feature selection in machine learning

Clustering — When You Should Use it and Avoid It

February 3, 2020 Explorium Data Science Team Data Science

No matter what type of research you’re doing, or what your machine learning (ML) algorithms are tasked with, somewhere along the line, you’ll be using clustering techniques quite liberally. Clustering and data preparation go hand in hand, as many times you’ll be working, at least initially, with datasets that are largely unstructured and unclassified. 

More importantly, clustering is an easy way to perform many surface-level analyses that can give you quick wins in a variety of fields. Marketers can perform a cluster analysis to quickly segment customer demographics, for instance. Insurers can quickly drill down on risk factors and locations and generate an initial risk profile for applicants.

Even so, it would be a shame to leave your analysis at clustering, since it’s not meant to be a single answer to your questions. Indeed, while clustering is incredibly useful in a variety of settings, it isn’t without some fairly important limitations. 

Guide to Data Acquisition

What is clustering?

Clustering is an unsupervised machine learning method of identifying and grouping similar data points in larger datasets without concern for the specific outcome. Clustering (sometimes called cluster analysis) is usually used to classify data into structures that are more easily understood and manipulated. It’s worth keeping in mind that while it’s a popular strategy, clustering isn’t a monolithic term, as there are multiple algorithms that use cluster analysis with different mechanisms. 

Why clustering isn’t (always) the answer 

For all the great things cluster analysis can do for your organization, there are just as many things that make it suboptimal when you’re looking for deep insights. Clustering by itself poses some important challenges that are inherent in the way you perform the analysis, and which makes it less than ideal for more complex ML and analytics-related tasks. 

The biggest issue that comes up with most clustering methods is that while they’re great at initially separating your data into subsets, the strategies used are sometimes not necessarily related to the data itself, but to its positioning in relation to other points. 

K-means clustering (where datasets are separated into K groups based on randomly placed centroids), for instance, can have significantly different results depending on the number of groups you set and is generally not great when used with non-spherical clusters. Moreover, the fact that centroids are set at random also impacts the results and can lead to issues down the line. 

Other algorithms can solve this problem, but not without a cost. Hierarchical clustering tends to produce more accurate results, but it requires significant computational power and is not ideal when you’re working with larger datasets. This method is also sensitive to outlier values and can produce inaccurate clusters as a result. 

Perhaps most importantly, clustering isn’t a final step in your data discovery. Indeed, because it’s unsupervised and is more concerned with classification than deep insights, it is a great tool when you’re preparing your data for more intensive analysis. 

When you should use clustering

All this isn’t to say you should never use clustering, but rather that you should deploy it where and when it’ll give you the greatest impact and insights. Also, there are many situations in which clustering can not only give you a great starting point but shed light on important features of your data that can be enhanced with deeper analytics. These are just some of the times when you should use clustering: 

When you’re starting from a large, unstructured dataset

This is perhaps the most valuable use of cluster analysis thanks to the amount of work it takes off your hands. As with other unsupervised learning tools, clustering can take large datasets and, without instruction, quickly organize them into something more usable. The best part is that if you’re not looking to perform a massive analysis, clustering can give you fast answers about your data. 

When you don’t know how many or which classes your data is divided into

Even if you’re starting with a more structured and well-labeled dataset, it may still not have the depth and stratification you’re looking for. Clustering is a great first step in your data prep because it starts to answer key questions about your dataset. For instance, you may discover that what you thought were two main subsets are actually four, or what categories you weren’t aware of were their own classes. 

When manually dividing and annotating your data is too resource-intensive

For smaller datasets, manual annotation and organization is feasible, if not ideal. However, as your data begins to scale, annotation, classification, and categorization become exponentially harder. Clustering depending on the algorithm you’re using can cut down your annotation and classification time because it’s less interested in specific outcomes and more concerned with the categorization itself. For instance, speech recognition algorithms produce millions of data points which would take hundreds of hours to fully annotate. Clustering algorithms can reduce the total work time and give you answers faster. 

When you’re looking for anomalies in your data 

Curiously, one of the more valuable uses of clustering is that due to many algorithms’ sensitivity to outlier data points, they can serve as identifiers for data anomalies. Indeed, algorithms such as density-based spatial clustering of applications with noise (DBSCAN) are designed to find clusters that are closely positioned and mark outliers in datasets. Understanding your anomalous data can help you optimize your existing data collection tools, and lead to more accurate results in the long term. 

To Cluster or Not to Cluster?

Much like with other useful algorithms and data science models, you’ll get the most out of clustering when you deploy it not as a standalone, but as part of a broader data discovery strategy. Cluster analysis can help you segment your customers, classify your data better, and generally structure your datasets, but it won’t do much more if you don’t give your data a broader context.

If you find yourself spending too much time on data wrangling, data discovery, data preparation, or data matching, try the Explorium External Data Platform for free now!

Learn more about the thousands of external data signals that Explorium offers:

Access External Data wit Explorium


About Explorium

Explorium provides the first External Data Platform to improve data analytics and machine learning. Explorium enables the automation of data discovery to improve predictive ML model performance. Explorium External Data Platform empowers data scientists and analysts to acquire and integrate relevant external data signals efficiently, cost-effectively, and in compliance with regulations. With faster, better insights from their models, organizations across fintech, insurance, consumer goods, retail, and e-commerce can increase revenue, streamline operations and reduce risks. Learn more at

Subscribe Today! Get the latest updates with our newsletter.
We promise you'll love it.

Follow us


We're Hiring! Join our global family of passionate and talented professionals as we define the future of data science. Learn More