No matter what type of research you’re doing, or what your machine learning (ML) algorithms are tasked with, somewhere along the line, you’ll be using clustering techniques quite liberally. Clustering and data preparation go hand in hand, as many times you’ll be working, at least initially, with datasets that are largely unstructured and unclassified.
More importantly, clustering is an easy way to perform many surface-level analyses that can give you quick wins in a variety of fields. Marketers can perform a cluster analysis to quickly segment customer demographics, for instance. Insurers can quickly drill down on risk factors and locations and generate an initial risk profile for applicants.
Even so, it would be a shame to leave your analysis at clustering, since it’s not meant to be a single answer to your questions. Indeed, while clustering is incredibly useful in a variety of settings, it isn’t without some fairly important limitations.
Clustering is an unsupervised machine learning method of identifying and grouping similar data points in larger datasets without concern for the specific outcome. Clustering (sometimes called cluster analysis) is usually used to classify data into structures that are more easily understood and manipulated. It’s worth keeping in mind that while it’s a popular strategy, clustering isn’t a monolithic term, as there are multiple algorithms that use cluster analysis with different mechanisms.
For all the great things cluster analysis can do for your organization, there are just as many things that make it suboptimal when you’re looking for deep insights. Clustering by itself poses some important challenges that are inherent in the way you perform the analysis, and which makes it less than ideal for more complex ML and analytics-related tasks.
The biggest issue that comes up with most clustering methods is that while they’re great at initially separating your data into subsets, the strategies used are sometimes not necessarily related to the data itself, but to its positioning in relation to other points.
K-means clustering (where datasets are separated into K groups based on randomly placed centroids), for instance, can have significantly different results depending on the number of groups you set and is generally not great when used with non-spherical clusters. Moreover, the fact that centroids are set at random also impacts the results and can lead to issues down the line.
Other algorithms can solve this problem, but not without a cost. Hierarchical clustering tends to produce more accurate results, but it requires significant computational power and is not ideal when you’re working with larger datasets. This method is also sensitive to outlier values and can produce inaccurate clusters as a result.
Perhaps most importantly, clustering isn’t a final step in your data discovery. Indeed, because it’s unsupervised and is more concerned with classification than deep insights, it is a great tool when you’re preparing your data for more intensive analysis.
All this isn’t to say you should never use clustering, but rather that you should deploy it where and when it’ll give you the greatest impact and insights. Also, there are many situations in which clustering can not only give you a great starting point but shed light on important features of your data that can be enhanced with deeper analytics. These are just some of the times when you should use clustering:
This is perhaps the most valuable use of cluster analysis thanks to the amount of work it takes off your hands. As with other unsupervised learning tools, clustering can take large datasets and, without instruction, quickly organize them into something more usable. The best part is that if you’re not looking to perform a massive analysis, clustering can give you fast answers about your data.
Even if you’re starting with a more structured and well-labeled dataset, it may still not have the depth and stratification you’re looking for. Clustering is a great first step in your data prep because it starts to answer key questions about your dataset. For instance, you may discover that what you thought were two main subsets are actually four, or what categories you weren’t aware of were their own classes.
For smaller datasets, manual annotation and organization is feasible, if not ideal. However, as your data begins to scale, annotation, classification, and categorization become exponentially harder. Clustering — depending on the algorithm you’re using — can cut down your annotation and classification time because it’s less interested in specific outcomes and more concerned with the categorization itself. For instance, speech recognition algorithms produce millions of data points which would take hundreds of hours to fully annotate. Clustering algorithms can reduce the total work time and give you answers faster.
Curiously, one of the more valuable uses of clustering is that due to many algorithms’ sensitivity to outlier data points, they can serve as identifiers for data anomalies. Indeed, algorithms such as density-based spatial clustering of applications with noise (DBSCAN) are designed to find clusters that are closely positioned and mark outliers in datasets. Understanding your anomalous data can help you optimize your existing data collection tools, and lead to more accurate results in the long term.
Much like with other useful algorithms and data science models, you’ll get the most out of clustering when you deploy it not as a standalone, but as part of a broader data discovery strategy. Cluster analysis can help you segment your customers, classify your data better, and generally structure your datasets, but it won’t do much more if you don’t give your data a broader context.