Table of Contents

    To decision-tree or not to decision-tree, that is the question. Or to cluster, for that matter. Or to linear regress.

    Classification is a key part of machine learning (ML), helping you to define factors and variables and/or train your model to recognize items and patterns.

    Sometimes that might mean teaching the model to classify and categorize something in a binary way. For example, you might create a classification algorithm that determines whether an image does or does not contain nudity. At other times, it’s part of a longer, more complicated process, helping to predict trends and outcomes with a lot of moving parts. This means you need to choose the right kind of classification algorithm from the start.

    Creating a classification algorithm 

    It’s also important to bear in mind that creating a classification algorithm is typically something you do early on in the process, often as part of your data preparation with Python, to get a better picture of your data. This will lead you on to deeper analysis. That means you need to think carefully about how your chosen classification method will lay the groundwork for what you do with the data next. Or, looked at another way, what you plan to do with the data later should inform the classification algorithm you choose.

    To help you pick the right one, let’s take a closer look at three of the most common classification algorithms: clustering, decision trees, and linear regression.

    How to Improve Training Data


    What is clustering?

    Clustering (or cluster analysis) is an unsupervised ML method that is used to organize data points from within a larger dataset into groups, based on the traits they share. The model tries to work out which similarities are relevant and clusters the data points on that basis. This form of data classification helps to create structures that you can understand and manipulate more easily.

    Cluster analysis algorithms have been used to filter spam, flag up fraud attempts, categorize products, create move recommendation engines, and spot suspected fake news stories (by identifying tell-tale words and phrases). They’re also used extensively in marketing and advertising, giving companies more effective ways to segment customers by grouping together and targeting people with characteristics that make them likely to convert.

    When should you use clustering?

    Clustering is useful when you’re working with a very big, unstructured dataset, when you’re unsure how many classes the dataset is divided into, and/or when classifying, categorizing and annotating your dataset by hand is too resource-heavy. It’s also great for helping you seek out anomalies in the data.

    When shouldn’t you use clustering?

    Supervised ML algorithms are typically more accurate than unsupervised ones. If you already have class labels that work well for classification, you’ll probably get more accurate results by making the most of these rather than using clustering to generate new ones. Plus, if your data is categorical rather than continuous (and certainly if you use binary variables), most clustering algorithms will be a bad fit, since these assess similarity by calculating the distance between points in the cluster.

    Decision trees 

    What are decision trees?

    A decision tree is a type of predictive algorithm that works by asking a binary question of the inputted data. Then, based on the answer, it branches off either to a follow-up question or to a final classification. Here you can find out more about how decision trees work in practice. As you can see here, to help you create decision trees Python has a number of powerful, readymade libraries.

    Decision trees are great when you have a complex set of criteria that build on one another in order to reach a decision. Let’s say you are using the algorithm to decide whether to approve a credit card application. If someone has a sparkling credit rating, that might be an instant “yes”, but if they don’t have any credit history, you might go down a different branch of questions that would provide other opportunities to assess and document their credit worthiness. Instead of an in-person advisor sitting down with that person to talk through each of these alternative paths logically, the algorithm performs the same process on the data in seconds, potentially splitting off through hundreds of different branches to reach a decision in seconds.

    Other use cases for decision trees include mapping out customer willingness to buy in a variety of scenarios, making pricing predictions, forecasting future outcomes based on a number of variables.

    When should you use decision trees?

    Decision trees are a good choice when you want a relatively simple model that allows you to document a clear, transparent decision-making process. It’s also a good fit when you don’t have a whole lot of computational power and can use the whole dataset with all its features. Plus, decision trees are good at handling datasets that have a lot of missing values or errors in them.

    When shouldn’t you use decision trees?

    If perfect accuracy is more important than explainability, a decision tree may not be the best choice.

    A word of warning, too: the biggest problem with decision trees is that they tend to overfit data. This means they often do very well during training but come unstuck when you test them with data they haven’t seen before. It also means you need to be really careful about selecting the most important features and introducing limitations to prevent the tree from becoming over-complicated.

    Linear regression 

    What is linear regression?

    Linear regression is a technique used to analyze and then model the relationships between variables. The algorithm tries to establish how often variables relate to one another and how many times they combine to contribute to a specific outcome.

    Linear regression is typically used to make estimates, evaluate trends, assess financial risk, predict house prices, or figure out the effectiveness of a pricing or promotion strategy.

    When should you use linear regression?

    It often makes sense to use linear regression when you need to forecast an effect or trend, or figure out the strength of a given predictor. Often, a regression model is used to answer questions about how strong an effect an independent variable has on a dependent variable. For example, you might ask “What is the strength of the relationship between marketing spending and sales?” and from there, “What would be the impact on sales if we increase marketing spend by 25%”

    When shouldn’t you use linear regression?

    Apart from the examples above, linear regression is a poor choice for classification algorithms. That’s because every time you add a new data point you would effectively need to update the model and potentially the threshold so that the training points fit the line.

    How to pick the right classification algorithm 

    Choosing the right classification algorithm means asking the right questions from the outset:

    • What question are you asking of the data? What are your predictive goals?

    Are you trying to group data points into distinct categories and classes? If so, clustering is probably the best option. Are you looking to map out a clear decision-making process? If so, consider a decision tree. Or are you looking to elucidate and predict the relationship between variables? In which case, you will likely need a linear regression model.

    • How much data do you have and what state is it in? 

    If your dataset is small enough to be manageable but has a lot of errors, you should still be able to derive value from it with a decision tree. A larger dataset that has some missing values will support clustering. However, linear regression gets weaker and weaker with each missing value.

    • Is your core priority for the model to be as transparent as possible or as accurate as possible?

    Cluster analysis is focused on accuracy. Decision trees, and to a lesser extent linear regression models, are primarily concerned with transparency.

    • How much computational power do you have?

    Clustering is resource-heavy, so you’ll need plenty of computational power at your disposal. The least demanding of the three types is the decision tree.

    Final thoughts: getting a helping hand with your classification algorithm

    Figuring out which classification algorithm is best can be tricky business, especially when you’re trying to work out how it fits with other considerations like improving your training data or laying the foundations for future stages in your model development.

    It’s well worth looking at how a powerful data science platform can ease some of these challenges for you. For example, by facilitating and automating connections to external data sources, so that you aren’t restricted by the quality and scope of the data you have in-house. Or by suggesting the most relevant algorithms to suit your needs and purpose. This will take out some of the hassle and heavy lifting for you, leaving you with the bandwidth to focus on making your ML project as effective as it can be.

    How to Improve Training Data