Table of Contents

    We all want our models to be as accurate as possible. While you can’t control every factor that interferes with accuracy, there are two types of “reducible error” that you can address. These are machine learning bias and variance.

    The trouble is, each of these issues relates to opposite sides of a data problem. As we’ll see in a moment, a high degree of bias in a model is closely connected to underfitting data, while a large amount of variance is related to overfitting data. You can see how this can make it very tricky to fix either problem without creating a new one in the process. How do you address bias in your model without increasing variance too much? And as you tackle variance, how do you prevent bias from creeping in?

    How to Improve Training Data

    What is bias in machine learning?

    Machine learning bias can creep into a model in a variety of ways and for different reasons Fundamentally, though, bias stems from your model lacking the complexity and nuance it needs to interpret the data as it should. Or it can be simply a product of using far too narrow a selection of data, which consequently fails to reflect the full picture you need. In other words, you have underfitted the data to your model.

    As a result, when the model is training using the data inputted into it, it starts to make certain assumptions based on patterns that don’t match reality. When you feed it testing and validation data, it continues to make predictions based on these assumptions, which may not be correct.

    In simple terms, bias is the gap between the values predicted by your model and the values borne out in the real world. The larger the gap, the more biased your model and the weaker its predictive power.

    What is variance in machine learning?

    Variance in a model describes how much any random variable differs from its expected value. Like bias, you will see variance when your model performs well on the training dataset but less impressively when given test or validation data. Unlike bias, variance also takes into account noise (fluctuations in the data). It’s important to note that variance is based on a single training set but measures the inconsistency of different predictions using different training sets. It’s not a measure of overall accuracy.

    High amounts of variance demonstrate that the predicted values are scattered far from the actual values. This causes overfitting — the model is predicting excessively complex relationships between input features and the outcome. In other words, the algorithm is also using random noise in the training data to model its predictions. It is taking far too much into account and, as a result, isn’t focusing enough on the genuinely important predictive variables.

    The tradeoff between bias and variance

    You can see how tackling bias and tackling variance at the same time is such a fiddly business. On one hand, in order to reduce bias, you need to make sure that your model isn’t oversimplifying things, giving too much credence to a small selection of variables while ignoring other important factors that could modify its prediction. This leads to sweeping generalizations that ultimately distort reality, giving you inaccurate and unhelpful results.

    On the other hand, you don’t want your model to be so open-minded that it tries to learn from absolutely every bit of information it’s shown and incorporate this into its predictive modeling. A lot of the stuff it’s exposed to in the training dataset will inevitably be random noise or other types of information that are irrelevant to the question in hand. Seeking out connections where there are none simply confuses the issue. Your model does need to be decisive about which features actually have predictive power and which do not. Which, of course, is exactly what causes bias — at least, if you’re overzealous about it.

    Striking a balance between these two opposing drivers is known as the bias-variance tradeoff. There are no easy answers here, though. Getting it right is a delicate process of navigating between limiting your model just enough to be discerning about the relevance of certain data points without becoming blinkered by them.

    Final thoughts: tackling the terrible twins

    Models that have a high bias error underfit the data and make oversimplified assumptions that stray away from reality. A model with a high variance error overfits the data, learning too much from it until it fails to really say anything at all.

    The ideal model is one in which bias and variance are carefully balanced against each other. You’re never going to be able to fully eliminate either of them; the goal is to keep them in check.

    Firstly, you need to audit your training data carefully, ensuring that it is broad, comprehensive and nuanced enough to reflect all the important scenarios your model needs to understand. That means thinking carefully about any glaring holes in the data. For example, has the model been trained with data that reflects all the demographics of people it will come across in real datasets? Is this training data free from its own biases, too? Does it inadvertently suggest a pattern that doesn’t exist, because certain traits happen to be overrepresented among certain categories within the dataset when that isn’t necessarily the case in real life?

    Secondly, have you been careful and thoughtful in selecting the features you believe to be most relevant? Have you tidied up the data to make sure it’s as clean and useful as it can be, free from confusing and unnecessary noise?

    In short, the key is to design and build your ML project the right way, right from the start — including the ways you source and acquire data. Make use of tools and platforms that boost data governance and transparency. Rigorously test and validate the performance of your model to ensure it doesn’t veer too far in either direction. Keep evaluating and improving it over time. Correcting data bias and variance isn’t a one-time thing. It’s a process.

    How to Improve Training Data