Data Bias and What it Means for Your Machine Learning Models
We’d all like to imagine that the machines, systems, and algorithms we create are objective and neutral, devoid of prejudice, free from pesky human weaknesses like bias, and the tendency to misinterpret a situation. However, this simply isn’t the case.
Sure, an automated machine learning tool or data science platform won’t privilege certain types of data or results out of personal loyalty. It won’t refuse to see a pattern because it’s emotionally attached to a different point of view. It will, however, create rules based on whatever data you feed into it. If that data tells a skewed or incomplete story, the rules it creates will be based on these foundational errors.
This is very bad news for businesses looking to mine useful insights from their datasets. In the long run, it can also cause huge problems for people already fighting against biases that negatively affect their lives – only now, with a supposedly dispassionate computer to justify it all.
Let’s take a look at some of the most prevalent types of bias, the data mistakes that cause them – and how to prevent this from happening in your own models.
An incredibly pervasive problem is people’s reluctance to question findings that support what they already believe. In other words, patterns and outcomes that confirm your assumptions or prejudices.
This is especially problematic with machine learning because the model is continually trying to refine itself based on your reactions to its results. If you ignore certain outcomes and privilege others – the ones that confirm what you already believe – the system takes this as a sign that it got this one right. It then feeds that information back into its workings, making similar interpretations of the data more likely in the future.
To avoid falling into this trap, make sure to look closely at any results that contradict your expectations to figure out how the algorithm got there. It may be your views, not the automated machine learning model, that needs refining. Examine the results that do appear to back you up, too, to figure out whether they really do mean what you want them to mean – don’t just accept this blindly because you want it to be right. Test the model with other larger datasets to ensure you still get the same types of results.
This is related to confirmation bias in that it leads you to infer connections where there are none by conflating or confounding variables. You’ve probably seen this tongue-in-cheek graph implying that global warming is “caused” by declining numbers of pirates. Or that the divorce rate in Maine is somehow linked to per capita consumption of margarine.
While these parallels are clearly absurd to a rational person, they aren’t to a machine learning program. As far as the algorithm is concerned, all correlations are equally valid, equally likely to imply causation. If that’s what you’re looking for, that’s what you’ll see – so you need to be on your guard to interrogate all instances of correlating results, even if they fit your hypothesis.
This is when the underlying dataset itself is the problem. It’s crucial that the training set you use is compatible with the automated machine learning model you will run later on. Otherwise, the algorithm may behave strangely when it tries to apply rules to this data.
For example, if you train your health screening algorithm to spot patterns using only data on male patients, it may not know how to interpret data on female patients. If you’re creating voice-activated, NLP-based tool but train it using only North American voices, it will struggle to process English spoken in other accents.
To get around this, interrogate your training datasets carefully. Are you missing out on certain key categories? Are you giving your machine learning model a broad enough spread of data that it knows how to handle the data it’s fed later? If you can’t source a broad enough spread of data internally, what external sources could you incorporate that will ensure the rigor of your model?
A more subtle variation of sample bias is when the data you provide contains secondary information that the algorithm starts to see as a key feature of the category it’s looking at.
For example, let’s say that the task is facial recognition. To avoid sample bias in your training data, you actively include images of people from as many different ethnicities and genders as possible. By sheer coincidence, in all the images of Asian people that you provide, the person is smiling, while none of the black people in your images are smiling. The algorithm interprets this as: smiling faces indicate Asian ethnicity, while no smiling indicates black ethnicity.
In this example, the bias is created purely by chance. At other times, it can be the result of using datasets shaped by pre-existing, real-world biases which the model has no way of recognizing or counteracting. For example, the fact that there has never been a female US president could be interpreted by the machine learning algorithm reading this data as an indicator that maleness is a necessary precondition of the presidency. As such, a model designed to predict the chances of political hopefuls might automatically discount female leadership candidates.
Another common dataset problem is when you’re simply using too small a sample to get an accurate picture. This typically stems from not having collected large enough pools of data in-house to get the kind of confidence you need in the outcome of your models.
Again, there’s a relatively simple solution: look beyond the limits of your own organization for valuable data. Who else in the data marketplace might have the raw information you need to enrich this model?
Systemic value distortion (measurement bias)
Bad data will produce bad results. That’s a simple fact. If whoever collected and organized the data in the first place was using poor tools and strategies, leading to inaccurate, incomplete or misleading datasets, there’s not a lot you can do about it.
These issues are sometimes the result of badly thought-out data capture systems, but they may also be the result of unconscious biases on the part of the people tasked with collecting it – whether that’s from designating column categories and descriptions, deciding which details are worth recording or figuring out who to collect data on in the first place.
Before you begin creating models, it’s well worth discussing with business teams exactly what they’re looking for and assessing whether the data you have is capable of answering those questions in its current form. It’s also a good idea to compare results generated using your own datasets with datasets collected by other people, using different approaches, to ensure you get the same kinds of patterns and predictions.
As we’ve seen, often the most effective way to challenge and overcome biases in your machine learning models is to diversify your data. If you stick entirely to internal data when training your company’s machine learning models, these will inherit any biases that guided the human decision-makers when they collected and supplied the data. It’s why you need more, better data – preferably, sourced from outside.
It also means you need a powerful data science platform in place that will allow you to bring in these different streams of data. One that will automate fiddly tasks like cleaning and harmonizing the data, allowing you to treat this as one single, coherent source.
The more sources of data you incorporate and the larger the pools of data you use, the less risk you run of being blindsided by inherent bias in the data – whether that’s created by human intervention, by chance, or by too small a sample size.