Your machine learning model is only as good as the data you feed into it. That makes data preparation (or cleaning, wrangling, cleansing, pre-processing, or any other term you use for this stage) incredibly important to get right. It will likely take up a considerable chunk of your time and energy.
Data preparation for analytics or, more likely, machine learning involves converting data into a form that’s ready for fast, accurate, efficient modeling and analysis. It involves stripping out inaccuracies and other problems that cropped up during data gathering, improving the quality, and reducing the risk of data bias.
If you use Python for data science, you’ll be working with the Pandas library. In this article, we’ll look at some of the key steps you should go through before you start modeling data.
Before you dive in, it’s crucial that you have a clear understanding of why this particular dataset has been selected, as well as precisely what it means. Why is this dataset so significant? What do you want to learn from it and exactly how will you use what it contains? (These decisions are rooted in domain knowledge and careful collaboration with your business colleagues – you can learn more about this here)
Once you’ve loaded your data into Pandas, there are a few simple things you can do immediately to clean it up. For example, you could:
On closer inspection of what each column means and whether it’s relevant to your purposes, you could then eliminate any that:
If you are dealing with a very large dataset, removing columns with a high proportion of missing values will speed things up without damaging or altering the overall meaning. This is as easy as using Pandas’ .dropna() function on your data frame. For instance, the following script could do the trick:
df[‘column_1’] = df[‘column_1’].dropna(axis=0)
However, it’s also worth noting the issue so that you can identify potential external data sources to combine with this dataset, in order to fill any gaps and enrich your model later on.
If you are using a smaller dataset, or are otherwise worried that dropping the instance/attribute with the missing values could weaken or distort your model, there are several other strategies you can use. These include:
Think carefully about which of these approaches will work best with the machine learning model you are preparing the data for. Decision trees don’t take too kindly to missing values, for example.
Note that, when using Python, Pandas marks missing numerical data with the floating value point NaN (not a number). You can find this unique value defined under the NumPy library, which you will also need to import. The fact that you have this default marker makes it a lot easier to quickly spot missing values and do an initial visual assessment of how extensive the problem is.
Before you can make this decision, you need to have a fairly clear idea of why you have outliers. Is this the product of mistakes made during data collection? Or is it a real anomaly, a useful piece of data that can add something to your understanding?
One quick way to check is splitting your dataset into quantiles with a simple script that will return Boolean values of True for outliers and False for normal values:
import pandas as pd df = pd.read_csv("dataset.csv") Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 print(IQR) print(df < (Q1 - 1.5*IQR))| (df > (Q3 + 1.5*IQR))
You can also put your data into a box plot to more easily visualize outlier values:
df = pd.read_csv(‘dataset.csv') plt.boxplot(df["column"]) plt.show()
You may also find it useful to apply a log or square root transformation. This will minimize the impact on the model if the outlier is an independent variable while helping your assumptions to work better if it’s a dependent variable.
That said, the most important thing is to consider carefully your reasoning for including or removing the outlier (and for how you handle it if you leave it in). Instead of trying a one-size-fits-all approach and then forgetting about it, this will help you to remain cognizant of potential challenges and issues in the model to discuss with your colleagues and refine your approach.
Having fixed the issues above, you can begin to split your dataset into input and output variables for machine learning and to apply a preprocessing transform to your input variables.
Precisely what kind of transformations you make will, of course, depend on what you plan to with the data in your machine learning model. A few options are:
Best for: logistic regression, linear regression, linear discriminate analysis
If any attributes in your input variables have a Gaussian distribution in which the standard deviation or mean varies, you can use these techniques to standardize the mean to 0 and the standard deviation to 1. You can import the sklearn.preprocessing library to use its StandardScaler standardization tool:
from sklearn import preprocessing names = df.columns scaler = preprocessing.StandardScaler() scaled_df = scaler.fit_transform(df) scaled_df = pd.DataFrame(scaled_df, columns = names)
Best for gradient descent (and other optimization algorithms), regression, neural networks, algorithms that use distance measures, e.g. K-Nearest Neighbors
This also involves normalizing data attributes with different scales so that they’re all on the same scale, typically ranging from 0-1. (You can see how the scaling function works in the example below.)
Best for: algorithms that weight input values, e.g. neural networks, algorithms that use distance measures, e.g. K-Nearest Neighbors
If your dataset is very sparse and contains a lot of 0s, but the attributes you do have use varying scales, you may need to rescale each row/observation so that it has a unit norm/length of 1. It’s worth noting, however, that to run normalization scripts, you’ll also need the scikit-learn library (sklearn):
from sklearn import preprocessing df = pd.read_csv('dataset.csv') min_max_scaler = preprocessing.MinMaxScaler() df_scaled = min_max_scaler.fit_transform(df) df = pd.DataFrame(df_scaled)
The result is a table that has values normalized so that you can run them without getting extreme results.
Best for: feature engineering, transforming probabilities into clear values
This means applying a binary threshold to data so that all values below the threshold become 0 and all those above it become 1. Once again, we can use a scikit-learn tool (Binarizer) to help us quickly solve the problem (here we’ll be using a sample table of potential recruits’ ages and GPAs to exemplify):
from sklearn.preprocessing import Binarizer df = pd.read_csv('testset.csv') #we’re selecting the colums to binarize age = df.iloc[:, 1].values gpa = df.iloc[: ,4].values #now we turn them into values we can work with x = age x = x.reshape (1, -1) y = gpa y =y.reshape (1, -1) #we need to set a threshold to define as 1 or 0 binarizer_1 = Binarizer(35) binarizer_2 = Binarizer(3) #finally we run the Binarizer function binarizer_1.fit_transform(x) binarizer_2.fit_transform(y)
Your output will go from something like this:
Original age data values : [25 21 45 ... 29 30 57] Original gpa data values : [1.9 2.68 3.49 ... 2.91 3.01 2.15]
Binarized age : [[0 0 1 ... 0 0 1]] Binarized gpa : [[0 0 1 ... 0 1 0]]
… Don’t forget to summarize your data to highlight the changes before you move on.
As we’ve seen, data preparation is vital, but can be a fiddly task. The more types of datasets you use, the more you might be worried about how long it will take to harmonize this data, applying different cleaning, pre-processing, and transformation tasks so that it all works together seamlessly.
If you plan to go down the (advisable) route of incorporating external data to enrich your machine learning models, bear in mind that you will save a lot of time by going through a platform that automates much of this data cleaning for you. At the end of the day, data preparation is important enough to take time and care getting right, but that doesn’t mean you should misdirect your energies into easily automated tasks.