As they say, the proof is in the pudding, and data preparation is where the pudding is put together. Any mistakes you make here will be baked into your dataset and later your model deployment, becoming harder to fix as you approach deployment. Fortunately, using machine learning (ML) tools like Python can help you avoid falling in a technical hole early on. Here’s how to make sure you do data preparation with Python the right way, right from the start.
Data preparation using Python: performing ETL
A key part of data preparation is extract-transform-load (ETL). This involves moving, drawing out, and copying data from one location to another.
There are several reasons you might need to do this. Perhaps you have high volumes of data from a number of different data sources, both internal and external, that need to be brought together into a single, consolidated resource. Or maybe you need to change the data into another format as you move it into the target database. Whatever your reason, this part of your data preparation is crucial.
First, you will need to collect/read the data from the original source or database. It is, of course, vital that the tool you use for this can support any kind of data source you throw at it, including SQL and NoSQL data, and file-formats including XlS, XML, CSV, and JSON.
It’s also important that you take the time to understand your data sources. You need a clear idea of how data is stored in any external datasets, how this will interact with your internal data infrastructure, and how this may affect speed and latency within the data transfer. Any bottlenecks that start forming now will only get worse later, when you start feeding extracted and transferred data into your models.
It makes sense to extract this data into a staging area rather than moving it directly into your destination dataset. This means you can apply transformations to it without risking making changes to the full target dataset at the same time.
Data preparation using Python: cleaning
To keep the extraction process going smoothly, you need to clean up your data. There are a ton of potential problems lurking in datasets that haven’t yet been cleaned up and made consistent — many of them down to really simple formatting errors. For example, some entries may be capitalized or in ALL CAPS, causing them to be read as different terms. Or, the same type of data may be formatted differently (such as having first name-last name in one set and initial-last name in another).
Fortunately, the Python Data Analysis Library (pandas) has plenty of easy-to-apply normalization tools and techniques that you can use in this preprocessing stage. These include techniques to change all words into lowercase letters, remove unnecessary punctuation, remove null values, make the formatting of values like currencies consistent, and even strip out “stop” words to tidy up text sources for natural language processing.
Note that it isn’t just internal errors and inconsistencies you need to worry about; you also need to make sure that data entries and columns are organized in the same way in the source data as in the destination datasets.
While your data should now be cleaned up, it may still need to be transformed into something usable for machine learning models.
Basic transformations are essentially a more advanced version of data cleaning, including features like row operations, joins, sorting, and aggregations. More advanced transformations include techniques like field decoding for data drawn from multiple sources, merging datasets for clarity and ease of use, and separating out single fields into multiple ones, such as turning addresses into different columns for street, town, zip code, state, and country.
Data preparation using Python: basic and advanced transformations
When it comes to basic transformations, you should be able to pick out ready-to-use pandas techniques for these. For example, if you have missing data to address, you could also use pandas to:
- Remove columns that have a lot of missing values, by applying the dropna() function
- Replace all your missing (NaN) values with 0 using the df.fillna(0) function. (Note that this is only appropriate if you have a small number of missing values. If you have a lot, changing the values to 0 could skew your models.)
- Creating a mean, median, or mode attribute for the missing columns
- Imputing the missing values of the attribute using linear regression
- Clustering the dataset and then using inter-cluster regression to fill in the missing values.
Other types of basic transformation include removing duplicate values, and mapping values to standardized formats, e.g., changing “male” and “female” to “M” and “F”.
When choosing more advanced transformations with Python, it’s important to look ahead to what you’re preparing this data to do in the resulting machine learning models.
Some examples include:
- If you expect to perform logistic regression, linear regression, or linear discriminate analysis, you may want to consider standardizing the data. A useful tool for this is the StandardScalertool from the scikit-learn.preprocessing library.
- If you intend to use the data for regression, neural networks, algorithms that use distance measures, e.g. K-Nearest Neighbors, or optimization algorithms like gradient descent, you might rescale the data.
We explain how to apply these (and other) transformations using Python in more depth in this blog post.
Now it’s time to load your transformed data into your target database, ready to be used for ML (or other) purposes. This makes it imperative that your chosen ETL tool supports connectors for whatever database you are using, be that MS SQL, Oracle, Snowflake, or so on.
If you’ve tackled the extraction and transformation steps correctly, this should go relatively smoothly. That said, you still need to choose whether to load the data all at once (which takes longer but means you’ll have all the data you need to work with at the same time) or to do it in stages (handy if you’re dealing with real-time data or constantly updated datasets).
Make sure, too, that you set aside some time for load verification. That means quality checking your datasets to ensure it’s all labeled right and no pesky errors or inconsistencies have snuck through.
Getting the right tools for data preparation using Python
There are a whole bunch of Python-specific libraries and tools out there that can make this easier. We’ve mentioned pandas and the machine-learning-focused SKLearn, but there are also purpose-built ETL tools like PETL, Bonobo, Luigi, Odo, and Mara.
However, it’s also important to look at the bigger picture. How will these tools, or the databases they feed into, connect with your machine learning platform? Can you cut out the need for a separate tool by choosing a data science platform that automates connections to external data sets, right from the start? One that, perhaps, is designed to interface with Python SDK?
ETL and data preparation comprise just one stage of the complete data pipeline. For best results, you need to consider how the whole process will fit together, right from preparation to deployment.