Like Italian cooking, data science is all about quality ingredients. It’s not enough to simply have a lot of data; you need to make sure the data you have is good. That it’s relevant for your purposes. That it’s cleaned and prepared right. That you have enough of the data you need and have removed the parts you don’t.
If you skip over these early steps, your machine learning (ML) models won’t come out the way you hoped — and once you’ve already started throwing all your data into the mix, it’s extremely tricky and time-consuming to rectify those underlying problems with the raw ingredients.
That means you need to take a good look at what works and what doesn’t right from the start and pay close attention to how the structure, features, and general shape of your data impacts your ML models at every stage of the process. It also means you need to think carefully about what you can learn (and what you can’t) from the data you already have stored internally, what other types of data you’ll need to source externally, and how your entire ML pipeline works together to help you get the insights you want. It means a focus on data.
While you’re training your models, they use data to learn how to detect the right patterns and make the kinds of predictions you want. In the testing stage, they use data to make sure the predictions they’re making are the right ones. In production, they use data to find new insights and make business-critical predictions. If the data you’ve used throughout the ML lifecycle is accurate, you’ll get accurate predictions. If you’ve used the wrong data, you’ll get the wrong predictions. That’s the bottom line.
When you start building ML models using the wrong data, or data that is too low quality, you create problems that you’ll need to fix later for your models to work. Not only that, but these problems will also be far harder to solve the further you push them down the line.
This means you need to focus on data carefully from the outset. Not only what you want to build from your datasets but also what you need to get out of the data itself.
Ignoring these issues with your data means your models simply won’t work. Just dumping all your data in a sprawling database or data lake and hoping for the best means your models will always be that little bit weaker. You won’t use the data to its full potential to train and improve your models and generate meaningful insights. You’ll struggle to demonstrate real value, business impact, and ROI.
It’s crucial that you view your data as the focal point of the entire exercise. Acquiring the data is only the first part of the challenge, and “dealing with your data” isn’t a single step in the workflow, one that you check off the list and move on from.
Rather, you will need to come back to your data constantly, auditing it, filling in the gaps, cleaning and harmonizing it, getting it ready to train and test your models, monitoring how well it’s doing. Asking yourself: Why is this working? Why isn’t this working? What else do we need? How can we make it better? How can we split the data in new ways? How can we use this data in different ways for training? Are we getting the right features out of it? The right insights?
It's about having the processes in place to ensure you’re consistently, continuously looking at your data (that you have that focus on data) to figure out the best way to position, manage, and use it. That you’re focusing on the right segments of your data - and the most powerful connections between datasets and data points. That’s why we talk about the machine learning lifecycle, rather than a sequence from start to finish.
It helps if you think of data as a process more than a step. Imagine the data workflow like this:
1. Data auditing
This is where you make sense of the data you already have, getting a clear view of how you collect it, and from where. You’ll discover where your blind spots are and what data you still need to source from outside to build the models you want and realize your data science goals.
2. Filling in the gaps
This is when you go out and find the data that you now know you’re missing. You look for the right sources, make sure the data you’ve located is relevant, and figure out how to integrate it into your data pipeline effectively.
3. Preparing the data
Now you’re ready to take all your new and old data, merge it, clean it, check it’s properly labeled without gaps or conflicts, and harmonize it so that your models will be able to treat it as a single, unified resource.
4. Feature engineering
Next, you’ll go through your enriched datasets looking for new features that will generate accurate, useful insights, helping you reach your goal and make critical predictions for your organization.
5. Training and testing
You’re now ready to let your spruced-up datasets loose on your models for the first time. At this stage, you carefully split your data and feed part of it to your model to train it and another part to test it, making sure the model works properly before deploying it.
6. Monitoring and refreshing
Even once your machine learning model is in production, you need to keep thinking about the data you’re feeding into it. Using data from years or even months earlier means that, over time, your predictions will become less accurate and actionable. This is why you need to keep tracking your model, making sure the insights it produces are relevant, avoiding data and model drift, and retraining the model with new, relevant data periodically.
Good machine learning requires not on a focus on data but on good data — that much we know. However, it’s not just about finding the right data. If you want to build great machine learning models, you need more than just data, and you need it fast. You need to know how to deal with data at each step, and why. More importantly, you need to know it before you start planning your models.
That’s why we put together a complete series that dives deep into each part of the data workflow we outlined above. First, we’ll explore the data auditing and discovery process, where you need to examine your data and fill in the gaps. Next, we’ll explore the ways you can prepare your data for maximum impact. Finally, we’ll look at feature engineering, training, and how you can monitor your data once your models are in production to ensure they remain relevant.
Don’t waste time building models that won’t give you results because you neglected your data. Read our Explorium Explains: Data for Machine Learning series to learn about the best ways to focus on data throughout the ML lifecycle.