When it comes to data science, machine learning, and artificial intelligence, the consensus is that good data is essential. When it comes to training models, you commonly hear “garbage in, garbage out”. This means that a model will only be as good as the data you feed it with, therefore, data quality is important. Read on to learn more about data quality from a data science expert - Noam Cohen, Data Products Team Leader at Explorium.
Data quality is a trendy term in the data landscape. Today, it is broadly agreed that there are three main dimensions of data quality - the correctness, the completeness, and the freshness of the data. In my view, a data quality assessment should be defined by the use case that the data serves. It also depends on the downstream task that the data is used for - in some cases good quality would mean having complete and comprehensive data to perform aggregative analysis, where outliers don’t play a significant role. In other cases, single row precision is critical if the user is making decisions by this level of granularity.
Poor quality data is not only hard to detect, but can also easily lead to bad "data-driven" conclusions, because data is often referred to as the main source of truth in decision making. When stakeholders lose faith in data, because of quality issues, they make suboptimal intuition-based decisions. This is why data quality is such an important concept, it directly influences business metrics and ROI. Ultimately, bad data leads to bad business decisions.
There are three main data quality measures:
Why is data important? Data drives decisions. It can drive day-to-day decisions around whether a certain business processes are working well, or need to be adjusted, and where to focus improvements. It can also drive strategic decisions, for example, around audience segmentation for targeting purposes. There are many business questions that can use data to find the answers and they all rely on the assumption that the data is actually representing reality (with some level of noise). Bad data increases the risk of making a wrong decision. This is true both when a human is involved in that decision, or when a computational model is making a prediction. For example, a company collects users’ birth dates as part of their sign-up funnel, and suggests the next page’s content according to the age, in order to maximize engagement. If there is a default date that many users simply approve, because they don’t want to denote personal info, this will lead to disproportional repeating records with the same value, and can highly bias the decision of an ML model that picks the best funnel. Obviously, this bias should be detected by looking at the data distribution and seeing that it doesn’t make sense, but this gives a great example of how inaccurate data leads to potentially poor decision making.
The traditional data quality metrics are completeness, correctness, and freshness (as previously discussed). Freshness is easy to calculate, however completeness and correctness require special expertise. Calculating completeness requires a good reference dataset with the right proportions of the subgroups that are in the interest of the use-case that the data is serving. For instance, a software company calculating the completeness of their website traffic data referencing a brick and mortar reference business would be much less helpful than referencing the traffic data of similar software businesses’ websites. Correctness is difficult to calculate because it requires labeled data which needs to be validated to have correct values. Building these data samples is time consuming and hard, as it is difficult to decide whether the values are true or not.
The main difference in managing quality of internal vs. external data is the collection process. With internal data you can have full control over the way you collect and process the data. You have direct access to the raw form, and the different transformations and cleaning modules that process the data before it is served. When managing external data, you usually can’t know in advance that there are collection issues (such as source shut down) and you need to have processes to detect changes in the statistics over time, and draw conclusions from the data itself. This process usually requires some subject matter expertise and sound data analysis.
In my perspective, the biggest challenges to data quality standards are:
This is easy to do, nonetheless, one needs to establish a solid quality analysis process. What we do at Explorium is:
These are some of the best practices that Explorium uses to ensure data consistency and that customers get the most reliable, high quality external data.
This is a hot vertical and there are many startups providing data quality tools and data quality solutions. Data integration and observability focus on enabling analysts and data scientists to be aware of issues with their ETLs before they reach their customers or data products. Many of these companies integrate with the oncall and incident management system, and add a layer of monitoring tools, customized for data issues. Their main value is to allow the organization to be more proactive about it’s data incidents and shorten time-to-detection. Another amazing value they highlight is localization of incidents within the data lineage.They do this by automatic data migration reports and ETL regression testing, column level lineage and anomaly thresholding and smart alerting.
Many organizations today are leveraging their internal data and enriching with external data to solve some of their most complex business problems. With the amount of big data that is available today, and the growing number of data sources and providers, the value lies more in quality over quantity.
Try the Explorium External Data Platform for free today and get access to high quality data, that meets regulatory compliance standards, to feed your analytics, business intelligence, and predictive machine learning models.
To learn more about data quality, check out these additional resources:
Ihab F. Ilias - Data Cleaning