When it comes to data science, machine learning, and artificial intelligence, the consensus is that good data is essential. When it comes to training models, you commonly hear “garbage in, garbage out”. This means that a model will only be as good as the data you feed it with, therefore, data quality is important. Read on to learn more about data quality from a data science expert – Noam Cohen, Data Products Team Leader at Explorium.
What do we mean when we talk about “Data Quality”?
Data quality is a trendy term in the data landscape. Today, it is broadly agreed that there are three main dimensions of data quality – the correctness, the completeness, and the freshness of the data. In my view, a data quality assessment should be defined by the use case that the data serves. It also depends on the downstream task that the data is used for – in some cases good quality would mean having complete and comprehensive data to perform aggregative analysis, where outliers don’t play a significant role. In other cases, single row precision is critical if the user is making decisions by this level of granularity.
Poor quality data is not only hard to detect, but can also easily lead to bad “data-driven” conclusions, because data is often referred to as the main source of truth in decision making. When stakeholders lose faith in data, because of quality issues, they make suboptimal intuition-based decisions. This is why data quality is such an important concept, it directly influences business metrics and ROI. Ultimately, bad data leads to bad business decisions.
What are the key attributes of high quality data?
There are three main data quality measures:
- Data Correctness – How accurately the data value describes real-world facts. For example, a B2B sales rep wishes to look at a prospect company’s number of employees. If they accidentally grab the wrong company from their database, because its name and location are similar to another organization, they will report a wrong number, be misinformed, and potentially lose an opportunity to sell to a qualified prospect. In this case, the rep used incorrect data. The data correctness metric is usually measured with classification metrics such as precision – a compound of correct data points compared to incorrect data points. There are many potential root causes of collection issues such as collection noise, faulty data transformations, outdated data, or incorrect schema description.
- Data Freshness – This refers to how relevant the data is to describe the current state of an entity, and takes into consideration the timeliness of the data and how frequently it is updated. This is a tricky measurement as “freshness” ranges from data that is updated in real-time to data that is updated annually. Each business use case will differ in its data freshness thresholds and requirements. For example, data that doesn’t change frequently, like a person or institution name, would not require the same freshness as stock market or Twitter trends. In any case, data must be up-to-date, if it is not it could mislead a decision. This metric is typically measured with time.
- Data Completeness (aka Data Coverage) – A measure which describes how whole and complete a data asset is. Completeness is especially important when you want to attach new attributes to your existing data. In cases where you have low coverage, you would get limited support for the different attributes that you enrich, and the data becomes less useful. Coverage is also important if you want to extract insights from your dataset. A complete set of data typically approximates phenomenons in the real world so that aggregates and descriptive statistics are less biased and lead to valid conclusions. Having insufficient coverage would increase the risk for biases, such as unrepresented stratas in your conclusions. For instance, if your collection process demonstrates a gender bias toward males (compared to females), such as when data is collected in sports events, it could lead to incorrect audience segmentation and an ineffective marketing strategy. Traditionally, collection bias is not considered classical correctness, as the information might be correct and timely. However, having issues such as inadequate representation of different cohorts in the data is an issue which is difficult to detect, yet can easily direct the wrong data lead decisions.
Why is data quality so important?
Why is data important? Data drives decisions. It can drive day-to-day decisions around whether a certain business processes are working well, or need to be adjusted, and where to focus improvements. It can also drive strategic decisions, for example, around audience segmentation for targeting purposes. There are many business questions that can use data to find the answers and they all rely on the assumption that the data is actually representing reality (with some level of noise). Bad data increases the risk of making a wrong decision. This is true both when a human is involved in that decision, or when a computational model is making a prediction. For example, a company collects users’ birth dates as part of their sign-up funnel, and suggests the next page’s content according to the age, in order to maximize engagement. If there is a default date that many users simply approve, because they don’t want to denote personal info, this will lead to disproportional repeating records with the same value, and can highly bias the decision of an ML model that picks the best funnel. Obviously, this bias should be detected by looking at the data distribution and seeing that it doesn’t make sense, but this gives a great example of how inaccurate data leads to potentially poor decision making.
How is data quality determined? Does this process differ for internal data vs. external data?
The traditional data quality metrics are completeness, correctness, and freshness (as previously discussed). Freshness is easy to calculate, however completeness and correctness require special expertise. Calculating completeness requires a good reference dataset with the right proportions of the subgroups that are in the interest of the use-case that the data is serving. For instance, a software company calculating the completeness of their website traffic data referencing a brick and mortar reference business would be much less helpful than referencing the traffic data of similar software businesses’ websites. Correctness is difficult to calculate because it requires labeled data which needs to be validated to have correct values. Building these data samples is time consuming and hard, as it is difficult to decide whether the values are true or not.
The main difference in managing quality of internal vs. external data is the collection process. With internal data you can have full control over the way you collect and process the data. You have direct access to the raw form, and the different transformations and cleaning modules that process the data before it is served. When managing external data, you usually can’t know in advance that there are collection issues (such as source shut down) and you need to have processes to detect changes in the statistics over time, and draw conclusions from the data itself. This process usually requires some subject matter expertise and sound data analysis.
What are some emerging data quality challenges?
In my perspective, the biggest challenges to data quality standards are:
- Finding a way to continuously monitor data correctness in production and not over static benchmarks.
- Efficient monitoring of data deformations that are created in different processing steps, because of edge cases, heuristics, and assumptions.
- When a data issue is unveiled, it is always challenging to understand in which step of data lineage the issue was created.
- Extracting data issues from production machine learning model estimates is a huge challenge as we are looking at predictions that consider the statistical properties of the dataset.
- In many cases you need specific domain expertise to detect issues that non expert analysts won’t spot.
When purchasing external data from vendors, how can organizations ensure that the vendors are providing high quality data?
This is easy to do, nonetheless, one needs to establish a solid quality analysis process. What we do at Explorium is:
- Ask data providers for the specific metrics that we’ve discussed and about the way they are evaluated.
- Understand the providers’ collection processes and the potential flaws they have.
- Compare several vendors with a ground truth dataset that we validate and trust. Also, doing sanity checks on specific examples and known behaviors can be very quick and helpful.
- Check for different cohort proportions, splitting by age, type, location, and other demographics. Check if these proportions correlate with known distributions. It can be simple such as obtaining male/female proportion, or more advanced analysis such as validating Zipfs law for word counts in textual data or checking whether the number of connections in social networks follow a power law distribution.
- When working with time series data, validate behavior during weekends, holidays, and national events.
- Test for correlations between attributes that correlate in the real world (company size vs. revenue)
What are some best practices for data quality management and fixing data quality issues?
- Have a good and efficient ground-truth dataset creation process for building benchmarks to check correctness.
- Build a repository for assertions the data should follow. Transform domain knowledge into rules and attach to data types for an automatic validation pipeline. For example – Revenue can’t be negative, or business location (long, lat) can’t be in the ocean.
- Maintain models for identifying outliers and suspicious data for scaling your quality process.
- Keep a good habit of eyeballing your data and perform sanity checks.
- Track the health metrics of your most important assets in a daily fashion, and have an data migration CICD for cases where you update your ETLs
- Deeply understand how your customer is using the data and what is important for him to succeed.
These are some of the best practices that Explorium uses to ensure data consistency and that customers get the most reliable, high quality external data.
What are some technology solutions that can help with data quality management? Which problems, specifically, can they correct?
This is a hot vertical and there are many startups providing data quality tools and data quality solutions. Data integration and observability focus on enabling analysts and data scientists to be aware of issues with their ETLs before they reach their customers or data products. Many of these companies integrate with the oncall and incident management system, and add a layer of monitoring tools, customized for data issues. Their main value is to allow the organization to be more proactive about it’s data incidents and shorten time-to-detection. Another amazing value they highlight is localization of incidents within the data lineage.They do this by automatic data migration reports and ETL regression testing, column level lineage and anomaly thresholding and smart alerting.
Many organizations today are leveraging their internal data and enriching with external data to solve some of their most complex business problems. With the amount of big data that is available today, and the growing number of data sources and providers, the value lies more in quality over quantity.
Try the Explorium External Data Platform for free today and get access to high quality data, that meets regulatory compliance standards, to feed your analytics, business intelligence, and predictive machine learning models.
To learn more about data quality, check out these additional resources:
Ihab F. Ilias – Data Cleaning