Data & ML Tech Lead
Data & ML Tech Lead
Data & ML Tech Lead

Data standardization lets datasets and users speak the same language

“Data standardization” means different things in different branches of the machine learning and data engineering world. We define data standardization as the process of transforming different representations of the same data into a single representation. 

For instance, let’s imagine a customer’s dataset about various companies, and the dataset includes information about which country each company is located in. If the standardization practices were insufficient when the customer collected the information, the dataset can easily have a column containing many different values that all represent the same country. 

Let’s pick an example country. Any or all of the following options could represent the United States in a single customer’s dataset: 

  • the United States of America
  • America
  • U.S.A.
  • USA
  • U.S.
  • US
  • the US
  • the USA
  • the United States
  • the States
  • United States

And this list is just common abbreviations in English. If we also take into account different languages, there are many more ways of representing the same country, such as אמריקה (Hebrew) or Америка (Russian). 

Our goal is to convert all these representations into one single representation, optimally a well known standard like ISO-3166 two letter country codes.

Before diving into the details of how we build an infrastructure to do that, let’s understand why data standardization is so important.

Data standardization is essential for data enrichment

Even as the number of datasets grows exponentially across the world, there is still not enough emphasis on data standardization during the data collection phase. At Explorium, our goal is being able to enrich a column (or many columns) of data with relevant information from external data sources.

It might be helpful for a company to know the population of a country. This is a simple enrichment: the user sends the country object to the dataset containing population, which returns the population of the specific country. But if the country object and the population object were created by different people using different nomenclature practices, the two data structures will not recognize each other. If they can’t recognize data in slightly different forms – forms that are understandable and interchangeable to a human reader – the system has no way of enriching this information.

The customer data probably contains even more challenging values, such as postal addresses, that require us to understand the exact location these postal addresses refer to.

This only scratches the surface of why Explorium has made it a priority to implement tools to standardize these values into one single representation. Explorium is standardizing millions of properties per day, containing information about different types of entities in the world. But how do we achieve that?

Establishing common data types that can span datasets

The solution requires one major feature: a strong typing model. As the amount of data grew inside Explorium, we recognized the need to keep a common representation among the datasets inside the company. To achieve that, we first took the simple types we have in Python such as float, int and str and expanded them to include more validations.

We called this module DataTypes, and we gradually pushed more and more validation to these data types.

Continuing with our previous example, the data type to represent a country is DataTypes.CountryCodeAlpha2, the value of which is an ISO-3166 two letter country code. Any value in a column marked with this data type is strictly validated. We can only use US, not USA or U.S.; similarly, we can use IL but not ISR.

Imagine creating data types for each identifier of an entity. DataTypes.Email will validate the format of an email address, while DataTypes.Month validates that the month number is between 1 and 12.

We built an infrastructure so that each DataType can be inherited from any other and contains all the validations that exist in the parent data type.

Currently, Explorium has 95 different data types, which we use to validate all the datasets inside Explorium. However, we’re all on the same page inside Explorium, so we knew we needed to implement another infrastructure to convert our customers’ input into our data types. To do that, we started the process of creating what we call TypeTransformers.

Type transformers are the engine of Explorium’s data standardization

In Explorium’s context, a transformer can be considered a class that has a function that receives and returns Explorium DataTypes. Type transformers are special transformers that can get fuzzy data such as U.S, USA or United States of America and convert them into a specific representation such as US, an acceptable value in a DataTypes.CountryCodeAlpha2 column.

Users can request a type transformer for any data type in Explorium. For example, if you have some fuzzy date information, you can request a transformer that transforms the value 01/01/2022 into an ISO-8601 date such as 2022-01-01. That would be called a type transformer for the DataType.ISODate.

With that in mind, let’s see a detailed example of our country type transformer. 

class TextToCountryCodeAlpha2(Transformer):
   Converts any text to two-letter country code.

   inputs = (Optional[DataTypes.StandardizedText],)
   outputs = (Optional[DataTypes.CountryCodeAlpha2],)

   def __transform__(self, ctx: TransformerContext) -> Tuple[Any, ...]:
       (country,) = ctx.values

       if len(country) == 2 and country in self._alpha_2_countries:
           country_code_alpha2 = country

       elif len(country) == 3 and country in self._alpha_3_countries:
           (country_code_alpha2,) = CountryCodeAlpha3ToCountryCodeAlpha2().transform((country,))

           country_code_alpha2 = self._fuzzy_search_country(country)

       return (country_code_alpha2,)

As you see in the example above, we first try to understand whether we are taking in a two-letter or three-letter country code. If we fail to determine this, we use a regular expression built for each country in the world. Although this is open to more and more development to support different languages, it covers almost all cases that we are currently receiving.

Here is an example test to demonstrate some country values that now standardize automatically.

       ("il", "il"),
       ("IL", "il"),
       ("IsRael", "il"),
       ("USA", "us"),
       ("People's Republic of China  ", "cn"),
       ("Taiwan", "tw"),
       ("China, Hong Kong SAR", "hk"),
       ("China, Macao Special Administrative Region", "mo"),
       ("Côte d'Ivoire", "ci"),
       ("united kingdom", "gb"),
       ("great britain", "gb"),
       ("northern ireland", "gb"),
       ("wales", "gb"),
       ("england", "gb"),
       ("scotland", "gb"),
       ("uk", "gb"),
       ("China", "cn"),
       ("costarica", "cr"),
def test_country_code_alpha2_type_transformer(value: Any, expected: Optional[str]):
   assert T().transform_unary(value, locale_code="en") == expected

The infrastructure we built around typing and transformation allows us to reduce the complexity of dealing with different types of representations of the same data, and helps us work in the “same language” among all the teams and datasets at Explorium. This reduces the time to ingest new datasets and release new enrichments.

We have a lot more to talk about automated data standardization. Stay tuned for the follow-up article where we build graphs edges between the data types so we can convert a value from one data type to another.