In the business of external data enrichment for data science, the main focus is on the ability to provide a fast and scalable way to aggregate, join and match large datasets received from data providers with the customer’s internal data. Enriched datasets derive more features, which are leveraged by customer data science teams, resulting in higher AUC and better predictive power.
This performance comparison has arisen from the very specific engineering problem of enrichments with foot traffic data, collected from mobile devices over a time span of half a year. However because this dataset is large in volume (4.5B records) and uniform (20–25M per day), it provides a generic real-world example for performance testing without adding extra complexity and variables like data-skew or complex relations. The data itself is the result of Spark job ETL and consists of parquet files — 1 file per partition, partitioned by day (~250MB each file), which resides on Amazon S3. A delta-lake table is metadata created in order to enable tabular access via the AWS Glue catalog.
Given a user dataset with a date column in the format YYYY%MM%%dd and coordinates in the form of H3 geo-hash, enrich their data with records from the 4.5B dataset with schema (local_day, h310), where local_day is the date in the same format as the user’s data, and h310 is Uber’s H3 geospatial index, easily calculated from decimal coordinates.
The user’s inquiry results in a dataset received via an API call, which could range from 1 to 1M rows, and will be referred to as query_table. Our base 4.5B dataset has unique h3’s per day, and half a year’s worth of data, which will be referred to as foottraffic_daily_visitation.
To read more about external data enrichment visit the Explorium.ai channel on Medium.