Benchmarking SQL engines for Data Serving: PrestoDb, Trino, and Redshift

maya.malevich

In the business of external data enrichment for data science, the main focus is on the ability to provide a fast and scalable way to aggregate, join and match large datasets received from data providers with the customer’s internal data. Enriched datasets derive more features, which are leveraged by customer data science teams, resulting in higher AUC and better predictive power.

This performance comparison has arisen from the very specific engineering problem of enrichments with foot traffic data, collected from mobile devices over a time span of half a year. However because this dataset is large in volume (4.5B records) and uniform (20–25M per day), it provides a generic real-world example for performance testing without adding extra complexity and variables like data-skew or complex relations. The data itself is the result of Spark job ETL and consists of parquet files — 1 file per partition, partitioned by day (~250MB each file), which resides on Amazon S3. A delta-lake table is metadata created in order to enable tabular access via the AWS Glue catalog.

 

The Objective

Given a user dataset with a date column in the format YYYY%MM%%dd and coordinates in the form of H3 geo-hash, enrich their data with records from the 4.5B dataset with schema (local_day, h310), where local_day is the date in the same format as the user’s data, and h310 is Uber’s H3 geospatial index, easily calculated from decimal coordinates.

The user’s inquiry results in a dataset received via an API call, which could range from 1 to 1M rows, and will be referred to as query_table. Our base 4.5B dataset has unique h3’s per day, and half a year’s worth of data, which will be referred to as foottraffic_daily_visitation.

To read more visit the Explorium.ai channel on Medium.

The post Benchmarking SQL engines for Data Serving: PrestoDb, Trino, and Redshift appeared first on Explorium.

Previous Article
Building Trust with Data Security -  Why Explorium got ISO/IEC 27701 certification
Building Trust with Data Security - Why Explorium got ISO/IEC 27701 certification

Few types of regulation have ever proved to be as far-reaching or game-changing as GDPR. Pretty much any or...

Next Article
Why External Data Platforms Are The Next Step for Alternative Data Seekers
Why External Data Platforms Are The Next Step for Alternative Data Seekers

As organizations move from data analytics to data science, their data needs evolve, too. Without data, ther...