Table of Contents

    Data enrichment is a crucial step in the modeling process that data scientists tend to overlook due to the difficulty in finding and utilizing external sources. Before getting to engineer features, each modeler should ask themselves: “is that all the relevant data? Are there, maybe outside of the organization, additional sources that could generate impactful features?”

    The first step is identifying “enrichable” columns that could be used to connect additional datasets. Examples of “enrichable” columns are URLs, zip codes, company names, and emails. They are deemed “enrichable” because all of them are of global context, or in other words public types of information that can be looked up on a search engine. After finding those columns, the hunting begins by searching for sources that could connect and enrich your current data with additional attributes. 

    As data scientists, we all know that the hardest part of our job is looking for the right data to feed our models. Have you ever ruled out data due to low impact even though you had a feeling that there was some hidden value? Turns out data integration metrics such as coverage and support can go a long way to improve clarity and promote understanding.

    What are support and coverage?

    Support is defined as the percentage of keys found in the secondary dataset, out of all unique keys in the main dataset. Coverage is defined as the percentage of rows with a matching key in the main dataset, out of all rows in the main dataset.

    data integration metricsdata integration metrics

    In the example above, “State” is the key column by which we join the secondary table to the main table. The main table contains five rows and three unique keys (A, B, C). The secondary contains two of the three keys (A, B). Therefore, the support of the secondary table is 66.6%. The keys contained in the secondary table are present in four out of five rows in the main table, therefore the coverage of the secondary table is 80%.

    If, for example, the secondary table looked like this:

    data integration metrics

    Then the support would stay the same (still two keys out of three) but the coverage would drop to 60% (three out of five).

    Here’s a code snippet doing the exact same calculations:

    def calculate_support(main, secondary, key_column):
        main_key_values = main.loc[:,key_column]
        n_keys_main = main_key_values.nunique()
        
        secondary_reduced_by_main = secondary.loc[secondary[key_column].isin(main_key_values), :]
        n_keys_from_main_in_secondary = secondary_reduced_by_main.loc[:,key_column].nunique()
        return round(n_keys_from_main_in_secondary / n_keys_main, 2)
    
    def calculate_coverage(main, secondary, key_column):
        n_main_rows = main.shape[0]
        n_rows_with_secondary_match = main[key_column].isin(secondary[key_column]).sum()
    
        return round(n_rows_with_secondary_match / n_main_rows, 2)
    
    def assess_support_and_coverage(main, secondary, key_column):
        print(f"""Support - {calculate_support(main, secondary, key_column)}""")
        print(f"""Coverage - {calculate_coverage(main, secondary, key_column)}""")
    
    main = pd.DataFrame({'id':list('AABBC')})
    secondary = pd.DataFrame({'id':list('AB')})
    key_column = 'id'
    assess_support_and_coverage(main, secondary, key_column)
    print(‘\n’)
    secondary = pd.DataFrame({'id':list('AC')})
    assess_support_and_coverage(main, secondary, key_column)
    

    OUTPUT:
    Support – 0.67

    Coverage – 0.8

    Support – 0.67

    Coverage – 0.6

    Guide to Data Acquisition

    What can we learn from support and coverage?

    First and foremost, support helps in understanding the relevance of a secondary dataset to our main dataset and how much data can be added from the secondary set to the main set. When coverage is similar to support there isn’t much to take, so let’s examine the two cases where coverage and support are not similar and focus on how the features’ values from the secondary table would be affected. There are several ways to assess a feature’s contribution to a model, in the examples below I chose to use the ANOVA f-value as the evaluation method.

    High support, low coverage 

    High support and low coverage means a majority of the keys were matched but those keys represent a small sector of the dataset. 

    • Features with low evaluation scores might have high potential even though they show a relatively low impact on the model. This could occur due to the labeled observations that were not matched and rely on the method of manually filling null values. Better imputation of the feature’s missing values could prove to be pivotal as most of the keys are already matched.
    • Features with high evaluation scores may shed light on the keys that were not matched. For example, this could indicate that a match that was not found in the secondary dataset is more impactful than the metrics the secondary table contained.

    Low support, high coverage 

    Low support and high coverage means a minority of the keys were matched but those keys represent a big sector of the dataset.

    • Features with low evaluation scores might be misleading based on the keys that were matched. A strong correlation might still exist in the data but not be portrayed in this specific set of training data.
    • Features with high evaluation scores, albeit rare when support is low, could once again shed light on the keys that were not found.

    Let’s dive into a specific machine learning case where we can see the metrics with some context.

    Coverage and support in action

    Let’s examine a dataset from Kaggle with Airbnb data from Berlin. Our goal is to predict the price of a room. This, for example, can be used by Airbnb to educate their hosts about their room rates or understand the potential of a new location. When using this data to predict the price of a room, our dataset looks something like this:

    name host_name neighbourhood price
    Riverfront Panorama Trent Alexanderplatz 55
    Berlin City Apartment 1with WiFi Ilona Alexanderplatz 96
    Cozy Berlin nest with great access Donna Friedenau 30
    1 Room Apartment, Charlottenburg Michael Westend 30

    Let’s say we want to use data on neighborhood crime rates, using the “neighbourhood” column as the key. The data looks like this:

    neighbourhood CrimeRate Violence JuvenileArrests
    Alexanderplatz 45.3 11.8 27.4
    Friedenau 99.5 13.9 198

     

    Out of the three keys in the main table (Alexanderplatz, Friedenau, Westend), two were present in the secondary table (Alexanderplatz, Friedenau), meaning the support of the neighborhood crime table to the Airbnb data is 66% (2 out of 3). Based on the presence of the matched keys in the main dataset, the coverage in this example would be 75% (3 out of 4).

    Since the table we used as the main dataset contains our label column, this makes it what we at Explorium like to call a “core dataset.” It serves as our anchor throughout the entire process, representing the number of observations in our final training set. 

    Let’s now use the full core dataset and the enrichment to train a basic model and see how we fair. The regression metric used for evaluation is R squared which is not best for making business decisions but good for getting a grasp at the overall model performance. I’ll skip feature extraction for now so we can stay on point and go straight to an initial model evaluation (with no enrichments):

    from explorium_sdk.features import search_for_features
      
    features = search_for_features
    (listings_df, label_column='price', augment_with_external_datasets=False)
     
    labels = features.loc [:,’price’]
    features = features.drop ('price', axis=1)
    X_train, X_test, y_train, y_test = train_test_split (features, labels, test_size=0.25, random_state=42)
    
    classifier = LinearRegression()
    Classifier.fit (X_train, y_train)
    predictions = classifier.predict (X_test)
     
    print (f"R2 score: {round(r2_score(y_true=y_test, y_score=predictions),2)}")
    

    OUTPUT:

    R2 score: 48.46

    Not exactly the most accurate finding.

    I can spend more time trying to tune my model even further, but only marginal progress can be achieved with the current features. My next move would be to search for more data based on “enrichable” columns in the core dataset. However, searching for good data based on specific values can become quite tedious. 

    The difficulties of data enrichment can create quite a puzzle. It’s pretty hard to know what source of data can save the situation here — data about the location, information regarding the host, maybe something regarding the attributes of the room. On top of that, I need to think about what format I’d be able to consume it in — an API, a static data source, or maybe a scraper. 

    Any way you look at it, it’s going to take a good share of time, skill, and resources. 

    Let’s assume we’ve completed the process of acquiring an external data source and jump ahead to focus on the process of actually using that source for enrichment. In our data acquisition process, we found a dataset detailing crime metrics in various Berlin neighborhoods (please note: the data below is not real and is only being used here for the purposes of this example). 

    We can add the metrics to each observation, merging the datasets by the “neighbourhood” column, just like we did in the example above. Here’s another glimpse at the dataset structure:

    neighbourhood CrimeRate Violence JuvenileArrests Shootings MotorVehicleTheft
    Alexanderplatz 45.3 11.8 27.4 1.3 74.1
    Friedenau 99.5 13.9 198 13.9 106.4

    Let’s look for a correlation between the newly found enrichment and the label column that could create an impact. We can do this by comparing each row in the enrichment to the average price seen in that neighborhood.

    As mentioned above I’ll use the ANOVA f-value (analysis of variance), a simple calculation that can shed some light on the potential lift. The f-value formula is as follows: 

    F = variation between sample means / variation within the samples 

    The values are returned in an open range (as opposed to a closed range like 0.0-1.0) so this check cannot determine anything definitively but it is a good way to get started.

    neighbourhood_scores = listings_df.groupby('neighbourhood').price.mean().reset_index()
    crime_merged = berlin_crime.merge(neighbourhood_scores ,how = 'inner')
    feature_columns = [col for col in crime_merged if col[0].isupper()]
    f_value, _ = f_regression(crime_merged[feature_columns], crime_merged.loc[:,'price'])
    for i in range(len(feature_columns)):
        print(f'{feature_columns[i]} - F-Value: {f_value[i]}')
    

    OUTPUT: 

    CrimeRate – F-Value: 279.85715977347007

    JuvenileArrests – F-Value: 47.8381415174938525

    DomesticViolence – F-Value: 2.3092730794710357

    Shootings – F-Value: 5.5654859032260875

    MotorVechicleTheft – F-Value: 15.177502487738597

    Alright, looking good. The feature “CrimeRate” seems to be highly correlated with our label values. It is fair to expect a model to draw insight from this. But will it show lift? That’s hard to say before checking support and coverage. Let’s check. After we check we can re-train the model and look for lift. 

    assess_support_and_coverage(listings_df, berlin_crime, ‘neighbourhood’)
    

    OUTPUT:

    Support – 0.83

    Coverage – 0.42

    Notice the high support, and the low coverage. Let’s see how much we can get out of our data. Now I’ll do an outer join with the core dataset being the main set and the enrichment as the secondary set.

    listings_w_crime = listings_df.merge(berlin_crime[['neighbourhood',’CrimeRate’]], on='neighbourhood', how='left') # OUTER MERGE
    features = search_for_features (listings_w_crime, label_column='price', augment_with_external_datasets=False)
     
    labels = features.loc[:,’price’]
    features = features.drop('price', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=42)
    
    classifier = LinearRegression()
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test) 
    
    print(f"R2 score with Crime Dataset: {round(r2_score(y_true=y_test, y_score=predictions),2)}")
    

    OUTPUT:

    R2 score with Crime Dataset: 53.62

    Not good enough. Although the correlation between the label and features is there and the support is high, the impact leaves a lot to be desired. This is where coverage becomes significant. We have already done the hard work of finding data correlated to our labels, but the impact is often only as valuable as the portion of data it covers. If the value of an enrichment is established, I would rather focus on raising coverage by looking for the missing keys than go back to the drawing board and try to find something new.

    Let’s replace our outer join from above with an inner join, excluding labels with no match in the enrichment. This can help clarify whether this enrichment has any potential worth.

    listings_w_crime_inner = listings_df.merge(berlin_crime[['neighbourhood',’CrimeRate’]], on='neighbourhood', how='inner') # INNER MERGE
    

    Note that since the inner join was used, the coverage will automatically rise to 100% since the join will leave only keys present in both datasets.

    features = search_for_features (listings_w_crime_inner, label_column='price', augment_with_external_datasets=False)
     
    labels = features.loc[:,’price’]
    features = features.drop('price', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.25, random_state=42)
    
    classifier = LinearRegression()
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
     
    Print (f"R2 score with Crime Dataset (INNER MERGE): {round(r2_score(y_true=y_test, y_score=predictions),2)}")
    

    OUTPUT:

    R2 score with Crime Dataset (INNER MERGE):  74.61

    As expected, value exists. Now comes the interesting part of drawing conclusions which can help reshape the use case. 

    Of course, the inner join we performed means that the model is ignoring an entire section of observations. Using this model as-is is a serious error that will most likely cause very low performance on unseen data — something I definitely recommend not doing.

    Instead, here are a few things I recommend you do:

    • Take a closer look at the matching keys. Business-wise, there might be a subset of the use-case that, based on the results above, is valuable enough to launch as a first step.
    • Take a closer look at the non-matching keys. Maybe focusing the search could help in locating the missing enrichment data. 
    • Take a closer look at the process. There might be a data wrangling issue that is relatively easy to solve. There might also be value in using the current data as is while putting more effort into imputation. 

    Conclusion

    Ruling out data due to low impact even though you have a feeling there’s hidden value doesn’t need to hold you back. Whether hunting for enrichments, assessing features, or at the final phase of deciding if a model is sufficient to be deployed to production, support and coverage are a good thing to keep in mind, and sometimes the secret to a quick win.

    Guide to Data Acquisition