Table of Contents

    If you’re dealing with data, you know that data quality is key to any successful project. Data deduplication is one of the most essential steps in ensuring data quality.

    In this blog post, I’ll show you how I used Explorium’s API to deduplicate company names in 7 lines of Python code. Explorium’s API returns a unique ID for each company, allowing us to easily identify and remove duplicate entries. Explorium offers you 500 enrichments a month for free.

    Let’s take a look at the code:

    import requests
    import pandas as pd
    
    def add_company_ids(df):
        url = "https://app.explorium.ai/api/bundle/v1/enrich/explorium-s-company-identifiers"
        payload = [{'company': company} for company in df['company']]
        headers = {"API_KEY": "<YOUR_API_KEY>"}
        response = requests.post(url, json=payload, headers=headers).json()
        df['Company ID'] = [res['Company ID'] for res in response]
        return df
    
    companies = ["McDonald's",
                 "McDonald's Corporation",
                 "Bekshire Hathaway",
                 "Berkshire Hathaway Inc",
                 "Berkshire Hathaway US",
                 "Tesla",
                 "Tesla, Inc.",
                 "Tesla Motors"]
    
    companies_df = pd.DataFrame(companies, columns=['company'])
    companies_with_ids_df = add_company_ids(companies_df)
    deduped_df = companies_with_ids_df.drop_duplicates(subset='Company ID')

    What I did in the code:

    1. The API call to Explorium is done
    2. Adding Company ID column to the DataFrame that we get from the API
    3. The DataFrame is filtered to keep only the unique company IDs

    Here is how the pandas DataFrames from the code look like:

    companies_df:
    
    |-------------------------|
    | company                 |
    |-------------------------|
    | McDonald's              |
    | McDonald's Corporation  |
    | Bekshire Hathaway       |
    | Berkshire Hathaway Inc  |
    | Berkshire Hathaway US   |
    | Tesla                   |
    | Tesla, Inc.             |
    | Tesla Motors            |
    |-------------------------|
    
    companies_with_ids_df:
    
    |---------------------------------------------------------------------|
    | company                 | Company ID                                |
    |---------------------------------------------------------------------|
    | McDonald's              | 8fb0006901b36e7bc45562e7b70db5140cbc6027  |
    | McDonald's Corporation  | 8fb0006901b36e7bc45562e7b70db5140cbc6027  |
    | Bekshire Hathaway       | 7285f94bf668f60d7bf1a10904618a896ecd211f  |
    | Berkshire Hathaway Inc  | 7285f94bf668f60d7bf1a10904618a896ecd211f  |
    | Berkshire Hathaway US   | 7285f94bf668f60d7bf1a10904618a896ecd211f  |
    | Tesla                   | b330d5663e6451d36c25c152fb523c4820466c50  |
    | Tesla, Inc.             | b330d5663e6451d36c25c152fb523c4820466c50  |
    | Tesla Motors            | b330d5663e6451d36c25c152fb523c4820466c50  |
    |---------------------------------------------------------------------|
    
    deduped_df:
    
    |-------------------------------------------------------------------|
    | company              | Company ID                                 |
    |-------------------------------------------------------------------|
    | McDonald's           | 8fb0006901b36e7bc45562e7b70db5140cbc6027   |
    | Bekshire Hathaway    | 7285f94bf668f60d7bf1a10904618a896ecd211f   |
    | Tesla                | b330d5663e6451d36c25c152fb523c4820466c50   |
    |-------------------------------------------------------------------|

    As you can see, the code was able to successfully deduplicate the companies, leaving us with only unique company names. If you’d like to try this out for yourself, sign up to Explorium and get your API.
    Here‘s Explorium’s company IDs API documentation.

    In addition to company deduplication, Explorium also offers data about companies (such as the number of employees, revenue, industry, etc.), individuals data (personal & business emails, phone numbers, names, job title, working company) and geospatial data (Points of interest, real estate, etc.).

    You can explore our API documentation and our data catalog to get more information on Explorium’s data.

    Deduping data is a crucial part of data quality and Explorium’s API makes it easy to do in just 7 lines of code. Try it out for yourself and see the power of the Explorium API!