Table of Contents

    Have you ever found yourself developing PySpark inside EMR notebooks? Have you ever found yourself debugging PySpark locally, but wanting to run it over a real and big data set and couldn’t because your computer lacked resources?


    On one of my projects at Explorium, I was implementing a scalable entity resolution process to automatically match and merge entities from dozens of data sources. (Stay tuned for more details in the following articles).

    This process is very resource-heavy and so I decided to implement it using PySpark.

    With PySpark I had the advantages of being able to run very complex processes with high computation abilities, and being able to use our entities data model which is implemented in Python.


    When I started implementing the process I faced the issue of not being able to debug my process against a real data set on my local computer, so I tried to do so against AWS EMR.

    I was sure I could find some solutions in a quick search, but surprisingly, I didn’t. I even opened a Stack Overflow thread regarding this most basic need: “How to debug PySpark on EMR using PyCharm”, but no one answered.


    After doing some research, I would like to share my insights on debugging PySpark with PyCharm and AWS EMR with others.

    To read more visit the channel on Medium.