Debugging PySpark with PyCharm and AWS EMR

Have you ever found yourself developing PySpark inside EMR notebooks? Have you ever found yourself debugging PySpark locally, but wanting to run it over a real and big data set and couldn’t because your computer lacked resources?

 

On one of my projects at Explorium, I was implementing a scalable entity resolution process to automatically match and merge entities from dozens of data sources. (Stay tuned for more details in the following articles).

This process is very resource-heavy and so I decided to implement it using PySpark.

With PySpark I had the advantages of being able to run very complex processes with high computation abilities, and being able to use our entities data model which is implemented in Python.

 

When I started implementing the process I faced the issue of not being able to debug my process against a real data set on my local computer, so I tried to do so against AWS EMR.

I was sure I could find some solutions in a quick search, but surprisingly, I didn’t. I even opened a Stack Overflow thread regarding this most basic need: “How to debug PySpark on EMR using PyCharm”, but no one answered.

 

After doing some research, I would like to share my insights on debugging PySpark with PyCharm and AWS EMR with others.

To read more visit the Explorium.ai channel on Medium.