Have you ever found yourself developing PySpark inside EMR notebooks? Have you debugged PySpark locally, but wanted to run it over a real and big data set and couldn't because your computer lacked resources?
On one of my projects at Explorium, I was implementing a scalable entity resolution process to automatically match and merge entities from dozens of data sources. (Stay tuned for more details in the following articles).
This process is very resource-heavy and so I decided to implement it using PySpark.
With PySpark I had the advantages of being able to run very complex processes with high computation abilities, and being able to use our entities data model which is implemented in Python.
When I started implementing the process I faced the issue of not being able to debug my process against a real data set on my local computer, so I tried to do so against AWS EMR.
I was sure I could find some solutions in a quick search, but surprisingly, I didn’t. I even opened a Stack Overflow thread regarding this most basic need: “How to debug PySpark on EMR using PyCharm”, but no one answered.
After doing some research, I would like to share my insights on how to debug PySpark with PyCharm and AWS EMR with others.
To read more visit the Explorium.ai channel on Medium.