Debugging PySpark with PyCharm and AWS EMR

maya.malevich

Have you ever found yourself developing PySpark inside EMR notebooks? Have you debugged PySpark locally, but wanted to run it over a real and big data set and couldn't because your computer lacked resources?

 

On one of my projects at Explorium, I was implementing a scalable entity resolution process to automatically match and merge entities from dozens of data sources. (Stay tuned for more details in the following articles).

This process is very resource-heavy and so I decided to implement it using PySpark.

With PySpark I had the advantages of being able to run very complex processes with high computation abilities, and being able to use our entities data model which is implemented in Python.

 

When I started implementing the process I faced the issue of not being able to debug my process against a real data set on my local computer, so I tried to do so against AWS EMR.

I was sure I could find some solutions in a quick search, but surprisingly, I didn’t. I even opened a Stack Overflow thread regarding this most basic need: “How to debug PySpark on EMR using PyCharm”, but no one answered.

 

After doing some research, I would like to share my insights on how to debug PySpark with PyCharm and AWS EMR with others.

To read more visit the Explorium.ai channel on Medium.

The post Debugging PySpark with PyCharm and AWS EMR appeared first on Explorium.

Previous Article
Explorium Closes $75M Series C Amid Soaring Demand  for External Data
Explorium Closes $75M Series C Amid Soaring Demand for External Data

Explorium, the External Data platform that automatically discovers thousands of relevant data signals and u...

Next Article
Should you attend the Gartner Data & Analytics Summit? (Yes, you should)
Should you attend the Gartner Data & Analytics Summit? (Yes, you should)

When it comes to virtual events, two things are universally true. One, we’ve seen it all this year: the awk...