- Best Practices
Best practices for Python programmers using PySpark.
We'll cover the following
While PySpark provides a familiar environment for Python programmers, it’s good to follow a few best practices to make sure you are using Spark efficiently. Here are a set of recommendations I’ve compiled based on my experience porting a few projects from Python to PySpark.
Avoid dictionaries
Using Python data types such as dictionaries means that the code might not be executable in a distributed mode. Instead of using keys to index values in a dictionary, consider adding another column that can be used as a filter to a dataframe. This recommendation also applies to other Python types including lists that are not distributable in PySpark.
Limit Pandas usage
Calling toPandas
will cause all data to be loaded into memory on the driver node and prevents operations from being performed in a distributed mode. It’s fine to use this function when data has already been aggregated and you want to make use of familiar Python plotting tools, but it should not be used for large dataframes.
Avoid loops
Instead of using for loops, it’s often possible to use functional approaches such as group by and apply to achieve the same result. Using this pattern means that code can be parallelized by supported execution environments. I’ve noticed that focusing on using this pattern in Python also results in cleaner code that is easier to translate to PySpark.
Get hands-on with 1200+ tech skills courses.