A PySpark Primer
An overview of PySpark.
We'll cover the following
What is PySpark?
PySpark is a powerful language for both exploratory analysis and building machine learning pipelines. The core data type in PySpark is the Spark dataframe, which is similar to Pandas dataframes but is designed to execute in a distributed environment.
While the Spark Dataframe API does provide a familiar interface for Python programmers, there are significant differences in the way that commands issued to these objects are executed.
A key difference is that Spark commands are lazily executed, which means that commands such as iloc
are not available on these objects. While working with Spark dataframes can seem to constrain us, the benefit is that PySpark can scale to much larger datasets than Pandas.
Get hands-on with 1400+ tech skills courses.