DataFrames

Get an introduction to the DataFrame data structure and its API.

We'll cover the following...

When using structured APIs such as DataFrames in favor of RDDs, developers experience expressiveness, simplicity, composability, and uniformity apart from improved performance. The structured APIs (DataFrames and Datasets) facilitate writing computations with common patterns used in data analysis. Consider the following example code written using RDDs to compute the average rating for the movie Gone with the Wind by three analysts.

    sc.parallelize(Seq(("Gone with the Wind", 6), ("Gone with the Wind", 8),
      ("Gone with the Wind", 8)))
      .map(v => (v._1, (v._2, 1)))
      .reduceByKey((k, v) => (k._1 + v._1, k._2 + v._2))
      .map(x => (x._1, x._2._1 / (x._2._2 * 1.0)))
      .collect()

The code appears cryptic. From Spark’s perspective, it can’t determine the intention of what we put in the lambda functions and can’t apply any optimizations. In contrast, look at the same computation expressed using DataFrames below:

    val dataDF = spark.createDataFrame(Seq(("Gone with the Wind", 6),
      ("Gone with the Wind", 8), ("Gone with the Wind", 8))).toDF("movie", "rating")
    val avgDF = dataDF.groupBy("movie").agg(avg("rating"))
    avgDF.show()

Looking at the two listings it is clear that the code snippet using DataFrames is far more ...