Interpreting Spark Logs

Get introduced to Spark application logs and the key parts to understanding them.

Logging and Spark

Whenever Spark is being used in a Java (or Spring) application, the logs produced by it are most likely a mix of the log lines included by the developers in the Java application, plus the logs produced by the Spark libraries themselves.

This separation occurs because Spark libraries work alongside logging libraries such as (and at least currently in version 3.X) org.slf4j and log4j logging frameworks (with the interface being the former, and the implementation the latter). We purposely excluded this library in the batch template Maven project within the pom.xml file to avoid compatibility issues with Spring Boot’s logging framework used:

Press + to interact
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
</exclusion>
</exclusions>
</dependency>

Technical quirks aside, it is easy to spot the Spark logs produced by an application because they are preceded by the fully qualified name (package.ClassName) of whatever class in Spark is producing logs.

Let’s work with the project from the Transformations (I): Map and Filter lesson and bring some meaning to what Spark usually logs in applications by highlighting the important parts. The code of this project applies a map transformation, followed by a filter one, to the foods dataset. When the application runs, the logs are produced and shown in the system output (terminal output, Windows command line output, and so on), so let’s analyze them. ...