Spark SQL Views and Tables
Get an introduction to Spark SQL views and tables.
We'll cover the following...
In the previous lesson, we created a temporary view in Spark. We can also create a table using Spark SQL. Spark uses Apache Hive to persist metadata like the schema, description, table name, database name, column names, partitions, or physical location for tables created by users. In case Hive isn’t configured, Spark uses Hive’s embedded deployment mode, which employs Apache Derby as the underlying database. When we start the spark-shell without Hive configuration, the spark-shell creates metastore_db
and warehouse
directories in the current directory. We’ll see these directories when we work the terminal at the end of this lesson.
There are two configuration settings related to Hive. The configuration property spark.sql.warehouse.dir
specifies the location of the Hive metastore warehouse, also known as the spark-warehouse directory. This is the location where Spark SQL persists tables. The second is the location of the Hive metastore, also known as the metastore_db
, which is a relational database to manage the metadata of the persistent relational entities, such as databases, tables, columns, and partitions.
Managed vs unmanaged tables
In Spark, we can create two types of tables:
-
Managed: With managed tables, Spark is responsible for managing both the data and the metadata related to the table. If the user deletes a managed table, then Spark deletes both the data and the metadata for the table.
-
Unmanaged: With unmanaged tables, Spark is only responsible for managing the metadata of the table while the user has the onus of managing the table’s data in an external data source. If the user deletes the table, only the metadata for the table is deleted and not the actual data for the ...