Spark and databases

Though there are many different DataSources involved in big data processing, relational databases can still be the de facto choice as a data repository, specifically in situations where the business domains require data normalization, relationships between domain models, strongly consistent transactions, and so on.

Spark offers the possibility of interacting with RDBMS to load a whole table, filter information, and load a fraction of a table by executing queries on the table. It also provides functionality to run operations while ingesting from the database, such as filtering and aggregation, that help minimize the volumes of data retrieved.

In regards to minimizing the amount of data retrieved, one sensible strategy is to filter information at the database level while querying tables, if possible. The immediate benefit of this strategy is the reduction of data volumes transferred from the source.

Note: We should always think in big data terms. We should assume that volumes of information are always huge, and aim at efficiency in every possible corner of our application.

Java applications always use a ...