System Design Deep Dive: Real-World Distributed Systems/

...

Resilient Distributed Datasets of Spark

Learn about the basic building block (RDDs) of Spark.

We'll cover the following...

Creation of RDDs
Representation of RDDs
- Controlling partitions
RDDs vs. DSMs

RDDs provide a restricted form or an abstraction of shared memory based on coarse-grained transformationsA transformation applied with the help of a function like Map and Reduce on a bulk of data. rather than ﬁne-grained transformationsA transformation applied to an entity of a database.. Simply put, RDDs are distributed data on a collection of worker nodes' memories based on coarse-grained transformations in a cluster.

Creation of RDDs

RDDs are an object in the language they are being made. We can build an RDD in the following ways.

From a file

An RDD can be built from a file in a distributed file system (DFS). It would create an RDD in which each block of data in DFS will be a partition in the RDD, and each record in a partition would represent a line in that file.

Press + to interact

From another RDD

RDDs can be made from other RDDs in two ways.

An RDD can be built by transforming an already built RDD––from the lines that contain NULL in them.
```
val RDD3 = RDD.filter(_.contains("NULL"))
```
It can also be built by altering the persistence of an existing RDD. RDDs, by default, are lazy (created on demand when used in parallel operations) and ephemeral (eliminated from the memory afterward). To change this behavior, Spark provides two actions:
- Cache: The cache action makes Spark keep data in memory for further use after its creation. However, if there is no memory to cache the data, it recomputes it whenever used again. This process is opted to ensure that the system keeps running even if a node

...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Resilient Distributed Datasets of Spark

Creation of RDDs

From a file

From a collection

From another RDD