RDDs (Resilient Distributed Datasets) are the basic data structures of Apache Spark. RDDs are immutable, distributed collections of objects that can be operated on in parallel. RDDs are the fundamental data structure of Spark and are built on top of the distributed filesystems. RDDs are fault tolerant and can be recomputed on failure.
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
Example:
RDD:
val rdd = sc.parallelize(List((1, “John”), (2, “Jane”), (3, “Bob”)))
DataFrame:
val df = rdd.toDF(“id”, “name”)