RDDs (Resilient Distributed Datasets) are the primary data abstraction in Apache Spark. RDDs are immutable collections of objects that can be split across multiple machines in a cluster. They can be created from files, databases, or other RDDs. RDDs are resilient because they can be reconstructed if a node fails.
DataFrames are a higher-level abstraction built on top of RDDs. They are similar to tables in a relational database and provide a schema that describes the data. DataFrames provide a domain-specific language for structured data manipulation and can be constructed from a wide array of sources such as CSV files, JSON files, and existing RDDs.
Example:
RDD:
val rdd = sc.textFile(“data.txt”)
DataFrame:
val df = spark.read.csv(“data.csv”)