A Resilient Distributed Dataset (RDD) is a fundamental data structure of Apache Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
For example, consider a list of numbers [1, 2, 3, 4, 5, 6, 7, 8], which can be divided into two RDDs:
RDD1 = [1, 2, 3, 4]
RDD2 = [5, 6, 7, 8]
Each RDD can then be further divided into logical partitions, such as:
RDD1 Partition 1 = [1, 2]
RDD1 Partition 2 = [3, 4]
RDD2 Partition 1 = [5, 6]
RDD2 Partition 2 = [7, 8]
These partitions can then be computed on different nodes of the cluster in parallel.