In
Spark, we can use some RDD’s multiple times. we repeat the
same process of RDD evaluation each time it required or brought into
action. This task can be time and memory consuming, especially for
iterative algorithms that look at data multiple times. To solve the
problem of repeated computation the technique of persistence came into
the picture.It helps
avoid re-computation of the whole lineage and saves the data by default
in the memory.It makes the whole system
- Time efficient
- Cost efficient
- Reduce the execution time.
- Persistence in Spark RDD is an optimization technique in which saves the result of RDD.
- Using this we save the intermediate result so that we can use it further if required.
- It reduces the computation overhead.
- We can make persisted RDD through cache() and persist() methods.
- When we use the cache() method we can store all the RDD in-memory.
- We can persist the RDD in memory and use it efficiently across parallel operations.
The
difference between cache() and persist() is that using cache() the
default storage level is MEMORY_ONLY while using persist() we can use
various storage levels (described below). When the RDD is computed for
the first time, it is kept in memory on the node. The cache memory of
the Spark is fault tolerant so whenever any partition of RDD is lost, it
can be recovered by transformation Operation that originally created
it.
- RDD persistence facilitates Storage and reuse of the RDD partitions. When an RDD is marked for persistence, every node stores any of the RDD partitions computed in memory. It then reuses them in other actions on the dataset. This facilitates better speed.
- Automatic re-computation of lost RDD partitions: If an RDD partition is lost, it is automatically re-computed using the original transformations. Thus, the cache is fault-tolerant.
- Every persisted RDD is stored on a different storage level that is determined by the Storage Level object passed to the Persist method.
In
this storage level, RDD is stored as deserialized Java object in the
JVM. If the size of RDD is greater than memory, It will not cache some
partition and recompute them next time whenever needed. In this level
the space used for storage is very high, the CPU computation time is
low, the data is stored in-memory. It does not make use of the disk.
MEMORY_AND_DISK
In
this level, RDD is stored as deserialized Java object in the JVM. When
the size of RDD is greater than the size of memory, it stores the excess
partition on the disk, and retrieve from disk whenever required. In
this level the space used for storage is high, the CPU computation time
is medium, it makes use of both in-memory and on disk storage.
MEMORY_ONLY_SER
This
level of Spark store the RDD as serialized Java object (one-byte array
per partition). It is more space efficient as compared to deserialized
objects, especially when it uses fast serializer. But it increases the
overhead on CPU. In this level the storage space is low, the CPU
computation time is high and the data is stored in-memory. It does not
make use of the disk.
MEMORY_AND_DISK_SER
It
is similar to MEMORY_ONLY_SER, but it drops the partition that does not
fits into memory to disk, rather than recomputing each time it is
needed. In this storage level, The space used for storage is low, the
CPU computation time is high, it makes use of both in-memory and on disk
storage.
DISK_ONLY
In
this storage level, RDD is stored only on disk. The space used for
storage is low, the CPU computation time is high and it makes use of on
disk storage.
Adding Persistence
import org.apache.spark.storage.StorageLevel
val data = sc.parallelize(1 to 10)
data.persist(StorageLevel.MEMORY_ONLY)
data.persist(StorageLevel.MEMORY_AND_DISK)
data.persist(StorageLevel.MEMORY_ONLY_SER)
data.persist(StorageLevel.MEMORY_ONLY_DISK_SER)
data.persist(StorageLevel.DISK_ONLY)
val data = sc.parallelize(1 to 10)
data.persist(StorageLevel.MEMORY_ONLY)
data.persist(StorageLevel.MEMORY_AND_DISK)
data.persist(StorageLevel.MEMORY_ONLY_SER)
data.persist(StorageLevel.MEMORY_ONLY_DISK_SER)
data.persist(StorageLevel.DISK_ONLY)
Unpersist
data.unpersist
Which Storage Level to Choose?
Spark’s
storage levels are meant to provide different trade-offs between memory
usage and CPU efficiency. We recommend going through the following
process to select one:
- If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
- If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
- Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
- Use
the replicated storage levels if you want fast fault recovery (e.g. if
using Spark to serve requests from a web application). All the storage
levels provide full fault tolerance by recomputing lost data, but the
replicated ones let you continue running tasks on the RDD without
waiting to recompute a lost partition.
No comments:
Post a Comment