RDD Persistence & Caching — Memory Management in Spark

NeoMart’s data team is running an advanced analytics pipeline on customer clickstream logs. The process includes:

Cleaning raw data
Extracting session-level metrics
Running machine learning transformations
Aggregating results for dashboards

Each step uses the same processed RDD multiple times.

But there is a problem:
Running the entire pipeline again and again takes too long. Spark must recompute every transformation from scratch, rebuilding lineage and rerunning all upstream stages.

Enter RDD Persistence & Caching — Spark’s way of remembering data for faster computations, saving precious time, money, and compute resources.

Why Do We Need Caching?

Spark uses lazy evaluation, meaning RDD transformations are not executed unless an action triggers them.
So if an RDD is used multiple times:

cleaned_data.count()
cleaned_data.take(10)
cleaned_data.saveAsTextFile("/mnt/output")

Spark recomputes cleaned_data three separate times unless you cache it.

Caching solves this by storing the RDD in memory (or memory + disk) so repeated access is instant.

cache() vs persist()

Spark provides two main ways to store RDDs:

### 1. `cache()`

Stores RDD in memory only.

rdd.cache()

Equivalent to:

rdd.persist(StorageLevel.MEMORY_ONLY)

### 2. `persist()`

Allows specifying different storage levels.

from pyspark import StorageLevel

rdd.persist(StorageLevel.MEMORY_AND_DISK)

Used when data may not fit entirely in memory.

Available Storage Levels

Storage Level	Description
MEMORY_ONLY	Fastest, but may fail if RDD doesn’t fit in memory
MEMORY_AND_DISK	Stores what fits in memory; spills the rest to disk
DISK_ONLY	Slower, but ensures full persistence
MEMORY_ONLY_SER	Serialized in memory — reduces size but increases CPU cost
MEMORY_AND_DISK_SER	Balanced storage & reliability
OFF_HEAP	For external memory (Tungsten), rare in typical workloads

Story Example: NeoMart Recommendation Pipeline

NeoMart runs a sessionization workflow to build personalized recommendations.

Without caching:

Each model training iteration recomputes raw logs
Session extraction runs again
Feature engineering runs again
Total time: 45 minutes

With caching:

sessions = (
    logs
    .filter(lambda x: "session" in x)
    .map(parse_session)
)

sessions.cache()

model = train_recommendation_model(sessions)

Total time drops to 8 minutes.

Caching saved them over 80% compute time.

When Should You Cache an RDD?

✔When the RDD is reused multiple times

Example: Training multiple ML models with the same preprocessed data.

✔When recomputation cost is expensive

Example: Custom parsing, UDFs, joins, or external IO.

✔When performing iterative algorithms

PageRank
K-Means
Gradient descent loops

✔When running multiple actions on the same RDD

Such as count(), take(), collect(), saveAsTextFile().

When Not to Cache

❌ RDD is used only once

Caching wastes memory.

❌ RDD is too large to fit in memory

Prefer MEMORY_AND_DISK or avoid caching.

❌ Using DataFrames instead

Spark automatically optimizes them with Catalyst & Tungsten.

How to Uncache / Remove from Memory

Memory is limited. After you're done, always clean up:

rdd.unpersist()

Or remove all cached objects:

spark.catalog.clearCache()

Debugging: How to See Cached RDDs

In Databricks or Spark UI:

Open the Storage tab
View size, partitions, and storage level
Monitor memory usage
Identify partitions not cached due to size

This helps optimize cluster resources effectively.

Example: Full Pipeline Using cache() and persist()

from pyspark import StorageLevel

logs = sc.textFile("/mnt/neomart/raw_logs")

clean = logs \
    .filter(lambda x: "event" in x) \
    .map(lambda x: parse_event(x))

clean.persist(StorageLevel.MEMORY_AND_DISK)

# Perform multiple actions without recomputation
print(clean.count())
print(clean.take(5))

daily_stats = clean \
    .map(lambda x: (x.date, 1)) \
    .reduceByKey(lambda x, y: x + y)

Best Practices for RDD Caching

🔹 Cache early in iterative algorithms

Avoid repeating expensive transformations.

🔹 Use MEMORY_ONLY when data fits

Fastest option.

🔹 Use MEMORY_AND_DISK when unsure

Safe and reliable.

🔹 Don’t cache everything

Be selective to avoid memory pressure.

🔹 Clean up with unpersist()

Especially in long Databricks jobs.

Summary — Caching Makes Spark Lightning Fast

RDD caching prevents expensive recomputations.
cache() stores data in memory; persist() lets you choose storage levels.
Useful for ML loops, repeated actions, and expensive pipelines.
Improves performance and reduces cluster cost.
Spark UI helps monitor cached datasets and memory usage.

Caching is one of the most powerful performance tools in Spark — when used wisely, it turns slow pipelines into near real-time workflows.

Next, we’ll cover Creating DataFrames from CSV, JSON, Parquet, and Hive Tables.

Why Do We Need Caching?​

cache() vs persist()​

### 1. cache()​

### 2. persist()​

Available Storage Levels​

Story Example: NeoMart Recommendation Pipeline​

Without caching:​

With caching:​

When Should You Cache an RDD?​

✔When the RDD is reused multiple times​

✔When recomputation cost is expensive​

✔When performing iterative algorithms​

✔When running multiple actions on the same RDD​

When Not to Cache​

❌ RDD is used only once​

❌ RDD is too large to fit in memory​

❌ Using DataFrames instead​

How to Uncache / Remove from Memory​

Debugging: How to See Cached RDDs​

Example: Full Pipeline Using cache() and persist()​

Best Practices for RDD Caching​

🔹 Cache early in iterative algorithms​

🔹 Use MEMORY_ONLY when data fits​

🔹 Use MEMORY_AND_DISK when unsure​

🔹 Don’t cache everything​

🔹 Clean up with unpersist()​

Summary — Caching Makes Spark Lightning Fast​