Skip to main content

RDD Persistence & Caching β€” Memory Management in Spark

NeoMart’s data team is running an advanced analytics pipeline on customer clickstream logs. The process includes:

  • Cleaning raw data
  • Extracting session-level metrics
  • Running machine learning transformations
  • Aggregating results for dashboards

Each step uses the same processed RDD multiple times.

But there is a problem:
Running the entire pipeline again and again takes too long. Spark must recompute every transformation from scratch, rebuilding lineage and rerunning all upstream stages.

Enter RDD Persistence & Caching β€” Spark’s way of remembering data for faster computations, saving precious time, money, and compute resources.


Why Do We Need Caching?​

Spark uses lazy evaluation, meaning RDD transformations are not executed unless an action triggers them.
So if an RDD is used multiple times:

cleaned_data.count()
cleaned_data.take(10)
cleaned_data.saveAsTextFile("/mnt/output")

Spark recomputes cleaned_data three separate times unless you cache it.

Caching solves this by storing the RDD in memory (or memory + disk) so repeated access is instant.


cache() vs persist()​

Spark provides two main ways to store RDDs:

### 1. cache()​

Stores RDD in memory only.

rdd.cache()

Equivalent to:

rdd.persist(StorageLevel.MEMORY_ONLY)

### 2. persist()​

Allows specifying different storage levels.

from pyspark import StorageLevel

rdd.persist(StorageLevel.MEMORY_AND_DISK)

Used when data may not fit entirely in memory.


Available Storage Levels​

Storage LevelDescription
MEMORY_ONLYFastest, but may fail if RDD doesn’t fit in memory
MEMORY_AND_DISKStores what fits in memory; spills the rest to disk
DISK_ONLYSlower, but ensures full persistence
MEMORY_ONLY_SERSerialized in memory β€” reduces size but increases CPU cost
MEMORY_AND_DISK_SERBalanced storage & reliability
OFF_HEAPFor external memory (Tungsten), rare in typical workloads

Story Example: NeoMart Recommendation Pipeline​

NeoMart runs a sessionization workflow to build personalized recommendations.

Without caching:​

  • Each model training iteration recomputes raw logs
  • Session extraction runs again
  • Feature engineering runs again
  • Total time: 45 minutes

With caching:​

sessions = (
logs
.filter(lambda x: "session" in x)
.map(parse_session)
)

sessions.cache()

model = train_recommendation_model(sessions)

Total time drops to 8 minutes.

Caching saved them over 80% compute time.


When Should You Cache an RDD?​

βœ”When the RDD is reused multiple times​

Example: Training multiple ML models with the same preprocessed data.

βœ”When recomputation cost is expensive​

Example: Custom parsing, UDFs, joins, or external IO.

βœ”When performing iterative algorithms​

  • PageRank
  • K-Means
  • Gradient descent loops

βœ”When running multiple actions on the same RDD​

Such as count(), take(), collect(), saveAsTextFile().


When Not to Cache​

❌ RDD is used only once​

Caching wastes memory.

❌ RDD is too large to fit in memory​

Prefer MEMORY_AND_DISK or avoid caching.

❌ Using DataFrames instead​

Spark automatically optimizes them with Catalyst & Tungsten.


How to Uncache / Remove from Memory​

Memory is limited. After you're done, always clean up:

rdd.unpersist()

Or remove all cached objects:

spark.catalog.clearCache()

Debugging: How to See Cached RDDs​

In Databricks or Spark UI:

  • Open the Storage tab
  • View size, partitions, and storage level
  • Monitor memory usage
  • Identify partitions not cached due to size

This helps optimize cluster resources effectively.


Example: Full Pipeline Using cache() and persist()​

from pyspark import StorageLevel

logs = sc.textFile("/mnt/neomart/raw_logs")

clean = logs \
.filter(lambda x: "event" in x) \
.map(lambda x: parse_event(x))

clean.persist(StorageLevel.MEMORY_AND_DISK)

# Perform multiple actions without recomputation
print(clean.count())
print(clean.take(5))

daily_stats = clean \
.map(lambda x: (x.date, 1)) \
.reduceByKey(lambda x, y: x + y)

Best Practices for RDD Caching​

πŸ”Ή Cache early in iterative algorithms​

Avoid repeating expensive transformations.

πŸ”Ή Use MEMORY_ONLY when data fits​

Fastest option.

πŸ”Ή Use MEMORY_AND_DISK when unsure​

Safe and reliable.

πŸ”Ή Don’t cache everything​

Be selective to avoid memory pressure.

πŸ”Ή Clean up with unpersist()​

Especially in long Databricks jobs.


Summary β€” Caching Makes Spark Lightning Fast​

  • RDD caching prevents expensive recomputations.
  • cache() stores data in memory; persist() lets you choose storage levels.
  • Useful for ML loops, repeated actions, and expensive pipelines.
  • Improves performance and reduces cluster cost.
  • Spark UI helps monitor cached datasets and memory usage.

Caching is one of the most powerful performance tools in Spark β€” when used wisely, it turns slow pipelines into near real-time workflows.


Next, we’ll cover Creating DataFrames from CSV, JSON, Parquet, and Hive Tables.




Career