Skip to main content

Cluster Sizing β€” Choosing the Right Instance Type

✨ Story Time β€” β€œWhy Is This Pipeline So Expensive?”​

Sara is a data engineer managing multiple ETL pipelines:

  • Some jobs run slow
  • Some jobs fail randomly
  • Some cost too much
  • Analysts complain about dashboards being stuck

The CTO walks by:

β€œSara, our cloud bill looks… scary.
Can we optimize our clusters?”

Sara nods.
Cluster sizing isn’t just about performance β€”
It’s about speed + stability + cost-efficiency all working together.

And Databricks gives you dozens of instance types…
Which one is the right choice?

Let’s simplify this.


🧩 What Is Cluster Sizing?​

Cluster sizing is the process of choosing:

  • Node type (compute-optimized, memory-optimized, GPU, etc.)
  • Number of workers
  • Driver size
  • Autoscaling configuration
  • Spot vs On-demand nodes

Your choices directly impact:

  • Cost
  • Performance
  • Stability
  • Job success rate

Choosing the wrong cluster = Slow + Expensive.
Choosing the right cluster = Fast + Cheap.


πŸ—οΈ Types of Databricks Cluster Nodes​

1. General Purpose (Balanced)​

Use when you don’t know what to choose.

Great for:

  • Medium ETL jobs
  • Not-too-heavy SQL queries
  • Mixed workloads

Examples:

  • m5.xlarge
  • m5.2xlarge

2. Compute-Optimized​

High CPU power β€” great for parallel workloads.

Best for:

βœ” Photon workloads
βœ” SQL-heavy jobs
βœ” Aggregations & group-bys
βœ” BI dashboards

Examples:

  • c5.xlarge
  • c5.2xlarge

3. Memory-Optimized​

High RAM β€” great for large joins & heavy shuffle.

Best for:

βœ” ETL pipelines
βœ” machine learning feature joins
βœ” caching large datasets

Examples:

  • r5.xlarge
  • r5.4xlarge

4. Storage-Optimized​

Useful when you need fast local disk β€” e.g., Delta caching.

Best for:

βœ” Photon
βœ” Data skipping workloads
βœ” Large Delta tables

Examples:

  • i3.xlarge
  • i3en.2xlarge

5. GPU Nodes​

Best for ML training & deep learning, not SQL/ETL.

Examples:

  • p3.2xlarge
  • g4dn.xlarge

πŸš€ Choosing Worker Count​

A common mistake:

Choosing too many or too few workers.

General rule:

Data VolumeRecommended Workers
< 50 GB2–4 workers
50–500 GB4–8 workers
500GB – 2TB8–16 workers
2TB+16–32 workers

Always start small β†’ scale up only if needed.


πŸ”„ Autoscaling Best Practices​

🟩 Enable autoscaling​

It saves cost by dynamically adjusting cluster size.

🟩 Keep min nodes small​

Avoid paying for idle nodes.

🟩 Keep max nodes reasonable​

Prevent runaway scaling.

Example:

Min Workers: 2
Max Workers: 10

🟩 Use Enhanced Autoscaling​

Better for bursty and unpredictable workloads.


πŸ§ͺ Real-World Example β€” Cost Saved by 40%​

Sara’s ETL pipeline was running on:

  • 32 workers
  • r5.8xlarge (huge & expensive)
  • No autoscaling

Cost was $120/hour for a single daily job.

After right-sizing:

  • 8 workers
  • c5.2xlarge (cheaper & faster for SQL)
  • Autoscaling 4 β†’ 12

New cost: $72/hour Performance: 30% faster Stability: Improved dramatically

Right sizing = $$$ saved + faster jobs.


πŸ“¦ Cluster Sizing Checklist​

🟩 1. What type of workload?​

WorkloadBest Node Type
SQL / BICompute-optimized or Photon
ETLGeneral-purpose or memory-optimized
ML TrainingGPU
Delta-heavyStorage-optimized

🟩 2. How much data?​

Size workers based on volume.

🟩 3. How much shuffling?​

More shuffle = more memory needed.

🟩 4. Does caching matter?​

Use i3 / i3en for fast SSD local caching.

🟩 5. Use spot instances for non-critical jobs​

Spot = cheap On-demand = reliable


🎯 Best Practices for Cluster Sizing​

  • Don’t oversize β€” start small and scale.
  • Use Photon for SQL-intensive workloads.
  • Enable autoscaling.
  • Use spot workers for non-critical pipelines.
  • Avoid GPU nodes unless doing ML.
  • Cache hot data only when useful.
  • Consider job clusters for ETL pipelines.
  • For production SQL dashboards β†’ use Databricks SQL Warehouses, not clusters.

πŸ“˜ Summary​

  • Cluster sizing is essential for balancing speed, cost, and reliability.
  • Databricks offers multiple node types β€” choose based on workload.
  • Autoscaling and Photon can significantly improve efficiency.
  • Right-sized clusters reduce cost and increase performance.
  • Understanding your data volume and query patterns is the key to picking the right instance.

Choose smart clusters β†’ save money β†’ boost performance β†’ make your team happy.


πŸ‘‰ Next Topic

SQL Endpoint Tuning β€” Query Performance Optimization

Career