Cluster Sizing β Choosing the Right Instance Type
β¨ Story Time β βWhy Is This Pipeline So Expensive?ββ
Sara is a data engineer managing multiple ETL pipelines:
- Some jobs run slow
- Some jobs fail randomly
- Some cost too much
- Analysts complain about dashboards being stuck
The CTO walks by:
βSara, our cloud bill looksβ¦ scary.
Can we optimize our clusters?β
Sara nods.
Cluster sizing isnβt just about performance β
Itβs about speed + stability + cost-efficiency all working together.
And Databricks gives you dozens of instance typesβ¦
Which one is the right choice?
Letβs simplify this.
π§© What Is Cluster Sizing?β
Cluster sizing is the process of choosing:
- Node type (compute-optimized, memory-optimized, GPU, etc.)
- Number of workers
- Driver size
- Autoscaling configuration
- Spot vs On-demand nodes
Your choices directly impact:
- Cost
- Performance
- Stability
- Job success rate
Choosing the wrong cluster = Slow + Expensive.
Choosing the right cluster = Fast + Cheap.
ποΈ Types of Databricks Cluster Nodesβ
1. General Purpose (Balanced)β
Use when you donβt know what to choose.
Great for:
- Medium ETL jobs
- Not-too-heavy SQL queries
- Mixed workloads
Examples:
- m5.xlarge
- m5.2xlarge
2. Compute-Optimizedβ
High CPU power β great for parallel workloads.
Best for:
β Photon workloads
β SQL-heavy jobs
β Aggregations & group-bys
β BI dashboards
Examples:
- c5.xlarge
- c5.2xlarge
3. Memory-Optimizedβ
High RAM β great for large joins & heavy shuffle.
Best for:
β ETL pipelines
β machine learning feature joins
β caching large datasets
Examples:
- r5.xlarge
- r5.4xlarge
4. Storage-Optimizedβ
Useful when you need fast local disk β e.g., Delta caching.
Best for:
β Photon
β Data skipping workloads
β Large Delta tables
Examples:
- i3.xlarge
- i3en.2xlarge
5. GPU Nodesβ
Best for ML training & deep learning, not SQL/ETL.
Examples:
- p3.2xlarge
- g4dn.xlarge
π Choosing Worker Countβ
A common mistake:
Choosing too many or too few workers.
General rule:
| Data Volume | Recommended Workers |
|---|---|
| < 50 GB | 2β4 workers |
| 50β500 GB | 4β8 workers |
| 500GB β 2TB | 8β16 workers |
| 2TB+ | 16β32 workers |
Always start small β scale up only if needed.
π Autoscaling Best Practicesβ
π© Enable autoscalingβ
It saves cost by dynamically adjusting cluster size.
π© Keep min nodes smallβ
Avoid paying for idle nodes.
π© Keep max nodes reasonableβ
Prevent runaway scaling.
Example:
Min Workers: 2
Max Workers: 10
π© Use Enhanced Autoscalingβ
Better for bursty and unpredictable workloads.
π§ͺ Real-World Example β Cost Saved by 40%β
Saraβs ETL pipeline was running on:
- 32 workers
- r5.8xlarge (huge & expensive)
- No autoscaling
Cost was $120/hour for a single daily job.
After right-sizing:
- 8 workers
- c5.2xlarge (cheaper & faster for SQL)
- Autoscaling 4 β 12
New cost: $72/hour Performance: 30% faster Stability: Improved dramatically
Right sizing = $$$ saved + faster jobs.
π¦ Cluster Sizing Checklistβ
π© 1. What type of workload?β
| Workload | Best Node Type |
|---|---|
| SQL / BI | Compute-optimized or Photon |
| ETL | General-purpose or memory-optimized |
| ML Training | GPU |
| Delta-heavy | Storage-optimized |
π© 2. How much data?β
Size workers based on volume.
π© 3. How much shuffling?β
More shuffle = more memory needed.
π© 4. Does caching matter?β
Use i3 / i3en for fast SSD local caching.
π© 5. Use spot instances for non-critical jobsβ
Spot = cheap On-demand = reliable
π― Best Practices for Cluster Sizingβ
- Donβt oversize β start small and scale.
- Use Photon for SQL-intensive workloads.
- Enable autoscaling.
- Use spot workers for non-critical pipelines.
- Avoid GPU nodes unless doing ML.
- Cache hot data only when useful.
- Consider job clusters for ETL pipelines.
- For production SQL dashboards β use Databricks SQL Warehouses, not clusters.
π Summaryβ
- Cluster sizing is essential for balancing speed, cost, and reliability.
- Databricks offers multiple node types β choose based on workload.
- Autoscaling and Photon can significantly improve efficiency.
- Right-sized clusters reduce cost and increase performance.
- Understanding your data volume and query patterns is the key to picking the right instance.
Choose smart clusters β save money β boost performance β make your team happy.
π Next Topic
SQL Endpoint Tuning β Query Performance Optimization