Databricks Pricing β How Clusters, SQL & Jobs Are Charged
Welcome back to ShopWave, our fictional retail company.
Your manager asks a critical question during a budget review meeting:
βHow much are we spending on Databricks, and why does it fluctuate?β
Understanding Databricks pricing is essential for controlling costs and planning resources effectively.
ποΈ Pricing is Based on Compute + Storageβ
Databricks bills based on:
- Compute β Running clusters or SQL warehouses
- Storage β Delta tables, files in DBFS, and cloud storage
Think of it like this:
Compute = βHow hard the engine worksβ
Storage = βHow much room you use in the warehouseβ
π» Cluster Pricingβ
Clusters are the main compute engine for:
- Notebooks
- ETL pipelines
- Machine learning
- Streaming jobs
Pricing depends on:
- Number of nodes (driver + worker nodes)
- Node type (standard, memory-optimized, GPU)
- Cluster type:
- Interactive β billed per second while active
- Job clusters β billed per run
Example at ShopWave:
- Small Python notebook cluster: 2 nodes Γ $0.20/hr β ~$0.40/hr
- Large ML GPU cluster: 4 nodes Γ $2/hr β ~$8/hr
π‘ Tip: Terminate idle clusters to save costs.
β‘ SQL Warehouse Pricingβ
SQL warehouses (formerly SQL endpoints) are optimized for dashboards and analytics.
- Billed based on compute size + time running
- Can scale up or down automatically
- Concurrency matters: More users querying β bigger warehouse β higher cost
ShopWave scenario:
- A dashboard warehouse with 4 βserverlessβ units β ~$1.50/hr
- During peak reporting β auto-scale to 8 units β ~$3/hr
- At night β auto-terminate β $0/hr
SQL warehouses are cheaper if auto-scaling and auto-termination are enabled.
π Jobs Pricingβ
Databricks Jobs are scheduled workflows (ETL, ML pipelines, notebooks).
- Charged based on the compute used during execution
- Job clusters are temporary β cost only while running
- Duration Γ cluster type determines the total
Example at ShopWave:
- Daily ETL job runs for 30 minutes on 3-node cluster
- 3 nodes Γ $0.50/hr Γ 0.5 hr = $0.75/day
- Monthly cost β $22.50
π° Storage Costsβ
- DBFS storage = cost of underlying cloud storage (S3, ADLS, GCS)
- Delta tables, CSV, Parquet, or model artifacts stored here
- Charges depend on size and retention
- Versioning and time travel in Delta Lake also consume storage
ShopWave tip: Clean up old Delta versions to save costs.
π Cost Optimization Tips
- Auto-terminate clusters β no idle costs
- Use job clusters β temporary compute for pipelines
- Auto-scale SQL warehouses β right-size for concurrency
- Monitor usage metrics β identify expensive workloads
- Archive or delete old data β reduce storage charges
- Use spot/preemptible instances β lower compute costs
π§ Real Business Example β ShopWaveβ
- Data engineering team runs ETL jobs on job clusters β billed only for runtime.
- BI dashboards use serverless SQL warehouses β auto-scaled to save money.
- ML team trains models on GPU clusters β costs monitored and allocated to projects.
- Admin regularly cleans old DBFS files β storage costs minimized.
Result: Optimized compute + storage β predictable monthly costs.
π Quick Summaryβ
- Databricks pricing = compute + storage
- Clusters = charged per node Γ time, interactive or job-based
- SQL warehouses = charged per compute unit Γ time, optimized for BI
- Jobs = charged only while running on job clusters
- Storage = underlying cloud storage usage + Delta Lake versioning
- Cost optimization = auto-terminate clusters, auto-scale warehouses, clean storage
π Coming Next
π Databricks Community Edition vs Enterprise vs Premium