Skip to main content

Databricks Pricing β€” How Clusters, SQL & Jobs Are Charged

Welcome back to ShopWave, our fictional retail company.
Your manager asks a critical question during a budget review meeting:

β€œHow much are we spending on Databricks, and why does it fluctuate?”

Understanding Databricks pricing is essential for controlling costs and planning resources effectively.


πŸ—οΈ Pricing is Based on Compute + Storage​

Databricks bills based on:

  1. Compute β€” Running clusters or SQL warehouses
  2. Storage β€” Delta tables, files in DBFS, and cloud storage

Think of it like this:

Compute = β€œHow hard the engine works”
Storage = β€œHow much room you use in the warehouse”


πŸ’» Cluster Pricing​

Clusters are the main compute engine for:

  • Notebooks
  • ETL pipelines
  • Machine learning
  • Streaming jobs

Pricing depends on:

  • Number of nodes (driver + worker nodes)
  • Node type (standard, memory-optimized, GPU)
  • Cluster type:
    • Interactive β†’ billed per second while active
    • Job clusters β†’ billed per run

Example at ShopWave:

  • Small Python notebook cluster: 2 nodes Γ— $0.20/hr β†’ ~$0.40/hr
  • Large ML GPU cluster: 4 nodes Γ— $2/hr β†’ ~$8/hr

πŸ’‘ Tip: Terminate idle clusters to save costs.


⚑ SQL Warehouse Pricing​

SQL warehouses (formerly SQL endpoints) are optimized for dashboards and analytics.

  • Billed based on compute size + time running
  • Can scale up or down automatically
  • Concurrency matters: More users querying β†’ bigger warehouse β†’ higher cost

ShopWave scenario:

  • A dashboard warehouse with 4 β€œserverless” units β†’ ~$1.50/hr
  • During peak reporting β†’ auto-scale to 8 units β†’ ~$3/hr
  • At night β†’ auto-terminate β†’ $0/hr

SQL warehouses are cheaper if auto-scaling and auto-termination are enabled.


πŸƒ Jobs Pricing​

Databricks Jobs are scheduled workflows (ETL, ML pipelines, notebooks).

  • Charged based on the compute used during execution
  • Job clusters are temporary β†’ cost only while running
  • Duration Γ— cluster type determines the total

Example at ShopWave:

  • Daily ETL job runs for 30 minutes on 3-node cluster
  • 3 nodes Γ— $0.50/hr Γ— 0.5 hr = $0.75/day
  • Monthly cost β‰ˆ $22.50

πŸ’° Storage Costs​

  • DBFS storage = cost of underlying cloud storage (S3, ADLS, GCS)
  • Delta tables, CSV, Parquet, or model artifacts stored here
  • Charges depend on size and retention
  • Versioning and time travel in Delta Lake also consume storage

ShopWave tip: Clean up old Delta versions to save costs.


πŸ”„ Cost Optimization Tips

  1. Auto-terminate clusters β†’ no idle costs
  2. Use job clusters β†’ temporary compute for pipelines
  3. Auto-scale SQL warehouses β†’ right-size for concurrency
  4. Monitor usage metrics β†’ identify expensive workloads
  5. Archive or delete old data β†’ reduce storage charges
  6. Use spot/preemptible instances β†’ lower compute costs

🧠 Real Business Example β€” ShopWave​

  1. Data engineering team runs ETL jobs on job clusters β†’ billed only for runtime.
  2. BI dashboards use serverless SQL warehouses β†’ auto-scaled to save money.
  3. ML team trains models on GPU clusters β†’ costs monitored and allocated to projects.
  4. Admin regularly cleans old DBFS files β†’ storage costs minimized.

Result: Optimized compute + storage β†’ predictable monthly costs.


🏁 Quick Summary​

  • Databricks pricing = compute + storage
  • Clusters = charged per node Γ— time, interactive or job-based
  • SQL warehouses = charged per compute unit Γ— time, optimized for BI
  • Jobs = charged only while running on job clusters
  • Storage = underlying cloud storage usage + Delta Lake versioning
  • Cost optimization = auto-terminate clusters, auto-scale warehouses, clean storage

πŸš€ Coming Next

πŸ‘‰ Databricks Community Edition vs Enterprise vs Premium

Career