OPTIMIZE Command (OPTIMIZE, Z-ORDER) β The Secret to Fast Delta Lake Queries
β¨ Story Time β βWhy is My Query Slower Today?ββ
Meet Ray, a data engineer working with a large Delta Lake table that receives millions of updates daily.
One morning:
- Yesterdayβs query ran in 6 seconds
- Today the same query takes over 35 seconds
- The dashboard team is already messaging himβ¦
Ray checks the table and discovers:
- Thousands of small Delta files
- Poor clustering
- No data skipping
- And a warehouse thatβs working harder than it should
He sighsβ¦
Then smiles β because he knows the fix is simple:
β‘ OPTIMIZE + Z-ORDER
The Databricks βperformance boost button.β
π§© What is OPTIMIZE in Databricks?β
OPTIMIZE is a Delta Lake command that compacts small files into large, efficient Parquet files.
Why is this important?β
Because writing too many small files leads to:
- Slow reads
- High metadata overhead
- Extra compute cost
- Poor parallelization
How OPTIMIZE works:β
- Reads many small files
- Combines them into fewer, larger files (usually 128MB+)
- Organizes partitions more efficiently
- Improves scan performance significantly
Example:β
OPTIMIZE sales_delta;
Just one command β and read performance improves instantly.
π What is Z-ORDER?β
Z-ORDER is a multi-dimensional clustering technique that groups related data together physically on disk.
This improves data skipping, meaning:
β‘ Databricks reads only the files that matter β‘ Not the entire dataset
Perfect for speeding up queries with filters such as:
WHERE customer_id = ...WHERE date BETWEEN ...WHERE product_category = ...
Example:β
OPTIMIZE sales_delta
ZORDER BY (customer_id, order_date);
This tells Databricks:
βPut rows with similar
customer_idandorder_datecloser together.β
π― When Should You Use OPTIMIZE?
Use it when:β
β Your table receives lots of small batch writes β You have many small files (file fragmentation) β Query performance drops over time β Dashboards require fast scans β Streaming writes produce too many tiny files
Not ideal when:β
β Data changes extremely frequently
β Youβre optimizing unpartitioned huge tables without Z-ORDER
β You run OPTIMIZE far too often (unnecessary compute cost)
π― When Should You Use Z-ORDER?β
Use Z-ORDER when your queries filter on a specific column frequently:
- Customer-level queries
- Product or SKU-level queries
- Date or timestamp queries
- Geolocation or region filters
- IoT sensors filtered by device_id
Avoid Z-ORDER when:
- Your table already has perfect partitioning
- You rarely filter on the columns
- Your table is small (< 50 GB)
π§ͺ Real-World Example β 10Γ Faster Queryβ
Rayβs company runs this query all day:
SELECT *
FROM sales_delta
WHERE customer_id = 99821;
Before Z-ORDER:
- Databricks scanned 1,200 files
- Query took 28 seconds
After:
OPTIMIZE sales_delta
ZORDER BY (customer_id);
Results:
- Scanned only 73 files
- Query took 2.1 seconds
- Dashboards loaded instantly
- Ray finally finished his coffee β
β‘ Benefits of OPTIMIZE + Z-ORDERβ
| Feature | Benefit |
|---|---|
| File Compaction | Faster reads & fewer metadata operations |
| Data Skipping | Databricks reads only the relevant files |
| Improved Clustering | Better filter performance |
| Lower Cost | Less compute + fewer scanned files |
| Faster Dashboards | BI tools feel βinstantβ |
π§ Best Practicesβ
- Run
OPTIMIZEon large Delta tables weekly or daily (depending on volume). - Use
ZORDERon the columns most commonly used in WHERE filters. - Donβt Z-ORDER too many columns at once β 1 to 3 is ideal.
- Schedule OPTIMIZE jobs in non-peak hours.
- Avoid running OPTIMIZE on very small tables (less than 10 GB).
π Summaryβ
- OPTIMIZE compacts small files into large, efficient ones.
- Z-ORDER clusters data to enable data skipping and faster filters.
- Together, they can provide 10Γ to 100Γ query performance improvements.
- Best for large, heavily updated Delta Lake tables.
- essential for production workloads, dashboards, and BI pipelines.
π Next Topic
File Compaction & Delta File Management