Must-Know Databricks Interview Questions & Answers (Real Company Scenarios) β Part 4
36. How do you tune Spark configurations in Databricks for performance?β
Story-Drivenβ
Tuning Spark is like adjusting the speed and fuel of a race car. Too slow, and you waste time; too fast without control, and you crash. Proper tuning makes your data jobs fly efficiently.
Professional / Hands-Onβ
- Common Spark configuration settings:
spark.executor.memoryβ Adjust executor memoryspark.executor.coresβ Number of cores per executorspark.sql.shuffle.partitionsβ Reduce shuffles
- Techniques:
- Monitor cluster metrics.
- Use dynamic allocation.
- Tune parallelism based on data size.
spark.conf.set("spark.sql.shuffle.partitions", "200")
37. Explain Z-ordering in Delta Lakeβ
Story-Drivenβ
Z-ordering is like arranging books in a library so related books are close together. This makes searches lightning-fast without scanning the whole shelf.
Professional / Hands-Onβ
- Z-ordering: Multi-dimensional clustering of data in Delta tables.
- Improves query performance on filtering columns.
OPTIMIZE sales_delta
ZORDER BY (customer_id, region)
38. How does time travel work in Delta Lake?β
Story-Drivenβ
Time travel in Delta Lake is like a magical diaryβyou can go back and read exactly what your data looked like last week, last month, or even yesterday.
Professional / Hands-Onβ
- Delta Lake keeps versioned data using a transaction log.
- Access historical data via:
SELECT * FROM sales_delta VERSION AS OF 3
SELECT * FROM sales_delta TIMESTAMP AS OF '2025-01-01'
- Useful for audit, recovery, and debugging.
39. How do you implement streaming pipelines in Databricks?β
Story-Drivenβ
A streaming pipeline is like a water pipeline delivering fresh water continuously. New data keeps flowing, and your system processes it automatically.
Professional / Hands-Onβ
-
Steps to implement:
- Read data using
readStream. - Transform using Spark operations.
- Write output using
writeStreamwith checkpointing.
- Read data using
df = spark.readStream.format("json").load("/stream/input")
df_transformed = df.filter(df.value > 100)
df_transformed.writeStream.format("delta").option("checkpointLocation", "/checkpoint").start("/delta/output")
40. What are Delta Lake optimizations (OPTIMIZE, VACUUM)?β
Story-Drivenβ
- OPTIMIZE: Organizes your data for faster queries, like tidying a messy bookshelf.
- VACUUM: Removes outdated or unnecessary files, like clearing trash.
Professional / Hands-Onβ
OPTIMIZEβ Reorganizes data with Z-ordering.VACUUMβ Deletes old files older than default retention (7 days).
OPTIMIZE sales_delta ZORDER BY (customer_id)
VACUUM sales_delta RETAIN 168 HOURS
41. Explain Databricks REST API usageβ
Story-Drivenβ
The REST API is like a remote control for Databricksβyou can start clusters, run jobs, and access notebooks programmatically without opening the UI.
Professional / Hands-Onβ
-
Use cases:
- Automate cluster creation and job scheduling.
- Fetch job status or logs.
-
Example using Python
requests:
import requests
response = requests.get(
"https://<databricks-instance>/api/2.0/clusters/list",
headers={"Authorization": f"Bearer {TOKEN}"}
)
42. How do you implement role-based access control (RBAC) in Databricks?β
Story-Drivenβ
RBAC is like giving keys to rooms only to the people who need them. Developers get dev keys, analysts get read-only keys, and admins get full access.
Professional / Hands-Onβ
-
RBAC in Databricks involves:
- Workspace access control (notebooks, jobs)
- Cluster access control
- Table & data access control with Unity Catalog
-
Example: Assign
CAN_MANAGEpermission to a group for a cluster.
43. How is auto-scaling managed in Databricks clusters?β
Story-Drivenβ
Auto-scaling is like hiring extra chefs when orders pile up and sending them home when itβs quiet. Your kitchen stays efficient without manual intervention.
Professional / Hands-Onβ
-
Auto-scaling clusters:
- Minimum and maximum workers defined.
- Databricks automatically scales based on workload.
-
Configurable at cluster creation:
Min Workers: 2, Max Workers: 10
44. Explain checkpointing and write-ahead logs (WAL) in streamingβ
Story-Drivenβ
Checkpointing and WAL are like saving your progress and keeping a backup diary of every move. If the stream fails, you can pick up exactly where you left off.
Professional / Hands-Onβ
- Checkpointing: Stores streaming progress and offsets.
- Write-ahead logs (WAL): Ensures all data is durably stored before processing.
- Used together to guarantee fault-tolerance and exactly-once semantics.
45. How do you debug failed jobs in Databricks?β
Story-Drivenβ
Debugging failed jobs is like detective workβyou follow clues (logs), check the crime scene (stages), and find what went wrong.
Professional / Hands-Onβ
-
Steps:
- Check cluster logs (driver & worker).
- Review Spark UI for failed stages or tasks.
- Look at notebook outputs or job logs.
- Retry with smaller dataset or isolated transformations.
- Use Databricks REST API to fetch detailed logs if automated debugging is needed.