Databricks DBFS β Internal File System Explained
Welcome back to ShopWave, our fictional retail company.
Youβve built notebooks, set up clusters, and secured accessβbut now a question pops up:
βWhere do we actually store all these files and datasets inside Databricks?β
Enter DBFS β Databricks File System.
ποΈ What Is DBFS?β
DBFS is Databricksβ built-in file system, a layer on top of cloud storage (AWS S3, Azure ADLS, or GCP GCS) that makes it look and behave like a local filesystem.
Think of it as:
βA Google Drive inside your Databricks workspace.β
It lets you:
- Read and write files
- Store notebooks, datasets, models
- Share files between clusters and notebooks
- Access cloud storage seamlessly
π₯ Why DBFS Mattersβ
DBFS is important because:
- Unified access β Any cluster can access the same files.
- Seamless integration β Works with Spark, Python, R, Scala, SQL.
- Persistent storage β Files persist even if clusters are terminated.
- Organized structure β Personal workspace, shared workspace, temporary storage.
ShopWave stores raw sales data, cleaned datasets, ML models, and experiment outputs in DBFS to keep everything organized.
ποΈ DBFS Structureβ
Hereβs how DBFS is organized:
/dbfs
βββ /FileStore
β βββ /datasets
β βββ /models
β βββ /temp
βββ /mnt
β βββ /external_cloud_storage_mounts
βββ /tmp
βββ /temporary_files
/FileStoreβ User-uploaded files, notebooks, datasets/mntβ Mount points for external cloud storage/tmpβ Temporary files during execution
Example: ShopWave uploads their CSV sales file to /FileStore/datasets/sales.csv.
π» Accessing DBFSβ
1οΈβ£ Using Python / Sparkβ
# Read CSV from DBFS
sales_df = spark.read.csv("/FileStore/datasets/sales.csv", header=True, inferSchema=True)
sales_df.show(5)
# Write DataFrame back to DBFS
sales_df.write.parquet("/FileStore/datasets/sales_parquet")
2οΈβ£ Using SQLβ
-- Read a Delta table stored in DBFS
SELECT * FROM delta.`/FileStore/datasets/sales_delta`
3οΈβ£ Using CLIβ
# List files
databricks fs ls dbfs:/FileStore/datasets
# Copy local file to DBFS
databricks fs cp local_file.csv dbfs:/FileStore/datasets/
# Remove a file
databricks fs rm dbfs:/FileStore/datasets/old_file.csv
π Mounting External Storageβ
DBFS can mount cloud storage, making it appear as part of the filesystem:
/mnt/s3_sales_data
/mnt/adls_customer_data
ShopWave mounts AWS S3 buckets containing raw sales and inventory data to /mnt, then accesses them through Spark without worrying about bucket paths each time.
π’ Real Business Example β ShopWaveβ
Scenario: ShopWave is building a recommendation engine.
- Engineers upload product and sales CSVs to
/FileStore/datasets. - Data scientists read these files from notebooks to train ML models.
- Transformed data is written back as Delta tables in
/FileStore/models. - BI dashboards access aggregated results from the same location.
DBFS ensures all teams work with the same files, avoiding duplication or version conflicts.
π§ Quick Tipsβ
- Use
/FileStorefor shared files within Databricks. - Use
/mntfor mounted cloud storage. - Use
/tmpfor temporary files during workflows. - Always clean up unused files to save storage costs.
- Leverage DBFS commands in notebooks or CLI for automation.
π Quick Summaryβ
- DBFS is Databricksβ internal file system, providing persistent storage for notebooks, datasets, and models.
- Organizes data in
/FileStore,/mnt, and/tmp. - Allows seamless integration with Python, SQL, R, Scala, Spark, and external cloud storage.
- Critical for collaboration across data engineers, analysts, and scientists.
- Makes workflows more efficient, organized, and scalable.
π Coming Next
π ** Databricks Pricing β How Clusters, SQL & Jobs Are Charged**