Skip to main content

Databricks DBFS β€” Internal File System Explained

Welcome back to ShopWave, our fictional retail company.
You’ve built notebooks, set up clusters, and secured accessβ€”but now a question pops up:

β€œWhere do we actually store all these files and datasets inside Databricks?”

Enter DBFS β€” Databricks File System.


πŸ—‚οΈ What Is DBFS?​

DBFS is Databricks’ built-in file system, a layer on top of cloud storage (AWS S3, Azure ADLS, or GCP GCS) that makes it look and behave like a local filesystem.

Think of it as:

β€œA Google Drive inside your Databricks workspace.”

It lets you:

  • Read and write files
  • Store notebooks, datasets, models
  • Share files between clusters and notebooks
  • Access cloud storage seamlessly

πŸ”₯ Why DBFS Matters​

DBFS is important because:

  1. Unified access β€” Any cluster can access the same files.
  2. Seamless integration β€” Works with Spark, Python, R, Scala, SQL.
  3. Persistent storage β€” Files persist even if clusters are terminated.
  4. Organized structure β€” Personal workspace, shared workspace, temporary storage.

ShopWave stores raw sales data, cleaned datasets, ML models, and experiment outputs in DBFS to keep everything organized.


πŸ—‚οΈ DBFS Structure​

Here’s how DBFS is organized:


/dbfs
β”œβ”€β”€ /FileStore
β”‚ β”œβ”€β”€ /datasets
β”‚ β”œβ”€β”€ /models
β”‚ └── /temp
β”œβ”€β”€ /mnt
β”‚ └── /external_cloud_storage_mounts
└── /tmp
└── /temporary_files

  • /FileStore β†’ User-uploaded files, notebooks, datasets
  • /mnt β†’ Mount points for external cloud storage
  • /tmp β†’ Temporary files during execution

Example: ShopWave uploads their CSV sales file to /FileStore/datasets/sales.csv.


πŸ’» Accessing DBFS​

1️⃣ Using Python / Spark​

# Read CSV from DBFS
sales_df = spark.read.csv("/FileStore/datasets/sales.csv", header=True, inferSchema=True)
sales_df.show(5)

# Write DataFrame back to DBFS
sales_df.write.parquet("/FileStore/datasets/sales_parquet")

2️⃣ Using SQL​

-- Read a Delta table stored in DBFS
SELECT * FROM delta.`/FileStore/datasets/sales_delta`

3️⃣ Using CLI​

# List files
databricks fs ls dbfs:/FileStore/datasets

# Copy local file to DBFS
databricks fs cp local_file.csv dbfs:/FileStore/datasets/

# Remove a file
databricks fs rm dbfs:/FileStore/datasets/old_file.csv

πŸ”— Mounting External Storage​

DBFS can mount cloud storage, making it appear as part of the filesystem:

/mnt/s3_sales_data
/mnt/adls_customer_data

ShopWave mounts AWS S3 buckets containing raw sales and inventory data to /mnt, then accesses them through Spark without worrying about bucket paths each time.


🏒 Real Business Example β€” ShopWave​

Scenario: ShopWave is building a recommendation engine.

  1. Engineers upload product and sales CSVs to /FileStore/datasets.
  2. Data scientists read these files from notebooks to train ML models.
  3. Transformed data is written back as Delta tables in /FileStore/models.
  4. BI dashboards access aggregated results from the same location.

DBFS ensures all teams work with the same files, avoiding duplication or version conflicts.


🧠 Quick Tips​

  • Use /FileStore for shared files within Databricks.
  • Use /mnt for mounted cloud storage.
  • Use /tmp for temporary files during workflows.
  • Always clean up unused files to save storage costs.
  • Leverage DBFS commands in notebooks or CLI for automation.

🏁 Quick Summary​

  • DBFS is Databricks’ internal file system, providing persistent storage for notebooks, datasets, and models.
  • Organizes data in /FileStore, /mnt, and /tmp.
  • Allows seamless integration with Python, SQL, R, Scala, Spark, and external cloud storage.
  • Critical for collaboration across data engineers, analysts, and scientists.
  • Makes workflows more efficient, organized, and scalable.

πŸš€ Coming Next

πŸ‘‰ ** Databricks Pricing β€” How Clusters, SQL & Jobs Are Charged**

Career