Skip to main content

Creating DataFrames from CSV, JSON, Parquet & Hive Tables

Every analytics pipeline at NeoMart, our growing e-commerce platform, starts with one step: loading data into Spark.
Whether it comes from mobile apps, warehouses, partners, or machine logs, your first job as a data engineer is to convert this raw data into a DataFrame β€” Spark’s most widely used data structure.

DataFrames provide schema, structure, column-level operations, and optimization through Catalyst.
But how you create a DataFrame depends on the file format you’re working with.

Let’s explore the four most common formats: CSV, JSON, Parquet, and Hive tables.


Why File Formats Matter​

Not all file formats behave the same.
Some are slow but simple (CSV), others lightning fast (Parquet), and some ideal for semi-structured workloads (JSON).

Choosing the right format can easily save minutes or even hours in large-scale ETL jobs.


1. Creating DataFrames from CSV Files​

CSV files are widely used but come with limitations β€” no schema, no compression, and slow parsing.

df = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.csv("/mnt/data/sales.csv")

βœ” When to Use CSV​

  • During initial ingestion
  • When partners/vendors deliver small datasets
  • For debugging and quick data inspection

❌ Avoid for big data​

CSV parsing becomes slow as data volume increases.


2. Creating DataFrames from JSON Files​

JSON is perfect for logs, nested attributes, and NoSQL-like structures.

df = spark.read \
.option("multiline", True) \
.json("/mnt/data/events.json")

βœ” Best for​

  • Clickstream logs
  • IoT events
  • User activity streams

Story Example​

NeoMart’s mobile app sends events like:

{
"user": "123",
"actions": ["view", "add_to_cart"]
}

JSON allows nested data, which Spark can parse easily.


3. Creating DataFrames from Parquet Files (Best Practice)​

Parquet is the default format for big data because of:

  • Columnar storage
  • Built-in compression
  • Predicate pushdown
  • Fast read/write
df = spark.read.parquet("/mnt/data/transactions/")

βœ” Best Format For​

  • Analytics
  • Large-scale ETL
  • Machine learning pipelines
  • Databricks Delta workflows

This is NeoMart’s recommended storage format for raw, clean, and analytics layers.


4. Creating DataFrames from Hive Tables​

Hive tables allow you to store structured datasets with metadata (schema, partitions).

df = spark.table("analytics.daily_orders")

or using SQL:

df = spark.sql("SELECT * FROM analytics.daily_orders")

βœ” Helpful When​

  • Working with enterprise data warehouses
  • Using Databricks metastore
  • Structuring data by partitions (date, region, etc.)

5. Summary​

  • CSV β†’ simple, human-readable, but slow
  • JSON β†’ perfect for nested & semi-structured data
  • Parquet β†’ fastest & most efficient (recommended for big data)
  • Hive Tables β†’ ideal for enterprise-scale structured storage

Proper DataFrame creation lays the foundation for the entire transformation pipeline β€” ensuring performance, accuracy, and scalability.


Next up, we’ll master the DataFrame API β€” Select, Filter, WithColumn, Drop, the core tools used to transform raw data into analytics-ready datasets.

Career