Creating DataFrames from CSV, JSON, Parquet & Hive Tables
Every analytics pipeline at NeoMart, our growing e-commerce platform, starts with one step: loading data into Spark.
Whether it comes from mobile apps, warehouses, partners, or machine logs, your first job as a data engineer is to convert this raw data into a DataFrame β Sparkβs most widely used data structure.
DataFrames provide schema, structure, column-level operations, and optimization through Catalyst.
But how you create a DataFrame depends on the file format youβre working with.
Letβs explore the four most common formats: CSV, JSON, Parquet, and Hive tables.
Why File Formats Matterβ
Not all file formats behave the same.
Some are slow but simple (CSV), others lightning fast (Parquet), and some ideal for semi-structured workloads (JSON).
Choosing the right format can easily save minutes or even hours in large-scale ETL jobs.
1. Creating DataFrames from CSV Filesβ
CSV files are widely used but come with limitations β no schema, no compression, and slow parsing.
df = spark.read \
.option("header", True) \
.option("inferSchema", True) \
.csv("/mnt/data/sales.csv")
β When to Use CSVβ
- During initial ingestion
- When partners/vendors deliver small datasets
- For debugging and quick data inspection
β Avoid for big dataβ
CSV parsing becomes slow as data volume increases.
2. Creating DataFrames from JSON Filesβ
JSON is perfect for logs, nested attributes, and NoSQL-like structures.
df = spark.read \
.option("multiline", True) \
.json("/mnt/data/events.json")
β Best forβ
- Clickstream logs
- IoT events
- User activity streams
Story Exampleβ
NeoMartβs mobile app sends events like:
{
"user": "123",
"actions": ["view", "add_to_cart"]
}
JSON allows nested data, which Spark can parse easily.
3. Creating DataFrames from Parquet Files (Best Practice)β
Parquet is the default format for big data because of:
- Columnar storage
- Built-in compression
- Predicate pushdown
- Fast read/write
df = spark.read.parquet("/mnt/data/transactions/")
β Best Format Forβ
- Analytics
- Large-scale ETL
- Machine learning pipelines
- Databricks Delta workflows
This is NeoMartβs recommended storage format for raw, clean, and analytics layers.
4. Creating DataFrames from Hive Tablesβ
Hive tables allow you to store structured datasets with metadata (schema, partitions).
df = spark.table("analytics.daily_orders")
or using SQL:
df = spark.sql("SELECT * FROM analytics.daily_orders")
β Helpful Whenβ
- Working with enterprise data warehouses
- Using Databricks metastore
- Structuring data by partitions (date, region, etc.)
5. Summaryβ
- CSV β simple, human-readable, but slow
- JSON β perfect for nested & semi-structured data
- Parquet β fastest & most efficient (recommended for big data)
- Hive Tables β ideal for enterprise-scale structured storage
Proper DataFrame creation lays the foundation for the entire transformation pipeline β ensuring performance, accuracy, and scalability.
Next up, weβll master the DataFrame API β Select, Filter, WithColumn, Drop, the core tools used to transform raw data into analytics-ready datasets.