Working with Large Files, Compression Types & Optimization Tips
Welcome back to RetailCo, our fictional retail company.
Alice, the data engineer, now faces a new challenge: loading massive historical sales and clickstream data efficiently.
βIf we donβt handle large files and compression properly, loads will be slow, costly, and error-prone,β she explains.
Letβs explore how to work with large files, choose compression types, and optimize Snowflake performance.
ποΈ Challenge of Large Filesβ
- Large files can slow down ETL
- Risk of time-out or memory issues
- Higher storage and compute costs
RetailCo example: 500 GB of historical sales CSVs from vendors need to be loaded quickly for analytics.
πΉ 1οΈβ£ Best Practices for Large Filesβ
- Split huge files into manageable chunks (~100 MB to 1 GB each)
- Use external stages (S3, Azure, GCS) to avoid internal stage limits
- Leverage Snowflake parallelism with multiple files
- Avoid too many tiny files (less than 10 MB) β increases load overhead
Example:
- Split 500 GB CSV into 500 files of ~1 GB
- Load them in parallel using COPY INTO or Snowpipe
πΉ 2οΈβ£ Compression Typesβ
Snowflake supports automatic decompression for GZIP, BZIP2, ZSTD, and more:
| Compression | Use Case | Pros | Cons |
|---|---|---|---|
| GZIP | CSV, JSON | Widely supported, reduces size 5β10x | Slower decompression |
| BZIP2 | CSV, JSON | High compression ratio | Slower |
| ZSTD | Parquet | Very fast and efficient | Limited support outside Snowflake |
| NONE | Already compressed files | No overhead | Uses more storage |
RetailCo example: Alice compresses large CSVs with GZIP to reduce storage and speed up loads.
COPY INTO SALES
FROM @S3_SALES_STAGE
FILE_FORMAT=(TYPE=CSV COMPRESSION=GZIP);
πΉ 3οΈβ£ File Format Optimizationβ
- Parquet for large datasets β smaller, columnar, faster queries
- CSV for simple ingestion, but compress it (GZIP)
- JSON for nested data β use VARIANT column, compress with GZIP
Rule of thumb: Use columnar formats (Parquet/ORC) for analytics, row-based (CSV/JSON) for raw ingest.
πΉ 4οΈβ£ Snowflake Load Optimization Tipsβ
- Use multiple files to leverage parallel loading
- Clustered tables β improves query performance on large datasets
- Avoid auto-compressing already compressed files
- Use staged files efficiently (internal/external stages)
- Monitor load performance via
COPY_HISTORYorLOAD_HISTORY - Purge old staged files to save storage
π§© RetailCo Real-World Scenarioβ
- Alice splits 500 GB CSVs into 500 files (~1 GB each)
- Compresses them with GZIP
- Stages them in S3 external stage
- Loads in parallel using COPY INTO
- Uses clustered table for faster aggregation queries
Outcome: ETL runs efficiently, cost is optimized, and dashboards are updated faster.
π§ Quick Tips Checklistβ
- Split large files β ~100 MBβ1 GB
- Compress files (GZIP/ZSTD) β reduces storage & network usage
- Use Parquet for analytics-heavy tables
- Leverage Snowflake parallelism by loading multiple files
- Monitor load history and optimize warehouses for heavy loads
π Quick Summaryβ
- Large files require splitting, compression, and staging for efficient Snowflake loads
- Compression types: GZIP, BZIP2, ZSTD, NONE
- File format: Parquet for analytics, CSV/JSON for raw ingestion
- Use parallel loading, clustered tables, and staged files
- Benefits: faster ETL, lower cost, optimized storage, improved query performance
π Coming Nextβ
π Snowflake Data Types Explained with Use Cases