Parquet
What is Parquet, and why is it faster than other ways of storing data?
Parquet Overview
Parquet is a columnar storage file format designed for efficient data processing in big data ecosystems (Hadoop, Spark, etc.).
Why It’s Faster
1. Columnar Storage
- Stores data by column, not row
- Reading specific columns doesn’t require scanning entire rows
- Perfect for analytical queries that only need a few columns
2. Compression
- Similar data types in columns compress extremely well
- Typical compression ratios: 75-90% smaller than CSV
- Less I/O = faster reads
3. Encoding
- Smart encoding schemes per column type
- Dictionary encoding for repeated values
- Run-length encoding for sequences
- Reduces storage and speeds up processing
4. Predicate Pushdown
- Stores metadata about min/max values per chunk
- Can skip entire blocks that don’t match query filters
- Avoids reading irrelevant data
5. Type-Aware
- Preserves data types (unlike CSV)
- No parsing overhead during reads
- Enables efficient operations
Example Speed Difference
Query: SELECT avg(price) FROM sales WHERE region='US'
CSV: Read all rows, parse all columns → ~10s
Parquet: Read 'price'+'region' columns only,
skip non-US chunks → ~0.5s
Trade-off: Slower writes, not human-readable, but vastly superior for analytics.
- Parquet files are compressed, as in gzipped (or equivalent)?
- Can a column have data of more than one type (e.g. numbers and strings)?
1. Compression in Parquet
Yes, but it’s two-layer compression:
- Encoding (always applied): Dictionary, RLE, bit-packing, etc.
- Compression codec (optional): Snappy (default), GZIP, LZ4, ZSTD, Brotli
Raw data → Encoding → Compression codec → Disk
Snappy is most common because it’s fast (decompression ~500 MB/s) with decent ratios. GZIP compresses better but slower. ZSTD is becoming popular (best of both).
Unlike gzipping a CSV (which must decompress everything), Parquet compresses per column chunk, so you only decompress what you read.
2. Mixed Types in Columns
No - each column has a strict schema with one type. This is a key feature.
However, you have options:
- Union types: Some implementations support variant/union columns (store type tag + value)
- String fallback: Store everything as strings (defeats performance benefits)
- Separate columns:
value_int,value_stringwith nulls - Struct/nested types: Parquet supports complex types - you could have a struct with multiple fields
The type strictness is actually why Parquet is fast - no runtime type checking, optimal encoding per type, and engines can generate specialized code paths.
- How do I use parquet files (create/read) on a Linux computer?
- Is it queryable via SQL syntax?
1. Using Parquet on Linux
Python (most common):
# Install: pip install pandas pyarrow
import pandas as pd
# Write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
df.to_parquet('data.parquet')
# Read
df = pd.read_parquet('data.parquet')
# Read specific columns only
df = pd.read_parquet('data.parquet', columns=['name'])
Command-line tools:
# Install parquet-tools
pip install parquet-tools
# Inspect schema
parquet-tools schema data.parquet
# Show content
parquet-tools show data.parquet
# Convert CSV → Parquet
parquet-tools csv data.csv -o data.parquet
DuckDB (my favorite for local work):
# Install: apt install duckdb OR pip install duckdb
duckdb -c "SELECT * FROM 'data.parquet'"
2. SQL Querying - YES!
DuckDB (best for local files):
duckdb
SELECT name, AVG(age)
FROM 'data.parquet'
WHERE age > 20
GROUP BY name;
-- Join multiple parquet files
SELECT * FROM 'users.parquet' u
JOIN 'orders.parquet' o ON u.id = o.user_id;
-- Wildcard patterns
SELECT * FROM 'data_*.parquet';
Python with DuckDB:
import duckdb
duckdb.query("SELECT * FROM 'data.parquet' WHERE age > 25").df()
DuckDB is perfect for this - it’s blazing fast, handles Parquet natively, and needs no setup.
I’m sold on duckdb, but how do I create parquet files without using Python/Pandas? Let’s say I have a bunch of CSV files I’d like to convert to Parquet.
DuckDB Can Do This Directly!
Single CSV → Parquet:
duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET)"
Multiple CSVs → Single Parquet:
duckdb -c "COPY (SELECT * FROM 'data_*.csv') TO 'combined.parquet' (FORMAT PARQUET)"
Multiple CSVs → Multiple Parquets:
for file in *.csv; do
duckdb -c "COPY (SELECT * FROM '${file}') TO '${file%.csv}.parquet' (FORMAT PARQUET)"
done
With Compression Options:
duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET, CODEC 'ZSTD', COMPRESSION_LEVEL 9)"
Codecs: SNAPPY (default), GZIP, ZSTD, LZ4, UNCOMPRESSED
Bonus - Transform While Converting:
duckdb -c "COPY (
SELECT
column1::INTEGER as id,
column2::DATE as date,
column3
FROM 'messy.csv'
WHERE column3 IS NOT NULL
) TO 'clean.parquet' (FORMAT PARQUET)"
No Python needed. DuckDB handles CSV parsing, type inference, and Parquet writing natively.
1. “Column-first”
What does it mean for the storage to be “columnar”? Does it mean the bytes are packed column-first on disk (and in memory), like in Julia arrays, as opposed to C arrays?
No - Different Concept
The Julia/C distinction is about memory layout within multidimensional arrays:
- C: row-major (consecutive rows)
- Julia/Fortran: column-major (consecutive columns)
Parquet’s “columnar” is about how tables are stored on disk - completely orthogonal.
What Columnar Actually Means
Row-oriented (CSV, traditional DB):
Disk layout:
[row1: id=1, name="Alice", age=25]
[row2: id=2, name="Bob", age=30]
[row3: id=3, name="Carol", age=27]
Column-oriented (Parquet):
Disk layout:
[id column: 1, 2, 3]
[name column: "Alice", "Bob", "Carol"]
[age column: 25, 30, 27]
All values for one column are stored together, physically adjacent on disk.
Why This Matters
SELECT AVG(age) FROM users -- 1 million rows
Row storage: Read all 1M rows × 3 columns = 3M values
Column storage: Read only the age column = 1M values
Physical Reality
Parquet files contain:
- Header with schema
- Row groups (chunks of ~100MB)
- Within each row group: separate column chunks
- Each column chunk: metadata + compressed bytes for that column’s values
- Within each row group: separate column chunks
You’re reading bytes sequentially per column, not interleaved row data.
Not about array indexing - about physical data organization on disk.
2. Connection to duckdb
What is the connection between the parquet format and duckdb? Was the format introduced by duckdb? I know that duckdb uses column-first storage and is fast.
No Connection - Independent Technologies
Parquet (2013):
- Created by Twitter and Cloudera
- Part of Apache ecosystem
- Designed for Hadoop/Spark workloads
- File format specification
DuckDB (2019):
- Created at CWI Amsterdam (research institute)
- In-process analytical database
- Column-first storage in memory/on disk (not Parquet)
- Database engine
Why They Work So Well Together
Both are columnar, so philosophically aligned:
- Parquet: columnar file format
- DuckDB: columnar execution engine
DuckDB reads Parquet exceptionally fast because:
- No impedance mismatch - both organize data by column
- Direct column chunk → vector register pipeline
- Can skip decompression/reading of unused columns
- Leverages Parquet’s metadata for predicate pushdown
DuckDB’s Own Format
DuckDB has its own .db format (also columnar), but it treats Parquet as a first-class citizen:
-- Parquet acts like a table, zero import needed
SELECT * FROM 'file.parquet';
Think of it as: Parquet is the storage standard for columnar data files, DuckDB is an execution engine optimized for that pattern.
Other tools (Pandas, Spark, Polars, Drill) also use Parquet - DuckDB just happens to be the best local query engine for it.
3. Comparison with ORC
What is the ORC format and how does it compare to parquet?
ORC (Optimized Row Columnar)
Columnar format developed by Hortonworks for Hive (2013). Parquet came from Twitter/Cloudera (2012).
Key Differences
1. Ecosystem
- ORC: Hive-first, Java/Hadoop ecosystem
- Parquet: Broader adoption - Spark, Impala, Arrow, Pandas, DuckDB
2. Compression
- ORC: Built-in ZLIB (default), Snappy, LZO, LZ4, ZSTD
- Parquet: Same codecs, but Snappy default
- ORC typically 10-15% smaller due to more aggressive encoding
3. File Structure
- ORC: Stripes (large row groups) with indexes at stripe level
- Parquet: Smaller row groups, metadata at file end
- ORC better for streaming writes, Parquet better for random access
4. Indexing
- ORC: Built-in bloom filters, min/max indexes per stripe
- Parquet: Min/max only (bloom filters added later, less mature)
- ORC faster for point lookups
5. Schema Evolution
- ORC: Better ACID support, easier column add/drop
- Parquet: More restrictive
6. Performance
- ORC: 5-10% faster in Hive/Presto
- Parquet: 5-10% faster in Spark
- Differences small, workload-dependent
When to Use What
ORC: Pure Hive/Hadoop stack, need ACID, heavy updates Parquet: Everything else - better tooling, wider support, language-agnostic
Practical Reality Parquet won the ecosystem war. More tools support it. Unless you’re locked into Hive, use Parquet.
4. Encodings
What are dictionary, bit-packing and RLE encodings?
Dictionary Encoding
Store unique values once, then use integer IDs to reference them.
Original column: ["apple", "banana", "apple", "apple", "banana"]
Dictionary: {0: "apple", 1: "banana"}
Encoded: [0, 1, 0, 0, 1]
Storage: 2 strings + 5 tiny integers (vs 5 full strings)
Massive win for columns with low cardinality (countries, categories, status codes). A column with 1M rows but only 50 unique values becomes trivial.
Bit-Packing
Use minimum bits needed to represent values, not full bytes.
Values: [0, 1, 2, 3, 0, 1, 2, 3]
Normal (8 bits each): 00000000 00000001 00000010 00000011 ...
Bit-packed (2 bits): 00 01 10 11 00 01 10 11
Savings: 8 bytes → 2 bytes (75% reduction)
Works when values fit in small ranges. If max value is 7, you only need 3 bits per number, not 32 or 64.
RLE (Run-Length Encoding)
Store value + count for consecutive repeats.
Original: [5, 5, 5, 5, 5, 7, 7, 9, 9, 9, 9]
RLE: [(5, count=5), (7, count=2), (9, count=4)]
11 values → 3 pairs
Perfect for sorted or semi-sorted data. A column of timestamps sorted by date or boolean flags with long runs of TRUE/FALSE compresses to almost nothing.
Combined Example
Parquet often chains these:
Column: region codes (mostly "US", "US", "CA", "US", "US"...)
1. Dictionary: US→0, CA→1
2. RLE on IDs: (0, 1000), (1, 50), (0, 2000)...
3. Bit-pack the counts
Result: 100MB → 500KB
5. Complex datatype support
Can I use structured datatypes in Parquet columns? For example, I want a column where each entry is a list of dates.
Yes - Parquet Has Rich Nested Types
List/Array Columns:
-- Create table with list column
CREATE TABLE events AS
SELECT
'user1' as user_id,
['2024-01-01'::DATE, '2024-01-15'::DATE, '2024-02-03'::DATE] as visit_dates;
-- Write to parquet
COPY events TO 'events.parquet' (FORMAT PARQUET);
-- Read and query
SELECT
user_id,
visit_dates,
len(visit_dates) as num_visits,
visit_dates[1] as first_visit -- 1-indexed
FROM 'events.parquet';
-- Unnest lists
SELECT user_id, UNNEST(visit_dates) as visit_date
FROM 'events.parquet';
Other Complex Types Supported:
-- Struct (like a nested record)
CREATE TABLE users AS
SELECT
'Alice' as name,
{'street': '123 Main', 'city': 'NYC', 'zip': '10001'} as address;
-- Map (key-value pairs)
CREATE TABLE metrics AS
SELECT
'server1' as host,
MAP(['cpu', 'mem'], [0.75, 0.82]) as stats;
-- Nested combinations
CREATE TABLE complex AS
SELECT
[
{'name': 'Alice', 'scores': [95, 87, 92]},
{'name': 'Bob', 'scores': [88, 90, 85]}
] as students;
From CSV with Lists:
If your CSV has lists as strings like "2024-01-01,2024-01-15,2024-02-03":
COPY (
SELECT
user_id,
string_split(date_string, ',')::DATE[] as visit_dates
FROM 'data.csv'
) TO 'data.parquet' (FORMAT PARQUET);
The schema is preserved in the Parquet metadata - reading tools know it’s a list of dates, not a string.
To be clear, the list and struct syntax is the same as python’s list and dictionary, right?
Almost, but not quite
Lists:
✓ Same as Python: ['item1', 'item2', 'item3']
Structs:
✗ Different syntax - uses {} but with : like Python dicts, though semantics differ
-- DuckDB struct syntax (named fields, like a record)
{'name': 'Alice', 'age': 25}
-- This looks like Python dict but behaves differently:
-- - Keys are FIXED field names (like class attributes)
-- - Not key-value lookups - it's a structured record
-- - Schema enforced: all rows must have same fields
Maps (actual key-value):
For true dictionary behavior, use MAP():
-- Map = actual key-value pairs (like Python dict)
MAP(['key1', 'key2'], ['val1', 'val2'])
-- Can have different keys per row
-- Keys/values must be same type across all entries
Comparison:
-- STRUCT: {field_name: value} - fixed schema, named access
SELECT s.name FROM (SELECT {'name': 'Alice', 'age': 25} as s);
-- MAP: flexible keys, subscript access
SELECT m['key1'] FROM (SELECT MAP(['key1'], ['val1']) as m);
-- LIST: [item1, item2] - same as Python
SELECT l[1] FROM (SELECT ['a', 'b', 'c'] as l);
So: lists = Python lists, structs ≈ Python dicts visually but more like dataclasses semantically, maps = true dicts.