Parquet

What is Parquet, and why is it faster than other ways of storing data?

Parquet Overview

Parquet is a columnar storage file format designed for efficient data processing in big data ecosystems (Hadoop, Spark, etc.).

Why It’s Faster

1. Columnar Storage

Stores data by column, not row
Reading specific columns doesn’t require scanning entire rows
Perfect for analytical queries that only need a few columns

2. Compression

Similar data types in columns compress extremely well
Typical compression ratios: 75-90% smaller than CSV
Less I/O = faster reads

3. Encoding

Smart encoding schemes per column type
Dictionary encoding for repeated values
Run-length encoding for sequences
Reduces storage and speeds up processing

4. Predicate Pushdown

Stores metadata about min/max values per chunk
Can skip entire blocks that don’t match query filters
Avoids reading irrelevant data

5. Type-Aware

Preserves data types (unlike CSV)
No parsing overhead during reads
Enables efficient operations

Example Speed Difference

Query: SELECT avg(price) FROM sales WHERE region='US'

CSV:     Read all rows, parse all columns → ~10s
Parquet: Read 'price'+'region' columns only, 
         skip non-US chunks → ~0.5s

Trade-off: Slower writes, not human-readable, but vastly superior for analytics.

Parquet files are compressed, as in gzipped (or equivalent)?
Can a column have data of more than one type (e.g. numbers and strings)?

1. Compression in Parquet

Yes, but it’s two-layer compression:

Encoding (always applied): Dictionary, RLE, bit-packing, etc.
Compression codec (optional): Snappy (default), GZIP, LZ4, ZSTD, Brotli

Raw data → Encoding → Compression codec → Disk

Snappy is most common because it’s fast (decompression ~500 MB/s) with decent ratios. GZIP compresses better but slower. ZSTD is becoming popular (best of both).

Unlike gzipping a CSV (which must decompress everything), Parquet compresses per column chunk, so you only decompress what you read.

2. Mixed Types in Columns

No - each column has a strict schema with one type. This is a key feature.

However, you have options:

Union types: Some implementations support variant/union columns (store type tag + value)
String fallback: Store everything as strings (defeats performance benefits)
Separate columns: value_int, value_string with nulls
Struct/nested types: Parquet supports complex types - you could have a struct with multiple fields

The type strictness is actually why Parquet is fast - no runtime type checking, optimal encoding per type, and engines can generate specialized code paths.

How do I use parquet files (create/read) on a Linux computer?
Is it queryable via SQL syntax?

1. Using Parquet on Linux

Python (most common):

# Install: pip install pandas pyarrow

import pandas as pd

# Write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
df.to_parquet('data.parquet')

# Read
df = pd.read_parquet('data.parquet')

# Read specific columns only
df = pd.read_parquet('data.parquet', columns=['name'])

Command-line tools:

# Install parquet-tools
pip install parquet-tools

# Inspect schema
parquet-tools schema data.parquet

# Show content
parquet-tools show data.parquet

# Convert CSV → Parquet
parquet-tools csv data.csv -o data.parquet

DuckDB (my favorite for local work):

# Install: apt install duckdb  OR  pip install duckdb
duckdb -c "SELECT * FROM 'data.parquet'"

2. SQL Querying - YES!

DuckDB (best for local files):

duckdb

SELECT name, AVG(age) 
FROM 'data.parquet' 
WHERE age > 20 
GROUP BY name;

-- Join multiple parquet files
SELECT * FROM 'users.parquet' u
JOIN 'orders.parquet' o ON u.id = o.user_id;

-- Wildcard patterns
SELECT * FROM 'data_*.parquet';

Python with DuckDB:

import duckdb
duckdb.query("SELECT * FROM 'data.parquet' WHERE age > 25").df()

DuckDB is perfect for this - it’s blazing fast, handles Parquet natively, and needs no setup.

I’m sold on duckdb, but how do I create parquet files without using Python/Pandas? Let’s say I have a bunch of CSV files I’d like to convert to Parquet.

DuckDB Can Do This Directly!

Single CSV → Parquet:

duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET)"

Multiple CSVs → Single Parquet:

duckdb -c "COPY (SELECT * FROM 'data_*.csv') TO 'combined.parquet' (FORMAT PARQUET)"

Multiple CSVs → Multiple Parquets:

for file in *.csv; do
  duckdb -c "COPY (SELECT * FROM '${file}') TO '${file%.csv}.parquet' (FORMAT PARQUET)"
done

With Compression Options:

duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET, CODEC 'ZSTD', COMPRESSION_LEVEL 9)"

Codecs: SNAPPY (default), GZIP, ZSTD, LZ4, UNCOMPRESSED

Bonus - Transform While Converting:

duckdb -c "COPY (
  SELECT 
    column1::INTEGER as id,
    column2::DATE as date,
    column3
  FROM 'messy.csv'
  WHERE column3 IS NOT NULL
) TO 'clean.parquet' (FORMAT PARQUET)"

No Python needed. DuckDB handles CSV parsing, type inference, and Parquet writing natively.

1. “Column-first”

What does it mean for the storage to be “columnar”? Does it mean the bytes are packed column-first on disk (and in memory), like in Julia arrays, as opposed to C arrays?

No - Different Concept

The Julia/C distinction is about memory layout within multidimensional arrays:

C: row-major (consecutive rows)
Julia/Fortran: column-major (consecutive columns)

Parquet’s “columnar” is about how tables are stored on disk - completely orthogonal.

What Columnar Actually Means

Row-oriented (CSV, traditional DB):

Disk layout:
[row1: id=1, name="Alice", age=25]
[row2: id=2, name="Bob", age=30]
[row3: id=3, name="Carol", age=27]

Column-oriented (Parquet):

Disk layout:
[id column: 1, 2, 3]
[name column: "Alice", "Bob", "Carol"]  
[age column: 25, 30, 27]

All values for one column are stored together, physically adjacent on disk.

Why This Matters

SELECT AVG(age) FROM users  -- 1 million rows

Row storage: Read all 1M rows × 3 columns = 3M values

Column storage: Read only the age column = 1M values

Physical Reality

Parquet files contain:

Header with schema
Row groups (chunks of ~100MB)
- Within each row group: separate column chunks
  - Each column chunk: metadata + compressed bytes for that column’s values

You’re reading bytes sequentially per column, not interleaved row data.

Not about array indexing - about physical data organization on disk.

2. Connection to duckdb

What is the connection between the parquet format and duckdb? Was the format introduced by duckdb? I know that duckdb uses column-first storage and is fast.

No Connection - Independent Technologies

Parquet (2013):

Created by Twitter and Cloudera
Part of Apache ecosystem
Designed for Hadoop/Spark workloads
File format specification

DuckDB (2019):

Created at CWI Amsterdam (research institute)
In-process analytical database
Column-first storage in memory/on disk (not Parquet)
Database engine

Why They Work So Well Together

Both are columnar, so philosophically aligned:

Parquet: columnar file format
DuckDB: columnar execution engine

DuckDB reads Parquet exceptionally fast because:

No impedance mismatch - both organize data by column
Direct column chunk → vector register pipeline
Can skip decompression/reading of unused columns
Leverages Parquet’s metadata for predicate pushdown

DuckDB’s Own Format

DuckDB has its own .db format (also columnar), but it treats Parquet as a first-class citizen:

-- Parquet acts like a table, zero import needed
SELECT * FROM 'file.parquet';

Think of it as: Parquet is the storage standard for columnar data files, DuckDB is an execution engine optimized for that pattern.

Other tools (Pandas, Spark, Polars, Drill) also use Parquet - DuckDB just happens to be the best local query engine for it.

3. Comparison with ORC

What is the ORC format and how does it compare to parquet?

ORC (Optimized Row Columnar)

Columnar format developed by Hortonworks for Hive (2013). Parquet came from Twitter/Cloudera (2012).

Key Differences

1. Ecosystem

ORC: Hive-first, Java/Hadoop ecosystem
Parquet: Broader adoption - Spark, Impala, Arrow, Pandas, DuckDB

2. Compression

ORC: Built-in ZLIB (default), Snappy, LZO, LZ4, ZSTD
Parquet: Same codecs, but Snappy default
ORC typically 10-15% smaller due to more aggressive encoding

3. File Structure

ORC: Stripes (large row groups) with indexes at stripe level
Parquet: Smaller row groups, metadata at file end
ORC better for streaming writes, Parquet better for random access

4. Indexing

ORC: Built-in bloom filters, min/max indexes per stripe
Parquet: Min/max only (bloom filters added later, less mature)
ORC faster for point lookups

5. Schema Evolution

ORC: Better ACID support, easier column add/drop
Parquet: More restrictive

6. Performance

ORC: 5-10% faster in Hive/Presto
Parquet: 5-10% faster in Spark
Differences small, workload-dependent

When to Use What

ORC: Pure Hive/Hadoop stack, need ACID, heavy updates Parquet: Everything else - better tooling, wider support, language-agnostic

Practical Reality Parquet won the ecosystem war. More tools support it. Unless you’re locked into Hive, use Parquet.

4. Encodings

What are dictionary, bit-packing and RLE encodings?

Dictionary Encoding

Store unique values once, then use integer IDs to reference them.

Original column: ["apple", "banana", "apple", "apple", "banana"]

Dictionary: {0: "apple", 1: "banana"}
Encoded:    [0, 1, 0, 0, 1]

Storage: 2 strings + 5 tiny integers (vs 5 full strings)

Massive win for columns with low cardinality (countries, categories, status codes). A column with 1M rows but only 50 unique values becomes trivial.

Bit-Packing

Use minimum bits needed to represent values, not full bytes.

Values: [0, 1, 2, 3, 0, 1, 2, 3]

Normal (8 bits each):  00000000 00000001 00000010 00000011 ...
Bit-packed (2 bits):   00 01 10 11 00 01 10 11

Savings: 8 bytes → 2 bytes (75% reduction)

Works when values fit in small ranges. If max value is 7, you only need 3 bits per number, not 32 or 64.

RLE (Run-Length Encoding)

Store value + count for consecutive repeats.

Original: [5, 5, 5, 5, 5, 7, 7, 9, 9, 9, 9]

RLE: [(5, count=5), (7, count=2), (9, count=4)]

11 values → 3 pairs

Perfect for sorted or semi-sorted data. A column of timestamps sorted by date or boolean flags with long runs of TRUE/FALSE compresses to almost nothing.

Combined Example

Parquet often chains these:

Column: region codes (mostly "US", "US", "CA", "US", "US"...)

1. Dictionary: US→0, CA→1
2. RLE on IDs: (0, 1000), (1, 50), (0, 2000)...
3. Bit-pack the counts

Result: 100MB → 500KB

5. Complex datatype support

Can I use structured datatypes in Parquet columns? For example, I want a column where each entry is a list of dates.

Yes - Parquet Has Rich Nested Types

List/Array Columns:

-- Create table with list column
CREATE TABLE events AS 
SELECT 
  'user1' as user_id,
  ['2024-01-01'::DATE, '2024-01-15'::DATE, '2024-02-03'::DATE] as visit_dates;

-- Write to parquet
COPY events TO 'events.parquet' (FORMAT PARQUET);

-- Read and query
SELECT 
  user_id, 
  visit_dates,
  len(visit_dates) as num_visits,
  visit_dates[1] as first_visit  -- 1-indexed
FROM 'events.parquet';

-- Unnest lists
SELECT user_id, UNNEST(visit_dates) as visit_date
FROM 'events.parquet';

Other Complex Types Supported:

-- Struct (like a nested record)
CREATE TABLE users AS
SELECT 
  'Alice' as name,
  {'street': '123 Main', 'city': 'NYC', 'zip': '10001'} as address;

-- Map (key-value pairs)
CREATE TABLE metrics AS
SELECT 
  'server1' as host,
  MAP(['cpu', 'mem'], [0.75, 0.82]) as stats;

-- Nested combinations
CREATE TABLE complex AS
SELECT
  [
    {'name': 'Alice', 'scores': [95, 87, 92]},
    {'name': 'Bob', 'scores': [88, 90, 85]}
  ] as students;

From CSV with Lists:

If your CSV has lists as strings like "2024-01-01,2024-01-15,2024-02-03":

COPY (
  SELECT 
    user_id,
    string_split(date_string, ',')::DATE[] as visit_dates
  FROM 'data.csv'
) TO 'data.parquet' (FORMAT PARQUET);

The schema is preserved in the Parquet metadata - reading tools know it’s a list of dates, not a string.

To be clear, the list and struct syntax is the same as python’s list and dictionary, right?

Almost, but not quite

Lists: ✓ Same as Python: ['item1', 'item2', 'item3']

Structs: ✗ Different syntax - uses {} but with : like Python dicts, though semantics differ

-- DuckDB struct syntax (named fields, like a record)
{'name': 'Alice', 'age': 25}

-- This looks like Python dict but behaves differently:
-- - Keys are FIXED field names (like class attributes)
-- - Not key-value lookups - it's a structured record
-- - Schema enforced: all rows must have same fields

Maps (actual key-value): For true dictionary behavior, use MAP():

-- Map = actual key-value pairs (like Python dict)
MAP(['key1', 'key2'], ['val1', 'val2'])

-- Can have different keys per row
-- Keys/values must be same type across all entries

Comparison:

-- STRUCT: {field_name: value} - fixed schema, named access
SELECT s.name FROM (SELECT {'name': 'Alice', 'age': 25} as s);

-- MAP: flexible keys, subscript access  
SELECT m['key1'] FROM (SELECT MAP(['key1'], ['val1']) as m);

-- LIST: [item1, item2] - same as Python
SELECT l[1] FROM (SELECT ['a', 'b', 'c'] as l);

So: lists = Python lists, structs ≈ Python dicts visually but more like dataclasses semantically, maps = true dicts.