Parquet

What is Parquet, and why is it faster than other ways of storing data?

Parquet Overview

Parquet is a columnar storage file format designed for efficient data processing in big data ecosystems (Hadoop, Spark, etc.).

Why It’s Faster

1. Columnar Storage

2. Compression

3. Encoding

4. Predicate Pushdown

5. Type-Aware

Example Speed Difference

Query: SELECT avg(price) FROM sales WHERE region='US'

CSV:     Read all rows, parse all columns → ~10s
Parquet: Read 'price'+'region' columns only, 
         skip non-US chunks → ~0.5s

Trade-off: Slower writes, not human-readable, but vastly superior for analytics.

  1. Parquet files are compressed, as in gzipped (or equivalent)?
  2. Can a column have data of more than one type (e.g. numbers and strings)?

1. Compression in Parquet

Yes, but it’s two-layer compression:

Raw data → Encoding → Compression codec → Disk

Snappy is most common because it’s fast (decompression ~500 MB/s) with decent ratios. GZIP compresses better but slower. ZSTD is becoming popular (best of both).

Unlike gzipping a CSV (which must decompress everything), Parquet compresses per column chunk, so you only decompress what you read.

2. Mixed Types in Columns

No - each column has a strict schema with one type. This is a key feature.

However, you have options:

The type strictness is actually why Parquet is fast - no runtime type checking, optimal encoding per type, and engines can generate specialized code paths.

  1. How do I use parquet files (create/read) on a Linux computer?
  2. Is it queryable via SQL syntax?

1. Using Parquet on Linux

Python (most common):

# Install: pip install pandas pyarrow

import pandas as pd

# Write
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
df.to_parquet('data.parquet')

# Read
df = pd.read_parquet('data.parquet')

# Read specific columns only
df = pd.read_parquet('data.parquet', columns=['name'])

Command-line tools:

# Install parquet-tools
pip install parquet-tools

# Inspect schema
parquet-tools schema data.parquet

# Show content
parquet-tools show data.parquet

# Convert CSV → Parquet
parquet-tools csv data.csv -o data.parquet

DuckDB (my favorite for local work):

# Install: apt install duckdb  OR  pip install duckdb
duckdb -c "SELECT * FROM 'data.parquet'"

2. SQL Querying - YES!

DuckDB (best for local files):

duckdb
SELECT name, AVG(age) 
FROM 'data.parquet' 
WHERE age > 20 
GROUP BY name;

-- Join multiple parquet files
SELECT * FROM 'users.parquet' u
JOIN 'orders.parquet' o ON u.id = o.user_id;

-- Wildcard patterns
SELECT * FROM 'data_*.parquet';

Python with DuckDB:

import duckdb
duckdb.query("SELECT * FROM 'data.parquet' WHERE age > 25").df()

DuckDB is perfect for this - it’s blazing fast, handles Parquet natively, and needs no setup.

I’m sold on duckdb, but how do I create parquet files without using Python/Pandas? Let’s say I have a bunch of CSV files I’d like to convert to Parquet.

DuckDB Can Do This Directly!

Single CSV → Parquet:

duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET)"

Multiple CSVs → Single Parquet:

duckdb -c "COPY (SELECT * FROM 'data_*.csv') TO 'combined.parquet' (FORMAT PARQUET)"

Multiple CSVs → Multiple Parquets:

for file in *.csv; do
  duckdb -c "COPY (SELECT * FROM '${file}') TO '${file%.csv}.parquet' (FORMAT PARQUET)"
done

With Compression Options:

duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET, CODEC 'ZSTD', COMPRESSION_LEVEL 9)"

Codecs: SNAPPY (default), GZIP, ZSTD, LZ4, UNCOMPRESSED

Bonus - Transform While Converting:

duckdb -c "COPY (
  SELECT 
    column1::INTEGER as id,
    column2::DATE as date,
    column3
  FROM 'messy.csv'
  WHERE column3 IS NOT NULL
) TO 'clean.parquet' (FORMAT PARQUET)"

No Python needed. DuckDB handles CSV parsing, type inference, and Parquet writing natively.

1. “Column-first”

What does it mean for the storage to be “columnar”? Does it mean the bytes are packed column-first on disk (and in memory), like in Julia arrays, as opposed to C arrays?

No - Different Concept

The Julia/C distinction is about memory layout within multidimensional arrays:

  • C: row-major (consecutive rows)
  • Julia/Fortran: column-major (consecutive columns)

Parquet’s “columnar” is about how tables are stored on disk - completely orthogonal.

What Columnar Actually Means

Row-oriented (CSV, traditional DB):

Disk layout:
[row1: id=1, name="Alice", age=25]
[row2: id=2, name="Bob", age=30]
[row3: id=3, name="Carol", age=27]

Column-oriented (Parquet):

Disk layout:
[id column: 1, 2, 3]
[name column: "Alice", "Bob", "Carol"]  
[age column: 25, 30, 27]

All values for one column are stored together, physically adjacent on disk.

Why This Matters

SELECT AVG(age) FROM users  -- 1 million rows

Row storage: Read all 1M rows × 3 columns = 3M values

Column storage: Read only the age column = 1M values

Physical Reality

Parquet files contain:

  • Header with schema
  • Row groups (chunks of ~100MB)
    • Within each row group: separate column chunks
      • Each column chunk: metadata + compressed bytes for that column’s values

You’re reading bytes sequentially per column, not interleaved row data.

Not about array indexing - about physical data organization on disk.

2. Connection to duckdb

What is the connection between the parquet format and duckdb? Was the format introduced by duckdb? I know that duckdb uses column-first storage and is fast.

No Connection - Independent Technologies

Parquet (2013):

  • Created by Twitter and Cloudera
  • Part of Apache ecosystem
  • Designed for Hadoop/Spark workloads
  • File format specification

DuckDB (2019):

  • Created at CWI Amsterdam (research institute)
  • In-process analytical database
  • Column-first storage in memory/on disk (not Parquet)
  • Database engine

Why They Work So Well Together

Both are columnar, so philosophically aligned:

  • Parquet: columnar file format
  • DuckDB: columnar execution engine

DuckDB reads Parquet exceptionally fast because:

  1. No impedance mismatch - both organize data by column
  2. Direct column chunk → vector register pipeline
  3. Can skip decompression/reading of unused columns
  4. Leverages Parquet’s metadata for predicate pushdown

DuckDB’s Own Format

DuckDB has its own .db format (also columnar), but it treats Parquet as a first-class citizen:

-- Parquet acts like a table, zero import needed
SELECT * FROM 'file.parquet';

Think of it as: Parquet is the storage standard for columnar data files, DuckDB is an execution engine optimized for that pattern.

Other tools (Pandas, Spark, Polars, Drill) also use Parquet - DuckDB just happens to be the best local query engine for it.

3. Comparison with ORC

What is the ORC format and how does it compare to parquet?

ORC (Optimized Row Columnar)

Columnar format developed by Hortonworks for Hive (2013). Parquet came from Twitter/Cloudera (2012).

Key Differences

1. Ecosystem

  • ORC: Hive-first, Java/Hadoop ecosystem
  • Parquet: Broader adoption - Spark, Impala, Arrow, Pandas, DuckDB

2. Compression

  • ORC: Built-in ZLIB (default), Snappy, LZO, LZ4, ZSTD
  • Parquet: Same codecs, but Snappy default
  • ORC typically 10-15% smaller due to more aggressive encoding

3. File Structure

  • ORC: Stripes (large row groups) with indexes at stripe level
  • Parquet: Smaller row groups, metadata at file end
  • ORC better for streaming writes, Parquet better for random access

4. Indexing

  • ORC: Built-in bloom filters, min/max indexes per stripe
  • Parquet: Min/max only (bloom filters added later, less mature)
  • ORC faster for point lookups

5. Schema Evolution

  • ORC: Better ACID support, easier column add/drop
  • Parquet: More restrictive

6. Performance

  • ORC: 5-10% faster in Hive/Presto
  • Parquet: 5-10% faster in Spark
  • Differences small, workload-dependent

When to Use What

ORC: Pure Hive/Hadoop stack, need ACID, heavy updates Parquet: Everything else - better tooling, wider support, language-agnostic

Practical Reality Parquet won the ecosystem war. More tools support it. Unless you’re locked into Hive, use Parquet.

4. Encodings

What are dictionary, bit-packing and RLE encodings?

Dictionary Encoding

Store unique values once, then use integer IDs to reference them.

Original column: ["apple", "banana", "apple", "apple", "banana"]

Dictionary: {0: "apple", 1: "banana"}
Encoded:    [0, 1, 0, 0, 1]

Storage: 2 strings + 5 tiny integers (vs 5 full strings)

Massive win for columns with low cardinality (countries, categories, status codes). A column with 1M rows but only 50 unique values becomes trivial.

Bit-Packing

Use minimum bits needed to represent values, not full bytes.

Values: [0, 1, 2, 3, 0, 1, 2, 3]

Normal (8 bits each):  00000000 00000001 00000010 00000011 ...
Bit-packed (2 bits):   00 01 10 11 00 01 10 11

Savings: 8 bytes → 2 bytes (75% reduction)

Works when values fit in small ranges. If max value is 7, you only need 3 bits per number, not 32 or 64.

RLE (Run-Length Encoding)

Store value + count for consecutive repeats.

Original: [5, 5, 5, 5, 5, 7, 7, 9, 9, 9, 9]

RLE: [(5, count=5), (7, count=2), (9, count=4)]

11 values → 3 pairs

Perfect for sorted or semi-sorted data. A column of timestamps sorted by date or boolean flags with long runs of TRUE/FALSE compresses to almost nothing.

Combined Example

Parquet often chains these:

Column: region codes (mostly "US", "US", "CA", "US", "US"...)

1. Dictionary: US→0, CA→1
2. RLE on IDs: (0, 1000), (1, 50), (0, 2000)...
3. Bit-pack the counts

Result: 100MB → 500KB

5. Complex datatype support

Can I use structured datatypes in Parquet columns? For example, I want a column where each entry is a list of dates.

Yes - Parquet Has Rich Nested Types

List/Array Columns:

-- Create table with list column
CREATE TABLE events AS 
SELECT 
  'user1' as user_id,
  ['2024-01-01'::DATE, '2024-01-15'::DATE, '2024-02-03'::DATE] as visit_dates;

-- Write to parquet
COPY events TO 'events.parquet' (FORMAT PARQUET);

-- Read and query
SELECT 
  user_id, 
  visit_dates,
  len(visit_dates) as num_visits,
  visit_dates[1] as first_visit  -- 1-indexed
FROM 'events.parquet';

-- Unnest lists
SELECT user_id, UNNEST(visit_dates) as visit_date
FROM 'events.parquet';

Other Complex Types Supported:

-- Struct (like a nested record)
CREATE TABLE users AS
SELECT 
  'Alice' as name,
  {'street': '123 Main', 'city': 'NYC', 'zip': '10001'} as address;

-- Map (key-value pairs)
CREATE TABLE metrics AS
SELECT 
  'server1' as host,
  MAP(['cpu', 'mem'], [0.75, 0.82]) as stats;

-- Nested combinations
CREATE TABLE complex AS
SELECT
  [
    {'name': 'Alice', 'scores': [95, 87, 92]},
    {'name': 'Bob', 'scores': [88, 90, 85]}
  ] as students;

From CSV with Lists:

If your CSV has lists as strings like "2024-01-01,2024-01-15,2024-02-03":

COPY (
  SELECT 
    user_id,
    string_split(date_string, ',')::DATE[] as visit_dates
  FROM 'data.csv'
) TO 'data.parquet' (FORMAT PARQUET);

The schema is preserved in the Parquet metadata - reading tools know it’s a list of dates, not a string.

To be clear, the list and struct syntax is the same as python’s list and dictionary, right?

Almost, but not quite

Lists: ✓ Same as Python: ['item1', 'item2', 'item3']

Structs: ✗ Different syntax - uses {} but with : like Python dicts, though semantics differ

-- DuckDB struct syntax (named fields, like a record)
{'name': 'Alice', 'age': 25}

-- This looks like Python dict but behaves differently:
-- - Keys are FIXED field names (like class attributes)
-- - Not key-value lookups - it's a structured record
-- - Schema enforced: all rows must have same fields

Maps (actual key-value): For true dictionary behavior, use MAP():

-- Map = actual key-value pairs (like Python dict)
MAP(['key1', 'key2'], ['val1', 'val2'])

-- Can have different keys per row
-- Keys/values must be same type across all entries

Comparison:

-- STRUCT: {field_name: value} - fixed schema, named access
SELECT s.name FROM (SELECT {'name': 'Alice', 'age': 25} as s);

-- MAP: flexible keys, subscript access  
SELECT m['key1'] FROM (SELECT MAP(['key1'], ['val1']) as m);

-- LIST: [item1, item2] - same as Python
SELECT l[1] FROM (SELECT ['a', 'b', 'c'] as l);

So: lists = Python lists, structs ≈ Python dicts visually but more like dataclasses semantically, maps = true dicts.

Author: Karthik Chikmagalur

Created: 2025-12-10 Wed 00:27

Validate