DataFrame

DataFrame

Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.

The library ships three API layers — all operating on the same underlying DataFrame type at runtime:

Untyped (import qualified DataFrame as D) — string-based column names, great for exploration and scripting.
Typed (import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation.
Monadic API — write your transformation as a self contained pipeline.

Why this library?

Concise, declarative, composable data pipelines using the |> pipe operator.
Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.

Install

cabal update
cabal install dataframe

To use as a dependency in a project:

build-depends: base >= 4, dataframe

Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:

dataframe

Quick Start

Save this as Example.hs and run with cabal run Example.hs:

#!/usr/bin/env cabal
{- cabal:
  build-depends: base >= 4, dataframe
-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}

import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators

main :: IO ()
main = do
    let sales = D.fromNamedColumns
            [ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
            , ("amount",  D.fromList [100, 120, 50, 20, 40, 30 :: Int])
            ]

    -- Group by product and compute totals
    print $ sales
        |> D.groupBy ["product"]
        |> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
                       , F.count (F.col @Int "amount") `as` "orders"
                       ]

-----------------------
product | total | orders
--------|-------|-------
  Int   |  Int  |  Int
--------|-------|-------
1       | 220   | 2
2       | 70    | 2
3       | 70    | 2

Reading from files works the same way:

df <- D.readCsv "data.csv"
df <- D.readParquet "data.parquet"

-- Hugging Face datasets
df <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"

Interactive REPL

The dataframe REPL comes with all imports pre-loaded. Here's a typical exploration session:

dataframe> df <- D.readCsv "./data/housing.csv"
dataframe> D.dimensions df
(20640, 10)

dataframe> D.describeColumns df
------------------------------------------------------------------------
    Column Name     | ## Non-null Values | ## Null Values |     Type
--------------------|--------------------|----------------|-------------
        Text        |         Int        |      Int       |     Text
--------------------|--------------------|----------------|-------------
 total_bedrooms     | 20433              | 207            | Maybe Double
 ocean_proximity    | 20640              | 0              | Text
 median_house_value | 20640              | 0              | Double
 median_income      | 20640              | 0              | Double
 households         | 20640              | 0              | Double
 population         | 20640              | 0              | Double
 total_rooms        | 20640              | 0              | Double
 housing_median_age | 20640              | 0              | Double
 latitude           | 20640              | 0              | Double
 longitude          | 20640              | 0              | Double

The :declareColumns macro generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:

dataframe> :declareColumns df
"longitude :: Expr Double"
"latitude :: Expr Double"
"housing_median_age :: Expr Double"
"total_rooms :: Expr Double"
"total_bedrooms :: Expr (Maybe Double)"
"population :: Expr Double"
"households :: Expr Double"
"median_income :: Expr Double"
"median_house_value :: Expr Double"
"ocean_proximity :: Expr Text"

dataframe> df |> D.groupBy ["ocean_proximity"]
              |> D.aggregate [F.mean median_house_value `as` "avg_value"]
-------------------------------------
 ocean_proximity |     avg_value
-----------------|-------------------
      Text       |       Double
-----------------|-------------------
 <1H OCEAN       | 240084.28546409807
 INLAND          | 124805.39200122119
 ISLAND          | 380440.0
 NEAR BAY        | 259212.31179039303
 NEAR OCEAN      | 249433.97742663656

Create new columns from existing ones:

dataframe> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3
-----------------------------------------------------------------------------------------------------------------
 longitude | latitude | housing_median_age | total_rooms | ... | ocean_proximity | rooms_per_household
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
  Double   |  Double  |       Double       |   Double    | ... |      Text       |       Double
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
 -122.23   | 37.88    | 41.0               | 880.0       | ... | NEAR BAY        | 6.984126984126984
 -122.22   | 37.86    | 21.0               | 7099.0      | ... | NEAR BAY        | 6.238137082601054
 -122.24   | 37.85    | 52.0               | 1467.0      | ... | NEAR BAY        | 8.288135593220339

Type mismatches are caught as compile errors — adding a Double column to a Text column won't silently produce garbage:

dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)

<interactive>:14:47: error: [GHC-83865]
    • Couldn't match type 'Text' with 'Double'
        Expected: Expr Double
          Actual: Expr Text
    • In the second argument of '(+)', namely 'ocean_proximity'
      In the second argument of 'derive', namely
        '(latitude + ocean_proximity)'

Template Haskell

For scripts and projects, Template Haskell can generate column bindings at compile time.

Generate column references from a CSV

declareColumnsFromCsvFile reads your CSV at compile time and generates typed Expr bindings for every column:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators

-- Reads housing.csv at compile time and generates:
--   latitude :: Expr Double
--   total_rooms :: Expr Double
--   ocean_proximity :: Expr Text
--   ... one binding per column
$(F.declareColumnsFromCsvFile "./data/housing.csv")

main :: IO ()
main = do
    df <- D.readCsv "./data/housing.csv"
    print $ df
        |> D.derive "rooms_per_household" (total_rooms / households)
        |> D.filterWhere (median_income .>. 5)
        |> D.groupBy ["ocean_proximity"]
        |> D.aggregate [F.mean median_house_value `as` "avg_value"]

Compare this to the manual version which requires spelling out every column name and type:

-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
        (F.col @Double "total_rooms" / F.col @Double "households")
   |> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)

Generate a schema type from a CSV

deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DataKinds #-}

import qualified DataFrame.Typed as T

-- Generates:
-- type HousingSchema = '[ T.Column "longitude" Double
--                        , T.Column "latitude" Double
--                        , T.Column "total_rooms" Double
--                        , ...
--                        ]
$(T.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")

Typed API

When you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified DataFrame as D
import qualified DataFrame.Typed as T
import Data.Text (Text)
import DataFrame.Operators

type EmployeeSchema =
    '[ T.Column "name"       Text
     , T.Column "department" Text
     , T.Column "salary"     Double
     ]

main :: IO ()
main = do
    df <- D.readCsv "employees.csv"
    case T.freeze @EmployeeSchema df of
        Nothing  -> putStrLn "Schema mismatch!"
        Just tdf -> do
            let result = tdf
                    |> T.derive @"bonus" (T.col @"salary" * T.lit 0.1)
                    |> T.filterWhere (T.col @"salary" .>. T.lit 50000)
                    |> T.select @'["name", "bonus"]
            print (T.thaw result)

T.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:

-- Typo in column name → compile error
tdf |> T.filterWhere (T.col @"slary" .>. T.lit 50000)
-- error: Column "slary" not found in schema

-- Wrong type → compile error
tdf |> T.filterWhere (T.col @"name" .>. T.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'

filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:

-- Before: TypedDataFrame '[Column "score" (Maybe Double), Column "name" Text]
let cleaned = T.filterAllJust tdf
-- After:  TypedDataFrame '[Column "score" Double, Column "name" Text]

cleaned |> T.derive @"scaled" (T.col @"score" * T.lit 100)

Features

I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.

Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.

Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.

Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.

Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.

Lazy engine: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.

Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.

ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.

Notebooks: IHaskell integration with pre-built Binder examples.

Lazy Queries

For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:

import qualified DataFrame.Lazy as L
import qualified DataFrame.Functions as F
import DataFrame.Operators
import DataFrame.Internal.Schema (Schema, schemaType)
import Data.Text (Text)

mySchema :: Schema
mySchema = [ ("name",   schemaType @Text)
           , ("weight", schemaType @Double)
           , ("height", schemaType @Double)
           ]

main :: IO ()
main = do
    result <- L.runDataFrame $
        L.scanCsv mySchema "large_file.csv"
        |> L.filter  (F.col @Double "height" .>. F.lit 1.7)
        |> L.select  ["name", "weight", "height"]
        |> L.derive  "bmi" (F.col @Double "weight"
                           / (F.col @Double "height" * F.col @Double "height"))
        |> L.take 1000
    print result

The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.

Documentation

User guide: https://dataframe.readthedocs.io/en/latest/
API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html
Coming from pandas, Polars, dplyr, or Frames?
Cookbook (SQL-style patterns)
Tutorials
Discord: https://discord.gg/8u8SCWfrNC

Name		Name	Last commit message	Last commit date
Latest commit History 1,151 Commits
.github		.github
app		app
benchmark		benchmark
cbits		cbits
data		data
dataframe-fastcsv		dataframe-fastcsv
dataframe-hasktorch		dataframe-hasktorch
dataframe-persistent		dataframe-persistent
docs		docs
examples		examples
ffi/DataFrame		ffi/DataFrame
python		python
scripts		scripts
src		src
static		static
tests		tests
.ghci		.ghci
.gitignore		.gitignore
.hlint.yaml		.hlint.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cabal.project		cabal.project
dataframe.cabal		dataframe.cabal
dataframe.ghci		dataframe.ghci
flake.nix		flake.nix
fourmolu.yaml		fourmolu.yaml
test_coverage.md		test_coverage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFrame

Why this library?

Install

Quick Start

Interactive REPL

Template Haskell

Generate column references from a CSV

Generate a schema type from a CSV

Typed API

Features

Lazy Queries

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataFrame

Why this library?

Install

Quick Start

Interactive REPL

Template Haskell

Generate column references from a CSV

Generate a schema type from a CSV

Typed API

Features

Lazy Queries

Documentation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages