What is Polars?
Before we go into the implementation details of Polars, let’s quickly look at what Polars is.
Polars is a super fast DataFrame library for Rust and Python. It exposes a set of query APIs that are similar to Pandas.
For example, here is an example provided by Polars to load a parquet file, and perform operations such as groupby
, agg
and join
:
#![allow(unused)] fn main() { let lf1 = LazyFrame::scan_parquet("myfile_1.parquet", Default::default())? .groupby([col("ham")]) .agg([ // expressions can be combined into powerful aggregations col("foo") .sort_by([col("ham").rank(Default::default(), None)], [false]) .last() .alias("last_foo_ranked_by_ham"), // every expression runs in parallel col("foo").cummin(false).alias("cumulative_min_per_group"), // every expression runs in parallel col("foo").reverse().implode().alias("reverse_group"), ]); let lf2 = LazyFrame::scan_parquet("myfile_2.parquet", Default::default())? .select([col("ham"), col("spam")]); let df = lf1 .join(lf2, [col("reverse")], [col("foo")], JoinArgs::new(JoinType::Left)) // now we finally materialize the result. .collect()?; }
Polars uses Apache Arrow as its Memory Model. Polars also uses techniques such as SIMD instructions, parallelization, query optimization, and many other techniques to have a lightning-fast performance.
Apache Arrow
The best way to understand Apache Arrow is to check out its official documentation. To summarize, it is an in-memory columnar format for representing tabular data.
The columnar format is beneficial for computations because it enables vectorization using SIMD (Single Instruction, Multiple Data). Apache Arrow also provides zero-copy reads which makes copying data around cheap.
I personally haven’t had the time to dig into the details of the Apache Arrow design but I definitely plan to one day!