Data-Driven Rusty Cage Match: Pandas vs. Polars

The data science landscape is always flooded with ever-improving wrappers and spin-off technologies trying to outperform the status quo. Some are full-blown libraries for everyone to use; some are retrofits for their specific needs. For the last decade, one library has reigned supreme as one of the champions of Python data manipulation: Pandas. But in recent years, there exists an underdog creeping up from its rusty beginnings: Polars. Gone are the days of 20GB ‘big data’ datasets. Modern big data now slams in explosive volume, which demands more speed and efficiency than ever.

The Fight Cards: Pandas vs. Polars

Pandas, like any champion, is a very well-studied piece of technology. Its limits are explored and challenged countless times over the decade of its reign. Under the hood, it is built with NumPy foundation to support users from R and MATLAB. It is designed for array storage and numerical operations in mind, which uses a block-based memory layout inherited from NumPy that can lead to inefficient columnar operations. This block-based memory layout results in poor cache locality and inefficient columnar operations. The majority of the (DataFrame) operations are designed primarily as single-threaded, which creates a CPU bottleneck. These combinations of limitations result in a higher memory demand, which does not scale well. Don’t get this wrong, Pandas is still a robust and powerful tool supported by a good track record.

Polars, on the other hand, is written in Rust, a low-level programming language that performs comparable to C/C++. This is supported by Apache Arrow columnar memory format optimized for vectorized or Single Instruction Multiple Data operations. On top of that, Rust’s rayon library maximizes the usage of available CPU cores automatically; the antithesis of Pandas’ single-core design. In the aspect of efficiency, the Rust-Arrow punch combo dramatically helps with memory allocation. This combination of tech is relatively new in the space and positions itself as a solution to the growing problem of data manipulation.

The New Tool on the Block: Lazy Execution

Now, what draws the line between Pandas and Polars? Lazy Execution. We are very familiar with Eager Execution, which both Pandas and Polars offer, where the operation is executed immediately forcing the creation/operation of DataFrame as it is triggered line by line. This eager style executes immediately, often forcing redundant intermediate data copies and requiring the entire dataset to be loaded preemptively, which negatively contributes to memory usage. It is, however, very useful for exploration and analysis.

On the other hand, Lazy Execution is Polars’ sharpest tool. It implements a query optimization technique that loads only the necessary data on that operation which uses less memory. Polars instead builds a “Logical Query Plan” or LazyFrame line by line and calls the collect() method when the plan is ready to be executed. The collect() method builds an optimized plan which encompasses Predicate Pushdown, Projection Pushdown, Reordering, and Streaming operations. This ensures that for every step of the plan, only the needed data is operated on and stored, minimizing the compute overall.

The Feint and the Counter: Syntax Comparison

In reality, Polars’ is not a direct replacement for Pandas i.e. some Pandas function does not have a Polars counterpart. Although it offers almost 1:1 operations, there are a couple of nuances, especially in its syntax. Here are some of the commonalities and their syntaxes:

Assignment


# NOTE: Pandas Implementation
df['new_col'] = df['col_a'] * 2
df = df.assign(new_col=df['col_a'] * 2)

# ---

# NOTE: Polars Implementation
df = df.with_columns(new_col=pl.col('col_a') * 2)

Column Selection


# NOTE: Pandas Implementation
subset_df = df[['col_a', 'col_b']]

# ---

# NOTE: Polars Implementation
subset_df = df.select('col_a', 'col_b')

Select and Rename


# NOTE: Pandas Implementation
subset_df = df.rename(columns={'old_a': 'new_a'})[['new_a', 'col_b']]

# ---

# NOTE: Polars Implementation
subset_df = df.select(pl.col('old_a').alias('new_a'), 'col_b')

Simple Filter


# NOTE: Pandas Implementation
filtered_df = df[df['col_a'] > 100]

# ---

# NOTE: Polars Implementation
filtered_df = df.filter(pl.col('col_a') > 100)

Modifying Columns (Transformations)


# NOTE: Pandas Implementation
df['flag'] = np.where(df['col_a'] > 50, 'High', 'Low')

# ---

# NOTE: Polars Implementation
df = df.with_columns(
	flag=pl.when(pl.col('col_a') > 50).then('High').otherwise('Low'))

Grouping and Aggregation


# NOTE: Pandas Implementation
result = df.groupby('key').agg({'val': ['mean', 'sum'], 'other_val': 'max'})

# ---

# NOTE: Polars Implementation
result = df.group_by('key').agg([
	pl.col('val').mean().alias('val_mean'),
	pl.col('val').sum().alias('val_sum'),
	pl.col('other_val').max()])

 

These examples demonstrate how Polars leans heavily on expressions and chained methods, which is the building block of the query plan or LazyFrame. These primarily just describe the instructions without executing them. While some Polars operations may seem more complex or convoluted, it is more verbose and explicit, which shapes its *query* silhouette.

The Climax: Performance Benchmark

There are a couple of metrics these two can compete on: Across speed, file I/O, aggregation, and memory efficiency. Polars outmatches Pandas by a mile: aggregation (groupby()) operations can leverage multi-threading automatically. The overall memory is handled more efficiently with Apache Arrow. Finally, based on our tests for speed in general, a complex compute time of a custom function dropped from 159 seconds to 17 seconds, a x9.35 speedup alone. 

However, these improvements are without a catch. The working dataset should be LARGE. The observable gains are leveraged from the efficiency of parallelization and query optimization for large datasets. Using Pandas on a relatively smaller dataset is negligibly different (sometimes even faster) with Polars due to the minimal overhead requirement. Lastly, from our speed test, some Polars functions need additional steps to recreate or imitate a Pandas function that results in slower runtime.

The Bell Rang: Pick Your Champion

The bout is in its conclusion. The “winner” depends on your own tally card based on use case, data size, and requirements.

Choose Pandas when:

  • Exploratory Data Analysis (EDA): fast experimentation and analysis for small dataset
  • Ecosystem Integration: leverage other libraries such as Scikit-learn or for data visualization, such as Matplotlib and Seaborn
  • Straightforward Workflow: Syntax switch can be taxing with a simpler workflow

Choose Polars when:

  • Production Necessity: Speed and efficiency can save compute cost and deliver SLAs
  • Large Datasets: Reoccurring Out-of-Memory problems
  • Complex Transformations: Query optimization can smooth out the cadence of operations

 

Let’s not forget that these matchups can also end up in a draw. A hybrid setup is achievable, maximizing both tools: Polars for heavy-lifting data transformation and Pandas for data visualization and model consumption.

Related Posts

Sorry, no similar posts found.