{"id":672,"date":"2025-11-24T22:47:15","date_gmt":"2025-11-24T22:47:15","guid":{"rendered":"https:\/\/phitopolis.com\/blog\/?p=672"},"modified":"2025-11-25T00:44:15","modified_gmt":"2025-11-25T00:44:15","slug":"data-driven-rusty-cage-match-pandas-vs-polars","status":"publish","type":"post","link":"https:\/\/phitopolis.com\/blog\/index.php\/2025\/11\/24\/data-driven-rusty-cage-match-pandas-vs-polars\/","title":{"rendered":"Data-Driven Rusty Cage Match: Pandas vs. Polars"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The data science landscape is always flooded with ever-improving wrappers and spin-off technologies trying to outperform the status quo. Some are full-blown libraries for everyone to use; some are retrofits for their specific needs. For the last decade, one library has reigned supreme as one of the champions of Python data manipulation: <a href=\"https:\/\/pandas.pydata.org\/\">Pandas<\/a>. But in recent years, there exists an underdog creeping up from its rusty beginnings: <a href=\"https:\/\/pola.rs\/\">Polars<\/a>. Gone are the days of 20GB &#8216;big data&#8217; datasets. Modern big data now slams in explosive volume, which demands more speed and efficiency than ever.<\/span><\/p>\n<h1 style=\"text-align: left;\">The Fight Cards: Pandas vs. Polars<\/h1>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Pandas, like any champion, is a very well-studied piece of technology. Its limits are explored and challenged countless times over the decade of its reign. Under the hood, it is built with <a href=\"https:\/\/numpy.org\/\">NumPy<\/a> foundation to support users from <a href=\"https:\/\/www.r-project.org\/\">R<\/a> and <a href=\"https:\/\/www.mathworks.com\/products\/matlab.html\">MATLAB<\/a>. It is designed for array storage and numerical operations in mind, which uses a block-based memory layout inherited from NumPy that can lead to inefficient columnar operations. This block-based memory layout results in poor cache locality and inefficient columnar operations. The majority of the (<code>DataFrame<\/code>) operations are designed primarily as single-threaded, which creates a <em>CPU bottleneck<\/em>. These combinations of limitations result in a higher memory demand, which does not scale well. Don&#8217;t get this wrong, Pandas is still a robust and powerful tool supported by a good track record.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Polars, on the other hand, is written in <a href=\"https:\/\/rust-lang.org\/\">Rust<\/a>, a low-level programming language that performs comparable to C\/<a href=\"https:\/\/devdocs.io\/cpp\/\">C++<\/a>. This is supported by <a href=\"https:\/\/arrow.apache.org\/\">Apache Arrow<\/a> columnar memory format optimized for vectorized or Single Instruction Multiple Data operations. On top of that, Rust&#8217;s <code>rayon<\/code>\u00a0library maximizes the usage of available CPU cores automatically; the antithesis of Pandas&#8217; single-core design. In the aspect of efficiency, the Rust-Arrow punch combo dramatically helps with memory allocation. This combination of tech is relatively new in the space and positions itself as a solution to the growing problem of data manipulation.<\/span><\/p>\n<h1 style=\"text-align: left;\">The New Tool on the Block: Lazy Execution<\/h1>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Now, what draws the line between Pandas and Polars? Lazy Execution. We are very familiar with Eager Execution, which both Pandas and Polars offer, where the operation is executed immediately forcing the creation\/operation of <code>DataFrame<\/code>\u00a0as it is triggered line by line. This eager style executes immediately, often forcing redundant intermediate data copies and requiring the entire dataset to be loaded preemptively, which negatively contributes to memory usage. It is, however, very useful for exploration and analysis.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">On the other hand, Lazy Execution is Polars&#8217; sharpest tool. It implements a query optimization technique that loads only the necessary data on that operation which uses less memory. Polars instead builds a &#8220;Logical Query Plan&#8221; or <code>LazyFrame<\/code> line by line and calls the <code>collect()<\/code>\u00a0method when the <em>plan<\/em> is ready to be executed. The <code>collect()<\/code>\u00a0method builds an optimized plan which encompasses Predicate Pushdown, Projection Pushdown, Reordering, and Streaming operations. This ensures that for every step of the plan, only the needed data is operated on and stored, minimizing the compute overall.<\/span><\/p>\n<h1 style=\"text-align: left;\">The Feint and the Counter: Syntax Comparison<\/h1>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">In reality, Polars&#8217; is not a direct replacement for Pandas i.e. some Pandas function does not have a Polars counterpart. Although it offers almost 1:1 operations, there are a couple of nuances, especially in its syntax. Here are some of the commonalities and their syntaxes:<\/span><\/p>\n<p><b>Assignment<\/b><\/p>\n<pre class=\"code-block\"><code>\r\n# NOTE: Pandas Implementation\r\ndf['new_col'] = df['col_a'] * 2\r\ndf = df.assign(new_col=df['col_a'] * 2)\r\n\r\n# ---\r\n\r\n# NOTE: Polars Implementation\r\ndf = df.with_columns(new_col=pl.col('col_a') * 2)\r\n\r\n<\/code><\/pre>\n<p><b>Column Selection<\/b><\/p>\n<pre class=\"code-block\"><code>\r\n# NOTE: Pandas Implementation\r\nsubset_df = df[['col_a', 'col_b']]\r\n\r\n# ---\r\n\r\n# NOTE: Polars Implementation\r\nsubset_df = df.select('col_a', 'col_b')\r\n\r\n<\/code><\/pre>\n<p><b>Select and Rename<\/b><\/p>\n<pre class=\"code-block\"><code>\r\n# NOTE: Pandas Implementation\r\nsubset_df = df.rename(columns={'old_a': 'new_a'})[['new_a', 'col_b']]\r\n\r\n# ---\r\n\r\n# NOTE: Polars Implementation\r\nsubset_df = df.select(pl.col('old_a').alias('new_a'), 'col_b')\r\n\r\n<\/code><\/pre>\n<p><b>Simple Filter<\/b><\/p>\n<pre class=\"code-block\"><code>\r\n# NOTE: Pandas Implementation\r\nfiltered_df = df[df['col_a'] &gt; 100]\r\n\r\n# ---\r\n\r\n# NOTE: Polars Implementation\r\nfiltered_df = df.filter(pl.col('col_a') &gt; 100)\r\n\r\n<\/code><\/pre>\n<p><b>Modifying Columns (Transformations)<\/b><\/p>\n<pre class=\"code-block\"><code>\r\n# NOTE: Pandas Implementation\r\ndf['flag'] = np.where(df['col_a'] &gt; 50, 'High', 'Low')\r\n\r\n# ---\r\n\r\n# NOTE: Polars Implementation\r\ndf = df.with_columns(\r\n\tflag=pl.when(pl.col('col_a') &gt; 50).then('High').otherwise('Low'))\r\n\r\n<\/code><\/pre>\n<p><b>Grouping and Aggregation<\/b><\/p>\n<pre class=\"code-block\"><code>\r\n# NOTE: Pandas Implementation\r\nresult = df.groupby('key').agg({'val': ['mean', 'sum'], 'other_val': 'max'})\r\n\r\n# ---\r\n\r\n# NOTE: Polars Implementation\r\nresult = df.group_by('key').agg([\r\n\tpl.col('val').mean().alias('val_mean'),\r\n\tpl.col('val').sum().alias('val_sum'),\r\n\tpl.col('other_val').max()])\r\n\r\n<\/code><\/pre>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">These examples demonstrate how Polars leans heavily on <em>expressions<\/em> and <em>chained methods<\/em>, which is the building block of the query plan or <code>LazyFrame<\/code>. These primarily just describe the instructions without executing them. While some Polars operations may seem more complex or convoluted, it is more verbose and explicit, which shapes its *query* silhouette.<\/span><\/p>\n<h1 style=\"text-align: left;\">The Climax: Performance Benchmark<\/h1>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">There are a couple of metrics these two can compete on: Across <em>speed, file I\/O, aggregation<\/em>, and <em>memory efficiency<\/em>. Polars outmatches Pandas by a mile: aggregation (<code>groupby()<\/code>) operations can leverage multi-threading automatically. The overall memory is handled more efficiently with <em>Apache Arrow<\/em>. Finally, based on our tests for speed in general, a complex compute time of a custom function dropped from 159 seconds to 17 seconds, a <em>x9.35<\/em>\u00a0speedup alone.\u00a0<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">However, these improvements are without a catch. The working dataset should be <strong>LARGE<\/strong>. The observable gains are leveraged from the efficiency of parallelization and query optimization for large datasets. Using Pandas on a relatively smaller dataset is negligibly different (sometimes even faster) with Polars due to the minimal overhead requirement. Lastly, from our speed test, some Polars functions need additional steps to recreate or imitate a Pandas function that results in slower runtime.<\/span><\/p>\n<h1 style=\"text-align: left;\">The Bell Rang: Pick Your Champion<\/h1>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The bout is in its conclusion. The &#8220;winner&#8221; depends on your own tally card based on use case, data size, and requirements.<\/span><\/p>\n<h2 style=\"text-align: left;\"><b>Choose Pandas when:<\/b><\/h2>\n<ul>\n<li style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Exploratory Data Analysis (EDA): fast experimentation and analysis for small dataset<\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Ecosystem Integration: leverage other libraries such as <a href=\"https:\/\/scikit-learn.org\/\">Scikit-learn<\/a> or for data visualization, such as <a href=\"https:\/\/matplotlib.org\/\">Matplotlib<\/a> and <a href=\"https:\/\/seaborn.pydata.org\/\">Seaborn<\/a><\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Straightforward Workflow: Syntax switch can be taxing with a simpler workflow<\/span><\/li>\n<\/ul>\n<h2 style=\"text-align: left;\"><b>Choose Polars when:<\/b><\/h2>\n<ul>\n<li style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Production Necessity: Speed and efficiency can save compute cost and deliver SLAs<\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Large Datasets: Reoccurring Out-of-Memory problems<\/span><\/li>\n<li style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Complex Transformations: Query optimization can smooth out the cadence of operations<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Let&#8217;s not forget that these matchups can also end up in a draw. A hybrid setup is achievable, maximizing both tools: Polars for heavy-lifting data transformation and Pandas for data visualization and model consumption.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The data science landscape is always flooded with ever-improving wrappers and spin-off technologies trying to outperform the status quo. Some are full-blown libraries for everyone to use; some are retrofits for their specific needs. For the last decade, one library<\/p>\n","protected":false},"author":9,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[118],"tags":[117,116,113,114,115],"ppma_author":[119],"authors":[{"term_id":119,"user_id":0,"is_guest":1,"slug":"alec-marohom","display_name":"Alec Marohom","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","first_name":"","last_name":"","user_url":"","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/672"}],"collection":[{"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=672"}],"version-history":[{"count":14,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/672\/revisions"}],"predecessor-version":[{"id":686,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/672\/revisions\/686"}],"wp:attachment":[{"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=672"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=672"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=672"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/phitopolis.com\/blog\/index.php\/wp-json\/wp\/v2\/ppma_author?post=672"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}