Polars vs. Pandas — An Impartial Pace Comparability -

Overview

Introduction — Function and Causes

Pace is necessary when coping with giant quantities of information. If you’re dealing with information in a cloud information warehouse or comparable, then the velocity of execution on your information ingestion and processing impacts the next:

Cloud prices: That is most likely the largest issue. Extra compute time equals extra prices in most billing fashions. In different billing based mostly on a certain quantity of preallocated assets, you possibly can have chosen a decrease service stage if the velocity of your ingestion and processing was increased.
Knowledge timeliness: You probably have a real-time stream that takes 5 minutes to course of information, then your customers may have a lag of a minimum of 5 minutes when viewing the info by way of e.g. a Energy BI rapport. This distinction is usually a lot in sure conditions. Even for batch jobs, the info timeliness is necessary. If you’re operating a batch job each hour, it’s a lot higher if it takes 2 minutes reasonably than 20 minutes.
Suggestions loop: In case your batch job takes solely a minute to run, you then get a really fast suggestions loop. This most likely makes your job extra pleasurable. As well as, it allows you to discover logical errors extra shortly.

As you’ve most likely understood from the title, I’m going to offer a velocity comparability between the 2 Python libraries Polars and Pandas. If you recognize something about Pandas and Polars from earlier than, then you recognize that Polars is the (comparatively) new child on the block proclaiming to be a lot quicker than Pandas. You most likely additionally know that Polars is carried out in Rust, which is a pattern for a lot of different fashionable Python instruments like uv and Ruff.

There are two distinct causes that I wish to do a velocity comparability check between Polars and Pandas:

Cause 1 — Investigating Claims

Polars boasts on its web site with the next declare: In comparison with pandas, it (Polars) can obtain greater than 30x efficiency good points.

As you’ll be able to see, you’ll be able to comply with a hyperlink to the benchmarks that they’ve. It’s commendable that they’ve velocity exams open supply. However in case you are writing the comparability exams for each your personal device and a competitor’s device, then there may be a slight battle of curiosity. I’m not right here saying that they’re purposefully overselling the velocity of Polars, however reasonably that they could have unconsciously chosen for favorable comparisons.

Therefore the primary purpose to do a velocity comparability check is solely to see whether or not this helps the claims introduced by Polars or not.

Cause 2 — Larger granularity

One more reason for doing a velocity comparability check between Polars and Pandas is to make it barely extra clear the place the efficiency good points may be.

This may be already clear in case you’re an knowledgeable on each libraries. Nonetheless, velocity exams between Polars and Pandas are principally of curiosity to these contemplating switching up their device. In that case, you won’t but have performed round a lot with Polars since you are uncertain whether it is price it.

Therefore the second purpose to do a velocity comparability is solely to see the place the velocity good points are positioned.

I wish to check each libraries on totally different duties each inside information ingestion and Knowledge Processing. I additionally wish to contemplate datasets which can be each small and huge. I’ll persist with frequent duties inside information engineering, reasonably than esoteric duties that one seldom makes use of.

What I can’t do

I can’t give a tutorial on both Pandas or Polars. If you wish to be taught Pandas or Polars, then a superb place to begin is their documentation.
I can’t cowl different frequent information processing libraries. This may be disappointing to a fan of PySpark, however having a distributed compute mannequin makes comparisons a bit harder. You would possibly discover that PySpark is faster than Polars on duties which can be very simple to parallelize, however slower on different duties the place holding all the info in reminiscence reduces journey occasions.
I can’t present full reproducibility. Since that is, in humble phrases, solely a weblog submit, then I’ll solely clarify the datasets, duties, and system settings that I’ve used. I can’t host a whole operating setting with the datasets and bundle the whole lot neatly. This isn’t a exact scientific experiment, however reasonably a information that solely cares about tough estimations.

Lastly, earlier than we begin, I wish to say that I like each Polars and Pandas as instruments. I’m not financially or in any other case compensated by any of them clearly, and don’t have any incentive apart from being interested in their efficiency ☺️

Datasets, Duties, and Settings

Let’s first describe the datasets that I will likely be contemplating, the duties that the libraries will carry out, and the system settings that I will likely be operating them on.

Datasets

A most corporations, you will want to work with each small and (comparatively) giant datasets. For my part, a superb information processing device can deal with each ends of the spectrum. Small datasets problem the start-up time of duties, whereas bigger datasets problem scalability. I’ll contemplate two datasets, each could be discovered on Kaggle:

A small dataset on the format CSV: It’s no secret that CSV information are all over the place! Usually they’re fairly small, coming from Excel information or database dumps. What higher instance of this than the classical iris dataset (licensed with CC0 1.0 Common License) with 5 columns and 150 rows. The iris model I linked to on Kaggle has 6 columns, however the classical one doesn’t have a operating index column. So take away this column if you need exactly the identical dataset as I’ve. The iris dataset is definitely small information by any stretch of the creativeness.
A big dataset on the format Parquet: The parquet format is tremendous helpful for giant information because it has built-in compression column-wise (together with many different advantages). I’ll use the Transaction dataset (licensed with Apache License 2.0) representing monetary transactions. The dataset has 24 columns and seven 483 766 rows. It’s shut to three GB in its CSV format discovered on Kaggle. I used Pandas & Pyarrow to transform this to a parquet file. The ultimate result’s solely 905 MB because of the compression of the parquet file format. That is on the low finish of what individuals name huge information, however it is going to suffice for us.

Duties

I’ll do a velocity comparability on 5 totally different duties. The primary two are I/O duties, whereas the final three are frequent duties in information processing. Particularly, the duties are:

Studying information: I’ll learn each information utilizing the respective strategies read_csv() and read_parquet() from the 2 libraries. I can’t use any non-compulsory arguments as I wish to evaluate their default conduct.
Writing information: I’ll write each information again to an identical copies as new information utilizing the respective strategies to_csv() and to_parquet() for Pandas and write_csv() and write_parquet() for Polars. I can’t use any non-compulsory arguments as I wish to evaluate their default conduct.
Computing Numeric Expressions: For the iris dataset I’ll compute the expression SepalLengthCm ** 2 + SepalWidthCm as a brand new column in a duplicate of the DataFrame. For the transactions dataset, I’ll merely compute the expression (quantity + 10) ** 2 as a brand new column in a duplicate of the DataFrame. I’ll use the usual option to remodel columns in Pandas, whereas in Polars I’ll use the usual capabilities all(), col(), and alias() to make an equal transformation.
Filters: For the iris dataset, I’ll choose the rows equivalent to the factors SepalLengthCm >= 5.0 and SepalWidthCm <= 4.0. For the transactions dataset, I’ll choose the rows equivalent to the explicit standards merchant_category == 'Restaurant'. I’ll use the usual filtering methodology based mostly on Boolean expressions in every library. In pandas, that is syntax similar to df_new = df[df['col'] < 5], whereas in Polars that is given equally by the filter() perform together with the col() perform. I’ll use the and-operator & for each libraries to mix the 2 numeric situations for the iris dataset.
Group By: For the iris dataset, I’ll group by the Species column and calculate the imply values for every species of the 4 columns SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm. For the transactions dataset, I’ll group by the column merchant_category and rely the variety of cases in every of the courses inside merchant_category. Naturally, I’ll use the groupby() perform in Pandas and the group_by() perform in Polars in apparent methods.

Settings

System Settings: I’m operating all of the duties regionally with 16GB RAM and an Intel Core i5–10400F CPU with 6 Cores (12 logical cores by way of hyperthreading). So it’s not state-of-the-art by any means, however ok for easy benchmarking.
Python: I’m operating Python 3.12. This isn’t probably the most present steady model (which is Python 3.13), however I feel it is a good factor. Generally the newest supported Python model in cloud information warehouses is one or two variations behind.
Polars & Pandas: I’m utilizing Polars model 1.21 and Pandas 2.2.3. These are roughly the most recent steady releases to each packages.
Timeit: I’m utilizing the usual timeit module in Python and discovering the median of 10 runs.

Particularly attention-grabbing will likely be how Polars can make the most of the 12 logical cores by way of multithreading. There are methods to make Pandas make the most of a number of processors, however I wish to evaluate Polars and Pandas out of the field with none exterior modification. In any case, that is most likely how they’re operating in most corporations world wide.

Outcomes

Right here I’ll write down the outcomes for every of the 5 duties and make some minor feedback. Within the subsequent part I’ll attempt to summarize the details right into a conclusion and level out a drawback that Polars has on this comparability:

Process 1 — Studying information

The median run time over 10 runs for the studying process was as follows:

# Iris Dataset
Pandas: 0.79 milliseconds
Polars: 0.31 milliseconds

# Transactions Dataset
Pandas: 14.14 seconds
Polars: 1.25 seconds

For studying the Iris dataset, Polars was roughly 2.5x quicker than Pandas. For the transactions dataset, the distinction is even starker the place Polars was 11x quicker than Pandas. We will see that Polars is way quicker than Pandas for studying each small and huge information. The efficiency distinction grows with the dimensions of the file.

Process 2— Writing information

The median run time in seconds over 10 runs for the writing process was as follows:

# Iris Dataset
Pandas: 1.06 milliseconds
Polars: 0.60 milliseconds

# Transactions Dataset
Pandas: 20.55 seconds
Polars: 10.39 seconds

For writing the iris dataset, Polars was round 75% quicker than Pandas. For the transactions dataset, Polars was roughly 2x as quick as Pandas. Once more we see that Polars is quicker than Pandas, however the distinction right here is smaller than for studying information. Nonetheless, a distinction of near 2x in efficiency is an enormous distinction.

Process 3 —Computing Numeric Expressions

The median run time over 10 runs for the computing numeric expressions process was as follows:

# Iris Dataset
Pandas: 0.35 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 54.58 milliseconds
Polars: 14.92 milliseconds

For computing the numeric expressions, Polars beats Pandas with a fee of roughly 2.5x for the iris dataset, and roughly 3.5x for the transactions dataset. This can be a fairly large distinction. It needs to be famous that computing numeric expressions is quick in each libraries even for the big dataset transactions.

Process 4 — Filters

The median run time over 10 runs for the filters process was as follows:

# Iris Dataset
Pandas: 0.40 milliseconds
Polars: 0.15 milliseconds

# Transactions Dataset
Pandas: 0.70 seconds
Polars: 0.07 seconds

For filters, Polars is 2.6x quicker on the iris dataset and 10x as quick on the transactions dataset. That is most likely probably the most stunning enchancment for me since I suspected that the velocity enhancements for filtering duties wouldn’t be this large.

Process 5 — Group By

The median run time over 10 runs for the group by process was as follows:

# Iris Dataset
Pandas: 0.54 milliseconds
Polars: 0.18 milliseconds

# Transactions Dataset
Pandas: 334 milliseconds 
Polars: 126 milliseconds

For the group-by process, there’s a 3x velocity enchancment for Polars within the case of the iris dataset. For the transactions dataset, there’s a 2.6x enchancment of Polars over Pandas.

Conclusions

Earlier than highlighting every level under, I wish to level out that Polars is considerably in an unfair place all through my comparisons. It’s usually that a number of information transformations are carried out after each other in follow. For this, Polars has the lazy API that optimizes this earlier than calculating. Since I’ve thought of single ingestions and transformations, this benefit of Polars is hidden. How a lot this might enhance in sensible conditions just isn’t clear, however it will most likely make the distinction in efficiency even larger.

Knowledge Ingestion

Polars is considerably quicker than Pandas for each studying and writing information. The distinction is largest in studying information, the place we had an enormous 11x distinction in efficiency for the transactions dataset. On all measurements, Polars performs considerably higher than Pandas.

Knowledge Processing

Polars is considerably quicker than Pandas for frequent information processing duties. The distinction was starkest for filters, however you’ll be able to a minimum of anticipate a 2–3x distinction in efficiency throughout the board.

Closing Verdict

Polars persistently performs quicker than Pandas on all duties with each small and huge information. The enhancements are very important, starting from a 2x enchancment to a whopping 11x enchancment. Relating to studying giant parquet information or performing filter statements, Polars is leaps and sure in entrance of Pandas.

Nonetheless…Nowhere right here is Polars remotely near performing 30x higher than Pandas, as Polars’ benchmarking suggests. I’d argue that the duties that I’ve introduced are customary duties carried out on reasonable {hardware} infrastructure. So I feel that my conclusions give us some room to query whether or not the claims put ahead by Polars give a sensible image of the enhancements that you would be able to anticipate.

However, I’m in little doubt that Polars is considerably quicker than Pandas. Working with Polars just isn’t extra sophisticated than working with Pandas. So on your subsequent information engineering mission the place the info suits in reminiscence, I’d strongly counsel that you just go for Polars reasonably than Pandas.

Wrapping Up

I hope this weblog submit gave you a unique perspective on the velocity distinction between Polars and Pandas. Please remark when you’ve got a unique expertise with the efficiency distinction between Polars and Pandas than what I’ve introduced.

If you’re enthusiastic about AI, Knowledge Science, or information engineering, please comply with me or join on LinkedIn.

Like my writing? Try a few of my different posts:

Polars vs. Pandas — An Impartial Pace Comparability