We are going to have a look at the most typical actions that, in my expertise, are most frequently used for knowledge evaluation. As an instance the method of utilizing Polars, I’ll think about an summary process with reproducible knowledge, so you possibly can comply with all of the steps in your laptop.
Think about that we’ve knowledge from three on-line shops, the place we register person actions, reminiscent of viewing and buying. Let’s assume that at any given time, just one motion of every sort can happen for every on-line retailer, and in case of a transaction error, our knowledge may be lacking the product identifier or its amount. Moreover, for our process, we’ll want a product catalog with costs for every merchandise.
Let’s formulate the primary process: to calculate a abstract desk with the whole buy for every on-line retailer.
I’ll break down this process into the next steps:
- Knowledge preparation and DataFrame creation.
- Abstract statistics of the DataFrame.
- Retrieving the primary 5 information.
- Renaming columns.
- Altering column varieties.
- Filling lacking values.
- Eradicating lacking values.
- Eradicating duplicate information.
- Filtering knowledge.
- Deciding on the required columns.
- Grouping knowledge.
- Merging knowledge with one other DataFrame.
- Calculating a brand new column.
- Making a Pivot desk.
Let’s get began!
Knowledge Preparation and DataFrame Creation
We have now the next knowledge:
OnlineStore
— signifies the shop.product
— shops the product ID.Motion sort
— the kind of motion (both a view or a purchase order).amount
— the quantity of the bought or seen product.Action_time
— the timestamp for the motion.
Necessities:
polars==1.6.0
pandas==2.0.0
from dataclasses import dataclass
from datetime import datetime, timedelta
from random import alternative, gauss, randrange, seed
from typing import Any, Dictimport polars as pl
import pandas as pd
seed(42)
base_time= datetime(2024, 8, 31, 0, 0, 0, 0)user_actions_data = [
{
"OnlineStore": choice(["Shop1", "Shop2", "Shop3"]),
"product": alternative(["0001", "0002", "0003"]),
"amount": alternative([1.0, 2.0, 3.0]),
"Motion sort": ("buy" if gauss() > 0.6 else "view"),
"Action_time": base_time - timedelta(minutes=randrange(1_000_000)),
}
for x in vary(1_000_000)
]
corrupted_data = [
{
"OnlineStore": choice(["Shop1", "Shop2", "Shop3"]),
"product": alternative(["0001", None]),
"amount": alternative([1.0, None]),
"Motion sort": ("buy" if gauss() > 0.6 else "view"),
"Action_time": base_time - timedelta(minutes=randrange(1_000)),
}
for x in vary(1_000)
]
For product catalog, which in our case embody solely product_id
and its value (value
).
product_catalog_data = {"product_id": ["0001", "0002", "0003"], "value": [100, 25, 80]}
The info is prepared. Now let’s create DataFrames utilizing these knowledge with Pandas and Polars:
# Pandas
user_actions_pd_df = pd.DataFrame(user_actions_data)
corrupted_pd_df = pd.DataFrame(corrupted_data)
product_catalog_pd_df = pd.DataFrame(product_catalog_data)# Polars
user_actions_pl_df = pl.DataFrame(user_actions_data)
corrupted_pl_df = pl.DataFrame(corrupted_data)
product_catalog_pl_df = pl.DataFrame(product_catalog_data)
Since we’ve user_actions_df
and corrupted_df
, let’s concatenate them right into a single DataFrame.
# Pandas
user_actions_pd_df = pd.concat([user_actions_pd_df, corrupted_pd_df])# Polars
user_actions_pl_df = pl.concat([user_actions_pl_df, corrupted_pl_df])
On this manner, we’ve simply created DataFrames for additional work.
In fact, every methodology has its personal parameters, so it’s finest to have the documentation helpful to keep away from confusion and use them appropriately.
Abstract Statistics of the DataFrame
After loading or making ready knowledge, it’s helpful to shortly discover the ensuing dataset. For abstract statistics, the tactic title stays the identical, however the outcomes might differ:
# Pandas
user_actions_pd_df.describe(embody='all')
OnlineStore product amount Motion sort Action_time
depend 1001000 1000492 1.000510e+06 1001000 1001000
distinctive 3 3 NaN 2 632335
high Shop3 0001 NaN view 2024-08-30 22:02:00
freq 333931 333963 NaN 726623 9
first NaN NaN NaN NaN 2022-10-06 13:23:00
final NaN NaN NaN NaN 2024-08-30 23:58:00
imply NaN NaN 1.998925e+00 NaN NaN
std NaN NaN 8.164457e-01 NaN NaN
min NaN NaN 1.000000e+00 NaN NaN
25% NaN NaN 1.000000e+00 NaN NaN
50% NaN NaN 2.000000e+00 NaN NaN
75% NaN NaN 3.000000e+00 NaN NaN
max NaN NaN 3.000000e+00 NaN NaN
# Polars
user_actions_pl_df.describe()
┌────────────┬─────────────┬─────────┬───────────┬─────────────┬────────────────────────────┐
│ statistic ┆ OnlineStore ┆ product ┆ amount ┆ Motion sort ┆ Action_time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞════════════╪═════════════╪═════════╪═══════════╪═════════════╪════════════════════════════╡
│ depend ┆ 1001000 ┆ 1000492 ┆ 1.00051e6 ┆ 1001000 ┆ 1001000 │
│ null_count ┆ 0 ┆ 508 ┆ 490.0 ┆ 0 ┆ 0 │
│ imply ┆ null ┆ null ┆ 1.998925 ┆ null ┆ 2023-09-19 03:24:30.981698 │
│ std ┆ null ┆ null ┆ 0.816446 ┆ null ┆ null │
│ min ┆ Shop1 ┆ 1 ┆ 1.0 ┆ buy ┆ 2022-10-06 13:23:00 │
│ 25% ┆ null ┆ null ┆ 1.0 ┆ null ┆ 2023-03-29 03:09:00 │
│ 50% ┆ null ┆ null ┆ 2.0 ┆ null ┆ 2023-09-19 06:49:00 │
│ 75% ┆ null ┆ null ┆ 3.0 ┆ null ┆ 2024-03-11 03:01:00 │
│ max ┆ Shop3 ┆ 3 ┆ 3.0 ┆ view ┆ 2024-08-30 23:58:00 │
└────────────┴─────────────┴─────────┴───────────┴─────────────┴────────────────────────────┘
As you possibly can discover, Pandas calculates statistics otherwise for numerous knowledge varieties and supplies distinctive
values for all columns. Polars, then again, calculates the null_count
worth.
Moreover, within the Polars documentation, it’s acknowledged:
We don’t assure the output of
describe
to be secure. It’ll present statistics that we deem informative, and could also be up to date sooner or later. Utilizingdescribe
programmatically (versus interactive exploration) will not be beneficial for that reason.
Retrieving the First 5 Data
When encountering knowledge for the primary time, we all the time need to discover it. Past acquiring abstract statistics, it’s additionally vital to see the precise information it comprises. To do that, we regularly have a look at the primary 5 information as a pattern.
# Pandas
user_actions_pd_df.head()
OnlineStore product amount Motion sort Action_time
0 Shop3 0001 1.0 view 2024-05-21 09:24:00
1 Shop3 0001 3.0 view 2023-03-10 15:54:00
2 Shop3 0001 3.0 view 2024-03-24 19:02:00
3 Shop1 0003 3.0 view 2024-08-11 16:16:00
4 Shop3 0001 3.0 view 2024-03-23 11:32:00
# Polars
user_actions_pl_df.head()
┌─────────────┬─────────┬──────────┬─────────────┬─────────────────────┐
│ OnlineStore ┆ product ┆ amount ┆ Motion sort ┆ Action_time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ datetime[μs] │
╞═════════════╪═════════╪══════════╪═════════════╪═════════════════════╡
│ Shop3 ┆ 0001 ┆ 1.0 ┆ view ┆ 2024-05-21 09:24:00 │
│ Shop3 ┆ 0001 ┆ 3.0 ┆ view ┆ 2023-03-10 15:54:00 │
│ Shop3 ┆ 0001 ┆ 3.0 ┆ view ┆ 2024-03-24 19:02:00 │
│ Shop1 ┆ 0003 ┆ 3.0 ┆ view ┆ 2024-08-11 16:16:00 │
│ Shop3 ┆ 0001 ┆ 3.0 ┆ view ┆ 2024-03-23 11:32:00 │
└─────────────┴─────────┴──────────┴─────────────┴─────────────────────┘
Polars has a helpful glimpse()
operate that gives a dense preview of the DataFrame. It not solely returns the primary 10 information (or any quantity you specify utilizing the max_items_per_column
parameter) but in addition exhibits knowledge varieties and report counts.
# Polars
user_actions_pl_df.glimpse()
Rows: 1001000
Columns: 5
$ OnlineStore <str> 'Shop3', 'Shop3', 'Shop3', 'Shop1', 'Shop3', 'Shop2', 'Shop1', 'Shop2', 'Shop1', 'Shop2'
$ product <str> '0001', '0001', '0001', '0003', '0001', '0003', '0001', '0001', '0002', '0003'
$ amount <f64> 1.0, 3.0, 3.0, 3.0, 3.0, 2.0, 3.0, 1.0, 2.0, 1.0
$ Motion sort <str> 'view', 'view', 'view', 'view', 'view', 'view', 'view', 'view', 'view', 'view'
$ Action_time <datetime[μs]> 2024-05-21 09:24:00, 2023-03-10 15:54:00, 2024-03-24 19:02:00, 2024-08-11 16:16:00, 2024-03-23 11:32:00, 2023-01-19 14:11:00, 2024-03-27 05:08:00, 2023-11-28 08:18:00, 2023-03-18 15:01:00, 2022-10-29 09:44:00
Renaming Columns
After exploring the information, it’s usually essential to edit it for additional use. If the column names should not passable or if your organization has its personal naming conventions, you possibly can simply rename them.
# Pandas
user_actions_pd_df = user_actions_pd_df.rename(
columns={
"OnlineStore": "online_store",
"product": "product_id",
"Motion sort": "action_type",
"Action_time": "action_dt",
}
)
# user_actions_pd_df.columns
Index(['online_store', 'product_id', 'quantity', 'action_type', 'action_dt'], dtype='object')
# Polars
user_actions_pl_df = user_actions_pl_df.rename(
{
"OnlineStore": "online_store",
"product": "product_id",
"Motion sort": "action_type",
"Action_time": "action_dt",
}
)
# user_actions_pl_df.columns
['online_store', 'product_id', 'quantity', 'action_type', 'action_dt']
Altering Column Sorts
When working with knowledge, optimizing their processing is usually a precedence, and knowledge varieties are not any exception. Selecting the best sort not solely unlocks out there features but in addition saves reminiscence. In our instance, I’ll change the column sort of amount
from float
to int
. In Pandas, you’ll use the astype()
methodology, whereas in Polars, you utilize the solid()
methodology.
# Pandas
user_actions_pd_df = user_actions_pd_df.astype({"amount": "Int64"})
Int64Index: 1001000 entries, 0 to 999
Knowledge columns (whole 5 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 online_store 1001000 non-null object
1 product_id 1000492 non-null object
2 amount 1000510 non-null Int64
3 action_type 1001000 non-null object
4 action_dt 1001000 non-null datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
reminiscence utilization: 46.8+ MB
# Polars
user_actions_pl_df = user_actions_pl_df.solid({"amount": pl.Int32})
Rows: 1001000
Columns: 5
$ online_store <str>
$ product_id <str>
$ amount <i32>
$ action_type <str>
$ action_dt <datetime[μs]>
Polars has a particular methodology
estimated_size()
that returns an estimate of the whole (heap) allotted dimension of the DataFrame. For instance:
user_actions_pl_df.estimated_size("mb")
# End result: 24.91054630279541
Though the tactic names for altering varieties differ, SQL fanatics will recognize the convenience of transition.
Filling Lacking Values
In actual tasks, knowledge isn’t good, and we regularly talk about with managers, analysts, and different techniques the right way to interpret knowledge habits. Throughout knowledge preparation, I particularly generated corrupted_data
to introduce some chaos into the information. Dealing with lacking values might simply be the topic of a whole e-book.
There are a number of methods for filling in lacking values, and the selection of methodology is determined by the duty: typically filling lacking values with zeros is enough, whereas different instances the imply worth could also be used. In Polars, the fill_null()
methodology will be utilized each to the DataFrame and to particular columns. So as to add a brand new column or exchange values in an current one, the with_columns()
methodology can be used.
In our instance, I’ll fill lacking values within the amount
column with 0:
# Pandas
user_actions_pd_df["quantity"].fillna(0, inplace=True)
Knowledge columns (whole 5 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 online_store 1001000 non-null object
1 product_id 1000492 non-null object
2 amount 1001000 non-null Int64
3 action_type 1001000 non-null object
4 action_dt 1001000 non-null datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.with_columns(pl.col("amount").fill_null(0))
┌────────────┬──────────────┬────────────┬──────────┬─────────────┬────────────────────────────┐
│ statistic ┆ online_store ┆ product_id ┆ amount ┆ action_type ┆ action_dt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞════════════╪══════════════╪════════════╪══════════╪═════════════╪════════════════════════════╡
│ depend ┆ 1001000 ┆ 1000492 ┆ 1.001e6 ┆ 1001000 ┆ 1001000 │
│ null_count ┆ 0 ┆ 508 ┆ 0.0 ┆ 0 ┆ 0 │
└────────────┴──────────────┴────────────┴──────────┴─────────────┴────────────────────────────┘
In Polars, you need to use numerous methods for filling lacking values within the knowledge, reminiscent of:
{None, 'ahead', 'backward', 'min', 'max', 'imply', 'zero', 'one'}
. The names of those methods are self-explanatory, so we gained’t delve into their particulars.
It’s additionally price noting that for filling NaN
values in floating-point columns, you need to use the fill_nan()
methodology, which doesn’t contain methods.
Eradicating Lacking Values
Not all lacking values will be stuffed, so these that can’t be accurately stuffed and utilized in additional calculations are finest eliminated. In our case, this is applicable to the product_id
column, as we can’t compute the ultimate end result with out this identifier.
To take away rows with lacking values in Pandas and Polars, use the next strategies:
# Pandas
user_actions_pd_df.dropna(subset=["product_id"], inplace=True)
Int64Index: 1000492 entries, 0 to 999
Knowledge columns (whole 5 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 online_store 1000492 non-null object
1 product_id 1000492 non-null object
2 amount 1000492 non-null Int64
3 action_type 1000492 non-null object
4 action_dt 1000492 non-null datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.drop_nulls(subset=["product_id"])
┌────────────┬──────────────┬────────────┬────────────┬─────────────┬────────────────────────────┐
│ statistic ┆ online_store ┆ product_id ┆ amount ┆ action_type ┆ action_dt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞════════════╪══════════════╪════════════╪════════════╪═════════════╪════════════════════════════╡
│ depend ┆ 1000492 ┆ 1000492 ┆ 1.000492e6 ┆ 1000492 ┆ 1000492 │
│ null_count ┆ 0 ┆ 0 ┆ 0.0 ┆ 0 ┆ 0 │
└────────────┴──────────────┴────────────┴────────────┴─────────────┴────────────────────────────┘
It’s additionally price noting that to take away NaN
values in floating-point columns, you need to use the drop_nans()
methodology.
Eradicating Duplicate Data
The best case of duplicate information happens when all values of 1 report are an identical to a different. In our case, duplicates would possibly come up if the identical motion is recorded a number of instances for a similar motion sort in the identical on-line retailer at a single time limit. I’ll preserve solely the latest worth in case duplicates are discovered.
To take away duplicate information in Pandas, use the drop_duplicates()
methodology, and in Polars, the distinctive()
methodology.
# Pandas
user_actions_pd_df.drop_duplicates(
subset=["online_store", "action_type", "action_dt"],
preserve="final",
inplace=True,
)
Knowledge columns (whole 5 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 online_store 907246 non-null object
1 product_id 907246 non-null object
2 amount 907246 non-null Int64
3 action_type 907246 non-null object
4 action_dt 907246 non-null datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.distinctive(
subset=["online_store", "action_type", "action_dt"],
preserve="final",
)
┌────────────┬──────────────┬────────────┬──────────┬─────────────┬────────────────────────────┐
│ statistic ┆ online_store ┆ product_id ┆ amount ┆ action_type ┆ action_dt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞════════════╪══════════════╪════════════╪══════════╪═════════════╪════════════════════════════╡
│ depend ┆ 907246 ┆ 907246 ┆ 907246.0 ┆ 907246 ┆ 907246 │
│ null_count ┆ 0 ┆ 0 ┆ 0.0 ┆ 0 ┆ 0 │
└────────────┴──────────────┴────────────┴──────────┴─────────────┴────────────────────────────┘
Filtering Knowledge
After the information cleansing section, we have to filter the related knowledge for future calculations. In Polars, that is accomplished utilizing the tactic with a fairly descriptive title, filter()
.
Rows the place the filter doesn’t consider to True are discarded, together with nulls.
# Pandas
user_actions_pd_df = user_actions_pd_df.loc[
user_actions_pd_df["action_type"] == "buy"
]
Int64Index: 262237 entries, 11 to 995
Knowledge columns (whole 5 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 online_store 262237 non-null object
1 product_id 262237 non-null object
2 amount 262237 non-null Int64
3 action_type 262237 non-null object
4 action_dt 262237 non-null datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.filter(
pl.col("action_type") == "buy"
)
┌────────────┬──────────────┬────────────┬──────────┬─────────────┬────────────────────────────┐
│ statistic ┆ online_store ┆ product_id ┆ amount ┆ action_type ┆ action_dt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞════════════╪══════════════╪════════════╪══════════╪═════════════╪════════════════════════════╡
│ depend ┆ 262237 ┆ 262237 ┆ 262237.0 ┆ 262237 ┆ 262237 │
│ null_count ┆ 0 ┆ 0 ┆ 0.0 ┆ 0 ┆ 0 │
└────────────┴──────────────┴────────────┴──────────┴─────────────┴────────────────────────────┘
Deciding on Required Columns
After filtering the information, chances are you’ll must retain solely the columns related for additional evaluation. In Polars, that is achieved utilizing the choose()
methodology.
# Pandas
user_actions_pd_df = user_actions_pd_df[
["online_store", "action_type", "product_id", "quantity"]
]
# Polars
user_actions_pl_df = user_actions_pl_df.choose(
"online_store", "action_type", "product_id", "amount"
)
Grouping Knowledge
After making ready the information, we will mixture it to get the sum of amount
for every on-line retailer and product. I will even retain action_type
for additional steps. We use the group_by()
methodology in Polars, which is analogous to the groupby()
methodology in Pandas.
# Pandas
user_actions_pd_df = (
user_actions_pd_df.groupby(["online_store", "product_id", "action_type"])
.agg({"amount": "sum"})
.reset_index()
)
online_store product_id action_type amount
0 Shop1 0001 buy 57772
1 Shop1 0002 buy 58015
2 Shop1 0003 buy 58242
3 Shop2 0001 buy 58256
4 Shop2 0002 buy 58648
5 Shop2 0003 buy 58458
6 Shop3 0001 buy 57891
7 Shop3 0002 buy 58326
8 Shop3 0003 buy 59107
# Polars
user_actions_pl_df = (
user_actions_pl_df.group_by(["online_store", "product_id", "action_type"])
.agg(pl.col("amount").sum())
)
┌──────────────┬────────────┬─────────────┬──────────┐
│ online_store ┆ product_id ┆ action_type ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i32 │
╞══════════════╪════════════╪═════════════╪══════════╡
│ Shop1 ┆ 0001 ┆ buy ┆ 57772 │
│ Shop1 ┆ 0002 ┆ buy ┆ 58015 │
│ Shop1 ┆ 0003 ┆ buy ┆ 58242 │
│ Shop2 ┆ 0001 ┆ buy ┆ 58256 │
│ Shop2 ┆ 0002 ┆ buy ┆ 58648 │
│ Shop2 ┆ 0003 ┆ buy ┆ 58458 │
│ Shop3 ┆ 0001 ┆ buy ┆ 57891 │
│ Shop3 ┆ 0002 ┆ buy ┆ 58326 │
│ Shop3 ┆ 0003 ┆ buy ┆ 59107 │
└──────────────┴────────────┴─────────────┴──────────┘
Becoming a member of Knowledge with One other DataFrame
To calculate the whole purchases, we have to be a part of our knowledge with the value catalog. In Pandas, we’ve two strategies for this, be a part of()
and merge()
, which differ of their specifics and performance. In Polars, we use solely the be a part of()
methodology.
# Pandas
user_actions_pd_df = user_actions_pd_df.merge(product_catalog_pd_df, on='product_id')
online_store product_id action_type amount value
0 Shop1 0001 buy 57772 100
3 Shop1 0002 buy 58015 25
6 Shop1 0003 buy 58242 80
1 Shop2 0001 buy 58256 100
4 Shop2 0002 buy 58648 25
7 Shop2 0003 buy 58458 80
2 Shop3 0001 buy 57891 100
5 Shop3 0002 buy 58326 25
8 Shop3 0003 buy 59107 80
# Polars
user_actions_pl_df = user_actions_pl_df.be a part of(product_catalog_pl_df, on='product_id')
┌──────────────┬────────────┬─────────────┬──────────┬───────┐
│ online_store ┆ product_id ┆ action_type ┆ amount ┆ value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i32 ┆ i64 │
╞══════════════╪════════════╪═════════════╪══════════╪═══════╡
│ Shop1 ┆ 0001 ┆ buy ┆ 57772 ┆ 100 │
│ Shop1 ┆ 0002 ┆ buy ┆ 58015 ┆ 25 │
│ Shop1 ┆ 0003 ┆ buy ┆ 58242 ┆ 80 │
│ Shop2 ┆ 0001 ┆ buy ┆ 58256 ┆ 100 │
│ Shop2 ┆ 0002 ┆ buy ┆ 58648 ┆ 25 │
│ Shop2 ┆ 0003 ┆ buy ┆ 58458 ┆ 80 │
│ Shop3 ┆ 0001 ┆ buy ┆ 57891 ┆ 100 │
│ Shop3 ┆ 0002 ┆ buy ┆ 58326 ┆ 25 │
│ Shop3 ┆ 0003 ┆ buy ┆ 59107 ┆ 80 │
└──────────────┴────────────┴─────────────┴──────────┴───────┘
In Polars, the
how
parameter helps the next values:{'internal', 'left', 'proper', 'full', 'semi', 'anti', 'cross'}
.
Calculating a New Column
To calculate a brand new column or modify an current column, Polars makes use of the with_columns()
methodology. To set an alias for a column, you need to use alias()
.
# Pandas
user_actions_pd_df["total"] = (
user_actions_pd_df["price"] * user_actions_pd_df["quantity"]
)
user_actions_pd_df = user_actions_pd_df[
["online_store", "action_type", "total"]
]
# Polars
user_actions_pl_df = user_actions_pl_df.with_columns(
(pl.col("value") * pl.col("amount")).alias("whole")
)
user_actions_pl_df = user_actions_pl_df.choose(
"online_store", "action_type", "whole"
)
Alternatively, you possibly can calculate a brand new column straight inside the choose()
methodology:
# Polars
user_actions_pl_df = user_actions_pl_df.choose(
"online_store",
"action_type",
(pl.col("value") * pl.col("amount")).alias("whole"),
)
Making a Pivot Desk
Our last step is to create a pivot desk. We have now already calculated the whole gross sales for every product, and now we are going to simply calculate the whole gross sales for every on-line retailer. In Pandas, we use the pivot_table()
methodology, which permits for the applying of mixture features. In Polars, we use the pivot()
methodology to create the pivot desk.
# Pandas
result_pd = user_actions_pd_df.pivot_table(
columns="online_store",
index="action_type",
values="whole",
aggfunc="sum",
)
online_store Shop1 Shop2 Shop3
action_type
buy 11886935 11968440 11975810
# Polars
result_pl = user_actions_pl_df.pivot(
columns="online_store",
index="action_type",
values="whole",
aggregate_function="sum",
)
┌─────────────┬──────────┬──────────┬──────────┐
│ action_type ┆ Shop1 ┆ Shop2 ┆ Shop3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════════╪══════════╪══════════╪══════════╡
│ buy ┆ 11886935 ┆ 11968440 ┆ 11975810 │
└─────────────┴──────────┴──────────┴──────────┘
DeprecationWarning: The argument
columns
for pl.DataFrame.pivot` is deprecated. It has been renamed toon
.
Right here we’re, concluding our little journey. As we will see, the outcomes for each Pandas and Polars match. Everybody who made it thus far is nice and extremely hardworking — you’ll succeed!
Abstract
On this article, we explored Polars utilizing sensible examples and comparisons with Pandas. I demonstrated the right way to deal with knowledge preparation, descriptive statistics, lacking values, duplicates, filtering, column choice, grouping, merging, and pivot tables. By showcasing these duties with each Pandas and Polars, I highlighted the convenience of utilizing Polars and transitioning to it from Pandas. This information serves as a sensible introduction to leveraging Polars for environment friendly knowledge evaluation.
Advisable learn:
Thanks for Studying!
In the event you loved this text and need to help my work, one of the simplest ways is to comply with me on Medium. Let’s join on LinkedIn in the event you’re additionally focused on working with knowledge like I’m. Your claps are drastically appreciated — they assist me understand how helpful this submit was for you.