Understanding Okay-Fold Goal Encoding to Deal with Excessive Cardinality | by Fhilipus Mahendra | Oct, 2024

Balancing complexity and efficiency: An in-depth have a look at Okay-fold goal encoding

Photograph by Mika Baumeister on Unsplash

Information science practitioners encounter quite a few challenges when dealing with numerous knowledge varieties throughout numerous tasks, every demanding distinctive processing strategies. A typical impediment is working with knowledge codecs that conventional machine studying fashions battle to course of successfully, leading to subpar mannequin efficiency. Since most machine studying algorithms are optimized for numerical knowledge, remodeling categorical knowledge into numerical type is important. Nonetheless, this usually oversimplifies complicated categorical relationships, particularly when the function have excessive cardinality — that means numerous distinctive values — which complicates processing and impedes mannequin accuracy.

Excessive cardinality refers back to the variety of distinctive parts inside a function, particularly addressing the distinct depend of categorical labels in a machine studying context. When a function has many distinctive categorical labels, it has excessive cardinality, which might complicate mannequin processing. To make categorical knowledge usable in machine studying, these labels are sometimes transformed to numerical type utilizing encoding strategies primarily based on knowledge complexity. One standard methodology is One-Scorching Encoding, which assigns every distinctive label a definite binary vector. Nonetheless, with high-cardinality knowledge, One-Scorching Encoding can dramatically enhance dimensionality, resulting in complicated, high-dimensional datasets that require important computational capability for mannequin coaching and doubtlessly decelerate efficiency.

Contemplate a dataset with 2,000 distinctive IDs, every ID linked to certainly one of solely three international locations. On this case, whereas the ID function has a cardinality of two,000 (since every ID is exclusive), the nation function has a cardinality of simply 3. Now, think about a function with 100,000 categorical labels that have to be encoded utilizing One-Scorching Encoding. This may create a particularly high-dimensional dataset, resulting in inefficiency and important useful resource consumption.

A extensively adopted answer amongst knowledge scientists is Okay-Fold Goal Encoding. This encoding methodology helps scale back function cardinality by changing categorical labels with target-mean values, primarily based on Okay-Fold cross-validation. By specializing in particular person knowledge patterns, Okay-Fold Goal Encoding lowers the chance of overfitting, serving to the mannequin study particular relationships throughout the knowledge slightly than overly basic patterns that may hurt mannequin efficiency.

Okay-Fold Goal Encoding entails dividing the dataset into a number of equally-sized subsets, often known as “folds,” with “Okay” representing the variety of these subsets. By folding the dataset into a number of teams, this methodology calculates the cross-subset weighted imply for every categorical label, enhancing the encoding’s robustness and lowering overfitting dangers.

Fig 1. Indonesian Home Flights Dataset [1]

Utilizing an instance from Fig 1. of a pattern dataset of Indonesian home flights emissions for every flight cycle, we are able to put this system into observe. The bottom query to ask with this dataset is “What’s the weighted imply for every categorical labels in ‘Airways’ by taking a look at function ‘HC Emission’ ?”. Nonetheless, you may include the identical query individuals been asking me about. “However, when you simply calculated them utilizing the focused function, couldn’t it outcome as one other excessive cardinality function?”. The straightforward reply is “Sure, it might”.

Why?

In circumstances the place a big dataset has a extremely random goal function with out identifiable patterns, Okay-Fold Goal Encoding may produce all kinds of imply values for every categorical label, doubtlessly preserving excessive cardinality slightly than lowering it. Nonetheless, the first aim of Okay-Fold Goal Encoding is to handle excessive cardinality, not essentially to scale back it drastically. This methodology works finest when there’s a significant correlation between the goal function and segments of the info inside every categorical label.

How does Okay-Fold Goal Encoding function? The only technique to clarify that is that, in every fold, you calculate the imply of the goal function from the opposite folds. This method gives every categorical label with a singular weight, represented as a numerical worth, making it extra informative. Let’s have a look at an instance calculation utilizing our dataset for a clearer understanding.

Fig 2. Indonesian Home Flights Dataset After Okay-Fold Assigned [1]

To calculate the load of the ‘AirAsia’ label for the primary statement, begin by splitting the info into a number of folds, as proven in Fig 2. You possibly can assign folds manually to make sure equal distribution, or automate this course of utilizing the next pattern code:

import seaborn as sns
import matplotlib.pyplot as plt

# To be able to cut up our knowledge into a number of elements equally lets assign KFold numbers to every of the info randomly.

# Calculate the variety of samples per fold
num_samples = len(df) // 8

# Assign fold numbers
df['kfold'] = np.repeat(np.arange(1, 9), num_samples)

# Deal with any remaining samples (if len(df) is just not divisible by 8)
remaining_samples = len(df) % 8
if remaining_samples > 0:
df.loc[-remaining_samples:, 'kfold'] = np.arange(1, remaining_samples + 1)

# Shuffle once more to make sure randomness
fold_df = df.pattern(frac=1, random_state=42).reset_index(drop=True)