Understanding KL Divergence, Entropy, and Associated Ideas | by Saankhya Mondal | Oct, 2024

Essential ideas in info principle, machine studying, and statistics

Picture AI-Generated utilizing Gemini

In Data Principle, Machine Studying, and Statistics, KL Divergence (Kullback-Leibler Divergence) is a basic idea that helps us quantify how two likelihood distributions differ. It’s typically used to measure the quantity of data misplaced when one likelihood distribution is used to approximate one other. This text will clarify KL Divergence and a few of the different extensively used divergences.

KL Divergence, often known as relative entropy, is a method to measure the distinction between two likelihood distributions, denoted as P and Q. It’s typically written as —

KL Divergence between two Discrete Distributions P(x) and Q(x)
KL Divergence between two Steady Distributions P(x) and Q(x)

This equation compares the true distribution P with the approximation. distribution Q. Think about you’re compressing knowledge utilizing an encoding system optimized for one distribution (distribution Q) however the precise knowledge comes from a special distribution (distribution P). KL Divergence measures how inefficient your encoding will probably be. If Q is near P, the KL Divergence will probably be small, which means much less info is misplaced within the approximation. If Q differs from P, the KL Divergence will probably be giant, indicating important info loss. In different phrases, KL Divergence tells you what number of additional bits you might want to encode knowledge from P when utilizing an encoding scheme designed for Q.

KL Divergence and Shannon’s Entropy

To higher perceive KL Divergence, it’s helpful to narrate it to entropy, which measures the uncertainty or randomness of a distribution. The Shannon’s Entropy of a distribution P is outlined as:

Shannon Entropy of a distribution P(x)

Recall the favored Binary Cross Entropy Loss operate and its curve. Entropy is a measure of uncertainty.

Shannon Entropy Plot (log base is e, can even take base as 2) Pic credit score — Creator

Entropy tells us how unsure we’re in regards to the outcomes of a random variable. The decrease the entropy, the extra sure we’re in regards to the consequence. The decrease the entropy, the extra info we’ve. Entropy is the very best when p=0.5. A likelihood of 0.5 denotes most uncertainty. KL Divergence could be seen because the distinction between the entropy of P and the “cross-entropy” between P and Q. Thus, KL Divergence measures the additional uncertainty launched through the use of Q as an alternative of P.

Deriving KL Divergence from Entropy

Properties —

  • KL Divergence is at all times non-negative.
Proof of non-negativity
  • In contrast to different distance metrics, the KL Divergence is uneven.
KL Divergence is uneven

Some Functions —

  1. In Variational Auto Encoders, KL Divergence is used as a regularizer to make sure that the latent variable distribution stays near a previous distribution (sometimes a normal Gaussian).
  2. KL Divergence quantifies the inefficiency or info loss when utilizing one likelihood distribution to compress knowledge from one other distribution. That is helpful in designing and analyzing knowledge compression algorithms.
  3. In reinforcement studying, KL Divergence controls how a lot a brand new coverage can deviate from an outdated one throughout updates. For instance, algorithms like Proximal Coverage Optimization (PPO) use KL Divergence to constrain coverage shifts.
  4. KL Divergence is extensively utilized in industries to detect knowledge drift.

Jensen-Shannon Divergence (JS Divergence) is a symmetric measure that quantifies the similarity between two likelihood distributions. It’s based mostly on the KL Divergence. Given two likelihood distributions P and Q, the Jensen-Shannon Divergence is outlined as —

Jenson Shannon Divergence

the place M is the common (or combination) distribution between P and Q.

Combination Distribution

The primary time period measures how a lot info is misplaced when M is used to approximate P. The second time period measures the data loss when M is used to approximate Q. JS Divergence computes the common of the 2 KL divergences with respect to the common distribution M. KL Divergence penalizes you for utilizing one distribution to approximate one other. Nonetheless, it’s delicate to which distribution you begin from. This asymmetry is commonly problematic whenever you wish to examine distributions with out bias. JS Divergence fixes this by averaging between the 2 distributions. It doesn’t deal with both P or Q because the “right” distribution however appears at their mixed conduct by way of the combination distribution M.

We noticed earlier that KL Divergence is said to Shannon Entropy. Shannon Entropy is a particular case of Renyi Entropy. Renyi Entropy of a distribution is outlined as —

Renyi Entropy of a distribution P(x), with parameter α

Renyi Entropy is parameterized by α>0. α controls how a lot weight is given to totally different possibilities within the distribution.

  • α=1: Renyi Entropy equals Shannon Entropy, giving equal weightage to all possible occasions. You possibly can derive it utilizing limits and the L’Hospital rule.
Deriving Shannon Entropy from Renyi Entropy
  • α<1: The entropy will increase sensitivity to uncommon occasions (decrease possibilities), making it extra targeted on the variety or unfold of the distribution.
  • α>1: The entropy will increase sensitivity to widespread occasions (greater possibilities), making it extra targeted on the focus or dominance of some outcomes.
Renyi Entropy Plot for various values of α (log base is e, can even take base as 2) Pic credit score — Creator
  • α=0: Renyi Entropy approaches the logarithm of the variety of attainable outcomes (assuming all outcomes are non-zero). That is known as the Hartley Entropy.
  • α=∞: As α→∞, Renyi entropy turns into the min-entropy, focusing solely on probably the most possible consequence.
min-entropy

The Renyi Divergence is a metric based mostly on Renyi Entropy. The Renyi Divergence between two distributions P and Q, parameterized by α is outlined by —

Renyi Divergence between two Discrete Distributions P(x) and Q(x), with parameter α

KL Divergence is a particular case of Renyi Divergence, the place α=1.

Deriving KL Divergence from Renyi Divergence
  • α<1: Focuses on uncommon occasions; extra delicate to tail distributions.
  • α>1: Focuses on widespread occasions; extra delicate to high-probability areas.
Renyi Divergence Plot between P and Q. Pic credit score — Creator

The Renyi Divergence is at all times non-zero and equal to 0 when P = Q. The above determine illustrates how the divergence adjustments once we range the distribution P. The divergence will increase, with the quantity of improve relying on the worth of α. The next worth α makes Renyi Divergence extra delicate to adjustments in likelihood distribution.

Renyi Divergence finds its utility in Differential Privateness, an necessary idea in Privateness Preserving Machine Studying. Differential Privateness is a mathematical framework that ensures people’ privateness when their knowledge is included in a dataset. It ensures that the output of an algorithm isn’t considerably affected by the inclusion or exclusion of any single particular person’s knowledge. Renyi Differential Privateness (RDP) is an extension of differential privateness that makes use of Rényi divergence to supply tighter privateness ensures. We’ll focus on them in a future weblog.

In an e-commerce setup, knowledge drift can happen when the underlying likelihood distribution of person conduct adjustments over time. This will have an effect on numerous points of the enterprise, akin to product suggestions. For instance how totally different divergences can be utilized to detect this drift, take into account the next toy instance involving buyer buy conduct over seven weeks.

Think about an e-commerce platform that tracks buyer purchases throughout 5 product classes: Electronics, Clothes, Books, Residence & Kitchen, and Toys. The platform collects click on knowledge weekly on the proportion of clicks in every class. These are represented as likelihood distributions proven within the following code block.

weeks = {
'Week 1': np.array([0.3, 0.4, 0.2, 0.05, 0.05]),
'Week 2': np.array([0.25, 0.45, 0.2, 0.05, 0.05]),
'Week 3': np.array([0.2, 0.5, 0.2, 0.05, 0.05]),
'Week 4': np.array([0.15, 0.55, 0.2, 0.05, 0.05]),
'Week 5': np.array([0.1, 0.6, 0.2, 0.05, 0.05]),
'Week 6': np.array([0.1, 0.55, 0.25, 0.05, 0.05]),
'Week 7': np.array([0.05, 0.65, 0.25, 0.025, 0.025]),
}

From Week 1 to Week 7, we observe the next —

  • Week 1 to Week 2: There’s a minor drift, with the second class rising in clicks barely.
  • Week 3: A extra pronounced drift happens because the second class turns into extra dominant.
  • Week 5 to Week 7: A big shift occurs the place the second class retains rising its click on share, whereas others, particularly the primary class, lose relevance.

We will calculate the divergences utilizing the next—

# Calculate KL Divergence
def kl_divergence(p, q):
return np.sum(kl_div(p, q))

# Calculate Jensen-Shannon Divergence
def js_divergence(p, q):
m = 0.5 * (p + q)
return 0.5 * (kl_divergence(p, m) + kl_divergence(q, m))

# Calculate Renyi Divergence
def renyi_divergence(p, q, alpha):
return (1 / (alpha - 1)) * np.log(np.sum(np.energy(p, alpha) * np.energy(q, 1 - alpha)))

Toy Instance — Divergence measures over time

KL Divergence reveals rising values, indicating that the distribution of purchases diverges extra from the baseline as time goes on. From Week 1 to Week 7, KL Divergence emphasizes adjustments within the second product class, which continues to dominate. Jensen-Shannon Divergence reveals an analogous easily rising pattern, confirming that the distributions have gotten much less related. JS captures the collective drift throughout the classes.

Renyi Divergence varies considerably based mostly on the chosen α.

  • With α=0.5, the divergence will place extra weight on uncommon classes (classes 4 and 5 within the distribution). It picks up the drift earlier when these uncommon classes fluctuate, particularly from Week 6 to Week 7 when their possibilities drop to 0.025.
  • With α=2, the divergence highlights the rising dominance of the second class, exhibiting that high-probability gadgets are shifting and the distribution is changing into much less numerous.

You possibly can visualize these traits within the determine above, the place you possibly can observe the sharp rise in slopes. By monitoring the divergences over the weeks, the e-commerce platform can detect knowledge drift and take measures, akin to retraining product suggestion fashions.

References and attention-grabbing learn —

  1. Data principle — Wikipedia
  2. Kullback–Leibler divergence — Wikipedia
  3. Entropy (info principle) — Wikipedia
  4. Jensen–Shannon divergence — Wikipedia
  5. Rényi entropy — Wikipedia
  6. Renyi Divergence — https://arxiv.org/pdf/1206.2459

I hope you discovered my article attention-grabbing. Thanks for studying!