It could be stunning, however on this article, I want to speak concerning the knapsack downside, the traditional optimisation downside that has been studied for over a century. In accordance with Wikipedia, the issue is outlined as follows:
Given a set of things, every with a weight and a price, decide which objects to incorporate within the assortment in order that the whole weight is lower than or equal to a given restrict and the whole worth is as massive as doable.
Whereas product analysts might not bodily pack knapsacks, the underlying mathematical mannequin is extremely related to lots of our duties. There are quite a few real-world functions of the knapsack downside in product analytics. Listed below are a number of examples:
- Advertising and marketing Campaigns: The advertising and marketing crew has a restricted finances and capability to run campaigns throughout totally different channels and areas. Their purpose is to maximise a KPI, such because the variety of new customers or income, all whereas adhering to present constraints.
- Retail Area Optimization: A retailer with restricted bodily area of their shops seeks to optimize product placement to maximise income.
- Product Launch Prioritization: When launching a brand new product, the operations crew’s capability could be restricted, requiring prioritization of particular markets.
Such and comparable duties are fairly widespread, and lots of analysts encounter them recurrently. So, on this article, I’ll discover totally different approaches to fixing it, starting from naive, easy methods to extra superior strategies equivalent to linear programming.
One more reason I selected this subject is that linear programming is among the strongest and common instruments in prescriptive analytics — a kind of study that focuses on offering stakeholders with actionable choices to make knowledgeable choices. As such, I imagine it’s a necessary ability for any analyst to have of their toolkit.
Let’s dive straight into the case we’ll be exploring. Think about we’re a part of a advertising and marketing crew planning actions for the upcoming month. Our goal is to maximise key efficiency indicators (KPIs), such because the variety of acquired customers and income whereas working inside a restricted advertising and marketing finances.
We’ve estimated the anticipated outcomes of varied advertising and marketing actions throughout totally different international locations and channels. Right here is the information we’ve got:
nation
— the market the place we will do some promotional actions;channel
— the acquisition methodology, equivalent to social networks or influencer campaigns;customers
— the anticipated variety of customers acquired inside a month of the promo marketing campaign;cs_contacts
— the incremental Buyer Help contacts generated by the brand new customers;marketing_spending
— the funding required for the exercise;income
— the first-year LTV generated from acquired prospects.
Be aware that the dataset is artificial and randomly generated, so don’t attempt to infer any market-related insights from it.
First, I’ve calculated the high-level statistics to get a view of the numbers.
Let’s decide the optimum set of selling actions that maximizes income whereas staying inside the $30M advertising and marketing finances.
At first look, the issue could appear simple: we may calculate all doable mixtures of selling actions and choose the optimum one. Nevertheless, it could be a difficult activity.
With 62 segments, there are 2⁶² doable mixtures, as every phase can both be included or excluded. This ends in roughly 4.6*10¹⁸ mixtures — an astronomical quantity.
To higher perceive the computational feasibility, let’s take into account a smaller subset of 15 segments and estimate the time required for one iteration.
import itertools
import pandas as pd
import tqdm# studying information
df = pd.read_csv('marketing_campaign_estimations.csv', sep = 't')
df['segment'] = df.nation + ' - ' + df.channel
# calculating mixtures
mixtures = []
segments = record(df.phase.values)[:15]
print('variety of segments: ', len(segments))
for num_items in vary(len(segments) + 1):
mixtures.lengthen(
itertools.mixtures(segments, num_items)
)
print('variety of mixtures: ', len(mixtures))
tmp = []
for chosen in tqdm.tqdm(mixtures):
tmp_df = df[df.segment.isin(selected)]
tmp.append(
{
'selected_segments': ', '.be part of(chosen),
'customers': tmp_df.customers.sum(),
'cs_contacts': tmp_df.cs_contacts.sum(),
'marketing_spending': tmp_df.marketing_spending.sum(),
'income': tmp_df.income.sum()
}
)
# variety of segments: 15
# variety of mixtures: 32768
It took roughly 4 seconds to course of 15 segments, permitting us to deal with round 7,000 iterations per second. Utilizing this estimate, let’s calculate the execution time for the total set of 62 segments.
2**62 / 7000 / 3600 / 24 / 365
# 20 890 800.6
Utilizing brute pressure, it might take round 20.9 million years to get the reply to our query — clearly not a possible choice.
Execution time is totally decided by the variety of segments. Eradicating only one phase can cut back time twice. With this in thoughts, let’s discover doable methods to merge segments.
As regular, there are extra small-sized segments than larger ones, so merging them is a logical step. Nevertheless, it’s vital to notice that this strategy might cut back accuracy since a number of segments are aggregated into one. Regardless of this, it may nonetheless yield an answer that’s “ok.”
To simplify, let’s merge all segments that contribute lower than 0.1% of income.
df['share_of_revenue'] = df.income/df.income.sum() * 100
df['segment_group'] = record(map(
lambda x, y: x if y >= 0.1 else 'different',
df.phase,
df.share_of_revenue
))print(df[df.segment_group == 'other'].share_of_revenue.sum())
# 0.53
print(df.segment_group.nunique())
# 52
With this strategy, we are going to merge ten segments into one, representing 0.53% of the whole income (the potential margin of error). With 52 segments remaining, we will acquire the answer in simply 20.4K years. Whereas this can be a vital enchancment, it’s nonetheless not ample.
It’s possible you’ll take into account different heuristics tailor-made to your particular activity. For example, in case your constraint is a ratio (e.g., contact fee = CS contacts / customers ≤ 5%), you can group all segments the place the constraint holds true, because the optimum answer will embody all of them. In our case, nevertheless, I don’t see any further methods to cut back the variety of segments, so brute pressure appears impractical.
That mentioned, if the variety of mixtures is comparatively small and brute pressure may be executed inside an inexpensive time, it may be an excellent strategy. It’s easy to develop and offers correct outcomes.
Since brute pressure just isn’t possible for calculating all mixtures, let’s take into account an easier algorithm to deal with this downside.
One doable strategy is to give attention to the top-performing segments. We are able to consider phase efficiency by calculating income per greenback spent, then type all actions primarily based on this ratio and choose the highest performers that match inside the advertising and marketing finances. Let’s implement it.
df['revenue_per_spend'] = df.income / df.marketing_spending
df = df.sort_values('revenue_per_spend', ascending = False)
df['spend_cumulative'] = df.marketing_spending.cumsum()
selected_df = df[df.spend_cumulative <= 30000000]
print(selected_df.form[0])
# 48
print(selected_df.income.sum()/1000000)
# 107.92
With this strategy, we chosen 48 actions and bought $107.92M in income.
Sadly, though the logic appears cheap, it isn’t the optimum answer for maximizing income. Let’s take a look at a easy instance with simply three advertising and marketing actions.
Utilizing the highest markets strategy, we would choose France and obtain $68M in income. Nevertheless, by selecting two different markets, we may obtain considerably higher outcomes — $97.5M. The important thing level is that our algorithm optimizes not just for most income but in addition for minimizing the variety of chosen segments. Due to this fact, this strategy is not going to yield one of the best outcomes, particularly contemplating its lack of ability to account for a number of constraints.
Since all easy approaches have failed, we should return to the basics and discover the idea behind this downside. Fortuitously, the knapsack downside has been studied for a few years, and we will apply optimization methods to unravel it in seconds somewhat than years.
The issue we’re attempting to unravel is an instance of Integer Programming, which is definitely a subdomain of Linear Programming.
We’ll focus on this shortly, however first, let’s align on the important thing ideas of the optimization course of. Every optimisation downside consists of:
- Choice variables: Parameters that may be adjusted within the mannequin, usually representing the levers or choices we wish to make.
- Goal perform: The goal variable we purpose to maximise or decrease. It goes with out saying that it should rely on the choice variables.
- Constraints: Situations positioned on the choice variables that outline their doable values. For instance, guaranteeing the crew can’t work a damaging variety of hours.
With these primary ideas in thoughts, we will outline Linear Programming as a state of affairs the place the next circumstances maintain:
- The target perform is linear.
- All constraints are linear.
- Choice variables are real-valued.
Integer Programming is similar to Linear Programming, with one key distinction: some or all resolution variables should be integers. Whereas this will look like a minor change, it considerably impacts the answer strategy, requiring extra advanced strategies than these utilized in Linear Programming. One widespread method is branch-and-bound. We received’t dive deeper into the idea right here, however you may at all times discover extra detailed explanations on-line.
For linear optimization, I want the broadly used Python package deal PuLP. Nevertheless, there are different choices obtainable, equivalent to Python MIP or Pyomo. Let’s set up PuLP through pip.
! pip set up pulp
Now, it’s time to outline our activity as a mathematical optimisation downside. There are the next steps for it:
- Outline the set of resolution variables (levers we will modify).
- Align on the target perform (a variable that we are going to be optimising for).
- Formulate constraints (the circumstances that should maintain true throughout optimisations).
Let’s undergo the steps one after the other. However first, we have to create the issue object and set the target — maximization in our case.
from pulp import *
downside = LpProblem("Marketing_campaign", LpMaximize)
The following step is defining the choice variables — parameters that we will change throughout optimisation. Our principal resolution is both to run a advertising and marketing marketing campaign or not. So, we will mannequin it as a set of binary variables (0 or 1) for every phase. Let’s do it with the PuLP library.
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
After that, it’s time to align on the target perform. As mentioned, we wish to maximise the income. The whole income can be a sum of income from all the chosen segments (the place decision_variable = 1
). Due to this fact, we will outline this components because the sum of the anticipated income for every phase multiplied by the choice binary variable.
downside += lpSum(
chosen[i] * record(df['revenue'].values)[i]
for i in segments
)
The ultimate step is so as to add constraints. Let’s begin with a easy constraint: our advertising and marketing spending should be under $30M.
downside += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
Trace: you may print
downside
to double examine the target perform and constraints.
Now that we’ve outlined all the pieces, we will run the optimization and analyze the outcomes.
downside.remedy()
It takes lower than a second to run the optimization, a major enchancment in comparison with the hundreds of years that brute pressure would require.
Outcome - Optimum answer discoveredGoal worth: 110162662.21000001
Enumerated nodes: 4
Complete iterations: 76
Time (CPU seconds): 0.02
Time (Wallclock seconds): 0.02
Let’s save the outcomes of the mannequin execution — the choice variables indicating whether or not every phase was chosen or not — into our dataframe.
df['selected'] = record(map(lambda x: x.worth(), chosen.values()))
print(df[df.selected == 1].income.sum()/10**6)
# 110.16
It really works like magic, permitting you to acquire the answer rapidly. Moreover, be aware that we achieved increased income in comparison with our naive strategy: $110.16M versus $107.92M.
We’ve examined integer programming with a easy instance that includes only one constraint, however we will lengthen it additional. For example, we will add further constraints for our CS contacts to make sure that our Operations crew can deal with the demand in a wholesome manner:
- The variety of further CS contacts ≤ 5K
- Contact fee (CS contacts/customers) ≤ 0.042
# outline the issue
problem_v2 = LpProblem("Marketing_campaign_v2", LpMaximize)# resolution variables
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
# goal perform
problem_v2 += lpSum(
chosen[i] * record(df['revenue'].values)[i]
for i in segments
)
# Constraints
problem_v2 += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
problem_v2 += lpSum(
chosen[i] * df['cs_contacts'].values[i]
for i in segments
) <= 5000
problem_v2 += lpSum(
chosen[i] * df['cs_contacts'].values[i]
for i in segments
) <= 0.042 * lpSum(
chosen[i] * df['users'].values[i]
for i in segments
)
# run the optimisation
problem_v2.remedy()
The code is simple, with the one tough half being the transformation of the ratio constraint into an easier linear type.
One other potential constraint you would possibly take into account is limiting the variety of chosen choices, for instance, to 10. This constraint may very well be fairly useful in prescriptive analytics, for instance, when you should choose the top-N most impactful focus areas.
# outline the issue
problem_v3 = LpProblem("Marketing_campaign_v2", LpMaximize)# resolution variables
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
# goal perform
problem_v3 += lpSum(
chosen[i] * record(df['revenue'].values)[i]
for i in segments
)
# constraints
problem_v3 += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
problem_v3 += lpSum(
chosen[i] for i in segments
) <= 10
# run the optimisation
problem_v3.remedy()
df['selected'] = record(map(lambda x: x.worth(), chosen.values()))
print(df.chosen.sum())
# 10
One other doable choice to tweak our downside is to vary the target perform. We’ve been optimising for the income, however think about we wish to maximise each income and new customers on the similar time. For that, we will barely change our goal perform.
Let’s take into account one of the best strategy. We may calculate the sum of income and new customers and purpose to maximise it. Nevertheless, since income is, on common, 1000 occasions increased, the outcomes could be skewed towards maximizing income. To make the metrics extra comparable, we will normalize each income and customers primarily based on their complete sums. Then, we will outline the target perform as a weighted sum of those ratios. I’d use equal weights (0.5) for each metrics, however you may modify the weights to provide extra worth to considered one of them.
# outline the issue
problem_v4 = LpProblem("Marketing_campaign_v2", LpMaximize)# resolution variables
segments = vary(df.form[0])
chosen = LpVariable.dicts("Chosen", segments, cat="Binary")
# goal Operate
problem_v4 += (
0.5 * lpSum(
chosen[i] * df['revenue'].values[i] / df['revenue'].sum()
for i in segments
)
+ 0.5 * lpSum(
chosen[i] * df['users'].values[i] / df['users'].sum()
for i in segments
)
)
# constraints
problem_v4 += lpSum(
chosen[i] * df['marketing_spending'].values[i]
for i in segments
) <= 30 * 10**6
# run the optimisation
problem_v4.remedy()
df['selected'] = record(map(lambda x: x.worth(), chosen.values()))
We obtained the optimum goal perform worth of 0.6131, with income at $104.36M and 136.37K new customers.
That’s it! We’ve realized the way to use integer programming to unravel numerous optimisation issues.
You could find the total code on GitHub.
On this article, we explored totally different strategies for fixing the knapsack downside and its analogues in product analytics.
- We started with a brute-force strategy however rapidly realized it might take an unreasonable period of time.
- Subsequent, we tried utilizing widespread sense by naively choosing the top-performing segments, however this strategy yielded incorrect outcomes.
- Lastly, we turned to Integer Programming, studying the way to translate our product duties into optimization fashions and remedy them successfully.
With this, I hope you’ve gained one other useful analytical device on your toolkit.
Thank you a large number for studying this text. I hope this text was insightful for you. In case you have any follow-up questions or feedback, please go away them within the feedback part.
All the photographs are produced by the writer until in any other case acknowledged.