Uncertainty in Markov Choices Processes: a Strong Linear Programming method | by Hussein Fellahi

Let’s begin by giving a proper definition of MDPs:

A Markov Resolution Course of is a 5-tuple (S, A, R, P, γ) such that:

S is the set of states the agent will be in
A is the set of actions the agent can take
R : S x A → R the reward operate
P is the set of chance distributions outlined such that P(s’|s,a) is the chance of transitioning to state s’ if the agent takes motion a in state s. Word that MDPs are Markov processes, which means that the Markov property holds on the transition chances: P(Sₜ₊₁|S₀, A₀, …, Sₜ, Aₜ) = P(Sₜ₊₁|Sₜ, Aₜ)
γ ∈ (0, 1] is a low cost issue. Whereas we often cope with discounted issues (i.e. γ < 1), the formulations introduced are additionally legitimate for undiscounted MDPs (γ = 1)

We then outline the coverage, i.e. what dictates the agent’s habits in an MDP:

A coverage π is a chance measure over the motion area outlined as: π(a|s) is the chance of taking motion a when the agent is in state s.

We lastly introduce the worth operate, i.e. the agent’s goal in an MDP:

The worth operate of a coverage π is the anticipated discounted reward underneath this coverage, when beginning at a given state s:

Particularly, the worth operate of the optimum coverage π* satisfies the Bellman optimality equation:

Which yields the deterministic optimum coverage:

Deriving the LP formulation of MDPs:

Given the above definitions, we will begin by noticing that any worth operate V that satisfies

is an higher sure on the optimum worth operate. To see it, we will begin by noticing that such worth operate additionally satisfies:

We acknowledge the worth iteration operator utilized to V:

i.e.

Additionally noticing that the H*operator is growing, we will apply it iteratively to have:

the place we used the property of V* being the mounted level of H*.

Subsequently, discovering V* comes right down to discovering the tightest higher sure V that obeys the above equation, which yields the next formulation:

Right here we added a weight time period akin to the chance of beginning in state s. We will see that the above downside is linear in V and will be rewritten as follows:

Uncertainty in Markov Choices Processes: a Strong Linear Programming method | by Hussein Fellahi | Sep, 2024

Deriving the LP formulation of MDPs:

Context Engineering is the ‘New’ Immediate Engineering

Indonesia on Observe to Obtain Sovereign AI Targets With NVIDIA, Cisco and IOH

AI Imaginative and prescient and The Way forward for Clever Security

Run Coding Assistants for Free on RTX AI PCs

Kaggle CLI Cheat Sheet – KDnuggets

Context Engineering is the ‘New’ Immediate Engineering

Indonesia on Observe to Obtain Sovereign AI Targets With NVIDIA, Cisco and IOH

AI Imaginative and prescient and The Way forward for Clever Security

Run Coding Assistants for Free on RTX AI PCs