Reinforcement learning is a type of machine learning where an agent learns to make decisions in a specific environment by receiving rewards or punishments for its actions.
In RL, the agent is the decision-making entity that takes actions based on the observation of an environment (provides the context and feedback for the agent's actions) and a policy to maximize a cumulative reward/expected return. The outcome of RL is to develop a policy (i.e., strategy) to take an action a_t when in a state s_t.
Unlike supervised learning, reinforcement learning does not rely on labeled data. Instead, it learns through trial and error, trying different actions and receiving feedback based on the outcomes.
An RL's reward-maxing learining process is often modeled using a Markov Decision Process (MDP). An MDP defines a set of states, actions, transition probabilities, and rewards, allowing the agent to learn an optimal policy that maximizes the cumulative reward.
(more image)
The Markov reward process introduces a distribution Pr(r_[t + 1] | s_t) of possible rewards received at the next step, given current state st. This process returns a sum of cumulative discounted rewards.
Markov decision process defines distribution Pr(s_[t + 1] | s_t, a_t) of possible next states , given the current state.
In RL, policies are strategies or sets of rules that an agent use to decide which action to take in a given state of the environment. Several policies onlcude:
Stochastic policy:
Deterministic policy:
Stationary (time-homogeneous) policy:
Non-stationary (time-dependent) Polic: y
Choose
Optimal state-value function v^*:
Optimal action-value function q^a.
(update image
The Bellman equation is a recursive formula in reinforcement learning and dynamic programming that describes how the value of a state depends on the immediate reward received and the value of future states.
Defined by Bellman, state and action values in RL must be consistent with each other.
A specific method within RL, tabular RL uses a table to store the value of different state-action pairs, making it suitable for problems with a limited and discrete state space.
(cut image)
Model-based methods
Model-free methods