Jiuru Lyu
  • Home
  • CV
  • Notes
  • Photograph
  • Blogs

On this page

  • The RL Interface: Sequential Decision Making
  • Multi-Armed Bandit Problem
    • Action-Value Methods
  • Edit this page
  • View source
  • Report an issue

13 Reinforcement Learning

Reinforcement Learning
Multi-Armed Bandit
Exploration-Exploitation
This lecture discusses the basics of reinforcement learning, including the concepts of agents, environments, rewards, and policies. It also covers the exploration-exploitation trade-off and multi-armed bandit problem.
Author

Jiuru Lyu

Published

April 13, 2025

Reinforcement learning is agent-oriented learning.

The RL Interface: Sequential Decision Making

  • Overfiew: At each time \(t\), the agent
    • Observe state \(S_t\) (Environment)
    • Executes an action \(a_t\)
    • Receives a scalar reward \(r_t\)
    • Transitions to the next state \(S_{t+1}\).
  • The process is repeated and forms a trajectory of states, actions, and rewards: \(s_1,a_1,r_1,s_2,a_2,r_2,\dots\).

Definition 1 (Markov Decision Process (MDP))  

  • \(\mathcal{S}\): set of states
  • \(\mathcal{A}\): set of actions
  • \(R\): reward function \[p(r_t\mid s_t,a_t)\]
  • \(P\): transition function \[p(s_{t+1}\mid s_t,a_t)\]

Policy: mapping from states to actions \[\pi:\mathcal{S}\to\mathcal{A}\] Goal: learn polyci that maximizes the cummulative reward: \[G=\sum_{t=1}^Tr_t.\]

Multi-Armed Bandit Problem

Figure 1: Multi-Armed Bandit Problem
  • Set-up:
    • \(\mathcal{S}\): single state

    • \(\mathcal{A}\): arms \(a=1,a=2,a=3,\dots,a=k\).

    • \(R\): \(P(r\mid a)\) for \(a\in\mathcal{A}\), pobablistic mapping from action to reward.

    • Task: Agent sequentially chooses which arm to pull next.

      for steps \(t=1,\dots, T\): The arm chosed to pull: \(a_t\in\mathcal{A}\). Observed reward: \(r_t\sim P(r\mid a_t)\).

    • Goal: Decide a sequence of actions that maximizes cumulative reward \[\sum_{t=1}^T r_t\]

    • Assumption: Reward depends only on the action taken. i.e., it is i.i.d. with expectation \(\mu(a)\) for \(a=1,\dots, k\).

Example 1 (Example: Best Arm to Pull)  

  • action \(1\): \(\mu(a)=8\)
  • action \(2\): \(\mu(a)=12\)
  • action \(3\): \(\mu(a)=12.5\)
  • action \(4\): \(\mu(a)=11\)

Action \(3\) is the best since \(\mu(a)\) is the highest. If we know \(\mu(a)\) for \(a\in\mathcal{A}\), then always taking \(a=3\) givesn the highest cummulative reward.

  • Challenge: \(\mu(a)\) is unknown, the distribution is unknown.

Action-Value Methods

  • Main idea: learn \(Q(a)\approx\mu(a)\quad\forall\,a\in\mathcal{A}\).
  • Suppose we have \(a_1,r_1,a_2,r_2,\dots,a_{t-1},r_{t-1}\).
  • Estimate \(\mu(a)\) for \(a\in\mathcal{A}\) as sample average: \[ \begin{aligned} Q_t(a)&=\dfrac{\text{sum of rewards when action }a\text{ is taken previously}}{\text{number of times action }a\text{ is taken previously}}\\ &=\dfrac{\dsst\sum_{i=1}^{t-1}r_1\1\qty{a_i=a}}{\dsst\underbrace{\sum_{i=1}^{t-1}\1\qty{a_i=a}}_{N_t(a)}} \end{aligned} \]

Remark 1 (Convergence of Sample Average). Sample average converges to the true value if action is taken infinitely number of times: \[ \lim_{N_t(a)\to\infty}Q_t(a)=\mu(a), \] where \(N_t(a)\) is the number of times action \(a\) has been taken by time \(t\).

  • Given \(Q_t(a)\) for all \(a\in\mathcal{A}\), the greedy action \(a_t^*=\argmax_{a}Q_t(a)\).

  • For the next action \(a_t\):

    • if \(a_t=a_t^*\), we are exploiting, we kind of know which actions are good.
    • if \(a_t\neq a_t^*\), we are exploring, we want to collect more data to improve estimates.
  • We can only pick one \(a_t\), so we can’t do both.

    BUT, we have to do both.

  • This is the key dilemma in reinforcement learning: Exploration vs. Exploitation.

    • Exploration: gain more information about value of each action.
    • Exploitation: Revisit actions that are already known to given high rewards.
  • First attempt: Naïve approach:

    • Exploration phase: Try each arm \(100\) times.
    • Estimate their value as \(Q(a)\) for \(a\in\mathcal{A}\).
    • Exploitation phase: Always pick \(a^*=\argmax_a Q(a)\).
    • Problem:
      • Is \(100\) times too much? e.g., negtive rewards \(10\) consecutive times
      • Is \(100\) times enough? \(Q(a)\to\mu(a)\) in the limit of infinite samples. We should never stop exploring.
  • Second attempt: \(\epsilon\)-Greedy Action Selection:

    • This ia simple, effective way to balance exploration and exploitation.
\begin{algorithm} \caption{$\epsilon$-Greedy Action Selection} \begin{algorithmic} \State{Initialize $\begin{cases}Q(a)=0\\N(a)=0\end{cases}$ for $a=1,\dots,k$} \For{$t=1,2,3,\dots$} \State{$a_t=\begin{cases}\dsst\argmax_a Q(a)&\text{with probability }1-\epsilon\longrightarrow\text{exploitation}\\\text{random action}&\text{with probability }\epsilon\longrightarrow\text{exploration}\end{cases}\qquad$} \# breaks ties randomly \State{Take action $a_t$, receive reward $r_t$} \State{update: $\begin{cases}N(a_t)\gets N(a_t)+1\\Q(a_t)\gets Q(a_t)+\dfrac{1}{N(a_t)}\qty\Big[r_t-Q(a_t)]\end{cases}\qquad$} \# incremental implementation of sample average \EndFor \end{algorithmic} \end{algorithm}

Remark 2 (\(\epsilon\) Controls the Exploration Rate).

  • Larger \(\epsilon\), explore more, learn faster, converge to suboptimal.
  • Smaller \(\epsilon\), explore less, learn slower, converge to near-optimal.
  • \(\epsilon\)-greedy explores forever, and we have constant exploration rate.
    • Should we explore forever?
    • Yes! But perhaps explore less as time goes by.
    • Key idea: “Optimism in the face of uncertainty”
      • The more uncertain we are about an action, the more important it is to explore that action.
  • Optimistic initialization:
    • Initialize \(Q(a)\) to some positive value
    • Encourage exploration initially, naturally gets “washed out” as we collect data.
    • This method is somewhat hacky, but it works.
\begin{algorithm} \caption{Optimistic Initialization (Somewhat Hacky)} \begin{algorithmic} \State{Initialize $\begin{cases}Q(a)=\textcolor{red}{Q_0},&\text{where }Q_0>0\\N(a)=0\end{cases}$ for $a=1,\dots,k\qquad$} \# Larger $Q_0$, more exploration \For{$t=1,2,3,\dots$} \State{$a_t=\dsst\argmax_{a}Q(a)\qquad$} \# breaks ties randomly \State{Take action $a_t$, receive reward $r_t$} \State{update: $\begin{cases}N(a_t)\gets N(a_t)+1\\Q(a_t)\gets Q(a_t)+\dfrac{1}{N(a_t)}\qty\Big[r_t-Q(a_t)]\end{cases}\qquad$} \# incremental implementation of sample average \EndFor \end{algorithmic} \end{algorithm}
  • Upper Confidence Bound (UCB) Action Selection:
    • Select actions greedily based on potential of being the best
    • Estimate \(Q(a)+\text{Uncertainty }N(a)\).
    • UCB Score: \[Q_t(a)+c\sqrt{\dfrac{\ln(t)}{N_t(a)}},\quad c>0,\] where
      • \(Q_t(a)\) is the sample average estimate
      • \(\dsst c\sqrt{\dfrac{\ln(t)}{N_t(a)}}\) is the exploration bonus
      • \(c\) is the degree of exploration. Larger \(c\), more exploration.
    • As \(N_t(a)\) increases, more certain, explore less that action.
    • As \(t\) increases, but \(N_t(a)\) doesn’t increase, explore more of that action.
\begin{algorithm} \caption{Upper Confidence Bound (UCB) Action Selection} \begin{algorithmic} \State{Initialize $\begin{cases}Q(a)=0\\N(a)=0\end{cases}$ for $a=1,\dots,k\qquad$} \For{$t=1,2,3,\dots$} \State{$a_t=\dsst\argmax_{a}\mqty[\textcolor{red}{Q_t(a)+c\sqrt{\dfrac{\ln(t)}{N_t(a)}}}]\qquad$} \# breaks ties randomly \State{Take action $a_t$, receive reward $r_t$} \State{update: $\begin{cases}N(a_t)\gets N(a_t)+1\\Q(a_t)\gets Q(a_t)+\dfrac{1}{N(a_t)}\qty\Big[r_t-Q(a_t)]\end{cases}\qquad$} \# incremental implementation of sample average \EndFor \end{algorithmic} \end{algorithm}
Back to top

Created with Quarto.
© Copyright 2025, Jiuru Lyu.
Last updated: 2025 Apr. 29.

 
  • Edit this page
  • View source
  • Report an issue