<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://www.finnrietz.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="http://www.finnrietz.dev/" rel="alternate" type="text/html" /><updated>2026-02-18T09:35:46+00:00</updated><id>http://www.finnrietz.dev/feed.xml</id><title type="html">Finn Rietz.dev</title><subtitle>Finn Rietz personal website. Blog and projects showcase. Topics I encountered during my studies that I find particularly interesting.</subtitle><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><entry><title type="html">Multi-Objective Deep Reinforcement Learning with Lexicographic Task-priority constraints</title><link href="http://www.finnrietz.dev/machine%20learning/lexicographic-morl/" rel="alternate" type="text/html" title="Multi-Objective Deep Reinforcement Learning with Lexicographic Task-priority constraints" /><published>2023-10-19T00:00:00+00:00</published><updated>2023-10-22T11:05:31+00:00</updated><id>http://www.finnrietz.dev/machine%20learning/lexicographic-morl</id><content type="html" xml:base="http://www.finnrietz.dev/machine%20learning/lexicographic-morl/"><![CDATA[<h2 id="motivation">Motivation and introduction</h2>
<p>Deep Reinforcement Learning (RL) Doesn’t Work Yet. 
That’s what Alexander Irpan wrote in his <a href="https://www.alexirpan.com/2018/02/14/rl-hard.html">famous blog post</a> back in 2018. 
Sadly, 5 years later, RL researcher and practitioners are still struggling with the same challenges that Alexander Irpan wrote about 5 years ago. 
Although we had some exciting advances and innovations, e.g. the <a href="https://danijar.com/project/dreamerv3/">Dreamer algorithm series</a>, <a href="https://proceedings.neurips.cc/paper_files/paper/2021/file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf">Decision Transformer</a>, and <a href="https://arxiv.org/pdf/2204.01691.pdf">SayCan</a>, I believe most practitioners would agree with the following statements, which, in one way or another, have also been made by Alexander Irpan:</p>
<ul>
  <li>DRL algorithms, that learn to solve each task from scratch, are still very sample inefficient, often requiring millions of transitions before achieving acceptable levels of performance.</li>
  <li>Designing scalar-valued reward functions for complex tasks that induce the desired behavior is difficult, with few general heuristics available.</li>
  <li>The inherent unsafe exploration of trial-and-error-based learning algorithms and the blackbox-natures of the resulting DNN-based agents hinders the more widespread employment of DRL in real-world applications.</li>
</ul>

<p>Fortunately, with <a href="https://arxiv.org/pdf/2310.02360.pdf">our recent paper (“Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning”)</a>, we address all of these pain points, at least to some extent. 
So, this hopefully got you interested in our work. 
In the next sections, I will provide an informal summary of our paper, starting with a recap of Multi-Objective Reinforcement Learning (MORL).
<br /></p>

<h2 id="morl">Multi-Objective Reinforcement Learning</h2>
<p>In our paper, we want to solve <em>special</em> MORL problems, but we begin with a general definition of MORL problems. 
MORL problems are formalized by a Markov decision process (MDP), \(\mathcal{M} \equiv (\mathcal{S, A}, \mathbf{r}, p, \gamma)\), where \(\mathcal{S \in \mathbb{R}^k, A \in \mathbb{R}^l}\) respectively denote the state- and action-space, \(\mathbf{r}: \mathcal{S} \times \mathcal{A} \to \mathbb{R}^n\) denotes a <strong>vector-valued</strong> reward function, \(p\) denotes the transition dynamics, and the discount factor is given by \(\gamma \in [0, 1]\). 
Importantly, in MORL, each dimension \(i\) of the vector-valued reward function \(\mathbf{r}(\mathbf{s},\mathbf{a})\) corresponds to a <strong>scalar-valued</strong> subtask reward function, meaning we have \(\mathbf{r}(\mathbf{s},\mathbf{a})_{[i]} = r_i(\mathbf{s},\mathbf{a}) \in \mathbb{R}\).
Further, directly maximizing the vector-valued reward function is usually not possible, since we can’t determine a unique, optimal policy for a vector-valued reward function.
For this reason, MORL algorithms usually rely on scalarization: 
The vector-valued reward function can be scalarized via some weighted sum (not all problems can be modeled this way), then the resulting scalar-valued reward function \({r}(\mathbf{s}, \mathbf{a}) = \sum_{i=1}^n \beta_i r_i(\mathbf{s}, \mathbf{a})\) can be maximized with any classical RL algorithm. 
Building on this formalism, in the next section we introduce <em>lexicographic</em> MORL problems, which are the particular kinds of MORL problems we are interested in.</p>

<h2 id="lexi_morl">Lexicographic MORL</h2>
<p>In <a href="https://arxiv.org/pdf/2310.02360.pdf">our paper</a>, we are interested in solving <em>special</em> MORL problems, with continuous state- and action-spaces, like multi-objective robot control problems.
In particular, the <em>special</em> MORL problems we are interested in are referred to as <em>lexicographic</em> or <em>task-prioritized</em> MORL problems. 
These are MORL problems where the subtasks are ordered by priority, meaning lexicographic MORL problems can model that some subtask \(r_i\) is more important or <em>of higher priority</em> that some other subtask \(r_j\). 
In this blog post and in our paper, we use the symbol \(\succ\) (“succeeds”) to denote anything involving lexicographic task priorities. For example, \(r_{1\succ2}\) means a lexicographic MORL task where subtask \(r_1\) is of higher priority than subtask \(r_2\).
Formally, a lexicographic MORL problem is given an MDP \(\mathcal{M}_\succ \equiv (\mathcal{S, A}, \mathbf{r}, p, \gamma, o, \varepsilon)\), where \(o = \langle 1, 2, \dots i \dots n \rangle\) specifies the priority-order of subtasks and \(\varepsilon = \langle \varepsilon_1, \varepsilon_2, \dots \varepsilon_i \dots \varepsilon_n \rangle\) are certain threshold variables (more on those later). 
Thus, lexicographic MORL makes for a natural and intuitive way of specifying task complex tasks, by specifying a priority order of simpler subtasks.
Arguably, this is more convenient and more intuitive than carefully designing a scalar-valued reward function that induces some complex behavior.</p>

<p>Solving a lexicographic MORL problem means finding a policy that is optimal for the highest-priority subtask, while for each lower-priority subtask, it is <em>as good as possible</em> while subject to the constraint of <strong>not worsening the performance of any of the higher-priority subtasks</strong>.
More formally, this means that policy search for each subtask \(i\) is constrained to a set \(\Pi_i\), which contains only those policies that are also optimal for each higher priority task \({1, \dots,  i-1}\).</p>

<p>In practice, we allow for some worsening of higher-priority subtask performance \(J\), since otherwise, the set \(\Pi_i\) would only contain the optimal policies for the higher-priority subtasks. 
How much we allow the performance of higher-priority subtasks to worsen is defined by the aforementioned \(\varepsilon_i\) thresholds.
Thus, in a lexicographic MORL problem, policy search for subtask \(i\) is constrained to the set</p>

\[\Pi_{i} = \{ \pi \in \Pi_{i-1} \mid \underset{\pi' \in \Pi_{i-1}}{\max} J_{i-1}(\pi') - J_{i-1}(\pi) \le \varepsilon_{i-1} \}.
\tag{1}\]

<p>Unfortunately, computing the set \(\Pi_i\) and performing policy search is intractable for continuous state-action-space MDPs. 
However, there are other ways for optimizing policies subject to lexicographic task-priority constraints, as we will see in Section <a href="#psqd">Our method: Prioritized Soft Q-Decomposition</a>.
Before we can describe our algorithm, however, we require the following two, additional background sections.</p>

<h2 id="q-decomposition">Q-Decomposition</h2>
<p>A useful technique for solving (lexicographic) MORL problems is <a href="https://cdn.aaai.org/ICML/2003/ICML03-086.pdf">Russell and Zimdar’s Q-Decomposition</a> formulation. 
The paper states that for scalarizable, vector-valued reward functions, the Q-function can be decomposed into \(n\) local Q-functions, each corresponding to one subtask. 
This means that for the \(n\) subtask reward functions in \({r}(\mathbf{s}, \mathbf{a})\) and a policy \(\pi\), we can learn \(n\) Q-functions</p>

\[Q_i^\pi (\mathbf{s}, \mathbf{a}) = r_i(\mathbf{s}, \mathbf{a}) + \gamma Q_i(\mathbf{s}', \pi(\mathbf{s}')), \forall i \in \{1,\dots,n\} \tag{2}\]

<p>and reconstruct the Q-function for the overall, scalarized MORL problem as</p>

\[{Q}^\pi = \sum_{i=1}^n \beta_i Q_i^\pi(\mathbf{s}, \mathbf{a}). \tag{3}\]

<p>This result is useful because it allows us to learn these Q-functions separately and concurrently (with some caveats like the <em>tragedy of the commons</em>, more on this later). 
Furthermore, we can potentially transfer and re-use the constituent Q-functions \(Q_i^\pi(\mathbf{s},  \mathbf{a})\) for different MORL tasks.
Lastly, the decomposed nature of the Q-Decomposition methods benefits the interpretability of the RL agent, since we can inspect the different components that jointly induce the behavior of the agent, which is not the case with classical, non-decomposed agents. 
Due to these desirable attributes of the Q-Decomposition method, we are building on and extending the Q-Decomposition framework in our paper. 
In particular, we extend Q-Decomposition to <em>soft</em> Q-Decomposition via MaxEnt RL and apply it in the context of continuous action-space MDPs with lexicographic subtask priorities. 
Thus, in the next section, we briefly review MaxEnt RL and soft Q-Learning, as final building blocks for our method.</p>

<h2 id="maxent_rl">MaxEnt RL</h2>
<p>Maximum Entropy (MaxEnt) RL, essentially, regularizes policies by punishing policies that are unnecessarily deterministic. This is achieved by adding Shannon’s entropy \(\mathcal{H}(\mathbf{a}_t \mid \mathbf{s}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi(\mathbf{a}_t \mid \mathbf{s}_t)}[-\log \pi(\mathbf{a}_t \mid \mathbf{s}_t)]\) to the reward signal, meaning the optimal MaxEnt policy is given by</p>

\[\pi^*_\text{MaxEnt} = \underset{\pi}{\arg \max} \sum^\infty_{t=1} \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t \sim \rho_\pi)}\bigg[\gamma^{t-1} \big( r(\mathbf{s}_t, \mathbf{a}_t) + \alpha \mathcal{H}(\mathbf{a}_t \mid \mathbf{s}_t) \big)\bigg], \tag{4}\]

<p>where \(\rho_\pi\) denotes the state-action marginal induces by the policy \(\pi\) and \(\alpha\) is a coefficient that trades off the reward and the entropy signal. 
The entropy regularization results in the following, energy-based Boltzmann distribution as optimal policy:</p>

\[\pi^*_{\text{MaxEnt}}( \mathbf{a}_t \mid \mathbf{s}_t)
	=
	\exp{
		(
		Q^*_{\text{soft}} (\mathbf{s}_t, \mathbf{a}_t) 
		-
		V^*_{\text{soft}} (\mathbf{s}_t) 
		),
	}
  \tag{5}\]

<p>with the optimal <em>soft</em> value and Q-function given by</p>

\[V^*_\text{soft}(\mathbf{s}_t) 
	= 
	\log
	\int_\mathcal{A}
	\exp
	(
	Q^*_\text{soft}(\mathbf{s}_t, \mathbf{a}^\prime) 
	)
	\,
	d \mathbf{a}^\prime
	,
  \tag{6}\]

<p>and</p>

\[Q^*_\text{soft}(\mathbf{s}_t, \mathbf{a}_t) 
	= 
	r(\mathbf{s}_t, \mathbf{a}_t) 
  +
  \\
	\mathbb{E}_{(\mathbf{s}_{t+1}, \dots) \sim \rho_\pi} 
	\bigg[ 
	\sum_{l=1}^\infty 
	\gamma^l 
	\big(
	r(\mathbf{s}_{t+l}, \mathbf{a}_{t+l}) 
	+ 
	\mathcal{H}(\mathbf{a}_{t+l} \mid \mathbf{s}_{t+l})
	\big) 
	\bigg]
	.
  \tag{7}\]

<p>In Equation (4), the soft Q-function serves as negative energy for the Boltzmann distribution and the soft value function serves as the log partition function, which means that for sampling, we can ignore the value function and directly sample from the unnormalized density</p>

\[\pi^*_{\text{MaxEnt}}( \mathbf{a}_t \mid \mathbf{s}_t)
	\propto
	\exp{
		(
		Q^*_{\text{soft}} (\mathbf{s}_t, \mathbf{a}_t) 
		),
	}
\tag{8}\]

<p>for example with Monte Carlo methods or Importance Sampling.
For the rest of this post, we will drop the <em>soft</em> subscript to avoid visual clutter.
Importantly, Q-Decomposition is still possible in MaxEnt RL, meaning we can also decompose and reconstruct a soft, multi-objective Q-function as described before (we just need a way to sample from the reconstructed, soft Q-function). 
In principle, we can make use of <a href="http://proceedings.mlr.press/v70/haarnoja17a/haarnoja17a.pdf">soft Q-Learning</a> (SQL) to directly obtain \(Q^*_i\), the optimal, soft constituent Q-functions for Q-Decomposition. 
However, constituent Q-functions obtained this way have a flaw, namely, they suffer from the <em>tragedy of the commons</em>, as described by <a href="https://cdn.aaai.org/ICML/2003/ICML03-086.pdf">Russel and Zimdars</a>. 
Since the constituent Q-functions were learned using off-policy SQL, they essentially assume complete control over the MDP, and therefore learn inconsistent, greatly over-estimated Q-values. 
In effect, this means that the constituent Q-function obtained via SQL way will not result in the optimal Q-function for the scalarized MORL problem if summed-up.
In our algorithm, which we finally describe in the next section, we have a neat way of addressing this issue, though.</p>

<h2 id="psqd">Our method: Prioritized Soft Q-Decomposition (PSQD)</h2>
<p>Let’s recap:
We want to make it easier to design reward functions that induce arbitrary, complex behavior, while also improving the sample in-efficiency, un-safety and un-interpretability of DRL algorithms.</p>

<p>The first part takes care of itself as soon as one rejects scalar reward-function engineering in favor of lexicographic constraints, which are much easier and more intuitive to define. 
Instead of manually searching for just the right weighting coefficient for some MORL problem to achieve the desired behavior, lexicographic MORL only requires defining the subtask priority (and some slack scalars).
The framework is insensitive w.r.t subtask reward scale, for reasons that will become clear later.
To improve the sample-inefficiency, un-safety, and un-interpretable nature of DRL algorithms, we propose a novel learning algorithm, PSQD, for continuous action-space lexicographic MORL problems that yields interpretable, transferable agents and components.</p>

<p>Recall that solving lexicographic MORL problems, which are defined by lexicographic MDPS \(\mathcal{M}_\succ\), essentially corresponds to policy search in \(\Pi_i\), the set of lexicographically optimal policies.
Recall also that computing the set \(\Pi_i\) is intractable.
Thus, we instead make a local-and state-based version of the lexicographic constraint from Equation (1):</p>

\[\max_{\mathbf{a}^\prime \in \mathcal{A}} Q_i(\mathbf{s}, \mathbf{a}^\prime)
	-
	Q_i(\mathbf{s}, \mathbf{a})
	\leq \varepsilon_i,
  \\
	\forall  \mathbf{a} \sim \pi_\succ,
  \\
	\forall \mathbf{s} \in \mathcal{S},
  \\
	\forall i \in \{1, \dots, n - 1\}.
  \tag{9}\]

<p>In this form of the lexicographic constraint, action selection, in every state, is constrained to actions whose Q-values are as good as that of the optimal action, minus the threshold \(\varepsilon_i\), for all \(i-1\) higher priority tasks.
This makes for a binary mask over the action space, where some subset of actions \(\mathcal{A}_{\succ i}\) is allowed when optimizing task \(r_i\), since 
it satisfies the above constraint, while the remaining actions are forbidden.
In our paper, we refer to this subset as the <em>indifference-space</em> of task \(i\), since with respect to the constraint, the task is <em>indifferent</em> as to which of the near-optimal actions in \(\mathcal{A}_{\succ i}\) is executed.
As a concrete example, consider the following images. 
We have an obstacle-avoidance and goal navigation environment (first image), with a point-mass agent whose 2D action-space corresponds to increments in the \(xy\)-plane.
We now train the agent on the first task, i.e. the highest priority task, which here corresponds to avoiding the obstacle. 
The learned Q-function is shown in the center image. 
Now, based on the learned Q-function, we can visualize the lexicographic constraint and the indifference space (last image), with the agent placed at the position indicated by the red dot. 
As can be seen, lexicographic constraint forbids all those actions that would lead to a collision.
The remaining, permitted actions can be used for optimizing lower-priority tasks, like navigating to the top goal area.</p>

<figure class="third ">
  
    
      <a href="/assets/img/psqd/transferEnv_onlyTop.png">
        <img src="/assets/img/psqd/transferEnv_onlyTop.png" alt="2D navigation environment" />
      </a>
    
  
    
      <a href="/assets/img/psqd/q0_prefectCrop_dot.png">
        <img src="/assets/img/psqd/q0_prefectCrop_dot.png" alt="Learned Q-function" />
      </a>
    
  
    
      <a href="/assets/img/psqd/indifference_space.png">
        <img src="/assets/img/psqd/indifference_space.png" alt="Indifference space based on Q-function" />
      </a>
    
  
  
    <figcaption>2D navigation example. The agent learns a Q-function for avoiding the obstacle, from which we infer the local indifference space.
</figcaption>
  
</figure>

<p>By relying on the local form of the lexicographic constraint in (9) and the resulting indifference space, we eliminated the need for computing the intractable set \(\Pi_i\) for policy search. 
This is because instead of computing \(\Pi_i\), the action indifference \(\mathcal{A}_{\succ i}\) gives rise to a new MDP \(M_{\succ i}\), which uses the scalar reward \(r_i\) and whose action space no longer corresponds to \(\mathcal{A}\), but to the indifference space \(\mathcal{A}_{\succ i}\).
In this new MDP \(\mathcal{M}_{\succ i}\) we can perform <strong>unconstrained</strong> policy search to optimize task \(r_i\), since the lexicographic constraint is moved into the action space and thereby always satisfied by construction.
This is the short version, the (very cool and intuitive) mathematical derivation and justification of this approach can be found in Section A of the supplementary material of <a href="https://arxiv.org/pdf/2310.02360.pdf">our paper</a>.</p>

<p>Based on this insight, we propose our learning algorithm, Prioritized Soft Q-Decomposition (PSQD), for continuous action-space lexicographic MORL tasks. 
PSQD combines Q-Decomposition with Soft Q-Learning by first pre-training on all subtasks \(r_1, \dots, r_n\) of the lexicographic MORL problem.
This way, we obtain \(n\) soft Q-functions \(Q_1^*, \dots, Q_n^*\) which we can then use to zero-shot the Q-function \(Q_\succ\) of the overall, lexicographic MORL problem.
The agent using the zero-shot Q-function \(Q_\succ\) respects the lexicographic constraints, however, it does not behave optimally w.r.t the overall, lexicographic MDP \(\mathcal{M}_\succ\), since the constituent Q-functions \(Q_1^*, \dots, Q_n^*\) were pre-trained separately and suffer from the aforementioned <em>illusion of control</em>. 
The effect of this can be observed in the following, intermediate result:</p>

<figure class="third ">
  
    
      <a href="/assets/img/psqd/q0.png">
        <img src="/assets/img/psqd/q0.png" alt="The pre-trained obstacle avoidance Q-function" />
      </a>
    
  
    
      <a href="/assets/img/psqd/q1_top.png">
        <img src="/assets/img/psqd/q1_top.png" alt="The pre-trained goal reach Q-function" />
      </a>
    
  
    
      <a href="/assets/img/psqd/Zeroshot_coloredBackground_legend.png">
        <img src="/assets/img/psqd/Zeroshot_coloredBackground_legend.png" alt="Trajectories corresponding to the zero-shot agent." />
      </a>
    
  
  
    <figcaption>Zeroshot experiment. The agent respects the lexicographic constraint and avoids the obstacle, but greedily navigates to the top, getting stuck inside the obstacle.
</figcaption>
  
</figure>

<p>Here, the first image again shows the obstacle avoidance Q-function \(Q_0^*\). 
The second image corresponds the the top goal reaching Q-function \(Q_1^*\). 
In the last image, we visualize the policy (colored background) and rollouts from the zeroshot agent, obtained as \(Q_{1\succ2} = Q_1^* + Q_2^*\) via Q-Decomposition. As can be seen, the zeroshot agent respects the lexicographic constraint and avoids colliding with the obstacle, but greedily navigates to the top and therefore gets stuck inside the obstacle.</p>

<p>Notice that this result is expected and positive. 
Due to the lexicographic constraint, we <em>know</em> that the agent can not collide with the obstacle. 
In our paper, we show that PSQD assigns zero likelihood to actions that violate the lexicographic constraint.
This is in stark contrast to standard MORL algorithms that rely on scalarization, meaning that the learned agent’s behavior is largely dictated by reward scale, with zero guarantees on resulting behavior.</p>

<p>However, we are of course interested in obtaining the optimal solution to the overall, lexicographic MORL problem.
That’s why PSQD uses the zeroshot composition merely as a <em>starting point</em> and subsequently continues improving performance by finetuning the constituent Q-functions, thereby learning the optimal solution to the overall, lexicographic MORL problem. 
Concretely, we finetune the constituent Q-functions by iteratively performing soft Q-Learnign in the transformed MDP \(\mathcal{M}_{\succ i}\).
Since the highest priority task is not affected by the lexicographic constraint (there are no higher-priority tasks that constrain it), in the first iteration, PSQD finetunes the task with the second highest priority by performing SQL in \(\mathcal{M}_{\succ i}\).
This means updating the \(Q_2^*\) to \(Q_{\succ 2}^*\), which is no longer optimal for \(r_2\) but solves \(r_2\) as best as possible while respecting the lexicographic constraint.
This involves the following backup operator</p>

\[\mathcal{T}Q(\mathbf{s}, \mathbf{a}) \triangleq r(\mathbf{s}, \mathbf{a}) 
	+ 
	\gamma \mathbb{E}_{\mathbf{s}^\prime \sim p} 
	\bigg[
	\underbrace{\log \int_{\mathcal{A}_\succ (\mathbf{s}^\prime)} \exp \big( Q(\mathbf{s}^\prime, \mathbf{a}^\prime)\big) d \mathbf{a}^\prime}_{V(\mathbf{s}^\prime)}
	\bigg],
	\tag{10}\]

<p>where the <em>log-sum-exp</em> expression is not over the entire action space but over the indifference space since we are in the lexicographic MPD \(\mathcal{A}_{\succ i}\).
This backup operator is approximated with the following stochastic optimization</p>

\[J_Q(\theta) = 
	\mathbb{E}_{\mathbf{s}_t, \mathbf{a}_t \sim \mathcal{D}} 
	\Bigg[ 
	\frac{1}{2} 
	\Big ( 
	Q^\theta_{n}(\mathbf{s}_t, \mathbf{a}_t) 
	- r_n(\mathbf{s}_t, \mathbf{a}_t) 
	\\
	+ \gamma
	\mathbb{E}_{\mathbf{s}_{t+1} \sim p} 
	\big[ 
	V_n^{\bar{\theta}}(\mathbf{s}_{t+1})
	\big] 
	\Big)^2 
	\Bigg ],
	\tag{11}\]

<p>which is the well-known SQL update, with \(V_n^\bar{\theta}\) being the empirical approximation of equation (6) with a target network, parameterized by \(\bar{\theta}\), in \(\mathcal{A}_{\succ i}\). 
More details about our learning algorithm can of course be found <a href="https://arxiv.org/pdf/2310.02360.pdf">in our paper</a>.</p>

<p>Applying this procedure once to finetune the Q-function of the goal navigation task, we obtain the following result:</p>

<figure class="third ">
  
    
      <a href="/assets/img/psqd/q1_top.png">
        <img src="/assets/img/psqd/q1_top.png" alt="The pre-trained goal reach Q-function" />
      </a>
    
  
    
      <a href="/assets/img/psqd/q1_top_adapted_obst.png">
        <img src="/assets/img/psqd/q1_top_adapted_obst.png" alt="The finetuned goal reach Q-function" />
      </a>
    
  
    
      <a href="/assets/img/psqd/Online_coloredBackground_crop_legend.png">
        <img src="/assets/img/psqd/Online_coloredBackground_crop_legend.png" alt="Trajectories corresponding to the finetuned agent." />
      </a>
    
  
  
    <figcaption>Finetuning experiment. The agent has learned how to solve the lexicographic MORL task optimally, navigating out of and around the obstacle.
</figcaption>
  
</figure>
<p>Here, the greedy, pre-trained constituent Q-function \(Q_2^*\) in the first image is finetuned as described above, to the optimal constituent Q-function \(Q_{\succ 2}^*\) shown in the second image.
Using the finetune Q-function for the second task, we again apply Q-Decomposition to obtain the <strong>optimal</strong> Q-function for the lexicographic MORL problem as \(Q_\succ^* = Q_1^* + Q_{\succ 2}^*\). 
The policy and rollouts from this agent are shown in the last image.
As can be seen, the agent has learned to avoid the obstacle and to drive out and around it, to reach the top goal area.</p>

<p>The following image shows a visual summary of our learning algorithm, PSQD:</p>
<figure>
<img src="/assets/img/psqd/algorithm.png" />
<figcaption>PSQD, a visual overview.</figcaption>
</figure>

<p>This method trivially extends to more than two tasks.
We can add a third subtask \(r_3\), corresponding, for example, to reaching the right-hand side of the environment. 
With these three subtasks, \(r_1\) for obstacle avoidance, \(r_2\) for reaching the top part of the environment, and \(r_3\) for reaching the right side of the environment, we can make multiple lexicographic MORL problems, by defining different priority orderings.
For example, we can keep the highest-priority subtask (obstacle avoidance) fixed but vary the priority of the top- and side-reach subtasks.
That is, we can either have the lexicographic MORL problem \(r_{1\succ 2\succ 3}\), where the top-reach subtask \(r_2\) has higher priority than the side-reach subtask \(r_3\), or we can make the lexicographic MORL problem \(r_{1 \succ 3 \succ 2}\), where reaching the side has higher priority than reaching the top.</p>

<p>In either case, we can pre-train on all subtasks separately, and transfer the resulting constituent Q-functions via Q-Decomposition to the lexicographic MORL tasks, where they are subsequently finetuned using PSQD.
This results in the following, differing behaviors:</p>
<figure>
<img src="/assets/img/psqd/three_tasks.png" />
<figcaption>Differing behaviors depending on lexicographic task priority order. (Lazy screenshot from the paper, apologies for the poor quality).</figcaption>
</figure>
<p>As can be seen, depending on the lexicographic task priority order, the resulting agent either first moves to the top, then to the side in image (a), or first to the side, then to the top, in image (b). 
The bottom row images visualize the indifference spaces of the constituent Q-functions.
Both tasks share the obstacle avoidance component \(\mathcal{\bar{A}}_{\succ 1}\), but have different constraints for the additional task that is varied between the two conditions, \(\mathcal{\bar{A}}_{\succ 2}\) in (c) and \(\mathcal{\bar{A}}_{\succ 3}\) in (d).
This aims to illustrate how lexicographic constraints can easily and intuitively be used to induce different, complex behaviors.</p>

<p>Lastly, to demonstrate the efficacy of our method in high-dimensional settings, or, a bit more colloquially, to show that our method scales, we perform a simulated joint-control experiment.
Here, the action space is in \(\mathbb{R}^9\) and corresponds to the joint torques of a Franke Emika Panda Robot. 
The higher-priority task, \(r_1\), corresponds to avoiding a certain subspace of the workspace (red area), while the lower-priority task, \(r_2\),  corresponds to reaching a certain end-effector position (green sphere).
First, consider a standard MORL algorithm that relies on linear scalarization of the vector-valued reward function.
The agent ignores the red area and greedily moves toward the target end-effector position:</p>
<video width="100%" controls="">
	<source src="/assets/img/psqd/franka_reach_unconstrained.mp4" type="video/mp4" />
	Your browser does not support the video tag.
</video>
<p>This can happen due to poor reward scale or poorly chosen scalarization weights.</p>

<p>Let’s contrast this with our method. 
We simply define the lexicographic task priority \(o = \langle r_1 \succ r_2\rangle\), set some low threshold, e.g. \(\varepsilon_1 = 1\), and obtain the following (zer-oshot) result:</p>
<video width="100%" controls="">
	<source src="/assets/img/psqd/franka_reach_zeroshot.mp4" type="video/mp4" />
	Your browser does not support the video tag.
</video>
<p>Here, after separately learning the subtask, even in the zero-shot setting, the agent does not enter the forbidden part of the workspace. 
To obtain the optimal agent for the lexicographic MORL task, i.e. reaching the target end-effect position while avoiding the forbidden part of the workspace, we perform our finetuning/adaptation step to learn the long-term consequences of the lexicographic constraint.
This results in the desired behavior and verifies that our method is also applicable to MDPs with high-dimensional action spaces:</p>
<video width="100%" controls="">
	<source src="/assets/img/psqd/franka_reach_adapted.mp4" type="video/mp4" />
	Your browser does not support the video tag.
</video>
<p><br /></p>

<p>There are some additional, nice properties of our method that I have only mentioned briefly or skipped entirely in this blog post.
Firstly, I want to mention how our method benefits sample-efficiency.
PSQD transfers knowledge between from simple subtasks to complex, lexicographic MORL problems.
Thus, we are not learning the complex, lexicographic MORL problem from scratch, rather, we transfer the pre-trained subtask solutions and perform a simple finetuning step to obtain the optimal solution to the lexicographic MORL tasks.
In a nutshell, this implies that we only need to learn once, for example, how to avoid obstacles, we can then re-use this learned behavior every time we want to exploit it as part of a lexicographic MORL problem.
Furthermore, each lexicographic task-priority constraints in effect ``shrinks’’ the action/search space of the RL algorithm, which makes it easier to explore the MDP and to discover the optimal solution.
Secondly, PSQD respects lexicographic priority constraints even during training and thereby makes for a safe exploration framework.
This is again in stark contrast to standard MORL approaches that rely on scalarization and learn each problem from scratch.
Lastly, PSQD benefits interpretability of the final agent, since we can inspect the constituent Q-functions and corresponding indifferent spaces to understand the agent’s action selection process.</p>

<h2 id="summary">Summary and conclusion</h2>
<p>And that’s it. 
This blog post presents a short summary of our recent work, <a href="https://arxiv.org/pdf/2310.02360.pdf">Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning</a>, which ~is currently under review~ has been accepted at ICLR 2024. 
The take-away points are as follows:</p>
<ul>
  <li>Reject scalar reward engineering, and embrace lexicographic task-priority constraints. Lexicographic constraints are much easier to define and have additional benefits, compared to cumbersome, manual tuning of reward scale coefficients.</li>
  <li>Our algorithm, PSQD, solves continuous action-space, lexicographic MORL problems, while explointing the Q-Decomposition method to transfer knowledge from simple subtasks to complex, lexicographic MORL problems.</li>
</ul>

<p>We are currently working on the successor paper, where we replace soft Q-Learning with a more stable DRL algorithm.</p>

<p>Cheers,<br />
<em>Finn</em>.
<br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Machine Learning" /><category term="Python" /><summary type="html"><![CDATA[Motivation and introduction Deep Reinforcement Learning (RL) Doesn’t Work Yet. That’s what Alexander Irpan wrote in his famous blog post back in 2018. Sadly, 5 years later, RL researcher and practitioners are still struggling with the same challenges that Alexander Irpan wrote about 5 years ago. Although we had some exciting advances and innovations, e.g. the Dreamer algorithm series, Decision Transformer, and SayCan, I believe most practitioners would agree with the following statements, which, in one way or another, have also been made by Alexander Irpan: DRL algorithms, that learn to solve each task from scratch, are still very sample inefficient, often requiring millions of transitions before achieving acceptable levels of performance. Designing scalar-valued reward functions for complex tasks that induce the desired behavior is difficult, with few general heuristics available. The inherent unsafe exploration of trial-and-error-based learning algorithms and the blackbox-natures of the resulting DNN-based agents hinders the more widespread employment of DRL in real-world applications.]]></summary></entry><entry><title type="html">Easy Language Learning with ChatGPT and Python</title><link href="http://www.finnrietz.dev/machine%20learning/chatgpt-python-copy/" rel="alternate" type="text/html" title="Easy Language Learning with ChatGPT and Python" /><published>2023-09-10T00:00:00+00:00</published><updated>2023-09-10T12:46:12+00:00</updated><id>http://www.finnrietz.dev/machine%20learning/chatgpt-python%20copy</id><content type="html" xml:base="http://www.finnrietz.dev/machine%20learning/chatgpt-python-copy/"><![CDATA[<h2 id="motivation">Motivation and introduction</h2>
<p>ChatGPT is everywhere, has disruptive potential, and can do many great things, given the right instructions (prompts). 
ChatGPT or other GPT models tend to perform well when the conversation is centered around a topic that is abundantly covered in their training data, just like most ML algorithms perform well when the test data is <i>similar enough</i> to the training data.
As such, it is no surprise that ChatGPT can generate code for almost any programming language, with many blogs and documentation being publicly available. 
Although ChatGPT might struggle with generating correct code for sophisticated and complex programs, it knows about data structures and basic I/O stuff, so enough for small scripts.
This makes it very easy to combine ChatGPT’s generative capabilities in virtually any topic with simple programs, which can be useful for many small projects.</p>

<p>To give one example of such a project, in the following, I show how to generate a deck of flashcards with the most useful, basic words in any language using ChatGPT and a few lines of Python code.
The technical aspect of this is relatively trivial, however, I want to provide easy-to-replicate steps even for people that have no background in computer science.
<br /></p>

<h2 id="chatgpt_language_tutor">ChatGPT as a language tutor</h2>
<p>Since I am doing my Ph.D. in Sweden, without being a native speaker of any Scandinavian  language, I am always looking for opportunities to practice Swedish. 
Since ChatGPT is capable of communicating in almost every language, it can also generate lists of vocabularies in any language. 
For example, one can ask it to generate the 100 most useful words for a particular language, as well as the translation of those words into any other language. 
One can further ask it to also provide an example sentence for each word, as well as the translation of the example sentence. 
There are almost no limits here, as long as we provide the right prompt, ChatGPT will provide the corresponding result (which is sometimes better, sometimes worse).
<br /></p>

<p>For my particular use case with Swedish-English translations, I found ChatGPT to provide okay to decent translations.
One should of course take everything that a GPT model generates with a grain of salt, but given my rudimental understanding of Swedish, I deemed the output of sufficient quality to use for some additional studying.
Give a vocabulary containing whatever we want to learn, the next step is to get ChatGPT to provide it in a data structure that is convenient to work with programmatically.
In my case, I wanted to have the data stored in a list of Python dictionaries.
Given clear instructions for how the list of dictionaries should be populated, ChatGPT will generate data that we can manipulate and make use of programmatically.</p>

<p><a href="https://chat.openai.com/share/b5e1acf1-4233-4b35-a7f6-e02b9c499863">Here is a short demonstration of the ChatGPT conversation</a> that I used to get a vocabulary of Swedish words and example sentences, as well as their English translation, stored in a list of Python dictionaries.</p>

<h2 id="python_flashcards">Creating flashcards with Python</h2>
<p>Given that we now have a list of Python dictionaries containing things we want to memorize, we can use the great, open-source tool <a href="https://apps.ankiweb.net/">Anki</a> and the corresponding <a href="https://github.com/kerrickstaley/genanki">Python API</a> to automatically create flash cards.
Anki is amazing, it even has  a mobile app that can synchronize sets of flashcards across multiple devices.
I have used it extensively throughout my university studies and can not recommend it enough.</p>

<p><a href="https://github.com/kerrickstaley/genanki">Genanki</a>, at the same time, makes it trivial to generate Anki decks using Python. 
Thus, all that is left to do is load the ChatGPT vocabulary into python and create the flashcards using genanki.
The GitHub repository provides a great example of how to do this, however, for completeness, here is the full script that I used:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">genanki</span>
<span class="kn">import</span> <span class="nn">random</span>

<span class="n">word_list</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="s">'swe_word'</span><span class="p">:</span> <span class="s">'att'</span><span class="p">,</span>
        <span class="s">'eng_word'</span><span class="p">:</span> <span class="s">'to'</span><span class="p">,</span>
        <span class="s">'swe_sentence'</span><span class="p">:</span> <span class="s">'Jag vill att du ska göra det.'</span><span class="p">,</span>
        <span class="s">'eng_sentence'</span><span class="p">:</span> <span class="s">'I want you to do it.'</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="c1"># replace this with your own data...
</span><span class="p">]</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">random_model_id</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">31</span><span class="p">)</span>
    <span class="n">random_deck_id</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randrange</span><span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">31</span><span class="p">)</span>

    <span class="c1"># genanki.Model defines the template for each flash card in our deck
</span>    <span class="n">my_model</span> <span class="o">=</span> <span class="n">genanki</span><span class="p">.</span><span class="n">Model</span><span class="p">(</span>
        <span class="n">random_model_id</span><span class="p">,</span>  <span class="c1"># needs to be unique
</span>        <span class="s">'SWE &amp; ENG Language learning model'</span><span class="p">,</span>
        <span class="n">fields</span><span class="o">=</span><span class="p">[</span>
          <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'swe_word'</span><span class="p">},</span> 
          <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'eng_word'</span><span class="p">},</span> 
          <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'swe_sentence'</span><span class="p">},</span> 
          <span class="p">{</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'eng_sentence'</span><span class="p">},</span>
        <span class="p">],</span>
        <span class="n">templates</span><span class="o">=</span><span class="p">[</span>
          <span class="p">{</span>
            <span class="s">'name'</span><span class="p">:</span> <span class="s">'Card 1'</span><span class="p">,</span>
            <span class="s">'qfmt'</span><span class="p">:</span> <span class="s">'What is the meaning of "&lt;i&gt;&lt;/i&gt;"?&lt;br&gt;&lt;br&gt;Example: &lt;i&gt;&lt;/i&gt;'</span><span class="p">,</span>
            <span class="s">'afmt'</span><span class="p">:</span> <span class="s">'&lt;hr id="answer"&gt;'</span> <span class="o">+</span> \
                    <span class="s">'The meaning of &lt;i&gt;""&lt;/i&gt; is &lt;i&gt;""&lt;/i&gt;.'</span> <span class="o">+</span> \
                    <span class="s">'&lt;br&gt;&lt;br&gt;Example: &lt;i&gt;&lt;/i&gt;'</span><span class="p">,</span>
            <span class="p">},</span>
        <span class="p">])</span>

    <span class="n">my_deck</span> <span class="o">=</span> <span class="n">genanki</span><span class="p">.</span><span class="n">Deck</span><span class="p">(</span>
        <span class="n">random_model_id</span><span class="p">,</span>  <span class="c1"># needs to be unique
</span>        <span class="s">'Swedish &amp; English Language learning'</span>
    <span class="p">)</span>

    <span class="k">for</span> <span class="n">card</span> <span class="ow">in</span> <span class="n">word_list</span><span class="p">:</span>
        <span class="n">my_note</span> <span class="o">=</span> <span class="n">genanki</span><span class="p">.</span><span class="n">Note</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="n">my_model</span><span class="p">,</span>
            <span class="n">fields</span><span class="o">=</span><span class="p">[</span><span class="n">card</span><span class="p">[</span><span class="s">'swe_word'</span><span class="p">],</span> <span class="n">card</span><span class="p">[</span><span class="s">'eng_word'</span><span class="p">],</span> <span class="n">card</span><span class="p">[</span><span class="s">'swe_sentence'</span><span class="p">],</span> <span class="n">card</span><span class="p">[</span><span class="s">'eng_sentence'</span><span class="p">]])</span>

        <span class="n">my_deck</span><span class="p">.</span><span class="n">add_note</span><span class="p">(</span><span class="n">my_note</span><span class="p">)</span>

    <span class="n">genanki</span><span class="p">.</span><span class="n">Package</span><span class="p">(</span><span class="n">my_deck</span><span class="p">).</span><span class="n">write_to_file</span><span class="p">(</span><span class="s">'swe_eng.apkg'</span><span class="p">)</span>

</code></pre></div></div>
<p>As can be seen, the script simply iterates over the vocabulary, creates a flashcard for each dictionary in the list, and adds the card to the deck.
The deck is then saved as a file and can be imported into Anki.</p>

<p>This setup relies on manually copying the ChatGPT data into a Python file, which is a minor annoyance for perfectionistic programmers… 
Of course, this could be avoided by making use of ChatGPT’s API, but unfortunately, I haven’t gotten access yet, even after paying for ChatGPT premium, solely to gain API access. Oh well ¯\<em>(ツ)</em>/¯…</p>

<p>The resulting deck, diplayed in the AnkiDroid app, looks as follows:</p>
<figure>
  <img src="/assets/img/genanki/anki.jpg" />
  <figcaption>Example flashcard with ChatGPT generated content displayed in the AnkiDroid app. The deck is created with the above give code and synchronized to the mobile device via AnkiWeb.</figcaption>
</figure>

<h2 id="summary">Summary and conclusion</h2>
<p>So, in summary, ChatGPT can be used to create vocabularies with translation between any language. 
These vocabularies can have arbitrary auxiliary  information, like example sentences or different word forms. 
ChatGPT can then be asked to output these vocabularies in convenient data structures, like Python dictionaries.</p>

<p>The open-source tool Anki for flashcard learning can be used to study these vocabularies. 
The easy-to-use Python package genanki makes it trivial to generate anki decks programmatically using Python, for example using the above-given script.</p>

<p>And that’s it, I hope you found this post interesting and perhaps useful for your own studies :)</p>

<p>Cheers,<br />
<em>Finn</em>.
<br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Machine Learning" /><category term="Python" /><summary type="html"><![CDATA[Motivation and introduction ChatGPT is everywhere, has disruptive potential, and can do many great things, given the right instructions (prompts). ChatGPT or other GPT models tend to perform well when the conversation is centered around a topic that is abundantly covered in their training data, just like most ML algorithms perform well when the test data is similar enough to the training data. As such, it is no surprise that ChatGPT can generate code for almost any programming language, with many blogs and documentation being publicly available. Although ChatGPT might struggle with generating correct code for sophisticated and complex programs, it knows about data structures and basic I/O stuff, so enough for small scripts. This makes it very easy to combine ChatGPT’s generative capabilities in virtually any topic with simple programs, which can be useful for many small projects.]]></summary></entry><entry><title type="html">Fun with Cellular Automata</title><link href="http://www.finnrietz.dev/python/cellular_automaton/" rel="alternate" type="text/html" title="Fun with Cellular Automata" /><published>2023-08-15T00:00:00+00:00</published><updated>2023-08-15T00:00:00+00:00</updated><id>http://www.finnrietz.dev/python/cellular_automaton</id><content type="html" xml:base="http://www.finnrietz.dev/python/cellular_automaton/"><![CDATA[<p><br /></p>

<p>This page provides some information about the animation playing in the header of my website’s landing page.</p>

<h2 id="what-are-cellular-automata">What are Cellular Automata?</h2>
<p>The animation that is playing on this website’s landing page is a “Rock-Paper-Scissors-Lizard-Spock” Cellular Automaton. What is a cellular automaton? Cellular automata are simple dynamical systems in which complex behaviour tends to emerge
based on very simple update rules. Cellular automata typically employ a grid-like state and update each cell according to a neighbourhood function that indexes the grid and updates the current cell according to a specific update rule, operating on the values indexed by the neighborhood function. The neat thing about cellular automata is that we can obtain very interesting and different dynamic systems based on very simple update rules. We only need to implement a base class for the CA that holds the state, while each inheriting CA simply implements the update rule for the desired result.</p>

<h2 id="examples">Examples</h2>
<p>The following animations are examples of cellular automata. The examples differ only in their update rules, which are just a few lines of code (additionally, I used different colour palettes for the different CAs, but this is a purely visual change). The animations are obtained by repeatedly applying the update rule to the initial state, which reveals how the state evolves over time.</p>

<h3 id="rock-paper-sissors">Rock-Paper-Sissors</h3>
<p>In this CA, each cell is updated by playing rock-paper-scissors against its neighbours. If the current type is e.g. rock, the cell changes to paper if there are more than <code class="language-plaintext highlighter-rouge">n</code> paper cells in its neighbourhood. Otherwise, it stays the same. The “Rock-Paper-Scissor-Lizard-Sprock” cellular
automaton extends this idea and simply introduces two additional elements, as described <a href="https://softologyblog.wordpress.com/2018/03/23/rock-paper-scissors-cellular-automata/">here</a>, but the principle update rule remains the same. We get the following result, each colour corresponds to one of the five types:
<br /></p>
<div style="text-align: center;">
<img class="align-center" src="/assets/ca/rpsls.gif" />
<figcaption>Rock-Paper-Scissor-Lizard-Spock CA.</figcaption>
</div>
<p><br /></p>

<h3 id="conways-game-of-life">Conway’s Game of Life</h3>
<p><a href="https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life">Conway’s Game of Life</a> is probably the most popular CA, in which cells can either be dead or alive, while the density of living cells determines whether cells stay alive, die, or spawn, of course depending only on the local neighbourhood. These dynamics roughly model overpopulation, underpopulation and reproduction. Conway’s Game of Life is also well-known for various <em>patterns</em> that behave in certain ways under the game-of-life update rules. Examples include the glider (travels diagonally over the canvas), oscillators (oscillate between states), or canons (produce projectiles that travel over the canvas). Here’s an example of Game of Life, again based on only proving the update rule for our base CA class:
<br /></p>
<div style="text-align: center;">
<img class="align-center" src="/assets/ca/gol.gif" />
<figcaption>Game of life with three gliders, one canon, and oscillators in each corner.</figcaption>
</div>
<p><br /></p>

<h3 id="forest-fire-simulation">Forest Fire Simulation</h3>
<p>As the last example, CAs can also be used as a very simplistic simulation of fire spreading through a forest. The updates rule simply states that cells with neighbour cells on fire also catch fire, while empty cells have a slight chance of regrowing. Lightning strikes have a very low chance to occur and are the initial causes of fires. Based on these dynamics, we get the following result:
<br /></p>
<div style="text-align: center;">
<img class="align-center" src="/assets/ca/fire.gif" />
<figcaption>Forest fire simulation.</figcaption>
</div>
<p><br /></p>

<p>I doubt that a CA model like this would actually be used in the real world to simulate the spread of forest fires, as it is just too simplistic. However, CAs find some use in video games as particle physics models, as explained in <a href="https://youtu.be/VLZjd_Y1gJ8">this cool video</a>. Additionally, can be used for pseudo-random number generation (see <a href="https://en.wikipedia.org/wiki/Rule_30">here</a>). Independently of their real-world usefulness, CAs are fun to play around with for an afternoon and can generate cool animations, as the above examples illustrate :)</p>

<h3 id="code">Code</h3>
<p>My code for the above animations is available <a href="https://gist.github.com/frietz58/239c84e2c24513138ed7dd5ae17a15cb">here</a>. As mentioned, it can easily be modified to implement other types of CAs.</p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Python" /><category term="Matplotlib" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Fun with Graph Neural Networks</title><link href="http://www.finnrietz.dev/machine%20learning/gnns/" rel="alternate" type="text/html" title="Fun with Graph Neural Networks" /><published>2023-07-02T00:00:00+00:00</published><updated>2023-07-23T13:48:11+00:00</updated><id>http://www.finnrietz.dev/machine%20learning/gnns</id><content type="html" xml:base="http://www.finnrietz.dev/machine%20learning/gnns/"><![CDATA[<h2 id="motivation">Motivation and introduction</h2>
<p>Recently, I became interested in <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4700287&amp;casa_token=uwI8LLQT1oUAAAAA:55yIFAS86OhKqkbcTYsG4PXcBsKfEk1UJJxCy-Mpg0v-eM7mYAf8iXaIh6U30FsLYc5BGuz4jhE&amp;tag=1">Graph Neural Networks (GNNs)</a> because they are a trending topic in the Machine Learning / Deep Learning community. Thus, in this blog post, I briefly introduce and motivate GNNs based on my novice-level understanding of them. Then, I share some results where I analyze and compare GNNs with classical Multilayer Perceptrons (MLPs) on a simple graph classification task.
<br /></p>

<h2 id="gnns">Graph Neural Networks: What, why, and how?</h2>
<p>So what are GNNs, why do we need them and how do they fundamentally work? GNNs are, broadly and informally speaking, Neural Networks (NNs) that can process and exploit non-Euclidean data structures, like graphs.<br />
Classical NNs can not capture and exploit the complex structure of non-Euclidean data because they assume fixed-dimensional, grid-like (aka Euclidean) data. MLPs, for example, assume that their input features come from arbitrary-but-fixed-dimensional, continuous, real-valued (aka Euclidean) vector spaces. 
<em>Euclidean</em> vector spaces generalize Euclidean geometry (parallel lines never intersect, laws of trigonometry apply) from 2D or 3D to higher-dimensional spaces. 
Convolutional Neural Networks (CNNs) exploit the Euclidean nature of their input data even more explicitly since they learn local filters that are applied over the entire input space in a sliding-window manner, thereby assuming Euclidean structure. 
To intuitively see that classical CNNs can not work on non-Euclidean data, consider the following picture:</p>
<p style="text-align: center;">
<img src="/assets/img/gnns/grid_to_graph.png" style="height: 200px;" />
</p>

<p>It is clear how to apply 2x2 filters at coordinates (0, 0) and (1, 1) for Euclidean (e.g. image, left) data, but unclear how to do the same for non-Euclidean (e.g. graph, right) data.
And this is, essentially, the motivation behind GNNs. 
Many interesting problems feature non-euclidean data (e.g. graph/node classification, graph/node regression, social network analysis, molecular chemistry, anomaly detection) and we would like to have Neural Networks that can exploit and reason about data from such domains.</p>

<p>With the motivation for GNNs out of the way, let’s see how GNNs work. 
Note that there are multiple ways of implementing GNNs and that I will only summarize <a href="https://arxiv.org/pdf/1609.02907.pdf">Kipf and Willing</a>’s method in the following, which introduces the Graph Convolution Network (GCN), a particular type of GNN. 
As the name implies, with the GCN, Kipf and Willing essentially generalize CNNs non-Euclidean data, addressing the issue demonstrated in the image above.
At a very high level and as we shall see, this is done by replacing the <strong>fixed</strong> neighborhood indexing of convolutional filters with a <strong>dynamic</strong> summation over the neighborhood of vertices in a graph. 
The GCN operates on a graph \(\mathcal{G = (V, E)}\), where \(\mathcal{V}\) are the graph’s vertices and \(\mathcal{E}\) are the graph’s edges. 
Each vertex \(\mathcal{V}_i\) in the graph might be described by some \(n\)-dimensional feature vector. 
Furthermore, \(A\) refers to the adjacency matrix of graph \(\mathcal{G}\), which is a different way of describing the edges in a graph. Let’s skip the theoretical derivation of the GCN (see <a href="https://arxiv.org/pdf/1609.02907.pdf">the paper</a> for details) and jump right to the model definition. Kipf and Willing provide the layer-wise definition of the GCN (in matrix notation) as follows:</p>

\[H^{(l+1)} = \sigma \Big( \underbrace{D^{-\frac{1}{2}} \hat{A} D^{-\frac{1}{2}}}_{\text{normalized } \hat{A}} H^{(l)} W^{(l)} \Big), \tag{1}\]

<p>where \(\sigma\) is a non-linear, differential activation function (e.g. ReLU, GeLU, TanH), \(D_{ii} = \sum_j \hat{A}_{ij}\) (multiplying by \(D\) essentially normalizes \(\hat{A}\)), \(\hat{A} = A + I\) (adding the identity matrix \(I\) to \(A\) is a trick that adds self connections to each vertex in the graph which is beneficial for representation learning), \(H^{(l)}\) is the hidden activation of the previous layer \(l\), and \(W^{(l)}\) is the learnable weight matrix of layer \(l\). `
For input layer \(H^{(l=0)}\) of the GCN we simply have \(H^{(l=0)} = X\), aka the \(n \times d\) -dimensional feature matrix of the Graph \(\mathcal{G}\), where \(d\) corresponds to the number of vertices in the graph and \(n\) to the aforementioned features of vertices. 
While the above formula specifies the GCN entirely, I find it hard to understand how the GCN works just based on this. 
So far, this looks almost like the layer of a plain MLP, with the addition of the \(D^{-\frac{1}{2}} \hat{A} D^{-\frac{1}{2}}\) (essentially the normalized adjacency matrix with self-connections), inside of the activation function. 
How does the GCN deal with the dynamic, non-Euclidean structure of the data? 
This becomes clear by considering Equation (1) in vector notation (Equation 12 in <a href="https://arxiv.org/pdf/1609.02907.pdf">the paper</a>):</p>

<p>\(h_i^{(l)} = \sigma \Big( \sum_{j \in \mathcal{N}_i \cup \{ i \} } \frac{1}{c_{ij}} \mathbf{W}^{(l)} h_j^{(l-1)} \Big). \tag{2}\)
<br />
Here, we can see clearly that the hidden representation of node \(i\) in layer \((l)\) of the GCN is the elementwise non-linear activation of a summation over the neighborhood (with self-connection) \(\mathcal{N}_i \cup \{ i \}\) of vertex \(i\) in the graph, where the neighboring vertices \(j\) are based on the hidden representation \(h_j^{(l-1)}\) of that vertex, transformed by the learnable weight matrix \(\mathbf{W}\) of layer \((l)\) and normalized by \(\frac{1}{c_{ij}}\), with \(c_{ij} = \sqrt{|\mathcal{N}_i|\cdot|\mathcal{N}_j|}\). 
Thus, the GCN addresses the non-Euclidean data structure by replacing fixed indexing typically performed by convolutions with a dynamic summation over vertex neighborhoods.</p>

<h2 id="dataset">A toy graph classification dataset</h2>
<div>
</div>
<p>Now that we understand the idea behind GNNs and how GCNs work, let’s put the GCN to the test and see how it compares with a classical MLP on a simple graph classification problem. 
For this, we need some data.
I used the <a href="https://networkx.org/documentation/stable/index.html">networkx library</a> to create a dataset of random graphs.
To make for a binary classification problem, graphs of class 1 have edge probability of 0.4, while graphs of class 2 have edge probability 0.6, thus the two graph classes have differing degrees. Similarly, I varied the edge weights between the two classes, such that graphs of class 1 randomly draw their edge weight from a standard Gaussian with mean -2, while graphs of class 2 draw their edge weight from a standard Gaussian with mean 2. These (scalar )edge features can trivially be encoded by the adjacency matrix.
This toy dataset essentially mimics the well-known XOR classification problem, except data points are graphs.
Exemplary graphs from this dataset are shown below, where the vertex color indicates the class label.
<img src="/assets/img/gnns/graph_data.png" /></p>

<p>Let’s now train a vanilla MLP and a GCN on this dataset and see how they compare. 
I made sure that all graphs in the dataset are of size 7 such that the adjacency matrix is always \(7 \times 7\), which is convenient because otherwise, we would have to perform some preprocessing, e.g. padding, to ensure that the input layer of the networks fits all graphs.</p>

<h2 id="networks">Results</h2>
<p>I now briefly describe the MLP and the GCN that were used to generate the following results.
As a first step, I created unregularized networks to make sure that I could overfit the dataset. 
I used the <a href="https://pytorch-geometric.readthedocs.io/en/latest/">PyTorch Geometric library</a> for the GCN implementation since it provides the GCN layer out of the box.
Both the MLP and GCN had two hidden layers of width 64 and had a similar number of learnable parameters, 8962 for the GCN and 7490 for the MLP.
I optimized both networks with Adam for 100 epochs, using a batch size of 16. 
The results for this setting are shown in the following plot:</p>

<p><img src="/assets/img/gnns/no-reg-gnn-vs-mlp.png" /></p>

<p>As can be seen, both networks are able to overfit the training set perfectly, achieving 100% classification accuracy. As expected due to the lack of regularization both networks fail to generalize well to the testing set, with the MLP achieving roughly 70% and the GCN 80% classification accuracy. The slightly better generalization as well as the lower testing error of the GCN is probably due to the strong <a href="https://en.wikipedia.org/wiki/Inductive_bias">inductive bias</a> of the GCN, compared to the MLP.</p>

<p>Next, I wanted to improve the generalization of these networks. 
I first chose a few different hyperparameter sets manually but found it very hard to improve accuracy on the test set. I added dropout layers and weight decay for regularization, tried bigger and smaller networks, different batch sizes, and different optimizers, but nothing significantly improved the test set accuracy. The following plot shows results with deeper networks (5 hidden layer of width 32), 50% dropout, and Adam weight decay 1e-4:</p>

<p><img src="/assets/img/gnns/better-gnn-vs-mlp.png" /></p>

<p>The results show similar patterns as in the unregularized case.
Next, I ran a Bayesian hyperparameter optimization for an afternoon, but even that failed to come up with better hyperparameters. 
This implies that there is some ambiguity in the data that can not be learned, only memorized in the training data. 
This makes sense since, given the dataset that I created, both edge weight and edge probabilities are sampled randomly and have some overlap, meaning there exists a certain subset of graphs in the dataset that are very hard or impossible to classify correctly, without additional information. 
It would be interesting to explore the uncertainty of the neural network predictions (for example through monte carlo dropout), but I haven’t done that for this post.</p>

<h2 id="summary">Summary and conclusion</h2>
<p>The main takeaway points from this post should be the following: GNNs are neural networks that can exploit non-Euclidean data structures. 
The GCN generalizes CNNs to non-Euclidean domains by replacing the fixed filter indexing with a dynamic averaging over some proximity neighborhoods. 
On the graph XOR classification problem, we observed better generalization and less overfitting due to stronger inductive bias by the GCN, compared to an MLP. 
This last point should be taken with a grain of salt though, since it’s not clear whether it holds generally or just for the specific problem and data used here.</p>

<p>And that’s it, I hope you found this post interesting :)</p>

<p>Cheers,<br />
<em>Finn</em>.
<br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Machine Learning" /><category term="Python" /><summary type="html"><![CDATA[Motivation and introduction Recently, I became interested in Graph Neural Networks (GNNs) because they are a trending topic in the Machine Learning / Deep Learning community. Thus, in this blog post, I briefly introduce and motivate GNNs based on my novice-level understanding of them. Then, I share some results where I analyze and compare GNNs with classical Multilayer Perceptrons (MLPs) on a simple graph classification task.]]></summary></entry><entry><title type="html">Autonomous Navigation with Pepper Robots</title><link href="http://www.finnrietz.dev/robotics/how-to-pepper-navigation/" rel="alternate" type="text/html" title="Autonomous Navigation with Pepper Robots" /><published>2022-02-12T00:00:00+00:00</published><updated>2022-12-02T09:19:11+00:00</updated><id>http://www.finnrietz.dev/robotics/how-to-pepper-navigation</id><content type="html" xml:base="http://www.finnrietz.dev/robotics/how-to-pepper-navigation/"><![CDATA[<h2 id="motivation">Motivation and introduction</h2>
<p>Softbank´s Pepper robot is a popular HRI research platform, however, as already mentioned in <a href="/linux/hokuyo-ros-setup/" target="_blank">my previous post</a>, the onboard LIDAR sensor supplies very sub-optimal data. Unfortunately, the resulting data is essentially useless for running SLAM algorithms because a) it is very sparse with only 15 laser beams and b) due to the awkward angle of those beams, their range is roughly 5 meters. I tried running <code class="language-plaintext highlighter-rouge">ros-gmapping</code> with Pepper’s onboard LIDAR sensor and the results were, as expected, not satisfactory. At the <a href="https://www.inf.uni-hamburg.de/en/inst/ab/wtm/" target="_blank">KT research group</a>, we wanted to implement autonomous navigation with Pepper robots, which clearly requires good mapping and localization capabilities. Thus, we explored two alternatives to Pepper’s poor onboard LIDAR sensor: We experimented with some visual-slam algorithms, including <a href="https://github.com/raulmur/ORB_SLAM2" target="_blank">ORB_SLAM2</a>, but were not satisfied with the results, most likely due to the shaky, blurry, and low-resolution stream of Pepper´s cameras and the relatively featureless environment that is our office building.
Thus, exploring another option, we build an external, dedicated sensor system for mapping and localization and rigidly attached it to one of our Pepper robots. This approach turned out to yield good mapping, localization and navigation results and can be reproduced with relatively little monetary cost. This post aims to give guidance to anyone who wants to implement a similar solution for their Pepper robot(s) and explains our method at an intermediately-detailed level.</p>

<h2 id="plan">The high-level approach</h2>
<p>The idea of building an external mapping/localization device is motivated by <a href="http://wiki.ros.org/hector_slam" target="_blank">the ROS hector_slam package</a>, made by the folks at TU Darmstadt. This demo video (from 2011), illustrates the capabilities of their algorithm very well:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/F8pdObV_df4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>Given these amazing results on a handheld device, the overall approach is clear: We can simply build a similar hardware system as is used in the video, “duct-tape” it onto our Pepper robot and establish communication between the mapping device and Pepper, such that we can map the environment, then run a localization and navigation algorithm that communicates with the robot to autonomously navigate based on the previously obtained map. Certainly a lot of work but doable. In the following sections, I detail the steps of this approach and attempt to highlight the pitfalls that cost me considerable time along the way.
<br /></p>

<h2 id="hardware">Hardware components</h2>
<p>First of all, if you want to replicate the approach I describe here, you will need to buy the following hardware components:</p>
<ul>
  <li>Most importantly, we need a <em>proper</em> laser range finder. We experimented with two different sensors: The high-quality Hokuyo UST-10LX (~1300€) and the much cheaper YDLIDAR G2 (~200€). In our office building navigation scenario, we found no significant differences in results between the two, but note that the YDLIDAR G2 only has a range of ~12 meters, which is insufficient for larger, open spaces.</li>
  <li>Additionally, the <code class="language-plaintext highlighter-rouge">hector_slam</code> ROS package not strictly requires but benefits from an inertial measurement unit (IMU), which provides acceleration and angular velocity measurements. These are useful when the mapping device is carried around and not rigidly attached to the pepper robot, in which case there can be significant changes in the alleviation and angle of the device, which must of course be considered by the mapping algorithm. Fortunately, IMUs are relatively cheap, we used the MPU-9250 IMU (~15€).</li>
  <li>Furthermore, we require a raspberry pi to stream the sensor data from the IMU and the LIDAR into a ROS network. Our model had 4 GB of RAM and ran a minimal Ubuntu installation. Of course, the more RAM the better, at the time of writing the Raspberry Pi 4 starts at 35€.</li>
  <li>For connecting the MPU-9250 IMU to the raspberry pi we need a special adapter. We used a <a href="https://wiki.seeedstudio.com/Grove_Base_Hat_for_Raspberry_Pi/">grove hat adapter</a> (~10€) but this component might be optional, depending on the concrete IMU model you opt for.</li>
  <li>Optionally: You should consider buying a strong battery that powers all of the above components so that you don’t have to attach a power cable to the robot or handheld system during mapping or navigation.</li>
  <li>Lastly, we need some kind of physical framework to actually attach all of these electronics to. We used a set of <a href="https://www.makerbeam.com/">MakerBeams</a> (starting from 100€), alternative a basic set of LEGO bricks might also work ;)
<br /></li>
</ul>

<p>In total, buying all of these hardware components will cost you about 400€. Considering the original cost of one Pepper robot (14.000€+) this is negligible, especially considering that these few components considerably enhance the navigation capabilities of the robot.</p>

<h3 id="physical-framework">The physical framework</h3>
<p>Assuming access to all hardware components, I now describe how to set up the raspberry pi and connect it to the LIDAR and IMU, so that the data from these two sensors will be available for further processing. We ended up building two separate systems. The first one, which we used mainly for debugging, is a direct adaptation of the hand-held device in the demo video above. The second system is a “utility belt” for the Pepper robot, which was eventually used to have Pepper autonomously navigate. Here are images of what these systems looked like:</p>

<figure class="third ">
  
    
      <a href="/assets/img/pepper-navigation/handheld-0.jpg">
        <img src="/assets/img/pepper-navigation/handheld-0.jpg" alt="Handheld mapping device" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/handheld-1.jpg">
        <img src="/assets/img/pepper-navigation/handheld-1.jpg" alt="Handheld mapping device (top perspective)" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/handheld-2.jpg">
        <img src="/assets/img/pepper-navigation/handheld-2.jpg" alt="Handheld mapping device (side perspective)" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/handheld-3.jpg">
        <img src="/assets/img/pepper-navigation/handheld-3.jpg" alt="Handheld mapping device (bottom perspective)" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/pepper-frame.jpg">
        <img src="/assets/img/pepper-navigation/pepper-frame.jpg" alt="MakerBeam frame on Pepper robot" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/pepper-complete.jpg">
        <img src="/assets/img/pepper-navigation/pepper-complete.jpg" alt="Complete MakerBeam and sensor belt on Pepper robot" />
      </a>
    
  
  
    <figcaption>Physical mapping &amp; navigation device. Either hand-held or attached to Pepper.
</figcaption>
  
</figure>

<h3 id="lidar">Getting LIDAR data</h3>
<p>Given a physical construct that holds our sensors, we can start working on the software side of things. First, install your ROS-supporting Linux distro of choice on the raspberry pi. Install a ROS version that supports the drivers and ROS packages required by your IMU and LIDAR sensors. Now, with ROS running on the raspberry, install the drivers and ROS packages required by your LIDAR sensor (also on the raspberry pi). These steps of course depend on the exact sensor you are using. In <a href="/linux/hokuyo-ros-setup/">an earlier post of mine</a> I describe how to set up the Hokuyo UST-10LX LIDAR.
For the second LIDAR sensor we were using (the YDLIDAR G2) I will not re-iterate the entire installation process because it is a relatively straightforward and well documented process. However, in short, to get the YDLIDAR G2 to work, you have to</p>
<ul>
  <li>Connect the YDLIDAR to the raspberry pi as described in the package manual</li>
  <li><a href="https://github.com/YDLIDAR/YDLidar-SDK/blob/master/doc/howto/how_to_build_and_install.md">install the base SDK</a></li>
  <li><a href="https://github.com/YDLIDAR/ydlidar_ros_driver">install the YDLIDAR ROS driver</a>. Don´t forget to replug the YDLIDAR after step 5.</li>
</ul>

<p>At this point, you should be able to start a <code class="language-plaintext highlighter-rouge">roscore</code> on your raspberry pi and launch your LIDAR sensor. You should be able to visualize the data in RVIZ or log the respective ROS topic and confirm that the sensor provides reasonable data. If you are using the YDLIDAR G2, you can call <a href="https://github.com/YDLIDAR/ydlidar_ros_driver/blob/master/launch/G2.launch" target="_blank">this launchfile</a> to start the sensor. If you are rigidly attaching the mapping device to Pepper, you should consider changing the six values (x y z yaw pitch roll) in line <code class="language-plaintext highlighter-rouge">36</code> such that they properly reflect the translation and rotation from the parent coordinate frame you are referring to.</p>

<h3 id="IMU">Getting IMU data</h3>
<p>With you laser scanner (hopefully) providing the desired laser data, we now setup the IMU. This of course also depends on the specific IMU device you are using, but the high-level steps of setting up the MPU-9250 IMU are as follows:</p>
<ul>
  <li>Connect the IMU to you raspberry pi. If you follow this guide exactly, this involves connecting the grove hat adapter to the <code class="language-plaintext highlighter-rouge">GPIO</code> bus of the raspberry pi and <a href="https://wiki.seeedstudio.com/Grove_Base_Hat_for_Raspberry_Pi/#installation">installing its driver</a>. Then, connect the MPU-9250 IMU to one of the i2c ports on the grove hat (see picture below).</li>
  <li><a href="https://github.com/mryellow/RTIMULib/tree/master/Linux">Install RTIMULib</a>. Pay attention to the step where you allow non-root users access to the i2c bus, I didn´t read this properly at first, which coused cryptic errors to arise further down the line…</li>
  <li><a href="https://github.com/jeskesen/i2c_imu">Install i2c_imu</a>. See <a href="https://github.com/jeskesen/i2c_imu/issues/10#issuecomment-350224211">this GitHub issue and comment</a>. <a href="https://github.com/jeskesen/i2c_imu/blob/master/launch/mpu_9150.launch">In the MPU launchfile</a>, for <code class="language-plaintext highlighter-rouge">i2c_imu</code> with the MPU-9250 we must set param <code class="language-plaintext highlighter-rouge">imu_type</code> to 7. Additionally, I had to set the value for param <code class="language-plaintext highlighter-rouge">i2c_bus</code> to 1, which I found out after probing around with the <code class="language-plaintext highlighter-rouge">i2cdetect</code> bash tool, which can be installed with <code class="language-plaintext highlighter-rouge">sudo apt-get install i2c-tools</code>.</li>
</ul>
<figure class="half">
    <a href="/assets/img/pepper-navigation/grove.jpg"><img src="/assets/img/pepper-navigation/grove.jpg" /></a>
    <a href="/assets/img/pepper-navigation/raspi-grove-imu-hokuyo-anno.jpg"><img src="/assets/img/pepper-navigation/raspi-grove-imu-hokuyo-anno.jpg" /></a>
    <figcaption>Raspberry pi with grove hat adapter and connected MPU-9250 IMU via I2C bus.</figcaption>
</figure>

<p>Now, your IMU should be working, and you should be able to access the sensor data with the raspberry pi, similar to this:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/SvgKDOAD_zo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<h2 id="hardware-tips">Getting TF right</h2>
<p>When building a physical system like this, ROS <code class="language-plaintext highlighter-rouge">tf</code> errors were the temporary bane of my existence. For people with no background in robotics, it might be somewhat unclear what <code class="language-plaintext highlighter-rouge">tf</code> actually does, hence I quickly want to provide some fundamental information: When we wish for robots to assume a certain joint configuration, we must look at the current joint values and then calculate the joint velocities that bring us from the current joint configuration to the desired one. This process is called inverse kinematics. These calculations are not terribly hard to do by hand, but, of course, that´s not what we do, instead, we rely on ROS and its various packages for this. For this to work, the robot must provide a valid “kinematic chain”, which essentially encodes how all the coordinate frames for all the joints are related (i.e. when we move the arm, the attached coordinate frame for the hand will also move). The <code class="language-plaintext highlighter-rouge">tf</code> package helps us in managing these different coordinate frames and, most importantly, allows us to easily convert points or vectors between different coordinate frames, given they are linked via a valid kinematic chain.
<br /></p>

<p>Why is this relevant? Because we must manually extend Pepper’s kinematic chain such that it can make use of the external IMU and LIDAR sensors and navigate fully autonomously. If we do not set <code class="language-plaintext highlighter-rouge">tf</code> up properly, the mapping, localization and navigation algorithms can not interpret the sensor data they are receiving. Just providing the sensor data without the corresponding coordinate frame that is linked to the kinematic chain is not enough because the sensor data points would be missing relative information; they would lack their <em>context</em>. The sensor data lives in a certain coordinate frame, and it must be clear how this frame is related to the rest of the robot in order to do any meaningful computations.
<br /></p>

<p>Thus, when mounting the LIDAR or IMU sensor to the robot, make sure you pay attention to the local coordinate frame of the sensor, which might or might not be printed onto the sensor. ROS uses a <a href="https://en.wikipedia.org/wiki/Right-hand_rule#Coordinates">right-handed coordinate system</a> and the sensor coordinate systems should align with this, i.e. the z-axis is the vertical one, x points to the front and the y-axis points to the left.
Thus, when launching your sensors, specify the correct transformation between a sensible parent coordinate frame and the frame of the sensor data. This can be done conveniently via the <a href="http://wiki.ros.org/tf#static_transform_publisher" target="_blank">static transform publisher</a>, for example by including it in the sensor’s launchfile.</p>

<p>Anyway, if you get this wrong, you will quickly notice it, e.g. when the IMU indicates left acceleration when you move it to the right, or because all laser scans are rotated by some degree. In fact, you can see in the IMU video above that the data in RVIZ is no exactly matching what I do with IMU in the real world. This is exactly the issue, the fixed transform that I specified for testing purposes is not in line with the IMU that is just loosely dangling around on my desk, hence the mismatch in orientation.</p>

<p>So, pay close attention when configuring the static transform publisher with the coordinate frames for our IMU and LIDAR. You can read more about the static transform publisher <a href="http://wiki.ros.org/tf#static_transform_publisher">here</a> and, if you want to understand this thoroughly, consider chapter 2 and 3 in Bruno Siciliano’s “Introduction to robotics” book.</p>

<h2 id="software-architecture">Software architecture</h2>
<p>Okay, at this point you should have a raspberry pi that is running ROS so that the data from the IMU and LIDAR sensors, which are connected to that raspberry pi, is available via rostopics. In principle, this is enough the replicate the mapping results as shown in the demo video above, via the <code class="language-plaintext highlighter-rouge">hector_slam</code> ROS package. I describe the mapping process more in-depth <a href="#hector_slam">later</a> because for now, I describe our software architecture. Namely, there are two more entities in addition to the sensor-data collecting raspberry pi. Firstly, we have a main ROS server that runs the computationally expensive mapping, localization and navigation algorithms. This way, the raspberry pi only acts as an interface to the sensors and broadcasts the sensor data into a shared local network, but does not have to perform any expensive computations. Secondly, we treat the Pepper robot as another, separate entity that provides sensor data and receives velocity commands. This again has the benefit of not performing expensive computation on the Pepper robot but rather on the main server and secondly, it resolves a very annoying issue: Pepper’s latest ROS packages only support ROS kinetic, which only supports Ubuntu 16. In the following, I describe how to make all of these systems communicate with each other.</p>

<h3 id="ros-architecture">Distributed ROS </h3>
<p>We have the following three entities:</p>
<ul>
  <li>A central ROS server (with its own <code class="language-plaintext highlighter-rouge">roscore</code>)</li>
  <li>The raspberry pi with attached LIDAR and IMU (with an additional <code class="language-plaintext highlighter-rouge">roscore</code>)</li>
  <li>A ROS kinetic docker container that is running Pepper´s ROS stack (running also a dedicated <code class="language-plaintext highlighter-rouge">roscore</code>)</li>
</ul>

<p>To enable these three entities to communicate with each other we require that they all have access to the same local network. This way, the central ROS server can read the topics published by the raspberry pi, process that data (i.e. for mapping or localization and navigation), and output velocity commands that are executed via Pepper’s <code class="language-plaintext highlighter-rouge">roscore</code>. In practice, we ran the main <code class="language-plaintext highlighter-rouge">roscore</code> and Pepper´s <code class="language-plaintext highlighter-rouge">roscore</code> on the same, powerful desktop machine, however, Pepper´s ROS kinetic core was running in an isolated docker container, which resolve the annoying version conflicts (I describe in <a href="#/linux/ros-docker/">another blog post</a> how to communicate with a <code class="language-plaintext highlighter-rouge">roscore</code> that lives inside a docker container).
The key thing to note is that when we have multiple roscores running in the same network, topics provided by the different roscores can be accessed generally, that is a topic from core <code class="language-plaintext highlighter-rouge">a</code> can be seen by core <code class="language-plaintext highlighter-rouge">b</code>.</p>

<h3 id="pepper-docker">Pepper´s docker ROS kinetic core</h3>
<p>As mentioned, Pepper´s most recent ROS packages are for ROS kinetic (which released in 2016 and requires Ubuntu 16). The cleanest solution would be to compile all of those packages manually in an up-to-date ROS environment. I tried this and, after wasting a considerable number of hours in this effort, accepted defeat. Thus, I eventually opted for a docker container running ROS kinetic, where Pepper´s ROS packages can be installed straightforwardly. In this docker container, simply install <a href="http://wiki.ros.org/pepper">Pepper´s entire ROS stack</a>. Then, you should be able to:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># launch Pepper´s ros driver</span>
roslaunch pepper_bringup pepper_full_py.launch nao_ip:<span class="o">=</span>&lt;YOUR-PEPPER-IP&gt; roscore_ip:<span class="o">=</span>&lt;DOCKER-KINETIC-ROSCORE-IP&gt;  
<span class="c"># the have it assume wake-up pose</span>
rosservice call /pepper_robot/pose/wakeup
</code></pre></div></div>
<p>of course replacing <code class="language-plaintext highlighter-rouge">&lt;YOUR-PEPPER-IP&gt;</code> and <code class="language-plaintext highlighter-rouge">&lt;DOCKER-KINETIC-ROSCORE-IP&gt;</code> with the IP address of the Pepper robot and the local ROS kinetics <code class="language-plaintext highlighter-rouge">roscore</code> IP address.
With a working ROS kinetic environment that allows us to control Pepper through ROS, we must just make sure that we can call services and publish data from our main server and have this take the desired effect in the docker kinetic ROS environment connected to Pepper.</p>

<h3 id="distributed-ros">Key configurations for distributed ROS systems</h3>
<p>There are two key configurations that are very easy to miss when building a distributed ROS system. I highly consider reading the entire <a href="http://wiki.ros.org/ROS/NetworkSetup">ROS network setup</a>, but nevertheless highlight these two configurations here.
Firstly, make sure that the IP address to hostname mapping is properly set up in <code class="language-plaintext highlighter-rouge">/etc/hosts</code>. Specifically, for each of our three entities, add the respective other two hostnames and IP addresses to that file. If this is not properly handled, roscore <code class="language-plaintext highlighter-rouge">a</code> will only be able to see the rostopics of remote hosts <code class="language-plaintext highlighter-rouge">b</code> and <code class="language-plaintext highlighter-rouge">c</code>, but the topics will not contain any data (<a href="https://answers.ros.org/question/90536/ros-remote-master-can-see-topics-but-no-data/">see this thread</a>).</p>

<p>Secondly, the local time for all three entities must be exactly the same. If I recall correctly, the time in the docker container is identical to that on the host, but certainly, the time between the raspberry pi and the desktop machine will not be the same. Here, I mean they must be identical up to a few milliseconds. If they have an offset of e.g. one second, this will cause very weird bugs to occur. For example, I was getting errors from the TF package indicating there was some issue with my kinematic chain, while in fact, the TF messages broadcasted by the raspberry pi were slightly too old (due to the nonidentical time on raspberry and main server), causing the messages to be dropped by the TF instance running on the desktop, even thought the kinematic chain was valid. To fix this, install <code class="language-plaintext highlighter-rouge">chrony</code> on the desktop and on the raspberry pi, then synchronize one machine with the clock of the other (both directions should be fine). This process is also documented <a href="http://wiki.ros.org/ROS/NetworkSetup#Timing_issues.2C_TF_complaining_about_extrapolation_into_the_future.3F">here</a>.</p>

<p>With these things taken care of, you should now be able to access data from the raspberry pi and from the ROS kinetic core on the main ROS server. You can test this by visualizing the external LIDAR, the IMU and, for example, Pepper´s camera stream in RVIZ on the desktop machine. You should also be able to launch e.g. <a href="http://wiki.ros.org/rqt_robot_steering">rqt-steering</a> on the desktop and control Pepper that way. If this works, your distributed ROS system is set up correctly and the navigation algorithm executed on the main server and based on LIDAR data from the raspberry pi will be able to drive Pepper to the desired goal location.</p>

<h2 id="hector_slam">Mapping with hector slam</h2>
<p>Given the distributed ROS system described above, the first step towards autonomous navigation is, of course, obtaining a map of the environment. As mentioned initially, here we entirely rely on the <code class="language-plaintext highlighter-rouge">hector_slam</code> ROS package. Starting the entire system and creating a map of the environment involves the following steps:</p>
<ul>
  <li>Start the required sensors (on the raspberry pi they are connected to):
    <ul>
      <li>Start the IMU (for example by adjusting <a href="https://github.com/jeskesen/i2c_imu/blob/master/launch/mpu_9150.launch">this launchfile</a> as described above)</li>
      <li>Start the LIDAR (for example, using this <a href="https://github.com/YDLIDAR/ydlidar_ros_driver/blob/master/launch/G2.launch">mentioned launchfile</a>)</li>
    </ul>
  </li>
  <li>Start Pepper’s ROS stack (inside the docker container that communicates with Pepper, as described <a href="#pepper-docker">above</a>)</li>
  <li>On the main server or inside Pepper’s kinetic docker ROS environment, start your tool of choice to control the Pepper robot, i.e. <a href="http://wiki.ros.org/rqt_robot_steering"><code class="language-plaintext highlighter-rouge">rqt_robot_steering</code></a>
<br />
Now, we can start the <code class="language-plaintext highlighter-rouge">hector_slam</code> ROS package and create a map of the environment.</li>
  <li>Start the <code class="language-plaintext highlighter-rouge">hector_slam</code> node (on the more powerful, central ROS server)
    <ul>
      <li>Start <a href="http://wiki.ros.org/hector_imu_attitude_to_tf"><code class="language-plaintext highlighter-rouge">hector_imu_to_tf</code></a>, this connects the LIDAR data to the angles reported by the IMU</li>
      <li>Start <a href="http://wiki.ros.org/hector_geotiff"><code class="language-plaintext highlighter-rouge">hector_geotiff</code></a>, a service for saving the map</li>
      <li><strong>Optionally</strong>, start <a href="http://wiki.ros.org/tf#static_transform_publisher"><code class="language-plaintext highlighter-rouge">static_transform_publisher</code></a>, to ensure that the TF tree is valid and associates all sensor data with the <code class="language-plaintext highlighter-rouge">base_link</code> frame</li>
      <li>Start <a href="http://wiki.ros.org/hector_mapping"><code class="language-plaintext highlighter-rouge">hector_mapping</code></a>, the main <code class="language-plaintext highlighter-rouge">hector_slam</code> node</li>
    </ul>
  </li>
</ul>

<p>Note that there is no difference whether you want to create the map using the hand-held device or with the LIDAR and IMU rigidly mounted to the Pepper robot. This is because the package <code class="language-plaintext highlighter-rouge">hector_imu_to_tf</code> links the LIDAR scans to the IMU angles. In practice, this is done by configuring TF in such a way that the laser data frame is the <code class="language-plaintext highlighter-rouge">base_stabilized</code> frame, while this frame’s pose is estimated using <a href="http://wiki.ros.org/hector_imu_attitude_to_tf"><code class="language-plaintext highlighter-rouge">hector_imu_to_tf</code></a> package. How <code class="language-plaintext highlighter-rouge">hector_slam</code> uses different coordinate frames is further documented <a href="http://wiki.ros.org/hector_slam/Tutorials/SettingUpForYourRobot">here</a>.</p>

<p>Thus, if everything works, you should be able to obtain good mapping results. Here are exemplary mapping results obtained with the above-described hard- and software-setup. First, we tested and debugged the system in a small, maze-like environment with unique features for easy mapping. Once the system was working well in the test environment, we tested it in the main hallway of our office building and obtained good results.</p>

<figure class="third ">
  
    
      <a href="/assets/img/pepper-navigation/test-env.jpg">
        <img src="/assets/img/pepper-navigation/test-env.jpg" alt="Small maze-like testing environment with unique features for easy mapping and debugging" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/maze.jpg">
        <img src="/assets/img/pepper-navigation/maze.jpg" alt="Obtained map of testing environment with hand-help mapping device" />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/office_rotated.jpg">
        <img src="/assets/img/pepper-navigation/office_rotated.jpg" alt="Obtained map of entire office" />
      </a>
    
  
  
    <figcaption>A small testing environment and mapping results.
</figcaption>
  
</figure>

<p>Once the entire environment has been mapped, <code class="language-plaintext highlighter-rouge">hector_geotiff</code> is used to save the map by executing <code class="language-plaintext highlighter-rouge">rostopic pub syscommand std_msgs/String "savegeotiff"</code>. This did not work for me immediately and complained with the message “<em>failed with error ‘Device not writable’</em>”. I don’t know the root cause of this, however, I was able to guess a solution relatively quickly. The following command, which re-creates the target folder and takes care of RWX rights fixed the issue for me:<br />
<code class="language-plaintext highlighter-rouge">sudo mkdir /opt/ros/melodic/share/hector_geotiff/maps &amp;&amp;
sudo chmod -R a+rwx /opt/ros/melodic/share/hector_geotiff/maps &amp;&amp;
sudo chown -R ubuntu:ubuntu /opt/ros/melodic/share/hector_geotiff</code>.
Of course, you have to adjust the paths for your ROS installation.</p>

<h2 id="amcl">Navigation and Adaptive Monte Carlo Localization</h2>
<p>Given a map of our environment, we can now run Adaptive Monte Carlo Localization (AMCL) to estimate the current robot position given a stream of laser scans.
You can of course use any other localization algorithm, but the <a href="http://wiki.ros.org/amcl">AMCL ROS package</a> generally works very well (if properly parameterized).</p>

<h3 id="map_server">Loading hector maps</h3>
<p>To start the localization process in a mapped environment, first, we must make the map available on a rostopic. The default topic used for this is accurately called <code class="language-plaintext highlighter-rouge">map</code>.
The map can be made available by calling the <code class="language-plaintext highlighter-rouge">map-server</code> package, which takes as argument the path to a map file, e.g. <code class="language-plaintext highlighter-rouge">rosrun map_server map_server path/to/map.yaml</code>. This should make your map available on the <code class="language-plaintext highlighter-rouge">map</code> topic, you can check this by either logging the <code class="language-plaintext highlighter-rouge">map</code> topic or by visualizing the <code class="language-plaintext highlighter-rouge">map</code> topic in RVIZ.</p>

<p>When we load maps that were created with the <code class="language-plaintext highlighter-rouge">hector_slam</code> package, we must pay attention to a particular parameter of the <code class="language-plaintext highlighter-rouge">map-server</code> package, which took me quite some time to figure out. The maps generated by <code class="language-plaintext highlighter-rouge">hector_slam</code> draw obstacles in blue instead of black (as can be seen on the map images above). This causes them to be classified wrongly by the <code class="language-plaintext highlighter-rouge">map_server</code> package, because with the default parameters, the blue color is converted to grayscale and happens to fall into an undefined range, meaning the <code class="language-plaintext highlighter-rouge">map_server</code> does not identify the blue pixels on the map as actual obstacles. Unfortunately, no warning is raised and the <code class="language-plaintext highlighter-rouge">map-server</code> happily broadcasts a map with many invalid (-1) values, which of course makes any localization impossible.
This is extra hard to spot, because the resulting map still looks almost perfect when visualized in RVIZ:</p>

<figure class="half ">
  
    
      <a href="/assets/img/pepper-navigation/good_map.png">
        <img src="/assets/img/pepper-navigation/good_map.png" alt="A valid map shown in RVIZ." />
      </a>
    
  
    
      <a href="/assets/img/pepper-navigation/bad_map.png">
        <img src="/assets/img/pepper-navigation/bad_map.png" alt="An invalid map shown in RVIZ." />
      </a>
    
  
  
    <figcaption>The map on the left is valid and can be used for localization and navigation, while the map on the right is invalid and completely useless for localization and navigation. A subtle but devastating difference.
</figcaption>
  
</figure>

<p>To prevent this from happening, set the parameter <code class="language-plaintext highlighter-rouge">occupied_thresh</code> of the <code class="language-plaintext highlighter-rouge">map_server</code> package to a lower value, for example, 0.5. The <code class="language-plaintext highlighter-rouge">map-server</code> package converts the color values in each cell to a value between 0 and 1, and pixel values greater than <code class="language-plaintext highlighter-rouge">occupied_thresh</code> are considered to be an obstacle on the map.</p>

<p>With a correctly loaded map now being available on the <code class="language-plaintext highlighter-rouge">map</code> topic, all that’s really left to do now is to start the ACML node. ACML has a lot of parameters, most of which I left at their default values. However, the odometry type should be set to <code class="language-plaintext highlighter-rouge">omni</code> via the <code class="language-plaintext highlighter-rouge">odom_model_type</code> parameter, since the Pepper robot has an omnidirectional mobile base. Furthermore, the <code class="language-plaintext highlighter-rouge">laser_min_range</code> and <code class="language-plaintext highlighter-rouge">laser_max_range</code> parameters should match the LIDAR device. For the YDLIDAR G2, I set them to 3 and 16, respectively. Lastly, make sure to pass the correct TF frames to AMCL. This of course is very specific to your TF chain, but most likely you can leave the default values if you didn’t change any TF frame names before.</p>

<p>Store the parameter values in a launchfile and launch AMCL. You can either call the AMCL rosservice <code class="language-plaintext highlighter-rouge">global_localization</code> to get an uninformed prior that spreads to pose particles uniformly over the state space or you can use the RVIZ <code class="language-plaintext highlighter-rouge">2D pose estimate</code> button to spread the initial particles normally around the cursor location on the map. Either way, after having initialized the localization algorithm you can (optionally) drive back and forth a bit until the history of laser scans allowed the localization algorithm to converge to the correct position on the map. Alternatively, you can just pass a goal pose and rely on the initial scan history for estimating the initial robot pose.</p>

<p>Based on the given map, current pose and goal pose, navigation is performed by the ROS package <code class="language-plaintext highlighter-rouge">move_base</code>, which will calculate a trajectory from the initial pose to the given goal pose. Once this trajectory has been calculated, <code class="language-plaintext highlighter-rouge">move_base</code> will output velocity commands on the <code class="language-plaintext highlighter-rouge">cmd_vel</code> topic that move the robot along that trajectory to the given target pose. The ROS driver of each specific robot must implement the functionality that interprets and executes the given velocity command with the available robot actuators, like Pepper’s omnidirectional wheels. In our case, Pepper’s ROS kinetic stack implements a velocity controller somewhere, I guess.</p>

<p>And that is it. With a setup like this, we can exploit the powerful ROS navigation stack to autonomously navigate with (slightly modified) Pepper robots.
My colleague Dr Phillip Allgeuer continued to work on this project when I had to leave the Knowledge Technology group to start my PhD. Phillip made considerable improvements to our codebase, finalized the project and produced two videos which I slightly edited and combined into the following:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/pcyNEU7ZCJ0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>

<p>Although I don’t know the following for certain, I hypothesize that the pauses of the robot in the second clip are based on Pepper’s onboard safety features. If enabled, these will override any control if it is considered unsafe, by some criterion. I observed numerous times that Pepper would suddenly stop moving when it is in close proximity to an obstacle, or when it moves at somewhat fast velocities, for example to maintain a safe balance.</p>

<h2 id="summary">Summary</h2>
<p>To summarize, we wanted that our Pepper robots could autonomously navigate our office premises. We found that the onboard laser ranger finders did not allow for satisfactory mapping and localization performance. Thus, we extended the Pepper robot with better LIDAR and IMU sensors, which allows us to run the powerful <code class="language-plaintext highlighter-rouge">hector_slam</code> package for mapping, and <code class="language-plaintext highlighter-rouge">amcl</code> plus <code class="language-plaintext highlighter-rouge">move_base</code> for localiation and motion planning/navigation. This required a somewhat sophisticated, distributed hardware, software and network architecture that is described in this blog post for anyone to reproduce.</p>

<p>Cheers,<br />
<em>Finn</em>.
<br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Robotics" /><category term="ROS" /><category term="SLAM" /><category term="Hardware" /><summary type="html"><![CDATA[Motivation and introduction Softbank´s Pepper robot is a popular HRI research platform, however, as already mentioned in my previous post, the onboard LIDAR sensor supplies very sub-optimal data. Unfortunately, the resulting data is essentially useless for running SLAM algorithms because a) it is very sparse with only 15 laser beams and b) due to the awkward angle of those beams, their range is roughly 5 meters. I tried running ros-gmapping with Pepper’s onboard LIDAR sensor and the results were, as expected, not satisfactory. At the KT research group, we wanted to implement autonomous navigation with Pepper robots, which clearly requires good mapping and localization capabilities. Thus, we explored two alternatives to Pepper’s poor onboard LIDAR sensor: We experimented with some visual-slam algorithms, including ORB_SLAM2, but were not satisfied with the results, most likely due to the shaky, blurry, and low-resolution stream of Pepper´s cameras and the relatively featureless environment that is our office building. Thus, exploring another option, we build an external, dedicated sensor system for mapping and localization and rigidly attached it to one of our Pepper robots. This approach turned out to yield good mapping, localization and navigation results and can be reproduced with relatively little monetary cost. This post aims to give guidance to anyone who wants to implement a similar solution for their Pepper robot(s) and explains our method at an intermediately-detailed level.]]></summary></entry><entry><title type="html">Hokuyo UST-10LX with ROS Setup</title><link href="http://www.finnrietz.dev/linux/hokuyo-ros-setup/" rel="alternate" type="text/html" title="Hokuyo UST-10LX with ROS Setup" /><published>2021-08-10T00:00:00+00:00</published><updated>2021-08-11T03:50:33+00:00</updated><id>http://www.finnrietz.dev/linux/hokuyo-ros-setup</id><content type="html" xml:base="http://www.finnrietz.dev/linux/hokuyo-ros-setup/"><![CDATA[<h2 id="motivation">Motivation and introduction</h2>
<p>I have worked for quite a while with Softbank’s Pepper robot, which, unfortunately, does not feature the best onboard hardware. Specifically, it only employs three laser sensors that are aimed towards the ground at a slightly awkward angle <a href="#[1]">[1]</a>. These sensors are enough to do minimalistic obstacle avoidance, but not enough for Simultaneous Localization and Mapping (SLAM). As a consequence, at the KT group, we’ve decided to instead build an external mapping device, using the Hokuyo UST-10LX <a href="[2]">[2]</a> laser sensor. However, the Hokuyo UST-10LX has to be connected to a host machine via ethernet cable to communicate its measurements with a host machine. This requires some network tinkering on that machine, which, in our case, is a Raspberry Pi 3. I found this setup not to be properly documented (or at least I couldn’t find anything on this). As such, below I describe how to configure a headless Raspi 3 running Ubuntu Server 20.04 to stream data from the Hokuyo laser sensor into your network. 
<br /></p>

<h2 id="Raspi-setup">Raspi setup and networking</h2>
<p>The reason this sensors usability is not exactly at the <i>plug-and-play</i> level is that it, as already said, communicates with the host machine via ethernet cable, which in turn means that we can’t use the LAN port to connect the host machine to our trusted network. This is not necessarily a problem, but if we still want to remotely connect to and work on the said host machine, we have to configure our Raspi in such a way that it a) allows data to come in via the wired network interface while b) connects to the network/internet via the wireless interface. In the following steps, we do just that. Additionally, the ethernet wired network device has to be configured in the right way to communicate with the Hokuyo sensor which, also, I did not find to be well documented.
Note that I did the following steps on the Ubuntu 20.04 server distro, I assume the steps work on all Debian-based Linux systems, but obviously, I did not test that.
<br /></p>

<p>Assuming you’ve freshly installed Ubuntu Server on your Raspi and plan to use it headless, you’ll have to do some initial configuration (including enabling <code class="language-plaintext highlighter-rouge">ssh</code>), which is widely covered, i.e. <a href="https://pimylifeup.com/ubuntu-server-raspberry-pi/" target="_blank">here</a>. Once that is done and ready, simply log into your Raspi via <code class="language-plaintext highlighter-rouge">ssh</code>. Assuming your Raspi is currently connected to your network via ethernet cable, we now have to configure it to connect to WLAN, so that the ethernet port is available for the Hokuyo sensor. Do the following steps:</p>
<ul>
  <li>Get the name of your wireless network device (likely <code class="language-plaintext highlighter-rouge">wlan0</code>):
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">ls</span> /sys/class/net
<span class="c"># enp8s0  lo  wlan0</span>
</code></pre></div>    </div>
    <p>This outputs the names of all network devices on your system. Your wireless device is likely called <code class="language-plaintext highlighter-rouge">wlan0</code>, you can read more about network device naming conventions on ubuntu <a href="http://manpages.ubuntu.com/manpages/focal/man7/systemd.net-naming-scheme.7.html" target="_blank">here</a>.</p>
  </li>
  <li>Edit the netplan configuration file it is`/etc/netplan/50-cloud-init.yaml with your editor of choice. Add the following lines to it, but make sure they are properly intended (its a YAML file, which is sensitive to this).
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wifis</span><span class="p">:</span>
  <span class="o">&lt;</span><span class="n">your</span><span class="o">-</span><span class="n">wireless</span><span class="o">-</span><span class="n">device</span><span class="o">&gt;</span><span class="p">:</span>
      <span class="n">dhcp4</span><span class="p">:</span> <span class="n">true</span>
      <span class="n">optional</span><span class="p">:</span> <span class="n">true</span>
      <span class="n">access</span><span class="o">-</span><span class="n">points</span><span class="p">:</span>
          <span class="s">"&lt;SSID_WiFi_name&gt;"</span><span class="p">:</span>
              <span class="n">password</span><span class="p">:</span> <span class="s">"&lt;WiFi_password&gt;"</span>
</code></pre></div>    </div>
    <p>You have to enter your network information, so replace <code class="language-plaintext highlighter-rouge">&lt;your-wireless-device&gt;</code> with the name of your wireless network device (i.e. <code class="language-plaintext highlighter-rouge">wlan0</code>), <code class="language-plaintext highlighter-rouge">&lt;SSID_WiFi_name&gt;</code> with then name of your router and <code class="language-plaintext highlighter-rouge">&lt;WiFi_password&gt;</code> with the according password. The final file should look similar to this:</p>
    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
</span><span class="n">network</span><span class="p">:</span>
  <span class="n">ethernets</span><span class="p">:</span>
      <span class="n">eth0</span><span class="p">:</span>
          <span class="n">dhcp4</span><span class="p">:</span> <span class="n">true</span>
          <span class="n">optional</span><span class="p">:</span> <span class="n">true</span>
  <span class="n">version</span><span class="p">:</span> <span class="mi">2</span>
  <span class="n">wifis</span><span class="p">:</span>
      <span class="n">wlan0</span><span class="p">:</span>
          <span class="n">dhcp4</span><span class="p">:</span> <span class="n">true</span>
          <span class="n">optional</span><span class="p">:</span> <span class="n">true</span>
          <span class="n">access</span><span class="o">-</span><span class="n">points</span><span class="p">:</span>
              <span class="s">"SSID_WiFi_name"</span><span class="p">:</span>
                  <span class="n">password</span><span class="p">:</span> <span class="s">"WiFi_password"</span>
</code></pre></div>    </div>
  </li>
  <li>Now, bring up the wireless network card on the Raspi, if it is not already running:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span> <span class="nb">sudo </span>ip <span class="nb">link set</span> &lt;your-wireless-device&gt; up
</code></pre></div>    </div>
    <p>Just executing this won’t break anything if your wireless card is already running, but later commands will fail if it is down. Again insert the name of your wireless network device (<code class="language-plaintext highlighter-rouge">wlan0</code>).</p>
  </li>
  <li>Now, run these commands to generate and apply the netplan configuration we entered into <code class="language-plaintext highlighter-rouge">/etc/netplan/50-cloud-init.yaml</code>:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>netplan <span class="nt">--debug</span> try
<span class="nv">$ </span><span class="nb">sudo </span>netplan <span class="nt">--debug</span> generate
<span class="nv">$ </span><span class="nb">sudo </span>netplan <span class="nt">--debug</span> apply
<span class="nv">$ </span><span class="nb">sudo </span>reboot
</code></pre></div>    </div>
    <p>Finally, you can unplug the ethernet cable from your Raspi, and, assuming everything went well, the Raspi will connect to your WLAN after its reboot, in which case you can <code class="language-plaintext highlighter-rouge">ssh</code> onto it again.</p>
  </li>
</ul>

<p>Now, the internet traffic is being handled by the wireless network device and the wired ethernet port is available for the Hokuyo. In the next section, we install the Hokuyo ROS driver to bring up the laser sensor.</p>

<h2 id="ros-setup">ROS and the Hokuyo driver</h2>
<p>The remaining steps are more or less straightforward, but not trivial and not properly documented either (hence this post). Install ROS on your Raspi or the machine you plan to attach the Hokuyo to, by following the <a href="http://wiki.ros.org/ROS/Installation" target="_blank">official installation instructions</a>. Once you have ROS installed, install the ROS-driver for the Hokuyo UST-10LX sensor <a href="[3]">[3]</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>apt-get <span class="nb">install </span>ros-&lt;ROS-VERSION&gt;-urg-node
</code></pre></div></div>
<p>Here, replace <code class="language-plaintext highlighter-rouge">&lt;ROS-VERSION&gt;</code> with any of <code class="language-plaintext highlighter-rouge">kinetic</code>, <code class="language-plaintext highlighter-rouge">melodic</code> or <code class="language-plaintext highlighter-rouge">noetic</code>, as these are the only ROS versions currently supported by the package. Alternatively, you can compile the package yourself, but then you will have to take care of setting up the workspace and that all dependencies are satisfied, which I won’t describe here, as this process is well documented elsewhere <a href="[4]">[4]</a>. With the drivers installed, now we just start ROS and can get our laser data, right? Well, not really, again because this specific Hokuyo laser sensor communicates with the host machine via ethernet cable. Furthermore, because of that, the instructions given for the <code class="language-plaintext highlighter-rouge">urg_node</code> package don’t work for us either. So, to get laser-data into our <i>handy-dandy</i> ROS topics, we have to assign an IPv4 address to the <strong>wired</strong> network interface (because the sensor is connected via cable), so that the IP address of the Hokuyo sensor falls into the subnet range of the wired network interface. From the <a href="https://www.hokuyo-aut.jp/dl/UST-10LX_Specification.pdf" target="_blank">Hokuyo’s datasheet</a> (page 6), we know that the device has the default IP address of <code class="language-plaintext highlighter-rouge">192.168.0.10</code> assigned. Thus, we can assign an IP address like <code class="language-plaintext highlighter-rouge">192.168.0.15/24</code> to our wired network device. <code class="language-plaintext highlighter-rouge">192.168.0.15/24</code> is in CIDR notation <a href="[5]" target="_blank">[5]</a> and states that the first 24 bits (as indicated by the trailing <code class="language-plaintext highlighter-rouge">/24</code>) provide the network identifier, and whatever comes after that is the actual machine identifier. To assign a specific IP address to the wired network device, take the following steps:</p>
<ol>
  <li>Get the name of your ethernet (wired) network device with the same command we used before:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">ls</span> /sys/class/net
<span class="c"># eth0  lo  wlan0</span>
</code></pre></div>    </div>
    <p><code class="language-plaintext highlighter-rouge">lo</code> is the loopback device, <code class="language-plaintext highlighter-rouge">wlan0</code> (or similar) the WiFi card, so the remaining name is your wired network card (most likely called <code class="language-plaintext highlighter-rouge">eth0</code>, <code class="language-plaintext highlighter-rouge">eno1</code> or <code class="language-plaintext highlighter-rouge">enp2s0</code>).</p>
  </li>
  <li>Assign an IP address to your wired ethernet device that shares the network identifier with the IP of the Hokuyo sensor:
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># sudo ip addr add &lt;SHARED.NETWORK.IDENTIFIER&gt;.&lt;HOST-IDENTIFIER&gt;/24 broadcast &lt;SHARED.NETWORK.IDENTIFIER&gt;.255 dev &lt;ETHERNET-CARD-NAME&gt;</span>
<span class="nv">$ </span><span class="nb">sudo </span>ip addr add 192.168.0.15/24 broadcast 192.168.0.255 dev eth0
</code></pre></div>    </div>
    <p>As above, replace the values in <code class="language-plaintext highlighter-rouge">&lt;&gt;</code>: with your values. You can verify that the command worked by inspecting the output of the <code class="language-plaintext highlighter-rouge">ifconfig</code> command. For the ethernet device you specified, it should show the IP address we just assigned to it. If your ethernet device does not show up in <code class="language-plaintext highlighter-rouge">ifconfig</code> but when you run <code class="language-plaintext highlighter-rouge">ls /sys/class/net</code>, it might be down. You can bring it up the same way we did with the wireless device earlier.</p>
  </li>
</ol>

<p>Now we have our regular traffic running over the wireless connection, while the ethernet wired device is configured to communicate with the Hokuyo sensor. Note, the IP address assignment with the <code class="language-plaintext highlighter-rouge">ip addr add</code> command we just did is not static, you will have to redo it after rebooting your device. You can look up how to configure this permanently by editing the netplan once again, i.e. <a href="https://linuxize.com/post/how-to-configure-static-ip-address-on-ubuntu-20-04/" target="_blank">here</a>. Now, the final step is to get the laser data into a ROS topic via the ROS driver.</p>

<p>As usual in ROS, first, we start a <code class="language-plaintext highlighter-rouge">roscore</code> with the simple command <code class="language-plaintext highlighter-rouge">roscore</code>. Then, you need a new session on the same machine to start the <code class="language-plaintext highlighter-rouge">urg_node</code> driver for the Hokuyo sensor. Either <code class="language-plaintext highlighter-rouge">ssh</code> onto the server a second time or use a terminal multiplexer like <a href="https://wiki.ubuntuusers.de/tmux/" target="_blank">tmux</a>. With the roscore running in the other session, run the following command to bring up the Hokuyo laser:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>rosrun urg_node urg_node _ip_address:<span class="o">=</span>192.168.0.10
</code></pre></div></div>
<p>With the <code class="language-plaintext highlighter-rouge">_ip_address</code> argument we pass the IP-address of the physical sensor to the ros node (see <a href="https://github.com/ros-drivers/urg_node/blob/kinetic-devel/launch/urg_lidar.launch" target="_blank">here</a>). As mentioned already <code class="language-plaintext highlighter-rouge">192.168.0.10</code> is the default value for the Hokuyo UST-10LX, you will have to adjust this if you assigned a different IP address to your sensor. If everything works, the node will output something like:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span> INFO] <span class="o">[</span>1501672789.034051716]: Streaming data.
</code></pre></div></div>
<p>Now, in yet another terminal session, you can inspect this data, which is the laser point cloud returned by the sensor, by simply calling <code class="language-plaintext highlighter-rouge">rostopic echo /scan</code>, where <code class="language-plaintext highlighter-rouge">/scan</code> is the default name of the rostopic used by the <code class="language-plaintext highlighter-rouge">urg_node</code> driver. 
<br /></p>
<h2 id="rviz">RVIZ</h2>
<p>You can also inspect the laser data in <code class="language-plaintext highlighter-rouge">RVIZ</code>, but this requires one two steps. RVIZ can only display data that is attached to a valid coordinate frame (which makes sense, because you need <i>some</i> point of reference if you want to visualize things in any space). Usually, the robot you are using will provide the coordinate frames via the ROS <code class="language-plaintext highlighter-rouge">tf</code> package, but given our current setup, we just have the Hokuyo sensor that is sorta floating in the void. If you inspect the data on the <code class="language-plaintext highlighter-rouge">/scan</code> topic, for example with <code class="language-plaintext highlighter-rouge">rostopic info /scan</code>, you will see that the data is associated with the coordinate frame <code class="language-plaintext highlighter-rouge">laser</code>. But this coordinate frame <strong>does not exist</strong>, so RVIZ can’t know where to put this data and will complain if you try to visualize the laser point cloud. To fix this, we simply broadcast a made-up coordinate frame, with the following command:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>rosrun tf static_transform_publisher 0 0 0 0 0 0 1 map laser 10
</code></pre></div></div>
<p>Now we would be set, if our <code class="language-plaintext highlighter-rouge">roscore</code> would run on a system that features a graphical display session, but if you are running the <code class="language-plaintext highlighter-rouge">roscore</code> on a headless machine, as is the case for the Raspberry 3 Ubuntu server setup describe earlier, we can’t even display RVIZ, to begin with. Thus, you will have to do this on a different machine. Luckily, setting up a distributed ROS environment is very easy. Before starting the <code class="language-plaintext highlighter-rouge">roscore</code> on the headless machine, run the following command:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">export </span><span class="nv">ROS_MASTER_URI</span><span class="o">=</span>http://&lt;IP&gt;:11311
</code></pre></div></div>
<p>Replace <code class="language-plaintext highlighter-rouge">&lt;IP&gt;</code> with the IP address of the network card. Obviously, when you restart the <code class="language-plaintext highlighter-rouge">roscore</code>, you will have to restart the <code class="language-plaintext highlighter-rouge">urg_node</code> as well.
Now, in a terminal on your machine with a graphical desktop environment, run the same command, then <code class="language-plaintext highlighter-rouge">rostopic list</code>. This should show the rostopics that are running on the headless machine, assuming both machines are in the same network. For some reason, we can now see the rostopics, but no data in the topics, even though we are streaming it from the headless machine. 
I had to apply <a href="https://answers.ros.org/question/90536/ros-remote-master-can-see-topics-but-no-data/?answer=90956#post-id-90956" target="_blank">this weird fix </a> in order to actually see the data in my ros topics on my main machine:</p>
<blockquote>
What finally solved my problem was adding all PCs with their hostnames and IP Addresses to the "/etc/hosts" file. Since then, everything works fine.
</blockquote>
<p>With that being taken care of and all nodes running (<code class="language-plaintext highlighter-rouge">roscore</code>, <code class="language-plaintext highlighter-rouge">urg_node</code>, <code class="language-plaintext highlighter-rouge">static_transform_publisher</code>), you should be able to visualize your laser data in RVIZ. Look at this nice laser data:</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/hEQP5q-U6MA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<h2 id="summary">Summary</h2>
<p>Alright, in this post we’ve configured a Raspberry Pi 3 running Ubuntu Server 20.04 to publish laser data into a network via WiFi, while the Hokuyo sensor is connected to the device via ethernet cable. We’ve installed the required software components and finally showed how to visualize the laser data in RVIZ on a different machine in the same network. Maybe someone finds this helpful :)</p>

<p>Cheers,<br />
<em>Finn</em>.</p>

<p><br /></p>

<p><br />
References: <br />
<a href="http://doc.aldebaran.com/2-4/family/pepper_technical/laser_pep.html" target="_blank" id="[1]">[1] Pepper laser specification</a><br />
<a href="https://www.robotshop.com/eu/en/hokuyo-ust-10lx-scanning-laser-rangefinder-eu.html" target="_blank" id="[2]">[2] Hokuyo UST-10LX</a><br />
<a href="http://wiki.ros.org/urg_node" target="_blank" id="[3]">[3] URG_node Hokuyo driver</a><br />
<a href="http://wiki.ros.org/ROS/Tutorials/BuildingPackages" target="_blank" id="[4]">[4] Building ROS packages</a><br />
<a href="https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing" target="_blank">[5] Wikipedia: Classless Inter-Domain Routing</a><br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Linux" /><category term="Robotics" /><category term="ROS" /><category term="SLAM" /><summary type="html"><![CDATA[Motivation and introduction I have worked for quite a while with Softbank’s Pepper robot, which, unfortunately, does not feature the best onboard hardware. Specifically, it only employs three laser sensors that are aimed towards the ground at a slightly awkward angle [1]. These sensors are enough to do minimalistic obstacle avoidance, but not enough for Simultaneous Localization and Mapping (SLAM). As a consequence, at the KT group, we’ve decided to instead build an external mapping device, using the Hokuyo UST-10LX [2] laser sensor. However, the Hokuyo UST-10LX has to be connected to a host machine via ethernet cable to communicate its measurements with a host machine. This requires some network tinkering on that machine, which, in our case, is a Raspberry Pi 3. I found this setup not to be properly documented (or at least I couldn’t find anything on this). As such, below I describe how to configure a headless Raspi 3 running Ubuntu Server 20.04 to stream data from the Hokuyo laser sensor into your network.]]></summary></entry><entry><title type="html">Neural activity animated visualization</title><link href="http://www.finnrietz.dev/machine%20learning/python/mpl_animate/" rel="alternate" type="text/html" title="Neural activity animated visualization" /><published>2021-03-02T00:00:00+00:00</published><updated>2021-03-02T00:00:00+00:00</updated><id>http://www.finnrietz.dev/machine%20learning/python/mpl_animate</id><content type="html" xml:base="http://www.finnrietz.dev/machine%20learning/python/mpl_animate/"><![CDATA[<p><br />
This project presents a piece of software that I wrote for my student employee
job at the Knowledge Technology Research group.</p>

<h2 id="the-problem-of-plotting-data-in-many-dimensions">The problem of plotting data in many dimensions</h2>
<p>Every data-scientist knows the struggle of visualizing high dimensional data.
Most 2-dimensional plots are fairly easy to read and can be grasped within one quick glimpse over the plot, at least when the plot is well done.
What is also relatively trivial and vastly popular is the practice of plotting different items in different colors, where color acts as an additional dimension in a 2-dimensional plot.
However, plots often get much messier and inherently harder to read when we shift them into three <em>spatial</em> dimensions. Not only do 3-dimensional introduce significant perceptual
overload, but there is also the fundamental problem that very, very clear trends can be completely shadowed, depending on the virtual camera angle, through which
the plot is represented. The data that I was given for this task consisted of two factors and their development over time as a third factor, which has exactly the problem of producing a three-dimensional plot. Thus, the problem at hand is to find a visualization that allows expressing the interplay of these three factors, all in one plot,
while maintaining the high amount of readability, typically associated with 2-dimensional plots.</p>

<h2 id="3d-time-series-plots">3D Time series plots</h2>
<p>Without wanting to depict my teachers and colleagues badly (for whom I have nothing but respect), one example for a plot that is relatively cumbersome to read is given here:
<img class="align-center" src="/assets/img/mpl_animate_project/both-3d-series.png" /></p>
<figcaption>Hard to read 3D time-series plots. Image taken from <a href="#avctrnn_citation">[2]</a>.</figcaption>

<p>These plots have not been done by me, but contain some of the data that I was tasked with to find a new, better visualization. All the data-vectors for the for the below animations originate from the same paper, so I encourage the interested reader to check it out <a href="https://www.researchgate.net/publication/327691059_Adaptive_and_Variational_Continuous_Time_Recurrent_Neural_Networks" target="_blank">here</a>. <br />
Given that the visualization was for a conference, not a follow-up version of the paper, we had the opportunity to experiment with animating the results, which brings the plot back into two dimensions, with time being displayed as actual time frames in the animation. The fact that all plots had a time-component fully enabled this, as animations really only make sense when the data has an underlying temporal component.</p>

<h2 id="matplotlibanimation-to-the-rescue">Matplotlib.animation to the rescue</h2>
<p>A quick google search revealed that Python can be used as an animation engine, specifically, with the <code class="language-plaintext highlighter-rouge">Matplotlib.animation</code> <a href="https://matplotlib.org/3.2.1/api/animation_api.html">API</a> (at this point I can’t bunt wonder whether there is something that Python can’t do). There are multiple approaches that can be taken to animate a plot with Matplotlib, the one I personally find most intuitive boils down to creating a callback function that is called for each time step in the animation and sets the data for each MPL artist for the current time step.</p>

<p>Using the <code class="language-plaintext highlighter-rouge">Matplotlib.animation</code> API, we can visualize the ealier plot as follows:
<img class="align-center" src="/assets/img/mpl_animate_project/bright.gif" id="bright_example" /></p>
<figcaption>Animated version of the earlier 3D time-series plot.</figcaption>

<p>We could have opted for a <em>smoother</em> animation, but we wanted to maintain the discrete time-steps for this specific graph, hence the <em>artificial</em>, non-smooth appearance of the animation. This animation already teasers what we can with Matplotlib’s animation framework, but check out the next example:</p>

<p><img class="align-center" src="/assets/img/mpl_animate_project/3x4_trace.gif" id="3x4" /></p>
<figcaption>Another example for animations done in Matplotlib.</figcaption>

<p>This illustrates that more complex, but nevertheless concise and seamless animations are also possible in Matplotlib. Here, we combined multiple of the original plots from <a href="#avctrnn_citation">the original paper</a> into one animation.</p>

<h2 id="dealing-with-sparse-data">Dealing with sparse data</h2>
<p>The major stepping stone I encountered when writing the code for the above animations was that of finding means to generate smooth, continuous animations from discrete, sparse data. In the <a href="#bright_example">the first example</a>, we kind of <em>embrace</em> the fact that we are dealing with discrete time steps, hence the sharp <em>jumps</em> from frame to frame.  However, we likely want our animations to play out smoothly. To achieve this, we have to augment our data with artificial data-points, so that we can achieve the desired framerate. For example, we might want to render at 60 FPS, but when we only have 35 data-points, our animation would be over in roughly half a second…
Even though in the final version of the second example, this kind of data interpolation is not used (because we implemented this <em>decaying trace</em>) at one point during development I implemented just that. Thus, I want to share the blog-post that helped me a bunch: <a href="https://www.rossidata.com/PythonAnimations">Easing Animations with Python by Nicholas Rossi</a>. Even though I did not use his package, Nicholas does a great job at illustrating the problem, hence I wanted to drop the reference here.</p>

<h2 id="situational-animations">Situational animations</h2>
<p>While creating animations in Python is incredibly fun, there is one (maybe obvious) drawback to tacking the time component from the third dimension and animating it: It is no longer possible to consider the development or history of the data at one single glance. When the time component is animated, the focus lies heavily on the current frame, and past frames are probably quick to be forgotten. The following image is a good example of where the 3D-visualization (the third dimension is color) nicely reveals a pattern in the underlying data:
<img class="align-center" src="/assets/img/mpl_animate_project/good-3d-example.png" /></p>
<figcaption>Example for strong visualization of temporal pattern in time-series data. Figure taken from <a href="#avctrnn_citation">[2]</a>.</figcaption>
<p>Here, we can clearly see an increase in activity around the 230th time step. If you consider the animation for these two plots (<a href="#3x4">top row, second and third subplot</a>), you might agree that it is harder to keep the entire history of the activation patterns in mind, specifically how the first half compares to the second half.</p>

<h2 id="final-verdict">Final verdict</h2>
<p>Animating temporal data can reduce the complexity of plots and make them easier to understand, at the cost of losing some of the compactness of having the entire time series displayed in one frame. Matplotlib’s animation class provides a convenient way for anyone with basic Python understanding to create scientific animations, similar to those presented here. However, larger and more complex animations do take quite a while to render…
<br />
As always, find the code associated with this project <a href="https://github.com/frietz58/mpl_neural_activity" target="_blank">on my Github</a>
<br /></p>

<p>References:<br />
<a href="https://www.inf.uni-hamburg.de/en/inst/ab/wtm/" target="_blank">[1] Knowledge Technology Research Group</a><br />
<a href="https://www.researchgate.net/publication/327691059_Adaptive_and_Variational_Continuous_Time_Recurrent_Neural_Networks" id="avctrnn_citation" target="_blank">[2] Adaptive and Variational Continuous Time Recurrent Neural Networks</a><br />
<a href="https://matplotlib.org/3.2.1/api/animation_api.html">[3] Matplotlib Animation API</a><br />
<a href="https://www.rossidata.com/PythonAnimations">[4] Easing Animations with Python</a><br />
<a href="https://github.com/frietz58/mpl_neural_activity" target="_blank">[5] Code for this project</a><br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Machine Learning" /><category term="Python" /><category term="Matplotlib" /><summary type="html"><![CDATA[This project presents a piece of software that I wrote for my student employee job at the Knowledge Technology Research group.]]></summary></entry><entry><title type="html">WoZ4U: An Open Source Interface for the Pepper robot</title><link href="http://www.finnrietz.dev/robotics/woz4u/" rel="alternate" type="text/html" title="WoZ4U: An Open Source Interface for the Pepper robot" /><published>2020-12-10T00:00:00+00:00</published><updated>2020-12-10T00:00:00+00:00</updated><id>http://www.finnrietz.dev/robotics/woz4u</id><content type="html" xml:base="http://www.finnrietz.dev/robotics/woz4u/"><![CDATA[<p><br />
This project presents the results of my three-month study visit at the <a href="https://www.umu.se/en/research/groups/intelligent-robotics/" target="_blank">Intelligent Robotics</a> group at the Umeå University in Sweden: We build a comprehensive and configurable interface for <a href="https://en.wikipedia.org/wiki/Pepper_(robot)" target="_blank">Softbank’s Pepper robot</a>.
We initially presented this work at the <a href="https://whisperproject.eu/wasp2020" target="_blank">2020 Workshop on Affective Shared Perception</a> and ultimately published it as <a href="https://www.frontiersin.org/articles/10.3389/frobt.2021.668057/full" target="_blank">artical in the Frontiers in Robotics and AI Journal</a>. As with most of my work, it is <a href="https://github.com/frietz58/WoZ4U" target="_blank">available on GitHub</a>.</p>

<h2 id="study-visit-in-umeå">Study visit in Umeå</h2>
<p>I initially started working on the Pepper robot in the scope of my student job at the <a href="https://www.inf.uni-hamburg.de/en/inst/ab/wtm/" target="_blank">Knowledge Technology group</a> at Uni Hamburg.
There were already a number of collaborations between the KT group in Hamburg and the <a href="https://www.umu.se/en/research/groups/intelligent-robotics/" target="_blank">Intelligent Robotics group</a> in Umeå, so when I started looking for interesting groups to visit, the Intelligent Robotics group stood out because a) they also work with multiple Pepper robots and b) they were already associated with the KT group. So it came that I visited the Intelligent Robotics group from September to December in 2021 to work on an interface for the Pepper robot, financed through a generous 3-month stipend. The study visit was great, I enjoyed working on this project a lot, everyone at the Intelligent Robotics group was incredibly welcoming, helpful, and friendly, plus there were plenty of oportunities to go camping and fishing on the weekends :)</p>

<p>The most eye-opening experience in this study visit was, by far, going through the lengthy, sometimes tedious process of publishing in a peer-reviewed journal.</p>

<!--




<figure class="third ">
  
    
      <a href=
        
          "/assets/woz4u/office.jpg"
        
        
      >
        <img src=
          
            "/assets/woz4u/office.jpg"
          
          alt="Office">
      </a>
    
  
    
      <a href=
        
          "/assets/woz4u/morning.jpg"
        
        
      >
        <img src=
          
            "/assets/woz4u/morning.jpg"
          
          alt="Nice morning">
      </a>
    
  
    
      <a href=
        
          "/assets/woz4u/coffee.jpg"
        
        
      >
        <img src=
          
            "/assets/woz4u/coffee.jpg"
          
          alt="Coffee">
      </a>
    
  
    
      <a href=
        
          "/assets/woz4u/birches.jpg"
        
        
      >
        <img src=
          
            "/assets/woz4u/birches.jpg"
          
          alt="Birches">
      </a>
    
  
    
      <a href=
        
          "/assets/woz4u/fishing.jpg"
        
        
      >
        <img src=
          
            "/assets/woz4u/fishing.jpg"
          
          alt="Fishing">
      </a>
    
  
    
      <a href=
        
          "/assets/woz4u/cold.jpg"
        
        
      >
        <img src=
          
            "/assets/woz4u/cold.jpg"
          
          alt="Cold day">
      </a>
    
  
  
    <figcaption>Some impressions form Umeå.
</figcaption>
  
</figure>
-->

<h2 id="the-interface">The interface</h2>
<p>The idea and requirements behind the interface were formulated by Professor Hellström and Professor Bensch at the Intelligent Robotics group. The main goal was to make Pepper as a research platform for Human-Robot Interaction (HRI) experiments more approachable. This is motivated by the observation that a lot of Pepper’s functionality is gated behind the robot’s API, which has a steep learning curve and requires sufficient programming knowledge to begin with. Given that there is an inherently social aspect in Human-Robot Interaction, HRI studies carried out by social-science researchers seem valid and required. While these research groups have excellent skills in experiment design, they don’t necessarily have the technical expertise to implement the robotic control software required to conduct HRI experiments. Thus, we set out to create an easily approachable, comprehensive, and configurable interface for the Pepper robot, to lower the barrier towards concrete HRI research. We named the interface <code class="language-plaintext highlighter-rouge">WoZ4U</code>, after the <em>Wizard-of-Oz</em> HRI experiment methodology, find it on <a href="https://github.com/frietz58/WoZ4U" target="_blank">GitHub</a>.</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/Anb5SAnE8Jo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<h3 id="software-architecture">Software architecture</h3>
<p>To make the interface as widely accessible as possible, we support all major operating systems. This includes Debian-based Linux, macOS, and Windows. Furthermore, we provide a docker image hosting the interface, which should eliminate most if not all requirement conflicts. The docker image further eliminates the need to follow the lengthy setup guide, as the interface is accessible as soon as the docker container is running on the network. In the backend, the interface is implemented as a webserver. This conveniently makes the interface accessible via web browser from any machine in the network (including smart phones), which removes any requirements towards the OS, since any modern OS comes with a web browser.</p>

<p>We had the non-functional requirement that the interface should be easily (re)configurable for different experiments because a tool specialized for one experiment has no general value for the community. As such, we feature a configuration file in <code class="language-plaintext highlighter-rouge">YAML</code> syntax. Every non-general part of the UI is configurable through that file so that no programming is required to set up the interface for a new experiment. Instead, one simply edits a few content-specific lines in the configuration file. For example, one might want to investigate which gestures are perceived as particularly friendly. For that, one edits which gestures should be accessible from the interface based on the following snippet:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">gestures</span><span class="pi">:</span> <span class="c1"># Buttons will be created for every item in the list</span>
  <span class="pi">-</span>
    <span class="na">title</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Yes"</span>  <span class="c1"># This will be shown in the GUI</span>
    <span class="na">gesture</span><span class="pi">:</span> <span class="s2">"</span><span class="s">animations/Stand/Gestures/Yes_1"</span>  <span class="c1"># Gesture to execute</span>
    <span class="na">tooltip</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Yes_1</span><span class="nv"> </span><span class="s">gesture"</span>  <span class="c1"># Tooltip for buton</span>
    <span class="na">key_comb</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">shift"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">1"</span><span class="pi">]</span>
  <span class="pi">-</span>
    <span class="na">title</span><span class="pi">:</span> <span class="s2">"</span><span class="s">No"</span>
    <span class="na">gesture</span><span class="pi">:</span> <span class="s2">"</span><span class="s">animations/Stand/Gestures/No_1"</span>
    <span class="na">tooltip</span><span class="pi">:</span> <span class="s2">"</span><span class="s">No_1</span><span class="nv"> </span><span class="s">gesture"</span>
    <span class="na">key_comb</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">shift"</span><span class="pi">,</span> <span class="s2">"</span><span class="s">2"</span><span class="pi">]</span>
</code></pre></div></div>

<p>This procedure is the same for all elements in the interface so that the entire interface can easily be configured for different experiments or occasions.</p>

<h3 id="features">Features</h3>
<p>The interface comprises the following robot functionalities:</p>
<ul>
  <li>Autonomous life management: Provides controls over the general behavior emitted by the robot</li>
  <li>Tablet control: Provides controls over which items (pictures, videos, websites) are displayed on Pepper’s tablet</li>
  <li>Speech control: Provides controls over text-to-speech messages</li>
  <li>Animated speech: Provides controls over speech + gesturing messages</li>
  <li>LED control: Provides control over Pepper’s LEDs</li>
  <li>Motion control: Provides a simplistic motion controller for Pepper’s omnidirectional wheels</li>
  <li>Gesture control: Provides control over Pepper’s gestures</li>
</ul>

<p>These features comprise almost everything Pepper’s API has to offer. In some cases, we even extend the API for some custom functionalities that are not part of the API (audio and touch-event live streams).</p>

<p>Feel free to read <a href="https://www.frontiersin.org/articles/10.3389/frobt.2021.668057/full" target="_blank">our paper</a> for more details.</p>

<p>We hope that this tool is useful for researchers in the HRI field.</p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Robotics" /><category term="Pepper Robot" /><category term="Flask Server" /><category term="Wizard of Oz" /><summary type="html"><![CDATA[This project presents the results of my three-month study visit at the Intelligent Robotics group at the Umeå University in Sweden: We build a comprehensive and configurable interface for Softbank’s Pepper robot. We initially presented this work at the 2020 Workshop on Affective Shared Perception and ultimately published it as artical in the Frontiers in Robotics and AI Journal. As with most of my work, it is available on GitHub.]]></summary></entry><entry><title type="html">Soft Actor Critic: Deep Reinforcement Learning for Robotics?</title><link href="http://www.finnrietz.dev/machine%20learning/python/soft-actor-critic/" rel="alternate" type="text/html" title="Soft Actor Critic: Deep Reinforcement Learning for Robotics?" /><published>2020-09-29T00:00:00+00:00</published><updated>2020-09-29T20:05:37+00:00</updated><id>http://www.finnrietz.dev/machine%20learning/python/soft-actor-critic</id><content type="html" xml:base="http://www.finnrietz.dev/machine%20learning/python/soft-actor-critic/"><![CDATA[<h2 id="motivation">Motivation and introduction</h2>
<p>The Soft Actor-Critic algorithm by Haarnoja et al. <a href="#[1]">[1]</a> has gotten a lot of coverage and attention in 2018 and 2019. And rightfully so. The paper proposes a very elegant solution to the notorious problem of deep reinforcement learning algorithms being too data-hungry for real-world feasibility and supplies very exciting examples illustrating the capabilities of the algorithm in a real-world setting, as can be seen below. Naturally, I was intrigued. While at the point of writing this post, Reinforcement Learning has not yet been featured on this site, it is, after all, my main academic interest and will be at the heart of my masters’  (and hopefully Ph.D.) thesis. Hence, for one of my courses, I decided to write a paper on the Soft Actor-Critic algorithm. In this blog post, I built on that paper <a href="#[2]">[2]</a> and provide some additional examples and insights.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/FmMPHL3TcrE" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p><br /></p>

<h2 id="problem">The problem with Deep Reinforcement Learning for real world robotics</h2>
<p>While this post will not address Reinforcement Learning in general, the gist of it is as follows: By executing pseudo-random actions in an environment (or simulation thereof) and <em>rewarding</em> good actions, we can have a, for example robotic, agent learn almost any desired behavior. Here, behavior means that we execute the desired sequence of actions to get from some initial state to some goal state. Inherently, this is a very powerful concept, as this makes it possible for robots to learn how to walk, grasp things, play games, engage in dialogue, and pretty much learn to solve any conditional, sequential problem. That’s the theory, at least.
<br />
However, as you might have noticed, we are not yet surrounded by intelligent, autonomous robots in our everyday life, in fact, it’s still out of the norm to find a robot autonomously cleaning an office space or shopping mall, which indicates that things aren’t quite as easy. Many different fields of robotics are still active research areas, just like Reinforcement Learning is still having a central problem, stopping it from being widely employed in real-world robotic scenarios. To be precise, the main problem of deep Reinforcement Learning algorithms for real-world robotics is that they are insanely data-hungry and take ages to converge (ie manage to generate the desired behavior). <br />
Why is this problematic? Well, robots aren’t indestructible and in the early stages of learning, Reinforcement Learning agents behave essentially randomly. You can probably imagine what drastic consequences it can have if we just set all motors in our robot to random power levels… Larger robots will fall over, mobile bots might severely ram into obstacles, and drones would crash immediately. If we expose our robots to this kind of behavior for a prolonged period of time, it is almost certain that the robot will suffer significant damage in the process (similar to how toddlers fall over when they begin learning to walk, except here, consequences aren’t inevitable breakdown). And this is only one aspect of the issue. When I wrote <em>insanely data-hungry</em>, I absolutely meant that. For example, AlphaStar, DeepMind’s deep neural Reinforcement Learning algorithm, has been with trained many agents in parallel, for 14 days straight, on 16 Tensor Processing Units (TPU), corresponding to 200 years of real-life training time, <strong>for each agent</strong> <a href="#[3]">[3]</a>… And this is under the employment of state-of-the-art methods to speed up the learning process. <br />
Ignoring that we can’t train a single robot in parallel fashion, after 200 years of hypothetical training, you can be sure that the robot would have broken down simply due to all the wear and tear that it would be exposed to in all that time. <br />
An apparent solution is to train the agent in a simulator (which also allows us to parallelize the training process) and then simply put the <em>behavior policy</em>  learned in the simulation on a physical robot, operating in the real world. However, the simulators are not yet good enough and fail to accurately represent the real world, which makes the learned behavior policies useless on the real, physical robot. Further, agents trained in a simulator tend to learn things that are hyper-specific to that simulator and don’t generalize to the real world. This is referred to as the Sim-to-Real problem and is an active research area in itself. <br />
So as you can see, there are a lot of challenges for real-world Reinforcement Learning. However, the Soft Actor-Critic algorithm tackles the problem at its root and aims to significantly speed up the learning process, to a point where deep Reinforcement Learning methods become feasible in real-world scenarios. Let’s explore the intuition behind the algorithm in the following section.</p>

<h2 id="sac_intuition">The intuition behind Soft Actor Critic</h2>
<p>
To gain an understanding into how the SAC algorithm tackles the data inefficiency problem of deep Reinforcement Learning methods, we have to look at the SAC specific reward function that is being employed by Haarnoja et al. However, to begin with, consider the classical Reinforcement Learning object, that describes the general goal of Reinforcement Learning <a href="#[9]">[9]</a>:
$$G_t = \sum^\infty_{k=0}\gamma^k R_{t+k+1}$$
This is the <i>expected discounted return</i> \(G\) at time step \(t\), with a discount factor \(0 \le \gamma \le 1\), so that the reward signal \(R\) from \(t+k+1\) time steps in the future is weighted to be less important than the reward signal at \(t+k\), encoding an aspect of temporal relevance. The reward signal \(R\) is, arguably, the central part of any Reinforcement Learning problem, as this guides what the <i>policy</i> (always denoted by \(\pi\)) of our agent will learn, by encoding the <i>goodness</i> of any action taken. Generally, the behavior, which is encoded in the policy of the agent, is adapted in such a way that it maximizes the reward function, thus, a well thought out reward function is the key for success in reinforcement learning. Essentially, no matter what, with Reinforcement Learning, we want to accumulate as much discounted reward, aka return \(G_t\) as possible. This is the main objective all Reinforcement Learning methods are subject to.
Formally, the optimal policy \(\pi^*\) is defined as the policy that has the highest expected reward for every action, at every timestep, in every state <a href="#[1]">[1]</a>:
$$\pi^* = \underset{\pi}{\operatorname{argmax}} \underset{\tau \sim \pi}{\mathbb{E}} \left[ \sum^\infty_{t=0} \gamma ^t [r(s_t, a_t)]\right]$$
Here, (\( \tau \sim \pi\)) means that a trajectory of interactions (\(\tau \)) has been sampled (\(\sim \)) from the probability distribution of the policy (\(\pi\)). 
Notice that \(r\) is a function, over all states and actions, providing the reward <i>meassure of goodness</i> for every combination of states and actions (at least in simple examples). 
</p>
<p>
Now, the central element in the SAC algorithm is an advanced, general reward function, that contains a second term in addition to the main reward signal <a href="7">[7]</a>:
$$\pi^* = \underset{\pi}{\operatorname{argmax}} \underset{\tau \sim \pi}{\mathbb{E}} \left[ \sum^\infty_{t=0} \gamma ^ t [r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)] \right]$$ 
The only difference to the original formula for the optimal policy is the term \(\mathcal{H}\), which is weighted by \(\alpha\). \(\mathcal{H}\) encodes the entropy of the policy \(\pi\) in every state and is given by \(\mathcal{H}(P) =  \underset{x \sim P}{\operatorname{\mathbb{E}}} [-log P (x)]\). Entropy is, roughly speaking, a meassure of information gain or uncertaintaniy of a random variable \(x\), sampled from a distribution \(P\). Do you see what this motivates the Reinforcement learning agent, who behaves according to a learnt policy that maximizes the given function, to do? It forces the agent to not only consider the reward associated with an action in a state, but also the overall degree of uncertainty in that state. 
This results in the agent choosing actions that lead to states which have not yet been seen, especially when a different action would lead to a state that has a higher expected return (but has already been seen). 
The parameter \(\alpha &gt; 0\) balances the two components of the objective function and controls the importance of the entropy term, compared to the reward signal. 
In the original version of the SAC algorithm, this parameter \(\alpha\) had to be set manually, which was a non-trivial problem for complex enough environments and required an expensive hyperparameter optimization <a href="#[1]">[1]</a>. However, in the newer version of the algorithm, Haarnoja et al. managed to automatically adjust the parameter by rephrasing the objective function once again. However, the details of this automatic temperature adjustment can be ignored for the purpose of this blog post.
</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/KOObeIjzXTY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p><br /></p>
<p>
In addition to speeding up the overall learning process and making for better data efficiency, this RL objective function has another desirable side effect: It produces much more stable policies <a href="#[1]">[1]</a>, <a href="#[7]">[7]</a>. Unfortunately, it is not further explained why that is, but I think about it like this: Since the reward for every state also depends on the entropy component, the agent is less likely to visit the same state twice because the entropy for that state will already be decreased. Hence, by exploring many, slightly different trajectories (sequences of states and selected actions), the overall policy is more robust, because it does not hinge on observing a small amount of <i>key</i> states in order to be able to select the overall <i>best</i> action. I hope that makes sense...?
But those are just my two cents... Either way, in the video above, we can observe the consequences of this: The agent can deal with significant perturbations of the state (brick wall, stairs, ramp) that it has not encountered during training. This is a very nice property to have, as it implies that the learned policy is more <i>general</i> and can be employed in contexts that are not part of the training data.
</p>
<p>
And this is how far I will go regarding the basic idea behind the SAC algorithm. To summarize, SAC incorporates an entropy term into the Reinforcement Learning objective function, which motivates the agent to select actions under consideration of the uncertainty associated with each state. Like this, the agent can explore the environment much more efficiently, which results in significantly faster convergence, compared to many other state-of-the-art algorithms (see the original paper for benchmarking results <a href="#[1]">[1]</a>). 
For the remainder of this post, we will explore and discuss how the algorithm performs on a practical OpenAI Gym task.
</p>

<h2 id="gym">OpenAI Gym example</h2>
<figure class="half">
    <a href="/assets/img/rl_sac_analysis/bipedal_walker.png"><img src="/assets/img/rl_sac_analysis/bipedal_walker.png" /></a>
    <a href="/assets/img/rl_sac_analysis/bipedal_walker_hc.png"><img src="/assets/img/rl_sac_analysis/bipedal_walker_hc.png" /></a>
    <figcaption>Figure 1: OpenAI Gym Bipedal Walker environment. Left: Normal version. Right: Hardcore version. </figcaption>
</figure>

<p>OpenAI Gym <a href="#[4]">[4]</a> provides a wide array of Reinforcement learning environments and is one of the de-facto tools being used to benchmark, compare and develop Reinforcement Learning algorithms. For getting practical experience with the SAC algorithm, I selected the BipedalWalker environment, where the goal is for a bipedal agent to develop an efficient walking gait. This environment is particularly interesting, for reasons further explained below, because it has a <em>normal</em> and a <em>hardcore</em> version, where the hardcore version of the environment contains many stumps, pitfalls, and stairs and is much harder to solve successfully. 
As we can see in the above video, the walking gait learned on the minotaur robot appears to be outstandingly stable, generalizing to a handful of unseen scenarios: The brick wall and the ramp. So my hypothesis is as follows: The bipedal walker trained on the normal version of the environment might be robust enough to also solve the hardcore version of the environment, similar to how the minotaur in the video could deal with the obstacles presented in the testing scenarios! To investigate this hypothesis, we need a working version of the algorithm though. Instead of implementing this algorithm from scratch (which would take a lot of time and straight-up not be efficient), we will use the implementation provided <a href="#[5]">in this repository [5]</a>.</p>

<h3 id="results">Results</h3>
<p>To begin with, I trained a SAC agent for 500 epochs on both the normal and hardcore version of the environment. For comparison, I also trained a PPO <a href="#[6]">[6]</a> agent and a TD3 <a href="#[8]">[8]</a> on both versions of the environment, to put the convergence time of the SAC agent into perspective. To be fair, PPO is an on policy method, which are known to have much worse data efficiency than off-policy methods. Consider the results presented below:</p>

<div style="text-align: center;">
  <img src="/assets/img/rl_sac_analysis/walker_rewards_4.png" class="align-center" />
  <figcaption>Figure 2: Training progress of SAC, TD3 and PPO agents on the normal and hardcore version of the BipedalWalker gym environment.</figcaption><br />
</div>

<p>We can observe that on the normal version of the environment, the algorithm converges within roughly ~ 100 epochs of 5000 interactions with the environment per epoch. However, out of the box, the algorithm does not appear to be able to solve the hardcore version of the environment within 500 epochs. Based on this data alone, I can not really draw further conclusions. It is very well possible that with slight adjustments to the hyperparameters, a SAC algorithm could solve the hardcore version of the environment as well. However, not wanting to invest more time into this blog post, I did not bother to conduct an expensive and timely hyperparameter optimization and applied the algorithm with its out-of-the-box configuration to both versions of the environment. Further, we can observe that the TD3 agent learns just as fast as the SAC agent. Again, we can’t really conclude anything beyond that this is how these algorithms perform, given this exact scenario and hyperparameter configuration. The benchmarking results presented by Haarnoja et al. do more justice to the efficiency of the algorithm than this small experiment and I highly encourage taking a look at the paper <a href="#[1]">[1]</a>.</p>

<p>Interestingly enough, TD3 struggles to make meaningful progress within 500 episodes on the hardcore version of the environment as well. This gives some indication of the difficulty associated with this specific environment. As expected, the PPO agent hasn’t come close yet to solving the environement, which it would likely do, given more training data</p>

<p>To begin the analysis of our main hypothesis, whether the policy learned by the SAC agent is robust enough to be transferred from the normal version of the environment to the hardcore one, consider the below Figure:</p>

<div style="text-align: center;">
  <img src="/assets/img/rl_sac_analysis/test_data.png" class="align-center" />
  <figcaption>Figure 3: Testing rollouts of the SAC policies learned on both versions of the environment. Left: Average reward obtained. Right: Average episode length.</figcaption><br />
</div>

<p>These statistics tell us a few things on how the learned policies perform, already regarding the main hypothesis we sought out to investigate, whether policies trained on the simple version of the environment would be robust enough to also deal with the obstacles presented in the hardcore version of the environment, without encountering them during training. Sadly, a short glance at Figure 3 immediately falsifies that hypothesis. We can observe that when executing the policy trained on the normal version of the environment on the hardcore version, we get an average reward of roughly -100, with low deviations from that value. This is because the environment <em>punishes</em> the agent with a -100 reward when it falls over.  Hence, the policy learned on the normal version of the environment is not robust enough to get past the obstacles in the hardcore version and falls over, getting punished with a -100 reward. Further, we can observe that the agent trained on the normal version of the environment does not appear to have a deviation from the reward and episode length. This indicates that the agent performs equally well (or badly) most of the time, contrary to the agent trained on the hardcore version of the environment. There, we can observe a much higher range of values, indicating external factors (aka the obstacles) having an effect on the performance of the agent.
To further verify our conclusion regarding our main hypothesis, take a look at how the agent performs in practice:</p>

<div style="text-align: center;">
  <video controls="true" allowfullscreen="true" style="width: 100%;">
    <source src="/assets/img/rl_sac_analysis/SAC_EZ_ON_EZ.mp4" type="video/mp4" />
  </video>
  <figcaption>Agent trained on the normal version of the environment, tested on the normal version.</figcaption><br />
</div>
<p>As expected, the agent developed a (kinda awkward looking) walking gait, that successfully solves the normal version of the environment by traveling all the way to the end of the level. This is the most efficient gait it found, as applying torque to the joints costs a small amount of reward. 
Regarding our main hypothesis, consider the following video, where we employ the policy trained on the normal version of the environment on the hardcore version:</p>

<div style="text-align: center;">
  <video controls="true" allowfullscreen="true" style="width: 100%;">
    <source src="/assets/img/rl_sac_analysis/SAC_EZ_ON_HC.mp4" type="video/mp4" />
  </video>
  <figcaption>Agent trained on the normal version of the environment, tested on the hardcore version.</figcaption><br />
</div>
<p>Here, we can see how the agent struggles to get past the obstacles. This properly rejects (I think conducting a T-Test on the reward distribution would be overkill and is not necessary for this blog post) our hypothesis: A SAC agent trained on the normal BipedalWalker environment is not robust enough to also solve the BipedalWalkerHardcore environment, as already indicated in Figure 3. In hindsight, I see how this is too big of a leap from the normal to the hardcore version of the environment. There is a clear difference to the real-world examples we saw above: In the examples with the stairs, the ramp, and the small brick wall, the robot, controlled by the learned policy, gets away with just sticking to the learned policy. These obstacles don’t require dedicated handling, the robot does not have to learn a specific behavior to get past them. The obstacles faced in the hardcore version of BipedalWalker environment can clearly not be handled in the same way. The agent needs to find a distinct strategy for dealing with the different obstacles present in the hardcover version of the BipedalWalker environment.</p>

<p>So what about the SAC agent that has been trained directly on the hardcore version of the environment? Well, take a look…</p>
<div style="text-align: center;">
  <video controls="true" allowfullscreen="true" style="width: 100%;">
    <source src="/assets/img/rl_sac_analysis/SAC_HC_ON_HC.mp4" type="video/mp4" />
  </video>
  <figcaption>Agent trained on the hardcore version of the environment, tested on the hardcore version.</figcaption><br />
</div>

<p>As you can see, that agent performs very poorly. However, I am certain that the agent would learn how to get past the different hurdles given a) more training time/data and or b) a hyperparameter optimization for that version of the environment. We can see already that the walking gait, if we can call it that, differs from what was learned on the normal version of the environment. This becomes even more apparent when we visualize the two agents side by side, one having been trained on the normal version of the environment, the other on the hardcore version:</p>
<div style="text-align: center;">
  <video controls="true" allowfullscreen="true" style="width: 100%;">
    <source src="/assets/img/rl_sac_analysis/SAC_EZ_HC_COMP.mp4" type="video/mp4" />
  </video>
  <figcaption>Comparison of walking gaits learned trained on the two environment versions, both tested on the normal version.</figcaption><br />
</div>

<h3 id="extra">Extra</h3>
<p>Purely because it’s somewhat interesting too look at, here is a video of the walking gaits developed by the TD3, PPP and SAC algorithm (SAC trained on normal and HC environment):</p>
<div style="text-align: center;">
  <video controls="true" allowfullscreen="true" style="width: 100%;">
    <source src="/assets/img/rl_sac_analysis/ALL_ON_EZ_COMP.mp4" type="video/mp4" />
  </video>
  <figcaption>Comparison of learned walking gaits by agent from the different algorithms. SAC (HC) is, again, the gait learned on  the hardcore version of the environment, executed on the normal version. </figcaption><br />
</div>

<h2 id="summary">Summary</h2>
<p>Alright, wrapping it up: Deep Reinforcement Learning methods suffer from strong data inefficiency. The Soft Actor-Critic algorithm by Haarnoja et al. tackles this data inefficiency problem of (deep) Reinforcement Learning algorithms, by modifying the reward object to include an entropy regularization term. Haarnoja et al. provide real-world examples demonstrating strong robustness of the developed policies and strong benchmarking results. We sought ought to investigate whether a SAC policy learned on the normal version of the environment would be robust enough to clear the obstacles in the hardcore version of the environment. Our results clearly indicate that this is not the case, for reasons provided above.</p>

<p>Finally, I want to mention that Haarnoja et al. are of course not the only people investigating data inefficiency in deep reinforcement learning methods. Here are a few approaches, in case you want to do some additional googling: Task Simplification, Imitation Learning, Hindsight imagination, Hierarchical Reinforcement Learning…</p>

<p>All data and code is available <a href="https://github.com/frietz58/sac_blog_stuff" target="_blank">here</a>.</p>

<p>Cheers,<br />
<em>Finn</em>.</p>

<p><br /></p>

<p><br />
References: <br />
<a href="https://arxiv.org/abs/1812.05905" target="_blank" id="[1]">[1] Soft Actor Critic Algorithms and Applications, Haarnoja et al. </a><br />
<a href="/assets/img/rl_sac_analysis/Finn_Rietz_Soft_Actor_Critic_Deep_Reinforcement_Learning_for_Robotics_Paper.pdf" target="_blank" id="[2]">[2] My coursework paper on the SAC algorithm</a><br /> 
<a href="https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii" target="_blank" id="[3]">[3] DeepMind’s AlphaStar</a><br />
<a href="https://gym.openai.com/" target="_blank" id="[4]">[4] OpenAI Gym</a><br />
<a href="https://github.com/createamind/DRL" target="_blank">[5] Createamind DRL: SAC implementation</a><br />
<a href="https://arxiv.org/abs/1707.06347" target="_blank">[6] Proximal Policy Optimization Algorithms</a><br />
<a href="https://arxiv.org/abs/1801.01290" target="_blank" id="[7]">Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor</a><br />
<a href="https://arxiv.org/abs/1802.09477" target="_blank" id="[8]">[8] Addressing Function Approximation Error in Actor-Critic Methods</a><br />
<a href="http://incompleteideas.net/book/the-book-2nd.html" target="_blank" id="[9]">[9] Sutton and Barto: Reinforcement Learning: An Introduction</a><br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Machine Learning" /><category term="Python" /><category term="Python" /><category term="Reinforcement Learning" /><category term="Robotics" /><summary type="html"><![CDATA[Motivation and introduction The Soft Actor-Critic algorithm by Haarnoja et al. [1] has gotten a lot of coverage and attention in 2018 and 2019. And rightfully so. The paper proposes a very elegant solution to the notorious problem of deep reinforcement learning algorithms being too data-hungry for real-world feasibility and supplies very exciting examples illustrating the capabilities of the algorithm in a real-world setting, as can be seen below. Naturally, I was intrigued. While at the point of writing this post, Reinforcement Learning has not yet been featured on this site, it is, after all, my main academic interest and will be at the heart of my masters’ (and hopefully Ph.D.) thesis. Hence, for one of my courses, I decided to write a paper on the Soft Actor-Critic algorithm. In this blog post, I built on that paper [2] and provide some additional examples and insights.]]></summary></entry><entry><title type="html">Docker + ROS: How to listen to ROS nodes in External Docker Containers</title><link href="http://www.finnrietz.dev/linux/ros-docker/" rel="alternate" type="text/html" title="Docker + ROS: How to listen to ROS nodes in External Docker Containers" /><published>2020-08-20T00:00:00+00:00</published><updated>2020-08-20T18:50:37+00:00</updated><id>http://www.finnrietz.dev/linux/ros-docker</id><content type="html" xml:base="http://www.finnrietz.dev/linux/ros-docker/"><![CDATA[<h2 id="motivation">Motivation and use case</h2>

<p>Working with Pepper robots at the Knowledge Technology Research Group, we have the high-level goal of building Pepper into an interactive demonstration platform. One of the brainstormed requirements is that we want a Pepper robot to be capable of autonomously navigating our office space and gather the employees for a joint lunch break. To approach the mapping and localization problem at hand, we decided on employing the <a href="[1]"><i> Visual Simultaneous Localization and Mapping [1]</i></a> algorithm. Instead of implementing the algorithm from scratch, we chose the OpenVSLAM implementation by <a href="[2]">Sumikura [2]</a> et al, which happens to come with a Dockerfile. Thus, the objective is clear: Get Pepper’s sensor data with ROS (possibly do some data cleaning), then feed the data to the VSLAM algorithm, which is running in it’s Docker container. But getting the two technologies to work hand in hand is only trivial for people who have a deep understanding of both frameworks, which I didn’t initially have. Hence I want to share what took me roughly a day to figure out…
<br /></p>

<h2 id="problem">The problem</h2>

<p>Looking for how to approach the issue of reading ROS sensor data in Docker containers, I consulted the official documentation. <a href="[3]">ROS’s documentation regarding Docker [3]</a>, only shows us how to listen to ROS nodes/topics when the main <code class="language-plaintext highlighter-rouge">roscore</code> command is run inside the Docker container as well. That is not what I wanted though, because all our other projects were already implemented outside of Docker, we only needed Docker for one component: The VSLAM implementation. The documentation regarding <a href="[4]">Docker’s main ROS image [4]</a> didn’t help me either. 
Hence, to begin tackling this problem, the first step is clear: <code class="language-plaintext highlighter-rouge">roscore</code> must be running somewhere, since this is a requirement for every ROS based system:</p>

<div>
  <img src="/assets/img/docker-ros/roscore.png" />
  <figcaption>Classical roscore command.</figcaption><br />
</div>

<p>The next, similarly basic, step is to lunch our ROS docker image (in a new terminal), and try to start communicating with the running <code class="language-plaintext highlighter-rouge">roscore</code>. Following the instruction’s from <a href="#[3]"> [3]</a> again, we run the image and source the entrypoint. To test whether the communication between the external <code class="language-plaintext highlighter-rouge">roscore</code> and our docker image works, we use the <code class="language-plaintext highlighter-rouge">rosnode list</code> command, which lists all active nodes. Given that <code class="language-plaintext highlighter-rouge">roscore</code> is running, there should always, at least, be the <code class="language-plaintext highlighter-rouge">/rosout</code> node. However, as we can see, executing these steps yields <code class="language-plaintext highlighter-rouge">"ERROR: Unable to communicate with master!"</code></p>
<div>
  <img src="/assets/img/docker-ros/fail.png" />
  <figcaption>No communication with external roscore from docker container.</figcaption>
</div>

<p>Googling for this specific ROS error message reveals interesting and <a href="[5]">helpful threads [5]</a>, that point us into the right direction. Namely, the key problem is that within our docker container, we don’t find the <code class="language-plaintext highlighter-rouge">roscore</code> that is running on the main system. Hence, we need to set the right environment variable (<code class="language-plaintext highlighter-rouge">ROS_MASTER_URI</code> <a href="[6]">[6]</a>), that indicates where to find the running <code class="language-plaintext highlighter-rouge">roscore</code>. Luckily enough, the <code class="language-plaintext highlighter-rouge">roscore</code> command provides us with that information (consider the output from the <code class="language-plaintext highlighter-rouge">roscore</code> command above). In my case, the ROS master is located at <code class="language-plaintext highlighter-rouge">http://finn-ubuntu:11311/</code>. A quick look into <code class="language-plaintext highlighter-rouge">/etc/hosts</code> reveals the IP we that hides behind the local “finn-ubuntu” hostname:</p>

<div>
  <img src="/assets/img/docker-ros/etc_hosts.png" />
  <figcaption>/etc/hosts contains mapping from hostnames to IPs.</figcaption>
</div>

<p>However, as we can observe below, even setting the right environment variable within the docker container does not appear to solve the issue, we are still left with the same error as before:</p>
<div>
  <img src="/assets/img/docker-ros/fail_after_fix.png" />
  <figcaption>Still no comunication after setting the right environment variable.</figcaption>
</div>

<p>So what’s causing this? Are we approaching the error from the wrong side? Did we maybe just have a typo somewhere? And most importantly: How do we fix this and finally access our valuable ROS nodes/topics from within our docker container?</p>

<h2 id="solution">The solution</h2>

<p>Actually, the final (working) solution is very close to what we did previously. However, one crucial detail is missing: The fact that docker containers, per default, live in a <a href="[7]">virtual bridge network [7]</a>! The reason for this is explained in <a href="#[7]">[7]</a>:</p>

<blockquote>In terms of Docker, a bridge network uses a software bridge which allows containers connected to the same bridge network to communicate, while providing isolation from containers which are not connected to that bridge network. </blockquote>
<div>
  <img src="/assets/img/docker-ros/ip_addr.png" />
  <figcaption>ifconfig command revealing information about our network.</figcaption>
</div>

<p>While this is certainly a very powerfull and usefull concept, it is also apparent how this caused our earlier fix to fail. Inspecting the output from <code class="language-plaintext highlighter-rouge">ip addr</code> further illustrates this “problem”: We see all the network interfaces that are currently running on our computer. This usually includes at the very least <code class="language-plaintext highlighter-rouge">lo</code> loopback device (which function as a local virtual network and runs on 127.0.0.1) and the physical network adapter, usually called <code class="language-plaintext highlighter-rouge">en0</code> or in my case <code class="language-plaintext highlighter-rouge">enp8s0</code>. <br />
(Read more about <a href="#[8]">network interfaces [8]</a> and their <a href="#[9]">naming convention [9]</a>)<br />
Additionally, we see the <code class="language-plaintext highlighter-rouge">docker0</code> interface. This is the virtual bridge network mention above. Because of this, docker container can communicate with one another, but are isolated from the other host networks, including <code class="language-plaintext highlighter-rouge">lo</code>, our loopback device responsible for the local network on our host machine. As we have looked up earler, <code class="language-plaintext highlighter-rouge">roscore</code> runs on 127.0.1.1 (i.e. is running locally) and thus not visible from the <code class="language-plaintext highlighter-rouge">docker0</code> bridge.</p>

<p>However, our docker container of course has access to the internet, meaning we can access <code class="language-plaintext highlighter-rouge">enp8s0</code> from within docker. Thus, we can access our host machine via its ipv4 address, which is visible form within our subnet (subnet as in the network that all the devices use that are connected to your router).
Knowing that our <code class="language-plaintext highlighter-rouge">roscore</code> runs on port 11311, we again set the environment variable form within our docker container: <code class="language-plaintext highlighter-rouge">export ROS_MASTER_URI=http://192.168.10.27:11311/</code>. However this time, instead of using the <code class="language-plaintext highlighter-rouge">lo</code> network address (127.0.0.1), we used the <code class="language-plaintext highlighter-rouge">enp8s0</code> ip address of our machine.
And voilà, <code class="language-plaintext highlighter-rouge">rosnode list</code> displays the the <code class="language-plaintext highlighter-rouge">/rosout</code> node, which proves that communication with the <code class="language-plaintext highlighter-rouge">roscore</code>, from within the docker container, works!</p>

<div>
  <img src="/assets/img/docker-ros/success.png" />
  <figcaption>Working communication with roscore form within docker container.</figcaption>
</div>

<h2 id="elegant_solution">The more elegant solution</h2>
<p>The above doctrine deducts the solution mirroring my learning experience. However, after identifying the root of the problem and learning about the whole <a href="#[10]">docker networking thing</a>, I now know that there is a much more elegant solution to the problem. Turns out, the docker developers have considered that some people might need to be able to communicate with service running locally on the machine that also hosts/runs docker. For such a scenario, there exists a dedicated network driver, that we can pass to the docker container. This is as simple as passing the <a href="#[11]">following argument</a> to our docker call:<br /> <code class="language-plaintext highlighter-rouge">--network host</code>. And that’s is, by adding this argument to the command, you should be able to listen to ros nodes from within you docker containers right away.</p>

<h2 id="summary">Summary</h2>
<p>Per default, docker containers run in a virtual bridge network, isolating them from host networks like <code class="language-plaintext highlighter-rouge">lo</code>, making the localhost unaccessable. Docker provides a network driver that removes the isolation and makes the loopback device network accessable from within the docker container. This host networking driver can be activated with the <code class="language-plaintext highlighter-rouge">--network host</code> command line argument. Alternatively, the ioslation can be circumvented manually by using the <code class="language-plaintext highlighter-rouge">en0</code> network adapter ip address of the host machine (which is accessable within docker container for TCP/IP communication).</p>

<p>Cheers,<br />
<em>Finn</em>.</p>

<p><br /></p>

<p><br />
References: <br />
<a href="https://de.wikipedia.org/wiki/Simultaneous_Localization_and_Mapping" target="_blank" id="[1]">[1] Simultaneous Localization and Mapping</a><br />
<a href="https://github.com/xdspacelab/openvslam" target="_blank" id="[2]">[2] OpenVSLAM</a><br />
<a href="http://wiki.ros.org/docker/Tutorials/Docker" target="_blank" id="[3]">[3] ROS Docker documentation</a><br />
<a href="https://hub.docker.com/_/ros" target="_blank" id="[4]">[4] Docker ROS image</a><br />
<a href="https://answers.ros.org/question/43981/error-unable-to-communicate-with-master/" target="_blank">[5] Unable to communicate with master fix</a><br />
<a href="http://wiki.ros.org/ROS/EnvironmentVariables#ROS_MASTER_URI" target="_blank">[6] ROS MASTER URI</a><br />
<a href="https://docs.docker.com/network/bridge/" target="_blanck" id="[7]">[7] Virtual bridge network</a><br />
<a href="http://www.aboutlinux.info/2006/11/ifconfig-dissected-and-demystified.html" target="_blanck" id="[8]">[8] ifconfig command in depth</a><br />
<a href="https://unix.stackexchange.com/questions/134483/why-is-my-ethernet-interface-called-enp0s10-instead-of-eth0" target="_blanck" id="[9]">[9] enpXsxX naming convention</a><br />
<a href="https://docs.docker.com/network/" target="_blanck" id="[10]">[10] Docker networking guide</a><br />
<a href="https://docs.docker.com/network/network-tutorial-host/" target="_blanck" id="[11]">[11] Docker host network driver</a><br /></p>]]></content><author><name>Finn Rietz</name><email>finn.rietz@oru.se</email></author><category term="Linux" /><category term="Linux" /><category term="Docker" /><category term="ROS" /><category term="SLAM" /><summary type="html"><![CDATA[Motivation and use case]]></summary></entry></feed>