Finn Rietz.dev

Multi-Objective Deep Reinforcement Learning with Lexicographic Task-priority constraints

2023-10-19T00:00:00+00:00

Motivation and introduction

Deep Reinforcement Learning (RL) Doesn’t Work Yet. That’s what Alexander Irpan wrote in his famous blog post back in 2018. Sadly, 5 years later, RL researcher and practitioners are still struggling with the same challenges that Alexander Irpan wrote about 5 years ago. Although we had some exciting advances and innovations, e.g. the Dreamer algorithm series, Decision Transformer, and SayCan, I believe most practitioners would agree with the following statements, which, in one way or another, have also been made by Alexander Irpan:

DRL algorithms, that learn to solve each task from scratch, are still very sample inefficient, often requiring millions of transitions before achieving acceptable levels of performance.
Designing scalar-valued reward functions for complex tasks that induce the desired behavior is difficult, with few general heuristics available.
The inherent unsafe exploration of trial-and-error-based learning algorithms and the blackbox-natures of the resulting DNN-based agents hinders the more widespread employment of DRL in real-world applications.

Fortunately, with our recent paper (“Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning”), we address all of these pain points, at least to some extent. So, this hopefully got you interested in our work. In the next sections, I will provide an informal summary of our paper, starting with a recap of Multi-Objective Reinforcement Learning (MORL).

Multi-Objective Reinforcement Learning

In our paper, we want to solve special MORL problems, but we begin with a general definition of MORL problems. MORL problems are formalized by a Markov decision process (MDP), $\mathcal{M} \equiv (\mathcal{S, A}, \mathbf{r}, p, \gamma)$, where $\mathcal{S \in \mathbb{R}^k, A \in \mathbb{R}^l}$ respectively denote the state- and action-space, $\mathbf{r}: \mathcal{S} \times \mathcal{A} \to \mathbb{R}^n$ denotes a vector-valued reward function, $p$ denotes the transition dynamics, and the discount factor is given by $\gamma \in [0, 1]$. Importantly, in MORL, each dimension $i$ of the vector-valued reward function $\mathbf{r}(\mathbf{s},\mathbf{a})$ corresponds to a scalar-valued subtask reward function, meaning we have $\mathbf{r}(\mathbf{s},\mathbf{a})_{[i]} = r_i(\mathbf{s},\mathbf{a}) \in \mathbb{R}$. Further, directly maximizing the vector-valued reward function is usually not possible, since we can’t determine a unique, optimal policy for a vector-valued reward function. For this reason, MORL algorithms usually rely on scalarization: The vector-valued reward function can be scalarized via some weighted sum (not all problems can be modeled this way), then the resulting scalar-valued reward function ${r}(\mathbf{s}, \mathbf{a}) = \sum_{i=1}^n \beta_i r_i(\mathbf{s}, \mathbf{a})$ can be maximized with any classical RL algorithm. Building on this formalism, in the next section we introduce lexicographic MORL problems, which are the particular kinds of MORL problems we are interested in.

Lexicographic MORL

In our paper, we are interested in solving special MORL problems, with continuous state- and action-spaces, like multi-objective robot control problems. In particular, the special MORL problems we are interested in are referred to as lexicographic or task-prioritized MORL problems. These are MORL problems where the subtasks are ordered by priority, meaning lexicographic MORL problems can model that some subtask $r_i$ is more important or of higher priority that some other subtask $r_j$. In this blog post and in our paper, we use the symbol $\succ$ (“succeeds”) to denote anything involving lexicographic task priorities. For example, $r_{1\succ2}$ means a lexicographic MORL task where subtask $r_1$ is of higher priority than subtask $r_2$. Formally, a lexicographic MORL problem is given an MDP $\mathcal{M}_\succ \equiv (\mathcal{S, A}, \mathbf{r}, p, \gamma, o, \varepsilon)$, where $o = \langle 1, 2, \dots i \dots n \rangle$ specifies the priority-order of subtasks and $\varepsilon = \langle \varepsilon_1, \varepsilon_2, \dots \varepsilon_i \dots \varepsilon_n \rangle$ are certain threshold variables (more on those later). Thus, lexicographic MORL makes for a natural and intuitive way of specifying task complex tasks, by specifying a priority order of simpler subtasks. Arguably, this is more convenient and more intuitive than carefully designing a scalar-valued reward function that induces some complex behavior.

Solving a lexicographic MORL problem means finding a policy that is optimal for the highest-priority subtask, while for each lower-priority subtask, it is as good as possible while subject to the constraint of not worsening the performance of any of the higher-priority subtasks. More formally, this means that policy search for each subtask $i$ is constrained to a set $\Pi_i$, which contains only those policies that are also optimal for each higher priority task ${1, \dots, i-1}$.

In practice, we allow for some worsening of higher-priority subtask performance $J$, since otherwise, the set $\Pi_i$ would only contain the optimal policies for the higher-priority subtasks. How much we allow the performance of higher-priority subtasks to worsen is defined by the aforementioned $\varepsilon_i$ thresholds. Thus, in a lexicographic MORL problem, policy search for subtask $i$ is constrained to the set

\[\Pi_{i} = \{ \pi \in \Pi_{i-1} \mid \underset{\pi' \in \Pi_{i-1}}{\max} J_{i-1}(\pi') - J_{i-1}(\pi) \le \varepsilon_{i-1} \}. \tag{1}\]

Unfortunately, computing the set $\Pi_i$ and performing policy search is intractable for continuous state-action-space MDPs. However, there are other ways for optimizing policies subject to lexicographic task-priority constraints, as we will see in Section Our method: Prioritized Soft Q-Decomposition. Before we can describe our algorithm, however, we require the following two, additional background sections.

Q-Decomposition

A useful technique for solving (lexicographic) MORL problems is Russell and Zimdar’s Q-Decomposition formulation. The paper states that for scalarizable, vector-valued reward functions, the Q-function can be decomposed into $n$ local Q-functions, each corresponding to one subtask. This means that for the $n$ subtask reward functions in ${r}(\mathbf{s}, \mathbf{a})$ and a policy $\pi$, we can learn $n$ Q-functions

\[Q_i^\pi (\mathbf{s}, \mathbf{a}) = r_i(\mathbf{s}, \mathbf{a}) + \gamma Q_i(\mathbf{s}', \pi(\mathbf{s}')), \forall i \in \{1,\dots,n\} \tag{2}\]

and reconstruct the Q-function for the overall, scalarized MORL problem as

\[{Q}^\pi = \sum_{i=1}^n \beta_i Q_i^\pi(\mathbf{s}, \mathbf{a}). \tag{3}\]

This result is useful because it allows us to learn these Q-functions separately and concurrently (with some caveats like the tragedy of the commons, more on this later). Furthermore, we can potentially transfer and re-use the constituent Q-functions $Q_i^\pi(\mathbf{s}, \mathbf{a})$ for different MORL tasks. Lastly, the decomposed nature of the Q-Decomposition methods benefits the interpretability of the RL agent, since we can inspect the different components that jointly induce the behavior of the agent, which is not the case with classical, non-decomposed agents. Due to these desirable attributes of the Q-Decomposition method, we are building on and extending the Q-Decomposition framework in our paper. In particular, we extend Q-Decomposition to soft Q-Decomposition via MaxEnt RL and apply it in the context of continuous action-space MDPs with lexicographic subtask priorities. Thus, in the next section, we briefly review MaxEnt RL and soft Q-Learning, as final building blocks for our method.

MaxEnt RL

Maximum Entropy (MaxEnt) RL, essentially, regularizes policies by punishing policies that are unnecessarily deterministic. This is achieved by adding Shannon’s entropy $\mathcal{H}(\mathbf{a}_t \mid \mathbf{s}_t) = \mathbb{E}_{\mathbf{a}_t \sim \pi(\mathbf{a}_t \mid \mathbf{s}_t)}[-\log \pi(\mathbf{a}_t \mid \mathbf{s}_t)]$ to the reward signal, meaning the optimal MaxEnt policy is given by

\[\pi^*_\text{MaxEnt} = \underset{\pi}{\arg \max} \sum^\infty_{t=1} \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t \sim \rho_\pi)}\bigg[\gamma^{t-1} \big( r(\mathbf{s}_t, \mathbf{a}_t) + \alpha \mathcal{H}(\mathbf{a}_t \mid \mathbf{s}_t) \big)\bigg], \tag{4}\]

where $\rho_\pi$ denotes the state-action marginal induces by the policy $\pi$ and $\alpha$ is a coefficient that trades off the reward and the entropy signal. The entropy regularization results in the following, energy-based Boltzmann distribution as optimal policy:

\[\pi^*_{\text{MaxEnt}}( \mathbf{a}_t \mid \mathbf{s}_t) = \exp{ ( Q^*_{\text{soft}} (\mathbf{s}_t, \mathbf{a}_t) - V^*_{\text{soft}} (\mathbf{s}_t) ), } \tag{5}\]

with the optimal soft value and Q-function given by

\[V^*_\text{soft}(\mathbf{s}_t) = \log \int_\mathcal{A} \exp ( Q^*_\text{soft}(\mathbf{s}_t, \mathbf{a}^\prime) ) \, d \mathbf{a}^\prime , \tag{6}\]

and

\[Q^*_\text{soft}(\mathbf{s}_t, \mathbf{a}_t) = r(\mathbf{s}_t, \mathbf{a}_t) + \\ \mathbb{E}_{(\mathbf{s}_{t+1}, \dots) \sim \rho_\pi} \bigg[ \sum_{l=1}^\infty \gamma^l \big( r(\mathbf{s}_{t+l}, \mathbf{a}_{t+l}) + \mathcal{H}(\mathbf{a}_{t+l} \mid \mathbf{s}_{t+l}) \big) \bigg] . \tag{7}\]

In Equation (4), the soft Q-function serves as negative energy for the Boltzmann distribution and the soft value function serves as the log partition function, which means that for sampling, we can ignore the value function and directly sample from the unnormalized density

\[\pi^*_{\text{MaxEnt}}( \mathbf{a}_t \mid \mathbf{s}_t) \propto \exp{ ( Q^*_{\text{soft}} (\mathbf{s}_t, \mathbf{a}_t) ), } \tag{8}\]

for example with Monte Carlo methods or Importance Sampling. For the rest of this post, we will drop the soft subscript to avoid visual clutter. Importantly, Q-Decomposition is still possible in MaxEnt RL, meaning we can also decompose and reconstruct a soft, multi-objective Q-function as described before (we just need a way to sample from the reconstructed, soft Q-function). In principle, we can make use of soft Q-Learning (SQL) to directly obtain $Q^*_i$, the optimal, soft constituent Q-functions for Q-Decomposition. However, constituent Q-functions obtained this way have a flaw, namely, they suffer from the tragedy of the commons, as described by Russel and Zimdars. Since the constituent Q-functions were learned using off-policy SQL, they essentially assume complete control over the MDP, and therefore learn inconsistent, greatly over-estimated Q-values. In effect, this means that the constituent Q-function obtained via SQL way will not result in the optimal Q-function for the scalarized MORL problem if summed-up. In our algorithm, which we finally describe in the next section, we have a neat way of addressing this issue, though.

Our method: Prioritized Soft Q-Decomposition (PSQD)

Let’s recap: We want to make it easier to design reward functions that induce arbitrary, complex behavior, while also improving the sample in-efficiency, un-safety and un-interpretability of DRL algorithms.

The first part takes care of itself as soon as one rejects scalar reward-function engineering in favor of lexicographic constraints, which are much easier and more intuitive to define. Instead of manually searching for just the right weighting coefficient for some MORL problem to achieve the desired behavior, lexicographic MORL only requires defining the subtask priority (and some slack scalars). The framework is insensitive w.r.t subtask reward scale, for reasons that will become clear later. To improve the sample-inefficiency, un-safety, and un-interpretable nature of DRL algorithms, we propose a novel learning algorithm, PSQD, for continuous action-space lexicographic MORL problems that yields interpretable, transferable agents and components.

Recall that solving lexicographic MORL problems, which are defined by lexicographic MDPS $\mathcal{M}_\succ$, essentially corresponds to policy search in $\Pi_i$, the set of lexicographically optimal policies. Recall also that computing the set $\Pi_i$ is intractable. Thus, we instead make a local-and state-based version of the lexicographic constraint from Equation (1):

\[\max_{\mathbf{a}^\prime \in \mathcal{A}} Q_i(\mathbf{s}, \mathbf{a}^\prime) - Q_i(\mathbf{s}, \mathbf{a}) \leq \varepsilon_i, \\ \forall \mathbf{a} \sim \pi_\succ, \\ \forall \mathbf{s} \in \mathcal{S}, \\ \forall i \in \{1, \dots, n - 1\}. \tag{9}\]

In this form of the lexicographic constraint, action selection, in every state, is constrained to actions whose Q-values are as good as that of the optimal action, minus the threshold $\varepsilon_i$, for all $i-1$ higher priority tasks. This makes for a binary mask over the action space, where some subset of actions $\mathcal{A}_{\succ i}$ is allowed when optimizing task $r_i$, since it satisfies the above constraint, while the remaining actions are forbidden. In our paper, we refer to this subset as the indifference-space of task $i$, since with respect to the constraint, the task is indifferent as to which of the near-optimal actions in $\mathcal{A}_{\succ i}$ is executed. As a concrete example, consider the following images. We have an obstacle-avoidance and goal navigation environment (first image), with a point-mass agent whose 2D action-space corresponds to increments in the $xy$-plane. We now train the agent on the first task, i.e. the highest priority task, which here corresponds to avoiding the obstacle. The learned Q-function is shown in the center image. Now, based on the learned Q-function, we can visualize the lexicographic constraint and the indifference space (last image), with the agent placed at the position indicated by the red dot. As can be seen, lexicographic constraint forbids all those actions that would lead to a collision. The remaining, permitted actions can be used for optimizing lower-priority tasks, like navigating to the top goal area.

2D navigation example. The agent learns a Q-function for avoiding the obstacle, from which we infer the local indifference space.

By relying on the local form of the lexicographic constraint in (9) and the resulting indifference space, we eliminated the need for computing the intractable set $\Pi_i$ for policy search. This is because instead of computing $\Pi_i$, the action indifference $\mathcal{A}_{\succ i}$ gives rise to a new MDP $M_{\succ i}$, which uses the scalar reward $r_i$ and whose action space no longer corresponds to $\mathcal{A}$, but to the indifference space $\mathcal{A}_{\succ i}$. In this new MDP $\mathcal{M}_{\succ i}$ we can perform unconstrained policy search to optimize task $r_i$, since the lexicographic constraint is moved into the action space and thereby always satisfied by construction. This is the short version, the (very cool and intuitive) mathematical derivation and justification of this approach can be found in Section A of the supplementary material of our paper.

Based on this insight, we propose our learning algorithm, Prioritized Soft Q-Decomposition (PSQD), for continuous action-space lexicographic MORL tasks. PSQD combines Q-Decomposition with Soft Q-Learning by first pre-training on all subtasks $r_1, \dots, r_n$ of the lexicographic MORL problem. This way, we obtain $n$ soft Q-functions $Q_1^*, \dots, Q_n^*$ which we can then use to zero-shot the Q-function $Q_\succ$ of the overall, lexicographic MORL problem. The agent using the zero-shot Q-function $Q_\succ$ respects the lexicographic constraints, however, it does not behave optimally w.r.t the overall, lexicographic MDP $\mathcal{M}_\succ$, since the constituent Q-functions $Q_1^*, \dots, Q_n^*$ were pre-trained separately and suffer from the aforementioned illusion of control. The effect of this can be observed in the following, intermediate result:

Zeroshot experiment. The agent respects the lexicographic constraint and avoids the obstacle, but greedily navigates to the top, getting stuck inside the obstacle.

Here, the first image again shows the obstacle avoidance Q-function $Q_0^*$. The second image corresponds the the top goal reaching Q-function $Q_1^*$. In the last image, we visualize the policy (colored background) and rollouts from the zeroshot agent, obtained as $Q_{1\succ2} = Q_1^* + Q_2^*$ via Q-Decomposition. As can be seen, the zeroshot agent respects the lexicographic constraint and avoids colliding with the obstacle, but greedily navigates to the top and therefore gets stuck inside the obstacle.

Notice that this result is expected and positive. Due to the lexicographic constraint, we know that the agent can not collide with the obstacle. In our paper, we show that PSQD assigns zero likelihood to actions that violate the lexicographic constraint. This is in stark contrast to standard MORL algorithms that rely on scalarization, meaning that the learned agent’s behavior is largely dictated by reward scale, with zero guarantees on resulting behavior.

However, we are of course interested in obtaining the optimal solution to the overall, lexicographic MORL problem. That’s why PSQD uses the zeroshot composition merely as a starting point and subsequently continues improving performance by finetuning the constituent Q-functions, thereby learning the optimal solution to the overall, lexicographic MORL problem. Concretely, we finetune the constituent Q-functions by iteratively performing soft Q-Learnign in the transformed MDP $\mathcal{M}_{\succ i}$. Since the highest priority task is not affected by the lexicographic constraint (there are no higher-priority tasks that constrain it), in the first iteration, PSQD finetunes the task with the second highest priority by performing SQL in $\mathcal{M}_{\succ i}$. This means updating the $Q_2^*$ to $Q_{\succ 2}^*$, which is no longer optimal for $r_2$ but solves $r_2$ as best as possible while respecting the lexicographic constraint. This involves the following backup operator

\[\mathcal{T}Q(\mathbf{s}, \mathbf{a}) \triangleq r(\mathbf{s}, \mathbf{a}) + \gamma \mathbb{E}_{\mathbf{s}^\prime \sim p} \bigg[ \underbrace{\log \int_{\mathcal{A}_\succ (\mathbf{s}^\prime)} \exp \big( Q(\mathbf{s}^\prime, \mathbf{a}^\prime)\big) d \mathbf{a}^\prime}_{V(\mathbf{s}^\prime)} \bigg], \tag{10}\]

where the log-sum-exp expression is not over the entire action space but over the indifference space since we are in the lexicographic MPD $\mathcal{A}_{\succ i}$. This backup operator is approximated with the following stochastic optimization

\[J_Q(\theta) = \mathbb{E}_{\mathbf{s}_t, \mathbf{a}_t \sim \mathcal{D}} \Bigg[ \frac{1}{2} \Big ( Q^\theta_{n}(\mathbf{s}_t, \mathbf{a}_t) - r_n(\mathbf{s}_t, \mathbf{a}_t) \\ + \gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p} \big[ V_n^{\bar{\theta}}(\mathbf{s}_{t+1}) \big] \Big)^2 \Bigg ], \tag{11}\]

which is the well-known SQL update, with $V_n^\bar{\theta}$ being the empirical approximation of equation (6) with a target network, parameterized by $\bar{\theta}$, in $\mathcal{A}_{\succ i}$. More details about our learning algorithm can of course be found in our paper.

Applying this procedure once to finetune the Q-function of the goal navigation task, we obtain the following result:

Finetuning experiment. The agent has learned how to solve the lexicographic MORL task optimally, navigating out of and around the obstacle.

Here, the greedy, pre-trained constituent Q-function $Q_2^*$ in the first image is finetuned as described above, to the optimal constituent Q-function $Q_{\succ 2}^*$ shown in the second image. Using the finetune Q-function for the second task, we again apply Q-Decomposition to obtain the optimal Q-function for the lexicographic MORL problem as $Q_\succ^* = Q_1^* + Q_{\succ 2}^*$. The policy and rollouts from this agent are shown in the last image. As can be seen, the agent has learned to avoid the obstacle and to drive out and around it, to reach the top goal area.

The following image shows a visual summary of our learning algorithm, PSQD:

PSQD, a visual overview.

This method trivially extends to more than two tasks. We can add a third subtask $r_3$, corresponding, for example, to reaching the right-hand side of the environment. With these three subtasks, $r_1$ for obstacle avoidance, $r_2$ for reaching the top part of the environment, and $r_3$ for reaching the right side of the environment, we can make multiple lexicographic MORL problems, by defining different priority orderings. For example, we can keep the highest-priority subtask (obstacle avoidance) fixed but vary the priority of the top- and side-reach subtasks. That is, we can either have the lexicographic MORL problem $r_{1\succ 2\succ 3}$, where the top-reach subtask $r_2$ has higher priority than the side-reach subtask $r_3$, or we can make the lexicographic MORL problem $r_{1 \succ 3 \succ 2}$, where reaching the side has higher priority than reaching the top.

In either case, we can pre-train on all subtasks separately, and transfer the resulting constituent Q-functions via Q-Decomposition to the lexicographic MORL tasks, where they are subsequently finetuned using PSQD. This results in the following, differing behaviors:

Differing behaviors depending on lexicographic task priority order. (Lazy screenshot from the paper, apologies for the poor quality).

As can be seen, depending on the lexicographic task priority order, the resulting agent either first moves to the top, then to the side in image (a), or first to the side, then to the top, in image (b). The bottom row images visualize the indifference spaces of the constituent Q-functions. Both tasks share the obstacle avoidance component $\mathcal{\bar{A}}_{\succ 1}$, but have different constraints for the additional task that is varied between the two conditions, $\mathcal{\bar{A}}_{\succ 2}$ in (c) and $\mathcal{\bar{A}}_{\succ 3}$ in (d). This aims to illustrate how lexicographic constraints can easily and intuitively be used to induce different, complex behaviors.

Lastly, to demonstrate the efficacy of our method in high-dimensional settings, or, a bit more colloquially, to show that our method scales, we perform a simulated joint-control experiment. Here, the action space is in $\mathbb{R}^9$ and corresponds to the joint torques of a Franke Emika Panda Robot. The higher-priority task, $r_1$, corresponds to avoiding a certain subspace of the workspace (red area), while the lower-priority task, $r_2$, corresponds to reaching a certain end-effector position (green sphere). First, consider a standard MORL algorithm that relies on linear scalarization of the vector-valued reward function. The agent ignores the red area and greedily moves toward the target end-effector position:

Your browser does not support the video tag.

This can happen due to poor reward scale or poorly chosen scalarization weights.

Let’s contrast this with our method. We simply define the lexicographic task priority $o = \langle r_1 \succ r_2\rangle$, set some low threshold, e.g. $\varepsilon_1 = 1$, and obtain the following (zer-oshot) result:

Your browser does not support the video tag.

Here, after separately learning the subtask, even in the zero-shot setting, the agent does not enter the forbidden part of the workspace. To obtain the optimal agent for the lexicographic MORL task, i.e. reaching the target end-effect position while avoiding the forbidden part of the workspace, we perform our finetuning/adaptation step to learn the long-term consequences of the lexicographic constraint. This results in the desired behavior and verifies that our method is also applicable to MDPs with high-dimensional action spaces:

Your browser does not support the video tag.

There are some additional, nice properties of our method that I have only mentioned briefly or skipped entirely in this blog post. Firstly, I want to mention how our method benefits sample-efficiency. PSQD transfers knowledge between from simple subtasks to complex, lexicographic MORL problems. Thus, we are not learning the complex, lexicographic MORL problem from scratch, rather, we transfer the pre-trained subtask solutions and perform a simple finetuning step to obtain the optimal solution to the lexicographic MORL tasks. In a nutshell, this implies that we only need to learn once, for example, how to avoid obstacles, we can then re-use this learned behavior every time we want to exploit it as part of a lexicographic MORL problem. Furthermore, each lexicographic task-priority constraints in effect ``shrinks’’ the action/search space of the RL algorithm, which makes it easier to explore the MDP and to discover the optimal solution. Secondly, PSQD respects lexicographic priority constraints even during training and thereby makes for a safe exploration framework. This is again in stark contrast to standard MORL approaches that rely on scalarization and learn each problem from scratch. Lastly, PSQD benefits interpretability of the final agent, since we can inspect the constituent Q-functions and corresponding indifferent spaces to understand the agent’s action selection process.

Summary and conclusion

And that’s it. This blog post presents a short summary of our recent work, Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning, which ~is currently under review~ has been accepted at ICLR 2024. The take-away points are as follows:

Reject scalar reward engineering, and embrace lexicographic task-priority constraints. Lexicographic constraints are much easier to define and have additional benefits, compared to cumbersome, manual tuning of reward scale coefficients.
Our algorithm, PSQD, solves continuous action-space, lexicographic MORL problems, while explointing the Q-Decomposition method to transfer knowledge from simple subtasks to complex, lexicographic MORL problems.

We are currently working on the successor paper, where we replace soft Q-Learning with a more stable DRL algorithm.

Cheers,
Finn.

Easy Language Learning with ChatGPT and Python

2023-09-10T00:00:00+00:00

Motivation and introduction

ChatGPT is everywhere, has disruptive potential, and can do many great things, given the right instructions (prompts). ChatGPT or other GPT models tend to perform well when the conversation is centered around a topic that is abundantly covered in their training data, just like most ML algorithms perform well when the test data is similar enough to the training data. As such, it is no surprise that ChatGPT can generate code for almost any programming language, with many blogs and documentation being publicly available. Although ChatGPT might struggle with generating correct code for sophisticated and complex programs, it knows about data structures and basic I/O stuff, so enough for small scripts. This makes it very easy to combine ChatGPT’s generative capabilities in virtually any topic with simple programs, which can be useful for many small projects.

To give one example of such a project, in the following, I show how to generate a deck of flashcards with the most useful, basic words in any language using ChatGPT and a few lines of Python code. The technical aspect of this is relatively trivial, however, I want to provide easy-to-replicate steps even for people that have no background in computer science.

ChatGPT as a language tutor

Since I am doing my Ph.D. in Sweden, without being a native speaker of any Scandinavian language, I am always looking for opportunities to practice Swedish. Since ChatGPT is capable of communicating in almost every language, it can also generate lists of vocabularies in any language. For example, one can ask it to generate the 100 most useful words for a particular language, as well as the translation of those words into any other language. One can further ask it to also provide an example sentence for each word, as well as the translation of the example sentence. There are almost no limits here, as long as we provide the right prompt, ChatGPT will provide the corresponding result (which is sometimes better, sometimes worse).

For my particular use case with Swedish-English translations, I found ChatGPT to provide okay to decent translations. One should of course take everything that a GPT model generates with a grain of salt, but given my rudimental understanding of Swedish, I deemed the output of sufficient quality to use for some additional studying. Give a vocabulary containing whatever we want to learn, the next step is to get ChatGPT to provide it in a data structure that is convenient to work with programmatically. In my case, I wanted to have the data stored in a list of Python dictionaries. Given clear instructions for how the list of dictionaries should be populated, ChatGPT will generate data that we can manipulate and make use of programmatically.

Here is a short demonstration of the ChatGPT conversation that I used to get a vocabulary of Swedish words and example sentences, as well as their English translation, stored in a list of Python dictionaries.

Creating flashcards with Python

Given that we now have a list of Python dictionaries containing things we want to memorize, we can use the great, open-source tool Anki and the corresponding Python API to automatically create flash cards. Anki is amazing, it even has a mobile app that can synchronize sets of flashcards across multiple devices. I have used it extensively throughout my university studies and can not recommend it enough.

Genanki, at the same time, makes it trivial to generate Anki decks using Python. Thus, all that is left to do is load the ChatGPT vocabulary into python and create the flashcards using genanki. The GitHub repository provides a great example of how to do this, however, for completeness, here is the full script that I used:

import genanki
import random

word_list = [
    {
        'swe_word': 'att',
        'eng_word': 'to',
        'swe_sentence': 'Jag vill att du ska göra det.',
        'eng_sentence': 'I want you to do it.',
    },
    # replace this with your own data...
]


if __name__ == "__main__":
    random_model_id = random.randrange(1 << 30, 1 << 31)
    random_deck_id = random.randrange(1 << 30, 1 << 31)

    # genanki.Model defines the template for each flash card in our deck
    my_model = genanki.Model(
        random_model_id,  # needs to be unique
        'SWE & ENG Language learning model',
        fields=[
          {'name': 'swe_word'}, 
          {'name': 'eng_word'}, 
          {'name': 'swe_sentence'}, 
          {'name': 'eng_sentence'},
        ],
        templates=[
          {
            'name': 'Card 1',
            'qfmt': 'What is the meaning of "<i></i>"?<br><br>Example: <i></i>',
            'afmt': '<hr id="answer">' + \
                    'The meaning of <i>""</i> is <i>""</i>.' + \
                    '<br><br>Example: <i></i>',
            },
        ])

    my_deck = genanki.Deck(
        random_model_id,  # needs to be unique
        'Swedish & English Language learning'
    )

    for card in word_list:
        my_note = genanki.Note(
            model=my_model,
            fields=[card['swe_word'], card['eng_word'], card['swe_sentence'], card['eng_sentence']])

        my_deck.add_note(my_note)

    genanki.Package(my_deck).write_to_file('swe_eng.apkg')

As can be seen, the script simply iterates over the vocabulary, creates a flashcard for each dictionary in the list, and adds the card to the deck. The deck is then saved as a file and can be imported into Anki.

This setup relies on manually copying the ChatGPT data into a Python file, which is a minor annoyance for perfectionistic programmers… Of course, this could be avoided by making use of ChatGPT’s API, but unfortunately, I haven’t gotten access yet, even after paying for ChatGPT premium, solely to gain API access. Oh well ¯\(ツ)/¯…

The resulting deck, diplayed in the AnkiDroid app, looks as follows:

Example flashcard with ChatGPT generated content displayed in the AnkiDroid app. The deck is created with the above give code and synchronized to the mobile device via AnkiWeb.

Summary and conclusion

So, in summary, ChatGPT can be used to create vocabularies with translation between any language. These vocabularies can have arbitrary auxiliary information, like example sentences or different word forms. ChatGPT can then be asked to output these vocabularies in convenient data structures, like Python dictionaries.

The open-source tool Anki for flashcard learning can be used to study these vocabularies. The easy-to-use Python package genanki makes it trivial to generate anki decks programmatically using Python, for example using the above-given script.

And that’s it, I hope you found this post interesting and perhaps useful for your own studies :)

Cheers,
Finn.

Fun with Graph Neural Networks

2023-07-02T00:00:00+00:00

Motivation and introduction

Recently, I became interested in Graph Neural Networks (GNNs) because they are a trending topic in the Machine Learning / Deep Learning community. Thus, in this blog post, I briefly introduce and motivate GNNs based on my novice-level understanding of them. Then, I share some results where I analyze and compare GNNs with classical Multilayer Perceptrons (MLPs) on a simple graph classification task.

Graph Neural Networks: What, why, and how?

So what are GNNs, why do we need them and how do they fundamentally work? GNNs are, broadly and informally speaking, Neural Networks (NNs) that can process and exploit non-Euclidean data structures, like graphs.
Classical NNs can not capture and exploit the complex structure of non-Euclidean data because they assume fixed-dimensional, grid-like (aka Euclidean) data. MLPs, for example, assume that their input features come from arbitrary-but-fixed-dimensional, continuous, real-valued (aka Euclidean) vector spaces. Euclidean vector spaces generalize Euclidean geometry (parallel lines never intersect, laws of trigonometry apply) from 2D or 3D to higher-dimensional spaces. Convolutional Neural Networks (CNNs) exploit the Euclidean nature of their input data even more explicitly since they learn local filters that are applied over the entire input space in a sliding-window manner, thereby assuming Euclidean structure. To intuitively see that classical CNNs can not work on non-Euclidean data, consider the following picture:

It is clear how to apply 2x2 filters at coordinates (0, 0) and (1, 1) for Euclidean (e.g. image, left) data, but unclear how to do the same for non-Euclidean (e.g. graph, right) data. And this is, essentially, the motivation behind GNNs. Many interesting problems feature non-euclidean data (e.g. graph/node classification, graph/node regression, social network analysis, molecular chemistry, anomaly detection) and we would like to have Neural Networks that can exploit and reason about data from such domains.

With the motivation for GNNs out of the way, let’s see how GNNs work. Note that there are multiple ways of implementing GNNs and that I will only summarize Kipf and Willing’s method in the following, which introduces the Graph Convolution Network (GCN), a particular type of GNN. As the name implies, with the GCN, Kipf and Willing essentially generalize CNNs non-Euclidean data, addressing the issue demonstrated in the image above. At a very high level and as we shall see, this is done by replacing the fixed neighborhood indexing of convolutional filters with a dynamic summation over the neighborhood of vertices in a graph. The GCN operates on a graph $\mathcal{G = (V, E)}$, where $\mathcal{V}$ are the graph’s vertices and $\mathcal{E}$ are the graph’s edges. Each vertex $\mathcal{V}_i$ in the graph might be described by some $n$-dimensional feature vector. Furthermore, $A$ refers to the adjacency matrix of graph $\mathcal{G}$, which is a different way of describing the edges in a graph. Let’s skip the theoretical derivation of the GCN (see the paper for details) and jump right to the model definition. Kipf and Willing provide the layer-wise definition of the GCN (in matrix notation) as follows:

\[H^{(l+1)} = \sigma \Big( \underbrace{D^{-\frac{1}{2}} \hat{A} D^{-\frac{1}{2}}}_{\text{normalized } \hat{A}} H^{(l)} W^{(l)} \Big), \tag{1}\]

where $\sigma$ is a non-linear, differential activation function (e.g. ReLU, GeLU, TanH), $D_{ii} = \sum_j \hat{A}_{ij}$ (multiplying by $D$ essentially normalizes $\hat{A}$), $\hat{A} = A + I$ (adding the identity matrix $I$ to $A$ is a trick that adds self connections to each vertex in the graph which is beneficial for representation learning), $H^{(l)}$ is the hidden activation of the previous layer $l$, and $W^{(l)}$ is the learnable weight matrix of layer $l$. ` For input layer $H^{(l=0)}$ of the GCN we simply have $H^{(l=0)} = X$, aka the $n \times d$ -dimensional feature matrix of the Graph $\mathcal{G}$, where $d$ corresponds to the number of vertices in the graph and $n$ to the aforementioned features of vertices. While the above formula specifies the GCN entirely, I find it hard to understand how the GCN works just based on this. So far, this looks almost like the layer of a plain MLP, with the addition of the $D^{-\frac{1}{2}} \hat{A} D^{-\frac{1}{2}}$ (essentially the normalized adjacency matrix with self-connections), inside of the activation function. How does the GCN deal with the dynamic, non-Euclidean structure of the data? This becomes clear by considering Equation (1) in vector notation (Equation 12 in the paper):

$h_i^{(l)} = \sigma \Big( \sum_{j \in \mathcal{N}_i \cup \{ i \} } \frac{1}{c_{ij}} \mathbf{W}^{(l)} h_j^{(l-1)} \Big). \tag{2}$
Here, we can see clearly that the hidden representation of node $i$ in layer $(l)$ of the GCN is the elementwise non-linear activation of a summation over the neighborhood (with self-connection) $\mathcal{N}_i \cup \{ i \}$ of vertex $i$ in the graph, where the neighboring vertices $j$ are based on the hidden representation $h_j^{(l-1)}$ of that vertex, transformed by the learnable weight matrix $\mathbf{W}$ of layer $(l)$ and normalized by $\frac{1}{c_{ij}}$, with $c_{ij} = \sqrt{|\mathcal{N}_i|\cdot|\mathcal{N}_j|}$. Thus, the GCN addresses the non-Euclidean data structure by replacing fixed indexing typically performed by convolutions with a dynamic summation over vertex neighborhoods.

A toy graph classification dataset

Now that we understand the idea behind GNNs and how GCNs work, let’s put the GCN to the test and see how it compares with a classical MLP on a simple graph classification problem. For this, we need some data. I used the networkx library to create a dataset of random graphs. To make for a binary classification problem, graphs of class 1 have edge probability of 0.4, while graphs of class 2 have edge probability 0.6, thus the two graph classes have differing degrees. Similarly, I varied the edge weights between the two classes, such that graphs of class 1 randomly draw their edge weight from a standard Gaussian with mean -2, while graphs of class 2 draw their edge weight from a standard Gaussian with mean 2. These (scalar )edge features can trivially be encoded by the adjacency matrix. This toy dataset essentially mimics the well-known XOR classification problem, except data points are graphs. Exemplary graphs from this dataset are shown below, where the vertex color indicates the class label.

Let’s now train a vanilla MLP and a GCN on this dataset and see how they compare. I made sure that all graphs in the dataset are of size 7 such that the adjacency matrix is always $7 \times 7$, which is convenient because otherwise, we would have to perform some preprocessing, e.g. padding, to ensure that the input layer of the networks fits all graphs.

Results

I now briefly describe the MLP and the GCN that were used to generate the following results. As a first step, I created unregularized networks to make sure that I could overfit the dataset. I used the PyTorch Geometric library for the GCN implementation since it provides the GCN layer out of the box. Both the MLP and GCN had two hidden layers of width 64 and had a similar number of learnable parameters, 8962 for the GCN and 7490 for the MLP. I optimized both networks with Adam for 100 epochs, using a batch size of 16. The results for this setting are shown in the following plot:

As can be seen, both networks are able to overfit the training set perfectly, achieving 100% classification accuracy. As expected due to the lack of regularization both networks fail to generalize well to the testing set, with the MLP achieving roughly 70% and the GCN 80% classification accuracy. The slightly better generalization as well as the lower testing error of the GCN is probably due to the strong inductive bias of the GCN, compared to the MLP.

Next, I wanted to improve the generalization of these networks. I first chose a few different hyperparameter sets manually but found it very hard to improve accuracy on the test set. I added dropout layers and weight decay for regularization, tried bigger and smaller networks, different batch sizes, and different optimizers, but nothing significantly improved the test set accuracy. The following plot shows results with deeper networks (5 hidden layer of width 32), 50% dropout, and Adam weight decay 1e-4:

The results show similar patterns as in the unregularized case. Next, I ran a Bayesian hyperparameter optimization for an afternoon, but even that failed to come up with better hyperparameters. This implies that there is some ambiguity in the data that can not be learned, only memorized in the training data. This makes sense since, given the dataset that I created, both edge weight and edge probabilities are sampled randomly and have some overlap, meaning there exists a certain subset of graphs in the dataset that are very hard or impossible to classify correctly, without additional information. It would be interesting to explore the uncertainty of the neural network predictions (for example through monte carlo dropout), but I haven’t done that for this post.

Summary and conclusion

The main takeaway points from this post should be the following: GNNs are neural networks that can exploit non-Euclidean data structures. The GCN generalizes CNNs to non-Euclidean domains by replacing the fixed filter indexing with a dynamic averaging over some proximity neighborhoods. On the graph XOR classification problem, we observed better generalization and less overfitting due to stronger inductive bias by the GCN, compared to an MLP. This last point should be taken with a grain of salt though, since it’s not clear whether it holds generally or just for the specific problem and data used here.

And that’s it, I hope you found this post interesting :)

Cheers,
Finn.

Autonomous Navigation with Pepper Robots

2022-02-12T00:00:00+00:00

Motivation and introduction

Softbank´s Pepper robot is a popular HRI research platform, however, as already mentioned in my previous post, the onboard LIDAR sensor supplies very sub-optimal data. Unfortunately, the resulting data is essentially useless for running SLAM algorithms because a) it is very sparse with only 15 laser beams and b) due to the awkward angle of those beams, their range is roughly 5 meters. I tried running ros-gmapping with Pepper’s onboard LIDAR sensor and the results were, as expected, not satisfactory. At the KT research group, we wanted to implement autonomous navigation with Pepper robots, which clearly requires good mapping and localization capabilities. Thus, we explored two alternatives to Pepper’s poor onboard LIDAR sensor: We experimented with some visual-slam algorithms, including ORB_SLAM2, but were not satisfied with the results, most likely due to the shaky, blurry, and low-resolution stream of Pepper´s cameras and the relatively featureless environment that is our office building. Thus, exploring another option, we build an external, dedicated sensor system for mapping and localization and rigidly attached it to one of our Pepper robots. This approach turned out to yield good mapping, localization and navigation results and can be reproduced with relatively little monetary cost. This post aims to give guidance to anyone who wants to implement a similar solution for their Pepper robot(s) and explains our method at an intermediately-detailed level.

The high-level approach

The idea of building an external mapping/localization device is motivated by the ROS hector_slam package, made by the folks at TU Darmstadt. This demo video (from 2011), illustrates the capabilities of their algorithm very well:

Given these amazing results on a handheld device, the overall approach is clear: We can simply build a similar hardware system as is used in the video, “duct-tape” it onto our Pepper robot and establish communication between the mapping device and Pepper, such that we can map the environment, then run a localization and navigation algorithm that communicates with the robot to autonomously navigate based on the previously obtained map. Certainly a lot of work but doable. In the following sections, I detail the steps of this approach and attempt to highlight the pitfalls that cost me considerable time along the way.

Hardware components

First of all, if you want to replicate the approach I describe here, you will need to buy the following hardware components:

Most importantly, we need a proper laser range finder. We experimented with two different sensors: The high-quality Hokuyo UST-10LX (~1300€) and the much cheaper YDLIDAR G2 (~200€). In our office building navigation scenario, we found no significant differences in results between the two, but note that the YDLIDAR G2 only has a range of ~12 meters, which is insufficient for larger, open spaces.
Additionally, the hector_slam ROS package not strictly requires but benefits from an inertial measurement unit (IMU), which provides acceleration and angular velocity measurements. These are useful when the mapping device is carried around and not rigidly attached to the pepper robot, in which case there can be significant changes in the alleviation and angle of the device, which must of course be considered by the mapping algorithm. Fortunately, IMUs are relatively cheap, we used the MPU-9250 IMU (~15€).
Furthermore, we require a raspberry pi to stream the sensor data from the IMU and the LIDAR into a ROS network. Our model had 4 GB of RAM and ran a minimal Ubuntu installation. Of course, the more RAM the better, at the time of writing the Raspberry Pi 4 starts at 35€.
For connecting the MPU-9250 IMU to the raspberry pi we need a special adapter. We used a grove hat adapter (~10€) but this component might be optional, depending on the concrete IMU model you opt for.
Optionally: You should consider buying a strong battery that powers all of the above components so that you don’t have to attach a power cable to the robot or handheld system during mapping or navigation.
Lastly, we need some kind of physical framework to actually attach all of these electronics to. We used a set of MakerBeams (starting from 100€), alternative a basic set of LEGO bricks might also work ;)

In total, buying all of these hardware components will cost you about 400€. Considering the original cost of one Pepper robot (14.000€+) this is negligible, especially considering that these few components considerably enhance the navigation capabilities of the robot.

The physical framework

Assuming access to all hardware components, I now describe how to set up the raspberry pi and connect it to the LIDAR and IMU, so that the data from these two sensors will be available for further processing. We ended up building two separate systems. The first one, which we used mainly for debugging, is a direct adaptation of the hand-held device in the demo video above. The second system is a “utility belt” for the Pepper robot, which was eventually used to have Pepper autonomously navigate. Here are images of what these systems looked like:

Physical mapping & navigation device. Either hand-held or attached to Pepper.

Getting LIDAR data

Given a physical construct that holds our sensors, we can start working on the software side of things. First, install your ROS-supporting Linux distro of choice on the raspberry pi. Install a ROS version that supports the drivers and ROS packages required by your IMU and LIDAR sensors. Now, with ROS running on the raspberry, install the drivers and ROS packages required by your LIDAR sensor (also on the raspberry pi). These steps of course depend on the exact sensor you are using. In an earlier post of mine I describe how to set up the Hokuyo UST-10LX LIDAR. For the second LIDAR sensor we were using (the YDLIDAR G2) I will not re-iterate the entire installation process because it is a relatively straightforward and well documented process. However, in short, to get the YDLIDAR G2 to work, you have to

Connect the YDLIDAR to the raspberry pi as described in the package manual
install the base SDK
install the YDLIDAR ROS driver. Don´t forget to replug the YDLIDAR after step 5.

At this point, you should be able to start a roscore on your raspberry pi and launch your LIDAR sensor. You should be able to visualize the data in RVIZ or log the respective ROS topic and confirm that the sensor provides reasonable data. If you are using the YDLIDAR G2, you can call this launchfile to start the sensor. If you are rigidly attaching the mapping device to Pepper, you should consider changing the six values (x y z yaw pitch roll) in line 36 such that they properly reflect the translation and rotation from the parent coordinate frame you are referring to.

Getting IMU data

With you laser scanner (hopefully) providing the desired laser data, we now setup the IMU. This of course also depends on the specific IMU device you are using, but the high-level steps of setting up the MPU-9250 IMU are as follows:

Connect the IMU to you raspberry pi. If you follow this guide exactly, this involves connecting the grove hat adapter to the GPIO bus of the raspberry pi and installing its driver. Then, connect the MPU-9250 IMU to one of the i2c ports on the grove hat (see picture below).
Install RTIMULib. Pay attention to the step where you allow non-root users access to the i2c bus, I didn´t read this properly at first, which coused cryptic errors to arise further down the line…
Install i2c_imu. See this GitHub issue and comment. In the MPU launchfile, for i2c_imu with the MPU-9250 we must set param imu_type to 7. Additionally, I had to set the value for param i2c_bus to 1, which I found out after probing around with the i2cdetect bash tool, which can be installed with sudo apt-get install i2c-tools.

Raspberry pi with grove hat adapter and connected MPU-9250 IMU via I2C bus.

Now, your IMU should be working, and you should be able to access the sensor data with the raspberry pi, similar to this:

Getting TF right

When building a physical system like this, ROS tf errors were the temporary bane of my existence. For people with no background in robotics, it might be somewhat unclear what tf actually does, hence I quickly want to provide some fundamental information: When we wish for robots to assume a certain joint configuration, we must look at the current joint values and then calculate the joint velocities that bring us from the current joint configuration to the desired one. This process is called inverse kinematics. These calculations are not terribly hard to do by hand, but, of course, that´s not what we do, instead, we rely on ROS and its various packages for this. For this to work, the robot must provide a valid “kinematic chain”, which essentially encodes how all the coordinate frames for all the joints are related (i.e. when we move the arm, the attached coordinate frame for the hand will also move). The tf package helps us in managing these different coordinate frames and, most importantly, allows us to easily convert points or vectors between different coordinate frames, given they are linked via a valid kinematic chain.

Why is this relevant? Because we must manually extend Pepper’s kinematic chain such that it can make use of the external IMU and LIDAR sensors and navigate fully autonomously. If we do not set tf up properly, the mapping, localization and navigation algorithms can not interpret the sensor data they are receiving. Just providing the sensor data without the corresponding coordinate frame that is linked to the kinematic chain is not enough because the sensor data points would be missing relative information; they would lack their context. The sensor data lives in a certain coordinate frame, and it must be clear how this frame is related to the rest of the robot in order to do any meaningful computations.

Thus, when mounting the LIDAR or IMU sensor to the robot, make sure you pay attention to the local coordinate frame of the sensor, which might or might not be printed onto the sensor. ROS uses a right-handed coordinate system and the sensor coordinate systems should align with this, i.e. the z-axis is the vertical one, x points to the front and the y-axis points to the left. Thus, when launching your sensors, specify the correct transformation between a sensible parent coordinate frame and the frame of the sensor data. This can be done conveniently via the static transform publisher, for example by including it in the sensor’s launchfile.

Anyway, if you get this wrong, you will quickly notice it, e.g. when the IMU indicates left acceleration when you move it to the right, or because all laser scans are rotated by some degree. In fact, you can see in the IMU video above that the data in RVIZ is no exactly matching what I do with IMU in the real world. This is exactly the issue, the fixed transform that I specified for testing purposes is not in line with the IMU that is just loosely dangling around on my desk, hence the mismatch in orientation.

So, pay close attention when configuring the static transform publisher with the coordinate frames for our IMU and LIDAR. You can read more about the static transform publisher here and, if you want to understand this thoroughly, consider chapter 2 and 3 in Bruno Siciliano’s “Introduction to robotics” book.

Software architecture

Okay, at this point you should have a raspberry pi that is running ROS so that the data from the IMU and LIDAR sensors, which are connected to that raspberry pi, is available via rostopics. In principle, this is enough the replicate the mapping results as shown in the demo video above, via the hector_slam ROS package. I describe the mapping process more in-depth later because for now, I describe our software architecture. Namely, there are two more entities in addition to the sensor-data collecting raspberry pi. Firstly, we have a main ROS server that runs the computationally expensive mapping, localization and navigation algorithms. This way, the raspberry pi only acts as an interface to the sensors and broadcasts the sensor data into a shared local network, but does not have to perform any expensive computations. Secondly, we treat the Pepper robot as another, separate entity that provides sensor data and receives velocity commands. This again has the benefit of not performing expensive computation on the Pepper robot but rather on the main server and secondly, it resolves a very annoying issue: Pepper’s latest ROS packages only support ROS kinetic, which only supports Ubuntu 16. In the following, I describe how to make all of these systems communicate with each other.

Distributed ROS

We have the following three entities:

A central ROS server (with its own roscore)
The raspberry pi with attached LIDAR and IMU (with an additional roscore)
A ROS kinetic docker container that is running Pepper´s ROS stack (running also a dedicated roscore)

To enable these three entities to communicate with each other we require that they all have access to the same local network. This way, the central ROS server can read the topics published by the raspberry pi, process that data (i.e. for mapping or localization and navigation), and output velocity commands that are executed via Pepper’s roscore. In practice, we ran the main roscore and Pepper´s roscore on the same, powerful desktop machine, however, Pepper´s ROS kinetic core was running in an isolated docker container, which resolve the annoying version conflicts (I describe in another blog post how to communicate with a roscore that lives inside a docker container). The key thing to note is that when we have multiple roscores running in the same network, topics provided by the different roscores can be accessed generally, that is a topic from core a can be seen by core b.

Pepper´s docker ROS kinetic core

As mentioned, Pepper´s most recent ROS packages are for ROS kinetic (which released in 2016 and requires Ubuntu 16). The cleanest solution would be to compile all of those packages manually in an up-to-date ROS environment. I tried this and, after wasting a considerable number of hours in this effort, accepted defeat. Thus, I eventually opted for a docker container running ROS kinetic, where Pepper´s ROS packages can be installed straightforwardly. In this docker container, simply install Pepper´s entire ROS stack. Then, you should be able to:

# launch Pepper´s ros driver
roslaunch pepper_bringup pepper_full_py.launch nao_ip:=<YOUR-PEPPER-IP> roscore_ip:=<DOCKER-KINETIC-ROSCORE-IP>  
# the have it assume wake-up pose
rosservice call /pepper_robot/pose/wakeup

of course replacing <YOUR-PEPPER-IP> and <DOCKER-KINETIC-ROSCORE-IP> with the IP address of the Pepper robot and the local ROS kinetics roscore IP address. With a working ROS kinetic environment that allows us to control Pepper through ROS, we must just make sure that we can call services and publish data from our main server and have this take the desired effect in the docker kinetic ROS environment connected to Pepper.

Key configurations for distributed ROS systems

There are two key configurations that are very easy to miss when building a distributed ROS system. I highly consider reading the entire ROS network setup, but nevertheless highlight these two configurations here. Firstly, make sure that the IP address to hostname mapping is properly set up in /etc/hosts. Specifically, for each of our three entities, add the respective other two hostnames and IP addresses to that file. If this is not properly handled, roscore a will only be able to see the rostopics of remote hosts b and c, but the topics will not contain any data (see this thread).

Secondly, the local time for all three entities must be exactly the same. If I recall correctly, the time in the docker container is identical to that on the host, but certainly, the time between the raspberry pi and the desktop machine will not be the same. Here, I mean they must be identical up to a few milliseconds. If they have an offset of e.g. one second, this will cause very weird bugs to occur. For example, I was getting errors from the TF package indicating there was some issue with my kinematic chain, while in fact, the TF messages broadcasted by the raspberry pi were slightly too old (due to the nonidentical time on raspberry and main server), causing the messages to be dropped by the TF instance running on the desktop, even thought the kinematic chain was valid. To fix this, install chrony on the desktop and on the raspberry pi, then synchronize one machine with the clock of the other (both directions should be fine). This process is also documented here.

With these things taken care of, you should now be able to access data from the raspberry pi and from the ROS kinetic core on the main ROS server. You can test this by visualizing the external LIDAR, the IMU and, for example, Pepper´s camera stream in RVIZ on the desktop machine. You should also be able to launch e.g. rqt-steering on the desktop and control Pepper that way. If this works, your distributed ROS system is set up correctly and the navigation algorithm executed on the main server and based on LIDAR data from the raspberry pi will be able to drive Pepper to the desired goal location.

Mapping with hector slam

Given the distributed ROS system described above, the first step towards autonomous navigation is, of course, obtaining a map of the environment. As mentioned initially, here we entirely rely on the hector_slam ROS package. Starting the entire system and creating a map of the environment involves the following steps:

Start the required sensors (on the raspberry pi they are connected to):
- Start the IMU (for example by adjusting this launchfile as described above)
- Start the LIDAR (for example, using this mentioned launchfile)
Start Pepper’s ROS stack (inside the docker container that communicates with Pepper, as described above)
On the main server or inside Pepper’s kinetic docker ROS environment, start your tool of choice to control the Pepper robot, i.e. rqt_robot_steering
Now, we can start the hector_slam ROS package and create a map of the environment.
Start the hector_slam node (on the more powerful, central ROS server)
- Start hector_imu_to_tf, this connects the LIDAR data to the angles reported by the IMU
- Start hector_geotiff, a service for saving the map
- Optionally, start static_transform_publisher, to ensure that the TF tree is valid and associates all sensor data with the base_link frame
- Start hector_mapping, the main hector_slam node

Note that there is no difference whether you want to create the map using the hand-held device or with the LIDAR and IMU rigidly mounted to the Pepper robot. This is because the package hector_imu_to_tf links the LIDAR scans to the IMU angles. In practice, this is done by configuring TF in such a way that the laser data frame is the base_stabilized frame, while this frame’s pose is estimated using hector_imu_to_tf package. How hector_slam uses different coordinate frames is further documented here.

Thus, if everything works, you should be able to obtain good mapping results. Here are exemplary mapping results obtained with the above-described hard- and software-setup. First, we tested and debugged the system in a small, maze-like environment with unique features for easy mapping. Once the system was working well in the test environment, we tested it in the main hallway of our office building and obtained good results.

A small testing environment and mapping results.

Once the entire environment has been mapped, hector_geotiff is used to save the map by executing rostopic pub syscommand std_msgs/String "savegeotiff". This did not work for me immediately and complained with the message “failed with error ‘Device not writable’”. I don’t know the root cause of this, however, I was able to guess a solution relatively quickly. The following command, which re-creates the target folder and takes care of RWX rights fixed the issue for me:
sudo mkdir /opt/ros/melodic/share/hector_geotiff/maps && sudo chmod -R a+rwx /opt/ros/melodic/share/hector_geotiff/maps && sudo chown -R ubuntu:ubuntu /opt/ros/melodic/share/hector_geotiff. Of course, you have to adjust the paths for your ROS installation.

Navigation and Adaptive Monte Carlo Localization

Given a map of our environment, we can now run Adaptive Monte Carlo Localization (AMCL) to estimate the current robot position given a stream of laser scans. You can of course use any other localization algorithm, but the AMCL ROS package generally works very well (if properly parameterized).

Loading hector maps

To start the localization process in a mapped environment, first, we must make the map available on a rostopic. The default topic used for this is accurately called map. The map can be made available by calling the map-server package, which takes as argument the path to a map file, e.g. rosrun map_server map_server path/to/map.yaml. This should make your map available on the map topic, you can check this by either logging the map topic or by visualizing the map topic in RVIZ.

When we load maps that were created with the hector_slam package, we must pay attention to a particular parameter of the map-server package, which took me quite some time to figure out. The maps generated by hector_slam draw obstacles in blue instead of black (as can be seen on the map images above). This causes them to be classified wrongly by the map_server package, because with the default parameters, the blue color is converted to grayscale and happens to fall into an undefined range, meaning the map_server does not identify the blue pixels on the map as actual obstacles. Unfortunately, no warning is raised and the map-server happily broadcasts a map with many invalid (-1) values, which of course makes any localization impossible. This is extra hard to spot, because the resulting map still looks almost perfect when visualized in RVIZ:

The map on the left is valid and can be used for localization and navigation, while the map on the right is invalid and completely useless for localization and navigation. A subtle but devastating difference.

To prevent this from happening, set the parameter occupied_thresh of the map_server package to a lower value, for example, 0.5. The map-server package converts the color values in each cell to a value between 0 and 1, and pixel values greater than occupied_thresh are considered to be an obstacle on the map.

With a correctly loaded map now being available on the map topic, all that’s really left to do now is to start the ACML node. ACML has a lot of parameters, most of which I left at their default values. However, the odometry type should be set to omni via the odom_model_type parameter, since the Pepper robot has an omnidirectional mobile base. Furthermore, the laser_min_range and laser_max_range parameters should match the LIDAR device. For the YDLIDAR G2, I set them to 3 and 16, respectively. Lastly, make sure to pass the correct TF frames to AMCL. This of course is very specific to your TF chain, but most likely you can leave the default values if you didn’t change any TF frame names before.

Store the parameter values in a launchfile and launch AMCL. You can either call the AMCL rosservice global_localization to get an uninformed prior that spreads to pose particles uniformly over the state space or you can use the RVIZ 2D pose estimate button to spread the initial particles normally around the cursor location on the map. Either way, after having initialized the localization algorithm you can (optionally) drive back and forth a bit until the history of laser scans allowed the localization algorithm to converge to the correct position on the map. Alternatively, you can just pass a goal pose and rely on the initial scan history for estimating the initial robot pose.

Based on the given map, current pose and goal pose, navigation is performed by the ROS package move_base, which will calculate a trajectory from the initial pose to the given goal pose. Once this trajectory has been calculated, move_base will output velocity commands on the cmd_vel topic that move the robot along that trajectory to the given target pose. The ROS driver of each specific robot must implement the functionality that interprets and executes the given velocity command with the available robot actuators, like Pepper’s omnidirectional wheels. In our case, Pepper’s ROS kinetic stack implements a velocity controller somewhere, I guess.

And that is it. With a setup like this, we can exploit the powerful ROS navigation stack to autonomously navigate with (slightly modified) Pepper robots. My colleague Dr Phillip Allgeuer continued to work on this project when I had to leave the Knowledge Technology group to start my PhD. Phillip made considerable improvements to our codebase, finalized the project and produced two videos which I slightly edited and combined into the following:

Although I don’t know the following for certain, I hypothesize that the pauses of the robot in the second clip are based on Pepper’s onboard safety features. If enabled, these will override any control if it is considered unsafe, by some criterion. I observed numerous times that Pepper would suddenly stop moving when it is in close proximity to an obstacle, or when it moves at somewhat fast velocities, for example to maintain a safe balance.

Summary

To summarize, we wanted that our Pepper robots could autonomously navigate our office premises. We found that the onboard laser ranger finders did not allow for satisfactory mapping and localization performance. Thus, we extended the Pepper robot with better LIDAR and IMU sensors, which allows us to run the powerful hector_slam package for mapping, and amcl plus move_base for localiation and motion planning/navigation. This required a somewhat sophisticated, distributed hardware, software and network architecture that is described in this blog post for anyone to reproduce.

Cheers,
Finn.

Hokuyo UST-10LX with ROS Setup

2021-08-10T00:00:00+00:00

Motivation and introduction

I have worked for quite a while with Softbank’s Pepper robot, which, unfortunately, does not feature the best onboard hardware. Specifically, it only employs three laser sensors that are aimed towards the ground at a slightly awkward angle [1]. These sensors are enough to do minimalistic obstacle avoidance, but not enough for Simultaneous Localization and Mapping (SLAM). As a consequence, at the KT group, we’ve decided to instead build an external mapping device, using the Hokuyo UST-10LX [2] laser sensor. However, the Hokuyo UST-10LX has to be connected to a host machine via ethernet cable to communicate its measurements with a host machine. This requires some network tinkering on that machine, which, in our case, is a Raspberry Pi 3. I found this setup not to be properly documented (or at least I couldn’t find anything on this). As such, below I describe how to configure a headless Raspi 3 running Ubuntu Server 20.04 to stream data from the Hokuyo laser sensor into your network.

Raspi setup and networking

The reason this sensors usability is not exactly at the plug-and-play level is that it, as already said, communicates with the host machine via ethernet cable, which in turn means that we can’t use the LAN port to connect the host machine to our trusted network. This is not necessarily a problem, but if we still want to remotely connect to and work on the said host machine, we have to configure our Raspi in such a way that it a) allows data to come in via the wired network interface while b) connects to the network/internet via the wireless interface. In the following steps, we do just that. Additionally, the ethernet wired network device has to be configured in the right way to communicate with the Hokuyo sensor which, also, I did not find to be well documented. Note that I did the following steps on the Ubuntu 20.04 server distro, I assume the steps work on all Debian-based Linux systems, but obviously, I did not test that.

Assuming you’ve freshly installed Ubuntu Server on your Raspi and plan to use it headless, you’ll have to do some initial configuration (including enabling ssh), which is widely covered, i.e. here. Once that is done and ready, simply log into your Raspi via ssh. Assuming your Raspi is currently connected to your network via ethernet cable, we now have to configure it to connect to WLAN, so that the ethernet port is available for the Hokuyo sensor. Do the following steps:

Get the name of your wireless network device (likely wlan0):
```
$ ls /sys/class/net
# enp8s0  lo  wlan0
```
This outputs the names of all network devices on your system. Your wireless device is likely called wlan0, you can read more about network device naming conventions on ubuntu here.

Edit the netplan configuration file it is`/etc/netplan/50-cloud-init.yaml with your editor of choice. Add the following lines to it, but make sure they are properly intended (its a YAML file, which is sensitive to this).

wifis:
  <your-wireless-device>:
      dhcp4: true
      optional: true
      access-points:
          "<SSID_WiFi_name>":
              password: "<WiFi_password>"

You have to enter your network information, so replace <your-wireless-device> with the name of your wireless network device (i.e. wlan0), <SSID_WiFi_name> with then name of your router and <WiFi_password> with the according password. The final file should look similar to this:

# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
  ethernets:
      eth0:
          dhcp4: true
          optional: true
  version: 2
  wifis:
      wlan0:
          dhcp4: true
          optional: true
          access-points:
              "SSID_WiFi_name":
                  password: "WiFi_password"

Now, bring up the wireless network card on the Raspi, if it is not already running:
```
$  sudo ip link set <your-wireless-device> up
```
Just executing this won’t break anything if your wireless card is already running, but later commands will fail if it is down. Again insert the name of your wireless network device (wlan0).
Now, run these commands to generate and apply the netplan configuration we entered into /etc/netplan/50-cloud-init.yaml:
```
$ sudo netplan --debug try
$ sudo netplan --debug generate
$ sudo netplan --debug apply
$ sudo reboot
```
Finally, you can unplug the ethernet cable from your Raspi, and, assuming everything went well, the Raspi will connect to your WLAN after its reboot, in which case you can ssh onto it again.

Now, the internet traffic is being handled by the wireless network device and the wired ethernet port is available for the Hokuyo. In the next section, we install the Hokuyo ROS driver to bring up the laser sensor.

ROS and the Hokuyo driver

The remaining steps are more or less straightforward, but not trivial and not properly documented either (hence this post). Install ROS on your Raspi or the machine you plan to attach the Hokuyo to, by following the official installation instructions. Once you have ROS installed, install the ROS-driver for the Hokuyo UST-10LX sensor [3]:

$ sudo apt-get install ros-<ROS-VERSION>-urg-node

Here, replace <ROS-VERSION> with any of kinetic, melodic or noetic, as these are the only ROS versions currently supported by the package. Alternatively, you can compile the package yourself, but then you will have to take care of setting up the workspace and that all dependencies are satisfied, which I won’t describe here, as this process is well documented elsewhere [4]. With the drivers installed, now we just start ROS and can get our laser data, right? Well, not really, again because this specific Hokuyo laser sensor communicates with the host machine via ethernet cable. Furthermore, because of that, the instructions given for the urg_node package don’t work for us either. So, to get laser-data into our handy-dandy ROS topics, we have to assign an IPv4 address to the wired network interface (because the sensor is connected via cable), so that the IP address of the Hokuyo sensor falls into the subnet range of the wired network interface. From the Hokuyo’s datasheet (page 6), we know that the device has the default IP address of 192.168.0.10 assigned. Thus, we can assign an IP address like 192.168.0.15/24 to our wired network device. 192.168.0.15/24 is in CIDR notation [5] and states that the first 24 bits (as indicated by the trailing /24) provide the network identifier, and whatever comes after that is the actual machine identifier. To assign a specific IP address to the wired network device, take the following steps:

Get the name of your ethernet (wired) network device with the same command we used before:
```
$ ls /sys/class/net
# eth0  lo  wlan0
```
lo is the loopback device, wlan0 (or similar) the WiFi card, so the remaining name is your wired network card (most likely called eth0, eno1 or enp2s0).
Assign an IP address to your wired ethernet device that shares the network identifier with the IP of the Hokuyo sensor:
```
# sudo ip addr add <SHARED.NETWORK.IDENTIFIER>.<HOST-IDENTIFIER>/24 broadcast <SHARED.NETWORK.IDENTIFIER>.255 dev <ETHERNET-CARD-NAME>
$ sudo ip addr add 192.168.0.15/24 broadcast 192.168.0.255 dev eth0
```
As above, replace the values in <>: with your values. You can verify that the command worked by inspecting the output of the ifconfig command. For the ethernet device you specified, it should show the IP address we just assigned to it. If your ethernet device does not show up in ifconfig but when you run ls /sys/class/net, it might be down. You can bring it up the same way we did with the wireless device earlier.

Now we have our regular traffic running over the wireless connection, while the ethernet wired device is configured to communicate with the Hokuyo sensor. Note, the IP address assignment with the ip addr add command we just did is not static, you will have to redo it after rebooting your device. You can look up how to configure this permanently by editing the netplan once again, i.e. here. Now, the final step is to get the laser data into a ROS topic via the ROS driver.

As usual in ROS, first, we start a roscore with the simple command roscore. Then, you need a new session on the same machine to start the urg_node driver for the Hokuyo sensor. Either ssh onto the server a second time or use a terminal multiplexer like tmux. With the roscore running in the other session, run the following command to bring up the Hokuyo laser:

$ rosrun urg_node urg_node _ip_address:=192.168.0.10

With the _ip_address argument we pass the IP-address of the physical sensor to the ros node (see here). As mentioned already 192.168.0.10 is the default value for the Hokuyo UST-10LX, you will have to adjust this if you assigned a different IP address to your sensor. If everything works, the node will output something like:

[ INFO] [1501672789.034051716]: Streaming data.

Now, in yet another terminal session, you can inspect this data, which is the laser point cloud returned by the sensor, by simply calling rostopic echo /scan, where /scan is the default name of the rostopic used by the urg_node driver.

RVIZ

You can also inspect the laser data in RVIZ, but this requires one two steps. RVIZ can only display data that is attached to a valid coordinate frame (which makes sense, because you need some point of reference if you want to visualize things in any space). Usually, the robot you are using will provide the coordinate frames via the ROS tf package, but given our current setup, we just have the Hokuyo sensor that is sorta floating in the void. If you inspect the data on the /scan topic, for example with rostopic info /scan, you will see that the data is associated with the coordinate frame laser. But this coordinate frame does not exist, so RVIZ can’t know where to put this data and will complain if you try to visualize the laser point cloud. To fix this, we simply broadcast a made-up coordinate frame, with the following command:

$ rosrun tf static_transform_publisher 0 0 0 0 0 0 1 map laser 10

Now we would be set, if our roscore would run on a system that features a graphical display session, but if you are running the roscore on a headless machine, as is the case for the Raspberry 3 Ubuntu server setup describe earlier, we can’t even display RVIZ, to begin with. Thus, you will have to do this on a different machine. Luckily, setting up a distributed ROS environment is very easy. Before starting the roscore on the headless machine, run the following command:

$ export ROS_MASTER_URI=http://<IP>:11311

Replace <IP> with the IP address of the network card. Obviously, when you restart the roscore, you will have to restart the urg_node as well. Now, in a terminal on your machine with a graphical desktop environment, run the same command, then rostopic list. This should show the rostopics that are running on the headless machine, assuming both machines are in the same network. For some reason, we can now see the rostopics, but no data in the topics, even though we are streaming it from the headless machine. I had to apply this weird fix in order to actually see the data in my ros topics on my main machine:

What finally solved my problem was adding all PCs with their hostnames and IP Addresses to the "/etc/hosts" file. Since then, everything works fine.

With that being taken care of and all nodes running (roscore, urg_node, static_transform_publisher), you should be able to visualize your laser data in RVIZ. Look at this nice laser data:

Summary

Alright, in this post we’ve configured a Raspberry Pi 3 running Ubuntu Server 20.04 to publish laser data into a network via WiFi, while the Hokuyo sensor is connected to the device via ethernet cable. We’ve installed the required software components and finally showed how to visualize the laser data in RVIZ on a different machine in the same network. Maybe someone finds this helpful :)

Cheers,
Finn.

References:
[1] Pepper laser specification
[2] Hokuyo UST-10LX
[3] URG_node Hokuyo driver
[4] Building ROS packages
[5] Wikipedia: Classless Inter-Domain Routing

Soft Actor Critic: Deep Reinforcement Learning for Robotics?

2020-09-29T00:00:00+00:00

Motivation and introduction

The Soft Actor-Critic algorithm by Haarnoja et al. [1] has gotten a lot of coverage and attention in 2018 and 2019. And rightfully so. The paper proposes a very elegant solution to the notorious problem of deep reinforcement learning algorithms being too data-hungry for real-world feasibility and supplies very exciting examples illustrating the capabilities of the algorithm in a real-world setting, as can be seen below. Naturally, I was intrigued. While at the point of writing this post, Reinforcement Learning has not yet been featured on this site, it is, after all, my main academic interest and will be at the heart of my masters’ (and hopefully Ph.D.) thesis. Hence, for one of my courses, I decided to write a paper on the Soft Actor-Critic algorithm. In this blog post, I built on that paper [2] and provide some additional examples and insights.

The problem with Deep Reinforcement Learning for real world robotics

While this post will not address Reinforcement Learning in general, the gist of it is as follows: By executing pseudo-random actions in an environment (or simulation thereof) and rewarding good actions, we can have a, for example robotic, agent learn almost any desired behavior. Here, behavior means that we execute the desired sequence of actions to get from some initial state to some goal state. Inherently, this is a very powerful concept, as this makes it possible for robots to learn how to walk, grasp things, play games, engage in dialogue, and pretty much learn to solve any conditional, sequential problem. That’s the theory, at least.
However, as you might have noticed, we are not yet surrounded by intelligent, autonomous robots in our everyday life, in fact, it’s still out of the norm to find a robot autonomously cleaning an office space or shopping mall, which indicates that things aren’t quite as easy. Many different fields of robotics are still active research areas, just like Reinforcement Learning is still having a central problem, stopping it from being widely employed in real-world robotic scenarios. To be precise, the main problem of deep Reinforcement Learning algorithms for real-world robotics is that they are insanely data-hungry and take ages to converge (ie manage to generate the desired behavior).
Why is this problematic? Well, robots aren’t indestructible and in the early stages of learning, Reinforcement Learning agents behave essentially randomly. You can probably imagine what drastic consequences it can have if we just set all motors in our robot to random power levels… Larger robots will fall over, mobile bots might severely ram into obstacles, and drones would crash immediately. If we expose our robots to this kind of behavior for a prolonged period of time, it is almost certain that the robot will suffer significant damage in the process (similar to how toddlers fall over when they begin learning to walk, except here, consequences aren’t inevitable breakdown). And this is only one aspect of the issue. When I wrote insanely data-hungry, I absolutely meant that. For example, AlphaStar, DeepMind’s deep neural Reinforcement Learning algorithm, has been with trained many agents in parallel, for 14 days straight, on 16 Tensor Processing Units (TPU), corresponding to 200 years of real-life training time, for each agent [3]… And this is under the employment of state-of-the-art methods to speed up the learning process.
Ignoring that we can’t train a single robot in parallel fashion, after 200 years of hypothetical training, you can be sure that the robot would have broken down simply due to all the wear and tear that it would be exposed to in all that time.
An apparent solution is to train the agent in a simulator (which also allows us to parallelize the training process) and then simply put the behavior policy learned in the simulation on a physical robot, operating in the real world. However, the simulators are not yet good enough and fail to accurately represent the real world, which makes the learned behavior policies useless on the real, physical robot. Further, agents trained in a simulator tend to learn things that are hyper-specific to that simulator and don’t generalize to the real world. This is referred to as the Sim-to-Real problem and is an active research area in itself.
So as you can see, there are a lot of challenges for real-world Reinforcement Learning. However, the Soft Actor-Critic algorithm tackles the problem at its root and aims to significantly speed up the learning process, to a point where deep Reinforcement Learning methods become feasible in real-world scenarios. Let’s explore the intuition behind the algorithm in the following section.

The intuition behind Soft Actor Critic

To gain an understanding into how the SAC algorithm tackles the data inefficiency problem of deep Reinforcement Learning methods, we have to look at the SAC specific reward function that is being employed by Haarnoja et al. However, to begin with, consider the classical Reinforcement Learning object, that describes the general goal of Reinforcement Learning [9]: $$G_t = \sum^\infty_{k=0}\gamma^k R_{t+k+1}$$ This is the expected discounted return $G$ at time step $t$, with a discount factor $0 \le \gamma \le 1$, so that the reward signal $R$ from $t+k+1$ time steps in the future is weighted to be less important than the reward signal at $t+k$, encoding an aspect of temporal relevance. The reward signal $R$ is, arguably, the central part of any Reinforcement Learning problem, as this guides what the policy (always denoted by $\pi$) of our agent will learn, by encoding the goodness of any action taken. Generally, the behavior, which is encoded in the policy of the agent, is adapted in such a way that it maximizes the reward function, thus, a well thought out reward function is the key for success in reinforcement learning. Essentially, no matter what, with Reinforcement Learning, we want to accumulate as much discounted reward, aka return $G_t$ as possible. This is the main objective all Reinforcement Learning methods are subject to. Formally, the optimal policy $\pi^*$ is defined as the policy that has the highest expected reward for every action, at every timestep, in every state [1]: $$\pi^* = \underset{\pi}{\operatorname{argmax}} \underset{\tau \sim \pi}{\mathbb{E}} \left[ \sum^\infty_{t=0} \gamma ^t [r(s_t, a_t)]\right]$$ Here, ($ \tau \sim \pi$) means that a trajectory of interactions ($\tau $) has been sampled ($\sim $) from the probability distribution of the policy ($\pi$). Notice that $r$ is a function, over all states and actions, providing the reward meassure of goodness for every combination of states and actions (at least in simple examples).

Now, the central element in the SAC algorithm is an advanced, general reward function, that contains a second term in addition to the main reward signal [7]: $$\pi^* = \underset{\pi}{\operatorname{argmax}} \underset{\tau \sim \pi}{\mathbb{E}} \left[ \sum^\infty_{t=0} \gamma ^ t [r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)] \right]$$ The only difference to the original formula for the optimal policy is the term $\mathcal{H}$, which is weighted by $\alpha$. $\mathcal{H}$ encodes the entropy of the policy $\pi$ in every state and is given by $\mathcal{H}(P) = \underset{x \sim P}{\operatorname{\mathbb{E}}} [-log P (x)]$. Entropy is, roughly speaking, a meassure of information gain or uncertaintaniy of a random variable $x$, sampled from a distribution $P$. Do you see what this motivates the Reinforcement learning agent, who behaves according to a learnt policy that maximizes the given function, to do? It forces the agent to not only consider the reward associated with an action in a state, but also the overall degree of uncertainty in that state. This results in the agent choosing actions that lead to states which have not yet been seen, especially when a different action would lead to a state that has a higher expected return (but has already been seen). The parameter $\alpha > 0$ balances the two components of the objective function and controls the importance of the entropy term, compared to the reward signal. In the original version of the SAC algorithm, this parameter $\alpha$ had to be set manually, which was a non-trivial problem for complex enough environments and required an expensive hyperparameter optimization [1]. However, in the newer version of the algorithm, Haarnoja et al. managed to automatically adjust the parameter by rephrasing the objective function once again. However, the details of this automatic temperature adjustment can be ignored for the purpose of this blog post.

In addition to speeding up the overall learning process and making for better data efficiency, this RL objective function has another desirable side effect: It produces much more stable policies [1], [7]. Unfortunately, it is not further explained why that is, but I think about it like this: Since the reward for every state also depends on the entropy component, the agent is less likely to visit the same state twice because the entropy for that state will already be decreased. Hence, by exploring many, slightly different trajectories (sequences of states and selected actions), the overall policy is more robust, because it does not hinge on observing a small amount of key states in order to be able to select the overall best action. I hope that makes sense...? But those are just my two cents... Either way, in the video above, we can observe the consequences of this: The agent can deal with significant perturbations of the state (brick wall, stairs, ramp) that it has not encountered during training. This is a very nice property to have, as it implies that the learned policy is more general and can be employed in contexts that are not part of the training data.

And this is how far I will go regarding the basic idea behind the SAC algorithm. To summarize, SAC incorporates an entropy term into the Reinforcement Learning objective function, which motivates the agent to select actions under consideration of the uncertainty associated with each state. Like this, the agent can explore the environment much more efficiently, which results in significantly faster convergence, compared to many other state-of-the-art algorithms (see the original paper for benchmarking results [1]). For the remainder of this post, we will explore and discuss how the algorithm performs on a practical OpenAI Gym task.

OpenAI Gym example

Figure 1: OpenAI Gym Bipedal Walker environment. Left: Normal version. Right: Hardcore version.

OpenAI Gym [4] provides a wide array of Reinforcement learning environments and is one of the de-facto tools being used to benchmark, compare and develop Reinforcement Learning algorithms. For getting practical experience with the SAC algorithm, I selected the BipedalWalker environment, where the goal is for a bipedal agent to develop an efficient walking gait. This environment is particularly interesting, for reasons further explained below, because it has a normal and a hardcore version, where the hardcore version of the environment contains many stumps, pitfalls, and stairs and is much harder to solve successfully. As we can see in the above video, the walking gait learned on the minotaur robot appears to be outstandingly stable, generalizing to a handful of unseen scenarios: The brick wall and the ramp. So my hypothesis is as follows: The bipedal walker trained on the normal version of the environment might be robust enough to also solve the hardcore version of the environment, similar to how the minotaur in the video could deal with the obstacles presented in the testing scenarios! To investigate this hypothesis, we need a working version of the algorithm though. Instead of implementing this algorithm from scratch (which would take a lot of time and straight-up not be efficient), we will use the implementation provided in this repository [5].

Results

To begin with, I trained a SAC agent for 500 epochs on both the normal and hardcore version of the environment. For comparison, I also trained a PPO [6] agent and a TD3 [8] on both versions of the environment, to put the convergence time of the SAC agent into perspective. To be fair, PPO is an on policy method, which are known to have much worse data efficiency than off-policy methods. Consider the results presented below:

Figure 2: Training progress of SAC, TD3 and PPO agents on the normal and hardcore version of the BipedalWalker gym environment.

We can observe that on the normal version of the environment, the algorithm converges within roughly ~ 100 epochs of 5000 interactions with the environment per epoch. However, out of the box, the algorithm does not appear to be able to solve the hardcore version of the environment within 500 epochs. Based on this data alone, I can not really draw further conclusions. It is very well possible that with slight adjustments to the hyperparameters, a SAC algorithm could solve the hardcore version of the environment as well. However, not wanting to invest more time into this blog post, I did not bother to conduct an expensive and timely hyperparameter optimization and applied the algorithm with its out-of-the-box configuration to both versions of the environment. Further, we can observe that the TD3 agent learns just as fast as the SAC agent. Again, we can’t really conclude anything beyond that this is how these algorithms perform, given this exact scenario and hyperparameter configuration. The benchmarking results presented by Haarnoja et al. do more justice to the efficiency of the algorithm than this small experiment and I highly encourage taking a look at the paper [1].

Interestingly enough, TD3 struggles to make meaningful progress within 500 episodes on the hardcore version of the environment as well. This gives some indication of the difficulty associated with this specific environment. As expected, the PPO agent hasn’t come close yet to solving the environement, which it would likely do, given more training data

To begin the analysis of our main hypothesis, whether the policy learned by the SAC agent is robust enough to be transferred from the normal version of the environment to the hardcore one, consider the below Figure:

Figure 3: Testing rollouts of the SAC policies learned on both versions of the environment. Left: Average reward obtained. Right: Average episode length.

These statistics tell us a few things on how the learned policies perform, already regarding the main hypothesis we sought out to investigate, whether policies trained on the simple version of the environment would be robust enough to also deal with the obstacles presented in the hardcore version of the environment, without encountering them during training. Sadly, a short glance at Figure 3 immediately falsifies that hypothesis. We can observe that when executing the policy trained on the normal version of the environment on the hardcore version, we get an average reward of roughly -100, with low deviations from that value. This is because the environment punishes the agent with a -100 reward when it falls over. Hence, the policy learned on the normal version of the environment is not robust enough to get past the obstacles in the hardcore version and falls over, getting punished with a -100 reward. Further, we can observe that the agent trained on the normal version of the environment does not appear to have a deviation from the reward and episode length. This indicates that the agent performs equally well (or badly) most of the time, contrary to the agent trained on the hardcore version of the environment. There, we can observe a much higher range of values, indicating external factors (aka the obstacles) having an effect on the performance of the agent. To further verify our conclusion regarding our main hypothesis, take a look at how the agent performs in practice:

Agent trained on the normal version of the environment, tested on the normal version.

As expected, the agent developed a (kinda awkward looking) walking gait, that successfully solves the normal version of the environment by traveling all the way to the end of the level. This is the most efficient gait it found, as applying torque to the joints costs a small amount of reward. Regarding our main hypothesis, consider the following video, where we employ the policy trained on the normal version of the environment on the hardcore version:

Agent trained on the normal version of the environment, tested on the hardcore version.

Here, we can see how the agent struggles to get past the obstacles. This properly rejects (I think conducting a T-Test on the reward distribution would be overkill and is not necessary for this blog post) our hypothesis: A SAC agent trained on the normal BipedalWalker environment is not robust enough to also solve the BipedalWalkerHardcore environment, as already indicated in Figure 3. In hindsight, I see how this is too big of a leap from the normal to the hardcore version of the environment. There is a clear difference to the real-world examples we saw above: In the examples with the stairs, the ramp, and the small brick wall, the robot, controlled by the learned policy, gets away with just sticking to the learned policy. These obstacles don’t require dedicated handling, the robot does not have to learn a specific behavior to get past them. The obstacles faced in the hardcore version of BipedalWalker environment can clearly not be handled in the same way. The agent needs to find a distinct strategy for dealing with the different obstacles present in the hardcover version of the BipedalWalker environment.

So what about the SAC agent that has been trained directly on the hardcore version of the environment? Well, take a look…

Agent trained on the hardcore version of the environment, tested on the hardcore version.

As you can see, that agent performs very poorly. However, I am certain that the agent would learn how to get past the different hurdles given a) more training time/data and or b) a hyperparameter optimization for that version of the environment. We can see already that the walking gait, if we can call it that, differs from what was learned on the normal version of the environment. This becomes even more apparent when we visualize the two agents side by side, one having been trained on the normal version of the environment, the other on the hardcore version:

Comparison of walking gaits learned trained on the two environment versions, both tested on the normal version.

Extra

Purely because it’s somewhat interesting too look at, here is a video of the walking gaits developed by the TD3, PPP and SAC algorithm (SAC trained on normal and HC environment):

Comparison of learned walking gaits by agent from the different algorithms. SAC (HC) is, again, the gait learned on the hardcore version of the environment, executed on the normal version.

Summary

Alright, wrapping it up: Deep Reinforcement Learning methods suffer from strong data inefficiency. The Soft Actor-Critic algorithm by Haarnoja et al. tackles this data inefficiency problem of (deep) Reinforcement Learning algorithms, by modifying the reward object to include an entropy regularization term. Haarnoja et al. provide real-world examples demonstrating strong robustness of the developed policies and strong benchmarking results. We sought ought to investigate whether a SAC policy learned on the normal version of the environment would be robust enough to clear the obstacles in the hardcore version of the environment. Our results clearly indicate that this is not the case, for reasons provided above.

Finally, I want to mention that Haarnoja et al. are of course not the only people investigating data inefficiency in deep reinforcement learning methods. Here are a few approaches, in case you want to do some additional googling: Task Simplification, Imitation Learning, Hindsight imagination, Hierarchical Reinforcement Learning…

All data and code is available here.

Cheers,
Finn.

References:
[1] Soft Actor Critic Algorithms and Applications, Haarnoja et al.
[2] My coursework paper on the SAC algorithm
[3] DeepMind’s AlphaStar
[4] OpenAI Gym
[5] Createamind DRL: SAC implementation
[6] Proximal Policy Optimization Algorithms
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
[8] Addressing Function Approximation Error in Actor-Critic Methods
[9] Sutton and Barto: Reinforcement Learning: An Introduction

Docker + ROS: How to listen to ROS nodes in External Docker Containers

2020-08-20T00:00:00+00:00

Motivation and use case

Working with Pepper robots at the Knowledge Technology Research Group, we have the high-level goal of building Pepper into an interactive demonstration platform. One of the brainstormed requirements is that we want a Pepper robot to be capable of autonomously navigating our office space and gather the employees for a joint lunch break. To approach the mapping and localization problem at hand, we decided on employing the Visual Simultaneous Localization and Mapping [1] algorithm. Instead of implementing the algorithm from scratch, we chose the OpenVSLAM implementation by Sumikura [2] et al, which happens to come with a Dockerfile. Thus, the objective is clear: Get Pepper’s sensor data with ROS (possibly do some data cleaning), then feed the data to the VSLAM algorithm, which is running in it’s Docker container. But getting the two technologies to work hand in hand is only trivial for people who have a deep understanding of both frameworks, which I didn’t initially have. Hence I want to share what took me roughly a day to figure out…

The problem

Looking for how to approach the issue of reading ROS sensor data in Docker containers, I consulted the official documentation. ROS’s documentation regarding Docker [3], only shows us how to listen to ROS nodes/topics when the main roscore command is run inside the Docker container as well. That is not what I wanted though, because all our other projects were already implemented outside of Docker, we only needed Docker for one component: The VSLAM implementation. The documentation regarding Docker’s main ROS image [4] didn’t help me either. Hence, to begin tackling this problem, the first step is clear: roscore must be running somewhere, since this is a requirement for every ROS based system:

Classical roscore command.

The next, similarly basic, step is to lunch our ROS docker image (in a new terminal), and try to start communicating with the running roscore. Following the instruction’s from [3] again, we run the image and source the entrypoint. To test whether the communication between the external roscore and our docker image works, we use the rosnode list command, which lists all active nodes. Given that roscore is running, there should always, at least, be the /rosout node. However, as we can see, executing these steps yields "ERROR: Unable to communicate with master!"

No communication with external roscore from docker container.

Googling for this specific ROS error message reveals interesting and helpful threads [5], that point us into the right direction. Namely, the key problem is that within our docker container, we don’t find the roscore that is running on the main system. Hence, we need to set the right environment variable (ROS_MASTER_URI [6]), that indicates where to find the running roscore. Luckily enough, the roscore command provides us with that information (consider the output from the roscore command above). In my case, the ROS master is located at http://finn-ubuntu:11311/. A quick look into /etc/hosts reveals the IP we that hides behind the local “finn-ubuntu” hostname:

/etc/hosts contains mapping from hostnames to IPs.

However, as we can observe below, even setting the right environment variable within the docker container does not appear to solve the issue, we are still left with the same error as before:

Still no comunication after setting the right environment variable.

So what’s causing this? Are we approaching the error from the wrong side? Did we maybe just have a typo somewhere? And most importantly: How do we fix this and finally access our valuable ROS nodes/topics from within our docker container?

The solution

Actually, the final (working) solution is very close to what we did previously. However, one crucial detail is missing: The fact that docker containers, per default, live in a virtual bridge network [7]! The reason for this is explained in [7]:

In terms of Docker, a bridge network uses a software bridge which allows containers connected to the same bridge network to communicate, while providing isolation from containers which are not connected to that bridge network.

ifconfig command revealing information about our network.

While this is certainly a very powerfull and usefull concept, it is also apparent how this caused our earlier fix to fail. Inspecting the output from ip addr further illustrates this “problem”: We see all the network interfaces that are currently running on our computer. This usually includes at the very least lo loopback device (which function as a local virtual network and runs on 127.0.0.1) and the physical network adapter, usually called en0 or in my case enp8s0.
(Read more about network interfaces [8] and their naming convention [9])
Additionally, we see the docker0 interface. This is the virtual bridge network mention above. Because of this, docker container can communicate with one another, but are isolated from the other host networks, including lo, our loopback device responsible for the local network on our host machine. As we have looked up earler, roscore runs on 127.0.1.1 (i.e. is running locally) and thus not visible from the docker0 bridge.

However, our docker container of course has access to the internet, meaning we can access enp8s0 from within docker. Thus, we can access our host machine via its ipv4 address, which is visible form within our subnet (subnet as in the network that all the devices use that are connected to your router). Knowing that our roscore runs on port 11311, we again set the environment variable form within our docker container: export ROS_MASTER_URI=http://192.168.10.27:11311/. However this time, instead of using the lo network address (127.0.0.1), we used the enp8s0 ip address of our machine. And voilà, rosnode list displays the the /rosout node, which proves that communication with the roscore, from within the docker container, works!

Working communication with roscore form within docker container.

The more elegant solution

The above doctrine deducts the solution mirroring my learning experience. However, after identifying the root of the problem and learning about the whole docker networking thing, I now know that there is a much more elegant solution to the problem. Turns out, the docker developers have considered that some people might need to be able to communicate with service running locally on the machine that also hosts/runs docker. For such a scenario, there exists a dedicated network driver, that we can pass to the docker container. This is as simple as passing the following argument to our docker call:
--network host. And that’s is, by adding this argument to the command, you should be able to listen to ros nodes from within you docker containers right away.

Summary

Per default, docker containers run in a virtual bridge network, isolating them from host networks like lo, making the localhost unaccessable. Docker provides a network driver that removes the isolation and makes the loopback device network accessable from within the docker container. This host networking driver can be activated with the --network host command line argument. Alternatively, the ioslation can be circumvented manually by using the en0 network adapter ip address of the host machine (which is accessable within docker container for TCP/IP communication).

Cheers,
Finn.

References:
[1] Simultaneous Localization and Mapping
[2] OpenVSLAM
[3] ROS Docker documentation
[4] Docker ROS image
[5] Unable to communicate with master fix
[6] ROS MASTER URI
[7] Virtual bridge network
[8] ifconfig command in depth
[9] enpXsxX naming convention
[10] Docker networking guide
[11] Docker host network driver

Decorate-Sort-Undecorate: List Sorting in Python

2019-12-09T00:00:00+00:00

Motivation and use case

The other day, I found myself generating a complex performance graph using matplotlib. I wanted to rearrange the order of the item in the legend. There are multiple ways we could tackle this problem: We could, for example, manipulate the order in which the items are plotted since this effects the order in which the legend data is generated. The, in my opinion, much more elegant way is to directly manipulate the order of the items in matplotlib’s legend. Since the legend consists of two lists (a list of handles and a list of labels), I needed a solution for sorting both lists.

Simple example: Sorting two lists in the same manner

We will start simple and, for now, only consider sorting two lists (that indirectly reference each other) in the same order. The more complex example, where we sort two lists by the order of a third list, will be introduced in the following section.

Consider the following example code:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib

matplotlib.rcParams.update({'font.size': 14})

# init figurebare
fig, ax = plt.subplots(1,1, figsize=(10,7))

# generate artificial data
line_a = np.linspace(100, 0, 85)
line_b = np.linspace(80, 0, 65)
line_c = np.linspace(60, 0, 45)
line_d = np.linspace(40, 0, 25)

# plot artificial data
plt.plot(line_d, label="0.25 Alg. B [Version 0.2]")
plt.plot(line_c, ls="-.", label="0.45 Alg. A [Version 0.2]")
plt.plot(line_b, ls="--", label="0.65 Alg. B [Version 0.1]")
plt.plot(line_a, ls=":", label="0.85 Alg. A [Version 0.1]")
plt.legend()
plt.show()

This code produces the following plot:

Pay close attention to the legend! There appears to be no ordering of the legend entries. We can observe that each item in the plot (the four lines) is identified by it’s linestyle and has a describing text, indicating a performance measure. These are the two lists that I mentioned earlier: The legend handles are the linestyle elements, and the labels are the pieces of text describing each item in the plot.

We can get these two lists using the following command :

# handles and labels are of type list
handles, labels = ax.get_legend_handles_labels()

We don’t have to worry too much about how the handles list looks like since we don’t want to sort by the linestyle. Instead, we want to sort the legend entries according to the text, describing each item in the plot. Thus, let’s inspect the labels list:

labels
['0.45 Alg. A [Version 0.2]',
'0.65 Alg. B [Version 0.1]',
'0.85 Alg. A [Version 0.1]',
'0.25 Alg. B [Version 0.2]',]

Sticking to the simple case, let’s fix the weird-looking ordering of the items in the legend. Luckily, Python makes it easy to sort two lists in the same way! It should be clear by now why we need to sort both lists: If we only sort the list of the labels, they will no longer match their handle! This means, that the text in the legend would no longer match its preceding linestyle! This is very dangerous and should never be done, since it manipulates the entire plot! Thus, the following code sorts both lists according to the items in the first list:

ordered_labels, ordered_handles = zip(*sorted(zip(labels, handles)))
ordered_labels
('0.25 Alg. B [Version 0.2]',
'0.45 Alg. A [Version 0.2]',
'0.65 Alg. B [Version 0.1]',
'0.85 Alg. A [Version 0.1]')

Here, we already (implicitly) made use of the ‘Decorate-Sort-Undecorate’ idiom, but more on that later. Let’s break the ordered_labels, ordered_handles = zip(*sorted(zip(labels, handles))) line down into smaller pieces, to really understand what’s happening. The most inner zip(labels, handles) shouldn’t be too mysterious. The zip() function simply does what it always does: Creating an iterator of n-tuples (meaning there are n elements in the tuple), depending on how many iterable we pass into zip() function. To give an example:

for tuple in zip(["fgh", "asd"], ["456", "123"]):
    print(tuple)
('fgh', '456')
('asd', '123')

In our case, this generates an iterator where each tuple contains one element of both lists, meaning the i-th tuple contains the i-th legend handle and i-th legend label. So far, so good, but what does the *sorted(...) do? Well, sorted simply sorts the items of a given iterable (the iterable we get from the most inner zip().

sorted(zip(["fgh", "asd"], ["456", "123"]))
[('fgh', '456'), ('asd', '123')]

Now, we get a list of sorted tuples. But we want two sorted lists! Here, the asterisk * comes into play. The asterisk, in this case, simply unpacks the list it receives as input into its positional arguments.

print(*sorted(zip(["fgh", "asd"], ["456", "123"])))
('asd', '123') ('fgh', '456')

So now we no longer have a list of tuples, but rather an iterable of tuples. Remember what we could do with an iterable? Throw the iterable at zip() and get an iterable of n-tuples back! See where this is going? Look at the following snippet:

for tuple in zip(*sorted(zip(["fgh", "asd"], ["456", "123"]))):
    print(tuple)
('asd', 'fgh')
('123', '456')

This is exactly what we wanted! Both lists have been sorted according to the items in the first list (actually, these are now tuples and not lists, but you can cast them into a list if it matters in your use case). Applying this to the initial snipped produces the following plot (if you want to take a look the complete, update snipped Decorate-Sort-Undecorate, also known as the here is the Github gist):

Here is the resulting plot, in which the legend entries are sorted by the performance measure value:

The legend looks much better! The entries in the legend are sorted in an ascending manner regarding the performance measure. But, what if we aren’t yet happy with the way the entries in the legend are ordered? What, if we don’t want to sort by the performance measure, but instead by the algorithm name or the version of each algorithm? This, we will explore in the next two sections, so hold on to your coffee mugs and bear with me!

The 'Decorate-Sort-Undecorate' idiom

So what is this mysterious ‘Decorate-Sort-Undecorate’ idiom, which is dominantly mentioned in the title of this post but has only been used implicitly and without great explanation? The Decorate-Sort-Undecorate, also known as the Schwartzian transform is a technique for

...comparison-based sorting when the ordering is actually based on the ordering of a certain property (the key) of the elements... wikipedia

and has been around since 1994. The idiom gets its name from the three main steps:

Create a list (decorate an existing list) with specific values, whose purpose is to control the sorting behavior.
Sort the decorated list (possibly apply sorting to another list).
Remove decorations from the decorated list (can be ignored if dedicated list with decorated values has been created).

Thus, this idiom describes how to sort a list, when we are not happy with the default sorting behavior. In the next and final section of this post, we will see the idiom in action to apply a custom sorting to our matplotlib legend.

Advanced example: Sorting two lists by the order of a third

Now, towards the more general example. Say we have two lists, like the list of handles and labels in a matplotlib legend, and we want to sort both of those lists according to some custom sorting behavior. Here, we fully embrace the decorate part of the ‘Decorate-Sort-Undecorate’ idiom by creating an additional list, with the sole purpose of generating the sorting behavior for our two actual lists.

For this, we must first generate the decorated list (the original idiom manipulates the actual list, but I feel like doing this externally is much more intuitive). As mentioned above, let’s say we want to sort our legend entries not by the performance measure, but by the name of the algorithms;

decorated = [text[5:] for text in labels]
decorate
['Alg. A [Version 0.2]',
'Alg. B [Version 0.1]',
'Alg. A [Version 0.1]',
'Alg. B [Version 0.2]']

In the next step, we generate an index list, and sort the indices according to the decorated list.

sorted_indices = list(range(len(decorated)))
sorted_indices.sort(key=decorated.__getitem__)
sorted_indices
[2, 0, 1, 3]

The final step is to simply map the sorted indices to the two lists we want to sort and we are done!

sorted_labels = list(map(labels.__getitem__, sorted_indices))
sorted_handels = list(map(handels.__getitem__, sorted_indices))

And that’s it! Here’s our final plot:

We have applied the ‘Decorate-Sort-Undecorate’ idiom to sort two lists in the same manner. The final version of the initial snipped, which contains the above steps is available as gist on Github. I hope you learned something from this post, or if not, at least enjoyed reading it as much as I enjoyed writing it.

Cheers,
Finn.

P.S.: This post has been motivated by my answer on stackoverflow regarding the question Is it possible to sort two lists(which reference each other) in the exact same way?

References:
[1] Schwartzian transform
[2] Python: Sorting Mini-HOW TO
[3] Python sorted()
[4] Python zip()
[5] Understanding the asterisk(*) of Python

Automatically Backup any Postgres Database Table into a GoogleDrive

2019-09-27T00:00:00+00:00

Motivation and use case

In this post I will explain how to automate the process of backuping a postgres database table into a GoogleDrive cloud storage location. I encoutered this problem when helping my dad out with some IT related problem, which included the backup process of a Postgress SQL database table. The solution I provide here can, however, easily be adapted to backup whatever and upload it into some GoogleDrive folder.

In my approach, we use two scripts to accomplish this: One script to produce the backup files and a second script that takes care of the uploading process. If you wish to upload something else, simply adapt the shell script which, in this case, generate the .sql dumps of the desired postgres database table.

For now, clone the repository and we are good to go:

git clone https://github.com/frietz58/postgres_googledrive_backup.git

If just want to get started, the README of this project on Github contains all the relevant steps aswell ;)

Note that the installation process requires a desktop environment and a browser, so straight up installing this on a headless linux distro won’t work, for reasons explained below .

Dumping a postgres table

First, we will take a look at how we can backup a postgres database table (If you wish to backup something else, you should start here). For this, we use the command pg_dump [dbname], which can create script or archive dumps of any given database. I’ve chosen to use script dumps which are

"plain-text files containing the SQL commands required to reconstruct the database to the state it was in at the time it was saved." postgres documentation

Specifically, in the cron_backup.sh script, the line that produces the .sql dumb file is:

  pg_dump $target_db > $save_dir/$timestamp"_dump_"$ip4_addr".sql"

This create the .sql dump of $target_db in the folder $save_dir, where the name of the dump file contains the current timestamp $timestamp and the IPv4 adress $ip4_addr of the machine, on which the backup has been created. Regarding the postgres database, you need to consider that by default, not everyone user is allowed to access the database tables. If you wish to only create a backup once, right now, you could use the following command:

sudo -i -u postgres pg_dump $target_db > dump.sql

Here, -i runs the command as an login shell and -u provides the target user. When we automate this process, we use the crontab of the postgres user to take care of this, as explained below.

The rest of the cron_backup.sh script parses the .yaml configuration file, sets varable like $ip4_addr or $timestamp and echo some log messages. If you don’t neet dynamic file names and can assure that the $save_dir always exists, you could drastically shorten the script :)

Uploading to GoogleDrive

The second scripts drive_upload.py takes care of uploading the files into a GoogleDrive cloud folder. It implements the GoogleDrive v3 API. Before that works, you will have to install the requirements, preferably in a virtualenv. So create and activate the virtualenv:

virtualenv -p python3 postgres_venv
source postgres_venv/bin/activate
pip install -r postgres_googledrive_backup/requirements.txt

Once you’ve installed the requirements, you need to enable the drive api for you Google account. I recommend that you create a new Google account specifically for storing the backups. Which is what we will do in the next section.

Enabling the Drive v3 API

Before we can automate the upload process, we must enable the Google Drive v3 API and in the process aquire the two files credentials.json and token.pickle. Those files are needed by the drive_upload.py script to authenticate the connected Google account using the OAuth v2.

Aquiring credentials.json file

Enabling the drive api (under step one) on this page, will download the file credentials.json. For now you can save this file wherever, but make sure to set the path in the config.yaml at credtials_path: /path/to/credentials.json yaml, so that the drive_upload.py script will know where the file is located. Once that is done, we can finally obtain the token.pickle file, which is the last piece that is still missing.

Acquiring token.pickle file

Even though I am not very happy about, the last step required you to have a desktop environment with a working browser. I am not sure why Google made it this way, but I assume that they want that you can only directly, in an interactive manner give a script or app access to your drive storage.

Given that you have set the path to the credentials.yaml file in the config.yaml file, execute the drive_upload.py script:

cd postgres_googledrive_backup
python drive_upload.py -cf /path/to/config.yaml

This will open your default browser and ask you to allow “Quickstart” access to that Google accounts drive storage:

If everything worked, this will create the file token.pickle in your current working directory and output Token file has been acquired, exiting... to your console. Make sure to adjust to the path of the token.pickle file in the config.yaml file, so that drive_upload.py finds the token, independt of your working directory.

And that’s it. You could now manually run both scripts manually and it would upload the .sql dumbs into the google drive folder set in the config.file. But we are developers, there is no fun in doing things manually. So let’s automate the entire workflow using the crontab in the next and final section of this post.

Automatic execution via Crontab

Now that the files credentials.json and token.pickle have been acquired, we can also use this on any headles linux distro. Simply scp the entire postgres_googledrive_backup folder to your headless linux distro, create a virtualenv there and install the requirements once again. The important thing is to make sure that the paths in the config.yaml file are correct.

To automatically create backups using the cron_backup.sh script and automatically uploading those into the cloud with the drive_upload.py script, we will use the crontab. The crontab allows us to execute arbitrary commands on a regular schedule, meaning that we can automatically create backups and upload them into the cloud.

Because only the postgres user has access to the postgres database tables that we wish to backup, we must use the postgres users crontab to schedule the backups. I assume that you have postgres installed, given that this post deals with backuping a postgres database table… We can use the su command to become the postgres user like this:

sudo su postgres

Once we are the postgres user, we can edit the crontab using the following command:

crontab -e

50 23 * * * /home/pcadmin/automatic_backup/cron_backup.sh /home/pcadmin/automatic_backup/config.yaml > /tmp/backup.log
55 23 * * * /home/pcadmin/backup_venv/bin/python3 /home/pcadmin/automatic_backup/drive_upload.py -cf /home/pcadmin/automatic_backup/config.yaml >> /tmp/backup.log

If you wish to run the backup process daily at a given specific time, the cron prefix is 23 50 * * * for every day at 23:50. Take a look at the crontab.md file, which is pretty close to what I am running on my linux server. Both scripts except one paramaeter, which is the path to the config.yaml file. So make sure to adjust that path and also make sure that the paths in the config.yaml are also correct.

When working the crontab, environment variables are not set, meaning that something python example.py won’t work, instead you need to use absolute paths: /usr/bin/python example.py.

Further, we need to make sure that postgres user, who will execute the commands in the postgres crontab, has read, write and execute permissions at the directory containing our scripts, so run the following commands:

sudo chown postgres postgres_googledrive_backup
sudo chmod 700 postgres_googledrive_backup

This makes the postgres user the owner of that directory and gives only the owner read write and execute permisions.

And that’s it. You can now, at an arbitrary interval, create backups of a postgres database table (or anything else if you adapt the backup generating script) and upload those backups into a GoogleDrive cloud folder. Working with the crontab for the first time can be a bit cumbersome, because commands that work in your shell don’t necisarly work in the crontab. Make sure that all paths are correct and the user whose crontab you are using (here the postgres user) has sufficient permissions on the folder its operating on, and you should be good to go.

Cheers,
Finn.