This page requires Javascript. Please enable it for

Examples of evolved self-attention agents
In this work, we evolve agents that attend to a small fraction of its visual input critical for its survival, allowing for interpretable agents that are not only compact, but also more generalizable. Here, we show examples of our agent's attention highlighted in white patches. In CarRacing (left), our agent mostly attends to the road borders, but shifts its focus to the turns before it changes heading directions. In DoomTakeCover (right), The agent is able to focus on fireballs and monsters, consistent with our intuitions.

Neuroevolution of Self-Interpretable Agents


Inattentional blindness is the psychological phenomenon that causes one to miss things in plain sight. It is a consequence of the selective attention in perception that lets us remain focused on important parts of our world without distraction from irrelevant details. Motivated by selective attention, we study the properties of artificial agents that perceive the world through the lens of a self-attention bottleneck. By constraining access to only a small fraction of the visual input, we show that their policies are directly interpretable in pixel space. We find neuroevolution ideal for training self-attention architectures for vision-based reinforcement learning tasks, allowing us to incorporate modules that can include discrete, non-differentiable operations which are useful for our agent. We argue that self-attention has similar properties as indirect encoding, in the sense that large implicit weight matrices are generated from a small number of key-query parameters, thus enabling our agent to solve challenging vision based tasks with at least 1000x fewer parameters than existing methods. Since our agent attends to only task-critical visual hints, they are able to generalize to environments where task irrelevant elements are modified while conventional methods fail.


There is much discussion in the deep learning community about the generalization properties of large neural networks. While larger neural networks generalize better than smaller networks, the reason is not that they have more weight parameters, but as recent work (e.g. ) suggests, it is because larger networks allow the optimization algorithm to find good solutions, or lottery tickets , within a small fraction of the allowable solution space. These solutions can then be pruned to form sub-networks with useful inductive biases that have desirable generalization properties.

Recent neuroscience critiques of deep learning (e.g. ) point out that animals are born with highly structured brain connectivity that are far too complex to be specified explicitly in the genome and must be compressed through a β€œgenomic bottleneck”--information encoded into the genome that specify a set of rules for wiring up a brain . Innate processes and behaviors are encoded by evolution into the genome, and as such many of the neuronal circuits in animal brains are pre-wired, and ready to operate from birth . These innate abilities make it easier for animals to generalize and quickly adapt to different environments .

There is actually a whole area of related research within the neurevolution field on evolving this genetic bottleneck, which is called an indirect encoding. Analogous to the pruning of lottery ticket solutions, indirect encoding methods allow for both the expressiveness of large neural architectures while minimizing the number of free model parameters. We believe that the foundations laid by the work on indirect encoding can help us gain a better understanding of the inductive biasesThe inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not yet encountered. (Wikipedia) of neural networks and possibly offer a fresh perspective for approaching out-of-domain generalization problems.

HyperNEAT is an indirect encoding method that can generate a variety of weight patterns with structured regularities. Here are examples (taken from ) of fully connected networks (phenotype) where the weights are not individually trained, but generated by a smaller neural network (genotype). The weight of each connection is a function of its location within the network. The thickness represents the magnitude of its weight.

Most current methods used to train neural networks, whether with gradient descent or evolution strategies, aim to solve for the value of each individual weight parameter of a given neural network. We refer to these methods as direct encoding methods. Indirect encoding , on the other hand, offers a radically different approach. These methods optimize instead for a small set of rules or operations, referred to as the genotype, that specify how the (much larger) neural network (the phenotype) should be generated.These terms in the evolutionary computing literature were taken from the evolutionary biology field, where the genotype is the part of the genetic makeup of a cell, and therefore of any individual organism, which determines the organism's actual observed properties, such as morphology, development, or behavior (the phenotype) (Wikipedia) In general, the phenotype encompasses both the neural architecture and its weights, but contemporary indirect encoding methods (e.g. ) typically generate only the weights of a pre-defined architecture using a small set of genotype parameters.

Before the popularity of Deep RL, indirect encoding methods in the neuroevolution literature have been a promising approach for the types of problems that eventually used Deep RL solutions. In the case of vision-based RL problems, earlier works demonstrated that large neural networks can be encoded into much smaller, genotype solutions, that are capable of playing Atari from pixels (when it was still considered challenging in 2012) or car racing directly from pixel-only inputs , hinting at the potential power of indirect encoding. Even before deep learning and convolutional networks started to gain traction in 2012 , indirect encoding has enabled neural network controllers to play board games with structural regularities such as Checkers and Go .

By encoding the weights of a large model with a small number of parameters, we can substantially reduce the search space of the solution, at the expense of restricting our solution to a small subspace of all possible solutions offered by direct encoding methods. This constraint naturally incorporates into our agent an inductive bias that determines what it does well at , and this inductive bias is dependent on the choice of our indirect encoding method. For instance, HyperNEAT has been successful at robotic gait control , suggesting CPPNsCompositional pattern-producing network (CPPN) is the genotype in the HyperNEAT⁠ algorithmβ€”a small neural network that produces patterns and regularities that define the weights of a larger phenotype network. (Wikipedia) to be effective at representing modular and symmetric properties suitable for locomotion. But are there indirect encoding methods that are better suited for vision-based RL tasks?

Car Racing environment
Take Cover environment
Perceiving the world through a self-attention bottleneck
Our agent receives visual input as a stream of 96x96px RGB images (left). Each image frame is passed through a self-attention bottleneck module, responsible for selecting K=10 patches (highlighted in white, middle). Features from these K patches (such as location) are then routed to a decision-making controller (right) that will produce the agent's next action. The parameters of the self-attention module and the controller are trained together using neuroevolution.

In this work, we establish that self-attention can be viewed as a form of indirect encoding, which enables us to construct highly parameter-efficient agents. We investigate the performance and generalization properties of these agents for vision-based RL tasks. Self-attention has been popularized by Transformer models that have been successfully applied in domains such as natural language processing and vision . As we will explain, self-attention offers a simple yet powerful approach for parameterizing a large weight matrix of size O(n2)\mathcal{O}(n^2) using only O(d)\mathcal{O}(d) number of parameter values, where nn is the size of the visual input, dd is the dimension of some transformed space and n≫dn \gg d. Furthermore, such a parameterization enforces an inductive bias to encourage our agent to attend to only a small fraction of its visual input, and as such naturally makes the agent more interpretable.

Environments outside of training distribution for Take Cover
In psychology, inattentive blindness results from a lack of attention that is not associated with vision defects, as individuals fail to perceive an unexpected stimulus in plain sight. Mack and Rock make the radical claim that there is no conscious perception of the visual world without attention to it. In this work, we model this phenomenon in simulation and restrict our agent's decision-making controller access only a small faction of its visual input. We demonstrate that an agent's inattentive blindness can result in better generalization, simply due to its ability to β€œnot see things” that can confuse it. Here, our agent is evolved to survive in the top-left environment only, but it can also survive in unseen settings with higher walls, different floor textures, or when confronted with a distracting sign.

As we will show, neuroevolution is an ideal method for training self-attention agents, because not only can we remove unnecessary complexity required for gradient-based methods, resulting in much simpler architectures, we can also incorporate modules that enhance the effectiveness of self-attention that need not be differentiable. We showcase self-attention agents trained with neuroevolution that require 1000x fewer parameters than conventional methods and yet is able to solve challenging vision-based RL tasks. Specifically, with less than 4000 parameters, our self-attention agents can reach average scores of 914 over 100 consecutive trials in a 2D car racing task and 1125 in a 3D VizDoom task (the tasks are considered solved for scores >900 and >750), comparable with existing state-of-the-art (SOTA) results in . Moreover, our agent learns to attend to only task-critical visual spots and is therefore able to generalize to environments where task irrelevant elements are modified whereas conventional methods fail.

Environments outside of training distribution for Car Racing
Humans who learn to drive during a sunny day can (to some extent) also drive at night, on a rainy day, in a different car, or in the presence of bird droppings on the windshield. Our self-attention agent is not only able to solve CarRacing-v0, achieving an average score > 900, it can also achieve similar performance in unseen conditions (such as brighter or darker scenery, or having its vision modified by artifacts such as side bars or background blobs), while requiring 1000x fewer parameters than conventional methods that fail to generalize.

The goal of this work is to showcase self-attention as a powerful tool for the neuroevolution toolbox, and we provide open-source code for reproducing our experiments (See Appendix). We hope our results will encourage further investigation into the neuroevolution of self-attention models, and also revitalize interest in indirect encoding methods.

Background on Self-Attention

We now give a brief overview of self-attention. Here, we describe a simpler subset of the full Transformer architecture used in this work. In particular, we omit Value matrices, positional encoding, multi-head attention from our method, and opt for the simplest variation that complements neuroevolution methods for our purpose. We refer to for an in-depth overview of the Transformer model.

Let X∈RnΓ—dinX \in\;\mathcal{R}^{n \times d_{in}} be an input sequence of nn elements (e.g. number of words in a sentence, pixels in an image), each of dimensions dind_{in} (e.g. word embedding size, RGB intensities). Self-attention module calculates an attention score matrix and a weighted output:

A=softmax(1din(XWk)(XWq)⊺)(1)Y=AX(2)\begin{aligned} A&=\text{softmax}\big(\frac{1}{\sqrt{d_{in}}}(X W_k) (X W_q)^\intercal \big) & (1) \\ Y&=A X & (2) \end{aligned}

where Wk,Wq∈RdinΓ—dW_k, W_q \in\;\mathcal{R}^{d_{in} \times d} are matrices that map the input to components called Key and Query (Key=XWk,Query=XWq\text{Key} = XW_k, \text{Query} = XW_q), dd is the dimension of the transformed space and is usually a small integer. Since the average value of the dot product grows with the vector's dimension, each entry in the Key and Query matrices can be disproportionally too large if dind_{in} is large. To counter this, the factor 1din\frac{1}{\sqrt{d_{in}}} is used to normalize the inputs. Applying the softmax\text{softmax}softmax(xi)=exp(xi)/βˆ‘kexp(xk)\text{softmax}(x_i) = \exp(x_i) / \sum_{k}{\exp(x_k)} operation along the rows of the matrix product in Equation 1, we get the attention matrix A∈RnΓ—nA \in\;\mathcal{R}^{n \times n}, where each row vector of AA sums to 11. Thus, each row of output Y∈RnΓ—dinY \in\;\mathcal{R}^{n \times d_{in}} can be interpreted as a weighted average of the input XX by each row of the attention matrix.

Visualizing Attention Layers from the Transfomer model. Figures from The Annotated Transformer.

Self-attention lets us map arbitrary input XX to target output YY, and this mapping is determined by an attention matrix AA parameterized by much smaller Key and Query parameters, which can be trained using machine learning techniques. The self-attention mechanism is at the heart of recent SOTA methods for translation and language modeling , and has now become a common place method for natural language processing domain.

Self-Attention for Images

Although self-attention is broadly applied to sequential data, it is straightforward to adapt it to images. For images, the input is a tensor X∈RHΓ—WΓ—CX \in\;\mathcal{R}^{H \times W \times C} where HH and WW are the height and width of the image, CC is the number of image channels (e.g., 3 for RGB, 1 for gray-scale). If we reshape the image so that it becomes X∈RnΓ—dinX \in\;\mathcal{R}^{n \times d_{in}} where n=HΓ—Wn = H \times W and din=Cd_{in} = C, all the operations defined in Equations 1-2 are valid and can be readily applied. In the reshaped XX, each row represents a pixel and the attentions are between pixels. Notice that the complexity of Equation 1 grows quadratically with the number of rows in XX due to matrix multiplication, it therefore becomes computationally prohibitive when the input image is large. While down-sampling the image before applying self-attention is a quick fix, it is accompanied with performance trade-off. For more discussion and methods to partially overcome this trade-off for images, see .

Rather than applying self-attention on individual pixels, we extract patches from an input image using a slide window. We will explain the approach in greater detail later on in the Self-Attention Agent section.

Instead of applying operations on individual pixels of the entire input, a popular method for image processing is to organize the image into patches and take them as inputs as described in previous work (e.g. ). In our approach, our agent attends to patches of the input rather than individual pixels, and we use a sliding window to crop the input image in our input transformations. Conceptually, our approach is similar to Spatial Softmax , which compresses visual inputs into a set of 2D keypoints that are relevant to the task. This has been shown to work on robot perception tasks, and in some cases the keypoints are spatially interpretable.

Self-Attention as a form of Indirect Encoding

Indirect encoding methods represent the weights of a neural network, the phenotype, with a smaller set of genotype parameters. How a genotype encodes a larger solution space is defined by the indirect encoding algorithm. HyperNEAT encodes the weights of a large network via a coordinate-based CPPN-NEAT network, while Compressed Network Search uses discrete cosine transform to compress the weights of a large weight matrix into a small number of DCT coefficients, similar to JPEG compression. Examples of patterns produced with these methods:

Patterns produced using CPPN-NEAT . Rich pattern-generators that produce structured regularities can be used in indirect encoding methods to produce weight values of large neural networks, as demonstrated in HyperNEAT . (Click Image to Enlarge)
Compressed Network Search uses Discrete Cosine Transform to generate a large number of feature detectors using only a small number of learnable coefficients. The patterns generated also exhibit regular patterns useful for continuous control tasks. (Click Image to Enlarge)

Due to compression, the space of possible weights an indirect encoding scheme can produce is only a small subspace of all possible combination of weights. The constraint on the solution space resulting from indirect encoding enforces an inductive bias into the phenotype. While this bias determines the types of tasks that the network is naturally suited at doing, it also restricts the network to a subset of all possible tasks that an unconstrained phenotype can (in theory) perform. More recent works have proposed ways to broaden its task domain of indirect encoding. ES-HyperNEAT proposed adapting part of the indirect encoding algorithm itself to the task environment. Hypernetworks suggested making the phenotype directly dependent on the inputs, thus tailoring the weights of the phenotype to the specific inputs of the network. Following this approach of incorporating information from the input into the weight-generating process, it has been shown that the phenotype can be highly expressive as the weights can adapt to the inputs for the task at hand, while static indirect encoding methods cannot.

Hypernetworks is an example of a dynamic indirect encoding method where the weights produced are allowed to change over time. When we train a hypernetwork model to generate synthetic handwriting or text, the model samples an action from a probability distribution conditioned on its internal state, and performs the action. But unlike conventional RNNs, its weights (represented by the four colors) can change over time, and is dependent on the agent's internal state and inputs (its previous action). Self-attention is also a form of dynamic indirect encoding as the weights produced are also dependent on the agent's senses and will vary over time.

Similarly, self-attention enforces a structure on the attention weight matrix AA in Equation 1 that makes it also input-dependent. If we remove the Key and Query terms, the outer product XXTX X^T defines an association matrix XXTX X^T are also known as Gram matrices, and are key to all of kernel methods and classical statistical learning. (Wikipedia) where the elements are large when two distinct input terms are in agreement. This type of structure enforced in AA has been shown to be suited for associative tasks where the downstream agent has to learn the relationship between unrelated items. For example, they are used in the Hebbian learning rule inspired by neurons that fire together wire together, shown to be useful for associative learning . Matrix factorization applied to weights has been proposed in the deep learning literature , and are also present in recommender systems to represent relationships between different inputs.

Visualization of Large Attention Layers from the Transfomer-XL model. (Click Image to Enlarge)

As the outer product XXTX X^T so far has no free parameters, the corresponding matrix AA will not be suitable for arbitrary tasks beyond association. The role of the small Key (WkW_k) and Query (WqW_q) matrices in Equation 1 allow AA to be modified for the task at hand. WkW_k and WqW_q can be viewed as the genotype of this indirect-encoding method.

Wk,Wq∈RdinΓ—dW_k, W_q \in \mathcal{R}^{d_{in} \times d} are the matrices that contain the free parameters, dind_{in} is a constant with image inputs (3 for RGB images and 1 for gray scale images), therefore the number of free parameters in self-attention is in the order of O(d)\mathcal{O}(d). As we explained previously, when applying self-attention to images nn can be the number of pixels in an input the magnitude of which is often tens of thousands even for small images (e.g. 100px Γ— 100px). On the other hand, dd is the dimension of the transformed space in which the Key and Query matrices reside and is often much smaller than nn (d=4d=4 in our experiments). This form of indirect encoding enables us to represent the phenotype, the attention matrix AA, of size O(n2)\mathcal{O}(n^2) using only O(d)\mathcal{O}(d) number of genotype parameters. In our experiments, we show that our attention matrix AA can be represented using only ∼\sim 1200 trainable genotype parameters.

Furthermore, we demonstrate that features from this attention matrix is especially useful to a downstream decision-making controller. We find that even if we restrict the size of our controller to only ∼\sim 2500 parameters, it can still solve challenging vision-based tasks by leveraging the information provided by self-attention.

Self-Attention Agent

The design of our agent takes inspiration from concepts related to inattentive blindness --when the brain is involved in effort-demanding tasks, it assigns most of its attention capacity only to task relevant elements and is temporarily blind to other signals . In this vein, our agent is designed to focus on only task-critical regions in the input image and will ignore the others.

The following figure depicts an overview of our self-attention agent:

Method overview. Illustration of data processing flow in our self-attention agent. We will explain each module in greater detail later on in this section. (Click Image to Enlarge)

There are four stages of information processing:

Input TransformationΒ Given an observation, our agent first resizes it into an input image of shape LΓ—LL \times L, the agent then segments the image into NN patches and regard each patch as a potential region to attend to.

Importance Voting via Self-AttentionΒ To decide which patches are appropriate, the agent passes the patches to the self-attention module to get a vector representing each patch's importance, based on which it selects KK patches of the highest importance.

Patch Selection and Feature RetrievalΒ Our agent then uses the index (kk) of each of the KK patches to fetch relevant features of each patch with a function f(k)f(k), which can be either a learned module or a pre-defined function that incorporates domain knowledge.

ControllerΒ Finally, the agent inputs the features into its controller that outputs the action it will execute in its environment.

Each of these stages will be explained in greater detail in this section.

Hyper-parameters in this paper. Left: Parameters for input transformation. After resizing the observation into an image of shape L Γ— L, we use a sliding window of specified size and stride to segment the image into N=(⌊Lβˆ’MS+1βŒ‹)2=(⌊96βˆ’74+1βŒ‹)2=529N = (\left \lfloor \frac{L-M}{S} + 1 \right \rfloor)^2 = (\left \lfloor \frac{96 - 7}{4} + 1 \right \rfloor)^2 = 529 patches. Right: Parameters for self-attention. Since the attention is between patches and each patch is RGB, we therefore have din=M2Γ—C=72Γ—3=147d_{in} = M^2 \times C = 7^2 \times 3 = 147.

To gain a better sense of the magnitudes involved, we summarize the hyper-parameters used in this work in the table above. Some of the parameters are explained in the following sections.

Car Racing environment
Take Cover environment
Visualizing the various stages of information processing in action
Input Transformation Our agent's' visual input is resized to 96x96px RGB images, and divided up into patches. (left)
Importance Voting via Self-Attention This module votes for the top K=10 patches (highlighted in white). (middle)
Patch Selection and Feature Retrieval Features from these K patches (such as location) routed to Controller. (right)

Input Transformation

Our agent does some basic image processing and then segments an input image into multiple patches. For all the experiments in this paper, our agent receives RGB images as its input, therefore we simply divide each pixel by 255 to normalize the data, but it should be straightforward to integrate other data preprocessing procedures. Similarly, while there can be various methods for image segmentation, we find a simple sliding window strategy to be sufficient for the tasks in this work.

After some basic image pre-processing, we extract patches from an input image using a slide window of size MΓ—M and stride S. This results in an output shape of (N, M, M, C), where N is the number of patches, M is the height / width of each patch, and C is the number of channels in the input image. We then flatten the data.

To be concrete, when the window size MM and stride SS are specified, our agent chops an input of shape (H,W,C)(H, W, C) into a batch of NN patches of shape (M,M,C)(M, M, C), where HH and WW are the height and width of the input image and CC is the number of channels. We then reshape the processed data into a matrix of shape (N,MΓ—MΓ—C)(N, M \times M \times C) before feeding it to the self-attention module. MM and SS are hyper-parameters to our model, they determine how large each patch is and whether patches overlap. In the extreme case when M=S=1M = S = 1 this becomes self-attention on each individual pixel in the image.

Importance Voting via Self-Attention

Upon receiving the transformed data in RnΓ—din\mathcal{R}^{n \times d_{in}} where n=Nn=N and din=MΓ—MΓ—Cd_{in} = M \times M \times C, the self-attention module follows Equation 1 to get the attention matrix of shape (N,N)(N, N). To keep the agent as simple as possible, we do not use positional encoding in this work.

We project the flattened patches into Keys and Queries and calculate their attention matrix of shape (N, N). Softmax is applied along each row. We then sum the matrix along the rows to get a patch importance vector.

By applying softmax, each row in the attention matrix sums to one, so the attention matrix can be viewed as the results from a voting mechanism between the patches. To be specific, if each patch can distribute fractions of a total of 1 vote to other patches (including itself), row ii thus shows how patch ii has voted and column jj gives the votes that patch jj acquired from others. In this interpretation, entry (i,j)(i, j) in the attention matrix is then regarded as how important patch jj is from patch ii's perspective. Taking sums along the columns of the attention matrix results in a vector that summarizes the total votes acquired by each patch, and we call this vector the patch importance vector. Unlike conventional self-attention, we rely solely on the patch importance vector and do not calculate a weighted output with Equation 2.

Patch Selection and Feature Retrieval

Based on the patch importance vector, our agent picks the KK patches with the highest importance. We pass in the index of these KK patches (denoted as index kk to reference the kthk^{th} patch) into a feature retrieval operation f(k)f(k) to query the for their features. f(k)f(k) can be static mappings or learnable modules, and it returns the features related to the image region centered at patch kk's position. The following list gives examples of possible features:

The design choices of these features give us control over various aspects of the agent's capabilities, interpretability and computational efficiency.

We sort the patch importance vector and extract the indices of the top K patches that have the highest importance. We then map the patch indices to the corresponding features and input to the decision-making controller obtain the next action. Note that these types of argsort and slice operations (such as importance sorting and patch pruning described in this section) are inherently non-differentiable, motivating the use of neuroevolution methods such as CMA-ES as we will discuss later on.

By discarding patches of low importance the agent becomes temporarily blind to other signals, this is built upon our premise and effectively creates a bottleneck that forces the agent to focus on patches only if they are critical to the task. Once learned, we can visualize the KK patches and see directly what the agent is attending to.

Although this mechanism introduces KK as a hyper-parameter, we find it easy to tune (along with MM and SS). In principle we can also let neuroevolution decide on the number of patches, and we will leave this for future work.

Pruning less important patches also leads to the reduction of input features, so the agent is more efficient by solving tasks with fewer weights. Furthermore, correlating the feature retrieval operation f(k)f(k) with individual patches can also lower the computational cost. For instance, if some local features are known to be useful for the task yet computationally expensive, KK acts as a budget cap that limits our agent to compute features from only the most promising regions. Notice however, that this does not imply we permit only local features, as f(k)f(k) also has the flexibility to incorporate global features.

In this work, f(k)f(k) is a simple mapping from patch index to patch position in the image and is a local feature. But f(k)f(k) can also be a stack of convolution layers whose receptive fields are centered at patch kk. If the receptive fields are large enough, f(k)f(k) can provide global features.


Temporal information between steps is important to most RL tasks, but single RGB images as our input at each time step do not provide this information. One option is to stack multiple input frames like what is done in , but we find this inelegant approach unsatisfactory because the time window we care about can vary for different tasks. Another option is to incorporate the temporal information as a hidden state inside our controller. Previous work has demonstrated that with a good representation of the input image, even a small RNN controller with only 6--18 neurons is sufficient to perform well at several Atari games using only visual inputs.

In our experiments, we use Long short-term memory (LSTM) network as our RNN controller so that its hidden state can capture temporal information across multiple input image frames. We find that an LSTM with only 16 neurons is sufficient to solve challenging tasks when combined with features extracted from self-attention.

Neuroevolution of the Agent

Operators such as importance sorting and patch pruning in our proposed methods are not gradient friendly. It is not straightforward to apply back-propagation in the learning phase. Furthermore, restricting to gradient based learning methods can prohibit the adoption of learnable feature retrieval functions f(k)f(k) that consist of discrete operations or need to produce discrete features. We therefore turn to evolution algorithms to train our agent. While it is possible to train our agent using any evolution strategy or genetic algorithms, empirically we find the performance of Covariance Matrix Adaptation Evolution Strategy (CMA-ES) stable on a set of RL benchmark tasks .

Visualization of CMA-ES optimization process on two function landscapes (from ).

CMA-ES is an algorithm that adaptively increases or decreases the search space for the next generation given the current generation's fitness. Concretely, CMA-ES not only adapts for the mean ΞΌ\mu and standard deviation Οƒ\sigma, but also calculates the entire covariance matrix of the parameter space. This covariance matrix essentially allows us to either explore more by increasing the variance of our search space accordingly, or fine tune the solution when the collected fitness values indicate we are close to a good optima. However the computation of this full covariance matrix is non-trivial, and because of this CMA-ES is rarely applied to problems in high-dimensional space such as the tasks dealing with visual inputs. As our agent contains significantly fewer parameters than conventional methods, we are therefore able to train it with an off-the-shelf implementation of CMA-ES .


We wish to answer the following questions via experiments and analysis:

Task Description

We evaluate our method in two vision-based RL tasks: CarRacing and DoomTakeCover. The below figure are videos of our self-attention agent performing these two tasks:

Self-attention agent playing CarRacing and DoomTakeCover
Left: Screenshot of the actual game environment presented to humans.
Right: Resized images presented to our agent as visual input, and also its attention highlighted in white patches.

In CarRacing, the agent controls three continuous actions (steering left/right, acceleration and brake) of the red car to visit as many randomly generated track tiles as possible in limited steps. At each step, the agent receives a penalty of βˆ’0.1-0.1 but will be rewarded with a score of +1000n+\frac{1000}{n} for every track tile it visits where nn is the total number of tiles. Each episode ends either when all the track tiles are visited or when 1000 steps have passed. CarRacing is considered solved if the average score over 100 consecutive test episodes is higher than 900. Numerous works have tried to tackle this task with Deep RL algorithms, but has not been solved until recently by methods we will refer to as World Model, Genetic Algorithm (GA) and Deep Innovation Protection (DIP). For comparison purposes, we also include an implementation of Proximal Policy Optimization (PPO) , a popular RL baseline.As of writing, it is difficult to find published works in the literature that have solved CarRacing-v0 outside of the mentioned works. After scouring online forums and GibHub, we identified four attempts. Two of these are off-policy methods that either relied on heavy hand-engineering to pre-process the pixels (and we were not able to reproduce by running their code), or have not actually solved it (See Wiki). The other two were PPO-based solutions that relied on reward shaping to add back a penalty score upon a crash to bootstrap the agent's training, but unfortunately still used the reshaped score for evaluation. Both still scored above 800 if evaluated properly, but not above 900. Our baseline is based on the best of the two PPO solutions found.

VizDoom serves as a platform for the development of agents that play DOOM using visual information. DoomTakeCover is a task in VizDoom where the agent is required to dodge the fireballs launched by the monsters and stay alive for as long as possible. Each episode lasts for 2100 steps but ends early if the agent dies from being shot. This is a discrete control problem where the agent can choose to move left/right or stay still at each step. The agent gets a reward of +1+1 for each step it survives, and the task is regarded solved if the average accumulated reward over 100 episodes is larger than 750. While a pre-trained World Model is able to solve both CarRacing and this task, it has been reported that the end-to-end direct encoding Genetic Algorithm (GA) proposed by Risi and Stanley falls short at solving this task without incorporating multi-objective optimization to preserve diversity, in Deep Innovation Protection (DIP). The PPO baseline also performed poorly on this task.Despite the task's simplicity, other recent works (e.g. Upside Down RL) also confirmed that traditional Deep RL algorithms such as DQN and A2C perform poorly on this task, perhaps due to the sparse reward signal based on survival.

Agent network architecture.  The numbers next to each arrow indicate data shape after the operation. A is the action dimension and is task dependent. Learnable parameters in green.

The above figure shows our network architecture and related parameters. We resize the input images to 96Γ—9696 \times 96 and use the same architecture for both CarRacing and DoomTakeCover (except for the output dimensions). We use a sliding window of size M=7M=7 and stride S=4S=4 to segment the input image, this gives us N=529N = 529 patches. After reshaping, we get an input matrix XX of shape (n=529,din=147)(n=529, d_{in}=147). We project the input matrix to Key and Query with d=4d=4, after self-attention is applied we extract features from the K=10K=10 most importance patches and input to the single layer LSTM controller (#hidden=16) to get the action.

Learnable parameters.  GA, DIP share the same World Model architecture. The genotype in our indirect encoding self-attention module are the fully connected (FC) layers that include corresponding bias terms, and the agent's controller is a small LSTM.

The table above summarizes the number of parameters in our agent, we have also included models from some existing works for the purpose of comparison. For feature retrieval function f(k)f(k), we use a simple mapping from patch index to patch center position in the input image. We normalize the positions by dividing the largest possible value so that each coordinate is between 0 and 1. In our experiments, we use pycma , an off-the-shelf implementation of CMA-ES to train our agent. We use a population size of 256, set the initial standard deviation to 0.1 and keep all other parameters at default values. To deal with randomness inherent in the environments, we take the mean score over 16 rollouts in CarRacing and 5 rollouts in DoomTakeCover as the fitness of each individual in the population.

Experimental Results

Not only is our agent able to solve both tasks, it also outperformed existing methods. Here is a summary of our agent's results:

CarRacing and DoomTakeCover results.  We report the average score over 100 consecutive tests with standard deviations. For reference, the required scores above which the tasks are considered solved are also included. Best scores are highlighted.

Test scores vs training iterations.  We test our agent for 100 consecutive episodes with different environment seeds every 10 training iterations. The solid line shows the average score, the shaded area gives the standard deviation, and the dashed line indicates the score above which the task is considered solved.

In addition to the SOTA scores, the attention patches visualized in pixel space also make it easier for humans to understand the decisions made by our agent. Here, we visualize our agent's attention by plotting the top KK important patches elected by the self-attention module on top of the input image and see directly what the agent is attending to (the opacity indicates the importance, the whiter the more important):

CarRacing and DoomTakeCover attention patches.

From visualizing the patches and observing the agent's attention, we notice that most of the patches the agent attends to are consistent with humans intuition. For example, in CarRacing, the agent's attention is on the border of the road but shifts its focus to the turns before the car needs to change its heading direction. Notice the attentions are mostly on the left side of the road. This makes sense from a statistical point of view considering that the racing lane forms a closed loop and the car is always running in a counter-clockwise direction.

In DoomTakeCover, the agent is able to focus its attention on fireballs. When the agent is near the corner of the room, it is also able to detect the wall and change its dodging strategy instead of stuck into the dead end. Notice the agent also distributes its attention on the panel at the bottom, especially on the profile photo in the middle. We suspect this is because the controller is using patch positions as its input, and it learned to use these points as anchors to estimate its distances to the fireballs.

We also notice that the scores from all methods have large variance in DoomTakeCover. This seems to be caused by the environment’s design: some fireballs might be out of the agent’s sight but are actually approaching. The agent can still be hit by the fireballs outside its vision when it’s dodging other fireballs that are in the vision.

Through these tasks, we are able to give a positive answer to the first question: Is our agent able to solve challenging vision-based RL tasks? What are the advantages over other methods that solved the same tasks? Our agent is indeed able to solve these vision-based RL challenges. Furthermore, it is efficient in terms of being able to reach higher scores with significantly fewer parameters.

Region of Interest to Importance Mapping

Our feature retrieval function f(k)f(k) is a simple mapping from patch index to (normalized) positions of the patch's center point. As such, this function provides information only about the locations of the patches, and discards the content inside these patches.

On first thought, it is actually really surprising to us that the agent is able to solve tasks with the position information alone. But after taking a closer look at the contents of the patches that the agent attends to, it is revealed that the agent learns not only where but also what to attend to. This is because the self-attention module, top KK patch selection, and the controller are all trained together as one system.

To illustrate this, we plot the histogram of patch importance that are in the top 5%5\% quantile from 20 test episodes. Although each episode presents different environmental randomness controlled by their random seeds at initialization, the distributions of the patch importance are quite consistent, this suggests our agent's behavior is coherent and stable. When sampling and plotting patches whose importance are in the specified ranges, we find that the agent is able to map regions of interest (ROI) to higher importance values.

Region of interest to patch importance mapping (Left: CarRacing, Right: DoomTakeCover). 
Importance voting mechanism via self-attention is able to identify a small minority of patches that are important for the task. The histogram shows the importance distribution of patches from 20 test episodes by patch importance scores. (Click Image to Enlarge)
Example patches sampled from specified importance ranges from the corresponding histogram. In CarRacing, we see the most important patches are locations near the vehicle, followed by road shapes, while patches of open green fields are deemed unimporant. In DoomTakeCover, the most important patches are the fireballs.

The patches of the highest importance are those critical to the core mission. These are the patches containing the red and white markers at the turns in CarRacing and the patches having fires in DoomTakeCover (patches on the left). Shifting to the range that is around the 5%5\% quantile, the patch samples are not as interesting as before but still contains useful information such as the border of the road in CarRacing and the texture of walls in DoomTakeCover. If we take an extreme and look at the patches with close to zero importance (patches on the right), those patches are mostly featureless and indeed have little information.

By mapping ROIs to importance values, the agent is able to segment and discriminate the input to its controller and learn what the objects are it is attending to. In other words, the self-attention module learns what is important to attend to and simply gives only the (normalized) positions of the top KK things in the scene to the controller. As the entire system is evolved together, the controller is still able to output reasonable actions based only on position information alone.

Can our agents generalize to unseen environments?

To test our agent's robustness and its ability to generalize to novel states, we test pre-trained agents in modified versions of CarRacing and DoomTakeCover environments without re-training or fine-tuning them. While there are infinitely many ways to modify an environment, our modifications respect one important principle: the modifications should not cause changes of the core mission or critical information loss. With this design principle in mind, we present the following modifications:

 Original Environment (Score: 914Β±15) vs Color Perturbation (Score: 866Β±112)

 World Models (Score: 655Β±353)
 GA (Score: 442Β±362)
 PPO (Score: 505Β±464)

 Original Environment (Score: 914Β±15) vs Vertical Frames (Score: 900Β±35)

 World Models (Score: 166Β±137)
 GA (Score: 675Β±254)
 PPO (Score: 615Β±217)

 Original Environment (Score: 914Β±15) vs Background Blob (Score: 898Β±53)

 World Models (Score: 446Β±299)
 GA (Score: 833Β±135)
 PPO (Score: 855Β±172)

Original Environment (Score: 1125Β±589) vs Higher Walls (Score: 934Β±560)

World Model baseline (Score: 243Β±104)
Left: Rendering of modified game environment. Center: Agent's visual input. Right: Reconstruction of what the baseline agent actually sees, based on its World Model trained only on the original environment.

Original Environment (Score: 1125Β±589) vs Different Floor Texture (Score: 1120Β±613)

World Model baseline (Score: 218Β±69)
Left: Rendering of modified game environment. Center: Agent's visual input. Right: Reconstruction of what the baseline agent actually sees, based on its World Model trained only on the original environment.

Original Environment (Score: 1125Β±589) vs Hovering Text (Score: 1035Β±627)

World Model baseline (Score: 240Β±63)
Left: Rendering of modified game environment. Center: Agent's visual input. Right: Reconstruction of what the baseline agent actually sees, based on its World Model trained only on the original environment.

For the purpose of comparison, we used the released code (and pre-trained models, if available) from as baselines. While our reproduced numbers do not exactly match the reported scores, they are within error bounds, and close enough for the purpose of testing for generalization. For each modification, we test a trained agent for 100 consecutive episodes and report its scores in the following table:

Generalization Experiments  We train agents in the original task and test in the modified environments without re-training. For comparison, we also include the performance in the unmodified tasks. Results with significant performance drop highlighted.

Our agent generalizes well to all modifications while the baselines fail. While World Model does not suffer a significant performance loss in color perturbations in CarRacing, it is sensitive to all changes. Specifically, we observe >75%> 75\% score drops in Vertical Frames, Higher Walls, Floor Texture, Hovering Text and a >50%> 50\% score drop in Background Blob from its performances in the unmodified tasks.

Since World Model's controller used as input the abstract representations it learned from reconstructing the input images, without much regularization, it is likely that the learned representations will encode visual information that is crucial to image reconstruction but not task-critical. If this visual information to be encoded is modified in the input space, the model produces misleading representations for the controller and we see performance drop.

In contrast, GA and PPO performed better at generalization tests. The end-to-end training may have resulted in better task-specific representations learned compared to World model, which uses an unsupervised representation based data collected from random policies. Both GA and PPO can fine-tune their perception layers to assign greater importance to particular regions via weight learning.

Through these tests, we are able to answer the second question posed earlier: How robust is the learned agent? If the agent is focusing on task-critical factors, does it generalize to the environments with modifications that are irrelevant to the core mission?

The small change in performance shows that our agent is robust to modifications. Unlike baseline methods that are subject to visual distractions, our agent focuses only on task-critical positions, and simply relies on the coordinates of small patches of its visual input identified via self-attention, and is still able to keep its performance in the modified tasks without any re-training. By learning to ignore parts of the visual input that it deems irrelevant, it can naturally still perform its task even when irrelevant parts of its environment are modified.

Related Work

Our work has connections to work in various areas:

NeuroscienceΒ  Although the human visual processing mechanisms are not yet completely understood, recent findings from anatomical and physiological studies in monkeys suggest that visual signals are fed into processing systems to extract high level concepts such as shape, color and spatial organization . Research in consciousness suggests that our brains interpret our surrounding environment in a β€œlanguage of thought” that is abstract enough to interface with decision making mechanisms . On the other hand, while recent works in deep RL for vision-based tasks have thrived , in most cases it is not clear why they work.

A line of work that narrows the gap is World Models , where the controller's inputs are abstract representations of the visual and temporal information, provided by encouraging a probabilistic β€œWorld Model” to compress and predict potential future experiences. Their agent excels in challenging vision-based RL tasks, and by projecting the abstract representation back to the pixel space, it is possible to get some insights into the agent's mind. suggest that not all details in the visual input are equally important, specifically they pointed out that rather than learning abstract representations that are capable of reconstructing the full observation, it is sufficient if the representation allows predicting quantities that are relevant for planning.

While we do not fully understand the mechanisms of how our brains develop abstract representations of the world, it is believed that attention is the unconscious mechanism by which we can only attend to a few selected senses at a time, allowing our consciousness to condense sensory information into a synthetic code that is compact enough to be carried forward in time for decision making . In this work, in place of a probabilistic World Model, we investigate the use of self-attention to distill an agent's visual input into small synthetic features used as inputs for a small controller.In the neuroscience and cognitive science literature, many of the descriptions of attention usually involve some sort of top-down feedback signal. The model proposed in this work is more of a bottom-up form of attention without top down feedback.

Neuroevolution-based methods for tackling challenging RL tasks have recently gained popularity due to their simplicity and competitiveness to Deep RL methods, even on vision-based RL benchmark tasks . More recent work demonstrated that evolution can even train RL agents with millions of weight parameters, such as the aforementioned World Models-based agents. As these approaches do not require gradient-based computation, they offer more flexibility such as discrete latent codes, being able to optimize directly for the total reward across multiple rollouts, or ease of scaling computation across machines.

It is worthy to note that even before the popularity of deep RL-based approaches for vision-based tasks, indirect encoding methods from the neuroevolution literature have been used to tackle challenging vision-based tasks such as Atari domain and car navigation from pixels . Indirect encoding methods are inspired by biological genotype--phenotype representations, and aim to represent a large, but expressive neural network with a small genotype code, reducing a high dimensional optimization problem to a more manageable one that can be solved with gradient-free methods.

Indirect encoding methods are not confined to neuroevolution. Inspired by earlier works , hypernetworks are recurrent neural networks (RNNs) whose weight matrices can change over time, depending on the RNN's input and state. It uses an outer product projection of an embedding vector, allowing a large weight matrix to be modified via a small β€œgenotype” embedding. As we will show in the next section, self-attention also relies on taking an outer product of input and other parameter vectors to produce a much larger weight matrix. Transformers demonstrated that this type of modified self-attention matrix is a tremendously powerful prior for various language modeling tasks. Here, we investigate the use of self-attention as an indirect encoding mechanism for training agents to perform vision-based RL tasks using neuroevolution.

Attention-based RLΒ  Inspired by biological vision systems, earlier works formulated the problem of visual attention as an RL problem . Recent work incorporated multi-head self-attention to learn representations that encode relational information between feature entities, with these features the learned agent is able to solve a novel navigation and planning task and achieve SOTA results in six out of seven StarCraft II tasks. Because the agent learned relations between entities, it can also generalize to unseen settings during training.

In order to capture the interactions in a system that affects the dynamics, proposed to use a group of modified RNNs. Self-attention is used to combine their hidden states and inputs. Each member competes for attention at each step, and only the winners can access the input and also other members’ states. They showed that this modular mechanism improved generalization on the Atari domain.

In addition to these works, attention is also explicitly used for interpretability in RL. In , the authors incorporated soft and hard attention mechanism into the deep recurrent Q-network, and they were able to outperform Deep Q-network (DQN) in a subset of the Atari games. Most recently, used a soft, top-down attention mechanism to force the agent to focus on task-relevant information by sequentially querying its view of the environment. Their agents achieved competitive performance on Atari while being more interpretable.

Although these methods brought exciting results, they need dedicated network architectures and carefully designed training schemes to work in an RL context. For example, needs to apply the self-attention mechanism iteratively within each time step to perform relational reasoning in the StarCraft II tasks. In , a hard attention mechanism had to be separately trained because it required sampling, and in , a carefully designed non-trainable basis that encodes spatial locations was needed. Because we are not concerned with gradient-based learning, we are able to chip away at the complexity and get away with using a much simplified version of the Transformer architecture in our self-attention agent. For instance, we do not even need to use positional encoding or layer normalization components in the Transformer.

The high dimensionality the visual input makes it computationally prohibitive to apply attention directly to individual pixels, and we rather operate on image patches (which have lower dimensions) instead. Although not in the context of self-attention, previous work (e.g. ) segments the visual input and attend to the patches rather than individual pixels. Our work is similar to , where the input to the controller their is a vector of patch features weighted by attentions, the dimension of which grows linearly as we increase the number of patches. However, as our method do not rely on gradient-based learning, we can simply restrict the input to be only the KK patches with highest importance, which is an easily tunable hyper-parameter independent of patch size or patch count. Ordinal measures have been shown to be robust and used in various feature detectors and descriptors in computer vision. Using gradient-free methods (such as neuroevolution) are more desirable in the case of non-differentiable operations because these ordinal measures can be implemented as argmaxargmax or top KK patch selection, critical for our self-attention agent.


While the method presented is able to cope with various out-of-domain modifications of the environment, there are limitations to this approach, and much more work to be done to further enhance the generalization capabilities of our agent. We highlight some of the limitations of the current approach in this section.

Much of the extra generalization capability is due to attending to the right thing, rather than from logical reasoning. For instance, if we modify the environment by adding a parallel lane next to the true lane, the agent attends to the other lane and drives there instead. Most human drivers do not drive on the opposite lane, unless they travel to another country.

Prefers to drive on the other lane
When we added an extra fake lane next to the real one, and set the blue channel of the color of the fake lane, the agent attends to the fake lane and proceeds to cross over and drive on it instead of the correct lane.

We also want to highlight that the visual module does not generalize to cases where dramatic background changes are involved. Inspired by , we modify the background of the car racing environment and replace the green grass background with YouTube videos .

Background suddenly becomes YouTube Cat Video
The agent stops to look at the cat with the fat white belly, rather than focus on the road.
Background suddenly becomes The King of Fighters
The agent prefers to watch the match between the kings of fighters.

The agent trained on the original environment with the green grass background fails to generalize when the background is replaced with distracting YouTube videos. When we take this one step further and replace the background with pure uniform noise, we observe that the agent's attention module breaks down and attends only to random patches of noise, rather than to the road-related patches.

Background is replaced with random noise
The agent attends only to noise, but never on the road.
Left: The policy trained on the normal environment does not transfer to the noisy environment (Score: -58Β±12)
Right: Agent trained from scratch in this noisy environment (Score: 577Β±243)

When we train an agent from scratch in the noisy background environment, it still manages to get around the track, although the performance is mediocre. Interestingly, the self-attention layer still attends only to the noise, rather than to the road, and it appears that the controller actually learns a policy to avoid such locations!

 K=5 (Score: 452Β±210)
 K=20 (Score: 660Β±259)

We also experiment with various KK (the number of patches). Perhaps lowering the number will force the agent to focus on the road. But when we decrease KK from 10 to 5 (or even less), the agent still attends to noisy patches rather than to the road. Not surprisingly, as we increase KK to 20 or even 30, the performance of this noise-avoiding policy increases.

These results suggest that while our current method is able to generalize to minor modifications of the environment, there is much work to be done to approach human-level generalization abilities. The simplistic choice we make to only use the patch locations (rather than their contents) may be inadequate for more complicated tasks. How we can learn more meaningful features, and perhaps even extract symbolic information from the visual input will be an exciting future direction.


The paper demonstrated that self-attention is a powerful module for creating RL agents that is capable of solving challenging vision-based tasks. Our agent achieves competitive results on CarRacing and DoomTakeCover with significantly fewer parameters than conventional methods, and is easily interpretable in pixel space. Trained with neuroevolution, the agent learned to devote most of its attention to visual hints that are task-critical and is therefore able to generalize to environments where task irrelevant elements are modified while conventional methods fail.

Yet, our agent is nowhere close to generalization capabilities of humans. The modifications to the environments in our experiments are catered to attention-based methods. In particular, we have not modified properties of objects of interest, where our method may perform as poorly (or worse) than methods that do not require sparse attention in pixel space. We believe this work complements other approaches (e.g. ) that approach the generalization problem, and future work will continue to develop on these ideas to push the generalization abilities proposed in more general domains (such as just to name a few).

Neuroevolution is a powerful toolbox for training intelligent agents, yet its adoption in RL is limited because its effectiveness when applied to large deep models was not clear until only recently . We find neuroevolution to be ideal for learning agents with self-attention. It allows us to produce a much smaller model by removing unnecessary complexity needed for gradient-based method. In addition, it also enables the agent to incorporate modules that include discrete and non-differentiable operations that are helpful for the tasks. With such small yet capable models, it is exciting to see how neuroevolution trained agents would perform in vision-based tasks that are currently dominated by Deep RL algorithms in the existing literature.

In this work, we also establish the connections between indirect encoding methods and self-attention. Specifically, we show that self-attention can be viewed as a form of indirect encoding. Another interesting direction for future works is therefore to explore other forms of indirect encoding bottlenecks that, when combined with neuroevolution, can produce parameter efficient RL agents exhibiting interesting innate behaviors.


The authors would like to thank Yingtao Tian, Lana Sinapayen, Shixin Luo, Krzysztof Choromanski, Sherjil Ozair, Ben Poole, Kai Arulkumaran, Eric Jang, Brian Cheung, Kory Mathewson, Ankur Handa, and Jeff Dean for valuable discussions.

Any errors here are our own and do not reflect opinions of our proofreaders and colleagues. If you see mistakes or want to suggest changes, feel free to contribute feedback by participating in the discussion forum for this article.

The experiments in this work were performed on multicore CPU Linux virtual machines provided by Google Cloud Platform.

Vision Icon by artist monkik on Noun Project.  Vision icon by artist monkik.


GECCO 2020  This work has been presented at GECCO 2020 as a full paper.

For attribution in academic contexts, please cite this work as:

Yujin Tang and Duong Nguyen and David Ha, Neuroevolution of Self-Interpretable Agents, 2020.

BibTeX citation

  author    = {Yujin Tang and Duong Nguyen and David Ha},
  title     = {Neuroevolution of Self-Interpretable Agents},
  booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference},
  url       = {},
  note      = "\url{}",
  year      = {2020}

Open Source Code

Please see our repo for details on reproducing the experiments in this work.


Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.