MAXL: Meta Auxiliary Learning

by Shikun Liu, Andrew J. Davison, & Edward Johns

We propose MAXL, a meta learning framework which automatically learns appropriate labels for an auxiliary task, such that a supervised learning task can be improved without requiring access to any further data.

NeurIPS 2019

Meta Auxiliary Learning Meta Auxiliary Learning

Introduction

Auxiliary learning is a method to improve the ability of a primary task to generalise to unseen data, by training on additional auxiliary tasks alongside this primary task. The sharing of features across tasks results in additional relevant features being available, which otherwise would not have been learned from training only on the primary task. Auxiliary learning is similar to multi-task learning, except that only the performance of the primary task is of importance, and the auxiliary tasks are included purely to assist the primary task.

Auxiliary learning methods can be mainly divided into two main directions:

  1. Supervised auxiliary learningļ¼š Suitable auxiliary tasks can be manually chosen to complement the primary tasks, but requiring the domain knowledge to choose the auxiliary tasks and labelled data to train the auxiliary tasks.

  2. Unsupervised auxiliary learning: It removes the need for labelled data, but at the expense of a limited set of auxiliary tasks which may not be beneficial for the primary task.

By combining the merits of both supervised and unsupervised auxiliary learning, the ideal framework would be one with the flexibility to automatically determine the optimal auxiliary tasks, but without the need to manually label these auxiliary tasks.

We propose to achieve such a framework with a simple and general meta-learning algorithm, which we call Meta AuXiliary Learning (MAXL). We first observe that in supervised learning, defining a task can equate to defining the labels for that task. Therefore, for a given primary task, an optimal auxiliary task is one which has optimal labels. The goal of MAXL is then to automatically discover these auxiliary labels using only the labels for the primary task.

Generalisation of Auxiliary Tasks

Before we introduce the design of MAXL, we first performed an empirical analysis to study the generalisation based on different auxiliary tasks. In this project, both the primary and auxiliary tasks are considered to be image classification for simplicity, and here the empirical analysis is performed on a 4-level CIFAR-100 dataset, with the design details shown in the following table.

3 Class 10 Class 20 Class 100 Class
animals large animals reptiles crocodile, dinosaur, lizard, snake, turtle
large carnivores bear, leopard, lion, tiger, wolf
large omnivores and herbivores camel, cattle, chimpanzee, elephant, kangaroo
medium animals aquatic mammals beaver, dolphin, otter, seal, whale
medium-sized mammals fox, porcupine, possum, raccoon, skunk
small animals small mammals hamster, mouse, rabbit, shrew, squirrel
fish aquarium fish, flatfish, ray, shark, trout
invertebrates insects bee, beetle, butterfly, caterpillar, cockroach
non-insect invertebrates crab, lobster, snail, spider, worm
people people baby, boy, girl, man, woman
vegetations vegetations flowers orchids, poppies, roses, sunflowers, tulips
fruit and vegetables apples, mushrooms, oranges, pears, peppers
trees maple, oak, palm, pine, willow
objects and scenes household objects food containers bottles, bowls, cans, cups, plates
household electrical devices clock, keyboard, lamp, telephone, television
household furniture bed, chair, couch, table, wardrobe
construction large man-made outdoor things bridge, castle, house, road, skyscraper
natural scenes large natural outdoor scenes cloud, forest, mountain, plain, sea
vehicles vehicles 1 bicycle, bus, motorcycle, pickup truck, train
vehicles 2 lawn-mower, rocket, streetcar, tank, tractor

CIFAR-100 dataset in a 4-level hierarchy. The original dataset is consisted of 20 (coarse) labels and 100 (fine) labels. And we additionally added 2 additional coarse levels with 3 and 10 labels.

Intriguing Observations on Auxiliary Learning

We first exhaustively run experiments, training a different set of primary and auxiliary tasks using the human-defined labels in the table above. Results are shown in the following figures.

Validation accuracy of the primary task trained with different auxiliary tasks.

From these results, we may conclude with some interesting observations on auxiliary learning. Note that these observations are only true based on the fact that both primary and auxiliary tasks are of the same type -- in this example -- image classification.

Here, the information entropy is measured by the number of labels in the prediction space, i.e., a larger number (finer level) of label space has higher information entropy. To learn beyond the information limited by the same type of tasks, a more optimal auxiliary learning framework is to combine the information brought by a diverse set of auxiliary tasks of different types, automatically filter the right information based on the choice of the primary tasks, which motivates to our recent project Auto-Lambda.

The Design of MAXL

Considering image classification as the primary task, the auxiliary task is then defined as a sub-class labelling problem. Based on the observations listed in the previous section, we associate each primary class with a number of auxiliary classes, in a two-level hierarchy, and learn a single optimal auxiliary task with this pre-defined hierarchy. For example, if manually-defined labels were used, a primary class could be "Dog", and one of the auxiliary classes could be "Border Collie". Except in our case, the concept of "Border Collie" is automatically generated by MAXL. The goal of MAXL is to generate these labels for the auxiliary task which, when trained alongside a primary task, improve the performance of the primary task.

The overview of MAXL framework. The performance of the primary task can be improved by training the primary task together with an auxiliary task defined by these generated auxiliary labels. Here, the auxiliary task is to divide each primary class into two sub-classes.

Detailed Formulation

To accomplish this, we train two networks: a multi-task network, which trains on the primary and auxiliary task in a standard multi-task learning setting, and a label-generation network, which generates the labels for the auxiliary task.

We denote the multi-task network as a function $f_{\theta_1}(x)$ with parameters $\theta_1$ which takes an input $x$, and the label-generation network as a function $g_{\theta_2}(x)$ with parameters $\theta_2$ which takes the same input $x$. Parameters $\theta_1$ are updated by losses of both the primary and auxiliary tasks, as is standard in multi-task learning. However, $\theta_2$ is updated only by the performance of the primary task, as a form of gradient-based meta learning.

Update multi-task network parameters $\theta_1$: $$ \text{argmin}_{\theta_1}\left(\mathcal{L}(f^\text{pri}_{\theta_1}(x), y^\text{pri}) + \mathcal{L}(f^\text{aux}_{\theta_1}(x), y^\text{aux})\right) $$ Update label-generation network $\theta_2$: $$ \text{argmin}_{\theta_2}\mathcal{L}(f^\text{pri}_{\theta_1^+}(x), y^\text{pri}) $$ from which, $$ \theta_1^+ = \theta_1- \alpha\nabla_{\theta_1}\left(\mathcal{L}(f^\text{pri}_{\theta_1}(x), y^\text{pri}) + \mathcal{L}(f^\text{aux}_{\theta_1}(x), y^\text{aux})\right) $$

The trick in this second objective is that we perform a derivative over a derivative (a Hessian matrix) to update $\theta_2$, by using a retained computational graph of $\theta_1^+$, in order to compute derivatives with respect to $\theta_2$. This second-order gradients can also be approximated using finite difference, to reduce training speed, as used in differentiable architecture search.

Mask SoftMax for Hierarchical Predictions

As shown in Observation #1, the design of a useful auxiliary task should include a more number of labels than the one of the primary task. To achieve this, we include a hyper-parameter -- hierarchy $\psi$ which defines the number of auxiliary classes per primary class. To implement this, we designed a modified SoftMax function, which we call Mask SoftMax, to predict auxiliary labels only for certain auxiliary classes. This takes ground-truth primary task label $y$, and the hierarchy $\psi$, and creates a binary mask $M = B(y, \psi)$. The mask is zero everywhere, except for ones across the set of auxiliary classes associated with y.

For example, considering a primary task with 2 classes $y = 0, 1$, and a hierarchy of $\psi = [2, 2]$. In this case, the binary masks are $M = [1, 1, 0, 0]$ for $y = 0$, and $[0, 0, 1, 1]$ for $y = 1$. A visual illustration is shown below.

$$p(y_i) = \frac{\exp(M\cdot y_i)}{\sum_j \exp(M\cdot y_j)}$$

$$p(y_i) = \frac{\exp(y_i)}{\sum_j \exp(y_j)}$$

In addition, we found that the generated auxiliary labels can easily collapse, such that the label-generation network always generates the same auxiliary label, ignoring the designed hierarchy. Therefore, to encourage the network to learn more complex and informative auxiliary tasks, we further apply an entropy loss $\mathcal{H}(y^\text{aux})=y^\text{aux}\log y^\text{aux}$ as a regularisation term when updating $\theta_2$.

Experiments

We compared MAXL to a number of baseline methods for generating auxiliary labels, on the 4-level CIFAR-100 dataset defined above:

Here, all the baselines were trained without any regularisation, to isolate the effect of auxiliary learning and test generalisation ability purely from auxiliary tasks. And the results are shown in the following table.

PRI 3 | AUX 10 PRI 3 | AUX 20 PRI 3 | AUX 100 PRI 10 | AUX 20 PRI 10 | AUX 100 PRI 20 | AUX 100
Single 87.49 87.49 87.49 75.15 75.15 79.71
Random 89.86 89.15 87.81 77.26 75.88 71.11
K-Means 90.16 90.57 90.43 77.68 78.63 73.35
Human 90.78 90.78 91.23 77.97 78.18 73.11
MAXL 90.59 90.68 90.61 78.64 78.43 74.28

Validation accuracy of the primary tasks trained with auxiliary tasks generated by different methods.

We observe that MAXL outperforms all baselines except Human. Note that K-Means required significantly longer training time than MAXL due to the need to run clustering after each iteration. Also note that the superior performance of MAXL over these three baselines occurs despite all four methods using exactly the same data. Finally, we observe that MAXL performs similarly to Human, despite this baseline requiring manually-defined auxiliary labels for the entire training dataset. With performance of MAXL similar to that of a system using human-defined auxiliary labels, we see strong evidence that MAXL is able to learn to generalise effectively in a self-supervised manner.

Visualisations of Generated Knowledge

We further show examples of images assigned to the same auxiliary class through MAXL's label-generation network. Below shows example images with the highest prediction probabilities for 3 auxiliary classes on MNIST.

Auxiliary Class #1

Auxiliary Class #2

Auxiliary Class #3

Visualisation of 5 test examples with the highest prediction probability, for each of 3 auxiliary classes, for different primary classes.

To our initial surprise, only part of the generated auxiliary labels visualised in MNIST show human-understandable knowledge. We can observe that the auxiliary classes #1 and #2 of digit nine are clustered by the direction of the "tail", and auxiliary classes #2 and #3 of digit seven are clustered by the distinction of the "horizontal line". But in most cases, there are no obvious similarities within each auxiliary class in terms of shape, colour, style, structure or semantic meaning, particularly in CIFAR-100 (which we ignored in visualisation). However, this makes more sense when we re-consider the role of the label-generation network, which is to assign auxiliary labels which assist the primary task, rather than grouping images in terms of semantic or visual similarity. The label-generation network would therefore be more effective if it were to group images in terms of a shared aspect of reasoning which the primary task is currently struggling to learn, which may not be human interpretable.

Conclusion & Future Work

In this paper, we have presented Meta AuXiliary Learning (MAXL) for generating optimal auxiliary labels which, when trained alongside a primary task in a multi-task setup, improve the performance of the primary task. Rather than employing domain knowledge and human-defined auxiliary tasks as is typically required, MAXL is self-supervised and, combined with its general nature, has the potential to automate the process of generalisation to new levels.

The general nature of MAXL also opens up questions about how self-supervised auxiliary learning may be used to learn generic auxiliary tasks, beyond sub-class image classification. During our experiments, we also ran preliminary experiments on predicting arbitrary vectors such that the auxiliary task becomes a regression, but results so far have been inconclusive. However, the ability of MAXL to potentially learn flexible auxiliary tasks which can automatically be tuned for the primary task, now offers an exciting direction towards automated generalisation across a wide range of more complex tasks. MAXL has inspired some interesting recent works such as in knowledge transfer, differentiable data selection, and image deblurring.

Citation

If you found this work is useful in your own research, please considering citing the following.

@inproceedings{liu2019maxl,
    title={Self-supervised generalisation with meta auxiliary learning},
    author={Liu, Shikun and Davison, Andrew J and Johns, Edward},
    booktitle={Proceedings of the 33rd International Conference on Neural Information Processing Systems},
    pages={1679--1689},
    year={2019}
}