Continual Learning in Vision Transformers

Continual Learning

Continual learning focuses on creating machine learning models that can learn new knowledge and skills while retaining their accuracy on previously acquired abilities. This capability enables AI models to build on existing knowledge rather than relearning everything from scratch. Consequently, it saves significant time and computational resources, especially for large-scale AI models like LLMs and generative AI.

Key aspects of continual learning:

Sequential learning: The model learns from a stream of data over time, rather than from a fixed dataset.
Avoiding catastrophic forgetting: The challenge of retaining previously learned knowledge while acquiring new skills.
Transfer Learning: Using the knowledge from previous tasks to learn new tasks.
Adaptability: Ability to adjust to changing environments or task requirements.

There are 4 main types of continual learning methods. In the following paragraphs, they are briefly explained:

Regularization Methods: limit weight updates of the model to preserve important parameters to previous tasks.
Rehearsal Methods: store or generate examples from previous tasks to maintain performance.
Architectural Methods: dynamically grow the network architecture as the model learn new tasks.
Meta-learning: learn how to learn efficiently, enabling quick adaptation to new tasks.

In this research project, I focused on applying architectural methods of continual learning to special type of Neural Network model - Vision Transformers.

Vision Transformers

Vision transformers (ViT) are a type of neural network models designed for vision tasks, inspired by the success of transformers in natural language processing. ViT adjusts the transformer architecture, originally designed for text, to process images. That is done by treating the image as a sequence of patches, similar to how transformers process a sequence of words.

Image to the left. The process is as follows:

Image Patching: Divides the input image into fixed-size patches.
Patch Embedding: Flattens and linearly projects the patches into lower-dimensional space.
Position Embedding: Adds learnable position embeddings to retain spatial information.
Transformer Encoded: Processes the sequence of patch embeddings using self-attention mechanisms.

Adaptive Distillation of Adapters

Image to the left. The process is as follows:

The approach uses a limited pool of new parameters called adapters.
It adds new adapters to the ViT architecture as the model learns new tasks.
There is a limit to the number of adapters.
Once the limit is reached, for new tasks:
- It measures the transferability score with previously learned tasks.
- It distills with the adapter that has the highest transferability score.

This research project focuses on implementing the above apprach (ADA) which is still in progress.