ΑΙhub.org
 

Pretrained transformers as universal computation engines


by
29 April 2021



share this:
input-output token

By Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch

Transformers have been successfully applied to a wide variety of modalities: natural language, vision, protein modeling, music, robotics, and more. A common trend with using large models is to train a transformer on a large amount of training data, and then finetune it on a downstream task. This enables the models to utilize generalizable high-level embeddings trained on a large dataset to avoid overfitting to a small task-relevant dataset.

We investigate a new setting where instead of transferring the high-level embeddings, we instead transfer the intermediate computation modules – instead of pretraining on a large image dataset and finetuning on a small image dataset, we might instead pretrain on a large language dataset and finetune on a small image dataset. Unlike conventional ideas that suggest the attention mechanism is specific to the training modality, we find that the self-attention layers can generalize to other modalities without finetuning.

To illustrate this, we take a pretrained transformer language model and finetune it on various classification tasks: numerical computation, vision, and protein fold prediction. Then, we freeze all the self-attention blocks except for the layer norm parameters. Finally, we add a new linear input layer to read in the new type of input, and reinitialize a linear output layer to perform classification on the new task. We refer to this as “Frozen Pretrained Transformer”.

Across the tasks, a token fed to the model represents a small amount of information: for example, it could be a single bit, or a 4×4 image patch. In particular, the tokens can only communicate with each other via the self-attention mechanism, which is not being trained at all on the downstream task. We investigate if these mechanisms – learned exclusively from natural language data – can be used for another modality in zero shot.

We show test accuracies for a variety of tasks below. We FPT can match or improve the performance of training a transformer fully from scratch! This indicates that, somehow, the attention mechanisms are general enough that we can feed in relatively arbitrary inputs and still generate useful embeddings for downstream classification.



We also find that, when computing the elementwise XOR of two bitstrings, despite the self-attention parameters being frozen, by learning input embeddings to feed into the attention layer it is possible to force the self-attention to attend to the relevant bits for strings of length up to 256 (length of 5 shown below):



An open question is then what the benefit of pretraining on language is. Instead of initializing the transformer parameters from a pretrained model, we could instead initialize them randomly or by pretraining on the Bit Memory task, which ablate against no supervision or weak memory supervision, instead. Our results indicate that all three methods of initialization can work well, but language still performs the best, somehow providing an interesting set of pretrained layers: for example, on CIFAR-10, the base FPT model achieves an accuracy of 68%, versus 63% from Bit Memory pretraining or 62% from random initialization. Furthermore, we find the language-pretrained frozen transformers converge faster than the randomly initialized frozen transformers, typically by a factor of 1-4x, indicating that language might be a good starting point for other tasks.

We also find the transformer architecture itself to be very important. If we compare a randomly initialized frozen transformer to a randomly initialized frozen LSTM, the transformer significantly outperforms the LSTM: for example, 62% vs 34% on CIFAR-10. Thus, we think attention may already be a naturally good prior for multimodal generalization; we could think of self-attention as applying data-dependent filters.

We’re very interested in a better understanding of the capability of language models or hybrid-modality transformers for the goal of a universal computation engine. We think there are a lot of open questions to be explored in this space, and are excited to see new work in multimodal training.


This post is based on the following paper:

This article was initially published on the BAIR blog, and appears here with the authors’ permission.




BAIR blog

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Scaling up multi-agent systems: an interview with Minghong Geng

  07 Apr 2026
We sat down with Minghong in the latest of our interviews with the 2026 AAAI/SIGAI Doctoral Consortium participants.

Forthcoming machine learning and AI seminars: April 2026 edition

  02 Apr 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 2 April and 31 May 2026.

#AAAI2026 invited talk: machine learning for particle physics

  01 Apr 2026
How is ML used in the search for new particles at CERN?
monthly digest

AIhub monthly digest: March 2026 – time series, multiplicity, and the history of RoboCup

  31 Mar 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

What I’ve learned from 25 years of automated science, and what the future holds: an interview with Ross King

  30 Mar 2026
We launch our new series with a conversation with Ross King - a pioneer in the field of AI-enabled scientific discovery.

A multi-armed robot for assisting with agricultural tasks

and   27 Mar 2026
How can a robot safely manipulate branches to reveal hidden flowers while remaining aware of interaction forces and minimizing damage?

Resource-constrained image generation and visual understanding: an interview with Aniket Roy

  26 Mar 2026
Aniket tells us about his research exploring how modern generative models can be adapted to operate efficiently while maintaining strong performance.

RWDS Big Questions: how do we highlight the role of statistics in AI?

  25 Mar 2026
Next in our series, the panel explores the statistical underpinning of AI.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence