Multimodal Transformers (But Not That Kind of Transformer)

Multimodal Blog

Mention ‘multimodal transformer’ to any 6-year-old and you might just trigger impassioned comments about devious Decepticons. But there is another kind of multimodal transformer. And this kind of transformer is of keen interest to SineWave because it creates AI functions that are changing the way we process and add value to information – whether in the enterprise, industrial, consumer, or federal domains.

Transforming AI Through Language Processing 

In June 2017, researchers at Google published a paper “Attention is All You Need.” The paper described a new approach to language translation. It reformulated an older concept called self-attention, which refers to the weighting of words and the importance that they give to each other within a sentence or paragraph.

In the paper, self-attention was combined with a way to encode the position within a sentence of the same words so that the importance of a word could be combined with its location within the text. This new approach performed exceptionally well in standardized language translation tasks. Further, it was efficiently trainable and could be computationally parallelized. This meant that modern Nvidia, AMD, ARM, or Intel processors – each of which has multiple central processing units (CPUs) on-chip and can operate in parallel – could all happily crunch away at the same time. This radically accelerated training computation and allowed the creation of very large language models or LLMs. A “transformer” was what the Google researchers named this new machine learning architecture.

And transform it did. In fact, it transformed the whole of AI. New and superior large language models followed. These included Google’s esoterically named Bidirectional Encoder Representations from Transformers or BERT algorithm (no relation to the Sesame Street character) and Facebook’s RoBERTa.

A further explosion of proprietary and open-source architectural evolution followed, the most familiar of which are Open AI’s GPT-4 and ChatGPT models. And despite some intense innovation, at their core, all these models – and the new businesses that accompanied them – were powered by transformers.

Around 2019-20, however, researchers began to look at the structure of transformers in a different way. They noticed that transformers performed exceptionally well when presented with a sequence of words (in fact, they usually operate on chunks of words, but we can leave that to another blog). They are therefore token processors.

But what if the tokens were not words but chunks of an image? Technical papers such as Google’s “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” followed. Here, 256×256 pixel images were decomposed into 256 chunks, or tokens, each 16×16 pixels, and then processed with a transformer. The result was a breakthrough performance in all sorts of image process tasks including image classification, image super-resolution, image de-blurring, and so on.

AI researchers working in audio and acoustics were paying close attention. For many years, acoustics researchers viewed recorded speech and music as images, more specifically, spectrograms. Spectrograms are constructed from an audio record using the short-time Fourier transform (STFT) to create an image in time and frequency from the original amplitude-time acoustic waveform. Since spectrograms are images, any acoustic record could be tokenized and fed into a transformer. Many such implementations resulted, including the Audio Spectrogram Transformer (AST), and a new and high-performing cadre of speech-text (transcription) or speech-speech (e.g., audio language translation) tools followed.

The Business Case for Multimodal Transformers

From there it was only natural to build transformers that could handle more than one mode at a time, and that’s exactly what happened in 2022-2023. Researchers – and several commercial implementations – built transformers that jointly processed text and image tokens at the same time. Some then found highly computationally efficient ways to combine text, images, and audio. This was important, since the computational complexity of a simple transformer algorithm scales with the square of the number of words, or image chunks, being processed. Big computational loads can grow very quickly, which resulted in the fusion of more and more sensor modes into transformers.

As an example, in July we saw the description of a meta-transformer, whose input modes can include fundamental data including text, tables, data time-series, graphs, images (2D and 3D), audio, and video plus sensed data such as X-ray diagnostic, infrared, and hyperspectral (e.g., satellite remote sensing) imagery.

At SineWave, we are motivated by a belief that multiple, highly valuable, multi-modal use cases can be addressed and new businesses created using such techniques. For example, an AI system might fuse an expert view of X-ray imagery with text processing of a patient record plus time-series analysis of their vital signs to generate superior diagnostic and therapeutic recommendations to inform a supervising physician. Or a digital twin that is managing the operation of a factory might combine tabular production records with images of the factory floor and an understanding of the hierarchy graph of the factory organization to optimize factory production and communicate effectively with factory staff.

And all of these innovations and future businesses will, very likely, be based on multi-modal transformers. We have come a long way in only 6 years. Megatron will be envious.