Multimodal AI

August 6, 2024by SineWave Ventures

Multimodal AI 

For more than a decade, we’ve used AI-powered voice assistants as a convenient and user-friendly way to interface with machines. This AI technology ingests human speech and responds in synthetic voice, giving us the ability to prompt machines to perform relatively simple tasks on our behalf. We interact with our voice assistants to check the weather, play a song, dictate a text message, or search for information. Despite recent advancements in the underlying AI technology, we’ve mostly stuck to basic tasks like those above, satisfied with the improved accessibility and intuitive user experience that they provide.  

More recently, the emergence of Generative AI has captured the world’s attention. AI-powered chatbots ingest natural language text, speech, or images, but unlike prior applications, this technology is designed to produce novel text, image, or audio output. However, many of today’s most widely used Generative AI applications are unimodal. That is, they’re designed and optimized to ingest, process, and understand a single type of data, oftentimes configured for natural language processing of text. 

Humans experience the world using multiple senses, for example, by combining audio input with visual cues to improve perception accuracy and robustness, especially within noisy environments. Multi-sensory integration helps us gain a more holistic understanding of our surroundings and establishes context for scenarios requiring complex decision-making. 

At SineWave Ventures, our thesis is that technology solutions should help solve problems more efficiently. Multimodal AI can enhance machine perception and understanding by integrating text, audio, and visual inputs. This approach not only improves the AI’s accuracy and reliability in various applications, but also allows machines to assist humans in performing more complex tasks, such as understanding nuanced contexts, making sophisticated predictions, and offering richer, more interactive experiences. Our portfolio company, Clarifai is a deep learning AI platform for computer vision, natural language processing, and data labeling, helping organizations build, deploy, and operationalize AI at scale across multiple industries. 

As multimodal AI continues to mature, we seek to invest in companies that are developing technologies that adapt analytic workflows to accommodate processing constraints, and that support the development of just-in-time, just good enough analytic strategies. By integrating multiple types of data, multimodal AI will provide a deeper understanding and more comprehensive solutions, making technology an even more powerful tool for tackling complex challenges. These advancements will not only enhance human interactions with machines, but also open up new possibilities for innovation and problem-solving.