By Sandy Carter
In the rapidly evolving landscape of artificial intelligence, we’re witnessing a fascinating transformation in the world of GPT (Generative Pre-trained Transformer) models.
GPT is a type of artificial intelligence model designed to understand and generate human-like text. Think of it as a very advanced autocomplete system.
Just as your phone might suggest the next word when you’re typing a message, a GPT model can predict and generate entire sentences or even long passages of text. It’s “pre-trained” on a vast amount of text from the internet, books, and other sources, which allows it to learn patterns in language.
The “generative” part means it can create new content, not just repeat what it has seen before. GPT models power many AI chatbots and writing assistants, helping them to engage in human-like conversations and produce coherent text on almost any topic. Most significantly of all, GPT has achieved the Holy Grail of AI: passing the Turing Test, where humans cannot tell that text has been produced ‘artificially’.
The advance of AI – both technologically and in terms of adoption – has been nothing short of phenomenal. According to OpenAI, there are more than 3M customer GPTs, and 77% of devices being used have some form of AI.
My provocative statement “GPT is dead. Long live GPTs!” encapsulates this evolution, highlighting not the demise of GPT architecture, but rather its evolution from text-only applications to even more capable and wondrous forms.
The Evolution of GPT
The early GPT models, with their singular focus on text, have given way to a new generation of more versatile and capable systems. While these text-centric models showcased remarkable abilities in natural language processing – generating human-like text, understanding context, and performing a wide array of language-related tasks with impressive accuracy – they were limited to a single modality.
So what’s dead? The notion that GPT models are confined to text alone. The GPT architecture, far from being obsolete, is very much alive and continuously evolving. As researchers and developers pushed the boundaries of what GPT could do, a transformation began to take shape. The latest iterations, exemplified by models like GPT-4, have broken free from the constraints of text-only processing. These advanced models have embraced multimodality, capable of understanding and processing both text and images.
For example, LegalGPT is multimodal GPT, allowing the processing of both text and image data. This feature enables the tool to handle tasks such as analysing legal documents, including scanned images of contracts or case files, while also providing detailed text-based insights. For example, LegalGPT can interpret complex legal documents and identify important clauses or issues, making it a versatile tool for legal professionals who often deal with both textual and visual information, such as scanned PDFs.
This leap forward represents a fundamental shift in our relationship with technology. It is no longer merely a tool; it is now an extra pair of hands, with capabilities approaching and in some cases exceeding those of human employees.
Because GPT models do a good job on the “drudge-work” within so many professions, it enables humans to concentrate fully on what we do brilliantly. It opens up new horizons for AI applications, bridging the gap between different types of data and paving the way for more sophisticated, context-aware AI systems.
The death of text-only GPT has given birth to a new era of multimodal AI, where these models can interact with and understand the world in ways that more closely mimic human cognition.
The Rise of Multimodal AI
The evolution of GPT towards multimodality is part of a broader, exciting trend in the field of artificial intelligence. Multimodal AI systems, capable of processing and generating multiple types of data simultaneously, are revolutionizing how machines understand and interact with the world.
These systems can integrate information from various sources – text, images, audio, and even video – to form a more comprehensive understanding of their environment. This enhanced perception allows for more nuanced and accurate responses, mimicking the way humans process information from multiple senses.
For example, an interesting multimodal GPT in the music industry is AIVA (Artificial Intelligence Virtual Artist). AIVA uses both text and sound as input, allowing users to generate music based on specific styles or emotions described in text form. It can interpret these text prompts and output corresponding audio, making it useful for composers or producers looking for inspiration or quick drafts. AIVA has been used in creating background scores for films, commercials, and even video games, showcasing how multimodal AI can blend creative input across text and sound.
The implications of this shift are profound and far-reaching. In healthcare, multimodal AI that analyses medical images alongside patient histories is already being used to drastically improve the accuracy of diagnoses. One example of a healthcare-focused multimodal GPT is Med-Gemini, developed by Google. Med-Gemini builds on the Gemini family of models and is specifically fine-tuned for medical applications. It combines text, images, and even 3D scans to assist in clinical workflows such as generating radiology reports, answering clinical questions, and offering diagnostic support.
Med-Gemini has been benchmarked on tasks like visual question-answering for chest X-rays, report generation for 3D imaging, and genomic risk prediction. These multimodal capabilities are designed to improve clinical reasoning by integrating diverse data types, making it a powerful tool in radiology, pathology, and genomics.
In addition, there are so many more applications of multi-modal GPTs. Autonomous vehicles could make split-second decisions by integrating visual, auditory, and textual data. Creative industries might see an explosion of new forms of art and design, as AI assists in blending different media types. Perhaps most excitingly, multimodal AI has the potential to break down communication barriers, offering more natural and intuitive ways for humans to interact with machines. As these systems continue to develop, we stand on the brink of a new era in artificial intelligence – one where the lines between different types of data blur, and AI’s understanding of the world grows ever closer to our own.
The Importance of Multimodal AI
The shift to multimodal AI represents far more than just a technological advancement; it’s a paradigm shift with far-reaching implications across various fields and aspects of society. Remarkable as AI’s “autocomplete” capabilities are, you can’t solve all the world’s problems by text alone. By enhancing AI’s understanding of context and nuance, multimodal systems promise to revolutionize human-AI interaction, making it more natural and intuitive. This improved interaction opens doors to solving complex, real-world challenges that were previously out of reach for single-modality AI.
Moreover, this shift unlocks new realms of creativity and innovation, potentially transforming fields like art, design, and scientific research. The ability of multimodal AI to bridge different types of information also has profound implications for accessibility, allowing for more effective communication tools for people with disabilities.
Perhaps most excitingly, by integrating visual context with language processing, multimodal AI could break down language barriers, fostering improved cross-cultural understanding and communication on a global scale. Collectively, these advancements underscore how multimodal AI is not merely an evolution in technology, but a revolutionary force that could reshape how we interact with machines, process information, and ultimately understand our world.
Companies at the Forefront
Several tech giants and innovative startups are leading the charge in developing advanced GPT models and multimodal AI:
- OpenAI: Known for its GPT models, OpenAI has made significant strides with GPT-4, which can process both text and images.
- Google DeepMind: The company’s PaLM-E model integrates large language models with robotic control, showcasing the potential of multimodal AI in physical interactions.
- Anthropic and AWS: While less is publicly known about their specific efforts, Anthropic has been pushing the boundaries of AI capabilities and ethical AI development.
Innovative Startups
- Hugging Face: This startup has become a hub for open-source AI models, including many GPT-based and multimodal projects.
- Adept AI Labs: Founded by former OpenAI and Google researchers, Adept is working on AI models that can interact with software interfaces.
- Stability AI: Known for their work on Stable Diffusion (which, among its many uses, enables the creation of detailed images from text prompts), Stability AI is pushing the boundaries of generative AI across multiple modalities.
Measuring the Impact of GPT Evolution As GPT models evolve from text-only to multimodal capabilities, a crucial question emerges: How do we effectively evaluate these increasingly sophisticated AI systems? While expert benchmarks provide valuable technical insights, they may not fully capture the most important metric: end-user satisfaction.
Recognizing this gap, an open-source initiative led by Salman Paracha (an AWS alumni like me) and Katanemo‘s has launched a “human” benchmark study for Large Language Models (LLMs). This study aims to measure the quality corridor that matters most to end users when interacting with LLMs, including the latest multimodal GPT models.
The Katanemo benchmark seeks to answer critical questions:
- Is there a threshold beyond which improvements in LLM response quality no longer significantly impact user satisfaction?
- At what point does the performance of these models dip to levels that users find unacceptable?
By participating in a brief 30-second survey individuals can contribute valuable data to this community effort. The results of this study will help establish a standardized measure for LLM quality, empowering researchers, developers, and consumers to make more informed decisions about which models to use or further develop.
The Road Ahead
As we move from “GPT is dead” to “Long live GPTs,” we’re entering an era where these models are becoming more versatile, more capable, and more integrated into our daily lives. The evolution towards multimodality represents not just a technological advancement, but a paradigm shift in how we interact with AI.
However, with great power comes great responsibility. The development of more advanced GPT models and multimodal AI raises new ethical concerns and challenges, particularly in areas like deep fakes and privacy. It will be crucial for researchers, companies, and policymakers to work together to ensure that these powerful new tools are developed and deployed responsibly.
The future of GPTs is not about the death of an old technology, but the birth of new possibilities. As these models continue to evolve and incorporate multimodal capabilities, we can expect to see applications that were once the stuff of science fiction become reality, fundamentally changing how we interact with technology and the world around us.
Feature Image Credit: Getty Images