Multimodal Architectures

The fundamentals in building AI agents is to mimic the humans' sensory units: vision, speech and hearing. These AI agents are able to expand their assistance by equipping accessibility to Social Media (e.g. Facebook, Instagram, ...), utility based (e.g. Notion), or community (e.g. Discord, Slack, ...) based platforms. For these set of tools or API connections, I like to prefer them as action space for these AI agents.

I’ve come to understand that the most effective approaches to engineering multimodal AI agents likely begin with exploring architectures in the following areas:

API infrastructure: API infrastructure: this involves building the AI agents' accessible tools and platforms via APIs gateways, primarily with fastAPI.
Deep Learners Architectures: These complex architectures are typically used to build the sensory units behind AI agents. One ambitious goal could be to combine large language models such as Gemini, Meta Llama, and GPT, creating a unified model whose knowledge base leverages the contextual information with the highest certainty (for example, using a certainty metric based on the residuals between hidden layers). However, since these models have different architectures, combining them effectively is likely not feasible at this stage.
Ensembling Models Architectures: the architecture that integrates open sourced models. This is compute efficient compared to inferencing a full deep learner fine-tuned to all sensory tasks.
MLOPs and Data Cycles: Automating MLOps and data cycles can be particularly helpful in situations where datasets are sufficient for training a model using tools like scikit-learn, but not large enough to fine-tune or train a deep learning model. The objective is to iteratively improve the last recent number of models to inference the next batch of observations. E.g. Implementing Mean shifting techniques on real-time inferences to approximate the probability distribution of the datasets, then updating the next inferencing window.

2025 Open Source Multimodels

Gemma 3N Model Family (Google)

Gemma 3n models have multiple architecture innovations:

They are available in two sizes based on effective parameters. While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an E2B), or you can access a spectrum of custom-sized models using the Mix-and-Match method.

PreviousText to speech models NextTarot Spread Recognition

Last updated 4 days ago