Semantic and Perceptual Captions

Introduction

Yesterday, Deepseek released the “Janus-Pro”, which is a multimodal embedding and text-to-image generation model. The abstract reminded me of a talk I gave in 2016, where I mentioned that “data, compute and algorithm” are the three driving forces of AI. This was a common understanding at that time. The tech report starts with “In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates, (1) an optimised training strategy, (2) expanded training data, and (3) scaling to larger model size”.

The three driving forces still remain unbeaten and play crucial roles in AI advancement.

 

1. Two levels of captions

Consider this image and corresponding captions:

Skyscraper x

Caption 1: Skyscrapers in which businesses lead the country’s economy.

Caption 2: This image displays four black skyscrapers that are made of glass and steel. They are reflecting the light from the sky. The sky is light blue and there are some clouds in the sky. There is a building in the foreground. The building is also made of glass and steel and is reflecting the light from the sky. The style of the image is realistic.

The first caption is typical that we would find on the internet for images like this. It captures the high-level semantic information in the image. Where as the second caption is a more detailed, fine grained description of the image. The image and captions are from the dataset pixel to prose. Understanding an image in terms of fine-grained local features and globally consistent semantic features is not new. Progressive GAN demonstrated that different layers of the network captured global and local features. Style GAN used it to modify the global or local features of a generated face.

 

2. Decoupling encoders

For multimodal understanding, it is necessary to have the image and corresponding captions in the same embedding space. CLIP introduced image and text encoders that were jointly trained using contrastive techniques. The embeddings generated by these encoders captured the semantic meaning of the image, thus forcing a shared embedding space. LLaVA introduced a bridge layer to map the embeddings from a Vision Transformer into the input tokens for LLaMA. As models evolved for the purposes of multimodal understanding and generation tasks, it was common practice to use a unified encoder module to obtain the image embeddings.

The key observation in Janus was that the representation required for the multimodal understanding (semantic) differs greatly from the representation required for a generation task (perceptual). It introduced two independent visual encoding pathways, one for multimodal understanding and one for generation tasks.

 

3. Why is it important

By decoupling the multimodal understanding encoder and visual encoder for generative tasks, Janus has given us a framework to jointly obtain sematic and perceptual embeddings. Most “multimodal” models focus on image and text modalities. However, the real world has audio, video and many forms of structured/unstructured data. Separate encoding and adapter heads could be added to this architecture to obtain joint embeddings of all these different modalities, moving towards truly “multimodal” models.

 

Conclusion

The release of Janus-Pro underscores the ongoing advancements in multimodal AI, emphasizing the importance of semantic and perceptual embeddings in enhancing both understanding and generation tasks. By introducing decoupled encoding pathways, Janus-Pro optimizes multimodal representation, ensuring that semantic comprehension remains distinct from perceptual synthesis. This architectural shift sets a precedent for integrating multi-modalities beyond text and images, including audio, video, and structured data, into a unified AI framework.

 

As AI models continue to evolve, the ability to jointly leverage distinct representations will become essential for tasks requiring both high-level comprehension and fine-grained detail generation. Future advancements may focus on refining adapter modules, improving cross-modal learning, and expanding embedding spaces to accommodate diverse data streams. With these developments, the pursuit of truly generalized multimodal AI moves one step closer to reality.

Related articles

Rethinking ROI in the age of AI

As AI drives a growing share of economic expansion, investors are questioning whether massive capital investments will deliver sustainable returns. Drawing lessons from the dot-com era, this article examines AI’s impact on GDP growth, valuation risks, and why the technological capabilities of artificial intelligence may still make it an ROI-positive force.

Read more

Tracing pixel defects to identify Deepfakes

As AI-generated images grow increasingly realistic, the next frontier of defense lies in detecting the invisible fingerprints left behind in every pixel. From GAN frequency inconsistencies to heatmap-based anomaly detection, deepfake forensics is shifting from human perception to measurable, machine-level analysis.

Read more

Extracting relevant information from chaotic audio

In a world where chaotic audio from crime scenes, crowded streets, and surveillance devices often hides crucial details, AI is transforming how we extract clarity from noise. From MP3’s psychoacoustic origins to today’s neural noise-reduction engines, advanced audio processing now enables law enforcement, intelligence agencies, and investigators to uncover truth buried in sound. As deepfake threats rise, the ability to isolate authentic, relevant audio has become a cornerstone of justice and national security.

Read more
Contact us

Let’s create a safer tomorrow!

We’re happy to answer any questions you may have and help you determine which of our products best fit your needs.

What happens next?
1

We schedule a call

2

Introduce you to our products

3

We prepare a proposal 

Schedule a Free Consultation