Skip to content
Go back

Scaling Semantic Insight: GPU-Accelerated Multimodal Feature Extraction for FrameNet Expansion

Published: Ashish Baghel ~8 min read

Introduction

As we move into an era where communication goes far beyond just words, the study of multimodal genres where meaning arises from a blend of images, text, and context has become a core challenge in computational linguistics and Goal is to computationally model the semantic representation of multimodal genres by extending the FrameNet framework to capture the complex interplay between communicative modes.

Why Extend FrameNet to Multimodal Data?

The FrameNet Legacy

FrameNet, rooted in frame semantics, has long provided a structured mapping between words and the conceptual “frames” they evoke. But classic FrameNet is built around text. It doesn’t capture meaning that comes from visual cues like body posture, spatial layouts, or symbolic imagery.

The Multimodal Imperative

Modern genres—think news stories, ads, memes, political posters—draw meaning from the synergy between images and text. Understanding these requires a model that can integrate not just lexical frames, but also visual entities, spatial relationships, emotional cues, and scene dynamics. That’s where our work comes in.

To support this, the FrameNet Brasil team and Red Hen Lab curated a dataset of 13,000+ news images, laying the groundwork for visual semantic parsing.

What We Needed: Structured Multimodal Outputs

To enrich our multimodal FrameNet model, we needed to extract from each image:

This required more than just detection or classification—it demanded a deep semantic analysis of complex visual scenes, detailed enough to support frame-based reasoning.

The Model: Gemma-3B with Quantization

We used Gemma-3B, a March 2025 release known for its multimodal reasoning, and deployed it in its 4-bit quantized form. This let us reduce memory usage without losing semantic depth.

Quantization advantages:

Scaling the Pipeline: Parallel Processing with PyTorch and Multiprocessing

Motivation

Processing 13,000+ images sequentially or on a single GPU would have taken weeks. Instead, we used three NVIDIA L40S GPUs in parallel on Case HPC, dramatically reducing the time required.

Reference Code Highlights

Here’s a code Snippet of our parallel image analysis pipeline:

num_gpus = torch.cuda.device_count()
if num_gpus == 0:
    raise RuntimeError("No GPUs available.")
print(f"Detected {num_gpus} GPU(s).")

tasks = [(img, i % num_gpus) for i, img in enumerate(image_paths)]

csv_file = 'structured_image_analysis_results.csv'
total_images = len(tasks)

with Pool(processes=num_gpus, initializer=init_worker) as pool:
    with open(csv_file, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow([
            "Image Path", "GPU", "Entities & Relationships",
            "Scene Description (English)", "Scene Description (Portuguese)",
            "Event Description", "Objects List"
        ])
        for result in tqdm(pool.imap_unordered(process_image, tasks), total=total_images, desc="Processing images"):
            if result['status'] == 'success':
                writer.writerow([
                    result['image_path'],
                    result['gpu'],
                    result['entities_relationships'],
                    result['scene_description_en'],
                    result['scene_description_pt'],
                    result['event_description'],
                    result['objects_list']
                ])
            else:
                print(f"Failed to process {result['image_path']} on GPU {result['gpu']}")

Check the full code here: Github

How it works:

Performance: With 3 GPUs, we processed 13,500+ images in under ~60 hours, averaging about ~40 seconds per image (depending on complexity).

Prompt Engineering for Semantic Rigor

The quality of the model’s output depended heavily on prompt design. We crafted a multi-section prompt that instructed the model to:

This approach made it possible for our pipeline to automatically parse and index semantic content, supporting future training of multimodal parsers and knowledge graphs.

Semantic Outputs: A Real Example

Soccer player celebrating a goal

Let’s break down what our pipeline extracted from this soccer celebration image—and why each part matters.

Why this output is remarkable:

Our model doesn’t just list objects. It understands roles (who is celebrating, who is observing), captures relationships, and generates rich, multilingual scene descriptions. This is far beyond basic object detection—it’s structured, context-aware, and ready for advanced research.

Entities & Relationships (click to expand)

A person [person] wearing a red and black soccer jersey [clothing], shorts [clothing], and socks [clothing] is jumping and celebrating.
Another person [person] wearing a black jersey [clothing] and shorts [clothing] is looking at the first person [person].
A soccer field [place] is visible in the background.
Grass [plant] covers the soccer field [place].
A logo [logo] is visible on the jersey [clothing] of the person [person].
A logo [logo] is visible on the soccer jersey [clothing].
A logo [logo] is visible on the shorts [clothing].
A sponsor logo [logo] is visible on the jersey [clothing].

Scene Description (English) (click to expand)

The image shows two people [person] on a well-maintained green soccer field [place]. The person on the left is mid-jump, with arms raised in a celebratory gesture, and a wide smile on their face. They are wearing a red and black soccer jersey [clothing] with various logos [logo] and sponsor markings. They are also wearing shorts [clothing] and socks [clothing]. The person on the right is looking towards the first person [person], seemingly observing the celebration. They are wearing a black jersey [clothing] and shorts [clothing]. The background consists of the expansive green grass [plant] of the soccer field [place]. The lighting appears to be bright and natural, suggesting an outdoor setting.

Scene Description (Portuguese) (clique para expandir)

A imagem mostra duas pessoas [person] em um campo de futebol [place] bem cuidado e verde. A pessoa à esquerda está pulando, com os braços levantados em um gesto de celebração e um largo sorriso no rosto. Eles estão vestindo uma camisa de futebol [clothing] vermelha e preta com vários logos [logo] e marcas de patrocinadores. Eles também estão usando shorts [clothing] e meias [clothing]. A pessoa à direita está olhando para a primeira pessoa [person], parecendo observar a celebração. Eles estão vestindo uma camisa preta [clothing] e shorts [clothing]. O fundo consiste na grama verde [plant] do campo de futebol [place]. A iluminação parece ser brilhante e natural, sugerindo um ambiente externo.

Event Description (click to expand)
  1. A soccer player [person] has just scored a goal and is celebrating with a teammate [person].
  2. A soccer player [person] is celebrating a victory with a teammate [person] after a match.
  3. A soccer player [person] is celebrating a personal achievement with a teammate [person] during a training session.
Objects List (click to expand)

Soccer field
Grass
Soccer jersey
Shorts
Socks
Logos
Sponsor markings
Person
Teammate
Clothing
Plant

Technical Takeaways

Conclusion

Our work proves that extracting structured meaning from images at scale is not just possible, it’s efficient and practical. With quantized models, smart prompts, and parallel GPU processing, we’ve built a pipeline that makes semantic multimodal annotation a reality for real research.


Acknowledgements

My sincere thanks to:

Their mentorship and collaboration have been invaluable throughout this project.


Suggest Changes