Logo Synthetic Visual Genome

Synthetic Dense Scene Graphs at Scale

1University of Washington, 2Allen Institute for Artifical Intelligence,
3Stanford University 4Toyota by Woven
CVPR 2025
dense scene graphs

Example of our Logo Synthetic Visual Genome (SVG) dataset:
the first automatically generated large-scale scene graph dataset with diverse open-set categories, fine-grained regions, and densely annotated relationships. SVG averages four times more relations per object than Visual Genome.

Introduction

Reasoning over visual relationships—spatial, functional, interactional, social, etc.—is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce LogoROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale.

To train LogoROBIN, we curate LogoSVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines LogoROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects.

Results show that our LogoROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning tasks.

Logo SVG Dataset

Two-Stage Pipeline Overview

Logo SVG is created through a systematic two-stage pipeline that leverages powerful multimodal models to generate dense, high-quality scene graph annotations at scale. Our approach addresses the limitations of existing scene-graph datasets that typically lack dense and diverse relationship annotations.

Our final datasets consist of:

  • SVG-Stage1: 33K images with 25.5 triplets per image and 1.9 predicates per region
  • SVG-Stage2: 113K images with 42.3 triplets per image and 2.4 predicates per region
You can download the complete dataset on 🤗 Hugging Face.

data-overview

LogoModel: ROBIN-3B

model architecture

ROBIN-3B Architecture: Our model consists of three main components: (1) a ConvNext-Large vision encoder that processes the entire image into image tokens, (2) a pixel-level mask-aware extractor that embeds segmentation masks into mask tokens, and (3) a Qwen2.5-3B language model that processes image, mask, and text tokens. This dual representation (pixel-level masks + text coordinates) enables precise, fine-grained localization, handling up to 99 regions per image with LLM's 8K context window.

Generated Scene Graphs

Experiment Results

1. Relationship VQA: Robin-3B trained with < 3M instances Outperforms VLMs trained with > 300M data

2. Referring Expression Comrehension: Robin-3B achieves the SoTA on Referring Expression Comprehension, beating up to 13B models.

3. Region Recognition: Robin-3B achieves the SoTA on Semantic Segmentation and Object & Part-Level Recognition.

4. Panoptic Scene Graph Generation: Outperforms 13B model on Panoptic Scene Graph Generation. Self-distillation with scene graphs further enhances the performance!

BibTeX

@inproceedings{park2025svg,
  author    = {Park, Jae Sung and Ma, Zixian and Li, Linjie and Zheng, Chenhao and Hsieh, Cheng-Yu and Lu, Ximing and Chandu, Khyathi and Kong, Quan and Kobori, Norimasa and Farhadi, Ali and Choi, Yejin and Krishna, Ranjay},
  title     = {Synthetic Visual Genome: Dense Scene Graphs at Scale with Multimodal Language Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025}
}