TL;DR

Is language supervision required to learn effective visual representations for multimodal tasks?

Not necessarily! Our research shows that when trained on sufficient web-scale data (2B+ images) and scaled to larger model sizes (7B parameters), visual self-supervised learning models can match or even outperform language-supervised models like CLIP across a broad range of visual question answering tasks — including OCR and chart understanding — without using any language supervision.

Web-SSL Scaling Performance
Model Scaling Works

Web-DINO improves consistently with model size (+4.6% from 1B to 7B params), while CLIP plateaus beyond 3B parameters.

Data Scale Matters

OCR & Chart understanding improves dramatically (+12.6%) as training data increases from 1B to 8B examples.

No Language Needed

Web-DINO at 7B parameters outperforms CLIP on multimodal tasks without any language supervision during training.

Web-SSL: Visual SSL 1.0 → 2.0

Web-Scale Data

Using the MetaCLIP dataset (2B images) to control for data distribution differences and elevate visual SSL to a new data regime

Parameter Scaling

Training Vision Transformers from 1B to 7B parameters to find the ceiling of visual SSL and identify emergent capabilities

Comprehensive Evaluation

Rigorous assessment across 16 diverse VQA benchmarks that span General, Knowledge, OCR & Chart, and Vision-Centric categories

Scaling Visual SSL

Effect of Model Scaling

Model scaling visualization showing Web-DINO outperforming CLIP at larger sizes
Model scaling: Web-DINO performance improves consistently with increasing parameter count (+4.9% from 1B to 7B), while CLIP performance plateaus beyond 3B parameters (+0.7% from 1B to 7B). All models are trained on 2B images from MetaCLIP (MC-2B).
Average VQA Performance of Web-DINO Trained on MC-2B
1B params
49.0%
3B params
51.7%
5B params
52.8%
7B params
53.9%
Key Findings
  • Scaling Behavior: Web-DINO improves consistently with model size.
  • Task-Specific Gains: OCR & Chart performance improves the most (+8.2%), followed by Vision-Centric (+5.9%) from 1B to 7B params.
  • DINO vs. CLIP: Web-DINO matches CLIP at 5B parameters, and outperforms CLIP at 7B parameters.

Effect of Number of Training Samples

Data scaling visualization showing continuous improvement with more training data
Training data size analysis: Web-DINO ViT-7B continues to improve with more training samples (+4.2% from 1B to 8B), with OCR & Chart tasks exhibiting the most significant improvements (+12.6% from 1B to 8B).
OCR & Chart Performance of Web-DINO
1B examples
26.8%
2B examples
31.3%
4B examples
35.6%
8B examples
39.3%
Task-Specific Insights
  • General VQA: Slight improvement (+1.1%) with diminishing returns after 2B examples.
  • Knowledge VQA: Modest improvements (+1.5%) with diminishing returns after 2B examples.
  • OCR & Chart: Significant improvements with no sign of saturation (+12.6%).
  • DINO vs. CLIP: Web-DINO consistently outperforms CLIP per fixed amount of training data.

Analysis of Scaling Results

1

Does the observed scaling behavior generalize to other visual SSL methods?

Yes, both joint embedding methods (Web-DINO) and masked modeling methods (Web-MAE) exhibit similar scaling properties, though with distinctive performance characteristics.

Web-MAE vs Web-DINO performance comparison

Comparison of Web-DINO and Web-MAE performance across model scales (1B-5B parameters)

Findings:

  • Web-MAE and Web-DINO improve consistently with model size (+2.3% for Web-MAE and +3.9% for Web-DINO from 1B to 5B params on average VQA)
  • Web-MAE achieves better OCR & Chart performance (+2.5% compared to Web-DINO at 5B parameters)

These results demonstrate that the observed scaling behavior generalizes across different visual SSL methods, suggesting that this phenomenon is intrinsic to the visual self-supervised paradigm, rather than any specific method.

No, models trained on smaller datasets such as ImageNet-1k (1.2M unique images) improve negligibly with model size, highlighting the critical importance of diverse web-scale data.

Data Distribution Comparison

ImageNet-1k (1.2M unique images): Negligible improvement (-0.1% from 1B to 3B)
MetaCLIP data (2B+ unique images): Noticeable improvement (+2.7% from 1B to 3B)

This parallels observations from language model research where data diversity and scale are essential for effective model scaling.

Comparison of ImageNet vs MC-2B training

Performance comparison between models trained on ImageNet-1k vs. MetaCLIP data across model sizes

Web-DINO models maintain strong performance on traditional vision benchmarks while improving on VQA tasks. But there are no clear scaling trends on classic vision tasks with increased model size.

Performance on Classic Vision Tasks

Model ImageNet-1k ADE20K NYU Depth ↓
MetaCLIP ViT-G 86.4% 46.7% 0.415
DINOv2 ViT-g 86.5% 53.0% 0.298
Web-DINO ViT-1B 84.7% 51.0% 0.345
Web-DINO ViT-7B 86.0% 54.7% 0.339

Unlike VQA, classic vision performance improves modestly with increased parameter count.

This result highlights VQA's value as a complementary evaluation protocol that may better reflect real-world perceptual challenges.

Classic vision task performance

Performance trends on ImageNet, ADE20K and NYU Depth.

Web-scale datasets naturally contain text-rich images that enable visual SSL models to learn OCR capabilities without explicit language supervision. Strategic data filtering further enhances this effect.

Raw data examples

Raw MC-2B Data

Light filtered data examples

Light Filter (50.3%)

Heavy filtered data examples

Heavy Filter (1.3%)

Examples of data filtering. Left: Random samples from MC-2B. Middle: Images with text (50.3% of data). Right: Charts, tables and documents (1.3% of data).

Understanding OCR & Chart Performance Improvements with Data Filtering

OCR & Chart Performance Breakdown
ChartQA +24.2%
Full Data: 23.3% Heavy Filter: 47.5%
OCRBench +13.8%
Full Data: 15.6% Heavy Filter: 29.4%
TextVQA +3.6%
Full Data: 49.2% Heavy Filter: 52.8%
DocVQA +13.0%
Full Data: 19.0% Heavy Filter: 32.0%
Overall OCR & Chart Improvement: +13.6% (from 26.8% to 40.4%)
Key Findings & Mechanisms
Major Performance Gains
  • Light Filter (50.3%): +6.4% on OCR & Chart tasks
  • Heavy Filter (1.3%): +13.6% on OCR & Chart tasks
  • ChartQA: +24.2% improvement with heavy filtering
Surpassing Language-Supervised Models

Heavy-filtered Web-DINO (40.4%) outperforms CLIP (36.1%) on OCR & Chart tasks despite using no language supervision and only 1.3% of training data.

Why It Works

Visual self-supervised learning effectively extracts textual information from images without language supervision. Web-scale datasets naturally contain text-rich images, and strategic data filtering significantly enhances this capability.

As model size and data increase, SSL models naturally develop representational alignment with language models, despite receiving no language supervision.

Possible Factors Driving Language Alignment

Web-scale training data exposes models to diverse visual concepts and even images containing text.
Increased model capacity and more training data enables learning more comprehensive and abstract representations, that implicitly capture linguistic concepts.
LLM alignment visualization

Measurement of representational alignment between visual features and Llama-3 8B/70B language models, without any finetuning or alignment procedure.

Key Takeaways

  • 1

    Visual SSL Scales

    Visual SSL models improve consistently with both increasing model size and more training data. In contrast, CLIP models plateau beyond moderate sizes.

  • 2

    Pretraining Data Distribution Matters A Lot

    SSL models trained on traditional, smaller datasets like ImageNet-1k (1.2M images) show negligible improvements with increased parameter count. Diverse web-scale data is necessary to enable effective scaling of visual SSL models.

  • 3

    Text-Rich Images Enhance OCR Capabilities

    Training on a data subset containing text-rich images (1.3% of total data) significantly improves OCR & Chart understanding, even outperforming CLIP models of the same size trained on full data.

Resources

Paper

Read our paper for detailed methodology, results, and analysis.

Download Paper

Code & Models

Access open-source model definitions and inference code for Web-SSL.

GitHub Repository

Model Weights

Download our Web-SSL model weights for your research.

Download Models

Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning},
  author={Fan, David and Tong, Shengbang and Zhu, Jiachen and Sinha, Koustuv and Liu, Zhuang and Chen, Xinlei and Rabbat, Michael and Ballas, Nicolas and LeCun, Yann and Bar, Amir and Xie, Saining},
  journal={arXiv preprint arXiv:2504.01017},
  year={2025}
}

FAQ

Visual Self-Supervised Learning (SSL) methods learn representations from images alone, using various pretext tasks like contrastive learning or masked image modeling. In contrast, Contrastive Language-Image Pretraining (CLIP) learns from paired image-text data, creating representations that align visual features with linguistic semantics. The primary distinction is that SSL operates without language supervision, while CLIP explicitly leverages language to guide representation learning.

We use a controlled two-stage visual instruction tuning procedure. First, a lightweight MLP adapter projects the vision encoder features into the LLM dimensionality, with only this adapter being trained. In the second stage, both the MLP adapter and LLM are finetuned. Critically, the vision encoder remains frozen in both stages, enabling fair comparison across different vision encoders. All experiments use the same LLM backbone (Llama-3 8B Instruct) and identical training data from Cambrian-Alignment and Cambrian-7M datasets to ensure consistent evaluation.

Web-scale image datasets naturally contain significant textual information. Unlike object-centric datasets like ImageNet, web images frequently include text in the form of labels, signs, charts, and documents. Our research demonstrates that with sufficient training data (2B+ images) and model capacity (5B+ parameters), visual SSL models can develop effective text recognition capabilities without explicit language supervision. We found that strategic filtering for text-rich images further enhances OCR & Chart performance (+13.6% improvement), allowing SSL models to outperform comparable CLIP models (+4.3%) with only 1.3% of the original unique training images.

Yes, we plan to open-source our Web-SSL vision models to support reproducibility and further research into visual self-supervised learning. Inference code and model weights will be available via our GitHub repository. We hope releasing these resources will enable the broader research community to achieve the next generation of vision models that excel at both conventional vision and modern multimodal capabilities.

Correspondence

Please reach out to David Fan and Shengbang (Peter) Tong with any questions.

davidfan [at] meta [dot] com

st5087 [at] nyu [dot] edu