Seeing History Unseen

Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections

Moritz Mähr

moritz.maehr@gmail.com

University of Basel

University of Bern

Moritz Twente

mtwente@protonmail.com

University of Basel

Evaluating vision–language models (VLMs) for alt-text in digital heritage collections

Case Study

Focus

Stadt.Geschichte.Basel
Open Research Data Platform

WCAG-compliant,
historically informed alt-text for heterogeneous images

Digitized archives still exclude blind and low-vision users
Many images lack meaningful alt-text or rely on minimal, generic captions
Alt-text is not just a technical requirement but part of historical justice
Rich descriptions also benefit other groups
(e.g., neurodivergent readers)

Alt-text translates visual evidence into language
Production is labor-intensive; often postponed or dropped entirely
Requires:
- Domain knowledge (history, archives, local context)
- Accessibility expertise (WCAG, screenreader use)
- Editorial judgement (what is salient, what is harmful)

Multimodal models can caption images at scale
(e.g., GPT-4o mini, Gemini, Llama, Qwen)
Promises
- High coverage, fast throughput, low marginal cost
- Potential to fill the gap for large collections
Risks
- Hallucinations, misrecognitions, and subtle biases
- Uncritical reproduction of harmful terminology from sources or training data

RQ1 — Feasibility
- What coverage, throughput, and unit cost can current VLMs achieve for WCAG-aligned alt-text on a heterogeneous heritage corpus?
RQ2 — Relative quality
- How do humanities experts rank model outputs?
- Which error patterns recur in ‹best› alt-texts?

Stadt.Geschichte.Basel
Open Research Data Platform

≈1900 media objects (as of Dec 2025)

Media Types: photographs, maps, drawings, objects, diagrams, ephemera
Time Periods: from prehistoric to contemporary Basel
Complexity: multi-part figures, text-heavy items

Four vision–language models via OpenRouter:
- Google Gemini 2.5 Flash Lite
- Meta Llama 4 Maverick
- OpenAI GPT-4o mini
- Qwen 3 VL 8B Instruct
Selection criteria:
- Mix of proprietary and open-weight models
- Similar cost range and context window
- Multimodal and multilingual (incl. German)

Fixed, WCAG-aligned system prompt
- Short, neutral, factual descriptions
- No “image of …”, no alt= wrappers, no emojis
- Length targets
  - 90–180 characters
  - up to 400 for complex charts
- Rules for portraits, objects, documents, maps, event photos

User prompt
- Concise, structured metadata (title, description, date, era, creator, publisher, source)
- Image URL at the end

Pipeline:
- Standardized JPG (800×800) + JSON metadata
  → 4 candidate alt-texts per image

Participants: 21 humanities scholars
Instructions:
- For each of the 20 images, rank 4 model outputs
  from 1 (best) to 4 (worst)
- Focus on WCAG-aligned criteria:
  - Core visual content, no redundancy
  - Salient features and visible text
  - Context only when it aids understanding
- Factual accuracy and bias not foregrounded but may influence judgement

Coverage
- All four models produced non-empty alt-text for all 20 images → 100% coverage
Latency and throughput
- ≈2–4 seconds per item; ≈0.24–0.43 items/s
- Qwen 3 VL 8B: fastest; GPT-4o mini: slower but stable
Cost
- ≈1.8×10⁻⁴ to 3.6×10⁻³ USD per description
- Sub-cent cost even for multiple candidates per image
technically and economically feasible at collection scale

Rank distributions
- GPT-4o mini and Qwen receive more first-place and fewer last-place ranks
- Gemini and Llama perform slightly worse descriptively
Statistical tests
- Friedman test: χ²(3, N=20) = 6.02, p = 0.11
- Kendall’s W ≈ 0.01 → very low agreement across tasks
- Pairwise Wilcoxon tests (Holm-corrected): no significant differences
no clear ‹winner›; relative model quality varies by image

e.g., describing people walking down a staircase when image shows them walking up
(Faltblatt der Gruppe ‹Freiräume für Frauen› FFF, 1992)

e.g., uncritical repetition of phrases like ‹reicher Jude› from metadata or training data
(Karikatur ‹Gegen den Eisenbahnhandel›, 1898)

e.g., summarising trends for one data series in a chart while ignoring others
(Publikumszahlen an Konzerten und FCB-Spielen im St. Jakob-Stadion, 1960–2022)

Operationally viable (RQ1)
- High coverage, low latency, negligible cost
- Easy to integrate into existing pipelines
Epistemically fragile (RQ2)
- Errors are uneven and task-dependent
- Biases and omissions require domain-sensitive review

Single institutional corpus; 100-item dataset, 20-item survey subset
Four models at one point in time (Oct 2025)
Future directions:
- Larger and more diverse participant pools, including blind and low-vision users
- Comparative studies of different prompt and review strategies
- Integration with authoring tools and cataloguing workflows

Recommended use:
- VLMs as drafting aids for trained editors
- Human-in-the-loop workflows with:
  - Editorial guidelines for sensitive content
  - Checks for bias, factual errors, and omissions

For DH Teams and GLAM Institutions:
- Start with limited, well-documented pilot runs
- Prioritise sensitive or high-impact items for manual review
- Develop internal style guides for alt-text and metadata
- Log model version, prompts, and provenance for transparency
For Model and Tool Builders:
- Support context-aware prompting (metadata + image)
- Provide better controls for bias and harmful content

Do not use VLMs as fully automated captioners
Treat alt-text as:
- An interpretive, historiographical act
- An ethical responsibility toward diverse audiences
Effective adoption requires:
- Clear editorial policies and sensitivity guidelines
- Human-in-the-loop review with domain and accessibility expertise
- Transparent, reproducible workflows and open data

→ Accessibility with AI is a design and governance choice