Seeing History Unseen

Evaluating Vision-Language Models for WCAG-Compliant Alt-Text in Digital Heritage Collections

Moritz Mähr

University of Basel

University of Bern

Moritz Twente

University of Basel

Seeing History Unseen

Evaluating vision–language models (VLMs) for alt-text in digital heritage collections

Case Study

Focus

Stadt.Geschichte.Basel
Open Research Data Platform

WCAG-compliant,
historically informed alt-text for heterogeneous images

Motivation: Accessibility Gap

  • Digitized archives still exclude blind and low-vision users
  • Many images lack meaningful alt-text or rely on minimal, generic captions
  • Alt-text is not just a technical requirement but part of historical justice
  • Rich descriptions also benefit other groups
    (e.g., neurodivergent readers)

Alt-Text in Digital Heritage

  • Alt-text translates visual evidence into language
  • Production is labor-intensive; often postponed or dropped entirely
  • Requires:
    • Domain knowledge (history, archives, local context)
    • Accessibility expertise (WCAG, screenreader use)
    • Editorial judgement (what is salient, what is harmful)

Opportunity and Risks of VLMs

  • Multimodal models can caption images at scale
    (e.g., GPT-4o mini, Gemini, Llama, Qwen)

  • Promises

    • High coverage, fast throughput, low marginal cost
    • Potential to fill the gap for large collections
  • Risks

    • Hallucinations, misrecognitions, and subtle biases
    • Uncritical reproduction of harmful terminology from sources or training data

Research Questions

  • RQ1 — Feasibility
    • What coverage, throughput, and unit cost can current VLMs achieve for WCAG-aligned alt-text on a heterogeneous heritage corpus?
  • RQ2 — Relative quality
    • How do humanities experts rank model outputs?
    • Which error patterns recur in ‹best› alt-texts?

Corpus and Dataset


Stadt.Geschichte.Basel
Open Research Data Platform

≈1900 media objects (as of Dec 2025)

  • Media Types: photographs, maps, drawings, objects, diagrams, ephemera
  • Time Periods: from prehistoric to contemporary Basel
  • Complexity: multi-part figures, text-heavy items

Models under Study

  • Four vision–language models via OpenRouter:
    • Google Gemini 2.5 Flash Lite
    • Meta Llama 4 Maverick
    • OpenAI GPT-4o mini
    • Qwen 3 VL 8B Instruct
  • Selection criteria:
    • Mix of proprietary and open-weight models
    • Similar cost range and context window
    • Multimodal and multilingual (incl. German)

Prompt and Pipeline Design

  • Fixed, WCAG-aligned system prompt
    • Short, neutral, factual descriptions
    • No “image of …”, no alt= wrappers, no emojis
    • Length targets
      • 90–180 characters
      • up to 400 for complex charts
    • Rules for portraits, objects, documents, maps, event photos

Prompt and Pipeline Design

  • User prompt
    • Concise, structured metadata (title, description, date, era, creator, publisher, source)
    • Image URL at the end
  • Pipeline:
    • Standardized JPG (800×800) + JSON metadata
      → 4 candidate alt-texts per image

Expert Ranking Study

  • Participants: 21 humanities scholars

  • Instructions:

    • For each of the 20 images, rank 4 model outputs
      from 1 (best) to 4 (worst)
    • Focus on WCAG-aligned criteria:
      • Core visual content, no redundancy
      • Salient features and visible text
      • Context only when it aids understanding
    • Factual accuracy and bias not foregrounded but may influence judgement

RQ1 – Feasibility Results

  • Coverage

    • All four models produced non-empty alt-text for all 20 images → 100% coverage
  • Latency and throughput

    • ≈2–4 seconds per item; ≈0.24–0.43 items/s
    • Qwen 3 VL 8B: fastest; GPT-4o mini: slower but stable
  • Cost

    • ≈1.8×10⁻⁴ to 3.6×10⁻³ USD per description
    • Sub-cent cost even for multiple candidates per image
  • technically and economically feasible at collection scale

RQ2 – Quantitative Ranking Results

  • Rank distributions

    • GPT-4o mini and Qwen receive more first-place and fewer last-place ranks
    • Gemini and Llama perform slightly worse descriptively
  • Statistical tests

    • Friedman test: χ²(3, N=20) = 6.02, p = 0.11
    • Kendall’s W ≈ 0.01 → very low agreement across tasks
    • Pairwise Wilcoxon tests (Holm-corrected): no significant differences
  • no clear ‹winner›; relative model quality varies by image

Error Patterns: Factual Misrecognition

e.g., describing people walking down a staircase when image shows them walking up
(Faltblatt der Gruppe ‹Freiräume für Frauen› FFF, 1992)

Error Patterns: Reproduction of Stereotypes

e.g., uncritical repetition of phrases like ‹reicher Jude› from metadata or training data
(Karikatur ‹Gegen den Eisenbahnhandel›, 1898)

Error Patterns: Selective Omission

e.g., summarising trends for one data series in a chart while ignoring others
(Publikumszahlen an Konzerten und FCB-Spielen im St. Jakob-Stadion, 1960–2022)

Operationally Viable, Epistemically Fragile

  • Operationally viable (RQ1)
    • High coverage, low latency, negligible cost
    • Easy to integrate into existing pipelines
  • Epistemically fragile (RQ2)
    • Errors are uneven and task-dependent
    • Biases and omissions require domain-sensitive review

Limitations and Future Work

  • Single institutional corpus; 100-item dataset, 20-item survey subset
  • Four models at one point in time (Oct 2025)
  • Future directions:
    • Larger and more diverse participant pools, including blind and low-vision users
    • Comparative studies of different prompt and review strategies
    • Integration with authoring tools and cataloguing workflows

Practical Recommendations

  • Recommended use:
    • VLMs as drafting aids for trained editors
    • Human-in-the-loop workflows with:
      • Editorial guidelines for sensitive content
      • Checks for bias, factual errors, and omissions

Practical Recommendations

  • For DH Teams and GLAM Institutions:

    • Start with limited, well-documented pilot runs
    • Prioritise sensitive or high-impact items for manual review
    • Develop internal style guides for alt-text and metadata
    • Log model version, prompts, and provenance for transparency
  • For Model and Tool Builders:

    • Support context-aware prompting (metadata + image)
    • Provide better controls for bias and harmful content

Take-Home Messages

  • Do not use VLMs as fully automated captioners
  • Treat alt-text as:
    • An interpretive, historiographical act
    • An ethical responsibility toward diverse audiences
  • Effective adoption requires:
    • Clear editorial policies and sensitivity guidelines
    • Human-in-the-loop review with domain and accessibility expertise
    • Transparent, reproducible workflows and open data

Accessibility with AI is a design and governance choice