A Long-Term Archival Pipeline for the Forschungsdatenplattform Stadt.Geschichte.Basel

omeka2dsp

Authors
Affiliations

Moritz Mähr

University of Basel

University of Bern

Moritz Twente

University of Basel

Published

October 15, 2025

Doi
Abstract

The Forschungsdatenplattform by Stadt.Geschichte.Basel provides open access to diverse historical materials relating to the city of Basel, including texts, images, statistical, and geospatial data. While technically robust and publicly available through GitHub Pages using the CollectionBuilder-CSV, our current infrastructure faces critical sustainability and scalability limitations.

At present, metadata and files are curated and managed using the Omeka-S instance provided by the University of Bern, from which we extract the data for display via CollectionBuilder. However, this setup introduces significant risks for long-term availability:

  • Omeka is not under our institutional control and may not remain permanently funded.

  • GitHub Pages is not suitable for serving large files or guaranteeing persistence.

  • CollectionBuilder lacks built-in support for versioning and persistent identifiers at the level required by research data infrastructures.

To address these challenges, we are implementing a transition pipeline to archive all metadata and associated files with DaSCH, leveraging its infrastructure for versioned, durable, and FAIR-compliant research data publication. The new pipeline includes:

  1. Metadata harvesting and transformation from Omeka to the DaSCH data model.

  2. Automated deposit and update routines using the DaSCH REST API, including support for versioning existing records.

  3. Writing back stable DaSCH identifiers and access URLs to our public-facing platform, ensuring transparency and citability.

  4. Preservation of hierarchical relationships and media metadata, aligned with our custom metadata model, which incorporates principles of anti-discriminatory description practices.

This transition allows us to decouple the archival backend from the front-end presentation, ensuring long-term data accessibility, citability, and semantic interoperability. In our presentation, we will share the architectural overview, code-level considerations, and our reflections on working with DaSCH APIs in a real-world context, including:

  • Lessons learned in metadata crosswalks and transformation logic.

  • Technical caveats in version control, file transfers, and identifier management.

  • Challenges in aligning minimal computing approaches (CollectionBuilder) with robust backend infrastructures (DSP).

This case study illustrates how lightweight digital publishing environments can be effectively combined with national research infrastructures to deliver sustainable, standards-based access to historical research data. It also raises broader questions about infrastructural independence, scalability of humanities platforms, and the practical challenges of implementing FAIR principles in community-developed software ecosystems.

  • Large-scale historical research project, initiated in 2011 by the Association for Basel History and carried out 2017–2026 at the University of Basel
  • More than 70 researchers studying the history of Basel from the earliest settlements to the present day
  • Funded with more than 9 million Swiss francs by the Canton of Basel-Stadt, the Lottery Fund, and private sponsors

. . .

  • Specialized Team for Research Data Management and Public History
  • Various research outputs (incl. books, papers, data stories, figures, and source code)

Research Data

 

Collecting and Managing Research Data

 

Public History with Research Data

 

Archiving Research Data for the Long Run

Long-Term Preservation

  • Institutional sustainability: External dependencies (Omeka at UniBern) may not be permanently funded
  • FAIR principles: Research data must be Findable, Accessible, Interoperable, and Reusable with proper metadata
  • Citability: Researchers need persistent identifiers (PIDs) (stable URLs or DOIs) to reference historical materials
  • Scalability: Current minimal computing approach (GitHub Pages) cannot handle large files

Where DaSCH Fits in: omeka2dsp

 

Challenges

  • Data model differences (Omeka vs DaSCH)
  • Metadata transformation and crosswalks
  • Automation of deposit and version control

 Data Model Differences

Omeka

 Data Model Omeka (simplified 😇)

Data Model DaSCH (simplified 😅)

 

Key Differences

  • Modeling approach:

    • DaSCH: Class hierarchy (ResourceDocument / Image …), explicit value classes (TextValue, ListValue).
    • Omeka S: Flat JSON-LD model (Item, Media, ItemSet), Dublin Core–centric.
  • Normalization & constraints:

    • DaSCH: Strict cardinalities and mandatory fields (hasTitle [1]).
    • Omeka S: More flexible, “validation” through templates.
  • Hierarchy representation:

    • DaSCH: Explicit Parent class and linkToParentObject.
    • Omeka S: Uses field ItemSet or Media to model relations.

Metadata Transformation

Data Validation

  • Custom Python scripts using pydantic for schema validation as Omeka does not enforce strict validation
  • Validation also helps identify data quality issues
  • Lots of manual cleaning required, all done in Omeka

Crosswalk Examples

Omeka Property DaSCH Property Notes
dcterms:title hasTitle Required in DaSCH (cardinality 1)
dcterms:identifier hasIdentifier Not by DSP used for stable references
dcterms:subject hasSubjectList Iconclass codes preserved
dcterms:temporal hasTemporalList Era mapping required
dcterms:creator hasCreatorList Multiple creators supported
dcterms:license hasLicenseList License URIs validated
Media → Item linkToParentObject Hierarchy explicitly modeled

Version Control and Updates

  • Challenge: DaSCH supports versioning, but requires careful planning
  • Strategy:
    • Initial deposit: Create new resources via REST API
    • Updates: Use PUT requests with resource IDs to create new versions
    • Identifier stability: ARK IDs remain constant across versions

API vs DSP-Tools

REST API Approach

  • Direct HTTP requests
  • Fine-grained control
  • Supports versioning
  • Complex error handling
  • Used for updates

DSP-Tools Approach

  • XML-based bulk import
  • Good for initial deposits
  • Less flexible for updates
  • Comprehensive validation
  • Used for large-scale ingestion

Our choice: Using dsp-tools for project setup and Rest API for ingestions/updates

omeka2dsp

Lessons Learned

  • Metadata crosswalks and data quality: Mapping Dublin Core to the DaSCH ontology required custom logic and validation. Omeka’s flexible schema led to inconsistent metadata, demanding extensive cleaning.
  • Identifiers and file handling: Synchronizing identifiers between Omeka and DaSCH proved complex. Large files (>100 MB) necessitated chunked uploads and tailored handling.
  • Workflow timing and validation: Determining the right moment to shift from active curation to archival mode was key. Early validation with pydantic schemas and subset testing prevented costly errors.
  • Documentation and reproducibility: Precise mapping documentation ensured consistency across transformations and supported reproducible workflows.
  • Architecture and infrastructure: Decoupling archival (DaSCH) from presentation (CollectionBuilder) enhanced flexibility. Lightweight public interfaces can coexist with robust preservation systems when APIs enable automation.

Key Takeaways

For the Community

  • Lightweight publishing can work with robust infrastructure
  • FAIR principles are achievable in practice
  • Decoupled architectures provide flexibility
  • Open-source tools enable customization

For DaSCH Users

  • Plan metadata transformation early
  • Use validation extensively
  • Consider hybrid API/DSP-Tools approach
  • Document mapping decisions thoroughly

Resources

Back to top