Dublin Core Metadata Enhancer

This project provides tools and workflows to enhance Dublin Core metadata records with reproducible enrichment processes. It aims to improve the quality and completeness of Dublin Core metadata through automated enrichment pipelines.

Purpose

The Dublin Core Metadata Enhancer enables:

  • Automated Enhancement: Systematic improvement of Dublin Core metadata records
  • Reproducible Workflows: Documented and repeatable enhancement processes
  • Quality Assurance: Validation and verification of enhanced metadata
  • Open Science: Transparent and shareable enhancement methodologies

Scope

This repository contains the source code, documentation, and examples for enhancing Dublin Core metadata records. The enhancement workflows can be applied to various types of digital resources and collections.

For detailed documentation and usage examples, please see the full documentation in this repository.

dublin-core-metadata-enhancer

Enhance Dublin Core records with reproducible enrichment workflows. The data in this repository is openly available to everyone and is intended to support reproducible research.

GitHub issues GitHub forks GitHub stars Code license Data license DOI

Repository Structure

The structure of this repository follows the Advanced Structure for Data Analysis of The Turing Way and is organized as follows:

  • analysis/: scripts and notebooks used to analyze the data
  • assets/: images, logos, etc. used in the README and other documentation
  • build/: scripts and notebooks used to build the data
  • data/: data files
  • documentation/: documentation for the data and the repository
  • project-management/: project management documents (e.g., meeting notes, project plans, etc.)
  • src/: source code for the data (e.g., scripts used to collect or process the data)
  • test/: tests for the data and source code
  • report.md: a report describing the analysis of the data

Data Description

This repository contains Dublin Core metadata enhancement tools and workflows designed to improve the quality and completeness of Dublin Core metadata records. The data includes:

  • Enhancement Workflows: Reproducible processes for enriching Dublin Core metadata
  • Validation Tools: Scripts and utilities for quality assurance of enhanced metadata
  • Documentation: Comprehensive guides and examples for using the enhancement pipelines
  • Test Data: Sample Dublin Core records for testing and validation purposes

All enhancement workflows are documented and version-controlled to ensure reproducibility. The tools support various Dublin Core metadata formats and can be adapted for different types of digital collections.

Data models and field mappings are documented in the documentation/ directory. All code is released under the AGPL-3.0 license, and data products are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Use

Metadata Enhancement Pipeline

This repository includes an automated metadata enhancement pipeline that generates WCAG 2.2-compliant alternative text for images using OpenAI’s newest GPT-5 model.

Prerequisites

  1. Python 3.8 or higher
  2. OpenAI API key

Installation

# Install uv (modern Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Python dependencies
uv sync

# Set your OpenAI API key
export OPENAI_API_KEY="your-openai-api-key-here"

Usage

# Enhance metadata from the default source
uv run python enhance_metadata.py

# Specify custom metadata URL and output file
# Run enhancement on remote metadata
uv run python enhance_metadata.py --metadata-url "https://example.com/metadata.json" --output "enhanced_metadata.json"

# Run enhancement on local metadata file
uv run python enhance_metadata.py --metadata-url "data/local_metadata.json" --output "enhanced_local.json"

# Use API key from command line
uv run python enhance_metadata.py --api-key "your-api-key"

# Development commands
uv run pytest                    # Run tests
uvx ty check src/                # Type checking
uv run ruff format .             # Format code with ruff
uv run ruff check .              # Lint code with ruff

How it works

The enhancement pipeline:

  1. Loads Dublin Core metadata from a JSON source (local file or URL)
  2. Downloads thumbnail images (object_thumb field) - images are pre-optimized by omeka
  3. Analyzes images using GPT-5 with contextual metadata
  4. Generates WCAG-compliant alternative text in German
  5. Outputs enhanced metadata as JSON

The AI prompt is designed to:

  • Identify image types (informative, complex diagrams/maps, or text images)
  • Generate appropriate alt text (max 120-200 characters)
  • Create long descriptions for complex content when needed
  • Follow accessibility best practices

Output Format

{
    "objectid": "example001",
    "alt_text": "Karte von Basel als befestigte Grenzstadt, umgeben von Breisgau und Sundgau.",
    "longdesc": ""
}

Testing

# Run tests
python -m unittest test.test_metadata_enhancer

Citation and Data Access

These data are openly available to everyone and can be used for any research or educational purpose. If you use this data in your research, please cite as specified in CITATION.cff. The following citation formats are also available through Zenodo:

Zenodo provides an API (REST & OAI-PMH) to access the data. For example, the following command will return the metadata for the most recent version of the data

curl -i https://zenodo.org/api/records/ZENODO_RECORD

Support

This project is maintained by @Stadt-Geschichte-Basel. Please understand that we can’t provide individual support via email. We also believe that help is much more valuable when it’s shared publicly, so more people can benefit from it.

Type Platforms
🚨 Bug Reports GitHub Issue Tracker
📊 Report bad data GitHub Issue Tracker
📚 Docs Issue GitHub Issue Tracker
🎁 Feature Requests GitHub Issue Tracker
🛡 Report a security vulnerability See SECURITY.md
💬 General Questions GitHub Discussions

Roadmap

No changes are currently planned.

Contributing

All contributions to this repository are welcome! If you find errors or problems with the data, or if you want to add new data or features, please open an issue or pull request. Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

Versioning

We use SemVer for versioning. The available versions are listed in the tags on this repository.

Authors and acknowledgment

  • Moritz Mähr - Initial work - maehr

See also the list of contributors who contributed to this project.

License

The data in this repository is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) License - see the LICENSE-CCBY file for details. By using this data, you agree to give appropriate credit to the original author(s) and to indicate if any modifications have been made.

The code in this repository is released under the GNU Affero General Public License v3.0 - see the LICENSE-AGPL file for details. By using this code, you agree to make any modifications available under the same license.

Back to top