MetaBelgica data model specification

Unofficial Draft

More details about this document
Latest published version:
https://www.w3.org/metabelgica-data-model/
Latest editor's draft:
https://w3id.org/metabelgica/data-model/spec/
Editor:
Sven Lieber (Royal Library of Belgium (KBR))
Author:
Sven Lieber (Royal Library of Belgium (KBR))
This Version
https://w3id.org/metabelgica/data-model/spec/20250715/
Previous Version
https://w3id.org/metabelgica/data-model/spec/20250711/

Abstract

This document defines the data model of MetaBelgica. It specifies how authority data can be represented in RDF and the MetaBelgica Wikibase.

This document was created as part of the BELSPO-funded research infrastructure project MetaBelgica.

1. Introduction

The goal of MetaBelgica is to provide high quality reference data for Belgian Cultural Heritage. Data is collaboratively maintained by initially four Federal Scientific Institutes (FSIs) in a Wikibase instance. This also sets the scope: reference data about authorities linked to Federal Collections. For semantic interoperability we will use the Resource Description Framework (RDF) to define a common data model. Additionally we provide a Wikibase data model as well as a mapping between the RDF data model and the Wikibase data model.

This document specifies overall considerations about the data model. Detailed documentation about the RDF vocabulary is available via the tool Widoco.

1.1 Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, MUST NOT, OPTIONAL, RECOMMENDED, REQUIRED, SHALL, SHALL NOT, SHOULD, and SHOULD NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

Conformance requirements are expressed with a combination of descriptive assertions and [RFC2119] terminology.

The key words MAY, MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

2. Terminology

Throughout the document, the following terminology is used.

alternate name
Authorities (such as persons or organizations) may have several spellings. Next to the preferred-name spelling, alternate names record other possible spellings. This makes them useful for Information Retrieval.
authority
This term from Information science, often used in the library domain, specifies a record within a controlled vocabulary. Specifically it covers "authorized forms of names, subjects and subject subdivisions" [MARC21Aut]. Historically used as a single way to spell something and hence a form of identifier.
data-subject
According to the GDPR: ...
entity
Todo: definition of an entity
ISNI
The International Standard Name Identifier (ISO 27729) is used to uniquely identify persons and organisations involved in the creation, production, management and distribution of cultural content. ISNI is a unique and permanent 16-digit number.
OWA
The Open World Assumption (OWA) specifies that ...
property
Todo: definition of a property
property instance
The usage of a property on an entity, i.e. the value of the property.
preferred-name
Todo: definition of a preferred name authority

3. Requirements

MetaBelgica

Collaboratively maintained

Public interface

4. Authority entities and properties

Within MetaBelgica we maintain authorities, represented as entities that are interlinked. Generally we distinguish the following three types of entities and additionally have the concept of time represented as property.

  1. Persons
  2. Organizations (Corporate bodies)
  3. Locations

In the following we elaborate on the different entities and properties.

Similarly, organizations will be linked to locations to indicate where an organiation is established.

Each of these entities has a number of properties such as related dates, but most importantly external identifiers to other reliable databases like ISNI.

This data model follows the Open World Assumption (OWA), hence a missing property does not mean that it does not exist, just that it is not known. For example, in cases where we do not know if a person is dead, we do not indicate "unknown" or something similar, we rather leave the property empty.

4.1 Persons

Persons are the creators or in other form contributors to cultural content. More specifically a person within MetaBelgica can also be a public identity such as a pseudonym.

Our goal is to provide high quality reference data which requires a disambiguation of persons. Hence we define a number of properties that help to distinguish persons among each other. In other words, two persons with the same name can still be distinguished by other additional properties.

4.1.1 Preferred name spelling

Historically, in authority control, a single uniform name was used to uniquely identify an authority record, hence the name as a sort of identifier. Instead of by name, we use a unique and persistent identifier to refer to persons. Nevertheless, we still use the preferred name spelling as the preferred label of a person.

  • A person MUST have at least one preferred name
  • A person MAY have one preferred name per language. This supports for example multilingual interfaces to consult the record.
  • A preferred name MUST be comprised of unicode characters and cannot be empty. This supports multilingual use cases in which different alphabets need to be supported.
  • A preferred name MAY be annotated with a language.
  • The language of a preferred name or alternate name MUST be annotated with language tags according to [RFC5646].

4.1.2 Alternate names

Names might be spelled differently in other languages, there might be recorded nicknames or aliases. As mentioned above, historically a single uniform name was chosen as preferred name spelling. Other spellings were recorded as well to improve information retrieval.

  • A person MAY have zero or more alternate names
  • An alternate name MUST be comprised of unicode characters and cannot be empty. This supports multilingual use cases in which different alphabets need to be supported.

4.1.3 Birth date

This is the recorded date of birth of a person according to the Gregorian calender.

Note: ISO8601:2019 extension for uncertain dates is used

We allow dates according to the ISO8601:2019 extension [ISO8601:2019-2] that is based on the Extended Date Time Format (EDTF) from the Library of Congress. This allows to also record uncertain dates in various ways.

4.1.4 Death date

This is the recorded date of death of a person according to the Gregorian calender.

4.1.5 Occupation

This is the occupation of a person according to a controlled vocabulary.

  • A person MAY have zero or more occupations.

4.1.6 Birth place

This property links a person to an entity of type Location (see definition below).

4.1.7 Death place

This property links a person to an entity of type Location (see definition below).

4.2 Organizations

Todo

4.3 Locations

Todo

4.4 Time

This concept is not an entity on its own, but modelled as a property.

5. Administrative entities and properties

Within MetaBelgica we collaboratively maintain authority data which requires a data governance strategy. In order to support data governance aspects, we define additional administrative entities and properties. How these entities and properties can be used in the Wikibase to on the one hand indicate data governance aspects such as what can be shown publicly and on the other hand enforce it, will be discussed in the following section mappings.

The Data Privacy Vocabulary [DPV] provides an extensive data model to cover many legal and practical aspects of data processing. For simplicity we only specify entities and properties needed for our use case, but we aim to reuse existing [DPV] terms as much as possible and align own terms to ensure compatibility.

Todo: mention DPV concepts

5.1 Public visibility

One goal of MetaBelgica is to provide its reference data to the public. However, certain entities or properties should not be made openly available.

On the one hand we indicate if a an entity or property should be shown publicly by using a visibility property, and on the other hand we indicate the reason why with a property legal ground.

5.1.1 Visibility property

We employ the property visibility to indicate the envisioned target audience. In MetaBelgica we distinguish between the following three use cases:

  1. internal: An entity/property is only meant to be used internally, one example are other administrative properties
  2. shared: An entity/property can be shared in a research context and in the frame of a data sharing agreement with a trusted partner, one example is sensitive data such as gender information
  3. public: An entity/property is meant to be displayed publicly, one example is the name of a person record
Note: Visibility is merely an annotation

Indicating a visibility is merely an annotation, you (or the software using the data) still need to use this annotation and act/filter accordingly!

The visibility property MUST BE applicable to annotate a whole entity, a whole property or a specific property instance. Entities annotated with internal or shared MUST NOT be shown publicly. Entities annotated with internal MUST NOT be shared with third-parties.

An entity may have several properties, in case the entity itself has the visibility public, but one of the used properties has the visibility internal or shared, then those property values MUST NOT be shown publicly.

For example, the property gender has the visibility shared, hence all person entities with visibility public may be displayed publicly, but the value of the gender property shall not be shown.

5.1.3 Data minimization

With respect to GDPR's data minimization, certain data should be adapted. Todo: elaborate

  • The default annotation yearOnly MUST be applied to the year of birth.

5.2 Data quality

6. Mappings

For interoperability we provide a number of mappings to different RDF vocabularies, but also to the used Wikbiase data model. We have the intention to provide data of the MetaBelgica platform in different data formats, based on the following mappings.

6.1 Wikibase Data Model

Todo: Wikibase data model

6.1.1 Flexible data governance model

The conceptual MetaBelgica data model, implemented as Wikibase entities and properties, can be used to support various data governance use cases. It can be used in a consistent way and only relies on the properties visibility, applies To and legal ground, optionally with start date and end date.

The following use cases are supported which means that even fine-grained control is possible:

  1. Display entity publicly
  2. Display property for entities publicly
  3. Do not display entity publicly
  4. Do not display property on entities publicly
  5. Opt-in to display entity publicly
  6. Opt-in to display property for specific entity publicly
  7. Opt-out to display entity publicly
  8. Opt-out to display property for specific entity publicly

Additionally, the Wikibase data model can be used to annotate provenance about the visibility with a legal ground property when used as a qualifier. Therefore it will be documented why something is public or not and when using start and end date qualifiers, a whole provenance record can be built. Useful to indicate if a certain opt-in only occurred at some point in time or to indicate from when an opt-out was requested.

Note: Why is visibility a claim and not a qualifier

Placing the visibility property directly on the property that is annotated seems intuitive, but then the additional legal ground would qualify the property and not the visibility. Additionally, the history of data governance with start and end dates cannot be modeled. Yet when using using the visibility property as a claim, additional metadata like the legal ground and possible start and end dates can be consistently modeled with qualifiers.

6.2 Schema.org

Todo: Schema.org mapping

mb:legalDeposit a dpv:LegalBasis ; dpv:hasJurisdiction yy ; dpv:hasLaw xx .

6.3 SKOS

Todo: SKOS mapping

6.4 CIDOC-CRM

Todo: CIDOC-CRM mapping

6.5 OSLO

Todo: OSLO mapping

A. References

A.1 Normative references

[DPV]
Data Privcay Vocabulary (DPV), version 2.1. W3C Data Privacy Vocabularies and Contorls Community Group (DPVCG). URL: https://w3c.github.io/dpv/2.1/dpv/
[ISO8601:2019-1]
Date and time — Representations for information interchange — Part 1: Basic rules. ISO 8601-1:2019.. International Organization for Standardization (ISO). 2019. ISO 8601-1:2019. URL: https://www.iso.org/standard/70907.html
[ISO8601:2019-2]
Date and time — Representations for information interchange — Part 2: Extensions. ISO 8601-2:2019.. International Organization for Standardization (ISO). 2019. ISO 8601-2:2019. URL: https://www.iso.org/standard/70908.html
[MARC21Aut]
MARC21 Format for Authority Data. Library of Congress (United States). URL: https://www.loc.gov/marc/authority/
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119
[RFC5646]
Tags for Identifying Languages. A. Phillips, Ed.; M. Davis, Ed. IETF. September 2009. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc5646
[RFC8174]
Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc8174