The Gap Between the Vocabulary and the Data

The W3C vocabulary layer works. DCAT for catalogs, Dublin Core for basic metadata, SKOS for taxonomies, SHACL for validation shapes. These specs are mature, well-documented, and interoperable. Canada’s open data portal runs on them. 46,000 datasets, publicly available CKAN schemas, real DCAT metadata.

The vocabulary is not the problem. The enforcement is.

What actually breaks

CKAN has a schema extension called scheming. You define fields, types, valid values, and scheming enforces them at the action layer, on both forms and API calls.

The enforcement is real within one system. Between systems, the gap opens. A harvest job pulls 400 datasets from another portal overnight. The source and target maintain separate schema files with no standard mechanism to keep them aligned. Data that was valid where it came from may not be valid where it lands.

This is not specific to CKAN. This is what happens whenever data crosses a boundary between systems with different validation assumptions.

SHACL addresses this for RDF. You define shapes, shapes declare required properties and value constraints, a validator checks the graph. SHACL Advanced Features adds result annotations and expression constraints. These are real advances.

But the validation still runs after the data exists. In the typical deployment, SHACL describes compliance rather than enabling it. Some triplestores improve on this by wiring SHACL into the commit path, rejecting invalid data before it enters the store. That’s real enforcement, but it’s tied to one store’s configuration. Move the data to another store without the same shapes, and the guarantee disappears.

The constraints don’t travel with the data. Enforcement that only holds while data stays in one system hasn’t solved the boundary problem. It’s deferred it.

Composition, not inheritance

Kurt Cagle wrote recently that SKOS inScheme is a faceting relationship, and that compositions of orthogonal schemes are generally a better model than property inheritance.

Hierarchies dominate semantic modeling partly because the tooling encourages them. Protege gives you a class tree. You put things in the tree. The tree becomes the model.

The alternative is flat type membership. A resource declares which schemes it belongs to. Each scheme evaluates independently. A DNS server that runs in an LXC container on critical infrastructure isn’t a subclass of three parent classes. It has three independent type memberships: what it does, how it runs, how important it is. Each type can declare its own constraints without knowing about the others.

SKOS already provides this if you treat concept schemes as independent classification facets rather than a single hierarchy. SHACL already supports the same pattern. Shapes don’t need to form a hierarchy. A resource can match multiple shapes, and each shape validates independently. Flat membership, independent evaluation.

Where the enforcement happens

In most RDF workflows, you author triples, load them into a store, then run SHACL validation. The validator produces a report. You read the report. You fix the data. You re-validate. The feedback loop can be minutes or hours. If you’re harvesting 400 datasets from a federated portal at 2 AM, days.

The alternative is to make the constraints structural. If the schema says a service requires an endpoint, data without an endpoint fails evaluation. Not at import time. At authoring time. The invalid state can’t be represented.

Two resources claim the same unique identifier? Evaluation fails. A dependency references a resource that doesn’t exist in the graph? Evaluation fails. These aren’t runtime checks. They’re properties of the data format.

In principle, both approaches can produce the same W3C outputs: SHACL validation reports, SKOS concept schemes, DCAT catalogs, PROV-O provenance traces. The W3C vocabularies don’t care how you populated them. The difference is whether you generate those artifacts from data that’s already been validated at authoring time, or from data that might still contain errors.

The gap

The vocabulary layer and the data layer are both well-served by existing specs. The gap is the enforcement layer between them: how do you ensure that data entering a system actually conforms to the vocabulary it claims to use?

SHACL validation after the fact is one answer. Structural constraints at authoring time is another. A third framing is emerging from the W3C Context Graphs Community Group: what happens at the boundary itself? When a graph moves from one system to another, something has to decide whether the receiving context’s constraints are met. At the boundary, three things can happen: proceed, request clarification, or reject. That’s the federation problem stated as a decision procedure.

These are complementary. Post-hoc validation catches what slipped through. Structural constraints prevent what they can. Boundary protocols address what neither handles alone: the moment data leaves one system’s assumptions and enters another’s.

The vocabulary was never the hard part. The hard part is making it stick.