Forty-six thousand datasets. Real DCAT metadata. Real enforcement. And a gap I couldn’t close from inside the system.
My name is on open.canada.ca. I worked on the platform that serves Canada’s federal open data portal. CKAN instance, scheming extension, DCAT metadata, bilingual catalogue, the whole stack. When I say the vocabulary layer works, I’m not citing a spec. I’m describing something I built and operated.
The vocabulary works. The enforcement doesn’t travel.
What the portal gets right
CKAN’s scheming extension is real enforcement. You define fields, types, valid values, and the system enforces them on both forms and API calls. If a dataset is missing a required field, the action layer rejects it. No report. No warning. Rejection.
Within one portal instance, this is solid. The schema file is the single source of truth. Every dataset conforms because the system won’t accept one that doesn’t. Forty-six thousand datasets, all structurally valid against the declared schema.
DCAT metadata is generated correctly. Dublin Core fields are populated. SKOS-like classifications organize datasets by subject. The W3C vocabulary usage is not decorative. It’s functional. Other portals can harvest from it and get conformant metadata.
That’s where it breaks.
The harvest boundary
Canada’s open data portal federates datasets from provincial and municipal portals. A harvest job runs overnight, pulls metadata from source portals, and loads it into the federal catalogue. The source portal has its own scheming configuration, its own field definitions, its own validation rules.
There is no standard mechanism to align schemas across portals.
The source says a field is optional. The target says it’s required. The source uses one controlled vocabulary for subject classification. The target uses another. The source validates dates in one format. The target expects ISO 8601. Each portal is internally consistent. Between them, the gap opens.
This is not a CKAN problem. This is a boundary problem. It happens whenever data crosses from one system’s validation assumptions into another’s. CKAN just makes it concrete because the harvest pipeline is explicit and the schema files are readable.
Why SHACL doesn’t close it
The natural response is SHACL. Define shapes for the receiving portal, validate incoming harvested data against them, reject what doesn’t conform. This is exactly what SHACL is for.
In practice, SHACL validation runs after the harvest. The data already crossed the boundary. The report tells you what’s wrong, and someone reads the report the next business day. For 400 datasets harvested at 2 AM, the feedback loop is measured in days.
SHACL Advanced Features improves this. Result annotations, expression constraints, better composition. But the timing problem remains. The validator runs after the data exists. The constraints don’t travel with the data from source to target. The source portal has no mechanism to know what shapes the target will enforce.
I wrote about this gap in detail: The Gap Between the Vocabulary and the Data.
What I couldn’t build inside the system
The thing I wanted and couldn’t have: a schema that’s the same file at the source and the target, evaluated before the data leaves the source, producing the exact metadata the target expects.
CKAN scheming files are YAML or JSON. They declare fields and validators. They don’t compose across portals. Two scheming files from two portals can’t be unified to produce a shared constraint set. There’s no intersection operation. You compare them manually, file a ticket, wait for the next release cycle.
The types are the interesting part. When I model a dataset as simultaneously a Dataset, a GovernedResource, and a BilingualAsset, those aren’t subclasses of a parent. They’re independent memberships. Each type brings its own constraints. The dataset must satisfy all of them, and the types don’t need to know about each other. I explored why this matters for SKOS and SHACL: Why skos:inScheme Is the Interesting Part.
CKAN can’t express this. The scheming extension validates fields, not type memberships. You can’t say “this dataset belongs to three independent classification schemes, and each scheme independently determines which fields are required.” You get one schema, one set of rules, one validation pass.
The structural alternative
After leaving the portal, I rebuilt the pattern from scratch using a constraint language. Resources declare types and dependencies. The language computes constraint intersections at evaluation time. If the data violates any constraint from any type membership, evaluation fails. Not a report. A type error.
From two fields per resource, comprehensions compute topology, depth, ancestors, critical paths, and impact analysis. Each W3C vocabulary becomes a projection: one evaluation step, one output format. I described the architecture: What If SHACL Validation Was a Type Error?.
The output is standard JSON-LD. Seventeen W3C vocabularies. Any triplestore can import it. The difference is that every constraint was enforced before the export, not after. The data that arrives at the receiving system is pre-validated against a schema that both sides can read, because it’s the same schema.
For the harvest boundary problem: the source and target share a constraint file. The source evaluates it before publishing. The target evaluates it before importing. If both pass, the data conforms. No overnight harvest job discovers a mismatch three days later.
Who this is for
This post isn’t a pitch. It’s a report from someone who operated the system and hit the wall.
The enforcement gap in open.canada.ca is not a bug. It’s a consequence of the architecture. CKAN enforces within one instance. The W3C vocabularies describe what the data should look like. Nothing in between ensures that harvested data from another portal meets the local schema. SHACL can validate after the fact. Nothing validates before the boundary crossing.
The twenty people running that portal know exactly which harvest sources produce clean metadata and which ones don’t. They maintain the workarounds. They know the gap is there because they work around it every quarter.
The structural alternative exists. The output is the same W3C linked data. The difference is when the constraints run.