implement dcat harvester #12

steveoni · 2025-10-03T14:23:40Z

Summary by CodeRabbit

New Features
- Added DCAT JSON‑LD harvesting to ingest datasets with multiple distributions and richer DCAT field mapping (publisher, contact point, language with default, license, resources, extras).
- Added DCAT‑AP support for DCAT‑AP specific resources and metadata mapping.
- New visible dataset fields: Dataset URL, License URL, and Extras (key–value metadata).
Refactor
- Harvesting framework updated to support varied target dataset types for more flexible mappings.

coderabbitai · 2025-10-03T14:24:34Z

Walkthrough

Adds a new DCAT harvester and JSON-LD helpers, updates BaseHarvester generics, and extends PortalJsCloudDataset with url, license_url, and extras. The harvester fetches expanded DCAT JSON‑LD, resolves distributions and agents, and maps datasets to PortalJsCloudDataset objects.

Changes

Cohort / File(s)	Summary
DCAT Harvester `src/harvesters/dcat.ts`	New `DCATHarvester` (extends `BaseHarvester`) with `getSourceDatasets()` and `mapSourceDatasetToTarget()`. Fetches expanded JSON‑LD from source URL, builds @id index, filters DCAT Dataset objects, resolves distributions/publisher, and maps to `PortalJsCloudDataset` (resources, language, contact, license_url, extras).
DCAT utilities `src/lib/dcat.ts`	New interfaces `DCATDataset`, `DCATDistribution` and helpers: `extractString`, `extractAgentName`, `extractStringArray`, `extractDistributions` to parse expanded JSON‑LD and resolve references.
Harvester base API `src/harvesters/base.ts`	`BaseHarvester` now has a second generic `TargetDatasetT` (default `PortalJsCloudDataset`); `mapSourceDatasetToTarget` returns `TargetDatasetT`; `upsertIntoTarget` accepts `TargetDatasetT`.
PortalJsCloud schema `src/schemas/portaljs-cloud.d.ts`	`PortalJsCloudDataset` gained optional fields: `url?: string`, `license_url?: string`, and `extras?: Array<{ key: string; value: string }>` .
Formatting / minor edits `src/harvesters/ckan.ts`	Minor formatting change to generic type declaration and a trailing comma added in resources mapping (no behavioral change).
DCAT-AP extension & harvester `src/harvesters/dcatap.ts`	Adds DCAT-AP typed interfaces (`DCATAPResource`, `DCATAPCkanPackage`, `DCATAPPortalJsDataset`) and `DCATAPHarvester` with `mapSourceDatasetToTarget` producing DCAT‑AP-aware `PortalJsCloudDataset`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator
  participant Harvester as DCATHarvester
  participant Source as "HTTP DCAT Endpoint"
  participant JSONLD as "Expanded JSON‑LD"
  participant Mapper as "mapSourceDatasetToTarget"
  participant Target as "PortalJsCloudDataset"

  Operator->>Harvester: run()
  Harvester->>Source: GET source URL
  Source-->>Harvester: 200 OK (expanded JSON‑LD)
  Harvester->>JSONLD: build @id index
  Harvester->>JSONLD: filter @type == dcat:Dataset
  Harvester->>JSONLD: resolve distributions & publisher
  Harvester->>Mapper: map dataset -> PortalJsCloudDataset
  Mapper-->>Harvester: mapped dataset (resources, extras, url, license_url, language)
  Harvester-->>Target: emit upsert payload
  note right of Target #DFF0D8: Target receives PortalJsCloudDataset objects

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Implement extendable BaseHarvester and CkanHarvester Class #2 — Related changes around BaseHarvester and Harvester decorator usage that align with the new DCATHarvester implementation.
Socrata harvester #8 — Also modifies BaseHarvester generics and touches dataset typing/schema overlaps.

Suggested reviewers

demenech
willy1989cv

Poem

I hop through graphs with whiskered cheer,
I fetch the nodes both far and near.
I stitch distributions, extras bright,
Pack them neat for portal sight.
A carrot for each dataset I find—data tidy, well aligned 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “implement dcat harvester” clearly and concisely summarizes the main addition of the pull request, which is the implementation of a DCAT harvester class, without extraneous details or ambiguity.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/schemas/portaljs-cloud.d.ts (1)
27-32: Extend CkanResource with license_url.

mapSourceDatasetToTarget now attaches license_url to each resource, but CkanResource still disallows that property. TypeScript will reject the harvester build with “Object literal may only specify known properties.” Please add an optional license_url?: string; here.

Consider:
 export interface CkanResource {
   url?: string;
   name?: string;
   description?: string;
   format?: string;
+  license_url?: string;
 }

🧹 Nitpick comments (1)

src/harvesters/dcat.ts (1)

30-31: Remove the unused objectMap.

The map is built and never read, so it just adds overhead. Please drop it or use it for lookups.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 978ceba and e4f33a4.

📒 Files selected for processing (3)

src/harvesters/dcat.ts (1 hunks)
src/lib/dcat.ts (1 hunks)
src/schemas/portaljs-cloud.d.ts (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/harvesters/dcat.ts (5)

src/harvesters/index.ts (1)

Harvester (12-14)

src/lib/dcat.ts (6)

DCATDataset (8-12)

extractDistributions (53-66)

extractAgentName (24-41)

DCATDistribution (2-6)

extractString (15-22)

extractStringArray (44-50)

src/harvesters/base.ts (1)

BaseHarvesterConfig (6-14)

src/schemas/portaljs-cloud.d.ts (1)

PortalJsCloudDataset (2-25)

config.ts (1)

env (24-24)

coderabbitai · 2025-10-03T14:27:30Z

src/harvesters/dcat.ts

+    const extractedLanguage = extractString(
+      pkg,
+      "http://purl.org/dc/terms/language"
+    );
+    const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
+    const language = (
+      validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
+    ) as "EN" | "FR" | "ES" | "DE" | "IT";


⚠️ Potential issue | 🟠 Major

Normalize incoming language codes.

Most DCAT feeds emit dct:language as lowercase codes (e.g. en) or URIs (e.g. /language/ENG). Comparing the raw value against ["EN","FR","ES","DE","IT"] forces nearly everything to the fallback "EN". Normalize the value (uppercase, trim URI suffix, etc.) before validation so genuine locales survive.

One minimal fix:

- const extractedLanguage = extractString( - pkg, - "http://purl.org/dc/terms/language" - ); + const rawLanguage = + extractString(pkg, "http://purl.org/dc/terms/language") || ""; + const extractedLanguage = rawLanguage + .split("/") + .pop() + ?.slice(0, 2) + .toUpperCase() || "";

This keeps the existing whitelist working for common inputs instead of hard-defaulting to "EN".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const extractedLanguage = extractString(

pkg,

"http://purl.org/dc/terms/language"

);

const validLanguages = ["EN", "FR", "ES", "DE", "IT"];

const language = (

validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"

) as "EN" | "FR" | "ES" | "DE" | "IT";

const rawLanguage =

extractString(pkg, "http://purl.org/dc/terms/language") || "";

const extractedLanguage = rawLanguage

.split("/")

.pop()

?.slice(0, 2)

.toUpperCase() || "";

const validLanguages = ["EN", "FR", "ES", "DE", "IT"];

const language = (

validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"

) as "EN" | "FR" | "ES" | "DE" | "IT";

🤖 Prompt for AI Agents

In src/harvesters/dcat.ts around lines 140 to 147, the code currently compares the raw extractedLanguage directly to the uppercase whitelist causing most inputs (like "en" or "/language/ENG") to fall back to "EN"; normalize the extractedLanguage before validation by trimming whitespace, converting to uppercase, and if it looks like a URI or contains slashes or hashes, take the last path/fragment segment (or strip non-letter characters) to yield a plain code, then check that normalized value against the existing validLanguages array and use it if valid, otherwise default to "EN".

coderabbitai · 2025-10-03T14:27:31Z

src/lib/dcat.ts

+export function extractAgentName(
+  obj: any,
+  predicate: string,
+  allObjects: any[]
+): string {
+  const agentId = extractString(obj, predicate);
+  if (agentId) {
+    const agent = allObjects.find((o) => o["@id"] === agentId);
+    if (agent) {
+      return (
+        extractString(agent, "http://xmlns.com/foaf/0.1/name") ||
+        extractString(agent, "http://purl.org/dc/terms/title") ||
+        ""
+      );
+    }
+  }
+  return "";
+}


⚠️ Potential issue | 🟠 Major

Respect literal publisher values.

If a dataset encodes dct:publisher as a literal (common in the wild), extractString returns that literal, but the lookup for an object with the same @id fails and we drop the original text, so maintainer ends up empty. Add a fallback to return the literal when no matching agent object exists.

Suggested tweak:

const agentId = extractString(obj, predicate); if (agentId) { const agent = allObjects.find((o) => o["@id"] === agentId); if (agent) { return ( extractString(agent, "http://xmlns.com/foaf/0.1/name") || extractString(agent, "http://purl.org/dc/terms/title") || "" ); } + return agentId; } return "";

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export function extractAgentName(

obj: any,

predicate: string,

allObjects: any[]

): string {

const agentId = extractString(obj, predicate);

if (agentId) {

const agent = allObjects.find((o) => o["@id"] === agentId);

if (agent) {

return (

extractString(agent, "http://xmlns.com/foaf/0.1/name") ||

extractString(agent, "http://purl.org/dc/terms/title") ||

""

);

}

}

return "";

}

export function extractAgentName(

obj: any,

predicate: string,

allObjects: any[]

): string {

const agentId = extractString(obj, predicate);

if (agentId) {

const agent = allObjects.find((o) => o["@id"] === agentId);

if (agent) {

return (

extractString(agent, "http://xmlns.com/foaf/0.1/name") ||

extractString(agent, "http://purl.org/dc/terms/title") ||

""

);

}

return agentId;

}

return "";

}

🤖 Prompt for AI Agents

In src/lib/dcat.ts around lines 24 to 41, extractAgentName currently treats the value returned by extractString as an ID and only returns a resolved agent's name, which drops literal publisher values when no matching object exists; change the logic so that if extractString(obj, predicate) yields a non-empty value but no object in allObjects matches that value, return that literal value as a fallback (i.e., return the original agentId string) while keeping the existing behavior when a matching agent object is found.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4f33a4 and 8bebc21.

📒 Files selected for processing (3)

src/harvesters/base.ts (2 hunks)
src/harvesters/ckan.ts (2 hunks)
src/harvesters/dcatap.ts (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

src/harvesters/base.ts (2)

src/schemas/portaljs-cloud.d.ts (1)

PortalJsCloudDataset (2-25)

src/lib/portaljs-cloud.ts (1)

getDatasetList (12-19)

src/harvesters/dcatap.ts (4)

src/schemas/portaljs-cloud.d.ts (2)

CkanResource (27-32)

PortalJsCloudDataset (2-25)

src/schemas/ckanPackage.d.ts (1)

CkanPackage (1-3)

src/harvesters/base.ts (1)

BaseHarvesterConfig (6-14)

config.ts (1)

env (24-24)

src/harvesters/ckan.ts (1)

src/schemas/ckanPackage.d.ts (1)

CkanPackage (1-3)

coderabbitai · 2025-10-03T14:37:40Z

src/harvesters/base.ts

+  SourceDatasetT extends { [k: string]: string } = any,
+  TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset


⚠️ Potential issue | 🔴 Critical

Constraint breaks every harvester subtype.

Forcing SourceDatasetT to extend { [k: string]: string } means every property must be a string. Our CKAN/DCAT packages carry arrays, objects, booleans (resources, extras, isopen, etc.), so types like DCATAPCkanPackage no longer satisfy the constraint and CkanHarvester<DCATAPCkanPackage> won’t compile. Drop this overly narrow bound (or widen it to Record<string, unknown>) so richer source schemas still type‑check.

Apply this diff:

-export abstract class BaseHarvester< - SourceDatasetT extends { [k: string]: string } = any, - TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset +export abstract class BaseHarvester< + SourceDatasetT extends Record<string, unknown> = any, + TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset

🤖 Prompt for AI Agents

In src/harvesters/base.ts around lines 17-18, the generic constraint "SourceDatasetT extends { [k: string]: string }" is too narrow and breaks harvesters whose source schemas include arrays, booleans, or nested objects; change the constraint to either remove it entirely or widen it to "Record<string, unknown>" (preferred) and keep the same default generic (e.g., SourceDatasetT extends Record<string, unknown> = any) so types like DCATAPCkanPackage and CkanHarvester<DCATAPCkanPackage> compile again.

coderabbitai · 2025-10-03T14:37:41Z

src/harvesters/dcatap.ts

+      // DCAT-AP specific fields (mapped from extras or direct fields)
+      language: pkg.language || extras.language || "EN",
+      //   frequency: extras.frequency || "",
+      //   temporal_start: extras.temporal_start || "",
+      //   temporal_end: extras.temporal_end || "",
+      //   publisher_name: extras.publisher_name || pkg.organization?.title || "",
+      //   publisher_email: extras.publisher_email || "",
+      //   contact_name: extras.contact_name || pkg.maintainer || pkg.author || "",
+      //   contact_email:
+      //     extras.contact_email || pkg.maintainer_email || pkg.author_email || "",
+      //   theme: extras.theme || "",
+      //   conforms_to: extras.conforms_to || "",
+      //   extras: pkg.extras || [],
+    };


⚠️ Potential issue | 🟠 Major

Preserve CKAN extras instead of dropping them.

You flatten pkg.extras into extras for lookups, but the returned dataset never exposes those extras. That strips DCAT-AP metadata we don’t map explicitly (frequency, temporal, publisher, etc.), so the target loses data. Forward the extras array (e.g. extras: pkg.extras ?? []) while still deriving specific fields from it.

🤖 Prompt for AI Agents

In src/harvesters/dcatap.ts around lines 156 to 169, the code flattens pkg.extras into a local extras object for lookups but then omits forwarding the original pkg.extras array in the returned dataset, which drops unmapped DCAT-AP metadata; restore the extras array by adding a property like extras: pkg.extras ?? [] to the returned object while keeping the existing derived fields (language and others) sourced from the flattened extras, so unmapped extras are preserved and mapped fields continue to use the lookup values.

anuveyatsu · 2025-10-04T09:52:55Z

src/harvesters/base.ts

-  abstract mapSourceDatasetToTarget(
-    dataset: SourceDatasetT
-  ): PortalJsCloudDataset;
+  abstract mapSourceDatasetToTarget(dataset: SourceDatasetT): TargetDatasetT;


what's this? @steveoni

anuveyatsu · 2025-10-04T09:54:10Z

src/harvesters/base.ts

    return await getDatasetList();
  }
-  async upsertIntoTarget({ dataset }: { dataset: PortalJsCloudDataset }) {
+  async upsertIntoTarget({ dataset }: { dataset: TargetDatasetT }) {


@steveoni what's the point of renaming it?

implement dcat harvester

e4f33a4

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

actual script for harvesting

8bebc21

coderabbitai bot reviewed Oct 3, 2025

View reviewed changes

anuveyatsu reviewed Oct 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement dcat harvester #12

implement dcat harvester #12

steveoni commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

coderabbitai bot Oct 3, 2025

Uh oh!

anuveyatsu Oct 4, 2025

Uh oh!

anuveyatsu Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		SourceDatasetT extends { [k: string]: string } = any,
		TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset

implement dcat harvester #12

Are you sure you want to change the base?

implement dcat harvester #12

Conversation

steveoni commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

anuveyatsu Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

anuveyatsu Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveoni commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 3, 2025 •

edited

Loading