Skip to content

Conversation

steveoni
Copy link
Member

@steveoni steveoni commented Oct 3, 2025

Summary by CodeRabbit

  • New Features

    • Added DCAT JSON‑LD harvesting to ingest datasets with multiple distributions and richer DCAT field mapping (publisher, contact point, language with default, license, resources, extras).
    • Added DCAT‑AP support for DCAT‑AP specific resources and metadata mapping.
    • New visible dataset fields: Dataset URL, License URL, and Extras (key–value metadata).
  • Refactor

    • Harvesting framework updated to support varied target dataset types for more flexible mappings.

Copy link

coderabbitai bot commented Oct 3, 2025

Walkthrough

Adds a new DCAT harvester and JSON-LD helpers, updates BaseHarvester generics, and extends PortalJsCloudDataset with url, license_url, and extras. The harvester fetches expanded DCAT JSON‑LD, resolves distributions and agents, and maps datasets to PortalJsCloudDataset objects.

Changes

Cohort / File(s) Summary
DCAT Harvester
src/harvesters/dcat.ts
New DCATHarvester (extends BaseHarvester) with getSourceDatasets() and mapSourceDatasetToTarget(). Fetches expanded JSON‑LD from source URL, builds @id index, filters DCAT Dataset objects, resolves distributions/publisher, and maps to PortalJsCloudDataset (resources, language, contact, license_url, extras).
DCAT utilities
src/lib/dcat.ts
New interfaces DCATDataset, DCATDistribution and helpers: extractString, extractAgentName, extractStringArray, extractDistributions to parse expanded JSON‑LD and resolve references.
Harvester base API
src/harvesters/base.ts
BaseHarvester now has a second generic TargetDatasetT (default PortalJsCloudDataset); mapSourceDatasetToTarget returns TargetDatasetT; upsertIntoTarget accepts TargetDatasetT.
PortalJsCloud schema
src/schemas/portaljs-cloud.d.ts
PortalJsCloudDataset gained optional fields: url?: string, license_url?: string, and extras?: Array<{ key: string; value: string }> .
Formatting / minor edits
src/harvesters/ckan.ts
Minor formatting change to generic type declaration and a trailing comma added in resources mapping (no behavioral change).
DCAT-AP extension & harvester
src/harvesters/dcatap.ts
Adds DCAT-AP typed interfaces (DCATAPResource, DCATAPCkanPackage, DCATAPPortalJsDataset) and DCATAPHarvester with mapSourceDatasetToTarget producing DCAT‑AP-aware PortalJsCloudDataset.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Operator
  participant Harvester as DCATHarvester
  participant Source as "HTTP DCAT Endpoint"
  participant JSONLD as "Expanded JSON‑LD"
  participant Mapper as "mapSourceDatasetToTarget"
  participant Target as "PortalJsCloudDataset"

  Operator->>Harvester: run()
  Harvester->>Source: GET source URL
  Source-->>Harvester: 200 OK (expanded JSON‑LD)
  Harvester->>JSONLD: build @id index
  Harvester->>JSONLD: filter @type == dcat:Dataset
  Harvester->>JSONLD: resolve distributions & publisher
  Harvester->>Mapper: map dataset -> PortalJsCloudDataset
  Mapper-->>Harvester: mapped dataset (resources, extras, url, license_url, language)
  Harvester-->>Target: emit upsert payload
  note right of Target #DFF0D8: Target receives PortalJsCloudDataset objects
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • demenech
  • willy1989cv

Poem

I hop through graphs with whiskered cheer,
I fetch the nodes both far and near.
I stitch distributions, extras bright,
Pack them neat for portal sight.
A carrot for each dataset I find—data tidy, well aligned 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “implement dcat harvester” clearly and concisely summarizes the main addition of the pull request, which is the implementation of a DCAT harvester class, without extraneous details or ambiguity.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/schemas/portaljs-cloud.d.ts (1)

27-32: Extend CkanResource with license_url.

mapSourceDatasetToTarget now attaches license_url to each resource, but CkanResource still disallows that property. TypeScript will reject the harvester build with “Object literal may only specify known properties.” Please add an optional license_url?: string; here.

Consider:

 export interface CkanResource {
   url?: string;
   name?: string;
   description?: string;
   format?: string;
+  license_url?: string;
 }
🧹 Nitpick comments (1)
src/harvesters/dcat.ts (1)

30-31: Remove the unused objectMap.

The map is built and never read, so it just adds overhead. Please drop it or use it for lookups.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 978ceba and e4f33a4.

📒 Files selected for processing (3)
  • src/harvesters/dcat.ts (1 hunks)
  • src/lib/dcat.ts (1 hunks)
  • src/schemas/portaljs-cloud.d.ts (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/harvesters/dcat.ts (5)
src/harvesters/index.ts (1)
  • Harvester (12-14)
src/lib/dcat.ts (6)
  • DCATDataset (8-12)
  • extractDistributions (53-66)
  • extractAgentName (24-41)
  • DCATDistribution (2-6)
  • extractString (15-22)
  • extractStringArray (44-50)
src/harvesters/base.ts (1)
  • BaseHarvesterConfig (6-14)
src/schemas/portaljs-cloud.d.ts (1)
  • PortalJsCloudDataset (2-25)
config.ts (1)
  • env (24-24)

Comment on lines +140 to +147
const extractedLanguage = extractString(
pkg,
"http://purl.org/dc/terms/language"
);
const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
const language = (
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
) as "EN" | "FR" | "ES" | "DE" | "IT";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize incoming language codes.

Most DCAT feeds emit dct:language as lowercase codes (e.g. en) or URIs (e.g. /language/ENG). Comparing the raw value against ["EN","FR","ES","DE","IT"] forces nearly everything to the fallback "EN". Normalize the value (uppercase, trim URI suffix, etc.) before validation so genuine locales survive.

One minimal fix:

-    const extractedLanguage = extractString(
-      pkg,
-      "http://purl.org/dc/terms/language"
-    );
+    const rawLanguage =
+      extractString(pkg, "http://purl.org/dc/terms/language") || "";
+    const extractedLanguage = rawLanguage
+      .split("/")
+      .pop()
+      ?.slice(0, 2)
+      .toUpperCase() || "";

This keeps the existing whitelist working for common inputs instead of hard-defaulting to "EN".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const extractedLanguage = extractString(
pkg,
"http://purl.org/dc/terms/language"
);
const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
const language = (
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
) as "EN" | "FR" | "ES" | "DE" | "IT";
const rawLanguage =
extractString(pkg, "http://purl.org/dc/terms/language") || "";
const extractedLanguage = rawLanguage
.split("/")
.pop()
?.slice(0, 2)
.toUpperCase() || "";
const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
const language = (
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
) as "EN" | "FR" | "ES" | "DE" | "IT";
🤖 Prompt for AI Agents
In src/harvesters/dcat.ts around lines 140 to 147, the code currently compares
the raw extractedLanguage directly to the uppercase whitelist causing most
inputs (like "en" or "/language/ENG") to fall back to "EN"; normalize the
extractedLanguage before validation by trimming whitespace, converting to
uppercase, and if it looks like a URI or contains slashes or hashes, take the
last path/fragment segment (or strip non-letter characters) to yield a plain
code, then check that normalized value against the existing validLanguages array
and use it if valid, otherwise default to "EN".

Comment on lines +24 to +41
export function extractAgentName(
obj: any,
predicate: string,
allObjects: any[]
): string {
const agentId = extractString(obj, predicate);
if (agentId) {
const agent = allObjects.find((o) => o["@id"] === agentId);
if (agent) {
return (
extractString(agent, "http://xmlns.com/foaf/0.1/name") ||
extractString(agent, "http://purl.org/dc/terms/title") ||
""
);
}
}
return "";
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Respect literal publisher values.

If a dataset encodes dct:publisher as a literal (common in the wild), extractString returns that literal, but the lookup for an object with the same @id fails and we drop the original text, so maintainer ends up empty. Add a fallback to return the literal when no matching agent object exists.

Suggested tweak:

   const agentId = extractString(obj, predicate);
   if (agentId) {
     const agent = allObjects.find((o) => o["@id"] === agentId);
     if (agent) {
       return (
         extractString(agent, "http://xmlns.com/foaf/0.1/name") ||
         extractString(agent, "http://purl.org/dc/terms/title") ||
         ""
       );
     }
+    return agentId;
   }
   return "";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function extractAgentName(
obj: any,
predicate: string,
allObjects: any[]
): string {
const agentId = extractString(obj, predicate);
if (agentId) {
const agent = allObjects.find((o) => o["@id"] === agentId);
if (agent) {
return (
extractString(agent, "http://xmlns.com/foaf/0.1/name") ||
extractString(agent, "http://purl.org/dc/terms/title") ||
""
);
}
}
return "";
}
export function extractAgentName(
obj: any,
predicate: string,
allObjects: any[]
): string {
const agentId = extractString(obj, predicate);
if (agentId) {
const agent = allObjects.find((o) => o["@id"] === agentId);
if (agent) {
return (
extractString(agent, "http://xmlns.com/foaf/0.1/name") ||
extractString(agent, "http://purl.org/dc/terms/title") ||
""
);
}
return agentId;
}
return "";
}
🤖 Prompt for AI Agents
In src/lib/dcat.ts around lines 24 to 41, extractAgentName currently treats the
value returned by extractString as an ID and only returns a resolved agent's
name, which drops literal publisher values when no matching object exists;
change the logic so that if extractString(obj, predicate) yields a non-empty
value but no object in allObjects matches that value, return that literal value
as a fallback (i.e., return the original agentId string) while keeping the
existing behavior when a matching agent object is found.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4f33a4 and 8bebc21.

📒 Files selected for processing (3)
  • src/harvesters/base.ts (2 hunks)
  • src/harvesters/ckan.ts (2 hunks)
  • src/harvesters/dcatap.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/harvesters/base.ts (2)
src/schemas/portaljs-cloud.d.ts (1)
  • PortalJsCloudDataset (2-25)
src/lib/portaljs-cloud.ts (1)
  • getDatasetList (12-19)
src/harvesters/dcatap.ts (4)
src/schemas/portaljs-cloud.d.ts (2)
  • CkanResource (27-32)
  • PortalJsCloudDataset (2-25)
src/schemas/ckanPackage.d.ts (1)
  • CkanPackage (1-3)
src/harvesters/base.ts (1)
  • BaseHarvesterConfig (6-14)
config.ts (1)
  • env (24-24)
src/harvesters/ckan.ts (1)
src/schemas/ckanPackage.d.ts (1)
  • CkanPackage (1-3)

Comment on lines +17 to +18
SourceDatasetT extends { [k: string]: string } = any,
TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Constraint breaks every harvester subtype.

Forcing SourceDatasetT to extend { [k: string]: string } means every property must be a string. Our CKAN/DCAT packages carry arrays, objects, booleans (resources, extras, isopen, etc.), so types like DCATAPCkanPackage no longer satisfy the constraint and CkanHarvester<DCATAPCkanPackage> won’t compile. Drop this overly narrow bound (or widen it to Record<string, unknown>) so richer source schemas still type‑check.

Apply this diff:

-export abstract class BaseHarvester<
-  SourceDatasetT extends { [k: string]: string } = any,
-  TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
+export abstract class BaseHarvester<
+  SourceDatasetT extends Record<string, unknown> = any,
+  TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
🤖 Prompt for AI Agents
In src/harvesters/base.ts around lines 17-18, the generic constraint
"SourceDatasetT extends { [k: string]: string }" is too narrow and breaks
harvesters whose source schemas include arrays, booleans, or nested objects;
change the constraint to either remove it entirely or widen it to
"Record<string, unknown>" (preferred) and keep the same default generic (e.g.,
SourceDatasetT extends Record<string, unknown> = any) so types like
DCATAPCkanPackage and CkanHarvester<DCATAPCkanPackage> compile again.

Comment on lines +156 to +169
// DCAT-AP specific fields (mapped from extras or direct fields)
language: pkg.language || extras.language || "EN",
// frequency: extras.frequency || "",
// temporal_start: extras.temporal_start || "",
// temporal_end: extras.temporal_end || "",
// publisher_name: extras.publisher_name || pkg.organization?.title || "",
// publisher_email: extras.publisher_email || "",
// contact_name: extras.contact_name || pkg.maintainer || pkg.author || "",
// contact_email:
// extras.contact_email || pkg.maintainer_email || pkg.author_email || "",
// theme: extras.theme || "",
// conforms_to: extras.conforms_to || "",
// extras: pkg.extras || [],
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve CKAN extras instead of dropping them.

You flatten pkg.extras into extras for lookups, but the returned dataset never exposes those extras. That strips DCAT-AP metadata we don’t map explicitly (frequency, temporal, publisher, etc.), so the target loses data. Forward the extras array (e.g. extras: pkg.extras ?? []) while still deriving specific fields from it.

🤖 Prompt for AI Agents
In src/harvesters/dcatap.ts around lines 156 to 169, the code flattens
pkg.extras into a local extras object for lookups but then omits forwarding the
original pkg.extras array in the returned dataset, which drops unmapped DCAT-AP
metadata; restore the extras array by adding a property like extras: pkg.extras
?? [] to the returned object while keeping the existing derived fields (language
and others) sourced from the flattened extras, so unmapped extras are preserved
and mapped fields continue to use the lookup values.

abstract mapSourceDatasetToTarget(
dataset: SourceDatasetT
): PortalJsCloudDataset;
abstract mapSourceDatasetToTarget(dataset: SourceDatasetT): TargetDatasetT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this? @steveoni

return await getDatasetList();
}
async upsertIntoTarget({ dataset }: { dataset: PortalJsCloudDataset }) {
async upsertIntoTarget({ dataset }: { dataset: TargetDatasetT }) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveoni what's the point of renaming it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants