-
Notifications
You must be signed in to change notification settings - Fork 1
implement dcat harvester #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a new DCAT harvester and JSON-LD helpers, updates BaseHarvester generics, and extends PortalJsCloudDataset with url, license_url, and extras. The harvester fetches expanded DCAT JSON‑LD, resolves distributions and agents, and maps datasets to PortalJsCloudDataset objects. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Operator
participant Harvester as DCATHarvester
participant Source as "HTTP DCAT Endpoint"
participant JSONLD as "Expanded JSON‑LD"
participant Mapper as "mapSourceDatasetToTarget"
participant Target as "PortalJsCloudDataset"
Operator->>Harvester: run()
Harvester->>Source: GET source URL
Source-->>Harvester: 200 OK (expanded JSON‑LD)
Harvester->>JSONLD: build @id index
Harvester->>JSONLD: filter @type == dcat:Dataset
Harvester->>JSONLD: resolve distributions & publisher
Harvester->>Mapper: map dataset -> PortalJsCloudDataset
Mapper-->>Harvester: mapped dataset (resources, extras, url, license_url, language)
Harvester-->>Target: emit upsert payload
note right of Target #DFF0D8: Target receives PortalJsCloudDataset objects
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/schemas/portaljs-cloud.d.ts (1)
27-32
: ExtendCkanResource
withlicense_url
.
mapSourceDatasetToTarget
now attacheslicense_url
to each resource, butCkanResource
still disallows that property. TypeScript will reject the harvester build with “Object literal may only specify known properties.” Please add an optionallicense_url?: string;
here.Consider:
export interface CkanResource { url?: string; name?: string; description?: string; format?: string; + license_url?: string; }
🧹 Nitpick comments (1)
src/harvesters/dcat.ts (1)
30-31
: Remove the unusedobjectMap
.The map is built and never read, so it just adds overhead. Please drop it or use it for lookups.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/harvesters/dcat.ts
(1 hunks)src/lib/dcat.ts
(1 hunks)src/schemas/portaljs-cloud.d.ts
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/harvesters/dcat.ts (5)
src/harvesters/index.ts (1)
Harvester
(12-14)src/lib/dcat.ts (6)
DCATDataset
(8-12)extractDistributions
(53-66)extractAgentName
(24-41)DCATDistribution
(2-6)extractString
(15-22)extractStringArray
(44-50)src/harvesters/base.ts (1)
BaseHarvesterConfig
(6-14)src/schemas/portaljs-cloud.d.ts (1)
PortalJsCloudDataset
(2-25)config.ts (1)
env
(24-24)
const extractedLanguage = extractString( | ||
pkg, | ||
"http://purl.org/dc/terms/language" | ||
); | ||
const validLanguages = ["EN", "FR", "ES", "DE", "IT"]; | ||
const language = ( | ||
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN" | ||
) as "EN" | "FR" | "ES" | "DE" | "IT"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalize incoming language codes.
Most DCAT feeds emit dct:language
as lowercase codes (e.g. en
) or URIs (e.g. /language/ENG
). Comparing the raw value against ["EN","FR","ES","DE","IT"]
forces nearly everything to the fallback "EN"
. Normalize the value (uppercase, trim URI suffix, etc.) before validation so genuine locales survive.
One minimal fix:
- const extractedLanguage = extractString(
- pkg,
- "http://purl.org/dc/terms/language"
- );
+ const rawLanguage =
+ extractString(pkg, "http://purl.org/dc/terms/language") || "";
+ const extractedLanguage = rawLanguage
+ .split("/")
+ .pop()
+ ?.slice(0, 2)
+ .toUpperCase() || "";
This keeps the existing whitelist working for common inputs instead of hard-defaulting to "EN"
.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const extractedLanguage = extractString( | |
pkg, | |
"http://purl.org/dc/terms/language" | |
); | |
const validLanguages = ["EN", "FR", "ES", "DE", "IT"]; | |
const language = ( | |
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN" | |
) as "EN" | "FR" | "ES" | "DE" | "IT"; | |
const rawLanguage = | |
extractString(pkg, "http://purl.org/dc/terms/language") || ""; | |
const extractedLanguage = rawLanguage | |
.split("/") | |
.pop() | |
?.slice(0, 2) | |
.toUpperCase() || ""; | |
const validLanguages = ["EN", "FR", "ES", "DE", "IT"]; | |
const language = ( | |
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN" | |
) as "EN" | "FR" | "ES" | "DE" | "IT"; |
🤖 Prompt for AI Agents
In src/harvesters/dcat.ts around lines 140 to 147, the code currently compares
the raw extractedLanguage directly to the uppercase whitelist causing most
inputs (like "en" or "/language/ENG") to fall back to "EN"; normalize the
extractedLanguage before validation by trimming whitespace, converting to
uppercase, and if it looks like a URI or contains slashes or hashes, take the
last path/fragment segment (or strip non-letter characters) to yield a plain
code, then check that normalized value against the existing validLanguages array
and use it if valid, otherwise default to "EN".
export function extractAgentName( | ||
obj: any, | ||
predicate: string, | ||
allObjects: any[] | ||
): string { | ||
const agentId = extractString(obj, predicate); | ||
if (agentId) { | ||
const agent = allObjects.find((o) => o["@id"] === agentId); | ||
if (agent) { | ||
return ( | ||
extractString(agent, "http://xmlns.com/foaf/0.1/name") || | ||
extractString(agent, "http://purl.org/dc/terms/title") || | ||
"" | ||
); | ||
} | ||
} | ||
return ""; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Respect literal publisher values.
If a dataset encodes dct:publisher
as a literal (common in the wild), extractString
returns that literal, but the lookup for an object with the same @id
fails and we drop the original text, so maintainer ends up empty. Add a fallback to return the literal when no matching agent object exists.
Suggested tweak:
const agentId = extractString(obj, predicate);
if (agentId) {
const agent = allObjects.find((o) => o["@id"] === agentId);
if (agent) {
return (
extractString(agent, "http://xmlns.com/foaf/0.1/name") ||
extractString(agent, "http://purl.org/dc/terms/title") ||
""
);
}
+ return agentId;
}
return "";
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
export function extractAgentName( | |
obj: any, | |
predicate: string, | |
allObjects: any[] | |
): string { | |
const agentId = extractString(obj, predicate); | |
if (agentId) { | |
const agent = allObjects.find((o) => o["@id"] === agentId); | |
if (agent) { | |
return ( | |
extractString(agent, "http://xmlns.com/foaf/0.1/name") || | |
extractString(agent, "http://purl.org/dc/terms/title") || | |
"" | |
); | |
} | |
} | |
return ""; | |
} | |
export function extractAgentName( | |
obj: any, | |
predicate: string, | |
allObjects: any[] | |
): string { | |
const agentId = extractString(obj, predicate); | |
if (agentId) { | |
const agent = allObjects.find((o) => o["@id"] === agentId); | |
if (agent) { | |
return ( | |
extractString(agent, "http://xmlns.com/foaf/0.1/name") || | |
extractString(agent, "http://purl.org/dc/terms/title") || | |
"" | |
); | |
} | |
return agentId; | |
} | |
return ""; | |
} |
🤖 Prompt for AI Agents
In src/lib/dcat.ts around lines 24 to 41, extractAgentName currently treats the
value returned by extractString as an ID and only returns a resolved agent's
name, which drops literal publisher values when no matching object exists;
change the logic so that if extractString(obj, predicate) yields a non-empty
value but no object in allObjects matches that value, return that literal value
as a fallback (i.e., return the original agentId string) while keeping the
existing behavior when a matching agent object is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/harvesters/base.ts
(2 hunks)src/harvesters/ckan.ts
(2 hunks)src/harvesters/dcatap.ts
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/harvesters/base.ts (2)
src/schemas/portaljs-cloud.d.ts (1)
PortalJsCloudDataset
(2-25)src/lib/portaljs-cloud.ts (1)
getDatasetList
(12-19)
src/harvesters/dcatap.ts (4)
src/schemas/portaljs-cloud.d.ts (2)
CkanResource
(27-32)PortalJsCloudDataset
(2-25)src/schemas/ckanPackage.d.ts (1)
CkanPackage
(1-3)src/harvesters/base.ts (1)
BaseHarvesterConfig
(6-14)config.ts (1)
env
(24-24)
src/harvesters/ckan.ts (1)
src/schemas/ckanPackage.d.ts (1)
CkanPackage
(1-3)
SourceDatasetT extends { [k: string]: string } = any, | ||
TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constraint breaks every harvester subtype.
Forcing SourceDatasetT
to extend { [k: string]: string }
means every property must be a string. Our CKAN/DCAT packages carry arrays, objects, booleans (resources
, extras
, isopen
, etc.), so types like DCATAPCkanPackage
no longer satisfy the constraint and CkanHarvester<DCATAPCkanPackage>
won’t compile. Drop this overly narrow bound (or widen it to Record<string, unknown>
) so richer source schemas still type‑check.
Apply this diff:
-export abstract class BaseHarvester<
- SourceDatasetT extends { [k: string]: string } = any,
- TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
+export abstract class BaseHarvester<
+ SourceDatasetT extends Record<string, unknown> = any,
+ TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
🤖 Prompt for AI Agents
In src/harvesters/base.ts around lines 17-18, the generic constraint
"SourceDatasetT extends { [k: string]: string }" is too narrow and breaks
harvesters whose source schemas include arrays, booleans, or nested objects;
change the constraint to either remove it entirely or widen it to
"Record<string, unknown>" (preferred) and keep the same default generic (e.g.,
SourceDatasetT extends Record<string, unknown> = any) so types like
DCATAPCkanPackage and CkanHarvester<DCATAPCkanPackage> compile again.
// DCAT-AP specific fields (mapped from extras or direct fields) | ||
language: pkg.language || extras.language || "EN", | ||
// frequency: extras.frequency || "", | ||
// temporal_start: extras.temporal_start || "", | ||
// temporal_end: extras.temporal_end || "", | ||
// publisher_name: extras.publisher_name || pkg.organization?.title || "", | ||
// publisher_email: extras.publisher_email || "", | ||
// contact_name: extras.contact_name || pkg.maintainer || pkg.author || "", | ||
// contact_email: | ||
// extras.contact_email || pkg.maintainer_email || pkg.author_email || "", | ||
// theme: extras.theme || "", | ||
// conforms_to: extras.conforms_to || "", | ||
// extras: pkg.extras || [], | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve CKAN extras instead of dropping them.
You flatten pkg.extras
into extras
for lookups, but the returned dataset never exposes those extras. That strips DCAT-AP metadata we don’t map explicitly (frequency, temporal, publisher, etc.), so the target loses data. Forward the extras array (e.g. extras: pkg.extras ?? []
) while still deriving specific fields from it.
🤖 Prompt for AI Agents
In src/harvesters/dcatap.ts around lines 156 to 169, the code flattens
pkg.extras into a local extras object for lookups but then omits forwarding the
original pkg.extras array in the returned dataset, which drops unmapped DCAT-AP
metadata; restore the extras array by adding a property like extras: pkg.extras
?? [] to the returned object while keeping the existing derived fields (language
and others) sourced from the flattened extras, so unmapped extras are preserved
and mapped fields continue to use the lookup values.
abstract mapSourceDatasetToTarget( | ||
dataset: SourceDatasetT | ||
): PortalJsCloudDataset; | ||
abstract mapSourceDatasetToTarget(dataset: SourceDatasetT): TargetDatasetT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this? @steveoni
return await getDatasetList(); | ||
} | ||
async upsertIntoTarget({ dataset }: { dataset: PortalJsCloudDataset }) { | ||
async upsertIntoTarget({ dataset }: { dataset: TargetDatasetT }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steveoni what's the point of renaming it?
Summary by CodeRabbit
New Features
Refactor