Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions src/harvesters/base.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ export type BaseHarvesterConfig = {
};

export abstract class BaseHarvester<
SourceDatasetT extends { [k: string]: string } = any
SourceDatasetT extends { [k: string]: string } = any,
TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
Comment on lines +17 to +18
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Constraint breaks every harvester subtype.

Forcing SourceDatasetT to extend { [k: string]: string } means every property must be a string. Our CKAN/DCAT packages carry arrays, objects, booleans (resources, extras, isopen, etc.), so types like DCATAPCkanPackage no longer satisfy the constraint and CkanHarvester<DCATAPCkanPackage> won’t compile. Drop this overly narrow bound (or widen it to Record<string, unknown>) so richer source schemas still type‑check.

Apply this diff:

-export abstract class BaseHarvester<
-  SourceDatasetT extends { [k: string]: string } = any,
-  TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
+export abstract class BaseHarvester<
+  SourceDatasetT extends Record<string, unknown> = any,
+  TargetDatasetT extends PortalJsCloudDataset = PortalJsCloudDataset
🤖 Prompt for AI Agents
In src/harvesters/base.ts around lines 17-18, the generic constraint
"SourceDatasetT extends { [k: string]: string }" is too narrow and breaks
harvesters whose source schemas include arrays, booleans, or nested objects;
change the constraint to either remove it entirely or widen it to
"Record<string, unknown>" (preferred) and keep the same default generic (e.g.,
SourceDatasetT extends Record<string, unknown> = any) so types like
DCATAPCkanPackage and CkanHarvester<DCATAPCkanPackage> compile again.

> {
protected config: BaseHarvesterConfig;

Expand All @@ -23,14 +24,12 @@ export abstract class BaseHarvester<
}

abstract getSourceDatasets(): Promise<SourceDatasetT[]>;
abstract mapSourceDatasetToTarget(
dataset: SourceDatasetT
): PortalJsCloudDataset;
abstract mapSourceDatasetToTarget(dataset: SourceDatasetT): TargetDatasetT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this? @steveoni


async getTargetPreexistingDatasets(): Promise<string[]> {
return await getDatasetList();
}
async upsertIntoTarget({ dataset }: { dataset: PortalJsCloudDataset }) {
async upsertIntoTarget({ dataset }: { dataset: TargetDatasetT }) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveoni what's the point of renaming it?

return await upsertDataset({
dataset,
dryRun: this.config.dryRun,
Expand Down
6 changes: 4 additions & 2 deletions src/harvesters/ckan.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ import { Harvester } from ".";
import { getAllDatasets } from "../lib/ckan";

@Harvester
class CkanHarvester<SourceDatasetT extends CkanPackage = CkanPackage> extends BaseHarvester<SourceDatasetT> {
class CkanHarvester<
SourceDatasetT extends CkanPackage = CkanPackage
> extends BaseHarvester<SourceDatasetT> {
constructor(args: BaseHarvesterConfig) {
super(args);
}
Expand All @@ -28,7 +30,7 @@ class CkanHarvester<SourceDatasetT extends CkanPackage = CkanPackage> extends Ba
resources: (pkg.resources || []).map((r: any) => ({
name: r.name,
url: r.url,
format: r.format
format: r.format,
})),

language: pkg.language || "EN",
Expand Down
176 changes: 176 additions & 0 deletions src/harvesters/dcat.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
import { env } from "../../config";
import { BaseHarvester, BaseHarvesterConfig } from "./base";
import { PortalJsCloudDataset } from "@/schemas/portaljs-cloud";
import { Harvester } from ".";
import {
DCATDataset,
DCATDistribution,
extractString,
extractAgentName,
extractStringArray,
extractDistributions,
} from "../lib/dcat";

@Harvester
class DCATHarvester extends BaseHarvester<DCATDataset> {
constructor(args: BaseHarvesterConfig) {
super(args);
}

async getSourceDatasets(): Promise<DCATDataset[]> {
const url = this.config.source.url;
const res = await fetch(url);
if (!res.ok) {
throw new Error(
`Failed to fetch DCAT JSON-LD: ${res.status} ${res.statusText}`
);
}
const jsonLd: any[] = await res.json();

const objectMap = new Map<string, any>();
jsonLd.forEach((obj) => objectMap.set(obj["@id"], obj));

const datasets: DCATDataset[] = jsonLd
.filter((obj) =>
obj["@type"]?.includes("http://www.w3.org/ns/dcat#Dataset")
)
.map((dataset) => ({
...dataset,
distributions: extractDistributions(dataset, jsonLd),
resolvedPublisherName: extractAgentName(
dataset,
"http://purl.org/dc/terms/publisher",
jsonLd
),
}));

return datasets;
}

mapSourceDatasetToTarget(pkg: DCATDataset): PortalJsCloudDataset {
const owner_org = env.PORTALJS_CLOUD_MAIN_ORG;

// Map distributions to resources
const resources = (pkg.distributions || []).map(
(dist: DCATDistribution) => ({
name:
extractString(dist, "http://purl.org/dc/terms/title") ||
"Unnamed Resource",
url:
extractString(dist, "http://www.w3.org/ns/dcat#downloadURL") ||
extractString(dist, "http://www.w3.org/ns/dcat#accessURL") ||
"",
format:
extractString(dist, "http://purl.org/dc/terms/format") ||
extractString(dist, "http://www.w3.org/ns/dcat#mediaType") ||
"",
description:
extractString(dist, "http://purl.org/dc/terms/description") || "",
license_url:
extractString(dist, "http://purl.org/dc/terms/license") || "",
})
);

const extras: Array<{ key: string; value: string }> = [];
const extraMappings = [
{ predicate: "http://purl.org/dc/terms/issued", key: "issued" },
{ predicate: "http://purl.org/dc/terms/modified", key: "modified" },
{
predicate: "http://www.w3.org/2002/07/owl#versionInfo",
key: "dcat_version",
},
{
predicate: "http://purl.org/dc/terms/accrualPeriodicity",
key: "frequency",
},
{
predicate: "http://purl.org/dc/terms/conformsTo",
key: "conforms_to",
isArray: true,
},
{
predicate: "http://purl.org/dc/terms/accessRights",
key: "access_rights",
},
{ predicate: "http://purl.org/dc/terms/provenance", key: "provenance" },
{ predicate: "http://purl.org/dc/terms/type", key: "dcat_type" },
{ predicate: "http://purl.org/dc/terms/spatial", key: "spatial_uri" },
{ predicate: "http://purl.org/dc/terms/publisher", key: "publisher_uri" },
];

extraMappings.forEach(({ predicate, key, isArray = false }) => {
const value = isArray
? extractStringArray(pkg, predicate).join(", ")
: extractString(pkg, predicate);
if (value) extras.push({ key, value });
});

const skippedKeys = [
"@id",
"@type",
"distributions",
"http://www.w3.org/ns/dcat#distribution",
"http://purl.org/dc/terms/title",
"http://purl.org/dc/terms/description",
"http://purl.org/dc/terms/identifier",
"http://purl.org/dc/terms/issued",
"http://purl.org/dc/terms/modified",
"http://www.w3.org/2002/07/owl#versionInfo",
"http://purl.org/dc/terms/language",
"http://www.w3.org/ns/dcat#landingPage",
"http://xmlns.com/foaf/0.1/page",
"http://purl.org/dc/terms/accrualPeriodicity",
"http://purl.org/dc/terms/conformsTo",
"http://purl.org/dc/terms/accessRights",
"http://purl.org/dc/terms/provenance",
"http://purl.org/dc/terms/type",
"http://purl.org/dc/terms/spatial",
"http://purl.org/dc/terms/publisher",
"http://www.w3.org/ns/dcat#contactPoint",
"http://purl.org/dc/terms/creator",
"http://purl.org/dc/terms/license",
];
Object.keys(pkg).forEach((key) => {
if (!skippedKeys.includes(key)) {
const value = extractString(pkg, key) || JSON.stringify(pkg[key]);
if (value) extras.push({ key, value });
}
});

const extractedLanguage = extractString(
pkg,
"http://purl.org/dc/terms/language"
);
const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
const language = (
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
) as "EN" | "FR" | "ES" | "DE" | "IT";
Comment on lines +140 to +147
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize incoming language codes.

Most DCAT feeds emit dct:language as lowercase codes (e.g. en) or URIs (e.g. /language/ENG). Comparing the raw value against ["EN","FR","ES","DE","IT"] forces nearly everything to the fallback "EN". Normalize the value (uppercase, trim URI suffix, etc.) before validation so genuine locales survive.

One minimal fix:

-    const extractedLanguage = extractString(
-      pkg,
-      "http://purl.org/dc/terms/language"
-    );
+    const rawLanguage =
+      extractString(pkg, "http://purl.org/dc/terms/language") || "";
+    const extractedLanguage = rawLanguage
+      .split("/")
+      .pop()
+      ?.slice(0, 2)
+      .toUpperCase() || "";

This keeps the existing whitelist working for common inputs instead of hard-defaulting to "EN".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const extractedLanguage = extractString(
pkg,
"http://purl.org/dc/terms/language"
);
const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
const language = (
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
) as "EN" | "FR" | "ES" | "DE" | "IT";
const rawLanguage =
extractString(pkg, "http://purl.org/dc/terms/language") || "";
const extractedLanguage = rawLanguage
.split("/")
.pop()
?.slice(0, 2)
.toUpperCase() || "";
const validLanguages = ["EN", "FR", "ES", "DE", "IT"];
const language = (
validLanguages.includes(extractedLanguage) ? extractedLanguage : "EN"
) as "EN" | "FR" | "ES" | "DE" | "IT";
🤖 Prompt for AI Agents
In src/harvesters/dcat.ts around lines 140 to 147, the code currently compares
the raw extractedLanguage directly to the uppercase whitelist causing most
inputs (like "en" or "/language/ENG") to fall back to "EN"; normalize the
extractedLanguage before validation by trimming whitespace, converting to
uppercase, and if it looks like a URI or contains slashes or hashes, take the
last path/fragment segment (or strip non-letter characters) to yield a plain
code, then check that normalized value against the existing validLanguages array
and use it if valid, otherwise default to "EN".

const datasetLicense =
extractString(pkg, "http://purl.org/dc/terms/license") ||
(resources.length > 0 ? (resources[0] as any).license_url || "" : "");

// Map to PortalJsCloudDataset (based on ckanext-dcat mappings)
return {
owner_org,
name: `${owner_org}--${
extractString(pkg, "http://purl.org/dc/terms/identifier") ||
pkg["@id"].split("/").pop() ||
"unknown"
}`,
title: extractString(pkg, "http://purl.org/dc/terms/title") || "",
notes: extractString(pkg, "http://purl.org/dc/terms/description") || "",
url: extractString(pkg, "http://www.w3.org/ns/dcat#landingPage") || "",
language,
author: extractString(pkg, "http://purl.org/dc/terms/creator") || "",
maintainer: (pkg as any).resolvedPublisherName || "",
license_id: extractString(pkg, "http://purl.org/dc/terms/license") || "",
license_url: datasetLicense,
contact_point:
extractString(pkg, "http://www.w3.org/ns/dcat#contactPoint") || "",
resources,
extras,
};
}
}

export { DCATHarvester };
171 changes: 171 additions & 0 deletions src/harvesters/dcatap.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
import { env } from "../../config";
import { CkanHarvester } from "./ckan";
import { Harvester } from ".";
import { BaseHarvesterConfig } from "./base";
import { CkanPackage } from "@/schemas/ckanPackage";
import { PortalJsCloudDataset, CkanResource } from "@/schemas/portaljs-cloud";

/**
* Extended CKAN Package type with additional DCAT-AP fields
*/
export interface DCATAPResource extends CkanResource {
hash?: string;
mimetype?: string | null;
mimetype_inner?: string | null;
cache_url?: string | null;
cache_last_updated?: string | null;
datastore_active?: boolean;
created?: string;
last_modified?: string;
state?: string;
position?: number;
id?: string;
revision_id?: string;
url_type?: string;
resource_type?: string | null;
size?: number | string | null;
package_id?: string;
}

// Then extend both interfaces
export interface DCATAPCkanPackage extends CkanPackage {
license_title?: string;
license_id?: string;
license_url?: string;
maintainer?: string;
maintainer_email?: string;
author?: string;
author_email?: string;
metadata_created?: string;
metadata_modified?: string;
tags?: Array<{
name: string;
display_name?: string;
id?: string;
state?: string;
}>;
groups?: Array<{
name: string;
title?: string;
display_name?: string;
description?: string;
id?: string;
}>;
organization?: {
title?: string;
name?: string;
description?: string;
id?: string;
};
isopen?: boolean;
version?: string;
url?: string;
state?: string;
type?: string;
extras?: Array<{
key: string;
value: string;
}>;
resources?: DCATAPResource[]; // Add this line to explicitly define resource type
}

// Finally extend the PortalJsCloudDataset interface
export interface DCATAPPortalJsDataset extends PortalJsCloudDataset {
license_title?: string;
license_url?: string;
metadata_created?: string;
metadata_modified?: string;
state?: string;
private?: boolean;
isopen?: boolean;
type?: string;
extras?: Array<{
key: string;
value: string;
}>;
resources?: DCATAPResource[]; // Override with extended resource type
}

@Harvester
export class DCATAPHarvester extends CkanHarvester<DCATAPCkanPackage> {
constructor(args: BaseHarvesterConfig) {
super(args);
}

mapSourceDatasetToTarget(pkg: DCATAPCkanPackage): DCATAPPortalJsDataset {
const owner_org = env.PORTALJS_CLOUD_MAIN_ORG;

// Map resources with more fields according to DCAT-AP
const resources = (pkg.resources || []).map((r) => ({
name: r.name || "",
url: r.url || "",
format: r.format || "",
description: r.description || "",
hash: r.hash || "",
mimetype: r.mimetype || "",
mimetype_inner: r.mimetype_inner || "",
size: r.size ? String(r.size) : undefined,
created: r.created || "",
last_modified: r.last_modified || "",
id: r.id || "",
state: r.state || "active",
position: r.position !== undefined ? r.position : 0,
}));

//Todo: Ask about how portaljs handle tags and groups harvesting
// const tags = pkg.tags ? pkg.tags.map((tag) => tag.name) : [];
// const groups = pkg.groups ? pkg.groups.map((group) => group.name) : [];

// Build extras from fields that don't have direct mapping
const extras: Record<string, any> = {};
pkg.extras?.forEach((extra) => {
extras[extra.key] = extra.value;
});

// Map to DCAT-AP compliant structure
return {
// Core metadata
owner_org,
name: `${owner_org}--${pkg.name}`,
title: pkg.title || "",
notes: pkg.notes || "",
url: pkg.url || "",
version: pkg.version || "",
type: pkg.type || "dataset",

// Temporal metadata
metadata_created: pkg.metadata_created || "",
metadata_modified: pkg.metadata_modified || "",

// Licensing and access
license_id: pkg.license_id || "",
license_title: pkg.license_title || "",
license_url: pkg.license_url || "",
private: pkg.private || false,
isopen: pkg.isopen || false,

// Attribution
author: pkg.author || "",
author_email: pkg.author_email || "",
maintainer: pkg.maintainer || "",
maintainer_email: pkg.maintainer_email || "",

// Resources
resources,

// DCAT-AP specific fields (mapped from extras or direct fields)
language: pkg.language || extras.language || "EN",
// frequency: extras.frequency || "",
// temporal_start: extras.temporal_start || "",
// temporal_end: extras.temporal_end || "",
// publisher_name: extras.publisher_name || pkg.organization?.title || "",
// publisher_email: extras.publisher_email || "",
// contact_name: extras.contact_name || pkg.maintainer || pkg.author || "",
// contact_email:
// extras.contact_email || pkg.maintainer_email || pkg.author_email || "",
// theme: extras.theme || "",
// conforms_to: extras.conforms_to || "",
// extras: pkg.extras || [],
};
Comment on lines +156 to +169
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve CKAN extras instead of dropping them.

You flatten pkg.extras into extras for lookups, but the returned dataset never exposes those extras. That strips DCAT-AP metadata we don’t map explicitly (frequency, temporal, publisher, etc.), so the target loses data. Forward the extras array (e.g. extras: pkg.extras ?? []) while still deriving specific fields from it.

🤖 Prompt for AI Agents
In src/harvesters/dcatap.ts around lines 156 to 169, the code flattens
pkg.extras into a local extras object for lookups but then omits forwarding the
original pkg.extras array in the returned dataset, which drops unmapped DCAT-AP
metadata; restore the extras array by adding a property like extras: pkg.extras
?? [] to the returned object while keeping the existing derived fields (language
and others) sourced from the flattened extras, so unmapped extras are preserved
and mapped fields continue to use the lookup values.

}
}
Loading