Skip to content

[Import-Document-AI] Stix 2.1 model support gaps #4907

@blauwers

Description

@blauwers

STIX 2.1 model support gaps in ImportDocumentAI connector

Description

The ImportDocumentAI connector intermittently fails to emit valid inter-entity STIX 2.1 relationships and omits support for several STIX object types and workflows. In practice this shows up as:

  • Predicted LLM relationships between SDOs (e.g., uses, attributed-to) not appearing in OpenCTI, with only item→container links created.
  • Some invalid relationships being attempted against the API (e.g., wrong type pairs) instead of being pre-validated and skipped.
  • Limited support for STIX objects (SDOs/SCOs) commonly extracted from reports, resulting in missing coverage or dropped objects.
  • In span-based extraction mode, relationships are lost when temporary span IDs are not resolved to final STIX IDs before bundling.

Collectively, these gaps reduce fidelity of the imported knowledge graph and create noisy import behavior.

Environment

  1. OS (where OpenCTI server runs): {e.g., Ubuntu 22.04}
  2. OpenCTI version: {e.g., 6.x.y}
  3. OpenCTI client: Connector – ImportDocumentAI
  4. Other environment details: {DB/Queue, Python version, connector image tag}

Reproducible Steps

Smallest reproducible scenario using a short report that mentions a known intrusion set, malware family, and TTPs:

  1. Ingest a PDF or text document with content such as:

    • “APT Foo (aka Bar) uses MalwareX to deliver payloads via Phishing (T1566). They also leverage Web Protocols (T1071.001). Activity is attributed to Threat Group Z.”
  2. Configure ImportDocumentAI to use the LLM extraction path (OpenAI/Azure OpenAI) with span-based output enabled (default in many deployments).

  3. Run the connector so it extracts entities and predicted relationships.

  4. Inspect the resulting Report in OpenCTI and the import job logs.

Expected Output

  • Entities/observables are created for the intrusion set (or threat-actor), malware, and ATT&CK attack-patterns (T1566, T1071.001), plus extracted observables.

  • Inter-entity relationships are emitted and visible in the graph, e.g.:

    • intrusion-set --(uses)--> malware
    • intrusion-set --(uses)--> attack-pattern
    • intrusion-set --(attributed-to)--> threat-actor
  • Invalid relationships (e.g., domain-name --(located-at)--> country) are pre-validated against the OpenCTI schema and not sent to the API.

  • Container (Report / Grouping / Case) includes object_refs to all created objects, with correct “related-to” back-links where appropriate.

Actual Output

  • The connector often creates only item→container links (e.g., report‐level related-to) and drops valid predicted relationships between SDOs.
  • Some invalid relationships are attempted and rejected by the API with errors similar to:
    {'name': 'FUNCTIONAL_ERROR', 'error_message': 'Only stix-core-relationship can be created through this method.'}
  • In span-based mode, relationships are silently lost when temporary span IDs (e.g., from_id/to_id tokens) are not mapped to the final STIX IDs in the bundle.
  • ATT&CK IDs or names may not resolve to existing attack-pattern SDOs, leading to missing TTP relationships or duplicated custom objects.

Additional information

Observable symptoms from logs (illustrative)

  • Predicted relationships recognized by the extraction step (e.g., INTRUSION-SET -[USES]-> MALWARE, INTRUSION-SET -[USES]-> ATTACK-PATTERN) but not present in the final graph.
  • Skips or API rejections for invalid relations (e.g., DOMAIN-NAME -[LOCATED-AT]-> COUNTRY) should be filtered client-side by the connector using the OpenCTI relation matrix, but currently reach the API.

Likely root causes in the current connector

  • No ID remap before bundling (span mode): predicted relations reference temporary span UUIDs; if these aren’t expanded to the final STIX IDs, edges are dropped.
  • No pre-validation of relation types: the connector does not consistently consult the OpenCTI schema relation mapping, so it may both discard valid relations (over-filtering) and attempt invalid ones (under-filtering).
  • Limited STIX coverage: several commonly-seen SDOs/SCOs are not emitted or are emitted inconsistently, reducing graph completeness.
  • ATT&CK resolution not robust: lack of a cache or name/ID dual lookup leads to missed attack-pattern reuse and broken uses links.

Impact

  • Relationship loss results in incomplete knowledge graphs, weaker analytics, and user confusion (entities appear, but inter-entity context is missing).
  • API errors add noise to import logs and slow down bulk processing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions