-
Notifications
You must be signed in to change notification settings - Fork 530
Description
STIX 2.1 model support gaps in ImportDocumentAI connector
Description
The ImportDocumentAI connector intermittently fails to emit valid inter-entity STIX 2.1 relationships and omits support for several STIX object types and workflows. In practice this shows up as:
- Predicted LLM relationships between SDOs (e.g.,
uses,attributed-to) not appearing in OpenCTI, with only item→container links created. - Some invalid relationships being attempted against the API (e.g., wrong type pairs) instead of being pre-validated and skipped.
- Limited support for STIX objects (SDOs/SCOs) commonly extracted from reports, resulting in missing coverage or dropped objects.
- In span-based extraction mode, relationships are lost when temporary span IDs are not resolved to final STIX IDs before bundling.
Collectively, these gaps reduce fidelity of the imported knowledge graph and create noisy import behavior.
Environment
- OS (where OpenCTI server runs): {e.g., Ubuntu 22.04}
- OpenCTI version: {e.g., 6.x.y}
- OpenCTI client: Connector – ImportDocumentAI
- Other environment details: {DB/Queue, Python version, connector image tag}
Reproducible Steps
Smallest reproducible scenario using a short report that mentions a known intrusion set, malware family, and TTPs:
-
Ingest a PDF or text document with content such as:
- “APT Foo (aka Bar) uses MalwareX to deliver payloads via Phishing (T1566). They also leverage Web Protocols (T1071.001). Activity is attributed to Threat Group Z.”
-
Configure ImportDocumentAI to use the LLM extraction path (OpenAI/Azure OpenAI) with span-based output enabled (default in many deployments).
-
Run the connector so it extracts entities and predicted relationships.
-
Inspect the resulting Report in OpenCTI and the import job logs.
Expected Output
-
Entities/observables are created for the intrusion set (or threat-actor), malware, and ATT&CK attack-patterns (
T1566,T1071.001), plus extracted observables. -
Inter-entity relationships are emitted and visible in the graph, e.g.:
intrusion-set --(uses)--> malwareintrusion-set --(uses)--> attack-patternintrusion-set --(attributed-to)--> threat-actor
-
Invalid relationships (e.g.,
domain-name --(located-at)--> country) are pre-validated against the OpenCTI schema and not sent to the API. -
Container (Report / Grouping / Case) includes
object_refsto all created objects, with correct “related-to” back-links where appropriate.
Actual Output
- The connector often creates only item→container links (e.g., report‐level
related-to) and drops valid predicted relationships between SDOs. - Some invalid relationships are attempted and rejected by the API with errors similar to:
{'name': 'FUNCTIONAL_ERROR', 'error_message': 'Only stix-core-relationship can be created through this method.'} - In span-based mode, relationships are silently lost when temporary span IDs (e.g.,
from_id/to_idtokens) are not mapped to the final STIX IDs in the bundle. - ATT&CK IDs or names may not resolve to existing
attack-patternSDOs, leading to missing TTP relationships or duplicated custom objects.
Additional information
Observable symptoms from logs (illustrative)
- Predicted relationships recognized by the extraction step (e.g.,
INTRUSION-SET -[USES]-> MALWARE,INTRUSION-SET -[USES]-> ATTACK-PATTERN) but not present in the final graph. - Skips or API rejections for invalid relations (e.g.,
DOMAIN-NAME -[LOCATED-AT]-> COUNTRY) should be filtered client-side by the connector using the OpenCTI relation matrix, but currently reach the API.
Likely root causes in the current connector
- No ID remap before bundling (span mode): predicted relations reference temporary span UUIDs; if these aren’t expanded to the final STIX IDs, edges are dropped.
- No pre-validation of relation types: the connector does not consistently consult the OpenCTI schema relation mapping, so it may both discard valid relations (over-filtering) and attempt invalid ones (under-filtering).
- Limited STIX coverage: several commonly-seen SDOs/SCOs are not emitted or are emitted inconsistently, reducing graph completeness.
- ATT&CK resolution not robust: lack of a cache or name/ID dual lookup leads to missed
attack-patternreuse and brokenuseslinks.
Impact
- Relationship loss results in incomplete knowledge graphs, weaker analytics, and user confusion (entities appear, but inter-entity context is missing).
- API errors add noise to import logs and slow down bulk processing.