Skip to content

Conversation

@maparent
Copy link
Collaborator

@maparent maparent commented Dec 25, 2025

The primary goal is to note inter-node references in text. We extract this as an extra json-ld attribute.
The secondary goal is to make markdown hyperlinks more legible to LLMs.
For that purpose, we use roam-url linkification, so the links match the node identities.
We also introduce a (pseudo) paragraph anchor, so that block references can refer to the combination of the page and the anchor. This does not fit roam patterns, but should again be more meaningful to LLMs.
These changes are currently applied only to the json-ld export.
There is an unrelated change to the regexp escaping that was suggested to coderabbit.

Summary by CodeRabbit

  • Bug Fixes

    • Improved special character handling during export
  • New Features

    • Added options to convert block references to links and insert block anchors
    • Enhanced page reference tracking in exported content

✏️ Tip: You can customize this high-level summary in your review settings.

@linear
Copy link

linear bot commented Dec 25, 2025

@supabase
Copy link

supabase bot commented Dec 25, 2025

This pull request has been ignored for the connected project zytfjzqyijgagqxrzbmz because there are no changes detected in packages/database/supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

@maparent
Copy link
Collaborator Author

@maparent maparent requested review from mdroidian and removed request for mdroidian December 25, 2025 18:38
@maparent maparent force-pushed the eng-1212-base-json-ld-export branch from 4d19e5b to 483e769 Compare December 25, 2025 18:43
@maparent maparent force-pushed the eng-1213-adjustments-to-markdown-export branch from 78b1bfe to 2384603 Compare December 25, 2025 18:44
@maparent maparent requested a review from mdroidian December 25, 2025 18:46
@maparent
Copy link
Collaborator Author

@mdroidian
Copy link
Contributor

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 25, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 25, 2025

📝 Walkthrough

Walkthrough

This PR enhances Roam's markdown export by improving regex escaping in node content transformation, extracting internal text references to nodes from page content, and introducing optional flags (blockRefsAsLinks and blockAnchors) to control block reference and anchor rendering in exported markdown.

Changes

Cohort / File(s) Summary
Regex Escaping Enhancement
apps/roam/src/utils/exportUtils.ts
Broadens regex metacharacter escaping in getContentFromNodes from just [ and ] to include ?, ., and +, while preserving placeholder patterns like {content}.
Node Reference Extraction
apps/roam/src/utils/jsonld.ts
Adds extraction of text-based references to nodes using a new roamUrlRe regex pattern and a precomputed nodeUids set for filtering. Augments node objects with textRefersToNode data and enables blockAnchors and blockRefsAsLinks flags during markdown rendering.
Export Options & Link/Anchor Rendering
apps/roam/src/utils/pageToMarkdown.ts
Introduces blockRefsAsLinks and blockAnchors optional flags to toMarkdown and pageToMarkdown functions. Adds logic to convert block references to links using resolved page UIDs, inserts block anchor markup into headings, and threads new options through the export pipeline.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: introducing Roam-style link formatting and reference tracking in the Markdown export pipeline.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 483e769 and 2384603.

📒 Files selected for processing (3)
  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/main.mdc)

**/*.{ts,tsx}: Use Tailwind CSS for styling where possible
When refactoring inline styles, use tailwind classes
Prefer type over interface in TypeScript
Use explicit return types for functions
Avoid any types when possible
Prefer arrow functions over regular function declarations
Use named parameters (object destructuring) when a function has more than 2 parameters
Use PascalCase for components and types
Use camelCase for variables and functions
Use UPPERCASE for constants
Function names should describe their purpose clearly
Prefer early returns over nested conditionals for better readability

Files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
apps/roam/**/*.{js,ts,tsx,jsx,json}

📄 CodeRabbit inference engine (.cursor/rules/roam.mdc)

Prefer existing dependencies from package.json when working on the Roam Research extension

Files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
apps/roam/**/*.{ts,tsx,jsx,js,css,scss}

📄 CodeRabbit inference engine (.cursor/rules/roam.mdc)

Use BlueprintJS 3 components and Tailwind CSS for platform-native UI in the Roam Research extension

Files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
apps/roam/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/roam.mdc)

apps/roam/**/*.{ts,tsx,js,jsx}: Use the roamAlphaApi docs from https://roamresearch.com/#/app/developer-documentation/page/tIaOPdXCj when implementing Roam functionality
Use Roam Depot/Extension API docs from https://roamresearch.com/#/app/developer-documentation/page/y31lhjIqU when implementing extension functionality

Files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
apps/roam/**

📄 CodeRabbit inference engine (.cursor/rules/roam.mdc)

Implement the Discourse Graph protocol in the Roam Research extension

Files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
🧠 Learnings (11)
📓 Common learnings
Learnt from: sid597
Repo: DiscourseGraphs/discourse-graph PR: 372
File: apps/roam/src/components/DiscourseNodeMenu.tsx:116-116
Timestamp: 2025-08-25T15:53:21.799Z
Learning: In apps/roam/src/components/DiscourseNodeMenu.tsx, when handling tag insertion, multiple leading hashtags (like ##foo) should be preserved as they represent user intent, not normalized to a single hashtag. The current regex /^#/ is correct as it only removes one leading # before adding one back, maintaining any additional hashtags the user intended.
📚 Learning: 2025-08-25T15:53:21.799Z
Learnt from: sid597
Repo: DiscourseGraphs/discourse-graph PR: 372
File: apps/roam/src/components/DiscourseNodeMenu.tsx:116-116
Timestamp: 2025-08-25T15:53:21.799Z
Learning: In apps/roam/src/components/DiscourseNodeMenu.tsx, when handling tag insertion, multiple leading hashtags (like ##foo) should be preserved as they represent user intent, not normalized to a single hashtag. The current regex /^#/ is correct as it only removes one leading # before adding one back, maintaining any additional hashtags the user intended.

Applied to files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-11-23T23:53:43.094Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 0
File: :0-0
Timestamp: 2025-11-23T23:53:43.094Z
Learning: In `apps/roam/src/utils/supabaseContext.ts`, the Roam URL normalization uses a targeted string replacement `url.replace("/?server-port=3333#/", "/#/")` rather than URL parsing to avoid impacting unforeseen URL patterns. This conservative approach is intentional to handle only the specific known case.

Applied to files:

  • apps/roam/src/utils/exportUtils.ts
  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-11-25T00:52:41.934Z
Learnt from: CR
Repo: DiscourseGraphs/discourse-graph PR: 0
File: .cursor/rules/roam.mdc:0-0
Timestamp: 2025-11-25T00:52:41.934Z
Learning: Applies to apps/roam/**/*.{ts,tsx,js,jsx} : Use the roamAlphaApi docs from https://roamresearch.com/#/app/developer-documentation/page/tIaOPdXCj when implementing Roam functionality

Applied to files:

  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-11-25T00:52:41.934Z
Learnt from: CR
Repo: DiscourseGraphs/discourse-graph PR: 0
File: .cursor/rules/roam.mdc:0-0
Timestamp: 2025-11-25T00:52:41.934Z
Learning: Applies to apps/roam/**/*.{js,ts,tsx,jsx,json} : Prefer existing dependencies from package.json when working on the Roam Research extension

Applied to files:

  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-06-22T10:40:52.752Z
Learnt from: sid597
Repo: DiscourseGraphs/discourse-graph PR: 232
File: apps/roam/src/utils/getAllDiscourseNodesSince.ts:18-31
Timestamp: 2025-06-22T10:40:52.752Z
Learning: In apps/roam/src/utils/getAllDiscourseNodesSince.ts, the user confirmed that querying for `?title` with `:node/title` and mapping it to the `text` field in the DiscourseGraphContent type is the correct implementation for retrieving discourse node content from Roam Research, despite it appearing to query page titles rather than block text content.

Applied to files:

  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
📚 Learning: 2025-11-25T00:52:41.934Z
Learnt from: CR
Repo: DiscourseGraphs/discourse-graph PR: 0
File: .cursor/rules/roam.mdc:0-0
Timestamp: 2025-11-25T00:52:41.934Z
Learning: Applies to apps/roam/**/*.{ts,tsx,js,jsx} : Use Roam Depot/Extension API docs from https://roamresearch.com/#/app/developer-documentation/page/y31lhjIqU when implementing extension functionality

Applied to files:

  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-06-17T23:37:45.289Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 220
File: apps/roam/src/utils/conceptConversion.ts:42-56
Timestamp: 2025-06-17T23:37:45.289Z
Learning: In the DiscourseNode interface from apps/roam/src/utils/getDiscourseNodes.ts, the field `type` serves as the unique identifier field, not a type classification field. The interface has no `uid` or `id` field, making `node.type` the correct field to use for UID-related operations.

Applied to files:

  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-06-17T23:37:45.289Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 220
File: apps/roam/src/utils/conceptConversion.ts:42-56
Timestamp: 2025-06-17T23:37:45.289Z
Learning: In the DiscourseNode interface from apps/roam/src/utils/getDiscourseNodes.ts, the field `node.type` serves as the UID field rather than having a conventional `node.uid` field. This is an unusual naming convention where the type field actually contains the unique identifier.

Applied to files:

  • apps/roam/src/utils/jsonld.ts
📚 Learning: 2025-11-05T21:57:14.909Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 534
File: apps/roam/src/utils/createReifiedBlock.ts:40-48
Timestamp: 2025-11-05T21:57:14.909Z
Learning: In the discourse-graph repository, the function `getPageUidByPageTitle` from `roamjs-components/queries/getPageUidByPageTitle` is a synchronous function that returns a string directly (the page UID or an empty string if not found), not a Promise. It should be called without `await`.

Applied to files:

  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
📚 Learning: 2025-12-07T20:54:20.007Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 559
File: apps/roam/src/utils/findDiscourseNode.ts:37-39
Timestamp: 2025-12-07T20:54:20.007Z
Learning: In apps/roam/src/utils/findDiscourseNode.ts, the function findDiscourseNodeByTitleAndUid accepts both uid and title parameters where uid is primarily used for cache access (as the cache key) while title is used for the actual matching via matchDiscourseNode. This design reflects the pattern where downstream, the uid is mostly used to fetch the title, so the function caches by uid but matches by title.

Applied to files:

  • apps/roam/src/utils/jsonld.ts
  • apps/roam/src/utils/pageToMarkdown.ts
🧬 Code graph analysis (2)
apps/roam/src/utils/jsonld.ts (2)
apps/roam/src/utils/getRelationData.ts (1)
  • getRelationDataUtil (9-67)
apps/roam/src/utils/exportUtils.ts (1)
  • getPageData (19-67)
apps/roam/src/utils/pageToMarkdown.ts (1)
apps/roam/src/utils/exportUtils.ts (1)
  • toLink (121-128)
🔇 Additional comments (11)
apps/roam/src/utils/jsonld.ts (4)

110-110: LGTM: Regex pattern correctly captures page UIDs from Roam URLs.

The pattern properly captures 9-10 character UIDs from Roam page URLs with word boundaries to prevent partial matches.


115-115: LGTM: Efficient use of Set for internal reference lookup.

Using a Set provides O(1) lookup time when filtering textRefersToNode in the subsequent logic.


142-143: LGTM: Enables Roam-style linkification for JSON-LD export.

These options align with the PR's secondary goal of making Markdown hyperlinks more legible to LLMs.


167-167: LGTM: Conditional inclusion prevents empty arrays in JSON-LD output.

Only adding textRefersToNode when references exist keeps the output clean and follows JSON-LD best practices.

apps/roam/src/utils/pageToMarkdown.ts (7)

149-161: LGTM: Improved readability of chained replacements.

The reformatting makes the embed and embed-children regex replacements easier to follow without changing functionality.


183-207: LGTM: Wiki link resolution supports Roam-style linkification.

The enhanced logic resolves wiki-style [[links]] to page UIDs when blockRefsAsLinks is enabled, making links more meaningful to LLMs as described in the PR objectives.


230-230: LGTM: Using Pandoc-style anchors for better markdown compatibility.

The {#block-uid} format follows Pandoc/kramdown conventions rather than Roam's native format. This aligns with the PR's goal of making the export more meaningful to LLMs and standard markdown processors.


231-231: LGTM: Block anchor placement enables paragraph-level references.

Inserting the anchor after the heading prefix creates the pseudo paragraph anchor pattern mentioned in the PR description, allowing block references to combine page + anchor.


247-248: LGTM: Parameters properly threaded with safe defaults.

The blockRefsAsLinks and blockAnchors parameters are correctly added with false defaults, ensuring backward compatibility while enabling the new functionality.

Also applies to: 260-261, 316-317


3-3: Both getPageUidByBlockUid and getPageUidByPageTitle are correctly called without await at lines 167 and 190. Usage patterns confirm both functions are synchronous: at line 169, pageUid is compared directly to strings; at line 191, the .length property is accessed directly. No issues found.


165-175: Fix error handling for getPageUidByBlockUid return value.

The function getPageUidByBlockUid returns undefined when a block is not found, not an empty string. Line 169 checks pageUid === "" which will never be true, allowing undefined to be concatenated into the link string at line 174, creating a malformed link like "undefined#block-xxx". Change the condition to check pageUid === undefined or use a nullish coalescing operator to handle the failure case correctly.

⛔ Skipped due to learnings
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 534
File: apps/roam/src/utils/createReifiedBlock.ts:40-48
Timestamp: 2025-11-05T21:57:14.909Z
Learning: In the discourse-graph repository, the function `getPageUidByPageTitle` from `roamjs-components/queries/getPageUidByPageTitle` is a synchronous function that returns a string directly (the page UID or an empty string if not found), not a Promise. It should be called without `await`.

Copy link
Contributor

@mdroidian mdroidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets address the coderabbit suggestions (or resolve if unnecessary ), then re-tag for review.

@maparent maparent force-pushed the eng-1212-base-json-ld-export branch from 483e769 to 6c772f6 Compare December 27, 2025 14:09
@maparent maparent force-pushed the eng-1213-adjustments-to-markdown-export branch from 2384603 to 491aa3e Compare December 27, 2025 14:12
@maparent maparent force-pushed the eng-1212-base-json-ld-export branch from 6c772f6 to 4aaf3e1 Compare December 27, 2025 14:40
@maparent maparent force-pushed the eng-1213-adjustments-to-markdown-export branch from 491aa3e to ff01286 Compare December 27, 2025 14:40
@maparent maparent force-pushed the eng-1212-base-json-ld-export branch from 4aaf3e1 to 715c55c Compare December 27, 2025 15:05
@maparent maparent force-pushed the eng-1213-adjustments-to-markdown-export branch 2 times, most recently from e235bf5 to 03b3755 Compare December 27, 2025 16:19
@maparent maparent requested a review from mdroidian December 27, 2025 16:30
Copy link
Contributor

@mdroidian mdroidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the Loom — that helps clarify the motivation.

From the video, my understanding is that Matt requested the ability to compute a textRefersToNode field. Could you please link to that original request or discussion, so I can read the context directly?

Separately, the {#block-<uid>} / #block-<uid> anchor pattern appears to be a new convention introduced to make block references linkable and scannable in markdown, even though this is not a Roam pattern. Was this approach discussed with the team beforehand, and what specific problem does it solve compared to using existing Roam block URLs or block UIDs?

Clarifying this will help me understand whether this is intended as a temporary export-only workaround or a longer-term convention we want to support.

@maparent maparent force-pushed the eng-1212-base-json-ld-export branch from c5d3cc0 to f3e9ded Compare December 30, 2025 02:50
Base automatically changed from eng-1212-base-json-ld-export to main December 30, 2025 02:51
@maparent maparent force-pushed the eng-1213-adjustments-to-markdown-export branch from 95ab75f to 91929e9 Compare December 30, 2025 02:53
@maparent
Copy link
Collaborator Author

maparent commented Dec 30, 2025

Matt's original request was here
(I clearly remember adding that link earlier but cannot find it?)
You are correct that this approach was not validated.
There were two motivations:

  1. It was easier for me to detect the node links with a more explicit URL, to answer Matt's request. In particular, if there was a block reference inside a node, which I thought should count as a node reference, I would never know it looking at the node.
  2. Since it was clearly a use case, I wanted to make the URLs in the markdown easier to understand for the LLM.

Now, that one probably deserves a round of shared design.
But I also worked with the assumption that there would be iterations, and that we could revisit assumptions.
So I have no strong opinion on long-term maintenance of this specific pattern.
BUT I'm more generally convinced that leaving the markdown as-is makes for hard-to-interpret markdown, and so that is worth discussing. This was just a first stab in that direction, that came out naturally from the design to answer the original request.

@mdroidian
Copy link
Contributor

Matt's original request was here
(I clearly remember adding that link earlier but cannot find it?)

If we keep all of the important context (background, scope, Loom walkthrough, etc.) consolidated in the PR description, it makes reviews significantly easier—for reviewers and for us when we come back to this later.

PR comment threads tend to accumulate automated messages and incidental discussion over time, which makes key context harder to find. Treating the PR body as the single source of truth helps keep the review focused and reduces friction.

@mdroidian
Copy link
Contributor

I think there may be a scope / request mismatch here.

  • The PR says “inter-node references,” but the implementation of textRefersToNode appears to extract outgoing references only (via a regex over Roam page URLs in exported content).
  • Matt’s request was specifically about Linked References (backlinks / “what links to this node?”), which this PR does not implement.
  • Matt’s request also mentions capturing additional context in the linkedReferences structure.
image

Separately, the additional linkification and pseudo-anchor conventions go beyond the original request. The {#block-<uid>} / #block-<uid> pattern in particular doesn’t match Roam semantics, and it’s not yet clear that the benefits outweigh the costs (extra tokens/context for LLMs, and potential confusion about what targets are actually navigable in Roam vs. only within exported markdown).

Given this, I think it would be better to close this PR and open two follow-up tickets:

  • one to clarify and implement Matt’s original Linked References request
  • one to separately discuss the anchor / markdown-link conventions

Copy link
Collaborator Author

  1. Re single source of truth: I have no objection to using PR body as single source of truth, but I feel that it ends up creating empty linear task bodies, which are harder to read. But I understand the problem if we keep both (and roam…) I am almost thinking if we do this, we could work strictly with github tasks and let go of linear?
  2. I have detailed the markdown issue [in roam](https://roamresearch.com/#/app/discourse-graphs/page/kJNzZYnEw). Let's use this task for that.
  3. I have created a separate task ENG-1230
    1. I will continue some of the conversation there.
  4. I think that task (json-ld exports) is easier done if we first get better links in the JSON-LD. Repeating rationale from roam:
    1. The parsing of various link types happens within the machinery, which is not aware of whether the output is markdown files or markdown embedded within JSON-LD
    2. In the case of JSON-LD, we would want to do link extraction. Since this is specific to JSON-LD, it would be easier to do this using the output of the function, vs changing that function to incorporate link tracking. This is another reason for that output to contain enough information to do this. (But that decision could go either way)
  5. BUT I will look at other ways to get backlinks using direct roam queries, that may allow to keep things separate.

@mdroidian
Copy link
Contributor

  1. Re single source of truth: I have no objection to using PR body as single source of truth, but I feel that it ends up creating empty linear task bodies, which are harder to read. But I understand the problem if we keep both (and roam…) I am almost thinking if we do this, we could work strictly with github tasks and let go of linear?

To clarify, single source of truth was just referring to GitHub comments vs PR body. Stated differently, if you are adding any additional context to a GitHub PR, add it to the body and not as a separate comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants