Feature: Generate documentation in LLM-friendly Markdown #6555

jstirnaman · 2025-11-19T22:34:27Z

Provides a selector menu in each page with the following options:
- Copy page Markdown
- Copy section Markdown (for section pages)
- Ask ChatGPT
- Ask Claude
The markdown is generated directly from the HTML by a function run by AWS Lambda and maintained in influxdata/docs-tooling
Adds e2e tests

This enables LLM-friendly documentation for entire sections, allowing users to copy complete documentation sections with a single click. Lambda@Edge now generates .md files on-demand with: - Evaluated Hugo shortcodes - Proper YAML frontmatter with product metadata - Clean markdown without UI elements - Section aggregation (parent + children in single file) The llms.txt files are now generated automatically during build from content structure and product metadata in data/products.yml, eliminating the need for hardcoded files and ensuring maintainability. **Testing**: - Automated markdown generation in test setup via cy.exec() - Implement dynamic content validation that extracts HTML content and verifies it appears in markdown version **Documentation**: Documents LLM-friendly markdown generation **Details**: Add gzip decompression for S3 HTML files in Lambda markdown generator HTML files stored in S3 are gzip-compressed but the Lambda was attempting to parse compressed data as UTF-8, causing JSDOM to fail to find article elements. This resulted in 404 errors for .md and .section.md requests. - Add zlib gunzip decompression in s3-utils.js fetchHtmlFromS3() - Detect gzip via ContentEncoding header or magic bytes (0x1f 0x8b) - Add configurable DEBUG constant for verbose logging - Add debug logging for buffer sizes and decompression in both files The decompression adds ~1-5ms per request but is necessary to parse HTML correctly. CloudFront caching minimizes Lambda invocations. Await async markdown conversion functions The convertToMarkdown and convertSectionToMarkdown functions are async but weren't being awaited, causing the Lambda to return a Promise object instead of a string. This resulted in CloudFront validation errors: "The body is not a string, is not an object, or exceeds the maximum size" **Troubleshooting**: - Set DEBUG for troubleshooting in lambda

Implements static Markdown generation during Hugo build. **Key Features:** - Two-phase generation: HTML→MD (memory-bounded), MD→sections (fast) - Automatic redirect detection via file size check (skips Hugo aliases) - Product detection using compiled TypeScript product-mappings module - Token estimation for LLM context planning (4 chars/token heuristic) - YAML serialization with description sanitization **Performance:** - ~105 seconds for 5,000 pages + 500 sections - ~300MB peak memory (safe for 2GB CircleCI environment) - 23 files/sec conversion rate with controlled concurrency **Configuration Parameters:** - MIN_HTML_SIZE_BYTES (default: 1024) - Skip files below threshold - CHARS_PER_TOKEN (default: 4) - Token estimation ratio - Concurrency: 10 workers (CI), 20 workers (local) **Output:** - Single pages: public/*/index.md (with frontmatter + content) - Section bundles: public/*/index.section.md (aggregated child pages) **Files Changed:** - scripts/build-llm-markdown.js (new) - Main build script - scripts/lib/markdown-converter.cjs (renamed from .js) - Core conversion - scripts/html-to-markdown.js - Updated import path - package.json - Updated exports for .cjs module Related: Replaces Lambda@Edge on-demand generation (5s response time) with build-time static generation for production deployment. feat(deploy): Add staging deployment workflow and update CI Integrates LLM markdown generation into deployment workflows with a complete staging deployment solution. **CircleCI Updates:** - Switch from legacy html-to-markdown.js to optimized build:md - 2x performance improvement (105s vs 200s+ for 5000 pages) - Better memory management (300MB vs variable) - Enables section bundle generation (index.section.md files) **Staging Deployment:** - New scripts/deploy-staging.sh for local staging deploys - Complete workflow: Hugo build → markdown gen → S3 upload - Environment variable driven configuration - Optional step skipping for faster iteration - CloudFront cache invalidation support **NPM Scripts:** - Added deploy:staging command for convenience - Wraps deploy-staging.sh script **Documentation:** - Updated DOCS-DEPLOYING.md with comprehensive guide - Merged staging/production workflows with Lambda@Edge docs - Build-time generation now primary, Lambda@Edge fallback - Troubleshooting section with common issues - Environment variable reference - Performance metrics and optimization tips **Benefits:** - Manual staging validation before production - Consistent markdown generation across environments - Faster CI builds with optimized script - Better error handling and progress reporting - Section aggregation for improved LLM context **Usage:** ```bash export STAGING_BUCKET="test2.docs.influxdata.com" export AWS_REGION="us-east-1" export STAGING_CF_DISTRIBUTION_ID="E1XXXXXXXXXX" yarn deploy:staging ``` Related: Completes build-time markdown generation implementation refactor: Remove Lambda@Edge implementation Build-time markdown generation has replaced Lambda@Edge on-demand generation as the primary method. Removed Lambda code and updated documentation to focus on build-time generation and testing. Removed: - deploy/llm-markdown/ directory (Lambda@Edge code) - Lambda@Edge section from DOCS-DEPLOYING.md Added: - Testing and Validation section in DOCS-DEPLOYING.md - Focus on build-time generation workflow

jstirnaman force-pushed the jts-feat-llm-text branch 4 times, most recently from a925d9b to 6c11757 Compare November 20, 2025 21:34

sanderson and others added 2 commits November 24, 2025 09:31

jstirnaman force-pushed the jts-feat-llm-text branch from 7bdbe79 to eede9c0 Compare November 24, 2025 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Generate documentation in LLM-friendly Markdown #6555

Feature: Generate documentation in LLM-friendly Markdown #6555

Uh oh!

jstirnaman commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature: Generate documentation in LLM-friendly Markdown #6555

Are you sure you want to change the base?

Feature: Generate documentation in LLM-friendly Markdown #6555

Uh oh!

Conversation

jstirnaman commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants