How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)?

Hi,

I’m currently using `diffArrays` with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:
- I tokenize the HTML into words and tags (preserving spaces).
- I use `diffArrays()` with a custom comparator that tries to ignore formatting-only changes.
- I wrap added/removed tokens in `` tags to render diffs inline in the browser.

However, I’m running into a few key issues:
1. Formatting-only changes (like switching from `` to ``) are still flagged as additions/removals, even when they are semantically equivalent.
2. Sentence-level changes are broken down into many small word-level diffs instead of treating them as grouped phrases.
3. I saw in the README that extending the `Diff` class could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context.

Here’s the code I’m using:

```typescript
export const compareTwoDocuments = (original: string, modified: string) => {
 // Minify HTML by removing extra whitespace and normalizing newlines
 const minifyHtml = (html: string): string => {
 return html
 .replace(/&nbsp;/g, ' ') // Replace &nbsp; with space
 .replace(/\s+/g, ' ') // Replace multiple spaces with single space
 .replace(/>\s+</g, '><') // Remove spaces between tags
 .replace(/\s+>/g, '>') // Remove spaces before closing tags
 .replace(/<\s+/g, '<') // Remove spaces after opening tags
 .trim()
 }

 // Custom tokenizer that separates HTML tags and content while preserving spaces
 const tokenizeHtml = (text: string): string[] => {
 const tokens: string[] = []
 let currentIndex = 0
 const tagRegex = /<[^>]+>|<\/[^>]+>/g
 let match

 while ((match = tagRegex.exec(text)) !== null) {
 // Add content before the tag if it exists
 if (match.index > currentIndex) {
 const content = text.slice(currentIndex, match.index)
 if (content) {
 // Split content into words and spaces
 const words = content.split(/(\s+)/)
 tokens.push(...words.filter((w) => w.length > 0))
 }
 }
 // Add the tag
 tokens.push(match[0])
 currentIndex = match.index + match[0].length
 }

 // Add any remaining content after the last tag
 if (currentIndex < text.length) {
 const remainingContent = text.slice(currentIndex)
 if (remainingContent) {
 // Split remaining content into words and spaces
 const words = remainingContent.split(/(\s+)/)
 tokens.push(...words.filter((w) => w.length > 0))
 }
 }

 return tokens
 }

 const originalTokens = tokenizeHtml(minifyHtml(original))
 const modifiedTokens = tokenizeHtml(minifyHtml(modified))

 const differences = jsdiff.diffArrays(originalTokens, modifiedTokens, {
 comparator: (left, right) => {
 // Skip spaces (cause we have so many of them)
 if (left === ' ' && right === ' ') return true

 // Compare HTML tags exactly
 if (left.startsWith('<') && right.startsWith('<')) {
 return left === right
 }

 // For content, compare case-insensitively and normalize whitespace
 const normalize = (str: string) => str.replace(/\s+/g, ' ').trim().toLowerCase()
 return normalize(left) === normalize(right)
 },
 })

 const finalHtmlResult = differences
 .map((part) => {
 const value = part.value.join('')
 if (part.added) return `${value}`
 if (part.removed) return `${value}`
 return value
 })
 .join('')

 return finalHtmlResult
}
```

My questions:
- Is there a recommended way to improve grouping of similar phrases/sentences instead of word-by-word diffs?
- Is there any documentation or example for subclassing Diff to improve diff scoring or heuristics?
- Would it make sense to preprocess HTML into block-level elements and diff those instead? Or is there a more robust way to make jsdiff HTML-aware?

Any guidance or references would be really appreciated.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions