-
Notifications
You must be signed in to change notification settings - Fork 524
Open
Description
Hi,
I’m currently using diffArrays
with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:
- I tokenize the HTML into words and tags (preserving spaces).
- I use
diffArrays()
with a custom comparator that tries to ignore formatting-only changes. - I wrap added/removed tokens in
<span>
tags to render diffs inline in the browser.
However, I’m running into a few key issues:
- Formatting-only changes (like switching from
<b>
to<strong>
) are still flagged as additions/removals, even when they are semantically equivalent. - Sentence-level changes are broken down into many small word-level diffs instead of treating them as grouped phrases.
- I saw in the README that extending the
Diff
class could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context.
Here’s the code I’m using:
export const compareTwoDocuments = (original: string, modified: string) => {
// Minify HTML by removing extra whitespace and normalizing newlines
const minifyHtml = (html: string): string => {
return html
.replace(/ /g, ' ') // Replace with space
.replace(/\s+/g, ' ') // Replace multiple spaces with single space
.replace(/>\s+</g, '><') // Remove spaces between tags
.replace(/\s+>/g, '>') // Remove spaces before closing tags
.replace(/<\s+/g, '<') // Remove spaces after opening tags
.trim()
}
// Custom tokenizer that separates HTML tags and content while preserving spaces
const tokenizeHtml = (text: string): string[] => {
const tokens: string[] = []
let currentIndex = 0
const tagRegex = /<[^>]+>|<\/[^>]+>/g
let match
while ((match = tagRegex.exec(text)) !== null) {
// Add content before the tag if it exists
if (match.index > currentIndex) {
const content = text.slice(currentIndex, match.index)
if (content) {
// Split content into words and spaces
const words = content.split(/(\s+)/)
tokens.push(...words.filter((w) => w.length > 0))
}
}
// Add the tag
tokens.push(match[0])
currentIndex = match.index + match[0].length
}
// Add any remaining content after the last tag
if (currentIndex < text.length) {
const remainingContent = text.slice(currentIndex)
if (remainingContent) {
// Split remaining content into words and spaces
const words = remainingContent.split(/(\s+)/)
tokens.push(...words.filter((w) => w.length > 0))
}
}
return tokens
}
const originalTokens = tokenizeHtml(minifyHtml(original))
const modifiedTokens = tokenizeHtml(minifyHtml(modified))
const differences = jsdiff.diffArrays(originalTokens, modifiedTokens, {
comparator: (left, right) => {
// Skip spaces (cause we have so many of them)
if (left === ' ' && right === ' ') return true
// Compare HTML tags exactly
if (left.startsWith('<') && right.startsWith('<')) {
return left === right
}
// For content, compare case-insensitively and normalize whitespace
const normalize = (str: string) => str.replace(/\s+/g, ' ').trim().toLowerCase()
return normalize(left) === normalize(right)
},
})
const finalHtmlResult = differences
.map((part) => {
const value = part.value.join('')
if (part.added) return `<span class="text-green-600 bg-green-100">${value}</span>`
if (part.removed) return `<span class="text-red-600 bg-red-100 line-through">${value}</span>`
return value
})
.join('')
return finalHtmlResult
}
My questions:
- Is there a recommended way to improve grouping of similar phrases/sentences instead of word-by-word diffs?
- Is there any documentation or example for subclassing Diff to improve diff scoring or heuristics?
- Would it make sense to preprocess HTML into block-level elements and diff those instead? Or is there a more robust way to make jsdiff HTML-aware?
Any guidance or references would be really appreciated.
Thanks.
smohammadhn
Metadata
Metadata
Assignees
Labels
No labels