|
7 | 7 | - java |
8 | 8 | - cpp |
9 | 9 | - python |
| 10 | + - programming |
10 | 11 | Creation Date: 2024-12-14, 20:31 |
11 | | -Last Date: 2024-12-16T22:14:18+08:00 |
| 12 | +Last Date: 2024-12-18T23:05:46+08:00 |
12 | 13 | References: |
13 | 14 | draft: |
14 | 15 | description: |
@@ -120,4 +121,40 @@ minHeap.push(5); |
120 | 121 | minHeap.pop(); |
121 | 122 | // Return the smallest element from the heap |
122 | 123 | minHeap.top().val; |
123 | | -``` |
| 124 | +``` |
| 125 | + |
| 126 | + |
| 127 | +## Convert HTML to TXT |
| 128 | +--- |
| 129 | +>[!important] |
| 130 | +> We don’t want to extract the content as a single string because this would result in losing all the formatting information provided by the HTML tags. Reformatting the content based on the document structure afterward is tedious, error-prone, and not scalable. |
| 131 | +> |
| 132 | +> Instead, the idea is to retain the formatting information we need before removing all the HTML tags. Then, we can use this retained formatting information to generate a text file with the desired formatting. |
| 133 | +
|
| 134 | + |
| 135 | +```java title="Java" |
| 136 | +// Generate a placeholder string using UUID to avoid conflicts with the HTML content |
| 137 | +String uniquePlaceholder = UUID.randomUUID().toString(); |
| 138 | + |
| 139 | +// This step retains line break information, which we will later replace with actual line breaks (\n) |
| 140 | +String htmlContent = rawHtml.replace("<br />", "<span>" + uniquePlaceholder + "</span>"); |
| 141 | + |
| 142 | +// Parse the modified HTML content using Jsoup to extract plain text |
| 143 | +// Replace the placeholder with actual line breaks (\n) to simulate the original formatting |
| 144 | +String txtContent = Jsoup.parse(htmlContent).text().replace(uniquePlaceholder, "\n"); |
| 145 | + |
| 146 | +// Create a FileWriter to write the plain text content to a file |
| 147 | +FileWriter writer = new FileWriter("output.txt"); |
| 148 | + |
| 149 | +// Write the plain text content into the file |
| 150 | +writer.write(txtContent); |
| 151 | + |
| 152 | +writer.close(); |
| 153 | + |
| 154 | +``` |
| 155 | + |
| 156 | +>[!code] |
| 157 | +> This above code example assumes that `<br />` is the only tag used to denote line breaks in the given HTML string. If other tags or methods are used for formatting, additional handling may be required. |
| 158 | +> |
| 159 | +> Also note that, The `FileWriter writer` formats and writes the content into a text file, ensuring that line breaks are correctly represented using `\n`. |
| 160 | +
|
0 commit comments