Skip to content

Commit 78f0045

Browse files
committed
common programming operations
- convert html to txt (java)
1 parent 06fdff2 commit 78f0045

File tree

1 file changed

+39
-2
lines changed

1 file changed

+39
-2
lines changed

content/Programming/Common Programming Operations.md

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,9 @@ tags:
77
- java
88
- cpp
99
- python
10+
- programming
1011
Creation Date: 2024-12-14, 20:31
11-
Last Date: 2024-12-16T22:14:18+08:00
12+
Last Date: 2024-12-18T23:05:46+08:00
1213
References:
1314
draft:
1415
description:
@@ -120,4 +121,40 @@ minHeap.push(5);
120121
minHeap.pop();
121122
// Return the smallest element from the heap
122123
minHeap.top().val;
123-
```
124+
```
125+
126+
127+
## Convert HTML to TXT
128+
---
129+
>[!important]
130+
> We don’t want to extract the content as a single string because this would result in losing all the formatting information provided by the HTML tags. Reformatting the content based on the document structure afterward is tedious, error-prone, and not scalable.
131+
>
132+
> Instead, the idea is to retain the formatting information we need before removing all the HTML tags. Then, we can use this retained formatting information to generate a text file with the desired formatting.
133+
134+
135+
```java title="Java"
136+
// Generate a placeholder string using UUID to avoid conflicts with the HTML content
137+
String uniquePlaceholder = UUID.randomUUID().toString();
138+
139+
// This step retains line break information, which we will later replace with actual line breaks (\n)
140+
String htmlContent = rawHtml.replace("<br />", "<span>" + uniquePlaceholder + "</span>");
141+
142+
// Parse the modified HTML content using Jsoup to extract plain text
143+
// Replace the placeholder with actual line breaks (\n) to simulate the original formatting
144+
String txtContent = Jsoup.parse(htmlContent).text().replace(uniquePlaceholder, "\n");
145+
146+
// Create a FileWriter to write the plain text content to a file
147+
FileWriter writer = new FileWriter("output.txt");
148+
149+
// Write the plain text content into the file
150+
writer.write(txtContent);
151+
152+
writer.close();
153+
154+
```
155+
156+
>[!code]
157+
> This above code example assumes that `<br />` is the only tag used to denote line breaks in the given HTML string. If other tags or methods are used for formatting, additional handling may be required.
158+
>
159+
> Also note that, The `FileWriter writer` formats and writes the content into a text file, ensuring that line breaks are correctly represented using `\n`.
160+

0 commit comments

Comments
 (0)