Skip to content

Commit 801731c

Browse files
authored
Upgrade redact_cli_py to 0.3.2 (#1044)
* Upgrade redact_cli_py to 0.3.2 * Fix mailto typo, update document URL
1 parent 2e5ef6f commit 801731c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+6520
-560
lines changed

scripts/redact_cli_py/.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 88
3+
extend-ignore = E203, E501, PIE798

scripts/redact_cli_py/CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
66

77
## [Unreleased]
88

9+
## [0.3.2] - 2022-08-11
10+
### Changed
11+
- Refactor code styles with flake8/black and their extensions.
12+
13+
## [0.3.1] - 2022-08-02
14+
### Added
15+
- Support to multi page PDFs and TIFFs in batch redact CLI (`batch_redact.py`)
16+
17+
## [0.3.0] - 2022-01-06
18+
### Added
19+
- Support to FormRecognizer OCR Result v3.0 format while still maintaining the backward compatibility to v2.0 and v2.1.
20+
21+
### Changed
22+
- The default API version of OCR result redaction has changed from v2.x to v3.x schema.
23+
- You now need to specified which version of the OCR result you want to redact in `redact.py` and `batch_redact.py`.
24+
- Before:
25+
26+
``` bash
27+
python redact.py ocr <ocr_result_path> <fott_label_path> <output_path>
28+
python batch_redact.py <input_container> <input_folder_path> <output_container> <output_folder_path>
29+
```
30+
31+
- After:
32+
33+
``` bash
34+
python redact.py ocr <ocr_result_path> <fott_label_path> <output_path> <api_version>
35+
python batch_redact.py <input_container> <input_folder_path> <output_container> <output_folder_path> <api_version>
36+
```
37+
38+
Where API Version is one of the following:
39+
- v2.0
40+
- v2.1
41+
- v3.0
42+
943
## [0.2.3] - 2021-12-13
1044
### Added
1145
- Support to redact some Latin ligature letters and letters with diacritics.

scripts/redact_cli_py/Pipfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ shapely = "*"
1010
dacite = "*"
1111
azure-storage-blob = "*"
1212
pypdfium = "*"
13+
flake8 = "*"
14+
black = "*"
15+
flake8-bugbear = "*"
16+
flake8-pie = "*"
17+
pep8-naming = "*"
18+
flake8-black = "*"
1319

1420
[dev-packages]
1521
pytest = "*"

scripts/redact_cli_py/Pipfile.lock

Lines changed: 419 additions & 205 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scripts/redact_cli_py/README.md

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@ The OCR.json and labels.json will also be redacted while keeping the semantics o
1010
![ocr-before-after-redaction](./images/ocr-before-after-redaction.png)
1111
![labels-before-after-redaction](./images/labels-before-after-redaction.png)
1212

13-
## Language support
13+
## Language Support
1414
This tool supports Latin characters redaction only. For any non-Latin document support, please [contact us](mailto:formrecog_contact@microsoft.com?subject=Redaction%20tool%20language%20support).
1515

1616
## Version
17-
Redact CLI 0.2.3
17+
Redact CLI 0.3.2
1818

1919
## Setup Environment
2020

@@ -103,7 +103,21 @@ python redact.py image <image_path> <fott_label_path> <output_path>
103103
### Redact OCR Result
104104

105105
``` bash
106-
python redact.py ocr <ocr_result_path> <fott_label_path> <output_path>
106+
python redact.py ocr <ocr_result_path> <fott_label_path> <output_path> <api_version>
107+
```
108+
109+
#### API Version
110+
111+
In Azure Form Recognizer, The OCR result for different API version has different schema. To successfully redact the OCR result, you must give one of the `<api_version>` to the redaction toolkit.
112+
113+
- v2.0
114+
- v2.1
115+
- v3.0
116+
117+
For example,
118+
119+
``` bash
120+
python redact.py ocr sample.ocr.json sample.labels.json redacted_sample.ocr.json "v3.0"
107121
```
108122

109123
### Redact FOTT Label Path
@@ -113,6 +127,7 @@ python redact.py fott <fott_label_path> <output_path>
113127
```
114128

115129
### Redact specific labels from Image, OCR results or FOTT Label Path
130+
116131
In some specific use-cases, the need may arise to redact specific labels from an image, OCR results or/and FOTT Label Path.
117132
Labels to be redacted need to provided together in a string separated by commas.
118133

@@ -127,17 +142,17 @@ And _Label_01_ and _Label_04_ need to be redacted, the following commands can be
127142
#### Redact specific labels from Image
128143

129144
``` bash
130-
python redact.py image <fott_label_path> <output_path> "Label_01,Label_04"
145+
python redact.py image <fott_label_path> <output_path> <api_version> "Label_01,Label_04"
131146
```
132147
#### Redact specific labels from OCR Result
133148

134149
``` bash
135-
python redact.py ocr <ocr_result_path> <image_path> <fott_label_path> <output_path> "Label_01,Label_04"
150+
python redact.py ocr <ocr_result_path> <image_path> <fott_label_path> <output_path> <api_version> "Label_01,Label_04"
136151
```
137152
#### Redact specific labels from FOTT Label Path
138153

139154
``` bash
140-
python redact.py image <image_path> <fott_label_path> <output_path> "Label_01,Label_04"
155+
python redact.py image <image_path> <fott_label_path> <output_path> <api_version> "Label_01,Label_04"
141156
```
142157

143158
### Batch Redaction
@@ -146,7 +161,7 @@ Batch redaction supports redacting a folder rather than executing on a single fi
146161
2. Azure Blob Storage virtual folder: a URL to a Blob Storage container and a folder path to denotes the folder.
147162

148163
``` bash
149-
python batch_redact.py <input_container> <input_folder_path> <output_container> <output_folder_path>
164+
python batch_redact.py <input_container> <input_folder_path> <output_container> <output_folder_path> <api_version>
150165
```
151166

152167
#### Container
@@ -176,12 +191,16 @@ python batch_redact.py local raw/ "https://my.blob.account/data?<my_secret_SAS_t
176191
python batch_redact.py "https://my.blob.account/data?<my_secret_SAS_token>" folder1/ "https://my.blob.account/data?<my_secret_SAS_token>" folder2/
177192
```
178193

179-
#### Note
194+
---
195+
196+
**NOTE**
180197

181198
1. Surround the URL with double quotes to prevent wrong character escape in the SAS token.
182199
2. Visit [Create Your SAS tokens with Azure Storage Explorer](https://docs.microsoft.com/en-us/azure/cognitive-services/translator/document-translation/create-sas-tokens?tabs=Containers) to see how to create a SAS token for this program to use.
183200
3. Currently, this redact CLI only support ASCII character redaction (Latin alphabets without the accent marks).
184201

202+
---
203+
185204
#### PDF Support
186205

187206
Batch mode now supports redacting data from one-page PDF documents. The tool will detect any PDF document in the input folder, convert to an image (.png) and redact the image itself placing it in the specified output folder upon completion.
@@ -204,7 +223,17 @@ pytest
204223

205224
in the root folder.
206225

207-
### Note
226+
---
227+
228+
**NOTE**
208229

209230
1. You can also take a look at the `redact/__init__.py` file. The command line interface (CLI) is just a thin wrapper on `redact_image()`, `redact_ocr_result()`, and `redact_fott_label()`. You could extend the code on top of the three functions for achieving your own goal, such as to redact a batch of data.
210231
2. For batch redaction, we currently only support `.jpeg`, `.jpg`, `.png`, `.tif`, `.tiff`, and `.bmp` as the file extension for images. PDF files are not supported.
232+
233+
---
234+
235+
## References
236+
237+
- [Form Recognizer API v2.0](https://westus2.dev.cognitive.microsoft.com/docs/services/form-recognizer-api-v2/operations/AnalyzeWithCustomForm)
238+
- [Form Recognizer API v2.1](https://westus.dev.cognitive.microsoft.com/docs/services/form-recognizer-api-v2-1/operations/AnalyzeWithCustomForm)
239+
- [Form Recognizer API v3.0](https://westus.dev.cognitive.microsoft.com/docs/services/form-recognizer-api-2022-08-31/operations/GetAnalyzeDocumentResult)

scripts/redact_cli_py/batch_redact.py

Lines changed: 56 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -8,108 +8,97 @@
88
from typing import List
99
from uuid import uuid4
1010

11-
from redact import redact_image, redact_fott_label, redact_ocr_result
11+
from redact import redact_fott_label, redact_ocr_result, redact_file_bundle
1212
from redact.io.blob_reader import BlobReader
1313
from redact.io.blob_writer import BlobWriter
1414
from redact.io.local_reader import LocalReader
1515
from redact.io.local_writer import LocalWriter
1616
from redact.utils.file_name import get_redacted_file_name, valid_url
17-
from redact.utils.pdf_renderer import PdfRenderer
1817
from redact.types.file_bundle import FileType, FileBundle
19-
from redact.types.pre_processing_bundle import PdfPreProcessingBundle
18+
from redact.preprocess import preprocess_multi_page_bundle
2019

2120

2221
# Strong Assumption: assume all valid URLs are Azure Blob URL.
2322
def is_blob_url(url: str) -> bool:
2423
return valid_url(url)
2524

2625

27-
def process_pdf_bundle(file_bundles: List[FileBundle], fields_to_redact: List[str]):
28-
renderer = PdfRenderer()
29-
30-
for file_bundle in file_bundles:
31-
pdf_pre_processing_bundle = PdfPreProcessingBundle.from_file_bundle(file_bundle)
32-
33-
redacted_image_name = get_redacted_file_name(pdf_pre_processing_bundle.rendered_file_name)
34-
redacted_fott_name = get_redacted_file_name(file_bundle.fott_file_name)
35-
redacted_ocr_name = get_redacted_file_name(file_bundle.ocr_file_name)
36-
37-
# Render PDF
38-
renderer.render_pdf_and_save(
39-
Path(build_pre_processing_folder, file_bundle.image_file_name),
40-
Path(build_pre_processing_folder, pdf_pre_processing_bundle.rendered_file_name),
41-
target_pdf_render_dpi)
42-
43-
# Follow the regular redaction process with taking files from slightly different source folders
44-
redact_image(
45-
Path(build_pre_processing_folder, pdf_pre_processing_bundle.rendered_file_name),
46-
Path(build_pre_processing_folder, file_bundle.fott_file_name),
47-
Path(build_output_folder, redacted_image_name),
48-
fields_to_redact)
49-
redact_fott_label(
50-
Path(build_pre_processing_folder, file_bundle.fott_file_name),
51-
Path(build_output_folder, redacted_fott_name),
52-
fields_to_redact)
53-
redact_ocr_result(
54-
Path(build_pre_processing_folder, file_bundle.ocr_file_name),
55-
Path(build_pre_processing_folder, file_bundle.fott_file_name),
56-
Path(build_output_folder, redacted_ocr_name),
57-
fields_to_redact)
58-
59-
if __name__ == '__main__':
26+
if __name__ == "__main__":
6027
input_container = sys.argv[1]
6128
input_path = sys.argv[2]
6229
output_container = sys.argv[3]
6330
output_path = sys.argv[4]
31+
api_version = sys.argv[5]
6432
target_pdf_render_dpi = 300
65-
fields_to_redact = []
33+
fields_to_redact = tuple()
6634

67-
if len(sys.argv) >= 6:
68-
fields_to_redact = (sys.argv[5].split(','))
35+
if len(sys.argv) >= 7:
36+
fields_to_redact = sys.argv[6].split(",")
6937

7038
# Random generated UUID in the build folder name for preventing collapse.
71-
build_path = Path(f'build-{uuid4()}/')
72-
build_pre_processing_folder = Path(build_path, "pre/")
39+
build_path = Path(f"build-{uuid4()}/")
40+
build_pre_folder = Path(build_path, "pre/")
7341
build_input_folder = Path(build_path, "in/")
7442
build_output_folder = Path(build_path, "out/")
75-
Path(build_pre_processing_folder).mkdir(parents=True, exist_ok=True)
43+
Path(build_pre_folder).mkdir(parents=True, exist_ok=True)
7644
Path(build_input_folder).mkdir(parents=True, exist_ok=True)
7745
Path(build_output_folder).mkdir(parents=True, exist_ok=True)
46+
7847
try:
7948
file_bundle_list = None
80-
pdf_file_bundle_list = None
49+
multi_page_bundle_list = None
8150
if is_blob_url(input_container):
8251
reader = BlobReader(input_container, input_path)
83-
pdf_file_bundle_list = reader.download_bundles(to=build_pre_processing_folder, mode=FileType.PDF_ONLY)
52+
multi_page_bundle_list = reader.download_bundles(
53+
to=build_pre_folder, mode=FileType.MULTI_PAGE
54+
)
8455
file_bundle_list = reader.download_bundles(to=build_input_folder)
8556
else:
8657
reader = LocalReader(input_path)
87-
pdf_file_bundle_list = reader.copy_bundles(to=build_pre_processing_folder, mode=FileType.PDF_ONLY)
58+
multi_page_bundle_list = reader.copy_bundles(
59+
to=build_pre_folder, mode=FileType.MULTI_PAGE
60+
)
8861
file_bundle_list = reader.copy_bundles(to=build_input_folder)
8962

63+
per_page_bundle_list: List[FileBundle] = []
64+
65+
# Render and process PDF/TIFF files if any.
66+
if multi_page_bundle_list is not None:
67+
for fb in multi_page_bundle_list:
68+
bundle_list = preprocess_multi_page_bundle(
69+
fb, build_pre_folder, build_input_folder, target_pdf_render_dpi
70+
)
71+
per_page_bundle_list.extend(bundle_list)
72+
73+
# Short path: preprocess folder -> output folder.
74+
# We still need to redact the full label file.
75+
redact_fott_label(
76+
Path(build_pre_folder, fb.fott_file_name),
77+
Path(
78+
build_output_folder, get_redacted_file_name(fb.fott_file_name)
79+
),
80+
fields_to_redact,
81+
)
82+
83+
# We still need to redact the full ocr file.
84+
redact_ocr_result(
85+
Path(build_pre_folder, fb.ocr_file_name),
86+
Path(build_pre_folder, fb.fott_file_name),
87+
Path(build_output_folder, get_redacted_file_name(fb.ocr_file_name)),
88+
api_version,
89+
fields_to_redact,
90+
)
91+
92+
# Process images and per page result from multi-page documents.
93+
file_bundle_list.extend(per_page_bundle_list)
9094
for fb in file_bundle_list:
91-
redacted_image_name = get_redacted_file_name(fb.image_file_name)
92-
redacted_fott_name = get_redacted_file_name(fb.fott_file_name)
93-
redacted_ocr_name = get_redacted_file_name(fb.ocr_file_name)
94-
95-
redact_image(
96-
Path(build_input_folder, fb.image_file_name),
97-
Path(build_input_folder, fb.fott_file_name),
98-
Path(build_output_folder, redacted_image_name),
99-
fields_to_redact)
100-
redact_fott_label(
101-
Path(build_input_folder, fb.fott_file_name),
102-
Path(build_output_folder, redacted_fott_name),
103-
fields_to_redact)
104-
redact_ocr_result(
105-
Path(build_input_folder, fb.ocr_file_name),
106-
Path(build_input_folder, fb.fott_file_name),
107-
Path(build_output_folder, redacted_ocr_name),
108-
fields_to_redact)
109-
110-
# Render and process PDF files if any
111-
if pdf_file_bundle_list is not None:
112-
process_pdf_bundle(pdf_file_bundle_list, fields_to_redact)
95+
redact_file_bundle(
96+
fb,
97+
build_input_folder,
98+
build_output_folder,
99+
api_version,
100+
fields_to_redact,
101+
)
113102

114103
if is_blob_url(output_container):
115104
writer = BlobWriter(output_container, output_path)

scripts/redact_cli_py/redact.py

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,30 +6,35 @@
66
from redact import redact_image, redact_fott_label, redact_ocr_result
77

88

9-
if __name__ == '__main__':
9+
if __name__ == "__main__":
1010
operator = sys.argv[1]
1111

12-
if operator == 'image':
13-
labels_to_redact = [] if len(sys.argv) < 6 else sys.argv[5].split(',')
12+
if operator == "image":
13+
labels_to_redact = [] if len(sys.argv) < 6 else sys.argv[5].split(",")
1414
redact_image(
1515
image_path=sys.argv[2],
1616
fott_label_path=sys.argv[3],
1717
output_path=sys.argv[4],
18-
labels_to_redact=labels_to_redact)
18+
labels_to_redact=labels_to_redact,
19+
)
1920

20-
elif operator == 'fott':
21-
labels_to_redact = [] if len(sys.argv) < 5 else sys.argv[4].split(',')
22-
redact_fott_label(fott_label_path=sys.argv[2],
23-
output_path=sys.argv[3],
24-
labels_to_redact=labels_to_redact)
21+
elif operator == "fott":
22+
labels_to_redact = [] if len(sys.argv) < 5 else sys.argv[4].split(",")
23+
redact_fott_label(
24+
fott_label_path=sys.argv[2],
25+
output_path=sys.argv[3],
26+
labels_to_redact=labels_to_redact,
27+
)
2528

26-
elif operator == 'ocr':
27-
labels_to_redact = [] if len(sys.argv) < 6 else sys.argv[5].split(',')
29+
elif operator == "ocr":
30+
labels_to_redact = [] if len(sys.argv) < 7 else sys.argv[6].split(",")
2831
redact_ocr_result(
2932
ocr_result_path=sys.argv[2],
3033
fott_label_path=sys.argv[3],
3134
output_path=sys.argv[4],
32-
labels_to_redact=labels_to_redact)
35+
api_version=sys.argv[5],
36+
labels_to_redact=labels_to_redact,
37+
)
3338

3439
else:
3540
raise NameError()

0 commit comments

Comments
 (0)