Skip to content

Commit ccb55b0

Browse files
cschenioTFR258chiache-msft
authored
redact-cli-py | Update to v0.2.2 (#1006)
* Add support for One Page PDF and selective label redacting * redact-cli-py | Add requirements.txt * Bump version to v0.2.2 Co-authored-by: TFR258 <threis@microsoft.com> Co-authored-by: Chia-Sheng Chen <chiache@microsoft.com>
1 parent ee9aba9 commit ccb55b0

35 files changed

+2656
-107
lines changed

scripts/redact_cli_py/CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
66

77
## [Unreleased]
88

9+
## [0.2.2] - 2021-11-17
10+
### Added
11+
- Support to only redact specific labels.
12+
- Add support for one page pdfs in batch redact CLI (`batch_redact.py`)
13+
914
## [0.2.1] - 2021-08-05
1015
### Added
1116
- Support to image modes other than 'RGB' and 'RGBA'. E.g. image mode '1'.

scripts/redact_cli_py/Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ jsonpointer = "*"
99
shapely = "*"
1010
dacite = "*"
1111
azure-storage-blob = "*"
12+
pypdfium = "*"
1213

1314
[dev-packages]
1415
pytest = "*"

scripts/redact_cli_py/Pipfile.lock

Lines changed: 77 additions & 67 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scripts/redact_cli_py/README.md

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The OCR.json and labels.json will also be redacted while keeping the semantics o
1111
![labels-before-after-redaction](./images/labels-before-after-redaction.png)
1212

1313
## Version
14-
Redact CLI 0.2.1
14+
Redact CLI 0.2.2
1515

1616
## Setup Environment
1717

@@ -63,7 +63,12 @@ pip install pipenv
6363
cd redact_cli_py/
6464
pipenv install
6565
```
66+
---
67+
**NOTE**
6668

69+
If running into any errors, try running the above commands from an elevated Powershell terminal.
70+
71+
---
6772

6873
## Run
6974

@@ -104,6 +109,34 @@ python redact.py ocr <ocr_result_path> <fott_label_path> <output_path>
104109
python redact.py fott <fott_label_path> <output_path>
105110
```
106111

112+
### Redact specific labels from Image, OCR results or FOTT Label Path
113+
In some specific use-cases, the need may arise to redact specific labels from an image, OCR results or/and FOTT Label Path.
114+
Labels to be redacted need to provided together in a string separated by commas.
115+
116+
If a document holds the following labels:
117+
- Label_01
118+
- Label_02
119+
- Label_03
120+
- Label_04
121+
122+
And _Label_01_ and _Label_04_ need to be redacted, the following commands can be leveraged:
123+
124+
#### Redact specific labels from Image
125+
126+
``` bash
127+
python redact.py image <fott_label_path> <output_path> "Label_01,Label_04"
128+
```
129+
#### Redact specific labels from OCR Result
130+
131+
``` bash
132+
python redact.py ocr <ocr_result_path> <image_path> <fott_label_path> <output_path> "Label_01,Label_04"
133+
```
134+
#### Redact specific labels from FOTT Label Path
135+
136+
``` bash
137+
python redact.py image <image_path> <fott_label_path> <output_path> "Label_01,Label_04"
138+
```
139+
107140
### Batch Redaction
108141
Batch redaction supports redacting a folder rather than executing on a single file. Both the input and the output supports two sources:
109142
1. local folder: a path to a folder on your local machine.
@@ -116,7 +149,7 @@ python batch_redact.py <input_container> <input_folder_path> <output_container>
116149
#### Container
117150
You can provide one of the two options:
118151
1. `local`: this means you read/write data from local machine.
119-
2. `https://<blob_account_url>/<container_name>?<sas_token>`: this means you read/write data from the the container `container_name` of the Azure blob account `blob_account_url`. Please make sure your `sas_token` grants the correct access (Read/List for input, Read/Add/Create/Write/Delete/List for output).
152+
2. `https://<blob_account_url>/<container_name>?<sas_token>`: this means you read/write data from the container `container_name` of the Azure blob account `blob_account_url`. Please make sure your `sas_token` grants the correct access (Read/List for input, Read/Add/Create/Write/Delete/List for output).
120153

121154
#### Examples
122155

@@ -146,6 +179,18 @@ python batch_redact.py "https://my.blob.account/data?<my_secret_SAS_token>" fold
146179
2. Visit [Create Your SAS tokens with Azure Storage Explorer](https://docs.microsoft.com/en-us/azure/cognitive-services/translator/document-translation/create-sas-tokens?tabs=Containers) to see how to create a SAS token for this program to use.
147180
3. Currently, this redact CLI only support ASCII character redaction (Latin alphabets without the accent marks).
148181

182+
#### PDF Support
183+
184+
Batch mode now supports redacting data from one-page PDF documents. The tool will detect any PDF document in the input folder, convert to an image (.png) and redact the image itself placing it in the specified output folder upon completion.
185+
186+
#### Batch Redacting specific labels
187+
188+
Like the single file `redact.py` script, `batch_redact.py` supports redacting specific labels. Labels to be redacted need to provided together in a string separated by commas.
189+
``` bash
190+
python batch_redact.py local raw/ local redacted/ "Label_01,Label_02"
191+
```
192+
193+
149194
### Test
150195

151196
To run the unit tests, simply run

0 commit comments

Comments
 (0)