Skip to content

Commit 5eb999a

Browse files
committed
remove poppler from windows configuration
1 parent 5655604 commit 5eb999a

File tree

3 files changed

+77
-49
lines changed

3 files changed

+77
-49
lines changed

.Rbuildignore

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,5 @@ vignettes/.*\.png$
2323
^inst/paper$
2424
^_pkgdown\.yml$
2525
^dev$
26-
^build\.sh$
27-
^poppler\.sh$
28-
^tesseract\.sh$
2926
^lib$
30-
^pacman\.conf$
31-
^harfbuzz\.sh$
3227
^\.covrignore$

src/Makevars.win

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,6 @@ PKG_LIBS += \
1010
-ltiff -lopenjp2 -lwebp -lsharpyuv -ljpeg -lgif -lpng16 -lz \
1111
-lws2_32
1212

13-
# Poppler configuration
14-
15-
POPPLERDATA = share/poppler
16-
POPPLER_RWINLIB = ../windows/poppler
17-
PKG_CXXFLAGS += -Dpoppler_cpp_EXPORTS -DBUNDLE_POPPLER_DATA
18-
PKG_CPPFLAGS += -I$(POPPLER_RWINLIB)/include/poppler/cpp \
19-
-I$(POPPLER_RWINLIB)/include/poppler \
20-
-DSTRICT_R_HEADERS -DR_NO_REMAP
21-
22-
PKG_LIBS += \
23-
-L$(POPPLER_RWINLIB)/lib${subst gcc,,${COMPILED_BY}}${R_ARCH} \
24-
-L$(POPPLER_RWINLIB)/lib \
25-
-lpoppler-cpp -lpoppler -llcms2 -ljpeg -lpng16 -ltiff -lopenjp2 \
26-
-lfreetype -lfreetype -lbz2 -liconv -lz
27-
2813
# Compile
2914

3015
all: clean winlibs
@@ -34,6 +19,5 @@ clean: rm -f $(OBJECTS) $(SHLIB)
3419
winlibs:
3520
"${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" "../tools/winlibs.R"
3621
rm -Rf ../inst/share && mkdir -p ../inst/share
37-
cp -Rf $(POPPLER_RWINLIB)/$(POPPLERDATA) ../inst/share/poppler
3822

3923
.PHONY: all winlibs clean

vignettes/intro.Rmd

Lines changed: 77 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,26 @@ if (grepl("tesseract.Rcheck", getwd())) {
2323
}
2424
```
2525

26-
The tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract): a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
26+
The tesseract package provides R bindings [Tesseract](https://github.com/tesseract-ocr/tesseract):
27+
a powerful optical character recognition (OCR) engine that supports over 100
28+
languages. The engine is highly configurable in order to tune the detection
29+
algorithms and obtain the best possible results.
2730

28-
Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.
31+
Keep in mind that OCR (pattern recognition in general) is a very difficult
32+
problem for computers. Results will rarely be perfect and the accuracy rapidly
33+
decreases with the quality of the input image. But if you can get your input
34+
images to reasonable quality, Tesseract can often help to extract most of the
35+
text from the image.
2936

3037
## Extract Text from Images
3138

32-
OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:
39+
OCR is the process of finding and recognizing text inside images, for example
40+
from a screenshot, scanned paper. The image below has some example text:
3341

3442
![Image with eight lines of English text](../man/figures/testocr.png)
3543

36-
The `ocr()` function extracts text from an image file. After indicating the engine for the language, it will return the text found in the image:
44+
The `ocr()` function extracts text from an image file. After indicating the
45+
engine for the language, it will return the text found in the image:
3746

3847
```{r}
3948
library(cpp11tesseract)
@@ -43,7 +52,8 @@ text <- ocr(file, engine = eng)
4352
cat(text)
4453
```
4554

46-
The `ocr_data()` function returns all words in the image along with a bounding box and confidence rate.
55+
The `ocr_data()` function returns all words in the image along with a bounding
56+
box and confidence rate.
4757

4858
```{r}
4959
results <- ocr_data(file, engine = eng)
@@ -52,15 +62,21 @@ results
5262

5363
## Language Data
5464

55-
The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.
65+
The tesseract OCR engine uses language-specific training data in the recognize
66+
words. The OCR algorithms bias towards words and sentences that frequently
67+
appear together in a given language, just like the human brain does. Therefore
68+
the most accurate results will be obtained when using training data in the
69+
correct language.
5670

5771
Use `tesseract_info()` to list the languages that you currently have installed.
5872

5973
```{r}
6074
tesseract_info()
6175
```
6276

63-
By default the R package only includes English training data. Windows and Mac users can install additional training data using `tesseract_download()`. Let's OCR a screenshot from Wikipedia in Simplified Chinese.
77+
By default the R package only includes English training data. Windows and Mac
78+
users can install additional training data using `tesseract_download()`. Let's
79+
OCR a screenshot from Wikipedia in Simplified Chinese.
6480

6581
![Image with thirteen lines of Chinese text](../man/figures/chinese.jpg)
6682

@@ -89,20 +105,34 @@ cat(text2)
89105

90106
## Preprocessing with Magick
91107

92-
The accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See [tesseract wiki: improve quality](https://tesseract-ocr.github.io/tessdoc/ImproveQuality) for important tips to improve the quality of your input image.
93-
94-
The awesome [magick](https://cran.r-project.org/package=magick/vignettes/intro.html) R package has many useful functions that can be use for enhancing the quality of the image. Some things to try:
95-
96-
- If your image is skewed, use `image_deskew()` and `image_rotate()` make the text horizontal.
97-
- `image_trim()` crops out whitespace in the margins. Increase the `fuzz` parameter to make it work for noisy whitespace.
98-
- Use `image_convert()` to turn the image into greyscale, which can reduce artifacts and enhance actual text.
99-
- If your image is very large or small resizing with `image_resize()` can help tesseract determine text size.
100-
- Use `image_modulate()` or `image_contrast()` or `image_contrast()` to tweak brightness / contrast if this is an issue.
108+
The accuracy of the OCR process depends on the quality of the input image. You
109+
can often improve results by properly scaling the image, removing noise and
110+
artifacts or cropping the area where the text exists.
111+
See [tesseract wiki: improve quality](https://tesseract-ocr.github.io/tessdoc/ImproveQuality)
112+
for important tips to improve the quality of your input image.
113+
114+
The awesome [magick](https://cran.r-project.org/package=magick/vignettes/intro.html)
115+
R package has many useful functions that can be use for enhancing the quality of
116+
the image. Some things to try:
117+
118+
- If your image is skewed, use `image_deskew()` and `image_rotate()` make the
119+
text horizontal.
120+
- `image_trim()` crops out whitespace in the margins. Increase the `fuzz`
121+
parameter to make it work for noisy whitespace.
122+
- Use `image_convert()` to turn the image into greyscale, which can reduce
123+
artifacts and enhance actual text.
124+
- If your image is very large or small resizing with `image_resize()` can help
125+
tesseract determine text size.
126+
- Use `image_modulate()` or `image_contrast()` or `image_contrast()` to tweak
127+
brightness / contrast if this is an issue.
101128
- Try `image_reducenoise()` for automated noise removal. Your mileage may vary.
102-
- With `image_quantize()` you can reduce the number of colors in the image. This can sometimes help with increasing contrast and reducing artifacts.
103-
- True imaging ninjas can use `image_convolve()` to use custom [convolution methods](https://ropensci.org/technotes/2017/11/02/image-convolve/).
129+
- With `image_quantize()` you can reduce the number of colors in the image.
130+
This can sometimes help with increasing contrast and reducing artifacts.
131+
- True imaging ninjas can use `image_convolve()` to use custom
132+
[convolution methods](https://ropensci.org/technotes/2017/11/02/image-convolve/).
104133

105-
Below is an example OCR scan. The code converts it to black-and-white and resizes + crops the image before feeding it to tesseract to get more accurate OCR results.
134+
Below is an example OCR scan. The code converts it to black-and-white and
135+
resizes + crops the image before feeding it to tesseract to get more accurate OCR results.
106136

107137
![The first page of 'The Importance of Being Earnest' by Oscar Wilde](../man/figures/wilde.jpg)
108138

@@ -123,7 +153,9 @@ cat(text)
123153

124154
## Read from PDF files
125155

126-
If your images are stored in PDF files they first need to be converted to a proper image format. We can do this in R using the `pdf_convert` function from the pdftools package. Use a high DPI to keep quality of the image.
156+
If your images are stored in PDF files they first need to be converted to a
157+
proper image format. We can do this in R using the `pdf_convert` function from
158+
the `cpp11poppler` package. Use a high DPI to keep quality of the image.
127159

128160
```{r, eval=require(cpp11poppler)}
129161
library(cpp11poppler)
@@ -135,24 +167,32 @@ cat(text)
135167

136168
## Tesseract Control Parameters
137169

138-
Tesseract supports hundreds of "control parameters" which alter the OCR engine. Use `tesseract_params()` to list all parameters with their default value and a brief description. It also has a handy `filter` argument to quickly find parameters that match a particular string.
170+
Tesseract supports hundreds of "control parameters" which alter the OCR engine
171+
Use `tesseract_params()` to list all parameters with their default value and a
172+
brief description. It also has a handy `filter` argument to quickly find
173+
parameters that match a particular string.
139174

140175
```{r}
141176
# List all parameters with *colour* in name or description
142177
tesseract_params("colour")
143178
```
144179

145-
Do note that some of the control parameters have changed between Tesseract engine 3 and 4.
180+
Do note that some of the control parameters have changed between Tesseract
181+
engine 3 and 4.
146182

147183
```{r}
148184
tesseract_info()["version"]
149185
```
150186

151187
### Whitelist / Blacklist characters
152188

153-
One powerful parameter is `tessedit_char_whitelist` which restricts the output to a limited set of characters. This may be useful for reading for example numbers such as a bank account, zip code, or gas meter.
189+
One powerful parameter is `tessedit_char_whitelist` which restricts the output
190+
to a limited set of characters. This may be useful for reading for example
191+
numbers such as a bank account, zip code, or gas meter.
154192

155-
The whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0.
193+
The whitelist parameter works for all versions of Tesseract engine 3 and also
194+
engine versions 4.1 and higher, but unfortunately it did not work in Tesseract
195+
4.0.
156196

157197
![A receipt in English with food and toys for Mr. Duke](../man/figures/receipt.jpg)
158198

@@ -182,7 +222,10 @@ cat(text)
182222

183223
## Best versus Fast models
184224

185-
In order to improve the OCR results, Tesseract has two variants of models that can be used. The `tesseract_download()` can download the 'best' (but slower) model, which increases the accuracy. The 'fast' (but less accurate) model is the default.
225+
In order to improve the OCR results, Tesseract has two variants of models that
226+
can be used. The `tesseract_download()` can download the 'best' (but slower)
227+
model, which increases the accuracy. The 'fast' (but less accurate) model is the
228+
default.
186229

187230
```{r, eval = FALSE}
188231
file <- system.file("examples", "chinese.jpg", package = "cpp11tesseract")
@@ -202,7 +245,9 @@ cat(text2)
202245

203246
## Contributed models
204247

205-
The `tesseract_contributed_download()` function can download contributed models. For example, the `grc_hist` model is useful for Polytonic Greek. Here is an example from Sophocles' Ajax (source: [Ajax Multi-Commentary](https://github.com/AjaxMultiCommentary))
248+
The `tesseract_contributed_download()` function can download contributed models.
249+
For example, the `grc_hist` model is useful for Polytonic Greek. Here is an
250+
example from Sophocles' Ajax (source: [Ajax Multi-Commentary](https://github.com/AjaxMultiCommentary))
206251

207252
![polytonicgreek](../man/figures/polytonicgreek.png)
208253

@@ -262,7 +307,8 @@ text <- ocr(file)
262307
cat(text)
263308
```
264309

265-
One way to organize the output is to split the text before the first digit on each line.
310+
One way to organize the output is to split the text before the first digit on
311+
each line.
266312

267313
```{r}
268314
text <- strsplit(text, "\n")[[1]]
@@ -298,4 +344,7 @@ for (i in seq_along(text)) {
298344
head(df)
299345
```
300346

301-
The result is not perfect (e.g. I still need to change "Gross National Product in Constant" to add the "(1958) Dollars"), but neither is Textract's and it requires to write a more complex loop to organize the data. Certainly, this can be simplified by using the Tidyverse.
347+
The result is not perfect (e.g. I still need to change "Gross National Product
348+
in Constant" to add the "(1958) Dollars"), but neither is Textract's and it
349+
requires to write a more complex loop to organize the data. Certainly, this can
350+
be simplified by using the Tidyverse.

0 commit comments

Comments
 (0)