pdf generation - Can Tesseract be set to OCR only (no image modification) when producing a PDF? -
is there way force tesseract ocr , leave original images intact? @ moment, use command:
tesseract -l eng file.tif file pdf
in order produce file.pdf
multipage tif file. problem command tesseract modifies images. example, thin lines denote tables or figures removed. i'd stop behavior , ocr document text underlaid on original image. in case matters,
$ tesseract -v tesseract 3.03 leptonica-1.71 libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0
and
$ cat /usr/share/tessdata/configs/pdf tessedit_create_pdf 1 tessedit_pageseg_mode 1
using current git repo of tesseract, resulting images better. specifically:
$ ./tesseract -v tesseract 3.04.00 leptonica-1.71 libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0
and
git log -n 1 commit 941d87057e67d18aca2ed428543e7f24bbdba010 author: ray smith <rays@google.com> date: wed may 13 17:46:58 2015 -0700 fixed training build
with
$ git branch * master
basically, of lines used eliminated in 3.03 tables , figures remain. being said, image still manipulated , resolution lower original image. nevertheless, purposes, things ok.
Comments
Post a Comment