pdf generation - Can Tesseract be set to OCR only (no image modification) when producing a PDF? -


is there way force tesseract ocr , leave original images intact? @ moment, use command:

tesseract -l eng file.tif file pdf 

in order produce file.pdf multipage tif file. problem command tesseract modifies images. example, thin lines denote tables or figures removed. i'd stop behavior , ocr document text underlaid on original image. in case matters,

$ tesseract -v tesseract 3.03  leptonica-1.71   libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0 

and

$ cat /usr/share/tessdata/configs/pdf tessedit_create_pdf 1 tessedit_pageseg_mode 1 

using current git repo of tesseract, resulting images better. specifically:

$ ./tesseract -v tesseract 3.04.00  leptonica-1.71   libgif 4.1.6(?) : libjpeg 6b : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0 

and

git log -n 1 commit 941d87057e67d18aca2ed428543e7f24bbdba010 author: ray smith <rays@google.com> date:   wed may 13 17:46:58 2015 -0700      fixed training build 

with

$ git branch * master 

basically, of lines used eliminated in 3.03 tables , figures remain. being said, image still manipulated , resolution lower original image. nevertheless, purposes, things ok.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -