script for OCR & text-parsing with collections label-images

Go to file

Kate Webbink d5145290fa add 'images' directory		2019-10-31 13:34:16 -05:00
images	add 'images' directory	2019-10-31 13:34:16 -05:00
LICENSE	Initial commit	2019-10-31 13:04:26 -05:00
README.md	updated ReadMe	2019-10-31 13:23:20 -05:00
ocrMangle.R	Added script and sample images	2019-10-31 13:33:21 -05:00

Collections-OCR

A script to batch collections label-images through OCR

Dependencies - make sure to install these libraries first:

Notes:

this takes ~2 seconds per label-image
this currently uses Tesseract's English ("eng"), German ("deu") and Latin ("lat") libraries.

To run the script:

Output - a CSV named "ocrText-[Date-time].csv", containing these columns:

"image" = filename for each JPG and JPEG
"line_count" = number of lines in each OCR transcription
"Line1" - "Line[N]" = text for each line in the OCR transcription.
- the number of "Line" columns will match the maximum number of lines as needed.