script for OCR & text-parsing with collections label-images

Go to file

Kate W 4cdd4f46ec merge consecutive line-breaks		2019-11-12 13:25:12 -06:00
images	Delete README.md	2019-10-31 13:40:57 -05:00
.gitignore	added gitignore & tidyr	2019-11-12 13:03:20 -06:00
Collections-OCR.Rproj	added gitignore & tidyr	2019-11-12 13:03:20 -06:00
LICENSE	Initial commit	2019-10-31 13:04:26 -05:00
README.md	Update README.md	2019-10-31 13:45:43 -05:00
ocrMangle.R	merge consecutive line-breaks	2019-11-12 13:25:12 -06:00

Collections-OCR

A script to batch collections label-images through OCR

Dependencies

Make sure to install these libraries first:

Notes:

This can takes over 10 seconds per label-image.
- Be mindful how many images you add to your "images" directory.
This currently uses Tesseract's English ("eng"), German ("deu"), and Latin ("lat") libraries.
Output likely needs some [or many] follow-up/clean-up steps.
- Batch similar images together to streamline follow-up steps.

To run the script:

A CSV named "ocrText-[Date-time].csv", containing these columns:

"image" = filename for each JPG and JPEG
"line_count" = number of lines in each OCR transcription
"Line1" - "Line[N]" = text for each line in the OCR transcription.
- the number of "Line" columns will match the maximum number of lines as needed.