Collections-OCR/README.md

30 lines
1.1 KiB
Markdown
Raw Normal View History

2019-10-31 18:04:26 +00:00
# Collections-OCR
2019-10-31 18:23:20 +00:00
A script to batch collections label-images through OCR
2019-10-31 18:40:28 +00:00
## Dependencies
Make sure to install these libraries first:
2019-10-31 18:23:20 +00:00
- `magick` - to read in image files
- `tesseract` - to do OCR magic
- `stringr` - to split the OCR'ed lines to columns
2019-10-31 18:40:28 +00:00
## How to run the script:
2019-10-31 18:23:20 +00:00
Notes:
2019-10-31 18:40:28 +00:00
- This takes ~2 seconds per label-image.
- Be mindful how many images you add to your "images" directory.
- Batch similar images together to facilitate the follow-up cleaning.
- This currently uses Tesseract's English ("eng"), German ("deu"), and Latin ("lat") libraries.
- Output is likely needs some [or many] follow-up/clean-up steps.
2019-10-31 18:23:20 +00:00
To run the script:
1. Add a folder named "images" to this script's directory
2. Add the images (JPG & JPEG) you'd like to OCR to that directory
3. Run the script (`Rscript ocrMangle.R`)
2019-10-31 18:40:28 +00:00
## Output
A CSV named "ocrText-[Date-time].csv", containing these columns:
- **"image"** = filename for each JPG and JPEG
- **"line_count"** = number of lines in each OCR transcription
- **"Line1"** - "Line[N]" = text for each line in the OCR transcription.
- the number of **"Line"** columns will match the maximum number of lines as needed.
2019-10-31 18:23:20 +00:00