diff --git a/README.md b/README.md index bc901f4..ee68f32 100644 --- a/README.md +++ b/README.md @@ -1,24 +1,29 @@ # Collections-OCR A script to batch collections label-images through OCR -Dependencies - make sure to install these libraries first: +## Dependencies +Make sure to install these libraries first: - `magick` - to read in image files - `tesseract` - to do OCR magic - `stringr` - to split the OCR'ed lines to columns +## How to run the script: Notes: -- this takes ~2 seconds per label-image -- this currently uses Tesseract's English ("eng"), German ("deu") and Latin ("lat") libraries. - +- This takes ~2 seconds per label-image. + - Be mindful how many images you add to your "images" directory. + - Batch similar images together to facilitate the follow-up cleaning. +- This currently uses Tesseract's English ("eng"), German ("deu"), and Latin ("lat") libraries. +- Output is likely needs some [or many] follow-up/clean-up steps. To run the script: 1. Add a folder named "images" to this script's directory 2. Add the images (JPG & JPEG) you'd like to OCR to that directory 3. Run the script (`Rscript ocrMangle.R`) -Output - a CSV named "ocrText-[Date-time].csv", containing these columns: -- "image" = filename for each JPG and JPEG -- "line_count" = number of lines in each OCR transcription -- "Line1" - "Line[N]" = text for each line in the OCR transcription. - - the number of "Line" columns will match the maximum number of lines as needed. +## Output +A CSV named "ocrText-[Date-time].csv", containing these columns: +- **"image"** = filename for each JPG and JPEG +- **"line_count"** = number of lines in each OCR transcription +- **"Line1"** - "Line[N]" = text for each line in the OCR transcription. + - the number of **"Line"** columns will match the maximum number of lines as needed.