updated ReadMe

2019-10-31 13:23:20 -05:00 · 2019-10-31 13:23:20 -05:00 · 6b3c6fa857
parent d2e296236e
commit 6b3c6fa857
1 changed files with 23 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1,2 +1,24 @@
 # Collections-OCR
-script for OCR &amp; text-parsing with collections label-images
+A script to batch collections label-images through OCR
+
+Dependencies - make sure to install these libraries first:
+- `magick` - to read in image files
+- `tesseract` - to do OCR magic 
+- `stringr` - to split the OCR'ed lines to columns
+
+Notes:
+- this takes ~2 seconds per label-image
+- this currently uses Tesseract's English ("eng"), German ("deu") and Latin ("lat") libraries. 
+
+
+To run the script:
+1. Add a folder named "images" to this script's directory
+2. Add the images (JPG & JPEG) you'd like to OCR to that directory
+3. Run the script (`Rscript ocrMangle.R`)
+
+Output - a CSV named "ocrText-[Date-time].csv", containing these columns:
+- "image" = filename for each JPG and JPEG
+- "line_count" = number of lines in each OCR transcription
+- "Line1" - "Line[N]" = text for each line in the OCR transcription.
+  - the number of "Line" columns will match the maximum number of lines as needed.
+