Update README.md

2019-10-31 13:40:28 -05:00 · 2019-10-31 13:40:28 -05:00 · 3c14f5de7b
parent 5013b326e7
commit 3c14f5de7b
1 changed files with 14 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -1,24 +1,29 @@
 # Collections-OCR
 A script to batch collections label-images through OCR

-Dependencies - make sure to install these libraries first:
+## Dependencies 
+Make sure to install these libraries first:
 - `magick` - to read in image files
 - `tesseract` - to do OCR magic 
 - `stringr` - to split the OCR'ed lines to columns

+## How to run the script:
 Notes:
- this takes ~2 seconds per label-image
- this currently uses Tesseract's English ("eng"), German ("deu") and Latin ("lat") libraries. 
-
+- This takes ~2 seconds per label-image.
+  - Be mindful how many images you add to your "images" directory.
+  - Batch similar images together to facilitate the follow-up cleaning.
+- This currently uses Tesseract's English ("eng"), German ("deu"), and Latin ("lat") libraries. 
+- Output is likely needs some [or many] follow-up/clean-up steps.

 To run the script:
 1. Add a folder named "images" to this script's directory
 2. Add the images (JPG & JPEG) you'd like to OCR to that directory
 3. Run the script (`Rscript ocrMangle.R`)

-Output - a CSV named "ocrText-[Date-time].csv", containing these columns:
- "image" = filename for each JPG and JPEG
- "line_count" = number of lines in each OCR transcription
- "Line1" - "Line[N]" = text for each line in the OCR transcription.
-  - the number of "Line" columns will match the maximum number of lines as needed.
+## Output
+A CSV named "ocrText-[Date-time].csv", containing these columns:
+- **"image"** = filename for each JPG and JPEG
+- **"line_count"** = number of lines in each OCR transcription
+- **"Line1"** - "Line[N]" = text for each line in the OCR transcription.
+  - the number of **"Line"** columns will match the maximum number of lines as needed.