Update README.md
This commit is contained in:
parent
5013b326e7
commit
3c14f5de7b
23
README.md
23
README.md
|
@ -1,24 +1,29 @@
|
||||||
# Collections-OCR
|
# Collections-OCR
|
||||||
A script to batch collections label-images through OCR
|
A script to batch collections label-images through OCR
|
||||||
|
|
||||||
Dependencies - make sure to install these libraries first:
|
## Dependencies
|
||||||
|
Make sure to install these libraries first:
|
||||||
- `magick` - to read in image files
|
- `magick` - to read in image files
|
||||||
- `tesseract` - to do OCR magic
|
- `tesseract` - to do OCR magic
|
||||||
- `stringr` - to split the OCR'ed lines to columns
|
- `stringr` - to split the OCR'ed lines to columns
|
||||||
|
|
||||||
|
## How to run the script:
|
||||||
Notes:
|
Notes:
|
||||||
- this takes ~2 seconds per label-image
|
- This takes ~2 seconds per label-image.
|
||||||
- this currently uses Tesseract's English ("eng"), German ("deu") and Latin ("lat") libraries.
|
- Be mindful how many images you add to your "images" directory.
|
||||||
|
- Batch similar images together to facilitate the follow-up cleaning.
|
||||||
|
- This currently uses Tesseract's English ("eng"), German ("deu"), and Latin ("lat") libraries.
|
||||||
|
- Output is likely needs some [or many] follow-up/clean-up steps.
|
||||||
|
|
||||||
To run the script:
|
To run the script:
|
||||||
1. Add a folder named "images" to this script's directory
|
1. Add a folder named "images" to this script's directory
|
||||||
2. Add the images (JPG & JPEG) you'd like to OCR to that directory
|
2. Add the images (JPG & JPEG) you'd like to OCR to that directory
|
||||||
3. Run the script (`Rscript ocrMangle.R`)
|
3. Run the script (`Rscript ocrMangle.R`)
|
||||||
|
|
||||||
Output - a CSV named "ocrText-[Date-time].csv", containing these columns:
|
## Output
|
||||||
- "image" = filename for each JPG and JPEG
|
A CSV named "ocrText-[Date-time].csv", containing these columns:
|
||||||
- "line_count" = number of lines in each OCR transcription
|
- **"image"** = filename for each JPG and JPEG
|
||||||
- "Line1" - "Line[N]" = text for each line in the OCR transcription.
|
- **"line_count"** = number of lines in each OCR transcription
|
||||||
- the number of "Line" columns will match the maximum number of lines as needed.
|
- **"Line1"** - "Line[N]" = text for each line in the OCR transcription.
|
||||||
|
- the number of **"Line"** columns will match the maximum number of lines as needed.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue