updated readme & libraries
This commit is contained in:
parent
c6825cb949
commit
6274b48693
51
README.md
51
README.md
|
@ -1,13 +1,51 @@
|
||||||
# Collections-OCR
|
# Collections-OCR
|
||||||
A script to batch collections label-images through OCR
|
A few scripts that batches of collections label-images through OCR
|
||||||
|
|
||||||
## Dependencies
|
## Google Cloud Vision API & `ocrCloudVision.R`
|
||||||
|
### Dependencies
|
||||||
|
Make sure to install these libraries first:
|
||||||
|
- `googleCloudVisionR` - to do OCR magic
|
||||||
|
- Other dependencies include `readr`, `tidyr`, `stringr` for data handling
|
||||||
|
|
||||||
|
### How to run `ocrCloudVision.R`:
|
||||||
|
Notes:
|
||||||
|
- This currently uses Google's Cloud Vision API, which requires:
|
||||||
|
- Being aware of [pricing & quotas for the Google Vision API](https://cloud.google.com/vision/pricing)
|
||||||
|
- Setting up a project on Google Cloud Platform
|
||||||
|
- Authenticating your magine by setting up a Service account & key
|
||||||
|
- Get help from the [cloudyr repo for `googleCloudVisionR`](https://cloudyr.github.io/googleCloudVisionR/)
|
||||||
|
- This can takes over 30 seconds per label-image.
|
||||||
|
- Be mindful how many images you add to your "images" directory.
|
||||||
|
- Be mindful of your internet connection speed
|
||||||
|
- Keep image sizes under 20MB
|
||||||
|
(Overall, smaller image files transfer and process more quickly)
|
||||||
|
- Output likely needs some [or many] follow-up/clean-up steps.
|
||||||
|
- Batch similar images together to streamline follow-up steps.
|
||||||
|
|
||||||
|
To run the script:
|
||||||
|
1. Add a folder named "images" to this script's directory
|
||||||
|
2. Add the images (JPG & JPEG) you'd like to OCR to that directory
|
||||||
|
3. Run the script (`Rscript ocrCloudVision.R`)
|
||||||
|
|
||||||
|
### Output from `ocrCloudVision.R`:
|
||||||
|
A CSV named "ocrText-[Date-time].csv", containing these columns:
|
||||||
|
- **"image"** = filename for each JPG and JPEG
|
||||||
|
- **"imagesize"** = filesize for each image (in MB)
|
||||||
|
- **"ocr_start"** = start-date and time when an image was submitted to the Google Vision API
|
||||||
|
- **"ocr_duration"** = duration (in seconds) of the OCR process
|
||||||
|
- **"line_count"** = number of lines in each OCR transcription
|
||||||
|
- **"Line1" - "Line[N]"** = text for each line in the OCR transcription of an image.
|
||||||
|
- the number of **"Line"** columns will match the maximum number of lines as needed.
|
||||||
|
|
||||||
|
|
||||||
|
## Tesseract & `ocrMangle.R`
|
||||||
|
### Dependencies
|
||||||
Make sure to install these libraries first:
|
Make sure to install these libraries first:
|
||||||
- `magick` - to read in image files
|
- `magick` - to read in image files
|
||||||
- `tesseract` - to do OCR magic
|
- `tesseract` - to do OCR magic
|
||||||
- `stringr` - to split the OCR'ed lines to columns
|
- `stringr` - to split the OCR'ed lines to columns
|
||||||
|
|
||||||
## How to run the script:
|
### How to run `ocrMangle.R`:
|
||||||
Notes:
|
Notes:
|
||||||
- This can takes over 10 seconds per label-image.
|
- This can takes over 10 seconds per label-image.
|
||||||
- Be mindful how many images you add to your "images" directory.
|
- Be mindful how many images you add to your "images" directory.
|
||||||
|
@ -20,10 +58,13 @@ To run the script:
|
||||||
2. Add the images (JPG & JPEG) you'd like to OCR to that directory
|
2. Add the images (JPG & JPEG) you'd like to OCR to that directory
|
||||||
3. Run the script (`Rscript ocrMangle.R`)
|
3. Run the script (`Rscript ocrMangle.R`)
|
||||||
|
|
||||||
## Output
|
### Output from `ocrMangle.R`:
|
||||||
A CSV named "ocrText-[Date-time].csv", containing these columns:
|
A CSV named "ocrText-[Date-time].csv", containing these columns:
|
||||||
- **"image"** = filename for each JPG and JPEG
|
- **"image"** = filename for each JPG and JPEG
|
||||||
- **"line_count"** = number of lines in each OCR transcription
|
- **"line_count"** = number of lines in each OCR transcription
|
||||||
- **"Line1"** - "Line[N]" = text for each line in the OCR transcription.
|
- **"Line1" - "Line[N]"** = text for each line in the OCR transcription.
|
||||||
- the number of **"Line"** columns will match the maximum number of lines as needed.
|
- the number of **"Line"** columns will match the maximum number of lines as needed.
|
||||||
|
|
||||||
|
|
||||||
|
## Google Drive API & `ocrGoogleDrive.R`
|
||||||
|
### This is drafty; might work for small batches, but needs work.
|
||||||
|
|
|
@ -9,7 +9,7 @@ library(googleCloudVisionR) # NOTE - requires API Key / Service Account
|
||||||
library(tidyr)
|
library(tidyr)
|
||||||
library(readr)
|
library(readr)
|
||||||
library(stringr)
|
library(stringr)
|
||||||
library(magick)
|
# library(magick)
|
||||||
|
|
||||||
# get list of local JPG & JPEG image files [REVERT]
|
# get list of local JPG & JPEG image files [REVERT]
|
||||||
imagelist <- list.files(path = "images/", pattern = ".jp|.JP")
|
imagelist <- list.files(path = "images/", pattern = ".jp|.JP")
|
||||||
|
|
Loading…
Reference in New Issue