OCR

Document IntelligenceAsync jobs required

Document text extraction with per-line geometry, powered by AWS Textract.

OCRTextractPDFBounding boxes

About

Extracts machine-readable text from uploaded clinical documents — PDFs, scans, and images — using AWS Textract. Beyond plain text, it returns tables and per-page line data with start/end coordinates, which downstream UIs use to highlight evidence directly on the source document.

This is the standalone extraction service. The coding pipelines (HCC, ICD, CDI, Chart Audit, Ortho) embed the same OCR step internally and attach normalized bounding boxes to their evidence, so apps built on those services never call OCR separately.

How it works

  1. 1Document upload (PDF or image, multipart)
  2. 2AWS Textract page-by-page extraction
  3. 3Assembly into processed_text, tables, and per-page df_data with line geometry

Intended use

  • Standalone text extraction for document intake, indexing, and search
  • Building evidence-linked document viewers (line coordinates drive highlight overlays)
  • Feeding extracted text into /llm/chat or /prior-auth/evaluate when no end-to-end pipeline fits

Key outputs

  • ocr_results.processed_text — full extracted text
  • ocr_results.page_contents[].df_data — per-line text with page number and x/y line geometry
  • textract_usage — pages processed and exact cost

Endpoints

Try each endpoint with your signed-in session — usage counts toward your monthly budget.

Use synthetic data only. Do not submit real patient records or PHI when testing endpoints.

Limitations & caveats

  • Do NOT call this separately when using a coding service — those pipelines OCR internally and return normalized (0–1) bounding boxes on their evidence
  • Multi-page documents should go through the async /jobs flow to avoid the 60-second gateway timeout
  • Extraction quality follows scan quality; handwriting and low-resolution faxes degrade results