Document vision · OCR

GLM-OCR

Document vision for receipts, PDFs, screenshots, forms.

GLM-OCR is the document-vision endpoint built on the GLM family of open vision-language models from Zhipu / Tsinghua. We host it as a clean OCR API that reads receipts, PDFs, screenshots, forms, tables, and handwriting and returns structured JSON or markdown — designed to be called as an agent tool, not a standalone product.

What it returns

  • Raw text for simple reads
  • Markdown with preserved heading/list/table structure
  • JSON with a tenant-supplied schema, suitable for direct insertion into a downstream typed pipeline
  • Bounding boxes for every extracted field if you need to draw or audit

When to pick it

  • Agents that need to read an image — receipt → expense entry, PDF → knowledge-graph node, screenshot → structured action
  • Replacing expensive per-token vision-LLM calls when the task is “extract fields”, not “reason over the image”
  • Cheap first-pass OCR before falling through to a vision-LLM for ambiguous cases

Pricing

EUR 0.0005 per page. Flat. No per-token tail. Pages are detected automatically for PDFs; a single image input counts as one page.

Limits

  • Per-tenant rate limit: 60 pages per second
  • Image size limit: 20 MB per page
  • Supported formats: png, jpg, webp, pdf, tiff

Best for

  • Receipt and invoice extraction for finance agents
  • PDF and screenshot reading inside agent tool loops
  • Form / table / handwriting digitization
  • Pre-processing image-heavy inputs for downstream LLM reasoning

Upstream source: github.com/THUDM/GLM