Document vision · OCR
GLM-OCR
Document vision for receipts, PDFs, screenshots, forms.
GLM-OCR is the document-vision endpoint built on the GLM family of open vision-language models from Zhipu / Tsinghua. We host it as a clean OCR API that reads receipts, PDFs, screenshots, forms, tables, and handwriting and returns structured JSON or markdown — designed to be called as an agent tool, not a standalone product.
What it returns
- Raw text for simple reads
- Markdown with preserved heading/list/table structure
- JSON with a tenant-supplied schema, suitable for direct insertion into a downstream typed pipeline
- Bounding boxes for every extracted field if you need to draw or audit
When to pick it
- Agents that need to read an image — receipt → expense entry, PDF → knowledge-graph node, screenshot → structured action
- Replacing expensive per-token vision-LLM calls when the task is “extract fields”, not “reason over the image”
- Cheap first-pass OCR before falling through to a vision-LLM for ambiguous cases
Pricing
EUR 0.0005 per page. Flat. No per-token tail. Pages are detected automatically for PDFs; a single image input counts as one page.
Limits
- Per-tenant rate limit: 60 pages per second
- Image size limit: 20 MB per page
- Supported formats: png, jpg, webp, pdf, tiff
Best for
- Receipt and invoice extraction for finance agents
- PDF and screenshot reading inside agent tool loops
- Form / table / handwriting digitization
- Pre-processing image-heavy inputs for downstream LLM reasoning
Upstream source: github.com/THUDM/GLM
Request GLM-OCR access
Get an API key.
Straight pay-per-use against the published rate. No deposit, no minimums. Tell us what you're building and we'll send your API key and endpoint URL within one working day.