OpenAI-compatible inference. Hydropower-clean. Hosted in Kathmandu.
Open-weight inference at fair-market rates.
ScaLabs Cloud serves HimalayaGPT, Qwen, Gemma, DeepSeek, MiniMax, and Cohere Command A+ on our own hardware in Kathmandu. One OpenAI-compatible endpoint, per-million-token pricing aligned with OpenRouter for the same open-weight families. HimalayaGPT is free; the rest is straight pay-per-use. Invoiced in NPR, EUR, or USD — your choice.
Free, hosted on Kathmandu hardware
HimalayaGPT 0.5B — Nepal's sovereign LLM, available to everyone.
HimalayaGPT is a 500-million-parameter, Nepali-language, instruction-tuned model from Himalaya AI Research Lab. We host it on our Kathmandu hardware for free — no deposit, no monthly minimum, no tier requirement. If you're building anything Nepali-language, point your OpenAI SDK at our endpoint and you're in.
Fair-use limits still apply to stop spam, resale, and runaway loops. Real Nepali-language workloads won't hit them.
See HimalayaGPTWhy this exists
Open-model inference, at honest prices, off the US/CN hyperscale path.
OpenAI-compatible
Standard /v1/chat/completions and /v1/responses endpoints. Drop-in for the SDKs you already use.
Open models only
HimalayaGPT, Qwen, Gemma, DeepSeek, MiniMax, Cohere Command A+ — open-weight models you can audit. No black-box frontier serving here.
Hydropower-clean
NEA hydropower in Kathmandu. Cleanest grid mix of any inference provider at this price.
No training on prompts
We do not log request and response bodies by default. Your prompts and outputs stay yours.
LLM catalog
9 chat-completion models across 6 families.
Open-weight chat-completion models from HimalayaGPT (free) through Cohere Command A+ and MiniMax M2.7. Drop-in to /v1/chat/completions.
HimalayaGPT 0.5B
Sovereign Nepali LLM by Himalaya AI Research Lab. Hosted free on our Kathmandu hardware as a public good.
See HimalayaGPT 0.5B → →Qwen 3.6 27B
The default. A capable dense model for coding, tool use, and structured agent loops.
See Qwen 3.6 27B → →Qwen 3.6 35B A3B
MoE economics at small-active-parameter cost. Best tokens-per-EUR ratio in the catalog.
See Qwen 3.6 35B A3B → →Gemma 4 31B
Gemma's dense flagship in our catalog. Different inductive biases than Qwen — keep both in your evals.
See Gemma 4 31B → →Gemma 4 26B A4B
The cheapest model in the catalog. 4B active parameters, 26B total. Built for volume.
See Gemma 4 26B A4B → →DeepSeek V4 Flash
A fast, cheap MoE for high-throughput pipelines. Pro tenants only at founding launch.
See DeepSeek V4 Flash → →Qwen 3.5 122B A10B
Step up when 30B-class isn't enough. 256K context, 10B active. Reserve for the work that earns it.
See Qwen 3.5 122B A10B → →MiniMax M2.7
1M-token context, 22B active. The model you reach for when nothing else fits.
See MiniMax M2.7 → →Cohere Command A+
Cohere's open-weight flagship MoE. 48 languages, agentic-tuned, Apache 2.0.
See Cohere Command A+ → →LLM pricing
Per-million-token rates. Priced near the open-router market.
Input and output rates per million tokens, in EUR. Roughly aligned with OpenRouter's published rates for the same model families — we're not the discount tier, we're the local-jurisdiction alternative. Pay-per-use, no minimums, no commitment.
| Model | Params | Context | EUR / M tokens (in / out) |
|---|---|---|---|
| HimalayaGPT 0.5B | 0.5B | 8K | Free |
| Qwen 3.6 27B | 27B | 128K | EUR 0.30 / 0.90 |
| Qwen 3.6 35B A3B | 35B (3B active) | 128K | EUR 0.25 / 0.60 |
| Gemma 4 31B | 31B | 64K | EUR 0.35 / 1.00 |
| Gemma 4 26B A4B | 26B (4B active) | 128K | EUR 0.25 / 0.55 |
| DeepSeek V4 Flash | 16B (2.5B active) | 128K | EUR 0.12 / 0.25 |
| Qwen 3.5 122B A10B | 122B (10B active) | 256K | EUR 0.60 / 1.40 |
| MiniMax M2.7 | 230B (22B active) | 1000K | EUR 0.80 / 1.80 |
| Cohere Command A+ | 218B (25B active) | 128K | EUR 0.45 / 1.35 |
Why our rates sit near OpenRouter, not under it.
We could undercut the open-router market by 50 % on these models — the Kathmandu cost base would let us — but we don't. The differentiator is location, license, and operational posture (hydropower-clean, EU contracting via Scalabs UG, Nepali contracting via ScaLabs Cloud Pvt. Ltd., no training on prompts), not a discount race. HimalayaGPT 0.5B is free because hosting Nepal's sovereign LLM is a public good; the rest is fair-market pay-per-use.
Inference privacy
Your prompts and outputs stay yours.
We do not log request and response bodies by default. We do not train models on customer prompts or outputs. We do not share data with the model authors. Period.
Get an API key
No deposit. No waitlist. Tell us where to send your key.
The inference catalog is live. Drop your details, pick a primary model, and we'll send you an API key and the endpoint URL within one working day. HimalayaGPT is free; the rest is pay-per-use against the rate sheet above.
- HimalayaGPT 0.5B is free for everyone — no card required to start.
- Pay-per-use on the rest, billed monthly against the rates above.
- No deposit, no annual commitment, no waitlist gating.
- NPR, EUR, or USD invoicing — your finance team's currency.
Practical questions
Before you point your SDK at our endpoint.
Why these models specifically?
We pick across families for coverage: HimalayaGPT free for Nepali, a couple of dense workhorses (Qwen, Gemma) for code and tool use, MoEs for throughput-sensitive workloads, Cohere Command A+ as the multilingual / agentic flagship, and Qwen 3.5 122B / MiniMax M2.7 when you need long context. We rotate the catalog as the open-model landscape shifts.
How does pricing compare to OpenRouter or the model authors?
Our per-token rates sit near the open-router market for the same open-weight model families — typically within ±20 % of OpenRouter's published rates, sometimes slightly above on smaller models, slightly below on flagship MoEs. We're not the cheapest hosting on the internet; we are the cheapest hosting on the internet that's also EU-billable via Bavaria, Nepali-billable via Kathmandu, hydropower-clean, and contractually doesn't train on your prompts.
How fast is the inference?
B60 Dual benchmarks publishing alongside launch terms. Founding customers can run their own benchmarks during a 14-day pre-launch period; we'll publish a public latency dashboard within 30 days of GA.
Can I bring my own fine-tuned model?
Not at founding launch. We are focused on getting the base catalog stable and fast first. Reach out if BYO weights is a requirement — we may add it as a paid add-on after the founding cohort.
How do you handle TLS / data sovereignty?
All endpoints serve TLS 1.3 only. The serving infrastructure is in Kathmandu, Nepal. EU customers contracting with Scalabs UG get a DPA covering cross-border transfer under SCCs. Data does not pass through US or CN territory in normal request paths.
What happens to my API keys if my account is suspended?
Keys are revoked immediately on suspension. We retain key fingerprints for 90 days for audit; the actual secret material is wiped within 24 hours of revocation.
Is this cheaper than running my own LLM on a VPS?
For anything bigger than a 1–3B model, yes — running on a CPU VPS is impractical, and a GPU VPS in Nepal does not exist at our pricing today. Our inference dedicated hardware amortizes across many tenants, so per-token cost is dramatically lower than dedicating GPUs to one customer.