OpenAI-compatible inference. Hydropower-clean. Hosted in Kathmandu.

Open-weight inference at fair-market rates.

ScaLabs Cloud serves HimalayaGPT, Qwen, Gemma, DeepSeek, MiniMax, and Cohere Command A+ on our own hardware in Kathmandu. One OpenAI-compatible endpoint, per-million-token pricing aligned with OpenRouter for the same open-weight families. HimalayaGPT is free; the rest is straight pay-per-use. Invoiced in NPR, EUR, or USD — your choice.

OpenAI-compatible APIEUR · USD · NPR billingNo training on prompts or outputsHydropower-clean compute
inference.stack Kathmandu GPU racks
Your code Any OpenAI-compatible SDK or HTTP client
OpenAI-compatible gateway /v1/chat/completions · /v1/responses
Local GPU inference Open models · hydropower · single round trip

Free, hosted on Kathmandu hardware

HimalayaGPT 0.5B — Nepal's sovereign LLM, available to everyone.

HimalayaGPT is a 500-million-parameter, Nepali-language, instruction-tuned model from Himalaya AI Research Lab. We host it on our Kathmandu hardware for free — no deposit, no monthly minimum, no tier requirement. If you're building anything Nepali-language, point your OpenAI SDK at our endpoint and you're in.

Fair-use limits still apply to stop spam, resale, and runaway loops. Real Nepali-language workloads won't hit them.

See HimalayaGPT

Why this exists

Open-model inference, at honest prices, off the US/CN hyperscale path.

01

OpenAI-compatible

Standard /v1/chat/completions and /v1/responses endpoints. Drop-in for the SDKs you already use.

02

Open models only

HimalayaGPT, Qwen, Gemma, DeepSeek, MiniMax, Cohere Command A+ — open-weight models you can audit. No black-box frontier serving here.

03

Hydropower-clean

NEA hydropower in Kathmandu. Cleanest grid mix of any inference provider at this price.

04

No training on prompts

We do not log request and response bodies by default. Your prompts and outputs stay yours.

LLM catalog

9 chat-completion models across 6 families.

Open-weight chat-completion models from HimalayaGPT (free) through Cohere Command A+ and MiniMax M2.7. Drop-in to /v1/chat/completions.

FREE

HimalayaGPT 0.5B

Sovereign Nepali LLM by Himalaya AI Research Lab. Hosted free on our Kathmandu hardware as a public good.

See HimalayaGPT 0.5B → →
QWEN

Qwen 3.6 27B

The default. A capable dense model for coding, tool use, and structured agent loops.

See Qwen 3.6 27B → →
QWEN

Qwen 3.6 35B A3B

MoE economics at small-active-parameter cost. Best tokens-per-EUR ratio in the catalog.

See Qwen 3.6 35B A3B → →
GEMMA

Gemma 4 31B

Gemma's dense flagship in our catalog. Different inductive biases than Qwen — keep both in your evals.

See Gemma 4 31B → →
GEMMA

Gemma 4 26B A4B

The cheapest model in the catalog. 4B active parameters, 26B total. Built for volume.

See Gemma 4 26B A4B → →
DEEPSEEK

DeepSeek V4 Flash

A fast, cheap MoE for high-throughput pipelines. Pro tenants only at founding launch.

See DeepSeek V4 Flash → →
QWEN

Qwen 3.5 122B A10B

Step up when 30B-class isn't enough. 256K context, 10B active. Reserve for the work that earns it.

See Qwen 3.5 122B A10B → →
MINIMAX

MiniMax M2.7

1M-token context, 22B active. The model you reach for when nothing else fits.

See MiniMax M2.7 → →
COHERE

Cohere Command A+

Cohere's open-weight flagship MoE. 48 languages, agentic-tuned, Apache 2.0.

See Cohere Command A+ → →

LLM pricing

Per-million-token rates. Priced near the open-router market.

Input and output rates per million tokens, in EUR. Roughly aligned with OpenRouter's published rates for the same model families — we're not the discount tier, we're the local-jurisdiction alternative. Pay-per-use, no minimums, no commitment.

ModelParamsContextEUR / M tokens (in / out)
HimalayaGPT 0.5B0.5B8KFree
Qwen 3.6 27B27B128KEUR 0.30 / 0.90
Qwen 3.6 35B A3B35B (3B active)128KEUR 0.25 / 0.60
Gemma 4 31B31B64KEUR 0.35 / 1.00
Gemma 4 26B A4B26B (4B active)128KEUR 0.25 / 0.55
DeepSeek V4 Flash16B (2.5B active)128KEUR 0.12 / 0.25
Qwen 3.5 122B A10B122B (10B active)256KEUR 0.60 / 1.40
MiniMax M2.7230B (22B active)1000KEUR 0.80 / 1.80
Cohere Command A+218B (25B active)128KEUR 0.45 / 1.35

Other inference · Speech, OCR, TTS, voice cloning

Need transcription, document vision, or voice cloning? That's a separate page.

We also host 8 utility models for the inference modalities that don't fit chat-completion shape: 2 speech-to-text (Whisper, Qwen3-ASR), 2 OCR (GLM-OCR, Mistral OCR), and 4 TTS models (Chatterbox, Qwen3-TTS, Kokoro, Piper — including zero-shot voice cloning).

Same Kathmandu GPUs, same OpenAI-compatible endpoint shape under /v1/audio/* and /v1/images/*, same no-deposit pay-per-use posture. The dedicated landing page has the catalog, per-model pricing, and the voice-cloning API write-up.

See speech, OCR, TTS →

Why our rates sit near OpenRouter, not under it.

We could undercut the open-router market by 50 % on these models — the Kathmandu cost base would let us — but we don't. The differentiator is location, license, and operational posture (hydropower-clean, EU contracting via Scalabs UG, Nepali contracting via ScaLabs Cloud Pvt. Ltd., no training on prompts), not a discount race. HimalayaGPT 0.5B is free because hosting Nepal's sovereign LLM is a public good; the rest is fair-market pay-per-use.

Inference privacy

Your prompts and outputs stay yours.

We do not log request and response bodies by default. We do not train models on customer prompts or outputs. We do not share data with the model authors. Period.

No prompt or output logging Request/response bodies are not logged by default. Tenants can opt in for their own debugging.
No model training on customer data Customer prompts and outputs are never used to train or fine-tune our hosted models.
Per-tenant rate limits Token-per-minute and concurrent-request limits configurable per tenant, with a documented hard ceiling.
Egress over private network When called from the same Kathmandu region (sister APIs, customer sandboxes), requests stay on the private network — no public egress.
Hardware in our racks Inference GPUs in Kathmandu, owned outright. No third-party cloud middleman.
Catalog audit trail Model versions, weights provenance, and benchmark scores are published with each catalog refresh.

Practical questions

Before you point your SDK at our endpoint.

Why these models specifically?

We pick across families for coverage: HimalayaGPT free for Nepali, a couple of dense workhorses (Qwen, Gemma) for code and tool use, MoEs for throughput-sensitive workloads, Cohere Command A+ as the multilingual / agentic flagship, and Qwen 3.5 122B / MiniMax M2.7 when you need long context. We rotate the catalog as the open-model landscape shifts.

How does pricing compare to OpenRouter or the model authors?

Our per-token rates sit near the open-router market for the same open-weight model families — typically within ±20 % of OpenRouter's published rates, sometimes slightly above on smaller models, slightly below on flagship MoEs. We're not the cheapest hosting on the internet; we are the cheapest hosting on the internet that's also EU-billable via Bavaria, Nepali-billable via Kathmandu, hydropower-clean, and contractually doesn't train on your prompts.

How fast is the inference?

B60 Dual benchmarks publishing alongside launch terms. Founding customers can run their own benchmarks during a 14-day pre-launch period; we'll publish a public latency dashboard within 30 days of GA.

Can I bring my own fine-tuned model?

Not at founding launch. We are focused on getting the base catalog stable and fast first. Reach out if BYO weights is a requirement — we may add it as a paid add-on after the founding cohort.

How do you handle TLS / data sovereignty?

All endpoints serve TLS 1.3 only. The serving infrastructure is in Kathmandu, Nepal. EU customers contracting with Scalabs UG get a DPA covering cross-border transfer under SCCs. Data does not pass through US or CN territory in normal request paths.

What happens to my API keys if my account is suspended?

Keys are revoked immediately on suspension. We retain key fingerprints for 90 days for audit; the actual secret material is wiped within 24 hours of revocation.

Is this cheaper than running my own LLM on a VPS?

For anything bigger than a 1–3B model, yes — running on a CPU VPS is impractical, and a GPU VPS in Nepal does not exist at our pricing today. Our inference dedicated hardware amortizes across many tenants, so per-token cost is dramatically lower than dedicating GPUs to one customer.