OpenAI-compatible inference. Hydropower-clean. Hosted in Kathmandu.

Open-weight inference at fair-market rates.

ScaLabs Cloud serves HimalayaGPT, Qwen, Gemma, DeepSeek, MiniMax, and Cohere Command A+ on our own hardware in Nepal. One OpenAI-compatible endpoint, per-million-token pricing aligned with OpenRouter for the same open-weight families. HimalayaGPT is free; the rest is straight pay-per-use. Invoiced in NPR, EUR, or USD — your choice.

Request inference access See the catalog

OpenAI-compatible APIEUR · USD · NPR billingNo training on prompts or outputsHydropower-clean compute

inference.stack Our GPU racks · Nepal

Your code Any OpenAI-compatible SDK or HTTP client

OpenAI-compatible gateway /v1/chat/completions · /v1/responses

Local GPU inference Open models · hydropower · single round trip

Free, hosted on our hardware

HimalayaGPT 0.5B — Nepal's sovereign LLM, available to everyone.

HimalayaGPT is a 500-million-parameter, Nepali-language, instruction-tuned model from Himalaya AI Research Lab. We host it for free — no deposit, no monthly minimum, no tier requirement. If you're building anything Nepali-language, point your OpenAI SDK at our endpoint and you're in.

Fair-use limits still apply to stop spam, resale, and runaway loops. Real Nepali-language workloads won't hit them.

See HimalayaGPT

Why this exists

Open-model inference, at OpenRouter-aligned rates, off the US/CN hyperscale path.

OpenAI-compatible

Standard /v1/chat/completions and /v1/responses endpoints. Drop-in for the SDKs you already use.

Open models only

HimalayaGPT, Qwen, Gemma, DeepSeek, MiniMax, Cohere Command A+ — open-weight models you can audit. No black-box frontier serving here.

Hydropower-clean

NEA hydropower. A cleaner grid mix than most inference providers at this price.

No training on prompts

We do not log request and response bodies by default. Your prompts and outputs stay yours.

LLM catalog

9 chat-completion models across 6 families.

Open-weight chat-completion models from HimalayaGPT (free) through Cohere Command A+ and MiniMax M2.7. Drop-in to /v1/chat/completions.

FREE

HimalayaGPT 0.5B

Sovereign Nepali LLM by Himalaya AI Research Lab. Hosted free on our hardware as a public good.

See HimalayaGPT 0.5B → →

QWEN

Qwen 3.6 27B

The default. A capable dense model for coding, tool use, and structured agent loops.

See Qwen 3.6 27B → →

QWEN

Qwen 3.6 35B A3B

MoE economics at small-active-parameter cost. Best tokens-per-EUR ratio in the catalog.

See Qwen 3.6 35B A3B → →

GEMMA

Gemma 4 31B

Gemma's dense flagship in our catalog. Different inductive biases than Qwen — keep both in your evals.

See Gemma 4 31B → →

GEMMA

Gemma 4 26B A4B

The cheapest model in the catalog. 4B active parameters, 26B total. Built for volume.

See Gemma 4 26B A4B → →

DEEPSEEK

DeepSeek V4 Flash

A fast, cheap MoE for high-throughput pipelines. Pro tenants only.

See DeepSeek V4 Flash → →

QWEN

Qwen 3.5 122B A10B

Step up when 30B-class isn't enough. 256K context, 10B active. Reserve for the work that earns it.

See Qwen 3.5 122B A10B → →

MINIMAX

MiniMax M2.7

1M-token context, 22B active. The model you reach for when nothing else fits.

See MiniMax M2.7 → →

COHERE

Cohere Command A+

Cohere's open-weight flagship MoE. 48 languages, agentic-tuned, Apache 2.0.

See Cohere Command A+ → →

LLM pricing

Per-token rates aligned with OpenRouter.

Our per-token rates sit within ±20% of OpenRouter's published rates for the same open-weight model families. We don't publish a static price table because OpenRouter's numbers shift week-to-week and we'd rather quote you the current rate than a stale page. Exact per-model pricing is confirmed at signup. Pay-per-use, no minimums, no commitment, EUR / USD / NPR invoicing.

Model	Params	Context	Best for
HimalayaGPT 0.5B	0.5B	8K	Nepali-language inference — civic services, Nepali content, multilingual agents
Qwen 3.6 27B	27B	128K	Dense agent and coding model
Qwen 3.6 35B A3B	35B (3B active)	128K	Small-active MoE sweet spot
Gemma 4 31B	31B	64K	Dense Gemma model — strong on writing and reasoning
Gemma 4 26B A4B	26B (4B active)	128K	Highest headline allowance — Gemma MoE for high-throughput agent loops
DeepSeek V4 Flash	16B (2.5B active)	128K	Pro-only flash model — throughput-first
Qwen 3.5 122B A10B	122B (10B active)	256K	Larger MoE for longer context and harder reasoning
MiniMax M2.7	230B (22B active)	1000K	Large-agent MoE — 1M context
Cohere Command A+	218B (25B active)	128K	Open-weight enterprise flagship — multilingual, agentic, multimodal

Other inference · Speech, OCR, TTS, voice cloning

Need transcription, document vision, or voice cloning? That's a separate page.

We also host 9 utility models for the inference modalities that don't fit chat-completion shape: 2 speech-to-text (Whisper, Qwen3-ASR), 3 OCR (GLM-OCR, dots.ocr for Nepali / Devanagari, olmOCR-2 for academic / legal), and 4 TTS models (Chatterbox, Qwen3-TTS, Kokoro, Piper — including zero-shot voice cloning).

Same GPUs, same OpenAI-compatible endpoint shape under /v1/audio/* and /v1/images/*, same no-deposit pay-per-use posture. The dedicated landing page has the catalog, per-model pricing, and the voice-cloning API write-up.

See speech, OCR, TTS →

Why our rates sit near OpenRouter, not under it.

We price on what the service is worth, not what we could discount to. The differentiator is jurisdiction, license, and operational posture (hydropower-clean, EU contracting via Scalabs UG, Nepali contracting via ScaLabs Cloud Pvt. Ltd., no training on prompts) — not a discount race. HimalayaGPT 0.5B is free because hosting Nepal's sovereign LLM is a public good; the rest is fair-market pay-per-use.

Inference privacy

Your prompts and outputs stay yours.

We do not log request and response bodies by default. We do not train models on customer prompts or outputs. We do not share data with the model authors. Period.

No prompt or output logging Request/response bodies are not logged by default. Tenants can opt in for their own debugging.

No model training on customer data Customer prompts and outputs are never used to train or fine-tune our hosted models.

Per-tenant rate limits Token-per-minute and concurrent-request limits configurable per tenant, with a documented hard ceiling.

Egress over private network When called from the same Kathmandu region (sister APIs, customer sandboxes), requests stay on the private network — no public egress.

Hardware in our racks Inference GPUs in our racks, owned outright. No third-party cloud middleman.

Catalog audit trail Model versions, weights provenance, and benchmark scores are published with each catalog refresh.

Get an API key

No deposit. No waitlist. Tell us where to send your key.

The inference catalog is live. Drop your details, pick a primary model, and we'll send you an API key and the endpoint URL within one working day. HimalayaGPT is free; the rest is pay-per-use against the rate sheet above.

Continue in the ScaLabs Cloud Console

We'll create your account and email you a 6-digit sign-in code. Finish the request inside the console.

Practical questions

Before you point your SDK at our endpoint.

Why these models specifically?

We pick across families for coverage: HimalayaGPT free for Nepali, a couple of dense workhorses (Qwen, Gemma) for code and tool use, MoEs for throughput-sensitive workloads, Cohere Command A+ as the multilingual / agentic flagship, and Qwen 3.5 122B / MiniMax M2.7 when you need long context. We rotate the catalog as the open-model landscape shifts.

How does pricing compare to OpenRouter or the model authors?

Our per-token rates sit near the open-router market for the same open-weight model families — typically within ±20 % of OpenRouter's published rates, sometimes slightly above on smaller models, slightly below on flagship MoEs. We're not the discount tier; we are the cheapest hosting that is EU-billable via Bavaria, Nepali-billable via Kathmandu, hydropower-clean, and contractually doesn't train on your prompts.

How fast is the inference?

Public benchmarks publishing alongside launch terms. Founding customers can run their own benchmarks during a 14-day pre-launch period; we'll publish a public latency dashboard within 30 days of GA.

Can I bring my own fine-tuned model?

Not at launch. We are focused on getting the base catalog stable and fast first. Reach out if BYO weights is a requirement.

How do you handle TLS / data sovereignty?

All endpoints serve TLS 1.3 only. The serving infrastructure is in Kathmandu, Nepal. EU customers contracting with Scalabs UG get a DPA covering cross-border transfer under SCCs. Data does not pass through US or CN territory in normal request paths.

What happens to my API keys if my account is suspended?

Keys are revoked immediately on suspension. We retain key fingerprints for 90 days for audit; the actual secret material is wiped within 24 hours of revocation.

Is this cheaper than running my own LLM on a VPS?

For anything bigger than a 1–3B model, yes — running on a CPU VPS is impractical, and a GPU VPS in Nepal does not exist at our pricing today. Our inference hardware is shared across many tenants, so per-token cost is dramatically lower than dedicating GPUs to one customer.