DeepSeek V4 on Huawei Ascend: Open Weights, MoE at Trillion Scale, and the Self-Hosting Path

DeepSeek V4 ships under MIT license in two MoE sizes (1.6T Pro and 284B Flash) with 1M-token context. Huawei's Ascend 950 SuperNode handles inference. Here is what readers can do with it — and what comes next.

DeepSeek V4 on Huawei Ascend: Open Weights, MoE at Trillion Scale, and the Self-Hosting Path

By Nadia Hassan, Insightful AI Desk

DeepSeek released its V4 family of large language models in April 2026 under the MIT license, with full open weights for both V4-Pro (1.6 trillion total parameters with 49 billion active per inference) and V4-Flash (284 billion total with 13 billion active). Both variants ship with a 1-million-token context window. Huawei has confirmed that its Ascend 950 SuperNode clusters support V4 inference natively, with reported latencies of 10 milliseconds on V4-Flash and 20 milliseconds on V4-Pro.

Per coverage at South China Morning Post and Huawei Central, this is the first frontier-grade Chinese AI model paired publicly with a fully Chinese silicon stack. The hardware is Huawei’s Da Vinci architecture with HBM3 memory; the software is Huawei’s CANN (Compute Architecture for Neural Networks) framework, integrated with DeepSeek’s training and inference code.

For readers, the more immediate practical question is what V4 enables today — both for those running on Chinese silicon and those working on Western infrastructure. The MIT-licensed open weights make this a more flexible model release than most current frontier releases on either side of the geopolitical map.

What V4 actually is

DeepSeek V4 is a Mixture-of-Experts (MoE) family. The two published variants:

  • V4-Pro: 1.6 trillion total parameters; approximately 49 billion active per token. Suitable for the most demanding reasoning and code workloads.
  • V4-Flash: 284 billion total parameters; approximately 13 billion active per token. Suitable for high-throughput inference at lower latency.
  • Both: 1-million-token context window, MIT-licensed open weights.

The MoE architecture is the practical key. At 49B active parameters per inference (Pro) or 13B (Flash), V4 runs at a hardware footprint closer to small dense models even though total parameters are at frontier scale. This is why V4-Flash fits on a single H200 141GB or two H100 80GB GPUs — configurations achievable for many enterprise and well-resourced individual developer setups.

The 1M-token context is the second notable specification. Many frontier closed-weights competitors top out at 200K-400K context in production; Gemini 3.1 Pro at 2M is the current ceiling, but with the pricing surcharge above 200K noted in earlier coverage. DeepSeek V4 at 1M open-weights changes the cost-context calculation: organizations can self-host at the 1M context tier without API surcharges, with the trade-off being infrastructure responsibility.

The MIT license

DeepSeek V4 ships under the MIT license, which is among the most permissive open-source licenses. Commercial use, modification, redistribution, fine-tuning, and offering V4-derived services are all permitted without restriction. There are no usage caps, no attribution clauses, no revenue thresholds that trigger obligations, and no field-of-use restrictions.

This is a meaningful differentiator. Some competing open-weights releases ship under more restrictive licenses (community licenses, no-commercial clauses, revenue caps that trigger renegotiation, or field-of-use restrictions). MIT-licensed weights are the most direct path to deploying a model in regulated, compliance-sensitive, or revenue-significant environments without legal review hurdles. For startups building products on top of an open model, MIT means the license stays out of the way.

The Huawei Ascend pairing

The hardware side of the V4 release is the larger structural development. Per Huawei’s own statement, the full Ascend SuperNode lineup — built around the Ascend 950 series — supports V4 inference natively. Reported performance:

  • V4-Pro on Ascend 950 SuperNode: ~20ms inference latency
  • V4-Flash on Ascend 950 SuperNode: ~10ms inference latency
  • Production: ~750,000 Ascend 950PR units shipping in 2026, with mass production starting in April and full-scale shipments targeted for the second half of 2026

The architectural detail worth noting: Huawei’s SuperNode is an interconnect technology that lashes large numbers of Ascend chips into training-class clusters. Per-chip, the Ascend 950 sits between mid-generation and current-generation Nvidia parts in throughput terms; the SuperNode addresses the gap by scaling out cluster size. For inference workloads — which V4’s MoE architecture suits well — the per-chip performance gap is less binding than for training workloads.

DeepSeek itself has noted that V4 throughput in some regions and cloud platforms is supply-constrained until the second half of 2026, when Ascend 950PR shipments scale. That makes the throughput economics on Chinese cloud platforms timing-dependent through the summer.

Self-hosting paths outside the Ascend ecosystem

For readers without access to Huawei Ascend infrastructure, V4 still runs on standard Nvidia configurations — the MIT license and open weights mean the model is portable. Practical hardware requirements per community deployment guides:

V4-Flash (284B / 13B active):

  • FP8: 2× H100 80GB, or 1× H200 141GB, with at least 256GB system RAM
  • Full 1M-token context with KV cache headroom: 4× A100 80GB or 2× H200
  • Quantized (IQ2/Q4 GGUF): single A100 80GB for experimentation, with quality trade-offs

V4-Pro (1.6T / 49B active):

  • 8-16× H200 in multi-node, or 8× B200 in multi-node configuration
  • Single-machine Pro deployment is not currently practical — multi-node distribution is required
  • The forthcoming B300 generation, with anticipated 288GB per chip, will change this

The recommended inference framework is vLLM, which supports MoE expert parallelism, V4’s hybrid CSA+HCA attention architecture, and efficient KV cache management at the 1M-token context scale.

Where V4 fits in the open-weights landscape

V4 is a significant addition to the open-weights ecosystem regardless of geopolitical framing. Three observations matter for understanding its place:

First, V4 is the first frontier-scale open-weights model with a fully aligned commercial deployment path. Many open-weights releases have been research artifacts or research-focused products. V4’s MIT licensing combined with Huawei’s cloud deployment infrastructure (for users in China) and standard Nvidia portability (for users elsewhere) makes it production-deployable across a wider range of contexts than typical open releases.

Second, V4 sits alongside other recent open-weights releases that collectively shift the available capability floor. Qwen 3.6 (Alibaba), IBM Granite 4.1, HiDream-O1, and Mistral’s next release together represent an open-weights ecosystem that is now competitive with the previous generation of closed-weights frontier models. The technical lag between open and closed frontier is narrowing.

Third, the geopolitical framing — "Chinese AI on Chinese chips" — can obscure the more interesting structural shift, which is that organizations with data-sovereignty requirements, compliance constraints, or simple cost discipline now have a high-quality open-weights option that runs on hardware they can procure on either side of the geopolitical map. That widens the procurement option set for global enterprises and for sovereign-AI initiatives in jurisdictions outside the US-China dyad.

Where the leverage is

The V4 release creates concrete openings for several reader groups.

For enterprise data-sovereignty workloads. Self-hosted V4-Flash on a 2×H100 or 1×H200 configuration delivers frontier-tier capability with full data residency and no third-party API exposure. For regulated industries (healthcare, financial services, defense contractors), this is a meaningful path. Three practical asks: confirm whether V4-Flash’s capability set covers your workload, verify hardware procurement timelines for H100/H200 in your region, and pilot the deployment on a single non-production workload before scaling.

For investors tracking the open-weights ecosystem. V4 reinforces a trend: open-weights frontier capability is becoming a credible alternative to closed-weights API-only competitors. The investment thesis splits into two: infrastructure plays (compute providers serving open-weights workloads — Runpod, Together, Fireworks, plus traditional clouds’ Bedrock and Vertex AI offerings) and tooling plays (vLLM, evaluation harnesses, deployment orchestration). Both sub-markets are forming around the open-weights tier’s commercial maturity.

For builders working on customizable AI products. MIT-licensed open weights mean fine-tuning, distillation, and domain-specific adaptation are commercially viable without license review. The pattern that works: start with V4-Flash for production economics, fine-tune on your domain corpus, deploy on cloud GPU infrastructure with vLLM, monitor cost-per-task against API alternatives. Many production workloads will reach a break-even where self-hosted V4-Flash beats per-token API pricing.

For sovereign-AI initiatives outside the US-China dyad. European, Indian, Southeast Asian, and Middle Eastern AI programs now have a frontier-grade open-weights option that doesn’t require commitment to either ecosystem. Procurement decisions can prioritize regional cloud capacity, data residency requirements, and language coverage. V4’s availability under MIT is structurally helpful for these initiatives in a way that more restrictive licenses are not.

What is worth doing, and what is worth watching

For developers and small teams wanting to experiment with V4 today, three practical paths are reachable.

1. Cloud GPU rental for V4-Flash testing. Runpod, Together, Lambda Labs, and similar providers offer hourly H200 or H100 rentals at rates that make 4-8 hour evaluation runs economical. The practical setup: rent a single H200 141GB instance, download V4-Flash from Hugging Face, deploy with vLLM, run your evaluation corpus against the model. Total cost for a meaningful evaluation: typically under $50.

2. Quantized local experimentation. For developers with a single high-memory consumer or workstation GPU (A100 80GB equivalents and above), V4-Flash in IQ2 or Q4 GGUF format runs locally with quality trade-offs that are acceptable for exploration and prototyping. Tools like llama.cpp and Ollama support these quantization formats. Latency will be lower than production-ready FP8 deployment but the path is zero-cost beyond electricity.

3. Hybrid API + self-hosted pattern. For teams whose workloads split between high-volume routine inference and lower-volume complex reasoning, a hybrid pattern often beats either extreme. Use V4-Flash self-hosted for the high-volume tier (cost-controlled, data-sovereign), and a closed-weights API (Gemini 3.1 Pro, Claude, GPT-5.5) for the complex-reasoning tier where the additional capability is worth the per-token cost. The orchestration layer routes requests by complexity.

Several questions about V4 remain publicly open and worth tracking. Independent capability benchmarking across V4 variants on agentic, code, and multimodal workloads — comparable to the UK AISI evaluation methodology applied to other frontier models — would clarify where V4 sits versus closed-weights competitors. Ascend 950PR production yield and supply through the second half of 2026 will determine whether the Chinese stack’s throughput economics actually scale as projected. The diffusion of V4-derived fine-tunes — which sectors and use cases adapt V4 most effectively — will indicate where the open-weights tier’s commercial maturity is actually arriving first. And license-restriction comparative analysis across open-weights releases (MIT versus Llama community license versus other variants) on actual deployment outcomes would inform license-choice debates that currently rely on legal-theoretical argument.

The most useful near-term signals: V4 benchmark publications from independent evaluation groups, vLLM and orchestration tooling release notes mentioning V4-specific optimizations, Hugging Face deployment statistics on V4 variants, and Ascend 950PR shipping cadence updates from Huawei. Each is independently observable.


How we use AI and review our work: About Insightful AI Desk.