Skip to content

DCE Product Roadmap

Disclaimer

This roadmap reflects current planning directions. Features and timelines are subject to change. Refer to release notes for confirmed deliverables.

H1 2026 H2 2026 2027+
AI
  • Inference runtime integration (vLLM / SGLang), domestic GPU support
  • Model asset center MVP (user/project/repo management, model & dataset upload/download, CLI)
  • Pre-integrated domestic model repos (Qwen / GLM / Baichuan)
  • Inference acceleration: multi-level KV Cache, topology-aware scheduling (Kueue / Gang)
  • Training-inference co-location basics
  • AI fault diagnosis (multi-source log correlation + root cause analysis)
  • Predictive alerting (time-series anomaly detection, resource exhaustion warnings)
  • DCE AI Runtime GA
  • Unified inference API (OpenAI API / Llama Stack compatible)
  • Fine-tuning / LoRA support
  • Multi-modal inference (text-image, audio-video)
  • Model asset center enhancements (remote replication/sync, security scanning, pre-warming, i18n)
  • MatrixHub CNCF submission1
  • AI Agent infrastructure Beta (sandbox, memory & context, semantic routing)
  • Fault self-healing (integrated training/inference framework auto-recovery)
  • Alert noise reduction (automatic correlated alert grouping)
  • LLM security (model access control, inference content safety policies)
  • Distributed inference
  • Training-inference co-location optimization
  • Full-stack AI automation (AutoML + Agent)
Infra
  • MetaX GPU onboarding (network topology, Lustre GDS)
  • Ascend 910C NPU scheduling (CANN driver)
  • Hygon DCU GPU scheduling
  • AI high-performance storage (Lustre file system)
  • Kueue / Gang Scheduling / LWS / DRA integration
  • HAMi commercial edition integration2
  • containerd enhancements (container disk limits)
  • Domestic GPU full GA (MetaX / Ascend / Hygon / Biren)
  • MetaX supernode release
  • Supernode solution (8/16-card high-density, GPU sharing scheduler)
  • GPU Operator hybrid scheduling (CPU + GPU + NPU), utilization → 80%+
  • Distributed storage solution (cloud scenarios)
  • DPU / NPU unified scheduling
  • Computing network, multi-cluster compute federation
  • InfiniBand topology discovery (via UFM)
Plat
  • One-click install (Web UI + CLI, auto environment detection)
  • Preflight check framework (plugin-based, network/storage/permission checks)
  • Gateway API migration start (Ingress retirement)
  • Log aggregation enhancements
  • Compute cloud operations platform admin console
  • Compute baseline review & billing model optimization
  • Ghippo admin console UI
  • CSP user two-factor authentication (2FA)
  • Rolling upgrades (zero-downtime, canary + rollback)
  • Gateway API migration complete
  • Deployment time → 15 min (from ~2 hours)
  • Compute cloud platform enhancements (tenant isolation, inventory management, billing conversion, GPU up/downgrade)
  • Bare-metal deployer (cluster provisioning, automated testing, single-node troubleshooting)
  • Lightweight kernel, edge-native
  • Self-adaptive platform (auto-tuning + self-healing)
Eco
  • Kueue / LWS / Gang Scheduling K8s AI/ML SIG contributions
  • Spiderpool DRA implementation, DRANet
  • Spiderpool MetaX GPU support
  • GAIE / NIXL / LMCache inference optimization project participation
  • MatrixHub Sandbox
  • unifabric 1.0 (network health check, disaster marking, KV Cache sync monitoring)
  • metal-deployer engineering delivery
  • GAIE / NIXL community seats
  • unifabric Sandbox, InfiniBand support
  • Low-code orchestration, natural language operations

[1] MatrixHub — DaoCloud's open-source model asset center, aiming to be for AI models what Harbor is for container images.
[2] HAMi — Heterogeneous AI computing middleware for GPU sharing and isolation.


Strategic Direction

DCE already includes AI Lab (training) and LLM Service Platform (model management & inference). In 2026, we focus on two priorities:

  1. AI Deepening — Complete enterprise inference scenarios, support domestic GPUs, bridge training to inference
  2. Platform Deepening — Operations experience, deployment efficiency, compute management — solidify existing capabilities

DCE 5.0 Existing Capabilities

All modules can be upgraded independently without platform-wide downtime.

Module Capabilities Docs
Container Management Multi-cluster management, cluster lifecycle, auto-scaling, Helm apps
Workbench CI/CD pipelines, GitOps, canary releases
Multi-cloud Orchestration Cross-cloud resource scheduling & app orchestration
Microservice Engine Spring Cloud / Dubbo management
Service Mesh Istio-based traffic governance & observability
Cloud-native Networking Multi-CNI, network policies, load balancing
Cloud-native Storage CSI standard, HwameiStor, multi-backend storage
Observability Metrics / logs / traces, multi-dimensional alerting
Middleware Redis / MySQL / Kafka / ES / PG lifecycle management
Image Registry Multi-instance management, Harbor compatible
Global Management Authentication, multi-tenancy, RBAC, audit
Virtual Machines KubeVirt, VM management, snapshots, live migration
AI Lab Training & inference, PyTorch / TensorFlow
LLM Service LLM deployment & operations, vLLM / SGLang
Cloud-Edge Collaboration Edge cluster & node management

Operational Assurance

Category Details
High Availability Multi-replica control plane + etcd cluster, auto-recovery on node failure
Data Backup etcd snapshots, app-level backup (Velero), cross-cluster disaster recovery
Offline Operation Fully offline deployment and operation, no external network dependency
Upgrade Rollback One-click rollback for all version upgrades
Security & Compliance MLPS Level 3, audit logs, image scanning, model access control
Identity LDAP / OIDC / enterprise identity platform integration
Technical Support Documentation + training certification + TAM + 7×24 emergency response

Ecosystem & Partnerships

Open Source Contributions: TOP 1 in China and TOP 3 globally for Kubernetes core repository contributions. Active in Istio / Cilium / Spiderpool / HwameiStor and other CNCF projects. Active contributors to Kueue / LWS / Gang Scheduling and other K8s AI/ML SIG projects.

Area Partners
Chips & Compute Huawei Ascend, Hygon, Biren, MetaX, NVIDIA
Operating Systems Kylin, UnionTech UOS
Databases & Middleware DMDB, OceanBase, TiDB

Industry Coverage: Finance · Manufacturing · Energy · Telecom · Government — serving 500+ enterprise customers.

Comments