# llm_solution **Repository Path**: openeuler/llm_solution ## Basic Information - **Project Name**: llm_solution - **Description**: 本项目已经迁移至 AtomGit || This project has been migrated to AtomGit || Linked: https://atomgit.com/openeuler/llm_solution - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 16 - **Forks**: 17 - **Created**: 2025-03-11 - **Last Updated**: 2025-12-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: sig-Long ## README # openEuler Open-Source Full-Stack AI inference Solution (Intelligence BooM) # **If your application scenario meets the following requirements, you can also download the following three images to start the use journey:** **CPU+NPU (800I A2)** **Hardware specifications:** Supports single-node system, two-node cluster, four-node cluster, and large cluster. **image path:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-800I-A2-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.2.0-x86_64-800I-A2-openeuler24.03-lts-sp2 **CPU+NPU (300I Duo)** **Hardware specifications:** Single-node system and two-node cluster are supported. **image path:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.2.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 **CPU+GPU (NVIDIA A100)** **Hardware specifications:** Supports single-node single-card and single-node multi-card. **image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.2.0-aarch64-A100-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.2.0-aarch64-syshax-openEuler24.03-lts-sp2 **Our vision:** Build open-source AI basic software de facto standards based on openEuler to promote the prosperity of the enterprise intelligent application ecosystem. **When big models meet industry implementation, why do we need a full-stack solution?** DeepSeek innovation lowers the threshold for implementing large models. AI enters the "Jervins Paradox" moment. Requirements increase significantly, multi-modal interaction breaks through hardware restrictions, and low computing power requirements reconstruct deployment logic, marking the transition from the "technical verification period" to the "scale implementation period". However, the core contradictions in industry practice gradually emerged: **Industry pain points** **Difficult adaptation:** The requirements for inference delay, computing cost, and multi-modal support vary greatly in different industries (such as finance, manufacturing, and healthcare). A single model or tool chain cannot cover diversified requirements. **High cost:** From model training to deployment, collaboration between (PyTorch/TensorFlow/MindSpore), hardware (CPU/GPU/NPU), and storage (relational database/vector database) is required. Hardware resource utilization is low and O&M complexity increases exponentially. **Ecosystem fragmentation:** Tool chains of hardware vendors (such as Huawei and NVIDIA), framework vendors (such as Meta, and Google) are incompatible with each other. Patchwork deployment leads to long development cycles and inefficient iterations. Technical challenges. **Inference efficiency bottleneck:** The parameter scale of large models exceeds trillions. Traditional inference engines do not support dynamic graph calculation, sparse activation, and hybrid precision, causing a serious waste of computing power. **Inefficient resource collaboration:** Heterogeneous computing power scheduling of CPUs, GPUs, and NPUs depends on manual experience. Memory and video memory fragmentation leads to idle resources. To solve the preceding problems, we collaborate with the open source community to accelerate the maturity of the open source inference solution Intelligence BooM. ## Technical Architecture ## ![Image](doc/deepseek/asserts/IntelligenceBoom_en.png) #### **Intelligent Application Platform: Quickly Connect Your Business to AI** #### **Component:** Smart Application Platform (Task Planning and Orchestration, OS Domain Model, Agent MCP Service), Multiple Intelligent Agents (Smart Tuning, Intelligent Operations, Intelligent Q&A, Deep Research) \[openEuler Intelligence open source address\] https://gitee.com/openeuler/euler-copilot-framework \[DeepInsight open source address\] https://gitee.com/openEuler/deepInsight **Core Value** **Intelligent empowerment of operating systems transforms passive tuning and maintenance into semi-active operations, enabling intelligent assisted driving for operating systems.** **intelligent optimization:** Breakthrough technologies such as multi-layer system load awareness and AI-inspired complex system optimization strategies have been implemented to achieve a performance improvement of over 10% in typical scenarios. **Intelligent Operations and Maintenance:** Build an intelligent OS operation and maintenance assistant that transforms command-line operations into natural language commands, covering 100% of typical O&M commands to enhance system usability and support ecosystem promotion. Breakthroughs in full-stack collaborative analysis and slow node fault diagnosis technologies have been achieved, reducing the time for locating faults in AI training and inference scenarios from days to hours. **DeepResearch:** Collaboration based on multiple agents: By constructing multiple agents (including outline planning, information retrieval, evaluation and reflection, and report generation), we can break through the limitations of single-agent capabilities and enhance research outcomes in complex domains. Contextual engineering techniques: We employ technologies such as long short-term memory, semantic compression, and structured input-output to optimize mechanisms for rewriting, selecting, and compressing contextual information. This allows the in-depth research agent to focus on research objectives in complex tasks and reduce the hallucination effect. Content conflict detection: We analyze content conflicts among multiple information sources (knowledge bases, internal networks, and public networks) to ensure the authenticity and objectivity of the report content, thereby enhancing users' trust in the research results. **Intelligent Application Platform:** General capabilities such as intelligent assistants, optimization, and operation and maintenance are integrated into the platform to build technologies like intelligent agent services, domain knowledge, and system memory services. #### **Inference service: enabling models to run efficiently** #### **Components:** vLLM , SGLang and LLaMA Factory \[vLLM open source address\] https://docs.vllm.ai/en/latest/ **Core Value** **Dynamic scaling:** vLLM supports on-demand model loading and uses the Kubernetes automatic scaling policy to reduce the idle computing power cost by more than 70%. **Big model optimization:** VLLM uses technologies such as PagedAttention and continuous batch processing to reduce the inference delay of trillion-parameter models by 50% and improve the throughput by three times. **Low-Cost Model Fine-Tuning:** Ready-to-use out of the box, providing a one-stop solution from data generation to fine-tuning and training. It supports low-cost hardware such as Atlas 3000 for small and medium-sized model scenarios; for large models and multimodal scenarios, it supports Atlas 800 A2 with memory-friendly efficient training and inference, and also offers Ascend-optimized parallel strategy tuning tools. #### **Acceleration layer: Make reasoning "one step faster"** #### **Components:** sysHAX, expert-kit, and LMCache \[sysHAX open source address\] https://gitee.com/openEuler/sysHAX \[Expert-Kit open source address\] https://gitee.com/openEuler/expert-kit \[LMCache open source address\] https://gitee.com/openeuler/LMCache-mindspore, https://github.com/LMCache/LMCache **Core Value** **Heterogeneous computing power collaboration distributed inference acceleration engine:** By integrating the computational characteristics of different hardware architectures such as CPUs, NPUs, and GPUs, dynamic task allocation is used to optimize the principle of "dedicated hardware for dedicated tasks." This approach virtualizes scattered heterogeneous computing power into a unified resource pool, enabling fine-grained allocation and elastic scaling. LMCache provides the capability to manage large-scale KV caches with a memory pool, connecting HBM, DDR, disk, and remote storage pools. The significant performance improvements are primarily based on Prefix Caching (sharing KV cache among multiple instances), CacheGen (compressing KV cache to save transmission time), and CacheBlend (improving cache hit rates). #### **Framework layer: Make the model "inclusive"** #### **Components:** MindSpore (all-scenario framework), PyTorch (Meta general framework), and MS-InferRT (Inference optimization component under the MindSpore framework, compatible with PyTorch) \[MindSpore open source address\] https://gitee.com/mindspore **Core Value** **Multi-framework compatibility:** Unified APIs allow users to directly invoke models trained by any framework without rewriting code. **Dynamic graph optimization:** For dynamic control flows (such as condition judgment and loop) of large models, the graph optimization capability is provided, improving the inference stability by 30%. Community ecosystem reuse: Inherit ecosystem tools (such as Hugging Face model library) of PyTorch/TensorFlow, reducing model migration costs. **Community Ecosystem Reuse:** Fully inherits PyTorch's ecosystem tools (such as the Hugging Face model library), reducing the cost of model migration. #### **Data engineering, vector retrieval, and data fusion analysis:** transformation from raw data to inference fuel #### **Components:** openGauss、PG Vector、Datajuicer… **Core Value** **Efficient processing and management of multi-modal data:** Unified access, cleaning, storage, and indexing of multi-modal data solves complex data types and large-scale management problems in inference scenarios and provides standardized data foundation for upper-layer intelligent applications. **Effective search and real-time response support:** Quick matching and real-time query of massive high-dimensional data meet the strict requirements on data timeliness and accuracy in inference scenarios and shorten the link delay from data to inference results. Provides underlying performance assurance for real-time applications such as intelligent Q&A and intelligent O&M. #### **Task management platform: smart resource scheduling** #### **Components:** OpenFuyao (task orchestration engine), K8S (container orchestration), RAY (distributed computing), and oeDeploy (one-click deployment tool) \[OpenYuanrong open source address\] https://www.openeuler.openatom.cn/zh/projects/yuanrong/ \[OpenFuyao open source address\] https://gitcode.com/openFuyao \[Ray open source address\] https://gitee.com/src-openEuler/ray \[Open source address of the oeDeploy\] https://gitee.com/openEuler/oeDeploy **Core Value** **Distributed Computing Engine:** Provides a unified serverless architecture that supports various distributed applications such as AI, big data, and microservices. It offers multi-language function programming interfaces to simplify the development of distributed applications with a single-machine programming experience. Additionally, it provides distributed dynamic scheduling and data sharing capabilities to achieve high-performance operation of distributed applications and efficient resource utilization in clusters. **Device-edge-cloud synergy:** Automatically allocates execution nodes based on task types (such as real-time inference and offline batch processing) and hardware capabilities (such as edge NPUs and cloud GPUs). **Full-lifecycle management:** provides a one-stop O&M interface, including model upload, version iteration, dependency installation, and service startup and shutdown. Fault self-healing: Monitors the task status in real time, automatically restarts abnormal processes, and switches services to the standby node, ensuring high service availability. **Self-healing:** Real-time monitoring of task status, automatic restart of abnormal processes, and switching to standby nodes to ensure high service availability. #### **Compiler: Making Code "More Hardware-Savvy"** #### **Component composition:** Heterogeneous fusion compiler AscendNPUIR, and operator auto-generation tool AKG \[Open source address of the AKG\] https://gitee.com/mindspore/akg **Core Value** **Cross-hardware optimization:** Automatically converts computing logic to address differences in instruction sets among CPUs (x86/ARM), GPUs (CUDA), and NPUs (Ascend/CANN), significantly improving computing power utilization. **Mixed precision support:** Dynamically adjust the FP32/FP16/INT8 precision, greatly improving the inference speed while the precision loss is controllable. Memory optimization:**Reduces the video memory and memory usage by 30% and reduces hardware costs by using technologies such as operator convergence and memory overcommitment. **Memory Optimization:** By using techniques such as operator fusion and memory reuse, we reduce GPU/RAM usage by 30%, thereby lowering hardware costs. #### **Operating System: Make the Full Stack "Stand as a Rock"** #### **Component:** openEuler (open-source Euler operating system), FalconFS (high-performance distributed storage pool), GMEM (heterogeneous converged memory), XSched (heterogeneous computing power partitioning), xMig (XPU migration), ModelFS (programmable page cache) \[openEuler open source address\] https://gitee.com/openEuler \[FalconFS open source address\] https://gitee.com/openeuler/FalconFS \[GMEM open source address\] https://gitee.com/openeuler/kernel \[GXSched open source address\] https://gitee.com/openeuler/libXSched \[xMig open source address\] https://gitee.com/openeuler/xmig \[ModelFS open source address\] https://gitee.com/openeuler/kernel/tree/OLK-6.6/fs/mfs **Core Value** **Heterogeneous resource management:** Supports unified scheduling of CPUs, GPUs, and NPUs, and provides capabilities such as hardware status monitoring and fault isolation. **Security enhancement:** Integrates the Chinese national cryptographic algorithm, permission isolation, and vulnerability scanning modules to meet compliance requirements of industries such as finance and government. **Fast Model Weight Loading:** Programmable Page Cache and Dynamic Caching, Doubles the Speed of Weight Loading. #### **Hardware Enablement and Hardware Layer: Make the Most of Computing Power** #### **Components:** CANN (Ascend AI enablement suite), CUDA (Nvidia computing platform), CPU (x86/ARM), NPU (Ascend), GPU (Nvidia/GPU in China) **Core Value** **Hardware potential release:** CANN optimizes matrix computing and vector computing for Ascend NPU Da Vinci architecture, greatly improving computing power utilization. CUDA provides a mature GPU parallel computing framework to support common AI tasks. **Heterogeneous computing power convergence:** Using unified programming interfaces (such as OpenCL) to implement collaborative computing among CPUs, NPUs, and GPUs, avoiding performance bottlenecks of a single hardware. #### **Connected technology: "high-speed conversation" with hardware** #### ## Full-Stack Solution Deployment Tutorial ## Currently, the solution supports more than 50 mainstream models, such as DeepSeek/Qwen/Llama/GLM/TeleChat. The following describes how to deploy the DeepSeek V3&R1 model and openEuler Intelligence application. ### DeepSeek V3 and R1 deployment ### Reference[Deployment Guide](https://gitee.com/openEuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3%26R1Deployment%20Guide_en.md) Use the one-click deployment script to start the inference service within 20 minutes. ### One-click deployment of the DeepSeek model and openEuler Intelligence intelligent application ### Reference[One-click deployment of openEuler Intelligence](https://gitee.com/openEuler/llm_solution/tree/master/script/mindspore-intelligence/README_en.md) Build a local knowledge base and collaborate with the DeepSeek big model to complete applications such as intelligent optimization and intelligent O&M. ## Participation and contribution ## Welcome to provide your valuable suggestions in the issue mode to build a full-stack open source inference solution with excellent out-of-the-box and leading performance. # llm_solution #