# MOEServer

**Repository Path**: knifecms/moeserver

## Basic Information

- **Project Name**: MOEServer
- **Description**: 使用DeepSpeed MoE提供QWenCoder3推理服务
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-24
- **Last Updated**: 2025-10-24

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# DeepSpeed MOE + Qwen3-Coder Web API系统

这是一个基于DeepSpeed MOE技术的高性能Qwen3-Coder推理API系统，专门为Roo Code等开发工具提供OpenAI兼容的接口。

## 🚀 功能特性

- **高性能推理**: 基于DeepSpeed MOE技术，支持大规模模型的并行推理
- **OpenAI兼容**: 提供完全兼容OpenAI ChatGPT API的接口
- **多工具集成**: 支持Roo Code、Claude Code、Continue.dev等主流开发工具
- **流式输出**: 支持实时流式响应，提升用户体验
- **智能负载均衡**: 支持多GPU并行处理和智能任务分配
- **监控统计**: 提供详细的性能监控和统计信息

## 📋 系统架构

```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   开发工具      │    │   Web API网关    │    │  DeepSpeed MOE  │
│  (Roo Code等)   │◄──►│   (FastAPI)      │◄──►│   推理引擎      │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌──────────────────┐    ┌─────────────────┐
                       │  认证与监控     │    │  Qwen3-Coder    │
                       │  (API Keys)     │    │   模型          │
                       └──────────────────┘    └─────────────────┘
```

## 🛠️ 安装要求

### 硬件要求

- **GPU**: 建议24GB+显存的NVIDIA GPU (如A100 40GB)
- **内存**: 64GB+系统内存
- **存储**: 2TB+ SSD存储空间
- **CPU**: 16核+处理器

### 软件要求

- Ubuntu 20.04+ / CentOS 8+
- Python 3.8+
- CUDA 11.8+
- NVIDIA驱动 520+
- Docker 20.10+ (可选)

## 📦 快速安装

### 1. 克隆项目

```bash
git clone <repository-url>
cd moeserver
```

### 2. 自动化部署

```bash
# 给部署脚本执行权限
chmod +x deploy.sh

# 运行部署脚本
./deploy.sh --model-path /path/to/your/model

# 或者使用CPU模式
./deploy.sh --cpu --model-path /path/to/your/model
```

### 3. 下载模型

```bash
# 下载Qwen3-Coder模型
git clone https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct \
  /workspace/models/qwen3-coder-480b-a35b-instruct

# 或者使用镜像源
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/Qwen/Qwen3-Coder-480B-A35B-Instruct \
  qwen3-coder-480b-a35b-instruct

# 下载模型文件
cd qwen3-coder-480b-a35b-instruct

pip install -U modelscope
modelscope download --model=qwen/Qwen3-Coder-480B-A35B-Instruct \
                    --local_dir ./Qwen3-Coder-480B-A35B-Instruct \
                    --cache_dir ./ms_cache

```

### 4. 启动服务

```bash
# 激活虚拟环境
source venv/bin/activate

# 启动API服务器
./start_api.sh

# 或者直接启动
python api_server.py
```

## 🔧 配置说明

### API服务配置

编辑 `config.py` 文件来自定义配置：

```python
# API服务配置
APIConfig.HOST = "0.0.0.0"
APIConfig.PORT = 8000

# DeepSpeed配置
DeepSpeedConfig.TENSOR_PARALLEL_SIZE = 8  # 张量并行度
DeepSpeedConfig.EXPERT_PARALLEL_SIZE = 8  # 专家并行度
DeepSpeedConfig.NUM_EXPERTS = 128  # 专家数量

# 模型配置
ModelConfig.MODEL_PATH = "/path/to/your/model"
```

### 环境变量配置

创建 `.env` 文件：

```bash
# API配置
API_HOST=0.0.0.0
API_PORT=8000
ENVIRONMENT=production
DEBUG=false

# 模型配置
MODEL_PATH=/workspace/models/qwen3-coder-480b-a35b-instruct
TENSOR_PARALLEL_SIZE=8
EXPERT_PARALLEL_SIZE=8
NUM_EXPERTS=128

# GPU配置
USE_CPU=false

# API Keys
ROO_CODE_KEY=roo-code-key-2024
CLAUDE_CODE_KEY=claude-code-key-2024
LOCAL_DEV_KEY=local-dev-key-2024
```

## 🔌 工具集成

### Roo Code集成

1. 打开Roo Code设置界面
2. 选择"API Provider" → "OpenAI Compatible"
3. 配置以下参数：

```
Base URL: http://localhost:8000
API Key: roo-code-key-2024
Model: qwen3-coder-480b-a35b-instruct
```

### Claude Code集成

```bash
# 安装Claude Code
npm install -g @anthropic-ai/claude-code

# 配置环境变量
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_AUTH_TOKEN=claude-code-key-2024

# 使用Claude Code
claude-code code
```

### Continue.dev集成

在VSCode的设置中添加：

```json
{
  "continue.apiKey": "local-dev-key-2024",
  "continue.apiBase": "http://localhost:8000",
  "continue.defaultModel": "qwen3-coder-480b-a35b-instruct",
  "continue.useOpenAi": true
}
```

## 📡 API文档

### 基础信息

- **Base URL**: `http://localhost:8000`
- **API版本**: v1
- **认证方式**: Bearer Token

### 端点列表

#### 1. 健康检查

```bash
GET /health
```

响应：
```json
{
  "status": "healthy",
  "model_loaded": true,
  "stats": {
    "inference_count": 100,
    "avg_latency": 0.5,
    "tokens_per_inference": 150
  }
}
```

#### 2. 获取模型列表

```bash
GET /v1/models
Authorization: Bearer <your-api-key>
```

#### 3. 聊天完成

```bash
POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer <your-api-key>

{
  "model": "qwen3-coder-480b-a35b-instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
  ],
  "max_tokens": 1000,
  "temperature": 0.1,
  "stream": false
}
```

#### 4. 流式聊天完成

```bash
POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer <your-api-key>

{
  "model": "qwen3-coder-480b-a35b-instruct",
  "messages": [
    {"role": "user", "content": "Explain how this function works: def quick_sort(arr):"}
  ],
  "stream": true
}
```

### JavaScript/TypeScript客户端示例

```javascript
class Qwen3CoderClient {
  constructor(baseURL, apiKey) {
    this.baseURL = baseURL;
    this.apiKey = apiKey;
  }

  async chat(messages, options = {}) {
    const response = await fetch(`${this.baseURL}/v1/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${this.apiKey}`
      },
      body: JSON.stringify({
        model: 'qwen3-coder-480b-a35b-instruct',
        messages,
        ...options
      })
    });

    if (!response.ok) {
      throw new Error(`API request failed: ${response.statusText}`);
    }

    return response.json();
  }

  async *streamChat(messages, options = {}) {
    const response = await fetch(`${this.baseURL}/v1/chat/completions`, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${this.apiKey}`
      },
      body: JSON.stringify({
        model: 'qwen3-coder-480b-a35b-instruct',
        messages,
        stream: true,
        ...options
      })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      const lines = chunk.split('\n');

      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = line.slice(6);
          if (data === '[DONE]') return;
          try {
            yield JSON.parse(data);
          } catch (e) {
            // 忽略解析错误
          }
        }
      }
    }
  }
}

// 使用示例
const client = new Qwen3CoderClient('http://localhost:8000', 'roo-code-key-2024');

const messages = [
  { role: 'user', content: 'Write a React component for a todo list.' }
];

// 非流式调用
const result = await client.chat(messages);
console.log(result.choices[0].message.content);

// 流式调用
for await (const chunk of client.streamChat(messages)) {
  if (chunk.choices[0].delta.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}
```

## 🚀 性能优化

### GPU优化配置

```python
# 调整并行策略
DeepSpeedConfig.TENSOR_PARALLEL_SIZE = 4  # 张量并行
DeepSpeedConfig.EXPERT_PARALLEL_SIZE = 8  # 专家并行
DeepSpeedConfig.NUM_EXPERTS = 128  # 专家数量

# 推理优化
DeepSpeedConfig.REPLACE_WITH_KERNEL_INJECT = True
DeepSpeedConfig.USE_CACHE = True
```

### 批处理配置

```python
# 支持批量推理
BATCH_SIZE = 8
MAX_BATCH_TOKENS = 8192
```

## 📊 监控和日志

### 性能监控

```bash
# 查看API统计
curl -H "Authorization: Bearer roo-code-key-2024" \
     http://localhost:8000/stats
```

### 日志查看

```bash
# 查看实时日志
tail -f logs/api.log

# 查看错误日志
grep ERROR logs/api.log
```

## 🐳 Docker部署

### 构建镜像

```bash
# 构建Docker镜像
docker build -t deepspeed-qwen3-api .

# 使用docker-compose启动
docker-compose up -d
```

### 环境变量

```bash
# GPU模式
docker run -d \
  --name qwen3-api \
  --gpus all \
  -p 8000:8000 \
  -e MODEL_PATH=/workspace/models/qwen3-coder-480b-a35b-instruct \
  deepspeed-qwen3-api

# CPU模式
docker run -d \
  --name qwen3-api \
  -p 8000:8000 \
  -e USE_CPU=true \
  -e MODEL_PATH=/workspace/models/qwen3-coder-480b-a35b-instruct \
  deepspeed-qwen3-api
```

## 🔧 故障排除

### 常见问题

1. **CUDA内存不足**
   - 减少并行度设置
   - 使用CPU模式
   - 启用模型分片

2. **模型加载失败**
   - 检查模型路径是否正确
   - 确认模型文件完整性
   - 验证GPU内存是否充足

3. **API连接超时**
   - 检查网络连接
   - 调整超时设置
   - 确认防火墙配置

### 调试模式

```bash
# 启用调试模式
export DEBUG=true
python api_server.py --log-level debug

# 查看详细错误
export LOG_LEVEL=DEBUG
./start_api.sh
```

## 📈 性能基准

### 测试配置
- 硬件: 8x A100 80GB
- 模型: Qwen3-Coder-480B-A35B-Instruct
- 并行: 8张量并行 + 8专家并行

### 性能指标
- **推理延迟**: ~500ms (256 tokens)
- **吞吐量**: ~100 requests/second
- **内存使用**: ~640GB (模型参数)
- **GPU利用率**: ~95%

## 🤝 贡献指南

欢迎提交Pull Request和Issue！

1. Fork项目
2. 创建特性分支
3. 提交更改
4. 推送到分支
5. 创建Pull Request

## 📄 许可证

本项目基于Apache 2.0许可证开源。

## 🙏 致谢

- [Qwen Team](https://github.com/QwenLM) - 提供优秀的Qwen3-Coder模型
- [Microsoft DeepSpeed Team](https://github.com/microsoft/DeepSpeed) - 提供强大的深度学习优化库
- [Roo Code Team](https://github.com/RooCodeInc) - 提供AI编程助手框架

## 📞 支持

如果您遇到问题或有疑问，请：

1. 查阅本文档的故障排除部分
2. 搜索已有的Issues
3. 创建新的Issue描述问题
4. 参与讨论和社区交流

---

**Happy Coding with DeepSpeed MOE + Qwen3-Coder! 🚀**