# DLSlime
**Repository Path**: jianjinok/DLSlime
## Basic Information
- **Project Name**: DLSlime
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-23
- **Last Updated**: 2026-01-23
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
路线图 |
Slack |
微信群 |
知乎
灵活高效的异构传输工具包
## 快速开始 (Getting Started)
DLSlime 提供了一套点对点(Peer-to-Peer)通信接口。例如,针对将远程 Tensor 的切片批量赋值给本地 Tensor 的任务,您可以使用以下 API 来实现。
.
以下是 DLSlime 接口的一些示例。
### P2P 通信
#### RDMA RC 模式
- RDMA RC 读 (同步 / 异步模式)
```
python example/python/p2p_rdma_rc_read.py
```
- RDMA RC 读 (协程模式)
```
python example/python/p2p_rdma_rc_read_coroutine.py
```
- RDMA RC 写 (同步 / 异步模式)
```
python example/python/p2p_rdma_rc_write.py
```
- RDMA RC 带立即数据的写 (Sync / Async mode)
```
python example/python/p2p_rdma_rc_write_with_imm_data.py
```
- RDMA RC 发送/接收 (Send/Recv)
```
python example/python/p2p_rdma_rc_send_recv.py
```
```
python example/python/p2p_rdma_rc_send_recv_gdr.py
```
- DLSlime Torch 后端
```
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 0
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 1
```
#### NVLink 模式
```
# 发起端 (initiator)
python example/python/p2p_nvlink.py --initiator-url "127.0.0.1:6006" --target-url "127.0.0.1:6007" --role initiator
```
```
# 目标端 (target)
python example/python/p2p_nvlink.py --initiator-url "127.0.0.1:6006" --target-url "127.0.0.1:6007" --role target
```
#### NVShmem 模式
```
# 发送 (send)
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 0 --world-size 2
```
```
# 接收 (recv)
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 1 --world-size 2
```
### 华为昇腾直连模式 (Huawei Ascend Direct Mode)
请参阅: [华为 README](docs/huawei_ascend/README.md)
> \[!Caution\]
> DLSlime NVShmem 传输引擎和华为昇腾直连(Ascend Direct)模式尚处于实验阶段。
### 集合通信算子 (Collective Ops)
#### 节点内 (Intra Node)
##### AllGather
```shell
torchrun --nnodes 1 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode intra
```
#### 节点间 (Inter Node)
##### AllGather
```shell
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
```
##### AllGather Gemm 重叠 (Overlapping)
```shell
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py
```
> \[!Note\]
> 上述节点内和节点间示例默认启用 CUDA Graph。使用 `--eager-mode` 可回退到 eager 模式。
## 安装
### pip 安装
```
pip install dlslime==0.0.1.post10
```
> \[!Note\]
> DLSlime pip 版本使用默认 FLAGS 构建(详情请参阅源码编译部分)。
### 源码编译
#### Python
```
git clone https://github.com/deeplink-org/DLSlime.git
FLAG= pip install -v --no-build-isolation -e .
```
#### CPP
```
git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG= ..
```
#### 编译标志 (Build flags)
`FLAG` 可以是以下选项:
| 标志 (Flag) | 描述 | 平台 | 默认值 |
| :-------------------- | :---------------------------------- | :----- | -----: |
| `BUILD_RDMA` | 构建 RDMA 传输引擎 | Hetero | ON |
| `BUILD_PYTHON` | 构建 Python 封装 | Hetero | ON |
| `BUILD_NVLINK` | 构建 NVLINK 传输引擎 | GPGPU | OFF |
| `BUILD_NVSHMEM` | 构建 NVShmem 传输引擎 | NVIDIA | OFF |
| `BUILD_ASCEND_DIRECT` | 构建 Ascend 直连传输 | ASCEND | OFF |
| `BUILD_TORCH_PLUGIN` | 构建 DLSlime 为 Torch 后端 | Hetero | OFF |
| `USE_GLOO_BACKEND` | 使用 GLOO RDMA Send/Recv Torch 后端 | Hetero | OFF |
| `BUILD_INTRA_OPS` | 使用 INTRA Collective OPS (节点内) | GPGPU | OFF |
| `BUILD_INTER_OPS` | 使用 INTER Collective OPS (NVSHMEM) | NVIDIA | OFF |
> \[!Note\]
> 在 Metax 平台上使用 DLSlime 作为 Torch 后端时,请启用 `USE_MECA`。
## 基准测试 (Benchmark)
### GDRDMA P2P 读/写
- 平台: NVIDIA ConnectX-7 HHHL 网卡; 200GbE (默认模式) / NDR200 IB; 双端口 QSFP112; PCIe 5.0 x16 (带 x16 PCIe 扩展选项); RoCE v2。
#### #BS=1, #Concurrency=1
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
```
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
```
| 传输引擎 | 通道数 | 消息大小 (bytes) | 批大小 | 并发数 | 平均延迟(ms) | 带宽(MB/s) |
| -------- | ------ | ---------------- | ------ | ------ | ------------ | ---------- |
| dlslime | 1 | 2,048 | 1 | 1 | 0.039 | 52 |
| dlslime | 1 | 4,096 | 1 | 1 | 0.037 | 111 |
| dlslime | 1 | 8,192 | 1 | 1 | 0.038 | 216 |
| dlslime | 1 | 16,384 | 1 | 1 | 0.037 | 442 |
| dlslime | 1 | 32,768 | 1 | 1 | 0.039 | 836 |
| dlslime | 1 | 65,536 | 1 | 1 | 0.039 | 1689 |
| dlslime | 1 | 131,072 | 1 | 1 | 0.041 | 3195 |
| dlslime | 1 | 262,144 | 1 | 1 | 0.043 | 6059 |
| dlslime | 1 | 524,288 | 1 | 1 | 0.049 | 10689 |
| dlslime | 1 | 1,048,576 | 1 | 1 | 0.062 | 17012 |
| dlslime | 1 | 2,097,152 | 1 | 1 | 0.083 | 25154 |
| dlslime | 1 | 4,194,304 | 1 | 1 | 0.127 | 33112 |
| dlslime | 1 | 8,388,608 | 1 | 1 | 0.211 | 39797 |
| dlslime | 1 | 16,777,216 | 1 | 1 | 0.382 | 43893 |
| dlslime | 1 | 33,554,432 | 1 | 1 | 0.726 | 46244 |
| dlslime | 1 | 67,108,864 | 1 | 1 | 1.412 | 47518 |
| dlslime | 1 | 134,217,728 | 1 | 1 | 2.783 | 48235 |
#### #BS=64, #Concurrency=1
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
```
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
```
| 传输引擎 | 通道数 | 消息大小 (bytes) | 批大小 | 并发数 | 平均延迟(ms) | 带宽(MB/s) |
| -------- | ------ | ---------------- | ------ | ------ | ------------ | ---------- |
| dlslime | 1 | 2,048 | 64 | 1 | 0.084 | 1562 |
| dlslime | 1 | 4,096 | 64 | 1 | 0.082 | 3213 |
| dlslime | 1 | 8,192 | 64 | 1 | 0.086 | 6095 |
| dlslime | 1 | 16,384 | 64 | 1 | 0.093 | 11249 |
| dlslime | 1 | 32,768 | 64 | 1 | 0.115 | 18193 |
| dlslime | 1 | 65,536 | 64 | 1 | 0.158 | 26542 |
| dlslime | 1 | 131,072 | 64 | 1 | 0.243 | 34498 |
| dlslime | 1 | 262,144 | 64 | 1 | 0.414 | 40549 |
| dlslime | 1 | 524,288 | 64 | 1 | 0.758 | 44248 |
| dlslime | 1 | 1,048,576 | 64 | 1 | 1.443 | 46510 |
| dlslime | 1 | 2,097,152 | 64 | 1 | 2.809 | 47782 |
| dlslime | 1 | 4,194,304 | 64 | 1 | 5.555 | 48327 |
| dlslime | 1 | 8,388,608 | 64 | 1 | 11.041 | 48624 |
| dlslime | 1 | 16,777,216 | 64 | 1 | 22.003 | 48798 |
| dlslime | 1 | 33,554,432 | 64 | 1 | 43.941 | 48872 |
| dlslime | 1 | 67,108,864 | 64 | 1 | 87.809 | 48912 |
| dlslime | 1 | 134,217,728 | 64 | 1 | 175.512 | 48942 |
#### #BS=64, #Concurrency=8
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
```
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
```
| 传输引擎 | 通道数 | 消息大小 (bytes) | 批大小 | 并发数 | 平均延迟(ms) | 带宽(MB/s) |
| -------- | ------ | ---------------- | ------ | ------ | ------------ | ---------- |
| dlslime | 1 | 2,048 | 64 | 8 | 0.037 | 3519 |
| dlslime | 1 | 4,096 | 64 | 8 | 0.038 | 6948 |
| dlslime | 1 | 8,192 | 64 | 8 | 0.038 | 13758 |
| dlslime | 1 | 16,384 | 64 | 8 | 0.04 | 26416 |
| dlslime | 1 | 32,768 | 64 | 8 | 0.057 | 36997 |
| dlslime | 1 | 65,536 | 64 | 8 | 0.098 | 42618 |
| dlslime | 1 | 131,072 | 64 | 8 | 0.184 | 45602 |
| dlslime | 1 | 262,144 | 64 | 8 | 0.356 | 47148 |
| dlslime | 1 | 524,288 | 64 | 8 | 0.699 | 47975 |
| dlslime | 1 | 1,048,576 | 64 | 8 | 1.384 | 48478 |
| dlslime | 1 | 2,097,152 | 64 | 8 | 2.755 | 48709 |
| dlslime | 1 | 4,194,304 | 64 | 8 | 5.498 | 48823 |
| dlslime | 1 | 8,388,608 | 64 | 8 | 10.982 | 48884 |
| dlslime | 1 | 16,777,216 | 64 | 8 | 21.954 | 48908 |
| dlslime | 1 | 33,554,432 | 64 | 8 | 43.895 | 48923 |
| dlslime | 1 | 67,108,864 | 64 | 8 | 87.766 | 48936 |
| dlslime | 1 | 134,217,728 | 64 | 8 | 175.517 | 48940 |
### GDRDMA 聚合带宽 (Aggregated Bandwidth)
#### #BS=1, #Concurrency=1
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
```
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
```
| 传输引擎 | 通道数 | 消息大小 (bytes) | 批大小 | 并发数 | 平均延迟(ms) | 带宽(MB/s) |
| -------- | ------ | ---------------- | ------ | ------ | ------------ | ---------- |
| dlslime | 8 | 2,048 | 1 | 1 | 0.051 | 157 |
| dlslime | 8 | 4,096 | 1 | 1 | 0.042 | 768 |
| dlslime | 8 | 8,192 | 1 | 1 | 0.04 | 1576 |
| dlslime | 8 | 16,384 | 1 | 1 | 0.054 | 2929 |
| dlslime | 8 | 32,768 | 1 | 1 | 0.051 | 5713 |
| dlslime | 8 | 65,536 | 1 | 1 | 0.052 | 11547 |
| dlslime | 8 | 131,072 | 1 | 1 | 0.055 | 22039 |
| dlslime | 8 | 262,144 | 1 | 1 | 0.058 | 42313 |
| dlslime | 8 | 524,288 | 1 | 1 | 0.064 | 74753 |
| dlslime | 8 | 1,048,576 | 1 | 1 | 0.072 | 127489 |
| dlslime | 8 | 2,097,152 | 1 | 1 | 0.101 | 184823 |
| dlslime | 8 | 4,194,304 | 1 | 1 | 0.149 | 246861 |
| dlslime | 8 | 8,388,608 | 1 | 1 | 0.237 | 299510 |
| dlslime | 8 | 16,777,216 | 1 | 1 | 0.403 | 340252 |
| dlslime | 8 | 33,554,432 | 1 | 1 | 0.743 | 364918 |
| dlslime | 8 | 67,108,864 | 1 | 1 | 1.423 | 378620 |
| dlslime | 8 | 134,217,728 | 1 | 1 | 2.79 | 384630 |
#### #BS=64, #Concurrency=1
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
```
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
```
| 传输引擎 | 通道数 | 消息大小 (bytes) | 批大小 | 并发数 | 平均延迟(ms) | 带宽(MB/s) |
| -------- | ------ | ---------------- | ------ | ------ | ------------ | ---------- |
| dlslime | 8 | 2,048 | 64 | 1 | 0.091 | 11690 |
| dlslime | 8 | 4,096 | 64 | 1 | 0.081 | 24403 |
| dlslime | 8 | 8,192 | 64 | 1 | 0.091 | 45926 |
| dlslime | 8 | 16,384 | 64 | 1 | 0.098 | 84092 |
| dlslime | 8 | 32,768 | 64 | 1 | 0.117 | 138696 |
| dlslime | 8 | 65,536 | 64 | 1 | 0.16 | 206866 |
| dlslime | 8 | 131,072 | 64 | 1 | 0.241 | 273976 |
| dlslime | 8 | 262,144 | 64 | 1 | 0.415 | 320008 |
| dlslime | 8 | 524,288 | 64 | 1 | 0.757 | 353714 |
| dlslime | 8 | 1,048,576 | 64 | 1 | 1.439 | 372217 |
| dlslime | 8 | 2,097,152 | 64 | 1 | 2.819 | 381397 |
| dlslime | 8 | 4,194,304 | 64 | 1 | 5.555 | 386489 |
| dlslime | 8 | 8,388,608 | 64 | 1 | 11.044 | 388927 |
| dlslime | 8 | 16,777,216 | 64 | 1 | 22.009 | 390278 |
| dlslime | 8 | 33,554,432 | 64 | 1 | 43.951 | 390978 |
| dlslime | 8 | 67,108,864 | 64 | 1 | 87.804 | 391370 |
| dlslime | 8 | 134,217,728 | 64 | 1 | 175.508 | 391588 |
#### #BS=64, #Concurrency=8
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
```
```
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
```
| 传输引擎 | 通道数 | 消息大小 (bytes) | 批大小 | 并发数 | 平均延迟(ms) | 带宽(MB/s) |
| -------- | ------ | ---------------- | ------ | ------ | ------------ | ---------- |
| dlslime | 8 | 2,048 | 64 | 8 | 0.036 | 28494 |
| dlslime | 8 | 4,096 | 64 | 8 | 0.038 | 50860 |
| dlslime | 8 | 8,192 | 64 | 8 | 0.048 | 104545 |
| dlslime | 8 | 16,384 | 64 | 8 | 0.041 | 207051 |
| dlslime | 8 | 32,768 | 64 | 8 | 0.056 | 297354 |
| dlslime | 8 | 65,536 | 64 | 8 | 0.099 | 337571 |
| dlslime | 8 | 131,072 | 64 | 8 | 0.185 | 363003 |
| dlslime | 8 | 262,144 | 64 | 8 | 0.356 | 376743 |
| dlslime | 8 | 524,288 | 64 | 8 | 0.701 | 383701 |
| dlslime | 8 | 1,048,576 | 64 | 8 | 1.386 | 387629 |
| dlslime | 8 | 2,097,152 | 64 | 8 | 2.757 | 389493 |
| dlslime | 8 | 4,194,304 | 64 | 8 | 5.5 | 390523 |
| dlslime | 8 | 8,388,608 | 64 | 8 | 10.984 | 391043 |
| dlslime | 8 | 16,777,216 | 64 | 8 | 21.955 | 391291 |
| dlslime | 8 | 33,554,432 | 64 | 8 | 43.891 | 391407 |
| dlslime | 8 | 67,108,864 | 64 | 8 | 87.771 | 391480 |
| dlslime | 8 | 134,217,728 | 64 | 8 | 175.518 | 391530 |
### 异构互连 (Heterogeneous Interconnection)
- 硬件配置
| 设备 | NIC 型号 | 带宽 | PCIe 版本 | PCIe 通道数 |
| :--- | ------------------------------: | -------: | --------: | ----------: |
| A | Mellanox ConnectX-7 Lx (MT4129) | 400 Gbps | PCIe 5.0 | x16 |
| B | Mellanox ConnectX-7 Lx (MT4129) | 400 Gbps | PCIe 5.0 | x8 |
| C | Mellanox ConnectX-7 Lx (MT4129) | 200 Gbps | PCIe 5.0 | x16 |
| D | Mellanox ConnectX-7 Lx (MT4129) | 400 Gbps | PCIe 5.0 | x16 |
- 实验配置
- 消息大小 = 128 MB
- RDMA RC 读 (单 NIC)
- 亲和性场景 (Under affinity scenario)
- RDMA with GPU Direct
- 互连带宽矩阵:(MB/s, 展示了理论边界的达成情况)。
| 吞吐量 (MB/s) | A | B | C | D |
| :------------ | -------: | -------: | -------: | -------: |
| A | 48967.45 | 28686.29 | 24524.29 | 27676.57 |
| B | 28915.72 | 28275.85 | 23472.29 | 27234.60 |
| C | 24496.14 | 24496.51 | 24513.57 | 24493.89 |
| D | 29317.66 | 28683.25 | 24515.30 | 27491.33 |
详细结果: [bench](bench/results)