# InferMatrix
**Repository Path**: devdz/InferMatrix
## Basic Information
- **Project Name**: InferMatrix
- **Description**: Systematic LLM inference performance evaluation | Build performance matrices across models, backends, and hardware
- **Primary Language**: Unknown
- **License**: AGPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-09
- **Last Updated**: 2026-03-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# InferMatrix

**Systematic LLM Inference Performance Evaluation**
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://github.com/psf/black)
*Build Performance Matrices • Compare Hardware Configs • Optimize Deployments*
[Quick Start](#-quick-start) • [Documentation](#-documentation) • [Examples](#-examples) • [Contributing](#-contributing)
---
## 🎯 What is InferMatrix?
**InferMatrix** is a unified framework for systematic evaluation of Large Language Model (LLM) inference performance across multiple backends and hardware configurations. It helps you build comprehensive **performance matrices** to identify optimal deployment strategies.
### Why InferMatrix?
- 🔄 **Multi-Backend Support**: Seamlessly test Ollama, vLLM, and LMStudio
- 🌐 **Flexible Deployment**: Local Windows, WSL, and remote servers via SSH
- 📊 **Matrix-Based Evaluation**: Cross-dimensional performance comparison
- ⚡ **Comprehensive Metrics**: TTFT, TPOT, Throughput, and Token Count
- 🎛️ **JSON-Driven Configuration**: Simple setup, powerful capabilities
- 📈 **Auto Visualization**: Beautiful performance reports out of the box
### Performance Matrix Example
InferMatrix generates systematic performance comparisons like this:
| Backend | RTX 4090 | RTX 3090 | A100 |
|---------|----------|----------|------|
| **Ollama** | 137 tok/s | 89 tok/s | 156 tok/s |
| **vLLM** | 156 tok/s | 102 tok/s | 198 tok/s |
| **LMStudio** | 125 tok/s | 84 tok/s | 145 tok/s |
*Evaluate new models, compare frameworks, and optimize hardware choices systematically.*
---
## 🚀 Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/infermatrix/infermatrix.git
cd infermatrix
# Install dependencies
pip install -r requirements.txt
```
### Basic Usage
```bash
# Run a test with your configuration
python run_tests.py --config configs/example.json
# List all available tests
python run_tests.py --list
# Generate performance report
python run_tests.py --generate-report
```
### Your First Test
Create a simple configuration file `my_test.json`:
```json
{
"ssh": {
"local_mode": true,
"hostname": "127.0.0.1"
},
"prompts": ["What is the capital of France?"],
"max_tokens": 512,
"result_dir": "results",
"tests": [
{
"name": "ollama-test",
"model": "llama2:7b",
"backend": "ollama",
"port": 11434
}
]
}
```
Run the test:
```bash
python run_tests.py --config my_test.json
```
View results in `results/` directory with JSON files and visualizations.
---
## 📖 Documentation
### Table of Contents
- [Testing Scenarios](#testing-scenarios)
- [Local Windows Testing](#1-local-windows-testing)
- [WSL Testing via SSH](#2-wsl-testing-via-ssh)
- [Remote Server Testing](#3-remote-server-testing)
- [Configuration Reference](#configuration-reference)
- [Performance Metrics](#performance-metrics)
- [Troubleshooting](#troubleshooting)
---
## 🎬 Testing Scenarios
### 1. Local Windows Testing
Test models running directly on your Windows machine (e.g., Ollama).
#### Prerequisites
- Windows with Ollama installed and running
- Model downloaded via `ollama pull `
#### Configuration
Use the template `configs/local_ollama_example.json`:
```json
{
"ssh": {
"local_mode": true,
"hostname": "127.0.0.1"
},
"prompts": [
"Explain neural networks in simple terms.",
"Write a Python function to calculate factorial."
],
"max_tokens": 512,
"result_dir": "results",
"tests": [
{
"name": "ollama-local-test",
"model": "deepseek-r1:1.5b",
"backend": "ollama",
"local_mode": true,
"port": 11434
}
]
}
```
#### Run Test
```bash
python run_tests.py --config configs/local_ollama_example.json
```
---
### 2. WSL Testing via SSH
Test models deployed in Windows Subsystem for Linux (e.g., vLLM).
#### Prerequisites
- WSL (Ubuntu recommended) installed and configured
- vLLM and dependencies (CUDA) installed in WSL
- Model files accessible in WSL (e.g., `/mnt/e/models/...`)
- Network connectivity between Windows and WSL
#### Configuration
**Step 1**: Get WSL IP address
```bash
# In WSL terminal
hostname -I
# Example output: 172.28.144.1
```
**Step 2**: Create configuration `wsl_test.json`:
```json
{
"ssh": {
"local_mode": false,
"hostname": "172.28.144.1",
"username": "your_wsl_username",
"password": "your_password",
"port": 22
},
"prompts": ["What is machine learning?"],
"max_tokens": 512,
"result_dir": "results",
"tests": [
{
"name": "wsl-vllm-test",
"backend": "vllm",
"port": 8000,
"backend_config": {
"model_path": "/mnt/e/models/deepseek-r1-distill-qwen-1.5b",
"wsl-venv": true,
"args": {
"host": "0.0.0.0",
"port": 8000,
"tensor-parallel-size": 1,
"gpu-memory-utilization": 0.9,
"max-model-len": 4096
}
}
}
]
}
```
**Note**: If `wsl-venv: true`, ensure your WSL environment activates the Python virtual environment in non-interactive SSH sessions (configure in `~/.bashrc`).
#### Run Test
```bash
python run_tests.py --config wsl_test.json
```
**How it works**: InferMatrix connects to WSL via SSH, launches vLLM service, runs performance tests, and cleanly shuts down the service.
---
### 3. Remote Server Testing
Test models on remote Linux servers (e.g., vLLM, Text Generation Inference).
#### Prerequisites
- SSH access to remote server (username/password or SSH key)
- Model service deployed on the server (vLLM, TGI, etc.)
#### Configuration
Create `remote_test.json`:
```json
{
"ssh": {
"local_mode": false,
"hostname": "192.168.1.100",
"username": "your_username",
"password": "your_password",
"key_path": null,
"port": 22
},
"prompts": [
"Explain the transformer architecture.",
"What are the benefits of quantization?"
],
"max_tokens": 1024,
"result_dir": "results",
"tests": [
{
"name": "remote-vllm-test",
"backend": "vllm",
"port": 8000,
"backend_config": {
"model_path": "/data/models/llama-2-7b",
"wsl-venv": false,
"args": {
"host": "0.0.0.0",
"port": 8000,
"tensor-parallel-size": 2,
"gpu-memory-utilization": 0.85,
"max-model-len": 8192
}
}
}
]
}
```
#### Test Connectivity (Recommended)
Before running full tests, verify server connectivity:
```bash
python demo_run_server_test.py --config remote_test.json
```
#### Run Performance Test
```bash
python run_tests.py --config remote_test.json
```
---
## ⚙️ Configuration Reference
### Configuration Schema
| Field | Type | Description |
|-------|------|-------------|
| **`ssh.local_mode`** | `bool` | `true`: Test local services
`false`: Test via SSH |
| **`ssh.hostname`** | `string` | Server IP or hostname
(Use `127.0.0.1` for local) |
| **`ssh.username`** | `string` | SSH username |
| **`ssh.password`** | `string` | SSH password (leave empty if using key) |
| **`ssh.key_path`** | `string` | Path to SSH private key |
| **`ssh.port`** | `int` | SSH port (default: 22) |
| **`prompts`** | `list[string]` | List of test prompts |
| **`max_tokens`** | `int` | Maximum tokens to generate |
| **`result_dir`** | `string` | Output directory for results |
| **`tests`** | `list[object]` | List of test configurations |
| **`tests[].name`** | `string` | Test identifier |
| **`tests[].model`** | `string` | Model name (Ollama only) |
| **`tests[].backend`** | `string` | Backend type: `ollama`, `vllm`, `lmstudio` |
| **`tests[].port`** | `int` | Service port |
| **`tests[].backend_config`** | `object` | Backend-specific configuration |
| **`backend_config.model_path`** | `string` | Model file path on server |
| **`backend_config.wsl-venv`** | `bool` | Activate Python venv in WSL |
| **`backend_config.args`** | `object` | Backend startup arguments |
### Example Configurations
All example configurations are available in the `configs/` directory:
- **`local_ollama_example.json`** - Local Windows Ollama testing
- **`wsl_vllm_example.json`** - WSL vLLM testing via SSH
- **`remote_server_example.json`** - Remote server testing
---
## 📊 Performance Metrics
InferMatrix measures the following key performance indicators:
| Metric | Description | Unit |
|--------|-------------|------|
| **TTFT** | Time To First Token
Latency from request to first token | seconds |
| **TPOT** | Time Per Output Token
Average time per generated token | milliseconds |
| **Throughput** | Total tokens per second
Overall generation speed | tokens/s |
| **Token Count** | Total tokens in response | tokens |
| **Prefill Speed** | Input processing speed | tokens/s |
### Understanding the Metrics
```
Request Timeline:
├─ TTFT ─────┤ (Prefill Phase)
└─ TPOT ─┬─ TPOT ─┬─ TPOT ─┤ (Decode Phase)
Token 1 Token 2 Token N
Total Latency = TTFT + (TPOT × N tokens)
Throughput = N tokens / Total Latency
```
---
## 🎨 Examples
### Building a Performance Matrix
Compare multiple models across different backends:
```json
{
"prompts": ["Calculate 1+1"],
"tests": [
{
"name": "ollama-deepseek",
"model": "deepseek-r1:1.5b",
"backend": "ollama",
"port": 11434
},
{
"name": "vllm-deepseek",
"backend": "vllm",
"backend_config": {
"model_path": "/models/deepseek-r1-1.5b"
}
},
{
"name": "lmstudio-deepseek",
"backend": "lmstudio",
"port": 8000
}
]
}
```
Results automatically generate a comparison matrix:
```
Framework Comparison - DeepSeek R1 1.5B
┌─────────────┬────────────┬───────────┬──────────────┐
│ Backend │ TTFT (s) │ TPOT (ms) │ Throughput │
├─────────────┼────────────┼───────────┼──────────────┤
│ Ollama │ 4.87 │ 7.29 │ 137.09 tok/s │
│ vLLM │ 0.25 │ 15.80 │ 63.29 tok/s │
│ LMStudio │ 2.33 │ 9.81 │ 101.89 tok/s │
└─────────────┴────────────┴───────────┴──────────────┘
```
### Hardware Configuration Testing
Test the same model on different GPUs:
```bash
# Test on RTX 4090
python run_tests.py --config configs/rtx4090_config.json
# Test on RTX 3090
python run_tests.py --config configs/rtx3090_config.json
# Compare results
python generate_comparison_report.py \
--results results/rtx4090_results.json results/rtx3090_results.json
```
---
## 🔧 Troubleshooting
### Common Issues and Solutions
#### SSH Connection Failed
**Symptoms**: `Connection refused` or `Permission denied`
**Solutions**:
- ✅ Verify `hostname`, `username`, `password`/`key_path`, and `port`
- ✅ Ensure SSH service is running: `sudo systemctl status ssh`
- ✅ Test connectivity: `ssh username@hostname`
- ✅ For WSL: Check Windows Defender Firewall settings
#### Model Not Found
**Symptoms**: `Model not found` or `404 Not Found`
**Solutions**:
- **Ollama**: Verify model name matches `ollama list` exactly
- **vLLM**: Check `model_path` points to directory with `config.json` and weight files
- **WSL**: Ensure Windows drive mount path is correct (e.g., `/mnt/e/...`)
#### Port Already in Use
**Symptoms**: `Address already in use` or `Port is occupied`
**Solutions**:
- ✅ Check if port is in use: `lsof -i :8000` (Linux) or `netstat -ano | findstr :8000` (Windows)
- ✅ Kill existing process or choose a different port
- ✅ Update `port` field in configuration
#### Permission Denied
**Symptoms**: `Permission denied` when accessing files or starting services
**Solutions**:
- ✅ Ensure user has read/write permissions for config and result directories
- ✅ Verify SSH user has permissions to access model files
- ✅ Check execution permissions: `chmod +x run_tests.py`
#### vLLM Startup Failed in WSL
**Symptoms**: vLLM fails to initialize or detect GPU
**Solutions**:
- ✅ Verify CUDA installation: `nvidia-smi` in WSL
- ✅ Check vLLM can detect GPU: `python -c "import torch; print(torch.cuda.is_available())"`
- ✅ If `wsl-venv: true`, verify activation command in script (default: `source ~/venv/bin/activate`)
- ✅ Ensure GPU drivers are properly passed through to WSL
#### Timeout Errors
**Symptoms**: `Request timeout` or `Connection timeout`
**Solutions**:
- ✅ Increase timeout values in code if testing very large models
- ✅ Check network stability
- ✅ Verify server resources are sufficient (CPU, GPU, RAM)
---
## 🏗️ Architecture
InferMatrix uses a modular architecture for flexibility and extensibility:
```
infermatrix/
├── run_tests.py # Main entry point
├── test_orchestrator.py # Test orchestration and workflow
├── llm_tester.py # Core testing logic
├── ssh_manager.py # SSH connection management
├── backend_deployer.py # Backend service deployment
├── configs/ # Configuration examples
│ ├── local_ollama_example.json
│ ├── wsl_vllm_example.json
│ └── remote_server_example.json
├── results/ # Test results output
│ ├── *.json # Raw performance data
│ └── *.png # Visualization charts
└── utils/ # Utility functions
├── metrics.py # Performance metrics calculation
└── visualizer.py # Chart generation
```
---
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
### Ways to Contribute
- 🐛 Report bugs via [GitHub Issues](https://github.com/infermatrix/infermatrix/issues)
- 💡 Suggest features or improvements
- 📝 Improve documentation
- 🔧 Submit pull requests
### Priority Areas
- Support for additional backends (TensorRT-LLM, Text-Generation-Inference)
- Enhanced visualization options
- Batch testing capabilities
- Multi-GPU performance analysis
- CI/CD pipeline
---
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## 📚 Citation
If you use InferMatrix in your research, please cite:
```bibtex
@software{infermatrix2025,
title={InferMatrix: Systematic LLM Inference Performance Evaluation},
author={Anonymous},
year={2025},
url={https://github.com/infermatrix/infermatrix}
}
```
---
## 🔗 Related Projects
- [vLLM](https://github.com/vllm-project/vllm) - High-throughput LLM serving
- [Ollama](https://github.com/ollama/ollama) - Run LLMs locally
- [LMStudio](https://lmstudio.ai/) - Desktop LLM application
- [llm-perf](https://github.com/ray-project/llm-perf) - LLM performance benchmarking
---
## 🌟 Star History
If you find InferMatrix useful, please consider giving it a star ⭐
---
**Built with ❤️ for the LLM community**
[Documentation](docs/) • [Issues](https://github.com/infermatrix/infermatrix/issues) • [Discussions](https://github.com/infermatrix/infermatrix/discussions)