# rbpe

**Repository Path**: MoleSir/rbpe

## Basic Information

- **Project Name**: rbpe
- **Description**: A simple Byte Pair Encoding written in Rust.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-19
- **Last Updated**: 2025-03-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: Tokenize, Rust

## README

# Rust Byte Pair Encoding

A simple Byte Pair Encoding written in Rust.


## Build 

````bash
cargo build
````

There is a simple test in examples/test.rs, you can run 

````bash
cargo run --example test
````

to test this bpe implement.


## Build Python Module

Install maturin

````bash
pip install maturin
````

Use maturin to build .whl file

````bash
maturin build
````

The .whl file will generate in ./target/wheels. Use pip to download it

````bash
pip install target/wheels/rbpe-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl --force-reinstall
````

Test python module in examples/test.py

````python
import rbpe 

origin_texts = [
    "Learn about language model tokenization",
    "OpenAI's large language models process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens",
    "春风拂面花满枝，流水潺潺鸟语时。山川如画心自远，岁月静好梦相依。",
    "「今日も一日頑張りましょう！💪✨🌞」",
]

with open('./data/train.txt', 'r', encoding='utf-8') as file:
    train_str = file.read()

tokenizer = rbpe.BpeTokenizer(train_str, 100)

for origin_text in origin_texts:
    ids = tokenizer.encode(origin_text)
    text = tokenizer.decode(ids)
    assert origin_text == text
    print(text)
````


# Reference

- https://en.wikipedia.org/wiki/Byte_pair_encoding
- https://github.com/lnx/bpe.git