# rbpe **Repository Path**: MoleSir/rbpe ## Basic Information - **Project Name**: rbpe - **Description**: A simple Byte Pair Encoding written in Rust. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-03-19 - **Last Updated**: 2025-03-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: Tokenize, Rust ## README # Rust Byte Pair Encoding A simple Byte Pair Encoding written in Rust. ## Build ````bash cargo build ```` There is a simple test in examples/test.rs, you can run ````bash cargo run --example test ```` to test this bpe implement. ## Build Python Module Install maturin ````bash pip install maturin ```` Use maturin to build .whl file ````bash maturin build ```` The .whl file will generate in ./target/wheels. Use pip to download it ````bash pip install target/wheels/rbpe-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl --force-reinstall ```` Test python module in examples/test.py ````python import rbpe origin_texts = [ "Learn about language model tokenization", "OpenAI's large language models process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens", "春风拂面花满枝,流水潺潺鸟语时。山川如画心自远,岁月静好梦相依。", "「今日も一日頑張りましょう!💪✨🌞」", ] with open('./data/train.txt', 'r', encoding='utf-8') as file: train_str = file.read() tokenizer = rbpe.BpeTokenizer(train_str, 100) for origin_text in origin_texts: ids = tokenizer.encode(origin_text) text = tokenizer.decode(ids) assert origin_text == text print(text) ```` # Reference - https://en.wikipedia.org/wiki/Byte_pair_encoding - https://github.com/lnx/bpe.git