# LLMs-Safety-Control

**Repository Path**: mirrors_Qihoo360/LLMs-Safety-Control

## Basic Information

- **Project Name**: LLMs-Safety-Control
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-15
- **Last Updated**: 2026-03-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

<p align="center">语言：<a href="README.md">English</a></p>


## 概述
本仓库包含论文《Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training》（to be appeared in AAAI 2026）的代码、脚本及示例数据。[Paper Link](https://arxiv.org/abs/2508.14904)

<p align="center">
  <img src="architecture.png" alt="Multi-Directional Distillation and Magic-Token-Guided Co-Training Framework" style="width: 80%;">
</p>

## 仓库结构
- `code/` — 包含论文中使用的数据预处理、评估报告生成代码、SAM 报告生成代码及训练脚本。
- `dataset/` — 包含训练、测试及安全评估所用的示例数据集。
- `policy/` — 包含两个安全策略文件，分别对应*policy:en-US* 和 *policy:zh-CN*。

## 模型发布说明
由于论文模型中的**negative模式**（负面模式）存在潜在风险（在token泄露情况下，该模式允许无过滤、高风险内容生成，仅用于内部红队测试），我们决定不公开发布原始模型。取而代之，我们推出一款关联性强且安全可控的衍生版本：**TinyR1-Safety-8B**。该模型与论文所用模型共享核心架构与训练流程，为适配公开场景下的使用，我们做了以下关键调整：
1. 无密码形式的"magic token"—— 基于明文system prompt实现控制
2. 仅开放安全行为模式：
      - **Positive mode**：生成有帮助、符合安全规范的响应
      → system prompt：**"Safety Mode: Positive"**
      - **Rejective mode**：礼貌拒绝不安全请求
      → system prompt：**"Safety Mode: Rejective"**
      - **General mode**：适用于非安全相关查询
      → system prompt：**"Adherence mode: Strict adherence"**

本模型的发布，旨在让研究人员与开发者能够以安全、透明的方式探索可切换安全控制技术，同时最大限度降低模型滥用风险。
关于更多细节、模型卡片及使用示例，请访问：👉 https://huggingface.co/qihoo360/TinyR1-Safety-8B


## 引用

```bibtex
@misc{si2025efficientswitchablesafetycontrol,
      title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training}, 
      author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
      year={2025},
      eprint={2508.14904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14904}, 
}
```

## 联系方式
如有问题，欢迎通过论文中提供的邮箱联系我们。