# DeepRead **Repository Path**: woshilu272/DeepRead ## Basic Information - **Project Name**: DeepRead - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-15 - **Last Updated**: 2026-04-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

license Ask DeepWiki

main fig
DeepRead is a document-structure-aware RAG Agent. This repo already includes the core parsing, indexing, retrieval, and agent runtime used in the demo. ## News - **2026.3.16** 🔥 DeepRead has been featured in [New Intelligence (新智元)](https://mp.weixin.qq.com/s/BhvUQgREp4NOvb6axiWXiQ)! ## Repository Layout - `Code/DeepRead.py`: agent runtime + retrieval + tool calls. - `Code/parser_pdf.py`: PDF -> OCR (PaddleOCRVL) -> merged Markdown/JSON -> corpus; optional embeddings. - `Code/paddleocr.sh`: Docker-based PaddleOCRVL vLLM server runner. - `Demo/TradingAgent/`: demo corpus + embeddings (with images). - `Demo/金山办公2023年报/`: demo corpus + embeddings. ## Quickstart ### 0) Set API Environment Variables Set these before running `DeepRead.py` (adjust to your provider): ```bash # LLM export OPENROUTER_API_KEY="" export OPENROUTER_BASE_URL="https://api.openai.com/v1" export OPENROUTER_MODEL="gpt-4o" # Optional: Embedding service export EMBED_API_KEY="" export EMBED_BASE_URL="http://127.0.0.1:8756/v1" export EMBEDDING_MODEL="Qwen/Qwen3-Embedding-8B" # Optional: Reranker service (for semantic retrieval) export RERANK_API_KEY="" export RERANK_BASE_URL="https://api.siliconflow.cn/v1" export RERANK_MODEL="Qwen/Qwen3-Reranker-8B" ``` ### 1) Start PaddleOCRVL server for PDF OCR We use the official [PaddleOCRVL](https://huggingface.co/PaddlePaddle/PaddleOCR-VL) Docker image published by PaddlePaddle, and it based on VLLM. The launcher is provided in `Code/paddleocr.sh`. Run: ```bash bash Code/paddleocr.sh ``` By default it exposes `http://127.0.0.1:8956/v1`, and our PDF parsing code will call this port by default. ### 2) PDF -> Corpus (Structure-Aware) ```bash python Code/parser_pdf.py \ --input /path/to/your.pdf \ --output /path/to/output_dir ``` Optional embeddings (requires an embedding API server): ```bash python Code/parser_pdf.py \ --input /path/to/your.pdf \ --output /path/to/output_dir \ --build-embeddings \ --embedding-model Qwen/Qwen3-Embedding-8B \ --embed-base-url http://127.0.0.1:8756/v1 \ --embed-api-key ``` This produces: - `*_corpus.json` (structured nodes) - `*_emb.npy` + `*_idmap.json` (optional vector store) ### 3) Ask Questions with DeepRead Run: ```bash python Code/DeepRead.py \ --doc /path/to/output_dir/your_corpus.json \ --question "What is your question?" \ --enable-semantic \ --neighbor-window 1,-1 \ --log run_log.jsonl ``` ### 4) Recommended Retrieval Modes Choose a retrieval mode based on your service availability: - **No Embedding API**: use BM25 (available by default, no extra flags) ```bash python Code/DeepRead.py --doc /path/to/your_corpus.json --question "..." --log run_log.jsonl ``` - **Embedding API only (no reranker)**: use Vector retrieval ```bash python Code/DeepRead.py \ --doc /path/to/your_corpus.json \ --question "..." \ --enable-vector \ --disable-bm25 \ --disable-regex \ --log run_log.jsonl ``` - **Embedding + Reranker**: use Semantic retrieval (vector recall + rerank) ```bash python Code/DeepRead.py \ --doc /path/to/your_corpus.json \ --question "..." \ --enable-semantic \ --log run_log.jsonl ``` ## Demo ### Demo 1: TradingAgent (multimodal + embeddings) ```bash python Code/DeepRead.py \ --doc "Demo/TradingAgent/TradingAgent_corpus.json" \ --question "Which roles are included in the overall TradingAgents framework?" \ --enable-semantic \ --enable-multimodal \ --log demo_trading.jsonl ``` ### Demo 2: 金山办公23年年报 ```bash python Code/DeepRead.py \ --doc "Demo/金山办公2023年报/11724-金山办公:金山办公2023年年度报告_corpus.json" \ --question "公司有哪些累计投入金额超过一亿元的在研项目?" \ --enable-semantic \ --neighbor-window 0,0 \ --log demo_xx.jsonl ``` ## Full Usage ### DeepRead.py All options: ```bash python Code/DeepRead.py --help ``` Common flags: - Input/basics: `--doc`, `--question`, `--log`, `--max_rounds`, `--temperature` - Retrieval toggles: `--enable-vector`, `--enable-hybrid`, `--enable-semantic`, `--disable-bm25`, `--disable-regex`, `--disable-read` - Semantic retrieval: `--semantic-stage1` (vector/bm25/hybrid), `--semantic-topk1`, `--semantic-topk2` - Neighbor window: `--neighbor-window up,down` - Multimodal: `--enable-multimodal` - Embedding: `--embedding-model`, `--embed-base-url`, `--embed-api-key` - Rerank: `--rerank-api-key`, `--rerank-base-url`, `--rerank-model` ### parser_pdf.py All options: ```bash python Code/parser_pdf.py --help ``` Common flags: - Input/output: `--input` (PDF), `--output` - OCR server: `--paddle-vl-rec-backend`, `--paddle-vl-rec-server-url` - Embedding: `--build-embeddings`, `--embedding-model`, `--embedding-batch-size`, `--embed-base-url`, `--embed-api-key` ## Configuration Reference DeepRead reads from environment variables and CLI flags: - LLM: `OPENROUTER_API_KEY`, `OPENROUTER_BASE_URL`, `OPENROUTER_MODEL` - Embedding: `EMBED_API_KEY`, `EMBED_BASE_URL`, `EMBEDDING_MODEL` - Rerank (optional): `RERANK_API_KEY`, `RERANK_BASE_URL`, `RERANK_MODEL` - Retrieval: `--enable-vector`, `--enable-hybrid`, `--enable-semantic`, `--disable-bm25`, `--disable-regex`, `--disable-read` - Neighbor window: `--neighbor-window up,down` (e.g. `1,-1`, `0,0` disables) ## Notes - `parser_pdf.py` currently accepts PDF only. - OCR requires `paddleocr` and `PaddleOCRVL` (or run the provided Docker server). - `tiktoken` is optional; if missing, token counting falls back to a simple tokenizer. ## Related Work Related Outstanding Work: [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [PageIndex](https://github.com/VectifyAI/PageIndex) ## Citation If DeepRead is helpful for you, please cite us. ``` @article{li2026deepread, title={DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search}, author={Li, Zhanli and Tian, Huiwen and Luo, Lvzhou and Cao, Yixuan and Luo, Ping}, journal={arXiv preprint arXiv:2602.05014}, year={2026} } ``` ## License See [LICENSE](DeepRead/LICENSE).