# Falcon-Perception **Repository Path**: ogw0725/Falcon-Perception ## Basic Information - **Project Name**: Falcon-Perception - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-20 - **Last Updated**: 2026-04-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Falcon Perception and Falcon OCR

Tech Report   Blogpost   Playground

PBench   OCR Model   Perception Model

# Falcon Perception A minimal, readable yet performant PyTorch inference engine implementation of **Falcon Perception** — a natively multimodal, dense, autoregressive Transformer model that performs **object detection**, **instance segmentation**, or **OCR** from natural language queries. ![falcon_inference](https://github.com/user-attachments/assets/1e45051b-ae05-4170-b326-d61d79a47ba2) > *"Segment these expressions in the image: the cat on the left"* → bounding boxes + pixel-level masks > *"Extract the text content from this image."* → text / latex formulas / html table ... ## Contents - [Quick Start](#quick-start) - [Installation](#installation) - [Run Perception (detection / segmentation)](#run-perception-detection--segmentation) - [Run OCR (text extraction)](#run-ocr-text-extraction) - [Run Multiple Samples with Paged Inference Engine](#run-multiple-samples-with-paged-inference-engine) - [Interactive Notebooks](#interactive-notebooks) - [Inference Engines](#inference-engines) - [PyTorch Inference Engines](#pytorch-inference-engines) - [Paged Inference Engine](#paged-inference-engine) - [Paged OCR Inference Engine](#paged-ocr-inference-engine) - [Batch Inference Engine](#batch-inference-engine) - [MLX Batch Inference Engine (Apple Silicon)](#mlx-batch-inference-engine-apple-silicon) - [Inference Server](#inference-server) - [Launch Server](#launch-server) - [Launch Streamlit Demo App](#launch-streamlit-demo-app) - [vLLM Docker Server (FalconOCR Only)](#vllm-docker-server-falconocr-only) - [Serving Throughput](#serving-throughput) - [Quick Start](#quick-start-1) - [Citation](#citation) - [Acknowledgments](#acknowledgments) ## Quick Start ### Installation The package supports two backends: **PyTorch** (CUDA GPUs) and **MLX** (Apple Silicon Macs). A bare `pip install` auto-detects your platform, or you can pick an explicit extra. | Install command | Backend | When to use | |---|---|---| | `pip install -e .` | Auto-detect | Mac -> MLX, Linux -> Torch | | `pip install -e ".[torch]"` | PyTorch + CUDA | GPU server or explicit Torch on Mac | | `pip install -e ".[mlx]"` | MLX | Apple Silicon Mac | | `pip install -e ".[ocr]"` | Torch + transformers | Layout-aware OCR (needs a layout detection model) | | `pip install -e ".[dev]"` | -- | Adds tensorboard, matplotlib, ipykernel | | `pip install -e ".[server]"` | -- | Adds FastAPI / Uvicorn for the paged inference server | ```bash # Example: from source with uv (recommended) uv sync # installs torch with CUDA 12.8 wheels by default source .venv/bin/activate ``` #### CUDA version `uv sync` defaults to **CUDA 12.8** wheels for PyTorch (works with NVIDIA driver >= 570.x). If you need a different CUDA version, edit the `[[tool.uv.index]]` URL in `pyproject.toml`: ```bash # CUDA 12.6 (driver >= 560.x) url = "https://download.pytorch.org/whl/cu126" # CUDA 13.0 (driver >= 575.x) — or remove [tool.uv.sources] entirely url = "https://download.pytorch.org/whl/cu130" ``` Then re-run `uv lock && uv sync`. If installing with **pip** instead of uv, install PyTorch first with the correct CUDA version: ```bash pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128 pip install -e . ``` > **Note:** The MLX backend does not require PyTorch or transformers at all. > The core data pipeline and tokenizer run on numpy/PIL and the lightweight > [tokenizers](https://github.com/huggingface/tokenizers) library. ### Run Perception (detection / segmentation) **PyTorch (GPU)** ```bash # Auto-downloads model + streaming sample image from Huggingface python demo/perception_single.py # Custom image and query, can be path or url python demo/perception_single.py --image photo.jpg --query "cat" # Detection only (no masks) python demo/perception_single.py --image photo.jpg --query "cat" --task detection ``` **MLX (Apple Silicon)** ```bash python demo/perception_single_mlx.py --image photo.jpg --query "cat" # Detection only python demo/perception_single_mlx.py --image photo.jpg --query "cat" --task detection ``` ### Run OCR (text extraction) **PyTorch (GPU)** ```bash # Auto-downloads model + stream sample image from HuggingFace python demo/ocr_single.py # Custom document image python demo/ocr_single.py --image document.png # Layout-aware OCR (detects regions first, then extracts text per region) # Requires the [ocr] extra: pip install -e ".[ocr]" # This will lazily download and run a 3rd party layout detection model # PaddlePaddle/PP-DocLayoutV3_safetensors from Huggingface python demo/ocr_single.py --image document.png --task ocr_layout ``` **MLX (Apple Silicon)** ```bash python demo/ocr_single_mlx.py --image document.png python demo/ocr_single_mlx.py # loads a demo sample from OCRBench-v2 ``` **OCR modes** | Mode | Best for | How | |------|----------|-----| | **Plain OCR** | Simple documents, real-world photos, slides, receipts, invoices | `--task ocr_plain` | | **Layout + OCR** | Complex multi-column documents, academic papers, reports, dense pages | `--task ocr_layout` | ### Run Multiple Samples with Paged Inference Engine ```bash # Perception — PBench dataset python demo/perception_benchmark.py # stream 50 samples from HF python demo/perception_benchmark.py --limit -1 # stream entire benchmark # OCR — OCRBench-v2 dataset python demo/ocr_benchmark.py python demo/ocr_benchmark.py --limit 200 ``` ### Interactive Notebooks Step-by-step walkthroughs with inline visualizations: | Notebook | Description | Colab | |---|---|---| | [`demo/perception.ipynb`](demo/perception.ipynb) | **Falcon Perception** — engine setup, detection vs segmentation, HR cache, dense tuning, PBench level showcase, and benchmark with IoU metrics | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Jy6lRYuGMKJEt9--KBLm6kjbC-vQ0Uwe) | | [`demo/ocr.ipynb`](demo/ocr.ipynb) | **Falcon OCR** — full-page and layout-based OCR across handwriting, formulas, tables, scanned documents, and scientific papers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HYOcYWqmeUT9j2h6_dNM2lX1mlI34Bfa) | | [`demo/perception_agent.ipynb`](demo/perception_agent.ipynb) | **Perception Agent** — grounded visual reasoning agent using Falcon Perception as a tool with an orchestrator VLM for multi-step scene understanding | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wg4EbNDKllAxoK1v5gbdRzkrKjdjIwbj) | | [`demo/perception_ov_mot.ipynb`](demo/perception_ov_mot.ipynb) | **Open-Vocab Multi-Object Tracking** — video object tracking pipeline using Falcon Perception in detection and segmentation modes | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13Hiei8As2JbiZuYllXQTNX_ZelysZSpH) | ## Inference Engines ### PyTorch Inference Engines **We use FlexAttention for both inference engines and training.** The hybrid attention mask (bidirectional image + causal text) is expressed as composable mask functions. The FlexAttention's `maskmod` also make it easy to implement continuous batching with paged attention via simple Python. PyTorch's `flex_attention` compiles them into fused Triton kernel — no custom attention code needed. #### Paged Inference Engine Performant engine with CUDAGraph and continuous batching via a paged KV cache: - **Paged KV cache** with virtual page tables (no wasted memory from padding) - **Continuous batching**: new sequences enter mid-generation, finished ones release pages immediately - **Torch compile**: piece-wise region outside the flex attention kernel. - **CUDA graph capture** for the decode loop (eliminates kernel launch overhead, important for small models) - **Background tokenization**: CPU thread pool overlapped with GPU compute - **Preemption**: if memory is tight, running sequences can be paused and re-prefilled later - **High-Resolution image feature cache** (for segmentation): LRU cache with pre-allocated pinned memory buffers for async GPU↔CPU transfer of high-resolution image features. Help reduce prefill time for subsequent query of the same image. Please check `demo/perception_single.py` and `demo/perception_benchmark.py` on how to directly instantiate and use the engine. > **NOTE**: First run will takes longer ~10-30s for torch compile and CUDAGraph capture. Subsequent run will be faster, around ~100ms for prefill, ~200ms for upsampling (0ms if cached), and ~50ms for decode a couple of instances (~10 tokens). (measured on H100) #### Paged OCR Inference Engine Extends the Perception's paged engine for document understanding: - **Layout detection**: runs a lightweight detector to find text regions, tables, figures, headers - **Per-region OCR**: crops each region and runs OCR inference with a category-specific prompt - **Continuous batching and gather**: all crops of the image are sent to the engine for continuous batched extraction. Once all crops are completed, the output are gathered and assembed into a structured output. Please check `demo/ocr_single.py` and `demo/ocr_benchmark.py` on how to directly instantiate and use the engine. > **NOTE**: First `layout_ocr` run will lazily download and run the document layout detection model. #### Batch Inference Engine The simplest and closest to training code path, make it easier to understand the model's forward pass without all the optimization. All sequences are left-padded to the same length (with correct rope indices and attention mask), runs a single prefill, then decodes token-by-token with a dense KV cache until all sequences are completed. Please check the `demo/perception_single.py --engine-type batch` path for usage. ### MLX Batch Inference Engine (Apple Silicon) The MLX backend provides batch inference on Apple Silicon Macs using the [MLX](https://github.com/ml-explore/mlx) framework. It shares the same model architecture and weights (auto-converted from safetensors on first load) and produces equivalent results. - Dense KV cache, left-padded batch inference - `mx.fast.scaled_dot_product_attention` with native sink support - Tiled windowed cross-attention in the AnyUp upsampler for memory efficiency - No PyTorch or transformers dependency See `demo/perception_single_mlx.py` for usage. ## Inference Server The server provides a REST API to the continuous batching Paged Inference Engine across multiple GPUs. ### Launch server ```bash # Install server and streamlit demo dependencies uv sync --extra server --extra demo # Auto-detects all available GPUs, compiles model, captures CUDA graphs python -m falcon_perception.server # Explicit config python -m falcon_perception.server --config.num-gpus 2 --config.port 7680 # Or with the OCR model python -m falcon_perception.server --config.hf-model-id tiiuae/Falcon-OCR --config.port 7681 ``` The server starts one engine worker per GPU in a separate process (i.e. Data Parallel). Each worker builds its own model, runs `torch.compile`, and captures CUDA graphs for the decode loop. Workers communicate with the main FastAPI process via `multiprocessing.Queue` and the server will assign new request to the worker with lowest number of queuing requests. Please check the [`server/README.md`](falcon_perception/server/README.md) for detailed usage. ### Launch Streamlit Demo App A browser-based demo UI that connects to the inference server. ```bash # With a server already launched in a separate terminal, launch the Streamlit app streamlit run demo/streamlit_app.py ``` The app provides: - Image upload and URL input - Automatic model detection — the task selector shows only tasks supported by the loaded model (segmentation/detection for Falcon-Perception, OCR plain/layout for Falcon-OCR) - Interactive mask and bounding box overlay visualization - Adjustable image resolution settings - Per-request timing breakdown ## vLLM Docker Server (FalconOCR Only) For FalconOCR without the extra perception heads, we also provide a vLLM docker server for fast deployment. ### Serving Throughput Measured on a single A100-80GB GPU with vLLM, processing document images from olmOCR-Bench at very high concurrency for optimal utilisation of vllm. - **Layout + OCR** — The full end-to-end pipeline: layout detection finds regions on each page, crops them, and the VLLM runs on every crop. This is the real-world serving number that includes both layout and OCR time. | Mode | tok/s | img/s | Description | |------|------:|------:|-------------| | **Layout + OCR** | 5,825 | 2.9 | Full pipeline: layout detection → crop → per-region OCR | At 0.3B parameters, Falcon OCR is roughly 3x smaller than 0.9B-class OCR VLMs (e.g. PaddleOCR VL), which translates directly into higher serving throughput at competitive accuracy. ### Quick Start For production OCR serving, we also provide a Docker image (~6,000 tok/s on a single A100) that bundles vLLM with an end-to-end parsing pipeline. It exposes two services: **vLLM** (port 8000, OpenAI-compatible API) and a **Pipeline** (port 5002, layout → crop → OCR → markdown). ```bash # Two GPUs (best throughput): vLLM on one, layout model on the other docker run -d --name falcon-ocr \ --gpus '"device=0,1"' \ -e EXPOSED_GPU_IDS=0,1 -e VLLM_GPU=0 -e PIPELINE_GPU=1 \ -e VLLM_GPU_MEM_UTIL=0.90 \ -p 8000:8000 -p 5002:5002 \ ghcr.io/tiiuae/falcon-ocr:latest # Single GPU (share memory — lower VLLM_GPU_MEM_UTIL to leave room for layout model) docker run -d --name falcon-ocr \ --gpus '"device=0"' \ -e EXPOSED_GPU_IDS=0 -e VLLM_GPU=0 -e PIPELINE_GPU=0 \ -e VLLM_GPU_MEM_UTIL=0.55 -e MAX_NUM_SEQS=512 \ -p 8000:8000 -p 5002:5002 \ ghcr.io/tiiuae/falcon-ocr:latest ``` ```bash # Upload (images or PDFs) curl -X POST http://localhost:5002/falconocr/upload \ -F "files=@document.pdf;type=application/pdf" # Parse with layout detection curl -X POST http://localhost:5002/falconocr/parse \ -H "Content-Type: application/json" \ -d '{"images": ["data:image/jpeg;base64,<...>"], "skip_layout": false}' # Parse without layout (full image → VLM directly) curl -X POST http://localhost:5002/falconocr/parse \ -H "Content-Type: application/json" \ -d '{"images": ["data:image/jpeg;base64,<...>"], "skip_layout": true}' ``` See the [Falcon-OCR model card](https://huggingface.co/tiiuae/Falcon-OCR) for the full API reference, configuration variables, and deployment options. ## Citation If you use Falcon-Perception, please cite ``` @article{bevli2026falcon, title = {Falcon Perception}, author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit}, journal = {arXiv preprint arXiv:2603.27365}, year = {2026}, url = {https://arxiv.org/abs/2603.27365} } ``` ## Acknowledgments We based on, built upon and are inspired by the following excellent work: - https://github.com/pytorch/torchtitan - https://github.com/karpathy/nanochat - https://github.com/meta-pytorch/attention-gym - https://github.com/changjonathanc/flex-nano-vllm - https://github.com/vikhyat/moondream - https://github.com/wimmerth/anyup - https://github.com/roboflow/trackers