# giga-datasets **Repository Path**: devdz/giga-datasets ## Basic Information - **Project Name**: giga-datasets - **Description**: GigaDatasets: A Unified and Lightweight Framework for Data Processing, Curation, and Visualization - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-09 - **Last Updated**: 2026-03-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

GigaDatasets

A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

| Quick Start | Contributing | License | Citation |

## ✨ Introduction GigaDatasets is a unified and lightweight framework for data curation, evaluation and visualization. Designed to make handling massive datasets simple, efficient, and consistent.
Major features - πŸ” **Unified Workflow**: Unify all steps from data curation and packaging to loading, evaluation, and visualization. - ⚑ **Lightweight and Easy to Use**: Simple pip/source install `pip3 install giga-datasets`, one line of code for data loading `dataset = load_dataset(data_path)`, one line of code for data evaluation `eval_results = FIDEvaluator(datasets)(pred_results)`. - πŸ—‚οΈ **Multi-format and Multi-structure Data Support**: File, LMDB, Pickle, and LeRobot datasets with flexible loading. Unified support for images, videos, 2D/3D boxes, 2D/3D points, and other structured data. - πŸš€ **Efficient Processing**: Optimized for speed and memory, suitable for large-scale data processing needs.
## ⚑ Installation GigaDatasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance) ```bash pip3 install giga-datasets ``` or you can install directly from source for the latest updates: ```bash conda create -n giga_datasets python=3.11.10 conda activate giga_datasets git clone https://github.com/open-gigaai/giga-datasets.git cd giga-datasets pip3 install -e . ``` ## πŸš€ Usage We provide accessible demo data and Jupyter notebooks in [getting_started](getting_started). Utility scripts can be found in the [scripts](scripts) folder. ### 1. load dataset There is a simple way to load datasets using the `load_dataset` function from the `giga_datasets` library. We provide a demo dataset in the `giga_data` directory for you to try out. Here is a quick example, and the full code is available [here](getting_started/load_dataset.py): ```python from giga_datasets import load_dataset dataset = load_dataset('./getting_started/giga_data') data_dict = dataset[0] print('Dataset size:', len(dataset)) print('First item in dataset:', data_dict) ``` The `giga_data` directory contains the following structure: ``` giga_data/ β”œβ”€β”€ config.json # Configuration file describing the dataset β”œβ”€β”€ labels/ # Directory containing label files β”‚ β”œβ”€β”€ config.json # Additional configuration for labels β”‚ β”œβ”€β”€ data.pkl # Serialized label data β”œβ”€β”€ images/ # Directory containing image files β”‚ β”œβ”€β”€config.json # Additional configuration for images β”‚ β”œβ”€β”€data.mdb # Lmdb format for images β”œ β”œβ”€β”€lock.mdb ``` The `config.json` file in the `giga_data` directory contains the following structure: ``` { "_class_name": "Dataset", "config_paths": [ "labels/config.json", "images/config.json" ] } ``` This file specifies: - \_class_name: Indicates the class type used for the dataset, which is Dataset in this case. - config_paths: Lists paths to additional configuration files for specific components of the dataset, such as labels/config.json and images/config.json. ### 2. package dataset For an unstructured dataset, you can use the Writer classes (including `PklWriter`, `FileWriter` and `LmdbWriter` to package your data into a structured format. Below is an example of how to package a dataset consisting of images and labels. The `raw_data` directory contains the following structure: ``` raw_data/ β”œβ”€β”€ 0.json # Annotation file for image 0 β”œβ”€β”€ 0.png # Image file 0 β”œβ”€β”€ 1.json # Annotation file for image 1 β”œβ”€β”€ 1.png # Image file 1 β”œβ”€β”€ ... ``` You can run the following python code to package the dataset, the full code is available [here](getting_started/pack_images.py): ```python image_paths = utils.list_dir(image_dir, recursive=True, exts=['.png', '.jpg', '.jpeg']) label_writer = PklWriter(os.path.join(save_dir, 'labels')) image_writer = LmdbWriter(os.path.join(save_dir, 'images')) for idx in tqdm(range(len(image_paths))): label_path = image_paths[idx].replace('.png', '.json') label_dict = json.load(open(label_path)) label_dict['data_index'] = idx label_writer.write_dict(label_dict) image_writer.write_image(idx, image_paths[idx]) label_writer.write_config() image_writer.write_config() label_writer.close() image_writer.close() label_dataset = load_dataset(os.path.join(save_dir, 'labels')) image_dataset = load_dataset(os.path.join(save_dir, 'images')) dataset = Dataset([label_dataset, image_dataset]) dataset.save(save_dir) ``` We supports packaging and reading different data formats. In addition to packaging images, we also provide an [example](getting_started/pack_videos.py) of packaging video data, where we store the video's metadata. ```python # package video samples in the input directory to the output directory python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos # if you want to package videos into lmdb format for better read performance python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --pack-lmdb # if you want to package samples, but not copy the video files and only store the metadata and absolute paths python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --only_path ``` ### 3. add new field In models' training or inference, a sample is often represented as a dictionary with multiple fields. Our framework is designed to be easily extensible to accommodate new data fields. Below is an example of how to add canny maps as a new field: ```python python getting_started/add_new_filed.py --data_dir getting_started/giga_data ``` ### Additional Usage Examples > **Note:** More usage examples and feature documentation will be added in future updatesβ€”stay tuned! ## 🀝 Contributing We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details. ## πŸ“„ License This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. ## πŸ“– Citation ```bibtex @misc{gigaai2025gigadatasets, author = {GigaAI}, title = {GigaDatasets: A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/open-gigaai/giga-datasets}} } ```