# proteinsgm

**Repository Path**: bio-mirrors/proteinsgm

## Basic Information

- **Project Name**: proteinsgm
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-25
- **Last Updated**: 2025-08-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## ProteinSGM: Score-based generative modeling for *de novo* protein design

This repository contains the codebase for [ProteinSGM: Score-based generative modeling for *de novo* protein design](https://www.biorxiv.org/content/10.1101/2022.07.13.499967v1)

### Installation
---

We recommend using the conda environment supplied in this repository.

1. Clone repository `git clone https://gitlab.com/mjslee0921/proteinsgm.git`
2. Navigate to folder `cd proteinsgm`
3. Install conda environment `conda env create -f env.yaml`
4. Activate conda environment `conda activate proteinsgm`

### Training
---

The CATH domain dataset used in the paper can be downloaded [here](ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/non-redundant-data-sets/cath-dataset-nonredundant-S40.pdb.tgz).

The specific CATH IDs in the train/test split can be found under the `data` folder in this repository.

Once the files have been extracted, you must change `data.dataset_path` in the configuration file to the correct path containing the training data.

The training script can be found in `train.py`. For instance, to train an unconditional model (conditioned just on length), run the following command:

`python train.py ./configs/cond_length.yml`

Conditional models can be trained by replacing `./configs/cond_length.yml` with `./configs/cond_length_inpainting.yml`

### Inference
---

#### Unconditional generation

Unconditional generation of structures is achieved by first sampling 6D coordinates from the model, and running Rosetta.

`sampling_6d.py` is used to sample 6D coordinates given a model checkpoint. For instance,

`python sampling_6d.py ./configs/cond_length.yml ./checkpoints/cond_length.pth`

This will first sample random lengths. If you want to generate a specific length, you can set `--select_length True length_index 1`, which will generate a protein of length 40 (we use 1-indexing here)

To generate structures from 6D coordinates, please refer to `sampling_rosetta.py`. For instance,

`python sampling_rosetta example/sample.pkl --index 1`

This will read the first sample (1-indexing) and run <tt>MinMover, FastDesign,</tt> and <tt>FastRelax</tt> and save all intermediate structures.

You can exclude any of the Rosetta minimization steps (except <tt>MinMover</tt>) with the flags `--fastrelax False --fastdesign False`.

We recommend using a GPU for 6D coordinate sampling, and CPU batch processing for Rosetta since Rosetta protocols can only use a single core per job. In our case we used a single NVIDIA V100 for training and inference, and 2 core CPU/8GB RAM per Rosetta job.

#### Conditional generation

Conditional generation additionally requires an input backbone and sequence, and a residue index range for which inpainting is desired.

To generate conditional samples, you must provide the chain id and residue indices of the regions to be masked during inference. For instance, to mask residues 10-30 and 35 in chain A, you can run the following command:

`python sampling_6d.py ./configs/cond_length_inpainting.yml ./checkpoints/cond_length_inpainting.pth --pdb example/2kl8.pdb --chain A --mask_info 10:30,35`

Note that we follow PDB conventions of residue indexing, where the first residue corresponds to index 1.

The generated 6D coordinates can then be minimized using Rosetta in the same manner as unconditional samples.

6D coordinate sampling should ~1 minute per sample on a normal GPU, and Rosetta minimization should take a maximum of 3 hour per iteration depending on the size of the selected region for design. For more runtime information, please refer to Supplementary Figure 7 in the paper.

The Rosetta protocol saves all iterations and intermediate structures to subdirectories in `outPath`, including structures before FastDesign and before relaxation. The default number of iterations is 3, and the final minimized structure can be found under `outPath/.../best_run/final_structure.pdb`.


#### ProteinMPNN and OmegaFold

ProteinMPNN can be found in https://github.com/dauparas/ProteinMPNN. Bulk inference for monomeric structures is done 
by using the script found in `examples/submit_example_1.sh`.

OmegaFold can be found in https://github.com/HeliXonProtein/OmegaFold. Since OmegaFold takes FASTA files as input, we recommend
concatenating ProteinMPNN sequences to a single FASTA file for bulk inference.

Note that we use <tt>MinMover</tt> minimized structures as input to ProteinMPNN. With the flags `--fastrelax False --fastdesign False`,
the entire pipeline (6D sampling - Rosetta <tt>MinMover</tt> - ProteinMPNN - OmegaFold) should take ~2-3 minutes per sample.

### Reference
---
Lee, Jin Sub, Kim, Jisun, and Philip M. Kim. "ProteinSGM: Score-based generative modeling for de novo protein design." bioRxiv (2022). https://doi.org/10.1101/2022.07.13.499967