Skip to content

Latest commit

 

History

History
112 lines (66 loc) · 5.78 KB

File metadata and controls

112 lines (66 loc) · 5.78 KB
graph LR
    Data_Pipeline["Data Pipeline"]
    Feature_Pipeline["Feature Pipeline"]
    Data_Transforms["Data Transforms"]
    Tools["Tools"]
    Parsers["Parsers"]
    Templates["Templates"]
    MSA_Pairing["MSA Pairing"]
    Data_Modules["Data Modules"]
    Data_Pipeline -- "Orchestrates" --> Tools
    Data_Pipeline -- "Orchestrates" --> Parsers
    Data_Pipeline -- "Feeds into" --> Feature_Pipeline
    Feature_Pipeline -- "Receives input from" --> Data_Pipeline
    Feature_Pipeline -- "Utilizes" --> Data_Transforms
    Feature_Pipeline -- "Feeds into" --> Data_Modules
    Data_Transforms -- "Used by" --> Feature_Pipeline
    Tools -- "Called by" --> Data_Pipeline
    Tools -- "Outputs consumed by" --> Parsers
    Parsers -- "Used by" --> Data_Pipeline
    Parsers -- "Used by" --> Templates
    Templates -- "Used by" --> Data_Pipeline
    Templates -- "Relies on" --> Parsers
    MSA_Pairing -- "Used by" --> Data_Pipeline
    MSA_Pairing -- "Used by" --> Feature_Pipeline
    Data_Modules -- "Consumes data from" --> Feature_Pipeline
Loading

CodeBoardingDemoContact

Details

The Data Processing Pipeline in OpenFold is a critical subsystem responsible for transforming raw biological data into a format suitable for the deep learning model. It encompasses several key components that work in concert to achieve this.

Data Pipeline

This is the orchestrator of the entire data processing workflow. It manages the execution of external bioinformatics tools to generate MSAs and templates, and coordinates the initial stages of data preparation. It includes specialized logic for multimer data.

Related Classes/Methods:

Feature Pipeline

Responsible for transforming raw biological data (sequences, MSAs, templates) into the numerical features (tensors) that can be directly consumed by the neural network. It applies various data transformations and prepares the input for the model.

Related Classes/Methods:

Data Transforms

A collection of functions and classes that apply various transformations, augmentations, and normalizations to the raw and intermediate data. This includes operations like cropping, padding, and converting data into model-consumable formats, with specific implementations for multimer data.

Related Classes/Methods:

Tools

Provides Python wrappers and utilities for executing external bioinformatics tools (e.g., HHblits, Jackhmmer, HHsearch, Kalign). These tools are crucial for generating Multiple Sequence Alignments (MSAs) and identifying structural templates, which are essential inputs for the model.

Related Classes/Methods:

  • tools

Parsers

Handles the parsing of various bioinformatics data formats, including A3M, FASTA, PDB, and MMCIF files. This component extracts relevant information from these files for downstream processing by other parts of the data pipeline.

Related Classes/Methods:

Templates

Manages the identification, processing, and featurization of structural templates. This involves searching for homologous structures, parsing their data, and preparing them as input features for the model. It includes logic for handling various template sources and potential errors.

Related Classes/Methods:

MSA Pairing

Specifically handles the pairing and processing of Multiple Sequence Alignments for multimeric protein complexes. This is a critical step for correctly representing inter-chain relationships and generating accurate features for multimer prediction.

Related Classes/Methods:

Data Modules

Provides the interface for PyTorch Lightning, handling data loading, batching, and dataset management for training and inference. It wraps the DataPipeline and FeaturePipeline to provide model-ready data in an efficient and structured manner.

Related Classes/Methods: