graph LR
Data_Pipeline["Data Pipeline"]
Feature_Pipeline["Feature Pipeline"]
Data_Transforms["Data Transforms"]
Tools["Tools"]
Parsers["Parsers"]
Templates["Templates"]
MSA_Pairing["MSA Pairing"]
Data_Modules["Data Modules"]
Data_Pipeline -- "Orchestrates" --> Tools
Data_Pipeline -- "Orchestrates" --> Parsers
Data_Pipeline -- "Feeds into" --> Feature_Pipeline
Feature_Pipeline -- "Receives input from" --> Data_Pipeline
Feature_Pipeline -- "Utilizes" --> Data_Transforms
Feature_Pipeline -- "Feeds into" --> Data_Modules
Data_Transforms -- "Used by" --> Feature_Pipeline
Tools -- "Called by" --> Data_Pipeline
Tools -- "Outputs consumed by" --> Parsers
Parsers -- "Used by" --> Data_Pipeline
Parsers -- "Used by" --> Templates
Templates -- "Used by" --> Data_Pipeline
Templates -- "Relies on" --> Parsers
MSA_Pairing -- "Used by" --> Data_Pipeline
MSA_Pairing -- "Used by" --> Feature_Pipeline
Data_Modules -- "Consumes data from" --> Feature_Pipeline
The Data Processing Pipeline in OpenFold is a critical subsystem responsible for transforming raw biological data into a format suitable for the deep learning model. It encompasses several key components that work in concert to achieve this.
This is the orchestrator of the entire data processing workflow. It manages the execution of external bioinformatics tools to generate MSAs and templates, and coordinates the initial stages of data preparation. It includes specialized logic for multimer data.
Related Classes/Methods:
Responsible for transforming raw biological data (sequences, MSAs, templates) into the numerical features (tensors) that can be directly consumed by the neural network. It applies various data transformations and prepares the input for the model.
Related Classes/Methods:
A collection of functions and classes that apply various transformations, augmentations, and normalizations to the raw and intermediate data. This includes operations like cropping, padding, and converting data into model-consumable formats, with specific implementations for multimer data.
Related Classes/Methods:
Provides Python wrappers and utilities for executing external bioinformatics tools (e.g., HHblits, Jackhmmer, HHsearch, Kalign). These tools are crucial for generating Multiple Sequence Alignments (MSAs) and identifying structural templates, which are essential inputs for the model.
Related Classes/Methods:
tools
Handles the parsing of various bioinformatics data formats, including A3M, FASTA, PDB, and MMCIF files. This component extracts relevant information from these files for downstream processing by other parts of the data pipeline.
Related Classes/Methods:
Manages the identification, processing, and featurization of structural templates. This involves searching for homologous structures, parsing their data, and preparing them as input features for the model. It includes logic for handling various template sources and potential errors.
Related Classes/Methods:
Specifically handles the pairing and processing of Multiple Sequence Alignments for multimeric protein complexes. This is a critical step for correctly representing inter-chain relationships and generating accurate features for multimer prediction.
Related Classes/Methods:
Provides the interface for PyTorch Lightning, handling data loading, batching, and dataset management for training and inference. It wraps the DataPipeline and FeaturePipeline to provide model-ready data in an efficient and structured manner.
Related Classes/Methods: