Analysis of Genomic Foundation Models. This repository aims to implement a series of experiments to critically evaluate the field of "foundation" models for genomics.
We train different ensembles each made of
Then, we analyze and compare text models to DNA models with respect to their output distributions, static word embeddings, contextual embeddings, and Fisher information concentration.
This project was built with uv, you can also run it with your usual Python environment.
You can install uv here: https://docs.astral.sh/uv/getting-started/installation/
After installing uv, you can directly run the commands below, it will install the dependencies automatically from pyproject.toml and uv.lock file.
Before training the models, you will need to download the data.
The OpenGenome2 eukaryotic genic windows are available on HuggingFace, while ncRNA and cDNA data were downloaded from Ensembl. For eexample, for cDNA data, run the scripts/download_ensembl_cdna.sh script and save the sequences into scratch/ensembl_cdna directory. Then, run scripts/preprocess_fasta.py script, which will prepare the FASTA into a HuggingFace dataset. Similar can be donw for ncRNA data.
There are two main modules in src: train and analyze. They should be called as Python modules (with -m).
The configuration of the models can be set in src/utils/config.py.
The paths to where you will store the dataset can be updated in src/utils/paths.py.
First, you will need to train the models:
uv run -m src.train --type {text, dna} --tokenizer {bpe, kmer} --data {wiki, og2, ncrna, cdna}
# all other hyperparams can be set in the utils/config.py fileAdditionally, --description can be provided to distinguish the runs better.
Models are saved in <scratch>/runs/<timestamp>_<type>_<tokenizer>_<description>/<id> so that they can be retrieved later for analysis, and <scratch> directory can be updated in src/utils/paths/py.
For example, if you run uv run -m src.train --type dna --tokenizer kmer --data og2, it will create:
scratch/runs/<timestamp>_dna_kmer_og2/1
scratch/runs/<timestamp>_dna_kmer_og2/2
...
scratch/runs/<timestamp>_dna_kmer_og2/Nfor
If you have limited resources, we recommend training less models (change
We look at the distributions of BERT models over masked tokens.
uv run -m src.analyze --type distribution --samples <NSAMPLES> --batch_size <BATCH_SIZE>We aim to see if models tend to agree on which tokens should be close in embedding space.
uv run -m src.analyze --type static We analyze the relations between the contextual embeddings, taken from the final transformer layer.
uv run -m src.analyze --type embeddings --samples <NSAMPLES> --batch_size <BATCH_SIZE>We look at the concentration of Fisher Information with respect to each layer.
uv run -m src.analyze --type fisher --samples <NSAMPLES> --batch_size <BATCH_SIZE>The code for training the discriminator used to reweight OpenGenome2 is located in the discriminator directory.
The provided scripts should be called in this order:
preprocess.pytrain.pyinference.py
This will generate a weighted dataset made of a text (the dna sequence) and weight (the "information score" derived from the discriminator logits) columns.
You can then use it to train your LM of choice on it, for example by passing the weights to a PyTorch WeightedRandomSampler object (see https://docs.pytorch.org/docs/2.12/data.html#torch.utils.data.WeightedRandomSampler for more information).
The discriminator architecture and different settings can be modified in config.py and model.py.