In his AI Speaker Series presentation at Sutter Hill Ventures, Brian Hie presented Evo, a long-context genomic foundation model, and discussed how it's being used to understand and design biological systems. Here's my notes from his talk:
- Biology is speaking a foreign language in DNA, RNA, and protein sequences.
- While we've made tremendous advances in DNA sequencing, synthesis, and genome editing, intelligently composing new DNA sequences remains a fundamental challenge.
- Similar to how language models like ChatGPT use next-token prediction to learn complex patterns in text, genomic models can use next-base-pair prediction to uncover patterns in DNA.
- Evolution leaves its imprint on DNA sequences, allowing models to learn complex biological mechanisms from sequence variation.
- Protein language models have already shown they can learn evolutionary rules and information about protein structure. Evo takes this further by training on raw DNA sequences across all domains of life.
- Evo 1 was trained on prokaryotic genomes with 7 billion parameters and a 131,000 token context.
- The model demonstrated a zero-shot understanding of gene essentiality, accurately predicting which genes are more tolerant of mutations.
- It can also design new biological systems that have comparable performance to state-of-the-art systems but with substantially different sequences.
- Evo 2 expanded to all three domains of life, trained on 9.3 trillion tokens with 40 billion parameters and a one million base pair context length. This makes it the largest model by compute ever trained in biology.
- The longer context allows it to understand information from the molecular level up to complete bacterial genomes or yeast chromosomes.
- Evo 2 excels at predicting the effects of mutations on human genes, particularly in non-coding regions where current models struggle. When fine-tuned on known breast cancer mutations, it achieves state-of-the-art performance.
- Using sparse autoencoders, researchers can interpret the model and find features that correspond to biologically relevant concepts like DNA, RNA, and protein structures. Some features even detect errors in genetic code, similar to how language models can detect bugs in computer code.
- The most forward-looking application is designing at the scale of entire genomes or chromosomes. Evo 2 can generate coherent mitochondrial genomes with all the right components and predicted structures.
- It can also control chromatin accessibility patterns, writing messages in "Morse code" by specifying open and closed regions of chromatin.
- All of the models, code, and datasets have been released as open source for the scientific community.