Industry Landscape: Generative AI in Drug Discovery & Computational Biology¶

This document surveys companies and platforms pioneering generative AI, foundation models, and machine learning approaches for drug discovery, treatment response prediction, and biological research.

Last Updated: December 2024
Purpose: Track industry developments, identify research directions, and gather ideas for this project.

Table of Contents¶

DNA Foundation Models & Sequence Generation
Single-Cell Foundation Models
Gene Expression & Multi-Omics
Splicing & RNA Processing
Protein & Structure-Based Discovery
Gene Editing & CRISPR
Clinical & Treatment Response
AI-Driven Target Discovery
Key Observations & Research Directions

DNA Foundation Models & Sequence Generation¶

Foundation models for DNA sequence understanding and generation — enabling synthetic genome design, variant effect prediction, and regulatory element discovery.

Evo 2 (Arc Institute)¶


Website	arcinstitute.org/tools/evo
Focus	DNA foundation model for generalist prediction and design
Key Technology	Evo 2 (40B parameters, 1M context length)
Partners	NVIDIA

What They Do:

Genomic foundation model for DNA, RNA, and protein tasks
Single-nucleotide resolution with near-linear scaling
Trained on 9+ trillion nucleotides from 128,000+ species (all domains of life)
Generative design of synthetic DNA sequences

Technical Details:

40 billion parameters
1 megabase (1M tokens) context length
Frontier deep learning architecture
Evo Designer tool for sequence generation
Open source on GitHub and Hugging Face

Relevance to This Project: ⭐⭐⭐⭐⭐

State-of-the-art in DNA generation
Long-range context is critical for genomics
Open source enables direct experimentation

Resources:

Nucleotide Transformer (InstaDeep)¶


Website	instadeep.com
Focus	DNA foundation models for molecular phenotype prediction
Key Technology	Nucleotide Transformer (up to 2.5B parameters)
Published	Nature Methods (Nov 2024)

What They Do:

Foundation models pre-trained on DNA sequences
Transfer learning for genomic tasks with limited labeled data
Multi-species genome understanding

Technical Details:

Models from 50M to 2.5B parameters
Multispecies 2.5B outperforms single-species models
18 benchmark tasks for evaluation
Interactive leaderboard for comparison

Relevance to This Project: ⭐⭐⭐⭐

Established benchmark for DNA foundation models
Transfer learning approach applicable to expression

Resources:

HyenaDNA (Stanford/Hazy Research)¶


Website	GitHub
Focus	Long-range genomic sequence modeling
Key Technology	HyenaDNA (up to 1M context)
Architecture	Hyena (sub-quadratic attention alternative)

What They Do:

Genomic foundation model with ultra-long context
Single nucleotide resolution
Pre-trained on human reference genome

Technical Details:

Up to 1 million token context (500x increase over dense attention)
Hyena operator for efficient long-range modeling
Fine-tunable for downstream tasks

Relevance to This Project: ⭐⭐⭐⭐

Efficient architecture for long sequences
Applicable to regulatory element prediction

Caduceus (Cornell)¶


Website	caduceus-dna.github.io
Focus	Bi-directional DNA sequence modeling
Key Technology	Caduceus (BiMamba + RC equivariance)
Architecture	Mamba-based

What They Do:

First family of RC (reverse complement) equivariant DNA models
Bi-directional modeling for upstream/downstream context
Long-range sequence modeling

Technical Details:

BiMamba: bi-directional Mamba block
MambaDNA: RC equivariant extension
Handles biological symmetry of DNA strands

Relevance to This Project: ⭐⭐⭐⭐

Novel architecture addressing DNA-specific challenges
Mamba efficiency for long sequences

DNABERT / DNABERT-2¶


Website	GitHub
Focus	Pre-trained bidirectional encoder for DNA
Key Technology	DNABERT-2 (ICLR 2024)
Benchmark	Genome Understanding Evaluation (GUE)

What They Do:

BERT-style pre-training for DNA sequences
Multi-species genome understanding
DNABERT-S for DNA embeddings that cluster by genome

Technical Details:

Efficient foundation model
28 datasets in GUE benchmark
Transfer learning to downstream tasks

Relevance to This Project: ⭐⭐⭐

Established baseline for DNA language models
Comprehensive benchmark suite

Single-Cell Foundation Models¶

Foundation models specifically designed for single-cell transcriptomics data.

Geneformer¶


Website	Hugging Face
Focus	Transfer learning for single-cell biology
Key Technology	Geneformer-V2 (104M-316M parameters)
Published	Nature (2023)

What They Do:

Foundation model for context-specific gene network predictions
Pre-trained on ~104M human single-cell transcriptomes
Zero-shot and fine-tuning capabilities

Technical Details:

Geneformer-V2-316M: latest model (Dec 2024)
Input size: 4096 tokens
Vocabulary: ~20K protein-coding genes
Rank-value encoding of gene expression

Relevance to This Project: ⭐⭐⭐⭐⭐

Primary inspiration for transformer-based expression modeling
Representation learning that can be extended to generation

Resources:

scGPT¶


Website	GitHub
Focus	Generative pre-trained transformer for single-cell
Key Technology	scGPT
Published	Nature Methods (2024)

What They Do:

Foundation model for single-cell multi-omics
Generative capabilities for cell state prediction
Pre-trained on 33+ million cells

Technical Details:

Generative pre-trained transformer architecture
Multi-task learning: cell type annotation, perturbation prediction, multi-omics integration
Attention-based gene-gene interaction modeling

Relevance to This Project: ⭐⭐⭐⭐⭐

Directly relevant — generative model for single-cell
Perturbation prediction aligns with counterfactual goals

Resources:

Nature Methods Paper

Gene Expression & Multi-Omics Foundation Models¶

Companies building foundation models specifically for gene expression, transcriptomics, and multi-omics data.

Synthesize Bio¶


Website	https://www.synthesize.bio/
Focus	Generative foundation models for gene expression
Key Technology	GEM-1 — Gene Expression Model
Approach	Generate biologically realistic gene expression data in silico

What They Do:

Build generative models that can simulate gene expression profiles under various conditions
Enable in silico experimentation that bridges wet lab and computation
Multi-omics and public + private data harmonization
Platform for drug discovery, hypothesis testing, and clinical decision support

Relevance to This Project: ⭐⭐⭐⭐⭐ (Primary inspiration) - Direct alignment with our cVAE and counterfactual simulation goals - Their GEM-1 represents the state-of-the-art we're studying

Key Blog Posts:

https://www.synthesize.bio/blog

Deep Genomics¶


Website	https://www.deepgenomics.com/
Focus	RNA biology and therapeutics
Key Technology	BigRNA (~2 billion parameters)
Founded	2014 (Toronto)

What They Do:

First transformer neural network engineered specifically for transcriptomics
Predicts tissue-specific regulatory mechanisms of RNA expression
Predicts binding sites of proteins and microRNAs
Predicts effects of genetic variants and therapeutic candidates

Technical Details:

~2 billion adjustable parameters
Trained on thousands of datasets (>1 trillion genomic signals)
Designed to understand complex RNA interactions

Relevance to This Project: ⭐⭐⭐⭐⭐ - BigRNA is a foundation model for RNA/transcriptomics — directly relevant - Their approach to predicting variant effects aligns with counterfactual reasoning

Helical¶


Website	https://www.helical.ai/
Focus	Open-source DNA/RNA foundation models
Key Technology	Helix-mRNA (hybrid foundation model)
Founded	2023 (Luxembourg)
Funding	€2.2M seed (June 2024)

What They Do:

First open-source platform dedicated to bio foundation models for DNA and RNA
Democratize access to advanced AI tools for pharma/biotech
Library of Bio AI Agents for tasks like biomarker discovery and target prediction

Technical Details:

Helix-mRNA: hybrid foundation model for mRNA therapeutics
Outperforms prior methods in modeling UTRs and long-sequence regions
Uses only ~10% of parameters of comparable models
Available on AWS Marketplace

Relevance to This Project: ⭐⭐⭐⭐ - Open-source focus aligns with our goals - Could potentially integrate their models or learn from their architecture

Noetik¶


Website	https://www.noetik.ai/
Focus	Cancer biology and treatment prediction
Key Technology	OCTO model
Founded	2022 (San Francisco)
Funding	$40M Series A (2024)

What They Do:

AI model that acts like a virtual lab for cancer research
Predicts how different cancer treatments might play out in real patients
Tests "what if" scenarios for treatment optimization

Technical Details:

OCTO trained on thousands of tumor samples
Integrates gene expression, protein data, and cell images
Predicts how tweaking a single gene could change protein levels across a tumor
In vivo CRISPR Perturb-Map platform for validation

Relevance to This Project: ⭐⭐⭐⭐ - Their "what if" scenario testing is exactly counterfactual reasoning - Multi-modal integration (expression + protein + images) is advanced

BioMap¶


Website	https://www.biomap.com/
Focus	Multi-modal biological foundation models
Key Technology	xTrimo (~210 billion parameters)
Partnerships	Sanofi ($10M upfront, >$1B potential milestones)

What They Do:

World's largest life science foundation model
Supports DNA, RNA, protein, cellular, and systems-level modalities
Designed to understand and predict biological behavior across multiple modalities

Technical Details:

~210 billion parameters (as of 2025)
Cross-Modal Transformer Representation of Interactome and Multi-Omics
GPU-accelerated deployment using multi-expert architectures and FP8 precision

Relevance to This Project: ⭐⭐⭐⭐ - Multi-modal approach is the future direction - Scale demonstrates what's possible with sufficient resources

Splicing & RNA Processing¶

AI models for predicting and understanding alternative splicing, splice site recognition, and RNA processing — directly relevant to the meta-spliceai project.

SpliceAI (Illumina)¶


Website	GitHub
Focus	Deep learning for splice site prediction
Key Technology	SpliceAI
Published	Cell (2019)

What They Do:

Predict splicing alterations from DNA sequence
Identify cryptic splice sites created by variants
Score variant pathogenicity based on splicing impact

Technical Details:

Deep residual neural network
10,000 nucleotide context window
Predicts donor/acceptor gain/loss
Pre-computed scores available for all SNVs

Relevance to This Project: ⭐⭐⭐⭐⭐

Foundation for meta-spliceai project
Demonstrates deep learning for splicing prediction
Well-established benchmark

Resources:

Splam (Johns Hopkins)¶


Website	GitHub
Focus	Splice junction recognition
Key Technology	Splam
Published	2024

What They Do:

Improved splice junction recognition over SpliceAI
Better accuracy with less genomic data
Cross-species generalization

Technical Details:

Deep learning model for splice sites
Outperforms SpliceAI on benchmarks
Generalizes across species

Relevance to This Project: ⭐⭐⭐⭐

State-of-the-art in splice prediction
Cross-species transfer learning

Pangolin¶


Focus	Tissue-specific splicing prediction
Key Technology	Pangolin

What They Do:

Predict tissue-specific alternative splicing
Model splicing quantitative trait loci (sQTLs)

Relevance to This Project: ⭐⭐⭐⭐

Tissue-specific conditioning aligns with our cVAE approach
Splicing as a form of gene regulation

Gene Editing & CRISPR¶

Generative AI for designing gene editors and optimizing CRISPR systems.

Profluent Bio¶


Website	profluent.bio
Focus	AI-designed gene editors
Key Technology	OpenCRISPR-1
Published	Nature (2025)

What They Do:

First AI-designed gene editor to edit human genome
Generative AI creates novel CRISPR proteins
Open-source release of OpenCRISPR-1

Technical Details:

Large language models trained on CRISPR-Cas sequences
Generated novel, functional genome editors
Improved properties vs natural systems
RNA-programmable with NGG PAM preference

Relevance to This Project: ⭐⭐⭐⭐⭐

Demonstrates generative AI for protein design
Open-source enables experimentation
LLM approach to biological sequences

Resources:

Protein & Structure-Based Discovery¶

Companies focused on protein structure prediction, design, and protein-based therapeutics.

Isomorphic Labs¶


Website	https://www.isomorphiclabs.com/
Focus	AI-first drug discovery
Key Technology	AlphaFold 3
Parent	Alphabet (DeepMind spin-off, 2021)

What They Do:

Reimagining drug discovery from first principles with AI-first approach
Expanded from small molecules to biologics
Internal pipeline focused on oncology and immunology

Technical Details:

AlphaFold 3: predicts structure of proteins, DNA, RNA, ligands, and their interactions
Released in collaboration with Google DeepMind
Nobel Prize-winning foundation (AlphaFold 2)

Relevance to This Project: ⭐⭐⭐ - Structure prediction complements expression modeling - Their approach to "AI-first" drug discovery is instructive

EvolutionaryScale¶


Website	https://www.evolutionaryscale.ai/
Focus	Protein design and engineering
Key Technology	ESM3 (generative protein model)
Founded	2024 (Meta FAIR spin-off)
Funding	$142M

What They Do:

Programmable biology for protein engineering
Target cancer cells, find alternatives to plastics, environmental mitigations

Technical Details:

ESM3: simultaneously reasons over sequence, structure, and function of proteins
Third-generation ESM model
Trained on NVIDIA H100 GPUs
ESM Cambrian: parallel model family for protein understanding

Relevance to This Project: ⭐⭐⭐ - Generative approach to proteins parallels our approach to expression - Their training methodology is instructive

Generate:Biomedicines¶


Website	https://generatebiomedicines.com/
Focus	Protein therapeutics via generative AI
Key Technology	Generative Biology™ platform

What They Do:

Pioneer of "Generative Biology" — generating custom protein therapeutics
From peptides to antibodies, enzymes, gene therapies
Generate, build, measure, learn loop

Clinical Progress:

GB-0895: Phase 3 for severe asthma (anti-TSLP antibody)
GB-0669: Phase 1 completed with positive results

Relevance to This Project: ⭐⭐⭐ - Their generate-build-measure-learn loop is a good framework - Demonstrates clinical translation of generative approaches

Chai Discovery¶


Website	https://www.chaidiscovery.com/
Focus	Molecular structure prediction and antibody design
Key Technology	Chai-1 (structure), Chai-2 (antibody design)
Funding	$100M total ($70M Series A, 2025)
Investors	Thrive Capital, OpenAI

What They Do:

Open-source multi-modal foundation model for molecular structure
Unifies predictions across proteins, small molecules, DNA, RNA, covalent modifications

Technical Details:

Chai-1: 77% success rate on PoseBusters (vs 76% AlphaFold3)
Can operate without MSAs (reduces compute demands)
Chai-2: ~16% hit rate for de novo antibody design across 52 novel antigens

Relevance to This Project: ⭐⭐⭐ - Open-source approach is valuable - Zero-shot antibody design is impressive generative capability

Recursion¶


Website	https://www.recursion.com/
Focus	Phenomics + drug discovery
Key Technology	Phenom-Beta, BioHive-2 supercomputer

What They Do:

Merge AI with massive biological datasets
Process cellular microscopy images into general-purpose embeddings
In-silico fluorescent staining from brightfield images

Technical Details:

Phenom-Beta: vision transformer (ViT) with masked autoencoders
Trained on RxRx3 dataset (~2.2M images, ~17K knockouts, 1,674 chemicals)
BioHive-2: 504 NVIDIA H100 GPUs
Partnership with MIT for open-source protein co-folding model

Relevance to This Project: ⭐⭐⭐ - Phenomics (image-based) complements transcriptomics - Their self-supervised approach is relevant

Clinical & Treatment Response¶

Companies focused on clinical trials, treatment optimization, and patient response prediction.

Insilico Medicine¶


Website	https://insilico.com/
Focus	End-to-end AI drug discovery
Key Technology	Pharma.AI, Precious3GPT
Founded	2014
Funding	$110M Series E (2025)

What They Do:

Fully integrated drug discovery suite
PandaOmics: discover and prioritize novel targets
Chemistry42: generate novel molecules
InClinico: design and predict clinical trials

Technical Details:

Precious3GPT: multi-omics, cross-species foundation transformer for aging research
Ingests data from rats, monkeys, humans across transcriptomics, proteomics, methylation
Enables virtual experiments to forecast compound effects on aging hallmarks
Available on Hugging Face

Clinical Progress:

Rentosertib (ISM001-055): AI-discovered drug, Phase 2a results published in Nature Medicine
First AI-discovered drug to show clinical proof-of-concept

Relevance to This Project: ⭐⭐⭐⭐⭐ - Precious3GPT is directly relevant — multi-omics generative model - Their end-to-end approach shows the full pipeline - Clinical validation demonstrates real-world impact

Tempus¶


Website	https://www.tempus.com/
Focus	Precision medicine with real-world data
Key Technology	Tempus One (AI platform)

What They Do:

AI-enabled precision medicine
Predict response to therapies with greater accuracy
Uncover novel biomarkers from real-world data

Technical Details:

Integrates clinical data with AI-driven algorithms
Neural-network-based high-throughput drug screening
Generative AI capabilities for querying healthcare data

Relevance to This Project: ⭐⭐⭐ - Real-world data integration is important for validation - Treatment response prediction aligns with our goals

Owkin¶


Website	https://www.owkin.com/
Focus	Clinical trials and digital pathology
Key Technology	Federated learning, SecureFedYJ

What They Do:

AI models for drug discovery and clinical trial optimization
Federated learning: train AI without centralizing data
Digital pathology AI diagnostics

Technical Details:

Owkin Studio: federated learning platform (40% of revenue)
Owkin Connect: AI models for drug discovery (35% of revenue)
SecureFedYJ: secure federated learning algorithm

Partnerships:

Amgen: cardiovascular prediction
AstraZeneca: AI tool for gBRCA mutation screening

Relevance to This Project: ⭐⭐⭐ - Federated learning is important for privacy-preserving multi-site studies - Clinical trial optimization is downstream application

Retro Biosciences¶


Website	https://www.retro.bio/
Focus	Cellular reprogramming and longevity
Key Technology	GPT-4b micro (with OpenAI)
Funding	$1B (led by Sam Altman)

What They Do:

Interventions to slow or reverse cellular aging
Focus on neurodegeneration
Combine wet-lab biology with computational methods

Technical Details:

GPT-4b micro: biology-specialized foundation model
Trained on protein sequences, biological literature, tokenized 3D structural data
Redesigned Yamanaka transcription factors (RetroSOX, RetroKLF)
50-fold increases in pluripotency marker expression

Relevance to This Project: ⭐⭐⭐⭐ - Demonstrates LLM approach to biology - Reprogramming is a form of counterfactual intervention

AI-Driven Target Discovery¶

Companies using AI/ML for target identification and validation (not necessarily generative).

Ochre Bio¶


Website	https://www.ochre-bio.com/
Focus	RNA therapeutics for liver disease
Key Technology	OBELiX platform
Headquarters	Oxford, UK

What They Do:

Developing RNA medicines for chronic liver diseases
Built one of world's largest human liver functional genomics datasets (~120,000 samples)
Combine machine learning with human validation models

Technical Details:

Proprietary gene perturbation atlases + patient disease atlases
Make causal predictions about drug targets
Human validation: perfused livers, diseased tissue slices, primary cells
In-house RNA chemistry

Partnerships:

GSK: functional genomics and single-cell datasets
Boehringer Ingelheim: chronic liver disease research

Relevance to This Project: ⭐⭐⭐ - Large-scale functional genomics data is valuable - Causal predictions align with our counterfactual goals - Note: Not building generative models, but using ML for target discovery

Atomic AI¶


Website	https://atomic.ai/
Focus	RNA structure and drug discovery
Key Technology	ATOM-1, PARSE platform
Funding	~$42M (seed + Series A)

What They Do:

AI-driven RNA drug discovery with atomic precision
Predict RNA structural and functional properties
Optimize RNA-targeted and RNA-based modalities

Technical Details:

ATOM-1: large language model for RNA structure prediction
PARSE: Platform for AI-driven RNA Structure Exploration
Combined foundation-model + wet-lab loop

Relevance to This Project: ⭐⭐⭐ - RNA structure is complementary to expression - Their foundation model + wet lab loop is a good paradigm

Enveda Biosciences¶


Website	https://enveda.com/
Focus	Natural product drug discovery
Key Technology	PRISM foundation model
Funding	$300M+ (Series C + D, unicorn valuation)
Investors	Sanofi

What They Do:

Enhance molecular structure identification from natural products
Self-supervised learning on mass spectrometry data

Technical Details:

PRISM: Pretrained Representations Informed by Spectral Masking
Trained on 1.2 billion small molecule mass spectra
Masked peak modeling (similar to masked LM in NLP)

Relevance to This Project: ⭐⭐ - Different modality (mass spec vs expression) - Self-supervised approach is transferable

Key Observations & Research Directions¶

Trends in the Industry¶

Foundation Models Are Dominant
Most companies are building large-scale foundation models
Parameters range from millions to 210 billion (BioMap xTrimo)
Self-supervised pretraining is standard
Multi-Modal Integration
Leading platforms integrate multiple data types
Expression + protein + structure + images
Cross-species and cross-tissue modeling
Generative vs. Predictive
Clear distinction between:
- Generative: Synthesize Bio, Generate:Biomedicines, EvolutionaryScale
- Predictive: Ochre Bio, Tempus, Owkin
Generative models enable counterfactual reasoning
Clinical Translation
Insilico Medicine leads with AI-discovered drugs in clinic
Validation in human systems is critical
Regulatory pathway is becoming clearer
Open Source Movement
Helical, Chai Discovery pushing open-source
Democratization of bio AI tools
Opportunity for academic contribution

Research Directions for This Project¶

Based on industry analysis, priority areas:

Conditional Generation with Biological Constraints
Tissue/disease/batch conditioning (current focus)
Pathway-level constraints
Gene regulatory network priors
Counterfactual Reasoning
Treatment response prediction
Perturbation effect simulation
Causal inference integration
Multi-Modal Extension
Expression + protein (pseudobulk bridging)
Integration with structure predictions
Image-based phenomics
Evaluation Frameworks
DE agreement metrics
Pathway concordance
Batch leakage tests
Clinical validation proxies
Scalability
Efficient architectures (see Helical's 10% parameter efficiency)
Latent diffusion for high-dimensional data
Federated learning for multi-site data

References¶

17 Companies Pioneering AI Foundation Models in Pharma and Biotech
NVIDIA BioNeMo Platform
12 AI Drug Discovery Companies You Should Know
Individual company websites and press releases