Recent posts

Distinguishing Activation from Inhibition with Relation-Aware Graph Neural Networks

25 minute read

In my last post, I discussed self-supervised edge prediction as a way of embedding genes using a gene-regulatory network.

This approach allows genes, metabolites, drugs and other vertices to be connected based on shared network topology. However, to date I’ve only discussed edge prediction using a dot-product head, where a vertex-pair’s edge support is a direct readout of their similarity in embedding space (𝐚 · 𝐛). While surprisingly powerful, this head has limitations when vertices are heterogeneous or interact in qualitatively different ways — particularly when we want to distinguish between activation and inhibition.

Here, I explore more expressive approaches for learning mappings between A → B by evaluating both general edge prediction heads (like MLPs) and “relation-aware” heads that can learn distinct mappings for different edge types. The post will cover:

  • Data model and training changes enabling relation-specific predictions
  • Geometric analysis revealing how relation-aware heads encode regulatory semantics
  • PerturbSeq validation demonstrating successful prediction of signed regulatory interactions
  • Pre-trained models available on HuggingFace

Napistu meets PyTorch Geometric - Predicting Regulatory Interactions with Graph Neural Networks

34 minute read

Biological applications of graph neural networks (GNNs) typically work with either small curated networks (100s-1,000s of nodes) or aggressively filtered subsets of large databases like STRING. The Octopus graph — which I introduced in my previous post — occupies a different space entirely. By integrating eight complementary pathway databases, it creates a genome-scale network with ~50K proteins, metabolites, and complexes spanning ~10M edges, all while preserving rich metadata about edge provenance, confidence scores, and mechanistic detail that filtered approaches discard.

This puts the Octopus in uncharted territory: large enough to capture genome-scale complexity, yet structured enough to preserve the biological interpretability that makes network analysis valuable. GNNs scale well beyond genome-scale requirements (100M+ nodes in social networks), but remain unexplored for comprehensive biological networks that integrate regulatory, metabolic, and interaction data. Bridging this gap requires infrastructure that handles both the biological complexity of multi-source networks and the engineering complexity of training GNNs at scale.

In this post, I’ll introduce Napistu-Torch — the infrastructure that finally makes this space navigable. Available from PyPI and indexed by the Napistu MCP server, Napistu-Torch provides a modular, reproducible framework for training GNNs on comprehensive biological networks. I’ll demonstrate that it’s feasible to train graph convolutional networks on the complete Octopus network using just a laptop (albeit with 2 days of training time for the full suite of models). But the real contribution is the ecosystem: the data structures, pipelines, and evaluation strategies that unlock far more sophisticated analyses.

Napistu’s Octopus: An 8-source human consensus pathway model

20 minute read

Introducing the Octopus: Napistu’s eight-source Human Consensus Pathway Model that unites the breadth of protein-protein interaction networks with the depth of regulatory databases and metabolic models.The result is a genome-scale directed graph that is both densely connected and mechanistically precise. In this post, I will:

  • Provide an overview of the Octopus model and its construction
  • Show side-by-side summaries of individual data sources highlighting their complementarity
  • Demonstrate that the model successfully merges results, creating a dense network covering the complete cellular repertoire of genes, metabolites, drugs, and complexes
  • Illustrate how source-level information can be carried forward to the Octopus’s graphical network to augment its vertex and edge features

Building AI-Friendly Scientific Software: A Model Context Protocol Journey

23 minute read

In this post, I walk through building a remote Model Context Protocol (MCP) server that enhances AI agents’ ability to navigate and contribute meaningfully to the complex Napistu scientific codebase.

This tool empowers new users, advanced contributors, and AI agents alike to quickly access relevant project knowledge.

Before MCP, I fed Claude a mix of README files, wikis, and raw code hoping for useful answers. Tools like Cursor struggled with the tangled structure, sparking the idea for the Napistu MCP server.

I’ll cover:

  • Why I built the Napistu MCP server and the problems it solves
  • How I deployed it using GitHub Actions and Google Cloud Run
  • Case studies showing how AI agents perform with — and without — MCP context

Network Biology with Napistu, Part 2: Translating Statistical Associations into Biological Mechanisms

33 minute read

This is part two of a two-part series on Napistu — a new framework for building genome-scale molecular networks and integrating them with high-dimensional data. Using a methylmalonic acidemia (MMA) multimodal dataset as a case study, I’ll demonstrate how to distill disease-relevant signals into mechanistic insights through network-based analysis.

From statistical associations to biological mechanisms

Modern genomics excels at identifying disease-associated genes and proteins through statistical analysis. Methods like Gene Set Enrichment Analysis (GSEA) group these genes into functional categories, offering useful biological context. However, we aim to go beyond simply identifying which genes and gene sets change. Our goal is to understand why these genes change together, uncovering the mechanistic depth typically seen in Figure 1 of a Cell paper. To achieve this, we must identify key molecular components, summarize their interactions, and characterize the dynamic cascades that drive emergent biological behavior.

In this post, I’ll demonstrate how to gain this insight by mapping statistical disease signatures onto genome-scale biological networks. Then, using personalized PageRank, I’ll trace signals from dysregulated genes back to their shared regulatory origins. This transforms lists of differentially expressed genes into interconnected modules that reveal upstream mechanisms driving coordinated molecular changes.