Recent posts

Building AI-Friendly Scientific Software: A Model Context Protocol Journey

23 minute read

In this post, I walk through building a remote Model Context Protocol (MCP) server that enhances AI agents’ ability to navigate and contribute meaningfully to the complex Napistu scientific codebase.

This tool empowers new users, advanced contributors, and AI agents alike to quickly access relevant project knowledge.

Before MCP, I fed Claude a mix of README files, wikis, and raw code hoping for useful answers. Tools like Cursor struggled with the tangled structure, sparking the idea for the Napistu MCP server.

I’ll cover:

  • Why I built the Napistu MCP server and the problems it solves
  • How I deployed it using GitHub Actions and Google Cloud Run
  • Case studies showing how AI agents perform with — and without — MCP context

Network Biology with Napistu, Part 2: Translating Statistical Associations into Biological Mechanisms

33 minute read

This is part two of a two-part series on Napistu — a new framework for building genome-scale molecular networks and integrating them with high-dimensional data. Using a methylmalonic acidemia (MMA) multimodal dataset as a case study, I’ll demonstrate how to distill disease-relevant signals into mechanistic insights through network-based analysis.

From statistical associations to biological mechanisms

Modern genomics excels at identifying disease-associated genes and proteins through statistical analysis. Methods like Gene Set Enrichment Analysis (GSEA) group these genes into functional categories, offering useful biological context. However, we aim to go beyond simply identifying which genes and gene sets change. Our goal is to understand why these genes change together, uncovering the mechanistic depth typically seen in Figure 1 of a Cell paper. To achieve this, we must identify key molecular components, summarize their interactions, and characterize the dynamic cascades that drive emergent biological behavior.

In this post, I’ll demonstrate how to gain this insight by mapping statistical disease signatures onto genome-scale biological networks. Then, using personalized PageRank, I’ll trace signals from dysregulated genes back to their shared regulatory origins. This transforms lists of differentially expressed genes into interconnected modules that reveal upstream mechanisms driving coordinated molecular changes.

Network Biology with Napistu, Part 1: Creating Multimodal Disease Profiles

29 minute read

This is part one of a two-part post highlighting Napistu — a new framework for building genome-scale networks of molecular biology and biochemistry. In this post, I’ll tackle a fundamental challenge in computational biology: how to extract meaningful disease signatures from complex multimodal datasets.

Using methylmalonic acidemia (MMA) as my test case, I’ll demonstrate how to systematically extract disease signatures from multimodal data. My approach combines three complementary analytical strategies: exploratory data analysis to assess data structure and quality, differential expression analysis to identify disease-associated features, and factor analysis to uncover coordinated gene expression programs across data types. The end goal is to distill thousands of molecular measurements into a handful of interpretable disease signatures — each capturing a distinct aspect of disease biology that can be mapped to regulatory networks.

Throughout this post, I’ll use two types of asides to provide additional context without disrupting the main analytical flow. Green boxes contain biological details, while blue boxes reflect on the computational workflow and AI-assisted development process.

Flattening the Gompertz Distribution

15 minute read

In this post I’ll explore the Gompertz law of mortality which describes individuals’ accelerating risk of death with age.

The Gompertz equation describes per-year hazard (i.e., the likelihood of surviving from time $t$ to $t+1$) as the product of age-independent parameter $\alpha$ and an age-dependent component which increases exponentially with time scaled by another parameter $beta$ ($e^{\beta \cdot t}$).

The equation is thus:

\[\large h(t) = \alpha \cdot e^{\beta \cdot t}\]

The Gompertz equation is often studied by taking its natural log resulting in a linear relationship between log(hazard) and increasing risk with age.

\[\large \ln(h(t)) = \ln(\alpha) + \beta \cdot t\]

Formulating and estimating the parameters of demographic hazard models like the Gompertz’s equation is an active area of research, and there is a lot of information out there catering to both the academic and lay audiences. Still, when reviewing this literature, I did not see a clear summary of how decreases in $\beta$ (the chief aim of longevity research) would lead to lifespan extension.

False Discovery Rate (FDR) Overview and lFDR-Based Shrinkage

9 minute read

Coming from a quantitative genetics background, correcting for multiple comparisons meant controlling the family-wise error rate (FWER) using a procedure like Bonferroni correction. This all changed when I took John Storey’s “Advanced Statistics for Biology” class in grad school. John is an expert in statistical interpretation of high-dimensional data and literally wrote the book, well paper, on false-discovery rate (FDR) as an author of Storey & Tibshirani 2006. His description of the FDR has grounded my interpretation of hundreds of genomic datasets and I’ve continued to pay this knowledge forward with dozens of white-board style descriptions of the FDR for colleagues. As an interviewer and paper reviewer I still regularly see accomplished individuals and groups where “FDR control” is a clear blind spot. In this post I’ll layout how I whiteboard the FDR problem, and then highlight a specialized application of the FDR for “denoising” genomic datasets.