Reproducing IPSC Interpolation With Squidiff: A Troubleshooting Guide

Alex Johnson

-Nov 13, 2025

Reproducing IPSC Interpolation With Squidiff: A Troubleshooting Guide

Introduction: The Challenge of Reproducing Fig. 2

Hey there! We're diving deep into the world of iPSC (induced pluripotent stem cell) research and hitting a bit of a snag while trying to reproduce the cool results from Fig. 2 of a recent paper. The goal? To use Squidiff to interpolate between day 0 and day 3 iPSC cells and accurately predict the states of cells at days 1 and 2. The paper claims impressive accuracy, showing smooth differentiation trajectories. However, when we tried to replicate the experiment, our results didn't quite match up. Our interpolated cells were significantly off, failing to align with the real day 1/2 cell distributions. This is where the troubleshooting begins, and we're hoping to uncover the missing pieces.

The Essence of the Experiment

The original experiment, as we understand it, involves training a Squidiff model on data from day 0 and day 3 iPSC cells. The model then attempts to predict the transcriptomic profiles of cells at intermediate time points, specifically days 1 and 2. The key metrics for success are the accuracy of the predictions, which should closely match the actual day 1 and day 2 cell data. We followed the paper's preprocessing steps, training configurations, and evaluation procedures as closely as possible, yet the results weren't there.

Our Replication Attempt: Step-by-Step

Let's walk through the steps we took to reproduce the experiment. This will help you understand where we might have gone wrong. It's a detailed account of our process.

Data Preparation: The Foundation of Any Experiment

First, we grabbed the iPSC dataset referenced in the paper. Think of this as the raw material for our experiment. We meticulously processed the data to ensure high quality and relevance.

Filtering the Data: We filtered the cells to include only day 0 to day 3 samples. We also removed low-quality cells based on total counts, the number of expressed genes, and the mitochondrial fraction. This step is about cleaning up the data to make it more reliable for training.
Normalization and Transformation: We normalized the data per cell to a uniform count and applied a log1p transformation. This step is crucial for scaling the data and reducing the impact of extreme values.
Gene Selection: We selected 203 highly variable genes, as described in the paper. We also made sure to include the marker genes mentioned around Fig. 2 (NANOG, T, GATA6, etc.). This focused our analysis on the most informative genes.
Data Splitting: We created a training dataset with a balanced sample of 1,200 cells from day 0 and 1,200 cells from day 3. The evaluation dataset consisted of 600 cells per day (0-3), used purely for evaluating the interpolation. We split the data into training and evaluation sets. This separation is essential for assessing how well the model generalizes to new data.

Training Configuration: Setting Up the Model

With the data prepared, we moved on to the training phase. Here's a breakdown of the training configuration:

Environment Setup: We used CUDA_VISIBLE_DEVICES=2 to specify the GPU for training.
Training Command: We ran a train_squidiff.py script with various parameters. The parameters included:
- --logger_path: Where to save the logs.
- --resume_checkpoint: To resume from a checkpoint.
- --data_path: The path to the training data.
- --gene_size and --output_dim: The number of genes.
- --num_layers: The number of layers in the model.
- --batch_size: The batch size for training.
- --lr: The learning rate.
- --diffusion_steps: The number of diffusion steps.
- --lr_anneal_steps: The number of learning rate annealing steps.
- --log_interval and --save_interval: Logging and saving intervals.
- --use_drug_structure: A flag for using drug structure information.

The training process resulted in a loss of approximately 0.03, and the gradients appeared stable.

Evaluation Procedure: Measuring the Results

After training, we evaluated the model's performance using these steps:

Encoding: We encoded the evaluation split (day 0-3), treating Group 0 as the starting point (day 0) and Group 3 as the target (day 3).
Interpolation: We computed the change in the latent space and interpolated at fractions corresponding to days 1 and 2.
Cell Generation: We generated 1,000 cells per interpolated day and compared them to the real day 1/2 cells.
Comparison: We used UMAP overlays, mean-expression scatters, and several metrics (Pearson, R², MAE, RMSE) to compare the generated and real cells.

Observed Results: The Discrepancy

Our evaluation showed a significant discrepancy between the predicted and actual cell states. The results for both day 1 and day 2 comparisons included:

Mean-level Pearson: ≈ 0.17
R²: ≈ -0.04
Flattened R²: ≈ -0.06
RMSE: ≈ 0.58
MAE: ≈ 0.30

The UMAPs revealed a lack of overlap between the real and generated cells. The model behaved more like a noisy baseline, failing to capture the day 1/2 state accurately. The model's predictions didn't align with the smooth differentiation trajectories described in the paper.

Troubleshooting: Key Questions and Potential Solutions

We're now at the heart of our troubleshooting efforts. Here are some of the key questions we're asking to identify what might be causing the discrepancy. Addressing these questions is critical to aligning our replication with the paper's findings.

1. Are There Any Missing Preprocessing Steps?

Could we be missing crucial preprocessing steps, such as specific scaling methods, gene ordering, or semantic encoder options? Detailed information on these aspects is often found in the supplementary materials or the code provided with the original paper. We need to meticulously review the preprocessing pipeline.

2. Are Different Architecture Settings or Training Tricks Required?

Do the iPSC experiments use specific architecture settings (gene_size, output_dim, num_layers, etc.) or training tricks (curriculum learning, lower learning rate) that we should replicate? Architecture and training tricks often significantly impact the model's performance. It's essential to ensure we are using the right setup.

3. Is the Correct Checkpoint Available?

Is the Figshare checkpoint (VO_diff_model.pt) intended for this iPSC interpolation, or is there a separate checkpoint for Fig. 2 that we should request? Using the right pre-trained model can be a game-changer. We need to confirm whether we are using the intended checkpoint for this experiment.

4. Why Is R² Negative Despite Following the Description?

What could be causing our R² scores to remain negative, even though we followed the published description? Negative R² values suggest that the model's predictions are worse than simply using the mean of the observed data. This is a critical issue that we need to resolve.

Conclusion: Seeking Solutions for Accurate iPSC Interpolation

Our journey to reproduce the iPSC interpolation results from Fig. 2 has presented some challenges. Despite our best efforts to replicate the experimental setup, the model's performance has not matched the reported accuracy. By systematically examining our preprocessing steps, training configurations, evaluation procedures, and seeking answers to specific questions, we hope to identify the root causes of the discrepancy and achieve results that align with the original paper. The goal is to accurately predict single-cell transcriptomics across the iPSC differentiation process. We hope to get closer to accurately predicting single-cell transcriptomics from day 1 to day 3, ultimately helping to accelerate advancements in stem cell research. We will continue to troubleshoot, refine our approach, and share our findings to contribute to the reproducibility and advancement of this exciting field.

For further insights into the challenges of replicating scientific studies and best practices, check out the Reproducibility Project to deepen your understanding.