Quick Start: Generate Expression Data for a Customised Tissue
Creator: Amir Akbarnejad (aa36@sanger.ac.uk)
Affiliation: Wellcome Sanger Institute and University of Cambridge
Date of Creation: 01.07.2025
Date of Last Modificaion: 01.07.2025
This tutorial demonstrates how to generate in silico spatial expression data using MintFlow.
To be able to run the notebook, the parts that you need to modify are specified by TODO:MODIFY:. The rest can be left untouched, as far as the goal is to run the notebook.
This notebook is only for demonstration, and to get biologically meaningful results you may need different data and/or settings.
import os, sys
import yaml
import mintflow
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from tqdm.autonotebook import tqdm
import numpy as np
from pprint import pprint
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import torch
import mintflow
import mintflow.interface.perturbation.module_gen_micsizefactor
import mintflow.interface.perturbation.module_gen_stdata
1 Overview
To generate in silico spatial expression data, you need to tell MintFlow:
A cell-cell neighbourhood graph (usually computed from cells’ spatial locations)
Cells’ cell type labels
A batch index integer, i.e., the index of the training batch that the tissue hypothetically belongs to. It conditions the generation on the batch token of one of biological/technological batches seen during training.
Afterwards, given the specified tissue (i.e. given 1, 2, and 3) the generative model is able to generate spatial expression data, and, if asked, multiple samples or realisations from it.
Note:
It is not allowed to have novel cell type labels, i.e. cell type labels that the model has never seen during training.
If the biological batch index is set to, e.g., 0, it doesn’t mean that the specified tissue (i.e. cell type labels and neighbourhood graph) has to be a crop or a subset of the 0-th biological batch in the training set. Instead, you can freely create even a de novo tissue by
arbitrarily specifying cells’ 2D locations,
computing the neighbourhood graph base on cells’ locations,
and arbitrary assigning cell type labels of your choice to cells.
At the following we demonstrate the steps of doing this.
2. Download a sample anndata object and a sample MintFlow checkpoint
Download this sample .h5ad file from google drive: (link to the file on google drive) and place it in a directory of your choice. Thereafter, set the variable
path_anndatabelow to the path where you placed the.h5adfile.In the first tutorial notebook we demonstrated how to save a checkpoint on disk by calling
mintflow.dump_checkpoint. Download this sample checkpoint file from google drive (link to the file on google drive) and place it in a directory of your choice. Thereafter, set the variablepath_checkpointbelow to the path where you placed the.ptfile.
path_anndata = './NonGit/data_train_single_section.h5ad'
# TODO:MODIFY: set to the path where you've put the `.h5ad` file that you downloaded above.
path_checkpoint = './NonGit/sample_checkpoint.pt'
# TODO:MODIFY: set to the path where you've put the `.pt` file that you downloaded above.
3. Load the MintFlow checkpoint
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
checkpoint_mintflow = torch.load(
path_checkpoint,
map_location='cpu',
weights_only=False
)
checkpoint_mintflow['model'].to(device)
print("Loaded the checkpoint.")
4. Make a tissue with customised cell type labels and cell spatial locations
As explained above, one can arbitrarily specify cells’ spatial locations and cell type labels. But here we simply use a crop of the tissue used for training.
4.1. load the anndata object
adata = sc.read_h5ad(
path_anndata
)
4.2. Select a crop from it
adata = adata[
(adata.obs['x_centroid'] > 5000.0) & (adata.obs['x_centroid'] < 6000.0) &\
(adata.obs['y_centroid'] > 2100.0) & (adata.obs['y_centroid'] < 2500.0)
] # select a crop from it
4.3. Create the neighbourhood graph
# create the neighbourhood graph
kwargs_neighbourhood_graph = {
'spatial_key': 'spatial',
'library_key': None,
'set_diag': False,
'delaunay': False,
'n_neighs': 5
}
adata.uns = {}
sq.gr.spatial_neighbors(
adata=adata,
**kwargs_neighbourhood_graph
)
4.4. Visualise the selected tissue crop
sc.pl.spatial(
adata,
spot_size=5,
color='broad_celltypes'
)
5. Generate expression data for the specified tissue
Having creatted the customised tissue section (i.e. given items 1, 2, and 3 explained at the beginning of the notebook) we now proceed to generate expression data for it. It’s done by calling mintflow.generate_insilico_ST_data.
Some important arguments to pass to the function mintflow.generate_insilico_ST_data:
obskey_celltype: the column name of the.obsfield that contain cell type labels. Cell type labels have to be among the ones seen during training.batch_index_trainingdata: generation is conditioned on batch index as well. For example ifbatch_index_trainingdatais set to 1, generation is conditioned on batch with index 1 seen during training. Note that this index is zero-based. To check the batch index assigned to each tissue section, you can run the below cell. In this tutorial we have a single tissue section, and therefore a single batch andbatch_index_trainingdatais set to 0.estimate_spatial_sizefactors_on_sections: To generateXintandXmictwo size factors are needed. To generate these size factors, MintFlow filters out cells with similar cell type labels and MCC vectors in some tissue sections. This argument specifies the tissue section(s) used for this purpose. In this tutorial we have a single tissue section, thereforeestimate_spatial_sizefactors_on_sectionsis set to [0].
# prints the batch index assigned to each tissue section in the training set
pprint(checkpoint_mintflow['data_mintflow']['train_list_tissue_section'].map_Batchname_to_inflowBatchID)
result_generation = mintflow.generate_insilico_ST_data(
adata=adata,
obskey_celltype='broad_celltypes',
obspkey_neighbourhood_graph='spatial_connectivities',
device=device,
batch_index_trainingdata=0,
num_generated_realisations=3,
model=checkpoint_mintflow['model'],
data_mintflow=checkpoint_mintflow['data_mintflow'],
dict_all4_configs=checkpoint_mintflow['dict_all4_configs'],
estimate_spatial_sizefactors_on_sections=[0]
)
The above cell generates num_generated_realisations=3 expression data or “realisations” for the tissue, and the variation in the expression of each gene among the generated samples or “realisations” can be informative.
We can obtain, e.g., the average microenvironment component of the generated expression as follows:
Xmic_average = np.stack(
[realisation['MintFLow_Generated_Xmic'] for realisation in result_generation['list_generated_realisations_ie_expressions']]
).mean(0)
Intuitively, Xmic_average means how “on average” the generative model thinks the microenvironment-induced part of expression is, given the provided cells’ locations and cell type labels.