Aggregate VSI per cell

Here, we will demonstrate the workflow how to aggregate the VSI information for a cell from all its associated pixels in the ovrlpy analysis using a Xenium dataset, but conceptually this works the same for other imaging-based SRT platforms, as well.

from pathlib import Path

import matplotlib
import matplotlib.pyplot as plt
import polars as pl
import seaborn as sns

import ovrlpy

matplotlib.rc("font", size=6)

data_folder = Path("path/to/data")

We start by reading the transcript data and running the ovrlpy pipeline. Of course the parameters used here can be adjusted to your needs.

Importantly, we additionally load the column mapping transcripts to cell-IDs. This is necessary for one of the approaches and will be explained in the corresponding section.

transcripts = ovrlpy.io.read_Xenium(
    data_folder / "transcripts.parquet",
    additional_columns=["cell_id"],
)

brain = ovrlpy.Ovrlp(transcripts, n_workers=8, random_state=42)
brain.analyse(min_transcripts=20)

Aggregate VSI per cell

via transcripts

When using the assignments from transcripts to cells, we need to make sure that the corresponding column is already included before we run the ovrlpy pipeline. The functions in ovrlpy.io module support reading more than just the transcript coordinate columns with the additional_columns argument. Two details are required when trying to extract the VSI and signal strength values using this method: the name of the column with the transcript to cell asignment, and the value that corresponds to unassigned transcripts.

Xenium and MERSCOPE: Typically, the column is named 'cell_id' and unassigned transcripts are indicated with -1.
CosMx: There is no column uniquely identifying a cell but the combination of 'cell_ID' and 'fov' can be used. The unassigned transcripts are indicated with 0 for the 'cell_ID'.

This appraoch is easy and fast, however, depending on the sparsity of the data some pixels corresponding to a cell may be missed if they do not have any associated transcripts.

Using the ovrlpy.cell_integrity_from_transcripts() functions we can now obtain the signal strength and VSI per cell. We actually get multiple values (one per pixel) that can be used for further analysis.

cell_pixels = ovrlpy.cell_integrity_from_transcripts(
    brain, cell_id="cell_id", unassigned=-1
)

cell_pixels.head()

shape: (5, 5)

cell_id	x_pixel	y_pixel	signal	vsi
i32	i32	i32	f32	f32
114141	5674	2389	0.836131	0.704874
114064	5480	2233	0.316398	0.438043
73743	8215	6233	1.564292	0.609592
27878	1326	3683	0.796194	0.839414
14921	2186	5415	2.458118	0.775897

via segmentation masks

Instead of using the transcipt to cell assignments, we can also use the segmentation masks to aggregate the VSI and signal strength per cell.

For this we need two pieces of information, a segmentation mask per cell (as a shapely Geometry, typically a Polygon), and a cell-ID.

Note, that for this approach we need additional optional dependencies that can be installed via the ovrlpy extra

pip install ovrlpy[segmentation]

Below you can find example code on how to generate this information from the standard output of some of the widely adopted platforms.

Examples how to load segmentation masks

Xenium

from shapely.geometry import Polygon

boundary_file = data_folder / "cell_boundaries.parquet"


boundaries = (
    pl.scan_parquet(boundary_file)
    .select("cell_id", pl.concat_arr(["vertex_x", "vertex_y"]).alias("coords"))
    .group_by("cell_id")
    .agg(pl.col("coords"))
    .with_columns(
        pl.col("coords").map_elements(Polygon, return_dtype=pl.Object).alias("mask")
    )
    .collect()
)

masks = boundaries["mask"]
ids = boundaries["cell_id"]

MERSCOPE

CosMx

Using the segmentation masks and the cell-IDs we can aggregate the VSI and signal to obtain a similar DataFrame as before.

cell_pixels = ovrlpy.cell_integrity_from_masks(brain, masks, ids)

cell_pixels.head()

shape: (5, 5)

cell_id	x_pixel	y_pixel	signal	vsi
i32	i64	i64	f32	f32
16712	3960	28	1.330162	0.901204
16712	3961	28	1.138057	0.734445
16712	3959	29	1.26439	0.944789
16712	3960	29	1.1826	0.913471
16712	3961	29	1.004933	0.72393

Filtering low VSI cells

Defining low-quality cells

There are multiple ways that come to mind how the VSI and signal values per cell can be analysed. We haven’t benchmarked these yet but some ideas are listed below.

In general we would recommend to filter the pixels by a minimum signal strength cutoff to avoid effects of pixels with low gene expression which are therefore likely noisy. The exact value depends on your dataset and is influenced by parameters such as the gene panel (highly vs lowly expressed genes), the panel size, sensitivity, …

Similarly, the VSI threshold depends on the analysed samples and should be evaluated and carefully picked. Here, we will use a value 0.7 as threshold.

min_s = 1.5  # signal strength threshold
min_vsi = 0.7  # VSI threshold

cell_pixels = cell_pixels.filter(pl.col("signal") > min_s)

One idea is to evaluate cells based on the fraction of the area (here using pixels as proxy) that is affected by low vertical signal integrity. A reasonable threshold for the VSI may depend on your experiment.

fraction_of_low_vsi_per_cell = cell_pixels.group_by("cell_id").agg(
    fraction_low_vsi_px=(pl.col("vsi") < min_vsi).sum() / pl.col("vsi").len()
)

fraction_of_low_vsi_per_cell.head()

shape: (5, 2)

cell_id	fraction_low_vsi_px
i32	f64
115370	0.0390625
47595	0.079365
108019	0.0
45237	0.100917
61609	0.065574

Another option is to calculate the mean VSI per cell.

mean_vsi_per_cell = cell_pixels.group_by("cell_id").agg(pl.col("vsi").mean())

mean_vsi_per_cell.head()

shape: (5, 2)

cell_id	vsi
i32	f32
148927	0.901035
141567	0.894772
42608	0.878738
5121	0.829888
143684	0.826892

Filtering cells

To demonstrate the effect of low VSI cells (using these two measures) we will visualize the calculated metrics per cell in a UMAP and see what happens if we remove them.

import scanpy as sc

cells = sc.read_10x_h5(data_folder / "cell_feature_matrix.h5")

ax = sns.histplot(cells.X.sum(axis=1).A1, bins=200)
_ = ax.set(xlabel="Transcripts per cell", ylabel="No. of cells")

../_images/f9732826ab4b95e0803f7edb3098bea3fe4a5cc3f71a26e3b36cec46a2c2fcf5.png

We will just perform some minimal filtering of the cells and very basic processing for the demonstration. Of course you can adjust this workflow to your liking.

sc.pp.filter_cells(cells, min_counts=50)
sc.pp.filter_cells(cells, max_counts=1_000)

sc.pp.pca(cells, n_comps=30)
sc.pp.neighbors(cells)
sc.tl.umap(cells, min_dist=0.1)

Next, we will add the calculated metrics as metadata to our cells.

# transform to pandas.DataFrame and make the index a string (necessary for AnnData)
metrics = (
    mean_vsi_per_cell.join(
        fraction_of_low_vsi_per_cell, on="cell_id", how="full", coalesce=True
    )
    .to_pandas()
    .astype({"cell_id": str})
    .set_index("cell_id")
)

# drop cells for which we do not have VSI info
cells = cells[cells.obs_names.isin(metrics.index)]

cells.obs = cells.obs.join(metrics)

And visualize the results of cell filtering in UMAPs.

Mean VSI

First a quick look at the mean VSI distribution, which can help find an appropriate threshold for filtering.

ax = sns.histplot(cells.obs["vsi"], bins=50)
_ = ax.set(xlabel="Mean VSI", ylabel="No. of cells")

../_images/b2767a96a7bdd8f7b3b6cc7d68a07d2d57506280eeef7d1dba3f4be762d6d288.png

In the UMAP we can observe how the separation of clusters has improved by removing these potential doublets. We used a few thresholds to demonstrate the effect.

umap = cells.obs.copy()
umap[["UMAP1", "UMAP2"]] = cells.obsm["X_umap"]

umaps = {None: umap}

# define doublets by mean VSI below threshold and generate UMAPs w/o doublets
for t in [0.3, 0.5, 0.7]:
    umaps[f"{t:.1f}"] = calculate_umap(cells[cells.obs["vsi"] >= t].copy())


fig = plot_UMAPs(umaps, hue="vsi", threshold_label="≥{t}")

../_images/81c6c1f7f3884e3a6c0ff438373aad1fa51189ebd2297565dc508830c324db53.png

Fraction of low VSI pixels

Following the same procedure as before, we first have a look at the overall distribution of our metric and then decide on a threshold for filtering and show the results in a UMAP.

ax = sns.histplot(cells.obs["fraction_low_vsi_px"], bins=50)
_ = ax.set(xlabel="Fraction of low VSI px per cell", ylabel="No. of cells")

../_images/59c6cbd2b94b95085fad7e44a1ac78378a6c114efbd3ce29c223cf1a7d88fee2.png

umaps = {None: umap}

# define doublets by more than X% pixels with VSI below threshold
for t in [1, 0.5, 0.2]:
    umaps[f"{t:.1f}"] = calculate_umap(
        cells[cells.obs["fraction_low_vsi_px"] < t].copy()
    )

fig = plot_UMAPs(
    umaps, hue="fraction_low_vsi_px", threshold_label="<{t}", cmap="flare_r"
)

../_images/c48b482a25c6f8af68d4c35ec5678bb638f05e128e24a98a4f22f9c3a5351132.png