Tutorial 1: Compute pathway signatures
📘 Overview
This notebook runs DESeq2 to compute differential expression results and performs pathway enrichment analysis with WikiPathways gene sets, producing –log₁₀ p-values that quantify pathway activation for each compound.
[ ]:
%%capture
!conda install -c bioconda gseapy
[1]:
import numpy as np
import pandas as pd
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import dilimap as dmap
[2]:
%load_ext autoreload
%autoreload 2
[3]:
dmap.logging.print_version()
Running dilimap 1.0.2 (python 3.10.16) on 2025-06-29 15:22.
1. Prepare data
We load raw RNA-seq count data generated from the lab.
The demo dataset includes primary human hepatocytes treated with 5 compounds.
Each compound tested at 4 concentrations to capture dose-dependent transcriptional responses.
[4]:
adata = dmap.datasets.demo_data(level='counts')
Package: s3://dilimap/public/data. Top hash: d3f70822ce
[5]:
## Simplify keys
adata.obs.rename(
columns={'CONCENTRATION_UM': 'dose_uM', 'COMPOUND': 'compound_name'}, inplace=True
)
[6]:
## Define dose levels
adata.obs['dose_level'] = 'Control'
labels = ['Low', 'Middle', 'Mid-High', 'High']
for cmpd in adata.obs['compound_name'].dropna().unique():
if 'DMSO' in cmpd:
continue
idx = adata.obs['compound_name'] == cmpd
levels = sorted(set(adata.obs['dose_uM'][idx]) - {14.10}) # remove 14.10 from DMSO
for label, lvl in zip(labels, levels):
adata.obs.loc[idx & (adata.obs['dose_uM'] == lvl), 'dose_level'] = label
adata.obs['dose_level'] = pd.Categorical(
adata.obs['dose_level'], categories=['Control'] + labels, ordered=True
)
[7]:
## Define auxiliary columns
adata.obs['cmpd_dose'] = (
adata.obs['compound_name'].astype(str) + '_' + adata.obs['dose_level'].astype(str)
)
adata.obs['cmpd_dose_plate'] = (
adata.obs['cmpd_dose'] + '_' + adata.obs['PLATE_NAME'].astype(str)
)
adata.obs_names = (
adata.obs['PLATE_NAME'].astype(str)
+ '_'
+ adata.obs['LIBRARY_ID'].str.split('_').str[-1]
)
[8]:
# Show compound x dose with number of replicates
dmap.utils.crosstab(adata, ['compound_name', 'dose_level'])
[8]:
| dose_level | Control | Low | Middle | Mid-High | High |
|---|---|---|---|---|---|
| compound_name | |||||
| Almotriptan | 0 | 3 | 3 | 3 | 3 |
| Brompheniramine | 0 | 3 | 3 | 3 | 3 |
| Chlorpromazine | 0 | 4 | 0 | 0 | 0 |
| DMSO | 8 | 0 | 0 | 0 | 0 |
| DMSO_replaced | 24 | 0 | 0 | 0 | 0 |
| Darunavir | 0 | 3 | 3 | 3 | 3 |
| Tolvaptan | 0 | 3 | 3 | 3 | 3 |
[9]:
# Platemap
from dilimap.utils import platemap
adata.obs['WELL_ROW'] = adata.obs['WELL_ID'].str[0]
adata.obs['WELL_COL'] = adata.obs['WELL_ID'].str[1:]
CMPD_COLOR = (
adata.obs['compound_name']
.map(
{
'DMSO': 'seashell',
'DMSO_replaced': 'seashell',
'Chlorpromazine': 'lightsalmon',
}
)
.fillna('lightblue')
)
CMPD_COLOR_OPACITY = '; opacity: ' + adata.obs['dose_level'].astype(str).map(
{'Control': '1', 'Low': '.4', 'Middle': '.6', 'Mid-High': '.8', 'High': '1'}
)
def highlight_cells(x):
return (
'background-color: '
+ x.map(dict(zip(adata.obs['CMPD_DOSE'], CMPD_COLOR))).fillna('white')
+ x.map(dict(zip(adata.obs['CMPD_DOSE'], CMPD_COLOR_OPACITY)))
)
df_cmpd_map = platemap(adata, value_key='cmpd_dose', batch='PLATE_NAME')
col_borders = dict(subset=['4', '8'], **{'border-right': '1.5pt solid black'})
row_borders = dict(
subset=(df_cmpd_map.index.str.endswith(('A', 'D', 'G')), df_cmpd_map.columns),
**{'border-top': '1.5pt solid black'},
)
display(
df_cmpd_map.style.apply(highlight_cells)
.set_properties(**col_borders)
.set_properties(**row_borders)
)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1_M4_S8_A | Darunavir_High | Darunavir_Mid-High | Darunavir_Middle | Darunavir_Low | Almotriptan_High | Almotriptan_Mid-High | Almotriptan_Middle | Almotriptan_Low | Brompheniramine_High | Brompheniramine_Mid-High | Brompheniramine_Middle | Brompheniramine_Low |
| A1_M4_S8_B | Darunavir_High | Darunavir_Mid-High | Darunavir_Middle | Darunavir_Low | Almotriptan_High | Almotriptan_Mid-High | Almotriptan_Middle | Almotriptan_Low | Brompheniramine_High | Brompheniramine_Mid-High | Brompheniramine_Middle | Brompheniramine_Low |
| A1_M4_S8_C | Darunavir_High | Darunavir_Mid-High | Darunavir_Middle | Darunavir_Low | Almotriptan_High | Almotriptan_Mid-High | Almotriptan_Middle | Almotriptan_Low | Brompheniramine_High | Brompheniramine_Mid-High | Brompheniramine_Middle | Brompheniramine_Low |
| A1_M4_S8_D | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | Tolvaptan_High | Tolvaptan_Mid-High | Tolvaptan_Middle | Tolvaptan_Low | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control |
| A1_M4_S8_E | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | Tolvaptan_High | Tolvaptan_Mid-High | Tolvaptan_Middle | Tolvaptan_Low | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control |
| A1_M4_S8_F | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | Tolvaptan_High | Tolvaptan_Mid-High | Tolvaptan_Middle | Tolvaptan_Low | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control | DMSO_replaced_Control |
| A1_M4_S8_G | DMSO_Control | DMSO_Control | DMSO_Control | DMSO_Control | DMSO_Control | DMSO_Control | DMSO_Control | DMSO_Control | Chlorpromazine_Low | Chlorpromazine_Low | Chlorpromazine_Low | Chlorpromazine_Low |
2. Quality-control
Next, we applied the following QC filtering to retain high-quality wells:
Total RNA counts > mean - 2.5 standard deviations (~700k total counts)
Mitochondrial RNA fraction < 9%
Correlation between replicates > 0.99
[10]:
# QC flags from viability screen
adata.obs['ldh_qc_pass'] = (adata.obs['LDH_QC'] == '') & (
adata.obs['compound_name'] != 'DMSO_replaced'
)
[11]:
# QC of total counts and mtRNA
dmap.pp.qc_metrics(adata)
thresh_counts = (
np.median(adata.obs['log_totalRNA']) - 2.5 * adata.obs['log_totalRNA'].std()
)
adata.obs['rna_qc_pass'] = (adata.obs['log_totalRNA'] > thresh_counts) & (
adata.obs['pct_mtRNA'] < 9
)
Added the following to `adata.obs`: ['log_totalRNA', 'pct_mtRNA', 'pct_rRNA']
[12]:
fig, axs = plt.subplots(1, 3, figsize=(12, 3), gridspec_kw={'wspace': 0.4})
sns.histplot(adata.obs['log_totalRNA'], bins=50, ax=axs[0])
sns.histplot(adata.obs['pct_mtRNA'], bins=50, ax=axs[1])
sns.histplot(adata.obs['pct_rRNA'], bins=50, ax=axs[2])
axs[1].set_xlim(0, 20)
axs[0].axvline(thresh_counts, c='k')
axs[1].axvline(9, c='k')
axs[2].axvline(9, c='k')
plt.show()
[13]:
## Example plate with QC metrics
def color_boolean(val):
return 'color: lightred; font-weight: bold' if val else ''
for val in ['log_totalRNA', 'pct_mtRNA']:
print(f'\n\033[1m{val}')
df = platemap(adata, val, batch='PLATE_NAME')
qc_fail = ~platemap(adata, 'rna_qc_pass', batch='PLATE_NAME').astype(bool)
styler = df.style.format(precision=2)
if val != 'notes_flags':
mu, std = np.median(adata.obs[val]), np.std(adata.obs[val])
styler = styler.background_gradient(
vmin=mu - 3 * std, vmax=mu + 3 * std, cmap='RdBu_r'
)
styler = (
styler.apply(lambda c: qc_fail[c.name].apply(color_boolean))
.map(lambda x: 'background-color: white' if pd.isnull(x) else '')
.set_properties(**col_borders)
.set_properties(**row_borders)
)
display(styler)
log_totalRNA
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1_M4_S8_A | 6.22 | 6.28 | 5.86 | 5.89 | 6.21 | 6.11 | 6.35 | 6.27 | 6.27 | 6.15 | 6.16 | 6.10 |
| A1_M4_S8_B | 6.35 | 6.38 | 6.29 | 6.39 | 6.13 | 6.03 | 6.50 | 6.16 | 6.33 | 5.93 | 6.32 | 5.85 |
| A1_M4_S8_C | 6.16 | 6.36 | 6.41 | 6.34 | 6.32 | 6.21 | 6.45 | 6.27 | 6.40 | 6.22 | 6.38 | 6.19 |
| A1_M4_S8_D | 6.39 | 6.41 | 6.37 | 6.38 | 6.31 | 6.20 | 6.16 | 6.24 | 6.35 | 6.31 | 6.45 | 6.24 |
| A1_M4_S8_E | 6.30 | 6.40 | 6.33 | 6.32 | 6.27 | 6.22 | 6.18 | 6.16 | 6.05 | 6.27 | 6.29 | 6.16 |
| A1_M4_S8_F | 6.09 | 6.24 | 6.29 | 6.30 | 6.21 | 6.24 | 6.39 | 6.28 | 6.30 | 6.27 | 6.16 | 6.26 |
| A1_M4_S8_G | 6.41 | 6.36 | 6.34 | 6.46 | 6.32 | 6.39 | 6.45 | 6.24 | 6.30 | 6.38 | 6.22 | 6.11 |
pct_mtRNA
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1_M4_S8_A | 3.71 | 4.21 | 6.34 | 5.58 | 4.93 | 4.28 | 4.57 | 4.45 | 4.47 | 4.76 | 4.65 | 4.93 |
| A1_M4_S8_B | 4.19 | 4.61 | 4.66 | 4.62 | 5.33 | 4.10 | 4.33 | 4.68 | 4.85 | 4.89 | 4.06 | 4.42 |
| A1_M4_S8_C | 3.78 | 4.44 | 4.70 | 4.81 | 4.77 | 4.50 | 4.55 | 4.26 | 4.68 | 4.66 | 4.22 | 4.31 |
| A1_M4_S8_D | 4.37 | 4.70 | 4.79 | 4.71 | 4.53 | 4.56 | 5.06 | 4.71 | 4.52 | 4.25 | 4.02 | 3.96 |
| A1_M4_S8_E | 4.72 | 4.71 | 4.58 | 5.29 | 4.71 | 4.56 | 5.72 | 4.63 | 5.16 | 4.57 | 4.32 | 4.42 |
| A1_M4_S8_F | 4.72 | 4.59 | 4.72 | 4.84 | 4.79 | 4.22 | 4.42 | 4.17 | 4.10 | 4.02 | 4.25 | 3.78 |
| A1_M4_S8_G | 4.75 | 4.68 | 4.88 | 4.52 | 4.92 | 4.64 | 4.77 | 4.43 | 4.10 | 3.72 | 4.83 | 4.06 |
[14]:
# QC of cross-replicate correlation
dmap.pp.qc_cross_rep_correlation(adata, group_key='cmpd_dose', plate_key='PLATE_NAME')
Added the following to `adata.obs`: ['cross_rep_correlation', 'rep_corr_qc_pass']
[15]:
## Example plate with cross-replicate correlations
val = 'cross_rep_correlation'
df = platemap(adata, val, batch='PLATE_NAME')
df_bool = np.invert(
platemap(adata, 'rep_corr_qc_pass', batch='PLATE_NAME').fillna(False).astype(bool)
)
mu, std = np.nanmean(adata.obs[val]), np.nanstd(adata.obs[val])
col_borders = dict(subset=['4', '8'], **{'border-right': '1.5pt solid black'})
row_borders = dict(
subset=(df.index.str.endswith(('A', 'D', 'G')), df.columns),
**{'border-top': '1.5pt solid black'},
)
print(val, f'CI=[{mu - 2 * std:0.3f},{1}]')
display(
df.style.format(precision=3)
.background_gradient(vmin=mu - 5 * std, vmax=1, cmap='Reds_r')
.apply(lambda c: df_bool[c.name].apply(color_boolean))
.map(lambda x: 'background-color: white' if pd.isnull(x) else '')
.set_properties(**col_borders)
.set_properties(**row_borders)
)
cross_rep_correlation CI=[0.982,1]
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1_M4_S8_A | 0.987 | 0.996 | 0.981 | 0.986 | 0.997 | 0.999 | 0.999 | 0.999 | 0.999 | 0.996 | 0.998 | 0.989 |
| A1_M4_S8_B | 0.995 | 0.999 | 0.998 | 0.999 | 0.997 | 0.997 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 |
| A1_M4_S8_C | 0.995 | 0.999 | 0.998 | 0.999 | 0.996 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 |
| A1_M4_S8_D | 0.995 | 0.996 | 0.997 | 0.996 | 0.995 | 0.998 | 0.999 | 0.999 | 0.995 | 0.991 | 0.987 | 0.968 |
| A1_M4_S8_E | 0.997 | 0.996 | 0.997 | 0.995 | 0.993 | 0.994 | 0.999 | 0.999 | 0.995 | 0.997 | 0.995 | 0.995 |
| A1_M4_S8_F | 0.995 | 0.996 | 0.994 | 0.994 | 0.995 | 0.998 | 0.999 | 0.999 | 0.995 | 0.990 | 0.970 | 0.958 |
| A1_M4_S8_G | 0.996 | 0.996 | 0.996 | 0.990 | 0.998 | 0.997 | 0.997 | 0.997 | 0.993 | 0.997 | 0.997 | 0.997 |
[16]:
from natsort import index_natsorted
sns.set(font_scale=0.8)
df_cmpd_flags = adata.obs[~adata.obs['rep_corr_qc_pass']][
'compound_name'
].value_counts()
df_cmpd_flags = df_cmpd_flags[~df_cmpd_flags.index.str.contains('DMSO')]
cmpd_flags = df_cmpd_flags.index[:8]
fig, axs = plt.subplots(
1, len(cmpd_flags), figsize=(5 * len(cmpd_flags), 3), gridspec_kw={'wspace': 0.5}
)
for i, cmpd_name in enumerate(cmpd_flags):
df_sub = adata[adata.obs['compound_name'] == cmpd_name].to_df()
df_sub.index = adata[adata.obs['compound_name'] == cmpd_name].obs['WELL_ID']
df_sub_corr = df_sub.T.corr()
idx_sort = index_natsorted(df_sub_corr.index.str[1:] + df_sub_corr.index.str[0])
df_sub_corr = df_sub_corr.iloc[idx_sort, idx_sort]
g = sns.heatmap(df_sub_corr, vmax=1, vmin=0.9, yticklabels=True, ax=axs[i])
flagged = (
adata[
(adata.obs['compound_name'] == cmpd_name) & ~adata.obs['rep_corr_qc_pass']
]
.obs['WELL_ID']
.values
)
for label in g.get_yticklabels():
if label.get_text() in flagged:
label.set_color('red')
axs[i].set_title(cmpd_name)
3. DESeq2
Samples passing the above QC filters are used to compute differential expression signatures with DESeq2.
The differential test compares compound-treated cells vs. DMSO controls from the same plate.
To streamline the workflow, DESeq2 runs via an R script inside a Docker container, so there is no need to switch between Python and R. Ensure Docker is installed (https://www.docker.com/products/docker-desktop/); the initial build may take a few minutes. Alternatively, you can run DESeq2 natively in R if preferred.
If the Docker image is unavailable, the code will fall back to using a precomputed DESeq2 file.
[17]:
%%capture
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4, threads_per_worker=1, memory_limit='auto')
client = Client(cluster)
[22]:
import warnings
try:
adatas = {}
for plate in np.unique(adata.obs['PLATE_NAME']):
adata_batch = adata[adata.obs['PLATE_NAME'] == plate].copy()
adata_batch = adata_batch[
(
adata_batch.obs['ldh_qc_pass']
& adata_batch.obs['rna_qc_pass']
& adata_batch.obs['rep_corr_qc_pass']
)
].copy()
adata_batch.obs_names_make_unique()
adatas[plate] = dmap.pp.deseq2(
adata_batch,
pert_name_col='compound_name',
other_pert_cols=['dose_level'],
dask_client=client,
)
adatas[plate].obs['PLATE_NAME'] = plate
adata_deseq = ad.concat(adatas.values(), join='outer')
except Exception as e:
warnings.warn(
'Falling back to precomputed DESeq2 data because Docker is not available or not running.\n'
'To run DESeq2 natively, ensure Docker is installed and the daemon is active.\n'
f'Original error: {str(e)}'
)
adata_deseq = dmap.datasets.demo_data(level='deseq2')
/opt/anaconda3/envs/py310/lib/python3.10/site-packages/anndata/_core/anndata.py:1756: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
/var/folders/lz/prv79nmj5msg8h6nzqn0w7cw0000gn/T/ipykernel_22309/2235761672.py:26: UserWarning: Falling back to precomputed DESeq2 data because Docker is not available or not running.
To run DESeq2 natively, ensure Docker is installed and the daemon is active.
Original error: Docker is installed but cannot connect to the Docker daemon.
Please ensure the Docker application is running and try again.
Details: ERROR: Cannot connect to the Docker daemon at unix:///Users/volkerbergen/.docker/run/docker.sock. Is the docker daemon running?
errors pretty printing info
warnings.warn(
Package: s3://dilimap/public/data. Top hash: 5a7dd22964
[23]:
adata_deseq = adata_deseq[~np.all(adata_deseq.to_df('LFC').isna(), 1)].copy()
/opt/anaconda3/envs/py310/lib/python3.10/site-packages/anndata/_core/anndata.py:1756: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
[24]:
adata_deseq.obs['ldh_qc_pass'] = adata_deseq.obs['ldh_qc_pass'].astype(bool)
adata_deseq.obs['rna_qc_pass'] = adata_deseq.obs['rna_qc_pass'].astype(bool)
adata_deseq.obs['rep_corr_qc_pass'] = adata_deseq.obs['rep_corr_qc_pass'].astype(bool)
4. WikiPathways
Next, we compute pathway signatures by performing enrichment analysis on differentially expressed genes (DEGs). Using the hypergeometric test, we calculate how likely it is that a pathway is enriched with DEGs by chance.
We adjust p-values for multiple testing (FDR) and apply a -log10 transformation to derive pathway signatures. These -log10(FDR) values serve as robust, quantitative features for our model.
[25]:
FDR = adata_deseq.to_df('FDR') # adjusted p-values from DESeq2
[27]:
adata_wiki = dmap.pp.pathway_signatures(FDR)
/opt/anaconda3/envs/py310/lib/python3.10/site-packages/anndata/_core/anndata.py:1756: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
[28]:
adata_wiki.obs = adata_deseq.obs.copy()
[29]:
adata_wiki.X = np.nan_to_num(adata_wiki.X)
Push
Here, the data is uploaded to S3, which requires appropriate access permissions.
Instead, you can save it locally and load it in the next tutorial notebook to run DILI predictions.
[ ]:
# dmap.s3.write(adata_deseq, 'demo_data_deseq2.h5ad')
[ ]:
# dmap.s3.write(adata_wiki, 'demo_data_pathways.h5ad')