Introduction
0. Installation
PENCIL examples
- Apply PENCIL’s multi-classification mode
- Apply PENCIL’s regression mode
Reference
Package versions

Introduction

PENCIL is novel supervised learning framework for identifying phenotype associated subpopulations and informative features simultaneously from single cell data. PENCIL has two modes, classification and regression, for categorical or continuous phenotypes, respectively. The classification model allows for simultaneous differential abundance testing and feature selection, and the regression mode learns supervised phenotypic trajectory of subpopulations. In this tutorial, we use two examples to help users to execute these two modes of PENCIL.

0. Installation

If the “pencil” package is not yet installed, the user can run the following command in the terminal:

pip install -e ..

The algorithm package pencil is developed in python, however, we can utilize the reticulate R package to run PENCIL in R (version>=4.1.0) on a seurat object. First, let’s load in the required packages:

library(Seurat)
library(reticulate)
library(scales)
library(ggplot2)

Users can select the python environment by click “Tools->Global Options->Python” in Rstudio for call pencil, or set it by reticulate directly.

use_python("xxx/python") #for python
use_virtualenv("xxx") #for virtual environment
use_condaenv("xxx") #for conda environment

PENCIL examples

The input data source of PENCIL includes a single-cell quantification matrix and condition labels of interest for all cells. The lables can be a category indicator vector and a continuous dependent variable. In this tutorial, we use two simulations based on real singlE cell data to show how to apply PENCIL.

The simulted datasets have been stored as seurat object and the standard preprocessing pipeline has been applied. We can download the two seurat objects to local by running

source('data_download.R')

Apply PENCIL’s multi-classification mode

In our first example, we use classification mode of PENCIL to identify phenotype enriched subpopulations in the simulation with categorical condition lables.

Load the dataset

load('./data/PENCIL_tutorial_1.Rdata')
dim(sc_data)

## [1] 55737  6350

The condition labels can be visualized on the UMAP from top 2000 most variable genes (MVG2000) as follows. We can see that the cell phenotype labels are distributed very randomly on the UMAP generated from MVG2000 under the standard process. It is difficult to identify phenotype associated subpopulations using general clustering algorithms or KNN graph-based methods without gene selection.

DimPlot(sc_data, group.by = "cell_phenotype_labels_simulation", reduction = 'umap-mvg2000', pt.size=0.3)

The cell labels of the simulated data were actually generated based on the expression level clustering of MVG1000-1300 (ground truth genes, GT genes). Clusters 0, 2, and 9 are used as ground truth groups (GT groups), and in each ground truth group, 90% of the cells are set to be in the same class, and the remaining 10% are randomly assigned other class labels to simulate phenotype enriched subpopulations. The other cells are randomly assigned a class label as background interference.

num_classes  <- length(unique(sc_data$cell_phenotype_labels_simulation))
pal  <- c('gray', hue_pal()(num_classes))

A <- DimPlot(object=sc_data, reduction='umap', label=T, pt.size=0.3)
B <- DimPlot(object=sc_data, reduction='umap', label=T, group.by="true_groups_simulation", cols=pal, pt.size=0.3)
C <- DimPlot(object=sc_data, reduction='umap', label=T, group.by="cell_phenotype_labels_simulation", cols=pal[2:length(pal)], pt.size=0.3)
A + B + C

Execute PENCIL to identify phenotype enriched subpopulations

PENCIL takes as input a matrix of expression data from MVG2000 (or more genes) and cell labels in an attempt to simultaneously localize GT genes and the cell subpopulations from which they arise.

We extract the data required by PENCIL from the seurat object.

exp_data = sc_data@assays[["RNA"]]@scale.data[VariableFeatures(sc_data),]
labels = as.factor(sc_data$cell_phenotype_labels_simulation)
class_names <- levels(labels)
labels_2_ids <- as.numeric(labels) - 1

Then, we can create a new python chunk to run pencil, and use r.x to pass the R variables into Python.

import os 
os.environ['CUDA_VISIBLE_DEVICES'] = '0' #select a gpu id, otherwise set to '-1'.

from pencil import *

# For recording the results. 
data_name = 'PENCIL_tutorial_1'
expr_id = '0.0.1'

data = r.exp_data.T.copy()
labels = np.array(r.labels_2_ids, dtype=int)

mode = 'multi-classification'
pencil = Pencil(mode, select_genes=True, seed=1234, data_name=data_name, expr_id=expr_id)
pred, confidence = pencil.fit_transform(
    data, labels, 
    test=True, 
    shuffle_rate=1/4,
    lambda_L1=1e-5, 
    lambda_L2=1e-3, 
    lr=0.01,  
    class_weights=None,
    class_names=r.class_names, 
    plot_show=False
    )

## dataset: PENCIL_tutorial_1, expr_id: 0.0.1
## scheme: ce, Sigmoid
## searching c...
## cmin:0.000, cmax:2.000, c:1.000, rejected 0 cells.
## cmin:0.000, cmax:1.000, c:0.500, rejected 3595 cells.
## cmin:0.000, cmax:0.500, c:0.250, rejected 6249 cells.
## cmin:0.250, cmax:0.500, c:0.375, rejected 4516 cells.
## cmin:0.250, cmax:0.375, c:0.312, rejected 5646 cells.
## cmin:0.250, cmax:0.312, c:0.281, rejected 5896 cells.
## cmin:0.281, cmax:0.312, c:0.297, rejected 5791 cells.
## searched c: 0.296875
## pretrain 500 epochs...
## cuda is available.
## epoch=0, loss=0.3839, mean_e=0.4755, mean_r=-0.0433, L1_reg=21.3151
## epoch=20, loss=0.3407, mean_e=0.4753, mean_r=-0.5820, L1_reg=27.6287
## epoch=40, loss=0.3403, mean_e=0.4758, mean_r=-0.6453, L1_reg=27.5444
## epoch=60, loss=0.3381, mean_e=0.4769, mean_r=-0.6720, L1_reg=31.8403
## epoch=80, loss=0.3362, mean_e=0.4787, mean_r=-0.5779, L1_reg=36.8966
## epoch=100, loss=0.3337, mean_e=0.4812, mean_r=-0.4753, L1_reg=43.5172
## epoch=120, loss=0.3320, mean_e=0.4830, mean_r=-0.4602, L1_reg=51.5590
## epoch=140, loss=0.3298, mean_e=0.4826, mean_r=-0.4509, L1_reg=59.8642
## epoch=160, loss=0.3287, mean_e=0.4822, mean_r=-0.4442, L1_reg=67.2273
## epoch=180, loss=0.3275, mean_e=0.4816, mean_r=-0.4420, L1_reg=73.7574
## epoch=200, loss=0.3261, mean_e=0.4808, mean_r=-0.4280, L1_reg=80.4409
## epoch=220, loss=0.3250, mean_e=0.4801, mean_r=-0.4158, L1_reg=86.7325
## epoch=240, loss=0.3237, mean_e=0.4789, mean_r=-0.4160, L1_reg=93.2755
## epoch=260, loss=0.3228, mean_e=0.4776, mean_r=-0.4166, L1_reg=99.7909
## epoch=280, loss=0.3219, mean_e=0.4760, mean_r=-0.4084, L1_reg=105.8988
## epoch=300, loss=0.3210, mean_e=0.4746, mean_r=-0.4095, L1_reg=111.8656
## epoch=320, loss=0.3201, mean_e=0.4734, mean_r=-0.4051, L1_reg=117.6823
## epoch=340, loss=0.3192, mean_e=0.4719, mean_r=-0.4026, L1_reg=123.2213
## epoch=360, loss=0.3185, mean_e=0.4712, mean_r=-0.4026, L1_reg=128.7799
## epoch=380, loss=0.3180, mean_e=0.4705, mean_r=-0.4046, L1_reg=134.1644
## epoch=400, loss=0.3174, mean_e=0.4699, mean_r=-0.3928, L1_reg=139.4066
## epoch=420, loss=0.3169, mean_e=0.4693, mean_r=-0.3904, L1_reg=144.6215
## epoch=440, loss=0.3163, mean_e=0.4674, mean_r=-0.3878, L1_reg=149.8350
## epoch=460, loss=0.3157, mean_e=0.4662, mean_r=-0.3874, L1_reg=154.8665
## epoch=480, loss=0.3153, mean_e=0.4655, mean_r=-0.3871, L1_reg=159.9819
## ---train time: 7.257097959518433 seconds ---
## 
## Number of examples rejected= 4363 / 6350
## num_of_rejcted
## class_2           1514
## class_3           1433
## class_1           1416
## dtype: int64
## --- without rejection ---
##               precision    recall  f1-score   support
## 
##      class_1       0.62      0.46      0.53      2115
##      class_2       0.52      0.76      0.62      2463
##      class_3       0.52      0.36      0.42      1772
## 
##     accuracy                           0.55      6350
##    macro avg       0.55      0.52      0.52      6350
## weighted avg       0.55      0.55      0.53      6350
## 
## --- with rejection ---
##               precision    recall  f1-score   support
## 
##      class_1       0.92      0.94      0.93       699
##      class_2       0.94      0.96      0.95       949
##      class_3       0.94      0.85      0.90       339
## 
##     accuracy                           0.93      1987
##    macro avg       0.93      0.92      0.92      1987
## weighted avg       0.93      0.93      0.93      1987
## 
## ---test time: 0.16099762916564941 seconds ---

In addition, PENCIL integrates the machine learning experiment recording tool mlflow for recording the parameter settings and results of each experiment (into the current working directory), which can be used in the following way,

pencil = Pencil(mode, select_genes=True, seed=1234, data_name=data_name, expr_id=expr_id, mlflow_record=True)
with mlflow.start_run():
    pred, confidence = pencil.fit_transform(data, labels, ...)

After the run is complete, we can execute mlflow ui in the terminal to view the results of each experiment and compare them conveniently.

Evaluate results

The results can be shown in Python directly by passing parameter emd into pencil.fit_transform.

emd <- sc_data@reductions[["umap"]]@cell.embeddings #R
pencil.fit_transform(..., emd=r.emd, plot_show=True) #Python

But we prefer to use another way, passing the results into R via ‘py$x’, and load them into the seurat object for more flexible visualization. We present the results on the UMAP generated from GT genes to facilitate comparison with the GT group.

pred_labels <- class_names[(py$pred+1)]
pred_labels[py$confidence < 0] = 'Rejected'
pred_labels_names = c('Rejected', as.character(class_names))
pred_labels <- factor(pred_labels, levels = pred_labels_names)
confidence <- as.data.frame(py$confidence, row.names=colnames(sc_data))

sc_data <- AddMetaData(sc_data, metadata = pred_labels, col.name='pred_labels' )
sc_data <- AddMetaData(sc_data, metadata = confidence, col.name='confidence.score')

FeaturePlot(sc_data, features='confidence.score', pt.size=0.3)

DimPlot(object=sc_data, reduction='umap', label=T, group.by="pred_labels", cols=pal, pt.size=0.3)

The distribution of confidence scores and the labels predicted by PENCIL are almost identical to the ground truth! Moreover, by visualizing the gene weights learned by PENCIL, we can see that the selected genes are indeed also mostly located in the range of mvg1000-1300 (GT genes).

# in python chunck
w = pencil.gene_weights(plot=True)

plt.close()

Apply PENCIL’s regression mode

In the second example, we use regression mode of PENCIL to learn the phenotypic trajectory in the simulation with continuous labels.

Load the dataset

The features of input single-cell quantification matrix are 10 principle components (PCs).

load('./data/PENCIL_tutorial_2.Rdata')
dim(sc_data.2)

## [1]    10 16291

We visualize this dataset using the UMAP coordinates generated from 10 PCs and color by the simulated cell timepoints.

DimPlot(sc_data.2, group.by = "cell_timepoints_simulation", reduction = 'umap', pt.size=0.3)

The simulated timepoint labels are still obtained from the clustering results, but since the input is 10-PCs-matrix, we do not set the ground truth feature like the first example. Theoretically, we can also perform simultaneous feature selection in regression mode (more related results can be found in our paper).

The clusters 3, 1, 2, 11, 4 are set to the ground truth groups (GT groups). For cells in GT groups 1, 3, and 5, we assign timepoints 1, 2, 3, respectively. For cells in GT groups 2, 4, we assign the time points of the pre- and post-groups in a 1:1 ratio to simulate transition states. The other cells are still randomly assigned a timepoint label as background noise.

A = DimPlot(sc_data.2, reduction = 'umap', pt.size=0.3, label = T)
num_groups = length(unique(sc_data.2$true_groups_simulation))
pal = c(hue_pal()(num_groups-1), 'gray')
B = DimPlot(sc_data.2, group.by = "true_groups_simulation", reduction = 'umap', cols=pal, pt.size=0.3)
A + B

Execute PENCIL to detect the phenotype associated trajectory

We then extract the PCs-matrix and timepoints labels, and execute pencil in Python.

exp_data = as.matrix(sc_data.2@assays[["RNA"]]@counts)
labels = as.numeric(as.character(sc_data.2$cell_timepoints_simulation))

import os 
os.environ['CUDA_VISIBLE_DEVICES'] = '0' #select a gpu id, otherwise set to '-1'.

from pencil import *

# For recording the results. 
data_name = 'PENCIL_tutorial_2'
expr_id = '0.0.1'

data = r.exp_data.T.copy()
labels = np.array(r.labels)

mode = 'regression'
pencil = Pencil(mode, select_genes=True, seed=1234, data_name=data_name, expr_id=expr_id, dropouts=[0.0, 0.0]) #`select_genes` can also be set to False, if True, pencil will output a weight vector for all 10 PCs. 
pred, confidence = pencil.fit_transform(
    data, labels, 
    test=True,
    shuffle_rate=1/5,
    lambda_L1=1e-5, 
    lambda_L2=0.0, 
    lr=0.01, 
    epochs=2000, 
    class_weights=None,
    plot_show=False
    )

## dataset: PENCIL_tutorial_2, expr_id: 0.0.1
## scheme: sml1, NGLR
## searching c...
## cmin:0.000, cmax:2.000, c:1.000, rejected 0 cells.
## cmin:0.000, cmax:1.000, c:0.500, rejected 435 cells.
## cmin:0.000, cmax:0.500, c:0.250, rejected 12803 cells.
## cmin:0.000, cmax:0.250, c:0.125, rejected 16291 cells.
## cmin:0.125, cmax:0.250, c:0.188, rejected 16291 cells.
## cmin:0.188, cmax:0.250, c:0.219, rejected 13823 cells.
## cmin:0.188, cmax:0.219, c:0.203, rejected 16291 cells.
## searched c: 0.203125
## cuda is available.
## epoch=0, loss=1.4097, mean_e=1.4376, mean_r=-0.0138, L1_reg=1.9475
## epoch=20, loss=0.4060, mean_e=1.3665, mean_r=-0.9981, L1_reg=2.5915
## epoch=40, loss=0.4060, mean_e=1.3257, mean_r=-0.9978, L1_reg=2.6782
## epoch=60, loss=0.4056, mean_e=1.3122, mean_r=-0.9956, L1_reg=2.6811
## epoch=80, loss=0.4029, mean_e=1.2962, mean_r=-0.9751, L1_reg=2.6590
## epoch=100, loss=0.3941, mean_e=1.1739, mean_r=-0.9250, L1_reg=2.6933
## epoch=120, loss=0.3709, mean_e=0.9510, mean_r=-0.7386, L1_reg=2.6683
## epoch=140, loss=0.3304, mean_e=0.6626, mean_r=-0.4815, L1_reg=2.7624
## epoch=160, loss=0.2986, mean_e=0.5003, mean_r=-0.3151, L1_reg=2.4561
## epoch=180, loss=0.2836, mean_e=0.2874, mean_r=-0.2602, L1_reg=2.1437
## epoch=200, loss=0.2780, mean_e=0.2744, mean_r=-0.2509, L1_reg=2.2259
## epoch=220, loss=0.2758, mean_e=0.2717, mean_r=-0.2322, L1_reg=2.3937
## epoch=240, loss=0.2737, mean_e=0.2701, mean_r=-0.2280, L1_reg=2.5344
## epoch=260, loss=0.2712, mean_e=0.2708, mean_r=-0.2181, L1_reg=2.6546
## epoch=280, loss=0.2700, mean_e=0.2686, mean_r=-0.2076, L1_reg=2.8302
## epoch=300, loss=0.2685, mean_e=0.2662, mean_r=-0.1936, L1_reg=3.0569
## epoch=320, loss=0.2682, mean_e=0.2646, mean_r=-0.1795, L1_reg=3.2941
## epoch=340, loss=0.2677, mean_e=0.2621, mean_r=-0.1716, L1_reg=3.4963
## epoch=360, loss=0.2663, mean_e=0.2604, mean_r=-0.1638, L1_reg=3.6674
## epoch=380, loss=0.2653, mean_e=0.2595, mean_r=-0.1521, L1_reg=3.8771
## epoch=400, loss=0.2651, mean_e=0.2587, mean_r=-0.1571, L1_reg=4.0785
## epoch=420, loss=0.2640, mean_e=0.2579, mean_r=-0.1493, L1_reg=4.2492
## epoch=440, loss=0.2633, mean_e=0.2566, mean_r=-0.1392, L1_reg=4.4611
## epoch=460, loss=0.2655, mean_e=0.2531, mean_r=-0.1535, L1_reg=4.6811
## epoch=480, loss=0.2623, mean_e=0.2523, mean_r=-0.1312, L1_reg=4.8246
## epoch=500, loss=0.2628, mean_e=0.2503, mean_r=-0.1293, L1_reg=5.0253
## epoch=520, loss=0.2610, mean_e=0.2505, mean_r=-0.1074, L1_reg=5.2265
## epoch=540, loss=0.2588, mean_e=0.2479, mean_r=-0.0977, L1_reg=5.4453
## epoch=560, loss=0.2574, mean_e=0.2465, mean_r=-0.0952, L1_reg=5.6536
## epoch=580, loss=0.2566, mean_e=0.2460, mean_r=-0.0909, L1_reg=5.8946
## epoch=600, loss=0.2564, mean_e=0.2455, mean_r=-0.0959, L1_reg=6.1100
## epoch=620, loss=0.2570, mean_e=0.2452, mean_r=-0.0783, L1_reg=6.2787
## epoch=640, loss=0.2550, mean_e=0.2447, mean_r=-0.0849, L1_reg=6.4379
## epoch=660, loss=0.2542, mean_e=0.2443, mean_r=-0.0814, L1_reg=6.6483
## epoch=680, loss=0.2542, mean_e=0.2436, mean_r=-0.0870, L1_reg=6.8633
## epoch=700, loss=0.2555, mean_e=0.2448, mean_r=-0.0605, L1_reg=7.0379
## epoch=720, loss=0.2538, mean_e=0.2434, mean_r=-0.0695, L1_reg=7.2444
## epoch=740, loss=0.2518, mean_e=0.2422, mean_r=-0.0686, L1_reg=7.4395
## epoch=760, loss=0.2506, mean_e=0.2418, mean_r=-0.0657, L1_reg=7.6527
## epoch=780, loss=0.2497, mean_e=0.2412, mean_r=-0.0692, L1_reg=7.8350
## epoch=800, loss=0.2496, mean_e=0.2410, mean_r=-0.0732, L1_reg=8.0071
## epoch=820, loss=0.2502, mean_e=0.2410, mean_r=-0.0615, L1_reg=8.1433
## epoch=840, loss=0.2493, mean_e=0.2408, mean_r=-0.0611, L1_reg=8.3031
## epoch=860, loss=0.2493, mean_e=0.2405, mean_r=-0.0538, L1_reg=8.4848
## epoch=880, loss=0.2528, mean_e=0.2407, mean_r=-0.0995, L1_reg=8.6519
## epoch=900, loss=0.2511, mean_e=0.2412, mean_r=-0.0659, L1_reg=8.7194
## epoch=920, loss=0.2475, mean_e=0.2396, mean_r=-0.0582, L1_reg=8.8256
## epoch=940, loss=0.2480, mean_e=0.2390, mean_r=-0.0576, L1_reg=9.0108
## epoch=960, loss=0.2474, mean_e=0.2388, mean_r=-0.0301, L1_reg=9.1651
## epoch=980, loss=0.2470, mean_e=0.2386, mean_r=-0.0478, L1_reg=9.2766
## epoch=1000, loss=0.2484, mean_e=0.2382, mean_r=-0.0357, L1_reg=9.3916
## epoch=1020, loss=0.2461, mean_e=0.2382, mean_r=-0.0645, L1_reg=9.4910
## epoch=1040, loss=0.2468, mean_e=0.2380, mean_r=-0.0550, L1_reg=9.6178
## epoch=1060, loss=0.2494, mean_e=0.2395, mean_r=-0.0201, L1_reg=9.7439
## epoch=1080, loss=0.2485, mean_e=0.2374, mean_r=-0.0897, L1_reg=9.8523
## epoch=1100, loss=0.2473, mean_e=0.2378, mean_r=-0.0334, L1_reg=9.9609
## epoch=1120, loss=0.2484, mean_e=0.2383, mean_r=-0.0216, L1_reg=10.0978
## epoch=1140, loss=0.2460, mean_e=0.2364, mean_r=-0.0417, L1_reg=10.2460
## epoch=1160, loss=0.2442, mean_e=0.2360, mean_r=-0.0444, L1_reg=10.3912
## epoch=1180, loss=0.2457, mean_e=0.2358, mean_r=-0.0132, L1_reg=10.5319
## epoch=1200, loss=0.2444, mean_e=0.2356, mean_r=-0.0355, L1_reg=10.6487
## epoch=1220, loss=0.2445, mean_e=0.2355, mean_r=-0.0481, L1_reg=10.7476
## epoch=1240, loss=0.2464, mean_e=0.2353, mean_r=-0.0174, L1_reg=10.8516
## epoch=1260, loss=0.2440, mean_e=0.2349, mean_r=-0.0437, L1_reg=10.9334
## epoch=1280, loss=0.2487, mean_e=0.2377, mean_r=-0.0305, L1_reg=11.0390
## epoch=1300, loss=0.2457, mean_e=0.2352, mean_r=-0.0553, L1_reg=11.1337
## epoch=1320, loss=0.2456, mean_e=0.2353, mean_r=-0.0130, L1_reg=11.2465
## epoch=1340, loss=0.2429, mean_e=0.2340, mean_r=-0.0368, L1_reg=11.3637
## epoch=1360, loss=0.2422, mean_e=0.2338, mean_r=-0.0229, L1_reg=11.5149
## epoch=1380, loss=0.2420, mean_e=0.2335, mean_r=-0.0233, L1_reg=11.6645
## epoch=1400, loss=0.2448, mean_e=0.2333, mean_r=0.0076, L1_reg=11.7885
## epoch=1420, loss=0.2423, mean_e=0.2333, mean_r=-0.0299, L1_reg=11.8869
## epoch=1440, loss=0.2430, mean_e=0.2333, mean_r=-0.0128, L1_reg=11.9803
## epoch=1460, loss=0.2433, mean_e=0.2333, mean_r=-0.0464, L1_reg=12.0829
## epoch=1480, loss=0.2447, mean_e=0.2330, mean_r=-0.0575, L1_reg=12.1336
## epoch=1500, loss=0.2459, mean_e=0.2363, mean_r=-0.0291, L1_reg=12.1447
## epoch=1520, loss=0.2435, mean_e=0.2330, mean_r=-0.0217, L1_reg=12.2426
## epoch=1540, loss=0.2425, mean_e=0.2319, mean_r=-0.0271, L1_reg=12.3334
## epoch=1560, loss=0.2418, mean_e=0.2317, mean_r=-0.0425, L1_reg=12.4527
## epoch=1580, loss=0.2415, mean_e=0.2317, mean_r=-0.0284, L1_reg=12.5416
## epoch=1600, loss=0.2415, mean_e=0.2317, mean_r=-0.0372, L1_reg=12.6468
## epoch=1620, loss=0.2419, mean_e=0.2317, mean_r=-0.0117, L1_reg=12.7310
## epoch=1640, loss=0.2410, mean_e=0.2315, mean_r=-0.0242, L1_reg=12.8254
## epoch=1660, loss=0.2439, mean_e=0.2322, mean_r=-0.0349, L1_reg=12.9176
## epoch=1680, loss=0.2453, mean_e=0.2342, mean_r=0.0029, L1_reg=12.9758
## epoch=1700, loss=0.2420, mean_e=0.2316, mean_r=-0.0394, L1_reg=13.0069
## epoch=1720, loss=0.2446, mean_e=0.2329, mean_r=0.0119, L1_reg=13.0895
## epoch=1740, loss=0.2411, mean_e=0.2306, mean_r=-0.0312, L1_reg=13.2102
## epoch=1760, loss=0.2406, mean_e=0.2307, mean_r=-0.0296, L1_reg=13.3355
## epoch=1780, loss=0.2431, mean_e=0.2306, mean_r=0.0226, L1_reg=13.4431
## epoch=1800, loss=0.2407, mean_e=0.2305, mean_r=-0.0160, L1_reg=13.5262
## epoch=1820, loss=0.2403, mean_e=0.2304, mean_r=-0.0058, L1_reg=13.6124
## epoch=1840, loss=0.2414, mean_e=0.2304, mean_r=-0.0163, L1_reg=13.7067
## epoch=1860, loss=0.2428, mean_e=0.2312, mean_r=-0.0062, L1_reg=13.7985
## epoch=1880, loss=0.2456, mean_e=0.2329, mean_r=0.0214, L1_reg=13.8173
## epoch=1900, loss=0.2453, mean_e=0.2316, mean_r=-0.0617, L1_reg=13.8204
## epoch=1920, loss=0.2429, mean_e=0.2312, mean_r=0.0078, L1_reg=13.8668
## epoch=1940, loss=0.2400, mean_e=0.2298, mean_r=-0.0202, L1_reg=14.0012
## epoch=1960, loss=0.2408, mean_e=0.2298, mean_r=0.0131, L1_reg=14.1327
## epoch=1980, loss=0.2399, mean_e=0.2295, mean_r=-0.0252, L1_reg=14.2322
## ---train time: 12.653669118881226 seconds ---
## 
## Number of examples rejected= 10214 / 16291
## ---test time: 0.04399919509887695 seconds ---

Evaluate results

Adding the pencil’s results into the seurat object to visualize.

pred.time <- as.vector(py$pred)
sc_data.2 <- AddMetaData(sc_data.2, metadata = pred.time, col.name='pred.time' )
sc_data.2 <- AddMetaData(sc_data.2, metadata = py$confidence, col.name='confidence.score')
FeaturePlot(sc_data.2, features = 'confidence.score', pt.size=0.3)

FeaturePlot(sc_data.2, features = 'pred.time', cells=Cells(sc_data.2)[sc_data.2$confidence.score > 0], pt.size = 0.3) + coord_cartesian(xlim =c(-12.5, 7.5), ylim = c(-12.5, 7.5)) + scale_colour_gradientn(colours=c("red","green","blue"))

The distribution of confidence scores and the continuous time scores predicted by PENCIL match well with the ground truth and successfully identify the transition states, as confirmed by the histogram below.

ggplot(sc_data.2[, sc_data.2$confidence.score > 0]@meta.data, aes(pred.time, fill = cell_timepoints_simulation)) +
geom_histogram(aes(y=..density..), bins = 500, show.legend = T) + geom_density(alpha=0.2) + scale_x_continuous(breaks = seq(1, 3, 0.5))

Reference

Supervised learning of high-confidence phenotypic subpopulations from single-cell data (2022).

Tao Ren, Ling-Yun Wu and Zheng Xia

Package versions

R packages loaded in this tutorial:
Seurat 4.0.5
reticulate 1.25
scater 1.22.0
ggplot2 3.3.5

Python packages that pencil depends on:
numpy 1.20.3
pandas 1.3.4
torch 1.10.0
seaborn 0.11.2
umap-learn 0.5.2
mlflow 1.23.1

PENCIL Tutorial in R

Tao Ren

2022-09-08

Introduction

0. Installation

PENCIL examples

Apply PENCIL’s multi-classification mode

Load the dataset

Execute PENCIL to identify phenotype enriched subpopulations

Evaluate results

Apply PENCIL’s regression mode

Load the dataset

Execute PENCIL to detect the phenotype associated trajectory

Evaluate results

Reference

Package versions