How can I get support for PENCIL?
Is selecting too many genes initially necessary for PENCIL input?
Do the two modes of PENCIL differ in terms of the cells and genes selected?
How does PENCIL determine the time change direction in continuous phenotype regression?
Why do different orders of input genes lead to different outputs?
How to set the parameter class_weights?

How can I get support for PENCIL?

If you have any questions, you can submit them on the GitHub issues page.

Is selecting too many genes initially necessary for PENCIL input?

PENCIL does need a pre-filtered gene list, like the top 2000 MVGs, as input. In theory, we can input all the genes for PENCIL to select. However, single-cell data is noisy and sparse, and we have to filter genes in the initial step like MVGs for downstream machine learning analysis. Selecting too few genes in the initial step can affect performance. Therefore, we set the expression matrix as a parameter that allows the user to choose which genes to use, and the recommendation will be the top 5000 MVGs. Considering that most single-cell RNA-seq datasets have around 10,000 genes, the top 5000 MVGs should cover the most phenotypically relevant genes. However, we allow users to adjust it according to their applications.

expression_data <- expression_data[user_chosen_genes, ] # in R
...
pencil.fit_transform(r.expression_data, ...) # in Python

Do the two modes of PENCIL differ in terms of the cells and genes selected?

Originally, PENCIL was designed to identify the subpopulations with differential abundances of phenotype labels through a classification mode. However, PENCIL is very flexible and can be extended to regression mode by updating the loss function. And we later realized that such a regression mode could implement a new application for supervised learning of subpopulations undergoing a continuous phenotypic transition, which is fundamentally different from the classification mode for differential abundance analysis. The regression model can reveal the transition state between the two categorical phenotypes, indicating that the rejection module is different from the one in the classification mode. At the same time, PENCIL will select genes that discriminate between conditions and genes that distinguish non-rejected cells from rejected cells. Thus, updating the loss function affects both the rejection module and the gene weights, suggesting that the selected genes by the two modes do not completely overlap. In addition, the informative genes associated with the conditions can be different in the two modes. For example, a dataset has condition labels from 3 conditions, t1, t2 and t3. If gene G is up-regulated in t2 and down-regulated in t1 and t3, in the classification mode, gene G will contribute to distinguishing t2 from t1 and t3, thus, likely to be selected. However, in regression mode (regression by t1->t2->t3), gene G does not change with phenotypes in the same direction, so it will not likely be selected. Together, for the same input, the genes selected by PENCIL’s regression mode and classification mode have some overlap, but they are not identical.

How does PENCIL determine the time change direction in continuous phenotype regression?

The regression mode of PENCIL proposes a new type of analysis for supervised learning of phenotypic trajectory of subpopulations from single-cell data. For example, for three samples from three time points, the cells from the time point 1, 2, and 3 conditions are labeled with 1, 2, and 3, respectively, which were used as target labels to train the model. The resulting prediction module in the regression mode of PENCIL will assign continuous time orders to the selected cells. Thus, the time change direction of the trajectory indicated by PENCIL is the same as the given phenotype labels used to train the model. Therefore, the supervised learning of phenotypic trajectory implemented in PENCIL can determine the time change direction, which is impossible for previous unsupervised pseudo-time analysis.

Why do different orders of input genes lead to different outputs?

In theory, the arrangement of genes in the dataset should not influence the mathematical model. However, variations in outcomes might arise during the execution of the optimization algorithm. For instance, our model and its solution methodology are grounded in the PyTorch deep learning framework, which involves a phase of random parameter initialization. Despite setting a fixed random state, varying the order of input features essentially amounts to different initializations. Consequently, this can result in the model converging to disparate local optima, thereby yielding divergent outcomes.

The extent of this discrepancy might also be influenced by the data distribution itself. For instance, in our simulated dataset, altering the gene order yields only a negligible difference in results. However, with the real data we utilized, this difference appears more pronounced. Yet, multiple tests with gene order randomly altered indicate that the difference remains within a tolerable range.

Setting the model’s parameters to a uniform constant could potentially resolve this discrepancy. However, this approach might lead to a less effective optimization process. In our analysis, the expression matrices extracted from the Seurat object are sorted by default based on the variance score (‘vst.variance.standardized’) in ascending order. We strongly recommend that users with different data input methods adopt a similar gene sorting strategy before initiating the PENCIL workflow. This practice is helpful to achieving consistent results.

For instance, for the users who inputs a count file (or seurat-object) and wishes to save the required data as CSV files, then read the CSV files in Python to run PENCIL, they can follow the steps below can to realize loading and saving:

# In R
# Load the ".txt" count file and create a seurat object.
counts = read.table('sc_counts.txt')
sc_data = CreateSeuratObject(counts=counts)

# Preprocess by seurat workflow
sc_data <- NormalizeData(object=sc_data, normalization.method="LogNormalize")
sc_data <- FindVariableFeatures(object=sc_data, selection.method='vst', nfeatures=2000)
sc_data <- ScaleData(object=sc_data)

# Extract the scaled expression matrix and phenotype labels, and save them to ".csv" files. 
# Genes are listed in descending order by VST score. 
exp_data = sc_data@assays[["RNA"]]@scale.data[VariableFeatures(sc_data),] 
write.csv(exp_data, file=paste(filepath, 'exp_data.csv', sep = ''))

labels = as.matrix(sc_data$phenotype)
row.names(labels) = colnames(exp_data)
write.csv(labels, file=paste(filepath, 'label.csv', sep = ''))

And for those using Scanpy to analyze their data in python, they can also follow the process below to fix the order of the genes:

# In a terminal
# Install the dependency packages first
pip install scanpy
pip install --user scikit-misc

# In Python
# Preprocess by seurat scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor='seurat_v3', n_top_genes=2000)

# Order genes by 'highly_variable_rank'
HVG_info = adata.var.loc[adata.var['highly_variable'], ].copy()
HVG_info.sort_values(by='highly_variable_rank', ascending=True, inplace=True)
HVGs = HVG_info.index.values
adata = adata[:, HVGs]

sc.pp.scale(adata, max_value=10)
data = adata_.X

How to set the parameter class_weights?

The purpose of introducing class_weights is to balance the number of cells in different classes in order to prevent some classes from being rejected all together because of a low number of cells. Therefore, when the number of cells in each category differs greatly, we recommend setting the weights to be inversely proportional to the number of cells in each category. For example, if there are three categories containing 1000, 2000, and 3000 cells respectively, class_weights should be set to 3:1.5:1 in order to balance their cell numbers.

Alternatively, the parameter can also be set based on how much the user values different categories.

Frequently asked questions

2023-11-13