To this end, SCA first quantifies the importance of each transcript in each cell by converting transcript counts into surprisal scores (Fig. 1b; Additional File 1: Algorithm 1). To determine the score of a given transcript in a given cell, we compare its expression distribution among the cell’s k nearest neighbors to its global expression (i.e., to the expected distribution of the transcript among a set of k cells randomly chosen from the entire dataset) for a user-specified neighborhood size k. A transcript whose local expression deviates strongly from its global expression is more likely to inform the cell’s location in relation to other cells, and therefore its identity. We quantify this deviation through a Wilcoxon rank-sum test, which produces a p-value representing the probability of the observed deviation in a random set of k cells. Following Shannon’s definition [24], the surprisal or self-information of the observed deviation is then defined as the negative logarithm of its probability, i.e., as −log(p)-\log (p). This is a positive number which measures how surprising the transcript’s local expression is, in units of nats when the logarithm is natural (changing the base scales the scores by a constant factor, which does not affect SCA’s output). To distinguish over- from under-expression, we flip the sign for under-expressed transcripts (Methods). The resulting scores are compiled into a surprisal matrix with the same dimensionality as the input data.
这个思路是很有道理,结合背景来看某个基因是否重要,有足够的差异则可能重要。但是这个做法有个问题,因为单细胞技术的捕获问题,某个基因在 A cell 和邻居间的极大差异,可能是因为实验带来的误差,而并非是真实存在的。