データの科学には当然三つの相がある. 一つはデータをどう計画してとるか (design for dataという), どうデータを具体的に集めるか (collection of dataという), データにする解析(analysis on dataという)である。大事なことはこの三つの相において一貫した考え方一データによる現象の解明理解ということ一が貫流していなければならないことである。
How to determine relative expression of genes compared to a reference gene using qPCR Ct values? (ResearchGate) I do not have treated and untreated samples. I just want to know how much of gene A is expressed relative to gene B and/or the reference gene.
An improvement of the 2ˆ(–delta delta CT) method for quantitative real-time polymerase chain reaction data analysis. Biostat Bioinforma Biomath. 2013 Aug; 3(3): 71–85. (PubMed PMC4280562)
a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. (Maaten and Hinton, 2008 PDF)
Most researchers are already familiar with another dimensionality reduction algorithm, Principle Components Analysis (PCA) also available in R2 and explained in more detail in the Principle Components Analysis tutorial. Both PCA and t-SNE reduce the dimension while maintaining the structure of high dimensional data, however, PCA can only capture linear structures. t-SNE on the other hand captures both linear and non-linear relations and preserves local distances in high dimensions while reducing the information to 2 dimensions (an XY plot). (16. t-SNE: high dimensionality reduction in R2 How to find groups in your dataset using t-SNE. r2-tutorials.readthedocs.io)
そもそもなぜ次元削減をする必要があるの?
Computers have no problem processing that many dimensions. However, we humans are limited to three dimensions. Computers still need us (thankfully), so we often need ways to effectively visualize high-dimensional data before handing it over to the computer. (An illustrated introduction to the t-SNE algorithm By Cyrille Rossant March 3, 2015)
(上記サイトはヴィジュアルに非常にわかりやすくt-SNEの説明をしています)
遺伝子解析にはなぜPCAよりもt-SNEが適しているの?
First, although PCA minimizes global reconstruction error, it may not preserve local proximities of points. In visualizing gene expression data, we are typically more interested in resolving nearby clusters than in preserving the correct distance relationships between genes with very different patterns of expression. But the optimization criterion of PCA results in the opposite priority: the relationship of distant points is depicted as accurately as possible, while small inter-point distances can be distorted. Second, there may be no single linear projection that gives a good view of the data: in such a case, all linear projection methods will fail. (An intuitive graphical visualization technique for the interrogation of transcriptome data. Bushati et al., 2011. Nucleic Acids ResearchVolume 39, Issue 17,Pages 7380–7389)
t-SNEの使い方の注意は?
Following are a few common fallacies to avoid while interpreting the results of t-SNE:
For the algorithm to execute properly, the perplexity should be smaller than the number of points. Also, the suggested perplexity is in the range of (5 to 50)
Sometimes, different runs with same hyper parameters may produce different results.
…
(Comprehensive Guide on t-SNE algorithm with implementation in R & Python
SAURABH.JAJU2, JANUARY 22, 2017)
t-SNEの短所は?
t-SNE has three potential weaknesses: (1) it is unclear how t-SNE performs on general dimensionality reduction tasks, (2) the relatively local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data, and (3) t-SNE is not guaranteed to converge to a global optimum of its cost function. (Maaten and Hinton, 2008 PDF)
ゲノムデータ(遺伝子発現プロファイルの解析)にt-SNEが使われるようになったのはいつ頃から?
自分が調べた限り、下記の論文よりも古い論文が見つかりませんでした。
Here, we test the recently developed nonlinear dimensionality reduction algorithm, t -statistic Stochastic Neighbor Embedding ( t -SNE) ( 8 ), on a variety of real-world transcriptome data sets. (An intuitive graphical visualization technique for the interrogation of transcriptome data. Nucleic Acids Res. 2011 Sep 1;39(17):7380-9.
We tested seven DRTs applied to four microarray cancer datasets and ran four clustering algorithms using the original and reduced datasets. … On the other hand, t-distributed Stochastic Embedding (t-SNE) and Laplacian Eigenmaps (LE) achieved good results for all datasets. (Comparative study on dimension reduction techniques for cluster analysis of microarray data. Date of Conference: 31 July-5 Aug. 2011 ieeexplore.ieee.org)
どんなデータに使えるの?
Question: why PCA for RNA-Seq but tSNE for scRNA-seq? (biostars.org)
Question: What to use: PCA or tSNE dimension reduction in DESeq2 analysis? (support.bioconductor.org)
t-SNEを実際に使うには?(生物学研究者向け)
Rを用いてt-SNE
A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor (bioconductor.org)
A step-by-step workflow for low-level analysis of single-cell RNA-seq data Aaron T.L. Lun, et al. F1000Research Software tool article
The Rtsne module in Array Studio will allow the user to cluster different cells with UMI counts, using the Rtsne package in R (arrayserver.com)
Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. (satijalab.org)
Identifying and Characterizing Subpopulations Using Single Cell RNA-seq Data (hms-dbmi.github.io)
シングルセルRNA-seqガイド
A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications (Haque et al., Genome Med. 2017; 9: 75)
Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Bo Wang, Junjie Zhu, Emma Pierson, Daniele Ramazzotti & Serafim Batzoglou Nature Methods volume 14, pages 414–416 (2017) 新手法の提案 We present single-cell interpretation via multikernel learning (SIMLR), an analytic framework and software which learns a similarity measure from single-cell RNA-seq data in order to perform dimension reduction, clustering and visualization. 既存の手法との比較 On seven published data sets, we benchmark SIMLR against state-of-the-art methods.
CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Lin P, Troup M, Ho JW. Genome Biol. 2017 Mar 28;18(1):59. 新手法の提案 Most existing dimensionality reduction and clustering packages for single-cell RNA-seq (scRNA-seq) data deal with dropouts by heavy modeling and computational machinery. Here, we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm that uses a novel yet very simple implicit imputation approach to alleviate the impact of dropouts in scRNA-seq data in a principled manner. 従来の手法t-SNEなどとの比較 Using a range of simulated and real data, we show that CIDR improves the standard principal component analysis and outperforms the state-of-the-art methods, namely t-SNE, ZIFA, and RaceID, in terms of clustering accuracy. 代表的な結果の図
Visualization and cellular hierarchy inference of single-cell data using SPADE. Anchang B, Hart TD, Bendall SC, Qiu P, Bjornson Z, Linderman M, Nolan GP, Plevritis SK. Nat Protoc. 2016 Jul;11(7):1264-79. 新たなデータ可視化手法の提案 we describe the use of Spanning-tree Progression Analysis of Density-normalized Events (SPADE), a density-based algorithm for visualizing single-cell data and enabling cellular hierarchy inference among subpopulations of similar cells. 別のデータ可視化手法であるt-SNEとの比較 We compare SPADE with recently developed single-cell visualization approaches based on the t-distribution stochastic neighborhood embedding (t-SNE) algorithm.
SAIC: an iterative clustering approach for analysis of single cell RNA-seq data. Yang L, Liu J, Lu Q, Riggs AD, Wu X. BMC Genomics. 2017 Oct 3;18(Suppl 6):689. 解析の重要性 An important step in the single–cell transcriptome analysis is to identify distinct cell groups that have different gene expression patterns. 従来の手法の問題点 Many studies rely on principal component analysis (PCA) with arbitrary parameters to identify the genes that will be used to cluster the singlecells. 新手法の提案 We have developed a novel algorithm, called SAIC (Singlecell Analysis via Iterative Clustering), that identifies the optimal set of signature genes to separate singlecells into distinct groups. データ可視化のステップでのt-SNEの利用 We applied the SAIC algorithm to one simulated dataset and two published single cell datasets. After signature genes selection, the results were evaluated by Davies-Bouldins index and then visualized using both a t-SNE 2D–plot and an unsupervised hierarchical clustering heatmap.