Han Y, Gao S, Muegge K, Zhang W, Zhou B. (2015) Advanced Applications of RNA Sequencing and Challenges. Bioinform Biol Insights 9(Suppl 1):29-46. [article]
轉錄體(Transcriptome)代表著單一細胞或是一群細胞中所有的RNA,而RNAseq其中一個很重要的應用,便是藉由比較不同情況下,轉錄體的表現差異,來探討細胞在不同發育階段、特定刺激下的基因表現等等。比較基因表現量差異的關鍵便是要能辨別不同isoform transcripts的量,另一方面,也代表要對genomic functional element的劃分要更清楚,因為這些都會決定某一個transcripts的表現量高低,這種比較不同狀況下轉錄體表現量差異的分析就叫做DEG analysis,目前已有些統計的方式來計算這些轉錄體中不同transcripts的表現量,有部分方法是來自於之前microarrary資料處理的方法。
- Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21.(被引用10593次)
- Grant GR, Manduchi E, Stoeckert CJ Jr. Analysis and management of microar- ray gene expression data. Curr Protoc Mol Biol. 2007;Chapter 19:Unit19.6.(被引用31次)
最常用來表達某一個transcripts的表現量為RPKM(reads per kilo base per million mapped reads),這數值顧名思義就是將transcripts的長度以及總量對此transcripts的表現量做normalization後的數值,但其實還有許多因素會影響到評估transcripts的表現量,像是sequencing depth、gene length、isoform abundance。目前RPKM這統計數值被質疑的地方便是其每有考慮到同一個transcripts isoform的影響,而另一種新的數值RSEM(reads per expectation maximization),其有考慮進去isoform的角色,
- Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.(被引用998次)
大多數用來分析的演算法,其只用最簡單的count-based 機率分布(Poisson distribution)以及Fisher’s exact test來做表現量差異的統計檢定。
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assess- ment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17(被引用1647次)
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quanti- fying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8(被引用5320次)
- Jiang H, Wong WH. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009;25(8):1026–32.(被引用331次)
這樣的檢定方法,並沒有考慮到biological variability,而要降低biological variability可以藉由分析重複實驗的資料,使用permutational-derived methods。
另一方面,假如資料有足夠多實驗重複資料時,可以使用extended Poisson distribution來做更進階的分析。
- Sengoelge G, Winnicki W, Kupczok A, et al. A SAGE based approach to human glomerular endothelium: defining the transcriptome, finding a novel molecule and highlighting endothelial diversity. BMC Genomics. 2014;15:725.(被引用2次)
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.(被引用2678次)
但RNAseq的研究非常的貴,要有多組重複實驗顯得不切實際,也有人在有限的檢體實驗中,直接模擬biological variability,並且使用pariwise或是multiple group comparisons來分析。有幾個工具已被開發出來進行表現量比較的分析,如Cuffdiff、DESeq、EdgeR。
在read counts計算上,其分布常呈現skewed,需要使用演算法來把分佈轉換,才能繼續使用統計檢定,有研究者開發出一款工具PoissonSeq,便是使用Poisson log-linear model來做DEG分析。
- Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012;13(3): 523–38.(被引用111次)
除此之外,有科學家使用從microarrary分析而來的工具來做RNAseq的表現量分析,其使用limma package中的voom 函式來將資料轉換成Gaussian distribution,在進行統計檢定!
- Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15(2):R29.(被引用2014次)
- Ritchie ME, Phipson B, Wu D, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47
- Ritchie ME, Silver J, Oshlack A, et al. A comparison of background correction methods for two-colour microarrays. Bioinformatics. 2007;23(20):2700–7.(被引用646次)
這邊有一篇研究來比較目前所有的DEG分析,可以參考此論文的意見再來決定要用什麼分析工具
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91.(被引用231次)
91. Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting dif- ferentially expressed genes from RNA-seq data. Am J Bot. 2012;99(2):248–56(被引用125次)
關於DEG分析還有許多挑戰需要解決,像是
How to uniform the reads coverage along the genome with the nucleotide composition variation;
How to detect the “within-sample” variations without simply assuming that the underlying conditions or treatments affect all individual gene equally;
How to improve current methods to detect differences in gene isoform preferences and abundance level in varying conditions;
How to account for the dif- ferent probability in read coverage in long genes versus short genes since we can gain great sequencing depth nowadays.
對「RNAseq:Differential Gene Expression」的一則回應