Monocle2轨迹分析 参考资料:单细胞之轨迹分析-2:monocle2 原理解读+实操 - 简书
[TOC]
Monocle2分析整体流程(结构化框架) Step 1:构建CellDataSet对象 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 library( monocle) convert_seurat_to_cds <- function ( seurat_obj, assay= "RNA" , out_file= NULL ) { expr_matrix <- as( as.matrix( seurat_obj@ assays[[ "RNA" ] ] @ counts) , "sparseMatrix" ) pd <- seurat_obj@ meta.data pd$ cell_id <- rownames( pd) pd <- new( "AnnotatedDataFrame" , data = pd) fd <- data.frame( gene_id = rownames( expr_matrix) , gene_short_name = rownames( expr_matrix) , row.names = rownames( expr_matrix) ) fd <- new( "AnnotatedDataFrame" , data = fd) cds <- newCellDataSet( expr_matrix, phenoData = pd, featureData = fd, lowerDetectionLimit = 0.5 , expressionFamily = negbinomial.size( ) ) if ( ! is.null ( out_file) ) { saveRDS( cds, file = out_file) } return ( cds) }
核心要点:
输入必须是 counts或接近counts的数据
meta信息要完整(cluster / celltype)
参数解读: expressionFamily
在Monocle2分析中,expressionFamily参数用于指定表达数据的统计分布假设。
对于原始UMI或read count数据,由于其具有离散性和过度离散(overdispersion)特征,通常采用负二项分布建模(negbinomial.size())。相比negbinomial(),该方法在保证计算效率的同时具有较好的拟合性能,因此被广泛使用。
而对于经过归一化或对数转换的数据(如TPM/FPKM或Seurat的log-normalized表达矩阵),其分布更接近连续型(常近似对数正态分布),此时应使用高斯分布模型(gaussianff()),而不应继续采用负二项分布。
tobit()模型用于处理具有“截断(censored)”特征的连续表达数据,即当真实表达值低于某一检测阈值时被记录为0。这类模型曾用于早期非UMI单细胞数据(如FPKM/TPM),以解释大量零表达值的来源。然而,在当前以UMI计数为主的单细胞测序数据中,表达值本质为离散计数,负二项分布(negbinomial.size())能够更合理地建模其统计特性。因此,tobit()在现代单细胞分析流程中已较少使用。
Step 2:标准化与离散度估计 1 2 cds <- estimateSizeFactors( cds) cds <- estimateDispersions( cds)
作用:
size factor:消除测序深度差异
dispersion:用于筛选高变基因
Step 3:过滤 一般来说,到这里之前Seurat里面已经完成了细胞过滤,这里可以省略。不过可以根据表达该基因的细胞数目过滤一下基因
1 2 3 4 5 6 cds <- detectGenes( cds, min_expr = 0.1 ) summary( fData( cds) $ num_cells_expressed) expressed_genes <- rownames( subset( fData( cds) , num_cells_expressed >= 10 ) )
Step 4:细胞分类 不推荐用Monocle自带的做,应该在Seurat对象里面已经分清楚。
Step 5:选择ordering genes(轨迹定义基因) 在 Monocle2 中,ordering genes 的选择直接决定轨迹结构 ,本质上是在回答:
👉 “用哪些基因去定义细胞状态变化?”
不同策略对应不同的生物学假设。
🧠 方法分类
方法
类型
核心思想
是否推荐
差异基因(cluster)
无监督
细胞群差异驱动轨迹
⭐⭐⭐⭐
高变基因(dispersion)
无监督
表达波动最大
⭐⭐⭐
发育marker
半监督
已知生物过程
⭐⭐⭐⭐
时间/状态差异基因
弱监督
拟时间相关
⭐⭐⭐⭐⭐
方法1:差异表达基因(最常用 & 推荐) 理想状态下,我们希望尽可能少的使用正在研究的对象的先验知识,以免引入过多偏见。
📌 核心逻辑
不同细胞群之间的差异基因 👉 最有可能驱动状态转变
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 diff <- differentialGeneTest( cds, fullModelFormulaStr = "~celltype" ) deg <- rownames( subset( diff, qval < 0.01 ) ) write.table( deg, file = "monocle.DEG.xls" , col.names = F , sep = "\t" , quote = F ) ordering_genes <- rownames( deg) ordering_genes <- rownames( deg) [ order( deg$ qval) ] [ 1 : 2000 ]
方法2:高变基因(适用于一群细胞内部的轨迹分析) 即没有内部分群时
1 2 3 4 5 6 7 8 disp_table <- dispersionTable( cds) ordering_genes <- subset( disp_table, mean_expression >= 0.1 & dispersion_empirical >= dispersion_fit ) $ gene_id
dispersion_empirical表示基因表达的实际离散程度,而dispersion_fit表示在给定平均表达水平下模型预测的技术噪声水平。通过筛选dispersion_empirical大于dispersion_fit的基因,可以识别出表达波动显著高于预期的基因(即overdispersed genes),这些基因更可能反映真实的生物学异质性,而非测序噪声。
其他方法分析
方法
本质在选什么
信息类型
VariableFeatures(Seurat)
方差最大的基因
技术+生物混合波动
FindAllMarkers(Seurat)
cluster差异基因
细胞群差异
dispersionTable
overdispersion基因
表达不稳定性
differentialGeneTest(~Cluster)
模型驱动差异
统计显著变化
不同基因选择方法本质上反映了不同的信息提取策略。Seurat的VariableFeatures和Monocle的dispersionTable侧重于识别表达波动较大的基因,但这些基因未必与细胞状态变化直接相关。FindAllMarkers则通过群间差异捕捉细胞类型特异性表达,而Monocle的differentialGeneTest通过广义线性模型在统计框架下识别与细胞分组显著相关的基因。
由于轨迹推断依赖于连续状态变化信号,基于差异表达(尤其是模型驱动的方法)通常比单纯基于表达波动的方法更为稳健。因此,推荐优先使用differentialGeneTest结合表达覆盖过滤来选择ordering genes。
Step 6:降维与排序 1 2 3 4 5 6 cds <- setOrderingFilter( cds, ordering_genes) cds <- reduceDimension( cds, max_components = 2 , method = "DDRTree" ) cds <- orderCells( cds)
关键点:
DDRTree = Monocle2核心
orderCells决定伪时间方向,使用root_state可以设置起点
Step 7:拟时序结果可视化 拟时序轨迹 1 2 3 4 plot_cell_trajectory( cds, color_by = "Pseudotime" , size = 1 , show_backbone = TRUE ) plot_cell_trajectory( cds, color_by = "Pseudotime" , size = 1 , show_backbone = TRUE ) + scale_color_npg( )
细胞类型轨迹 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 plot_cell_trajectory( cds, color_by = "celltype" , size = 1 , show_backbone = TRUE ) p1 <- plot_cell_trajectory( cds, x= 1 , y= 1 , color_by = "celltype" ) + theme( legend.position= "none" , panel.border = element_blank( ) ) + scale_color_npg( ) p2 <- plot_complex_cell_trajectory( cds, x= 1 , y= 1 , color_by = "celltype" ) + scale_color_npg( ) + theme( legend.title = element_blank( ) ) p1| p2 pdata <- pData( cds) ggplot( pdata, aes( Pseudotime, color= celltype, fill= celltype) ) + geom_density( bw = 0.5 , size= 1 , alpha= 0. .5) + theme_classic2( )
State轨迹 1 2 3 4 plot_cell_trajectory( cds, color_by = "State" , size = 1 , show_backbone = TRUE ) plot_cell_trajectory( cds, color_by = "State" ) + facet_wrap( "~State" , nrow = 1 )
Step 8 :指定基因可视化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 gene_use <- "SDCCAG8" cds_subset <- cds[ gene_use, ] p1 <- plot_genes_in_pseudotime( cds_subset, color_by = "State" ) p2 <- plot_genes_in_pseudotime( cds_subset, color_by = "celltype" ) p3 <- plot_genes_in_pseudotime( cds_subset, color_by = "Pseudotime" ) p1| p2| p3 p1 <- plot_genes_jitter( cds_subset, grouping = "State" , color_by = "State" ) p2 <- plot_genes_violin( cds_subset, grouping = "State" , color_by = "State" ) p3 <- plot_genes_in_pseudotime( cds_subset, color_by = "State" ) p1| p2| p3 library( monocle) library( ggplot2) library( viridis) plot_gene_trajectory_replot <- function ( cds_obj, gene_use, out_pdf = NULL , log_transform = TRUE , upper_quantile = 1 , palette_option = "plasma" , point_size = 1.2 , point_alpha = 0.9 , base_size = 14 , plot_title = NULL ) { if ( ! gene_use %in% rownames( cds_obj) ) { stop( paste0( gene_use, " is not present in rownames(cds_obj)." ) ) } coord_df <- as.data.frame( reducedDimS( cds_obj) ) coord_df <- t( coord_df) coord_df <- as.data.frame( coord_df) if ( ncol( coord_df) < 2 ) { stop( "reducedDimS(cds_obj) does not contain at least 2 dimensions." ) } colnames( coord_df) [ 1 : 2 ] <- c ( "Dim1" , "Dim2" ) coord_df$ cell_id <- rownames( coord_df) expr_vec <- as.numeric ( exprs( cds_obj) [ gene_use, coord_df$ cell_id] ) if ( log_transform) { expr_vec <- log1p( expr_vec) } upper_cut <- quantile( expr_vec, upper_quantile, na.rm = TRUE ) expr_vec_plot <- pmin( expr_vec, upper_cut) coord_df$ expr <- expr_vec_plot coord_df <- coord_df[ order( coord_df$ expr, decreasing = FALSE ) , ] if ( is.null ( plot_title) ) { plot_title <- paste0( gene_use, " log1p expression along trajectory" ) } p <- ggplot( coord_df, aes( x = Dim1, y = Dim2, color = expr) ) + geom_point( size = point_size, alpha = point_alpha) + scale_color_viridis_c( option = palette_option, name = paste0( gene_use, "\nexpression" ) ) + theme_classic( base_size = base_size) + theme( axis.line = element_line( color = "black" , linewidth = 0.6 ) , plot.title = element_text( hjust = 0.5 ) ) + labs( title = plot_title, x = "Component 1" , y = "Component 2" ) if ( ! is.null ( out_pdf) ) { pdf( out_pdf, width = 6 , height = 5 ) print( p) dev.off( ) } return ( p) } p <- plot_gene_trajectory_replot( cds_obj = cds, gene_use = "SDCCAG8" , out_pdf = "data_trajectory_replot.pdf" , log_transform = TRUE , upper_quantile = 0.99 , palette_option = "plasma" , point_size = 1.3 , point_alpha = 0.9 ) print( p)
Step 9:寻找拟时序相关基因 画热图 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Time_diff <- differentialGeneTest( cds, cores = 1 , fullModelFormulaStr = "~sm.ns(Pseudotime)" ) Time_diff <- Time_diff %>% filter( qval < 0.05 ) write.csv( Time_diff, "Time_diff_all.csv" , row.names = F ) Times_genes <- Time_diff %>% pull( gene_short_name) %>% as.character ( ) Times_genes <- Times_genes[ 1 : 500 ] p = plot_pseudotime_heatmap( cds[ Times_genes, ] , show_rownames= T , return_heatmap= T ) p ggsave( "Time_heatmapAll.pdf" , p, width = 5 , height = 10 )
单独提取某一个cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 p$ tree_row clusters <- cutree( p$ tree_row, k= 4 ) clustering <- data.frame( clusters) clustering[ , 1 ] <- as.character ( clustering[ , 1 ] ) colnames( clustering) <- "Gene_Clusters" table( clustering) write.csv( clustering, "Time_clustering_all.csv" , row.names = F ) cluster6_genes <- clustering %>% dplyr:: filter( Gene_Clusters == "6" ) %>% dplyr:: pull( Gene_name) length ( cluster6_genes) library( org.Hs.eg.db) library( clusterProfiler) plot_KEGG <- function ( ENTREZID, organism= "hsa" , qvalueCutoff= 0.05 ) { KEGG_res <- enrichKEGG( ENTREZID$ ENTREZID, organism, pvalueCutoff = 1 , qvalueCutoff = qvalueCutoff) KEGG_plot <- dotplot( KEGG_res) KEGG_res <- list ( KEGG_res= KEGG_res, KEGG_plot= KEGG_plot) return ( KEGG_res) } plot_GO <- function ( ENTREZID, orgdb, ont= "ALL" , keyType= "ENTREZID" , qvalueCutoff= 0.05 ) { GO_res <- enrichGO( ENTREZID$ ENTREZID, keyType = keyType, OrgDb = orgdb, ont= ont, readable = T , pvalueCutoff = 1 , qvalueCutoff = 0.05 ) GO_res <- clusterProfiler:: simplify( GO_res, cutoff= 0.7 , select_fun= min ) GO_plot <- dotplot( GO_res, split = "ONTOLOGY" ) + facet_wrap( .~ ONTOLOGY, scales= "free" ) GO_res <- list ( GO_res= GO_res, GO_plot= GO_plot) return ( GO_res) } cluster6_entrez <- bitr( cluster6_genes, fromTyp = "SYMBOL" , toType = "ENTREZID" , OrgDb = org.Hs.eg.db) length ( cluster6_entrez$ ENTREZID) GO6 <- plot_GO( cluster6_entrez, orgdb = org.Hs.eg.db, qvalueCutoff= 1 ) KEGG6 <- plot_KEGG( cluster6_entrez, organism = "hsa" , qvalueCutoff = 1 ) write.csv( as.data.frame( GO6$ GO_res) , "Cluster6_GO.csv" , row.names = FALSE ) write.csv( as.data.frame( KEGG6$ KEGG_res) , "Cluster6_KEGG.csv" , row.names = FALSE ) if ( ! is.null ( GO6$ GO_plot) ) { ggsave( "Cluster6_GO_dotplot.pdf" , plot = GO6$ GO_plot, width = 8 , height = 6 ) } if ( ! is.null ( KEGG6$ KEGG_plot) ) { ggsave( "Cluster6_KEGG_dotplot.pdf" , plot = KEGG6$ KEGG_plot, width = 8 , height = 6 ) }
Step 10:分支分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 plot_cell_trajectory( cds, color_by = "State" ) BEAM_res <- BEAM( cds[ ordering_genes, ] , branch_point = 1 , cores = 2 ) BEAM_res <- BEAM_res[ order( BEAM_res$ qval) , ] BEAM_res <- BEAM_res[ , c ( "gene_short_name" , "pval" , "qval" ) ] head( BEAM_res) write.csv( BEAM_res, "BEAM_res.csv" , row.names = F ) plot_genes_branched_heatmap( cds[ row.names( subset( BEAM_res, qval< 1e-4 ) ) , ] , branch_point = 1 , num_clusters = 4 , cores = 1 , use_gene_short_name = T , show_rownames = T ) BEAM_genes <- BEAM_res %>% arrange( qval) %>% head( 100 ) %>% pull( gene_short_name) %>% as.character ( ) p <- plot_genes_branched_heatmap( cds[ BEAM_genes, ] , branch_point = 1 , num_clusters = 3 , show_rownames = T , return_heatmap = T ) ggsave( "BEAM_heatmap.pdf" , p$ ph_res, width = 6.5 , height = 10 )