cellphoneDB进行CCI以及可视化
除了cellchat,在单细胞转录组或者空间组的分析中,cellphoneDB也是一个常用的细胞通讯软件,这个数据库更注重配受体关系,对于有明确先验知识的配受体研究比较友好。
但值得注意的是,它的数据库只包括人的基因名称信息,所以在开始前,可以参考biomaRt进行同源基因转换-CSDN博客转换物种的同源基因名称
导入环境
import numpy as np
import pandas as pd
import scanpy as sc
import os
import sys
from scipy import sparse
from cellphonedb.src.core.methods import cpdb_statistical_analysis_method
import os
import anndata as ad
import pandas as pd
import ktplotspy as kpy
import matplotlib.pyplot as plt
%matplotlib inline
ktplotspy是利用cellphoneDB结果进行后续可视化的一个R包,安装可见GitHub - zktuong/ktplotspy: Python library for plotting Cellphonedb results. Ported from ktplots R package.
同源基因转换
adata = sc.read('/data/work/your_scRNA.h5ad')
gene_names = adata.var_names.tolist()
gene_names_df = pd.DataFrame(gene_names, columns=['gene_name'])
gene_names_df.to_csv('adata_gene_names.csv', index=False)
#biomart转换后得到新的csv
replacement_df = pd.read_csv('mouse_homologous_gene_conversion.csv')
replacement_df.columns = ['Gene.name', 'Gene.name.1','Chromosome.scaffold.name']
gene_mapping = dict(zip(replacement_df['Gene.name'], replacement_df['Gene.name.1']))
new_gene_names = [gene_mapping.get(gene, None) for gene in gene_names]
# 仅保留有对应人类基因名的基因
valid_indices = [i for i, gene in enumerate(new_gene_names) if gene is not None]
filtered_adata = adata[:, valid_indices]
filtered_adata.var_names = [new_gene_names[i] for i in valid_indices]
filtered_adata.obs.index = filtered_adata.obs.index.astype(str)
filtered_adata.var.index = filtered_adata.var.index.astype(str)
filtered_adata.write_h5ad('your_new_scRNA.h5ad')
保存替换完同源基因的anndata对象后可以重新读取,进行后续的分析,这里没有对应的同源基因被我过滤掉了(如果有更全面的分析需求可以注意一下这一点)
提取注释标签
#重新读取同源替换后的数据
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
adata.shape#检查数据大小
adata.obs['annotation'].values.describe()#检查注释信息
adata.obs['barcode_sample']=list(adata.obs_names)
adata.obs['annotation']=adata.obs['annotation'].astype(str)
adata.obs['cell_type']=list(adata.obs['annotation'])
meta=adata.obs.loc[:,['barcode_sample','cell_type']]
meta.to_csv('bin50_meta.tsv', sep = '\t')
#检查对应
list(adata.obs.index).sort() == list(df_meta['barcode_sample']).sort()
提取后续分析需要的注释信息表,其中'barcode_sample'是细胞唯一标识符,'annotation'是我的注释信息存放的列,导出为bin50_meta.tsv
运行cellphoneDB
设置好数据库路径,注释信息表路径,标准化后的anndata对象路径以及输出路径。
我是直接将v5数据库下载使用,因为我的环境不能联网,在线下载可能会方便很多,网上有很多教程。直接下载的路径见GitHub - ventolab/cellphonedb-data
cpdb_file_path = '/data/work/CCI_cellphoneDB/cellphonedb.zip'
meta_file_path = '/data/work/CCI_cellphoneDB/bin50_meta.tsv'
counts_file_path = '/data/work/CCI_cellphoneDB/bin50_annotated_normlog.h5ad'
out_path = '/data/work/CCI_cellphoneDB/bin50'
cpdb_results = cpdb_statistical_analysis_method.call(
cpdb_file_path = cpdb_file_path, # mandatory: CellphoneDB database zip file.
meta_file_path = meta_file_path, # mandatory: tsv file defining barcodes to cell label.
counts_file_path = counts_file_path, # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
counts_data = 'hgnc_symbol', # defines the gene annotation in counts matrix.
# active_tfs_file_path = active_tf_path, # optional: defines cell types and their active TFs.
# microenvs_file_path = microenvs_file_path, # optional (default: None): defines cells per microenvironment.
score_interactions = True, # optional: whether to score interactions or not.
iterations = 1000, # denotes the number of shufflings performed in the analysis.
threshold = 0.05, # defines the min % of cells expressing a gene for this to be employed in the analysis.
threads = 5, # number of threads to use in the analysis.
debug_seed = 42, # debug randome seed. To disable >=0.
result_precision = 3, # Sets the rounding for the mean values in significan_means.
pvalue = 0.1, # P-value threshold to employ for significance.
subsampling = False, # To enable subsampling the data (geometri sketching).
subsampling_log = False, # (mandatory) enable subsampling log1p for non log-transformed data inputs.
subsampling_num_pc = 100, # Number of componets to subsample via geometric skectching (dafault: 100).
subsampling_num_cells = 1000, # Number of cells to subsample (integer) (default: 1/3 of the dataset).
separator = '|', # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
debug = False, # Saves all intermediate tables employed during the analysis in pkl format.
output_path = out_path, # Path to save results.
output_suffix = None # Replaces the timestamp in the output files by a user defined string in the (default: None).
)
运行日志以及输出的文件如下
读取结果可视化
相比cellchat,这个软件自带的可视化功能会少很多
自带可视化热图
kpy.plot_cpdb_heatmap(pvals = cpdb_results['pvalues'],
degs_analysis = False,
figsize = (5, 5),
title = "Sum of significant interactions")
R可视化点图
但是利用输出文件也可以在R环境下进行很多想要的可视化方式,首先导入R环境和包
library(ktplots)
library(Seurat)
library(tidyverse)
library(data.table)
library(dplyr)
library(readr)
导入需要的结果表和RDS格式数据
#cellphoneDB结果文件
mypvals <- read.table("/data/work/CCI_cellphoneDB/bin50/statistical_analysis_pvalues_12_26_2024_220659.txt",header = T,sep = "\t",stringsAsFactors = F,check.names = F)
mymeans <- read.table("/data/work/CCI_cellphoneDB/bin50/statistical_analysis_means_12_26_2024_220659.txt",header = T,sep = "\t",stringsAsFactors = F,check.names = F)
mydecon <- read.table("/data/work/CCI_cellphoneDB/bin50/statistical_analysis_deconvoluted_12_26_2024_220659.txt",header = T,sep = "\t",stringsAsFactors = F,check.names = F)
#单细胞数据
scRNA = read_rds("your_scRNA.rds")
绘制全部点图
p <- plot_cpdb(cell_type1 = "", cell_type2 = "", scdata = scRNA, means = mymeans, pvals = mypvals,
celltype_key = "annotation",
highlight_col = "blue",
keep_significant_only=T) +
theme(axis.text = element_text(size = 5, color = 'black'))
ggsave("plot_cpdb_output.png", p, width = 50, height = 40, units = "in", limitsize = FALSE)
如果这里cell_type空白则默认绘制所有分群的点图,也可以设置一个或者两个绘制一对多和一对一的配受体点图,效果如下
分类展示基因家族
这里参数还有一个family,可以分类展示不同家族的配受体关系
gene_families <- c("chemokines", "th1", "th2", "th17", "treg", "costimulatory", "coinhibitory")
for(family in gene_families) {
tryCatch({
p <- plot_cpdb(cell_type1 = "PT-S1", cell_type2 = "", scdata = scRNA, means = mymeans, pvals = mypvals,
celltype_key = "annotation_res1_1",
gene_family = family, # 指定基因家族
highlight_col = "blue",
keep_significant_only=T) +
theme(axis.text = element_text(size = 10, color = 'black'))
ggsave(paste0("plot_cpdb_PTS1_", family, ".png"), p, width = 20, height = 30, units = "in", limitsize = FALSE)
}, error = function(e) {
message(paste("No significant genes found for", family, "and plotting will not proceed."))
})
}