日常吐槽。
一、写在前面
stereopy日常出bug(github issue里得有一半的问题是我提的,当然也有可能是因为我菜),stereopy自己生成的anndata自己不能计算空间共现关系,还是靠squidpy才能计算。另外还要一些函数一开并行计算就报错,这里留一些stereopy对象的转换/输出方式,以备大家不(sui)时之(pao)需(lu)。更多空转教程可见:空间转录组学习手册合辑
二、stereopy相关数据读写
Stereopy是一个基础且全面的空间转录组分析工具,其支持空间高变基因计算、RNA Velocity
、批次整合、SingleR
注释、细胞通讯、基因调控网络、轨迹推断等基础/进阶分析。此外,针对多样本的数据,Stereopy
还支持3D
细胞通讯、3D
轨迹推断、3D
基因调控网络等分析。学习下面的内容前需要大家先掌握python知识:生信Python速查手册
文件读取、输出、转换:
1.1 GEM与GEF格式文件
一共有GEM、GEF两种格式的数据。GEM文件包括:切片中x、y轴空间坐标信息,MIDCount为基因表达量,ExonCount为由SAW(Version >= 5.1.3)中的spatial_RNA_visualization_v5生成的外显子表达量。
数据格式包含以下信息:
GEF分为Square Bin与Cell Bin两种,前者的前缀一般为.raw.gef/.gef/.tissue.gef,后者的前缀一般为.cellbin.gef。(Stereo-seq技术分辨率可达纳米级别bin1的纳米孔半径约为250nm。通过bin_size参数,可以将一定范围内的纳米孔数据合并为一个bin unit(相当于一个分类单元)。例如,当bin_size设置为20时,bin unit的边长大约为10/14 μm(取决于纳米孔中心的距离:500nm或715nm)。在bin_size确定后,可自动生成StereoExpData用于下游分析。由于计算性能、生物学意义等问题,在分析过程中可能需要跳回“分bin”的步骤,优化参数后继续处理。)
此外stereopy还可以读取来自Anndata、Scanpy、Seurat的h5ad格式文件。
1.2 数据读取
# 导入包
import stereo as st
import warnings
warnings.filterwarnings('ignore')
1.2.1 GEM文件读取
# gem文件读取
data_path = 'Demo_MouseBrain/SS200000135TL_D1.cellbin.gem'
data = st.io.read_gem(
file_path=data_path,
sep='\t',
bin_type='cell_bins',
is_sparse=True,
)
data
# [2023-09-19 06:31:29][Stereo][3273491][MainThread][139900354834816][reader][84][INFO]: the martrix has 57133 cells, and 22406 genes.
# StereoExpData object with n_cells X n_genes = 57133 X 22406
# bin_type: cell_bins
# offset_x = 3206# offset_y = 6174
# cells: ['cell_name', 'cell_point']
# genes: ['gene_name']
1.2.2 GEF文件读取
# suqre bin读取
data_path = 'Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(
file_path=data_path,
bin_type='bins',
bin_size=100,
is_sparse=True,
)
data
# [2023-09-19 06:33:03][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
# [2023-09-19 06:33:23][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 9124 cells, and 24302 genes.
# [2023-09-19 06:33:25][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.
# StereoExpData object with n_cells X n_genes = 9124 X 24302
# bin_type: bins
# bin_size: 100
# offset_x = 0
# offset_y = 0
# cells: ['cell_name']
# genes: ['gene_name']
# cell bin读取
data_path = './Demo_MouseBrain/SS200000135TL_D1.cellbin.gef'
data = st.io.read_gef(
file_path=data_path,
is_sparse=True,
bin_type='cell_bins',
)
# [2023-09-19 06:33:54][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1038][INFO]: the matrix has 57133 cells, and 24670 genes.
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1161][INFO]: This is GEF file which contains cell bin infomation.
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1162][INFO]: bin_type: cell_bins
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1168][INFO]: Number of cells: 57133
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1171][INFO]: Number of gene: 24670
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1174][INFO]: Resolution: 500
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1177][INFO]: offsetX: 0
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1180][INFO]: offsetY: 0
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1183][INFO]: Average number of genes: 223.460693359375
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1186][INFO]: Maximum number of genes: 1046
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1189][INFO]: Average expression: 399.17034912109375
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1192][INFO]: Maximum expression: 2337
# [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.
1.2.3 Stereopy
来源的h5ad
文件读取
data_path = './Demo_MouseBrain/StereoExpData_stereopy_mykey.h5ad'
data = st.io.read_stereo_h5ad(
file_path=data_path,
use_raw=True,
use_result=True,
)
data
# StereoExpData object with n_cells X n_genes = 35998 X 3000
# bin_type: bins
# bin_size: 50
# offset_x = None
# offset_y = None
# cells: ['cell_name', 'total_counts', 'pct_counts_mt', 'n_genes_by_counts', 'leiden', 'louvain']
# genes: ['gene_name', 'n_cells', 'n_counts']
# key_record: {'cluster': ['leiden', 'louvain'], 'gene_exp_cluster': ['gene_exp_leiden', 'gene_exp_louvain']}
1.2.4 Anndata
来源的h5ad
文件读取
ann_h5ad = './Demo_MouseBrain/anndata_output.h5ad'
data = st.io.read_ann_h5ad(
file_path=ann_h5ad,
spatial_key=None,
)
data
# StereoExpData object with n_cells X n_genes = 35998 X 24302
# bin_type: bins
# bin_size: 50
# offset_x = None
# offset_y = None
# cells: ['cell_name']
# genes: ['gene_name']
# Anndata来源的h5ad文件也可用于scanpy的分析
import scanpy as sc
# 直接利用scanpy读取:
data = sc.read_h5ad('Demo_MouseBrain/anndata_output.h5ad')
data
# AnnData object with n_obs × n_vars = 35998 × 24302
# obs: 'orig.ident', 'x', 'y'
# uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn'
# obsm: 'spatial'
1.3 文件输出
虽然输出的文件均为h5ad
文件,但不同格式输出的h5ad
文件具有差异,需要区别对待,输出对应软件需要的版本。
1.3.1 用于stereopy
读取的h5ad
输出¶
# 读取GEF文件
data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(file_path=data_path, bin_size=50)
# 预处理
data.tl.cal_qc()
data.tl.raw_checkpoint()
data.tl.sctransform(res_key='sctransform', inplace=True)
# 聚类
data.tl.pca(use_highly_genes=False, n_pcs=30, res_key='pca')
data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors')
data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')
data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
data.tl.louvain(neighbors_res_key='neighbors', res_key='louvain')
# data.tl.key_record为运行函数时自动生成的字典,收录的为res_key参数返回的内容,我们在后面正式的数据分析教程中会介绍
print(data.tl.key_record)
# [2023-09-19 08:09:11][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: umap end, consume time 24.3978s.
# [2023-09-19 08:09:11][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run leiden...
# [2023-09-19 08:09:20][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: leiden end, consume time 9.2709s.
# [2023-09-19 08:09:21][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run louvain...
# [2023-09-19 08:09:21][Stereo][3273491][MainThread][139900354834816][_louvain][100][INFO]: using the "louvain" package of Traag (2017)
# [2023-09-19 08:09:35][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: louvain end, consume time 14.8895s.
# {'hvg': [], 'pca': ['pca'], 'neighbors': ['neighbors'], 'umap': ['umap'], 'cluster': ['leiden', 'louvain'], 'marker_genes': [], 'sct': ['sctransform'], 'gene_exp_cluster': ['gene_exp_leiden', 'gene_exp_louvain']}
# 将StereoExpData对象写为h5ad, 如果key_record = None, 则会自动采用data.tl.key_record
st.io.write_h5ad(
data,
use_raw=True,
use_result=True,
key_record=None,
output='./Demo_MouseBrain/StereoExpData_stereopy.h5ad'
)
# 原gef大约600MB,输出后的h5ad约为3GB
# 当然你也可以建立一个字典保存指定的key_record
outkey_record = {'cluster':['leiden','louvain'],}
st.io.write_h5ad(
data,
use_raw=True,
use_result=True,
key_record=outkey_record,
output='./Demo_MouseBrain/StereoExpData_stereopy_mykey.h5ad',
)
# 指定key_record后的输出文件大小为1.1GB,需要节省磁盘空间的同学可以尝试这种方法
1.3.2 用于Anndata
读取的h5ad
输出
# 读取gef文件:
data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(file_path=data_path, bin_size=50)
data.tl.raw_checkpoint()
# 转换输出
adata = st.io.stereo_to_anndata(data,flavor='seurat',
output='Demo_MouseBrain/anndata_output.h5ad')
# [2023-09-19 07:07:41][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
# [2023-09-19 07:08:00][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 35998 cells, and 24302 genes.
# [2023-09-19 07:08:02][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.
# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][756][INFO]: Adding sample in adata.obs['orig.ident'].
# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][759][INFO]: Adding data.position as adata.obsm['spatial'] .
# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][764][INFO]: Adding data.position as adata.obs['x'] and adata.obs['y'] .
# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][853][INFO]: Adding data.tl.raw.exp_matrix as adata.uns['raw_counts'] .
# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][884][INFO]: Rename QC info.
# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][900][INFO]: Finished conversion to anndata.
# [2023-09-19 07:08:06][Stereo][3273491][MainThread][139900354834816][reader][904][INFO]: Finished output to Demo_MouseBrain/anndata_output.h5ad
1.3.3 用于Seurat
读取的h5ad
输出
Seurat
对象可谓是单细胞分析界的硬通货,一定要学会转化。
# 读取数据
data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(file_path=data_path, bin_size=50)
# 预处理一下
data.tl.cal_qc()
data.tl.raw_checkpoint()
# 存为Seurat友好型的h5ad
adata = st.io.stereo_to_anndata(data,flavor='seurat',
output='Demo_MouseBrain/seurat_out.h5ad')
# [2023-09-19 08:32:39][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
# [2023-09-19 08:32:58][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 35998 cells, and 24302 genes.
# [2023-09-19 08:33:00][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.
# [2023-09-19 08:33:00][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run cal_qc...
# [2023-09-19 08:33:00][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: cal_qc end, consume time 0.7293s.
# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][756][INFO]: Adding sample in adata.obs['orig.ident'].
# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][759][INFO]: Adding data.position as adata.obsm['spatial'] .
# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][764][INFO]: Adding data.position as adata.obs['x'] and adata.obs['y'] .
# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][853][INFO]: Adding data.tl.raw.exp_matrix as adata.uns['raw_counts'] .
# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][884][INFO]: Rename QC info.
# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][900][INFO]: Finished conversion to anndata.
# [2023-09-19 08:33:04][Stereo][3273491][MainThread][139900354834816][reader][904][INFO]: Finished output to Demo_MouseBrain/seurat_out.h5ad
后续h5ad
在Seurat
中的读取可以参考:单细胞对象(数据格式)转换大全|2. h5ad转Seuratobj。
当然,如果你更熟悉rds
文件,也可以在shell
中做如下转换:
!/usr/bin/Rscript ~/stereopy/docs/source/_static/annh5ad2rds.R --infile 'Demo_MouseBrain/seurat_out.h5ad' --outfile 'Demo_MouseBrain/seurat_out.rds'
1.3.4 GEF
输出
创建新的GEF
用于输出:
# 数据读取
data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(file_path=data_path, bin_size=50)
# 选择对应基因取子集
data.tl.filter_genes(gene_list=['H2al2a','Gm6135'], inplace=True)
# 仅保存过滤后的结果
st.io.write_mid_gef(
data=data,
output='./Demo_MouseBrain/my_filter.filtered.gef'
)
# [2023-09-19 08:18:02][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
# [2023-09-19 08:18:22][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 35998 cells, and 24302 genes.
# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.
# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run filter_genes...
# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: filter_genes end, consume time 0.5481s.
# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][writer][287][INFO]: The output standard gef file only contains one expression matrix with mid count.Please make sure the expression matrix of StereoExpData object is mid count without normaliztion.
# 读取GEF文件
data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(file_path=data_path, bin_size=50)
# 预处理
data.tl.cal_qc()
data.tl.raw_checkpoint()
data.tl.sctransform(res_key='sctransform', inplace=True)
# 分群
data.tl.pca(use_highly_genes=False, n_pcs=30, res_key='pca')
data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors')
data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')
data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
# 更新本地已存在的gef
exist_path = './Demo_MouseBrain/my_filter.filtered.gef'
st.io.update_gef(
data=data,
gef_file=exist_path,
cluster_res_key='leiden',
)
保存于已存在的GEF
文件:
1.4 以AnnData
格式操作
AnnData
对象在Python
亦是单细胞/空转分析的硬通货,Stereo
中对象与其的相互转换与交互值得我们单独探究一番:
1.4.1 AnnData
来源h5ad
读取
import stereo as st
data = st.io.read_h5ad('Demo_MouseBrain/anndata_output.h5ad')
# 此时读入的便是一个AnnData对象:
data
# AnnData object with n_obs × n_vars = 35998 × 24302
# obs: 'orig.ident', 'x', 'y'
# uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn'
# obsm: 'spatial'
# data._ann_data可以完全作为一个AnnData对象处理,而不是StereoExpData对象
data._ann_data
# AnnData object with n_obs × n_vars = 35998 × 24302
# obs: 'orig.ident', 'x', 'y'
# uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn'
# obsm: 'spatial'
1.4.2 各数据存储位置
# 表达矩阵:
data._ann_data.X
# <35998x24302 sparse matrix of type '<class 'numpy.float64'>'
# with 41464124 stored elements in Compressed Sparse Row format>
# 空间坐标信息:
data._ann_data.uns
# OverloadedDict, wrapping:
# {'bin_size': 50, 'bin_type': 'bins', 'raw_cellname': array(['24481313596250', '26199300516500', '26843545608050', ...,
# '48533130461250', '38439957317800', '53687091208300'], dtype=object), 'raw_counts': <35998x24302 sparse matrix of type '<class 'numpy.uint32'>'
# with 41464124 stored elements in Compressed Sparse Row format>, 'raw_genename': array(['0610005C13Rik', '0610006L08Rik', '0610009B22Rik', ..., 'mt-Nd4l',
# 'mt-Nd5', 'mt-Nd6'], dtype=object), 'resolution': 500, 'sn': batch sn
# 0 -1 SS200000135TL_D1}
# With overloaded keys:
# ['neighbors'].
# 细胞及相关注释信息:
data._ann_data.obs
# 基因及相关注释信息:
data._ann_data.var
1.4.3 数据处理
AnnData
可直接参与大部分用于StereoExpData
的计算
# 例如以下这些:
data.tl.cal_qc()
data.tl.raw_checkpoint()
data.tl.normalize_total(target_sum=1e4)
data.tl.log1p()
data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)
data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca_test', svd_solver='arpack')
data.tl.neighbors(pca_res_key='pca_test', n_pcs=30, res_key='neighbors_test', n_jobs=8)
data.tl.umap(pca_res_key='pca_test', neighbors_res_key='neighbors_test', res_key='umap_test', init_pos='spectral')
data.tl.leiden(neighbors_res_key='neighbors_test', res_key='leiden_test')
# 计算得到的结果也会自动保存于data._ann_data中
data._ann_data
# AnnData object with n_obs × n_vars = 35998 × 24302
# obs: 'orig.ident', 'x', 'y', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt', 'leiden_test'
# var: 'n_cells', 'n_counts', 'mean_umi', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
# uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn', 'highly_variable_genes', 'pca_test', 'neighbors_test', 'umap_test', 'leiden_test', 'gene_exp_leiden_test'
# obsm: 'spatial', 'X_pca_test', 'X_umap_test'
# obsp: 'neighbors_test_connectivities', 'neighbors_test_distances'
# 例如这里多出的细胞信息:
data._ann_data.obs
# 可视化函数一样可以利用:
data.plt.umap(res_key='umap_test', cluster_key='leiden_test')
1.4.4 数据输出
用于stereopy
分析的AnnData
对象也可以存为h5ad
,
data._ann_data.write_h5ad('./test_result/SS200000135TL_D1.stereo.h5ad')
更多空间转录组分析技巧可见:空间转录组学习手册合辑