当前位置: 首页 > article >正文

日常吐槽。

一、写在前面

stereopy日常出bug(github issue里得有一半的问题是我提的,当然也有可能是因为我菜),stereopy自己生成的anndata自己不能计算空间共现关系,还是靠squidpy才能计算。另外还要一些函数一开并行计算就报错,这里留一些stereopy对象的转换/输出方式,以备大家不(sui)时之(pao)需(lu)。更多空转教程可见:空间转录组学习手册合辑

二、stereopy相关数据读写

Stereopy是一个基础且全面的空间转录组分析工具,其支持空间高变基因计算、RNA Velocity、批次整合、SingleR注释、细胞通讯、基因调控网络、轨迹推断等基础/进阶分析。此外,针对多样本的数据,Stereopy还支持3D细胞通讯、3D轨迹推断、3D基因调控网络等分析。学习下面的内容前需要大家先掌握python知识:生信Python速查手册

文件读取、输出、转换:

1.1 GEM与GEF格式文件

一共有GEM、GEF两种格式的数据。GEM文件包括:切片中x、y轴空间坐标信息,MIDCount为基因表达量,ExonCount为由SAW(Version >= 5.1.3)中的spatial_RNA_visualization_v5生成的外显子表达量。

数据格式包含以下信息:

GEF分为Square Bin与Cell Bin两种,前者的前缀一般为.raw.gef/.gef/.tissue.gef,后者的前缀一般为.cellbin.gef。(Stereo-seq技术分辨率可达纳米级别bin1的纳米孔半径约为250nm。通过bin_size参数,可以将一定范围内的纳米孔数据合并为一个bin unit(相当于一个分类单元)。例如,当bin_size设置为20时,bin unit的边长大约为10/14 μm(取决于纳米孔中心的距离:500nm或715nm)。在bin_size确定后,可自动生成StereoExpData用于下游分析。由于计算性能、生物学意义等问题,在分析过程中可能需要跳回“分bin”的步骤,优化参数后继续处理。)

此外stereopy还可以读取来自Anndata、Scanpy、Seurat的h5ad格式文件。

1.2 数据读取

    # 导入包
    import stereo as st
    import warnings
    warnings.filterwarnings('ignore')

    1.2.1 GEM文件读取

    # gem文件读取
    data_path = 'Demo_MouseBrain/SS200000135TL_D1.cellbin.gem'
    data = st.io.read_gem(
            file_path=data_path,
            sep='\t',
            bin_type='cell_bins',
            is_sparse=True,
            )
    data
    # [2023-09-19 06:31:29][Stereo][3273491][MainThread][139900354834816][reader][84][INFO]: the martrix has 57133 cells, and 22406 genes.
    # StereoExpData object with n_cells X n_genes = 57133 X 22406
    # bin_type: cell_bins
    # offset_x = 3206# offset_y = 6174
    # cells: ['cell_name', 'cell_point']
    # genes: ['gene_name']

    1.2.2 GEF文件读取

      # suqre bin读取
      data_path = 'Demo_MouseBrain/SS200000135TL_D1.tissue.gef'
      data = st.io.read_gef(
              file_path=data_path,
              bin_type='bins',
              bin_size=100,
              is_sparse=True,
              )
      data
      # [2023-09-19 06:33:03][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
      # [2023-09-19 06:33:23][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 9124 cells, and 24302 genes.
      # [2023-09-19 06:33:25][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.
      # StereoExpData object with n_cells X n_genes = 9124 X 24302
      # bin_type: bins
      # bin_size: 100
      # offset_x = 0
      # offset_y = 0
      # cells: ['cell_name']
      # genes: ['gene_name']
      # cell bin读取
      data_path = './Demo_MouseBrain/SS200000135TL_D1.cellbin.gef'
      data = st.io.read_gef(
              file_path=data_path,
              is_sparse=True,
              bin_type='cell_bins',
              )
      # [2023-09-19 06:33:54][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1038][INFO]: the matrix has 57133 cells, and 24670 genes.
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1161][INFO]: This is GEF file which contains cell bin infomation.
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1162][INFO]: bin_type: cell_bins
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1168][INFO]: Number of cells: 57133
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1171][INFO]: Number of gene: 24670
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1174][INFO]: Resolution: 500
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1177][INFO]: offsetX: 0
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1180][INFO]: offsetY: 0
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1183][INFO]: Average number of genes: 223.460693359375
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1186][INFO]: Maximum number of genes: 1046
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1189][INFO]: Average expression: 399.17034912109375
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1192][INFO]: Maximum expression: 2337
      # [2023-09-19 06:33:58][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.

      1.2.3 Stereopy来源的h5ad文件读取

      data_path = './Demo_MouseBrain/StereoExpData_stereopy_mykey.h5ad'
      data = st.io.read_stereo_h5ad(
              file_path=data_path,
              use_raw=True,
              use_result=True,
              )
      data
      # StereoExpData object with n_cells X n_genes = 35998 X 3000
      # bin_type: bins
      # bin_size: 50
      # offset_x = None
      # offset_y = None
      # cells: ['cell_name', 'total_counts', 'pct_counts_mt', 'n_genes_by_counts', 'leiden', 'louvain']
      # genes: ['gene_name', 'n_cells', 'n_counts']
      # key_record: {'cluster': ['leiden', 'louvain'], 'gene_exp_cluster': ['gene_exp_leiden', 'gene_exp_louvain']}

      1.2.4 Anndata来源的h5ad文件读取

      ann_h5ad = './Demo_MouseBrain/anndata_output.h5ad'
      data = st.io.read_ann_h5ad(
              file_path=ann_h5ad,
              spatial_key=None,
              )
      data
      # StereoExpData object with n_cells X n_genes = 35998 X 24302
      # bin_type: bins
      # bin_size: 50
      # offset_x = None
      # offset_y = None
      # cells: ['cell_name']
      # genes: ['gene_name']
         
        
        # Anndata来源的h5ad文件也可用于scanpy的分析
        import scanpy as sc
        # 直接利用scanpy读取:
        data = sc.read_h5ad('Demo_MouseBrain/anndata_output.h5ad')
        data
        # AnnData object with n_obs × n_vars = 35998 × 24302
        #     obs: 'orig.ident', 'x', 'y'
        #     uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn'
        #     obsm: 'spatial'

        1.3 文件输出

        虽然输出的文件均为h5ad文件,但不同格式输出的h5ad文件具有差异,需要区别对待,输出对应软件需要的版本。

        1.3.1 用于stereopy读取的h5ad输出¶

          # 读取GEF文件data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'data = st.io.read_gef(file_path=data_path, bin_size=50)
          # 预处理data.tl.cal_qc()data.tl.raw_checkpoint()data.tl.sctransform(res_key='sctransform', inplace=True)
          # 聚类data.tl.pca(use_highly_genes=False, n_pcs=30, res_key='pca')data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors')data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')data.tl.louvain(neighbors_res_key='neighbors', res_key='louvain')
          # data.tl.key_record为运行函数时自动生成的字典,收录的为res_key参数返回的内容,我们在后面正式的数据分析教程中会介绍print(data.tl.key_record)# [2023-09-19 08:09:11][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: umap end, consume time 24.3978s.# [2023-09-19 08:09:11][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run leiden...# [2023-09-19 08:09:20][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: leiden end, consume time 9.2709s.# [2023-09-19 08:09:21][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run louvain...# [2023-09-19 08:09:21][Stereo][3273491][MainThread][139900354834816][_louvain][100][INFO]:     using the "louvain" package of Traag (2017)# [2023-09-19 08:09:35][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: louvain end, consume time 14.8895s.# {'hvg': [], 'pca': ['pca'], 'neighbors': ['neighbors'], 'umap': ['umap'], 'cluster': ['leiden', 'louvain'], 'marker_genes': [], 'sct': ['sctransform'], 'gene_exp_cluster': ['gene_exp_leiden', 'gene_exp_louvain']}

            # 将StereoExpData对象写为h5ad, 如果key_record = None, 则会自动采用data.tl.key_recordst.io.write_h5ad(        data,        use_raw=True,        use_result=True,        key_record=None,        output='./Demo_MouseBrain/StereoExpData_stereopy.h5ad'        )# 原gef大约600MB,输出后的h5ad约为3GB
              # 当然你也可以建立一个字典保存指定的key_recordoutkey_record = {'cluster':['leiden','louvain'],}st.io.write_h5ad(        data,        use_raw=True,        use_result=True,        key_record=outkey_record,        output='./Demo_MouseBrain/StereoExpData_stereopy_mykey.h5ad',        )# 指定key_record后的输出文件大小为1.1GB,需要节省磁盘空间的同学可以尝试这种方法

              1.3.2 用于Anndata读取的h5ad输出

                # 读取gef文件:data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'data = st.io.read_gef(file_path=data_path, bin_size=50)data.tl.raw_checkpoint()
                # 转换输出adata = st.io.stereo_to_anndata(data,flavor='seurat',                                output='Demo_MouseBrain/anndata_output.h5ad')# [2023-09-19 07:07:41][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...# [2023-09-19 07:08:00][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 35998 cells, and 24302 genes.# [2023-09-19 07:08:02][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][756][INFO]: Adding sample in adata.obs['orig.ident'].# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][759][INFO]: Adding data.position as adata.obsm['spatial'] .# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][764][INFO]: Adding data.position as adata.obs['x'] and adata.obs['y'] .# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][853][INFO]: Adding data.tl.raw.exp_matrix as adata.uns['raw_counts'] .# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][884][INFO]: Rename QC info.# [2023-09-19 07:08:03][Stereo][3273491][MainThread][139900354834816][reader][900][INFO]: Finished conversion to anndata.# [2023-09-19 07:08:06][Stereo][3273491][MainThread][139900354834816][reader][904][INFO]: Finished output to Demo_MouseBrain/anndata_output.h5ad

                1.3.3 用于Seurat读取的h5ad输出

                Seurat对象可谓是单细胞分析界的硬通货,一定要学会转化。

                  # 读取数据data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'data = st.io.read_gef(file_path=data_path, bin_size=50)
                  # 预处理一下data.tl.cal_qc()data.tl.raw_checkpoint()
                  # 存为Seurat友好型的h5adadata = st.io.stereo_to_anndata(data,flavor='seurat',                                output='Demo_MouseBrain/seurat_out.h5ad')# [2023-09-19 08:32:39][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...# [2023-09-19 08:32:58][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 35998 cells, and 24302 genes.# [2023-09-19 08:33:00][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.# [2023-09-19 08:33:00][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run cal_qc...# [2023-09-19 08:33:00][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: cal_qc end, consume time 0.7293s.# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][756][INFO]: Adding sample in adata.obs['orig.ident'].# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][759][INFO]: Adding data.position as adata.obsm['spatial'] .# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][764][INFO]: Adding data.position as adata.obs['x'] and adata.obs['y'] .# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][853][INFO]: Adding data.tl.raw.exp_matrix as adata.uns['raw_counts'] .# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][884][INFO]: Rename QC info.# [2023-09-19 08:33:01][Stereo][3273491][MainThread][139900354834816][reader][900][INFO]: Finished conversion to anndata.# [2023-09-19 08:33:04][Stereo][3273491][MainThread][139900354834816][reader][904][INFO]: Finished output to Demo_MouseBrain/seurat_out.h5ad

                  后续h5adSeurat中的读取可以参考:单细胞对象(数据格式)转换大全|2. h5ad转Seuratobj。

                  当然,如果你更熟悉rds文件,也可以在shell中做如下转换:

                    !/usr/bin/Rscript  ~/stereopy/docs/source/_static/annh5ad2rds.R --infile 'Demo_MouseBrain/seurat_out.h5ad' --outfile 'Demo_MouseBrain/seurat_out.rds'

                    1.3.4 GEF输出

                    创建新的GEF用于输出:

                      # 数据读取data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'data = st.io.read_gef(file_path=data_path, bin_size=50)
                      # 选择对应基因取子集data.tl.filter_genes(gene_list=['H2al2a','Gm6135'], inplace=True)
                      # 仅保存过滤后的结果st.io.write_mid_gef(        data=data,        output='./Demo_MouseBrain/my_filter.filtered.gef'        )# [2023-09-19 08:18:02][Stereo][3273491][MainThread][139900354834816][reader][1001][INFO]: read_gef begin ...# [2023-09-19 08:18:22][Stereo][3273491][MainThread][139900354834816][reader][1088][INFO]: the matrix has 35998 cells, and 24302 genes.# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][reader][1096][INFO]: read_gef end.# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][st_pipeline][37][INFO]: start to run filter_genes...# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][st_pipeline][40][INFO]: filter_genes end, consume time 0.5481s.# [2023-09-19 08:18:24][Stereo][3273491][MainThread][139900354834816][writer][287][INFO]: The output standard gef file only contains one expression matrix with mid count.Please make sure the expression matrix of StereoExpData object is mid count without normaliztion.
                        # 读取GEF文件data_path = './Demo_MouseBrain/SS200000135TL_D1.tissue.gef'data = st.io.read_gef(file_path=data_path, bin_size=50)
                        # 预处理data.tl.cal_qc()data.tl.raw_checkpoint()data.tl.sctransform(res_key='sctransform', inplace=True)
                        # 分群data.tl.pca(use_highly_genes=False, n_pcs=30, res_key='pca')data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors')data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
                        # 更新本地已存在的gef
                        exist_path = './Demo_MouseBrain/my_filter.filtered.gef'st.io.update_gef(        data=data,        gef_file=exist_path,        cluster_res_key='leiden',        )

                        保存于已存在的GEF文件:

                        1.4 以AnnData格式操作

                        AnnData对象在Python亦是单细胞/空转分析的硬通货,Stereo中对象与其的相互转换与交互值得我们单独探究一番:

                        1.4.1 AnnData来源h5ad读取

                          import stereo as stdata = st.io.read_h5ad('Demo_MouseBrain/anndata_output.h5ad')# 此时读入的便是一个AnnData对象:data# AnnData object with n_obs × n_vars = 35998 × 24302#     obs: 'orig.ident', 'x', 'y'#     uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn'#     obsm: 'spatial'
                            # data._ann_data可以完全作为一个AnnData对象处理,而不是StereoExpData对象data._ann_data# AnnData object with n_obs × n_vars = 35998 × 24302#     obs: 'orig.ident', 'x', 'y'#     uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn'#     obsm: 'spatial'

                            1.4.2 各数据存储位置

                              # 表达矩阵:data._ann_data.X# <35998x24302 sparse matrix of type '<class 'numpy.float64'>'#  with 41464124 stored elements in Compressed Sparse Row format>

                                # 空间坐标信息:data._ann_data.uns# OverloadedDict, wrapping:#   {'bin_size': 50, 'bin_type': 'bins', 'raw_cellname': array(['24481313596250', '26199300516500', '26843545608050', ...,#       '48533130461250', '38439957317800', '53687091208300'], dtype=object), 'raw_counts': <35998x24302 sparse matrix of type '<class 'numpy.uint32'>'#  with 41464124 stored elements in Compressed Sparse Row format>, 'raw_genename': array(['0610005C13Rik', '0610006L08Rik', '0610009B22Rik', ..., 'mt-Nd4l',#       'mt-Nd5', 'mt-Nd6'], dtype=object), 'resolution': 500, 'sn':   batch                sn# 0    -1  SS200000135TL_D1}# With overloaded keys:#   ['neighbors'].

                                  # 细胞及相关注释信息:data._ann_data.obs

                                    # 基因及相关注释信息:data._ann_data.var

                                    1.4.3 数据处理

                                    AnnData可直接参与大部分用于StereoExpData的计算

                                      # 例如以下这些:data.tl.cal_qc()data.tl.raw_checkpoint()data.tl.normalize_total(target_sum=1e4)data.tl.log1p()data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca_test', svd_solver='arpack')data.tl.neighbors(pca_res_key='pca_test', n_pcs=30, res_key='neighbors_test', n_jobs=8)data.tl.umap(pca_res_key='pca_test', neighbors_res_key='neighbors_test', res_key='umap_test', init_pos='spectral')data.tl.leiden(neighbors_res_key='neighbors_test', res_key='leiden_test')

                                        # 计算得到的结果也会自动保存于data._ann_data中data._ann_data# AnnData object with n_obs × n_vars = 35998 × 24302#     obs: 'orig.ident', 'x', 'y', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt', 'leiden_test'#     var: 'n_cells', 'n_counts', 'mean_umi', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'#     uns: 'bin_size', 'bin_type', 'raw_cellname', 'raw_counts', 'raw_genename', 'resolution', 'sn', 'highly_variable_genes', 'pca_test', 'neighbors_test', 'umap_test', 'leiden_test', 'gene_exp_leiden_test'#     obsm: 'spatial', 'X_pca_test', 'X_umap_test'#     obsp: 'neighbors_test_connectivities', 'neighbors_test_distances'

                                          # 例如这里多出的细胞信息:data._ann_data.obs

                                            # 可视化函数一样可以利用:data.plt.umap(res_key='umap_test', cluster_key='leiden_test')

                                            1.4.4 数据输出

                                            用于stereopy分析的AnnData对象也可以存为h5ad

                                            data._ann_data.write_h5ad('./test_result/SS200000135TL_D1.stereo.h5ad')

                                            更多空间转录组分析技巧可见:空间转录组学习手册合辑


                                            http://www.kler.cn/a/540735.html

                                            相关文章:

                                          • 第三个Qt开发实例:利用之前已经开发好的LED驱动在Qt生成的界面中控制LED2的亮和灭
                                          • 嵌入式linux系统中VIM编辑工具用法与GCC参数详解
                                          • DeepSeek 实践总结
                                          • Python Pandas(3):DataFrame
                                          • fps动作系统9:动画音频
                                          • STM32 HAL库 ADC程序(C语言)
                                          • PostgreSQL的学习心得和知识总结(一百六十七)|深入理解PostgreSQL数据库之静态语法检查工具PgSanity的使用和实现
                                          • 示波器使用指南
                                          • [7] 游戏机项目说明
                                          • SQL自学,mysql从入门到精通 --- 第 15天,数据导入、导出
                                          • 《深度学习》——pytorch框架及项目
                                          • 处理STM32 DMA方式下的HAL_UART_ERROR_ORE错误
                                          • 中央处理器
                                          • 分布式微服务接口基于多线程进行性能优化
                                          • 蓝桥杯试题:冒泡排序 选择排序
                                          • 六.logback记录日志文件并按大小日期分割文件
                                          • 操作系统调度算法解析(SJF)
                                          • EtherNet/IP转Modbus TCP实现三菱变频器与西门子PLC通讯的配置案例
                                          • 从零复现DeepSeek R1:从V3中对MoE、MLA、MTP的实现,到Open R1对R1中SFT、GRPO的实现
                                          • ESP8266配置为TCP客户端,连接电脑和手机(使用Arduino配置)
                                          • javaEE-10.CSS入门
                                          • 【Elasticsearch】管道聚合
                                          • SpringBoot源码解析(十):应用上下文AnnotationConfigServletWebServerApplicationContext构造方法
                                          • 相对收益-固定收益组合归因-前言
                                          • 纯前度(vue)实现对pdf\mp4\png\jpg\jpegxls\doc\txt文件预览,无需要转化
                                          • Android图片加载框架Coil,Kotlin