Analysis
聚类分析
SAW count, realign 和 reanalyze 分析流程以 AnnData H5AD 格式输出空间聚类结果,其中了数据记录预处理、降维、聚类和差异表达分析的信息结果。
H5AD 中的聚类结果和 UMAP 降维信息可以在 StereoMap 中实现可视化。
这里详细展开了一个H5AD文件中的记录信息:
$ h5dump -n <task id>/outs/analysis/<SN>.bin200_1.0.h5ad ## you can also check <SN>.cellbin_1.0.h5ad
HDF5 "<task id>/outs/analysis/<SN>.bin200_1.0.h5ad" {
FILE_CONTENTS {
group /
dataset /X
group /layers
group /obs
dataset /obs/_index
group /obs/leiden
dataset /obs/leiden/categories
dataset /obs/leiden/codes
dataset /obs/n_genes_by_counts
group /obs/orig.ident
dataset /obs/orig.ident/categories
dataset /obs/orig.ident/codes
dataset /obs/pct_counts_mt
dataset /obs/total_counts
dataset /obs/x
dataset /obs/y
group /obsm
dataset /obsm/X_pca
dataset /obsm/X_umap
dataset /obsm/spatial
group /obsp
group /obsp/connectivities
dataset /obsp/connectivities/data
dataset /obsp/connectivities/indices
dataset /obsp/connectivities/indptr
group /obsp/distances
dataset /obsp/distances/data
dataset /obsp/distances/indices
dataset /obsp/distances/indptr
group /raw
group /raw/X
dataset /raw/X/data
dataset /raw/X/indices
dataset /raw/X/indptr
group /raw/var
dataset /raw/var/_index
dataset /raw/var/mean_umi
dataset /raw/var/n_cells
dataset /raw/var/n_counts
group /raw/var/real_gene_name
dataset /raw/var/real_gene_name/categories
dataset /raw/var/real_gene_name/codes
group /raw/varm
group /uns
dataset /uns/bin_size
dataset /uns/bin_type
group /uns/gene_exp_leiden
dataset /uns/gene_exp_leiden/1
...
dataset /uns/gene_exp_leiden/_index
group /uns/hvg
dataset /uns/hvg/method
group /uns/hvg/params
dataset /uns/hvg/source
dataset /uns/leiden_resolution
group /uns/neighbors
dataset /uns/neighbors/connectivities_key
dataset /uns/neighbors/distance_key
group /uns/rank_genes_groups
dataset /uns/rank_genes_groups/logfoldchanges
group /uns/rank_genes_groups/mean_count
dataset /uns/rank_genes_groups/mean_count/1
...
dataset /uns/rank_genes_groups/mean_count/_index
dataset /uns/rank_genes_groups/names
group /uns/rank_genes_groups/params
dataset /uns/rank_genes_groups/params/corr_method
dataset /uns/rank_genes_groups/params/groupby
dataset /uns/rank_genes_groups/params/method
dataset /uns/rank_genes_groups/params/reference
dataset /uns/rank_genes_groups/params/use_raw
group /uns/rank_genes_groups/pts
dataset /uns/rank_genes_groups/pts/1
...
dataset /uns/rank_genes_groups/pts/_index
group /uns/rank_genes_groups/pts_rest
dataset /uns/rank_genes_groups/pts_rest/1
...
dataset /uns/rank_genes_groups/pts_rest/_index
dataset /uns/rank_genes_groups/pvals
dataset /uns/rank_genes_groups/pvals_adj
dataset /uns/rank_genes_groups/scores
dataset /uns/resolution
group /uns/sn
dataset /uns/sn/_index
dataset /uns/sn/batch
dataset /uns/sn/sn
group /var
dataset /var/_index
dataset /var/dispersions
dataset /var/dispersions_norm
dataset /var/highly_variable
dataset /var/mean_umi
dataset /var/means
dataset /var/n_cells
dataset /var/n_counts
group /var/real_gene_name
dataset /var/real_gene_name/categories
dataset /var/real_gene_name/codes
group /varm
group /varp
}
}
查看数据集
可以在 Python 的 Jupyter Notebook工具页面中检查数据信息,通常使用 AnnData 包检查存储在 H5AD 文件中的元数据集。
import anndata as ad
adata = ad.read_h5ad('./C4144D5.h5ad')
当你输入变量 adata 时,将返回 adata 数据对象内包含内容的简要说明。
# adata
AnnData object with n_obs × n_vars = 439383 × 29759
obs: 'total_counts', 'n_genes_by_counts', 'pct_counts_mt', 'leiden', 'orig.ident', 'x', 'y'
var: 'real_gene_name', 'n_cells', 'n_counts', 'mean_umi', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bin_size', 'bin_type', 'gene_exp_leiden', 'hvg', 'leiden_resolution', 'neighbors', 'omics', 'pca_variance_ratio', 'rank_genes_groups', 'resolution', 'sn'
obsm: 'X_pca', 'X_umap', 'spatial'
obsp: 'connectivities', 'distances'
obs:观测元数据(细胞级注释)。一个存储细胞相关信息的表(DataFrame),例如total_counts(即 MID)、n_genes_by_counts(即基因类型)、Leiden 聚类标签和空间坐标信息等;var:变量元数据(基因级注释)。一个存储基因相关信息的表(DataFrame),例如基因名称、基因是否为高变基因以及基因的生物学功能等;adata.X是一个经过预处理和标准化的特征表达矩阵,通常表示为稀疏矩阵。
输入 adata.obs 来观察细胞元数据:
# adata.obs
total_counts n_genes_by_counts pct_counts_mt leiden orig.ident x y
5583457496640 5 5 0.0 16 sample 1300 11840
5583457496660 24 19 0.0 16 sample 1300 11860
5583457496680 35 24 0.0 16 sample 1300 11880
5583457496700 26 20 0.0 16 sample 1300 11900
5583457496720 20 13 0.0 16 sample 1300 11920
... ... ... ... ... ... ... ...
84782654432300 5 4 0.0 16 sample 19740 9260
84782654432320 73 36 0.0 2 sample 19740 9280
84782654432340 66 33 0.0 2 sample 19740 9300
84782654432360 73 39 0.0 2 sample 19740 9320
84782654432380 18 9 0.0 16 sample 19740 9340
Leiden 算法得到的聚类分群结果位于 adata.obs['leiden']。
输入 adata.var 来观察基因元数据:
# adata.var
real_gene_name n_cells n_counts mean_umi means dispersions dispersions_norm highly_variable
ENSMUSG00000000001 Gnai3 3641 4927 1.353200 0.218442 3.864523 0.787665 True
ENSMUSG00000000003 Pbsn 84 99 1.178571 0.004794 3.560551 -0.295258 False
ENSMUSG00000000028 Cdc45 279 335 1.200717 0.016009 3.475172 -0.599423 False
ENSMUSG00000000031 H19 11 13 1.181818 0.000698 3.654413 0.039135 False
ENSMUSG00000000037 Scml2 890 1142 1.283146 0.053320 3.649287 0.020871 False
... ... ... ... ... ... ... ... ...
ENSMUSG00000116984 CT030713.2 339 418 1.233038 0.020487 3.625916 -0.062388 False
ENSMUSG00000116987 AC150035.3 212 268 1.264151 0.013059 3.599458 -0.156647 False
ENSMUSG00000116988 AC164314.2 479 630 1.315240 0.029183 3.738448 0.338513 False
ENSMUSG00000116989 AC131339.4 67 77 1.149254 0.003788 3.486996 -0.557302 False
ENSMUSG00000116993 AC135964.2 210 276 1.314286 0.012649 3.561962 -0.290230 False
此外,还可以检查降维分析的结果:
# adata.obsm['X_pca']
array([[ -6.2587485 , 4.520802 , -9.568217 , ..., -1.9764128 ,
1.8916789 , -1.5253358 ],
[ -8.745589 , -0.8465148 , -5.714299 , ..., -0.06065813,
0.4646256 , -1.4891121 ],
[ -9.6469 , -0.99549234, -3.3434439 , ..., -1.5173951 ,
0.70049083, -0.56164753],
...,
[-10.14395 , -1.4017884 , -2.0154784 , ..., -1.759757 ,
0.12367768, -0.3570788 ],
[ -9.854649 , -0.06418456, -1.2702237 , ..., 0.20150849,
-0.12149827, -0.20047918],
[-11.477859 , -0.91366583, -3.719183 , ..., -1.4211301 ,
0.35585332, -0.44572574]], dtype=float32)
# adata.obsm['X_umap']
array([[-1.4992443, 1.803457 ],
[ 1.5087016, -0.8629851],
[ 2.234619 , -1.2892196],
...,
[ 3.3134413, -1.8212844],
[ 3.317021 , -1.615297 ],
[ 1.4269315, -1.217469 ]], dtype=float32)
差异表达分析
SAW count, realign 和 reanalyze 会以 CSV 格式输出差异表达分析结果。
差异表达分析的 CSV 结果文件有两种,分别为 find_marker_genes.csv 和 <bin_size>_marker_features.csv。
find_marker_genes.csv是差异表达分析的原始输出结果<bin_size>_marker_features.csv中的数据信息经过整理,格式经过调整,更加简洁明了。
对于每个类群的特征信息,主要计算以下指标:
- 平均 MID Count
- 表达占比的Log2变化值
- 校正后的 p-value (当前簇中特征表达相对于其他簇的置信度)
- 基因在类群内的表达占比 (
Cluster 1 % of expressed = 1表示该特征在类群中的所所有细胞或bin中均有表达)
Feature ID,Feature Name,Cluster 1 Mean MID Count,Cluster 1 Log2 fold change,Cluster 1 Adjusted p-value,Cluster 1 % of expressed, ... ,Cluster 20 Mean MID Count,Cluster 20 Log2 fold change,Cluster 20 Adjusted p-value,Cluster 20 % of expressed
ENSMUSG00000016559,H3f3b,67.1754386,42.00155933,1.76E-41,1, ... ,0.076923077,-63.19518177,0,0.076923077
<bin_size>_marker_features.csv 中记录的差异表达分析结果可在 StereoMap 中查看,或直接使用 Excel 打开。
多组学联合分析-聚类
如果您对 Stereo-CITE T FF 样本执行 SAW reanalyze,其联合分析多组学聚类结果将保存在 H5MU 中。
以下是 H5MU 中记录的信息的示例:
$ h5dump -n <task id>/outs/analysis/<SN>.bin20.h5mu
HDF5 "<task id>/outs/analysis/<SN>.bin20.h5mu" {
FILE_CONTENTS {
group /
group /mod
group /mod/multiomics
dataset /mod/multiomics/X
group /mod/multiomics/layers
dataset /mod/multiomics/layers/denoised_rna
group /mod/multiomics/obs
dataset /mod/multiomics/obs/_index
group /mod/multiomics/obs/leiden
dataset /mod/multiomics/obs/leiden/categories
dataset /mod/multiomics/obs/leiden/codes
dataset /mod/multiomics/obs/n_genes_by_counts
group /mod/multiomics/obs/orig.ident
dataset /mod/multiomics/obs/orig.ident/categories
dataset /mod/multiomics/obs/orig.ident/codes
dataset /mod/multiomics/obs/pct_counts_mt
dataset /mod/multiomics/obs/total_counts
dataset /mod/multiomics/obs/x
dataset /mod/multiomics/obs/y
group /mod/multiomics/obsm
dataset /mod/multiomics/obsm/X_totalVI
dataset /mod/multiomics/obsm/X_umap
dataset /mod/multiomics/obsm/spatial
group /mod/multiomics/obsp
group /mod/multiomics/obsp/connectivities
dataset /mod/multiomics/obsp/connectivities/data
dataset /mod/multiomics/obsp/connectivities/indices
dataset /mod/multiomics/obsp/connectivities/indptr
group /mod/multiomics/obsp/distances
dataset /mod/multiomics/obsp/distances/data
dataset /mod/multiomics/obsp/distances/indices
dataset /mod/multiomics/obsp/distances/indptr
group /mod/multiomics/raw
group /mod/multiomics/raw/X
dataset /mod/multiomics/raw/X/data
dataset /mod/multiomics/raw/X/indices
dataset /mod/multiomics/raw/X/indptr
group /mod/multiomics/raw/var
dataset /mod/multiomics/raw/var/_index
dataset /mod/multiomics/raw/var/mean_umi
dataset /mod/multiomics/raw/var/n_cells
dataset /mod/multiomics/raw/var/n_counts
group /mod/multiomics/raw/var/real_gene_name
dataset /mod/multiomics/raw/var/real_gene_name/categories
dataset /mod/multiomics/raw/var/real_gene_name/codes
group /mod/multiomics/raw/varm
group /mod/multiomics/uns
dataset /mod/multiomics/uns/bin_size
dataset /mod/multiomics/uns/bin_type
group /mod/multiomics/uns/gene_exp_leiden
dataset /mod/multiomics/uns/gene_exp_leiden/1
...
dataset /mod/multiomics/uns/gene_exp_leiden/9
dataset /mod/multiomics/uns/gene_exp_leiden/_index
group /mod/multiomics/uns/hvg
dataset /mod/multiomics/uns/hvg/method
group /mod/multiomics/uns/hvg/params
dataset /mod/multiomics/uns/hvg/source
dataset /mod/multiomics/uns/leiden_resolution
group /mod/multiomics/uns/neighbors
dataset /mod/multiomics/uns/neighbors/connectivities_key
dataset /mod/multiomics/uns/neighbors/distance_key
dataset /mod/multiomics/uns/omics
dataset /mod/multiomics/uns/resolution
group /mod/multiomics/uns/sn
dataset /mod/multiomics/uns/sn/_index
dataset /mod/multiomics/uns/sn/batch
dataset /mod/multiomics/uns/sn/sn
group /mod/multiomics/var
dataset /mod/multiomics/var/_index
dataset /mod/multiomics/var/highly_variable
dataset /mod/multiomics/var/highly_variable_nbatches
dataset /mod/multiomics/var/highly_variable_rank
dataset /mod/multiomics/var/mean_umi
dataset /mod/multiomics/var/means
dataset /mod/multiomics/var/n_cells
dataset /mod/multiomics/var/n_counts
group /mod/multiomics/var/real_gene_name
dataset /mod/multiomics/var/real_gene_name/categories
dataset /mod/multiomics/var/real_gene_name/codes
dataset /mod/multiomics/var/variances
dataset /mod/multiomics/var/variances_norm
group /mod/multiomics/varm
group /mod/multiomics/varp
group /mod/protein
...
group /mod/rna
...
group /obs
dataset /obs/_index
dataset /obs/_scvi_batch
dataset /obs/_scvi_labels
dataset /obs/_scvi_raw_norm_scaling
group /obsm
dataset /obsm/multiomics
dataset /obsm/protein
dataset /obsm/rna
group /obsmap
dataset /obsmap/multiomics
dataset /obsmap/protein
dataset /obsmap/rna
group /obsp
group /uns
dataset /uns/_scvi_manager_uuid
dataset /uns/_scvi_uuid
group /var
dataset /var/_index
dataset /var/mean_umi
dataset /var/n_cells
dataset /var/n_counts
group /var/real_gene_name
dataset /var/real_gene_name/categories
dataset /var/real_gene_name/codes
group /varm
dataset /varm/multiomics
dataset /varm/protein
dataset /varm/rna
group /varmap
dataset /varmap/multiomics
dataset /varmap/protein
dataset /varmap/rna
group /varp
}
}