Matrices

Gene Expression File (GEF)

基因表达文件 (GEF) 是一种数据管理和存储格式,旨在支持多维数据集和高计算效率。Stereo-seq 分析工作流程生成 bin GEF 和 cellbin GEF 文件。Bin GEF 文件格式是一种分层结构的数据模型,以各种 bin 大小存储一个或多个组合的基因表达矩阵。Cellbin GEF 文件格式存储每个细胞内的表达信息。每个 GEF 容器组织一个空间基因表达矩阵集合。它包括两个主要数据对象:Group 和 Dataset。数据集是数据元素的多维数组。Group 对象类似于以层次结构组织数据集和其他组的文件系统目录。

Gene Expression Matrix (GEM)

基因表达矩阵 (GEM) 存储基因空间表达数据。SAW 在分析流程中生成多个基因表达矩阵文件,基本格式需要六列,标题行显示列名。六列分别是基因 ID、基因名称、x 坐标、y 坐标、MID 计数和 exon 计数,如果是 cellbin 数据,那么会有第七列记录细胞 ID。最大面积外接矩形区域的表达矩阵标题在列行前有多个以“#”开头的注释行,标题字段名称和字段类型在表中描述。

文件类型

SAW 分析流程输出的表达矩阵文件主要包括两种类型,bin GEF 和 cellbin GEF。可以通过文件后缀名来快速识别:

File extensionDescription
.gef

The feature expression matrix file in HDF5 format for visualization. It contains the MID count for each gene of each spot. A spot is a binning unit that has a fixed-sized square shape in which the expression value in this square is accumulated. By default, a visualization .gef includes spot sizes of bin 1, 5, 10, 20, 50, 100, 150, 200.

.cellbin.gef

The cellbin feature expression matrix file in HDF5 format. It contains the spatial location and area of each cell, the MID count for each gene of each cell, and the cluster the cell belongs to. In .cellbin.gef, the cell is the smallest data unit.


Only available when the cell segmentation was done based on an microscopy image.

转录组相关

SAW countSAW realign 输出的表达矩阵文件通常为:

FileDescription
<SN>.raw.gefFeature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.gefFeature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.tissue.gefFeature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.cellbin.gefCellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.
<SN>.adjusted.cellbin.gefCellbin expression matrix with cell border expanding, based on <<SN>_<stainType>_mask_edm_dis_<distance>.tif.

组织区域信息

<SN>.tissue.gef 组织区域下GEF表达矩阵的生成通常要依赖<SN>.raw.gef原始表达矩阵和一张组织分割mask图。

tissuecut.stat统计文件文件位于 /STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX 下,记录了组织区域下的特征统计信息:

Metric Description
Tissue area in square nanometers The physical tissue area of the sample slice, in square nanometers.
Contour area in pixel The area of the tissue region on the tissue segmentation image, in pixels.
Number of DNB under tissue The number of detected DNBs with RNA capture under the tissue region.
% of DNB under tissue The proportion of detected DNBs with RNA capture under the tissue region relative to the total counts across the entire chip.
Total gene type under tissue The total number of annotated gene types under the tissue region.
MID count under tissue MID counts under the tissue region.
% of MID under tissue The proportion of MID counts under the tissue region relative to the total counts across the entire chip.
Number of reads under tissue The number of sequencing reads under the tissue region.
% of reads under tissue The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip.
Mean reads per spot (binN) Mean reads of each binN spot under the tissue region.
Median reads per spot (binN) Median reads of each binN spot under the tissue region.
Mean gene type per spot (binN) Mean gene type of each binN spot under the tissue region.
Median gene type per spot (binN) Median gene type of each binN spot under the tissue region.
Mean MID per spot (binN) Mean MID count of each binN spot under the tissue region.
Median MID per spot (binN) Median MID count of each binN spot under the tissue region.

微生物相关

基于 Stereo-seq N FFPE 组织样本进行分析,在运行 SAW count 分析任务时,设置--microorganism-detect 参数,输出的微生物相关的表达矩阵文件就被保存在 /outs/feature_expression/microorganism 目录下,具体内容如下:

FileDescription
<SN>.microorganism.raw.gefFeature expression matrix of microorganisms includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.microorganism.gefFeature expression matrix of microorganisms. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.host_microorganism.raw.gefFeature expression matrix of microorganisms and the host includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.host_microorganism.gefFeature expression matrix of microorganisms and the host. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.microorganism.<classification>.gem

Feature expression matrix of a specific classification of microbes.

Classifications of microorganisms include phylum, class, order, family, genus, and species.

微生物分类信息

Kraken2 对微生物数据进行分类后,/STEREO_ANALYSIS_WORKFLOW/MICROOGANISM/ANALYSIS目录下会出现两个文件,分别是seq_complete_info.txtseq_complete_info_dedup.txt。这两个文件的区别在于,后者针对微生物比对进行了去重处理。每行代表一条read的比对结果记录,主要包括read ID、空间坐标、MID计数、分类学ID、学名、详细生物分类和read计数。

$ head ./STEREO_ANALYSIS_WORKFLOW_PROCESSING/MICROORGANISM/ANALYSIS/seq_complete_info.txt
seq    x    y    umi    taxid    Scientific_Name    kindom    phylum    class    order    family    genus    species    count
V350264949L2C001R01701342667    11662    10937    ABB    77643    Mycobacterium_tuberculosis_complex    k__Bacteria    p__Actinobacteria    c__Actinomycetia    o__Corynebacteriales    f__Mycobacteriaceae    g__Mycobacterium        1
V350264949L2C001R01701347399    2742    13877    2B7    1783272    Terrabacteria_group    k__Bacteria                            1
V350264949L2C001R01900083155    18561    10644    C9E    5338    Agaricales    k__Fungi    p__Basidiomycota    c__Agaricomycetes    o__Agaricales             1
V350264949L2C001R01900086639    3861    14913    810    1760    Actinomycetia    k__Bacteria    p__Actinobacteria    c__Actinomycetia                     1
V350264949L2C001R01800770541    4264    17661    C20    1783272    Terrabacteria_group    k__Bacteria                            1
V350264949L2C001R01900181247    4396    16735    6EC    1762    Mycobacteriaceae    k__Bacteria    p__Actinobacteria    c__Actinomycetia    o__Corynebacteriales f__Mycobacteriaceae            1
V350264949L2C001R01900245227    18830    15878    D88    2    Bacteria    k__Bacteria                            1
V350264949L2C001R01900248762    18671    15840    77E    2    Bacteria    k__Bacteria                            1
V350264949L2C001R01900262154    15242    9034    4D0    1224    Proteobacteria    k__Bacteria    p__Proteobacteria                        1

蛋白组相关

如使用 SAW countSAW realign 对 Stereo-CITE T FF 组织样本分析 ,其空间蛋白表达矩阵将保存在 /outs/feature_expression 中。

具体内容如下:

FileDescription
<SN>.protein.raw.gefFeature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.protein.gefFeature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.protein.tissue.gefFeature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.protein.cellbin.gefCellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.
<SN>.protein.adjusted.cellbin.gefCellbin expression matrix with cell border expanding, based on <SN>_<stainType>_mask_edm_dis_<distance>.tif.
<SN>.protein.tissue.rmbg.gem.gzFeature expression matrix from automatic protein background removal. It shows bin1 expression counts.

组织区域信息

<SN>.protein.tissue.gef 组织区域下的蛋白GEF表达矩阵的生成通常要依赖 <SN>.protein.raw.gef 原始蛋白表达矩阵和一张组织分割mask图。

protein.tissuecut.stat 统计文件文件位于 /STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX,记录了组织区域下的特征统计信息:

Metric Description
Tissue area in square nanometers The physical tissue area of the sample slice, in square nanometers.
Contour area in pixel The area of the tissue region on the tissue segmentation image, in pixels.
Number of DNB under tissue The number of detected DNBs with ADT capture under the tissue region.
% of DNB under tissue The proportion of detected DNBs with ADT capture under the tissue region relative to the total counts across the entire chip.
Total protein type under tissue The total number of annotated protein types under the tissue region.
MID count under tissue MID counts under the tissue region.
% of MID under tissue The proportion of MID counts under the tissue region relative to the total counts across the entire chip.
Number of reads under tissue The number of sequencing reads under the tissue region.
% of reads under tissue The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip.
Mean reads per spot (binN) Mean reads of each binN spot under the tissue region.
Median reads per spot (binN) Median reads of each binN spot under the tissue region.
Mean protein type per spot (binN) Mean protein type of each binN spot under the tissue region.
Median protein type per spot (binN) Median protein type of each binN spot under the tissue region.
Mean MID per spot (binN) Mean MID count of each binN spot under the tissue region.
Median MID per spot (binN) Median MID count of each binN spot under the tissue region.
© 2025 STOmics Tech. All rights reserved.Modified: 2025-12-29 19:47:43

results matching ""

    No results matching ""