Matrices
Gene Expression File (GEF)
基因表达文件 (GEF) 是一种数据管理和存储格式,旨在支持多维数据集和高计算效率。Stereo-seq 分析工作流程生成 bin GEF 和 cellbin GEF 文件。Bin GEF 文件格式是一种分层结构的数据模型,以各种 bin 大小存储一个或多个组合的基因表达矩阵。Cellbin GEF 文件格式存储每个细胞内的表达信息。每个 GEF 容器组织一个空间基因表达矩阵集合。它包括两个主要数据对象:Group 和 Dataset。数据集是数据元素的多维数组。Group 对象类似于以层次结构组织数据集和其他组的文件系统目录。
Gene Expression Matrix (GEM)
基因表达矩阵 (GEM) 存储基因空间表达数据。SAW 在分析流程中生成多个基因表达矩阵文件,基本格式需要六列,标题行显示列名。六列分别是基因 ID、基因名称、x 坐标、y 坐标、MID 计数和 exon 计数,如果是 cellbin 数据,那么会有第七列记录细胞 ID。最大面积外接矩形区域的表达矩阵标题在列行前有多个以“#”开头的注释行,标题字段名称和字段类型在表中描述。
文件类型
SAW 分析流程输出的表达矩阵文件主要包括两种类型,bin GEF 和 cellbin GEF。可以通过文件后缀名来快速识别:
转录组相关
SAW count 和 SAW realign 输出的表达矩阵文件通常为:
| File | Description |
|---|---|
<SN>.raw.gef | Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts. |
<SN>.gef | Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200. |
<SN>.tissue.gef | Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200. |
<SN>.cellbin.gef | Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area. |
<SN>.adjusted.cellbin.gef | Cellbin expression matrix with cell border expanding, based on <<SN>_<stainType>_mask_edm_dis_<distance>.tif. |
组织区域信息
<SN>.tissue.gef 组织区域下GEF表达矩阵的生成通常要依赖<SN>.raw.gef原始表达矩阵和一张组织分割mask图。
tissuecut.stat统计文件文件位于 /STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX 下,记录了组织区域下的特征统计信息:
| Metric | Description |
|---|---|
| Tissue area in square nanometers | The physical tissue area of the sample slice, in square nanometers. |
| Contour area in pixel | The area of the tissue region on the tissue segmentation image, in pixels. |
| Number of DNB under tissue | The number of detected DNBs with RNA capture under the tissue region. |
| % of DNB under tissue | The proportion of detected DNBs with RNA capture under the tissue region relative to the total counts across the entire chip. |
| Total gene type under tissue | The total number of annotated gene types under the tissue region. |
| MID count under tissue | MID counts under the tissue region. |
| % of MID under tissue | The proportion of MID counts under the tissue region relative to the total counts across the entire chip. |
| Number of reads under tissue | The number of sequencing reads under the tissue region. |
| % of reads under tissue | The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip. |
| Mean reads per spot (binN) | Mean reads of each binN spot under the tissue region. |
| Median reads per spot (binN) | Median reads of each binN spot under the tissue region. |
| Mean gene type per spot (binN) | Mean gene type of each binN spot under the tissue region. |
| Median gene type per spot (binN) | Median gene type of each binN spot under the tissue region. |
| Mean MID per spot (binN) | Mean MID count of each binN spot under the tissue region. |
| Median MID per spot (binN) | Median MID count of each binN spot under the tissue region. |
微生物相关
基于 Stereo-seq N FFPE 组织样本进行分析,在运行 SAW count 分析任务时,设置--microorganism-detect 参数,输出的微生物相关的表达矩阵文件就被保存在 /outs/feature_expression/microorganism 目录下,具体内容如下:
| File | Description |
|---|---|
<SN>.microorganism.raw.gef | Feature expression matrix of microorganisms includes the whole information over a complete chip region. It only has bin1 expression counts. |
<SN>.microorganism.gef | Feature expression matrix of microorganisms. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200. |
<SN>.host_microorganism.raw.gef | Feature expression matrix of microorganisms and the host includes the whole information over a complete chip region. It only has bin1 expression counts. |
<SN>.host_microorganism.gef | Feature expression matrix of microorganisms and the host. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200. |
<SN>.microorganism.<classification>.gem | Feature expression matrix of a specific classification of microbes. Classifications of microorganisms include phylum, class, order, family, genus, and species. |
微生物分类信息
Kraken2 对微生物数据进行分类后,/STEREO_ANALYSIS_WORKFLOW/MICROOGANISM/ANALYSIS目录下会出现两个文件,分别是seq_complete_info.txt和seq_complete_info_dedup.txt。这两个文件的区别在于,后者针对微生物比对进行了去重处理。每行代表一条read的比对结果记录,主要包括read ID、空间坐标、MID计数、分类学ID、学名、详细生物分类和read计数。
$ head ./STEREO_ANALYSIS_WORKFLOW_PROCESSING/MICROORGANISM/ANALYSIS/seq_complete_info.txt
seq x y umi taxid Scientific_Name kindom phylum class order family genus species count
V350264949L2C001R01701342667 11662 10937 ABB 77643 Mycobacterium_tuberculosis_complex k__Bacteria p__Actinobacteria c__Actinomycetia o__Corynebacteriales f__Mycobacteriaceae g__Mycobacterium 1
V350264949L2C001R01701347399 2742 13877 2B7 1783272 Terrabacteria_group k__Bacteria 1
V350264949L2C001R01900083155 18561 10644 C9E 5338 Agaricales k__Fungi p__Basidiomycota c__Agaricomycetes o__Agaricales 1
V350264949L2C001R01900086639 3861 14913 810 1760 Actinomycetia k__Bacteria p__Actinobacteria c__Actinomycetia 1
V350264949L2C001R01800770541 4264 17661 C20 1783272 Terrabacteria_group k__Bacteria 1
V350264949L2C001R01900181247 4396 16735 6EC 1762 Mycobacteriaceae k__Bacteria p__Actinobacteria c__Actinomycetia o__Corynebacteriales f__Mycobacteriaceae 1
V350264949L2C001R01900245227 18830 15878 D88 2 Bacteria k__Bacteria 1
V350264949L2C001R01900248762 18671 15840 77E 2 Bacteria k__Bacteria 1
V350264949L2C001R01900262154 15242 9034 4D0 1224 Proteobacteria k__Bacteria p__Proteobacteria 1
蛋白组相关
如使用 SAW count 和 SAW realign 对 Stereo-CITE T FF 组织样本分析 ,其空间蛋白表达矩阵将保存在 /outs/feature_expression 中。
具体内容如下:
| File | Description |
|---|---|
<SN>.protein.raw.gef | Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts. |
<SN>.protein.gef | Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200. |
<SN>.protein.tissue.gef | Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200. |
<SN>.protein.cellbin.gef | Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area. |
<SN>.protein.adjusted.cellbin.gef | Cellbin expression matrix with cell border expanding, based on <SN>_<stainType>_mask_edm_dis_<distance>.tif. |
<SN>.protein.tissue.rmbg.gem.gz | Feature expression matrix from automatic protein background removal. It shows bin1 expression counts. |
组织区域信息
<SN>.protein.tissue.gef 组织区域下的蛋白GEF表达矩阵的生成通常要依赖 <SN>.protein.raw.gef 原始蛋白表达矩阵和一张组织分割mask图。
protein.tissuecut.stat 统计文件文件位于 /STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX,记录了组织区域下的特征统计信息:
| Metric | Description |
|---|---|
| Tissue area in square nanometers | The physical tissue area of the sample slice, in square nanometers. |
| Contour area in pixel | The area of the tissue region on the tissue segmentation image, in pixels. |
| Number of DNB under tissue | The number of detected DNBs with ADT capture under the tissue region. |
| % of DNB under tissue | The proportion of detected DNBs with ADT capture under the tissue region relative to the total counts across the entire chip. |
| Total protein type under tissue | The total number of annotated protein types under the tissue region. |
| MID count under tissue | MID counts under the tissue region. |
| % of MID under tissue | The proportion of MID counts under the tissue region relative to the total counts across the entire chip. |
| Number of reads under tissue | The number of sequencing reads under the tissue region. |
| % of reads under tissue | The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip. |
| Mean reads per spot (binN) | Mean reads of each binN spot under the tissue region. |
| Median reads per spot (binN) | Median reads of each binN spot under the tissue region. |
| Mean protein type per spot (binN) | Mean protein type of each binN spot under the tissue region. |
| Median protein type per spot (binN) | Median protein type of each binN spot under the tissue region. |
| Mean MID per spot (binN) | Mean MID count of each binN spot under the tissue region. |
| Median MID per spot (binN) | Median MID count of each binN spot under the tissue region. |