检查注释文件

SAW checkGTF

SAW count 的分析任务只接受标准格式的注释文件,所以在 SAW count 进行 reads 注释之前会自动校验文件的内容和格式。除了之外,还能够通过它实现特定注释信息内容的提取。

SAW 允许接入的注释文件的文件后缀名包括:gtf/gtf.gz、gff/gff.gz、gff3/gff3.gz 。

如果注释文件存在以下内容和格式问题(常见的错误),SAW checkGTF 将进行简单更正和问题提醒,以确保文件能够被正常使用。

IssueSolution
In the seventh column indicating the sense and antisense strands, "-" and "_" symbols are mistakenly mixed.Check each row of the annotation file and correct the error symbol "_" to "-".
A gene block lacks gene ID. Discard the entire gene information and issue a warning.
A gene row is missing but the transcript row has information.Discard the entire gene information and issue a warning.
A gene block lacks gene name. Use gene ID to fill in the missing one.
A gene row exists but partial information of child rows is missing.Child rows can inherit information from their gene rows.
A transcript ID is missing.Fill it with its parent gene ID suffixed with a sequential number (e.g., XXX.1).
A row contains multiple attributes with the same name.Only save the last <attribute:value> of the duplicated entries, for subsequent annotation.

*请注意,该程序默认假定输入注释文件已正确排序,并以基因块格式识别基因注释信息。

运行命令进行注释文件的格式检查:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --output-gtf=/path/to/output/GTF/or/GFF

假若不设置 --output-gtf 参数,程序将不会输出修正后的结果文件,请注意检查日志信息。

如果想要提取特定的注释信息,例如: gene_biotype:protein_codinggene_biotype:lincRNA,可以这样运行程序:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --attribute=key1:value1,key2:value2,... \
    --output-gtf=/path/to/output/GTF/or/GFF

如果 --attribute 启用,SAW checkGTF 将提取特定的注释信息,但不进行文件格式的检查。

© 2025 STOmics Tech. All rights reserved.Modified: 2025-12-29 19:47:43

results matching ""

    No results matching ""