检查注释文件
SAW checkGTF
SAW count 的分析任务只接受标准格式的注释文件,所以在 SAW count 进行 reads 注释之前会自动校验文件的内容和格式。除了之外,还能够通过它实现特定注释信息内容的提取。
SAW 允许接入的注释文件的文件后缀名包括:gtf/gtf.gz、gff/gff.gz、gff3/gff3.gz 。
如果注释文件存在以下内容和格式问题(常见的错误),SAW checkGTF 将进行简单更正和问题提醒,以确保文件能够被正常使用。
| Issue | Solution |
|---|---|
| In the seventh column indicating the sense and antisense strands, "-" and "_" symbols are mistakenly mixed. | Check each row of the annotation file and correct the error symbol "_" to "-". |
| A gene block lacks gene ID. | Discard the entire gene information and issue a warning. |
| A gene row is missing but the transcript row has information. | Discard the entire gene information and issue a warning. |
| A gene block lacks gene name. | Use gene ID to fill in the missing one. |
| A gene row exists but partial information of child rows is missing. | Child rows can inherit information from their gene rows. |
| A transcript ID is missing. | Fill it with its parent gene ID suffixed with a sequential number (e.g., XXX.1). |
| A row contains multiple attributes with the same name. | Only save the last <attribute:value> of the duplicated entries, for subsequent annotation. |
*请注意,该程序默认假定输入注释文件已正确排序,并以基因块格式识别基因注释信息。
运行命令进行注释文件的格式检查:
saw checkGTF \
--input-gtf=/path/to/input/GTF/or/GFF \
--output-gtf=/path/to/output/GTF/or/GFF
假若不设置 --output-gtf 参数,程序将不会输出修正后的结果文件,请注意检查日志信息。
如果想要提取特定的注释信息,例如: gene_biotype:protein_coding 或 gene_biotype:lincRNA,可以这样运行程序:
saw checkGTF \
--input-gtf=/path/to/input/GTF/or/GFF \
--attribute=key1:value1,key2:value2,... \
--output-gtf=/path/to/output/GTF/or/GFF
如果 --attribute 启用,SAW checkGTF 将提取特定的注释信息,但不进行文件格式的检查。