pyfaidx, 高效的Pythonic 随机访问fasta子序列

分享于 

25分钟阅读

GitHub

  繁體 雙語
"samtools faidx" compatible FASTA indexing in pure python
  • 源代码名称:pyfaidx
  • 源代码网址:http://www.github.com/mdshw5/pyfaidx
  • pyfaidx源代码文档
  • pyfaidx源代码下载
  • Git URL:
    git://www.github.com/mdshw5/pyfaidx.git
    Git Clone代码到本地:
    git clone http://www.github.com/mdshw5/pyfaidx
    Subversion代码到本地:
    $ svn co --depth empty http://www.github.com/mdshw5/pyfaidx
    Checked out revision 1.
    $ cd repo
    $ svn up trunk
    

    TravisPyPICode HealthCoverageDepsyAppveyor

    描述

    Samtools为indexed提供了一个函数"faidx"( FASTA索引),它创建一个小平面索引文件,允许快速随机访问索引FASTA文件,同时加载文件中的最小文件数量。 这个 python MODULE 实现了用于索引。检索和放置FASTA文件的纯 python 类,使用samtools兼容的索引进行。 pyfaidx MODULE 是与 pygr seqdb MODULE 兼容的API。 在 pyfaidx MODULE 旁边安装了命令行脚本" faidx",并在没有任何编程知识的情况下方便了FASTA文件。

    如果你在出版物中使用 pyfaidx,请引用:

    ,,,。 高效的"Pythonic"访问FASTA文件使用 pyfaidx。 PeerJ预发行 3: e1196.2015.

    安装

    这个软件包使用 python 3.2 -3.4,2.7,2.6和pypy在 Linux,macOS 和 Windows 下测试,并且可以从PyPI中获得:

    pip install pyfaidx # add --user if you don't have root

    或者下载一个发布插件,并:

    python setup.py install

    如果使用 pip install --user,请确保将 /home/$(whoami)/.local/bin 添加到 $PATH,如果你想运行 faidx 脚本。

    用法

    >>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta')>>> genes
    Fasta("tests/data/genes.fasta") # set strict_bounds=True for bounds checking

    像字典一样。

    >>> genes.keys() ('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')>>> genes['NM_001282543.1'][200:230]>NM_001282543.1:201-230CTCGTTCCGCGCCCGCCATGGAACCGGATG>>> genes['NM_001282543.1'][200:230].seq'CTCGTTCCGCGCCCGCCATGGAACCGGATG'>>> genes['NM_001282543.1'][200:230].name'NM_001282543.1'# Start attributes are 1-based>>> genes['NM_001282543.1'][200:230].start201# End attributes are 0-based>>> genes['NM_001282543.1'][200:230].end230>>> genes['NM_001282543.1'][200:230].fancy_name'NM_001282543.1:201-230'>>>len(genes['NM_001282543.1'])5466

    注意序列对象的开始坐标和结束坐标是 [1, 0 ]。 通过将 one_based_attributes=False 传递给 Fasta 或者 faidx,可以将它的更改为 [0, 0 ]。 这里参数只影响 Sequence. start/.end 属性,并且不会对切片坐标产生影响。

    索引如列表:

    >>> genes[0][:50]>AB821309.1:1-50ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC

    切片就像一个字符串:

    >>> genes['NM_001282543.1'][200:230][:10]>NM_001282543.1:201-210CTCGTTCCGC>>> genes['NM_001282543.1'][200:230][::-1]>NM_001282543.1:230-201GTAGGCCAAGGTACCGCCCGCGCCTTGCTC>>> genes['NM_001282543.1'][200:230][::3]>NM_001282543.1:201-230CGCCCCTACA>>> genes['NM_001282543.1'][:]>NM_001282543.1:1-5466CCCCGCCCCT........
    • 切片起始和结束坐标是 0 -based,就像 python 序列一样。

    补充和反补语就像DNA一样

    >>> genes['NM_001282543.1'][200:230].complement>NM_001282543.1 (complement):201-230GAGCAAGGCGCGGGCGGTACCTTGGCCTAC>>> genes['NM_001282543.1'][200:230].reverse>NM_001282543.1:230-201GTAGGCCAAGGTACCGCCCGCGCCTTGCTC>>>-genes['NM_001282543.1'][200:230]>NM_001282543.1 (complement):230-201CATCCGGTTCCATGGCGGGCGCGGAACGAG

    还可以使用方法调用访问 Fasta 对象:

    >>> genes.get_seq('NM_001282543.1', 201, 210)>NM_001282543.1:201-210CTCGTTCCGC>>> genes.get_seq('NM_001282543.1', 201, 210, rc=True)>NM_001282543.1 (complement):210-201GCGGAACGAG

    可以从 [start, end] 坐标列表中检索拼接序列: 待办事项更新这里部分

    # new in v0.5.1segments = [[1, 10], [50, 70]]>>> genes.get_spliced_seq('NM_001282543.1', segments)>gi|543583786|ref|NM_001282543.1|:1-70CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC

    自定义密钥功能提供更简洁的访问:

    >>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta', key_function=lambdax: x.split('.')[0])>>> genes.keys()
    dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])>>> genes['NR_104212'][:10]>NR_104212:1-10CCCCGCCCCT

    你可以指定一个字符来分隔名称,这将生成其他条目:

    >>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop">>> genes.keys()
    dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])

    如果key_function或者split_char生成重复项,则可以选择要采取的操作:

    # new in v0.4.9>>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest")>>> genes.keys()
    dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])

    过滤器函数( 返回 true ) 限制索引:

    # new in v0.3.8>>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta', filt_function=lambdax: x[0] =='N')>>> genes.keys()
    dict_keys(['NR_104212', 'NM_001282543', 'NR_104216', 'NR_104215', 'NM_001282549', 'NM_000465', 'NM_001282545', 'NM_001282548'])>>> genes['XM_005249644']KeyError: XM_005249644notin tests/data/genes.fasta.

    或者只获取一个 python 字符串:

    >>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta', as_raw=True)>>> genes
    Fasta("tests/data/genes.fasta", as_raw=True)>>> genes['NM_001282543.1'][200:230]CTCGTTCCGCGCCCGCCATGGAACCGGATG

    你可以确保始终收到大写序列,即使fasta文件的大小写较低

    >>>from pyfaidx import Fasta>>> reference = Fasta('tests/data/genes.fasta.lower', sequence_always_upper=True)>>> reference['gi|557361099|gb|KF435150.1|'][1:70]>gi|557361099|gb|KF435150.1|:2-70TGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGACAGTGCTTGCAGGATCTCTCCTGGACAAA

    你也可以执行基于线的迭代,接收序列线,因为它们出现在 FASTA file: 中

    >>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta')>>>for line in genes['NM_001282543.1']:...print(line)CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGCAGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGACGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC...

    序列名称在任何空格上都被截断。 这是索引策略的一个限制。 但是,可以恢复全名:

    # new in v0.3.7>>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta')>>>for record in genes:...print(record.name)...print(record.long_name)...gi|563317589|dbj|AB821309.1|gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA forFGFR2-AHCYL1 fusion kinase protein, complete cds
    gi|557361099|gb|KF435150.1|gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
    gi|557361097|gb|KF435149.1|gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds...# new in v0.4.9>>>from pyfaidx import Fasta>>> genes = Fasta('tests/data/genes.fasta', read_long_names=True)>>>for record in genes:...print(record.name)...gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA forFGFR2-AHCYL1 fusion kinase protein, complete cds
    gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
    gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds

    可以在内存中使用预先读取缓冲区缓冲序列,以便快速进行顺序访问:

    >>>from timeit import timeit>>> fetch ="genes['NM_001282543.1'][200:230]">>> read_ahead ="import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)">>> no_read_ahead ="import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')">>> string_slicing ="genes = {}; genes['NM_001282543.1'] = 'N'*10000">>> timeit(fetch, no_read_ahead, number=10000)0.2204863309962093>>> timeit(fetch, read_ahead, number=10000)0.1121859749982832>>> timeit(fetch, string_slicing, number=10000)0.0033553699977346696

    向前读取缓冲可以减少 1/2,以便对缓冲区域进行顺序访问。

    如果要修改FASTA文件的内容,可以使用可变参数。 可以用等效长度字符串替换FastaRecord的任何部分。 警告:这将立即更改你的文件的内容:

    >>> genes = Fasta('tests/data/genes.fasta', mutable=True)>>>type(genes['NM_001282543.1'])<class'pyfaidx.MutableFastaRecord'>>>> genes['NM_001282543.1'][:10]>NM_001282543.1:1-10CCCCGCCCCT>>> genes['NM_001282543.1'][:10] ='NNNNNNNNNN'>>> genes['NM_001282543.1'][:15]>NM_001282543.1:1-15NNNNNNNNNNCTGGC

    FastaVariant类提供了一种集成单个核苷酸变体调用的方法来生成一致序列。

    # new in v0.4.0>>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', het=True, hom=True)RuntimeWarning: Using sample NA06984 genotypes.>>> consensus['22'].variant_sites
    (16042793, 21833121, 29153196, 29187373, 29187448, 29194610, 29821295, 29821332, 29993842, 32330460, 32352284)>>> consensus['22'][16042790:16042800]>22:16042791-16042800TCGTAGGACA>>> Fasta('tests/data/chr22.fasta')['22'][16042790:16042800]>22:16042791-16042800TCATAGGACA>>> consensus = FastaVariant('tests/data/chr22.fasta', 'tests/data/chr22.vcf.gz', sample='NA06984', het=True, hom=True, call_filter='GT =="0/1"')>>> consensus['22'].variant_sites
    (16042793, 29187373, 29187448, 29194610, 29821332)

    它还提供了命令行脚本:

    脚本:faidx

    Fetch sequences from FASTA. If no regions are specified, all entries in the
    input file are returned. Input FASTA file must be consistently line-wrapped,
    and line wrapping of output is based on input line lengths.
    positional arguments:
     fasta FASTA file
     regions space separated regions of sequence to fetch e.g.
     chr1:1-1000
    optional arguments:
     -h, --help show this help message and exit -b BED, --bed BED bed file of regions
     -o OUT, --out OUT output file name (default: stdout)
     -i {bed,chromsizes,nucleotide,transposed}, --transform {bed,chromsizes,nucleotide,transposed} transform the requested regions into another format. default: None
     -c, --complement complement the sequence. default: False
     -r, --reverse reverse the sequence. default: False
     -a SIZE_RANGE, --size-range SIZE_RANGE
     selected sequences are in the size range [low, high]. example: 1,1000 default: None
     -n, --no-names omit sequence names from output. default: False
     -f, --full-names output full names including description. default: False
     -x, --split-files write each region to a separate file (names are derived from regions)
     -l, --lazy fill in --default-seq for missing ranges. default: False
     -s DEFAULT_SEQ, --default-seq DEFAULT_SEQ
     default base for missing positions and masking. default: N
     -d DELIMITER, --delimiter DELIMITER
     delimiter for splitting names to multiple values (duplicate names will be discarded). default: None
     -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
     python functionto modify header lines e.g: "lambda x: x.split("|")[0]". default: lambda x: x.split()[0]
     -u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest}
     entry to take when duplicate sequence names are encountered. default: stop
     -g REGEX, --regex REGEX
     selected sequences are those matching regular expression. default:. * -v, --invert-match selected sequences are those not matching 'regions' argument. default: False
     -m, --mask-with-default-seq
     mask the FASTA file using --default-seq default: False
     -M, --mask-by-case mask the FASTA file by changing to lowercase. default: False
     -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
     python functionto modify header lines e.g: "lambda x: x.split("|")[0]". default: None
     --no-rebuild do not rebuild the. fai index even if it is out of date. default: False
     --version print pyfaidx version number

    例如:

    $ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320>NM_001282543.1:201-210
    CTCGTTCCGC>NM_001282543.1:300-320
    GTAATTGTGTAAGTGACTGCA
    $ faidx --full-names tests/data/genes.fasta NM_001282543.1:201-210>NM_001282543.1| Homo sapiens BRCA1 associated RING domain 1 (BARD1), transcript variant 2, mRNA
    CTCGTTCCGC
    $ faidx --no-names tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
    CTCGTTCCGC
    GTAATTGTGTAAGTGACTGCA
    $ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210>NM_001282543.1:201-210 (complement)
    GAGCAAGGCG
    $ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210>NM_001282543.1:210-201
    CGCCTTGCTC
    $ faidx --reverse --complement tests/data/genes.fasta NM_001282543.1:201-210>NM_001282543.1:210-201 (complement)
    GCGGAACGAG
    $ faidx tests/data/genes.fasta NM_001282543.1>NM_001282543.1:1-5466
    CCCCGCCCCT........
    ..................
    ..................
    ..................
    $ faidx --regex "^NM_00128254[35]" genes.fasta>NM_001282543.1
    ..................
    ..................
    ..................>NM_001282545.1
    ..................
    ..................
    ..................
    $ faidx --lazy tests/data/genes.fasta NM_001282543.1:5460-5480>NM_001282543.1:5460-5480
    AAAAAAANNNNNNNNNNNNNN
    $ faidx --lazy --default-seq='Q' tests/data/genes.fasta NM_001282543.1:5460-5480>NM_001282543.1:5460-5480
    AAAAAAAQQQQQQQQQQQQQQ
    $ faidx tests/data/genes.fasta --bed regions.bed
    ...
    $ faidx --transform chromsizes tests/data/genes.fasta
    AB821309.1 3510
    KF435150.1 481
    KF435149.1 642
    NR_104216.1 4573
    NR_104215.1 5317
    NR_104212.1 5374
    ...
    $ faidx --transform bed tests/data/genes.fasta
    AB821309.1 1 3510
    KF435150.1 1 481
    KF435149.1 1 642
    NR_104216.1 1 4573
    NR_104215.1 1 5317
    NR_104212.1 1 5374
    ...
    $ faidx --transform nucleotide tests/data/genes.fasta
    name start end A T C G N
    AB821309.1 1 3510 955 774 837 944 0
    KF435150.1 1 481 149 120 103 109 0
    KF435149.1 1 642 201 163 129 149 0
    NR_104216.1 1 4573 1294 1552 828 899 0
    NR_104215.1 1 5317 1567 1738 968 1044 0
    NR_104212.1 1 5374 1581 1756 977 1060 0
    ...
    faidx --transform transposed tests/data/genes.fasta
    AB821309.1 1 3510 ATGGTCAGCTGGGGTCGTTTCATC...
    KF435150.1 1 481 ATGACATCATTTTCCACCTCTGCT...
    KF435149.1 1 642 ATGACATCATTTTCCACCTCTGCT...
    NR_104216.1 1 4573 CCCCGCCCCTCTGGCGGCCCGCCG...
    NR_104215.1 1 5317 CCCCGCCCCTCTGGCGGCCCGCCG...
    NR_104212.1 1 5374 CCCCGCCCCTCTGGCGGCCCGCCG...
    ...
    $ faidx --split-files tests/data/genes.fasta
    $ ls
    AB821309.1.fasta NM_001282549.1.fasta XM_005249645.1.fasta
    KF435149.1.fasta NR_104212.1.fasta XM_005265507.1.fasta
    KF435150.1.fasta NR_104215.1.fasta XM_005265508.1.fasta
    NM_000465.3.fasta NR_104216.1.fasta XR_241079.1.fasta
    NM_001282543.1.fasta XM_005249642.1.fasta XR_241080.1.fasta
    NM_001282545.1.fasta XM_005249643.1.fasta XR_241081.1.fasta
    NM_001282548.1.fasta XM_005249644.1.fasta
    $ faidx --delimiter='_' tests/data/genes.fasta 000465.3>000465.3
    CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
    AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
    .......
    $ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta
    NM_000465.3 5523
    $ faidx -m --bed regions.bed tests/data/genes.fasta### Modifies tests/data/genes.fasta by masking regions using --default-seq character ###$ faidx -M --bed regions.bed tests/data/genes.fasta### Modifies tests/data/genes.fasta by masking regions using lowercase characters ###$ faidx -e "lambda x: x.split('.')[0]" tests/data/genes.fasta -i bed
    AB821309 1 3510
    KF435150 1 481
    KF435149 1 642
    NR_104216 1 4573
    NR_104215 1 5317
    .......

    samtools faidx 相似的语法

    还可以使用较低级别的Faidx类:

    >>>from pyfaidx import Faidx>>> fa = Faidx('genes.fa') # can return str with as_raw=True>>> fa.index
    OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ])>>> fa.index['AB821309.1'].rlen3510fa.fetch('AB821309.1', 1, 10) # these are 1-based genomic coordinates>AB821309.1:1-10ATGGTCAGCT
    • 如果FASTA文件未索引,当 faidx 初始化时 build_index 方法将自动运行,并且索引将用 write_fai() 写入" filename.fa.fai"。 其中" filename.fa"是原始FASTA文件。
    • 起始坐标和结束坐标为 1 -based。

    对压缩FASTA的支持

    pyfaidx 可以为使用 bgzip命令行工具的FASTA文件创建和读取 .fai 索引,这些文件是从samtoolsbgzipBGZF 格式写入压缩数据。 BGZFgzip 兼容的,由多个级联 gzip 块组成,每个块都有一个附加 gzip 头,可以为快速随机访问建立索引。 换句话说,用 bgzip 压缩的文件是有效的gzip,因此可以通过 gunzip 读取。 有关 bgzip的更多详细信息,请参见以下描述。

    变更日志

    请查看发行版,获得版本变更的全面列表。

    已知问题

    我试图尽可能多地修复 Bug,但是大多数的工作都是由一个开发者支持的。 请检查与你的工作相关的问题的已知问题( )。 欢迎请求请求。

    创建一个新的请求请求。 如果你添加了新特性,请创建相关的测试。

    要在你的计算机上运行测试:
    • 创建新的virtualenv 并安装 dev-requirements.txt。

    • 下载运行的测试数据:

      python tests/data/download_gene_fasta.py

    • 运行以下测试

      nosetests --with-coverage --cover-package=pyfaidx

    确认

    项目由作者 Matthew 自由授权,并在Drs的指导和财务支持下完成。 ,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology,Oncology。


    acc  fast  rand  随机  pythonic  FASTA  
    相关文章