Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU

Yousuke Watanuki ., Keiichi Tamura ., Hajime Kitakami ., Yoshifumi Takahashi .

Abstract


Suffix trees, which are trie structures that present
the suffixes of sequences (e.g., strings), are widely used for sequence
search in different application domains such as, text data
mining, bioinformatics and computational biology. In particular,
suffix trees are useful in bioinformatics applications, because they
can search similar sub-sequences and extract frequent sequence
patterns efficiently. In recent years, efficient construction of a
suffix tree that allows faster sequence searches has become
one of the most important challenges, because the number
and size of the data that are stored in sequence databases
have been increasing exponentially. This paper proposes a novel
parallelization model for approximate sequence matching that
uses disk-based suffix trees, which are built on hard disks not on
memory, on a multi-core CPU. In the proposed parallelization
model, we divide an entire sequence database into two or more
sub-databases called partitions. For each partition, we build
a disk-based suffix tree and define a task as an approximate
sequence matching on one disk-based suffix tree. Moreover,
the proposed parallelization model involves a multiple buffering
management system to avoid conflicts among CPU-cores. We
evaluated the proposed parallelization model using an actual
amino acid sequence database on a PC. The experimental results
show a substantial improvement in computation performance.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.