Using sequence compression to speedup probabilistic profile matching

Freschi, Valerio; Bogliolo, Alessandro

doi:10.1093/bioinformatics/bti323

Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P ≪ N. In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length encoding (RLE) and LZ78, that reduce computational complexity by the compression factor of the encoding. While RLE provides little improvement with respect to brute-force algorithm (the number of runs is close to the number of haracters of the original sequences), LZ78 provides a sizeable advantage. The advantage is greater for DNA sequences, since the lower the number of symbols in the alphabet the higher the efficiency of LZ78. The paper shows that the proposed approach reduces the asymptotic complexity of string comparison without adding to much to the complexity of the inner loop. Moreover, the proposed techniques can be combined with existing branc and bound approaches to obtain further speed up. Experimental results refer to representative sequences taken from human genome. Results: In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.

Using sequence compression to speedup probabilistic profile matching

FRESCHI, VALERIO;BOGLIOLO, ALESSANDRO

2005

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Using sequence compression to speedup probabilistic profile matching

FRESCHI, VALERIO;BOGLIOLO, ALESSANDRO

2005

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)