Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P ≪ N. In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length encoding (RLE) and LZ78, that reduce computational complexity by the compression factor of the encoding. While RLE provides little improvement with respect to brute-force algorithm (the number of runs is close to the number of haracters of the original sequences), LZ78 provides a sizeable advantage. The advantage is greater for DNA sequences, since the lower the number of symbols in the alphabet the higher the efficiency of LZ78. The paper shows that the proposed approach reduces the asymptotic complexity of string comparison without adding to much to the complexity of the inner loop. Moreover, the proposed techniques can be combined with existing branc and bound approaches to obtain further speed up. Experimental results refer to representative sequences taken from human genome. Results: In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.
Using sequence compression to speedup probabilistic profile matching
FRESCHI, VALERIO;BOGLIOLO, ALESSANDRO
2005
Abstract
Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P ≪ N. In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length encoding (RLE) and LZ78, that reduce computational complexity by the compression factor of the encoding. While RLE provides little improvement with respect to brute-force algorithm (the number of runs is close to the number of haracters of the original sequences), LZ78 provides a sizeable advantage. The advantage is greater for DNA sequences, since the lower the number of symbols in the alphabet the higher the efficiency of LZ78. The paper shows that the proposed approach reduces the asymptotic complexity of string comparison without adding to much to the complexity of the inner loop. Moreover, the proposed techniques can be combined with existing branc and bound approaches to obtain further speed up. Experimental results refer to representative sequences taken from human genome. Results: In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.