What we actually do

The main areas of our research are: DNA sequencing and assembling (including design of algorithms for the NGS sequencers); protein structure analysis; RNA structure analysis and prediction (including automatic tertiary structure prediction tool); nanotechnology and DNA computing.

Standard and isothermic spectra
Spectra derived from real DNA sequences for standard and isothermic SBH

The instances in this section were used in computational tests of algorithms solving the standard and isothermic DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests.

* J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz, A. Swiercz, "Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries",Computational Biology and Chemistry28, 2004, pp. 11-19.
* J. Blazewicz, C. Oguz, A. Swiercz, J. Weglarz, "DNA sequencing by hybridization via genetic search",Operations Research54, 2006, pp. 1185-1192.
* J. Blazewicz, F. Glover, M. Kasprzak, W.T. Markiewicz, C. Oguz, D. Rebholz-Schuhmann, A. Swiercz, "Dealing with repetitions in sequencing by hybridization"Computational Biology and Chemistry30, 2006, pp 313-320.

Tables 1 and 2 contain instances derived from real DNA sequences coding human proteins, with 5%, 10% and 20% of random negative errors, and 5%, 10% and 20% of random positive ones. In Table 1 spectra for standard DNA sequencing are presented, while in Table 2 for isothermic DNA sequencing. Instances from Table 1 were used for tests in the last two papers. The data from Table 2 were used in all the papers listed above. Table 3 contains instances for standard and isothermic sequencing with negative errors coming from repetitions of oligonucleotides in real DNA sequences. Tests on the data from Table 3 are presented in the last paper.


Spectra derived from real DNA sequences without negative errors coming from repetitions of oligonucloetides

These 40 sequences being bases of these instances were taken from GenBank. They are prefixes of several genes coding human proteins. Their accession numbers in GenBank database are as follows:

NM_016055, NM_016080, NM_152373, NM_002938, BC040844, NM_152763, NG_002692, BC044213, NG_002660, NG_002481, BC007770, BC015575, BC004538, BC063108, NG_002361, NG_001569, BC056270, NM_153834, HSA519841, BC053904, BC062471, BC062325, BC057825, NG_001151, NG_001292, NM_005337, NM_032293, NM_032292, NM_198197, NM_032423, NM_021807, NM_015435, NM_024622, NM_030633, NM_172366, NM_177959, NM_005337, NM_003318, AF435957, AF497481.

The lengths of the sequences varied between 200 and 600 nucleotides (200, 400, 500 and 600). The length of oligonucleotides for standard libraries was set to 10. Due to compare the results of standard and isothermic sequencing, the temperatures of oligonucleotides for isothermic libraries were set to 26 and 28 degrees (the sum of cardinalities of these two libraries is similar to the cardinality of the standard library of length 10 nucleotides). First, the spectra without errors were generated from these sequences. Then, randomly generated errors (5%, 10% and 20%) were introduced to these spectra.

Table 1 consists of spectra for standard (isometric) sequencing and their original sequences. Prefixes of the sequences taken from GenBank (with accession numbers listed above) form the original sequences (of length 200, 400, 500 and 600 nucleotides) from which the spectra were created. They are saved in files "N_A.seq" (eg. 200_01), where N is the length of the original sequence and A is the number of the sequence. They are packed in the file "sequence.zip". In the files "standE.zip" (eg. stand5.zip) there are spectra that contain E% of negative and E% of positive errors. Spectra are saved in files "N_A" (eg. 200_01) where N is the length of the original sequence and A is the number of the sequence. In the files "N_A" first two lines denote the length of the original sequence (N) and the cardinality of the spectrum. In next lines sorted oligonucleotides follows.

Table 1. Spectra for standard sequencing derived from real DNA sequences
error rate
spectra
sequences
5%
stand5.zip 
sequence.zip
10%
stand10.zipsequence.zip
20%
stand20.zipsequence.zip

Table 2 consists of spectra for isothermic sequencing. Their original sequences are the same as for the standard sequencing, and they are also saved in file "sequences.zip". Spectra are saved in files "isotE.zip" (eg. isot10.zip), where there are E% of negative and E% of positive errors. Spectra are saved in files "N_A" (eg. 500_01) where N is the length of the original sequence and A is the number of the sequence. In the files "N_A" first two lines denote the length of the original sequence (N) and the cardinality of the spectrum (for the same length of the sequence, the cardinalities of spectra differ). In next lines sorted oligonucleotides follows.

Table 2. Spectra for isothermic sequencing derived from real DNA sequences
error rate
spectra
sequences
5%
isot5.zip sequence.zip
10%
isot10.zip sequence.zip
20%
sequence.zip

Spectra derived from real DNA sequences, with negative errors coming from repetitions of oligonucleotides

The sequences being bases of these instances were taken from GenBank. They are prefixes of several genes coding human proteins. Their accession numbers in GenBank database are as follows:

NM_002052, NM_003008, NM_183353, BC008923, BC012982, NG_002363, BC009854, NM_194310, NM_006402, NM_194300, NM_020690, NM_005745, BC000774, NM_182498, BC047640, NM_004902, NM_020713, BC062620, NM_032866, NG_000980, BC026171, NM_000321, BC063041, BC005805, NM_144767, BC001077, NM_032349, BC053852, NM_021934, NM_015698, BC050425, NM_018046, NM_004663, BC007398, NM_016428, NM_177423, BC026078, NM_031205, BC041372, NM_033318.

The length of the sequences is 600 nucleotides. The spectra for standard sequencing were cut out from these sequences with oligonucleotide length set to 10, what resulted in some negative errors coming from repetitions of 10-mers within the sequences. The instances contain from 1 to 17 such errors, where average was 4 repetitions for all 40 instances. The spectra for isothermic sequencing were cut out from these sequences with oligonucleotide temperatures set to 26 and 28 degrees, what resulted in some negative errors coming from repetitions of oligonucleotides within the sequences. The instances contain from 4 to 30 such errors, where average was 16 repeated oligonucleotides for all 40 instances. The average repetition values were set to the same averege repetition values as for prefixes of length 600 of randomly chosen 1000 sequences coding human DNA taken from GenBank.

There are two sets of spectra. First set, which may be found in files "rep_A.zip", contains spectra with errors coming only from repetitions. Spectra are saved in files "600_B", where B is the number of the sequence and A may be one of two values: 'stand' for standard or 'isot' for isothermic sequencing. Second set, in files "rep_A_5.zip", contains spectra with errors coming from repetitions and randomly added 5% of negative errors and 5% of positive errors. Spectra are also saved in files "600_B".

Oligonucleotides in files were sorted due to lose any information about the order in the original sequence. The original sequences for comparison of the results are in the file "rep_seq.zip" which is composed of files "600_B.seq", where B is the number of the sequence. The data are placed in Table 3.

Table 3. Spectra containing errors coming from repetitions derived from real DNA sequences
library of oligonucleotides spectra with repetitions only
spectra with 5% of positive
and 5% of negative errors
standard
rep_stand.zip rep_stand_5.zip
isothermic
rep_isot.zip  rep_isot_5.zip
sequences
rep_seq.zip
rep_seq.zip