What we actually do
The main areas of our research are: DNA sequencing and assembling (including design of algorithms for the NGS sequencers); protein & RNA structure analysis and modeling (including automatic tertiary structure prediction tools); nanotechnology and DNA computing.
We are working on...
Sequencing by hybridization
Sequencing by hybridization (SBH) is one of the methods of reading a DNA sequence. It consists of two phases: the biochemical and computational ones. In the biochemical phase a DNA chip, which contains a library of several different oligonucleotides, is exposed to the solution containing many copies of target DNA. Hybridization occurs between the DNA sequence and oligonucleotides in their complementary places. In most of the hybridization experiments, the library is a set of oligonucleotides of the same length. After the reaction, one obtains a set of short subfragments of the examined sequence by reading a fluorescent or radioactive image of the DNA chip. This set is called spectrum.The computational phase of DNA sequencing consists in reconstructing the target DNA sequence from the oligonucleotides in the spectrum. It is known that in the case of an ideal hybridization experiment, when no errors in spectrum are involved, the reconstruction of the sequence can be solved in polynomial time.In real hybridization experiments we can get two different kinds of errors:* positive errors - these are additional oligonucleotides that appear in the spectrum, but do not fit to the target DNA,* negative errors - missing oligonucleotides in the spectrum, comparing to the ideal spectrum; a special case are negative errors coming from repetitions, that are repeated subsequences in the original sequence but present only in one copy in the spectrum.When the above errors occur in the spectrum, the reconstruction of the DNA sequence becomes strongly NP-hard.Tabu search algorithm presented in the first article can be downloaded here.In section Standard and isothermic spectra you may find instances used in computational tests of algorithms solving the standard and isothermic DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests:* J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz, A. Swiercz, "Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries", Computational Biology and Chemistry 28, 2004, pp. 11-19.* J. Blazewicz, C. Oguz, A. Swiercz, J. Weglarz, "DNA sequencing by hybridization via genetic search", Operations Research 54, 2006, pp. 1185-1192.* J. Blazewicz, F. Glover, M. Kasprzak, W.T. Markiewicz, C. Oguz, D. Rebholz-Schuhmann, A. Swiercz, "Dealing with repetitions in sequencing by hybridization", Computational Biology and Chemistry 30, 2006, pp 313-320.In section Standard spectra you may find the instances which were used in computational tests of algorithms solving the standard DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests.* J. Blazewicz, M. Kasprzak, W. Kuroczycki, Hybrid genetic algorithm for DNA sequencing with errors, Journal of Heuristics 8 (2002) 495-502.* J. Blazewicz, F. Glover, M. Kasprzak, DNA sequencing - tabu and scatter search combined, INFORMS Journal on Computing 16 (2004) 232-240.* J. Blazewicz, F. Glover, M. Kasprzak, Evolutionary approaches to DNA sequencing with errors, Annals of Operations Research 138 (2005) 67-78.Most of the sequences for experiments were taken from GenBank, National Institute of Health, USA. Sequences are composed of at least 600 nucleotides.From the prefixes of the sequences of length 200, 400, 500, and 600, the ideal spectra were generated (by selecting all of the oligonucleotides being members of the library that are contained in the sequence). Next, errors were randomly added to the spectra. Some of the sequences (in the section: standard spectra) were generated according to the uniform distribution.Spectra contain both negative and positive errors (with error rate ranging from 5% up to 20%). 5% error rate means that from the ideal spectrum 5% of all the oligonucleotides were randomly deleted and the same number of oligonucleotides were added (different from the ones already existing in the spectrum). Spectra were created progressively, which means that 10% of errors contained the same errors as in the spectrum with 5% error rate plus additional 5% of errors of both types. This provides that the case with spectra containing less errors (5%) is not more difficult than the case with spectra containing more errors (10%). Spectra with 20% of errors were created accordingly. At the end, oligonucleotides in the spectra were sorted alphabetically due to lose the information about order of the oligonucleotides within the target.In specific subarticle you can read definitions of used libraries.
SR-ASM algorithm
SR-ASM (Short Reads ASseMbly) algorithm is designed for DNA assembly of the short sequences coming from 454 sequencers. The 454 sequencer protocol is based on measuring light intensities generated during the sequencing process. The sequences are given as nucleotide chains with numbers of their consecutive appearances. Additionally to the sequences, one gets rates of confidence for every nucleotide.The constructed SR-ASM algorithm is a heuristic based on a graph model, the graph being built on the set of input sequences. In the algorithm three parts can be distinguished. The first part computes feasible overlaps for input sequences. It requires as the parameters, values of the minimum overlap between two sequences and of the error bound, the latter being the percentage of the mismatches allowed in the overlap of two sequences. The comparison is done also for the sequences reverse complementary to the input ones, due to the assumption that the fragments come from both strands of a DNA helix. In the second phase, a graph is constructed with the fragments as the vertices. Two vertices are connected by an arc if there is a feasible overlap between the two fragments. Next, a path is searched for, which passes through one of the vertices from every pair: either through the straightforward fragment or its reverse complementary counterpart. Usually it is not possible to find a single path in the graph and several paths corresponding to contigs are returned as solutions. At the end, in the third part, a consensus sequence (sequences) is determined on the basis of the alignment.The strength of SR-ASM comes from the combination of the following new propositions in the assembling procedure: temporary compression of the input sequences, new method of selecting promising pairs, operations repairing lack of some arcs or excess of some vertices in the overlap graph, and finally, the system of voting sequences in creating the consensus sequence.Download the source code of  SR-ASM algorithm.Usefulness of the algorithm has been proven in tests on raw data generated during sequencing of the whole 1.84 Mbp genome of bacteria Prochlorococcus marinus. The paper with the description of the algorithm and with the computational experiment is:* J. Blazewicz, M. Bryja, M. Figlerowicz, P. Gawron, M. Kasprzak, E. Kirton, D. Platt, J. Przybytek, A. Swiercz, L. Szajkowski, "Whole genome assembly from 454 sequencing output via modified DNA graph concept", Computational Biology and Chemistry 33 (2009) 224-230.
Simplified partial digest problem
In order to recognize linear structure of a long DNA fragment (i.e. its sequence of nucleotides), the DNA molecule is divided into smaller parts, sequenced and then combined together back into the whole. Restriction mapping is one of the approaches used for the purpose of dividing and subsequent joining smaller parts. Very short DNA fragments (patterns) of the molecule uniquely recognized by restriction enzymes provide information necessary to carry out this process. The patterns are called restriction sites.During the process of restriction mapping of DNA, many copies (clones) of a DNA molecule are exposed to restriction enzymes, which cut them at particular patterns (sites). Depending on the time of this reaction (also on the enzyme used and on the length of DNA), the experiment returns clones of the original molecule cut at few or many sites (usually the shorter reaction time, the smaller the number of sites recognized by the enzyme). The cutting process leads to the loss of information about the location of the fragments and restriction sites in the original molecule, their order is next reconstructed on the base of the fragment lengths.There are two classical variants of the experiment: double digestion, which employs two restriction enzymes, and partial digestion, which uses one enzyme but in a series of reactions differing by time. At the output of the error-free partial digestion one gets a multiset of all interpoint distances, where the points are the restriction sites and the ends of the molecule. The aim of the Partial Digest Problem (PDP) is to reconstruct the restriction map basing on the multiset of fragment lengths. This variant was not widely employed in biological laboratories because of the difficulty in obtaining fragments between every pair of sites. From the point of view of computational complexity theory, the status of error-free PDP (being a known problem also in computational geometry) is still an open question.The simplified partial digestion was proposed as a new variant of the restriction mapping in [1]. It overcomes the drawback of the partial digestion outlined above. In this simplified method, one enzyme is used on two sets of clones of the DNA molecule. The corresponding experiment consists only of two parts. In the first part, the time of the chemical reaction is very short and chosen so that target cloned molecules from the first set are cut at (statistically) one restriction site at most. In the second part, the reaction time span is long enough to cut the cloned molecules from the second set at all restriction sites. This approach reduces the number of reactions and presents an easier adjustment of the reaction times. The aim of the Simplified Partial Digest Problem (SPDP) is the reconstruction of the map of restriction sites from two multisets resulting from the two reactions. The computational complexity of SPDP was solved in [2], the problem was proved to be strongly NP-hard. On the other hand, further studies have shown computational usefulness of this approach, confirmed by tests of a series of new algorithms for the problem [3-6].[1] J. Blazewicz, P. Formanowicz, M. Kasprzak, M. Jaroszewski, W.T. Markiewicz, "Construction of DNA restriction maps based on a simplified experiment", Bioinformatics 17 (2001) 398–404.[2] J. Blazewicz, M. Kasprzak, "Combinatorial optimization in DNA mapping – a computational thread of the Simplified Partial Digest Problem", RAIRO – Operations Research 39 (2005) 227–241.[3] J. Blazewicz, M. Jaroszewski, "New algorithm for the Simplified Partial Digest Problem", Lecture Notes in Bioinformatics 2812 (2003) 95–110.[4] J. Blazewicz, E.K. Burke, M. Kasprzak, A. Kovalev, M.Y. Kovalyov, "The Simplified Partial Digest Problem: enumerative and dynamic programming algorithms", IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (2007) 668–680.[5] J. Blazewicz, E.K. Burke, M. Kasprzak, A. Kovalev, M.Y. Kovalyov, "On the approximability of the Simplified Partial Digest Problem", Discrete Applied Mathematics 157 (2009) 3586-3592.[6] J. Blazewicz, E.K. Burke, M. Kasprzak, A. Kovalev, M.Y. Kovalyov, "The Simplified Partial Digest Problem: approximation and a graph-theoretic model", European Journal of Operational Research 208 (2011) 142-152.
RNA structural databases
A number of bioinformatic tools have been proposed to explore two major structural databases (PDB, NDB) in order to analyze various aspects of RNA tertiary structures. One of these tools is RNA FRABASE, a web-accessible database with an engine for automatic search of 3D fragments within PDB-derived RNA structures.RNA FRABASE 2.0 stores a wide selection of structural data about PDB-deposited RNA structures:- core data: RNA sequences, secondary structures, the PDB header data (identification codes, experimental method used, resolution, PDB deposition date and number of models), the NDB identification codes);- a collection of atom coordinates of unmodified and modified nucleotide residues occurring in RNA structures;- a description of torsion angles and sugar pucker parameters, including values for the pseudotorsion angles;- base-base and inter base pair parameters;- base pair location, i.e. information specifying whether one or both bases of the pair belong to the resulting fragment;- a complete classification of canonical and non-canonical base pair types, in both Westhof's and Saenger's notation and information about multiplets;- information about original residue and chain numbering;- graphical views of RNA secondary structures.The database is served by a search engine designed to search for user-specified single- or multi-stranded RNA fragments. Search algorithms operate on the database of the RNA sequences and the library of RNA secondary structures, coded in the dot-bracket format. Due to an efficiency of the engine, RNA FRABASE is the only web-accessible tool that enables fast automatic exploration of such a substantial selection of structural parameters available for user-defined 3D RNA fragments. The tool is widely used by the researchers all over the world in the study of RNA features.RNA FRABASE has been developed in our lab, in cooperation with a group of biochemists from the Laboratory of Structural Chemistry of Nucleic Acids, IChB PAS. The first version of the tool has been released in 2007, version 2.0 appeared in 2010. More information can be found in the following papers:M. Popenda, M. Blazewicz, M. Szachniuk, R.W. Adamiak.RNA FRABASE version 1.0: an engine with a database to search for the three-dimensional fragments within RNA structures. Nucleic Acids Research 36, 2008, D386-D391 (published online on October 5, 2007, doi:10.1093/nar/gkm786).M. Popenda, M. Szachniuk, M. Blazewicz, S. Wasik, E.K. Burke, J. Blazewicz, R.W. Adamiak.RNA FRABASE 2.0: an advanced web-accessible database with the capacity to search the three-dimensional fragments within RNA structures. BMC Bioinformatics, 2010, 11:231 (published online on May 6, 2010, doi:10.1186/1471-2105-11-231).