Resequencing data of 20 Arabidopsis ecotypes.
(Diplom Thesis, University of Tübingen 2005)

This diploma thesis describes work on a chip resequencing project of 20 ecotypes belonging to the plant model species Arabidopsis thaliana, and these ecotypes are accessions from natural populations. Chip resequencing primarily aims at identifying single nucleotide polymorphisms (SNPs), the most abundant class of naturally occurring sequence variation. For resequencing, DNA microarrays are employed on which a genome-wide tiling of 25-mer probes is spotted. These probes are designed complementary to an a priori reference genome sequence. For each interrogated site probes with any of the four possible nucleotides in the middle are represented so that a nucleotide substitution in the interrogated genome will generally lead to a hybridization signal that is strongest for the corresponding non-reference probe at a SNP position. The huge data set resulting from the resequencing of 20 genomes of ~125 Mb has been stored in a MySQL database and a viewer has been implemented in Java for graphical display of resequencing data recovered directly from the database.

Part of this thesis is a basic characterization of the resequencing data. Intensity and specificity of hybridization exhibit a large degree of variability, the difference in intensity being more than 10-fold in extreme cases. Examinations revealed that this variability is in part caused by experimental factors, and in part determined by sequence properties of the probe. High AT content and self-complementarity, favoring hairpin formation, negatively affect hybridization, whereas probes with high-complexity sequences, measured by sequence entropy, hybridize better on average.

In order to estimate the potential of a given probe for cross-hybridization to multiple DNA sequence tracts in the genome, a systematic search for repeated 25-mers in the reference genome has been conducted. The result suggests that more than 90% false SNP calls in the reference ecotype, Col-0, are caused by cross-hybridization found with this search method. The error rates for SNP calls in other ecotypes can be improved with a filter based on 25-mer matches.

Finally, an algorithm has been developed for the prediction of large deletions from resequencing data. It is a comparative loss-of-signal approach that identifies regions where the target ecotype exhibits strongly reduced hybridization signal relative to the reference. More than 700 deletions larger than 200 bp have been predicted for the ecotype Ler-1 some of which are accurate estimates of deletions known from dideoxy sequencing. The main obstacles for deletion calling are regions which are repetitive or produce an ambiguous hybridization signal from the reference. This leads to uncertainties about start and end points of putative deletions. As the set of known large deletions in Ler-1 is incomplete, it is difficult to assess the specificity of our deletion calling heuristic. Indirect evaluations suggest that among the predictions the number of true deletions is higher than the number of false positives. A better assessment will be possible when some regions containing putative deletions have been sequenced.