Biological data and cancer
Computers have become indispensable tools for accessing, visualizing, analyzing, managing, and publishing biological data. Recently, high-throughput sequencing and functional genomics projects have generated unprecedented quantities of biological data which cannot anymore be processed manually or interpreted visually. The emergence of the new field of bioinformatics is a response to this development. It comprises all research and development activities that help to make sense out of biological data with the aid of computers. Specifically, these activities include mathematical modeling of biological processes, the conception of new algorithms to fit models to data, as well as the development of software and databases as community resources. Sometimes, it is the sheer amount of information that calls for automatic processing. In other instances, it is the complex relationship between observations and biological models, which makes the intervention of advanced computational approaches necessary. The latter case is illustrated, for instance, by the problem of deriving a quantitative model of a transcription factor binding site from experimental data.
Cancer can be considered a gene regulatory disease. Normal regulation of genes permits the development and maintenance of a healthy human being. Abnormal regulation leads to various diseases. A trendy view is that cancer cells are maintained in a specific pathological state by gene regulatory circuits. Transcription factors are key elements of such circuits in that they control the expression of other genes while themselves being regulated by the products of genes. The specific research projects of our group aim at an understanding of transcriptional regulatory mechanisms, in particular those which are affected by genetic lesions that cause cancer. Our group is also developing and maintaining public software and databases for accessing and analyzing data that are relevant to gene expression and cancer. In addition, we have a number of collaborative projects with experimental researchers.
Characterization of transcription factor binding sites
Despite tremendous research efforts devoted to the study of transcriptional control mechanisms in higher eukaryotes, we still do not understand how gene regulatory information is encoded in the human genome. As a consequence, we can neither predict the biological function of non-protein-coding DNA sequences nor the deleterious effects of mutations therein from the nucleotide sequence alone. There is now growing hope that progress will come from exhaustive and innovative analysis of data produced by new high-throughput analytical technologies. Microarrays, for example, allow for the simultaneous analysis of the expression of many genes in different cell types under a variety of different condition. Genome-wide mapping of transcription factors bound to cis-regulatory elements is achieved by so-called ChIP-on-chip experiments. The possibility to measure the evolutionary conservation of regulatory DNA sequences by cross-genome comparison helps further to distinguish true biological function from biological and technological noise. On the other hand, still lacking are accurate tools to predict the binding sites of eukaryotic transcription factors.
Transcription factor binding sites are the elementary building blocks of gene regulatory regions. The accuracy of current prediction methods for such sites is on average very low, as indicated by a recent systematic evaluation of a large number of motif finding algorithms. We suspect that the low success rate is at least partly due to the limiting number of known binding site examples from which the models are derived. Starting from this assumption, we have recently developed a bioinformatics-driven high-throughput technology to characterize the binding specificity of a transcription factor. The method relies on SELEX, an in vitro evolutionary method to select nucleic acid ligands of a protein, and makes use of the concatenation step of SAGE to increase the sequencing throughput. We have shown that it is possible with this technology to derive computational models which quantitatively predict the affinity of a transcription factor to a particular ligand. So far, we have sequenced more than 40’000 binding sites to four different transcription factors. The experimental work pertaining to this project was carried out in Nicolas Mermod’s lab at the University of Lausanne and EPFL.
The high-throughput SELEX method outlined above was first applied and validated on the regulatory DNA binding protein CTF/NF1, which is thought to play a role in both DNA replication and transcription. More recently, we have used this approach to characterize the binding specificity of three members of the TCF (T Cell factor) family: Lef‑1, Lef‑1 in complex with beta‑catenin, and TCF4. An important result from this study is that the binding specificities of these three factors are statistically indistinguishable from each other, even when SELEX libraries of several thousands of binding sites are available for analysis. The varying expression levels of individual TCF members in different tissue thus cannot be responsible for the differential expression of the corresponding target genes.
To foster the development of better tools for prediction of transcription factor binding sites, we are also working on new computational methodologies. One specific project aims at defining a bootstrapping protocol to assess the robustness of transcription factor binding site matrices derived from sparse data sets. Moreover, using the large SELEX libraries for the transcription factors of the TCF family, we are now exploring the potential benefits of more advanced binding site matrices that take into account nearest-neighbor dependencies. Finally, we use those transcription factor binding site prediction methods which we already deem accurate (based on bootstrapping tests or other tests) to explore whether their target sites are consistently up- or down-regulated in cancer cells. The results of such an analysis will help to distinguish global regulatory defects in cancer cells from patient-specific lesions when interpreting gene expression profiles from clinical samples.
Case studies in comparative genomics
More and more complete genome sequences, both from eukaryotic and prokaryotic species, are becoming available. As a consequence, the comparative genomics approach, which consists of analyzing the evolution of protein-coding and regulatory DNA sequences across genomes, plays an ever growing role in generating new hypotheses and questions for experimental researchers. As a result of this trend, our group is increasingly solicited for collaborations in this area. For example, with the group of Walter Wahli at the University of Lausanne, we have recently analyzed the sequence conservation patterns in the promoter regions of the GRHPR gene across three mammalian genomes. In this case, we observed that regulatory elements important for the regulation of the gene in one species were not present in the corresponding regulatory regions of the other species. In agreement with this finding, the corresponding induction pathways were found to be inactive in the other species.
Our most extensive case study in comparative genomics bears on a protein family, which we termed Stealth (Fig. 1) because we speculate that it helps bacterial pathogens to hide from the host immune system. The human representative of this family originally caught our interest because it contains Notch repeats. Such repeats play a crucial role in developmental pathways that are impaired in cancer cells. Surprisingly, we found that this newly discovered protein also shares extensive sequence similarities with a number of bacterial proteins, some of which were described as virulence factors of human pathogens. A systematic comparison of all domains against completely sequenced genomes combined with a scrutiny of the scientific literature pertaining to newly identified relatives enabled us to predict a general molecular function for this newly defined protein family. According to our conclusions, all Stealth member proteins encode a D‑hexose‑1-phosphoryltransferase. This prediction was very recently corroborated by three experimental studies, which furthermore identified human Stealth as the disease gene for Mucolipidosis type II.
The Eukaryotic Promoter Database (EPD)
EPD is a database of experimentally characterized eukaryotic promoters which has been maintained for 20 years already. The underlying definition of a promoter is that of a transcription initiation site. EPD has played an instrumental role in the identification of major eukaryotic promoter elements such as the TATA‑ and CCAAT-boxes. Today, it is mostly used by computational and systems biologists studying gene control elements or transcription regulatory networks.
Initially, EPD was compiled by manual screening and processing of biological data published in journal articles. Today, new entries are exclusively generated from mass genome annotation data produced with high-throughput transcription start site mapping techniques such as 5’SAGE or CAGE. The recent release of large volumes of such data enabled us to increase the number of promoter entries in EPD by a factor of five in less than two years. Our highest priority at the moment is to reach complete promoter coverage for the human genome and a few important model organisms within a few years.
CleanEx: a database of heterogeneous gene expression data based on a consistent gene nomenclature
CleanEx was originally designed as an accessory database to EPD providing links to public gene expression profiles. Realizing that CleanEx is potentially useful in other contexts and complementary to existing gene expression databases, we decided to develop it into an independent resource. In doing so, we tried to take into account the needs of other user communities, in particular those of researchers trying to identify and characterize subclasses of tumors which could benefit from specialized treatment.
The main goal of CleanEx is to provide access to public gene expression data via unique gene names. A second objective is to represent heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-data set comparisons. A consistent and up-to-date gene nomenclature is achieved by associating each single experiment with a permanent target identifier consisting of a physical description of the targeted RNA population or the hybridization reagent used. These targets are then mapped at regular intervals to the growing and evolving gene catalogues of man and model organisms.
Recently we have started to import public data directly from the GEO repository at the NCBI. As a result, the number of experiments covered by CleanEx has increased tremendously. As a result of this trend, CleanEx defines itself more and more as a downstream resource of public repositories offering advanced query and analysis services to biologists. With this objective in mind, we have recently added a number of new functions to the web server for joint analysis gene expression profiles bearing on the same biological problem but generated with different technology platforms. Current development efforts focus on the problem of making numerical values produced by different methods (counts, absolute intensities, log-ratios) mutually compatible for downstream analysis tasks such as class discovery and class prediction.
HTPSELEX: a database of high-throughput SELEX data
This new database has been created to disseminate raw data and derived data from HTP SELEX experiments. This resource is primarily intended to serve computational biologists interested in building models for transcription factor binding sites from large SELEX libraries. For each experiment, HTPSELEX provides sequencing chromatograms, protein-binding sequences, and binding site models, and detailed information about the transcription factor analyzed.
Signal Search Analysis (SSA)
Signal search analysis is a general method to discover and characterize sequence motifs that preferentially occur at a constrained distance from a physiological site, for instance a transcription initiation site. The basic algorithms were developed in the late eighties by myself and have, like the EPD, played an instrumental role in the characterization of eukaryotic promoter elements. We recently made improved versions of the basic algorithms available over a web server. Currently we are applying these methods to analyze the signal contents of different promoter classes, as defined by public gene expression data in the CleanEx database.

Figure 1: Sequence conservation, domain architectures and phylogenetic tree of Stealth proteins. The discovery of conserved domains in bacterial proteins combined with the scrutiny of the scientific literature on the experimentally characterized bacterial proteins enabled us to propose a general molecular function for this newly recognized protein family defined by a comparative genomics approach. Note further that the human Stealth gene was recently identified as the target of mutations causing the hereditary disease mucolipidosis type II.
Collaborations
Part of the work described above was carried out in collaboration with the groups of Joerg Huelsken (ISREC), Mauro Delorenzi (ISREC, SIB), Nicolas Mermod (University of Lausanne, EPFL), Emmanuelle Roulet (University of Geneva), and Bob Strausberg (J. Craig Venter Institute, Rockville MD, USA).
URLs for bioinformatics resources maintained by our group
http://www.epd.isb-sib.ch
EPD home page
http://www.cleanex.isb-sib.ch
CleanEx home page
http://www.isrec.isb-sib.ch/htpselex/
HTP SELEX database
http://www.isrec.isb-sib.ch/ssa/
Signal Search Analysis Server home page
ftp://ftp.isrec.isb-sib.ch/sib-isrec
FTP server: Source code distribution of various software packages developed by the group, flat file releases of EPD and CleanEx.
Keywords
Promoter structure, gene expression database, DNA sequence computational analysis