Understanding regulation of gene transcription is central to molecular biology as well as being of great interest in medicine. The molecular syntax of the concerted transcriptional activation/repression of gene networks in mammal cells, which shape the physiological response to the molecular signals, is often unknown or not completely understood. Combining genome-wide experiments with in silico approaches opens the way to a more systematic comprehension of the molecular mechanisms of transcription regulation. Diverse bioinformatics tools have been developed to help unravel these mechanisms, by handling and processing data at different stages: from data collection and storage to the identification of molecular targets and from the detection of DNA motif signatures in the regulatory sequences of functionally related genes to the identification of relevant regulatory networks. Moreover, the large amount of genome-wide scale data recently produced has attracted professionals from diverse backgrounds to this cutting-edge realm of molecular biology. This mini-review is intended as an orientation for multidisciplinary professionals, introducing a streamlined workflow in gene transcription regulation with emphasis on sequence analysis. It provides an outlook on tools and methods, selected from a host of bioinformatics resources available today. It has been designed for the benefit of students, investigators, and professionals who seek a coherent yet quick introduction to in silico approaches to analyzing regulation of gene transcription in the post-genomic era.
Completion of the human genome reference and development of next-generation sequencing (NGS) has led to an exciting era in which the complexity of biomedical research can be tackled in a new way, requiring multidisciplinary expertise at multiple levels (Noble 2002, Quackenbush 2006b, Navlakha & Bar-Joseph 2011). Bioinformatics, as an integrative discipline in this field, increasingly occupies a larger portion of the biomedical stage. Formerly seen as a service in wet laboratories or as a specialization in computer science departments, it is presently an intrinsic component of scientific discovery (Noble 2002). Significant benefits can derive from both sharing experience in a multidisciplinary team and employing individuals who integrate interdisciplinary expertise. Thus, it is important to prepare a new generation of biologists to master computational and quantitative skills (Bialek & Botstein 2004, Pevzner 2004, Wingreen & Botstein 2006, Dudley & Butte 2009, Pevzner & Shamir 2009) as well as knowledge in specialized areas of biomedicine. Moreover, creating and maintaining appropriate infrastructures helps in unifying the heterogeneity of post-genomic data and dedicated bioinformatics toolkits. Large resources for data handling and manipulation, for example, open interactive systems such as Galaxy (Giardine et al. 2005), BioGPS (Wu et al. 2009), and the open source project R-Bioconductor (Gentleman et al. 2004), are highly welcome and point to a successful, cooperative model. However, most efforts cannot yet fully rely on such coherent supporting frameworks. Instead, expertise and knowledge of heterogeneous tools are required (Bialek & Botstein 2004, Pevzner 2004, Wasserman & Sandelin 2004, Quackenbush 2006b, Wingreen & Botstein 2006, Dudley & Butte 2009, Pevzner & Shamir 2009). Here, I want to collect a few reference landmarks for technology professionals who want to enter the post-genomic biology era and for biomedical investigators and students who want to build bioinformatics expertise on their own. Given the vastness of the subject, I won't be able to cover all areas of interest, nor provide deep insight into the complex issues underlying tool conception and development. By focusing on a streamlined analysis workflow, I want to stimulate and support the reader's intellectual curiosity and desire to develop individualized skills.
The physiological responses of cells to any molecular signal are shaped by the molecular syntax of a concerted transcriptional activation/repression of many genes, referred to as regulatory networks. The major molecular events that underlie gene transcription primarily include the recognition of specific DNA elements known as transcription factor binding sites (TFBSs) by transcription factor (TF) proteins, as well as the recognition of cofactors that specifically link these protein–DNA binding events to the transcriptional molecular machinery (Wasserman & Sandelin 2004, Juven-Gershon et al. 2008). Besides, micro-RNA circuitry and a variety of posttranslation modifications, including the topical epigenetic markers, further modulate gene transcription with either functional or dysfunctional effect (Bartel 2009, Fabbri & Calin 2010, Portela & Esteller 2010, Lister et al. 2011). The packaging of DNA sequences around histone proteins to form nucleosomes and chromatin fibers (Hayes & Hansen 2001), that is, the structural organization of chromatin in the cell nuclei, masks/reveals TFBSs under different physiological conditions (Li et al. 2007, Park 2009). Various epigenetic modifications further alter the chromatin conformation, likely combining with each other to generate chromatin states (Ernst & Kellis 2010). In addition to a first layer of signals constituted by TFBS motifs, we can therefore picture a second layer of signals, defined by CTCF-binding sites that indicate active/inactive DNA sequence domains, or chromatin boundaries (Jothi et al. 2008, Jeziorska et al. 2009, Phillips & Corces 2009, Botta et al. 2010, Tolstorukov et al. 2011).
The architecture of a typical gene promoter is not rigorously defined, although a functional regulatory sequence comprises core elements (core promoter) and several enhancer/silencer elements scattered at various distances from the annotated transcriptional start sites (Wasserman & Sandelin 2004, Carninci et al. 2005, Halfon 2006, Juven-Gershon et al. 2008, Dolfini et al. 2009), as shown in Fig. 1. Enhancers often lie several kilo bases from the core, as observed in regulation by nuclear receptors (Lin et al. 2007, Kuttippurathu et al. 2011, Navlakha & Bar-Joseph 2011), and may lie in intragenic regions (mainly introns) as well as in the upstream sequence of a gene. The ‘linear’ architecture of the promoter (Fig. 1) in fact reflects the tri-dimensional structure of chromatin packaging, the conformation of which is consistent with a dynamic fractal globule able to fold/unfold any genomic locus, as recently revealed by Hi-C technique and Monte Carlo simulations (Lieberman-Aiden et al. 2009). The organization of chromatin on a genome scale deeply affects gene transcription regulation (Li et al. 2007, Ernst et al. 2011, Tolstorukov et al. 2011, Delest et al. 2012). Further advancements in our understanding of gene transcriptional regulation will derive from the exploitation of structural findings on chromosome territories (Fraser & Bickmore 2007, Lieberman-Aiden et al. 2009, Shaw 2010), TF–TFBS binding events (Honig & Rohs 2011, Nikolova et al. 2011, Delest et al. 2012), and from a systemic approach (He et al. 2009, Ashworth et al. 2011).
In the current working model of transcription, proximal/distal regulatory elements are arranged in units, called cis-regulatory modules (CRMs), which are organized in a somewhat complex way (Halfon 2006, Jeziorska et al. 2009). Among other possible arrangements, a typical CRM consists of dense clusters of similar signals (Halfon 2006), which is a key feature exploited by the statistical methods developed for their detection (see below). The identification of CRMs has been an object of intensive bioinformatics research (Wasserman & Sandelin 2004, Elnitski et al. 2006, van Nimwegen 2007, Rister & Desplan 2010, Medina-Rivera et al. 2011). For a long time, the detection of DNA elements on a genomic scale was only attempted in silico and with limited success (Wasserman & Sandelin 2004, Hannenhalli 2008). The development of new technologies has promoted a systematic identification of TFBSs at high resolution. The cell and tissue specificity of the physiological response to molecular signals, for years addressed in vitro on a gene-by-gene basis, has been tackled in vivo via genome-wide approaches such as ChIP-on-Chip (Elnitski et al. 2006, Tavera-Mendoza et al. 2006, Dufour et al. 2007, Zheng et al. 2007) and ChIP-seq (Johnson et al. 2007, Barski & Zhao 2009, Park 2009, Pepke et al. 2009, Visel et al. 2009). In particular, ChIP-seq technology has enabled the unbiased identification of all DNA sequences (TFBS) bound by a specific protein of interest (TF) (Park 2009, Liu et al. 2010) and is also successfully used in epigenomics (Bock & Lengauer 2008, Park 2008, Huss 2010, Lim et al. 2010, Ongenaert 2010). In silico approaches are currently applied in combination with genome-wide experimental techniques (D'Haeseleer 2006a, Elnitski et al. 2006, Dufour et al. 2007, Bickel et al. 2009, Gazdag et al. 2009, Delacroix et al. 2010).
The methods developed for detection of DNA motifs in the regulatory sequences of genes were devised in the framework of information theory (Hertz et al. 1990, Hertz & Stormo 1999), at a time when no structural clue about chromatin packaging was available. In this theory, the DNA molecule is simply a long word composed of four letters, the A G T C nucleotides, where the motif is hidden (Hertz et al. 1990, Hertz & Stormo 1999, Stormo 2000, Wasserman & Sandelin 2004, D'Haeseleer 2006a,b). The characterization of the (non-coding) DNA regions in terms of functional CRMs remains a computational challenge, especially for compositional signals, involving the combinatorial activation of different species of TFBSs (Bluthgen et al. 2005, Pierstorff et al. 2006, Van Loo & Marynen 2009). A comprehensive probabilistic framework for DNA motifs discovery, devised by van Nimwegen (2007), provides a unifying view of the sequence analysis approach to gene transcription regulation. The in silico determination of chromatin boundaries has been attempted from experimental chromatin signatures (Won et al. 2008) and nucleosome positioning prediction (Ioshikhes et al. 2006, Segal et al. 2006). The first systematic characterizations of chromatin states based on the in silico integration of a variety of epigenetic marks in human T-cells and Drosophila have been recently provided by Ernst & Kellis (2010), Ernst et al. (2011) and Riddle et al. (2011).
The presence of repeated, similar signals in the same non-coding DNA region, that is, the statistical overrepresentation of a DNA motif, especially when the region and/or the signal are conserved across the species, is key to inferring a biological role for this DNA element. This is a key criterion for detecting functional motifs, referred to as phylogenetic footprinting (Wasserman & Sandelin 2004, D'Haeseleer 2006a,b, Hannenhalli 2008) and implemented in several dedicated bioinformatics tools (see below). Moreover, statistical overrepresentation of (combinations of) DNA motifs in many of the regulatory regions of a set of co-expressed genes may indicate that these gene targets are co-regulated. In this set of regulatory regions, specific combinations of DNA motifs, henceforth ‘DNA signatures’, will be detected with good statistics compared with backgrounds (see Fig. 2).
Biochemical pathways target different DNA sequence elements and implement the gene expression control that is key to the regulatory events. In principle, knowledge of the distribution of the functional DNA elements in the regulatory regions of a set of responsive genes should enable the inference of the relevant regulatory networks. Inference of these networks from expression microarray data can be attempted by reverse engineering (He et al. 2009, Ashworth et al. 2011). Comprehensive coverage of reverse engineering and systemic approaches to gene transcription (regulatory) networks can be found in He et al. (2009), Koyuturk (2010) and Ashworth et al. (2011), and references therein. Much more difficult, and as yet unproven, is a forward approach, i.e. inferring the regulatory networks from the distribution of (active) DNA motifs in the gene regulatory regions (Altobelli 2007). In general, reconstruction of gene networks is bound to provide a static picture of the actual biological pathways. Time evolution must be addressed with an appropriate experimental design.
We distinguish here two levels of data analysis: 1) the treatment of raw data, which leads to the identification of target genes, and the unraveling of the biological meaning (ontological or functional analysis) of these targets; and 2) the integration of different types of information into a unifying view, the reconstruction of gene networks and biological pathways, and the generation of regulative hypotheses. Sequence analysis may be performed at level 1, helping in identifying TBFSs from ChIP-seq data, for example, and/or at level 2 as in the case study below. In the following, we illustrate analysis tools for a hypothetical workflow that includes whole-genome techniques (expression microarray and ChIP-seq) and provide a concise applicative example of sequence analysis of regulatory regions of these gene targets. Even though many computational tools are available, the assumptions under which they have been developed may not fully apply to one's specific purpose. Thus, the analytical strategy is in best practice tailored for the specific biological system and research goals, possibly by controlling experiment design and always by evaluating the strengths and limitations of the method used. Both microarray and NGS data can be analyzed using open software packages within the R-Bioconductor framework (Gentleman et al. 2004) and/or Galaxy (Giardine et al. 2005). Methods and strategies for ChIP-seq analysis can be found in Pepke et al. (2009). Zhang et al. (2011) provide an extensive review on the impact of NGS technologies in genomics and Marra et al. in functional genomics and epigenomics (Morozova & Marra 2008, Hirst & Marra 2010). For foundations of gene expression analysis, see for example Quackenbush (2001, 2002, 2003 and 2006a) and Hsiao et al. (2005).
Once a list of (possibly) co-regulated genes is available, the detection of DNA signatures (Fig. 2) may be attempted using more than one method from two classes: 1) the algorithms that search for known motifs by screening the sequence against collections of TFBS models and 2) the algorithms that discover DNA motifs by unraveling the statistical property of the sequence, referred to as de novo. TFBSs are usually modeled by either consensus (Xie et al. 2005, D'Haeseleer 2006b) or matrix (Wasserman & Sandelin 2004, D'Haeseleer 2006a,b). The model-based methods are somewhat biased by the quality and type of the model used, yet provide a direct estimate of which TF is involved in regulation (Wasserman & Sandelin 2004). The de novo methods may spot novel motifs, yet require output interpretation; and both classes demand an appropriate definition of background in order to assign statistical significance (usually P values) to their outputs. A background set of sequences is usually constituted by the entire set of promoters of all genes for the species of interest or, for example, by the regulatory sequences of the genes that are functionally opposite, i.e. downregulated vs upregulated, depending on the problem under investigation.
An excellent web toolkit for (proximal) regulatory region analysis is the matrix-based Pscan (Zambelli et al. 2009). It is both user friendly and statistically robust, enabling fast detection of DNA signatures for several different promoter sizes and for several species. Moreover, the source code can be downloaded for stand-alone usage. Tools for de novo motif discovery, that is from the second class of algorithms mentioned earlier, are often used in analysis of ChIP-seq experiments: WEEDER (Pavesi et al. 2004, Tompa et al. 2005), MEME (Bailey & Elkan 1994), MDScan (Liu et al. 2002), and W-ChIPMotifs (Jin et al. 2009) are equipped with web interfaces. Convenient integrative frames for launching multiple DNA discovery algorithms (two or more of the above-mentioned) are TAMO (Gordon et al. 2005), by shell scripting, and the web-based CompleteMOTIFs (Kuttippurathu et al. 2011). Motif discovery algorithms eventually need to be coupled to other tools, either to link the discovered motifs to known TFBSs or to classify them using, for example, clustering algorithms (Bickel et al. 2009, Hackenberg et al. 2011). STAMP (Mahony & Benos 2007), a user-friendly tool that may serve this purpose, is also handled by the latest version of TAMO (Gordon et al. 2005). The more recent Tmod tool kit (Sun et al. 2010) allows the integration of 12 different motif discovery algorithms for Windows operating systems. RSAT (Thomas-Chollier et al. 2008) is a comprehensive interactive platform that supports hundreds of genomes and implements a variety of approaches to regulatory sequence analysis including unique visualization features. Based on multispecies alignments, OPOSSUM (Ho Sui et al. 2005) and CORG (Dieterich et al. 2005) are the major resources for investigating distal motifs, providing both collections of conserved regulatory regions and conserved motifs. The method implemented in REDUCE Suite v2 algorithm (Foat et al. 2006) instead models the interaction between TF and TFBS, enabling the inference of the TF sequence-specific binding affinity from a single expression experiment.
Gene annotation enrichment tools help in identifying the relevant (altered) biological processes in genome-wide experiments. The 68 gene enrichment tools developed for the functional analysis of gene expression data are compared by Huang da et al. (2009a,b). Enrichment tools heavily rely on the quality of annotation databases, which are incomplete and biased by construction (Huang da et al. 2009a, Liberzon et al. 2011). Comprehensive, web-based platforms are DAVID (Huang da et al. 2009b), GSEA (Subramanian et al. 2005, Liberzon et al. 2011), also available as an R-Bioconductor package and as a desktop application, and GREAT (McLean et al. 2010), which has been developed for unbiased functional annotation of genomic regions in human and mouse genomes. The latter is therefore recommended for annotating ChIP-seq-derived targets. For an extensive catalog of bioinformatics resources and web tools, the reader is referred to the dedicated annual publication by Brazas et al. (2011).
In order to tackle the complexity of estrogenic regulation in breast cancer cells in silico, a large sequence analysis workflow was set up as in Fig. 3 (Altobelli 2007). Lists of estrogen target genes were collected and organized in a manually curated, in-house database, by selecting those that exhibited an early response to estradiol in breast cancer cells (up/downregulation within 4 h), from gene expression microarray and gene-by-gene experiments published up to year 2005. The workflow combined approaches for DNA sequence analysis of proximal regions (≤1 kbp) with a method that enabled the investigation of distal (up to 15 kbp) conserved nucleotide blocks (Altobelli 2007). In particular, the above-mentioned Pscan (Zambelli et al. 2009) and WEEDER (Pavesi et al. 2004, Tompa et al. 2005) algorithms were used, as well as a collection of conserved motifs generated from a modified version of CORG (Dieterich et al. 2003). One of the many purposes of this study was to formulate regulatory hypotheses to be tested in the laboratory. Estrogen-responsive elements (EREs) are highly degenerate (Carroll & Brown 2006) and found scattered over large genomic distances (Carroll & Brown 2006). Thus, they are very elusive to detection in silico (Sismondi et al. 2007). Estrogenic regulation often occurs through the interaction of (activated) estrogen receptor (ER) with cognate TFs, exhibiting TFBSs adjacent to EREs (co-localized TFBSs). This suggests that DNA elements of different types located within a distance comparable to the nucleosome average size (circa 200 bp) may form a functional unit. One of the implemented strategies, therefore, was searching for EREs in the surroundings of other TFBSs – the latter detected with robust statistics.
The workflow provided multiple outputs as follows: 1) sub-lists of genes associated by the presence of at least one DNA motif, 2) DNA motifs and their genomic localizations (DNA signatures), 3) lists of putative TFs, and (after literature mining and data integration) 4) regulatory hypotheses. Among other DNA signatures, several conserved GATA family motifs were detected in multiple occurrences in the upstream region of a small group of developmental genes including cyclin-G2 (CCNG2). Two (novel) degenerated ERE motifs were also detected in the proximity of two conserved GATA3 binding sites in the CCNG2 upstream regulatory region (Altobelli 2007). Cyclin-G2, a cell cycle inhibitor, is strongly downregulated in breast cancer cells and is a primary ER target gene (Stossi et al. 2006). The co-localized ERE and GATA motifs suggest that an interaction between the ER and GATA3 may be implicated in the regulation of the above-mentioned group of developmental genes and that this might play a role in mammary gland development and cancer differentiation (Chou et al. 2010). This interaction might, in fact, contribute to the shaping of the lactiferous ducts, determining the fate of luminal cells, which express estrogen receptor (ER+), compared to myo-epithelial cells, which do not express estrogen receptor (ER−). This may also have implications in the characterization of types of breast cancer (ER+ vs ER− luminal-like types).
Genome-wide experiments in estrogen-responsive cells have recently enabled the detection of presumably direct targets of ERs, as well as a catalog of EREs and binding sites for putative cognate TFs (Carroll et al. 2005, Carroll & Brown 2006, Lin et al. 2007). This makes it possible, in principle, to validate all in silico outcomes and to refine the workflow accordingly.
Declaration of interest
The author declares that there is no conflict of interest that could be perceived as prejudicing the impartiality of the review.
This research did not receive any specific grant from any funding agency in the public, commercial, or not-for-profit sector.
The author thanks all the students and colleagues whose persistent questions eventually prompted to write this work as well as Prof. R Bargar, Prof. R Flower, and Mr J Maftin for kindly proofreading the manuscript.
Altobelli G 2007 Functional classification of estrogen-responsive gene regulatory sequences in breast cancer cells: towards the identification of regulatory networks. PhD Thesis. University of Turin http://dott-scsv.campusnet.unito.it/do/home.pl/View?doc=archivio_tesi.html.
Ashworth J Wurtmann EJ & Baliga NS 2011 Reverse engineering systems models of regulation: discovery prediction and mechanisms. Current Opinion in Biotechnology. http://dx.doi.org/10.1016/j.bbr.2011.03.031.
BaileyTLElkanC1994Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the International Conference on Intelligent Systems for Molecular Biology228–36.
BialekWBotsteinD2004Introductory science and mathematics education for 21st-Century biologists. Science303788–790. doi:10.1126/science.1095480.
BickelPJBrownJBHuangHLiQ2009An overview of recent developments in genomics and associated statistical methods. Philosophical Transactions of the Royal Society of London. Series A: Mathematical Physical and Engineering Sciences3674313–4337. doi:10.1098/rsta.2009.0164.
BluthgenNKielbasaSMHerzelH2005Inferring combinatorial regulation of transcription in silico. Nucleic Acids Research33272–279. doi:10.1093/nar/gki167.
BottaMHaiderSLeungIXLioPMozziconacciJ2010Intra- and inter-chromosomal interactions correlate with CTCF binding genome wide. Molecular Systems Biology6426doi:10.1038/msb.2010.79.
BrazasMDYimDSYamadaJTOuelletteBF2011The 2011 Bioinformatics Links Directory update: more resources, tools and databases and features to empower the bioinformatics community. Nucleic Acids Research39W3–W7. doi:10.1093/nar/gkr514.
CarninciPKasukawaTKatayamaSGoughJFrithMCMaedaNOyamaRRavasiTLenhardBWellsC2005The transcriptional landscape of the mammalian genome. Science3091559–1563. doi:10.1126/science.1112014.
CarrollJSLiuXSBrodskyASLiWMeyerCASzaryAJEeckhouteJShaoWHestermannEVGeistlingerTR2005Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell12233–43. doi:10.1016/j.cell.2005.05.008.
ChouJProvotSWerbZ2010GATA3 in development and cancer differentiation: cells GATA have it!Journal of Cellular Physiology22242–49. doi:10.1002/jcp.21943.
DelacroixLMoutierEAltobelliGLegrasSPochOChoukrallahMABertinIJostBDavidsonI2010Cell-specific interaction of retinoic acid receptors with target genes in mouse embryonic fibroblasts and embryonic stem cells. Molecular and Cellular Biology30231–244. doi:10.1128/MCB.00756-09.
DelestASextonTCavalliG2012Polycomb: a paradigm for genome organization from one to three dimensions. Current Opinion in Cell Biology24405–414. doi:10.1016/j.ceb.2012.01.008.
DieterichCWangHRateitschakKLuzHVingronM2003CORG: a database for COmparative Regulatory Genomics. Nucleic Acids Research3155–57. doi:10.1093/nar/gkg007.
DieterichCGrossmannSTanzerARopckeSArndtPFStadlerPFVingronM2005Comparative promoter region analysis powered by CORG. BMC Genomics624doi:10.1186/1471-2164-6-24.
DolfiniDZambelliFPavesiGMantovaniR2009A perspective of promoter architecture from the CCAAT box. Cell Cycle84127–4137. doi:10.4161/cc.8.24.10240.
DudleyJTButteAJ2009A quick guide for developing effective bioinformatics programming skills. PLoS Computational Biology5e1000589doi:10.1371/journal.pcbi.1000589.
DufourCRWilsonBJHussJMKellyDPAlaynickWADownesMEvansRMBlanchetteMGiguereV2007Genome-wide orchestration of cardiac functions by the orphan nuclear receptors ERRα and γ. Cell Metabolism5345–356. doi:10.1016/j.cmet.2007.03.007.
ElnitskiLJinVXFarnhamPJJonesSJ2006Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Research161455–1464. doi:10.1101/gr.4140006.
ErnstJKellisM2010Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology28817–825. doi:10.1038/nbt.1662.
ErnstJKheradpourPMikkelsenTSShoreshNWardLDEpsteinCBZhangXWangLIssnerRCoyneM2011Mapping and analysis of chromatin state dynamics in nine human cell types. Nature47343–49. doi:10.1038/nature09906.
FoatBCMorozovAVBussemakerHJ2006Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics22e141–e149. doi:10.1093/bioinformatics/btl223.
FraserPBickmoreW2007Nuclear organization of the genome and the potential for gene regulation. Nature447413–417. doi:10.1038/nature05916.
GazdagESantenardAZiegler-BirlingCAltobelliGPochOToraLTorres-PadillaME2009TBP2 is essential for germ cell development by regulating transcription and chromatin condensation in the oocyte. Genes and Development232210–2223. doi:10.1101/gad.535209.
GentlemanRCCareyVJBatesDMBolstadBDettlingMDudoitSEllisBGautierLGeYGentryJ2004Bioconductor: open software development for computational biology and bioinformatics. Genome Biology5R80doi:10.1186/gb-2004-5-10-r80.
GiardineBRiemerCHardisonRCBurhansRElnitskiLShahPZhangYBlankenbergDAlbertITaylorJ2005Galaxy: a platform for interactive large-scale genome analysis. Genome Research151451–1455. doi:10.1101/gr.4086505.
GordonDBNekludovaLMcCallumSFraenkelE2005TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics213164–3165. doi:10.1093/bioinformatics/bti481.
HackenbergMCarpenaPBernaola-GalvanPBarturenGAlganzaAMOliverJL2011WordCluster: detecting clusters of DNA words and genomic elements. Algorithms for Molecular Biology62doi:10.1186/1748-7188-6-2.
HannenhalliS2008Eukaryotic transcription factor binding sites – modeling and integrative search methods. Bioinformatics241325–1331. doi:10.1093/bioinformatics/btn198.
HeFBallingRZengAP2009Reverse engineering and verification of gene networks: principles, assumptions, and limitations of present methods and future perspectives. Journal of Biotechnology144190–203. doi:10.1016/j.jbiotec.2009.07.013.
HertzGZStormoGD1999Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics15563–577. doi:10.1093/bioinformatics/15.7.563.
HertzGZHartzellGWIIIStormoGD1990Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computer Applications in the Biosciences681–92.
Ho SuiSJMortimerJRArenillasDJBrummJWalshCJKennedyBPWassermanWW2005oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Research333154–3164. doi:10.1093/nar/gki624.
HsiaoAIdekerTOlefskyJMSubramaniamS2005VAMPIRE microarray suite: a web-based platform for the interpretation of gene expression data. Nucleic Acids Research33W627–W632. doi:10.1093/nar/gki443.
Huang daWShermanBTLempickiRA2009aBioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research371–13. doi:10.1093/nar/gkn923.
Huang daWShermanBTLempickiRA2009bSystematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols444–57. doi:10.1038/nprot.2008.211.
HussM2010Introduction into the analysis of high-throughput-sequencing based epigenome data. Briefings in Bioinformatics11512–523. doi:10.1093/bib/bbq014.
IoshikhesIPAlbertIZantonSJPughBF2006Nucleosome positions predicted through comparative genomics. Nature Genetics381210–1215. doi:10.1038/ng1878.
JeziorskaDMJordanKWVanceKW2009A systems biology approach to understanding cis-regulatory module function. Seminars in Cell & Developmental Biology20856–862. doi:10.1016/j.semcdb.2009.07.007.
JinVXApostolosJNagisettyNSFarnhamPJ2009W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data. Bioinformatics253191–3193. doi:10.1093/bioinformatics/btp570.
JohnsonDSMortazaviAMyersRMWoldB2007Genome-wide mapping of in vivo protein–DNA interactions. Science3161497–1502. doi:10.1126/science.1141319.
JothiRCuddapahSBarskiACuiKZhaoK2008Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. Nucleic Acids Research365221–5231. doi:10.1093/nar/gkn488.
Juven-GershonTHsuJYTheisenJWKadonagaJT2008The RNA polymerase II core promoter – the gateway to transcription. Current Opinion in Cell Biology20253–259. doi:10.1016/j.ceb.2008.03.003.
KoyuturkM2010Algorithmic and analytical methods in network biology. Wiley Interdisciplinary Reviews Systems Biology and Medicine2277–292. doi:10.1002/wsbm.61.
KuttippurathuLHsingMLiuYSchmidtBMaskellDLLeeKHeAPuWTKongSW2011CompleteMOTIFs: DNA, motif discovery platform for transcription factor binding experiments. Bioinformatics27715–717. doi:10.1093/bioinformatics/btq707.
LiberzonASubramanianAPinchbackRThorvaldsdottirHTamayoPMesirovJP2011Molecular signatures database (MSigDB) 3.0. Bioinformatics271739–1740. doi:10.1093/bioinformatics/btr260.
Lieberman-AidenEvan BerkumNLWilliamsLImakaevMRagoczyTTellingAAmitILajoieBRSaboPJDorschnerMO2009Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science326289–293. doi:10.1126/science.1181369.
LinCYVegaVBThomsenJSZhangTKongSLXieMChiuKPLipovichLBarnettDHStossiF2007Whole-genome cartography of estrogen receptor α binding sites. PLoS Genetics3e87doi:10.1371/journal.pgen.0030087.
ListerRPelizzolaMKidaYSHawkinsRDNeryJRHonGAntosiewicz-BourgetJO'MalleyRCastanonRKlugmanS2011Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature47168–73. doi:10.1038/nature09798.
LiuXSBrutlagDLLiuJS2002An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology20835–839. doi:10.1038/nbt717.
MahonySBenosPV2007STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Research35W253–W258. doi:10.1093/nar/gkm272.
McLeanCYBristorDHillerMClarkeSLSchaarBTLoweCBWengerAMBejeranoG2010GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology28495–501. doi:10.1038/nbt.1630.
Medina-RiveraAAbreu-GoodgerCThomas-ChollierMSalgadoHCollado-VidesJvan HeldenJ2011Theoretical and empirical quality assessment of transcription factor-binding motifs. Nucleic Acids Research39808–824. doi:10.1093/nar/gkq710.
MorozovaOMarraMA2008Applications of next-generation sequencing technologies in functional genomics. Genomics92255–264. doi:10.1016/j.ygeno.2008.07.001.
NavlakhaSBar-JosephZ2011Algorithms in nature: the convergence of systems biology and computational thinking. Molecular Systems Biology7546doi:10.1038/msb.2011.78.
NikolovaENKimEWiseAAO'BrienPJAndricioaeiIAl-HashimiHM2011Transient Hoogsteen base pairs in canonical duplex DNA. Nature470498–502. doi:10.1038/nature09775.
van NimwegenE2007Finding regulatory elements and regulatory motifs: a general probabilistic framework. BMC Bioinformatics8 (Suppl 6) S4doi:10.1186/1471-2105-8-S6-S4.
OngenaertM2010Epigenetic databases and computational methodologies in the analysis of epigenetic datasets. Advances in Genetics71259–295. doi:10.1016/B978-0-12-380864-6.00009-2.
PavesiGMereghettiPMauriGPesoleG2004Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research32W199–W203. doi:10.1093/nar/gkh465.
PevznerPA2004Educating biologists in the 21st century: bioinformatics scientists versus bioinformatics technicians. Bioinformatics202159–2161. doi:10.1093/bioinformatics/bth217.
PierstorffNBergmanCMWieheT2006Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA. Bioinformatics222858–2864. doi:10.1093/bioinformatics/btl499.
RiddleNCMinodaAKharchenkoPVAlekseyenkoAASchwartzYBTolstorukovMYGorchakovAAJaffeJDKennedyCLinder-BassoD2011Plasticity in patterns of histone modifications and chromosomal proteins in Drosophila heterochromatin. Genome Research21147–163. doi:10.1101/gr.110098.110.
RisterJDesplanC2010Deciphering the genome's regulatory code: the many languages of DNA. BioEssays32381–384. doi:10.1002/bies.200900197.
SegalEFondufe-MittendorfYChenLThastromAFieldYMooreIKWangJPWidomJ2006A genomic code for nucleosome positioning. Nature442772–778. doi:10.1038/nature04979.
SismondiPBigliaNPonzoneRFusoLScafoglioCCicatielloLRavoMWeiszACiminoDAltobelliG2007Influence of estrogens and antiestrogens on the expression of selected hormone-responsive genes. Maturitas5750–55. doi:10.1016/j.maturitas.2007.02.019.
StossiFLikhiteVSKatzenellenbogenJAKatzenellenbogenBS2006Estrogen-occupied estrogen receptor represses cyclin G2 gene expression and recruits a repressor complex at the cyclin G2 promoter. Journal of Biological Chemistry28116272–16278. doi:10.1074/jbc.M513405200.
SubramanianATamayoPMoothaVKMukherjeeSEbertBLGilletteMAPaulovichAPomeroySLGolubTRLanderES2005Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS10215545–15550. doi:10.1073/pnas.0506580102.
Tavera-MendozaLEMaderSWhiteJH2006Genome-wide approaches for identification of nuclear receptor target genes. Nuclear Receptor Signaling4e018.
Thomas-ChollierMSandOTuratsinzeJVJankyRDefranceMVervischEBroheeSvan HeldenJ2008RSAT: regulatory sequence analysis tools. Nucleic Acids Research36W119–W127. doi:10.1093/nar/gkn304.
TolstorukovMYVolfovskyNStephensRMParkPJ2011Impact of chromatin structure on sequence variability in the human genome. Nature Structural & Molecular Biology18510–515. doi:10.1038/nsmb.2012.
TompaMLiNBaileyTLChurchGMDe MoorBEskinEFavorovAVFrithMCFuYKentWJ2005Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology23137–144. doi:10.1038/nbt1053.
Van LooPMarynenP2009Computational methods for the detection of cis-regulatory modules. Briefings in Bioinformatics10509–524. doi:10.1093/bib/bbp025.
ViselABlowMJLiZZhangTAkiyamaJAHoltAPlajzer-FrickIShoukryMWrightCChenF2009ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature457854–858. doi:10.1038/nature07730.
WassermanWWSandelinA2004Applied bioinformatics for the identification of regulatory elements. Nature Reviews. Genetics5276–287. doi:10.1038/nrg1315.
WingreenNBotsteinD2006Back to the future: education for systems-level biologists. Nature Reviews. Molecular and Cellular Biology7829–832. doi:10.1038/nrm2023.
WonKJChepelevIRenBWangW2008Prediction of regulatory elements in mammalian genomes using chromatin signatures. BMC Bioinformatics9547doi:10.1186/1471-2105-9-547.
WuCOrozcoCBoyerJLegliseMGoodaleJBatalovSHodgeCLHaaseJJanesJHussJW2009BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biology10R130doi:10.1186/gb-2009-10-11-r130.
XieXLuJKulbokasEJGolubTRMoothaVLindblad-TohKLanderESKellisM2005Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature434338–345. doi:10.1038/nature03441.
ZambelliFPesoleGPavesiG2009Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Research37W247–W252. doi:10.1093/nar/gkp464.
ZhangJChiodiniRBadrAZhangG2011The impact of next-generation sequencing on genomics. Journal of Genetics and Genomics3895–109. doi:10.1016/j.jgg.2011.02.003.