CpG Island Array FAQ

Questions

  1. Whatís a CpG Island?
  2. Why are CpG Islands important?
  3. What does CGI mean?
  4. How was the CpG Island library spotted down on the slides derived?
  5. Iím confused, how come some of the clones donít seem to have an annotated CpG island in their region or nearby?
  6. How do you determine the genomic locations of clones?
  7. How come some sequences have a warning about their genomic location?
  8. How do you determine the annotation of nearby genes and other features?
  9. How often do you update your databases?
  10. Wait a minute! How come some of these CpG islands are downstream or within a gene? I thought they were supposed to be upstream in the promoter regions.
  11. In the multi-search screen, how come I canít search more than 200 clones at a time?
  12. Why donít you provide mySQL dumps of your database?
  13. Why does my browser have trouble with your website? Why don't you support the browser "foo"?
  14. Iíve found something cool in your library. How about adding a feature or doing a collaboration?
  15. Can I link directly into your database?
  16. Who made all this?
  17. How do I interpret genomic location or BLAT result?



Answers

  1. Whatís a CpG Island?

    A CpG Island is a stretch of DNA with a high quantity of the nucleotides G and C next to one another. The formal definition we use here is a stretch of DNA at least 200 BP long with at least 50% GC content.



  2. Why are CpG Islands important?
  3. CpG Islands are important because they represent areas of the genome that have for some reason been protected from the mutating properties of methylation through evolutionary time (which tend to change the G in CpG pairs to an A). Often, they point to the presence of an important piece of intergenic DNA, such as that found in the promoter regions of genes where transcription factors bind.


  4. What does CGI mean?

    Itís just an acronym for CpG Island thatís used in the literature.



  5. How was the CpG Island library spotted down on the slides derived?

    The array that we have was spotted from a library of CpG islands purchased from the Sanger Centre. Here’s a simple flowchart of the process:



    The library was made by cutting genomic DNA with MSE1, which cuts at AATT points. Methylated fragments (those that aren’t being protected and therefore probably not a CpG island) are then pulled out on a column and discarded. The remaining fragments are artificially methylated and this is then run through a column which pulls out those methylated fragments; that is the CpG islands. These pieces are then cloned into vectors, grown on plates, picked, amplified and spotted. For a more detailed description of the methodology, please refer to Cross, S.H., et. al., Purification of CpG Islands using a methylated DNA binding column. Nature Genetics (1994) 6:236-244.



  6. Iím confused, how come some of the clones donít seem to have an annotated CpG island in their region or nearby?

    The library is made experimentally and thus possibly has some Ďnoiseí associated with it. However, it should be noted that the definition of a CpG island is simply an arbitrary construction based on some experimental evidence and theory. In fact, many clones on our arrays donít have a Ďformalí CpG island, but you will notice that they often are high in GC rich regions.



  7. How do you determine the genomic locations of clones?

    We run a local instance of BLAT with the default switches against the genome which is filtered for repetitive sequence using RepeatMasker. We also determined the repeat content internally using RepeatMasker within each clone. When possible, for alignments we use a contig derived from our 5í and 3í sequencing (sequencing was done at the Canada's Michael Smith Genome Sciences Centre). In the absence of a contig we use the longest of either the 5í and 3í reads provided they pass our threshold of quality, which is the number of PHRED scores greater than 20 in at least 33% of the sequence read. If there are no quality sequences, we just choose the longest Sanger sequence because unfortunately the Sanger sequence had no corresponding quality scores. However, you should be careful using these sequences. If there is no BCGSC sequence data, we provide the Sanger sequence reads (if present). While the Sanger reads are better than nothing, one should also use this information carefully as we noticed some problems occurred during transfer of the library from Sanger (cross-contaminations).



  8. How come some sequences have a warning about their genomic location?

    This warning flag is determined if a single clone has multiple high quality matches against the genome (i.e. the clone has more than one blat results that the blat match percentage is larger than 90%, the cover region of the blat hit is more than 90% and the repeats for the sequence is less than 90%). This can occur if the clone has a high percentage of repetitive elements within it. It can also occur if the region of the clone aligns in a segment of genomic DNA that is a translocation (i.e.-was itself derived from another piece of DNA). If you follow these clones through to the UCSC browser, this is indicated by the track down near the bottom which is derived from BLASTZ alignments. Finally, there are some clones that span chimeric contaminations which will of course hit two separate segments of the genome with high quality. Generally speaking, you will want to filter such spots out of your analysis, or at least proceed with caution.



  9. How do you determine the annotation of nearby genes and other features?

    To find within, upstream and downstream genes, we downloaded tables from UCSC and installed them into our local instance of mySQL. From there we used the genomic location determined by BLAT to locate the features of interest. The rest of the annotations come from links to tables in the GO, NCBI Unigene, NCBI est, NCBI gene, and NCBI RefSeq databases which we have installed locally in databases.



  10. How often do you update your databases?

    Keeping all databases concurrent and up to date on a daily basis is essentially impossible. One of the reasons for this is because publicly available data sources are dynamically moving objects and never updated synchronously at their respective sources. It also takes a few days for us to construct all the intermediate tables linking all the information. We have decided to do quarterly updates of this CpG database, and if you check our database statistics section, you will see the versions of all the data sources. All annotations from previous version are archived, and you can download their tables here when they become available.



  11. Wait a minute! How come some of these CpG islands are downstream or within a gene? I thought they were supposed to be upstream in the promoter regions.

    CpG islands occur all over the place, and there is increasing evidence that the promoter regions and TF binding sites are not necessarily all upstream of genes (see this paper). In fact, we have found that there are many clones in this library that could be classified as regions near ‘ bidirectional genes’ .



  12. In the multi-search screen, how come I canít search more than 200 clones at a time?

    If you’re searching this many clones you should simply download all the annotations for the clones off our web site. Any time a database, such as Unigene, is updated, this file is also updated.



  13. Why donít you provide mySQL dumps of your database?

    We can provide you the basic annotations, but our database is built on an integrative approach. By this we mean that we have many NCBI, UCSC, etc databases which are used to construct the annotation, and this would be unwieldy. It’s best probably to just download the detailed annotation from here and then build your own database.



  14. Why does my browser have trouble with your website? Why don't you support browser "foo"?

    We try to support the most commonly used browsers, but unfortunately the standards in the browser arena aren’t really much in the way of standards! To that end, we have tried to make sure that things work under the most commonly used browsers, like Firefox, IE, and Safari. If you’re using something really old (like Netscape 4.8…), then you will also run into troubles with rendering due to CSS and Javascript nuances. And if you’re using Firefox, check out our nifty plugin!



  15. I’ve found something cool in your library. How about adding a feature or doing a collaboration?

    We’re always interested in feedback, suggestions and collaborations, though whether those will become a reality is difficult to say given time and resources. Still, please let us know!



  16. Can I link directly into your database?

    Of course! We have a java servlet to do just that. Point your program (for example, this works in GeneSpring) to http://data.microarrays.ca/cpg/searchsingle.jsp?id=<foo> where <foo> is simply the UHN id for that sequence (e.g. UHNhscpg0000001). This will bring you directly into a clone card. There is a file with all the annotation located here, so please don’t hit us with a bot grabbing everything. If we notice somebody doing this, we will block your IP from access! However, if you do have a valid reason for doing this, just contact us and we’ll work something out.



  17. Who made all this?

    This browser and the searches were all made by the informatics core at the UHN microarray centre. Our group currently consists of two full time employees, Carl Virtanen, and Zhibin Lu. Additional work was done by some great co-op students Catherine Kwok, Lynette Lau, Melissa Ma, and Mark Superina.

    This project has been funded at least in part with Federal funds from the Department of Health and Human Services under Contract Number NO1-CO-12400. The contents of this publication do not necessarily reflect the views or policies of the Department of Health and Human Services nor does mention of trade names, commercial products, or organization imply endorsement by the US Government.



  18. How do I interpret genomic location or BLAT result?

    The BLAT result is displayed as 'chrom:chromStart-chromEnd'. chrom is the name of the chromosome. The first base in a chromosome is numbered 0. chromStart is the start position of the BLAT result, and chromEnd is the end position of the BLAT result. The chromEnd base is NOT included in the hit. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100.

Top of page