Pixel-based DNA representations

 

Dong Hyun Jeong, COIT, UNCC

 

Generally DNA sequences range in lengths from a few thousand to 20 million. In contrary to DNA sequences, gene (Protein) sequences range about a few thousand in length. To manage data for a single Protein sequence and provide the information coordinated in a meaningful way, a pixel-based visualization technique [Wong03] has been used. With using pixel-based visualization technique, color glyphs which represent protein data are generated.

The used pixel-based visualization technique consists of several processes of arranging the pixel information and revealing the hidden features of gene sequences. To arrange the pixel information mapping with original sequence information, space-filling methods have to be considered. Several different methods have been designed to have a simpler regularity and support the easiness to follow. Keim [Keim00] generated a query dependent arrangement technique after testing two existing arrangement approaches such as Peano-Hilbert curve and Morton curve in terms of their difficulties of following the curve and understanding the curve in an intuitive manner. But it causes the main disadvantage of breaking the curve into pieces [Wong03]. Hence, we used Hilbert curve ordering method to arrange sequence information mapping with color information. Figure 2 shows the basic Hilbert curve ordering. Whenever the order number (n) changes, the overall size of mapped matrix will be different.

 

Order dependent Hilbert curve ordering which is always self-similar at 2n x 2n (n=1, 2, 3, ¡¦ ).

For mapping with gene sequences, Hilbert curve order is set to 12 which cover 212x212 sizes of gene sequences having the same sequential patterns. Instead of mapping sequence information, color coding method is used.

In general, DNA is a linear polymer made up of individual chemical units called nucleotides or bases.  The four necleotides that make up the DNA sequences of living thins are adenine, guanine, cytosine, and thymine – designated A, G, C, and T. As mentioned above, used DNA, budding yeast S. Cerevisiae, can be extended into almost 200 genes which can be translated into protein using the genetic code. Hence, to map the gene information with color pixels, two different approaches have been made. One for gene sequence and the other is for translated protein.

For determining the color codes, four commonly used color maps proposed by other researchers are used. Gene pairs for Chlamydia muridarum (strain Nigg) genome is used for finding the efficient color codes.

 

      

a                         b                               c                        d

Pixel-based gene visualization with having several different color codings; (a) A (white), C (yellow), G (orange), T (dark brown) [Wong03], (b) A (orange), C (blue), G (purple), T (yellow) [Sal05], (c) A (green), C (blue), G (black), T (red) [Rou97],and (d) A (red), C (blue), G (green), T (yellow) [Rei00]. Black space located in left bottom of each image represents the empty space.

All images are generated by using Hilbert curve ordering method. Even though the images have different color maps, differentiation of sequences and finding the special feature are not always available. Therefore, a pixel enhancement technique and digital image-processing filters are applied [Wong03]. First, a Gaussian filter is used to smooth the high-frequency values. And then Histogram equalization is applied to modify the dynamic range and contrast of an image depending on color channels: usually most of images used in computer system have three channels (R, G, B). Finally, saturation values are increased using extrapolation as Saturation adjustment technique.

 

Gaussian filtering

The Gaussian smoothing operator is a 2-D convolution operator that is used to `blur' images and remove detail and noise. In 2-D, an isotropic (i.e. circularly symmetric) Gaussian has the form:

In our application, Gaussian Masks of radius of 36 pixels and sigma of 10 are defined. Also it can produce of generating distributed gene expressions.

2-D Gaussian distribution with mean (0,0) and =1

 

      

a                         b                               c                        d

 

Histogram Equalization

Histogram equalization is important for improving contrast by obtaining a uniform histogram. This technique can be used on a whole image or just on a part of an image. Histogram equalization will not "flatten" a histogram. It redistributes intensity distributions. If the histogram of any image has many peaks and valleys, it will still have peaks and valley after equalization, but peaks and valley will be shifted. Because of this, "spreading" is a better term than "flattening" to describe histogram equalization.

 

OPERATION

1. Compute histograms depending on color channels

2. Calculate normalized sum of histogram

3. Transform input pixels of image to output image by referencing the normalized histogram.

 

Red channel

   

a                         b                               c                        d

 

Green channel

   

a                         b                               c                        d

 

Blue channel

   

a                         b                               c                        d

 

Histogram equalized images after merging each channels

   

a                         b                               c                        d

 

Saturation Adjustment

To alter saturation, pixel components must move towards or away from the pixel's luminance value. By using a black-and-white image as the degenerate version, saturation can be decreased using interpolation, and increased using extrapolation. This avoids computationally more expensive conversions to and from HSV space. Repeated update in an interactive application is especially fast, since the luminance of each pixel need not be recomputed. Negative alpha preserves luminance but inverts the hue of the input image.

 

   

a                         b                               c                        d

 

Even though all pixel-enhanced images have similar results, when finding the spread out patterns in terms of R, G, B channels, the pixel representations using color codes (A (green), C (blue), G (black), T (red)) show the almost perfect matches (c). It can be easily found that the images after applying Histogram equalization depending on color channels greatly mapped into the final images. Then one can easily figure out which area of the image has much dense information of adenine, guanine, cytosine, or thymine. Therefore, all gene expressions are displayed as images using the color code (A (green), C (blue), G (black), T (red)). To display the generated gene images into 2D space, we applied Multi-dimensional Scaling method which we are going to talk about in next section.

 

Applications

To find the usefulness of pixel-based DNA representation methods, the method is applied into several applications such Genomic Visualization (GVis) [Hong05] and Pathway Visualization (PVis).

GVis (A Scalable Visualization Framework for Genomic Data) is a framework with which it is possible to brose the phylogeny hierarchy of organisms from the highest level down to the level of an individual organism of interest and also analyze each interest gene by initiating the gene-finding and gene-match analyzing tool. The framework permits one to navigate through and explore large amounts of genomic data (thousand of genomes or more) using a 2.5D space layout. All genomic data used in GVis framework follow the NCBI GenBank flat-file format. The publicly available GenBank files consist of a set of ASCII text files, most of which contain gene sequence data, and some supplemental information that contain lists of author names, journal citations, gene names, keywords, and accession numbers of the records. By extracting several important features from the GenBank files, we are able to create our own GVis data files in binary.

 

 

Regulatory pathway visualization is a method of analyzing the dynamic regulatory pathways, estimated with the DNA microarray time-series information, provide predictions of dynamic interactions among genes. Even though several analyzing methods have been studied, most of them have a lack of analyzing or observing the interactions. In contrary to other methods, our visual approaches have been made based on the proposed gene regulatory pathway prediction method [Dar04]. In the visualization, the pixel-based gene visualization techniques are used.

 

Simple Application Demo

Need to download two files (hilbert12.zip, Hilbert_openGL_test.zip) After downloading them, you need to extract to a specific directory. Especially Hibert12.zip file has to be extracted into the same directory with the execution file.

 

References

[Hong05] J. Hong, D.H. Jeong, C.D. Shaw, W. Ribarsky, M. Borodovsky, and C. Song, "GVis: A Scalable Visualization Framework for Genomic Data," pp. 191-198, EuroVis 2005.

[Keim00] D. A. Keim, "Designing pixel-oriented visualization techniques: Theory and applications," IEEE Transactions on Visualization and Computer Graphics, 6(1), pp. 59-78 (2000)

[Rei00] Jan Reichert, Andreas Jabs, Peter Slickers, Jurgen Suhnel: The IMB Jena Image Library of Biological Macromolecules. Nucleic Acids Research 28(1): 246-249 (2000)

[Rou97] Rouchka, E.C., Mazzarella, R., States, D.J., "Computational Detection of CpG Islands in DNA" Technical Report, Washington University, Department of Computer Science, WUCS-97-39 (1997)

[Sal05] Sales-Pardo M; Guimera R; Moreira AA; Widom J; Amaral LAN, "Mesoscopic modeling for nucleic acid chain dynamics," PHYSICAL REVIEW E 71 (5): Art. No. 051902 (2005)

[Wong03] P.C. Wong, K.K. Wong, H. Foote, J. Thomas, "Global Visualization and Alignments of Whole Bacterial Genomes," Vol. 9, No. 3,   pp. 361-377 (2003)

[Yang04] J. Yang, A. Patro, S. Huang, N. Mehta, M. O. Ward, E. A. Rundensteiner, "Proceedings of the IEEE Symposium on Information Visualization (INFOVIS'04)," pp. 73-80 (2004)