## Saturday, August 3, 2019

### Data Mining Essay -- Technology, Data Processing

1 Data Pre-processing 1.1 k-mers extraction Assume Ka = (a1,a2...ak) is a k-mer of continuous sequence of length k, and a = 1,Ã¢â‚¬ ¦, S, where S is the cumulative number of k-mers in that series. In the case of a sequence of length L, we have L Ã¢â‚¬â€œ k + 1 total number of k-mers that can be given out making use of k length window drifting procedure. 1.2 Generation Of Position Frequency Matrices For the positive dataset, 500 sequences were used to calculate k-mer frequencies from three successive windows. The three windows are: (1) window A, from -75 to -26 bp before the polyA site, (2) window B, from -25 to -1 bp before the polyA site, and (3) window C, from 1 to 25 bp after the polyA site. The highly informative k-mer frequencies (HIK) feature vector consisted of cumulated frequencies of all monomer, dimmer, and trimer frequencies for the three regions. This results in 3 regions x 4 monomer frequencies, 3 x 16 dimer frequencies, and 3 x 64 trimer frequencies. Hence, a total of 252 features are obtained. The negative dataset was computed from frequencies in similarly spaced windows, but from the beginning of 500 other independent sequences (windows: A, -300 to -251 bp; B, -251 to -226 bp; and C, -225 to -201 bp 1.3 Background Probability Feature The label space is written as Y = fp; ng indicating that a sequence with a polyA site is detected (positive class label p) or not detected (negative class label n). A classiffier, i.e., a mapping from instance space to label space, is found by means of learning from a set of examples. An example is of the form z = (x; y) with x 2 X and y 2 Y. The symbol Z will be used as a compact notation for X _Y. Training data are a sequence of examples: S = (x1; y1); : : : ; (xn; ... ...clude GC-rich redundant motifs and diffuse motifs that are difficult to detect. Suggestions and Further Research Motif discovery in DNA datasets is a challenging problem domain due to lack of understanding of the nature of the data, and the mechanisms to which proteins recognize and interact with its binding sites are still perplexing to biologist. Hence, predicting binding sites by using computational algorithms is still far from satisfaction. Many computational motif discovery algorithms have been proposed in the past decade. Like most of these algorithms, it shares some common challenges that require further investigation. The first is the scalability of the system for large scale dataset such as ChIP sequences. The scalability is the ability of a tool to maintain its prediction performances and efficiency while the size of the datasets increases.