# Introduction to Machine Learning for Predictive Sequence Analysis

Course on "Machine Learning for Predictive Sequence Analysis" by Gunnar Rätsch on the 21st of March 2007 in Bertinoro, Italy.

### Abstract

Machine learning is the study of algorithms which generalize knowledge gained from empirical data. In this tutorial I will focus on supervised learning for biological sequence analysis, where a typical task is to predict properties of a sequence. Examples include protein homology detection, gene finding, prediction of protein functions, etc. I will start with a broad introduction into Machine learning including classification, regression, semi- and unsupervised learning, generalization performance and model selection. In the second part I will focus on Support Vector Machines (SVMs) -- the most popular example of binary classification algorithms. They utilize so-called kernels that formalize the similarity between examples and allow the design of efficient and mathematically elegant algorithms. In the third part I will introduce a few powerful kernel functions for sequence analysis in detail with practical examples. Finally, I will discuss several applications of these techniques in computational biology.

### Overview

#### Introduction to Machine Learning

Machine learning is the study of algorithms which generalize knowledge gained from empirical data. We will focus on the supervised learning paradigm, where the algorithm is provided with training examples as well as an expert opinion of the correct answer. The algorithm’s task is to find the best decision function for future examples.

- Classification
- Regression
- Un- and Semi-supervised learning
- Generalization and Model selection

#### Support Vector Machines and Kernels

Support Vector Machines (SVMs) maximize the margin between positive and negative training examples. It is the most popular example of binary classification algorithms (algorithms which predict “yes/no” answers) which build upon the solid foundation of statistical learning and optimization theory. They utilize so-called kernels that formalize similarity functions and allow the design of efficient and mathematically elegant algorithms. Moreover, many statistical algorithms can be reformulated using kernels (usually referred to as the “kernel trick”) to allow nonlinear decision functions as well as structured data types.

- Maximal margin algorithm
- Convex optimization problems
- Positive semidefinite kernels
- Beyond 2-class classification

#### Kernels for Sequences and Graphs

In this section, we explain how kernels can be defined on sequences such as DNA or amino acid sequences. These kernels are the modeling tool that allows us to apply the algorithms presented in the previous chapter on complex data structures arising in computational biology. We illustrate how a practitioner can construct kernels for a particular application by combining known kernels.

- Spectrum kernel and weighted degree kernel
- Guidelines for kernel design

#### Applications in Computational Biology

We discuss how several important questions in bioinformatics have been tackled using SVMs and string kernels.

- Remote homology detection
- Gene Finding
- Protein function prediction

Additionally, I will mention a few software packages that implement the algorithms mentioned in the course.

#### Acknowledgements:

This tutorial largely overlaps with the one I have given together with Cheng Soon Ong at GCB 2006.