Data Mining I

Main content

Course overview  

The goal of the field of data mining is to find patterns and statistical dependencies in large databases and to gain an understanding of the underlying system from which the data were obtained. In computational biology, data mining contributes to the analysis of vast experimental data generated by high-throughput technologies, and thereby enables the generation of new hypotheses.

In this course, we present the algorithmic foundations of data mining and its applications in computational biology. The course features an introduction to popular data mining problems and algorithms, reaching from classification to clustering. Based on these techniques, we examine how these algorithms can be used to study gene expression, protein function or the structure of biological networks. This course is intended for both students who are interested in applying data mining algorithms and students who would like to gain an understanding of the key algorithmic concepts in data mining.

Official entry in course catalog

Number Course
Next offer

Data Mining I

Fall 2016

Schedule (Fall 2016)

  • Lectures: Wednesdays, from 9 am to 11 am.
  • Tutorials: Wednesdays, from 11 am to 12 pm.


  • Weekly homework assignments (12 in total; 30% of the final grade)
  • Written final exam (70% of the final grade)

Course prerequisites

  • Basic understanding of mathematics, as taught in basic mathematics courses at the Bachelor's level.

Course contents and slides (Fall 2015)

  • Introduction
  • Similarity and Distance Metrics. Similarity measures on: vectors, sets, strings and time-series
  • Similarity measures on graphs: Weisfeiler-Lehman Kernel
(PDF, 8.1 MB)
  • Classification: evaluation of classifiers, cross-validation
  • Nearest Neighbour Classification
  • Naive Bayes Classifier
  • Linear Discriminant Analysis
  • Logistic Regression
  • Decision Tree
  • Support Vector Machines (SVM)
  • Kernels
(PDF, 1.7 MB)
  • Clustering: k-means
  • Kernel k-means
  • Graph-based Clustering: DBScan, Spectral Clustering
  • Expectation Maximization Clustering
  • Hierarchical Clustering
(PDF, 2.6 MB)
  • Feature selection based on: greedy selection, mutual information, Hilbert-Schmidt Independence Criterion (HSIC)
  • Submodular Functions
  • Feature selection in practice (analysis)
(PDF, 264 KB)
  • Applications in Computational Biology
  • Overfitting and Generalization: Deleteriousness Prediction
  • Phenotype Prediction and Epistasis
(PDF, 1.8 MB)
  • Complete list of references
(PDF, 659 KB)
Page URL:
© 2017 Eidgenössische Technische Hochschule Zürich