KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...

69
Data Mining I Summer semester 2019 Lecture 1: Introduction Lectures: Prof. Dr. Eirini Ntoutsi TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme AG Intelligente Systeme - Data Mining group

Transcript of KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...

  • Data Mining I

    Summer semester 2019

    Lecture 1: Introduction

    Lectures: Prof. Dr. Eirini Ntoutsi

    TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali

    Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme

    AG Intelligente Systeme - Data Mining group

  • About me

    03/2016 - present Associate Professor Faculty of Electrical Engineering & Computer Science Leibniz University Hannover L3S Research Center (since May 2016)

    02/2012 - 02/2016 Post-doctoral researcher & lecturer Institute for Informatics, LMU Munich, Germany

    02/2010 - 01/2012 Alexander von Humboldt postdoc fellowInstitute for Informatics, LMU Munich, Germany

    2009: Data Mining Expert National Hellenic Organization (OTE), Athens, Greece

    04/2007 – 02/2009 Co-Founder and AI expert

    NeeMo Startup, Greece

    09/2003 – 09/2008 PhD in Data MiningUniversity of Piraeus, Athens, Greece

    09/2001 – 09/2003 MSc, Computer Science/ Text MiningPolytechnic School, University of Patras, Greece

    09/1996 – 09/2001 Diploma, Computer Engineering and Informatics/ AI GamesPolytechnic School, University of Patras, Greece

    2Learning from streaming data

    Current focus areas:• Data Stream Mining/ Adaptive Machine Learning• Responsible AI: Fairness-Aware Machine Learning

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    3Data Mining I @SS19: Introduction

  • Why to study Data Mining/Machine Learning – famous quotes*

    ■ “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)

    ■ “Machine learning is the next Internet” (Tony Tether, Director, DARPA)

    ■ “Machine learning is the hot new thing” (John Hennessy, President, Stanford)

    ■ “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)

    ■ “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)

    ■ “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO, Microsoft)

    4

    *Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf

    Data Mining I @SS19: Introduction

    Disclaimer: I use the terms data mining and machine learning (sometimes also Artificial Intelligence (AI) interchangeably here and through the lecture.We will discuss the similarities/differences later. In both cases, we talk to learning from data.

  • Data Mining – Data Science – Big Data – Machine Learning – Deep Learning Analytics …

    ■ New fancy words for knowledge discovery from data

    ❑ Data mining, machine learning have been focusing on knowledge discovery from data for decades

    ❑ Well defined set of tasks and solutions

    ■ Big data and analytics are more business terms and ill-defined

    ■ The same holds today for AI

    5

    “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.“

    Source: Dan Ariely, Duke University

    Data Mining I @SS19: Introduction

  • Ever increasing interest … the ``rebranding’’ effect

    6

    Source: Google trends, query on 9.4.2019

    Data Mining I @SS19: Introduction

  • Why to study Data Mining - Data Scientist: The sexiest job of 21st century

    7

    “If “sexy” means having rare qualities that are much in demand, data scientists are alreadythere. They are difficult and expensive to hire and, given the very competitive market for theirservices, difficult to retain. There simply aren’t a lot of people with their combination ofscientific background and computational and analytical skills.”

    Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link

    Data Mining I @SS19: Introduction

    Source: https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists

    https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists

  • A good conjuncture for ML/DM/DS (data-driven learning)

    8Data Mining I @SS19: Introduction

    Data deluge Machine Learningadvances

    Computer power Enthusiasm

  • World-wide competition on Artificial Intelligence (AI)

    ■ From FORSCHUNGSGIPFEL 2019 - Künstliche Intelligenz – Innovationstreiber einer neuen Generation, 19/3/2019, Berlin

    ■ Cédric Villani’s talk on the geopolitics of AIThere are 3 fierce competitions

    ■ Competition for human talent

    ■ Competition for infrastructure

    ■ Competition for data

    ■ More info on

    ❑ Track #fogipf19 in Twitter

    ❑ Check videos online: http://www.forschungsgipfel.de/2019/videos

    9Data Mining I @SS19: Introduction

    https://twitter.com/hashtag/fogipf19?src=hashhttp://www.forschungsgipfel.de/2019/videos

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    11Data Mining I @SS19: Introduction

  • Why we need Data Mining

    ■ Huge amounts of data are collected nowadays from different application domains

    ■ “We are drowning in information but starving for knowledge” John Naibett link

    ■ The amount and the complexity of the collected data does not allow for manual analysis.

    Telecommunication

    Astronomy

    Banks Biology

    Internet

    Supermarkets

    12

    IoT

    Data Mining I @SS19: Introduction

    http://www.kdnuggets.com/news/2007/n06/3i.html

  • Examples of data sources: The Internet

    ■ Internet users

    13

    Web 2.0: A world of opinionsUser generated content

    Data Mining I @SS19: Introduction

    Source: http://www.internetlivestats.com/internet-users/

  • Examples of data sources: Internet of things

    ■ The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics, software, sensors, and network connectivity, which enables these objects to collect and exchange data.

    Source: https://en.wikipedia.org/wiki/Internet_of_Things

    14

    Image source:http://tinyurl.com/prtfqxf

    Source: http://blogs.cisco.com/diversity/the-internet-of-things-infographic

    During 2008, the number of things connected to the internet surpassed the number of people on earth… By 2020 there will be 50 billion … vs 7.3 billion people (2015).

    These things are everything, smartphones, tablets, refrigerators …. cattle.

    Data Mining I @SS19: Introduction

    https://en.wikipedia.org/wiki/Internet_of_Things

  • Examples of data sources: data intensive science

    15

    Slide from:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt

    “Increasingly, scientific breakthroughs willbe powered by advanced computingcapabilities that help researchersmanipulate and explore massive datasets.”

    -The Fourth Paradigm – Microsoft

    Examples of e-science applications:• Earth and environment• Health and wellbeing

    − E.g., The Human Genome Project (HGP)

    • Citizen science• Scholarly communication• Basic science

    − E.g., CERN

    Data Mining I @SS19: Introduction

  • Examples of data sources: Manufacturing

    ■ Andrew Ng Says Factories Are AI’s Next Frontier

    Source: https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/

    16

    Image source: https://images.readwrite.com/wp-content/uploads/2018/03/AAEAAQAAAAAAAAueAAAAJDY1NmFl

    N2NhLWExZTUtNDRhNy1iMWQ5LTViZGM3NTFlODczYQ.jpg

    Companies are making major investments in AI and industrial analytics to help drive their digital transformation

    Data Mining I @SS19: Introduction

    Image source: https://cdn-sv1.deepsense.ai/wp-content/uploads/2018/04/Spot-the-flaw-Visual-quality-control-in-

    manufacturing-1140x337.jpg

    https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/

  • Examples of data sources: We … the data subjects

    ■ Wherever we go, we are "datafied".

    ■ Smartphones are tracking our locations.

    ■ We leave a data trail in our web browsing.

    ■ Interaction in social networks.

    ■ Privacy is an important issue … not covered though in this lecture → privacy aware data mining

    ❑ Check the EU General Data Protection Regulation (https://eugdpr.org/)

    ■ e.g., https://www.whitecase.com/publications/article/chapter-5-key-definitions-unlocking-eu-general-data-protection-regulation

    17Data Mining I @SS19: Introduction

    https://eugdpr.org/

  • From data to knowledge of different types

    18Data Mining I @SS19: Introduction

    Data Methods Knowledge

    Call records

    Movie ratings

    Telescope images

    Outlier Detection Detect fraud cases

    Collaborative filtering Recommend movies to users

    ClassificationIs it an «early», «intermediate» or «late formation» star?

    News articles ClusteringWhat are the topics people discuss about in the news today?

  • Short break (5’) – Get to know us better

    ■ What is your field of study?

    ❑ Informatik? Informationstechnik? Elektrotechnik?

    ■ What is your interest in the data mining field?

    ■ Are there people from Physics, Medicine, Engineering Sciences in the audience?

    19Data Mining I @SS19: Introduction

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    20Data Mining I @SS19: Introduction

  • What is KDD

    Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially

    useful, and ultimately understandable patterns in data.

    [Fayyad, Piatetsky-Shapiro, and Smyth 1996]

    Remarks:

    ● valid: the discovered patterns should also hold for new, previously unseen problem instances.

    ● novel: at least to the system and preferably to the user

    ● potentially useful: they should lead to some benefit to the user or task

    ● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some post-processing

    21Data Mining I @SS19: Introduction

    Clarification: The term databases does not refer exclusively to relational databases storing structured data … it can be any data storage and also structured, semi-structured, non-structured data

  • The KDD process and the Data Mining step

    22

    Patterns

    Knowledge

    [Fayyad, Piatetsky-Shapiro & Smyth, 1996]

    Transformed data

    Target data

    Preprocessed data

    Sele

    ctio

    n:

    •Se

    lect

    a r

    elev

    ant

    dat

    aset

    or

    focu

    s o

    n a

    su

    bse

    t o

    f a

    dat

    aset

    •Fi

    le /

    DB

    /

    Pre

    pro

    cess

    ing/

    Cle

    anin

    g:•

    Inte

    grat

    ion

    of

    dat

    a fr

    om

    d

    iffe

    ren

    t d

    ata

    sou

    rces

    •N

    ois

    e re

    mo

    val

    •M

    issi

    ng

    valu

    es

    Tran

    sfo

    rmat

    ion

    :•

    Sele

    ct u

    sefu

    l fea

    ture

    s•

    Feat

    ure

    tra

    nsf

    orm

    atio

    n/

    dis

    cret

    izat

    ion

    •D

    imen

    sio

    nal

    ity

    red

    uct

    ion

    Dat

    a M

    inin

    g:•

    Sear

    ch f

    or

    pat

    tern

    s o

    f in

    tere

    st

    Eval

    uat

    ion

    :•

    Eval

    uat

    e p

    atte

    rns

    bas

    ed o

    n

    inte

    rest

    ingn

    ess

    mea

    sure

    s•

    Stat

    isti

    cal v

    alid

    atio

    n o

    f th

    e M

    od

    els

    •V

    isu

    aliz

    atio

    n•

    Des

    crip

    tive

    Sta

    tist

    ics

    Data

    Data Mining I @SS19: Introduction

  • A modern version: The Data Science process

    23Data Mining I @SS19: Introduction

  • The interdisciplinary nature of KDD 1/2

    24

    KDD

    Machine Learning

    Databases

    Statistics

    Data visualization

    Pattern recognition

    Algorithms Other disciplines

    Data Mining I @SS19: Introduction

  • The interdisciplinary nature of KDD 2/2

    25

    Statistics Machine Learning

    Databases

    KDD

    Model based inferenceFocus on numerical

    data

    Theory + methodsFocus on small datasets

    Scalability to large data setsNew data types (web data, micro-arrays, social data ...)

    Integration with commercial databases[Chen, Han & Yu 1996]

    [Berthold & Hand 1999] [Mitchell 1997]

    Data Mining I @SS19: Introduction

  • How do machines learn?

    ■ ML “gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)

    ■ We don’t codify the solution. We don’t even know it!

    ■ Data is the key & the learning algorithm

    26Data Mining I @SS19: Introduction

    Algorithms

    Models

    Models

    (semi)Automatic

    decision making

    Data

    How can we build computer programs that automatically improve with experience?

    Tom Mitchell, Machine Learning book

  • More formally: How do machines learn?

    ■ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

    Tom Mitchell, Machine Learning 1997.

    ■ Example: A backgammon learning problem

    ❑ Task T: playing backgammon

    ❑ Performance measure P: % of games won against opponents

    ❑ Training experience E: playing practice games against itself

    ■ Example: Exam performance

    ❑ Task T: predict whether a student will pass the final DM exam or not

    ❑ Experience E: historical records of students that took the DM exam

    ❑ Performance measure P: % of correctly identified students

    27Data Mining I @SS19: Introduction

  • (Machine) Learning from experience/feedback 1/2

    ■ “Experience comes in terms of data (the so called, instances or examples) from the specific problem/ application”

    ■ Datasets consists of instances (also known as examples or objects)

    ❑ e.g., in a university database: students, professors, courses, grades,…

    ❑ e.g., in a library database: books, users, loans, publishers, ….

    ❑ e.g., in a movie database: movies, actors, director,…

    ■ Instances are described through features (also known as attributes or variables)

    ❑ E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.

    ❑ An easy to visualize example: if our data are in a database table, the rows are the instances and the columns are the features.

    28Data Mining I @SS19: Introduction

  • (Machine) Learning from experience/feedback 2/2

    ■ Except for the instance description, we might also have feedback on those instances from some “teacher”/”expert“

    ❑ E.g., whether a student passed the exam

    ■ The direct feedback is known as label, i.e., each instance is associated with a label labeleddataset

    ■ But we might have no feedback at all unlabeled dataset

    ■ There might be also indirect feedback

    29Data Mining I @SS19: Introduction

    Unlabeled datasetLabeled dataset

    Lecture 2 is devoted on getting to know our data!!!

  • Short break (5’) – Modeling students data for the exam performance task

    ■ Recall our learning example

    ■ Example: Exam performance

    ❑ Task T: predict whether a student will pass the final DM exam or not

    ❑ Experience E: historical records of students that took the DM exam

    ❑ Performance measure P: % of correctly identified students

    ■ If students are the learning instances, what sort of features could I use to describe each of them?

    ■ What could be the feedback (direct, indirect) for the learning model (if any)?

    30Data Mining I @SS19: Introduction

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    31Data Mining I @SS19: Introduction

  • Different learning tasks

    Based on the feedback we have on the data, we can distinguish between:

    ■ Direct-feedback instances

    ❑ the correct response /label is provided for each instance by the “teacher”

    ❑ e.g., good or bad product

    ■ No-feedback instances

    ❑ no evaluation/label of the instance is provided, since there is no “teacher“

    ❑ e.g., no information on whether a product is good or bad, just the description of the product/instance

    ■ Indirect-feedback instances

    ❑ less feedback is given, since not the proper action, but only an evaluation of the chosen action is given by the teacher

    32Data Mining I @SS19: Introduction

    Supervised learning

    Reinforcement learning

    Unsupervised learning

  • Different learning tasks: Supervised learning

    ■ Supervised learning/ Predictive:

    ❑ A description of the instances and their class labels is available (training set)

    ❑ The goal is to learn a mapping from the instances to the class labels, i.e., given a future unseen instance to predict its class label

    ■ Typical examples covered in this lecture:

    ❑ Classification

    ❑ Outlier detection

    ❑ Regression

    33Data Mining I @SS19: Introduction

  • Classification: an example

    ■ The goal is to learn a mapping from the “height, width space” to the class space (nails, screw,paper clips)

    ■ For the new objects, the result of the classification if one of the class labels {nails, screw,paper clips}

    34Data Mining I @SS19: Introduction

    Screw

    Nails

    Paper clips

    Hei

    ght

    [cm

    ]

    Width[cm]

    instance width height class

    1 2,6 4,5 Screw

    2 3,7 7,3 Nails

    3 4,1 6,5 Paper Clips

    4 8,5 8,1 Screw

    5 9,5 5,5 Nails

    … … … …

    New objectNew object

  • Classification applications 1/2

    ■ Application: Fraud Detection

    ❑ Goal: Predict fraudulent cases in credit card transactions.

    ❑ Approach:

    ■ Use credit card transactions and the information on its account-holder as attributes.

    ❑ When does a customer buy, what does he buy, how often he pays on time, etc

    ■ Label past transactions as fraud or fair transactions. This forms the class attribute.

    ■ Learn a model for the class of the transactions.

    ■ Use this model to detect fraud by observing credit card transactions on an account.

    35Data Mining I @SS19: Introduction

  • Classification applications 2/2

    ■ Application: Churn prediction in telco

    ❑ Goal: Predict whether a customer is likely to be lost to a competitor

    ❑ Approach:

    ■ Use detailed record of transactions with each of the past and present customers, to find attributes.

    ❑ How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.

    ■ Label the customers as loyal or disloyal (class attribute).

    ■ Find a model for customer loyalty

    ■ Use this model to predict churn and organize possible retain strategies.

    36Data Mining I @SS19: Introduction

  • Example: Google News

    37Data Mining I @SS19: Introduction

  • A huge variety of classification algorithms

    38Data Mining I @SS19: Introduction

    Decision trees k nearest neighbours

    Support vector machines

    Neural networks Bayesian classifiers

    Ensembles

  • Supervised learning: Regression

    ■ Similar to classification, but the feature-result to be learned is continuous rather than discrete.

    ■ Goal: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

    39Data Mining I @SS19: Introduction

    Given this data, a friend has a house 750 square feet - how much can they be expected to get?

    Source: Andrew Ng ML course, Coursera

  • Application: Precision farming

    ■ Create a production curve depending on multiple parameters like soil characteristics, weather, used fertilizers.

    ■ Only the appropriate amount of fertilizers given the environmental settings (soil, weather) will result in maximum yield.

    ■ Controlling the effects of over-fertilization on the environment is also important

    40

    Water capacity

    Soil parametersWeatherFertilizers

    Fertilizers

    productionproduction

    curve

    Data Mining I @SS19: Introduction

  • Different learning tasks: Unsupervised learning

    ■ Unsupervised learning/ Descriptive:

    ❑ Only a description of the instances is available

    ❑ No feedback/labels are available

    ❑ The goal is to discover groups of similar instances

    ■ Typical subtasks covered in this lecture:

    ❑ clustering

    ❑ association rules mining

    ❑ outlier detection

    41Data Mining I @SS19: Introduction

  • Clustering: an example

    ■ Each point described in terms of its height and width

    ■ No information on the actual classes (nails, paper clips) is available to the clustering algorithm.

    42

    Cluster 1Cluster 2

    Hei

    ght

    [cm

    ]

    Width[cm]

    Data Mining I @SS19: Introduction

    instance width height

    1 2,6 4,5

    2 3,7 7,3

    3 4,1 6,5

    4 8,5 8,1

    5 9,5 5,5

    … … …

  • Clustering applications 1/2

    Application: Market Segmentation

    ■ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

    ■ Approach:

    ❑ Collect different attributes of customers based on their geographical and lifestyle related information.

    ■ E.g., age, income, education, family status, ….

    ❑ Find clusters of similar customers.

    ❑ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

    43Data Mining I @SS19: Introduction

  • Clustering applications 2/2

    Application: Document clustering

    ■ Find groups of documents (topics) that are similar to each other based on the important terms appearing in them.

    ■ Approach:

    ❑ Identify important terms in each document.

    ❑ Form a similarity measure between documents.

    ❑ Cluster based on the similarity measure.

    ■ Gain:

    ❑ Help the end user to navigate in the collection of documents (based on the extracted clusters).

    ❑ Utilize the clusters to relate a new document or search term to clustered documents.

    ■ Check for example, Google News.

    44Data Mining I @SS19: Introduction

  • Example: Google News

    45Data Mining I @SS19: Introduction

  • A huge variety of clustering algorithms

    46Data Mining I @SS19: Introduction

    Partitioning methods (k-Means)

    Grid-based methods (CLIQUE)

    Model-based methods (DBSCAN)

    Hierarchical methods

    Constraint-based methods

    Model-based methods(EM)

  • Unsupervised learning: Association rules mining

    ■ Task: Find all rules in the database, in the following form:

    If x, y, z are contained in a set M, then t is also contained in M with a probability of at least X%.

    47

    a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f

    In 5 out of 7 cases (~71 %)b,c,d appear together

    a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f

    In 5 out of 5 cases (100 %) it holds that:If b,c then d also exists.

    Data Mining I @SS19: Introduction

    • a= milk• b=cheese• c =wine

    • d= pasta• e= yogurt• f = apples

  • Application: Market basket analysis

    ■ Result:

    ❑ Frequently purchased items together may be better to be positioned close to each other: E.g. since diapers are often purchased together with beers => Place beer in the way from diapers to the checkout

    ❑ Generate recommendations for customers with similar baskets:=> e.g. Customers that bought „Star Wars“, might be also interested in „The lord of the rings “.

    48

    Shopping basket

    DataWarehouse

    Possible generalizations:• Paprika-Chips Snacks • Enrichment of customer data

    Association rules

    Data Mining I @SS19: Introduction

  • Unsupervised|Supervised Learning: Outlier detection

    ■ Outlier detection is defined as identification of non-typical data

    ■ Outliers might indicate

    ❑ possible abuse of credit cards, mobile phones

    ❑ data errors

    ❑ device failures

    49Data Mining I @SS19: Introduction

  • Application

    ■ Analysis of the SAT.1-Ran-Soccer-Database (Season 1998/99)

    ❑ 375 players

    ❑ Primary attributes: Name, #games, #goals, playing position (goalkeeper, defense, midfield, offense),

    ❑ Derived attribute: Goals per game

    ❑ Outlier analysis (playing position, #games, #goals)

    ■ Result: Top 5 outliers

    50

    Rank Name # games #goals position Explanation

    1 Michael Preetz 34 23 Offense Top scorer overall

    2 Michael Schjönberg 15 6 Defense Top scoring defense player

    3 Hans-Jörg Butt 34 7 Goalkeeper Goalkeeper with the most goals

    4 Ulf Kirsten 31 19 Offense 2nd scorer overall

    5 Giovanne Elber 21 13 Offense High #goals/per game

    Data Mining I @SS19: Introduction

    Note: “Outliers” is not necessarily a negative term.

  • Short break (5’) – Learning from the student data

    ■ Recall our learning example

    ■ Example: Exam performance

    ❑ Task T: predict whether a student will pass the final DM exam or not

    ❑ Experience E: historical records of students that took the DM exam

    ❑ Performance measure P: % of correctly identified students

    ■ If students are the learning instances, what sort of features could I use to describe each of them?

    ■ What could be the feedback/label for the learning model (if any)?

    ■ What could be a supervised learning task here?

    ❑ For classification? For prediction?

    ■ What could be an unsupervised learning task ?

    ❑ For clustering, frequent itemsets mining?

    ■ What could be an outlier detection problem here?

    51Data Mining I @SS19: Introduction

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    52Data Mining I @SS19: Introduction

  • Course logistics 1/3

    ■ Class schedule

    ❑ Lectures: Wednesdays, 12:15 - 13:45, Multimedia-Hörsaal (3703 - 023), Appelstraße 4.

    ❑ Tutorials: Monday: 10:00 - 11:30 , Monday: 13:30 - 15:00, Tuesday: 11:45 - 13:15, Tuesday: 13:30 - 15:00, Room 235, Gebaeude 3703, Appelstraße 4.

    ■ StudIP as a common information sharing place

    ❑ Up to date announcements and material

    ❑ Use the forum for your questions. They might benefit everyone!

    ■ Exam:

    ❑ Written exam, 90’

    ■ You are allowed to bring a hand-written A4 with formulas etc (No need to memorize) – Each student should have her own A4 (copied are not allowed)

    ❑ The exam will be based on the material discussed in the class plus the tutorials.

    ■ Exam date: Monday 28.8.2019, 08:30-11:00, Rooms: F 102, F 303

    53Data Mining I @SS19: Introduction

  • Overview of the lectures (current planning)

    1. Introduction

    2. Getting to know our data

    3. Association Rules Mining

    4. Clustering

    6. Classification

    7. Outlier Detection

    54Data Mining I @SS19: Introduction

  • Course logistics 2/3

    ■ Projects

    ❑ Focus on the complete KDD pipeline for two different learning tasks:

    ■ Classification: 15/5/2019 & Clustering: 26/6/2019

    ■ Groups of 2 (Please form the teams by yourselves)

    ■ Goal: how to run a data mining case study? From data preprocessing to transformation, learning algorithm, evaluation and presentation of the results. Both analysis and presentation part are important.

    ■ We will use Kaggle for result submission (but you have to submit the report separately)

    ■ We will have a poster session at the end where each team present its results

    ■ Bonus schema

    ❑ Pass both projects: you switch to the next best grade

    ■ e.g., from 1.7→1,3

    ❑ Each member ``inherits’’ the grade of the group

    ❑ Extra bonus for those that score best in Kaggle (system) & those with the best poster (voting)

    55Data Mining I @SS19: Introduction

  • Course logistics 3/3

    56Data Mining I @SS19: Introduction

    ■ Teaching Assistants

    ❑ Vasileios Iosifidis

    ■ Room 240, 2nd floor, Appelstraße 4

    [email protected]

    ❑ Tai Le Quy

    ■ Room 010, Ground floor, Appelstraße 4

    [email protected]

    ❑ Maximilian Idahl

    ■ -

    ❑ Wazed Ali

    [email protected]

    ❑ Shaheer Asghar

    [email protected]

    ■ Lecturer

    ❑ Prof. Dr. Eirini Ntoutsi

    ■ Room 203, 2nd floor, Appelstraße 4

    Contact via email:[email protected]

    Please use [DM1] in the subject

    mailto:[email protected]

  • Tutorials: Organization

    ■ 4 tutorial groups

    ❑ Times:

    ■ Monday, 10:00 – 11:30 and 13:30 – 15:00

    ■ Tuesday, 11:45 – 13:15 and 13:30 – 15:00

    ❑ Room: 235, Appelstraße 4

    ❑ Registration for groups in Stud.IP, unlocked today at 20:00

    57Data Mining I @SS19: Introduction

  • Tutorials: Why to attend

    Why should you attend the tutorials?

    Solving theoretical and algorithmic parts is an excellent preparation for the exam

    Working on the implementation part is useful for the bonus projects

    Each tutorial will consist of

    1. a theoretical part (e.g., properties of an algorithm)

    2. an algorithmic part (e.g., applying an algorithm on a particular dataset)

    3. an implementation part (e.g., how to run a data mining analysis in Python)

    58SoSe19: DM I - Tutorial

  • Tutorials: structure

    1 worksheet per week

    Announced after the lecture, so you have the chance to prepare beforehand

    Solutions will be available via Stud.IP

    Theoretical and algorithmic parts will be mainly presented at the blackboard (with help from you)

    Implementation part in form of jupyter notebooks (interactive python)

    Use Stud.IP for any tutorial related question.

    This is the fastest way to get an answer

    Your question might be relevant to other students

    Posing and answering questions is a great way to learn

    Or send an e-mail to [email protected]

    59SoSe19: DM I - Tutorial

  • Tutorials: For next week (1st tutorial)

    Take a look at python and jupyter notebooks

    Installation:

    Anaconda python distribution includes most packages needed for this course

    Step-by-step installation guide for Windows/Mac/Linux: https://docs.anaconda.com/anaconda/

    Quick start guide for jupyter notebooks: https://jupyter.readthedocs.io/en/latest/content-quickstart.html

    Alternative:

    Jupyter notebook environment Google Colab: https://colab.research.google.com/notebooks/welcome.ipynb

    Free, requires no setup, runs entirely in the cloud

    60SoSe19: DM I - Tutorial

    https://docs.anaconda.com/anaconda/https://jupyter.readthedocs.io/en/latest/content-quickstart.htmlhttps://colab.research.google.com/notebooks/welcome.ipynb

  • Tutorials: Python resources

    There are lots of great python tutorials

    Recommended: http://scipy-lectures.org/intro/language/python_language.html

    Official: https://docs.python.org/3/tutorial/

    If you prefer video tutorials:

    https://pythonprogramming.net/python-fundamental-tutorials/ or

    https://youtu.be/YYXdXT2l-Gg

    And on jupyter notebooks

    https://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46 or

    http://opentechschool.github.io/python-data-intro/core/notebook.html or

    https://youtu.be/HW29067qVWk

    61SoSe19: DM I - Tutorial

    http://scipy-lectures.org/intro/language/python_language.htmlhttps://docs.python.org/3/tutorial/https://pythonprogramming.net/python-fundamental-tutorials/https://youtu.be/YYXdXT2l-Gghttps://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46http://opentechschool.github.io/python-data-intro/core/notebook.htmlhttps://youtu.be/HW29067qVWk

  • Tutorials 1-2 plan

    62SoSe19: DM I - Tutorial

    Tutorial 1 + 2 will include introductions to

    Python basics

    Arrays in NumPy

    Data manipulation with Pandas

    Visualization using matplotlib

    Data mining and analysis tools in scikit-learn

    Goal: Running a data analysis process in python, from data selection to pattern evaluation

    Useful for the projects

    Learning-by-doing

  • Textbook and recommended readings

    ■ Textbook:

    ❑ Tan P.-N., Steinbach M., Kumar V., Introduction to Data Mining, Addison-Wesley, 2014

    ❑ New edition is expected in May 2019

    ■ Recommended readings

    ❑ Meira and Zaki, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014

    ❑ Mitchell T. M., Machine Learning, McGraw-Hill, 1997

    ❑ Han J., Kamber M., Pei J., Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011

    ❑ C. Aggarwal, Data Mining the textbook, 2015

    ❑ Witten I. H., Frank E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers, 2016.

    63Data Mining I @SS19: Introduction

    http://images-eu.amazon.com/images/P/0071154671.03.LZZZZZZZ.jpg

  • Online resources

    ■ Machine Learning class by Andrew Ng, Stanford

    ❑ http://ml-class.org/

    ■ Tom Mitchel’s lectures on youtube

    ❑ www.youtube.com/playlist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ

    ■ Kdnuggets: Data Mining and Analytics resources

    ❑ http://www.kdnuggets.com/

    64Data Mining I @SS19: Introduction

  • Tools

    ■ Several options for either commercial or free/ open source tools

    ❑ Check an up to date list at: http://www.kdnuggets.com/software/suites.html

    ■ Commercial tools offered by major vendors

    ❑ e.g., IBM, Microsoft, Oracle …

    ■ Free/ open source tools

    65

    Weka

    Elki

    R

    SciPy + NumPy

    OrangeRapid Miner (free, commercial versions)

    Data Mining I @SS19: Introduction

    http://www.kdnuggets.com/software/suites.html

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    66Data Mining I @SS19: Introduction

  • Things you should know from this lecture

    ■ KDD definition

    ■ KDD process

    ■ DM step

    ■ Supervised vs Unsupervised learning

    ■ Main DM tasks

    ❑ Clustering

    ❑ Classification

    ❑ Regression

    ❑ Association rules mining

    ❑ Outlier detection

    67Data Mining I @SS19: Introduction

  • Outline

    ■ Why to study Data Mining?

    ■ Why we need Data Mining?

    ■ What is the KDD (Knowledge Discovery in Databases) process?

    ■ Main data mining tasks

    ■ Course logististics

    ■ Things you should know from this lecture

    ■ Homework/ Tutorial

    68Data Mining I @SS19: Introduction

  • Homework/ Tutorial

    ■ Homework: Think of some real world applications that you find suitable for Data Mining.

    ❑ Why?

    ❑ What type of patterns would you look for?

    ❑ Would you approach it as a supervised or unsupervised learning task?

    ■ Readings:

    ❑ Tan P.-N., Steinbach M., Kumar V book, Chapter 1.

    ❑ U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press.

    69Data Mining I @SS19: Introduction

  • Acknowledgement

    ■ The slides are based on

    ❑ KDD I lecture at LMU Munich (Johannes Aßfalg, Christian Böhm, Karsten Borgwardt, Martin Ester, EshrefJanuzaj, Karin Kailing, Peer Kröger, Eirini Ntoutsi, Jörg Sander, Matthias Schubert, Arthur Zimek, Andreas Züfle)

    ❑ Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/

    ❑ Pedro Domingos Machine Lecture course slides at the University of Washington

    ❑ Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

    ❑ Thank you to all TAs contributing to their improvement, namely Vasileios Iosifidis, Damianos Melidis, Tai Le Quy, Han Tran

    70Data Mining I @SS19: Introduction

    http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html