KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...

Data Mining I

Summer semester 2019

Lecture 1: Introduction

Lectures: Prof. Dr. Eirini Ntoutsi

TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali

Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme

AG Intelligente Systeme - Data Mining group

About me

03/2016 - present Associate Professor Faculty of Electrical Engineering & Computer Science Leibniz University Hannover L3S Research Center (since May 2016)

02/2012 - 02/2016 Post-doctoral researcher & lecturer Institute for Informatics, LMU Munich, Germany

02/2010 - 01/2012 Alexander von Humboldt postdoc fellowInstitute for Informatics, LMU Munich, Germany

2009: Data Mining Expert National Hellenic Organization (OTE), Athens, Greece

04/2007 – 02/2009 Co-Founder and AI expert

NeeMo Startup, Greece

09/2003 – 09/2008 PhD in Data MiningUniversity of Piraeus, Athens, Greece

09/2001 – 09/2003 MSc, Computer Science/ Text MiningPolytechnic School, University of Patras, Greece

09/1996 – 09/2001 Diploma, Computer Engineering and Informatics/ AI GamesPolytechnic School, University of Patras, Greece

2Learning from streaming data

Current focus areas:• Data Stream Mining/ Adaptive Machine Learning• Responsible AI: Fairness-Aware Machine Learning

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

3Data Mining I @SS19: Introduction

Why to study Data Mining/Machine Learning – famous quotes*

■ “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)

■ “Machine learning is the next Internet” (Tony Tether, Director, DARPA)

■ “Machine learning is the hot new thing” (John Hennessy, President, Stanford)

■ “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)

■ “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)

■ “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO, Microsoft)

4

*Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf

Data Mining I @SS19: Introduction

Disclaimer: I use the terms data mining and machine learning (sometimes also Artificial Intelligence (AI) interchangeably here and through the lecture.We will discuss the similarities/differences later. In both cases, we talk to learning from data.

Data Mining – Data Science – Big Data – Machine Learning – Deep Learning Analytics …

■ New fancy words for knowledge discovery from data

❑ Data mining, machine learning have been focusing on knowledge discovery from data for decades

❑ Well defined set of tasks and solutions

■ Big data and analytics are more business terms and ill-defined

■ The same holds today for AI

5

“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.“

Source: Dan Ariely, Duke University


Ever increasing interest … the ``rebranding’’ effect

6

Source: Google trends, query on 9.4.2019


Why to study Data Mining - Data Scientist: The sexiest job of 21st century

7

“If “sexy” means having rare qualities that are much in demand, data scientists are alreadythere. They are difficult and expensive to hire and, given the very competitive market for theirservices, difficult to retain. There simply aren’t a lot of people with their combination ofscientific background and computational and analytical skills.”

Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link


Source: https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists

A good conjuncture for ML/DM/DS (data-driven learning)


Data deluge Machine Learningadvances

Computer power Enthusiasm

World-wide competition on Artificial Intelligence (AI)

■ From FORSCHUNGSGIPFEL 2019 - Künstliche Intelligenz – Innovationstreiber einer neuen Generation, 19/3/2019, Berlin

■ Cédric Villani’s talk on the geopolitics of AIThere are 3 fierce competitions

■ Competition for human talent

■ Competition for infrastructure

■ Competition for data

■ More info on

❑ Track #fogipf19 in Twitter

❑ Check videos online: http://www.forschungsgipfel.de/2019/videos


https://twitter.com/hashtag/fogipf19?src=hashhttp://www.forschungsgipfel.de/2019/videos

Outline









Why we need Data Mining

■ Huge amounts of data are collected nowadays from different application domains

■ “We are drowning in information but starving for knowledge” John Naibett link

■ The amount and the complexity of the collected data does not allow for manual analysis.

Telecommunication

Astronomy

Banks Biology

Internet

Supermarkets

12

IoT


http://www.kdnuggets.com/news/2007/n06/3i.html

Examples of data sources: The Internet

■ Internet users

13

Web 2.0: A world of opinionsUser generated content


Source: http://www.internetlivestats.com/internet-users/

Examples of data sources: Internet of things

■ The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics, software, sensors, and network connectivity, which enables these objects to collect and exchange data.

Source: https://en.wikipedia.org/wiki/Internet_of_Things

14

Image source:http://tinyurl.com/prtfqxf

Source: http://blogs.cisco.com/diversity/the-internet-of-things-infographic

During 2008, the number of things connected to the internet surpassed the number of people on earth… By 2020 there will be 50 billion … vs 7.3 billion people (2015).

These things are everything, smartphones, tablets, refrigerators …. cattle.


https://en.wikipedia.org/wiki/Internet_of_Things

Examples of data sources: data intensive science

15

Slide from:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt

“Increasingly, scientific breakthroughs willbe powered by advanced computingcapabilities that help researchersmanipulate and explore massive datasets.”

-The Fourth Paradigm – Microsoft

Examples of e-science applications:• Earth and environment• Health and wellbeing

− E.g., The Human Genome Project (HGP)

• Citizen science• Scholarly communication• Basic science

− E.g., CERN


Examples of data sources: Manufacturing

■ Andrew Ng Says Factories Are AI’s Next Frontier

Source: https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/

16

Image source: https://images.readwrite.com/wp-content/uploads/2018/03/AAEAAQAAAAAAAAueAAAAJDY1NmFl

N2NhLWExZTUtNDRhNy1iMWQ5LTViZGM3NTFlODczYQ.jpg

Companies are making major investments in AI and industrial analytics to help drive their digital transformation


Image source: https://cdn-sv1.deepsense.ai/wp-content/uploads/2018/04/Spot-the-flaw-Visual-quality-control-in-

manufacturing-1140x337.jpg

https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/

Examples of data sources: We … the data subjects

■ Wherever we go, we are "datafied".

■ Smartphones are tracking our locations.

■ We leave a data trail in our web browsing.

■ Interaction in social networks.

■ Privacy is an important issue … not covered though in this lecture → privacy aware data mining

❑ Check the EU General Data Protection Regulation (https://eugdpr.org/)

■ e.g., https://www.whitecase.com/publications/article/chapter-5-key-definitions-unlocking-eu-general-data-protection-regulation


https://eugdpr.org/

From data to knowledge of different types


Data Methods Knowledge

Call records

Movie ratings

Telescope images

Outlier Detection Detect fraud cases

Collaborative filtering Recommend movies to users

ClassificationIs it an «early», «intermediate» or «late formation» star?

News articles ClusteringWhat are the topics people discuss about in the news today?

Short break (5’) – Get to know us better

■ What is your field of study?

❑ Informatik? Informationstechnik? Elektrotechnik?

■ What is your interest in the data mining field?

■ Are there people from Physics, Medicine, Engineering Sciences in the audience?


Outline









What is KDD

Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially

useful, and ultimately understandable patterns in data.

[Fayyad, Piatetsky-Shapiro, and Smyth 1996]

Remarks:

● valid: the discovered patterns should also hold for new, previously unseen problem instances.

● novel: at least to the system and preferably to the user

● potentially useful: they should lead to some benefit to the user or task

● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some post-processing


Clarification: The term databases does not refer exclusively to relational databases storing structured data … it can be any data storage and also structured, semi-structured, non-structured data

The KDD process and the Data Mining step

22

Patterns

Knowledge

[Fayyad, Piatetsky-Shapiro & Smyth, 1996]

Transformed data

Target data

Preprocessed data

Sele

ctio

n:

•Se

lect

a r

elev

ant

dat

aset

or

focu

s o

n a

su

bse

t o

f a

dat

aset

•Fi

le /

DB

/

Pre

pro

cess

ing/

Cle

anin

g:•

Inte

grat

ion

of

dat

a fr

om

d

iffe

ren

t d

ata

sou

rces

•N

ois

e re

mo

val

•M

issi

ng

valu

es

Tran

sfo

rmat

ion

:•

Sele

ct u

sefu

l fea

ture

s•

Feat

ure

tra

nsf

orm

atio

n/

dis

cret

izat

ion

•D

imen

sio

nal

ity

red

uct

ion

Dat

a M

inin

g:•

Sear

ch f

or

pat

tern

s o

f in

tere

st

Eval

uat

ion

:•

Eval

uat

e p

atte

rns

bas

ed o

n

inte

rest

ingn

ess

mea

sure

s•

Stat

isti

cal v

alid

atio

n o

f th

e M

od

els

•V

isu

aliz

atio

n•

Des

crip

tive

Sta

tist

ics

Data


A modern version: The Data Science process


The interdisciplinary nature of KDD 1/2

24

KDD

Machine Learning

Databases

Statistics

Data visualization

Pattern recognition

Algorithms Other disciplines


The interdisciplinary nature of KDD 2/2

25

Statistics Machine Learning

Databases

KDD

Model based inferenceFocus on numerical

data

Theory + methodsFocus on small datasets

Scalability to large data setsNew data types (web data, micro-arrays, social data ...)

Integration with commercial databases[Chen, Han & Yu 1996]

[Berthold & Hand 1999] [Mitchell 1997]


How do machines learn?

■ ML “gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)

■ We don’t codify the solution. We don’t even know it!

■ Data is the key & the learning algorithm


Algorithms

Models

Models

(semi)Automatic

decision making

Data

How can we build computer programs that automatically improve with experience?

Tom Mitchell, Machine Learning book

More formally: How do machines learn?

■ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Tom Mitchell, Machine Learning 1997.

■ Example: A backgammon learning problem

❑ Task T: playing backgammon

❑ Performance measure P: % of games won against opponents

❑ Training experience E: playing practice games against itself

■ Example: Exam performance

❑ Task T: predict whether a student will pass the final DM exam or not

❑ Experience E: historical records of students that took the DM exam

❑ Performance measure P: % of correctly identified students


(Machine) Learning from experience/feedback 1/2

■ “Experience comes in terms of data (the so called, instances or examples) from the specific problem/ application”

■ Datasets consists of instances (also known as examples or objects)

❑ e.g., in a university database: students, professors, courses, grades,…

❑ e.g., in a library database: books, users, loans, publishers, ….

❑ e.g., in a movie database: movies, actors, director,…

■ Instances are described through features (also known as attributes or variables)

❑ E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.

❑ An easy to visualize example: if our data are in a database table, the rows are the instances and the columns are the features.


(Machine) Learning from experience/feedback 2/2

■ Except for the instance description, we might also have feedback on those instances from some “teacher”/”expert“

❑ E.g., whether a student passed the exam

■ The direct feedback is known as label, i.e., each instance is associated with a label labeleddataset

■ But we might have no feedback at all unlabeled dataset

■ There might be also indirect feedback


Unlabeled datasetLabeled dataset

Lecture 2 is devoted on getting to know our data!!!

Short break (5’) – Modeling students data for the exam performance task

■ Recall our learning example





■ If students are the learning instances, what sort of features could I use to describe each of them?

■ What could be the feedback (direct, indirect) for the learning model (if any)?


Outline









Different learning tasks

Based on the feedback we have on the data, we can distinguish between:

■ Direct-feedback instances

❑ the correct response /label is provided for each instance by the “teacher”

❑ e.g., good or bad product

■ No-feedback instances

❑ no evaluation/label of the instance is provided, since there is no “teacher“

❑ e.g., no information on whether a product is good or bad, just the description of the product/instance

■ Indirect-feedback instances

❑ less feedback is given, since not the proper action, but only an evaluation of the chosen action is given by the teacher


Supervised learning

Reinforcement learning

Unsupervised learning

Different learning tasks: Supervised learning

■ Supervised learning/ Predictive:

❑ A description of the instances and their class labels is available (training set)

❑ The goal is to learn a mapping from the instances to the class labels, i.e., given a future unseen instance to predict its class label

■ Typical examples covered in this lecture:

❑ Classification

❑ Outlier detection

❑ Regression


Classification: an example

■ The goal is to learn a mapping from the “height, width space” to the class space (nails, screw,paper clips)

■ For the new objects, the result of the classification if one of the class labels {nails, screw,paper clips}


Screw

Nails

Paper clips

Hei

ght

[cm

]

Width[cm]

instance width height class

1 2,6 4,5 Screw

2 3,7 7,3 Nails

3 4,1 6,5 Paper Clips

4 8,5 8,1 Screw

5 9,5 5,5 Nails

… … … …

New objectNew object

Classification applications 1/2

■ Application: Fraud Detection

❑ Goal: Predict fraudulent cases in credit card transactions.

❑ Approach:

■ Use credit card transactions and the information on its account-holder as attributes.

❑ When does a customer buy, what does he buy, how often he pays on time, etc

■ Label past transactions as fraud or fair transactions. This forms the class attribute.

■ Learn a model for the class of the transactions.

■ Use this model to detect fraud by observing credit card transactions on an account.


Classification applications 2/2

■ Application: Churn prediction in telco

❑ Goal: Predict whether a customer is likely to be lost to a competitor

❑ Approach:

■ Use detailed record of transactions with each of the past and present customers, to find attributes.

❑ How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.

■ Label the customers as loyal or disloyal (class attribute).

■ Find a model for customer loyalty

■ Use this model to predict churn and organize possible retain strategies.


Example: Google News


A huge variety of classification algorithms


Decision trees k nearest neighbours

Support vector machines

Neural networks Bayesian classifiers

Ensembles

Supervised learning: Regression

■ Similar to classification, but the feature-result to be learned is continuous rather than discrete.

■ Goal: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.


Given this data, a friend has a house 750 square feet - how much can they be expected to get?

Source: Andrew Ng ML course, Coursera

Application: Precision farming

■ Create a production curve depending on multiple parameters like soil characteristics, weather, used fertilizers.

■ Only the appropriate amount of fertilizers given the environmental settings (soil, weather) will result in maximum yield.

■ Controlling the effects of over-fertilization on the environment is also important

40

Water capacity

Soil parametersWeatherFertilizers

…

Fertilizers

productionproduction

curve


Different learning tasks: Unsupervised learning

■ Unsupervised learning/ Descriptive:

❑ Only a description of the instances is available

❑ No feedback/labels are available

❑ The goal is to discover groups of similar instances

■ Typical subtasks covered in this lecture:

❑ clustering

❑ association rules mining

❑ outlier detection


Clustering: an example

■ Each point described in terms of its height and width

■ No information on the actual classes (nails, paper clips) is available to the clustering algorithm.

42

Cluster 1Cluster 2

Hei

ght

[cm

]

Width[cm]


instance width height

1 2,6 4,5

2 3,7 7,3

3 4,1 6,5

4 8,5 8,1

5 9,5 5,5

… … …

Clustering applications 1/2

Application: Market Segmentation

■ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

■ Approach:

❑ Collect different attributes of customers based on their geographical and lifestyle related information.

■ E.g., age, income, education, family status, ….

❑ Find clusters of similar customers.

❑ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.


Clustering applications 2/2

Application: Document clustering

■ Find groups of documents (topics) that are similar to each other based on the important terms appearing in them.

■ Approach:

❑ Identify important terms in each document.

❑ Form a similarity measure between documents.

❑ Cluster based on the similarity measure.

■ Gain:

❑ Help the end user to navigate in the collection of documents (based on the extracted clusters).

❑ Utilize the clusters to relate a new document or search term to clustered documents.

■ Check for example, Google News.


Example: Google News


A huge variety of clustering algorithms


Partitioning methods (k-Means)

Grid-based methods (CLIQUE)

Model-based methods (DBSCAN)

Hierarchical methods

Constraint-based methods

Model-based methods(EM)

Unsupervised learning: Association rules mining

■ Task: Find all rules in the database, in the following form:

If x, y, z are contained in a set M, then t is also contained in M with a probability of at least X%.

47

a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f

In 5 out of 7 cases (~71 %)b,c,d appear together

a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f

In 5 out of 5 cases (100 %) it holds that:If b,c then d also exists.


• a= milk• b=cheese• c =wine

• d= pasta• e= yogurt• f = apples

Application: Market basket analysis

■ Result:

❑ Frequently purchased items together may be better to be positioned close to each other: E.g. since diapers are often purchased together with beers => Place beer in the way from diapers to the checkout

❑ Generate recommendations for customers with similar baskets:=> e.g. Customers that bought „Star Wars“, might be also interested in „The lord of the rings “.

48

Shopping basket

DataWarehouse

Possible generalizations:• Paprika-Chips Snacks • Enrichment of customer data

Association rules


Unsupervised|Supervised Learning: Outlier detection

■ Outlier detection is defined as identification of non-typical data

■ Outliers might indicate

❑ possible abuse of credit cards, mobile phones

❑ data errors

❑ device failures


Application

■ Analysis of the SAT.1-Ran-Soccer-Database (Season 1998/99)

❑ 375 players

❑ Primary attributes: Name, #games, #goals, playing position (goalkeeper, defense, midfield, offense),

❑ Derived attribute: Goals per game

❑ Outlier analysis (playing position, #games, #goals)

■ Result: Top 5 outliers

50

Rank Name # games #goals position Explanation

1 Michael Preetz 34 23 Offense Top scorer overall

2 Michael Schjönberg 15 6 Defense Top scoring defense player

3 Hans-Jörg Butt 34 7 Goalkeeper Goalkeeper with the most goals

4 Ulf Kirsten 31 19 Offense 2nd scorer overall

5 Giovanne Elber 21 13 Offense High #goals/per game


Note: “Outliers” is not necessarily a negative term.

Short break (5’) – Learning from the student data

■ Recall our learning example





■ If students are the learning instances, what sort of features could I use to describe each of them?

■ What could be the feedback/label for the learning model (if any)?

■ What could be a supervised learning task here?

❑ For classification? For prediction?

■ What could be an unsupervised learning task ?

❑ For clustering, frequent itemsets mining?

■ What could be an outlier detection problem here?


Outline









Course logistics 1/3

■ Class schedule

❑ Lectures: Wednesdays, 12:15 - 13:45, Multimedia-Hörsaal (3703 - 023), Appelstraße 4.

❑ Tutorials: Monday: 10:00 - 11:30 , Monday: 13:30 - 15:00, Tuesday: 11:45 - 13:15, Tuesday: 13:30 - 15:00, Room 235, Gebaeude 3703, Appelstraße 4.

■ StudIP as a common information sharing place

❑ Up to date announcements and material

❑ Use the forum for your questions. They might benefit everyone!

■ Exam:

❑ Written exam, 90’

■ You are allowed to bring a hand-written A4 with formulas etc (No need to memorize) – Each student should have her own A4 (copied are not allowed)

❑ The exam will be based on the material discussed in the class plus the tutorials.

■ Exam date: Monday 28.8.2019, 08:30-11:00, Rooms: F 102, F 303


Overview of the lectures (current planning)

1. Introduction

2. Getting to know our data

3. Association Rules Mining

4. Clustering

6. Classification

7. Outlier Detection



■ Projects

❑ Focus on the complete KDD pipeline for two different learning tasks:

■ Classification: 15/5/2019 & Clustering: 26/6/2019

■ Groups of 2 (Please form the teams by yourselves)

■ Goal: how to run a data mining case study? From data preprocessing to transformation, learning algorithm, evaluation and presentation of the results. Both analysis and presentation part are important.

■ We will use Kaggle for result submission (but you have to submit the report separately)

■ We will have a poster session at the end where each team present its results

■ Bonus schema

❑ Pass both projects: you switch to the next best grade

■ e.g., from 1.7→1,3

❑ Each member ``inherits’’ the grade of the group

❑ Extra bonus for those that score best in Kaggle (system) & those with the best poster (voting)




■ Teaching Assistants

❑ Vasileios Iosifidis

■ Room 240, 2nd floor, Appelstraße 4

■ [email protected]

❑ Tai Le Quy

■ Room 010, Ground floor, Appelstraße 4


❑ Maximilian Idahl

■ -

❑ Wazed Ali


❑ Shaheer Asghar


■ Lecturer

❑ Prof. Dr. Eirini Ntoutsi

■ Room 203, 2nd floor, Appelstraße 4

Contact via email:[email protected]

Please use [DM1] in the subject

mailto:[email protected]

Tutorials: Organization

■ 4 tutorial groups

❑ Times:

■ Monday, 10:00 – 11:30 and 13:30 – 15:00

■ Tuesday, 11:45 – 13:15 and 13:30 – 15:00

❑ Room: 235, Appelstraße 4

❑ Registration for groups in Stud.IP, unlocked today at 20:00


Tutorials: Why to attend

Why should you attend the tutorials?

Solving theoretical and algorithmic parts is an excellent preparation for the exam

Working on the implementation part is useful for the bonus projects

Each tutorial will consist of

1. a theoretical part (e.g., properties of an algorithm)

2. an algorithmic part (e.g., applying an algorithm on a particular dataset)

3. an implementation part (e.g., how to run a data mining analysis in Python)

58SoSe19: DM I - Tutorial

Tutorials: structure

1 worksheet per week

Announced after the lecture, so you have the chance to prepare beforehand

Solutions will be available via Stud.IP

Theoretical and algorithmic parts will be mainly presented at the blackboard (with help from you)

Implementation part in form of jupyter notebooks (interactive python)

Use Stud.IP for any tutorial related question.

This is the fastest way to get an answer

Your question might be relevant to other students

Posing and answering questions is a great way to learn

Or send an e-mail to [email protected]


Tutorials: For next week (1st tutorial)

Take a look at python and jupyter notebooks

Installation:

Anaconda python distribution includes most packages needed for this course

Step-by-step installation guide for Windows/Mac/Linux: https://docs.anaconda.com/anaconda/

Quick start guide for jupyter notebooks: https://jupyter.readthedocs.io/en/latest/content-quickstart.html

Alternative:

Jupyter notebook environment Google Colab: https://colab.research.google.com/notebooks/welcome.ipynb

Free, requires no setup, runs entirely in the cloud


https://docs.anaconda.com/anaconda/https://jupyter.readthedocs.io/en/latest/content-quickstart.htmlhttps://colab.research.google.com/notebooks/welcome.ipynb

Tutorials: Python resources

There are lots of great python tutorials

Recommended: http://scipy-lectures.org/intro/language/python_language.html

Official: https://docs.python.org/3/tutorial/

If you prefer video tutorials:

https://pythonprogramming.net/python-fundamental-tutorials/ or

https://youtu.be/YYXdXT2l-Gg

And on jupyter notebooks

https://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46 or

http://opentechschool.github.io/python-data-intro/core/notebook.html or

https://youtu.be/HW29067qVWk


http://scipy-lectures.org/intro/language/python_language.htmlhttps://docs.python.org/3/tutorial/https://pythonprogramming.net/python-fundamental-tutorials/https://youtu.be/YYXdXT2l-Gghttps://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46http://opentechschool.github.io/python-data-intro/core/notebook.htmlhttps://youtu.be/HW29067qVWk

Tutorials 1-2 plan


Tutorial 1 + 2 will include introductions to

Python basics

Arrays in NumPy

Data manipulation with Pandas

Visualization using matplotlib

Data mining and analysis tools in scikit-learn

Goal: Running a data analysis process in python, from data selection to pattern evaluation

Useful for the projects

Learning-by-doing

Textbook and recommended readings

■ Textbook:

❑ Tan P.-N., Steinbach M., Kumar V., Introduction to Data Mining, Addison-Wesley, 2014

❑ New edition is expected in May 2019

■ Recommended readings

❑ Meira and Zaki, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014

❑ Mitchell T. M., Machine Learning, McGraw-Hill, 1997

❑ Han J., Kamber M., Pei J., Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011

❑ C. Aggarwal, Data Mining the textbook, 2015

❑ Witten I. H., Frank E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers, 2016.


http://images-eu.amazon.com/images/P/0071154671.03.LZZZZZZZ.jpg

Online resources

■ Machine Learning class by Andrew Ng, Stanford

❑ http://ml-class.org/

■ Tom Mitchel’s lectures on youtube

❑ www.youtube.com/playlist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ

■ Kdnuggets: Data Mining and Analytics resources

❑ http://www.kdnuggets.com/


Tools

■ Several options for either commercial or free/ open source tools

❑ Check an up to date list at: http://www.kdnuggets.com/software/suites.html

■ Commercial tools offered by major vendors

❑ e.g., IBM, Microsoft, Oracle …

■ Free/ open source tools

65

Weka

Elki

R

SciPy + NumPy

OrangeRapid Miner (free, commercial versions)


http://www.kdnuggets.com/software/suites.html

Outline









Things you should know from this lecture

■ KDD definition

■ KDD process

■ DM step

■ Supervised vs Unsupervised learning

■ Main DM tasks

❑ Clustering

❑ Classification

❑ Regression

❑ Association rules mining

❑ Outlier detection


Outline









Homework/ Tutorial

■ Homework: Think of some real world applications that you find suitable for Data Mining.

❑ Why?

❑ What type of patterns would you look for?

❑ Would you approach it as a supervised or unsupervised learning task?

■ Readings:

❑ Tan P.-N., Steinbach M., Kumar V book, Chapter 1.

❑ U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press.


Acknowledgement

■ The slides are based on

❑ KDD I lecture at LMU Munich (Johannes Aßfalg, Christian Böhm, Karsten Borgwardt, Martin Ester, EshrefJanuzaj, Karin Kailing, Peer Kröger, Eirini Ntoutsi, Jörg Sander, Matthias Schubert, Arthur Zimek, Andreas Züfle)

❑ Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/

❑ Pedro Domingos Machine Lecture course slides at the University of Washington

❑ Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

❑ Thank you to all TAs contributing to their improvement, namely Vasileios Iosifidis, Damianos Melidis, Tai Le Quy, Han Tran

http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...

Documents

Transcript of KBS - KBS - Data Mining Intoutsi/DM1.SoSe19/lectures/... · 2019. 4. 10. · Lecture 1:...