Intelligent Visualization of Multi Dimension Data Sets · Intelligent Visualization of Multi...

50
By Hanaa Ismail Elshazly PhD Student Faculty of Computers and Information Cairo University Intelligent Visualization of Multi Dimension Data Sets Faculty of Computers and Information - Cairo University Department of Computer Sciences Supervisors Prof. Dr. Aboul Ella Hassanien & Prof. Dr. Abeer Mohamed El Korany Prof.Dr.Mostafa Reda Eltantawi

Transcript of Intelligent Visualization of Multi Dimension Data Sets · Intelligent Visualization of Multi...

By

Hanaa Ismail Elshazly

PhD Student

Faculty of Computers and Information

Cairo University

Intelligent

Visualization of Multi

Dimension Data Sets

Faculty of Computers and Information - Cairo University

Department of Computer Sciences

Supervisors

Prof. Dr. Aboul Ella Hassanien & Prof. Dr. Abeer Mohamed El

Korany

Prof.Dr.Mostafa Reda Eltantawi

Contents

Introduction 1

2

3

Experimental Results 4

Conclusion 5 5

Future Work 6

Proposed Model

6

Related work

2

Highlights

We introduce an automatic system to visualize

multidimensional rules.

◦ Reducing the dimensions of the input data sets by feature

selection techniques.

◦ A new emerged problem is the generated rules number.

◦ Rules were refined using Genetic Algorithms to be

visualized.

◦ Refined Rules were interactively visualized using nodes and

edges.

Introduction

Multidimensional data

Reduction

Visualize

Intelligent Visualization of

Multidimensional Data Sets

Dimensions: A dimension is a

key descriptor, an index, by which

you can access facts according to

the value (or values) you want

Information visualization is the

study of (interactive) visual

representations of abstract data to

reinforce human cognition. The abstract

data include both numerical and non-

numerical data, such as text and

geographic information

Introduction General

Massive and complex data are generated every day in many fields due to the advance of hardware and software technology.

Curse of dimensionality is a major obstacle in machine learning and data mining.

Clinical data referring to patients’ investigations contain irrelevant attributes that degrade the classification performance.

Visualization is important when analyzing multidimensional datasets, since it can help humans discover and understand complex relationships in data.

Introduction Data Problems

Data Quality

Integrating redundant data

from different sources

Mining information from

heterogeneous databases

Difficulty in training set

Dynamic databases

Dimensionality

Introduction Dimensionality reduction

In machine learning and statistics, dimensionality reduction or

dimension reduction is the process of reducing the number of

random variables under consideration via obtaining a set of

principal variables. It can be divided into feature selection and

feature extraction.

Most popular search methods that are manageable in low space

can be totally unmanageable in high dimension space

The curse of dimensionality is a major obstacle in machine

learning and data mining

Reduction of the dimensionality of features space leads to a

successful classification

Selecting the optimal feature subset can substantially improve the

classification performance

Filter

Wrapper

Embedded

• Improve the

comprehensibility of

the induced concepts

• Decrease of dataset

complexity

• Improve classification

performance

• Resources saving

• Visualization ability

• Better understanding

of extracted

knowledge

• Reducing computation

Requirement

• Reduces the effect of

curse of dimensionality

FS Techniques

Reduced Data Massive Data

Microarray GE

Medical Images

Huge Databases

Finance Data

Sensor Arrays

Web

Documents

Introduction Dimensionality reduction

Visualization Techniques for Rules Mining

Table : Most common and simple [Romero,C., Luna,J. M., Romero, J.R., Ventura, S. , 2011]

Scatter Plot : Represent rules as points in the coordinate plane by their interestingness measures [Hahsler and Chellubonia.,S., 2011 ]

Parallel Coordinate : Represent rules as polygonal lines that intersect multiple vertical axes representing associated items. [Usman and Usman, 2016 ]

Directed Graph : Use nodes to represent items and directed edges for the antecedents and the consequents[Sekhavat,Y.A. and Hoeber,O. , 2013]

Matrix : Prevail approach to represent the antecedents and consequent[Lei et al., 2016]

Motivation

• Information has become a very valuable commodity, many features that seem to be useful and leads to increase computational cost, storage requirements and decrease accuracy.

• There are many tasks that depend on dimensionality: text categorization, genomics, econometrics and computer vision.

Introduction

Problem Definition

• Dimensionality reduction is crucial in order to remove noisy and improve accuracy.

• As extracted rules cardinality is a mainstay in rule visualization process, it may hinder the benefit of those rules.

• Limited work has been found on how the user understand and use it.

• There is a need to trust and better insight into those rules.

Introduction

Objectives

• Develop and implement an automatic system that is capable of reducing multidimensional data sets as well as the number of generated rules.

• Provide a dynamic Visualization Decision rules facility that leads to better gain insight into the mined rules.

• Provides different visualized levels of trusting for extracting rules.

Introduction

Feature Selection

Before -after Data Set Method Paper

12600 – 1000

2000 - 500

7129 - 100

7129 - 100

Prostate Cancer Data Set

Colon Cancer Data Set

Leukemia

Myeloma

Dual-process sample selection using

Support Vector Machine

(Liu,Q. etal, 2013)

30 - 12

32 - 13

34 - 18

Breast Cancer Diagnosis

Breast Cancer Prognosis

Erythemato-squamous

diseases

Breast Cancer Data Set

Linguistic hedges neuro-fuzzy

classifier(LHNFC)

(Azar, A. and

Hassanien,2015)

10 - 7

10 - 5

30 - 25

30 - 20

Wisconsin Breast

Cancer(1992)

Wisconsin Breast Cancer

(1995)

Wisconsin Breast

Cancer(1992)

Wisconsin Breast Cancer

(1995)

Genetic/Particle Swarm (GPSO)

Genetic/Fruit Fly (GFOA)

(Fei Ye, 2016)

Related Work

Traditional Classifiers

Accuracy

%

Data Set Classifier Authors

94% Prostate Artificial Neural

Networks(ANN)

(Saritas, I. and Ozkan, I.A. and

Sert, I.U.,2010)

95.5%

93.1%

93%

Breast Cancer

(Naive Bayes +

Feature Ranking)

Random Forest+

Feature Ranking)

SVM+ Feature

Ranking)

(Santos, V. and Datia, Nuno

and Pato, M.P.M.,2014)

96.45%

90.6%

82.8%

97.9%

Breast Cancer

Heart Valve

Heart Disease

Dermatology

dominance-based

rough set(DRSA)

(Azar,A.T. etal,2016)

Related Work

Ensemble Classifiers

Accuracy% Data Set Classifier Authors

79.96% Lymphography Bagging credal

decision trees(B-CDT)

Abellán, J. and

Masegosa , A.R., 2012

96.4%

64.7%

80.7%

Breast

Liver

Lymphography

Differential Evolution (De Falco, I. , 2013)

93%

92%

Breast Random Forest+

Feature Ranking)

Bagging+ Feature

Ranking)

(Santos, V. and Datia,

Nuno and Pato,

M.P.M.,2014)

94.8%

94.8%

Spine Diagnosis RF +PCA

Bagging

(Indrajit Mandal,2015)

Related Work

Enhanced visual data mining process for dynamic decision-making

Related Work

Limitations :

• The physician reduce rules by himself

according to the frequency of the act.

• The physician add weight for each rule to

reflect the highest score.

• Depending on the physician expert to

specify the principal reason for the infection

• Dimensionality will be solved manually by

the physician filteration.

(Ltifi,H. and Benmohamed E. and Kolski C. and Ben Ayed M. , 2016)

The research presented the design of visual data mining method

for support the dynamic decision-making.

Aim : Assist physicians to fight against nosocomial infections in the

intensive care unit in the Habib Bourguiba Hospital of Sfax, Tunisia.

Steps : Temporal Data Manipulation

Temporal Visualization

Discovered Knowledge Management.

Construction and evaluation of structured association map for

visual exploration of association rules

Related Work

(Kim J. W. , 2017)

The research proposed a novel visualization method , a variant of

cluster heat map for representing association rules.

Aim : Assist analyzers to select relevant items to be used in many to

many association rule mining.

Steps : Items Classification (Factor Items and Response Items by

Analyzers).

Generate factor dendogram and Response dendogram by applying

hierarchical clustering algorithm and distance measures.

Matrix Generation reflecting the interestingness measure of each

rule .

Limitations :

• In High dimension Dataset, it is to difficult to

specify factor and response items.

• An additional burden is the sorting of items

according to the position in associated

dendograms.

The Proposed General Model

Pre-processing

phase

Feature Reduction

phase

Refinement Phase

Classification

phase

Visualization phase

Experimental Data Sets

Classes Instances Features Source Data Set

2 classes 569 samples Features 32 UCI (Machine Learning

Repository)

Wisconsin Breast

Cancer–Diagnosis

2 classes samples 198 Features 32 UCI (Machine Learning

Repository)

Wisconsin Breast

Cancer–Prognosis

2 classes 267 samples 45 Features UCI (Machine Learning

Repository)

SPECTF Heart Dataset

4 classes 148 samples 18 Features University Medical

Centre, Institute of

Oncology, Ljubljana,

Yugoslavia

Lymphography

2 classes 583 samples 11 Features UCI (Machine Learning

Repository)

Indian Liver Patient

Dataset

2 classes 102 samples 12600

Features

UCI (Machine Learning

Repository)

Prostate

Proposed General Model

www.themegallery.com

Proposed Model Features

• It should be reliable and robust enough to cope with

different data types.

• The proposed model addresses the tedious tasks encountered by the physician or decision maker during the exploration of the classification outcomes.

• It provides the decision-maker the chance to make proactive, knowledge-driven decisions and to be a part of the mining process by harnessing the perceptual capabilities of the human visual system.

Pre-processing Phase

Aim : Used to reduce the number of

values for a given continuous attribute

by dividing the range of the attribute

into intervals and replacing low level

concepts by higher level concepts.

Techniques:

• Equal Binning : Transform

numerical variables into

categorical counterparts.

• Simplification : Rescaling

data in the range [1,3].

Discretization

Pre-processing Phase Equal Binning Algorithm

Foreach feature V in data (D)

{ Dividing domain of V into k intervals of equal size.

The width of intervals is:

w = (max(V)-min(V))/k

And the interval boundaries are:

min+w, min+2w, ... , min+(k-1)w

}

Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of

the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.

How Discretization techniques influence the classification of breast cancer data

Bool.Reas% Binging

%

Entropy

%

91 92.9 77.2 Naïve Bayes

95.3 95.3 91.4 Decision Rules

94 94.7 76.1 KNN

Feature Selection Phase

• The goal of this step is reducing the number of features that hinders severely the applicability of popular search methods are designed and dedicated for low space while they are totally unmanageable in high dimension space.

Techniques:

• PCA: a statistical technique useful in machine

learning applications for data compression and

reduction of massive data dimensions.

• Rough Set: offers mathematical tool to

discover patterns hidden in data, used for

• Feature selection

• Data reduction

• Decision rule generation

Comparison of different selection techniques over three

multidimensional data sets

Rough based Feature Selection

Technique

Rough Set feature section technique realizes highest results over 2

data sets over 3 data sets

What is Rough Set

Rough Set Concepts

• Information/Decision Systems (Tables)

• Indiscernibility

• Set Approximation

• Reducts and Core

U

setX

U/R

R : subset of

attributes

XR

XXR

Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and Genetic

Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46,

2013.Descrinibility

Rough Sets Algorithm for Reduct Generation

Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:

)]d(u)[d(u Dd if )}c(u)c(u :C{c

)]d(u)[d(u Dd if λ ijjiji

jim

nn

ijm ,Uui }},...,2,1{,:{)( njijmuf ijj

iT

ijm ,ijma .ijm

),( falsemij .ijm

),(truetmij .ijm

Where

is the disjunction of all variables a such that

(2)

(3)

if

if

(1) if

For any

Rough Set Rules

Generation Algorithm Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU

M(T), we will mean matrix defined as:

)]d(u)[d(u Dd if )}c(u)c(u :C{c

)]d(u)[d(u Dd if λ ijjiji

jim

nn

ijm is the set of all the condition attributes that classify objects ui and uj into

different classes.

,Uui }},...,2,1{,:{)( njijmuf ijj

iT

ijm ,ijma .ijm

),( falsemij .ijm

),(truetmij .ijm

Where

is the disjunction of all variables a such that

(2)

(3)

if

if

(1) if

Rules Refinement

Phase

Reduce rules number to be easily

visualized and presented to an

expert without decreasing the

accuracy.

Techniques:

• Dependency Calculation

• Rules Generation

• Reduct Evaluation using

Entropy

• GA using Support and

Confidence as Fitness Function

•Decision attribute is dependent on this feature. γ ( B,d )=|

POS( B,d ) | / |U, |

• | U|and| POS( B,d ) | denotes the cardinality of the sets U

and POS( B,d ) respectively .

•d-dispensable attribute, is a condition which can be

removed without

• losing classification performance since preseving the

indiscernibility

•relations else the condition attribute is d-indispensable

Dependency calculation

Reduct Evaluation

Calculate entropy of the target : Gain(T) = Entropy (T);

Entropy (T) = where c is the possible values of

the target

Foreach in Reducts

{

Foreach x In R

{

Entropy (T,X) =

}

}

Choose with the largest information gain.

i2

c

1i i plogp

iR

E(c) ) (c

xccP

iR

),( XTEntEi

Genetic Algorithm Using

Support and Confidence as

Fitness Function Body ==> Consequent [ Support , Confidence ]

Consequent: represents a discovered property for the examined data.

Support: represents the percentage of the records satisfying the body or the consequent.

Confidence: represents the percentage of the records satisfying both the body and the consequent to those satisfying only the body.

Main advantage for using GAs is their robustness, once the problem is correctly modelled, the algorithm is able to explore the feasible region within the search space and exploit the best global solution.

Classification Phase Classification Phase

Phase

Rule Generation

Classification

with Decision

Rules

Testing

Generated

Rules

Classified

Instances

Tested

Instances

Multidimensional

Data

Final Reducts

Aim : The learning algorithm

called classifier has as goal to

return a set of decision rules with a

procedure that makes possible to

classify objects not found in the

original decision table.

Techniques:

Rough Set Rules Generation

using Discernibility Matrix

Visualization Phase

• Graph Nodes

• Edges

• Charts

• Grids

VISUALIZATION

Measurement Calculation for

Rules Supporting

Refined Rules with

Trusted Levels

Rendering

Rules & Reducts

Refined Decision

Rules

Expert can manage induced rules

through levels of trusting that

enable fast trust decision.

Experimental Results

Parameters Setting for Breast Cancer Experiment

Train : 70% Test : 30%

Encoding Strategy : Discretized values are ranged on scale (1-3).

Population Size : 400

Crossover Selection Parents : Random.

Crossover Probability : 0.1

Crossover position : Single point

Cut Position : Random

Fitness Threshold : 0-2

Termination Criteria : Set to the rules number specified by the physician.

Significance Level : 0 – 0.3 Less Significant

0.4 – 0.5 Medium Significant

> 0.5 Most Significant

Visualization of Breast Cancer Rules

400 R 87000 R

Experimental Results

Visualization of Breast Cancer Reducts

Visualization of features of the breast data set ordered by its occurrence over all extracted reducts.

Experimental Results

Visualization of Breast Cancer Rules

Visualization of global and detailed nodes representing refined classification rules of the breast data.

212 R 400 R 87000 R

Experimental Results

Visualization of Breast Cancer refined

rules distributed over 2 levels of Trust

www.themegallery.com

Visualization of Breast Cancer Rules

Visualization of Refined Breast Cancer Decision Rules According to Trusting Levels.

Experimental Results

Visualization of Breast Cancer Rules

Navigation through Refined Breast Cancer Decision Rules Details

According to Trusting Levels.

Experimental Results

Experimental Results Parameters Setting for Prostate Cancer Experiment

Train : 70% Test : 30%

Encoding Strategy : Discretized values are ranged on scale (1-3) .

Population Size : 117

Crossover Selection Parents : Random.

Crossover Probability : 0-2

Crossover position : Single point

Cut Position : Random

Fitness Threshold : 0-1

Termination Criteria : Set to the rules number specified by the physician.

Significance Level : 0 – 0.3 Less Significant

0.4 – 0.5 Medium Significant

> 0.5 Most Significant

Visualization of Prostate Cancer Reducts Visualization of all reducts of the Prostate Cancer data set and all features ordered by its occurrence in all extracted reducts.

Experimental Results

Visualization of Prostate Cancer Rules

Navigation through Refined Prostate Cancer Decision Rules According to

Trusting Levels.

71 R 117R 22000 R

Experimental Results

Hanaa Ismail Elshazly et al., ”Weighted Reduct Selection Metaheuristic Based Approach for Rules Reduction and

Visualization” , International Conference on Computing Communication and Automation (ICCCA2016), IEEE, Buddh

Nagar Uttar Pradesh, India, 2016

Visualization of Prostate Cancer Rules

Visualization of Refined Prostate Cancer Decision Rules According to Trusting

Levels.

Experimental Results

Visualization of Prostate Cancer Rules

Navigation through Refined Prostate Cancer Decision Rules According

to Trusting Levels.

Experimental Results

Performance analysis

Reduct Matching Approach :

Consider all features of informative

reduct .

Core Matching Approach :

Consider only the intersection of

all reducts.

Prostate

117

Breast Cancer

472

71

99

400

98

RULES

ACC%

60

98

212

95

RULES

ACC%

Conclusions

• A proposed model for knowledge-based

classification and visualization of decision rules

which enhances the classification process and

improves the insight into rules knowledge.

• Physician can detect a minimum number of rules

with trusted levels to reach an efficient diagnosis

of diseases.

• Interactive Visualization Approach is presented

• The user can explore large data sets of rules

freely by focusing his attention on limited subsets

Future Work

• Promising results of the

proposed model encourages the

possibility of applying the model

on other multi dimensional data

sets.

• Other visualization dynamic

techniques can be applied to

meet the different requirements

of physicians.