Extracting biclusters of similar values with Triadic Concept Analysis

31
Extraction de biclusters de valeurs similaires ` a l’aide de l’analyse de concepts triadiques M. Kaytoue, S. O. Kuznetsov, J. Macko, W. Meira Jr. et A. Napoli Bordeaux, 31 Janvier - 3 F´ evrier 2012 Extraction et Gestion des Connaissances - EGC 2012

description

Talk "Extraction et gestion des connaissances" (EGC 2012)

Transcript of Extracting biclusters of similar values with Triadic Concept Analysis

Page 1: Extracting biclusters of similar values with Triadic Concept Analysis

Extraction de biclusters de valeurssimilaires a l’aide de l’analyse de concepts

triadiques

M. Kaytoue, S. O. Kuznetsov,

J. Macko, W. Meira Jr. et A. Napoli

Bordeaux, 31 Janvier - 3 Fevrier 2012

Extraction et Gestion des Connaissances - EGC 2012

Page 2: Extracting biclusters of similar values with Triadic Concept Analysis

Context

Knowledge Discovery in Databases

2 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 3: Extracting biclusters of similar values with Triadic Concept Analysis

Biclustering numerical data

Numerical data and bicluster

Given a numerical dataset (G ,M,W , I )–object/attribute data-table–

G a set of objects (lines)

M a set of attributes (columns)

W a set of values

I ⊆ G ×M ×W a relation s.t. (g ,m,w) ∈ I , written m(g) = w ,means that object g takes the value w for attribute m–simply represents data-cells–

a bicluster is a pair (A,B) with A ⊆ G and B ⊆ M.–a rectangle in the data-table–

3 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 4: Extracting biclusters of similar values with Triadic Concept Analysis

Biclustering numerical dataExample

Given a dataset (G ,M,W , I ) with

G = {g1, g2, g3, g4}

M = {m1,m2,m3,m4,m5}

W = {0, 1, 2, 6, 7, 8, 9}

and e.g. m2(g4) = 9

the bicluster ({g2, g3, g4}, {m3,m4}) can be viewed as the grayrectangle

m1 m2 m3 m4 m5

g1 1 2 2 1 6g2 2 1 1 0 6g3 2 2 1 7 6g4 8 9 2 6 7

4 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 5: Extracting biclusters of similar values with Triadic Concept Analysis

Biclustering numerical data

But... a bicluster should reflect

a local phenomena in the data: “rectangles of values”

connectedness of values: e.g. similar values

overlapping: objects/attributes may belong to several patterns

a partial order, e.g. for algorithmic issues

maximality of rectangles w.r.t. connectedness and ordering

Several types of biclusters

5 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 6: Extracting biclusters of similar values with Triadic Concept Analysis

Biclustering numerical dataSeveral applications

Collaborative filtering and recommender systems

Finding web communities

Discovery of association rules in databases

Gene expression analysis, ...

Several algorithms

Iterative Row and Column Clustering Combination

Divide and Conquer / Distribution Parameter Identification

Greedy Iterative Search / Exhaustive Bicluster Enumeration

A difficult problem generally relying on heuristics

S. C. Madeira and A. L. OliveiraBiclustering Algorithms for Biological Data Analysis: a survey.In IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004.

6 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 7: Extracting biclusters of similar values with Triadic Concept Analysis

Introducing similarityA simple similarity relation

w1 'θ w2 ⇐⇒ |w1 − w2| ≤ θ with θ ∈ R,w1,w2 ∈W

Considered type of biclusters

A bicluster (A,B) is a bicluster of similar values if

mi (gj) 'θ mk(gl), ∀gj , gl ∈ A, ∀mi ,mk ∈ B

m1 m2 m3 m4 m5

g1 1 2 2 1 6g2 2 1 1 0 6g3 2 2 1 7 6g4 8 9 2 6 7

(with θ = 2)

and maximal if no object/attribute can be added

J. Besson, C. Robardet, L. De Raedt, J.-F. BoulicautMining Bi-sets in Numerical Data.In KDID 2006: 11-23.

7 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 8: Extracting biclusters of similar values with Triadic Concept Analysis

Formal Concept Analysis (G. & W., 99)

From a formal context to a concept lattice...

m1 m2 m3

g1 × ×g2 × ×g3 × ×g4 × ×g5 × × ×

Formal concepts = maximal rectangles

... with interesting properties (and existing algorithms!)

Maximality of concepts as rectangles

Overlapping of concepts

Specialization/generalisation hierarchy

This is exactly what we need for biclustering

8 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 9: Extracting biclusters of similar values with Triadic Concept Analysis

Contribution

FCA: an interesting framework for biclustering

Use FCA for a complete, correct and non-redundant extractionof biclusters of similar values with lossless discretization

with no set similarity parameter (useful for top-k patterndiscovery)with a given similarity parameter (as in the literature)

Design an algorithm

better than its competitorscan be easily distributedcan handle several constraints (e.g. size) in the fly

A better understanding of closed numerical pattern mining

9 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 10: Extracting biclusters of similar values with Triadic Concept Analysis

Outline

1 Formal Concept Analysis (FCA)

2 A first FCA-based biclustering method

3 Algorithm TriMax

4 Experiments

5 Conclusion and perspectives

10 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 11: Extracting biclusters of similar values with Triadic Concept Analysis

Formal Concept Analysis (FCA)

In a nutshell...FCA

A data analysis theory rooted in order and lattice theory allowingto characterize formal concepts (also known as closed itemsets)

A concept in a formal context

Formal context (G ,M, I ): objects, attributes, incidence relation

Two derivations operators allowing to define formal concepts

A concept is a maximal rectangle of ×, modulo column and linepermutations

m1 m2 m3

g1 × ×g2 × ×g3 × ×g4 × ×g5 × × ×

({g3, g4, g5}, {m2,m3}) is a formal concept

11 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 12: Extracting biclusters of similar values with Triadic Concept Analysis

Formal Concept Analysis (FCA)

Triadic Concept Analysis (Lehmann &Wille, 1995)

“Extension” of FCA to ternary relation

An object has an attribute for a given condition

Triadic context (G ,M,B,Y ): objects, attributes, conditions,incidence relation

Several derivation operators allowing to characterize “triadicconcepts” as maximal cubes of ×

b1 b2 b3

m1 m2 m3

g1 ×g2 × ×g3 × ×g4 × ×g5 × ×

m1 m2 m3

g1 × × ×g2 × ×g3 × × ×g4 × ×g5 × ×

m1 m2 m3

g1 × ×g2 ×g3 × × ×g4 × ×g5 × × ×

({g3, g4, g5}, {m2,m3}, {c1, c2, c3}) is a triadic concept

12 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 13: Extracting biclusters of similar values with Triadic Concept Analysis

1 Formal Concept Analysis (FCA)

2 A first FCA-based biclustering method

3 Algorithm TriMax

4 Experiments

5 Conclusion and perspectives

Page 14: Extracting biclusters of similar values with Triadic Concept Analysis

A first FCA-based biclustering method

Basic idea

Principle

Start from a numerical dataset

Build a triadic context, with same objects, same attributes, anda discretized non-lossy “numerical space” dimension

Extract triadic concepts

We show interesting links between biclusters of similarvalues and triadic concepts

14 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 15: Extracting biclusters of similar values with Triadic Concept Analysis

A first FCA-based biclustering method

Discretization method

Interodinal scaling (existing discretization scale)

Let (G ,M,W , I ) be a numerical dataset (with W the set ofdata-values.

Now consider the setT = {[min(W ),w ],∀w ∈W } ∪ {[w ,max(W )],∀w ∈W }.

Known fact: T and all its intersections characterize any intervalof values on W .

Example

With W = {0, 1, 2, 6, 7, 8, 9}, one has

T = {[0, 0], [0, 1], [0, 2], [0, 3], ..., [1, 9], [2, 9], ..., [9, 9]}

and for example [0, 8] ∩ [2, 9] = [2, 8]

15 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 16: Extracting biclusters of similar values with Triadic Concept Analysis

A first FCA-based biclustering method

Building a triadic contextTransformation procedure

From a numerical dataset (G ,M,W , I ), build a triadic context(G ,M,T ,Y ) such as (g ,m, t) ∈ Y ⇐⇒ m(g) ∈ t

16 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 17: Extracting biclusters of similar values with Triadic Concept Analysis

A first FCA-based biclustering method

First contributionWe proved that there is a 1-1-correspondence between

(i) Triadic concepts of the resulting triadic context(ii) Biclusters of similar values maximal for some θ ≥ 0

Interesting facts

Efficient algorithm for concepts extraction (Data-Peeler)

L. Cerf, J. Besson, C. Robardet, J.-F. BoulicautClosed patterns meet n-ary relations.In TKDD 3(1): (2009).

This algorithm allows to handle several constraints

Top-k biclusters: Concept (A,B,C ) with high |A|, |B|, and |C |corresponds to bicluster (A,B) as a large rectangle of closevalues (by properties of interordinal scale)

This formalization allows us to design a new algorithm toextract maximal biclusters for a given parameter θ

17 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 18: Extracting biclusters of similar values with Triadic Concept Analysis

1 Formal Concept Analysis (FCA)

2 A first FCA-based biclustering method

3 Algorithm TriMax

4 Experiments

5 Conclusion and perspectives

Page 19: Extracting biclusters of similar values with Triadic Concept Analysis

Algorithm TriMax

Compute all max. biclusters for a givenθ

Principle

Use another (but similar) discretization procedure to build thetriadic context based on tolerance blocks

Standard algorithms output biclusters of similar values but notnecessarily maximal

We design a new algorithm TriMax for that task

TriMax is flexible, uses standard FCA algorithms in itscore and is better than its competitors

19 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 20: Extracting biclusters of similar values with Triadic Concept Analysis

Algorithm TriMax

Finding maximal set of similar values

'θ a tolerance relation

reflexive, symmetric, but not transitive

Blocks of tolerance of W

Maximal sets of pairwise similar values are closed setsExample with θ = 1

'1 0 1 2 6 7 8 9

0 × ×1 × × ×2 × ×6 × ×7 × × ×8 × × ×9 × ×

Blocks of tolerance

{0, 1}{1, 2}{6, 7}{7, 8}{8, 9}

Renamed classes

[0, 1][1, 2][6, 7][7, 8][8, 9]

S. O. KuznetsovGalois Connections in Data Analysis: Contributions from the Soviet Era and Modern Russian Research.In Formal Concept Analysis, Foundations and Applications, 2005.

20 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 21: Extracting biclusters of similar values with Triadic Concept Analysis

Algorithm TriMax

New transformation procedure

Tolerance blocks based scaling

Compute the set C of all blocks of tolerance over W

From the numerical dataset (G ,M,W , I ), build the triadiccontext (G ,M,C ,Z ) such that (g ,m, c) ∈ Z ⇐⇒ m(g) ∈ c

Actually, we remove “useless information”

θ = 1

21 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 22: Extracting biclusters of similar values with Triadic Concept Analysis

Algorithm TriMax

Second contribution

Algorithm TriMax

Any triadic concept corresponds to a bicluster of similar values,but not necessarily maximal!

It lead us to the algorithm TriMax that:

Process each formal context (one for each block of tolerance)with any existing FCA algorithmAny resulting concept is a maximal bicluster candidate and asimple procedure allow to check maximality (this may beproblematic, but experiments show a good behaviour)Each context can be processed separately

TriMax allows a complete, correct and non redundantextraction of all maximal biclusters of similar values for auser defined similarity parameter θ

22 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 23: Extracting biclusters of similar values with Triadic Concept Analysis

1 Formal Concept Analysis (FCA)

2 A first FCA-based biclustering method

3 Algorithm TriMax

4 Experiments

5 Conclusion and perspectives

Page 24: Extracting biclusters of similar values with Triadic Concept Analysis

Experiments

Trimax - settings

Implementation: C++, boost library 1.42

InClose algorithm for dyadic contexts processing

Data: gene expression data of the species Laccaria bicolor

Configuration: Intel CPU 2.54 Ghz, 8 GB RAM

24 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 25: Extracting biclusters of similar values with Triadic Concept Analysis

Experiments

Trimax - monitoring aspects

Starting with all 12 attributes, we make vary the number ofobjects, the similarity parameter θ and monitor:

Number of maximal biclusters of similar values

Execution time (in seconds)

Number of tolerance blocks

Density of the triadic context

Comparison between the number of non-maximal biclusters withthe number of maximal biclusters

Execution time profiling of the main procedures of TriMax

25 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 26: Extracting biclusters of similar values with Triadic Concept Analysis

Experiments

Trimax - experimental results

Nr. of max. biclusters Execution times in sec. Nr. of blocks of toler.

Density of 3-adic cont. Nr. generated of biclusters Execution time

26 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 27: Extracting biclusters of similar values with Triadic Concept Analysis

Experiments

TriMax bottleneck

Computing the modus is problematic...

builds of formal context (2D) for each block of tolerance

extracts concepts (A,B) for each of them

computes the modus C to get triadic concept (A,B,C ) andcheck maximality

But...

In many applications, experts have preferences

One can remove a bicluster candidate before moduscomputation according to some constraints

Example with θ = 33, 000, 500 objects, 12 attributes

104, 226 maximal biclusters extracted in 16.130 sec

5, 332 maximal biclusters in 2.1 sec with at least 10 (at last 40)objects

27 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 28: Extracting biclusters of similar values with Triadic Concept Analysis

Experiments

Comparison

Existing algorithms

Numerical Biset Miner (NBS-Miner) - not scalable

J. Besson, C. Robardet, L. De Raedt, J.-F. BoulicautMining Bi-sets in Numerical Data.In KDID 2006: 11-23.

Interval Pattern Structures (IPS) - less efficient than TriMax

M. Kaytoue, S. O. Kuznetsov, and A. NapoliBiclustering Numerical Data in Formal Concept Analysis.ICFCA, Springer, 2011.

28 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 29: Extracting biclusters of similar values with Triadic Concept Analysis

Experiments

An example of comparison

Increasing number of objects and all 12 attributes.Results in milliseconds.

θ = 0 θ = 700 θ = 10000

Other scenarii show a similar behaviour.

29 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N

Page 30: Extracting biclusters of similar values with Triadic Concept Analysis

1 Formal Concept Analysis (FCA)

2 A first FCA-based biclustering method

3 Algorithm TriMax

4 Experiments

5 Conclusion and perspectives

Page 31: Extracting biclusters of similar values with Triadic Concept Analysis

Conclusion and perspectives

ConclusionContribution

A better understanding of closed numerical pattern miningwithin FCA

A formal characterization of a type of bicluster

TriMax for efficient computation

Perspectives

top-k bicluster discovery

n-dimensional numerical datasets

Distributed computation

Constraints (size, mean-square residue, etc.)

Links with Fuzzy FCA

31 / 31Extraction de biclusters de valeurs similaires a l’aide de l’analyse de concepts triadiques

N