BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav,...

542
BDA’02 18 èmes journées Bases de Données Avancées Actes publiés sous la direction de Philippe Pucheral Évry, France 21-25 octobre 2002

Transcript of BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav,...

Page 1: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

BDA’0218èmes journées

Bases de Données Avancées

Actes publiés sous la direction dePhilippe Pucheral

Évry, France21-25 octobre 2002

Page 2: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 3: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Préface

Ces actes regroupent les articles présentés au cours des 18èmes Journées Basesde Données Avancées (BDA’02) qui se sont déroulées à Evry du 21 au 25 octobre2002. Depuis leur première édition en 1985, les journées Bases de DonnéesAvancées sont devenues une véritable institution et un lieu d’échange et decommunication privilégié pour toute la communauté bases de données francophone.

Les journées BDA ont vocation à couvrir un large spectre de problématiquesallant de l’infrastructure des systèmes d’information aux technologies intégrées aucœur des moteurs de SGBD. Comme le montre la sélection des 22 articles présentsdans ces actes (sur les 50 soumis au comité de programme), notre communautéadresse aussi bien les facettes théoriques que pratiques des grands problèmes actuelsposés en bases de données. Par ailleurs, pour la première fois de son histoire, lesjournées BDA intègrent une session « démonstration » destinée à la présentation deprototypes de recherche. Six démonstrations ont ainsi été sélectionnées. Cesjournées BDA ont été complétées par deux tutoriaux, trois conférences invitées etune journée dédiée au thème de la valorisation industrielle.

Je profite de cette préface pour adresser mes remerciements chaleureux à toutescelles et ceux qui ont, de diverses façons, œuvrés pour le succès de ces journées :

- l’ensemble des auteurs ayant soumis leurs travaux de recherche,- Jean Ferrié (PR Univ. Montpellier II), qui a accepté de présider ces

journées,- Frédéric Cuppens (MC ONERA) et Claude Godart (PR Nancy I), qui ont

assuré respectivement les tutoriaux « Protection des systèmesd’information » et « Travail coopératif assisté par ordinateur »,

- Amr El Abbadi (PR Univ. Calif. Santa Barbara), Marc Shapiro (DRMicrosoft Research Cambridge) et Eric Viara (Sysra & Infobiogen) qui ontprésentés leurs travaux dans le cadre des conférences invitées « ImprovingAccess Efficiency for Spatial Database », « Réplication: les approchesoptimistes » et « Banques et bases de données en biologie moléculaire : dela donnée à la structure »,

Page 4: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

- les membres du comité de programme et les lecteurs externes pour leurprécieux concours à l’évaluation des articles soumis et au choix des articlesretenus,

- le président et les membres du comité d’organisation qui ont supporté toutel’organisation locale de ces journées,

- les institutions telles que le GET, l’INT, l’INRIA, le CNRS (par le biais duGDR I3), le Ministère de la jeunesse, de l’éducation nationale et de larecherche ainsi que les industriels tels que France Télécom R&D, IBMFrance et Microsoft France qui ont apporté leur soutien logistique et/oufinancier,

- sans oublier celles et ceux dont le nom n’apparaît dans aucun comité et quin’ont pas hésité à prêter leur concours par simple générosité.

Philippe PucheralProfesseur à l’Université de VersaillesPrésident du Comité de Programme

Page 5: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Président des journées

Jean Ferrié, LIRMM, Montpellier

Comité de programme

PrésidentPhilippe Pucheral PRiSM, Versailles

Comité de sélection des articles Laurent Amsaleg IRISA, RennesVéronique Benzaken LRI, OrsayGuy Bernard INT, EvryDavid Billard CUI, GenèvePhilippe Bonnet DIKU, Copenhague Patrick Bosc ENSSAT - IRISA, LannionGérôme Canals LORIA, NancyClaude Chrisment IRIT, ToulouseRosine Cicchetti LIF, MarseilleSophie Cluet Xyleme/INRIA-RocquencourtChristine Collet LSR, GrenobleIsabelle Comyn-Wattiau CEDRIC/Univ. CergyAnne Doucet LIP6, ParisAlban Gabillon LIUPPA, PauAlejandro Gutierrez INCO, MontévidéoZoubida Keddad PRiSM, VersaillesJacques Lemaître SIS, ToulonMichel Léonard CUI, GenèveNoureddine Mouaddib IRIN, NantesEsther Pacitti CRIP5, Paris VJean-Marc Petit LIMOS, Clermont-FerrandClaudia Roncancio LSR, GrenobleMichel Scholl CEDRIC/INRIA-RocquencourtBruno Traverson EDF R&D, Clamart

Comité de sélection des démonstrationsLionel Brunie INSA, Lyon Claire Carpentier INT, EvryBéatrice Finance PRiSM, VersaillesHervé Martin LSR, GrenoblePascal Poncelet LIRMM, Montpellier

Page 6: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Lecteurs externesJacky Akoka, Mourad Alia, Edgard Benitez-Guerrero, Omar Benjelloun, Laure

Berti-Equille, Mohand Boughanem, Emmanuel Bruno, Fernando Carpani, FrançoisCharoy, Denis Conan, Bruno Defude, Fabien De Marchi, François Denis, AlexandreDikovsky, Jean-Marie Favre, Irini Fundulaki, Laurent Gallon, Stéphane Gancarski,Joaquin Goyoaga, Patrick Gros, Mohand-Said Hacid, Lotfi Lakhal, Thang Le Dinh,Alexandre Lefebvre, Ludovic Liétard, Stéphane Lopes, Viet Phan Luong, AdrianaMarotta, Pascal Molli, Franck Morvan, Regina Motz, Manuel Munier, ElisabethMurisasco, Noël Novelli, Verónika Peralta, Olivier Perrin, Olivier Pivert, MohamedQuafafou, Guillaume Raschia, Daniel Rocacher, Raul Ruggia, Samira Si-Saïd,Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar,Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni

Comité d’organisation

PrésidentBruno Defude INT, Évry

MembresSandrine Bourguer INT, ÉvryAmel Bouzeghoub INT, ÉvryClaire Carpentier INT, ÉvryBrigitte Houassine INT, ÉvrySamir Tata INT, Évry

Page 7: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Soutiens / Partenaires

Page 8: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 9: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Sommaire

Session 1 : Conférence invitée• Improving Access Efficiency for Spatial Databases ..………………………..15

Amr El Abbadi

Session 2 : Gestion de données XML • A comparative study for XML change detection ...…………………………...19

Grégory Cobéna, Talel Abdessal, Yassine Hinnach• Querying XML Sources Using an Ontology-based Mediator ...………………39

Bernd Amann Catriel Beeri, Irini Fundulaki, Michel Scholl• Construction and Maintenance of a Set Of Pages Of Interest (SPIN)

using Active XML …...………………..………………………………………65Serge Abiteboul, Grégory Cobéna, Benjamin Nguyen, Antonella Poggi

Session 3 : Systèmes distribués adaptables• A Component-based Infrastructure for Customized Persistent Object

Management ...………………………………………………………………...83Luciano Garcia-Banuelos, Phuong-Quynh Duong, Christine Collet

• Parallel Processing with Autonomous Databases in a Cluster System ...…….105Stéphane Gançarski, Hubert Naacke, Esther Pacitti, Patrick Valduriez

• RS2.7: an Adaptable Replication Framework ...……………………………..129Stéphane Drapeau, Claudia Lucia Roncancio, Pascal Déchamboux

• La tolérance aux fautes adaptable pour les systèmes à composants :application à un gestionnaire de données ...………………………………….151Huong-Quynh Duong, Elisabeth Pérez-Cortés, Christine Collet

Session 4 : Méthodologies de conception • From UML to ROLAP Multidimensional Databases using a Pivot Model ….171

Nicolas Prat, Jacky Akoka• Measuring UML Conceptual Modeling Quality,

Method and Implementation …...…………………………………………….197Samira Si-Said Cherfi, Jacky Akoka, Isabelle Comyn-Wattiau

Session 5 : Conférence invitée• Réplication: les approches optimistes ...……………………………………...225

Marc Shapiro

Page 10: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 6 : Services Web • Active XML: A Data-Centric Perspective on Web Services…………………229

Serge Abiteboul, Omar Benjelloun, Ioana Manolescu, Tova Milo, Roger Weber

• Efficient Data and Program Integration Using Binding Patterns……….……257Ioana Manolescu, Luc Bouganim, Françoise Fabret, Eric Simon

• Dynamic discovery of e-services A description logics based approach………………..……………………... 283Mohand-Said Hacid, Alain Leger, Christophe Rey, Farouk Toumani

Session 7 : Miscellaneous• Distances de similarité d'images basées sur les arbres quaternaires…………307

Marta Rukoz, Maude Manouvrier, Geneviève Jomier• A First Experience in Archiving the French Web……………………………327

Serge Abiteboul, Grégory Cobéna, Julien Masanes, Gerald Sedrati• A propos de requêtes possibilistes adressées à des bases de données

possibilistes…………………………………………………………………...343Laurence Duval, Olivier Pivert

Session 8 : Conférence invitée• Banques et bases de données en biologie moléculaire :

de la donnée à la structure ……………………………………………… 363Eric Viara

Session 9 : Logique et bases de données• Sémantique des programmes Datalog avec négation sous hypothèses

non-uniformes ..………………………………………………………………367Yann Loyer, Nicolas Spyratos

• Database Summarization: Application to a Commercial Banking Data Set ...383Régis Saint-Paul, Guillaume Rashia, Noureddine Mouaddib

• Incertitude et hypothèses non-uniformes dans les bases de données déductives ..………………………………………………………………….405Yann Loyer, Umberto Straccia

Session 10 : Fouille de données • A Method for Computing Frequent Key and Closed Itemsets in One

Phase ...……………………………………………………………………….425Viet Phan Luong

• Treillis Relationnel : Une Structure Algébrique pour le Data Mining Multi-Dimensionnel ...…………………………………..……………………445Alain Casali, Rosine Cicchetti, Lotfi Lakhal

• Règles d'association significatives ...…………………………………………469Tao-Yuan Jen, Nicolas Spyratos, Yuzuru Tanaka

• ε-functional dependency inference: application to DNA microarray expression data ...………………………………..……………………………487Alexandre Aussem, Jean-Marc Petit

Page 11: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Démonstrations• A Platform for Experimenting Disconnected Objects on Mobile

Hand-Held Devices …………………………………………………………501Denis Conan, Sophie Chadridon, Olivier Villin, Guy Bernard

• Une plate-forme de télé-enseignement à base de composants réutilisableset personnalisables ………………………………………………………..…509John Freddy Duitama, Amel Bouzeghoub, Claire Carpentier, Bruno Defude

• Experiencing Persistent Object management Customization………………...517Luciano Garcia-Banuelos, Phuong-Quynh Duong, Tanguy Nedelec, Christine Collet

• DBA Companion : un outil pour l'analyse de bases de données …………….523Stéphane Lopes, Fabien de Marchi, Jean-Marc Petit

• Indexation et interrogation de photos de presse décrites en MPEG-7 …… 529Emmanuel Bruno, Jacques Le Maitre, Elisabeth Murisasco

• Un environment de prototypage rapide d'applications web pour bases de données …………………………………………………… 535Bruno Defude

Index des auteurs …...……….………………………………………………541

Page 12: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 13: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 1Conférence invitée

Page 14: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 15: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Impr oving AccessEfficiencyfor SpatialDatabases

Amr El Abbadi

Departmentof Computer ScienceUniversity of California, Santa Barbara, CA [email protected]

ABSTRACT.

Spatial databasesare widely usedin applicationssuch as geographical informationsystems(GIS)and computer-aideddesign(CAD) systems.Due to the interative nature of theapplica-tions and the complexity of the objects,spatial databases facesomeuniquechallenges com-paredto more traditional databasesystems.Aspart of theAlexandriaDigital Library Project,our work addressessomeof thesechallenges, with a focuson efficient accessof large spatialdatasetsat differentlevels.

As the collection level, browsingallows users to visualizethe selectivityof multiple queriesbefore evaluatingthe queriesat the database, therebyprovidesan effcientway to explore thecontentof a dataset. For spatial browsing, we first identify a setof spatial relationsusingasimplifieditersectionmodel.We thenproposethreestorage-efficient approximationalgorithmsbasedthe Euler Histogram. We also extendEuler Histogram as a framework for spatial joinselectivityestimation.Basedon thecharacteristicsof differentdatasets,differentprobabilisticmodelscanbepluggedinto this framework to providebetterestimationresults.

At the object level, we propose a novel approach to accelerate spatial selectionsand joinswith computergraphicshardware. Our work is focusedon hardware-basedimplementationofthe polygon-polygonintersectiontest,which is the computational primitive for manyspatialdatabaseoperations.Thegeneral approach is to usegraphics hardware to renderthepolygoninteriorsor boundaries,thencheck theframebuffer for overlappingpixels.Dueto thedisparitybetweenthedataresolutionand thepixel resolution, themainchallenge is to providenot onlyspeedimprovements,but alsothehigh accuracy requiredby spatialdatabaseapplications.Wedevelopthreeapproximationsalgorithmsusinghardwareblendingandaccumulation functions.Theblending-basedalgorithmis simpleand fast,and the two accumulation-basedalgorithmsprovide provable guaranteeson either precisionor recall at a small cost of efficiency. Wealsoshowthatbycombiningtheaccumulation-basedalgorithmswith theconventional softwareapproach, significant speedup canbeachievedfor complex datawithoutsacrificingaccuracy.

Page 16: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 17: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 2Gestion de données XML

Page 18: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 19: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A comparative study for XML change detec-tion

Grégory Cobéna

— Talel Abdessalem

— Yassine Hinnach

INRIA, FranceDomaine de Voluceau, Rocquencourt BP105, 78153 Le Chesnay [email protected]

ENST, France46, rue Barrault, 75013 [email protected]@yahoo.com

ABSTRACT. Change detection is an important part of version management for databases anddocument archives. The success of XML has recently renewed interest in change detection ontrees and semi-structured data, and various algorithms have been proposed. We study heredifferent algorithms and representations of changes based on their formal definition and onexperiments conducted over XML data from the Web. Our goal is to provide an evaluation ofthe quality of the results, the performance of the tools and, based on this, guide the users inchoosing the appropriate solution for their applications.

RÉSUMÉ. Dans le cadre des bases de données temporelles ou celui de l’archivage de documents,la détection de changements est un aspect essentiel de la gestion de versions. Le succès deXML a apporté un regain d’intérêt pour les algorithmes de diff s’appliquant à des structuresarborescentes et notamment aux données semi-structurées. Récemment, plusieurs algorithmeset modèles ont été proposés, et nous avons souhaité mener une étude comparative de ces solu-tions. Nous étudions ici, à partir de leurs définitions formelles et des expériences conduites surles données XML du Web, les différents algorithmes proposés ainsi que les représentations dechangements. Notre objectif est d’évaluer la performance des outils et la qualité des résultatsobtenus afin d’aider au choix d’une solution appropriée qui réponde aux besoins spécifiques dechaque application.

KEYWORDS: XML, Semi-structured Data, diff, Change Detection, Versions, Tree edit problem,Tree pattern matching

MOTS-CLÉS : XML, données semi-structurées, détection de changement, versions

Page 20: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

The context for the present work is change control in XML data warehouses. Insuch a warehouse, documents are collected periodically, for instance by crawling theWeb. When a new version of an existing document arrives, we want to understandchanges that occured since the previous version. Considering that we have only theold and the new version for a document, and no other information on what happenedbetween, a diff needs to be computed. A typical setting for the diff algorithm is as fol-lows: the input consists in two files representing two versions of the same document,the output is a delta file representing the changes that occurred.

In this paper, we consider XML input documents and XML delta files to representchanges. The goal of this survey is to analyze the different existing solutions and,based on this, assist the users in choosing the appropriate tools for their applications.We study two dimensions of the problem: (i) the representation of changes (ii) thedetection of changes.

Representing Changes. To understand the important aspects of changes represent-ation, we point out some possible applications:

– In Version management [CHI 00, MAR 01], the representation should allow foreffective storage strategies and efficient reconstruction of versions of the documents.

– In Temporal Applications [CHA 99b], the support for a persistent identificationof XML tree nodes is mandatory since one would like to identify (i.e. trace) a nodethrough time.

– In Monitoring Applications [CHE 00, NGU 01], changes are used to detectevents and trigger actions. The trigger mechanism involves queries on changes thatneed to be executed in real-time. For instance, in a catalog, finding the product whosetype is ’digital camera’ and whose price has decreased.

As mentioned above, the deltas we consider here are XML documents summariz-ing the changes. The choice of XML is motivated by the need to exchange, store andquery these changes. XML allows to support better quality services as in [CHE 00]and [NGU 01], in particular real query languages [W3C b, AGU 00], and facilitatesdata integration [W3C a]. Since XML is a flexible format, there are different pos-sible ways of representing the changes on XML and semi-structured data [CHA 98,La 01, MAR 01, XML ], and build version management architectures [CHI 00]. InSection 3, we compare change representation models and we focus on recent pro-posals that have a formal definition, a framework to query changes and an availableimplementation, namely DeltaXML [La 01], XyDelta [MAR 01], XUpdate [XML ]and Dommitt [Dom ]

Change Detection. In some applications (e.g. an XML document editor) the systemknows exactly which changes have been made to a document, but in our context, thesequence of changes is unknown. Thus, the most critical component of change controlis the diff module that detects changes between an old version of a document and thenew version. The input of a diff program consists in these two documents, and possibly

Page 21: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

their DTD or XMLSchema. Its output is a delta document representing the changesbetween the two input documents. Important aspects are as follow:

– Correctness: We suppose that all diffs are “correct”, in that they find a set ofoperations that is sufficient to transform the old version into the new version of theXML document. In other words, they miss no changes.

– Minimality: In some applications, the focus will be on the minimality of theresult (e.g. number of operations, edit cost, file size) generated by the diff . Thisnotion is explained in Section 2. Minimality of the result is important to save storagespace and network bandwidth. Also, the effectiveness of version management dependsboth on minimality and on the representation of changes.

– Semantics: Some algorithms consider more than the tree structure of XML doc-uments. For instance, they may consider keys (e.g. ID attributes defined in the DTD)and match with priority two elements with the same tag if they have the same key. Inthe world of XML, the semantics of data is becoming extremely important [W3C a]and some applications may be looking for semantically correct results or impose se-mantic constraints, e.g. that a product in a catalog is identified by its name and thatonly its price might be modified.

– Performance and Complexity: With dynamic services and/or large amounts ofdata, good performance and low memory usage become mandatory. For example,some algorithms find a minimum edit script (given a cost model detailed in Section 2)in quadratic time and space.

– “Move” Operations: The capability to detect move operations (see Section 2)is only present in certain diff algorithms. The reason is that it has an impact on thecomplexity (and performance) of the diff and also on the minimality and the semanticsof the result.

To explain how the different criteria affect the choice of a diff program, considerthe application of cooperative work on large XML documents. Large XML documentsare replicated over the network. We want to permit concurrent work on these docu-ments and efficiently update the modified parts. Thus, a diff between XML documentsis computed. The semantic support of ID attributes allows to divide the document intofiner grain structures, and thus to efficiently handle concurrent transactions. Then,changes can be applied (propagated) to the files replicated over the network. Whenthe level of replication is low, priority is given to performance when computing thediff instead of minimality of the result.

Experiment Settings. Our comparative study relies on experiments conducted overXML documents found on the web. Xyleme [xyl] crawled more than five hundredmillions web pages (HTML and XML) in order to find five hundred thousand XMLdocuments. Because only part of them changed during the time of the experiment(several months), our measures are based roughly on hundred thousand XML docu-ments. Most experiments were run on sixty thousand of them (because of the time itwould take to run them on all the available data). It would also be interesting to run

Page 22: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

it on private data (e.g. financial data, press data). Such data is typically more regular.We intend to conduct such an experiment in the future.

Observe that our work is intended to XML documents. It can also be used forHTML documents by XML-izing them, a relatively easy task that mostly consistsin properly closing tags. However, change management (detection+representation)for a “true” XML document is semantically much more informative than for HTML.It includes pieces of information such as the insertion of particular subtrees with aprecise semantics, e.g. a new product in a catalog.

The paper is organized as follows. First, we first present the data, operations andcost model in Section 2. Then, we compare change representations in Section 3. Thenext section is an in-depth state of the art in which we present change detection al-gorithms and their implementation programs. In Section 5 we present a performanceanalysis (speed and memory). Finally, we study the quality of the results of diff pro-grams in Section 6. The last section concludes the paper.

2. Preliminaries

In this section, we introduce the notions that will be used along the paper. The datamodel we use for XML documents is labeled ordered trees as in [MAR 01]. We willalso briefly consider some algorithms that support unordered trees.

Operations. The change model is based on editing operations as in [MAR 01],namely insert , delete , update and move . There are various possible interpretationsfor these operations. For instance, in Kuo-Chung Tai’s model [TAI 79], deleting anode means making its children become children of the node’s parent. But this modelmay not be appropriate for XML documents, since deleting a node changes its depthin the tree and may also invalidate the document structure according to its DTD.

Thus, for XML data, we use Selkow’s model [SEL 77] in which operations areonly applied to leaves or subtrees. For instance, when a node is deleted, the entiresubtree rooted at the node is deleted. This captures the XML semantic better, for in-stance removing a product from a catalog by deleting the corresponding subtree. Im-portant aspects presented in [MAR 01] include (i) management of positions in XMLdocuments (e.g. the position of sibling nodes changes when some are deleted), and(ii) consistency of the sequence of operations depending on their order (e.g. a nodecan not be updated after one of its ancestors has been deleted).

Edit Cost. The edit cost of a sequence of edit operations is defined by assigninga cost to each operation. Usually, this cost is

per node touched (inserted, deleted,

updated or moved). If a subtree with nodes is deleted (or inserted), for instanceusing a single delete operation applied to the subtree root, then the edit cost for thisoperation is . Since most diff algorithms are based on this cost model, we use it inthis study. The edit distance between document and document is defined by the

Page 23: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

minimal edit cost over all edit sequences transforming in . A delta is minimal ifits edit cost is no more than the edit distance between the two documents.

One may want to consider different cost models. For instance, assigning the costfor each edit operation, e.g. deleting or inserting an entire subtree. But in this case,

a minimal edit script would often consist in the two following operations: (i) deletethe first document with a single operation applied to the document’s root (ii) insertthe second document with a single operation. We briefly mention in Section 6 someresults based on a cost model where the cost for insert , delete and update is

per

node but the cost for moving an entire subtree is only.

The move operation. The semantics of move is to identify nodes (or subtrees) evenwhen their context (e.g. ancestor nodes) has changed. Some of the proposed al-gorithms are able to detect move operations between two documents, whereas othersdo not. We recall that most formulations of the change detection problem with moveoperations are NP-hard [ZHA 95]. So the drawback of detecting moves is that suchalgorithms will only approximate the minimum edit script. The improvement whenusing a move operation is that, in some applications, users will consider that a moveoperation is less costly than a delete and insert of the subtree. In temporal databases,move operations are important to detect from a semantic viewpoint because they allowto identify (i.e. trace) nodes through time better than delete and insert operations.

Mapping/Matching. In this paper, we will also use the notion of “mapping” betweenthe two trees. Each node in (or ) that is not deleted (or inserted) is “matched” tothe corresponding node in (or ). A mapping between two documents represents allmatchings between nodes from the first and second documents. In some cases, a deltais said “minimal” if its edit cost is minimal for the restriction of editing sequencescompatible with a given “mapping”1.

The definition of the mapping and the creation of a corresponding edit sequenceare part of the “change detection”. The “change representation” consists in a datamodel for representing the edit sequence.

3. Comparison of the Change Representation models

XML has been widely adopted both in academia and in industry to store and ex-change data. [CHA 99b] underlines the necessity for querying semistructured tem-poral data. Recent works [CHA 99b, La 01, CHI 00, MAR 01] study version man-agement and temporal queries over XML documents. Although an important aspectof version management is the representation of changes, a standard is still missing.

In this section we recall the problematic of change representation for XML docu-ments, and we present main recent proposals on the topic, namely DeltaXML [La 01]and XyDelta [MAR 01]. Then we present some experiments conducted over Web data.

. a sequence based on another mapping between nodes may have a lower edit cost

Page 24: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

As previously mentioned, the main motivations for representing changes are: ver-sion management, temporal databases and monitoring data. Here, we analyse theseapplications in terms of (i) versions storage strategies and (ii) querying changes.

Versions Storage Strategies. In [CHI ], a comparative study of version manage-ment schemes for XML documents is conducted. For instance, two simple strategiesare as follow : (i) storing only the latest version of the document and all the deltasfor previous versions (ii) storing all versions of the documents, and computing deltasonly when necessary. When only deltas are stored, their size (and edit cost) must bereduced. For instance, the delta is in some cases larger than the versioned document.We have analyzed the performance for reconstructing a document’s version based onthe delta. The time complexity is in all cases linear in the edit cost of the delta. Thecomputation cost for such programs is close to the cost of manipulating the XMLstructure (reading, parsing and writing).

One may want to consider a flat text representation of changes that can be obtainedfor instance with the Unix diff tools. In most applications, it is efficient in terms ofstorage space and performance to reconstruct the documents. Its drawback are: (i)that it is not XML and can not be used for queries (ii) files must be serialized into flattext and this can not be used in native (or relational) XML repositories.

Querying Changes. We recall here that support for both indexing and persistentidentification is useful. On one hand, labeling nodes with both their prefix and post-fix position in the tree allows to quickly compute ancestor/descendant tests and thussignificantly improves querying [AGU 00]. On the other hand, labeling nodes with apersistent identifier accelerates temporal queries and reduces the cost of updating anindex. In principle, it would be nice to have one labeling scheme that contains bothstructure and persistence information. However, [COH 02] shows that this requireslonger labels and uses more space.

Also note that using move operations is often important to maintain persistent iden-tifiers since using delete and insert does not lead to a persistent identification. Thus,the support of move operations improves the effectiveness of temporal queries.

3.1. Change Representation models

We now present change representation models, and in particular DeltaXML [La 01]and XyDelta [MAR 01]. In terms of features, the main difference between them is thatonly XyDelta supports move operations. Except for move operations, it is important tonote that both representations are formally equivalent, in that simple algorithms cantransform a XyDelta delta into a DeltaXML delta, and conversely.

DeltaXML: In [La 01] (or similarly in [CHA 99b]), the delta information is storedin a “summary” of the original document by adding “change” attributes. It is easy topresent and query changes on a single delta, but slightly more difficult to aggregatedeltas or issue temporal queries on several deltas. The delta has the same look and feel

Page 25: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

as the original document, but it is not strictly validated by the DTD. The reason is thatwhile most operations are described using attributes (with a DeltaXML namespace),a new type of tag is introduced to describe text nodes updates. More precisely, forobvious parsing reasons, the old and new values of a text node cannot be put side byside, and the tags <deltaxml:oldtext> and <deltaxml:newtext> are usedto distinguish them.

There is some storage overhead when the change rate is low because: (i) positionmanagement is achieved by storing the root of unchanged subtrees (ii) change statusis propagated to ancestor nodes. A typical example would be:

<catalog deltaxml:delta=’modified’><product deltaxml:delta=’unchanged’ /><product deltaxml:delta=’modified’><status deltaxml:delta=’deleted’>Unavailable</status><name>Digital Camera</name><description>...</description><price deltaxml:delta=’inserted’>$399</price>

</product></catalog>

Note that it is also possible to store the whole document, including unchanged parts,along with changed data.

XyDelta: In [MAR 01], every node in the original XML document is given a uniqueidentifier, namely XID, according to some identification technique called XidMap.The XidMap gives the list of all persistent identifiers in the XML document in theprefix order of nodes. Then, the delta represents the corresponding operations: identi-fiers that are not found in the new (old) version of the document correspond to nodesthat have been deleted (inserted)2. The previous example would generate a delta asfollows. In this delta, nodes 15-17 (i.e. from 15 to 17) that have been deleted areremoved from the XidMap of the second version

. In a similar way, the persistent

identifiers 31-33 of inserted nodes are now found between node

and node

.

<xydeltav1_XidMap="(1-30)"v2_XidMap="(1-14;18-23;31-33;24-30)"><delete xid=(15-17) parent=6 position=1><status>Not Available</status>

</delete><insert xid=(31-33) parent=6 position=4><price>$399</price>

</insert></xydelta>

. move and update operations are described in [MAR 01]

Page 26: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

XyDeltas have nice mathematical properties, e.g. they can be aggregated, inver-ted and stored without knowledge about the original document. Also the persistentidentifiers and move operations are useful in temporal applications. The drawbackis that the delta does not contains contexts (e.g. ancestor nodes or siblings of nodesthat changed) which are sometimes necessary to understand the meaning of changesor present query results to the users. Therefore, the context has to be obtained byprocessing the document.

XUpdate [XML ] provides means to update XML data, but it misses a more preciseframework for version management or to query changes.

Dommitt [Dom ] representation of changes is in the spirit of DeltaXML. However,surprisingly, instead of using change attributes, new node types are created. For in-stance, when a book node is deleted, a xmlDiffDeletebook node is used. A drawbackis that the delta DTD is significantly different from the document’s DTD.

Remark. No existing change representation can be valitaded by (i) either a genericDTD (because of document’s specific tags) (ii) or the versioned document’s DTD(because of text nodes updates as mentioned previously). These issues will have to beconsidered in order to define a standard for representing changes of XML documentsin XML.

3.2. Change Representation Experiments

Figure 1 (page 9) shows the size of a delta represented using DeltaXML or XyDeltaas function of the edit cost of the delta. The delta cost is defined according to the “

per node” cost model presented in Section 2. Each dot represents the average3 deltafile size for deltas with a given edit cost. It confirms clearly that DeltaXML is slightlylarger for lower edit costs because it describes many unchanged elements. On theother hand, when the edit cost becomes larger, its size is comparable to XyDelta. Thedeltas in this figure are the results of more than twenty thousand XML diffs, roughlytwenty percent of the changing XML that we found on the web.

4. State of the art in Change Detection

In this section, we present an overview of the abundant previous work in this do-main. The algorithms we describe are summarized in Figure 2 (page 14).

A diff algorithm consists in two parts: first it matches nodes between the two(versions of the same) document(s). Second it generates a document, namely a delta,representing a sequence of changes compatible with the matching.

. although fewer dots appear in the left part of the graph, they represent each the average over

several hundred measures

Page 27: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

100 bytes

1 KB

10 KB

100 KB

1 MB

10 100 1000 10000

Ave

rage

Del

ta F

ile S

ize

(in

byte

s)

Delta Editing Cost (in units)

XyDeltaDeltaXML

Figure 1. Size of the delta files

For most XML diff tools, no complete formal description of their algorithms isavailable. Thus, our performance analysis is not based on formal proofs. We comparedthe formal upper bounds of the algorithms and we conducted experiments to test theaverage computation time. Also we give a formal analysis of the minimality of thedelta results.

Following subsections are organized as follows. First, we introduce the String EditProblem. Then, we consider optimal tree pattern matching algorithms that rely on thestring edit problem to find the best matching. Finally we consider other approachesthat first find a meaningful mapping between the two documents, and then generate acompatible representation of changes.

4.1. Introduction: The String Edit Problem

Longest Common Subsequence (LCS). In a standard way, the diff tries to find aminimum edit script between two strings. It is based on edit distances and the stringedit problem [APO 97, LEV 66, SAN 83, WAG 74]. Insertion and deletion corres-pond to inserting and deleting a (single) symbol in a string. A cost (e.g.

) is assigned

to each operation. The string edit problem corresponds to finding an edit script ofminimum cost that transforms a string into a string . A solution is obtained byconsidering the cost for transforming prefix substrings of (up to the i-th symbol)into prefix subtrings of (up to the j-th symbol). On a matrix

, a direc-

Page 28: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

ted acyclic graph (DAG) representing all operations and their edit cost is constructed.Each path ending on represents an edit script to transform into .The minimum edit cost is then given by the minimal cost ofthese three possibilities:

! #"%$ &('*) +-,./10 243655879/10 :;: 2=< 3>?)@0 :;: A 3524B=&C #"%$ &'1) +D,E)@0 A 3655@79/10 :;: 23@>?)@0 :;: A < 365FHG%H$H! #"%$ &'*)I+D,E/10 234J)@0 A 355@7K/10 :;: 2=< 38>L)@0 :;: A < 35

The edit distance between and is given by M , and theminimum edit script by the corresponding path. Note that for example the cost forN%OQPSR CTHUWV R%XIY [ZK\ ] is zero when the two symbols are identical.

The sequence of nodes that are not modified by the edit script (nodes on diagonaledges of the path) is a common subsequence of and . Thus, it is equivalant to find-ing the “Longest Common Subsequence” (LCS) between and . Note that each nodein the common subsequence defines a matching pair between the two correspondingsymbols in and .

The space and time complexity are ^- . This algorithm has been improvedby Masek and Paterson using the “four-russians” technique [MAS 80] in ^- _ ]4` and ^- ]4`*E]a` _ ]4` worst-case running time for finiteand arbitrary alphabet sets respectively.

D-Band Algorithms. In [MYE 86], a ^- cb algorithm is exhibited, whereb

is the size of the minimum edit script. Such algorithms, namely D-Band algorithms,consist is computing cost values only close to the diagonal of the matrix. A diagonald

is defined by aE couples with the same difference fegih d, e.g. for

d hkjthe diagonal contains Ej8.j[ ! . When using the usual “

per node” cost

model, diagonal areas of the matrix, e.g. all diagonals from eml to l , contain alledit scripts of cost lower than a given value l . Obviously, if a valid edit script ofcost lower than l is found to be minimum inside the diagonal area, then it must bethe minimum edit script. When

dis zero, the area consists solely in the diagonal

starting at j8j% . By increasingd

, it is then possible to find the minimum edit scriptin ^-aZ R M =n ob time. Using a more precise analysis of the number ofdeletions, [WU 90] improves significantly this algorithm performance when the twodocuments lengths differ substantially. This D-Band technique is used by the famousGNU diff [FSF ] program for text files.

4.2. Optimal Tree Pattern Matching

Serialized XML documents can be considered as strings, and thus we could usea “string edit” algorithm to detect changes. This may be used as a raw storage andraw version management, and can indeed be implemented using GNU diff that only

Page 29: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

supports flat text files. However, in order to support better services, it is preferable toconsider specific algorithms for tree data that we describe next. The complexity wemention for each algorithm is relative to the total number of nodes in both documents.Note that the the number of nodes is linear in the document’s file size.

Previous Tree Models. Kuo-Chung Tai [TAI 79] gave a definition of the edit dis-tance between ordered labeled trees and the first non-exponential algorithm to com-pute it. The time and space complexity is quasi-quadratic.

In Selkow’s variant [SEL 77], which is closer to XML, the LCS algorithm de-scribed previously is used on trees in a recursive algorithm. Considering two doc-uments

b and

b , the time complexity is ^- b b . In the same spirit is

Yang’s [YAN 91] algorithm to find the syntactic differences between two programs.

MMDiff and XMDiff. In [CHA 99a], S. Chawathe presents an external memoryalgorithm XMDiff (based on main memory version MMDiff) for ordered trees in thespirit of Selkow’s variant. Intuitively, the algorithm constructs a matrix in the spiritof the “string edit problem”, but some edges are removed to enforce that deleting (orinserting) a node will delete (or insert) the subtree rooted at this node. More precisely,(i) diagonal edges exists if and only if corresponding nodes have the same depth inthe tree (ii) horizontal (resp. vertical) edges from M 8 to n 8 exists unless thedepth of node with prefix label n

inb

is lower than the depth of node n inb

. For MMDiff, the CPU and memory costs are quadratic ^- b b . WithXMDiff, memory usage is reduced but IO costs become quadratic.

Unordered Trees. In XML, we sometimes want to consider the tree as unordered.The general problem becomes NP-hard [ZHA 92], but by constraining the possiblemappings between the two documents, K. Zhang [ZHA 96] proposed an algorithm inquasi quadratic time. In the same spirit is X-Diff [WAN ] from NiagaraCQ [CHE 00].In these algorithms, for each pair of nodes from

b and

b (e.g. the root nodes),

the distance between their respective subtrees is obtained by finding the minimum-cost mapping for matching children (by reduction to the minimum cost maximumflow problem [ZHA 96, WAN ]). More precisely, the complexity is ^- b b P T!`* b n P T`* b ]4`* P T`= b n P T`* b , where P T`= b is the maximumoutdegree (number of child nodes) of

b. We do not consider these algorithms since

we did not experiment on unordered XML trees. However, their characteristics aresimilar to MMDiff since both find a minimum edit script in quadratic time.

DeltaXML. One of the most featured product on the market is DeltaXML [DEL ].It uses a similar technique based on longest common subsequence computations, moreprecisely it uses Wu [WU 90, MYE 86] D-Band algorithm to run in quasi-linear time.The complexity is ^- b , where

is the total size of both documents, and

bthe

edit distance between them. Because the algorithm is applied at each level separately,the result is not strictly minimal. The recent versions of DeltaXML supports the ad-dition of keys (either in the DTD or as attributes) that can be used to enforce correctmatching (e.g. always match a person by its name attribute). DeltaXML also supportsunordered XML trees.

Page 30: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Others. In a similar way, IBM developed XML treediff [IBM ] based on [CUR 99]and [SHA 90]. A first phase is added which consists in pruning identical subtreesbased on their hash signature, but it is not clear if the result obtained is still minimal.Sun also released an XML specific tool named DiffMK [Sun ] that computes the dif-ference between two XML documents. This tool is based on the Unix standard diffalgorithm, and uses a list description of the XML document, thus losing the benefit ofthe tree structure in XML. The tests that we conducted, and other results found on theweb seem to indicate that the current version is not “correct”.

For both programs, we experienced difficulties in running the tools on a large setof files4. Thus, these two programs were not included in our experiments.

We were surprised by the relatively weak offer in the area of XML diff tools sincewe are not aware of more featured XML diff products from important companies. Wethink that this may be due to a missing widely accepted XML change protocol. Itmay also be the case that some products are not publicly available. Fortunately, thealgorithms we tested represent well the spirit of today’s tools: quadratic minimum-script finding algorithm (MMDiff), linear-time approximation (DeltaXML), and treepattern matching with move operations (see next).

4.3. Tree pattern matching with a move operation

The main reason why few diff algorithms supporting move operations have beendeveloped earlier is that most formulations of the tree diff problem are NP-hard [ZHA 95,CHA 97] (by reduction from the “exact cover by three-sets”). One may want to con-vert a pair of delete and insert operations applied on a similar subtree into a singlemove operation. But the result obtained is in general not minimal, unless the costof move operations is strictly identical to the total cost of deleting and inserting thesutree.

LaDiff. Recent work from S. Chawathe includes LaDiff [CHA 96, CHA 97], de-signed for hierarchically structured information. It introduces a matching criteria tocompare nodes, and the overall matching between both versions of the document isdecided on this base. A minimal edit script -according to the matching- is then con-structed. Its cost is in ^- T n T where is the total number of leaf nodes, and Ta weighted edit distance between the two trees. Intuitively, its cost is linear in the sizeof the documents, but quadratic in the number of changes between them. Note thatwhen the change rate is maximized, the cost becomes quadratic in the size of the data.Since we do not have an XML implementation of LaDiff, we could not include it inour experiments.

XyDiff. It has been proposed with one of the authors of the present paper in [COB 02].XyDiff is a fast algorithm which supports move operations and XML features like theDTD ID attributes. Intuitively, it matches large identical subtrees found in both doc-

. other users on the Web seemed to have similar problems

Page 31: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

uments, and then propagates matchings. A first phase consists in matching nodesaccording to the key attributes. Then it tries to match the largest subtrees and con-siders smaller and smaller subtrees if matching fails. When matching succeeds, par-ents and descendants of identical nodes are also matched as long as the mappingsare unambiguous (e.g. an unambiguous case is when two matched nodes have botha single child node with a given tag name). Its cost in time and space is quasi linear^- ]4`* . in the size of the documents. It does not, in general, find the minimumedit script.

4.4. Summary of tested diff programs

As previsouly mentioned, the algorithms are summarized in Figure 2 (page 14).The time cost given here (quadratic or linear) is a function of the data size, and cor-responds to the case when there are few changes.

For GNU diff, we do not consider minimality since it does not support XML (ortree) editing operations. However, we mention in Section 6 some analysis of the resultfile size.

5. Experiments: Speed and Memory usage

As previously mentioned, our XML test data has been downloaded from the web.The files found on the web are on average small (a few kilobytes). To run tests onlarger files, we composed large XML files from DBLP [LEY ] data source. We usedtwo versions of the DBPL source, downloaded at an interval of one year.

The measures were conducted on a Linux system. Some of the XML diff tools areimplemented in C++, whereas others are implemented in Java. Let us stress that weran tests that show that these algorithms compiled in Java (Just-In-Time compiler) orC++ run on average at the same speed, in particular for large files.

Let us analyze the behaviour of the time function plotted in Figure 3(page 15) .It represents, for each diff program, the average computing time depending on theinput file size. On the one hand, XyDiff and DeltaXML are perfectly linear, as well asGNU Diff. On the other hand, MMDiff increase rate corresponds to a quadratic timecomplexity. When handling medium files (e.g. hundred kilobytes), there are orders ofmagnitude between the running time of linear vs. quadratic algorithms.

For MMDiff, memory usage is the limiting factor since we used a 1Gb RAM PCto run it on files up to hundred kilobytes. For larger files, the computation time ofXMDiff (the external-memory version of MMDiff) increases significantly when diskaccesses become more and more intensive.

In terms of implementation, GNU Diff is much faster than others because it doesn’tparse or handle XML. On the contrary, we know -for instance- that XyDiff spends

Page 32: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Pro

gram

Nam

eA

utho

rT

ime

Mem

ory

Mov

esM

inim

alN

otes

Edi

tC

ost

full

yte

sted

Del

taX

ML

Del

taX

ML

.com

linea

rlin

ear

nono

MM

Dif

fC

haw

athe

and

al.

quad

ratic

quad

ratic

noye

s(t

ests

with

our

impl

emen

tatio

n)X

MD

iff

Cha

wat

hean

dal

.qu

adra

ticlin

ear

noye

squ

adra

ticI/

Oco

st(t

ests

with

our

impl

emen

tatio

n)G

NU

Dif

fG

NU

Tool

slin

ear

linea

rno

-no

XM

Lsu

ppor

t(fla

tfile

s)X

yDif

fIN

RIA

linea

rlin

ear

yes

nono

tinc

lude

din

expe

rim

ents

LaD

iff

Cha

wat

hean

dal

.lin

ear

linea

rye

sno

crite

ria

base

dm

appi

ngX

MLT

reeD

iff

IBM

quad

ratic

quad

ratic

nono

Dif

fMK

Sun

quad

ratic

quad

ratic

nono

notr

eest

ruct

ure

XM

LD

iff

Dom

mitt

.com

we

wer

eno

tallo

wed

todi

scus

sit

Con

stra

ined

Dif

fK

.Zha

ngqu

adra

ticqu

adra

ticno

yes

-for

unor

dere

dtr

ees

-con

stra

ined

map

ping

X-D

iff

Y.W

ang,

D.D

eWit

t,Ji

n-Y

iCai

quad

ratic

quad

ratic

noye

s-f

orun

orde

red

tree

s(U

.Wis

cons

in)

-con

stra

ined

map

ping

Figure 2. Quick Summary

Page 33: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

10ms

0.1s

1s

10s

1min

10min

1h

10 KB 100 KB 1 MB

Com

putin

g T

ime

XML File Size

XMDiffMMDiff

DeltaXMLXyDiff

GNU Diff

Figure 3. Speed of different programs

ninety percent of the time in parsing the XML files. This makes GNU Diff very per-formant for simple text-based version management schemes.

A more precise analysis of DeltaXML results is depicted in Figure 4 (page 16). Itsshows that although the average computation time is linear, the results for some doc-uments are significantly different. Indeed, the computation time is almost quadraticfor some files. We found that it corresponds to the worst case for D-Band algorithms:the edit distance

b(i.e. the number of changes) between the two documents is close

to the number of nodes . For instance, in some documents, j percent of the nodes

changed, whereas in other documents less than

percent of the nodes changed. Thismay be slight disadvantage for applications with strict time requirements, e.g. com-puting the diff over a flow of crawled documents as in NiagaraCQ [CHE 00] or Xy-leme [NGU 01]. On the contrary, for MMDiff and XyDiff, the variance of computationtime for all the documents is small. This shows that their average complexity is equalto the upper bound.

6. Experiments: Quality of the result

The “quality” study in our benchmark consists in comparing the sequence of changesgenerated by the different algorithms. We used the result of MMDiff and XMDiff as a

Page 34: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

10ms

0.1s

1s

10s

1min

10min

1h

10 KB 100 KB 1 MB

Com

putin

g T

ime

XML File Size

average DeltaXML measuresother DeltaXML measures

MMDiff

Figure 4. Focus on DeltaXML speed measures

reference because these algorithms find the minimum edit script. Thus, for each pairof documents, the quality for a diff tool (e.g. DeltaXML) is defined by the ratio

X h UU

where U is the delta edit cost and U is MMDiff delta’s edit cost for the same pairof documents. A quality equals to one means that the result is minimum and is con-sidered “perfect”. When the ratio increases, the quality decreases. For instance, aratio of

means that the delta is twice more costly than the minimum delta. In our

first experiments, we didn’t consider move operations. This was done by replacing forXyDiff each move operation by the corresponding pair of insert and delete . In thiscase, the cost of moving a subtree is identical to the cost of deleting and inserting it.

In Figure 5 (page 17), we present an histogram of the results, i.e. the number ofdocuments in some range of quality. XMDiff and MMDiff do not appear on the graphbecause they serve as reference, meaning that all documents have a quality strictlyequal to one. GNU Diff do not appear on the graph because it doesn’t construct XML(tree) edit sequences.

These results in Figure 5 show that:

– (i) DeltaXML: For most of the documents, the quality of DeltaXML result isperfect (strictly equal to 1). For the others, the delta is on average thirty percent morecostly than the minimum.

Page 35: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

10000

20000

30000

40000

(perfect) 1.3 (medium) 2.5 (low) 5

Num

ber

of d

ocum

ents

Quality of the Delta

DeltaXMLXyDiff

Figure 5. Quality Histogram

– (ii) XyDiff: Almost half of the deltas are less than twice more costly than theminimum. The other half costs on average three times the minimum.

Result file size. In terms of file sizes, we also compared the different delta docu-ments, as well as the flat text result of GNU Diff. The result diff files for DeltaXML,GNU Diff and XyDiff have on average the same size. The result files for MMDiff areon average twice smaller (using a XyDelta representation of changes).

Using “move”. We also conducted experiments by considering move operations andassigning them the cost

. Intuitively this means that move is considered cheaper than

deleting and inserting a subtree, e.g. moving files is cheaper than copying them anddeleting the original copy. Only XyDiff detects move operations. On average, XyDiffperforms a bit better, and it particular becomes better than MMDiff for five percent ofthe documents.

Finally, note that this quality measure focuses on the minimality of results. Insome applications, the semantics of the results is more important. But the semanticvalue can not be easily measured. An interesting aspect is the support of (semantic)matching rules by some programs (DeltaXML, XyDiff). More work is clearly neededin the direction of evaluating the semantic quality of results. We also intend to conductexperiments on LaDiff [CHA 96] which is a good example of criteria-based mappingand change detection.

Page 36: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

7. Conclusion

In this paper, we described existing works on the topic of change detection in XMLdocuments.

We first presented the two recent proposals for change representation, and com-pared their features through analysis and experiments. Both support XML queries andversion management, but the identification-based scheme (XyDelta) is slightly morecompact for small deltas, whereas the delta-attributes based scheme (DeltaXML) ismore easily integrated in simple applications. A key feature of XyDelta is the supportof node identifiers and move operations that are used in temporal XML databases.

More work is clearly needed in that direction to define a common standard forrepresenting changes.

The second part of our study concerns change detection algorithms. We comparedtwo main approaches, the first one consists in computation of minimal edit scripts,while the second approach relies on meaningfull mappings between documents. Weunderlined the need for semantical integration in the change detection process. Theexperiments presented show (i) a significant quality advantage for minimal-based al-gorithms (DeltaXML, MMDiff) (ii) a dramatic performance improvement with linearcomplexity algorithms (GNU Diff, XyDiff and DeltaXML).

On average, DeltaXML [DEL ] seems the best choice because it runs extremelyfast and its results are close to the minimum. It is a good trade-off between XM-Diff (pure minimality of the result but high computation cost) and XyDiff (high per-formance but lower quality of the result). We also noted that flat text based versionmanagement (GNU Diff) still makes sense with XML data for performance criticalapplications.

Although the problem of “diffing” XML (and its complexity) are better and betterunderstood, there is still room for improvement. In particular, diff algorithms couldtake better advantage of semantic knowledge that we may have on the documents ormay have infered from their histories.

Acknowledgments We would like to thank Serge Abiteboul, Vincent Aguiléra,Robin La Fontaine, Amélie Marian, Tova Milo, Benjamin Nguyen and Bernd Amannfor discussions on the topic.

8. References

[AGU 00] AGUILÉRA V., CLUET S., VELTRI P., VODISLAV D., WATTEZ F., “Querying XMLDocuments in Xyleme”, Proceedings of the ACM-SIGIR 2000 Workshop on XML andInformation Retrieval, Athens, Greece, july 2000.

[APO 97] APOSTOLICO A., GALIL Z., Eds., Pattern Matching Algorithms, Oxford UniversityPress, 1997.

Page 37: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[CHA 96] CHAWATHE S., RAJARAMAN A., GARCIA-MOLINA H., WIDOM J., “Change de-tection in hierarchically structured information”, SIGMOD, vol. 25, num. 2, 1996, p. 493-504.

[CHA 97] CHAWATHE S., GARCIA-MOLINA H., “Meaningful Change Detection in Struc-tured Data”, SIGMOD, Tuscon, Arizona, May 1997, p. 26-37.

[CHA 98] CHAWATHE S., ABITEBOUL S., WIDOM J., “Representing and querying changesin semistructured data”, ICDE, 1998.

[CHA 99a] CHAWATHE S., “Comparing Hierarchical Data in External Memory”, VLDB,1999.

[CHA 99b] CHAWATHE S. S., ABITEBOUL S., WIDOM J., “Managing Historical Semistruc-tured Data”, Theory and Practice of Object Systems, vol. 5, num. 3, 1999, p. 143–162.

[CHE 00] CHEN J., DEWITT D. J., TIAN F., WANG Y., “NiagaraCQ: a scalable continuousquery system for Internet databases”, SIGMOD, 2000.

[CHI ] CHIEN S., TSOTRAS V., ZANIOLO C., “A Comparative Study of Version ManagementSchemes for XML Documents”, TimeCenter Technical Report TR51, Sept. 2000.

[CHI 00] CHIEN S.-Y., TSOTRAS V. J., ZANIOLO C., “Version Management of XML Docu-ments”, WebDB (Informal Proceedings), 2000.

[COB 02] COBÉNA G., ABITEBOUL S., MARIAN A., “Detecting Changes in XML Docu-ments”, ICDE, 2002.

[COH 02] COHEN E., KAPLAN H., MILO T., “Labeling dynamic XML trees”, PODS, 2002.

[CUR 99] CURBERA F., EPSTEIN D., “Fast difference and update of XML documents”,XTech, 1999.

[DEL ] DELTAXML, “Change Control for XML in XML”, www.deltaxml.com.

[Dom ] DOMMITT INC., “XML Diff and Merge tool”, www.dommitt.com.

[FSF ] FSF, “GNU Diff”, www.gnu.org/software/diffutils/diffutils.html.

[IBM ] IBM, “XML Treediff”, www.alphaworks.ibm.com/.

[La 01] LA FONTAINE R., “A Delta Format for XML: Identifying changes in XML and rep-resenting the changes in XML”, XML Europe, 2001.

[LEV 66] LEVENSHTEIN V. I., “Binary codes capable of correcting deletions, insertions, andreversals”, Cybernetics and Control Theory 10, , 1966, p. 707-710.

[LEY ] LEY M., “DBLP”, dblp.uni-trier.de/.

[MAR 01] MARIAN A., ABITEBOUL S., COBÉNA G., MIGNET L., “Change-centric Man-agement of Versions in an XML Warehouse”, VLDB, , 2001.

[MAS 80] MASEK W., PATERSON M., “A faster algorithm for computing string edit dis-tances”, J. Comput. System Sci., 1980.

[MYE 86] MYERS E., “An O(ND) difference algorithm and its variations”, Algorithmica,1986.

[NGU 01] NGUYEN B., ABITEBOUL S., COBÉNA G., PREDA M., “Monitoring XML Dataon the Web”, SIGMOD, 2001.

[SAN 83] SANKOFF D., KRUSKAL J., “Time warps, String Edits, and Macromolecules”,Addison-Wesley, Reading, Mass., , 1983.

Page 38: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[SEL 77] SELKOW S. M., “The tree-to-tree editing problem”, Information Processing Letters,6, , 1977, p. 184-186.

[SHA 90] SHASHA D., ZHANG K., “Fast algorithms for the unit cost editing distance betweentrees”, J. Algorithms, 11, , 1990, p. 581-621.

[Sun ] SUN MICROSYSTEMS, “Making All the Difference”,http://www.sun.com/xml/developers/diffmk/.

[TAI 79] TAI K., “The tree-to-tree correction problem”, Journal of the ACM, 26(3), july 1979,p. 422-433.

[W3C a] W3C, “Resource Description Framework”, www.w3.org/RDF.

[W3C b] W3C, “XQuery”, www.w3.org/TR/xquery.

[WAG 74] WAGNER R., FISCHER M., “The string-to-string correction problem”, Jour. ACM21, , 1974, p. 168-173.

[WAN ] WANG Y., DEWITT D. J., CAI J.-Y., “X-Diff: A Fast Change Detection Algorithmfor XMLDocuments”, http://www.cs.wisc.edu/ yuanwang/xdiff.html.

[WU 90] WU S., MANBER U., MYERS G., “An O(NP) sequence comparison algorithm”,Information Processing Letters, 1990, p. 317-323.

[XML ] XML DB, “XUpdate”, http://www.xmldb.org/xupdate/.

[xyl] “Xyleme”, www.xyleme.com.

[YAN 91] YANG W., “Identifying syntactic differences between two programs”, Software -Practice and Experience, 21, (7), , 1991, p. 739-755.

[ZHA 92] ZHANG K., STATMAN R., SHASHA D., “On the editing distance between unorderedlabeled trees”, Information Proceedings Letters 42, , 1992, p. 133-139.

[ZHA 95] ZHANG K., WANG J. T. L., SHASHA D., “On the editing distance between undir-ected acyclic graphs and related problems”, Proceedings of the 6th Annual Symposium onCombinatorial Pattern Matching, 1995, p. 395-407.

[ZHA 96] ZHANG K., “A Constrained Edit Distance Between Unordered Labeled Trees”, Al-gorithmica, 1996.

Page 39: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Querying XML Sources Using an Ontology-based Mediator

Bernd Amann

— Catriel Beeri

— Irini Fundulaki

— Michel Scholl

Cedric/CNAM-Paris et INRIA-Futurs

[email protected] [email protected] [email protected]

Hebrew University, Israel

[email protected]

RÉSUMÉ. Dans cet article nous proposons une nouvelle approche pour l’interrogation et l’in-tégration de ressources XML disponibles sur le Web. Nos contributions sont (i) la définitiond’un langage à base de règles de traduction, suivant l’approche LAV (Local As View) pour ladescription de ressources XML, et (ii) des algorithmes qui utilisent ces règles pour la traduc-tion de requêtes utilisateurs en requêtes XML. Notre approches a également été validée par leprototype

.

ABSTRACT. In this paper we propose a mediator architecture for the querying and integrationof Web-accessible XML data sources. Our contributions are (i) the definition of a simple butexpressive mapping language, following the local as view approach and describing XML re-sources as local views of some global schema, and (ii) efficient algorithms for rewriting userqueries according to existing source descriptions. The approach has been validated by the

prototype.

MOTS-CLÉS : intégration, interrogation, XML, médiation, ontologie

KEYWORDS: integration, querying, XML, mediation, ontology

Page 40: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

During the last decade, there has been a significant focus on data integration. In anutshell, data integration can be described as follows : Given heterogeneous and auto-nomous information sources in a specific domain of interest, the goal is to enable usersto query the data as if they reside in a single source, with a single schema. To achievethis goal, a global schema of the data is defined, and related to the schemas of the indi-vidual sources. Queries are formulated in terms of this global schema. Since the actualdata resides in the sources, queries are rewritten into queries over the source schemas,which are then evaluated at the sources. The answers returned from the sources arecombined, transformed to be compatible with the global schema, and presented tothe user. The integration facilities, namely the global schema and the query transla-tion and processing algorithms, are performed by a mediator, whose main task is toprovide users with a unique interface for querying the data. The fact that the sourcesconcern a restricted domain of interest, such as culture, sports, and so on, is crucialfor the successful deployment of integration systems.

Well-known projects that deal with data integration include InformationManifold [LEV 96], Tukwila [POT 00], Tsimmis [PAP 95], Picsel [GOA 00],Agora [MAN 01], YAT [CHR 00], and MIX [BAR 99]. As the goal of integration isto support declarative querying and automatic query and result transformations, manyof these use the well-established tools available for such purposes in the relationalmodel, such as query and transformation languages.

Recently, XML [ABI 99] has emerged as a standard for publishing and exchangingdata on the Web. Many data sources export XML data, and publish their contents usingDTD’s or XML schemas. Thus, independently of whether the data is actually stored inXML native mode or in a relational store, the view they present to users is XML-based.The use of XML as a data representation and exchange standard raises new issues fordata integration. A significant issue, as argued in [AMA 02], is the inadequacy of XMLto serve as a global integration schema. Integration issues are also more complex inan XML environment.

In this paper we describe an approach to the integration of XML sources, based onthe local-as view approach to data integration. Our main contributions are as follows :(i) the use of a specific kind of ontologies for the global schema ; (ii) the definitionof a simple but expressive language for describing XML resources as views of theglobal schema ; (iii) an approach to query processing, that includes query rewritingfrom the terms of the global schema into one or more XML queries over the localsources, and the generation of query execution plans that may decompose a singlequery into queries over multiple sources. The approach has been validated by the

prototype [FUN ].

The conceptual schema is not XML-based. The choice of an integration model thatis not XML-based implies that queries (and answers) have to be transformed betweentwo data models. This choice, as well as the choice of the local-as- view approach,

Page 41: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

heavily influence the nature of the required mappings and how they are expressed.Our choices are explained and motivated below, in Section 3.

The paper is organized as follows : in Section 2 we illustrate the main ideas ofthe approach by an example. Section 3 briefly discusses the issues that need to beaddressed when integrating XML sources, presents the integration data model, andthe mapping language for the description of XML resources as views over the globalschema. The query language, and the query processing algorithms are given in Sec-tion 4. The

prototype is sketched in Section 5. Related work is presented in

Section 6, and Section 7 presents our conclusions.

2. System Overview

We illustrate our approach via an example dealing with the integration of XML-based information sources on art and culture. Formal definitions, technical details,explanations and justification of choices, are deferred to subsequent sections.

2.1. XML Resources

The source , located at http //www.paintings.com is an XML resource about

painters and their paintings ; its XML DTD is illustrated in Fig. 1.

<!ELEMENT Painter (Painting+)><!ATTLIST Painter name CDATA #REQUIRED><!ELEMENT Painting EMPTY><!ATTLIST Painting title CDATA #IMPLIED

year CDATA #IMPLIED>

Figure 1. XML DTD for source , located at URL http //www.paintings.com

A second source

, located at URL http //www.art.com, is described in Fig. 2.As is common in data integration scenarios, a single source may provide only part ofthe information available on a subject. Furthermore, sources differ not only in termsof contents, but also in terms of structure and terminology. Given the hierarchicalstructure of XML, such differences of structure may be more significant that those thatexist in relational sources. For an example of a difference of contents, note that source

might record information on the location of paintings, which is absent in source . As for structure, note that in source

paintings are organized by museums, notby their painters as in source

. Consequently, while in the hierarchy of a painting

occurs below its painter, in source

, if the source were interested in adding thepainter for each painting, the painter would occur below the painting.

Page 42: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. <!ELEMENT Museum (MuseumName, City, Painting+)>3. <!ELEMENT Painting (Title, Image*)>4. <!ELEMENT Image EMPTY>5. <!ATTLIST Image type #CDATA #IMPLIED

location #CDATA #IMPLIED>6. <!ELEMENT MuseumName #PCDATA>7. <!ELEMENT City #PCDATA>8. <!ELEMENT Title #PCDATA>

Figure 2. XML DTD for source

, located at http //www.art.com

2.2. The Global Schema

The main task of an integration mediator is to provide users with a unique interfacefor querying the data, independently of its actual organization and location. In ourapproach, this interface, or global schema, is described as an ontology. As used here,an ontology denotes a light-weight conceptual model and not a hierarchy of terms ora hierarchy of concepts (see Section 3 for details).

StringString

String

Actor

Organisationhas title

Personhas name

Man Made Objectproduced

Activity

imageImage

urlString

Stringtype(image of)

(location of)

located_at

has title

museumName

city

Museum

String

String

(produced_by)

(carried_out_by)carried_out

Figure 3. An Ontology for Cultural Artifacts

Fig. 3 illustrates (part of) a global schema for cultural artifacts inspired by theICOM/CIDOC Reference Model1, an international standard for museum documenta-tion. The schema is represented as a labeled graph. In this graph, the nodes corres-pond to concepts and value types, and the edges depict roles, attributes, and simpleinheritance (i.e.,

) links. Roles are binary relations between concepts ; attributes

connect concepts to value types. Both are depicted by solid arcs. Inheritance (

)links connect concepts and are depicted by dashed arcs. Each role has an inverse de-picted in Fig. 3 within parentheses.

. http //cidoc.ics.forth.gr/crm_intro.html

Page 43: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The concepts in this schema include Actor, its subconcepts Person and Orga-nisation, as well as Activity, Man_Made_Object, Image and Museum. An actor(instance of concept Actor) carries out an activity (instance of concept Activity) toproduce a man made object (instance of concept Man_Made_Object). These rela-tionships are represented by roles carried_out and produced, respectively. The nameof a person (instance of concept Person) is represented by the attribute has_name,etc...

The global schema can be viewed as a simple object-oriented data model. Hence,a global schema can be viewed as defining a database of objects, connected by roles,with the class extents related by subset relationships as per the

links in the schema.

Since it is an integration schema, this is a virtual database. The actual materializationexists in the sources.

Roles can be composed, provided they satisfy certain compatibility constraints.Such compositions are derived roles. For example, carried_out.produced is a derivedrole that connects Actor to Man_Made_Object. Combining concepts with (simpleor derived) roles induces derived concepts. For example, Actor.carried_out.producedcan be viewed as the sub-concept of Man_Made_Object made of those objects thatare reachable from some actor by an instance of this derived role. Both derived rolesand derived classes are referred to as schema paths.

The augmentation of the given schema with the derived roles and concepts gives aderived schema. It is significant for the integration, since it provides an interpretationfor the mapping rules (see the following) that describe the sources in terms of schemapaths, hence for query processing, as discussed next.

2.3. Mapping Rules

Our integration approach describes XML sources as local views on the glo-bal schema. Among the different possibilities listed in [CLU 01] for defining suchmappings, we have chosen the path-to-path approach. The description of a sourceconsists of mappings rules that associate paths in the source DTD, expressed inXPath [CLA 99], with paths in the global schema (schema paths). For example, therules illustrated in Fig. 4 map paths in the source

described in Fig. 1 to paths in theglobal schema of Fig. 3.

: http //www.paintings.com/Painter as Person : /@name as

has_name: /Painting as

carried_out.produced:

/@title as

has_title:

/@year as

created.date.year

Figure 4. Set of Mapping Rules for source http //www.paintings.com

Page 44: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A rule consists of a name, a left hand side (LHS) and a right hand side (RHS).The LHS contains an XPath pattern [CLA 99] that starts at a context which is either aconcrete URL, as in rule

or a variable, as in rule

. The XPath pattern is calledthe location path of the rule. The LHS of a rule also contains an optional variabledeclaration (the use of variables will be explained later). The RHS of a mapping ruleis a path in the global schema, called the schema path of the rule.

Mapping rules define instances of concepts and relationships between them. As anexample for the first case, consider the rule

in Fig. 4. It states that the elementsof type Painter under the roots of the XML documents in

are (descriptions of)instances of concept Person. As an example for the second case, Rule

specifies

that the value obtained by evaluating XPath pattern @name on some XML element returned by rule corresponds to a value of attribute has_name of , considered

as an instance of concept Person. In the same way, rule

connects all instancesobtained by rule

to all instances of concept Man_Made_Object obtained by apath of type carried_out.produced.

This view of mapping rules allows us to define the semantics of XML fragmentsand their structural relationships in terms of the global derived schema. Thus,

defines a subset of the extent of concept Person, while rule

relates elements

in this subset by the derived role carried_out.produced to a subset of the extent ofMan_Made_Object. That is, it relates painters to paintings they have created.

2.4. Query Processing

Users formulate queries on the global schema using a simplified variant of OQL,the standard for querying object databases. For example, here is a query

that asksfor “titles of the man made objects created by Van Gogh ” :

: select from Person

,

.has_name ,

.carried_out.produced.has_title where = “Van Gogh”

We now discuss the options available for answering such a query, given a set ofsources

and mapping rules that relate them to the global schema.

The first, simple, solution is to evaluate this query over each source in

. Thismeans that, given a source

, we need to rewrite it into an XML query that

can answer. The idea behind this rewriting is the following : Each variable in the

query is bound to some schema path. We search for mapping rules or concatenationsof mapping rules, that can be used to translate these schema paths to local paths inthe source DTD. This is done by matching the schema paths in the query against theschema paths of the mapping rules. A successful matching associates a query variablewith a rule, or a concatenation of a rule. A binding is a set of such associations for all

Page 45: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

query variables. It can be used to rewrite the query into a query to be evaluated by theXML source.

For example for above and for source

we see that instances for variable

are found by rule , for variable by

and for variable by the concatenation of

rule

with rule

. The resulting binding is . By

substituting the schema path of each query variable with the location path (LHS) ofthe corresponding rule, we obtain query

:

: select from http //www.paintings.com/Painter

,

./@name , ./Painting/@title ,

where = “Van Gogh”

Query can be easily translated into the XQuery expression

:

: FOR $a IN document(“http //www.paintings.com”)/Painter,$b IN $a/@name,$c IN $a/Painting/@title

WHERE $b = “Van Gogh”RETURN $c

Such a matching/rewriting process should be attempted for each source. Then theanswers are gathered and returned to the user.

In some cases, however, we cannot obtain a full binding for a given source. Thena second solution for query evaluation is to decompose the query into several queriesthat are evaluated against different sources. Consider the following query

, which

asks for “titles of objects created by Van Gogh, as well as the name and the city of themuseum where they are exposed” :

: select , ,

from Person

,

.has_name ,

.carried_out.produced , .has_title , .located_at , .museumName , .city

where = “Van Gogh”

Source cannot provide information about the locations of objects, hence there

is no mapping rule whose schema path (RHS) matches the schema path c.located_atof query variable . Thus, we can only obtain a partial answer from this source, byevaluating the query

illustrated in Fig. 5. To obtain a full answer we have tojoin partial answers from different sources.

For the example, the missing information is represented by subquery illus-

trated also in Fig. 5, that involves the variables , , , . The variable is included in

Page 46: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

since it is the join variable between the two queries. Thus, we have decompo-sed the initial query into two subqueries

and . Assuming the latter query

is successfully evaluated over some source (e.g., source

), the results of the twoqueries are joined on to provide a complete answer to the original query. If such adecomposition cannot be found, the best we can do is to present to the user only thepartial results from

evaluated against the first source.

Note that to join two fragments from different sources requires to decide whetherthe two fragments represent identical objects. Keys are introduced to identify objects.In particular, results of queries

over a source and of

over a source

can be joined only if the same key for man made objects can be provided by these twosources. This implies the use of keys, both in the global schema and in the sources(the DTD’s in Fig. 1 and Fig. 2 do not define such keys). We introduce keys and theirusage in Section 3.5.

: select : select , from Person

,

.has_name from Man_Made_Object ,

.carried_out.produced , .located_at , .has_title .museumName ,

where = “Van Gogh” .city Figure 5. Queries

and

3. Integration Model

This section is devoted to the detailed presentation of our integration model. Wefirst briefly explain our choice for the conceptual model, then provide a formal defi-nition of a global schema and introduce the notion of derived schema. Mapping rulesare described afterwards and we finish this section with a short discussion on keys.

3.1. LAV vs. GAV

A fundamental choice we faced concerns the relationship between the globalschema and the source schemas (DTDs), for which two approaches are possible. Inthe first, known as global as view (GAV), the global schema is defined as a view on thesources’ schemas. In the second, known as local as view (LAV), a source is describedas a view on the global schema. Several tradeoffs of those approaches are presentedin [LEV ] : in the GAV approach, the translation of a user query into a query that willbe executed by the local sources, is done by unfolding the global schema definition ;this is quite a straightforward procedure. In the LAV approach a query needs to bere-formulated in terms of the local schemas, a procedure known as rewriting queriesusing views. It is a hard problem [BEE 97, HAL 00].

Page 47: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

On the other hand, in the GAV approach, since the global schema reflects thesource schemas, any change in a sources’s schema may affect the global view. Thecomplexity of this view increases with the number of sources. The GAV is feasibleonly if insertion of a new source or update of a source schema do not happen too oftenand if the number of sources is not too large. In our context, the LAV approach ispreferable : once a reasonably comprehensive global schema has been defined, it ispossible to perform many kinds of modifications to the local schemas, and these donot affect the global schema. It is also possible to add local sources to the system ; aslong as their data fits into the framework defined by the global schema, the latter doesnot need to be modified. In some cases, it may have to be extended, but this kind ofchange is easily accommodated by users. Thus, the LAV approach is more scalablethan the GAV approach.

For concluding this discussion about LAV vs. GAV, we may note that the Xy-leme project [CLU 01] has selected the GAV approach for mapping a huge numberof sources. In their system, however, the global schema and the mapping rules areconstructed automatically from the XML repository, and updated regularly, reflectingthe changes in the repository. We assume the global schema is generated by humanefforts, allowing a higher level of precision in the integration and in query processing.

3.2. Ontologies vs. XML

We briefly define the word ontology and argue why ontologies are preferable forthe integration data model, rather than, say, XML. For a full discussion, see [AMA 02].

The word ontology has been extensively used to describe an area of know-ledge and is central to the current W3C Semantic Web Activity initiative (see forexample [HEF ]) whose goal is the definition of an Ontology Web language. Onto-logies come with different degrees of structure ranging from simple taxonomies tometadata schemas and logical theories. Sophisticated tools are currently proposed toenable distributed agents to cooperate, to allow autonomous suppliers of data and ser-vices to describe their resources, and to provide a foundation for e-commerce. A keygoal in this context is knowledge management (see http www.ontoknowledge.org). Itis now well accepted that Semantic Web ontologies need enough structure to specifyclasses in the chosen domain of interest, relationships that exist between objects ofdifferent classes, properties or attributes those objects may have. In essence, such on-tologies are light-weight conceptual models. In particular, concepts can be viewed asclasses, and roles as binary relations between classes. All modern ontology languagesalso support

relationships between concepts.

XML DTDs are sufficient for exchanging data between actors who have agreedupon common definitions but their lack of semantics is a limit to their use for integra-tion. RDF and RDF schemas provide a framework for describing a richer semantics.Let us briefly illustrate other of XML w.r.t simple ontologies as aforementioned.

Page 48: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Since XML is a hierarchical data model, representation of many-to-many rela-tionships is necessarily asymmetric. In contrast, ontologies provide for inverse roles.Second, XML does not support

relationships. As explained in the following, these

relationships provide significant support for query processing in our system. Third,since ontologies are defined in form of light weight object-oriented schemas, they cansupport well-known query languages such as OQL. A simple version of OQL can beused by a wide variety of users. Finally, since ontologies are used in any case in manydomains as the de-facto standard, it seems natural to extend their use to integration,rather than forcing users to learn a new model to describe their application domain.

Comparing again to Xyleme, we note that they have chosen XML for the integra-tion schema. In their context, the simplicity of the mapping generation process dictatedthe use of the same data model. Note that their system by default assumes that hie-rarchical relationships in the query are to be present in the XML documents, althoughthey do allow a relaxation of this constraint. In our context, the global schema is sym-metric so there is no notion of hierarchy in it. This easily supports dynamic views overthe global schema, where by formulating a query a user determines a virtual tree, forthat query only. We believe that the advantages of a simple conceptual model morethan compensate for the added complexity induced by using more than one model.

3.3. Global Schemas and Schema Paths

The ontology language used in our model is relatively simple w.r.t. other proposalsand in the following we will use the term (global) schema instead of ontology.

Global Schemas

A global schema is a 6-tuple , where :

– is a set of concepts,

is a set of typed binary roles connecting concepts in ,

– is a set of attributes of type string2,

is a binary relationship between concepts in ,

– and are two typing functions returning for each role/attribute its

domain concept and its range concept/type respectively.

A global schema is essentially a simple conceptual schema that can be representedas a graph of concepts connected by roles (as shown in figure 3). The semantics ofa schema is defined by the set of databases that conform to it. Each such databasecontains a set of objects (instances) for each concept in . These objects are relatedto each other by instances of roles in

, and to values by instances of attributes in

. Roles and attributes instances satisfy the typing constraints implied by and

. Wlg. we assume that all attributes are of type string ; an extension to the types proposed by

the XPath model or XML schema should be straightforward.

Page 49: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

and are multi-valued and optional. The

relationship defines a hierarchyin

, namely a directed acyclic graph. It carries subset semantics and supports role

and attribute inheritance. Namely, if , then the set of objects of is a subsetof the set of objects of and all roles defined between some concepts and arealso defined between all subconcepts of and respectively. We say that areisa-related if either = , or .

Finally, we consider schema graphs to be symmetric : each role has an in-verse role, denoted , in

. Obviously,

, and

. This is useful for query formulation and, hence, beneficial to have in aconceptual schema : a user may “start” a query at any concept, and then use roles toconnect to other concepts.

Schema Paths and Derived Schemas

We distinguish two kinds of paths in a schema :

– A role path is a sequence of roles , where for all roles (

), and are

-related. Given a role path ,

we define its inverse role path where is the inverse role of – A concept path is either of the form , or a sequence , where is a concept

and is a role path, such that and are

-related. The source of is

is and its is in the first case, and in the second case.

The composition of a concept path and a role path , denoted , is well-definedprovided that and

are

-related.

A concept path can be viewed as a derived concept (denoted by conc(p)),standing for “the instances of that can be reached from instances of by following the roles in , in order”. Obviously, every concept is also

a derived concept.

In the same way, a role path can be viewed as a derived role (de-noted by role(r)) connecting instances of concept

to instances of concept

. Similarly to derived roles, we can define derived attributes, by a role pathfollowed by an attribute. Like attributes, these do not have inverses. Clearly, every role(attribute) is also a derived role (attribute).

Let be a concept path, be a prefix of , and denote with removed.If is either or a superconcept thereof, then is called a suffix of . Obviously, is a suffix of .

Given , a database of

, we can associate extents with derived concepts in astraightforward manner. We note the following facts concerning these extents. First,the extent of conc(p) is a subset of the extent of , hence also of its super-concepts in

. Second, the extent of conc(p) is a subset of the extent of each of

its suffixes. For example, it is easy to see that is a suffix of "! _

and all instances of (objects produced by an

activity carried out by a person) are instances of (objects produced by events).

Page 50: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Given a global schema

, the derived schema

is defined as follows :

– is the set of all derived concepts of

.

( ) is the set of all the derived roles (attributes) definable in

, and and are defined as explained above.

– The

relation contains the

relations from

, and additionally each pair , where is a derived concept, and is a suffix of .

Our interest in the derived schema is motivated by the fact that some sourcesmay provide data only for derived concepts. The

relationships in the de-

rived schema enable us to use these sources to provide answers in terms ofthe original concepts. For example, even if a source provides only informationabout Person.carried_out.produced, this allows us to obtain some instances ofMan_Made_Object, although not necessarily all. Note that answers obtained fromsources in the LAV approach are partial answers in any case.

3.4. Mapping Rules

A source is integrated to the system, by providing a set of mapping rules that des-cribe the relationships between the source schema and the global schema. There existdifferent ways for defining such views varying in terms of size and preciseness of thedefinition but also in the complexity of the query rewriting algorithm [CLU 01]. Wehave chosen essentially the same approach as in [CLU 01], namely to associate pathsin the global schema with paths in the source schemas. This allows us to both asso-ciate concepts with XML nodes in the sources, and to associate relationships amongconcepts (expressed as roles or derived roles in the global schema) with XPaths in theXML sources.

Paths in a source are described in XPath [CLA 99]. We assume familiarity withthe XPath language. Described in a nutshell, an XPath location path is composed ofa sequence of location steps. Location steps have three parts : (i) an axis specifies therelationship (child, descendant, ancestor, attribute etc.) between the nodes selected bythe location step and the context node, (ii) a node test specifies a node’s XML type(element, attribute, and so on) and possibly its name, and (iii) optional predicates useXPath expressions to further refine the set of selected nodes.

Let

be a set of variables, and be a set of URLs. A mapping rule is an expres-sion of the form

, where :

is the rule’s label,

– , the rule’s root, is either a variable or a URL ( is called the root of

),

– is an XPath pattern, called the location path of the rule,

Page 51: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

is a binding of

(

is called the binding rule of), where

is a

variable,

– is a schema path : it is a role path if is a variable and a concept path otherwise.

A rule

is called a relative mapping rule if its root is a variable , and an absolutemapping rule otherwise. In the first case, is the root variable of

, and this occur-

rence of is a use of the variable. Let denote

’s location path andschema path, respectively, where location path.

Given a set of mapping rules, we define reachability (in a source) for rules and

variables, as follows : (1) each rule whose root is a URL (the URL of) is reachable ;

(2) each variable bound by a reachable rule is reachable ; (3) finally, each rule whoseroot is a reachable variable is reachable. The set is cyclic if this definition of reacha-bility leads to a cycle. The simplest case of a cycle is a rule whose left-hand-sidecontains

(provided that

can be reached from a URL by other rules).

A mapping over

and for a source

is a set of mapping rules such that 1)labels are unique (that is, no two rules have the same label), 2) all rules and variablesare reachable in

, 3) the concepts, roles and attributes used in its rules occur in

and 4) it contains no cycles.

Concatenation of Mapping Rules

Two rules

, , can be concatenated,if the composition of their schema paths, is well defined3. Note the constraintthat the root of

is bound in

and that concatenation is possible only if is a rolepath. The result of the concatenation is the rule

.Given a mapping , its closure is the set of all rules that can be obtained from

by repeated concatenation. It is denoted by . Its expansion, denoted , is

the set of absolute rules in (

) and can be computed by a bottom-upfixpoint computation (since we only consider acyclic mappings, we are sure that afinite fixpoint exists).

Interpretation of Mapping Rules

Given a global schema

, a mapping over

can naturally be interpreted inthe derived schema

. Each absolute rule in defines a derived concept, and each

relative rule in defines a derived role (or attribute). Let us denote by

the

restriction of

to the derived concepts, roles and attributes of .

For example, rule defines a derived concept, conc(Person.carried_out.-

produced) subconcept of Man_Made_Object, and rule

defines a derived role,role(carried_out.produced), between concept Person and concept Man Made Ob-ject. Rule

defines a derived attribute attr(carried_out.produced.has_title) of

concept Person.

. We do not define any restriction on the concatenation of the rules’ location paths.

Page 52: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A mapping for a source

associated with URL , allows us to view a collectionof XML fragments reachable from as a database that conforms to

. To define this

database, the population of each derived concept, conc(p), is defined as the union ofthe set of fragments returned by all absolute rules

in where

or is asuffix of

.The set of fragments

returned by some absolute rule

in is defined as

follows. The root of an absolute rule

is a URL of some source. Hence

isassigned the set of XML fragments that can be obtained by applying the location path to the XML document identified by . The set

can be computed by a simple

fixpoint computation, using rules in . Since is finite, alternatively the rules of

can be used directly.

Similarly, the relative rules of are interpreted as roles (or attributes) of

in

this database of XML fragments, represented by location paths.

Before leaving this subject, we note that according to the LAV approach, XML ex-tents defined as above for the concepts are viewed as subsets of the real (but unknown)extents. Indeed, as sources are added, and rules are added to a mapping, the extentsgrow. In the LAV approach, any set of answers returned for a query is assumed to bea subset of the full (but unknown) answer.

3.5. Keys

As illustrated in Section 2, keys are essential to decide whether two XML frag-ments describe the same concept. We assume that sources are heterogeneous and au-tonomous, and we do not expect that they provide us with persistent object identifiersthat are valid for all sources. The ID/IDREF XML attributes mechanism is used forinternal references, but cannot serve for keys. Sources might specify meaningful keysin terms of XML elements/attributes as proposed in [BUN 01, FAN 01, THO 01], butone cannot expect different autonomous sources to always use the same keys. Forexample a painting might be identified by its title in one source, by its title and theyear of creation in another source.

A way to overcome this problem is to define global keys for concepts in the globalschema. A

for a concept is defined as a list of derived attributes (called keypaths) with source and is denoted by

. W.l.gwe assume in this paper that a concept is associated with at most one key, and all itssubconcepts (including the derived ones) share the same key.

In our global schema, we could state for example, that an instance of conceptPerson is identified by attribute has_name :

(Person)= _ . Instancesof concept Man_Made_Object are identifiable by their title and their year of crea-tion :

(Man_Made_Object)= _ . Images have nokey, i.e.

(Image)= .

Page 53: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4. Query Processing

Our query processing approach is presented in this section. We first introduce theuser query language (section 4.1). Two query processing strategies are then discussed.In the first approach (section 4.2), the solution to a query is the union of the completeanswers from individual sources. If no complete answer can be derived from a source,then the source is abandoned. In contrast, the second approach (section 4.3) allows alsofor incomplete answers from a given source. If a source

can only partially answer a

query, then the query is decomposed in two parts one to be fully answered by

andthe other part being sent to the other sources. The partial results from different sourcesare then joined by the mediator.

4.1. Query Language

The users query the virtual database as presented via the global schema, usingsimple tree queries, based on select-from-where clauses following an OQL-likesyntax. Queries are of the form :

Q : select , , ...from , , ... , ...where and and ...

The ’s are query variables and each in the from clause is a path in the globalschema (schema path), called the binding path of and denoted . The firstvariable is the root variable of the query, and its binding path is a concept path.For each

, there is a single clause , and is a role path. We call the

parent of . We assume the parenthood relation between variables forms a tree, with as its root. Thus, ranges over the extent of the derived concept conc( ), and , ranges over the instances defined by traversing instances of the derived role from the instances of its parent.

We assume queries satisfy the following restrictions. First, no restructuring is allo-wed in the select clause. Although this may add expressive power to the language, wefeel it is not strictly needed for our application. Certainly, it is orthogonal to the issueof retrieving data from sources, addressed in this paper. Second, the where clauseis a conjunction of simple predicates, where a simple predicate is of the form in which and is an atomic value. Thus, it is not possible toexpress joins by equalities between variables, i.e., by predicates of the form .This restricts the expressive power of the query language but simplifies the rewritingand evaluation of queries. Third, schema paths occur in the from clause, but not inthe select clause or the where clause of a query. It is easy to show that a query withschema paths in the select and the where clause can be rewritten into an equivalent

Page 54: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

query in which they appear only in the from clause. Last, the language has no quanti-fiers, aggregates, or subqueries. However, a variable present in the from clause butnot in the select or where clauses, is implicitly existentially quantified. Thus, querieswith certain kinds of existential quantification can be expressed in the above form.

Since no joins are allowed in the where clause, a query whose variables form aforest can be decomposed into a cross product of several tree queries : the restrictionto tree rather than forest queries results in no loss of expressive power.

The result of a query is a set of tuples of the form where are instances of the variables in the query’s select clause and can beeither atomic values, or XML fragments.

In the sequel, the following representation of tree queries is used. A tree query

is represented as a labeled tree, where

is the set of query

variables (tree nodes), is the parent binary relation between nodes defined above, is the binding path of and

is a set of operations associated with variable , defined as follows : For a variable in the select clause , and for eachcondition in the where clause, .

4.2. Variable to Rules Bindings

We now proceed to the details of query processing. We first present a simple ap-proach in which a source contributes to the answer only if it can fully answer thequery.

To evaluate a query, we need to rewrite it into an XML query that some sourcescan answer. Obviously, in general only some of the sources contain the data requestedin the query. Each such source returns a subset of the possible answers ; the union ofthe answers from all relevant sources is presented to the user (see Section 5).

For this rewriting, we use the mapping rules. For a query

and a source, define

a variable to rule binding, or shortly variable binding, as a mapping

from a set ofquery variables to . We consider only bindings such that is either emptyor is the set of nodes of some prefix4 of

. The empty binding is denoted by

.If

binds all variables in

then it is called a full binding, otherwise it is a partialbinding.

The properties of a binding

are the following : if is not empty, then

associates each variable in it with a rule of , such that the following holds :

1) if is the root of query

, then is an absolute mapping rule such that

, i.e., the derived concept defined by is a subconcept

of the binding path ,2) else, let , then

is a relative rule, and

. A tree

is a prefix of a tree

if it is a subtree of

and its root is the same as that of

.

Page 55: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

- the root variable of rule is bound in rule

,- the role path (RHS) of the rule

is equal to the binding path ,- and finally, the concatenation of the two rules

and is well-defined.

In the first case, is the root of

and bound to some (possibly derived) concept byits binding path that has the form or . An absolute rule

can provide ins-

tances for this concept if its concept path (RHS) , viewed as a derived concept,

is a sub-concept of (i.e. if the latter is a suffix of or a superconcept the-

reof). Note that we use here both derived concepts and the

relationship betweenthem. Thus, the derived schema defined in Section 3 is essential for our approach toquery processing.

In the second case, the assumption that if

is defined on then it is defined on theparent of follows from the requirement that its domain is a prefix of

. In thiscase, the declaration of in

has the form , and . Answers for can

be obtained from answers for , by following the binding path of .A partial binding

is called maximal if there does not exist a binding

such that and

for all in . It is evident that a fullbinding is a maximal binding.

Variable Binding Algorithm

We will now describe a variable binding algorithm which takes as aninput a query

and a mapping for a source

and returns a set of partial bindings.

A binding

is represented as a vector of associations of variables to rules,

. The algorithm is illustrated in more detail in Fig. 6. First, thevariables of the query tree are arranged in pre-order : the root is first, and every othernode occurs after its parent. The algorithm starts from the empty binding, and oncea set of partial bindings have been constructed, it tries to extend each one, using theordering of the variables. The extension of a partial binding

by

is denoted .

In the first step, we extend

to the root variable . For each absolute rule

in

such that is a subconcept of or equal to the concept path , we create

the binding and add it to the set of bindings for . If no absolute rule is found

such that the above condition holds, then the algorithm stops, and returns the emptyset. Then, we iterate through the sequence of variables, from the left. Let the current,not yet treated, variable be , and let be its parent. For each

constructed so far, if

then

cannot be extended to (recall that a binding is always defined ona prefix of

). Else, let binding

associate rule with . Then, for each relative

rule

of , such that , if

and

can be concatenated (i.e,

binds the variable that is the root of

, and their schema paths can be composed), weextend

by

. In this case,

can be dropped, since the new binding extends it.Note that the edge from to is traversed in this step, and only in this step.

Page 56: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Input : the sequence of variables of query

, in pre-order : ;the closure of mapping rules

of some mapping for source

;Output : the set of maximal bindings for

and

Algorithm : ;

for each absolute rule

if concept path is equal to or is a suffix of path

/* is a subconcept of */add

to ;

for

/* contains all maximal bindings up to */ ;

:= parent of ;for each binding

where for each rule

in

where

if the composition of is well defined

/*

is extended to and added to */add

to ;

if

was extended to remove

from ;

return ;

Figure 6. Variable binding Algorithm

finds the maximal bindings for a query

and a mapping on source.

The proof is straightforward. Consider a binding

in the result set of . If thereexists a variable that we could add in , this means that there exists some rule

such that is well-defined, then by the algorithm would already

be in which is a contradiction, from the above assumption.

Let us illustrate the algorithm with query

presented earlier, and the mappingrules for source

illustrated in Fig. 4. Rule returns answers for variable

. The

rule’s schema path is Person which is equal to’s binding path (Person). Rule

returns answers for , since (1) its schema path has_name is equal to the variable’sbinding path, (2) its root variable is bound in rule

, and (3) the compositionof the schema paths of rules

and

is well defined, since attribute has_name isdefined for concept Person. In a similar manner, we find that variable is bound torule

and variable to rule

. For variable , we do not find a mapping rule whose

schema path is equal to the variable’s binding path (located_at). The result is the singleton

.

Page 57: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4.3. Query Decomposition

Let

be the set of sources mapped to the global schema. Algorithm re-turns for each source

in

the set of maximal bindings. Each such binding

is eitherfull, i.e. contains all variables in

(then

can answer the query using

),

or partial. In the latter case,

provides us with incomplete information, i.e. does notprovide answers for all variables in the query. To complete these partial answers, wedecompose the query

into (i) a prefix query that source

can answer using binding

, denoted , and (ii) a set of suffix queries, denoted

(see below).

As an example, take the result of algorithm calculated for query

,and the mapping illustrated in Fig. 4. It contains a partial binding

defined on aproper subset of the variables in the initial query : for an instance of variable wemiss instances for variable (and its descendants).

To obtain the complete answer, we define (1) a prefix query that source can

answer using

(query illustrated in Fig. 5) and one suffix query

(query illustrated in Fig. 5). The suffix queries of a prefix

in

are definedas follows. Let be the set of variables in

which contain at least one child in

but not in

(we call the boundary of

). Then we define a suffix query for each

variable in as the subtree of

rooted at and containing all descendants of notin

. It is easy to see that query illustrated in Fig. 5 is a suffix query of

in

. Observe that for a given prefix query there might exist zero, one or more suffixqueries.

Joining the Results

The results of the prefix query and the suffix queries must be joined. In order toperform the join between a prefix and a suffix query the following two conditions musthold : (1) the concept to which the root of a suffix query is bound, should have a key,and (2) the sources on which the join is performed should provide complete values forthis key.

The key values of an instance of a concept are obtained by considering

as a query ranging over all key paths in it. The result of a key query is of the form , where the are instances of the variables to which the key pathsare bound in the key query. For example, the query illustrated below returns the keyvalues for an instance of concept Man_Made_Object :

_ _ : select , from Man_Made_Object , .has_title , .created.date.year

Given a (prefix or suffix) query

whose result is of the form where the

are instances of the variables in the query’s select clause and akey query

, where is the concept on which the join will be performed,is extended to

so as to get, as well as the ’s, the key values for the

fragments, instances of concept , accessed by

. The result of is of the

Page 58: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

form where the ’s are instances of the key queryvariables (variables bound to the key paths in

). For example, the pre-fix query obtained after extending query

of Fig. 5 by the key of conceptMan_Made_Object is given below5.

: select , , from Person

,.has_name ,

.carried_out.produced , .has_title , .has_title , .created.date.year

where = “Van Gogh”

Query Decomposition Algorithm

Let

be a query and

be a set of sources. A decomposition of

w.r.t. somemaximal binding

is a couple such that

is aprefix query of

on source

in

and is the set of suffix queries of

in. For example, is a decomposition for query

and

the maximal binding defined on source

. Observe that is empty if

is a

full binding for

.

Let be a decomposition of

. Then can be

translated into a source query using binding

. For each suffix query in either

a full binding is found or the suffix query has still to be decomposed. Let be a

suffix query in and

denote the key query variables bound tothe key paths of concept associated to the root variable of

. Then

denotes the join operation between and

(we assume that both queries areextended by the appropriate key queries). A prefix query rewriting for adecomposition is defined as the join between a prefix query and all suffixqueries

, , in (if

is a full binding for

then

is emptyand ) :

Then the initial query

can be rewritten as (Q,S) defined as the union of all

prefix rewritings for sources in

:

!

Let a query execution plan (QEP) be defined as follows : (1) a query that canbe answered by a single source (that is a query for which there exists a full binding)is a(n atomic) QEP ; (2) the union of two QEP’s is a QEP 6 ; (3) the join of two

". This query can be optimized by keeping the variables that are common to the key query and

the prefix query (the case of variables # and $ in the example above).%. Remember that union is heterogeneous, that is two sets of tuples answering the same query

but resulting from different sources might have different structures for the & -th component.

Page 59: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

QEP’s is a QEP7. Basically, sources answer atomic queries in a QEP and the mediatorperforms joins and unions. A QEP can involve several atomic queries sent to a givensource. It might be interesting to combine such queries in a single query. This impliesthe reorganization using classical properties such as distributivity of union w.r.t. join.Such properties and reorganizations as well as other optimizations are beyond thescope of this paper.

Input : a query

and a set of sources

Output : a query execution plan for

;Algorithm :

! ;for all sources

if /* there exists at least one maximal binding for

in

*/

for all bindings

if

is a full binding! ;else ! ;

for all suffix queries

if!

/* there exists a non-empty query plan *//* for all subqueries up to

*/if !

/* there exists a query plan of */! !

! ;

else ! ; ! ! !

return

! ;

Figure 7. Algorithm

Given a set of sources

and a query

, the algorithm!

(Q) shown in Fig. 7computes a query execution plan for

. For each source

and maximal binding

, a QEP

! of the prefix rewriting is computed : if

is a fullbinding (i.e. complete answers are obtained), the result is query

. Else, if

is a

partial binding, then query

is decomposed into a prefix query and a set of

. We restrict join to the non commutative aforementioned definition of join : the root of the

second QEP should belong to the boundary of the first QEP and each of them should correspondto a concept for which a key has been defined.

Page 60: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

suffix queries (these queries are also extended by the key queries as shown

before). The query execution plan of

against source

is obtained by joining

to the query execution plan for each suffix query (variable

denotes thekey query variables of

). To calculate the query execution plan of a suffix query

the algorithm is called recursively. Finally the obtained plan is added to the existingplan by union.

Observe that there are two reasons to interrupt the calculation of a query executionplan for a given source

and binding

. The most trivial case is that there exists no

maximal binding for

in. The second reason is that there exists at least one suffix

query which cannot be satisfied (empty query execution plan).

5. System Architecture

In this section we sketch the architecture of the prototype

[FUN ] (Fig. 8)that implements the data integration approach described previously. XML Web re-

PortalSchema

Web Server

Query Execution PlansGenerator

Query Parser

User query

Query Interface

Formatted result

ManagerSchema

Manager

Mapping rulesXSL stylesheets

Mapping rules

Source Publication Interface

XML Fragments XPath Location Paths

Kweelt Query Engine

XSLT ProcessorRules

Integration Module

Figure 8.

Portal Architecture

sources can be published on the fly by creating/modifying/deleting mapping rules bet-ween source fragments and the global schema using the Source Publication Interface.The global schema can be consulted through the Schema Manager which is also res-ponsible for its loading in a

portal. The mapping rules are first validated by the

Rules Manager which is also responsible for their storage. The publication of a re-source also consists in providing an XSLT transformation program8 that can be used

. XSL Transformations (XSLT : http //www.w3c.org/TR/xslt)

Page 61: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

for formatting source data in the query result9. Query processing is done in severalsteps : first user queries can be formulated using a standard Web browser. They areeither created by a generic Query Interface, or simply stored in the form of a hypertextlink (URL). The Query Interface communicates with the Schema Manager allowingthe user to browse the global schema for the formulation of a query. The Query Inter-face forwards the query to the Query Parser which performs a syntactical analysis ofthe query with some type-checking w.r.t. the global schema and produces a languageneutral intermediate representation of it. The query is then forwarded to the QueryExecution Plans Generator, which creates the query execution plan. The IntegrationModule rewrites the queries into Quilt Queries and sends them to the Kweelt QueryEngine10 for evaluation11. The resulting XML fragments are sent to the IntegrationModule that combines the results. This module, based on the query and the mappingrules, inserts schema specific tags, and then the XSLT Processor (Cocoon12) finallytransforms the result into an HTML document which is displayed to the browser ofthe user.

The

prototype was implemented in Java JDK 1.2. XML technologies suchas XSLT, XPath and the Xalan XML Stylesheet processor13 were used.

6. Related Work

Data integration has become an important issue during the past years and a largenumber of integration systems have been proposed. These systems can be classifiedaccording to the architectures used for query processing : data warehouse systems ma-terialize all source data before query processing, whereas mediators propose a virtualdatabase and push queries to the source level based on sophisticated query rewritingalgorithms. Our approach clearly belongs to the second category.

Mediator systems are classified according to the way sources are described to themediator and queries are evaluated [LEV ]. Tsimmis [PAP 95], MIX [BAR 99] andYAT [CHR 00] follow the global as view approach and are not directly comparableto ours. On the other hand, Information Manifold [LEV 96], Tukwila [POT 00] fol-low the local as view approach. In those systems the global schema is a flat relationalschema, and Description Logics is used to represent hierarchies of classes. The sourcesare expressed as relational views over this schema. The Bucket algorithm is introducedin [LEV 96] that rewrites a conjunctive query expressed in terms of the global schemausing the source views. It examines independently each of the query subgoals and triesto find rewritings but loses some by considering the subgoals in isolation. The Mini-

. If the query result contains XML fragments from a source, then those are transformed using

the source’s XSL Stylesheet.. Kweelt Query Engine : http //db.cis.upenn.edu/Kweelt/.. The Kweelt query engine can evaluate the subset of XQuery expressions presented in this

paper. . http //xml.apache.org/cocoon . http //xml.apache.org/xalan-j/index.html

Page 62: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Con algorithm [POT 00] improves the bucket algorithm by exploiting the input/outputdependencies between the query subgoals for reducing the search space of possiblerewritings. Algorithm presented in this paper resembles to MiniCon since itexploits the parent/child dependencies of query variables for query decomposition.

The Agora [MAN 01] system, offers an XML view for relational and XML dataand user queries are XQuery expressions. Although XML is used as the global datamodel, an extended use of the relational model is made : the XML view is translatedinto a generic relational schema, XML resources are described as relational views overthis schema and XQuery expressions are translated to standard SQL queries which arethen decomposed, optimized and evaluated. Our system and query rewriting algorithmextensively exploit the tree structure of XML data which is described as local viewsof a more powerful conceptual schema with inheritance.

Last, the Xyleme [CLU 01] system is based on a data-warehouse solution for theintegration of XML data (“all XML data of the Web”). However, it can be consideredas a mediator system, since source data is stored without transformation and users canquery this data via different views. Each view is described by a DTD, called abstractDTD, and source data is mapped to one or several DTDs using path-to-path mappingrules. These rules are similar to our mapping rules with the difference that they mapabsolute source paths (starting from the document root) to absolute paths in the abs-tract DTD (starting from the DTD root element). As we have mentioned in section 3,these settings result in less precise source descriptions compared to our approach butenable the automatic generation of the mapping rules.

7. Conclusions

We proposed in this paper an alternate approach for integrating XML sources fol-lowing the LAV approach. Instead of choosing for the global view, a relational or XMLschema, we advocated the use of an ontology-based mediation. The global schema re-sembles is close to an object-oriented schema on a terminology describing a commondomain of interest and users issue queries on this global schema. Our contributionsare (i) a view definition language, (ii) a rewriting algorithm, (iii) an algorithm forgenerating execution plans, and (iv) a prototype validating the approach.

We are currently working on several extensions concerning our integration mo-del. First, we try to extend the query language by allowing explicit joins in the whereclause of a query. This does not change the binding algorithm, but increases the com-plexity of query processing. A second issue we are looking at concerns the usage ofmaximal bindings for query decomposition. In fact, the current version of the rewri-ting algorithm generates query execution plans which favor information stored locallyin the same document. For example, if some source

provides a single full binding

for some query

, the algorithm will return the result of

in, but will not try to join

with some other source

. This restriction can be removed by allowing also partial

Page 63: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

bindings that are not maximal, but will increase the number of possible decomposi-tions significantly.

8. Bibliographie

[ABI 99] ABITEBOUL S., BUNEMAN P., SUCIU D., Data On the Web : From Relations toSemistructured Data and XML, Morgan kaufmann, October 1999.

[AMA 02] AMANN B., BEERI C., FUNDULAKI I., SCHOLL M., « Ontology-Based Integra-tion of XML Web Resources », International Semantic Web Conference (ISWC), Sardinia,Italy, 2002.

[BAR 99] BARU C., GUPTA A., LUDÄSCHER B., MARCIANO R., PAPAKONSTANTINOU Y.,VELIKHOV P., CHU V., « XML-based information mediation with MIX », Demonstrations,ACM/SIGMOD, 1999, p. 597–599.

[BEE 97] BEERI C., LEVY A., ROUSSET M.-C., « Rewriting Queries Using Views in Des-cription Logics », Proc. PODS, Tucson, Arizona, May 1997, p. 99-108.

[BUN 01] BUNEMAN P., DAVIDSON S. B., FAN W., HARA C. S., TAN W. C., « Keys forXML », Proc. WWW10, 2001, p. 201-210.

[CHR 00] CHRISTOPHIDES V., CLUET S., SIMEON J., « On Wrapping Query Languages andEfficient XML Integration », Proc. of ACM SIGMOD, Dallas, USA, May 2000.

[CLA 99] CLARK J., (EDS.) S. D., « XML Path Language (XPath) Version 1.0 », W3CRecommendation, November 1999, http ://www.w3c.org/TR/xpath.

[CLU 01] CLUET S., VELTRI P., VODISLAV D., « Views in a Large Scale XML Repository »,Proc. VLDB, Rome, Italy, September 2001.

[FAN 01] FAN W., KOOPER G., SIMEON J., « A Unified Constraint Model for XML », Proc.WWW10, Hong-Kong, China, May 2001.

[FUN ] FUNDULAKI I., AMANN B., BEERI C., SCHOLL M., « STYX : Connecting the XMLWorld to the World of Semantics », Demonstration at EDBT’2002.

[GOA 00] GOASDOUÉ F., LATTÉS V., ROUSSET M.-C., « The use of CARIN Language andAlgorithms for Information Integration : The PICSEL System », International Journal onCooperative Information Systems, , 2000.

[HAL 00] HALEVY A., « Theory of answering queries using views », SIGMOD Record,vol. 29, n 4, 2000, p. 40–47.

[HEF ] HEFLIN J., VOLZ R., DALE J., « Requirements for a Web Ontology Language »,http ://www.w3.org/TR/2002/WD-webont-req-20020307/.

[LEV ] LEVY A., « Answering queries using views : a survey »,http ://www.cs.washington.edu/homes/alon/site/files/view-survey.ps, submitted forpublication.

[LEV 96] LEVY A., RAJARAMAN A., ORDILLE J., « Querying Heterogeneous InformationSources Using Source Descriptions », Proc. VLDB, Mumbai (Bombay), India, September1996, p. 251-262.

[MAN 01] MANOLESCU I., FLORESCU D., KOSSMANN D., « Answering XML Queries overHterogeneous Data Sources », Proc. VLDB, Rome, Italy, September 2001.

Page 64: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[PAP 95] PAPAKONSTANTINOU Y., GARCIA-MOLINA H., WIDOM J., « Object ExchangeAcross Heterogeneous Information Sources », Proc. ICDE, Taipei, Taiwan, March 1995,p. 251-260.

[POT 00] POTTINGER R., LEVY A., « A Scalable Algorithm for Answering Queries UsingViews », Proc. VLDB, Cairo, Egypt, September 2000, p. 484-495.

[THO 01] THOMPSON H., BEECH D., MALONEY M., MENDELSOHN N., « XML SchemaPart 1 : Structures », W3C Recommendation, May 2001, http ://www.w3.org/TR/XML-schema-1.

Page 65: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Construction and Maintenance of a Set OfPages of Interest (SPIN) using ActiveXML

Serge Abiteboul — Grégory Cobéna — Benjamin Nguyen — Anto-nella Poggi

INRIADomaine de Voluceau78153 Le Chesnay CEDEXFRANCEEmail: [email protected]

RÉSUMÉ. Dans cet article, nous nous intéressons à la construction d’entrepôts dynamiques àpartir de ressources du Web, en utilisant notamment des services disponibles sur le web. Notrecontribution tient essentiellement à la définition d’une nouvelle architecture basée autour deservices web (e.g. SOAP) et du language ActiveXML, nous permettant de résoudre notammentles problèmes suivant: (i) l’acquisition de pages du Web, (ii) le contrôle des changements de cespages et (iii) l’enrichissement des données et méta-données par l’utilisation de services Web.Le système est développé en ActiveXML, un langage et un système permettant l’intégrationd’appels à des services Web au sein d’un document XML.

ABSTRACT. In this article, we examine the problem of constructing a temporal data warehouseusing web services. There are many important aspects in the construction of such a warehouse.Our particular contribution in this article regards the global architecture of a system that can(i) acquire specific pages from the web (ii) control page changes (iii) easily be enhanced usingvarious web services. In order to present a concise architecture, the whole system was designedusing ActiveXML, a language and a system based on the embedding of web service calls intoXML documents.

MOTS-CLÉS : Entrepôt de données, XML, Bases de données semi-structurées, Données tempo-relles, Services Web

KEYWORDS: Data warehousing, XML, Semi-structured Data Bases, Temporal Data, Web Services

Page 66: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Intr oduction

The number of pages available on the web is increasing, and with it, the quantity ofinformation. A classical problem is that of maintaining a collection of web resources(URLs) of interest for a particular community. Important aspects are (i) the acquisitionof pages (from the web or from users), (ii) the control of changes, (iii) the enrichmentof data of meta-data, (iv) the storing of the collection and (v) the support of topicspecific query mechanisms. In this paper, we are concerned with the first three aspects.More precisely, we present a framework to support the contruction of sets (collection)of pages related to a given interest called SPIN (for Set of Pages of INterest).

The SPIN apprach is based on ActiveXML [ABI 02], a language and a system thatallows the embedding of calls to web services in XML documents. The ActiveXMLlanguage is simple yet provides an extremely powerful form of data-centric distributedcomputation with XML and XML query languages as cornerstones.

The starting point of the present work is the strong belief that the constructionof thematic collections (a very common task today, for instance in thematic portals)should be based on a declarative specification. SPIN designers specify the perimeterof their collections through a formal declaration named the intension of the SPIN. Ourintensional definition of SPIN (based on ActiveXML) is a first step in that direction.It facilitates the definition, deployment and maintenance of such thematic collections.In the specification, the designer may rely on simple functionalities such as full-textsearch and navigation. We will see that the specification may support much morecomplex needs.

Once the intension of a SPIN has been defined, it is the responsibility of the systemto construct the extension, i.e., the collection of resources that match the specification.(One may want to think of it as a collection of URLs or as a collection of actual pages.)The information around these pages may be enriched (as specified in the intension)with additional meta-information, such as the last date of change, the size of the page,the MD5 signature of the page, its page rank, the number of links pointing to it, etc.The SPIN designer controls when such an extension is computed and may request tohave it be computed regularly, e.g., weekly.

An important characteristic of the SPIN approach is that new functionalities mayeasily be added because of the ActiveXML basis, under the assumption that the newfunctionalities are provided as web services. These functionalities enable many fea-tures such as :

– the integration of “real time” information, e.g., based on subscription servicesthat may push resources of interest to a particular SPIN.

– the enrichment of the content of the collection with some processing, e.g., onemay use a classification service to classify all the documents according to a particularhierarchy.

An important aspect of the SPIN system is its management of change. The systemcan detect changes that occurred both inside pages of the collection, but also in the

Page 67: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

collection itself (new pages). Furthermore, it can detect changes in meta-data obtainedby processing the collection, e.g., detect if the classification of a page changed. Thechange detection system is base on the XML diff algorithm described in [COB 02],a diff tool that computes changes in a labeled tree. By using this algorithm and thesuccessive versions of a SPIN, we can produce data (reports) that describe the changesbetween versions.

The main contributions of the paper are :

1) the use of ActiveXML as the core language of our system. The intensions ofcollections are defined in ActiveXML.

2) a library of web services also called SPIN to support specific features of SPINmanagement (users management, web crawler,...)

3) an architecture that enables the construction of such collections using Acti-veXML servers, standard services available on the web, and the SPIN library.

ActiveXML is a peer-to-peer system. One could imagine the collection being distri-buted over several web sites that cooperate in building it. This aspect is somewhat or-thogonal to the specification of the collection (the topic of this paper). We will ignoreit here. It remains nonetheless an essential asset of our approach.

In the following section, we present a motivating example for our system. InSection3, we present a short state of the art, and the tools used in our system. In Sec-tion 4, we describe the basic data model, and the core services, using the ActiveXMLformalism. In Section 5, we detail some other advanced features that can be added byusing the appropriate web services. In Section 6, we detail the temporal aspects of thesystem, and follow up with the explanation of how this is all integrated. We concludeby mentioning experiments performed with our first prototype.

2. Moti vation

In this section we present motivating examples for the use of our system. We startby presenting basic examples. We then describe the running example used throughoutthe paper.

We first consider three simple scenarios. These are actually directly supported byexisting software. We will show in the following sections how all these examples canbe generalised in a larger unified framework, SPIN. The advantage of our system isthat these examples can all coexist in a much larger context supporting a variety ofother scenarios and that they can very easily be specified in a declarative manner.

– A web-masterwants to detectbroken links in the Web site he maintains. Forinstance consider the set of pages whose URLs begin with www-rocq.inria.fr/ (INRIA,Rocquencourt). He wants to fetch pages, parse the files to detect links, and fetch thelinks in a recursive fashion while recording URLs that have been already processedand those that are broken. Such a service can easily be specified, using our declarativelanguage. Furthermore, one may enrich the list of all the pages (as it is constructed)

Page 68: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

with information such as the type of the page (html, pdf, etc) or the appearance ofcertain keywords.

– The personin chargeof a portal may want to monitor [NGU 01] a web siteor a set of web pages on a specific topic, and be informed when pages are added, orupdated. Such a service can again be specified in a concise way. Let us stress that theuse of ActiveXML makes the integration of all the web services called much simpler.The user of the system can easily improve its possibilities, e.g., processing the newlydiscovered pages with tools that enable their automatic translation in a given languageor their classification according to a particular ontology.

– A userwants to search the webusingseveral search engines(metasearch) :The goal is to integrate results from various search engines. (See, e.g., Kartoo [KAR ].)It is also straightforward to specify such an application using SPIN. It is easy to “tai-lor” this integration and add some specific processing of the resulting set of URLs,e.g., searching for synonyms.

As we can see, the requirements differ greatly from one application to another.We propose some basic functionalities, and a simple yet expandable platform thatintegrates them.

Moti vating example

We next introduce a running example that will be used in the entire paper. Supposewe want to maintain the collection of web pages of interest for the inhabitants of aparticular city, for instance Sèvres in France. The first web sites that come to mindon this topic would be the city official web site, or the local theater web-site. To findother interesting web-pages, we may use search engines and keyword search. At thispoint, the interaction between our system and the users is critical : they may add sitesof interest, provide new keywords that should (or should not) be investigated, givepositive or negative feedback on specific web sites and pages. In our case, a user may,for instance, specify that a page containing the keywords Sèvres and 92310 (the zipcode of Sèvres) is potentially interesting, whereas a page containing Deux-Sèvres(a French department) is probably not.

Once we have the collection of documents, the work is not yet over. Many moretasks are left, such as : classifying the pages according to a given ontology, indexingpages, using the links inside pages of interest to obtain more pages that could also beinteresting.

Finally, it is very important to keep the collection up to date, and possibly archivesome of it regularly, i.e., perform change control.

3. Context

In this section, we first present ActiveXML, that plays a central role in our work.We then briefly discuss the topic of Web Portal creation, that presents a number ofcommon goals with SPIN support.

Page 69: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The choiceof ActiveXML We decided to use ActiveXML. For a complete overviewof ActiveXML, we refer to [ABI 02]. We could have based our system on a relationalor object database model, or used a semi-structured language such as Ozone [LAH 99]or Microsoft Smart Tags [POW ]. Our choice of XML and web services is more thansimply following a trend. First of all, XML documents do not suffer the limitationsof strict typing. They can be easily enriched ; the flexible typing aspect of XML iscrucial here. Secondly, the interaction with the users (via web browsers) and with webfunctionalities such as search engines or crawlers are much simpler in ActiveXMLthan with relational or object databases. Finally, ActiveXML fits nicely into the webcontext, where the system should not stop running if a service call fails, or if anotherone is very slow in returning its answer. Nonetheless, many ideas presented here couldbe transposed to an ODMG warehouse with a little work (methods encapsulating webservices, etc.) or to a relational context.

Web Portals Constructing web portals is an important and complex task that pre-sents a number of common goals with the construction of SPINs. Many companies,such as IBM, with Websphere Portal Service [Web] or Sybase (Enterprise Portal)[Syb] propose complex systems to build Web Portals. In domains such as e-business,or surveillance systems, other companies also offer specific and complicated solutions.Our proposal features a very simple system, based on ActiveXML, that also simplifiesthe task of building web portals. As it will be shown, the declarative description of aSPIN takes only a few lines of code. Thanks to its modular approach, our system ishighly adaptive and is not limited to a specific type of application.

Web Services It should finally be observed that more and more services will beoffered on the web that can participate in SPIN construction. For instance, the SPINlibrary includes methods to access search engines via the SOAP protocol. Recently,Google started providing such a service. Indeed, it is very likely that the low levelservices supported by SPIN will soon be available on the web. SPIN should be viewedas the glue needed to combine them. To continue with new services found on the web,Google now offers news search [Goo]. This is in some sense a SPIN service : it is anaccess to a particular collection of pages, the news headlines. The SPIN system goesfurther, since it lets users define their own interests and interesting sites, as well as thefunctions to perform on them.

Why SPIN? The goal of this work is to present a global and generic framework thatenables the creation of many different sorts of applications. While most commercialsystems are aimed at a specific task, it is our belief that many simple web servicesexist on the web, and that by using them within the correct architecture we can have asimple to use yet highly modular and efficient system.

Page 70: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4. Specifyingthe collection

In this section we give a brief description of the basics of the data model. Wefirst present the intension, then the extension of the SPIN. Finally, we turn to thepresentation of web Services

4.1. Data Model

A data-warehouse consists of a header, and a certain number of spins. Each SPINconsists of an intension, an extension and some service definitions. The warehouseis seen as an ActiveXML document. As mentioned earlier, an ActiveXML documentmay use external services (via service calls) that may appear anywhere in the docu-ment. To simplify, in the SPINs we consider here the service calls will be concen-trated in the service definitions. In general, an ActiveXML document may also offer(to the external world) a number of services1 that are defined using queries (typicallyin XOQL[AGU ], Xquery or XPATH) and updates, of which we will also give someexamples. The specification of a SPIN for Sèvres may look like this :

<spin:warehouse name="Sèvres"><spin:head>

<spin:owner id="Serge" /><spin:title>Sèvres Warehouse</spin:title><spin:accessControlList><spin:access group="friends" mode="call"/><spin:access group="all" mode="read"/></spin:accessControlList>

</spin:head><spin:spin name="sevres">

<spin:intension> ... </spin:intension><spin:extension> ... </spin:extension><spin:services> ... </spin:services>

</spin:spin><spin:spin name="sevres-sculpture"> ... </spin:spin></spin:warehouse>

This first part of the SPIN (the header) describes the owner of the SPIN, its generaltitle, and the access control list. In this case, all users are allowed to view the contentsof the SPIN, but only the users in the friends list can call the services provided byour warehouse. Now consider the intension of a particular SPIN :

<spin:spin name="sevres"><spin:intension><spin:bound>3000</spin:bound><keywords>

. These services may be restricted by only letting specific access groups to use them.

Page 71: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

<keyword>Sèvres</keyword><keyword>92310</keyword>

</keywords><interestingSites>

<site>http://www.ville-sevres.fr/</site><site>http://www.vertsdesevres.com/</site>

</interestingSites></spin:intension> ...

</spin:spin>

The intension subtree contains all the data (parameters) used by the services toperform the operations that construct the SPIN. The spin :bound element indi-cates the maximum number of URLs we want to have in the extension. The intensionalso provides a list of keywords Sèvres and 92310, and a list of interesting sitessuch as http ://www.ville-sevres.com/. Note that keywords and interestingSites are notpart of the namespace spin. Different SPINs may use different services and thus dif-ferent kinds of data in the intension. The only constraint is that the intension mustcontain all the parameters needed by the services that will be called.

Note that strictly speaking the specification of a SPIN is made up of more thansimply its intension : We also need to know the web services that use this intension.These will be detailed later.

Consider now one extension :

<spin:extension date="31 jul 2001"><spin:url id="http://www.mysite.com/mypage.html"><content>...</content><link>http://www.yahoo.com/</link><link>http://www-rocq.inria.fr/</link><type>HTML</type><last_update>28 jul 2001</last_update><classification>Group1 (Resume)</classification><site>http://www.inria.fr/</site>

</spin:url> ...</spin:extension>

For one SPIN, there may be many extensions, each one computed at a given date.This is captured by the date attribute of the spin :extension node. For eachresource (URL), various information are recorded for each page. Again, exactly whatinformation depends on the definition specification of web services. Thus, in the ex-tension also, we are not limited to predefined attributes. The data model is fully exten-sible, and any kind of (meta) data may be added to the entry defining a page. The onlyattribute that must be present for a spin :url element is the (unique) identifier idof the page, which is its URL. All other attributes or child nodes are not part of thebasic schema of SPIN. Therefore not all the meta data stored in the extension is partof the spin namespace, since it is service dependent. In general, each node will beprefixed with the namespace of the service that provided the information.

Page 72: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4.2. Web services

Consider now web services. Calls to web services in ActiveXML are also syntac-tically in XML as in :

<axml:sc methodName = "myWebService"><axml:params><axml:param name="param1">XML data or XPATH expression

</axml:param>...

</axml:params></axml:sc>

ActiveXML service calls return an (Active)XML document. Note that a number offeatures in ActiveXML allow to control the firing of the call, and the duration of thevalidity of the data that is obtained. We will ignore this aspect here.

Let us consider how we can obtain URLs of interest using data from the intension.This is done as follows using two services, askGoogle and getSite :

<spin:services>% Keyword Querylet askGoogle($name) be for each $X in<axml:sc name="http://www.google.com/googleSearch">

<axml:params><axml:param name="keyword"

xpath="self//spin:spin[name=$name]/keywords" />

</axml:params></axml:sc>do insert (self//spin:spin[name=$name]

/spin:extension/<spin:url id=$X>)

% Interesting siteslet crawlInterestingSites($name) befor each $X in<axml:sc name="http://www.myservices.com/getSite"><axml:params>

<axml:param name="url"

xpath="self//spin:spin[name=$name]/spin:intension/interestingSites/site/" />

<axml:param name="depth">5</axml:param><axml:param name="bound"

xpath="self//spin:spin[name=$name]

Page 73: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

/spin:intension/spin:bound/"/></axml:params></axml:sc>do insert (self/spin[name=$name]/spin:extension

/<spin:url id=$Xopinion="yes">)</spin:services></spin:warehouse>

Explenation :

– The first query calls a (pseudo) Google servicewww.google.com/googleSearch/ passing to it lists of keywords co-ming from the intension (XPATH query). The results (a list of URLs) are then insertedinto the extension of the SPIN as nodes named <spin :url>. The attribute idcontains the URL string. Note that this is done by side effect of the call.

– The second query uses a webservice that retrieves all the pages that can bereached starting from a page and following at most 5 links not exiting from the gi-ven site. The starting pages are obtained in the //spin :spin/spin :inten-sion/interestingSites/ subtree. The bound parameter indicates that we donot want to retrieve more than a certain number of pages. If we go over that limit westop crawling. The list of URLs are included into the extension in the same way asbefore, but each URL discovered this way also has an attribute opinion="yes"which indicates a higher degree of confidence in the quality of the URLs discoveredthis way.

5. AdvancedSPIN Features

We may add as many features to the warehouse as we want. The only limit is thatthese features must be provided by web services. The parameters used by these morecomplex service calls need to be written in the intension definition, so that they maybe accessed with a simple path query.

Classification

Let us imagine we have a classification service that given a hierarchy of classes(defined with example web pages for instance) and a new web page, classifies theweb-page into the hierarchy. We need to first of all define the classes that we are goingto encounter in the SPIN. We define them statiscally here ; but these classes could alsobe obtained by a service call. The classes could for instance be defined in the intensionof the SPIN as follows :

<spin:intension>...<classification>

Page 74: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

<art> <music/> <cinema/> <theatre/> ... </art><sports><individual-sports><tennis/>...</individualSports><collectiveSports><football/>...</collectiveSports>

<sports><history> ... </history>...

</classification>...</spin:intension>

We now define a service that uses these parameters to run on a set of pages already inthe extension, and add to each page the attribute class, which contains the class ofthe hierarchy that the document belongs to.

<spin:services>let getClassInfo($name) be select self//spin:spin[name=$name]/spin:extension/

<spin:url id=$X ><class>$X</class></spin:url>from $X in

<axml:sc name="http://www.class.com/"><axml:params>

<axml:param name="urlList" >select $Yfrom $Y in self//spin:spin[name=$name]

/spin:extension/spin:url/@id</axml:param><axml:param name="classHierarchy"

xpath="self//spin:spin[name=$name]//classification" />

</axml:params></axml:sc>

...</spin:services>

Userannotations

In this paragraph we give the example of a service that lets users annotate eachweb page. The code for the service is the following :

<spin:services>...% Enter a user’s opinionlet addUserOpinion ($ID, $opinion, $text, $class) be <spin:url id=$ID>

<opinion>$opinion</opinion><comment>$text</comment>if $class != "" then <class>$class</class>

</spin:url>

Page 75: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

...</spin:services>

This service takes as input parameters the ID (url) of a page and the opinion (yes/no),comment (a string) and class (a string) that a user thinks the URL should have.

Using other web-services

The previous example shows how to construct the most basic data warehouse. Thepower of using ActiveXML in the definition of our system lies in the fact that an expe-rimented user will be able to write out his own ActiveXML programs to enhance thesemantics of his data warehouse. Up to now, the only temporal aspect of the data wa-rehouse consisted in the dates that appear in the <spin :extension date=@D>nodes. In the next section, we show how we manage the data over time in more detail.

6. Temporal Aspectsof SPIN

We have defined an efficient architecture for web-services using ActiveXML. Itis important to note that ActiveXML is data-based, and that the stored data plays anessential role in the application. We detail in this section the version archiving processand version management of SPIN. Note that the way the system discovers that a pagehas changed, and the management of loading these pages is beyond the scope of thispaper. We refer to [NGU 01] for more details on the topic.

Ar chive

With ActiveXML, data returned by a web service call is stored in the AXML do-cument itself. Our system can be used for services mediation, proxy/caching and ar-chiving by setting the proper validity and periodicity flags. For instance, a simple webarchive is built using a crawling service with a validity time set forever. In more com-plex applications, the archiving process consists in several steps, in particular contentfiltering, page classification or indexing.

We show how to query the archived data. Then, we will show a simple way tocompress the archive by using a diff[COB 02] service.

Querying the AXML Document

An AXML document is first of all an XML document and contains data. Thus,it can be queried using an XML query language, such as XQuery [W3C ] or XOQL[AGU , AGU 00]. For example, users may query the document to find a page in thearchive. Also the SPIN application itself may use queries to retrieve access rights frominside the AXML document, and manage access control using stylesheets in the spiritof [GAB 01]. Queries are also used in the AXML document itself to extract parametersfor service calls.

Further ideas

Page 76: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

For more elaborate queries on AXML documents, a semantic mediation moduleis clearly needed. A possible solution is to use an external service, like Xyleme [Xyl]to (i) provide an abstract view of the domain (ii) retrieve mapping rules (iii) translateabstract queries into concrete queries. More work is clearly needed in that direction.

When AXML documents become larger, an index structure is needed to accele-rate queries. It is possible, for instance, to store AXML documents in a native XMLrepository. Another solution consists in creating an simple index inside the AXMLdocument itself (by using an external web service). This is in the spirit of Latex or Mi-crosoft Word that have a function to generate such an index at the end of a document.In this case, the index can also be made available to the user.

Delta-basedcompression

We consider here a data warehouse that is updated on a regular basis. It may bean archive, or a caching proxy for instance. The delta-based compression is used ondocuments for which several versions are stored. The goal is to save storage resourcesby storing only changes between two consecutive versions of documents instead ofstoring both documents. In [MAR 01], several storage strategies are presented. Forinstance, one may want to store only the latest version of the document, and all back-ward deltas from a version to the previous one. Then, we are able to reconstruct anyolder version by applying the sequence of deltas to the document.

In some applications (e.g. editing XML documents), the changes applied to a docu-ment are known. But in most cases, the set of changes is unknown, and a diff algorithmis necessary to detect changes between two consecutive versions [COB 02, DEL ].These services can be integrated into two calls :

DetectChanges(XMLdoc V0, XLMdoc V1)returns XMLdelta <v0v1> ;

ApplyDelta(XMLdoc V0, XMLdelta d-v0v1)returns XMLdoc <v1> ;

In our system, these services are provided by XyDiff [COB 02]. Based on XyDiff, wedefined a simple aggregation service for SPIN. It receives two XML subtrees repre-senting different versions of a document, and returns an AXML document containingthe first subtree, a delta, and a service call to reconstruct the second subtree. The nicething with AXML is that embedded service calls are transparent to the user. Thus, theresult returned by the aggregation service is virtually identical to the subtrees passedin arguments. We show in the aggregate function definition how to use these twoweb services.

Using AXML validity settings, the compression service can be used when a newdocument is stored, or on a regular timing (e.g. every week). The AXML definition ofthe aggregation service is as follows :

let aggregate($name, $D1, $D2) be insert self//spin:spin[name=$name]

/spin:extension[date=$D1]/<delta from=$D1 to=$D2>

Page 77: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

... %the delta</delta></spin:extension>self//spin:spin[name=$name]/spin:extension[date=$D2]/<axml:sc name="applyDelta"><axml:params><axml:param name="from"

xpath="../spin:extension[date=$D1]"/><axml:param name="delta-loc"

xpath="../delta[from=$D1 && tp=$D2]"/>

</axml:params><validity>CLONE VALUE</validity><refreshPolicy>ON DEMAND</refreshPolicy></axml:sc></spin:extension>

delete self//spin:spin[name=$name]/spin:extension[date=$D2]

Depending on the XML diff tool used, the aggregation service has different fea-tures. XyDiff , for instance, is very fast to compute and well suited for large AXMLdocuments. Many other XML diff tools would take hours to diff files of a hundredkilobytes.

The change detection service may also be used to monitor changes [NGU 01], forinstance, to detect that an important page contains new links.

7. Ar chitectureand Experiments

We present in this section the architecture of the project as well as implementationstatus and some experiments.

The architecture is described in Figure 1. The SPIN application stores and que-ries data through the XOQL-based XML repository. The core of SPIN relies on webservices integration by the AXML processor. The processor will process the AXMLdocument that describes the data warehouse that we wish to construct, by calling thevarious services described previously in the paper, and shown in Figure 1. The propo-sed architecture contains four web services, but more can be added (e.g. classification).Let us note that the XOQL, XyDiff, Xyleme services already existed, and were inte-grated into the architecture, by making them AXML compliant. Other web services(e.g. Crawler) are built from Unix applications using wrappers. Our ongoing imple-mentation effort is to wrap useful web services to the AXML standard, in order to beable to integrate them afterwards in our framework.

Page 78: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

!"

#$

"

%&&$'

""(!!$"

)*++!!$"

$"

%&&'

Figure1. Architecture

The user interface of our system consists in the module named “AXML client”in Figure 1. It is based on a browser and uses dynamic HTML pages generated fromXML documents using style-sheets (by Cocoon) and the servlet processor Tomcat.The pure-data XML documents are obtained using a light AXML processor that eva-luates only the services required for the display.

Experimentsand Implementation Status Some modules like XOQL or XyDiff arefully functional, though not yet completely integrated into web services. Both XOQLand XyDiff have been developed as part of an earlier project and have been intensi-vely tested/used since. The ActiveXML processor is also implemented, we are nowworking on extending its features, in particular the security aspects. We also designeda prototype application with SPIN services, including a lightweight web crawler and awrapper to Google2. These modules have been connected to both XyDiff and XOQLto compute deltas, compress the archive, and send notification reports to users.

For instance, we constructed the SPIN of www.inria.fr web-sites that contains over13.000 web pages, and maintained it during several weeks to see new pages and up-dated pages. The corresponding XML document weights several megabytes.

To conclude this section, it is important to observe that the extension of the SPINlibrary becomes each day easier. First, more and more web services become available,that the library may use. Also, tools like GLUE [Min ] can now be used to quicklycreate web services, e.g., from java programs.

. This was before Google SOAP interface was made available.

Page 79: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

8. Conclusionand perspectives

In this article, we presented a simple and expandable system, based on web ser-vices, that constructs and maintains a data warehouse over time. The architectureis modular, thus enhancing the warehouse with application specific modules is verysimple. We believe this system can make up the core of many data warehousing ap-plications, and provide the basic services that are needed in order to maintain such awarehouse. The use of ActiveXML provides an easy means of adding new functiona-lities to the warehouse, provided these services are encapsulated by a SOAP protocol.

Acknowledgments We would like to thank Vincent Aguilera, Omar Benjelloun,Laurent Mignet, Tova Milo, Michalis Vazirgianis, Beiting Zhu for discussions on thetopic.

9. Bibliographie

[ABI 02] ABITEBOUL S., BENJELLOUN O., MILO T., MANOLESCU I., WEBER R., « ActiveXML : A Data-Centric Perspective on Web Services », Bases de Données Avancées, 2002.

[AGU ] AGUILERA V., « X-OQL Query Language for XML », www-rocq.inria.fr/˜ agui-lera/xoql/.

[AGU 00] AGUILÉRA V., CLUET S., VELTRI P., VODISLAV D., WATTEZ F., « QueryingXML Documents in Xyleme », Proceedings of the ACM-SIGIR 2000 Workshop on XMLand Information Retrieval, Athens, Greece, july 2000.

[COB 02] COBENA G., ABITEBOUL S., MARIAN A., « Detecting Changes in XML Docu-ments », ICDE, 2002.

[DEL ] DELTAXML, « DeltaXML : Change Control for XML in XML », www.deltaxml.com.

[GAB 01] GABILLON A., BRUNO E., « Contrôles d’accès pour documents XML », Bases deDonnées Avancées, 2001.

[Goo] « http ://www.google.com/news/ ».

[KAR ] KARTOO, « www.kartoo.com/ ».

[LAH 99] LAHIRI T., ABITEBOUL S., WIDOM J., « Integrating Structured and Semi-Structured Data », Int. Worshop on Database Programming Languages, 1999.

[MAR 01] MARIAN A., ABITEBOUL S., COBENA G., MIGNET L., « Change-centric Mana-gement of Versions in an XML Warehouse », VLDB, , 2001.

[Min ] MIND ELECTRIC, « GLUE », www.themindelectric.com/glue/index.html.

[NGU 01] NGUYEN B., ABITEBOUL S., COBENA G., PREDA M., « Monitoring XML dataon the Web », Proceedings of the ACM-SIGMOD, , 2001.

[POW ] POWELL J., MAXWELL T., « Integrating Office XP smart tags with the Microsoft.NET Platform », http ://msdn.microsoft.com/.

[Syb] « Sybase Enterprise Portal », http ://www.sybase.com/products/ep/.

[W3C ] W3C, « XQuery », www.w3.org/TR/xquery.

[Web] « IBM Web Sphere », http ://www.ibm.com/websphere/.

[Xyl] « Xyleme S.A. », http ://www.xyleme.com/.

Page 80: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 81: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 3Systèmes distribués

adaptables

Page 82: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 83: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A Component-based Infrastructure forCustomized Persistent Object Management

Luciano García-Bañuelos 1 — Phuong-Quynh Duong —Christine Collet

LSR/IMAG Laboratory681, rue de la Passerelle38400 Saint Martin d’Hères, FRANCE

Luciano.Garcia-Banuelos, Phuong-Quynh.Duong, [email protected]

ABSTRACT. On this paper we present a component-based infrastructure for building customizedpersistent object managers. We consider that future database management systems (DBMS)will be built as a set of adaptable services. Our infrastructure is comprised of such services,and its usage is twofold. First, it can be used to provide object managers for applicationswhere a full-fledged DBMS may be cumbersome (e.g. mobile computing). Second, it can beused as a middleware for persistent object management (e.g. on top an existing DBMS). Thepaper focuses on architectural principles of the infrastructure, its components, as well as theinter-component interactions and dependencies.

RÉSUMÉ. Dans cet article nous présentons une infrastructure à base de composants pour laconstruction de gestionnaires d’objets persistants personnalisés. Nous considérons que les fu-turs systèmes de gestion de bases de données (SGBD) seront construits comme un ensemble deservices adaptables. Notre infrastructure comporte de tels services et peut-être utilisée pour 1)fournir des gestionnaires d’objets aux applications dans lesquelles il n’est pas nécessaire dedisposer de toutes les fonctions d’un SGBD (e.g. informatique mobile), 2) offrir un intergicieldédié à la gestion de données persistantes (e.g. au dessus d’un SGBD existant). Le documentprésente les principes architecturaux de l’infrastructure, ses composants logiciels, ainsi queleurs interactions et dépendances.

KEYWORDS: Database services, DBMS architectures, Component-based architectures, Persis-tence management, Cache management, Multi-level transactions.

MOTS-CLÉS : Services base de données, Architectures SGBD, Architectures à base de compo-sants, Gestion de persistance, Gestion de cache, Transactions multi-niveaux.

. Supported by the CONACyT Scholarship program of the Mexican Government

Page 84: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

This paper presents preliminary results of the NODS (Networked Open DatabaseServices) project [COL 00]. The evolution of this work can be tracked in [Gar 01,Gar 02]. Our project vision is that DBMS should be unbundled into a set of cooperat-ing services. With such services, we aim at constituting a framework from whichsystem programmers may build customized data managers. This vision is sharedby several other researchers (see for instance [VAS 94, SIL 97, GEP 98, HAM 99,CHA 00]). However, we can corroborate that only few active research projects aredevoted to the process of revisiting DBMS architectures.

There are contexts where overheads imposed by the whole DBMS machinery can-not be afforded. This is the case of data management applications on resource con-strained environments, e.g. mobile computers/devices. The single solution for scalingdown DBMS to those environments seems to be building new ad hoc software. Butthe problem is not only down sizing.

There have been some previous attempts to develop more adaptable DBMS ar-chitectures. Most of those early works have kept the full-functional DBMS as thefundamental unit of packaging for both installation and operation.

We claim that a component-based approach to architecture DBMSs software shouldbe taken. Thus, instead of considering extensible DBMS kernels, we propose finer-grain components. Our components provide low level services such as logging, caching,concurrency control, recovery, among others. These components can be used indepen-dently, as a subset, or altogether according to the data management requirements.

Our work takes well-known techniques for persistent object management and putthem into a generalized infrastructure. The novelty of our work is the resulting ar-chitecture principles, enabling componentization. We therefore applied architecturalreasoning in a systematic way to characterize component interdependencies. All theseprinciples have been validated by building a full functional prototype.

Promoting a component-based approach is striking:

– Components can be used to build stand-alone data managers for applicationswith particular requirements.

– Components can be used to build database middleware.

– Systems using multiple component-based services can exploit common sub-components, improving resource sharing.

1.1. Approach

We have tackled the problem of componentizing DBMS software. However, acorrect componentization of DBMS software compels a deep study because of thesubtle dependencies kept by their functional constituents.

Page 85: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

We started by working in the isolation of persistence related services from thewhole DBMS machinery. The goal was to provide components to assemble persistentobject managers not only in transactional contexts, but also in non-transactional reli-able and unreliable contexts. The road map towards the consolidation of the component-based infrastructure is shown in Figure 1. A general explanation follows:

Componentization – This phase implied the definition of functional dimensionscharacterizing persistence management. We tried to define fine-grain, but mean-ingful components, keeping in mind code reuse. In addition, we had to establishthe boundaries of each data management aspect: persistence, fault tolerance,concurrent access, and transaction supports.

Reasoning about dependencies – We continued by establishing component inter-dependencies and frameworks for providing ways to enable components sepa-ration and/or assembling. In particular, we concentrated in characterizing com-plex relationships between persistence management and other system aspects,e.g. how to separate fault tolerance from persistence management?

Customized persistence management – Finally, we devised tools and techniquessupporting multiple software configurations such as: persistence with no faulttolerance, persistence with crash recovery, and persistence with transaction sup-port.

TxMgmt

Locking

RecoveryPersistence

object management

Transactionalpersistent

Non-transactional

object managementreliable persistent

RecoveryTxMgmt

Full-fledged DBMS

Persistence Management

QueryingPersistence

Storage

Caching

Logging

Reasonningabout dependencies

orderflushing

Caching

Storage Logging

Componentizing

Assembling customized infrastructures

Figure 1. Towards a component-based infrastructure: The road map

The resulting infrastructure architecture is organized in three layers. The first layercopes with unreliable storage management. The second one adds support for systemcrash resilience, by introducing a redo-only recovery method. Finally, the third layerprovides transaction support. All the layers are articulated with a multi-level recoverymethod.

Page 86: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1.2. Paper outline

The rest of this paper is organized as follows. The global picture of the architectureof our infrastructure is detailed in Section 2. This architecture is organized in threelayers, and Sections 3 to 5 provide the design patterns characterizing the interactionsbetween components of these layers. Then, Section 6 overviews our first prototype.Section 7 provides a short survey of related previous works on DBMS architecturesand proposals for persistence service. We conclude and review ongoing work in Sec-tion 8.

2. Architecture

The infrastructure has a multi-layer architecture as shown in Figure 2.

IConcurrencyControl

ILogManager

IStorageManager

ICacheManager

(2) Transaction support

(1) Crash resilience

(0) Base components

IPersistenceManager

IPersistenceManager

ICheckpointManager

IPersistenceManager

PersistenceManager

ConcurrencyControl

LogManager

CacheManager

StorageManager

ITxResourceManager

TransPersistMgr

ReliablePersistMgr

Figure 2. Infrastructure architecture

Each layer in the infrastructure copes with a well identified set of system con-cerns. Layer 0, provides a unreliable persistence service. Layer 1 adds system crashresilience. Finally, Layer 2 adds support for implementing a transactional resourcemanager. A general description of components follows:

Layer 0 – Unreliable, base components

– The CacheManager is responsible of maintaining popular data objectsin main memory. This component is based in the well known FIX/use/UNFIXprotocol [GRA 93b]. This protocol allows to keep track of access on data ob-jects to implement replacement strategies, and to maintain a reference count ofconcurrent accesses.

Page 87: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– The StorageManager provides an unified interface to access in an object-based fashion a given underlying storage product, e.g. a DBMS. Thus, thisinterface allows to plug-in not only ad hoc storage managers, but also any ex-isting storage technology. For instance, storage managers can be built on topof object-relational DBMSs, by means of the appropriate mapping and runtimelibraries.

– The PersistenceManager coordinates a CacheManager and a Storage-Manager and mediates object materialization, and update propagation. Notethat this component does not include recovery-related code.

Layer 1 – System crash resilience

– The LogManager component provides a general purpose logging facility.The major responsibilities of this component are the implementation of reliablestorage, log record organization, as well as housekeeping tasks such as log trun-cation. In order to keep generality, checkpointing and recovery methods areimplemented elsewhere, e.g. in the ReliablePersistMgr and in the TransPer-sistMgr.

– The ReliablePersistMgr, meaning reliable persistence manager, adds sys-tem crash reliability to the base PersistenceManager component. In order todo that, this component uses a LogManager to implement a redo-only recoverymethod.

Layer 2 – Transaction support

– The ConcurrencyControl component provides an interface allowing toaccommodate a large range of concurrency control methods.

– The TransPersistMgr, which stands for transactional persistence man-ager, allows to build transactional resource managers. Transaction support isbased in a multi-level recovery approach described in Section 5. This compo-nent relies on a ReliablePersistMgr and a LogManager to implement trans-action atomicity and durability, and on a ConcurrencyControl component fortransaction isolation.

Some components can be used outside of the persistent object managers. This isthe case of components CacheManager, StorageManager, LogManager, and Con-currencyControl. The rest of components implement persistence management pro-tocols.

The infrastructure can be used to built three families of persistent object managers.First, unreliable managers are built using only components in Layer 0. Second, reli-able managers are built with components in layers 0 and 1. These managers providea checkpointing-based programming model. Finally, transactional managers (e.g. XAcompliant resource managers) can be built with components from the three layers.

Page 88: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3. Base components

Layer 0 is at the heart of our platform and provides components to assemble a fam-ily of unreliable persistent object managers. Using this base configuration supposesthat system crashes do not impact negatively persistent data integrity. Examples ofusage of such a configuration are web caching or read-only systems.

This layer is responsible of loading objects from and eventually writing them tounderlying storage. There are three components in this layer, namely PersistenceM-anager, CacheManager and StorageManager. It is the PersistenceManager whichtakes the role of coordinator of the object loading and writing. Figure 3 provides moredetails about component interfaces.

lookupaddToCachefixunfix

loadstore

<<interface>>

<<interface>>ICacheManager

IStorageManager

<<provides>>

<<provides>>

readIntentionwriteIntentionreadCompletionwriteCompletion

IPersistenceManager<<interface>>

PersistenceManager

CacheManager

StorageManager

<<requires>>

<<provides>><<requires>>

Figure 3. Components in Layer 0

Operations provided by PersistenceManager component 1 capture hints on objectaccesses. Note that read and write operations are expanded to reflect temporal breadthof accesses. Thus, readIntention() and writeIntention() express the beginning ofaccesses on read or write mode, respectively. Further, once accesses on objects arefinished (e.g. by transaction commit time), the readCompletion() and/or writeCom-pletion() should be declared. We assume, that there are not blind writes, that is everywrite is preceded by its corresponding read operation.

For ease of explanation we discuss sequences of operation calls in two parts. Thefirst one corresponds to loading, i.e. object fault design pattern, and the second one towriting, i.e. update management.

. Hereinafter, when we talk about operations provided by a component, we refer to operations

defined in the corresponding interface. For instance, PersistenceManager provided operationsare defined in the IPersistenceManager interface.

Page 89: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1. Object faulting design pattern

Object faulting is managed as it is done traditionally in DBMS. Object faultingdue to prefetching and eager swizzling are out of the scope of our analysis. Figure 4shows the design pattern for object faulting.

readIntentionlookup

load

addToCache

fix

Object faulting

:StorageManager:CacheManager:PersistenceManager

Figure 4. Typical object faulting call sequence

The sequence starts with a lookup() operation on CacheManager component.When the requested object is found, it is simply returned after being fixed in cachewith fix() operation. This sequence can be derived from Figure 4 by eliminating oper-ations enclosed within the dashed box. If the object is not currently present in cache,the PersistentManager will request a copy from storage, to the StorageManager.This is done by executing load() operation. Finally, the object is added to cache andfixed, by calling addToCache() and fix() operations.

3.2. Update management

Management of updates on objects consists of two parts. The first concerns pro-viding means to identify updated objects that are present in the cache. The secondissue concerns update propagation policies.

3.2.1. Identifying updated objects

Updated objects, also known as dirty objects, are usually managed within specialdata structures, such as hash tables or queues [MOH 92]. We have decided to managesuch data structures by a sub-component called DirtyObjectManager (see Figure 5).

DirtyObjectManager provides, among others, the operation markDirty(). Thisoperation is used to notify that a given object has been updated. Its usage is describedin detail in the following subsection. Figure 5 shows DirtyObjectManager compo-nent and its interface, and Figure 6 shows how markDirty() operation is called.

Page 90: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.2.2. Update propagation

In general terms, update propagation can be either immediate or deferred. Im-plementing an immediate update propagation scheme is straightforward. However,it usually incurs into high overheads, because it needs a large amount of write oper-ations. A special case of immediate update propagation occurs in the transactionalcontext, referred to as force-at-commit propagation [HÄR 83]. This latter is discussedin Section 5.

Concerning deferred update propagations, the problem is to determine where prop-agation decisions come from. This is in fact one of the most important sources ofcomponent interdependencies. Within Layer 0, only two types of events can be con-sidered to manage update propagation, namely cache eviction and system shutdownevents. The latter can be considered as an special case of cache eviction, where everycached object is evicted.

Figure 5 shows how update propagation policies are registered into the platform. Agiven DirtyObjectManager is associated to a set of update propagation event listen-ers. Whenever an update propagation event must be notified to the DirtyObjectMan-ager, the corresponding event listener will catch it. The status of the object associatedto the event is checked. If the object is dirty, its state is flushed to storage, with flush()operation. If the object is clean, its state can be simple discarded.

DirtyObjectManager

CacheEvictionListener

flushmarkDirty

<<interface>>IDirtyObjectManager

PersistenceManager

handleCacheEviction

<<interface>>ICacheEvictionListener

<<provides>>

<<requires>>

<<provides>>

IUpdatePropagationListener

serializedeserialize

Figure 5. Classes for update management

Within unreliable persistent object managers, the only update propagation eventlistener is the CacheEvictionListener. Figure 6 shows how a cache eviction eventtriggers an update propagation.

Page 91: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

:StorageManager:DirtyObjectManager:CacheEvictionListener

handleCacheEvictionflush

store

markDirty

Figure 6. Sequence diagram for update management

3.2.3. More on update propagation hints

In addition to the cache eviction events, we can consider the following flushinghints.

Checkpointing – This is an activity that forces writing information to stable storagein order to reduce recovery time, and log space. Depending in the checkpointingalgorithm, it may be required to flush some or even all updated objects thatremain in cache, and some other system data structures.

Transaction commit – The force-at-commit policy is one possible strategy for updatepropagation management. It implies that every object modified by a transactionshould be flushed at transaction commit time.

Reconnection in mobile environments – In mobile and in general in weakly con-nected environments, a reconnection event can trigger update propagation.

Some applications could explicitly provide hints to drive update propagation. Fur-ther, some advanced protocols such as the distributed group commit presented in[PAR 99] could be also considered.

Recently, Lomet and Tuttle [LOM 95, LOM 99] provided a quite complete formalframework to understand the impact of using logical logging on cache management,i.e. update propagation. As a result of this formalization, they proposed a set of al-gorithms to implement a general cache manager. These algorithms include one thatcalculates a write graph. This write graph is used by another algorithm that sched-ules cache writes. The latter can be included as an update propagation policy in ourinfrastructure.

4. Adding system crash resilience

This layer adds a first level of resilience to the platform. System crash resilienceis related to the problem of bringing persistent data to a consistent state after a systemcrash, e.g. system persistence in non-transactional contexts, or transaction durability.

Page 92: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In our platform, system crash resilience is provided by the ReliablePersistMgrwhich implements a redo-only recovery mechanism. To this end, this componentstores redo information within a log.

Figure 7 shows how components in Layer 1 extend those in Layer 0 to add systemcrash resilience.

PersistenceManager

ReliablePersistMgr LogManager

<<interface>> <<interface>>

writeLogRecord

<<provides>><<provides>> <<requires>>

ILogManagerICheckpointManager

checkpoint

IPersistenceManager

IPersistenceManager

CacheManager

StorageManager

Figure 7. Components in Layers 1 and 0

4.1. Collecting redo information

Supporting deferred update requires storing redo information, i.e. within the log.There are several formats for logging information. They are classified as physical,logical and physiological logging [GRA 93b]. The selection of the type of log infor-mation is left to the component implementor. Redo recovery information is gatheredby the ReliablePersistMgr component when the writeCompletion() operation is re-quested, as shown in Figure 8. Once the corresponding log record is stored, the samewriteCompletion() operation is delegated to Layer 0.

:PersistenceManager:LogManager:ReliablePersistMgr

writeCompletionwriteLogRecord

redo−only info

writeCompletion

Figure 8. Sequence diagram for acquiring redo recovery information

Page 93: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4.2. Checkpointing

The primary goal of checkpointing is to limit the work to be done during recoveryafter a system crash. Checkpoints are produced periodically, so that all updates pre-ceding a checkpoint are guaranteed to be stable. As said before, checkpointing mightinduce update propagations. We have classified checkpointing schemes according tothe sequence of activities to be done. Thus, we distinguish the following types ofcheckpointing:

Sharp or heavyweight – These schemes require stopping all activities, flushing allupdated objects currently in cache to underlying storage, and writing a end ofcheckpoint log record.

Incremental – In these methods, the dirty object table is managed in two or moreparts, each one corresponding with checkpoints intervals. Thus, only dirty ob-jects of the oldest checkpoint interval are to be flushed. Examples for this kindof scheme include the penultimate checkpoint described in [BER 87], and theOracle incremental checkpoint [JOS 98].

Lightweight – By exploiting the undo and redo recovery information, this class ofmethods require storing only the system data structures, e.g. dirty object table,active transaction table. ARIES [MOH 92] supports this kind of checkpointing.

As suggested above, checkpointing may require storing the dirty object table inthe log. The serialize() operation of DirtyObjectManager component is used to thisend. The deserialize() operation is inverse to the previous one, and is used in therestart procedure.

CheckpointListener is added to the dirty object manager, in order to mediateupdate propagation induced by the checkpointing process. This is shown in Figure 9.

DirtyObjectManager

CacheEvictionListener

CheckpointListener

IDirtyObjectManager

IUpdatePropagationListener

IUpdatePropagationListener

Figure 9. Checkpointing listener in the context of update propagation policies

Figure 10 depicts a generic sequence diagram for the checkpointing process. Notethat a subset of calls have numeric labels, which will be used in the explanation. As

Page 94: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

a matter of fact, a sharp or heavyweight checkpoints incur in the execution of steps 2and 4 on the sequence. The implementation of an incremental checkpoint requires theexecution of steps 1, 2, and 4. Finally, a lightweight scheme needs the execution ofsteps 1, 3, and 4.

:ReliablePersistMgr

writeLogRecord

begin checkpoint

:DirtyObjectManager

notiryCheckpoint

:LogManager

flush

dirty object table

serialize writeLogRecord

store

writeLogRecord

end checkpoint

:CheckpointListener

checkpoint

:StorageManager

(1)

(2)

(3)

(4)

Figure 10. Generic sequence diagram for checkpointing

As Layer 1 does not deal with transactions, the sequence diagram shown in Figure10 does not include a call for storing the active transaction table. However, storing theserialized form of the active transaction table should occur as part of step 3.

5. Adding transaction support

Layer 2 deals primarily with providing support to transaction abstraction. Hence,the issues to be addressed are transaction isolation, atomicity and durability. Atomicityand durability are related to recovery.

5.1. Transaction isolation

Transaction isolation is twofold. In the one hand, it implies providing a way tocoordinate concurrent accesses. In the other hand, it concerns ensuring object consis-tency when the system maintains multiple versions of a single object. This is foundwhen an optimistic concurrency control method is used, or in distributed contexts, e.g.client-side caching.

In our infrastructure, we have delegated the implementation for transaction isola-tion to the ConcurrencyControl component. Figure 11 shows the definition of thiscomponent.

The IConcurrencyControl interface has been derived from the visibility frame-work provided in [BLA 98]. The operations included in this interface enables theintegration a large set of concurrency control, as well as cache consistency methods.

Page 95: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5.2. Transaction recovery

Atomicity and durability are ensured by underlying recovery methods, and theyare commonly addressed together. However, we need to separate them to achievecomponentization. To this end, we adopted a layered architecture approach, which isbased on works about multi-level recovery [WEI 90, LOM 92, WEI 93].

As a layered system, durability is delegated to Layer 1, and atomicity to Layer 2.In order to do that, Layer 2 uses a high level compensation approach. That meansthat Layer 2 stores undo information, and when needed, such information is used todetermine compensation operations.

Based in this framework, Layer 2 can add support to transactions. This consistsin providing the operations related to transaction management, delegating isolationto the corresponding component, handling undo information and exploiting the dura-bility guaranties provided by components in Layer 0 and 1. Figure 11 shows howTransPersistMgr, standing for transactional persistence manager, is related to theother components in the infrastructure.

LogManager

IConcurrencyControl<<interface>>

readIntentionwriteIntentionreadCompletionwriteCompletion

PersistenceManager

ReliablePersistMgr

CacheManager

StorageManager

ICheckpointManager

IPersistenceManager ILogManager

ConcurrencyControlTransPersistMgr

<<interface>>

beginTransactioncommitTransactionabortTransaction

IPersistenceManager

<<provides>>

<<requires>>

<<provides>>

ITxResourceManager

Figure 11. Structure of components in all layers

Figure 12 shows the sequence of operations that characterizes transactional pro-cessing 2. Firstly, a beginTransaction() operation is called, prompting a beginTx logrecord to be written. For every object to be accessed, a writeIntention() should becalled before any update can take place. This operation allows to record the undo re-covery information. That corresponds with the actions taken at the Layer 2 to record

. The sequence suggests that redo and undo information is stored by a single LogManager.

With this assumption, the resulting management regime is similar to that of ARIES [MOH 92].It is also possible to use two logs, but their management becomes complexer as noted in[WEI 90].

Page 96: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

undo information. The redo information is handled by Layer 1, as described in Sec-tion 4. Note that actions on Layer 1 are accomplished when writeCompletion() op-erations are requested. In the sequence diagram, this is done at commit time, i.e.when commitTransaction() operation is called. Support for partial savepoints can beimplemented in a similar way.

:TransPersistMgr :ReliablePersistMgr :LogManager

writeLogRecord

writeCompletionwriteLogRecord

writeCompletion

writeLogRecord

writeLogRecord

commitTransaction

writeIntention

beginTransaction

commitTx record

beginTx record

redo−only info

undo−only info

:PersistenceManager

Figure 12. Typical call sequence for transactional access

During restart, after a system crash, redo information is sent to Layer 1, to beprocessed by the ReliablePersistMgr component. Similarly, undo information is sentto Layer 2, to the TransactPersistMgr component. The restart procedure is based inthe repeat history paradigm proposed in ARIES [MOH 92], and generalized by themulti-level recovery methods [WEI 90, LOM 92, WEI 93]. The recovery method forour layered architecture is described in more detail in [Gar ].

Additionally, support for advanced transaction models can be provided. This candone, by implementing the algorithms proposed in [LOM 92], in the context of multi-level recovery.

5.3. Force-at-commit update propagation policy

As stated in Section 4, the transaction commit event can be taken as an updatepropagation policy. This corresponds to a force-at-commit cache management policy[HÄR 83]. However, implementing a force-at-commit policy implies that the under-lying storage support an atomic propagation of any arbitrary set of objects.

Examples of atomic capable storage techniques include shadow paging, and inten-tion lists [HÄR 83]. Note however, that StorageManager components implementedon top of full-fledged DBMSs can implement atomic propagation by using nativetransactions. This strategy is implemented in several commercial database middle-

Page 97: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

ware products (see for instance [TIM 01]), and is used to guarantee maximal reliabil-ity of data. This is done by delegating fault-tolerance management to the underlyingDBMS. The trade-offs between reliability and performance are obvious.

A force-at-commit strategy can be added to our infrastructure with the correspond-ing update propagation policy, e.g. TxCommitListener, to our DirtyObjectMan-ager. This is shown in Figure 13. This component will force flushing all objectsupdated by a given transaction, by the time this transaction commits.

This base strategy can be extended by means of a group commit protocol. The ideais to group a set of transactions, and their corresponding updated objects, and request asingle atomic update, i.e. native transaction. This can hopefully increase performance,by reducing the number of message exchanges between the infrastructure and theDBMS.

In the same way, in distributed contexts, the distributed group commit protocol de-scribed in [PAR 99], can be integrated by applying the same principles. These updatepropagation policies are shown in Figure 13.

DirtyObjectManager

CacheEvictionListener

IDirtyObjectManager

IDirtyObjectManager IDirtyObjectManager

IDirtyObjectManager

IDirtyObjectManager

TxCommitListener

GroupCommitListener

DGroupCommitListener

IUpdatePropagationListener

IUpdatePropagationListener

IUpdatePropagationListener

IUpdatePropagationListenerIUpdatePropagationListener

Figure 13. Some transaction related update propagation policies

The pattern for update propagation policies was described in Section 3.

6. Implementation

We developed a first prototype of our infrastructure. We chose the Java languageto exploit particularly the high level of portability of its bytecode. This is because weare interested in using our infrastructure in different configurations of hardware andoperating system.

Based in the principles of architecture, we developed implementations for ourcomponents:

– Cache manager – The implementation has been done according to CacheM-anager interface and includes several replacement policies, namely LRU, MRU andFIFO.

Page 98: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– Storage manager – One implementation has been done on top of the Java pro-vided file management libraries. Files are organized in pages, and objects are allocatedinside slotted pages [CAR 86]. We have implemented a GiST [HEL 95] based indexframework with support for B+ and R* trees. These indexes are used to maintain sys-tem information such as the mapping between logical object identifiers and in-diskphysical addresses. Thus, we provided a complete object-based storage manager ac-cording to the StorageManager component interface. Note that we used the cachemanagement components in the construction of the storage manager.

A second implementation has been done with a subsystem allowing to map Javaobjects into relational databases. To this end, we have used a open source software[OBJ ] which we have masked behind our StorageManager component interface.

– Log manager – As for storage management, the implementation of the log man-ager was done on top of Java provided file management libraries. It also uses the cachemanagement facilities of our infrastructure. The implementation provides reliable sta-ble storage as proposed in [GRA 93b].

– Concurrency control – A lock-based concurrency controller has been imple-mented as proposed in [GRA 93b]. Thus, this component implements a two-phaselocking protocol. Deadlocks are resolved by timeouts.

– Unreliable persistence manager – We implemented the UnreliablePersistMgrcomponent, along with the DirtyObjectManager component. This latter includes twoupdate propagation policies, namely the CacheEvictionListener and CheckpointLis-tener.

– Reliable persistence manager – This component has been implemented whichuses a physical logging subsystem for storing after images. These images are used toimplement the redo only recovery method for system crash resilience.

– Transactional persistence manager – We implemented a local transactional re-source manager. This component implements the TransPersistMgr specification, andthe multi-level recovery method sketched in Section 5. Hence, it is responsible of theundo-based recovery method for providing transaction recovery. Additionally, a prim-itive support for nested transactions is provided, because the lock-based concurrencycontroller does not support lock delegation.

6.1. A reliable non-transactional configuration

We tested a reliable non-transactional configuration of our infrastructure in thecontext of the PING project [PIN ]. This project aims at providing a platform for build-ing large-scale massively multi-user software, such as large-scale simulation software,and networked games. We concluded that a transactional-oriented DBMS is ill-suited,justifying the use of our infrastructure. We are currently investigating further in thisdomain [NED 00].

Page 99: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

6.2. A transactional configuration

We tested a transactional configuration for embedding it in a resource constrainedenvironment, i.e. a handheld computer (Compaq iPAQ). This test is anticipatingthe development of a platform with support for mobile transactions as suggested in[SER 02].

Finally, we tested a transactional configuration on top a full-fledged relationalDBMS. In this case, we have been able to add support to nested transactions. We willfurther test our infrastructure in the integration of persistence services for advancedEJB containers [JCP 00].

7. Related work

7.1. Extensible DBMS

By the end of 80’s, the industrial and academic research communities started ef-forts for developing what was called the extensible database management systems. Inthe following, we give a general classification of those approaches.

DBMS generators – The idea of generating some parts of the DBMS has been ex-plored by some researchers. Examples of this approach include the Genesisproject [BAT 88] and the Volcano query optimizer generator [GRA 93a]. Thisapproach has been apparently abandoned, presumably because of the difficultiesdue to intrinsic complexity of the DBMS software.

DBMS with an extensible type subsystem – Some researchers studied ways to letusers extend DBMS type system (i.e. ADT), and underlying access methods(i.e. indexes). Prototypes such as Postgres [STO 86] at the University of Berke-ley, and Starburst [SCH 86] at IBM are examples of this approach. Note thatthis approach has effectively evolved, and it is at the origin of the so calledobject-relational DBMSs.

DBMS kernel toolkits – This approach has been taken in the Exodus [CAR 86],Shore [CAR 94] and DASDBS [SCH 90] projects. The idea was to build a mini-mal (core functionality) DBMS kernel along with a set of tools allowing to buildad hoc DBMSs. These projects have been testbed for lots of research develop-ments. Note however that this prototypes were almost full-fledged DBMSs, sothat lighter weight DBMS implementations could not be envisioned.

7.2. Persistence service standards

Among the proposals for persistence service standards, we can mention three of themost widely accepted: the Object Management Group’s Persistent State Service (PSS

Page 100: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2.0) [OMG 00], the Object Data Management Group standard (ODMG 3.0) [CAT 00],and the Java Data Objects (JDO) [JCP 01a] at the Java Community Process. All ofthem attempt to provide a standard approach to integrate existing persistence technol-ogy (e.g. DBMS) by using a single homogeneous set of interfaces. Effectively, theyintend to provide a framework for building persistent applications independently ofthe underlying support, thus promoting in some scale multi-vendor interoperability.

The three proposals provide a transaction-based processing mode, except for PSSwhich also provides a non transactional approach. The proposals promote high levelAPIs (Application Programming Interfaces) hiding implementation details. JDO goesbeyond revealing some system internals. Among those details, there are notably APIsto provide hints for cache replacement, hints for swizzling/object faulting scopes,for selection of concurrency control strategies (e.g. locking-based or optimistic ap-proaches), etc.

The OMG has taken from the beginning an service oriented approach. Hence,they proposed the following persistent related services [OMG ]: transactions, query-ing, and concurrency control. However, although PSS specifies interaction with thetransaction service, nothing is said about interactions with querying and concurrencycontrol services.

The Java Community Process has started defining a generic cache service [JCP 01b].Note that the definition process is led by Oracle, so that a DBMS-like approach canbe expected. However, there is no up-to-date information about interaction betweenJDO and the cited cache service.

7.3. Industrial ventures

OLE DB [BLA 96] is one of the industrial products that has explored externalizingDBMS-like services. OLE DB components can be classified as tabular data providersand service providers. Data providers are components that wrap data repositories inthe form of relational tables. Thus, text files, spreadsheets, ISAMs, and even full-fledged DBMS can be exposed to OLE DB clients with as single, unified interface.Furthermore, a service provider is a OLE DB component that does not own data, butprovides some service. Examples of OLE DB service provider components includequery engines, and transaction monitors.

Another commercial product proposing an approach similar to ours is that of thePoet’s FastObjects j2 [SOF 01] (aka. Navajo). FastObjects j2 is in fact an embeddedDBMS intended to be used in resource constrained mobile devices (e.g. PDAs). ThisDBMS has a highly modular architecture so that it is possibly to add/remove somecomponents, i.e. XML import/export, logging and synchronization modules.

Page 101: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

7.4. Our position

Earlier works have kept in mind the goal of building more and more complexDBMS. We argue, however that the process inverse has to be done, trying to breakingdown DBMSs in simpler but meaningful software components. Our vision is closeto the RISC-style data management blocks described in [CHA 00]. Thus, insteadof talking about an extensible DBMS kernel, we go beyond by proposing fine-grainsoftware components for mechanisms such as caching, logging and recovery.

Our approach can be also compared to that of the JUPITER file manager insideGenesis [BAT 88]. JUPITER has been designed to accommodate several implementa-tions for a same system concern. However, its design relies in the use of programmingconstructs available at the time the Genesis project was active, e.g. conditional com-pilation.

We have tracked the development of standards for persistence, and other relatedservices. We have witnessed, unfortunately, the lack for integration between thoseservices. We are convinced that the understanding of the subtle dependencies insidethe DBMS machinery can be useful to define well integrated services. In addition, weclaim that specifying services at lower level could allow to optimize resource sharingbetween services. For instance, a persistence service which specifies a caching sub-service, could allow different vendor implementations cohabiting at runtime to sharea single instance of such sub-service. The same thing could be said for other low levelservices.

The OLE DB approach provides cursor-based navigation through tabular data.Hence, the central problem of that work was homogeneous access to data sourcesdifferent to DBMS, and query services. Our work is complementary to that one, be-cause we propose service components of lower level. We can envision to build OLEDB service components which use our caching, logging, concurrency control, andother components.

8. Conclusions and ongoing work

The continuous evolution of requirements in data intensive applications imposesnew challenges to the research and industrial communities. DBMS architecturesshould be revisited, and we claim that a component-based approach should be taken.

We have presented the first results of our works looking at unbundling DBMS.Our efforts have resulted in the formulation of architecture principles, component def-inition, and characterization of component interactions. We focused on the isolationof persistence related services from the rest functions found in full-fledged DBMS.Based on our conceptual principles, we have completed an initial prototype. The us-ability of such prototype has been tested in several environments. It has been testedin networked gaming software, characterized by tight real-time constraints, and wheretransactions are ill-suited. We have carried out some preliminary tests in resource

Page 102: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

constrained environments, and in database middleware. We also investigate how tointegrate persistence services for EJB containers. Finally, we are developing bench-marks to compare our platform with existing monolithic software.

This work is only a single research thread of the NODS project. Thus, we work inthe characterization of requirements of emerging application domains, such as, mobilesystems, and large-scale computer supported collaborative work. In addition, we workhard on the analysis and characterization of other data management system concerns.They include replication management, fault tolerance, query processing, transactioncoordination, and workflow management. Finally, we are interested in issues relatedto component assembling techniques.

9. Acknowledgements

We would like to thank all members of the NODS team, and particularly TanguyNedelec, Stéphane Drapeau, Claudia Roncancio, Elizabeth Pérez, Patricia Serrano,and Adnane Benaissa. Also, we thank Pascal Déchamboux and Alexandre Lefevre atFrance Telecom R&D.

10. References

[BAT 88] BATORY D., BARNETT J., GARZA J., SMITH K., TSUKUDA K., TWICHELL B.,WISE T., “GENESIS: An Extensible Database Management System”, IEEE Transactionson Software Engineering, vol. 14, num. 11, 1988.

[BER 87] BERNSTEIN P. A., HADZILACOS V., GOODMAN N., Concurrency Control andRecovery in Database Systems, Addison-Wesley, 1987.

[BLA 96] BLAKELEY J. A., “OLE DB: A Component DBMS Architecture”, Proceedings ofthe Twelfth International Conference on Data Engineering, 1996.

[BLA 98] BLACKBURN S., STANTON R., “The Transactional Object Cache: A Foundation forHigh Performance Persistent System Construction”, Proceedings of the 8th InternationalWorkshop on Persistent Object Systems, 1998.

[CAR 86] CAREY M. J., DEWITT D. J., FRANK D., GRAEFE G., MURALIKRISHNA M.,RICHARDSON J. E., SHEKITA E. J., “The Architecture of the EXODUS ExtensibleDBMS”, Proceedings of the International Workshop on Object-Oriented Database Sys-tems, 1986.

[CAR 94] CAREY M. J., DEWITT D. J., FRANKLIN M. J., HALL N. E., MCAULIFFE M. L.,NAUGHTON J. F., SCHUH D. T., SOLOMON M. H., TAN C. K., TSATALOS O. G., WHITE

S. J., ZWILLING M. J., “Shoring Up Persistent Applications”, Proceedings of the 1994ACM SIGMOD International Conference on Management of Data, 1994.

[CAT 00] CATTELL R., BARRY D. K., The Object Data Standard: ODMG 3.0, MorganKaufmann, 2000.

[CHA 00] CHAUDHURI S., WEIKUM G., “Rethinking Database System Architecture: To-wards a Self-Tuning RISC-Style Database System”, Proceedings of 26th InternationalConference on Very Large Data Bases, 2000.

Page 103: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[COL 00] COLLET C., “The NODS project: Networked Open Database Services”, ECOOPSymposium on Objects and Databases, 2000.

[Gar ] GARCÍA-BAÑUELOS L., DUOHG P.-Q., COLLET C., “Separating System Crash andTransaction Recovery in a Component-based Infrastructure for Persistent Object Manage-ment”, preparation.

[Gar 01] GARCÍA-BAÑUELOS L., COLLET C., “ Towards an Adaptable Persistence Service:The NODS approach”, Presented in TOOLS Europe’2001 Workshop on Object-OrientedDatabases, March 2001.

[Gar 02] GARCÍA-BAÑUELOS L., “An Adaptable Infrastructure for Customized PersistentObject Management”, Proceedings of the EDBT Ph.D. Workshop, 2002.

[GEP 98] GEPPERT A., DITTRICH K. R., “Bundling: Towards a New Construction Paradigmfor Persistent Systems”, Networking and Information Systems, vol. 1, num. 1, 1998.

[GRA 93a] GRAEFE G., MCKENNA W. J., “The Volcano Optimizer Generator: Extensibilityand Efficient Search”, Proceedings of the Ninth International Conference on Data Engi-neering, 1993.

[GRA 93b] GRAY J., REUTER A., Transaction Processing: Concepts and Techniques, Mor-gan Kaufmann, 1993.

[HAM 99] HAMILTON J., “Networked Data Management Design Points”, Proceedings of25th International Conference on Very Large Data Bases, 1999.

[HÄR 83] HÄRDER T., REUTER A., “Principles of Transaction-Oriented Database Recovery”,ACM Computing Surveys, vol. 15, num. 4, 1983.

[HEL 95] HELLERSTEIN J. M., NAUGHTON J. F., PFEFFER A., “Generalized Search Treesfor Database Systems”, Proceedings of 21th International Conference on Very Large DataBases, 1995.

[JCP 00] JCP, “Enterprise JavaBeans

Specification 2.0”, Java Community Process, Pro-posed Final Draft JSR 000019, October 2000.

[JCP 01a] JCP, “Java

Data Objects Specification 1.0.”, Java Community Process, JavaData Objects Expert Group, Proposed Final Draft JSR 000012, May 2001.

[JCP 01b] JCP, “JCACHE – Java Temporary Caching API”, Java Community Process, JavaSpecification Request 107, March 2001.

[JOS 98] JOSHI A., BRIDGE W., LOAIZA J., LAHIRI T., “Checkpointing in Oracle”, Pro-ceedings of 24rd International Conference on Very Large Data Bases, 1998.

[LOM 92] LOMET D. B., “MLR: A Recovery Method for Multi-level Systems”, Proceedingsof the 1992 ACM SIGMOD International Conference on Management of Data, 1992.

[LOM 95] LOMET D. B., TUTTLE M. R., “Redo Recovery after System Crashes”, Proceed-ings of 21th International Conference on Very Large Data Bases, 1995.

[LOM 99] LOMET D. B., TUTTLE M. R., “Logical Logging to Extend Recovery to NewDomains”, Proceedings ACM SIGMOD International Conference on Management of Data,1999.

[MOH 92] MOHAN C., HADERLE D. J., LINDSAY B. G., PIRAHESH H., SCHWARZ P.,“ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and PartialRollbacks Using Write-Ahead Logging”, ACM Transactions on Database Systems, vol. 17,num. 1, 1992.

Page 104: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[NED 00] NEDELEC T., RONCANCIO C., PÉREZ-CORTÉS E., GERODOLLE A., “Issues in theDesign of Large-Scale Shared Networked Worlds”, Proceedings of the Sixth InternationalWorkshop on Groupware, 2000.

[OBJ ] OBJECTRELATIONALBRIDGE, “ObJectRelationalBridge project”, ! .

[OMG ] OMG, “CORBAServices”, Object Management Group,"" "$#%& ! ##('%' ) *#.

[OMG 00] OMG, “Persistent State Service Specification”, Object Management Group,"" "$# +, !- . .+/0+. 1, January 2000.

[PAR 99] PARK T., YEOM H. Y., “A Consistent Group Commit Protocol for DistributedDatabase Systems”, Proceedings of the PDCS’99 International Conference on Paralleland Distributed Computing Systems, 1999.

[PIN ] PING TEAM, “Platform for Interactive Networked Games.”,"" "2 ! *% 433+//$56 67#.

[SCH 86] SCHWARZ P. M., CHANG W., FREYTAG J. C., LOHMAN G. M., MCPHERSON J.,MOHAN C., PIRAHESH H., “Extensibility in the Starburst Database System”, Proceedingsof the International Workshop on Object-Oriented Database Systems, 1986.

[SCH 90] SCHEK H.-J., PAUL H.-B., SCHOLL M. H., WEIKUM G., “The DASDBS Project:Objectives, Experiences, and Future Prospects”, IEEE Transactions on Knowledge andData Engineering, vol. 2, num. 1, 1990.

[SER 02] SERRANO-ALVARADO P., “Defining an Adaptable Mobile Transaction Service”,Proceedings of the EDBT Ph.D. Workshop, 2002.

[SHA 01] SHAH M. A., MADDEN S. R., FRANKLIN M. J., HELLERSTEIN J. M., “Java Sup-port for Data-Intensive Sytsems: Experiences Building the Telegraph Dataflow System”,ACM SIGMOD Record, vol. 30, num. 4, 2001.

[SIL 97] SILBERSCHATZ A., ZDONIK S. B., “Database Systems - Breaking Out of the Box”,SIGMOD Record, vol. 26, num. 3, 1997.

[SOF 01] SOFTWARE P., “FastObjects

j2, White paper.”,"" "8' *2$#(! :9 0;< =' ;.> .3./?7!, 2001.

[STO 86] STONEBRAKER M., ROWE L. A., “The Design of Postgres”, Proceedings of the1986 ACM SIGMOD International Conference on Management of Data, 1986.

[TIM 01] TIMESTEN, “TimesTen Front-Tier 2.3: Dinamic data cache for Oracle 8i”,"" "8 ,#2 $#(! , 2001.

[VAS 94] VASKEVITCH D., “Database in Crisis and Transition: A Technical Agenda for theYear 2001”, Proceedings of the 1994 ACM SIGMOD International Conference on Man-agement of Data, 1994.

[WEI 90] WEIKUM G., HASSE C., BRÖSSLER P., MUTH P., “Multi-Level Recovery”,Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles ofDatabase Systems, 1990.

[WEI 93] WEIKUM G., HASSE C., “Multi-Level Transaction Management for Complex Ob-jects: Implementation, Performance, Parallelism”, VLDB Journal, vol. 2, num. 4, 1993.

Page 105: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Parallel Processing with Autonomous Databases in a Cluster System Stéphane Gançarski1, Hubert Naacke1, Esther Pacitti2, Patrick Valduriez1 1 LIP6, University Paris 6 8, rue du Cap. Scott 75015 PARIS [email protected]

2 Institut de Recherche en Informatique de Nantes [email protected] ABSTRACT. We consider the use of a cluster system for Application Service Provider (ASP). In the ASP context, hosted applications and databases can be update-intensive and must remain autonomous. In this paper, we propose a new solution for parallel processing with autonomous databases, using a replicated database organization. The main idea is to allow the system administrator to control the tradeoff between database consistency and application performance. Application requirements are captured through execution rules stored in a shared directory. They are used (at run time) to allocate cluster nodes to user requests in a way that optimizes load balancing while satisfying application consistency requirements. We also propose a new preventive replication method and a transaction load balancing architecture which can trade-off consistency for performance using execution rules. Finally, we discuss the on-going implementation at LIP6 using a Linux cluster running Oracle 8i.

KEYWORDS: database, cluster architecture, transaction processing, load balancing, replication, consistency

Page 106: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Clusters of PC servers now provide a cheap alternative to tightly-coupled multiprocessors such as Symmetric Multiprocessor (SMP) or Non Uniform Memory Architecture (NUMA). They make new businesses like Application Service Provider (ASP) economically viable. In the ASP model, customers’ applications and databases (including data and DBMS) are hosted at the provider site and need be available, typically through the Internet, as efficiently as if they were local to the customer site. Thus, the challenge for a provider is to fully exploit the cluster’s parallelism and load balancing capabilities to obtain a good cost/performance ratio. The typical solution to obtain good load balancing in cluster architectures is to replicate applications and data at different nodes so that users can be served by any of the nodes depending on the current load. Thisalso provides high-availability since, in the event of a node failure, other nodes can still do the work. This solution has been successfully used by Web sites such as search engines using high-volume server farms (e.g., Google). However, Web sites are typically read-intensive which makes it easier to exploit parallelism.

In the ASP context, the problem is far more difficult. First, applications can be update-intensive. Second, applications and databases must remain autonomous so they can be subject to definition changes to accommodate customer requirements. Replicating databases at several nodes, so they can be accessed by different users through the same or different applications in parallel, can create consistency problems [14], [9]. For instance, two users at different nodes could generate conflicting updates to the same data, thereby producing an inconsistent database. This is because consistency control is done at each node through its local DBMS. There are two main solutions readily available to enforce global consistency. One is to use a transaction processing monitor to control the access to replicated data. However, this requires significant rewriting of the applications and may hurt transaction throughput. A more efficient solution is to use a parallel DBMS such as Oracle Rapid Application Cluster or DB2 Parallel Edition. Parallel DBMS typically provide a shared disk abstraction to the applications [20] so that parallelism can be automatically inferred. But this requires heavy migration to the parallel DBMS and hurts database autonomy.

Ideally, applications and databases should remain unchanged when moved to the provider site’s cluster. In this paper, we propose a new solution for load balancing of autonomous applications and databases which addresses this requirement. This work is done in the context of the Leg@Net project1 sponsored by the RNTL between LIP6, Prologue Software and ASPLine, whose objective is to demonstrate the viability of the ASP model for pharmacy applications in France. Our solution exploits a replicated database organization. The main idea is to allow the system

1 see www.industrie.gouv.fr/rntl/AAP2001/Fiches_Resume/[email protected]

Page 107: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

administrator to control the database consistency/performance tradeoff when placing applications and databases onto cluster nodes. Databases and applications can be replicated at multiple nodes to obtain good load balancing. Application requirements are captured (at compile time) through execution rules stored in a shared directory used (at run time) to allocate cluster nodes to user requests. Depending on the users’ requirements, we can control database consistency at the cluster level. For instance, if an application is read-only or the required consistency is weak, then it is easy to execute multiple requests in parallel at different nodes. If, instead, an application is update-intensive and requires strong consistency (e.g. integrity constraints satisfaction), then an extreme solution is to run it at a single node and trade performance for consistency. Or, if we want both consistency and replication (e.g. for high availability), another extreme solution is synchronous replication with 2 phase commit (2PC) [9] for refreshing replicas. However, 2PC is both costly in terms of messages and blocking (failure of the coordinator cannot be terminated independently by the participants).

There are cases where copy consistency can be relaxed. With optimistic replication [12], transactions are locally committed and different replicas may get different values. Replica divergence remains until reconciliation. Meanwhile, the divergence must be controlled for at least two reasons. First, since synchronization consists in producing a single history from several diverging ones, the higher the divergence is, the more difficult the reconciliation. The second reason is that read-only applicationsdo not always require to read perfectly consistent data and may tolerate some inconsistency. In this case, inconsistency reflects a divergence between the value actually read and the value that should have been read in ACID mode. Non-isolated queries are also useful in non replicated environments (e.g. ANSI isolation levels and their critique [2]). Specification of inconsistency for queries has been widely studied in the literature, and may be divided in two dimensions, temporal and spatial [18]. An example of temporal dimension is found in quasi-copies [1], where a cached (image) copy may be read-accessed according to temporal conditions, such as an allowable delay between the last update of the copy and the last update of the master copy. The spatial dimension consists of allowing a given "quantity of changes" between the values read-accessed and the effective values stored at the same time. This quantity of changes, referred to as import-limit in epsilon transactions [23], may be for instance the number of data items changed, the number of updates performed or the absolute value of the update. In the continuous consistency model [24], both temporal dimension (staleness) and spatial dimension (numerical error and order error) are controlled. Each node propagates its writes by either pull or push access to other nodes, so that each node maintains a predefined level of consistency for each dimension. Then each query can be sent to a node having a satisfying level of consistency (w.r.t. the query) in order to optimize load balancing.

Page 108: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In this paper, we strive to capitalize on the work on relaxing database consistency for higher performance which we apply in the context of cluster systems. We make the following contributions:

- replicated database architecture for cluster systems that does not hurt application and database autonomy, using non intrusive database techniques, i.e. techniques that work independently of any DBMS;

- new preventive replication method that provides strong consistency without the overhead of synchronous replication, by exploiting the cluster’s high speed network;

- transaction load balancing architecture which can trade-off consistency for performance using optimistic replication and execution rules;

- conflict manager architecture which exploits the database logs and execution rules to perform replica reconciliation among heterogeneous databases.

This paper is organized as follows. Section 2 introduces our cluster system architecture with database replication. Section 3 presents our replication model with both preventive and optimistic replication. Section 4 describes the way we can capture and exploit execution rules about applications. Section 5 describes our execution model which uses these rules to perform load balancing and manage global consistency. Section 6 briefly describes our on-going implementation. Section 7 compares our approach with related work. Section 8 concludes.

2. Cluster architecture

In this section, we introduce the architecture for processing user requests coming, for instance, from the Internet, into our cluster system and discuss our solution for placing applications, DBMS and databases in the system.

The general processing of a user request is as follows. First, the request is authenticated and authorized using a directory which captures information about users and applications. The directory is also used to route requests to nodes. If successful, the user gets a connection to the application (possibly after instantiation) at some node which can then connect to a DBMS at some, possibly different, node and issue queries for retrieving and updating database data.

We consider a cluster system with similar nodes, each having one or more processors, main memory (RAM) and disk. Similar to multiprocessors, various cluster system architectures are possible: shared-disk, shared-cache and shared-nothing [10]. Shared-disk and shared-cache require a special interconnect that provide a shared space to all nodes with provision for cache coherence using either hardware or software. Using shared disk or shared cache requires a specific DBMS implementation like Oracle Rapid Application Cluster or DB2 Parallel Edition. Shared-nothing is the only architecture that supports our autonomy requirements

Page 109: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

without the additional cost of a special interconnect. Thus, we strive to exploit a shared-nothing architecture.

There are various ways to organize the applications, DBMS and databases in our shared-nothing cluster system. We assume applications typically written in a programming language like C, C++ or Java making DBMS calls to stored procedures using a standard interface like ODBC or JDBC. Stored procedures are in SQL, PSM (SQL3's Persistent Stored Modules) or any proprietary language like Oracle's PL/SQL or Microsoft's TSQL. In [4], we presented and discussed three main organizations to obtain parallelism. The first one is client-server DBMS connection whereby a client application at one node connects to a remote DBMS at another node (where the same application can also run). The second organization is peer-to-peer DBMS connection whereby a client application at one node connects to a local DBMS which transparently accesses the same DBMS at another node using a distributed database capability. The third organization is replicated database whereby a database and DBMS is replicated across several nodes. These three organizations are interesting alternatives which can be combined to better control the consistency/performance trade-off of various applications and optimize load balancing. For instance, an application at one node could do client-server connection to one or more replicated databases, the choice of the replicated database being made depending on the load.

In this paper, we focus on the replicated database organization which is the most general as it provides for both application and database access parallelism. We use multi-master replication [14] whereby each (master) node can perform updates to the replica it holds. However, conflicting updates to the database from two different nodes can yield to consistency problems (e.g. the same data get different values in different replicas). The classical solution to this problem is optimistic and based on conflict detection and resolution. However, there is also a preventive solution which we propose and avoids conflicts at the expense of a forced waiting time for transactions. Thus, we support both replication schemes to provide a continuum from strong consistency with preventive replication to weaker consistency with optimistic replication.

Based on these choices, we propose the cluster system architecture in Figure 1 which does not hurt application and database autonomy. Applications, databases and DBMS are replicated at different nodes without any change by the cluster administrator. Besides the directory, we add 4 new modules which can be implemented at any node. The application load balancer simply routes user requests to application nodes using a traditional load balancing algorithm. The transaction load balancer intercepts DBMS procedure calls (in ODBC or JDBC) from the applications, generates a transaction execution plan (TEP), based on application and user consistency requirements obtained from the directory. For instance, it decides on the use of preventive or optimistic replication for a transaction. Finally, it triggers transaction execution (to execute stored procedures) at the best nodes, using run-time information on nodes' load. The preventive replication manager orders

Page 110: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

transactionsat each node in a way that prevents conflicts and generates refresh transactions to update replicas. The conflict manager periodically detects conflicts introduced on replicas by transactions run in optimistic mode using the DBMS logs and solves them using information in the directory. At each node, the local DBMS ensures local serialization of transactions it executes, including refresh transaction and solving transactions. Global consistency is ensured at the level required by the application, either by the preventive replication manager or the conflicts manager.

Application load balancer

Conflicts manager

app1 app2 appn

Preventive replicationmanager

DB

DBMS

DB

DBMS

DB

DBMS

DB

DBMS

Transaction load balancer Directory

Cluster

Internet

Figure 1: Cluster system architecture

3. Replication Model

In our context, replication of data at different cluster nodes is a major way to increase parallelism. However, updates to replicated data need be propagated efficiently to all other copies. A general solution widely used in database systems is lazy replication. In this section, we discuss the value of lazy replication in cluster systems, and propose a new multi-master lazy replication scheme with conflict prevention and its architecture. Our scheme can also reduce to the classical multi-master replication scheme with conflict resolution.

Page 111: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1. Lazy replication

With lazy replication, a transaction can commit after updating a replica at some node. After the transaction commits, the updates are propagated towards the other replicas, which are then updated in separate transactions. Unlike synchronous replication (with 2 phase commit), updating transactions need not wait that mutual copy consistency be enforced. Thus lazy replication does not block and scales up much better compared with the synchronous approach. This performance advantage has made lazy replication widely accepted in practice, e.g. in data warehousing and collaborative applications on the Web [12].

Following [13] we characterize a lazy replication scheme using: ownership, configuration, transaction model propagation, refreshment. The ownership parameter defines the permissions for updating replicas. If a replica R is updateable, it is called a primary copy, otherwise it is called a secondary copy, noted r. A node M is said to be a master node if it only stores primary copies. A node S is said to be a slave node if it only stores secondary copies. In addition, if a replica copy R is updateable by several master nodes then it is said to be a multi-owner copy. A node MO is said to be a multi-owner master node if it stores only multi-owner copies. For cluster computing we only consider master, slave and multi-owner master nodes. A master node M or a multi-owner node MO is said to be a master of a slave node S iff there exists a secondary copy of r in S of a primary copy R in M or MO. We also say that S is a slave of M or MO.

The transaction model defines the properties of the transactions that access replicas at each node. Moreover, we assume that, once a transaction is submitted for execution to a local transaction manager at a node, all conflicts are handled by the local concurrency control protocol. In our framework, we fix the properties of the transactions. We focus on four types of transactions that read or write replicas: update transactions, multi-owner transactions, refresh transactions and queries. An update transaction T updates a set of primary copies. A refresh transaction, RT, is associated with an update transaction T, and is made of the sequence of write operations performed by T used to refresh secondary copies. We use the term multi-owner transaction, noted MOT, to refer to a transaction that updates a multi-owner copy. Finally, a query Q, consists of a sequence of read operations on primary or secondary copies.

The propagation parameter defines when the updates to a primary copy or multi-owner copy R must be multicast towards the slaves of R or all owners of R. The multicast protocol is assumed to be reliable and preserve the global FIFO order [16]. We focus on deferred update propagation: the sequence of operations of each refresh transaction associated with an update transaction T is multicast to the appropriate nodes within a single message M, after the commitment of T.

The refreshment parameter defines when should a MOT or RT be triggered and the commit order of these transactions. We consider the deferred triggering mode.

Page 112: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

With a deferred-immediate strategy, a refresh transaction RT or multi-owner transaction MOT is submitted for execution as soon as the corresponding message M is received by the node.

3.2. Managing replica consistency

Depending on which node is allowed user updates to a replica, several replication configurations can be obtained. The lazy master (or asymmetric) configuration allows only one node, called master node, to perform user updates on the replica; the other nodes can only perform reads. Figure 2(a) shows an example of a lazy master bowtie configuration in which there are two nodes storing primary copies R and S and their secondary copies r1, s1 and r2, s2 at the slave nodes. The multi-master (or symmetric) configuration allows all nodes storing a replica to be masters. Figure 2(b) shows an example of a multi-master configuration in which all master nodes stores a primary copy of S and R. There are also hybrid configurations.

Different configurations yield different performance/consistency trade-offs. For instance a lazy master configuration such as bowtie is well suited for read intensive workloads because reading secondary copies does not conflict with any update transaction. In addition, since the updates are rare, the results of a query on a secondary copy r at time t would be, in most cases, the same as reading the corresponding primary copy R at time t. Thus, the choice of a configuration should be based on the knowledge of the transaction workload. For update intensive workloads, the multi-master configuration seems best as the load of update transactions can be distributed among several nodes.

Figure 2: Replication configurations

For all configurations, the problem is to manage data consistency. That is, any node that holds a replica should always see the same sequence of updates to this replica. Consistency management for lazy master has been addressed in [13]. The

R

S

r1, s 1

r2, s 2

a ) b o w tie

S 2, R 2

S 3, R 3

S 4, R 4

b ) m u lti-m aste r

S 1, R 1

Page 113: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

problem is more difficult with multi-master where independent transactions can update the same replica at different master nodes. A conflict arises whenever two or more transactions update the same object. The main solution used by replication products [19] is to tolerate and resolve conflicts. After the commitment of a transaction, a conflict detection mechanism checks for conflicts which are resolved by undoing and redoing transactions using a log history. During the time interval between the commitment of a transaction and conflict resolution, users may read and write inconsistent data. This solution is optimistic and works best with few conflicts. However, it may introduce inconsistencies.

We propose an alternative, new solution which prevents conflicts and thus avoids inconsistency. A detailed presentation of the preventive replication scheme and its algorithms is in [14]. With this preventive solution, each transaction T is associated with a chronological timestamp value, and a delay d is introduced before each transaction submission. This delay corresponds to the maximum amount of time to propagate a message between any two nodes. During this delay, all transactions received are ordered following the timestamp value. After the delay has expired, all transactions younger than T are guaranteed to be received. Therefore, transactions at each node are executed following the same timestamp order and consistency is assured.

This preventive approach imposes waiting a specific delay d, before the execution of multi-owner and refresh transactions. Our cluster computing context is characterized by short distance, high performance inter-process communication where error rates are typically low. Thus, d can be negligible to attain strong consistency. On the other hand, the optimistic approach avoids the waiting time d but must deal with inconsistency management. However, there are many applications that tolerate reading inconsistent data. Therefore, we decided to support both replication schemes to provide a continuum from strong consistency with preventive replication to weaker consistency with optimistic replication.

3.3. Preventive Replication Manager Architecture

This section presents the system architecture of a master, multi-master or slave node with conflict prevention. This architecture can be easily adapted to the simpler optimistic approach, with the addition of a conflict manager (see Section 5). To maintain the autonomy of each node, we assume that six components are added to a regular database system in order to support lazy replication (see Figure 3). The Replica Interface manages the incoming multi-owner transaction submission. The Receiver and Propagator implement reception and propagation of messages, respectively. The Refresher implements a refreshment algorithm. Finally, the Deliverer manages the submission of multi-owner transactions and refresh transactions to the local transaction manager.

Page 114: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The Log Monitor uses log sniffing to extract the changes to primary copies by continuously reading the content of a local History Log (noted H). The sequence of updates of an update transaction T and its timestamp C are read from H and writtento the Input Log, that is used by the Propagator.

Next, multi-owner transactions are submitted through the Replica Interface. The application program calls the Replica Interface passing as parameter the multi-owner transaction MOT. The Replica Interface then establishes a timestamp value C for MOT. Afterwards, the sequence of operations of MOT is written into the Owner Log followed by C. Whenever the multi-owner transaction commits, the Deliverer notifies the event Replica Interface. After MOT commitment, the replica interface ends its processing and the application program continues its next execution step.

The Receiver implements message reception. Messages are received and stored in a Reception Log. The receiver then reads messages from this log and stores each message in an appropriate FIFO pending queue. The content of the queues form the input to the Refresher. The Propagator reads continuously the contents of the Owner and Input Log and for each sequence of updates followed by C read, it constructs a message M. Messages are multicast through the network interface.

Figure 3: Master, Multi-owner or Slave node Architecture

The Refresher implements the refreshment algorithm. It reads the contents of a set of pending queues. Based on its refreshment parameters, it submits refresh transactions and multi-owner update transactions by inserting them into the running queue. The running queue contains all ordered transactions not yet entirely executed. Finally, the Deliverer submits refresh and multi-owner transactions to the local transaction manager. It reads the contents of the running queue in a FIFO order

R e f r e s h e r( q u e u e s )

D B M S

R e c e iv e rP r o p a g a t o r

N e t w o r k

R e f r e s h e rL o g

R -L o g

O w n e rlo g

D e l iv e r e r

R e p lic a I n te r f a c e

M u lt i - o w n e rT r a n s a c t io n s L o c a l L o g

L o g M o n it o r I n p u t L o g

Q u e r ie s a n d U p d a te T r a n s a c t io n s

Page 115: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

and submits each write operation as part of a transaction to the local transaction manager. Whenever a multi-owner transaction is committed, it notifies the event to the Replica Interface.

4. Trading consistency for load balancing

The replicated database organization may increase transaction parallelism. For simplicity, we focus on inter-transaction parallelism, whereby transactions updating the same database are dynamically allocated to different master nodes. There are two important decisions to make for an incoming transaction:choosing the node to run it, which depends on the current load of the cluster, and the replication mode (preventive or optimistic) which depends on the degree of consistency desired. In this section, we show how we can capture and use execution rules about applications in order to obtain transaction parallelism, by exploiting the optimistic replication mode.

4.1 Motivating example

To illustrate how we can tolerate inconsistencies, we consider a very simple example adapted from the TPC-C benchmark2. Similar to the Pharmacy applications in our Leg@net project, TPC-C deals with customers that order products whose stock must be controlled by a threshold value. We focus on table Stock(item, quantity, threshold). Procedure DecreaseStock decreases the stock quantity of item id by q.

procedure DecreaseStock(id, q) :

UPDATE Stock

SET quantity = quantity – q

WHERE item = id;

Let us consider a Stock tuple <item=1,quantity=30,threshold=10> replicated at nodes N1 and N2, transaction T1 at N1 that calls DecreaseStock(1, 15) and transaction T2 at N2 that calls DecreaseStock(1, 10). If T1 and T2 are executed in parallel in optimistic mode, we get <item=1,quantity=15,threshold=10> at N1 and <item=1,quantity=20,threshold=10> at N2. Thus, the Stock replicas are inconsistent and require reconciliation. After reconciliation, the tuple value will be <item=1,quantity=5,threshold=10>. Now, assume query Q that checks for stocks to renew:

2 see www.tpc.org/tpcc

Page 116: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

SELECT item FROM Stock where quantity < threshold

Executing Q at either node N1 or N2 will not retrieve item 1. However, after reconciliation (see Section 6.2), the final value of item 1 will be <item=1,quantity=5,threshold=10>. If the application tolerates inconsistencies, it is aware that the results may have been incomplete and can either reissue Q after some time necessary to reach the next reconciliation step and produce a correct result, or execute Q at either node N1 or N2 and produce the results with a bounded inaccuracy. In our example, item 1 would not be selected, which may be acceptable for the user.

Assume now there is an integrity constraint C : (quantity > threshold*0.5) on table Stock. The final result after reconciliation clearly violates the constraint for item 1. However, this violation cannot be detected by either N1 after executing T1 or N2 after executing T2. There are two ways to solve this problem: either prevent T1 and T2 to be executed at different nodes in optimistic mode, or, at reconciliation time, validate one transaction (e.g. with highest priority) and compensate, if possible, the other one.

4.2. Execution rules

Application consistency requirements are expressed in terms of execution rules. Examples of execution rules are data-independency between transactions, integrity constraints [1], access control rules, etc. They may be stored explicitly by the system administrator or inferred from the DBMS catalogs. They are primarily used by the system administrator to place and replicate data in the cluster, similar to parallel DBMS [12], [20]. They are also used by the system to decide at which nodes and under which conditions a transaction can be executed.

Execution rules are stored in the directory (see Figure 4) . They are expressed in a declarative language. Implicit rules refer to data already maintained by the system (e.g. users authorizations). Hence, they include queries sent to the database catalog to retrieve the data. Incoming transactions are managed by the policy manager. It retrieves execution rules associated with a given transaction and defines a run-time policy for the transaction. The run-time policy controls the execution of the transaction at the required level of consistency. The couple (transaction, run-time policy) is called transaction policy (TP) and is sent to the transaction router, which in turns computes a cost function to elaborate the transaction execution plan (TEP) which includes the best node among the candidates to perform the transaction with the appropriate mode.

Page 117: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 4 : transaction load balancer architecture

4.3. Defining execution rules

A transaction is defined by T = (P, param, user, add-req), where :

P is the application program of which T is an instance with param as parameters.

user is the user who sends the transaction, and

add-req are additional execution requirements for the transaction

Execution rules use information about the transaction (the four elements above) and the data stored in the database. In order to preserve application autonomy, execution rules cannot be defined at a finer granularity than the program. Such information may be already existing in the database catalog. Otherwise, it must be explicitly specified or inferred from other information. Information related to a program includes its type (query or update), its conflict classes [15] i.e. the data the program may read or write, and its relative priority. If P is a query, the required precision of the results must be specified. If P is an update, the directory must capture under which conditions (parameters) T is compensatable, the compensating transaction, whether T should be retried and under which temporal conditions. For a couple (P, D), where D is a data in the write conflict class of P, the administrator may specify max-change(P,D) which is defined as follows If D is an attribute, max-change states how much changes (relative or absolute) a transaction T(P) may do to D. If D is a table, max-change states how many tuples T(P) may update. In our example, we have max-change(Decrease-Stock(id,q), Stock.quantity) = q, max-change(Decrease-Stock(id,q), Stock) = 1.

This information is used to determine the run-time policy for transaction T and a partially ordered set of candidate nodes at which the transaction may be executed.

Transaction Load Balancer

Incoming transaction T

Directory

TEP (T, Node, mode)

Policy Manager

Transaction Router

Transaction Policy

Page 118: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

As those nodes may be master, slave or multi-owner, the run-time policy may be different for each type of node. The run-time policy is described as follows :

type(T) : denotes if T is a query or a (multi-owner) transaction.

priority(T) : the absolute priority level of the transaction, computed from the relative priorities of the user and the program, and the relative priority of the transaction itself (included in the add-req parameter).

compatible(T, T’) : for each transaction T’, this vector stores whether T and T’ are disjoint (resp. commutative). It is computed with compatibility information about programs, conflict classes and effective parameters of the transactions. This information is used by the load balancer to address transactions to different nodes in non-isolated mode to the replication manager. In our example, it is obvious that T1 and T2 are not disjoint but are commutative.

query-mode(T) : if T is a query, the query mode models the spatial and temporal dimensions of the query quality requirements. For instance, a query may tolerate an imprecision of 5% if the answer is delivered within 10 seconds, or of 2% if delivered within one minute and may choose to abort if the response time is beyond 2 minutes. The query mode may include different spatial dimensions (absolute value, number of updates) and give a trade-off between the temporal and spatial dimensions (e.g. find the best precision within a given time or answer as soon as possible within a given acceptable precision). In our example, we assume that Q accepts an error of at most 5 items in the results.

update-mode(T): the update mode models if a multi-owner transaction may be performed in non-isolated mode on a master copy and compensated if the conflict resolution fails, under which temporal conditions and with which compensating transaction. The update mode also models under which conditions a transaction should be automatically retried if aborted. In our example, neither T1 nor T2 are compensatable since the Decrease-Stock procedure corresponds to a real withdrawal of goods in stock.

IC(T) : the set of integrity constraints T is likely to violate. In our example, IC(T1) = IC(T2) = C.

max-change(T,D): for each data D, the maximum of change the transaction may produce. In our example,we have max-change(T1, Stock.quantity) = 15, max-change(T2, Stock.quantity) = 10, max-change(T1, Stock) = max-change(T2, Stock) = 1.

In our example, the following TPs would be produced :

(T1, type = trans., priority = null, compatible = (), update-mode = no-compensate, IC = (C), max-change = (Stock.quantity, 15), (Stock, 1) )

(T2, type = trans., priority = null, compatible = ( (T1, commut.) ), update-mode = no-compensate, IC = (C), max-change = (Stock.quantity, 10), (Stock, 1) )

Page 119: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

(Q, type = query., priority = null, compatible = ( (T1, no-commut.), (T2, no-commut.) ), query-mode = ( (imprecision = 5 unit), (time-bound = no), (priority = time) ).

5. Execution model

In this section, we present the execution model for our cluster system. The objective is to increase load balancing based on execution rules. The problem can be reduced as follows: given the cluster’s state (nodes load, running transactions, etc.), the cluster’s data placement, and a transaction T with a number of consistency requirements, choose one optimal node and execute T at that node. Choosing the optimal node requires to first choose the replication mode, and then choose the candidate nodes where to execute T with that replication mode. This yields a set of TEPs (one TEP per candidate node) among which the best one can be selected based on a cost function. In the rest of this section, we present the algorithm to produce candidates TEPs and the way to select the best TEP and execute it. Finally, we illustrate the transaction routing process on our running example.

5.1. Algorithm for choosing candidate TEPs

A TEP must specify whether preventive or optimistic replication is to be used. Preventive replication always preserves data consistency but may increase contention as the degree of concurrency increases. On the other hand, optimistic replication performs replica synchronization after the transaction commitment at specified times. If a conflict occurs during synchronization, since the transaction has been committed, the only solutions are to either compensate the transaction or notify the user or administrator. Thus, optimistic replication is best when T is disjoint from all the transactions that accessed data in optimistic replication mode since the last synchronization point, and when the chance of conflict is low.

The algorithm to choose the candidate TEPs proceeds as follows. The input is a transaction policy consisting of transaction T, conflict description (i.e. which transactions commute or not with T), required precision (Imp value), and update description (maxChange property). The output is a set of candidate TEPs, each specifying a node N for executing T and how. The algorithm has two steps. First, it finds the candidate TEPs with preventive replication to prevent the occurrence of non resolvable conflicts. This involves assessing the probability of conflict between T and all the committed transactions that have accessed data in optimistic mode since the last synchronization point. In case of potential conflict at a node, a TEP with preventive replication is built.

In case of non-conflicting transaction T (second step), the algorithm finds the candidate nodes with optimistic replication for data accessed by T. If the execution

Page 120: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

of T requires accessing consistent data not available at a node, the algorithm adds the necessary synchronization before processing T as follows. For each node N and each data D, it computes the imprecision Imax(D,N) of processing T at N. Imax is the sum of the maximum changes of all transactions t such that (i) updates of t and T are not disjoint, (ii) t was processed at a node M ≠ N, (iii) t updates are not yet propagated to N. If N can process T with the required consistency (i.e. Imax(D,N) < Imp(N) for all D accessed by T) then a TEP (T,N) is built. Otherwise, the minimal synchronization for N is specified and added to the TEP.

5.2. Choice of optimal node and transaction execution

Choosing the optimal node is an optimization problem with the objective of minimizing response time. The cost of a given TEP(T, N, sync) includes the synchronization cost (for propagating all updates to N) and the processing cost of T at N. After an optimal node is selected, the transaction router triggers execution as follows. First, it performs synchronization if necessary. If preventive replication has been selected, it sends the transaction to the receiver module of the replication manager. Otherwise, it sends the transaction directly to the DBMS. By default, the transaction router checks precision requirements of T before processing it, assuming that they are still valid at the end of T. This is not always true if a concurrent update occurs during the processing of T. To ensure that the precision requirements of T are met at the end of T, a solution is to process T by iteration until the precision requirements are reached. If the precision requirements cannot be reached, T must be aborted. Otherwise, T can be committed. Another solution is to forbid the execution of concurrent transactions that would prevent T from reaching its consistency requirements.

In order to compute the necessary synchronization before processing T at node N, the transaction router maintains for each node the list of all transactions to be propagated. Then, the algorithm for synchronization is the following : (i) if synchronization yields no conflicts, the transaction router executes T on N, (ii) otherwise, replica synchronization is delegated to the conflict manager.

To have detailed information about the current state of replicas, the conflict manager reads the DBMS log (which keeps the history of update operations) of the nodes to synchronize. Then, it resolves the conflicts based on priority, timestamps, or user-defined reconciliation rules. If automatic conflict resolution is not possible, the conflict manager sends a notification alert to the user or the administrator with details about the conflicting operations.

Page 121: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5.3. Example of transaction routing

Let us now illustrate the previous algorithms on the transactions of the example of Section 4.1 and show how TEPs are produced from the TP sent by the policy manager. We assume that the TPs are received in order (T1, T2, Q), that data at nodes N1 and N2 is accessed in optimistic mode and that no other transaction is running and conflicting with T1, T2 or Q. We first consider a case where integrity constraint C is not taken into account. Then we show how C influences transaction routing.

Case 1 : no integrity constraint

Upon receiving TP (T1, type = trans., priority = null, compatible = (), update-mode = no-compensate, IC = (C), max-change = (Stock.quantity, 15), (Stock, 1)), the transaction router does the following:

computes the set of candidate nodes N1, N2;

detects that T1 is not conflicting with any running transaction, thus the candidate nodes are N1, N2;

sends T1 to the least loaded node (say N1) with T1 as synchronization, which means that N1 must send T1 to the other node as a synchronizing transaction;

infers Imax(Stock, N1)=1, which means that at most one tuple can be modified at N1 before synchronization.

Upon receiving TP (T2, type = trans., priority = null, compatible = ( (T1, commut.) ), update-mode = no-compensate, IC = (C), max-change = (Stock.quantity, 10), (Stock, 1)), the transaction router does the following :

computes the set of candidate nodes N1, N2;

detects that T2 is conflicting with T1 but commutes with it;

sends T2 to the least loaded node (assume N2) with T2 as synchronization. As T1 and T2 are commutable, the order in which they will be executed at N1 (resp. N2) does not matter;

infers Imax(Stock, N2) = 1.

Upon receiving TP (Q, type = query., priority = null, compatible = ( (T1, no-commut.), (T2, no-commut.) ), query-mode = ( (imprecision = 5 unit), (time-bound = no), (priority = time))), the transaction router does the following:

computes the set of candidate nodes N1, N2;

detects that Q is conflicting with both T1 and T2;

from the current values of Imax(Stock, N1) and Imax(Stock, N2), it computes that executing Q at either N1 or N2 would yield a result with an imprecision of at

Page 122: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

most one unit. As the query mode imposes an imprecision of at most 5 units, Q is sent to the least loaded node (say N1)

In the case the query mode of Q was not allowing any imprecision, the router would have waited for the next synchronization of N1 and N2 to send Q.

Case 2 : with integrity constraint

The transaction router would detect that both T1 and T2 are likely to violate C and are not compensatable. Sending T1 and T2 to different nodes could lead to the situation where C is not violated at either N1 or N2, but is violated during synchronization. Since T1 and T2 are not compensatable, this situation is not acceptable and T2 must be sent to the same node as T1. Then we have Imax(Stock, N1) = 0 and Imax(Stock, N2) = 2. Upon receiving Q, the transaction router may still choose the least loaded node to execute it. Since the priority is given to time in the query mode, the least loaded node is chosen : N2. Had the priority been given to precision, N1 would have been selected by the transaction router.

6. Implementation

In this section, we briefly describe our current implementation on a cluster of PCs under Linux. We plan to first experiment our approach with the Oracle DBMS. However, we use standards like LDAP and JDBC, so the main part of our prototype is independent of the target environment.

6.1 Transaction load balancer

The transaction load balancer is implemented in Java. It acts as a JDBC server for the application, preserving the application autonomy through the JDBC standard interface. Inter-process communication between the application and the load balancer uses RmiJdbc open source software3. To reduce contention, the load balancer takes advantage of the multi- threading capabilities of Java based on the Linux’s native threads. For each incoming transaction, the load balancer delegates transaction policy management and transaction routing to a distinct thread. The transaction router sends transactions for execution to DBMS nodes through JDBC drivers provided by the DBMS vendors. To reduce latency when executing transactions, the transaction router maintains a pool of JDBC connections to all cluster nodes.

3 see www.objectweb.org/rmijdbc

Page 123: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

6.2 Conflict manager

The conflict manager is composed of three modules :

Log analyzer : reads the DBMS log to capture the updates made by the local transactions;

Conflict solver : analyzes updates made on each node in order to detect and solve conflicts;

Synchronizer : manages synchronization processes.

At each node, the log manager runs as an independent process and reads the log to detect updates performed by the local transactions. Those updates are sent to the conflict solver through time-stamped messages. The Oracle LogMiner tool can be used to implement this module, and messages can be managed with the Advanced Queuing tool. To illustrate the log analysis on our running example, the Stock update made by transaction T1 is detected by the following query on the LogMiner schema:

SELECT scn, sql_redo, sql_undo

FROM v$logmnr_contents

WHERE seg_name='STOCK';

The conflict solver receives messages from the log analyzer, analyzes them to detect conflicting writes (e.g. update/update or update/delete) on a same data. It then computes how conflicts can be solved. In the best case, the conflict is solved by propagating updates from one node to the other ones. In the worst case, the conflict is solved by choosing a transaction (the one with less priority) to be compensated. In both cases, synchronizing transactions (propagation or compensation) are sent to the corresponding nodes.

The synchronizer receives synchronizing transactions from the conflict solver. It may execute them either periodically or upon receiving an order from the transaction load balancer for an immediate synchronization.

The preventive replication manager is implemented as an enhanced version of a preceding implementation [11]. To implement reliable message broadcast, we plan to use Ensemble [6], a group communication software package from Cornell University.

6.3 Directory

All information used for load balancing (execution rules, data placement, replication mode, cluster load) is stored in an LDAP compliant directory. The directory is accessed through Java Directory Naming Interface (JDNI) which

Page 124: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

provides an LDAP client implementation. Dynamic parameters measuring the cluster activity (load, resource usage) are stored in the directory. They are used for transaction routing. Values are updated periodically at each node. To measure DBMS node activity, we take advantage of dynamic views maintained by Oracle. For instance, the following queries collect CPU usage and I/O made by all running transactions at a node :

select username, se.sid, cpu_usage from v$session ss, v$sesstat se,

v$statname sn where se.statistic# = sn.statistic# and name like '%CPU used by this

session%' and se.sid = ss.sid order by value desc

select username, os_user, process pid, ses.sid,

physical_reads, block_gets, consistent_gets,

block_changes, consistent_changes from v$session ses, v$sess_io sio where ses.sid = sio.sid order by physical_reads, ses.username

6.4 Planned experimentations

The experimental cluster is composed of 5 nodes: 4 DBMS nodes + 1 application node. At each DBMS node, there is a TPC-C database (500MB to 2GB) and TPC-C stored procedures. The application node sends the transactional workload to the transaction load balancer through TPC-C stored procedures. The cluster data placement and replication (optimistic and preventive) may be configured depending on the experiment goal. We plan to first measure the cluster performance (transactional throughput) when the database is replicated on the 4 nodes and accessed in either preventive or optimistic replication mode, by varying the update rates and the probability of conflict. We also plan to measure the scalability of our approach. We will simply implement a logical 16 node cluster where each node runs 4 DBMS instances to behave like a 4 node cluster. Based on the actual performance numbers with 4 nodes, this will yield confident results.

7. Comparison with related work

The main work related to ours is replication in either large-scale distributed systems or cluster systems and advanced transaction models that trade consistency for improved performance. In synchronous replication, 2PC can be used to update replicas. However 2PC is blocking and the number of messages exchanged to control transaction commitment is significant. It cannot scale up to cluster configurations to large numbers of nodes. [7] addresses the replica consistency problem for synchronous replication. The number of messages exchanged is reduced compared to 2PC but the solution is still blocking and it is not clear whether it scales up. In addition, synchronous solutions cannot perform load balancing as we do. The common point with our preventive approach is that we both consider the use of communication services to guarantee that messages are delivered at each

Page 125: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

node in a specific order. [13] proposes a refreshment algorithm that assures correctness for lazy-master configurations, but does not consider multi-master configurations as we do here. Multi-master asynchronous replication [19] has been successfully implemented in commercial systems such as Oracle and Sybase. However, only the optimistic approach with conflict detection and conciliation is supported.

There are interesting projects for replicated data management in cluster architectures. The PowerDB project at ETH Zurich deals with the coordination of the cluster nodes in order to provide a uniform and consistent view to the clients. Its solution fits well for some kinds of applications, such as XML document management [5] or read-intensive OLAP queries [17]. However, it does not address the problem of seamless integration of legacy applications. The GMS project [21] uses global information to optimize page replacement and prefetching decisions over the cluster. However, it mainly addresses system-level or Internet applications (such as the Porcupine mail server). Other projects are developed in the context of wide-area networks. The Trapp project at Stanford University [8] addresses the problem of precision/performance trade-off. However, the focus is on numeric computation of aggregation queries and minimizing communication costs. The TACT middleware layer [24] implements the continuous consistency model. Despite the fact that additional messages are used to limit divergence, a substantial gain in performance may be obtained if users accept a rather small error rate. However, read and write operations are mediated individually: an operation is blocked until consistency requirements can be guaranteed. This implies monitoring at the server level, and it is not clear if it allows installation of a legacy application in an ASP cluster. In the quasi-copy caching approach [1], four consistency conditions are defined. Quasi-copies can be seen as materialized views with limited inconsistency. However, they only accept single master replication, which is not adapted to our multi-master replication in a cluster system. Finally, epsilon transactions [23] provide a nice theoretical framework for dealing with divergence control. As in the continuous consistency model [24], it allows different consistency metrics to give answers to queries with bounded imprecision. However, it requires to significantly alter the concurrency control, since each lock request must read or write an additional counter to decide whether the lock is compatible with the required level of consistency. In summary, none of the existing approaches addresses the problems of leaving databases and applications autonomous and unchanged as in our work.

8. Conclusion

In this paper, we proposed a new solution for parallel processing with autonomous databases in a cluster system for ASP, using a replicated database organization. The main idea is to allow the system administrator to control the consistency/performance tradeoff when placing applications and databases onto

Page 126: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

cluster nodes, which is not possible when using an existing parallel DBMS. Application requirements are captured through execution rules stored in a shared directory. They are used at configuration time to choose the best organization for applications and databases. They are also used at run-time to either prevent or tolerate copy inconsistency in order to optimize load balancing.

Capitalizing on the work on relaxing database consistency for higher performance, this paper makes several contributions in the context of cluster systems. First, we defined a replicated database architecture for clusters systems that does not hurt application and database autonomy. We use non intrusive techniques by intercepting DBMS transaction calls or exploiting DBMS’s log interfaces.

Second, we proposed a new preventive replication method that provides strong consistency without the overhead of synchronous replication by exploiting the cluster’s high speed network. The preventive replication architecture maintains DBMS’s autonomy and can support optimistic replication as well, with the addition of a conflict manager.

Third, we proposed a transaction load balancing architecture which can trade-off consistency for performance using optimistic replication and execution rules. Support for both preventive and optimistic replication provides a continuum from strong consistency to weaker consistency with different cost/performance. Execution rules can be defined at different levels of granularity (program or transaction, table or attribute or tuple) to express application semantics. The distinction between Transaction Policy (what we want) and Transaction Execution Plan (how we optimize it) eases the system’s evolution (by changing rules) and load balancing decisions.

Finally, we presented an execution model to execute Transaction Execution Plans in a way that optimizes load balancing. The optimal node is selected based on the replication mode that should be used and a cost function which estimates nodes’s load. We also proposed a conflict manager architecture which exploits the database logs and execution rules to perform replica reconciliation among heterogeneous databases.

We have started to implement the proposed solution on LIP6’s cluster architecture running Linux and Oracle 8i. In the near future, we will experiment with the TPC-C benchmark to assess the cost/performance of preventive replication and optimistic replication (with relaxed consistency) under various workloads. We will also develop a simulation model, calibrated with our implementation, to study how our solution scales up to very large cluster configurations.

References

[1] R. Alonso, D. Barbará, H. Garcia-Molina. Data Caching Issues in an Information Retrieval System. ACM Transactions on Database Systems (TODS), 15(3), 1990.

Page 127: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[2] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O'Neil, P. O'Neil. A Critique of ANSI SQL Isolation Levels. In ACM SIGMOD Int. Conf. on Management of Data, 1995. [3] A. Doucet, S. Gançarski, C. León, M. Rukoz. Checking Integrity Constraints in Multidatabase Systems with Nested Transactions. In Int. Conf. On Cooperative Information Systems (CoopIS), 2001. [4] S. Gançarski, H. Naacke, P. Valduriez. Load Balancing of Autonomous Applications and Databases in a Cluster System. In 4th Workshop on Distributed Data and Structure (WDAS), 2002. [5] T. Grabs, K. Böhm, H.-J. Schek. Scalable Distributed Query and Update Service Implementations for XML Document Elements. In IEEE RIDE Int. Workshop on Document Management for Data Intensive Business and Scientific Applications, 2001. [6] M. Hayden. The Ensemble System. Technical Report, Departement of Computer Science, Cornell University, TR-98-1662, 1998. [7] B. Kemme, G. Alonso. Don’t be lazy be consistent : Postgres-R, A new way to implement Database Replication. In Int. Conf on Very Large Databases (VLDB), 2000. [8] C. Olston, J. Widom. Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data. In Int. Conf. on Very Large Databases (VLDB), 2000. [9] T. Özsu, P. Valduriez: Principles of Distributed Database Systems. Prentice Hall, 2nd edition, 1999. [10] T. Özsu, P. Valduriez. Distributed and Parallel Database Systems - Technology and current state-of-the-art. ACM Computing Surveys, 28(1), 1996. [11] E. Pacitti. Improving Data Freshness in Replicated Databases. PhD Thesis, INRIA-RR 3617, 1999. [12] E. Pacitti, O. Dedieu. Algorithms for Optimistic Replication on the Web. Journal of the Brazilian Computing Society, 2002, to appear. [13] E. Pacitti, P. Minet, E. Simon. Replica Consistency in Lazy Master Replicated Databases. Distributed and Parallel Databases, 9(3), 2001. [14] E. Pacitti. Preventive Lazy Replication in Cluster Systems. Technical Report RR-2002-01, CRIP5, University Paris 5, 2002. [15] M. Patiño-Martínez, R. Jiménez-Peris, B. Kemme, G. Alonso. Scalable Replication in Database Clusters. In 14th Int. Conf. on Distributed Computing (DISC), 2000. [16] D. Powel et al. Group communication (special issue). Communication of the ACM, 39(4), 1996. [17] U. Röhm, K. Böhm, H.-J. Schek. Cache-Aware Query Routing in a Cluster of Databases. Int. Conf. on Data Engineering (ICDE), 2001.

Page 128: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[18] A. Sheth, M. Rusinkiewicz. Management of Interdependent Data: Specifying Dependency and Consistency Requirements. Workshop on the Management of Replicated Data, 1990. [19] D. Stacey. Replication: DB2, Oracle, or Sybase. Database Programming & Design. 7(12), 1994. [20] P. Valduriez. Parallel Database Systems: open problems and new issues. Int. Journal on Distributed and Parallel Databases, 1(2), 1993. [21] G. Voelker et al. Implementing Cooperative Prefetching and Caching in a Global Memory System.In ACM Sigmetrics Conf. on Performance Measurement, Modeling, and Evaluation, 1998. [22] G. Weikum. Principles and Realization Strategies of Multilevel Transaction Management. ACM Transactions on Database Systems (TODS), 16(1), 1991. [23] K. L. Wu, P. S Yu, C. Pu. Divergence Control for Epsilon-Serializability. In 8th Int. Conf. on Data Engineering (ICDE), 1992. [24] H. Yu, A. Vahdat. Efficient Numerical Error Bounding for Replicated Network Services. In Int. Conf. On Very Large Databases (VLDB), 2000.

Page 129: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

RS2.7: an Adaptable ReplicationFramework

Stéphane Drapeau*,** — Claudia Lucia Roncancio**

— Pascal Déchamboux*

* France Télécom R&D28, chemin du Vieux ChêneF-38243 Meylan

Stephane.Drapeau, [email protected]

** Laboratoire LSR IMAG681, rue de la PasserelleF-38402 Saint Martin d’Hères

Stephane.Drapeau, [email protected]

ABSTRACT. The RS2.7 Replication Framework revisits the replication function in order to pro-vide a component-based middleware support that can adapt to several kind of environment. Itclearly identifies what minimal functions are relevant to replication: binding replicas betweenthemselves, and synchronizing them in order to support the right levels of coherency. This paperfocuses on the coherency issue. We analyse how this feature can be decomposed with respectto two dimensions: functional and scheduling. Playing with these two dimensions allows toprovide different replication solutions by merely assembling RS2.7 components. A prototype ofRS2.7 is operational and has been applied to a platform for interactive networked applications.

RÉSUMÉ. RS2.7 est un Canevas de Duplication redéfinissant la fonctionnalité de duplication,afin de fournir un intergiciel à base de composants pouvant s’adapter à différents environne-ments. Il identifie les fonctions minimales propres à la duplication : la liaison entre les copieset leur synchronisation afin d’obtenir la cohérence locale souhaitée. Ce papier se focalise surles problèmes de cohérence. Nous analysons comment cette caractéristique peut se décomposerselon deux dimensions : fonctionnelle et structurelle. En jouant avec ces deux dimensions, ilest possible de fournir différentes solutions gérant la duplication en assemblant simplement descomposants conformes à RS2.7. Un prototype de RS2.7 est opérationnel et a été utilisé dans uneplate-forme pour applications interactives en réseau.

KEYWORDS: Replication, consistency, coherency, functional decomposition, adaptable frame-work, component.

MOTS-CLÉS : Duplication, cohérence globale, cohérence locale, décomposition fonctionnelle, ca-nevas adaptable, composant.

Page 130: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Distributed computing requires mechanisms to ensure system availability and re-liability, with different purposes in mind such as fault-tolerance or performance scal-ability (e.g., parallel computing). Replication is at the heart of solutions to all the pre-ceding issues and is usually declined with many different flavours. The RS2.7 Repli-cation Framework revisits the replication function in order to provide a middlewaresupport that can adapt to several kind of environment. It clearly identifies what min-imal functions are relevant to replication: binding replicas between themselves, andsynchronizing them in order to support the right levels of coherency. Moreover, thispaper analyses how coherency can be decomposed with respect to two dimensions:functional and scheduling. Our work is done in the context of the NODS project1

(Network Open Database Services) that aims at defining an open, adaptable architec-ture that can be extended and cuztomized on a per-application basis [COL 00]. Thisvision is shared with several other researchers [SIL 97, HAM 99, CHA 00, DIT 01].

Replication techniques have been heavily investigated in various areas such asgroup communication systems, distributed shared memories, distributed file systems,DBMS and object-oriented distributed systems. Experiments have lead to a large num-ber of replication techniques and protocols [GRA 96, KEM 00, KIN 99, DRA 01].Apart from the principles involved, these experiments have little in common. Theygenerally implement some dedicated ad hoc replication support. Providing an adapt-able replication support could have a major impact with the growing replication needsrequired to support ubiquitous computing in the large.

Adaptation can be obtained through parametrization. It seems quite impossibleto provide a general purpose replication protocol that can be parameterized to ac-commodate all runtime environments. Indeed, we have noticed that large amounts ofcode need to be replaced from one case to another. Further, increasing the number ofparameters tends to heavily increase the code complexity. We argue for a component-based approach where pieces of code (i.e., components) can be changed for adaptationpurpose. This requires components of the framework to be clearly identified in termsof functions, defined through relevant interfaces. Like with any component model,dependencies must be fully mastered, guarantying the ability to replace components.We also advocate that the functional scope of a component should be minimal in orderto enhance its reusability.

We have not found any work on replication with adaptability as the main objec-tive. Compared with existing work (eg. Garf [GAR 95], services for Corba [MAF 95,NAR 02], Core [BRU 95] and Globe [KER 98]), we believe RS2.7 is far more adapt-able and reusable. We have considered three conditions to be fulfilled in order tooptimize adaptability.

The first one is the separation of concerns, which means that the functional scopeof the framework is only devoted to replication. We propose two features as the foun-

. http://www-lsr.imag.fr/Les.Groupes/STORM/Storm2002/English/index.html

Page 131: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

dations of the framework. The first feature aims at managing bindings between repli-cas, mainly through their life cycle aspect (i.e., creating/deleting bindings, adding/re-moving a replica to/from a binding). The second feature deals with coherency be-tween replicas of a binding, application-level consistency being built by choosing thecoherency model and implementation that best fits.

The second condition is the ability to adapt to various contexts such as persis-tent, transactional, or fault-tolerant ones. This requires independence from potentialinteractions with other features such as concurrency control or save point mechanism.

The third condition is the decomposition of the coherency support into a largenumber of components, each one playing a reduced role. It can be decomposed withrespect to two dimensions: functional and scheduling. Defining these components isprobably one of the main contribution of our work. This fine grain decompositionallows the tuning of the replicas coherency to be as optimal as possible. This is donesimply by assembling the most optimal implementation of a given function (e.g., apolicy).

This paper is organized as follows. Section 2 gives the definition and the scopeof the replication framework. Section 3 presents how it takes part in various contexts.Section 4 proposes an abstract coherency protocol used in the design of the framework(scheduling decomposition) and Section 5 presents the functional architecture of theframework (functional decomposition). Section 6 briefly reports our implementationachievements whereas Section 7 is devoted to related work. Our conclusions andfuture work are given in Section 8.

2. A replication framework

2.1. Definition

Even if several efforts have been devoted to propose replication frameworks[GAR 95, MAF 95, NAR 02, BRU 95, KER 98], there is no consensus on the defi-nition and functions provided by such a framework. This section introduces our pro-posed replication framework through the functions it covers and its positioning withrespect to the applications as well as to other features.

We argue that separation of concerns is a key issue to adaptation. In that respect,we do not intend to provide a full-fledged framework in terms of replication functions,as it reduces reusability. RS2.7 focuses on two features: life cycle management ofgroups of replicas (e.g. their creation and deletion) and inter-replicas synchronizationprotocols (named coherency protocols in the following).

Note that in general, replicated object management is not reduced to the two fea-tures that RS2.7 provides. Indeed, a replication policy usually requires the definitionof the following points:

– the replication time: when to create or to delete a replica inside the system?

Page 132: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– the replication degree: how many replicas have been or may be created?

– the replicas placement (if needed): where to place a replica among a set of dis-tributed nodes?

– the coherency model: what is the required coherency model?

coherency model

replicas placement

replication degree

replication time

Replication Policy

Other Services

fault toleranceload balancing

....

life cycle management

Replication Framework

coherency protocol

Figure 1. Replication policy / replication framework

The coherency model is supported by the replication framework (see Section 2.2)but the replication time, the replication degree and the replicas placement are relatedto other services (figure 1). These services may use the replication framework fordifferent purposes such as load balancing or fault tolerance. In these later cases, thechoices of replication time and degree will certainly differ.

2.2. Coherency models and protocols

RS2.7 provides a generic approach allowing particular instanciations of the frame-work to obtain the appropriate coherency protocols. Requirements of the system/appli-cation using the replication framework lead to different RS2.7 instances. For example,one provides a simple master-slaves protocol while another one provides a ROWA pro-tocol. Some applications need strong coherency whereas others may accept a diver-gence among replicas. For this reason, we introduce the notion of coherency model,which defines how users perceive the different replicas of an object. It can be consid-ered as a contract between the replication framework and its users. This includes thedefinition of both, access and synchronization events. An access event on a replica iseither a read or a write operation, while a synchronization event is a request to syn-chronize replicas. We can classify coherency implementations (i.e., protocols) intothe four models below:

1) One copy equivalence model: In this model, replicas are always equivalent.Reads on replicas give up-to-date data. Writes are always executed on up-to-date repli-cas. Synchronization is done to preserve one copy equivalence but may not be doneat each write operation. Protocols like ROWA, ROWAA, quorum, active replication,passive replication, or eager replication used by DBMS implement this model.

Page 133: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2) Divergent replicas model: This model offers weak coherency as it allows repli-cas divergence. However, it is possible to characterize this divergence by giving someguarantee on R/W operations and their execution order. For example, a sequence ofoperations performed on a particular replica is perceived in the same order by otherreplicas, or operations appear after operations that logically precede them. Guaranteeson operations may force the execution of writes on an up-to-date replica. In this case,writes can be lost or two replicas can be modified simultaneously.

Protocols implementing a divergent replicas model are often used in DistributedShared Memories (DSM) as they contribute to popular consistency models (e.g. causalconsistency, PRAM consistency, entry consistency, release consistency).

3) Convergent replicas model: This model allows replicas to diverge but theyeventually converge at some point. A limited level of inconsistency based on condi-tions [GAL 95] like delay, periodic, time points, version, numerical, object, or eventconditions is supported. Synchronization is done when conditions will be violated. Itis not possible to access replicas not respecting a condition. For many applications,permitting temporary inconsistencies between replicas is not a drawback and allowsbetter performance. We distinguish between two kinds of models:

a) Convergent replicas models with reads on divergent replicas: each repli-cated object has an owner that stores its current value. Updates (always done onup-to-date replicas) are first applied to the owner and then propagated to other repli-cas. Reads may be performed on any replica. Protocols like lazy master replication,epsilon-serializability in DMBS implement this model.

b) Convergent replicas models with writes on divergent replicas. Readsand writes may be done on not up-to-date replicas. Two nodes may simultaneouslyupdate their replica and race each other to install their updates at other nodes. Thereplication mechanism must detect this situation and must reconcile the two processesso that their updates are not lost. Protocols like multi airline reservation, or protocolsused in mobile computing implement this model.

Various requirements concerning concurrency control or fault tolerance have tobe considered when implementing specific coherency protocols. Thus, replication,concurrency control and fault tolerance are not independent even though separatingthem and clearly defining their interactions is a way to enhance adaptability.

For example, concurrent accesses on different replicas imply dependencies be-tween replication and concurrency control. In fact, coherency protocols define someconcurrency control requirements:

1) Some protocols implementing the one copy equivalence model (model 1) per-form updates synchronously on all replicas whereas some others propagate updatesasynchronously. In both cases there are particular concurrency control needs.

2) The divergent replicas model and the convergent replicas model with writes ondivergent replicas (models 2 and model 3b) allow several simultaneous writers andreaders working on different replicas.

Page 134: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3) The convergent replicas model with reads on divergent replicas (model 3a) au-thorizes one writer and several readers simultaneously on different replicas.

Thus, during a write operation, coherency protocols need to lock all replicas whenimplementing the first model, while implementing the third one needs locking onlyone replica.

Other interactions between coherency protocols and concurrency, and also withfault tolerance are further discussed in section 5.

3. Support of consistency models

The previous section introduces the scope of RS2.7 in terms of coherency models.However, application execution requires a consistency model. A consistency modelis a more or less formal specification of how the memory appears to an application.Consistency models are implemented by consistency protocols that manage objectstaking into account replication, concurrency control and fault tolerance. In presenceof replicated data, we argue that a consistency model (protocol) includes a coherencymodel (resp. protocol) (figure 2).

Coherency protocol

Concurrency

Fault toleranceC

omm

unic

atio

n

Coherencymodel

Consistencymodel

Con

sist

ency

pro

toco

l

App

licat

ion

RS2.7 coverage

Figure 2. Support of consistency models

Consistency models have been adopted in several domains. In DSM (distributedshared memories), they define the value to be returned by a read event during theexecution of parallel programs. Many different consistency models exist ([KIN 99]):sequential, causal, PRAM, weak consistency, entry consistency, release consistency.In this context, consistency protocols take into account concurrency and replication.DBMS also proposes consistency models. For instance, a very popular correctioncriterion for replicated databases is one copy serializability. This correction criterionensures that an execution on replicated data is equivalent to an execution working witha single copy of data, and that the execution of transactions is serializable. In this case,

Page 135: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

the consistency model takes into account concurrency, fault tolerance and replication.ACID transactions offer a protocol implementing this consistency model.

With our proposed coherency models, consistency models do not have to deal withreplication issues. The objective is that issues related to replication are hidden behindthe coherency models. This approach promotes the use of RS2.7 in several contextssupporting transactions, fault tolerance, DSM, etc. In order to better understand this,two examples are depicted in the next two subsections.

3.1. Transactional context

With replicated data, a transactional service aims at doing a specific consistencymodel. To do this, it interacts with a concurrency service and RS2.7. If the consistencymodel is one copy serializability, the transactional service needs 1) a concurrencyservice which provides one writer or several readers of an object and 2) concerningreplicated objects, a one copy equivalence model. There are two levels of coherence:one among the replicas of an object (specified by the coherency model) and anotheramong objects managed by the transactional service (in order to respect a consistencymodel).

Let us consider a system with two databases (DB1 and DB2), where each databasecontains a replica of objects A and B (i.e., replicas A1 and B1 on DB1, replicas A2and B2 on DB2). T1 is a transaction on DB1 that modifies A and B. T2 is a transactionon DB2 that modifies A. Notice that the transaction manager is not aware of replica-tion. The execution of T1 requires the transactional service to request a lock on Afrom the concurrency service and to inform RS2.7 of the write operation on A. Then,RS2.7 requests a lock on objects A1 and A2. The same process applies to object B.At the same time, the transactional service of DB2 requests A for T2. Thanks to thelock set by RS2.7, T2 is denied from locking A.

There are two levels of concurrency control: one managed by the coherency proto-col and a second managed by the transactional service. The awareness of replicationremains inside the replication framework. If the transactional service wants to im-plement epsilon serializability (relaxed isolation), the approach ensures that there isno consequence on the replication framework, but only modifications to the level ofconcurrency managed by the transactional service.

3.2. Mobile context

In a mobile context, the transactional service needs a convergent replicas modelwith writes on divergent replicas (3b). Let us consider the previous example withDB1 as a mobile host. When the transaction service on DB1 locks A, the replicationframework only requests to lock A1. At the same time, the transaction service on DB2can also lock A, actually A2, to execute T2 concurrently. A merge process could then

Page 136: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

appear later. The fact that there is a specific kind of replication is also hidden by thereplication framework.

4. An abstract coherency protocol

For adaptability purpose, RS2.7 must support several coherency protocols. Nev-ertheless, the large variety of existing coherency protocols complexifies the designof the framework. Protocols differ mainly in the manner and the order they accom-plish different phases. In order to be able to provide a generic replication framework,adaptability is introduced by providing abstractions for defining coherency protocols.These abstractions consist of five generic phases: the access, coordination, execution,validation, and response phases. The differences between protocols are characterizedby the approach they use in each phase and in the order the phases are executed. Thisleads to a replication framework whose design is independent of any protocol. The ab-stract coherency protocol is the minimal shared part between all existing protocols. Tointroduce the five phases, let us consider a client object that interacts with a replicatedobject.

Access phase: a client object submits a request (operation) to a replicated object.Replication is said to be transparent if the client object interacts with a logicalobject and is not aware of the underlying physical replicas, nor how many ofthem exist or where they are. On the other hand, with non-transparent replica-tion, the client object sends its request directly to one or several (possibly all)replica.

Coordination phase: it includes preliminary treatments to the execution of a request.If necessary, replicas coordinate with each other to synchronize the execution ofthe requested operation. This phase may also include coordination actions withconcurrency control and/or with fault tolerance services.

Execution phase: The requested operation is actually executed on the replica(s).

Validation phase: The replicas make sure that they agree on the result of the execu-tion. For instance, they may decide if it is necessary to undo or to redo someactions. Interactions with concurrency control and/or fault tolerance servicesmay be required.

Response phase: The outcome of the executed operation is sent to the client object.Protocols show two possibilities: either the response is sent only after every-thing has been settled and all protocol phases have finished, or the response issent as soon as it is available even some phases have not been completed yet.

Some protocols may skip some phases, order them in different ways, iterate oversome of them, or merge them into a sequence. Let us consider four particular co-herency protocols to illustrate the five phases. In these examples, a client object sub-mits a write request.

Page 137: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– The ROWA protocol (Read One Write All)[BER 84] performs writes syn-chronously on all the replicas and reads on one replica. It implements a one copyequivalence model (section 2.2, model 1). In the access phase the write request iscaptured. In the coordination phase, it is transmitted to all the replicas, which executeit (i.e., execution phase). The concurrency control ensures that a single client objectaccesses the replicated object. The validation and response phases are empty.

– In the ROWAA protocol (Read One, Write All Available)[GOO 83], writes aredone synchronously on all available replicas and asynchronously on the others; readsare done on one replica. It also implements a one copy equivalence model but thecoordination, execution and validation phases are different from ROWA. If a replicais not available during the synchronization process, there is a loop between the val-idation and coordination phases. When the replicas are available, they are updatedasynchronously during the coordination phase.

– In the lazy master/slaves replication [GRA 96], used in the DBMS context, up-dates are done on the master and slaves are updated asynchronously. This proto-col implements a convergent replicas model with reads on divergent replicas (sec-tion 2.2, model 3a). Thus, the access, execution and response phases are executed onthe master replica. The coordination, execution and validation phases are executedasynchronously in this order on the slave replicas.

– In active replication [POW 91], based on the communication primitives, the ac-cess and coordination phases can be merged, as well as the validation and responsephases.

We were inspired from [WIE 00] which proposes a decomposition with the objec-tive of comparing protocols used in distributed systems and database systems. Never-theless, they focuse on strong coherence. Our abstract coherency protocol considersweak coherence among replicas. Besides, we do not include concurrency control andmessages ordering issues. We consider only interactions with these aspects.

For each phase, we define a component with a generic interface2. Thus, it canbe possible to change some specific phases of a particular coherency protocol foradaptation purposes.

While this section has defined a first decomposition for the protocols (schedulingdecomposition), the next one presents the components that can be used to composeeach phase (functional decomposition).

5. Functional architecture of RS2.7

In order to obtain adaptability inside the framework, functional decomposition isin turn applied to coherency protocols. This means that common functionalities havebeen extracted from protocols. In this respect, a functional architecture distinguishingthe components involved in the construction of coherency protocols is proposed. Each

. Due to space limitation, we do not present the interfaces in this paper.

Page 138: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Life Cycle ManagerAccess Manager

Local Replica

Response Manager

Conflict Detection

SynchronizationManager

Replicated MessagesManager

Synchronization Group

MessagesOrdering

Consensus Manager

Failing Replica Manager

Information Collector

Conflict Resolution

Roles Manager

SynchronizationStarting

ManagerLocal ReplicaDispatcher

Group Multicast

Synchro.MessageListenerManager

Coordination Phase Execution Phase Validation Phase Response Phase

components

Kernel

Access Phase

Manager

coherency

commonComponents

models

to all

Componentsdependent

on thecoherency

models

Componentsdependent

on thecoherencyprotocol

Messages FactorySynchronization

MembershipGroup

Manager

UpdatesLog

Figure 3. Functional architecture

component has an interface3 covering a particular function that can be implemented inseveral ways. Components can be used to build coherency protocols for applicationswith particular requirements.

The functional architecture provides four categories of components (see figure 3):

– Kernel components are considered as the basic level to construct simple proto-cols. These components mainly concern replica life-cycle, their communication andinteractions with the user application.

– Components common to all coherency models. This category introduces com-ponents related to general synchronization issues.

– Components dependent on the coherency model.

– Components dependent on the coherency protocol.These later two categories add particular model/protocol dependent components.

The next sections present each category in detail, and illustrate the presentationwith a ROWA protocol and some of its variants.

. Due to space limitation, we do not present the interfaces in this paper.

Page 139: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5.1. Kernel components

Kernel components participate to the access, coordination, execution and responsephases. No validation issues are considered at this stage. They also provides means tomanage replication life cycle.

The life cycle manager component manages the life cycle of two kinds of entities:replicated objects and replicas. This means that it supports the creation/deletion ofreplicated objects, as well as adding/removing replicas associated with such a repli-cated object.

The Access Phase involves a dispatcher manager component. The dispatchermanager component captures requests submitted by client objects and forwards themto the relevant replicas.

The Coordination Phase includes the synchronization messages factory, the syn-chronization messages listener manager, the group membership manager and thegroup multicast components. The last two components interact with the communica-tion service (that provides non-blocking point-to-point communication and may sup-port multicast). The group membership manager is in charge of the life cycle of objectgroups and maintains the list of its members. It provides support for joining and leav-ing groups. The group multicast component provides support for sending messages toall members of a group, with various reliability and ordering guarantees. Note that thisfunctionality may be directly provided by the communication service or requires workfrom the replication framework. The synchronization messages listener manager in-terprets messages delivered by the communication support and transfers informationto the listening components.

The Execution Phase includes two components: a local replica access managerand the local replica manager. The local replica access manager offers an inter-face allowing R/W on a replica whereas the local replica manager encapsulates otherreplica management issues (e.g. loading a replica in memory). These componentscontribute to the genericity of our framework as they permit the replication of anykind of objects (e.g simple or composite objects, HTML pages). The local replicaaccess manager gives an abstract representation of replicas to the framework.

The Response Phase has a single component, response manager, passing the re-turning value to the client object.

Let us consider a simple ROWA protocol to illustrate the construction of a protocolusing the kernel components. The phases of the protocol are access, coordination andexecution. When a write is requested, the access phase performs appropriate actionsthought the concurrency control service before transferring the request to the coordi-nation phase. In the coordination phase a synchronization message is constructed (bythe synchronization messages factory) and is sent to all the replicas (group multicastcomponent). The arrival of this message to the different nodes starts their coordina-tion phase. The message is read by the synchronization message listener manager andpassed to the execution phase. Thus it is applied to the local replica using the localreplica access component.

Page 140: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5.2. Components common to all coherency models

The support of coherency models requires components for the synchronizationprocess. The level of coherency between replicas depends on the synchronization ap-proach. The synchronization manager component is introduced for the CoordinationPhase. It is based on the synchronization messages factory (kernel component) and ontwo other new components: the starting synchronization component and the updateslog component. The starting synchronization component decides of the synchroniza-tion time. For instance, this time may be determined by the occurrence of a read orwrite event, or the violation of a condition. The updates log component saves infor-mation to be exchanged during the synchronization process. It may be implementedwith different mechanisms such as logs, triggers, snapshots, shadow, etc.

There are no other components common to all coherency models in the other pro-tocol phases.

Let us consider a more complex protocol than ROWA, where the synchronizationprocess takes place when a read or a write is requested. Thus, the starting synchro-nization component launches the synchronization process on read or write events. Thesynchronization manager requests information from the updates log component in or-der to synchronize replicas. Depending on its implementation, this component mayreturn the value to be installed in the replicas or a set of operations to be executed toobtain the appropriate value.

5.3. Components dependent on the coherency model

Coherency models are characterized by the replica’s role. This role may spec-ify which replicas can be updated by external requests and those updated exclusivelyby the replication framework itself. Two basic settings are possible, namely, master-slaves and peer-to-peer replication. Peer-to-peer replication allows any replica to beupdated upon external request, and its modifications are eventually forwarded to theother replicas. On the other hand, a master-slaves setting defines a clear distinctionbetween the master replica, which can be explicitly modified, and the slave replicas,which are only updated by the replication framework. Thus, the roles manager com-ponent is introduced for the Access Phase.

The treatment of conflicting modifications requires the introduction of new com-ponents in the Validation Phase: the conflict detection component and the conflictresolution component. These components are used by protocols implementing theconvergent replicas model with writes on divergent replicas. In such protocols, thereplicated object must be able to detect conflicts and to solve them. For this purpose,current replication systems offer different conflict resolution policies (e.g. time stamp-based, priority based, additive, maximum) that can be used in the implementation ofthese components. Note that interaction with the concurrency control may be required.

Page 141: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Let us consider a protocol allowing multiple simultaneous writers on the replicatedobject. Thus, after the execution phase, the validation phase (conflict detection com-ponent) checks for conflicts. If a conflict appears, conflict resolution is invoked and acoordination phase is started.

5.4. Components dependent on the coherency protocol

This category introduces components particular to specific protocols. In someprotocols, it is necessary to manage the call replication problem in the Access Phase:this problem arises when a replicated object A sends a request to another replicatedobject B. If there are n replicas of A, B may receive n copies of the request but shouldexecute it only once. This situation can be handled in different ways, at the clientside (A) or at the server side (B). The replicated messages manager component isintroduced for this purpose.

The synchronization group component is introduced for the Coordination Phase.This component decides which replicas are involved in the synchronization process.For example, in epidemic protocols, updates are forwarded gradually.

The Execution Phase may also require to order operations to be executed on localreplicas. This permits executions in other orders than the arrival order. The messageordering component provides support for this feature.

The Validation Phase of some protocols requires some actions related to fault tol-erance. Two components are introduced for this purpose: the consensus manager andthe failing replica manager components. This last component interacts with a fault-tolerant service that detects dead replicas and decides what to do until the replicasrevive. This component may interact with the group membership component. Theinformation collector component is also introduced to collect results produced by dif-ferent replicas.

As an the example, let us change the ROWA protocol into the ROWAA. The com-ponents of the access, coordination and execution phases remain unchanged. It is nownecessary to use a validation phase with the failing replica manager. Moreover, thisphase interacts with the coordination phase. If the failing replica manager detectsdead replicas, it informs the synchronization manager that handles the differred syn-chronization process.

With the RS2.7 component approach, we have demonstrated how replication pro-tocols can be built by assembling components. We have also shown that existingprotocols can be easily enhanced by incrementally adding new components to exist-ing assemblies. This characterizes the ability of our framework to smoothly adapt tovarious situations that can largely differ between each other.

Page 142: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

6. Implementation and experimentation

This Section presents an implementation of RS2.7 (section 6.1) and an experimen-tation to validate our approach. The validation has two main objectives: to show theadaptability inside the framework (section 6.2) and the adaptability to the applicationcontext (section 6.3).

6.1. Implementation

The basic architecture principle that governs RS2.7 consists in interposing media-tion objects to access replicas in order to provide transparency. They are then linkedto binding objects that bind replicas between each other. The mediation objects alsomanage replication by implementing the appropriate coherency protocol.

"! # $%&$'( )!(%#*(+,-./-"! (%0%1+,-2.*3-%"!(4562-782'39-2:1/""! (402';2/-'</%"!

(a) Memory 1 (b) Memory 2

Figure 4. (a) Replicable object creation / (b) Replica creation

Binder

Memory 1 Memory 1

p1 p1b1

Memory 1

p1b1

p2

(b) (c)(a)

binding

np<−>b1

Binder

Memory 2binding

Binder

b2

np<−>b1, b2

Figure 5. Binding in RS2.7

The process to obtain a replicable object is the following (figure 5a and figure 5b):

– To ask to a binding factory ( =?> ) the creation of a new binding representative(figure 4a line 1),

– To associate the application object @BA (this is the first replica) with this bindingrepresentative (figure 4a line 2) and

– To export the binding in the domain of replicable objects (figure 4a line 3). Abinder is a particular kind of naming context that can define a name for a replicableobject and that can associate a binding to it. The name associated to a binding isensured to be unique.

Page 143: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

To create a new replica, the process is the following (figure 5c):

– To ask a binding factory ( =?> ) the creation of a new binding representative (fig-ure 4b line 1),

– To associate the application object (this is the second replica) with this bindingrepresentative (figure 4b line 2), and

– To bind this object with the relevant replicable object with name @ (figure 4bline 3).

The coherency protocol that manages a replicable object is distributed on eachreplica through the binding representatives. The construction of a particular protocolconsists in assembling the appropriate functional components (Section 5) within eachphase of the abstract protocol (Section 4), and in integrating these different phases(scheduling components) through the relevant scheduling algorithm. All this is em-bedded into binding objects.

The implementation of a simple lazy master/slaves protocol illustrates this process.In this protocol, writes are permitted only in master replicas, and reads are allowedon all replicas (slaves replicas). The master sends updates asynchronously to all theslaves. Thus, this protocol implements the convergent replicas model with reads ondivergent replicas (Section 2.2 coherency model 3a).

In this implementation, the behaviour of binding representatives is similar for allslaves but differs for the master. The same phases are executed on all master and slaves(i.e., access phase, coordination phase, execution phase and response phase) but theactive components inside them differ. For the master, the access phase consists of arole manager component, the coordination phase consists of a starting synchronizationmanager, an updates log, a synchronization messages factory and a communicationmanager component, the execution phase consists of an access replica component andthe response phase consists of a response manager. For the slaves the coordinationphase consists of a synchronization messages listener manager. The other phases areidentical.

The sequence diagrams of figure 6 show the process involved inside the frameworkwhen performing writes, reads or synchronization. When a read operation is submittedto the master or to a slave (figure 6b), this operation passes first through the accessphase where the role manager decides if the operation is allowed. Next, the executionphase executes the operation accessing the access replica manager. The response issent to the caller during the response phase by the response manager.

A write operation submitted to a slave passes first through the access phase, wherethe role manager decides that a read is not allowed. An error message is sent by theresponse phase by the way of the response manager. On the other hand, the same oper-ation submitted to the master (figure 6a) passes first through the access phase and nextthrough the coordination phase. The operation is recorded by the updates log, next ex-

Page 144: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

newMessage (message)

PhaseResponse

AccessReplicaManager

PhaseExecutionAccess

Phase RoleManager

read(x)read(x)

ackread(x)

Replica

getX()

33

read(x)

33

CallerAccessPhase

Updatelog

AccessReplicaManager

PhaseExecution

write(x,3)

write

ack

Manager

PhaseCoordination

write(x,3)

write(x,3)

setX(3)

Replica

write(x,3)

AccessReplicaManager

PhaseExecution

readMessage (message)

write(x,3) setX(3)

MembershipManager

Group

MulticastGroup

Network

ComponentSyncronization

StartingUpdate

log

(b) read on the master and on the slaves(a) write operation on the master

NetworkPhase

Coordination

(d) synchronization on the slaves

Role

Replica

write(x,3)

ListenerMessages

write(x,3)

Synchronization

synchro()

getUpdates()

getNewMessage(updates)

setGroup(message,slaves)

send(message)

(c) synchronization process on the master

Synchronization

FactoryMessages

PhaseCoordination

send(message)

Figure 6. A simple lazy master/slaves protocol

ecuted by the execution phase accessing the access replica manager. Asynchronouslythe starting synchronization manager (figure 6c) starts the synchronization processaccording to conditions. In this case, the coordination phase constructs the synchro-nization message with the synchronization message factory and the updates log. Thismessage is sent to all replicas by the communication component. Each slave receivesthis message and updates its associated replica performing the execution phase (fig-ure 6d).

We now illustrate the advantages of our approach by showing how it can be adaptedto changes in the coherency protocol (section 6.2) and to application contexts (sec-tion 6.3).

6.2. Adaptability inside the framework

In order to validate the decomposition of the replication functionality proposedin sections 4 and 5, several protocols, which implement the four coherency models(Section 2.2), have been developed in Java. For the time being, the composition ofthese components and the composition of the phases are hand coded. Our first goal isto validate functional and scheduling decompositions.

A first kind of adaptability inside the framework is obtained by the fact that eachcomponent of the functional architecture has a generic interface, allowing various

Page 145: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

implementations. For instance, the update log component of the previous example(Section 6.1) can be based on a log or a file or merely by directly performing the exe-cution phase, using its access replica manager to access the replica state. In the sameway, if the replica is an HTML page, only the access replica manager in the executionphase has to be changed. A dynamic master could also be introduced. In this case,the binding representatives are the same for all the replicas. The role manager decideswhich replica is the master according to a specific algorithm. An implementation canbe an election between all replicas or a “token ring”-like algorithm. Conflict detec-tion and conflict resolution components can be typically implemented in several waysaccording to the context.

A second way to obtain adaptability inside the framework is to modify a phase(adding, deleting components as well as changing the composition of the basic com-ponents). For instance, a dispatcher component can be added to the access phase ofthe slaves. Thus, when a write operation is submitted to a slave, this operation isforwarded to the master. As a second example, the starting synchronization managercan be suppressed, then starting the synchronization process when there is a write op-eration on the master. This modification allows the initial protocol to evolve into aone copy equivalence protocol (model 1 in section 2.2). In this case the coordinationphase also needs to interact with a concurrency control that locks all replicas duringthe write operation.

MembershipManager

Group

MulticastGroup

readMessage (message)

AccessReplicaManager

PhaseExecution

PhaseResponse NetworkAccess

Phase RoleManager

read(x)read(x)

ack

PhaseCoordination

getNewMessage(synchro)

setGroup(message,master)

send(message)send(message)

write(x,3)

write(x,3)

write(x,3) setX(3)

ReplicaSynchronization

getX()

33

read(x)

33

Caller Synchronization

FactoryMessages

ListenerMessages

Figure 7. Read on a slave in a pull lazy master/slaves protocol

A third kind of adaptability inside the framework consists in changing a bindingrepresentative. For instance, the slave binding representatives can synchronize withtheir master before a read operation (figure 7). In section 6.1, the protocol uses apush policy for synchronization, while it uses a pull/push policy here: the slaves pullupdates before reads and the master pushes them according to its starting synchroniza-

Page 146: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

tion manager. If the master binding representative is not changed, it sends updates toall replicas when a particular replica will be read. The master binding representativecan be modified for sending updates only to the replica that will be read.

6.3. Adaptability to the application context

A utilization of the replication framework has been done in the context of the PING(Platform for Interactive Networked Games) European project [ENS 01]. The PINGproject intends to specify, develop and demonstrate a flexible and scalable architecturefor large-scale interactive multi-participant applications over the Internet. The contextis not transactional. In this implementation (developed in Java), the focus has beenput on the separation of concerns among coherency protocols, concurrency controland consistency protocols as described in section 3. Several coherency protocols havebeen implemented for the convergent replicas models with reads on divergent replicas(section 2.2 model 3a) and with writes on divergent replicas (section 2.2, model 3b),but without decomposition inside the binding representatives (for performance rea-sons). These coherency protocols are used to implement several consistency models(sequential and several variant of causal models).

Our framework is also under validation within a transactional context, within afault-tolerant context, and within a mobile transactional context as well. So far, nomodification has been required to the framework.

7. Related Work

Replication is frequently used in group communication systems, shared memo-ries, file systems, DBMS and object oriented platforms. A large number of ad hocreplication techniques and protocols have been proposed [GRA 96, KEM 00, KIN 99,DRA 01]. There is also much academic and industrial activity on the design and im-plementation of adaptable replication frameworks.

A standard approach to obtain adpatability is separation of concerns between theapplication code and the replication code is to use inheritance: objects inherit ade-quate behaviors from a set of predefined classes. Another approach is to use reflexivefacilities that rely on two object levels: a base level and a meta level (Garf [GAR 95],Core [BRU 98], RepliXa [KLE 96], Globe [Van 99]).

Beside this separation of concerns, our proposal also considers separation of con-cerns between the replication framework and other features.

Concerning supported replication protocols, existing work may be classified intotwo categories: those proposing a limited support of replication protocols to achievefault-tolerance and those proposing a more generic approach.

In the first category, the range of provided replication techniques is limited tostrong coherency among replicas. Their objective is to provided fault-tolerance through

Page 147: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

replication techniques like active replication or passive replication [LIT 94]. For ex-ample, Garf [GAR 95] is an object-oriented environment that simplifies the develop-ment of fault-tolerant applications by separating the distributed behaviors of objectsfrom their functionality. It offers a library of ready-to-use components, behavioralobjects, providing adequate support for fault tolerance through replication (active andpassive replication). Behavioral classes are implemented using the Isis toolkit, whichprovides fault tolerance at the Unix process level. The replicas are managed usinggroup communication, implemented through multicast primitives. These primitivesmust ensure that component failures do not compromise the consistency of the log-ical state managed by group members. We find a similar approach in work aroundCorba [MAF 95, NAR 01, NAR 02, FEL 00]. They combine a Corba programmingenvironment with a group communication system like that found in Isis.

Systems of the second category allow a large number of replication techniques[BRU 98, BEE 96, KER 98]. Core [BRU 95, BRU 98] is an architecture and a run-time environment for adaptable replicated objects. A replica consists of three compo-nents. First, a local copy that contains the state and offers an interface to manipulateit. Second, an access object which wraps the local copy and controls access to it.Third, a consistency manager that cooperates with the access object to maintain theconsistency of the local copy. Local copies and access objects are specific, but con-sistency managers are generic. One selects a replica control strategy by instantiatingthe appropriate consistency manager. The consistency manager mainly handles lock-ing, synchronization and forwarding of updates. The Globe [KER 98] system alsoprovides a flexible framework for associating consistency models with distributed ob-jects. Some other work proposes to manage replication in an adaptable way in theDSM (Munin [CAR 91], TreadMarks [AMZ 96], Midway [BER 93] Arias [D´ 96]).These systems offer to the application developer different implementations of consis-tency models.

We propose a framework to implement any protocol. Thus, our proposal is moregeneral than the first category. Moreover, in our opinion, works of the second categorypropose more than a replication framework. They combine concurrency, replicationand consistency protocols. They do not really isolate what is specific to replication anddo not extract general replication functionalities. This limits adaptability, flexibilityand reusability of their frameworks.

8. Conclusions and future work

This paper contributes to a clear separation of the replication functionalities inorder to enhance adaptability. The RS2.7 replication framework provides support forreplica life cycle management and for coherency protocols.

It is used along with other services in order to build various replication policies.These services use the life cycle management to create and delete replicas accord-ing to the policies (when to create/delete replicas, how many, where to create them).

Page 148: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The replication policies are also composed by the coherency model supported by thereplication framework.

A coherency model defines how users perceive the different replicas of a replicatedobject. Four coherency models are proposed: one copy equivalence model, divergentreplicas model, convergent replicas models with reads on divergent replicas, and con-vergent replicas models with writes on divergent replicas. A large variety of protocolscan implement these models. The coherency model is part of the consistency model.This separation between coherency model and consistency models permits the use ofour framework in several contexts like transactional or fault-tolerant ones.

Moreover, the RS2.7 replication framework is intended to be adapted to support alarge number of coherency protocols. Although designing such a generic layer is dif-ficult, an abstract coherency protocol has been proposed. It distinguishes five logicalphases: access, coordination, execution, validation and response phases. Flexibil-ity to support various coherency protocols is guaranteed by a functional architecturethat identifies several components involved in the construction of coherency protocols.These components also allow a better composition between the replication frameworkand other services like fault tolerance or concurrency control.

A first implementation has been made in order to validate functional and schedul-ing decompositions. A second prototype of RS2.7 has been achieved and is integratedin a Platform for Interactive Networked Games (PING [ENS 01]). This experiencehas focused on the separation of concerns between coherency protocols, concurrencycontrol and consistency protocols. Performance issues are considered.

On-going work includes the use of the ObjectWeb Fractal component framework4

[COU 01] that allows to describe the composition and to generate an optimized imple-mentation. This approach permits to take advantage of the openness of our frameworkand of its internal architecture. This composition could be static or dynamic, de-pending on trade-offs to be made between performance and dynamic adaptation (i.e.,dynamic reconfiguration). In this respect, it is also intended to experiment componentmerging patterns for performance enhancement. Moreover, transaction contexts areheavily investigated with RS2.7, and especially environments like EJB platforms.

Acknowledgements

A grand merci to Elizabeth Perez Cortez, Patricia Serrano-Alvarado, AlexandreLefebvre and NODS members for very useful discussions on our work.

. http://www.objectweb.org/architecture/component/index.html

Page 149: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

9. References

[AMZ 96] AMZA C., COX L., DWARKADAS S., KELEHER P., LU H., RAJAMONY R., YU

W., ZWAENEPOEL W., “TreadMarks: Shared Memory Computing on Networks of Work-stations”, IEEE Transactions on Computers, vol. 29, p. 18-28, 1996.

[BEE 96] BEEDUBAIL G., POOCH U., “An Architecture for Object Replication in DistributedSystem”, report num. TR-96-006, 1996, Department of Computer Science, Texas A&MUniversity, College Station, Texas, USA.

[BER 84] BERNSTEIN P., GOODMAN N., “An Algorithm for Concurrency Control and Re-covery in Replicated Distributed Databases”, ACM TODS, vol. 9, p. 596-615, December1984.

[BER 93] BERSHAD B., ZEKAUSKAS M., SAWDON W., “The Midway Distributed SharedMemory System”, Proc. of the COMPCON Spring, San Francisco, CA, USA, February22-26 1993, p. 528-537.

[BRU 95] BRUN-COTTAN G., MAKPANGOU M., “Adaptable Replicated Objects in Dis-tributed Environments”, report num. 2593, May 1995, INRIA.

[BRU 98] BRUN-COTTAN G., “Cohérence de Données Répliquées Partagées par un Groupede Processus Coopérant à Distance”, PhD thesis, Université Pierre et Marie Curie, Paris VI,September 1998.

[CAR 91] CARTER J., BENETT J., ZWAENEPOEL W., “Implementation and Performance ofMunin”, Proceeding of the 13th Symposium on Operating Systems Principles, vol. 25,Pacific Grove, California, USA, 1991, Operating Systems Review, p. 152-164.

[CHA 00] CHAUDHURI S., WEIKUM G., “Rethinking Database System Architecture: To-wards a Self-tuning RISC-style Database System”, Proceedings of 26th InternationalConference on Very Large Data Bases, Cairo, Egypt, 2000.

[COL 00] COLLET C., “The NODS project: Networked Open Database Services”, ECOOP,2000.

[COU 01] COUPAYE T., LENGLET R., BEAUVOIS M., BRUNETON E., DÉCHAMBOUX P.,“Composants et Composition dans l’Architecture de Systèmes Répartis”, Journées Com-posants : Flexibilité du système au langage, ASF (ACM SIGOPS France), Besancon,France, October 2001.

[D´ 96] DÉCHAMBOUX P., HAGIMONT D., MOSSIÈRE J., ROUSSET DE PINA X., “The AriasDistributed Shared Memory: an Overview”, Proc. of the 23rd Seminar on Current Trendsin Theory and Practice of Informatics, INRIA, November 1996.

[DIT 01] DITTRICH K., (EDS) A. G., Component Database Systems, Morgan KaufmannPublishers, 2001.

[DRA 01] DRAPEAU S., RONCANCIO C., “Concepts et Techniques de Duplication”, report ,2001, Laboratoire LSR IMAG.

[ENS 01] ENST, FRANCE TÉLÉCOM R&D, IMAG, UNIV. READINGS, SICS, “Object andEvent Management: First Specification”, report num. PING IST-1999-11488, Delivrable2.2, June 2001.

[FEL 00] FELBER P., GUERRAOUI R., SCHIPER A., “Replication of CORBA Objects”,vol. 1752 LNCS, p. 254-276, Springer-Verlag, Berlin, Heidelberg, 2000.

[GAL 95] GALLERSD

RDER M., NICOLA M., “Improving Performance in ReplicatedDatabases Through Relaxed Coherency”, Proc. of the VLDB Conf., Switzerland, 1995.

Page 150: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[GAR 95] GARBINATO B., GUERRAOUI R., MAZOUNI K. R., “Implementation of the GARFReplicated Object Plateform”, Distributed Systems Engineering Journal, vol. 2, 1995,p. 14–27.

[GOO 83] GOODMAN N., SKEEN D., CHAN A., DAYAL U., FOX S., RIES D., “A RecoveryAlgorithm for a Distributed Database System”, Proceedings of the 2nd ACM SIGACT-SIGMOD, Atlanta, GA, March 1983, p. 8-15.

[GRA 96] GRAY J., HELLAND P., O’NEIL P., SHASHA D., “The Danger of Replication anda Solution”, ACM SIGMOD International Conference on Management of Data, Montreal,June 1996.

[HAM 99] HAMILTON J., “Networked Data Management Design Points”, Proceedings of25th International Conference on Very Large Data Bases, Edinburgh, Scotland, September1999.

[KEM 00] KEMME B., ALONSO G., “Don’t be lazy, be consistent: Postgres-R, a new way toimplement Database Replication”, Proc. of 26th International Conference on Very LargeDatabases (VLDB), Cairo, Egypt, 2000.

[KER 98] KERMARREC A.-M., KUZ T., VAN STEEN M., TANENBAUM A., “A Frameworkfor Consistent, Replicated Web Objects”, Proceedings of the 18th International Conferenceon Distributed Computing Systems, Amsterdam, The Netherlands, 1998, IEEE ComputerSociety, p. 276-284.

[KIN 99] KINDLER A., “A Classification of Consistency Models”, report num. B99-14, Oc-tober 1999, Humboldt-Universitat zu Berlin, Institut f

r Informatik, D-10099 Berlin, Ger-many.

[KLE 96] KLEIN

DER J., GOLM M., “Transparent and Adaptable Object Replication Usinga Reflective Java”, report num. TR-I4-96-07, September 1996, Universit

t Erlangen-N

rnberg.

[LIT 94] LITTLE M., SHRIVASTAVA S., “Object Replication in Arjuna”, Broadcast projectdeliverable report, November 1994, Dept. of Computing Science, University of Newcastle,UK.

[MAF 95] MAFFEIS S., “Adding Group Communication and Fault-Tolerance to CORBA”,USENIX Conference on Object-Oriented Technologies, June 1995.

[NAR 01] NARASIMHAN P., MOSER L. E., MELLIAR-SMITH P. M., “A Component-basedFramework for Transparent Fault-Tolerant CORBA”, Software Practice and Experience,Theme Issue on Enterprise Frameworks, , 2001.

[NAR 02] NARASIMHAN P., MOSER L., MELLIAR-SMITH P., “Strong Replica Consistencyfor Fault-Tolerant CORBA Applications”, Journal of Computer System Science and Engi-neering, , 2002.

[POW 91] POWELL D., VERISSIMO P., “Delta4: A Generic Architecture for DependableComputing”, chapter 6, p. 89-123, Distributed Fault-Tolerance, Springer-Verlag, 1991.

[SIL 97] SILBERSCHATZ A., ZDONIK S., “Database Systems - Breaking out the Box”, SIG-MOD Record, 26(3), September 1997.

[Van 99] VAN STEEN M., HOMBURG P., TANENBAUM A., “Globe: A Wide-Area DistributedSystem”, IEEE Concurrency, , 1999, p. 70-78.

[WIE 00] WIESMANN M., PEDONE F., SCGIPER A., KEMME B., ALONSO G., “Understand-ing Replication in Databases and Distributed Systems”, ICDCS’00, Taipei, Taiwan, April2000.

Page 151: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

La tolérance aux fautes adaptable pour lessystèmes à composants : application à un ges-tionnaire de données

Phuong-Quynh Duong* — Elizabeth Pérez Cortés** — ChristineCollet*

* Laboratoire LSR/IMAG681, rue de la Passerelle38400 Saint Martin d’Hères, FRANCE

phuong-quynh.duong, [email protected]

** Depto. de Ingeniería Eléctrica UAMIAv. San Rafael Atlixco 186 Col. Vicentina 09340, México DF

[email protected]

RÉSUMÉ. Ce papier présente notre approche pour la définition d’un framework qui autorisel’adaptation de la tolérance aux fautes aux systèmes à composants. Nous considérons que leprocessus permettant de fournir la tolérance aux fautes adaptable peut se faire en deux étapes :la personnalisation et la régulation dynamique. Le travail présenté dans ce papier concerne lapersonnalisation de la tolérance aux fautes. Nous avons défini la notion de niveau de toléranceaux fautes pour un système. À partir du niveau souhaité, nous avons identifié l’ensemble deséléments devant être intégrés dans le système cible. Pour illuster notre approche nous consi-dérons un gestionnaire de données à composants auquel on associe deux niveaux différents detolérance aux fautes.

ABSTRACT. This paper presents the approach of defining a framework for adaptable fault toler-ance in component-based systems. We have divided the process of providing adaptable faulttolerance into two stages: customization and adaptivity. The paper focuses on the first stage:fault tolerance customization. We defined the notion of fault tolerance level for a given system.From a required level, we identified the set of needed elements to be integrated into the targetsystem. A component-based persistence object manager with two different required levels offault tolerance has been chosen as target system to illustrate our approach.

MOTS-CLÉS : Services base de données, Architectures à base de composants, Tolérance auxfautes, Niveau de tolérance aux fautes.

KEYWORDS: Database services, Component-based architecture, Fault tolerance, Fault tolerancelevel.

Page 152: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

L’utilisation croissante des systèmes informatiques dans le quotidien augmente demanière proportionnelle les contraintes de sûreté de fonctionnement qu’ils doiventrespecter. Plus ils sont utilisés, plus une défaillance a de l’impact [AVI 97]. En outre,ces systèmes sont de plus en plus complexes, évoluent dans un environnement réparti,avec des éléments mobiles, gèrent des données multimédia, etc. De ce fait, ils ne sontplus construits de manière monolithique mais à partir d’un ensemble de composantschacun prenant en charge des tâches précises. Des efforts sont menés dans le but deconstruire ces composants. Ainsi des composants dédiés au contrôle de concurrence,à la désignation, à la persistance, etc. existent actuellement [Gar 02a, DRA 02]. L’ap-proche système à composants a des propriétés permettant l’adaptabilité : le systèmeintègre seulement les éléments qui lui sont nécessaires et peut choisir ceux qui mettenten oeuvre les politiques ad-hoc. Ce type de systèmes doit aussi pouvoir assurer unetolérance aux fautes adaptée à la technologie composant, et donc adaptable.

Ce travail s’inscrit dans le cadre du projet NODS (Networked Open DatabaseServices) [COL 00] (http ://www-lsr.imag.fr/storm.html) qui vise la construction dessystèmes de gestion de données à composants. Dans le but d’avoir des systèmes dedonnées sur mesure, des services d’évènements, de requêtes, de persistance, et deduplication sont en construction [Gar 02a, DRA 02, Gar 02b, SOL 00]. Ces servicessont adaptables dans la mesure où le choix des politiques à appliquer est laissé auconcepteur du système. L’objectif du travail présenté dans cet article est de construireun framework de tolérance aux fautes permettant au constructeur d’un tel système àcomposants de choisir le niveau de tolérance aux fautes (NTF) et d’intégrer d’unemanière simple les mécanismes nécessaires pour maintenir ce niveau.

La suite de cet article est structurée de la manière suivante : la section 2 contientla description détaillée de l’approche. La section 3 présente la notion de NTF, les dif-férents niveaux possibles, ainsi que les correspondances entre les NTFs et les méca-nismes nécessaires à leur support. Des exemples d’application de l’approche pour ungestionnaire de données sont présentés en section 4. Dans la section 5 notre approcheest comparée qualitativement à d’autres propositions et finalement, les conclusions etles perspectives de ce travail sont données dans la section 6.

2. Approche

Traditionnellement, un système tolérant aux fautes est construit en intégrant deséléments redondants qui sont superflus dans un cas normal, mais qui sont nécessairesdans le cas d’une faute. Durant l’exécution, le système est surveillé pour détecter laprésence de fautes et, lors de l’apparition d’une faute, la reprise est réalisée en utilisantles éléments redondants inclus préalablement. Le choix des éléments redondants àinclure dépend des types de fautes à tolérer et du fonctionnement requis. De ce fait,la tolérance aux fautes peut être fournie à différents niveaux, chacun nécessitant uneinfrastructure particulière et introduisant des sur-coûts différents.

Page 153: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Dans notre approche, nous distinguons deux étapes dans le processus permettantde fournir la tolérance aux fautes adaptable. Ces deux étapes sont appelées respecti-vement personnalisation et régulation dynamique du NTF. La première étape viseà aider les constructeurs à personnaliser le NTF de leurs systèmes. Il faut :

i) fournir un moyen de préciser le NTF requis,

ii) guider le choix des éléments redondants, de surveillance, de détection et de ré-cupération en fonction du NTF spécifié,

iii) faciliter l’intégration de ces éléments dans le système cible, et

iv) assurer l’exécution correcte du système intégré.

Notre objectif est de fournir pour cela un framework de personnalisation dela tolérance aux fautes. Le terme framework désigne ici un ensemble d’interfacesavec une implémentation partielle. En effet, seul un sous-ensemble des interfaces seraimplémenté tandis que le reste doit être pris en charge par le système cible. Cetteapproche est bien adaptée pour les composants gris. Un composant est dit gris lorsqu’ilpermet d’exposer son architecture, ce qui facilite les démarches des mécanismes detolérance aux fautes.

La deuxième étape, la régulation dynamique du NTF, tient compte du fait que latolérance aux fautes est coûteuse par nature et requiert une quantité minimale de res-sources pour être assurée. Étant donné que les ressources dans un système ne sontpas constantes, par exemple, les machines sont ajoutées et retirées dynamiquement, leNTF requis ne peut pas toujours être fourni. En revanche, le système doit ajuster leNTF en fonction des ressources disponibles. Cet ajustement peut entraîner une évolu-tion de l’architecture de l’application qui doit être réalisée en surveillant la cohérencede l’exécution. Notre framework sera étendu pour prendre en compte cet aspect etobtenir ainsi un framework adaptable pour la tolérance aux fautes.

Ce papier décrit les premiers éléments de définition de notre framework. Il concernel’étape de personnalisation du NTF des systèmes. Nous supposons que les ressourcesdisponibles sont infinies et que les éléments qui constituent un système respectentleur spécification en l’absence de fautes. Nous supposons également que ces élémentsn’intègrent aucun mécanisme de tolérance aux fautes.

3. Niveaux de la tolérance aux fautes

Le comportement d’un système peut être caractérisé par un ensemble de propriétés.Ces propriétés peuvent être classifiées en propriétés de sûreté (safety properties) etpropriétés de vivacité (liveness properties) [LAM 77]. Intuitivement, les propriétésde sûreté indiquent que les mauvaises choses n’arrivent jamais, et les propriétés devivacité indiquent que les bonnes choses arrivent. Une faute a lieu quand au moinsune de ces propriétés n’est pas vérifiée.

Les fautes qui surviennent au cours du cycle de vie du système et qui ont unecause matérielle sont appelées fautes opérationnelles (operational faults). Les fautes

Page 154: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

dues à une mauvaise conception sont appelées fautes de conception (design faults)[JAL 94]. On connaît à priori l’ensemble des propriétés qui potentiellement ne sontpas respectées lors de la présence d’une faute de conception. La présence d’une fauteopérationnelle peut violer n’importe quelle propriété du système.

3.1. Types de fautes et leur relation d’inclusion

Les types de fautes pris en compte dans notre travail font partie des types de fautesdécrits dans [CRI 91]. Ils sont les suivants :Panne franche (Crash) : Le système s’arrête prématurément. Les pannes franches

peuvent être classifiées selon l’état du système lors de la reprise de fonctionne-ment :

– Panne franche amnésique (Amnesia-crash) : Le système est relancé àpartir d’un état prédéfini indépendant de l’état au moment de la panne franche.Il perd donc les informations produites pendant son exécution antérieure.

– Panne franche partiellement amnésique (Partial Amnesia-crash) : Cer-taines parties de l’état du système sont restaurées à leur état au moment de lapanne franche, les autres parties sont relancées à partir d’un état prédéfini, parexemple l’état initial.

– Panne franche pause (Pause-crash) : Le système est relancé à partir deson état au moment de la panne franche.

– Panne franche définitive (Halting-crash) : Le système n’est pas relancé.Omission (Omission) : Le système est en panne d’omission s’il ne répond pas à une

requête, c-à-d, lorsque le résultat du traitement d’une requête n’est pas observé.Réponse tardive (Late timing) : La réponse provenant d’un système arrive en retard

par rapport au délai prévu.Valeur (Value) : La valeur de la réponse fournie par le système n’est pas correcte.Byzantine (Byzantine) : Le comportement du système est arbitraire vis-à-vis de la

panne.Les types de fautes présentés1 ci-dessus appartiennent à la classe des fautes opéra-

tionnelles. Ces fautes ne sont pas indépendantes. Par exemple, un crash entraîne unefaute d’omission car lors qu’un système a un crash, il ne peut plus répondre à unedemande. Par conséquent, il a aussi une faute d’omission.

La relation d’inclusion [JAL 94] montrée dans la figure 1 établit de manière natu-relle une hiérarchie entre les différents types de fautes. En conséquence, nous définis-sons la relation parent entre deux types de fautes comme suit :

Definition 1 Le type de fautes

est le parent du type de fautes

si l’ensemble

est inclus dans l’ensemble

. Par simplicité nous dirons que

est le parent de

ounous écrirons simplement

.

. Dans la suite, nous utilisons les termes en anglais pour désigner les différents types de fautes.

Page 155: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Crash Omission Timing

Byzantine

Value

Figure 1. Relation d’inclusion entre les types de fautes

3.2. Définition du niveau de tolérance aux fautes

Nous nous inspirons de la définition donnée dans [GAR 99] pour définir le NTFd’un système en fonction de ses propriétés et des fautes considérées.

Un système est dit tolérant au type de fautes

pour la propriété si une desconditions suivantes est vraie :

– La propriété est préservée lors la présence de la faute de type

. Dans ce casla faute est dite masquée.

– La propriété est violée lorsque la faute de type

se produit mais quand la fautene se produit plus, le système rétablit son exécution et la propriété est à nouveausatisfaite. Dans ce cas la faute est perçue mais le système est capable de se remettre etde continuer à fonctionner normalement.

En conséquence, pour un type de fautes il est possible d’avoir l’un des deux ni-veaux de tolérance : masquage (masking) ou non masquage (unmasking). Le premierest plus “fort” que le deuxième.

Nous nous intéressons aussi au cas où le système est capable de signaler la pré-sence d’une faute. Ce niveau intermédiaire est appelé signalisation (signaling), et ilest moins “fort” que le niveau de non masquage. Le plus “faible” niveau est de nerien faire par rapport à une faute. Il est appelé néant (nothing) et complète le spectredes possibilités. Par la suite, nous utilisons les termes anglais : masking, unmasking,signaling et nothing pour décrire les valeurs possibles d’un niveau de tolérance auxfautes.

Étant donné la relation d’inclusion entre les types de fautes, le choix du niveaupour un type de fautes influence le choix du niveau pour les types de fautes qui setrouvent au dessous dans la hiérarchie.

Comme nous l’avons déjà mentionné, les types de fautes pris en compte dans notreproposition actuelle sont exclusivement les fautes opérationnelles. Par conséquent,nous devons nous intéresser à toutes les propriétés du système dans la définition duNTF d’un système :

Definition 2 f est un type de fautes et

Page 156: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

l où est plus “fort” que ou égal à si .

Le NTF suivant est valide sur les types de fautes donnés dans la figure 1 :

! #"$%&'( !)# *+( -,.%/(0'1 + 324/1 !5467 1 8

Dans cet exemple, puisque les fautes late timing ont un niveau de tolérance auxfautes unmasking, le niveau pour les fautes d’omission peut donc prendre une desdeux valeurs : unmasking ou masking. Si la valeur masking est choisie, le niveau pourles crash doit être masking car les fautes d’omission incluent les crash.

Il faut noter que la contrainte sur les valeurs possibles pour chaque type de fautesdans la définition 2 réduit le nombre des NTFs valides à un nombre fini, permettantainsi de les lister. À partir de la liste des NTFs possibles, nous pouvons donc établirles correspondances entre un niveau de tolérance aux fautes donné et les élémentsredondants, les éléments de surveillance, le processus de diagnostic et les éléments derécupération nécessaires. Cette démarche est présentée dans la sous-section suivante.

3.3. Correspondances entre le niveau de tolérance souhaité et les élémentsnécessaires

À partir de la définition du NTF, nous sommes capables d’identifier d’une manièresystématique les éléments nécessaires à intégrer dans un système afin que le NTFdemandé soit assuré. Pour cela, nous avons établi trois tableaux dont le contenu guidele choix des moyens logiciels permettant d’assurer le NTF souhaité. Ces tableaux decorrespondance ont été conçus en utilisant les solutions existantes dans le domaine detolérance aux fautes. Ce sont :

– Le Tableau des éléments redondants (TER) qui établit les correspondances entrele niveau souhaité et les éléments redondants ainsi que les procédures de récupérationassociées ;

– Le Tableau de surveillance (TS) qui établit les correspondances entre les typesde fautes pris en compte et les informations de surveillance à recueillir ;

– Le Tableau de diagnostic (TD) qui établit les correspondances entre les types defautes et les procédures de diagnostic de la présence d’un type de fautes correspondant.

Les trois tableaux sont donnnés en annexe A. On ne donne aucune indication sur lafréquence de la prise d’informations diagnostiques ainsi que sur la façon de “capter”ces informations. Plus la prise est fréquente, plus la présence de faute est vite détectéemais plus la mise en oeuvre est coûteuse. La prise d’informations diagnostiques peut

Page 157: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

être effectuée en mode pull2 ou push3. Ces choix sont laissés aux concepteurs desystème.

Pour mettre en oeuvre le NTF de notre exemple, la duplication est le mécanismeadéquat. La politique précise de duplication doit être déterminée selon la disponibilitédes ressources. Pour détecter la présence d’un crash, on lit dans le tableau TS qu’il estnécessaire d’utiliser des messages de battement de coeur. Lorsqu’ils n’arrivent pas aumoment prévu, le système est suspecté d’être en panne franche. Cette procédure dediagnostic est donnée dans le tableau TD sur la ligne correspondant au crash.

Il faut remarquer que nos tableaux présentent des mécanismes parfaitement biendéfinis et testés ce qui assure la possibilité de mettre en oeuvre les différents NTFpossibles. Donc dans le framework, les interfaces de chaque élément sont proprementdéfinies et les implémentations des éléments qui ne dépendent pas de l’applicationprécise sont fournies.

4. Application de l’approche à un gestionnaire de données

Dans cette section, nous illustrons notre approche en utilisant un gestionnaire dedonnées à base de composants comme système cible. Pour cela, nous décrivons, dansla sous-section 4.1, le modèle de composants avec lequel le gestionnaire de donnéesest conçu. Ensuite, nous présentons l’architecture du gestionnaire en excluant les as-pects de la tolérance aux fautes. Finalement, les sous-sections 4.3 et 4.4 présententl’architecture du même gestionnaire avec deux NTFs différents.

4.1. Modèle de composant

Le modèle de composants décrit dans cette section est inspiré du modèle de com-posant Fractal [GRO b], un travail en cours dans le cadre de ObjectWeb [CON ]. Nousprésentons seulement des éléments essentiels permettant la compréhension de notredémarche.

Un composant est une structure qui existe seulement à l’exécution. Il peut cor-respondre à un objet ou à une agrégation d’objets. Un composant est formé de deuxparties : un contrôleur et un contenu.

Le contenu d’un composant peut être composé (d’un nombre fini) d’autres com-posants. Ces composants sont appelés sous-composants. Un composant primitif estcelui dont le contenu est un objet.

. En mode pull, le système envoie les informations diagnostiques seulement quand il y a une

demande de prise d’informations.. En mode push, le système envoie les informations diagnostiques à sa volonté sans se soucier

s’il y a une demande de prise d’informations.

Page 158: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Le contrôleur d’un composant incarne le comportement de contrôle associé à uncomposant particulier. Un contrôleur peut :

– intercepter les appels entrant et sortant du contenu,

– superposer le comportement de contrôle au comportement du contenu du com-posant,

– gérer le cycle de vie du contenu (et par conséquent, du composant),

– exporter les interfaces du contenu,

– connaître et modifier la composition du contenu.

En pratique, le contrôleur est matérialisé par deux parties : un conteneur et uneracine de composition. La racine de composition encapsule la composition des sous-composants participant au contenu et les interactions entre eux. La racine de composi-tion représente l’aspect fonctionnel du composant en question tandis que le conteneurreprésente l’aspect technique.

Nous utilisons le terme entité pour désigner un composant, un contenu ou uncontrôleur. Les entités interagissent à travers des appels de méthodes. Le regroupe-ment de différents appels de méthodes constitue une interface.

Une interface serveur regroupe des appels de méthode servis par une entité tandisqu’une interface cliente regroupe des appels de méthode émis par une entité.

Une liaison (binding) est un lien entre une interface cliente et une interface serveur.Une liaison est établie si et seulement si l’interface serveur peut accepter au moins tousles appels que le client émet vers cette interface serveur.

La figure 2 montre la convention graphique utilisée dans la suite pour présenter lesdifférentes parties d’un composant.

interface serveur

conteneur

racine de composition

un composant vu

de l’extérieur

un composantprimitif

interface cliente

interface (serveur) exportée du contenu

contenuliais

on

Figure 2. Convention graphique

Page 159: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4.2. Architecture du gestionnaire de données

Le gestionnaire de données que nous utilisons dans ce papier est inspiré du tra-vail présenté dans [Gar 02a]. Dans le cas où la tolérance aux fautes n’est pas priseen compte, le composant de persistance (Persistance) a besoin d’un composantpour accéder à un stockage permanent (Stockage) et d’un composant qui gère lecache (Cache). Le composant Cache, à sont tour, est composé d’un composant quigère le cache (Cache) et d’un composant qui décide quels sont les objets à sup-primer dans le cache quand le cache est plein (Remplacement). Les trois com-posants Persistance, Stockage et Cache sont englobés dans un composantappelé Gestionnaire de données. La figure 3 montre l’architecture qui vientd’être décrite. Les deux sous-sections suivantes montrent comment cette architectureévolue quand un NTF est requis.

Stockage

Cache

Cache

Remplacement

Persistance

Gestionnaire de données

Figure 3. Architecture d’un gestionnaire de données selon l’approche à base de com-posants

4.3. Un premier niveau de tolérance aux fautes

Dans cette sous-section, nous montrons comment nous ajoutons un NTF au ges-tionnaire de données de la figure 3 selon notre approche. Le NTF choisi est <(Crash,unmasking), <Omission, nothing>, <Late timing, nothing>, <Value, nothing>, <By-zantine, nothing> .

Page 160: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Éléments redondants Afin d’assurer le niveau unmasking du crash, nous utili-sons la journalisation et le checkpointing.

La journalisation est réalisée par la collaboration entre deux éléments : un Facade4

et un Journal. Le Journal est responsable de l’organisation de l’ensemble des en-registrements gardant les traces de l’exécution des opérations effectuées sur un espacede stockage fiable. Il fournit des méthodes permettant d’écrire, lire et rechercher de telsenregistrements. Cependant, la sémantique des données stockées dans le journal lui estétrangère. La connaissance de la sémantique des données relève de la responsabilitédu Facade. Le Facade est donc un élément appartenant à la racine de compositiondu composant Gestionnaire de données. Il est mis en oeuvre par le program-meur du composant Gestionnaire de données, tandis que le Journal estmis en oeuvre par le concepteur du framework de tolérance aux fautes (c.f section 2).

Le checkpointing est mis en place d’une manière similaire. Il utilise le même élé-ment Facade décrit plus haut et un élément appelé Checkpoint. Lors de l’implan-tation, le Journal et le Checkpoint peuvent être mis en oeuvre par une mêmeclasse.

Surveillance Nous avons besoin d’un élément d’information qui émet périodi-quement des battements de coeur (heartbeat) afin de détecter le crash du gestionnairede données. Cet élément est donc inséré dans l’architecture sous la forme d’une in-terface serveur HeartBeat du composant final. Cette interface est à la charge de laracine de composition du composant Gestionnaire de données, c-à-d, lors dela mise en oeuvre, la racine de composition du Gestionnaire de données doitimplanter cette interface.

Récupération Le principe de la récupération est de trouver un état cohérent dugestionnaire de données à partir du journal et du checkpoint, et ensuite de l’installersur le système, à savoir le cache et éventuellement le stockage. La récupération étantspécifique à chaque application, nous laissons cette tâche à la charge du programmeur.Dans le cas du gestionnaire de données, la procédure de récupération est effectuée parl’élément Facade car il a l’accès aux informations dans le journal et le checkpoint,et il sait les interpréter. L’élément Facade doit exporter une interface permettant dedéclencher le processus de récupération.

Architecture finale Le résultat de l’ajout du NTF <(Crash, unmasking),<Omission, nothing>, <Late timing, nothing>, <Valeur, nothing>, <Byzantine, no-thing> est montré dans la figure 4. Dans cette architecture finale :

– Le conteneur est enrichi de deux interfaces serveur disponibles à la racine decomposition : Journal et Checkpoint.

– La racine de composition est enrichie de quatre interfaces serveur : Heartbeatet trois interfaces de l’élément Facade. Ces trois interfaces correspondent à l’inter-

. Le nom Facade désigne un élément qui offre une interface unifiée pour un ensemble d’in-

terfaces dans un sous-système. Cet élément Facade a pour but de cacher les interfaces pluscomplexes du sous-système et donc de permettre d’utiliser ces interfaces plus facilement.

Page 161: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

face de récupération, une interface cliente vers les deux interfaces serveur mentionnéesau dessus du conteneur et une interface vers le contenu pour rassembler des informa-tions à écrire sur le journal et le checkpoint.

Persistance

Stockage Cache

Faca

de

heartbeat

journal

checkpoint

recover

Gestionnaire de données

conteneur

racine de composition

Figure 4. Architecture du composant final avec le NTF <(Crash, unmasking),<Omission, nothing>, <Late timing, nothing>, <Value, nothing>, <Byzantine, no-thing>

4.4. Un autre niveau de tolérance aux fautes

Le NTF présenté dans cette sous-section est <(Crash, masking), <Omission, no-thing>, <Late timing, nothing>, <Value, nothing>, <Byzantine, nothing> .

Éléments redondants La duplication avec au moins deux copies actives permetd’assurer le niveau masking du crash.

Nous supposons disposer d’un framework de duplication tel que celui décrit dans[DRA 02]. Une politique de duplication est définie par quatre éléments : le nombre decopies, l’emplacement des copies, le moment de création des copies et la façon dontles copies évoluent (ou le modèle de l’évolution des copies). Le choix de ces quatreautres éléments n’est pas à la charge du framework de duplication. Une fois que lemodèle d’évolution des copies est choisi, le framework de duplication assure que ce

Page 162: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

modèle est respecté. Le modèle d’évolution des copies correspondant à la duplicationavec au moins deux copies actives est le modèle de copie unique5.

Un objet de liaison fournit par le framework de duplication est ajouté dans le conte-neur du composant Gestionnaire de données. Cet objet de liaison a pour butd’intercepter les opérations à effectuer sur le support de stockage. Il interagit avec leframework de duplication afin d’assurer que les copies évoluent correctement vis-à-visdu modèle choisi.

Surveillance Comme dans le cas précédent, pour pouvoir détecter le crash, unélément d’information qui émet périodiquement les battements de coeur (heartbeat)est ajouté à la racine de composition.

Récupération Une partie de la procédure de récupération est incluse dans lechoix du modèle d’évolution des copies. Elle est donc assurée par le framework deduplication. Afin que la récupération soit effectuée, le framework de tolérance auxfautes doit introduire un nouveau élément que nous appelons le Gestionnairedu niveau de tolérance aux fautes (GNTaF). Ce gestionnaire a pourbut de maintenir le nombre de copies nécessaires à un modèle d’évolution donné. Parconséquent, dans notre cas, si une copie est en crash, le GNTaF doit être capable decréer une nouvelle copie pour que deux copies actives soient toujours disponibles.Pour cela, le GNTaF interagit avec le framework de duplication à travers l’interfacefournie par le framework de duplication pour la création d’une nouvelle copie.

Architecture finale La figure 5 montre l’architecture finale de cet exercice.

5. Travaux similaires

5.1. Niveaux de tolérance aux fautes

Le terme tolérance aux fautes a des sémantiques diverses. Les sémantiques les plusutilisées par les concepteurs de systèmes sont les suivantes :

– Les comportements du système lors de l’apparition de certains types de fautessont bien décrits dans la spécification, et l’implantation du système respecte cette spé-cification. Cette forme de tolérance est connue sous le terme : fail-safe.

– Le système accomplit la tâche décrite dans sa spécification malgré la panne decertains de ses composants. Cette forme de tolérance est appelée masquage.

La distinction entre ces deux niveaux de la tolérance aux fautes est difficile. Sila spécification d’un système précise son comportement lors de la présence de fauteset il le respecte, il est fail-safe mais puisqu’il accomplit sa tâche, il fait également lemasquage.

. Le modèle de copie unique assure que toutes les copies sont équivalentes et donc évoluent

de la même manière. La lecture et l’écriture sur les copies s’effectuent sur les copies les plus àjour.

Page 163: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Persistance

Stockage Cache

Gestionnaire de données

contenuer

racine de composition

Objet

de

liaison

Frameworkde

duplication

tolérance aux fautesdu niveau de

Gestionnaire

heartbeat

création d’unenouvelle copie

Figure 5. Architecture du composant final avec le NTF <(Crash, masking), <Omis-sion, nothing>, <Late timing, nothing>, <Value, nothing>, <Byzantine, nothing>

Il existe des travaux sur la spécification formelle des systèmes informatiquesqui définissent la tolérance aux fautes par rapport aux propriétés de sûreté (sa-fety properties) et aux propriétés de vivacité (liveness properties) d’un système[GAR 99, KUL 99]. Ces travaux distinguent quatre formes de tolérance aux fauteslors de la présence de fautes :

– nothing : Les propriétés de sûreté et de vivacité sont violées,

– fail-safe : Les propriétés de sûreté sont préservées tandis que les propriétés devivacité sont violées,

– unmasking : Les propriétés de sûreté sont violées tandis que les propriétés devivacité sont préservées,

– masking : Les propriétés de sûreté et de vivacité sont préservées.

La classification à quatre niveaux proposée par les travaux formels prend en compteessentiellement les classes de fautes de conception, ce qui n’est pas applicable dansnotre contexte.

5.2. Adaptabilité de la tolérance aux fautes

Parmi les travaux qui ont essayé de découpler la tolérance aux fautes de l’aspectapplicatif d’un système et ensuite de rendre la tolérance aux fautes adaptable, on peutciter Chameleon [KAL 99], AQuA [REN 01], la spécification de CORBA tolérant auxfautes [GRO a] et DARX [PAR ].

Page 164: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Aucun de ces travaux ne donne une définition précise du NTF. Bien que Chame-leon propose la notion de “spécification de tolérance aux fautes” utilisée par le systèmecible pour indiquer ses besoins de fiabilité. Nous n’avons pas trouvé dans les papierspubliés une définition claire de la façon dont le système peut exprimer ses besoins.Les autres travaux ne mentionnent pas explicitement le concept NTF.

AQuA, la spécification de CORBA tolérant aux fautes et DARX assurent un seulNTF correspondant au masquage6 des pannes franches. AQuA masque aussi les fautesde valeur.

Donc, les types de fautes pris en compte dans ces travaux sont limités à deux :pannes franche et fautes de valeur. Le seul mécanisme utilisé pour ces types de fautesest la duplication. Dans certains cas, l’adaptabilité porte sur le nombre de fautes tolé-rées. Par ailleurs, l’adaptabilité même dans le cas de la personnalisation peut être ex-primée par différentes variables tels que le type de fautes, le degré des conséquencescausées par la présence de fautes, etc. Les travaux existants proposent une solutionpartielle au problème de l’introduction de la tolérance aux fautes adaptable dans unsystème.

Tous les travaux cités dans cette section considèrent les systèmes cibles commedes boîtes noires. Ils n’ont pas de connaissances sur l’architecture ni sur les compor-tements de ces systèmes. Ceci explique le fait que le seul mécanisme de tolérance auxfautes utilisé dans ces travaux est la duplication, et donc le seul niveau de toléranceaux fautes envisageable est le masquage.

6. Conclusion et perspectives

Dans cet article, nous avons présenté nos premiers résultats vers la tolérance auxfautes adaptable, à savoir notre approche de définition d’un framework de personna-lisation de la tolérance aux fautes pour les constructeurs de systèmes répartis à com-posants. Nous avons donné une nouvelle définition du niveau de tolérance aux fautespour un système. Avec cette définition nous sommes capables de déduire les élémentsredondants, de surveillance, de détection et de récupération nécessaires au supportd’un NTF spécifié par le constructeur d’un système.

Nous avons aussi développé deux exemples qui montrent comment intégrer leséléments des mécanismes de tolérance aux fautes à une application précise. Ces exer-cices nous ont permis d’identifier d’autres éléments qui doivent apparaître dans l’ar-chitecture pour assurer la coordination des composants, par exemple le gestionnairedu niveau de TaF. D’autres expériences visant la définition d’une architecture globalequi prend en compte tous les NTFs possibles et où les composantes choisis s’intègrentsans difficulté sont en cours de réalisation. Le résultat de cette tâche nous permettra

. La sémantique de masquage est celle utilisée par les concepteurs de système (c.f le para-

graphe Niveaux de tolérance aux fautes au début de cette section).

Page 165: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

finalement de définir complètement notre framework et poursuivre sa validation par lebiais du prototypage.

À plus long terme nous prévoyons de lever l’hypothèse des ressources infinies pourfaire évoluer notre framework et le rendre adaptable. Dans cette future étape du travail,une représentation dynamique des ressources du système s’avère nécessaire ; le NTFeffectivement fourni sera déterminé à partir des ressources disponibles et ajusté dyna-miquement. La réflexivité de notre modèle de composants nous semble déterminantedans la réalisation de cette tâche.

7. Bibliographie

[AVI 97] AVIZIENIS A., « Toward Systematic Design of Fault Tolerant Systems », IEEE Com-puter, vol. 30, n 4, 1997, p. 51-58, IEEE Computer Society.

[COL 00] COLLET C., « The NODS project : Networked Open Database Services », Procee-dings of Symposium on Objects and Databases (ECOOP), LNCS1944, 2000, p. 153-169.

[CON ] CONSORTIUM O., « Open Source Middleware », http ://www.objectweb.org.

[CRI 91] CRISTIAN F., « Understanding fault-tolerant distributed systems », Communicationsof the ACM, vol. 34, n 2, 1991, p. 56-78.

[DRA 02] DRAPEAU S., RONCANCIO C., DÉCHAMBOUX P., « Design of an Adaptable Re-plication Service », submitted to publication, 2002.

[GAR 99] GARTNER F. C., « Fundamentals of Fault-Tolerant Distributed Computing in Asyn-chronous Environments », ACM Computing Surveys, vol. 31, n 1, 1999, p. 1-26.

[Gar 02a] GARCÍA-BAÑUELOS L., « An Adaptable Infrastructure for Customized PersistentObject Management », Proceedings of the EDBT Ph.D. Workshop, 2002.

[Gar 02b] GARCÍA-BAÑUELOS L., DUONG P.-Q., COLLET C., « A component-based infra-structure for customized persistent object management », à apparaitre dans les actes des18èmes Journées Bases de données Avancées, , 2002.

[GRO a] OBJECT MANAGEMENT GROUP, « CORBA 2.5 - chapter 25 - Fault TolerantCORBA », http ://www.omg.org/cgi-bin/doc ?formal/01-09-29.

[GRO b] TECHNICAL COMPONENT MODEL WORKING GROUP, « Fractal component mo-del », http ://www.objectweb.org/architecture/component/index.html.

[JAL 94] JALOTE P., Fault Tolerance in Distributed Systems, Prentice Hall, 1994.

[KAL 99] KALBARCZYK Z., IYER R. K., BAGCHI S., WHISNANT K., « Chameleon : ASoftware Infrastructure for Adaptive Fault Tolerance », IEEE Transactions on Parallel andDistributed Systems, vol. 10, n 6, 1999, p. 560-579.

[KUL 99] KULKARNI S., « Component based design of fault-tolerance », PhD thesis, TheOhio State University, 1999.

[LAM 77] LAMPORT L., « Proving the Correctness of Multiprocess Programs », IEEE Tran-sactions on Software Engineering, vol. 3, n 2, 1977, p. 125-143.

[PAR ] LABORATOIRE D’INFORMATIQUE DE PARIS 6, « Dynamic Agent Replication eXten-sion », http ://www-src.lip6.fr/darx/.

Page 166: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[REN 01] REN Y., « AQuA : A Framework for Providing Adaptive Fault Tolerance to Distri-buted Applications », PhD thesis, University of Illinois at Urbana-Champaign, 2001.

[SOL 00] SOLAR G. V., « Service d’événements flexible pour l’intégration d’applicationsbases de données réparties », PhD thesis, Université Joseph Fourier, 2000.

ANNEXE A

Type de fautes Informations recueillies pour le diagnostic

Crash Messages périodiques de battement de coeurOmission Le moment où une requête est délivrée à la couche application et le

message d’acquittement quand la réponse correspondante est envoyéeLate timing Le moment où une requête est délivrée à la couche application et le

moment où la réponse correspondante est envoyéeValue Le message envoyé à d’autres systèmes

Byzantine Toutes les informations recueillies pour d’autres types de fautes

Tableau 1. Tableau de surveillance (TS)

Type de fautes Comment détecter la présence de faute

Crash Si le message de battement de coeur n’a pas été encore reçu, le systèmeest supposé d’être en crash.

Omission Aucun acquittement n’est reçu à T+

où T est le moment de la livraisonde la requête à la couche application) et

est le temps d’attente pour la

réponse correspondante.Late timing -idem-

Value Si la valeur du message sortant est différente de la valeur majoritaire dumême message envoyé par d’autres instances du système, cet instancea une faute value.

Byzantine L’intégration des mécanismes de détection des fautes value et des fauteslate timing.

Tableau 2. Tableau de diagnostic (TD)

Page 167: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Unm

aski

ngN

othi

ng/S

igna

ling

Méc

anis

mes

Act

ions

à fa

ireM

aski

ng

Che

ckpo

intin

g,Jo

urna

lisat

ion

Rev

enir

au d

enie

r poi

nt d

esa

uveg

arde

(che

ckpo

int)

Am

nesi

a cr

ash,

Parti

al a

mne

sia

cras

hX

Dup

licat

ion

pass

ive

Act

iver

une

répl

ique

pass

ive

Dup

licat

ion

(sau

f la

dupl

icat

ion

pass

ive)

Paus

e cr

ash

XX

XD

uplic

atio

n (s

auf l

adu

plic

atio

n pa

ssiv

e)

Hal

ting

cras

hX

Dup

licat

ion

pass

ive

Act

iver

une

répl

ique

pass

ive

-idem

-

Om

issi

onX

Jour

nalis

atio

n(r

equê

te, r

épon

se)

Ré-

soum

ettre

la re

quêt

e au

syst

ème

-idem

-

Tim

ing

XX

X-id

em-

Val

ueX

N/A

Dup

licat

ion

activ

e av

ec a

um

oins

3 ré

pliq

ues

Byz

antin

eX

N/A

-idem

-

Tab

leau

3. T

able

au d

es é

lém

ents

redo

ndan

ts (T

ER).

X : a

ucun

élé

men

t n’e

st n

éces

sair

e, N

/A :

le c

as e

st in

valid

e

Page 168: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 169: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 4Méthodologies de

conception

Page 170: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 171: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

From UML to ROLAP multidimensionaldatabases using a pivot model

Nicolas Prat* — Jacky Akoka**

* ESSECAvenue Bernard Hirsch, BP 105, F-95021 Cergy cedex

[email protected]

** CEDRIC-CNAM & INT292, rue Saint-Martin, F-75141 Paris cedex [email protected]

ABSTRACT. Effective data warehouse design requires a conceptual modeling phase. This paperdescribes a data warehouse design method based on the conceptual, logical and physicallevels. The conceptual phase creates an UML model. To this end, UML is enriched withmultidimensional concepts. The logical phase maps the enriched UML model into amultidimensional one. The physical phase maps the multidimensional model into a databaseschema, depending on the target ROLAP or MOLAP tool. Our method consists in (1) themetamodels used at each design step, including the unified multidimensional metamodel (2) aset of transformations defined on the concepts of these metamodels, facilitating the designprocess. We present the metamodels and transformations, and illustrate the formalspecification of the transformations with OMG’s Object Constraint Language (OCL). Weapply our method to a case study and compare it to the state-of-the-art.RÉSUMÉ . Pour être efficace, la conception d’entrepôts de données nécessite une phase demodélisation conceptuelle. Cet article décrit une méthode de conception d’entrepôts dedonnées fondée sur les niveaux conceptuel, logique et physique. La phase conceptuelle créeun modèle UML. A cette fin, UML est enrichi de concepts multidimensionnels. La phaselogique traduit le modèle UML enrichi en modèle multidimensionnel. La phase physiquetraduit le modèle multidimensionnel en un schéma de base de données, en fonction de l’outilROLAP ou MOLAP cible. Notre méthode est constituée (1) des métamodèles utilisés à chaqueétape de conception, dont le métamodèle multidimensionnel unifié (2) d’un ensemble detransformations définies sur les concepts de ces métamodèles pour faciliter le processus deconception. Nous présentons les métamodèles et les transformations, et illustrons laspécification formelle des transformations avec OCL (Object Constraint Language) del’OMG. Nous illustrons notre méthode par une étude de cas et la comparons à l’état de l’art.KEYWORDS: data warehouse, multidimensional database, OLAP, design method, UML, OCL.

MOTS-CLÉS : entrepôt de données, base de données multidimensionnelle, OLAP, méthode deconception, UML, OCL.

Page 172: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

The data warehousing and OLAP market is growing rapidly. [OLA 02] estimatesthat the worldwide OLAP total market is likely to exceed 5 billion dollars by 2004,compared with 1 billion dollars in 1996. Like the relational database market at itsbeginning, the OLAP market has no dominant players.

The OLAP design process is described by tool vendors as much easier thanclassical database design. We claim that, since data warehouses are developed toprovide managers with data on which they will build queries depending on theirconstantly evolving needs, the design process is crucial.

Each OLAP tool is based on a specific underlying physical metamodel. Weclaim that a standardization of the multidimensional metamodel will allow both theusers to better understand the underlying concepts and the tool editors to adopt aunified view.

Unlike the transactional database world which has taken some time to adopt thethree levels of abstraction recommended by ANSI/X3/SPARC, the data warehouseactors should rapidly divide the modeling task according to the three conceptual,logical and physical levels. The conceptual level allows the data warehouse designerto build a high level abstraction of data for decision making, independently fromimplementation issues. The logical level maps this abstraction into a standardmultidimensional representation. Finally, taking into account the target tool, thephysical phase aims at building a database schema to be implemented on a specificplatform.

The main objective of this paper is to propose a four-step design method for datawarehouse development, including transformations for systematic mapping betweeneach step. In order to address several target tools, we define a unifiedmultidimensional metamodel at the logical level and build our process on it. Thepaper is organized as follows. Section 2 describes the unified multidimensionalmetamodel. Section 3 presents the four-step design method. The transformationsapplied at each step are described and formalized with OCL, the Object ConstraintLanguage associated with UML [OMG 01a]. A case study illustrating the process ispresented in Section 4. Section 5 compares our approach with the state-of-the-art.Finally, Section 6 concludes and describes further research.

Page 173: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2. A unified multidimensional metamodel

In contrast with the relational metamodel1, there is no standard multidimensionaldatabase metamodel. More precisely, there is no commonly accepted formalmultidimensional data metamodel. As a consequence, many multidimensionalmetamodels have been proposed in the literature [AGR 97 ; BLA 98 ; CAB 98 ;CHA 97 ; GOL 98a ; GYS 97 ; KIM 96 ; LIW 96 ; OMG 01b ; PED 99 ; VAS 99].The concepts used vary depending on the authors and some concepts, e.g. the notionof “fact”, are employed with various meanings. Furthermore, there is no consensusconcerning the level of the multidimensional metamodel (physical, logical orconceptual). The star and snowflake models presented in [KIM 96] have often beenconsidered to be at the physical level, since the choice between stars and snowflakesis based on performance considerations (trade-off between query performance andoptimization of disk space). More recent publications have placed themultidimensional metamodel at the logical level [VAS 00] or even at the conceptuallevel [GOL 98a ; HÜS 00].

Our strong belief is that the multidimensional metamodel belongs to the logicallevel. Even though there is no consensus on this metamodel, it clearly existsindependently of physical implementations. However, the multidimensionalmetamodel should not be situated at the conceptual level since the concepts of thismetamodel (e.g. the concept of dimension) are not as close to reality as concepts likethe object (used in conceptual object modeling languages like UML [OMG 01a] forexample). There is indeed a strong parallel between the relational and themultidimensional metamodels - e.g. the definitions or attempts to define a (standard)associated query language and algebra. This is the reason why we argue that bothmetamodels should be considered as belonging to the same level, i.e. the logicallevel.

In the multidimensional metamodel, data are organized in (hyper)cubes.Although the detailed concepts of this model vary depending on the authors, we candescribe multidimensional semantics using four concepts which appear recurrently,namely the notions of measure, dimension, hierarchy and attribute. Our metamodelis composed of these four concepts, which are illustrated in Figure 1.

1. To be consistent with the conventions of the Object Management Group [OMG 01a; OMG01b], we will refer to modeling formalisms (e.g. relational, UML…) as metamodels.

Page 174: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 1. Multidimensional representation of data

The key concept of the multidimensional metamodel is the notion of measure. Ameasure is typically a quantitative data, a numeric value of interest for the analysis.A measure needs not be of numeric type, as long as its values are totally ordered.For example, it can be an enumeration type. Thus, the satisfaction of customers witha product may be measured on a four-value scale (unsatisfied – mitigated – satisfied– enthusiastic). In a cube, the measures correspond to the content of the cells.

The dimensions form the edges of the cube. Each measure is associated with oneor several dimensions, which specify the context of the measure. In the example ofFigure 1, the quantity sold is dimensioned by the dimensions product, day and city,in other words, by “quantity sold”, we mean the quantity sold for a particularproduct at a particular date in a particular city. Note that a dimension is representedby its identifier (e.g. the values of the dimension product are product codes).

Sometimes, we need to represent an event linking several dimensions withouthaving any measure associated with this event [KIM 96]. For this purpose, we use aspecific type of measure, called dummy measure. Consider for example therelationship “reservation” linking a borrower, a book and a reservation date, withoutany specific attributes characterizing the reservation. The dummy measure“reservation” will serve to indicate which books have been reserved by whichborrowers and at which date. Note that dummy measures need to be distinguished

4

6

9

9

3

1

12

6 8 11 5 9 9

PRODUCTproduct nameunit price

CATEGORY

P1 P2 P3 P4 P5 P6 P7

3 March 99

4 March 99

5 March 99

6 March 99

7 March 99

8 March 99

9 March 99

DA

Y

MO

NTH

QU

ARTE

R

YEAR

LEGEND: Measure

DIMENSION Attribute Hierarchy

BordeauxBrest

LyonNantes

Paris

CITY

REGION

Quantity sold

Page 175: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

from other measures, since they will require specific implementation depending onthe target ROLAP or MOLAP tool.

The dimensions are organized in hierarchies. A hierarchy is an aggregation pathbetween dimensions. In Figure 1, “city->region” and “day->month->quarter->year”are examples of hierarchies. A hierarchy is oriented from the lower to the upperabstraction levels (here, from city to region and from day to year). The arrowbetween two successive dimensions (called dimension link ) may be interpreted as afunctional dependency. Hierarchies are of paramount importance in themultidimensional metamodel since they are used to aggregate (“rollup”) or detail(“drill-down”) measures.

Dimensions may be described by attributes. For example, the dimension product(i.e. product code) is described by the product name and its unit price. Attributes arenot the object of multidimensional analysis, as opposed to measures. In other words,if a dimension is described by a feature that is a measure of interest, this featureshould be defined as a one-dimensional measure associated with this dimension.

A multidimensional model is specified textually as illustrated for themultidimensional model of Figure 1 :

dimension daydimension monthdimension quarterdimension yeardimension product codedimension categorydimension citydimension region

measure quantity sold [product code, day, city]

hierarchy time day->month->quarter->yearhierarchy type product code->categoryhierarchy location city->region

attribute product name [product code]attribute unit price [product code]

Our multidimensional metamodel unifies the concepts of the mainmultidimensional metamodels found in the literature. The metamodel is generic inthat it can be mapped into many OLAP tools, as the present article will illustrate forthe ROLAP star case. For the sake of this article, we present a simplified version ofthe metamodel. A more complete version can be found in [AKO 01].

Figure 2 summarizes the main concepts of our multidimensional metamodel,represented with UML. The link between a measure and a dimension dimensioningthis measure is called dimensioning and is strong if the dimension is necessary tofunctionally determine the measure (if the dimensioning is not strong, the dimensionshall be indicated between parentheses in the textual specification of the measure).Every dimensioning has at least one aggregate function associated with it. The nameof an aggregate function is either Sum, Avg, Min, Max or Count. Aggregatefunctions are characterized by a restriction level, as suggested by [HÜS 00]. A Sumaggregate function is less restrictive than a Count aggregate function in that it allowsmore analysis.

Page 176: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 2. Unified multidimensional metamodel

The unified multidimensional metamodel is used as a pivot metamodel in ourdesign method, as described in the next section.

3. The design method

Starting from user requirements, our method is based on the three usualabstraction levels : conceptual, logical and physical (Figure 3). It is thereforedecomposed into four phases :

– In the conceptual phase, the designer represents the universe of discourseusing the UML notation [OMG 01a] along with the associated approach ofdevelopment [JAC 99] (step 1) ; the UML model is then enriched and transformedto take into account the specific features of multidimensional modeling (step 2).

– In the logical phase, the enriched and transformed UML model is mapped intoa unified multidimensional model, using mapping rules.

– The physical phase allows the designer to map the multidimensional modelinto a physical database schema, depending on the target OLAP tool. Due to spacelimitations, in this paper, we focus on ROLAP tools using the star model. A specificset of mapping rules from the logical to the physical model is defined for each typeof tool.

ModelElementname : Name

MultidimensionalModel

MultidimensionalModelElement

11..*

1+ownedElement1..*

AggregateFunctionname : FunctionNamerestrictionLevel : Integer

Dimensioningstrong : Boolean

1..*

0..*

1..*

0..*

MeasuredummyMeasure : Boolean

DimensionAttributeDimension

1..*

0..*

+dimension

1..*

+measure0..*

1 0..*

+owner1

+attribute0..* DimensionHierarchy

DimensionLink

10..*+source1

+dimensionLink 0..*0..*1 +dimensionLink 0..*+target1

1..*2..*

+dimensionHierarchy1..*

+dimensionLink2..*

ordered

level : Integer

ModelElementname : Name

MultidimensionalModel

MultidimensionalModelElement

11..*

1+ownedElement1..*

AggregateFunctionname : FunctionNamerestrictionLevel : Integer

Dimensioningstrong : Boolean

1..*

0..*

1..*

0..*

MeasuredummyMeasure : Boolean

DimensionAttributeDimension

1..*

0..*

+dimension

1..*

+measure0..*

1 0..*

+owner1

+attribute0..* DimensionHierarchy

DimensionLink

10..*+source1

+dimensionLink 0..*0..*1 +dimensionLink 0..*+target1

1..*2..*

+dimensionHierarchy1..*

+dimensionLink2..*

ordered

level : Integer

Page 177: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– The data confrontation phase consists in mapping the physical schema dataelements with the data sources. It leads to the definition of queries for extracting thedata corresponding to each component in the physical schema. This is a verycomplex problem, going beyond the scope of this paper. However, it is important tomention this crucial phase in the data warehousing process.

Figure 3. The four steps of the design method

3.1. Conceptual design

OLAP systems are emerging as the dominant approach in data warehousing.OLAP allows designers to model data in a multidimensional way as hypercubes.ROLAP snowflakes and stars as well as MOLAP cubes do not offer a visualization

Universe of discourse

Conceptual modeling

UML model

Enrichment/transformation

Enriched/transformed UML model

Logical mapping

Unified multidimensional model

Physical mapping

MOLAP schema

ROLAP star schema

ROLAP snow- flake schema

Source confrontation

Data warehouse metadata

CONCEPTUALDESIGN

LOGICALDESIGN

PHYSICALDESIGN

DATACONFRON-

TATION

Page 178: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

of data structures independently from implementation issues. Therefore, they do notensure a sound data warehouse conceptual design.

Our design method uses the Unified Modeling Language (UML) at theconceptual level. The reason for this is threefold : UML is now a well-knownlanguage for software engineers; it provides simple yet powerful constructs todescribe at a high level of abstraction the important concepts of the applicationdomain. Finally, it can be easily mapped to relational as well as to multidimensionallogical models.

Due to these considerations, many authors use the UML notation at a first step oftransactional database design. It is now natural to apply UML to data warehousingas well, as illustrated by the Object Management Group’s recent CommonWarehouse Metamodel [OMG 01b].

Our design method consists of a two-step conceptual design :

– Step 1 leads to a UML model, more precisely to a class diagram withoutoperations.

– Step 2 enriches and transforms this model to facilitate its automatic mapping toa unified multidimensional model. Four types of operations are conducted : thedetermination of identifying attributes, the determination of measures, the migrationof association attributes and the transformation of generalizations.

For the second step, we need to specialize and enrich OMG’s UML metamodelwith concepts specific to multidimensional modeling (Figure 4).

Figure 4. Enriched UML metamodel

ModelElementname : Name

Relationship

AttributeOfOrdinaryClassidentifyingAttribute : Boolean

AttributeOfAssociationClass

OrdinaryClass AssociationClass OrdinaryAssociation

GeneralizationConstraint

UMLModel

Constraint

UMLModelElement

1

1..*

1

+ownedElement1..*

0..*0..*

+constraint

0..*+constrainedElement

0..*ordered

Association

Attributemeasure : Boolean

AssociationEnd

aggregation : AggregationKindmultiplicity : Multiplicity

11

2..* ordered

Generalization

Class1 0..*

+owner

1

+attribute

0..*ordered

0..*1

+association

0..*+participant

1

0..*

1

+specialization

0..*

+parent1

0..*

1+generalization

0..*

+child1

disjoint,complete

disjoint,complete

disjoint,complete

+connection

+association

ModelElementname : Name

Relationship

AttributeOfOrdinaryClassidentifyingAttribute : Boolean

AttributeOfAssociationClass

OrdinaryClass AssociationClass OrdinaryAssociation

GeneralizationConstraint

UMLModel

Constraint

UMLModelElement

1

1..*

1

+ownedElement1..*

0..*0..*

+constraint

0..*+constrainedElement

0..*ordered

Association

Attributemeasure : Boolean

AssociationEnd

aggregation : AggregationKindmultiplicity : Multiplicity

11

2..* ordered

Generalization

Class1 0..*

+owner

1

+attribute

0..*ordered

0..*1

+association

0..*+participant

1

0..*

1

+specialization

0..*

+parent1

0..*

1+generalization

0..*

+child1

disjoint,complete

disjoint,complete

disjoint,complete

+connection

+association

Page 179: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In Figure 4, classes which are not association classes are called ordinary classes.Similarly, associations which are not association classes are called ordinaryassociations. The main extensions are the attribute measure of class Attribute(which indicates if the attribute corresponds to a measure), and the attributeidentifyingAttribute , which indicates if an attribute identifies its owner class. Theattribute identifyingAttribute is only necessary for the attributes of ordinary classes.

The four transformations performed during step 2 are presented below andformalized with OCL [OMG 01a]. A transformation is represented in OCL as anoperation of the UML model considered. The formal specification of thetransformation consists in pre- and post-conditions on the operation. Due to spacelimitations, some of the transformations are expressed in natural language only.

3.1.1. Determination of identifying attributes

Since the notion of identifying attribute is not defined in the standard UMLnotation, we need to determine explicitly the identifying attributes of classes in orderto define the dimensions of the multidimensional model at the logical level. Notethat since an association class is identified by the n-uple of identifying attributes ofthe participating classes, the determination of identifying attributes is necessary onlyfor the ordinary classes. For each ordinary class of the UML model, the user and thedata warehouse designer have to decide which attribute identifies the class (forsimplicity, we assume that only one such attribute may exist for each ordinaryclass). If necessary, a specific attribute is created in order to identify the class.Identifying attributes are specified using the UML construct of tagged value: thesuffix id is added to each identifying attribute, as in [MOR 00]. This process canbe synthesized by the following transformation (expressed in natural language andOCL) :

Transformation Tcc1: Each attribute of an ordinary class is either anidentifying attribute or not.

context UMLModel::Tcc1(ordinaryClass:OrdinaryClass)post:(ordinaryClass.attribute->size() [email protected]>size()) or(ordinaryClass.attribute->size() [email protected]>size()+1)post:(ordinaryClass.attribute->forall(a1:Attribute|(a1.identifyingAttribute=true)or(a1.identifyingAttribute=false)))post:(ordinaryClass.attribute->select(a1:Attribute|(a1.identifyingAttribute=true))->size=1)

Page 180: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1.2. Determination of attributes representing measures

We differentiate between attributes representing measures, and attributes whichcan be defined as qualitative values. As described in the previous section, thisdistinction is not based on data types even if, generally, measures are numerical andqualitative attributes are not. Therefore this differentiation cannot be performedautomatically. The user and the data warehouse designer have to decide whichattributes must be considered as measures. Note that this does not deal withidentifying attributes determined previously, since an identifying attribute cannot bea measure. In the UML model, attributes representing measures are specified by thetagged value meas. This process can be synthesized as follows:

Transformation Tcc2: Each attribute is either a measure or not.

3.1.3. Migration of association attributes

This step is concerned with 1-1 and 1-N associations having specific attributes(these associations are actually association classes, since an ordinary associationcannot bear attributes in UML). Let us mention that this case is rarely encountered.If specific attributes are present in these associations, the designer has first to checkthe validity of this representation. Even if their presence cannot be questioned, theycannot be mapped into multidimensional models by using hierarchies. The reason isthat, in multidimensional models, these hierarchies do not contain information.Therefore, they must migrate from the association to the participating class on the Nside. In case of 1-1 association, they can indifferently migrate into one of the twoclasses. The transformations for migrating association attributes are expressed asfollows :

Transformation Tcc3 : Each attribute belonging to a 1-1 association istransferred to one of the classes involved in the association.

Context UMLModel::Tcc3(associationClass: AssociationClass)

pre:(associationClass.connection->size()=2) and ((associationClass.connection->at(1)).multiplicity=‘zero or one’ or (associationClass.connection->at(1)).multiplicity=‘exactly one’) and ((associationClass.connection->at(2)).multiplicity=‘zero or one’ or (associationClass.connection->at(2)).multiplicity=‘exactly one’) and associationClass.attribute->isNotEmpty()

Page 181: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

post: [email protected]> forall(a1:Attribute| associationClass.attribute->excludes(a1) and (associationClass.connection ->at(1)).participant.attribute ->includes(a1)) or [email protected]> forall(a1:Attribute| associationClass.attribute->excludes(a1) and (associationClass.connection ->at(2)).participant.attribute ->includes(a1))

Transformation Tcc4 : Each attribute belonging to a 1-N association istransferred to the N-class, i.e. the class involved several times in the association.

3.1.4. Transformation of generalizations

The generalizations of the UML notation cannot be mapped directly tohierarchies in the multidimensional model, since the semantics of hierarchies inobject-oriented models and multidimensional models differ. However, we want topreserve the information contained in UML generalizations and transform thesehierarchies to enable their correct mapping to multidimensional hierarchies in thelogical phase. To this end, we transform the generalizations into aggregations andclasses following the proposal of [MOO 00] for ER models. We have adapted thisrule to UML and extended it to consider the different cases of incompletespecialization and/or overlapping specialization. The corresponding transformationis informally described below :

Transformation Tcc5 : For each level i of specialization of a class C, a classnamed Type-C-i is created. The occurrences of these classes define all thespecializations of C. In case of overlapping between specializations, a specialvalue is created for each overlapping between two or more sub-classes of C. Incase of incomplete specialization, the special value “others” is created. A N-1aggregation is created between the classes C and Type-C-i.

This transformation is illustrated with the case study in section 4.1. Note that themapping of UML generalizations into multidimensional models is not trivial. Thisissue is dealt with in more depth in [AKO 01].

Thanks to the four transformations described above, the resulting UML modelcan then be automatically mapped into a logical multidimensional model, asdescribed in the following section.

Page 182: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.2. Logical design

The aim of the logical design phase is to map the enriched UML conceptualmodel into a logical one expressed with the concepts of our unifiedmultidimensional metamodel. This model is generated using specific mappingtransformations. The transformations map first the ordinary classes and theirattributes (transformations Tcl1 to Tcl3) and then the associations – associationclasses or ordinary associations – and their attributes (transformations Tcl4 to Tcl6).A transformation is represented in OCL as an operation of the UML model, whoseparameters are the target multidimensional model and the concepts mapped in theUML model; the result consists in the concepts obtained in the targetmultidimensional model.

The ordinary classes of the conceptual UML model are mapped bytransformation Tcl1:

Transformation Tcl1: The identifying attribute of each ordinary class ismapped into a dimension in the multidimensional model.

Context UMLModel::Tcl1(identifier:Attribute, multidimensionalModel:MultidimensionalModel):Dimension

pre: identifier.owner.oclIsTypeOf(OrdinaryClass)=true and identifier.identifyingAttribute=true

post:result.name=identifier.namepost:multimensionalModel->includes(result)

The attributes of ordinary classes are mapped by transformations Tcl2 and Tcl3:

Transformation Tcl2: The non-identifying attributes of each ordinary classare mapped into dimension attributes in the multidimensional model if thesenon-identifying attributes are not measures of interest.

The multidimensional attributes are associated with the dimension obtained bymapping the identifying attribute of the ordinary class (transformation Tcl1).

Transformation Tcl3: The non-identifying attributes of each ordinary classare mapped into measures in the multidimensional model if these non-identifyingattributes are measures of interest.

Context UMLModel::Tcl3(nonIdentifier:Attribute, multidimensionalModel:MultidimensionalModel):Measure

pre: nonIdentifier.owner.oclIsTypeOf(OrdinaryClass) =true and nonIdentifier.identifyingAttribute=false and nonIdentifier.measure=true

post:result.name=nonIdentifier.name

Page 183: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

post:nonIdentifier.owner.attribute-> forall(a1:Attribute| if a1.identifyingAttribute=true then result.dimension=Tcl1(a1) endif)

post:multimensionalModel->includes(result)

The measures are associated with the dimension obtained by mapping theidentifying attribute of the ordinary class (transformation Tcl1).

The attributes of association classes are mapped using transformation Tcl4 :

Transformation Tcl4: The attributes of each association class are mappedinto measures, associated with dimensions obtained by mapping the identifyingattributes of the ordinary classes directly or indirectly participating in theassociation class (transformation Tcl1).

If the association class bearing the attributes has one (or several) participatingclass(es) with a maximal cardinality of 1, the dimension(s) obtained by mapping theidentifying attribute(s) of this (these) class(es) should be indicated betweenparentheses in the specification of the measures, to express the fact that thedimension(s) are not necessary to functionally determine the measures.

Finally, associations are mapped using transformations Tcl5 (for binary N-1associations) and Tcl6 (for other associations) :

Transformation Tcl5: A path formed by N-1 associations is mapped into ahierarchy in the multidimensional model.

This is a simple transformation. However, to improve the definition ofmultidimensional hierarchies, the transformation could be refined, e.g. byconsidering the different kinds of UML associations.

Transformation Tcl6: Every N-M or N-ary association without at least oneattribute that is always defined is mapped into a dummy measure, associatedwith dimensions obtained by mapping the identifying attributes of the ordinaryclasses directly or indirectly participating in the association (transformationTcl1).

Note that if an N-M association or an N-ary association has attributes, theseattributes have already been mapped into measures (transformation Tcl4). If one ofthese attributes is always defined, the corresponding measure is also always defined,making the definition of a “dummy measure” unnecessary.

At the end of the logical design phase, the universe of discourse is describedthrough a unified multidimensional model. Depending on the OLAP tool to be used,this model must be implemented, i.e. mapped into physical concepts.

Page 184: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.3. Physical design

The physical design phase depends heavily on the target system. ROLAP systemsimplement the multidimensional model in a relational database system (RDBMS).This category of systems may be subdivided depending on the model used toimplement the multidimensional model in the RDBMS. The models used aretypically the star model, the snowflake model, or any combination or extension ofthese two models.

For each type of target system, a specific set of mapping transformations fromthe logical multidimensional model has to be defined. In this paper, we focus on theROLAP star metamodel. This metamodel is represented in Figure 5. The metamodelis an extension of OMG’s relational metamodel [OMG 01b]. The main extensionconsists in the distinction between dimension tables and fact tables. Foreign keys aredefined for fact tables only.

Figure 5. ROLAP star metamodel

The transformations described below map a logical multidimensional model intoa ROLAP star schema. Similarly to the previous phases, OCL can be used toformalise these transformations.

The dimensions of the logical multidimensional model are mapped bytransformation Tls1:

ModelElementname : Name StarSchema

StarSchemaElement

1

1..*

1

+ownedElement1..*

DimensionTable

Table

PrimaryKey

1

1

+table1

+primaryKey1

Column

1

1..*

+table1

+column1..* ordered

0..11..* +primaryKey0..1+column1..*ordered

FactTable

ForeignKey

1

0..1

+importedKeyColumn

1

+importedKeyOf

0..1

0..*1

+primaryKeyOf

0..*

+primaryKeyColumn

1

1

1..*

+factTable

1

+foreignKey

1..*

ModelElementname : Name StarSchema

StarSchemaElement

1

1..*

1

+ownedElement1..*

DimensionTable

Table

PrimaryKey

1

1

+table1

+primaryKey1

Column

1

1..*

+table1

+column1..* ordered

0..11..* +primaryKey0..1+column1..*ordered

FactTable

ForeignKey

1

0..1

+importedKeyColumn

1

+importedKeyOf

0..1

0..*1

+primaryKeyOf

0..*

+primaryKeyColumn

1

1

1..*

+factTable

1

+foreignKey

1..*

Page 185: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Transformation Tls1: Every dimension dimensioning at least one measure ismapped into a dimension table and an associated primary key.

Note that the logical dimensions that do not dimension any measure, i.e. the onesthat only participate in hierarchies, will be taken care of by transformation Tls4.

Measures are mapped using transformations Tls2 and Tls3:

Transformation Tls2: Every non-dummy measure is mapped into a fact tablecolumn (i.e. a fact) in table T, whose foreign keys correspond to the logicaldimensions of the measure and whose primary key corresponds to the subset ofthese dimensions which are not indicated in parentheses. If table T does notexist, it is defined when mapping the measure.

Note that the logical dimensions indicated in parentheses in the specification of ameasure are the ones that are functionally determined by the others. Therefore, theyare not used in the definition of the primary key of table T.

Transformation Tls3: Every dummy measure is mapped into a fact tablewhose foreign keys correspond to the logical dimensions of the measure andwhose primary key corresponds to the subset of these dimensions which are notindicated in parentheses.

The fact tables generated by transformation Tls3 are thus factless fact tables[KIM 96].

The hierarchies of the logical model are mapped using transformation Tls4:

Transformation Tls4: Every hierarchy D1->D2->…->Dn of the logicalmodel is mapped by considering all the sub-hierarchies Dj->Dj+1…->Dn where1<=j<n and Dj dimensions at least one measure. A sub-hierarchy Dj->Dj+1…->Dn is mapped in the physical model by defining in the dimension tableidentified by Dj a column corresponding to each of the Di (where j<i<=n).

Context MultidimensionalModel:: Tls4(hierarchy:DimensionHierarchy, starSchema:StarSchema):Set(Column)

post:Sequence1..(hierarchy.dimensionLink->size())->forall(J:Integer|

if (hierarchy.dimensionLink->at(J)).source.measure ->isNotEmpty() then Sequence(J..(hierarchy.dimensionLink ->size()) ->forall(I:Integer| starSchema ->forall(dimensionTable:DimensionTable| if dimensionTable.primaryKey.column.name= (hierarchy.dimensionLink ->at(J)).source.name

Page 186: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

then (dimensionTable->select(c1:Column| c1.name=(hierarchy.dimensionLink ->at(I)).source.name)->size()=1) result->includes((dimensionTable ->select(c1:Column| c1.name=(hierarchy.dimensionLink ->at(I)).source.name)) endif))

endif)post:starSchema->includes(result)

Dimension attributes are mapped with transformation Tls5:

Transformation Tls5: Every attribute of every dimension Di of the logicalmodel is mapped into a dimension table column, in all the dimension tableswhich possess an (identifying or non-identifying) column corresponding to Di.

The next section illustrates our design method through an example, phase afterphase.

4. Case study

A firm is faced with the definition of an optimal media-planning system. Thecompany wishes to launch advertising campaigns for its products using several typesof media (radio, TV, newspapers, magazines, etc.). Its objective is to maximize thenumber of consumers being exposed to the advertising campaign. To support thedecision-making process, we need to define a multidimensional model with all thedata relevant to the media-planning problem.

4.1. Conceptual design

The conceptual model is represented in Figure 6. It contains data related to theproducts concerned by the advertising campaigns. The consumers are represented astargets located at different regions. The consumers are defined according to theirpurchasing behavior over time which is strongly influenced by the advertisingcampaigns. The consumers are exposed to several types of media. This exposure ismeasured over time. The model includes all the key information about the mediashareholding. The real model has been simplified in order to be more readable.

Page 187: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 6. The UML conceptual model for the media-planning example

The UML model is then enriched by determination of identifying attributes id,attributes representing measures meas, migration of association attributes andtransformation of generalizations. Two classes Type-shareholder-1 and Type-shareholder-2 are created and renamed as Shareholder_type andPrivate_shareholder_type. The set of occurrences of Shareholder_type is private,public, both. The set of occurrences of Private_shareholder_type is person,company, others. The attribute percentage_of_region is transferred to the classTarget (transformation Tcc4). The result is represented in Figure 7.

percentage_of_region

Private_shareholder Public_shareholderpublic_shareholder_level

Person Companymanager_name

exposuremedia_exposure

Yearyear

Datedd_mm_ yy

Media_typemedia_typeinsertion

Shareholdershareholder_name

Regionregionnumber_of_inhabitants

Quarterquarter

11..*11..*

11..*11..*

consumptionproduct_consumption

Mediamedia_nameadvertising_price

1

*

1

*

1

*

1

*main_

shareholder

*

1..*

*

1..*gets

Product_typeproduct_typeproduct_unit *1..* *1..*

may_be_advertised_in

Targettarget_codestatusminimum_agemaximum_agesex

1..*

1

1..*

1

*

*

*

*

Advertising_campaign

campaign_code

1

*

1

*

during

*

*

*

*

is_strongly_influenced_by

1..*

*

1..*

*

in

Productproduct_codeproduct_name

1

*

1

*

*

*

*

*

*

1

*

1

for

**

*

percentage_of_region

Private_shareholder Public_shareholderpublic_shareholder_level

Person Companymanager_name

exposuremedia_exposure

Yearyear

Datedd_mm_ yy

Media_typemedia_typeinsertion

Shareholdershareholder_name

Regionregionnumber_of_inhabitants

Quarterquarter

11..*11..*

11..*11..*

consumptionproduct_consumption

Mediamedia_nameadvertising_price

1

*

1

*

1

*

1

*main_

shareholder

*

1..*

*

1..*gets

Product_typeproduct_typeproduct_unit *1..* *1..*

may_be_advertised_in

Targettarget_codestatusminimum_agemaximum_agesex

1..*

1

1..*

1

*

*

*

*

Advertising_campaign

campaign_code

1

*

1

*

during

*

*

*

*

is_strongly_influenced_by

1..*

*

1..*

*

in

Productproduct_codeproduct_name

1

*

1

*

*

*

*

*

*

1

*

1

for

**

*

Page 188: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 7. Enriched/transformed UML model

4.2. Logical design

We give below the unified multidimensional representation resulting from theapplication of the mapping transformations (Table 1).

exposuremedia_exposure meas

Year

year id

Datedd_mm_yy id

Media_typemedia_type idinsertion

Regionregion idnumber_of_inhabitants meas

Quarterquarter id

11..*

11..*

11..*11..*

consumptionproduct_consumption meas

Mediamedia_name idadvertising_price

1

*

1

**

1..*

*

1..* gets

Product_typeproduct_type idproduct_unit

*1..* *1..*

may_be_advertised_in

Targettarget_code idstatusminimum_agemaximum_agesexpercentage_of_region meas

1..*

1

1..*

1

*

*

*

*

Advertising_campaign

campaign_code id

1

*

1

*

during

*

*

*

*is_strongly_influenced_by

1..*

*

1..*

*

in

Productproduct_code idproduct_name

1

*

1

*

*

*

*

*

*

1

*

1for

Shareholder_typeshareholder_type id

Shareholdershareholder_name idpublic_shareholder_levelmanager_name

1

*

1

*

main_shareholder

1*

Private_shareholder_typeprivate_shareholder_type id

1

** *

1

1*

**

exposuremedia_exposure meas

Year

year id

Datedd_mm_yy id

Media_typemedia_type idinsertion

Regionregion idnumber_of_inhabitants meas

Quarterquarter id

11..*

11..*

11..*11..*

consumptionproduct_consumption meas

Mediamedia_name idadvertising_price

1

*

1

**

1..*

*

1..* gets

Product_typeproduct_type idproduct_unit

*1..* *1..*

may_be_advertised_in

Targettarget_code idstatusminimum_agemaximum_agesexpercentage_of_region meas

1..*

1

1..*

1

*

*

*

*

Advertising_campaign

campaign_code id

1

*

1

*

during

*

*

*

*is_strongly_influenced_by

1..*

*

1..*

*

in

Productproduct_code idproduct_name

1

*

1

*

*

*

*

*

*

1

*

1for

Shareholder_typeshareholder_type id

Shareholdershareholder_name idpublic_shareholder_levelmanager_name

1

*

1

*

main_shareholder

1*

Private_shareholder_typeprivate_shareholder_type id

1

** *

1

1*

**

Page 189: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

dimension product_codedimension product_typedimension campaign_codedimension dd_mm_yydimension quarterdimension yeardimension shareholder_namedimension shareholder_typedimension private_shareholder_typedimension target_codedimension regiondimension media_namedimension media_type

measure percentage_of_region [target_code]measure number_of_inhabitants [region]measure product_consumption [product_code,target_code,quarter]dummy measure is_strongly_influenced_by [product_code,target_code,campaign_code]dummy measure in [campaign_code,media_name]dummy measure may_be_advertised_in [product_type,media_type]dummy measure gets [region,media_name]dummy measure main_shareholder [media_name,dd_mm_yy,(shareholder_name)]measure media_exposure [media_name,target_code,quarter]

hierarchy campaign_product campaign_code->product_code->product_typehierarchy campaign_date campaign_code->quarter->yearhierarchy time dd_mm_yy->quarter->yearhierarchy shareholder_type shareholder_name->shareholder_typehierarchy private_shareholder_type shareholder_name->private_shareholder_typehierarchy media_type media_name->media_typehierarchy location target_code->region

attribute product_unit [product_type]attribute product_name [product_code]attribute status [target_code]attribute minimum_age [target_code]attribute maximum_age [target_code]attribute sex [target_code]attribute insertion [media_type]attribute advertising_price [media_name]attribute public_shareholder_level [shareholder_name]attribute manager_name [shareholder_name]

Table 1. Media-planning multidimensional model

4.3. Physical design

Figure 8 presents the result of the application of the transformations described inSection 3.3, assuming that ROLAP star is used for implementation. Note that sincemany dimension tables are shared by different fact tables, Figure 8 actuallyrepresents a constellation of stars.

Page 190: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 8. Media-planning ROLAP star physical schema

PRODUCT_TYPE

product_typeproduct_unit

ADVERTISING_CAMPAIGN

campaign_codequarteryearproduct_codeproduct_nameproduct_typeproduct_unit

QUARTERquarteryear

TARGET

target_codestatusminimum_agemaximum_agesexregion

REGION

region

MEDIA_TYPE

media_typeinsertion

MEDIA

media_nameadvertising_pricemedia_typeinsertion

SHAREHOLDERshareholder_namepublic_shareholder_levelmanager_nameshareholder_typeprivate_shareholder_type

MAY_BE_ADVERTISED_IN

FK product_typeFK media_type

REGION_FIGURES

FK regionnumber_of_inhabitants

GETS

FK regionFK media_name

PRODUCT

product_codeproduct_nameproduct_typeproduct_unit

TARGET_FIGURES

FK target_codepercentage_of_region

DATEdd_mm_yyquarteryear

CONSUMPTION

FK product_codeFK target_codeFK quarterproduct_consumption

EXPOSURE

FK media_nameFK target_codeFK quartermedia_exposure

MAIN_SHAREHOLDER

FK media_nameFK dd_mm_yyFK shareholder_nameIS_STRONGLY_INFLUENCED_BY

FK product_codeFK target_codeFK quarterFK campaign_code

IN

FK campaign_codeFK media_name

Page 191: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

After the definition of the physical database schema, the data confrontationphase is performed in order to map the physical schema data elements with the datasources. Since it is beyond the scope of this paper, this confrontation phase is notillustrated here.

5. State-of-the-Art

The deficit in data warehouse design is real. Very few methods have beenproposed until now. Let us mention [AKO 97 ; AKO 01 ; GOL 98a ; CAB 98 ;MOO 00].

Unlike papers describing design methods, as stated by [SAP 98], a fair numberof publications is available concerning multidimensional data modeling but withvery few recognizing the importance of the separation of conceptual, logical andphysical issues. Moreover, even if the three levels are considered, some confusionexists between, on the one hand, conceptual and logical models and, on the otherhand, logical and physical models. A real confusion also seems to exist between theconceptual and physical aspects. As an example, the multidimensional modelingmanifesto of Kimball is inadequate for conceptual modeling [KIM 97]. Hisapproach tends to include physical design issues, especially with his propositions ofstar and snowflake schemas which appear not to be independent fromimplementation issues.

5.1. Conceptual-logical models

[CAB 98] focus on logical design issues and propose a logical model for OLAPsystems. They assume that there exists an integrated ER schema of the operationaldata sources. They provided a methodology to transform the ER schema into adimensional graph. [GOL 98b] proposed a conceptual model called Dimensional-Fact schema. They provided a methodology to transform the ER model of the datasources into a Dimensional-Fact model. Note that their approach is not based on aformal data model. [LEH 98b] proposed a conceptual multidimensional modelwhich includes some mechanism to structure qualifying information. Note that noformal graphical notation is provided. [SAP 98] presented a specialization of the ERmodel called Multidimensional Entity-Relationship Model (MER) expressing themultidimensional structure of the data by means of two specialized relationship setsand a specialized entity set. Their approach models user requirements independentlyfrom the structure of the data sources. [GYS 97] proposed a multidimensionaldatabase model providing the functionalities necessary for OLAP-basedapplications. They made a clear separation between the structural aspects and thecontents, allowing them to define data manipulation languages in a transparent way.They defined an algebra and a calculus.

Page 192: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5.2. Logical-physical models

A survey of logical models for OLAP databases can be found in [VAS 99]. Themain features of these models are that they systematically offer a logical view ofdata to be queried by a set of operators and, usually, a set of implementationmechanisms. Among these models, let us mention [AGR 97] who provide a logicalmodel in which dimensions and measures are treated in a symmetric way and wheremultiple hierarchies of dimensions allow ad hoc aggregates. [PED 99] propose amultidimensional logical data model for complex data, justifying nine requirementsto be satisfied in order to support complex data. They showed that their modelcovers the nine requirements in a better way than previous models [RAF 90 ; AGR97 ; GRA 96 ; KIM 96 ; LIW 96 ; GYS 97 ; DAT 97 ; LEH 98a]. [HAR 96]investigated mainly physical design and implementation issues.

Finally, our method is an attempt to define a generic framework based on thefollowing principles :

– it makes a clear distinction between the classical steps of database design(conceptual, logical, physical),

– it unifies the different multidimensional concepts into a single and genericmodel,

– it can be used for implementation with ROLAP as well as MOLAP tools,although this article stressed the frequent case of ROLAP star

– it capitalizes on the existing UML schemas,

– it is based on well-established UML concepts.

6. Conclusion and further research

We have described a method for designing and developing data warehouses.Capitalizing on database design techniques, we proposed a conceptual design phasebased on the UML notation, followed by an enrichment/transformation of thismodel. This enrichment/transformation allows the designer to automatically convertthis conceptual representation into a logical multidimensional model. At this step,we proposed a generic multidimensional metamodel independent fromimplementation issues and unifying ROLAP and MOLAP concepts. Using mappingrules, this generic logical model can be mapped to any physical multidimensionalplatform. A case study was described to illustrate the main features of the method.

Similarly to software engineering which has come to maturity with the advent ofCASE tools, the data warehousing community needs extensive tool support toachieve significant productivity improvements. In this respect, our method, whichcan be semi-automated, is a contribution. A first prototype has been developed[BAR 02] and the results we got so far confirm the relevance an usefulness of our

Page 193: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

approach. We plan to extend the tool and use it to further test our approach on moreextensive, real life case studies.

Further questions still remain open. The different mapping transformation setshave to be extended. A reverse engineering approach taking into account existingdata warehouses and/or data marts must be developed [AKO 99]. We are currentlyworking on these issues.

Acknowledgments

This paper has benefited from the helpful advice and comments of IsabelleCOMYN-WATTIAU. Her contribution is gratefully acknowledged. The authors alsothank the RESEARCH CENTER of ESSEC for its financial support.

7. References

[AGR 97] AGRAWAL R., GUPTA A., SARAWAGI S., “Modeling multidimensional databases”,13th International Conference on Data Engineering (ICDE ’97), Birmingham, UK, April1997.

[AKO 97] AKOKA J., PRAT N., “Modélisation logique des données dans les Systèmes Multi-dimensionnels d'Aide à la Décision : la méthode MAP”, Revue des Systèmes de Décision,vol. 6(2), June 1997.

[AKO 99] AKOKA J., COMYN-WATTIAU I., “Rétro-conception des «datawarehouses» et dessystèmes multidimensionnels”, 17ème Congrès INFORSID, La Garde, June 1999.

[AKO 01] AKOKA J., COMYN-WATTIAU I., PRAT N., “Dimension hierarchies design fromUML generalizations and aggregations”, 20th International Conference onConceptual Modeling (ER2001) , Yokohama, Japan, November 2001.

[BAR 02] BARREZ J.C., “GEDESID: un générateur de systèmes décisionnels”, Mémoired’Ingénieur, CNAM, Paris, June 2002.

[BLA 98] BLASCHKA M., SAPIA C., HÖFLING G., DINTER B., “Finding your way throughmultidimensional data models”, DEXA Workshop on Data Warehouse Design and OLAPTechnology (DWDOT ’98), Vienna, Austria, 1998.

[CAB 98] CABIBBO L., TORLONE R., “A Logical Approach to Multidimensional Databases”,Proceedings of 6 th International Workshop on Extending Database Technology(EDBT’1998), Valencia (Spain), March 1998.

[CHA 97] CHAUDHURI S., DAYAL U., “An overview of data warehousing and OLAPTechnology”, SIGMOD Record, vol. 26, number 1, March 1997.

[DAT 97] DATTA A., THOMAS H., “A Conceptual Model and Algebra for On-Line AnalyticalProcessing in Decision Support Databases”, Proceedings of WITS, 1997.

Page 194: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[GOL 98a] GOLFARELLI M., M AIO D., RIZZI S., “Conceptual design of data warehouses fromE/R schemes”, 31st Hawaii International Conference on System Sciences , Hawaii, USA,January 1998.

[GOL 98b] GOLFARELLI M., RIZZI S., “A methodological framework for data warehousingdesign”, 1st ACM workshop on Data warehousing and OLAP (DOLAP’98), WashingtonDC, USA, November 1998.

[GRA 96] GRAY J. et al, “Data Cube : A Relational Aggregation Operator GeneralizingGroup-By, Cross-Tab and Sub-Totals”, Proceedings of ICDE, 1996.

[GYS 97] GYSSENS M., LAKSHMANAN L.V.S., “A foundation for multi-dimensionaldatabases”, 23rd VLDB Conference, Athens, Greece, 1997.

[HAR 96] HARINARAYAN V., RAJARAMAN A., ULLMAN J.D., “Implementing Data CubesEfficiently”, Proceedings of SIGMOD conference, 1996.

[HÛS 00] HÜSEMANN B., LECHTENBÖRGER J., VOSSEN G., “Conceptual data warehousedesign”, 2nd International Workshop on Design and Management of Data Warehouses(DMDW 2000), Stockholm, Sweden, June 2000.

[JAC 99] JACOBSON I., BOOCH G., RUMBAUGH J., The Unified Software DevelopmentProcess, Addison Wesley Publishing Company, 1999.

[KIM 96 ] KIMBALL R., The data warehouse toolkit, John Wiley et al., Sons, 1996.

[KIM 97] KIMBALL R., “A Dimensional Modeling Manifesto”, DBMS on-line, 1997,http://www.dbmsmag.com/.

[LEH 98a] LEHNER W., “Modeling Large Scale OLAP Scenarios”, Proceedings of EDBT ,1998.

[LEH 98b] LEHNER W., ALBRECHT J., WEDEKIND H., “Normal Forms for MultidimensionalDatabases”, Proceedings of the 10th SSDBM conference, Italy, July 1998.

[LIW 96] LI C., WANG X.S., “A data model for supporting on-line analytical processing”,Proceedings Conference on Information and Knowledge Management (CIKM ’96),Baltimore, USA, November 1996.

[MOO 00] MOODY D.L., KORTINK M.A.R., “From Enterprise Models to Dimensional Models: A Methodology for Data Warehouse and Data Mart Design”, 2nd InternationalWorkshop on Design and Management of Data Warehouses (DMDW 2000), Stockholm,Sweden, June 2000.

[MOR 00] MORLEY C., HUGUES J., LEBLANC B., UML, pour l'analyse d'un systèmed'information, Informatique Dunod, Paris, 2000.

[OLA 02] OLAP REPORT, “The OLAP Report – Market share analysis”, 2002,http://www.olapreport.com/Market.htm

[OMG 01a] OMG (OBJECT M ANAGEMENT GROUP), “Unified Modeling Language (UML)specification”, version 1.4, September 2001,http://www.omg.org/technology/documents/formal/uml.htm

Page 195: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[OMG 01b] OMG (OBJECT M ANAGEMENT GROUP), “Common Warehouse Metamodel(CWM) specification”, version 1.0, October 2001,http://www.omg.org/technology/documents/formal/cwm.htm

[PED 99] PEDERSEN T.B., JENSEN C.S., “Multidimensional data modeling for complex data”,15th International Conference on Data Engineering (ICDE ’99), Sydney, Australia,March 1999.

[RAF 90] RAFANELLI M., SHOSHANI A., “STORM : A Statistical Object RepresentationModel”, Proceedings of SSDBM, 1990.

[SAP 98] SAPIA C., BLASCHKA M., HÖFLING G., DINTER B., “Extending the E/R Model forthe Multidimensional Paradigm”, International Workshop on Data Warehousing andData Mining in conjunction with ER98, Singapore, 1998.

[VAS 99] VASSILIADIS P., SELLIS T., “A survey of logical models for OLAP databases”,SIGMOD Record, vol. 28, number 4, December 1999.

[VAS 00] VASSILIADIS P., “Gulliver in the land of data warehousing : practical experiencesand observations of a researcher”, Proceedings of the International Workshop on Designand Management of Data Warehouses (DMDW’2000), Stockholm, June 2000.

Page 196: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 197: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Measuring UML Conceptual Modeling Quality Method and Implementation Samira SI-SAID CHERFI*—Jacky AKOKA* ,**—Isabelle COMYN-WATTIAU* ,*** ,****

* Laboratoire CEDRIC-CNAM, 292 Rue Saint Martin, 75141 Paris Cedex 03 sisaid, [email protected] ** Institut National des Télécommunications 9 rue Charles Fourier 91011 Évry cedex *** Université de Cergy-Pontoise [email protected] **** ESSEC Business School Avenue Bernard Hirsch B.P. 105 95021 Cergy-Pontoise Cedex

ABSTRACT: The purpose of a conceptual model is to provide an accurate reflection of the user’s requirements. However, there are many ways of formulating a universe of discourse. Although all might be argued to be equally correct, not all are necessarily equally useful. This research investigates the evaluation process of conceptual specifications developed using Unified Modeling Language (UML) conceptual models. In this paper, we primarily address the problem of assessing conceptual modeling quality. In particular, we provide a comprehensive framework for evaluating UML conceptual schemas. Furthermore, we define and examine classes of metrics facilitating the evaluation process and leading to the choice of the appropriate representation among several schemas describing the same reality. Extending quality criteria proposed in the literature, we select the subset of criteria relevant to conceptual UML schema quality evaluation and describe the implementation of this quality measurement in Rational ROSE CASE tool. For each criterion we define metrics allowing the designer to measure the schema quality. More specifically, we evaluate alternative UML conceptual schemas representing the same universe of discourse using the appropriate criteria and their associated metrics.

KEYWORDS: Conceptual modeling quality, UML conceptual modeling, quality criteria, quality metrics, user validation.

Page 198: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

A conceptual schema is an abstraction of a universe of discourse under consideration. It is concerned with the analysis of users requirements. It leads to the elaboration of a high-level data model. It is an abstract structure in which all relevant concepts and relationships between them are taken into account. The conceptual schema is defined by using concepts that the user understands and applies. Understanding users requirements may be a complicated problem for the system designer. Usually, the users express their problems using domain-oriented concepts and rules. From the users viewpoint, a conceptual schema should be supported by facilities for accessing, developing, analyzing, changing and maintaining concepts used in the construction of the conceptual schema. These five functions can be structured as three dimensions: the specification (analyzing), the usage (accessing, changing) and the implementation (developing, maintaining) dimensions. As a consequence, a conceptual schema developed using UML concepts provides a basis for the validation process. Although a conceptual schema may be consistent, it is not necessarily correct. In some cases, it may have nothing to do with the universe of discourse considered. There is therefore a need for criteria allowing us to measure the quality of a conceptual schema.

Unlike in software engineering where quality of software products has been extensively studied and software metrics have been applied to measure data quality, literature devoted to conceptual modeling quality evaluation is limited, in particular for UML schema quality evaluation. This literature provides lists of desirable properties of conceptual schemas. Among general quality criteria proposed to evaluate conceptual schemas, let us mention: completeness, inherence, clarity, consistency, orthogonality, generality, abstractness and similarity. The formalization of these criteria is not yet sufficiently well understood. For example, abstractness measures the level of repression of detail. Similarity measures the capability of the conceptual schema to reflect the real world. But how do we measure these two criteria and how do we interpret the results obtained by the evaluation process? Although it appears very desirable to avoid conceptual schemas with low abstraction and low similarity, there is clearly a tradeoff between abstractness and similarity. How do we attain the value of this tradeoff?

As it has been stated above, a conceptual schema can be correct but not necessarily useful. Even if two conceptual schemas representing the same universe of discourse are considered to be equally correct, how do we choose the most friendly? Do we prefer the simpler? Do we know how to characterize this simplicity?

The aim of this paper is therefore to answer these questions and more precisely to contribute to understand the following problems:

- What are the desirable properties of a conceptual schema?

Page 199: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

- How do we structure them in a global framework?

- How do we measure and interpret quality of UML conceptual schemas?

- Can we perform automatically and implement the measurement process?

This paper extends previous findings by providing an evaluation framework and by further examining quality criteria and classes of metrics relevant to UML conceptual schemas. Our underlying objective is to facilitate the choice of a conceptual schema among several candidates using quality measures. The work presented in this paper is an extension of the one presented in [si-said et al, 2002] which proposes a set of generic metrics and their application to EER conceptual schemas quality evaluation.

The structure of this paper is as follows: the next section reviews relevant literature in conceptual schema quality evaluation. Section 3 is devoted to the presentation of our framework for conceptual quality evaluation. Criteria and their associated metrics are discussed. The results obtained by the application of the framework to a case study using UML model are then presented in Section 4. Section 5 is devoted to the description of implementation issues. Finally, we conclude in Section 6 and discuss areas of future research.

2. Related literature

Our current work borders several lines of existing research in quality properties in software engineering, and more precisely programs, data and conceptual schemas. There exists abundant empirical or qualitative literature regarding the quality of software products. Software metrics are used to assess the quality of programs and more generally software products quality [Davis 90]. In the area of data quality, several attributes have been studied extensively [Zmud et al., 90]. In the intuitive approach used to study data quality, the main attributes suggested are: accuracy, timeliness, precision, reliability, currency, completeness, accessibility and relevancy [Kriebel 79, Bailey et al. 83, Ives et al. 83, Wang et al. 93, Wang et al. 95]. The theoretical approach uses ontologies in which attributes of data quality are derived based on data deficiencies [Wand et al. 96]. The empirical approach collects data from data consumers to identify quality attributes [Wang et al. 96].

However, related research in conceptual modeling quality evaluation is emerging. The first structured approach dates back to the contribution of [Batini et al. 92]. For the first time, they propose quality criteria relevant to conceptual schema evaluation (completeness, correctness, minimality, expressiveness, readability, self-explanation, extensibility, normality). Although they provide some transformations in order to improve conceptual schema quality using these criteria, they do not define metrics to evaluate this quality. In [Lindland et al., 94], the quality of schemas is evaluated along three dimensions: syntax, semantics and pragmatics. Syntactic quality refers to the degree of correspondence between the conceptual schema and its representation. The semantic quality refers to the degree of

Page 200: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

correspondence between the conceptual schema and the real world. Finally, the pragmatic quality defines the degree of correspondence between a conceptual schema and its interpretation, which can be defined as the degree to which the schema can be understood. In [Assenova et al. 96], a set of criteria is proposed: homogeneity, explicitness, size, rule simplicity, rule uniformity, query simplicity and stability. Moreover, the authors describe a set of transformations aiming at improving the quality of schemas. For each transformation, the impact of it on each quality criterion is discussed. In the context of Business Process Reengineering (BPR), Van den Berg and Teeuw proposed a four dimensional framework for evaluating models and tools [Teeuw et al. 97]. The first dimension is functionality defined by the following criteria: expressiveness, abstractions, compositions, formal/methodological support, relevance of concepts, etc.. The second dimension is ease of use associated with the following criteria: accessibility, usability, adaptability, openness, etc. The third dimension is BPR trajectory and finally a general dimension related to tool price and customer support. Moody extends [Batini et al. 92] by defining a comprehensive set of metrics, based on eight quality factors taking into account the categories of actors: business users (understandability, flexibility, integrity, completeness), data analysts (correctness, simplicity), data administrators (integration), and application developers (implementability) [Moody 98]. These metrics have neither been validated nor implemented. However, the framework has been used to evaluate an application development project, to measure a process quality, to support an evaluation process and to evaluate differences between data models produced by expert and novice data modelers [Moody et al. 98]. [Schuette et al. 98] have shown how to manage the subjectivity of the modeler through the standardization of the modeling process. They proposed general guidelines for modeling based on six principles: construction adequacy, language adequacy, economic efficiency, clarity, systematic design, and comparability. Genero et al. have presented a set of automatically computed metrics for evaluating ER diagram complexity [Genero et al. 2000]. In addition, they have proposed an induction method for building fuzzy regression trees allowing them to evaluate the existence of relationship between the metrics and ER diagram maintainability subcharacteristics : understandability, legibility, simplicity, analysability, modifiability, stability and testability. Poels et al. have proposed a set of measures to assess size, structure and dynamic behaviour aspects related to the complexity of object-oriented conceptual schemas [Poels et al. 2000]. Misic and Zhao have adapted and extended Lindland’s framework to the comparison of reference models. They describe an application of their work to two models for electronic commerce [Misic et al. 2000].

As shown in the next section, our contribution is not mainly in the definition of new criteria. However, we make a significant extension to the existing criteria by elaborating metrics to capture the relevant properties of UML conceptual schemas. Evaluating the quality of these schemas means being able to exhibit the appropriate metrics associated with the criteria within a structured framework. This is precisely the aim of the next section.

Page 201: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3. A framework for conceptual quality evaluation

We have shown in the previous section that most of the literature on conceptual schema quality evaluation mainly provides lists of desirable properties. The framework presented in this section addresses the issue of conceptual schemas quality measurement in a more systematic way. A conceptual schema quality can be measured by its capabilities to:

- provide a formal representation of the observed reality,

- meet users requirements,

- be a basis for the future information system implementation.

Taking into account these three fundamental objectives, the framework suggests to evaluate the quality of a conceptual schema in a space designed along three dimensions, namely the specification, usage and implementation dimensions (Figure 1). This framework could be used in several applications. The categorization of notations could be one of them by providing guidelines allowing the measurements of how a notation contributes to information systems modeling throughout the three dimensions. A second application could be the analysis of methods practice. In this application, the framework could guide the observations on how designers use the notations of a given method. Such an application could be very useful in the domain of education (e-learning, teaching methods etc.). The application that is presented here is the evaluation of conceptual schemas quality.

The quality of a conceptual schema is maximal when it takes the value “1” on each of these three dimensions. It corresponds to a schema that “perfectly” meets the desired characteristics for the specification, implementation and usage dimensions. The specification dimension is related to the conceptual phase and is described below. The implementation dimension refers to the production of a database from a conceptual specification, satisfying the various requirements. Finally, the usage dimension defines the ease of operations on the conceptual schema.

Quality ImplementationImplementabilityMaintainability

SpecificationLegibility

ClarityMinimality

ExpressivenessSimplicityCorrectness

1

1 1

UsageCompletenessUnderstandability

Figure 1 : The three dimensions of quality

Page 202: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In this article, due to the complexity resulting from interaction between the three dimensions, we will concentrate on the specification dimension only, since it seems to us to be crucial for the evaluation process. However, a set of quality criteria is proposed for the two other dimensions. The metrics associated with this criteria set will be developed in a forthcoming paper. For the specification dimension, we have defined a set of measurable metrics that have been applied on several examples.

3.1. The specification dimension

A conceptual model is an abstract representation of the reality. This representation is the result of a specification based on a given set of notations. This specification is generally used as a mean to reach an agreement between the analysts. It is also a communication tool during the validation process with the users. The notation provides the models with a formal semantic that enables tool-based analysis. Our review of the literature has shown a lack of framework and empirical validation of UML schema quality. The specification dimension captures the degree of requirements understanding in the conceptual schema. We can find in the literature several standards and guidelines describing recommended characteristics for requirements specification [IEEE Std 830-1998]. We are interested in measurable criteria based on valuable metrics. We have identified the following criteria: legibility, expressiveness, simplicity and correctness. They are the only measures of conceptual quality, which is our objective.

3.1.1. Legibility

The legibility (or readability) expresses the ease with which a conceptual schema can be read. To measure legibility, we propose two sub criteria, namely clarity and minimality.

Clarity

Clarity is a purely aesthetic criterion. It is based on the graphical arrangement of the elements composing the conceptual schema with respect to some guidelines. When the number of elements composing the schema grows, a direct consequence of a bad arrangement could be an increase of line-crossings, which penalize readability [Batini et al. 92]. Other criteria on the layout design are described in [Schuette et al. 98]. These criteria are interesting for clarity but difficult to measure as they imply length, angles and surface measuring. In fact, they characterize graphic choices independently of conceptual choices. Our research is concerned with conceptual alternatives and their consequences on drawing. Therefore, our metric is based on the number of line-crossings in the schema. Note that all the measures made in Section 4 are based on schemas, which are equivalent regarding the other layout criteria.

Page 203: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Clarity =

)()()()()()(HNBACNBANB

CRHNBACNBANB++

−++

Where NB(H) is the number of inheritance links, NB(A) the number of associations, NB(AC) the number of association classes and CR the number of line-crossings in the schema

This metric is based on the heuristics stating that a schema containing N edges can have at most N crossings. In reality, it will have much less line-crossings. In the class diagram of the UML notation, lines may correspond to associations, links that relate association classes to their corresponding associations, compositions, aggregations or inheritance links. In addition, an association is represented by one line, one link corresponds to an association (Figure 2) and each inheritance link provides one edge.

work s _in

C ity

Employee

1..1

0..*

1. .1

0..*

1. .1

0..*

l ives_in1..1

0..*

1. .1

0..* + subordinate_to

1..1

+ m anager_of

0..*

(a): Clarity = 0.33

CityEmployee

1..*

1..1

0..*

1..10..*

1..1

m anages+subordinate

1..*+manager

1..1

lives_in

works_in

0..*

1..10..*

1..1

(b): Clarity = 1

Figure 2 : Assessing clarity

Minimality.

A schema is said to be minimal if every concept present in the schema has a distinct meaning [Batini et al. 92]. The minimality measure is based on the number of concepts necessary to describe a reality. The smaller is the number of concepts used the higher is the minimality of the conceptual schema. This minimality aspect is captured by the first sub criterion named non-redundancy. Moreover, UML notation is devoted to object-oriented representation of a reality. Therefore, it allows the designer to refer to the main characteristics of object-orientation: abstraction, encapsulation, modularity and hierarchy [Booch,1991]. Object-oriented models like UML are based on two fundamental mechanisms to cope with these characteristics: inheritance and aggregation. These two concepts contribute to increase the minimality of a conceptual schema by allowing factorization and concept heading respectively. To characterize this minimality we propose two supplementary sub criteria: factorization degree and aggregation degree.

Non-redundancy. A schema is said non-redundant when every aspect of the requirements appears only once. We propose the following metric:

Page 204: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Non-redundancy =

∑∑ −

S

S

R

CiNBwi

CiNBwiCiNBwi

)(

)()(

Where Ci belongs to [class, association, inheritance link, class association, aggregation link, composition link]. NB(Ci) corresponds to the number of elements of type Ci, NBR(Ci) is the number of redundant elements of type Ci in the current schema S, wi is the weight associated with Ci.

A schema containing no redundancy has a Non-redundancy value equal to 1. Non-redundancy decreases when the number of redundancies increases. In this metric as well as in the other metrics described below, weights are associated with concepts, such as class, association, inheritance link,... The motivation for adding weights is to take into account the impact of redundant concepts. For example, the redundancy of a class leads to one more concept in the schema (see example in Figure 3).

Administ rati on_staff

Medical_staff

S ervice

1..*

1..*

1..*

1..*

work s_in

1..*

1 ..*

1..*

1 ..*work s_in

(a) Minimality = 0.33

Administra tion_staff Medica l_staff

S erviceStaff 1..* 1 ..*1..* 1 ..*

work s _in

(b) Minimality = 1

Figure 3: Assessing minimality

Factorization degree. The factorization degree measures the effectiveness of inheritance hierarchies of a schema. The objective is to evaluate the effectiveness of inheritance use by measuring the degree of factorization in the hierarchy. The factorization process deals with the attributes as well as with the associations and the methods.

Factorization degree =

)()()()(

1 HNBCiNBCiUSECiDEF

H Ci∑ ∑

Where H is a hierarchy in the schema, Ci • attribute, association, operation, DEF(Ci) counts the number of occurrences of an element Ci in a hierarchy, USE(Ci) counts the number of inheritances of an element Ci + 1, NB(H) is the number of hierarchies in the schema, and NB(Ci) the total number of different concepts in the hierarchy.

The value of factorization degree is close to 1 when USE(Ci) is very high as compared to DEF(Ci) for all the concepts Ci. This is the case when the schema has

Page 205: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

deep hierarchies with many elements located at the most abstract levels of the hierarchy, leading to a high number of inheritances.

Aggregation degree. The aggregation degree measures the efficient use of aggregate attributes in a UML schema. This quality measure is based on attributes aggregation level. Object-orientation is based on complex attribute types. An attribute domain may be a collection of basic data types or an aggregation of such types. It can also be a class or a complex construct based on several other classes. The Unnest function explicits the different aggregation levels for a given attribute. The Level function counts the number of aggregation levels. As an example, in an hospital case study, Unnest is applied to the attribute “prescription_detail” belonging to the association class “prescription” :

Unnest(prescription_detail) = Aggregate [date ; Collection (Aggregate [frequency; duration, Aggregate [drug_name ; drug_description]]°].

In other words, the attribute “prescription_detail” is characterized by five abstraction levels (Level=5). If an attribute domain is a basic data type, a Level value equals to 1. Hence, the metric of aggregation degree is :

Aggregation degree =

)()((

11

CiNBCiUnnestLevel

ci∑

Where Level(Unnest(Ci)) counts the number of aggregation levels of an attribute Ci and NB(Ci) is the number of attributes in the schema.

Notice that if a schema does not contain any aggregate attribute, Level(Unnest(Ci))=1 for each attribute, leading to a global 0 value for this metric. It confirms our belief that these measures do not replace previous measures defined for minimality but must be added to the latter to enrich the overall schema quality evaluation.

3.1.2. Expressiveness

A schema is said to be expressive when it represents users requirements in a natural way and can be easily understood without additional explanation [Batini et al. 92]. We distinguish between two levels of expressiveness, namely concept and schema expressiveness.

Concept expressiveness

Measures whether the used concepts are expressive enough to capture the main aspects of the reality semantics. For example, an inheritance link is more expressive than an association in the UML notation. It is a specific association. Mentioning this specificity makes the schema more expressive. Indeed, an inheritance link from class C1 to class C2 expresses the fact that: i) there exists an association between C1 and C2, ii) the set of C2 instances is included in the set of C1 instances, iii) C2 shares all properties of C1, iv) C2 participates to all associations to which C1 participates. We propose to measure this concept expressiveness by associating weights with the different concepts involved.

Page 206: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Schema expressiveness

Measures the expressiveness of the schema as a whole. It is clear that the greater is the number of concepts used the higher is the expressiveness of the conceptual schema. For example, in the “hospital management” case study (Section 4), a conceptual schema, in which several categories of doctors (Practitioner, Researcher, Independent-consultant etc.) are specified, is more expressive than the one in which only one category (Doctor) is represented.

To measure expressiveness, encompassing both concept and schema levels, we propose the following metric.

Expressiveness =

∑∑

S

jj

S

ii

UCNBw

CNBw

)(

)(

Where Ci belongs to class, association, inheritance link, association class, wi is the weight associated with Ci , NB(Ci) is a function calculating the number of Ci concepts in a schema, S is the current schema to be measured and compared to the union of all schemas describing the same reality. This union is to be considered as the mathematical union operator in the sense that a concept present in several schemas is counted only once.

This metric provides a maximal value of expressiveness in the case of a conceptual schema constructed from the union of all the others. Such a union schema is not always correct because the union here is purely syntactic and does not correspond to an integration of the schemas. Moreover this hypothetical union schema is the most expressive one and, obviously, the most redundant in since it contains all the concepts of the other schemas.

3.1.3. Simplicity

A schema is said to be simple if it contains the minimum possible constructs. Our measure of simplicity is based on the assumption that the complexity of a conceptual schema grows with the number of concepts (including inheritance and aggregation links). As a consequence, a conceptual schema is simpler if the number of classes is larger than the number of links (associatins and inheritance links) between the classes. Similar results for EER models can be found in [Genero et al. 2000]. We suggest the following metric to measure UML schema simplicity :

Simplicity =

)()()()()()(

ACNBCNBHNBANBACNBCNB

++++

Where NB(C), NB(AC) NB(H), NB(A) correspond respectively to the number of classes, association classes, inheritance links and associations (including aggregation and composition).

A conceptual schema in which there are neither associations nor inheritance links has 1 as a simplicity value, which is the maximal value for the simplicity function. On the contrary, if the number of associations is very high comparing to the number of classes, the value of simplicity approaches zero.

Page 207: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1.4. Correctness

Correctness is used in a wide range of contexts leading to very different interpretations. A schema is syntactically correct when concepts are properly defined in the schema [Batini et al. 92]. In the same way, a schema is semantically correct if concepts are used according to their definitions. However, another way of defining semantic correctness is to link it to conformity with requirements. Our objective is to make the quality measurement automatic. Therefore we are limited to syntactic correctness. This characteristic is easier to and most of existing CASE tools support verification of conceptual schema correctness. To measure correctness we suggest the following metric:

Correctness =

=

=−

N

i

i

i

N

i

i

CVERIF

CERRCVERIF

1

1

)(

))()((

Where VERIF() is a function calculating the number of characteristics to be verified on an element Ci of the current schema. This number is the same for all the occurrences of the same type concept. ERR() is a function calculating the number of errors depicted on an element Ci. N is the number of elements in the schema

As an example on the value of the function VERIF(), let’s consider that for a class, we have defined the two following correctness rules : a naming rule and the obligation of at least one attribute. In this case, for each class Ci of the schema VERIF(Ci) = 2. For a conceptual schema in which there is no error (ERR(ci) = 0 for each element of the schema), the value of correctness is maximal (1). However, if no one of the verifications concludes to correctness (VERIF(Ci) = ERR(Ci), for each element of the schema), the value of correctness is minimal (0).

Summarizing, the first objective of a conceptual schema is to provide a formal representation of the observed reality. The quality framework within its specification dimension aims to measure how a conceptual schema respects some predefined specification characteristics.

3.2. The usage dimension

The usage dimension measures the quality of the conceptual schema according to the user's perception of both the system and the developed specification. A conceptual schema is judged to be "good" if:

- it has the same perception of the system as the user's one. This means that the abstractions made by the schema is a correct representation of the user’s vision of the universe of discourse,

- and it is correctly interpreted by the users.

To take into account these two aspects, we propose two criteria, which measure quality of conceptual schemas within the usage dimension namely, completeness and understandability.

Page 208: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.2.1. Completeness

A schema is complete when it represents all relevant features of the application domain [Batini et al. 92]. More specifically, the completeness can be measured by the degree of coverage of users requirements by the conceptual schema. Completeness is a very important criterion as it is crucial for the success of the future system. The degree of disparity, between user requirements and their interpretation by the designer as expressed in the conceptual schema, measures the gap between the user's and the designer’s perception of the same reality.

3.2.2. Understandability

Understandability is defined as the ease with which the user can interpret the data model. This criterion is very important for the validation of conceptual schemas and consequently influences directly the measure of completeness. The understandability of a conceptual schema relies on how much modeling features are made explicit. Non-explicit names, a high level of aggregation of the modeling features, and the complexity of the defined integrity constraints are factors that decrease the schema understanding.

To summarize, the specification dimension measures the quality of the schema according to predefined modeling and representation rules. However, a good conceptual schema according to the specification dimension could be very bad if it does not correspond to users requirements. The usage dimension measures the degree of correspondence between the conceptual schema and the user’s requirements and measures the ease with which this correspondence can be made.

3.3. The implementation dimension

This dimension refers to the amount of effort that can be needed to implement the conceptual schema. Two criteria are proposed to evaluate this implementation aspect of the conceptual schema:

3.3.1. Implementability

It is defined as the amount of effort that a data administrator needs to provide to implement the conceptual schema. This effort depends on the degree of correspondence between the modeling concepts in the conceptual schema and the concepts of the target database management system.

3.3.2. Maintainability

It measures the ease with which the conceptual schema can evolve. The maintainability of a conceptual schema implies the study of modeling elements cohesion, how do they coexist? When may they change? How frequently? The responses to these questions lead to a definition of precise characteristics necessary to insure a certain degree of maintainability.

Page 209: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Summarizing, the implementation dimension measures the ease with which a conceptual schema can be implemented and maintained.

4. Case study : Hospital management

We consider a sufficiently complex problem to illustrate the application of the framework. The quality here is measured to compare several schemas modeling the same reality. We make the assumption that all the schemas represent the same reality and are syntactically correct. The measurement here concerns the conceptual quality according to the specification dimension.

Let’s consider a public organization managing the medical aspects of several independent hospitals. Each hospital has its own medical staff providing health care to patients in medical services. The research activity takes place in laboratories. There are three different functions performed by doctors: medical practice, research and consultations. More precisely, a doctor can be a practitioner in a medical service and/or researcher in a laboratory. These doctors are employees of the hospital. The others are independent-consultants whose activity is however located in the hospital. They are not employees of the hospital. All the doctors, whatever their function, are affiliated to a unique hospital. Obviously, practitioners and independent-consultants prescribe drugs to their patients.

The main difficulty in building a conceptual schema for this problem is due to the non-determinism of conceptual models. In most cases, several constructs are possible to represent a unique reality. In our case, there are several alternative categorizations of doctors. The most representative conceptual schemas resulting from this main choice are given below.

P _ d a t e

P r e s c r i p t i o n _ d e t a i l

f r e q u e n c yd u r a t io n

1 . . *1 . . *

d e t a i l s

D r u g

n a m e0 . . *1 . . 1 0 . . *1 . . 1 c o n c e r n s

L a b o r a t o i r e

L _ n a m e

S e r v i c e

S _ n a m e

P a t i e n t

P _ n a m e

P _ a d d r e s s

H o s p i t a l

H _ n a m e

H _ a d d r e s s1 . . 1

0 . . *

1 . . 1

0 . . *

h a s _ l

1 . . 1

1 . . *

1 . . 1

1 . . *h a s _ s

D o c t o r

D _ n a m e

D _ a d d r e s sD _ t y p e

D _ s p e c i a l i t y

u p d a t e _ m e d i c a l _ r e c o r d ( )u p d a t e _ s c i e n t i f i c _ r e c o r d ( )c h a n g e _ la b o r a t o r y ( )

c h a n g e _ s e r v i c e ( )c h a n g e _ h o s p i t a l ( )

1 . . 1

1 . . *

1 . . 1

1 . . *

s e a r c h e s

1 . . 1

1 . . *

1 . . 1

1 . . *

a t t a c h e d

1 . . *

0 . . *

1 . . *

0 . . *

P r e s c r i p t i o n

1 . . 1

0 . . *

1 . . 1

0 . . *

w o r k s

In this first schema, we don’t differentiate between doctors’ functions. This choice is relevant when the validation is to be performed by non-professional users. The concept of generalization hierarchy is avoided, at least in a first step.

Page 210: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

R e s e a rc h e r

D _ n a m eD _ a d d re s sD _ s p e c ia li ty

c h a n g e _ la b o ra t o ry ()u p d ta e _ m e d ic a l _re c o rd ()

H o s p i t a l

H _ n a m eH _ a d d re s s

P ra c t it i o n e r

D _ n a m eD _ a d d re s sD _ s p e c ia li t y

c h a n g e _ s e rvic e ()u p d a t e _ m e d ic a l_ re c o rd ()

In d ep e n d e n t _ C o n s u lt a nt

D _ n a m eD _ a d d re s sD _ s p e c ia li t y

c h a n g e _ h o s p it a l()u p d a t e _ m e d ic a l_ re c o rd ()

1 . . 1 0 . .*1 . . 1 0 . .*

wo rk s

L a b o ra t o ry

L _ n a m e

1 . . 1

1 . . *

1 . . 1

1 . . *

s e a rc h e s

1 . .1

0 . . *

1 . .1

0 . . *

h a s _ l

S e rvic e

S _ n a m e

1 . . 1

1 . . *

1 . . 1

1 . . *

h a s _ s

1 . . 1 1 . . *1 . . 1 1 . . *

a t t a c h e d

P a t ie n t

P _ n a m eP _ a d d re s s

1 . . *

0 . .*

1 . . *

0 . .*

P re s c r i p t io n1 . . *

0 . . *

1 . . *

0 . . *

P re s c r i p t io n

P ra c t it i o n e r_ re s e a rc h e r

D _ n a m eD _ a d d re s sD _ s p e c ia li t y

c h a n g e _ la b o ra t o ry ()c h an g e _ s e rvic e ()u p d a t e _ m e d ic a l_ re c o rd ()u p d a t e _ s c ie n t if ic _ re c o rd ()

1 . . 1

1 . .*

1 . . 1

1 . .*

s e a rc h e s a t t ac h e d

1 . . *

0 . . *

1 . . *

0 . . *

P r e s c r ip t io n

P _ d a t e

D ru g

n a m e

P _ d a t e

P _ d a t e

P re s c r i p t io n _ d e t a il

fre q u e n c yd u ra t io n

1 . . *1 . . *

d e t a il s

0 . . *1 . . 1 0 . . *1 . . 1

c o n c e rn s

1 . . *1 . . *

d e t a il s

1 . . *1 . . *

d e t a il s

In this second representation, we still don’t use the generalization hierarchy. However, we distinguish between doctors’ functions leading to four classes representing the four different allowed positions for doctors. As a consequence, several associations are replicated.

P _ d a t e

D ru g

n a m e

P _ d a t e

P e s c ri pt io n _ d e ta i l

fre q u e n c y

d u ra t io n

1 . . *1 . . *d e t a i ls

0 . . * 1 . . 10 . . * 1 . . 1

c o n c e rn s

1 . .*1 . .*

d e ta i ls

In d e pe n d e n t _ c o n s u l t a nt

c h a n g e _ h o s p it a l ( )

P a t ie n t

P _ n a m e

P _ a d d re s s

1 . . *

0 . .*

1 . . *

0 . .*P re s c r ip t io n

P ra c t i t io n e r

c h a n g e _ s e rvic e ()

1 . . *

0 . .*

1 . . *

0 . .* P re s c r ip t io n

R e s e a rc h e r

m a j _d o s s ie r_ s c i e n ti fiq u e ()

S e rvic e

S _ n a m e

a t t a c h e d

L a b o ra t o ry

L _ n a m e

1 . . 1

1 . . *

1 . . 1

1 . . *

s e a rc h e s

H o s p it a l

D _ n a m e

D _ a d d re s s

1 . . 1

1 . .*

1 . . 1

1 . .*

ha s _ s

1 . . 1

0 . . *

1 . . 1

0 . . *h a s _ l

D o c t o r

D _ n a m e

D _ a d d re s s

D _ s p e c ia l i t y

u p d a t e _ m e d i c a l_ re c o rd ()

1 . . 1

1 . . *

1 . . 1

1 . . *

w o rk s

In this third schema, the generalization hierarchy is introduced to represent naturally the three different functions of doctors. This choice enriches considerably the expressiveness of the schema but leads also to some replications.

Page 211: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Doctor

D_nam eD_address

D_spec iali ty

P_date

Drug

nam e

P_date

P_date

Prescription_de tail

frequencydurat ion

1. .*1. .*detai ls

0. .* 1. .10. .* 1. .1

c oncerns

1. .11. .*

1. .11. .*

detai ls

1. .*1. .*

detai ls

Pract it ioner

upda te_m edical_record ()change_service()

Independant_c onsultant

change_hospital()

upda te_m edical_record ()

Hospit al

H_nam e

H_address

1. .1

0. .*

1. .1

0. .*

wo rk s

Researcher

change_labora tory()

upda te_s c ient ific_record ()

Pa tient

P_nam e

P_add ress

1. .*

0. .*

1. .*

0. .*

Prescription

1. .*

0. .*

1. .*

0. .*

Pres cript ion

Labo rato ry

L_name0. .*0. .*

has_ l

1. .1

1. .*

1. .1

1. .*

searches

Pract it ioner_researcher

change_labora tory()

upda te_m edical_record ()upda te_s c ient ific _reco rd ()

change_service()1. .*

0 . . *

1. .*

0 . . *

Prescription

1. .1

0. .*

1. .1

0. .*

sea rche s

Service

S_nam e

1. .1

1. .*

1. .1

1. .*

at tached

1. .*1. .*

has_s

1. .1

1. .*

1. .1

1. .*

at tached

This schema uses the generalization hierarchy to differentiate between four possible doctors positions. The description resulting is more explicit but rather more complex.

Dru g

n am e

P _ d a te

P re scri p ti o n _ d e ta i l

fre q u e n cy

d u ra ti o n 0 ..* 1 ..10 ..* 1 ..1

co n ce rn s

1 ..1

1 ..*

1 ..1

1 ..*

d e ta i l s

P ra cti t i o n e r

u p d a te _ m e d ica l_ re co rd ()

ch a n g e _ se rvi c e ()

In d e p e n d a n t_ co n su l ta n t

up da te _m e di ca l_r ec or d ()

ch an ge _ ho sp i ta l ()

Hos pi tal

H_ n a m e

H_ a d d re ss

1 ..1

0 ..*

1 ..1

0 ..*

w o rks

Re se a rch e r

u p d a te _ sc ie n ti f i c_ re co rd ()

ch a n g e _ la b o ra to ry()

L a b o ra t o ry

L _n a m e

0 ..*0 ..*

h a s _ l

1 ..1

1 ..*

1 ..1

1 ..*

se a rch e s

P a tie nt

P_ n a m e

P _ a d d re ss

Do cto r

D_ a d d re ss

D_ n a m e

D_ sp e cia l i ty

ch an g e _ ad dre ss ()

1 ..*

0 ..*

1 ..*

0 ..*

P re sc ri p ti o nP ra cti t i o n e r_ re se a rch e r

ch a n g e _ la b o ra to ry()

u p d a te _ m e d ica l_ re co rd ()

u p d a te _ sc ie n ti f i c_ re co rd ()

ch a n g e _ se rvi c e ()

1. .1

0 ..*

1. .1

0 ..*

se a rch e s

S e rvi c e

S _ n a m e

1 ..1

1 ..*

1 ..1

1 ..*

a tta c h e d

1 ..*1 ..* h as _s

1 ..1

1 ..*

a tta c h e d

1 ..1

1 ..*

In this schema, the generalization choice is the same as the previous one. However, to avoid redundant associations, the latter are defined for the generic class. It is a curious compromise between the previous choices. It exhibits the different categories of doctors but does not make explicit their functions.

Page 212: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

P_d ate

Doc tor

D_a dd re ssD_n am e

D_special i ty

cha nge _a ddress()

Drug

na m e

P_d ate

P res c ription_

fre qu encydu ra tio n1 ..*1 ..*

details

0..* 1..10..* 1..1

c onc erns

1..1

1..*

1..1

1..*

detailsP ract ition er

up da te _m e dica l_record()cha nge _se rvi ce()

P atient

P_n am eP_a dd re ss

1 ..*

0..*

1 ..*

0..*

P res c ription

S ervic e

S_n am e

1..1

1..*

1..1

1..*

attac hed

Independent_cons ultant

cha nge _h osp ital ()up da te _m e dica l_record()

1 ..*

0..*

1 ..*

0..*P res c ription

Hos pital

H_n am e

H_a dd re ss1 ..*1 ..*has _s

1..1

0 ..*

1..1

0 ..*

work s

Laboratory

L_ na m e

0..*0..*

has _l

Res earc her

up da te _scien ti f ic_ re cord()cha nge _lab oratory()

1..1

1..*

1..1

1..*

s earc hes

S alaried_doc tor

In this schema, two generalization levels are proposed. Based on the doctor employee status, the first level distinguishes between independent-consultants and employees. At the second level, employees may be practitioners or researchers (or both).

D o c t o r

D _ n a m eD _ a d d r e s sD _ s p e c ia l i t y

c h a n g e _ a d d r e s s ( )

P _ d a t e

P r e s c r i p t io n _ d e ta i l

fr e q u e n c yd u r a t i o n

1 . . *1 . . *d e t a i l s

D r u g

n a m e0 . . *1 . . 1 0 . . *1 . . 1 c o n c e r n s

P r a c t i t i o n e r

c h a n g e _ s e rvi c e ( )

S e r vi c e

S _ n a m e

1 . . 1 1 . .*1 . . 1 1 . .*

a t t a c h e d

In d e p e n d e n t _ c o n s u l t a n t

c h a n g e _ h o s p i t a l ( )

H o s p i t a l

H _ n a m eH _ a d d r e s s

1 . .*1 . .*

h a s _ s

1 . . 1

0 . . *

1 . . 1

0 . . *

w o r k s

L a b o ra t o r y

L _ n a m e0 . . *0 . . *

h a s _ l

R e s e a rc h e r

u p d a t e _ m e d ic a l_ re c o r d ( )c h a n g e _ la b o r a t o r y ( )

1 . . 1

1 . . *

1 . . 1

1 . . *

s e a rc h e s

P a ti e n t

P _ n a m eP _ a d d r e s s

C o n s u l t i n g _ d o c t o r

u p d a t e _ m e d ic a l_ re c o r d ()

1 . . *

0 . . *

1 . . *

0 . . *

P r e s c r i p t i o n

A second possibility is to distinguish at the first level between “ pure” researchers and consulting doctors who may be either practitioners or independent-consultants.

Page 213: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Practitioner_researcher

Hospital

H_nomH_address

Doctor

D_addressD_nameD_speciality

change_address()

P_date

Prescription_detail

frequencyduration

1..*1..*

details

Drug

name0..*1..1 0..*1..1

concerns

Practitioner

change_service()

ServiceS_name

1..1 1..*1..1 1..*

attached

Independent_consultant

change_hospital()1..*1..*

has_s

1..1

0..*

1..1

0..*

works

Laboratory

L_name0..*0..*

has_l

Researcher

update_scientific_record()change_ laboratory()

1..1

1..*

1..1

1..*

searches

Patient

P_nameP_address

Consulting_doctor

update_medical_record()

1..*

0..*

1..*

0..*

Prescription

Practitioner_researcherPractitioner_researcher

Hospital

H_nomH_address

Doctor

D_addressD_nameD_speciality

change_address()

P_date

Prescription_detail

frequencyduration

1..*1..*

details

Drug

name0..*1..1 0..*1..1

concerns

Practitioner

change_service()

ServiceS_name

1..1 1..*1..1 1..*

attached

Independent_consultant

change_hospital()1..*1..*

has_s

1..1

0..*

1..1

0..*

works

Laboratory

L_name0..*0..*

has_l

Researcher

update_scientific_record()change_ laboratory()

1..1

1..*

1..1

1..*

searches

Patient

P_nameP_address

Consulting_doctor

update_medical_record()

1..*

0..*

1..*

0..*

Prescription

Hospital

H_nomH_address

Hospital

H_nomH_address

Doctor

D_addressD_nameD_speciality

change_address()

Doctor

D_addressD_nameD_speciality

change_address()

P_date

Prescription_detail

frequencyduration

Prescription_detail

frequencyduration

1..*1..*

details

Drug

name

Drug

name0..*1..1 0..*1..1

concerns

Practitioner

change_service()

Practitioner

change_service()

ServiceS_name

ServiceS_name

1..1 1..*1..1 1..*

attached

Independent_consultant

change_hospital()

Independent_consultant

change_hospital()1..*1..*

has_s

1..1

0..*

1..1

0..*

works

Laboratory

L_name

Laboratory

L_name0..*0..*

has_l

Researcher

update_scientific_record()change_ laboratory()

Researcher

update_scientific_record()change_ laboratory()

1..1

1..*

1..1

1..*

searches

Patient

P_nameP_address

Patient

P_nameP_address

Consulting_doctor

update_medical_record()

1..*

0..*

1..*

0..*

Prescription

In this schema, we introduce, at the second level of the generalization, a multiple inheritance.

D o c to rD _ a d d re s sD _ n a m eD _ s p e c ia lity

c h a n g e _ a d d re s s ( )

P _ d a ted e ta il : c o lle c tio n (a g g re g a te [D ru g _ n a m e ,f re q u e n c y ,d u ra t io n ] )

P ra c titio n e r

c h a n g e _ s e rv ic e ()

S e rv ic eS _n a m e

1 ..1

1 ..*

1 ..1

1 ..*a tta c h e d

In d e p e n d e n t_ c o n s u lt a n t

c h a n g e _ h o s p i ta l( )

H o s p it a lH _ n o mH _a d d re s s

1 ..*1 ..*

h a s _ s1 ..1

0 ..*

1 ..1

0 ..*

w o rk s

L a b o ra to ryL _n a m e

0 ..*0 ..* h a s _ l

R e s e a rc h e r

u p d a t e _ s c ie n tif ic _ re c o rd ()c h a n g e _ la b o ra to ry ( )

1 ..1

1 ..*

1 ..1

1 ..*

s e a rc h e s

P a tie n tP _ n a m eP _ a d d re s s

C o n s u ltin g _ d o c to r

u p d a t e _ m e d ic a l_ re c o rd ()

1 ..*0 ..* 1 ..*

P re s c r ip ti o n

P ra c titio n e r_ re s e a rc h e r

D o c to rD _ a d d re s sD _ n a m eD _ s p e c ia lity

c h a n g e _ a d d re s s ( )

P _ d a ted e ta il : c o lle c tio n (a g g re g a te [D ru g _ n a m e ,f re q u e n c y ,d u ra t io n ] )

P ra c titio n e r

c h a n g e _ s e rv ic e ()

S e rv ic eS _n a m e

1 ..1

1 ..*

1 ..1

1 ..*a tta c h e d

In d e p e n d e n t_ c o n s u lt a n t

c h a n g e _ h o s p i ta l( )

H o s p it a lH _ n o mH _a d d re s s

1 ..*1 ..*

h a s _ s1 ..1

0 ..*

1 ..1

0 ..*

w o rk s

L a b o ra to ryL _n a m e

0 ..*0 ..* h a s _ l

R e s e a rc h e r

u p d a t e _ s c ie n tif ic _ re c o rd ()c h a n g e _ la b o ra to ry ( )

1 ..1

1 ..*

1 ..1

1 ..*

s e a rc h e s

P a tie n tP _ n a m eP _ a d d re s s

C o n s u ltin g _ d o c to r

u p d a t e _ m e d ic a l_ re c o rd ()

1 ..*0 ..* 1 ..*

P re s c r ip ti o n

P ra c titio n e r_ re s e a rc h e r

In this last schema, we introduce an aggregate attribute namely: prescription_detail. From the inheritance viewpoint, this schema is similar to the eight one.

We have evaluated the nine schemas using our framework and the metrics described in Section 3. The weights, associated with concepts, are as follows : 1 for both class and binary 1-1 association ; 3 for a binary *-* association and an inheritance link ; 1 for composition association and for aggregation association. The weights values result from a simulation process conducted on several examples.

Page 214: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 4 : Quality evaluation results

The correctness criterion is maximum for the nine schemas since we limited our investigation to syntactically correct schemas. The clarity measures vary from 0.64 to 1. Due to the small size of the schemas, four of them can be represented without any line-crossings. The fifth schema has the lowest clarity value (0.64). This is due to the numerous association replications leading to several line-crossings. This illustrates a strong dependency between the clarity and the minimality criteria : replicating associations decreases the minimality value as well as the clarity one. Concerning the minimality measurement, let’s notice that four schemas are excellent and equivalent due to the absence of redundant concepts (schemas 1, 7, 8 et 9). They are not globally equivalent since they will differ notably on the expressivity measurement. Schema 2 appears to be the worst since it distinguishes between four categories of doctors without using the generalization concept. This design choice maximizes the number of redundancies. The analysis of expressiveness values leads to the identification of the worst schema : schema 1 which represents all doctors in a unique category ignoring important specific features. The best schema appears to be the fourth one. It makes explicit the greatest number of concepts contained in schemas. As for the simplicity criterion, there is no significant difference between the schemas. However, the first one is logically the best. Recall that it is based on the choice to produce a first cut schema for non-professional users. The comparison of the schemas based on the average of the different criteria values establishes three clusters of schemas. In the “best” cluster, we find the three schemas (6, 7 and 8) based on two levels of generalizations. The multiple inheritance concept appears to contribute positively in the global quality evaluation of schema 8. The second cluster comprises schemas 1, 3, 5 and 9. In particular, the first cut schema (schema 1) appears to be acceptable. One can question the usefulness of integrating complex concepts like in schema 8 to improve marginally the conceptual quality. The last cluster contains schemas 2 and 4, although schema 4 was previously noticed as the most expressive schema. It confirms our belief that the overall quality cannot be limited to one criterion. In order to differentiate the schemas belonging to the “best” cluster, we propose to take into account the factorization degree. This factorization degree cannot be aggregated with the other criteria, since its scale values are not

Page 215: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

comparable to the others. However, it can be used as a second level discrimination factor. As a consequence, schema 8 appears to be superior to the other schemas. This is mainly due to a judicious exploitation of the multiple inheritance concept.

As a conclusion, our framework can be used according to three viewpoints :

- choosing the best schema based on a standard agreggation of criteria (i.e. the average of criteria discussed above),

- adapting the standard agreggation to a particular vision of the quality (i.e. defining a weighted average of the five measures),

- concentrating on a particular criterion.

5. Implementing the quality metrics

A general view of the implementation framework is presented in Figure 5. This implementation is performed with respect to two objectives: extendibility and modularity.

The first objective is to extend existing CASE tools with quality measurement facilities. In order to achieve an independence between the CASE tool and the implementation framework, we have introduced an interface module based on XML (Extensible Markup Language). The XML interface manages the interaction between the quality module and the CASE tool. It contains the XML description of the conceptual schemas being evaluated. This description is obtained using either an embedded functionality of the CASE tool or by extending it using an add-in.

The second objective is to attain a high degree of modularity. As a consequence, the quality framework is implemented as a module independent from any CASE tool. Moreover, its structure is modular supporting easy modifications and further extensions as detailed in section 5.1.1.

In this paper, we detail implementation aspects only for the CASE tool Rational Rose.

XML interface

Quali ty module

An existing CASE tool

Figure 5: Quality metrics implementation view

Page 216: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5.1. Extending Rational rose with quality measurement functionality

Rational Rose is first used to construct the conceptual schemas. The quality measurement functionality is added as a menu item as shown in Figure 6.

Figure 6: Adding the quality measurement functionality

The quality measurement item selection launches the quality module execution as an external tool. The communication between the two tools is performed via the XML interface module.

5.1.1. The XML interface module

The interface module manages the communication between Rational Rose 2000 and the quality module. It is thus the most evolving module in the structure presented in Figure 5. The current version of the XML interface is based on two components namely, the Unisys XMI Software and the AlphaWorks XMI toolkit[alpha]. The first component is an implementation of the XMI (XML Metadata Interchange) specification allowing exchange of information between Rational Rose 2000 and UML-based tools and environments. The second component is a Java software enabling sharing of Java objects using XML. These two components have been adapted to the quality module needs. The designer has to export into XML the UML schemas to be evaluated using the Unisys XMI Software.

Page 217: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The XML specifications are then interpreted by the AlphaWorks XMI toolkit before being used by the quality module written in Java.

In a further version we plan to reconstruct the XML interface module to make it more generic. Our objective is to minimize the amount of code to be written in order to support communication with additional CASE tools

5.1.2. The quality module

Figure 7 presents a conceptual view of the quality module. Each quality metric is represented by a class having three attributes namely description, modeling_notation and parameters. Consequently, each metric has a textual description, describing the quality criterion supported and the modeling notation for which it holds. However, for a given quality criterion, the code enabling the evaluation of the metric is independent from the modeling notation as it uses the code implemented in the operation apply_metric(). This characteristic is directly related to the genericity of the quality metrics. For example, let us consider the simplicity metric described for the UML notation in section 3.1.3. This metric has the following generic formula:

Simplicity=

)()()(

LCNBNLCNBNLCNB+

Where NB(NLC), NB(LC) correspond respectively to the number of non-link concepts and link concepts in a schema.

A non-link concept is a concept independent from other concepts and representing objects that have an identity in the real world. A link concept represents a relationship or an association of non-link concepts. Examples of non-link concepts are the class concept in the UML notation and the entity-type concept in the entity relationship notation. An association in the UML notation and a relationship in entity-relationship notation are considered as link concepts. For the simplicity metric, the attribute parameter has the following structure:

Parameters: (NLC:collection(string),LC:collection(string)).

Where NLC contains the collection of non-link concept names in a given notation (class and class association for UML) and LC represents the collection of link concepts names in the same notation (inheritance link, association and aggregation association for UML for example). The method apply_metric() implements the generic formula based on the NLC and LC generic concepts without referring to specific notation concepts.

Page 218: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Facto rization_degree

para met ers

apply _m etr ic ()

Aggregation_degree

paramet ers

apply _m etr ic ()

Non redundancyparam eters

apply _m et ric ()

Minimalitypa ram ete rs

apply _m et ric ()

Legibilityparame ters

apply _m etr ic ()

Clari typaram eters

apply _m et ric ()

Expressivenessparam eters

apply _m et ric ()Simp licity

param eters

apply _m et ric ()

C orr ectness

apply _m et ric ()

Metricdesc ript io n

mode ling_tec hnique

param et ers

apply _m e tr ic ()

Aggregate_metricm odeling_te c hnique

define_aggregate_m etric ()

apply _aggregate_m etric ()

1..*1. .*

1..*1. .*

Simu la tionQuali ty _res ul ts

ac quire_w eights()

Select_s c hemas ()0..*1..* 0..*1..*

applied on

conc ep tua l_schemam odeling_tec hnique

0. .*

1..*

0. .*

1..*

Figure 7: A conceptual view of the quality module

In addition to the basic metrics, the quality module provides a mean to construct aggregate metrics based on predefined quality metrics. In the current implementation, only an average aggregate metric is defined. It is hard coded in the aggregate metric class implementation. This class manages also the display of the quality results. The class named simulation represents an evaluation session. It has an attribute Quality_results that saves the evaluation results. During an evaluation session, the designer chooses the schemas to evaluate and provides weights to the concepts of the notation. For example, the results presented in Section 4 correspond to the weights described above (1 for both class and binary 1-1 association; 3 for a binary *-* association and an inheritance link; 1 for composition association and for aggregation association).

The quality module is developed in Java [JDK]. Further releases of the quality module concern two components namely the “aggregate_metric” class and the “simulation” class. As far as aggregate metrics is concerned, we plan to develop a complementary module for aggregate metrics definition to allow more complex metric usage. Concerning communication, the current version manages the selection of the schemas to be evaluated through directory sharing. The improvement planed consists in adding visual facilities and introducing more flexibility during the selection of either the schemas to be evaluated or the quality metrics to be applied.

Page 219: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

6. Conclusion and future work

Our contribution is multifold. First, our evaluation framework is an extension to the existing approaches to conceptual schemas quality evaluation by differentiating the specification dimension from the usage and implementation dimensions. Therefore, our paper contributes to the literature of conceptual schema quality evaluation by allowing specific viewpoints (specification, usage, implementation) in the evaluation process. Second, our approach captures the role of the specification dimension in the evaluation process adding to the emerging literature in conceptual model evaluation. Third, several meaningful and interesting insights are derived regarding the criteria used to perform the evaluation along the specification dimension. Our approach captures the essential tradeoff between the benefit of a criterion such as legibility and the remaining criteria such as expressiveness and simplicity. Although the analysis can be seen as more complex, we have described a framework and the associated metrics for UML schema quality evaluation. The objective of integrating these measurements in a CASE tool has continually constrained our choice of metrics, since we are only interested into those, which are automatically computable. We have developed a prototype using Java and implemented as a module communicating with the CASE tool Rational Rose 2000, enriching the latter with a conceptual quality capability. Faced with several similar UML representations of a same reality, he can use this capability to choose the best one, given one criterion or a combination of quality criteria.

Our current approach has its limitations, and three major extensions can be made. First we should integrate the usage and implementation dimensions and study their effect on the overall quality of UML schemas. By adopting specific metrics to these dimensions, we will be able to derive interesting insights on the marginal quality of each dimension. Second, the metrics presented concentrate on static aspects. We are working on the integration of dynamic aspects at the conceptual level in the quality evaluation process. Finally, several case studies are needed in order to reach more definitive conclusions on the acceptable values for weights associated to concepts as well as the definition of more interesting overall quality functions based on the metrics.

Page 220: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

7. Bibliography

[alpha] http://www.alphaworks.ibm.com/

[Assenova et al 96] Assenova P., Johannesson P., Improving quality in conceptual modelling by the use of schema transformations. In the proceeding of ER’96. Cottbus, Germany, 1996.

[Bailey et al 83] Bailey J. E., Pearson S. W., Development of a tool for measuring and analysing user satisfaction. Management science, 29(5), 1983.

[Batini et al. 92] C. Batini, S. Ceri, S.B. Navathe, Conceptual database design : An entity relationsip approach, Benjamen Cummings, Redwood City, California, 1992.

[Booch 91] Booch G., Object Oriented Design with applications, Benjamin Cummings, Redwood City, California, 1991.

[Davis 90] Davis A., Software Requirements : Analysis and Specification, Prentice Hall, 1990.

[Genero et al. 2000] Genero M., Jiménez L., Piattini M., Measuring the quality if entity relationship diagrams, in the proceedings of ER2000 conference, LNCS 1920, pp.513-526.

[IEEE Std 830-1998] IEEE Recommended Practice for Software Requirements Specifications, (Revision of IEEE Std 830-1993).

[Ives et al. 83] Ives B., Olson M. H., Baroudi J . J., The measurement of user information satisfaction. Communication if the ACM, 26(10), 1983.

[JDK] JavaTM 2 SDK, Standard Edition Version 1.3.0_02

[Kriebel 79] Kriebel C. H. Evaluating the quality of information systems. In design and implementation of computer based information systems. 1979

[Lindland et al. 94] Lindland, O.I., G. Sindre and A. Sølvberg, Understanding quality in conceptual modelling, IEEE Software, Vol. 11, No. 2, March 1994, 42-49.

[Misic et al. 2000]. Misic VB, Zhao JL. Evaluating the quality of reference models. In proceeding of the 19th Conference on Conceptual Modeling ER2000, pp. 484-498.

[Moody 98] Moody D.L, Metrics for Evaluating the Quality of Entity-Relationship Models, 17th International Conference in Conceptual Modeling (ER98), Singapore, LNCS 1507 (Ling, Ram, Lee eds).

[Moody et al. 98] Moody D.L, Shanks G.G, Darke P, Improving the quality of entity-relationship models – experience in research and practice, 17th International Conference in Conceptual Modeling (ER98), Singapore, LNCS 1507 (Ling, Ram, Lee eds).

[Poels et al. 2000] Poels G., Dedene G., Measures for dynamic aspects of object-oriented conceptual schemes, in the proceedings of ER2000 conference, LNCS 1920, pp. 499-512.

[Schuette et al. 98] Reinhard Schuette, Thomas Rotthowe: The Guidelines of Modeling - An Approach to Enhance the Quality in Information Models. ER 1998 conference: 240-254

Page 221: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[Si-said et al. 02] Samira Si-Said, Jacky Akoka, Isabelle Comyn-Wattiau: Conceptual Modeling Quality - From EER to UML Schema Evaluation. ER 2002 conference.

[Teeuw et al. 97]. Teeuw B., Van Den Berg H., On the Quality of Conceptual Models. Proceedings of the ER'97 Workshop on Behavioral Models and Design Transformations: Issues and Opportunities in Conceptual Modeling 6 - 7 November 1997, UCLA, Los Angeles, California

[Wand et al 96] Wand Y., Wang R. Y., Anchoring data quality dimensions in ontological foundations. In communications of the ACM, Vol. 39, No. 11 November, 1996.

[Wang et al. 93] R. Y. Wang, Henry B. Kon, S. E. Madnick: Data Quality Requirements Analysis and Modeling. International conference on data engineering. 1993: 670-677

[Wang et al. 95] Wang R. Y., Redy M. P., Kon H. B., Towards Quality Data: An attribute-based Approach. Decision Support Systems, 13, 1995.

[Wang et al. 96].Wang R. Y., Strong D. M., Beyond accuracy: what data quality means to data consumers. In journal of information systems (JMIS), 12(4), 1996.

[Zmud et al. 90] Zmud R., M. Lind, Young F., An attribute space for organisational communication channels. Information systems reserch, 1(4), 1990.

Page 222: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 223: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 5Conférence invitée

Page 224: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 225: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Replication: lesapprochesoptimistes1

Mar c Shapiro

MicrosoftResearch Ltd.7 J J ThomsonAve, Cambridge, CB3OFB,[email protected]

RESUME. La replicationsert a ameliorer lesperformanceset la disponibilit e desdonneespar-tageesdansun systemereparti. Nousclassonsdesalgorithmesde replicationselonplusieursaxes: granularite, qualite de la propagation,ordonnancement,gestiondeconflit, et caractereparesseux. C’estcedernierpoint qui differenciesystemepessimisteset optimistes.

Un systemepessimistecoordonnelesreplicatsdefacon synchrone. A l’inverse, un systemeop-timistepropage lesmisesa jour enarri ere-plan, decouvre lesconflitsa posteriori,et rechercheun consensus de facon incrementale. Il peut diverger sur le court terme. La ou un systemepessimistebloque (par exemplequandle reseauest indisponible),le systemeoptimistepeutprogresserdefacon speculative. L’importancedestechniquesoptimistescroıt a mesure queletravail a travers lesreseauxdegrande echelleetmobilesprend del’importance.

Notre classificationpermetde comparer sur une memeechelle les differentsalgorithmesdereplication,malgre lestechniquesfort differentesmisesenœuvre. Lesaxesdenotre classifica-tion mettenten lumiere les grands defisd’un systemeoptimiste: identificationet gestiondesconflits,reconciliation,convergence, qualite desreplicats,etpassage a l’ echelle. Desexemplesdesystemesexistantsillustreront quelquespointsremarquablessur cesaxes.

1en collaboration avec Yasushi Saito, HP Labs

Page 226: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 227: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 6Services Web

Page 228: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 229: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Active XML: A Data-Centric Perspective onWeb Services

Serge Abiteboul*,+ — Omar Benjelloun* — Ioana Manolescu* —Tova Milo*,++ — Roger Weber**

* INRIA, France** ETH Zurich, Switzerland+ Xyleme S.A., France++ Tel-Aviv University, Israël

[email protected]

ABSTRACT.

We present Active XML (AXML, for short), a declarative framework that harnesses web servicesfor data integration, and is put to work in a peer-to-peer architecture. It is centered aroundAXML documents which are XML documents that may contain calls to web services. Thelanguage allows both the specification of such documents and the definition of new web servicesbased on them. While documents with embedded calls were already proposed before, AXML isfirst to actually turn calls to web services embedded in XML documents into a powerful tool forweb-scale data integration. In particular, the language includes linguistic features to controlthe timing of service call activations, the lifespan of data, and the exchange of extensional andintensional data. Various scenarios are captured, such as mediation, data warehousing, anddistributed computation. A first prototype is described.

RÉSUMÉ.

Nous proposons une architecture peer-to-peer pour l’intégration de données et de services web.Notre approche repose sur un langage, Active XML, dans lequel (1) les documents contiennentdes appels vers des services web et sont enrichis lorsque ces derniers sont invoqués et (2)qui permet de définir de nouveaux services web de manière déclarative, sous forme de requêtesXQuery portant sur ces documents actifs. L’inclusion d’appels de fonctions ou même de servicesweb dans des données n’est pas en soi une idée novatrice. C’est leur utilisation comme unpuissant outil pour l’intégration de données et de services qui fait l’intérêt de notre contribution.En particulier, notre langage permet de spécifier précisément le moment où les services doiventêtre invoqués et les données enrichies, ce qui permet de capturer divers scenarii d’intégration dedonnées tels que médiation, entreposage de données et un forme restreinte de calcul distribué.Un premier prototype est également présenté.

KEYWORDS: XML, Web Services, Data Integration

MOTS-CLÉS : XML, Service Web, Intégration de données

Page 230: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Data integration has been extensively studied in the past in the context of companyinfrastructures e.g., [GUP 89, WIE 93, T. 99]. However, since the web is becominga main target, data integration has to deal with its large scale, and faces new prob-lems of heterogeneity and interoperability between “loosely-coupled” sources, whichoften hide data behind programs. These issues have been recently addressed in twocomplimentary ways. First, major standardization efforts have addressed and partiallyresolved some of the heterogeneity and interoperability problems via (proposed) stan-dards for the web, like XML, SOAP and WSDL [w3c]. XML, as a self-describingsemi-structured data model, brings flexibility for handling data heterogeneity 1, whileemerging standards for web services like SOAP and WSDL simplify the interoperabil-ity problem by normalizing the way programs can be invoked over the web. Second,the increasingly popular peer-to-peer architectures seem to provide a promising solu-tion for the problems coming from the independence and autonomy of sources and thelarge scale of the web. Peer-to-peer architectures provide a decentralized infrastruc-ture in sync with the spirit of the web and that scales well to its size, as demonstratedby recent applications, e.g., [mor, kaz].

We believe that the peer-to-peer approach, together with the aforementioned stan-dards, form the proper ground for data integration on the web. What is still lacking,however, is the glue between these two paradigms. This is precisely what is providedby Active XML (AXML, in short), the topic of this paper, a declarative frameworkthat harnesses XML and web services for data integration, and is put to work in apeer-to-peer architecture.

The AXML framework is centered around AXML documents which are XML doc-uments that may contain calls to web services. When calls included in a documentare fired, the latter is enriched by the corresponding results. In some sense, an AXMLdocument can be seen as a (partially) materialized view integrating plain XML dataand dynamic data obtained from service calls. It is important to stress that AXMLdocuments are syntactically valid XML documents. As such, they can be stored, pro-cessed and displayed using existing tools for XML. Since web services play a majorrole in our model, we also provide a powerful mean to create new ones: they can bespecified as queries with parameters on AXML documents, using XQuery [XQu], thefuture standard query language for XML, extended with updates.

Documents with embedded calls were already proposed before (see related workin Section 2 for a detailed account). But AXML is first to actually turn them into apowerful tool for data integration, by providing the following features:

Controlling the activation of calls and the lifespan of data By giving means todeclaratively specify when the service calls should be activated (e.g. when needed, ev-ery hour, etc.), and for how long the results should be considered valid, our approach

. Complementary standards for meta data, like RDF, also address semantic heterogeneity is-

sues.

Page 231: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

SOAP

AXML peer S2

SOAP

AXML peer S3

SOAP client

SOAPservice

AXML storage

EvaluatorXQueryprocessor

queryresults

definitionsAXML service

updatereadupdate

read

AXML peer S1

wrapperSOAP

service call service result

query

consults

Figure 1. Outline of the AXML data and service integration architecture.

allows to capture and combine different styles of data integration, such as warehous-ing, mediation and flexible combinations of both.

Services with intensional input/output Standard web services exchange “plain”XML data. We allow AXML web services to exchange AXML data that may containcalls to other services. Namely, the arguments of calls as well as the returned answersmay include both extensional XML data and intensional information represented byembedded calls. This allows to delegate portions of a computation to other peers,i.e., to distribute computations. Moreover, the decision to do so can be based onconsiderations such as peer capabilities or security requirements.

Continuous services Existing web services either use a (one-way) messaging styleor a remote procedure calls (RPC) style — they are called with some parameters and(eventually) return an answer. But often, a continuous behavior is desired, wherea (possibly infinite) stream of answers is returned for a single call. As an exam-ple, consider subscription systems where new data of interest is pushed to users (see,e.g., [NGU 01]). Similarly, a data warehouse receives streams of updates from datasources, for maintenance purposes. Streams of results are also generated, for instance,by sensors (e.g., thermometers, video surveillance cameras). The AXML frameworkadds this capability to web services, and allows for the use and creation of such con-tinuous web services.

These features are combined in a peer-based architecture, illustrated by Figure 1.Each peer contains a repository of AXML documents, some AXML web servicesdefinitions, and an Evaluator module, that acts both as a client and as a server:

Client By choosing which of the service calls embedded in the AXML documentsneed to be fired at each point in time, firing the calls, and integrating the returnedanswers into the documents. It is important to stress that any web service can be used,be it provided by an AXML peer or not, as long as it complies with standard protocolsfor web service description and invocation [SOA, WSD].

Page 232: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Server By accepting requests for the AXML services supported by the peer, execut-ing the services (i.e. evaluating the corresponding XQuery) and returning the result.

Observe that since AXML services query AXML documents and can accept (resp.return) AXML documents as parameters (resp. results), a service execution may re-quire the activation of other services calls. Thus, these two tasks are closely inter-connected.

The paper is organized as follows: After an overview of related work, the AXMLlanguage is presented in Section 3, mainly through an extended example. A formalsemantics for AXML documents and services is presented in Section 4, while secu-rity and peer capabilities are considered in Section 5. An evaluation strategy and animplementation are discussed in Section 6. The last section is a conclusion.

2. Related work

Active XML touches several areas in data managment and web computing. Wenext briefly consider these topics and explain how AXML relates to them.

Basic technologies/standards The starting point of the present work is the semistruc-tured data model and its current standard incarnation, namely XML [XMLa]. We relyprimarily on an XML query language and on protocols for enabling remote procedurecalls on the web. Disparate efforts to define a query language for XML are now uni-fying in the XQuery proposal [XQu], and its subset XPath [XPa]. We use those asthe basis of our service definition language. As for remote procedure calls, the variousindustrial proposals for web services, e.g. .NET by Microsoft, e-speak by HP, SunOneby Sun Microsystems, are also converging towards using a small set of specificationsto achieve interoperability 2. Among those, we are directly concerned with the SimpleObject Access Protocol (SOAP) [SOA] and the Web Services Description Language(WSDL) [WSD], that are used in our implementation. More indirectly, UDDI reg-istries [UDD], can be used by our system, e.g., to publish or discover web services ofinterest.

Data integration Data integration systems typically consist of data sources, thatprovide data, and mediators or warehouses, that integrate it with respect to a schema.The relationship between the formers and the latters is defined using a global-as-viewor local-as-view approach [GAR 97, LEV 96]. While mediators/warehouses can alsoserve as data sources for higher integration levels, a clear hierarchy between the dataproviders typically exists. In contrast, all Active XML peers play a symmetric role,both as data sources and as partially materialised views over the other peers, thusenabling a more flexible and scalable network architecture for data integration. Webservices are used as uniform wrappers for heterogenous data sources, but also providegeneric means to apply various transformations on the retrieved data. In particular,

. An organization was even created for that. See http://www.ws-i.org.

Page 233: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

tools developped for schema/data translation and semantic integration, e.g. [LEV 96]can be wrapped as web services and used to enrich the Active XML framework.

Services composition and integration Integration and composition of web ser-vices (and programs in general) have been an active fields of research [WEI 01]. In-tensional data was used in Multilisp [HAL 85], under the form of “futures”, i.e., han-dles to results of ongoing computations, to allow for parallelism. Ambients [CAR 99,CAR 98], as bounded spaces where computation happens, also provide a powerfulabstraction for processes and devices mobility on the web. More recently, standardlanguages have even been proposed in the industry, like IBM’s Web Services FlowLanguage [WSF] or Microsoft’s XLang [XLA] for specifying how web services in-teract with each other and how messages and data can be passed consistently betweenbusiness processes and service interfaces.

While AXML allows to compose services, to use the output of some service callsas input to other ones, and even to define new services based on such (possibly recur-sive) compositions, the focus here is not on workflow and process-oriented techniques[CHR 01] but on data. AXML is not a framework for service composition, but fordata integration using web services.

Our formal foundation is based on non-deterministic fixpoint semantics [ABI 95],that was primarily developed for Datalog extensions. In that direction, the paper hasalso been influenced by recent works on distributed Datalog evaluation [JIM 01].

Embedded calls As already mentioned, the idea of embedding function calls indata is not a new one. Embedded functions are already present in relational systems[ULL 89], e.g., as stored procedures. Method calls form a key component of objectdatabases [CAT 94]. The introduction of service calls in XML documents is thus verynatural. Indeed, external functions were present in Lore [HUG 97] and an extensionof object databases to handle semistructured data is proposed in [Ozo 99], therebyallowing to introduce external calls in XML data. Our work is tailored to XML andweb services. In that sense, it is more directly related to Microsoft Smart Tags [POW ],where service calls can be embedded in Office documents, mainly to enrich the userexperience by providing contextual tools. Our goal is to provide means of controllingand enriching the use of web service calls for data integration, and to equip them witha formal semantics.

Active databases and triggers The activation of service calls is closely relatedto the use of triggers [ULL 89] in relational databases, or rules in active databases[WID 96]. Active rules were recently adapted to the XML/XQuery context [BON 02].A recent work considered firing web service calls [BON 01]. We harness these con-cepts to obtain a powerful data integration framework based on web services. In somesense, the present work is a continuation of previous works on ActiveViews [ABI 99].There, declarative specifications allowed for the automatic generation of applicationswhere users could cooperate via data sharing and change control. The main differ-ences with ActiveViews are that (i) AXML promotes peer-to-peer relationships vs.

Page 234: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

interactions via a central repository, and (ii) the cornerstones of the AXML languageare XPath, XQuery and web services vs. object databases [CAT 94].

Peer-to-Peer computing is gaining momentum as a large-scale resource sharingparadigm by promoting direct exchange between equal-footed peers [kaz, mor]. Wepropose a system where interactions between peers are at the core of the data model,through the use of service calls. Morever, it allows peers to play different roles, anddoes not impose strong constraints on interaction patterns between peers, since theyare allowed to define and use arbitrary web services. While we do not consider inthis paper issues such as dynamic data placement and replication, or automatic peerdiscovery, we believe that solutions developped in the peer computing community forthese problems (see for instance [mor]) can benefit our system, and plan to investigatethat in future work.

3. AXML by example

In this section, we introduce Active XML via an example. In Section 3.1, wepresent a simple syntax for including service calls within AXML documents, andoutline its core features. Section 3.2 deals with intensional parameters and results ofservice calls. We then consider, in turn, the lifespan of data, the activation of calls andthe definition of AXML services.

3.1. Data and simple service integration

At the syntactic level, an AXML document is an XML document. At the semanticlevel, we view an AXML document as an unordered data tree i.e., the relative order ofthe children of a node is immaterial. While order is important in document-orientedapplication, in a database context like ours it is less significant and we assume that,if needed, it may be encoded in the data. Also, we attach a special interpretation toparticular elements in the AXML document that carry the special tag <sc>, for servicecall; these elements represent service calls that are embedded in the document 3. Ingeneral, a peer may provide several web services. Each service may support an arrayof functions. We use here the terminology service call for a call to one of the functionsof a service.

As an illustration, consider the AXML document corresponding to my personalpage for auctions that I manage on my peer, say mypeer.com. This simple page con-tains information about categories of auctions I am currently interested in, and thecurrent outstanding auction offers for these categories. The page may be written asfollows:

<myAuctions> Auctions I’m interested in.

. For readability, we use a simple syntax for <sc> elements.The complete syntax is presented

in appendix.

Page 235: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

<category name=”Toys”><sc>auction.com/getOffers(“Toys”)</sc>

</category><category name=”Glassware”>

<sc>eBay.com/getAuctions(“Glassware”)</sc></category>

</myAuctions>While the category names are explicitly written in the document, the offers are speci-fied only intensionally, i.e., using service calls instead of actual data. Here, the list oftoy auctions is provided by auction.com. On that server, the function getOffers, whengiven as input the category name Toys, returns the relevant list of offers, as an XMLdocument. The latter is merged in the document, which may now look as follows:

<myAuctions> Auctions I’m interested in.<category name=”Toys”>

<sc>auction.com/getOffers(“Toys”)</sc><auction aID=”1”><description>Stuffed bear toy</description></auction><auction aID=”2”>...

</category>...</myAuctions>Observe that the new data is inserted as sibling elements of <sc>, and that the latter isnot erased from the document, since we may want to re-activate this call later to obtainnew auction offers. Finally, note that in the case of a continuous service, several resultmessages may be sent by the service for one call activation. In this case, all the resultsaccumulate as siblings of the <sc> element.

Merging service results More refined data integration may be achieved using ID-based data fusion, in the style of e.g., [PAP 96, ABI 97, DEU 99]. In XML, a DTDor XML Schema may specify that certain attributes uniquely identify their elements.When a service result contains elements with such identifiers, they are merged withthe document elements that have the same identifiers, if such exist. To illustrate this,assume that auction.com supports a getBargains function that returns the list of the tencurrently most attractive offers, each one with its special bargain price. Suppose alsothat aID is a key for auction elements. If an auction element with aID “1” is returnedby a call to getBargains, the element will be “merged” with the auction with aID “1”that we already have.

XPath service parameters The parameters of a service call may be defined ex-tensionally (as in the previous examples) or may use XPath queries. For instance,the getOffers service used above gets as input a category name. Rather than givingthe name explicitly, we can specify it intensionally using a relative XPath expres-sion [XPa]:

<myAuctions> Auctions I’m interested in.<category name=”Toys”>

<sc>auction.com/getOffers([../@name/text()])</sc> ...</category> ...</myAuctions>

Page 236: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The XPath expression ../@name/text() navigates from the <sc> node to the parent<category> element, and then to its name attribute. The service is then called with thename attribute’s value as a parameter. In this example, there is only one possible in-stantiation for the XPath expression. In general, several document subtrees may matcha given XPath expression. There are two possible choices here. Either to activate theservice several times, once per each possible instantiation, or alternatively to call itjust once, sending the forest consisting of all the instantiations a single parameter. Inour implementation, we took the first approach as a default, but in principle one couldadd a attribute to the <sc> element to explicitly specify which one of the two seman-tics is prefered for a particular call. Besides the parameters, the name of the calledpeer as well as the name of the service itself may be specified using relative XPathexpressions. The same default semantics applies.

3.2. Intensional parameters and results

The parameters of the services calls that we have seen so far were (instantiated as)simple strings. In general, the parameters of a service call may be arbitrary AXMLdata, specified either explicitly, or by an XPath expression. In particular, AXMLparameters may contain calls to other services, leading thus to intensional serviceparameters. For example, to get a more adequate set of auctions, we may use a servicethat, in addition to the category name, needs the current user budget, which is itselfobtained by a call to the bank services:

<myAuctions> Auctions I’m interested in.<category name=”Toys”>

<sc>auction.com/getOffers1([../@name/text()],<sc>bank.com/getBudget(“Bob”)</sc>)</sc></category>

</myAuctions>Up to now, we have not discussed where and when a service call is activated. In theabove example, we already face a choice concerning the activation of getBudget. Wemay call it first, and then call getOffers providing it with the result. Another option isto call directly getOffers with the intensional parameter, and let it handle the activationof the call to getBudget. We will further discuss this issue in Section 5.

Services may not only get intensional data (i.e. AXML documents with embeddedservice calls) as input, but also return such data as a result. As an example, eachauction in the result of getOffers may contain a call to a getDetails service that providesmore information about that particular auction:

<myAuctions> Auctions I’m interested in.<category name=”Toys”>

<sc>auction.com/getOffers([../@name/text()])</sc><auction aID=”1>

<description>Stuffed bear toy</description><sc>auction.com/getDetails([../@aID])</sc>

Page 237: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

</auction>...</category>...</myAuctions>Observe that intensional results already appear in practice in many popular applica-tions. For example, the Google search engine returns, for a given keyword, somedocument URLs plus (possibly) a handle for obtaining more answers. With this han-dle, one can obtain a new list and perhaps another handle. Therefore, AXML servicecalls can be seen as a generalization of HTML hyperlinks, that handles calls to webservices.

3.3. Controlling the lifespan of data

So far, all the service call results were accumulated in the calling document. Inpractice, we need more flexibility to manage these results, so that we may replace oldresults with new ones, discard data that became too old or inconsistent, etc. Manymodels and techniques have been proposed for managing data lifespan, particularlyin the fields of version management, temporal databases, and active databases. Forour purposes, we chose a suitable, simple model, that may be extended with morecomplex features.

To manage data lifespan, we conceptually attach a special attribute expiresOn toany data node in an AXML document 4. Some nodes may have explicit expirationtime, whereas others will inherit it from one of their ancestors. Expired nodes shouldsimply be viewed as erased from the document.

The value of the expiresOn attribute is an event, that may depend on time, and/oron the document content. For example, if a user wants to specify that her interest in aproduct category lasts only until February 19th, 2002, then the element will have thefollowing form:

<category name=”Toys” expiresOn=”Feb. 19th, 2002”>...Data returned by a service may also come with an expiration time specification. Thisis a very useful feature that allows a service provider to state how long the particularresult is meaningful. For example, getOffers may inform the user of an auction’sclosing time, by setting expiresOn for the returned data. The lifespan of a service callresult may be explicitly overwritten by the caller. This is done using a valid attribute, inthe sc element. valid can be a function of the time when the call was (last) answered,denoted . For example, the following call states that auctions in Category arearchived for one year.

<sc valid=“ + 1 year”>auction.com/contGetOffers(“A”)</sc>

. Strictly speaking, it is not possible in XML to attach an attribute to a #PCData node. It is

possible to do so in AXML.

Page 238: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.4. Controlling the activation of calls

To control when a service call is activated, we use two attributes of <sc> elements,namely mode and frequency. The value of the frequency attribute is similar to the oneof valid, except that it is a function of

, the time when the service was last called.

Thus, we can easily specify a given instant, a time interval, the occurrence of an event,etc. By default, a service is called only once, when the document is registered. We saythat a call has expired when, according to its frequency attribute, it should be activated.

The mode of a call is either lazy or immediate. In immediate mode, the call isactivated when it expires; if the call is in lazy mode, the fact that it has expired onlymeans that the service has to be called whenever the data produced by this call isneeded (e.g., by a query over the AXML document).

The data validity and the calls activation mode and frequency, together, providea flexible and powerful tool for capturing various integration scenarios. This is il-lustrated next. In the following integration styles, the first three assume regular (noncontinuous) services whereas the last one relies on continuous services:

– mediator style: set valid to 0 (note that an immediate mode would not be mean-ingful in this case).

– mediator style with caching: choose a non-zero value for valid, and lazy mode.

– warehousing mode with pulling information: valid larger than frequency, andimmediate mode.

– warehousing mode with pushing: choose a non-zero value for valid, and imme-diate mode.

Note that the activation of a call is dissociated from the lifespan of its results. Forexample, if we wish to call getOffers every day, and keep the results for a week aftertheir acquisition, we would write:

<sc valid=” + 1 week” frequency=” + 1 day”>auction.com/getOffers(“Toys”)</sc>REMARK. — (timeout) In the case of a non-continuous service, it may happen thatthe answer returns very late, or never returns at all. In practice, it is useful to havetimeouts for calls. When the timeout is reached, the system abandons hope of gettingthe result. An exception handling mechanism should also be provided to manage suchevents.

3.5. AXML service definition

The AXML framework allows to call arbitrary web services, but also to define newones, as illustrated in this section. In short, an AXML service is defined by a param-eterized XQuery query over the peer’s AXML documents. As an example,getOffers,that returns all currently open auctions for a given category, may be defined at auc-tion.com as follows:

Page 239: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

let sc auction.com/getOffers($c) befor $cat in document(“auction.com/a.xml”)//category,

$a in $cat/auction, $aID in $a/@aID/text(), $des in $a/description/text()where $cat/@name/text()=$creturn <auction aID=$aID>

<description>$des)</description><sc>auction.com/getDetails(../@aID)</sc>

</auction>In the above example, the category parameter $c is of type #PCDATA (text). The queryconsults an AXML document (a.axml) which may contain service calls, and constructsan AXML document with some calls (e.g. to the getDetails function of auction.com).Here again we face a choice concerning the activation of getDetails: we may call itfirst, and only then return the answer of getOffers. Another option is to return thedocument immediately and let the caller of getOffers handle the activation of the callsto getDetails. The particular type of the service result may be described by an XMLSchema [XMLb], as advised by the WSDL specification [WSD]. This informationcan be used to chose between the two options mentioned above.

To define continuous AXML services, we simply prefix the definition with thekeyword continuous. Thus, a continuous variant of a getOffers, returning the set ofinteresting auctions whenever it changes, is defined as follows: let continuous sc auc-tion.com/contGetOffers($c) be... Additional parameters can be defined to specify thefrequency of updates, and whether to send full results or deltas. They are not detailedhere.

To define AXML services with side effects, in the absence of a standardized lan-guage for XML updates, we use the extension to XQuery proposed in [TAT 01].

3.6. Discussion

We conclude this section with two remarks regarding consistency and termination.

Consistency We assume that the document we start with is well-formed and obeysits DTD (or XML Schema) if one is specified for it. An inconsistency may arise ifone call leads to constructing a document that no longer obeys the schema. Whilesome of this may be prevented by consulting the declared signature of the used ser-vices [HOS 00, ALO 01], static type checking becomes more complicated due to theuse of ID-based element fusion and of XPath expressions in call parameters.

Termination and recursion We have seen above that a service call may return in-tensional answers. Note that this may lead to a non terminating computation: theresult of a service call may contain new service calls that need to be activated. Thosein turn may return new calls to be activated, and so on. Similarly, the processing of aparticular service call (namely the evaluation of the query defining it) may triger theexecution of new calls (perhaps even to the same service), etc. While some form ofrecursion is useful, e.g. for defining transitive closure type of computations, detecting

Page 240: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

termination in general is difficult due to the distributed form of the computation andthe independence of the peers. Some simple sufficient conditions for termination arementioned in the Appendix.

4. Data and computation model

In this section, we briefly define the AXML data model and the semantics ofAXML documents and services. For lack of space, the presentation is informal. Theformal definitions as well as the proofs of the results can be found in [ABI 01].

Intuitively, an AXML instance consists of a number of peers, each one containingsome AXML documents that are being run. AXML documents are XML unorderedtrees. The evaluation of these documents generates calls between these peers andpossibly results in new documents being evaluated at each peer. As we shall see,the evaluation is non-deterministic, which captures the asynchronous evolution of theglobal instance, which may eventually reach a fixpoint or not. We will first present thedata model, then the computation.

4.1. Data model

An (AXML) instance consists of a number of peers. Each peer contains AXMLdocuments, some service definitions, and a working area. We next define instances,then proceed to the definition of documents and services.

Instances An instance

consists of a number of peers . The content ofa peer

is defined by a triple where , the peer’s repository, is aset of persistent AXML documents, , the peer’s services, is a set of AXML servicedefinitions, and , the peer working area, is a set of AXML temporary documents.All the sets are assumed to be finite.

Each document in the working area of a peer represents the computationof some service call in , i.e., some current work that is performing. Any suchdocument also contains a destination attribute specifying the place where the resultof this computation should be sent, which can be a local element or a request from aremote peer.

Documents As for standard XML documents, an AXML document is modeledby a labeled tree with nodes representing the document elements/attributes and withedges capturing the component-of relationship among document items. The threemain differences with the standard XML data model [XMLa] are that (1) we ignorehere the order of elements, hence our trees are unordered 5, so we only consider theorder-free fragments of XPath (for parameters) and XQuery (for service definitions);

. We may take into account the ordering in some specific cases, e.g., for the extensional por-

tions of documents.

Page 241: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

(2) a validity predicate is attached to some elements to specify when some particulardata becomes stale; (3) some of the tree leaves are special service call nodes, calledin the sequel sc-nodes. An sc-node is labeled by a tuple of the form where:

– and are respectively a peer and service names, or XPath expressions. In thefirst case, the service must be defined in peer with arity .

– , the call parameters, are AXML documents or XPath expressions.

An sc-node where none of are XPath expressions, is called a concretecall.

Reduced documents Continuous services send a sequence of answers to the caller.Smart (or optimized) services may only send the delta since the last answer. In othercases, the caller may be responsible for detecting and ignoring redundant data. Toabstract this (without having to get into implementation details such as who performsthe optimization and when), we use in the formal model the notion of reduced versionof a document, where multiple occurrences of the same data are omitted.

To define reduced documents, we use the auxiliary concept of inclusion relation-ship among trees. A reduced document is such that no subtree is included in one ofits siblings. One can show that the reduced version of a document is unique. It canbe, for instance, computed by iteratively removing redundant subtrees. The details areomitted. We will assume in the sequel that all our AXML documents are reduced.

Service definitions To conclude this section, let us consider the definition of ,i.e., the definition of services. The semantics of XQuery queries is standard, with onenotable exception: when evaluating path expressions, service calls act like documentboundaries which the evaluation cannot cross. In other words, they are terminal nodeswhich do not match any path expression.

The definition of an AXML service consists of the service name, the service type(e.g. continuous or not), the service parameter names , and a parameterizedquery , namely a query that may refer to the (parameters) documents .

4.2. Computation

We are now ready to define the semantics of AXML documents. Each peer in-cludes a collection of AXML trees, in and . These documents may containservice calls that may be activated to derive more information about the documents.A service call activation spans a computation on one of the peers. More precisely, theactivation in Peer of a particular service call to Peer involves (1) (possibly) instan-tiating in the XPath expressions of attributes of the call, (2) for each instantiation,sending concrete calls from Peer to Peer , (3) computing in Peer the correspondinganswers and (4) returning the answers to Peer , where they are received and merged

Page 242: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

at the appropriate place in the tree. If, for some reason, the resulting tree is no longera legal AXML document, it becomes the inconsistent document.

Recall that the decision whether service calls can, and need to, be instantiated(resp. sent, computed, returned) at a given time depends on the specific call attributes.We will simply refer to such calls as eligible for instantiation (resp. sending, compu-tation, returning). We will see in the next section how this can be implemented.

An initial instance is such that all peers have an empty working area. Given aninitial instance

, each peer evolves in a similar way. Starting from

,

repeatedly (and non-deterministically), one of the following steps is executed:

Step 1: Instantiate the XPath parameters: For some (non concrete) sc-node in or that is eligible for instantiation, the XPaths are evaluated, and for each instan-tiation, a new document is added to ’s Working Area. The roots of these documentshave the corresponding concrete service call as an sc-node child, and have as thedestination for the result of the computation.

Step 2: Send/Receive an external call: For some concrete sc-node in or that is labeled with a call to some remote peer and is eligible for sending, the callis activated. Formally, this consists in adding, to the Working Area of the receivingpeer , a new document whose root has an - child labeled with and having as the destination for the result.

Step 3: Compute a local call: For some concrete sc-node in or that islabeled with a call to a local service of and is eligible for computation, evaluatethe service query using the given parameters. The result, a forest, is merged under theparent node of .

Step 4: Return/Receive result of a call: For some document in

, eligiblefor being returned as an answer, the children of , (not including ’s destinationattribute) are sent to the destination peer and merged under the parent of the destinationnode.

Observe that, in the above computation, we grouped sending (resp. returning) acall and receiving it in one operation. Intuitively, our send/receive (resp. return/receive)operation captures the moment when the receiver receives the message. Finally, toguarantee a correct semantics, we need some fairness condition:

( ) Any operation that may happen, eventually happens.

Non-determinism and confluence In general, an initial instance

may be trans-formed in many different ways, depending on the choice of the operations to perform.This non-determinism is built in the semantics. So, even if an instance converges to afixpoint, the fixpoint does not have to be unique. Furthermore, as mentioned in Section3.6, the computation may continue forever, building more and more data, i.e., there isno guarantee of termination.

Although this may seem to be a negative feature of the model, observe that thisnaturally models the real world we are trying to capture. The state of a peer may

Page 243: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

continuously evolve because, for instance, of interactions with human users updatingdata. Also, continuous external services such as subscriptions may keep sending newdata to the peer. So, the system should not be expected to terminate. Also, datamay expire or get deleted and the order in which the various operations/queries areexecuted may have impacts on the state. Thus, because of the asynchronicity andthe independence of peers, determinism is an elusive goal in such an environment.However, termination and confluence can be enforced under very strict restrictions, asoutlined next.REMARK. — (Monotone computation) Suppose the computation is monotone, i.e.,no fact is ever deleted or updated, and information keeps being added in a cumula-tive manner. Furthermore, assume that each call becomes eligible for instantiationand sending infinitly often (i.e. the call is repeatedly triggered and its XPath re-instantiated). Under this restrictions, the order in which the steps are executed is notsignificant anymore. One can then guarantee that all computations lead to a (possiblyinfinite) unique state. A finite state can be guaranteed with some additional restric-tions. This is in the spirit of results on inflationary fixpoint semantics, see [ABI 95].Details are deferred to the appendix.

5. Limiting the firing of calls

So far, we have described the AXML paradigm at a rather abstract level. Beforewe consider its actual implementation, we highlight some important issues that arecritical for a real life implementation. Before activating a service call, two points needto be checked: (i) that the receiving peer is willing to accept the call, i.e., that thecaller has the proper priviledges to issue the call, and (ii) that the receiver has thecapability to process the call, which involves understanding the parameters that aresent. In practice, access to services from other peers will be severely controled forsecurity reasons. Also, peers will have limited capabilities, e.g., most of them willonly accept calls with “plain” XML arguments.

5.1. Site capabilities and security

First, consider peer capabilities. We illustrated in Section 3.1 the use of intensionalparameters in a service call. Observe that they may, in principle, be evaluated beforeor after the call is sent to the receiving peer. In practice, not all choices are alwaysfeasible. For instance, consider again the example in Section 3.1. If auction.com isnot capable of calling getBudget on bank.com (e.g., because it is not an AXML peer),then “Bob”’s AXML peer must first call getBudget, and only then call getOffers withthe result.

Now, consider the security concerns that must guide call activations. Access con-trol is a needed features for many applications. First, a service provider may wish toreserve the access to a service to those who paid for it. For example, acm.org currently

Page 244: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

allows users from the inria.fr domain to use the search services of the ACM digital li-brary, but not any web user can do so. Furthermore, security is necessary to protectsites from a malicious usage. Not surprisingly, the exchange of data that includes ser-vice calls is a major security hole. For instance, suppose that we want to break into apeer , say the site , providing quotes of the day. There are two main ways todo this:

In a call parameter Intensional service parameters open backdoors to AXMLservers. For instance, a malicious client may use the following call to :

<sc>qod.com/QuoteOfDay(<sc>buy.com/BuyCar("BMW Z3")</sc>)</sc>

This malicious user does not wish to buy the car by himself, but tries to make qod.combuy it.

In a call result (Trojan Horse) Suppose now that is malicious in thequotes it provides, e.g., by returning the following quote as a call result:

<quote> Love means never having to say you’re sorry.<sc>buy.com/BuyCar("BMW Z3")</sc></quote>

Thus, by sending an intensional result, the peer may force its clients to invokedangerous services.

Finally, perhaps the most natural violation of security is to bring an AXML peerto transmit private data to a malicious distant site. This may be achieved for instanceby including the following call (as a parameter of a call or in a result):

<sc>i.am.bad/SneakAbout([../../*])</sc></axml>

Instantiating this XPath argument amounts to sending i.am.bad (possibly private) partsof the document that included this call, which is clearly an issue.

The above examples show that the AXML framework makes unauthorized at-tempts to access data quite likely, as well as malicious usage of web services. Hence,access control is essential. We next see how this can be incorporated in the framework.

5.2. Our solution

We illustrate how the above issues may be addressed with two very simple poli-cies. These policies have to be combined with some access control mechanism on thedocuments. Access models for XML have been proposed in, e.g., [DAM 01]. Thisaspect will not be detailed here.

In the first policy, called binding, a peer publishes the kind of arguments eachof its services accepts (e.g., arbitrary AXML, XPath expressions, strict XML). Onlycalls with the proper arguments may then be activated. Note that this policy can beenforced using the WSDL language which enables publication of XML Schema typesfor services input/output parameters.

Page 245: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The second policy, called trust, reflects some form of agreement between the callerand the receiver. More precisely, the reasoning that allows to decide whether a ser-vice (where includes the name of the service and the site that provides it) canbe called by a site is encapsulated in a boolean function . The

function returns if is willing to call and the provider of

is willing to accept this call from . Note that, like in Java’s sandbox securitymodel [GON 97], the decision depends on the origin of the call. This function will beused to determine which calls are eligible for activation at each point in time. We willsee exactly how this is done in the next section.

To implement , we can assume, for instance, that each peer has an agreedservice list, containing the services that it trusts, and is willing to call. Similarly, weassume for every service, an agreed site list, i.e. the sites (trusted and accepted by theservice provider) from which the service may be called. These two lists are typicallyexposed as web services. More precisely, each AXML peer provides (i) a servicethat allows to check whether it is willing to let another peer call one of its servicesand (ii) a service to check whether it is willing to call some particular service. Fornon-AXML peers, we make conservative assumptions.

As mentioned above, these two models, binding and trust, may be combined. Theymay also be extended in a number of ways. First, one may want to include some accesscontrol list (ACL) mechanism, to grant different rights to various users of a peer. Onemight want to control the right to fire a particular service call or the right to access datawith an arbitrary granularity (e.g., at the element level). Also, the functionmay vary in time. For instance, depending on the load of the service provider, onemay want to restrict usage of the service to certain clients only. Finally, one may wantto include arbitrarily complex solutions for trust management that have been proposedsuch as REFEREE [CHU 97]. No matter how complex the used policy is, the provideressentially needs to know, given a concrete call and a site, whether this site is entitledto activate this particular call.

6. Evaluation and Implementation

In this section, we describe the architectural components, and the algorithms usedby an AXML peer in order to evaluate and maintain AXML documents. First, weexplain how time-related events are detected in the system. Then, we see how theevaluation of documents is affected by these events.

The Event Detector To capture time-related events, we use an Event Detector mod-ule (ED). For simplicity, we omitted this module from the architecture sketch (Figure1) at the beginning of this paper. The ED of an AXML peer monitors all AXMLdocuments on , including data validity parameters, and the activation mode and fre-quency of all service calls present in these documents. The ED sends messages toother components of the AXML peer:

– to the Evaluator: when a service call has expired, or has reached timeout;

Page 246: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– to the AXML storage: when a data node has become invalid.

Before presenting our evaluation algorithm, recall from Section 3.4 that servicecalls can be defined to be immediate or lazy. Immediate service calls have to beactivated as soon as they expire, while the activation of expired lazy calls may bepostponed until their results are actually needed. To simplify the presentation, we firstassume below that all the service calls in the documents are defined with an immediateexecution mode, and explain the evaluation algorithm for this restricted case. Then,we explain how the above needs to be extended in order to support lazy calls. Finallywe describe our implementation. Recall from Section 4 that a concrete service call isone whose parameters do not include XPath expressions.

6.1. Calls with immediate mode

We start by explaining how the Evaluator decides when a call is eligible for in-stantiation, resp. activation, computation and return, (in the terms of Section 4), basedon the messages received from the ED. We then outline the algorithms for processingservice call activations.

Deciding on call eligibility The following rules are applied by the Evaluator mod-ule:

– Upon receiving an “sc has expired” message from the ED, if sc is non-concrete,it becomes eligible for instantiation.

– If sc is concrete and aimed at some service outside , we first choose someof the service calls included in the parameters (according to the security, capability,and optimization reasoning outlined in Section 5), and process them. Only then, scbecomes eligible for sending.

– If sc is concrete and aimed at a local AXML service defined on by a query ,then sc becomes eligible for computation.

– After an sc aimed at a local AXML service was evaluated, its result becomeseligible for returning after being post-processed (again, by calling some of the servicecalls in the result, based on security, capability etc.).

Processing service call activations Recall from Section 4 that the four steps ofcomputation where chosen non-deterministically and in random order. We introducehere the notion of task, to track the evaluation of each particular service call, fromthe moment it is activated, to the end of its evaluation. Like sc-nodes, tasks can beconcrete or non-concrete. Documents in naturally have corresponding tasks, andso do activated sc-nodes in . Note that the evaluation is still non-deterministic, andthat tasks can be evaluated in parallel: at a given point in time, a task is either running,ready, or suspended, waiting for some event, perhaps the end of another task. Any ofthe ready tasks may be processed at that point.

Tasks are created in three possible ways. First, the Evaluator creates a new task(concrete or non-concrete) for the activation of every expired, immediate service call.

Page 247: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Second, upon receiving from outside a call to a service defined at , the SOAP wrap-per creates a task for this call in . Note that this task is concrete, since only concretetasks can be sent (see Section 4). Third, the processing of a task may create othertasks, as we will see.

As a notation, let be a task with destination , corre-

sponding to the activation of a call to the service , provided by the peer , withparameters .

Figure 2 outlines the simple algorithm for evaluating non-concrete tasks. First,the XPath parameters of the task have to be evaluated, by issuing calls to the queryprocessor. When the evaluation is done, each has the value of an AXML forest .As mentioned in Section 3.1, the non-concrete call is unrolled into as many concretecalls as there are elements in the cartesian product of the forests . The processing of

is over when all these concrete tasks have finished.

peer , non-concrete task

1 evaluate the XPath parameters 2 foreach 3 let be the value obtained for (an AXML forest)4 foreach 5 create

6 insert

in

7 suspend until all

finish8 exit

Figure 2. Processing a non-concrete task.

In Figure 3, we describe the processing of a concrete task. Assume that a parameter , which is an AXML tree, contains some expired service call . Then, has todecide whether it needs to activate it or to send it as an intensional parameter. Thisdecision is based on the binding and trust policies described in Section 5 6. Notethat the decision is made locally, using the policies of without requiring a“global” view of the security and capability requirements of other peers.

At line 6, if is a service local to , then we call the XQuery processor with theproper arguments; otherwise, a call is sent to via the SOAP wrapper. In both cases,

is suspended waiting for the result; Once receives the result, if it needs to forwardit to the distinct peer , we may have to decide when and where to execute thecalls it contains. The reasoning is very similar to the one above, dividing the workamong and . Subsequently, the result is sent to . (If is local, by accessingthe local AXML repository; otherwise, by sending a result message through the SOAPwrapper). Finally, the concrete tasks exits.

. It may also take into account other considerations such as the system load.

Page 248: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

peer , concrete task

1 foreach in 2 if decides to activate 3 then create

, new task for ; insert in

4 suspend until all finish

5 if (i.e. is defined in by some query )6 then call ; suspend until result ready7 else (i.e. is a distant service)8 call ; suspend until result ready9 if

10 then foreach in result11 if decides to activate 12 then create

, new task for ; insert in

13 suspend until all finish

14 send result to be inserted under 15 if non continuous16 then exit

Figure 3. Processing a concrete task.

sc4sc2

sc1

sc3

sc6 sc7sc5

sc1

sc2

consulted byXPath parameters

modifiedby activation

(a) (b)

of sc2

of sc1

Figure 4. Dependencies among service calls.

Continuous tasks Tasks associated to calls to continous services keep running.The received updates just keep being sent to their destination. When they appear inthe algorithms described above, these tasks are non-blocking.

Unsubscribe and timeout For readability, we have omitted some issues from thealgorithms depicted in Figures 2 and 3. First, if an unsubscribe message for a contin-uous service is sent by the ED, the Evaluator identifies the associated concrete tasks,sends “unsubscribe” messages to their service providers, and destroys the task. Simi-larly, when a non-continuous call times out, the Evaluator destroys its task.

6.2. Calls with lazy mode

Let us now consider the more complex case of the lazy mode.

Page 249: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Service call dependencies The presence of lazy calls may cause dependenciesamong call activations. For example, assume that we need to activate a non-concreteservice call. Before instantiating its XPath parameters, we may need to activate somelazy service calls, that may affect the result of the instantiation. This situation is illus-trated in the AXML document shown in Figure 4(a). The “influence zone” of sc , i.e.,the set of nodes that may be modified by the results of sc , intersects the zone in whichthe XPath parameters of sc are evaluated. If sc is in lazy mode, and has expired,then it is preferable to call it again before we instantiate the XPath arguments of sc .In turn, sc may have XPath parameters that evaluate in the influence zones of lazy,expired service calls, leading to a graph of dependencies like the one in Figure 4(b).

Similarly, assume that a request for an AXML service is received and the servicequery needs to be evaluated. Before calling the XQuery processor, we have to checkif the data read by intersects the influence zone of some lazy expired service call.This again leads to a dependency graph of the above form.

A reasonable compromise between precision and complexity has to be found fortracking dependencies. It is not possible to compute dependency graphs statically.For instance, as a document evolves, calls are added, or removed, by service callactivations. Computing the exact dependency graph of a service call leads to compu-tationally complex problems such as XPath containment [DEU 01].

We therefore adopt the following pragmatic solution. We consider the influencezone of a service call to be all the subtrees rooted at its parent. We consider the scopeof an XPath expression to be the set of subtrees rooted in the highest nodes attainedby its evaluation, as described by the XPath specification [XPa]. Finally, we assumethe data read by an XQuery query to be described by the XPath expressions in its forclause 7. In general, path expressions may also appear in other parts of the query, e.g.the where clause. W.l.o.g we assume here that the query is first normalized [MAN 01].We have thus brought the dependency decision problem to deciding whether two treesintersect, which can be done in constant time, provided a convenient encoding forelement IDs (e.g., [LI 01]).

A call dependency graph may contain cycles reflecting mutual call dependencies.They are broken by arbitrarily choosing some dependencies to be ignored. Breakingthe cycles amounts to introducing non-determinism and possibly “missing” some data.In a web context, this is acceptable.

Eligibility with lazy mode In the presence of lazy calls, a given call sc may bedeclared eligible for instantiation (resp. execution) only after all the lazy calls in itsdata dependency graph have been issued.

Call activation with lazy mode Task processing in the presence of lazy calls ismore complex due to the fact that we have to track data dependencies. First, beforeinstantiating an XPath argument of a non-concrete call, we have to make sure that the

. In some sense, this simple approach is pessimistic, since we do not use the where clause to

filter the actually consulted data.

Page 250: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

data it bears on is available. To that purpose, before line 1 in Figure 2, we need toconstruct the dependency graph for the XPath parameters of the task, on a snapshotof the destination document. If has cycles, they are broken; then, we create tasksfor all the leaf nodes from , and process them in parallel. When these tasks are over,to take into account their effect on the destination document, we re-compute ; aslong as is not empty, we repeatedly create and process tasks, corresponding to lazy,expired calls, that

depends on. The processing of

is suspended until is empty.

The very same steps have to be applied when processing a concrete task, beforeactually calling the XQuery processor (line 6 in Figure 3), except that is computedfor the XPath expressions that depends on. We omit the details.

6.3. Implementation

A first prototype of AXML peer software has been implemented in Java. It relieson the XOQL query processor [AGU ] which implements an algebra similar to theone of XQuery 8. The SOAP wrapper, which is needed both to invoke and answerservice calls relies on the Axis engine from the Apache software group [apa], whichalthough in early development stage, provides good performance and great flexibilitythrough its architecture based on chainable handlers.

We implemented the evaluation strategy of Section 6.1, which only deals with theimmediate activation of service calls. This is done mainly using a timer thread thatacts as a scheduler. In this restricted case, dependency among service calls does nothave to be tracked. Tasks are evaluated in parallel, each one being handled by a sep-arate thread. A thread pool mechanism is used to limit the number of simultaneouslyrunning threads.

Since SOAP supports only RPC calls and one-way messages, we built a layeron top of it to allow for continuous services [CRE 01]. Basically, the caller of acontinuous service provides a listening SOAP service, used by the callee to return astream of answers as one-way messages.

This prototype is functional, and was used to build a distributed peer-to-peer auc-tioning application, where each peer can offer auctions for other peers to bid on, andsearch for auctions of interest available from other peers, without needing a central-ized auction server [ABI 02].

7. Conclusion

The AXML paradigm allows to turn service calls embedded in XML documentsinto a powerful tool for data integration. This includes in particular support for various

. We chose XOQL because at the time we started this implementation, no XQuery processor

was available to us. Although there are differences with XQuery, these are mostly syntactic.

Page 251: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

integration scenarios like mediation, data warehousing and distributing computationsover the web via the exchange of AXML documents.

We implemented a first prototype, but further work is needed to develop appropri-ate optimization techniques. Because of the richness of the model, this is a complextask that should borrow from many techniques that have been previously used, notablyin the contexts of warehouses and mediators. We also need to build an environmentfor designing AXML documents and tools for easily building applications that usethem.

The proposed AXML paradigm should be further experimented and evaluated.Towards this goal, we are intending to use AXML as an application developmentplatform in the context of a European project called DBGlobe. The project dealswith data management problems in global distributed computing environments, witha strong emphasis on mobility. We believe it provides an adequate testbed for theproposed framework.

8. References

[ABI 95] ABITEBOUL S., HULL R., VIANU V., Foundations of Databases, Addison-WesleyPublishing Company, Reading, Massachusetts, 1995.

[ABI 97] ABITEBOUL S., QUASS D., MCHUGH J., WIDOM J., WIENER J., “The Lorel QueryLanguage for Semistructured Data”, Int. Journal on Digital Libraries, vol. 1, num. 1, 1997,p. 68–88.

[ABI 99] ABITEBOUL S., AMANN B., CLUET S., EYAL A., MIGNET L., MILO T., “ActiveViews for Electronic Commerce”, Proc. of VLDB, 1999.

[ABI 01] ABITEBOUL S., BENJELLOUN O., MILO T., “A Data-Centric Perspective on WebServices (Preliminary Report)”, report num. 212, Nov. 2001, INRIA.

[ABI 02] ABITEBOUL S., BENJELLOUN O., MILO T., MANOLESCU I., WEBER R., “ActiveXML: Peer-to-Peer Data and Web Services Integration (demo)”, Proc. of VLDB, 2002.

[AGU ] AGUILERA V., “The X-OQL homepage”, !"#$%&'%.

[ALO 01] ALON N., MILO T., NEVEN F., SUCIU D., VIANU V., “XML with Data Values:Typechecking Revisited”, Proc. of ACM PODS, 2001.

[apa] “The Apache Software Foundation”,() &*"

.

[BON 01] BONIFATI A., CERI S., PARABOSCHI S., “Pushing Reactive Services to XMLRepositories using Active Rules”, Proc. of the Int. WWW Conf., Hong Kong, China, May2001.

[BON 02] BONIFATI A., BRAGA D., CAMPI A., CERI S., “Active XQuery”, Proc. of ICDE,2002.

[CAR 98] CARDELLI L., GORDON A. D., “Mobile Ambients”, NIVAT M., Ed., Proc. ofFoSSaCS, vol. 1378, p. 140–155, Springer-Verlag, Berlin, Germany, 1998.

[CAR 99] CARDELLI L., “Abstractions for Mobile Computation”, Secure Internet Program-ming, 1999, p. 51-94.

Page 252: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[CAT 94] CATTELL R. G. G., Ed., The Object Database Standard: ODMG-93, Morgan Kauf-mann, San Mateo, California, 1994.

[CHR 01] CHRISTOPHIDES V., HULL R., KUMAR A., SIMÉON J., “Workflow Mediationusing VorteXML”, IEEE Data Engineering Bulletin, vol. 24, num. 1, 2001, p. 40–45.

[CHU 97] CHU Y., FEIGENBAUM J., LAMACCHIA B., RESNICK P., STRAUSS M., “REF-EREE: Trust Management for Web Applications”, Proc. of the Int. WWW Conf., vol. 29(8-13), 1997, p. 953-964.

[CRE 01] CREMENESCU F., “Supporting Subscription Services using SOAP”, 2001, Stage defin d’étude, Ecole Polytechnique.

[DAM 01] DAMIANI E., DI VIMERCATI S. D. C., PARABOSCHI S., SAMARATI P., “SecuringXML Documents”, Proc. of EDBT, 2001.

[DEU 99] DEUTSCH A., FERNANDEZ M., FLORESCU D., LEVY A., SUCIU D., “A QueryLanguage for XML”, Proc. of the Int. WWW Conf., vol. 31(11-16), 1999.

[DEU 01] DEUTSCH A., TANNEN V., “Containment of Regular Path Expressions under In-tegrity Constraints”, Proc. of the KRDB Workshop, Rome, 2001.

[GAR 97] GARCIA-MOLINA H., PAPAKONSTANTINOU Y., QUASS D., RAJARAMAN A.,SAGIV Y., ULLMAN J., WIDOM J., “The TSIMMIS Approach to Mediation: Data Modelsand Languages”, Journal of Intelligent Information Systems, vol. 8, 1997, p. 117-132.

[GON 97] GONG L., MUELLER M., PRAFULLCHANDRA H., SCHEMERS R., “Going Beyondthe Sandbox: An Overview of the New Security Architecture in the Java Development Kit1.2”, Proc. of the Usenix Symp. on Internet Technologies and Systems, 1997.

[GUP 89] GUPTA A., Integration of Information Systems: Bridging HeterogeneousDatabases, IEEE Press, 1989.

[HAL 85] HALSTEAD R., “Multilisp: A Language for Concurrent Symbolic Computation”,ACM Trans. on Programming Languages and Systems, vol. 7(4), 1985, p. 510–538.

[HOS 00] HOSOYA H., PIERCE B. C., “XDuce: A Typed XML Processing Language (Pre-liminary Report)”, Proc. of WebDB, May 2000.

[HUG 97] HUGH J. M., ABITEBOUL S., GOLDMAN R., QUASS D., WIDOM J., “Lore: ADatabase Management System for Semistructured Data”, report , Feb 1997, Stanford Uni-versity Database Group.

[JIM 01] JIM T., SUCIU D., “Dynamically Distributed Query Evaluation”, Proc. of ACMPODS, 2001, p. 413–424.

[kaz] “The Kazaa Homepage”,(

.

[LEV 96] LEVY A., RAJARAMAN A., ORDILLE J., “Querying Heterogeneous InformationSources Using Source Descriptions”, Proc. of VLDB, 1996, p. 251-262.

[LI 01] LI Q., MOON B., “Indexing and Querying XML Data for Regular Path Expressions”,Proc. of VLDB, 2001.

[MAN 01] MANOLESCU I., FLORESCU D., KOSSMANN D., “Answering XML queries overheterogeneous data sources”, Proc. of VLDB, 2001.

[mor] “The Morpheus homepage”, &#

.

[nam] “Namespaces in XML”, )" '% &

.

[NGU 01] NGUYEN B., ABITEBOUL S., COBENA G., PREDA M., “Monitoring XML Dataon the Web”, Proc. of ACM SIGMOD, 2001.

Page 253: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[Ozo 99] OZONE: INTEGRATING STRUCTURED AND SEMISTRUCTURED DATA, “T. Lahiriand S. Abiteboul and J. Widom”, Proc. Int. Workshop on Database Programming Lan-guages, 1999.

[PAP 96] PAPAKONSTANTINOU Y., ABITEBOUL S., GARCIA-MOLINA H., “Object Fusionin Mediator Systems”, Proc. of VLDB, 1996, p. 413–424.

[POW ] POWELL J., MAXWELL T., “Integrating Office XP Smart Tags with the Microsoft.NET Platform”,

( .

[SOA] “Simple Object Access Protocol (SOAP) 1.1”,() "

.

[T. 99] T. OZSU AND P. VALDURIEZ, Principles of Distributed Database Systems, 2nd Edi-tion, Prentice-Hall, 1999.

[TAT 01] TATARINOV I., IVES Z., LEVY A., WELD D., “Updating XML”, Proc. of ACMSIGMOD, 2001.

[UDD] “Universal Description, Discovery, and Integration of Business for the Web (UDDI)”, # ".

[ULL 89] ULLMAN J., Principles of Database and Knowledge Base Systems, Computer Sci-ence Press, 1989.

[w3c] “The World Wide WEb Consortium (W3C)”,() "

.

[WEI 01] WEIKUM G., Ed., Infrastructure for Advanced E-Services, vol. 24, no. 1, Bulletinof the Technical Committee on Data Engineering, IEEE Computer Society edition, Mar2001.

[WID 96] WIDOM J., CERI S., Active Database Systems: Triggers and Rules for AdvancedDatabase Processing, Morgan Kaufmann Publishers, 1996.

[WIE 93] WIEDERHOLD G., “Intelligent Integration of Information”, Proc. of ACM SIG-MOD, Washington, DC, May 1993, p. 434-437.

[WSD] “Web Services Definition Language (WSDL)”,( " %

.

[WSF] “Web Services Flow Language (WSFL 1.0)”,Available from

( $.

[XLA] “XLANG, Web Services for Business Process Design”," &$&$'% & '%"$ .

[XMLa] “Extensible Markup Language (XML) 1.0 (2nd Edition)”, )" '%.

[XMLb] “XML Schema”,( " &$

.

[XPa] “XML Path Language (XPath) Version 1.0”,( *" '

.

[XQu] “XQuery 1.0: An XML Query Language”,() " '# &

.

Page 254: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

APPENDIX 1: Deterministic computation and termination

We consider here monotone services, defined essentially using conjunctive frag-ments of XQuery/XPath, and assume that any piece of information has an infinitevalidity. We next briefly revisit the various steps of the computation, and analyzewhat might be the sources of non-determinism in each step and what is required inorder to avoid it.

XPath expressions The instantiations of an XPath expression may change overtime, because some nodes on the way may have acquired more data, due to the acti-vation of some service calls. Therefore, the particular point in time when the XPathsare evaluated may affect the result. To avoid this problem, we must require that eachnon-concrete call becomes eligible for instantiation and sending infinitely often (i.e.that the frequency attribute is such that the call is repeatedly triggered and its XPathparameters are re-instantiated).

XQueries Similarly, the evaluation of a service defined as a query also changesover time. To be more precise, since we are in a monotone context, it may growlarger as time passes. For continuous services, this is not a problem, because thequery is repeatedly evaluated and all new results are eventually sent. In contrast, non-continuous services are executed just once. Thus, we must require that the querybe computed only after the data it bears on has been acquired. We omit the formaldefinition of this notion here.

Exchanging data Peers exchange data via call parameters and results. The ex-changed subtrees are extracted from a tree on one site and merged into a tree on theother side. Non-determinism arises when the extracted subtrees contain path expres-sions that attempt to go above the root of the subtree: these paths have different in-stantiations if evaluated before or after the transfer. To avoid this, the exchange ofsuch data is not allowed.

The above conditions guarantee that any data that can be derived in one particularexecution sequence will also be eventually derived in any other sequence. Based onthis, we have:

Theorem .1 For AXML documents and services satisfying the above conditions, allcomputations lead to a unique state.

However, this state may be finite for some AXML documents and infinite for oth-ers. Using a reduction of the halting problem for Turing machines, one can showthat:

Theorem .2 Given an initial instance, one cannot decide whether it has a finite se-mantics.

Page 255: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

To conclude this section, we mention a particularly simple case where finite se-mantics is guaranteed.

Layered services A lot of the complexity comes from recursion and from the useof XPath expressions. In practice, services are often defined using layers, with ser-vices calling only services in lower layers, and returning data containing calls to ser-vices in such lower layers. We will call AXML documents without XPath expressionsXPath-free documents, and AXML services that only query such documents XPath-free services. One can show that, for XPath-free layered documents and services, thesemantics is finite. In this context, it is simple to design evaluation strategies that (i)detect termination, and (ii) avoid re-triggering calls whose answer will not add newdata.

APPENDIX 2: Extended XML syntax for service call elements

The following is a sample definition of a simple sc element that calls a corporatee-mail directory service:

'%* & (' $# & $& & #('$& &$%$& &&$%

'$%* '%* & & $& ' ) $& '%* & $& ' %*

'%* '%

As standard with XML, this definition is rather verbose. But typically, AXMLdevelopers will use GUIs and users will not have to deal directly with this syntax. (Wedo not have such an interface yet but are planing to develop one.)

To differentiate service calls from the rest of the standard XML data, and avoidnaming conflicts, we use a specific XML namespace [nam] for them (represented bythe axml prefix in the example).

The attributes of the sc element provide the necessary information to issue the call,using the SOAP protocol.

Page 256: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 257: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Efficient Data and Program Integration Using Binding Patterns

Ioana Manolescu*

Luc Bouganim*, **

Françoise Fabret*

Eric Simon*

* INRIA Rocquencourt France <Firstname.Lastname>@inria.fr

** PRISM Laboratory 78035 Versailles - France <Firstname.Lastname>@prism.uvsq.fr

Abstract: With the recent developments of the Web, we witness a shift from data integration systems to data and program integration applications, allowing communities of users to share their resources: data, programs, and computing power. This work investigates data and program integration in a fully distributed peer-to-peer mediation architecture. The challenge in making such a system succeed at a large scale is twofold. First, we need a simple concept for modeling resources. Second, we need efficient operators for distributed query execution, capable of handling well costly computations and large data transfers. To model heterogeneous resources, we use the model of table with binding patterns, simple yet powerful enough to capture data and programs. To exploit a resource with restricted binding patterns, we propose an efficient BindJoin operator, in which we build optimization techniques for minimizing large data transfers and costly computations. Furthermore, the proposed BindJoin operator delivers most of its output in the early stages of the execution, which is an important asset in a system meant for human interaction. Our experimental evaluation validates the proposed BindJoin algorithm on queries involving expensive programs. It shows that our BindJoin delivers the major part of the result early on during query execution.

Résumé : Le développement récent des applications Web fait apparaître le besoin d’intégrer non seulement des données distribuées, mais aussi des programmes. Ainsi, des communautés d’utilisateurs peuvent partager des ressources (données, programmes, et ressources de calcul) distribuées. Cet article se situe dans le domaine de l’intégration de données et de programmes dans une architecture de médiation décentralisée, de type "peer-to-peer". Dans ce contexte, un modèle simple est nécessaire pour la publication de ressources : nous utilisons le modèle de "table à restrictions d’accès", simple mais assez puissant pour modéliser des données et des programmes. L’exécution de requêtes distribuées mettant en jeu des calculs coûteux et des transferts de données volumineuses nécessite la définition de nouveaux opérateurs. Nous décrivons dans cet article l’opérateur "BindJoin", mettant en œuvre des techniques d’optimisation pour minimiser les transferts de données volumineuses et les calculs coûteux. De plus, l’opérateur proposé produit une grande partie des ses résultats dès le début de l’exécution, ce qui est une propriété importante dans un système destiné à des utilisateurs humains. Nous validons les performances de l’opérateur proposé à l’aide d’une série d’expérimentations.

Page 258: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

There is a growing interest in the scientific community to allow disparate groups of users (a.k.a. virtual organizations) to share resources consisting of both data collections and programs. This vision is best reflected by recent initiatives such as the "Grid Computing" infrastructure [11], that aims at constructing a "meta computer": a large scale, distributed computing environment, providing transparent access to highly heterogeneous resources. A frequent domain of applications is that of international scientific cooperation, where remote laboratories share their data and programs in order to produce integrated high-quality data.

LeSelect [20] is a mediator system developed at INRIA, which allows the users to publish their resources (data and functions – corresponding to programs) so they can be transparently accessed. In LeSelect, several distributed mediators (a.k.a. servers) cooperate in a peer-to-peer fashion to allow large-scale integration of data and functions. LeSelect is currently used in many earth-science cooperation projects like Thetis (coastal zone management) [8], Decair (air quality models) [8] and SIMBio (bio-corrosion monitoring) [9].

1.1. Motivating example

As an example of such projects, consider the following distributed scientific application. At siteS1, satellite images have been processed into a map of the ozone cover of the French territory. At siteS2, a survey of the traffic in the same area resulted in a set of records corresponding to the days when traffic was particularly intense. At siteS3, a function OzoneLevelsàlevel computes the set of distinct ozone density levels found in an ozone cover image. If a user at siteS4 wants to match heavy traffic data from S2 with the days with low ozone levels in images found on S1, the OzoneLevels function on S3 needs to be invoked on images from S1. In this example, answering the user query will necessitate manipulation of large data (e.g., images) and potentially expensive function invocations.

This work investigates data and program integration in a fully distributed peer-to-peer mediation architecture. The challenge in making such a system succeed at a large scale is twofold: First, a publication model needs to be chosen for representing the published data (e.g. satellite images and high traffic data), and functions (e.g. OzoneLevels). Simplicity is an important quality for this model, since the publisher is not supposed to be a computer scientist. Second, we need efficient operators for distributed query execution, capable of handling well costly computations and/or large data transfers. Costly computations arise with (i) expensive functions [3,5,16,17] like OzoneLevels, reflecting a domain-specific knowledge which perform complex mathematic computations correlated subqueries [16], (ii) accesses to Web sites [13], and (iii) correlated subqueries [16]. Large data transfers (e.g. satellite images) are necessary when functions only run on their native site, and cannot be shipped through the network. This is the case of the major part of

Page 259: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

scientific applications, in which the programs were written in isolation (without concern for their future integration in a larger setting) [3].

Figure 1. Sample configuration and query on distributed data and functions.

1.2. Objective and approach

Our general objective is to minimize the publisher task while providing the best performance for queries involving expensive functions or large data. Our approach can be summarized in three points:

1. We base the modeling of our resources on the concept of table with binding patterns (introduced in [23] for a different purpose), and the associated logical BindJoin operator, as follows. A binding pattern for a table R specifies which attributes of R must be supplied in order to obtain information from R. The logical BindJoin operator is a variant of the relational join operator to access tables with binding patterns. Binding patterns can naturally be used to model functions. We propose to use the same modelization for table with large binary objects (thereafter called blobs).

2. We analyze the impact of considering expensive functions and blobs on the design of the BindJoin operator and on its integration in the query execution plan (QEP). First, the total work (TW) and response time (RT) of queries including blob transfers and expensive function calls must be reduced, by employing caching and asynchronism techniques. Second, we show the necessity to address the more specific early tuple output rate (ER), that is, we are interested by a QEP that returns as much results as possible early on during query execution. Indeed, in a data integration setting like the one above, queries are asked by human users that wait for the result in front of their stations. Moreover, in many cases, several exploratory queries are asked before the user identifies the data segments he/she is really interested in. Therefore, users typically want to see at least part of the result as soon as possible; the same query pattern also appears in online decisional applications. We therefore identify a good ER as an important performance requirement for the execution of distributed queries like the one in our example.

3. We propose to include every optimization (caching, asynchronism and ER specific optimizations) in the BindJoin operator. While this approach complicates significantly the design and implementation of the BindJoin

Page 260: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

operator, it reduces to the minimum the publisher task. The publisher must only provide a call-based interface of the form callFunction(arguments) for functions and readBlob(blobName) for accessing blobs.

1.3. Contribution and Outline

Our contribution, following our objectives of publisher task minimization and performance maximization, is twofold:

- First, we show how the model of tables with binding patterns can be used to uniformly model data sources including functions as well as blobs and explain why that modelization provides benefits similar to those of semi-joins, without their drawbacks.

- Second, we propose, implement and assess the performance of a highly efficient BindJoin operator, that improves over the state-of-the-art algorithms by having much better ER properties. This paper is organized as follows. In section 2, we show how the paradigm of

tables with binding patterns can be used to describe data (including blobs) and functions and present the associated logical BindJoin operator. We compare our approach with the classical semi-join technique. In section 3, we analyzed the impact of considering expensive functions and blobs on the design of the BindJoin operator and on its integration in the QEP. We describe the optimizations that our BindJoin operator should include in order to provide a good ER behavior. Section 4 depicts the associated physical operators and presents the algorithms used in case of limited memory. Section 5 demonstrates the good ER of our BindJoin operator through a series of experiments. Related work is presented in section 6. We conclude in section 7.

2. Modeling and querying resources

In this section, we first describe the concept of table with binding patterns and the associated BindJoin and BindAccess operators. Then we explain how we use these concepts to model several classes of resources. Finally, we compare our approach with the classical semi-join technique.

2.1. The concept of table with binding patterns

Binding patterns [23] can be attached to a relational table to describe the restrictions that we encounter in accessing it. These restrictions may stem from confidentiality or performance reasons, or simply reflect the restricted nature of the resource: for example, web sources can be represented as virtual tables with a binding pattern where input parameters have to be provided in order to obtain the results.

Page 261: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A binding pattern bp for a table R(X1, X2 … Xn), is a partial mapping from X1,

X2 … Xn to the alphabet b,f. The meaning of a binding pattern is the following: those Xi mapped to b are bound, i.e., their values must be supplied in order to obtain information from R, while values of attributes mapped to f are free and can be obtained from the data source, as soon as values for all b attributes are supplied. If a binding pattern maps all attributes of R to f, then tuples of R can be obtained without any restriction (just like a usual Scan). For example, if it was possible to obtain the full data contained in a web Yellow Pages source of the form YP(name, address, phoneNo), this would be indicated by a YP(namef addressf

phoneNof) binding pattern. On the contrary, YP(nameb addressf phoneNof) specifies that the values of the name attribute have to be supplied in order to obtain addresses and phone numbers from YP.

2.2. The BindJoin and BindAccess operators

The presence of access restrictions, formalized using binding patterns, makes the regular set of relational operators (select, project, join etc.) insufficient in order to answer queries.

The BindJoin operator: The standard relational join operator does not capture well the semantics of combining two tables, if at least one has a restricted access pattern. To illustrate, consider a QEP fragment that joins the YP(nameb addressf phoneNof) with an Employee(namef salaryf) table on their name field. Due to the commutativity of the standard join operator, we might try to write this fragment as Employee

name YP, or as YP name Employee.

However, the first variant is valid, while the second one is not: as described by the binding pattern of YP, we cannot start by accessing it, before supplying some bindings for its name field. Thus, we adopt a variant of the relational join operator, the BindJoin logical operator (denoted and used, e.g., in [10,13]) to capture this asymmetric behavior: the right-hand child of a BindJoin operator cannot be executed on its own, since it depends on the join values passed across the BindJoin operator.

Note that the BindJoin operator can only be used to test equality predicates, since its semantic is to provide precise values to the right-hand input.

The BindAccess operator: Due to the semantics of the binding pattern YP(nameb

phoneNof), we cannot perform a scan on the YP table. Instead, we have to supply some values for the name attribute in order to get YP tuples. Furthermore, the set of tuples that we can extract from YP, following this binding pattern, depends on the values that we supply for name (in contrast, the result of a Scan is always the same). To capture the special semantics of a restricted access, we use a special BindAccess operator. As an intuition, think of the BindAccess as being a "parameterized Scan", where the parameters are the values supplied for the bound attributes.

Page 262: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The formal semantics of BindAccess and BindJoin can be specified as follows. Consider two tables R(X,Y) and S(U,V), where X, Y, U and V are pairwise disjoint sets of variables. Let R(XbYf) and S(UbVf) be binding patterns of R and S, and χ be a set of values for X. Then, denoting the BindAccess by BA, we have: BA(R(Xb Yf), χ) = σX∈ χ R(X, Y, Z).

Furthermore, the BindJoin of Scan(S(Uf Vf)) and BA(R(Xb Yf)) has the following semantics:

Scan(S(Uf Vf))

U=X BA(R(XbYf)) = (u,v,x,y) | (u,v) ∈ S ^ (x,y) ∈ R ^ u=x

While this formula shows that the set of tuples returned by a BindJoin is similar to the one returned by a regular join, keep in mind that the similarity stops here. Indeed, neither the optimization techniques for join queries nor the join operator algorithms can be directly reused for two distinct reasons: First, the asymmetric character of the BindJoin greatly impacts the optimizer's search space (see [10] for query optimization with regular joins and BindJoins). Second, BindJoins often involve costly computations, which have deep implications on the operator algorithm (see section 3).

2.3. Modeling resources using tables with binding patterns

Modeling functions: A function is naturally represented as a table, whose binding pattern distinguishes the attributes that correspond to function arguments (which need to be supplied in the query) from the function results. For example, the binding pattern of the OzoneLevels function is OzoneLevels(imgb levelf).

We propose a specific modeling for data resources involving blobs in order to optimize their transfer through the network. This modeling imposes some requirements on the publication of a table with blobs: for every blob attribute B of table R, R must also contain a small-sized attribute Bid that determines the value of B, i.e., such that the functional dependency Bid →B holds. Furthermore, among the binding patterns of R, we require the presence of R(Bidb Bf). As an example, assume that the id field in the SatImg table determines the img field. Then, the binding pattern set for SatImg must at least contain SatImg(idb imgf). BlobIDs are system-generated in the case of published data residing in an DBMS. In a simpler setting, a blob is usually stored in a separate file, whose complete name (i.e. host/filename) can be used as a blobID.

The purpose of requiring a blob identifier is to reduce the blob transfers to a minimum. First, we transfer identifiers instead of blobs, whenever the blobs themselves are not needed; blob identifiers enable us to transport only once only the necessary blobs, as follows. First, when a set of blob transfers are necessary, by comparing the identifiers of two blobs, we can decide whether or not they are the same, and if yes, make the transfer only once. Second, if in a given QEP several selections and/or joins apply on the tuples containing blob identifiers, we avoid

Page 263: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

transferring blobs (for some further processing) once we know that some of them were not necessary (eliminated by joins or selections).

Unifying function calls and blob transfers: At this point, the similarity between calling a function and getting a blob becomes evident. Both are modeled by an access to a resource following a restricted binding pattern. Both are expensive operations, suggesting the usage of a cache for function results and blobs. Thus, in the following, we will no longer make the distinction between the two: they are accessed using the same operators (section 3), and the same techniques apply.

Using the BindJoin and BindAccess operators: In the table below, at left, we show the tables with binding patterns corresponding to the scientific data integration scenario that we presented. The user query at siteS4 is written in SQL at right:

S1: SatImg(id, date, img);

SatImg(idf datef), SatImg(idb imgf) S2: HighTraffic(date); HighTraffic(datef) S3: OzoneLevels(img, level);

OzoneLevels(imgblevelf)

Select i.img, i.date, h.date, ol(i.img) From S1:SatImg I, S2:HighTraffic h, S3:OzoneLevels ol Where (i.date>=h.date) and (i.date <h.date+3) and (ol(i.img) < 45)

Figure 2 shows one possible QEP for evaluating this query; we circled together operators successively executed on the same site, while blob transfer edges are shown in thick lines. The bottom join operator correlates the dates from the SatImg and HighTraffic sources. Due to the join condition, some tuples from SatImg may be paired with several tuples from HighTraffic; other SatImg tuples are eliminated by the join. Then, images that survived the join predicate are fetched (and cached locally) at siteS3 by a BindJoin, and a second BindJoin invokes OzoneLevels on them. We then project out the images, and apply the selection on the result of the OzoneLevels function; this selection eliminates part of the tuples, and thus some image identifiers. The last BindJoin retrieves at siteS4 only the images corresponding to identifiers that survived both the join and selection predicate. Note that we perform two BindJoins with the table SatImg(idbblobf), corresponding to the two unavoidable transfers: from S1 to S3, and from S1 to S4. In both cases, we only transfer the images that are actually needed on the destination sites.

2.4. Comparison with semi-join technique

A well-known method for achieving the performance gains of the QEP in figure 2 consists of optimizing with semi-joins, introduced in [2]. The QEP in figure 3 uses semi-joins and is equivalent, in terms of blob transfers and function calls, to the one in figure 2 (if all BindJoins appearing in figure 2 use a cache). The first semi-join between SatImg and HighTraffic ensures that only those SatImg tuples with matches in HighTraffic are sent to S3, and the function is called only on those tuples. The remaining tasks are: (i) join again with SatImg and HighTraffic, to get the full join

Page 264: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

result, and (ii) transfer exactly once the useful images from S1 to S4. To solve (i), without transferring the images from S3 to S2 and to S4, we project out the image, and apply the two joins shown on the right branch after the "Y" point in the QEP. To solve (ii), we apply a duplicate-free projection on the left branch, denoted π0

id, and by a final join send the images to S4. Note that (i) and (ii) could not be performed one after another; therefore, we need to materialize the results of the projection after OzoneLevels, thus the "Y"-shape QEP.

Figure 2. QEP for the sample query using BindJoins

Figure 3. QEP for the sample query using semi-joins

This semi-join solution has several drawbacks when compared with the one in figure 2. First, it has a significant overhead: for example, the SatImg-HighTraffic join is applied twice, and HighTraffic is shipped twice to S1. A recent study on semi-joins [27] motivates their interest by the presence of replicas (e.g., use a replica of HighTraffic from S1). However, in a loosely coupled integration system, data owners typically do not interact and do not maintain replicas. Also, the materialization in the "Y" point hinders pipelining which is specifically indicated in our context (see section 3). Finally, optimization with semi-joins is still quite complex. Instead, we adopted a lighter approach: assuming only that our BindJoin operator uses a cache, our modeling provides the advantages of semi-joins, without their drawbacks, for the specific redundant operations that we consider.

3. Designing an efficient BindJoin operator

In this section, we address the design of a physical BindJoin that helps meeting our performance requirements for the execution of distributed queries involving blobs and/or expensive functions.

Traditionally, query processing performance is assessed using three measures [19]: total work (TW), including all processing and data transfer costs, response

BA(SatImg(Idb Imgf)) σlevel<45

πid, h.date, i.date, level

BA(OzoneLevels(imgb levelf))

BA(SatImg(Idb Imgf))

Scan(SatImg(Idf datef)) Scan(HighTraffic(datef)) S1

S1

S3

S2

S4 S4

S1

i.date ≥ h.date i.date < h.date+3

σOzoneLevels.level<45

πi.id,Ol.level

HighTraffic SatImg S2

S3 S1

S4

S1

i.date ≥ h.date i.date < h.date+3

idπ0 SatImg

SatImg

HighTraffic

i.date ≥ h.date i.date < h.date+3

π

S2 S1

Page 265: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

time (RT), measuring the time elapsed until the query result has been completely received on the query site, and time to the first tuple (FT), accounting only for the time elapsed until the result starts arriving. Note that in fact, the FT metric typically accounts for the early tuple output rate (also called "low latency" in [18]): the property of a QEP to produce as much results as possible early on during query execution. Rather than FT, we use the more expressive "early output rate" (ER) term to designate this property. As mentioned in section 1, ER is an important performance goal in the context that we consider and is thus more detailed in the next sections.

We propose a new physical BindJoin that helps reduce the TW and RT of distributed queries with blobs and expensive functions; the main innovation of this operator its good ER, significantly improved over the state-of-the-art algorithms (as we show in section 5). However, an efficient operator would be useless if it did not integrate well with (or did not take advantage of) standard query execution and optimization techniques.

This section demonstrates both our specific contribution in the design of the operator and its good integration with the existing techniques. In section 0, we introduce some simple notations to support our exposition. We then turn to discuss both local optimizations (i.e., that can be applied at the operator level and thus included in its design) and global optimizations (i.e., that must be decided at the QEP level). Sections 0 and 0 show which of the existing optimizations for reducing TW respectively RT can be combined with our BindJoin. Section 0 is specific to our contribution, as it presents the special techniques we employ to provide our BindJoin operator a good ER behavior (local optimization).

3.1. The BindJoin operator : notations and basic principle

We use capital letters, e.g., X, Y,… , for attribute names, and corresponding lower case letters, like x, y,… for attribute values. We consider a BindJoin operator which receives from its left-hand child operator, denoted q in figure 4, tuples of the form (X, Z), and uses the X arguments to access the resource R, following its binding pattern R(XbYf). The BindAccess operator returns (x, y) tuples for each x, and the BindJoin concatenates these tuples with the z attribute, not needed to access R (figure 4).

Figure 4. The BindJoin and BindAccess operators

Page 266: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.2. Reducing total work (TW)

Local optimization: Reducing total work at the level of the BindJoin operator can be done by using caching techniques as suggested in [16]. Caching is profitable for function evaluation and blob transfers as soon as (i) retrieving a function result, respectively blob from the cache is less expensive that computing the results, respectively transferring the blob and (ii) there are duplicate values in the input table. Consider a tuple (x,z) coming from q. If the y values associated with x are stored in a cache, the tuple enriched with y can be output directly, short-circuiting the BindAccess operator. Otherwise, a (potentially expensive) access to the restricted resource R is made with x as an argument. We decide therefore to include a cache in our BindJoin operator. Several caching techniques have been proposed in, e.g., [16]

Global optimization: Global optimization of queries with expensive functions has been extensively studied [6, 7, 17, 22, 26] (see section 6). Due to the modeling we propose for tables with blobs (based on binding patterns), these optimization results apply for function calls as well as for blob transfers. Query optimization in the context of tables with binding patterns is addressed in [10]. As we explain in section 6, this context is more general as it allows the query optimizer to consider also caching the results of subqueries, which can lead to significant performance gains.

3.3. Reducing response time (RT)

Running several tasks in parallel may reduce response time as soon as that tasks consume distinct resources. This simple principle applies at the local level, i.e. intra-operator parallelism, and at the global level, i.e. inter-operator parallelism.

Local optimization: Our interest in intra-operator parallelism is restricted to the costlier operations, namely function invocation and blob transfer. Performing in parallel several function calls or blob transfers allows to fully exploit the query processing - respectively the network transfer – capacity [13]. This could be useful when several processors are available in order to compute a function, or when the same blobs exist in several source sites. Our physical BindJoin operator is designed to include such intra-operator parallelism; due to space limitations, this aspect is relegated to the extended version of this article [21].

Global optimization: Inter-operator parallelism is undoubtedly interesting in a distributed QEP including several BindJoins, especially if they run on distinct sites, but also within a single site if one BindJoin is net bound (e.g., blob fetching), while the other is CPU or I/O bound. Independent parallelism means there is no producer-consumer relationship between the operators that run in parallel; pipeline parallelism is its counterpart for producer-consumer operator pairs. De-synchronizing the BindJoin from its neighbor operators (p and q in figure 4) allows such inter-operator parallelism. As a consequence, during the execution, a BindJoin may accumulate

Page 267: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

tuples waiting to be processed, if q outputs tuples faster than the BindJoin can consume them. Conversely, it may also accumulate result tuples, input for a slower parent operator.

3.4. Improving ER behavior

The BindJoin informally described so far is pipelined, and treats tuples from q in the order of their arrival. At the beginning of the execution of a query, the cache is empty, and most tuples take a long time to get through the BindJoin, corresponding to an access to the restricted resource. As the execution continues, the cache is progressively filled, and probably tuples are output at a faster rate towards the end. Thus, the early tuple output rate is likely to be small, and most tuples are output in the last stages of the execution. In figure 5(a), we show some values of X, in the order in which they are found in q's tuples; for simplicity, we ignore the associated Z values, and consider there is one Y result for every X. Assuming all values are processed in the same amount of time, and that cache access is very fast, the tuples output by the BindJoin during query execution are depicted in figure 5 (b)

Figure 5. (a) sequence of X values in q output – (b),(c),(d) corresponding tuple

output rates of the BindJoin

Local optimization: If the child operator of the BindJoin (denoted q in figure 4) has a good ER behavior, it may happen that several tuples output by q accumulate in the BindJoin, waiting to be processed. These waiting tuples can be divided in two categories: those for which the result has already been computed and is in the cache; and those for which we need to access the restricted resource in order to get the result. If waiting tuples are processed in the order of their arrival, a tuple τ from the first category can only be output after the processing of all tuples ρ from the second category, such that ρ arrived before τ. However, tuples like τ and ρ could very well be processed in parallel, since they have distinct requirements: to output τ it is enough to access the cache, while for ρ we need to access the restricted resource. This strategy leads to the output rates in figure 5(c) (assuming that at t=0 all 8 tuples

Page 268: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

were already output by q, waiting to be processed by the BindJoin). The output rate is clearly improved for the early stages of the execution.

Finally, when selecting the next x value to be processed, we may choose the most advantageous x value with respect to the ER behavior. A good idea is to choose the x value corresponding to the currently largest number of (x, z) tuples waiting to be processed; we term this value the most frequent. The advantage of choosing the most frequent x value is obvious: when the restricted access to R using x is finished, a large number of (x, y, z) tuples corresponding to x can be sent simultaneously to the output, improving even more the ER of the BindJoin. This behavior is reflected in figure 5(d).

Global optimization: With respect to the global ER of a QEP, we make the following remarks. (i) The delay incurred by a single blocking operator in a QEP is a direct loss for the QEP's ER; therefore, when ER is a concern, pipelined operators are required. (ii) The ER of an operator is a measure relative to the rate of its input operators: of course, if the child of operator op is blocking and does not output any tuple in the first minute of query execution, there is little that op can do to improve the query's ER. As a consequence, op has good ER behavior if its ER is as good as it can be with respect to the ER of its input operators. An obvious example of poor ER behavior is a duplicate elimination operator that uses sorting, and thus needs to materialize all its inputs before any tuple is output. An alternative implementation, based on memoization, is likely to perform much better in terms of ER. (iii) The good ER behavior of several operators in a QEP re-enforce each other: if a leaf operator (e.g. Scan) has an important ER, and if its parent is able to exploit it, then ER of the QEP rooted at the parent is important, too. Thus, an important early tuple rate propagates upward in the QEP, up to the topmost operator, whose ER directly benefits the user. In this paper we propose a BindJoin operator optimized for ER; relational operators for reordering [24], join [15,18,30] or aggregation [15] with good ER behavior have already been described, and should be used together with our BindJoin.

While a global optimization strategy [6, 7, 10, 17, 22, 26] will order BindJoins after selections (in order to reduce the number of tuples on which an expensive operation is performed), it may place BindJoins after join operators, as soon as a cache is used in the BindJoin. Indeed, joins can produce many tuples but will never produce new values, and in particular, values for the BindJoin’s inputs. This remark increases the interest of the ER optimizations we propose, since duplicates are more likely to appear in the input of a BindJoin as a consequence of a join operator.

Finally, one should note that ER optimization may also reduce response time by absorbing synchronization problems between several BindJoins in case of pipeline parallelism between two BindJoin operators (section 5.3 discusses a detailed example).

Page 269: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4. Operator implementation

This section describes the physical operators that implement the BindJoin and BindAccess operators, and explains how caching, asynchronism and ER optimizations are incorporated within these operators. The internal data structure used for these optimizations is depicted in section 4.2. Section 4.3 explains how our BindJoin algorithm deals with limited memory execution environments.

4.1. Physical operators for BindJoin and BindAccess

We now describe the physical operators for the BindJoin and BindAccess, implemented as iterators, following [14]. The API of an iterator consists of an initialization open() method, a next() method producing one tuple at a time, and a close() method to release resources and terminate.

Figure 6 depicts our proposed decomposition of the logical BindJoin and BindAccess operators into physical operators, implementing the techniques described in the previous section. A single data structure, belonging to the BindJoin, is used to hold (a) tuples waiting to be processed, (b) a result cache and (c) processed tuples waiting to be output.

Figure 6. Physical operators for a QEP fragment including a BindJoin and a BindAccess on the same site

Physical operators for the BindJoin: We decompose the BindJoin in four physical operators, termed BJStore, BJCompute, BJGetBindings and BJGet in figure 6.

1. BJStore retrieves (x, z) tuples from q and checks x against the cache. If x is present in the cache, then BJStore inserts the tuple in the set of processed tuples; otherwise, in the set of waiting tuples.

2. On a next call from the BindAccess, BJGetBindings chooses the next x value to be processed and returns it.

3. BJCompute retrieves (x, y) tuples from the BindAccess, and updates the cache and the set of processed tuples accordingly.

Page 270: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4. Finally, BJGet answers next() calls from p, the operator above the BindJoin in the QEP, returning an (x,y,z) tuple to p. This tuple is erased from the set of tuples waiting to be output.

Physical BindAccess operator: We use a single physical BindAccess operator, as shown in figure 6, to implement the logical BindAccess. This operator obtains x values from BJGetBindings, and performs a call to the restricted resource providing x as argument. On a next() call from BJCompute, BindAccess returns a (x,y) tuple. Thus, the access to the resource is encapsulated within BindAccess; this operator is generic and can be provided by the integration system, making the publication process easy for the resource owner. The only thing required to "plug" a BindAccess on a given restricted resource is a call interface to that resource (e.g. callFunction(arg1,arg2,…,argk) for functions and readBlob(blobID, startOffset, endOffset, memBuffer) for accessing blobs).

Operator synchronization: In figure 6, we represented the case when the BindJoin and the BindAccess operators run on the same site. In this case, BJCompute, BindAccess and BJGetBindings run synchronously, since there is no gain in parallelizing these operators within a single site. However, note that in parallel with these three operators run, on one hand, p and BJGet, and on the other hand, q and BJStore. Decoupling in such a way the execution of the BindJoin-BindAccess pair from the rest of the QEP allows for inter-operator parallelism.

4.2. Organization of the BindJoin's internal data structure

The data structure that we use to hold the data internal to the BindJoin operator is depicted in figure 7. This structure is basically organized as a hash table. Every hash bucket contains a set of cells (shown in gray); one cell corresponds to a given value for x, the argument value for accessing the restricted resource. Within one bucket, cells are organized in a linked list, shown by the thin arrows.

Figure 7. Outline of the BindJoin data structure.

Page 271: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Two extra data structures are maintained among the cells in the hash table. First, a doubly-linked frequency list connects cells corresponding to x values not yet in the cache, in the order of their frequency. Second, we also keep a processed tuple set, containing the cells for which x is already in the cache, and besides, there currently are some (x,y,z) tuples produced and not output yet.

At the bottom of figure 7, we have detailed the internal structure of a cell. It stores the x value, its frequency, the associated z attributes, and the y results corresponding to x. Finally, it contains the pointers linking it to the next cell in the same bucket, and to the next and previous cell in x frequency order. While cell frequency update may entail a reordering in the doubly-linked list based on frequency, this reordering is local; a cell may switch place with its neighbor, without requiring a full list traversal. A detailed API of the data structure is provided in [21].

Note that for waiting as well as for processed tuples, the storage required in the data structure is reduced, since an x or y value is stored only once for all tuples in which it appears. However, x and y values accumulate during all the execution, which may lead to using up the available memory space; we address this issue in the next section.

4.3. BindJoin behavior in the presence of limited memory

The techniques presented so far assume that the data structure holds in memory, which may not be the case. Note that if X, Y or Z are blobs, their storage is delegated to a special BlobManager component [21], which flushes them to disk; in this case, the data structure will contain the blobID, not the blob. Due to the fact that blob transfers are achieved by BindJoins (with cache), a blob is never transmitted twice by the same operator, and thus will not be repeatedly loaded in memory and flushed to disk.

There is, however, a risk of memory overflow even if X, Y and X are not blob attributes. In this case, our goal is to conserve as much as possible its good ER properties. To that purpose, we attempt to keep in memory all (X,Y) pairs, so that we can process in parallel tuples with new values and those for which the results are in the cache.

The BindJoin's memory consumption due to the internal data structure has a continuously increasing component, the (X,Y) cache, and a variable component, due to (X,Z) tuples waiting to be processed, or (X,Y,Z) tuples waiting to be output. We distinguish four execution phases, corresponding to four memory states. (i) In the Init phase, there is enough memory for all the data structure. (ii) In the Limited phase, all the (X,Y) pairs produced so far still hold in memory, but there is no place to store the Z values. (iii) In the Saturated phase, there is no place left to produce new (X,Y) pairs, so some data has to be temporarily stored to disk. (iv) Finally, during the Cleanup phase, data previously flushed to disk is processed.

Page 272: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 8. BindJoin behavior in the presence of memory limitations.

The data structure and the behavior of the BindJoin's physical operators are depicted in figure 8. We represent the three types of information stored in the data structure (X, Y, and Z values) in separate areas. Graphically, the Y and Z values appearing “under” a given X value are associated to it, in the sense that (X,Z) tuples corresponding to these values were received from q and Y is the value obtained by accessing the resource with the argument X. From left to right in the data structure, there are less and less Z values for a given X value (in the terms of section 0, the left-to-right order reflects the decreasing frequency of X). The thick arrows designate the tuples inserted/extracted by the physical operators from the data structure.

The Init phase: At the beginning of the execution, BJStore inserts (X,Z) tuples, BJGetBindings picks the most frequent X not yet processed. BJCompute inserts in the data structure the Y results (when they are available) for this X, while BJGet extracts (X,Y,Z) tuples. When memory runs out, we enter the Limited phase.

The Limited phase: In this phase, we first flush to disk all (X,Z) pairs obtained so far (whether Y has been computed or not), in the decreasing order of X frequency. Whenever it obtains a new (X,Z) tuple, BJStore inserts X in the data structure, and sends (X,Z) to the "XZ output buffer", to be written to the temporary FIFO file F1. Note that we need to store (X,Z) pairs, not just Z values, in order to be able to re-compute the tuples. BJGet outputs tuples from a buffer of data sequentially read from F1; the first such buffer to be brought in memory corresponds to the most frequent X value, for which the result has probably already been computed. For each (X,Z) tuple read, if the result is already in the memory cache, BJGet outputs the (X,Y,Z) tuple; otherwise, its waits for the result to become available. When the available memory is insufficient for storing the (X,Y) pairs, we enter the Saturated phase.

The Saturated phase: In this phase, whenever BJStore obtains a new (X,Z) tuple, if X is already in memory, its corresponding frequency counter is updated, and (X,Z) is sent to F1. Otherwise, we apply the hashing function H(X) to distribute the (X,Z) tuple in the disk buckets noted H1, H2 and H3 in figure 8 at right. This phase ends when BJStore encounters EOF.

Page 273: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The Cleanup phase: At this point, BJGetBindings reads F1 page by page, the corresponding Y results are taken from the cache or computed, and the (X,Y,Z) tuples output. When F1 is finished, the current (X,Y) cache can be completely discarded, since no tuple on disk has an X value among those in the cache. Then, the partitions made by the function H are loaded in memory one by one, and the tuples inside are processed as in the Init phase. To that purpose, the partitioning function H is chosen so that each partition fits in memory, by a technique similar to the one proposed in [16].

Note that our BindJoin’s behavior in the case of limited memory incurs a minimal overhead; in particular, no tuple is written to disk more than once during the processing.

5. Experimental evaluation of the BindJoin ER

This section compares the ER of our BindJoin operator with that provided by the state-of-the-art algorithms for handling expensive functions. We use a combination of implementation and simulation.

5.1. Experimental platform

We experimented with several algorithms, QEPs, data distribution parameters, data delivery rate and tuple orderings, in order to capture many possible execution scenarios. For space reasons, we only include here the most significant ones, and comment on some others (section 5.4). More results are described in [21].

BindJoin algorithms: We compare our BindJoin physical operator (denoted ERBJ) with two algorithms previously proposed for handling expensive functions [5, 16]. The simplest one uses a hash-based (memoization) cache and will be denoted HBBJ. The second one is sort-based (SBBJ): before accessing an expensive resource, the arguments are materialized and sorted, and thus a cache of just one value is sufficient. To ensure a fair comparison, HBBJ is de-synchronized from its parent and child operators through the standard Exchange operator [14].

QEPs used in our measures: We study the tuple output rate from the following three QEPs:

- QEP1: q(X, Z) X BA(f(Xb Yf))

- QEP2: q(X, Z) X BA(f(Xb Yf))

Z BA(g(Zb Tf))

- QEP3: σs (q(X, Z) X BA(f(Xb Yf)) )

Z BA(g(Zb T f))

X, Y, Z and T are integer attributes, q is a given QEP producing tuples of the form (X,Z), and s is a selection condition on the result of function f. While very simple, such QEPs are very general, as the subplan q may be arbitrarily complex. Indeed, as mentioned in section 3.4, global optimization techniques [6, 17, 26]

Page 274: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

would order relational operator before expensive BindJoins. QEP3 is representative of plans with restrictions on the function results.

Characteristics of the tuples produced by q: We use a data generator which constructs the set of (X,Z) tuples output by q according to a set of parameters: (i) the number of tuples; (ii) the number of distinct values for each attribute; (iii) the distribution law, assuming that the distributions of the attributes are mutually independent; and (iv) the rate at which tuples can be obtained from q. This rate is important, since an ERBJ can only accumulate tuples if the tuple input rate is larger than its tuple processing rate. Finally, we include in the data generator the possibility to deliver the tuples in specific orders. When generating X and Z according to a uniform data distribution, we did not enforce perfect uniformity; rather, we used a uniformly distributed random variable to draw 10,000 values out of 2,500 possible, yielding 2450 values. For Zipfian distributions, we used a low zipf factor (α = 0.2) representative of real-life databases distributions delivering a total of 1,450 distinct values in 10,000 tuples.

Graphs: Each graph presented in the sequel shows the number of result tuples as a function of the running time. Thus, the response time is indicated by the width of the curve while the good ER behavior is shown by the convexity of the curve.

5.2. ER behavior of the BindJOin on simple query plans

Our first four experiments study the ER behavior of QEP1 and QEP2, in which all BindJoins are implemented using as HBBJ, SBBJ, and ERBJ, when q’s output follows a uniform, respectively Zipfian input data distribution (see figure 9-12). In this section, we assume that all tuples output by q have been transmitted towards its parent operator before the BindJoins start to run. This assumption will be lifted in the next sections; however, it is quite realistic if q consists only of regular relational operations, while the accesses to f and g are much more expensive (this configuration is also considered in [3, 6, 13]).

Hash-based BindJoin: In experiments 1- 4, HBBJ delivers few tuples at the beginning, since most X values that f (or g) processes are new; f (or g) must be computed. As the cache gets filled, toward the end of the execution, the output rate increases. With a Zipfian distribution, the ER behavior is slightly improved since some values are very frequent.

Sort-based BindJoin: In experiment 1 (figure 9), the curve for SBBJ is almost a straight line: after each processed value, SBBJ outputs in average 4 tuples (2.450 distinct values in 10.000 tuples). With the Zipfian distribution, in experiment 2 (figure 10), SBBJ outputs bursts of tuples corresponding to the groups of tuples sharing the same X value. Unfortunately, SBBJ is not able to exploit the most frequent values, since it is encountered somewhere in the middle of the value range. In the case of QEP2 (figures 11 and 12), the SBBJ output is delayed until the first

Page 275: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

BindJoin has finished, since the tuples have to be re-sorted on Z before the second BindJoin can start. The overall ER of the plan using SBBJ is poor1.

Early rate BindJoin: In figure 9, ERBJ does slightly better than SBBJ, since it chooses the most frequent values first, even if there is only a very small frequency variation in uniform distributions (in our randomly generated distribution the frequency varies from 3 to 6 with an average of 4). With Zipfian distributions (figures 10 and 12), ERBJ exploits the presence of the few very popular values (characteristic to Zipf) by a very large pack of tuples output right at the beginning of the execution.

Figure 9. Exp.1 : one BindJoin,

uniform distribution

Figure 10. Exp.2: one BindJoin,

Zipfian distribution

Figure 11. Exp.3: two BindJoins,

Uniform distribution

Figure 12. Exp.4: two BindJoins,

Zipfian distribution

1 In the specific case of QEP2, it would be better to use independent parallelism, however this can lead to increased total work, because accessing the first restricted resource (BindJoin1) may suppress some tuples and thus reduce the number of accesses to the second resource.

Running time (x1000 units)

0 5 10 15 20 25 30 35 40 45 50

10

8

6

4

2

0

Output tuples

(x1000)

ERBJ

HBBJ

SBBJ

QEP 1, 1 BindJoin, Uniform distribution , Cost of f = 20

0 5 10 15 20 25 30

Output tuples

(x1000)

ERBJ

HBBJ

SBBJ

QEP 1, 1 BindJoin, Zipfian distribution, Cost_f = 20

Running time (x10000 units)

0 2 4 6 8 10 12 14 16

Output tuples

(x1000)

ERBJ

HBBJ

SBBJ

QEP 2, 2 BindJoins, Uniform distribution, Cost of f = 20, Cost of g = 50

Running time (x10000 units) 0 2 4 6 8 10

Output tuples

(x1000)

ERBJ

HBBJ

SBBJ

QEP 2, 2 BindJoins, Zipfian distribution, Cost of f = 20, Cost of g = 50

Running time (x1000 units)

10

8

6

4

2

0

10

8

6

4

2

0

10

8

6

4

2

0

Page 276: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 13. Exp.5: effect of delays on

the ER behavior

Figure 14. Exp.6: selection predicate

between 2 BJs

Conclusions: HBBJ has a bad ER behavior since tuples for which the result was already in the cache have to wait for their turn before being output. However, HBBJ is non-blocking and works in pipeline. SBBJ does not take advantage of the most frequent values, since it sorts in a value based data order in which most frequent values do not necessarily come first. But the main disadvantage of SBBJ is its blocking aspect, which leads to very poor ER and increased response time. Our proposed ERBJ consistently outperforms the others. Like HBBJ, it is non-blocking; like SBBJ, it outputs simultaneously several tuples sharing an argument value, when the result for this value is available. All three algorithms use a cache, and therefore are useful if there are duplicates in their input. However, the advantage of the ERBJ over the two others is increased by the presence of skewed distributions, since it processes first the more popular values to improve its output rate.

5.3. ER behavior and response time with more complex plans

In experiment 5 (figure 13), we study the output rate of QEP1 in the case where q only provides one tuple every time unit (since, e.g., it retrieves input data from a remote site or perform a complex subplan). For comparison, we also plotted the curve for ERBJ when there is no input limitation (the same as in figure 10). The curve corresponding to ERBJ with delay between inputs has an almost linear aspect for the first 10,000 running time units; then, the curve joins the one corresponding to ERBJ without any delays. The two curves join after 10,000 time units; this shows that even with tuple buffer limited by the slow input2, the ERBJ has been able to choose frequent values to process (otherwise, it would have met the non-restricted curve even later). The ERBJ curve is quite close to the absolute optimum in the

2 We measured the size of the ERBJ buffer and learned that it grows slowly up to 1,000 tuples.

Running time (x10000 units)

0 1 2 3 4 5 6

Output tuples (x100)

ERBJ HBBJ

SBBJ

QEP 3, 2 BindJoins and a selection, Uniform distribution Cost of f = 20, s selectivity = 0.1, Cost of g = 50

Running time (x1000 units)

0 5 10 15 20 25 30 35

Output tuples

(x1000)

ERBJ no delay

HBBJ

Delay 1

SBBJ

delay 1

QEP 1, 1 BindJoin, Zipfian distribution, Cost of f = 20

ERBJ delay 1

10

8

6

4

2

0

10

8

6

4

2

0

Page 277: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

presence of restrictions, which would be to output 1 tuple every time unit (as soon as it arrives), however, this optimum is not achievable, since new values require processing. The ERBJ is so good because the most frequent values are scattered uniformly over a data set following a Zipf distribution. It would not be the case with larger delays and a very bad data order (e.g. the most frequent values last), since the ERBJ could not see (and choose) frequent values early. HBBJ is not affected by the delay between inputs since it is de-synchronized from q. Obviously, with a larger delay, idle time may occur. Finally, as expected, SBBJ has to wait 10,000 time units before it can sort them and proceed.

Experiment 6 (figure 14) studies the output of QEP3, in the case where the selection σs eliminates nine (X,Y,Z) tuples out of ten. HBBJ not only has a worse ER than the ERBJ, but also a longer response time. At the beginning of the execution, the first HBBJ outputs few tuples, of which very few survive the selection. The second HBBJ is therefore often idle. Towards the end of the execution, as the cache of the first BindJoin fills, it generates tuples at a faster rate, and the second BindJoin, even after the selection, becomes overloaded with new values. This behavior (first idle, then overloaded) translates into the increased running time. Note that the same synchronization problem can arise in a variety of settings – for example, if we replace σs by a regular join, a mixture of joins and selections etc. The same holds for QEP2, if g is significantly more expensive than f [21]. By contrast, when using ERBJs, the large early output rate of the first one, even if trimmed by the selection, means that the second one is always busy. Therefore, the pipeline between the two is perfect (in figure 14, the total running time is that of the first BindJoin). Finally, the curve corresponding to sort-based BindJoins is, as expected, delayed by the running time of the first BindJoin.

5.4. Other experiments and conclusion

In this section, we present the results of some other measures that we performed but do not present here (see the extended version of this paper [21]), and we draw the final conclusions.

Influence of the input order: While the order in which the input tuples enter the BindJoins in the previous section was arbitrary, in practice specific input orders may be encountered, resulting from the processing of the subplan q. The ERBJ, as well as the SBBJ, are not sensitive to such orders, since they perform their own re-ordering (unless delays proscribe this reordering). In contrast, the HBBJ treats tuples in the order of their arrival, and is thus very sensitive to the order.

Influence of the number of BindJoins in a QEP: We have performed experiments on QEPs with up to four BindJoins. We have noticed that as the number of BindJoin increases, if they have different costs, produce several output tuples per input tuple, or if there are interspersed selections, the probability of synchronization problems for hash-based BindJoins (like the one in figure 13) increases. The explanation is

Page 278: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

again the fact that some BindJoins are successively idle then overloaded. Therefore, the advantage of ERBJ over hash-based BindJoins is more important. (The sort-based alternative is not worth mentioning, as in the presence of many BindJoins, its initial delay only worsens).

Conclusion: From these experiments, we draw the following conclusions. The ERBJ provides always a significantly better ER than the state-of-the-art algorithms for accessing restricted resources, which are based on sorting the input, or on simple memoization. This improvement depends on the presence of duplicates in the input, and increases with the non-uniform distribution of values in the input. It is remarkably stable with variations in the input order, delays between two successive input tuples, or when other operators in the QEP may cause synchronization problems. Furthermore, its excellent ER properties may improve the overall RT of complex queries.

6. Related optimization and execution techniques

Significant work has been done on online and adaptive query processing; see, e.g., [1, 15, 18, 24, 28]. These works do not address the specific BindJoin operator but are however complementary. Indeed, as mentioned in section 3.4, the ER behavior of a QEP is the result of the good ER behavior of all its query operators.

The ObjectGlobe [4] and HyperQueries [29] project uses Java user-defined operators, loadable from external code repositories. Mocha [25] addresses query optimization for user-defined functions (UDFs) that may be shipped across the network. When possible, this technique is very profitable, since it avoids data transfer, however, in several applications this is not the case, as restricted resources depend on a particular language, environment, or cannot be copied.

Execution techniques for expensive functions: In [5], the sort-based BindJoin algorithm with which we compared our BindJoin in section 5 is presented. In [16], hash- and sort-based techniques for avoiding useless function calls are compared. They propose a hybrid cache algorithm that degrades gracefully if the cache outgrows the available memory. Compared with [5, 16], our BindJoin exploits duplicates and parallelism to improve its ER; also, we use it for avoiding duplicate blob transfers. We show in section 4.3 how our ERBJ deals with memory limitations: thus, it has the good properties of the hybrid cache algorithm, plus an improved ER behavior.

[22] studies query execution in a client-server context, with expensive UDFs. They recognized that UDF can be executed as joins, and that existing work on distributed join processing, and semi-joins, could be reused. Our techniques have a broader scope, since they apply for any restricted resource access, (in particular for blob transfers) and also consider ER optimizations.

Page 279: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Optimization techniques for expensive functions: There are two approaches for modeling expensive functions, and therefore two classes of optimization algorithms. First, the LDL approach [7] models a function as a table. While this requires little modification to a regular optimizer, the number of functions is reflected exponentially in the size of the search space, just like the number of regular tables. This drawback is avoided by the second approach, in which expensive functions are assimilated with selection predicates, and thus have a lesser influence on the search space. Optimization methods based on predicate ranking have been proposed in, e.g., [17, 6]; in a distributed setting, they are no longer optimal, due to data transfer costs [22]. In [26], efficient optimization algorithms improve over predicate ranking by considering interesting data orders, and bushy QEPs.

Our modeling of restricted resources belongs rather to the LDL family, for the following reason. In general, a restricted, expensive resource is not necessarily a function or a predicate. It may be a full-parameterized sub-query, optimized as a complex operator tree QEP (sub-queries have also been assimilated to expensive functions in [16], who suggested caching the result of a sub-query just like a function result). Ignoring such parameterized sub-plans may lead to loss of optimality [10]. To combine a sub-plan, providing bindings, with another parameterized sub-plan requiring them, we need a binary BindJoin operator, not a selection one. Query optimization algorithms for tables with binding patterns, using joins and BindJoins, are provided in [10], which shows that in practical cases, the presence of access restrictions drastically limits the size of the search space.

Most often, distributed queries over expensive functions are globally optimized minimizing the TW; a general framework for RT-oriented query optimization is described in [12].

7. Conclusion

In this paper, we investigate the publication model and algorithms for resource sharing in a fully distributed peer-to-peer mediation architecture.

First, we showed how the model of tables with binding patterns can be used to uniformly model data sources including functions as well as blobs. We explained why that modelization provides the benefit of semi-join like techniques without their drawbacks.

Then, we analyzed the impact of expensive functions and blobs on the design of the BindJoin operator and on its integration in the query execution plan. We considered three performance goals: (i) total work; (ii) response time; and (iii), the more specific early tuple output rate. We identified the latter as an important performance requirement in our context.

Page 280: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The main specificity of our BindJoin is that it exploits the presence of duplicates in its input to provide an important early tuple output rate, so that the user obtains most of the query results fast.

Since the proposed BindJoin operator includes all the optimizations, the publisher task is reduced to the minimum. Together with the simple yet powerful model of table with binding patterns, it makes trivial programs and blob publishing while providing good performance and an important early output rate.

References

[1] R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. In Proc. of ACM SIGMOD Conf., 2000.

[2] P. Bernstein and D W. Chiu. Using semi-joins to solve relational queries. Journal of the ACM, 1981.

[3] L. Bouganim, F. Fabret, F. Porto, and P. Valduriez. Processing queries with expensive functions and large objects in distributed mediator systems. In Proc. of ICDE Conf., 2001.

[4] R. Braumandl, M. Keidl, A. Kemper, and D. Kossmann et al. ObjectGlobe: Ubiquitous query processing on the internet. In Workshop on Technologies for E-Services, 2000.

[5] S. Chaudhuri and K. Shim. Query optimization in the presence of foreign functions. In Proc. of the VLDB Conf., 1993.

[6] S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM Transaction on database system (TODS), 2(24), 1999.

[7] D. Chimenti, R. Gamboa, and R.Khrishnamurty. Towards an open architecture for LDL. In Proc. of the VLDB Conf., Amsterdam, 1989.

[8] The Decair and Thetis projects. Available at http://www-caravel.inria.fr/Econtrats.html.

[9] The Ecobase Team. The Ecobase project: Database and web technologies for environmental information systems. SIGMOD Record, 30(3), 2001.

[10] D. Florescu, A. Levy, I. Manolescu, and D. Suciu Query optimization in the presence of limited access patterns. In Proc. of ACM SIGMOD Conf., 1999.

[11] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the Grid : Enabling scalable virtual organizations. Int.'l Journal of Supercomputer Applications, 2001.

[12] S. Ganguly, W. Hassan, and R. Krishnamurthy. Query optimization for parallel execution. In Proc. of ACM SIGMOD Conf., 1992.

[13] R. Goldman and J. Widom. WSQ/DSQ: A practical approach for combined querying of databases and the web. In Proc. of ACM SIGMOD Conf., 2000.

[14] G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2), June 1993.

[15] P. Haas and J. Hellerstein. Ripple joins for online aggregation. In Proc. of ACM SIGMOD Conf., 1999.

[16] J. Hellerstein and Jeffrey F. Naughton. Query execution techniques for caching expensive methods. In Proc. of ACM SIGMOD Conf., 1996.

Page 281: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[17] J. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In Proc. of ACM SIGMOD Conf., 1993.

[18] Z. Ives, D. Florescu, M. Friedman, D. Weld, and A. Levy. An adaptive query execution system for data integration. In Proc. of ACM SIGMOD Conf., 1999.

[19] D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 2000.

[20] The LeSelect Project. Available at http://www-caravel.inria.fr/LeSelect.

[21] I. Manolescu, L. Bouganim, F. Fabret, and E. Simon. Efficient data and program integration using binding patterns. Tech. Report no. 4239, INRIA. Available at caravel.inria.fr/dataFiles/MBFS01tech.ps.

Extended version available at www-rocq.inria.fr/~manolesc/BJ-extended.ps

[22] T. Mayr and P. Seshadri. Client-site query extensions. In Proc. of ACM SIGMOD Conf., 1999.

[23] A. Rajaraman, Y. Sagiv, and J. Ullman. Answering queries using templates with binding patterns. In Proc. of the ACM PODS, San Jose, CA, 1995.

[24] A. Raman, B. Raman, and J. Hellerstein. Online dynamic reordering for interactive data processing. In Proc. of the VLDB Conf., 1999.

[25] M. Rodriguez-Martinez and N. Roussopoulos. MOCHA: A self-extensible database middleware system for distributed data sources. In Proc. of ACM SIGMOD Conf, 2000.

[26] W. Scheufele and G. Moerkotte. Efficient dynamic programming algorithms for ordering expensive joins and selections. In Proc. of the EDBT Conf., 1998.

[27] K. Stocker, D. Kossmann, R. Braumandl, and A. Kemper. Integrating semi-join-reducers into state of the art query processors. In Proc. of ICDE Conf., 2001.

[28] T. Urhan and M. Franklin. XJoin: a reactively scheduled pipelined join operator. In IEEE Data Engineering Bulletin, 2000.

[29] C. Wiesner and A. Kemper. Hyperqueries: dynamic distributed query processing on the Internet. In Proc. of the VLDB Conf., 2001.

[30] A.N. Wilschut and P.M.G. Apers. Dataflow query execution in a parallel main-memory environment. In Proc. of the PDIS Conf., 1991.

Page 282: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 283: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Dynamic discovery of e-services

A description logics based approach

Mohand-Said Hacid* — Alain Léger** —Christophe Rey*** — Farouk Toumani***

* Laboratoire d’Ingénierie des Systèmes d’InformationUFR d’Informatique, Université Claude Bernard Lyon IBâtiment Nautibus 8, boulevard Niels Bohr, F-69622 Villeurbanne cedex

[email protected]

** France Telecom R&D/DMIBP 59, 4 rue du clos courtel F-35512 Cesson Sévigné

[email protected]

*** Laboratoire d’Informatique, de Modélisation et d’Optimisation des SystèmesCNRS UMR 2239 – Université Blaise-Pascal Clermont-Ferrand II24 avenue des Landais F-63177 Aubière cedex

rey,[email protected]

ABSTRACT.We investigate the problem of dynamic discovery of e-services, important in theloosely coupled vision of e-commerce. We show that it can be viewed as a new importantreasoning technique in Description Logics: it can be viewed as a new instance of the problemof rewriting concepts using terminologies which generalizes problems such as rewriting queriesusing views or minimal rewriting using terminologies. We call this new instance the best cov-ering using terminology: given a queryQ and a setST of e-services, the problem consistsin finding a subset ofST called a "cover" ofQ that contains as much as possible of commoninformation withQ and as less as possible of extra information with respect toQ. We formallystudy this problem for languages with structural subsumption. We show that it isNP-hard andpropose an algorithm derived from hypergraphs theory.

RÉSUMÉ.Nous étudions le problème de la découverte dynamique de e-services, important dansl’approche faiblement couplée du e-commerce. Nous montrons qu’il peut être vu comme un rai-sonnement nouveau dans le domaine des logiques de description : il s’agit alors d’une nouvelleinstance du problème de la réécriture de concepts en utilisant une terminologie, d’autres ins-tances étant la réécriture de requêtes en utilisant des vues ou encore la recherche de réécrituresminimales en utilisant une terminologie. Nous appelons cette nouvelle instance la recherchedes meilleures couvertures en utilisant une terminologie : soit une requêteQ et un ensembleST de e-services, il s’agit d’identifier les sous-ensembles deST , que l’on appellera "couver-tures" deQ, qui contiennent le plus possible d’informations communes avecQ et le moins

Page 284: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

possible d’informations absentes deQ. Nous étudions le problème au niveau formel pour deslangages à subsomption structurelle. Nous montrons qu’il estNP-hard. Enfin nous proposonsun algorithme issu de la théorie des hypergraphes pour le résoudre.

KEYWORDS:E-services discovery, Description Logics, covers of concepts, rewriting, difference,hypergraphs transversals

MOTS-CLÉS :Découverte de e-services, Logiques de Description, couvertures d’un concept, ré-écriture, différence, transversaux d’un hypergraphe

Page 285: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

The recent progress and wider dissemination of electronic commerce via the WorldWide Web is revolutionizing the way companies interact with their suppliers, partnersor clients. The number and type of on-line resources and services increased consid-erably and lead to a new form of automation, namely B2B and B2C e-commerce.A recent initiative envisions a new paradigm for electronic commerce in which ap-plications are wrapped and presented as integrated electronic services (e-services)[Wei01, Vld01]. Roughly speaking, an e-service (also called Web service) can bedefined as an application made available via the Internet by a service provider, andaccessible by clients [CAS 01b, BEN 02]. Examples of e-services currently avail-able range from weather forecast, on-line travel reservation or banking services toentire business functions of an organization. The ultimate vision behind the e-serviceparadigm is to transform the Web from a collection of information into a distributeddevice of computation where programs (services) are capable of intelligent interactionby being able to discover and negotiate with each other and compose themselves intomore complex services [CAS 01a, Wei01, FEN 02, BEN 02]. Automation is a keyconcept to realize this vision. It is fundamental at each step of the service delivery tocope with the highly dynamic environment of e-services [CAS 01a, Vld01].

This paper focuses on the problem of dynamic discovery of e-services. Such aprocess involves automatic matching of service offers with service requests and con-stitutes an important aspect of e-commerce interactions. For example, it is the firststep to enable automated service composition. Our aim is to ground the dynamicdiscovery of e-services on a semantic comparison between a client query and avail-able e-services to provide thecombinationsof e-services that“best match"the clientneeds. However, to achieve such an advanced discovery process two main issues mustbe addressed [PAO 02]:

– Description of services: an automated discovery process requires rich and flex-ible machine understandabledescriptions of services that are not supported by thecurrent industry standards (e.g., UDDI1).

– An algorithm that allows to reason about the description of e-services to achievethe discovery task.

It is worth noting that the semantic web initiative2 at W3C aims at generatingtechnologies and tools that might help bridge the gap between the current standardssolutions and the requirement of and advanced e-services discovery process [PAO 02,FEN 02]. In this context, ontologies can play a crucial role to define formal seman-tics for information [FEN 01, FEN 02, HOR 02b], consequently allowing computer-interpretable specifications of services. In the line of the semantic web approach, ourwork rests on a knowledge representation approach to allow a rich description of e-services and to provide adequate reasoning mechanisms that automate the discovery

1. Universal Description, Discovery and Integration (http://www.uddi.org/).2. http://www.w3.org/2001/sw/.

Page 286: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

of e-services. We propose to use description logics (DLs) [DON 96] as a descriptionlanguage to specify the declarative part of e-services. A key aspect of description log-ics is their formal semantics and reasoning support. They have proven to provide auseful support for the definition, integration and maintenance of ontologies- a featurethat makes them suitable for the semantic Web [HEN 00, HOR 02b, HOR 02a].

Problem statement and contributions This paper concentrates on the reasoningissue to automate the discovery of e-services. In this setting, the problem of dynamicdiscovery of e-services can be stated as follows: given an ontologyT containinge-services descriptions and a client queryQ, find a combination of e-services thatcontains as much as possible ofcommoninformation withQ and as less as possible ofextra informationwith respect toQ. We call such a combination of e-services abestcoverof Q usingT .

To formally define the notion ofbest coverwe need to be able to characterize thenotion of “extra information”, i.e., the information contained in one description andnot contained in the other. For that, we use a non standard operation in descriptionlogics, thedifference or subtractionoperation. Roughly spoken, the difference of twodescriptions is defined as being a description containing all information which is a partof one argument but not a part of the other one [TEE 94].

We formally define thebest coveringproblem in a restricted framework of de-scription logics where the difference operation is always semantically unique. Weshow that, in this framework, the problem of computing thebest coversof a queryQ using an ontologyT can be seen as a new instance of the problem of rewritingconcepts using a terminology [BEE 97, BAA 00]. A study of complexity showed thatthis problem is NP-Hard. Then, we make use of hypergraph theory to propose an al-gorithm that allows to compute thebest coversof a conceptQ using an ontologyT .

Context of this work The work presented in this paper has been developed andexperienced in the context of the MKBEEM3 project which aims at providing elec-tronic marketplaces with intelligent, knowledge-based multilingual services. In thisproject, e-services are used to describe the offers delivered by the MKBEEM plat-form independently from specific providers. The reasoning mechanism described inthis paper is used to allow clients to dynamically discover the available e-services thatbest meet their needs, to examine their properties and capabilities, possibly to pro-vide missing information and to determine how to access them. First experimentalresults show that the modularity of the proposed architecture together with the associ-ated reasoning mechanism allow to make the whole system provider-independent andmore capable to face the great instability and the little lifetime of e-commerce offersand e-services.

3. MKBEEM stands for Multilingual Knowledge Based European Electronic Marketplace (IST-1999-10589, 1st Feb. 2000 - 1st Aug. 2002).

Page 287: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Organization of the paper The rest of this paper is organized as follows. Sec-tion 2 presents our motivation in using reasoning mechanisms for dynamic discoveryof e-services. Section 3 introduces the description logic material and the reasoningmechanisms that are used in our framework. In Section 4, we provide a formal frame-work for the best covering problem and a way to solve it with the help of hypergraphstheory. In Section 5, we describe an algorithm we have implemented to compute thebest covers of a conceptQ using an ontologyT . Section 6 presents a brief descriptionof the MKBEEM project. Section 7 reviews related work and presents future researchdirections.

2. Motivation

The main advantage of the proposed approach is to ground the dynamic discov-ery of e-services on a semantic comparison between a client query and available e-services. More precisely, we propose an algorithm that enables to achieve this seman-tic comparison by giving a way to extract from the e-services definitions the part thatis semantically common with the query and the part that is semantically different fromthe query. Knowing the former and the latter allows to select relevant e-services andthen to choose the best ones: this is the dynamic discovery. But knowing the latterallows also to initiate the dialogue between the e-commerce platform and the user inorder to make him clarify his query.

The following example illustrates the practical interest of the reasoning mechanismdescribed in this paper for our application.

Example 1

Let us consider an ontology4 that contains the following e-services:- ToTravel allowing to consult a list of trips given adeparture place, an arrivalplace, anarrival date and anarrival time,- FromTravel allowing to consult a list of trips given adeparture place, anarrivalplace, adeparture date and adeparture time,- Hotel allowing to consult a list of hotels given adestination place, thecheck-indate, thecheck-out date, thenumber of adults and thenumber of children.

Now, assume we have the following query "I want to go from Paris to Madrid on Fri-day 21st of June, look for an accommodation there for one week (from 21st of Juneto 28th of June) and rent a car". Formally, the e-servicesToTravel, FromTraveland Hotel as well as the queryQ can be expressed as concept descriptions in agiven description logic. Our goal is to rewriteQ into the closest descriptionE ex-pressed as a conjunction of e-services. Considering our ontology of e-services, thepossibly interesting combinations of e-services are:E1 = Hotel, ToTravel andE2 = Hotel, FromTravel. The two types of extra information brought by each

4. This ontology describes some e-services extracted from the French railways company(SNCF) web site (http://www.sncf.com).

Page 288: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Solution Rest Missing informationE1 car rental, departure date arrival date, arrival time, number of adults, number of

childrenE2 car rental departure time, number of adults, number of children

Table 1. Example of extra information.

combination of e-services are given in Table 1. For each combination, these two kindsof “extra information” are:

– the information which is contained in the queryQ and not contained in its rewrit-ing (cf. Table 1, columnRest), and

– the information contained in the rewriting and not contained in the queryQ (cf.Table 1, columnMissing information).

Continuing with the example, the best combinations are discovered by searching theones that bring the least possible of extra information with respect to the query. It isclear that to better meet the user needs, it is more interesting to try to minimize, in first,the first kind of extra information (i.e., the columnRest). Here, the extra informationof ToTravel is “bigger” than the extra information ofFromTravel. So, the bestcombinations for the query is Hotel, FromTravel. Once the best combinationshave been dynamically discovered, a dialogue phase can be initiated with the user toask him to provide the missing information.

3. Description Logics

The technical background of our proposal is constituted by Description Logics(DLs) enriched with a difference operator. We refer to [DON 96] for an introduc-tion to DLs and to [TEE 94] for an extension of DLs with a difference operation. Inthis section, we introduce Description Logics, the difference operation (as defined byTeege in [TEE 94]) and the notion of size of a description.

3.1. Description Logics: main notions

DLs are a family of logics that were developed for modeling complex hierarchicalstructures and to provide a specialized reasoning engine to do inferences on thesestructures. The main reasoning mechanisms (like subsumption or satisfiability) areeffectively decidable for some description logics ([DON 96]). Recently, DLs havebeen proved well-suited for the semantic Web. Some ontology languages such as OIL[FEN 01, HOR 02a] or DAML [HEN 00, HOR 02a] that were proposed to extendRDFS5 are in fact syntactical variants of a very expressive DL.

5. Resource Description Framework Schema (http://www.w3.org/RDF/).

Page 289: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A DL allows to represent domain of interest in terms ofconcepts(unary predicates)that characterize subsets of the objects (individuals) in the domain, androles (binarypredicates) over such domain. Concepts are denoted by expressions formed by meansof special constructors. Examples of constructors considered in this work are:

– the symbol> is a concept description which denotes the top concept while thesymbol⊥ stands for the inconsistent (bottom) concept,

– concept conjunction (u), e.g., the concept descriptionparent u male denotesthe class of fathers (i.e., male parents),

– the universal role quantification (∀R.C), e.g., the description∀child.male de-notes the set of individuals whose children are all male,

– the number restriction constructors(≥ n R) and(≤ n R), e.g., the description(≥ 1 child) denotes the class of parents (i.e., individuals having at least one children),while the description(≤ 1 Leader) denotes the class of individuals that cannot havemore than one leader.

The various description logics differ from one to another based on the set of con-structors they allow. Table 2 below shows the constructors of two DLs:FL0 andALN . A concept obtained using the constructors of a description logicL is called anL-concept. The semantics of a concept description is defined in terms of an interpreta-

Constructor name Syntax Semantics FL0 ALNconcept name P PI ⊆ ∆I X Xtop > ∆I X Xbottom ⊥ ∅ Xconjunction C uD CI ∩DI X Xprimitive negation ¬P ∆I \ PI Xuniversal quantifi-cation

∀R.C x ∈ ∆I |∀y : (x, y) ∈ RI → y ∈ CI X X

at least numberrestriction

(≥ nR), n ∈ N x ∈ ∆I |#y|(x, y) ∈ RI ≥ n X

at most numberrestriction

(≤ nR), n ∈ N x ∈ ∆I |#y|(x, y) ∈ RI ≤ n X

Table 2. Syntax and semantics of some concept-forming constructors.

tion I = (∆I , ·I), which consists of a nonempty set∆I , the domain of the interpreta-tion, and an interpretation function·I , which associates to each concept nameP ∈ Ca subsetP I of ∆I and to each role nameR ∈ R a binary relationRI ⊆ ∆I ×∆I .Additionally, the extension of.I to arbitrary concept descriptions is defined induc-tively as shown in the third column of Table 2. Based on this semantics, subsumption,equivalence and the notion of least common subsumer6 are defined as follows. LetC1, . . . , Cn andD be concept descriptions:• C is subsumed byD (notedC v D) iff CI ⊆ DI for all interpretationI.• C is equivalent toD (notedC ≡ D) iff CI = DI for all interpretationI.

6. Informally, a least common subsumer of a set of concepts corresponds to the most specificdescription which subsumes all the given concepts [BAA 99].

Page 290: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

• D is a least common subsumer ofC1, . . . , Cn (notedD = lcs(C1, . . . , Cn)) iff:(1) Ci v D for all 1 ≤ i ≤ n, and(2) D is the least concept description with thisproperty, i.e., ifD′ is a concept description satisfyingCi v D′ for all 1 ≤ i ≤ n, thenD v D′ [BAA 99].

The intensionalcomponent of a knowledge base built using a description logic iscalledterminology. The kind of terminologies we consider in this paper are definedbelow.

Definition 1 (terminology) Let A be a concept name andC be a concept description.Then A

.= C is a concept definition. A terminologyT is a finite set of conceptdefinitions such that each concept name occurs at most once in the left-hand side of adefinition. The concept nameA is a defined concept in the terminologyT iff it occursin the left-hand side of a concept definition inT .

An interpretationI satisfies the statementA.= C iff AI = CI . An interpretation

I is amodelfor a terminologyT if I satisfies all the statements inT .

A terminology built using the constructors of a languageL is called anL-terminology.In the sequel, we assume that a terminologyT is acyclic, i.e., there do not exist cyclicdependencies between concept definitions. Acyclic terminologies can be unfolded byreplacing defined names by their definitions until no more defined names occur on theright-hand sides. Therefore, the notion oflcs of a set of descriptions can be obviouslyextended to concepts containing defined names. In this case we writelcsT (C,D) todenote the least common subsumer of the conceptsC andD w.r.t. a terminologyT(i.e., thelcs is applied to the unfolded descriptions ofC andD). In our application,an ontology of e-services will be described as a terminology (i.e., concept definitionsare used to specify e-services). So, in the following, when appropriate, we use theterm e-services (or simply services) to understand defined concepts in our application.Also, we use the terms terminology and ontology interchangeably.

Example 2

The e-services introduced informally in example 1 can be described using the descrip-tion logicFL0∪≥ n R7 as given in Table 3.In the same way, the query given in example 1 could be abstracted by the following

description:Q

.= (≥ 1 departurePlace)u (∀ departurePlace.Location)u (≥ 1 arrivalPlace)u (∀ arrivalPlace.Location)u (≥ 1 departureDate)u (∀ departure-Date.Date)u Accommodationu (≥ 1 destinationPlace)u (∀ destination-Place.Location)u (≥ 1 checkIn)u (∀ checkIn.Date)u (≥ 1 checkOut)u(∀ checkOut.Date)u carRental

7. We noteFL0∪(≥ nR) the description logicFL0 augmented with the constructor(≥ nR).

Page 291: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

ToTravel.= (≥ 1 departurePlace)u ( ∀ departurePlace.Location)u (≥ 1 arrivalPlace)u (∀ arrivalPlace.Location)u (≥ 1 arrivalDate)u (∀ arrivalDate.Date)u(≥ 1 arrivalTime)u (∀ arrivalTime.Time)

FromTravel.= (≥ 1 departurePlace)u (∀ departurePlace.Location)u (≥ 1 arrivalPlace)u (∀ arrivalPlace.Location)u (≥ 1 departureDate)u (∀ departure-Date.Date)u (≥ 1 departureTime)u (∀ departureTime.Time)

Hotel.= Accommodation u (≥ 1 destinationPlace)u (∀ destination-

Place.Location)u (≥ 1 checkIn)u (∀ checkIn.Date)u (≥ 1 checkOut)u (∀ checkOut.Date)u (≥ 1 nbAdults)u (∀ nbAdults.Integer)u (≥ 1nbChildren)u (∀ nbChildren.Integer)

Table 3. Example of an ontology of e-services.

3.2. The difference operation

In this section, we recall the main results obtained by Teege in [TEE 94] about thedifference operation between two concept descriptions.

Definition 2 (difference operation)Let C,D be two concept descriptions withC vD. The differenceC −D of C andD is defined byC −D := max

wB|B uD ≡ C

This definition of difference requires that the second argument subsumes the firstone. However, the differenceC − D between two incomparable descriptionsC andD can be given by constructing the least common subsumer ofC and D, that is,C −D := C − lcs(C,D).

It is worth noting that, in some description logics, the setC − D may containdescriptions which are not semantically equivalent as illustrated by the example below.

Example 3

Let us consider the following descriptionsC.= (∀R.P )u(∀R.¬P ) andD

.= (∀R.P ′)u(∀R(≤ 4S)). The following two non-equivalent descriptions(∀R.¬P ′) and(∀R(≥5S)) are both members of the setC −D.

Teege [TEE 94] provides sufficient conditions to characterize the logics where thedifference operation is always semantically unique and can be implemented in a sim-ple syntactical way by constructing the set difference of subterms in a conjunction.Some basic notions and useful results of this work are introduced below.

Definition 3 (reduced clause form and structure equivalence) Let L be a descrip-tion logic.

– A clausein L is a descriptionA with the following property:(A ≡ B u A′) ⇒(B ≡ >)∨ (B ≡ A). Every conjunctionA1 u . . .uAn of clauses can be representedby the clause setA1, . . . , An.

Page 292: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– A clause setA = A1, . . . , An is calledreduced if either n = 1, or no clausesubsumes the conjunction of the other clauses:∀1 ≤ i ≤ n : Ai 6w A \Ai. The setAis then called areduced clause form (RCF)of every descriptionB ≡ A1 u . . .uAn.

– Let A = A1, . . . , An andB = B1, . . . , Bm be reduced clause sets in adescription logicL. A andB are structure equivalent (denoted byA ∼= B) iff:n = m ∧ ∀1 ≤ i ≤ n ∃1 ≤ j, k ≤ n : Ai ≡ Bj ∧Bi ≡ Ak

– If in a description logic for every description all its RCFs are structure equivalent,we say that RCFs arestructurally unique in that logic.

Thestructural difference operation, denoted by\≡, is defined as being the set differ-ence of clause sets where clauses are compared on the basis of the equivalence relation.[TEE 94] provides two interesting results:1) in description logics with structurallyunique RCFs, the difference operation can be straightforwardly calculated using thestructural difference operation, and2) structural subsumptionis a sufficient conditionfor a description logic to have structurally unique RCFs.

Consequently, structural subsumption is a sufficient condition that allows to iden-tify logics where the difference operation is semantically unique and can be imple-mented using thestructural difference operation. However, it is worth noting that thedefinition of structural subsumption given in [TEE 94] is different from the one usu-ally used in the literature. Unfortunately, a consequence of this remark is that manydescription logics for which a structural subsumption algorithm exists (e.g.,ALN[MOL 98]) do not have structurally unique RCFs. Nevertheless, the result given in[TEE 94] is still interesting in practice since there exists many description logics withthis property. Examples of such logics include the languageFL0∪(≥ n R), that wehave used in the context of the MKBEEM project, or the more powerful descriptionlogicL1 [TEE 94] , which contains the following constructors:

– u,t,>,⊥, (≥ n R), existential role quantification(∃R.C) and existential fea-ture quantification(∃f.C) for concepts, whereC denotes a concept,R a role andf afeature (i.e., a functional role),

– bottom (⊥), composition (), differentiation(|) for roles,

– bottom (⊥) and composition () for features.

In the rest of this paper we use the termstructural subsumptionin the sense of[TEE 94].

3.3. Size of a description

LetL be a description logic with structural subsumption. We define the size|C| ofanL-concept descriptionC as being the number of clauses in its RCFs8. If necessary,

8. We recall that, sinceL have structurally unique RCFs, all the RCFs of anL-description areequivalent and thus have the same number of clauses.

Page 293: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

a more precise measure of a size of a description can be defined by also taking intoaccount the size of each clause (e.g., by counting the number of occurrences of conceptand role names in each clause). However, in this case one must use some kind ofcanonical form to abstain from different descriptions of equivalent clauses. Pleasenote that, in a description logic with structurally unique RCFs it is often possible todefine a canonical form which is itself an RCF [TEE 94].

4. The best covering problem

In this section, we first investigate thebest coveringproblem in the frameworkof description logics with structural subsumption. Then we see how to compute bestcovers using hypergraphs theory. Finally, we consider the running example to seewhat are the best covers of the queryQ.

4.1. Problem statement

Let us first introduce some basic definitions that are required to formally define thebest coveringproblem. LetL be a description logic with structural subsumption,Tbe anL-terminology, andQ 6≡ ⊥ be a coherentL-concept description. The set of e-services occurring inT is denoted byST = Si, i ∈ [1, n] with Si 6≡ ⊥,∀i ∈ [1, n].In the sequel, we assume that the queryQ and the e-servicesSi, i ∈ [1, n] are givenby their RCFs.

Definition 4 (cover)A cover ofQ usingT is a conjunctionE of some namesSi fromT such that:Q− lcsT (Q,E) 6≡ Q.

Hence, a cover of a conceptQ usingT is defined as being any conjunction ofe-services occurring inT which shares some common information withQ. Pleasenote that a coverE of Q is always consistent withQ (i.e., Q u E 6≡⊥) sinceL isa description logic with structurally unique RCFs9 and we haveQ 6≡ ⊥ andSi 6≡⊥,∀i ∈ [1, n].

To define the notion ofbest cover, we first need to characterize more precisely theremaining descriptions both in the input concept descriptionQ (hereafter called therest) and in its coverE (hereafter called themiss).

Definition 5 (rest and miss)Let Q be anL-concept description andE a cover ofQusingT . The rest ofQ with respect toE, written RestE(Q), is defined as follows:RestE(Q) .= Q− lcsT (Q,E).

9. If the languageL contains the incoherent concept⊥, then⊥ must be a clause, i.e., nontrivial decompositions of⊥ is not possible (that means we cannot have incoherent conjunctionof coherent clauses), otherwise it is easy to show thatL does not have structurally unique RCFs.

Page 294: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The missing information ofQ with respect toE, written MissE(Q), is defined asfollows: MissE(Q) .= E − lcsT (Q, E).

Now we can define the notion ofbest cover.

Definition 6 (best cover)A concept descriptionE is called abest coverof Q using aterminologyT iff:

– E is a cover ofQ usingT , and

– there doesn’t exist a coverE′ of Q usingT such that(|RestE′(Q)|, |MissE′(Q)|) < (|RestE(Q)|, |MissE(Q)|), where< stands for thelexicographic order.

The best covering problem, noted BCOV(T , Q), is then the problem of computingall the best covers ofQ usingT .

Theorem 1 (Complexity of BCOV(T , Q)) The best covering problem is NP-hard.

The proof of this theorem follows from the results regarding the minimal rewritingproblem [BAA 00] (see [HAC 02] for a detailed proof).

4.2. Computing best covers using hypergraphs

Let us first recall some useful definitions regarding hypergraphs.

Definition 7 (hypergraph and transversals)[EIT 95]A hypergraphH is a pair(Σ,Γ) of a finite setΣ = V1, . . . , Vn and a setΓ of subsetsof Σ. The elements ofΣ are called vertices, and the elements ofΓ are called edges.A set T ⊆ Σ is a transversal ofH if for eachε ∈ Γ, T ∩ ε 6= ∅. A transversalT is minimal if no proper subsetT ′ of T is a transversal. The set of the minimaltransversals of an hypergraphH is notedTr(H).

Now we can show that the best covering problem can be interpreted in the frame-work of hypergraphs as the problem of finding the minimal transversals with a minimalcost.

Definition 8 (hypergraph HT Q generated fromT and Q) Let L be a descriptionlogic with structural subsumption,T be anL-terminology, andQ be anL-conceptdescription. Given an instanceBCOV(T , Q) of the best covering problem, we buildan hypergraphHT Q = (Σ,Γ) as follows:

– each e-serviceSi in T becomes a vertexVSiin the hypergraphHT Q. Thus

Σ = VSi, i ∈ [1, n].

Page 295: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– each clauseAi ∈ Q, for i ∈ [1, k], becomes an edge inHT Q, notedwAi, with

wAi= VSi

|Si ∈ ST andAi ∈≡ lcsT (Q, Si) where∈≡ stands for the membershiptest modulo equivalence of clauses andlcsT (Q,Si) is given by its RCF.

For the sake of clarity we introduce the following notation.

Notation For any set of verticesX = VSi, subset ofΣ, we noteEX.=

uVSi∈XSi the concept obtained from the conjunction of the e-services corresponding

to the vertices inX. Mutually, for any conceptE.= uj∈[1,m]Sij

, we noteXE =VSij

, j ∈ [1,m] the set of vertices corresponding to the e-services inE.

With lemmas 1 and 2 given below, we show that computing a cover ofQ usingTthat minimizes therest amounts to computing a transversal ofHT Q by consideringonly the non empty edges. Proofs of these lemmas are in [HAC 02].

Lemma 1 (characterization of the minimal rest)Let L be a description logic withstructural subsumption,T be anL-terminology, andQ be anL-concept description.Let HT Q = (Σ,Γ) be the hypergraph built from the terminology of e-servicesTand the conceptQ = A1 u . . . u Ak provided by its RCF. Whatever the coverEof Q usingT we consider, the minimal rest (i.e., the rest whose size is minimal) is:Restmin ≡ Aj1 u . . . uAjl

, ∀ji ∈ [1, k] | wAji= ∅.

Lemma 2 (characterization of covers that minimize the rest)Let HTQ = (Σ,Γ′)be the hypergraph built by removing fromHT Q the empty edges. A rewritingEmin

.=Si1 u . . . u Sim

, with 1 ≤ m ≤ n andSij∈ ST for 1 ≤ j ≤ m, is a cover ofQ

usingT that minimizes the restRestEmin(Q) iff XEmin

= VSij, j ∈ [1,m] is a

transversal ofHTQ.

Having covers that minimize the rest, it remains to isolate those minimizing themiss in order to have the best covers. To express miss minimization in the hypergraphsframework, we introduce the following notion of cost.

Definition 9 (cost of a set of vertices)Let BCOV(T , Q) be an instance of the best covering problem andHTQ = (Σ,Γ′)its associated hypergraph. The cost of the set of verticesX is defined as follows:cost(X) = |MissEX

(Q)|.

Therefore, theBCOV(T , Q) problem can be reduced to the computation of thetransversals with minimal cost of the hypergraphHTQ. Clearly, it appears that wecan only care aboutminimal transversals. To sum up, theBCOV(T , Q) problem canbe reduced to the computation of the minimal transversals with minimal cost of thehypergraphHTQ. Therefore, one can reuse results known for computing minimaltransversals for solving the best covering problem.

Page 296: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Example 4

Let T andQ be, respectively, the e-services ontology and the query given in the ex-ample 2. We assume that the concept names (e.g.,Location, Date, Accommodation,. . . ), that appear in the description of the queryQ and/or in the descriptions of the e-services ofT , are all atomic concepts. Hence, the queryQ and the e-services ofT areall provided by their RCFs10. Therefore, the associated hypergraphHTQ = (Σ,Γ)will be made of the set of verticesΣ = VToTravel, VFromTravel, VHotel and the setΓ containing the following edges:

Edges created fromQ’s clauses = Set of e-services/verticesw(≥1departureP lace) = VToTravel, VFromTravelw(∀departureP lace.Location) = VToTravel, VFromTravelw(≥1arrivalP lace) = VToTravel, VFromTravelw(∀arrivalP lace.Location) = VToTravel, VFromTravelw(≥1departureDate) = VFromTravelw(∀departureDate.Date) = VFromTravelwAccommodation = VHotelw(≥1destinationPlace) = VHotelw(∀destinationPlace.Location) = VHotelw(≥1checkIn) = VHotelw(∀checkIn.Date) = VHotelw(≥1checkOut) = VHotelw(∀checkOut.Date) = VHotelwcarRental = ∅

We can see that no e-service covers the clause corresponding to the edgewcarRental

(as we havewcarRental = ∅). Since this is the only empty edge inΓ, the best coversof Q usingT will have exactly the following rest:Restmin ≡ carRental (cf lemma1). Now, considering the hypergraphHTQ, the only minimal transversal is:X =VFromTravel, VHotel. So,EX

.= Hotel u FromTravel is the best cover ofQusing the ontology of e-servicesT . Figure 1 shows the hypergraphHTQ and its onlyminimal transversal which corresponds to the only best cover ofQ.

If there were many minimal transversals, we would compute their cost, that is the sizeof the missing information of their corresponding description. For example, the size ofthe missing information ofEX is the cost of the transversalX which is given below.cost(X) = |MissEX

(Q)| = |MissFromTraveluHotel(Q)|cost(X) = |(≥ 1 departureTime)u (∀ departureTime.Time)u (≥ 1 nbAdults)u (∀nbAdults.Integer)u (≥ 1 nbChildren)u (∀ nbChildren.Integer)| = 6.And then the best covers would be the minimal transversals with the minimal cost. Inthis example, we do not care about this cost because the hypergraphHTQ has onlyone minimal transversal.

10. Otherwise, we have to recursively unfold the e-service (resp. query) description by replac-ing by its definition each concept name appearing in the e-service (resp. query) description.

Page 297: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

w(≥1departurePlace)w(∀departurePlace.Location) w(≥1arrivalPlace) w(∀arrivalPlace.Location) VToTravel VFromTravel

w(≥1departureDate)w(∀departureDate.Date)

wAccommodationw(≥1destinationPlace) w(∀destinationPlace.Location)w(≥1checkIn) w(∀checkIn.Date) w(≥1checkOut) w(∀checkOut.Date)

The only minimal transversal

wcarRental (empty edge) VHotel

Figure 1. HTQ and its only minimal transversal.

5. Algorithm

In this section we give a sketch of an algorithm, calledcomputeBCov, for comput-ing the best covers of a conceptQ using a terminologyT . In the previous section ,we have shown that this problem can be reduced to the search of the transversals withminimal cost of the hypergraphHTQ. The problem of computing minimal transver-sals of an hypergraph is central in various fields of computer science [EIT 95]. Theprecise complexity of this problem is still an open problem. In [MIC 96], it is shownthat the generation of the transversal hypergraph can be done in incremental subexpo-nential timekO(logk), wherek is the combined size of the input and the output. To ourknowledge, this is the best theoretical time bound for the problem of the generation ofthe transversal hypergraph.

In our case, since the problem is slightly different, we propose an adaptation of anexisting algorithm with a combinatorial optimization technique (branch-and-bound)to compute the transversals with a minimum cost.

A classical algorithm for computing the minimal transversals of an hypergraph ispresented in [BER 89, MAN 94, EIT 95]. The algorithm is incremental and works inn steps wheren is the number of edges of the hypergraph. Starting from an empty setof transversals, the basic idea is to explore each edge of the hypergraph, one edge ineach step, and to generate a set of candidate transversals by computing all the possibleunions between the candidates generated in the previous step and each vertex in theconsidered edge. At each step, the non-minimal candidate transversals are pruned.

So, a naive approach to compute the minimal transversals with a minimal costwould be to compute all the minimal transversals, using such an algorithm, and then tochoose those transversals which have the minimal cost. The algorithmcomputeBCovpresented here makes an improvement over the naive approach by using an additional

Page 298: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

pruning criteria for reducing the number of candidates in the intermediate steps ofsuch a classical algorithm, while ensuring that the transversals with the minimal costare still considered. The main idea behind this algorithm is to use a Branch-and-

Algorithm 1 computeBCov(sketch)

Require: An instance BCOV(T , Q) of the best covering problem.Ensure: The set of the best covers of Q using T .1: Build the associated hypergraph HT Q = (Σ, Γ′).2: Tr ← ∅ – Initialization of the minimal transversal set.3: CostEval←

∑e∈Γ′

minVSi

∈e(|MissSi(Q)|). – Initialization of CostEval

4: for all edge E ∈ Γ′ do5: Tr ← the new generated set of the candidate transversals.6: Remove from Tr the transversals which are non minimal and those whose cost

is greater than CostEval.7: Compute a more precise evaluation of CostEval.8: end for9: for all X ∈ Tr such that |MissEX (Q)| = CostEval do

10: return the concept EX .11: end for

Bound like enumeration of transversals. First, a simple heuristic is used to efficientlycompute a cost of agoodtransversal (i.e., a transversal expected to have a small cost)(line 3). This can be carried out by adding, for each edge of the hypergraph, the costof the vertex that has the minimal cost. The resulting cost is stored in the variableCostEval. As we have, for any set of verticesX = Si:

cost(X) = |MissEX(Q)| ≤

∑i |MissSi(Q)| =

∑Si∈X cost(Si)

the evaluation is an upper bound of the cost of a feasible transversal. Then as weconsider candidates in intermediate steps of the algorithm, we can eliminate fromTrany candidate transversal that has a greater cost thanCostEval, since that candidatecould not possibly lead to a transversal that is better than what we already know (line6). Then, from each candidate transversal that remains inTr, we compute a newevaluation forCostEval by considering only remaining edges (line 7).

At the end of the algorithm, each computed minimal transversalX ∈ Tr istranslated into a conceptEX which constitutes an element of the solution to theBCOV(T , Q) problem.

6. Experimentation: The MKBEEM project

The work presented in this paper has been developed and used in the context of theMKBEEM project which aims at providing electronic marketplaces with intelligent,knowledge-based multilingual services [mkb]. In this project, ontologies are used to

Page 299: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

provide a consensual representation of the electronic commerce field in two typicalDomains (Tourism and Mail order). The MKBEEM ontologies are structured in threelayers, as shown in Figure 2.

MKBEEM Global Ontology

Tourism Mail order

Domain ontology

SNCF B&B Ellos ...

Sources descriptions

E-services Ontology E-service level

Global and domain Ontologies

Source level

Figure 2. Knowledge representation in the MKBEEM system.

Theglobal ontology describes the common terms used in the whole MKBEEMplatform while eachdomain ontology contains specific concepts corresponding toone of the domains of the MKBEEM partners (e.g, tourism, mail orders, etc.). Thesources descriptions specify the providers competencies, i.e., the description of thecontents of the providers information sources. Finally, all the offers available inthe MKBEEM platform are integrated and described in thee-services ontology.Whereas in many e-commerce platforms e-services are associated to providers, wehave defined an e-service as aprovider-independent offeravailable on a given e-commerce platform. The example 1 given in section 2, is a typical mediation in-stance in the context of this project: the user poses queries in terms of the “integratedschema" (i.e., e-services and domain ontology) rather than directly querying specificprovider information sources. This enables users to focus onwhat they want, ratherthan worrying abouthowandfrom whereto obtain the answers. Then, to effectivelyhandle mediation tasks, the MKBEEM system rely on two reasoning mechanisms:

– the first allows to reformulate users queries against the domain ontology in termsof e-services. The aim here is to allow the users/applications to automatically discoverthe available e-services that best meet their needs, to examine their capabilities andpossibly to complete missing information;

– the second, calledquery plan generation, takes place after the first step andallows to reformulate a user query, expressed as a combination of e-services, in termsof providers views. The aim of this second issue is to allow the identification of theviews that are able to answer to the query (knowing that, afterwards, the query plans

Page 300: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

will be translated into databases queries via the corresponding wrappers).

While the second reasoning mechanism, known asquery rewriting using views, hasalready been addressed in the literature [BEE 97, GOA 00], the first is a new problemfor which we have proposed a solution in this paper.

The algorithmcomputeBCovpresented in section 5 has been implemented as anintegrated component in the MKBEEM prototype. This prototype is built as a set ofEnterprise Java Beans (EJB) components that interact with each other. The proto-type relies on the Picsel [GOA 00] mediator to handle thequery plan generationtask.There are also some components dedicated to the interaction with the user interfacebuilt with Java Server Pages (JSPs) or Servlets. Finally, some of these componentshave the functionality of interacting with the remote (or locally duplicated) databasesin the provider information systems.

The MKBEEM prototype has been validated on a pan-European scale (France andFinland), with three basic languages (Finnish, English and French) and two optionallanguages (Spanish and Swedish), in two distinct end-user fields: 1) Business to con-sumer on-line sales, and 2) Web based travel/tourism services. In our first experimentswe used small ontologies (' 500 concepts and50 e-services) to validate the accuracyof the suggested approach. On-going work is devoted to the assessment of the perfor-mance and the scalability of the MKBEEM prototype.

7. Discussion

Existing solutions that achieve dynamic discovery of e-services rely on simplequery mechanisms to provideindividualservices thatexactlymatch the query. Clearly,a semantic match like the one proposed in this paper is beyong the representation ca-pabilities of the emerging XML based standards and current e-service platforms. Forexample, the information provided in a UDDI business registration consists of threecomponents: "white pages" (e.g., business name, contact information, ...); "yellowpages" including industrial categorizations based on standard taxonomies; and "greenpages", the technical information about services that are exposed by the business.Based on these descriptions, UDDI provides poor search facilities allowing only akeyword based search of businesses, services and the so-called TModels on the basesof their names.

To cope with these limitations, there are some proposals of matching algorithmsthat employ semantic web technology for service description [GON 01, PAO 02].[GON 01] reports on an experience in building matchmaking prototype based on de-scription logic reasoner and operating on service descriptions in DAML+OIL [HOR 02b].The proposed matching algorithm is based on simple subsumption and consistencytests. [PAO 02] proposes a more elaborated matching algorithm between servicesand requests described in DAML-S11. The algorithm recognizes various degrees of

11. http://www.daml.org/services/

Page 301: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

matching that are determined by the minimal distance between concepts in the con-cept taxonomy. The problem of capabilities based matching has also been addressedby the multi-agent community. A matching algorithm is proposed in [SYC 02] for thelanguage LARKS. This algorithm is similar to the one proposed in [PAO 02] sinceLARKS identifies a set of filters that progressively restrict the number of services thatare candidates for a match. Our work falls in this research stream of approaches thatsupport the location of e-services based on a semantic match between declarative de-scriptions of services and requests. However, since we view the e-service discoveryas a rewriting process, our algorithm is able to discovercombinationsof services thatmatch (cover) a given query. Furthermore, the difference between the query and itsrewriting (i.e., rest and miss) is effectively computed and can be used to improve thee-service interoperability. So our dynamical discovery of e-services appears as thefirst step towards a dynamic composition of e-services.

From the theoretical point of view, the best covering problem belongs to the gen-eral framework forrewriting using terminologiesprovided in [BAA 00]. This frame-work is defined as follows: given a terminologyT (i.e., a set of concept descriptions),a concept descriptionQ that does not contain concept names defined inT and a binaryrelationρ between concept descriptions, canQ be rewritten into a descriptionE, builtusing (some) of the names defined inT , such thatQρE ? Additionally, some optimal-ity criterion is defined in order to select the relevant rewritings. Already investigatedinstances of this problem are the minimal rewriting problem [BAA 00] and rewritingqueries using views [BEE 97, GOA 00]. In the former,ρ is instantiated by equivalencemoduloT and the size of the rewriting is used as the optimality criterion. In the latter,which is the problem underlying the query plan generation in MKBEEM, the relationρ is instanciated by subsumption and the optimality criterion is the inverse subsump-tion [BAA 00]. In this context ,thebest covering problemis the new instance of theproblem of rewriting concepts using terminologies where the goal is to rewrite a de-scription Q into the closest description expressed as a conjunction of (some) conceptnames inT (hence,ρ is neither equivalence nor subsumption).

We have investigated this problem in a restricted framework of description log-ics with structural subsumption. These logics ensure that the difference operation isalways semantically unique and can be computed using a structural difference opera-tion. This framework appears to be sufficient in the context of the MKBEEM project.But the languages that are recommended to realize the semantic web vision tend to bemore expressive. That’s why our future work will be devoted to the extension of theproposed framework to hold the definition of the best covering problem for descrip-tion logics (for exampleALN ) where the difference operation is not semanticallyunique. In this case, the difference operation does not yield a unique result and thusthe proposed definition of a best cover is no longer valid. However, after the very firstresults we got concerningALN , we argue that a restricted difference operator can bedefined, and then the framework can be extended, so that many practical applicationsof the dynamic discovery of e-services can be solved with this more expressive logic.

Page 302: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

8. References

[BAA 99] B AADER F., KÜSTERSR., MOLITOR R., “Computing Least Common Subsumerin Description Logics with Existential Restrictions”, DEAN T., Ed.,Proc. of the16th Int.Joint Conf. on AI, M.K, 1999, p. 96-101.

[BAA 00] B AADER F., KÜSTERSR., MOLITOR R., “Rewriting Concepts Using Terminolo-gies”, Proc. of the Int. Conf. KR Colorado, USA, Apr. 2000, p. 297-308.

[BEE 97] BEERI C., LEVY A., ROUSSET M.-C., “Rewriting Queries Using Views in De-scription Logics”, YUAN L., Ed., Proc. of theACM PODS, New York, USA, Apr. 1997,p. 99-108.

[BEN 02] BENATALLAH B., DUMAS M., SHENG Q., NGU A., “Declarative Composition andPeer-to-Peer Provisioning of Dynamic Web Services”,Proc. of the IEEE Int. Conf. on DataEngineering , San Jose, USA, Jun. 2002, p. 297–308.

[BER 89] BERGEC.,Hypergraphs, vol. 45 ofNorth Holland Mathematical Library, ElsevierScience Publishers B.V. (North-Holland), 1989.

[CAS 01a] CASATI F., SHAN M.-C., “Dynamic and adaptive composition of e-services”,In-formation Systems, vol. 26, num. 3, 2001, p. 143-163.

[CAS 01b] CASATI F., SHAN M.-C., “Models and Languages for Describing and DiscoveringE-Services”, Proceedings of SIGMOD 2001, Santa Barbara, USA, May 2001.

[DON 96] DONINI F., M. LENZERINI D. NARDI A. S., “Reasoning in description logics”,In Gerhard Brewka, editor, Foundation of Knowledge Representation, CSLI-Publications,1996, p. 191-236.

[EIT 95] EITER T., GOTTLOB G., “Identifying the Minimal Transversals of a hypergraph andRelated Problems”,SIAM Journal on Computing, vol. 24, num. 6, 1995, p. 1278–1304.

[FEN 01] FENSEL D., VAN HARMELEN F., HORROCKS I., MCGUINNESS D., PATEL-SCHNEIDER P. F., “OIL: An ontology infrastructure for the semantic web.”,IEEE In-telligent Systems, vol. 16, num. 2, 2001, p. 38–45.

[FEN 02] FENSEL D., BUSSLERC., MAEDCHE A., “Semantic Web Enabled Web Services”,International Semantic Web Conference, Sardinia, Italy, Jun. 2002, p. 1–2.

[GOA 00] GOASDOUÉ F., V. LATTÈS M.-C. R., “The Use of CARIN Language and Algo-rithms for Information Integration: The PICSEL System”,IJICIS, vol. 9, num. 4, 2000,p. 383-401.

[GON 01] GONZÁLEZ-CASTILLO J., TRASTOUR D., BARTOLINI C., “Description Logicsfor Matchmaking of Services”, Proc. of the KI-2001 Workshop on Applications of De-scription Logics Vienna, Austria, vol. 44, September 2001.

[HAC 02] HACID M.-S., LÉGER A., REY C., TOUMANI F., “Dynamic discovery of e-services: a description logics based approach.”, Report, 2002, LIMOS, Clemont-Ferrand,France, see http://lisi.insa-lyon.fr/∼mshacid/publications.html.

[HEN 00] HENDLER J., MCGUINNESSD. L., “The DARPA Agent Markup Language”,IEEEIntelligent Systems, vol. 15, num. 6, 2000, p. 67–73.

[HOR 02a] HORROCKS I., P.F.PATEL-SCHNEIDER, VAN HARMELEN F. ., “Reviewing theDesign of DAML+OIL: An Ontology Language for the Semantic Web”,Proc. of the 18thNat. Conf. on Artificial Intelligence (AAAI 2002), 2002, To appear.

Page 303: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[HOR 02b] HORROCKS I., “DAML+OIL: A Reason-able Web Ontology Language”,Proc.of the Int. Conf. on Extending Database Technology Prague, Czech Republic, Mar. 2002,p. 2–13.

[MAN 94] M ANNILA H., RaIHa K.-J., The Design of Relational Databases, Addison-Wesley, Wokingham, England, 1994.

[MIC 96] M ICHAEL L. FREDMAN L. K., “On the Complexity of Dualization of MonotoneDisjunctive Normal Forms.”,Journal of Algorithms, vol. 21, num. 3, 1996, p. 618-628.

[mkb] “MKBEEM web site : http://www.mkbeem.com”.

[MOL 98] M OLITOR R., “Structural subsumption forALN ”, report num. LTCS-98-03,March 1998, Aachen University of Technology, Research Group for Theoretical ComputerScience.

[PAO 02] PAOLUCCI M., KAWAMURA T., PAYNE T., SYCARA K., “Semantic Matching ofWeb Services Capabilities.”,Proc. of the Int. Semantic Web Conference, Sardinia, Italy,June 2002, p. 333–347.

[SYC 02] SYCARA K., WIDOFF S., KLUSCH M., , LU J., “LARKS: Dynamic MatchmakingAmong Heterogeneous Software Agents in Cyberspace”,Autonomous Agents and Multi-Agent Systems, vol. 5, 2002, p. 173–203.

[TEE 94] TEEGE G., “Making the Difference: A Subtraction Operation for Description Log-ics”, DOYLE J., SANDEWALL E., TORASSO P., Eds.,Proc. of the Int. Conf. KR, SanFrancisco, CA, 1994, Morgan Kaufmann.

[Vld01] “The VLDB Journal: Special Issue on E-Services”, 10(1), Springer-Verlag BerlinHeidelberg, 2001.

[Wei01] “Data Engineering Bulletin: Special Issue on Infrastructure for Advanced E-Services”, 24(1), IEEE Computer Society, 2001.

Page 304: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 305: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 7Miscellaneous

Page 306: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 307: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Distancesde similarité d’images baséessurlesarbr esquaternaires1

Marta Rukoz* — Maude Manouvrier ** — GenevièveJomier**

* UniversidadCentral deVenezuela,CCPD,EscueladeComputación,Av. LosIlustres,Apt.47002,LosChaguaramos,1041Caracas,Venezuela

[email protected]

** UniversitéParis Dauphine, LAMSADEPlacedu Mal DeLattre deTassigny, 75775Paris Cedex 16,France

manouvrier,[email protected]

RÉSUMÉ.Cetarticle présenteunedéfinitiongénéralededistancedesimilarité,appelée

, entreimagesreprésentéespar desarbresquaternaires.L’arbre quaternaire estunestructure hiérar-chiquequi, lorsqu’elleestutiliséepour la recherche d’imagespar le contenu,permetdetenircomptedela localisationspatialedescaractéristiquesd’image(couleur, texture, etc.).Enfonc-tion dela pondération desnœudsd’arbre quaternaire etdela distancechoisieentrecesnœuds,lors du calcul de

, plusieurs distancesentre arbres quaternaires ou plusieurs distancesde

similarité visuelleentre imagespeuventêtre définies.Certainesmesuresdesimilarité définiesdansdesarticles scientifiquesapparaissentcommedescasparticuliers de la distance

. De

nouvellesdistancesentre imagespeuventégalementêtre définiesà partir dela distance

.

ABSTRACT. This article presents

-distance, a generalization of image similarity distancesde-finedon imagesrepresentedwith a quadtree. A quadtreeis a recursivepartitioning structurewhich canbeusedto handlepositionalinformationof image featuresin content-basedimageretrieval. Using different weightson quadtree nodesand different distancesbetweennodes,distancesbetweenquadtreesor surfacesimilarity betweenimagescan becomputedbasedonthegeneral definitionof

. We showhowexistingquadtree-basedimage distancesappearto

beparticular casesof the

-distance. Moreover, new distancesbetweenimagescanbedefinedbasedon the

-distancedefinition.

MOTS-CLÉS: Basede donnéesd’images,distanceentre arbresquaternaires,similarité visuelled’images,similaritéderégionsd’images,recherched’imagespar le contenu.

KEYWORDS:Imagedatabase, distancebetweenquadtrees,similarity of imagesurfaces,similarityof image regions,image retrieval

. Ce travail a été réalisédansle cadred’une coopérationinternationaleCNRS - CONICIT

(accords8680et10058).

Page 308: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Intr oduction

Dans les systèmesde recherched’imagespar le contenu(en anglaisContent-BasedImage Retrieval ou CBIR) [ASL 99, BAC 96, CHA 00, FLI 95, ORI 01], lerésultatd’une rechercheest un ensembled’imagessimilaires à une imagerequêteplutôt qu’un ensembled’imagesrépondantexactementaux critèresde la recherche[TAN 01]. La recherched’image par le contenuest baséesur la similarité desca-ractéristiquesvisuellesdesimagestelles que la couleur[LU 94, SMI 02, STE02],la texture [MAN 02a, PUZ 99, SMI 94] ou la forme [CHA 00, KIM 02]. La fonc-tion de distanceutilisée pour évaluer la similarité entreimagesdépenddescritèresde la recherchemaiségalementde la représentationdescaractéristiquesde l’image[LI 02, LU 99]. L’idée principaleestgénéralementd’associerà chaqueimageunvec-teurmulti-dimensionnelreprésentantlescaractéristiquesdel’image,et demesurerlasimilaritédesimagesenutilisantunefonctiondedistanceentrelesvecteurs[CAS 02,KAK 00]. L’image requêteest égalementreprésentéepar un vecteurde caractéris-tiqueset le résultatde la requêteretournetoutesles imagesdont le vecteura étémis en correspondanceavec celui de l’image requête[ALB 00]. La requêtedevientpar conséquentune requêted’intervalle (en anglaisrange query) de la forme "Re-trouvezles imagespour lesquellesla distanceentre leur vecteurde caractéristiqueset le vecteurde caractéristiquesde l’image requêteappartientà un certain inter-valle." ou unerequêtedevoisinage(enanglaisnearestneighbourquery) dela forme"Trouver les plus prochesvecteurs du vecteurde caractéristiquesde l’image re-quêteavecunseuildetolérancedonné.", dansunespacevectorielmulti-dimensionnel[CHE 98b, KAT 97] ou dansun espacemétrique[CIA 97, TRA 02].

A chaqueimagecorrespondunensembled’attributsappeléindex ousignatured’image[NAS 97]. Plusieurssystèmesderecherched’images,tel queQBIC [FLI 95], caracté-risentlesimagesparunesignatureglobale(ex. l’histogrammedescouleursdel’imageentière)[RUI 99]. Cependant,pourbeaucoupd’applications,la caractérisationglobaledesimagesn’offre pastoujoursde résultatssatisfaisants.Dansle domainemédical,par exemple,descaractérisationslocales(de régionsd’images)sontnécessairescarle nombrepixels représentantunepathologieest faible par rapportau nombretotaldepixelsdel’image et descaractéristiquesglobalesnepermettentpasdedifférenciersuffisammentles imagesde la base[SHY 98]. De plus, les caractéristiquesglobalesne tiennentpascomptede la localisationdespixelset desrégionsd’intérêt.Pourre-médierà cettelimite et tenir comptede la localisationdescaractéristiquesdanslecalcul de la similaritédesimages,plusieursapproches[ALB 00, AHM 97, GUP97,LU 94, LIN 01, MAL 99] utilisent unestructurespatiale,l’arbre quaternaire(en an-glais quadtree) [SAM 84]. Une telle structurepermetde stocker les caractéristiquesdesdifférentesrégionsd’image et de filtrer les imagesen augmentantau fur et àmesurele niveaude détails: chaqueimageest tout d’abordcomparéeglobalementavec l’image requête,puis si leur similarité globaleest inférieureà un certainseuil,les sous-régionsd’imageshomologuessontcomparées,ainsi de suite jusqu’auxré-gionsde taille minimale.Cettetechniqued’indexation peutêtreutilisée pour com-parerles imagesglobalement[LU 94], mais peut égalementpermettrede faire des

Page 309: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

requêtessur les régionsd’imagestellesque"Retrouveztoutesles imagesde la baseayantla régionnord-ouestet la régionsud-estsimilairesà cellesdel’imagerequête."[LIN 01, MAL 99].

Danscetarticle,nousproposonsunedéfinitiongénéralededistanceentrearbresqua-ternaires,appeléedistance . Cette distanceest une combinaisonlinéaire de dis-tancesentrenœudsd’arbresquaternaires.En fonction desdifférentspoidsassociésaux nœudset de la distancechoisieentreles nœuds,plusieursfamillesde distancespeuvent être définiesà partir de la distance . Plus précisément,nousdéfinissonsdeuxfamillesdedistancesbaséessurla structuredesarbresquaternaires,appelées -distance( pour ) et -distance( pour quadrant),et unefamille de distancesdesimilaritévisuelleentreimages,appelée -distance( pourvisuel).Lesdeuxpre-mièresfamillesdedistancescomparentlesarbresquaternairesreprésentantlesimages.La dernièrefamille comparevisuellementles imagesen utilisant leur représentationen arbrequaternaire.L’intérêt de la distance estquesadéfinition estgénérale,cequi signifie que: (1) certainesdistancesproposéesdansle domainede la recherched’images[AHM 97, GUP97, LU 94, LIN 01, MAL 99] apparaissentcommedescasparticuliersde la distance et (2) il estpossiblede définir denouvellesdistancesàpartir dela distance .

L’organisationde cet article est la suivante.La section2 rappelleles grandsprin-cipesde l’arbre quaternaire.La section3 présentela distance . La section4 définitplusieursfamilles de distancesdérivéesde la distance . Finalement,la section5compareles distancesproposéesdanscet article avec les mesuresdéjàexistantesetconclut.

00 01

02030 031

032

0332 0333

0331

0330

00 01

02030

032 033

0310

0311

0313

0312

Image 1

Image 5

Image 2

Image 6

Image 3

Image 7

Image 4

Image 8

Figure1. Un exempledebasesdedonnéesd’images.

Page 310: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2. Conceptsde basede l’arbr equaternaire

0

00 01 02

03

033

031030

0330 0331 0332 0333

032

Identification des noeuds d’arbre quaternaire

!#" $ % & ')(!#" $ % & '+*!#" $ % & '-,!#" $ % & '/.

010100101001010212122121221212

313133131341441451515515156161661616

717171781818180

00 01 02

031030 032

0330 0331 0332 0333

03

033* 9 ,* 9 :

; 9 . ,

Arbre quaternaire de l’image 2

<1<1<<1<1<=1==1=

>1>1>>1>1>?1?1??1?1?

* @ 9 A B0

00 01 02

031030

0330 0331 0332 0333

03 * 9 * A033* 9 B

032

Arbre quaternaire de l’image 1

C1C1CC1C1CD1DD1DE1E1EE1E1EF1F1FF1F1FG1G1GG1G1GH1H1HH1H1H I1I1I

0

00 01 02

031030 032 033

. 9 :03 * 9 ,

Arbre quaternaire de l’image 4

J1J1JJ1J1JK1K1KK1K1KL1LL1LM1MM1M

0

00 01 02

031030 032 033

03

N 9 * A* 9 B

Arbre quaternaire de l’image 3

Figure2. Lesarbresquaternairesdesquatrepremièresimagesdela figure1.

L’arbre quaternaireestunestructurehiérarchiqueconstruitepar divisionsrécur-sivesde l’espaceenquatrequadrantsdisjoints[SAM 84]. Cettestructureesttrèsuti-liséepour représenterles images,c’est-à-direpour stocker les imageselles-mêmes[MAN 02b] oupourstockeretindexerlescaractéristiquesdesimages[LU 94, LIN 01,MAL 99]. En particulier, l’arbre quaternairepermetde tenir compte,lors de la re-cherched’imagessimilaires,dela localisationspatialedescaractéristiquesd’images,tellesquela couleur[ALB 00], le contour[CHA 00] ou de la localisationdesobjetsd’intérêt[AHM 97].

Pourêtrereprésentéeparunarbrequaternaire,uneimageestrécursivementdécompo-séeenquatrequadrantsdisjointsdemêmetaille,enfonctiond’un critèrededécoupage(ex. homogénéitédela couleur)detelle sortequechaquenœuddel’arbrequaternairereprésenteunquadrantdansl’image.Le nœudracinedel’arbrereprésentel’imageen-tière.Si uneimagen’estpashomogèneparrapportaucritèrededécoupage,le nœudracinedel’arbre quaternairereprésentantl’image a quatrenœudsfils représentantlesquatrepremiersquadrantsde l’image. Un nœudestfeuille si le quadrantcorrespon-dantdansl’image esthomogènepar rapportau critèrede découpage,sinonle nœudestinterne.

Il existeplusieursfonctionspermettantd’associerun identificateuràun nœudd’arbrequaternaire[SAM 84]. Cesfonctionspermettentde retrouver facilement,à partir del’identificateurde l’image et du nœudd’arbrequaternaire,le quadrantassociédansl’image.Danscetarticle,nousutilisonsun ordreenZ, ensuivantlesdirectionsNors-

Page 311: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Ouest,Nord-Est,Sud-Ouestet Sud-Estet en leur associantrespectivementles iden-tificateurs O , P , Q et R . Pourcommencer(voir à gauchedesfigures1 and2), l’entierO identifie l’image entièreet le nœudracinede l’arbre quaternaire.LesentiersO à R ,précédésde O , identifientlesquatrepremiersquadrantsdel’imageet lesquatrenœudsd’arbrequaternaireassociés.Demanièrerécursive,lesnœudsfils d’un nœudidentifiépar sontidentifiéspar TS avec S unentierprenantsavaleurdansU OWVXRZY . Deuxnœudsdemêmeidentificateurdansdeuxarbresquaternairesdifférentssontnommésnœudshomologues.

La figure2 présenteun exempledequatrearbresquaternairesreprésentantlesquatrepremièresimagesde la figure 1. Afin de faciliter la compréhensiondu lecteur, nousavonsutilisé un exemplesimpleoù le critèrede découpageest l’homogénéitéde lacouleurpourdesimagesennoir etblanc.

Lesnœudsinternesd’un arbrequaternairepeuventcontenirde l’information commeparexemplel’histogrammedecouleursoula signaturedela régioncorrespondante,lenombredepointsd’intérêtdansla régionassociéeouencoreuneinformationsurleursnœudsfils. Dansla figure 2, les nœudsinternescontiennentla proportionde nœudsnoirsdansle sous-arbre(voir lesvaleurssoulignéesdansla figure2). Dansl’approchede[AHM 97], lesimagessontcaractériséespardespointsd’intérêt.Le critèrededé-coupagedesimagesest,danscecas,quechaquenœudfeuille de l’arbre quaternairecorrespondeà un quadrantd’imagecontenantau plus un point d’intérêt.Les nœudsinternescontiennentalorsla sommedespointsd’intérêtsituésdansle sous-arbredontils sontracines.

Certainesapproches[GUP 97, LU 94, LIN 01, MAL 99] indexentles imagespardesarbresquaternairesdont le nombrede niveauxestfixe (généralementinférieur à 3).Chaqueimageest,dansce cas,représentéepar un arbrequaternairecompletéquili-bré.Chaquenœud(interneou feuille) del’arbrequaternairecontientdel’informationsur la régioncorrespondantedansl’image, commepar exemplel’histogrammedescaractéristiquesvisuellesde la région (couleur, texture, forme ou unecombinaisondescescaractéristiques).Unetellestructure,appeléehistogrammesmulti-niveaux(enanglaismulti-level histograms), permetde filtrer les imagesau fur et à mesurede larecherche.La figure3 donneun exempled’histogrammesdecouleursmulti-niveaux,chaquenœuddel’arbrequaternairecontenantl’histogrammedecouleursdela régioncorrespondantedansl’image.

3. Distancesbaséessur lesarbr esquaternaires

3.1. Distancesentrenœudsd’arbre quaternaires

A chaquenœudd’arbre quaternairecorrespondun quadrantdansl’image asso-ciée.Un nœudd’arbrequaternairepeutcontenirn’importe quelleinformationsur larégioncorrespondante: un histogrammede couleurs,un vecteurdecaractéristiques,

Page 312: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure3. Un exempled’histogrammesdecouleursmulti-niveaux.

despointsd’intérêt,la sous-image(compresséeounon)correspondantà la région,oudesdescripteursdeFourier, parexemple.Il estparconséquentpossibledecalculerdesdistancesentrenœudsd’arbrequaternaires.

On note []\T^)_`V+aTb une distancenormalisée( []\T^)_`V+aTbdceU OWVfPgY ) entredeux nœudsho-mologues apparaissantdanslesarbresquaternaires_ et a . La distance[f\ peutêtren’importe quelledistance[CAS 02, DIG 99, PUZ 99] appliquéeaux nœudsd’arbrequaternaire,en fonction du contenudesnœuds(ex. morceaud’image ou signatured’un morceaud’image).Le choixde [ \ dépendducritèrededécoupagedel’imageenarbrequaternaireainsi quede la valeurdonnéeaux nœudsfeuilles et aux nœudsin-ternesdel’arbrequaternaire.Si lesimagessontreprésentéespardeshistogrammesdecouleursmulti-niveaux(voir figure3),alors [ \ peutêtreunedistancechoisieparmilesdistancesexistantesentreleshistogrammesdecouleurs[SMI 02, STE02]. Certainessignaturesd’imagescommelesdescripteursdeFourierpourla forme[LU 99, RUI 97,RUI 99] sontdesreprésentationsinvariantesdu pointdevuedestransformationsgéo-métriques(translation,rotationou homothétie).Si les imagessont représentéespardesarbresquaternairesstockantles descripteursde Fourier associésaux différentsquadrantd’images,alorsdansce cas,l’in variancepar transformationsgéométriquess’appliqueàchaquequadrantassociéàun nœudd’arbrequaternaire.

Lorsquedeuxnœudshomologuessonttousles deuxfeuilles ou internes,la distance[ \ correspondà unedistanceentrevaleursdenœuds.Lorsqueun nœud estinternedansun arbrequaternaire,parexemple_ , maisestfeuille dansl’autre,parexemplea ,alorsplusieursalternativesapparaissent:

1) [ \ ^h_iVjaTblkmP parcequelesnœudsnesontpasdemêmetype(interneet feuille).Dansce cas,pour tous les nœudsfils de dansl’arbre quaternaire_ , [ \on ^h_`V+aTbpkqP

Page 313: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

parcequelesnœudsrS , fils de , existentdansl’arbre quaternaire_ maisn’existentpasdansl’arbrequaternairea .

2) Le nœudinterne , situédansl’arbre quaternaire_ , peut contenirunevaleurcorrespondantausous-arbredont il estracine.Cettevaleurestutiliséepourétablir ladistance[f\ entrelesdeuxsous-arbresde racine danslesarbresquaternaires_ et a(le sous-arbrederacine estunefeuille dansl’arbrequaternairea ).

3) Le nœud , feuille dansl’arbre quaternairea , peutêtretransforméennœudin-ternele tempsdu calculde la distancedesimilarité,en ayantquatrenœudsfils dontla valeurdépenddu critèrededécoupage.Pource faire,on remplaceartificiellementle nœudfeuille homogène parun sous-arbrecomposédequatrenœudsfeuillesho-mogènesselonle critèrededécoupage.Danscecas,la distance[ \ peut-êtrecalculéepourchacundesnœudsfils de .Par exemple,dansla figure 2, [tsiuiuv^wPvVXRxb peutêtrecalculéede différentesmanières.Parcequele nœudOvRyR estinternedansl’arbrequaternaireP et estfeuille dansl’arbrequaternaireR , alorson peutavoir [ siuiu ^wPvVXRxbpkzP . Dansce cas,pour tous les nœudsfeuilles OvRyRZS , [ siuiu n^|PyVXRybk~P , puisquecesnœudsn’existentpasdansl’arbre quater-naire R . Si, pendantle calculde la distance,le nœudOvRyR del’arbre quaternaireR estdécoupéenquatrenœudsfeuillesfils decouleurnoire,alors [ siuiu ^wPvViRybkO puisquelesdeuxnœudshomologuessontinternes,[ siuiuXs ^|PyVXRxbk[ s`uXuo ^wPvViRybk[ s`uiuX ^|PvViRybkPet [gs`uXuiux^|PvViRyb kO puisqueseulle nœudOyRvRyR a la mêmecouleurdanslesdeuxarbresquaternaires.

3.2. Définition généraledela distance entreimages

La distance estunedistanceentreimagesreprésentéespar desarbresquater-naires.Sa définition est généraleet permetde distinguerplusieursfamilles de dis-tancesentrearbresquaternaires.La distance entredeuximages_ et a estdéfinieparunesommededistances[ \ ^)_`Vjaxb , entrelesnœudsdesarbresquaternairesreprésentantlesimages_ et a , pondéréespardescoefficients \ , \ O :

^)_`V+aTbk \ t\Z[f\r^h_iVjaTb \ t\ [1]

– estl’identificateurd’un nœudpris parmi l’union desidentificateursdenœudsapparaissantdanslesarbresquaternairesdesimages_ et a . On note la cardinalitédel’ensembledesidentificateursdenœuds.

– [ \ ^)_`V+aTb estunedistancenormaliséeentreles nœudshomologues desarbresquaternaire_ et a , tellequ’elle aétédéfiniedansla section3.1.

– \ estun coefficientpositif représentantle poidsdu nœud dansle calculdeladistance . Le choix dechaquepoids \ dépenddesbesoinsde l’utilisateur, c’est-à-diredel’importancequel’utilisateursouhaitedonneràcertainsquadrantsd’imageparrapportà d’autresdansle calculde la distance . Par exemple,si certainsquadrants

Page 314: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

nedoiventpasapparaîtredansle calculde , ils peuventêtreassociéesà un poids \nul. Si la surfacedesquadrantsdoit entreren jeu dansle calcul de , chaquecoef-ficient t\ doit êtreproportionnelà la surfacereprésentéepar le quadrantpar rapportà l’image entière.Lescoefficients t\ permettentdedéfinir desdistancesqui sontdescasparticuliersdela distance (voir sections3.3et 4).

estunedistance,commelemontreladémonstrationsuivante,c’est-à-direquequellequesoit uneimage _ représentéeparun arbrequaternaire^)_`Vw_|bkO et vérifie lespropriétésdesymétrie(voir item 2 ci-dessous)ainsiquel’inégalité triangulaire(voiritem3 ci-dessous).

Démonstration est unedistancepuisqu’ellecorrespondà unecombinaisonlinéairede distancesentrenœudshomologues( [ \ ). Par conséquent,_`Vx_ et a , trois imagesreprésentéespardesarbresquaternaires:

1) ^)_`Vw_|bkO car , [f\W^)_`Vw_|bkO .2) ^)_`V+aTbk^avVX_|b car 1 , []\r^)_`Vjaxbk[f\W^ayVw_|b .3) ^)_`Vw_b^h_`V+aTb^ayVw_/b car 1[]\T^)_`Vw_-bz[f\W^)_`V+aTb[f\r^ayVw_b et donct\[f\W^)_`Vw_b t\[]\r^)_`Vjaxb¡t\[]\r^ayVX_j-b , car t\ O .

La distance estnormaliséepar la sommedescoefficients \ . Dansla suitedel’ar-ticle, nousneconsidéronsquedesdistancesnormalisées.

REMARQUE. — ^)_`Vjaxbpk¢O signifie que"tous les nœudshomologues desarbresquaternaires_ et a , depoids \ nonnul, ont unedistance[ \ nulle ( tel que \¤£kO :[ \ ^h_iVjaTb¥k¦Oxb " et ^)_`V+aTbk§P signifie que"tous lesnœudshomologues desarbresquaternaires_ et a , de poids \ non nul, ont unedistance[ \ égaleà 1 ( 1 tel quet\ £k¨O : []\T^)_`V+aTbkmP]b ".

3.3. Définitionsdesdistances¤©«ªt¬ , ­ ©«ªt¬ et ¯®Enfonctionduchoixdescoefficientst\ et/ouduchoixdeladistance[f\ , ladistance peutdevenir, unedistancenotée¤©«ªt¬ , qui neprendencomptequelesnœudsd’arbre

quaternairesituésau-dessusd’un niveau° (voir section3.3.1),ouuneborneinférieure

dedistances,notée ­ ©«ªt¬ (voir section3.3.2),ou encoreunedistancederégion,notéep® (voir section3.3.3).

3.3.1. Distance¤©«ªt¬Quandles coefficients \ sont égauxpour tous les nœudshomologues (par

exemple \ k±P ), tous les nœudsont le mêmepoidsdansle calcul de la distance . Toutefois,il estpossibledechoisir le poidsdesnœudsenfonctiondeleur niveau

Page 315: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

dansl’arbre quaternaire,de telle sortequ’une différenceentredeux arbresquater-nairesapparaissantà un niveauprochede la racineait un plus gros impactdanslecalculdela distancequ’unedifférencesituéeà un niveauplusprofonddel’arbre.Onnote, ¤©«ªt¬g^)_`Vjaxb , la distance entredeux images_ et a représentéespar desarbresquaternaires,pourlaquelleon associeun coefficient t\ positif pourlesnœudsd’arbrequaternairesituésjusqu’auniveau° (la racineétantsituéeau niveau0) et un coeffi-cient t\ nul pourlesnœudssituésàunniveau² plusprofond( ²´³µ° ). La distance ©#ªg¬netientpascomptedesdétailsdesimagessituésau-dessousd’un niveau° donnédanslesarbresquaternairesqui représententlesimages.

3.3.2. Distance ­ ©«ªt¬L’arbrequaternaireestunestructurehiérarchiquequi permetdecalculerdesdis-

tancesapprochéesentreimages,en évitantde lire tout l’arbre. Ceci permetnotam-mentdefairerapidementle filtragedesimageslesmoinssimilaires(parrapportàunedistancedonnée)à l’image requête[ALB 00, GUP97, LU 94, LIN 01]. Pourutiliserl’arbre quaternairecommeun pré-filtre, le principeconsisteà stocker, danschaquenœudinterne de l’arbre quaternaire_ , de l’information ¶o·\ sur le sous-arbredont est racine.La valeurde ¶o·\ dépendantdu critèrede découpage.La comparaisondesvaleurs ¶g·\ et ¶X¸ \ permetde calculeruneborneinférieurede distancesentreles sous-

arbreshomologuesderacine . Cetteborneinférieurededistancesestnotée­[ ©«ªt¬\ ^)_`V+aTb ,avec ° le niveaude danslesarbresquaternaires_ et a .Par exemple,les nœudsinternesdesarbresquaternairesde la figure 2 contiennentla proportionde nœudsnoirs dansle sous-arbredont ils sontracine,en fonction dunombredenœudspossiblesautotaldansunarbrequaternaireéquilibréayantle mêmenombredeniveaux.Lesvaleurs¶o·\ sonttellesque ¶ s`uiu k¹Pº» (unseulnœudfils de OvRvRestnoir dansl’arbre quaternaire1), ¶ s`u k¼PºTP]½ (un seulnœudestnoir parmi les 16nœudspossiblesdel’arbrequaternaireP équilibréjusqu’auniveau3), et ¶ s k¾P¿yº½Z» .Si on note ­[ ©«ªt¬\ ^h_iVjaTbÀkÂÁ¶o·\¯Ã ¶X¸ \ Á , et si, lorsquele nœud est feuille dansl’arbrequaternaire_ , ¶ · \ kÄP si sa couleurest noir ou O si sa couleurest blanche,alors

­[ © ¬s`uXu ^|PvV`QvbkÂÁTPº» à PºZQ¯Á k¹Pº» et ­[ © ¬siuiu ^wPvViRybkÂÁxPº» à PÅÁ kRTº» .

Le calculd’uneborneinférieurededistances entredeuximages_ et a représentées

pardesarbresquaternaires,notée ­ ©«ªt¬ ^)_`V+aTb , sefait enutilisant lesbornesinférieures

desdistances­[ ©«ªt¬\ . On obtientla formulesuivante:

­ ©«ªt¬ ^)_`V+aTbk \ \ ­[ ©#ªg¬\ ^h_`V+aTb \ \ [2]

Page 316: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Pourtoutniveau° d’arbrequaternaire,­ ©«ªt¬ ^)_`V+aTbÆ ­ ©«ªtÇ ¬ ^)_`V+aTb : ­ ©#ªg¬ estunefonc-tion croissantede ° . La valeurréelledela distanceentrelesimagesnepeutêtreétabliequepar comparaisonexhaustive de leursarbresquaternaires,maisestapprochéedeplusenplusprécisémentparlesapproximationsréaliséesparniveau,danslesarbres,enfaisantun explorationenlargeurd’abord.

3.3.3. Distance¯®Il est égalementpossiblede calculerune distance entrerégionshomologues

d’imagesen associantdescoefficients f\ positifs aux nœudsreprésentantla régiondanslesarbresquaternairesdesimageset descoefficients t\ nulsauxautresnœuds.Onnote ¯®v^)_`Vjaxb , la distance entrelesimages_ et a représentéespardesarbresqua-ternaires,pour laquelleon a associéuniquementun coefficient t\ positif auxnœuds représentantla région . La région peutêtrecomposéedesous-régions · , chaquerégion · devantêtreunerégionrectangulaire(afindecorrespondreà un ensembledenœudsd’arbrequaternaire).L’invariancepartransformationgéométriquedanscecas(enfonctionduchoixducontenudesnœudsetdela distance[ \ - voir section3.1)seravalablepourchaquesous-régionindépendammentlesunesdesautres,maispaspourla régionentièresi celle-ciestcomposéedesous-régions.Pourréglerceproblème,ilestpossible,commedans[MAL 99], de transformerl’image requêteinitiale en plu-sieursimagesrequêtes,chaqueimagerequêtereprésentant,par exemple,le résultatunetranslationparrapportauxrégionssélectionnéesdansl’image requêteinitiale.

4. Casparticuliers de la distance En fonctiondesvaleursdescoefficients t\ et du choix dela distance[f\ , plusieurs

famillesde distancespeuventêtredéfinies.On note, , unedistanceentreles struc-turesdesarbresquaternairesdontle calculnetientpascomptedela valeurdesnœudsfeuilles(voir section4.1).Onnote, , unedistanceentrelesstructuresdesarbresqua-ternairesdontle calcultient comptedela valeurdesnœudsfeuilles(voir section4.2).Uneautredistance,notée , permetdecomparervisuellementlesimagesenutilisantleur arbrequaternaire(voir section4.3).Commedansla section3.3, il estpossible,àpartir desdistances , et , de définir desdistancesapprochées,notéesrespecti-vementÈ©«ªt¬ , ©«ªt¬ et É©#ªg¬ , desbornesinférieuresdedistances,notéesrespectivement

­ ©«ªt¬ , ­ ©«ªt¬ et ­ ©«ªt¬ , etdesdistancesentrerégionsd’images,notéesrespectivement ® , ® et ® .

4.1. Distance Lacomparaisondelastructurededeuxarbresquaternairesreprésentantdesimages,

sanstenir comptede la valeurdesnœudsfeuilles,permetdesavoir : (1) si le décou-pagededeuximagesselonle mêmecritèreestidentiqueou (2) si le découpaged’unemêmeimageselondeuxcritèresdifférentsest identique.Ce type de distance,notée

Page 317: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

, a pourvaleur0, ^)_`V+aTbÆkO , lorsquelesnœudsinterneset lesnœudsfeuillessontexactementàlamêmepositiondanslesarbresquaternairesreprésentantlesimages_ eta . ParexemplelesimagesP et ¿ dela figure1 ont unedistance nulle ( ¤^|PvVo¿ZbkO )puisquel’image ¿ correspondau complémentairede l’image P . Les arbresquater-nairesde cesdeux imagesont exactementla mêmestructure,mais les valeursdenœudsfeuilles sont inversées(l’arbre quaternairede l’image P est représentésur lafigure2). La distance n’a uneutilité quelorsqueles imagesnesontpastoutesre-présentéespardesarbresquaternaireséquilibrésdontle nombredeniveauxestfixé.

Pourcalculerla distance entredeuximages_ et a représentéespardesarbresquater-naires,la distance[ \ ^)_`Vjaxb entrenœudsd’arbrequaternaireneprendquedeuxvaleurs:

– [ \ ^)_`V+aTbkO lorsquelesnœudshomologues sonttouslesdeuxinternesoutouslesdeuxfeuilles(avecou sansla mêmevaleur)danslesarbresquaternaires_ et a .

– [ \ ^)_`V+aTbkmP lorsquele nœud estinternedansunarbrequaternaireetestfeuilledansl’autreou lorsque existedansun arbreet n’existepasdansl’autre.

Par exemple,si dansla figure 2, les coefficients t\ ont la mêmevaleur pour tousles nœudshomologues , alors la distanceentre les arbresquaternairesP et R estÀ^|PyVXRybÂkËÊxºTPfR . En effet, il y a cinq nœudsde typesdifférents(interneou feuille)sur treize nœudsau total, entre les arbresquaternairesP et R : [ s`uiu ^|PyVXRxbdkÄP et[gs`uXu n ^wPvVXRxbÀkÌP , avec S¾c¦U OWViRY . En revanche,la distance© ¬ entreles images Pet R est: © ¬o^wPvVXRxbÍkÎO car la différenceentrelesarbresquaternairesdesimagessesitueuniquementà partir du niveau2. Au niveau2, © ¬ ^wPvViRybpkÏPºÐ , parceque9nœudsapparaissentdansl’union desnœudsdesarbresquaternairesP et R jusqu’auniveau2 et queseul le nœudOyRvR est internedansun arbrequaternaireet est feuilledansl’autre ( [gs`uiuv^|PyVXRxb¥kP ). Ceci signifie que11% desnœudshomologuesnesontpasdemêmetypeentrelesarbresquaternairesP et R jusqu’auniveau2.

Lors du calculde la distance , lesvaleursdesnœudsnesontpasprisesencompte.Par conséquent,les imagespeuventêtrevisuellementtrèsdifférentes(casdesimagesP et ¿ de la figure 1 par exemple).La distanceQ estuneautredistance,issuede ladéfinitiondela distance , qui tient compte,quantàelle,dela valeurdesnœuds.

4.2. La distance La distance comparedeuxarbresquaternairesnonseulementdupointdevuede

leur stucture,maiségalementdu pointdevuedesvaleursdeleursnœuds.La distanceQ entredeuxarbresquaternaires_ et a estnulle, p^)_`V+aTbkO , lorsquetouslesnœudsinterneset tous les nœudsfeuilles sontà la mêmepositiondansles deuxarbresetlorsquetouslesnœudsont mêmevaleurdanslesdeuxarbres.Danscecas:

– [ \ ^)_`V+aTbkO lorsquelesnœudshomologues sonttouslesdeuxinternesoutouslesdeuxfeuilles,avecla mêmevaleur, danslesarbresquaternaire_ et a .

– [ \ ^)_`V+aTbk¼P lorsque estfeuille dansun arbrequaternaireet estinterne(sans

Page 318: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

valeurdansle nœudinterne)dansl’autrearbreou lorsque n’existequedansun seuldesarbresquaternaires_ et a .

– []\r^)_`V+aTbÅcY-OÑVtPxU , danslesautrescas,c’est-à-direlorsquelesnœudshomologues sont tous les deux internesou tous les deuxfeuilles avec desvaleursdifférentes.[f\r^h_iVjaTb estla distanceentrelesvaleursstockéesdanslesnœuds .La distance[]\ , qui permetdecomparerlesnœudshomologues danslesdeuxarbresquaternaires,doit êtrebaséesur le critèrede découpageutilisé pour représenterlesimagesen arbrequaternaire.Commeprécédemment,n’importe quel coefficient \peutêtrechoisi,permettantainsidedéfinir unefamille dedistances .

Lorsque [ \ ^)_`VjaxbÈcÒ]OWVfPZÓ et queles coefficients \ sontégauxpour tous les nœuds , Å^h_iVjaTb correspondà la proportionde nœudshomologuesayantdesvaleursoudestypesdifférentsentreles arbresquaternairesreprésentantles images_ et a (voirexemplesci-dessous).

Par exemple,la distance entreles images P et ¿ , dont la distance estnulle (lesimagesétantcomplémentaires)est: p^|PvVo¿ZbkePfOxºrPfR . En effet, seulsles nœudsin-ternesont unedistance[f\ nulle, la valeurdesdistances[]\ entreles nœudsfeuillesétant1. La distance entrelesarbresquaternairesP et Q dela figure2 est: Å^wPvViQybkPºTP]R . En effet, lesnœudshomologuesOyRvRyO n’ont pasla mêmevaleurdanslesarbresquaternairesP et Q . En revanche,tous les autresnœudshomologuessontde mêmetypesdanslesdeuxarbreset, lorsqu’ils sontfeuilles,ont la mêmevaleur: [gsx^|PvV`QvbÆk[gs n ^|PyViQybÔk[gs`u n ^|PyViQybÔk[gs`uiu`Z^|PvV`Qvbk[gs`uXuiv^wPvV`Qvbk¨[tsiuiuiuy^wPvViQybkO , avec ScÕU OWViRY .Si on ne tient comptequedesnœudshomologuessituésentrela racineet le niveau2 desarbresquaternairesP et Q , on obtientque: © ¬ ^wPvV`QvbÍk~O . En effet, lesdiffé-rencesentrelesdeuxarbresn’apparaissentqu’auniveau3. En revanche,si on calculeune distanceapprochée,en utilisant la valeur desnœudsinternes,on obtient que :

­ © ¬ ^|PyViQybÔkÂÁrPº» à PºZQÂÁvÖrPºÐkPºRy½ . Onabienque ­ © ¬ ^wPvViQybÆ ­ © u ¬ ^|PvV`Qvb , avec

­ © u ¬ ^|PyViQybÔkÎp^|PvV`Qvb .Plusieursapproches,présentéeset comparéesdans[MAN 02b], proposentdestockerdesimagessimilairesorganiséesenarbrequaternaire.L’objectif principaldecesap-prochesestd’optimiserl’espacedestockagedesimagesenpartageantlespartiescom-munesentreleursarbresquaternaires.Parconséquent,la distance peutêtreutiliséedanscesapprochespourorganiserlesimagesdanslabasesousla formed’unearbores-cenced’imagesdanslaquelleuneimage_ estfeuille del’image a , dansl’arborescence,si ² imagede l’arborescence,p^)_`V+aTb¯ep^)_`VX²+b . Plusla distance estpetiteentredeuximages,plus il y a despointscommunsentreles imageset doncde partagedenœudspossibleentreleursarbresquaternaires- voir [MAN 02b] pourplusdedétails.

Cependant,deuximagesdontlesarbresquaternairessonttrèsdifférents,Å^h_`V+aTb×ØP ,peuvent apparaîtretrès similairesvisuellement: par exemple,lorsquele critère dedécoupageestl’homogénéitéde la couleur, uneimagecomplètementblancheet uneimageblanchepossédantuniquementun pixel noir. C’est le casdesimages Ê et ½

Page 319: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

de la figure 1 qui ont unedistanceÅ^+ÊrVi½ybkOÑÙ ÐyÐ , l’arbre quaternairede l’image Êétantcomposéd’uneseulefeuille racineblanche.Par conséquent,il estintéressantdedéfinir unedistancevisuelle,notée , telle quesi Ú^h_iVjaTb estprochedezéroalorslesimages_ et a sontsimilairesvisuellementdupointdevueducritèrededécoupagebienqueleur arbresquaternairespuissentêtretrèsdifférents.La sectionqui suit présenteunedistancevisuellebaséesurla distance , précédemmentdéfinie.

4.3. La distanceOn note , unefamille dedistancesvisuellesentreimages,calculéesà l’aide des

arbresquaternairesreprésentantles images.Lors du calculde la distanceÚ^h_`V+aTb , lesarbresquaternairesdesdeuximagesdoiventêtrecomplétéspouravoir la mêmestruc-ture ( À^h_`V+aTbkÌO - voir section4.1) : lorsqu’un nœud est internedansun arbrequaternaire_ et est feuille dansun arbrequaternairea , le nœud devient, le tempsdu calculdela distance , internedansl’arbre quaternairea et possèdequatrenœudsfils feuillesdont la valeurdépenddu critèrededécoupage.Le calculdela distancene prenden comptequeles distances[ \ entrenœudsfeuilles; [ \ k¦O pour touslesnœudsinternes,lesarbresquaternaires_ et a ayantexactementla mêmestructure,lorsdu calculdela distance .

Pourcomparerla surfacedesimages_ et a , les coefficients t\ associésaux nœudsfeuilles peuvent êtreproportionnelsà la surfacedesquadrantscorrespondantsdansl’image. Par conséquent,f\ÛkÜ»WÝWª si estsituéau niveau° de l’arbre quaternaire(la racinede l’arbre étantau niveau0) en prenantcommehypothèsequela surfacede l’image entièrevaut 1. Plusle nœudestsituéprofondémentdansl’arbre, plus lavaleurdesoncoefficient,et doncsonpoidsdansle calculdela distance , estfaible.Par exempleles imagesÊ et ½ de la figure1 ont unedistanceÚ^+ÊrVX½xbÍkÎOWÙ OvOv» car larégionnoiredansl’image ½ représentePºZQvÊv½ èmede la surfacede toutel’image. Enrevanche,la distance entreles images P et ¿ est: ¥^|PvVo¿Zb¥kP puisqueles imagessontcomplémentaires.

Par exemple,la distance entreles images P et Q dont les arbresquaternairessontreprésentéssurla figure2 estde0.02.Eneffet, Ú^wPvViQybk[ siuiuws ^|PyViQvbxÖ^wPº½v»xbÔk¹Pº½v»car [ siuiuws ^|PyViQvbkËP (la valeurdesdistances[f\ pour les autresnœuds estnulle) et siuiu nÕk¼»WÝ u . Si on s’intéressede plus prèsà la distancede la région OyRvR danslesimages P et Q , on obtientque: siuiu ^wPvViQybk¼Pº» puisqueseul le nœudOyRvRvO diffèreentreles sous-arbresde racine OyRvR dansles arbresquaternairesP et Q . En revanche,1siuiuv^wPvViRybk¨RTº» .Pourcalculerla distanceentreles imagesP et R , l’arbre quaternaireR doit êtrecom-plété: lenœudOvRyR estparconséquentdiviséenquatrenœudsfils noirs.Ainsi, Ú^wPvVXRxbk^h[gs`uiuwsT^|PyVXRybWÛ[gs`uiu`Z^|PvViRybWÛ[gs`uXuiv^wPvViRybwbWÖÆ^wPºZ½Z»xbkRxºZ½Z»kOÑÙ OxÊ . Si onnes’intéressepasauxdétailssituésendessousduniveau2 desarbresquaternairesP et R , onobtientque Í© ¬o^|PvViRybk¹PºTP]½ carle nœudOvRvR , situéauniveau2, estinternedansl’arbrequa-

Page 320: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

ternaireP maisestfeuille dansl’arbrequaternaireR . Le quadrantcorrespondantrepré-sente1/16ièmedela surfacedel’image.Si onutilise la valeurcontenuedansle nœud

interneOvRyR del’arbrequaternaireP , on obtientque: ­ © Qvbt^|PyVXRybkÂÁxP à Pº»ÀÁvÖTPºTPf½ .

4.4. Exemplesd’utilisation dela distance

Figure4. Uneimagerequêteet sareprésentationenarbrequaternaire.

Image1 Image2 Image3 Image 4 Image 5 Image 6 Image7 Image8Þ0 0 0.38 0.38 1 0.88 0 0ÞÔß«àjá0 0 0.11 0.11 1 0.85 0 0â

0.15 0.08 0.46 0.38 1 0.88 0.61 0.77â ß«àjá0.11 0.11 0.22 0.11 1 0.77 0.55 0.66ã0.08 0.06 0.09 0.03 0.34 0.34 0.92 1ã ß«àjá0.06 0.06 0.125 0.06 0.31 0.25 0.87 1

­ ã ß«àjá 0.08 0.06 0.09 0.03 0.34 0.34 0.92 0.93

NB : Lors du calculdesdistancesÞ

etâ

, lescoefficients äå ont mêmevaleur æ \ .Lescoefficients ä å sontproportionnelsà la surfacedesquadrantsd’image \ , lorsdu calculdela distance

ã .Tableau1. Valeursdesdistances entre l’imagerequêtedela figure4 et lesimagesdela figure1

Les distancesdéfiniesdanscet article peuvent être utiliséespour la rechercheglobaled’imagespar la contenuselondifférentscritères(comparaisonvisuelledesimages,comparaisonde la structurede leur représentationen arbrequaternaireetc.)ou pourla recherched’imagesparsimilaritéderégions.Danscecas,la similaritédesimagesdoit êtredéfiniesenfonctiondela distance choisie(c’est-à-direenfonction

Page 321: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

descoefficients \ et de la distance[ \ choisis)et d’un seuil donné.Supposons,parexemple,qu’un utilisateursouhaiteretrouver toutesles images,de la figure 1, simi-lairesà l’image requêteçfè , représentéesur la figure4. Si l’utilisateur considèrequedeuximagessontsimilaireslorsqueleur distance est inférieureà 0.3, alorsle ré-sultatde la requêtecorrespondaux imagesP et Q (lesvaleursdesdistancesentrelesimagessont indiquéessur la table1). En revanche,si deux imagessontditessimi-laireslorsqueleur distance © ¬ est inférieureà 0.3, alorsle résultatest l’ensembledesimagesP à » . Cecisignifiequela différenceentrel’arbre quaternairede l’imagerequêteet ceuxdesimagesR et » sesituentsousle niveauQ . Si l’utilisateur considèrequedeuximagessontsimilaireslorsqueleur distance estinférieureà 0.1,alorslerésultatde la requêteestl’ensembledesimagesP à » . Le résultatde la requêteestle

mêmelorsquela distanceutiliséeest ­ © ¬ , maisle coûtestmoindrepuisqueseullesnœudsd’arbrequaternairesituésjusqu’auniveau2 ont étéinspectés.En revanche,sila distanceutiliséeest Í© ¬ , alorsl’image R n’appartientpasaurésultatdela requête.Ceci signifie que,d’un point de vue visuel, les différencesentrel’image requêteetl’image R sesituentauniveaudesrégionsoccupantle moinsdesurface(ici 1/64ièmedela surfacedel’image).

En comparantparexemplel’image requêteçtè avecles imagesÊ et ½ , on observe quelesarbresquaternairessontdifférentsdu point devuedela structuredesarbres(posi-tion denœudsinternesetdesnœudsfeuillesdanslesarbres)etdela valeurdesnœuds( À^éçtèV`ÊvbêkëÅ^éçtèV`Êvbk±P et À^éçtèZVi½ybkëÅ^éçtèVi½ybêkOWÙ ìvì ), alorsque la distancevisuelleentreles imagesestde 0.34 ( ¥^hç è ViÊvbÈk¼¥^hç è VX½xbÈk¼OWÙ RZ» ). Si on comparel’image requêteç è avec les images¿ et ì , on voit queles structuresdesarbresqua-ternairessont identiques( À^éç è Vo¿Zbkz¤^hç è VXìxb¤kO ), que leursarbresquaternairesdiffèrentsur les valeursde nœudsfeuilles ( p^hç è V`¿vbpkOWÙ ½WP et p^hç è VXìxbpk§OWÙ#¿v¿ ) etqueleurdistancevisuelleestprochede1.

Autre exemple: Supposonsque l’utilisateur associe,aux quadrantsOyO , OvRvO , OyRvRyOand OvRyRvR de l’image requête(la régionnoire de l’image), les coefficientssuivants: sXs kqPº» , siuXs kqPºTP]½ et s`uXuXs k siuiuXu kqPºZ½Z» (correspondantà la surfacedesquadrantspar rapportà la surfacede l’image entière),et un coefficient t\µk¼O auxautresquadrantsd’image.Danscecas,l’utilisateur peutcomparerla régionnoiredel’image requêteavec la mêmerégiondansles autresimagesde la base,en utilisantunedistance® . La distance® entrela régionnoire de l’image requêteet celle del’image » estnulle (la régionestnoiredanslesdeuximages).Enrevanche,la distance ® entrel’image requêteet lesimagesQ et R estde PºrPf½ÉÖÆRyQxºTPyPÉkOÑÙP]ì . Eneffet, lenœudOvRyO estblancdanslesarbresquaternairesdesimagesQ et R . La distance ® estde ^|PºTP]½¥PºZ½Z»TbíÖRxQyºrPvPÅkÎOWÙ«QZR entrel’image requêteet l’image P , car lesnœudsOvRvO et OvRvRyO sontblancsdansl’arbrequaternairedel’image P . La distance ® tendvers1 entrel’imagerequêteet lesimagesÊ , ½ et ì .

Page 322: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5. Discussionet Conclusion

L’arbrequaternaireestutilisé dansplusieursapprochesderecherched’imagesparle contenu[AHM 97, ALB 00, LU 94, LIN 01, MAL 99, ORI 01] afin deprendreencompte,dansle calcul de distancesentreimages,la localisationspatialedescarac-téristiques(couleur, texture,contouretc.)desimages.Lesdistances,proposéesdanscertainesdecesapproches,apparaissentcommedescasparticuliersdela distanceproposéedanscet article. Les auteursde [AHM 97], par exemple,caractérisentlesimagespar un ensemblede points d’intérêt et les représententpar desarbresqua-ternairesdont les nœudsfeuilles contiennentau plus un point d’intérêt. Les arbresquaternairesdesimagesdela basesontcomparésà l’arbre quaternairedel’image re-quêteenutilisantunedistancequi correspondaumaximumdedeuxdistancesdetype

­ ©«ªt¬ . Danscecas,­[ ©#ªg¬\ estunedistanceentredeuxsous-arbreshomologuescontenantdestaux d’occupation(nombrede pointsd’intérêt dansle quadrantd’imagecorres-pondant)et lescoefficients \ correspondentà la proportionde pointsd’intérêtdansle sous-arbrepar rapportau nombrede pointsd’intérêt dansles arbresquaternairesreprésentantlesimages.

Les auteursde [LU 94] et de [ORI 01] représententchaqueimagepar un arbrequa-ternaireéquilibrédont le nombrede niveauxestfixe. Chaquenœudcontientl’histo-grammedecouleursdu quadrantcorrespondantdansl’image.Lesauteursdéfinissentunedistancedetype entrelesimagesbaséesurleshistogrammesdecouleurs,danslaquelleles coefficients f\ sontdescoefficientsde surface( t\µk»WÝѪ ) et où [f\ estunedistanceentreleshistogrammesdecouleursstockésdanslesnœudshomologues . Leshistogrammescontenusdanslesracinesdesarbresquaternairessontcomparés.Si leur distanceestinférieureà un certainseuil,alorsleshistogrammesstockésdanslesnœudsdu niveauinférieurssontcomparés,etc.Plusla distanceestcalculéeà unbasniveaudesarbresquaternaires,pluson a deprécisionsurla similaritédesimages[LIN 01]. Cesapprochescalculentdoncunedistancede type à chaqueniveaudesarbresquaternaires.Dansle systèmeDISIMA [ORI 01], il estégalementpossibledefairedesrequêtesdesimilaritéderégionsd’imagesensélectionnantunepartitiondel’histogrammemulti-niveaux.Ce typede distancerevient à calculerunedistance®(voir sections3 and4.3).

Lesauteursde[MAL 99] représententégalementlesimagesdela basepardesarbresquaternaireséquilibrés.Leur approchepermetde calculerla similarité entrerégionsd’images.Unesignaturedechaquerégionestcalculéeet stockéedansl’arbre quater-nairereprésentantl’image. Lorsquel’utilisateur souhaitefaireunerequête,il choisituneouplusieursrégionsdansuneimagedécoupéeenarbrequaternairedontle nombrede niveauxestchoisipar l’utilisateur. Le systèmecalculela régionenglobantles ré-gionssélectionnéespar l’utilisateur (permettantainsi de faire destranslationsde ré-gions).Puis,chaqueimagedela baseestcomparéeà l’image requête(enconsidéranttouteslestranslationspossiblesdela régionenglobante).Chaquerégiond’image(ousous-image)étantreprésentéeparunvecteurdecaractéristiques,lesauteursontdéfini

Page 323: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

unedistanceentrecesvecteursencombinantdemanièrelinéairelescaractéristiquesdesimages.La distanceainsiproposéeestde type ® , puisqu’il y a comparaisonvi-suelledesrégionssélectionnéesparl’utilisateurdansl’image requêteaveclesmêmesrégionsdansles imagesde la base.Dansce cas,les coefficients f\ sont les mêmespour tousles nœudshomologues représentantles régionssélectionnéespar l’utili-sateurdansl’image requête(toutesles régionssélectionnéespar l’utilisateur sontdemêmetaille) ; lescoefficients f\ sontnulspourlesautresnœuds.

En définitive, la distance , définie danscet article, généraliseles distancesentreimagesorganiséesenarbresquaternaires.La définitionde la distance estgénéraledanslesensoù,enfonctiondeschoixdesvaleursdecoefficients \ etdeladistance[ \ ,plusieursdistancesexistantes[AHM 97, LU 94, LIN 01, MAL 99, ORI 01] peuventêtreretrouvéeset denouvellesdistancespeuventêtredéfiniesenfonctiondesbesoinsdel’utilisateur. Deplus,lesdistancesproposéesdanscetarticlesontindépendantesducritèrededécoupagedesimagesenarbrequaternaireet desvaleursstockéesdanslesnœudsdesarbresquaternaires(histogrammesde couleurs,représentationde texture,signature,matricede pixelsetc.).Par conséquent,la distance estcomplémentaireparrapportauxdistancesexistantes[ASL 99, BAC 96, FLI 95, LI 02, PUZ 99] baséessur les caractéristiquesvisuellesdesimages.En effet, en fonction de la valeurdesnœudsd’arbrequaternaire,la distance[]\ peutêtrechoisieparmi toutesles mesuresdesimilaritédisponibles,sur la couleur[SMI 02], le contour[CHA 00] ou la texture[PUZ 99]. Si lesnœudsd’arbrequaternairestockentdesmorceauxd’images,cesmor-ceauxpeuventêtrecompresséset [ \ peutêtrechoisieparmilesmesuresdesimilaritéexistantes,commeparexemplecellesbaséessurlesondelettes[CHE 98a]. Il estàno-terquepouraccélérerlesrecherchesd’images,il estpossible,enfonctiondu contenudesnœuds(si le contenud’un nœudreprésenteuneapproximationdusous-arbredontil estracine)et de la distance[ \ choisieentrechaquenœud,de filtrer les imagesaupremierniveau(les imagesentières)avant de comparerle restedesarbresquater-naires.Le filtrages’effectuesurl’ensembledesimagesdela baseet permetd’obtenirun sur-ensembledel’ensemblerésultat[FAL 94, LIN 01]. Puislescomparaisonsdesarbresniveauxparniveauxsefont à partir du sur-ensembleobtenuaprèsfiltrage.Lesstructuresutiliséespourfiltrer lesimagesreprésentéespardesarbresquaternairessontdesarbresR dans[LU 94], desarbresk-d dans[MAL 99] ouunestructuredehachagedans[LIN 01].

La distance est proposéepour calculerdesdistancesentreimagesdécomposéesen arbrequaternaireselonun critère de découpagedonné.Il est possiblede géné-raliser l’approcheproposéeet de calculerunedistanceentreimagesen prenantenconsidérationplusieurscritèresde découpageen arbrequaternaireet doncplusieurscaractéristiquesd’image.Une telle distancecorrespondà unesommepondéréede la

forme¨îíï îZðíî ©·hñ ¸ ¬ îíï î , où òxó estuncoefficientreprésentantle poidsassociéàla distanceô ó ^h_`V+aTb qui correspondà unedistancedetype entrelesimages_ et a représentées

enarbrequaternaireselonuncritèrededécoupageõ . Nousdévelopponsactuellementun prototypeafin d’évaluerles différentesdistances,présentéesdanscet article, sur

Page 324: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

desbasesd’imagesréelles.

Remerciements: Les auteurstiennentà remercierVincentOria, ainsi que les rap-porteursdecetarticle,pourleur commentairesconstructifs.

6. Bibliographie

[AHM 97] AHMAD I ., GROSKY W., « SpatialSimilarity-BasedRetrievalsandImageIndexingBy HierarchicalDecomposition», Int. DatabaseEngineeringandApplicationsSymposium(IDEAS), Montreal(Canada),1997,http ://www.cs.wayne.edu/billgrosky/Papers97.htm.

[ALB 00] ALBUZ E., KOCALAR E., KHOKHAR A., « QuantizedCIELab* SpaceandEnco-dedSpatialStructurefor ScalableIndexing of Large Color ImageArchives», Proc. ofthe IEEE Int. Conf. on Acoustics,Speech, and Signal Processing(ICASSP), June2000,http ://www.ee.udel.edu/ö ashfaq/recent_pub.html/icassp2000.ps.gz.

[ASL 99] ASLANDOGAN Y., YU C., « TechniquesandSystemsfor ImageandVideoRetrie-val », IEEE Trans.on Knowledge andDataEngineering, vol. 11,n÷ 1, 1999,p. 56–63.

[BAC 96] BACH J., FULLER C., GUPTA A., HAMPAPUR A., HOROWITZ B., HUMPHREY

R., JAIN R., SHU C.-F., « VirageImageSearchEngine: An OpenFramework for ImageManagement», StorageandRetrieval for ImageandVideoDatabases(SPIE), 1996,p. 76–87.

[CAS 02] CASTELL I V., « Image Databases: Search andRetrieval of Digital Imagery », cha-pitre 14 - MultidimensionalIndexing Structuresfor Content-BasedRetrieval, p. 373–433,Wiley Inter-Science,2002,V. CastelliandL.D. Bergman(Eds)- ISBN : 0-471-32116-8.

[CHA 00] CHAKRABARTI K., ORTEGA-BINDERBERGER M., PORKAEW K., ZUO P., MEH-ROTRA S., « SimilarShapeRetrieval in MARS », IEEEInt. Conf. onMultimediaandExpo(ICME II) , New York, NY, USA, 2000,p. 709–712.

[CHE 98a] CHEN C., WILKINSON R., « ImageRetrieval UsingMultiresolutionWaveletDe-composition», Int. Conf. on ComputationalIntelligenceand MultimediaApplications,1998.

[CHE 98b] CHEUNG K., FU A.-C., « EnhancedNearestNeighbourSearchon the R-tree»,SIGMODRecord, vol. 27,n÷ 3, 1998,p. 16–21.

[CIA 97] CIACCIA P., PATELLA M., ZEZULA P., « M-tree : An Efficient AccessMethodfor Similarity Searchin Metric Spaces», Proc. of the Very Large DatabaseConference(VLDB), AthensGreece,1997.

[DIG 99] DI GESÙ V., STAROVOITOV V., « Distance-basedfunctionsfor imagecomparison»,PatternRecognition Letters, vol. 20,n÷ 2, 1999,p. 207–214.

[FAL 94] FALOUTSOS C., EQUITZ W., FL ICKNER M., NIBLACK W., PETKOVIC D., BAR-BER R., « Efficient and Effective Queryingby ImageContent», Journal of IntelligentInformationSystems, vol. 3, n÷ 3/4,1994,p. 231–262.

[FLI 95] FL ICKNER M., SAWHNEY H., NIBLACK W., ASHLEY J., AL, « Queryby Imageand Video Content: The QBIC System», Computer- IEEE ComputerSocietyPress,vol. 28, n÷ 9, 1995,p. 23–32, ISSN : 0018-9162- QueryBy ImageContent- IBM -http ://wwwqbic.almaden.ibm.com.

Page 325: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[GUP97] GUPTA A., JAIN R., « Visual InformationRetrieval », Communicationsof ACM(CACM), vol. 40,n÷ 5, 1997,p. 70–79.

[KAK 00] KAK A. C., PAVLOPOULOU C., « ComputerVisionTechniquesfor Content-BasedImageRetrieval from LargeMedicalDatabases», 7thWorkshoponMachineVisionAppli-cations(IAPR), Tokyo (Japan),2000.

[KAT 97] KATAYAMA N., SATOH S., « The SR-tree : An Index Structure for High-DimensionalNearestNeighborQueries», Proc. ACM SIGMOD Conf. (SIGMOD’97),Tucson,Arizona,USA, 1997,p. 369–380.

[KIM 02] K IMIA B., « Image Databases: Search andRetrieval of Digital Imagery», chapitre13 - ShapeRepresentationfor ImageRetrieval, p. 345–372,Wiley Inter-Science,2002,V.CastelliandL.D. Bergman(Eds)- ISBN : 0-471-32116-8.

[LI 02] L I Y., WAN X., KUO C.-C., « Image Databases: Search and Retrieval of DigitalImagery », chapitre10 : Introductionto content-basedimageretrieval - Overview of keytechniques,p. 261–284,Wiley Inter-Science,2002, V. CastelliandL.D. Bergman(Eds)-ISBN : 0-471-32116-8- http ://biron.usc.edu/ö yingli/Papers/Chap2.pdf.

[LIN 01] L IN S., TAMER ÖZSU M., ORIA V., NG R., « An ExtendibleHash for Multi-PrecisionSimilarity Queryingof ImageDatabases», Proc.of the27th Int. Conf. on VeryLarge DataBase(VLDB’2001), Roma(Italy), 2001.

[LU 94] LU H., OOI B.-C., TAN K.-L., « EfficientImageRetrieval by ColorContents», FirstInt. Conf. on Applicationsof Database(ADB-94), Vadstena(Sweden),juin 1994, LectureNotesin ComputerSciences- 819- SpringerVerlag.

[LU 99] LU G., MultimediaDatabaseManagementSystems, ArtechHousePublishers,1999,ISBN : 0-89006-342-7.

[MAL 99] MALKI J., BOUJEMAA N., NASTAR C., WINTER A., « Region QuerieswithoutSegmentationfor ImageRetrieval by Content», 3rd Int. ConferenceonVisual InformationSystems(Visual’99), , 1999,http ://www-rocq.inria.fr/ö awinter/publis.html.

[MAN 02a] MANJUNATH B., MA W.-Y., « ImageDatabases: Search andRetrieval of DigitalImagery », chapitre12 - TextureFeaturesfor ImageRetrieval, p. 313–344, Wiley Inter-Science,2002,V. CastelliandL.D. Bergman(Eds)- ISBN : 0-471-32116-8.

[MAN 02b] MANOUVRIER M., RUKOZ M., JOMIER G., « Quadtreerepresentationsfor sto-rageandmanipulationof clustersof images», Image andVisionComputing, vol. 20,n÷ 7,2002,p. 513–527.

[NAS 97] NASTAR C., « Indexation d’Imagespar le Contenu: un Etat de l’Art », COm-pressionet REprésentationdesSignauxAudiovisuels(CORESA’97), IssyLesMoulineaux- France,1997,JournéesCNET, http ://www-rocq.inria.fr/imedia/.

[ORI 01] ORIA V., TAMER ÖZSU M., L IN S., IGLINSKI J., « Similarity Queriesin the DI-SIMA ImageDBMS », Proc.of ACM Multimedia, Ottawa (Canada),Sept.2001,p. 475–478, http ://web.njit.edu/ö oria/publications.htm.

[PUZ 99] PUZICHA J., RUBNER Y., TOMASI C., , BUHMANN J., « EmpiricalEvaluationofDissimilarity Measuresfor ColorandTexture», Proc.of theIEEEInt. Conf. onComputerVision (ICCV’99), 1999,p. 1165–1173.

[RUI 97] RUI Y., SHE A., HUANG T., « A Modified FourierDescriptorfor ShapeMatchingin Mars », Images Databasesand Multi-Media Search, vol. 8 de Serieson SoftwareEngineeringandKnowledge Engineering- A.W.M. Smeulders,R.Jain, p. 165–177,WorldScientific,1997.

Page 326: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[RUI 99] RUI Y., HUANG T., CHANG S.-F., « Image Retrieval : Current Tech-niques, Promising Directions and Open Issues », Journal of Visual Commu-nication and Image Representation, vol. 10, 1999, p. 39–62, http ://re-search.microsoft.com/users/yongrui/html/publication.html.

[SAM 84] SAMET H., « The Quadtreeand RelatedHierarchicalStructures», ComputingSurveys, vol. 16,n÷ 2, 1984,p. 187–260.

[SHY 98] SHYU C., BRODLEY C., KAK A., KOSAKA A., A ISEN A., BRODERICK L., « Lo-cal versusGlobalFeaturesfor Content-BasedImageRetrieval », Proc.of IEEE Workshopon Content-BasedAccessof Image andVideoLibrarires(CBAIVL’98), SantaBarbara,Ca-lifornia, 1998.

[SMI 94] SMITH J., CHANG S.-F., « Quad-TreeSegmentationfor Texture-BasedImageQuery », Proc. of 2nd AnnualACM Multimedia Conf., SanFransisco,CA. USA, Oct.1994.

[SMI 02] SMITH J., « Image Databases: Search andRetrieval of Digital Imagery », chapitre11 - Color for ImageRetrieval, p. 285–311, Wiley Inter-Science,2002, V. CastelliandL.D. Bergman(Eds)- ISBN : 0-471-32116-8.

[STE02] STEHLING R., NASCIMENTO M., FALCAO A., « MultimediaMining - a High Wayto IntelligentMultimediaDocument», chapitre4 - Techniquesfor Color-BasedImageRe-trieval, C. Djeraba(Eds)- Kluwer AcademicPublishers,2002.

[TAN 01] TAN K.-L., OOI B. C., YEE C., « An Evaluationof Color-SpatialRetrieval Tech-niquesfor Large ImageDatabases», MultimediaTools and Applications, vol. 14, n÷ 1,2001,p. 55–78.

[TRA 02] TRAINA C. J., TRAINA A., FALOUTSOS C., SEEGE B., « FastIndexing andVisua-lizationof Metric DataSetsusingSlim-Trees», IEEETransactiononKnowledge andDataEngineering(TKDE), vol. 14,n÷ 2, 2002,p. 244–260.

Page 327: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A First Experience in Archiving the FrenchWeb

Serge Abiteboul

— Grégory Cobéna

— Julien Masanes

— Gé-rald Sedrati

INRIA, [email protected], [email protected]

BnF, Bibliothèque nationale de [email protected]

Xyleme S.A., [email protected]

RÉSUMÉ. Alors que le web est de plus en plus reconnu comme une importante source d’informa-tion, certaines organisations, comme Internet Archive www.archive.org, essaient d’en archivertout ou partie. Le web français est aujourd’hui un sujet d’étude dans la perspective d’une col-lecte patrimoniale et automatisée des sites: en France, la Bibliothèque Nationale de France(BnF), s’inscrit dans la problématique d’un dépôt légal du web. Nous présentons ici quelquestravaux conduits par la BnF et l’INRIA sur ce sujet. Plus précisément, nous nous intéressons àl’acquisition des données à archiver. Les difficultés rencontrées concernent notamment la défi-nition du périmètre du ’web français’ ainsi que le choix de politiques de rafraîchissement despages mises à jour. De plus, le besoin de conserver plusieurs éditions d’un même document nousa amené à étudier la problématique de gestion des versions. Enfin, nous mentionnons quelquesexpériences en cours.

ABSTRACT. The web is a more and more valuable source of information and organizations areinvolved in archiving (portions of) it for various purposes, for instance, the Internet Archivewww.archive.org. A new mission of the French National Library (BnF) is the “dépôt légal”(legal deposit) of the French web. We describe here some preliminary work on the topic con-ducted by BnF and INRIA. In particular, we consider the acquisition of the web archive. Issuesare the definition of the perimeter of the French web and the choice of pages to read once ormore times (to take changes into account). When several copies of the same page are kept, thisleads to versioning issues that we briefly consider. Finally, we mention some first experiments.

MOTS-CLÉS : entrepôt, importance des pages, archivage, mises-à-jour

KEYWORDS: data warehouse, page importance, archives, version management

Page 328: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Since 15371, for every book edited in France, an original copy is sent to the Bi-bliothèque nationale de France (French National Library - BnF in short) in a processcalled dépôt légal. The BnF stores all these items and makes them available for futuregenerations of researchers. As publication on the web increases, the BnF proposes pro-viding a similar service for the French web, a more and more important and valuablesource of information. In this paper, we study technical issues raised by the legal de-posit of the French web. The main differences between the existing legal deposit andthat of the web are the following :

1) the number of content providers : On the web, anyone can publish documents.One should compare, for instance, the 148.000 web sites in “.fr” (as of 2001) with the5000 traditional publishers at the same date.

2) the quantity of information : Primarily because of the simplicity of publishingon the web, the size of content published on the French web is orders of magnitudelarger than that of the existing legal deposit and with the popularity of the web, thiswill be more and more the case.

3) the quality : Lots of information on the web is not meaningful.

4) the relationship with the editors : With legal deposit, it is accepted (indeed en-forced by law) that the editors “push” their publication to the legal deposit. This “pu-sh” model is not necessary on the web, where national libraries can themselves findrelevant information to archive. Moreover, with the relative freedom of publication, astrictly push model is not applicable.

5) updates : Editors send their new versions to the legal deposit (again in pushmode), so it is their responsibility to decide when a new version occurs. On the web,changes typically occur continuously and it is not expected that web-masters will, ingeneral, warn the legal deposit of new releases.

6) perimeter : The perimeter of the classical legal deposit is reasonably simple,roughly the contents published in France. Such notion of border is more delusive onthe web.

For these reasons, the legal deposit of the French web should not only rely oneditors “pushing” information to BnF. It should also involve (because of the volumeof information) on complementing the work of librarians with automatic processing.

There are other aspects in the archiving of the web that will not be considered here.For instance, the archiving of sound and video leads to issues of streaming. Also, thephysical and logical storage of large amounts of data brings issues of long term pre-servation. How can we guarantee that terabytes of data stored today on some storagedevice in some format will still be readable in 2050 ? Another interesting aspect is todetermine which services (such as indexing and querying) should be offered to usersinterested in analyzing archived web content. In the present paper, we will focus onthe issue of obtaining the necessary information to properly archive the web.

. This was a decision of King François the 1st.

Page 329: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

The paper describes preliminary works and experiments conducted by BnF andINRIA. The focus is on the construction of the web archive. This leads us to conside-ring issues such as the definition of the perimeter of the French web and the choice ofpages to read one or more times (to take changes into account). When several copies ofthe same page are kept, this also leads to versioning issues that we briefly consider. Fi-nally, we mention some first experiments performed with data provided by Xyleme’scrawls of the web (of close to a billion pages).

In Section 2, we detail the problem and mention existing work on similar topics.In Section 3, we consider the building of the web archive. Section 4 deals with theimportance of pages and sites that turn out to play an important role in our approach.In Section 5, we discuss change representation, that is we define a notion of delta perweb site that we use for efficient and consistent refresh of the warehouse. Finally webriefly present results of experiments.

2. Web Archiving

The web keeps growing at an incredible rate. We often have the feeling that itaccumulates new information without any garbage collection and one may ask if theweb is not self-archiving ? Indeed, some sites provide access to selective archives.On the other hand, valuable information disappears very quickly as community andpersonal web pages are removed. Also the fact that there is no control of changes in“pseudo” archives is rather critical, because this leaves room for revision of history.This is why several projects aim at archiving the web. We present some of them in thissection.

2.1. Goal and scope

The web archive intends providing future generations with a representative archiveof the cultural production (in a wide sense) of a particular period of Internet history.It may be used not only to refer to well known pieces of work (for instance scientificarticles) but also to provide material for cultural, political, sociological studies, andeven to provide material for studying the web itself (technical or graphical evolutionof sites for instance). The mission of national libraries is to archive a wide range ofmaterial because nobody knows what will be of interest for future research. This alsoapplies to the web. But for the web, exhaustiveness, which is required for traditionalpublications (books, newspapers, magazines, audio CD, video, CDROM), can’t beachieved. In fact, in traditional publication, publishers are actually filtering contentsand an exhaustive storage is made by national libraries from this filtered material. Onthe web, publishing is almost free of charge, more people are able to publish and nofiltering is made by the publishing apparatus. So the issue of selection comes againbut it has to be considered in the light of the mission of national libraries, whichis to provide future generations with a large and representative part of the culturalproduction of an era.

Page 330: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2.2. Similar projects

Up to now, two main approaches have been followed by national libraries regar-ding web archiving. The first one is to select manually a few hundred sites and choosea frequency of archiving. This approach has been taken by Australia [A N ] and Ca-nada [MAR 99] for instance since 1996. A selection policy has been defined focusingon institutional and national publication.

The second approach is an automatic one. It has been chosen by Nordic coun-tries [ARV 00] (Sweden, Finland, Norway). The use of robot crawler makes it pos-sible to archive a much wider range of sites, a significant part of the surface web infact (maybe 1/3 of the surface web for a country). No selection is made. Each pagethat is reachable from the portion of the web we know of will be harvested and ar-chived by the robot. The crawling and indexing times are quite long and in the mean-time, pages are not updated. For instance, a global snapshot of the complete nationalweb (including national and generic domain located sites) is made twice a year bythe royal library of Sweden. The two main problems with this model are : (i) thelack of updates of archived pages between two snapshots, (ii) the deep or invisibleweb [RAG 01, BER ] that can’t be harvested on line.

2.3. Orientation of this experiment

Considering the large amount of content available on the web, the BnF deemsthat using automatic content gathering method is necessary. But robots have to beadapted to provide a continuous archiving facility. That is why we have submitted aframework [MAS 01] that allows to focus either the crawl or the archiving, or both, ona specific subset of sites chosen in an automatic way. The robot is driven by parametersthat are calculated on the fly, automatically and at a large scale. This allows us toallocate in an optimal manner the resources to crawling and archiving. The goal istwofold : (i) to cover a very large portion of the French web (perhaps “all”, although allis an unreachable notion because of dynamic pages) and (ii) to have frequent versionsof the sites, at least for a large number of sites, the most “important” ones.

It is quite difficult to capture the notion of importance of a site. An analogy takenfrom traditional publishing could be the number of in-going links to a site, whichmakes it a publicly-recognized resource by the rest of the web community. Links canbe consider similar, to a certain extent of course, to bibliographical references. Atleast they give a web visibility to documents or sites, by increasing the probability ofaccessing to them (cf the random surfer in [5]). We believe that it is a good analogyof the public character of traditionally published material (as opposed to unpublished,private material for instance) and a good candidate to help driving the crawling and/orarchiving process [MAS 01]. Some search engines already use importance to rankquery results (like Google or Voila).

Page 331: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

These techniques have to be adapted to our context, that is quite different. Forinstance, as we shall see, we have to move from a page-based notion of importanceto a site-based one to build a coherent Web archive. (see Section 4). This also leadsto exploring ways of storing and accessing temporal changes on sites (see Section5) as we will no longer have the discrete, snapshot-type of archive but a more conti-nuous one. To explore these difficult technical issues, a collaboration between BnFand INRIA started last year. The first results of this collaboration are presented here.Xyleme provided different sets of data needed to validate some hypothesis, using theXyleme crawler developed jointly with INRIA. Other related issues, like the depositand archiving of sites that can not be harvested online will not be addressed in thispaper [MAS 02].

One difference between BnF’s legal deposit and other archive projects is that itfocuses on the French web. To conclude this section, we consider how this simple factchanges significantly the technology to be used.

2.4. The frontier for the French web

Given its mission and since others are doing it for other portions of the web, theBnF wants to focus on the French web. The notion of perimeter is relatively clear forthe existing legal deposit (e.g, for books, the BnF requests a copy of each book editedby a French editor). On the web, national borders are blurred and many difficultiesarise when trying to give a formal definition of the perimeter. The following criteriamay be used :

– The French language. Although this may be determined from the contents ofpages, it is not sufficient because of the other French speaking countries or regionse.g. Quebec. Also, many French sites now use English, e.g. there are more pages inEnglish than in French in inria.fr.

– The domain name. Resource locators include a domain name that sometimesprovides information about the country (e.g. .fr). However, this information is not suf-ficient and cannot in general be trusted. For instance www.multimania.com is hostinga large number of French associations and French personal sites and is mostly used byFrench people. Moreover, the registration process for .fr domain names is more dif-ficult and expensive than for others, so many French sites choose other suffixes, e.g..com or .org.

– The address of the site. This can be determined using information obtainablefrom the web (e.g., from domain name servers) such as the physical location of theweb server or that of the owner of the web site name. However, some French sites mayprefer to be hosted on servers in foreign countries (e.g., for economical reasons) andconversely. Furthermore, some web site owners may prefer to provide an address inexotic countries such as Bahamas to save on local taxes on site names. (With the sameprovider, e.g., Gandi, the cost of a domain name varies depending on the country ofthe owner.)

Page 332: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Note that for these criteria, negative information may be as useful as positive ones,e.g., we may want to exclude the domain name .ca (for Canada).

The Royal library of Sweden, which has been archiving the Swedish Web for morethan 6 years now, has settled on an inclusion policy based on national domain (.se and.nu), checking the physical address of generic domain name owners, and the possi-bility to manually add other sites. The distribution of the domain names is about 65percent for nation domains (.se and .nu) and 25 percent for generic domains (.net,.com, .org).

Yet another difficulty in determining the perimeter is that the legal deposit is ty-pically not very interested in commercial sites. But it is not easy to define the notionof commercial site. For instance, amazon.fr (note the “.fr”) is commercial whereasgroups.yahoo.com/group/vertsdesevres/ (note the “.com”) is a public, political forumthat may typically interest the legal deposit. As in the case of the language, the natureof web sites (e.g., commercial vs. non commercial) may be better captured using thecontents of pages.

No single criteria previously mentioned is sufficient to distinguish the documentsthat are relevant for the legal deposit from those that are not. This leads to using amulti-criteria based clustering. The clustering is designed to incorporate crucial infor-mation : the connectivity of the web. French sites are expected to be tightly connected.Note that here again, this is not a strict law. For instance, a French site on DNA maystrongly reference foreign sites such as Mitomap (a popular database on the humanmitochondrial genome).

Last but not least, the process should involve the BnF librarians and their know-ledge of the web. They may know, for instance, that 00h00.com is a web book editorthat should be archived in the legal deposit.

Technical corner. The following technique is used. A crawl of the web is started.Note that sites specified as relevant by the BnF librarians are crawled first and therelevance of their pages is fixed as maximal. The pages that are discovered are ana-lyzed for the various criteria to compute their relevance for the legal deposit. Onlythe pages believed to be relevant (“suspect” pages) are crawled. For the experiments,the BAO algorithm is used [ABI 02] that allows to compute page relevance on-linewhile crawling the web. The algorithm focuses the crawl to portions of the web thatare evaluated as relevant for the legal deposit. This is in spirit of the XML-focused on-line crawling presented in [MIG 00], except that we use the multi-criteria previouslydescribed. The technique has the other advantage that it is not necessary to store thegraph structure of the web and so it can be run with very limited resources. Intuitively,consider the link matrix of the web (possibly normalized by out-degrees), and the value vector for any page-based criteria. Then, represents a depth-1 propa-gation of the criteria, and in general represents the propagation up to depth . Note that the PageRank [GOO b] is defined by the limit of when goesto infinity. We are not exactly interested in PageRank, but only in taking into accountsome contribution of connectivity. Thus we define the vector value for a page as :

Page 333: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

. Any distribution can be

used for the sequence , as long as the sum converges. When the sequencedecreases faster, the contribution of connectivity is reduced.

Since the same technology is used to obtain the importance of pages, a more de-tailed presentation of the technique is delayed to Section 3.

To conclude this section, we note that for the first experiments that we mentionin the following sections, the perimeter was simply specified by the country domainname (.fr). We intend to refine it in the near future.

3. Building the Archive

In this section, we present a framework for building the archive. Previous work inthis area is abundant [A N , ARV 00, MAR 99], so we focus on the specificities of ourproposal.

A simple strategy would be to take a snapshot of the French web regularly, say times a year (based on available resources). This would typically mean running re-gularly a crawling process for a while (a few weeks). We believe that the resultingarchive would certainly be considered inadequate by researchers. Consider a resear-cher interested in the French political campaigns in the beginning of the 21st century.The existing legal deposit would give him access to all issues of the Le Monde news-paper, a daily newspaper. On the other hand, the web archive would provide him onlywith a few snapshots of Le Monde web site per year. The researcher needs a more“real time” vision of the web. However, because of the size of the web, it would notbe reasonable/feasible to archive each site once a day even if we use sophisticatedversioning techniques (see Section 5).

So, we want some sites to be very accurately archived (almost in real-time) ; wewant to archive a very extensive portion of the French web ; and we would like to dothis under limited resources. This leads to distinguishing between sites : the most im-portant ones (to be defined) are archived frequently whereas others are archived onlyonce in a long while (yearly or possibly never). A similar problematic is encounte-red when indexing the web [GOO b]. To take full advantage of the bandwidth of thecrawlers and of the storage resources, we propose a general framework for buildingthe web archive that is based on a measure of importance for pages and of their changerate. This is achieved by adapting techniques presented in [MIG 00, ABI 02]. But first,we define intuitively the notion of importance and discuss the notion of web site.

Page importance. The notion of page importance has been used by search engineswith a lot of success. In particular, Google uses an authority definition that has beenwidely accepted by users. The intuition is that a web page is important if it is refe-renced by many important web pages. For instance, Le Louvre’s homepage is moreimportant than an unknown person homepage : there are more links pointing to LeLouvre coming from other museums, tourist guides, or art magazines and many more

Page 334: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

coming from unimportant pages. An important drawback is that this notion is basedstrictly on the graph structure of the web and ignores important criteria such as lan-guage, location and also content.

3.1. Site vs. page archiving

Web crawlers typically work at the granularity of pages. They select one URL toload in the collection of URLs they know of and did not load yet. The most primi-tive crawlers select the “first” URL, whereas the sophisticated ones select the most“important” URL [GOO b, MIG 00]. For an archive, it is preferable to reason at thegranularity of web sites rather than just web pages. Why ? If we reason at the pagelevel, some pages in a site (more important than others) will be read more frequently.This results in very poor views of websites. The pages of a particular site would typi-cally be crawled at different times (possibly weeks apart), leading to dangling pointersand inconsistencies. For instance, a page that is loaded may contain a reference to apage that does not exist anymore at the time we attempt to read it or to a page whosecontent has been updated2.

For these reasons, it is preferable to crawl sites and not individual pages. But it isnot straightforward to define a web site. The notion of web site loosely correspondsto that of editor for the classical legal deposit. The notion of site may be defined,as a first approximation, as the physical site name, e.g., www.bnf.fr. But it is not al-ways appropriate to do so. For instance, www.multimania.com is the address of a webprovider that hosts a large quantity of sites that we may want to archive separately.Conversely, a web site may be spread between several domain names : INRIA’s web-site is on www.inria.fr, www-rocq.inria.fr, osage.inria.fr, www.inrialpes.fr, etc. Thereis no simple definition. For instance, people will not all agree when asked whetherwww.leparisien.fr/news and www.leparisien.fr/ shopping are different sites or parts ofthe same site. To be complete, we should mention the issue of detecting mirror sites,that is very important in practice.

It should also be observed that site-based crawling contradicts compulsory craw-ling requirements such as the prevention of rapid firing. Crawlers typically balanceload over many websites to maximize bandwidth use and avoid over-flooding webservers. In contrast, we focus resources on a smaller amount of websites and try toremain at the limit of rapid firing for these sites until we have a copy of each. Anadvantage of this focus is that very often a small percentage of pages causes most ofthe problem. With site-focused crawling, it is much easier to detect server problemssuch as some dynamic page server is slow or some remote host is down.

. To see an example, one of the authors (an educational experience) used, in the website of

a course he was teaching, the URL of an HTML to XML wrapping software. A few monthslater, this URL was leading to a pornographic web site. (Domain names that are not renewedby owners are often bought for advertisement purposes.) This is yet another motivation forarchives.

Page 335: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.2. Acquisition : Crawl, Discovery and Refresh

Crawl. The crawling and acquisition are based on a technique [MIG 00] that wasdeveloped at INRIA in the Xyleme project. The web data we used for our first experi-ments was obtained by Xyleme [xyl] using that technology. It allows, using a clusterof standard PCs, to retrieve a large amount of pages with limited resources, e.g. a fewmillion pages per day per PC on average. In the spirit of [HAV 99, CHO 00, MIG 00],pages are read based on their importance and refreshed based on their importance andchange frequency rate. This results in an optimization problem that is solved with adynamic algorithm that was presented in [MIG 00]. The algorithm has to be adaptedto the context of the web legal deposit and site-based crawling.

Discovery. We first need to allocate resources between the discovery of new pagesand the refreshing of already known ones. For that, we proceed as follows. The size ofthe French web is estimated roughly. In a first experiment using only “.fr” as criteriaand a crawl of close to one billion of URLs, this was estimated to be about 1-2 % ofthe global web, so of the order of 20 millions URLs. Then the librarians decide theportion of the French web they intend to store, possibly all of it (with all precautionsfor the term “all”). It is necessary to be able to manage in parallel the discovery of newpages and the refresh of already read pages. After a stabilization period, the system isaware of the number of pages to read for the first time (known URLs that were neverloaded) and of those to refresh.

It is clearly of interest to the librarians to have a precise measure of the size ofthe French web. At a given time, we have read a number of pages and some of themare considered to be part of the French web. We know of a much greater number ofURLs, of which some of them are considered “suspects” for being part of the Frenchweb (because of the “.fr” suffix or because they are closely connected to pages knownto be in the French web, or for other reasons.) This allows us to obtain a reasonablyprecise estimate of the size of the French web.

Refresh. Now, let us consider the selection of the next pages to refresh. The tech-nique used in [MIG 00] is based on a cost function for each page, the penalty for thepage to be stale. For each page , is proportional to the importance of page and depends on its estimated change frequency . We define in the next sub-section the importance

of a site and we also need to define the “change rate” ofa site. When a page in site has changed, the site has changed. The change rate is,for instance, the number of times a page changes per year. Thus, the upper bound forthe change rate of is

. For efficiency reasons, it is better toconsider the average change rate of pages, in particular depending on the importanceof pages. We propose to use a weighted average change rate of a site as :

! " "

Page 336: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Our refreshing of web site is based on a cost function. More precisely, we chooseto read next the site with the maximum ratio :

! number of pages in

where may be, for instance, the following simple cost function :

!We divide by the number of pages to take into account the cost to read the site.

A difficulty for the first loading of a site is that we do not know for new sites theirnumber of pages. This has to be estimated based on the number of URLs we know ofthe site (and never read). Note that this technique forces us to compute importance atpage level.

To conclude this section, we will propose a model to avoid such an expensivecomputation. But first we revisit the notion of importance.

3.3. Importance of pages for the legal deposit

When discovering and refreshing web pages, we want to focus on those whichare of interest for the legal deposit. The classical notion of importance is used. Butit is biased to take into account the perimeter of the French web. Finally, the contentof pages is also considered. A librarian typically would look at some documents andknow whether they are interesting. We would like to perform such an evaluation auto-matically, to some extent. More precisely, we can use for instance the following simplecriteria :

– Frequent use of infrequent Words : The frequency of words found in the webpage is compared to the average frequency of such words in the French web3. Forinstance, for a word and a page , it is :

! "# %$&')(*+ ,( ,-/. where 0 1 1 324 and

1 is the number of occurrences of a word in a page and

4 the number of

words in the page. Intuitively, it aims at finding pages dedicated to a specific topic,e.g. butterflies, so pages that have some content.

– Text Weight : This measure represents the proportion of text content over othercontent like HTML tags, product or family names, numbers or experimental data. Forinstance, one may use the number of bytes of French text divided by the total numberof bytes of the document.5

. To guarantee that the most infrequent words are not just spelling mistake, the set of wordsis reduced to words from a French dictionary. Also, as standard, stemming is used to identifywords such as toy and toys.

Page 337: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

( & "# %$&'

'$ "Intuitively, it increases the importance of pages with text written by people versus

data, image or other content.

A first difficulty is to evaluate the relevance of these criteria. Experiments arebeing performed with librarians to understand which criteria best match their exper-tise in evaluating sites. Another difficulty is to combine the criteria. For instance,www.microsoft.fr may have a high PageRank, may use frequently some infrequentwords and may contain a fair proportion of text. Still, due to its commercial status,it is of little interest for the legal deposit. Note that librarians are vital in order to“correct” errors by positive action (e.g., forcing a frequent crawl of 00h00.com) ornegative one (e.g., blocking the crawl of www.microsoft.fr). Furthermore, librariansare also vital to correct the somewhat brutal nature of the construction of the archive.Note however that because of the size of the web, we should avoid as much as possiblemanual work and would like archiving to be as fully automatic as possible.

As was shown in this section, the quality of the web archive will depend on com-plex issues such as being able to distinguish the borders of a web site, analyze andevaluate its content. There are ongoing projects like THESU [HAL 02] which aim atanalyzing thematic subsets of the web using classification, clustering techniques andthe semantics of links between web pages. Further work on the topic is necessary toimprove site discovery and classification

To conclude this section, we need to extend previously defined notions to thecontext of website. For some, it suffices to consider the site as a huge web documentand aggregate the values of the pages. For instance, for Frequent use of infrequentWords, one can use : ! "# %$&' ( -( , - . where 0

1 1 2

4 Indeed, the values on word frequency and text weight seem to be more meaningful atthe site level than at the page level.

For page importance, it is difficult. This is the topic of next section.

4. Site-based Importance

To obtain a notion of site importance from the notion of page importance, onecould consider a number of alternatives :

– Consider only links between websites and ignore internal links ;

– Define site importance as the sum of PageRank values for each page of the website ;

– Define site importance as the maximum value of PageRank, often correspondingto that of the site main page.

Page 338: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

We propose in this section an analysis of site importance that will allow us to chooseone notion.

First, observe that the notion of page importance is becoming less reliable as thenumber of dynamic pages increases on the web. A reason is that the semantics of theweb graph created by dynamic pages is weaker than the previous document basedapproach. Indeed, dynamic pages are often the result of database queries and linkto other queries on the same database. The number of incoming/outgoing links isnow related to the size of the database and the number of queries, whereas it waspreviously a human artifact carrying stronger semantics. In this section, we present anovel definition of sites’ importance that is closely related to the already known pageimportance. The goal is to define a site importance with stronger semantics, in thatit does not depend on the site internal databases and links. We will see how we canderive such importance from this site model.

Page importance, namely PageRank in Google terminology, is defined as the fix-point of the matrix equation

[BRI 98, PAG 98], where the web-pagesgraph

is represented as a link matrix . Let be the vector of

out-degrees. If there is an edge for

to , 2 , otherwise it is . We note the importance for each page. Let us define a web-sites graph

whereeach node is a web-site (e.g. www.inria.fr). The number of web-sites is

. For each

link from page in web-site to page in web-site there is an edge from to . This edges are weighted, that is if page in site is twice more important thanpage (in also), then the total weight of outgoing edges from will be twice thetotal weight of outgoing edges from . The obvious reason is that browsing the webremains page based, thus links coming from more important pages deserve to havemore weight than links coming from less important ones. The intuition underlyingthese measures is that a web observer will visit randomly each page proportionally toits importance. Thus, the link matrix is now defined by :

1

We note two things :

– If the graph

representing the web-graph is (artificially or not) strongly connec-ted, then the graph

derived from

is also strongly connected.

– is still a stochastic matrix, in that

. (proof in appendix).

Thus, the page importance, namely PageRank, can be computed over and

there is a unique fixpoint solution. We prove in appendix that the solution is given by :

This formal relation between website based importance and page importance sug-gests to compute page importance for all pages, a rather costly task. However, it serves

Page 339: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

as a reference to define site-based importance, and helps understand its relation topage-based importance. One could simplify the problem by considering, for instance,that all pages in a website have the same importance. Based on this, the computationof site-importance becomes much simpler. In this case, if there is there is at least onepage in pointing to one page in , we have

2 , where is the out-degree of . A more precise approximation of the reference value consistsin evaluating the importance of pages of a given website on the restriction of

to . Intuitively it means that only internal links in will be considered. This approxi-

mation is very effective because : (i) it finds very good importance values for pages,that correspond precisely to the internal structure of the web-site (ii) it is cheaper tocompute the internal page importance for all websites, one by one, than to computethe PageRank over the entire web (iii) the semantics of the result are stronger becauseit is based on site-to-site links.

This web-site approach enhances significantly previous work in the area, and wewill see in next section how we also extend previous work in change detection, repre-sentation and querying to web sites.

5. Representing Changes

Intuitively, change control and version management are used to save storage andbandwidth resources by updating in a large data warehouse only the small parts thathave changed [MAR 01]. We want to maximize the use of bandwidth, for instance, byavoiding the loading of sites that did not change (much) since the last time they wereread. To maximize the use of storage, we typically use compression techniques and aclever representation of changes. We propose in this section a change representationat the level of web sites in the spirit of [LAF 01, MAR 01]. Our change representationconsists of a site-delta, in XML, with the following features :

(i) Persistent identification of web pages using their URL, and unique identificationof each document using the tuple (URL, date-of-crawl) ;

(ii) Information about mirror sites and their up-to-date status ;

(iii) Support for temporal queries and browsing the archive

The following example is a site-delta for www.inria.fr :

<website url="www.inria.fr"><page url="/index.html"><document date="2002-Jan-01" status="updated"

file="543B6.html"/><document date="2002-Mar-01" status="unchanged"

file="543B6.html"/></page><page url="/news.html"><document date="2002-Mar-25" status="updated"

Page 340: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

file="543GX6.html"/><document date="2002-Mar-24" status="error">

<error httperror="404"/></document><document date="2002-Mar-23" status="updated"

file="523GY6.html"/>...<document date="1999-Jan-08" status="new"

file="123GB8.html"/></page><mirror url="www-mirror.inria.fr" depth="nolimit"><exclusion path="/cgi-bin" />

</mirror></website>

Each web-site element contains a set of pages, and each page element contains asubtree for each time the page was accessed. If the page was successfully retrieved,a reference to the archive of the document is stored, as well as some metadata. Ifan error was encountered, the page status is updated accordingly. If the page mirrorsanother page on the same (or on another) web-site, the document is stored only once(if possible) and is tagged as a mirror document. Each web-site tree also contains alist of web-sites mirroring part of its content. The up-to-date status of mirror sites isstored in their respective XML file.

Other usages. The site-delta is not only used for storage. It also improves the effi-ciency of the legal deposit. In particular, we mentioned previously that the legal depo-sit works at a site level. Because our site-delta representation is designed to maintaininformation at page level, it serves as an intermediate layer between site-level compo-nents and page-based modules.

For instance, we explained that the acquisition module crawls sites instead ofpages. The site-delta is then used to provide information about pages (last update,change frequency, file size) that will be used to reduce the number of pages to crawlby using caching strategies. Consider a news web site, e.g. www.leparisien.fr/. Newsarticles are added each day and seldom modified afterwards, only the index page isupdated frequently. Thus, it is not desirable to crawl the entire web site every day. Thesite-delta keeps track of the metadata for each pages and allows to decide which pagesshould be crawled. So it allows the legal deposit to virtually crawl the entire web siteeach day.

Browsing the archive.

A standard first step consists in replacing in each page the links to the Internet(e.g. http ://www.yahoo.fr/) by local links (e.g. to files). The process is in general easy,some difficulties are caused by pages using java-scripts (sometimes on purpose) thatmake links unreadable. A usual problem is the consistency of the links and the data.First, the web graph is not consistent to start ; broken links, servers down, pages with

Page 341: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

out of date data are common. Furthermore, since pages are crawled very irregularly,we never have a true snapshot of the web.

The specific problem of the legal deposit is related to temporal browsing. Consider,for instance, a news web site that is entirely crawled every day. A user may arrive at apage, perhaps via a search engine on the archive. One would expect to provide him themeans to browse through the web site of that day and also in time, move to this samepage the next day. The problem becomes seriously more complex when we considerthat all pages are not read at the same time. For instance, suppose a user reads a version of page and clicks on a link to . We may not have the value of page at thattime. Should we find the latest version of before , the first version after , or theclosest one ? Based on an evaluation of the change frequency of , one may computewhich is the most likely to be the correct one. However, the user may be unsatisfiedby this and it may be more appropriate to propose several versions of that page.

One may also want to integrate information coming from different versions ofa page into a single one. For instance, consider the index of a news web site withheadlines for each news article over the last few days. We would like to automaticallygroup all headlines of the week into a single index page, as in Google news searchengine [Goo a]. A difficulty is to understand the structure of the document and toselect the valuable links. For instance, we don’t want to group all advertisements ofthe week !

6. Conclusion

As mentioned in the introduction, the paper describes preliminary work. Some ex-periments have already been conducted. A crawl of the web was performed and data isnow being analyzed by BnF librarians. In particular, we analyze the relevance of pageimportance (i.e., PageRank in Google terminology). This notion has been to a certainextent validated by the success of search engines that use it. It was not clear whether itis adapted to web archiving. First results seem to indicate that the correlation betweenour automatic ranking and that of librarians is essentially as similar as the correlationbetween ranking by librarians.

Perhaps the most interesting aspect of this archiving work is that it leads us toreconsider notions such as web site or web importance. We believe that this is leadingus to a better understanding of the web. We intend to pursue this line of study and tryto see how to take advantage of techniques in classification or clustering. Conversely,we intend to use some of the technology developed here to guide the classification andclustering of web pages.

Acknowledgments We would like to thank Laurent Mignet, Benjamin Nguyen,David Leniniven and Mihai Preda for discussions on the topic.

Page 342: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

7. Bibliographie

[A N ] A NATIONAL LIBRARY OF AUSTRALIA POSITION PAPER, « Natio-nal Strategy for Provision of Access to Australian Electronic Publications »,www.nla.gov.au/policy/paep.html.

[ABI 02] ABITEBOUL S., PREDA M., COBENA G., « Computing web page importance wi-thout storing the graph of the web (extended abstract) », IEEE Data Engineering Bulletin,Volume 25, 2002.

[ARV 00] ARVIDSON A., PERSSON K., MANNERHEIM J., « The Kulturarw3 Project - TheRoyal Swedish Web Archiw3e - An example of ’complete’ collection of web pages. »,66th IFLA Council and General Conference, 2000, www.ifla.org/IV/ifla66/papers/154-157e.htm.

[BER ] BERGMAN M., « The Deep Web : Surfacing Hidden Value », www.brightplanet.com/.

[BRI 98] BRIN S., PAGE L., « The Anatomy of a Large-Scale Hypertextual Web Search En-gine », WWW7 Conference, Computer Networks 30(1-7), , 1998.

[CHO 00] CHO J., GARCIA-MOLINA H., « Synchronizing a database to improve freshness »,SIGMOD, , 2000.

[Goo a] GOOGLE, « Google News Search », http ://news.google.com/.

[GOO b] GOOGLE, « www.google.com/ ».

[HAL 02] HALKIDI M., NGUYEN B., VARLAMIS I., VAZIRGIANIS M., « THESUS : Orga-nising Web Document Collections based on Semantics and Clustering », Technical Report,2002.

[HAV 99] HAVELIWALA T., « Efficient Computation of PageRank », Technical report, Stan-ford University, , 1999.

[LAF 01] LAFONTAINE R., « A Delta Format for XML : Identifying changes in XML andrepresenting the changes in XML », XML Europe, 2001.

[MAR 99] MARTIN L., « Networked Electronic Publications Policy », 1999, www.nlc-bnc.ca/9/2/p2-9905-07-f.html.

[MAR 01] MARIAN A., ABITEBOUL S., COBENA G., MIGNET L., « Change-centric Mana-gement of Versions in an XML Warehouse », VLDB, , 2001.

[MAS 01] MASANÈS J., « The BnF’s project for Web archiving »,What’s next for Digital Deposit Libraries ? ECDL Workshop, 2001,www.bnf.fr/pages/infopro/ecdl/france/sld001.htm.

[MAS 02] MASANES J., « Préserver les contenus du Web », IVe journées internationalesd’études de l’ARSAG - La conservation à l’ère du numérique, 2002.

[MIG 00] MIGNET L., PREDA M., ABITEBOUL S., AILLERET S., AMANN B., MARIAN A.,« Acquiring XML pages for a WebHouse », proceedings of Base de Données Avancéesconference, 2000.

[PAG 98] PAGE L., BRIN S., MOTWANI R., WINOGRAD T., « The PageRank Citation Ran-king : Bringing Order to the Web », 1998.

[RAG 01] RAGHAVAN S., GARCIA-MOLINA H., « Crawling the Hidden Web », The VLDBJournal, 2001.

[xyl] « Xyleme », www.xyleme.com.

Page 343: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A propos de requêtes possibilistes adresséesà des bases de données possibilistes

Laurence Duval* Olivier Pivert**

*IRISA/ENSAICampus de Ker Lann BP 3720335170 Bruz Cedex [email protected]

**IRISA/ENSSATTechnopole Anticipa BP 44722305 Lannion Cedex, [email protected]

RÉSUMÉ. L'interrogation de bases de données contenant des informations imprécises soulèvedifférents problèmes, dont celui de la complexité. Dans ce papier, nous considérons unnouveau type de requêtes, appelées requêtes possibilistes, de la forme : "dans quelle mesureest-il possible que le n-uplet t appartienne à la réponse à la requête Q", où Q est une requêteusuelle pouvant faire intervenir des opérations de sélection et de jointure. Le cadre choisipour modéliser les données est celui des bases de données possibilistes, où une valeurd'attribut mal connue est représentée par une distribution de possibilité. Nous décrivons uneméthode permettant d'évaluer de telles requêtes sans avoir à expliciter les mondes possiblesassociés à la base considérée, le but étant d'éviter l'explosion combinatoire induite parl'approche d'évaluation canonique.

ABSTRACT. Querying databases containing imprecise data raises different problems amongwhich that of complexity. In this paper, a new type of query is considered. These queries,called possibilistic queries, are of the form: "to what extent is it possible that tuple t belongsto the answer to query Q", where Q denotes a usual query that may involve selection and joinoperations. The model used is the relational possibilistic database framework, where an ill-known attribute value is represented as a possibility distribution. We describe a methodallowing to process such queries without making explicit the different worlds associated withthe database. The main objective of this approach is to avoid the combinatorial growthinduced by the canonical evaluation method.

MOTS-CLÉS : théorie des possibilités, données mal connues, requêtes possibilistes,interprétation en mondes.KEYWORDS: possibility theory, ill-known values, possibilistic queries, world-basedinterpretation.

Page 344: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Dans différents domaines, on note un besoin croissant de systèmes d'informationcapables de traiter des données mal connues. C'est notamment le cas des entrepôtsde données (data warehouses) pour lesquels la fusion d'informations provenant dedifférentes sources peut induire de l'imprécision (notons qu'à l'heure actuelle, cettedernière est traitée par des méthodes usuelles de nettoyage de données). La théoriedes possibilités [ZAD 78] propose un modèle ordinal de l'incertain dans lequell'imprécision est représentée au moyen d'une relation de préférence définissant unordre total sur les situations possibles. Cette approche est fortement liée auxensembles flous [ZAD 65] puisque l’idée est de contraindre les valeurs que peutprendre une variable par un ensemble flou normalisé (i.e., où au moins un élémentappartient complètement à l’ensemble). Une distribution de possibilité est uneapplication π d’un domaine donné dans l’intervalle unité [0, 1] et π(a) donne ledegré de possibilité associé au fait que la valeur effective de la variable est a. Lacondition de normalisation impose qu’au moins une valeur (a0) soit complètementpossible, i.e., π (a0) = 1. Ce formalisme est particulièrement adapté à l’expressiond’incertitudes subjectives décrites au moyen de termes linguistiques tels que grand,jeune, plutôt petit, etc. Quand le domaine est fini, une distribution de possibilité estnotée : π1/a1 + … + πn/an où ai est une valeur candidate et πi son degré depossibilité. Dans le cadre des bases de données relationnelles, le modèle possibilisteoffre un cadre unifié pour représenter les valeurs précises et les valeurs imprécisestelles que les intervalles usuels ou flous et les valeurs nulles (voir [BOS 97] pourplus de détails). Au delà de la représentation, un aspect important concerne lamanipulation de ce type de données, et des travaux fondateurs ont, dans les années80, posé quelques bases, particulièrement pour les requêtes contenant seulement desopérations simples, à savoir la sélection et la projection [PRA 84]. Ainsi, unerequête de sélection retourne une relation dans laquelle chaque n-uplet est affectéd'une paire de degrés exprimant respectivement la possibilité et la nécessitéd'appartenance du n-uplet à la relation résultat.

Il est clair toutefois qu'un langage de requêtes restreint à la sélection et à laprojection n'est guère satisfaisant dans la mesure où il ne permet de manipulerqu'une relation à la fois et n'a donc qu'un pouvoir d'expression très limité. La priseen compte d'autres opérateurs pose de sérieux problèmes et il a été montré [IMI 84]qu'il n'existait pas de modèle "simple" constituant un système de représentation fortpour l'ensemble des opérateurs relationnels. En particulier, le modèle possibiliste(tout comme le modèle OR-sets [LIP 81]) ne constitue pas un tel système dereprésentation et le résultat d'une requête consiste donc en général en un ensemblede mondes qu'il est impossible de représenter à l'aide d'une relation unique dumodèle. Cet état de fait a deux conséquences fâcheuses : d'une part, il est difficile àun utilisateur d'interpréter une réponse composée d'un très grand nombre de tables,d'autre part, le coût de l'évaluation est généralement prohibitif puisque la requêtedoit être évaluée sur chacun des (très nombreux) mondes associés à la base dedonnée considérée.

Page 345: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Dans [BOS 01], un langage d'interrogation relationnel comportant des opérateursde sélection, projection et une jointure spécifique est proposé. L'intérêt de ce cadreest qu'il se fonde sur des procédures d'évaluation où les mondes ne sont pasexplicitement manipulés et, par conséquent, on peut espérer atteindre unecomplexité raisonnable. Dans le même esprit de faisabilité, l'idée développée dansce papier consiste à considérer des requêtes spécifiques, appelées "requêtespossibilistes" − concept inspiré des propositions d'Abiteboul et al. [ABI 91] etintroduit, dans un cadre possibiliste, dans [BOS 02] −, pour lesquelles il n'est pasnon plus nécessaire d'expliciter les mondes. Une requête possibiliste diffère d'unerequête classique en ceci qu'elle fait référence à un n-uplet particulier. Plusprécisément, sa formulation générale est : "dans quelle mesure est-il possible que len-uplet t appartienne au résultat de la requête Q", où Q désigne une requêterelationnelle usuelle. Dans ce travail, nous nous restreignons aux requêtes Qcontenant seulement des sélections, des jointures et une projection finale. Noussupposerons également qu'une relation n'est utilisée qu'une seule fois dans unerequête et que les distributions de possibilité utilisées sont finies. L'étude desrequêtes possibilistes constitue un travail original puisqu'à notre connaissance aucuntravail de recherche similaire n'a encore été entrepris.

Le papier est structuré de la façon suivante. En section 2, les requêtespossibilistes sont présentées et des exemples illustrent l'explosion combinatoireinduite par la méthode d'évaluation "naïve" (i.e., la méthode canonique fondée surl'explicitation des mondes). Différents principes permettant d'améliorer cetteméthode sont donnés en section 3. Dans la section 4, une méthode d'évaluationreposant sur un principe différent (appelée méthode "compacte") est présentée. Cettenouvelle stratégie, qui ne nécessite pas de calculer explicitement les différentsmondes tire avantage des principes décrits dans la section précédente. Les opérationsimpliquées dans le calcul compact (sélection, jointure et calcul du degré final) sontdétaillées en section 5 et illustrées dans l'exemple traité en section 6 . La section 7,quant à elle, est consacrée aux conclusions ainsi qu'à la mise en évidence dedirections pour des travaux futurs.

2. Introduction aux requêtes possibilistes

Nous donnons tout d'abord le vocabulaire que nous utilisons dans ce papier, puisnous illustrons à travers un exemple la relation entre une base de donnéespossibiliste et les mondes qui lui sont associés.

2.1. Vocabulaire et notations

Dans les prochaines sections, les notations suivantes sont utilisées. B représenteun base de données relationnelle usuelle et B, une base de données relationnelle

Page 346: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

possibiliste. Rep(B) dénote l'ensemble des mondes (bases de données usuelles)associés à B. Les opérations de projection sur une relation possibiliste, de sélectionsur une relation possibiliste et de jointure de relations possibilistes sont notéesrespectivement πp, σp et ><p. Enfin, Qp désigne une requête possibiliste incluant larequête usuelle Q.

Une relation possibiliste a la même structure qu'une relation usuelle. Elle diffèreuniquement d'une relation usuelle en ceci que la valeur d'un attribut peut être unedistribution de possibilité (i.e., un ensemble disjonctif de valeurs candidatespondérées). Dans cet article, le fait de considérer des distributions de possibilitéfinies se traduit par le caractère fini de l'ensemble des mondes associés à une base dedonnées possibiliste, chacun d'eux étant une base de données relationnelle usuelle àlaquelle est associé un degré de possibilité.

2.2. Exemple introductif

Exemple 1. Considérons une base de données militaire décrivant des imagesd'avions prises par satellite. Cette base de données contient deux relations : Img(#i,t_a, date, lieu) et Avn(t_a, lg, vt). La première décrit des images identifiées par unnuméro (#i), un type d'avion éventuellement mal connu (t_a) ainsi que le lieu et ladate où la photo a été prise. La seconde, quant à elle, contient les caractéristiques desavions considérés : type (t_a), longueur (lg) et vitesse maximale (vt), informationssupposées ici précises. Considérons l'extension suivante de cette base de donnéespossibiliste B :

Img #i t_a date lieu Avn t_a lg vti1 a1 1/d3 + 0.7/d1 l1 a1 20 1000i3 1/a3 + 0.3/a4 d1 l2 a2 25 1200

a3 18 800a4 20 1200

Dans ce contexte, un exemple de requête possibiliste est : "dans quelle mesureest-il possible que le n-uplet <d1, 20> appartienne au résultat de la requête Qdonnant les paires (d, l) telles qu'il existe (au moins) une image prise à la date d etreprésentant un avion de longueur l". La sémantique naturelle d'une telle requêtepeut être exprimée en se référant aux mondes (plus ou moins possibles) associés à labase B. A l'extension précédente de B correspondent quatre mondes M1, M2, M3,M4, chacun d'eux contenant la relation Avn et l'une des relations suivantes :

Img1 #i t_a date lieu Img2 #i t_a date lieuΠ = 1 i1 a1 d3 l1 Π = 0.7 i1 a1 d1 l1

i3 a3 d1 l1 i3 a3 d1 l1

Page 347: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Img3 #i t_a date lieu Img4 #i t_a date lieuΠ = 0.3 i1 a1 d1 l1 Π = 0.3 i1 a1 d3 l1

i3 a4 d1 l1 i3 a4 d1 l1

où Π indique le degré de possibilité du monde considéré. On voit que seuls lesmondes M2 , M3 et M4 permettent de "générer" le n-uplet t et on en conclut qu'il estpossible au degré 0.7 que le n-uplet <d1, 20> appartienne au résultat de la requêteQ♦

Dans le but d'illustrer la complexité liée à l'évaluation des requêtes possibilistesen passant par les mondes, nous considérerons dans la suite la base de donnéespossibiliste B' composée de la relation Avn précédente et de la relation possibilisteImg' suivante :

Img' #i t_a date lieui1 1/a2 + 0.5/a1 1/d3 + 0.7/d1 1/l1 + 0.8/l2i2 1/a3 + 0.8/a2 + 0.5/a4 + 0.2/a1 1/d2 + 0.2/ d3 1/l3 + 0.7/l4i3 1/a3 + 0.3/a4 d1 l1i4 1/a4 + 0.2/a1 1/d1 + 0.2/ d3 l2

Cette base de données contenant seulement 8 n-uplets, a exactement 1024interprétations, ce qui illustre la nature combinatoire du problème et montre qu'il estirréaliste d'évaluer une requête possibiliste Qp en appliquant la requête Q sur chacundes mondes associés à la base considérée. Il est donc impératif de mettre enévidence des méthodes d'évaluation efficaces.

Toutefois, si l'approche qui consiste à passer par les mondes n'est passatisfaisante du point de vue des calculs, elle reste la référence du point de vuesémantique, i.e., elle permet de fonder la démonstration de la validité de toute autreapproche.

3. Principe de l'évaluation des requêtes possibilistes

Dans cette section, nous décrivons l'approche canonique d'évaluation consistant àexpliciter les mondes associés à la base de données possibiliste initiale B, et nousexhibons quelques améliorations permettant de limiter le nombre des mondesintervenant dans le calcul. Ces améliorations seront reprises en section 4 dans lastratégie d'évaluation compacte que nous proposons.

Page 348: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1. Stratégie initiale naïve

Comme nous l'avons illustré dans la section précédente, une stratégied'évaluation naïve (et inefficace) consiste à : i) calculer tous les mondes possibles derep(B), ii) évaluer Q sur chacun de ces mondes, iii) rechercher le monde depossibilité la plus élevée contenant le n-uplet t. L'algorithme d'évaluationcorrespondant est le suivant :

µ ← 0;pour tout B dans rep(B) faire

Res ← Q(B);si σt(Res) ≠ ∅ alors µ ← max(Π(B), µ)fin si

fin pour

où Π(B) représente le degré de possibilité associé au monde B. La condition desélection σt vise à sélectionner le n-uplet t et est définie de la façon suivante :

σt = A1 = t.A1 et A2 = t.A2 … et An = t.An

où A1, ..., An désignent les attributs apparaissant dans t.

Comme nous l'avons illustré auparavant (voir le dernier exemple de la section2.2), cette procédure n'est pas viable en général, à cause de l'explosion combinatoiredu nombre de mondes. Pour en améliorer l'efficacité, on peut chercher à réduire lenombre de mondes à considérer. Cette approche, illustrée dans les deux sous-sections suivantes, a toutefois ses limites et nous verrons plus loin qu'il estnécessaire d'envisager une philosophie d'évaluation différente si l'on veut pouvoirobtenir des performances réalistes.

3.2. Suppression des attributs inutiles

La première amélioration consiste à supprimer de la base de données possibilisteles attributs qui n'apparaissent pas dans Q, ce qui est d'autant plus intéressant queceux-ci prennent des valeurs imprécises. Notons B' la base de données possibilisteobtenue de cette façon. La procédure d'évaluation devient :

µ ← 0;pour tout B dans rep(B') faire

Res ← Q(B);si σt(Res) ≠ ∅ alors

µ ← max(Π(B), µ)fin sifin pour

Page 349: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Exemple 2. Revenons sur l'exemple précédent, et supposons que nous voulionsévaluer le degré de possibilité d'appartenance du n-uplet <d1, 20> à la jointure destables Img' et Avn. Les attributs que l'on peut supprimer sont #i, lieu et vt. Parconséquent, l'évaluation de la requête possibiliste entraîne le calcul des mondesassociés à la base de données composée des deux relations suivantes (les attributssupprimés apparaissent en gris) :

Img'' #i t_a date lieu Avn' t_a lg spi1 1/a2 + 0.5/a1 1/d3 + 0.7/d1 1/l1 + 0.8/l2 a1 20 1000i2 1/a3 + 0.8/a2 + 1/d2 + 0.2/ d3 1/l3 + 0.7/l4 a2 25 1200

0.5/a4 + 0.2/a1 a3 18 800i3 1/a3 + 0.3/a4 d1 l1 a4 20 1200i4 1/a4 + 0.2/a1 1/d1 + 0.2/ d3 l2

Dans ce cas, le nombre d'interprétations a été divisé par 4 (il est maintenant égal à256), mais il reste cependant très élevé pour une base possibiliste initiale contenant 8n-uplets, et l'on est encore loin d'une méthode d'évaluation "réaliste" en termes deperformances♦

3.3. Utilisation du n-uplet cible

Toujours dans le but d'améliorer le processus, un second traitement consiste àsupprimer des relations les n-uplets qui ne peuvent pas générer le n-uplet cible t.Plus précisément, la technique consiste à éliminer les valeurs candidates (ce quientraine, dans certains cas, des disparitions de n-uplets) avant de calculer lesmondes. Pour ce faire, nous devons définir la sélection fondée sur un critère de laforme " attribut = constante " dans le contexte des relations possibilistes.

L'idée est de filtrer les valeurs des attributs qui, dans Q, interviennent dans laprojection finale (i.e., ceux apparaissant dans le n-uplet cible). Pour chaque n-uplet ud'une relation possibiliste r impliquée dans la requête, seules les valeurs d'attributapparaissant dans le n-uplet cible sont conservées. Si, pour un attribut A d'un n-upletu, aucune des valeurs candidates n'est égale à t.A, le n-uplet u est supprimé.Formellement, une sélection du type " attribut = constante " sur une relationpossibiliste de schéma R(A, X) est définie de la façon suivante :

select(r, A = t.A) = v = <π/t.A, u.X> | ∃ u, u ∈ r et u.A = … + π/t.A + … (1).

On note σpt(B') l'opération : select(B', A1 = t.A1) et ... et select(B', An = t.An) où

A1, ..., An sont les attributs apparaissant dans le n-uplet cible t. Implicitement, σpt(B')

Page 350: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

signifie que la sélection est exécutée sur toute relation r de B' contenant l'un aumoins des attributs Ai. L'algorithme correspondant est le suivant :

µ ← 0;pour tout B dans rep(σp

t(B')) faireRes ← Q(B);si Res ≠ ∅ alors µ ← max(Π(B), µ) fin si

fin pour

Exemple 3. Dans l'exemple précédent, ce traitement se traduit par la suppression detoutes les valeurs de date (respectivement de longueur) différentes de d1 (resp. 20).

Image''' t_a date Avion'' t_a lg1/a2 + 0.5/a1 0.7/d1 a1 20

1/a3 + 0.8/a2 + 1/d2 + 0.2/ d3 a2 250.5/a4 + 0.2/a1 a3 181/a3 + 0.3/a4 d1 a4 201/a4 + 0.2/a1 d1

Ainsi, l'évaluation de la requête Q ne se fait plus que sur 8 mondes (au lieu de 256)♦

Toutes ces améliorations peuvent réduire de façon significative le nombre demondes qui doivent être calculés, mais elles ne permettent pas en général d'obtenirune complexité raisonnable puisque la nature combinatoire du problème persiste.Dans la section suivante, nous décrivons le principe général d'une stratégie évitantd'expliciter les mondes.

4. Les quatre étapes de l'évaluation compacte d'une requête possibiliste

L'idée clé est de ne pas calculer tous les mondes possibles. En d'autres termes,l'objectif est de procéder à un calcul compact, i.e., à une évaluation manipulantseulement des relations possibilistes. La remarque suivante conforte la légitimité decette approche.

Rappelons que, si le modèle possibiliste n'est pas assez riche pour représenter laréponse exacte à une requête relationnelle Q en général (en raison de la nécessité dereprésenter des n-uplets "disjonctifs", i.e., qui ne peuvent pas coexister dans unmême monde), ce problème disparaît dans le contexte des requêtes possibilistes.Une telle requête fait en effet référence à un seul n-uplet, et l'on n'a donc pas à sepréoccuper de la coexistence de n-uplets "incompatibles" dans les relationsintermédiaires manipulées – on ne cherche pas à représenter le résultat de Q. Cetteremarque sera exploitée dans la section 5 consacrée à la définition des opérateurs.

Page 351: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

La stratégie d'évaluation que nous proposons reprend les améliorations décritesen section 3 et comporte les quatre étapes suivantes :

1) Suppression des attributs inutiles, i.e. ceux des relations possibilistes utiliséesdans Q mais qui n'apparaissent pas dans la requête Q (cf. 3.2) ;

2) Propagation des valeurs des attributs du n-uplet cible apparaissant dans Q. Cetteétape se résume à introduire des sélections permettant de ne retenir que les n-uplets susceptibles de contribuer à la génération du n-uplet cible (cf. 3.3).

3) Evaluation compacte de Q, ce qui nécessite de définir les opérations appropriéesde sélection et de jointure (ces points, qui constituent le cœur de la contribution,sont décrits en section 5).

4) Calcul du degré de possibilité désiré à partir de la relation obtenue à l'étape 3.

La section suivante détaille chacune de ces étapes et fournit notamment ladéfinition des opérations de sélection et de jointure dans le cadre d'une évaluationcompacte.

5. Définition des opérations du calcul compact

Dans cette section, nous décrivons tout d'abord les requêtes pour lesquelles nousavons pu définir une procédure d'évaluation compacte. Nous abordons ensuite lesproblèmes posés par l'existence de dépendances entre attributs (ou, plus précisément,entre valeurs candidates associées à plusieurs attributs) dans les résultats de sous-requêtes, puis nous détaillons les étapes 3 et 4 du processus décrit précédemment.

5.1. Requêtes acceptées

Même si l'évaluation de requêtes possibilistes pose moins de problèmes que cellede requêtes classiques, elle n'est cependant pas triviale. Ceci nous amène àrestreindre la famille des requêtes possibilistes considérées dans ce papier. Lesrestrictions concernent la requête relationnelle usuelle Q référencée dans la requêtepossibiliste. Comme mentionné précédemment, nous choisissons de nous restreindreaux requêtes ne faisant intervenir que des sélections et des jointures (opérations quiconstituent le cœur de nombreuses requêtes usuelles).

Une deuxième restriction a trait à l'impossibilité de manipuler plusieurs copiesd'une même relation (cas d'une auto-jointure, notamment). La raison en est qu'unerequête référençant plusieurs copies d'une même relation et mettant en jeu unecomparaison entre des attributs venant de ces copies génère des dépendances entrevaleurs candidates. Le problème des dépendances entre attributs est traité dans lasous-section suivante et nous verrons que, dans la plupart des cas, il est possible de

Page 352: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

le surmonter. Notre approche ne fournit toutefois pas de solution lorsque plusieursexempla ires d'une relation sont manipulés, et nous nous limitons donc dans la suiteaux requêtes ne présentant pas cette caractéristique.

5.2. Problèmes liés à des dépendances entre attributs

L'indépendance entre attributs est présumée dans une relation possibiliste dans lesens où, pour un n-uplet donné, le choix d'une valeur d'attribut n'est pas conditionnépar un choix préalable fait pour un autre attribut du même n-uplet. En d'autrestermes, le modèle stipule implicitement que les distributions utilisées dans un mêmen-uplet sont indépendantes puisque l’ensemble des interprétations est obtenu parproduit cartésien sur les valeurs candidates. Si elle est généralement acceptable auniveau des relations initiales, cette hypothèse est toutefois mise en défaut lorsquedes opérations de sélection et de jointure sont appliquées. Il convient d'insister sur lefait que les dépendances dont il est ici question ne sont pas des dépendancesfonctionnelles, mais des dépendances entre valeurs candidates dans le résultat d'une(sous)-requête.

Un problème apparaît dans deux situations : i) lorsqu'une condition comporte unedisjonction de prédicats référençant plusieurs attributs (e.g., A > 4 ou B = 12) quisont utilisés dans des opérations ultérieures, ii) lorsqu'une condition fait intervenirune comparaison d'attributs (e.g., A > B) qui sont utilisés dans des opérationsultérieures. Dans ces deux cas, pour pouvoir traiter convenablement les opérationssubséquentes, il est nécessaire que toutes les combinaisons satisfaisantes des valeursd'attributs soient conservées avec leurs degrés.

Exemple 4. Considérons le schéma R(A, B, C) et une requête Q incluant lacondition "A > 100 ou B < 20", sous l'hypothèse que A et B sont référencés pard'autres opérations de Q. Supposons que les valeurs de A pour un tuple u de r(R)sont 1/90 + 0.7/80 + 0.2/120 et celles de B 1/30 + 0.5/10. Le résultat correct decette opération est constitué des couples (1/90, 0.5/10), (0.7/80, 0.5/10), (0.2/120,1/30), (0.2/ 120, 0.5/10) et l'on peut constater que ce résultat ne peut être représentéavec des valeurs de A et de B indépendantes♦

Exemple 5. Considérons le schéma R(A, B, C) précédent et la condition (A > B)relative à la relation r contenant le n-uplet v = <1/2 + 0.7/4 + 0.5/9, 1/2 + 0.3/8,c>. Ici, trois paires de valeurs sont satisfaisantes : 0.7/<4, 2>, 0.5/<9, 2> et 0.3/<9,8> mais elles ne sont pas représentables en termes de produit cartésien sur A et B♦

Page 353: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Dans le contexte des requêtes possibilistes, les dépendances entre attributs neconstituent cependant pas un problème majeur dans la mesure où il est possible demanipuler des relations contenant des n-uplets disjonctifs. Cette méthode estlégitime puisque : i) l'on recherche la meilleure alternative, i.e., la meilleurereprésentation du n-uplet cible dans le résultat (le degré final est calculé au moyend'un maximum) et ii) les opérations autorisées dans Q (sélection et jointure)n'imposent pas l'indépendance des n-uplets dans les relations intermédiaires (lesrequêtes ne nécessitent aucun calcul de cardinalité, en particulier). Dans l'exemple 4,la relation issue de la sélection comporterait donc les quatre n-uplets distinctsgénérés par u .

5.3. Sélection

Nous considérons deux types de conditions de sélection atomique : celles visantà comparer une valeur d'attribut et une constante (type 1), et celles servant àcomparer deux attributs (type 2). Nous définissons dans un premier temps lesopérations correspondantes, puis nous traitons de la composition des conditions desélection.

5.3.1. Critère de la forme : Attribut θ valeur

On considère ici une condition de sélection du type "A θ v" (où A désigne unattribut, v une constante et θ un comp arateur) appliquée à une relation r de schémaR(A, X). Un n-uplet u de r génère un n-uplet dans la relation résultante si au moinsune valeur candidate ai dans u.A satisfait la condition (a i θ v). Le n-uplet résultant u'a les mêmes valeurs que u pour tout attribut sauf pour A : u'.A contient uniquementles candidats ai de u.A qui satisfont la condition (a i θ v). Cette opération de sélectionpeut être définie de la façon suivante :

sel(r, A θ v) = <restr(u.A, A θ v), u.X> | u ∈ r (2)

où restr(u.A, A θ v) = πi/ai | ai ∈ u.A et πi = πu.A(ai) et ai θ v (cette opérationrestreint la distribution de possibilité u.A aux valeurs candidates qui satisfont lacondition).

Justification. Pour démontrer la validité de la démarche, il faut prouver que lerésultat que l'on obtient est la représentation compacte de l'union des tablesrésultantes (mondes) que l'on aurait obtenues en appliquant la sélection sur chacundes mondes de la relation initiale considérée. Soit r la relation possibiliste initiale.Soit Res la réponse compacte obtenue en appliquant la formule (2) et soit El'ensemble des mondes de la réponse obtenus par la stratégie non compacte (i.e.,fondée sur l'explicitation des mondes). Montrons tout d'abord que tout n-uplet t de

Page 354: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

tout monde M0 de E peut être généré à partir d'un n-uplet de Res. Soit α le degré depossibilité associé au monde M0. Le n-uplet t appartient à un monde de la réponse,et vérifie donc t.A i θ v. Ceci implique que t existait dans le monde M de r (dont M0

est issu par sélection), lui aussi de degré α. M étant un monde de r, on en déduit qu'ilexistait dans r un n-uplet t' tel que t'A i contenait la valeur t.A i affectée d'un degré βsupérieur ou égal à α. D'où l'on déduit − voir définition 2 − que le résultat compactRes = sel(r, A θ v) contient un n-uplet t'' tel que t''.A i contient la valeur candidatet.A i affectée du degré β.

Réciproquement, montrons que tout n-uplet "précis" généré à partir d'un n-upletde Res appartient à au moins un monde de la réponse. Soit s un n-uplet de Res. Parconstruction (cf. définition 2), il existe un n-uplet s' appartenant à r tel que s'.A i

inclut s.A i. Un choix ai licite pour s.A i l'est aussi pour s'.Ai, et l'on peut doncconstruire au moins un monde M contenant un n-uplet s'' généré à partir de s enfaisant le choix de ai pour l'attribut Ai. Le n-uplet s appartenant à Res, on a : ∀ai ∈s.A i, ai θ v, et le n-uplet s'' appartiendra donc également au monde (i.e., dans ce cas,à la relation) M' obtenu en appliquant la sélection au monde M, c'est-à-dire à l'undes mondes de la réponse obtenue par un calcul non compact. Si l'on suppose que ledegré associé à ai dans s.Ai (et donc dans s'.A i) est α, le degré du monde M (et doncde M') sera, par construction, inférieur ou égal à α (le degré du monde est obtenu encalculant le minimum des degrés associés aux valeurs candidates choisies pourgénérer le monde). L'étape de calcul du degré final (cf. section 5.5) nous assurequant à elle que le degré calculé pour le n-uplet cible est bien égal au degré maximaldes mondes de la réponse dans lequel ce n-uplet est présent.

5.3.2. Critère de la forme Attribut1 θ Attribut2

Dans le cas d'une condition du type "A θ B" sur une relation r dont le schéma estR(A, B, X), un tuple u génère un n-uplet u' dans la réponse si au moins une valeurcandidate pour u.A est en relation θ avec au moins une valeur candidate pour u.B.Comme il est mentionné en 5.2, un n-uplet u peut générer plusieurs n-upletsdisjonctifs de la relation résultante. Le principe consiste à générer un n-uplet u'i pourchacune des valeurs candidates satisfaisantes ai de u.A : la valeur de u'i.A est ai avecson degré, la valeur de u'i.B est la distribution u.B restreinte aux valeurs bj vérifiantla condition ai θ bj. Le nombre de n-uplets générés par u est donc égal au nombre devaleurs candidates de u.A qui sont en relation θ avec au moins une valeur candidatede u.B (notons que nous aurions pu également considérer la solution symétriqueconsistant à générer un n-uplet pour chaque valeur candidate de u.B). Plusformellement, cette opération de sélection peut être définie de la façon suivante :

sel(r, A θ B) = <αi/ai, restr(u.B, ai θ B), u.X> | ∃ u ∈ r tel que u.A = α1/a1 + ... + αi/ai + ... + αn/an et restr(u.B, ai θ B) ≠ ∅ (3).

Page 355: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Exemple 6. Considérons la relation suivante qui décrit des images satellite d'avions.Chaque image est supposée représenter deux avions. Les types des avions sontsupposés mal connus du fait de l'imprécision inhérente au processus dereconnaissance.

#image avion1 avion212

1/a1 + 1/a2 + 0.7/a30.2/a2 + 0.8/a3+ 0.2/a4

0.8/a2 + 0.2/a30.6/a1 + 0.1/a4

Supposons que la sous-requête qui nous intéresse est la suivante : sel(r, avion1 ≠avion2). Le résultat obtenu est représenté par la table suivante :

#image avion1 avion2111222

1/a11/a2 0.7/a30.2/a20.8/a30.2/a4

0.8/a2 + 0.2/a30.2/a30.8/a2

0.6/a1 + 0.1/a40.6/a1 + 0.1/a4

0.6/a1 ♦

La définition 3 peut être justifiée en suivant un raisonnement similaire à celuidétaillé en 5.3.1 pour les critères du type Ai θ v. Pour des raisons de place, cettedémonstration est omise ici, et nous ne détaillerons pas non plus les preuvesrelatives aux opérateurs suivants, celles-ci reposant toutes sur le même type deraisonnement (dans chaque cas, on peut montrer que la table obtenue par évaluationcompacte est une représentation compacte de l'union des mondes constituant laréponse que l'on obtiendrait en utilisant la stratégie fondée sur l'explicitation desmondes).

5.3.3. Critères disjonctifs

Considérons, à titre d'exemple de référence pour les critères de sélectiondisjonctifs, la condition suivante : "(A θ v1) ou (B θ v2)" appliquée à une relation rde schéma R(A, B, X). Un n-uplet u de r satisfait (possiblement) la condition si aumoins une valeur candidate a de u.A satisfait (a θ v1) ou si une valeur candidate b deu.B vérifie (b θ v2). Cette sélection appliquée à un n-uplet donné génère deux n-uplets dans la réponse : le premier inclut toutes les valeurs candidates de u.A quisatisfont la condition A θ v1 associées à toutes les valeurs candidates de u.B, tandisque le second est composé de toutes les valeurs candidates satisfaisantes qui ne sontpas dans le n-uplet précédent, à savoir toutes les valeurs candidates u.B qui vérifientla condition B θ v2 avec toutes les valeurs candidates de u.A ne satisfaisant pas A θv1. Cette opération peut être formalisée de la façon suivante :

Page 356: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

sel(r, (A θ v1) or (B θ v2)) = <restr(u.A, A θ v1), u.B, u.X> | u ∈ r ∪ <restr(u.A, ¬(A θ v1)), restr(u.B, B θ v2), u.X> | u ∈ r (4).

Les autres conditions de sélection disjonctives peuvent être traitées de façonsimilaire.

5.3.4. Critères conjonctifs

En ce qui concerne les conditions de sélection conjonctives, elles peuventgénéralement être calculées en utilisant le modèle possibiliste "usuel", i.e., sans n-uplets disjonctifs (un tuple donné génère au maximum un tuple). Cela n'est toutefoispas possible dans des cas particuliers tels que "(A1 θ1 A2) et (A1 θ2 A3)", i.e., quandun attribut donné est utilisé dans deux comparaisons d'attributs. Ce type de critèreinduit en effet des dépendances entre attributs et il faut alors recourir à des n-upletsdisjonctifs comme dans le cas des conditions de sélection disjonctives (ce casparticulier n'est pas détaillé ici faute de place).

5.4. Jointure

La définition de l'opération de jointure est similaire à celle de la sélectionimpliquant une condition de type 2. Quand deux n-uplets peuvent être joints, onconserve toutes les valeurs candidates qui s'apparient et l'on génère autant de n-uplets que nécessaire. Plus formellement, si nous considérons les relations r et s dontles schémas respectifs sont R(A, X) et S(B, Y) et la condition A θ B, l'opération dejointure est définie de la façon suivante :

Join(r, s, Α θ Β) = <αi/ai, restr(v.B, ai θ B), u.X, v.Y> | ∃ u ∈ r, ∃ v ∈ s tel que u.A = α1/a1 + ... + αi/ai + ... + αn/an et restr(v.B, ai θ B) ≠ ∅ (5).

5.5. Calcul du degré final

Le rôle de la quatrième et dernière étape est de traiter la projection finaleprésente dans Q. Par construction, la projection donne naissance à un seul n-uplet,i.e., le n-uplet cible t. La seule question réside dans la détermination du degréassocié à ce n-uplet, i.e., de la valeur que l'on doit, en réponse, fournir à l'utilisateur.Soit la relation r' obtenue à la fin de l'étape 3. Le degré de possibilité associé à un n-uplet de r' est le minimum (sur tous les attributs) du maximum des degrés liés auxvaleurs candidates (pour un attribut donné). Selon la sémantique de la projection(basée sur un quantificateur existentiel), le degré final associé au n-uplet cible estégal au maximum de ceux des n-uplets de r'.

Page 357: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Example 7. Supposons que, dans le contexte d'une requête possibiliste du type"dans quelle mesure est-il possible que le couple (x, y) appartienne au résultat de larequête Q", on obtienne à l'issue de l'étape 3 la table suivante :

A X B Y1/a1 + 1/a2 + 0.7/a3

0.8/a3xx

0.8/b2 + 0.2/b30.6/b1 + 0.1/b4

yy

L'étape finale de calcul du degré associé au couple (x,y) fournit :

max(min(max(1, 1, 0.7), max(0.8, 0.2)), min(0.8, max(0.6, 0.1))) = 0.8♦

6. Exemple détaillé

Dans cette section, nous détaillons, au travers d'un exemple, les différentesétapes de l'évaluation. Considérons les deux relations suivantes :

S #i t_a date lieui1 1/a2 + 0.5/a1 d1 1/l2 + 0.8/l1i2 1/a3 + 0.8/a2 d2 1/l4 + 0.7/l1i3 1/a3 + 0.7/a2 d1 l1i4 1/a4 + 0.2/a1 d1 l2

L #i t_a date lieuk1 1/a4 + 0.6/a2 d3 1/l1 + 0.7/l3k2 1/a2 + 0.8/a4 d1 1/l1 + 0.5/l4k3 1/a4 + 0.8/a2 d1 1/l2 + 0.6/l1k4 1/a5 + 0.7/a3 d1 l1

Ces deux relations ont le même schéma que la relation Image de l'exemple 2.2. etdécrivent des collections d'images prises par deux satellites différents. Considéronsla requête possibiliste suivante : quel est le degré de possibilité associé à la valeur a2

dans le résultat de la requête : projt_a(select(join t.a = t.a(select(S, date = 'd1'), select(L,date = 'd1')), lieu1 = lieu2). Autrement dit, on cherche dans quelle mesure il estpossible qu'un avion de type a2 ait été observé au même lieu à la date d1 par lesdeux satellites à la fois".

La première étape supprime l'attribut #i dans S et dans L (tous les autres attributsinterviennent dans la requête). La seconde étape ne conserve, dans S et dans L, que

Page 358: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

les n-uplets dont l'une des variantes pour t_a est a2. Ceci conduit aux deux relationssuivantes :

S' t_a date lieu L' t_a date lieu1/a2 d1 1/l2 + 0.8/l1 0.6/a2 d3 1/l1 + 0.7/l3

0.8/a2 d2 1/l4 + 0.7/l1 1/a2 d1 1/l1 + 0.5/l40.7/a2 d1 l1 0.8/a2 d1 1/l2 + 0.6/l1

On peut ensuite effectuer les sélections (date = 'd1'), et les tables précédentesdonnent naissance à :

S'' t_a date lieu L'' t_a date lieu1/a2 d1 1/l2 + 0.8/l1 1/a2 d1 1/l1 + 0.5/l4

0.7/a2 d1 l1 0.8/a2 d1 1/l2 + 0.6/l1

Puis on procède au calcul de l'équi-jointure entre S'' et L'', basée sur l'attribut t_a, cequi produit la relation :

SL t_a1 date1 lieu1 t_a2 date2 lieu21/a2 d1 1/l2 + 0.8/l1 1/a2 d1 1/l1 + 0.5/l41/a2 d1 1/l2 + 0.8/l1 0.8/a2 d1 1/l2 + 0.6/l1

0.7/a2 d1 l1 1/a2 d1 1/l1 + 0.5/l40.7/a2 d1 l1 0.8/a2 d1 1/l2 + 0.6/l1

On exécute ensuite la sélection lieu1 = lieu2 et l'on obtient la relation :

SL' t_a1 date1 lieu1 t_a2 date2 lieu21/a2 d1 0.8/l1 1/a2 d1 1/l11/a2 d1 1/l2 0.8/a2 d1 1/l21/a2 d1 0.8/l1 0.8/a2 d1 0.6/l1

0.7/a2 d1 l1 1/a2 d1 1/l10.7/a2 d1 l1 0.8/a2 d1 0.6/l1

La dernière étape permet d'obtenir le degré final associé à la valeur cible (a2) :

max(min(1, 0.8, 1, 1), min(1, 1, 0.8, 1), min(1, 0.8, 0.8, 0.6),min(0.7, 1, 1), min(0.7, 0.8, 0.6)) = 0.8.

Page 359: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

7. Conclusion

Cet article traite de l'interrogation de bases de données possibilistes, i.e., de basesde données où certaines valeurs d'attributs peuvent être mal connues et représentéespar des distributions de possibilité). Plus précisément, un nouveau type de requêtes(appelées requêtes possibilistes) a été étudié, dans la perspective d'éviter l'explosioncombinatoire induite par l'évaluation de requêtes classiques, ces dernièresnécessitant de calculer les mondes associés à la base de données considérée. Laforme générale d'une requête possibiliste est : "dans quelle mesure est-il possibleque le n-uplet t appartienne au résultat d'une requête usuelle Q". Une procédured'évaluation en quatre étapes a été décrite, dont l'intérêt majeur est qu'elle permetd'effectuer le calcul dans le modèle compact initial, i.e., sans avoir à expliciter lesmondes. Dans l'état actuel, cette procédure permet de traiter uniquement les requêtesrelationnelles faisant intervenir des sélections et des jointures. Cette restrictionn'apparaît toutefois pas trop sévère dans la mesure où, en pratique, beaucoup derequêtes s'appuient sur ces opérateurs. En outre, il ne semble pas irraisonnable depenser que l'approche puisse s'étendre à d'autres opérateurs. Quoi qu'il en soit, laprésente contribution constitue une étape prometteuse du fait qu'elle apporte uneréponse au problème de la complexité.

Plusieurs axes de travail futur sont à considérer. D'une part, il conviendraitd'étudier comment l'approche peut être élargie afin d'être en mesure de traiter unplus large éventail de requêtes. En particulier, la prise en compte des opérateursensemblistes constitue un thème de recherche intéressant. D'autre part, il serait utiled'étudier de façon approfondie la complexité du processus d'évaluation. Enparticulier, il conviendrait, pour une requête possibiliste donnée faisant intervenirune requête usuelle Q, de comparer cette complexité à celle obtenue lorsque l'onévalue Q sur une base de données classique. Enfin, afin d'asseoir la faisabilité d'unetelle approche, il est également nécessaire d'étudier certains aspects plus techniquesliés à l'implémentation, à commencer par la (ou les) façon(s) de mettre en oeuvreune base de données possibiliste avec les outils de modélis ation qu'offrent les SGBDcommerciaux actuels.

Bibliographie

[ABI 91] ABITEBOUL S., KANELLAKIS P., GRAHNE G., « On the representation and querying ofsets of possible worlds », Theoretical Computer Science, vol. 78, 1991, p. 159-187.

[BOS 97] BOSC P., PRADE H., « An introduction to fuzzy set and possibility theory-basedapproaches to the treatment of uncertainty and imprecision in data base managementsystems », In: Uncertainty Management in Information Systems – From Needs toSolutions, (Motro A. and Smets P. Eds.), Kluwer Academic Publishers, 1997, p. 285-324.

Page 360: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[BOS 01] BOSC P., LIÉTARD L., PIVERT O., « A function-based join for the manipulation ofpossibilistic relations », Proc. of the 16th ACM Conference on Applied ComputingSAC'2001, Las Vegas, USA, 2001, p. 472-476.

[BOS 02] BOSC P., DUVAL L., PIVERT O., « About Possibilistic Queries Against PossibilisticDatabases », Proc. of the 17th ACM Symposium on Applied Computing SAC'2002, Madrid,Spain, 2002, p. 807-811.

[IMI 84] IMIELINSKI T., LIPSKI W., « Incomplete information in relational databases »,Journal of the ACM, vol. 31, 1984, p. 761-791.

[LIP 81] LIPSKI W., « On databases with incomplete information », Journal of the ACM, vol.28, 1981, p. 41-70.

[PRA 84] PRADE H., TESTEMALE C., « Generalizing database relational algebra for thetreatment of incomplete/uncertain information and vague queries », Information Sciences ,vol. 34, 1984, p. 115-143.

[ZAD 65] Z ADEH L.A., « Fuzzy Sets », Information and Control, vol. 8, 1965, p. 338-353.

[ZAD 78] ZADEH L.A., « Fuzzy sets as a basis for a theory of possibility », Fuzzy Sets andSystems, vol. 1, 1978, p. 3-28.

Page 361: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 8Conférence invitée

Page 362: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 363: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Banques et basesdedonneesenbiologiemoleculair e : de la donneea la structure

Eric Viara

SYSRA& INFOBIOGEN523,[email protected], http://www.sysra.com/viara

RESUME.

Lesprojetsdegenomiqueont produit cesdernieresanneesdesvolumesimportantsdedonneesde l’ordre de plusieurs tera-octets.Cesvolumessontdepuis le debut desannees 80 en crois-sanceexponentielleet la tendanceactuelles’accentueavecl’avenementdenouvellestechnolo-gies: le sequencage massif, la transcriptomique, la proteomique, lesnouvelles technologiesdegenotypage...

La situationactuelleen bioinformatiqueest tresmarqueepar les approchesqui ont prevaludansle passe, lorsqued’une part l’information etait suffisammentreduitepour pouvoir etregereepar destechniquesinformatiques peuperformantes,et d’autre part lestypesdedonneesetaientbeaucoupmoinsdiversifiesetlesdonneeselles-memesbeaucoupmoinsinterdependantesqu’aujourd’hui. De cefait, lesdonneessontencore majoritairement:

— dissemineedansunemultitudedebanquesdedonnees,

— stockeesousdesformatssyntaxiquementheterogenes,

— nondisponible dansdessystemesdegestiondebasesdonnees(SGBD),maisdistribueesousformedefichiers plats,

— modeliseedanscesdifferentesbanquesselondessemantiques heterogeneset difficilesamettre enrelation.

Cetetatdefait ainsiquel’accroissementdestraitementsdistribuesa travers l’Internet exigedenouvellesapprochespour la gestiondecesdonnees.

Cetexpose presente:

— la situationactuelleenbioinformatiqueautraversd’un panoramadesbanquesdedonneesettraitementsexploitespar INFOBIOGEN(http://www.infobiogen.fr), lestechnologiesutiliseeset lesdifferentssystemesd’integration dedonnees,

— uneapprochefederativeetorienteeobjetdeveloppeepar SYSRA(http://www.sysra.com)et INFOBIOGENdansle cadre d’un projet d’integration et demanipulation dedonneesgeno-miqueset proteomiquesbase sur le SGBDOEYEDB(http://www.eyedb.com).

Page 364: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 365: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 9Logique et bases de

données

Page 366: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 367: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Sémantique des programmes Datalog avec né-gation sous hypothèses non-uniformes

Yann Loyer* — Nicolas Spyratos**

* Istituto di Elaborazione della Informazione, Consiglio Nazionale delle Ricerche,Area della Ricerca CNR di Pisa, Via Moruzzi,1 I-56124 Pisa

** Laboratoire de Recherche en Informatique, UMR 8623, Université de Paris Sud,Bat. 490, 91405 Orsay

loyer,[email protected]

RÉSUMÉ.Une sémantique doit être associée à tout programme logique ou base de données dé-ductive, et ce même en présence d’information incomplète. Les différentes sémantiques pouvantêtre associées à une base de données déductive correspondent à différentes hypothèses concer-nant les atomes dont la valeur logique ou valeur de vérité ne peut être inférée du programme.Nous proposons d’unifier et d’étendre les approches fondées sur des hypothèses en autorisantl’utilisation d’hypothèses non-uniformes. Pour manipuler de telles hypothèses, nous utilisonsla notion de support d’une hypothèse inspirée de celle d’ensemble non-fondé de Van Gelder.A l’aide de cette notion, nous définissons la notion de sémantique fondée sur une hypothèse etmontrons que cette sémantique généralise à la fois la sémantique bien-fondée et la sémantiqueKripke-Kleene.

ABSTRACT.A precise meaning or semantics must be associated with any logic program or de-ductive database, and that even in presence of incomplete information. The different semanticsthat can be assigned to a logic program correspond to different assumptions made concerningthe atoms whose logical values cannot be inferred from the rules. We propose to unify andextend the assumption-based approaches by allowing assumptions to be non-uniform. To dealwith such assumptions, we extend the concept of unfounded set of Van Gelder to the notion ofsupport of a hypothesis. Based on the support of a hypothesis, we define our hypothesis-foundedsemantics and show that this semantics generalizes both the Kripke-Kleene semantics and thewell-founded semantics of Datalog programs with negation.

MOTS-CLÉS :logique et bases de données, sémantique des bases de données déductives, raison-nement non-monotone, hypothèses.

KEYWORDS:semantics of deductive databases, non-monotonic reasoning, hypothesis.

Page 368: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Dans les années 70, Minsky [MIN75] et McCarthy [McC77] ont montré que lalogique classique n’est pas adéquate pour représenter la nature du raisonnement hu-main de sens commun, principalement à cause du caractère non-monotone de celui-ci. Selon Przymusinski [PRZ89], la non-monotonie du raisonnement humain est dueau fait que notre connaissance du monde est presque toujours incomplète, ce quinous conduit par conséquent à raisonner malgré l’absence d’information complèteet à souvent réviser nos conclusions lorsque de nouvelles informations deviennentdisponibles.

Les bases de données déductives, la programmation logique et le raisonnementnon-monotone sont fortement liés. Les deux premiers implémentent la négation enutilisant différents opérateurs non-monotones, et peuvent également être utilisés commeoutils d’inférence pour implémenter d’autres formalismes non-monotones [MIN95,PRZ89, REI86].

Malgré l’absence d’information complète, une sémantique doit être associée à toutprogramme logique ou base de données déductive. Principalement deux approchesont été développées. La première est fondée sur la complétion de Clark [CLA78,FIT85, KUN87, LLO84] et a conduit à la sémantique de Kripke-Kleene définie parFitting [FIT85]. Dans la seconde, différentes sémantiques ont été proposées : la séman-tique des modèles stables [GEL88] fondée sur la logique auto-epistémique, la séman-tique du modèle par défaut [BID88] fondée sur la logique par défaut, la sémantique dumodèle parfait [PRZ88] fondée sur la circumscription et, finalement, la sémantiquebien-fondée qui étend les approches précédentes à tout programme logique et estéquivalente à des formes appropriées des quatre formalisations majeures du raison-nement non-monotone (circumscription de McCarthy [McC80, McC86], hypothèsedu monde clos de Reiter [REI78], logique auto-epistémique de Moore [MOO85] etthéorie du défaut de Reiter [REI80]). La sémantique bien-fondée est l’une des ap-proches de la négation dans les bases de données déductives les plus étudiées [ALF95,BRA98, CHE95, CHE96, FIT93, ZUK97]. Elle approxime la sémantique des modèlesstables [GEL88], l’une des autres approches majeures de la négation dans les bases dedonnées déductives [FIT01, VnG91], et est utile pour un calcul efficace des modèlesstables [NIE96, SUB95].

Dans [LOY99], nous avons montré que les différentes sémantiques pouvant êtreassociées à une base de données déductive ou à un programme logique correspondentà différentes hypothèses concernant les atomes dont la valeur logique ou valeur devérité ne peut être inférée du programme. Par exemple, la sémantique bien-fondée cor-respond à l’hypothèse selon laquelle tout ces atomes sont faux (hypothèse du mondeclos) et la sémantique Kripke-Kleene correspond à l’hypothèse selon laquelle tout cesatomes sont inconnus (hypothèse du monde ouvert). Nous pensons que nous ne de-vrions pas être limités à ces deux approches, mais que nous devrions être capablesd’associer à un programme Datalog avec négation une sémantique fondée sur toutehypothèse donnée représentant notre connaissance supposée ou connaissance par dé-

Page 369: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

faut. Il ne semble pas réaliste de modéliser le raisonnement humain de sens communen se limitant à supposer que tout est faux ou que tout est inconnu. La nécessité d’u-tiliser des hypothèses non-uniformes pour associer une sémantique à un programmelogique a déjà été montrée dans le domaine de l’intégration d’information [FUH97].Dans cet article, les auteurs souhaitent utiliser l’hypothèse du monde clos pour cer-tains prédicats, e.g. pour définir une relation auteur, et l’hypothèse du monde ouvertpour certains prédicats, e.g. pour définir une relation amis, et proposent de modifier leprogramme pour simuler cette connaissance par défaut. Un autre domaine dans lequell’utilisation d’hypothèses non-uniformes semble utile et naturel est celui de l’inté-gration d’informations provenant de sources différentes. L’exemple suivant illustreun problème d’intégration d’information où l’information consiste en un ensemblede faits qu’un serveur central collecte de différentes sources et essaie de combiner àl’aide de (a) un ensemble de règles logiques, i.e. un programme logique, et (b) unehypothèse représentant l’estimation du serveur.

Exemple 1 Considérons une situation dans laquelle un juge (le serveur central) doitdécider de l’inculpation éventuelle d’une personne pour un délit. Pour ce faire, lejuge commence par collecter de l’information de deux sources d’informations dif-férentes : le procureur et l’avocat de la personne. Le juge combine ensuite l’infor-mation collectée en utilisant un ensemble de règles afin d’arriver à une décision.Dans le cadre de cet exemple, supposons que le juge a collecté l’ensemble de faitsF = ¬témoin(John), amis(John, Ted) qu’il combine en utilisant l’ensemble derègles R suivant1 :

R

suspect(X) ← mobile(X)suspect(X) ← témoin(X)suspect(X) ← suspect’(X)innocent(X) ← alibi(X,Y ) ∧ ¬amis(X,Y )innocent(X) ← innocent’(X) ∧ ¬suspect(X)amis(X,Y ) ← amis(Y,X)amis(X,Y ) ← amis(X,Z) ∧ amis(Z, Y )inculpé(X) ← suspect(X)inculpé(X) ← ¬innocent(X)

Le premier fait de F affirme qu’il n’y a pas de témoin contre John, i.e. le faittémoin(John) est faux. Le second fait de F affirme que Ted est un ami de John, i.e. lefait amis(John, Ted) est vrai.

Considérant l’ensemble de règles, les trois premières règles de R décrivent com-ment travaille le procureur : afin de montrer qu’une personne X est suspecte, le pro-cureur essaie de trouver un mobile (première règle) ou un témoin contre X ou estconvaincu par d’autres preuves que X est suspect (troisième règle).

1. Afin d’insister sur la différence entre les sémantiques usuelles telles que la sémantique bienfondée et la sémantique Kripke-Kleene d’un tel programme, il serait préférable de remplacer lesprédicats extensionnels suspect’ (X) et innocent’ (X) par les prédicats intensionnels suspect(X) etinnocent(X), mais le programme proposé permet de simplifier la présentation.

Page 370: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Les quatrième et cinquième règles décrivent comment travaille l’avocat : afin desupporter l’assertion qu’une personne X est innocente, l’avocat essaie de trouver unalibi pour X donné par une personne n’étant pas un ami de X (quatrième règle) ou estconvaincu par d’autres preuves que X est innocent (cinquième règle). La quatrièmerègle dépend des sixième et septième règles qui définissent la relation amis.

Finallement, Les deux dernières règles de R sont les règles de décision et décriventcomment travaille le juge : Afin de décider d’inculper ou non X , le juge examine lesprémisses suspect(X) et¬ innocent(X). Comme expliqué précédemment, les valeurs deces prémisses viennent de deux sources différentes : le procureur et l’avocat. Chacunede ces prémisses peut avoir les valeurs vrai ou faux. Il est également possible que lavaleur d’une prémisse soit inconnue. Par exemple, si le procureur ne trouve ni mobile,ni témoin contre X , ni d’autre preuve que X soit suspect, alors la valeur de suspect(X)sera inconnu.

Suite à ces observations se pose la question de savoir quelle valeur est appropriéepour être associée à inculpé(X). X devrait être inculpé si il a été explicitement prouvéque X est suspect ou non innocent.

Il y a donc trois valeurs possibles pour inculpé(X) : inconnu, vrai, faux. Lavaleur inconnu pour une prémisse signifie que cette prémisse est soit vraie soit fausse,mais que sa valeur est actuellement inconnue.

Nous pouvons remarquer que la valeur inconnu est liée aux “valeurs nulles” desattributs dans la théorie des bases de données où une distinction est faite entre deuxtypes de valeurs nulles [ZAN84] :

– la valeur de l’attribut existe mais est actuellement inconnue ;

– la valeur de l’attribut n’existe pas.

Un exemple du premier type est la valeur de l’attribut Département pour un em-ployé qui vient d’être recruté mais n’a pas encore été affecté à un département spé-cifique, un exemple du second type est le nom de jeune fille d’un homme. La valeurinconnu correspond au premier type de valeur nulle.

Pour revenir à l’exemple, la décision d’inculper ou non John dépend de la valeurque recevra inculpé(John) lors de la collecte des prémisses. En observant les faits deF et les règles de R (et à l’aide de l’intuition), nous pouvons voir que suspect(John) etinnocent(John) reçoivent tous les deux la valeur inconnu et, par suite, inculpé(John)aussi.

C’est clairement un cas dans lequel le juge ne peut décider d’inculper ou nonJohn !

Quoiqu’il en soit, dans le contexte de la prise de décision, il est nécessaire deparvenir à une décision (basée sur les faits et règles disponibles) même si certainesvaleurs sont inconnues. Cela peut être fait en supposant des valeurs pour certaines,voire toutes, les prémisses inconnues. Une telle affectation de valeurs est ce que nousappelons une hypothèse.

Page 371: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Par conséquent, dans notre exemple, le juge devrait supposer l’innocence de Johnqui ne devrait alors pas être inculpé. Nous pouvons remarquer que c’est précisémentce qui arrive dans la vie réelle sous de telles circonstances, i.e. toute personne estsupposée innocente à défaut d’une preuve du contraire.

Maintenant, on peut constater que les hypothèses uniformes ne conviennent pasdans cet exemple et qu’une hypothèse non-uniforme est nécessaire. Si l’on utilisel’hypothèse du monde clos et associe la valeur faux à tous les atomes par défaut,alors le juge va déduire que John n’est pas innocent et doit être inculpé. Si l’onutilise l’hypothèse du monde ouvert et associe la valeur par défaut inconnu à tous lesatomes, alors les atomes suspect(John), innocent(John) et inculpé(John) sont associésà la valeur inconnu et le juge ne peut prendre de décision. Une hypothèse intuitive-ment appropriée dans cette situation consiste à supposer par défaut que les atomesmobile(John), témoin(John), suspect’ (John) sont faux, que l’atome innocent’ (John)est vrai et que les autres sont inconnus. Avec une telle hypothèse, le juge pourraitdéduire que John est innocent, n’est pas suspect et ne doit pas être inculpé.

Bien entendu, l’hypothèse utilisée doit être “saine” par rapport à l’informationdisponible, i.e. par rapport aux faits et règles donnés. De manière informelle, nousdéfinissons la notion d’hypothèse “raisonnable” ou saine à l’aide du test suivant :

1) s’il n’y a pas de contradiction entre l’hypothèse H et l’ensemble des faits F , alorsajouter H à F pour produire un nouvel ensemble de faits F ′ = F ∪H ; 2

2) appliquer les règles de R à F ′ pour produire un nouvel ensemble de faits H ′ ;

3) H est saine si les atomes apparaissant en tête de règle dans R dont la valeur dansH est connue ont la même valeur dans H ′.

H est saine si aucun atome défini dans H ne change de valeur après l’activation desrègles. Dans notre exemple, considérons l’hypothèse suivante :

H1 = ¬mobile(John),¬témoin(John),¬suspect’(John),¬suspect(John),¬alibi(John),¬innocent’(John),¬innocent(John),¬inculpé(John)

Si l’on applique le test précédent, alors on obtient :

H ′1 = ¬mobile(John), amis(John, Ted), amis(Ted, John),

¬suspect(John),¬innocent(John), inculpé(John)

On peut observer que la valeur du fait inculpé(John), qui apparaît en tête de certainesrègles dans R, a changé. H1 n’est pas une hypothèse saine.

Maintenant, considérons l’hypothèse suivante :

H2 = ¬mobile(John),¬témoin(John),¬suspect’(John),¬suspect(John), innocent’(John), innocent(John),¬inculpé(John)

2. F et H sont des interprétations partielles, i.e. des ensembles consistants de littéraux.

Page 372: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Si l’on applique de nouveau le test précédent, alors on obtient :

H ′2 = ¬mobile(John), amis(John, Ted), amis(Ted, John),

¬suspect(John), innocent(John),¬inculpé(John)

Les valeurs des atomes définis dans H2 apparaissant en tête de règles dans Rrestent inchangées dans H ′

2, donc H2 est une hypothèse saine.

Intuitivement, une hypothèse est saine si ce que nous supposons est compatibleavec les faits et règles donnés.

Dorénavant, nous utilisons la notation P = 〈F,R〉, où F est l’ensemble des faitset R l’ensemble des règles, et nous appelons P un programme.

En principe, nous pouvons supposer toute valeur pour chaque atome instancié.Quoiqu’il en soit, étant donné un programmeP et une hypothèse H , il est peu probableque H soit saine par rapport àP , en général. Par contre, il se peut qu’un sous-ensemblede H soit sain par rapport à P .

Il est alors naturel de demander, pour un programmeP et une hypothèse H donnés,quel est le sous-ensemble maximal qui est sain par rapport à P . Nous montrons que cesous-ensemble est unique et proposons une méthode pour le calculer. Nous l’appelonssupport de H par P et le notons sH

P . Intuitivement, le support de H indique quelle partde H peut être supposée tout en restant compatible avec les faits et règles de P .

Dans ce qui suit, nous montrons que le support sHP peut être utilisé pour définir

une sémantique de P fondée sur l’hypothèse H notée HFSHP . Cette sémantique est

définie par un calcul de point fixe à partir d’un opérateur de conséquence immédiateet de la notion de support d’une hypothèse donnée par rapport à une séquence deprogrammes de la façon suivante :

– F0 = F ;

– Fi+1 = T〈Fi,R〉(Fi) ∪ sH〈Fi,R〉.

Nous montrons également qu’il existe une connection intéressante entre ces sé-mantiques fondées sur hypothèses et les sémantiques usuelles des programmes Data-log avec négation. Plus précisément, nous montrons que :

– si H est l’hypothèse du monde clos alors HFSHP coïncide avec la sémantique

bien fondée de P [VnG91], et

– si H est l’hypothèse du monde ouvert alors HFSHP coïncide avec la sémantique

Kripke-Kleene de P [FIT85].

La suite de cet article est organisée de la façon suivante. Dans la section 2, nousrappelons très brièvement certaines définitions et notations de la sémantique bienfondée et de la sémantique Kripke-Kleene. Dans la section 3, nous définissons lesnotions d’hypothèse saine et de support d’une hypothèse par un programme P ; nous

Page 373: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

présentons également un algorithme pour le calcul du support. Dans la section 4, nousdéfinissons la notion de sémantique d’un programme logique ou base de données dé-ductive fondée sur une hypothèse. Nous montrons finalement que la notion de supportpermet d’unifier et d’étendre les sémantiques bien fondée et Kripke-Kleene. La sec-tion 5 présente la conclusion et des suggestions de recherches futures.

2. Préliminaires

Un programme Datalog avec négation est un ensemble fini de formules, appeléesrègles, de la forme A ← L1, ..., Ln où A est un atome et les Li’s sont des littérauxpositifs ou négatifs.3 A est appelée tête de la règle et L1, ..., Ln est appelé corps de larègle. La tête d’une règle r est notée head(r) et son corps body(r).

Une interprétation (partielle) d’un programme Datalog avec négation P est un en-semble I de littéraux instanciés tel qu’il n’existe pas d’atome A vérifiant A,¬A ⊂ I .

Deux interprétations I et J sont compatibles si I ∪ J est une interprétation. Soitune interprétation I , nous notons def(I) l’ensemble de tous les atomes instanciés Atels que A ∈ I ou ¬A ∈ I , i.e. l’ensemble de tous les atomes instanciés qui ne sont pasinconnus dans I . De plus, soit S un ensemble d’atomes instanciés, nous définissons larestriction I à S, noté I/S comme suit : I/S = I∩(S∪¬.S), où ¬.S = ¬A |A ∈ S.

Un programme Datalog avec négation peut être vu comme une paire 〈F,R〉 où Fest un ensemble de faits, équivalent à une interprétation correspondant à un ensemblede règles de la forme A← true ou A← false, et R un ensemble de règles.

2.1. Sémantique bien fondée

La sémantique bien fondée a été proposée dans [VnG91] et est fondée sur l’hy-pothèse du monde clos, i.e. tout atome est supposé faux par défaut. Si l’on consid-ère un programme Datalog avec négation instancié P , sa sémantique bien fondée estdéfinie à l’aide des deux opérateurs suivants : soit I une interprétation,

– l’opérateur de conséquence immédiate TP , défini par

TP (I) = head(r) | r ∈ P et ∀B ∈ body(r), B ∈ I, et

– l’opérateur non-fondé UP , où UP (I) est défini comme le plus grand ensemblenon-fondé par rapport à I .

Nous rappelons qu’un ensemble d’atomes instanciés est dit non-fondé par rapportà une interprétation I si pour tout A ∈ U et pour toute règle r ∈ P , on a :

head(r) = A⇒ ∃B ∈ body(r) (¬B ∈ I ou B ∈ U)

3. Si B est un atome, alors B et ¬B sont des littéraux. Un littéral est instancié si toutes sesvariables sont instanciées.

Page 374: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Dans [BID91] il est prouvé que UP (I) = HB \ SPFP (I), où HB est la base deHerbrand et SPFP (I) la limite de la suite [SPF i(I)]i≥1 définie par :

− SPF 1P (I) = head(r) | r ∈ P et pos(body(r)) = ∅

et ∀B ∈ body(r),¬B ∈ I

− SPF i+1P (I) = head(r) | r ∈ P et pos(body(r)) ⊆ SPF i

P (I)et ∀B ∈ body(r),¬B ∈ I, i > 0.

où pos(body(r)) est l’ensemble des litéraux positifs apparaissant dans body(r). Lesatomes de SPFP (I) sont appelés atomes potentiellement fondés.

L’opérateur WP , appelé opérateur bien fondé, est défini par WP (I) = TP (I) ∪¬.UP (I) et est monotone par rapport à l’inclusion ensembliste. La sémantique bienfondée de P est le plus petit point fixe de WP [VnG91].

2.2. Sémantique Kripke-Kleene

La sémantique Kripke-Kleene a été proposée dans [FIT85] et est fondée sur l’hy-pothèse du monde ouvert, i.e. tous les atomes ont la valeur inconnu par défaut. Dans[FIT85], une valuation est une fonction qui associe à tout atome une valeur logique del’ensemble vrai, faux, inconnu. Si l’on considère un programme Datalog avecnégation instantié P , sa sémantique Kripke-Kleene est définie à l’aide de l’opérateursuivant : soient I une interprétation et A un atome instancié,

– s’il y a une règle de P dont la tête est A, et dont la valeur du corps par rapport à vest vrai, alors ΦP(v)(A) = vrai ;

– si pour toute règle dont la tête est A, la valeur du corps par rapport à v est faux,alors ΦP(v)(A) = faux ;

– sinon ΦP(v)(A) = inconnu.

La sémantique Kripke-Kleene de P est le point fixe itéré de ΦP obtenu à partir del’interprétation qui associe à tout atome la valeur inconnu.

3. Test d’hypothèses

Dans la suite de cet article, afin de simplifier la présentation, nous utiliserons leterme “programme” à la place de “programme Datalog avec négation”, et nous sup-poserons que tous les programmes sont instanciés.

Page 375: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1. Support d’une hypothèse

Soit un programme P = 〈F,R〉, nous considérons deux manières de déductiond’information à partir deP . Premièrement par l’activation des règles de R pour dériverde nouveaux faits à partir de ceux de F , à l’aide d’un opérateur de conséquence im-médiate TP . Ensuite, par le biais d’un raisonnement fondé sur une hypothèse donnée.

Définition 1 (Opérateur de conséquence immédiate TP ) L’opérateur deconséquence immédiate TP prend en entrée une interprétation I et retourne une in-terprétation TP(I), définie par :

TP(I) = A | ∃A← L1, ..., Ln ∈ P (∀Li (Li ∈ I))∪¬A | ∃A← L1, ..., Ln ∈ P et ∀A← L′

1, ..., L′n ∈ P (∃L′

i (¬L′i ∈ I))

Ce que nous appelons une hypothèse est en fait simplement une interprétation H .Nous utilisons le terme “hypothèse” pour insister sur le fait que les valeurs associéespar H aux atomes instanciés de la base de Herbrand sont des valeurs supposées - etnon des valeurs calculées à partir des faits et règles du programme. En tant que telle,une hypothèse doit être testée contre la connaissance “sûre” fournie par P . Le testconsiste à “ajouter” H à F , puis à activer les règles de R pour dériver une nouvelle in-terprétation H ′. Plus formellement, soit H/Heads(P) la restriction de H à l’ensembleHead(P) défini par Head(P) = A | ∃A← L1, ..., Ln ∈ P. Si H/Heads(P) ⊂ H ′

alors l’hypothèse H est saine, i.e. les valeurs définies par H ne sont pas en contradic-tion avec celles définies par P .

Définition 2 (Hypothèse saine) Soit P = 〈F,R〉 un programme et H une hypothèse.H est saine par rapport àP si

– F et H sont compatibles, et

– H/Head(P) ⊂ TP(F ∪H).

Nous utilisons la restriction de H à Head(P) avant d’effectuer la comparaisonavec TP(F ∪ H) car tous les atomes qui ne sont pas en tête de règles dans P serontinconnus dans TP(F ∪ H). Par conséquent, H et TP(F ∪ H) sont compatibles surces atomes.

L’exemple suivant illustre la notion d’hypothèse saine par rapport à un programmelogique ou base de données déductive.

Exemple 2 Soit le programme P = 〈F ,R〉 tel que R soit l’ensemble de règles donnédans l’exemple 1 et F = témoin(John).

Soit H l’hypothèse suivante :

H = ¬mobile(John),¬témoin(John),¬suspect(John)

Nous pouvons aisément remarquer que H n’est pas saine par rapport àP . L’atometémoin(John) est défini dans H et F , mais avec des valeurs différentes, donc H et Fne sont pas compatibles.

Page 376: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Maintenant, soit H ′ l’hypothèse suivante :

H ′ = ¬mobile(John),¬suspect(John)

F et H ′ sont compatibles, donc il est possible de collecter la connaissance définie parces deux interprétations dans une nouvelle interprétation sans introduire de conflitsou d’ incohérence :

F ∪H ′ = témoin(John),¬mobile(John),¬suspect(John)

Ensuite, nous activons les règles de R sur l’ interprétation F ∪H ′ :

TP(F ∪H ′) = témoin(John), suspect(John)

Nous pouvons observer que H ′ n’est pas saine par rapport àP car H ′/Heads(P) n’est

pas un sous-ensemble de TP(F ∪ H ′) et est donc en contradiction avec la connais-sance dérivée.

Même si une hypothèse H n’est pas saine par rapport à un programme P , il sepeut que certains sous-ensembles de H soit sains par rapport à P . Bien entendu, noussouhaitons connaître le plus grand sous-ensemble de H qui soit sain par rapport à P .Nous appelons cet ensemble le support de H par P . Le lemme suivant montre que lesupport est unique (et par conséquent un concept bien défini).

Lemme 1 Si H1 et H2 sont deux sous-ensembles sains de H par rapport à P , alorsH1 ∪H2 est sain par rapport àP .

Par conséquent, le plus grand sous-ensemble sain de H est défini par⋃H ′ |H ′ ≤

H et H ′ est sain par rapport à P.

Définition 3 (Support) Soient P un programme et H une hypothèse. Le support deH par rapport àP , notésH

P , est le plus grand (par rapport à l’ inclusion ensembliste)sous-ensemble sain de H par rapport àP .

Exemple 3 Soient P le programme et H l’hypothèse définis dans l’exemple 2, alorsle support de H par rapport àP est :

sHP = ¬mobile(John)

Nous pouvons remarquer que le support d’une hypothèse par rapport à un pro-grammeP = 〈F,R〉 est compatible avec l’interprétation obtenue en activant les règlesde R sur les faits de F , i.e. TP(F ) et sH

P sont compatibles.

Nous donnons maintenant un algorithme pour calculer le support sHP d’une hy-

pothèse H par rapport à un programme P .

Considérons la suite 〈PFi〉, i ≥ 0 définie par :

– PF0 = ∅ ;

Page 377: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– PFi = def(TP(F ∪Hi−1) \H)

où Hi−1 = H \ A,¬A | A ∈ PFi−1 ou A,¬A ⊂ F ∪ H, i.e. le plus grandsous-ensemble de H compatible avec F et ne contenant pas de faits correspondantsaux atomes de PFi−1.

Intuitivement, nous évaluons pas à pas les atomes pouvant potentiellement avoirune valeur logique différente de leur valeur dans H . Nous avons les résultats suivants :

Proposition 1 La suite 〈PFi〉, i ≥ 0 est croissante par rapport à l’ inclusion ensem-bliste et atteint une limite en un nombre fini d’étapes. Cette limite est notéPF .

Si un atome de la base de Herbrand n’est pas dans PF , alors cela signifie qu’il estimpossible, par rapport à P, d’inférer pour cet atome une valeur différente de sa valeurdans H .

Théorème 1 Soient P un programme et H une hypothèse, nous avons

sHP = H \ A,¬A | A ∈ PF ou A,¬A ⊂ F ∪H

4. Sémantique fondée sur une Hypothèse

Comme expliqué plus tôt, étant donnés un programme P = 〈F,R〉 et une hy-pothèse H , nous dérivons de l’information de deux manières : par l’activation desrègles (i.e. par l’application de l’opérateur de conséquence immédiate TP ) et en cal-culant le support sH

P de H par P . Au final, l’information obtenue provient de l’inter-prétation TP(F )∪sH

P . La sémantique que nous souhaitons associer à un programmePest le maximum d’information pouvant être dérivée de P sous une hypothèse H maissans aucune autre information. Pour réaliser cette idée, nous procédons de la manièresuivante :

1) Comme nous ne souhaitons pas d’information supplémentaire (autre que par P etsHP ), nous commençons le calcul par les faits F .

2) Afin de dériver le maximum d’information deP , nous collectons ensemble la con-naissance déduite par l’activation des règles de R, i.e. par l’application de l’opérateurTP , et autant de connaissance supposée que possible, i.e. le support de H par rapportà P .

3) Nous ajoutons les nouveaux faits que nous avons dérivés et définissons un nou-veau programme Pi sur lequel nous appliquons les mêmes opérations jusqu’à l’obten-tion d’un point fixe.

Proposition 2 La suite 〈Fn〉, n ≥ 0 définie par :

– F0 = F , et

– Fn+1 = T〈Fn,R〉(Fn) ∪ sH〈Fn,R〉,

est croissante par rapport à l’ inclusion ensembliste et atteint une limite notée HFSHP .

Proposition 3 L’ interprétation HFSHP est un modèle de P .

Page 378: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Nous pouvons donc définir une sémantique de P par rapport à l’hypothèse H.

Définition 4 (Sémantique de P fondée sur une hypothèse ) L’ interprétation HFSHP

est la sémantique de P fondée sur H , ou H-sémantique de P .

L’exemple suivant illustre le calcul de cette sémantique.

Exemple 4 Soient H = A,¬B,D,E et P le programme défini par F = ∅ et

R

A ← ¬BB ← ¬AC ← ¬AD ← EE ← D,¬C

Nous avons

– F0 = F ;

– F1 = T〈F0,R〉(F0) ∪ sH〈F0,R〉 = ∅ ∪ A,¬B ;

– F2 = T〈F1,R〉(F1) ∪ sH〈F1,R〉 = A,¬B,¬C ∪ A,¬B ;

– F3 = T〈F2,R〉(F2) ∪ sH〈F2,R〉 = A,¬B,¬C ∪ A,¬B,D,E ;

– F4 = T〈F3,R〉(F3) ∪ sH〈F3,R〉 = A,¬B,¬C,D,E ∪ A,¬B,D,E = F3 =

HFSHP ;

Suivant cette définition, tout programme logique ou base de données déductivepeut être associé à différentes sémantiques, une pour chaque hypothèse possible. Lethéorème 2 affirme que les sémantiques usuelles des programmes Datalog avec néga-tion correspondent à des cas particuliers de sémantiques fondées sur des hypothèsesparticulières : la sémantique bien fondée correspond à l’hypothèse qui associe la valeurfaux à tous les atomes de HBP ; la sémantique Kripke-Kleene à l’hypothèse qui as-socie la valeur faux à tous les atomes n’apparaissant pas en tête de règle et la valeurinconnu aux autres atomes. Notre approche est donc conservative, elle généralise etétend la notion d’ensemble non-fondé de Van Gelder [VnG91] à toute hypothèse. Deplus, elle unifie le calcul de ces sémantiques et facilite les comparaisons entre ces sé-mantiques, il devient par exemple simple de vérifier que la sémantique Kripke-Kleeneest incluse dans la sémantique bien fondée.

Théorème 2 Soit P un programme Datalog avec négation.

1) Si HF = ¬.HBP , alors HFSHFP coïncide avec la sémantique bien fondée de P .

2) Si HU = ¬.HBP \Head(P ), alors HFSHUP coïncide avec la sémantique Kripke-

Kleene de P .

Exemple 5 Soit P le programme défini à l’exemple 4, on a :

– HFSHFP = ¬D,¬E = sémantique bien fondée de P ;

– HFSHUP = ∅ = sémantique Kripke-Kleene de P .

Page 379: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5. Conclusion

Nous avons défini un cadre formel fondé sur le test d’hypothèses pour raisonner surla non-monotonie et l’incomplétude dans le contexte des bases de données déductivesDatalog avec négation. Ce cadre nous permet d’utiliser toute connaissance commehypothèse pour l’information manquante. Un concept de base de cette approche estcelui de support d’une hypothèse par un programme. Le support d’une hypothèse estle plus grand sous-ensemble de l’hypothèse ne contredisant pas les faits et les règlesdu programme. Nous avons ensuite utilisé ce concept de support pour définir la notionde sémantique d’un programme fondée sur une hypothèse. Nous avons donné desalgorithmes pour le calcul du support et de la sémantique. Enfin, nous avons montréque notre sémantique généralise la sémantique bien fondée et la sémantique Kripke-Kleene qui peuvent être considérées commes des cas particuliers correspondant à deshypothèses uniformes.

Nous pensons que la notion de sémantique fondée sur une hypothèse peut êtreutile non seulement dans les domaines de l’intégration et de la recherche d’infor-mation, mais aussi dans celui des systèmes fondés sur l’explication. En effet, sup-posons qu’une hypothèse donnée H soit incluse dans la sémantique d’un programmeP fondée sur H , alors P peut être vu comme une “explication” de l’interprétation H .Nous étudions actuellement ces différents domaines d’application.

6. Bibliographie

[ALF95] ALFERES, J.J., DAMÁSIO, C.V., PEREIRA, L.M., A logic programming system fornon-monotonic reasoning, Journal of Automated Reasoning, 14 : 97–147, 1995.

[BID88] BIDOIT N., FROIDEVEAUX C., General logical databases and programs : Defaultlogic semantics and stratification, Journal of Information and Computation, 1988.

[BID91] BIDOIT N., FROIDEVEAUX C., Negation by default and unstratifiable logic pro-grams, TCS, 78, 1991.

[BRA98] BRASS, S., DIX, J., Characterizations of the disjunctive well-founded semantics :confluent calculi and iterated GCWA, Journal of Automated Reasoning, 20(1) : 143–165,1998.

[CHE95] CHEN, W., SWIFT, T., WARREN, D.S., Efficient top-down computation of queriesunder the well-founded semantics, Journal of Logic Programming, 24(3) : 161–199, 1995.

[CHE96] CHEN, W., WARREN, D.S., Tabled evaluation with delaying for general logic pro-grams, Journal of the ACM, 43(1) : 20–74, 1996.

[CLA78] CLARK, K. L., Negation as failure, in H. Gallaire and J. Minker, editors, Logic andDatabases, Plenum Press, New York, 293-322, 1978.

[FIT85] FITTING, M. C., A Kripke/Kleene Semantics for Logic Programs, J. Logic Program-ming, 2 :295-312, 1985.

[FIT93] FITTING, M. C., The family of stable models, J. Logic Programming, 17 :197–225,1993.

[FIT01] FITTING, M. C., Fixpoint semantics for logic programming—a survey, TheoreticalComputer Science, To appear.

Page 380: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[FUH97] FUHR, N. and RÖLLEKE, T., HySpirit – a Probabilistic Inference Engine for Hyper-media Retrieval in Large Databases, in : Schek, H.-J. ; Saltor, F. ; Ramos, I. ; Alonso, G.(eds.). Proceedings of the 6th International Conference on Extending Database Technology(EDBT), 24-38, 1997.

[GEL88] GELFOND, M. and LIFSCHITZ, V., The Stable Model Semantics for Logic Program-ming, in : R. Kowalski and K. Bowen (eds.), Proceedings of the Fifth Logic ProgrammingSymposium MIT Press, Cambridge, MA, 978-992, 1988.

[KUN87] KUNEN, K., Negation in Logic Programming, J. Logic Programming, 4(4) :289-308,1987.

[LLO84] LLOYD, J. W., Foundations of Logic Programming, Springer Verlag, New York, firstedition, 1984.

[LOY99] LOYER, Y. SPYRATOS, N. and STAMATE, D., Computing and Comparing Semanticsof Programs in Four-Valued Logics, in : M. Kutylowski, L. Pacholski and T. Wierzbicki(eds.), Proceedings of the 24th International Symposium on Mathematical Foundations ofComputer Science (MFCS′99), LNCS 1672 : 59-69, 1999.

[McC77] MCCARTHY, J., Epistemological problems in artificial intelligence, in Proceedingsof IJCAI’77, American Association for Artificial Intelligence, Morgan Kaufmann, Los Al-tos, CA, 1038-1044, 1977.

[McC80] MCCARTHY, J., Circumscription - a form of non-monotonic reasoning, Journal ofArtificial Intelligence, 13 :27-39, 1980.

[McC86] MCCARTHY, J., Applications of circumscription to formalizing common senseknowledge, Journal of Artificial Intelligence, 28 :89-116, 1986.

[MIN95] MINKER, J., An overview of non-monotonic reasoning and logic programming, Jour-nal of Logic Programming, Special Issue, 17, 1993.

[MIN75] MINSKY, M., A framework for representing knowledge, in P. Winston, editor, ThePsychology of Computer Vision, MIT Press, New York, 1975.

[MOO85] MOORE, R. C., Semantics considerations on non-monotonic logic, Journal of Arti-ficial Intelligence, 25 :75-94, 1985.

[NIE96] NIEMELA, I. and SIMONS, P., Efficient implementation of the well-founded and stablemodel semantics, Proceedings of JICSLP’96, MIT Press, 1996.

[PRZ89] PRZYMUSINSKI, T. C., Non-monotonic formalisms and logic programming, Proceed-ings of the Sixth International Conference on Logic Programming, 1989.

[PRZ90a] PRZYMUSINSKI, T. C., Extended Stable Semantics for Normal and Disjunctive Pro-grams, in D. H. D. Warren and P. Szeredi (eds.), Proceedings of the Seventh InternationalConference on Logic Programming, MIT Press, Cambridge, MA, 459-477, 1990.

[PRZ90b] PRZYMUSINSKI, T. C., Well-Founded Semantics Coincides with Three-Valued Sta-ble Semantics, Fund. Inform., 13 :445-463, 1990.

[PRZ88] PRZYMUSINSKA, H. and PRZYMUSINSKI, T. C., Weakly perfect model semanticsfor logic programs, in R. Kowalski and K. Bowen, editors, Proceedings of the Fifth LogicProgramming Symposium, MIT Press, 1106-1122, 1988.

[REI78] REITER, R., On closed-world databases , in H. Gallaire and J. Minker, editors, Logicand Databases, Plenum Press, New York, 55-76, 1978.

[REI80] REITER, R., A logic for default theory, Journal of Artificial Intelligence, 13 :81-132,1978.

[REI86] REITER, R., Non-monotonic reasoning, Annual Reviews of Computer Science, 1986.

Page 381: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[SUB95] SUBRAHMANIAN, V.S., NAU, D., VAGO, C., WFS + branch bound = stable models,IEEE Transactions on Knowledge and Data Engineering, 7 :362-377, 1991.

[VnG89] VAN GELDER, The Alternating Fixpoint of Logic Programs with Negation, in : Pro-ceedings of the Eighth Symposium on Principles of Database Systems, ACM, Philadelphia,1-10, 1989.

[VnG91] VAN GELDER, A., ROSS, K. A., SCHLIPF, J. S., The Well-Founded Semantics forGeneral Logic Programs, J. ACM, 38 :620-650, 1991.

[ZAN84] ZANIOLO, C., Database Relations with Null Values, Journal of Computer and Sys-tem Sciences, 28 : 142-166, 1984.

[ZUK97] ZUKOWSKI, U., BRASS, S., FREITAG, B., Improving the alternated fixpoint : thetransformation approach, in : Proceedings of LPNMR’97, LNCS 1265, pages 40–59, 1997.

Page 382: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 383: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Database summarization:Application to a Commercial BankingData Set

Régis Saint-Paul— Guillaume Raschia— Noureddine Mouaddib

Institut de Recherche en Informatique de Nantes2, rue de la Houssinière BP 92208 - 44322 Nantes Cedex 3, FRANCE

Saint-Paul,Raschia,[email protected]

ABSTRACT.In this paper, an original approach to database summarization is applied to a massivedata set provided by a bank marketing department. The summarization process is based on anincremental and hierarchical conceptual clustering algorithm, building a summary hierarchyfrom database records. Levels of the hierarchy provides some views with different granulari-ties over the entire database. Each summary describes part of the data set. Furthermore, thefuzzy set-based representation of summaries allows the system to ensure a strong robustness andaccuracy regarding the well-known threshold effect of the crisp clustering methods. The sum-marization process is also supported by some background knowledge, providing a user-friendlyvocabulary to describe summaries with a high-level semantics. Even though our method is notimmediately concerned with computational performance, its low time and memory requirementsmakes it appropriate for large real-life databases. The scalability of the process is demonstratedthrough the application on a banking data set.

RÉSUMÉ.Dans cet article, une approche originale du résumé de données est appliquée à unebase réelle du service marketing d’un groupe banquaire. Le processus de résumé se fonde sur unalgorithme de formation de concepts hiérarchique et incrémental. Les niveaux de la hiérarchieproposent des vues avec différentes granularité de l’intégralité de la base. Chaque résumé décritune partie des données de l’ensemble de départ. La représentation des résumés basée sur lathéorie des ensembles flous donne au système une forte robustesse et bonne précision, évitantle fameux effet de seuil des méthodes non-floues de clustering. Le processus de résumé reposeégalement sur une base de connaissances préalable qui permet une description des donnéesdans le vocabulaire de l’utilisateur avec un haut niveau sémantique. Quoique notre méthodene soit pas immédiatement concernée par les problèmes de performance, sa complexité linéaireet ses faibles besoins en ressource mémoire lui permette d’envisager le traitement de grandsensemble de données réelles, ainsi que le montre l’application présentée.

KEYWORDS:Database summarization, Knowledge Discovery, Fuzzy logic

MOTS-CLÉS :Résumé de bases de données, Extraction de connaissances, Logique floue

Page 384: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Generally, the huge amount of informations stored each day into databases is use-less, since standard tools for visualizing, querying and analyzing data are inefficientdue to the scalability problem. Therefore, new research domains, such as data min-ing, data warehousing and knowledge discovery have recently raised up interest ofdatabase community. In the same time, the data summarization paradigm has beenconsidered as a main topic of the extended database research area.

In that context, a lot of work has been done, especially on classification, clustering,association rules and decision trees, as a way to discover correlations between classesof database observations. For instance, the natural language-translated rule“longpapers are generally the most interesting one” could be inferredfrom a case study on scientific journals. Also these techniques have been used to iden-tify, characterize and/or discriminate parts of the database with conjunction of predi-cates, such as [SALARY>50000 AND AGE>50 AND (CAR=MERCEDES OR CAR=BMW) AND SEX=M] roughly defines the group of company chairmen.

However, most of the KDD (Knowledge Discovery in Databases) systems aredesigned to extract knowledge nuggets into the data, i.e. a very precise and hiddenknowledge, rather than to provide a global view on database. Moreover, knowledgerepresentation is often unintelligible for the user.

Besides, multidimensional databases are arising great interest from the summa-rization task point of view, since they allow an end-user to query and visualize part ofthe data using some special algebraic operators such as roll-up or drill-down. Oftenimplemented through materialized views, multidimensional databases are however notable to provide to the user some intentional descriptions of parts of the data set, butrather to give, at different levels of granularity, the real distribution of attribute valuescalled measures, according to others called dimensions, through basic elements calledcells.

Therefore, in complement to these well-known KDD and OLAP approaches, wepropose a fuzzy set-based summarization method, the SAINT ETIQ system, providingsummaries which cover parts of the primary database at different levels of abstrac-tion. Since interpretation and exploration of summaries is a main goal of summa-rization, the symbolic/numerical interface provided by tools of the Zadeh’s fuzzy settheory (ZAD 65), and more especially linguistic variables (ZAD 75) and fuzzy par-titions (RUS 69), are the fundamental background of all the approaches to linguis-tic summarization. Significant works have been done in this area, for instance byYager (YAG 82), Rasmussen and Yager (RAS 97), Kacprzyk (KAC 99), Bosc et al.(BOS 99b; BOS 98), Cubero et al. (CUB 99), Dubois and Prade (DUB 00).

Our approach considers a primary relationR(A1, . . . , An) in the relational databasemodel, and constructs a new relationR∗(A1, . . . , An), in which tuplesz are sum-maries and attribute values are fuzzy linguistic labels describing a sub-table ofR.Thus, the SAINT ETIQ system identifies statements of the form“ Q objects of

2

Page 385: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

R are a1 and . . . and am” . Furthermore, summaries are organized into a hi-erarchy, such that the user can refine an approximative search into the database, fromthe most generic summary verifying the query to the most specific one. The over-lapped summariesz of R∗ are defined from database records and prior knowledge,providing synthetic views of a part of the database at different level of abstraction.One singular feature of the SAINT ETIQ system is the intensive use of BackgroundKnowledge (BK) in the summarization process. BK is built a priori on each attribute.It supports a translation step of descriptions of database tuples into a user-definedvocabulary.

The rest of this paper is organized as follows. First, the overall architecture of theSAINT ETIQ system is presented. Second, we focus on the use of background knowl-edge through the rewriting process of database records. Then, the construction and therepresentation of summaries are detailed. Finally, features of our prototype of SAIN -TETIQ are presented, as well as some interesting results offered by the application ofSAINT ETIQ over a real-life data-set..

2. Summarization Model Architecture

Our model of data summarization takes database records as input and gives somekind of knowledgeas output. Figure 1 shows the overall architecture of the system.The summarization task is then designed as a knowledge data discovery process, in asense that it is divided into three major parts such as follows:

1) apre-processingstep: it allows the system to rewrite database records in orderto be processed by the mining algorithm. This translation step gives birth to candidatetuples, which are different representations of a single database record, according tosome background knowledge. Background knowledge are fuzzy partitions definedover attribute domains. Each class of a partition is also labelled with a linguisticdescriptor provided by the user or a domain expert. For instance, the fuzzy labelyoungcould belongs to a partition built over the domain of the attributeAGE.

2) a “data mining” step: it considers the candidate tuples one at a time, and per-forms a scalable machine learning algorithm to extract knowledge. Obviously, theintensive use of background knowledge, which supports the translation step, avoidsfinding surprising knowledge nuggets. The model is rather designed to producesummaries–the extracted knowledge in the KDD process analogy—over the entiredatabase, even if some of them are considered as trivial from the user point of view.Furthermore, summaries are human-friendly since they are described with the uservocabulary taken from background knowledge. Hence, their direct interpretation iseasily performed by the user. It is an important feature of our model, and the maindifference with the usual KDD processes.

3) apost-processingstep: our model tries to define summaries at different level ofgranularity. the post-processing step consists in organizing the extracted summariesinto a hierarchy, such that the most general summary is placed at the root of the tree,

3

Page 386: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

and the most specific summaries are the leaves. The root summary describes the entiredata set, whereas the leaves are summaries of only a few records of the database. Thus,browsing the hierarchy with a top-down approach allows the user to progressivelyrefine a rough query on the database until considering database records themselves.

3. Translation Step

3.1. The classical relational database framework

A database is a collection of data logically organized into files or tables of fixed-length objects (records), described by a set of features (attributes). Each attribute isdefined on a domain corresponding to the set of possible values for an attribute vari-able. Each record is an ordered list of attribute-value pairs. A tuple is an elementof the cartesian product of attribute domains. For convenience, we limit ourselves inconsidering a single table, the universal tableR in the relational database paradigm(COD 90). Table 1 presents a toy example of database records described by two at-

NAME (id) OCCUPATION INCOME (US$)

Burns NPP boss 87 000Cletus unemployed 5 000Homer safety inspector 44 000Kent anchorman 99 000Krusty clown 68 000Lisa sax player 15 000Maggie baby star 72 000Marge private housekeeper 0P. Kashmir exotic dancer 19 000Smithers assistant manager 60 000Snake PHU artist 67 000. . . . . . . . .

Table 1. part of the SIMPSONS_CHARACTERS table

tributes : the Simpsons characters are listed with their jobs and annual incomes. Forinstance,Cletus the Slaw-Jawed Yokelis said to beunemployedwith a US$5 000an-nual income earned from the Social Security Disability Insurance. The abbreviationsNPP, PHU andP. Kashmirare respectively used forNuclear Power Plant, Profes-sional Hold-UpandPrincess Kashmir.

Formally, letA = A1, A2, . . . , An be the set of attributes ofR. For instance, at-tributes areAGE, INCOME, OCCUPATIONor COUNTRY. Denote byDA the primarydomain of the attributeA, such as the interval[0, 150] for the numerical attributeAGE,or the term setFrance, Marocco, Poland for the nominal attributeCOUNTRY. An

4

Page 387: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

elementt ∈ R is represented by the vector〈t.A1, . . . , t.An〉, where attribute valuest.Ai, Ai ∈ A, are basically crisp. For instance,Apu.JOB=grocer.

3.2. Notations related to the fuzzy set theory

Some useful notations related to the Zadeh’s fuzzy set theory (ZAD 65) are intro-duced here. They are all used in the following of this communication.

Consider the crisp setΩ as the universe of discourse.F(Ω) is the power fuzzy setof Ω, i.e. the set of fuzzy sets ofΩ. An elementF ∈ F(Ω) is defined onΩ by themembership functionµF which takes values in[0, 1].

In the following, we indistinctly denote byF (x) or µF (x) the membership gradeof x in F . The crisp set0+F is the strong zero cut ofF . It corresponds to thesupport of the fuzzy setF defined asx ∈ Ω | F (x) > 0. Moreover, in the caseΩ = x1, . . . , xn, n ∈ N, we denote byµF = α1/x1 + . . . + αn/xn, withαi = F (xi), i ∈ [1, n].

Finally, F is a fuzzy subset ofF ′, denoted byF ⊆F F ′, if and only if for allelementx in Ω, the inequalityF (x) ≤ F ′(x) is satisfied.

3.3. Generation of Candidate Tuples

The first use of the domain knowledge consists in finding the best representationof a database tuple according to linguistic labels provided by BK. Indeed, for a giventuplet the SAINT ETIQ system identifies on each attribute the most similar fuzzy setsd’s of the BK language, and it quantifies the satisfaction of the representation oft bythed’s.

The translation step is based on labeled type-1 fuzzy sets for the nominal attributes,and linguistic variables (ZAD 75) for the numerical attributes. Fuzzy sets of BK sat-isfy on each attribute the cover property in the sense that for all the attribute valuesvin F(DA), there exists at least one elementd of BK with v ∈ 0+d.

Figure 2 presents an example of a fuzzy linguistic partition on the attributeIN-COME. Someone who earns US$ 37 000 a year has both a modest and reasonableincome with different satisfaction degrees.

Table 3 shows the definition of labeled type-1 fuzzy sets asartist, on the nominalattributeOCCUPATION.

The translation step generates, from a given primary database record, all thecan-didate tuplesfor the generalization.For instance, consider the database record〈Burns,NPP boss, US$ 87 000〉 from Table 1. The translation step turnsBurns.OCCUPATION=NPP Bossinto predefined linguistic descriptors taken from BK provided in Table 3.Thus, the candidates are defined in Table 2.

5

Page 388: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In the same way,Burns.INCOME=US$ 87 000 is translated into the linguistic de-scriptorenormouswith a maximum satisfaction degree, since US$ 87 000 belongs tothe core of the fuzzy setenormousaccording to Figure 2.

Thus, the translation step converts the primary tupleBurns= 〈NPP boss, US$ 87 000〉into 2 user-defined vocabulary tuples:

Burns[1] = 〈businessman, enormous〉Burns[2] = 〈 firm manager, enormous〉

with the appropriate satisfaction degrees:

ϕ(Burns[1].OCC) = businessman(NPP boss)= 0.9

ϕ(Burns[2].OCC) = firm manager(NPP boss)= 1.0

ϕ(Burns[k].INC) = enormous(US$ 87 000)= 1.0

The abbreviationOCCstands forOCCUPATIONandINC is used forINCOME.

More formally, consider the primary database recordt and a candidate tuplectbuilt from t. The satisfaction degree of a descriptord of ct on the attributeA isdefined as the membership grade oft in d: ϕ(ct.A) = ct.A(t.A).

The SAINT ETIQ system associates a weightw to each candidate tuplet, corre-sponding to the proportion of a database record it represents. Considert gives birth tok candidate tuples through the translation step. Then the weight associated to eachctbuilt from t is simply given byw(ct) = 1/k. For instance,w(Burns[1]) = 0.5 sincethere exist two candidate tuples,Burns[1] andBurns[2], built from the database recordBurns. The weightsw allow one to exactly know the representativity of a cluster ofcandidate tuples from the primary database point of view.

The translation step applied either on numerical or nominal attributes allows oneto achieve a unified framework for the generalization process of the candidate tuples.The result of the translation is considered as the first level of summarization.Burns[k]is in fact the intentional description of a summary of database records close toBurns.

Furthermore, the translation step provides some informations about the match-ing degree between the user-defined vocabulary in BK and the real distribution ofdatabase records over the different attributes. The more database records generatesome candidate tuples, the less BK fits the database distribution. The study of weightsw associated to the candidate tuples seems to be an interesting starting point to eval-uate the accuracy of BK according toR. But this discussion as well as an extensionto a hypothetical interactive mechanism of refinement of BK is out of order in thiscommunication.

In the following, we denote byD+A the upper attribute domain ofA defined as the

finite set of linguistic termsd of BK. Both primary numerical and nominal attributes

6

Page 389: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

translated term membership grade candidate

artist 0.0 nobusinessman 0.9 yeshigher ed. empl. 0.0 noschool qualif. empl. 0.0 noshopkeeper 0.0 nofirm manager 1.0 yesno occupation 0.0 no

Table 2. Translation step of Burns.OCCUPATION

are now associated with discrete domains since the translation step provides candi-date tuples described on each attribute by elements ofD+

A . For instance,D+INCOME

= none, miserable, modest, reasonable, comfortable, enormous, outrageous, andD+

OCCUPATION= artist, businessman, shopkeeper,. . . . The next section will focus ona particular subset ofF(D+

A).

4. Summary Representation

4.1. What are summaries ?

The main goal of SAINT ETIQ is to extract summaries from a huge amount ofdatabase records. Each summaryz provides a synthetic view of a part of the database,i.e. a sub-tableσz(R) of R.

The subset of database recordsσz(R) involved into the summarization is usuallycalled theextent, whereas the summarized descriptionz of these database records isthe intent.

The intentional descriptionz = 〈z.A1, . . . , z.An〉 of a summary describes similarfeatures of tuples inσz(R). It allows to generalize the descriptions of database tuples,attribute by attribute. Eachz.Ai is a fuzzy set represented by some linguistic labelsdtaken from prior knowledge (see Section 3). A descriptord generalizes the attributevalues of database records in the extent ofz.

For instance, consider the selection Kent, Krusty, Lisa, Maggie, Princess Kash-mir, Snake of records from Table 1 described on the attributesOCCUPATIONandINCOME. These six tuples are then summarized by a single fuzzy tuplez defined as:

z = 〈z.INCOME, z.OCCUPATION〉z.INCOME = αfp/fatly paid+

αpp/poorly paidz.OCCUPATION = αa/artist

7

Page 390: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

whereαd is the membership grade ofd in z.A. It represents asatisfaction degreeofthe intentional description of the summaryz by d on the attributeA. Section 4.2 givesdetails about computation ofαd.

A fuzzy attribute value-oriented extension of the relational database paradigm isadopted to consider uncertainty and imprecision into the description of summaries.The approach adopted here (ZEM 84) is based on the possibilistic model (PRA 84)for relational databases. To handle ill-known values, it considers attribute descriptionsz.A of summariesz as weighted disjunctive statements. Further discussions on fuzzydatabases are developed in (PET 96; BOS 99a).

Obviously, we observe that the annual income of the artists of Table 1 is duallydefined. The disjunction of descriptors ofz on the attributeINCOMEderives from thescattering of attribute values of database records. Indeed, Lisa Simpsons and PrincessKashmir look like they use to earn a low income, in contrast to the other artists likethe Springfield’s esteemed anchorman Kent Brockman. But, it should be interestingto roughly decide in a more general way whether the Springfield’s artists are well paidor not.

Therefore, each linguistic labeld describing a database summary on an attributeis associated to arepresentativity degree. Coupled with the membership grade, itprovides basic informations to evaluate the description ability ofd. In our example,the linguistic labelfatly paid supports 66% of the extension ofz, whereaspoorlypaid represents only 33% of it. Besides, consider the two descriptors have the samesatisfaction degrees (αfp = αpp). Thus, we interpret the description ofz as:

most of the Springfield’s artists seems to be fatly paid, but a few ofthem are poorly paid.

4.2. Formal Definition of Summaries

Through the above introductory example, the user point of view is put forward.It allows one to intuitively keep in mind what we call summary and what kind ofinterpretation we can expect from it. Further formal details about the representationof summaries are now introduced.

A summary is defined in an extensional manner with a collection of candidate tu-plesRz = ct1, ct2, . . . , ctN. Eachcti is associated to a primary database record,i.e. an element ofR. Denote bycard(Rz) =

∑ct∈Rz

w(ct), and by|Rz| the num-ber of candidate tuples inRz. card(Rz) corresponds to the representativity of thesummaryz according to the primary databaseR, whereas|Rz| is the standard scalarcardinality of the crisp set of candidate tuples defined by the relationRz.

Moreover, the intent of a summary is defined as:

z = 〈z.A1, z.A2, . . . , z.An〉with z.Ai ∈ F(D+Ai

) , 1 ≤ i ≤ n .

8

Page 391: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 1. The overall process of database summarization

Figure 2. Linguistic variable defined onINCOME

ConsiderRz=Apu[1], Burns[1], Moe[1]. The intentional description of the sum-maryz is then defined as:

z = 〈.9/businessman, 1.0/enormous+ 1.0/miserable〉

where membership gradeαd of a descriptord is computed as an optimistic aggregationvalue—with the usual triangular conormmax1—of satisfaction degrees of candidatetuples:

αd = maxct∈Rz | ct.A=d

(ϕ(ct.A))

1. Since the maximum satisfies the property of monotonicity w.r.t. the inclusion of extensionaldescriptions of summariesRz.

9

Page 392: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

fuzzy label membership function

artist 1.0/sax player + .7/baby star + .3/anchorman +.9/clown+ .9/exotic dancer+ .4/PHU artist

businessman .7/grocer+ .7/bartender+ .6/attorney+ .9/NPP boss+.8/assistant manager+ .8/anchorman+ .9/PHU artist

higher educ. employee 1.0/attorney + .8/safety inspector+ .8/assist. man-ager

school qualif. employee 1.0/secretary+ .3/exotic dancershopkeeper 1.0/grocer+ .9/bartenderfirm manager 1.0/NPP boss+ .2/private housekeeperno occupation 1.0/unemployed+ .7/pensioner+ .5/private house-

keeper+ .4/baby star+ .4/PHU artist

Table 3. Labeled type-1 fuzzy sets on the attribute OCCUPATION

NAME (id) OCCUPATION ϕ INCOME ϕ w Ref

Apu[1] businessman .7 miserable 1.0 0.5 ApuApu[2] shopkeeper 1.0 miserable 1.0 0.5 ApuBurns[1] businessman .9 enormous 1.0 0.5 BurnsBurns[2] f. manager 1.0 enormous 1.0 0.5 BurnsHomer[1] he. employee .8 reasonnable 1.0 0.5 HomerMoe[1] businessman .7 miserable 1.0 0.5 MoeMoe[2] shopkeeper .9 miserable 1.0 0.5 Moe. . . . . . . . . . . . . . .

Table 4. Translation step of the Relation SIMPSONS_CHARACTERS

Consider the above example. The membership grade ofbusinessmanis then com-puted as

αbus = max(ϕ(Apu[1].OCC), ϕ(Burns[1].OCC),

ϕ(Moe[1].OCC))

= max (0.7, 0.9, 0.7)

= 0.9

Furthermore, the attribute description of a single candidate tuplect could be seen as aspecial case of a summary descriptionz, such thatz.A = ϕ(ct.A)/ct.A for all Ain A.

In a general way, the fuzzy approach of the summarization process allows to givepreferences over multiple generalizations as well as to quantify the satisfaction ofeach upper attribute description with a nuanced scale. Moreover, the well-defined

10

Page 393: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

numerical/symbolic interface of the fuzzy set theory provides a powerful support forlinguistic descriptions of summaries, especially for the translation step.

To provide some representativity measure of each linguistic labeld into a summarydescriptionz.A, we use a primary database relative cardinality measure defined as:

cardz.A(d) =∑

ct∈Rz | ct.A=d

w(ct)

wherew(ct) is the weight associated to the candidate tuplect according to the primarydatabaseR (see Section 3.3).

The descriptor cardinalitycardz.A(d) relative to the extent ofz determines theproportion of primary database records involved into the generalized description ofRz with the linguistic labeld of BK.

Denote bycardz.A(d) = cardz.A(d)/ card(Rz), the normalized descriptor car-dinality. It measures the importance of each labeld into the summary descriptionz.A.

Consider the above example in which|Rz| = 3 andcard(Rz) = 1.5. Cardinali-ties of linguistic labels are given as follows:

cardz.OCCUPATION(businessman) = 1.5cardz.INCOME(enormous) = 0.5cardz.INCOME(miserable) = 1.0

Moreover,cardz.INCOME(enormous) ≈ 0.33 andcardz.INCOME(miserable) ≈ 0.66.Hence, all the Springfield’s inhabitants partially or totally incorporated intoRz arebusinessmen, and one third of them has an enormous annual income whereas the otherpart uses to earn a miserable salary.

5. Learning Summaries From Data

The SAINT ETIQ system performs the database summarization process by the wayof a concept formation algorithm (FIS 87). The process integrates learning and clas-sification tasks, sorting each tuple through a summary hierarchy, and in the sametime, updating summary descriptions and related measures. Note that in our approach,learned concepts are the summaries and objects are the database records.

5.1. Concept Formation

Most of human learning can be regarded as a gradual process of concept formation:observation of a succession of objects allows to induce a conceptual hierarchy thatsummarizes and organizes human experience. In other words, concept formation isthe fundamental activity which structures objects into a concise form of knowledge

11

Page 394: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

that can be efficiently used in the future (REI 91). It includes the classification ofnew objects based on a subset of their properties (the prediction ability), as well asthe qualitative understanding of those objects based on the generated knowledge (theobservation ability).

Hence this task is very similar to theconceptual clusteringissue as defined byMichalski and Stepp (MIC 83), with the added constraint that learning is incremental.

More formally, given a sequential presentation of tuples and their associated de-scriptions, the main goals of concept formation are:

1) identifying clusters that group the tuples in categories;

2) defining an intentional description (i.e. a concept) that summarizes instances ofeach category;

3) organizing these concepts into a hierarchy.

5.2. Hierarchical Organization of Summaries

Summariesz are stored as fuzzy tuples into a relationR∗, and are organized intoa hierarchy which defines a partial ordering on tuples ofR∗.

Considering a nodez into the hierarchy, its parent node summarizes more databaserecords thanz, and its children nodes less than itself. Consequently, there respectiveintents are more or less specific in a sense that, on each attribute, the summary descrip-tion of the parent node is more scattered thanz. In the same way,z is more scatteredthan all of its children nodes. The specialization happens over one or more attributes,according to decisions made during the hierarchy building

Besides, applying the algorithm on candidate tuplesct rather than directly on prim-itive database recordst allows one to incorporatet into several summaries in differentbranches of the hierarchy, by the way of the corresponding candidate tuplesct of t.Thus, SAINT ETIQ generates a non-disjoint hierarchy, so called apyramid, from thepoint of view of elements inR.

5.3. Incremental Learner

Incremental learning methods are basicaly dynamic: their input is a stream of ob-jects that are assimilated one at a time. Thus, incremental processes build at any timean estimated knowledge structure of an unknown real one. Therefore, a primary mo-tivation for using incremental systems is that knowledge may be rapidly updated witheach new object. Indeed, incremental learners are driven by new objects, such thateach step through the hypothesis space occurs in response to some new experiences.

Obviously, the major drawback of this approach is that the estimated structure (thesummary hierarchy) is only built from past observations, and thus corresponds to alocal optimization of a heuristic measure used to evaluate the quality of the summary

12

Page 395: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

partition at each level of the hierarchy. However, D. Fisher (FIS 87) showed thatexperiences with such systems provide good results if some bidirectional learningoperators are used.

5.4. Hill-Climbing Search

One can consider the concept formation as a search through a space of concepthierarchies, and hill-climbing is a basic Artificial Intelligence search method provid-ing a possible way of controlling that search. Indeed, the system adopts a top-downclassification method, incorporating a new tuplet into the root of the hierarchy anddescending the tree according to the hill-climbing search.

At a nodez, the algorithm considersincorporating the current tuplet into eachchild node ofz as well ascreatinga new child node accommodatingt. Furthermore,the system evaluates the preference ofmergingthe two best children nodes ofz andsplitting the best child node. Then SAINT ETIQ uses a heuristic objective functionbased on contrast and typicality of summary descriptions to determine the best opera-tor to apply at each level of the hierarchy.

Furthermore, bidirectional operators, such as splitting and merging, make localmodifications to the summary hierarchy. They are used to weaken sensitivity of theobject ordering, simulating the effect of backtracking in the space of summary hierar-chies, without storing previous hypotheses on the resulting structure. Thus, the systemdoes not adopt a purely agglomerative or divisive approach, but rather uses both kindof operators for the construction of the tree. To reduce effects of this well-known draw-back of concept formation algorithms, one can consider an optimization and simpli-fication step, for instance with an iterative hierarchical redistribution (FIS 96) whichconsiders the movement of a set of observations, represented by an existing cluster(summary), through the overall summary hierarchy.

Finally, the main advantage of hill-climbing search is its low memory requirement,since there are never more than a few states in memory, by contrast to search-intensivemethods as depth-first or breadth-first ones.

5.5. Discussion about Complexity

The temporal complexityc(n) = O(n) of SAINT ETIQ is linear w.r.t. the numbern of database tuples. Indeed, considering learning operators as primitive elements,c(n) is defined as:

c(n) = [(B + 3) ·N · logB p] · (n · pN ) ,

whereN = |A| is the number of attributes,B the average number of children nodesof a summary, andp represents the average cardinality of translated attribute domainsD+

A . In the above formulae,N · logB p gives the depth of the tree, whereas(B + 3)

13

Page 396: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

stands for the exact number of learning operators applied at each node of the tree.Moreover,(n · pN ) is the maximum number of candidate tuples generated from theprimary database.

Observe that the primary database has to be parsed only once. Computation ofthe SAINT ETIQ algorithm is then performed with a low temporal cost, since the in-cremental learning, the hill climbing search method and the predefined vocabulary ofsummary descriptions.

6. A Real-Life Application

6.1. Model’s Implementation

A prototype implementing our model has been developed with the pascal languageusing Borland(TM) Delphi c© 6. The source code is about 56000 lines long, amongwhich 40% is used for the summarization model implementation itself. The rest isdedicated to both the graphical user interface and some libraries devoted to the XMLrepresentation of background knowledge and summaries.

As stated in 5.4, a quality measure is used to evaluate candidate summaries dur-ing the learning task and determine the appropriate learning operator. This qualitymeasure combines:

– a contrast measure, which is expected to be monotonously decreasing as thespecificity of the summary partition increase,

– and atypicality measure, which reciprocally is expected to be monotonouslyincreasing as the specificity increases.

Those measures are normalized over the[0, 1] range. Therefore, at the root of thehierarchy, contrast will be maximum and typicality minimum, while the opposite willbe observed if we consider a leaf summary. This property is inherited from the con-cept representation model of the concept formation paradigm, as defined by E. Rosch(ROS 78).

Typicality of a partition reflects the way intensional descriptions of summaries arenot scattered over the attribute domains, i.e. there exist only a few linguistic descriptorson each attribute for the summaries of a given partition (see 4.1). It is based on aspecificity measure defined as:

Sp(z.A) =|D+

A | − | 0+z.A||D+

A | − 1,

where|X| is the cardinality of the crisp setX.

Although typicality measure is not uniquely implemented, its instantiation doesnot yield to great difference in the resulting summary. Indeed, its behavior is wellbounded by the fuzzy linguistic partitions of attribute domains and the implementationexactly reflects the model’s representation of background knowledge.

14

Page 397: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

At the opposite, the contrast measure greatly affects the resulting hierarchy. Forexample, using optimistic or pessimistic aggregation operators, respectively max andmin operators, to compute contrast, will lead to very different situations: a wide hi-erarchy with barely no depth for the optimistic computation, and a binary tree forthe pessimistic one. Of course, such a choice would not be very interesting becausethe singularity of produced hierarchy does not contribute to have a human-friendlyrepresentation of the data set at different levels of granularity. Therefore, averagingoperators are preferred to compute contrast and typicality.

The effective computation of the contrast measure is based on a dissimilarity de-gree. Dissimilarity between two summaries is intended to reflect the distance betweentheir respective intensional descriptions. This measure can itself be tuned to allowdomain specific treatments. For instance, if the user wants a particular attribute to betaken a special care, the dissimilarity measure will be adjusted with the goal. Thedissimilarity measure however does not directly affect properties of the hierarchy. Itrather affects the way tuples are grouped together and consequently, how summariesare described. However, defining domain specific dissimilarity measure requires ahigher expertise level on the domain. The prototype implements a general dissimilar-ity measureδ which is expected to meet most of the real cases. Its expression is basedon a resemblance measureRe such asδ(zi.A, zj .A) = 1−Re(zi.A, zj .A):

Re(zi.A, zj .A) =|zi.A ∩min zj .A|

min(|zi.A|, |zj .A|),

where|X| is the cardinality (sigma count) of the fuzzy setX, and∩min is the standardfuzzy intersection connective operator.

Contrast of a partition is then evaluated by computing a mean on the pairwise dis-similarities observed between the summaries of the partition. The hierarchy widenessis controlled by applying a correction on the contrast measure, proportionally to thegiven partition cardinality. The decision function of this correction can be adjustedto meet the specific mining task requirement and will affect the average wideness andmaximum depth of the resulting hierarchy.

Besides, the graphical user interface of our prototype allows the user to intuitivelydefine the background knowledge in terms of fuzzy sets, as well as to choose an ap-propriate strategy and to browse through the generated summary hierarchy.

6.2. The Commercial Banking Data Set

Over an agreement, the CIC Banking Group provided us with an extract of statisti-cal data used for behavioral studies over customers. The database consists in a singletable in which each record represents a customer, and fields (attributes) describe thecustomer in terms of age, income, or occupation, as well as banking products this cus-tomer is used to hold (accounts, credit cards, loan, . . . ). And finally, several attributesreflect statistics over the operations the customer do on a monthly period (number of

15

Page 398: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

operations, total cash withdrawal, . . . ). The database represents a set of 33700 cus-tomers and 70 attributes. It is to be noted that some of the database values are absentand some are incoherent.

6.3. Construction of BK

Marketing experts provided us with the vocabulary they use to describe the valueson each attribute. For instance, they gave the range of income they would qualify withHigh,Averageor Low. The fuzzy approach of our system allows to take into accountthe inherent imprecision of such a vocabulary.

In addition, we used some basic mining tools, natively built in the prototype, toallow a refinement of linguistic partitions over each attribute domain; the system sim-ply provides an image of the distribution of tuple values on each attribute domain,such that the user can tune the linguistic partition to better fit the data. It is to benoted that an extensive use of such tools, allowing to define great partitions that meetsome mathematical properties, is out of our purpose. Indeed, summaries are first in-tended to reflect the content of the database with the user’s vocabulary. Therefore,background knowledge are built by a domain expert, and those tools are only used tocheck inconsistency and coverage properties.

For the results presented in this paper, we used a subset of 10 of the most relevantattributes. On each attribute, we defined between 3 and 8 modalities leading to a totalof 1036800 possible descriptor combinations.

6.3.1. Behavior of the Summarization Task

Figure 3 shows the evolution of the number of learning operations. Each of thisfigure bar expresses the total learning operators requiered for the treatement of setsof 1000 tuples. The learning operatorsAdd a levelandMergeare more frequent thanSplitandNewdue to the choosen strategy. Their ratio however remains the same alongthe all process, which is a good indicator of the well balanced use of those operators.

Figure 4 shows the evolution of the number of leaf nodes. Leaf nodes are themost specific summaries and are defined with a single descriptor on each attribute.Therefor, the number of leaf nodes is equal to the number of attribute/ rewritten valuecombination found in the database. The curve appears to have two discontinuities,the first being arround the 6000th processed tuple and the second arround the 18000thprocessed tuple showing that new value combinations appears at a higher rate at thosestages than during the rest of the process. Those two discontinuities are in correlationwith the two brusk increases of the learning operators used for handling those newtuples (Figure 3). After a sequence of learning activity to take into account the newmodalities found in the data set, the hierarchy becomes stable again and less operationsare required for the incorporation of incoming tuples.

16

Page 399: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 3. Usage of Learning Operators

Figure 4. Evolution of the Number of Leaves

Figure 5. Evolution of the Hierarchy Depth

17

Page 400: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 6. Performance of the Summarization Task

The evolution of the average depth of the hierarchy, shown on figure 5, is rapidlyconverging to 14, with the maximum being stable at 25. This is due to the naturallybounded nature of the hierarchy from BK definition (see section 5.5).

The SAINT ETIQ prototype has not been built with all the possible optimizations.However, it’s performance appears to conforme to the expected behaviour. Figure 6shows the performance evolution during the summary process. All the 55724 can-didate were processed within 21 minutes, but with at least 50% of the computationtime being taken by logging and statistic tasks. Two factors mainly affect the processperformance :

– The hierarchy size and balance,

– The hierarchy stability.

Those factors are closely linked since, for any particular learning strategy, theyonly depend on the sequential order of the data values. Figure 6 shows two localminima arround tuple 6000 and tuple 18000 that can be explained by the increase ofthe number of operations necessary to handle those parts of the database, as discussedabove. It is expected that, as the number of processed tuples increase, new candidatetuple values become rare. In this context, performance of the process will tend tobe constant and only depends on the final stable hierarchy size and balance. At thisstage, there is no more learning, and the process only performs a classification taskwith performance similare to tree based indexing methods.

6.4. Interpretation

The rewriting step of the summarization process can possibly produce many can-didate tuples for each database record. The exact number of candidate tuples dependson the fuzzyness introduced in BK. BK defined for this test produced 55724 rewritencandidate from the 33700 original tuples. The summarization process is applied to allthose candidate tuples and leads to a resulting hierarchy with 14766 leaf nodes (see

18

Page 401: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 7. Widness of the Final Summary Hierarchy

figure 4). This number of leaf nodes is rather small regarding the cardinality of thecartesian product (1036800) and is a first interesting result of the summary process.

Figure 7 shows the widness of the final hierarchy for each level. The averagewidness is about half the maximum widness which expresses a well balanced tree.

The graphical interface provides the user a efficient tool to browse through thefinal hierarchy and visualize the intensive and extensive content of each node. Whenclicking on a node, the graphical interface also shows the intentional content of eachchild nodes and highlight the differences existing between the intentional descriptorof those child nodes. This way, the user have an immediate understanding of thedistinctive features between each sub-level hierarchy and can easily browse throughthe all hierarchy.

Considering a particular level, user can interprete the relative cardinality of childnodes and the task to infere knowledge from this is left to him. For example, summarywith a small cardinality will express particular modalities that are unlikely to occure(e.g. custommer with many credit cards and only one account).

6.5. Supervised knowledge discovery

Browsing through the hierarchy may allow the user to grab some general knowl-edge but it is unlikely he would be able to answer specific questions he may be lookingafter. Thus, the SAINT ETIQ prototype comes with a set of tools to help the user in hissearch.

The system first builds an index of all the summariy leaves. To build this index,all the summaries are uniquely identified by their position within the hierarchy, forinstanceR.1.1.2with R meaningRootand each number beeing the child number inthe list of the summary childrens. Therefor, each database tuple processed may leadto one or more canditate tuple, each of which will be found in the extensive content of

19

Page 402: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

a particular leaf summary of the hierarchy. The system then builds an index of tupleIDs that allow to quickly find all the leaf summaries containing one of the candidatetuple generated from this tuple.

Then, a classical SQL query tool allows the user to perform queries over the orig-inal database. Those queries are aimed to extract subsets of data instances whichmeets the user questions. For instance, in the banking application considered here,we wanted to figure out if custommer fidellity corresponds to some nd out wether theconsidered attributes were revelant to give some indication about the custommer fidel-lity. The fidellity of custommer was not an attribute used during the summarizationprocess, but some external information allowed to

The extracted tuple subset correspond to a subset of leaf summaries gathered fromthe builded index. The summary list is displayed with their corresponding effectivecontent and a Query to Summary Matching (QtFM) degree. QtFM degree is calculatedas the ratio between the summary content of selected tuples and the total content. Itexpresses the degree by whitch a summary more or less exactly reflects the querycontent.

Leaves however are the most specific summaries to express tuple and they wouldonly reflect each of the rewritten modalities of the tuple subset and their relativeweight. But if the knowledge to gather is not absolutely trivial, ie. if there are manyleaf nodes needed to express all the modalities of the tuple subset, the user will needto look at the data from a more general point of view. To help him with it, we proposetwo mechanism :

– The result can be displayed as a list of summaries and their associated measures(QtFM degree and total cardinality) at any level of the hierarchy

– A graph shows the number of concepts needed to represent all the tuple subsetaccording to the considered level. It indicates a compression ratio whitch expressesthe ratio between the number of leaves and the number of summaries at the consideredlevel. A second ratio indicates the number of summaries needed in regard of the totalnumber of summaries present at this same level.

Those tools help the user to easily locate summaries that efficiently describe thequeried tuples at a generalization level that permit to gather some knowledge. Forexample, if all queried tuples are located into very few summaries, it can be concludedthat there exists some dependency between the query and the intensional descriptorsof those summaries.

7. Conclusion and Future Work

In this communication, we introduced an original fuzzy set-based approach todatabase summarization, with some common and distinctive features with the usualKDD processes. The dual representation of summaries have been introduced, as wellas the intensive use of background knowledge for the translation step. The main fea-

20

Page 403: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

tures of the concept formation algorithm used to build the summary hierarchy havealso been considered. The scalability of our SAINT ETIQ system have been demon-strated through the good results of a real-life application.

We are now working on a query tool over the summary hierarchy. This module isintented to provide an easier way to infere knowledge from the summary by providingthe user a way to highlight hierarchy nodes that contain part of the query result.

Acknowledgements

We wish to thank the CIC group for providing their banking data and their expertiseon the banking domain without which this study couldn’t have been possible.

References

[BOS 98] BOSC P., LIÉTARD L., PIVERT O., “Extended Functional Dependenciesas a Basis for Linguistic Summaries”,ZYTKOW J. M., QUAFAFOU M., Eds.,Proc. of the 2nd European Symposium on Principles of Data Mining and Knowl-edge Discovery (PKDD’98), vol. 1510 ofLNAI, Berlin, sep 23-26 1998, Springer,p. 255–263.

[BOS 99a] BOSC P., BUCKLES B., PETRY F. E., PIVERT O., “Fuzzy databases”,BEZDEK J., DUBOIS D., PRADE H., Eds.,Fuzzy sets in approximate reasoningand information systems, vol. 5 of The Handbooks of fuzzy sets series, p. 403–468, Kluwer Academic Publishers, July 1999.

[BOS 99b] BOSC P., PIVERT O., UGHETTO L., “On data summaries based on grad-ual rules”, Proc. of the Int. Conf. on Computational Intelligence, 6th DortmundFuzzy Days (DFD’99), vol. 1625 ofLNCS, Dortmund, Germany, may 25-28 1999,Springer, p. 512–521.

[COD 90] CODD E. F.,The Relational Model for Database Management - Version 2,Addison-Wesley, 1990.

[CUB 99] CUBERO J. C., MEDINA J. M., PONS O., VILA M.-A., “Data Summa-rization in Relational Databases through Fuzzy Dependencies”,Information Sci-ences, vol. 121, num. 3-4, 1999, p. 233–270.

[DUB 00] DUBOIS D., PRADE H., “Fuzzy sets in data summaries - Outline of a newapproach”,Proc. of the 8th Int. Conf. on Information Processing and Managementof Uncertainty in Knowledge-Based Systems (IPMU’2000), vol. 2, Madrid, July3-7 2000, p. 1035–1040.

[FIS 87] FISHER D. H., “Knowledge Acquisition via Incremental Conceptual Clus-tering”, Machine Learning, vol. 2, 1987, p. 139–172, Kluwer Academic Publish-ers, Boston.

21

Page 404: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[FIS 96] FISHER D. H., “Iterative Optimization and Simplification of HierarchicalClusterings”,Artificial Intelligence Research, vol. 4, 1996, p. 147–179.

[KAC 99] K ACPRZYK J., “Fuzzy Logic for Linguistic Summarization of Databases”,Proc. of the 8th Int. Conf. on Fuzzy Systems (FUZZ-IEEE’99), vol. 1, Seoul, Korea,August 22-25 1999, p. 813–818.

[MIC 83] M ICHALSKI R. S., STEPPR. E., “Learning from Observation: ConceptualClustering”, MICHALSKI R. S., CARBONELL J. G., TOM M. M ITCHELL E.,Eds.,Machine Learning, an Artificial Intelligence Approach, p. 331–363, TiogaPublishing Co., Palo Alto, CA, 1983.

[PET 96] PETRY F. E.,Fuzzy databases - Principles and applications, Kluwer Aca-demic Publishers, 1996.

[PRA 84] PRADE H., TESTEMALE C., “Generalizing database relational algebra forthe treatment of incomplete or uncertain information and vague queries”,Informa-tion Sciences, vol. 34, 1984, p. 115–143.

[RAS 97] RASMUSSEN D., YAGER R. R., “Fuzzy query language for hypothesisevaluation”, ANDREASEN T., CHRISTIANSEN H., LARSEN H. L., Eds.,FlexibleQuery Answering Systems, Kluwer Academic Publishers, 1997, p. 23–43.

[REI 91] REICH Y., “Constructive Induction by Incremental Concept Formation”,FELDMAN Y. A., BRUCKSTEIN A., Eds., Artificial Intelligence and ComputerVision, Amsterdam, 1991, Elsevier Science Publishers, p. 191–204.

[ROS 78] ROSCH E., “Principles of Categorization”, ROSCH E., B.B. LLOYD E.,Eds.,Cognition and Categorization, p. 27–48, Erlbaum, Hillsdale, NJ, 1978.

[RUS 69] RUSPINI E. H., “A new approach to clustering”,Information and Control,vol. 15, num. 1, 1969, p. 22–32.

[YAG 82] YAGER R. R., “A new approach to the summarization of data”,Informa-tion Sciences, vol. 28, num. 1, 1982, p. 69–86.

[ZAD 65] ZADEH L. A., “Fuzzy Sets”, Information and Control, vol. 8, 1965,p. 338–353.

[ZAD 75] ZADEH L., “Concept of a linguistic variable and its application to approx-imate reasoning-I”,Information Systems, vol. 8, 1975, p. 199-249.

[ZEM 84] ZEMANKOVA M., KANDEL A., Fuzzy Relational Databases — A Key toExpert Systems, Verlag TUV Rheinland, Cologne, Germany, 1984.

22

Page 405: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Incertitude et hypothèses non-uniformes dansles bases de données déductives

Yann Loyer — Umberto Straccia

Istituto di Elaborazione della Informazione, Consiglio Nazionale delle Ricerche,Area della Ricerca CNR di Pisa, Via Moruzzi,1 I-56124 Pisa

[email protected], [email protected]

RÉSUMÉ. Différents cadres de programmation logique multi-valuée ont été proposés pour lamanipulation d’informations incertaines dans les bases de données déductives et la program-mation logique. Une caractéristique commune de ces approches est qu’elles reposent sur l’u-tilisation d’hypothèses prédéfinies uniformes, i.e. d’hypothèses associant la même valeur devérité par défaut à tous les atomes de la base de données, e.g. sous l’hypothèse du monde clos,tous les atomes ont comme valeur de véritéfaux par défaut. Dans cet article, nous étendonsces approches selon trois directions par l’utilisation d’hypothèses non-uniformes :(i) nousautorisons l’utilisation de modes non-monotones de négation ;(ii) les valeurs de vérité pardéfaut des atomes ne sont pas nécessairement toutes égales ; et(iii) une hypothèse peut êtreune interprétation partielle. Nous montrons également que notre approche étend les approchesusuelles : si l’on restreint notre attention aux approches classiques et aux hypothèses uniformes,alors notre sémantique capture les sémantiques classiques des bases de données déductives etprogrammes logiques.

ABSTRACT.Different many-valued logic programming frameworks have been proposed to man-age uncertain information in deductive databases and logic programming. A feature of theseframeworks is that they rely on a predefined assumption or hypothesis, i.e. an interpretationthat assigns the same default truth value to all the atoms of a program, e.g. in the open worldassumption, by default all atoms have unknown truth value. In this paper we extend theseframeworks along three directions:(i) we will introduce non-monotonic modes of negation;(ii) the default truth values of atoms need not necessarily to be all equal each other; and(iii)

a hypothesis can be a partial interpretation. We will show that our approach extends the usualones: if we restrict our attention to classical logic programs and consider uniform hypotheses,then our semantics reduces to the usual semantics of logic programs. In particular, under theeverything false assumption, our semantics captures and extends the well-founded semantics tothese frameworks.

MOTS-CLÉS :logique et bases de données, incertitude, raisonnement non-monotone, hypothèses.

KEYWORDS:logic and databases, uncertainty, non-monotonic reasoning, hypothesis.

Page 406: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Le traitement de l’information incertaine est un problème important dans les domainesd’application de l’intelligence artificielle pour lesquels l’information du monde réel àreprésenter est de nature imparfaite. Un travail important a été effectué ces dernièresdécennies dans ces domaines : de nombreux concepts ont été étudiés, de nombreuxproblèmes identifiés et nombres de solutions ont été développées.

La logique du premier ordre fut la base de la plupart des formalismes de représen-tation de la connaissance. Ses éléments principaux –individus, propriétés et relationsentre eux– capturent naturellement la façon dont les gens encodent leur connaissance.Malheureusement, elle est sévèrement limitée dans son aptitude à représenter notreincertitude à propos du monde : un fait peut seulement être connu comme étant faux,vrai ou ni l’un ni l’autre. Or la plupart de nos connaissances sur le monde réel ne sontpas absolument vraies. De plus, des considérations pratiques imposent que le cadreutilisé comme représentation de la connaissance avec incertitude admette des implé-mentations et calculs efficaces. Les bases de données déductives, avec leur propriétéde modularité et leurs puissantes techniques d’évaluation de requêtes, ont attiré l’at-tention de nombreux chercheurs et de nombreux cadres pour les bases de donnéesdéductives avec information incertaine ont été proposés [CAO00, DEK97, DUB91,FIT91, FUH00, ISH85, KIF88, KIF92, LAK94a, LAK94b, LAK01, LU96, LUK98,NG91, NG93a, NG93b, SHA83, SUB87, VnE86, WAG98, WUE95].

Les formalismes de traitement de l’incertitude dans les cadres proposés incluent lathéorie des probabilités [FUH00, LAK94b, LUK98, NG91, NG93a, NG93b, WUE95],la théorie des ensembles flous [CAO00, SHA83, VnE86, WAG98], les logiques multi-valuées [FIT91, KIF88, KIF92, LAK01, LU96] et la logique possibiliste [DUB91].Ces cadres diffèrent dans (i) leur notion d’incertitude ; (ii) la façon dont sont traitéesles incertitudes ; et (iii) la façon dont l’incertitude est associée aux faits et règles duprogramme. Sur la base de (iii), ces approches peuvent être classifiées en deux catè-gories : (a) celles basées sur les annotations (AB) et (b) celles basées sur l’implication(IB).

Dans l’approche AB, une règle est de la forme

A : f(β1, . . . , βn) ← B1 : β1, . . . , Bn : βn

et affirme “la certitude de l’atome A est au moins (ou est dans) f(β1, . . . , βn), sila certitude de l’atome Bi est au moins (ou est dans) βi, 1 ≤ i ≤ n”. Ici, f estune fonction calculable n-aire et βi est soit une constante soit une variable sur undomaine de certitude approprié. Comme exemples d’approches AB, on peut citer[KIF88, KIF92, NG91, NG93a, NG93b, SUB87].

Dans l’approche IB, une règle est de la forme

Aα← B1, ..., Bn

et affirme que la certitude associée à l’implication B1 ∧ ... ∧ Bn → A est α. Étantdonné une valuation v associant des valeurs de certitude aux Bis, la certitude de A

2

Page 407: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

est calculée en prenant la “conjonction” des valeurs v(Bi) et, en quelque sorte, enla “propageant” à la tête de la règle. Comme exemples d’approches IB, on peut citer[ESC94, FIT91, LAK94a, LAK94b, LAK01, VnE86]. Notre but n’est pas de com-parer les deux approches dans cet article. Le lecteur se reférera à [LAK01] pour unecomparaison exhaustive. Nous limitons notre contribution dans ce sens au rappel desfaits suivants [LAK01] : (i) bien que la façon dont est traitée l’implication dans l’ap-proche AB soit plus proche de la logique classique, la façon dont les règles sont ac-tivées dans l’approche IB semble intuitivement plus naturelle et, (ii) l’approche ABest strictement plus expressive que l’approche IB. L’inconvénient est que la résolutionde requêtes dans l’approche AB est plus compliquée, e.g. l’opérateur est en généralnon continu alors qu’il l’est dans les approches IB. Pour ces raisons, il est reconnu quel’approche IB est plus simple à utiliser et est plus appropriée pour une implémentationefficace.

Quoiqu’il en soit, parmi les caractéristiques communes à ces approches, on peutsouligner l’absence de mode non-monotone de négation et le fait qu’elles reposent surdes hypothèses uniformes, i.e. des hypothèses associant la même valeur par défaut auxatomes dont la valeur de vérité ne peut être déduite du programme. Dans l’approcheAB, l’hypothèse du monde ouvert est utilisée, i.e. tout atome est considéré inconnupar défaut, alors que dans l’approche IB est utilisée l’hypothèse du monde clos selonlaquelle tout atome est associé par défaut au plus petit élément du treillis de vérité.

Bien que les hypothèses uniformes soit largement utilisées, la non-uniformité estsouhaitable dans certains cas comme le montre l’exemple suivant.1

Exemple 1 Considérons la situation dans laquelle un juge doit décider si une per-sonne nommée Ted accusée de meurtre doit être inculpée Dans ce but, le juge col-lecte des informations de deux sources : le procureur et l’avocat de Ted. Le jugecombine ensuite ces faits à l’aide d’un ensemble de règles afin de prendre une dé-cision. Dans notre exemple, nous supposons que le juge a collecté un ensemble defaits F = temoin(John), amis(John, Ted) qu’ il combine avec un ensemble derègles R de la façon suivante 2 :

R =

suspect(X) ← mobile(X)suspect(X) ← temoin(X)innocent(X) ← alibi(X, Y) ∧ ¬amis(X, Y)innocent(X) ← presomption_d_innocence(X) ∧ ¬suspect(X)amis(X, Y) ← amis(Y, X)amis(X, Y) ← amis(X, Z) ∧ amis(Z, Y)inculpe(X) ← suspect(X)inculpe(X) ← ¬innocent(X)

1. La nécessité de l’utilisation d’hypothèses non-uniformes a déjà été montrée par Fuhr etRölleke qui propose une modification des programmes pour simuler une combinaison des hy-pothèses du monde clos et du monde ouvert [FUH97].2. Pour simplifier la présentation, nous ne considérons pas d’incertitude bien que dans une tellesituation nous soyons amener à en manipuler comme nous le verrons ultérieurement.

3

Page 408: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Quelques commentaires sur ces règles. Les deux premières règles de R décrivent letravail du procureur : afin de montrer qu’une personne X est suspecte, le procureuressaie de trouver un mobile (première règle) ou un témoin contre X (deuxième rè-gle). Les troisième et quatrième règles de R décrivent le travail de l’avocat : afin demontrer qu’une personne X est innocente, l’avocat essaie de trouver un alibi pour Xfourni par une personne n’étant pas une relation ou un ami de X (troisième règle) oud’utiliser la présomption d’ innocence si X n’est pas suspect (quatrième règle). Enfinles deux dernières règles de R sont les “ règles de décision” du juge.

Quelle valeur devrait être associée à inculpe(Ted) ? Ted devrait être inculpé siil est explicitement montré qu’ il est suspect ou non innocent.

Nous pouvons aisément remarquer que les hypothèses uniformes, telles que l’hy-pothèse du monde clos et l’hypothèse du monde ouvert ne sont appropriées. Si noussuivons l’hypothèse du monde clos et associons à tout atome la valeur faux par dé-faut, alors le juge déduira que Ted n’est pas innocent et doit être inculpé. Si noussuivons l’hypothèse du monde ouvert et associons à tout atome la valeur inconnupar défaut, alors le juge ne pourra rien déduire concernant les valeurs des atomessuspect(Ted), innocent(Ted) et inculpe(Ted) et ne pourra donc pas prendre dedécision. Nous pourrions suivre une autre hypothèse uniforme associant la valeurvrai à tout atome, mais alors le juge déduirait que Ted est suspect et doit être inculpé.

Une hypothèse non-uniforme appropriée dans une telle situation serait de sup-poser par défaut que les atomes mobile(Ted), temoin(Ted) et suspect(Ted) sontfaux, que l’atome presomption_d_innocence(Ted) est vrai et que les autres sontinconnus. Sous une telle hypothèse, le juge pourrait inférer que Ted est innocent, nonsuspect et ne doit pas être inculpé.

Nous pensons que nous ne devrions pas être limités à l’utilisation d’hypothèses uni-formes, mais que nous devrions pouvoir associer à un programme logique une sé-mantique fondée sur une hypothèse non-uniforme représentant notre connaissance pardéfaut.

Dans ce but, nous étendons l’approche des bases de données paramétrées [LAK01],approche unifiant les diverses approches IB, selon trois directions : (i) nous intro-duisons des opérations de négation dans les programme, i.e. nous étendons les ap-proches IB par l’introduction de modes non-monotones de négation ; (ii) les valeurspar défaut des atomes ne sont pas nécéssairement toutes identiques ; et (iii) une hy-pothèse peut être partielle et ne concerner qu’une partie des atomes.

Nous montrons que notre approche étend les approches usuelles : si nous re-streignons notre attention aux programmes logiques et bases de données déductivesusuels, et aux hypothèses uniformes, alors notre sémantique capture les sémantiquesusuelles. En particulier, Nous montrons que sous l’hypothèse selon laquelle tout estfaux, (i) pour les bases de données déductives paramétrées nous obtenons la séman-tique proposée dans [LAK01] ; et (ii) pour les programmes Datalog avec négation,

4

Page 409: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

nous obtenons la sémantique bien fondée [VnG91]. Si l’hypothèse considérée associeà tout atome la valeur inconnu, alors notre sémantique capture la sémantique Kripke-Kleene [FIT85].

Dans la section suivante, nous introduisons la syntaxe du langage logique, nousdéfinissons la notion de modèle et proposons des opérateurs à l’aide desquels, dans lasection 3, nous définissons les sémantiques de ces programmes logiques. La section 4contient des comparaisons entre notre sémantique et les sémantiques usuelles.

2. Syntaxe et sémantique : préliminaires

Nous rappelons les aspects syntaxiques des bases de données déductives paramétréesprésentées dans [LAK01] et l’étendons avec la négation.

Soit L un langage du premier ordre contenant une infinité de symboles de vari-ables, un nombre fini de constantes et de symboles de prédicats, mais pas de symbolesde fonctions. Bien que L ne contienne pas de symbole de fonction, il contient dessymboles pour des familles de fonctions de propagation (Fp), conjonction (Fc) etdisjonction (Fd), appelées fonctions de combination.

Soit 〈T ,,⊗,⊕〉 un treillis complet de valeurs de certitude et B(T ) l’ensembledes multi-ensembles sur T . Nous notons respectivement ⊥ et les plus petit et plusgrand éléments de T . Une fonction de propagation est une fonction de T × T dansT et une fonction de conjonction ou disjonction est une fonction de B(T ) dans T .Chaque type de fonction doit vérifier certaines des propriétés suivantes 3 :

1) monotonie par rapport à chacun de ses arguments ;

2) continuité par rapport à chacun de ses arguments ;

3) majorée : f(α1, α2) αi, pour i = 1, 2,∀α1, α2 ∈ T ;

4) minorée : f(α1, α2) αi, pour i = 1, 2,∀α1, α2 ∈ T ;

5) commutativité : f(α1, α2) = f(α2, α1),∀α1, α2 ∈ T ;

6) associativité : f(α1, f(α2, α3)) = f(f(α1, α2), α3),∀α1, α2, α3 ∈ T ;

7) f(α) = α,∀α ∈ T ;

8) f(∅) = ⊥ ;

9) f(∅) = ;

10) f(α,) = α,∀α ∈ T ;

Comme postulé dans [LAK01] :

1) une fonction de conjonction dans Fc doit satisfaire les propriétés 1, 2, 3, 5,6, 7, 9 et 10 ;

3. Afin de simplifier la présentation, nous formulons les propriétés en traitant toutes les fonc-tions comme des fonctions binaires sur T .

5

Page 410: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2) une fonction de propagation dans Fp doit satisfaire les propriétés 1, 2, 3 et 10 ;

3) une fonction de disjonction dansFd doit satisfaire les propriétés 1, 2, 4, 5, 6, 7 et 8.

Nous supposerons également l’existence d’une fonction de T dans T appelée fonctionde négation, anti-monotone par rapport à et satisfaisant ∀α ∈ T , ¬¬α = α et¬⊥ = .

Définition 1 (Programme normal paramétré) Un programme normal paramétré P(np-programme) est un 5-uplet 〈T ,R, C,P,D〉, dont les composants sont définiscomme suit :

1) T est un ensemble fini de valeurs de vérité partiellement ordonné par . 〈T ,,⊗,⊕〉 est un treillis complet dont ⊗ est l’ infimum et ⊕ le supremum. Le plus petitélément du treillis est noté⊥ et le plus grand ;

2) R est un ensemble fini de règles normales paramétrées (np-règles) de la forme :

r : A αr← B1, ..., Bn,¬C1, ...,¬Cm

oùA est une formule atomique et B1, ..., Bn, C1, ..., Cm sont des formules atomiquesou des éléments de T et αr ∈ T \ ⊥ est le degré de certitude de la règle ;

3) C est une fonction qui associe à chaque np-règle une fonction de conjonctiondans Fc ;

4) P est une fonction qui associe à chaque np-règle une fonction de propagationdans Fp ;

5) D est une fonction qui associe à chaque symbole de prédicat dans P une fonctionde disjonction dans Fd.

Pour simplifier la présentation, nous écrivons

r : A αr← B1, ..., Bn,¬C1, ...,¬Cm; 〈fd, fp, fc〉

pour représenter une np-règle pour laquelle fd ∈ Fd est la fonction de disjonctionassociée au symbole de prédicat de A, fc ∈ Fc et fp ∈ Fp sont respectivement lesfonctions de conjonction et propagation associées à r. Intuitivement, la fonction deconjonction (e.g. ⊗) détermine la valeur de vérité de la conjonction des B1, ..., Bn,¬C1, ...,¬Cm, la fonction de propagation (e.g. ⊗) détermine comment “propager”cette valeur de vérité du corps de la règle à sa tête, en prenant en compte le coefficientde certitude αr associé à r, et la fonction de disjonction (e.g. ⊕) indique commentcombiner les valeurs obtenues si un atome apparaît en tête de plusieurs règles. Unnp-programme sans négation est un programme paramétré suivant la définition de[LAK01].

Nous définissons la base de Herbrand HBP d’un np-programmeP comme l’ensem-ble des instantiations d’atomes apparaissant dans P , et P ∗ comme l’instantiation deP , i.e. l’ensemble de toutes les instantiations de règles de P .

6

Page 411: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Un programme logique classique est un np-programme dont la seule fonction deconjonction (et de propagation) est ⊗, dont la seule fonction de disjonction est ⊕ et telque αr = , pour tout règle r ∈ P . Un tel programme sera noté de manière usuelle.

Exemple 2 Considérons le treillis complet 〈T ,,⊗,⊕〉, où T est [0, 1], ∀, a, b ∈[0, 1], a b ssi a ≤ b, a ⊗ b = min(a, b), et a ⊕ b = max(a, b). Considérons lesfonctions : fd(a, b) = a + b − a · b et fc(a, b) = fp(a, b) = a · b. La fonction denégation est donnée par ¬a = 1− a. L’ensemble P est un np-programme :4

P =

suspect(X)0.6← mobile(X) 〈fd,⊗,−〉

suspect(X)0.8← temoin(X) 〈fd,⊗,−〉

innocent(X)1← alibi(X, Y) ∧ ¬amis(X, Y) 〈fd, fp,⊗〉

innocent(X)1← presomption_d_innocence(X) ∧ ¬suspect(X) 〈fd, fp,⊗〉

amis(X, Y)1← amis(Y, X) 〈⊕, fp,−〉

amis(X, Y)0.7← amis(X, Z) ∧ amis(Z, Y) 〈⊕, fp, fc〉

inculpe(X)1← suspect(X) 〈⊕, fp,−〉

inculpe(X)1← ¬innocent(X) 〈⊕, fp,−〉

temoin(John)1← 1 〈⊕, fp,−〉

mobile(Jim)1← 0.8 〈⊕, fp,−〉

alibi(Jim, John)1← 1 〈⊕, fp,−〉

amis(John, Ted)1← 0.8 〈⊕, fp,−〉

amis(Jim, Ted)1← 0.6 〈⊕, fp,−〉

Le choix de la fonction de disjonction fd pour le prédicat suspect (resp. innocent)modélise le fait que si nous disposons de plusieurs moyens de déduire qu’une personneest suspecte (resp. innocente), alors nous souhaitons combiner les valeurs de véritéafin de montrer que la suspicion augmente plutôt que de se limiter à ne considérerque la valeur maximale.

La sixième règle montre l’utilisation de fonctions de conjonction et de propaga-tion. Elle correspond à la “ transitivité de l’amitié” : la valeur décroît plus la chainereliant deux personnes est longue.

2.1. Interprétations de programmes

Une interprétation d’un np-programme P est une fonction qui associe à chaque élé-ment de la base de Herbrand de P une valeur dans T . Nous notons VP (T ) l’ensembledes interprétations de P .

4. Le symbole− à la place d’un symbole de fonction dénote le fait que cette fonction est inutile.On remarquera que toute fonction de conjonction est aussi une fonction de propagation.

7

Page 412: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

L’un des principaux problèmes de la programmation logique est de déterminerla sémantique d’un programme. Selon l’approche classique, la sémantique d’un pro-gramme P est déterminée en sélectionnant une interprétation particulière de P dansl’ensemble des modèles de P . Pour un programme logique sans négation, et en parti-culier dans l’approche paramétrée IB, ce modèle choisi est généralement le plus petitmodèle de P par rapport à .

L’introduction de la négation dans la programmation logique, et en particulier dansl’approche paramétrée IB, entraîne la perte de la propriété d’existence et d’unicité duplus petit modèle, comme le montrent les exemples suivants.

Exemple 3 Supposons que notre treillis est la logique booléenne etP le np-programmedéfini par les deux règles

At← ¬B

Bt← ¬A

Le programme P a deux modèles minimaux : M1 qui associe àA la valeur faux et àB la valeur vrai, et M2 qui associe à A la valeur vrai et à B la valeur faux. M1 etM2 ne sont pas comparables par rapport à l’ordre de vérité.

Exemple 4 Soit T = [0, 1]. Considérons fc(a, b) = min(a, b), fd(a, b) = max(a, b),et fp(a, b) = a · b. La fonction de négation est ¬a = 1 − a. Considérons le pro-gramme P :

A1← ¬B; 〈fd, fp,−〉

B1← ¬A; 〈fd, fp,−〉

A1← .2; 〈fd, fp,−〉

B1← .3; 〈fd, fp,−〉

Ce programme a une infinité de modèles (ceux indiqués sur la zone grise de la Fig-ure 1). Il a aussi une infinitéde modèles minimaux (ceux indiqués par la ligne épaisse)par rapport à l’ordre (a, b) (c, d) ssi a c et b d.

La façon usuelle de traiter ce genre de situation consiste à introduire une nouvellevaleur de vérité, inconnu, pour les atomes et à considérer comme sémantique de Pl’interprétation qui associe cette valeur inconnu à A et B.

L’idée est de définir une nouvelle valeur u et de définir un nouveau treillis T ′ =T ∪ u. Cette valeur représente une valeur inconnue ou indéfinie, ce qui signifie queu remplace une valeur de T qui n’est actuellement pas connue. Nous aimerions alorsétendre les fonctions de propagation et de disjonction à ce nouveau treillis de la façonsuivante :

– fc(u, x) = si x "= ⊥ alors u sinon ⊥ ;

8

Page 413: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 1. Infinité de modèles minimaux.

– fd(u, x) = si x "= alors u sinon .

Il est également nécessaire d’étendre l’ordre de T à T ′. Nous savons que pour toutx ∈ T ′, ⊥ x . Mais, de la première contrainte, nous déduisons u x pourtout x "= ⊥ et, de la seconde x u pour tout x "= . Cette extension est possibleuniquement pour T = ⊥,.

Plutôt que d’introduire une nouvelle valeur, nous considérerons des interprétationspartielles : une interprétation partielle est une interprétation qui associe des valeurs devérité à certains atomes mais non nécessairement à tous. Les atomes n’ayant pas devaleur sont appelés atomes non définis. Un atome non défini par une interprétationpartielle peut être vu comme un atome dont la valeur est actuellement inconnue.

Définition 2 (Interprétation partielle) Soit P un np-programme. Une interprétationpartielle I de P est un ensemble A : µ | A ∈ HBP et µ ∈ T .

Une interprétation partielle peut être vue comme une fonction définie par : pour toutatome instancié A , si A : µ ∈ I alors I(A) = µ sinon I(A) n’est pas défini. Bienentendu, une interprétation peut être considérée comme une interprétation partielle.Les interprétations (totales ou partielles) seront utilisées en tant que fonctions ou en-sembles suivant le contexte.

Dorénavant, étant donné un np-programme P , nous notons rA une règle (r : A αr←B1, ..., Bn,¬C1, ...,¬Cm; 〈fd, fp, fd〉) ∈ P ∗, dont la tête est A ; et, étant donnée uneinterprétation I telle que toutes les prémisses du corps de rA sont définies dans I , nousnotons I(rA) l’évaluation du corps de rA par rapport à I , i.e.

I(rA) = fp(αr, fc(I(B1), . . . , I(Bn),¬I(C1), . . . ,¬I(Cm)))

I(rA) est indéfini s’il existe dans le corps une prémisse non définie dans I , sauf s’ilexiste i tel que I(Bi) = ⊥ ou I(Ci) = . Dans ce cas, nous définissons I(rA) = ⊥.

Définition 3 (Satisfaction d’un np-programme) Soit P un np-programme et I uneinterprétation partielle de P . On dit que I satisfait (est un modèle de) P, noté |=I P ,ssi ∀A ∈ HBP :

1) s’ il existe une règle rA ∈ P ∗ telle que I(rA) = , alors I(A) = ;

9

Page 414: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2) si pour toutes les règles rA ∈ P ∗, I(rA) est défini, alors I(A) fd(X), où

X = |I(rA) : rA ∈ P ∗|

avec fd la fonction de disjonction associée à π(A), le symbole de prédicat de A.

Exemple 5 Soit le programme P de l’Exemple 4, les interprétations Iyx sont toutes des

modèles de P . L’ interpretation I non définie pour A et B est aussi un modèle de P .

Si l’on restreint notre attention aux programmes positifs, alors notre notion de satis-faction se réduit à celle de [LAK01] si l’interprétation I est totale, i.e. définie pourtous les atomes de HB(P ).

2.2. Opérateurs paramétrés

Tout d’abord, nous étendons l’ordre défini sur T à l’espace des interprétationsVP (T ). Soient I1 et I2 des éléments de VP (T ), alors I1 I2 si et seulement siI1(A) I2(A) pour tout atome instancié A. Muni de cet ordre, VP (T ) est un treilliscomplet, et nous avons (I1⊗I2)(A) = I1(A)⊗I2(A), et similairement pour les autresopérations. Les actions des fonctions peuvent être étendues des atomes aux formulesde la façon suivante : I(fc(X,Y )) = fc(I(X), I(Y )), et similairement pour les autresfonctions. Enfin, pour tout α ∈ T et pour tout I ∈ VP (T ), I(α) = α.

Nous définissons à présent un nouvel opérateur THP inspiré de [FIT93, LOY00,

PRZ95, PRZ90a, PRZ90b]. Cet opérateur est paramétré par une interprétation sur⊥,. Cette interprétation représente notre connaissance par défaut et nous l’ap-pelons hypothèse pour insister sur le fait qu’elle représente de la “connaissance sup-posée” et non de la “connaissance sûre”. Une telle hypothèse affirme que certainsatomes sont supposés faux (⊥) et que certains autres sont supposés vrais (). TH

P

infère de l’information à partir de deux interprétations : la première est utilisée pourévaluer les litéraux positifs et la seconde pour évaluer les litéraux négatifs dans lescorps des règles de P .

Définition 4 (Opérateur de conséquence immédiate paramétré) Soient P un np-programme et H une hypothèse. L’opérateur de conséquence immédiate TH

P est uneapplication de VP (T ) × VP (T ) dans VP (T ), définie par : pour toute paire (I, J)d’ interprétations dans VP (T ), pour tout atome instanciéA, si A n’apparaît pas entête de règle dans P , alors TH

P (I, J)(A) = H(A), sinon THP (I, J)(A) = fd(X), où

fd est la fonction de disjonction associée à π(A), le symbole de prédicat de A, et

X = |fp(αr, fc(|I(B1), . . . , I(Bn),¬J(C1), ...,¬J(Cm)|)) :(r : A αr← B1, ..., Bn,¬C1, ...,¬Cm; 〈fd, fp, fc〉) ∈ P ∗| .

Si le programme ne contient pas de négation et si H associe la valeur ⊥ à tous lesatomes, alors TH

P est équivalent à l’opérateur de conséquence immédiate défini dans[LAK01].

10

Page 415: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Proposition 1 Soient P un np-programme et H une hypothèse. THP est monotone

pour son premier argument, anti-monotone pour son second argument par rapport à.

À l’aide de la Proposition 1et du théorème de Knaster-Tarsky, nous pouvons définir unopérateur SH

P , inspiré de [VnG89] et derivé de TP , qui prend en entrée une interpréta-tion J , évalue les litéraux négatifs du programme p.r. J , et ensuite retourne le modèledu programme “positif” obtenu par iterations de TH

P à partir de l’hypothèse H .

Néanmoins, afin de pouvoir utiliser des hypothèses non-uniformes, nous devonstout d’abord définir la notion de stratification par rapport aux cycles positifs, puisactiver les règles par rapport à une telle stratification.

Définition 5 (Cycle positif étendu) Un cycle positif d’un np-programme P est unensemble rA1 , ..., rAn

de règles de P tel que pour tout i dans [1...n], Ai apparaîtpositivement dans le corps de rAi+1 etAn apparaît positivement dans le corps de rA1 .Un cycle positif C étendu avec toutes les règles de P dont la tête est la tête d’une desrègles de C est appelé cycle positif étendu.

Définition 6 (Stratification p.r. aux cycles positifs étendus) Une stratification parrapport aux cycles positifs étendus d’un np-programme P est une suite de np-programmes P1, ..., Pn tels que pour l’application σ de l’ensemble des règles de Pdans [1...n],

1) P1, ..., Pn est une partition de P ;

2) chaque règle r est dans Pσ(r) ;

3) si r1 et r2 sont deux règles de P telles que la tête de r2 apparaît dans le corpsde r1 et qu’ il n’existe pas de cycle positif étendu de règles de P contenant r1, alorsσ(r1) = σ(r2) ;

4) si r1 et r2 sont deux règles de P appartenant à un même cycle positif étendu derègles de P , alors σ(r1) = σ(r2) ;

5) si r1 et r2 sont deux règles de P telles que la tête de r2 apparaît dans le corpsde r1 et qu’ il existe un cycle positif étendu de règles de P contenant r1, mais pas decycle positif étendu de règles de P contenant à la fois r1 et r2, alors σ(r2) < σ(r1).

Nous pouvons remarquer que tout np-programme a au moins une stratification p.r. auxcycles positifs étendus. Il peut alors être montré que

Proposition 2 Soient P un np-programme ayant une stratification p.r. aux cyclespositifs étendus contenant une seule strate, et J une interprétation sur T . Soit Hune hypothèse sur ⊥, telle que pour tout cycle positif étendu rA1 , ..., rAk

deP , H(A1) = ... = H(Ak). Alors la suite définie par a0 = H et an+1 = TH

P (an, J)converge.

Nous définissons maintenant l’opérateur SHP dérivé de TH

P .

11

Page 416: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Définition 7 (Opérateur alternant paramétré SHP ) Soient P un np-programme

ayant une stratification p.r. aux cycles positifs étendus P1, ..., Pn et J une interpré-tation sur T . Soit H hypothèse sur ⊥, telle que pour tout cycle positif étendurA1 , ..., rAk

de P , H(A1) = ... = H(Ak). SHP (J) est la limite de la suite d’ inter-

prétations définie par :

– a1 est le point fixe itéré de la fonction λx.THP1

(x, J) obtenu en commençant lecalcul avec H ;

– ai est le point fixe itéré de la fonction λx.THP1∪...∪Pi

(x, J) obtenu en com-mençant le calcul avec ai−1.

Intuitivement, durant le calcul de SHP (J), nous fixons la valeur des prémisses néga-

tives dans P en leur associant leurs valeurs dans J . Ensuite, nous considérons leprogramme “positif” et évaluons ce programme strate par strate. Après l’évaluationd’une strate, nous savons que la connaissance obtenue ne peut être modifiée par cellequi sera inférée en activant les règles des strates suivantes. Nous utilisons alors cetteconnaissance pour évaluer la strate suivante. Un programme peut généralement avoirplusieurs stratifications p.r. aux cycles positifs étendus, mais le résultat ne dépend pasde la stratification utilisée lors du calcul.

On peut remarquer que la notion de stratification et la condition sur H que nousavons introduites sont indispensables pour assurer la convergence du calcul. Les ex-emples suivants illustrent ce point.

Exemple 6 Soient H = A : , B : ⊥, C : ⊥, D : ⊥ une hypothèse et P un np-programme défini par P = A ← ⊥, B ← ⊥, C ← A,D ← B,C ← D,D ← C.Si nous calculons la suite I0 = H , Ii = TH

P (Ii−1, J) alors nous avons

I1 = A : ⊥, B : ⊥, C : , D : ⊥,I2 = A : ⊥, B : ⊥, C : ⊥, D : ,I3 = A : ⊥, B : ⊥, C : , D : ⊥ . . .

Ce calcul ne termine pas. Maintenant, si nous considérons notre définition de SHP ,

alors nous avons une stratification de P avec deux strates : la première contient lesdeux premières règles de P et la seconde les quatre dernières règles de P . Le calculnous donne I1 = A : ⊥, B : ⊥, C : ⊥, D : ⊥ = I2.

Exemple 7 Soient le np-programme P = A ← B,B ← A et l’hypothèse H =A : , B : ⊥. La condition sur H n’est pas satisfaite, i.e. H(A) "= H(B), le calculne termine pas.

3. Sémantiques sous hypothèses non-uniformes

Dans cette section, nous déterminons, parmi les modèles d’un np-programme, celuiqui correspond à une hypothèse donnée. Dans le reste de l’article, toute hypothèse estsupposée associer la même valeur par défaut aux atomes apparaissant en tête de règled’un même cycle positif étendu.

12

Page 417: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3.1. Sémantiques sous hypothèses non-uniformes totales

De la Proposition 1, nous déduisons la propriété suivante de SHP .

Proposition 3 Soient P un np-programme et H une hypothèse non-uniforme totale.SH

P est anti-monotone p.r. et, par conséquent, SHP SH

P est monotone.

Il existe une propriété dérivée du théorème de Knaster-Tarski traitant des fonctionsanti-monotones sur un treillis complet :

Proposition 4 ([YAB85]) Soit f une fonction anti-monotone sur un treillis completT . Alors il existe deux éléments µ et ν de T , appelés points d’oscillations extrêmes def , tels que :

– µ et ν sont les plus petit et plus grand point fixes de f f ;

– f oscille entre µ et ν, i.e. f(µ) = ν et f(ν) = µ ;

– si x et y sont aussi des éléments de T entre lesquels f oscille alors x et y sontentre µ et ν.

SHP est anti-monotone par rapport à et VP (T ) est un treillis complet, donc SH

P

a deux points d’oscillations extrêmes par rapport à . Soient I⊥ l’interprétation quiassocie la valeur ⊥ à tous les atomes de HB(P ), i.e. le plus petit élément de VP (T )par rapport à , et I l’interprétation qui associe la valeur à tous les atomes deHB(P ), i.e. le plus grand élément de VP (T ) par rapport à .

Proposition 5 Soient P un np-programme et H une hypothèse non-uniforme totale.SH

P a deux points d’oscillations extrêmes, SH⊥ = (SH

P SHP )∞(I⊥) et SH

= (SHP

SHP )∞(I), avec SH

⊥ SH .5

Comme dans l’approche par points fixes alternés de Van Gelder [VnG89], SH⊥ et SH

sont respectivement une sous-estimation et une sur-estimation de P , mais par rapportà toute hypothèse H . Comme sémantique de P , nous proposons de considérer le com-promis entre ces deux interprétations, i.e. de considérer comme définis seulement lesatomes dont les valeurs coïncident dans ces deux interprétations limites.

Proposition 6 Soient P un np-programme et H une hypothèse non-uniforme totale.Alors |=SH

⊥∩SHP .

L’interprétation SH⊥ ∩SH

est un modèle de P et sera considéré comme sa sémantiquepar rapport à l’hypothèse H .

Définition 8 (Sémantique de compromis) Soit P un np-programme, la sémantiquede compromis de P rapport à l’hypothèse non-uniforme totale H , CSH(P ), estdéfinie par CSH(P ) = SH

⊥ ∩ SH .

5. Pour simplifier la présentation, nous omettons le symbole P dans SH⊥ et SH

.

13

Page 418: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Exemple 8 Soient P le np-programme de l’Exemple 2 et H = I⊥, alors nous avons6

CSI⊥(P ) ⊃ s(John) : 0.8, s(Jim) : 0.6, s(Ted) : 0, i(John) : 0, i(Jim) : 0.664,i(Ted) : 0, c(John) : 1, c(Jim) : 0.6, c(Ted) : 1. Maintenant soit H l’hypothèse :H = m(X) : 0, w(x) : 0, s(X) : 0, p(X) : 1, a(X, Y) : 0, f(X, Y) : 0, i(X) : 1, c(X) : 0.Alors nous avons CSH(P ) ⊃ s(John) : 0.8, s(Jim) : 0.6, s(Ted) : 0, i(John) :0.2, i(Jim) : 0.7984, i(Ted) : 1, c(John) : 0.8, c(Jim) : 0.6, c(Ted) : 0.

3.2. Sémantiques sous hypothèses non-uniformes partielles

Dans la section précédente, nous avons abordé le cas des hypothèses totales. Danscette section, nous généralisons cette approche au cas des hypothèses partielles.

Comme expliqué précédemment, l’approche consistant à introduire une nouvellevaleur logique u et définir un nouveau treillis T ′ = T ∪ u n’est pas applicable.De plus, les opérateurs s’appliquent à des interprétations totales. Nous choisissonsl’approche suivante : une hypothèse partielle H sur ⊥, peut être vue commel’intersection de deux interprétations totales H⊥ et H, telles que H⊥ est similaireà H mis à part le fait que ⊥ est supposé pour les atomes inconnus, alors que H estsimilaire àHmis à part le fait que est supposé pour les atomes inconnus. Nous avonsH⊥∩H = H . Afin d’associer une sémantique à un np-programme par rapport à unetelle hypothèse partielle, nous proposons de considérer l’intersection ou le consensusentre deux sémantiques.

Proposition 7 Soient P un np-programme et H une hypothèse non-uniforme par-tielle. Alors |=CSH⊥ (P )∩CSH (P ) P .

Cette intersection est le modèle de P que nous considérons comme sémantique de Ppar rapport à H .

Définition 9 (Sémantique de consensus p.r H) Soient P un np-programme etH unehypothèse non-uniforme partielle. La sémantique de consensus de P par rapport àH , CH(P ), est définie par CH(P ) = CSH⊥(P ) ∩ CSH(P ). 7

Exemple 9 Soient P le np-programme de l’Exemple 2 et H l’hypothèse suggéréeen introduction, i.e. H = m(X) : 0, w(x) : 0, s(X) : 0, p(X) : 1. Alors nous avonsCH(P ) ⊃ s(John) : 0.8, s(Jim) : 0.6, s(Ted) : 0, i(Ted) : 1, c(John) : 0.8, c(Jim): 0.6, c(Ted) : 0.

6. Dans la suite de l’article, pour simplifier la présentation, dans les différentes interpréta-tions, nous indiquerons uniquement les valeurs des atomes associés aux symboles de prédicatssuspect, innocent et inculpe que nous noterons respectivement s, i et c.7. On peut remarquer que toute sémantique de compromis est également une sémantique deconsensus.

14

Page 419: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4. Comparaisons avec les sémantiques usuelles

Notre sémantique étend la sémantique des programmes paramétrés de Lakshmanan etShiri présentée dans [LAK01] aux programmes paramétrés avec négation. Cela est dûau fait que le méchanisme développé pour traiter la négation n’a pas d’effet sur lesprogrammes positifs, et par conséquent, pour l’hypothèse I⊥, nous avons

Proposition 8 Si P est un np-programme sans négation alors la sémantique de com-promisCSI⊥(P ) deP (ou, de façon équivalente, la sémantique de consensus CI⊥(P ))par rapport à l’hypothèse I⊥ coïncide avec la sémantique de P de Lakshmanan etShiri.

Maintenant, nous comparons notre sémantique avec la sémantique bien fondée desprogrammes Datalog avec négation définie dans [VnG91].

Proposition 9 Soit P un programme Datalog avec négation. La sémantique de com-promisCSI⊥(P ) deP (ou, de façon équivalente, la sémantique de consensus CI⊥(P ))par rapport à l’hypothèse I⊥ coïncide avec la sémantique bien fondée de P .

Cette approche permet donc également d’étendre la sémantique bien fondée au cadredes bases de données paramétrées.

Exemple 10 Considérons l’Exemple 2 et l’hypothèse I⊥. nous définissons le pro-gramme Datalog avec négation P ′ en remplacant dans P toutes les valeurs de véritépar 1, la fonction de disjonction fd par⊕ et la fonction de conjonction fc par⊗. Alors,nous avons CI⊥(P ′) ⊃ s(John) : 1, s(Jim) : 1, s(Ted) : 0, i(John) : 0, i(Jim) : 0,i(Ted) : 0, c(John) : 1, c(Jim) : 1, c(Ted) : 1.

Finalement, nous nous intéressons à la sémantique Kripke-Kleene [FIT85]. Nous définis-sons l’hypothèse H∅ par : si A n’apparaît pas en tête de règle dans P , alors H∅(A) =⊥, sinon H∅(A) n’est pas défini.

Proposition 10 Soit P un programme Datalog avec négation. Soient WFS(P ) lasémantique bien fondée de P et KK(P ) la sémantique Kripke-Kleene de P . Alors ona KK(P ) ⊂ CH∅(P ) ⊂WFS(P ).

La sémantique de consensus CH∅(P ) de P représente plus de connaissance que sasémantique Kripke-Kleene, mais moins que sa sémantique bien fondée. Ce résultatrappelle le fait que la sémantique Kripke-Kleene d’un programme P est plus faibleque sa sémantique bien fondée.

Exemple 11 Soit P = B ← A,B ← ¬A,A ← A. On a KK(P ) = ∅ ⊂CH∅(P ) = B : ⊂WFS(P ) = A :⊥, B :.

15

Page 420: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5. Conclusion

Nous avons proposé un cadre général pour le raisonnement dans les bases de donnéesdéductives et la programmation logique en présence d’incertitude, de négation et d’ab-sence d’information. Notre approche introduit la notion de sémantique de bases dedonnées déductives paramétrées avec négation sous hypothèses non-uniformes pourl’information manquante. Nous avons également vu que notre approche capture etétend les sémantiques usuelles des bases de données déductives et de la programma-tion logique.

6. Bibliographie

[CAO00] True H. Cao. Annotated fuzzy logic programs. Fuzzy Sets and Systems, 113(2) :277–298, 2000.

[DEK97] Alex Dekhtyar and V.S. Subrahmanian. Hybrid probabilistic programs. In Proc. ofthe 13th Int. Conf. on Logic Programming (ICLP-97), Leuven, Belgium, 1997. The MITPress.

[DUB91] Didier Dubois, Jérome Lang, and Henri Prade. Towards possibilistic logic program-ming. In Proc. of the 8th Int. Conf. on Logic Programming (ICLP-91), pages 581–595. TheMIT Press, 1991.

[ESC94] Gonzalo Escalada-Imaz and Felip Manyà. Efficient interpretation of propositionalmultiple-valued logic programs. In Proc. of the 5th Int. Conf. on Information Processingand Managment of Uncertainty in Knowledge-Based Systems, (IPMU-94), number 945 inLecture Notes in Computer Science, pages 428–439. Springer-Verlag, 1994.

[FIT85] Melvin Fitting. A kripke-kleene semantics for logic programs. Journal of LogicProgramming, 2(4) :295–312, 1985.

[FIT91] Melvin Fitting. Bilattices and the semantics of logic programming. Journal of LogicProgramming, 11 :91–116, 1991.

[FIT93] Melvin Fitting. The family of stable models. Journal of Logic Programming, 17(2/3& 4) :197–225, 1993.

[FUH00] Norbert Fuhr. Probabilistic datalog : Implementing logical information retrieval foradvanced applications. Journal of the American Society for Information Science, 51(2) :95–110, 2000.

[FUH97] Norbert Fuhr and Thomas Rölleke. Hyspirit –a probabilistic inference engine forhypermedia retrieval in large databases. In Proceedings of the 6th International Conferenceon Extending Database Technology (EDBT),, Lecture Notes in Computer Science, pages24–38. Springer-Verlag, 1997.

[VnG89] Allen Van Gelder. The alternating fixpoint of logic programs with negation. In Proc.of the 8th ACM SIGACT SIGMOD Sym. on Principles of Database Systems (PODS-89),pages 1–10, 1989.

[ISH85] Mitsuru Ishizuka and Naoki Kanai. Prolog-ELF : incorporating fuzzy logic. In Proc.of the 9th Int. Joint Conf. on Artificial Intelligence (IJCAI-85), pages 701–703, Los Angeles,CA, 1985.

16

Page 421: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[KIF88] M. Kifer and Ai Li. On the semantics of rule-based expert systems with uncertainty.In Proc. of the Int. Conf. on Database Theory (ICDT-88), number 326 in Lecture Notes inComputer Science, pages 102–117. Springer-Verlag, 1988.

[KIF92] Michael Kifer and V.S. Subrahmanian. Theory of generalized annotaded logic pro-gramming and its applications. Journal of Logic Programming, 12 :335–367, 1992.

[LAK94a] Laks Lakshmanan. An epistemic foundation for logic programming with uncer-tainty. In Foundations of Software Technology and Theoretical Computer Science, number880 in Lecture Notes in Computer Science, pages 89–100. Springer-Verlag, 1994.

[LAK94b] Laks V.S. Lakshmanan and Nematollaah Shiri. Probabilistic deductive databases.In Int’ l Logic Programming Symposium, pages 254–268, 1994.

[LAK01] Laks V.S. Lakshmanan and Nematollaah Shiri. A parametric approach to deduc-tive databases with uncertainty. IEEE Transactions on Knowledge and Data Engineering,13(4) :554–570, 2001.

[LOY00] Y. Loyer, N Spyratos, and D. Stamate. Integration of information in four-valuedlogics under non-uniform assumptions. In Proceedings of the 30th IEEE InternationalSymposium on Multi-Valued Logics (ISMVL 2000), pages 185–191, Portland, Oregon, mai2000. IEEE Press.

[LU96] James J. Lu. Logic programming with signs and annotations. Journal of Logic andComputation, 6(6) :755–778, 1996.

[LUK98] Thomas Lukasiewicz. Probabilistic logic programming. In Proc. of the 13th Euro-pean Conf. on Artificial Intelligence (ECAI-98), pages 388–392, Brighton (England), Au-gust 1998.

[NG91] Raymond Ng and V.S. Subrahmanian. Stable model semantics for probabilistic de-ductive databases. In Zbigniew W. Ras and Maria Zemenkova, editors, Proc. of the 6th Int.Sym. on Methodologies for Intelligent Systems (ISMIS-91), number 542 in Lecture Notes inArtificial Intelligence, pages 163–171. Springer-Verlag, 1991.

[NG93a] Raymond Ng and V.S. Subrahmanian. Probabilistic logic programming. Informationand Computation, 101(2) :150–201, 1993.

[NG93b] Raymond Ng and V.S. Subrahmanian. A semantical framework for supporting sub-jective and conditional probabilities in deductive databases. Journal of Automated Reason-ing, 10(3) :191–235, 1993.

[VnG91] Allen nva Gelder, Kenneth A. Ross, and John S. Schlimpf. The well-founded seman-tics for general logic programs. Journal of the ACM, 38(3) :620–650, January 1991.

[PRZ95] T. Przymusinski. Static semantics for normal and disjunctive logic programs. Annalsof Mathematics and Artificial Intelligence, 14 :323–357, 1995.

[PRZ90a] T. C. Przymusinski. Extended stable semantics for normal and disjunctive programs.In D. H. D. Warren and P. Szeredi, editors, Proceedings of the Seventh International Con-ference on Logic Programming, pages 459–477. MIT Press, 1990.

[PRZ90b] T. C. Przymusinski. Stationary semantics for disjunctive logic programs and deduc-tive databases. In S. Debray and H. Hermenegildo, editors, Logic Programming, Proceed-ings of the 1990 North American Conference, pages 40–59. MIT Press, 1990.

[SHA83] Ehud Y. Shapiro. Logic programs with uncertainties : A tool for implementing rule-based systems. In Proc. of the 8th Int. Joint Conf. on Artificial Intelligence (IJCAI-83),pages 529–532, 1983.

17

Page 422: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[SUB87] V.S. Subramanian. On the semantics of quantitative logic programs. In Proc. 4thIEEE Symp. on Logic Programming, pages 173–182. Computer Society Press, 1987.

[VnE86] M.H. van Emden. Quantitative deduction and its fixpoint theory. Journal of Philo-sophical Logic, (1) :37–53, 1986.

[WAG98] Gerd Wagner. Negation in fuzzy and possibilistic logic programs. In T. Martinand F. Arcelli, editors, Logic programming and Soft Computing, pages –. Research StudiesPress, 1998.

[WUE95] Beat Wüttrich. Probabilistic knowledge bases. IEEE Transactions on Knowledgeand Data Engineering, 7(5) :691–698, 1995.

[YAB85] S. Yablo. Truth and reflection. Journal of Philosophical Logic, 14 :297–349, 1985.

18

Page 423: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Session 10Fouille de données

Page 424: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 425: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A Method for Computing Frequent Key andClosed Itemsets in One Phase

Viet Phan-Luong

Laboratoire d’Informatique Fondamentale de Marseille LIF-UMR CNRS 6166C.M.I. de l’Université de Provence39 rue F. Joliot Curie13453 Marseille Cedex 13

[email protected]

ABSTRACT. Though closed itemsets and key itemsets are dual concepts which constitute an in-teresting solution to mining association rules, existing methods that compute closed itemsetsdo not compute key itemsets and vice-versa, or if they do, they do it in two phases: one forcomputing frequent key itemsets, and the other for computing frequent closed itemsets. In thispaper, we present a method for computing frequent key itemsets with the following particularity:when all frequent key itemsets are discovered, all frequent closed itemsets are also discovered.The key itemsets and closed itemsets resulting from the method are represented in a structurethat we call the closed keys representation of frequent itemsets. We show that determining allfrequent itemsets, basing on the closed keys representation, is just a search operation in therepresentation. In order to speed up the search, we adapt the Galois lattice to store the closedkeys representation.

RÉSUMÉ. Les motifs clos et les motifs clés sont des concepts duels qui constituent une solutionintéressante au problème de recherche des règles associatives en fouille de données. Pourtantles méthodes existantes qui calculent les motifs clos ne calculent pas les motifs clés, ou sielles le font, elles le font en deux phases : les motifs clés sont calculés en première phase, etleurs clotures sont calculées en deuxième phase. Ce papier présente une méthode pour calculerles motifs clés fréquents avec la particularité suivante : lorsque les motifs clés fréquents sonttous calculés, on obtient aussi leurs clôtures. Le résultat de la méthode est représenté dansune structure appelée la représentation par clés fermés. Déterminer si un motif est fréquentest simplement une opération de recherche dans la structure. Pour accélérer l’opération, nousadaptons la structure de treillis de Galois pour stocker la représentation par clés fermés.

KEYWORDS: Data mining, frequent Itemsets, association rules.

MOTS-CLÉS : Extraction de connaissances, motifs fréquents, règles associatives.

Page 426: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Intr oduction

In data mining, a transaction is an identified set of items (attributed values). Adataset is a finite set of transactions. Given a dataset, the aim of mining associa-tion rules is to find out the strong relationships between items. For example, themarket basket data analysis [AGR 93] discovered:

of customers who bought ce-

reals and sugar also bought milk. Mining association rules has applications on verylarge databases, such as marketing/sales-products, finance-stock market, medicine,geographic information, scientific discovery, etc. Each of those applications can in-volve hundreds of items. The number of association rules resulting from each ofthose applications can be exponential with respect to the number of items. Such atremendous number of association rules is a real problem to the end-users to appre-hending the mining result. The concepts of concise representations of associationrules [PAS 99a, BAS 00a, ZAK 00, KRY 01a, PHA 01a] are interesting solutions tothis problem. Roughly speaking, a concise representation of association rules is asmall set of association rules on which we can infer all interesting association rules.Among the concepts of concise representations of association rules, the representativebasis [KRY 01a, PHA 01a, PHA 01b] is an interesting concept based on the conceptsof closed itemsets [PAS 99a, ZAK 99] and key itemsets [BAS 00b]. Intuitively, anitemset is closed if, let be the set of all transactions that commonly have , for anyitemset that that is commonly shared by all transactions in , . Dually, anitemset is a key if, let be the set of all transactions that commonly have , for anyitemset that is commonly shared by all transactions in , . It is shown thatfor any association rule in the representative basis, the left-hand side of is a keyitemset, and the right-hand side of is a closed itemset. However, existing methodsthat compute frequent closed itemsets do not compute frequent key itemsets and vice-versa, or if they do, they do it in two phases: one for computing frequent key itemsets,and the other for computing frequent closed itemsets.

In this paper, we present a method for computing frequent key itemsets with thefollowing particularity: when all frequent key itemsets are discovered, all frequentclosed itemsets are also discovered. The key itemsets and closed itemsets resultingfrom the method are represented in a structure that we call the closed keys representa-tion. Determining frequent key itemsets and frequent closed itemsets on this structureis immediate. Determining all frequent itemsets, basing on the closed keys represen-tation, is just a search operation in the representation. In order to speed up the search,we adapt the Galois lattice to store the closed key representation. Such a structure ofstorage can be used to efficiently computing the representative association rules.

The paper is organized as follows. In Section 2 we remind the main concepts onfrequent itemsets, in particular, the concepts of closed itemsets and key itemsets. InSection 3 we present our approach, which begins with the presentation of propertiesconcerning the inferences of supports of itemsets. Next, we present the concept ofthe closed keys representation. Then we show how to determine effectively frequentitemsets, with their supports, based on the representation. At the end of Section 3, wepresent a lattice structure of the closed keys representation that is useful to speed up

Page 427: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

the inference of frequent itemsets. Section 4 is devoted to the algorithm for searchingthe closed keys representation. It also includes the proofs of the correcteness andcompleteness of the algorithm. Detail comparisons with existing approaches are inSection 5. Finally, remarks and conclusions are in Section 6.

2. Preliminaries

We consider a dataset which is a triple , where !" are finitenon-empty sets. In particular, an element in is called an item, an element in # iscalled an object or a transaction, and is a binary relation on and . A pair !$&%' represents the fact that the object (transaction) $ has the item % . A subset (is called an itemset of . An itemset consisting of ) items is called a ) -itemset. TheGalois connection [GAN 99] between *+ and *-, is a pair of functions !./&0 where01!243-$5'6879/%2':;-<$=%>?'@BA , and .C#D243-%E' F79G$5':#HI!$&%?'(BA .Intuitively, 0G< is the set of all objects in that share in common all the items in ,and dually, .C#D is the set of all items that the objects in # share in common. Thefunctions 0 and . can be extended with 0G!JKL and .C!JMHN . The function 0is antimonotonic: for all POQRTS U , if IOVWTS then 0G<TSPX0G<-OY . The function. is also antimonotonic. The Galois closure operators are the following functions:Z N.$20 and

Z [0$?. , where $ denotes the composition of functions.Z

andZ

are montonic. Given an itemset \ Z <]U.C^0G<& is called the closure of . Indeed,for all ' , we have _ Z < (Extension) and

Z Z <&` Z < (Idempotency). Anitemset is said to be closed if a Z < . An itemset is called a key itemset (orgenerator) if for every itemset Gb; Z !M^[ Z ! implies cd . That is is a keyitemset if there is no itemset strictly included in and having the same closure as .

The support of an itemset , denoted by eYfgh! , is eYfgh![jiTkQl\m01!==nQiYkMQl/c ,where iYkMQl/mop denotes the cardinality of oV>91oqr . Given a support thresh-old, denoted by st%uveIfg ,

6w s%uveIfMg wyx , an itemset is said to be frequent ifeYfgh!czst%uveIfMg . An itemset is called a frequent key (or closed) itemset if isfrequent and is a key (respectively closed) itemset.

An association rule [AGR 93] is an expression of the form QO| TS , whereIOQRTSp . Let be an association rule, denoted by @~CPO|TS . Its support andconfidence, denoted by eIfMgm and iY$Puv.Cm respectively, are eIfMgm[deIfgh! OC S ,and iT$Puv.Cm]eIfMgm=neYfgh! O . A general form of association rules, called disjunc-tive association rules [BYK 01], is c~Mc| Ov S , where \= O = S a , with supportand confidence defined by eYfgh<deIfgh! O v6eIfMg< S C@eYfgh! O[ S ,and iT$Puv.C<Q"eIfMgm=neYfgh! . Let be an association rule (normal or disjunctive). is said to be exact (certain) if iT$Puv.C< x ; otherwise, is said to be approximate.

Example1 Consider a dataset represented in Table a. The itemsets of the dataset areclassified with respect to their supports in Table b.

Page 428: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure1.Table a: Table b:

O Items Sup Itemsets1 A D 1/5 D, AD2 B E 2/5 AB, BC, ABC, ABE, BCE, ABCE3 A B C E 3/5 B, C, AC, AE, BE, CE, ACE4 A C E 4/5 A, E5 A B C E 5/5 J

In short an itemset is denoted by juxtaposition of its items, so is the union ofitemsets, when no confusion is possible. For example, 2 denotes the itemset3-?2=KA , and O S denotes O O , where O R S are itemsets.

Let st%uveIfMg(W*n . With Table b, H=2=c==J are the frequentclosed itemsets of . The frequent key itemsets of are RRc2R?=_R2RRJ .

3. Closedkeysrepresentation

3.1. Support Inferences

We start this section by providing the basis for support inferences. Then we definethe concept of the closed keys representation, and show how frequent itemsets can bedetermined on the representation. Some results are similar to the results in [BAS 00b]and [KRY 01b]. However, we shall show how our results are distinct from those in[BAS 00b] and [KRY 01b].

Lemma 1 Let ;Y be itemsets. Then 01! v60G<_01v .

Proposition1 Let O = S be itemsets such that O S . If 01! O 0G< S then forevery b S , 0G<[b0G< O <?a S & .

Proof. Suppose that -O TS and 0G<-OTc40G<YSI . Let be an itemset such thatV S . can be represented as S] < O` <H: S & . Therefore, by Lemma 1,01!0G< S t0G< O <a S = . As 0G< S j01! O , we have 01!0G< O t0G< O<?a S & . Now, as 0G< O !?a S =]:0G< O , we have: 0G<[b0G< O !?a S = .

In [BAS 00b], it was shown that 01!`60G<5< S H O & . We can see that O <[ S "Uj! S ( O . However, O !( S is computationally more efficient than?< S O , because under the condition O S , we have O <c S J ,so the union of O and Da S can be implemented by simple insertion.

The following corollaries are direct consequences of Proposition 1.

Page 429: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Corollary 1 Let O R S be itemsets such that O@ S . If eYfgh! O t8eIfMg< S thenfor every :WTS , eYfgh!HeYfgh!IO <bTSP& . Thus, is not a key itemset, andIO <TS- is not closed.

In particular, if O J then for every S R such that SD and eIfMg< S [eYfghJ ,we have eYfgh![eIfMg<Da S .

Corollary 2 Let be an itemset. If is not a key then for every p6 , is not a key.

An application of Corollary 2: If we are only interested in key itemsets then in anincremental method for generating key itemsets, we can discard non-key itemsets.

Definition 1 An itemset is called a maximal key itemset included in an itemset oif is a key itemset and there is no key itemset such that p and p o .

In the above definition, is required to be a key itemset, but eIfMg< is not neces-sarily equal to eIfMgmop . For instance, with Example 1, is a maximal key includedin , but eIfMg<cEjeIfMg<E . However, we have the following property.

Lemma 2 If o is a non-key itemset, then any key itemset included in o , such thateYfgh<opdeIfgh!p , is a maximal key itemset included in o .

Proposition2 Let o be an itemset. Let O R S I= be maximal key itemsets in-cluded in o . Then the support of o is: eYfgh<ops%uC3PeIfgh<t¡m7 ¡=% x spA

Proof. We have eIfMgmop w st%uC3PeIfgh! ¡ 7H ¡ &%¢ x Y=s A . If o is a keyitemset, then o is the unique maximal key included in itself, and then eIfMgmop_st%uC3PeIfgh! ¡ 57h ¡ =%? x spA , where s£ x . Otherwise, o is not a key, and inthis case there exists a key itemset such that o and eIfMg<_8eYfgh<op .Such a key itemset is a maximal key itemset included in o (Lemma 2). Hence,eYfgh<opdeIfgh!p`z¢s%uC3PeIfgh<_¡m"7Q ¡&&% x spA .Thus, eIfMgmopbs%uC3PeIfgh<_¡<7Q_¡&% x spA .

In [BAS 00b], it was shown that if o is a non-key ) -itemsets, )jz* , then thesupport of o is eYfgh<ops%uC3PeIfMg<I¤-¥OTH7;Y¤P¥GO o@A . Proposition 2 is distinctfrom this result on two points (i) Proposition 2 is valid not only for non-key itemsets,but also for key itemsets, and (ii) in order to compute the support of o , we do notneed to consider all )_ x -itemsets included in o , but only maximal key itemsetsincluded in o .

Proposition 2 means that if we know the supports of all key itemsets, then we caninfer the support of any other itemset. However, if we are only interested in determin-ing the frequent itemsets with their supports, then we can base on only frequent keyitemsets.

Page 430: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Theorem1 Let o be an itemset. If there exists a frequent key itemset such that:o and o Z < , then eIfMgmopjeIfgh!p . Otherwise, o is not frequent.

Proof. If there exists a frequent key itemset such that ¦o and o¦ Z <p ,then eIfghmo w eIfgh< and eIfgh Z <& w eIfMgmop . As eIfMg<p5¦eIfgh Z <& ,we have eIfMgmopCbeIfMg<pCeIfgh Z <& . Otherwise, by contradiction suppose thato is frequent. We have either o is a key or o is not a key. In the former case,the contradition is immediate, because o is a frequent key itemset included in itself.In the latter case, there exists a key itemset o such that eYfgh!p§eIfgh<op .Therefore 01!p0Gmop , and then

Z !pEF.Cm01!p&2X.C^0Gmop= Z mop . Hence,o¨ Z <op_ Z <p . As o is frequent, is frequent. Thus, is frequent and:o and o8 Z < . Contradiction. .

3.2. The closed keys representation

Definition 2 Let F©t be a dataset. Let s%uveYfg be the support threshold.The closed keys representation of , with respect to st%uveIfg , denoted by ªpm_&s%uveIfMg1 ,is the set

3<V Z !paVeYfgh!p=7 is a key itemset of and eIfgh<`z¢s%uveYfgA .

By Theorem 1, for any itemset , basing on the closed keys representation, it isstraightforward to determine if is frequent, and if so, its support is already available,without any computation.

Example 1 (continued) The closed keys representation of the dataset in Table a, withrespect to s%uveIfMg_j*n , consists of !J«RJ« x , <RJ«&¬n , <_=R­nM , 2= =­n ,<J&¬nM , <E R*n , <ER2R­nM , <c2R*Mn .

Basing on the closed keys representation, determining if an itemset is frequentconsists in searching for a triple !V Z !p:VeYfgh!p= in the representation suchthat ®¯ Z !p . Now, we show through Propositions 3, 4 and 5, that this isequivalent to searching the smallest closed itemset that contains .

Proposition3 Let be an itemset. If there exists a triple !V Z <¢ ReIfMg<&in the closed keys representation such that °± Z !p , then for every triple< Z < 6 eYfgh! & in the closed keys representation such that 46Z ! ² , we have

Z !p[ Z !p .

Proposition 3 states the unicity of the closed itemsetZ !p . Notice that there may

exist key itemsets rdp such that p16 and b andZ <[ Z <p .

Proposition4 Let be an itemset. If there exists a triple <V Z !ph ReIfMg<& inthe closed keys representation such that 6 Z !p , then for every closed itemseto such that :o , we have

Z <:o .

Page 431: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Proof. By the monotonicity and the idempotency ofZ

, we have:Z !p_ Z <_Z Z !p&B Z < . Therefore,

Z <B Z !p . Moreover, from Lo , we haveZ ! Z <op[jo . Thus,Z <p¢o .

Proposition5 Let be an itemset. If o is the smallest closed itemset such that Vo , then there exists a key such that b Z <[o .

Proof. Assume that o is the smallest closed itemset such that yo . We haveB Z < Z <op"Uo . Therefore,

Z <Uo . If is a key, then take ³U , andthe proof is done. Otherwise, there exists a key , such that

Z <[ Z < . Thus,there exists a key such that 86 Z <p[o .

3.3. Lattice structure for determining frequent itemsets

We show how the lattice structure can speed up the search in the closed keys repre-sentation. Let !´" w be an ordered set where

wis a partial order on L (

wis reflexive,

antisymmetrical and transitive). Let µW´ . An upper bound (or lower bound) of µis an element fp'V´ (respectively, ¶'B´ ) such that for all g 'Bµ]<g w f (respectively,¶ w g ). The least upper bound of µ , denoted by µ , is called the join of µ . Thegreatest lower bound of µ , denoted by ·Cµ , is called the meet of µ . <´" w is a -semi-lattice if for any two elements ¸v&¹ in ´ , the join of 3I¸h&¹/A always exists (denotedby ¸ ¹ ). <´ w is a · -semi-lattice if for any two elements ¸v=¹ in ´ , the meet of3I¸v=¹/A always exists (denoted by ¸·5¹ ). <´" w is a lattice if it is -semi-lattice and· -semi-lattice. <´" w is a complete lattice if µ and ·Cµ exist for all µ¯U´ . Anyfinite lattice is complete. The lattice structure has many applications, particularly informal concept analysis [GAN 99], where the concept of Galois lattice is defined.

Definition 3 Let be a closed itemset. The pair <\0G<& is called a concept. Aconcept < O R# O is a subconcept of < S # S , denoted as ! O # O w ! S # S , iff O S (or iff # S d# O ). ! S R# S is called a super-concept of ! O R# O . A concept <\R#Dis frequent if eIfgh!z6st%uveIfg .

Let º be the set of all possible concepts in a dataset. The order set mºC w isa complete lattice, called the Galois lattice, where the operators join and meet aredefined as follows:

join: <o O R» O mo S R» S [U Z <o O o S ¼=» O » S meet: mo O =» O G·mo S =» S [mo O _o S Z <» O[ » S &Let ½cº be the set of all frequent concepts in the dataset. As eIfgh<o O po S _z

eYfgh<o O , m½cºC w is a · -semi-lattice.

Using the lattice diagram of the Galois lattice to represent the closed keys repre-sentation, we relabel each node !;&01!= by /!)G!T=k1<& , where )G< is the list of all

Page 432: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure2. ¾`¿2À]ÁÂ>Ã ¾`¿?Ä>¿2À`ÅmÄ Ã ÆÇ Ä=ÈYÅÉÊËËËËËËÌ Ì Ì Ì Ì Ì

¾]À]ÁÂ>à ¾ÁÄ&À`ÅmÄ ÃÍÇ¼Î Ä Ç ÅÉ Ê

¾Â>à ¾ÅmÄ Ã Ï ÅÉÊ

ÊÌÌÌÌÌ

ÌË Ë Ë

Ë Ë Ë

Ð Â>Ã Ð ÅmÄ Ã ÅÉ

¿ÁÂ>à ¿"ÅmÄ ÃÍÇ Å²ÉÊ

ÁÂ>Ã Á]Å<Ä Ã Ï ÅÉÊÌÌ

ÌÌÌÌÌÌ

ÌÌÌÌ

key itemsets such thatZ <p2F , and k1! is the list of the indexes of the nodes

that are adjacent to the node . The list k1< is built as follows. We assume that theset of all items in the dataset is sorted in some order. Let u be the number of all itemsthat occur in the dataset. Each itemset can be represented by a bit-vector of length u ,where an

x(or) at position %R Hw % w u x , indicates that the item at rank % is present

(respectively, absent) in . With such a representation, each itemset corresponds to aninteger or a character string. The list k1< consists of the integers (or strings) thatcorrespond to the subconcepts of <\0G<& that are adjacent to !;&01!= . The list k1<is sorted in descending order. In the implementation, we use the elements in the listk/! as indexes to link to the closed itemsets corresponding to the subconcepts of theconcept represented by . For example, Figure 2 represents the closed keys repre-sentation of the dataset in Table a of Example 1. The closed itemsets E and care represented by

xI;xI;x( * x ) and

;xI;x( Ñ ), respectively. The numbers * x and Ñ in

k/!H are used to link to E and , respectively.

The set of all frequent closed itemsets is represented in a similar way, that cor-responds to the · -semi-lattice diagram of the frequent concepts. Determining if anitemset is frequent is done as follows. The itemset is mapped to a bit-vector. Webegin by determining which closed itemsets, on the tops of the semi-lattice diagram,include . If such a closed itemset does not exist, then is not frequent. Otherwise,let be a closed itemset on the tops that includes . The bit-vector of is translatedto a number (or a character string), let us call it u[< . We search the list k/v , in de-scending order, for a number such that its bit-vector has

xat all positions of the items

in . If such a number exists, then using this number as index, we search down thesemi-lattice diagram for the next closed itemset, and so on. Or if the current numberin the search list k/v is less than u[< , then return eIfgh![deYfghv .

For example, let us determine if 2 is frequent with respect to st%uveIfMg@X*Mn .The bit-vector of E is

xI;xIM, which corresponds to number * . E is included in

Page 433: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

c , which is at the top of the diagram. Searching the list k1<Ec in descend-ing order, we find * x z* and the bit-vector of * x has

xat all positions of the items

in E . Using * x as index we go down to the node E . As there exists no numberin k1<Ec that is greater than or equal to * . The search stops at the first number ofk/!E , and returns eIfgh<2D[deIfgh!E .

As another example, consider Hjc . The bit-vector of c is«xMxI«x

whichcorresponds to number

x ­ . The first number in k1<Ec is * x does not hasx

atthe position . The next number is ÑÒ x ­ . Thus, the search stops, and returnseYfgh!c[eIfMg<Ec .

Let s be the average number of adjacent nodes to a node in the diagram, andZ

the height of the · -semi-lattice diagram. The mean complexity of determing if anitemset is frequent can be estimated by

Z sVn*M . This shows that the closed keysrepresentation is very efficient for inferences of frequent itemsets, and frequent keyand closed itemsets.

4. Generating the closedkeysrepresentation

We propose FClosedKeys (see 4.2), an incremental algorithm as Apriori [AGR 94]and Pascal [BAS 00b] to generate the closed keys representation. We start with theempty itemset, which is of course a key, with support

x. Then we compute the supports

of allx-itemsets. For each

x-itemset , if eIfMg< x , then is marked as a non-key

itemset. The otherx-itemsets are marked as key itemsets. Let »1Ó be the union of

all non-keyx-itemsets. We can see that »/Ó is the closure of the empty set. A triple

!JR»;ÓM x is added to the closed keys representation. Next, we consider the * -itemsetsbuilt on the frequent

x-itemsets by GenCandidate (see 4.1), if the set of key

x-itemsets

is not empty. So on, we consider the ) -itemsets built on the frequent !)E x -itemsets,until the set of key ) x -itemsets is empty.

4.1. Generating key itemset candidates

Let o and » be frequent m%- x -itemsets such that obD» and »5o are singletonsand o4V»XÔ6»dVo in lexicographic order. Then o » is an % -itemset candidate. Wehave the following cases:

(a) At least o or » is not a key. Suppose that o is not a key. Then o » (denotedby o » , in short) is not a key (Corollary 2). As o is not a key, but o is frequent, thereexists a frequent m%j* -itemset Õ such that Õ o and eIfghÕEKLeIfMgmop . ByCorollary 1, eIfMgmo »cEFeIfMgÕ mop»U(op=FeIfMgÕ <»(op= . As Õ is anm%va* -itemset and <»pop is disjoint from Õ , we have Õ <»op is an m%h x -itemset. Therefore, we search Õ <o »Vop& among the frequent m%G x -itemsets.If there exists, then o » is frequent, and eIfMgmo »?[eIfMgÕ <»@op= . The itemsetÕ <»¢op is linked to o » , for use in the next step. Otherwise, o » is not frequent.The case where » is not a key is similar.

Page 434: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

(b) Both o and » are keys. There are two subcases:

(b.1) There exists a frequent non-key m% x -itemset Õ such that Õ o » . Inthis case, o » is not a key (Corollary 2). As Õ is a frequent non-key itemset, thereexists a frequent m%]j* -itemset such that Õ and eIfMgmE eIfghÕE , andeYfgh<o »?_±eYfgh< mo »XdÕE& . It is easy to verify that mo »WjÕE is anm%T x -itemset. Therefore, by searching among the frequent <%T x -itemsets previouslydiscovered, we can see if op» is frequent, and if so, eIfgh<o »?jeIfgh< mop»pKÕ= .The itemset mo »ÕE is linked to o » , for use in the next step. Otherwise, op»is infrequent.

(b.2) Else, if all m%C x -itemsets which are subsets of o » are frequent then op»is a candidate. Otherwise, o » is not frequent, and is deleted from the list of % -itemsetcandidates.

The frequent % -itemsets generated in points (a) and (b.1) are not key itemsets. Theyare stored in a list denoted by ÖC¡ . The % -itemsets generated in point (b.2) are stored ina list denoted by ]¡ . The support of each % -itemset in the list ¡ is then computed.If is not frequent, then is deleted from ¡ . Otherwise, will be verified if it is akey itemset.

Notations: For each % -itemset , there are associated the following fields:

– ;Í)×-¹ : to indicate if is a key itemset, and

– \ g;Q×-Ø : to represent an m% x -itemset ¡<¥GO such that ¡<¥GOK and eIfgh!eYfgh! ¡<¥GO , if such an itemset exists.

4.2. Generating the closed keys representation

In step % , with %5zy* , GenCandidate is called with inputs Ö ¡m¥O and ¡<¥GO , andresults in Ö ¡ and ¡ . The itemsets in Ö ¡ are frequent non-key itemsets. If ¡ is notempty, then we access the dataset to compute the supports of the itemsets in ¡ , andkeep only those which are frequent. Then we check each itemset in ¡ to see if thereexists a key itemset Y¡<¥GO T¡ such that eYfgh!T¡"UeIfgh!T¡m¥OY . For this, we considereach frequent key itemset I¡m¥O . For each Y¡'p¡ Ö¡ , if eIfgh<Y¡©eIfMg<T¡<¥GOY , thenwe add T¡\VT¡<¥GO to »\¡<¥GO , which is initially set to empty for each frequent key itemsetT¡m¥O , and we set Y¡R g;Q×-Øa¯T¡<¥GO , if T¡'d`¡ . In such a case clearly Y¡ is not a keyitemset. We shall show that the union of I¡<¥GO and the final value of »/¡<¥GO , when allT¡'V¡ Öh¡ are considered, is the closure of I¡m¥O .Example 1 (continued) We compute the closed keys representation of the dataset inTable a, with respect to st%uveIfg d*Mn .Ö O 3<&¬nTI<_=­nTI2=­nTI<ÙV x nTI< &¬n¼A ; O dÖ O ; ª Ó 3!J«RJ x ¼A ; O J .

Ú ×IuvkulM%>lMkÛ×<Ö O O yields:

Page 435: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure3. ALGORITHM GenCandidateInput: Ö¡m¥O , ¡<¥GO : lists of frequent <%v x -itemsets.Output: Ö¡ : list of frequent % -itemsets; ]¡ : list of % -itemsets, candidates for frequentkeys.Method:Ö ¡ J ; ¡ J ;For each pair of itemsets mo =» , where moVeIfMgmop=¼;!»eYfgh!»&"'BÖ ¡m¥O ¡m¥O ,such that o±:» and »(o are singletons and o±(»LÔU»(o in lexicographicorder, do If u$PÛoVÍ)×-¹ then BeginÕjo g\P×-Ø ;

// Õ is an m%v@* -itemset: Õ o and eIfghÕeYfgh<op ;If Õ <»©op is a frequent <%v x -itemset then begino »` )«×I¹Kd.1k¶!e-×ÜReIfgh<o »2`jeIfMgÕ !»pop= ;o »` g\Q×IØKÕ !»pop ; insert <o »`ReIfMgmo »D= in Ö¡

endEnd Else If u$PÛv»` )«×I¹ then BeginÕjj»` g;Q×-Ø ;

// Õ is an m%v@* -itemset: Õ » and eYfghÕ[deIfMg<»c ;If Õ mo§a»c is a frequent <%v x -itemset then begino »` )«×I¹Kd.1k¶!e-×ÜReIfgh<o »2`jeIfMgÕ <oa»= ;o »` g\Q×IØKÕ <oa» ; insert <o »`ReIfMgmo »D= in Ö ¡

endEnd Else

If there exists a frequent non-key m%h x -itemset Õ such that Õ o » ,then BeginÕ g;Q×-Ø ;

// is an m%ha*M -itemset: Õ and eIfMgmEeYfghÕ ;If m mo »(ÕE& is a frequent <%v x -itemset then begino »` )«×I¹Kd.1k¶!e-×ÜReIfgh<o »2`jeIfMgm mo »(Õ= ;o »` g\Q×IØK mo »(Õ ; insert <o »`ReIfMgmo »D= in Ö¡

endEndElse If all <%v x -itemsets which are subsets of op» arefrequent then begin eYfgh<o »[ ; insert mo »`ReIfghmop»?& into ¡ end

// ¡ : list of % -itemset candidates

S 3<E ¼-<22 TI! ¼-<H2 TI! ¼-! ¼A and Ö S J ;Computing the actual supports of the itemsets in S : S 3<E *MnM¼-<E2R­n¼-<E=­nTI!c2R*nM¼I!R­n¼-!R­nA .

For each key itemset in O :

Page 436: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure4. ALGORITHM FClosedKeysInput: A dataset Ut , and a support threshold st%uveIfMg .Output: The closed keys representation of , with respect to st%uveIfg .Method:JÍ)×-¹HbÛ>Pf1× ; Ö Ó ©3!J« x A ; » Ó J ; O J ;Read the dataset to compute the support of each

x-itemset;

Ö O 3! O eIfMg< O &"7eIfgh! O z6st%uveIfgA ;For each < O eIfMg< O &'BÖ O do

If eIfgh! O x thenbegin » Ó j» Ó` O ; O g;Q×-ØKJ ; O )«×I¹Kd.1k¶!e-× end

else begin O )«×I¹KbÛ>Pf/× ; O d OC 3! O eYfgh! O =A end;ª Ó 3!JR» Ó x A ;%Cj* ; ª4bª Ó ; ª_¡<¥GO"J ; O"jJ ;While _¡m¥OjJ do begin

GenCandidate( ÖC¡m¥O , ¡<¥GO ); // Results in Ö¡ and ¡ .// Ö¡ : set of frequent non-key % -itemsets; ¡ : set of % -itemset candidates.

If ¡]J then beginRead the dataset to compute the support of each % -itemset in ¡ ;Delete from ¡ all itemsets ¡ such that eYfgh! ¡ Ò6st%uveIfMg ;For each < ¡ eIfMg< ¡ &]'B ¡ do ¡ Í)×-¹HbÛ>Pf1× ;

end;For each ! ¡<¥GO ReIfgh! ¡m¥O ='B ¡m¥O do begin» ¡<¥GO J ;For each < ¡ eIfMg< ¡ &]'B ¡ do

If ¡m¥O2 ¡ and eIfgh< ¡m¥O deIfMg< ¡ then begin»\¡<¥GO"j»\¡<¥GO !T¡aT¡<¥GOI ;T¡& g;Q×-ØHdT¡<¥GO ; Y¡&Í)×-¹H.1k¶e-×

end;For each !T¡RReIfgh!T¡!=]'Ö¡ do

If T¡<¥GO T¡ and eYfgh!T¡<¥GOI[deIfgh!T¡! then »/¡<¥GO"d»/¡m¥O <T¡vT¡<¥GOI ;ª ¡m¥O bª ¡<¥GO[ 3< ¡<¥GO =» ¡<¥GO eIfMg< ¡<¥GO &¼A ;

end;ª¯ª ª ¡<¥GO ; ¡ 3< ¡ ReIfgh< ¡ ='B ¡ 7 ¡ Í)×-¹HjÛ>Pf/×MA ;%b%1 x ; ª ¡<¥GO dJ

end;Return( ª ).

For , we have » O J . For , we have » O j , g\P×-Ø , and is nota key itemset. For , we have » O , E2 g;Q×-ØVU2E g\Q×IØB , and Eand are not key itemsets. For , we have » O J .

Hence, ª O ©3<RJ«&¬nTI<_=R­n¼-!2RR­nM¼-<J&¬nMA ,and S 3!_R*MnTI<E=­nM¼I!c2*MnMA .

Page 437: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Ú ×IuvkulM%>lMkÛ×<Ö S S yields:

ÖÝE3<Ec2R*nM¼I!cR*nTI!ER­nM¼-<c R*Mn¼A and ÝEjJ ;For each key itemset in _S :

For E , we have »/SU , and H and c are not key itemsets. For ,we have »/S N , and 2 is not a key itemset. For H , we have »GSKXE , andc and c are not key itemsets.

Hence, ª S ©3<E R*nTI!2=­nTI<H2=ER*n¼A .Comments: In this step, Ý is empty: H is built with and E . As 2 is nota key, Ec is not a key, and eYfgh!HD2XeIfgh>3QcA >3-=_RcAD3-RcAQ&EeYfgh!cD]*Mn , and Ec2 g;Q×-Øtc . E is built with and E which arekeys. But there exists c and c is not a key because eIfMg<c[eIfMg<c .Hence, c is not a key, and eIfgh<EeIfgh3-A 3-?R RKA?3- RKAP=deIfgh<Ec[*Mn , and g\P×-ØjE . It is similar for 2 , and c .

As ÝcJ , _Ý is also empty. Thus, the algorithm stops. The closed keys repre-sentation of the dataset , with respect to s%uveIfg_d*n , is

ª¯©3!J«RJ« x TI!?RJ«&¬nTI! = =­nTI!2RR­n¼-<J=¬nM ,<_R*Mn¼-<ER2R­nM¼-<c2R*MnA .

Proposition6 In FClosedKeys, for an itemset ¡ 'j ¡ , if ¡ )«×I¹ is not set to false,then ¡ is a frequent key itemset.

Proof. An itemset Y¡['B`¡ is generated from two key m% x -itemsets o and » ( I¡o » ) by GenCandidate, when all frequent non-key <% x -itemsets are not includedin op» , and all m%1 x -itemsets included in o » are frequent key itemsets. The lattersare maximal key itemsets (strictly) included in o » . If FClosedKeys does not setT¡&Í)×-¹ to false , then it means that there is no maximal key itemset -¡<¥GO included in Y¡such that eIfMg< ¡ deYfgh! ¡<¥GO . Following Lemma 2, ¡ is a key itemset. Clearly, it isfrequent.

Proposition7 FClosedKeys finds all frequent key itemsets.

Proof. Each frequent key % -itemset I¡ can only appear in ]¡ and not in ÖC¡ . Moreover,we can see that ¡ is the set of all candidates I¡ such that all frequent non-key <%; x -itemsets are not included in I¡ , and all m%/ x -itemsets included in I¡ are frequent keyitemsets (see GenCandidate). If Y¡ is actually a key, then there exists no frequent keym%P x -itemset ¡m¥O such that ¡<¥GOD ¡ and eYfgh! ¡<¥GO jeIfgh< ¡ . If so, FClosedKeysnever sets ¡ )«×I¹ to false. Hence, < ¡ ReIfgh< ¡ & is inserted in ¡ .

Proposition8 For each triple ! ¡<¥GO =» ¡m¥O eIfMg< ¡<¥GO &]'tª ¡m¥O , ¡m¥O is a frequent keyitemset and

Z < ¡<¥GO ¡<¥GO » ¡<¥GO .

Page 438: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Proof. By Proposition 6, ¡m¥O is a frequent key itemset. For each ' Z ! ¡m¥O \ ¡m¥O ,T¡m¥O 3-?A is a frequent % -itemset, which is either in ÖC¡ or in ¡ . For each frequentkey m% x -itemset Y¡<¥GO , FClosedKeys considers all % -itemsets I¡ in ¡ and Ö¡ . IfT¡m¥O T¡ and eIfMg<Y¡m¥OY?NeIfgh!T¡ , then it set »/¡m¥ON»/¡m¥O !T¡C¢T¡<¥GOI , where»\¡<¥GO is initially set to empty. Thus,

Z !Y¡<¥GO-[jY¡m¥O »\¡<¥GO .

Theorem2 The set ª , returned by FClosedKeys, consists of all triples !-¡RI Z <Y¡[ ¡ TReIfgh< ¡ = , where ¡ is a frequent key itemset of the given dataset. That is ª is theclosed keys representation of the dataset, with respect to support threshold st%uveIfMg .

The proof of Theorem 2 is immediate by Propositions 7 and 8.

We report in Table 1 some experimental results of the closed keys representations.The Mushrooms dataset obtained from the UCI KDD Archive (http://kdd.ics.uci.edu/)consists of

;x *Q¬ objects. Each object is described with *­ attributes, with values codedin 128 items. The Chess and Connect-4 datasets are obtained from IBM Almaden(www.almaden.ibm.com/cs/quest/demos.html). Chess has 3196 objects. Each objectis described with ­Þ attributes with values coded in 76 items. Connect-4 has 67557objects. Each object is described with 43 attributes with values coded in 130 items.

Notations: In Table 1, the following notations are used:

Sup: Support threshold; NbFq: Number of all frequent itemsets with respect to asupport threshold; MaxFq: Maximal size of the frequent itemsets; NbCK: Number ofall keys in the closed keys representation, with respect to a support threshold; MaxCK:Maximal size of the keys in the closed keys representation.

Table 1.Database Sup NbFq MaxFq NbCK MaxCK

Mushrooms 30% 2735 9 558 6Mushrooms 20% 53663 15 1747 7Chess 80% 8282 10 5114 9Chess 70% 48969 13 23992 11Connect-4 97% 487 6 285 5Connect-4 90% 27127 12 3487 7

Let us conclude this section with some remarks about the complexities of theclosed keys representation and the algorithms. For each triple <-¡R- Z <T¡¼"T¡¼eIfMg<T¡&in the closed keys representation, the memory space required to store ¡ and Z < ¡ ¡ is equal to the memory space required to store

Z < , the closure of . In eachstep %z§* of FClosedKeys, the generation of an % -itemset candidate ¡ , basing ontwo frequent m% x -itemsets o and » , needs to know the m%¢*M -itemsets o g\P×-Øor »` g\Q×IØ or Õ g;Q×-Ø , where Õ is a frequent m%[ x -itemset included in o » . Thesem%@*M -itemsets are associated with the m% x -itemsets in step <%h x . Thus in step% , the algorithm needs only the stored information in step <% x , and not the stored

Page 439: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

information in step m%j* . The computation of the support of a frequent non-key% -itemset Y¡ is immediate, as well as the field Y¡= g\Q×IØ . Only in the case of frequent keyitemset candidates (case b.2 GenCandidate), the computation costs most: checkingthat all m%C x -itemsets Y¡m¥O T¡ are frequent key itemsets, accessing the dataset tocompute the support of I¡ , and checking that Y¡ is a key itemset. Finally, we observethat in FClosedKeys, the number of dataset accesses is at most equal to the maximalsize of the frequent key itemsets.

5. Relatedwork

5.1. Methods for computing frequent closed itemsets

Many algorithms to computing frequent closed itemsets have been proposed in thelast years. Apriori-Close[PAS 99a], an extension of Apriori [AGR 94], is an algorithmthat computes simultaneously frequent and frequent closed itemsets. In step ) ofiteration, )tz6* , the ) -itemsets are generated from the frequent !) x -itemsets as inApriori. The dataset is scanned to compute the supports of the generated ) -itemsets.Infrequent ) -itemsets are removed. For each remaining ) -itemset, if there exists afrequent !)_ x -itemset with the same support, then the )t x -itemset is markedas non-closed. At the end of step ) , all frequent )H x -items that are not marked asnon-closed are effectively closed. Apriori-Close stops at step ) x , where ) is themaximal size of frequent itemsets.

CHARM [ZAK 99] is also a bottom-up algorithm that computes frequent closeditemsets. The main differences from Apriori-Close ARE the following. For eachfrequent item CHARM maintains a list of transaction identifiers in which the itemoccurs. CHARM explores both itemset and transaction identifier set spaces. The mainoperations are the union of itemsets and the intersection of transaction identifier sets.When combining two itemsets O and S , CHARM applies the following optimizationsto avoid enumerating all possible subsets of a closed itemset: (1) If 01! O 0G< S ,then every occurence of O is replaced by O S as 01! O S 0G< O -01! S b0G< O X0G< S . A similar optimization is applied if 0G< S 0G< O . (2) If 01! O ?X0G< S ,then every occurence of PO is replaced by -O TS as 01!IO TS-0G<IOY-01!TSQb0G<-OT01!TSI , and discard YS from further consideration as the closure of IS is identical tothe closure of IO YS .

CLOSET is a different method for computing frequent closed itemsets [PEI 00].First, the dataset is scanned to build a list of the frequent items sorted in descendingorder of their supports, called f-list. Next, the dataset is scanned again to constructa compact representation of frequent itemsets, called the ÖDµ -tree (Frequent Patterntree), organized in a prefix tree, using the order of the f-list. The tree has a headertable where entries are used to link items in the ÖDµ -tree. The search space of frequentclosed itemsets is divided into % -conditional datasets, where % is an item in the . -list,such that every transaction in the % -conditional dataset does not contain any item whichis less frequent then the item % . Basing on the ÖDµ -tree, the % -conditional dataset can

Page 440: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

be efficiently built. The method for searching frequent closed itemsets is recursive,which has two main points in each recursion step: the construction of ÖDµ -trees andthe division of the search space in conditional datasets, following the items in the. -lists of the current conditional dataset.

The above algorithms for computing frequent closed itemsets are different fromour algorithm on two points: they need to compute until the frequent closed itemsets ofmaximal size, and they do not compute the frequent key itemsets. Though the set of allfrequent itemsets can be determined, with their supports, based on the frequent closeditemsets, it is not straightforward to determine the set of all frequent key itemsets,with their supports, basing on the closed itemsets. In our approach, the frequent keyitemsets are available in the representation.

5.2. Methods for computing frequent key itemsets

Dually to Apriori-Close, Pascal [BAS 00b] is a bottom-up algorithm which com-putes frequent and frequent key itemsets. In step ) of iteration, )5zd* , ) -itemsets aregenerated from frequent !)" x -itemsets as in Apriori method, with the following op-timization. Let Y¤ be a generated ) -itemset. If there exists a !)E x -itemset P¤-¥GO such that Y¤-¥O is infrequent, then delete I¤ . Otherwise, let g/ß>à be the minimum ofthe supports of ) x -itemsets contained in -¤ . If Y¤-¥O is not key, then mark -¤ asnon-key, and set eIfMg<I¤ Wg\ß à (support inference). For the remaining ) -itemsetswhich are not marked as non-key, scan the dataset to compute their supports. If thesupport of a ) -itemset ¤ is equal to g ß>à , then ¤ is marked as non-key. At the endof step ) , all remaining ) -itemsets which are not marked as non-key are effectivelyfrequent key itemsets. Pascal stops when all frequent itemsets are computed, even ifall frequent key itemsets have been already discovered many steps before.

Our algorithm is distinct from Pascal on three points:

(i) When generating candidates in step )Wz³* , for every candidate ¤ , Pascalverifies if all !)K x -itemsets which are subsets of ¤ are frequent. In our algorithm,this verification only exists for candidates in case (b.2) (see GenCandidate).

(ii) If all Y¤-¥O Y¤ are frequent, and such an I¤-¥O is not a key, then followingPascal, the support of -¤ is got by computing the minimum of the supports of P¤-¥GO Y¤ . In such a case, our algorithm knows which itemset P¤-¥O I¤ such that eYfgh!Y¤[eYfgh!Y¤-¥OY , thanks to the field prev associated with the frequent !)H x -itemsets.

(iii) Pascal stops when no frequent itemset can be found. Our algorithm is de-signed to compute the frequent key itemsets with additional information for comput-ing frequent closed itemsets. It stops when no frequent key itemset can be found. Theadditional information consists of the support of each key itemset and its closure.

With these three points, we argue that our algorithm is more efficient and resultsin more information than Pascal.

Page 441: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

AClose [PAS 99a] is an algorithm that compute both frequent key and closed item-sets. It works in two phases. In the first phase, AClose uses a bottom-up search toidentify frequent key itemsets. In step ) of iteration, )z§* , ) -itemsets are gener-ated from frequent !)K x -itemsets. The dataset is scanned to compute the supportsof the generated ) -itemsets. Infrequent ) -itemsets are removed. For each remaining) -itemset, if there exists a frequent !)t x -itemset with the same support, then the) -itemset is not a key and is deleted. At the end of step ) , all frequent ) -itemsetsremaining are effectively frequent key itemsets. AClose stops when no more keys canbe generated.

In the second phase AClose computes the closure of all frequent key itemsets gen-erated in the first phase. For each frequent key itemset , the closure of is computedas the intersection of all transactions in which occurs.

Our algorithm is distinct from AClose on two main points: (i) the optimizationsby support inferences, as discussed in the comparison with Pascal; (ii) our algorithmcomputes frequent key and closed itemsets in only one phase; intersection is not usedin the computation of frequent closed algorithm.

5.3. Other approaches to concise representations of frequent itemsets

The disjunction-free sets representation [BYK 01] is an approach based on theconcept of frequent disjunction-free itemset which is defined as follows: An itemseto is called frequent disjunction-free if o is frequent and there are no items ?RX'tosuch that o3-=AV| is an exact rule. Otherwise, o is called disjunc-tive. Every subset of a disjunction-free itemset is also a disjunction-free itemset, andevery superset of a disjunctive itemset is also a disjunctive itemset [BYK 01]. Thedisjunction-free sets representation consists of (a) the set of all frequent disjunction-free itemsets, with their supports, (b) the negative border of frequent disjunction-freeitemsets, denoted by Ù ÖQ×P×Pl ¥ , which is the set of all itemsets o , with their sup-ports, such that o is not frequent disjunction-free, but every subset of o is frequentdisjunction-free, and (c) the set of all items that occur in the dataset.

Work in [KRY 01b] has proposed a more concise representation, called disjunction-free generators representation, which consists of:

(a) The set of frequent keys (generators), which are disjunction-free, denoted byÖDÙ ÖQ×-× Ú .

(b) The negative border of frequent disjunction-free generators which is the set ofall frequent key itemsets o such that o is not disjunction-free, but every subset of ois disjunction free, and which is denoted by ÖDÙ ÖQ×P×Pl ¥ .

(c) The negative border of infrequent disjunction-free generators which is the setof all key itemsets o such that o is not frequent, but every subset of o is frequentand disjunction-free, and which is denoted by Ù ÖQ×P×Pl ¥ .

Page 442: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

(d) And the set Yá of items that occur in the dataset.

In the disjunction-free generators representation, the itemsets in ÖDÙ ÖQ×-× Ú andÖDÙ ÖQ×-×Pcl ¥ are enriched with their supports.

Example 1 (continued) The disjunction-free generators representation of the datasetin Table a, with respect to st%uveIfMg_*Mn , consists of:

ÖDÙ ÖQ×P× Ú ~J« x ¼R<¬nM¼R_<­nM¼RK!­n¼R m¬nM¼=HK!*n (The number nextto an itemset is its support. Observe that t!*nT=E <­nB'dÙ ÖQ×P× Ú becauseJ?| and JD| are exact rules).

ÖDÙ ÖQ×P× Ú l ¥ ~t!*nT=E <­nÙ ÖQ×P× Ú cl ¥ ~MÙ and Tá~3P?=_R2RÙB=KA .Determining if the itemset o is frequent and if so computing its support need to

search for the subsets of o in ÖDÙ ÖQ×-× Ú cl ¥ =Ù ÖQ×P× Ú cl ¥ and ÖDÙ ÖQ×P× Ú , andto compute the minimum of the supports of the generators included in o . Moreover,in case o is disjunctive, computing eIfgh<op needs information from exact disjunctiverules. However, such an information is not encoded in the disjunction-free generatorsrepresentation.

We can see that determining if an itemset is frequent, and if so, infer its support,basing on the disjunction-free generators representation is a complex operation andcosts more than basing on the closed keys representation. One may argue that thenumber of itemsets in the disjunction-free generators representation is less than thenumber of itemsets in the closed keys representation. For this, let us consider thedisjunction-free generators representation of the dataset in Table a, with respect tost%uveIfMg_¬n :ÖDÙ ÖQ×P× Ú ~J« x ¼R<¬nM¼R m¬n . ÖDÙ ÖQ×-× Ú l ¥ : empty.

Ù ÖQ×P× Ú cl ¥ ~MÙB=_R2=E , and á ~3-= 2=ÙBRKAExplanation: The key itemsets _<­nM¼RK!­n¼RcK*MnM are not frequent, with

respect to st%uveIfgL¬n . They are deleted from ÖDÙ ÖQ×P× Ú . However, they arestill disjunction-free (the concept of disjunction-free is independent of the supportthresholds). Therefore, and are added to Ù ÖQ×P× Ú cl ¥ . But c is not addedto MÙ ÖQ×P× Ú l ¥ , because its subsets and are not in ÖDÙ_ÖP×P× Ú . Similarly, and E are not frequent, with respect to st%uveIfg6¯¬n . Therefore, they aredeleted from ÖDÙ_ÖDâ2×P× Ú l ¥ , which becomes empty. Notice that is added toMÙ ÖQ×P× Ú l ¥ , because all of its subsets and are in ÖDÙ ÖQ×P× Ú . However, Eis not added to Ù ÖQ×P× Ú l ¥ , because is a subset of , but ¦':ÖDÙ_ÖP×P× Ú .Thus, there are eight itemsets in the disjunction-free generators representation, withrespect to st%uveIfg¬n . Whereas with respect to the same s%uveYfg , there are onlysix itemsets in the closed keys representation: !JJ x ¼-<RJ=¬n¼-<J=¬nM . Thisresult is obtained by a simple selection of triples having supports greater than or equalto ¬n , in the closed keys representation with respect to st%uveIfg *Mn .

Page 443: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In the algorithm for generating the disjunction-free generators representation, eachcandidate is checked if it is disjunction-free [KRY 01b]. This task costs more thanchecking if an itemset Y¡ is a key itemset, and if not, computing

Z <I¡<¥GOI(T¡<¥GO (seeAlgorithm FClosedKeys) on two points: (i) Checking if an itemset -¡ is disjunction-free needs to know ÖDÙ_ÖP×P× Ú ¡<¥GO and ÖDÙ ÖQ×P× Ú ¡m¥1S (two previous steps), whereaschecking if an itemset Y¡ is a key itemset needs only to know t¡m¥O (one previousstep), and (ii) Computing the confidence of a disjunctive rule costs more than thesimple comparison eYfgh! ¡<¥GO [deIfgh< ¡ .

With the above arguments, we can conclude that the closed keys representation ismore efficient than the disjunction-free generators representation, in terms of the algo-rithms for computing the representation, and also in terms of inferences: determiningthe frequent itemsets with their supports, basing on the representations.

6. Conclusionsand Remarks

We have presented FClosedKeys, an algorithm for computing both frequent keyand closed itemsets in one phase. The correctness and soundness of the algorithmhave been proved. The result of the algorithm, the closed keys representation, is aconcise representation of frequent itemsets. The representation is lossless: given anitemset , we can determine from the representation if is frequent, and if so, itssupport is available in the representation. We have represented a lattice structure thatis useful to speed up the inference of frequent itemsets. The discussion on relatedwork has shown that our approach has many advantages over existing approachesto concise representations of frequent itemsets, in terms of algorithms and frequentitemset inferences. Experimental results show that the representation is compact andvaluable. In particular, all frequent key and frequent closed itemsets, which are veryuseful for many applications as concise representations of association rules and onlinedata mining, are available in the closed keys representation. In [AGG 98] an approachto online mining association rules was proposed. In this approach, an adjacency latticestructure was used to store all frequent itemsets in main memory. This structure notonly allows to efficiently mining association rules, but also allows to avoid generatingredundant rules. In the problem of online mining association rules, instead of storethe adjacency lattice structure, we think it will be more efficient if we store the Galoislattice structure of the closed keys representation (see Section 3.3). This is because theclosed keys representation is clearly more compact than the adjacency lattice structure,yet the support of any frequent itemset is immediate in the representation, without anycomputation. The work is currently investigated.

7. References

[AGR 93] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules between Setsof Items in Large Databases", Proc. of the 1993 ACM SIGMOD Int’l Conf. on Managementof Data, May 1993, pp. 207-216.

Page 444: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[AGR 94] R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules in LargeDatabases", Proc. of the

ÆIÏ-ãäInt’l Conference on Very Large Data Bases (VLDB), juin

1994, pp. 478-499.

[AGG 98] C. C. Aggarwal, P.S. Yu, "Online Generation of Association Rules", Proc. of theInt’l Conference on Data Engineering, Orlando, Florida, Feb. 1998.

[BAS 00a] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme and L. Lakhal, "Mining minimalnon-redundant association rules using frequent closed itemsets", Proc. of the 6th Int’l Conf.on Deductive and Object Databases (DOOD’00), Stream of the 1st Int’l Conf. on Compu-tational Logics (CL’00), LNCS 1861, Springer Verlag, pp. 972-986, London, UK, 2000.

[BAS 00b] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme and L. Lakhal, "Mining frequentpatterns with counting inference", ACM SIGKDD Explorations, vol. 2(2), December 2000,pp. 66-75.

[BOU 00] J.F. Boulicaut and A. Bykowski, "Frequent closures as a concise representation forbinary data mining", Proc. of the 4th Pacific Asia Conference on Knowledge Discovery andData mining, PAKDD’00, Kyoto (Japan), April-2000, LNAI 1805, pp. 62-73.

[BYK 01] A. Bykowski and C. Rigotti, "A condensed representation to find frequent patterns",Proc. of the

ÇTÆIã²åACM SIGACT-SIGMOD-SIGART PODS’01, May, 2001.

[GAN 99] B. Ganter, R. Wille, Formal concept Analysis: Mathematical Foundations, Springer,1999.

[KRY 01a] M. Kryszkiewics, "Closed set based discovery of representative association rules",Proc. of IDA’01, Springer, September 2001.

[KRY 01b] M. Kryszkiewics, "Concise Representation of Frequent Patterns based onDisjunction-free Generators", Proc. of the 2001 IEEE International Conference on DataMining (ICDM’01), 29-Nov. - 2-Dec. 2001, San Jose, California, USA, pp. 305-312.

[PAS 99a] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, "Discovering frequent closeditemsets for association rules", Proc. of the 7th Int’l Conf. on Database Theory (ICDT), jan.1999, pp. 398-416.

[PAS 99b] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, "Efficient mining of associationrules using closed itemset lattices", Information Systems, 24(1), 1999, pp. 25-46.

[PEI 00] J. Pei, J. Han, R. Mao, "CLOSET: An Efficient Algorithm for Mining FrequentClosed Itemsets", Proc. Workshop on Reasearch Issues on Data Mining and KnowledgeDiscovery (DMKD), May 2000, p. 21-30.

[PHA 01a] V. Phan-Luong, "The Representative Basis for Association Rules", Proc. of the2001 IEEE International Conference on Data Mining (ICDM’01), 29-Nov. - 2-Dec. 2001,San Jose, California, USA, poster paper pp. 639-640.

[PHA 01b] V. Phan-Luong, "Reasoning on Association Rules", Actes desǼæ-ç

Journées Basesde Données Avancées (BDA’2001), Cépaduès Edition, pp. 299-310.

[ZAK 99] M.J Zaki, C.J. Hsiao, "CHARM: An efficient algorithm for closed association rulesmining", Technical report 99-10, Computer Science Dept., Rensselaer Polytechnic Institute,Oct. 1999.

[ZAK 00] M.J Zaki, "Generating non-redundant association rules", Proc. of theÎ ã²å

ACM SIG-MOD Int’l Conf. on Knowledge Discovery and Data Mining KDD 2000, Boston, MA, Au-gust 2000, pp. 34-43.

Page 445: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Treillis Relationnel : Une Structure Algébriquepour le Data Mining Multi-Dimensionnel

Alain C ASALI — Rosine CICCHETTI — Lotfi L AKHAL

Laboratoire d’Informatique Fondamentale de Marseille (LIF), CNRS UMR 6166Université de la Méditerranée, Case 901163 Avenue de Luminy13288 Marseille Cedex 9

casali,cicchetti,[email protected]

RÉSUMÉ.Les motifs multi-dimensionnels contraints diffèrent des motifs fréquents, bien connusen data mining, parce qu’ils sont conceptuellement et logiquement structurés et supportent di-vers types de contraintes. Les techniques classiques de data mining basées sur le treillis desparties se prêtent mal à une extension au contexte multi-dimensionnel car les espaces de re-cherche et de solutions s’avèrent mal adaptés à ces nouveaux problèmes. Dans cet article, nousproposons un cadre théorique pour divers problèmes de data mining multi-dimensionnel, enintroduisant une nouvelle structure algébrique : le treillis relationnel caractérisant l’espace derecherche à explorer. Nous prenons en compte les contraintes monotones et/ou anti-monotonesmises en œuvre lors de l’extraction de motifs multi-dimensionnels, proposons des représenta-tions intervallaires du treillis relationnel contraint, qui est un espace convexe, et présentons unalgorithme permettant de les calculer. Enfin, nous mettons en évidence les avantages du treillisrelationnel par rapport au treillis des parties en data mining multi-dimensionnel.

ABSTRACT.Constrained multidimensional patterns differ from the well known frequent patternsfrom a conceptual and logical points of view because they are structured and support varioustypes of constraints. Classical data mining techniques are based on the power set lattice and,even extended, are not suitable when addressing the discovery of multidimensional patterns.In this paper we propose a foundation for various multidimensional data mining problems byintroducing a new algebraic structure called relational lattice which characterizes the searchspace to be explored. We take into consideration monotone and/or anti-monotone constraintsenforced when mining multidimensional patterns. In addition, we propose range representa-tions of the constrained relational lattice which is a convex space and present an algorithmfor computing them. Finally, we place emphasis on advantages of the relational lattice whencompared to the power set lattice in multidimensional data mining.

MOTS-CLÉS :Data mining multi-dimensionnel, Théorie des bases de données, Treillis.

KEYWORDS:Multidimensional data-mining, Database theory, Lattice.

Page 446: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

L’extraction de motifs multi-dimensionnels contraints ou data mining sur les basesde donnéesn-aires permet de résoudre différents problèmes comme la recherche desrègles d’associations multi-dimensionnelles [LU 00], dépendances Roll-Up [WIJ 99],gradients multi-dimensionnels contraints [DON 01], règles de classification [LIU 98,LI 01], règles de corrélation [BRI 97, GRA 00]. Elle constitue aussi l’étape fonda-mentale dans le calcul de cubes de données (complets ou partiels) matérialisés pourl’OLAP [AGA 96, GRA 97, HO 97, BEY 99, CIC 01, HAN 01]. L’adaptation à cenouveau contexten-aire d’approches et d’algorithmes éprouvés et efficaces sur desbases de données binaires est possible mais peu pertinente. Pourtant cette adapta-tion est couramment utilisée pour l’analyse d’association quantitatives [SRI 96] et laclassification [LIU 98, LI 01]. [BEY 99, HAN 01] ont utilisé des extensions d’Apriori[AGR 96] pour le calcul de cubes de données de type iceberg et ont constaté des tempsde réponse “dramatiquement mauvais”. Les raisons de ces échecs sont les suivantes.Tout d’abord, chaque attributn-aire doit être remplacé par un ensemble d’attributsbinaires, chacun représentant une seule valeur de l’attributn-aire [SRI 96]. Si le do-maine des attributs est grand (c’est le cas dans les entrepôts de données [BEY 99])cette opération de substitution peut conduire à travailler sur un grand nombre de va-leurs d’attributs. Or l’espace de recherche, considéré en data mining sur des bases dedonnées binaires, est le treillis des parties de l’ensemble des attributs binaires (appe-lés items dans ce contexte). Pourtant cet espace de recherche considérable prend encompte un grand nombre de solutions dont on sait pertinemment, dans un contexten-aire, qu’elles sont sémantiquement erronées. Supposons que dans une relation unattributA admettek valeurs, lui sont substituésk attributs binairesa1, ..., ak. Dans letreillis des parties, tous les couples (ai, aj) avec1 ≤ i, j ≤ k et i 6= j sont considéréset la recherche de leur support, par exemple, nécessite un balayage coûteux de la basebinaire. Or nous savons que la base originelle ne contient aucun motif (ai, aj), toutsimplement parce que l’attribut initialA est mono-valué et donc ses valeursai et aj

sont exclusives. Ainsi, si une contrainte anti-monotone (e.g. Freq()≥ seuil) est utiliséepour extraire des motifs multi-dimensionnels fréquents, la complexité originelle desalgorithmes par niveaux est affectée car la bordure négative1 (concept fondamentalpour l’analyse de la complexité [MAN 97]) est augmentée par un ensemble, pouvantêtre volumineux, de combinaisons inutiles. Le problème s’aggrave si une contraintemonotone (e.g. Freq()≤ seuil) est considérée car, bien qu’invalide, le motif (ai,aj)introduit précédemment fait partie du résultat.

Dans cet article, nous introduisons et caractérisons l’espace de recherche pourl’extraction de motifs multi-dimensionnels, i.e. pour les divers problèmes de data mi-ning multi-dimensionnel évoqués. Dans cet espace, ne sont retenues que des solutionssémantiquement valides dans notre contexte de travail. En introduisant une relationd’ordre entre éléments de cet espace et en proposant deux opérateurs de construc-

1. Ensemble des minimaux non fréquents.

2

Page 447: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

tion, nous définissons une nouvelle structure algébrique, appelée treillis relationnel.Dans ce nouveau cadre, l’extraction de motifs multi-dimensionnels peut être effec-tuée en utilisant des conjonctions de contraintes monotones et/ou anti-monotones.Le deuxième aspect de notre contribution porte sur les représentations compactes[MAN 96] du treillis relationnel. Les représentations intervallaires ou par bordureévitent l’énumération de toutes les solutions [MIT 82, MAN 97]. Nous définissons di-verses représentations intervallaires équivalentes pour le treillis relationnel contraint.Leur avantage pratique est qu’elles limitent le problème d’explosion de l’espace mé-moire particulièrement critique pour l’extraction des motifs multi-dimensionnels con-traints. De plus, ces représentations intervallaires permettent bien sûr de reconstruirel’espace global des solutions ou, sans avoir à effectuer cette reconstruction, de décidersi tel ou tel motif multi-dimensionnel est une solution. Nous montrons que le treillisrelationnel sous contraintes monotones et/ou anti-monotones est un espace convexe.Nous décrivons ensuite un algorithme par niveaux sans “retour arrière” permettant deles calculer. Comme il n’existe pas, à notre connaissance, d’approche générale pourl’extraction des divers types de motifs multi-dimensionnels, nous proposons une com-paraison entre notre démarche et les extensions, au contexten-aire, des approches dedata mining binaire. Nous montrons en particulier (i) la pertinence de nos espaces derecherche et de solutions par rapport à ceux considérés par de telles extensions et (ii)la préservation de la complexité des algorithmes par niveaux dans notre démarche etsa dégradation pour les extensions considérées.

Organisation de l’article : Dans le paragraphe 2 nous détaillons la structure dutreillis relationnel. Nous en étudions, au paragraphe 3, les représentations interval-laires pour les différents cas de conjonction de contraintes. L’algorithme de calculd’une représentation pour le treillis relationnel en présence des différents types decontraintes est donné au paragraphe 4. Nous proposons enfin une comparaison entretreillis relationnel et treillis des parties au paragraphe 5. En conclusion nous présen-tons le bilan de notre contribution et les perspectives de recherche.

2. Treillis Relationnels

Contrairement aux motifs recherchés en data mining binaire, les motifs multi-dimensionnels sont dotés d’une structure et il importe de les caractériser en exhibantcette structure. D’autre part les liens qui existent entre de tels motifs véhiculent unesémantique importante dans un contexte de travailn-aire. Sur le premier point, notreréponse consiste à introduire la notion d’espace multi-dimensionnel dont les élémentsreprésentent les motifs multi-dimensionnels. Pour formaliser leurs liens, nous nousappuyons sur une relation d’ordre et définissons deux opérateurs fondamentaux deconstruction. Dès lors, l’espace de recherche pour les problèmes de data miningn-aire peut être caractérisé, ce que nous faisons en définissant le concept de treillis re-lationnel dont nous étudions les propriétés. Dans ce contexte multi-dimensionnel, letreillis relationnel est un nouveau langage de description de concepts qui peut être

3

Page 448: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

utilisé non seulement en data mining multi-dimensionnel mais aussi en apprentissageautomatique [MIT 97].

2.1. Espaces multi-dimensionnels

Tout au long de cet article, nous nous basons sur les hypothèses suivantes et utili-sons les notations introduites. Soitr une relation de schémaR. Les attributs deR sesubdivisent en deux ensembles (i)D l’ensemble des dimensions encore appelées attri-buts catégories correspondant aux critères d’analyse pour l’OLAP, la classification oul’apprentissage de concepts [MIT 97] et (ii) M l’ensemble des attributs mesures oudes attributs de classification. De plus,∀A ∈ D, nous notonsDim(A) la projectionder surA et nous supposons, dans la suite de l’article, que∀A ∈ D, |Dim(A)| ≥ 2.

L’espace multi-dimensionnel de la relationr regroupe toutes les combinaisons sé-mantiquement valides des ensembles de valeurs existant dansr pour les attributs deD, ensembles enrichis de la valeur symbolique ALL. Cette dernière, introduite dans[GRA 97] pour la définition de l’opérateur Cube-By, est une généralisation de toutesles valeurs possibles de la projection des attributs. Ainsi∀A ∈ D,∀a ∈ Dim(A),a ⊂ ALL.

Définition 2.1L’espace multi-dimensionnel der est noté et défini comme suit :Espace(r) =×A ∈ D(Dim(A) ∪ ALL) où × symbolise le produit cartésien.Tout élément appartenant à l’espace multi-dimensionnel est appeléd-tuple et repré-sente un motif multi-dimensionnel.

SoientX ⊆ D et t und-tuple. Notonst[X] la restriction det àX. Tout tuple der estund-tuple carr ⊆ ×A∈DDim(A) ⊆ Espace(r).

Exemple 1- Le tableau 1 présente la relation exemple utilisée tout au long de cet articlepour illustrer les différents concepts introduits. Dans cette relation,A,B,C sont desattributs catégories etM est un attribut mesure. L’espace multi-dimensionnel de cetterelation est illustré par le tableau 2.

N Tuple A B C M

1 a1 b1 c1 32 a1 b1 c2 23 a1 b2 c1 24 a1 b2 c2 25 a2 b1 c1 16 a3 b1 c1 1

Tableau 1.Relation exempler

4

Page 449: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

N Tuple A B C

1 a1 b1 c1

2 a1 b1 c2

3 a1 b2 c1

4 a1 b2 c2

5 a2 b1 c1

6 a2 b1 c2

7 a2 b2 c1

8 a2 b2 c2

9 a3 b1 c1

10 a3 b1 c2

11 a3 b2 c1

12 a3 b2 c2

13 a1 b1 ALL14 a1 b2 ALL15 a1 ALL c1

16 a1 ALL c2

17 a2 b1 ALL18 a2 b2 ALL19 a2 ALL c1

N Tuple A B C

20 a2 ALL c2

21 a3 b1 ALL22 a3 b2 ALL23 a3 ALL c1

24 a3 ALL c2

25 ALL b1 c1

26 ALL b1 c2

27 ALL b2 c1

28 ALL b2 c2

29 a1 ALL ALL30 a2 ALL ALL31 a3 ALL ALL32 ALL b1 ALL33 ALL b2 ALL34 ALL ALL c1

35 ALL ALL c2

36 ALL ALL ALL37 ∅ ∅ ∅

Tableau 2.Espace multi-dimensionnel de la relationr

Les d-tuples d’identifiants 1, ..., 12 et 37 sont des tuples possibles pourr (car leursvaleurs sont réelles), même si le tuple d’identifiant 37 n’est pas présent explicitementdans la relationr. Dans ce tuple, le symbole∅ signifie “valeur vide”. Les autresd-tuples (identifiants 13, ..., 36) ne peuvent pas être des tuples der.

2.2. Ordre de généralisation

Lesd-tuples véhiculent une information à différents niveaux de granularité et, dansl’espace multi-dimensionnel, ils sont, par nature même, redondants. En effet, toutd-tuple de la relation initiale participe généralement à la construction de plusieursd-tuples (premier niveau de synthèse de données) ; ces derniers peuvent à nouveau êtresynthétisés et ainsi de suite. Au niveau le plus général, la synthèse consiste en ununiqued-tuple dont toutes les valeurs sont ALL et qui résume la relation initialer dela manière la plus compacte qui soit, mais aussi la plus grossière.Dans ce contexte, il importe donc de formaliser ce lien de “participation à la construc-tion” entred-tuples, ce que nous faisons en définissant une relation d’ordre entred-tuples, relation dite de généralisation. Cette relation d’ordre a été initialement intro-

5

Page 450: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

duite par T. Mitchell [MIT 82] dans un contexte d’apprentissage automatique pourcaractériser l’espace des versions.

Définition 2.2Soientu, v deuxd-tuples de l’espace multi-dimensionnel der :

u ≥g v ⇔∀A ∈ D, v[A] ⊆ u[A]ouv =< ∅, ..., ∅ >

Si u[A] etv[A] 6= ALL, u[A] etv[A] correspondent à des ensembles singletons. Nousdisons queu est plus générique quev dansEspace(r).

La relation de généralisation≥g est un ordre dual à la relation de spécialisation (≤s)ou de subsumption.

Exemple 2- Dans l’espace multi-dimensionnel de notre relation exemple (cf. tableau 2),en considérant queti est le tuple d’identifianti, nous avons :t13 ≥g t1 et t25 ≥g t1.Donct13 et t25 sont plus génériques quet1 et t1 plus spécifique quet13 et t25.

Appliqués sur un ensemble ded-tuples, les opérateursmin etmax rendent respec-tivement lesd-tuples les plus génériques et les plus spécifiques de l’ensemble consi-déré.

Définition 2.3Soit un ensemble ded-tuplesT ⊆ Espace(r) :

– min(T ) = t ∈ T | @u ∈ T : u ≥g t– max(T ) = t ∈ T | @u ∈ T : t ≥g u

Exemple 3- Dans notre espace multi-dimensionnel exemple (cf. tableau 2), soitT = t1, t13, t25. Alors nous avonsmin(T ) = t13, t25 et max(T ) = t1

2.3. Opérateurs de base

Nous introduisons deux nouveaux opérateurs de construction ded-tuples : Somme(notée+) et Produit (noté•). La somme de deuxd-tuples rend led-tuple le plusspécifique généralisant les deux opérandes. Elle est définie comme suit.

Définition 2.4Soientu etv deuxd-tuples deEspace(r).

t = u + v ⇔ ∀A ∈ D, t[A] =

u[A] si u[A] = v[A]ALL sinon.

Nous disons quet est la somme desd-tuplesu etv.

6

Page 451: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Exemple 4- Dans l’espace multi-dimensionnel exemple (cf. tableau 2), nous avonst1 + t2 = t13, ce qui signifie quet13 est construit à partir desd-tuplest1 et t2. D’autrepart(t13 + t14) + (t15 + t16) = t13 + t14 = t15 + t16 = t29.

Le produit de deuxd-tuples retourne led-tuple le plus générique spécialisant lesdeux opérandes. Si pour ces deuxd-tuples, il existe un attributA prenant des va-leurs distinctes et réelles (i.e. existant dans la relation initiale), alors seul led-tuple<∅, ..., ∅> les spécialise (hormis ced-tuple, les ensembles permettant de les construiresont disjoints).

Définition 2.5Soientu etv deuxd-tuples deEspace(r). Définissons led-tuplez comme suit :∀A ∈ D, z[A] = u[A] ∩ v[A].

t = u • v ⇔

t = z si @ A ∈ D | z[A] = ∅< ∅, ..., ∅ > sinon.

Nous disons quet est le produit desd-tuplesu etv.

Exemple 5- Dans l’espace multi-dimensionnel pris en exemple (cf. tableau 2) nousavonst13 • t25 = t1 et t1 • t2 = <∅, ∅, ∅>, ce qui signifie quet13 et t25 généralisentt1 et t1 participe à la construction det13 et t25 (directement ou indirectement). Lesd-tuplest1 et t2 n’ont d’autre point commun que led-tuple vide.

2.4. Caractérisation du treillis relationnel

En dotant l’espace multi-dimensionnelEspace(r) de la relation de généralisationentred-tuples et en utilisant les opérateurs Produit et Somme, nous introduisons unestructure algébrique appelée treillis relationnel qui fixe un cadre théorique et généralpour le data miningn-aire. Les propriétés fondamentales du treillis relationnel sonténoncées dans le théorème 2.1 ainsi que dans les propriétés 2.1 et 2.2 dont les dé-monstrations sont données dans [CAS 02].

Theorème 2.1Soient R un ensemble d’attributs etr une relation surR. L’ensemble ordonnéTR(r) = 〈Espace(r),〉 est un treillis complet, atomique, co-atomique et non dis-tributif appelétreillis relationnel dans lequel :

1) ∀t, u ∈ TR(r) : t u ⇔ t ≥g u

2) ∀ T ⊆ TR(r) :∧

T = +t∈T t où∧

symbolise l’infimum.

3) ∀ T ⊆ TR(r) :∨

T = •t∈T t où∨

symbolise le supremum.

Exemple 6- La figure 1 illustre le treillis relationnel de la projection de notre rela-tion exemple (cf. tableau 1) sur les attributsA et B. Dans ce diagramme, les arêtesreprésentent les liens de généralisation ou spécialisation entred-tuples.

7

Page 452: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

< a 1 ,b 1 >

< a 1 , ALL >

< a 3 ,b 2 >

< ALL, ALL>

< a 3 , ALL > < ALL , b 2 > < a 2

, ALL > < ALL , b 1 >

< a 3 ,b 1 > < a 2 ,b 2

> < a 2 ,b 1 > < a 1 ,b 2

>

G e n

e r a

l i s a t

i o n

S p e c i a l i s a t i o n

< 0 , 0 >

Figure 1. Diagramme de Hasse du treillis relationnel de la projection der surAB

Dans ce treillis relationnel, nous avons :

1) En faisant abstraction de <∅, ∅>, les co-atomes (dont l’ensemble est notéCAt)sont desd-tuples véhiculant l’information au niveau le plus détaillé, i.e. celui des va-leurs réelles des dimensions. En d’autres termes, les co-atomes peuvent être des tuplesde r. Ainsi, nous avons :CAt(TR(r)) = <a1, b1>, <a2, b1>, <a3, b1>, <a1, b2>,<a2, b2>, <a3, b2>

2) Les atomes (dont l’ensemble est notéAt) du treillis offrent l’information la plussynthétique possible, à l’exception dud-tuple <ALL,ALL>. Comme nous ne considé-rons ici que deux attributs dimensions, tous les atomes comportent au moins la valeursynthétique ALL (pour l’un ou l’autre des attributs). Plus précisément, nous avons :At(TR(r)) = <a1,ALL>, <a2,ALL>, <a3,ALL>, <ALL, b1>, <ALL,b2>

Proposition 2.1SoitL(r) le treillis des parties de la relationr, i.e. le treillis〈P(∪A∈DDim(A)),⊆)〉(P(X) désignant l’ensemble des parties de l’ensembleX). Alors il existe un plonge-ment d’ordreΦ : TR(r) → L(r)

t 7→ t[A], A ∈ D | t[A] 6= ALL

Proposition 2.2La hauteur du treillis relationnel est |D|+1. Son nombre d’éléments par niveauxi (i ∈1..|D|) est : ∑

X⊆D|X|=i

(∏

A∈X

|Dim(A)|) ≤ (|D|i

) maxA∈D

(|Dim(A)|)i.

8

Page 453: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Son nombre total d’éléments est donc :∑i=1..|D|

(∑

X⊆D|X|=i

(∏

A∈X

|Dim(A)|)) + 2 = (∏

A∈|D|

(|Dim(A)|+ 1)) + 1

Le rang d’und-tuplet est :rank(t) = |Φ(t)|

3. Représentations intervallaires du treillis relationnel contraint

Le treillis relationnel introduit une nouvelle structure algébrique définissant l’es-pace de recherche pour divers problèmes de data mining sur des bases de don-néesn-aires. Dans ce paragraphe, notre premier objectif est la prise en compte descontraintes monotones et/ou anti-monotones les plus couramment utilisées en datamining [NG 98, BAY 00, GRA 00, LOP 02]. Celles-ci peuvent porter sur :

– des mesures d’intérêts [BAY 99] comme la fréquence de motifs, la confiance, lacorrélation : dans ce cas, seuls les attributs catégories deR sont nécessaires ;

– des agrégats selon des attributs mesuresM calculés en utilisant des fonctionsstatistiques additives (count, sum,min, ...) ;

– des mesures de prédiction de classes pour les approches de classification super-visée, dans ce cas, outre les attributs catégories deD, R contient aussi des attributsclasses (également notésM).

D’autre part, nous nous soucions de la taille de la représentation de l’espace dessolutions, i.e. de l’ensemble desd-tuples satisfaisant les contraintes indiquées. En ef-fet, dans un cadre bases ou entrepôts de données, cet espace peut être de très grandedimension. Cependant, il existe des représentations qui ne définissent pas exhausti-vement toutes les solutions mais qui donnent des “bornes” à partir desquelles il estpossible de reconstruire l’espace de solutions. De telles représentations sont appe-lées représentations intervallaires. A l’instar de T. Mitchell, lorsqu’il caractérise l’es-pace des versions [MIT 82], et de H. Mannilaet al. [MAN 97] pour la caractérisationdes motifs fréquents, nous utilisons des bornes pour définir les représentations inter-vallaires du treillis relationnel selon une conjonction de contraintes. Cependant, T.Mitchell n’utilise que deux bornes (S et G), H. Mannilaet al. également (BD+ etBD−) alors que nous en introduisons quatre : S+, S−, G+ et G−. S+ représente l’en-semble desd-tuples les plus spécifiques du treillis relationnel vérifiant la conjonctionde contraintes et S− l’ensemble desd-tuples les plus spécifiques ne vérifiant pas cetteconjonction. Symétriquement, nous définissons les ensembles G+ et G− ainsi : G+

est l’ensemble desd-tuples les plus génériques vérifiant la conjonction de contrainteset G− l’ensemble desd-tuples les plus génériques ne vérifiant pas cette conjonction.

Après avoir rappelé les définitions de contraintes monotone et anti-monotone se-lon un ordre, nous proposons, pour chaque type de conjonction de contraintes (mo-notones, anti-monotones, monotones et anti-monotones), les représentations interval-laires permettant de définir de manière compacte l’espace de solutions et de décider

9

Page 454: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

si un d-tuple t appartient ou pas à cet espace. Enfin nous caractérisons la meilleurereprésentation pour une telle décision.

Définition 3.11) Une contrainteContest dite monotone selon l’ordre si et seulement si :

∀ t, u ∈ TR(r) : [t u etCont(t)] ⇒ Cont(u).2) Une contrainteContest dite anti-monotone selon l’ordre si et seulement si :

∀ t, u ∈ TR(r) : [t u etCont(u)] ⇒ Cont(t).

Notations : Nous notonsccm (respectivementccam) une conjonction de contrain-tes monotones (respectivement anti-monotones) etcch une conjonction de contrainteshybride (monotones et anti-monotones). Suivant les cas considérés, les bornes intro-duites sont indicées par le type de contrainte considérée. Par exempleS+

ccam symbo-lise l’ensemble desd-tuples les plus spécifiques vérifiant la conjonction de contraintesanti-monotones.

Remarques (Cas extrêmes) :

– Nous supposons par la suite que led-tuple <ALL, ..., ALL> vérifie toujours laconjonction de contraintes anti-monotones et que led-tuple <∅, ..., ∅> vérifie toujoursla conjonction de contraintes monotones. Avec ces hypothèses, l’espace des solutionscontient au moins un élément (éventuellement led-tuple de valeurs vides).

– De plus, nous supposons que led-tuple <ALL, ..., ALL> ne vérifie jamais laconjonction de contraintes monotones et que led-tuple <∅, ..., ∅> ne vérifie jamaisla conjonction de contraintes anti-monotones, car sinon, l’espace de solutions estEspace(r).

Exemple 7- Dans l’espace relationnel exemple (cf. tableau 2), nous voulons connaîtretous lesd-tuples dont la somme de la valeur pour l’attribut mesureM est supérieure ouégale à 3. La contrainte “sum(M ) ≥ 3” est une contrainte anti-monotone. De même,si nous voulons connaître tous lesd-tuples dont la somme de la valeur pour l’attributM est inférieure ou égale à 5, la contrainte exprimée “sum(M ) ≤ 5” est monotone.

3.1. Représentations intervallaires pour les conjonctions de contraintes monotones

Avant de définir les diverses représentations intervallaires équivalentes pour le pro-blème des conjonctions de contraintes monotones, nous formalisons les bornes S+

ccm,S−ccm, G+

ccm et G−ccm associées.

Définition 3.2Soientr une relation surR, TR(r) le treillis relationnel de cette relation etccm uneconjonction de contraintes monotones. Alors nous avons :

1) S+ccm = <∅, ..., ∅>

10

Page 455: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

2) S−ccm = t ∈ TR(r) | ¬ccm(t) et@t′ ∈ TR(r) : ¬ccm(t′) et t ≥g t′

3) G+ccm = t ∈ TR(r) | ccm(t) et@t′ ∈ TR(r) : ccm(t′) et t′ ≥g t

4) G−ccm = t ∈ At(TR(r)) | ¬ccm(t)

Les diverses représentations intervallaires permettant de reconstruire l’espace desolutions combinent deux des bornes définies. Nous étudions donc tous les couples debornes possibles à travers la proposition suivante.

Lemme 3.1NotonsTR(r)ccm l’espace des solutions. Alors nous caractérisonsTR(r)ccm commesuit :

1) Représentation avec (S+ccm,G+

ccm) :TR(r)ccm = t ∈ TR(r) | ∃ u ∈ S+

ccm et∃ v ∈ G+ccm : t ≥g u etv ≥g t. Ainsi

tous lesd-tuples résultats généralisent la solution la plus spécifique et spécialisent unedes solutions les plus génériques.

2) Représentation avec (S+ccm,S−ccm) :

TR(r)ccm = t ∈ TR(r) | ∃ u ∈ S+ccm et@ v ∈ S−ccm : t ≥g u et t ≥g v. Les

d-tuples résultats généralisent la solution la plus spécifique et ne généralisent aucundesd-tuples les plus spécifiques ne vérifiant pas la contrainte.

3) Représentation avec (S−ccm,G+ccm) :

TR(r)ccm = t ∈ TR(r) | @ u ∈ S−ccm et∃ v ∈ G+ccm : t ≥g u etv ≥g t. Les

d-tuples résultats spécialisent une des solutions les plus génériques sans généraliseraucund-tuple parmi les plus spécifiques ne satisfaisant pasccm.

4) Représentation avec (S−ccm,G−ccm) :TR(r)ccm = t ∈ TR(r) | @ u ∈ S−ccm et@ v ∈ G−

ccm : t ≥g u et t ≥g v. Toutesles solutions ne généralisent pas ded-tuple plus spécifique ne vérifiant pasccm et nespécialisent aucun desd-tuples les plus génériques qui ne satisfont pasccm.

5) Représentation avec (G+ccm,G−ccm) :TR(r)ccm = t ∈ TR(r) | ∃ u ∈ G+

ccm et@ v ∈ G−ccm : u ≥g t et t ≥g v. Les d-

tuples caractérisés spécialisent une des solutions les plus génériques et ne spécialisentaucun desd-tuples les plus génériques ne vérifiant pas la contrainte.

6) Représentation avec (S+ccm,G−ccm) :

aucune représentation n’est possible puisque toutd-tuple généralisant une solutionspécifique deS+

ccm peut satisfaire la contrainte ou pas et que toutd-tuple spécialisantun élément deG−

ccm peut être une solution ou pas. Donc à partir des bornes considé-rées, rien n’est décidable.

Preuve - NotonsSolccm l’ensemble t ∈ TR(r) | ccm(t). Nous montrons que pourchaque représentation intervallaire,Solccm = TR(r)ccm.

11

Page 456: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1) Représentation avec (S+ccm,G+

ccm) :

- Soit t ∈ TR(r)ccm. Par définition, il existev ∈ G+ccm | ccm(v) etv ≥g t.

Puisqueccm est une contrainte monotone, nous avonsccm(t). Par conséquentt ∈Solccm. DoncTR(r)ccm ⊆ Solccm. (a)

- Soit t ∈ Solccm, il existe forcémentv ∈ G+ | v ≥g t carG+ccm représente

les minimaux vérifiantccm. De plus la contrainte∃u ∈ S+ccm | t ≥g u est toujours

vérifiée. Donct ∈ TR(r)ccm etSolccm ⊆ TR(r)ccm. (b)

(a) et (b)⇒ Solccm = TR(r)ccm

2) Représentation avec (S+ccm,S−ccm) :

- Soit t ∈ TR(r)ccm. PuisqueS−ccm représente les maximaux ne vérifiantccmet @v ∈ S−ccm | t ≥g v, nous en déduisons quet ∈ Solccm car la contrainte∃u ∈S+

ccm | t ≥g u est toujours vérifiée. DoncTR(r)ccm ⊆ Solccm. (a)

- Soit t ∈ Solccm. La contrainte∃u ∈ S+ccm | t ≥g u est triviale. S’il existait

v ∈ S−ccm | t ≥g v, comme nous avonsccm(t), nous devrions avoirccm(v), ce quiest impossible carv ∈ S−ccm. DoncSolccm ⊆ TR(r)ccm. (b)

(a) et (b)⇒ Solccm = TR(r)ccm

3) Représentation avec (S−ccm,G+ccm) :

Cette représentation est une combinaison des deux représentations précédentes. DoncSolccm = TR(r)ccm d’après (1) et (2).

4) Représentation avec (S−ccm,G−ccm) :La condition@ v ∈ G−

ccm, t ≥g v est toujours vérifiée car soitt ∈ G−ccm et dans ce

cas il n’appartient pas à l’espace de solutions soitt = <ALL,...,ALL>. Or par défi-nition ced-tuple ne vérifie pas la conjonction de contraintes monotones. Donc cettereprésentation est équivalente à la deuxième etSolccm = TR(r)ccm d’après (2).

5) Représentation avec (G+ccm,G−ccm) :La condition∀ u ∈ S+, t ≥g u est toujours vérifiée car toutd-tuple est plus génériqueque< ∅, ..., ∅ >. Donc cette représentation est équivalente à la première etSolccm =TR(r)ccm d’après (1).

Exemple 9- Le tableau 3 illustre les bornes S+ccm, S−ccm, G+

ccm et G−ccm du treillisrelationnel exemple (cf. tableau 2) pour la contrainte monotone “sum(M ) ≤ 5”.

S+ccm < ∅, ∅, ∅ >

S−ccm <a1,ALL,ALL> <ALL, b1,ALL> <ALL,ALL, c1>

G+ccm

<a2,ALL,ALL> < a3,ALL,ALL> <ALL, b2,ALL><ALL,ALL, c2> <a1,b1,ALL> <a1,ALL,c1><ALL,b1,c1>

G−ccm <a1,ALL,ALL> <ALL, b1,ALL> <ALL,ALL, c1>

Tableau 3.Bornes deTR(r) pour “sum(M) ≤ 5”

12

Page 457: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Considérons, par exemple, la représentation intervallaire(S+ccm, G+

ccm) et le d-tuplet19 = <a2, ALL , c1>. Comme toutd-tuple, il généralise <∅, ∅, ∅> mais, d’autre part,il spécialise led-tuple <a2,ALL,ALL> appartenant àG+

ccm. Donc t19 satisfait lacontrainte et appartient à l’espace de solutions.

Représentations intervallaires de cardinalité minimale :En présence de contraintesmonotones, plusieurs représentations intervallaires équivalentes du treillis relationnelavec contraintes sont possibles. Compte tenu de notre contexte de travail où l’espacede solutions peut être très large, il importe de choisir une représentation compacte.Comme nous l’avons souligné dans la preuve de la proposition 3.1, les ensemblesS+

ccm et G−ccm ont une taille réduite par rapport àS−ccm et G+

ccm, le premier parcequ’il est réduit au seul tuple de valeurs vides, le second parce qu’il regroupe lesd-tuples les plus génériques ne satisfaisant pas les contraintes. Entre ces deux bornes,c’est bien sûrS+

ccm qui est choisi car sa cardinalité est minimale. Ce choix étant fait,deux représentations intervallaires sont possibles pour les conjonctions de contraintesmonotones : (S+

ccm, G+ccm) ou (S+

ccm, S−ccm).

3.2. Représentations intervallaires pour les conjonctions de contraintesanti-monotones

De la même manière que pour les contraintes monotones, nous caractérisons lesbornes du treillis relationnel en présence d’une conjonction de contraintes anti-mo-notones, puis nous étudions les différentes représentations intervallaires équivalentesd’un tel treillis contraint, les illustrons sur notre exemple et indiquons des critères pourle choix d’une représentation optimale.

Définition 3.3Soientr une relation surR, TR(r) le treillis relationnel de cette relation etccam uneconjonction de contraintes anti-monotones. Alors nous avons :

1) S+ccam = t ∈ TR(r) | ccam(t) et@t′ ∈ TR(r) : ccam(t′) et t ≥g t′

2) S−ccam = t ∈ CAt(TR(r)) | ¬ccam(t)3) G+

ccam = <ALL,...,ALL>

4) G−ccam = t ∈ TR(r) | ¬ccam(t) et@t′ ∈ TR(r) : ¬ccam(t′) et t′ ≥g t

Lemme 3.2Soientr une relation surR, TR(r) son treillis relationnel etccam une conjonctionde contraintes anti-monotones. NotonsTR(r)ccam l’espace des solutions. Alors nouspouvons caractériserTR(r)ccam comme suit :

1) Représentation avec (S+ccam,G+

ccam) :TR(r)ccam = t ∈ TR(r) | ∃ u ∈ S+

ccam et∃ v ∈ G+ccam : t ≥g u etv ≥g t

2) Représentation avec (G+ccam,G−ccam) :TR(r)ccam = t ∈ TR(r) | ∃ u ∈ G+

ccam et@ v ∈ G−ccam : u ≥g t etv ≥g t

13

Page 458: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3) Représentation avec (S+ccam,G−ccam) :

TR(r)ccam = t ∈ TR(r) | ∃ u ∈ S+ccam et@ v ∈ G−

ccam : t ≥g u etv ≥g t

4) Représentation avec (S+ccam,S−ccam) :

TR(r)ccam = t ∈ TR(r) | ∃ u ∈ S+ccam et@ v ∈ S−ccam : t ≥g u etv ≥g t

5) Représentation avec (S−ccam,G−ccam) :TR(r)ccam = t ∈ TR(r) | @ u ∈ S−ccam et@ v ∈ G−

ccam : u ≥g t etv ≥g t

6) (S−ccam,G+ccam) : aucune représentation n’est possible.

Preuve - Preuve duale à celle du lemme 3.1.

Exemple 11- Le tableau 4 donne S+ccam, S−ccam, G+ccam et G−ccam du treillis relationnel

exemple (cf. tableau 2) pour la contrainte anti-monotone “sum(M) ≥ 3 ”.

S+ccam <a1,b1,c1> <a1,b2,ALL> <a1,ALL,c2>

S−ccam

<a1,b1,c2> <a1,b2,c1> <a1,b2,c2><a2,b1,c1> <a2,b1,c2> <a2,b2,c1><a2,b2,c2> <a3,b1,c1> <a3,b1,c2><a3,b2,c1> <a3,b2,c2>

G+ccam <ALL,ALL,ALL>

G−ccam

<a2,ALL,ALL> < a3,ALL,ALL> <ALL, b1,c2><ALL,b2,c1> <ALL,b2,c2>

Tableau 4.Bornes deTR(r) pour “sum(M) ≥ 3”

Comme <a2,ALL,ALL> appartient àG−ccam, tout d-tuple le spécialisant ne respecte

pas la contrainte, c’est le cas de <a2, ALL , c1>.

Représentations intervallaires de cardinalité minimale :Comme nous l’avons in-diqué dans la preuve de la proposition 3.2, les ensemblesS−ccam et G+

ccam ont uneplus faible cardinalité queS+

ccam et G−ccam. En effetG+

ccam est réduit au seuld-tupledont toutes les valeurs sont ALL etS−ccam regroupe lesd-tuples les plus spécifiquesne vérifiant pas les contraintes. Nous avons le choix entre deux représentations pourles conjonctions de contraintes anti-monotones : (S+

ccam, G+ccam) ou (G+

ccam, G−ccam).

3.3. Représentations intervallaires pour les conjonctions de contraintes hybrides(monotones et anti-monotones)

Pour prendre en compte des conjonctions de contraintes hybrides, nous nous ap-puyons sur certaines des bornes introduites précédemment.

Définition 3.4Soientr une relation surR, TR(r) le treillis relationnel de cette relation etcch uneconjonction de contraintes hybride (i.e. :cch = ccm ∧ ccam). Nous avons :

14

Page 459: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1) S+cch = S+

ccam

2) G−cch = G−ccam

3) S−cch = S−ccm

4) G+cch = t ∈ G+

ccm | ccam(t)

Lemme 3.3NotonsTR(r)cch l’espace des solutions. Alors nous pouvons caractériserTR(r)cch

comme suit :

1) Représentation avec (S+cch,G+

cch) :

TR(r)cch = t ∈ TR(r) | ∃ u ∈ S+ccam et∃ v ∈ G+

cch : t ≥g u etv ≥g t

2) Représentation avec (G−cch,G+cch) :

TR(r)cch = t ∈ TR(r) | @ u ∈ G−ccam et∃ v ∈ G+

cch : u ≥g t etv ≥g t

3) Représentation avec (S−cch,G−cch) :

TR(r)cch = t ∈ TR(r) | @ u ∈ S−ccm et@ v ∈ G−ccam : t ≥g u etv ≥g t

4) Représentation avec (S−cch,S+cch) :

TR(r)cch = t ∈ TR(r) | @ u ∈ S−ccm et∃ v ∈ S+ccam : t ≥g u et t ≥g v

5) Représentation avec (S+cch,G−cch) :

aucune représentation n’est possible.

6) Représentation avec (S−cch,G+cch) :

aucune représentation n’est possible.

Preuve - NotonsSolcch l’ensemble t ∈ TR(r) | cch(t). La preuve de la proposi-tion provient des propositions 3.1 et 3.2 et du fait queSolcch = Solccm ∩ Solccam.

Représentations intervallaires de cardinalité minimale :Il est difficile, dans lecas des conjonctions hybrides, de caractériser une représentation intervallaire de car-dinalité minimale, car aucune des bornes n’est réduite à un uniqued-tuple. Dans lecontexte de l’espace des versions, T. Mitchell [MIT 82] montre que les borduresS etG qu’il définit, donnent une représentation adéquate et minimale d’après [MEL 92].En adaptant ce constat à notre contexte de travail, nous choisissons les bornes équi-valentes àS et G, i.e. S+

ccam etG+cch. Dans notre représentation, toute solution véri-

fiant une contrainte hybride doit généraliser une solution satisfaisant les contraintesanti-monotones et spécialiser une solution satisfaisant la conjonction de contraintesmonotones et anti-monotones.G+

cch correspond exactement àG alors queS+ccam est

un sur-ensemble deS, mais l’espace de solutions est identique. Le choix deS+ccam a

été motivé par l’objectif de répondre à des requêtes concernant lesd-tuples satisfai-sant uniquement la conjonction de contraintes anti-monotones (e.g. représentant lesmotifs multi-dimensionnels fréquents), ce que ne permet pas la représentation avec labordureS de T. Mitchell. Pour retrouverS, il suffit d’élaguer deS+

ccam les élémentsne spécialisant aucun élément deG+

cch.

La représentation parS+ccam etS−ccm s’avère très intéressante car elle peut servir de

représentation intermédiaire pour l’extraction deS+ccam etG+

cch. La justesse de cette

15

Page 460: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

méthode se déduit de la structure convexe du treillis relationnel contraint définie dansle paragraphe suivant. De plus cette représentation permet de répondre à des requêtesconcernant uniquement la conjonction de contraintes monotones (cf. lemme 3.1) ou laconjonction de contraintes anti-monotones (cf. lemme 3.2).

3.4. Structure du treillis relationnel contraint

Tout treillis relationnel contraint n’est pas forcément un treillis, nous montronsdans ce paragraphe que cet ensemble partiellement ordonné possède une structuremathématique, en l’occurrence celle d’un espace convexe [VEL 93].

Définition 3.5Soit 〈P,〉 un ensemble partiellement ordonné,C ⊆ P est un espace convexe si∀x, y, z ∈ P, x y z etx, z ∈ C ⇒ y ∈ C. DoncC est borné par deux en-sembles : un majorant ou “upper set” défini parmax(C) et un minorant ou “lowerset” défini parmin(C).

Theorème 3.1Tout treillis relationnel avec contraintes monotones et/ou anti-monotones est un espaceconvexe. Son majorant (respectivement minorant) estmax≥g (TRcont(r)) (respecti-vementmin≥g

(TRcont(r))), cont ∈ ccm, ccam, cch.

Preuve - D’après les lemmes 3.1, 3.2 et 3.3 le treillis relationnel contraint est bornéet donc il est un espace convexe.

En considérant les représentations avecS+cch etS−cch, l’application du théorème 3.1

nous permet d’obtenir facilementG+cch = min≥g

(t ∈ TR(r) | @u ∈ S−ccm et∃v ∈S+

ccam : t ≥g u et t ≥g v).

4. L’algorithme EVA

Les algorithmes par niveaux ont prouvé leur efficacité lorsque doivent être traitésde très grands ensembles de données stockés sur disque. Des améliorations récentesconcernant le nombre de balayages de la base de données ainsi que la génération descandidats sont données dans [BAS 00, GEE 01, STU 02]. Pour ces raisons, nous enadoptons les principes pour proposer un algorithme calculant les bornes pour les re-présentations intervallaires du treillis relationnel. Plus précisément, l’algorithme EVAretourne les bornesS+ et G+ correspondant au type de contraintes traitées et, selonle cas, utiliseG− ou G+ pour l’élagage. Ses principes généraux sont conformes auxalgorithmes par niveaux et diffèrent par (i) la phase de construction des niveaux quiest basée, dans notre cas, sur l’opérateur Produit et (ii) la phase d’élagage qui est réa-lisée sans “retour arrière” comme dans [DEH 99]. La phase d’initialisation (lignes 1 à5) permet, en fonction des contraintes mises en jeu, de spécifier les trois bornes consi-dérées en s’appuyant sur les définitions 3.2, 3.3, 3.4. Si la conjonction de contraintes

16

Page 461: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

n’englobe aucune contrainte monotone, alorsG+ est led-tuple le plus générique (ligne1). Dans le cas contraire, l’ensemble initial de candidatsL1 correspond auxd-tuplesles plus génériques en dehors de <ALL,...,ALL>, i.e. les atomes deTR(r). Sont uni-quement retenus dansG+ les atomes vérifiant à la foisccm etccam (ligne 2). S’il n’ya aucune contrainte anti-monotone, la borneS+ est initialisée à <∅, ..., ∅> (ligne 3) si-non elle est identique à l’ensemble des candidats, i.e. l’ensemble des atomes vérifiantccam. Dans ce casG− regroupe tous les atomes ne vérifiant pasccam (ligne 4).

Pour chaque niveau, deux cas sont examinés : absence ou présence de contrainteanti-monotone. Dans le premier cas, l’ensemble courant de candidatsLi est élaguéen éliminant tous les éléments deG+ (ligne 8). Il ne reste dansLi que lesd-tuplesne vérifiant pas les contraintes monotones. Le nouvel ensemble de candidatsLi+1 estconstruit en effectuant le produit desd-tuples distincts deLi (ligne 9) et lesd-tuplesobtenus qui satisfont les contraintes monotones sont ajoutés àG+ (ligne 10).

En présence de contraintes anti-monotones, tout nouveau candidat est calculé eneffectuant le produit de deux candidats du niveau précédent et il ne doit spécialiseraucund-tuple deG− (ligne 12). Les nouveaux candidats ne satisfaisant pasccamsont ajoutés àG− (ligne 13) et éliminés deLi+1 qui ne contient plus que desd-tuplesvérifiant les contraintes anti-monotones (ligne 14). Enfin, les bornesS+ et G+ sontrecalculées en considérant respectivement lesd-tuples les plus spécifiques deS+ etLi+1 (utilisation de l’opérateurmax à la ligne 15) et lesd-tuples les plus génériques(utilisation de l’opérateurmin à la ligne 16).

Complexité de l’algorithme EVA

L’analyse de la complexité des algorithmes par niveaux pour une contrainte anti-monotone a été étudiée par H. Mannilaet al. [MAN 97] en utilisant les bordurespositives et negatives. La complexité théorique est donnée dans [GUN 97] par :O((|BD−(Sol)|+ |Sol|)∗coût du test de ccam)oùBD− est la bordure négative (e.g.minimaux non fréquents dans le cadre de la recherche des motifs fréquents) etSol estl’ensemble des solutions du problème. Cette complexité se généralise facilement pourl’algorithme par niveaux EVA qui tient compte des conjonctions de contraintes mono-tones et/ou anti-monotones. Nous intégrons aussi le coût des balayages de la base dedonnées. Donc la complexité d’EVA estO((|P |+ |Q|) ∗max(coût du test ccam, coûtdu test ccm)+ |D|*coût du balayage) où :

– Si ccam = : P = G+ccm etQ = t ∈ TR(r) | ¬ccm(t).

– SinonP = G−ccam etQ = TR(r)ccam.

La complexité précise d’EVA pour l’extraction des motifs multi-dimensionnelsfréquents, corrélés, émergeants... peut être améliorée en utilisant des structures dedonnées avancées permettant de se rapprocher de la complexité des opérations don-nées dans le théorème de D.M. Yellin [YEL 94] énoncé ci-dessous :

17

Page 462: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Theorème 4.1Une séquence den opérations, où chaque opération est un ajout, une suppression, uneappartenance ou une insertion peut être réalisée, dans le pire des cas, enO(

√nlog(n))

par opération, chaque opération utilisantO(n3/2) espace mémoire.

Donc la complexité d’EVA pour ce type de contraintes estO((|P |+|Q|)√

(|r|)log(|r|)+b∗coût du balayage) où b = max(card(rank(t)), t ∈ S+) (b étant le nombre maxi-mum de balayages de la relation).

Alg. 1 Algorithme EVA : construction par niveaux des intervallesS+, G+ représentantle treillis relationnel avec les conjonctions de contraintes monotones, anti-monotonesou hybridesEntrée : relationr surR, ccam et ccmSortie : S+, G+

1: si ccm = alors G+ =<ALL,...,ALL>2: sinonL1 = At(TR(r))

G+ = t ∈ At(TR(r)) | ccm(t) et ccam(t)3: si ccam = alors S+ = < ∅, ..., ∅ >4: sinonL1 = t ∈ At(TR(r)) | ccam(t)

S+ = L1

G− = t ∈ At(TR(r)) | ¬ccam(t)5: i = 16: tant que (Li 6= ∅) faire7: si (ccam = ) alors8: Li = Li\G+

9: Li+1 = v = t • t′ | t 6= t′ , t, t′ ∈ Li,v 6=< ∅, ..., ∅ > et@u ∈ G+ : u ≥g v

10: G+ = t ∈ Li+1 | ccm(t)11: sinon12: Li+1 = v = t • t′ | t 6= t′ , t, t′ ∈ Li,

v 6=< ∅, ..., ∅ > et@u ∈ G− : u ≥g v13: G− = t ∈ Li+1 | ¬ccam(t)14: Li+1 = Li+1\G−

15: S+ = max(S+ ∪ Li+1)16: si ccm 6= alors G+ = min(G+ ∪ t ∈ Li+1 | ccm(t))17: fin si18: i = i + 119: fin tant que20: S+ = t ∈ S+ | ∃t′ ∈ G+ : t′ ≥g t \\ Pour obtenirS+ de cardinalité

\\ minimale si nécessaire.21: renvoyer S+, G+

18

Page 463: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5. Treillis relationnels vs treillis des parties en data mining multi-dimensionnel

À notre connaissance, il n’existe pas une approche générale visant à établir unfondement algébrique pour le data mining multi-dimensionnel. Précédemment, deschercheurs ont tenté d’étendre au contexten-aire des solutions éprouvées dans uncadre de bases de données binaires en utilisant le treillis des parties comme espace derecherche notamment pour les règles d’associations quantitatives [SRI 96] et la clas-sification [GAN 93, LIU 98, DON 99, LI 01], mais ces tentatives ne sont probantesdans un contexte multi-dimensionnel ni d’un point de vue formel ni d’un point devue expérimental. Par exemple, sur un plan expérimental, des extensions d’Aprioripour permettre le calcul des cubes de données de type iceberg [BEY 99, HAN 01] ontproduit des résultats extrêmement mauvais en temps d’exécution.

Sur un plan théorique, nous proposons dans ce paragraphe une analyse compa-rative de ce type de démarche et de la nôtre en étudiant les espaces de recherche àexplorer, les espaces de solutions, ainsi que le comportement des algorithmes par ni-veaux.

En considérant le treillis de partiesL(r) et le treillis relationnelTR(r) comme desespaces de recherche pour l’extraction de motifs multi-dimensionnels contraints, nousproposons une comparaison sur quatre points distincts : la taille des treillis, leurs ca-ractéristiques, la justesse des solutions obtenues pour les conjonctions de contrainteset la complexité des algorithmes par niveaux.

– Taille des treillis :Examinons la taille des deux treillis ainsi que la taille maximale des niveaux associés.|L(r)| = 2

∑A∈D |Dim(A)|, alors que |TR(r)| =ΠA∈D(|Dim(A)|+1)+1 (proposition

2.2). Une borne supérieure non exponentielle pour la cardinalité du treillis relationnelestO((maxA∈D(|Dim(A)| + 1))|D|). Considérons par exemple une relation avec 5attributs ayant chacun 10 valeurs, nous avons : |L(r)| = 250 = 1125899906842624,alors que |TR(r)| = 115 + 1 = 161052.

Soit n =∑

A∈D|Dim(A)|. La taille du niveau le plus large dansL(r) est bornée

par( n

n/2), qui est asymptotique à2

n√

n

√2π [BEE 84] alors que la taille maximale d’un

niveau du treillis relationnel est( |D||D|/2

) maxA∈D(|Dim(A)|)|D| qui est asymptotique

à 2|D|√|D|

√2π ∗maxA∈D(|Dim(A)|)|D|.

En conclusion, la taille du niveau le plus large dansL(r) est donc exponentiellepar rapport au nombre de valeurs des attributs de la relation (i.e.

∑A∈D

|Dim(A)|)

alors que celle deTR(r) est exponentielle par rapport aux nombres d’attributs (i.e.|D|). Notons qu’en pratique|D| est généralement une constante.

19

Page 464: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

– Caractéristiques des treillis :Comme la relation d’ordre sur les deux treillis n’est pas la même, nous pouvons dé-duire deux conséquences :

- Les opérateurs∧

et∨

ne sont pas les mêmes dans les deux treillis : dansle treillis des parties

∧= ∩ et

∨= ∪, alors que dans le treillis relationnel

∧= + et∨

= •.- le treillis des parties est un treillis distributif, alors que le treillis relationnel

est un treillis atomique, co-atomique et non distributif (Cf. théorème 2.1).

D’un point de vue plus conceptuel, nous pouvons dire que le treillis des parties estle treillis d’un ensemble d’atomes (valeurs d’attributs) alors que le treillis relationnelest le treillis d’un ensemble de molécules (d-tuples).

– Justesse des solutions :Nous avons en introduction expliqué que, pour les problèmes multi-dimensionnels,le treillis des partiesL(r) englobe des solutions sémantiquement erronées alorsque le treillis relationnelTR(r) est exactement l’espace de recherche valide. Plusprécisément, le plongement d’ordreΦ (proposition 2.1) montre que pour toutd-tupledu treillis relationnel, il existe un élément équivalent dans le treillis des parties etque sa sémantique est correcte alors que la réciproque est fausse carΦ est injective(et n’est pas bijective).∀t ∈ TR(r), @ai, aj ∈ Φ(t) etai, aj ∈ Dim(Ak) d’aprèsla définition 2.1. Or∀Ak ∈ D,∀ai, aj ∈ Dim(Ak) aveci 6= j, (ai, aj) ∈ L(r)et pourtant nous savons que ces combinaisons sont invalides. Si une contrainteanti-monotone est utilisée pour l’extraction, un algorithme de type Apriori élague cescouples invalides au deuxième niveau, par contre si la condition est monotone cescouples peuvent faire partie du résultat.

– Complexité des algorithmes par niveaux :Le fait de générer des motifs erronés a évidemment des conséquences pour les algo-rithmes sous-jacents. Nous les mettons en évidence en comparant la taille des bordurespertinentes pour les contraintes monotones ou anti-monotones. Considérons les solu-tions les plus génériques vérifiantccm pourL(r) etTR(r). Nous avons :

|G+ccm(L(r))| = O(|G+

ccm(TR(r))| +∑

A∈D|Dim(A)|2−|Dim(A)|

2 ). Dans le cadredes contraintes anti-monotones, la bordure négative pourL(r) comporte, elle aussi,des motifs multi-dimensionnels erronés (les couples des valeurs d’un même attribut),donc sa taille est plus grande queG−

ccam pourTR(r). En fait le nombre d’élémentssupplémentaires dans la bordureG−

ccam pourL(r) est exactement le nombre maxi-mal donné précédemment (les mêmes couples sont considérés), i.e.|G+

ccam(L(r))| =|G−

ccam(TR(r))|+∑

A∈D|Dim(A)|2−|Dim(A)|

2 . Cette augmentation de la bordure né-gative a un impact d’autant plus néfaste sur la complexité des algorithmes par niveauxque les attributs ont de grands ensembles de valeurs. C’est la raison de l’inefficacité,dans un contexte multi-dimensionnel, des algorithmes par niveaux, explorantL(r),qui, au deuxième niveau, construisent toutes les combinaisons possibles, calculentleur support et augmentent la taille des bordures.

20

Page 465: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

6. Conclusion

Dans cet article, nous introduisons un cadre formel pour résoudre différents pro-blèmes de data mining sur des bases de donnéesn-aires. Nous proposons une structurealgébrique originale, le treillis relationnel comme espace de recherche. Nous nous in-téressons aux représentations du treillis relationnel avec contraintes monotones et/ouanti-monotones en dressant une typologie de ses représentations intervallaires, nousmontrons que le treillis relationnel contraint est un espace convexe et nous propo-sons un algorithme préservant la complexité des algorithmes par niveaux et calculantune représentation pour les différentes conjonctions de contraintes. Enfin nous com-parons treillis relationnel et treillis des parties pour montrer que, dans notre structure,seuls les motifs multi-dimensionnels sémantiquement valides sont représentés alorsque le treillis des parties en propose un sur-ensemble en y adjoignant des motifs multi-dimensionnels erronés. La conséquence sur les algorithmes de recherche par niveauxest étudiée.Une extension, en cours, de ce travail est la définition d’une représentation concisedans le contexte des motifs multi-dimensionnels contraints permettant de reconstruirenon seulement l’espace de solutions mais aussi les mesures (ou les supports) associées.

7. Bibliographie

[AGA 96] AGARWAL S., AGRAWAL R., DESHPANDEP., GUPTA A., NAUGHTON J., RAMA -KRISHNAN R., SARAWAGI S., « On the Computation of Multidimensional Aggregates »,VLDB’96, 1996, p. 506-521.

[AGR 96] AGRAWAL R., MANNILA H., SRIKANT R., TOIVONEN H., VERKAMO A. I.,« Fast Discovery of Association Rules », Advances in Knowledge Discovery and DataMining, Cambridge,MA :AAAI/MIT press, 1996, p. 307-328.

[BAS 00] BASTIDE Y., TAOUIL R., PASQUIER N., STUMME G., LAKHAL L., « MiningFrequent Patterns with Counting Inference », vol. 2(2), SIGKDD Explorations, 2000,p. 66-75.

[BAY 99] B AYARDO R., AGRAWAL R., « Mining the Most Interesting Rules », KDD, 1999,p. 145-154.

[BAY 00] B AYARDO R., AGRAWAL R., GUNOPULOS D., « Constraint-Based Rule Miningin Large, Dense Databases », vol. 4(2/3), Data Mining and Knowledge Discovery, 2000,p. 217-240.

[BEE 84] BEERI C., DOWD M., FAGIN R., STATMAN R., « On the Structure of ArmstrongRelations for Functional Dependencies », vol. 31(1), JACM, 1984, p. 30-46.

[BEY 99] BEYER K., RAMAKRISHNAN R., « Bottom-Up Computation of Sparse and IcebergCubes », ACM SIGMOD, 1999, p. 359-370.

[BRI 97] BRIN S., MOTWANI R., SILVERSTEIN C., « Beyond Market Baskets : GeneralizingAssociation Rules to Correlations », ACM SIGMOD, 1997, p. 265-276.

[CAS 02] CASALI A., CICCHETTI R., LAKHAL L., « Treillis Relationnel et Data MiningMulti-Dimensionnel », rapport, juillet 2002, LIF, CNRS UMR 6166.

21

Page 466: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[CIC 01] CICCHETTI R., NOVELLI N., LAKHAL L., « APIC : an Efficient Algorithm forComputing Iceberg Datacubes », BDA, 2001, p. 229-242.

[DEH 99] DEHASPE L., TOIVONEN H., « Discovery of Frequent DATALOG Patterns »,vol. 3, Data Mining and Knowledge Discovery, 1999, p. 7-36.

[DON 99] DONG G., LI J., « Efficient Mining of Emerging Patterns : Discovering Trends andDifferences », KDD, 1999, p. 43-52.

[DON 01] DONG G., HAN J., LAM J., PEI J., WANG K., « Multi-Dimensional ConstrainedGradients in Data Cubes », Italy, 2001, VLDB’01, p. 321-330.

[GAN 93] GANASCIA J.-G., « TDIS : an Algebraic Formalization », IJCAI, 1993, p. 1008-1015.

[GEE 01] GEERTS F., GOETHALS B., DEN BUSSCHEJ. V., « A Tight Upper Bound on theNumber of Candidate Patterns », ICDM, 2001, p. 155-162.

[GRA 97] GRAY J., CHAUDHURI S., BOSWORTHA., LAYMAN A., REICHART D., VENKA-TRAO M., PELLOW F., PIRAHESH H., « Data Cube : A Relational Aggregation OperatorGeneralizing Group-by, Cross-Tab, and Sub Totals », vol. 1(1), 1997, p. 29-53.

[GRA 00] GRAHNE G., LAKSHMANAN L. V. S., WANG X., « Efficient Mining of Constrai-ned Correlated Sets », ICDE, 2000, p. 512-521.

[GUN 97] GUNOPULOS D., KHARDON R., MANNILA H., TOIVONEN H., « Data mining,Hypergraph Transversals, and Machine Learning », ACM PODS, 1997, p. 209-216.

[HAN 01] HAN J., PEI J., DONG G., WANG K., « Efficient Computation of Iceberg Cubeswith Complex Measures », ACM SIGMOD, 2001.

[HO 97] HO C.-T., AGRAWAL R., MEGIDDO N., SRIKANT R., « Range Queries in OLAPData Cubes », ACM SIGMOD, 1997, p. 73-88.

[LI 01] L I W., HAN J., PEI J., « CMAR : Accurate and Efficient Classification Based onMultiple Class-Association Rules », ICDM, 2001, p. 369-376.

[LIU 98] L IU B., HSU W., MA Y., « Integrating Classification and Association Rule Mining »,KDD, 1998, p. 80-86.

[LOP 02] LOPESS., PETIT J. M., LAKHAL L., « Functional and Approximate DependencyMining : Databases and FCA points of View. », Experimental and Theoretical ArtificialIntelligence (JETAI) :Special Issue on Concept Lattice-based theory, methods and tools forKnowledge Discovery in Databases., 2002.

[LU 00] L U H., FENG L., HAN J., « Beyond intratransaction association analysis : miningmultidimensional intertransaction association rules », vol. 18(4), ACM TOIS, 2000, p. 423-454.

[MAN 96] M ANNILA H., TOIVONEN H., « Multiple Uses of Frequent Sets and CondensedRepresentations », KDD, 1996, p. 189-194.

[MAN 97] M ANNILA H., TOIVONEN H., « Levelwise Search and Borders of Theories inKnowledge Discovery », vol. 1(3), Data Mining and Knowledge Discovery, 1997, p. 241-258.

[MEL 92] M ELLISH C., « The Description Identification Problem », vol. 52(2), ArtificialIntelligence, 1992, p. 151-167.

[MIT 82] M ITCHELL T. M., « Generalization as Search », vol. 18(2), Artificial Intelligence,1982, p. 203-226.

22

Page 467: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[MIT 97] M ITCHELL T. M., « Machine learning », MacGraw-Hill Series in ComputerScience, 1997.

[NG 98] NG T., LAKSHMANAN L. V. S., HAN J., PANG A., « Exploratory Mining and Pru-ning Optimizations of Constrained Association Rules », ACM SIGMOD, 1998, p. 13-24.

[SRI 96] SRIKANT R., AGRAWAL R., « Mining Quantitative Association Rules in Large Re-lational Tables », ACM SIGMOD, 1996, p. 1-12.

[STU 02] STUMME G., TAOUIL R., BASTIDE Y., PASQUIER N., LAKHAL L., « ComputingIceberg Concept Lattices with Titanic. », vol. 42(2), Data and Knowledge Engineering,2002, p. 189-222.

[VEL 93] VAN DE VEL M., « Theory of Convex Structures », North-Holland, Amsterdam,1993.

[WIJ 99] WIJSEN J., NG R. T., CALDERS T., « Discovering Roll-Up Dependencies », KDD,1999, p. 213-222.

[YEL 94] Y ELLIN D. M., « An Algorithm for Dynamic Subset and Intersection Testing »,vol. 129(2), TCS, 1994, p. 397-406.

23

Page 468: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 469: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

!"#

$

!"#$%&'#()* +,-"#$%&./. 0-122 -%34526 7 47'4$($%/89:;-: ;2://<2 9

#=6.= >? @ A << / ? // 0 B/0 << A @</ ;; @ A / @ & ;C D / @ ?/ E / @ <D / ;; , / 0/@ ; 8//? A@ D F ;/G A A @ D @ % /< ;;0 A D/>/ >@ A ? @/0 @ A@;?/< H/C I /0 @?/< B/< /C D @C/// / @/

$!6#$&?@;/@/; <0//<- ; < 0 //- @ < - $ @;;;<0 < @;;;@</;<?;J</J2;</;< ;<< /2@-<J</0;? ;< ;< ;;< <K@K0/< /28<J? ;/@/0;-<<@; @/<J;;/< ;/@ < @; ; J< < @ ; ,/< 0 <@ ;; * I <E;<@;

.,6&=68; @ A A D @ A@;

5L3,#B68 /@@ ;;@;

Page 470: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

%&"

! " #"$ %

& '( # $

( " # )

" $ ! "' % ( "

! " #

#* ! ! + $

! !+('

%+

) M # $N $$

+(' $##

$#,&-.#&

/&#& * #$-$$ $N $ $0 ",

1 &# 23

1 % 2$ 11$

* #4-4-

) DM '11 1# $ N $$ ; ' +( $# 1D# &#& 1D# &' +( $# " 1 D # &#& 1D # ( 5 6 5 6 +( "

565# 6 +(

'() 1 M # 7# $ 4$ 8 $ D M 1 # # " +(

42,7# -) ∧4-9-.8 -

# :; :< D$ D & (= ,

7# -) ∧4-9-.8 -> # ,:;$,:<?

Page 471: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

+( ! 9

8 $ ") $9

7# $4$ 8 @$ #

" # $#:; " D$;:A 9 8 #"$#:<( #" D$<:A 1 9" 8

) # #$

(! # ## +(

' # # >0B

CDDE$ 4(F CDDG$ 7H I:::$ 9 I::I? $ "

" +( +

$ " " $

+$ "

' # 7 $

7(!>8CDDE?$%C:; ( C<$$

# :D :C$+CD::+(

( @ $

! " ## "

+(

@ $ (( "J$

@K> CDDG?K>8LCDDD?$" (

" +(B $ %(

%+(

## #+() ##$ ! "

# # "

(("J

) ( $ +( ' '

$ " +( * @

$ +( ,

2+(C,7# -) ∧4-9-.8 -> # ,:;$,:<?3

2+(I,7# -) ∧-∧4-9-.8 -> # ,:;C$,:CM?

++(" ++(( #";CA

9 8 $ "CMA (

# +(0 "++(

+$ '* -

( ( # # :;C

:; % J# #

Page 472: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

8'' J+(' $ ( $

$ $ J ' >94

CDDD?$' ' "

>9N9 I::C? $ +(

$ +(

'() O M' ! !# ,,'N ,//,

!P !$ Q ,

D' M'$ +( ,

,7 - !-.R(- !> # ,:;$,:<?

',7 - !∧- !-.R(- !> # ,:;<$,:IM?

,7 - !∧L 2(- !-.R(- !> # ,:D$,:I?

,7 - !∧L 2(- !∧L 9 - !-.R(- !> # ,:DC$,:CD?

B "+( "+($ $+(' J) ##$ "+( " "( ! $ !+ ($

# !#(" " !

(@$ " ! * !

( "## ( # " !

#(@J$ ! * ! "

## ( # " ! #( 0 "

" J "

"+($ ## ## +(,.'..$S'SS

@ >RCDDD?$ ! +

!# +(

B $

! !

B! " "

!1 >)2)T4 CDDC? !1

>9 CDD;$ I:::?$

>7%1!CDDC?## !

" ! #

# +($"

" +( !

Page 473: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

C: A " +( #

+(

B!## >7! (I::C?"

+( +()

%+ $ +( J

$ +($

$ !

@ $ " "

!4+ # +( $

! < @ #

! ! # * +

*+

@ +( $ >)2)T4CDDC?$UT"

# +( !%!+

( " !" #

* U " %

( # *# ' ##

#" #" !

@ " $ !%!+ 1 $

<-<$') $ # * 1 @;α 9# D / " *'!%!+ "'α ""!%!+1 " " "#

7$ # ,?A -A R $ " ' ## ( # (

# V% ( %

'!%!+ "' '% ##

(8# "( 1(

# V # # V

#" #"

@ " $ ##

#" #" (

"χI# ,

−="

", I

Page 474: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

/,

1,#" !" (1"#" !" (

) α$ D χIα " " #(

" χI $ '!%!+ *$ '!%!+

1

4%

R" C::1

R" % C::1%

( # ::E$ $ χI$ " <;WC >)2)T4 CDDC? K " χI

(<;WC$'!%!+ *@ $

% "' % ## ( # (

EA"'

#<;WC$ '!%!+ $ '11

"' '% X ' # * '!%!+

8 $ ( # " ## #"

J"!

4+ # " $

' ##

+(' B ## !

>RCDDD?

-

-?

-

-?

−−−−+−=

C::

C::C::II

I

χ

Page 475: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

$%

4# # $ " +(

*# 2 "

# ,

M$, /$$Y $$, /$

/M/$$$Y $$ 0 /$$Y $

/$ M # " !"$ $ /$ M$ M$ 1 M) # $ /M $-$/∈ /$$

M * M

#(C M$$!$&/ ", /$-P$A$AA$Y Q$ /!-P0$0A$0AA$Y Q /&-P$A$AA$Y Q@ $$-!-0A $$-∧!-0A M

M

$ ! &

0A

' A 0

0 A

0A A

, "%?/ A A0MF$!&G

) M$ @ AM $##

$#,&-.#&

/&#& M0 " #(# & #&7$ M$$!$& 1$ +( M,

Page 476: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

$#,$--.!-0$#',!-0A∧&--.$-

M$ +($#,&-.#&M DM$# D ( 1D"##&#&$;$# D ( 1D"#&#& 1D"#&) (+ $ #

# ,

6$# D,

&;$# D,

7 $ + D +($# $#' !$ # ,

$#, -

# -

$#', -

# -

B I "

## ( # 1( (

D

D#&& ∧σ

D

D

&

#&&

σσ ∧

AIEW

C ===∧=∧=

D

D$&0!σ

AC::C

C

===∧=

=∧=∧=

D

D

&0!

$&0!

σ

σ

AIEW

C ===∧=

D

D0!$σ

A<<<<

C

===

=∧=

D

D

$

0!$

σσ

Page 477: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

( ! J ##

( # # +( # ,

2+(C,$∧!-.&> # , C$,C?$2+(I, $-.&> # , I$,I?$/$$!& #,-

B " 1" # $∧!# 1 1"# $" #$ ' 1#

$$ 1 1 # ! # +( # # &"χI $ I≠ : I≠ C,

/ I - : I - C 1$ " +(

' #

+( @$ # 1$ I-C$

C-C ' 1# 2+(C 1

1 # 2+( I 7 " $ 2+(C

# "2+(I$ $ $ 2+(C$

("2+(I

4 'I $ #(

' $ χI ,

7+(

,7 - !-.R(- !> # ,:;$,:<?

',7 - !∧- !-.R(- !> # ,:;<$,:IM?$

χI-:EGIE

7+(

,7 - !-.R(- !> # ,:;$,:<?

,7 - !∧L 2(- !-.R(- !> # ,:D$,:I?$

χI-GIE

7+(

,7 - !∧L 2(- !-.R(- !> # ,:D$,:I?

CZC::

ZC::ZC::

ZC::

ZC::ZC::

I

IC

I

I

IC

I

−−+−

Page 478: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

,7 - !∧L 2(- !∧L 9 - !-.R(- !> # ,:DC$,:CD?$

χI-:CCCC

( # ::E " <;WC$

' ## ( # 7 $ +( ## ' , ! $ % (

! #( ' !+ ($

! ( ( # # ' !+ # "

## ( # J # "$ ! ($ !

( !#(

7 " $ ( $

! $ ( #($

! $$#(0 " ##

' +('

!"# ## #

+($ >R CDDD?$ ## "

+ % " 7 $

(+(

,$-.&> # ,:;$,:<?$0,$∧!-.&> # ,:;G$,:I?$

,)-.O> # ,:D:$,:<?$ ,)∧L-.O> # ,:DG$,:I?$

#::G$0 !>RCDDD?4 $ #::E$

+( 0 >RCDDD? $ 0 ## ( # # ! ( # ::E0 "

## #

+( B!$ $

+"

4+ " ##

+( $ #

@ A @;B!

Page 479: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

+(( # +(

$ $ #( '

#

-* , +( ,$-.&@" +( 0,!-.&$⊂!

#+ $

,7 - !-.R(- !> # ,:;$,:<?( "+(@J$

,7 - !∧L 2(- !-.R(- !> # ,:D$,:I?

( "

,7 - !∧L 2(- !∧L 9 - !-.R(- !> # ,:DC$,:CD?

',7 - !∧- !-.R(- !> # ,:;<$,:IM?

,7 - !∧L 2(- !-.R(- !> # ,:D$,:I?

% ( !

-* , +( # @; α +( 0 # 0 0 ( ""$0≠ :0≠ C$

( " ( # α "

+($'$$ 1 ! ::E " <;WC$+( ( # $' ( # B " ( # ( # !+(( # +(# 1

CZC::

ZC::ZC::

ZC::

ZC::ZC:: II

I

0

0

0

0

−−

+−

Page 480: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

-* $, +( +( @; α +(A "A( "

( # α+(A "A( "$A( # α

-P$'$$Q 'I' $ +(( # ::E$' ( # ) ##$ +( ( # " % +( ( $ " +( ( # $"+(( " @$' ( # ' ( # $ ( # 8 # " +(( # +("

( #

( # "$ " $ +( ( # α +( 0$ ## # 0 J"!7$α - ::E $ " +( * ,

+,! - !-.- !> # ,:;:$,:I:?

7,! - ! ∧9 - !-.- !> # ,:;<$,:CM?

P,! - !∧9 - !∧ #- !-.- !> # ,:;;$,:CE?

+ +( ( # # +( ( @$P +(( # + " ( # 77 " $"## # +7J !$ P # +$ 7

4+# +(( # $ (! +( ( # @'+ # <$ +(( # +(

( # ( " +( 8 #

! ! ( !

Page 481: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

+( (! +(

$# "

" .(* " )"(/

"* , +( >0Q@? ( ( !+(3

"R& ( # α" " , +(( # -*/

0Q@;-C2>0Q@;?->C?"-I0Q@ "

>?N ! ->C?N ! "0Q@;-0Q@;[C2>0Q@;?->?

-:

+*≤0Q@; -: " 2>*?( ">?"

2>*? # -C::"

-C

χI>?2>*? χISR&"

-C

,

,

,

*-*[C

, +

-:"

0Q@;-0Q@;[C2>0Q@;?->?

,

,

, )"

2 2>0Q@;?,

(!,0Q@'

Page 482: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

8 # " (! 1J +

+( 7 $ " %

+( ,

,- !-.)- !> # ,:;:$,:I:?

;:A !$ + 7 + %"$ (! </$ </ +( $#,&-.#&"& 0 "$# ! + $ # (#&7 " $ # " "$

! + ) # $ +

(!$ ## ! + # $

# $ +( ( !

!

7 +(( # ##

# $ !

>8 CDDE? + $ "

7(!$ C:; ( C< +

!\]

GDD ( CC( #

" #::E$::CCI 1

!$C

::EI::C

Page 483: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

7 α-::EL "

-<;WC

2+(

2+(

( #

2+(

2+(

( #

# ,:;

,:C

<<:; GG<I::WA C<C<G ;WGGWWA

# ,:;

,:I

<IW G;I:DDA IEG: IMEC:MWA

# ,:;

,:<

;: ICIGIEA I<<; IEIC:M;A

# ,:D

,:C

C;DM <MMCD;MA WDI: IMWEEMA

# ,:D

,:I

CEW W<IMDIA ;II MGDIEA

# ,:D

,:<

<I C<W:G<A MWM GW;EMA

/%&/ /0 @ A/0 @@; +

7 α-::CL "

-GG<E

2+(

2+(

( #

2+(

2+(

( #

# ,:;

,:C

<<:; GIIC;;:A C<C<G MW;EGDA

# ,:;

,:I

<IW GWCDMEA IEG: IEMC::WA

# ,:;

,:<

;: I:IE::A I<<; I<WC:::A

# ,:D

,:C

C;DM <G<CDCWA WDI: IEEEC;A

# ,:D

,:I

CEW W<IMDIA ;II M<;;;A

# ,:D

,:<

<I C<W:G<A MWM GC;CGA

/ %&/ /0 @ A/0 @@;

0 $ 1$ " +(

( # W:GAEIA +(@

$ "( $"( $ +(

( # ( (

Page 484: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

0%1

B +( ( # +(

+( )

!$ +(( # J *" C:A

+( # + ( #

" +(

! "# B# ! !

## +( ( # (!

## +(( # @$

( ! >9

I::I?

2 / ").

>4(FCDDG?24(F$ $2 $T $ 48L$

5R @ % # 4 26$ 4 _ F( @ ($

444818T7$<:D1<I;$CDDG

>94CDDD?29%$ 24(F$5 (! (6$(;+<$&. 6S5BB > &; 5J @ B- B .@$ CWE1CEW$CDDD

>9 I::I? H 9$ 2 T$ B 7"$ N $ !$

5744 , (! # #" 6$ <D 6;/DLIC$BC$GE1DE$I::I

>9N9I::C?9 !$RN$ 9 $ 5 "

! ! +( "

6$? @$ +$L C$ B W$I::C

>)2)T4 CDDC? $ // D <D D @ < $ ( ! ! " ")2)T4$7$R $CDDC

> CDDG? $HR$__$] ] ($ 0 $ 5@K ,4

("% ((# 6$@_@1DG8N0@1DG]!

_@@$ $ $CDDG

>RCDDD? $ HR$5 (1 (6$

5J @ B@@$LCC$BE$CDDD

>7H I:::? $ 7$ H H $ 5 ( #" F!

( 6(;A&;.@/;BF6S.,B' G$C1CI$@$Ta$4$%I:::

>8LCDDD?T8 4L $5K,4K% ((#@ (6$

@ ( _ F(@ %$L<$BW$<M<1W:D$CDDD

Page 485: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

> I:::? 2 $ 5T!% # @ L6$ $&. B06-/$LIE$B<$<;:1W:G$I:::

>7! (I::C? L 7! ($ 52 +( 6$

CM+ 9@ 4 $IDD1<C:$4($ $0 I::C

>7%1! CDDC? N 7%1!$ 5@ %$ 4 %$ 7 #

(26$_ F(@ % @$444818T7$!C<$CDDC

>9CDD;? $9 $ 2F $59% 9,N X (

4 2@ 26$@ ( _ F(@ %$L

I$<D1G;$_F4 7!$CDD;

>0BCDDE?4$)0 $ B!$54 ## (!# (

( 6$ ( ; A &; R- @ B !FRB!A**+G$W<I1WW<$ !$FX $$CDDE

>8 CDDE? < & #- ; /< @ 0 / <-$ % # # $ 8 $ @ # 8 # $

!,bbFFF bc b2%!$CDDE

Page 486: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 487: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

ε-functional dependency inference: applica-tion to DNA microarray expression data

Découverte deε-dépendances fonctionnelles :Application à l’analyse de l’expression des gènes

Alexandre Aussem, Jean-Marc Petit

Laboratoire LIMOS (UMR-CNRS)Université Blaise Pascal - Clermont-Ferrand II63 173 Aubière cedex, France.E-mail: alex,[email protected]

ABSTRACT.Nowadays, DNA microarray technology provides biologists with the ability to mea-sure the expression levels of thousands of genes in a single experiment. As data from such ex-periments accumulates, it appears possible to attempt a reverse engineering of the underlyingregulatory interactions from the expression data itself. This may be achieved by using sophis-ticate data mining techniques that have been successfully employed in the realm of databaseknowledge discovery. To this aim, we point out a new data mining problem, calledε-functionaldependency inference problem. This problem is closely related to functional dependencies in-ference. Application of this ongoing work is illustrated on expression profiles from a sub-sampleof genes from budding yeastSacharomyces cerevisiaedata and preliminary results are given.

RÉSUMÉ.La technologie des biopuces (ou puces à ADN) permet désormais aux biologistes demesurer les niveaux d’expression de milliers de gènes au cours d’une seule expérience. Grâceaux données recueillies, un enjeu majeur est d’élucider les mécanismes de régulation entreles gènes afin d’identifier de nouveaux médicaments potentiels. Or les techniques classiquesde fouilles de données, utilisées dans le cadre de l’extraction de connaissances, peuvent aiderà mieux comprendre ces mécanismes de régulation. Dans cet esprit, nous exhibons un nou-veau problème de fouille de données, appelédécouverte deε-Dépendances Fonctionnelles. Laméthode d’inférence est d’abord exposée puis une mise en oeuvre préliminaire est présentéepour l’analyse du profil d’expression d’un sous-ensemble de gènes de la levureSacharomycescerevisiae.

KEYWORDS:Gene expression, Microarrays, Genomic, Functional dependencies, data mining.

MOTS-CLÉS :Génétique, Biopuces, Génomique, Dépendances fonctionnelles, fouille de données.

Page 488: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

DNA microarray technology provides biologists with the ability to measure theexpression levels of thousands of genes in a single experiment. It is believed thatgenes of similar function yield similar expression patterns in microarray hybridizationexperiments [SHE 95]. As data from such experiments accumulates, it is essentialto have accurate means for assigning functions to genes. Also, the interpretation oflarge-scale gene expression data provides opportunities for developing novel miningmethods for selecting - for example - good drug candidates (all genes are potentiallydrug targets) from among tens of thousands of expression patterns [FUH 00, SCH 00].Therefore, high performance knowledge discovery and data mining techniques playan increasingly important role in the analysis and discovery of sequence, structure andfunctional patterns or models from large sequence databases of genetic sequences.

In this paper, we propose to adapt a sophisticated data mining technique that havebeen successfully employed in the realm of database knowledge discovery, to thisproblem [HUH 98, LOP 00, NOV 01, MAN 94b, WYS 01]. Due to data particular-ities, we exhibit a new data mining problem, calledε-functional dependencies dis-covery problem. This problem is closely related to functional dependencies infer-ence. This paper first addresses the theory and techniques for dealing with the issueof extracting theseε-functional dependencies. In the last section, we then exhibitthe significantε-functional dependencies from the expression data of budding yeastSacharomyces cerevisiaemeasures in different DNA microarray hybridization exper-iments.

Paper organization

Section 2 introduces related contributions towards analysing gene expression data.The terminology used in this paper is given in Section 3. In Section 4, we point outa new data mining problem - the discovery ofε-functional dependencies - and give aframework in which this problem fits and can be resolved. Section 5 describes someexperimental results obtained so far. We conclude and sketch some perspectives ofthis ongoing work in Section 6.

2. Related work

A number of standard and more sophisticated data analysis techniques have al-ready been employed in the literature towards analysing gene expression data of DNAmicroarray experiments [BRO 97, D’H 99, FUH 00, SPE 98, WOO 00].

Genes may be grouped in a an unsupervised fashion based on a measure of sim-ilarity between expression patterns, for narrowing the field of candidate drug targetsfor increasing the efficiency of gene selection process (e.g. hierarchical clustering[SPE 98] or self-organizing maps). When prior knowledge of gene function is avail-able, supervised learning techniques may be employed for functionally classifyinggenes. Supervised learning techniques use a training set to specify in advance which

Page 489: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

data should cluster together according to the expert knowledge. They learn to dis-criminate between members of given functional class based on positive and negativeexamples, e.g. support vector machines (SVM) [BRO 97]. A fuzzy logic approach toanalyzing gene expression data is also discussed in [WOO 00].

However, the real challenge lies in inferring important functional relationshipsfrom these data. Beyond the cluster analysis [SPE 98, PHA 02]lies a more ambitiouspurpose of genetic inference: to find out the underlying regulatory interactions fromthe expression data, using efficient inference procedures. In this setting, two unsuper-vised techniques that might be useful for tackling this difficult problem come in mind:association rules and functional dependencies.

Mining association rulesis a very popular method that has attracted a lot of re-search over the last decade (see e.g. [HAN 00] for a survey). However, the mainproblem we are faced with is the problem ofdiscretizationsince we must perform apre-treatment over the data to obtain a binary relation between experiences and someinteresting subsetsof gene expression values. The number ofitemswill be larger thaninitially which is a very bad news especially when large frequent itemsets (sets ofgene) exists.

Mining functional dependenciesis less popular than association rules mining and,for the time being, does not received much attention by the KDD community. Efficientalgorithms for inferring functional dependencies from a relation exist. For instance,we quote Tane [HUH 98], Fdep [FLA 99], Dep-Miner [LOP 00], FastFD [WYS 01]and Fun [NOV 01]. First of all, it is worth noting that without any pre-treatment,functional dependency inference could be achieved. However, the obtained knowledgewould be useless since most of the values of the relation are in[−2, 2] and almost allvalues differ from each other (second digit). In other words, almost all genes are keyswhich is without any interest for biologists. Nevertheless, we believe - and this is themain contribution of this paper - that functional dependency inference offers a nicesetting for DNA microarray analysis, whenever noises in data are taken into accountin the definition of functional dependency satisfaction.

Approximate functional dependencies are studied in [KIV 95]. The authors pro-pose several errors measures but none of them includes a proximity relation betweenthe values. Rather they consider in which extend a functional dependency is true ina relation (number of violating pairs, number of tuples to delete to obtain a satisfiedfunctional dependency). In our context, such measures do not fit application require-ments due to the small numbers of tuples.

3. Terminology

For the sake of clarity, we introduce a few notations for our problem which arealmost standard in the database community (see e.g. [MAN 94a, LEV 99]).

Page 490: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

r g0 g1 g2 g3 g4 g5

t1 -0.20 0.03 -0.29 0.18 -0.38 0.00t2 0.00 0.41 -0.18 -0.07 -0.36 -0.18t3 -0.07 0.38 0.07 0.07 -0.25 -0.10t4 -0.23 -0.04 -0.30 -0.31 0.04 0.01t5 0.00 -0.17 -0.20 -0.04 -0.29 0.10t6 -1.12 0.24 -0.56 0.19 0.15 1.27t7 -1.18 0.30 -0.15 0.26 0.06 1.01t8 -0.56 0.31 -0.86 0.23 0.01 0.55

Table 1. A running example

Let G be a finite set ofgenes. Each geneG ∈ G takes its possible values inR,the set of real numbers. An experience (ortuple) overG is a mappingt : G → Rn.A relation is a set of experiences. We say thatr is a relationover G andG is therelation schemaof r. Let X ⊆ G be a gene set and t be a tuple; we denote byt[X] the restriction of t to X. Thedeclarationof a functional dependencyoverG is anexpressionG1 → G2 whereG1, G2 ⊆ G. Thesatisfactionof a functional dependencyG1 → G2 is defined with respect to a relationr: G1 → G2 is satisfied inr, denotedby r |= G1 → G2, if and only if ∀t, u ∈ r, if t[G1] = u[G1] thent[G2] = u[G2]. Afunctional dependencyG1 → G2 is minimal if G2 is not functionally dependent onany proper subset ofG1 andG1 → G2 is canonicalif |G2| = 1.

The application domain being DNA microarray analysis, two underlying "con-straints" have to be understood: firstly, the number of experiments is small (a fewhundreds at most) whereas the number of genes is large (several thousands). Suchconstraints differ widely from those usually held in databases where the number oftuples can be huge whereas the number of attributes rarely exceeds fifty.

Example 1- Let us consider a toy example made of a set of 8 experiences over a setof 6 genes as depicted in Table 1.

Throughout this paper, we will illustrate our approach with this example. 2

4. Mining ε-functional dependencies

It is worth noting that DNA microarray technology delivers numerical values witha relatively small confidence on these values; then biologists have to interpret the data,for example as levels of expression (i.e. discretization). In this setting, a questioncomes immediately in mind: how can we relax functional dependency satisfactiondefinition to take into account the particularity of the data ?

Page 491: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Keep in mind that thedeclarationof a functional dependency is not changed: weare only interested in defining a new satisfaction definition of a functional dependencyin a relation.

4.1. Relaxing functional dependency satisfaction

To take into account noise in data, the satisfaction of a functional dependency ina relation has to be relaxed. Sincer |= X → Y can be rephrased as "equal X-values correspond to equal Y-values", we would like to obtain something like "closeX-values correspond to close Y-values". Thus, instead of requiring strong equalitybetween attribute values, we admit an error less or equal to the absolute value of thedifference. This leads to the following definition.

Definition 1 The functional dependency isepsilon-satisfiedin a relation r, denotedby r |= G1 →ε G2, if and only if ∀t, u ∈ r, if ∀G ∈ G1, |t[G] − u[G]| ≤ ε then∀G ∈ G2, |t[G]− u[G]| ≤ ε.

Note that in the above definition, we have chosen a particular vector norm withoutloss of generality. The following discussion still hold for the Euclidean norm and/orthe maximum norm.In the sequel, a canonical minimalε-functional dependency will be referred to as aε-functional dependency. Classical satisfaction of functional dependencies is achievedwheneverε = 0.

Thus,G1 →ε G2 can be interpreted in our context as follows: for each geneA ofG1, if A has the same expression level in every experiment, then for each geneB ofG2, B has the same expression level in every experiment.

4.2. A new data mining task

The problem we are interested in can be now formulated as follows:

"Given a relationr overG and a user-specified thresholdε, find functionaldependenciesepsilon-satisfied in r"

The basic ideas underlying our approach to cope with this problem rely on: 1) thedefinition of aclosure operatoron G with respect to the data, implying the notionof closed set, 2) the definition of a sub-family of closed set, so-calledepsilon-agreesetsobtained from the data and 3) the characterization of left hand sides ofepsilon-functional dependencies with respect to the minimal sub family of closed set. Basi-cally, we make in the sequel an intensive use of the equivalence between a closureoperator, a family of closed sets and implication rules [DEM 92].

Page 492: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4.2.1. A closure operator onG wrt a relation

Given a user-specified thresholdε and a relationr overG, we define a mapping.+rfrom 2G to 2G as follows:

X+r = A ∈ G | r |= X →ε A, X ⊆ G

One can easily verify that this mapping is a closure operator onG, i.e. extensive(X ⊆ X+

r ), monotone (X ⊆ Y ⇒ X+r ⊆ Y +

r ) and idempotent (X+r = (X+

r )+).

Let X ⊆ G, X is aclosed setwith respect tor if X = X+r . Let CLε(r) be the

set of closed set with respect to.+r andGENε(r) be the set of meet-irreducible sets ofCLε(r) (which is unique and minimal). From that point,ε-functional dependenciescan be obtained from several approaches. In this paper, we adapt propositions initiallymade in [MAN 94b, DEM 95] and improved in a data mining setting in [LOP 00,WYS 01].

4.2.2. Computing a subset of closed set

We shall defineε-agree sets on both two tuples of a relation and a whole relation.

Definition 2 Let ε be a user-specified threshold,r a relation overG, ti, tj ∈ r andX ⊆ G. The ε-agree setof ti and tj is defined as follows:agε(ti, tj) = G ∈G | |ti[G]− tj [G]| ≤ ε..

Note that from this definition, we have a convenient way to describe functional de-pendencies that do not hold in the relation: LetX = agε(ti, tj), X gives all functionaldependencies not satisfied inti, tj since for each geneg in G \ X, X → g is notsatisfied inti, tj by definition ofagε(ti, tj).

Definition 3 Let ε be a user-specified threshold andr a relation overG. Theε-agreesets ofr are defined as follows:agε(r) = agε(ti, tj) | ti, tj ∈ r, ti 6= tj.

Example 2- From Example 1 with a thresholdε = 0.05, we have for instance,ag0.05(t1, t2) = g4 andag0.05(t1, t4) = g0, g2, g5. Once each couple of tupleshas been considered, we get agree set ofr:

ag0.05(r) = g4, , g0, g2, g5, g3, g1, g0, g2, g3, g2, g1, g3, g4

2

Fromε-agree sets, we can easily make the relationship with closed sets previouslydefined.

Proposition 1 Let ε be a user-specified threshold andr a relation overG.

agε(r) ⊆ CLε(r)

Page 493: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

In other words, we can defineclosed setfrom which implication rulescan bederived being understood that such implication rules are in our contextε-functionaldependencies and convey a new and surprising knowledge on gene interaction.

4.2.3. Generatingε-functional dependencies

We first re-formulate in our framework a result between meet-irreducible sets anda sub-family of closed sets. Since we are interested in generating canonical minimalfunctional dependencies, we just have to characterize the set of minimal left hand sidesof functional dependencies for a given attribute.

Proposition 2 Let ε be a user-specified threshold,r a relation overG andA ∈ G.

GENε(A, r) = max⊆X ∈ agε(r) |A 6∈ X

It is interesting to note that such sets allow to describeexcluded functional de-pendencies, i.e. ε-functional dependencies which do not hold inr sinceε-functionaldependencies of the formY →ε A, Y ∈ GENε(r) are never satisfied inr (by con-struction as mentioned above). Such kind of cover has a lot of nice features but theirdescriptions go beyond the scope of this paper (see e.g. [GOT 90, LOP 01]).

Example 3- Continuing with the previous example, let us consider the geneg4. ThesetGEN0.05(g4, r) can be easily computed fromag0.05(r).

GEN0.05(g4, r) = g0, g2, g5g0, g2, g3g1

In other words, we can give three excludedε-functional dependencies forg4 in r:r 6|= g0, g2, g5 → g4,r 6|= g0, g2, g3 → g4,r 6|= g1 → g4 2

Then, given a gene A, minimal left hand sides ofε-functional dependencies havingA in right hand side can be derived, for instance as given in Proposition 3.

Proposition 3 Let ε be a user-specified threshold,r a relation overG andA ∈ G.

lhs(A, r) = min⊆X ⊆ G | ∀Y ∈ GENε(A, r), X 6⊆ Y

Algorithmic considerations on these results can be found in [MAN 94b, KAV 99,LOP 00, WYS 01]. The key step is the computation oflhs(A, r), which is exponentialin the number of genes. We currently use a depth first strategy to reduce the memoryrequirement (our algorithm is based on the propositions made in [KAV 99]) whenimplementing this step.

Page 494: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Example 4- Let us again consider the geneg4. We get 6 minimal0.05-functionaldependencies inr for g4:

LHS0.05(g4, r) = g0, g1, g1, g2, g1, g3, g1, g5g3, g5, g4

2

An important practical feature is described through the following example wherethe threshold is larger.

Example 5- From the running example, if the threshold is set to a larger value, e.g.ε = 2.0, ε-functional dependencies would have been of the form∅ →2.0 g, for everyg ∈ G. Roughly speaking, the variation of the expression level ofg remains within awindow of sizeε. 2

By the way, biologists have the opportunity to set values to the threshold fromlarge values to small ones. Thus, in an iterative process, they can easily identify geneswhose variation of expression level overshoot the window.Once subsets of genes are identified, special cares may be devoted to their analysis byconsidering each of them one by one as a potential right-hand side.

5. Experimental results: Yeast cell cycle analysis

For purpose of illustration, a number ofε-functional dependencies are extractedfrom a subset of cell cycle-regulated genes of the YeastSaccharomyces cerevisiaebymicroarray hybridization. The yeast cell cycle analysis project’s goal is to identify allgenes whose mRNA levels are regulated by the cell cycle (see [SPE 98]). The data isavailable for downloading atgenome-www.stanford.edu/cellcycle.

Unfortunately, the commendable desire to take into accountall the genes in thedatabase does not survive combinatorial considerations. Therefore we were forcedto restrict our attention to a subset of the genes. Fortunately, genes that share similarexpression profiles have already been grouped in [SPE 98] by a hierarchical clusteringalgorithm. Using their classification tree, we selected a subset of the genes sharingsimilar expression profiles resulting in a ’relation schema’ made up from 179 genesand a relation of 80 experiments. Our purpose here was to exhibit allε-functionaldependencies among those genes for varyingε values. Given the size of the relation,theε-agree sets were easily computed.

All the experiments were carried out with a 900 MHz Intel Pentium III PC runningWindows XP with 250 MB memory. The code was implemented in C++ using STL(Standard Template Library). The thresholds, the elapsed times, the numbers of agreesets and the numbers of functional dependencies of the form∅ →ε A are reported intable 2. Execution times exceeding 1 hours are marked with a “*”. These results showthat the algorithm performs relatively well (less than one hour) when the threshold isgreater than 2.5.

Page 495: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Table 2. Times (in seconds) to extractε-functional dependencies in a179×80 relation.threshold elapsed time (s.) # Agree # non-standard FD

8 7.55 4 1784 8.53 187 143

3.5 9.96 336 1263 17.72 530 104

2.9 31.60 600 992.8 175.64 660 912.7 872.38 756 822.6 602.09 848 762.5 1596.5 941 712 ∗ 1589 ∗0 ∗ 1885 ∗

Biological interpretation

For the time being, it seems difficult to draw any firm conclusion about the biolog-ical meaning of the numerous extractedε-functional dependencies among the genes,even for the experts in the field. Nevertheless, according to ongoing discussions withbiologists,ε-functional dependencies is a new form of knowledge which raises a lot ofinterests for them. Indeed, such an approach differs widely from classical ones usedso far, such as clustering or classification.

Finally, we would like to stress that the interactive Web sitegenome-www.stan-

ford.edu provides the biologists with the opportunity to assess, to some extent, thebiological plausibility of the proposedε-functional dependencies by comparing thefunctional classes of the genes appearing in every individualε-functional dependency.This requires considerable time and effort though.

6. Conclusion

We discussed a new data mining problem, called theε-functional dependenciesinference problem, in order to attempt a reverse engineering of the underlying regula-tory gene interactions from the expression levels from DNA microarray hybridizationexperiments. This new problem formulation is different from previous works thatattempt to classify or cluster experiments and/or genes.

The proposed approach is closely related to the theoretical framework introducedin [MAN 94b, DEM 95].

Note that in another context,fuzzy functional dependenciesare studied (see e.g.[BOS 98]) and several new definitions of functional dependencies are given. For in-stance, a sound and complete inference rule system is given for fuzzy functional de-pendencies in [CHE 94]. In this setting,ε-functional dependencies can be seen as aspecial case of fuzzy functional dependencies. Nevertheless, to the best of our knowl-

Page 496: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

edge, no serious contribution exists to infer fuzzy functional dependencies in a givenrelation.

One of the key feature of this approach is to avoid the cumbersome pre-processingphase before the data mining process can take place, either by defining fuzzy terms, orby discretizing values into meaningful intervals. The method is illustrated on expres-sion profiles from a subsample of genes from budding yeastSacharomyces cerevisiaedata.

This ongoing work raises a lot of perspectives. They can be roughly classified intwo directions: the former concerns the problem ofε-functional dependency infer-ence, in which many issues can be tackled such as scalability, post-processing or newdefinitions of functional dependency satisfaction. The latter concerns the biologicalmeaning of our results. We are currently pursuing collaborations with biologists toassess our propositions.

7. References

[BOS 98] BOSC P., DUBOIS D., PRADE H., “Fuzzy Functional Dependencies -An Overviewand a critical discussion”,Journal of the American Society for Information Science, vol. 49,1998, p. 217–235.

[BRO 97] BROWN M., GRUNDY W., LIN D., CHISTIANINI N., SUGNET C., FUREY T.,ARES M., HAUSSLER D., “Knowledge-based analysis of microarray gene expression datausing support vector machines”,Proceedings of the National Academy of Science, 1997.

[CHE 94] CHEN G., KERRE E., VANDENBULCKE J., “A Computational Algorithm for theFuzzy FD Transitive Closure and a Complete Axiomatization of Fuzzy FD”,Journal ofIntelligent Systems, vol. 9, 1994, p. 421–439.

[DEM 92] DEMETROVICS J., LIBKIN L., MUCHNIK I. B., “Functional Dependencies in Re-lational Databases: A Lattice Point of View”,Discrete Applied Mathematics, vol. 40, 1992,p. 155–185.

[DEM 95] DEMETROVICS J., THI V. D., “Some Remarks On Generating Armstrong AndInferring Functional Dependencies Relation”,Acta Cybernetica, vol. 12, num. 2, 1995,p. 167-180.

[D’H 99] D’H AESELEER P., LIANG S., SOMOGYI R., “Gene expression data analysis andmodelling”, Pacific Symposium on Biocomputing, Hawaii, 1999.

[FLA 99] FLACH P. A., SAVNIK I., “Database Dependency Discovery: A Machine LearningApproach”,AI Communications, vol. 12, num. 3, 1999, p. 139-160.

[FUH 00] FUHRMAN S., CUNNINGHAM M., WEN X., ZWEIGER G., SEILHAMER J., SOM-OGYI R., “The application of Shannon entropy in the prediction of putative drug targets”,BioSystems, vol. 55, 2000, p. 5-14.

[GOT 90] GOTTLOB G., LIBKIN L., “Investigations on Armstrong Relations, DependencyInference, and Excluded Functional Dependencies”,Acta Cybernetica, vol. 9, num. 4,1990, p. 385–402.

[HAN 00] HAN J., KAMBER M., Data Mining: Concepts and Techniques, Morgan KaufmannPublishers, first edition, 2000.

Page 497: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[HUH 98] HUHTALA Y., KÄRKKÄINEN J., PORKKA P., TOIVONEN H., “Efficient Discoveryof Functional and Approximate Dependencies Using Partitions”,Proc. of the14th IEEEICDE, 1998, p. 392–401.

[KAV 99] K AVVADIAS D. J., STAVROPOULOS E. C., “Evaluation of an Algorithm for theTransversal Hypergraph Problem”, VITTER J. S., ZAROLIAGIS C. D., Eds.,3rd Inter-national Workshop on Algorithm Engineering, WAE’99, vol. 1668 ofLNCS, London, UK,1999, Springer.

[KIV 95] K IVINEN J., MANNILA H., “Approximate inference of functional dependenciesfrom relations”,TCS, vol. 149, num. 1, 1995, p. 129–149.

[LEV 99] L EVENE M., LOIZOU G., A Guided Tour of Relational Databases and Beyond,Springer-Verlag, 1999.

[LOP 00] LOPESS., PETIT J.-M., LAKHAL L., “Efficient Discovery of Functional Depen-dencies and Armstrong Relations”,Proc. of the6th EDBT, vol. 1777 ofLNCS, Konstanz,Germany, 2000, Springer, p. 350-364.

[LOP 01] LOPES S., PETIT J.-M., LAKHAL L., “A framework for understanding existingdatabases”,Proc. of the6th Conf. on IDEAS, France, IEEE CS, 2001.

[MAN 94a] MANNILA H., RÄIHÄ K.-J., The Design of Relational Databases, Addison-Wesley, second edition, 1994.

[MAN 94b] M ANNILA H., RÄIHÄ K.-J., “Algorithms for Inferring Functional Dependenciesfrom Relations”,DKE, vol. 12, num. 1, 1994, p. 83–99.

[NOV 01] NOVELLI N., CICCHETTI R., “FUN: An Efficient Algorithm for Mining Functionaland Embedded Dependencies”,Proc. of the ICDT, London, UK, vol. 1973 ofLNCS,Springer-Verlag, 2001, p. 189-203.

[PHA 02] PHAN J. M., NG R. T., “GEA: A Toolkit for Gene Expression Analysis (demo)”,SIGMOD’2002, Madison, USA, ACM, 2002.

[SCH 00] SCHERF U., AL ., “A gene expression database for the molecular pharmacology ofcancer”,Nature Genetics, vol. 24, num. 3, 2000, p. 236-244.

[SHE 95] SHENA M., AL ., “Quantitative Monitoring of Gene Expression Patterns with acDNA Microarray”, Science, , num. 270, 1995, p. 467-470.

[SPE 98] SPELLMAN, AL ., “Identification of Cell Cycle-regulated Genes of the Yeast Saccha-romyces cerevisiae by Microarray Hybridization”,Molecular Biology of the Cell, vol. 9,1998, p. 3273-3297.

[WOO 00] WOOLF P., WANG Y., “A fuzzy logic approach to analyzing gene expression data”,Physiol. Genomics, vol. 3, 2000, p. 9-15.

[WYS 01] WYSS C., GIANNELLA C., ROBERTSON E. L., “FastFDs: A Heuristic-Driven,Depth-First Algorithm for Mining Functional Dependencies from Relation Instances”,KAMBAYASHI Y., WINIWARTER W., ARIKAWA M., Eds.,Proc. of the3td DAWAK , Mu-nich, Germany, September 5-7, vol. 2114 ofLNCS, Springer-Verlag, 2001, p. 101-110.

Page 498: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 499: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Démonstrations

Page 500: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 501: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

A Platform for Experimenting DisconnectedObjects on Mobile Hand-Held Devices

Denis Conan — Sophie Chabridon — Olivier Villin — Guy Bernard

InstitutNationaldesTélécommunications9 rue CharlesFourier91011 Évry cedexFrance

Denis.Conan,Sophie.Chabridon,Olivier.Villin,[email protected]

ABSTRACT. Mobile databases and distributed systemsrelying on wirelesscommunication net-workshaveto dealwith variable levelsof connectivity. Thisdemonstration proposes a genericDisconnectedObjectManagement(DOM) service, called

, that enableswork continu-

ity evenwhenweaklyconnectedor disconnected.For betteragility and fidelity, we offer bothapplication-aware andapplication-transparentadaptations.

RÉSUMÉ.Les basesde donnéesmobileset les systèmesrépartis utilisant les réseauxsansfildoiventfaire faceà defortesvariationsduniveaudeconnectivité.Cettedémonstrationproposeun servicegénériquede gestiond’objetsdéconnectésappelé

qui permetdecontinuer

à travailler enmodesfaiblementconnectéou mêmedéconnecté.Pour améliorer l’agilité et lafidélité, nousoffronsuneadaptation transparentepour l’application ainsi qu’une adaptationaveccollaboration entre l’application et le système.

KEYWORDS:Mobility, disconnection, CORBA, wirelessnetworks.

MOTS-CLÉS: Mobilité, déconnexion,CORBA, réseauxsansfil.

Page 502: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

An important characteristic of mobile environmentsis that they suffer from fre-quent disconnections. A disconnection is a normal event in suchenvironmentsandshouldnot beconsideredasa failure. This hasa profoundimpacton how transactionmanagementis implemented andhow dataconsistency is guaranteedin suchenvi-ronments[BAR 99]. We distinguishbetweentwo kinds of disconnections:voluntarydisconnectionswhentheuserdecidesto work ontheirown for saving batteryor com-munication costsor whenradio transmissionsareprohibited asaboarda plane,andinvoluntarydisconnectionsdueto physicalwirelesscommunicationbreakdownssuchasin an uncoveredareaor whenthe userhasmoved out of the reachof a basesta-tion. We alsohandle thecasewherethecommunicationis still possiblebut not at anoptimal level. It corresponds to whathasbeencalledweakconnectivity [MUM 95];it resultsfrom intermittent communication,low-bandwidth,high-latency or expensivenetworks.

The weakconnectivity of mobile environments in conjunction with the relativeresourcepovertyof hand-helddevicesleadsto a trade-off betweenautonomousappli-cationsandinterdependent distributedapplications. This trade-off is well explainedin [NOB 97] wherethe rangeof strategiesfor adaptationbringsout threedesignal-ternatives: nosystemsupport (laissez-fairestrategy), collaborationbetweentheappli-cationsandthe system(application-aware strategy), andno changesto the applica-tions(application-transparentstrategy). Previousworks[JOS97, MUM 95, NOB 97,PET97] have demonstratedthe possibility that a systemcan provide good perfor-manceeven when the network bandwidth variesover several orders of magnitude,but alsotheneedfor application intervention to improve agility (speedandaccuracy)in reactionto changesin resourceavailability andto specifyfidelity in termsof dataconsistency.

Thecontribution of this demonstrationis to proposea DisconnectedObjectMan-agement (DOM) servicefor application-awareadaptation in addition to application-transparent adaptation. It is a servicethat enableswork continuity in a transparentmanner evenwhenweaklyconnectedor disconnected.This work shouldbeseenasa first steptowardsa completemobileinformationmanagementsystem;it definesthemainbasiccomponentsof agenericinfrastructureto experimentdeployment,replica-tion andvariousreconciliation strategies.

Theremainderof thispresentationdetailsthearchitecture(Section2), theexampleapplication (Section3) andthedemonstration(Section4).

2. Architecture

In a classicaldistributedapplication with strongconnectivity, the graphical userinterfaceis loadedon the mobile terminal andthe server objectsarehostedon ma-chinesof thewired network. Servicecontinuity while disconnectedimpliestransfer-ring someelementsof theserversto themobileterminal before loosingconnectivity,

Page 503: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

logging operationsor statechangesduring thedisconnection, andre-integratingwhenre-connecting. In order tosupport multipleapplicationsconcurrently, somepartsof re-sourcemanagementandlog managementarecentralisedandapplication-transparent.For application-awareness,theseservicesarerealisedby objectsthatacceptrequestsfrom applications. The application-awareresource managementserviceabstractstoapplicationsconnectivity informationprovidedby theoperatingsystemandapplica-tions canspecifywhich resourcesand resource levels correspond to bad,weak,orstrongconnectivity, thusimproving agility.

Figure1 presentsthearchitecture of theDOM service.More precisely, it depictsUML-lik e collaboration diagrams of the client sendinga requestto a remote objectwhenthe connectivity is strong(case2.a) andthensendinga request in the caseofweakconnectivity (case2.b).

DOM LM

RM

CM

DO

Client ObjectRemote

2.b.2.b.1

2.b.2.b2.b.1

2.b.2.b.2: DO’s request to the remote object2.b.2.b.1: <<periodic>> getCMInfo()

2.b.2.b: addLog()2.b.2.a: DO’s request to the Remote Object2.b.1: getCMInfo()

2.b: Client’s request to the DO2.a: Client’s request to the Remote Object1: getCMInfo()

2.b.2.b.2

DOM Server

Interceptor 2.a

2.b 2.b.2.a1

Figure 1. TheDOM servicearchitecture.

All the rectanglesin Figure1 representobjects. All the requestsfrom and theresponsesto the client areintercepted. On requestsending,the interceptor actsasaswitchbetweenthedisconnectedobjectDO andtheremoteobject.Onresponserecep-tion, the interceptor detectspossiblecommunicationfailuresbetweenthesendingoftherequestandthereceptionof theresponse.A disconnectedobjectis anobjectwhichis similar in designandimplementationto theremoteobject,but specificallybuilt forsupporting disconnectionandweakconnectivity. It is the application designer’s re-sponsibilityto balance betweenan easydesignanda morecomplex onethat adaptsbetterto connectivity variations.

Page 504: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Disconnected objectsareassociatedvia application-transparentportableintercep-tors to theclient. If theclient wantsapplication-awareadaptation, it obtains the ref-erenceof thedisconnectedobjectmanagerDOM, for example, from a file storedonthemobileterminal.TheDOM is theentrypointof theDOM serviceto find theothermanagers. Theresourcemanager RM is a factoryof connectivity managersCMs. ACM realisestheabstraction of connectivity informationrelatedto oneresource. Thepolicy currently implementedassociatesa CM perlogical link betweena client andaremoteobject;it is thefinestgranularity at themiddlewarelevel.

The interceptor obtains the connectivity information from the CM (1), andthen,decideswherethe client’s requestmustbe issued. Whenthe connectivity is strong—i.e. in the connectedmode—, the client’s request leaves the mobile terminal toreachthe remote object (2.a). As a result, the DO cannot keepup to datewith thelatestrequests. Therefore, the DO should periodically call the remote objectfor anincremental statetransfer[KHU 02].

Whentheconnectivity becomesweakor null, forcing the client to enterinto thepartially connectedor disconnectedmodesrespectively, theclient’s requestis issuedto the DO (2.b). The requeststhat follow in the scenarioareapplication-dependentbecausethe DO is built by the application’s designer. TheDO updatesits stateandpreparesa new request,calleda DO request,for theremoteobject.Thesimplestcaseis thattheDO requestis equivalentin parameters’ content andoperation nameto theclient’s request. Next, the DO asksits CM for connectivity information(2.b.1). Wegivetwo possibleendsto this scenario: 2.b.2.aand2.b.2.b.

In the partially connectedmode (2.b.2.a), the operations areexecuted locally andremotely; the DO sendsthe DO requestto the remote objectandupdatesits stateagainif necessary. In the disconnected mode(2.b.2.b), the operationsareexecutedonly locally. If appropriate,theDO encodes theDO requestin a datacontainerandsendsit to the log manager LM. Periodically, the LM decodes the logged requestwith a method provided previously by the DO, teststhe connectivity (2.b.2.b.1) and,if possible—i.e. partially connectedmode—, forwards theDO requestto theremoteobject (2.b.2.b.2). Clearly, the execution whendisconnectedis not equivalent to anexecution while connectedor partiallyconnected.This is acceptable providedthattheconnectivity informationis visualisedby aniconicimagein theclient’suserinterface.This is very closeto the relaxed check-out modeproposedin [HOL 00] wheretheusersknow thatthey areviewing staledatabut it is still usefulfor themto doso.

In addition, userscandisconnector re-connectvoluntarily bycalling or operations; thesecalls are addressedto the remote objects,inter-cepted,and(re-)directedto the DO. In caseof voluntary disconnection,the DO isresponsible for initiating the loadingof the statefrom the remoteobject. Whenre-connectingvoluntarily, theDO askstheLM to flush the loggeddatapessimistically,thatis there-connectionis successfulonly if thelog is emptywhenthecall returns.

For thepartially connectedanddisconnectedmodes, we log andpropagateoper-ationsinsteadof statecontents/changeslike in [PET97]. In fact, this is application-

Page 505: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

dependentsincethe log manager doesnot interpret logged requests.The codethatcanparseandforward the loggedrequestsis provided at initialization time by dis-connectedobjects(via objectby values(OBV) in theCORBA environment) thatwenameDO requestinterpreters.DisconnectedobjectsandDO request interpreterscanlog andpropagateeitheroperationsor statechanges.Thelog manager receivesasan "!

parameteraDO requestinterpreter, thatis adescription of thestateandthecodeoftheobjectthatis responsiblefor theinterpretationof futureloggedrequests,andanewinstanceis automatically createdin theexecution entity of thelog manager. Providedthatall theDO requestinterpretersinherit thesameabstractinterface, thelog manageris genericandapplication-independent.

FormoredetailsontheDOM service,moreespeciallyoninterfacesfor application-awareadaptation, thereadercanreferto [Viv02].

3. Example application: A wireless email browser

This sectionillustratesthe adaptation of an email browser to wirelessenviron-ments. Our email browser offers the basicfunctionalities of well-known softwaresuchasNetscapeMessengeror Microsoft IE. Theuserhandlesmessagescomposedof a body anda header, itself divided into an identifier, thenamesof thesenderandthe receiver, a subject,the dateof sending, anda status(reador unread). The mainfunctionalitiesprovidedby thegraphicaluserinterface(GUI) aresending, replying to,forwarding,receiving anddeletinga message.

In thefirst versionof theemailbrowsernamed“centralised”, theGUI is executedinto thesameexecutionentity astheusermailboxobject. Themailboxobjectplaystwo roles: (1) themailboxobject storesreceivedemails; (2) for sendinga message,theGUI sendsthemessageto themailbox object,thelattergetsthereceiver’s addressfrom the mailbox manager object and forwards the message.A secondexecutionentity contains themailbox manager that is responsible for creating, deletingandlo-calisingthemailboxobjects.

Thesecondversion of theemailbrowsernamed“distributed” is obtainedby sep-aratingthe GUI and the usermailbox object into different execution entities. TheGUI is launchedby theuserin themobile terminalandcommunicateswith thecor-responding mailboxobjectvia wirelesslinks. Sincesomeof thedatathatarecopiesof themailboxobject’s dataarelocally storedwithin theGUI, thedistribution leadsto theseparation of GUI’soperationsinto two groups: theoperationsthatonly impactlocal (GUI’s) dataandthe onesthat arecarriedover the mailbox object straightaf-ter beingexecutedby theGUI. A typical operation of thefirst group is theoperation#$&% !'()* % *+-,/."0 of theGUI sayingthata messagehasbeenreador unread. In thecentralisedversion,theoperationis synchronouslyperformedon themailbox object.Now, it is appliedandloggedby the GUI in orderto avoid generating too many re-questsonthewirelessnetwork. At thenext remoteoperation execution,for instanceacall to 1 ( # (2 3(4(&,5, % '(6.0 , thelog of local operationsis transmittedasanargument

Page 506: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

andappliedto themailbox objectbefore theprocessingof theremote operation. So,the effectsof #$&% !'(5)* % *+,/.0 areseenbeforethe effectsof 1 ( # (& 3(4(,5, % '(7.0 ,asin thecentralisedversion. Therefore, theGUI logsall its local operationssincethelastremoteoperationthatdonotneedto beprocessedremotely in asynchronousway.Another consequenceis thatall theremoteoperationshave astheir first argumentanarraycontaining thelist of local operations. This designpatternis rathersimpleandcanbeappliedapplyto distributedapplications thatarepiece-wisedeterministic.

However the quality of the wirelesslink, in order to load datafrom themailboxobject at a convenient rate and quantity, messagesare readin two steps: first theheader, next thecontent. Theuserbrowsesthesetof headersandloadsonly thedesiredcontents. Thereasonsfor thisdistinctionarethatcontentsareusuallymuchlargerthanheaders,andbeingoptimistic, userswill not readevery content whereasthey readalltheheaders.Another way to adaptto thewirelesslink is theadditionof “collective”operationsthatreadanddeletegroupsof messages:e.g. all theread/unreadmessages,all themessages.

In theconnectedmode,theclient’s requestsaredirectlysentto theremote object.Hence,the stateof the disconnectedobjecton the mobile terminaldoesnot evolve.Theadvantageof this modeis that thereis no indirectionandthestateof thediscon-nectedobject canbeempty, thussaving memory. Whenthemobileterminal becomespartially connected,theinterceptor callsthe 8 , #9 !5!&( # *:."0 operation on thediscon-nectedobjectwhich in turn calls the 8 , #9 !!&( # *:.0 operation on the remote objectto transferthestate.

In thepartiallyconnectedmode, theoperationsareexecutedlocally andremotely.If theprototypeof theoperationcontainsonly

!parameters,theoperation is executed

locally first andthenremotelyso that thedisconnectedobjectremainsup to date. Ifthe prototype contains only 9 +* parametersanda returntype, the operation is exe-cutedremotely first andthenlocally. Theconsequenceis thatthedisconnectedobjectremainsup to datewith thedataloadedfrom theremoteobjectbefore it responds totheclient. Themixing of

!, ! 9 +* , and 9 +5* parametersandareturn valueis let asan

openissuein our first study. Another openissueis thesupport of exceptions thrownby theserversandsentasresponsesto theclients.

In thedisconnectedmode,theoperationsareexecutedonly locally. If theprototypeof theoperation containsonly

!parameters,theoperationis logged. If theprototype

containsonly 9 +* parametersandareturntype,whetheror nottheoperation is loggeddependson whether thestateof thetarget objectchanges.Themixing of

!, ! 9 +5* ,

and 9 +* parametersandareturnvalue andthethrowing of exceptionsraisesthesamedifficultiesasmentionedpreviously. In addition,recallthateveryoperation hasasitsfirst argumentan arrayrepresentinga log of operations that werelocal to the GUI.This log is alsoaddedto thelog of thelocal copy. Of course,this first argumentis an "!

parameterbut doesnot take partin thepreviousdiscussions.Finally, animportanthypothesisof this studyis that theremoteobjectcannot beaccessedconcurrentlybyotherclientswhile thecurrent client is disconnected.Thus,thereconciliation is eased

Page 507: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

and kept simple. The transitionbetweenthe disconnected mode and the partiallyconnectedmodecorrespondsto thereplayof theoperationslogged by thelocalcopy.

4. Demonstration

IEEE 802.11b Lan

ORBacus 4.1.0IBM J9, VAME

iPAQ Windows CE

(client)

PC Windows 2000

ORBacus 4.1.0(client)

Sun JDK 1.3.1, SwingSun JDK 1.3.1, Swing

ORBacus 4.1.0(client)

PC Linux

Sun JDK 1.3.1

(server)

PC Linux

ORBacus 4.1.0

Figure 2. Thedemonstrationwith thewirelessemailbrowser.

For implementing andvalidatingtheDOM service,CORBA [OMG 01] hasbeenchosenfor its ability to be usedin multiple domainsandfor providing extensibilitymechanisms suchas portable interceptors to build application-transparentservices,andobjects by valueto build application-aware services.Disconnectedobjects,beingCORBA objects,areaccessiblefrom everywhere, so from all theapplications of themobileterminal.Anotherrationalefor lettingthedisconnectedobjectbeing aCORBAobjectis thatit canusestandardCORBA servicessuchasnaming, event notificationsor transactionsindependentlyof theframework. Detailsof thedesignandimplemen-tationcanbefound in [CON 02].

Wehaveconductedseriesof performancemeasuresin differentsoftwareandhard-warecombinations(laptop PCandiPAQ PDA, running Windowsor Linux). For wire-lesscommunications,a CompaqIEEE 802.11b WL110 cardat 11Mbpswaspluggedin all devicesandweusedasoftwarebasestation.Thedemonstrationshowsthewire-lessemail browser example application involving threeclients with threedifferenthardwareandsoftwareconfigurationsandoneserver. Figure2 draws anoverview ofthedemonstration.

Page 508: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

5. Conclusion

This demonstrationproposesa genericservicefor DisconnectedObjectManage-mentthat canbenefit to distributedapplications running in mobile environments. Itdecomposesinto a disconnectedobjectmanagement interface, a resourcemanager, aconnectivity manageranda log manager. In the example application we have pre-sented,disconnectedobjectsareproxies of server objects,but could be proxies ofdatabases.Similarly, thelog managerlogsoperationsonserverobjects,but couldlogdatabaseoperations. Theprototypewe have realiseddemonstratesthat sometodayhand-helddevicesanda fortiori futuremobiledevicescanembedourframework, thatincludesacompleteORB.Wehavepresentedaprototypeof anemailbrowserexampleapplication. Testswererun on botha laptopPCandaniPAQ PDA. Theperformanceresultsshow thattheDOM serviceoverheadis negligible for theenduser.

6. References

[BAR 99] BARBARÁ D., “Mobile ComputingandDatabase— A Survey”, IEEETransactionson Knowledge anddatadEngineering, , num. 1, 1999, p. 108–117.

[CON 02] CONAN D., CHABRIDON S., BERNARD G., “DisconnectedOperationsin MobileEnvironments”, Proc. 2nd IPDPSWork. on Par. and Dist. Comp.Issuesin WirelessNet-worksandMobile Computing, Ft. Lauderdale,Florida,April 2002.

[HOL 00] HOLLIDAY J., AGRAWAL D., EL ABBADI A., “PlannedDisconnectionsfor MobileDatabases”, 3rd DEXA Int. Work. on Mobility in Databasesand Distributed Systems,Greenwich,U.K., Sep. 2000.

[JOS97] JOSEPH A., TAUBER J., KAASHOEK F., “Mobile Computing with the RoverToolkit”, IEEETrans.on Comp., vol. 46,num. 3, 1997.

[KHU 02] KHUSHRAJ A., HELAL A., ZHANG J., “IncrementalHoardingandReintegrationin Mobile Environments”, Proc.of theInt. Symp.on Appli. andtheInternet, Nara,Japan,Jan.2002.

[MUM 95] MUMMERT L., EBLING M., SATYANARAYANAN M., “Exploiting WeakConnec-tivity ofr Mobile File Access”, Proc.of the15thACM Symp.on Oper. Syst.Princ., CopperMountain resort,CO,Dec.1995.

[NOB 97] NOBLE B., SATYANARAYANAN M., NARAYANAN D., TILTON J., FLINN J.,WALKER K., “Agile Application-AwareAdaptationfor Mobility”, Proc.of the16thACMSymp.on Oper. Syst.Princ., 1997.

[OMG 01] OMG, “The CommonObject RequestBroker - Architectureand Specifications.Revision2.4.2”, OMG Document formal/01-02-01,Feb. 2001,ObjectManagementGroup.

[PET97] PETERSEN K., SPREITZER M., TERRY D., THEIMER M., DEMERS A., “FlexibleUpdatePropagationfor WeaklyConsistentReplication”, Proc.16thACM Symp.on Princ.of Dist. Comp., SaintMalo, France,Oct. 1997, p. 288–301.

[Viv02] “The Vivian Consortium.VIVIAN deliverablereport: PlatformServicesSpecifica-tions.”, report, Sep. 2002, http://www-nrc.nokia.com/Vivian.

Page 509: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Une plate-forme de télé-enseignement àbase de composants réutilisables etpersonnalisables

John-Freddy Duitama*+, Amel Bouzeghoub*, Claire Carpentier*,Bruno Defude**Institut National des Télécommunications9, rue Charles Fourier 91011 Évry Cedex

+Université d’Antioquia, Medellin, ColombiePré[email protected]

RÉSUMÉ. Nous proposons un environnement de création et de diffusion de contenuspédagogiques multimédia utilisant principalement les technologies du Web (protocole HTTP,HTML, XML, navigateurs Web). Pour favoriser la réutilisation, nous définissons un modèlede composants pédagogiques organisés suivant un modèle du domaine. La personnalisationdes contenus est réalisée au moyen de profils des apprenants.MOTS-CLÉS : base de composants éducatifs, personnalisation, création et réutilisation decontenus pédagogiques, profils des apprenants, application Web, XML.

Page 510: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Le télé-enseignement a pris un nouvel essor avec les avancées en matièred’internet et de multimedia [WEBE97, BRUS98, NEJD99, SEEB99, SADD01]. Il existedeux catégories d’outils : les outils synchrones (visio-conférence) et les outilsasynchrones (e.g., tutoriel). Nous nous situons dans un contexte asynchrone. Notrebut est de fournir un environnement de création et de diffusion de contenuspédagogiques multimédia utilisant principalement les technologies du Web(protocole HTTP, HTML, XML, navigateurs Web). Cet environnement doit restersimple d’utilisation tout en augmentant la productivité de l’enseignant, notammenten favorisant la réutilisation de contenus déjà développés dans d’autres contextes(par lui-même ou par d’autres enseignants). Un bon outil pédagogique doit aussi semettre en accord avec les attentes des apprenants et tenir compte de leursconnaissances préalables.

Pour remplir ces objectifs principaux, nous avons besoin d’un environnementcachant la complexité des outils aux utilisateurs et favorisant la réutilisation via unmodèle de composants. Nous avons choisi, de même une personnalisation descontenus en fonction des profils des apprenants [BRUS 96].

L’article est structuré comme suit. La section 2 décrit succinctement l’ensembledes modèles d’informations utilisés dans l’environnement en mettant l’accent sur lemodèle de composants. La section 3 décrit l’architecture logicielle choisie pour leprototype. Dans la section 4, nous décrivons les principales fonctionnalitésimplémentées. Enfin, dans la section 5, nous donnons quelques points decomparaison avec d’autres approches et quelques perspectives d’extensions.

2. Les modèles d’informations utilisés

Nous avons besoin de décrire trois types d’informations, le domained’enseignement couvert, les apprenants et les composants.

Le modèle du domaine doit décrire le corpus des connaissances enseignées, il estle préalable à toute utilisation de la plate-forme. Nous avons choisi un modèlerelativement simple, basé sur une description hiérarchique de concepts dénotantqu’un concept est plus général que tel autre concept.

Les apprenants sont décrits via un profil qui possède deux facettes, la premièredonne les informations générales sur l’apprenant (identification, préférences liées àl’interface utilisateur comme la langue, les choix de polices, …) et la seconde décritles connaissances déjà acquises par l’apprenant. Cette deuxième facette est définie àpartir du modèle de domaine, puisqu’il s’agit de l’instancier en valuant chaqueconcept par un entier compris entre 0 et 100 (100 indiquant la maitrise complète duconcept et 0 la non-connaissance complète). L’initialisation des profils se fait soit

Page 511: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

par l’utilisation de stéréotypes, soit par interaction avec l’apprenant à sa premièreconnexion. Le profil d’un apprenant est dynamique et va évoluer au fur et à mesurequ’il utilise l’environnement.

Pour décrire les composants il existe déjà des normes de composants éducatifs,notamment LOM (Learning Object Model) [IEEE 01]. Cependant ces normesprennent peu en compte la sémantique des composants (description de leur contenupar exemple). Nous proposons d’étendre ces normes en leur adjoignant desdescriptions sémantiques de leur contenu (à partir du modèle du domaine enseigné)et de leurs interactions possibles avec d’autres composants (description des entréeset sorties des composants). Pour nous un composant est composé d’un contenu(URL d’une ressource sur le Web fournie par l’auteur du composant) et est décritpar des méta-informations comprenant celles de LOM, ainsi que la description deson contenu (ensemble de concepts traités), de son entrée (représente les conceptspré-requis), de sa sortie (représente les concepts acquis lorsque le composant a étéparcouru avec succès par l’apprenant) et sa condition de succès. Il faut noter que lagranularité d’un composant est fixée par son auteur et que nous ne définissonsaucune contrainte sur le contenu du composant. Nous avons formalisé desopérateurs de composition permettant de construire (éventuellement récursivement)un composant complexe à partir de composants élémentaires (ou de composantscomplexes dans le cas récursif). Un composant peut être également défini comme uncours, il devient alors diffusable à des apprenants.

3. Architecture logicielle

L’architecture de notre prototype (Figure 1) est définie à partir de celle proposéepar [Wu et al. 98]. Au plus bas niveau on retrouve les informations décrites dans lasection précédente et qui sont manipulées via des modules de gestion/navigation. Lestockage est assuré par un SGBD (dans notre cas Oracle8i) et la manipulation se faitvia XML. Le module pédagogique permet de sélectionner les cours les plus adaptésà un apprenant donné par rapport aux concepts qu’il a sélectionnés. Le module denavigation utilise les techniques de navigation hypermédia pour construire laprésentation des composants sélectionnés pour un apprenant et notamment ilconstruit automatiquement une table des matières et une carte conceptuelle(représentation graphique des concepts du cours et de leurs relations sémantiques).L’ensemble des modules est implanté sous forme de servlets Java et l’accès auSGBD se fait via l’interface Java/JDBC.

Page 512: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Gestion des composants

Figure 1 – Architecture fonctionnelle

4. Fonctionnalités du prototype

On trouve trois grandes catégories d’utilisateurs (administrateur,auteurs etapprenants) avec chacune un ensemble de fonctionnalités dédiées.

4.1. Interface administrateur

L’administrateur va créer le modèle de domaine, déclarer les auteurs autorisés etcréer les stéréotypes d’apprenants.

Module de navigationModule pédagogique

Base de composantséducatifs

300

WEB

Navigation Gestion des profils

Interface apprenantInterface auteurInterface administrateur

Modèle del'apprenant

10

Modèle dudomaine

Page 513: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

4.2. Interface auteur

Les enseignants/auteurs peuvent soit ajouter de nouveaux composantsélémentaires, soit ajouter de nouveaux composants complexes, soit créer des cours.L’ajout d’un composant élémentaire consiste à le décrire en donnant ses metadonnées, ses entrées et ses sorties et à fournir l’URL de la ressource qui réalise soncontenu.

L’ajout d’un composant complexe se fait par composition de composantsexistant en donnant leur graphe de composition. Les nœuds représentent lescomposants et les arcs sont typés et modélisent soit une séquence (pour accéder à cecomposant on a besoin d’avoir auparavant accédé à tel autre), soit une alternative(tel composant peut être utilisé indifféremment à la place de tel autre), soit uneillustration sous forme d’exercice (tel composant est un exercice d’application de telcomposant). Un nœud peut représenter soit un composant (désigné par uneidentification unique), soit un ensemble de composants (désignés par une expressionde sélection sur l’ensemble des composants). Implicitement, dans ce dernier cas onsuppose que tous les éléments de l’ensemble sont alternatifs les uns des autres. Lecalcul des entrées et des sorties du composant complexe se fait automatiquement enutilisant les valeurs des composants assemblés et la sémantique des opérations decomposition.

La création d’un cours consiste simplement à choisir un composant et à déclarerqu’il est peut être diffusé comme un cours. Dans ce cas, à partir du graphe de cecomposant, le système va pouvoir dériver automatiquement une table des matières,une table des exercices ou bien la carte conceptuelle de navigation.

4.3. Interface apprenant

A chaque session, l’apprenant doit s’identifier et ensuite choisir le concept qu’ilveut étudier. Le système propose alors une liste des cours correspondant à la fois auconcept recherché et au profil de l’utilisateur. Cette liste est triée par ordre depertinence décroissante. L’apprenant peut alors soit choisir un cours dans cette liste,soit reformuler sa demande (pour cela il peut naviguer dans le modèle de domaine).

Une fois un cours choisi, celui-ci est diffusé en tenant compte du profilutilisateur. La diffusion va filtrer certains composants alternatifs dans le cours (parexemple si l’apprenant ne veut pas de vidéo, tous les composants alternatifs incluantde la vidéo vont être supprimés). De plus, on va faire apparaître explicitement àl’apprenant le fait qu’il a déjà acquis certains concepts du cours donc que certainscomposants ne sont pas forcément utiles pour lui.

La navigation dans un cours peut se faire de deux manières, soit via une tabledes matières, soit via une carte conceptuelle qui modélise les relations sémantiques

Page 514: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

entre les différentes parties du cours et permet de mieux resituer le cours dans soncontexte global.

En fin de session, le niveau de l’utilisateur est mis à jour en tenant compte de cequ’il fait durant celle-ci. Si le cours n’a pas été totalement exploré, la sessionsuivante replacera l’apprenant au même niveau de parcours.

5. Bilan et perspectives

De nombreuses plateformes de télé-enseignement existent d’ores et déjà tantdans le domaine commercial (WebCT, Learning Space) que dans le mondeacadémique. L’originalité de notre approche réside dans l’utilisation conjointe d’unmodèle de composant incorporant un niveau sémantique ainsi que de profilsutilisateurs. Le prototype est opérationnel et permet une première validation de nosidées. Une phase d’expérimentation auprès d’enseignants et d’élèves de nosinstitutions est en cours. Au niveau fonctionnel, de nombreuses extensions restent àfaire, notamment pour tirer plus parti de la description sémantique des composants(vérfication de la validité des compositions des composants et calcul desacquisitions à faire pour un apprenant s’il désire suivre un cours donné). Enfin, nousvoulons étudier l’apport de technologie de type « peer to peer » pour faire uneversion répartie du système.

Remerciements

Ce projet est commun à l’Institut National des Télécommunications etl’Université d’Antioquia (Medellin, Colombie). La mise en œuvre du prototype aété réalisée par des élèves de dernière année d’ingénieur en systèmes informatiquesde l’université d’Antioquia.

Bibliographie

[BRUS96] Peter Brusilovsky. Adaptive Hypermedia an Attempt to Analyze and Generalize.In P. Brusilovsky, P. Comer and N. Streitz (eds.): Multimedia, Hypermedia and VirtualReality. Lectures Notes in Computer Science, Vol. 1077, Berling: Springer-Verlag, pp.288-304. 1996

[BRUS98] Peter Brusilovsky, John Eklund and Elman Schwarz. Web-based education for all:a tool for development adaptive courseware. 7th International World Wide WebConference. Brisbane Australia. April 14-18. 1998. Available from:http://www7.scu.edu.au/programme/fullpapers/1893/com1893.htm

[IEEE-01] IEEE Learning Technology Standards Committee (LTSC). 2001. IEEE P1484.12Learning Object Metadata (LOM), Draft Document. Available http://ltsc.ieee.org/

Page 515: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[NEJD99] Nejdl, W. and Wolpers, M. 1999. KBS-Hyperbook- a-data-driven informationsystems on the web. In proceedings of the 8th International Conference on World WideWeb. (Toronto, Ont.).

[ROSC99] Roschelle, J., DiGiano, C., Koutlis, M., Repenning, A., Phillips, J., Jackiw, N.,and Suthers Dan. Developing educational software components. IEEE Computer 32, 9.Sept. 1999, 50-58.

[SADD01] Abdulmotaleb el Saddik, Stephan Fischer Ralf Steinmetz. Reusability andAdaptability of Interactive Resources In Web-Based Educational. ACM Journal ofEducational Resources in Computing Vol 1. #. 1 . Spring 2001.

[SEEB99] Cornelia Seeberg, Abdulmotaleb El Saddik, Achim Steinacker, KlausReichenberger, Stephan Fischer and Ralf Steinmetz. From the User's Needs to AdaptiveDocuments. In Proceedings of the INTEGRATED DESIGN & PROCESSTECHNOLOGY "IDPT'99". June 1999.

[WEBE97] Weber Gerhard and Specht Marcus. User modeling and Adaptive NavigationSupport in WWW-Based Tutoring. In Anthony Jameson, Cécile Paris and Carlo Tasso(Eds), User Modeling: Proceedings of the sixth International Conference, UM97. Vienna,New York: Springer Wien New York. © CISM, 1997. Available on-line fromhttp://um.org.

[Wu-H98] Wu.,H., Houben, G.J., De Bra, P., AHAM: A Reference Model to Support AdaptiveHypermedia Authoring, Proc. of the "Zesde Interdisciplinaire ConferentieInformatiewetenschap", pp. 77-88, Antwerp, 1998.

Page 516: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 517: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Experiencing Persistent Object ManagementCustomization

Luciano García-Bañuelos 1 — Phuong-Quynh Duong —Tanguy Nedelec — Christine Collet

LSR/IMAG Laboratory681, rue de la Passerelle38400 Saint Martin d’Hères, FRANCE

Luciano.Garcia, Phuong-Quynh.Duong, Tanguy.Nedelec, [email protected]

ABSTRACT. We defined a component-based infrastructure for building customized persistent ob-ject managers with three levels of reliability: unreliable, reliable without transactions, andreliable with transactions. Our goal is to allow the deployment of components according tothe application requirements and the runtime environment constraints. This paper presents theinfrastructure architecture and shows how it can be used. To this end, we developed two ap-plications: a video game and a client-server product catalog manager. Each application usesdifferent persistence managers: a reliable non-transactional one for the game, a transactionalone for the server side of the second application, and a unreliable manager for the client side.The applications were deployed in two different runtime environments: desktop PCs and PDAs.

RÉSUMÉ. Nous avons défini une infrastructure à composants pour la construction de gestion-naires d’objets persistants personnalisés avec trois niveaux de fiabilité : non-fiable, fiable ettransactionnel. Notre but est de permettre le déploiement de l’ensemble requis de composantsselon les besoins de l’application et son environnement d’exécution. Ce papier présente l’archi-tecture et montre comment l’infrastructure peut être utilisée. Pour cela, nous avons choisi deuxapplications : un jeu vidéo et une application de gestion de catalogue avec une architectureclient/serveur. Ces deux applications ont des besoins différents : un gestionnaire non transac-tionelle pour le jeu, un gestionnaire transactionnel du côté serveur de la seconde application etun gestionnaire non-fiable du côté client. Les deux applications s’exécutent dans deux environ-nements différents en terme de ressources : des ordinateurs de bureau et des PDAs.

KEYWORDS: Database services, DBMS architectures, Component-based architectures, Persis-tence management, Cache management, Multi-level transactions.

MOTS-CLÉS : Services base de données, Architectures SGBD, Architectures à base de compo-sants, Gestion de persistance, Gestion de cache, Transactions multi-niveaux.

. Supported by the CONACyT Scholarship program of the Mexican Government

Page 518: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Database technology has an extremely successful track record as a backbone ofinformation technology throughout the last three decades. Database management sys-tems (DBMS) largely contributed to this success by providing bundles of these tech-nologies. However, as DBMSs incorporate more and more features, their complexityincreases tremendously.

The small subset of functionalities needed for a given application can be swampedby a mass of irrelevant functionalities. Thus, sometimes, using a full-fledged DBMSbecomes cumbersome. Mobile computers/devices are examples of environments wheredata management is constrained by limited resources (e.g. memory, CPU). In suchcases, we need data management systems with exactly the set of required function-alities. Unfortunately, DBMS architectures are not designed to this aim. Their re-markable complexity makes adding and/or removing functionalities a hard engineer-ing/research task.

DBMS kernel toolkits were a first attempt to provide more adaptable DBMS archi-tectures. However, most of those earlier works have kept the full-functional DBMS asthe fundamental unit of packaging for installation and operation.

Our approach, in the NODS (Networked Open Database Services) project [COL 00],is to unbundle DBMS into a set of cooperating services. As a result, we provide aglobal framework from which system programmers can build data management in-frastructures.

One of the first efforts concerns the isolation of persistence related services fromthe whole DBMS machinery. The resulting infrastructure is described in [GAR 02a,GAR 02b].

This paper overviews the persistence management infrastructure and demonstratesits usage through two application. Each one is based on a particular configuration ofthe infrastructure relevant to the application and environment where it is deployed.

2. A component infrastructure

Our approach is based on the identification of functional dimensions characterizingpersistence management. This process needs a deep study because of subtle interde-pendencies between modules of a DBMS. Boundaries and dependencies found led usto the definition of a component-based infrastructure. The framework is designed topromote component reuse with focus on the genericity of each component.

The architecture is organized in a hierarchical layered way as shown in Figure 1.Each layer extends the services provided by the layer immediately above it.

Based on this approach, different persistent object managers can be configured, bydeploying software components of one or more layers. A quick walk-through at eachlayer follows.

Page 519: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

IConcurrencyControl

ILogManager

IStorageManager

ICacheManager

(2) Transaction support

(1) Crash resilience

(0) Base components

IPersistenceManager

IPersistenceManager

ICheckpointManager

IPersistenceManager

PersistenceManager

ConcurrencyControl

LogManager

CacheManager

StorageManager

ITxResourceManager

TransPersistMgr

ReliablePersistMgr

Figure 1. System architecture

Layer 0 - Unreliable, base components

– The CacheManager component allows to maintain popular data objects in mainmemory.

– The StorageManager component provides a unified interface to access in anobject-based fashion a given underlying storage product, e.g. DBMS.

– The PersistenceManager coordinates a CacheManager and a StorageManager.It mediates object materialization, as well as update propagation.

Layer 1 - System crash resilience

– The LogManager component defines interface for a general purpose loggingfacility.

– The ReliablePersistMgr, meaning reliable persistence manager, adds systemcrash reliability to the base components.

Layer 2 - Transaction support

– The ConcurrencyControl component provides an interface allowing to accom-modate a large range of concurrency control methods.

– The TransPersistMgr, which stands for transactional persistence manager, addstransactional resource management capabilities to the infrastructure.

We developed a prototype of the infrastructure using the Java language. Takingadvantages from the high level of portability of the code, we are able to deploy ourprototype on different hardware and operating systems.

Page 520: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3. Demonstration

The objective of the demonstrations is to show the flexibility of our infrastructure.It has been done by deploying two types of applications in different execution envi-ronments. The first application is a video game, the popular arcade “asteroids” game,and the second one is a product catalog management application.

3.1. Asteroid application

We modified a freely available Java version of the popular “asteroids” game. Thus,a game session can be stopped and resumed later on. To this end, we added a per-sistence manager to make persistent all objects in the game area, i.e. asteroids andspaceship.

Two configurations of the infrastructure have been deployed in two different exe-cution environments. The first one is for handheld computers while the second one isfor desktop computers.

The version for handheld computers includes an unreliable persistence manager,i.e. layer 0 of the infrastructure. Objects of the game area are stored only when theplayer stops the game. Other events causing the game to be stopped, such as a systemcrash, do not trigger the storage process. Thus, the game session is recoverable onlyin case where it has been stopped properly. Unreliability is justified by the limitedmemory of a handheld computer.

The persistence manager configuration is quite simple. It includes a StorageMan-ager instance and a CacheManager instance. The instantiation of the correspondingpersistence manager is as follows: UnreliablePersistenceManager(StorageManageraStorageMgr, CacheManager aCacheMgr).

The desktop configuration incorporates a reliable persistence manager, i.e. layers1 and 0 of the infrastructure. Objects are written periodically on permanent storage sothat the game session can be restored after any event that stopped it.

For this configuration, we added a LogManager as follows: ReliablePersistence-Manager(StorageManager aStorageMgr, CacheManager aCacheMgr, LogManageraLogMgr). Note that we reused the same base code for both the application and theinfrastructure while porting the handheld computer version to the desktop platform.Only a log manager has been added to the code used in the handheld computer version.

3.2. Product catalog management system

Let us consider a simple scenario where a company wants to manage its productcatalog and let its employees access it at their respective work places. The system hasa client-server architecture. The server manages product catalog data. Clients are ableto send requests to the server in order to query and/or update catalogs.

Page 521: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Such a system can be deployed as follows: a server and clients on handheld com-puters. Both the client and the server applications are coded in Java, with remotemethod invocation used to transmit requests and eventual results between them.

This product catalog management system serves as a demonstration vehicle em-phasizing the following aspects of our infrastructure:

– The ability to support other storage managers than our ad hoc implementation,

– The use of transactions in object context,

– The ability of instantiating different configurations in the same system.

We chose to built the server side application on top of a relational DBMS. Tothis end, we have deployed an StorageManager component which wraps an object-mapping subsystem (built on top of a open source Java mapping software [OJB ]).This object-mapping subsystem, in turn, enables access to any JDBC complaint source.

Since we need support for transactions, we use a reliable persistence manager,i.e. layers 2, 1 and 0 of the infrastructure. This persistent manager is instantiated asfollows: TransactionalPersistenceManager(LockManager concurrencyControl, Reli-ablePeristenceManager persistenceMgr, LogManager logMgr).

The client side has itself a unreliable persistence manager, i.e. layer 0 of the in-frastructure, to maintain a local persistent cache. Thus, some interesting data such as alist of ID of current top ten products can be stored locally once the session terminates.Later on, such a list allows users to access more quickly to other information relatedto these products.

4. References

[COL 00] COLLET C., “The NODS Project: Networked Open Database Services”, ECOOPSymposium on Objects and Databases, , 2000.

[GAR 02a] GARCIA-BANUELOS L., “An Adaptable Infrastructure for Customized PersistentObject Management”, Proceedings of the EDBT’2002 PhD Workshop, March 2002.

[GAR 02b] GARCIA-BANUELOS L., DUONG P.-Q., COLLET C., “A Component-based In-frastructure for Customized Persistent Object Management”, Proceedings of the BDA’02French Conference, Evry, France, 2002, October.

[OJB ] OJB TEAM, “Object/Relational Bridge”, http://jakarta.apache.org/ojb/.

Page 522: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 523: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

DBA Companion:un outil pour l’analyse de bases de données

Stéphane Lopes

— Fabien De Marchi

— Jean-Marc Petit

Laboratoire PRiSM, CNRS FRE-2510,

Université de Versailles St-Quentin en Yvelines,45, avenue des États-Unis 78035 Versailles Cedex, ! #"#$&%

Laboratoire LIMOS, CNRS UMR-2239,Université Blaise Pascal - Clermont-Ferrand II,Complexe scientifique des Cézeaux 63177 Aubière Cedex,'( )*+-,/.*021-/3 /"4*56+//&%

RÉSUMÉ. Comprendre la sémantique des données dans les bases de données (BDs) relation-nelles existantes est une tâche importante pour de nombreuses applications comme l’analyse etla maintenance de BDs, la rétro-conception des BDs ou l’optimisation de requêtes. La séman-tique des données est contenue principalement dans les contraintes d’intégrité. Pour la plupartdes BDs opérationnelles, en particulier pour les plus anciennes, nous ne pouvons pas suppo-ser que nous disposons de cette connaissance. Dans cet article, nous présentons un prototypeappelé DBA Companion qui peut apporter une aide pour la compréhension des BDs relation-nelles existantes. Cet outil intègre des algorithmes pour l’extraction des contraintes d’intégritéainsi que d’autres problèmes connexes. Parmi les diverses applications possibles, nous nousfocalisons sur le réglage logique des BDs.

ABSTRACT. Understanding data semantics from existing relational databases is important forseveral applications such as database maintenance and analysis, database re-engeenering orquery optimization. Data semantics is carried out in particular by integrity constraints. Formost operational databases, particularly for the oldest ones, we cannot assume to dispose ofthis knowledge. In this paper, we present a tool called DBA Companion which can be an help todeal with the understanding of existing relational databases. This tool integrates algorithms todeal with the extraction of integrity constraints and several related problems. From this minedknowledge, the logical tuning of databases can be achieved.

MOTS-CLÉS : réglage logique des BDs, analyse de schémas de BDs, contraintes d’intégrité, infé-rence des dépendances fonctionnelles, inférence des dépendances d’inclusion.

KEYWORDS: logical database tuning, database schema analysis, integrity constraints, functionaldependency inference, inclusion dependency inference.

Page 524: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Comprendre la sémantique des données dans les bases de données (BDs) relation-nelles existantes est une tâche importante pour de nombreuses applications commel’analyse et la maintenance de BDs, la rétro-conception, la construction d’entrepôtsde données ou l’optimisation de requêtes. La sémantique des données est contenueprincipalement dans les contraintes d’intégrité. Parmi ces contraintes, les dépendancesfonctionnelles (DFs), qui généralisent la notion de clé, et les dépendances d’inclusion(DIs), qui généralisent la notion de clé étrangère, sont les contraintes d’intégrité lesplus communes [MAN 94]. Dans le meilleur des cas, ces contraintes ont été spécifiéeslors de la conception de la BD et sont donc disponibles dans le SGBD. Cependant,nous ne pouvons pas présumer que nous nous trouvons dans cette situation idéale pourune BD opérationnelle, particulièrement pour les plus anciennes. Dans ce dernier cas,cette connaissance doit être extraite à partir de la BD. Plusieurs sources d’informa-tions sont pertinentes pour mener à bien une telle tâche. On peut citer notamment leschéma physique de la BD, l’extension de la BD ou les programmes d’application.

Dans cet article, nous présentons un prototype appelé DBA Companion qui peutapporter une aide pour la compréhension des BDs existantes. Cet outil intègre plu-sieurs algorithmes dédiés à l’analyse de BDs. Cette analyse est réalisée en s’appuyantsur des techniques de fouilles de données ce qui a permis de concevoir des algorithmesefficaces. L’accent est mis sur l’efficacité afin de pouvoir se confronter à des situa-tions réelles. Nous traitons trois problèmes principaux (inférence des DFs, inférencedes DIs et sélection des dépendances intéressantes) ainsi que divers problèmes liés(génération de BDs d’Armstrong, inférence des dépendances approximatives). L’idéeest de fournir à l’administrateur de BD des informations pertinentes qui lui permettentd’améliorer les performances des applications et d’assurer la cohérence des données.

Parmi les application pouvant bénéficier de la compréhension des BDs existantes,nous avons choisi de nous intéresser plus particulièrement au réglage logique de BDs.Notre prototype a été réalisé dans le but d’assister un administrateur de BDs pour cettetâche.

Réglage logique des BDs De nos jours, les administrateurs de BDs doivent sur-veiller et régler une nombre important de paramètres pour un fonctionnement opti-mal de leurs BDs. La difficulté d’une telle tâche est largement reconnus alors que denombreuses entreprises ne disposent pas d’un administrateur à temps plein. Simplifierl’administration des SGBDs devient donc un nouveau challenge pour la communautébase de données.

Le réglage physique d’une BD est intensivement étudié afin d’améliorer les perfor-mances du système notamment en assistant l’administrateur pour la création d’indexou en collectant automatiquement des statistiques pour l’optimisation de requêtes.

Nous nous plaçons ici dans le cadre du réglage logique d’une BD, i.e. de l’ana-lyse et de l’optimisation du schéma de la BD. Fournir à l’administrateur de BD desconnaissances à propos des dépendances satisfaites par une BD peut aider à réaliser

Page 525: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

certaines tâches comme la spécification des clés candidates ou la détection de rela-tions dénormalisées [LOP 00, LOP 01a]. Par exemple, une relation dénormalisée peutrésulter d’une erreur de conception ou d’une évolution mal contrôlée d’une BD etpas toujours de besoins d’optimisation. Notons également que les dépendances ap-proximatives peuvent apporter des indices à propos des données inconsistantes. Parexemple, une DF violée par un petit nombre de tuples peut signifier qu’une erreur desaisie existe dans ces tuples.

2. Un cadre pour l’analyse de schémas de BDs

A partir d’une instance de BD, la découverte des DFs et des DIs peut être réa-lisée de manière efficace en s’appuyant sur des techniques de fouilles de données[LOP 00, MAR 02a]. Parmi les connaissances découvertes, il reste à sélectionner lesplus pertinentes. En effet, certaines dépendances extraites à partir des données peuventne pas être significatives. Par exemple, des dépendances accidentelles, i.e. vraies seule-ment pour une instance particulière de la BD, peuvent exister et ne doivent donc pasêtre prises en compte par la suite. Nous proposons deux alternatives pour apporterune aide au choix des dépendances intéressantes. La première est basée sur l’utilisa-tion de BDs exemples : une BD d’Armstrong1 est une représentation alternative d’unensemble de dépendances et permet ainsi de visualiser les dépendances découvertes[MAR 02b]. La deuxième alternative s’appuie sur l’analyse des programmes d’appli-cations, i.e. une charge de requêtes SQL : par exemple, les attributs impliqués dansles ordres de jointures fournissent des indices pour sélectionner les dépendances per-tinentes [LOP 01b].

3. Un outil pour le réglage logique de BDs

Le prototype a été réalisé en C++/STL (Standard Template Library)/MFC (Micro-soft Foundation Classes) et fonctionne sous Windows. L’accès aux BDs se fait à l’aidede l’API (Application Programming Interface) ODBC (Open DataBase Connectivity).La figure 1 présente une partie des fonctionnalités de l’outil.

L’utilisateur se connecte tout d’abord à une source de données ODBC puis sélec-tionne la ou les relations sur lesquelles il souhaite travailler. Il est ensuite possibled’exécuter divers algorithmes sur cette sélection (inférence des DFs exactes et ap-proximatives, inférence des clés, inférences des DIs exactes et approximatives, gé-nération de relation d’Armstrong). L’outil collabore étroitement avec le SGBD pourles différents traitements. Par exemple, les relations d’Armstrong sont générées dansle SGBD ce qui permet à l’utilisateur de les modifier puis de réitérer le processusd’analyse à partir de ces relations.

7. Informellement, une BD d’Armstrong est une BD satisfaisant exactement un ensemble de

dépendances.

Page 526: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Figure 1. DBA Companion

Un certain nombre d’extension peuvent permettre d’améliorer l’application. Parexemple, la génération automatique de scripts SQL à partir des connaissances extraitespeut permettre de renforcer des contraintes d’intégrité sur la BD tout en autorisant desmodifications par l’administrateur. En effet, l’administrateur souhaitera probablementétudier les modifications proposées avant de les réaliser. La découverte d’incohérencesdans la BD peut aussi s’appuyer sur notre outil. A partir des dépendances approxi-matives, des incohérences dans les données peuvent être mises en évidence et uneproposition de corrections peut être faite par l’outil.

4. Bibliographie

[LOP 00] LOPES S., PETIT J.-M., LAKHAL L., « Efficient Discovery of Functional Depen-dencies and Armstrong Relations », Proc. of EDBT 2000, Konstanz, Germany, vol. 1777de LNCS, Springer, 2000, p. 350–364.

[LOP 01a] LOPES S., PETIT J.-M., LAKHAL L., « A Framework for Understanding ExistingDatabases », Proc. of IDEAS 2001, Grenoble, France, IEEE, 2001, p. 330–338.

Page 527: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

[LOP 01b] LOPES S., PETIT J.-M., TOUMANI F., « Discovering Interesting Inclusion Depen-dencies : Application to Logical Database Tuning », J. of Information Systems, vol. 27, n 81, 2001, p. 1–19, Elsevier Science.

[MAN 94] MANNILA H., RÄIHÄ K.-J., The Design of Relational Databases, Addison Wes-ley, 1994.

[MAR 02a] MARCHI F. D., LOPES S., PETIT J.-M., « Efficient Algorithms for Mining Inclu-sion Dependencies », Proc. of EDBT 2002, Prague, Czech Republic, vol. 2287 de LNCS,Springer, 2002, p. 464–476.

[MAR 02b] MARCHI F. D., LOPES S., PETIT J.-M., « Samples for Understanding Data-semantics in Relations », Proc. of ISMIS 2002, Lyon, France (to appear), 2002.

Page 528: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président
Page 529: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Indexation et interrogation de photos de presse décrites en MPEG-7 Emmanuel Bruno — Jacques Le Maitre — Elisabeth Murisasco Laboratoire SIS, Equipe Informatique Université de Toulon et du Var Bâtiment R, BP 132 83957 La Garde cedex bruno, lemaitre, [email protected] RÉSUMÉ. Cette démonstration présente une interface, construite en amont d’un serveur Web, dédiée à l’indexation et à l’interrogation de photos de presse représentées sous forme de documents MPEG-7 et stockées dans une base de données XML. Les trois principales fonctionnalités de l’interface sont présentées : (i) l’indexation textuelle et visuelle des données qui produit un document MPEG-7, (ii) l’interrogation : une requête est saisie dans un formulaire puis traduite en une requête XQuery soumise au travers d’un médiateur à la base de données XML, (iii) la présentation et la classification des réponses. ABSTRACT. This demonstration presents a user interface, built on top of a web server, dedicated to query a catalogue of news photos described as MPEG-7 documents stored in an XML database. This paper focuses on the three main features of the interface: (i) textual and visual data indexing which produces MPEG-7 documents (ii) data querying : queries are captured in query forms and translated into XQuery queries sent to the XML database through a mediator, (iii) answer presentation and classification.

MOTS-CLÉS : Données multimédia, Indexation, Langage de requêtes, MPEG-7, XML, XQuery. KEYWORDS: Multimedia data, Indexation, Query language, MPEG-7, XML, XQuery

Page 530: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Cette démonstration présente une interface pour indexer et interroger les photographies d’une agence de presse, décrites sous forme de documents MPEG-7 (Martinez, 2002) et donc de documents XML et stockées dans une base de données XML. L’indexation est faite en deux étapes (i) une description textuelle des photos consistant en des données signalétiques et des descripteurs choisis dans un thésaurus et des mots-clés libres, (ii) une indexation automatique du contenu visuel des photos, qui est réalisée par une analyse d’image permettant d’extraire des descripteurs de couleur ou de texture ainsi que des caractéristiques sur la prise de vue. L’interrogation est réalisée, classiquement, au travers de formulaires qui permettent de saisir des critères portant sur la description textuelle des photos et sur leur contenu visuel. Les réponses obtenues sont classées selon un coefficient de similarité calculé à partir de la description textuelle et du contenu visuel. Les photos représentatives de chaque classe sont affichées sous forme d’un tableau d’ « imagettes » à partir duquel l’utilisateur peut reformuler sa requête en la spécialisant ou en la généralisant. Une caractéristique importante de cette interface est son aspect « tout XML » : les photos, le thésaurus et les formulaires sont décrits en XML et stockées dans une base de données XML dont l’interrogation est réalisée au travers de requêtes XQuery (Boag, 2002). L’implantation de cette interface est réalisée en utilisant l’environnement de publications de données XML Cocoon intégrée à un serveur Web Apache. Cette interface a été développée dans le cadre du projet RNTL MUSE dont l’objectif est l’élaboration d’un moteur de recherche pour interroger des données multimédia stockées dans une base de données XML1.

2. Indexation

L’indexation se déroule en deux étapes. Dans la première, le document MPEG-7 est créé puis rempli avec les informations textuelles saisies au travers d’un formulaire d’indexation, lui-même décrit en XML. La figure 2 montre le formulaire d’indexation associé à la photo de la figure 1. Remarquons notamment les mots-clés libres qui sont extraits du champ « Sujet » et les descripteurs du thésaurus. Le thésaurus a une structure classique (Lefèvre, 2000) : c’est un ensemble de termes muni des relations de synonymie, de généricité et d’association et de leurs inverses. Dans la seconde étape – transparente pour l’utilisateur –, le document MPEG-7 est complété par les descripteurs extraits, par analyse d’image, du contenu visuel de la photo. Les aspects visuels pris en compte sont la couleur, le type de plan, le type de prise de vue et enfin l’orientation de l’image.

1 http://sis.univ-tln.fr/MUSE

Page 531: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3. Interrogation

Deux modes d’interrogation sont proposés aux utilisateurs : un mode rapide qui consiste à naviguer dans le thésaurus et un mode par formulaire que nous présentons plus en détail.

Un formulaire d’interrogation est destiné à interroger un catalogue constitué d’une liste de fiches, dont chacune est décrite par un élément XML. Ce catalogue est extrait d’une base de données XML par une requête XQuery. Dans le cas de l’application présentée, les fiches sont les documents MPEG-7 qui décrivent les photos. Un formulaire d’interrogation est donc une vue sur une fiche du catalogue

Figure 1. Photo (©Editing)

Figure 2. Formulaire d’indexation

Page 532: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

qui est elle-même une vue sur la base de données XML qui contient les données de ce catalogue, comme le résume la figure 3.

La requête saisie dans le formulaire de la figure 4, par exemple, est traduite en XQuery par la requête suivante où les expressions XPath font référence à des noms d’éléments présents dans la description MPEG7 d’une fiche d’indexation :

for $card in catalog let $s1 := $card/ContentDescription//Creator/Agent//FamilyName, $s2 := $card/ContentDescription//CreationAbstract/ KeywordAnnotation[2]/Keyword/text(), $s3 := xf:substring-before($card/ContentDescription //CreationCoordinates/CreationDate/text(), "-") where $s1 = "Lefèvre" and $s2 = "municipales" and ($s2 = "Paris" or $s2 = "Lyon") and $s3 < 2002 return $card

En dehors des opérateurs classiques sur les chaînes de caractères, l’interface met à la disposition de l’utilisateur des opérateurs spécifiques qui permettent d’élargir une requête portant sur un descripteur du thésaurus à ses descripteurs spécifiques,

Figure 3. Catalogue et Formulaire d’interrogation

Fiche

Catalogue (vue sur la BD XML)

requête XQuery

BD XML

Formulaire d’interrogation (vue sur une fiche)

Zone d’interrogation (vue sur une partie d’une fiche)

requête XQuery

Auteur "Lefèvre"

Mots-clés "municipales" & ("Paris" + "Lyon")

Date < 2002

Figure 4. Formulaire d’interrogation et requête

Page 533: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

génériques ou associés ainsi que des prédicats de haut-niveau permettant d’interroger le contenu visuel des photos.

4. Présentation des réponses et reformulation des requêtes

Une requête retourne les documents MPEG-7 décrivant les photos qui y répondent. Afin de présenter à l’utilisateur un éventail complet de ces photos mais

Client (navigateur Web)

Interface (Serveur Web

Apache + Cocoon)

Formulaire d’indexation

Formulaire d’interrogation

Réponses classées

Gestionnaire de données XML

Base de données XML (formulaires + thésaurus + documents MPEG-7 )

Médiateur (évaluateur de requêtes XQuery)

XSLT + CSS

Générateur de requêtes XQuery

1

4

2

1

1

2

3

3

5

Figure 5. Architecture de l’interface

Page 534: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

sans le noyer sous leur nombre, celles-ci sont réparties en un nombre fixé a priori de classes et seule la photo la plus représentative de chaque classe est affichée. L’utilisateur peut ensuite demander l’affichage de toutes les photos d’une classe et éventuellement reformuler sa requête. Le programme de classification est en cours d’implémentation. Chaque classe regroupe les photos dont le coefficient de similarité est supérieur à un certain seuil. Ce coefficient de similarité est construit, pour chaque couple de photos, par combinaison d’un coefficient de similarité visuelle et d’un coefficient de similarité textuelle.

5. Architecture informatique

La figure 5 montre l’architecture de l’interface et sa connexion avec le médiateur et le gestionnaire de données XML. L’interface est implémentée au sein d’un serveur Web Apache. Les formulaires d’indexation et d’interrogation sont décrits en XML et dynamiquement convertis en formulaires HTML lors de l’ouverture d’une session (1). Une requête saisie dans un formulaire d’interrogation est traduite en XQuery puis soumise au médiateur (2). Le médiateur évalue la requête par appel au gestionnaire de données XML (3) puis renvoie à l’interface les fiches MPEG-7 répondant à la requête (4). La présentation de la réponse est construite par application d’une feuille de style XSLT puis d’une feuille de style CSS à ces fiches (5). Au vu de celle-ci l’utilisateur peut retourner au formulaire d’interrogation pour reformuler sa requête.

6. Conclusion

Les deux points forts de cette interface sont (i) la combinaison entre indexation textuelle et indexation visuelle et (ii) la programmation « tout XML ». Le développement de cette interface n’est pas terminé. Il faut tout d’abord achever l’intégration de l’interface avec les outils développés par les autres partenaires du projet Muse, notamment le médiateur et le gestionnaire de données XML. Un premier prototype sera disponible fin 2002.

8. Références

Boag S. et al., XQuery 1.0 : An XML Query Language, W3C Working Draft, http://www.w3.org/TR/2002/WD-XQuery-20020430, 2002.

Lefèvre P., La recherche d'informations, du texte intégral au thésaurus, Editions Hermès, 2000.

Martinez J., Overview of the MPEG-7 Standard (version 6.0), http://mpeg.telecomitalia.com/standards/mpeg-7/mpeg-7.htm.

Page 535: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Un environnement de prototypage rapided’applications web pour bases de données

Bruno DefudeInstitut National des Télécommunications9 rue Charles Fourier 91011 Evry [email protected]

RÉSUMÉ. Nous décrivons un environnement de prototypage rapide d'applications web pourbases de données, qui permet à partir d'un schéma entité-association de dériverautomatiquement le schéma de bases de données équivalent et une application web.L'environnement est constitué de trois outils qui peuvent être utilisés soit de manièreintégrée, soit isolément. Cet environnement est utilisé surtout dans un contexted'enseignement car il permet à des étudiants de développer très rapidement et trèssimplement des applications bases de données.MOTS-CLÉS : prototypage rapide, application web, outil pour l'enseignement des bases dedonnées, programmation CGI, Oracle, XML

Page 536: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

1. Introduction

Le Web est devenu de fait la plate-forme idéale de déploiement d’applicationspuisqu’il fait appel à des standards éprouvés tant au niveau protocole (HTTP), qu’auniveau des langages de description de documents (HTML et maintenant XML).

De nombreux outils de développement sont apparus venant soit du monde web(interprète de langages de programmation associés à un serveur HTTP comme ASP[ASP] de Microsoft, JSP [JSP] et servlet de Java ou PHP [PHP] de la communautélogiciel libre), soit des langages de programmation (modèle J2EE [J2EE] de Javapar exemple). Ces solutions sont puissantes et permettent de répondre en grandepartie aux besoins mais par contre ne sont pas très faciles à utiliser pour des non-informaticiens. Elles ne sont pas non plus très centrées sur les bases de donnéesmais plus sur l’interface utilisateur ou bien la logique applicative.

Dans le domaine des bases de données, plusieurs propositions ont été faites tantdans le monde commercial (WebDB [WEBDB] d’Oracle par exemple), que dans lemonde académique (Strudel [FLO 00] par exemple). Enfin, il y a des outils webd’administrations de bases de données (PHPMyAdmin [PHPMYADMIN] pourMySQL en est un exemple) mais qui ne remplissent pas complètement nos objectifspuisqu’ils ne permettent pas réellement de développer des applications.

Les objectifs de notre projet sont de disposer d’une suite logicielle complète(depuis la conception des schémas de la base de données, jusqu’au code del’application Web qui permet de la manipuler), gratuite, modifiable, facile àdéployer et facile d’utilisation pour nos étudiants. L’application produite n’a pasforcément vocation a être l’application finale (c’est plutôt un prototype jetable quipermet au moins de valider le schéma de base de données). Notre approche est trèsvoisine de celle de WebDB, mais nous essayons d’être plus indépendant du SGBDet notre vision de l’interaction web/SGBD est assez différente (Oracle essaye defaire le maximum de travail dans le SGBD alors que nous préférons en sortircertaines fonctions comme l’habillage HTML).

Le reste de l’article est structuré comme suit. La section 2 décrit l’architecturelogique de l’environnement et ses principes d’utilisation. La section 3 décrit plusprécisément les fonctionnalités de la passerelle web/SGBD. Nous tirons dans lasection 4 un bilan et les perspectives d’évolution de cet environnement.

2. Architecture logique

L’architecture logique et les principes d’utilisation de notre environnement sontdécrits dans la figure 1. Cette suite logicielle comprend trois outils intégrés :

Page 537: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

-un outil de conception de schéma de bases de données : il permet classiquementà partir d’un schéma Entité-Association de générer le code SQL équivalent pour unSGBD donné (Oracle et MySQL pour l’instant) ainsi qu’une description XML del’application à produire exprimée dans le langage d’entrée du générateur (celle-cipeut être modifiée par le programmeur car elle ne correspond pas forcément à sesbesoins). Le code SQL peut être interprété sur le SGBD par n’importe quelinterprète SQL. Cet outil est implanté en langage Java et Swing. La représentationinterne des schémas se fait en XML, ce qui permet de faire la génération de codeSQL par des feuilles de style XSLT (cela permet d’ajouter facilement de nouvellescibles). ;

Figure 1 : Architecture logique et principes d’utilisation

-une passerelle gérant l’interface avec le SGBD qui permet l’interprétation d’uneou plusieurs requêtes SQL et l’habillage HTML du résultat produit ;

-un générateur d’applications web qui, à partir d’une description XML del’application, construit les formulaires HTML s’interfaçant avec la passerelle. Cesformulaires vont pouvoir être stockés sur n’importe quelle machine (pasnécessairement celle où se trouve le SGBD) et interprétés par des navigateurs websupportant le langage Javascript. Le générateur est développé en langage Java enutilisant l’interface de programmation JAXP qui permet de manipuler desdocuments XML.

FormulairesHTML+JS

Client (navigateur web)

Editeur Entité-Association

E1 E2A1

SGBD

Passerelleweb/SGBD

Générateurd’application

DescriptionXML

ServeurHTTP

Serveur

Schéma BDEn SQL

HTTP-CGI

HTML

Page 538: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

3. Passerelle web/SGBD

Son rôle est d’interpréter une ou plusieurs requêtes SQL sur le SGBD et derenvoyer le résultat récupéré habillé en HTML. L’interprétation des requêtes SQL sefait dans la même transaction. L’originalité de cette passerelle réside dans leparamétrage assez poussé qu’elle permet au niveau de l’habillage HTML. L’idéeconsiste à utiliser le dictionnaire de données pour obtenir de l’information utile pourla génération. Par exemple, simplement à partir du nom d’une relation on peutgénérer un formulaire permettant l’insertion d’un tuple dans cette relation.

Pour l’instant la passerelle est implantée sous forme d’un programme de typeCGI écrit en langage C en utilisant l’interface de programmation Pro*C d’Oracle.Du côté client, la passerelle se programme via un ensemble de variables dont le nomest fixé (par exemple UID pour la chaîne de connexion Oracle, SQLSTATEMENTpour les instructions SQL que l’on veut interpréter, …).

Pour augmenter les fonctionnalités de la passerelle, nous avons introduitplusieurs modes d’interaction :

-mode normal : la passerelle interprète les instructions SQL et habille le résultaten HTML ;

-modes insertion/suppression/mise à jour : la passerelle génère un formulaireHTML et du code Javascript permettant l’insertion/suppression/mise à jour d’untuple dans une relation donnée en paramètre ;

-mode QBE : la passerelle génère un formulaire HTML et du code Javascriptpermettant l’interrogation de type QBE d’une relation donnée en paramètre ;

-mode hypertexte : la passerelle interprète les instructions SQL et s’il s’agit derequêtes de type SELECT, elle génère pour chaque valeur de clé étrangère unformulaire qui va permettre l’accès au tuple pointé par la clé. Ce mode permet denaviguer «à la hypertexte » dans une base de données via les clés étrangères ;

-mode copie : la passerelle interprète une requête de type SELECT et génèrepour chaque valeur retournée un formulaire qui permet la recopie de cette valeurdans une variable du formulaire. Ce mode permet la construction de menusdéroulant dont les valeurs sont extraites de la base de données.

4. Bilan et perspectives

Cet environnement de prototypage est complètement opérationnel et a été utilisédans de nombreux projets d’élèves de l’INT depuis maintenant plusieurs années.Nous avons, par exemple, développé un interprète SQL entièrement web qui nouspermet de ne plus utiliser de produits comme sqlplus d’Oracle. Nous l’utilisonsaussi dans le cadre de travaux pratiques et il permet de faire réaliser aux élèves uneapplication base de données dans un délai très court (3 à 6 heures). Il est égalementutilisé dans le cadre d’applications de gestion internes à l’INT. Il est disponible sur

Page 539: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

simple demande auprès de l’auteur pour les établissements d’enseignementsupérieur. Les évolutions les plus importantes en cours traitent de l’indépendancevis à vis du SGBD. La passerelle est en cours de re-développement sous forme deservlet Java ce qui va permettre de pouvoir sélectionner son SGBD, d’avoir unegestion des connexions au SGBD plus efficace et du multithreading.

Un autre travail en cours consiste à donner une implantation sous forme de JSP(Java Server Page) à la passerelle. Dans ce contexte, nous proposons de définir unebibliothèque de balises spécialisées (taglib pour JSP) encapsulant les principauxmodes d’interaction de la passerelle. Ceci doit permettre une programmation depages JSP beaucoup plus facile, tout en gardant l’avantage des JSP de mieuxpouvoir gérer l’habillage HTML.

Remerciements

L’implantation de cet environnement a été réalisée en partie par de nombreuxprojets d’élèves de l’INT que nous tenons à remercier pour le travail réalisé.

Bibliographie

[ASP] http://msdn.microsoft.com/asp

[FLO 00] D. Florescu, I. Manolescu, A. Levy, D. Suciu « Declarative Specification of WebSites with Strudel », VLDB Journal, Vol 9, No 1, 2000

[J2EE] http://java.sun.com/products/j2ee

[JSP] http://java.sun.com/products/jsp

[PHP] http://www.php.net

[PHPMYADMIN] http://sourceforge.net/projects/phpmyadmin

[WEBDB] http://otn.oracle.com/products/webdb/content.html

Page 541: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président

Index des auteurs

ABDESSAL T. …………………....19ABITEBOUL S. …………65, 229, 327AKOKA J. …………………..171, 197AMANN B. ………………………..39AUSSEM A. ……………………487BEERI C. ………………………….39BENJELLOUN O. …………………229BERNARD G. ……………………501BOUGANIM L. …………………257BOUZEGHOUB A. ………………509BRUNO E. ……………………….529CARPENTIER C. …………………509CASALI A. ……………………….445CHADRIDON S. …………………. 501CICCHETTI R. ……………………445COBENA G. ....……..…….19, 65, 327COLLET C. ………….….83, 151, 517COMYN-WATTIAU I. ….….……..197CONAN D. ………………………501DECHAMBOUX P. ………….…….129DEFUDE B. …………………509, 535DE MARCHI F. ………………….523DRAPEAU S. …………………...129DU L. …………………………...343DUITAMA J. F. …………………509DUONG P-Q. ..…...…….83, 151, 517EL ABBADI.A..…………………....15FABRET F. ……….…………..…257FUNDULAKI I...………………....…39GANÇARSKI S. ...……………….105GARCIA-BANUELOS L. ……..83, 517HACID M-S. ……………….…. 283HINNACH Y. …………………… 19JEN T-Y. ………………………469JOMIER G. ……………………...307LAKHAL L. ………………………445LEGER A. …………………..…..283

LE MAITRE J. ………………….529LOPES S. ……………………… 529LOYER Y. ..……..……...…367, 405LUCIA RONCANCIO C. ………....129MANOLESCU I. ….…..……229, 257MANOUVRIER M. ……….………307MASANES J. ……………………327MILO T. …………………………229MOUADDIB N. ...………………..383MURISASCO E. …………………529NAACKE H. ..………………..…..105NEDELEC T. ……….. …………..517NGUYEN B. ………………………65PACITTI E. ……………….…….105PEREZ-CORTES E. ...…………..151PETIT J-M. ………………..487, 523PHAN LUONG V. .……………….425PIVERT O. .………………………343POGGI A.………….………...……65PRAT N. ……………..………….171RASHIA G. ……..……………….383REY C. ………………………....283RUKOZ M. .………………….… 307SAINT-PAUL R. ….……………..383SCHOLL M. ...…………………..…39SEDRATI G. …………………….327SHAPIRO M. ...………….…..… 225SIMON E. ……….…………..… 257SI-SAID CHERFI S. …..……….. 197SPYRATOS N. ..….…………367, 469STRACCIA U. ....………………..405TANAKA Y. …………..…………469TOUMANI F. ……….…………...283VALDURIEZ P. ...………………..105VIARA E. …………………..……363VILLIN O. ………………………501WEBER R. ……………………… 229

Page 542: BDA’02Mehdi Snene, Pham Thi, Slim Turki, Patrick Valduriez, Genoveva Vargas Solar, Dan Vodislav, José Luis Zechinelli Martini, Karine Zeitouni Comité d’organisation Président