Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming...

8
Designing Multimodal Interactive Systems Using EyesWeb XMI Gualtiero Volpe Paolo Alborno Antonio Camurri Paolo Coletta Simone Ghisio University of Genova DIBRIS Genova, Italy [email protected] [email protected] [email protected] [email protected] [email protected] Maurizio Mancini Alberto Massari Radoslaw Niewiadomski Stefano Piana Roberto Sagoleo University of Genova DIBRIS Genova, Italy [email protected] [email protected] [email protected] [email protected] [email protected] Copyright is held by the author/owner(s). AVI , June 07–10, 2016, Bari, Italy. Abstract This paper introduces the EyesWeb XMI platform (for eX- tended Multimodal Interaction) as a tool for fast prototyping of multimodal systems, including interconnection of multiple smart devices, e.g., smartphones. EyesWeb is endowed with a visual programming language enabling users to com- pose modules into applications. Modules are collected in several libraries and include support of many input devices (e.g., video, audio, motion capture, accelerometers, and physiological sensors), output devices (e.g., video, audio, 2D and 3D graphics), and synchronized multimodal data processing. Specific libraries are devoted to real-time anal- ysis of nonverbal expressive motor and social behavior. The EyesWeb platform encompasses further tools such EyesWeb Mobile supporting the development of customized Graphical User Interfaces for specific classes of users. The paper will review the EyesWeb platform and its compo- nents, starting from its historical origins, and with a particu- lar focus on the Human-Computer Interaction aspects. Author Keywords Multimodal interactive systems; visual programming lan- guages; EyesWeb XMI ACM Classification Keywords H.5.2 [Information interfaces and presentation (HCI)]: User interfaces 49

Transcript of Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming...

Page 1: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

Designing Multimodal InteractiveSystems Using EyesWeb XMI

Gualtiero VolpePaolo AlbornoAntonio CamurriPaolo ColettaSimone GhisioUniversity of GenovaDIBRISGenova, [email protected]@[email protected]@[email protected]

Maurizio ManciniAlberto MassariRadoslaw NiewiadomskiStefano PianaRoberto SagoleoUniversity of GenovaDIBRISGenova, [email protected]@[email protected]@[email protected]

Copyright is held by the author/owner(s).AVI, June 07–10, 2016, Bari, Italy.

AbstractThis paper introduces the EyesWeb XMI platform (for eX-tended Multimodal Interaction) as a tool for fast prototypingof multimodal systems, including interconnection of multiplesmart devices, e.g., smartphones. EyesWeb is endowedwith a visual programming language enabling users to com-pose modules into applications. Modules are collected inseveral libraries and include support of many input devices(e.g., video, audio, motion capture, accelerometers, andphysiological sensors), output devices (e.g., video, audio,2D and 3D graphics), and synchronized multimodal dataprocessing. Specific libraries are devoted to real-time anal-ysis of nonverbal expressive motor and social behavior.The EyesWeb platform encompasses further tools suchEyesWeb Mobile supporting the development of customizedGraphical User Interfaces for specific classes of users.The paper will review the EyesWeb platform and its compo-nents, starting from its historical origins, and with a particu-lar focus on the Human-Computer Interaction aspects.

Author KeywordsMultimodal interactive systems; visual programming lan-guages; EyesWeb XMI

ACM Classification KeywordsH.5.2 [Information interfaces and presentation (HCI)]: Userinterfaces

49

Page 2: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

IntroductionSummer 1999, Opening of Salzburg Festival, Austria: Inthe music theatre opera Cronaca del Luogo by Italian com-poser Luciano Berio, a major singer (David Moss) plays aschizophrenic character at times appearing wise and calm,while other times appearing crazy, with nervous and jerkymovements. Some of his movement qualities are automat-ically extracted by using sensors embedded in his clothesand a flashing infrared light on his helmet, synchronizedwith video cameras positioned above the stage. This infor-mation is used to morph the singer’s voice from profound(wise) to a harsh, sharp (crazy) timbre. The impact withsuch concrete real-world applications of multimodal anal-ysis and mapping was of paramount importance to shapethe requirements for our first publicly available version ofEyesWeb [1].

In particular, the need for fast prototyping tools made usleave the concept of EyesWeb as a monolithic applica-tion, to be recompiled and rebuilt after any possible minorchange, and made us move to a more flexible approach,which was being already adopted by other software plat-forms both in the tradition of computer music programminglanguages and tools and in other domains such as simula-tion tools for system engineering. EyesWeb was thus con-ceived as a modular software platform, where a user canassemble the single modules in an application by means ofa visual programming language. As such, EyesWeb sup-ports its users in designing and developing interactive mul-timodal systems is several ways, such as for example (i) byproviding built-in input/output capabilities for a broad rangeof sensor and capture systems, (ii) by enabling to easily de-fine and customize how data is processed and feedbackis generated, and (iii) by offering tools for creating a widepalette of interfaces for different classes of users.

Since then EyesWeb was reworked, improved, and ex-tended along years and went through five major versions,being always available for free1. Nowadays, it is employedin various application domains, going beyond the originalarea of computer music and performing arts, and includingfor example active experience of cultural heritage, exergam-ing, education and technology-enhanced learning, therapyand rehabilitation.

This paper is organized as follows: the next section presentssome related work, i.e., other modular platforms endowedwith a visual programming language with a particular ref-erence to multimedia and multimodal systems; then, themajor components of the EyesWeb platform are introduced;finally, the different classes of users for EyesWeb and thereasons that make it suitable for fast prototyping of appli-cations including interconnection of smart objects are dis-cussed under an HCI perspective.

Related workWhereas general-purpose tools such as, for example, Math-works’ Simulink exists since long time, platforms especiallydevoted to (possible real-time) analysis of multimodal sig-nals are far less common. Max [9] is a platform and a visualprogramming language for music and multimedia, originallyconceived by Miller Puckette at IRCAM, Paris, and nowa-days developed and maintained by Cycling ’74. Born forsound and music processing in interactive computer mu-sic, it is also endowed with packages for real-time video,3D graphics, and matrix processing. Pd (Pure Data) [10]is similar in scope and design to Max. It also includes a vi-sual programming language and it is intended to supportdevelopment of interactive sound and music computing ap-plications. The addition of GEM (Graphics Environment for

1http://www.casapaganini.orgThe actual released version is EyesWeb 5.6.0.0.

50

Page 3: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

Multimedia) enables real-time generation and processing ofvideo, OpenGL graphics, images, and so on. Moreover, Pdis natively designed to enable live collaboration across net-works or the Internet. vvvv [13] is a hybrid graphical/textualprogramming environment for easy prototyping and devel-opment. It has a special focus on real-time video synthe-sis and it is designed to facilitate the handling of large me-dia environments with physical interfaces, real-time motiongraphics, audio and video that can interact with many userssimultaneously. Isadora [6] is an interactive media presen-tation tool created by composer and media-artist MarkConiglio. It mainly includes video generation, processing,and effects and is intended to support artists in developinginteractive performances. In the same field of performingarts, Eyecon [4] aims at facilitating interactive performancesand installations in which the motion of human bodies isused to trigger or control various other media, e.g., music,sounds, photos, films, lighting changes, and so on. The So-cial Signal Interpretation framework (SSI) [14] offers toolsto record, analyze and recognize human behavior in real-time, such as gestures, mimics, head nods, and emotionalspeech. Following a patch-based design, pipelines are setup from autonomic components and allow the parallel andsynchronized processing of sensor data from multiple inputdevices.

Whereas Max and Pd are especially devoted to audio pro-cessing, and vvvv to video processing, EyesWeb has aspecial focus on higher-level nonverbal communication,i.e., EyesWeb provides modules to automatically computefeatures describing the expressive, emotional, and affectivecontent multimodal signals convey, with particular referenceto full-body movement and gesture. EyesWeb also has asomewhat wider scope with respect to Isadora, which es-pecially addresses interactive artistic performances, andto SSI, which is particularly suited for analysis of social in-

Figure 1: The overall architecture of the EyesWeb XMI platform

teraction. Finally, with respect to its previous versions, thecurrent version of EyesWeb XMI encompasses enhancedsynchronization mechanisms, improved management andanalysis of time-series (e.g., with novel modules for analysisof synchronization and coupling), extended scripting capa-bilities (e.g., a module whose behavior can be controlledthrough Python scripts), and a reorganization of the Eye-sWeb libraries including novel supported I/O devices (e.g.,Kinect V2) and modules for expressive gesture processing.

EyesWeb kernel, tools, libraries, and devicesFigure 1 shows the overall architecture of EyesWeb (thecurrent version is named XMI, i.e., for eXtended MultimodalInteraction). The EyesWeb Graphic Development Environ-ment tool (GDE), shown in Figure 2, manages the interac-tion with the user and supports the design of applications(patches). An EyesWeb patch consists of interconnectedmodules (blocks). A patch can be defined as a structured

51

Page 4: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

Figure 2: A view of the EyesWeb Graphic DevelopmentEnvironment (GDE), with a sample patch for synchronizing motioncapture, audio, and video recordings of the music performance ofa string quartet.

network of blocks that channels and manipulates a digitalinput dataflow resulting in a desired output dataflow. Datamanipulation can be either done automatically or throughreal-time interaction with a user. For example, to createa simple video processing patch, an EyesWeb developerwould drag and drop an input video block (e.g., to capturethe video stream from a webcam), a processing block (e.g.,a video effect), and an output block (e.g., a video display).The developer would then connect the blocks and she mayalso include some interaction with the user, e.g., by addinga block that computes the energy of the user’s movementand connecting it to the video effect block to control theamount of effect to be applied.

The build-up of an EyesWeb patch in many ways resem-bles that of an object oriented program (a network of clus-ters, classes, and objects). A patch could therefore be in-terpreted as “a small program or application”. The samplepatch displayed in Figure 2 was built for performing syn-

chronized playback of multimodal data. In particular, thisexample displays motion capture (MoCap) data obtainedfrom a MoCap system (Qualisys). The recorded MoCapdata is synchronized with the audio and video tracks of thesame music performance (a string quartet performance).

The EyesWeb kernelThe core of EyesWeb is represented by its kernel. It man-ages the execution of the patches by scheduling each block,it handles data flow, it notifies events to the user interface,and it is responsible of enumerating and organizing blocksin catalogs including a set of coherent libraries. The ker-nel works as a finite state machine consisting of two majorstates: design-time and run-time. At design-time users candesign and develop their patches, which are then executedat run-time. Patches are internally represented by a graphwhose nodes are the blocks and whose edges are the linksbetween the blocks. In the transition between design andrun time, the kernel performs a topological sort of the graphassociated to the patch to be started and establishes theorder for scheduling the execution of each block.

The EyesWeb librariesEyesWeb libraries include modules for image and videoprocessing, for audio processing, for mathematical opera-tions on scalar and matrices, for string processing, for time-series analysis, and for machine learning (e.g., SVMs, clus-tering, neural networks, Kohonen maps, and so on). Theyalso implement basic data structures (e.g., lists, queues,and labeled sets) and enable to connect with the operat-ing systems (e.g., for launching processes or operating onthe filesystem). Particularly relevant in EyesWeb are thelibraries for real-time analysis of nonverbal full-body move-ment and expressive gesture of single and multiple users[2], and the libraries for the analysis of nonverbal social in-teraction within groups [12]. The former include modules for

52

Page 5: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

computing features describing human full-body movementboth at a local temporal granularity (e.g., kinematics, en-ergy, postural contraction and symmetry, smoothness, andso on) and at the level of entire movement units (e.g., di-rectness, lightness, suddenness, impulsivity, equilibrium,fluidity, coordination, and so on). The latter implementstechniques that have been employed for analyzing featuresof the social behavior of a group such as physical and affec-tive entrainment and leadership. Techniques include, e.g.,Recurrence Quantification Analysis [8], Event Synchroniza-tion [11], SPIKE Synchronization [7], and nonlinear asym-metric measures of interdependence between time-series.

Available ToolsReferring to Figure 1, in addition to EyesWeb GDE that wasalready described above, further available tools are:

• EyesWeb Mobile: an external tool supporting the de-sign of Graphic User Interfaces linked to an EyesWebpatch. EyesWeb Mobile is composed of a designerand a runtime component. The former is used to de-sign the visual layout of the interfaces; the latter runstogether with an EyesWeb patch and communicatesvia network with the EyesWeb kernel for receivingresults and controlling the patch remotely.

• EyesWeb Console: a tool allowing to runs patchesfrom the command line to reduce the GDE overhead.On Windows, it runs both as a standard application oras a Windows service and additionally it can be usedto run patches in Linux and OS X.

• EyesWeb Query : a tool to automatically generatedocumentation from a specific EyesWeb installa-tion, including icon, description, input and outputdatatypes of each block. Documentation can be gen-erated in latex (pdf), text, and MySql.

• EyesWeb Register Module: a tool allowing to addnew blocks (provided in a dll) to an existing Eye-sWeb installation, and to use them to make and runpatches. The new blocks extend the platform andmay be developed and distributed by third parties.

Supported devicesEyesWeb supports a broad range of input and output de-vices, which are managed by a dedicated layer (devices)located in between the kernel and the operating system. Inaddition to usual computer peripherals (mouse, keyboard,joystick and game controllers, and so on), input devicesinclude audio (from low-cost mother-board-integrated toprofessional audio cards), video (from low-cost webcamsto professional video cameras), motion capture systems (toget with high accuracy 3D coordinates of markers in envi-ronments endowed with a fairly controlled setup), RGB-Dsensors (e.g., Kinect for X-Box One, also known as KinectV2, extracting 2D and 3D coordinates of relevant bodyjoints, and capturing RGB image, grayscale depth image,and infrared image of the scene), Leap Motion (a sensordevice capturing hand and finger movement), accelerome-ters (e.g., onboard of Android smarthpones and connectedvia network, by means of the Mobiles to EyesWeb app,see Figure 3; X-OSC sensors, and so on), Arduino, Nin-tendo Wiimote, RFID (Radio-Frequency Identification) de-vices (e.g., used to detect the presence of a user in a spe-cific area), and biometric sensors (e.g., respiration, hearthrate, skin conductivity, and so on). Output includes audio,video, 2D and 3D graphics, and possible control signals toactuator devices (e.g., haptic devices, robots). Moreover,EyesWeb implements standard networking protocols suchas, for example, TCP, UDP, OSC (Open Sound Control),and ActiveMQ. In such a way, it can receive and send datafrom/to any device, including smart objects, endowed withnetwork communication capabilities.

53

Page 6: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

Figure 3: A view of the Mobiles to EyesWeb Android applicationon the Google PlayStore

Classes of users and interfacesUsers of EyesWeb can be ideally collected in several classes.

EyesWeb end users usually do not directly deal with theplatform, but rather experience the multimodal interactivesystems that are implemented using the platform. Theirinterface is therefore a natural interface they can operateby means, e.g., of their expressive movement and gestureto act upon multimedia content. In such a way, end userscan also interact with smart devices they may wear or hold(e.g., smartphones). Smart devices can either be used todirectly operate on content, or the data they collect can bepresented back to end users, e.g., using data sonification[5] and visualization technologies.

EyesWeb developers make applications (patches) usingthe EyesWeb GDE and its visual programming language.This implements all the basic constructs of programminglanguages such as sequences (a chain of interconnectedblocks), conditional instructions (e.g., by means of switchblocks that direct the flow of data to a specific part of a

patch when a given condition is matched), iterative instruc-tions (by means of a specific mechanism that allows toexecute a given sub-patch repetitively), and subprograms(implemented as sub-patches). Because of the visual pro-gramming paradigm, EyesWeb developers do not need tobe computer scientists or expert programmers. In our ex-perience, EyesWeb patches were developed by artists,technological staff of artists (e.g., sound technicians), de-signers, content creators, students in performing arts anddigital humanities, and so on. Still, some skills in usage ofcomputer tools and, especially, in algorithmic thinking arerequired. EyesWeb developers can exploit EyesWeb as atool for fast-prototyping of applications for smart objects:EyesWeb can receive data from such objects by means ofits input devices and the task of the developer is to designand implement in a patch the control flow of the application.

Supervisors of EyesWeb patches are users who super-vise and control the execution of an EyesWeb application.They can both set parameters to customize the execution ofthe patch before running it and act upon the patch (e.g., bychanging the value of some parameters) while the patch isrunning. Consider, for example, a teacher who customizesan educational serious game for her pupils and operateson the game as a child is playing with it, or a therapist whosets e.g., target and difficulty level of an exergame for a pa-tient. Supervisors of EyesWeb patches do not need anyparticular skill in computer programming and indeed theyare usually a special kind of end user, with a specific exper-tise in the given application area. The EyesWeb Mobile toolallows for endowing them with traditional Graphical User In-terfaces they can use for their task of customizing and con-trolling the execution of a patch. In such a way, they do notneed to go into the details of the EyesWeb GDE and theywork with an interaction paradigm, which is more familiar tothem. Moreover, the patch is prevented from possible un-

54

Page 7: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

Figure 4: An EyesWeb Mobile interface developed for enabling ateacher to customize a serious game for children.

wanted modifications. The EyesWeb Mobile interfaces canalso work on mobile devices (e.g., tablets) to facilitate oper-ations when the operator cannot stay at the computer (e.g.,a therapist who needs to participate in the therapy session).This feature makes EyesWeb Mobile a tool that also suitsremote configuration of smart objects applications. Figure 4shows an EyesWeb Mobile interface developed for enablinga teacher to customize a serious game for children.

EyesWeb programmers develop new software modules forthe EyesWeb platform. They need to be skilled C++ pro-grammers and are endowed with the EyesWeb SDK, whichenables them to extend the platform with third-parties mod-ules. In particular, the EyesWeb SDK enables including inthe platform new modules for interfacing possible smartobjects that e.g., do not communicate through standard net-working protocols or are not supported by the platform yet.

ConclusionEyesWeb is nowadays employed by thousands of usersspanning over several application domains. It was adoptedin both research and industrial projects by research centers,

universities, and companies. A one-week tutorial, the Eye-sWeb Week is organized every two years at our researchcenter. In our experience at Casa Paganini - InfoMus, Eye-sWeb was used in research projects that combine scien-tific research in information and communications technol-ogy (ICT) with artistic and humanistic research [3]. In thiscontext, the platform provided the technological ground forartistic performances (e.g., in Allegoria dell’opinione ver-bale by R. Doati, Medea by A. Guarnieri, and Invisible Lineby A. Cera, to name just some of them), for multimodal in-teractive systems supporting new ways of experiencing artand cultural heritage (e.g., Museum of Bali in Fano, Italy;Museum of La Roche d’Oëtre in Normandy, France; EnricoCaruso Museum near Florence, Italy), for serious games ineducational settings (e.g., The Potter and BeSound), andfor exergames for therapy and rehabilitation (e.g., in an on-going collaboration with children hospital Giannina Gasliniin Genova, Italy). EyesWeb is being currently improved andextended in the framework of the EU-H2020-ICT DANCEProject, investigating how sound and music can express,represent, and analyze the affective and relational qualitiesof body movement.

The need of supporting such a broad range of applicationdomains required to make a trade-off between implement-ing general-purpose mechanisms and exploiting domain-specific knowledge. For example, on the one hand, sup-port to generality sometimes encompasses a somewhatreduced learnability and usability of the platform by lowly-skilled EyesWeb developers due to the increased complex-ity of the implemented mechanisms. On the other hand,some EyesWeb modules developed on purpose for specificapplication scenarios have a limited scope and are difficultto reuse. Future directions and open challenges includeincreased cross-platform interoperability and a tighter inte-gration with cloud services and storage technologies.

55

Page 8: Designing Multimodal Interactive Systems Using …ceur-ws.org/Vol-1602/paper9.pdfprogramming language for music and multimedia, originally conceived by Miller Puckette at IRCAM, Paris,

AcknowledgementsThis research has received funding from the EuropeanUnion’s Horizon 2020 research and innovation programmeunder grant agreement No 645553 (H2020-ICT ProjectDANCE, http://dance.dibris.unige.it). DANCE investigateshow affective and relational qualities of body movement canbe expressed, represented, and analyzed by the auditorychannel.

REFERENCES1. Antonio Camurri, Shuji Hashimoto, Matteo Ricchetti,

Andrea Ricci, Kenji Suzuki, Riccardo Trocca, andGualtiero Volpe. 2000. EyesWeb: Toward Gesture andAffect Recognition in Interactive Dance and MusicSystems. Computer Music Journal 24, 1 (April 2000),57–69.

2. Antonio Camurri, Barbara Mazzarino, and GualtieroVolpe. 2004. Analysis of expressive gesture: TheEyesWeb expressive gesture processing library. InGesture-based communication in human-computerinteraction. Springer, 460–467.

3. Antonio Camurri and Gualtiero Volpe. 2016. TheIntersection of Art and Technology. IEEE MultiMedia23, 1 (Jan 2016), 10–17.

4. Eyecon. 2008. http://eyecon.palindrome.de/.(2008). Accessed: 2016-03-25.

5. Thomas Hermann. 2008. Taxonomy and Definitions forSonification and Auditory Display. In Proceedings ofthe 14th International Conference on Auditory Display(ICAD 2008), Patrick Susini and Olivier Warusfel (Eds.).IRCAM.

6. Isadora. 2002. http://troikatronix.com/. (2002).Accessed: 2016-03-25.

7. Thomas Kreuz, Daniel Chicharro, Conor Houghton,Ralph G Andrzejak, and Florian Mormann. 2012.Monitoring spike train synchrony. Journal ofNeurophysiology (2012).

8. Norbert Marwan, M. Carmen Romano, Marco Thiel,and Jürgen Kurths. 2007. Recurrence plots for theanalysis of complex systems. Physics Reports 438,5–6 (2007), 237–329.

9. Max. 1988. http://cycling74.com/products/max/.(1988). Accessed: 2016-03-25.

10. Miller Puckette. 1996. Pure Data. In Proceedings of theInternational Computer Music Conference. 224–227.

11. Rodrigo Quian Quiroga, Thomas Kreuz, and PeterGrassberger. 2002. Event synchronization: A simpleand fast method to measure synchronicity and timedelay patterns. Physical Review E 66, 4 (Oct 2002),041904.

12. Giovanna Varni, Gualtiero Volpe, and Antonio Camurri.2010. A system for real-time multimodal analysis ofnonverbal affective social interaction in user-centricmedia. IEEE Transactions on Multimedia 12, 6 (2010),576–590.

13. vvvv. 1998. http://vvvv.org/. (1998). Accessed:2016-03-25.

14. Johannes Wagner, Florian Lingenfelser, Tobias Baur,Ionut Damian, Felix Kistler, and Elisabeth André. 2013.The Social Signal Interpretation (SSI) Framework:Multimodal Signal Processing and Recognition inReal-time. In Proceedings of the 21st ACMInternational Conference on Multimedia (MM ’13).ACM, New York, NY, USA, 831–834.

56