Katarzyna Wegrzyn-Wolska - stuba.skznalosti2008.fiit.stuba.sk/download/presentations/... ·...
Transcript of Katarzyna Wegrzyn-Wolska - stuba.skznalosti2008.fiit.stuba.sk/download/presentations/... ·...
New challenges of Search Engines
Katarzyna Wegrzyn-Wolska
ESIGETELEcole Supérieure d'Ingénieurs en Informatique et Génie des
Télécommunications
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 2
Outline
IntroductionSE : general problems
Future of SearchEconomics of SESearch QualityPersonalisation and ProfilingPrivacy and Search EnginesIntellectual Property and CopyrightDetecting Spam IndexingMultimedia SearchMobility, Local and Social Media
Conclusion
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 3
Introduction
Subject:New challenges of Future Search Engines
MotivationImportance of topic ...Participation in Europeans Commission Projects
Expert in FP6 & FP7
ObjectivesDiscuss the Problems
How to search and evaluate the data ?
Solutions or ... ?
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 4
SE a Key Enabling Technology
85% of all Internet traffic: from Search Engine10 Bn text pages accessible through SE like Google9,6 Bn (US; in December 2007) searches
up to 15% over last year (Google 30%)total: 113 billion searches in 2007
SE ADS: > 10 Bn € worldwide today, expected 22 Bn € in 2010extremely cost effective by business players, with clear and measurable ROI (Return on Investment)
Cultural data: digital libraries (indexed by powerful search tools)SE: key to ensure the cultural and language diversity
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 5
Other problems ….
Multiplicity of data formats and indexing:text, images, audio, 3D...
Integration of other technologies:satellite/airplane pictures (e.g Google earth);
New forms of data exchange:Peer to Peer vs Client server;
Integration with protected/encrypted formats:DRM interfaces;
Users tagged networks of knowledge;Personalization according to user search “history”
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 6
Source: Search Engine Land: US web search share, NetRatings, August 2007
Search Landscape in 2007
Three major “Mainframes”Google,Yahoo, and MSN
>800 M searches daily60% international106 machines
$20 Bn in Paid Search Revenues
Large indicesBillions of documentsPetabytes of data
Enid Burns, Search Engine Watch, Feb 1, 2008
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 7
Power of GOOGLE ?
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 8
What we expect from SE in future
Most challenging (research) issues:economic opportunity and marketscultural diversity “spectrum”data/formats and content explosion in the futuredemands for audiovisual (multimedia)mobile searchimpact of user behaviour and the way users interact with online information systemsspecialisation versus generic search models and technologies (vertical search):
(e.g in the context of specific application environments such ashealth or education)
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 9
John Battelle: SearchBlog and Battelle MediaSE as a Platform… and more than a Platform
a new interface to computingbeginning of a new customer-driven culture
Rise of Conversational Mediausers interact with services…conversational models (conversation economy)
business transition to conversational modelssmart companies see an opportunity online…possibility to have a conversation with the customers…
Web 2.0 : Architecture of Participationuser-generated content
the force of many to create advantage and build network effects
Future of Search
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 10
John Battelle: SearchBlog and Battelle Media
SE a new Platform :Remember DOS?
After DOS….. Windows ...And now ?
Search is an Interface
In future ? New platform to computing ?
Future of Search : SE as a Platform
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 11
Platform to computingLike Spotlight (Mac OS)
Future of Search : SE as a Platform
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 12
Future of Search : Conversation
John Battelle: SearchBlog and Battelle Media
1010
100100
1,0001,000
19701970 19801980 19901990
ParticipantsParticipants(mm)(mm)
20002000 20102010
Industry size ($bb)Industry size ($bb)
5050
500500
5,0005,000
Talk with Back-Office
Talk betweenFront and
Back Office
Talk with Customers
(Web 2.0…)
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 13
Economics of SE
Prof. Hal Varian; Chief Economist, Google, and Professor at UC Berkeley
What services do search engines provide?Google as matchmaker
Matches up those seeking info to those having infoMatches up buyers with sellers
Ads are highly effective due to high relevanceBut even so, advertising still requires scale
2% of ads might get clicks2% of clicks might convertSo only 4 out a thousand who see an ad actually buyprice per click (PPC) will not be large
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 14
Economics of SE
GoogleBrin & Page tried to sell algorithm to Yahoo for $1 million
they wouldn’t buyFormed Google with no real idea of how they would make moneyPut a lot of effort into improving algorithm
Availability of real time data allows for fine tuning, constant improvement:
each query is tested on 4000 new algorithms (Google)
Why online business are different Online businesses can continually experiment
Japanese term: kaizen = “continuous improvement”Hard to really do continuously for offline companies
Manufacturing, ServicesVery easy to do online
Leads to very rapid (and subtle) improvement
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 15
Search Quality
What’s the Goal? User Satisfaction
Understand user intentProblems: Ambiguity and Context
Generate relevant matchesProblems: Scale and accuracy
Present useful informationProblems: Ranking and Presentation
Quality DimensionsRankingFreshnessPresentation
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 16
Search Quality
Dr. Jan Pedersen; Yahoo Search
Eye Tracking StudiesGolden Triangle
Top left cornerQuick scan
For candidateLonger scan
For relevance
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 17
Search Quality
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 18
Search Quality
Dr. Daniel Russell; Google
What peoples think when searching ?“Jaguar “ - Mac OS?, car?, cat?
Central America - rather cat … (car no probably)good response - personalisation problem
Specials studies :how people thinkmental modelqualitative reactionsexpectationsanalysing users behaviours:
ex. why 50% of clicks to Advanced Search page?
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 19
Personalisation and Profiling
Dr. Jaime Teevan: Microsoft Research
User profiling - User DataPersistent demographic information (age, gender, zip code, …)Dynamic interests (music, travel, …)User environment (locations, browser, connection speed, …)User transaction history (seasonal purchases, spending patterns,)User behavior at the Web site
The means of gathering user data varies widely:Static form-based profile:
input by the user (explicit involvement of the user)Dynamic profile:
automatically derived by the server based tracking user behaviors (implicit involvement of the user)
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 20
Privacy and Search Engines
Chris Jay Hoofnagle; Samuelson Clinic; Berkeley Ctr. for Law and Tech
Collecting personal data ?Search engines mediate access to content
central point of privacy vulnerability Search query:
Access or Retention ?What are personally identifiable information ?
Information to identify ?Metadata, data about others may identify you too
Personalization to Customization Tracking is present, even to sites with “sensitive” topics Goal : to present ads across multiple platforms (desktop, laptop, xbox)
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 21
Privacy and Search Engines
AOL Query SearchAOL have published 20M queries based on 600 000 users (Users are uniquely enumerated)Uncensored queries for three months of AOL search service, spring 2006Essentially public domainContains dangerous private information
Some easy to identifyUsers vanity searched name, SSN
Ex. grep for credit-card patterns produces the following:grep -i -e “[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}-[0-9]\{4\}” *.txt
* 9006-0512-xxxx-xxx* 1550-0905-xxxx-xxxx
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 22
Privacy and Search Engines
Looking for (SSN)grep -i -e “\b[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}\b” *.txt
* kristy nicole vega hammond la. social secruity number 437-67-xxxx birth date 03 08 xx drivers license number la. 00765xxxx address 41178 rene dr. hammond la.
* pamela button 079-60-xxxx* thomas j finney socsec 370-40-xxxx* 419-94-xxxx thomas black* 458-87-xxxx seguro social
Grep for email addresses([a-zA-Z0-9_\-]*@[a-zA-Z0-9_\-]*\.)turns another 60 results
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 23
Privacy and Search Engines
Google’s Search PolicySource: Search Privacy Practices: A Work In Progress, CDT Report- August 2007
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 24
Intellectual Property and Copyright
Jason Schultz, Intellectual Property Attorney, Electronic Frontier Foundation (EFF)
Copyright threats to SearchSearch Engines copy, index, and distribute information to millions of peopleWhat about Spiders, Linking, Images, Books ?
Search Engine strategiesImplied permission, Linking, not hosting (for the most part),
Linking to copyrighted works generally not an infringement, unless
You knew the link leads directly to infringing material
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 25
Intellectual Property and Copyright
Image Search
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 26
Copyright: Image Search
Copyright Issues in Image SearchCapturing image,Making and storing thumbnailDisplaying thumbnails in response to keyword searchesProviding Link to original picture page
Is it legal ? - Perfect 10 v; GoogleGoogle says:
They spider everythingThey can’t tell who’s infringing until somebody notify themIt’s a fair use to make an image directoryImage search is important public resource
Court says:First decision: it's legalP10 Opposition (opinion amended on December 3, 2007):
"Image Search" tool illegally reproduced and displayed P10 photos when it returned thumbnail results and framed third-party websites in response to search terms
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 27
Intellectual Property and Copyright
Google Book Search
3 kinds of books:classic, totally public, without copyrightwith copyright & editor permission to indexwith copyright & without editor permission to index
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 28
Copyright: Google Book Search
Author’s Guild Guild v. GoogleGuild Author’s says:
We sell booksYou borrowed books from the libraries and copied them without paying usYou make moneyWe want moneyPay usThis will help you sell books
Google says:We had to copy books to make an indexNo one sees > a few lines at a timeWe link to where you can buy/borrowBook search is important to public accessThis will help you sell books
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 29
Detecting Spam Indexing
Dr. Marc Najork: Microsoft Research
Only highly placed sites in SE results (for some queries) benefit from SE referrals How to increase SE referrals:
Buy keyword-based advertisementsImprove the ranking of your pages
Provide genuinely better content, or“Game” the system
SEO business (Search Engine Optimization)Some SEOs are ethicalSome are not …
Taxonomy of web spam techniques : Keyword stuffing,Link spam, Cloaking
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 30
Multimedia Search
Dr. Lynn Wilcox: FXPal
What is Multimedia?ACM Special Interest Group on Multimedia 2003
More than one media (text, images, audio, video) that are correlatedExamples:
Time correlated: Video with text transcript of the audioSpatially correlated: Images on a page with associated text
A less strict definition: Not “Just” Text : Images, Audio, Video
Interface to Search:Images, Audio, Video
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 31
Multimedia Search
Text SearchKeywords
Image SearchSearch based on tags (FlickR, FaceBook)Search based on surrounding text (Google)Content based search
Using image featuresUsing faces
Audio SearchSearch based on metadata (iTunes)Content based search (MuscleFish, Foote)
Video SearchSearch based on text (Google/UTube)Search based on associated media (Lectures with slides)Search based on content (TrecVid News Search)
MediaMagic
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 32
How about Mobility & Mobil Search?
~2Bn mobile users today, 1.5 Bn GSM usersworld-wide (3Bn in 2010)~75% of terminals equipped with Internetaccess in the medium termmobility imposes very specific searchcontent search & other technologies (location)heterogeneous mobile-fixed environmentsMobil search
iPhone : mobile traffic has become a real possibility for real-time search needs. WML (wireless mobile language)
real chance to thread local search into mobile media needs.
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 33
Local search 30 % of all search engine queries contained a zip code, city name, or state. local needs and mobile search:
potential to turn local search into a "modern Yellow Pages" in real-time.
Local Search
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 34
Social Media Explosion
http://www.visualcomplexity.com
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 35
Social Media: Explosion
FacebookAnd other …
how they earn money ?access to the data for ad targeting purposes
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 36
Social Media: Explosion of the Blogs
> 60 million blogslink connexion
green: one-way blue: reciprocal
white dots: individual blogs
1 - DailyKos 500K/day2 - Boingboing3 - LiveJournal (isolated
community)4 - “blue blob” balanced
discourse (most links are reciprocal)
5&6 - “outlying blue island”
The Hyperbolic Blogosphere 2007Matthew Hurst http://tinyurl.com/2nbwo6
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 37
SE : 2007 Predictions and Scorecard
Sage Lewis, Search Engine Watch, Jan 17, 2008ReadWriteWeb, Web Marketing, Watch WebProNews
Top SE Year-End predictions of 2007 :RSS will go mainstream in a big wayThe explosion of widgetsSemantic Web products (Twine)Browser wars between IE7 and FireFoxVirtual world businessesAOL acquiredAnd most of all: the social revolution!
2007 Scorecard:interesting and thoughtful
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 38
Search Marketing Predictions for 2008
Kevin Newcomb, Search Engine Watch, Jan 23, 2008
Local search starts to make and impact Social search will finally be useful
not just for friends Education and training will be importantGoogle policy of privacy issues"People-driven", "Brain-power" SE: success or fail ?
Cha-Cha, MahaloVertical searches will have shake-ups (maybe health?)Widgets: another online presence:
now just like a website, a blog, or social pageIncreasing of China's participation of global search share
Baidu or even Google and Yahoo China.Yahoo will be someone to watch in 2008
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 39
Search Engine : Géneral Predictions
What will be the future scorecard ?
Who know the answers ?
Bratislava, 31 mars 2008 Katarzyna Wegrzyn-Wolska 40
Thank you very muchfor your time today
and for yourattention