Entity Extraction, Linking, Classiﬁcation, and Tagging for ... · Entity Extraction, Linking,...

Entity Extraction, Linking, Classification, and Taggingfor Social Media: A Wikipedia-Based Approach

Abhishek Gattani1, Digvijay S. Lamba1, Nikesh Garera1, Mitul Tiwari2, Xiaoyong Chai1,Sanjib Das3, Sri Subramaniam1, Anand Rajaraman4, Venky Harinarayan4, AnHai Doan1,3

1@WalmartLabs, 2LinkedIn, 3University of Wisconsin-Madison, 4Cambrian Ventures

ABSTRACTMany applications that process social data, such as tweets,must extract entities from tweets (e.g., “Obama” and “Hawaii”in “Obama went to Hawaii”), link them to entities in aknowledge base (e.g., Wikipedia), classify tweets into a setof predefined topics, and assign descriptive tags to tweets.Few solutions exist today to solve these problems for socialdata, and they are limited in important ways. Further, eventhough several industrial systems such as OpenCalais havebeen deployed to solve these problems for text data, little ifany has been published about them, and it is unclear if anyof the systems has been tailored for social media.

In this paper we describe in depth an end-to-end indus-trial system that solves these problems for social data. Thesystem has been developed and used heavily in the pastthree years, first at Kosmix, a startup, and later at Wal-martLabs. We show how our system uses a Wikipedia-basedglobal “real-time” knowledge base that is well suited for so-cial data, how we interleave the tasks in a synergistic fash-ion, how we generate and use contexts and social signalsto improve task accuracy, and how we scale the system tothe entire Twitter firehose. We describe experiments thatshow that our system outperforms current approaches. Fi-nally we describe applications of the system at Kosmix andWalmartLabs, and lessons learned.

1. INTRODUCTIONSocial media refers to user generated data such as tweets,

Facebook updates, blogs, and Foursquare checkins. Suchdata has now become pervasive, and has motivated numer-ous applications in e-commerce, entertainment, government,health care, and e-science, among others.

Many such applications need to perform entity extrac-tion, linking, classification, and tagging over social data.For example, given a tweet such as “Obama gave an im-migration speech while on vacation in Hawaii”, entity ex-traction determines that string “Obama” is a person name,and that “Hawaii” is a location. Entity linking goes one

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 39th International Conference on Very Large Data Bases,August 26th - 30th 2013, Riva del Garda, Trento, Italy.Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

step further, inferring that “Obama” actually refers to anentity in a knowledge base, for example, the entity at URLen.wikipedia.org/wiki/Barack Obama, and that “Hawaii”refers to the entity at URL en.wikipedia.org/wiki/Hawaii.Classification assigns a set of predefined topics to the tweet,such as “politics” and “travel”. Finally, tagging assigns de-scriptive tags to the tweet, such as “politics”, “tourism”,“vacation”, “President Obama”, “immigration”, and “Hawaii”,the way a person may tag a tweet today.

Entity extraction, a.k.a. named entity recognition (NER),and text classification are well-known problems that havebeen around for decades (e.g., [4, 23]), while entity linkingand tweet tagging are newer problems that emerged in thepast few years [27]. Nevertheless, because of their impor-tance to a large variety of text-centric applications, theseproblems have received significant and increasing attention.

Despite this attention, few solutions exist today to solvethese problems for social media, and these solutions are lim-ited in several important ways. First, the solutions often “re-cycle” techniques developed for well-formed English texts.A significant amount of social data, however, are misspelledungrammatical short sentence fragments, thereby provingill-suited for these techniques. Second, the solutions oftenemploy computation-intensive techniques that do not scaleto high-speed tweet streams of 3000-6000 tweets per second.

Third, existing solutions typically do not exploit contextinformation, such as topics that a Twitter user often tweetsabout. As we show in this paper, since many types of socialdata (especially tweets and Facebook updates) are often veryshort (e.g., “go Giants!”), it is critical that we infer andexploit context information to improve accuracy. Fourth,existing solutions typically do not exploit social signals, suchas traffic on social sites (e.g., Wikipedia, Pinterest), eventhough such signals can greatly improve accuracy.

Finally, most current solutions address only a single prob-lem, in isolation, even though as we show later in this paper,addressing all four problems in a synergistic fashion can fur-ther improve the overall performance.

In the past few years, several industrial systems to ex-tract, link, classify and tag text data, such as OpenCalaisat opencalais.com, have also been deployed on the Web (seethe related work section). However, little, if any, has beenpublished about these systems, and as far as we know, noneof these deployed systems has been specifically tailored forsocial media.

In this paper we describe an end-to-end industrial sys-tem that extracts, links, classifies, and tags social data. Tothe best of our knowledge, this is the first paper that de-

1126

scribes such a system in depth. The system has been devel-oped and used heavily since 2010, first at Kosmix, a startupthat performed semantic analysis of social data, then later,since mid-2011 at WalmartLabs, a research and develop-ment lab for Walmart (which acquired Kosmix). At Wal-martLabs, the system has been used extensively to processtweets, Facebook updates, and other types of social data, topower a variety of e-commerce applications (see Section 5).

Even though our system can handle many types of socialdata (as well as a variety of text documents, see Section5), for expository reasons in this paper we will focus onhandling tweets. Our system differs from current systems inthe following important ways:

Using a Global and “Real-Time” Knowledge Base:Our knowledge base (which we use to find and link to enti-ties mentioned in tweets) is built from Wikipedia (see [13]).Wikipedia is global, in that it contains most concepts andinstances judged important in the world. Thus, it providesa good coverage for the tasks. More importantly, it is “realtime” in that contributors continuously update it with newentities that just appear in real-world events. This “realtime” nature makes it especially well-suited for processingsocial data, and in fact, we take certain measures to make iteven more “real time” (see Section 3.1). In contrast, manycurrent solutions use knowledge bases that are updated lessfrequently.

Synergistic Combination of the Tasks: Our system in-terleaves the four tasks of extraction, linking, classification,and tagging in a synergistic fashion. For example, givena tweet, we begin by performing a preliminary extractionand linking of entity mentions in that tweet. Suppose manysuch mentions link to many nodes under the subtree “Tech-nology” in our knowledge base (KB). Then we can infer that“Technology” is a likely topic for the tweet, thereby helpingclassification. In return, if we have determined that “Tech-nology” is indeed a topic for the tweet, then we can inferthat string “apple” in the tweet likely refers to the node“Apple Corp.” in the KB, not the node “Apple (fruit)”,thereby helping entity linking.

Using Contexts and Social Information: Given atweet such as “go Giants!”, without some contexts, such asknowing that this user often tweets about the New York Gi-ants football team, it is virtually impossible to extract andlink entities accurately. As another example, it is not pos-sible to process the tweet “mel crashed, maserati gone” inisolation: we have no idea which person named Mel the useris referring to. However, if we know that in the past onehour, when people tweeted about Mel Gibson, they oftenmentioned the words “crash” and “maserati” (a car brand),then we can infer that “mel” likely refers to the node MelGibson in the KB. Our system exploits such intuitions. Itcollects contexts for tweets, Twitter users, hash tags, Webdomains, and nodes in the KB. It also collects a large num-ber of social signals (e.g., traffic on Wikipedia and Pinterestpages). The system uses these contexts and signals to im-prove the accuracy of the tasks.

Other important features of our system include a minimaluse of complex time-intensive techniques, to ensure that wecan process tweets in real time (at the rate of up to 6000tweets per second), and the use of hand-crafted rules at var-ious places in the processing pipeline to exert fine-grained

Figure 1: A tiny example of a KB

control and improve system accuracy.In the rest of this paper we first define the problems of

entity extraction and linking, and tweet classification andtagging. We then describe the end-to-end system in detail.Next, we present experiments that show that the currentsystem outperforms existing approaches. Next, we brieflydescribe several e-commerce and consumer-facing applica-tions developed at Kosmix and WalmartLabs that make useof this system, and discuss lessons learned. We concludewith related work and a discussion of future work.

2. PROBLEM DEFINITIONWe now describe the problems considered in this paper.

As mentioned in the introduction, here we focus on process-ing tweets (see Section 5 for examples of other kinds of datathat we have applied our system to).

Tweets and Tweet Stream: For our purpose, a tweethas a user ID and a text. For example, the user with theTwitter ID @polwatcher tweets the text “Obama just leftfor Hawaii”. (Tweets often have far more data, such as timeand location, but we do not consider them in this paper.)Many Kosmix and WalmartLabs applications must processthe entire Twitter firehose (i.e., a stream that emits 3000-6000 tweets per second) in real time. So an important re-quirement for our solution is that it scales to the firehosestream, i.e., can process tweets as fast as they come in.

Knowledge Base: As discussed earlier, we use a largeknowledge base in our solution. A knowledge base (KB)typically consists of a set of concepts C1, . . . , Cn, a set ofinstances Ii for each concept Ci, and a set of relationshipsR1, . . . , Rm among the concepts.

We distinguish a special relationship called “is-a”, whichspecifies that a concept A is a kind of a concept B (e.g.,Professors is a kind of People). The “is-a” relationshipsimposes a taxonomy over the concepts Ci. This taxonomyis a tree, where nodes denote the concepts and edges the“is-a” relationships, such that an edge A → B means thatconcept B is a kind of concept A. Figure 1 shows a tiny KB,which illustrates the above notions.

In many KBs, if A is a parent node of B and C (and onlyof these nodes) in the taxonomy, then the set of instances ofA is the union of the instances of B and C. In our context,we do not impose this restriction. So node A may haveinstances that do not belong to B or C. (KBs typically alsocontain many domain integrity constraints, but we will notdiscuss them in this paper.)

In Section 3.1 we briefly discuss how we build our KBout of Wikipedia, then enrich it with a variety of structureddata sources (e.g., MusicBrainz, City DB, Yahoo! Stocks,Chrome, Adam).

1127

Categories: To perform entity extraction, we have de-fined a large set of entity categories that many real-worldapplications care about. This set contains not only person,location, and organization – the three most common cat-egories that existing entity extraction works have studied,but also other common but less-studied categories such asproduct, music album, song, book, TV show, sport event,car, and movie.

We are now in a position to define the problems of entityextraction, linking, classification, and tagging considered inthis paper.

Entity Extraction: In this problem, given a tweet, wewant to locate strings in the text of the tweet that refer tothe predefined categories. For example, given “Obama justwent to Hawaii”, we want to infer that “Obama” is a personname, and that “Hawaii” is a location. Given “just saw salttonite”, we want to infer that “salt” refers to a movie. Werefer to strings such as “Obama”, “Hawaii”, and “salt” asentity mentions, or mentions for short.

This is the same problem considered by prior work in en-tity extraction (a.k.a. named entity recognition). However,most such works have considered only a small set of cate-gories, typically person, location, and organization. In con-trast, here we consider a far larger set of categories.

Entity Linking: Given a tweet t and a KB, we wantto find strings in t that mention concepts and instancesin the KB, and link these strings to the concepts and in-stances. For example, given Wikipedia as a KB, we want tolink “Obama” in tweet “Obama just went to Hawaii” to theperson instance en.wikipedia.org/wiki/Barack Obama, and“Hawaii” to the state instance en.wikipedia.org/wiki/Hawaii.We often refer to a pair of (entity mention, node) such as(“Obama”, en.wikipedia.org/wiki/Barack Obama) also as amention, when there is no ambiguity.

The concepts and instances in the KB are often collec-tively referred to as entities, and the problem is often calledentity linking. This problem is relatively new (emerged inthe past decade), but is receiving increasing attention (seethe related work section).

We note that entity linking is related, but different fromentity extraction. For example, given the tweet “just sawsalt tonite with charlie”, entity extraction may infer that“salt” is a movie and that “charlie” is a person name. Butit does not have to link these strings to any KB. In contrast,entity linking may merely infer that “salt” refers to a movieentity in the KB, if Salt indeed exists as a movie entity inthat KB. It does not have to infer that “charlie” is a personname, if “charlie” refers to no entity in the KB.

Tweet Classification: In our setting, we have defined 23topics, which correspond to 23 nodes that are the childrenof the root of the taxonomy in our KB. Example topics in-clude politics, entertainment, technology, and travel. Givena tweet we want to classify it into one or several of thesetopics.

Tweet Tagging: Given a tweet, we want to tag it withdescriptive tags, similar to the way a person may tag a tweet,or an author may tag a blog (e.g., to build a tag cloud later).

A key question is where these tags come from. In oursetting, we consider the names of all concepts and instancesin our KB to be possible tags. Given a tweet, our goal is toselect from this universe of tags a small set of tags that best

Figure 2: Our solution architecture

describe the tweet. For example, we may tag “Obama justgave an immigration speech in Hawaii” with Obama, Hawaii,US Politics, immigration, and vacation. This problem is alsoreferred to as social tagging.

Practical Considerations: In our work, we quicklyfound that entity linking could become “excessive” givena large KB. For example, virtually any word in the tweet “Iwas happy to see you and Charlie together” can be linkedto a page in Wikipedia (there are in fact several Wikipediapages for “I” alone). This is clearly not necessary for manyreal-world applications. Thus, instead of finding all stringsin the tweet that can be linked to entities in the KB, wetry to find only strings that “best” describe the tweet, with“best” being a subjective notion that a particular applica-tion can control, using customized scoring functions andhand-crafted rules. For example, given “Obama just gavea speech in Hawaii”, we extract and link “Obama” and“Hawaii”, but not “just”, “gave”, “a”, “speech”, and “in”.

3. SOLUTION ARCHITECTUREWe now describe the Kosmix/WalmartLabs solution to

the above extraction, linking, classification, and tagging prob-lems. Figure 2 shows the ten main steps of the solution.Given a tweet, we preprocess it, e.g., detecting the lan-guage, tokenizing (Step 1). Next, we use the KB to ex-tract mentions from the tweet, remove certain mentions,then score the remaining ones (Steps 2-3). Here a men-tion refers to a pair of (string, KB node) such as (“Obama”,en.wikipedia.org/wiki/Barack Obama). So we are effectivelyperforming entity extraction and linking at the same time.Then in the next step (Step 4) we use these mentions toclassify and tag the tweet.

Next, we go back to processing the mentions, but do so inmore depth. Specifically, we extract 30+ mention features,remove certain mentions using rules involving these features,disambiguate the mentions (e.g., linking “apple” to Applethe company not Apple the fruit), then score the mentionsagain (Steps 5-8). Next, we use the “clean” mentions toclassify and tag the tweet again. Finally we apply hand-crafted editorial rules to filter mentions and classificationand tagging results (Steps 9-10).

The above ten steps make use of a global “real-time” KB,Web and social contexts, social signals, and hand-craftedrules, as illustrated in the figure. We now describe these

1128

steps in more details. But before that, we describe how webuild the KB and create the Web and social contexts.

3.1 Building a Global “Real-Time” KBWe begin by considering what kind of KB we should build.

We observe that Twitter is quite diverse, in that tweets canbe about virtually anything. It follows that to process tweetswith high recall, we should use a global KB, i.e., a KB thatcontains most concepts and instances deemed important inthe world. Examples of such KBs are Wikipedia, Freebase[7], DBpedia [5], and YAGO [35].

Twitter is also “real-time”, in that new events, topics, andentities are introduced into Twitter all the time, at (near)real-world speed. Thus, to process tweets in real time, weshould also use a “real-time” KB, i.e., a KB that quicklyincorporates new events, entities, and topics, soon after real-world happenings.

For these reasons, we decided to use Wikipedia, a global“real-time” KB being continuously updated by a large armyof volunteers. If an important new event, topic, or en-tity appears in the real world, very soon it is mentioned inWikipedia, often in a freshly constructed Wikipedia page.Thus, by continuously crawling Wikipedia, we can build adynamic, fresh, and timely KB.

Wikipedia however is not a KB in the traditional sense.For example, it is not a taxonomy, but rather a giant di-rected cyclic graph in which a node often has multiple pathsto the root, called lineages. For example, concept “ForcedSuicide” is a child of “Ancient Greek Philosophers” and “5thCentury BC Philosophers”, which in turn are children of“Philosophers”, which is a child of “ROOT”. So “ForcedSuicide” has two lineages: Force Suicide - Ancient GreekPhilosophers - Philosophers - ROOT, and Force Suicide -5th Century BC Philosophers - Philosophers - ROOT.

Thus, we converted Wikipedia into a KB. In particular,we converted its graph structure into taxonomy, by findingfor each concept a single “main” lineage, called the primarylineage. At the same time we keep all the other lineagesaround, to avoid losing information. Later in Section 3.6 weshow how all such lineages can be used to classify and tagtweets.

We describe the process of converting Wikipedia into aKB in a recent paper [13]. Since the resulting KB still didnot have enough coverage for our applications, we addedeven more concepts and instances to the KB, by addingdata from many structured sources, such as Chrome (anautomobile source), Adam (health), MusicBrainz (albums),City DB, and Yahoo Stocks (see [13]).

Finally, we observed that our KB still was not sufficiently“real-time” in certain cases. For example, a new event Xmay not be introduced into Wikipedia (and thus made itinto our KB) until 1-2 hours after the event has happened.During this time people already tweet about X, and not hav-ing X in our KB makes it difficult to process such tweets. Toaddress this problem, we built an event detection system (tobe described in an upcoming paper) that monitors Twitterto detect new interesting events. The moment a new event isdetected, we clean and add it to our KB. While we can onlyhandle hundreds of events each day, these tend to be themost popular events, and in many cases this solution helpsus add the new events (and associated entities and topics)to our KB far sooner than waiting for them to appear inWikipedia.

3.2 Generating Web and Social ContextsAs discussed in the introduction, having contexts help us

better process tweets. We now describe how we generatethese contexts.

Contexts for Tweets, Users, Hashtags, and Domains:In its basic form, a tweet is just 140 characters (and oftenfar fewer than that, such as “go Giants!”). To process suchshort tweets, we try to obtain more context information. Inparticular, we focus on the followings:

• Web context for tweets: If a tweet mentions a URL, weretrieve the article on that Web page (if any), extractthe title and a snippet (typically the first few lines) ofthe article, then associate (i.e., store) this title/snippetpair with the tweet. (We do not store the entire articlefor time and space reasons.) We call the title/snippetpair the Web context of the tweet, since it capturesinformation on the Web related to the tweet.

• Social context for users: For each user ID in Twitter(e.g., @polwatcher), we define a social context that istime dependent. To compute the social context foruser U at time t, we retrieve the last k tweets of U upto time t (where k is pre-specified), tag them using thesystem described in this paper, then union the tagsand compute average scores. This produces a set of(tag, score) pairs that indicate what user U has oftentalked about in the last k tweets before time t. Forexample, if the social context at the current time for@polwatcher is {(politics, 0.8), (obama, 0.9), (coffee,0.6), (madison, 0.4), (wisconsin, 0.2)}, then this userhas often tweeted about these topics in recent days.

• Social context for hashtags and Web domains: Simi-larly, we define and compute social contexts for hash-tags and Web domains. To compute the social contextfor a hashtag h at time t, we retrieve k tweets thatmention h up to time t, tag them, then union the tagsand compute average scores. If a tweet mentions aURL that comes from a Web domain (e.g., cnn.com,patch.com), then we compute a social context for thatdomain in an analogous fashion.

Efficiently Computing the Contexts: As each tweetcomes in from the Twitter firehose, we compute its Webcontext on the fly, in real time. Next, we apply the systemdescribed in this paper to extract, link, classify, and tag thetweet. Then we pass the tweet together with its tags to acontext computing system. This system uses the tags tocompute the social contexts for the user ID, the hashtags,and the Web domains of the tweet, as described above. (Italso uses the tags to compute a social context for each nodein the KB, as described below.) Finally, we store these so-cial contexts, so that the current system can use them inprocessing new incoming tweets.

Contexts for the Nodes in the KB: Similarly, we defineand compute Web and social contexts for each node in theKB. To compute a Web context for a node N , we retrievethe articles associated with N (e.g., the Wikipedia page, ifany), tag them, then union the tags and average the scores.To compute a social context for node N , we retrieve the lastk tweets that mention N (i.e., containing entities that are

1129

linked to node N by our current system), tag the tweets,then union the tags and average the scores.

We compute the Web contexts for the nodes in the KB inan offline process (and refresh these contexts regularly, onceevery few days). We compute the social contexts for thenodes using the same system that computes social contextsfor users, hashtags, and domains, as described earlier.

3.3 Preprocessing the TweetWe are now ready to describe the ten main steps of our

system. In the first step, we preprocess each incoming tweet.Specifically, we detect the language and drop the tweet if itis not in English. Next, we compute the Web context ofthe tweet, as described earlier. Next, we clean the tweet(e.g., removing user ID, emails, URLs, HTML tags), thentokenize it, using common separators such as white space,comma, period, colon, and semi-colon (currently we performonly limited spell correction on the tweet using a dictionary).Finally, since entity mentions are typically nouns or nounphrases (e.g., Obama, Hawaii), we perform lightweight part-of-speech tagging to detect nouns and noun phrases.

3.4 Extracting and Linking Entity MentionsIn the next step, we extract entity mentions from the tweet

and link them to the nodes in the KB, whenever possible.For example, consider a toy KB with only three non-rootnodes, n1 = “Obama” (a person), n2 = “Politics” (a topic),and n3 = “Politics of Love” (a movie). Given the tweet“Politics of Love is about Obama’s election”, we want togenerate the mentions

(Politics, n2), (Politics of Love, n3), (Obama, n1).

The first mention, (Politics, n2), for example says that string“Politics” in the tweet refers to node n2 in the KB. (Thismention is clearly incorrect, because “Politics” here is a partof “Politics of Love”, a movie name. Subsequent steps canremove such incorrect mentions, as we show later.)

A Dictionary-Based Approach: Toward the abovegoal, we take a dictionary-based approach. That is, we parsethe tweet to find strings that match the names of the nodesin the KB, then link the strings to those nodes to form en-tity mentions (later we show how to remove mentions thatare incorrectly linked). In this sense, the node names forma giant dictionary against which we match the tweet.

To match efficiently, we construct a prefix map, which is ahash with entries of the form (prefix, nodeID, stop?). Hereprefix is a token-level prefix of a node name in the KB. Ifprefix happens to be the full name of a node, then nodeIDrefers to the ID of that node. In this case, stop? is set totrue if there is no other node that has prefix as a prefix,and false otherwise. If prefix is not the full name of a node,then nodeID is set -1 and stop? is set to false.

For example, consider again the above toy KB of nodesn1−n3. The prefix map for this KB consists of four entries:(Obama, n1, t), (Politics, n2, f), (Politics of, -1, f), and(Politics of Love, n3, t). We omit details on how to constructthe prefix map except to note that it is generated offline thenloaded into memory (only once, at the start of our system).

Given a tweet, we parse it word by word and use the prefixmap to locate matching strings. Consider again the tweet“Politics of Love is about Obama’s election”. By looking upthe first word of the tweet, “Politics”, in the prefix map, wecan see that “Politics” is the full name of node n2, but is

still a prefix of some other node. So we generate the mention(Politics, n2), but continue the search. Eventually we findthat “Politics of Love” is the full name of node n3 (thusgenerating a new mention), and that it is not a prefix ofany other node. So the search restarts at the next word,“is”, and so on. This algorithm scans each tweet only once,and thus is time efficient. As described, it does not considerdifferent word orderings in the tweet (e.g., recognizing that“Barack Obama” and “Obama, Barack” refer to the samenode in the KB).

Pros and Cons of the Dictionary-Based Approach:The dictionary-based approach is time-efficient and well-suited for tweets, which often contain broken ungrammat-ical sentence fragments. Nevertheless, it fails in certaincases. For example, given the tweet “watched Obama elec-tion tonite with Stephanie”, the dictionary approach maybe able to extract “Obama” and “Obama election”. But itmay not be able to extract “Stephanie” as a person name, ifStephanie is a relatively unknown person who is not in theKB. Put another way, the dictionary approach is limited torecognizing only entities that are in the KB.

To address this limitation, in this step we also employ off-the-shelf state-of-the-art named entity recognizers to parsetweets. Currently we only employ these recognizers to findperson names (e.g., Stephanie), because we found that theycould help improve our recall in recognizing person names.To scale, we note that we have a limited amount of time toprocess each tweet. So if the off-the-shelf recognizers cannotfinish in time, we stop them and use only the output of thedictionary-based approach.

Homonyms and Synonyms: We note that extractedmentions can be (and often are) homonyms (i.e., having thesame name but linking to different entities). For example,if a tweet mentions “apple”, then the system will produceone mention that links “apple” to the node Apple Corp inthe KB, and another mention that links “apple” to the nodeApple (fruit). In general, the system will produce as manymentions as the number of homonyms we have in the KB.We show later in Step 3.9 how to disambiguate the mentions,i.e., removing all homonym mentions except one.

Synonyms (i.e., the same entity appearing under differentnames) are also often a problem in text processing. Ourcurrent system assumes that the synonyms are already inthe KB, as a set of nodes that all point to a canonical node.For example, in the KB we collect as many synonyms aspossible for President Obama (e.g., Obama, BO, Barack,etc.), creating nodes for them, and linking all of them to thecanonical Wikipedia page of Barack Obama. This way, ifa tweet mentions one of the synonyms, we can resolve it tothe canonical entity.

3.5 Filtering and Scoring MentionsThe previous step generates a set of entity mentions. In

this step we apply hand-crafted rules to drop certain men-tions judged incorrect, then score the remaining mentions.Examples of these rules include:

• Overlapping mentions: For example, a rule may dropa mention if it is subsumed by another mention (e.g.,mention (Politics, n2) is subsumed by (Politics of Love,n3)).

• Blacklists of strings and IDs: drop a mention if it is a

1130

blacklisted string (e.g., stop words, profanities, slangs,etc.; about 127K strings have been blacklisted), or ifthe mention refers to a blacklisted node in the KB(about 0.5M nodes have been blacklisted).

• Prefix/suffix: drop a mention if it contains certaincharacters or words as prefixes or suffixes, or if thenode name begins with a stop word (e.g., “a”, “an”,“the”).

• Size/number: drop a mention if it is a one-letter word(e.g., “a”, “b”) or a number.

• Straddling sentence boundaries: drop a mention if itstraddles sentence boundaries, such as mention “In-dian Cricket teams” from tweet “He’s an Indian. Cricketteams are popular in India.”

• Part of a noun: if a mention is one word and the wordbefore or after this mention is a noun phrase, thenthis mention is likely to be part of that noun phrase.Hence, drop this mention.

We use 14 such rules, which drop mentions by only examin-ing the textual content of the tweet. Once we have droppedcertain mentions, we score the remaining ones. (This is justa “quick and dirty” score; we compute a more complex scorelater.) Briefly, we consider whether the mention is a nounor noun phrase, whether it is a popular homonym (i.e., itsWikipedia page has a high traffic count in the past few days),and whether it appears in the title of any article referred toin the tweet. The more of these conditions are true, themore likely that the mention is indeed a valid mention (i.e.,a mention of a real-world entity) that is linked to the rightnode in the KB. We omit describing the exact scoring func-tion due to space reasons.

3.6 Tagging and Classifying the TweetIn the next step we tag and classify the tweet. Again, this

is just a “quick and dirty” tagging and classifying, whoseresults are used to generate features. These features are inturn used to evaluate the mentions. Later, in Step 3.11, afterthe mentions have been thoroughly evaluated and incorrectmentions removed, we redo the tagging and classifying stepto obtain a more accurate result.

We now describe the tagging procedure. Let (m1, n1, s1),. . . , (mq, nq, sq) be the q mentions that we have generated forthe tweet in the previous steps. Here, a mention (mi, ni, si)means the string mi in the tweet refers to node ni in theKB, and has a score si.

We assume that the above mentions have been sortedin decreasing order of score. Consider the first mention(m1, n1, s1) (i.e., the one with the highest score). Recallthat n1 is a node in the KB. Starting at n1, we go up alllineages of n1, all the way to the root, and assign to eachnode in these lineages a score (currently set to be s1, thesame score as that of n1). For example, suppose that n1 is“Forced Suicide” and that it has two lineages: Force Sui-cide - Ancient Greek Philosophers - Philosophers - ROOT,and Force Suicide - 5th Century BC Philosophers - Philoso-phers - ROOT. Then all nodes along these two lineages getassigned the same score as that of n1.

The intuition is that since the tweet mentions n1, it followsthat all nodes in all lineages of n1 are also relevant to the

tweet. Hence, they all get assigned a non-zero score that isproportional to how relevant n1 is to the tweet.

Next, we process the second mention (m2, n2, s2). Simi-larly, all nodes in all lineages of n2 get assigned the score s2,except however the nodes that also appear in the lineagesof the first mention. Intuitively, such nodes should be morerelevant to the tweet (since they appear in the lineages oftwo entity mentions, not just one). So their scores shouldbe a combination of the two scores s1 and s2. Setting suchscores to be s1 + s2, however, tend to overestimate the im-portance of such nodes. So instead we set their scores tos1 + s2 − s1 · s2, using a probabilistic interpretation.

We proceed similarly with the remaining mentions. Atthe end, all nodes in the lineages of the nodes n1, . . . , nq

have been assigned non-zero scores, and those appearing inmultiple lineages have higher scores, reflecting the intuitionthat these nodes can be good tags that describe the tweets.We then normalize the node scores.

Let C be the list of all nodes with non-zero normalizedscores. Next, we select from C all topic nodes. Recall that wehave defined 23 topics, which are 23 children of the ROOTnode in the KB; we also refer to these topic nodes as verticalnodes, because each such node represents a subtree in theKB’s taxonomy that is a vertical, such as health, technology,politics, travel, and so on.

We select the v topic nodes with the highest score (wherev is a system parameter), then remove from C all nodes thatare not descendants of these v topic nodes. The intuitionhere is that a tweet is typically just about a small set oftopics (v in this case). So any node outside those topics islikely to be noise and should be removed.

Finally, we sort and return the names of the nodes in C,alongwith their scores, as the tags of the tweet. Given thesetags, it is relatively straightforward to infer a set of topicsfor the tweet. We omit further details for space reasons.

Optimizations: To further improve the accuracy of tag-ging and classification, we apply a set of optimizations to theabove process. First, we ignore all nodes that are blacklisted(recall that about 0.5M nodes in the KB are blacklisted).

Second, we observe that we still tend to overestimate thescore of a node that appears in the lineages of multiple men-tions. So we stop increasing the score of a node after it hasreceived contributions from k mentions (currently set to 3).

Finally, we observe that certain nodes are more suited tobe tags than others (e.g., “travel”), and that certain nodesshould probably never be used as tags (e.g., “average”).Consequently, we create two lists of nodes: bad nodes andgood nodes. During the above score computation process,if a node of a lineage is in the good-node or bad-node list,then we boost or decrease its score by a certain amount,respectively.

3.7 Extracting Mention FeaturesIn this step we extract a broad range of features for men-

tions. Later we use these features to disambiguate and scorementions. Broadly speaking, we extract features from tweet,the KB, and other external sources. Examples of featuresextracted from the tweet itself include:

• Similarity score between the mention and the tweet:Let n be the node referred to in the mention, and letS be the set of nodes that appear in the lineages ofn as well as in the tags of the tweet, as computed in

1131

Section 3.6 (recall that each tag is technically the nameof a node in the KB). We compute a weighted sum ofthe scores of the nodes in S and return the sum as thesimilarity score between the mention and the tweet.

• Similarity scores between the mention and the socialcontext of the user, the hashtags in the tweet (if any),and the Web domains in the tweet (if any).

• Does the mention appear in the title (of an article thatthe tweet refers to)?

• If the node of the mention is a descendant of a cer-tain concept in the KB (e.g., Books, Comics, Movies,Films, Songs), does the mention begin with an up-percase character? This feature captures the intuitionthat often the name of a book, comic, movie, etc. be-gins with an uppercase character.

• Other examples include: the number of times the men-tion appears in the tweet, its position, is it part of ahashtag?

Examples of features extracted from the KB include: Howmany homonyms does this mention appear in (a measure ofhow ambiguous this mention is)? How many descendantsdoes the node of the mention have? How many children?The depth of the node in the taxonomy? Is the node aninstance or a concept? Is the node a person name? A loca-tion? An acronym?

Examples of features extracted from other sources include:the number of words in the mention that are capitalized,how popular is this mention? (For example, how often doesit appear on the Web? How many times has its Wikipediapage been accessed in the past day, past week, past month?)How often is the mention linked in text as opposed to beingunlinked? (If the mention is often linked, then it is likelyto be a valid mention of a real-world entity.) How genericis the mention (e.g., “calm” and “collected” are too genericto be useful)? How often is it being searched for (collectedusing search logs)? What is its click-through rate?

Social Signals: As described, it is clear that we extractmany social signals as features. Examples include Wikipediatraffic, traffic on a variety of social media sites (e.g., Pin-terest), search frequency, and click-through rate. Togetherwith social contexts, such signals help boost the processingaccuracy.

3.8 Filtering the MentionsOnce we have extracted mention features, we apply a set

of hand-crafted rules that use these features to remove cer-tain mentions. An example rule may remove a mention if itsnode has fewer than 5 children. Another rule may removea mention if the Wikipedia page associated with its nodehas received fewer than 10 visits in the past month. Yetanother filter may remove a mention if its node is a descen-dant of “Books” and yet the mention does not begin withan uppercase character, and so on.

3.9 Disambiguating the MentionsIn the next step, we disambiguate mentions such as Apple

the company vs. Apple the fruit. To do so, for each ambigu-ous mention (i.e., a mention with multiple homonyms), wecompute a disambiguation score between its node and thetweet. This score considers the followings:

• the popularity of the node (mostly measured in thetraffic to the Wikipedia page associated with the node);

• the similarity scores between the mention and the tweet,the mention and the user, the mention and the hash-tags and the Web domains, as discussed earlier;

• a similarity score computed between the Web contextof the node and the tweet; and

• a similarity score computed between the social contextof the node and the tweet.

Clearly, the higher the disambiguation score, the more likelythat the tweet refers to the concept (or instance) associatedwith the node. Hence, we select the highest-scoring nodeamong the ambiguous nodes (assuming that its score ex-ceeds a threshold). For example, suppose a tweet mentions“apple”, resulting in two mentions to Apple the companyand Apple the fruit, respectively. Suppose that the disam-biguation scores of these two mentions are 0.8 and 0.5. Thenwe select the first mention, and thus the node referring toApple the company, as the interpretation of the string “ap-ple” in the tweet.

3.10 Scoring the Mentions AgainIn this step we score the mentions again, for the last time.

For each mention, we compute a score on how likely themention is. Currently, the score is computed using a logisticregression function over a subset of the features generatedin Section 3.7. This function is trained over a sample ofmanually curated data.

3.11 Tagging and Classifying the Tweet AgainAt this point, we assume that all mentions have been

thoroughly evaluated, most incorrect mentions have beenremoved, and the scores of the mentions have been com-puted as carefully as possible. So we use these mentions totag and classify the tweet again, in the same way as we do inSection 3.6. This produces a revised set of tags and topicsfor the tweet.

3.12 Applying Editorial RulesIn the final step of the system, we apply editorial rules

to clean the mentions, tags, and classification labels. Forexample, a rule may be !playbook & (blackberry | bberry)= Blackberry Mobile Phones. This rule says that if a men-tion does not contain “playbook” but does contain “black-berry” or “bberry” (ignoring cases), then link this mentionto concept “Blackberry Mobile Phone” in the KB. In gen-eral, each rule is a pair (regex, action), which means thatif the mention matches the regular expression regex, thenapply action. Our current system has 231 such rules.

After applying editorial rules, we output the final men-tions, tags, and topics for the tweet, together with scores.

3.13 DiscussionAs described, our approach to semantically processing a

tweet has several distinguishing characteristics:

• First, we interleave the four tasks: extracting, linking,classifying, and tagging. As Sections 3.3-3.12 demon-strate, the goal of such interleaving is to use the out-put of one task to help another task. For example, apreliminary step of extracting and linking produces a

1132

preliminary set of mentions. We use these mentions tohelp tagging and classifying tweets. The result in turnhelps extracting and linking. And the result of this inturn helps tagging and classifying.

• Second, we use a lot of context information. These in-clude Web context for tweets, social contexts for users,hashtags, and domains, and Web and social contextsfor KB nodes. Since tweets (and many other types ofsocial media data, such as Facebook updates) tend tobe quite short, it is critical to generate and use con-texts to improve the processing accuracy.

• Third, we use a lot of social information. This takesthe form of social contexts, as described earlier, as wellas social signals, such as traffic on social Web sites(e.g., Wikipedia, Pinterest), search frequency, and click-through rate. The social information helps us signifi-cantly boost the accuracy of the system.

• Fourth, since we have to process tweets in real time,and scale to 3000-6000 tweets per second, we do not usecomplex, opaque, or time intensive techniques in ouronline pipeline. (We do use some of these techniquesin offline processing; see Sections 3.1-3.2.) An addedbenefit of this strategy is that it gives us a lot of fine-grained control over the entire pipeline. We have foundthis control capability to be critical in adapting thepipeline quickly to a new application and to suddenchanges in the tweet stream. Another benefit is thatwe can quickly train a new developer to understand,use, debug, and maintain the pipeline, an importantrequirement in certain high-turnover situations.

• Finally, we use hand-crafted rules extensively, in sev-eral places of the pipeline. Our main conclusions arethat it is possible to supply such rules even with arelatively small development team, and that the rulesare important to improve the accuracy and to exertfine-grained control over the pipeline.

4. EMPIRICAL EVALUATIONWe now describe experiments that evaluate our system,

compare it to current approaches, and measure the utilityof various components in our system.

Data Sets: We first sampled 500 English tweets fromthe firehose, and discarded a few bad tweets (e.g., non-English, containing only an URL) to obtain a sample of 477tweets. Next, we manually identified all entity mentions inthe tweets and linked them to the correct nodes in our KB.In total we identified 364 entity mentions.

Next, we manually identified the topics and the tags foreach tweet. Doing this turned out to be highly time consum-ing and rather tricky. Since our KB is quite large, with 13+million nodes, finding all good tags for a tweet is typicallynot possible. Thus, for each tweet, we ran our system withthe lowest possible thresholds, to generate as many tags aspossible. Next, we manually examined these tags to selectthe best tags for the tweet. Finally, we added any tag thatwe think the tweet should have, and yet was not producedby our system.

Given the time-consuming nature of the above process,so far we have manually tagged and assigned topics to 99tweets. We use these 99 tweets to evaluate classification

Table 1: The accuracy of our system for extractionand linking

and tagging, and use the 477 tweets mentioned above toevaluate extraction and linking. We are in the process ofexpanding the above two evaluation sets further.

4.1 Measuring the Accuracy of Our SystemExtraction and Linking: Table 1 shows the accuracy ofour system for extraction and linking. Here P , shorthandfor precision, is the fraction of mentions predicted by thesystem that is correct, in that the mention is a valid mentionof a real-world entity and its linking to an entity in the KB(if any) is correct. R, shorthand for recall, is the fractionof mentions in the golden set of mentions that the systemsuccessfully retrieved, and F1 = 2PR/(P + R).

The first line (“Overall”) of this table shows that our sys-tem achieves reasonably high accuracy, at 77.74% precisionand 62.36% recall. We perform quite well on the commoncategories of person, organization, and location (as the nextthree lines show). Later we show that we do better herethan current approaches.

The subsequent lines show our accuracy per various otherentity categories. Here the results range from being per-fect (e.g., Medical Condition, which tend to be long uniquestrings), to reasonable (e.g., Sport Events, Movies), to rel-atively low (e.g., Music Album, Song). In particular, wefound that it is very difficult to accurately extract movie,book, and song names that are “generic”, such as “It’s Fri-day” or “inception”. The problem is that such phrases com-monly appear in everyday sayings, e.g., “Thank God it’sFriday”.

Classification: Table 2 shows the accuracy of our systemfor classification (i.e., how well we assign topics to tweets).The overall performance (first line) is respectable at 50%precision and 59.74% recall, with clearly ample room for im-provement. The subsequent lines show a breakdown of accu-racy per topic. Here again the performance ranges from per-fect (e.g., Environment, Travel) to reasonable (e.g., Health)

1133

Table 2: The accuracy of our system for classifica-tion task

to relatively low (e.g., Products, People). (Note that for thelast two topics, our system did not make any predictions.)

Tagging: For the task of assigning descriptive tags totweets, our system achieves 35.92% precision, 84.09% recalland 50.34% F1.

4.2 Comparison with Existing ApproachesWe compare our system with the off-the-shelf popular

Stanford Named Entity Recognizer and the popular indus-trial system OpenCalais. Specifically, we consider three ver-sions of the Stanford system:

• StanNER-3: This is a 3-class (Person, Organization,Location) named entity recognizer. The system uses aCRF-based model which has been trained on a mixtureof CoNLL, MUC and ACE named entity corpora.

• StanNER-3-cl: This is the caseless version of StanNER-3 system which means it ignores capitalization in text.

• StanNER-4: This is a 4-class (Person, Organization,Location, Misc) named entity recognizer for Englishtext. This system uses a CRF-based model which hasbeen trained on the CoNLL corpora.

OpenCalais is an industrial product of Thomson Reuterswhich provides open APIs to extract entities, facts, eventsand relations from text and assign topics and tags to text.Currently, the system supports 39 entity types and 17 topics.

Comparing Accuracy for Person, Organization, andLocation: Tables 3.a-d show the accuracy of our system(listed as “doctagger”, a name used internally at Walmart-Labs) vs the Stanford variants vs OpenCalais. Since thecurrent Stanford variants focus on extracting person names,organizations, and locations, the table compares the accu-racy only for these categories.

The tables show that our system outperforms the othertwo in almost all aspects, especially with respect to extract-ing organizations. A main reason for low precision in the

Table 3: Our system vs Stanford system variantsvs OpenCalais in extracting persons, organizations,and locations

Table 4: Our system vs OpenCalais for all tasks

other systems is that they interpret many interjections (rofl,lmao, haha, etc) and abbreviations as organization names.A main reason for low recall is the difficulty in recognizingan organization name without using a large KB. For exam-ple, most NER tools without a large KB would incorrectlyidentify “Emilie Sloan” as a person, not an organization.

Our System vs OpenCalais: OpenCalais is quite sim-ilar to our system, in that it can perform all four tasks ofextraction, linking, classification, and tagging, and that itcan handle a large number of categories (in contrast, thecurrent Stanford variants only focus on extracting persons,organizations, and locations).

Thus, in the next step, we compare our system to Open-Calais. Table 4 shows the overall performance of the twosystems for the tasks. (Note that for tagging, we do nothave access to the internal KB used by OpenCalais, andhence are unable to compute recall, and thus F1.) The tableshows that we significantly outperform OpenCalais, exceptin the precision of tagging (35.92% vs 40.8%).

Table 5 shows a breakdown of the accuracy of entity ex-traction and linking per categories. In general, OpenCalaisextracts relatively few entities, which explains its low recall.The table shows that our system outperforms OpenCalaisin all categories except Position and Technology. (Note thatfor certain categories, our system or OpenCalais made noprediction; in such cases we compute recall to be 0.00.)

4.3 Evaluating Components’ UtilitiesTable 6 shows the utility of various components in our

system. The rows of the table show the accuracy of thecomplete system, the system without using any context in-formation, the system without using any social signal, anda baseline system, respectively. The baseline system simplyextracts mentions using the names of the nodes in the KB,then matches each mention to the most popular homonym

1134

Table 5: Our system vs OpenCalais for entity ex-traction and linking tasks

Table 6: The accuracy of our system as we turn offvarious components

node (i.e., the one with the most traffic).The table shows that the baseline system achieves very

low accuracy, thereby demonstrating the utility of the over-all system. The table also shows that accuracy drops sig-nificantly without using contexts or social signals, therebydemonstrating the utility of these components in our system.

4.4 SummaryThe experiments show that our system significantly out-

performs current approaches. However, there is still ampleroom for improvement, especially in certain categories suchas Product and Music. The experiments also show thatusing contexts and social signals is critical to improving ac-curacies.

5. APPLICATIONS & LESSONS LEARNEDOur system has been used extensively at Kosmix and later

at WalmartLabs in a variety of applications. We now brieflydescribe some of these applications, and the lessons learned.

5.1 Sample ApplicationsEvent Monitoring in the Twittersphere: In late 2010Kosmix built a flagship application called Tweetbeat thatmonitors the Twittersphere to detect interesting emergingevents (e.g., Egyptian uprising, stock crash, Japanese earth-quake), then displays all tweets of these events in real time.

Figure 3: Event monitoring in social media usingTweetbeat

Figure 3 shows an example. This was a little widget embed-ded on the ABC news homepage, and powered by Kosmix.For the event “Egyptian uprising” (which was probably thehottest political event at that time), the widget shows inter-esting tweets related to that event, scrolling in real time.

To do this, for each incoming tweet we must decide whetherit belongs to an event E. The technique used to decidethis is complex and will be described in an upcoming paper.However, an important component of this technique is thatwe perform extraction, linking, classification, and tagging ofeach incoming tweet. If the tweet refers to nodes in the KBthat we already know are involved in the event E, then weknow that the tweet is more likely to refer to event E.

In-context Advertising: The basic idea here is thatwhen a user is reading a Web page (e.g., a newsarticle ora discussion forum page), we parse the page, identify themost important concepts/instances on the page, and thenhighlight those concepts/instances. When the user hoversover a highlight, he or she will see a pop-up ad that is rele-vant to the highlighted concepts/instances. To identify themost important concepts/instances on the page, we use thesystem described in this paper.

Understanding User Queries: Kosmix started out as aDeep Web search engine. A user poses a search query suchas “Las Vegas vacation” on kosmix.com, we interpret thequery, go to an appropriate set of Deep Web data sources(i.e., those behind form interfaces), query the sources, obtainand combine the answers, then return these to the user.

Clearly, understanding the user query is a critical step inthe above process. To understand a query, we use the cur-rent system to detect if it mentions any concept or instancein the KB (sort of treating the user query as a short tweet).

Product Search: In mid 2011 Kosmix was acquired byWalmart and since then we have used the above system

1135

to assist a range of e-commerce applications. For example,when a user queries “Sony TV” on walmart.com, we maywant to know all categories that are related to this query,such as “DVD”, “Bluray players”, etc. We use the currentsystem to find such related categories.

Product Recommendation: In spring 2012 we intro-duced a Facebook application called ShopyCat. After aFacebook user installs this application and gives it access tohis/her Facebook account, the application will crawl his/herposts as well as those of his/her friends, then infer the in-terests of each person. Next, the application uses these in-terests to recommend gifts that the user can buy for his orher friends. For example, ShopyCat may infer that a partic-ular friend is very interested in football and Superbowl, andhence may recommend a Superbowl monopoly game fromdeluxegame.com as a gift.

ShopyCat infers the interests of a person by processinghis or her posts in social media using the current system,to see what concepts/instances in our KB are frequentlymentioned in these posts. For example, if a person oftenmentions coffee related products in his or her posts, thenShopyCat infers that he or she is likely to be interested incoffee.

Social Mining: In one of the latest applications, we useour system to process all tweets that come from a specificlocation, to infer the overall interests of people in that loca-tion, then use this information to decide how to stock thelocal Walmart store. For example, from mining all tweetscoming from Mountain View, California, we may infer thatmany people in Mountain View are interested in outdooractivities, and so the outdoor section at the local Walmartis expanded accordingly. Such social mining appears to bequite useful on a seasonal basis.

5.2 Lessons LearnedWe have found that it is possible to use a modest team (of

no more than 3 full-time persons at any time; this does notcount the team that builds the KB) to build an end-to-endsystem that semantically processes social data and pushesthe current state of the art. In this system, social signals andcontexts play an important role in achieving good accuracy.We also found that even when the accuracy is not perfect,the system already proves useful for a variety of real-worldapplications.

There is still ample room for improvement in terms ofaccuracy, however. For example, as discussed in Section4.1, we found that it is very difficult to accurately extractmovie, book, and song names that are “generic”, such as“It’s Friday” or “inception”, because they commonly appearin everyday tweets, e.g., “Thank God it’s Friday”. How todetect such tweets and avoid extracting a false positive is achallenging problem.

6. RELATED WORKEntity extraction and classification of formal text has been

widely studied for more than two decades, in both the databaseand AI communities. A variety of techniques ranging fromhand-coded rules to statistical machine learning has beenproposed. Fastus [4], Circus [23], DBLife [12] and Avatar[19] are examples of systems based on hand-coded rules. Sys-tems such as Aitken [2], Califf and Mooney [9], and Soder-land [34] learn rules automatically. Statistical learning based

systems have used a variety of approaches, such as hiddenMarkov models [1, 8, 33], maximum entropy [21, 26, 29] andconditional random fields [22]. A recent survey [31] discussesstate-of-the-art information extraction techniques in depth.

Competitions such as CoNLL [38], MUC [37] and ACE[15] made available large annotated corpora of news arti-cles, thereby fostering the growth of many commercial tools.Examples include Stanford NER [17], OpenNLP [6], GATE[11], and LingPipe [24]. Stanford NER, based on CRF clas-sifiers, is a popular and state-of-the-art tool for extractionof entities from formal text.

Entity extraction and classification for tweets, on the otherhand, has been a less studied problem. Liu et al. [25]present a semi-supervised solution that combines a KNN anda CRF classifier. They also use several gazetteer lists cov-ering common names, companies, locations, etc. Their useof gazetteer lists resembles our use of a KB. However, theirsolution extracts only person, organization and location en-tities, while we do it for a large number of entity types withlinks to our KB. Finn et al. [16] use crowdsourcing solutionsto annotate a large corpus of tweets to create training data.Recently Ritter et al. [30] have developed a NLP pipelinespanning POS tagging, chunking, named entity recognitionand classification for tweets. Han and Baldwin [18] haveworked on normalization of tweets to improve the sentencestructure which could potentially help any semantic task ontweets. For example, our system too can use normalizationas a preprocessing step.

DBpedia [5], YAGO [35], Freebase [7] and Google knowl-edge graph are well-known examples of KBs constructed us-ing information extraction (IE) techniques. While IE tech-niques can help extend existing KBs [36], KBs in turn canhelp improve IE [10]. Our system uses a KB (we describebuilding our KB in [13]) to extract and link entities and con-cepts. Our work is similar to Wikify! [27] which extractsimportant concepts from text and links them to Wikipediapages. However, our KB is richer than Wikipedia and weuse a lot of social and Web contexts and signals to handlesocial media text. Wiki3C [20] is another related work thatranks the Wikipedia categories of a concept extracted fromtext based on the context.

As far as large-scale systems are concerned, SemTag andSeeker [14] is one of the earlier attempts at performing au-tomatic semantic tagging of a large Web corpus using anontology. Currently there are many industrial systems suchas OpenCalais [28], AlchemyAPI [3], Semantria [32] per-forming large scale entity extraction and classification. Se-mantria does not support linked data whereas OpenCalaisand AlchemyAPI do. OpenCalais additionally extracts re-lations, facts and events. AlchemyAPI provides APIs forlanguage identification, sentiment analysis, relation extrac-tion, and content scraping. We evaluated OpenCalais as itseems to be the market leader providing services to sitessuch as CNET and The Huffington Post.

7. CONCLUDING REMARKSIn this paper we have described an end-to-end industrial

system that performs entity extraction, linking, classifica-tion, and tagging for social data. To the best of our knowl-edge, this is the first paper that describes such a system indepth. We have presented experiments that show that oursystem significantly outperforms current approaches, thoughit still has plenty of room for improvement. We have showed

1136

that while not perfect, the system has proved useful in a vari-ety of real-world applications. Finally, we have also demon-strated that it is important to exploit contexts and socialsignals to maximize the accuracy of such systems. We arecurrently working to open source a version of this system,for research and development purposes in the community.

8. REFERENCES[1] E. Agichtein and V. Ganti. Mining reference tables for

automatic text segmentation. In SIGKDD, 2004.

[2] J. S. Aitken. Learning information extraction rules:An inductive logic programming approach. In ECAI,2002.

[3] AlchemyAPI. http://www.alchemyapi.com/.

[4] D. E. Appelt, J. R. Hobbs, J. Bear, D. Israel, andM. Tyson. FASTUS: A finite-state processor forinformation extraction from real-world text. In IJCAI,1993.

[5] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,R. Cyganiak, and Z. Ives. DBpedia: A nucleus for aweb of open data. In The Semantic Web, 2007.

[6] J. Baldridge. The OpenNLP project, 2005.

[7] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, andJ. Taylor. Freebase: A collaboratively created graphdatabase for structuring human knowledge. InSIGMOD, 2008.

[8] V. Borkar, K. Deshmukh, and S. Sarawagi. Automaticsegmentation of text into structured records. InSIGMOD Record, 2001.

[9] M. E. Califf and R. J. Mooney. Relational learning ofpattern-match rules for information extraction. InAAAI, 1999.

[10] W. W. Cohen and S. Sarawagi. Exploiting dictionariesin named entity extraction: combining semi-Markovextraction processes and data integration methods. InSIGKDD, 2004.

[11] H. Cunningham, K. Bontcheva, and D. Maynard.GATE: an architecture for development of robust HLTapplications. In ACL, 2002.

[12] P. DeRose, W. Shen, F. Chen, A. Doan, andR. Ramakrishnan. Building structured webcommunity portals: A top-down, compositional, andincremental approach. In VLDB, 2007.

[13] O. Deshpande, D. S. Lamba, M. Tourn, S. Das,S. Subramaniam, A. Rajaraman, V. Harinarayan, andA. Doan. Building, maintaining, and using knowledgebases: A report from the trenches. In SIGMOD, 2013.

[14] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha,A. Jhingran, T. Kanungo, S. Rajagopalan,A. Tomkins, J. A. Tomlin, and J. Y. Zien. SemTagand Seeker: Bootstrapping the semantic web viaautomated semantic annotation. In WWW, 2003.

[15] G. Doddington, A. Mitchell, M. Przybocki,L. Ramshaw, S. Strassel, and R. Weischedel. Theautomatic content extraction (ACE) program–tasks,data, and evaluation. In LREC, 2004.

[16] T. Finin, W. Murnane, A. Karandikar, N. Keller,J. Martineau, and M. Dredze. Annotating namedentities in Twitter data with crowdsourcing. InHLT-NAACL, 2010.

[17] J. R. Finkel, T. Grenager, and C. Manning.Incorporating non-local information into informationextraction systems by Gibbs sampling. In ACL, 2005.

[18] B. Han and T. Baldwin. Lexical normalisation of shorttext messages: Makn sens a# twitter. In ACL-HLT,2011.

[19] T. Jayram, R. Krishnamurthy, S. Raghavan,S. Vaithyanathan, and H. Zhu. Avatar informationextraction system. IEEE Data Eng. Bull., 29(1):40–48,2006.

[20] P. Jiang, H. Hou, L. Chen, S. Chen, C. Yao, C. Li,and M. Wang. Wiki3C: Exploiting Wikipedia forcontext-aware concept categorization. In WSDM,2013.

[21] D. Klein and C. D. Manning. Conditional structureversus conditional estimation in NLP models. InEMNLP, 2002.

[22] J. Lafferty, A. McCallum, and F. C. Pereira.Conditional random fields: Probabilistic models forsegmenting and labeling sequence data. In ICML,2001.

[23] W. Lehnert, J. McCarthy, S. Soderland, E. Riloff,C. Cardie, J. Peterson, F. Feng, C. Dolan, andS. Goldman. UMass/Hughes: Description of theCIRCUS system used for MUC-5. In MUC-5, 1993.

[24] LingPipe. http://alias-i.com/lingpipe/.

[25] X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizingnamed entities in tweets. In ACL-HLT, 2011.

[26] A. McCallum, D. Freitag, and F. Pereira. Maximumentropy Markov models for information extraction andsegmentation. In ICML, 2000.

[27] R. Mihalcea and A. Csomai. Wikify!: Linkingdocuments to encyclopedic knowledge. In CIKM, 2007.

[28] OpenCalais. http://www.opencalais.com/.

[29] A. Ratnaparkhi. Learning to parse natural languagewith maximum entropy models. Machine learning,34(1-3):151–175, 1999.

[30] A. Ritter, S. Clark, and O. Etzioni. Named entityrecognition in tweets: An experimental study. InEMNLP, 2011.

[31] S. Sarawagi. Information extraction. Foundations andTrends in Databases, 1(3):261–377, 2008.

[32] Semantria. https://semantria.com/.

[33] K. Seymore, A. McCallum, and R. Rosenfeld.Learning hidden Markov model structure forinformation extraction. In AAAI, 1999.

[34] S. Soderland. Learning information extraction rulesfor semi-structured and free text. Machine learning,34(1-3):233–272, 1999.

[35] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: acore of semantic knowledge. In WWW, 2007.

[36] F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: Aself-organizing framework for information extraction.In WWW, 2009.

[37] B. M. Sundheim and N. A. Chinchor. Survey of themessage understanding conferences. In ACL-HLT,1993.

[38] E. F. Tjong Kim Sang and F. De Meulder.Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InCoNLL, 2003.

1137

Entity Extraction, Linking, Classiﬁcation, and Tagging for ... · Entity Extraction, Linking,...

Documents

Transcript of Entity Extraction, Linking, Classiﬁcation, and Tagging for ... · Entity Extraction, Linking,...