Entity Extraction, Linking, Classiﬁcation, and Tagging for ... · PDF file Entity...
Embed Size (px)
Transcript of Entity Extraction, Linking, Classiﬁcation, and Tagging for ... · PDF file Entity...
Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach
Abhishek Gattani1, Digvijay S. Lamba1, Nikesh Garera1, Mitul Tiwari2, Xiaoyong Chai1, Sanjib Das3, Sri Subramaniam1, Anand Rajaraman4, Venky Harinarayan4, AnHai Doan1,3
1@WalmartLabs, 2LinkedIn, 3University of Wisconsin-Madison, 4Cambrian Ventures
ABSTRACT Many applications that process social data, such as tweets, must extract entities from tweets (e.g., “Obama” and “Hawaii” in “Obama went to Hawaii”), link them to entities in a knowledge base (e.g., Wikipedia), classify tweets into a set of predefined topics, and assign descriptive tags to tweets. Few solutions exist today to solve these problems for social data, and they are limited in important ways. Further, even though several industrial systems such as OpenCalais have been deployed to solve these problems for text data, little if any has been published about them, and it is unclear if any of the systems has been tailored for social media.
In this paper we describe in depth an end-to-end indus- trial system that solves these problems for social data. The system has been developed and used heavily in the past three years, first at Kosmix, a startup, and later at Wal- martLabs. We show how our system uses a Wikipedia-based global “real-time” knowledge base that is well suited for so- cial data, how we interleave the tasks in a synergistic fash- ion, how we generate and use contexts and social signals to improve task accuracy, and how we scale the system to the entire Twitter firehose. We describe experiments that show that our system outperforms current approaches. Fi- nally we describe applications of the system at Kosmix and WalmartLabs, and lessons learned.
1. INTRODUCTION Social media refers to user generated data such as tweets,
Facebook updates, blogs, and Foursquare checkins. Such data has now become pervasive, and has motivated numer- ous applications in e-commerce, entertainment, government, health care, and e-science, among others.
Many such applications need to perform entity extrac- tion, linking, classification, and tagging over social data. For example, given a tweet such as “Obama gave an im- migration speech while on vacation in Hawaii”, entity ex- traction determines that string “Obama” is a person name, and that “Hawaii” is a location. Entity linking goes one
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 39th International Conference on Very Large Data Bases, August 26th - 30th 2013, Riva del Garda, Trento, Italy. Proceedings of the VLDB Endowment, Vol. 6, No. 11 Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
step further, inferring that “Obama” actually refers to an entity in a knowledge base, for example, the entity at URL en.wikipedia.org/wiki/Barack Obama, and that “Hawaii” refers to the entity at URL en.wikipedia.org/wiki/Hawaii. Classification assigns a set of predefined topics to the tweet, such as “politics” and “travel”. Finally, tagging assigns de- scriptive tags to the tweet, such as “politics”, “tourism”, “vacation”, “President Obama”, “immigration”, and “Hawaii”, the way a person may tag a tweet today.
Entity extraction, a.k.a. named entity recognition (NER), and text classification are well-known problems that have been around for decades (e.g., [4, 23]), while entity linking and tweet tagging are newer problems that emerged in the past few years . Nevertheless, because of their impor- tance to a large variety of text-centric applications, these problems have received significant and increasing attention.
Despite this attention, few solutions exist today to solve these problems for social media, and these solutions are lim- ited in several important ways. First, the solutions often “re- cycle” techniques developed for well-formed English texts. A significant amount of social data, however, are misspelled ungrammatical short sentence fragments, thereby proving ill-suited for these techniques. Second, the solutions often employ computation-intensive techniques that do not scale to high-speed tweet streams of 3000-6000 tweets per second.
Third, existing solutions typically do not exploit context information, such as topics that a Twitter user often tweets about. As we show in this paper, since many types of social data (especially tweets and Facebook updates) are often very short (e.g., “go Giants!”), it is critical that we infer and exploit context information to improve accuracy. Fourth, existing solutions typically do not exploit social signals, such as traffic on social sites (e.g., Wikipedia, Pinterest), even though such signals can greatly improve accuracy.
Finally, most current solutions address only a single prob- lem, in isolation, even though as we show later in this paper, addressing all four problems in a synergistic fashion can fur- ther improve the overall performance.
In the past few years, several industrial systems to ex- tract, link, classify and tag text data, such as OpenCalais at opencalais.com, have also been deployed on the Web (see the related work section). However, little, if any, has been published about these systems, and as far as we know, none of these deployed systems has been specifically tailored for social media.
In this paper we describe an end-to-end industrial sys- tem that extracts, links, classifies, and tags social data. To the best of our knowledge, this is the first paper that de-
scribes such a system in depth. The system has been devel- oped and used heavily since 2010, first at Kosmix, a startup that performed semantic analysis of social data, then later, since mid-2011 at WalmartLabs, a research and develop- ment lab for Walmart (which acquired Kosmix). At Wal- martLabs, the system has been used extensively to process tweets, Facebook updates, and other types of social data, to power a variety of e-commerce applications (see Section 5).
Even though our system can handle many types of social data (as well as a variety of text documents, see Section 5), for expository reasons in this paper we will focus on handling tweets. Our system differs from current systems in the following important ways:
Using a Global and “Real-Time” Knowledge Base: Our knowledge base (which we use to find and link to enti- ties mentioned in tweets) is built from Wikipedia (see ). Wikipedia is global, in that it contains most concepts and instances judged important in the world. Thus, it provides a good coverage for the tasks. More importantly, it is “real time” in that contributors continuously update it with new entities that just appear in real-world events. This “real time” nature makes it especially well-suited for processing social data, and in fact, we take certain measures to make it even more “real time” (see Section 3.1). In contrast, many current solutions use knowledge bases that are updated less frequently.
Synergistic Combination of the Tasks: Our system in- terleaves the four tasks of extraction, linking, classification, and tagging in a synergistic fashion. For example, given a tweet, we begin by performing a preliminary extraction and linking of entity mentions in that tweet. Suppose many such mentions link to many nodes under the subtree “Tech- nology” in our knowledge base (KB). Then we can infer that “Technology” is a likely topic for the tweet, thereby helping classification. In return, if we have determined that “Tech- nology” is indeed a topic for the tweet, then we can infer that string “apple” in the tweet likely refers to the node “Apple Corp.” in the KB, not the node “Apple (fruit)”, thereby helping entity linking.
Using Contexts and Social Information: Given a tweet such as “go Giants!”, without some contexts, such as knowing that this user often tweets about the New York Gi- ants football team, it is virtually impossible to extract and link entities accurately. As another example, it is not pos- sible to process the tweet “mel crashed, maserati gone” in isolation: we have no idea which person named Mel the user is referring to. However, if we know that in the past one hour, when people tweeted about Mel Gibson, they often mentioned the words “crash” and “maserati” (a car brand), then we can infer that “mel” likely refers to the node Mel Gibson in the KB. Our system exploits such intuitions. It collects contexts for tweets, Twitter users, hash tags, Web domains, and nodes in the KB. It also collects a large num- ber of social signals (e.g., traffic on Wikipedia and Pinterest pages). The system uses these contexts and signals to im- prove the accuracy of the tasks.
Other important features of our system include a minimal use of complex time-intensive techniques, to ensure that we can process tweets in real time (at the rate of up to 6000 tweets per second), and the use of hand-crafted rules at var- ious places in the processing pipeline to exert fine-grained
Figure 1: A tiny example of a KB
control and improve system accuracy. In the rest of this paper we first define the problems of
entity extraction and linking, and tweet classification and tagging. We then describe the end-to-end system in detail. Next, we present experiments that show that the current system outperforms