Coreference Resolution - IIT Bombay

Coreference Resolution - IIT Bombay

Coreference Resolution Seminar by Satpreet Arora (07D05003) What is Coreference? In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same entity. Example: Aditya went to Videorec to buy a DVD for himself. He had frequented the store

for many years now. Here, Aditya, himself and He are coreferent. Also Videorec and store are coreferent. Coreference Resolution. Determine which entities in a document/ discourse have the same referent. In NLP, we usually deal with Coreference resolution of NPs. The coreference system has to form equivalence classes of NPs that have the same real-world entity as its referent.

Coreference Relation is both transitive and symmetric. A Coreference System Input of a coreference system: John Simon, Chief Financial Ocer of Prime Corp. since 1986, saw his pay jump 20%, to $1.3 million, as the 37-yearold also became the nancial-service companys president. Output of a coreference system: [JS John Simon], [JS Chief Financial Ocer] of [PC Prime Corp.] since 1986, saw [JS his] pay jump 20%, to $1.3 million, as [JS

the 37-year-old] also became the [PC nancial-service company]s [JS president] Equivalence classes: JS: {John Simon, Chief Financial Ocer, his, the 37-year-old, president} PC: {Prime Corp., nancial-service company} Anaphora Resolution Anaphora refers to the linguistic phenomenon of having a noun phrase refer to a previously

mentioned entity in a text for its semantic interpretation In other words, a pair of NPs constitutes an anaphoric relationship if i < j and npj depends on npi for its interpretation, where npk denotes the kth NP in a document. For instance, the NP pair forms an anaphoric relationshipin our example. Queen Elizabeth set about transforming her husband Coreference Resolution - an Important Problem Applications include: Question answering systems and Information

Retrieval: Consider the query- Where was Mozart born?. A question answering system may rst retrieve the sentence He was born in Salzberg from a document talking about Mozart. In this case, the system will return the correct answer if it can determine that the pronoun He is coreferent with Mozart. Machine Translation. Anaphora resolution comes into play when discrepancies exist between the two languages with respect to anaphor selection. Coreference Resolution - an Important Problem (contd)

Text Summarization Text summarisation tools using coreference resolution not only include in the summary those sentences that contain a term appearing in the query, they also incorporate sentences containing a noun phrase that is coreferent with a term occurring in a sentence already selected by the system. Cross-document coreference is particularly useful for text summarization systems that need to identify and merge the same piece of information about an entity mentioned in different documents in order to avoid repetition.

Coreference Resolution - a Hard Problem The diculty of the problem lies in its dependence on sophisticated semantic and world knowledge. The policemen refused the women a permit for the demonstration because they feared violence. The policemen refused the women a permit for the demonstration because they advocated violence. Observe how they is referring two two different entities in the two sentences depending upon the context. Its easy for humans but dicult for

machines. Coreference Resolution - a Hard Problem (contd) Many sources of information play a role: Lexical information such as head noun matches (as in Lord Spencer and Mr. Spencer) is an indicator of coreference, although it is not an absolute indicator (e.g. Lord Spencer and Diana Spencer are not coreferent). Knowledge sources such as gender and number, semantic class, discourse focus,

and world knowledge also play a role in determining whether two discourse entities are coreferent. Coreference Resolution - a Hard Problem (contd) No single source of knowledge is completely reliable indicator: For example, two semantically compatible NPs are potentially coreferent (e.g. Diana Spencer and the princess) but whether the NPs are actually coreferent depends on other factors (such as contextual information).

Linguistic constraints indicating (non-)coreference such as number (dis)agreement is not absolutely hard (e.g. the singular NP assassination (of her bodyguards) can be coreferent with the plural NP these murders. Coreference Resolution - a Hard Problem (contd) Coreference strategies differ depending on the type of NP: Denite NPs are more likely to be anaphoric

than their non-denite counterparts (e.g. the article immediately preceding man in the sentence Diana saw the/a photographer following her secretly determines whether the NP has an existential or denite reading). Pronoun resolution is dicult because resolution strategies differ for each type of pronouns (e.g. reflexives versus possessives) and also because some pronouns such as pleonastic pronouns are semantically empty (e.g. the pronoun it in the sentence Camilla went outside and it was raining is The Algorithm A lot of different algorithms use different

approaches to solve the problem. However, they all share some basic components. (for non machine learning based algorithms) Step 1: Identication of Discourse Entities (NPs) Step 2: Representation of NPs (as a set of features); Step3: Calculating distances between NPs using the distance metric Step 4:Creating equivalence classes using a clustering algorithm or other classication tools. IDENTIFICATION OF DISCOURSE ENTITIES For coreference resolution algorithms, the task is

to identify all of the noun phrases in the text. Textual elements which can be denite noun phrases, demonstrative noun phrases, proper names, appositives, sub-noun phrases that act as modiers, pronouns and so on. The basic structure of the identication is as follows: Identication of Discourse Entities (contd) Tokenization and Morphological Processing: Splitting the text to sentences and stemming words to their root form

POS tagging: Hidden Markov Model based statistical POS tagger. Noun Phrase Identication: Determines noun phrase boundaries based on POS tagger. Named Entity recognition: May also be HMM based, learns from a tagged corpus of named entities. If there are overlaps boundaries are adjusted. Nested Noun Phrases Extraction : Accepts noun phrases and determines the nested phrases (if any) Representation of NPs Representation of NPs : A set of features used to construct the feature vector. individual words in the NP;

head noun: last word of the NP; position in the document; pronoun type: nominative, accusative, possessive, ambiguous; article: indenite, denite, none; appositive: based on heuristics (commas, etc.); number: plural, singular; proper name: based on heuristics (capitalization, etc.); semantic class based on Wordnet; gender: masculine, feminine, either, neuter; animacy: based on semantic class. Representation of NPs (contd)

Calculating distances between NPs Different algorithms use different distance metrics. We here present one from Cardie et al. (1999). Noun phrase coreference as clustering and the corresponding clustering algorithm. The distance between noun phrases NP 1 and NP2 is dened as: dist(NP1,NP2) = wf * incompatibilityf(NP1,NP2) The summation is over f F F: set of features wf: weight of feature f incompatibilityf: degree of incompatibility

between NP1 and NP2 Calculating distances between NPs (contd) Clustering Properties of the clustering algorithm: start from the end of the document and work backwards; if distance between two NPs is less than r , then their equivalence classes are considered for merging; classes can be merged unless they contain incompatible NPs; the algorithm automatically computes the

transitive closure of the coreference relation; two NPs can be coreferent even if dist(NP1,NP2) > r, as long as dist(NP1,NP2) ; r is a free parameters of the algorithm. Clustering Machine Learning algorithms In the algorithm we just saw, the weights of each feature are xed manually. Unlike manual approaches, machine learning approaches to coreference resolution induce a

model that determines the probability that two NPs are coreferent from annotated data automatically via the use of learning algorithms. They can be characterized in terms of the knowledge sources being employed, the method of training data creation, as well as the learning algorithm and the clustering algorithm being chosen. Machine Learning algorithms Training Data Creation:

Positive instances:Two different methods are used to create positive training instances: from Aone et al. (1995) 1)Transitive (i.e. an instance is formed between an NP and each of its preceding NPs in the same anaphoric chain) and 2)Non-transitive (i.e. an instance is formed between an NP and its closest preceding NP in the same anaphoric chain Negative instances: 1) Negative instances are generated by pairing an NP with each preceding NP that does not have an anaphoric relationship with it.- Aone et al. (1995) 2) To reduce the ratio of negative instances to positive instances a negative instance is created by pairing an anaphoric NP, npj , with each NP appearing between np j and its closest preceding antecedent-Soon et al. (2001)

Learning Algorithm A lot of recent research involving machine learning techniques use decision trees for classifying NP pairs. Soon et al. used C5 tree classier. Learning Algorithm (contd) Each pair of NPs is presented to the classier and the classier returns a probability or certainty value of the pair being coreferent.

All pairs which receive a probability value greater than a threshold value are considered as being coreferent. The algorithm then constructs the transitive closure of all the pairs and thus achieves partitioning of all the NPs into coreferent classes. Clustering algorithms are also used in machine learning approaches. The relative importance of all the factors discussed previously is learnt from the training corpus instead of being xed

manually. This allows for consideration of a larger number of factors. In principle, learning-based systems are more robust than knowledge-based systems in the face of noisy input (meaning there are exceptions to rules). Also machine learning algorithms adapt easier to different topics. Conclusion Machine learning approaches to coreference resolution have been shown to be a promising way to build robust coreference resolution systems. Despite the successful application of machine learning

techniques to coreference resolution, the problem is far from being solved. Linguistics combined with machine learning techniques can prove effective in solving the coreference problem. Coreference resolution is one of the most dicult problems in language understanding. Given that NLP is often said to be AI-complete we might be able to conclude that coreference resolution is among the hardest of the hardest problems in articial intelligence. References Vincent N G (2002):Machine Learning for Coreference

Resolution:Recent Successes and Future Challenges Cardie, Claire and Kiri Wagstaff. 1999. Noun phrase coreference as clustering. Byron, D. and J. Tetreault, 1999. A Flexible Architecture for Reference Resolution. In Proc. of the 9th EACL. Unsupervised Models for Co-reference Resolution (Vincent Ng 2008). Wee Meng Soon, Daniel Chung Yong Lim, Hwee Tou Ng(2001) DSO National Laboratories :A Machine Learning Approach to Coreference Resolution of Noun Phrases http://www.cs.tau.ac.il/~nachumd/NLP/2010/Anaphora.pdf http://www.inf.ed.ac.uk/teaching/courses/nlu/lectures/nlu_l1 6.pdf http://www.dfki.de/~loeckelt/ss2010/presentations/corefere nce_resolution.pdf

Wikipedia Questions

Recently Viewed Presentations

  • Performance - Institute of Computer Engineering (E191)

    Performance - Institute of Computer Engineering (E191)

    Bitcoin Mining "We should have a gentleman's agreement to postpone the GPU arms race as long as we can for the good of the network." - Satoshi Nakamoto, Dez 2009
  • New Media and Health in the Era of Personalisation

    New Media and Health in the Era of Personalisation

    Social Values have been estimated by Daniel Fujiwara. Based analysis of British Household Survey (35,000 people) Relating behaviour to Life Satisfaction Scores. To value wellbeing arising from social and community factors. In terms of increase in household incomes.
  • IP AND GLOBALISATION OF INDIAN PHARMACEUTICAL INDUSTRY: EMERGING

    IP AND GLOBALISATION OF INDIAN PHARMACEUTICAL INDUSTRY: EMERGING

    Mixed Responses from commentators. Keayla (1994) and Amit Sengupta (1994) cautioned about the possible threats and ramifications of new policy reforms.
  • 1 Objectives 1. To examine U.S. currency and

    1 Objectives 1. To examine U.S. currency and

    Can be calculated with the following formula: current period CPI - prior period CPI. prior period CPI . consumer price index (CPI) ... develop banking structure in alignment with global policies regarding exchange, debt and cash flow management . Problems...
  • Linking Occupation, Health, and OT through Outcomes Research

    Linking Occupation, Health, and OT through Outcomes Research

    Linking Occupation, Health, and OT through Outcomes Research. Matthew Geddie, OTR, PhD. Rachel Kahlig, MOTS. INTRODUCTION. Lets get started. Thank you for joining me this afternoon, For those of you who don't know me, my name is Mariel Parra, I...
  • Enduring Understandings for 6-8 (ppt) - Troup County School ...

    Enduring Understandings for 6-8 (ppt) - Troup County School ...

    Enduring Understandings Matching Activity. Instructional Approach(s): The teacher should facilitate the Enduring Understandings Matching Activity [linked on the resource page] in which students match an Enduring Understanding with its description and images that illustrate the Enduring Understanding.
  • Solutions to TSP using Convex Hull - University of North ...

    Solutions to TSP using Convex Hull - University of North ...

    Traveling Salesman Problem. Distances between n cities are stores in a distance matrix D with elements d ij where i, j = 1 …n and the diagonal elements d ii are zero. A tour can be represented by a cyclic...
  • EC3027 Corporate Social Responsibility

    EC3027 Corporate Social Responsibility

    Arial Wingdings Calibri Times New Roman Watermark 1_Watermark 6BU004 Corporate Social Responsibility and Ethics Session objectives Group activity Definitions of 'Stakeholders' Definitions cont. Perspectives on the firm Shareholder perspectives on firm Stakeholder perspectives on firm Problems with stakeholder perspectives Prioritisation...