Classifying Noun Countability Using Google Ngrams

Classifying Noun Countability Using Google Ngrams

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU) July 30th, 2009 Lexical Knowledge from Ngrams 1 Hammer : Fast and multifunctional n-gram search engine Search ngram: FAST ng r am s INPUT: token, POS, chunk, NE July 30th, 2009 OUTPUT: frequency to text

Lexical Knowledge from Ngrams 22 Characteristics Search up to 7 grams with wildcards Multi-level input Token, POS, chunk, NE, combinations NOT, OR for POS, chunk, NE Multi-level output Token, POS, chunk, NE document information Original sentences, KWIC, ngram Display Show the results in the order of frequency Running Environment Single CPU, PC-Linux, 400MB process, 500GB disk July 30th, 2009 Lexical Knowledge from Ngrams 33

Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 July 30th, 2009 Lexical Knowledge from Ngrams 4 Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 5 Implementation: Overview 3. Display

2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 6 Implementation: Overview

1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 7 From n-gram to Inverted Index Example: 3-grams Ngram ID

Position=1 Position=2 Position=3 1 A B C 2 A B B 3 B A C

Posting list A pos=1 1 A pos=2 3 B pos=1 3 B pos=2 1 B pos=3 2 C pos=3 1 July 30th, 2009 2

2 3 Lexical Knowledge from Ngrams 8 Posting list Wide variation of posting list size (in 7-gram: 1.27B) #EOS# (100,906,888), , (55,644,989), the (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) 1 0 0 :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) 0 1 1

0 1 0 0 0 0 1 0 0 1 List of ngramID C pos=3 1 3 Encoded into pointer (freq=1) C pos=3 5

July 30th, 2009 Lexical Knowledge from Ngrams 9 Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use look ahead to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) 4 33 34 55 76 80 89 92 99 4 SKIP 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98 July 30th, 2009 Lexical Knowledge from Ngrams

10 Implementation: Overview 2. Filtering 1 Search candidates . Search request N-gram data Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

11 Filtering Not all candidate ngramIDs match the request A Freq=123 NN VB Freq=5 Freq=10 B PERSON LOC We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB July 30th, 2009 Lexical Knowledge from Ngrams

12 Implementation: Overview 3. Display 2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

13 Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC July 30th, 2009 Lexical Knowledge from Ngrams 14 Size of data Text 1.7 G words 200M sentences 2.4M articles Total 530GB

108 GB 8 GB 260 GB Suffix array For text N-gram data 8 GB Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Inverted index for n-gram data 40 GB 100 GB POS, chunk, NE for N-gram data

Others July 30th, 2009 Lexical Knowledge from Ngrams Wikipedia text 6 GB Wikipedia POS, chunk, NE 15 Future Work Other information (ex: parse, coref, relation, genre, discourse) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens July 30th, 2009 Lexical Knowledge from

Ngrams 16 Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 17

Recently Viewed Presentations

  • Coronary Artery Disease - Family Medicine

    Coronary Artery Disease - Family Medicine

    Intermediate probability from around 10% to 90%. Can patient exercise? Stress test is a first-line option for most men and women. Stress imaging test if baseline EKG, prior revascularization, and if diabetes mellitus
  • Ecology - Mrs. DeNicola&#x27;s Science Corner

    Ecology - Mrs. DeNicola's Science Corner

    Oak trees alter the pH of the soil, making the forest better suited for shrubs and other trees. Roots of shrubs proliferate in the soil of the forest and prevent the oak trees from obtaining water. Oak trees succumb to...
  • Integrals - robeson.k12.nc.us

    Integrals - robeson.k12.nc.us

    Integration Techniques. Integration is the process of finding an indefinite or diefinite integral . Integral is the definite integral is the fundamental concept of the integral calculus. It is written as. Where f(x) is the integrand, a and b are...
  • Corporate Creativity

    Corporate Creativity

    nd. estates to meet with National Assembly. 3. rd. estate . doubled. their numbers. 1. st. and 2. nd. sat on right, 3. rd. sat on left. Formed municipal (city) government. National Assembly (1789-1793) Locked out of the Estates-General, the...
  • Sir Gawain and the Green Knight - Loudoun County Public Schools

    Sir Gawain and the Green Knight - Loudoun County Public Schools

    Sir Gawain and the Green Knight Cultural and character backgrounds The Author The author of the poem is known only as "The Pearl Poet." Written in northwestern England around 1370 The language and topics indicate that the author was most...
  • Cybersecurity

    Cybersecurity

    Analytics in Cybersecurity. Where we are. Where we want to be. Analyzing previous cyber threats/attacks. Predict and prevent future attacks. Big Data can store large amounts of data and help analysts examine, observe, and detect irregularities within a network.
  • Design and test of a low-cost turbidity meter

    Design and test of a low-cost turbidity meter

    Design and test of a low-cost turbidity meter using an LED. A. Louise Ferris; Thomas Aumer; Theodore A. CorcovilosDepartment of Physics, Duquesne University. Introduction. Acknowledgements. Funding for this project and support for this project was provided by the Duquesne Physics...
  • Advertising and Promotion - Hogan Lovells

    Advertising and Promotion - Hogan Lovells

    Health Care Economic Information (HCEI) HCEI is defined as "any analysis that identifies, measures, or compares the economic consequences, including the costs of the represented health outcomes, of the use of a drug to the use of another drug, to...