Classifying Noun Countability Using Google Ngrams

Classifying Noun Countability Using Google Ngrams

N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU) July 30th, 2009 Lexical Knowledge from Ngrams 1 Hammer : Fast and multifunctional n-gram search engine Search ngram: FAST ng r am s INPUT: token, POS, chunk, NE July 30th, 2009 OUTPUT: frequency to text

Lexical Knowledge from Ngrams 22 Characteristics Search up to 7 grams with wildcards Multi-level input Token, POS, chunk, NE, combinations NOT, OR for POS, chunk, NE Multi-level output Token, POS, chunk, NE document information Original sentences, KWIC, ngram Display Show the results in the order of frequency Running Environment Single CPU, PC-Linux, 400MB process, 500GB disk July 30th, 2009 Lexical Knowledge from Ngrams 33

Demo http://linserv1.cims.nyu.edu:23232/ngram_wikipedia2 July 30th, 2009 Lexical Knowledge from Ngrams 4 Available for you Web system At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 5 Implementation: Overview 3. Display

2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 6 Implementation: Overview

1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams 7 From n-gram to Inverted Index Example: 3-grams Ngram ID

Position=1 Position=2 Position=3 1 A B C 2 A B B 3 B A C

Posting list A pos=1 1 A pos=2 3 B pos=1 3 B pos=2 1 B pos=3 2 C pos=3 1 July 30th, 2009 2

2 3 Lexical Knowledge from Ngrams 8 Posting list Wide variation of posting list size (in 7-gram: 1.27B) #EOS# (100,906,888), , (55,644,989), the (33,762,672) conscipcuous, consiety, Mizuk, (1) 3 types for faster speed and smaller index size Bitmap (freq >1%) 1 0 0 :#EOS# 1.27B bits (bitmap) <-> 3.2B bits (list) 0 1 1

0 1 0 0 0 0 1 0 0 1 List of ngramID C pos=3 1 3 Encoded into pointer (freq=1) C pos=3 5

July 30th, 2009 Lexical Knowledge from Ngrams 9 Search Given an n-gram request (A B C) Get posting lists for A, B and C Search intersections of posting lists Use look ahead to speed up the search Look ahead size = Sqrt(size of posting list) Moffat and Zobel (1996) 4 33 34 55 76 80 89 92 99 4 SKIP 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98 July 30th, 2009 Lexical Knowledge from Ngrams

10 Implementation: Overview 2. Filtering 1 Search candidates . Search request N-gram data Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

11 Filtering Not all candidate ngramIDs match the request A Freq=123 NN VB Freq=5 Freq=10 B PERSON LOC We need frequency, sentence information to matched n-grams POS, chunk and NE information is presented as ID Reduce the index more than 200GB July 30th, 2009 Lexical Knowledge from Ngrams

12 Implementation: Overview 3. Display 2. Filtering 1. Search candidates N-gram data Search request Inverted index for n-gram data Suffix array for text Wikipedia text POS, chunk, NE for N-gram data Wikipedia POS, chunk, NE July 30th, 2009 Lexical Knowledge from Ngrams

13 Display N-gram will be displayed in the descending order of frequency N-gram ID is ordered by the frequency Sentences are searched using suffix array POS, chunk, NE are displayed with sentence, KWIC, ngram Doc ID, title of Wikipedia (and possible features of doc) is displayed with sentences and KWIC July 30th, 2009 Lexical Knowledge from Ngrams 14 Size of data Text 1.7 G words 200M sentences 2.4M articles Total 530GB

108 GB 8 GB 260 GB Suffix array For text N-gram data 8 GB Ngram 1: 8M 2: 93M 3: 377M 4: 733M 5: 1.00B 6: 1.17B 7: 1.27B Inverted index for n-gram data 40 GB 100 GB POS, chunk, NE for N-gram data

Others July 30th, 2009 Lexical Knowledge from Ngrams Wikipedia text 6 GB Wikipedia POS, chunk, NE 15 Future Work Other information (ex: parse, coref, relation, genre, discourse) Longer n-gram Compress index, dictionary Ease the indexing load Now we need a big memory machine Distributing indexing Union operation for tokens July 30th, 2009 Lexical Knowledge from

Ngrams 16 Available for you Web demo At NYU http://nlp.cs.nyu.edu/nsearch At JHU? USB Hard drive July 30th, 2009 Lexical Knowledge from Ngrams 17

Recently Viewed Presentations

  • Privacy: Expectations, Norms, Technology

    Privacy: Expectations, Norms, Technology

    Social norms define the role, impose obligations, create expectations, and shape desires both for the lawyer, doctor, or race car driver, and for those who interact with them. Norms and Identity Similar remarks hold for being a parent, child, lover,...
  • Data Pipeline Town Hall September 3, 2015 The

    Data Pipeline Town Hall September 3, 2015 The

    Data Pipeline Town HallSeptember 3, 2015. The goal of the webinar is to provide updates on the Data Pipeline and current and upcoming collections.These webinars also provide a forum for districts, BOCES, and Administrative Units to have questions answered about...
  • Annual Title 1 Meeting - Home - SCCPSS

    Annual Title 1 Meeting - Home - SCCPSS

    The table above represents a decrease of 16 percentage points in the number of students with a reading comprehension Lexile score below basic as measured by Scholastic Reading Inventory (SRI) Lexile. Lexile score below basic as measured by Scholastic Reading...
  • Preparing for Phase Delivery Chronos Technology: COMPANY PROPRIETARY

    Preparing for Phase Delivery Chronos Technology: COMPANY PROPRIETARY

    Why prepare for Phase? Phase must be supported… When TDD technology is deployed. When some LTE-A services are to be offered . Our industry does not expect this to be as trivial as supporting Frequency over PTP proved to be
  • Local Government Data Model - Atlanta Regional Commission

    Local Government Data Model - Atlanta Regional Commission

    When to use the Parcel Fabric. Spatial accuracy is critical. Parcels in the same layer have different levels of accuracy. Accuracy needs to get better with new accurate data
  • MGT 3061 Information Systems and Management

    MGT 3061 Information Systems and Management

    Information Systems and Management Project A planned undertaking of a series of related activities to reach an objective that has a beginning and an end.
  • NYC4SEC June 11, 2014 Meet-up Group at John Jay College

    NYC4SEC June 11, 2014 Meet-up Group at John Jay College

    NYC4SEC Meet-up Group - June 11th, 2014 . NYC4SEC Meet-up Group‎John Jay College-Criminal Justice899 10th Avenue, New York, New YorkWednesday June 11th, 2014 06:30pm. Introduction to the Microsoft exFAT File System. Robert Shullich . CPP, CISSP, CRISC, GSEC, GCFA, CEH,...
  • CSE 373: Data Structures and Algorithms Graphs Autumn

    CSE 373: Data Structures and Algorithms Graphs Autumn

    CSE 373 AU 18. Review. Stable. In the output, equal elements (i.e., elements with equal keys) appear in their original order. In-place. Algorithm uses a constant additional space, ?(1) extra space. Adaptive. Performs better when input is almost sorted or...