CS 430 / INFO 430 Information Retrieval Lecture

CS 430 / INFO 430 Information Retrieval Lecture

CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5 1 Course Administration 2 CS 430 / INFO 430 Information Retrieval Completion of Lecture 4 3 Word List

On disk If a word list is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that a word list has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer. 4 File Structures for Inverted Files: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for lexicographic processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

5 File Structures for Inverted Files: Binary Tree Input: elk, hog, bee, fox, cat, gnu, ant, dog elk bee ant hog cat fox dog 6 gnu File Structures for Inverted Files: Binary Tree

Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses 7 File Structures for Inverted Files: Binary Tree Calculation of maximum depth of tree. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n) Illustrates importance of balanced trees.

8 File Structures for Inverted Files: Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's inorder predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when held in memory Knuth vol 1, 2.3.1, page 325. 9 File Structures for Inverted Files: Right Threaded Binary Tree dog gnu

bee ant cat hog elk fox 10 NULL File Structures for Inverted Files: B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys

Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki All leaves are at the same depth 11 File Structures for Inverted Files: B-trees B-tree example (order 2) 50 65 55 59 10 19 35 36 47 1 5 8 9 12 14 18

12 21 24 28 70 90 98 66 68 91 95 97 72 73 Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers. File Structures for Inverted Files: B+-tree Example: B+-tree of order 2, bucket size 4 A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets 50 65 10 25

... D9 55 59 D51 ... D54 70 81 90 D66... (Implementation of B+-trees is covered in CS 432.) 13 D81 ... CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5

14 SMART System An experimental system for automatic information retrieval automatic indexing to assign terms to documents and queries collect related documents into common subject classes identify documents to be retrieved by calculating similarities between documents and queries procedures for producing an improved search query

based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988 15 Indexing Subsystem Documents text assign document IDs break into tokens tokens *Indicates optional operation. documents

stop list* non-stoplist stemming* tokens stemmed terms term weighting* terms with weights 16 document numbers and *field numbers Index database

Search Subsystem query parse query ranked document set query tokens stop list* non-stoplist tokens ranking* stemming* *Indicates optional operation. 17

Boolean retrieved operations* document set relevant document set stemmed terms Index database Decisions in Building the Word List: What is a Term? 18 Underlying character set, e.g., printable ASCII, Unicode, UTF8.

Is there a controlled vocabulary? If so, what words are included? List of stopwords. Rules to decide the beginning and end of words, e.g., spaces or punctuation. Character sequences not to be indexed, e.g., sequences of numbers. Lexical Analysis: Term

What is a term? Free text indexing A term is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, terms are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc. 19 Oxford English Dictionary 20 Lexical Analysis: Choices Punctuation: In technical contexts, punctuation may be used as a character within a term, e.g., wordlist.txt. Case: Case of letters is usually not significant. Hyphens: (a) Treat as separators: state-of-art is treated as state of art.

(b) Ignore: on-line is treated as online. (c) Retain: Knuth-Morris-Pratt Algorithm is unchanged. Digits: Most numbers do not make good terms, but some are parts of proper nouns or technical terms: CS430, Opus 22. 21 Lexical Analysis: Choices The modern tendency, for free text searching, is to map upper and lower case letters together in index terms, but otherwise to minimize the changes made at the lexical analysis stage. 22 Lexical Analysis Example: Query Analyzer A term is a letter followed by a sequence of letters and digits. Upper case letters are mapped into the lower case equivalents. The following characters have significance as operators: (

23 ) & | Lexical Analysis: Transition Diagram letter, digit 1 space letter ( 2 3

) & 0 | other end-of-string 24 4 7 5 6 Lexical Analysis: Transition Table State space letter

0 1 0 1 1 1 ( ) & | other end-of string digit 2

1 3 1 4 1 5 1 6 1 States in red are final states. 25 6 1

7 1 Changing the Lexical Analyzer This use of a transition table allows the system administrator to establish differ lexical choices for different collections of documents. Example: To change the lexical analyzer to accept tokens that begin with a digit, change the top right element of the table to 1. 26 Stop Lists Very common words, such as of, and, the, are rarely of use in information retrieval. A stop list is a list of such words that are removed during lexical analysis. A long stop list saves space in indexes, speeds

processing, and eliminates many false hits. However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?") 27 Suggestions for Including Words in a Stop List Include the most common words in the English language (perhaps 50 to 250 words). Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world). In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents). 28

Example: Stop List for Assignment 1 a are but has in more one that this which 29 about as by have is new or the

to will an at for he it of said their was with and be from his its on say

they who you Example: the WAIS stop list (first 84 of 363 multi-letter words) 30 about above according after afterwards again alone along already among amongst an anyone anything anywhere at be

becomes becoming been beginning behind being between beyond can can't could couldn't did didn't do down during each either else elsewhere across against also

another are became before below billion cannot actually adj all almost although always any anyhow aren't around because become beforehand begin beside

besides both but by caption co does eg end doesn't eight ending don't eighty enough Stop list policies How many words should be in the stop list? Long list lowers recall

Which words should be in list? Some common words may have retrieval importance: -- home, life, water, war, world In certain domains, some words are very common: -- computer, program, source, machine, language There is very little systematic evidence to use in selecting a stop list. 31 Stop Lists in Practice The modern tendency is: (a) have very short stop lists for broad-ranging or multi-lingual document collections, especially when the users are not trained. (b) have longer stop lists for document collections in well-defined fields, especially when the users are trained professional. 32 Stemming

Morphological variants of a word (morphemes). Similar terms derived from a common stem: engineer, engineered, engineering use, user, users, used, using Stemming in Information Retrieval. Grouping words with a common stem together. For example, a search on reads, also finds read, reading, and readable Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed. 33 Categories of Stemmer The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. Conflation methods Manual Automatic (stemmers) Affix

removal Longest match 34 Successor variety Simple removal Table lookup n-gram Porter Stemmer A multi-step, longest-match stemmer. M. F. Porter, An algorithm for suffix stripping. (Originally

published in Program, 14 no. 3, pp 130-137, July 1980.) http:// www.tartarus.org/~martin/PorterStemmer/def.txt Notation v c (vc)m vowel constant (vowel followed by a constant) repeated m times Any word can be written: [c](vc)m[v] 35 m is called the measure of the word Porter's Stemmer Porter Stemming Algorithm Complex suffixes Complex suffixes are removed bit by bit in the different steps. Thus:

GENERALIZATIONS becomes becomes becomes becomes 36 GENERALIZATION (Step 1) GENERALIZE (Step 2) GENERAL (Step 3) GENER (Step 4). Porter Stemmer: Step 1a Suffix sses ss caresses -> caress

ies i ponies ties -> poni -> ti ss caress -> caress ss s 37

Replacement Examples cats -> cat Porter Stemmer: Step 1b Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree

(*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing *v* - the stem contains a vowel 38

Porter Stemmer: Step 5a Step 5a is defined as follows. (m>1) E -> (m=1 and not *o) E -> probate -> probat rate -> rate cease -> ceas *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). 39 Stemming in Practice

Evaluation studies have found that stemming can affect retrieval performance, usually for the better, but the results are mixed. Effectiveness is dependent on the vocabulary. Fine distinctions may be lost through stemming. Automatic stemming is as effective as manual conflation. Performance of various algorithms is similar. Porter's Algorithm is entirely empirical, but has proved to be an effective algorithm for stemming English text with trained users. 40 Selection of tokens, weights, stop lists and stemming Special purpose collections (e.g., law, medicine, monographs) Best results are obtained by tuning the search engine for the characteristics of the collections and the expected queries. It is valuable to use a training set of queries, with lists of relevant documents, to tune the system for each application. General purpose collections (e.g., news articles) The modern practice is to use a basic weighting scheme (e.g., tf.idf), a simple definition of token, a short stop list and little

stemming except for plurals, with minimal conflation. 41 Web searching combine similarity ranking with ranking based on document importance.

Recently Viewed Presentations

  • Health Promotion Project Proposal - PSYC DWEEB

    Health Promotion Project Proposal - PSYC DWEEB

    Health Promotion Project Proposal. My name is John Roop. I am here today to present an idea that I believe to be important to the health and well-being of the students at NGCSU.
  • Product Training Date - Allogy

    Product Training Date - Allogy

    Product Training HemCon ChitoGauze® PRO * ChitoGauze® PRO Overview #1018 / #30-0049 ChitoGauze PRO 3in x 4yds NSN# 6510-01-591-7740 #1017 ChitoGauze 4in x 4yds FDA: 510(k): K102546 FDA Indication: ChitoGauze PRO is a hemostatic dressing for the external, temporary control...
  • Computer Problems Slowing You Down?

    Computer Problems Slowing You Down?

    Computer Problems Slowing You Down? GIVE US A CALL! 392-HELP (4357) Laptop not working? CALL THE UF COMPUTING HELP DESK! WE DIAGNOSE AND HELP REPAIR LAPTOPS!
  • Industrial and Organizational Psychology

    Industrial and Organizational Psychology

    Times New Roman Courier New Default Design Industrial and Organizational Psychology Work Groups and Work Teams Working Together Group/Team Concepts Group/Team Concepts 2 Group Performance How Does Group Compare To Individuals? Group Diversity Group Interventions Autonomous Work Teams Team Building
  • Moroni 9-10 - Brigham Young University-Idaho

    Moroni 9-10 - Brigham Young University-Idaho

    Moroni 10 is the Book of Mormon's concluding doctrinal dissertation. It defines the relationship among grace, gifts, and godhood. The grace that flows from the Savior's atoning sacrifice opens the gate to the divine road, the gifts are the vehicle,...
  • BioHydrogen Mark D. Redwood m.d.redwood@bham.ac.uk Research Fellow Unit

    BioHydrogen Mark D. Redwood [email protected] Research Fellow Unit

    Integration of a biohydrogen reactor with a solid-state H-store (Collaboration with A. Bevan and D. Book) Comparison of learning curves for future hydrogen production technologies (Collaboration with R. Green and H. Hu). Conference talks (wrt SCRATCH) Society for General Microbiology...
  • Teorias Y Modelos De Enfermeria

    Teorias Y Modelos De Enfermeria

    FAYE ABDELLAH. "TEORÍA DE TIPOLOGÍA DE LOS PROBLEMAS DE ENFERMERÍA" LA RESOLUCION DE PROBLEMAS es el vehículo para la definición de problemas de enfermaría en el proceso de curación del paciente. Enfermería: es un arte y una ciencia que moldea...
  • Language Registers - Madison County Schools / Overview

    Language Registers - Madison County Schools / Overview

    LANGUAGE REGISTERS EVERY LANGUAGE HAS 5 REGISTERS Frozen Formal Consultative Casual Intimate FROZEN Pledge of Allegiance Lord's Prayer Preamble to Constitution FORMAL Interviews Academic language in classroom Public speaking CASUAL Talking with friends Slang Abbreviations Drafts CONSULTATIVE Used when speaking...