INFO 340 - L14 - April 12, 2004 Notes By: Egaas, Fortier, Prins Admin: > Readings - This week (1 & 2) - Next week (3) > Upcoming - P2, May 17 - A4: IR Matching/Ranking, May 26 - Final, June 2 ** DOCUMENT INDEXING ** Indexing Documents > Determining terms > Folding case > Stemming terms > Removing stop words > Preserving necessary metadata Folding Case > Examples - NSF > nsf | Van > van | Morrison > morrison > Pro - Reduces vocabulary size - Improves 'recall' > Con - Loose semantic information of potential value Removing Stop/Noise Words > Stopwords are function words that occur with great frequency > Examples: to | be | or | not > Pro: Reduce vocabulary size > Con: Stop words can often be important in search: 'to be or not to be' Basic Search Algorithm see Class 14 - slide 16 For(every doc in corpus) { while (more terms in doc) Do { token <- GetNextToken(doc) Then if(!isNoiseWord (token)) { token <- Stem(token) postingList <- AddPosting(term,doc) } } } Summary > Indexing documents is like a pipeline or distillation process Doucments -> Taokens -> Terms -> ... -> [Final Term + Posting] The small details of this normalization process are extremely important Combination of Terms > Morrison AND Moon AND DANCE 1 > Morrison AND Moon 3 > Morrison AND Dance 3 > Dance AND Moon 5 > Dance 1000 > Moon 100 > Morrison 30 > Observation > Using a Boolean system, a user has to enter a large number of queries - A complex endeavor > Or, the system has to generate a large number of queries and show them to the user in some sensible fashion Exact Match > Exact Match - An IR system where only documents that exactly match the query are returned * Boolean systems (& SQL, of course) Another Approach > Best Match - An IR system that returns a ranked list of results *Search Engines How do Best Match Systems Work > Several strategies but Inverse Document Frequency is a very important once > Basic Idea - If a document contains a word that rarely occurs elsewhere in the collection of documents then that word is significant... Example > Corpus on Artificial Intelligence - If you type 'computer science' how many documents would be returned? - What about 'artificial intelligence'? - What about 'pink robot vacuum cleaners'? Exercise > Given an inverted file, what are some interesting/useful statistics? - Total # of docs that a keywords appears (moon in TWO docs) - Total # of docs - Given a document, # of times that they keyword appears - Total # of words - # of times keyword appears in field Statistics that are Hulpful > NDoc - Number of documents in a system > Fk - Number of occurrences of keyword across all documents > Dk - Number of documents that contain keyword > Fkd - Number of occurrences of keyword, k, in document, d Importance of a Term, Wkd > Wkd = Fkd * discrimk Weight of a keyword in a document is the frequence of the keyword times 'how discriminating' they keyword is > Discrimk = log (NDoc / Dk) + 1 Summary > Structure of inverted file > Best match versus exact match retrieval systems > Document / term statistics > Notion of weighting