INFO 340 - May 17, 2004 - L15 Notes By: Egaas, Fortier, Prins Admin > Reading - This week Ch 3 - Next week: Modern Information Retrieval, Chapter 3 (Baeza-Yates & Ribeiro-Neto) > Upcoming - P2: Functions #1 and #2 Today - A4: IR Matching/Ranking 26 May - Final Project 2 June - Exam 7 June Topics > Continue with matching Process > Assignment #3 - ER Modeling Recall: Best Match Versus Exact Match > Exact Match - Where boolean operators are used to control the matching process * skiing AND (water OR snow) > Best Match - A weight algorithm is used to order items in a 'good' ranking * skiing water snow Recall: Word Statistics Enable the Weighting Process > Key statistics: - NDocs - Number of docs - Fk - Frequency of the keyword - Dk - # of documents that have the keyword in them - Fkd - The frequency of keyword appearing in a Document > These are readily available from the inverted file - Be sure to know how Document Scores > To score documents we compute weights for each keyword and sum the weights > Assume: - Query, q, which consists of 1 or more keywords - And available statistics *** unicode text below *** s_d = ’àëw_kd k’àà(q’à©d) score of the document = the sum of the weights associated with the document Document Scores > The score s_d, for a document d, > is equal the sum of > weights, w_kd, for a keyword, k, in relation to the document,d > only look at keywords that are in the document. Calculating the Weights > Wkd = Fkd * discrimk - The weight is related to frequency of a keyword in a document multiplied by how discriminating the keyword is Wkd = Fkd * discrimk 1) Fkd: Documents that contain many occurrences of a word are better than documents that contain fewer occurrencesof a ord 2) discrimk: A keyword that is relatively rare is better than a keyword ***NOT COMPLETE*** > A common method for computing the discriminatinv power of a keyword is inverse document frequently - Discrimk = log(NDoc /Dk) (Estimate how rare the keyword is) Exercise #1 > Wkd = Fkd * log(Ndoc / Dk) + 1 > Assume: - Document collectoin: 1,000,000 documents - 'Convivial' occurs three times in document 101 and 1 time in document 104 > Question: - What is the weight Wkd for the word 'convivial in document 101 and 104? 3 * log(1,000,000 / 2) + 1 For 101: 18.09691001300806 ~ 18 1 * log(1,000,000 / 2) + 1 For 104: 6.698970004336019 ~ 6