INFO 340 - L14 - April 12, 2004
Notes By: Egaas, Fortier, Prins

Admin:
    > Readings
        - This week (1 & 2)
        - Next week (3)
    > Upcoming
        - P2, May 17
        - A4: IR Matching/Ranking, May 26
        - Final, June 2

** DOCUMENT INDEXING **

Indexing Documents
    > Determining terms
    > Folding case
    > Stemming terms
    > Removing stop words
    > Preserving necessary metadata
    
Folding Case
    > Examples
        - NSF > nsf | Van > van | Morrison > morrison
    > Pro
        - Reduces vocabulary size
        - Improves 'recall'
    > Con
        - Loose semantic information of potential value

Removing Stop/Noise Words
    > Stopwords are function words that occur with great frequency
    > Examples: to | be | or | not
    > Pro: Reduce vocabulary size
    > Con: Stop words can often be important in search: 'to be or not to be'


Basic Search Algorithm

see Class 14 - slide 16

For(every doc in corpus) {
    while (more terms in doc) Do {
        token <- GetNextToken(doc) Then
        if(!isNoiseWord (token)) {        
                token <- Stem(token)
                postingList <- AddPosting(term,doc)
        }
    }
}

Summary
    > Indexing documents is like a pipeline or distillation process
        Doucments -> Taokens -> Terms -> ... ->
                [Final Term + Posting]
    The small details of this normalization process are extremely important

Combination of Terms
    > Morrison AND Moon AND DANCE   1
    > Morrison AND Moon             3
    > Morrison AND Dance            3
    > Dance AND Moon                5
    > Dance                         1000
    > Moon                          100
    > Morrison                      30
    > <empty>

Observation
    > Using a Boolean system, a user has to enter a large number of queries
        - A complex endeavor
    > Or, the system has to generate a large number of queries and show them to the user in some sensible fashion

Exact Match
    > Exact Match
        - An IR system where only documents that exactly match the query are returned
            * Boolean systems (& SQL, of course)

Another Approach
    > Best Match
        - An IR system that returns a ranked list of results
            *Search Engines

How do Best Match Systems Work
    > Several strategies but Inverse Document Frequency is a very important once
    > Basic Idea
        - If a document contains a word that rarely occurs  elsewhere in the collection of documents then  that word is significant...

Example
    > Corpus on Artificial Intelligence
        - If you type 'computer science' how many documents would be returned?
        - What about 'artificial intelligence'?
        - What about 'pink robot vacuum cleaners'?

Exercise
    > Given an inverted file, what are some interesting/useful statistics?
        - Total # of docs that a keywords appears (moon in TWO docs)
        - Total # of docs
        - Given a document, # of times that they keyword appears
        - Total # of words
        - # of times keyword appears in field
        
Statistics that are Hulpful
    > NDoc
        - Number of documents in a system
    > Fk
        - Number of occurrences of keyword across all documents
    > Dk
        - Number of documents that contain keyword
    > Fkd
        - Number of occurrences of keyword, k, in document, d
        
Importance of a Term, Wkd
    > Wkd = Fkd * discrimk
        
        Weight of a keyword in a document is the frequence of the keyword times 'how discriminating' they keyword is
    
    > Discrimk = log (NDoc / Dk) + 1

Summary
    > Structure of inverted file
    > Best match versus exact match retrieval systems
    > Document / term statistics
    > Notion of weighting