INFO 340 - May 19, 2004 - L16 Notes By: Fortier, Prins Know the difference between aggregation, composition, and a regular. Topics > Zipf's Law > Page Rank > Assignment #4 Zipf's Law > The frequency of occurrence of some event, F(r) is a function of its rank, r F(r) = C/r·µÉ a’âà1 C’âà0.1 a and C are parameters Example > The TIME collection consists of 423 short news stories (1.6 MB) > Parse each store into words > Count the number of time each word occurs (word frequency) > Rank the word frequencies from highest to lowest Graphs and stuff... Look at the slides F(r) = C/r·µÉ a’âà1 C’âà0.1 log(F(r)) = log(C/r·µÉ) = log(Cr¬Ø·µÉ) = -a’ãÖlog(C’ãÖr) = log(C)-a’ãÖlog(r) "Cool eh?" Take the log o each side of the equation and simplify by following rules of logarithms We end up with an equation in the forum of Y = -mX + b A straight line which has a negative slope COOL!!!!111one upper cutt-off |- | <---frequent stuff | \_ stop words | \_ | \_ | lower cut-off | \_ <-- significant words |_________|_____<----rare stuff NICE thanks Summary > Scoring documents by inverse document frequency 'works' because the use of works in text follows Zipf's distribution. Web Structure > Topics - Hubs versus authorities (J. Kleinberg) - Google's PageRank algorithm Three Regions in the Web * Upstream Nodes --> core Nodes --> Downstream Nodes * Tendrils (connect the upstream and downstream nodes) > The core contains the major sites > Upstream nodes lead to the core be they cannot be reached from the cure > Downstream nodes can be reached from the core but the core can no be read from What Might Connectivity Reveal? > High link density among a set of pages might be an indication of a common topic Hubs vs. Authorities > Hubs are pages that point to many authorities > Hubs and authorities are defined by the link connectivity structure Analysis of Web Structure > One research project sought to uncover these structures of hubs and authorities, finding 100,000 'communities' > This 'latent' structure, bottom-up structure, can be exploited by search engines Google *** NOT INCLUDED *** PageRank *** NOT INCLUDED *** How PageRank Works? *** NOT INCLUDED *** PageRank <<-- t2 --> A <-- t1 --> > Consider page A with incoming links from two other pages, t1 and t2 > The PageRank for A is calculated using the PageRank values for t1 and t2. Why? Algorithm PR(A) = (1-d)+d*[(PR(t1))/(C(t1)) +... (PR(tn))/(C(tn))] PR(A) : PageRank of page A C(A) : Number of outbound links d : Damping factor (set to .85)