Shared Flashcard Set

Details

COMP3017 - Lecture 9 (B) - Information Retreival
COMP3017 - Lecture 9 (B) - Information Retreival
19
Computer Science
Undergraduate 2
05/04/2014

Additional Computer Science Flashcards

 


 

Cards

Term
What is a Lexicon?
Definition
  • Data structure listing all the terms that appear in the document collection
    • Lexicon construction carried out after term selection
    • Usually hash table for fast lookup
Term
What is an inverted index?
Definition
  • Data structure that relates a word in the lexicon to a document in which it appeasrs (a posting)
  • May also include:
    • Occurrence count within document
    • Pointer to location of word within document

[image]

 

Term
Give examples of searching for single words
Definition
  1. Example 1: If postings consist only of document identifiers, no rankin of results possible
    1. Lookup query term in lexicon
    2. Use inverted index to get address of posting list
    3. Return full posting address
  2. Example 2: If postings also contain term count, we can do some primitive ranking of results
    1. Lookup query term in lexicon
    2. Use inverted index to get address of posting lsit
    3. Create empty priority queue with maximum length R
    4. For each document/count pair in posting list
      1. If priority queue has fewer than R elements, add (doc, count) pair to queue
      2. If count is larger than lowest entry in queue, delete lowest entry and add the new pair
Term
What are the characteristics of the boolean model?
Definition
  • Lookup each query term in lexicon and inverted index
  • Apply boolean set operators to posting lists for query terms to identify set of relevant documents
    • AND - set intersection
    • OR - set union
    • NOT - set complement
Term
Give an example of the boolean model
Definition
  • Document collection:
    • d1="Three quarks for Master Mark"
    • d2="The strange history of quark cheese"
    • d3="Strange quark plasmas"
    • d4="Strange Quark XPress algorithm"
  • Lexicon and inverted index:
    • three   →   {d1}
    • quark   →   {d1, d2, d3, d4}
    • master  →   {d1}
    • mark   →   {d1}
    • strange  →   {d2, d3, d4}
    • history   →   {d2}
    • cheese   →   {d2}
    • plasma   →   {d3}
    • xpress   →   {d4}
    • problem   →   {d4}
  • Query: "strange" AND "quark" AND NOT "cheese"
  • Result set:
  • {D2, D3, D4} ∩ {D1, D2, D3, D4} ∩ {D1, D3, D4} = {D3, D4}
Term
What are the principles of document length normalisation?
Definition
  • In a large collection, document sizes vary widely
    • Large documents are more likely to be considered relevant
  • Adjust answer set ranking - divide rank by document length
    • Size in bytes
    • Number of words
    • Vector norm
Term
What are some characteristics of positional indexes?
Definition
  • If postings include term locations, we can also support a term proximity operator
    • Term1/k term2
    • effectively an extended AND operator
    • term1 occurs in a document within k words of term2
  • When calculating intersection of posting lists, examine term locations
Term
What are Biword Indexes?
Definition
  • Can extend boolean model to handle phrase queries by indexing biwords (pairs of consecutive terms)
  • Consider a document containing the phrase "electronics and computer science"
    • After removing stop words and normalising, index and biwords "electronic computer" and "computer science"
    • When querying on phrases longer than two terms, use AND to join constituent biwords
Term
What are the disadvantages of the boolean model?
Definition
  • A document either matches a query or not
    • Need better criteria for ranking hits (similarity is not binary)
    • Partial matches
    • Term weighting
Term
What are the principles of the Vector model?
Definition
  • Documents and queries are reprensented as vectors of term-based features (occurrence of terms in collection)
    • dj=(ti,j, t2,j, ... tn,j)
    • qj=(ti,k, t2,k, ... tn,k)
  • Features may be binary
    • ti,j = 1 if term tm is present in document dj
    • ti,j = 0 otherwise
Term
How is it possible to assess similarity between query and document as number of terms in common?
Definition
[image]
Term
What are the principles of the weighted vector model?
Definition
  • Not all terms are equally interesting
    • 'the' vs 'system' vs 'Bernes-Lee'
  • Replace binary features with weights

[image]

 

View documents and queries as vectors in multidimensional space

Term
How can similarity be determined in weighted vector models?
Definition

Using the dot product (i.e. angle between vectors in multidimensional space)

 

[image]

Term
What are some basis for term weighting?
Definition
  • Need basis for assigning weights to terms in documents
  • How common is the term in the document? 
    • Within document measure
    • Term frequency
  • How common is the term in document collection? 
  • Collection frequency
  • Inverse document frequency 
Term
What are the formulas for Term Frequency (TF) and Inverse Document Frequency (IDF)
Definition
[image]
Term
Give an example of TF-IDF
Definition
  • Consider the following document collection

[image]

[image]

Term
What are the principles of the probabilistic model?
Definition

Calculate similarity between query and documet as the odds that the document is relevant to the query:

[image]

P(relevant|q) and P(~relevant|q) are the same for all documents in teh collection

Term
What are the principles of query refinement?
Definition
  • Typical queries very short, ambiguous
    • Add more terms to disambiguate, improve
  • Often difficult for users to express their information needs as a query
    • Easier to say which results are/aren't relevant than to write a better query
Supporting users have an ad free experience!