Shared Flashcard Set


COMP3017 - Lecture 9 (A) - Information Retreival
COMP3017 - Lecture 9 (A) - Information Retreival
Computer Science
Undergraduate 2

Additional Computer Science Flashcards




What is the definition of Information Retreival?
  • An information need is a topic about which a user desires to know more
  • A query is what a user conveys to the computer in an attempt to communicate their information need
  • A document is relevant if it is one that the user perceives as containing the information of value with respect to their personal information need
  • In other words IR is the taks of finding relevant documents in a collection
  • Best known modern examples of IR are Web Search Engines
What is the information retreival problem?
The primary goal of an information retreival system is to retreive all the documents tha are relevant to a user query while retreiving as few non-relevant documents as possible
Draw a diagram for a high level system architecture



What are some key concepts in IR Systems?
  • Document collection
    • Document granularity: paragraph, page or multi-page text
  • Query
    • Expressed in a query language
    • List of words [AI book], a phrase ["AI Book"], contain boolean operators [AI AND Book] and non-boolean operators [AI NEAR book]
  • Answer set
    • Set of documents judged to be relevant to the query
    • Ranked in decreasing order of relevance
What are some key points on IR system implementations?
  • User expectations of information retrieval systems are exceedingly demanding
    • Need to index billion document collections
    • Need to return top results these collctions in a fraction of a second
  • Design of appropriate document representations and data structures is critical
What are some characteristics of Information Retreival Models?
  • Information retreival systems are distinguished by how they represent documents to queries
  • Several common models: Boolean (Set-theoretic) Algebraic (Vector spaces), probabilistic
  • An information retreival model consists of
    • A set D of representations of the documents in the collection
    • A set Q of representation of user information needs (queries)
    • A ranking function R(di,qi) that associates a real number with a query representation qi in Q and a document representation di in D
What are the principles of implementing Term Selection?

Information retreival systems either:

  • Index all words contained in documents (full-text indexing)
  • Selectively extract index terms from the documents that are to be indexes


What are the principles of implementing Tokenization?
  • Identify distinct words
    • Separate on whitespace, turn each document into list of tokens
    • Discard punctuation (but consider tokenisation of O'Neill, and hyphens)
    • Fold case, remove diacritics
  • Tokenisation is language-specific
    • Identify language first (using character n-gram model, Markov model)
    • Languages with compound words (e.g. German, Finnish) may require special treatment (compound splitting)
    • Languages which do not leave spaces between words (e.g. Chinese, Japanese) may require special treatment (word segmentation)
Draw Zipf's Law on Word significance



What are the concepts of Stop Word Removal?
  • Extremely common words have little use for discriminating between documents
    • e.g. the, od, and , a, but, etc
  • Construct a stop list
    • Short terms by collection frequency
    • Identify high frequency stop words (possibly hand filtering)
    • Use stop list discard terms during indexing
What is noun grouping?

Most meaning is carried by nouns


Identify adjacent nouns to index terms (special case of biword indexing)

What are the concepts of stemming?
  • Terms in user query might not exactly match document terms
    • Query contains 'computer', document contains 'computer'
  • Terms have wide variety of sintactic variations
    • Affixes added to word stems, that are common to all inflected forms
    • For english, typically suffixes
    • e.g. connect is the stem of: connected, connecting, connects, etc.
  • Use stemming algorithm:¬†
    • To remove affixes from terms before indexing
    • When generating query terms from query
What are some stemming algorithms?
  • Lookup table
    • Contains mappings from inflexted forms to uninflected stems
  • Suffix-stripping
    • Rules for identifying and removing suffixes
    • e.g. '-sses' --> '-ss'
    • Porter Stemming Algorithm
  • Lemmatisation
    • Identify part of speech (noun, verb, adverb)
    • Apply POS-specific normalisation rules to yield lemmas
What are some disadvantages of Stemming?
  • Yields small (~2%) increase in recall for English
    • More important in other language
  • Can harm precision
    • 'Stock' versus 'stocking'
  • Difficult in agglomerative languages (German, Finnish)
  • Difficult in languages with many exceptions (Spanish)
    • Can use dictionary
What are some characteristics of manual indexing?
  • Identification of indexing terms by a (subject expert)
  • Indexing terms typically taken from a controlled vocabulary
    • May be a flat list of terms, or hierarchical thesaurus
    • May be structured terms
      • (e.g. names or structures of chemical compounds)
What are some characteristics on result presentation?
  • Probability ranking principle
    • List oredered by probabilities of relevance
    • Good for speed
    • Doesn't consider utility
    • If two copies of most relevant document, second has equal relevance but zero utility once you've seen the first
  • Many IR systems eliminate results that are too similar
Supporting users have an ad free experience!