Shared Flashcard Set

Details

Search Engines 11-442 Midterm 2
A flash card set for studying for the final of the CMU 11-X42 course on Search Engines
15
Education
Graduate
12/06/2023

Additional Education Flashcards

 


 

Cards

Term
Index Construction
Definition

Partition the Indexes (Sharding)

Add Tiers of Indexes

Cache Queries (Check First)

Cache Index Terms (Check Second)

For Construction Use MapReduce

 

 

Term

Dual-Encoder Models

(Representation based models)

Definition

Separately encode the query and document (can be same or diff encoding functions)

Then match the results (typically with a simple match function)

Term
Deep Structured Semantic Models (DSSM)
Definition

Type of Dual-Encoder model

vocabulary of 500k terms, ignores all others

Map each word to a vector (Hashing), but doesn't use word2vec or anything

Instead breaks words into trigrams (eg. best -> #best# -> #be, bes, est, st#) This hashing is robust to spelling diffs, but not conceptual diffs

[image]

Better than BM25, worse than good LTR

Term
Deep Relevance Matching Model (DRMM)
Definition

Interaction Based Neural Model (Meaning get local matches between pieces of text like cosine similarity of word embeddings, then learn patterns of interaction)

 

Convert all words to word2vec embeddings,

compare every query word to every doc word,

use histogram pooling to get a constant number of inputs (eg. log of # of words that had match of .8 to .9 as one input feature),

Pass that through a feed forward neural net to get a score,

[image]

too expensive for initial retrieval
This is comparable to good LTR models

 

 

Term
BERT Ranking
Definition
Term
DeepCT
Definition

The idea is to fine-tune BERT to produce importance scores for each word and then use those scores rather than term frequency. Use the max score for each word.

 

This improves Indri or BM25 and is a preprocessing step, so it can be done offline.

Term
Doc2Query/DocT5Query
Definition

The idea is to automatically generate questions that a document could answer and append them to the end of the document, then use traditional techniques. We are augmenting the documents.

 

This is a lexicon based approach that enables Document expansion. 

 

This improves BM25 15% (and the T5 model which has a better transformer gets 25%)

Term
COIL
Definition

Contextualized Inverted Lists

The idea is that words don't convey meaning on their own, but BERT can be trained to produce contextualized embeddings which do a better job. Lets use those instead.

 [image]

Term
SPLADE
Definition

The idea is to learn a Bag of Words for the text that is representative rather than using the words themselves. This is done by projecting the output of a BERT model into a vocabulary sized vector.

 

[image]

Term
FAISS
Definition
[image]
Term
ANCE
Definition

Hard Negative Mining

[image]

Term
Condenser
Definition
[image]
Term
HyDE
Definition
Use a LLM to generate a fictional document that is a good response to a query for query augmentation.
Term
Multi-hop Search
Definition

LLMs can help with this

It's where there are questions that require subquestions to answer.

Who was the forth presidents wife?

Need to know who the forth president was.

Need to know who he was married to.

Term
RAG
Definition

Retrieval Augmented Generation

Want to give a query. Have an LLM or smth do research with doc retrieval then answer.

Supporting users have an ad free experience!