Shared Flashcard Set

Details

Hadoop Review
General review for Hadoop Final
24
Computer Science
Undergraduate 4
05/12/2014

Additional Computer Science Flashcards

 


 

Cards

Term
hdfs
Definition
Hadoop distributed file system - High-performance distributed file system for storing data.
Term
yarn
Definition
Map Reduce 2.0 -splits the 2 major functionalities of the job tracker, resource managment and scheduling/monitoring, into the Resource Manager and Application Master.
Term
sqoop
Definition
Used for migrating data between structured data stores and hdfs/hadoop storage
Term
apache pig
Definition
interpreting language layered over map reduce - high level language for data analysis
Term
Hive
Definition
data wharehouse facilitating querying and managing large datasets - mimics relational database syntax and such
Term
hadoop streaming
Definition
utility to create and run map reduce jobs with any executable or script as the mapper or reducer
Term
apache hbase
Definition
distributed, scalable, big data store - stores data as sorted key/value pairs with the key consisting of row and columns - used for fast lookup
Term
apache accumulo
Definition
Robust, scalable, high-performance data
storage and retrieval key/value store

cell-based access controls
Term
apache avro
Definition
Serialization framework that compresses and serializes data for storage or transfer. Relies heavily on schemas
Term
Parquet
Definition
columnar storage format for Hadoop.
Term
apache mahout
Definition
Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce
Term
storm
Definition
Distributed real-time computation system - processes streaming data in real time in memory, making it extremely fast
Term
ZooKeeper
Definition
centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Term
Redis,Memcached
Definition
Open-source in-memory key/value stores
Term
spark
Definition
fast, general engine for large-scale data processing
Term
azkaban
Definition
Batch workflow job scheduler to run hadoop jobs
Term
Apache Cassandra
Definition
NoSQL database for managing large amounts of structured, semi-structured, and unstructured data
Term
Numerical Summarization
Definition
Design pattern - group records together by a field or set of fields and calculate a numerical aggregate per group...

mapper, partitioner, reducer
Term
Inverted Index
Definition
Design pattern - Generate an index from a data set to enable fast searches or data enrichment. Takes time, but greatly reduces search times, output can be ingested into a key/value store
Term
Combiner
Definition
Design pattern - used to do to do concatena@on prior to the reduce phase
Term
Counting with counters
Definition
Design pattern - use mapreduce framework's counter utility to calculate global sum entirely on the map side, producing no output
Term
basic Filtering
Definition
Filtering pattern - (map side) filtering
Term
Bloom filtering
Definition
Filtering pattern - keep records that are a member of a large predefined set of values - tiny possibility of false positives. Example: filtering out comments that don't contain a keyword
Term
Reduce side join
Definition
Supporting users have an ad free experience!