Shared Flashcard Set

Details

Intro Data Science
Intro Data Science
102
Mathematics
12th Grade
11/03/2017

Additional Mathematics Flashcards

 


 

Cards

Term
algorithm
Definition
A series of repeatable steps for carrying out a certain type of task with data. As with data structures, people studying computer science learn about different algorithms and their suitability for various tasks. Specific data structures often play a role in how certain algorithms get implemented.
Term
histogram
Definition
A graphical representation of the distribution of a set of numeric data, usually a vertical bar graph.
Term
coefficient
Definition
“A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (Ex.: x in x(y + z), 6 in 6ab”[websters] When graphing an equation such as y = 3x + 4, the coefficient of x determines the line's slope. Discussions of statistics often mention specific coefficients for specific tasks such as the correlation coefficient, Cramer’s coefficient, and the Gini coefficient
Term
data science
Definition
The ability to extract knowledge and insights from large and complex data sets
Term
mode
Definition
“The value that occurs most often in a sample of data. Like the median, the mode cannot be directly calculated”
Term
Dashboard
Definition
A graphical representation of the analyses performed by the algorithms
Term
Data Aggregation
Definition
The act of collecting data from multiple sources for the purpose of reporting or analysis.
Term
Database
Definition
A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system
Term
Data center
Definition
A physical facility that houses a large number of servers and data storage devices. Data centers might belong to a single organization or sell their services to many organizations.
Term
data cleansing
Definition
The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency.
Term
Data collection
Definition
Any process that captures any type of data.
Term
Data Set
Definition
A collection of data, typically in tabular form.
Term
data visualization
Definition
A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.
Term
demographic data
Definition
Data relating to the characteristics of a human population.
Term
distributed objects
Definition
A software module designed to work with other distributed objects stored on other computers.
Term
Distributed processing
Definition
The execution of a process across multiple computers connected by a computer network.
Term
Data point
Definition
An individual item on a graph or a chart.
Term
Data quality
Definition
The measure of data to determine its worthiness for decision making, planning, or operations.
Term
Data science
Definition
A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.
Term
External data
Definition
Data that exists outside of a system.
Term
Exploratory analysis
Definition
Finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
Term
In-database analytics
Definition
The integration of data analytics into the data warehouse
Term
In-memory database
Definition
Any database system that relies on memory for data storage.
Term
Key Value Databases
Definition
They store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a Key Value is normally some kind of primitive of the programming language.
Term
Load balancing
Definition
The process of distributing workload across a computer network or computer cluster to optimize performance.
Term
Location analytics
Definition
Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses. It allows you to associate geospatial information with data sets.
Term
Location data
Definition
Data that describes a geographic location.
Term
Machine-generated data
Definition
Any data that is automatically created from a computer process, application, or other non-human source.
Term
Network analyzing
Definition
Viewing relationships among the nodes in terms of the network or graph theory, meaning analyzing connections between nodes in a network and the strength of the ties.
Term
Parallel method invocation
Definition
Allows programming code to call multiple functions in parallel.
Term
Parallel processing
Definition
The ability to execute multiple tasks at the same time.
Term
Parallel query
Definition
A query that is executed over multiple system threads for faster performance.
Term
Public data
Definition
Public information or data sets that were created with public funding
Term
Real-time data
Definition
Data that is created, processed, stored, analysed and visualized within milliseconds
Term
Reference data
Definition
Data that describes an object and its properties. The object may be physical or virtual.
Term
Risk analysis
Definition
The application of statistical methods on one or more data sets to determine the likely risk of a project, action, or decision.
Term
Search data
Definition
Aggregated data about search terms used over time.
Term
Sentiment analysis
Definition
The application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company.
Term
Server
Definition
A physical or virtual computer that serves requests for a software application and delivers those requests over a network.
Term
Storage
Definition
Any means of storing data persistently.
Term
Software as a service
Definition
Application software that is used over the web by a thin client or web browser. Sales force is a well-known
Term
Structured data
Definition
Data that is organized by a predetermined structure
Term
Text analytics
Definition
The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.
Term
Transnational data
Definition
Data that changes unpredictably. Examples include accounts payable and receivable data, or data about product shipments.
Term
Unstructured data
Definition
Data that has no identifiable structure–for example, the text of email messages.
Term
Value
Definition
All that available data will create a lot of value for organizations, societies and consumers. Big data means big business and every industry will reap the benefits from big data
Term
Volume
Definition
The amount of data, ranging from megabytes to brontobytes
Term
Visualization
Definition
A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.
Term
Schema
Definition
The structure that defines the organization of data in a database system.
Term
ACID test
Definition
A test applied to data for atomicity, consistency, isolation, and durability
Term
Behavioral analytics
Definition
Using data about people’s behavior to understand intent and predict future actions
Term
Business Intelligence
Definition
The general term used for the identification, extraction, and analysis of data.
Term
Cell phone data
Definition
Cell phones generate a tremendous amount of data, and much of it is available for use with analytical applications.
Term
Classification analysis
Definition
A systematic process for obtaining important and relevant information about data, also meta data called; data about data.
Term
Cold data storage
Definition
Storing old data that is hardly used on low-power servers. Retrieving the data will take longer
Term
control
Definition
in an experiment, the standard that is used for comparison
Term
data
Definition
information gathered from observations, a collection of facts from which conclusions may be drawn
Term
deduction
Definition
Reasoning from general to specific
Term
experiment
Definition
the act of conducting a controlled test or investigation
Term
hypothesis
Definition
possible explanation for a set of observations or possible answer to a scientific question, a prediction that can be tested
Term
induction
Definition
reasoning from detailed facts to general principles
Term
infer
Definition
conclude by reasoning
Term
model
Definition
the act of representing something (usually on a smaller scale)
Term
theory
Definition
well-tested explanation that unifies a broad range of observations
Term
induction
Definition
reasoning from detailed facts to general principles
Term
deduction
Definition
Reasoning from general to specific
Term
Binary Variable
Definition
Binary variables are those variables which can have only two unique values.
Term
Categorical Variable
Definition
Categorical variables (or nominal variables) are those variables which have discrete qualitative values. For example, names of cities are categorical like Delhi, Mumbai, Kolkata.
Term
Classification
Definition
It is supervised learning method where the output variable is a category,
Term
Clustering
Definition
Clustering is an unsupervised learning method used to discover the inherent groupings in the data.
Term
Confidence Interval
Definition
A confidence interval is used to estimate what percent of a population fits a category based on the results from a sample population.
Term
Continuous Variable
Definition
Continuous variables are those variables which can have infinite number of values but only in a specific range.
Term
Data Mining
Definition
Data mining is a study of extracting useful information from structured/unstructured data taken from various sources.
Term
Data Transformation
Definition
Data transformation is the process to convert data from one form to the other. This is usually done at a preprocessing step.
Term
Decision Tree
Definition
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input & output variables. In this technique, we split the population (or sample) into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiation in input variables.
Term
Deep Learning
Definition
Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN) which uses the concept of human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to model multiple outputs simultaneously.
Term
Descriptive Statistics
Definition
Descriptive statistics is comprised of those values which explains the spread and central tendency of data.
Term
Dependent Variable
Definition
A dependent variable is what you measure and which is affected by independent / input variable(s). It is called dependent because it “depends” on the independent variable. For example, let’s say we want to predict the smoking habits of people. Then the person smokes “yes” or “no” is the dependent variable.
Term
Dummy Variable
Definition
Dummy Variable is another name for Boolean variable.
Term
Feature Selection
Definition
Feature Selection is a process of choosing those features which are required to explain the predictive power of a statistical model and dropping out irrelevant features
Term
Frequentist Statistics
Definition
Frequentist Statistics tests whether an event (hypothesis) occurs or not. It calculates the probability of an event in the long run of the experiment (i.e the experiment is repeated under the same conditions to obtain the outcome).
Term
Imputation
Definition
Imputation is a technique used for handling missing values in the data. This is done either by statistical metrics like mean/mode imputation or by machine learning techniques like kNN imputation
Term
Inferential Statistics
Definition
In inferential statistics, we try to hypothesize about the population by only looking at a sample of it.
Term
IQR
Definition
QR (or interquartile range) is a measure of variability based on dividing the rank-ordered data set into four equal parts. It can be derived by Quartile3 – Quartile1.
Term
K-Means
Definition
It is a type of unsupervised algorithm which solves the clustering problem. It is a procedure which follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.
Term
Lasso Regression
Definition
Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following:

Objective = RSS + α * (sum of absolute value of coefficients)

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Let’s iterate it briefly here:

α = 0 : Same coefficients as simple linear regression
α = ∞ : All coefficients zero (same logic as before)
0 < α < ∞ : coefficients between 0 and that of simple linear regression
Term
Linear Regression
Definition
The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weight! What do you think the child will do? He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life. The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation below.

Y=aX+b

where:

Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.
Term
Logistic Regression
Definition
In simple words, it predicts the probability of occurrence of an event by fitting data to a logistic function. Hence, it is also known as logistic regression. Since, it predicts the probability, the output values lies between 0 and 1 (as expected).
Term
Machine learning
Definition
Machine Learning refers to the techniques involved in dealing with vast data in the most intelligent fashion (by developing algorithms) to derive actionable insights. In these techniques, we expect the algorithms to learn by itself wiithout being explicitly programmed.
Term
Naive Bayes
Definition
It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.
Term
Natural Language Processing
Definition
In simple words, Natural Language Processing is a field which aims to make computer systems understand human speech. NLP is comprised of techniques to process, structure, categorize raw text and extract information.
ChatBot is a classic example of NLP, where sentences are first processed, cleaned and converted to machine understandable format
Term
Normal Distribution
Definition
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the bell curve, because it has a peculiar shape of a bell. Mostly, a binomial distribution is similar to normal distribution. The difference between the two is normal distribution is continuous.
Term
Ordinal Variable
Definition
Ordinal variables are those variables which have discrete values but has some order involved
Term
Outlier
Definition
Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
Term
Precision and Recall
Definition
Precision can be measured as of the total actual positive cases, how many positives were predicted correctly.
It can be represented as:

Precision = TP / (TP + FP)

Whereas recall is described as the measured of how many of the positive predictions were correct

It can be represented as:

Recall = TP / (TP + FN)
Term
Predictor Variable
Definition
Predictor variable is used to make a prediction for dependent variables.
Term
P-Value
Definition
P-value is the value of probability of getting a result equal to or greater than the observed value, when the null hypothesis is true.
Term
Regression
Definition
It is supervised learning method where the output variable is a real value, such as “amount” or “weight”.
Term
Reinforcement Learning
Definition
It is an example of machine learning where the machine is trained to take specific decisions based on the business requirement with the sole motto to maximize efficiency (performance). The idea involved in reinforcement learning is: The machine/ software agent trains itself on a continual basis based on the environment it is exposed to, and applies it’s enriched knowledge to solve business problems. This continual learning process ensures less involvement of human expertise which in turn saves a lot of time!

Important Note: There is a subtle difference between Supervised Learning and Reinforcement Learning (RL). RL essentially involves learning by interacting with an environment. An RL agent learns from its past experience, rather from its continual trial and error learning process as against supervised learning where an external supervisor provides examples.

A good example to understand the difference is self driving cars. Self driving cars use Reinforcement learning to make decisions continuously like which route to take, what speed to drive on, are some of the questions which are decided after interacting with the environment. A simple manifestation for supervised learning would be to predict the total fare of a cab at the end of a journey.
Term
Response Variable
Definition
Response variable (or dependent variable) is that variable whose variation depends on other variables.
Term
Ridge Regression
Definition
Ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of squares of coefficients in the optimization objective. Thus, ridge regression optimizes the following:

Objective = RSS + α * (sum of square of coefficients)

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of squares of coefficients. α can take various values:

α = 0:
The objective becomes same as simple linear regression.
We’ll get the same coefficients as simple linear regression.
α = ∞:
The coefficients will be zero. This is because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.
0 < α < ∞:
The magnitude of α will decide the weightage given to different parts of objective.
The coefficients will be somewhere between 0 and 1 for simple linear regression.
Term
Statistics
Definition
It is the study of the collection, analysis, interpretation, presentation, and organisation of data.
Supporting users have an ad free experience!