Shared Flashcard Set

Details

Data Science Interview
Prep for Data Scientist Interview
12
Computer Science
Professional
03/29/2022

Additional Computer Science Flashcards

 


 

Cards

Term
supervised learning
Definition
Uses known and labeled data as input.
Supervised learning has a feedback mechanism.
The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine.
Term
Unsupervised Learning
Definition
Uses unlabeled data as input
Unsupervised learning has no feedback mechanism
The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm
Term
How can you avoid overfitting your model?
Definition
Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting
Term
What are the feature selection methods used to select the right variables?
Definition
filter, and wrapper methods
Term
Filter method for variable selection
Definition
Linear discrimination analysis
ANOVA
Chi-Square
The best analogy for selecting features is "bad data in, bad answer out." When we're limiting or selecting the features, it's all about cleaning up the data coming in
Term
Filter method for variable selection
Definition
Forward Selection: We test one feature at a time and keep adding them until we get a good fit
Backward Selection: We test all the features and start removing them to see what works better
Recursive Feature Elimination: Recursively looks through all the different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.
Term
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
Definition
If the data set is large, we can just simply remove the rows with missing data values
For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).
Term
What are dimensionality reduction and its benefits?
Definition
The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches).
Term
Difference between Point Estimates and Confidence Interval
Definition
Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha.

Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters’ Point estimators are - Maximum Likelihood estimator and the Method of Moments.

To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias.
Term
Standardization
Definition
The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0.
Standardization takes care that the standard normal distribution is followed by the data.
Normalization formula -
X’ = (X - Xmin) / (Xmax - Xmin)

Here,

Xmin - feature’s minimum value,

Xmax - feature’s maximum value
Term
Normalization
Definition
The technique of converting all data values to lie between 1 and 0 is known as Normalization. This is also known as min-max scaling.
The data returning into the 0 to 1 range is taken care of by Normalization.
Standardization formula -
X’ = (X -
Term
Why is R used in Data Visualization?
Definition
R is widely used in Data Visualizations for the following reasons-

We can create almost any type of graph using R.
R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
It is easier to customize graphics in R compared to Python.
R is used in feature engineering and in exploratory data analysis as well.
Supporting users have an ad free experience!