some key takeaway that the 2nd time I went trough the course.
Bias/Variance:
First comparing with human results, if they are different a lot, then we consider bias and variance If High bias: Underfitting data Bigger network
If High variance: Regularization More data
Why regularization reduce overfitting:
L1,L2 regularization is added on cost function, which is like panalizing the weight,
if lambda(regularization coefficient) is big, which can make weight~ to 0, so that kind of change some neurons less important, so the type became from overfitting to underfitting.
Vanishing/exploding gradients:
It happens in a long deep network. Gradients become too big/too small and they affect the model to learn.
If exploding gradients-> weight explodes too much(too large)
If vanishing gradients->weight decays too much(too small)
One solution(not best) is ****to ****optimize weight initialization.
E.g:
If activation func is ReLu-> Wi = np.random.radn(shape)*np.sqrt(2./layers_dims[l-1])
If activation func is Tanh-> Wi = np.random.radn(shape)*np.sqrt(1./layers_dims[l-1])
What is gradient decent?
to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.
batch VS mini-batch VS stochstic gradient decent.
batch gradient decent: going through whole training set and do one gradient decent.
mini-batch: split into smaller mini-batch, for each mini-batch, do one gradient decent.
stochastic gradient decent: mini-batch size=1
1 epoch definition: 1 pass through training set. 1 epoch has one gradient decent if you choose batch gradient. 1 epoch has many gradient decent if you choose mini-batch gradient
Hyper-parameter tuning:
Tuning process: randomly picking up hyper-paras, from coarse to fine.
Using an appropriate scale to pick hyper-paras: i.e. learning rate: picking random nums in [0.0004,0.004], random nums in [0.004,0.04]
What is Batch norm:
input features need normlization so training can be faster, efficient to find minimal loss. Likewise, batch norm is to normalize Z[i]=a*Z_norm + b. a,b, are learnable.
Batch norm is helpful for learning on shifting input distribution. I.e. previously black cats but now color cats.