some key takeaway that the 2nd time I went trough the course.

Bias/Variance:

First comparing with human results, if they are different a lot, then we consider bias and variance If High bias: Underfitting data Bigger network

If High variance: Regularization More data

Why regularization reduce overfitting:

L1,L2 regularization is added on cost function, which is like panalizing the weight,

if lambda(regularization coefficient) is big, which can make weight~ to 0, so that kind of change some neurons less important, so the type became from overfitting to underfitting.

Vanishing/exploding gradients:

It happens in a long deep network. Gradients become too big/too small and they affect the model to learn.

If exploding gradients-> weight explodes too much(too large)

If vanishing gradients->weight decays too much(too small)

One solution(not best) is ****to ****optimize weight initialization.

E.g:

If activation func is ReLu-> Wi = np.random.radn(shape)*np.sqrt(2./layers_dims[l-1])

If activation func is Tanh-> Wi = np.random.radn(shape)*np.sqrt(1./layers_dims[l-1])

What is gradient decent?

to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible.

batch VS mini-batch VS stochstic gradient decent.

batch gradient decent: going through whole training set and do one gradient decent.

mini-batch: split into smaller mini-batch, for each mini-batch, do one gradient decent.

stochastic gradient decent: mini-batch size=1

1 epoch definition: 1 pass through training set. 1 epoch has one gradient decent if you choose batch gradient. 1 epoch has many gradient decent if you choose mini-batch gradient

Hyper-parameter tuning:

Tuning process: randomly picking up hyper-paras, from coarse to fine.

Using an appropriate scale to pick hyper-paras: i.e. learning rate: picking random nums in [0.0004,0.004], random nums in [0.004,0.04]

What is Batch norm:

input features need normlization so training can be faster, efficient to find minimal loss. Likewise, batch norm is to normalize Z[i]=a*Z_norm + b. a,b, are learnable.

Batch norm is helpful for learning on shifting input distribution. I.e. previously black cats but now color cats.

--

--

George S

senior ML researcher, sharing knowledge and news in AI