3 minute read

Logistic unit

In simple Logistic Regression, we have the cost function

L(a,y)=yln(a)(1y)ln(1a)

whereb a is the predicted value and y is the ground-truth label on the training set (0,1).

Why this function? The intuitive version is that if y=1 the second term goes away, and the first term that remains is small when a1 (i.e. the predicted value approaches the correct ground truth), yet large when a0 (i.e. we are punished with a large cost when the predicted answer goes away from the ground truth). Similarly for the other term.

Stats asidePermalink

However, the more mathemtically rigorous answer is that this function comes from the MLE (Maximum Likelihood Estimator). Recall that if we have a Bernoulli model, then we can write

L(x1,,xn;p)=pni=1xi(1p)nni=1xi

Recall we want the argmax over p of this function, but since the logarithm is monatonically increasing this amounts to the same as finding the max over the log-likelihood

l=logL(x1,,xn;p)=(ni=1xi)logp+(nni=1xi)log(1p)

In regular statistics class, we’d now find the maxima by taking deratives and setting equal to 0. This would give us the MLE estimator of our paramter p, which not too surprisngly would turn out to be the average

pMLE=1nni=1xi

For a single training example the log-likelihood would be

l1=xlogp+(1x)log(1p)

and we not that finding the maxima of this function is equivalent to finding the minima of the negative. With a small change of notation, we see the negative is exactly our usual logistic regression cost function.

Back to MLPermalink

BackpropPermalink

We will need to know how our cost-function varies as our network weights vary. To that end let’s compute a few things

La=ya+(1y)(1a)

Then we also used the sigmoid activation function

a=σ(z)=11+ez

meaning

dadz=ez(1+ez)2=11+ez1+ez11+ez=a(1a)

Now we can simply use the chain rule to get the dependence on z:

dLdz=Laaz=a(1a)[ya+(1y)(1a)]=y(1a)+a(1y)=ay

The final step

Given that z depends on the network’s weights w1,w2,b and the training features, x1,x2 as follows

z=w1x1+w2x2+b

and

dzdwi=xi

and

dzdb=1

Applying the chain-rule just one more time and we get the final result

dwi=dLdwi=xi(ay)

and

db=dLdb=(ay)

Gradient descent update rulePermalink

So to perform gradfient descent we must update our parameters as follows:

w1=w1αdw1w2=w2αdw2b=bαdb

where α is the learning rate and controls the size of the steps we take.

Vectorizing Logistic RegressionPermalink

So for the i’th training example, the forward step would look like

z(i)=wTx(i)+ba(i)=σ(z(i))

so for m training examples we’d traditionally have a for loop iterating over each. However this is inefficient.

Consider the design matrix X:

X=(||x(1)x(m)||)

This is an n×m matrix, where we have m training examples and n features.

Using this matrix, we can do the forward step over all examples at once

[z(1),z(2),,z(m)]=wTX+[b,b,,b]

so

Z=w(T)X+b

or in code

# Note: python will broadcast the scalar b 
Z = np.dot(w.T, X) + b

Gradient computationPermalink

dz(i)=a(i)y(i)

and we define the 1×m ector

dZ=[dz(1),,dz(m)]

so vectorizing our computation, and with similar definitions for the vectors A and Y, we’d have

dZ=AY

We had

db=1mm1z(i)

which translates to

db = np.sum(dZ)/m

and

dw=1mXdZ(T)

expanding this

\begin{aligned} dw &= \frac{1}{m} \begin{pmatrix}     \vert & \dots & \vert \\     x^{(1)} & \dots  & x^{(m)}   \\     \vert & \dots & \vert \end{pmatrix} \begin{pmatrix}     dz^{(1)} \\     \dots  \\     dz^{(m)} \end{pmatrix} \\ & = \frac{1}{m}\left[x^{(1)} dz^{(1)} + \dots + x^{(m)}dz^{(m)}] \end{aligned}

Note that here the resulting dw is an n×1 vector.

In code and altogether

Z = np.dot(W.T, X) + b
A = sigma (Z)
dZ = A - Y
dw = (1/m) np.dot(X, dZ.T)
db = 1/m np.sum(dZ)

Then the gradient ascent update goes as

w = w - alpha dw
b = b - alpha db

We still need an outer for-loop over all the iterations of gradient ascent, but aside that we have vectorized all that is possible.

Comments