We implement multiclass logistic regression from scratch in Python, using stochastic gradient descent, and try it out on the MNIST dataset.
If anyone would like more detail about how the formula for the gradient of L is derived, here is a worksheet I made that walks you through the calculation: https://drive.google.com/file/d/15vngdpZZJH5OK8GVAUmxmY9qaSLbKA0V/view?usp=sharing
Correction: At 23:42, transpose(xiHat) is 1 x (d + 1), not 1 x K.