## Softmax Regression

*This is part 2 of a 5 article series:*

*Training an Architectural Classifier: Motivations**Training an Architectural Classifier: Softmax Regression**Training an Architectural Classifier: Deep Neural Networks**Training an Architectural Classifier: Convolutional Networks**Training an Architectural Classifier: Transfer Learning*

*A personal side goal of this project is becoming more aquainted with deep **learning** frameworks, so although sklearn and the like may have a Logistic Regression module, I’ll be doing this more manually in TensorFlow. You’ll also see tf.slim and Keras.*

Working with the concept that its better to start with a **simpler, more explicable model** and only add complexity if nessecary, I’ll start by trying simple logistic/softmax regression. In short, the goal of logistic regression is to make a prediction by taking an input image, and multiplying all of its features (pixels in this case) by a set of positive or negative weights, then adding a bit of bias.

This should sound familiar to any one with some math experience, it’s the equation of a line: y=mx+b, except in this case our line exists in VERY high dimensional space (m, x, and b are high dimensional matrices instead of the scalars you used in school). This makes some intuitive sense when you consider that what we are attempting to do is draw a line, or hyper-plane, through space that can seperate images of one class from those in another.

These weights represent the learned likelyhood of a pixel contributing positively or negatively to the overall image being in a certain class. Thus the value of the pixel, multiplied with the learned weight, gives a kind of “vote” towards the final result. Using a softmax function, these votes are then converted to probabilities that an image belongs to a given class. Although I’ve used the terms logistic and softmax interchangeably, this is the primary difference between logistic and softmax regression, softmax will accomplish what logistic does but across multiple classes.

The weights are learned through an iterative process called *gradient descent *and* back-propogation *whereby error is attributed to specific weights with each prediction, and that weight is modified up or down and then tried again. In this case, we use an error metric called cross-entropy that collects the the average of the product of the true category multiplied by the negative log of of the predicted category.

It’s a simple model, so I’ll let the notebook do the rest of the talking here:

Bottom line, we can do better. Best case accuracy, even after longer training periods, was about **57%**. Here’s the tensorboard graph of accuracy over 5000 epochs:

The model’s accuracy, even on training data, is well below human accuracy, indicating that the model is likely not complex enough to extract meaningful information from the huge number of features it is being given. The big split between training and validation accuracy also indicates that the model is **overfitting** to the information that it *is* able to extract.

So what if we expand this single neuron classifier into a deep neural net? That will come in my next post: