Skip to main content

Softmax function, softmax regression.

The softmax function is also called the normalized exponential function. 

It is a generalization of the logistic function that "squashes" a K-dimensional vector  of arbitrary real values to a K-dimensional vector  of real values in the range [0, 1] that add up to 1. 

In probability theorythe output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes. 

Example: 

We know that every image in MNIST is of a handwritten digit between zero and nine. So there are only ten possible things that a given image can be. We want to be able to look at an image and give the probabilities for it being each digit. For example, our model might look at a picture of a nine and be 85% sure it's a nine, but give a 5% chance to it being an eight (because of the top loop) and a bit of probability to all the others because it isn't 100% sure.

This is a classic case where a softmax regression is a natural, simple model. If you want to assign probabilities to an object being one of several different things, softmax is the thing to do, because softmax gives us a list of values between 0 and 1 that add up to 1. Even to train more sophisticated models, the final step is usually a layer of softmax.
A softmax regression has two steps: first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.
To tally up the evidence that a given image is in a particular class, we do a weighted sum of the pixel intensities. The weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor.


We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input. The result is that the evidence for a class igiven an input x is:

evidencei=jWi, jxj+bi
where Wi is the weights and bi is the bias for class i, and j is an index for summing over the pixels in our input image x. We then convert the evidence tallies into our predicted probabilities y using the "softmax" function:

y=softmax(evidence)
Here softmax is serving as an "activation" or "link" function, shaping the output of our linear function into the form we want -- in this case, a probability distribution over 10 cases. You can think of it as converting tallies of evidence into probabilities of our input being in each class. It's defined as:

softmax(evidence)=normalize(exp(evidence))
If you expand that equation out, you get:

softmax(evidence)i=exp(evidencei)jexp(evidencej)
But it's often more helpful to think of softmax the first way: exponentiating its inputs and then normalizing them. The exponentiation means that one more unit of evidence increases the weight given to any hypothesis multiplicatively. And conversely, having one less unit of evidence means that a hypothesis gets a fraction of its earlier weight. No hypothesis ever has zero or negative weight. Softmax then normalizes these weights, so that they add up to one, forming a valid probability distribution. (To get more intuition about the softmax function, check out the section on it in Michael Nielsen's book, complete with an interactive visualization.)
You can picture our softmax regression as looking something like the following, although with a lot more xs. For each output, we compute a weighted sum of the xs, add a bias, and then apply softmax.
If we write that out as equations, we get:
[y1, y2, y3] = softmax(W11*x1 + W12*x2 + W13*x3 + b1,  W21*x1 + W22*x2 + W23*x3 + b2,  W31*x1 + W32*x2 + W33*x3 + b3)
We can "vectorize" this procedure, turning it into a matrix multiplication and vector addition. This is helpful for computational efficiency. (It's also a useful way to think.)
[y1, y2, y3] = softmax([[W11, W12, W13], [W21, W22, W23], [W31, W32, W33]]*[x1, x2, x3] + [b1, b2, b3])
More compactly, we can just write:

y=softmax(Wx+b)

The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs*
  However, we don't apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function to the . According to this function, the activation ajL of the jth output neuron is

where in the denominator we sum over all the output neurons.


In fact, if you look closely, you'll see that in both cases the total change in the other activations exactly compensates for the change in . The reason is that the output activations are guaranteed to always sum up to 1, as we can prove using Equation (78) and a little algebra:

As a result, if  increases, then the other output activations must decrease by the same total amount, to ensure the sum over all activations remains 1. And, of course, similar statements hold for all the other activations.

Equation (78) also implies that the output activations are all positive, since the exponential function is positive. Combining this with the observation in the last paragraph, we see that the output from the softmax layer is a set of positive numbers which sum up to . In other words, the output from the softmax layer can be thought of as a probability distribution.

The fact that a softmax layer outputs a probability distribution is rather pleasing. In many problems it's convenient to be able to interpret the output activation  as the network's estimate of the probability that the correct output is . So, for instance, in the MNIST classification problem, we can interpret  as the network's estimated probability that the correct digit classification is .

Comments

Popular posts from this blog

IIMB PGSEM SOP Essays.

The IIMB PGSEM application for 2008 had a SOP section which required 5 short essays to be written. Here are the ones I had written: Statement of Purpose How do you see the PGSEM helping you in your goals? (150 words) My taking up the PGSEM course has twin objectives, namely, self-development and learning all aspects of setting up, managing a commercial/social enterprise. Having worked in the software industry for five years, I have closely seen the software development life-cycle. However, there are several aspects of business and the economy that are of interest to me and I find the time ripe to explore these in a formal way, through academics; specifically strategic management of a firm, innovation strategies, and the scope of strategic consulting. Getting ready to usher in acceleration in growth opportunities in my care

Google BigQuery & Apache Hive

Google BIGQUERY is a fast, economical and fully-managed enterprise data warehouse for large-scale data analytics. Details of querying your custom table in BigQuery: https://cloud.google.com/bigquery/quickstart-web-ui The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Built on top of Apache Hadoop™, Hive provides the following features: Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis. A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™ Query execution via Apache Tez™, Apache Spark™, or MapReduce Procedural language with HPL-SQL Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider. More details on getting started: 

DNA newspaper plagiarizes my photographs!

The newspaper DNA (Daily News and Analysis - http://www.dnaindia.com/bangalore ) seems to have involved in not verifying its sources of photographs and having used my photographs (does this amount to plagiarism? I think it does) after it carried some of my pictures in the 'After Hrs' section of its newspaper on 31st January 2009, which I had taken at the IIMB Yamini 2009. It is good that they covered the event but they should have cited/verified the sources of the photographs. In all probability they or their sources just picked up the photos from my blog, with the belief that no one would notice anyways - seems they could not escape as luck would have it, I spotted them in the DNA paper on Saturday. It was early in the morning when as I flipped open the last page of the supplement that I was stunned to see my pics, which I was able to recognize immediately - however there were no credits anywhere in sight! Please check the photos below from the e-paper version on their website