In this video, we will talk about first text classification model on top of features that we have described.
And let’s continue with the sentiment classification. We can actually take the IMDB movie reviews dataset, that you can download, it is freely available. It contains 25,000 positive and 25,000 negative reviews. And how did that dataset appear? You can actually look at IMDB website and you can see that people write reviews there, and they actually also provide the number of stars from one star to ten star. They actually rate the movie and write the review. And if you take all those reviews from IMDB website, you can actually use that as a dataset for text classification because you have a text and you have a number of stars, and you can actually think of stars as sentiment. If we have at least seven stars, you can label it as positive sentiment. If it has at most four stars, that means that is a bad movie for a particular person and that is a negative sentiment. And that’s how you get the dataset for sentiment classification for free. It contains at most 30 reviews per movie just to make it less biased for any particular movie.
These dataset also provides a 50/50 train test split so that future researchers can use the same split and reproduce their results and enhance the model. For evaluation, you can use accuracy and that actually happens because we have the same number of positive and negative reviews. So our dataset is balanced in terms of the size of the classes so we can evaluate accuracy here.
Okay, so let’s start with first model. Let’s takes features, let’s take bag 1-grams with TF-IDF values. And in the result, we will have a matrix of features, 25,000 rows and 75,000 columns, and that is a pretty huge feature matrix. And what is more, it is extremely sparse. If you look at how many 0s are there, then you will see that 99.8% of all values in that matrix are 0s. So that actually applies some restrictions on the models that we can use on top of these features.
And the model that is usable for these features is logistic regression, which works like the following. It tries to predict the probability of a review being a positive one given the features that we gave that model for that particular review. And the features that we use, let me remind you, is the vector of TF-IDF values. And what you actually can do is you can find the weight for every feature of that bag of force representation. You can multiply each value, each TF-IDF value by that weight, sum all of that things and pass it through a sigmoid activation function and that’s how you get logistic regression model.
And it’s actually a linear classification model and what’s good about that is since it’s linear, it can handle sparse data. It’s really fast to train and what’s more, the weights that we get after the training can be interpreted.
And let’s look at that sigmoid graph at the bottom of the slide. If you have a linear combination that is close to 0, that means that sigmoid will output 0.5. So the probability of a review being positive is 0.5. So we really don’t know whether it’s positive or negative. But if that linear combination in the argument of our sigmoid function starts to become more and more positive, so it goes further away from zero. Then you see that the probability of a review being positive actually grows really fast. And that means that if we get the weight of our features that are positive, then those weights will likely correspond to the words that a positive. And if you take negative weights, they will correspond to the words that are negative like disgusting or awful.