Yelp Review Rating Prediction

Jane Yan
3 min readFeb 19, 2021

We will go through a method to predict the customer rating for businesses like restaurants or bars based on the customer review text.

Data Pre-processing and Exploratory Data Analysis (EDA)

We prepare the training set with the following steps:

  • Load the stopwords from the file and store them in a set, which will be used to filter out non-useful words
  • Extract the useful fields from the training set
  • Tokenize the text field and remove the stop words
  • Calculate some statistics to verify your implementation
Label (stars) distribution

Basic Feature Engineering

How can we get the Bag-of-words (BoW) feature?

As a baseline approach, let us use a bag-of-words to represent each document, i.e., using a bag of tokens and the corresponding counts in each review, and ignore the order of tokens. For this approach, we need to first define a dictionary. In this baseline model, you can take up to 500 frequent tokens as a dictionary and map all the training instances into sparse feature vectors.

So how to encode the feature using dictionary and token count? For example, if your dictionary is [a, b, c, d, e, f, g], and a training instance contains tokens {a:2, b:2, d:1, g:1}, your feature vector will be [2, 2, 0 ,1 ,0, 0, 1].

Note that the sequence of tokens in the dictionary really doesn’t matter, but you should make sure the sequence is the same for all the instances.

By now we have implemented the feature extraction function to build a feature vector from review tokens.

Model Design and Implementation

How to predict the review rating?

Our goal here is to predict the ratings (from 1 star to 5 stars) over items (restaurants, shops, pharmacies, etc.) based on the review text from each user on that item.

We can consider this as a multi-class classification problem where the model takes the feature vector of each review (which we constructed in the previous part) as the input and predicts the rating (among 1 to 5) as the class label for the item being reviewed.

Specifically, we want you to implement, train and evaluate a regularized multi-class logistic regression (RMLR) method on the training set and validation set we provided, and finally report your results on the test set.

Generally speaking, the Stochastic Gradient Descent (SGD)will converge much slower than GD. A midpoint alternative is to use Batched SGD, which means to divide the training set into equally sized subsets (e.g., 100 instances per subset) and to compute the gradient on each subset per iteration. Clearly, there is a trade-off between the number of iterations and the cost of computing the gradient per iteration.

Model Evaluation

The hard metric is the overall Accuracy (under zero/one loss) for your multi-class classifier, which is defined as the number of instances that you correctly predict its star divided by the size of the dataset. In addition, you may also want to compute the per-class accuracy to understand your model’s performance in different classes.

The soft metric is the Root Mean Square Error (RMSE) for your prediction results.

Summary

In this story, we have built a machine learning model from scratch for solving a real-world problem.

We should also have a good sense of how the machine learning library can help boost our modeling work! In most of the other assignments, we will primarily use the machine learning library instead of building the wheels ourselves, then the challenge will become how you can use the library well :)

--

--