Have you ever read a review and questioned its legitimacy?
Google and Amazon alone have removed over 450 million fake reviews in the last 2 years.
Surprisingly enough, hundreds of thousands of reviews are written by bots.
The issue is severe enough to have attracted the attention of mainstream media and governments. For example, the BBC and New York Times have reported that “fake reviews are becoming a common problem on the Web”
How are Fake Reviews Generated?
Businesses gain fake reviews either by incentivizing their users or paying fake review providers to review their products artificially.
In fact, established illegal industry of fake reviews providers has emerged, calling themselves digital marketers, intending to offer services that help vendors and service providers improve their ratings and ranking in the stores through fake reviews and support.
They either write fake reviews manually or use Bots exploiting Text Generative Models.
Bots use TGMs, the state-of-the-art Natural Language Processing models which generate human-like written text. These are now extensively used for deceptive review generation.
Detecting Fake Reviews:
Considering it a violation and a threat to consumers’ trust in purchasing online services and products, identifying fake reviews has been a crucial ongoing research topic. Significant work has been done to tackle this crucial issue.
Two main categories of features that have been exploited in the fake reviews detection research are textual and behavioral features. Several datasets (e.g. movie reviews, Yelp dataset, etc) and machine learning algorithms (Naive-Bayes, SVM, decision trees, RNNs, LSTMs) have been applied to make a reliable model for this purpose.
This article proposes the use of Natural Language Processing (NLP) to create a model that can detect fake restaurant reviews generated by Generative Pre-trained Transformers.
Our approach uses CRISP-DM methodology because of its powerful practicality, flexibility, and usefulness when using analytics to solve thorny business issues, to create a completely efficient pipeline for production.
We created a dataset of 200,000 labeled observations of Restaurant reviews.
100,000 Restaurant Real (Human) Reviews - Out of 226,652 Real Reviews provided by Objection.co (review source: scraped from the Web)
100,000 Bot Reviews – generated by our Bot Review Generator Ver. 1.0
Labeled Bot Reviews,0 and Human Reviews,1.
Stored into an SQLite3 database
Bot Review Generator:
We used our Bot for generating fake reviews for our dataset. The core of the bot is Generative Pre-trained Transformers text completion capability.
Create Input Prompt for Generative Pre-trained Transformers
Restaurant Names - list of European Restaurants (source: Kaggle)
Starting words - 23 words selected randomly (e.g., A, The, We, I, etc)
Topic words - 12 words selected randomly (e.g. great, expensive, cheap, etc.)
Fix Maximum Token Length (we took 100)
Enter the input.
The bot generates a review after processing.
We generated 100,000 fake reviews with our Bot and stored them in the SQLite3 database to be used after appending them to our dataset.
Initial Data Analysis:
We analyzed the quality of fake reviews generated by the TGM of our Bot. This analysis is important to check how consistent and related our generated reviews are, concerning the real restaurant reviews. Following are the visuals explaining the relationship between real and fake reviews.
The conclusion from Initial Data Analysis:
Analyzing the initial dataset, we noticed some flaws. Generated reviews did not match the real restaurant reviews and there was a significant repetition of the reviews, which would certainly result in reduced performance of the model.
We made the following improvements in the dataset creation process:
Used randomly selected restaurant names from the real reviews’ dataset
Used the first 3 words from a randomly selected real review
Max token length is based on the length of the randomly selected real review.
Used Topic modeling algorithm on the real review dataset to generate Topic words.
Making above stated improvements, the duplicates and unrealistic reviews were removed. We did not do stemming, removal of stop words or punctuations, and setting all characters to lowercase as they were determined to be features of the reviews. We now have 200,134 Labelled Observations of Restaurant Reviews
100,035 Restaurant Real (Human) Reviews - Out of 226,652 Real Reviews provided by Objection.co (review source: scraped from the Web)
100,099 Bot Reviews – generated by our Bot Review Generator Ver. 2.0
Labeled Bot Reviews,0 and Human Reviews,1.
Stored into an SQLite3 database
Baseline Training / Testing Pipeline:
Now, it’s time to fit our baseline models on the data we have prepared. We Split the data into the train (70%) and test (30%) data, a common trend in training machine learning models, and used five common classifiers for training, testing, and comparison purpose i.e. Positive Aggression Classifier, Support Vector Machine, XGBoost, Naive Bayes and Random Forest.
Baseline Models Analysis:
Here is the comparison between the Passive-Aggressive Classifier and SVM. (Detailed comparison with other machine learning and deep learning models is in models summary).
Deep Learning (Fine-tuned BERT Model):
We then fed the dataset into the classifier exploiting the Deep Learning BERT model (from Hugging Face). Doing Transfer Learning saved us a big deal from training on such a big dataset in terms of hardware resources and time.
We fine-tuned the BERT model with our dataset to create variant models:
myBERT150 - using the maximum token length of 150
myBERT300 - using the maximum token length of 300
The performance comparison of this Deep Learning model is here;
To evaluate the overall performance, we took into consideration the accuracy of each classifier along with their precision, recall, and F1 score as given in the table below:
Passive-Aggressive Classifier was found to be, overall, the highest in all the performance metrics, from the list of machine learning models whereas, our BERT variant myBERT300 is the top-performing model in both machine learning and deep learning classifiers (BERT variants).
Final Selected Model for Deployment:
Since, our fine-tuned BERT variant (myBERT300 - using the maximum token length of 300) outperformed the rest of the classifiers achieving the best accuracy (99.432%), precision (99%), recall (99%), and f1-score (99%). It is no doubt the perfect choice to be deployed as our fake reviews detector.
Although we used the BERT pre-trained model from Hugging Face and fine-tuned it for our classification purpose, fine-tuning requires powerful resources (strong GPU or TPU) too. Lacking good resources can be very time-consuming and frustrating (Well! exaggerating a bit).
The Runner-Ups in the classifiers, based on accuracy and run-time, are myBERT150 (98.731% accurate), Passive-Aggressive Classifier (97.417% accurate), and Support Vector Machine (97.332% accurate).
Now, we are ready to deploy our model in production. We chose Flask framework, a famous web application framework written in python. It has multiple modules that make it easier for machine learning developers to turn their models into working applications without having to worry about the details like protocol management and thread management etc.
We improved the performance of the classifier by removing the flaws in the data i.e. inconsistency of generated fake reviews with the real reviews and significant repetition of generated reviews. But our team has identified few measures that could result in further improvement in performance, time, and resource reduction of the classifier. These are:
Create a more diverse dataset.
Tune baseline model parameters to get a model with a faster run time.
In a nutshell, fake review detection has received significant attention in both business and academia due to the potential impact fake reviews have on consumer behavior and purchasing decisions as well as on businesses themselves. Crucial advancements have been made to solving this with the help of different machine learning and deep learning models, and huge data availability.
In this article, we proposed the use of one of the most powerful deep learning architectures i.e. Bidirectional Encoders Representations from Transformers (BERT), on the dataset, we created with the help of our Bot using Generative Pre-trained Transformers. We got the pre-trained BERT from Hugging Face, fine-tuned it to make two of its variants. Comparison of performance of these variants with each other and with other machine learning models brought us to the conclusion that our variant, myBERT300 outperformed all the detectors with an accuracy of 99.432%.
It is highly expected that considering more diversity in the dataset and tuning of the baseline model, the performance and resource efficiency of the classifier can be further improved.
Collaborators in this article:
Objection.co Data Science Interns: Junylou Daniels, Albina Cako, Dung Tran
• Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social
bots. Commun. ACM 59, 7 (2016), 96–104.
• Floridi, Luciano, and Massimo Chiriatti. "GPT-3: Its nature, scope, limits, and consequences." Minds and
Machines 30, no. 4 (2020): 681-694.
• OpenAI. (2021, March 25). Gpt-3 powers the next generation of apps. Retrieved April 24, 2021, from
• Pitman, A., Fake reviews are a real problem: 8 statistics that show why. Retrieved April 10, 2021, from
• Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statistical Detection and
Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, pages 111–116.
• Steve Shao 2020, Insight, accessed 06 April 2021, <https://blog.insightdatascience.com/contextual-topic-