Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

How to analyse customer reviews with NLP: a case study

Alexandra Mendes
Vítor Bernardes
Rui Melo

November 27, 2023

Min Read
How to analyse customer reviews with NLP: a case study
blue arrow to the left
Imaginary Cloud logo

Problem

One of the most critical aspects of understanding a business is understanding its strengths and weaknesses. Analyzing why it is thriving or not represents a key to the longevity of that business. Hotels are not strange to this scenario.

As a business owner, it is essential to understand why some customers might not return to the hotel, the reason behind some aversion, or what positively stood out to them.

To perform this research, we gathered a dataset of hotel reviews and focused our attention on a specific hotel: Britannia International Hotel Canary Wharf.

Britannia International Hotel Canary Wharf.

The dataset was gathered from the Kaggle platform, containing over 515,000 customer reviews and scoring of 1493 luxury hotels across Europe.

Solution

Motivation and Objectives

To gain insights into the hotel reviews and understand the customers' feelings and feedback more accurately, we needed to understand the customer opinions and segmentation in our dataset with the available data.

Additionally, the large corpus of customer feedback makes it time-consuming to manually review them to capture customers' preferences and pain points. Therefore, we also proceeded to analyze the review texts with Natural Language Processing techniques to understand the intrinsic feelings and emotions behind reviews and recognize which aspects of the hotel required improvements.

While we applied this process to the hospitality industry, this type of analysis can be readily implemented for any other industry that captures customer feedback or even enabled by collecting customer comments from social media posts.

Overview

We started by evaluating the available data, with particular attention to the format and soundness of each field. As is typical when dealing with datasets, especially ones that involve user-generated data, some data needed cleaning. This is an important step in every data analysis process to ensure that the data we work with and use as a foundation for insights is sound and therefore leads to reasonable and representative conclusions.

In the specific case of this dataset, the actual review text needed some minor cleaning to remove redundant whitespace. However, we also noticed a significant issue: all punctuation was missing from the review. Therefore, it was necessary to perform a pre-processing step. We proceeded to recover some of the structure provided by that punctuation to ensure we could use Natural Language Processing techniques and obtain relevant results. A simple yet effective method was to approximate that structure by adding periods before each word beginning with a capital letter.

The effectiveness of that method also stemmed from our additional processing, where we filtered known acronyms and named entities, so we would not add unnecessary periods. To achieve that, we employed automatic named entity recognition, a process that attempts to identify named entities in a given piece of text automatically. In the NLP context, named entities are real-world objects that can be identified with a proper name, including cities, individuals, organizations, etc.

Analysis

Data profiling

The next step was creating our dataset, which we filtered to only apply to our specific hotel. With our filtering, we were able to have access to information about our particular hotel.

The dataset contains the review date and the score given to that stay. It also had information regarding the reviewer's nationality and tags that described the characteristics of the visit, such as if it constituted a double or a single room and how long the stay was. In addition, it also possessed negative and positive reviews of that stay.

To approximate the available data to a real scenario, we randomly meshed the negative and positive reviews into only one column to analyze later.

Distribution Analysis

The first task was to see reviews' ratings by date. Identifying periods where the ratings would not be so good could be possible. This could derive from a seasonal aspect, such as not having air conditioning in the summer or the impact of a specific employee.

This approach was not fruitful, but the same logic applied to analyzing the tags or nationalities. Through the tags, we could identify, for instance, if customers with an Executive Double Room stay did leave bad reviews or not. That visualization could be done through boxplots. We analyzed all the different tags and found that most of them reflected similar distributions, which prevents the possibility of obtaining relevant insights.

Boxplots with reviewer score for different hotel accomodations.

Regarding the nationalities, it was essential to analyze the distribution of our customers. This could provide insights into the marketing team’s effectiveness in some markets. Excluding the UK customers, which represent 80% of all the customers, we get the following world map overview, where darker shades indicate a higher number of reviewers from that nationality:

World map overview indicating reviewers nationality.

Sentiment Analysis

To further understand the feeling behind the reviews, we use a language model hosted on the HuggingFace platform to know whether the review was positive or negative. The multilingual XLM-roBERTa-base model was trained on ~198M tweets and fine-tuned for sentiment analysis. The sentiment fine-tuning was done in 8 languages.

With the ability to split the reviews into positive and negative with a reasonable confidence level (0.76 accuracy in our dataset), we tried to analyze patterns within those reviews. A straightforward way to visualize the words is through word clouds. Following is the word cloud for Negative and Positive Reviews.

   

Negative reviews

Positive reviews

   

There is much information to be gained from analyzing the dynamics between positive and negative customer reviews. Customers surely want to have their say, as demonstrated by our data set, where negative reviews are, on average, over twice as long as positive reviews. Additionally, by looking at the evolution of the average number of reviews over time, we can see a potential slight increasing trend in the number of negative reviews, which the business should be attentive to.

3 month moving of average reviews

Emotion Analysis

Besides identifying the sentiment behind a text, another technique in NLP is to identify the emotion behind it. To achieve this, we used the NCRLex library. NCRLex library allows us to recognize emotions from texts, such as fear, anger, or surprise. This analysis allows us to more accurately understand how customers feel about a specific service or product.

Similarly to sentiment visualization, we can visualize a word cloud for each emotion within the positive or negative reviews by identifying the different emotions associated. For example, the word cloud generated from the trust emotion within the positive reviews is as follows:

Word cloud generated from trust emotion within positive reviews

This process allows us to have some idea of what triggers which customer emotion.

Keyword Analysis

To further analyze the reviews, we wanted to identify the main objects of customer comments in their reviews. To achieve that, we extracted relevant keywords from the set of positive and negative reviews using YAKE, an unsupervised automatic keyword extraction method.

This method computes statistical features related to characteristics for each review, including word case, position, frequency, context, and weights of each term according to these features.

Finally, a score is computed indicating the significance of each term as a potential keyword. This is a powerful yet lightweight method that, due to its fully unsupervised nature, can be employed in different domains and even with other languages.

Additionally, we employed a pure frequency-based approach to uncover the most common objects mentioned in reviews. The results were similar to our keyword analysis, reaffirming its validity and reliability.

These were the keywords identified for positive and negative reviews:

  • Positive: hotel, location, staff, view, room, breakfast
  • Negative: hotel, staff, room, breakfast, window, bed, Wi-Fi

As expected, the identified keywords are common points addressed in the hospitality industry reviews. They already constitute a good indicator of adequate service or potential areas of improvement for the hotel.

However, we wanted to go deeper into the analysis and uncover exactly what it was about these objects that were – or were not – working as expected by customers. For example, why were windows such a prominent aspect of negative reviews?

To that end, we used another technique from Natural Language Processing: syntactic dependency parsing. We employed spaCy, a fast, comprehensive, and production-ready NLP library for Python, to create a syntactic dependency tree, which connects all terms in the input text according to their syntactic relation. Then, we queried this tree to pinpoint precisely what it was about a given keyword (for example, "room" or "location") that customers did or did not especially like.

Syntactic dependency parsing process.

The result was a list of modifiers for each keyword. For example, we could learn that customers might consider a "room" to be "spacious" or the "location" to be "convenient." This resulting list of modifiers enabled us to create word clouds to visualize the frequency of each modifier for the given keyword, such as the word cloud below, for the keyword "room":

Word cloud for the keyword room

Analyaing these frequent modifiers for each keyword, their relevance, and weight, and analyzing separately for positive and negative reviews, provided us with a profounder insight into what customers like best – and not so much – the results we present below.

4 things to remember when choosing a tech stack for your web development project
blue arrow to the left
Imaginary Cloud logo

Outcomes

Upon analyzing the data set as described above, we were able to identify some positive aspects of the business, as well as essential areas for improvement.

One noticeable comment from customers, which frequently appears in both positive and negative reviews, is that some consider the hotel dated. The three main modifiers used to describe the hotel in negative reviews pertain to that quality. This suggests the business may want to look into renovation to appease those pain points.

Modifiers for hotel keyword in negative reviews
Modifiers for hotel keyword in positive reviews.

The keyword analysis reveals customers' most common points when posting their reviews. As one would expect, the room features prominently in both negative and positive reviews. While it is mentioned regularly in negative reviews throughout the period we analyzed, in approximately the last six months, there was a surge in room mentions in positive reviews, a potentially favorable trend the business should be aware of. In positive reviews, the most common comments refer to rooms as clean and spacious. There are also references to being overall comfortable and cheap.

The beds were also frequently mentioned, with some users considering them stiff and uncomfortable. The prevalence of this comment also suggests an immediate area for improvement. On that note, some customers also pointed out that they found the hotel noisy.

Top modifiers for negative reviews for bed.

In addition to that, another major issue reported by customers is the heating, ventilation, and air conditioning system in place at the hotel — "hot" and "cold" were the main concerns from customers regarding their rooms. One particular pain point was the room window, which was so frequently mentioned to be identified as one of our keywords, especially since it required staff assistance to open some rooms' windows.

Word cloud with main concerns from customers.

In that sense, the staff was frequently brought up in positive and negative reviews, with some customers considering them rude. However, more often than not, they were considered friendly and helpful, although one particular point of interest is that many customers thought the hotel was understaffed. Finally, the mention of the staff in reviews remains relatively constant over time.

The hotel location was another prominent factor in positive reviews. It was predominantly perceived as a positive aspect, with many general compliments, and being considered convenient and centrally located. However, one crucial trend the business should be aware of is that, over time, location has been mentioned less frequently in positive reviews while increasingly referred to in negative reviews. While this may relate to the external location and, therefore, to external factors outside of immediate hotel control, it is a potential trend worth keeping an eye out for.

Finally, it is worth mentioning that a significant number of negative reviews commented upon the hotel's Wi-Fi, mainly due to it being paid and not free.

Keword-mentions-in-reviews

Applications

Business intelligence and sentiment analysis projects such as this can bring value to many use cases.

E-commerce

Nowadays, a significant portion of shopping is done online. E-commerce represents a growing trend of nearly unlimited access to resources, markets, and products in real-time from anywhere on the planet. Understanding the reach of the marketing in terms of customer segmentation is very important for a business to adjust efforts to reach the desired target public.

Almost every e-commerce platform contains a reviews section where customers can comment on the products they bought. This comment section represents a valuable data source that can bring value to the business.

Through NLP techniques, it is possible to acquire insights into what the customer likes or dislikes about the products. These insights can help understand flaws or further improvements to the product and/or the platform. We can identify key aspects that bring insecurity or other emotions to the customer, so we can act on them.

It also becomes possible to see the evolution of the user sentiment on the product over time and measure how changes affected the customers' overall opinion.

Hospitality Industry

The hospitality industry is a very competitive sector where little details can prove to be essential edges over competitors.

Booking, Trivago, Google, and other platforms often list establishments. The common aspect between these platforms is that customers often use them to leave reviews. By analyzing the review scores and comments, it is possible to gather insights into customers' opinions on key aspects of the businesses.

This data allows us to interpret which aspects of the business need changing or attention, what parts customers value, and possibly foresee some adjustments we should consider.

Food services industry

Restaurants, coffee shops, and bars increasingly rely on their online presence to attract customers. This involves being listed on several platforms like Yelp, Google, Zomato, and Tripadvisor, which allow users to leave ratings and written reviews. Often, clients choose which new places to try based solely on these reviews, making them a key to understanding how the business is performing.

It is in these establishments' best interest to use all this feedback to find ways to get an edge over their competitors. Analyzing possible customer pain points helps invest in worthwhile improvements, and tracking consumer sentiment over time ensures that the investments are paying off.

Any establishment that grows beyond a specific size must rely on Data Science techniques to analyze many reviews they may get on different platforms. This process can be automated, providing quick feedback and a broad vision of what is attracting or disenchanting customers. This will help managers take their food services to the next level.

Entertainment Industry

The entertainment industry is broad, including everything from Movies, TV Shows, and Youtube Channels to Amusement Parks and Circus Acts. Common to all of these businesses, especially in the digital age, is that they are subject to reviews and comments, both from critics and spectators.

As the business grows, the number of reviews might become unmanageable, making it difficult to understand the overall sentiment of the population. This is where NLP techniques should come into play, allowing many comments to be parsed and analyzed to extract valuable and actionable insights.

blue arrow to the left
Imaginary Cloud logo

Endnotes

In summary, we analyzed customer feedback about their stay in a hotel using Natural Language Processing techniques and uncovered actionable insights that can directly impact business decision-making. This analysis and the underlying processes can be used for many other applications, bringing value to businesses across many sectors.

This project was completed in 3 days with a team of 2 Imaginary Cloud Data Scientists. Imaginary Cloud provides Data Science and AI development services, focusing on bringing the highest value to its clients through tailored solutions and an agile process.

Contact us if you need a custom Data Science or AI solution:

Artificial Intelligence Solutions  done right - CTA
blue arrow to the left
Imaginary Cloud logo
blue arrow to the left
Imaginary Cloud logo
blue arrow to the left
Imaginary Cloud logo
blue arrow to the left
Imaginary Cloud logo
blue arrow to the left
Imaginary Cloud logo
blue arrow to the left
Imaginary Cloud logo
Alexandra Mendes
Alexandra Mendes

Content writer with a big curiosity about the impact of technology on society. Always surrounded by books and music.

Read more posts by this author
Vítor Bernardes
Vítor Bernardes

Data scientist passionate about data science and watchful of its ethical implications. Besides work, I love nerding out on music and reading a good story.

Read more posts by this author
Rui Melo
Rui Melo

Data Scientist who loves exploring problems. In my free time, I teach basketball to kids and enjoy going to the beach.

Read more posts by this author

People who read this post, also found these interesting:

arrow left
arrow to the right
Dropdown caret icon