Can You Trust Online Reviews? Here's What's Wrong With Them

When did 4-stars become a bad review? Online product reviews and rating systems are broke and here's how we can fix them.

When did a 4-star review become a bad review?

PHOTO ILLUSTRATIONS BY SARINA FINKELSTEIN

ON A RECENT UBER TRIP FROM MY BROOKLYN APARTMENT TO JFK AIRPORT, I realized the driver had prematurely crossed into Queens on a route that Google Maps does not recommend. Annoyed at the driver for taking the long way, but cognizant that my review could hurt his livelihood, I gave him three stars—what I considered not awful, but not stellar.

Later, I found out I had effectively given him a potentially career-ending review because, according to its fine print, Uber says it may kick drivers off the system if they can’t maintain a 4.6-star average. Out of five stars.

That’s right: You need get the equivalent of a 92% grade-point average just to continue to be an Uber driver.

Another bizarre iteration of the five-star review system caught my eye while planning the trip. Scrolling through the first 300 reviews of Airbnb listings for San Francisco, I couldn’t find a single one with fewer than four stars.

Clearly, this ain’t your grandfather’s five-star review system. At some point, maybe around when orange became the new black, four stars became the new zero. Or in Uber’s case, four-point-six. For consumers who grew up with Leonard Maltin’s movie guide, this is very confusing.

These days every transaction or relationship can be graded or rated, and the masses regard this ability as essential to their rights as empowered consumers. Doctors on ZocDoc; rides in a car on Uber and Lyft; stays at hotels and makeshift hotels; products and sellers on Etsy, eBay, and Amazon. It’s like a new version of Murphy’s Law: Everything that can be rated will be.

A new version of Murphy’s Law: Everything that can be rated will be.

As a result, almost everyone may be influenced by online reviews before they buy, yet because what’s considered a good rating varies so widely from site to site—not to mention the huge problem of fake reviews—that every review and ratings system must be taken with a large grain of salt. At best, they are flawed but vaguely useful indications of a service or product’s quality; at worst, they are nearly meaningless.

trophy thumbs up
Photo illustration by Sarina Finkelstein for MONEY

Do We Even Know What We’re Reviewing?

SERVICES LIKE UBER AND LYFT are rated with the same five-star system as an Airbnb or a movie on Netflix. And yet people think about basic transportation in a very different light than they do highly qualitative, complicated subjects like movies, restaurant meals, or vacation destinations. After all, as long as your Uber driver gets you to your destination in one piece and in a reasonable amount of time without harassing you, you’re probably happy.

Given that, drivers don’t earn stars so much as potentially lose them—which makes five stars the default review. Ask any driver and they’ll tell you: Anything less than perfect is bad. That may be an effective tool for making sure drivers are at the top of their game, but it seems overly punitive given the number of customers who aren’t in the know about their proprietary five-star system.

It doesn’t make much sense that ratings for more qualitative-dependent things operate in the same five-star review system as binary “did it deliver or not?” services. When you stay at a hotel, you’re grading the experience and its details, not whether or not it fulfilled the promise of a bed, roof, and bathroom. Ditto for a film or restaurant: You aren’t looking for confirmation that Dazed and Confused is a feature film or that Per Se will in fact give you edible food—that’s assumed. You want to know if the experience is deserving of your time and money.

People review first and ask questions later.

Complicating matters is the fact that it’s not clear how reviewers should be doling out ratings. Netflix, which uses your star reviews to curate better individual recommendations, is considering ditching its five-star rating system due to the concern that people are basing their reviews not on their enjoyment of the film they’re watching, but on their sense of its reputation. It seems that many reviewers get what might be called “critic syndrome,” handing out stars based on perceived quality rather than personal enjoyment. The result is a broken system in which reviews are nothing more than an amplification of conventional wisdom.

Sometimes people are even confused about exactly what they are rating. Joey Gerhards, manager at popular eBay-based bike shop The Pro’s Closet, says that it’s common for online reviewers of his shop to mix up seller reviews with product reviews: “We often see that buyers use the feedback system as the channel for rating an item or product, instead of the actual service provided,” he told me.

key with stars on keychain
Photo illustration by Sarina Finkelstein for MONEY

What’s a “Good” Review, Anyway?

MUCH IN THE SAME WAY that it’s impossible for a parent to know if their child’s C grade is average or the worst in the class, it’s hard to tell what a certain online ratings really says about quality, especially when everyone rates differently.

Among film rating sites, for example, review aggregator Rotten Tomatoes counts reviews in a binary positive-or-negative, and then scores the movie based on the percentage of positive reviews. Competitor Metacritic assigns reviews a grade on a scale of 100 and then averages them. Fandango, meanwhile, sticks to the venerable five-star scale of movie guides past, but “Uber-izes” it—so that almost no films get the equivalent of failing grades. “For all intents and purposes, Fandango is using a 3- to 5-star scale,” FiveThirtyEight’s Walt Hickey wrote last year, noting this system’s dangerous potential for destroying a first date because of a mis-rated piece of trash. (Fandango acquired Rotten Tomatoes and now displays that rating, meliorating that problem.)

As a result of the “Uber-ization” of the five-star system, very few online review ecosystems seem immune to some kind of grade inflation. A 2014 exhaustive analysis of 1.2 million products on Amazon found more than half the reviews got a five-star rating. And yet, not everyone got the memo: Plenty of thorough, three-star reviews read like raves.

Along with Uber, Airbnb has been the next-most major player in the “sharing” economy, and it’s similarly afflicted with review inflation. Corresponding to my experience looking at San Francisco, a 2015 study of 600,000 listings by Boston University found that 95% of Airbnbs got 4.5 stars or above, suggesting that people rate it more like Uber: Unless the listing was dishonest or something goes very wrong, Airbnb reviewers give full or almost-full star ratings. “Virtually none have less than a 3.5 star rating,” reads the study, which contrasts this with Yelp and TripAdvisor’s hotel reviews, which average a far more normalized 3.8 and 3.9 stars, respectively. Still, there are twice as many five star reviews as any other category on Yelp. The second-most is four stars.

Using the same five-star system for everything has everybody confused. It’s not hard to picture a “Curb Your Enthusiasm” scenario with Larry and Cheryl after an Uber ride:

“Gave him a four, thought he was pretty, pritttayy good!” Larry might say.

“You gave him a four? Larry, he could lose his job, what did you expect, a limousine? What was wrong?”

“What was wrong? Nothing! Four is good! Back in my day a four was ‘very good.’ Five was ‘excellent.’ You think he should get a five for that? I save my fives!”

Guilt Destroys Honesty In Reviews

ONE THEORY BEHIND THE DISCONNECT between the ratings distributions between new sharing-economy upstart Airbnb and hotel-finding stalwarts like TripAdvisor and Yelp is that customers feel guilty leaving anything less than the maximum when they deal with an ordinary citizen instead of a conventional business. In Airbnb’s case, guests frequently break bread and share a space with hosts. People are very susceptible to guilt; just look at “tip creep” in coffee shops when the Square-enabled iPad is turned towards you asking how much you want to tip.

This is something David Infante, a culture writer at Thrillist recently experienced firsthand with an Airbnb rental gone wrong.

When everything gets a five, nothing gets a five, and you lose the ability to sort by quality.

“The one we settled on for this trip had a bevy of 4- and 5-star ratings, and there were no red flags—seemed legit,” he said. Having covered hospitality for years, he had done his homework.

However, the $200-a-night place was not good. The advertised washing machine was really a coin affair in the basement, and out of order at the time of the stay. The front door of the apartment was propped open by deadbolt due to a broken knob, and there were exposed wires. “For the price, and for the rating, it was absolutely baffling. How could other guests possibly not feel the same degree of discomfort that we did?” Infante mused.

Perhaps the previous guests would have felt bad giving more honest reviews, warts and all. “In the absence of some heinous, objective shortcoming (cockroaches, for example), a guest feels this uncomfortable pressure to withhold their subjective reactions to their Airbnb,” Infante said. “Hell, when we were later writing our review, we went through some minor hand-wringing ourselves. ‘Was it really that bad? I feel like two stars is mean. I mean, it was clean and well-located, after all.’”

With Uber and Lyft, the review-inflating guilt factor is even more pronounced. Giving a four-star review (or worse) on Uber could literally cost a driver his livelihood, not just a little extra income from excess property. Many people I’ve talked to—and lots of people on Quora—feel guilty leaving anything other than the highest review, even if the driver took a less-than-ideal route or spent the whole trip on the phone. “I only give a 4-star review in extreme circumstances,” my friend Alex, a frequent Uber user, told me. “I’ve had terrible rides where the person got lost and cost me a lot of extra money due to incompetence. But if they’re nice about it they get five stars.”

In the end, it’s a vicious cycle that pushes even the mediocre towards four and five stars. And when everything gets a five, nothing gets a five, and you lose the ability to sort by quality, making it impossible to really know what you’re getting into.

dice with stars hanging from mirror
Photo illustration by Sarina Finkelstein

Ratings Are Dominated By Fives And Ones

AIRBNB DOESN’T BUY THE IDEA that guilt causes ratings inflation. The company contends that users are more likely to be critical about a listing because they’d want others to be honest, too. “I think everybody takes it very seriously because that’s how you decide where you’re going to stay,” Airbnb spokesperson Nick Shapiro said.

So why are Airbnb ratings overwhelmingly high? The company says reviews are skewed heavily with four- and five-star ratings because it removes poorly-reviewed listings from the system.

Dropping bad listings is great news for customers who want assurance that all Airbnbs meet a certain baseline standard, but the heavily-skewed positive listings make it impossible for customers to filter by quality. In an interview with Harvard Business Review, Harvard Business School professor Frances Frei said of a system saturated by five stars: “It’s close to useless…At that point, any notion that the company would be soliciting useful feedback, it’s gone.”

Not all companies’ review systems cluster around fives. Many products on Amazon are loaded with both five- and one-star ratings and largely lack anything in the middle. “On the macro level it’s not good to have a rating system that’s heavily skewed positive or negative from a distribution standpoint,” Luther Lowe, VP of Public Policy at Yelp, told me. “It doesn’t make sense to have this 1-5 scale if 90% of your ratings are five or five stars, you should probably switch to a thumbs up, thumbs down system.” (Yelp itself isn’t immune from the problem; its average rating is 3.8, and 44% of all ratings are five stars.)

“At YouTube … it didn’t feel like people were motivated to use lower ratings.”—Shiva Rajaraman, former YouTube product manager

Shiva Rajaraman, who now works at Spotify, told me that Uber and Airbnb’s ratings situation reminds him of his days as a product manager at YouTube, when it was obvious the five-star system wasn’t working. “So few people were using one, two, three, four—and rarely would we see the scores correlate with quality of the video,” he said. “The output was one of largely zero value.”

Eventually, Rajaraman says, YouTube abandoned stars in favor of likes/dislikes because the latter “was an easier system everyone could understand because that binary feedback is easy to act on.” Plus, with the advent of better metrics and hard data—like time watched—subjective feedback was becoming less necessary. For a company like Uber that has GPS data giving route and trip time, certain things are already known.

Film with stars on it
Photo illustration by Sarina Finkelstein

You Really Need To Read Reviews Because Ratings Are So Flawed

AIRBNB ALSO DEFENDS its ratings system by saying that the stars are just a part of the review—and not the most useful part. “What’s important is what they’re saying about it—personal experiences people have actually had,” Airbnb’s Shapiro told me. “We hope all listings become five stars.”

In a sea of five-star ratings, the only way to differentiate is to read the nitty-gritty of each review and try to interpret what the guests were really thinking. Did they give a good review because they were just being nice and don’t want to hurt the host’s cash flow? Or did they genuinely consider their stay to be a top-notch experience? Also: Does their version of a five-star experience resemble what you consider to be a five-star experience?

The text in reviews is certainly helpful, but with that as your main tool it’s impossible to glean quality without reading individual reviews en masse—a huge burden for the customer.

So What Are The Alternatives?

RAJARAMAN READILY ADMITS the simplified systems like the one he helped develop at YouTube sometimes comes up short. “There’s something missing in Uber and a lot of these things—which is how to tell you this was exceptional,” he said. Users can explain that a driver was amazing or a host gave you the best night of your life, but in terms of pure ratings this experience won’t stand out from any other five-star review. And you pull give two thumbs up.

EBay’s solution to star review issues has been a multi-factor feedback system that includes positive-neutral-negative ratings, star ratings for detailed feedback, and comments. In theory, this makes sense. In reality, the resulting complexity is hard for some consumers to understand, as Gerhards noted.

According Laura Chambers, a VP of Global Consumer Trust at eBay, “top sellers” have 99.5% positive ratings, while all sellers average 98%. Getting into the top tier has its perks: top sellers enjoy fee discounts and good placement in search results.

Hotel Tonight has a clever solution: not rating a luxury resort and a hostel with the same scale.

Consumers, meanwhile, should understand that 99.5% positive doesn’t mean 99.5% perfect, but they may not. This leads to the online business version of grade-grubbing, in which many listings by serious sellers have a paragraph saying: “If you feel that this transaction was anything less than five stars, please contact us before leaving feedback and we will certainly try to address the problem to your satisfaction.” Anything less than perfect has the capacity to damage.

The professional hotel industry might not have quite the same problems with the five-star system as Airbnb, but booking site Hotel Tonight decided to eschew it anyway. “I think they’ve become inflated to the point that they’re not useful or trustworthy,” Hotel Tonight CEO Sam Shank told me. “You would see a big box chain hotel with a 4- or 4.5-star rating, which just really doesn’t equate to the level of service you’re going to be getting.” The company now opts for the binary like/dislike system. If a property gets below 80% positive feedback, it is put on probation and a dialogue opens. If things don’t turn around, Hotel Tonight says goodbye.

Such a system may seem oversimplified for a quality-based product, but Hotel Tonight has an obvious solution: cataloguing hotels by service category, because you shouldn’t rate a luxury resort and a hostel with the same scale. The elegance of this system is that it allows for the “everyone gets a five” utopia of Airbnb, while preserving the consumer’s ability to filter and sort by useful measures of quality (hotel categories, characteristics, amenities, and price points). What’s more, the ease of rating facilitates mass participation and thus, more utility for the system.

It’s unclear whether gig-economy star inflators—Airbnb, Lyft, Uber—and the other players will realize their review systems’ flaws and evolve them into something with more utility. But every month they don’t solidifies the skewed, confusing, and arguably meaningless rating system as the standard.

Your browser is out of date. Please update your browser at http://update.microsoft.com