• U.S.

What Twitter Says to Linguists

6 minute read
Katy Steinmetz

There’s more in a tweet than 140 characters. Among the 500 million messages sent each day on Twitter, there’s a tsunami of slang terms and textspeak. There are hashtags, emoticons and links. Many tweets contain geotags that identify where on earth a person stood when pressing SEND. That may sound like just a lot of noise, but for linguists making ever more sophisticated use of it all, Twitter is providing the most enormous stream of data they have ever had at their disposal.

Gone are the days when a language researcher had to interview subjects in a lab or go door to door in the hope of gaining a few insights about a limited sample of people. Academics in the U.S. and Europe are using the seven-year-old microblogging platform to put millions of examples under the microscope in an instant. “It’s unprecedented,” says Ben Zimmer, the executive producer at Vocabulary.com “the sheer amount of text you can look at at one time and the number of people you can analyze at once.” Hidden in tweets are insights about how we portray our identity in a few short sentences. There are clues to long-standing mysteries, like how slang spreads. And there is a new form of communication to study. If language is the archive of history, as Ralph Waldo Emerson once said, social media should get its own shelf.

Data extracted from Twitter come with caveats–like the fact that some tweets are written by automated bots–and great minds are still in the early days of building computer programs that can understand tweets in all their humor and nuance. But what is being said on Twitter is invaluable to scholars as well as to researchers with agendas, like advertisers and campaign managers. “We can talk about culture and community, but language gives you a way to really observe those things,” says Jacob Eisenstein, a computational linguist at Georgia Tech, “if you know how to pull that signal out of all that text.” There are more people trying to locate the signal every day: upwards of 150 Twitter-based studies have come out in 2013 so far.

Language researchers have found that women are more likely to use first-person terms (like I and my) and exclamation points, especially repeated ones. Men typically share more links and use more technology-related words. But social networks matter too: a female who follows and tweets to a largely male audience is more likely to use features, like numbers, associated with the boys. And vice versa for men.

A Stanford University linguist found that older tweeters tend to use emoticons with noses–:-) instead of :)–an action tied to their preference for conventional language. Youthful “no nose” tweeters tend to use more swear words. In a study released this June, Dutch researchers at the University of Twente found that young tweeters were more apt to type all-capital words and to use expressive lengthening, like writing “niiiiiiice” instead of “nice.” The older crowd is more apt to tweet well-wishing phrases like good morning and take care, to send longer tweets and to use more prepositions.

Then there are geography, income and race. For instance, the term suttin (a variant of something) has been associated with Boston-area tweets, while the acronym ikr (an expression meaning “I know, right?”) is popular in the Detroit area. Tweets containing the word awesome, one study showed, are more likely to emanate from wealthy neighborhoods. Emoticons often appear in tweets sent from areas with a large Hispanic population.

These facts may seem frivolous. But research like this provides insights into how people purposefully and unwittingly use words to signal who they are. “Language is really a window into people’s sense of personal identity,” Eisenstein says. Tweet trends also make it possible to guess the demographics of senders when no information is explicitly provided–a huge asset for anyone trying to use Twitter to sell products, get out a message or collect statistics. For example, researchers at the Mitre Corp., a nonprofit science and technology group, came up with an algorithm that could correctly determine someone’s sex 75% of the time on the basis of just their tweets. It outperformed humans, at a much faster speed. “We can’t personally read all the tweets,” says Carnegie Mellon professor Noah Smith, who has used Twitter to study economic confidence and presidential approval rates. “But you can write a computer program.”

Smith–a specialist in natural language processing–and Eisenstein are researching the diffusion of new words, a pursuit that was much more painstaking in the pre-Internet days. Using Twitter, their team is constructing what Smith describes as “subway maps around the United States showing where words tend to move.” What they’ve found is that race may matter as much as geography. A term coined in Jackson, Miss., for instance, might turn up in Memphis–both places that have a high percentage of African Americans–but not spread to Nashville, where the majority is white. Other researchers are following trails of tweets to investigate how rumors and urban legends change as they’re passed from person to person.

Tweeters are generally oblivious to the possibility that their messages might be scrutinized, which is a boon to researchers who want to analyze natural speech rather than the kind of edited text you find in the pages of a magazine. “They don’t feel like they’re being observed by people in white lab coats,” Smith says. “They really are just doing their thing.” But that presents an ethical quandary too. Even if tweets are meant to be public–and are presented anonymously in academic papers–tweeters typically aren’t consenting to be part of a study. As Zimmer says, “It’s a gray area.”

The data also has drawbacks. Though Twitter users number more than 200 million, they are not a random sample; they skew young and urban. People can lie about themselves. And no matter how much information academics can guess at, there are details–like income and education level–that they’d be able to get in a traditional study but can’t get from a microblog. “When you’re using these brute-force techniques of data collection, the picture of the individual gets lost,” says Zimmer.

Another complication is that people write on Twitter in ways they never have before, which is why researchers at Carnegie Mellon developed an automated tagger that can identify bits of tweetspeak that aren’t standard English, like Ima (which serves as a subject, verb and preposition to convey “I am going to”). The need for that program illustrates how hard it can be to process tweets but also shows how revolutionary the medium has been. “For almost its entire history, written language has had this weird bias, where it’s only used in formal situations,” says Georgia Tech’s Eisenstein. “Social media has taken the informal peer-to-peer interaction that might have been almost exclusively spoken and put it in a written form. The result of that is a burst of creativity.” And, of course, a language bonanza.

More Must-Reads from TIME

Contact us at letters@time.com