How much can one express in 140 characters? Comparison between English and other languages like Chinese

It has been reported by major media such as BBC (Hewitt, 2012) and the Atlantic (Rosen, 2011) that one can express much more content, if using other languages other than English, under the arbitrary limit imposed by certain online media or computer-mediated communication platforms, notably the 140-character limit of Twitter or the 140-byte limit of Short Message Service (SMS). Examples of Chinese (Chan, 2010) and Japanese (Summers, 2010) are often cited as “more expressive” languages with the 140-character limit. However, a definite answer is yet to be provided regarding “how much” more expressive with systematic yet quantifiable measures.  Taking an information-theoretic approach, A recent AAAI research paper “How much is said in a tweet?” provides arguably by far the most mathematically and theoretically sound attempt (with some other interesting findings as well), but it did not rely on parallel text data (Neubig, Duh, 2013). The BBC report only say 140 Chinese characters amount to 70 to 80 words(Hewitt, 2012), and arguably the most systematic and data-intensive attempt is made by a UK information system consultant (Summers, 2010) that uses Google’s machine translation and Twitter’s data. Nonetheless, despite its various merits, this research uses machine-translation as proxy and misses Chinese language. Since some blog post has claimed that “140 Chinese characters could contain 5 times more content”(Ruby, 2012) but without evidence, and another blog post has claimed that “140 Characters On Chinese Twitter Is More Like 500 Characters On Twitter.com”(Dugan, 2011), I think it is necessary to propose a better research design to do such comparison.

Update: thanks to a reminder by Scott Hale, there is a similar attempt in answering the exact question from an information-theoretic angle: “How much is said in a tweet? A multilingual, information-theoretic perspective”(Neubig, Duh, 2013).

More systematic approach using linguistic corpus

In fact, even before Twitter comes into existence, the Director of Linguistic Data Consortium (a major linguistic research data keeper), Professor Mark Liberman has conducted a well-designed comparison on the Chinese-English comparison using the LDC parallel Chinese/English corpora, producing ratio results ranging from 1.19 to 2.27(Liberman, 2005). The basic idea behind the research design on using linguistic corpora is simple. By comparing the actual storage space of texts of the same content in different languages, one can see how much “space” required to store/convey the same content.

However, while the research design is systematic enough and well-founded in computational linguistics. Liberman’s attempt suffered two major drawbacks.

The first one is a major one, but actually points out the issue of encoding standards. The encoding standard of the Chinese-language text in Liberman’s research is not the current mainstream industry standard of Unicode encoding schemes, but rather the Chinese national standards of GB-2312. As to be shown later, this contributes why Liberman’s estimate ratio is much lower than the recent individual observations.

The second one is a minor one. The data Liberman uses is formal and legalistic in nature. While it is interesting and methodologically sound to compare different genres of parliamentary, legal and news texts, the data might not be adequate for understanding online communication that is more conversational. It is a minor drawback, however, because researchers must strike a balance between data availability and suitability.

A proposed approach using user-generated translation

With the aim to measure up to Liberman’s computational linguistic approach with additional considerations to encoding standards and data availability that matches current online communication practices, I propose a basic and generic research design to measure the “content equivalence” across languages when arbitrary character limits are imposed:

  • Technical: Any comparison must consider the variation of encoding standards and their distinction from character sets. Note even for the current mainstream industry standard of Unicode, several encoding standards exist that may produce different outcomes.
  • Content-wise: Any comparison must consider whether the data is suitable for understanding human communication online (instead of machine translation) and whether data-collection is scalable enough. Scalable data collection facilitates practical research management and enhances generalization of knowledge claims.

So the above generic research concerns lead me to look for openly available data that is generated by human translators. Note that much of the linguistic corpa is not free. In addition, bilingual (not to mention multilingual) corpa is more difficult to find. User-generated content projects such as Wikipedia do not have “exact translation” of the same content, despite the content there is free to access. Also, encyclopedia writing styles are different from online conversations.

TED Open Translation Project seems to be a suitable option. First, it is a project that is initiated by a well-known and internet-friendly organization that uses professional human translation service to kick-start its outsourced open-content-style translation of subtitles (sadly *NOT* open content see its terms and conditions). Second, it provides a multilingual corpora (at least in theory) for “exactly the same content”. Third, it provides textual data that is relatively closer to online communication. One can still argue that Ted talk’s “speech” is still different than “net-speak” in style, but it can also be argued that the speaking style in Ted talks is relatively closer to online speech when compared to legal, government or news broadcasting linguistic data that is commonly used in computational linguistics because of its institutional availability.

Findings Based on a Pilot Research on one Ted Talk

To test the water, I conducted a pilot research on one Ted Talk, Hans Rosling’s Stats that reshape your worldview. I have downloaded the transcripts of four language versions: English, Japanese, Simplified Chinese and traditional Chinese and then encoded using major encoding standards that are typically used at different points in computing history.

The first set of findings, based on Unicode encoding standards, shows that the English/Chinese Ratio is at 3.63 and Japanese/Chinese ratio is at 1.34. It means that a tweet (or a Sina Weibo message) of 140-character in Chinese language can convey 3.63 times of the English-written content or 1.34 times of Japanese content, at least in the case of Roslin’s speech. Put differently, for the same content of message to be conveyed in a full 140-character English-language tweet, Chinese-language users can use just below 40 characters, and Japanese-language users can use just a bit over 50 characters to express the same.

Table 1: Language, characters and information capacity

English

Japanese

Chinese, simplified

Chinese, traditional

Equivalent characters needed to expressing 140-character English content

140

51.79

38.91

38.55

Number of “Tweets” needed to convey the same message

3.63

1.34

1.01

1

However, it should be pointed out the findings are actually encoding-dependent. If the clock goes back to ten years ago when Chinese-language content is encoded mostly in Big5 or GB-2312, Japanese in Shift-JIS, and English in ASCII, the numbers are different. The above findings are base on Unicode encoded in UTF8 or UTF16.

Future research can use the research design and data collection strategy here to compare more than 1000 sample TED talks available for Japanese-English and Chinese-English pairs. I would assume that the overall outcome should not be too far away from the pilot findings based on one sample. I would also assume that, based on the improved research design and strategy, the ratio of 3.63 is a better estimate than the previous claims of 2 or 5.

It is worth mentioning that the AAAI paper does not seem to provide such ratio numbers, but does have a figure titled “Figre1: Entropy per character in a tweet for each of the
languages.” under the heading of “Information Content per Character” based on their experimental results using 50,000 tweets across all examined languages (Neubig, Duh, 2013). Reading from the figure, the rough Chinese/English ratio would be around 5.2/3.3, putting the ratio number around 1.57. I am not sure why it is a bit off from the other research findings.

Major implications

No matter what the exact number is, the above findings have historical, cultural and political implications beyond its technicality: Digital Chinese characters not only set the first major milestone for a multilingual Internet, but also symbolizes a historical turn for Chinese (probably East Asian) modernity regarding modern media and literacy for the following reasons (for details, see another blog post).

Further reading

This is a work in progress and thus I welcome any comments. It is difficult for one to claim to have exhausted all the available sources on a topic like this, but you can help me by providing more references.

Thanks to my colleague at the OII, Scott Hale, there is a more mathematically-advanced attempt in answering the exact question from an information-theoretic angle: “How much is said in a tweet? A multilingual, information-theoretic perspective”(Neubig, Duh, 2013). It will take me some time to digest this paper in full, and thus I list here first as future reading.

Also, do have a simple try with Ben Summers’ Tweet Measurer with your twitter account to help answer the question: “What’s the equivalent of Twitter’s 140 character limit for non-Latin character sets?” (Summers, 2010)

Note for citation

If you want to cite this post for its estimation numbers, the content of this post will be presented in a panel discussion on “Chinese Web data” at the Chinese Internet Research Conference 2013 (CIRC 2013).

References

Chan, Y. (2010, October 12). Microblogs reshape news in China. China Media Project. Retrieved April 16, 2013, from http://cmp.hku.hk/2010/10/12/8021/
Dugan, L. (2011, July 27). 140 Characters On Chinese Twitter Is More Like 500 Characters On Twitter.com. AllTwitter. Retrieved April 15, 2013, from http://www.mediabistro.com/alltwitter/140-characters-on-chinese-twitter-is-more-like-500-characters-on-twitter-com_b11951
Hewitt, D. (2012, July 31). Has Weibo really changed China? BBC. Retrieved from http://www.bbc.co.uk/news/magazine-18887804
Liberman, M. (2005, August 5). Language Log: One world, how many bytes? Language Log. Retrieved April 16, 2013, from http://itre.cis.upenn.edu/~myl/languagelog/archives/002379.html
Neubig, G., & Duh, K. (2013). How much is said in a tweet? A multilingual, information-theoretic perspective. In AAAI Spring Symposium on Analyzing Microtext. Presented at the AAAI Spring Symposium on Analyzing Microtext, Stanford, California. Retrieved from http://www.phontron.com/paper/neubig13sam.pdf
Rosen, R. J. (2011, September 3). How Much Can You Say in 140 Characters? A Lot, if You Speak Japanese. The Atlantic. Retrieved from http://www.theatlantic.com/technology/archive/2011/09/how-much-can-you-say-in-140-characters-a-lot-if-you-speak-japanese/245199/
Ruby, B. (2012, November 16). Twitter versus Weibo: What You Need To Know. The Fearless Group. Retrieved April 15, 2013, from http://thefearlessgroup.com/twitter-versus-weibo-what-you-need-to-know/
Summers, B. (2010, February 1). What’s the equivalent of Twitter’s 140 character limit for non-Latin character sets? Ben Summers’ blog. Information management. Retrieved April 15, 2013, from http://bens.me.uk/2010/twitter-charset-experiment

Comments (choose your preferred platforms)

Loading Facebook Comments ...

2 thoughts on “How much can one express in 140 characters? Comparison between English and other languages like Chinese

  1. Dr. Liao,

    Have you looked into the possibility of analyzing the Press Release corpus? Many companies publish very accurate translations of their press releases in many languages. I think this would be a good set to analyze.

    Thanks,
    Saqib

  2. Hello saqib, using Press Release corpus will have some foreseeable advantages and disadvantages. Obvious advantages would include the reflection of commercial interests, public images and the implications for multi-lingual SEO and communication. However, the main disadvantage would be data availability. I am not sure if any company is giving it out for free research/archive purposes. Still, there are some interesting answers that can be provided on multilingual Press Release released by corporations. Thank you for your suggestions!