Basic geo-lingusitic analysis on Chinese search engine result pages (SERPs)

This blog post provides some basic geo-linguistic analysis of the findings explained in the previous blog post. Geo-linguistic information can be extracted from the data to consider the geographic and linguistic factors of the web links (Liao, Petzold, 2010).

geo-IP

First, based on the IP addresses of each web links provided by MaxMind’s GeoIP database, the geographic information is extracted. The two tables below show the unweighted and weighted results respectively. The weighted results assign higher visibility scores for the top-ranking search results based on industry data of click-through rates, whereas unweighted results assign even visibility scores for each of the top-10 earch results.

Geo-IP distribution: unweighted

Geo-IP distribution: unweighted

Geo-IP distribution: weighted

Geo-IP distribution: weighted

Across all search engine variants, the web pages hosted in the US, Chinese-speaking regions of mainland China (CN), Hong Kong (HK) and Taiwan (TW) have accounted for over 97% of the unweighted outcome (97.5% if weighted). It suggests a clear geographic concentration. In addition, the makeup of such concentration is different among different search engine variants. Overall, Baidu and Yahoo provide the most different outcomes and Google provides more moderated outcomes. For example, websites hosted in Taiwan are hardly visible for Baidu_CN and Yahoo_CN, whereas those hosted in mainland China are visible for Google_HK and Google_TW.

It is observed that the US- and Taiwan-hosted websites are not visible for Baidu_CN and Yahoo_CN. It is also observed from the difference between the two tables above that the weighted outcome expands the significance of the US-based websites, particularly for Google_HK, Google_TW, Yahoo_HK and Yahoo_TW.

Language scripts

Second, based on the linguistic features of respective summary “snippet”, the linguistic information is extracted, including whether the text is mostly written in simplified Chinese characters or traditional ones. The two tables below show the unweighted and weighted results respectively: if over 90% of traditional-versus-simplified distinguishing Chinese characters are written in traditional characters, the snippet is then categorized as traditional (Trad); if over 90% of such characters are written in simplified Chinese, the it is categorized as simplified (Simp).

Linguistic distribution: unweighted

Linguistic distribution: unweighted

Linguistic distribution: weighted

Linguistic distribution: weighted

Across all search engine variants, well over 95% of the SERP outcome is written in Chinese (97.5% if weighted). It suggests a clear concentration on the choice of language, regardless of location or search engines. However, the makeup of such concentration on Chinese-language content is different among different search engine variants. Again, Baidu and Yahoo provide the most different outcomes and Google provides more moderated outcomes. For example, traditional Chinese content is hardly visible for Baidu_CN and Yahoo_CN, whereas simplified Chinese content is quiet visible for Google_HK and Google_TW.
It is also observed from the difference between the two tables above that the weighted outcome expands the significance of the US-based websites for Google_HK, Google_TW, Yahoo_HK and Yahoo_TW.

The geo-linguistic analysis of the overall findings indicate a concentration (or convergence) on Chinese-language content hosted in the US and three main Chinese-speaking regions of mainland China, Hong Kong and Taiwan. However the findings also diverge on the geo-linguistic makeup. Baidu_CN and Yahoo_CN provide SERPs that are mostly simplified Chinese content hosted in mainland China, whereas Goolge_HK and Goolge_TW provide SERPs that are much more mixed.

Chinese Internet?

So the overall findings do indicate a concentration/convergence on Chinese-language content hosted in specific regions, particularly in mainland China, the US, Hong Kong and Taiwan. However, different localization interfaces do also generate different outcomes on the geo-linguistic proportion of content. Google is much more moderate in providing more “cosmopolitan Chinese” results, whereas Yahoo and Baidu provide more “locally Chinese” results.

References

Liao, H.-T., & Petzold, T. (2010). Analysing geo-linguistic dynamics of the World Wide Web: The use of cartograms and network analysis to understand linguistic development in Wikipedia. Cultural Science, 3(2).

Comments (choose your preferred platforms)

Loading Facebook Comments ...