2008 台灣立委選舉地圖 (又: 台灣/中華民國 2008 年立委選舉結果地方版塊圖)
在報導選舉時,媒體常用以縣市的行政區域來說明『北藍南綠』(藍指國民黨親民黨,綠指民進黨台聯黨)的政治支持版塊,個人覺得常會有不良的誤導之嫌,所以利用2008年立委選舉結果資料重做了幾張圖如下所述,並說明這些圖將有助於瞭解台灣的選區的結構、突顯城鄉差異、並回歸投票數人頭而不是封建佔地盤的基本民主認知。
(其實我是很想做最新的2012 立委選舉和 2012 總統選舉的圖的,不過因為方便使用的原始資料表中選會還沒做, 若有任何人想要幫忙的話, 請在網上收集資料並填入此表: 中華民國/台灣2008及2012年選舉結果 每個人填一點, 填完了我就做圖更新。)
若以較細緻的選區而不是縣市作為單位製圖就有進步,如下圖左為例,可以看出北藍中哪裡有綠,南綠中哪裡有藍。值得一提的是藍綠的差距百分比也直接用藍綠顏色的深淺做區分,所以可以分辦出哪些選區藍綠差距大哪些小。不只如此,我主張更進一步利用數量變形圖(cartogram),如下圖右為例,可以直觀地看出各選區的投票數量。下圖右是以各選區的投票數多寡,來重新調整地圖各區的『面積』大小,也就是說該選區的總票數若愈多,則其面積愈大。若兩倍的票數,則是兩倍的面積。
或有人會覺得看這種數量變形圖會不習慣,或許覺得似乎有扭曲誤導之嫌。這變形圖看不習慣或許是個問題,但就傳達選舉的資訊來說數量變形圖比一般以傳統地理面積來呈現政治版塊來的更貼近事實。
換句話說,傳統的地圖在呈現選舉上才有扭曲之嫌。像人口密集的五都,不但有多個選舉區並有比例上更多的選民及選票,常常就在圖上看不到而不見其重要性。下圖左把台灣的選舉區(細黑線)在各縣市標示出來,並以北中南東及離島的方式來分別上色。綠色的花蓮縣、台東縣、澎湖縣、金門縣及連江縣都只有一個選區,而台北市、新北市、台中市、台南市和高雄市(以較深的顏色來突顯)有不少選區,密密麻麻地難以在地圖上呈現。相較之下,數量變形圖(cartogram)的優點在下圖右顯而易見。綠色的花東及離島縣都相對縮小以呈現其投票數量,而人口密集的五都還有桃園縣,則是明顯地放大,來按比例來呈現選舉地圖應該最在意的資訊:票數。
以上兩圖相比較,馬上可以讀出的是人口結構和政治地理。的確台灣北部選民人口是最多的,而中部南部的人口雖相對少一點,但也有相當的選民,相較之下花東離島則是選民人口相當少。這反映的是台灣人口結構和政治地理的基本事實。
或有人會抗議這樣以事實為基製圖把花東及離島邊緣化了(雖然這樣的指控可能美學感受大於事實基礎)。以下兩張圖則是同樣利用數量變型圖來呈現區域立委的數量,也就是一個選區就一個立委(台灣已採用單一選區制度,一選區選出一立委代表)。下圖左是北中南東及五都圖,下圖右則是選舉結果的藍綠差異。由於立委是真正代表選民進立法院的,所以上面選票數的數量變形圖和下面的立委數的數量變形圖兩相對照,綠色的花東及離島縣在立委選舉上是佔便宜的﹝個人也認為是應該的政治設計來均衡地方發展﹞。換句話說,單一選區的人口愈少,在選出的區域立委方面該票就愈有份量。
花東離島之外的台灣北中南區,同樣在對照選票數和立委數的數量變形圖,則差異不大。這反映的是單一選區制度劃分的票票等值的基本原則。台灣將進行十年一次的選區重劃分,基本的精神也是在調和選票數和立委數的分佈。
這回到我花了三個工作天不去完成我的博士論文,分心來做這幾張圖的動機和苦心。政治新聞報導和選舉分析地我們常常使用地圖,但使用地圖的方式常常還是有點像打戰爭攻下領土﹝對電玩玩家來說像是打三國志的戰略遊戲﹞。不管是媒體的選舉語彙如「選戰」、「刺客」、「百里候」有多麼的不進步及封建,還是我們看圖「西瓜偎大邊」的視覺效應,最最重要的還是選票和政治代表。我個人希望台灣的選舉有更好的分析、討論及報導,而不是簡化成綠色的面積大還是藍色的面積大,還有媒體儀式性懶惰只看藍綠南北不看城鄉及細緻的選區劃分。
讓我們在報導及討論選情及選舉結果時,回歸基本民主的數人頭而不是封建插旗子的打天下。
註1:對變型圖和選舉結果有深入興趣的,可以見美國選舉的變形圖,如Maps of the 2008 US presidential election results及Maps of the 2010 US midterm elections)
註2:變型圖的製法及類型很多,這裡用的是擴散法。想像原來的邊界像氣球一樣,要擴散變形到邊界的兩邊密度(又或數量大小除於面積大小)相等,而製成以數量為面積成比例的圖。以下兩張圖利用網格來呈現人口密度高的地方擴張而人口密度底的地方壓縮後的結果:
Geo-linguistic factors matter when Google and Baidu compete for global reach
0 Comments Published by Han-Teng Liao October 1st, 2011 in *OIINEWSThough I have argued elsewhere that geo-linguistic analysis matters to understand the inter-linking dynamics of the World-wide Web, it is often overlooked by skeptics who believe that the World-Wide Web will be eventually dominated either by English and Chinese language. The recent news events that Google plans its very first Asian data centers in Hong Kong, Taiwan and Singapore, and Baidu just got Beijing’s approval to test the overseas market in Thai-speaking and Arabic-speaking worlds, should provide further evidence that geo-linguistic factors matter, even for US-based Google and China-based Baidu. Geo-linguistic factors matter for us not only to understand the dynamics of Internet, but also shift the dynamics of Internet.
Do you think that Google’s choice of Hong Kong, Taiwan and Singapore is just an arbitrary one? Is it just co-incident that these cities just happen to host many Chinese-speaking Internet users? When Google “migrated” from mainland China to Hong Kong China, I suggested/predicted in an op-ed piece published in a Taiwanese newspaper that Google should establish other centers in other East Asian cities, such as Hong Kong, Tokyo and Taipei, so as to demonstrate the move to leave Beijing is not a defeat, but rather a beginning of “long march”. I specifically use the historical term of “long march” as a kind of ironic but strong metaphor to make the case that Google should continue to serve other Chinese-speaking population and non Chinese-speaking population in East Asia. While not discounting the size of Internet users in China, it is still very important to see East Asia youth and cosmopolitan crowd as highly interconnected peoples in the region.
Similar evidence could be found in Europe. Google’s EU Headquarter in Dublin has openings in positions in localization, including “Localization Product Team Lead”. Have a look at the job description and highlight the business processes and professionals that this “Localization Product Team Lead” involves, both inside and outside Google, you can get a sense that localization matters for everyday business of Internet companies. Localization, or L10n, originally just refers to a set of computer codes and configuration texts that provide a localized interface for software and websites. Now, localization means much more. With the wide-adoption of language- and geography-based technologies, localization now captures markets, and also the possibilities of hearts and minds of the people via familiar language and location-based sensibility. If Internet companies do it right, they can be heavily rewarded and a geo-linguistic-demographic segment of Internet users is within its corporate reach. If they do it wrong, they will face backlash such as these angry comments on Google’s language detection software.
It is about the time for researchers to theorize on the topic of “Internet users”, or specifically “information-searchers”. Borrowing Ien Ang’s concepts on TV viewers such as “audience-as-public” and “audience-as-market”, I argue that Internet users can also be conceptualized as “searchers-as-public” and “searchers-as-market”.
In the globalized TV industry, the “audience-as-market” is first zoned by multi-nationals as “geo-linguistic regions”, then for each geographically-bounded smaller regions (after all TV distribution channels are not as universal as the World-wide Web), audience is segmented into demographics where the channels and programming of the TV products are organized and measured according to the very demographics. Advertising money and expertise can thus operate in the knowledge system of managing the “audience-as-market”.
In the globalized Internet industry, the “searcher-as-market” is also partitioned into “geo-linguistic profiles”, such as zh-TW for traditional Chinese used in Taiwan, en-UK for English used in UK, etc. Unlike the globalized TV industry where broadcasting or cable TV is still geography-bounded, these “geo-linguistic profiles” are configurable and malleable. The outcome of the content that shows on your laptops or computers is not as pre-determined as the content that comes up on your TV screens. If your laptops or computers are configured with a specific “geo-linguistic profile” and the website or online service you try to use also provide the specific service that is tailored for that profile, you are not limited by geographic constraints. This is why I can manage to use Taiwanese, Hong Kong and mainland Chinese versions of Google search engines even though I am not in Taiwan, Hong Kong or mainland China.
The power of digital and networked technologies does not stop there. In addition to overcoming some geographic constraints, the basis of search-engine-related industries, i.e. search keywords and links around keywords, provide much more flexible calculating device to segment searchers than the “people-meter” used in TV industry. Instead of relying only on elusive and sometimes intangible guesswork of quantitative-sociology-based demographics, the Internet industry can use keywords to measure and analyze searchers. In effect, keywords are the equivalent “channels” of distribution mechanism. Think about the role of hashtags in twitters and other microblogs. Think about Google Adword. Think about Microsoft AdCenter. They all use the power of keywords to reach a certain geo-linguistic-demographic of people, narrowing down even more effectively with specific keywords that circulate among certain users who have the right geo-linguistic profile and right search words. Here we witness the birth of the “searchers-as-public” and “searchers-as-market”. This is probably the underlying structure that revolutions can (or cannot) snowball, and this is indeed the underlying dominant structure where the Internet advertising money is gathered and distributed. Try the tools provided by Google Adword and Microsoft AdCenter, and you will realise that each keyword (or a set of keywords) in different geo-linguistic profile (or geo-linguistic virtual region) has already a price tag on it for bidder.
We need more theories and empirical work on the knowledge of and production of, the “searchers-as-public” and “searchers-as-market”.
Data is data. Period?


Both of the maps above visualize the 2008 US presidential election results by coloring each county with different colors: red for the Republican candidate John McCain, and blue for the Democratic candidate Barack Obama. Which is more “distorted” and which one is closer to the truth?
The difference between the two maps is the size of each county shown. The area of a county on the left follows the usual expectation of geographic map, where the spatial size of the geographic area are represented. The area of a county on the right, however, is adjusted (or distorted) in a way to represent the size of votes in each particular area.
The map on the left may give the wrong impression that the election results should be a win for the red Republican candidate. In contrast, the map on the right gives better indication that popular votes have been won by the blue Democrat candidate.
Thus, while the map on the left distorts little in terms of geography, the map on the right represents better for this case where the number of votes counts more than the area size of a county when comparing election results.
The above is among increasing number of cases where several issues of data visualization, including persuasion vs. manipulation. Some argue that “good charts merely present data, and leave the analysis (obvious though it may be) to the viewer”, while others argue that some additional data visualization efforts can help better comprehension and memorability of viewers.
It is an impressive effort made by other OII people and collaborators, including Mark Graham, Scott A. Hale, Taylor Shelton, Matthew Zook, Monica Stephens, etc. to visualise the important Internet indicators at the “Visualizing Data” project. Thanks to the environment provided by Academia Sinica and Oxford Internet Institute, I have used both network analysis software and Geographic Information Systems (GIS) to visualise data for my research projects on comparative studies on Baidu Baike-Chinese Wikipedia and linguistic networked readiness index for selected languages in India and China. However, these visualization efforts may receive mixed reception from different audience with different expectations of graphs, maps and charts. It seems to me that we need to revisit the old social science philosophical issues regarding “evidence”, “explanations”, “presentations” and “interpretations” with the exciting new open data and internet data. The data visualization are just too powerful to be ignored or be taken at face value.
Some are closer than the other: some patterns of inter-language links among different language versions of Wikipedia
0 Comments Published by Han-Teng Liao September 5th, 2011 in *OIINEWSInterlanguage links are the basic cross-language (and thus potentially cross-culture) links across different language versions of Wikipedia. They are everywhere when you read an encyclopedia article page from Wikipedia. They show up at the side bar to the left of encyclopedia articles, leading readers to another language version of Wikipedia that provides the exact or equivalent entry.
Thanks to the data and tools provided by Toolserver operated by Wikimedia Deutschland e.V. with assistance from the Wikimedia Foundation, I have managed to process the data and visualize the outcome with my co-author Thomas Petzold. What follows are a few selected languages, other languages which have stronger connections (or dependency) shown around each of them.
This is for Russian language version (ru):
This is for Chinese language version (zh):
This is for Arabic language version (ar):
This is for Turkish language version (tr):
There are many potential reasons (e.g. language script, political or cultural ties) to explain why some languages are closer than the other. What can we learn from the inter-language linkage among different versions of Wikipedia? How can we foster more critical mutual understanding and cultural exchange among certain languages? How do we understand the need for a separate language version of Wikipedia, such as Egyptian Arabic version in relation to Arabic version? The story about the inter-language linkage within the Global Wikipedia may be unfolding.
Difference in “proportional emphasis”- Baidu Baike and Chinese Wikipedia Comparison
0 Comments Published by Han-Teng Liao September 4th, 2011 in *OIINEWS, Chinese-written Internet, geo-linguistic analysis, thesis, WikipediaThe major difference in world coverage between Baidu Baike and Chinese Wikipedia, particuarly in terms of proportional emphasis in regions where both have web links can be shown on the map with adequate comparing measurement. Region by region, Figure 5‑7 shows the map where the absolute number of external links that Baidu Baike and Chinese Wikipedia have. These numbers are compared in each region in order to show their comparative proportional emphasis. For each region, while the colour of the bubbles (and areas) representing which encyclopedia has more links, the size of the bubbles indicate how much more one online encyclopedia has more than the other. First, a region is coloured in blue if Chinese Wikipedia have more links, red if Baidu Baike has more. It is already evident that Chinese Wikipedia has more links among a massive number of regions, whereas Baidu Baike has a bit more links in certain sporadic regions, in mainland China and some reigons in Africa.
Figure 5‑7. Coverage of the world: comparison based absolute number
Second, Figure 5‑7 also shows the difference in proportional emphasis with the size of bubbles. Recall that in Table 5‑5, the thesis has defined a simple measurement of difference for comparative purposes. Thus for each region where Baidu Baike and Chinese Wikipedia both have links, the level of difference between the two can thus be calculated and then visualised accordingly as indicated by the size of the bubbles, in proportion to the level of difference as previously defined in Table 5‑5 . The bigger the bubble is, the larger the difference is. For example, the difference value of 1.00, as shown in the map legend, suggests the level of difference is ten times bigger.
It is clearly demonstrated that the coverage difference among regions such as China and the US is relatively small compared to the difference in Europe, Middleast and South America. Though there exists a clear difference where Baidu Baike has more links than Chinese Wikipedia in mainland China, and the other way around in the US, the difference is not as salient as in the difference in other areas. Again, among the generally “link-have-less” regions such as South and Central Asia, Middle East, South African and North African countries, Baidu Baike seems to have even poorer performance than Chinese Wikipedia. Among the “link-have-more” regions such as East Asia, Europe and North America, Baidu Baike has more links only in mainland China. Hence, Chinese Wikipedia has an overall advantage in the world coverage, both based on the ccTLD and geo-IP lookup results.
Similarities and differences in the coverage of the world by Baidu Baike and Chinese Wikipedia
0 Comments Published by Han-Teng Liao September 4th, 2011 in *OIINEWS, geo-linguistic analysis, thesis, WikipediaA series of additional maps also reveals that although both of them have similar patterns when it comes to which regions have more links and which have less links, Chinese Wikipedia has more links than Baidu Baike does among most regions, including the “link-have-rich” regions. In another words, Chinese Wikipedia thus has wider coverage of the world than Baidu Baike.
Figure 5‑5 shows the overall coverage of external links for each region based on the ccTLD categorisation results. The bigger a bubble is for a given regin, the more number of external links have been identified for that region. The “link-have-rich” areas, including East Asian countries such as China, Japan, Taiwan, etc., and major European and North American countries, receive relatively more number of links, both for Baidu Baike and Chinese Wikipedia, in contrast to the “link-have-less” areas in Latin America and Africa. The similarity might be attributed to either the skewed geographic distribution of Internet development favouring East Asia, Europe and North America, or the relative interest of Chinese-language Internet users about the world.
Additional two choropleth maps in Figure 5‑6, showing that the geo-IP lookup categorisation results, also exhibits the similar pattern (more links for regions in East Asia, Europe and north America, and less for South America and Africa), with one exception for the US. The US are much more prominent in the results of geo-IP lookup instead of those of ccTLD. The significant visual difference between the two sets is expected outcome for two major reasons. First, the US websites in general does not use its ccTLD “.us” too often. Instead most US websites uses gTLDs such as “.com” and “.edu” directly as if they are “American”, and thus they will be included in the geo-IP lookup results, and missing from the ccTLD results. Second, many international organizations or multinational corporations host their region-specific websites in the U.S. For example, many Chinese companies with their ccTLD “.cn” domain names, choose to host some of their web servers in the U.S., with examples shown in Table 5‑3, which includes both the organ newspaper of Chinese Communist Party, the People’s Daily, and one of major portal Chinese websites, Sina.com. Thus, the US as a region with its country code “.us” is expected to be under-represented in ccTLD categorization and at the same time over-represented in geo-IP lookup. This is an important caveat for any researchers doing similar analysis when analysing the geographic features of the web links.
Figure 5‑6. Coverage of the world: distribution of geo-IP lookup
Comprehensive coverage versus sole mainland emphasis
0 Comments Published by Han-Teng Liao September 4th, 2011 in *OIINEWS, Chinese-written Internet, WikipediaThe results have shown that Chinese Wikipedia has more comprehensive coverage of the world whereas Baidu Baike has the sole emphasis on mainland China.
First, simply based on the number of distinct geographic categories (or country codes) among the external links of Baidu Baike and Chinese Wikipedia, it is shown that Chinese Wikipedia has more comprehensive coverage of the world even it has less external links in total than Baidu Baike, as summarized in Table 5 6. The ccTLD categorization shows that Baidu Baike covers 139 distinctive regions whereas Chinese Wikipedia includes 225. The geo-IP lookup categorisation demonstrates that Baidu Baike has only 113 distinct country codes while Chinese Wikipedia has 196 ones.
Table 5‑6
Numbers of non-zero categories in ccTLD and geo-IP lookup outcome
| Encyclopedia | External web links | Country codes(via ccTLD) | Country codes(via geo-IP lookup) |
| Baidu Baike | 1,303,240 | 139 | 113 |
| Chinese Wikipedia | 719,016 | 225 | 196 |
Second, two choropleth maps shown in Figure 5‑4 visualise which regions are “not” covered, further evidence of greater coverage of Chinese Wikipedia. Since both of them have covered more than 100 regions, it is easier to visualize the opposite –what are not covered— in order to share the contrast of their respective coverage. It is shown that both encyclopedias have not yet covered some African countries such as Chad, Congo, Gabon, and Ethiopia. Still, Chinese Wikipedia covers more African, and Asian, European and Latin American countries than Baidu Baike. It is unexpected that Nepal, a bordering country diplomatically friendly to China, does not receive any external web link from Baidu Baike, by ccTLD or geo-IP categorisation alike, whereas Chinese Wikipedia’s coverage does include Nepal.
Figure 5‑4. Regions that are not covered
Both Table 5‑6 and Figure 5‑4 has demonstrated that Baidu Baike does not cover as many regions as Chinese Wikipedia does. As also expected, the map based on the ccTLD categorisation approach does not fully match that based on the geo-IP data set. For example, Sudan does not receive web links with its ccTLD of “.sd”, but based on the geo-IP data set, it receive web links both from Baidu Baike and Chinese Wikipedia. Baidu Baike’s link is the official website of Khartoum Refinery Company, a Sino-Sudan joint venture company between the Ministry of Energy & Mining of Sudan and government-owned China National Petroleum http://www.krcsd.com. Chinese Wikipedia’s link is http://www.sudanair.com, the official website of the national airline of Sudan. Both the Sudan websites do not use the ccTLD of “.sd”, but they are hosted in Sudan according to the geo-IP lookup of the actual geographic locations of the webpage servers.
Geographic categorisation based on ccTLD and geo-IP lookup
0 Comments Published by Han-Teng Liao September 4th, 2011 in geo-linguistic analysisGenerally with web links, two geographic categorization methods can be deployed. One is directly extracting their nominal country code top-level domain names (ccTLD) from the URL and another is looking up their geographical location of the IP addresses (geo-IP lookup) where the webpage servers are hosted.
- ccTLD-based categorisation
Country code top-level domain names (ccTLD), along with other top-level domain names such as generic top-level domain name (gTLD), provide the explicit way to categorise web links based on the relevant geographic and institutional information. A web link often contains a web address or URL, a naming convention or web standard called Uniform Resource Locator (URL). With a typical example like “http://people.oii.ox.ac.uk/hanteng/about/cv.htm”, experienced web users in the UK will immediately notice the web address, based on the subcomponent of “ac.uk”, is hosted in academic institution in UK.
Although the detailed knowledge of the URL may be sometimes obscure and technical for average users, most web users have some experience in typing the URL into the Internet browsers to access a certain web page or download a music file. Thus, in a similar way as users of mail services and telephones may only have working knowledge on zip codes (for mail addresses) or internal and local area codes (for phone numbers), experienced Internet users should have certain working knowledge in recognizing these geographic and institutional markers inside an URL with different levels of competency and familiarity.
Since top-level domain names usually include markers of institutions and countries, researchers can use them for categorizing web links. Most well-recognized institutional top domain names include “.com” for companies, “.org” for organisations, “.gov” for governments, etc. Well-recognized country top domain names include “.uk” for the UK, “.ru” for Russia, “.cn” for China, etc. Still there are some of the less-well known ones such as “.cat” for Catalan. Since the thesis concerns about the boundary-making on the Chinese-language Internet, top-level domain names such as “.cn”(mainland China), “.hk”(Hong Kong), “.mo”(Macao), “.sg”(Singapore), and “.tw”(Taiwan) can roughly indicate some general tendencies among the external links inside Baidu Baikie’s and Chinese Wikipedia’s articles.
- geo-IP lookup-based categorisation
Geographic location – Internet Protocol address lookup (geo-IP lookup) provides another way of geographic categorization of web links. Instead of relying on the domain names embedded in a web address, the geo-IP lookup approach uses the corresponding Internet Protocol (IP) addresses and to identify geographic locations. From a given web link to a geographic location, the lookup process involves two procedures. First, the IP address (the standard numerical address that underlies all web applications) need to be looked up from the domain names of the external links. Then, the geographic information (e.g. countries, regions, or even cities) can be further looked up by the very same IP address. Table 5‑3 shows some examples of geo-IP lookup, some of the hostnames having ccTLD of “.cn” are shown to have their actual hosting locations in California or Georgia in the U.S.
A quick note on the services and database used for the geo-IP lookup in the thesis. The Google Public Domain Name System (DNS) resolution service is used for the first procedure of geo-IP lookup, and the free GeoLite Country database provided by MaxMind for the second. Though these two are not the only lookup services and databases available, they have been used by many users including researchers and they are for free. The Google Public DNS resolution service promises absolutely no redirection (or DNS hijacking, refering to falsely giving the alternative IP addresses for a given domain name lookup) and the MaxMind promises 99.5% of accuracy for looking up the geographic locations based on their free GeoLite Country database.
Unlike the ccTLD approach which uses the explicit domain names, the geo-IP lookup approach uses the actual IP addresses of the web servers to identify geographic locations. For this reason, all websites be assigned with a geographic category, including those have no ccTLD “.cn” but only gTLD such as “.com” and “.net”. For example, as shown in Table 5 3, the website “sina.com” is identified to be hosted in California. In addition, the geo-lookup approach provides geographic information of the web servers, sometimes different from the explicit geographic markers used in the ccTLD approach. As also shown in Table 5 3, the official website of China’s party newspaper People’s Daily is hosted both in China and the US, with the domain name “people.com.cn” hosted in Beijing and the other “www.people.com.cn” hosted by the servers in the US, operated by a content-delivery company called “ChinaCache North America”. Similarly, one of the major Chinese portal website, Sina.com, also has different servers hosted in China and the US for different domain names. The additional geographic and hosting information, potentially related to jurisdiction-dependent regulations, is particularly important for the Chinese-language Internet with websites hosted in different corners of the world, a point which the thesis will return later.
The pros and cons in the two approaches of ccTLD and geo-IP lookup are summarised as below before categorising the external links of Baidu Baike and Chinese Wikipedia. The major advantage of the ccTLD approach is its explicitness on the hosts of the web pages to signal their geographic attributes and/or targets. For example, a web link containing the top-level domain name of “.cn”, the ccTLD for mainland China, has in effect explicitly suggested that this web link may target users in China, may lead to a server in China, or just host some web content about China. However, the major drawback of the first approach is that it cannot determine the geographic locations of those web links that do not contain ccTLDs. It can be complemented by the greater coverage of the geo-IP lookup approach, which provides geographic information based on the IP addresses of the web servers, major advantage of the second approach. An additional significant advantage of the geo-IP lookup approach is that it can potentially provide finer-granularity geographic information from the level of countries to the level of provinces and cities. This is the main reasons why the Internet industry has gathered such information to provide services such as fraud prevention, geographic targeting, content delivery, etc. The major drawback of such mapping of IP addresses to geographical locations is that its dynamic nature requires more up-to-date data. For the purpose of the section to compare the world coverage of the two encyclopedias, both categorization schemes are conducted to provide complementary views on the geographic distribution of the external-link data set.
Data selection and analytical strategies – Comparing the content of Baidu Baike and Chinese Wikipedia (2/2)
0 Comments Published by Han-Teng Liao September 4th, 2011 in Chinese-written Internet, WikipediaData selection of external links. Before further categorising and analysing the collected data set, some notes on the systematic way to gather and select what counts as valid “external links” are provided as follows. First, only the web links inside the text of article page (including references) are collected, leaving out the navigation links in other parts of the web page (such as sidebars). Such a choice is justified because the aim here is to examine the text itself, not the navigation structure of the website. Second, the definition of external links here is stricter than the usual technical one. To be included in the data set, the web links contained in the text have to be not only technically external links (i.e. web links written with universal reference, instead of local or relative ones), but also actual external links to other organization websites. Thus, the collected data set exclude the web links that lead to other websites hosted by the same organisation which hosted the encyclopedia in question. As shown in Table 5 2, for Baidu Baike, any web links that lead to other Baidu websites, including various services and discussion forums, are excluded. Similarly for Chinese Wikipedia, any web links that point at other language versions and other sister projects hosted by the Wikimedia Foundation are also excluded.
Table 5‑2
Exclusion criteria for the external web links
| Encyclopedia | web links that are treated as internal links | ||
| Baidu Baike | *.baidu.com | ||
| Chinese Wikipedia | *.wikipedia.org | *.wikitionary.org | *.wikibooks.org |
| *.wikisource.org | *.wikimedia.org | *.wikimedia.de | |
| *.wikinews.org | *.wikiquote.org | *.mediawiki.org | |
| *.wikimediafoundation.org | |||
With the aim to show and compare how Baidu Baike and Chinese Wikipedia cover the world differently, the following sections explain how this thesis has extracted the geographic information from the web links, visualized the extracted geographic data, and finally analysed how different or similar these web links are distributed across geographic regions for the two encyclopedias.
Analytical Strategies. For readers to have an easy understanding on the overall world coverage of the external-link data set, without being overwhelmed too quickly by technical and statistical details, the following sections tell the story from the middle of action (medias in res). Since the conducted data processing involves extracting, categorizing, visualizing and comparing the web links at hand, the following sections will start from the middle, that is to say, showing the distribution of web links across regions from “link-have-rich” categories to “link-have-less” ones. Readers can navigate the process from the middle, upwards to see how data is extracted (see next section on geographic categorisation), downwards to examine how data is visualized, compared and analysed (see the second following section).
Using the geo-IP lookup data set (see also the next section for the geographic categorisation based on geo-locations of IP) as example, Figure 5-2 shows the distribution of links across regions, from the region with the most number of links on the left to that with the least number of links on the right in descending order. The distribution, both for Baidu Baike (shown in red) and Chinese Wikipedia (shown in blue), is typically skewed in the sense that few “link-have-more” regions have enormous number of links, whereas massive “link-have-less” regions receive little links. For example, the top five regions, as shown in further detail in the first graph of the second row in Figure 5-2, receive large number of links (over 40,000 links from Chinese Wikipedia and over 10,000 links from Baidu Baike). In contrast, the 65th to 70th regions, as also shown in the first graph of the second row in Figure 5 2, receive only small number of links (less than 100 links from Chinese Wikipedia and about 5 links from Baidu Baike). Proportion-wise, for Chinese Wikipedia the top five regions of have 73% of the total links, and for Baidu Baike, the top five regions already have 93%. Since the top five regions are only a small fraction of the total number of regions (about more than 200 regions covered by the data set), as shown in the top graph of Figure 5-2, the distribution is skewed from the few “link-have-more” regions to massive “link-have-less ones”.
Figure 5-2. Typical statistical distribution: the Geo-IP lookup data set
Figure 5-2 also shows how Baidu Baike and Chinese Wikipedia differ statistically in their distribution of external links. Baidu Baike has much more skewed distribution than Chinese Wikipedia, because Baidu Baike has much more links only in the top two regions and has less links in all other regions. Although it is already intriguing to see China and the United States have the top two ranking in both encyclopedias with different order, further analysis and interpretation requires basic understanding about the nature of the data (the geo-IP lookup, see the next section) and sound analytical strategies to compare the overall skewed distribution.
The skewed distribution presents a challenge for meaningful data reduction and comparison. On one hand, researchers can focus their analysis only on the top “link-have-more” regions, while overlooking the massive number of regions which are “link-have-less”. On the other, researchers can compare the coverage by considering the “link-have-less” or even “link-nothing” ones, while missing the details how “link-have-more” regions compare among themselves. Although there is no easy general answer to resolve the issue, it is clear that for the purpose of the thesis, the issue has to be resolved at least specifically for the data set at hand. For example, it is shown in the second and third graphs below in Figure 5 2, that the even middle-ranking regions have its significance in showing how the two encyclopedias cover the world. In particular, Baidu Baike has the tag line of “covering all knowledge domains, serving all Internet users” (涵盖所有领域知识、服务所有互联网用户) and it would be interesting to see whether Baidu Baike has better coverage of the world with its twice the number of external links.
To make sense of the data with such a skewed distribution geography-wise, it is necessary to revisit the rationale of this comparative research on the Chinese-written user-generated encylopedias. On one hand, a comprehensive coverage of the world is expected because of the “encyclopedic” nature of collecting all human knowledge in the world. On the other hand, a proportional coverage of the world based on what counts as “relevant” and “notable” knowledge for Chinese-language users is also expected because the body of general knowledge is written, by and for, a certain linguistic group of users. Thus, the thesis should compare first the comprehensiveness and then the proportionality of the coverage. For the comprehensiveness comparison, the focus of comparison is on which regions are covered. For proportionality comparison, the focus of comparison is on which regions receive more attention from Baidu Baike or Chinese Wikipedia in terms of proportion. It is thus a pragmatic strategy both to resolve the general issue of dealing with skewed geographic distribution of data set, and to answer the empirical and theoretical questions regarding both the comprehensive coverage and proportional emphasis of the content of Chinese Wikipedia and Baidu Baike.
For comprehensive coverage comparison, a series of choropleth maps are produced to show the geographic distribution of the links. Since a choropleth map can show how statistical variables are distributed on different areas, the choropleth maps based on the external-link data set should provide a world-wide comparison on the coverage of the two encyclopedias. For proportionality comparison, a few regions are selected for detailed analysis to complement the comparative coverage choropleth maps.
Since the overall analytical direction has been generally settled to compare respectively the comprehensive coverage and proportional emphasis of the external links, it is time to describe how the categorisation is conducted by extracting geographic information in the first place. Two geographic categorization schemes are designed and conducted upon the external-link data set. The first one is country-code top-level-domain names (ccTLD) and the second one is geographic locations of the IP addresses (geo-IP lookup). Although they have certain advantages and limitations (which will be discussed briefly in the flowing section), both combined should provide a complementary and arguably the best-effort geographic analysis based on general web links.
Data selection and analytical strategies – Comparing the content of Baidu Baike and Chinese Wikipedia (1/2)
0 Comments Published by Han-Teng Liao September 4th, 2011 in WikipediaTo assess the geographic and linguistic features of the external links of Baidu Baike and Chinese Wikipedia, a set of data is collected in June 2010 for both websites. The data set are in turn analysed according to their geographic and linguistic features, using a set of geo-linguistic analytical tools that are developed particularly for the thesis, but still potentially applicable to other geographic and linguistic contexts. Special attention has been paid to the skewed distribution of the data across geographic and linguistic differences. The data selection and analytical strategies discussed here thus lay the groundwork for the next section on geographic categorisation and visualisation to demonstrate the difference in the world coverage of the two encyclopedias.
Data set. The data set includes three components: (1) the encyclopedia article pages, (2) the external web links extracted from the article pages, and (3) the externally linked web pages. By the year of 2010, the then four-year old Baidu Baike has more encyclopedia articles and more external web links than then eight-year old Chinese Wikipedia, as shown in Table 5-1. Baidu has about six times the number of articles and about twice the number of external links.
Table 5‑1
Numbers of collected article pages, external web links and pages
| Encyclopedia | Articles | External web links | External web pages |
| Baidu Baike | 2,160,620 | 1,303,240 | 1,174,039 |
| Chinese Wikipedia | 362,213 | 719,016 | 673,790 |
Note that not every external web link is well-formed by following common Web standards and not every well-formed link can lead to an external web page that actually existed for data collection period. Thus, the numbers of collected external web pages, as shown the last column in Table 5‑1, are not 100% but rather slightly over 90% the number of external web links. The numbers confirm that Baidu Baike has more articles than Chinese Wikipedia and further shows that Baidu Baike does not have the same proportion number of web links: six times the number of articles but only less than twice the number of external links. If more external web links indicate potentially more articles that are well-sourced with more citations, the numbers show that Chinese Wikipedia is indeed more well-sourced per article than Baidu Baike, which may support the notion that Chinese Wikipedia have better quality. Indeed, if we further examine the word count or the length of articles and the distribution of external links, the Chinese Wikipedia articles seem to be subject to more editorial scrutiny to avoid short articles, as to be detailed in the following paragraphs with the evidence shown in Figure 5‑1.
|
|
Figure 5‑1. Number of entries and external links, and their normalized frequency
First, the two graphs in the left column of Figure 5‑1 compare the distribution of the number of articles (on y-axis) across the number of word counts, equivalent to the length of articles (on x- axis). There is no clear difference between the two when it comes to the average word counts; however, a clear difference exists in the distribution pattern. Indeed, although the data shows that on average Chinese Wikipedia has slightly longer articles with the average word counts of 676 (or 102.83) words, as opposed to that of Baidu Baike, 603 (or 102.78), as also shown in the graph, the difference is relatively small. Still, the contrast of the distribution is clear. The length of Baidu Baike article entries (the top graph shown in red) is normally distributed from small numbers such as tens (10) and hundreds (102) to large ones such as near ten-thousand (104) across the the x-axis of word counts. In contrast, there are few to none short articles in Chinese Wikipedia that are below hundreds by word count, as highlighted at the second graph in in the left column of Figure 5‑1. Simply put, Chinese Wikipedia does not have very short articles.
The most plausible cause behind the distribution difference is very likely to be the difference in their editorial polices towards short articles. Chinese Wikipedia has a quality-control editorial policy on “stub” articles, flagging articles that are too short so that further actions of improvement or deletion need to be taken. A common practice in other language versions of Wikipedia, the rationale behind this policy is to strike a balance between encouraging new articles (often short in the beginning) and expanding them into fuller ones. An example of a “stub” article can be a new entry with just one sentence such as “Salvador Allende is the Chilean president from 1970 to 1973″ (the example provided by the guidelines of Chinese Wikipedia). Thus in Chinese Wikipedia, flagging an article as “stub” signals the need for further editorial actions, including content expansion, merge with other existing articles or deletion. In contrast, Baidu Baike does have entries that are very short in length, reflecting the absence of similar editorial policies or practices on short articles, which explains the normal distribution shown in Figure 5‑1.
Second, the two graphs in the right column of Figure 5‑1 shows that Chinese Wikipedia has more web links per entry article than Baidu Baike. On average, Chinese Wikipedia has per article 1.99 external web links, which is significantly larger than the average number of 0.60 in Baidu Baike. As shown in the first bar of the graphs, Baidu Baike has proportionally more (over 60% of the total external links) entry articles that contain no external links at all.
Hence, the two basic indicators of average word counts and external web links shows that Chinese Wikipedia in general shies away from short articles and moves towards more external links, suggesting more editorial scrutiny as compared to Baidu Baike.
Search
About
Han-Teng Liao, Oxford Internet Institute
Latest
- 2008 台灣立委選舉地圖
- Geo-linguistic factors matter when Google and Baidu compete for global reach
- Ways of Seeing (Data)
- Some are closer than the other: some patterns of inter-language links among different language versions of Wikipedia
- Difference in “proportional emphasis”- Baidu Baike and Chinese Wikipedia Comparison
- Similarities and differences in the coverage of the world by Baidu Baike and Chinese Wikipedia
- Comprehensive coverage versus sole mainland emphasis
- Geographic categorisation based on ccTLD and geo-IP lookup
- Data selection and analytical strategies – Comparing the content of Baidu Baike and Chinese Wikipedia (2/2)
- Data selection and analytical strategies – Comparing the content of Baidu Baike and Chinese Wikipedia (1/2)
Internet Culture
Oxford Internet Institute
























