Thanks to the questions raised by a veteran Wikipedian Federico Leva (Nemo), a Wikimedia research analyst Oliver Keyes (find the discussion thread “[Wiki-research-l] Wikipedia traffic: selected language versions”), I have the chance to clarify my suggestions in curating and analysing Wikipedia traffic data. Here in this blogpost I will use English Wikipedia Traffic data to explain what can be done for geolinguistic normalization and why the results are potentially useful.
First, let me describe what can be done with current traffic data reports and their limitations. With a bit of web scraping and visualization work, I have prototpyed a historical visualization of the English Wikipedia viewing/editing traffic. It is no surprise that the US and the UK are among the big regions with large proportions of the viewing/editing activities.
However, limited by the current data in percentage numbers, there is no way for researchers or Wikipedians to see the growth or decline in absolute numbers. The current data release only allows comparison in proportions: e.g. the visits to English Wikipedia from the U.K. as proportion to the all visits. I have detailed this point and others in this mailing list post here.
Take it a step further, researchers and Wikipedians may want to explore the *per capita* numbers, or per speaker numbers across regions. To do so, we first need speaker population data for different languages across different countries. Using the Unicode language territory information data table, one can get such data to “normalize” the Wikipedia traffic data for further comparison. I call this method geolingustic normalization, borrowing and advancing the GIS mapping strategy called geographic normalization (see Harvard University’s GIS manual here for an accessible introduction). It is rather important, in my mind, to account for migrants and diaspora of the world and avoid “methodological nationalism”, and the following per speaker results of the English Wikipedia’s viewing and editing charts illustrate this.
First, the per speaker viewing chart for English Wikipedia:
Second, the per speaker editing chart for English Wikipedia:
Immediately we see there are more editing and viewing activities, per speaker, in some English-speaking countries than in the U.S. For English Wikipedia’s per-speaker viewing activities, the top five regions are Canada, the UK, New Zealand, Australia and Ireland, followed then by the U.S., Malaysia, Netherlands, and other European countries.
For English Wikipedia’s per-speaker editing activities, the top region is, interestingly, Israel. I am not sure whether and how the results make sense. It could mean that the speaker population data listed in the Unicode CLDR language-territory information might underestimate the English-speaking population in Israel, thereby amplifying its per speaker results. It could simply mean that there are indeed more per speaker English Wikipedians in Israel (to me it is unlikely).
For the sake of comparing other regions, the below chart removes Israel from the English Wikipedia’s per-speaker viewing activities.
Somewhat more consistent with the per speaker viewing traffic data, the editing traffic data shows that the other top five editing regions are the U.K., New Zealand, Canada, and Ireland. The U.S. is only the seventh, ranked lower than the sixth Australia and higher than the other top ten countries of the Netherlands, Italy and Spain.
The geolinguistic normalization results above are potentially useful for developing strategies for Wikipedia movement with more targeted information. Immediately, more “developed” regions and less “developed” regions (in terms of Wikipedia development) are identified. It can be argued that the utility rates by the U.S., a country where the Wikipedia movement is started, are much lower than other developed (in the general sense of economic development) countries such as Canada, the U.K., New Zealand and Ireland. A possible strategy is to mobilize the cognitive surplus and experience of the more Wikipedia-developed regions to foster the outreach of the less Wikipedia-developed regions. For instance, the above charts suggest Malaysia has ranked much higher in the English Wikipedia’s viewing traffic data, but not so much in its’ editing traffic data. The results may suggest a need to convert readers of English Wikipedia in Malaysia into new editors.
The per speaker results can also foster a cross-national or transnational competition in recruiting new users and getting new free content and data. Note that the Wikimedia governance has a divided structure in terms of geolinguistic and national boundaries. In terms of Wikipedia content, they are divided overwhelmingly by languages and governed by active editors and administrators that can span several countries. In terms of local Wikimedia chapters, they are divided along the national borders, which is the likely outcome of legal jurisdictions to set up a civic non-profit organization. The per speaker traffic data thus provide some important data points in measuring and comparing the outreach (or even fundraising) outcomes. The competition scenario would be more sensible among regions with similar geolinguistic demography and economic development, say between Canada, the U.K., New zealand, Australia, etc.
Imagine the benefits the Arabic, Uyghur or Kurdish Wikipedias can have to target its potential editors and readers with such information. Note that for Uyghur and Kurdish Wikipedias, as they are minority languages with diaspora population around the world without a nation state of their own, their Wikipedia development might be more difficult than the other language versions with local chapters. This is my own personal belief that at very least the Wikipedia and Wikimedia Foundation can provide data for them to make their own development strategies. As I have blogged earlier, I am a believer in the notion that open data and open knowledge may lead to open solutions.