<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tobias Escher at the OII &#187; e-Social Science</title>
	<atom:link href="http://people.oii.ox.ac.uk/escher/category/e-social-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://people.oii.ox.ac.uk/escher</link>
	<description>is a Research Assistant and a DPhil Student</description>
	<lastBuildDate>Wed, 15 Jun 2011 20:09:25 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>The joy of a searchable Hansard or Why open data matters for research!</title>
		<link>http://people.oii.ox.ac.uk/escher/2010/02/17/the-joy-of-a-searchable-hansard-or-why-open-data-matters-for-research/</link>
		<comments>http://people.oii.ox.ac.uk/escher/2010/02/17/the-joy-of-a-searchable-hansard-or-why-open-data-matters-for-research/#comments</comments>
		<pubDate>Wed, 17 Feb 2010 16:41:38 +0000</pubDate>
		<dc:creator>tobias.escher</dc:creator>
				<category><![CDATA[*OIINEWS]]></category>
		<category><![CDATA[DPhil]]></category>
		<category><![CDATA[e-Social Science]]></category>
		<category><![CDATA[mySociety]]></category>
		<category><![CDATA[political participation]]></category>

		<guid isPermaLink="false">http://people.oii.ox.ac.uk/escher/?p=367</guid>
		<description><![CDATA[It is no secret that I&#8217;m a great admirer of mySociety&#8217;s work and I even try to contribute a little bit to it myself through some of the research I do for them but today I would just like to share briefly an example of how much difference it can make to research whether or [...]]]></description>
			<content:encoded><![CDATA[<p>It is no secret that I&#8217;m a great admirer of mySociety&#8217;s work and I even try to contribute a little bit to it myself through some of the research I do for them but today I would just like to share briefly an example of how much difference it can make to research whether or not data is available online, in a well-structured manner and with an intelligent search built on top of it.</p>
<p>In my doctoral research I look at the communication between constituents and their Members of Parliament. I was looking for a simple way to judge the relevance of the mail that MPs receive from their constituents. As I found, MPs tend to refer to their &#8220;postbag&#8221; in order to emphasize the importance of an issue as e.g. the Simon Hughes did in a recent debate on climate change (see video below):</p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="320" height="230" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="flashvars" value="gid=debate/2009-11-24b.431.1&amp;file=17062&amp;start=5830" /><param name="src" value="http://www.theyworkforyou.com/video/parlvid.swf" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="320" height="230" src="http://www.theyworkforyou.com/video/parlvid.swf" allowfullscreen="true" flashvars="gid=debate/2009-11-24b.431.1&amp;file=17062&amp;start=5830"></embed></object></p>
<p>So in order to judge how often that happens, I needed to have a look at <a href="http://en.wikipedia.org/wiki/Hansard">Hansard</a>, the written record of proceedings in parliament. Now the <a href="http://www.publications.parliament.uk/pa/cm/cmhansrd.htm">main online Hansard record at the UK Parliament website</a> is rather difficult and does not provide a search functionality so I turned to <a href="http://www.theyworkforyou.com">TheyWorkForYou.co</a>m, one of mySociety&#8217;s projects that does not only provide detailed information on MPs but also offers a nicely formatted, searchable version of Hansard, now dating back to 1935 (!).</p>
<p>In this way it was a matter of seconds to find out how often MPs and Lords have mentioned their postbags in parliamentary proceedings since 1935 (<a href="http://www.theyworkforyou.com/search/?s=postbag&amp;o=d">it was 1,621 times</a>). A multitude of options allow to filter your search accordingly so that now I know that the majority of these mentions were made during House of Commons debates (<a href="http://www.theyworkforyou.com/search/?s=postbag&amp;from=&amp;to=&amp;person=&amp;section=debates&amp;column=">989</a>) and that in the current parliament Conservative MP Mark Field leads the table (<a href="http://www.theyworkforyou.com/search/?s=postbag+2005-05-05..2010-02-17&amp;o=p">with 9 references to his postbags</a>). Finally I could quickly produce a figure showing the development of references to their constituent mail:</p>
<p><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2010/02/parliamentary_postbag_mentions.png"><img class="aligncenter size-full wp-image-376" title="parliamentary_postbag_mentions" src="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2010/02/parliamentary_postbag_mentions.png" alt="" width="474" height="291" /></a></p>
<p><em>btw the search is intelligent enough to look for the word &#8220;postbag&#8221; as well as similar words such as &#8220;postbags&#8221; or &#8220;post bag&#8221;</em></p>
<p>I cannot start to imagine how long it would have taken to produce this figure with the limited capabilities of the official Hansard and it would not have been possible at all at the time when all this data was really only a <em>written</em> record in the literal sense. Not saying that this particular piece of information is a world-changing discovery but it is a good example of how the availability of data in a structured and searchable format (!) can contribute to scholarship nevertheless. In this respect the various Open Data initiatives by governments offer a huge potential for social scientists with the appropriate statistical and computational skills to offer fresh insights. See for example the <a href="http://www.guardian.co.uk/world-government-data">Guardian World Government Data Initiative</a> that offers the datasets opened up by various different governments in a uniform format.</p>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 628px; width: 1px; height: 1px;">btw the search is intelligent enough to look for the word &#8220;postbag&#8221; as well as similar words such as &#8220;postbags&#8221; or &#8220;post bag&#8221;</div>
]]></content:encoded>
			<wfw:commentRss>http://people.oii.ox.ac.uk/escher/2010/02/17/the-joy-of-a-searchable-hansard-or-why-open-data-matters-for-research/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Five lessons on how Google Blogsearch works (or doesn&#8217;t) and how to use it for research</title>
		<link>http://people.oii.ox.ac.uk/escher/2008/02/28/google-blogsearch-howto/</link>
		<comments>http://people.oii.ox.ac.uk/escher/2008/02/28/google-blogsearch-howto/#comments</comments>
		<pubDate>Thu, 28 Feb 2008 22:03:34 +0000</pubDate>
		<dc:creator>tobias.escher</dc:creator>
				<category><![CDATA[*OIINEWS]]></category>
		<category><![CDATA[blogging]]></category>
		<category><![CDATA[e-Social Science]]></category>

		<guid isPermaLink="false">http://people.oii.ox.ac.uk/escher/2008/02/28/google-blogsearch-howto/</guid>
		<description><![CDATA[After I have recently spent a couple of days racking my brain about it, I thought I better share to give others a head start. Ok, here is the task: Automatically query Google Blogsearch to find the number of &#8220;all&#8221; English-language blog posts containing a number of words that have been published during a certain [...]]]></description>
			<content:encoded><![CDATA[<p>After I have recently spent a couple of days racking my brain about it, I thought I better share to give others a head start.</p>
<p><em>Ok, here is the task: <strong>Automatically query</strong> Google Blogsearch to find the number of &#8220;all&#8221; <strong>English-language</strong> blog posts containing a number of words that have been published <strong>during a certain period</strong>. Should be easy enough? Well, you have no idea&#8230;</em></p>
<p class="MsoNormal">Before I start, why to use <a href="http://blogsearch.google.com/">Google Blogsearch</a> anyway? This is why:<span lang="EN-US"> <a href="http://technorati.com/search?advanced">Technorati</a> does not allow to limit the time for searches, <a href="http://www.blogpulse.com/search.html#advanced">Blogpulse</a></span><span lang="EN-US"> does not offer to filter languages and is very slow anyway and last but not least <a href="http://www.bloglines.com/advsearch?q=">Bloglines</a>, while doing a good job in general, reports less results than Google and also gives very erroneous estimates for total post counts.</span></p>
<p class="MsoNormal"><strong>First problem: Index update intervalls</strong></p>
<p class="MsoNormal">At first sight Google Blogsearch is great. For any kind of search Google will usually show you <a href="http://www.mattcutts.com/blog/minty-fresh-indexing/">some posts that have just been published</a>. This is great but clearly Google simply cannot index all blogs at the same time. The crawler will visit less frequently updated blogs less frequently and even then we don&#8217;t know when Google updates the Blogsearch index. If you are interested in &#8220;all&#8221; posts that have been published in the last week (or at least all posts that have been published on blogs Google tends to index), you should better wait to query Google Blogsearch until the index has been updated to include all those posts that have been written in the last week but have yet to make it into the index.</p>
<p class="MsoNormal">But how long should you wait? I&#8217;ve used a random sample of 500 queries that were first submitted to Google exactly two days after the period of interest and repeated those searches again. I divided the newly obtained result by the initial ones (so this should be 1 if no change in total post count or  larger than one if more posts were found in the second query) and plotted them by days that have passed since the date of interest.</p>
<p class="MsoNormal"><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-index-update.png"><img src="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-index-update.png" /></a><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-index-update.png"> </a></p>
<p class="MsoNormal">What you can see is that first of all that there is a huge variation in the counts, so usually you will even receive a lower total count if you query Google Blogsearch after some days. However, this is a problem in Google&#8217;s estimation techniques and I will deal with that later. What can be seen is that there is a clear increase in posts starting about 10-12 days after the period of interest. Ergo, it is probably safe to assume that Google has completely updated the Blogsearch index  after a two week interval &#8211; assuming that the index is updated continuously. Clearly this experiment should be repeated to be safe that we not just hit the middle of for example a 4-week cycle.</p>
<p class="MsoNormal">In contrast, below is a picture of <a href="http://news.yahoo.com/">Yahoo News</a>. As one would expect, there is less variation because this search has a lesser number of outlets to index and crucially relies on being up to date.</p>
<p><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/yahoo-news-index-update.png"><img src="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/yahoo-news-index-update.png" /></a></p>
<p class="MsoNormal"><em>First lesson learned: If you really want all posts published at day X from all blogs Google indexes, wait about two weeks from day X onwards before you do your query.</em></p>
<p class="MsoNormal"><strong> Second problem: total result estimates</strong></p>
<p class="MsoNormal">As you probably know, basically the number Google reports for the total number of results is an estimate &#8211; and a VERY bad one. The reason is that Google cannot be bothered to really search for ALL posts containing a word if 99.9% of people are only interested in the top 10 ie. most relevant ones. Therefore it just checks the most relevant ones and by some magic produces an estimate of what the total number is likely to be. Once you start browsing through the list of results you will likely find out there are much less results than originally indicated. Need a proof? See the screenshot below:</p>
<p class="MsoNormal"><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-28-02-2008.png"><img src="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-28-02-2008.png" /></a><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-16-02-2008-scaled.png"> </a><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-16-02-2008.png" title="google blogsearch total count estimate"> </a></p>
<p class="MsoNormal">The interesting bit is that Google will give you 649 results (as you can see there is no way to request more) but still it claims there would be a total of 2,088 results which is clear rubbish (Note that this does already include the duplicates!)</p>
<p class="MsoNormal"><em>Second lesson learned: Do not, never and by no means use the total result estimate. Instead, use the number of the last result Google returns. This has the problem that Google will never return more than a 1,000 results.</em></p>
<p class="MsoNormal"><strong>Third problem: The same query will give different results (just keep trying)</strong></p>
<p class="MsoNormal">Taking the lessons learned from problem one and two you could just construct a query that asks Google for the last results page straight away (ie. you append <code>&amp;start=990</code> to your query). However, if you do the same query again you often obtain a different, usually higher, number of results. <a href="http://blogsearch.google.com/blogsearch?hl=en&amp;q=hillary+clinton+john+mccain+&amp;as_maxm=2&amp;as_miny=2008&amp;as_maxy=2008&amp;as_minm=1&amp;as_mind=30&amp;as_maxd=3&amp;as_drrb=b&amp;ctz=-120&amp;ie=utf-8&amp;num=10&amp;start=990&amp;lr=lang_en">Give this one a try</a>. Initially it brought something in the region of 600 results, then it were 900 and if you play long enough (by browsing a bit through the list) you may get to a 1000. Again it seems like Google is not putting much energy into the first query but once you do the second it has already a head start and will give you more.</p>
<p class="MsoNormal"><em>Third lesson learned: In order to get a more accurate number from Google, do a query for the top ten results first and after a few seconds, do another one for the last results page.</em></p>
<p class="MsoNormal"><strong>Fourth problem: different results from feed and interface</strong></p>
<p class="MsoNormal">This really, really, really is just annoying. You will obtain different results if you do exactly the same query and access the results via the <a href="http://blogsearch.google.com/blogsearch?hl=en&amp;ie=UTF-8&amp;num=10&amp;lr=lang_en&amp;q=boris+spassky+bobby+fischer&amp;btnG=Search+Blogs&amp;start=990">user interface</a> or via the <a href="http://blogsearch.google.com/blogsearch_feeds?hl=en&amp;lr=lang_en&amp;q=boris+spassky+bobby+fischer&amp;ie=utf-8&amp;num=10&amp;output=rss&amp;start=990">XML feed</a>.</p>
<p class="MsoNormal"><em>Fourth lesson learned: Decide for one way of accessing the results and stay with it to allow to compare the obtained counts. The feeds seems to return the higher number of results.</em></p>
<p class="MsoNormal"><strong>Fifth problem: Google is sorry&#8230;</strong></p>
<p class="MsoNormal">Well, just when you think you have solved all your problems (at least related to this search thing&#8230;), Google blocks your access. I have yet to figure out why exactly as the policy is somehow inconsistent (sometimes it happens very quickly, sometimes it takes half a day) but apparently Google does not like queries for the last results page &#8211; which is no wonder as it requires more computing power. When only checking for the top 10 results I never got the following page:</p>
<p><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-error.png"><img src="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/google-blogsearch-error.png" /></a></p>
<p>This seems to be unrelated to whether one is using the normal web interface or the XML feed.</p>
<p><em>No lesson to be learned here&#8230;</em></p>
<p><strong>On SPAM</strong></p>
<p class="MsoNormal">One problem that keeps cropping up is SPAM. You can never totally eliminate bogus blogs that were just set up to generate some ad revenue and lure unsuspecting people to dubious business offers. But why for some queries posts like the one below do get included in droves is totally beyond me.</p>
<p class="MsoNormal"><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/spamblog.png"><img src="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2008/02/spamblog.png" /></a></p>
<p>This seems to be a particular problem with blogs on (Google&#8217;s) blogspot platform that has loads of automatically generated blogs.<em> </em>I try to treat SPAM posts as random error, hoping that it will affect all queries equally (and this really is only a hope&#8230;).</p>
<p><strong>Last words</strong></p>
<p>I hope this is going to help some people who also need some reliable counts on the number of posts dealing with a certain topic. In my research area, this certainly is an issue. What is more, let me know if you have more or different experiences so that we can learn from each other. btw most of these findings should also be true for Google search more generally.</p>
<p>Finally, of course one has to be somehow grateful to Google to at least provide a service like that for free (yeah, I know they do ads). But what really sucks is that Google sits on the data and could give you all the numbers you need straight away with 99% accuracy. So to end with my usual plea: open up your index and make it accessible (at least for research purposes)!!!!</p>
]]></content:encoded>
			<wfw:commentRss>http://people.oii.ox.ac.uk/escher/2008/02/28/google-blogsearch-howto/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Agenda Setting Online: Comparing Traditional Media and the Blogosphere</title>
		<link>http://people.oii.ox.ac.uk/escher/2007/09/18/agenda-setting-online-comparing-traditional-media-and-the-blogosphere/</link>
		<comments>http://people.oii.ox.ac.uk/escher/2007/09/18/agenda-setting-online-comparing-traditional-media-and-the-blogosphere/#comments</comments>
		<pubDate>Tue, 18 Sep 2007 19:48:51 +0000</pubDate>
		<dc:creator>tobias.escher</dc:creator>
				<category><![CDATA[*OIINEWS]]></category>
		<category><![CDATA[blogging]]></category>
		<category><![CDATA[DPhil]]></category>
		<category><![CDATA[e-Social Science]]></category>
		<category><![CDATA[google]]></category>

		<guid isPermaLink="false">http://people.oii.ox.ac.uk/escher/2007/09/18/agenda-setting-online-comparing-traditional-media-and-the-blogosphere/</guid>
		<description><![CDATA[Some time ago I started working on a paper that is analysing how blogs and citizen journalism might change the traditional agenda setting process. The agenda setting theory states in a nutshell that the media might not tell people WHAT TO THINK but rather WHAT TO THINK ABOUT. One of the hopes inscribed into blogs [...]]]></description>
			<content:encoded><![CDATA[<p>Some time ago I started working on a paper that is analysing how blogs and citizen journalism might change the traditional agenda setting process. The agenda setting theory states in a nutshell that <em>the media might not tell people WHAT TO THINK but rather WHAT TO THINK ABOUT</em>. One of the hopes inscribed into blogs has been that they would facilitate an alternative public sphere that provides news different(ly) from the traditional mass media.</p>
<p>I have been thinking of a way to test whether the blogosphere really does constitute a counter public. I have developed a tool that compares the media agenda &#8211; that is a ranking of stories reported within 24 hours &#8211; to the blogging agenda and measures the overlap between the two. The main objective is to find out whether bloggers are applying different criteria to rank the importance (salience) of a news story than traditional journalists.</p>
<p><a href="http://people.oii.ox.ac.uk/escher/wp-content/uploads/2007/09/Escher_Blog_Agenda_Setting.pdf">My paper describes it in more detail</a> but basically I construct the agenda from the stories on <a href="http://news.google.com">Google News</a>, extract the key words for each story with the <a href="http://developer.yahoo.com/search/content/V1/termExtraction.html">Yahoo Term Extractor</a> and search with the help of <a href="http://blogsearch.google.com/">Google Blog Search</a> how many posts cover this story. You can <a href="http://uggeshall.adastral.ucl.ac.uk/blogagenda/query_agenda.pl">have a look at the data on this website</a> but you will realise that data collection stopped some time ago.</p>
<p>In due course there will be an update of the tool along with the paper that will improve the data reliability but for now I very much welcome your feedback on it!</p>
]]></content:encoded>
			<wfw:commentRss>http://people.oii.ox.ac.uk/escher/2007/09/18/agenda-setting-online-comparing-traditional-media-and-the-blogosphere/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

