if you are interested in screen scraping, look at my page here too
crawlers
- Nutch is an Open Source Java-based crawler and indexer (so you could built your own search engine). We use it for a number of purposes (will add our configuration here soon…)
- WIRE is a crawler together with some tools for generating statistics and reports on the collected crawl. I just found out about it recently (thanks Ricardo) and could not test it yet but it promises to be a very fast method of collecting web pages and their link structure.
- UbiCrawler by the Laboratory for Web Algorithmics at the University of Milan is a fully distributed crawler that can be obtained by contacting the author. Anyone experience with it? They also provide a variety of other tools for handling large graphs so worth checking!
- SocSciBot 3 is a Windows application of the Statistical Cybermetrics Research Group at the University of Wolverhampton that you can use to crawl small sites (< 5,000) pages . They also have a number of other tools on their website.
- VOSON is a project of the Australian National University that aims to combine different tools in order to allow social scientist to mine the web and to visualize the data and extract information from it. The software is not yet publicly available but there is a presentation by Robert Ackland on its (planned) features.
- Heritrix – the web crawler of the Internet archive project.
- HTTrack is a kind of website copier that allows you to download an entire site onto your computer. It comes in a Windows as well as a Linux version. It seems interesting if you are interested in a small number of sites only but it does not give you any option to analyse the structure.
datasets
(of crawls, web graphs, link structure … )
- The dataset for this year’s Web SPAM Challenge at WWW2007 is a web graph of the .uk web containing 77m pages.
- Datasets from the Laboratory for Web Algorithmics in Milan are very large, ranging from 2000-2005 datasets and mainly focus on a certain country top-level domain.
- The Academic Web Link Database Project is based at the University of Wolverhampton’s Cybermetrics Research Group and does exactly what it says on the tin: providing regularly updated crawls of universities from around the world (UK, Australia, New Zealand, US, Spain, Taiwanese, China)
- The Web Research collections of the Text Retrieval Conference (TREC) include two crawls (2002 and 2004) of the .gov domain as well as blogs (2006). However, they are only available for a fee. There is a collection of about 100,000 emails from the SPAM track that is available for free.
- Several large crawls (11,000,000 documents) of the .jp domain are available from the Japanese National Institute of Informatics but have to be applied for.
- Although I often forget it, the Internet Archive is a highly useful resource – although its coverage is far from universal and there are not regular backup intervals. Unfortunately access to their tools via a software interface is currently (since 2002 … ) not available.
