if you are interested in screen scraping, look at my page here too

crawlers

  • Nutch is an Open Source Java-based crawler and indexer (so you could built your own search engine). We use it for a number of purposes (will add our configuration here soon…)
  • WIRE is a crawler together with some tools for generating statistics and reports on the collected crawl. I just found out about it recently (thanks Ricardo) and could not test it yet but it promises to be a very fast method of collecting web pages and their link structure.
  • UbiCrawler by the Laboratory for Web Algorithmics at the University of Milan is a fully distributed crawler that can be obtained by contacting the author. Anyone experience with it? They also provide a variety of other tools for handling large graphs so worth checking!
  • SocSciBot 3 is a Windows application of the Statistical Cybermetrics Research Group at the University of Wolverhampton that you can use to crawl small sites (< 5,000) pages . They also have a number of other tools on their website.
  • VOSON is a project of the Australian National University that aims to combine different tools in order to allow social scientist to mine the web and to visualize the data and extract information from it. The software is not yet publicly available but there is a presentation by Robert Ackland on its (planned) features.
  • Heritrix – the web crawler of the Internet archive project.
  • HTTrack is a kind of website copier that allows you to download an entire site onto your computer. It comes in a Windows as well as a Linux version. It seems interesting if you are interested in a small number of sites only but it does not give you any option to analyse the structure.

datasets

(of crawls, web graphs, link structure … )