Introduction to Web Scraping Tools

Building off our recent exploration of tools for scraping and analyzing Twitter data, and expanding into more general web scrapers, there are a range of tools that allow you to “scrape” or extract data from any website you want.

Before you begin scraping, make sure that your target website allows it. You can do this by appending /robots.txt to any site (e.g.: facebook.com/robots.txt indicates scraping is not allowed).

For the beginner web scraper, the Chrome browser extension Simplescraper could be exactly what you are looking for. Simply go to your target website and highlight which elements you want to scrape. Once finished, the data can be saved as a CSV. The extension comes with a free tier that allows you to scrape for 100 elements per month.

If you are working on a larger research project, Parsehub could meet all of your requirements. A piece of software that runs on both Mac and Windows, Parsehub lets users browse sites in their custom browser and then choose which elements to scrape. It has good documentation and I found that after ten minutes of reading tutorials I was able to scrape a site with relative ease. The free tier allows you to scrape 200 websites per month.

Finally, the most feature-rich, but also the most advanced piece of software on this list is Scrapy. Scrapy is a free and open-source Python library that allows users to write small scrapers that can be used on any website. Scrapy is a very open-ended tool, and users can write Python code that allows it to communicate with other pieces of software (e.g. you could scrape a weather website and send that data to a Python script that texts that information to your phone every morning). The downside to Scrapy is that there is a larger barrier for entry as one cannot use Scrapy without at least a basic knowledge of how Python works.

For more web scraping tools and pros and cons of each, I have made a tool comparison table available here.

Please feel free to reach out to me (oavery@conncoll.edu) if you would like help getting started with any of these tools!

If you’re interested in social media data and web scraping, join the Data Mining Faculty Interest Group sponsored by the library! Our next meeting will be held April 22 at noon, and we will look more at some of the tools described above. Register here.