What is Web Scraping?

Web Scraping is a technique used to extract information from websites.

How It Works

Web Scraping usually happens in two parts:

  1. A program sends automated requests to a website
  2. The site's HTML response is parsed to extract data


Sending Requests

A web scraper will often send many requests to a targeted website. Sometimes, it will be easy for the web server to tell that these requests are coming from a bot. Other times, the web scraper will intentionally disguise itself so that it appears to be a normal human visitor.


Parsing HTML

A web scraper will attempt to find patterns in a website's HTML and use those patterns to extract the data he or she is looking for.

For example, the HTML in a Google search result always looks like this:

<li class="g">
<h3 class="r">
<a href="http://blog.hartleybrody.com/web-scraping/">
I Don't Need No Stinking API: Web Scraping For Fun and Profit
</a>
</h3>
<cite>
blog.hartleybrody.com/web-scraping/
</cite>‎
<span class="st">
Sometimes you need to pull data from a service that doesn't have an API. Not to fear! Here's how (and why) you should consider web scraping.
</span>
</li>

That code appears for each result listed on a Google results page. To scrape a list of results, a web scraper could pull the list of every <li class="g"> element on the page, and then pull the link from the <h3 class="r"> element.

More Information

There is a lot of great information about web scraping to help you learn more. Check out some of these great resources.


The Ultimate Guide to Web Scraping

Web scraping ebook cover This is basically the web scraping Bible. The Ultimate Guide to Web Scraping examines various ways that information is sent from a website to your computer, and how that can be intercepted and parsed. It also looks at common traps and anti-scraping tactics and how you might be able to thwart them.

Buy Now - $10.00


More Books & Resources


© 2013 What is Web Scraping