What is Web Scraping?

Web Scraping is a technique used to extract information from websites.

How It Works

Web Scraping usually happens in two parts:

  1. A program sends automated requests to a website
  2. The site's HTML response is parsed to extract data

Sending Requests

A web scraper will often send many requests to a targeted website. Sometimes, it will be easy for the web server to tell that these requests are coming from a bot. Other times, the web scraper will intentionally disguise itself so that it appears to be a normal human visitor.

Parsing HTML

A web scraper will attempt to find patterns in a website's HTML and use those patterns to extract the data he or she is looking for.

For example, the HTML in a Google search result always looks like this:

<li class="g">
<h3 class="r">
<a href="http://blog.hartleybrody.com/web-scraping/">
I Don't Need No Stinking API: Web Scraping For Fun and Profit
<span class="st">
Sometimes you need to pull data from a service that doesn't have an API. Not to fear! Here's how (and why) you should consider web scraping.

That code appears for each result listed on a Google results page. To scrape a list of results, a web scraper could pull the list of every <li class="g"> element on the page, and then pull the link from the <h3 class="r"> element.

