Do you ever find yourself in a situation where you need to get information out of a website that conveniently doesn’t have an export option?
This happened to a client of mine who desperately needed lists of email addresses from a platform that did not allow you to export your own data and hid the data behind a series of UI hurdles. This client was about to pay out the nose for a data-entry worker to copy each email out by hand. Luckily, she remembered that web scraping is the way of the future and happens to be one of my favorite ways to rebel against “big brother”. I hacked something out fast (15 minutes) and saved her a lot of money. I know others out there face similar issues. So I wanted to share how to write a program that uses the web browser like you would and takes (back) the data!
We will practice this together with a simple example: scraping a Google search. Sorry, not very creative 🙂 But it’s a good way to start.
Python (I use 2.7)
- Splinter (based on Selenium)
Chrome & Chromedriver
If you don’t have Pandas and are lazy, I recommend heading over to Anaconda to get their distribution of Python that includes this essential & super useful library.
Otherwise, download it with pip from the terminal/command line & all of its dependencies
If you don’t have Splinter (and are not using Anaconda’s Python), simply download it with pip from the terminal/command line.
If you want to set this up in a virtual environment (which has many advantages) but don’t know where to start, try reading our other blog post about virtual environments.
Step 1: The Libraries & Browser
Here we will import all the libraries we need and set up a browser object.
If the page you are trying to scrape is responsive, use set_window_size to ensure all the elements you need are displayed.
The code above will open a Google Chrome browser. Now that the browser is all set up, let’s visit Google.
Step 2: Explore the Website
Great, so far we have made it to the front page. Now we need to focus on how to navigate the website. There are two main steps to achieving this:
- Find something (an HTML element)
- Perform an action on it
To find an HTML element you need to use the Chrome developer tools. Right click on the website and select “Inspect”. This will open a box on the right side of the Chrome browser. Then click on the inspect icon (highlighted in red).
Next use the inspector cursor to click on a section of the website that you want to control. When you have clicked, the HTML that creates that section will be highlighted on the right. In the photo below, I have clicked on the search bar which is an input.
Next right click on the HTML element, and select under “Copy” -> “Copy XPath”
Congrats! You’ve now got the keys to the kingdom. Let’s move on to how to use Splinter to control that HTML element from Python.
Step 3: Control the Website
That XPath is the most important piece of information! First, keep this XPath safe by pasting into a variable in Python.
Next we will pass this XPath to a great method from the Splinter Browser object: find_by_xpath(). This method will extract all the elements that match the XPath you pass it and return a list of Element objects. If there is only one element, it will return a list of length 1. There are other methods such as find_by_tag(), find_by_name(), find_by_text(), etc.
The code above now gives you navigation of this individual HTML element. There are two useful methods I use for crawling: fill() and click()
The code above types CodingStartups.com into the search bar and clicks the search button. Once you execute the last line, you will be brought to the search results page!
Tip: Use fill() and click() to navigate login pages 😉
Step 4: Scrape!
For the purpose of this exercise, we will scrape off the titles and links for each search result on the first page.
Notice that each search result is stored within a h3-tag with a class “r”. Also take note that both the title and the link we want is stored within an a-tag.
The XPath of that highlighted a tag is:
But this is just the first link. We want all of the links on the search page, not just the first one. So we are going to change this a bit to make sure our find_by_xpath method returns all of the search results in a list. Here is how to do it. See the code below:
This XPath tells Python to look for all h3-tags with a class “r”. Then inside each of them, extract the a-tag & all its data.
Now, lets iterate through the search result link elements that the find_by_xpath method returned. We will extract the title and link for each search result. It’s very simple:
Cleaning the data in
search_result.textcan sometimes be the most frustrating part. Text on the web is very messy. Here are some helpful methods for cleaning data:
All of the titles and links are now in the scraped_data list. Now to export our data to csv. Instead of the csv library chaos, I like to use a pandas dataframe. It’s 2 lines:
The code above creates a csv file with the headers Title, Link and then all of the data that was in the scraped_data list. Congrats! Now go forth and take (back) the data!
In case you want a big picture view, here is the full code available on our GitHub account.
Thanks for reading! If you have questions feel free to comment & I will try to get back to you.
Check out my essentials list on Amazon