Scraping Articles from the Washington Post
In the last section we introduced how to connect to the APIs. However, most of websites do not provide APIs, with takes a lot of effort to maintain. Web crawling is popular technique to get online data, especially when there is no other systematic way (e.g., API or database dumps for public) to obtain the content of websites.
Whenever you decide to write a Web crawlers target a specific website or a collection of websites, better read the Robots exclusion standard or the website at first and be sure that you action is legal. Robots exclusion standard is also called robots.txt, which is a text file that usually placed in the root of the web site hierarchy, for example, https://en.wikipedia.org/robots.txt. This file contains the instructions for robots to follow, such the Web pages the owner does not want to be fetched. The absence of this file would imply that web robots are allowed to crawl the entire site.
To build a Web crawler we needs two modules, urllib2 and BeautifulSoup. The former allows us to communicate with the Web serve and the latter help parsing the downloaded webpage data, which are usually in html or xml formats.
import re, urllib2
import numpy as np
from bs4 import BeautifulSoup
Retrieve article URLs
Web crawling is simple when it is small in scale. All you need to do is to get the web address (URL) of a page, download the entie page, and parse the download content.
To demonstrate how to download a collection of articles. We can check Jennifer Rubin's blog in the Washington Post. From this page, we will get a list of article addresses, and then use those addresses to retrieve the text of articles.
I would skip the details of how to use modules urllib2 and BeautifulSoup and simply gives the following function that gets all hyperlinks in a give page.
blogAddress = 'http://www.washingtonpost.com/blogs/right-turn/'
def articleUrls(blogAddress):
html = urllib2.urlopen(blogAddress).read()
soup = BeautifulSoup(html)
links = soup.findAll("a")
t = set([ link["href"] for link in links if link.has_attr('href')])
l = []
for s in t:
a = s.split('/')
if len(a)==11 and a[4]=='right-turn' and a[-1]!='#comments':
l.append(s)
return l
l = articleUrls(blogAddress)
The above codes give the following list of URLs.
Scrap text from Web pages
Using the URLs we retrieved in the last step, we can easily repeat our earlier action: download the Web pages and parse them to get clean text.
Firstly, let's define a function is very powerful in cleaning text.
clean = lambda s: str(re.sub('[\W_]+', ' ', s))
This function takes in a piece of text, remove all non-alphanumerals, and returns only the Latin letters and Arabic digits.
Now we are ready to proceed:
def getText(articleUrl):
html = urllib2.urlopen(articleUrl).read()
soup = BeautifulSoup(html)
article = soup.body.findAll('article')
text = ' '.join([clean(s.text) for s in article[0].findAll('p')])
return text
blogs = {}
for i in l:
text = getText(i)
blogs[i] = text
Running the above scripts gives us a dictionary called "blogs", in which keys are URLs and values are the text of the corresponding article. The reason we are keeping the Web addresses is that we may want to double-check our data set in future.
Save Data
Finally, we can save the collected articles in a text file for further analysis (e.g., text mining, sentimental analysis).
f = open('/Users/csid/Desktop/blogs.txt', "wb")
for i in blogs:
f.write(i + '\t' + blogs[i]+'\n')
f.close()
Here is one of the blogs we crawled:
In our text file it looks like this: