Beautiful Soup is useful for pulling data out of HTML and XML files. It works with html parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
We can directly get a html document from server
url = 'http://quotes.toscrape.com/page/1/'
# send a http GET reuest to server and get the response
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
For this tutorial we are using below simple html made up taking Albert Einstein’s quote from http://quotes.toscrape.com/page/2/
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>Quotes to Scrape</title>
</head>
<body>
<div id="quote" class="textNormal">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="simplicity,understand">
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
<a class="tag" href="/tag/understand/page/1/">understand</a>
</div>
</div>
</div>
</body>
</html>
"""
Parse a html documnet and create a soup object
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.prettify())
<!DOCTYPE html>
<html>
<head>
<title>
Quotes to Scrape
</title>
</head>
<body>
<div class="textNormal" id="quote">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">
“If you can't explain it to a six year old, you don't understand it yourself.”
</span>
<span>
by
<small class="author" itemprop="author">
Albert Einstein
</small>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="simplicity,understand" itemprop="keywords"/>
<a class="tag" href="/tag/simplicity/page/1/">
simplicity
</a>
<a class="tag" href="/tag/understand/page/1/">
understand
</a>
</div>
</div>
</div>
</body>
</html>
Get title
soup.title.string
'Quotes to Scrape'
Find all links
len(soup.findAll('a'))
2
soup.find(id='quote')
<div class="textNormal" id="quote">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="simplicity,understand" itemprop="keywords"/>
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
<a class="tag" href="/tag/understand/page/1/">understand</a>
</div>
</div>
</div>
Extracting all the URLs found within a page’s <a>
tags
for link in soup.find_all('a'):
print(link.get('href'))
/tag/simplicity/page/1/
/tag/understand/page/1/
Extract all the text from page
# soup.getText()
soup.get_text()
"\n\n\n\nQuotes to Scrape\n\n\n\n\n“If you can't explain it to a six year old, you don't understand it yourself.”\nby Albert Einstein\n\n\n Tags:\n \nsimplicity\nunderstand\n\n\n\n\n\n"
Tag
Can access HTML tags from soup object
soup.div
<div class="textNormal" id="quote">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="simplicity,understand" itemprop="keywords"/>
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
<a class="tag" href="/tag/understand/page/1/">understand</a>
</div>
</div>
</div>
soup.body.a
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
Tag has name and number of attributes which can be accessed like dictionary
soup.div.name
'div'
soup.div['id']
'quote'
Navigating using tag names
soup.head.title
<title>Quotes to Scrape</title>
soup.head.title.string
'Quotes to Scrape'
You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first <b>
tag beneath the<body>
tag:
soup.body.span
<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
soup.body.span.text
"“If you can't explain it to a six year old, you don't understand it yourself.”"
Using a tag name as an attribute will give you only the first tag by that name
soup.a
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
soup.find_all('a') # to get all the <a> tags
[<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>,
<a class="tag" href="/tag/understand/page/1/">understand</a>]
A tag’s children are available in a list called .contents
soup.a.contents
['simplicity']
iterate over a tag’s children using the .children
generator
for child in soup.a.children:
print(child)
simplicity
The .contents
and .children
attributes only consider a tag’s direct children. For instance, the <head>
tag has a single direct child–the <title>
tag:
The .descendants
attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:
head_tag = soup.head
head_tag.contents
['\n', <title>Quotes to Scrape</title>, '\n']
for child in head_tag.descendants:
print(child)
<title>Quotes to Scrape</title>
Quotes to Scrape
len(list(soup.children)), len(list(soup.descendants))
(5, 41)
title_tag = head_tag.contents[0]
title_tag
'\n'
for string in soup.stripped_strings:
print(repr(string))
'Quotes to Scrape'
"“If you can't explain it to a six year old, you don't understand it yourself.”"
'by'
'Albert Einstein'
'Tags:'
'simplicity'
'understand'
for string in soup.stripped_strings:
print(string)
Quotes to Scrape
“If you can't explain it to a six year old, you don't understand it yourself.”
by
Albert Einstein
Tags:
simplicity
understand
story = soup.find(id='quote')
for ch in story.children:
print(ch.name)
None
div
None
for ch in story.descendants:
print(ch.name)
None
div
None
span
None
None
span
None
small
None
None
None
div
None
meta
None
a
None
None
a
None
None
None
None
soup.find('small', attrs = {'class':'author'}).text
'Albert Einstein'
Use decompose
to remove the unwanted tags
story.find('div', attrs = {'class':'tags'}).decompose()
The Quote
for ch in story.stripped_strings:
print(ch)
“If you can't explain it to a six year old, you don't understand it yourself.”
by
Albert Einstein