Web scraping using BeautifulSoup

Beautiful Soup is useful for pulling data out of HTML and XML files. It works with html parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

We can directly get a html document from server

url = 'http://quotes.toscrape.com/page/1/'
# send a http GET reuest to server and get the response
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

For this tutorial we are using below simple html made up taking Albert Einstein’s quote from http://quotes.toscrape.com/page/2/

html_doc = """
<!DOCTYPE html>
<html>
   <head>
      <title>Quotes to Scrape</title>
   </head>
   <body>
      <div id="quote" class="textNormal">
         <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
            <span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
            <span>by <small class="author" itemprop="author">Albert Einstein</small>
            </span>
            <div class="tags">
               Tags:
               <meta class="keywords" itemprop="keywords" content="simplicity,understand">
               <a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
               <a class="tag" href="/tag/understand/page/1/">understand</a>
            </div>
         </div>
      </div>
   </body>
</html>
"""

Parse a html documnet and create a soup object

soup = BeautifulSoup(html_doc, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Quotes to Scrape
  </title>
 </head>
 <body>
  <div class="textNormal" id="quote">
   <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
    <span class="text" itemprop="text">
     “If you can't explain it to a six year old, you don't understand it yourself.”
    </span>
    <span>
     by
     <small class="author" itemprop="author">
      Albert Einstein
     </small>
    </span>
    <div class="tags">
     Tags:
     <meta class="keywords" content="simplicity,understand" itemprop="keywords"/>
     <a class="tag" href="/tag/simplicity/page/1/">
      simplicity
     </a>
     <a class="tag" href="/tag/understand/page/1/">
      understand
     </a>
    </div>
   </div>
  </div>
 </body>
</html>

Get title

soup.title.string

'Quotes to Scrape'

Find all links

len(soup.findAll('a'))

soup.find(id='quote')

<div class="textNormal" id="quote">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
</span>
<div class="tags">
               Tags:
               <meta class="keywords" content="simplicity,understand" itemprop="keywords"/>
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
<a class="tag" href="/tag/understand/page/1/">understand</a>
</div>
</div>
</div>

Extracting all the URLs found within a page’s <a> tags

for link in soup.find_all('a'):
    print(link.get('href'))

/tag/simplicity/page/1/
/tag/understand/page/1/

Extract all the text from page

# soup.getText()
soup.get_text()

"\n\n\n\nQuotes to Scrape\n\n\n\n\n“If you can't explain it to a six year old, you don't understand it yourself.”\nby Albert Einstein\n\n\n               Tags:\n               \nsimplicity\nunderstand\n\n\n\n\n\n"

Tag

Can access HTML tags from soup object

soup.div

<div class="textNormal" id="quote">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
</span>
<div class="tags">
               Tags:
               <meta class="keywords" content="simplicity,understand" itemprop="keywords"/>
<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>
<a class="tag" href="/tag/understand/page/1/">understand</a>
</div>
</div>
</div>

soup.body.a

<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>

Tag has name and number of attributes which can be accessed like dictionary

soup.div.name

'div'

soup.div['id']

'quote'

Navigating using tag names

soup.head.title

<title>Quotes to Scrape</title>

soup.head.title.string

'Quotes to Scrape'

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first <b> tag beneath the<body> tag:

soup.body.span

<span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span>

soup.body.span.text

"“If you can't explain it to a six year old, you don't understand it yourself.”"

Using a tag name as an attribute will give you only the first tag by that name

soup.a

<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>

soup.find_all('a') # to get all the <a> tags

[<a class="tag" href="/tag/simplicity/page/1/">simplicity</a>,
 <a class="tag" href="/tag/understand/page/1/">understand</a>]

A tag’s children are available in a list called .contents

soup.a.contents

['simplicity']

iterate over a tag’s children using the .children generator

for child in soup.a.children:
    print(child)

simplicity

The .contents and .children attributes only consider a tag’s direct children. For instance, the <head> tag has a single direct child–the <title> tag:

The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

head_tag = soup.head
head_tag.contents

['\n', <title>Quotes to Scrape</title>, '\n']

for child in head_tag.descendants:
    print(child)

<title>Quotes to Scrape</title>
Quotes to Scrape

len(list(soup.children)), len(list(soup.descendants))

(5, 41)

title_tag = head_tag.contents[0]
title_tag

'\n'

for string in soup.stripped_strings:
    print(repr(string))

'Quotes to Scrape'
"“If you can't explain it to a six year old, you don't understand it yourself.”"
'by'
'Albert Einstein'
'Tags:'
'simplicity'
'understand'

for string in soup.stripped_strings:
    print(string)

Quotes to Scrape
“If you can't explain it to a six year old, you don't understand it yourself.”
by
Albert Einstein
Tags:
simplicity
understand

story = soup.find(id='quote')

for ch in story.children:
    print(ch.name)

None
div
None

for ch in story.descendants:
    print(ch.name)

None
div
None
span
None
None
span
None
small
None
None
None
div
None
meta
None
a
None
None
a
None
None
None
None

soup.find('small', attrs = {'class':'author'}).text

'Albert Einstein'

Use `decompose` to remove the unwanted tags

story.find('div', attrs = {'class':'tags'}).decompose()

The Quote

for ch in story.stripped_strings:
        print(ch)

“If you can't explain it to a six year old, you don't understand it yourself.”
by
Albert Einstein

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Tag

Navigating using tag names

Use decompose to remove the unwanted tags

The Quote

References

Use `decompose` to remove the unwanted tags