Learn Web Scraping with Python

Web Scrapping Library

Web Scrapping Library

Python has a vast collection of libraries and also provides a very useful library for web scrapping. Let's understand the required library for Python.

 

Library used for web scrapping

1. Selenium 

Selenium is an open-source automated testing library. It is used to check browser activities. To install this library, type the following command in your terminal.

cmd>pip install selenium  

Note - It is good to use the PyCharm IDE.

 


2. Pandas

Pandas library is used for data manipulation and analysis. It is used to extract the data and store it in the desired format.

cmd>pip install pandas

 


3. BeautifulSoup

BeautifulSoup is a Python library that is used to pull data of HTML and XML files. It is mainly designed for web scrapping. It works with the parser to provide a natural way of navigating, searching, and modifying the parse tree. The latest version of BeautifulSoup is 4.8.1. 


Let's understand the BeautifulSoup library in detail.

 Installation of BeautifulSoup

cmd>pip install bs4  

 


Installing a parser

BeautifulSoup supports HTML parser and several third-party Python parsers. You can install any of them according to your dependency. The list of BeautifulSoup's parsers is the following:

Parser

Typical usage

Python's html.parser

BeautifulSoup(markup,"html.parser")

lxml's HTML parser

BeautifulSoup(markup,"lxml")

lxml's XML parser

BeautifulSoup(markup,"lxml-xml")

Html5lib

BeautifulSoup(markup,"html5lib")

 

We recommend you to install html5lib parser because it is much suitable for the newer version of Python, or you can install lxml parser.

cmd>pip install html5lib  

 


BeautifulSoup is used to transform a complex HTML document into a complex tree of Python objects. But there are a few essential types object which are mostly used:

Tag

Tag object corresponds to an XML or HTML original document.

E.g:-

soup = bs4.BeautifulSoup("<b class = "boldtest">Extremely bold</b>)  

tag = soup.b  

type(tag)  

Output:

<class "bs4.element.Tag">

 

Tag contains lot of attributes and methods, but most important features of a tag are name and attribute.

 

Name

Every tag has a name, accessible as .name:

E.g:-

tag.name  

 

Attributes

A tag may have any number of attributes. The tag <b id = "boldtest"> has an attribute "id" whose value is "boldtest". We can access a tag's attributes by treating the tag as dictionary.

E.g:-

tag[id]  

 

We can add, remove, and modify a tag's attributes. It can be done by using tag as dictionary.

E.g:-

# add the element  

tag['id'] = 'very-very-bold'  

tag['attribute2'] = 1  

print(tag)

# delete the tag  

del tag['id']     

 

Multi-valued Attributes

In HTML5, there are some attributes that can have multiple values. The class (consists more than one css) is the most common multivalued attributes. Other attributes are rel, rev, accept-charset, headers, and accesskey.

E.g:-

class_is_multi= { '*' : 'class'}  

xml_soup = BeautifulSoup('<p class="body strikeout"></p>''xml', multi_valued_attributes=class_is_multi)  

xml_soup.p['class']  

# [u'body', u'strikeout']  

 

NavigableString

A string in BeautifulSoup refers text within a tag. BeautifulSoup uses the NavigableString class to contain these bits of text.

E.g:-

tag.string  

# u'Extremely bold'  

type(tag.string)  

# <class 'bs4.element.NavigableString'>  

 

A string is immutable means it can't be edited. But it can be replaced with another string using replace_with().

E.g:-

tag.string.replace_with("No longer bold")  

print(tag)

 

In some cases, if you want to use a NavigableString outside the BeautifulSoup, the unicode() helps it to turn into normal Python Unicode string.

 

BeautifulSoup object

The BeautifulSoup object represents the complete parsed document as a whole. In many cases, we can use it as a Tag object. It means it supports most of the methods described in navigating the tree and searching the tree.

E.g:-

doc=BeautifulSoup("<document><content/>INSERT FOOTER HERE</document","xml")  

footer=BeautifulSoup("<footer>Here's the footer</footer>","xml")  

doc.find(text="INSERT FOOTER HERE").replace_with(footer)  

print(doc)  

Output:

?xml version="1.0" encoding="utf-8"?>
# <document><content/><footer>Here's the footer</footer></document>