Learn Web Scraping with Python

GET request to fetch the raw HTML content

 GET request to fetch the raw HTML content  

Let's understand another example; we will make a GET request to the URL and create a parse Tree object (soup) with the use of BeautifulSoup and Python built-in "html5lib" parser.

Here we will scrap the webpage of given link (http://cfamilycomputers.com/). Consider the following code:

E.g:-

# importing the libraries  
from bs4 import BeautifulSoup  
import requests  
  
url=" http://www.cfamilycomputers.com/"  
# Make a GET request to fetch the raw HTML content  
html_content = requests.get(url).text  
 # Parse the html content  
soup = BeautifulSoup(html_content, "html5lib")  
print(soup.prettify()) # print the parsed data of html  

 

The above code will display the all html code of cfamilycomputers homepage.

Using the BeautifulSoup object, i.e. soup, we can collect the required data table. Let's print some interesting information using the soup object:

Let's print the title of the web page.

print(soup.title)  

Output: It will give an output as follow:

<title>CfamilyComputers :: A self-paced training portal</title>

 

In the above output, the HTML tag is included with the title. If you want text without tag, you can use the following code:

print(soup.title.text)  

Output: It will give an output as follow:

CfamilyComputers :: A self-paced training portal

We can get the entire link on the page along with its attributes, such as href, title, and its inner Text. Consider the following code:


Output: It will print all links along with its attributes. Here we display a few of them:

Inner Text is: CFamilyComputers
Title is: None
href is: /
Inner Text is: Home
Title is: None
href is: /
Inner Text is: News
Title is: None
href is: /news/
Inner Text is: Bigdata
Title is: None
href is: /categories/bigdata/
Inner Text is: Learn Hadoop Development
Title is: None
href is: /course/learn_hadoop_development/
Inner Text is: Learn Apache Kafka
Title is: None
href is: /course/learn_apache_kafka/
Inner Text is: Learn Spark
Title is: None
href is: /course/learn_spark/
Inner Text is: Learn Apache Spark and Scala
Title is: None
href is: /course/learn_apache_spark_and_scala/
Inner Text is: Learn Hadoop Administration
Title is: None
href is: /course/learn_hadoop_administration/
Inner Text is: DevOps
Title is: None
href is: /categories/devops/
Inner Text is: Learn Git
Title is: None
href is: /course/learn_git/
Inner Text is: Learn Jenkins
Title is: None
href is: /course/learn_jenkins/
Inner Text is: Learn Apache Maven
Title is: None
href is: /course/learn_apache_maven/
Inner Text is: Learn Docker
Title is: None
href is: /course/learn_docker/
Inner Text is: Web Development
Title is: None
href is: /categories/web_development/
Inner Text is: Learn JavaScript
Title is: None
href is: /course/learn_javascript/
Inner Text is: Learn CSS                              
                                   
                                
Title is: None
href is: /categories/scala_and_scala_frameworks/
Inner Text is: Scala and Scala Frameworks
Title is: None
href is: /categories/scala_and_scala_frameworks/
Inner Text is: CFamily Channel
Title is: None
href is: https://www.youtube.com/cfamily
Inner Text is: PythonAcademy Channel
Title is: None
href is: https://www.youtube.com/pythonacademyIN
Inner Text is: TechPrimers
Title is: None
href is: http://www.techprimers.co/