Web scraping is the process of using bots to extract content and data from a website
We're going to make use of beautifulsoup, lxml and request
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping and also use for parsing file or data.
The requests library is the de facto standard for making HTTP requests in Python
Now to our project. We've to install those libraries
For beautifulsoup you can do py -m pip install bs4 or pip install beautifulsoup4 pip install lxml
pip install requests
## scraping from our local HTML
from bs4 import BeautifulSoup
Remember we're getting our HTML file from our local PC, so we don't need the request library.
And here's our local HTML file
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sample website</title>
</head>
<body>
<h1 id="site_title">Test Website</h1>
<hr><hr>
<div class="article">
<h2><a href="#">Article</a></h2>
<p>This is the summary of article</p>
</div>
<hr><hr>
<div class="article">
<h2><a href="#">Article2</a></h2>
<p>This is the summary of article2</p>
</div>
<footer class="footer">
Footer information
</footer>
</body>
</html>
Back to our code.
from bs4 import BeautifulSoup
#to open our files.
with open('sample.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
#noticed we used our parser "lxml"
If you go ahead and print soup you'll get all the HTML file, but not in an arrange manner. So we make use of prettify() to print the file in a well arranged way.
back to our code.
from bs4 import BeautifulSoup
#to open our files.
with open('sample.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
#noticed we used our parser "lxml"
print(soup.prettify())
let's assume we want just the first article. Then, we're going to make use of find(). find() retruns only the first tag or attribute assign to it.
back to our code
from bs4 import BeautifulSoup
#to open our files.
with open('sample.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
#noticed we used our parser "lxml"
#remember the articles are inside a **div** with a class name of **"article"**
first_art = soup.find('div', class_="article")
print(first_art.prettify())
Now, we want to get all the articles, we're going to use find_all(). it returns everything. We're also going to loop all through the article to return the title and the summary
from bs4 import BeautifulSoup
#to open our files.
with open('sample.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
#noticed we used our parser "lxml"
#remember the articles are inside a **div** with a class name of **"article"**
for article in soup.find_all('div', class_="article"):
headline = article.h2.text
print(headline)
summary = article.p.text
print(summary)
print()
the reason why use .text is because we want to return or display the actual text without displyaing the html tag.
Here's the same approach on a hosted website
but we made use of request to pull the data
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.bellanaija.com/').text
soup = BeautifulSoup(source, 'lxml')
artTitle = soup.find_all('div', class_='mvp-feat1-pop-cont')
filename = "blogpost.csv"
file = open(filename, 'w')
headers = "Title \n"
file.write(headers)
for i in artTitle:
title = i.find('div', 'div', class_="mvp-feat1-pop-text").text
file.write(title)
file.close()