Web scraping in python

Web scraping in python

Web scraping is the process of using bots to extract content and data from a website

We're going to make use of beautifulsoup, lxml and request

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping and also use for parsing file or data.

The requests library is the de facto standard for making HTTP requests in Python

Now to our project. We've to install those libraries

For beautifulsoup you can do py -m pip install bs4 or pip install beautifulsoup4 pip install lxml

pip install requests

## scraping from our local HTML

from bs4 import BeautifulSoup

Remember we're getting our HTML file from our local PC, so we don't need the request library.

And here's our local HTML file

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample website</title>
</head>
<body>
    <h1 id="site_title">Test Website</h1>
    <hr><hr>
    <div class="article">
        <h2><a href="#">Article</a></h2>
        <p>This is the summary of article</p>
    </div>
    <hr><hr>
    <div class="article">
        <h2><a href="#">Article2</a></h2>
        <p>This is the summary of article2</p>
    </div>

    <footer class="footer">
        Footer information
    </footer>

</body>
</html>

Back to our code.

from bs4 import BeautifulSoup

#to open our files.
with open('sample.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #noticed we used our parser "lxml"

If you go ahead and print soup you'll get all the HTML file, but not in an arrange manner. So we make use of prettify() to print the file in a well arranged way.

back to our code.

from bs4 import BeautifulSoup

#to open our files.
with open('sample.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #noticed we used our parser "lxml"
print(soup.prettify())

let's assume we want just the first article. Then, we're going to make use of find(). find() retruns only the first tag or attribute assign to it.

back to our code

from bs4 import BeautifulSoup

#to open our files.
with open('sample.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #noticed we used our parser "lxml"
#remember the articles are inside a **div** with a class name of **"article"**
first_art = soup.find('div', class_="article")
print(first_art.prettify())

Now, we want to get all the articles, we're going to use find_all(). it returns everything. We're also going to loop all through the article to return the title and the summary

from bs4 import BeautifulSoup

#to open our files.
with open('sample.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    #noticed we used our parser "lxml"
#remember the articles are inside a **div** with a class name of **"article"**
for article in soup.find_all('div', class_="article"):
    headline = article.h2.text
    print(headline)

    summary = article.p.text
    print(summary)

    print()

the reason why use .text is because we want to return or display the actual text without displyaing the html tag.

Here's the same approach on a hosted website

but we made use of request to pull the data

from bs4 import BeautifulSoup
import requests

source = requests.get('https://www.bellanaija.com/').text

soup = BeautifulSoup(source, 'lxml')
artTitle = soup.find_all('div', class_='mvp-feat1-pop-cont')

filename = "blogpost.csv"
file = open(filename, 'w')
headers = "Title \n"

file.write(headers)

for i in artTitle:
    title = i.find('div', 'div', class_="mvp-feat1-pop-text").text
    file.write(title)
file.close()