Qayyum Siddiqui logo
Qayyum Siddiqui
python

Web Scraper using Flask in python

Web Scraper using Flask in python
0 views
11 min read
#python

Hi Everyone, I hope you are doing well.In this blog post. we will be building a web scraper using flask. Flask is a small and lightweight Python web framework that provides useful tools and features that make creating web applications in Python easier. In other words, the minimalist's dream and the procrastinator's best friend! Imagine if building a web app was like assembling IKEA furniture, but instead of 57 confusing steps, you get a single sheet that says, "Here's some wood, nails, and a hammer. Go nuts!" That's Flask for you.

It's like the Swiss Army knife of Python frameworks – small enough to fit in your pocket (or a single Python file), yet packed with enough tools to whip up a web app faster than you can say "over-engineered". Plus, it's so flexible, it's practically doing yoga. You want to structure your project like a maze? Go for it! Prefer it as chaotic as a toddler's toy box? Flask won't judge.

In short, if web frameworks were pets, Flask would be that easy-going goldfish that doesn't mind if you forget to feed it for a day or two. It's there to help, not to dictate your life choices. Happy coding, or should I say, happy Flask-ing!

one more advantage is that we can build both frontend and backend, Flask uses the Jinja template engine to dynamically build HTML pages using familiar Python concepts such as variables, loops, lists, and so on. You’ll use these templates as part of this project.

In this tutorial, we'll build a small web scraper to scrape data from the provided url. we will build a simple form for entering the url using jinja template engine we will create a route to send the request to the backend for processing the request and sending the response to the frontend.

Prerequisites

Before we start, we need

  • A local python programming environment, you can install it from https://www.python.org/
  • Basic python programming language knowledge, even if you are new to python this is very beginner friendly

Setup your environment

1. Setup

  • First of all, create a folder with any name , I'm giving it web-scraper.

  • Then we need to install virtual environment to isolate our dependencies

  1. Python Version: If you're using Python 3, virtualenv might not be installed by default. Python 3 uses venv instead.

  2. Installation of virtualenv: If you prefer virtualenv, you might need to install it first.

Here's how you can tackle this:

  • For Python 3's venv:

    python -m venv env
  • If you want to use virtualenv:

    • Install it using pip:
      pip install virtualenv
    • Then create your environment:
      virtualenv env

I'm using python -m venv env, when we run this command in our terminal, it will create a env folder which will have all our project dependencies.

2. Isolate project dependencies

Excellent! You've successfully navigated the treacherous waters of virtual environment creation. Now, you're all set to isolate your project's dependencies. Here's what you can do next:

  • Activate your environment:
    • On Windows:

      env\Scripts\activate
      Image

      It will activate the venv

      Image
    • On macOS/Linux:

      source env/bin/activate

Remember, with great virtual environments comes great responsibility... or at least, a cleaner project setup. Enjoy your Flask adventure, and may your code be bug-free and your environments always isolated!

3. Flask Installation

  • Install Flask:
    pip install Flask

4. check if installation is successful

python -c "import flask; print(flask.__version__)"

#output
3.0.3 (your flask version)

5. - Start your Flask app:

create a new file with any name, i call it server.py

#from flask package import Flask object
from flask import Flask
#create app variable with Flask object
app = Flask(__name__)

#this is our default route
@app.route('/')
def hello_world():
    return 'Flask App is running, Well done'

if __name__ == '__main__':
    app.run(debug=True)

Then run:

python server.py (your_filename.py)
Image

go to port http"//127.0.0.1:5000

Image

write another route /hi

@app.route('/hi')
def greet():
    return "Hi, How are you?"

start the server again with python server.py and go to port http://127.0.0.1:5000/hi

Image

Now that our backend server is working. Now instead of just returning the message on the port. let's render html templates using jinja template engine. first let's create templates folder and then index.html

Run this commands

mkdir templates
touch index.html

Templates(Frontend)

Let's create a index.html template to render on the UI

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraper</title>
    <!-- Removed jQuery since it's not used in the script -->
</head>

<body>
    <h1>Web Scraper</h1>
    <form id="scrapeForm">
        <label for="url">Enter URL:</label>
        <input type="url" id="url" name="url" required>
        <button type="submit">Scrape Data</button>
    </form>
    <div id="result">Result : </div>

    <script>
        function isValidUrl(string) {
            try {
                new URL(string);
                return true;
            } catch (_) {
                return false;
            }
        }

        document.getElementById('scrapeForm').addEventListener('submit', function (e) {
            e.preventDefault();
            var url = document.getElementById('url').value;

            try {
                if (isValidUrl(url)) {
                    fetch('/scrape', {
                        method: 'POST',
                        headers: {
                            'Content-Type': 'application/x-www-form-urlencoded',
                        },
                        body: new URLSearchParams({ url: url }),
                    })
                        .then(response => {
                            if (!response.ok) {
                                throw new Error('Network response was not ok');
                            }
                            return response.json();
                        })
                        .then(data => {
                            document.getElementById('result').innerHTML = data.title ? `<h2>${data.title}</h2>` : '<p>No title found or an error occurred.</p>';
                        })
                        .catch(error => {
                            console.error('Error in fetch:', error);
                            document.getElementById('result').innerHTML = '<p>Error: ' + (error.message || error.toString()) + '</p>';
                        });
                } else {
                    throw new Error('Invalid URL entered');
                }
            } catch (error) {
                console.error('Unexpected error:', error);
                document.getElementById('result').innerHTML = '<p>Unexpected Error: ' + error.message + '</p>';
            }
        });
    </script>
</body>

</html>

Let's break down what's happening in this HTML code with a bit of humor:

The HTML Structure

  • Head Section:

    • Sets up the document's metadata, like character encoding and viewport settings. It's like setting the stage before the play starts.
  • Body Section:

    • Here's where the action happens. You've got your title, a form for user input, and a spot for results.

The Form

  • <form id="scrapeForm">: This is where users will enter a URL they want to scrape. It's like a magic portal where you type in a URL, and poof! Data appears.

  • <input type="url" id="url" name="url" required>: This is where users input the URL. The required attribute means, "Hey, you can't leave this blank!"

  • <button type="submit">Scrape Data</button>: When clicked, this button submits the form. It's the "Let's do this!" button of our web scraper.

The JavaScript Magic

  • Function isValidUrl: Before we do anything, we check if what the user entered is actually a URL. If not, it's like trying to start a car with a banana instead of a key.

  • Event Listener on Form Submission:

    • e.preventDefault(); stops the form from doing its default action (which would be to reload the page). We want to handle this ourselves.
  • URL Validation:

    • If the URL is valid, we proceed. If not, it's like trying to bake a cake with no oven.
  • Fetch Request:

    • POST to /scrape: We send the URL to our Flask backend via a POST request. It's like sending a letter to your friend asking for the latest gossip from a specific website.
    • Headers and Body: We set up how we're sending this data. Think of it as packaging your letter in an envelope with the right address.
  • Handling the Response:

    • .then(response => response.json()): We're waiting for a response from our server, expecting it to be JSON data.
    • Displaying Results: If we get a title, we show it in a <h2>. If not, we display an error message. It's like opening your friend's letter and either finding juicy gossip or a note saying, "Sorry, no gossip today."
  • Error Handling:

    • If anything goes wrong with the network request or if the URL isn't valid, we catch the error and display it. It's like having a backup plan when your friend doesn't reply or sends back gibberish.
Image

Summary

This HTML page sets up a simple user interface for a web scraper. Users enter a URL, and through JavaScript, it sends this URL to a Flask backend for processing. The backend then sends back data (in this case, just the title), which is displayed on the page. If anything goes awry, error messages are shown to keep users informed. It's like sending a request to a friend to check out a place for you and report back, with some techy magic in between!

Now update the server.py code

from flask import Flask, request, jsonify, render_template
from bs4 import BeautifulSoup
import requests

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/hi')
def greet():
    return "Hi, How are you?"

@app.route('/scrape', methods=['POST'])
def scrape():
    url = request.form.get('url')
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        soup = BeautifulSoup(response.text, 'html.parser')
        data = soup.prettify()
        # Here you would typically parse the soup to extract data
        # For simplicity, let's just return the title of the page
        title = soup.title.string if soup.title else "No title found"
        return jsonify({"title": title})
    except requests.RequestException as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True)

In the above code,if you notice I've used render_template('index.html) which automatically looks for templates folder and index.html file in it and render it on UI.

Here's what's happening in this Flask route ("/scrape"):

  • Route Declaration:

    @app.route('/scrape', methods=['POST'])

    This line tells Flask to listen for POST requests at the /scrape URL. It's like setting up a cosmic mailbox where only specific letters (POST requests) get delivered.

  • Function Definition:

    def scrape():

    This is where the magic happens. It's like opening the letter and starting to process it.

  • Getting the URL:

    url = request.form.get('url')

    Here, we're grabbing the URL from the form data sent in the POST request. Think of it as pulling out the address from the envelope.

  • The Try Block:

    try:

    This is like putting on a safety helmet before diving into potentially dangerous territory.

    • Fetching the Webpage:

      response = requests.get(url)

      We're sending out a space probe (or in this case, a requests.get) to fetch the webpage.

    • Checking if the Probe Returned:

      response.raise_for_status()

      This checks if our space probe got back in one piece. If not, it throws a tantrum (raises an exception).

    • Parsing the HTML:

      soup = BeautifulSoup(response.text, 'html.parser')

      Here, BeautifulSoup comes in like a cosmic chef, turning raw HTML into a delicious soup of structured data.

    • Prettifying the Soup:

      data = soup.prettify()

      This step is like formatting the soup into an aesthetically pleasing dish, though in this case, we're not using it further.

    • Extracting the Title:

      title = soup.title.string if soup.title else "No title found"

      We're looking for the title of the webpage, which is like finding the name of the dish. If there's no title, we're just saying, "This dish is unnamed."

    • Sending Back the Data:

      return jsonify({"title": title})

      We wrap up the title in JSON, like packaging our dish for delivery back to the user.

  • The Except Block:

    except requests.RequestException as e:

    If anything goes wrong in our cosmic journey (like the probe getting lost or eaten by a black hole), we catch the error.

    • Error Handling:
      return jsonify({"error": str(e)}), 500
      We send back an error message, like a note saying, "Sorry, your space dish couldn't be delivered due to unforeseen cosmic events."

In essence, this Flask route is a miniature space mission: we launch a probe to fetch data, process it, and return with the spoils (or an error if things go awry). It's like ordering takeout from space, but instead of food, you get web data!

Now, I typed the url https://www.qayyumsiddiqui.in and clicked the button scrape data and output is my title as you can see in the below image.

Image

Conclusion:

If you have noticed, we wrote the entire application in just two files server.py and templates/index.html.This is the best part about flask.It's easy to build. Keep watching space as I'm going to add more functionality to it.

Source code:

Github Source Code : Web Scraper - Flask