Automatically Summarizing Online Articles using Python and BART Model

Jayson Gent

Today, we’re taking a dive into automating summarization of online articles using Python, and the BART model from Hugging Face. Let’s dive right in!

Getting Started

Make sure you have the transformers and torch libraries installed. You can do this by using pip:

pip install torch 
pip install transformers
pip install coloredlogs

Initializing the Summarizer and setting up our logger

We begin by clearing our CUDA cache, then determining if we can run our operations on a GPU or if we need to use a CPU. The pipeline function helps us create our summarizer:

import torch
import logging
import coloredlogs
from transformers import pipeline

# Initialize the logger with colors
logger = logger.getLogger(__name__)
coloredlogs.install(level='INFO', logger=logger)

torch.cuda.empty_cache()
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)

The summarizer model in the pipeline above is facebook/bart-large-cnn and it is a BART model.

BART stands for Bidirectional and Auto-Regressive Transformers. It’s a model developed by Facebook’s AI team. Unlike traditional Transformer models that process input sequences in one direction (either left-to-right or right-to-left), BART processes inputs in both directions, which allows for a richer understanding of context.

The ‘auto-regressive’ part refers to the way BART generates output sequences: it predicts each subsequent token based on the ones it has already generated, in addition to the input.

BART has been found to perform excellently in a variety of tasks, such as text generation, translation, and summarization, among others. For our use-case, we’re leveraging a variant of BART, bart-large-cnn, which has been fine-tuned specifically for summarizing news articles.

This model has been made available through the Hugging Face transformers library, a popular repository of pre-trained models for natural language processing tasks.

Building the Summarizer Function

Next, we’ll define a method called summarize_news(). This method takes in a URL for a news article and uses our summarizer to, well, summarize it!

    def summarize_news(url, summarizer, max_length=130, min_length=30):
        article_text = scrape_website(url)
        if not article_text:
            logger.warning(f"No text scraped from url: {url}. Skipping summarization.")
            return None
        try:
            # Split the article_text into chunks of approximately 1024 tokens
            chunks = [article_text[i:i + 1024] for i in range(0, len(article_text), 1024)]

            # Summarize each chunk separately
            summaries = []
            for chunk in chunks:
                logger.info("Chunking")
                summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)[0][
                    "summary_text"]
                summaries.append(summary)

            # Combine all chunk summaries into one summary
            combined_summary = ' '.join(summaries)

            return combined_summary
        except Exception as e:
            logger.error(f"Error occurred during summarization: {e}")
            return None

The Scraper Method

To fetch the content from the website, we’ve got another method scrape_website(). It sends a GET request to the provided URL and if the response is a success (HTTP status 200), it uses BeautifulSoup to parse the HTML.

    def scrape_website(url):
        logger.info(f"Scraping {url}...")
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            relevant_text = []
            elements = soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'strong', 'em'])
            for element in elements:
                text = element.get_text().strip()
                if len(text) > 20 and not re.search(r'^\d+\s*$', text):
                    relevant_text.append(text)
            if relevant_text:
                article_text = ' '.join(relevant_text)
                return article_text
            else:
                logger.error("Failed to find relevant text.")
                return None
        elif response.status_code == 404:
            logger.error(f"Website not found: {url}")
            return None
        else:
            logger.error(f"Failed to fetch website. Status code: {response.status_code}")
            return None

Adjustments maybe needed to the script above depending on the site you are intending to scrap.

url = "https://www.example.com/news/article"
print(summarize_news(url, summarizer))

And there you have it! An automatic article summarizer built with Python. Feel free to tweak and customize it according to your needs. Happy coding!

Drowning in data but not sure how to make sense of it all? You’re not alone, and I am here to help! At Epoch Insights, I turn your data into actionable insights that drive decision-making. Don’t wait for tomorrow to unlock the power of your data. Start your journey towards data enlightenment by booking a consultation! I can’t wait to help you thrive in this data-driven world.