Deep learning for text summarization using NLP for automated news digest

To assess how well four cutting-edge deep learning models performed in producing acceptable and succinct summaries of a news dataset, we used them in this study. The models used in this paper are T5-base, T5-large, BART-CNN-large, and PEGASUS-large. Prior to delving into the detailed analysis of these models, we discuss the evaluation metrics used to identify the most effective model for text summarization. Specifically, we used ROUGE and BLEU scores as our primary metrics for performance assessment. These metrics and how we used them in our study are further examined in the sections below.

Table of Contents

ROUGE score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores are essential for assessing how effectively the text summarization algorithm works. These metrics serve as benchmarks for the quality of summarization algorithms by objectively measuring the degree to which machine-generated summaries align with human-written summaries. It ranges from 0 to 1, and as the value of the rouge score is closer to 1, the model is more perfect and accurate. In the field of NLP and summarization, ROUGE scores are used to validate the efficacy of proposed methods, ensuring transparency and reproducibility of research findings.

In this paper, we used three rouge score metrics as follows:

1.

ROUGE-1 (Unigram Overlap) The overlap of unigrams, or single words, between the generated summary is measured by ROUGE-1. The F1-score, accuracy, and recall are computed by counting the number of overlapping unigrams.

$${\text{Precision }} = \frac{{{ }Number\;of\;overlapping\;unigrams}}{Number\;of\;unigrams\; in\;generated\;summary}$$

$${\text{Recall}}\;{ = }\;\frac{{{ }Number\; of\;overlapping\;unigrams}}{Number\;of\; unigrams\; in\;reference\;summary }$$
2.

ROUGE-2 (bigram overlap) The overlap of bigrams, or pairs of adjacent words, between the reference summary and the generated summary is measured using ROUGE-2. It broadens the assessment to include additional semantic data.
3.

ROUGE-L (longest common subsequence) The longest common subsequence (LCS) between the reference summary and the generated summary is measured by ROUGE-L. Instead of concentrating on word overlap, it considers the longest word sequence that appears in both summaries.

By evaluating the degree to which generated summaries match the summaries authored by humans, ROUGE scores shed insight into the quality of these summaries. Better summarising performance is demonstrated by higher ROUGE scores, and every measure presents a unique angle on the summarization procedure. ROUGE-2 builds on ROUGE-1’s surface-level lexical overlap recognition to include bigrams and more contextual information. When analysing the longest common sequence, ROUGE-L places a strong emphasis on content overlap, which is essential for accurately summarising the original text. All things considered, these requirements are essential for directing the creation and assessment of text summarising systems since they offer an objective assessment of summary quality.

Bleu score

Text summarization and machine translation systems are evaluated on their text quality using standards called BLEU (Bilingual Evaluation Understudy) scores. They evaluate the degree to which the generated text is comparable to reference texts, which are usually summaries or translations provided by humans. Although BLEU is more frequently linked to machine translation, text summarization also makes use of it. The scores range from 0 to 1, where 1 indicates a perfect match between the generated summary and the reference summary. However, achieving a score close to 1 is very rare.

Role in Text Summarization:

Precision-based metric When contrasting the reference summaries with the generated summary, BLEU analyses the accuracy of n-grammes, or sequences of n-words. This indicates that the number of n-grammes in the generated text matches those in the reference text.
Multi-ngram analysis BLEU preserves both surface-level vocabulary matches and more complex contextual matches by considering unigrams, bigrams, trigrams, and higher-order n-grams. This offers an accurate assessment of the substance and flow of the summary.

BLEU scores offer an accepted technique for comparing and enhancing the field by offering a standard for evaluating the efficacy of various summarization methods.

$${\text{BLEU }} = {\text{ BP }} \times {\text{ exp}}\left( {\mathop \sum \limits_{n = 1}^{N} \omega_{n} \log p_{n} } \right).$$

T5 base

Text summarization is greatly impacted by the T5 (Text-to-Text Transfer Transformer) model, particularly in its T5 base version, because of its robust and flexible architecture. Google Research created T5, which standardises NLP tasks into a text-to-text format, making a variety of linguistic operations, including summarization, more efficient. Modelling hidden text lengths enhances contextual and semantic understanding during pre-training. The models’ ability to produce clear and coherent summaries are optimised through post-training fine tuning on certain summarising datasets.

The role of it in text summarization lies in handling long inputs, generating summaries, customization and versatility. By recognising connections and relationships within the text, the transformer architecture of T5-base effectively handles extended text sequences, which are essential for summarising jobs. It is also able to generate summaries that accurately capture the major ideas of the original text due to its encoder-decoder architecture, which has been improved by intensive pre-training. It is also flexible enough to be used in a variety of areas and can be adapted to meet varied summarization demands, such as summarising studies on news items. With its dominant transformer-based development and unified approach, it is essential for text summarization. It is an important tool in NLP given its capacity to handle lengthy texts, produce logical results, and adjust to various jobs.

The initial test performed when testing the dataset using the T5 base model produced a moderate initial result, with an average ROUGE score of approximately 0.29 and a BLEU score of 0.0431. Table 3 shows the average scores during testing the dataset. The score shows a considerable improvement following the model’s fine-tuning using a task-specific dataset. Also, this improvement shows how fine-tuning makes a difference in helping the model accurately capture and recreate important aspects of the source text in its summaries. The outcome shows how well the T5 base model can summarise text, particularly when adjusted to the unique features of the dataset, which produces summaries that are more precise and logical.

Table 3 Pre-training for T5-base.

T5-LARGE

A member of the transformer family, the T5 large model was developed by Google Research. Every NLP task is considered by the T5 architecture as a text-to-text task, implying that text strings are consistently utilised for both the input and output. This method provides the use of a single framework for a variety of tasks, including text summarization, question answering, and translation. In comparison with smaller models, the T5 large is superior and capable of handling complex language interpretation and creation tasks since it has 24 layers and 770 million parameters. Using an elimination auto-encoding objective, the model is pre-trained on a large corpus of different text data, where it gains the ability to detect missing tokens in a distorted input sequence. This extensive pre-training helps in T5’s development of strong linguistic knowledge.

Its role of it in text summarization, it excels at text summarising, generating clear, understandable summaries of lengthy texts. It is capable of capturing complex connections and interpretations inside the text because of its immense parameter count and thorough pre-training. Since it enables the model to autonomously handle the task as a sequence-to-sequence problem, the content-to-text framework is highly beneficial for summary. It converts an input document (source text) into a brief version (summary), preserving important context and information.

To use T5-large for summarization, the pre-trained model needs to be modified using a specific summarization dataset. In the process of fine-tuning, the model develops the capacity to provide summaries that are relevant to the style and requirements of the dataset. T5-large’s strong language processing abilities ensure that the summaries generated are not only grammatically and naturally correct but also informative. It is a useful tool for managing the massive quantity of textual information encountered in modern applications since it can provide high-quality, coherent, and short summaries by utilising its extensive pre-training and smart architecture. The model structure of T5 is shown in Fig. 5.

In the initial testing of the dataset using T5-large, the output average ROUGE score was approximately 0.2429 and the BLEU score was 0.0338. Table 4 shows the average rouge and bleu scores of testing the dataset. As we see, compared to T5-large, T5-base has a higher score.

Table 4 Pre-training for T5 large.

BART-LARGE-CNN

Facebook AI developed the transformer-based BART (Bidirectional and Auto-Regressive Transformers) model specifically for sequence-to-sequence tasks like text summarization. A modified version of the BART model known as the “large CNN” variant includes a huge architecture with convolutional neural networks (CNNs) integrated into its layers. The benefits of auto-regressive and bidirectional transformers merge in BART. It consists of a left-to-right auto-regressive decoder, like GPT, that generates output text one token at a time, and a bidirectional encoder, like BERT, that scans the full input text to grasp context. Because of its dual nature, BART can handle jobs that call for producing coherent and contextually relevant text (output sequence) as well as understanding context (input sequence).

In the process, BART-large CNN plays an important role by transforming lengthy and complex text into a shorter summary while retaining the original meaning and key information. It works in two parts: encoding and decoding. In encoding, the bidirectional encoder reads the whole input text and identifies all the complex dependencies and relations that are present in it. If the model fully understands the context is essential for producing accurate summaries. Utilising encoder’s context, the auto-regressive decoder predicts each word individually to provide the summary. The output summary is kept coherent and flexible, which helps with this step-by-step production.

The advantage lies in using a decoding autoencoder technique, BART is pre-trained on a sizable corpus of text, where it gains the ability to reconstruct original text from corrupted versions. This improves its text comprehension and production skills. Its summarising skills are further improved by fine-tuning certain summarization datasets. Overall, it leverages its sophisticated architecture and pre-training to produce text summaries of higher quality, making it an efficient tool in NLP. The workflow of BART CNN-large model is shown in Fig. 6.

When we performed the initial test using BART-large CNN to establish baseline performance metrics, the average ROUGE score and BLEU score were roughly 0.1352 and 0.0182, respectively. The average rouge and bleu scores obtained from testing the dataset are given in Table 5.

Table 5 Pre-training for BART-large CNN.

PEGASUS-large

The most advanced model built specifically for text summarising tasks is called PEGASUS (Pre-training with Extracted Gap-Sentences for Abstractive Summarising Sequence-to-Sequence). PEGASUS is a highly efficient method for producing high-quality summaries because it uses a new pre-training objective developed for summary, which was created by researchers at Google Research. The Transformer architecture, a deep learning model built for NLP tasks, constitutes basis for PEGASUS. The encoder-decoder structure of PEGASUS is similar to that of other Transformer models. The input text is analysed by the encoder, and the summary has been generated by the decoder.

PEGASUS’s pre-training methodology is its main innovation. It includes a novel technique known as “Gap-Sentence Generation (GSG),” in contrast to traditional models that have already been pre-trained with generic language modelling tasks. The key terms in GSG are masked off, and the model learns how to recognise these gaps from the rest of the text. By closely aligning the pre-training job with the summary, this methodology improves the model’s capacity to produce concise and understandable summaries. PEGASUS can generate summaries of excellent quality, is flexible enough to work in a variety of fields and minimises the amount of human labour required for text summaries.

When we performed initial testing with Pegasus-large to establish baseline performance metrics, the average ROUGE score and BLEU score obtained were 0.2493 and 0.0317, respectively. Table 6 shows the values corresponding to pre-training for PEGASUS-large and Table 7 shows the baseline performance metrics for various models.

Table 6 Pre-Training for PEGASUS-large.

Table 7 Pre-Training results.

link