Lists

5 Best Python Libraries To Extract Keywords From Text

Here is the List of 5 Best Open-Source Python Libraries for Keyword Extraction

The Keywords Extraction methodology in Data Science is a text analysis method that allows you to gain key insights about a piece of text in a short period of time. Using this method, you should be able to quickly identify relevant terms in any document, saving you time spent sifting through it. You can use this library in a lot of different ways.

For example, you can use it to automatically extract keywords from your text when you write a blog post, so if you’re feeling lazy or less creative than usual, you can just let the library do the work.

Other times, you can think of real-life examples, like when you put a product in a store and people review it. You can use this feature to automatically find the problems with the products by looking at a lot of reviews without having to read them all.

Here are 5 of the most useful Python libraries for automatically extracting keywords from text in many languages.

Python Libraries To Extract Keywords

1. KeyBERT

KeyBERT is one of the most user-friendly libraries. To construct keywords and key phrases that are most similar to a document, KeyBERT uses BERT embeddings.

Installing this library with pip is simple:

pip install keybert

You may use it as a library in your scripts by importing the KeyBERT model and then extracting keywords from a variable that contains the plain text:

from keybert import KeyBERT

doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

2. MultiRake

MultiRake is a multilingual RAKE library for Python that includes:

  • Automatic keyword extraction from text written in any language
  • No need to know the language of the text beforehand
  • No need to have a list of stopwords
  • 26 languages are currently available, for the rest – stopwords are generated from provided text
  • Just configure rake, plug-in text, and get keywords (see implementation details)

This implementation stands out due to its multilingual support. Basically, you can provide text without knowing its language (Cyrillic or Latin alphabets) or stopwords and receive excellent results. The best results come from a carefully crafted list of stopwords. During RAKE initialization, only utilize the language code.

3. PKE

Each component of the PKE keyphrase extraction pipeline may be readily updated or extended to create new models, making it an open-source python-based keyphrase extraction toolkit. The SemEval-2010 dataset was used to train the supervised models, which makes benchmarking the latest keyphrase extraction models a snap. The following pip command can be used to install this library (it requires Python 3.6+):

pip install git+https://github.com/boudinfl/pke.git

To make it work, you’ll also need the following additional libraries:

python -m nltk.downloader stopwords
python -m nltk.downloader universal_tagset
python -m spacy download en_core_web_sm # download the english model

It’s possible to extract keyphrases from a document using PKE’s standardized API. Use it as shown in the script below, for example:

# script.py
import pke

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
extractor.load_document(input='/path/to/input.txt', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

4. YAKE

YAKE! is a lightweight unsupervised automatic keyword extraction method that uses single document text statistical data to choose the most essential keywords. Our system is not dependent on dictionaries, external corpus, text size, language, or domain. Unsupervised approaches (TF.IDF, KP-Miner, RAKE) and one supervised method (TextRank) are compared to show our proposal’s strengths and relevance (KEA).

Our methods outperform state-of-the-art methods in a variety of collections of varying sizes, languages, and domains (see Benchmark section below).

Online Demo

5. RAKE

A Python implementation of the RAKE algorithm described in Rose, S., Engel, D., Cramer, & Cowley, W. (2010). Individual Document Keyword Extraction Text Mining: Theory and Applications, eds. M. W. Berry & J. Kogan.

These are the 5 best python libraries to extract keywords from the text.

Noor Qureshi

Experienced Founder with a demonstrated history of working in the computer software industry. Skilled in Network Security and Information Security.

Related Articles

Back to top button