Text Classification with Google’s Natural Language API

As a premier partner of Google, SpringML specializes in Google Cloud Products featured on the Google Cloud Platform (GCP) to solve your business’ problems.  Today, I would like to highlight Google’s Natural Language API for classification.  This API can be used to quickly group your news articles, blog posts, videos, and documents into classes, and to sort your organization’s data with fast results. Using this API bypasses traditional training methods which means, you won’t be bogged down with the modeling, hyperparameter tuning, or text-processing. The official page for Google’s NLP API can be found here.

Objectives & Prerequisites:

By the end of this read, you will learn how to:

  • Use Google’s Natural Language API.
  • Extract text from a PDF using Textract.
  • Extract text from a speech within an MP4 video.
  • Classify extracted text from a video.

Before you begin, you will need:

  • Some basic understanding of machine learning and python programming.  The assumption is that you also have a Google account setup already.
  • To read my previous blog (found here). This post shows you how to set up a model within the Vision API, however the GCP set-up process wil not be much different here.  The only variation is that, in the end, you have to enable a different API on your Google Cloud Platform:
Cloud Natural Language API on GCP
Rather than enabling the Vision API as outlined in my last blog (see link above), you’ll be enabling this API for the demos below.

Why use Google’s Natural Language API?

Let’s look at the Traditional Text Classification process:

Traditional Natural Language Classification
This is assuming you decide to go with Naive Bayes, Support Vector Classifier, Random Forest. It’s a bit more of a painstaking process, as you can see.

Now, Let’s compare it to Google’s Natural Language API:

As you can see, the NLP (Natural Language Processing) API process is a lot easier from a developer’s standpoint, and it enables data to be read from files faster.

Using the Natural Language API to Extract Text from a TXT File to Classify.

Now I am going to pass raw, unprocessed text through the NLP API, and it should give me relatively accurate predictions as a result. I am going to start with the BBC news dataset found here. After downloading the news article dataset, you can read an article from the set like this:

with open("bbc/business/057.txt",'r') as f:
    file = f.read().replace("\'"," ").replace("\n"," ")  # replacing \ and \n from txt file

After we have read the TXT file, we can use the code for the NLP API (found originally from here).
import six
from google.cloud import language
from google.cloud.language import enums, types

def classify_text(text):
    """Classifies content categories of the provided text."""
    client = language.LanguageServiceClient()

    if isinstance(text, six.binary_type):
        text = text.decode('utf-8')

    document = types.Document(
        content=text.encode('utf-8'),
        type=enums.Document.Type.PLAIN_TEXT)

    categories = client.classify_text(document).categories

    for category in categories:
        print(u'=' * 20)
        print(u'{:<16}: {}'.format('name', category.name))
        print(u'{:<16}: {}'.format('confidence', category.confidence))

classify_text(file)

After you have this code saved in a py file, you can export your API key credentials and the code from your bash terminal.
export GOOGLE_APPLICATION_CREDENTIALS=key.json

This is also covered with my previous post on Google’s Vision API.

Example 1: Natural Language API in Action

Here is a sample of the text from the BBC dataset labeled as ‘business’ news:

“Economy  stronger than forecast.   The UK economy probably grew at a faster rate in the third quarter than the 0.4% reported, according to Bank of England deputy governor Rachel Lomax.  Private sector business surveys suggest a stronger economy than official estimates”.

And here is the model’s classification of this text by category and sub-category:

Natural Language API In Action - Example 1

As you can see, the API correctly identified the category and subcategory with high confidence with no training required.

Example 2: Natural Language API in Action

Here is another sample of text from the same dataset with a sports article:

“African double in Edinburgh World 5000m champion Eliud Kipchoge won the 9.2km race at the View From Great Edinburgh Cross Country.  The Kenyan, who was second when Newcastle hosted the race last year, was in front from the outset. Ethiopian duo Gebre Gebremariam and Dejene Berhanu made last-gasp efforts to overtake him, but Kipchoge responded and a burst of speed clinched victory.”

And here, again, is the model’s classification of this text by category and sub-category:

Natural Language API In Action - Example 2

The most interesting thing to note is that the API has identified the text category as ‘sports,’ but also has also placed the article in the subcategory ‘Track and Field’ as well.  This demonstrates the level of detail of the classification within the Natural Language API system.

 

Using the Natural Language API to Extract Text from a PDF to Classify.

Sample Datasets, like the one from the BBC, are usually curated by a team of professionals.  More often than not, you won’t get clean data from a TXT file and you will have to extract data from somewhere within a document or publication file instead. Amidst the slew of forms and business documents that exist on the Internet today, it is very common to require data to be pulled from a PDF to analyze.  The easiest way that I found to extract text from a PDF is to use Textract.  The extracted data can come out a bit messy so I wrote a regex to clean it up.  Here is the code:

import textract
import re
def removePunctuation(text):
    '''
    input: string of words
    output: punctuation and special characters removed
    '''
    for c in '!"#$%&\'()*+,-/.:;<=>?@[]^_`{|}~\\':
        text = text.replace(c," ").strip().lower()
    return text
       
def pdftext(fname):
   
    '''
    input: pdf with selectable text
    ouput: Extracted text from that document
    '''
    text = str(textract.process(fname,encoding='ascii'))
    # method='tesseract' for scanned images in PDF or Vision API, even better!
    text = re.sub("http[s]*|www.*",' ',text.lower())
    text = text.replace('\\n',' ').replace('\n',' ') # removing spaces
    text = text.replace('x0c',' ').replace('x94',' ') # remove unicode before text to matrix transformation
    text = removePunctuation(text)  # removes special characters
    text = re.sub("(\d+)*",'',text) # removing numbers
    return ' '.join(i for i in text.split() if len(i)>1)
pdftext('yourfile.pdf')

Example 3: Natural Language API in Action

Let’s take a look at how the NLP API classifies Shakespeare’s Sonnet 18 from a PDF. The text within the file, of course, resembles of Shakespearian English and is quoted below:

‘sonnet by william shakespeare shall compare thee to summers day thou art more lovely and more temperate rough winds do shake the darling buds of may and summers lease hath all too short date sometime too hot the eye of heaven shines and often is his gold complexion dimmed and every fair from fair sometime declines by chance or natures changing course untrimmed but thy eternal summer shall not fade nor lose possession of that fair thou owst nor shall death brag thou wandrest in his shade when in eternal lines to time thou growst so long as men can breathe or eyes can see so long lives this and this gives life to thee’

Here is the model’s classification of the text by category and sub-category, once again:

Natural Language API In Action - Example 3

Based on the text, the classification makes sense to me and would certainly make even Shakespeare proud, especially since the article was placed in the “Poetry” subcategory too.

 

Using the Natural Language API to Extract Text from a Speech Within a Video to Classify.

Video classification can be a great way to sort out videos quickly, and one of several solutions to distinguish videos from one another is to gauge them by their speech content.  We can use Google’s Speech-to-Text API to extract the text from a conversation within a video and to classify it with Natural Language API.

 

For example, you want to classify this speech from President Trump:

Here are the steps to follow to prepare the video’s audio  for transcription:
  1. You’ll need to download the file as an MP4 using a video downloader application (We used  ClipGrab).
    • Editors Note: ClipGrab is a safe video downloader tool that can also pull videos from Facebook and Twitter if need be. We especially like this program because it is open source and it won’t send you to spam websites like other online tools.
  2. Then, after downloading the video, you’ll need to extract the audio from the clip and save it into a WAV file:

import moviepy.editor as mp
clip = mp.VideoFileClip("file.mp4")
clip.audio.write_audiofile("file.wav")

The WAV format is how the Speech API takes in data.  Now that I have the right file format, I am going to use SpeechRecognition, which is a nice wrapper that accesses Google’s Speech-to-Text API. It also interfaces with other Speech-to-Text API’s to give you a range of choices.  Since the focus of this article is on Google’s APIs, I based the code off of this example here to extract President Trump’s speech to text:
import os
import speech_recognition as sr
with open("apikey.json") as f: # your google API key here
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
r = sr.Recognizer()
file = 'file.wav'
 
with sr.AudioFile(file) as source:
    audio = r.record(source)
    # Transcribe audio file
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS,show_all=True)

 

Here is the Text-to-Speech output from the code above:

{‘results’: [{‘alternatives’: [{‘confidence’: 0.9735198,
‘transcript’: ‘we believe that only American citizens should vote in American elections’}]},
{‘alternatives’: [{‘confidence’: 0.9794967,
‘transcript’: ‘ which is why the time has come for voter ID like everything else voter ID’}]},
{‘alternatives’: [{‘confidence’: 0.9675131,
‘transcript’: ‘ can I have your go out and you want to buy groceries you need a picture on a card you need ID you go out and you want to buy anything you need ID and you need your picture in this country’}]},
{‘alternatives’: [{‘confidence’: 0.94627315,
‘transcript’: ” the only time you don’t need it in many cases is when you want to vote for a president when you want to vote for a senator when you want to vote for a governor or Congressman its prey”}]}]}

 

With a bit of cleanup using the following code…

print(''.join([i['alternatives'][0]['transcript'] for i in text['results']])),

This is the final content from the Speech-to-Text conversion:

“we believe that only American citizens should vote in American elections which is why the time has come for voter ID like everything else voter ID can I have your go out and you want to buy groceries you need a picture on a card you need ID you go out and you want to buy anything you need ID and you need your picture in this country the only time you don’t need it in many cases is when you want to vote for a president when you want to vote for a senator when you want to vote for a governor or Congressman”

Example 4: Natural Language API in Action

After passing the information through the Natural Language API, here are the results from my bash terminal:

Natural Language API In Action - Example 4

Based on the results from the video above, the classification of the Natural Language API and even the text extraction from Speech-To-Text API looks very accurate.

 

When is it time for a custom model?

The API is a very good start and you can see the benefits of it right away.  But, at the end of the day, it may still best to go with a custom mode,l depending on the situation.

Here are some reasons for a custom model:

  1. The API can tend to be too general, and often times our clients will want to categorize things that the pre-built API does not.  
  2. You may need finer-grain categories and classification beyond what the API provides.
  3. Custom models often produce a higher accuracy for a category of your choice, as it has been specifically trained for your use-case’s needs.
  4. You gain a sense of control and personalization over your results. You know your data is tailor to meet your needs, and you gain great confidence in its usage to influence your business’ use case.

 

We at SpringML can not only apply powerful Google APIs for your business, but we can also create custom deep learning models such as Using Object Detection to Find Potholes in Tensorflow and deploy it on Google Cloud Platform.  If you want to reach out to us for our domain expertise in Machine Learning, you can reach us at info@springml.com.