Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 1)

I haven’t spent much time managing marketing teams, content developers, creating campaigns, or any of those projects. But, from the outside looking in, I can imagine that there must be an immense amount of work and intuition required to come up with good marketing content. Some might say that the author should just ‘know their audience’ or ‘just create what you want to create’. In general, I agree. However, companies like YouTube have changed the game somewhat in recent years. Audiences have never been larger, more diverse, more targeted, and accessible. It would be nearly impossible to grow a marketing strategy through intuition alone and maintain success. Creators and Authors should have tools that give them an opportunity to keep up with a very fickle audience. With a little data, a few lines of code, and some visualizations these creatives just might have a chance to be ahead of the curve….faster.

In this few part series, we are going to spin up a simple project to demonstrate a neat use case. Consider for a moment that you are a marketing leader or content creator responsible for generating campaigns, marketing videos, podcasts, etc. We’ll also assume that while this leader has a lot of experience, they are also quite pragmatic and aware there are some opportunities to start off on the right foot. We can try to understand the landscape for the companies products and content before investing dollars into a project. Seeing as this company is heavily B2C, YouTube is a keystone in delivering good content to large audiences quickly.

In this series, we will cover the following:

Setting up a project in Google so as to activate the v3 data API for YouTube
Get familiar with the APIs we will need to use and what the APIs provide
Make calls to the appropriate APIs and store the results
Clean the comment data to be processed further by the VADER Lexicon from NLTK
Create a simple PowerBI dashboard to visualize/analyze some of the results
Discuss some future improvements that can be made

Setting up your project with Google:

To begin, you’ll have to set up a project with Google so as to activate the API, get your API Keys, configure Oauth, etc.

If you don’t have an account with Google/Google Cloud, sign up for a Google account
Sign into Google Cloud
Create the project and name according to your needs
Once the project has been created, search for ‘youtube’ in the search bar at the top of the Google Cloud Workspace
One of the top results will be: YouTube Data API v3
Select the API to move forward
Enable the API for your project and move onto the next screen
At the top right, click the button to create credentials
Choose the YouTube Data API v3 from the list under the question: “Which API are you using?”
For the next question: “Where will you be calling the API from?” use the answer: “Other UI (e.g. Windows, CLI tool)”
For the next question: “What data will you be accessing?” use the answer: “Public Data”
1. Unless, of course, you know the app will need personal information. For this use case, we will not need to use personal data. Plus, there is extra levels of scrutiny involved with accessing personal data
After proceeding to the next page, you will be presented with an API Key. Make sure to save this key.
Next, click on the ‘CREATE CREDENTIALS’ button. You’ll want to create an OAuth Client ID
Click on ‘Configure Consent Screen”
I chose an External User Type and selected Create
Fill out the form and continue
Add the first three scopes. If you are only going after the comments, you won’t need to add most of these APIs. You will not need to add sensitive or restricted scopes
Add users as necessary
I published my app in the following screens because it seemed as though there were issues when running code if the project was not published. We are only testing anyways so there is little risk of huge impacts
Under the Credentials tab, create an Create OAuth client ID
Name the application type as a ‘desktop app’ and Name the client any way you want
Save your client ID and Client Secret
Download the .json file from the ‘OAuth 2.0 Client IDs’ and save the file in the same directory that you will be developing within
Optional: Add a service account

It takes a bit of time to set up the API and OAuth, but it is worth it in the end. There are tons of other walkthroughs on YouTube if you get stuck.

Starting to Build the Script:

Import libraries and defining some variables/functions that enable the OAuth to operate:

import os
import numpy as np
import re

CLIENT_SECRETS_FILE = "client_secret_2.json"

SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

import google.oauth2.credentials
 
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
import google.oauth2.credentials
 
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

import pickle
import google.oauth2.credentials
 
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

def get_authenticated_service():
    credentials = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not exist
    if not credentials or not credentials.valid:
        # Check if the credentials have expired
        if credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES)
            credentials = flow.run_console()
 
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)
 
    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
 
if __name__ == '__main__':
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
    service = get_authenticated_service()

A key part of this block of code is to save off the credentials for the authenticated service. You do not want to have to re-authenticate EVERY single time you run the script. This block will fix this problem.

api_key = '<API KEY>'
youtube=build('youtube','v3',developerKey=api_key)

Define your API Key here in these lines of code. This builds a ‘key’ as it were to then call the different parts of the API like comments, search function, etc.

Using the Search Function to get the Channels we will analyze:

You can use the search function to get a set of results from the YouTube search function. You should familiarize yourself with the documentation: https://developers.google.com/youtube/v3/getting-started

We will not be exploring the API functions in depth here in this article. Here is an example of using the search method of the API:

snippets = youtube.search().list(part='id,snippet',type='channel',q='t.rex arms').execute()

channelId = snippets['items'][0]['snippet']['channelId']
print(channelId)
>> UCU-ljC8EvKZFhJ-pct_5rMQ

I cut a bit of a corner here. I knew that the channel Id I wanted existed at the 0 index. This Channel Id will be used to kick off the rest of the script. The concept is as follows:

Search YouTube for a Channel(s) (the Seed) >> Extract the Stats and the Playlist Id from channels.list() >> Get the list of videos and their Ids from the playlistItems() >> Use the channel/video ids to get the comments from each video. The next lines of code will demonstrate the results that the YouTube API will return:

stats = youtube.channels().list(part='statistics',id=channelId).execute()
stats['items']

>> [{'kind': 'youtube#channel',
  'etag': '6p3MzT5MtiAPsl3LjZUa1Jrfp78',
  'id': 'UCU-ljC8EvKZFhJ-pct_5rMQ',
  'statistics': {'viewCount': '103419822',
   'subscriberCount': '975000',
   'hiddenSubscriberCount': False,
   'videoCount': '145'}}]

content = youtube.channels().list(id = channelId, part='contentDetails').execute()
content['items']

>> [{'kind': 'youtube#channel',
  'etag': 'NHEVnfNtoeJIhQaZFf68M1xiH9c',
  'id': 'UCU-ljC8EvKZFhJ-pct_5rMQ',
  'contentDetails': {'relatedPlaylists': {'likes': '',
    'favorites': '',
    'uploads': 'UUU-ljC8EvKZFhJ-pct_5rMQ'}}}]

uploadId = content['items'][0]['contentDetails']['relatedPlaylists']['uploads']
uploadId

>> 'UUU-ljC8EvKZFhJ-pct_5rMQ'

After getting the uploads playlist, we should be able to go get the videos from the playlist. If there was more than one playlist, you could simply write the playlist ids to an empty list and loop through all of them to get the videos. Next, we get the videos from the playlist:

while 1: res=youtube.playlistItems().list(playlistId=uploadId,maxResults=50,part='snippet',pageToken=nextPage_token).execute()
    allVideos += res['items']
    nextPage_token = res.get('nextPageToken')
    if nextPage_token is None:
        break

video_ids=[]
channelId = []
for i in range(0,143):
    video_ids.append(allVideos[i]['snippet']['resourceId']['videoId'])
    channelId.append(allVideos[i]['snippet']['channelId'])

stats = []
for i in range(0,len(video_ids),40):
    res = (youtube).videos().list(id=','.join(video_ids[i:i+40]),part='statistics').execute()
    stats += res['items']

A while loop grabs any and all videos. Depending on your own use case, there might be a need to stop after so many calls. Remember, you only have 10,000 calls per day. The two other blocks simply appends data to a list for post processing later. I would probably not hard code a range in production level code since playlists will have different numbers of videos.

Next, we deconstruct the results of the other lists to separate ‘columns’ to be used in a dataframe/table:

import pandas as pd
data={'title':title,'video_id':videoid,'video_description':video_description,'publishedDate':publishedDate,'likes':liked,'dislikes':disliked,'views':views,'comment_count':comment}
df=pd.DataFrame(data)
df.head()

We go after the comments with the following lines of code:

channelId = list(set(channelId))
allComments = []
video_id_pop = []
channel_id_pop = []
video_title_pop = []
video_desc_pop = []
comments_pop = []
comment_id_pop = []
reply_count_pop = []
like_count_pop = []

for channel in channelId:
    res=youtube.commentThreads().list(allThreadsRelatedToChannelId=channel,
                                      part='id,snippet',
                                      maxResults=100).execute()

    try:
        nextPageToken = res['nextPageToken']

    except KeyError:
        nextPageToken = None

    except TypeError:
        nextPageToken = None
    
    comments_temp = []
    comment_id_temp = []
    reply_count_temp = []
    like_count_temp = []
    channel_id_temp = []
    video_id_temp = []

    for item in res['items']:
        allComments.append(res['items'])
        comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
        comment_id_temp.append(item['snippet']['topLevelComment']['id'])
        reply_count_temp.append(item['snippet']['totalReplyCount'])
        like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
        channel_id_temp.append(item['snippet']['channelId'])
        video_id_temp.append(item['snippet']['videoId'])

    comments_pop.extend(comments_temp)
    comment_id_pop.extend(comment_id_temp)
    reply_count_pop.extend(reply_count_temp)
    like_count_pop.extend(like_count_temp)
    channel_id_pop.extend(channel_id_temp)
    video_id_pop.extend(video_id_temp)
    
    while (nextPageToken):
        try:
            res=youtube.commentThreads().list(allThreadsRelatedToChannelId=channel,
                                      part='id,snippet',
                                      maxResults=100,pageToken=nextPageToken).execute()
            
            comments_temp = []
            comment_id_temp = []
            reply_count_temp = []
            like_count_temp = []
            channel_id_temp = []
            video_id_temp = []

            for item in res['items']:
                allComments.append(res['items'])
                comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
                comment_id_temp.append(item['snippet']['topLevelComment']['id'])
                reply_count_temp.append(item['snippet']['totalReplyCount'])
                like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
                channel_id_temp.append(item['snippet']['channelId'])
                video_id_temp.append(item['snippet']['videoId'])

            comments_pop.extend(comments_temp)
            comment_id_pop.extend(comment_id_temp)
            reply_count_pop.extend(reply_count_temp)
            like_count_pop.extend(like_count_temp)
            channel_id_pop.extend(channel_id_temp)
            video_id_pop.extend(video_id_temp)
            
            nextPageToken = res['nextPageToken']
            
        except KeyError:
            break

data_threads={'comment':comments_pop,'comment_id':comment_id_pop,'reply_count':reply_count_pop,'like_count':like_count_pop,'channel_id':channel_id_pop,'video_id':video_id_pop}
threads=pd.DataFrame(data_threads)
threads.head()

The code above looks complicated, but there isn’t too much to the functions. YouTube has a function to grab all the threads that relate to a channel id. Because the results are paginated, you will need to incorporate the nextPageToken and loop through the pages until complete. In some applications, you may want to cut the calls off early—especially if the channel has a ton of engagement.

We’ll cover data cleaning, feature engineering, use of NLTK/VADER for sentiment analysis, and a simple dashboard in PowerBI in the next article!

ArticlesTyler BetthauserMarch 13, 2021ConaxonNLTK, VADER, API, YouTube, Text Cleaning, Microsoft PowerBI, For Loops, While Loops, unsupervised sentiment analysis, dashboards