Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 2)

Last time, we looked at the process for setting up a project in Google Cloud, enabling the API, and utilizing the API to get data that we can analyze. In this article, we are going to do a bit of data cleaning, analysis, NLP/unsupervised sentiment analysis, and data visualization in PowerBI.

Where we left off….

Last time, we created a Pandas DataFrame that houses the commentary from a YouTube Channel:

data_threads={'comment':comments_pop,'comment_id':comment_id_pop,'reply_count':reply_count_pop,'like_count':like_count_pop,'channel_id':channel_id_pop,'video_id':video_id_pop}
threads=pd.DataFrame(data_threads)
threads.head()

After creating this table, I removed the duplicates in the even that we actually DO have any duplicates. This is more of a best practice than an actual need. I did have an issue in previous versions of the script where there had been duplicate comments generated. This will also ensure that when we get to calculating metrics and counting there will be no risk of artificially inflating values.

threads.drop_duplicates(inplace=True)

Next, we merge the high-level statistics with the comments:

result = pd.merge(threads, df, how="inner", on=["video_id"])

Cleaning the Comment Text:

Before applying any sort of sentiment analysis, or analysis in general, we absolutely should clean the comments. We would not be able to scale an analysis on hundreds of thousands of comments without some sort of cleaning. Let us start with removing tags, for example:

def remove_tags(string):
    result = re.sub('<.*?>','',string)
    return result

result['comment']=result['comment'].apply(lambda cw : remove_tags(cw))

There are lots of things we can do with emojis and emoticons. They convey sentiment via a pictogram. Unfortunately, a lexicon cannot interpret a picture. The emojis will need to be converted to a phrase. We can do that in this way:

from emot.emo_unicode import UNICODE_EMO, EMOTICONS

def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
        return text
  
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
        return text

result['comment'] = result['comment'].apply(convert_emoticons)
result['comment'] = result['comment'].apply(convert_emojis)

URLs cannot be interpreted as much of anything. We will remove them:

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

result['comment'] = result['comment'].apply(remove_urls)

HTML is also another piece that should be cleaned up:

from bs4 import BeautifulSoup

def html(text):
    return BeautifulSoup(text, "lxml").text

result['comment'] = result['comment'].apply(html)

The next few lines were generated for future studies in Natural Language Processing (NLP), but not necessarily used here. However, they are useful functions to reference back to later if you happen to be on this journey as well:

  • Remove Punctuation

  • Tokenize

  • Remove Stop Words

  • Lemmatize

  • Generate the number of words in a comment

  • Generate the number of sentences in a comment

import string
string.punctuation
def remove_punctuation(text):
    no_punct=[words for words in text if words not in string.punctuation]
    words_wo_punct=''.join(no_punct)
    return words_wo_punct
result['comment_no_punc']=result['comment'].apply(lambda x: remove_punctuation(x))

def tokenize(text):
    split=re.split("\W+",text) 
    return split
result['comment_no_punc_tokens']=result['comment_no_punc'].apply(lambda x: tokenize(x.lower()))
result.head(1)

#Importing stopwords from nltk library
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
# Function to remove the stopwords
def stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

result['title_wo_punct_split_wo_stopwords'] = result['comment_no_punc_tokens'].apply(stopwords)

import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV} 

def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

result['title_wo_punct_split_wo_stopwords_lemma'] = result['title_wo_punct_split_wo_stopwords'].apply(lemmatize_words)

result['num_words'] = result['comment'].apply(lambda x: len(x.split()))
result['num_sentences'] = result['comment'].apply(lambda x: len(re.split( '~ ...' ,'~'.join(x.split('.')))))

Implement NLTK VADER Lexicon:

One thing to note here is that the optimal solution for sentiment analysis is actually LABELING the data manually BEFORE attempting sentiment analysis. By doing this, you ensure that the labeling is unique to your own use case—which will be explored a bit later. The code to run the comments against the NLTK can be found below:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
nltk.downloader.download('vader_lexicon')

sid = SIA()

sentiment = []

for comment in result['comment']:
    sentiment.append(sid.polarity_scores(comment) )

result['sentiment'] = sentiment

result = result.drop('sentiment', 1).assign(**pd.DataFrame(result.sentiment.values.tolist()))

This piece of the code will take awhile, depending on the number of comments. In this block, we simply loop through the comments, write the results from SentimentIntensityAnalyzer to the empty sentiment list, and then creates another column to the result/comments table. The result of the sentiment analyzer returns a dictionary response. We can split this dictionary into separate columns by executing the following code:

result = result.drop('sentiment', 1).assign(**pd.DataFrame(result.sentiment.values.tolist()))

Finally, we write the file to a csv so we can connect PowerBI. We are going to now create a simple visualization that would allow a user to track the sentiment on their YouTube content.

Visualizing the YouTube Data Collected:

Dashboard View.JPG

Since the goal was to measure engagement on the channel, the more basic stats are shown at the top. Comments, Likes, Dislikes, and Views measure a general level of engagement with the topic of the video. One could probably even created a weighted average calculation with the average sentiment, thereby creating a single engagement score. The bar chart simply shows the top 10 videos. Naturally, it would make sense to focus on the videos that were viewed most so as to analyze why that video may have been more successful. Comments are listed down below along with the predicted sentiment values. NOTE: the compound score is a kind of ‘blended’ sentiment score. This reddit feed seems to explain the mathematics of the scoring well enough: https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk. Finally, there is a dual-axis chart that shows average compound score shown over time relative to the views. There is actually little correlation between the two, but I thought that juxtaposing these values would be a good method for an analyst to notice large swings in the sentiment over time. Large contrasts in sentiment would be key areas to investigate why the content was either well received or not.

Considerations:

  • The sentiment analyzer struggles with context. The sentiment analyzer struggles somewhat with things like ‘bad ass’. In most contexts, this colloquialism is actually a ‘positive’ for some content even though the words technically might be considered ‘negative’. The limitation of the sentiment analyzer is that other people without domain expertise in the subject matter is categorizing the words as negative, neutral, or positive. On aggregate, I think this methodology is reliable and the benefits outweigh the risks of mis-labeling

  • Take steps to do human labeling before attempting to do a thorough sentiment analysis. The methods we employ here are somewhat of a ‘quick and dirty’ method of measuring sentiment