Conaxon

View Original

Sentiment Analysis to Drive Content Strategy on your YouTube Marketing Channel (Part 2)

Last time, we looked at the process for setting up a project in Google Cloud, enabling the API, and utilizing the API to get data that we can analyze. In this article, we are going to do a bit of data cleaning, analysis, NLP/unsupervised sentiment analysis, and data visualization in PowerBI.

Where we left off….

Last time, we created a Pandas DataFrame that houses the commentary from a YouTube Channel:

See this content in the original post

After creating this table, I removed the duplicates in the even that we actually DO have any duplicates. This is more of a best practice than an actual need. I did have an issue in previous versions of the script where there had been duplicate comments generated. This will also ensure that when we get to calculating metrics and counting there will be no risk of artificially inflating values.

See this content in the original post

Next, we merge the high-level statistics with the comments:

See this content in the original post

Cleaning the Comment Text:

Before applying any sort of sentiment analysis, or analysis in general, we absolutely should clean the comments. We would not be able to scale an analysis on hundreds of thousands of comments without some sort of cleaning. Let us start with removing tags, for example:

See this content in the original post

There are lots of things we can do with emojis and emoticons. They convey sentiment via a pictogram. Unfortunately, a lexicon cannot interpret a picture. The emojis will need to be converted to a phrase. We can do that in this way:

See this content in the original post

URLs cannot be interpreted as much of anything. We will remove them:

See this content in the original post

HTML is also another piece that should be cleaned up:

See this content in the original post

The next few lines were generated for future studies in Natural Language Processing (NLP), but not necessarily used here. However, they are useful functions to reference back to later if you happen to be on this journey as well:

  • Remove Punctuation

  • Tokenize

  • Remove Stop Words

  • Lemmatize

  • Generate the number of words in a comment

  • Generate the number of sentences in a comment

See this content in the original post

Implement NLTK VADER Lexicon:

One thing to note here is that the optimal solution for sentiment analysis is actually LABELING the data manually BEFORE attempting sentiment analysis. By doing this, you ensure that the labeling is unique to your own use case—which will be explored a bit later. The code to run the comments against the NLTK can be found below:

See this content in the original post

This piece of the code will take awhile, depending on the number of comments. In this block, we simply loop through the comments, write the results from SentimentIntensityAnalyzer to the empty sentiment list, and then creates another column to the result/comments table. The result of the sentiment analyzer returns a dictionary response. We can split this dictionary into separate columns by executing the following code:

See this content in the original post

Finally, we write the file to a csv so we can connect PowerBI. We are going to now create a simple visualization that would allow a user to track the sentiment on their YouTube content.

Visualizing the YouTube Data Collected:

Since the goal was to measure engagement on the channel, the more basic stats are shown at the top. Comments, Likes, Dislikes, and Views measure a general level of engagement with the topic of the video. One could probably even created a weighted average calculation with the average sentiment, thereby creating a single engagement score. The bar chart simply shows the top 10 videos. Naturally, it would make sense to focus on the videos that were viewed most so as to analyze why that video may have been more successful. Comments are listed down below along with the predicted sentiment values. NOTE: the compound score is a kind of ‘blended’ sentiment score. This reddit feed seems to explain the mathematics of the scoring well enough: https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk. Finally, there is a dual-axis chart that shows average compound score shown over time relative to the views. There is actually little correlation between the two, but I thought that juxtaposing these values would be a good method for an analyst to notice large swings in the sentiment over time. Large contrasts in sentiment would be key areas to investigate why the content was either well received or not.

Considerations:

  • The sentiment analyzer struggles with context. The sentiment analyzer struggles somewhat with things like ‘bad ass’. In most contexts, this colloquialism is actually a ‘positive’ for some content even though the words technically might be considered ‘negative’. The limitation of the sentiment analyzer is that other people without domain expertise in the subject matter is categorizing the words as negative, neutral, or positive. On aggregate, I think this methodology is reliable and the benefits outweigh the risks of mis-labeling

  • Take steps to do human labeling before attempting to do a thorough sentiment analysis. The methods we employ here are somewhat of a ‘quick and dirty’ method of measuring sentiment