How to: Parse Android Logs for Analytics and Machine Learning Applications
Introduction: What are Logs?
Building Android based apps, or any software for that matter, will eventually end up in understanding why a bug is occurring. Bugs are just a natural part of software development. A key tool in understanding the state of your software at the time an issue happens are logs. Think of logs as a ledger for what is happening when the code is running. Engineers can print almost anything to the logs that might help them understand problems that pop up in the future.
Given that logs are often structured, contain a ton of useful data, easy to acquire, and key to development software logs are ripe for sophisticated analysis and maybe even applying machine learning to them. There are lots of tools for log analytics like: Scalyr, Logz.io, Sematext, GrayLog, Nagios, and many others (https://opensource.com/article/19/4/log-analysis-tools). In many cases, utilizing an open-source, pre-built, will work in a pinch and be pretty reliable when a mission critical bug plagues the backlog. However, it might be useful to have a way of creating your own customized solution.
Android LogCat Logs:
The structure of the Android Logs are as follows:
The main files that can be analyzed are the radio, main, event, and system logs. Each log file contains different characteristics about the system at any given time.
Each message in the log consists of the following elements:
A tag indicating the part of the system or application that the message came from
A timestamp (at what time this message came)
The message log level (or priority of the event represented by the message)
The log message itself( detail description of error or exception or information)
There are a few different log types:
Application log -
Utilize the android.util.Log class methods to write messages of different priority to the log file
Java classes declare their tag statically as a string and can be many layers deep
System log -
Utilize the android.util.Slog class
Many frameworks use the system logs to separate certain messages from a potentially messy application log
Event log -
Event logs messages are created using android.util.EventLog class
Log entries consist of binary tags and they are followed by binary parameters
The message tag codes are stored on the system at: /system/etc/event-log-tags
Radio log
Used for radio and phone(modem) related information
Log entries consist of binary tags code and message for Network info
Android Log Structure:
tv_sec tv_nsec priority pid tid tag messageLen Message
tag: log tag
tv_sec & tv_nsec: the timestamp of the log messages
In the logs we are going to parse the date and timestamp (down to the milliseconds)
pid: process Id
tid: thread id
Priority value is one of the following character values:
V: Verbose (lowest priority)*
D: Debug*
I: Info*
W: Warning*
E: Error*
F: Fatal*
S: Silent (highest priority, on which nothing is ever printed)
Code for Parsing:
The parsing of the files is fairly straightforward—especially because the text files are delimited by simple whitespace.
import pandas as pd import numpy as np import seaborn as sns import re import os, zipfile import gzip import shutil import datetime import matplotlib.pyplot as plt
After the import of key libraries, then you will check the working directory and assign it as a variable. This will all be done to allow for the script to be placed in the directory of the log files:
# define the current working directory as a variable for extracting all the log files that ar cwd = os.getcwd() # define the search path for the rest of the script to reference search_path = os.getcwd() #print print(cwd)
The cwd should be within the folder where the log files are located. We’ll define a function to be used later that will programmatically level out the arrays. Then, we get to work decompressing the log files so everything ends up as a text file:
# Function to make the array lengths the same later def pad_dict_list(dict_list, padel): lmax = 0 for lname in dict_list.keys(): lmax = max(lmax, len(dict_list[lname])) for lname in dict_list.keys(): ll = len(dict_list[lname]) if ll < lmax: dict_list[lname] += [padel] * (lmax - ll) return dict_list file_type = ".gz" for fname in os.listdir(path=search_path): if fname.endswith(file_type): with gzip.open(fname,'rb') as f_in: with open(fname+'.log','wb') as f_out: shutil.copyfileobj(f_in,f_out)
Next lines do the following:
need to get a list of all the main.log files into a list
need to loop through the list
read / parse each file
append each parsed line to the appropriate empty list
strip out some of the files from the list of files we are going to loop over and read
mainLogs = [] keyword = 'main' for fname in os.listdir(cwd): if keyword in fname: mainLogs.append(fname) mainLogs = [item for item in mainLogs if not item.endswith('.gz')] date = [] time = [] processID = [] threadID = [] priority = [] app = [] tagsText = [] readLine = [] for main in mainLogs: with open(main,encoding='utf8',errors='surrogateescape',newline='\n') as logs: try: for line in logs: lines = line.split() #for debugging readLine.append(lines) date.append(lines[0]) time.append(lines[1]) processID.append(lines[2]) threadID.append(lines[3]) priority.append(lines[4]) app.append(lines[5]) tagsText.append(lines[6:]) except IndexError: pass
After we have written our parsed files to the lists we need to combine the messages and tags together since we split by whitespace. This next little piece of code will recombine tags and texts to a human readable string:
tagsTextComb = [] for innerlist in tagsText: tagsTextComb.append(' '.join(innerlist)+" ")
Next lines of code will assess the length of each list. In order for a dictionary of lists to be transformed into a pandas dataframe, each of the lists must be the same length.
print("length of Date"+' '+str(len(date))) print("length of Time"+' '+str(len(time))) print("length of processID"+' '+str(len(processID))) print("length of threadID"+' '+str(len(threadID))) print("length of priority"+' '+str(len(priority))) print("length of app"+' '+str(len(app))) print("length of tagsText"+' '+str(len(tagsText))) print("length of tagsTextComb"+' '+str(len(tagsTextComb))) length of Date 3829775 length of Time 3829775 length of processID 3829775 length of threadID 3829775 length of priority 3829775 length of app 3829770 length of tagsText 3829770 length of tagsTextComb 3829770
The following code finalizes the processing of the main log:
Combine the lists into a dictionary
Call the function that pads the lists and evens them out
Create the dataframe for the main log
mainDict = {'date': date, 'time': time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb} pad_dict_list(mainDict,'x') dfMain = pd.DataFrame(mainDict)
For the remainder of this post, we will process the remainder of the log files, combine them together, and cleaned for a bit of analysis:
crashLogs = [] keyword = 'crash' for fname in os.listdir(cwd): if keyword in fname: crashLogs.append(fname) crashLogs = [item for item in crashLogs if item.endswith('.log')] crashDate = [] crashTime = [] crashProcessID = [] crashThreadID = [] crashPriority = [] crashApp = [] crashTagsText = [] crashReadLine = [] for crash in crashLogs: with open(crash,encoding='utf8',errors='surrogateescape',newline='\n') as logs: next(logs) try: for line in logs: lines = line.split() #for debugging crashReadLine.append(lines) crashDate.append(lines[0]) crashTime.append(lines[1]) crashProcessID.append(lines[2]) crashThreadID.append(lines[3]) crashPriority.append(lines[4]) crashApp.append(lines[5]) crashTagsText.append(lines[6:]) except IndexError: pass crashTagsTextComb = [] for innerlist in crashTagsText: crashTagsTextComb.append(' '.join(innerlist)+" ") crashDict = {'date':crashDate,'time':crashTime,'processID':crashProcessID,'threadID':crashThreadID,'priority':crashPriority,'app':crashApp,'tagsText':crashTagsTextComb} pad_dict_list(crashDict,'x') dfCrash = pd.DataFrame(crashDict) eventsLogs = [] keyword = 'event' for fname in os.listdir(cwd): if keyword in fname: eventsLogs.append(fname) eventsLogs = [item for item in eventsLogs if not item.endswith('.gz')] date = [] time = [] processID = [] threadID = [] priority = [] app = [] tagsText = [] readLine = [] for event in eventsLogs: with open(event,encoding='utf8',errors='surrogateescape',newline='\n') as logs: next(logs) try: for line in logs: lines = line.split() #for debugging readLine.append(lines) date.append(lines[0]) time.append(lines[1]) processID.append(lines[2]) threadID.append(lines[3]) priority.append(lines[4]) app.append(lines[5]) tagsText.append(lines[6:]) except IndexError: pass tagsTextComb = [] for innerlist in tagsText: tagsTextComb.append(' '.join(innerlist)+" ") eventsDict = {'date':date,'time':time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb} pad_dict_list(eventsDict,'x') dfEvents = pd.DataFrame(eventsDict) sysLogs = [] keyword = 'system' for fname in os.listdir(cwd): if keyword in fname: sysLogs.append(fname) sysLogs = [item for item in sysLogs if not item.endswith('.gz')] date = [] time = [] processID = [] threadID = [] priority = [] app = [] tagsText = [] readLine = [] for sys in sysLogs: with open(sys,encoding='utf8',errors='surrogateescape',newline='\n') as logs: try: for line in logs: lines = line.split() #for debugging readLine.append(lines) date.append(lines[0]) time.append(lines[1]) processID.append(lines[2]) threadID.append(lines[3]) priority.append(lines[4]) app.append(lines[5]) tagsText.append(lines[6:]) except IndexError: pass tagsTextComb = [] for innerlist in tagsText: tagsTextComb.append(' '.join(innerlist)+" ") sysDicts = {'date':date,'time':time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb} pad_dict_list(sysDicts,'x') dfSys = pd.DataFrame(sysDicts) radioLogs = [] keyword = 'radio' for fname in os.listdir(cwd): if keyword in fname: radioLogs.append(fname) radioLogs = [item for item in sysLogs if not item.endswith('.gz')] date = [] time = [] processID = [] threadID = [] priority = [] app = [] tagsText = [] readLine = [] for radio in radioLogs: with open(radio,encoding='utf8',errors='surrogateescape',newline='\n') as logs: try: for line in logs: lines = line.split() #for debugging readLine.append(lines) date.append(lines[0]) time.append(lines[1]) processID.append(lines[2]) threadID.append(lines[3]) priority.append(lines[4]) app.append(lines[5]) tagsText.append(lines[6:]) except IndexError: pass tagsTextComb = [] for innerlist in tagsText: tagsTextComb.append(' '.join(innerlist)+" ") radioDicts = {'date':date,'time':time,'processID':processID,'threadID':threadID,'priority':priority,'app':app,'tagsText':tagsTextComb} pad_dict_list(radioDicts,'x') dfRadio = pd.DataFrame(radioDicts) frames = [dfRadio, dfSys, dfMain, dfCrash, dfEvents] df = pd.concat(frames)
This code should help you get started! In a follow up piece, we’ll go over some basic analytics, cleaning, and applications.