Functional Decomposition and its Utility for Data Scientists & ML Practitioners
Revisiting Code to Parse Android Log Files
Last post (https://www.conaxon.org/projects/how-to-parse-android-logs-for-analytics-and-machine-learning-applications), we walked through how you might be able to parse a log file from Android to produce analytics that makes log mining and analysis more efficient. That script we produced was close to 300 lines and not very expedient. Depending on the size and number of log files, the script took many minutes to finish. There were some key failures that needed addressing:
The script was not efficient with memory: we tried to load too much data into memory all at once instead of processing, filtering, transforming the text files line by line
Repeating lines / functions over and over: instead of abstracting our code into inter-related functions, the script just repeated the same processes over and over; thus, ballooning the number of lines required to complete a simple task, creating hard to read code, and weakening the scripts ability to handle variability in future use
Hard-Coding Variables: Sometimes, hard-coding variables are necessary and even pretty efficient when doing a simple one-time analysis. However, for tools that are meant to scale across teams and use cases, hard-coding a ton of variables becomes very limiting and introduces risk that a script can fail. Avoid coding for the happy path. Really train to consider the use case in design of code
Increasingly, Machine Learning Engineers, Data Scientists, Data Engineers, and Analysts are pushed to conceptualize the ‘full-stack’ when developing solutions. Many machine learning products are deploying tools rather than a one-time analysis. Simply: use should inform design. My background is not in Software Engineering. As I begin to deploy more analytics and machine learning tools, the more First-Software-Engineering-Principals become important when considering how a problem should be tackled.
After looking at the first version of the parsing script, it was clear that a refactoring was required.
What is ‘Functional Decomposition’:
I cannot take all the credit for ‘dreaming’ up this improved execution. Many hours were spent with two colleagues to get the right flow. A thought did cross my mind as I was being blown away by their teachings: “What exactly are we doing here?”. I will admit, my scripts are often an amalgamation of StackOverflow threads, Udemy Courses, and CodeMentor.
So, what exactly is Functional Decomposition and why does it matter?
In the context of programming, the breakdown of complex systems into smaller functional components is called Functional Decomposition (source: https://stackoverflow.com/questions/947874/what-is-functional-decomposition)
The benefit of functional decomposition is that once you start coding, you are working on the simplest components you can possibly work with for your application. Therefore developing and testing those components becomes much easier (not to mention you are better able to architect your code and project to fit your needs)
Conceptually, the functional decomposition methodology makes a lot of sense. Yes, it takes a ton of time up front to conceptualize the components first, but the work is worth it down the road when attempting to test and add new functionality later. The main goal is to make your solution modular and scale with your solution over time. Your analytics solutions should attempt to reuse and recycle as much as possible—while, of course, making some improvements along the way.
By creating modular code, testing becomes easier when adding features or functionality later in the development process. On the flip-side, code that is modular also allows for things to be unplugged quickly if there happens to be some sort of defect specific to a part of the code base. Take some time to sit with stakeholders and developers to do a mind map and break down the functionality into areas. Areas can be broken down to individual actions or processes that must take place in order for the tool to deliver a certain functionality. Keep in mind: there is a point where too much decomposition can end up creating a code base that is too disjointed and tough to navigate. Each team will need to find an appropriate balance of granularity and readability.
Not only does breaking code out into functions easier to plug in and out code, but also to test code. Being able to test new code and regressions in old additions to the code base is supremely important to making code bases more manageable. Having the ability to isolate one function, perform a test, and determine whether it works or not is key to saving time and reduce the need to open up other areas of code—thereby reducing risk that a bug escapes.
What Utility does Functional Decomposition have for Analytics / Machine Learning?
Machine Learning, Artificial Intelligence, and (to a slightly lesser degree) Analytics are not just model building. In a large corporation like Facebook, Zillow, Apple, or Google then there might be teams where model or dashboard building is the only mandate for that team. However, most companies will not be large enough to be able to support a single team that just focuses on the models. The reality is that software engineering principles are becoming a core part of the job for an AIML Engineer, Data Engineer, BI Engineer, and Data Scientists. Fast approaching is a future where companies will want to see ‘Full-Stack AIML Developers’ that can conceptualize delivery of a solution built around AI, ML, and Analytics.
The Full-Stack AIML Developer makes sense—even though difficult to imagine there will be those with all of this knowledge at their fingertips. Companies like Microsoft have taken impressive steps towards automating much of the data analytics, modeling, engineering, and pipelining required for creating an AIML product through their Azure offering. Couple those technologies with an experienced software engineer to wrap them up in a cohesive package and magic will happen. Admittedly, Azure can only do so much for ‘generic’ business problems. A need still exists for customization in AIML modeling; however, Azure continues to add new pretrained models all of the time with the ability to fine-tune them on specific applications. With that kind of momentum, truly customized AIML development might not make sense when taking into the costs and benefits. Automation will allow for some bandwidth within a software engineers personal development stack and free them up to be able to truly deliver AIML to corporate masses quickly.
When AIML developers are creating models, training, and evaluating performance it will pay dividends to incorporate modularity into their code—making it easier for their work to be integrated into downstream processes and systems. There are significant differences between the AIML Developer that is more focused on the science and less on the software engineering. AIML practitioners should endeavor to practice improving their code so that it can stand on its own. A very common criticism of the Data Science department is that the deployment teams have to often rewrite models for production use. I do not think this needs to be the case. AIML teams should devote time to building and honing their software engineering capabilities so they can be independent functions or platforms that can be leveraged by the rest of the organization without a ton of rework. Part of starting along that path is creating code that is easy to read, modular, easily tested, and explainable.