One of the root causes of the current meteoric advances in AI has been the free and open disclosure of machine learning (ML) concepts and results through mechanisms such as the arXiv technical paper repository. Another critical but often overlooked driver has been, however, the sharing of machine learning software tools and libraries between AI researchers.
There are innumerable examples of this. An obvious one is the PyTorch framework. In addition to providing basic tools to apply AI to images and text, PyTorch provides powerful methods to manipulate tensors – the lifeblood of machine learning operations – and to quickly and automatically compute gradients over the computational graph that defines the AI algorithm. This capability for automatic gradient computation alone can reduce the time needed to execute an ML project by months or even years. One can, however, layer into frameworks such as PyTorch, a sometimes-bewildering array of powerful AI/ML libraries. For example, scikit-learn provides an array of algorithms supporting classification, regression, clustering and other more esoteric but critical ML functions. Another example is Keras, which provides quick and easy access to basic ML operators and algorithms. Yet another, perhaps more humble but equally important package is Matplotlib, which allows ML researchers to communicate complex results through beautifully rendered graphs and images.
Despite their incredible utility and value, ML researchers and developers can usually access these packages at zero cost. And they can do so with just a few mouse clicks. Usually, this is done with sophisticated package managers built into or designed for use with the Python programming language, which itself forms the basis for the vast majority of AI/ML software development. Package management systems such as pip or conda support easy downloading, installation and automatic updating of toolkits and libraries. Currently, around 1,000 packages are available on Conda alone.
There is therefore a lot of outstanding AI/ML software available, just waiting to be used. As a result, ML developers can quickly and easily build, test, and deploy their algorithms on top of a deep hierarchy of powerful ML software that represents, in effect, thousands of years of development effort. As a result, building on top of the latest and most powerful research is effectively built-in to the AI/ML development process and consequently, the whole industry moves forward at lightning speed, with new AI wonders delivered seemingly every week.
Are there undesirable consequences, however, that come along with enabling this fantastic AI/ML development machinery? Unfortunately, the answer is a resounding ‘yes’. Consider first, the set of software dependencies that are built into a finished AI/ML program. That program might directly depend on class libraries from dozens of packages. And those packages may, in turn, depend on other packages in a hierarchical fashion. As a result, the finished AI/ML program has built-in millions of lines of code, any portion of which could be malicious. Here, of course, the intrinsic strength of the open-source software development process that generates many packages really helps. The open-source community is adept at scrutinizing code for malicious intent. Nevertheless, the security of the resulting program code base is then dependent on the quality and care of those open-source developers and maintainers, who are often unpaid and working on multiple projects at a time. As a result, there are many opportunities to miss malware. A robust and well-designed software bill-of-materials management process is a critical first step to addressing this.
Then, leaving aside the program code that makes up much of an AI/ML model, it is important to understand that model also comprises the critical parameters or weights that drive the algorithm. These parameters are usually passive data – that is, they do not execute program code and so don’t usually enable malware insertion. Nevertheless, they can be changed by a malicious actor, leading to degraded or even entirely dysfunctional model operation. Or they can simply be stolen. Interestingly, major frameworks, such as the aforementioned PyTorch framework, do not support the digital signing or encryption of model parameters. As a result, in a naïve implementation, there is no easy way to make sure that the model parameters are indeed the correct model parameters, or to ensure parameters cannot be extracted from the model if a malicious actor gains access to it.
And yet there is a bigger issue: the use of Python as the core development language. Python is a so-called interpreted language. That is, when a host computer runs a Python program, it will first take the Python source code and execute a just-in-time compilation step to render the program in a form that the host can run in its core processor). This means Python is a wonderfully flexible and efficient development platform, but it also means that the encryption of Python programs is effectively impractical. Consequently, encryption of the program components of AI/ML models expressed in Python is also impractical. And since the bulk of AI/ML software is developed in Python, it is not easy to encrypt and secure the code base of a project. There have been efforts within the major ML frameworks to address this issue. For example, PyTorch supports TorchScript, where the dynamic computational graph that represents the AI/ML is unrolled and can then be exported to a more secure development environment, such as one based on C++. Nevertheless, these efforts are just beginning, and they are still cumbersome.
Where there is a gap in capability, however, there is an opportunity to innovate. Given the size and importance of the AI/ML software development market, there is real business for inventive startups to develop. This aligns with SineWave’s focus on two key areas of our thesis: AI/ML and software security. We are excited about the possibilities.