PID for Data Science
A couple of weeks ago a colleague and I were sharing stories about the paths that led us to our current understanding of technology. I mentioned that my formative years as a technologist were spent in the control systems field where I programmed PLC’s and industrial control systems for an advanced manufacturing facility. During that time, I learned about a nifty little algorithm called a “PID controller”, and my colleague and I wondered aloud if PID controllers had any applicability in the data science domain. I immediately began to think of potential cybersecurity applications, but initially thought that PID wouldn’t add much value since anomaly detection over the amount of data generated by a typical IDS sensor would be better suited for a regression algorithm or decision tree. I’ve thought about this a lot since our initial conversation, and I must admit that my position has changed. I now believe there is great potential for the application of a modified PID controller in data science.
If you are not familiar with PID controllers (an acronym for Proportional Integral Derivative controller) they are essentially feedback control loops that automate the control of systems by controlling inputs and monitoring the outputs for a given process. If there’s a difference between the output and the desired output, the input is adjusted to account for the error. The first PID designs were employed for controlling the steering of ships by automating the task of maintaining a heading and accounting for changing winds and unforeseen currents. PIDs have been around since the 1920’s, and are very common today. In fact, you likely use PID controllers every day and don’t even know it. The thermostat in your home or office is a basic PID controller that results in a decision whether or not to turn on your air conditioner or heater. The desired output is the temperature you set the system to. The control system then measures the inside temperature and adjusts for error. Just about every modern automobile comes standard with cruise control, which is another example of a PID controller we use every day. I have an extensive background with control systems and know from experience that PIDs are quite useful. They are well documented, easy to implement, and once tuned require very little adjustment. Unfortunately, unless you are in the control systems or engineering field, you’ve probably never heard of them. They certainly didn’t cover PIDs in my statistics, computer science, or calculus classes. My only formal exposure to PID was through vendor training; an Allen Bradly PLC training workshop provided by my employer. What is most notable is that PID seems to be completely absent from the modern data science toolkit. There is no PID controller package for R. I managed to find a blog post showing how to create one, but no ready-made R package. The scikit-learn Python library doesn’t have a PID controller, and I couldn’t find PID on Apache Hadoop or the Spark framework either. Why? I suppose it is because these kinds of data science packages and libraries are designed to make sense of big datasets with lots of variables, but a PID controller is designed to read one thing and control another. However, just because they are not inherently designed to work with big data doesn’t necessarily mean that they can’t be incredibly useful in applications that need to work with data.
I was a self-taught programmer and was writing code long before I was ever formally trained to do so, and in hindsight I think I’m better off for it because I believe that the process of building knowledge through hypotheses and experimentation is far more beneficial than classroom-based didactic methods - at least for me. I have tried and failed a lot over the years. As a result, I have become very, very good at troubleshooting. I had to figure out how to get things done out of necessity, so I was never limited to what the libraries had to offer. I say this because I believe that with enough domain experience, many data science problems can actually be simplified and broken down such that a PID can easily accomplish the task, even if you have to ‘roll your own’ controller from scratch. Keep in mind that PID is not a method for classification or clustering. There is no model being created, and this isn’t your standard Bayesian. While PID controllers can definitely be useful for solving many data science related problems, they will never be able to predict anything. So, what are the potential use cases where you might choose a PID controller? Simply put, anywhere where there is a direct relationship between a variable you can control and an output that you can monitor - perhaps optimizing load balancers. Perhaps throttling. Perhaps auto-provisioning in a Dev-Sec-Ops pipeline. The possibilities are limitless.