A lesson in the infrastructure paradox: Security vs User-friendly
In the process of creating a great service, one has to deal with a multitude of variables that could spoil that endeavour. And as the complexity of the system grows so do the potential known and unknown issues. One may even say that planning and architecting a complex system is a never ending struggle. Sometimes, it surely feels like it. It also feels like constantly trying to foresee the future and think of all the ways our decisions from today will come and kick us in the knee a few months down the road.
One such complex problem for us at Gilion, has been data storage. AiM's purpose is to give highly valuable insights into the wellbeing of our customers' business. In order to do that we need large amounts of data. Collecting and storing such data is technically not a problem these days, keeping it safe and compliant with various regulations is a completely different issue. When we architected AiM, our goal was to achieve both ease of ingestion and processing of large amounts of data for our engineers, but also keep the data safe, easy to remove if needed, and make sure that our clients’ data never mixes. Not even by an accident.
To achieve this we've decided to treat each client as a completely independent project within our system. We’ve avoided storing mixed customer data in a centralized database and then cherry picking what we need. Instead, we opted for creating a separate Google Cloud Platform(GCP) project for every new client. Iin practice this means that each new account created in AiM gets a separate GCP project. We have developed our own services that take care of this in the background making sure that proper privileges are granted where necessary. Additionally, this also makes it easy for us to be GDPR compliant, as data deletion upon request becomes less of a burden. We have also made sure that the incoming data from various sources that our customers connect their accounts to is automatically cleaned up from any Personally Identifiable Information even before it reaches our storage.
It seems sometimes that tight security and user friendliness don't really go hand in hand, though I personally refuse to agree with this. But in order to be able to use all the previously secured and compartmentalized large amounts of data we needed, a secure and easy-to work system was put in place. After all, what is the use of AiM as a platform if we can't easily process all the data and give back valuable insights?
Our analytics and data engineers write all sorts of sophisticated solutions that grant us, what sometimes feels like, power to see into the future. My task as an infrastructure engineer was to find the most efficient way to empower engineers while maintaining the strict security on which we are building our system. To add to that, we also need to worry about the costs, both in money and time. I want my engineers to be able to process all the data they want, as efficiently as possible at the lowest possible cost. How hard can that be, right?
To be able to achieve all the set goals we have decided early on that we want to build our system around Kubernetes. Our guiding idea was that we wanted to be able to scale out when needed very fast, yet be able to decide how much money we burn along the way. Having Kubernetes as the foundation gives us power to decide which types of machines we use for different purposes, which in turn allows us to control how much money we spend.
I firmly believe in automatization and therefore, at every step of architecting a system, I try to make myself, and other humans, obsolete. Or at least bring the necessity for our presence to a bare minimum. With that in mind, I've decided to build our data processing system on GitOps principles. GitOps is a set of practices to manage infrastructure and application configurations using Git as a single source of truth. Simplified, thatmeans that changes to both underlying systems such as Kubernetes cluster, service accounts, databases and services that our engineers are building are being done through GitHub. Whether the engineer wants to grant privileges to the application or to certain client's data that is done through code that someone needs to review and approve before it becomes reality. Concluding that any and all manual changes to the system are prohibited.
The above diagram tries to illustrate that. Any and all code changes, be it to the ML models that we're building or infrastructure that we're using, requires the following steps:
- Make code changes
- Send the changes to GitHub
- Create a PR
- Wait for the PR approval
- Merge
After that, everything happens without the need for the engineers to do anything. We have decided to build our system using tools created by Argo Project. We use ArgoCD to do continuous delivery of code changes. It also allows us to quickly roll back if the need arises. The ML models are then deployed to Argo Workflows. This simple (at first glance only) yet powerful tool empowers our analytics and data teams to customize different jobs using simple and intuitive templates. Given Argo Workflows was created with Kubernetes in mind, we are able to allow our engineers to tweak the resources they need in the very template they have created for the ML job.
Keeping things as simple as possible has given us power to iterate through code changes in a fast and efficient way. To easily build and even break things if necessary, because the engineers feel confident in rolling back changes will be consistently easy.
When it comes to cost control, running things in Kubernetes allows us to have a lot of flexibility with the underlying machines that are running our ML models. Not all models are created equal, so knowing exactly the amount of resources we will need for each of them is not always easy. My idea was to avoid blocking the engineers, but also avoid burning unnecessary money. Having the possibility to split the jobs into various node pools within Kubernetes, and pay for bare minimum when the system is not fully utilized made cost control very easy.
While our job is to create powerful insights, metrics and predictions, my job hasn't changed much since this system was put in place.. Trying to predict the future and figuring out what and how in this system will come to kick me in the knee down the road is something I deal a lot with. But don't get me wrong, I love what I'm doing. It is a never ending source of most interesting challenges and a great way to keep myself humbled because no matter how smart I think the solution I've made is, there is a great chance that the future will prove me wrong in one way or another.