The Journey of A System
It has been a long time since my last blog, the world has seen lots of changes, including technology innovation, pandemic and especially the way we communicate with each other and with digital product.
This post is brief to the journey for one of our product which we involved a lot from the beginning 3 years ago. Today Cakap is one of the leading e-learning platform in Indonesia and is helping thousands if not million of students to participate in online classes. Content of this post is grabbed from my latest presentation in the company and now I am happy to share it as a moment to look back what we had done throughout these years.
An Ancient Monolith
The very old Cakap (formerly Squline) had just some EC2 instances, with database on RDS. The application was deployed by a small script which pull code from GitLab to the instance on specific branch and run some commands. Data is stored locally in the instance as well. So, yea, quite traditional, but at least, it worked for sometimes.
When we join the team, it is a must for us to drive away from the traditional because of two main reasons:
- It slowed down us especially when development team increases in size
- The system got unresponsive whenever we got a minor increases of traffic and we had no sight at all of what were happenning
So the first step we took, was to completely redesign the system from two perspectives:
- The new system must not only provides our users with better response time and stability, but also satifies the needs from development teams while ensure other teams can rely on it to expand the business and operation types
- The balance between effort, cost, performance and the maintenance processes
A More Modern, but not Cloud Native, yet
After 3 years throughout a few interations, currently, Cakap infrastructure is powered by 3 different cloud providers, fully written using Infrastructure-as-Code. There are some diversities, like we still rely on Ansible to do configuration with bare VM on cloud, however most of our workloads are now running in managed Kubernetes cluster - which helped us a lot of scalability, high-availability while keeping the cost on both human effort and money at minimal at possible.
Cloud Providers
Written in Terraform, our infrastructure spanned across AWS, Digital Ocean and Alicloud at the same time, thanks to IaC pieces which can be upgraded and modified from the command line just by one single terraform apply
.
Configuration Management
Some workloads which are not in Kubernetes, yet, are fully managed by Ansible, so we rarely need to touch the instances ourselves, all softwares and configurations, including backup mechanism and network routing inside the instances are declaratively written in Ansible roles and playbooks. So, again, one single ansible-playbook
command is enough to provision as many servers as we want.
Kubernetes
By leveraging managed-service from AWS, we can be able to have a Kubernetes cluster to manage and deploy to all environments, like development, staging, production, etc. by just helm upgrade
command. The scaling is done via Kubernetes add-ons like cluster-autoscaler
and the high-availability is ensured by the scheduling and health-check natively come from Kubernetes itself. We are at 30-40 different applications per environment for now but thanks to Kubernetes, we can add one in just a few minutes.
Continuous Integration
We heavily used GitLab CI/CD Pipeline to have all the applications automatically integrated, tested and deployed against different environments. For production of course we have more strictly progress to ensure business continuity. At this time we can have 20-30 concurrent jobs running at the same time for various types of technical stacks, like Java, Angular, Android, Nodejs, etc. with minimal cost, as all those jobs are scheduled by GitLab Runner running in our Kubernetes as well.
Growth Comes with Cost
Of course there will always be risks, especially when you have a lot of traffic coming without any hint. So a strong foundation of observability stack is very important. That includes, but not limited to:
- Logging: the log must be easy to reach, to parse and be consistent with different applications
- Metrics: both historical and realtime, we need to have a way to look at everything related to infrastructure, CPU, memory, processes, network, bandwidth, etc.
- Tracing: how applications are interacting with each other and each client, a request from a client, how did it come into the system and how data flow, and what is the room for improvements
Now from just a few EC2 instances, we need to face with a lot of other problems and each of them comes with various solutions, with its own pros and cons. An important lesson learned here, is to accept the risk and spend the effort to try on things and always have a backup plan, you have a few customers, you can be on the luck streak that they won't complain due to a few hours of downtime. Try to imagine if you need to serve on million.
Where Do We Come From Here?
There are always a lot of opportunities to explore, and our ultimate outcome is always to satify all our users - not only the students and teachers, but also other team members, stakeholders, board of managements, investors, etc. We are still in the very beginning of the journey and we expect to see next million around the corner soon. Always be prepared and consider the risk with opportunity is a battle field for us to lean on. Let's see what we can achieve in next year!