Migrating your infrastructure to the cloud is usually a no-brainer for companies looking to make long term savings, improve performance and increase scalability. However, moving to the cloud can be a daunting prospect. Companies can run into huge problems when beginning a migration project ill-prepared. In this article, I’ll go over eight things you should consider when planning your strategy to migrate to the cloud.
Getting discovery right
Probably the biggest and most important element of discovery is the process of creating a comprehensive inventory of all of your servers, applications and their subsequent dependencies that altogether make up your environments. This can sometimes be overlooked when planning a migration but it’s crucially important to get this step right and ensure that discovery is done methodically and thoroughly, capturing as much information about your existing environment as possible.
Discovery will show you important things like the scale of your environment, operating systems that might not be supported post-migration, software that may not be compatible, license requirements that may change, systems which can be decommissioned and potentially problematic systems which will require special attention. Problematic systems you ask? How about the twenty year old PC under someone’s desk that no one dares reboot because it might be running the payroll system? You’re going to need to come up with a way to migrate that.
Discovery can be daunting and extremely time and resource consuming so the temptation to rush it is understandable. The problem with a rushed discovery though is ultimately a delayed project resulting in more engineer days and more expense. The phrase “if only we’d known that at the beginning of the project” is pretty disheartening when you’re in the middle of a multi-phase data centre migration.
Fortunately, discovery can be made much simpler. Using tools such as CloudPhysics or Cloudamize it is possible to gain a clearer insight into what your environment looks like, enabling you to make data driven decisions on your infrastructure migration. Building a detailed inventory of your systems and applications will inform the next phase of your migration plan.
Choosing a migration approach
Typically there are two main approaches when migrating systems to the cloud. These are usually referred to as “Big Bang” and “Phased”. There are pros and cons to both so it’s important to assess which approach will work best for you.
As the name suggests this is where you migrate all your systems and services at the same time in a big bang. You effectively switch over to the new environment immediately. This approach, as daunting as it sounds, can be quite effective. If your environment is reasonably small - say a handful of VMs and a simple network - then big bang might be the best approach. For example, migrating your web servers to the cloud while your database server remains on-premise to be migrated separately will probably result in having to maintain a VPN and some other complex but temporary network solutions. For a system where it is simply a handful of web servers and a database, big bang is probably the best approach.
The main issues with this approach begin to occur when the systems and services to be migrated are numerous, the networking is complex, there are multiple databases to consider etc. In these scenarios a big bang approach would likely be unmanageable, require a huge amount of engineering resources and cause a lot of problems.
For large scale or complex environments, a phased approach is unquestionably the recommended strategy for migration.
This is where detailed discovery becomes essential; you will first need to divide your systems into logical units or waves. There are many different ways to divide your systems and consideration should be taken as to the most logical way to divide yours.
For example, let’s say you have an application named “App1” which features four web servers behind a load balancer, an API server and a database. You also have another application called “App2” which features an API server and another database but also requires a connection to the database for “App1”.
In this scenario, it makes sense to treat “App1” as the first wave and “App2” as the second wave. By creating alternate temporary DNS entries for the applications you can migrate the infrastructure for “App1” and thoroughly test it with some old data before switching it over to production. The second wave then allows you to migrate the infrastructure for “App2” under another alternate DNS entry and establish the connection to the “App1” database. Once all the infrastructure is migrated you will be left with the existing applications still happily running on-premise while waiting for the switch over to their cloud based counterparts while you run tests on the migrated infrastructure. Once your testing is complete the “go live” is simply a DNS change and database restore, et voila, you have migrated your applications to the cloud and put them into production.
While proper planning is best practice and the benefits are fairly obvious there is still a chance you might experience a “Help, everything has gone horribly wrong!” moment during a migration. Technology is complex, people are not perfect and sometimes stuff just goes wrong. Having a phased approach means you reduce the blast radius during a failed migration. Only dealing with a handful of systems at a time means less things to undo.
Whenever planning a migration it’s important to ask the question “How do I get this back online if the migration fails?” The answer to that question usually depends on the state your systems are in prior to migration. Some simple things that can aid a quick recovery during a migration failure are things like:
- Take a backup - this is crucial before any kind of migration
- Verify you can restore data from the backup - sometimes backups are corrupt, make sure yours isn’t
- Schedule some downtime with your users where possible
- Carry out the migration during a quiet period
- If the servers are clustered, failover to one server and migrate the other
- Snapshot the source server (if your hypervisor has this functionality)
- Try to script as many of the migration steps as possible with something like Bash, Python or PowerShell rather than using a GUI - scripts will do exactly what you tell them to do, allow you to test them out beforehand and have your colleagues peer review them. Conversely, it might be hard to remember what you clicked on at 03:00, especially if it’s gone wrong!
- Check that the TTL on your DNS is set nice and low (I recommend 5 minutes) about 48 hours in advance of the migration - this should speed up the DNS switchover
- Make sure that necessary Stakeholders have been informed
- Make sure monitoring systems have been updated
- Make sure the poor guy who’s on-call knows you’re migrating servers tonight!
- Check that backup again, take another one just to be sure
- Define some success criteria - a set of tests that can be performed post-migration to determine if the migration was successful
Another element that should inform your migration strategy is a risk assessment. There are many ways of conducting a risk assessment but ultimately you need to decide what you determine to be a risk and what your appetite for that risk is.
Some useful factors for defining risk can be things like cost to business, recovery time, reputational impact, legal and compliance implications etc. Assuming you are taking a phased approach, a worthwhile exercise during planning would be to create a calculator which assigns a score to each of your core risk factors. This can be done in a simple spreadsheet, no need to write a fancy application to do this.
Working to a scenario of an imaginary two hour outage, calculate the risk score for each of your applications. Here’s an example of a rudimentary risk score, as you can see “App3” scores the highest so should be determined the highest risk:
To get the best idea for the risk landscape you should try to make sure the risk assessment is completed by several people who have exposure to the applications and take an average of the scores. This should help to counteract the human bias of just having one single person assign the risk assessment score.
Once you’ve assessed each of your applications and services you can use the results to make data driven decisions about the priority order of your migrations. IE which of them goes first? Which of your applications carries the highest risk score? Which of them are low risk and potentially quick wins?
What not to migrate
We’ve all heard that “the future is cloud” and this can often drive a company’s decision to migrate their systems. There are, however, times where migrating a particular system might not be practical. The Google Cloud whitepaper “CIO’s Guide to Application Migration” asks 3 questions to determine if your application is suitable to be migrated:
- Are the components of my application stack virtualized or virtualizable?
- Can my application stack run in a cloud environment while still supporting any and all licensing, security, privacy and compliance requirements?
- Can all application dependencies (eg 3rd part languages, frameworks, libraries etc.) be supported in the cloud?
Generally speaking, if the answer to any of those questions is “no” then a “lift and shift” migration is not appropriate for this workload as it is. You should consider leaving it on-premise until the later stages of your migration giving more time to evaluate a “move and improve” solution
Lift and Shift vs Move and Improve
You’ve probably heard the term “Lift and Shift” when it comes to migrations. The idea of pretty much redeploying everything as-is in a new environment. This mentality often forms the basis of a large scale enterprise migration; “Server-A” will still be called “Server-A”, it will still be running Windows Server 2016 and it will still talk to “Master-DB” on the same ports as it did when it was in your data centre. The networking might look a little different but for the most part, the virtual machines are the same, doing the same job they always do. Same workload, different bit of tin.
You may not have heard the term “Move and Improve”. This mentality usually forms the basis of “infrastructure modernisation” which is the process of taking an existing or “legacy” workload and reworking it to make use of modern tools and techniques or “cloud native” offerings. This can make for a streamlined and efficient application which is often far more scalable and cost effective.
The principal of “Move and Improve” starts with the exercise of assessing which of your applications or workloads could be replaced by a better technology, even a “cloud native” technology.
Let’s say, for example, you have a 100TB dataset that you’re running analytics against. Your existing system uses traditional SQL servers and the analysis model takes several days to process. This system would easily benefit from being transformed into a Big Query workload in GCP. After some query optimisation, the analytics model would take minutes or seconds to process rather than days and the whole system would be vastly more cost effective.
Post migration tasks
Once your virtual machine has been migrated to the cloud, there's a fair chance that things won’t work perfectly out of the box. There’s going to be monitoring and logging agents to install, firewall configuration to set up, connectivity checks to other services etc. With this in mind, take the time to assess what your server may require to run correctly in its new environment.
For example, when migrating a server to GCP there’s the Google SDK, Stackdriver logging and monitoring agents, VSS snapshot functionality, Google Metadata Server entries for the host file etc. If you used a tool like CloudEndure to migrate the machine to GCP you’re going to need to find a way to install all the parts that it doesn’t install but are required to run correctly in GCP.
Once you have established all of these components you should write them into a post-migration install and configuration script. Depending on your chosen migration tool you might even be able to pass this script to the instance being migrated to GCP to be run at boot time as a post migration task.
One final (but by no means insignificant) thing that should be considered prior to an infrastructure migration is to adopt modern IT practices in order to maintain your resources post migration. One of the key milestones for modern IT practices is the adoption of “desired state configuration” tools. These tools serve as a simple and consistent way to define what your infrastructure should look like and ensure that it maintains your “desired state”. Below is a more detailed look at these tools.
Infrastructure as Code
Infrastructure as Code is a term that does what it says on the tin. It’s your servers, networking, firewalls, security, storage etc defined in a text file. The most well known and widely used tool for Infrastructure as Code is HashiCorp’s Terraform.
Terraform gives the user a simple and human readable language to define “resources”. These resources can then be deployed to your cloud platform and a record of what exists is kept in a remote “state file”. This system ensures that the entire infrastructure is self documented because it is all defined in plain text and allows for better enforcement of good practices by requiring changes to the infrastructure to be made in code and applied via Terraform thus updating the “state file”.
If you’re not doing this already you should definitely look to move to an Infrastructure as Code model either as part of your migration or as a post migration task. If you’re interested in Terraform check out HashiCorp’s learning page for Terraform here.
Once you’ve deployed your servers you will need a way to configure them. How do you tell a web server that its function is to be a web server? You tell it with configuration management. Much the same way as Infrastructure as Code is the infrastructure defined in text files, the same can be said for configuration management.
Tools like Puppet, Ansible, Chef, PowerShell DSC etc are desired state configuration tools for on-server configuration. Taking the same file based approach, configuration management allows you to define how a virtual machine should be configured. Based on these definitions the tools will configure your machines as you define them.
To learn more about configuration management, check out this article on opensource.com
“But where do I put all of this code?!” you ask. In a Git repository.
Git is a version control system for code. It’s also distributed so there’s no longer just one copy of “daves_super_important_script.sh”. Instead, it will be safely preserved in a repository, able to be called down whenever required like a golden spear from the heavens…
Or more practically it allows you to actually keep track of what version of the script is currently in use. When people ask you “Where’s the script for X?” or say, “I think there might be a problem with the code for Y” you can point them in the direction of a handy place where all that code is stored and kept safe. You can restrict who has access to that code and when people want to make changes to it you make sure they have to submit them for scrutiny and approval.
There are many Git providers out there and which one you should choose depends on your requirements. This blog topic is not going to deep dive into Git because it’s a huge subject. If you’re not already doing Git then basically any provider will do be that GitHub, GitLab, BitBucket etc. I’d recommend checking out GitHub’s very helpful documentation here.
Of course there’s a lot more to planning an infrastructure migration than what’s detailed in this article but hopefully, it will help to steer you in the right direction or perhaps even prompt you to think about something you haven't considered yet.
If you’d like to know more about migrating your services to Google Cloud Platform and would like a trained GCP Professional Architect to help plan your journey to the cloud, why not get in touch!
05 Aug 2020
How to seamlessly transition to Google Chat
30 Jul 2020
Joining CTS as a graduate
21 Jul 2020