The Change Window
Learn about the problem started at the airline incident and the basic flow of Core-Facilities failover.
We'll cover the following
As was the case with most big clients, a local team of engineers dedicated to the account operated the airline’s infrastructure. In fact, that team had been doing most of the work for more than three years when this happened.
Start of the problem
On the night the problem started, the local engineers had executed a manual database failover from CF database 1 to CF database 2 (see diagram below).
They used Veritas to migrate the active database from one host to the other. This allowed them to do some routine maintenance to the first host. They had done this procedure dozens of times in the past.
This was back in the day when planned downtime was common. That’s not the way to operate now. Veritas Cluster Server was orchestrating the failover. In the space of one minute it could:
- Shut down the Oracle server on database 1.
- Unmount the filesystems from the RAID array.
- Remount them on database 2.
- Start Oracle there.
- Reassign the virtual IP address to database 2.
The application servers couldn’t even tell that anything had changed, because they were configured to connect to the virtual IP address only.
Get hands-on with 1400+ tech skills courses.