The Situation

Recently, a customer approached Addteq for assistance with their Bitbucket Data Center environments, specifically, their Development and Production environments that both see frequent usage. According to the customer, the environments were slow and unstable, often needing to be restarted. Addteq's investigation found that the Bitbucket version in both environments was woefully out-of-date and end-of-life, so our primary recommendation was to upgrade to the latest enterprise-supported version, as well as several improvements to the application JVM and operating system tuning.

The Challenges

• Discovery revealed that while the Dev environment had a fairly standard Data Center configuration, the Prod environment configuration was quite unorthodox in its design, employing two data centers in different geographic locations, the first being called Production and the second being called DR. 

• Each data center had two active Bitbucket nodes for a total of four across both data centers. 

• Both data centers had their own shared drive, MSSQL server, and single Elasticsearch node. The Production shared drive and MSSQL database was being mirrored to the DR shared drive and MSSQL server, while the DR Elasticsearch server laid dormant and unused. 

• The DR Bitbucket nodes, however, were not utilizing the DR shared drive and MSSQL database, and instead were connected to Production's. In the event of a disaster affecting the Production data center, the instance would be stopped, and the DR nodes would be remapped to the DR shared drive, MSSQL database, and Elasticsearch. 

See the diagram below for the complete layout.

Data Center Diagram.png

Further complicating matters, due to the customer's industry, extremely tight and restrictive change management and separation of duty policies. This resulted in the use of legacy operating systems that required several modifications and workarounds to support the upgraded application as well as the new version of Git that Bitbucket required, as an OS upgrade would not be approved within the upgrade window. Additionally, the fragmented separation-of-duties made scheduling and coordination of the upgrades difficult.

Finally, the customer's management required minimal downtime for the production upgrade, as there were many critical business functions, automated processes that depended on Bitbucket being available so that they could checkout code.

The Plan

Addteq developed a plan to upgrade both Dev and Production environments to the latest available Bitbucket version, as well as upgrading Git and Elasticsearch, automating as much of the process as possible using Ansible.

For Dev, the plan was a standard Bitbucket Data Center upgrade process. However, due to the complexity of the Production environment and the customer management's minimal downtime requirement, we had to do something different for Production. 

The plan for Production consisted of separating the DR environment from Production so that it was running completely independently of the Production data center in the Disaster Recovery Scenario shown in the diagram above. 

• Once the DR environment was validated to be working, write permissions were revoked in the DR environment to approximate a read-only configuration and the Global Load Balancer was configured to only direct traffic to the DR environment. 

• The upgrade process then commenced on the Production-side environment while the DR continued to serve read-only access for the business-critical automated processes that needed it. 

• Once the upgraded Production-side was validated to be working, the Global Load Balancer was configured to direct traffic solely to the Production nodes, and the upgrade on the DR-side was performed. 

When all upgrades were completed, the pre-upgrade configuration of the environment was restored. 

The Outcome

Here are just a few of the great outcomes of the upgrade:

• Both Production and Dev environments were upgraded to the latest version of Bitbucket.

• The upgrade of the application nodes was automated while Addteq made the modifications and workarounds on the legacy OS.

• Production's complete-downtime (no application access) was under a half-hour in  total during the entire process due to the use of the read-only DR data center location while the main Production data center location was upgraded.

• Critical business functions were not impacted during the upgrade as the application was available in a read-only mode.

• Application instability and performance issues were remediated; between the application upgrade, the updates to the installed plugins, and the tuning of the operating system and application JVM, the application is now very performant and stable. 

• The customer is very, very happy.

Addteq provided amazing value to the customer in their time of need, getting the badly-needed upgrades installed despite the complex, non-standard configuration of their environment, and the difficult conditions of the customer's change management process and restrictive separation of duties, all with hardly any downtime or impact to the business.