Several years ago I setup a clients web app for his training courses, it was a Django app and I was quite proud of it. In addition to his online training course he also wanted a mobile version, so an API was created and the mobile app pulls from that API.
Everything has worked fine with the occasional update required for Heroku to keep the web dyno running and to support new versions of iOS. Since the last app update also included changing over to Swift nothing except screenshots have been needed. Today everything failed.
I slept in today, most Saturdays I am up before 8am but today I had nothing to do. I finally got up around 10:30, I was awake before then thanks to emails popping in about Labor Day sales causing my phone to buzz across the nightstand, but still not ready to do anything prior to then. Just before 11am my client emails me, his app isn’t updating and the web training isn’t coming up.
I logged into Heroku figuring the dyno needed a reboot for some reason. Hit reboot and nothing. Checked the logs and the dyno was failing to reboot. Why the hell wasn’t it rebooting? Googled the error code and discovered the reboot was taking longer than 75 seconds and timing out. Nothing has changed on this app in years, why is it slow to reboot?
Checked the Heroku status page, right after more reboot attempts and finally getting a reply saying reboots were disabled. The last thing I wanted to see was on the screen “Widespread Platform Issues”. I should have checked there first but I assumed it was something I was doing causing the server to not reboot. It had a stack update so I thought maybe the previous stack had hit its EOL and was requiring me to update before it would restart. Installing the Heroku CLI and trying to push a build to finish the stack update took a long time. It was actually the slow build that made me finally think to check the status page.
I started trying to pull down a copy of the database so I could attempt to spin up another copy on DigitalOcean. The database export was garbled. I’m not sure if it was because of the platform issues or if the format Heroku uses differs from the standard Postgres style. I started the process of setting up the DO server thinking better to get the app launched and then figure out the data than to eventually get the data and not be able to fix things immediately.
Shortly before 4pm when I had just finished getting the server setup I checked the Heroku status.
Our engineers have resolved the issue causing dynos to be unable to start. We are still monitoring for other impact, but the platform appears to have stabilized. We have re-enabled dyno restarts, including automatic cycling. Dynos that crashed or were otherwise restarted during this incident will automatically restart within the next few hours, or can be manually restarted with the following CLI command…
Back to the Heroku dashboard, restart dyno, and everything was back. Six hours after the issue started it was fixed and during that time I was unable to solve the issue even temporarily.
Last week this same client had a similar issue with his other app. In that case it runs on his web server and the issue was caused by a third-party migrating him to a new server and not bothering to ask any questions prior to the move. Last December he emailed me because he had someone doing WordPress work for him and wanted to make sure that nothing happened to the apps. I emailed the developer detailed information on what needed to be avoided to ensure the app did not go down.
They moved the site, setup a new database, and didn’t copy over the tables for the application. They then tried ramming in a new copy of CodeIgniter because the PHP version was moving to 7.x from 5.x and it was throwing errors. The worst part, I got the email about the issue on Sunday morning, emailed the developer and didn’t hear anything until Tuesday. I finally got everything working Tuesday afternoon after getting the access I wanted.
Seeing as both applications went down within a week of each other I need to make changes. The first change is I’m setting up database backups to my computer and my server on a daily basis. I’m building a new shared admin for the apps that can be switched over to on the next Apple update.
The apps are going to check the server for config files which will contain file URLs, last modified dates, and a md5 checksum of the files. If the last modified date and the checksum differ from the version on the app it will fetch down the updated files. If they have not or it fails to contact the server/fetch the files it will continue using the existing ones.
The admin, web training, and all app related data are moving over to a DigitalOcean droplet that routinely takes snapshots and utilizes a floating IP. If the server has issues a new one can be spun up from the latest snapshot and the floating IP can be pointed to it. No DNS worries and no 6 hour or multi-day downtime’s.
With these changes should DigitalOcean completely fail I can setup a new server somewhere else with the latest database backup, clone the repository, and update the config files to point the apps to the new server in minutes.