Apex Outage / Fix
The problems that occurred and steps taken to rectify
-
-
June 07, 2023 13:09 •
-
Design
-
We recently faced a two-fold problem with our application:
While we were already planning to move based on many factors, we figured this would be a good opportunity to migrate providers while solving the memory issue.
The first issue we encountered involved Heroku's memory limits. Our app had been running fine for some time, but after pushing some major changes to the code base, it started running at 95-96% of allowed memory levels. This, coupled with heavy throughput and requests, caused us to exceed Heroku's memory quota. We suspected a memory leak, bad query, or inefficient asset loading (such as images/videos). We used tools like Derailed, Newrelic, and Sentry to identify the culprit. We also checked if the gem packs were up-to-date and if there were any known issues. Using the derailed to check gem size:
bundle exec derailed bundle:mem TOP: 85.5508 MiB rails/all: 32.9922 MiB action_mailbox/engine: 11.8125 MiB (Also required by: ohmysmtp-rails) action_mailbox: 11.8125 MiB action_mailbox/mail_ext: 11.8125 MiB action_mailbox/mail_ext/address_equality.rb: 11.1445 MiB
Everything was fine, except for one 21 MB! gem pack that our stack used to handle images with vips.
ruby-vips: 21.5234 MiB vips: 21.5234 MiB vips/operation: 0.3047 MiB
We opted not to get rid of it, as we lacked knowledge on how to manage it better.
We then tested the code for memory leaks using
bundle exec derailed exec perf:mem_over_time
and hit different URLs that received the most traffic on production. We also used Sentry and New Relic to check for endpoint hits and tried to reproduce the memory leak locally, but without success. We inspected production metrics via New Relic for any hits to the memory issue, but everything seemed to be running fine, with a couple of spikes in response times.
Although the average response time was not great (~300ms), it was acceptable for a small app, and the user experience was not significantly impacted.
Although the average response time was not great (~300ms), it was acceptable for a small app, and the user experience was not significantly impacted.
We decided to migrate to Render, which provided better services and cost savings. However, we encountered major problems due to our way of migrating the app. We did not test to get rid of all the kinks that come with learning a new procedure, we played with DNS settings without fully understanding the process for certificate SSL certificates, and we lacked a great understanding of CORS. All of these combined caused a disaster.
During the installation move, we set up all our processes correctly and configured the necessary .rb files as described in Render's guide. We also had to configure a couple of other settings to negate some errors that were happening and how Render dealt with domain rerouting. Due to SSL certificate issuance and rate limits, we had unwittingly caused the process to be stalled for 48 hours. After that time, the certificate was restored, and traffic was able to reach the site again on 03/20/23. The next couple of hours were spent fixing CORS errors, which came down to setting up some environment variables.
We utilized built-in Rails methods to handle headers and assign Environment Variables (ENV) to the correct locations. By doing so, we were able to configure the appropriate values through the Render dashboard. Here's an example of how we used this method:
We utilized built-in Rails methods to handle headers and assign Environment Variables (ENV) to the correct locations. By doing so, we were able to configure the appropriate values through the Render dashboard. Here's an example of how we used this method:
config.public_file_server.headers = { 'Access-Control-Allow-Origin' => ENV['ENV_VARIABLE_HERE'] }
config.asset_host = ENV['VARIABLE_HERE']
This snippet of code helps ensure that headers are properly set and assets are served from the correct location, making for a more efficient and secure user experience.
Once we had all these things in place, the site was back up and running, response times were low, and all assets were being served without any problems. This was a headache, but we learned many things along the way, such as asset service configs, metrics (New Relic, Derailed), and MRSK.
I wrote this article to help visitors of our site understand what happened during our migration process, and to provide guidance to other developers who may face similar issues. By following the methods mentioned above, they may be able to identify and resolve their own problems.
If you have any questions DM on Twitter, Instagram, or send us a message using Support.
I wrote this article to help visitors of our site understand what happened during our migration process, and to provide guidance to other developers who may face similar issues. By following the methods mentioned above, they may be able to identify and resolve their own problems.
If you have any questions DM on Twitter, Instagram, or send us a message using Support.