Apex Outage / Fix
The problems that occurred and steps taken to rectify
article hero
We recently faced a two-fold problem with our application: 
  • it was maxing out memory on Heroku 
  • we were planning to migrate to Render 
While we were already planning to move based on many factors, we figured this would be a good opportunity to migrate providers while solving the memory issue.

The first issue we encountered involved Heroku's memory limits. Our app had been running fine for some time, but after pushing some major changes to the code base, it started running at 95-96% of allowed memory levels. This, coupled with heavy throughput and requests, caused us to exceed Heroku's memory quota. We suspected a memory leak, bad query, or inefficient asset loading (such as images/videos). We used tools like Derailed, Newrelic, and Sentry to identify the culprit. We also checked if the gem packs were up-to-date and if there were any known issues. Using the derailed to check gem size:

 bundle exec derailed bundle:mem

TOP: 85.5508 MiB
  rails/all: 32.9922 MiB
    action_mailbox/engine: 11.8125 MiB (Also required by: ohmysmtp-rails)
      action_mailbox: 11.8125 MiB
        action_mailbox/mail_ext: 11.8125 MiB
          action_mailbox/mail_ext/address_equality.rb: 11.1445 MiB
Everything was fine, except for one 21 MB! gem pack that our stack used to handle images with vips.
  ruby-vips: 21.5234 MiB
    vips: 21.5234 MiB
      vips/operation: 0.3047 MiB
 We opted not to get rid of it, as we lacked knowledge on how to manage it better.

We then tested the code for memory leaks using 
bundle exec derailed exec perf:mem_over_time 
and hit different URLs that received the most traffic on production. We also used Sentry and New Relic to check for endpoint hits and tried to reproduce the memory leak locally, but without success. We inspected production metrics via New Relic for any hits to the memory issue, but everything seemed to be running fine, with a couple of spikes in response times.
7 day response times

Although the average response time was not great (~300ms), it was acceptable for a small app, and the user experience was not significantly impacted.
exercises 7 day

We decided to migrate to Render, which provided better services and cost savings. However, we encountered major problems due to our way of migrating the app. We did not test to get rid of all the kinks that come with learning a new procedure, we played with DNS settings without fully understanding the process for certificate SSL certificates, and we lacked a great understanding of CORS. All of these combined caused a disaster.

During the installation move, we set up all our processes correctly and configured the necessary .rb files as described in Render's guide. We also had to configure a couple of other settings to negate some errors that were happening and how Render dealt with domain rerouting. Due to SSL certificate issuance and rate limits, we had unwittingly caused the process to be stalled for 48 hours. After that time, the certificate was restored, and traffic was able to reach the site again on 03/20/23. The next couple of hours were spent fixing CORS errors, which came down to setting up some environment variables.

We utilized built-in Rails methods to handle headers and assign Environment Variables (ENV) to the correct locations. By doing so, we were able to configure the appropriate values through the Render dashboard. Here's an example of how we used this method:

config.public_file_server.headers = { 'Access-Control-Allow-Origin' => ENV['ENV_VARIABLE_HERE'] }

 config.asset_host = ENV['VARIABLE_HERE'] 

This snippet of code helps ensure that headers are properly set and assets are served from the correct location, making for a more efficient and secure user experience.

Once we had all these things in place, the site was back up and running, response times were low, and all assets were being served without any problems. This was a headache, but we learned many things along the way, such as asset service configs, metrics (New Relic, Derailed), and MRSK.

I wrote this article to help visitors of our site understand what happened during our migration process, and to provide guidance to other developers who may face similar issues. By following the methods mentioned above, they may be able to identify and resolve their own problems.

If you have any questions DM on Twitter, Instagram, or send us a message using Support.
Maintaining good health and wellness can have its ups and downs. The downs can be offset by incorporating proper planning to support overall wellbeing.

Looking to improve your health and wellness journey? Consider ApexMortals.

Start your journey for free in just a few minutes. No payment or obligation required.