What else is running on my EC2 instance?…

Aside

At Thinkear we use a lot of Amazon Webservices. We use Elastic Compute Cloud (EC2) to host our Apache Tomcat server running Java 6. We recently had some major performance issues with our service, which lead us to analyze our EC2 hosts and figure out what was running on it.

At peak hours our servers handle ~35K requests/second and we have a 100 ms SLA to maintain with our partners. In this type of environment performance is the top priority and we were surprised by some of the things we found running on our EC2 instances. I thought I would share what we found. Some of it was surprising.  Some of it was documented after you knew what to look for – but I find the Amazon docs hard to navigate. Throughout our interactions with Amazon support, we found that not all representatives were aware of some of these points below.

For an AMI we have the Amazon Linux  X86 64-bit (ami-e8249881) running a Tomcat 7 Linux Configuration.

1. Apache runs as a proxy.

Our load balancer (AWS Elastic Load Balancing), directs port 80 traffic to port 80 on the hosts. Then Apache runs a proxy that forwards requests from port 80 to port 8080 (default Tomcat port).

Config files are located in /etc/httpd/conf.d and /etc/httpd/conf. We had to tweak settings in /etc/httpd/conf/httpd.conf based on our use case. These settings were the root cause of our issues. We had never looked into them because everything seemed to work.

We tried by-passing Apache because we didn’t need the features it brings. Unfortunately, we had issues with our servers on deployment when we by-passed apache. We haven’t found the root cause of this as of yet.

2. Logrotate Elastic Beanstalk Log Files

elasticbeanstalk.conf in /etc/httpd/conf.d/ defines the ErrorLog and AccessLog properties for Apache. These files are then rotated out by /etc/cron.hourly/logrotate-elasticbeanstalk-httpd. The problem was that we didn’t know these log files existed and we felt the settings were too aggressive for us.

These are our current settings: https://gist.github.com/KamilMroczek/7296477. We changed the size parameter to be 50 MB and to only keep 1 rotated file. Smaller files take less time to compress. We didn’t need all those extra copies.

3. Logrotate Tomcat Log Files

logrotate-elasticbeanstalk in /etc/cron.hourly defines rotating catalina.out and localhost_access_log.txt out of the Tomcat logs directory! As nice as it is for them to do that, we had no idea. It didn’t have a large impact on us, since we handled the rotating of our log files ourselves already at shorter intervals. We ended up removing this unnecessary step anyway.

Original Log rotate script: https://gist.github.com/KamilMroczek/7296539

4. Publishing logs to S3

We noticed that we had CPU spikes at 20 minute intervals on our hosts at 10, 30 and 50 minutes passed the hour. We couldn’t explain these. When we looked at our CPU usage through top we found the culprit.

/etc/cron.d/publish_logs is a python script that publishes:

  • /var/log/httpd/*.gz (#2 above)
  • /var/log/tomcat7/*.gz (#3 above)

I originally thought that we were uploading the same files a ton since the logrotate only rotated every hour and kept the last 9 copies, but the publishing happened 3 times an hour. But we found out that the code has de-duplicating logic.

We removed this cron task because we didn’t need the data uploaded. We already uploaded our tomcat logs separately and the beanstalk logs were of no use to us at the time. Nor have we ever used them to troubleshoot issues.

5. Amazon Host Configuration

The entire configuration for your environment can be found through Amazon Cloud Formation. There is a describe-stacks (or cfn-describe-stacks depending on the CLI version) call that allows you to pull the entire configuration for an environment. We are in the process of auditing ours. More complete instructions are here:

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-describing-stacks.html

As will all tough problems, after troubleshooting them you inevitably get a deeper understanding of your system and its architecture. When you own and provision your own servers you understand everything on them because you are responsible for creating the template. When you use a hosted solution such as Amazon Webservices, you can run into the problem of not knowing everything about your image. But we learned that you need to take the time to understand what you are getting.