Businesses today want to keep an eye on their carbon emissions and do their bit to help the climate crisis and so they need to understand and reduce all their emissions including those from cloud computing.

You might imagine that the cloud providers with their omniscient observability would be able to provide accurate, real time carbon and energy reporting to each of their customers. Unfortunately they don’t. There’s basic reporting of carbon but it’s inconsistent across providers and often lags behind by several weeks, if not months. This means it’s fine for doing annual reports but it can be frustratingly hard for customers to see if there’s any meaningful change from tweaks to their infrastructure. This blog explains how we got around it.

This blog is one in a series from an internal project undertaken here at Scott Logic. The aim of the project was to investigate the carbon footprint of running code on mobile devices vs the server. We won’t go into the actual results as that’s a topic for a different blog post. Instead we wanted to talk through the steps we took to solve the problem for our use case.

Background

By the time we started the server work, we already had mobile apps in development for Android and iOS that could run the chosen CPU benchmarks. To compare between mobile and server we needed to build a test harness to run the same benchmarks, and ideally using the same benchmark code. We also needed to work out a way of actually measuring or calculating the energy used. Despite the project name including the words ‘Carbon Footprint’ our actual point of comparison was energy consumption in watt hours (Wh). The simple reason being the carbon footprint is largely dependent on the source of the electricity so by measuring energy consumption we can more directly compare results.

Approaches considered but ultimately rejected

As part of the research phase, there were a few potential approaches that were considered but ultimately rejected.

The first was to use the carbon footprint report generated by GCP. Rather than test all cloud providers from the start we decided to go with GCP as we believed the carbon report would give us the information we needed. Our hypothesis was that we could take the carbon footprint amount, given in kilograms of CO2 equivalent, and by making assumptions about the carbon intensity of the electricity, we could calculate the amount of energy used. It became clear quite quickly that this approach would not work. The unknowns started to pile up and we would have to make a lot of assumptions. But really the biggest issue was that the report is only generated monthly. We needed a better solution, so our search continues.

Another approach that was considered was to use the Etsy Cloud Jewels methodology. This approach had more legs compared to the GCP Carbon Footprint report option. Again at this point we were still considering using GCP. Ultimately we decided to dismiss this approach after finding a different one more suitable for our use case.

Our Approach

Up until now we had been focusing on GCP solutions. However, since we were no longer considering the carbon footprint report idea there was no reason to stick with GCP so we expanded our search into solutions on AWS.

This leads us to our actual solution, based on the work done by Teads to build a carbon footprint calculator for AWS EC2 instances. As part of their research, they created a Google Sheets spreadsheet which happens to include energy consumption figures for almost every single EC2 instance. Helpfully, they include energy consumption at idle, 10%, 50%, and 100% loads for the whole instance, and CPU, GPU, and memory. By knowing these figures, and getting utilisation data from CloudWatch, we could estimate energy consumption.

Running Benchmarks

As previously mentioned, when work on the server started the benchmark apps for iOS and Android had already started to be built. This meant that there were already implementations of the benchmarks in Swift and Java. The benchmarks used for this project were:

  • Fannkuch (complexity of 12)
  • Mandelbrot (complexity of 30,000)
  • Spectral (complexity of 32,000)

To get the best comparisons to mobile, we used the same benchmark code from mobile with the same complexity values. Complexity is an integer number that is supplied as an argument to the benchmark functions that affects memory and CPU. The higher the complexity value, the harder the calculations become and therefore the more resources and time they take.

Now armed with a methodology for calculating energy and the benchmark code, we started building the test harness. We knew we needed:

  1. A test harness that could start each benchmark
  2. A way of deploying the test harness to AWS
  3. A way of simplifying the creation and destruction of infrastructure
  4. Some way of measuring utilisation and calculating energy consumption
  5. The ability to run benchmarks in different languages

Cloud Deployment

We knew we needed to be running on EC2 in order to use the Teads data. This basically leaves us with two options: Run on a standard EC2 instance or Dockerise and use ECS. We went with the latter. Either way would’ve been fine, but we found it easier to just build a docker image with all the bits (technical term) included and run it on ECS, using EC2 for compute instead of Fargate. The performance penalty was negligible.

Test Harness App

With that sorted, attention turned to what our test harness app would actually look like. Given we were starting with Java we went with a very, very simple Spring Boot app consisting of little more than a REST controller and a couple of services. This meant we could start and configure each run from Postman with a simple GET request rather than messing around with SSH. It would also give us scope to include some sort of web UI if we no longer wanted to use Postman.

Simplifying infrastructure provisioning

To save money, and because our infrastructure doesn’t need to exist any longer than it takes to run the benchmarks, we would need to build up and tear it down a lot. To make our lives easier, and to ensure everything is consistent between builds, we used Terraform. We put together a simple deployment that included the bare minimum that was needed. Admittedly it wasn’t very robust and probably (definitely) not up to best practices but each deployment would only last for a couple of hours at most. Using Terraform was probably the single best decision taken. As an indication of how useful it was, there were over 50 revisions of the Task Definition alone.

EC2 Instances

We started with t2.large instances. This gave us 2 vCPUs and 8GB of memory. Memory wasn’t really a concern for our purposes and our average utilisation reflects this. We maintained a pretty solid utilisation level of 9.5%. Having 2 vCPUs was beneficial. Specifically, it highlighted the difference between the single threaded and multi threaded workloads. However, the T-series instances had one key disadvantage for us. They can boost. Now normally this would be a good thing but we needed to keep the run conditions as consistent as possible. Because the boosting is based on CPU tokens and is an automated process, we couldn’t control for it. This meant we were seeing inconsistencies in our run times.

Another downside to the T-series instances is they run on two different Intel CPU models, the Xeon E5-2676 V3 & Xeon E5-2686 V4. The V4 was able to complete a Spectral benchmark about 100 seconds faster than the V3. This was an issue as we couldn’t specify which model we wanted every time we built up the infrastructure and would sometimes have to terminate the instance a number of times before we got the model we needed. Again, consistency was an issue here.

We switched to the M4 instance family, specifically m4.large. This provided stability in run times, albeit slightly longer due to lack of boosting. Although the AWS EC2 documentation lists two different CPU models for this instance type, our instances only ever used the Xeon E5-2686 V4, removing the CPU lottery problem we were having.

M4’s worked fine for Java and Swift. However, for some reason that we were never fully able to explain, it didn’t play nicely with our WebAssembly benchmarks. Every attempt to run them resulted in a 132 exit code. Eventually this was solved by changing the instance type to m6i.large. Admittedly going with M4 was probably a mistake. Hindsight being 20/20, we should’ve just gone with the newest instance we had Teads data for from the beginning. It would’ve saved a lot of time.

Running benchmarks in other languages

While the Java Spring app offered many benefits to simplifying our workflow for Java benchmarks, it did result in us effectively developing ourselves into a corner. When it came time to implement the Swift benchmarks we faced a dilemma. We could either rebuild our Java test harness to something that is Swift native (including all the energy calculation stuff which we will cover in the next section), or find some way of integrating it with the existing test runner.

Ultimately we went with the latter. We took the raw benchmark code from the iOS app and bundled it into a simple Swift app. In our Dockerfile, we included a stage to copy the Swift app into the image, added a controller endpoint to our Spring app, and using the ProcessBuilder API created a service that allowed the Spring app to execute the Swift app.

We took pretty much the same approach for the WebAssembly benchmarks. We included a WebAssembly interpreter and the WASM binaries in the image, created another endpoint, and used the ProcessBuilder API to create a service to execute the benchmarks.

Calculating Energy Consumption

Our Teads spreadsheet from earlier gave us the following energy consumption values:

Instance Component Idle (W) 10% (W) 50% (W) 100% (W)
t2.large CPU 0.97 2.77 5.71 7.81
  Memory 1.6 2.4 3.2 4.8
  Instance 4.2 6.8 10.5 14.2
m4.large CPU 0.97 2.77 5.71 7.81
  Memory 1.6 2.4 3.2 4.8
  Instance 4.2 6.8 10.5 14.2
m6i.large CPU 1.09 2.98 7.05 9.55
  Memory 1.6 2.4 3.2 4.8
  Instance 4.6 7.3 12.1 16.2

Source: Teads Engineering

We can see that the memory is the same for all instance types, and the CPU is identical for T2 and M4 which is understandable as both instance types use the same CPU models.

As we alluded to earlier, we used the data in the table above as the basis for our energy calculations. To make the calculations more accurate we incorporated data from AWS CloudWatch metrics, specifically CPU utilisation. We also attempted to include a custom memory metric using the CloudWatch Agent but it was more of a pain to get it to work and it was inconsistent at best.

Because our test harness was a Spring app we were able to use the AWS SDK for Java to programmatically retrieve utilisation metric data. The caveat being that because the data is updated every minute, and it is an average of the previous minute’s utilisation, the benchmarks needed to be running for several minutes in order to get accurate results. This can be achieved in two ways: increasing the complexity, or just running several benchmarks sequentially. Because we needed consistent data across platforms, the complexity number was fixed for each benchmark. So we opted to run benchmarks in collections and average the results.

Potential Future Steps

As is the case with most research projects, the work is seldom finished. Our server experimentation only covers CPU benchmarks. The mobile work started looking at JavaScript, GPU benchmarks, and video encoding. Expanding the server testing to cover these areas is the logical next step in the project. Unfortunately we ran out of time to complete them as part of this project, but perhaps it’s something that we or another team could circle back to.