Research-related computation has an environmental impact. It is just as serious as plastic and hazardous chemicals. After reading about the CO2 footprint of e-mails, I started thinking about the costs of data storage and processing. Did you know that datacentres already account for more CO2 emissions than the US aviation sector? In a journey full of surprises, I found out that we all can optimise our IT use. And it is NOT about turning off your camera during Zoom meetings.
Can data be greener? Follow Estel to Serverland
Growing up, April made me happy because it brings new books and old legends. And it’s green thought time… so I will tell you the story of an unfinished adventure. Read carefully, because you are to choose how it ends!
This journey started when I read that an average e-mail is equivalent to 4g CO2e. I was flabbergasted. And I was determined to find more numbers. My plan was to convince the whole institute to massively delete old e-mails, as well as old, redundant, untidy datasets. We’d save so much energy! Greener and wiser ever after.
Right? No! Of course not. I did not know how far out of my comfort zone I was stepping, how exasperating it would be, and yet how instructive. I think you also ought to know.
Up to the cloud and down the server hole
Remember our blog about carbon neutral 2030? We’d heard about a place called Forum, where the Radboudumc servers consumed 690.000 kWh in 2019. To be honest, the name “Forum” brought my thoughts to ancient Rome. I couldn’t help but imagine a few men in white toga’s, discussing politics amongst rows of servers, cables and blinking lights.
Luckily, after a few more e-mails I got to the people “behind the servers”. We met online, to see what we could learn from each other. The first thing I learned is that the RU’s got three (small) data centres. One at Radboudumc (R), one at the sports centre (G), and… yes, you guessed it: Forum (F). For some reason, R is all ours, but we also use space in F and G.
So, the annual energy expenditure in F, which equals that of 230.000 household kitchens, is not all. 70% of the Radboudumc (local) data lives in R. That includes the /:H, /:Z or whatever-letter-folder from your department. However, an energy intensive piece, the “computation cluster” lives in G. It’s these cluster’s back-up copy who lives in F. Furthermore, some departments have their own servers. It’s not that they don’t use the shared ones: it is yet another back-up.
This led me to one of my first questions. How much CO2 does data storage cost? What’s a good compromise between safe data storage and too much duplication? I was expecting some numbers and good advice. Instead, a silence filled the virtual room.
I started to panic. This server hole was deep and ugly, and the information I found was terribly complicated. I just understood one thing: uncertainty. Error. Assumptions. More assumptions. I stopped reading, I stopped writing, and told my colleagues that my story would be boring and uninformative.
That’s the cliché intermezzo: my lovely team threw a rope into the server hole and rescued me. That’s your catharsis, they said, a powerful change of perspective! Share that with the rest. Maybe you even will find someone who has more answers?
Server maintenance and sustainability
The fact that had lost me is that the CO2 equivalent of servers does not only have to do with the power they use. As Mo Tiel (project manager RU) pointed out, manufacture, transport and maintenance can make a big difference. Optimising the life cycle of hardware is also partly in our hands. At the RU, physical servers are used for about 10 years. The older and most inefficient servers are repurposed for less intensive tasks, while the new servers perform the demanding ones. You can find more information on their website.
That’s a good start! But at the Radboudumc, things are quite different. Servers are not changed every 10, but every 5 years. And they are not repurposed. That’s because the umc’s data usage is growing so fast, that it’s cheaper to modernize all servers every 5 years, than to try and maximise their lifetime. The same happens at a global scale.
However, it is important to realise one thing: data centres are already consuming more energy than the US aviation and automotive industry. You also probably realise, the amount of data produced and analysed will continue to grow. Shouldn’t we pay more attention? Shouldn’t we make sure we use our resources in a meaningful way? My team were right when I was in the server hole. In time, I found people who could give me directions. Let me share them with you.
A long run to a greener cloud?
Thanks to some good connections and Twitter magic, I (virtually) stumbled upon the work of Jason, Loïc, Mike and collaborators. They estimated the footprint of several bioinformatic processes, and explained beautifully how they did it, and why.
Computation is an important part of the environmental impact of research, much like plastic and hazardous chemicals. Therefore, it’s important to identify the points where we can (and should) make the right choices. Even when not familiar with bioinformatic tools, the core message is important to all of us. For me, it was an eye-opener to read about the striking impact of software and hardware choice, run duration, parallelisation, memory allocation, using a CPU or a GPU, and server location.
I chose two examples that caught my attention. A genome assembly of short reads with SSPACE costs 0.0027 kg CO2e. But the same genome assembly, now done with SGA, costs 0.13 kg CO2e. That’s an increase in carbon footprint of two orders of magnitude! Now a very power intensive one: 100ns of a molecular dynamics simulation of the Satellite Tobacco Mosaic Virus. It can cost 17.8 kg CO2e or 95 kg CO2e depending on the software used.
Knowledge is power (quite literally!). If I can optimise my experiments to use less plates, less tubes, or image more conditions in one go… you know. You might be able to rethink some of your runs. The fun part is that you can use this online tool to estimate the impact of your own computational work. Others already did. On a different note, in this graph I got from the article, you can see how server location greatly affects the footprint of a GWAS analysis. Enough things to consider!
But let’s go back to where my journey started: our e-mails in the Microsoft cloud. According to what I’d read, an average e-mail (0.004 kg CO2e) could have a higher footprint than a genome assembly! How can this be? Well, we do need to be cautious with how estimations are made. As we already said, hardware (servers, networks) is getting efficient very fast. Furthermore, electricity sources are getting greener, and this varies between countries as well. In 2018, estimations for the US are 429g CO2e per kWh, and for the UK 283g CO2e per kWh.
On top of this, we need a reliable conversion from information transfer to electricity consumption. In a blog from 2015, I found that transmitting 1GB of data cost 13 kWh. However, in a blog from 2020, the figure is 0.025 kWh per GB transmitted data. So, concerning the “popular” facts about e-mails and the footprint of videocalling and streaming, maybe we’ll have to take them with a pinch of salt.
Importantly, in big data centres (data farms), lighting and cooling are used more efficiently than in small ones. Also, their energy expenditure is more stable. However, with such growing demand energy efficiency might not be enough. This is why we should be doing our part: collecting data and analysing meaningfully, pressing hardware producers to use sustainable sources, repurposing hardware (parts), and learning how to make costly runs greener, for example.
Information is your friend. Treat your data right.
There’s another thing we can all contribute to. Have you heard of the data lifecycle? For you who are thinking about the FAIR data criteria, here are some thought experiments (yes, now you choose the end!):
-Findable. Imagine you publish a dataset and make it available in an external server. Do you need to keep two copies at the Radboudumc? And, should I be convincing you to delete outdated datasets? I know what my answer would be to the last one…
-Accessible. An external hard drive is not instantaneously accessible for everyone in the world, but it is not constantly running on energy either. We could store pre-processed data in “low impact” storages, and make sure that methods, processed data, publications, and contact details are available for everyone?
-Interoperable. Recently, there was some consternation among researchers, because the GraphPad license was expiring. Licensed software is not only expensive: once you lose access to it… the format is useless. From the FAIR data point of view, would you have invested in a new licence, or in switching to more “interoperable” software?
-Re-usable. If an analysis is already done, and it’s done well, maybe there’s no need to repeat it. Plan, share, report… and we might be able to avoid some runs.
Extra tip for free: For more detailed information, you may find the website of the UK Data Service useful.
Greener and wiser ever after
This journey is coming to an end (for now), and it’s time for the morale of the story. Data, repositories, e-mails and videocalls are not only great tools: they’re indispensable. We should use them well! If we all think about how we produce, process, store and share data, we’ll push change at the institutional level. Change is meaningful growth, not growth avoidance.
Remember, words and stories are powerful because they have meaning, and data is powerful because it leads to knowledge. Maximise knowledge, minimise space. Your turn now!
The Green Lab Initiative brings together all green enthusiasts from the Radboudumc, to push the transition into sustainable research. In our blog series “green intentions of 2021”, we monthly share tips and inspiration. Your small steps matter! Find us on Twitter, or send an e-mail to greenlabinitiative(@radboudumc.nl).