[sepia] cloud usage audit

Sage Weil sweil at redhat.com
Fri Oct 19 14:20:52 PDT 2018

On Fri, 19 Oct 2018, David Galloway wrote:
> On 10/11/2018 06:04 PM, Sage Weil wrote:
> > Sometime soon I'm hoping we can take a careful look at what cloud 
> > resources we're using (at OVH or elsewhere) so we can (1) see what there 
> > is, (2) see if it is all necessary or if some of it can be scaled back, 
> > and (3) get a handle on what the underlying requirements are.
> > A few different organizations have suggested that they can donate some 
> > cloud resources.  Any VMs we can shift over to those other clouds would 
> > let us reduce our costs (or get more work done).  The first step is to see 
> > what we have and have a clear set of requirements to share so we know what 
> > is involved to make use of those resources!
> > Among other things, I have a suspicion that the teuthology worker nodes on 
> > OVH are underutilized.  Maybe we can scale that back, or figure out a way 
> > to scale up/down on demand?  It's expensive, and we're about to have a lot 
> > more eyes on the spend and questions about what is/isn't necessary than we 
> > have now :).
> > 
> I compiled numbers for the past four months of OVH usage.
> https://docs.google.com/spreadsheets/d/1oUHqujqK5TBTyBaSVxuyYZTxzdO1Ptt6j0WZ9eFbksM/edit#gid=0
> For those who can't access the doc, here's the cliff notes in USD.
> Average Monthly Costs for:
>  - Public Infra: $601.99
>  - Permanent CI: $5,405.26
>  - Ephemeral   : $17,736.41
>  - Other       : $220.00
> Total          : $23,963.66
> Of the Ephemeral Instance usage, there's about a 60/40 split between
> Jenkins Slaves and Sepia Teuthology testnodes respectively

The teutholgoy nodes seem like the biggest target.  I see lots of cron 
jobs scheduling runs against them, but I'm not sure if those tests have to 
run there vs on bare metal, or if the results are even looked at closely.  
I'm advocate minimizing the teuthology footprint to what we need (e.g., 
tests that require the broader range of distros than what we have via 

> Public Infra = {tracker,docs,www,download,status.sepia}.ceph.com
> Permanent CI = chacra (x6), shaman (x3), jenkins (x2), prado
> Ephemeral    = Ephemeral Jenkins slaves and ovh### Sepia testnodes
> Other        = A couple VMs called packages-repository and teuthology
> (unsure if these are used for anything anymore)
> We could potentially run the permanent CI in the Sepia lab if we used
> 20-30TB of space on the LRC for Chacra nodes.

Are those nodes serving data to the outside world, or is that all coming 
from download.ceph.com?  The bandwidth in/out of the lab is somewhat 
limited IIRC.  That issue aside, I'm all for pulling this in.  We might 
need to be careful to avoid the lab cluster as we dogfood release 
candidates there and I don't want to get into a situation where we can't 
build ourselves a fix.


