[sepia] cloud usage audit

David Galloway dgallowa at redhat.com
Fri Oct 19 16:06:11 PDT 2018

On 10/19/2018 05:20 PM, Sage Weil wrote:
> On Fri, 19 Oct 2018, David Galloway wrote:
>> On 10/11/2018 06:04 PM, Sage Weil wrote:
>>> Hi David, everyone,
>>> Sometime soon I'm hoping we can take a careful look at what cloud 
>>> resources we're using (at OVH or elsewhere) so we can (1) see what there 
>>> is, (2) see if it is all necessary or if some of it can be scaled back, 
>>> and (3) get a handle on what the underlying requirements are.
>>> A few different organizations have suggested that they can donate some 
>>> cloud resources.  Any VMs we can shift over to those other clouds would 
>>> let us reduce our costs (or get more work done).  The first step is to see 
>>> what we have and have a clear set of requirements to share so we know what 
>>> is involved to make use of those resources!
>>> Among other things, I have a suspicion that the teuthology worker nodes on 
>>> OVH are underutilized.  Maybe we can scale that back, or figure out a way 
>>> to scale up/down on demand?  It's expensive, and we're about to have a lot 
>>> more eyes on the spend and questions about what is/isn't necessary than we 
>>> have now :).
>> I compiled numbers for the past four months of OVH usage.
>> https://docs.google.com/spreadsheets/d/1oUHqujqK5TBTyBaSVxuyYZTxzdO1Ptt6j0WZ9eFbksM/edit#gid=0
>> For those who can't access the doc, here's the cliff notes in USD.
>> Average Monthly Costs for:
>>  - Public Infra: $601.99
>>  - Permanent CI: $5,405.26
>>  - Ephemeral   : $17,736.41
>>  - Other       : $220.00
>> Total          : $23,963.66
>> Of the Ephemeral Instance usage, there's about a 60/40 split between
>> Jenkins Slaves and Sepia Teuthology testnodes respectively
> The teutholgoy nodes seem like the biggest target.  I see lots of cron 
> jobs scheduling runs against them, but I'm not sure if those tests have to 
> run there vs on bare metal, or if the results are even looked at closely.  
> I'm advocate minimizing the teuthology footprint to what we need (e.g., 
> tests that require the broader range of distros than what we have via 
> fog).
>> Public Infra = {tracker,docs,www,download,status.sepia}.ceph.com
>> Permanent CI = chacra (x6), shaman (x3), jenkins (x2), prado
>> Ephemeral    = Ephemeral Jenkins slaves and ovh### Sepia testnodes
>> Other        = A couple VMs called packages-repository and teuthology
>> (unsure if these are used for anything anymore)
>> We could potentially run the permanent CI in the Sepia lab if we used
>> 20-30TB of space on the LRC for Chacra nodes.
> Are those nodes serving data to the outside world, or is that all coming 
> from download.ceph.com?  The bandwidth in/out of the lab is somewhat 
> limited IIRC.  That issue aside, I'm all for pulling this in.  

The CI is mostly creating dev packages for the Sepia lab to consume.
So, in theory, we wouldn't need a ton of bandwidth since the dev
packages -> testnodes will only be a few switch hops away.

> We might 
> need to be careful to avoid the lab cluster as we dogfood release 
> candidates there and I don't want to get into a situation where we can't 
> build ourselves a fix.

Great point.  Another option would be setting up a separate Ceph or
Gluster cluster on existing baremetal but all we have is mira which are
8 (9? 10?) years old.

More information about the Sepia mailing list