[users at bb.net] Running (docker) buildslaves in a cluster with SLURM

Dominic Kempf dominic.kempf at iwr.uni-heidelberg.de
Mon Nov 16 14:05:51 UTC 2015


Hey Pierre, hey David,

thanks for your input.

As to why using SLURM:

We are having an existing cluster for our research group (we are 
developing codes for numerical simulation), that already uses SLURM. It 
is best to integrate our testing hardware into that cluster to have a 
hybrid usage of the hardware for both testing and production.

Furthermore, we want to do performance testing (and in particular 
scalability testing) in the future. To do that, we need to be able to 
have quite some control over what cores our processes run on (because 
for example, the performance of a memory bandwidth limited code depends 
heavily on the available memory controllers). SLURM offers ways to 
tackle this kind of things, while buildbot does not (to my knowledge).

@David:
Unfortunately, we cannot use such an approach. The Compile step 
definitely cannot be done on the head node, as it is usually consuming 
as much or even more ressources than the execution step. This brings us 
back to the problem of unifying buildsteps into one slurm job.

@Pierre:
Actually, I think I have figured it out by now:

I will write a SlurmLatentBuildSlave, that submits an sbatch file to 
SLURM (maybe through a small server running on the head node). The 
process in that sbatch file will (once appointed the ressources) spin up 
the build slave.

In a second step, I would write a DockerSlurmLatentBuildSlave, that in 
the sbatch file spins up a docker container with the build slave running 
inside.

It seems clear and easy now, but I was really confused last week. Thanks 
for your answers, they helped me clearing things up.

Best,
Dominic


On 14.11.2015 22:27, David Strubbe wrote:
> Hi, I am running buildbot with SLURM jobs too. For example, 
> http://www.tddft.org/programs/octopus/buildbot (specifically the ones 
> called hbar). But we only submit jobs for the test step, the 
> compilation is run on the head node. You may find this script I wrote 
> helpful:
>
> http://web.mit.edu/~dstrubbe/www/queue_monitor.pl 
> <http://web.mit.edu/%7Edstrubbe/www/queue_monitor.pl>
>
> It is BSD-licensed and manages submission of jobs with PBS or SLURM. 
> It is being used for the Octopus testsuite above, as well as for 
> another project, BerkeleyGW (BSD-licensed) from which the attached 
> script comes.
>
> David
>
> On Fri, Nov 13, 2015 at 8:45 AM, Dominic Kempf 
> <dominic.kempf at iwr.uni-heidelberg.de 
> <mailto:dominic.kempf at iwr.uni-heidelberg.de>> wrote:
>
>     Dear Buildbot list,
>
>     I am currently working on a buildbot setup that wants to run
>     buildslaves
>     integrated into a small cluster that is using a SLURM scheduling
>     system. I have trouble mapping my requirements to buildbot concepts in
>     a suitable way.
>
>     Problems arise from:
>     * At first, I thought I can have just one buildslave on the
>     cluster frontend,
>       that passes all build requests to a queue. But it seems that I
>     rather need
>       one such slave on the frontend per job in the queue (sounds like
>     a job for
>       a latent slave). Correct?
>     * I have no clue yet on how to handle separate build steps,
>     because either
>       - the job as submitted to SLURM must contain all build steps at
>         once - which makes a separation of logs etc. a pain
>       - every build step must be submitted to SLURM separately, with
>     the jobs
>         depending on each other correctly - which is also a pain,
>     because I cannot
>         guarantee things running on the same node.
>
>     To further complicate things, I also want to run my builds in
>     docker containers
>     that we use to model heterogeneous userlands. Note that in the
>     above context, this
>     is different than for example in a DockerLatentBuildSlave: With
>     the latter, the
>     slave runs and builds its commands inside a docker container. In
>     my approach, a
>     (potenitally also dockerized) buildslave submits a job to a queue,
>     which, when executed
>     on some node, spins up another docker container there and runs the
>     job inside that
>     one.
>
>     I am open to any sort of input and discussion!
>     Thanks in advance,
>
>     Dominic Kempf
>     _______________________________________________
>     users mailing list
>     users at buildbot.net <mailto:users at buildbot.net>
>     https://lists.buildbot.net/mailman/listinfo/users
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/users/attachments/20151116/44b21af3/attachment.html>


More information about the users mailing list