[devel at bb.net] deadlock upon errors on log.addRawLines

Thu Apr 28 15:17:40 UTC 2016

Hi Ion,

For me, the db.logs module should never fail. If there is a need for a
retry, it shall probably be implemented in the db layer.

I think this would make sense to first have a better idea of the rootcause
of the deadlock.
Vladimir @rutsky has implemented some powerful sql logging, maybe this
would make sense to activate this

I agree that there can be case of unrecoverable error which prevents the
logs to be written (I can think of a disk full on the sql server).
In this case, this is more a kind of panic issue. There are several option
I can think of:

- stop the step in exception status.
   if we can't write logs, will we be able to change the step status to
exception?
- panic the master, and stop it.
   We let an upper orchestration layer handle the availability issue
- retry forever until some db admin restores the sql server.
  This has the advantage of not failing any build.

About addstdout, in buildbot nine and buildbot 0.8.12, for "new style"
steps, addLog, addStdout already returns a deferred, which only completes
when the write has been done.
http://docs.buildbot.net/latest/manual/new-style-steps.html
so, yes, you should yield the addStdout, and make sure your steps are
considered "newStyle" (which means they have a run() instead of start()
method)

Pierre

Le jeu. 28 avr. 2016 à 13:54, Ion Alberdi <nolaridebi at gmail.com> a écrit :

> Hello to all,
> As the error might require more analysis, I copy/paste the issue mentioned
> on irc.
>
> From my understanding,
> when an error appears in the buildbot<->db conversation to add a log line:
>
> https://github.com/buildbot/buildbot/blob/master/master/buildbot/process/log.py#L76
>
> the step calling addStdout will not be able to finish, as it will wait
> for the lock to be released, forever.
>
> I see two solutions for now:
> 1. return the deferred in log.addStdout calls, and let the developper
>     handle the issue if there is any. It has the drawback of changing
>     the way steps are implemented:
>     before:
>         log.addStdout
>         log.addStdout
>     after:
>         yield log.addStdout
>         yield log.addStdout
>
> 2. implement a retry mechanism in addRawLines (with a random sleep between)
>     and raise an error if the issue is not solved (that the developer will
> handle or not).
>     This aims at reducing the number of discarded logs so that they could
> become tolerable.
>
> gracinet (correct me if i'm wrong) prefers 2, inputs are welcome :)
>
>
> P.S:
> I'm currently testing solution 2 in a use case that stresses the logs long
> enough
> to make the db (mysql innodb) report the following issues:
>
> - sqlalchemy.exc.OperationalError: (OperationalError) (1213, 'Deadlock
> found when trying to get lock; try restarting transaction') None None
> - sqlalchemy.exc.OperationalError: (OperationalError) (1213, 'Deadlock
> found when trying to get lock; try restarting transaction') 'INSERT INTO
> logchunks (logid, first_line, last_line, content, compressed) VALUES (%s,
> %s, %s, %s, %s)' (924L, 315L, 315L,
> 'x\xdaKU\x80\x02\xbf|\x85\x92\xcc\xdcT\x85\xaa\xfc\xbcT\x85\xcc\xbc\xb4|
> \xa1P\x92\x91Y\xac\x90\x92X\x92\xaa\xa0ad`h\xa6k`\xa2kd\xa1```\x05F\x9a:\n\xc5\xa9%%\x99y\xe9\n%\xf9\n\xa1!\xce\x00BM\x15\x96',
> 1)
>
>
>
> --
> Ion
> _______________________________________________
> devel mailing list
> devel at buildbot.net
> https://lists.buildbot.net/mailman/listinfo/devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.buildbot.net/pipermail/devel/attachments/20160428/2b626d2f/attachment.html>