All of lore.kernel.org
 help / color / mirror / Atom feed
* Never ending stream of bitbake exceptions when the builder runs out of disk space
@ 2017-06-15  6:48 Martin Jansa
  2017-06-27  8:08 ` Patrick Ohly
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Jansa @ 2017-06-15  6:48 UTC (permalink / raw)
  To: Patches and discussions about the oe-core layer

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

This issue exists for very long time.

I know that when the builder runs out of disk space there are multiple
things which might go wrong (I've seen bad archives on premirrors, bad
sstate archives caused by this), so this issue isn't the main problem, but
still would be nice to fail faster.

In last build which was running for some 9 hours, it was first building for
maybe 2 hours before it run out of disk space and this morning there is
50MB log just from bitbake output stored on the jenkins master. Repeating
following message very quickly

# grep -c "Errno 28" consoleText.txt
42986

ERROR: Running command [['world'], 'build']
Traceback (most recent call last):
  File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line 211,
in fire(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>,
d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):

    >    fire_class_handlers(event, d)
         if worker_fire:
  File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line 134,
in fire_class_handlers(event=<bb.event.HeartbeatEvent object at
0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
                         continue
    >            execute_handler(name, handler, event, d)

  File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line 106,
in execute_handler(name='runqueue_stats', handler=<function runqueue_stats
at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent object at
0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
         try:
    >        ret = handler(event)
         except (bb.parse.SkipRecipe, bb.BBHandledException):
  File
"/home/jenkins/oe/world/shr-core/openembedded-core/meta/classes/buildstats.bbclass",
line 212, in runqueue_stats(e=<bb.event.HeartbeatEvent object at
0x7fcfed3e96a0>):
             done = isinstance(e, bb.event.BuildCompleted)
    >        system_stats.sample(e, force=done)
             if done:
  File
"/home/jenkins/oe/world/shr-core/openembedded-core/meta/lib/buildstats.py",
line 148, in SystemStats.sample(event=<bb.event.HeartbeatEvent object at
0x7fcfed3e96a0>, force=False):
                                      data +
    >                                 b'\n')
                 self.last_proc = now
OSError: [Errno 28] No space left on device

It would be better to exit completely when something as bad as Errno 28
happens.

[-- Attachment #2: Type: text/html, Size: 3077 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-15  6:48 Never ending stream of bitbake exceptions when the builder runs out of disk space Martin Jansa
@ 2017-06-27  8:08 ` Patrick Ohly
  2017-06-27  8:12   ` Martin Jansa
  2017-06-27  9:41   ` Richard Purdie
  0 siblings, 2 replies; 8+ messages in thread
From: Patrick Ohly @ 2017-06-27  8:08 UTC (permalink / raw)
  To: Martin Jansa; +Cc: Patches and discussions about the oe-core layer

On Thu, 2017-06-15 at 08:48 +0200, Martin Jansa wrote:
> This issue exists for very long time.
> 
> 
> I know that when the builder runs out of disk space there are multiple
> things which might go wrong (I've seen bad archives on premirrors, bad
> sstate archives caused by this), so this issue isn't the main problem,
> but still would be nice to fail faster.
> 
> 
> In last build which was running for some 9 hours, it was first
> building for maybe 2 hours before it run out of disk space and this
> morning there is 50MB log just from bitbake output stored on the
> jenkins master. Repeating following message very quickly
> 
> 
> # grep -c "Errno 28" consoleText.txt 
> 42986
> 
> 
> ERROR: Running command [['world'], 'build']
> Traceback (most recent call last):
>   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> 211, in fire(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>,
> d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
>      
>     >    fire_class_handlers(event, d)
>          if worker_fire:
>   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> 134, in fire_class_handlers(event=<bb.event.HeartbeatEvent object at
> 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> 0x7fd00330b198>):
>                          continue
>     >            execute_handler(name, handler, event, d)
>      
>   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> 106, in execute_handler(name='runqueue_stats', handler=<function
> runqueue_stats at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent
> object at 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> 0x7fd00330b198>):
>          try:
>     >        ret = handler(event)
>          except (bb.parse.SkipRecipe, bb.BBHandledException):
>   File
> "/home/jenkins/oe/world/shr-core/openembedded-core/meta/classes/buildstats.bbclass", line 212, in runqueue_stats(e=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>):
>              done = isinstance(e, bb.event.BuildCompleted)
>     >        system_stats.sample(e, force=done)
>              if done:
>   File
> "/home/jenkins/oe/world/shr-core/openembedded-core/meta/lib/buildstats.py", line 148, in SystemStats.sample(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>, force=False):
>                                       data +
>     >                                 b'\n')
>                  self.last_proc = now
> OSError: [Errno 28] No space left on device
> 
> 
> It would be better to exit completely when something as bad as Errno
> 28 happens.

Do you have BB_DISKMON_DIRS active? Probably yes.

The reason why it did not trigger here might be that the build ran out
of disk space so quickly that the disk monitoring had no chance to
detect the problem before system stat sampling itself started failing
with the error above.

System stat sampling and disk monitoring are hooking into the same
event, so my theory is that once the system stat sampling fails, disk
monitoring code no longer runs.

I'm not sure what exactly the right fix is: detect uncaught OSError like
28 in the bitbake event loop and abort the build, and/or catch the error
in buildstats.py and ignore it so that the normal disk monitoring can
happen?

I know how to do the latter, but not the former.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-27  8:08 ` Patrick Ohly
@ 2017-06-27  8:12   ` Martin Jansa
  2017-06-27  8:25     ` Richard Purdie
  2017-06-27  9:41   ` Richard Purdie
  1 sibling, 1 reply; 8+ messages in thread
From: Martin Jansa @ 2017-06-27  8:12 UTC (permalink / raw)
  To: Patrick Ohly; +Cc: Patches and discussions about the oe-core layer

[-- Attachment #1: Type: text/plain, Size: 4490 bytes --]

Is BB_DISKMON_DIRS enabled by default?

Quick grep shows it only in local.conf.sample*:
meta/conf/local.conf.sample:BB_DISKMON_DIRS = "\
meta/conf/local.conf.sample.extended:# inode is running low, it is enabled
when BB_DISKMON_DIRS is set.
meta/conf/local.conf.sample.extended:#BB_DISKMON_DIRS =
"STOPTASKS,${TMPDIR},1G,100K WARN,${SSTATE_DIR},1G,100K"

and my jenkins builds are very close to default oe-core nodistro config, so
I don't think I have that enabled.

Maybe I should enable it, or maybe it should be enabled by default if we
cannot fix this exception stream.

Thanks

On Tue, Jun 27, 2017 at 10:08 AM, Patrick Ohly <patrick.ohly@intel.com>
wrote:

> On Thu, 2017-06-15 at 08:48 +0200, Martin Jansa wrote:
> > This issue exists for very long time.
> >
> >
> > I know that when the builder runs out of disk space there are multiple
> > things which might go wrong (I've seen bad archives on premirrors, bad
> > sstate archives caused by this), so this issue isn't the main problem,
> > but still would be nice to fail faster.
> >
> >
> > In last build which was running for some 9 hours, it was first
> > building for maybe 2 hours before it run out of disk space and this
> > morning there is 50MB log just from bitbake output stored on the
> > jenkins master. Repeating following message very quickly
> >
> >
> > # grep -c "Errno 28" consoleText.txt
> > 42986
> >
> >
> > ERROR: Running command [['world'], 'build']
> > Traceback (most recent call last):
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> > 211, in fire(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>,
> > d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
> >
> >     >    fire_class_handlers(event, d)
> >          if worker_fire:
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> > 134, in fire_class_handlers(event=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >                          continue
> >     >            execute_handler(name, handler, event, d)
> >
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> > 106, in execute_handler(name='runqueue_stats', handler=<function
> > runqueue_stats at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent
> > object at 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >          try:
> >     >        ret = handler(event)
> >          except (bb.parse.SkipRecipe, bb.BBHandledException):
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-core/meta/classes/buildstats.bbclass",
> line 212, in runqueue_stats(e=<bb.event.HeartbeatEvent object at
> 0x7fcfed3e96a0>):
> >              done = isinstance(e, bb.event.BuildCompleted)
> >     >        system_stats.sample(e, force=done)
> >              if done:
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-core/meta/lib/buildstats.py",
> line 148, in SystemStats.sample(event=<bb.event.HeartbeatEvent object at
> 0x7fcfed3e96a0>, force=False):
> >                                       data +
> >     >                                 b'\n')
> >                  self.last_proc = now
> > OSError: [Errno 28] No space left on device
> >
> >
> > It would be better to exit completely when something as bad as Errno
> > 28 happens.
>
> Do you have BB_DISKMON_DIRS active? Probably yes.
>
> The reason why it did not trigger here might be that the build ran out
> of disk space so quickly that the disk monitoring had no chance to
> detect the problem before system stat sampling itself started failing
> with the error above.
>
> System stat sampling and disk monitoring are hooking into the same
> event, so my theory is that once the system stat sampling fails, disk
> monitoring code no longer runs.
>
> I'm not sure what exactly the right fix is: detect uncaught OSError like
> 28 in the bitbake event loop and abort the build, and/or catch the error
> in buildstats.py and ignore it so that the normal disk monitoring can
> happen?
>
> I know how to do the latter, but not the former.
>
> --
> Best Regards, Patrick Ohly
>
> The content of this message is my personal opinion only and although
> I am an employee of Intel, the statements I make here in no way
> represent Intel's position on the issue, nor am I authorized to speak
> on behalf of Intel on this matter.
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 6113 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-27  8:12   ` Martin Jansa
@ 2017-06-27  8:25     ` Richard Purdie
  2017-06-27  9:21       ` Patrick Ohly
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Purdie @ 2017-06-27  8:25 UTC (permalink / raw)
  To: Martin Jansa, Patrick Ohly
  Cc: Patches and discussions about the oe-core layer

On Tue, 2017-06-27 at 10:12 +0200, Martin Jansa wrote:
> Is BB_DISKMON_DIRS enabled by default?
> 
> Quick grep shows it only in local.conf.sample*:
> meta/conf/local.conf.sample:BB_DISKMON_DIRS = "\
> meta/conf/local.conf.sample.extended:# inode is running low, it is
> enabled when BB_DISKMON_DIRS is set.
> meta/conf/local.conf.sample.extended:#BB_DISKMON_DIRS =
> "STOPTASKS,${TMPDIR},1G,100K WARN,${SSTATE_DIR},1G,100K"
> 
> and my jenkins builds are very close to default oe-core nodistro
> config, so I don't think I have that enabled.
> 
> Maybe I should enable it, or maybe it should be enabled by default if
> we cannot fix this exception stream.

We should run with disk monitoring on by default.

The ways a system can fail in an out of disk (or inode) are pretty
widespread, you could certainly "fix" this exception path but another
would come along and you'd be forever trying to fix all of them with no
real way to do it for all cases.

The actual damage running out of space does it also quite nasty,
usually zero length files in places where things outside our control
don't expect them (e.g. gcc).

This was why we wrote the disk monitoring code in the first place, with
two levels of action, warning the users, then hard stopping the build
to try and prevent the above corruption.

We did also try and ensure that corruption in DL_DIR and SSTATE_DIR is
at least detectable (checksums in downloads) or avoided (sstate uses
atomic moves).

So if its not default, we should make it default and I'd encourage
people to use it.

Cheers,

Richard




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-27  8:25     ` Richard Purdie
@ 2017-06-27  9:21       ` Patrick Ohly
  2017-06-27  9:37         ` Richard Purdie
  0 siblings, 1 reply; 8+ messages in thread
From: Patrick Ohly @ 2017-06-27  9:21 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Patches and discussions about the oe-core layer

On Tue, 2017-06-27 at 09:25 +0100, Richard Purdie wrote:
> So if its not default, we should make it default and I'd encourage
> people to use it.

It's not the default at the moment because it's only in
local.conf.sample, which people might not use.

So should a default for BB_DISKMON_DIRS be set in bitbake.conf?

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-27  9:21       ` Patrick Ohly
@ 2017-06-27  9:37         ` Richard Purdie
  2017-06-27 13:00           ` Martin Jansa
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Purdie @ 2017-06-27  9:37 UTC (permalink / raw)
  To: Patrick Ohly; +Cc: Patches and discussions about the oe-core layer

On Tue, 2017-06-27 at 11:21 +0200, Patrick Ohly wrote:
> On Tue, 2017-06-27 at 09:25 +0100, Richard Purdie wrote:
> > 
> > So if its not default, we should make it default and I'd encourage
> > people to use it.
> It's not the default at the moment because it's only in
> local.conf.sample, which people might not use.
> 
> So should a default for BB_DISKMON_DIRS be set in bitbake.conf?

No, I don't think this belongs in bitbake.conf.

If you use a nodistro setup you do get a reasonable default value for
this, you're right and it is in the default local.conf.sample. Its up
to distros and users to select the pieces they need/want beyond that.
Whilst Martin is close to nodistro it sounds like this config was
missing and I'd suggest using it.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-27  8:08 ` Patrick Ohly
  2017-06-27  8:12   ` Martin Jansa
@ 2017-06-27  9:41   ` Richard Purdie
  1 sibling, 0 replies; 8+ messages in thread
From: Richard Purdie @ 2017-06-27  9:41 UTC (permalink / raw)
  To: Patrick Ohly, Martin Jansa
  Cc: Patches and discussions about the oe-core layer

On Tue, 2017-06-27 at 10:08 +0200, Patrick Ohly wrote:
> On Thu, 2017-06-15 at 08:48 +0200, Martin Jansa wrote:
> > 
> > This issue exists for very long time.
> > 
> > 
> > I know that when the builder runs out of disk space there are
> > multiple
> > things which might go wrong (I've seen bad archives on premirrors,
> > bad
> > sstate archives caused by this), so this issue isn't the main
> > problem,
> > but still would be nice to fail faster.
> > 
> > 
> > In last build which was running for some 9 hours, it was first
> > building for maybe 2 hours before it run out of disk space and this
> > morning there is 50MB log just from bitbake output stored on the
> > jenkins master. Repeating following message very quickly
> > 
> > 
> > # grep -c "Errno 28" consoleText.txt 
> > 42986
> > 
> > 
> > ERROR: Running command [['world'], 'build']
> > Traceback (most recent call last):
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py",
> > line
> > 211, in fire(event=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>,
> > d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
> >      
> >     >    fire_class_handlers(event, d)
> >          if worker_fire:
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py",
> > line
> > 134, in fire_class_handlers(event=<bb.event.HeartbeatEvent object
> > at
> > 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >                          continue
> >     >            execute_handler(name, handler, event, d)
> >      
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py",
> > line
> > 106, in execute_handler(name='runqueue_stats', handler=<function
> > runqueue_stats at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent
> > object at 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >          try:
> >     >        ret = handler(event)
> >          except (bb.parse.SkipRecipe, bb.BBHandledException):
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-
> > core/meta/classes/buildstats.bbclass", line 212, in
> > runqueue_stats(e=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>):
> >              done = isinstance(e, bb.event.BuildCompleted)
> >     >        system_stats.sample(e, force=done)
> >              if done:
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-
> > core/meta/lib/buildstats.py", line 148, in
> > SystemStats.sample(event=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>, force=False):
> >                                       data +
> >     >                                 b'\n')
> >                  self.last_proc = now
> > OSError: [Errno 28] No space left on device
> > 
> > 
> > It would be better to exit completely when something as bad as
> > Errno
> > 28 happens.
> Do you have BB_DISKMON_DIRS active? Probably yes.
> 
> The reason why it did not trigger here might be that the build ran
> out
> of disk space so quickly that the disk monitoring had no chance to
> detect the problem before system stat sampling itself started failing
> with the error above.
> 
> System stat sampling and disk monitoring are hooking into the same
> event, so my theory is that once the system stat sampling fails, disk
> monitoring code no longer runs.
> 
> I'm not sure what exactly the right fix is: detect uncaught OSError
> like
> 28 in the bitbake event loop and abort the build, and/or catch the
> error
> in buildstats.py and ignore it so that the normal disk monitoring can
> happen?
> 
> I know how to do the latter, but not the former.


Incidentally, looking at this trace, I think bitbake should drop the
event handler triggering exceptions in a case like this, try and avoid
looping quite so badly. We should probably have a bug for that.

Cheers,

Richard




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
  2017-06-27  9:37         ` Richard Purdie
@ 2017-06-27 13:00           ` Martin Jansa
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Jansa @ 2017-06-27 13:00 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Patches and discussions about the oe-core layer

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

OK, I've updated my jenkins job for build setup to enable BB_DISKMON_DIRS
to prevent such issues in future

https://github.com/shr-project/jenkins-jobs/commit/a8c06243d6296d294d5e79abd7d25d3b8a56d040

once verified that it works as expected I'll update all the builds I
maintain.

On Tue, Jun 27, 2017 at 11:37 AM, Richard Purdie <
richard.purdie@linuxfoundation.org> wrote:

> On Tue, 2017-06-27 at 11:21 +0200, Patrick Ohly wrote:
> > On Tue, 2017-06-27 at 09:25 +0100, Richard Purdie wrote:
> > >
> > > So if its not default, we should make it default and I'd encourage
> > > people to use it.
> > It's not the default at the moment because it's only in
> > local.conf.sample, which people might not use.
> >
> > So should a default for BB_DISKMON_DIRS be set in bitbake.conf?
>
> No, I don't think this belongs in bitbake.conf.
>
> If you use a nodistro setup you do get a reasonable default value for
> this, you're right and it is in the default local.conf.sample. Its up
> to distros and users to select the pieces they need/want beyond that.
> Whilst Martin is close to nodistro it sounds like this config was
> missing and I'd suggest using it.
>
> Cheers,
>
> Richard
>
>

[-- Attachment #2: Type: text/html, Size: 1804 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-27 13:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-15  6:48 Never ending stream of bitbake exceptions when the builder runs out of disk space Martin Jansa
2017-06-27  8:08 ` Patrick Ohly
2017-06-27  8:12   ` Martin Jansa
2017-06-27  8:25     ` Richard Purdie
2017-06-27  9:21       ` Patrick Ohly
2017-06-27  9:37         ` Richard Purdie
2017-06-27 13:00           ` Martin Jansa
2017-06-27  9:41   ` Richard Purdie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.