All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Mikko Rapeli" <mikko.rapeli@bmw.de>
To: <sakib.sajal@windriver.com>
Cc: <openembedded-core@lists.openembedded.org>
Subject: Re: [OE-core] [PATCH v2] buildstats.bbclass: add functionality to collect build system stats
Date: Wed, 11 Nov 2020 07:24:37 +0000	[thread overview]
Message-ID: <20201111072436.GN1246345@korppu> (raw)
In-Reply-To: <20201110230744.30544-1-sakib.sajal@windriver.com>

Hi,

On Tue, Nov 10, 2020 at 06:07:44PM -0500, Sakib Sajal wrote:
> There are a number of timeout and hang defects where
> it would be useful to collect statistics about what
> is running on a build host when that condition occurs.
> 
> This adds functionality to collect build system stats
> on a regular interval and/or on task failure. Both
> features are disabled by default.
> 
> To enable logging on a regular interval, set:
> BB_HEARTBEAT_EVENT = "<interval>"
> BB_LOG_HOST_STAT_ON_INTERVAL = <boolean>
> Logs are stored in ${BUILDSTATS_BASE}/<build_name>/host_stats
> 
> To enable logging on a task failure, set:
> BB_LOG_HOST_STAT_ON_FAILURE = "<boolean>"
> Logs are stored in ${BUILDSTATS_BASE}/<build_name>/build_stats
> 
> The list of commands, along with the desired options, need
> to be specified in the BB_LOG_HOST_STAT_CMDS variable
> delimited by ; as such:
> BB_LOG_HOST_STAT_CMDS = "command1 ; command2 ;... ;"

I can understand why and have been debugging crashing and hanging build machines,
but I would not have found this change useful. Do you have more concrete examples
how this could be used?

Instead, I found that normal Linux server admin practices were best:

 * collect build machine kernel, journald and syslogs to remote host, e.g. rsyslog
 * monitor CPU, memory, IO, network etc performance, also to a remote host, e.g.
   pcp.io tooling or collectd
 * collect bitbake build logs with system timestamps to remote host, e.g. don't trust
   jenkins and its timestamps

With those, I have been able to find problems in Linux kernels, bugs
in VMWare cloud storage stack triggering IO hangs, stalls and eventually kernel
crashes, broken HW like memory. And of course basic things like full disks, full /tmp,
kernel oom killer kicking in when build slaves ran out of RAM during bitbake build
which results in either build machine changes or tuning of parallel build flags to
account also physical RAM.

Wihtout full remote logging infrastructure, I could not have solved anything. Running
individual commands is not enough when only full kernel dmesg of affected machine
can tell that IO stack has a hang or an Oops or that disk had been mounted read-only
due to errors.

Cheers,

-Mikko

      reply	other threads:[~2020-11-11  7:24 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-10 23:07 [PATCH v2] buildstats.bbclass: add functionality to collect build system stats Sakib Sajal
2020-11-11  7:24 ` Mikko Rapeli [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201111072436.GN1246345@korppu \
    --to=mikko.rapeli@bmw.de \
    --cc=openembedded-core@lists.openembedded.org \
    --cc=sakib.sajal@windriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.