All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] buildstats.bbclass: add functionality to collect build system stats
@ 2020-11-10 23:07 Sakib Sajal
  2020-11-11  7:24 ` [OE-core] " Mikko Rapeli
  0 siblings, 1 reply; 2+ messages in thread
From: Sakib Sajal @ 2020-11-10 23:07 UTC (permalink / raw)
  To: openembedded-core

There are a number of timeout and hang defects where
it would be useful to collect statistics about what
is running on a build host when that condition occurs.

This adds functionality to collect build system stats
on a regular interval and/or on task failure. Both
features are disabled by default.

To enable logging on a regular interval, set:
BB_HEARTBEAT_EVENT = "<interval>"
BB_LOG_HOST_STAT_ON_INTERVAL = <boolean>
Logs are stored in ${BUILDSTATS_BASE}/<build_name>/host_stats

To enable logging on a task failure, set:
BB_LOG_HOST_STAT_ON_FAILURE = "<boolean>"
Logs are stored in ${BUILDSTATS_BASE}/<build_name>/build_stats

The list of commands, along with the desired options, need
to be specified in the BB_LOG_HOST_STAT_CMDS variable
delimited by ; as such:
BB_LOG_HOST_STAT_CMDS = "command1 ; command2 ;... ;"

Signed-off-by: Sakib Sajal <sakib.sajal@windriver.com>
---
 meta/classes/buildstats.bbclass | 40 ++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/meta/classes/buildstats.bbclass b/meta/classes/buildstats.bbclass
index 6f87187233..a8ee6e69a6 100644
--- a/meta/classes/buildstats.bbclass
+++ b/meta/classes/buildstats.bbclass
@@ -104,14 +104,46 @@ def write_task_data(status, logfile, e, d):
             f.write("Status: FAILED \n")
         f.write("Ended: %0.2f \n" % e.time)
 
+def write_host_data(logfile, e, d):
+    import subprocess, os, datetime
+    cmds = d.getVar('BB_LOG_HOST_STAT_CMDS')
+    if cmds is None:
+        d.setVar("BB_LOG_HOST_STAT_ON_INTERVAL", "0")
+        d.setVar("BB_LOG_HOST_STAT_ON_FAILURE", "0")
+        bb.warn("buildstats: Collecting host data failed. Set BB_LOG_HOST_STAT_CMDS=\"command1 ; command2 ; ... \" in conf\/local.conf\n")
+        return
+    path = d.getVar("PATH")
+    opath = d.getVar("BB_ORIGENV", False).getVar("PATH")
+    ospath = os.environ['PATH']
+    os.environ['PATH'] = path + ":" + opath + ":" + ospath
+    with open(logfile, "a") as f:
+        f.write("Event Time: %f\nDate: %s\n" % (e.time, datetime.datetime.now()))
+        for cmd in cmds.split(";"):
+            if len(cmd) == 0:
+                continue
+            try:
+                output = subprocess.check_output(cmd.split(), stderr=subprocess.STDOUT, timeout=1).decode('utf-8')
+            except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError) as err:
+                output = "Error running command: %s\n%s\n" % (cmd, err)
+            f.write("%s\n%s\n" % (cmd, output))
+    os.environ['PATH'] = ospath
+
 python run_buildstats () {
     import bb.build
     import bb.event
     import time, subprocess, platform
 
     bn = d.getVar('BUILDNAME')
-    bsdir = os.path.join(d.getVar('BUILDSTATS_BASE'), bn)
-    taskdir = os.path.join(bsdir, d.getVar('PF'))
+    ########################################################################
+    # bitbake fires HeartbeatEvent even before a build has been
+    # triggered, causing BUILDNAME to be None
+    ########################################################################
+    if bn is not None:
+        bsdir = os.path.join(d.getVar('BUILDSTATS_BASE'), bn)
+        taskdir = os.path.join(bsdir, d.getVar('PF'))
+        if isinstance(e, bb.event.HeartbeatEvent) and bb.utils.to_boolean(d.getVar("BB_LOG_HOST_STAT_ON_INTERVAL")):
+            bb.utils.mkdirhier(bsdir)
+            write_host_data(os.path.join(bsdir, "host_stats"), e, d)
 
     if isinstance(e, bb.event.BuildStarted):
         ########################################################################
@@ -186,10 +218,12 @@ python run_buildstats () {
         build_status = os.path.join(bsdir, "build_stats")
         with open(build_status, "a") as f:
             f.write(d.expand("Failed at: ${PF} at task: %s \n" % e.task))
+            if bb.utils.to_boolean(d.getVar("BB_LOG_HOST_STAT_ON_FAILURE")):
+                write_host_data(build_status, e, d)
 }
 
 addhandler run_buildstats
-run_buildstats[eventmask] = "bb.event.BuildStarted bb.event.BuildCompleted bb.build.TaskStarted bb.build.TaskSucceeded bb.build.TaskFailed"
+run_buildstats[eventmask] = "bb.event.BuildStarted bb.event.BuildCompleted bb.event.HeartbeatEvent bb.build.TaskStarted bb.build.TaskSucceeded bb.build.TaskFailed"
 
 python runqueue_stats () {
     import buildstats
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [OE-core] [PATCH v2] buildstats.bbclass: add functionality to collect build system stats
  2020-11-10 23:07 [PATCH v2] buildstats.bbclass: add functionality to collect build system stats Sakib Sajal
@ 2020-11-11  7:24 ` Mikko Rapeli
  0 siblings, 0 replies; 2+ messages in thread
From: Mikko Rapeli @ 2020-11-11  7:24 UTC (permalink / raw)
  To: sakib.sajal; +Cc: openembedded-core

Hi,

On Tue, Nov 10, 2020 at 06:07:44PM -0500, Sakib Sajal wrote:
> There are a number of timeout and hang defects where
> it would be useful to collect statistics about what
> is running on a build host when that condition occurs.
> 
> This adds functionality to collect build system stats
> on a regular interval and/or on task failure. Both
> features are disabled by default.
> 
> To enable logging on a regular interval, set:
> BB_HEARTBEAT_EVENT = "<interval>"
> BB_LOG_HOST_STAT_ON_INTERVAL = <boolean>
> Logs are stored in ${BUILDSTATS_BASE}/<build_name>/host_stats
> 
> To enable logging on a task failure, set:
> BB_LOG_HOST_STAT_ON_FAILURE = "<boolean>"
> Logs are stored in ${BUILDSTATS_BASE}/<build_name>/build_stats
> 
> The list of commands, along with the desired options, need
> to be specified in the BB_LOG_HOST_STAT_CMDS variable
> delimited by ; as such:
> BB_LOG_HOST_STAT_CMDS = "command1 ; command2 ;... ;"

I can understand why and have been debugging crashing and hanging build machines,
but I would not have found this change useful. Do you have more concrete examples
how this could be used?

Instead, I found that normal Linux server admin practices were best:

 * collect build machine kernel, journald and syslogs to remote host, e.g. rsyslog
 * monitor CPU, memory, IO, network etc performance, also to a remote host, e.g.
   pcp.io tooling or collectd
 * collect bitbake build logs with system timestamps to remote host, e.g. don't trust
   jenkins and its timestamps

With those, I have been able to find problems in Linux kernels, bugs
in VMWare cloud storage stack triggering IO hangs, stalls and eventually kernel
crashes, broken HW like memory. And of course basic things like full disks, full /tmp,
kernel oom killer kicking in when build slaves ran out of RAM during bitbake build
which results in either build machine changes or tuning of parallel build flags to
account also physical RAM.

Wihtout full remote logging infrastructure, I could not have solved anything. Running
individual commands is not enough when only full kernel dmesg of affected machine
can tell that IO stack has a hang or an Oops or that disk had been mounted read-only
due to errors.

Cheers,

-Mikko

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-11-11  7:24 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-10 23:07 [PATCH v2] buildstats.bbclass: add functionality to collect build system stats Sakib Sajal
2020-11-11  7:24 ` [OE-core] " Mikko Rapeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.