All of lore.kernel.org
 help / color / mirror / Atom feed
* [yocto] Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: Oct 7, 2021
@ 2021-10-07 13:34 Randy MacLeod
  0 siblings, 0 replies; only message in thread
From: Randy MacLeod @ 2021-10-07 13:34 UTC (permalink / raw)
  To: Sakib Sajal, alexandre.belloni, richard.purdie, Wold, Saul,
	Trevor Gamblin, Surendran, Kiran
  Cc: yocto


YP AB Intermittent failures meeting
===================================
https://windriver.zoom.us/j/3696693975

Attendees: Richard, Trevor, Randy, Saul


Summary:
========

Ptest results continue to improve yet again but there's still room
for even more improvement.

Alex made a graph of the number of AB INT issues per week:
  https://bootlin.com/~alexandre/SWAT_stats.png
We assume that week 15, 16 was when the RCU bug in he kernel
started being a problem and week 29 was when it go fixed but
more careful analysis is required.

The make/ninja load average limit is in but it's not clear
if it's effective yet and it breaks dunfell.
Trevor has a build of dunfell that with some patches appears to work.


If anyone wants to help, we could use more eyes on the logs,
particularly the summary logs and understanding iostat #
when the dd test times out.



Plans for the week:
===================

      Richard: QA results for M4, etc.
      Alex: ?
      Sakib: hook more responsive load average in to latency test. (v3)
      Trevor: patch to set PARALLEL_MAKE : -l 50
        -> dunfell, gatesgarth, hardknott (Aug 5, Oct 7)
        Confirm that dunfell works now, test other branches.
      Saul: SBOM
      Randy: # processes graph of full builds, patch ninja, graph it.
      Kiran: SBOM


Nothing much new below here. Keeping the list since it's still to-do.

../Randy

Meeting Notes:
==============

1. job server

- ninja could be patched with make's more responsive algorithm
      next or is this good enough?

    Aug 26:
    Randy made some graphs that show that the -l NUM results
    in the number of compile jobs oscillates *wildly* between 0 and 200
    on a 192 core builder compiling chromium. What I did was:
    $ bitbake -c cleansstate chromium-x11
    $ bitbake -c configure chromium-x11
    $ bitbake -c compile chromium-x11
    and while that compile was running:
    $ while [ ! -f /tmp/compiling-chromium-is-done ]; do \
         cat /proc/loadavg >> procs-load.log   ; sleep 0.5 ;
      done
    Results so far:
       https://postimg.cc/gallery/3hjfYfG/f8f46c97
    Next step is either:
    a. collect data as above for an image build and see if the sub-optimal
       ninja behaviour makes a difference
    and/or
    b. patch ninja with make's more responsive load avg
       algorithm:
          https://git.savannah.gnu.org/cgit/make.git/commit/?id=d8728efc8


- Richard suggested that we extract make's code for measuring the load
      average to a separate binary and run it in the periodic io latency
      test. Also can we translate it to python?
      - Trevor is working on this and had some problems so next week.
        (Aug 19 - Trevor is back from vaction so maybe next week.)

- Trevor to see if the load average change really did reduce load
    on WR build systems. (Aug 19)

2. AB status

       Trevor is learning about buildbot and working on a scheduling bug
       (CentOS worker?)

       bitbake layer setup tool should allow multiple backends:
         eg: kas, a y-a-helper.

       ptest cases are improving, we may be close to done!
       Let's wait a week to see how things go.
       (July29, Aug 5, Aug 19,  we're not done...)

       - lttng-tools ptest is failing. RP is working on it with upstream.
         The timeout (done on Aug 5) increase hasn't helped.


3. Sakib's improvements to the logging are merged.

       Sakib generated a summary of all high latency 'top' logs from
       ~July 23->July 29 by just running his summary script on the
       merged raw top logs.

      More analysis required....


Still relevant parts of
Previous Meeting Notes:
=======================


4. bitbake server timeout ( no change july 29, Aug 19, Oct 7)

       "Timeout while waiting for a reply from the bitbake server (60s)"

5. io stalls (no update: July 29, Oct 7)

       Richard said that it would make sense to write an ftrace utility
       / script to monitor io latency and we could install it with sudo
       Ch^W mentioned ftrace on IRC.
       Sakib and Randy will work on that but not for a week or two
       or longer! (Aug 19).

       Randy collected iostat data on 3 build server:
           https://postimg.cc/gallery/8cN6LYB
       We agreed that having -ty-2 be ~ 100 utilization for many hours
       in a row is not acceptable and that a threshold of ~ 10 minutes
       at 100% utilization may be a reasonable limt. I need to figure out
       if I can get data on the fraction of IO done per IO clas since
       we do use ionice to do clean-up and other activities.


../Randy






^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2021-10-07 13:34 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 13:34 [yocto] Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: Oct 7, 2021 Randy MacLeod

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.