Summary:

Build: 20210420-1 had 23 triggers and isn't really worth
analyzing. Stay tuned for what is hopefully better data
with a higher timeout threshold.

There is a nice looking pair of graph attached though! :)

../Randy

On 2021-04-15 4:48 p.m., Randy MacLeod wrote:
> On 2021-04-15 1:55 p.m., Randy MacLeod wrote:


 > Should we  increase the interval to 30, 60, ore more seconds?

I've bumped the interval to 60 seconds.


I spent some time looking at the first bit of data along with
Sakib and Saul from time to time.
> General conclusions:
> 
> 1. It seems like ALL triggers involve oe-selftest being active.

Still true.
> 
> 2. xz might be a problem but we're not sure yet. 

Added to graph attached.

> 
> 3. We need more data and tools and time to think about it.



> To Do:
> 
> 1. increase top cmdline length from 512 to  16K

Still to do, maybe. It's not clear that it's useful yet.

> 
> 2. sometimes we see:
> 
>      Command '['oe-time-dd-test.sh', '100']' timed out after 10.0 seconds
> 
> That should not happen so we should understand why and either increase
> the time between runs or fix the tooling. This seems to happen under load
> so it's hiding the interesting data that we are looking for!

With a 60 second interval, this does NOT happen.

> 
> 3. tail the cooker console in addition to top. Present that before top.
> 
>     It would be nice to have a top equivalent for bitbake.

Still to do.
This can be done via the cooker log.
We can parse the file and do:
1. list recent tasks regardless of whether they have completed:
       tail -20?
2. list tasks that have started but not completed.



This email is about:
    https://autobuilder.yocto.io/pub/non-release/20210420-1/
It was a 'quick' build. There was only 1 log file produced.
It used the master so the timeout was still 5 seconds.

There were 23 (!!) times that the dd time exceeded the
15 (ACTUALLY 5) second limit out of a total of 1504 invocations
and those triggers were captured by 1 log file:
    testresults/qemux86-world/2021-04-20--03-43/host_stats_0_top.txt


Splitting each top output into a separate file as before,
now we have 23 log files:

How big are these files, ie how many process/kernel threads were running
when top ran?

$ wc -l host-stats-0* | sort -n
     699 host-stats-03:42:40--23.log
     704 host-stats-02:20:41--5.log
     720 host-stats-03:33:41--21.log
     733 host-stats-02:19:40--4.log
     743 host-stats-03:32:41--20.log
     752 host-stats-02:25:45--6.log
     753 host-stats-02:02:46--3.log
     760 host-stats-03:36:07--22.log
     784 host-stats-02:01:48--2.log
     784 host-stats-03:30:43--18.log
     802 host-stats-02:45:55--10.log
     807 host-stats-03:31:42--19.log
     816 host-stats-03:29:49--17.log
     829 host-stats-03:19:50--15.log
     845 host-stats-02:41:40--8.log
     851 host-stats-02:00:57--1.log
     899 host-stats-02:32:47--7.log
     906 host-stats-03:18:41--14.log
     925 host-stats-03:28:24--16.log
     947 host-stats-02:42:41--9.log
    1084 host-stats-03:16:50--13.log
    1204 host-stats-02:54:43--11.log
    1314 host-stats-03:03:41--12.log
   19661 total



I noticed that several but not all log files were running xz:

$ for i in `ls host-stats-*`; do echo -n $i ": "; grep "xz " $i | wc -l; 
done
host-stats-02:00:57--1.log : 0
host-stats-02:01:48--2.log : 0
host-stats-02:02:46--3.log : 0
host-stats-02:19:40--4.log : 2
host-stats-02:20:41--5.log : 0
host-stats-02:25:45--6.log : 2
host-stats-02:32:47--7.log : 0
host-stats-02:41:40--8.log : 1
host-stats-02:42:41--9.log : 2
host-stats-02:45:55--10.log : 1
host-stats-02:54:43--11.log : 0
host-stats-03:03:41--12.log : 0
host-stats-03:16:50--13.log : 47
host-stats-03:18:41--14.log : 0
host-stats-03:19:50--15.log : 9
host-stats-03:28:24--16.log : 15
host-stats-03:29:49--17.log : 0
host-stats-03:30:43--18.log : 0
host-stats-03:31:42--19.log : 7
host-stats-03:32:41--20.log : 7
host-stats-03:33:41--21.log : 11
host-stats-03:36:07--22.log : 15
host-stats-03:42:40--23.log : 9


I've plotted the number of xz processes along with the load average
and the dd time  in the attached graph.


All of the top output logs seems to be running oe-selftest:

for i in host-stats-0*; do grep -H -c "DISPLAY.*oe-selftest " $i ; done
host-stats-02:00:57--1.log:1
host-stats-02:01:48--2.log:1
host-stats-02:02:46--3.log:1
host-stats-02:19:40--4.log:1
host-stats-02:20:41--5.log:1
host-stats-02:25:45--6.log:1
host-stats-02:32:47--7.log:1
host-stats-02:41:40--8.log:2
host-stats-02:42:41--9.log:2
host-stats-02:45:55--10.log:2
host-stats-02:54:43--11.log:2
host-stats-03:03:41--12.log:2
host-stats-03:16:50--13.log:2
host-stats-03:18:41--14.log:2
host-stats-03:19:50--15.log:2
host-stats-03:28:24--16.log:2
host-stats-03:29:49--17.log:2
host-stats-03:30:43--18.log:2
host-stats-03:31:42--19.log:2
host-stats-03:32:41--20.log:2
host-stats-03:33:41--21.log:2
host-stats-03:36:07--22.log:2
host-stats-03:42:40--23.log:2



The logs DO seem to be clustered in that there are several 1 minute 
adjacent intervals where the dd time exceeded the threshold.

Data here but you can see this in the attached graph.

$ for i in host-stats-*; do echo -n $i ": "; head -1 $i | cut -c -15; done
host-stats-02:00:57--1.log : top - 02:00:57
host-stats-02:01:48--2.log : top - 02:01:48
host-stats-02:02:46--3.log : top - 02:02:46
host-stats-02:19:40--4.log : top - 02:19:40
host-stats-02:20:41--5.log : top - 02:20:41
host-stats-02:25:45--6.log : top - 02:25:45
host-stats-02:32:47--7.log : top - 02:32:47
host-stats-02:41:40--8.log : top - 02:41:40
host-stats-02:42:41--9.log : top - 02:42:41
host-stats-02:45:55--10.log : top - 02:45:55
host-stats-02:54:43--11.log : top - 02:54:43
host-stats-03:03:41--12.log : top - 03:03:41
host-stats-03:16:50--13.log : top - 03:16:50
host-stats-03:18:41--14.log : top - 03:18:41
host-stats-03:19:50--15.log : top - 03:19:50
host-stats-03:28:24--16.log : top - 03:28:24
host-stats-03:29:49--17.log : top - 03:29:49
host-stats-03:30:43--18.log : top - 03:30:43
host-stats-03:31:42--19.log : top - 03:31:42
host-stats-03:32:41--20.log : top - 03:32:41
host-stats-03:33:41--21.log : top - 03:33:41
host-stats-03:36:07--22.log : top - 03:36:07
host-stats-03:42:40--23.log : top - 03:42:40



*** Add time distribution of dd.
Oddly enough, the output of dd does not seem to be consistent:

102400 bytes (102 kB) copied, 0.0163975 s, 6.2 MB/s
102400 bytes (102 kB) copied, 0.0054732 s, 18.7 MB/s
102400 bytes (102 kB, 100 KiB) copied, 0.0984687 s, 1.0 MB/s
102400 bytes (102 kB, 100 KiB) copied, 0.304233 s, 337 kB/s

I'm not sure what's going on there, yet. Sakib?


I think the cooker output will be quite useful.
I'm not going to put any more time into this data set
since the time threshold is still 5 seconds.

All for now.

../Randy


>
>>
>> ../Randy
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Richard
>>>>
>>>> The template is not used anywhere, yet, the initial patchset 
>>>> enables the
>>>> data collection by default.
>>>>
>>>> I have left the template in case the data collection is removed from
>>>> defaults and need to be used on a case by case basis.
>>>>
>>>> I am not entirely sure if nesting templates work. I have not seen any
>>>> examples of it, neither did i try it myself. If nesting does work, the
>>>> template should be useful.
>>>
>>> I had a quick look at the code and sadly, it doesn't appear I 
>>> implemented
>>> nesting so this wouldn't be that useful as things stand.
>>>
>>> Cheers,
>>>
>>> Richard
>>>
>>>
>>>
>>>
>>> 
>>>
>>
>>
>
>

-- 
# Randy MacLeod
# Wind River Linux