Summary: Build: 20210420-1 had 23 triggers and isn't really worth analyzing. Stay tuned for what is hopefully better data with a higher timeout threshold. There is a nice looking pair of graph attached though! :) ../Randy On 2021-04-15 4:48 p.m., Randy MacLeod wrote: > On 2021-04-15 1:55 p.m., Randy MacLeod wrote: > Should we  increase the interval to 30, 60, ore more seconds? I've bumped the interval to 60 seconds. I spent some time looking at the first bit of data along with Sakib and Saul from time to time. > General conclusions: > > 1. It seems like ALL triggers involve oe-selftest being active. Still true. > > 2. xz might be a problem but we're not sure yet. Added to graph attached. > > 3. We need more data and tools and time to think about it. > To Do: > > 1. increase top cmdline length from 512 to 16K Still to do, maybe. It's not clear that it's useful yet. > > 2. sometimes we see: > > Command '['oe-time-dd-test.sh', '100']' timed out after 10.0 seconds > > That should not happen so we should understand why and either increase > the time between runs or fix the tooling. This seems to happen under load > so it's hiding the interesting data that we are looking for! With a 60 second interval, this does NOT happen. > > 3. tail the cooker console in addition to top. Present that before top. > > It would be nice to have a top equivalent for bitbake. Still to do. This can be done via the cooker log. We can parse the file and do: 1. list recent tasks regardless of whether they have completed: tail -20? 2. list tasks that have started but not completed. This email is about: https://autobuilder.yocto.io/pub/non-release/20210420-1/ It was a 'quick' build. There was only 1 log file produced. It used the master so the timeout was still 5 seconds. There were 23 (!!) times that the dd time exceeded the 15 (ACTUALLY 5) second limit out of a total of 1504 invocations and those triggers were captured by 1 log file: testresults/qemux86-world/2021-04-20--03-43/host_stats_0_top.txt Splitting each top output into a separate file as before, now we have 23 log files: How big are these files, ie how many process/kernel threads were running when top ran? $ wc -l host-stats-0* | sort -n 699 host-stats-03:42:40--23.log 704 host-stats-02:20:41--5.log 720 host-stats-03:33:41--21.log 733 host-stats-02:19:40--4.log 743 host-stats-03:32:41--20.log 752 host-stats-02:25:45--6.log 753 host-stats-02:02:46--3.log 760 host-stats-03:36:07--22.log 784 host-stats-02:01:48--2.log 784 host-stats-03:30:43--18.log 802 host-stats-02:45:55--10.log 807 host-stats-03:31:42--19.log 816 host-stats-03:29:49--17.log 829 host-stats-03:19:50--15.log 845 host-stats-02:41:40--8.log 851 host-stats-02:00:57--1.log 899 host-stats-02:32:47--7.log 906 host-stats-03:18:41--14.log 925 host-stats-03:28:24--16.log 947 host-stats-02:42:41--9.log 1084 host-stats-03:16:50--13.log 1204 host-stats-02:54:43--11.log 1314 host-stats-03:03:41--12.log 19661 total I noticed that several but not all log files were running xz: $ for i in `ls host-stats-*`; do echo -n $i ": "; grep "xz " $i | wc -l; done host-stats-02:00:57--1.log : 0 host-stats-02:01:48--2.log : 0 host-stats-02:02:46--3.log : 0 host-stats-02:19:40--4.log : 2 host-stats-02:20:41--5.log : 0 host-stats-02:25:45--6.log : 2 host-stats-02:32:47--7.log : 0 host-stats-02:41:40--8.log : 1 host-stats-02:42:41--9.log : 2 host-stats-02:45:55--10.log : 1 host-stats-02:54:43--11.log : 0 host-stats-03:03:41--12.log : 0 host-stats-03:16:50--13.log : 47 host-stats-03:18:41--14.log : 0 host-stats-03:19:50--15.log : 9 host-stats-03:28:24--16.log : 15 host-stats-03:29:49--17.log : 0 host-stats-03:30:43--18.log : 0 host-stats-03:31:42--19.log : 7 host-stats-03:32:41--20.log : 7 host-stats-03:33:41--21.log : 11 host-stats-03:36:07--22.log : 15 host-stats-03:42:40--23.log : 9 I've plotted the number of xz processes along with the load average and the dd time in the attached graph. All of the top output logs seems to be running oe-selftest: for i in host-stats-0*; do grep -H -c "DISPLAY.*oe-selftest " $i ; done host-stats-02:00:57--1.log:1 host-stats-02:01:48--2.log:1 host-stats-02:02:46--3.log:1 host-stats-02:19:40--4.log:1 host-stats-02:20:41--5.log:1 host-stats-02:25:45--6.log:1 host-stats-02:32:47--7.log:1 host-stats-02:41:40--8.log:2 host-stats-02:42:41--9.log:2 host-stats-02:45:55--10.log:2 host-stats-02:54:43--11.log:2 host-stats-03:03:41--12.log:2 host-stats-03:16:50--13.log:2 host-stats-03:18:41--14.log:2 host-stats-03:19:50--15.log:2 host-stats-03:28:24--16.log:2 host-stats-03:29:49--17.log:2 host-stats-03:30:43--18.log:2 host-stats-03:31:42--19.log:2 host-stats-03:32:41--20.log:2 host-stats-03:33:41--21.log:2 host-stats-03:36:07--22.log:2 host-stats-03:42:40--23.log:2 The logs DO seem to be clustered in that there are several 1 minute adjacent intervals where the dd time exceeded the threshold. Data here but you can see this in the attached graph. $ for i in host-stats-*; do echo -n $i ": "; head -1 $i | cut -c -15; done host-stats-02:00:57--1.log : top - 02:00:57 host-stats-02:01:48--2.log : top - 02:01:48 host-stats-02:02:46--3.log : top - 02:02:46 host-stats-02:19:40--4.log : top - 02:19:40 host-stats-02:20:41--5.log : top - 02:20:41 host-stats-02:25:45--6.log : top - 02:25:45 host-stats-02:32:47--7.log : top - 02:32:47 host-stats-02:41:40--8.log : top - 02:41:40 host-stats-02:42:41--9.log : top - 02:42:41 host-stats-02:45:55--10.log : top - 02:45:55 host-stats-02:54:43--11.log : top - 02:54:43 host-stats-03:03:41--12.log : top - 03:03:41 host-stats-03:16:50--13.log : top - 03:16:50 host-stats-03:18:41--14.log : top - 03:18:41 host-stats-03:19:50--15.log : top - 03:19:50 host-stats-03:28:24--16.log : top - 03:28:24 host-stats-03:29:49--17.log : top - 03:29:49 host-stats-03:30:43--18.log : top - 03:30:43 host-stats-03:31:42--19.log : top - 03:31:42 host-stats-03:32:41--20.log : top - 03:32:41 host-stats-03:33:41--21.log : top - 03:33:41 host-stats-03:36:07--22.log : top - 03:36:07 host-stats-03:42:40--23.log : top - 03:42:40 *** Add time distribution of dd. Oddly enough, the output of dd does not seem to be consistent: 102400 bytes (102 kB) copied, 0.0163975 s, 6.2 MB/s 102400 bytes (102 kB) copied, 0.0054732 s, 18.7 MB/s 102400 bytes (102 kB, 100 KiB) copied, 0.0984687 s, 1.0 MB/s 102400 bytes (102 kB, 100 KiB) copied, 0.304233 s, 337 kB/s I'm not sure what's going on there, yet. Sakib? I think the cooker output will be quite useful. I'm not going to put any more time into this data set since the time threshold is still 5 seconds. All for now. ../Randy > >> >> ../Randy >>>>> >>>>> Cheers, >>>>> >>>>> Richard >>>> >>>> The template is not used anywhere, yet, the initial patchset >>>> enables the >>>> data collection by default. >>>> >>>> I have left the template in case the data collection is removed from >>>> defaults and need to be used on a case by case basis. >>>> >>>> I am not entirely sure if nesting templates work. I have not seen any >>>> examples of it, neither did i try it myself. If nesting does work, the >>>> template should be useful. >>> >>> I had a quick look at the code and sadly, it doesn't appear I >>> implemented >>> nesting so this wouldn't be that useful as things stand. >>> >>> Cheers, >>> >>> Richard >>> >>> >>> >>> >>> >>> >> >> > > -- # Randy MacLeod # Wind River Linux