All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] scripts/oe-timeout-dd-test.sh: add script
@ 2021-03-07 20:58 Sakib Sajal
  2021-03-08 14:50 ` [OE-core] " Randy MacLeod
  0 siblings, 1 reply; 6+ messages in thread
From: Sakib Sajal @ 2021-03-07 20:58 UTC (permalink / raw)
  To: openembedded-core

oe-timeout-dd-test tries to write <count> number of
kilobytes to the filesystem within a specified
<timeout>. The purporse of this script is to find
which part of the build system puts stress on the
filesystem io and log all the processes.

To use and monitor a build, add to local.conf:
BB_HEARTBEAT_EVENT = "<interval to log data>"
BB_LOG_HOST_STAT_ON_INTERVAL = "1"
BB_LOG_HOST_STAT_CMDS = "oe-timeout-dd-test.sh <timeout> <count>"

Log can be found at:
$TMPDIR/buildstats/<build_number>/host_stats file.

Signed-off-by: Sakib Sajal <sakib.sajal@windriver.com>
---
 scripts/oe-timeout-dd-test.sh | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)
 create mode 100644 scripts/oe-timeout-dd-test.sh

diff --git a/scripts/oe-timeout-dd-test.sh b/scripts/oe-timeout-dd-test.sh
new file mode 100644
index 0000000000..40b3948c32
--- /dev/null
+++ b/scripts/oe-timeout-dd-test.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+#
+# oe-timeout-dd-test tries to write <count> number of
+# kilobytes to the filesystem within a specified time
+# constraint. The purporse of this script is to find
+# which part of the build system puts stress on the
+# filesystem io and log all the processes.
+
+usage() {
+        echo "Usage: $0 <timeout> <count>"
+}
+
+if [ $# -ne 2 ]; then
+        usage
+        exit 1
+fi
+
+timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
+
+if [ $? -ne 0 ]
+then
+        top -b -n 1
+else
+        echo "success"
+fi
+
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script
  2021-03-07 20:58 [PATCH] scripts/oe-timeout-dd-test.sh: add script Sakib Sajal
@ 2021-03-08 14:50 ` Randy MacLeod
  2021-03-08 17:33   ` Richard Purdie
  0 siblings, 1 reply; 6+ messages in thread
From: Randy MacLeod @ 2021-03-08 14:50 UTC (permalink / raw)
  To: Sakib Sajal, openembedded-core

On 2021-03-07 3:58 p.m., Sakib Sajal wrote:
> oe-timeout-dd-test tries to write <count> number of
> kilobytes to the filesystem within a specified
> <timeout>. The purporse of this script is to find
s/purporse/purpose/
> which part of the build system puts stress on the
> filesystem io and log all the processes.

> 
> To use and monitor a build, add to local.conf:
> BB_HEARTBEAT_EVENT = "<interval to log data>"
> BB_LOG_HOST_STAT_ON_INTERVAL = "1"
> BB_LOG_HOST_STAT_CMDS = "oe-timeout-dd-test.sh <timeout> <count>"
> 
> Log can be found at:
> $TMPDIR/buildstats/<build_number>/host_stats file.
> 
> Signed-off-by: Sakib Sajal <sakib.sajal@windriver.com>
> ---
>   scripts/oe-timeout-dd-test.sh | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
>   create mode 100644 scripts/oe-timeout-dd-test.sh
> 
> diff --git a/scripts/oe-timeout-dd-test.sh b/scripts/oe-timeout-dd-test.sh
> new file mode 100644
> index 0000000000..40b3948c32
> --- /dev/null
> +++ b/scripts/oe-timeout-dd-test.sh
> @@ -0,0 +1,26 @@
> +#!/bin/sh
> +#
> +# oe-timeout-dd-test tries to write <count> number of
> +# kilobytes to the filesystem within a specified time
> +# constraint. The purporse of this script is to find
s/purporse/purpose/
> +# which part of the build system puts stress on the
> +# filesystem io and log all the processes.
s/./when that happens./

It's probably worth pointing out that <timeout> is a floating point
number and maybe provide typical values such as with a HB interval of
15 seconds, and 0.5s timeout and 100 KB IO, the result is
no triggers on an isolated system whereas say ?0.1?s and 100 KB IO
results in several triggers. The idea is to give people a starting
point.
> +
> +usage() {
> +        echo "Usage: $0 <timeout> <count>"
> +}
> +
> +if [ $# -ne 2 ]; then
> +        usage
> +        exit 1
> +fi
> +
> +timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
> +
> +if [ $? -ne 0 ]
> +then
> +        top -b -n 1
> +else
> +        echo "success"

Do we need this else part? It'll just fill up the logs?

> +fi
> +

What values have you tried and how did that work out?
On Friday, a build with 100KB and was it .5 seconds and didn't
see any event exceed the timeout, right? We also ran several builds
out of a shared local sstate-cache and didn't see any event exceed the 
timeout, iirc.

../Randy


> 
> 
> 
> 
> 


-- 
# Randy MacLeod
# Wind River Linux

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script
  2021-03-08 14:50 ` [OE-core] " Randy MacLeod
@ 2021-03-08 17:33   ` Richard Purdie
  2021-03-08 20:46     ` Sakib Sajal
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Purdie @ 2021-03-08 17:33 UTC (permalink / raw)
  To: Randy MacLeod, Sakib Sajal, openembedded-core

On Mon, 2021-03-08 at 09:50 -0500, Randy MacLeod wrote:
> On 2021-03-07 3:58 p.m., Sakib Sajal wrote:
> > +timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
> > +
> > +if [ $? -ne 0 ]
> > +then
> > +        top -b -n 1
> > +else
> > +        echo "success"
> 
> Do we need this else part? It'll just fill up the logs?
> 
> > +fi
> > +
> 
> What values have you tried and how did that work out?
> On Friday, a build with 100KB and was it .5 seconds and didn't
> see any event exceed the timeout, right? We also ran several builds
> out of a shared local sstate-cache and didn't see any event exceed the 
> timeout, iirc.

It would really help me to have an idea of what we're proposing we configure
on the autobuilder and how to capture the result...

Cheers,

Richard


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script
  2021-03-08 17:33   ` Richard Purdie
@ 2021-03-08 20:46     ` Sakib Sajal
  2021-03-08 22:10       ` Richard Purdie
  0 siblings, 1 reply; 6+ messages in thread
From: Sakib Sajal @ 2021-03-08 20:46 UTC (permalink / raw)
  To: Richard Purdie, Randy MacLeod, openembedded-core


On 2021-03-08 12:33 p.m., Richard Purdie wrote:
> [Please note: This e-mail is from an EXTERNAL e-mail address]
>
> On Mon, 2021-03-08 at 09:50 -0500, Randy MacLeod wrote:
>> On 2021-03-07 3:58 p.m., Sakib Sajal wrote:
>>> +timeout $1 dd if=/dev/zero of=/tmp/foo bs=1024 count=$2 >/dev/null 2>&1
>>> +
>>> +if [ $? -ne 0 ]
>>> +then
>>> +        top -b -n 1
>>> +else
>>> +        echo "success"
>> Do we need this else part? It'll just fill up the logs?
>>
>>> +fi
>>> +
>> What values have you tried and how did that work out?
>> On Friday, a build with 100KB and was it .5 seconds and didn't
>> see any event exceed the timeout, right? We also ran several builds
>> out of a shared local sstate-cache and didn't see any event exceed the
>> timeout, iirc.
> It would really help me to have an idea of what we're proposing we configure
> on the autobuilder and how to capture the result...
>
> Cheers,
>
> Richard

Hi Richard,

Randy and I could directly use autobuilders and make the necessary 
changes to run the tests/experiments, if you prefer to do it yourself, 
read on for more details.

We have been working on a way to collect more data to deal with the 
various intermittent failures that we've been having. We put together a 
simple script that tries to write a specified amount of data to the 
filesystem within a specified time. This script can be used to determine 
if there is io stress on the file-system, if so it captures the output 
of top to see which processes are running.


To use the script to monitor the filesystem during a build, add this to 
local.conf:

BB_HEARTBEAT_EVENT = "<interval at which data will be logged>"
BB_LOG_HOST_STAT_ON_INTERVAL = "1"
BB_LOG_HOST_STAT_CMDS = "oe-timeout-dd-test.sh <timeout> <no. of 
kilobytes to write>"

The logs are stored in tmp-glibc/buildstats/*/host_stats file.

Sample log output:

Event Time: 1615158368.577454
Date: 2021-03-07 23:06:08.586541
oe-timeout-dd-test.sh 0.1 1000
success

Event Time: 1615158392.393126
Date: 2021-03-07 23:06:32.480006
oe-timeout-dd-test.sh 0.1 1000
success

Event Time: 1615159999.497806
Date: 2021-03-07 23:33:19.508317
oe-timeout-dd-test.sh 0.1 1000
top - 15:33:20 up 136 days,  8:11, 16 users,  load average: 641.54, 
792.91, 704.
....


The builds that exceeded the timeout can be found by:

grep "top " tmp-glibc/buildstats/*/host_stats

For example:

build$ grep -r  "top " data_collect*/tmp-glibc/buildstats/
data_collect1/tmp-glibc/buildstats/20210307205832/host_stats:top - 
16:22:44 up 136 days,  9:01, 16 users,  load average: 1653.48, 774.33, 548
data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top - 
15:33:20 up 136 days,  8:11, 16 users,  load average: 641.54, 792.91, 704.
data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top - 
16:21:50 up 136 days,  9:00, 16 users,  load average: 859.48, 535.06, 464.
data_collect2/tmp-glibc/buildstats/20210307205836/host_stats:top - 
16:23:50 up 136 days,  9:02, 16 users,  load average: 1300.73, 863.23, 596
data_collect3/tmp-glibc/buildstats/20210307205840/host_stats:top - 
14:57:51 up 136 days,  7:36, 16 users,  load average: 343.42, 281.75, 243.
data_collect4/tmp-glibc/buildstats/20210307205840/host_stats:top - 
15:07:42 up 136 days,  7:46, 16 users,  load average: 607.03, 447.76, 324.
data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top - 
14:57:35 up 136 days,  7:35, 16 users,  load average: 304.50, 270.72, 239.
data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top - 
15:04:53 up 136 days,  7:43, 16 users,  load average: 491.61, 307.05, 262.
data_collect5/tmp-glibc/buildstats/20210307205840/host_stats:top - 
16:27:34 up 136 days,  9:05, 16 users,  load average: 457.19, 639.73, 563.


I ran 5 builds simultaneously using a shared sstate-cache with 0.5s 
timeout and 100kb write, which did not trigger the script for io lag for 
any build.
I reran the test without shared sstate-cache and all 5 builds 
encountered io lag at least once, some 3 or 4 times, as shown in the 
grep output above.

Looking at the logs, it looked like the machine was swapping.

The <timeout> and the <count> variables may need to be adjusted for each 
machine.

We found a bug with the data collection mechanism, where if the scripts 
hasn't returned within a second, it times out and is killed. This bug 
should not hamper with you carrying out tests, I will send a fix.

The bug looks like following:

Error running command: oe-timeout-dd-test.sh 0.1 1000
Command '['oe-timeout-dd-test.sh', '0.1', '1000']' timed out after 1 seconds


Sincerely,

Sakib


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script
  2021-03-08 20:46     ` Sakib Sajal
@ 2021-03-08 22:10       ` Richard Purdie
  2021-03-11 20:53         ` Yi Fan Yu
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Purdie @ 2021-03-08 22:10 UTC (permalink / raw)
  To: Sakib Sajal, Randy MacLeod, openembedded-core

Hi Sakib,

On Mon, 2021-03-08 at 15:46 -0500, Sakib Sajal wrote:
> Randy and I could directly use autobuilders and make the necessary 
> changes to run the tests/experiments, if you prefer to do it yourself, 
> read on for more details.

I'm happy to have the help. I'm thinking that rather than instrument 
one build or a specific test case, we should add something to 
yocto-autobuilder-helper on a test branch, that way we can run an 
"a-full" or "a-quick" target on the autobuilder under the test and see
if we can get any interesting data.

The autobuilders have a shared output directory we could use to collate
the information, it is exported to the environment as 
OEQA_DEBUGGING_SAVED_OUTPUT in config.json in -helper.

I think we therefore need:

a) poky-contrib branch with this change in
b) yocto-autobuilder-helper branch with the config enabled in config.json
c) a further patch to collect up any interesting data and save it to
OEQA_DEBUGGING_SAVED_OUTPUT as a post build event handler for further
analysis. We may be able to patch the existing results collection script
to do this in helper? Obviously the naming of the files on 
OEQA_DEBUGGING_SAVED_OUTPUT needs to be multi build safe.

Does that sound like a reasonable approach?

Cheers,

Richard





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [OE-core] [PATCH] scripts/oe-timeout-dd-test.sh: add script
  2021-03-08 22:10       ` Richard Purdie
@ 2021-03-11 20:53         ` Yi Fan Yu
  0 siblings, 0 replies; 6+ messages in thread
From: Yi Fan Yu @ 2021-03-11 20:53 UTC (permalink / raw)
  To: Richard Purdie, Sakib Sajal, Randy MacLeod, openembedded-core

Randy & Sakib & I took a look at using `flock` instead of `dd`

Initial results to see if it works:
the results are the following when tested with `stress --timeout 20'

stress: info: [44041] dispatching hogs: 800 cpu, 0 io, 800 vm, 800 hdd:
0.74 seconds to flock my home directory
at this load, opening a shell takes around 3-4 seconds


stress: info: [1424] dispatching hogs: 1000 cpu, 0 io, 1000 vm, 1000 hdd
20 seconds: the stress test times out before flock could return
at this load, opening a shell takes forever


I think it is a good indicator of stress if flock takes more than 0.5s 
to complete


yifan

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-03-11 20:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-07 20:58 [PATCH] scripts/oe-timeout-dd-test.sh: add script Sakib Sajal
2021-03-08 14:50 ` [OE-core] " Randy MacLeod
2021-03-08 17:33   ` Richard Purdie
2021-03-08 20:46     ` Sakib Sajal
2021-03-08 22:10       ` Richard Purdie
2021-03-11 20:53         ` Yi Fan Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.