[PATCH] selftest/reproducible: Don't call sync between each file compare

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] selftest/reproducible: Don't call sync between each file compare
@ 2021-02-26 17:47 Richard Purdie
  2021-03-01  7:59 ` [OE-core] " Mikko Rapeli
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Purdie @ 2021-02-26 17:47 UTC (permalink / raw)
  To: openembedded-core

Calling sync between each file compare is horrible performance wise
as we compare thousands of files. We don't care about IO latency here
so disable.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 meta/lib/oeqa/selftest/cases/reproducible.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meta/lib/oeqa/selftest/cases/reproducible.py b/meta/lib/oeqa/selftest/cases/reproducible.py
index ce4e8ebe06b..05fc4b7fa0a 100644
--- a/meta/lib/oeqa/selftest/cases/reproducible.py
+++ b/meta/lib/oeqa/selftest/cases/reproducible.py
@@ -123,7 +123,7 @@ def compare_file(reference, test, diffutils_sysroot):
         result.status = MISSING
         return result
 
-    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True)
+    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True, sync=False)
 
     if r.status:
         result.status = DIFFERENT
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [OE-core] [PATCH] selftest/reproducible: Don't call sync between each file compare
  2021-02-26 17:47 [PATCH] selftest/reproducible: Don't call sync between each file compare Richard Purdie
@ 2021-03-01  7:59 ` Mikko Rapeli
  2021-03-01  8:37   ` Richard Purdie
  0 siblings, 1 reply; 5+ messages in thread
From: Mikko Rapeli @ 2021-03-01  7:59 UTC (permalink / raw)
  To: richard.purdie; +Cc: openembedded-core

Hi,

On Fri, Feb 26, 2021 at 05:47:52PM +0000, Richard Purdie wrote:
> Calling sync between each file compare is horrible performance wise
> as we compare thousands of files. We don't care about IO latency here
> so disable.
> 
> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> ---
>  meta/lib/oeqa/selftest/cases/reproducible.py | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/meta/lib/oeqa/selftest/cases/reproducible.py b/meta/lib/oeqa/selftest/cases/reproducible.py
> index ce4e8ebe06b..05fc4b7fa0a 100644
> --- a/meta/lib/oeqa/selftest/cases/reproducible.py
> +++ b/meta/lib/oeqa/selftest/cases/reproducible.py
> @@ -123,7 +123,7 @@ def compare_file(reference, test, diffutils_sysroot):
>          result.status = MISSING
>          return result
>
> -    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True)
> +    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True, sync=False)

Uh, sync is horrible for performance. Why don't we change it to false
by default:

--- a/meta/lib/oeqa/utils/commands.py
+++ b/meta/lib/oeqa/utils/commands.py
@@ -167,7 +167,7 @@ class Result(object):
     pass
 
 
-def runCmd(command, ignore_status=False, timeout=None, assert_error=True, sync=True,
+def runCmd(command, ignore_status=False, timeout=None, assert_error=True, sync=False,
           native_sysroot=None, limit_exc_output=0, output_log=None, **options):
     result = Result()
 

Based on comment:

    # tests can be heavy on IO and if bitbake can't write out its caches, we see timeouts.
    # call sync around the tests to ensure the IO queue doesn't get too large, taking any IO
    # hit here rather than in bitbake shutdown.
    if sync:
        p = os.environ['PATH']
        os.environ['PATH'] = "/usr/bin:/bin:/usr/sbin:/sbin:" + p
        os.system("sync")
        os.environ['PATH'] = p

this looks like a workaround for some other bug.

Cheers,

-Mikko

>      if r.status:
>          result.status = DIFFERENT
> -- 
> 2.27.0
> 

> 
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [OE-core] [PATCH] selftest/reproducible: Don't call sync between each file compare
  2021-03-01  7:59 ` [OE-core] " Mikko Rapeli
@ 2021-03-01  8:37   ` Richard Purdie
  2021-03-01  8:43     ` Mikko Rapeli
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Purdie @ 2021-03-01  8:37 UTC (permalink / raw)
  To: Mikko.Rapeli; +Cc: openembedded-core

On Mon, 2021-03-01 at 07:59 +0000, Mikko.Rapeli@bmw.de wrote:
> Hi,
> 
> On Fri, Feb 26, 2021 at 05:47:52PM +0000, Richard Purdie wrote:
> > Calling sync between each file compare is horrible performance wise
> > as we compare thousands of files. We don't care about IO latency here
> > so disable.
> > 
> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> > ---
> >  meta/lib/oeqa/selftest/cases/reproducible.py | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/meta/lib/oeqa/selftest/cases/reproducible.py b/meta/lib/oeqa/selftest/cases/reproducible.py
> > index ce4e8ebe06b..05fc4b7fa0a 100644
> > --- a/meta/lib/oeqa/selftest/cases/reproducible.py
> > +++ b/meta/lib/oeqa/selftest/cases/reproducible.py
> > @@ -123,7 +123,7 @@ def compare_file(reference, test, diffutils_sysroot):
> >          result.status = MISSING
> >          return result
> >  
> > 
> > -    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True)
> > +    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True, sync=False)
> 
> Uh, sync is horrible for performance. Why don't we change it to false
> by default:
> 
> --- a/meta/lib/oeqa/utils/commands.py
> +++ b/meta/lib/oeqa/utils/commands.py
> @@ -167,7 +167,7 @@ class Result(object):
>      pass
>  
> 
>  
> 
> -def runCmd(command, ignore_status=False, timeout=None, assert_error=True, sync=True,
> +def runCmd(command, ignore_status=False, timeout=None, assert_error=True, sync=False,
>            native_sysroot=None, limit_exc_output=0, output_log=None, **options):
>      result = Result()
>  
> 
> 
> Based on comment:
> 
>     # tests can be heavy on IO and if bitbake can't write out its caches, we see timeouts.
>     # call sync around the tests to ensure the IO queue doesn't get too large, taking any IO
>     # hit here rather than in bitbake shutdown.
>     if sync:
>         p = os.environ['PATH']
>         os.environ['PATH'] = "/usr/bin:/bin:/usr/sbin:/sbin:" + p
>         os.system("sync")
>         os.environ['PATH'] = p
> 
> this looks like a workaround for some other bug.

The wider issue is the randomly failing ptests and other issues with the automated
testing, particularly things running under qemu system mode. Those seemed partly to 
be caused by huge IO queues building up. The idea of the sync calls was to keep the 
UI backlog manageable.

In most cases we execute occasional commands so this should work/help. In the case of
the reproducible ptest, we execute a cmp on every generated package which really didn't
work well hence it makes sense to disable it there.

We don't see this issue with any other tests as far as I've observed. I really don't
want to make the runtime testing results worse, anyone who's looked at swat or attended
bug triage will know how much trouble we're having with the issues this code helps
mitigate :/.

Cheers,

Richard






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [OE-core] [PATCH] selftest/reproducible: Don't call sync between each file compare
  2021-03-01  8:37   ` Richard Purdie
@ 2021-03-01  8:43     ` Mikko Rapeli
  2021-03-01 11:23       ` Richard Purdie
  0 siblings, 1 reply; 5+ messages in thread
From: Mikko Rapeli @ 2021-03-01  8:43 UTC (permalink / raw)
  To: richard.purdie; +Cc: openembedded-core

On Mon, Mar 01, 2021 at 08:37:46AM +0000, Richard Purdie wrote:
> On Mon, 2021-03-01 at 07:59 +0000, Mikko.Rapeli@bmw.de wrote:
> > Hi,
> > 
> > On Fri, Feb 26, 2021 at 05:47:52PM +0000, Richard Purdie wrote:
> > > Calling sync between each file compare is horrible performance wise
> > > as we compare thousands of files. We don't care about IO latency here
> > > so disable.
> > > 
> > > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> > > ---
> > >  meta/lib/oeqa/selftest/cases/reproducible.py | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/meta/lib/oeqa/selftest/cases/reproducible.py b/meta/lib/oeqa/selftest/cases/reproducible.py
> > > index ce4e8ebe06b..05fc4b7fa0a 100644
> > > --- a/meta/lib/oeqa/selftest/cases/reproducible.py
> > > +++ b/meta/lib/oeqa/selftest/cases/reproducible.py
> > > @@ -123,7 +123,7 @@ def compare_file(reference, test, diffutils_sysroot):
> > >          result.status = MISSING
> > >          return result
> > >  
> > > 
> > > -    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True)
> > > +    r = runCmd(['cmp', '--quiet', reference, test], native_sysroot=diffutils_sysroot, ignore_status=True, sync=False)
> > 
> > Uh, sync is horrible for performance. Why don't we change it to false
> > by default:
> > 
> > --- a/meta/lib/oeqa/utils/commands.py
> > +++ b/meta/lib/oeqa/utils/commands.py
> > @@ -167,7 +167,7 @@ class Result(object):
> >      pass
> >  
> > 
> >  
> > 
> > -def runCmd(command, ignore_status=False, timeout=None, assert_error=True, sync=True,
> > +def runCmd(command, ignore_status=False, timeout=None, assert_error=True, sync=False,
> >            native_sysroot=None, limit_exc_output=0, output_log=None, **options):
> >      result = Result()
> >  
> > 
> > 
> > Based on comment:
> > 
> >     # tests can be heavy on IO and if bitbake can't write out its caches, we see timeouts.
> >     # call sync around the tests to ensure the IO queue doesn't get too large, taking any IO
> >     # hit here rather than in bitbake shutdown.
> >     if sync:
> >         p = os.environ['PATH']
> >         os.environ['PATH'] = "/usr/bin:/bin:/usr/sbin:/sbin:" + p
> >         os.system("sync")
> >         os.environ['PATH'] = p
> > 
> > this looks like a workaround for some other bug.
> 
> The wider issue is the randomly failing ptests and other issues with the automated
> testing, particularly things running under qemu system mode. Those seemed partly to 
> be caused by huge IO queues building up. The idea of the sync calls was to keep the 
> UI backlog manageable.
> 
> In most cases we execute occasional commands so this should work/help. In the case of
> the reproducible ptest, we execute a cmp on every generated package which really didn't
> work well hence it makes sense to disable it there.
> 
> We don't see this issue with any other tests as far as I've observed. I really don't
> want to make the runtime testing results worse, anyone who's looked at swat or attended
> bug triage will know how much trouble we're having with the issues this code helps
> mitigate :/.

Ok, understood. You need to keep the system and tests running.

But IO flushes can take any amount of time. So sounds like timeouts are
too short, or should not exists at all. Also, systems are doing too much
parallel work or are actually short on RAM if slow IO to disk starts
failing things.

-Mikko

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [OE-core] [PATCH] selftest/reproducible: Don't call sync between each file compare
  2021-03-01  8:43     ` Mikko Rapeli
@ 2021-03-01 11:23       ` Richard Purdie
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Purdie @ 2021-03-01 11:23 UTC (permalink / raw)
  To: Mikko.Rapeli; +Cc: openembedded-core

On Mon, 2021-03-01 at 08:43 +0000, Mikko.Rapeli@bmw.de wrote:
> On Mon, Mar 01, 2021 at 08:37:46AM +0000, Richard Purdie wrote:
> > On Mon, 2021-03-01 at 07:59 +0000, Mikko.Rapeli@bmw.de wrote:
> > > On Fri, Feb 26, 2021 at 05:47:52PM +0000, Richard Purdie wrote:
> > > 
> > > 
> > > Based on comment:
> > > 
> > >     # tests can be heavy on IO and if bitbake can't write out its caches, we see timeouts.
> > >     # call sync around the tests to ensure the IO queue doesn't get too large, taking any IO
> > >     # hit here rather than in bitbake shutdown.
> > >     if sync:
> > >         p = os.environ['PATH']
> > >         os.environ['PATH'] = "/usr/bin:/bin:/usr/sbin:/sbin:" + p
> > >         os.system("sync")
> > >         os.environ['PATH'] = p
> > > 
> > > this looks like a workaround for some other bug.
> > 
> > The wider issue is the randomly failing ptests and other issues with the automated
> > testing, particularly things running under qemu system mode. Those seemed partly to 
> > be caused by huge IO queues building up. The idea of the sync calls was to keep the 
> > UI backlog manageable.
> > 
> > In most cases we execute occasional commands so this should work/help. In the case of
> > the reproducible ptest, we execute a cmp on every generated package which really didn't
> > work well hence it makes sense to disable it there.
> > 
> > We don't see this issue with any other tests as far as I've observed. I really don't
> > want to make the runtime testing results worse, anyone who's looked at swat or attended
> > bug triage will know how much trouble we're having with the issues this code helps
> > mitigate :/.
> 
> Ok, understood. You need to keep the system and tests running.
> 
> But IO flushes can take any amount of time. So sounds like timeouts are
> too short, or should not exists at all. Also, systems are doing too much
> parallel work or are actually short on RAM if slow IO to disk starts
> failing things.

I'm far from happy with where we're at with the autobuilder intermittent failures. 
Along with the bug triage team, I've spent the best part of a couple of years trying
to get to grips with them. The sync work around does help to a degree and that is in
itself helpful and reduces the load on build failure handling.

We do already have a ton of memory in the workers. We're reluctant to reduce the 
parallelism numbers as in general things work well, we 'just' see a single failure
in every other build (where the builds report 2 million test results over 4 arches 
each in 32+64 bit, different init systems and so on).

The "timeouts" are things like kernel rcu failures in guests, bitbake event test
issues (where a 5s timeout increased to over 60s doesn't help). Another example is
valgrind's ptests. Where possible we're trying to rewrite not to have the timeouts.

I'd love some help in getting to the bottom of the issues but we have already 
tried the simpler stuff sadly :(. The issues are very very rare but annoying when
they do happen.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-03-01 11:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-26 17:47 [PATCH] selftest/reproducible: Don't call sync between each file compare Richard Purdie
2021-03-01  7:59 ` [OE-core] " Mikko Rapeli
2021-03-01  8:37   ` Richard Purdie
2021-03-01  8:43     ` Mikko Rapeli
2021-03-01 11:23       ` Richard Purdie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.