linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: dancol at google.com (Daniel Colascione)
Subject: [PATCH v1 2/2] Add selftests for pidfd polling
Date: Fri, 26 Apr 2019 12:35:40 -0700	[thread overview]
Message-ID: <CAKOZuetFw9MnZCrXbcFTCFD9Jq8bZwzdcEi_OYHMMv2R_sgNNA@mail.gmail.com> (raw)
In-Reply-To: <20190426172602.GD261279@google.com>

On Fri, Apr 26, 2019 at 10:26 AM Joel Fernandes <joel at joelfernandes.org> wrote:
> On Thu, Apr 25, 2019 at 03:07:48PM -0700, Daniel Colascione wrote:
> > On Thu, Apr 25, 2019 at 2:29 PM Christian Brauner <christian at brauner.io> wrote:
> > > This timing-based testing seems kinda odd to be honest. Can't we do
> > > something better than this?
> >
> > Agreed. Timing-based tests have a substantial risk of becoming flaky.
> > We ought to be able to make these tests fully deterministic and not
> > subject to breakage from odd scheduling outcomes. We don't have
> > sleepable events for everything, granted, but sleep-waiting on a
> > condition with exponential backoff is fine in test code. In general,
> > if you start with a robust test, you can insert a sleep(100) anywhere
> > and not break the logic. Violating this rule always causes pain sooner
> > or later.
>
> I prefer if you can be more specific about how to redesign the test. Please
> go through the code and make suggestions there. The tests have not been flaky
> in my experience.

You've been running them in an ideal environment.

Some tests do depend on timing like the preemptoff tests,
> that can't be helped. Or a performance test that calculates framedrops.

Performance tests are *about* timing. This is a functional test. Here,
we care about sequencing, not timing, and using a bare sleep instead
of sleeping with a condition check (see below) is always flaky.

> In this case, we want to make sure that the poll unblocks at the right "time"
> that is when the non-leader thread exits, and not when the leader thread
> exits (test 1), or when the non-leader thread exits and not when the same
> non-leader previous did an execve (test 2).

Instead of sleeping, you want to wait for some condition. Right now,
in a bunch of places, the test does something like this:

do_something()
sleep(SOME_TIMEOUT)
check(some_condition())

You can replace each of these clauses with something like this:

do_something()
start_time = now()
while(!some_condition() && now() - start_time < LONG_TIMEOUT)
  sleep(SHORT_DELAY)
check(some_condition())

This way, you're insensitive to timing, up to LONG_TIMEOUT (which can
be something like a minute). Yes, you can always write
sleep(LARGE_TIMEOUT) instead, but a good, robust value of LONG_TIMEOUT
(which should be tens of seconds) would make the test take far too
long to run in the happy case.

Note that this code is fine:

check(!some_condition())
sleep(SOME_REASONABLE_TIMEOUT)
check(!some_condition())

It's okay to sleep for a little while and check that something did
*not* happen, but it's not okay for the test to *fail* due to
scheduling delays. The difference is that
sleeping-and-checking-that-something-didn't-happen can only generate
false negatives when checking for failures, and it's much better from
a code health perspective for a test to sometimes fail to detect a bug
than for it to fire occasionally when there's no bug actually present.

> These are inherently timing related.

No they aren't. We don't care how long these operations take. We only
care that they happen in the right order.

(Well, we do care about performance, but not for the purposes of this
functional test.)

> Yes it is true that if this runs in a VM
> and if the VM CPU is preempted for a couple seconds, then the test can fail
> falsely. Still I would argue such a failure scenario of a multi-second CPU
> lock-up can cause more serious issues like RCU stalls, and that's not a test
> issue. We can increase the sleep intervals if you want, to reduce the risk of
> such scenarios.
>
> I would love to make the test not depend on timing, but I don't know how.

For threads, implement some_condition() above by opening a /proc
directory to the task you want. You can look by death by looking for
zombie status in stat or ESRCH.

If you want to test that poll() actually unblocks on exit (as opposed
to EPOLLIN-ing immediately when the waited process is already dead),
do something like this:

- [Main test thread] Start subprocess, getting a pidfd
- [Subprocess] Wait forever
- [Main test thread] Start a waiter thread
- [Waiter test thread] poll(2) (or epoll, if you insist) on process exit
- [Main test thread] sleep(FAIRLY_SHORT_TIMEOUT)
- [Main test thread] Check that the subprocess is alive
- [Main test thread] pthread_tryjoin_np (make sure the waiter thread
is still alive)
- [Main test thread] Kill the subprocess (or one of its threads, for
testing the leader-exit case)
- [Main test thread] pthread_timedjoin_np(LONG_TIMEOUT) the waiter thread
- [Waiter test thread] poll(2) returns and thread exits
- [Main test thread] pthread_join returns: test succeeds (or the
pthread_timedjoin_np fails with ETIMEOUT, it means poll(2) didn't
unblock, and the test should fail).

Tests that sleep for synchronization *do* end up being flaky. That the
flakiness doesn't show up in local iterative testing doesn't mean that
the test is adequately robust.

WARNING: multiple messages have this Message-ID (diff)
From: dancol@google.com (Daniel Colascione)
Subject: [PATCH v1 2/2] Add selftests for pidfd polling
Date: Fri, 26 Apr 2019 12:35:40 -0700	[thread overview]
Message-ID: <CAKOZuetFw9MnZCrXbcFTCFD9Jq8bZwzdcEi_OYHMMv2R_sgNNA@mail.gmail.com> (raw)
Message-ID: <20190426193540.rxZIc6g6hLjhWa0J34iBfD_aSvQampabbECn3PtjCkM@z> (raw)
In-Reply-To: <20190426172602.GD261279@google.com>

On Fri, Apr 26, 2019@10:26 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> On Thu, Apr 25, 2019@03:07:48PM -0700, Daniel Colascione wrote:
> > On Thu, Apr 25, 2019@2:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > This timing-based testing seems kinda odd to be honest. Can't we do
> > > something better than this?
> >
> > Agreed. Timing-based tests have a substantial risk of becoming flaky.
> > We ought to be able to make these tests fully deterministic and not
> > subject to breakage from odd scheduling outcomes. We don't have
> > sleepable events for everything, granted, but sleep-waiting on a
> > condition with exponential backoff is fine in test code. In general,
> > if you start with a robust test, you can insert a sleep(100) anywhere
> > and not break the logic. Violating this rule always causes pain sooner
> > or later.
>
> I prefer if you can be more specific about how to redesign the test. Please
> go through the code and make suggestions there. The tests have not been flaky
> in my experience.

You've been running them in an ideal environment.

Some tests do depend on timing like the preemptoff tests,
> that can't be helped. Or a performance test that calculates framedrops.

Performance tests are *about* timing. This is a functional test. Here,
we care about sequencing, not timing, and using a bare sleep instead
of sleeping with a condition check (see below) is always flaky.

> In this case, we want to make sure that the poll unblocks at the right "time"
> that is when the non-leader thread exits, and not when the leader thread
> exits (test 1), or when the non-leader thread exits and not when the same
> non-leader previous did an execve (test 2).

Instead of sleeping, you want to wait for some condition. Right now,
in a bunch of places, the test does something like this:

do_something()
sleep(SOME_TIMEOUT)
check(some_condition())

You can replace each of these clauses with something like this:

do_something()
start_time = now()
while(!some_condition() && now() - start_time < LONG_TIMEOUT)
  sleep(SHORT_DELAY)
check(some_condition())

This way, you're insensitive to timing, up to LONG_TIMEOUT (which can
be something like a minute). Yes, you can always write
sleep(LARGE_TIMEOUT) instead, but a good, robust value of LONG_TIMEOUT
(which should be tens of seconds) would make the test take far too
long to run in the happy case.

Note that this code is fine:

check(!some_condition())
sleep(SOME_REASONABLE_TIMEOUT)
check(!some_condition())

It's okay to sleep for a little while and check that something did
*not* happen, but it's not okay for the test to *fail* due to
scheduling delays. The difference is that
sleeping-and-checking-that-something-didn't-happen can only generate
false negatives when checking for failures, and it's much better from
a code health perspective for a test to sometimes fail to detect a bug
than for it to fire occasionally when there's no bug actually present.

> These are inherently timing related.

No they aren't. We don't care how long these operations take. We only
care that they happen in the right order.

(Well, we do care about performance, but not for the purposes of this
functional test.)

> Yes it is true that if this runs in a VM
> and if the VM CPU is preempted for a couple seconds, then the test can fail
> falsely. Still I would argue such a failure scenario of a multi-second CPU
> lock-up can cause more serious issues like RCU stalls, and that's not a test
> issue. We can increase the sleep intervals if you want, to reduce the risk of
> such scenarios.
>
> I would love to make the test not depend on timing, but I don't know how.

For threads, implement some_condition() above by opening a /proc
directory to the task you want. You can look by death by looking for
zombie status in stat or ESRCH.

If you want to test that poll() actually unblocks on exit (as opposed
to EPOLLIN-ing immediately when the waited process is already dead),
do something like this:

- [Main test thread] Start subprocess, getting a pidfd
- [Subprocess] Wait forever
- [Main test thread] Start a waiter thread
- [Waiter test thread] poll(2) (or epoll, if you insist) on process exit
- [Main test thread] sleep(FAIRLY_SHORT_TIMEOUT)
- [Main test thread] Check that the subprocess is alive
- [Main test thread] pthread_tryjoin_np (make sure the waiter thread
is still alive)
- [Main test thread] Kill the subprocess (or one of its threads, for
testing the leader-exit case)
- [Main test thread] pthread_timedjoin_np(LONG_TIMEOUT) the waiter thread
- [Waiter test thread] poll(2) returns and thread exits
- [Main test thread] pthread_join returns: test succeeds (or the
pthread_timedjoin_np fails with ETIMEOUT, it means poll(2) didn't
unblock, and the test should fail).

Tests that sleep for synchronization *do* end up being flaky. That the
flakiness doesn't show up in local iterative testing doesn't mean that
the test is adequately robust.

  parent reply	other threads:[~2019-04-26 19:35 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-25 19:00 [PATCH v1 1/2] Add polling support to pidfd joel
2019-04-25 19:00 ` Joel Fernandes (Google)
2019-04-25 19:00 ` [PATCH v1 2/2] Add selftests for pidfd polling joel
2019-04-25 19:00   ` Joel Fernandes (Google)
2019-04-25 20:00   ` tycho
2019-04-25 20:00     ` Tycho Andersen
2019-04-26 13:47     ` joel
2019-04-26 13:47       ` Joel Fernandes
2019-04-25 21:29   ` christian
2019-04-25 21:29     ` Christian Brauner
2019-04-25 22:07     ` dancol
2019-04-25 22:07       ` Daniel Colascione
2019-04-26 17:26       ` joel
2019-04-26 17:26         ` Joel Fernandes
2019-04-26 19:35         ` dancol [this message]
2019-04-26 19:35           ` Daniel Colascione
2019-04-26 20:31           ` joel
2019-04-26 20:31             ` Joel Fernandes
2019-04-26 13:42     ` joel
2019-04-26 13:42       ` Joel Fernandes
2019-04-25 22:24 ` [PATCH v1 1/2] Add polling support to pidfd christian
2019-04-25 22:24   ` Christian Brauner
2019-04-26 14:23   ` joel
2019-04-26 14:23     ` Joel Fernandes
2019-04-26 15:21     ` christian
2019-04-26 15:21       ` Christian Brauner
2019-04-26 15:31       ` christian
2019-04-26 15:31         ` Christian Brauner
2019-04-28 16:24   ` oleg
2019-04-28 16:24     ` Oleg Nesterov
2019-04-29 14:02     ` joel
2019-04-29 14:02       ` Joel Fernandes
2019-04-29 14:07       ` joel
2019-04-29 14:07         ` Joel Fernandes
2019-04-29 14:25         ` oleg
2019-04-29 14:25           ` Oleg Nesterov
2019-04-29 14:20       ` oleg
2019-04-29 14:20         ` Oleg Nesterov
2019-04-29 16:32         ` joel
2019-04-29 16:32           ` Joel Fernandes
2019-04-30 11:53           ` oleg
2019-04-30 11:53             ` Oleg Nesterov
2019-04-30 12:07             ` oleg
2019-04-30 12:07               ` Oleg Nesterov
2019-04-30 15:49             ` joel
2019-04-30 15:49               ` Joel Fernandes
2019-04-26 14:58 ` christian
2019-04-26 14:58   ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKOZuetFw9MnZCrXbcFTCFD9Jq8bZwzdcEi_OYHMMv2R_sgNNA@mail.gmail.com \
    --to=linux-kselftest@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).