All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Van Haaren, Harry" <harry.van.haaren@intel.com>
To: Aaron Conole <aconole@redhat.com>
Cc: David Marchand <david.marchand@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
Date: Mon, 14 Oct 2019 16:48:57 +0000	[thread overview]
Message-ID: <E923DB57A917B54B9182A2E928D00FA67591A3F3@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <f7ttv8bs7ib.fsf@dhcp-25.97.bos.redhat.com>

> -----Original Message-----
> From: Aaron Conole [mailto:aconole@redhat.com]
> Sent: Monday, October 14, 2019 3:54 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: David Marchand <david.marchand@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> service_autotest failing
> 
> Aaron Conole <aconole@redhat.com> writes:
> 
> > "Van Haaren, Harry" <harry.van.haaren@intel.com> writes:
> >
> >>> -----Original Message-----
> >>> From: Aaron Conole [mailto:aconole@redhat.com]
> >>> Sent: Wednesday, September 4, 2019 8:56 PM
> >>> To: David Marchand <david.marchand@redhat.com>
> >>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> service_autotest
> >>> failing
<snip lots of backlog>
> >>> > real    2m42.884s
> >>> > user    5m1.902s
> >>> > sys    0m2.208s
> >>>
> >>> I can confirm - takes about 1m to fail.
> >>
> >>
> >> Hi Aaron and David,
> >>
> >> I've been attempting to reproduce this, still no errors here.
> >>
> >> Given the nature of service-cores, and the difficulty to reproduce
> >> here this feels like a race-condition - one that may not exist in all
> >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
> >>
> >> I'm using Meson to build, so reproducing using this instead of the
> command
> >> as provided above. There should be no difference in reproducing due to
> this:
> >
> > The command runs far more iterations than meson does (I think).
> >
> > I still see it periodically occur in the travis environment.
> >
> > I did see at least one missing memory barrier (I believe).  Please
> > review the following code change (and if you agree I can submit it
> > formally):
> >
> > -----
> > --- a/lib/librte_eal/common/eal_common_launch.c
> > +++ b/lib/librte_eal/common/eal_common_launch.c
> > @@ -21,8 +21,10 @@
> >  int
> >  rte_eal_wait_lcore(unsigned slave_id)
> >  {
> > -       if (lcore_config[slave_id].state == WAIT)
> > +       if (lcore_config[slave_id].state == WAIT) {
> > +               rte_rmb();
> >                 return 0;
> > +       }
> >
> >         while (lcore_config[slave_id].state != WAIT &&
> >                lcore_config[slave_id].state != FINISHED)
> > -----
> >
> > This is because in lib/librte_eal/linux/eal/eal_thread.c:
> >
> > -----
> > 		/* when a service core returns, it should go directly to WAIT
> > 		 * state, because the application will not lcore_wait() for it.
> > 		 */
> > 		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> > 			lcore_config[lcore_id].state = WAIT;
> > 		else
> > 			lcore_config[lcore_id].state = FINISHED;
> > -----
> >
> > NOTE that the service core skips the rte_eal_wait_lcore() code from
> > making the FINISHED->WAIT transition.  So I think at least that read
> > barrier will be needed (maybe I miss the pairing, though?).
> >
> > Additionally, I'm wondering if there is an additional write or sync
> > barrier needed to ensure that some of the transitions are properly
> > recorded when using lcore as a service lcore function.  The fact that
> > this only happens occasionally tells me that it's either a race (which
> > is possible... because the variable update in the test might not be
> > sync'd across cores or something), or some other missing
> > synchronization.
> >
> >> $ meson test service_autotest --repeat 50
> >>
> >> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
> >> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
> >> ...
> >> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
> >>
> >> OK:        50
> >> FAIL:       0
> >> SKIP:       0
> >> TIMEOUT:    0
> >>
> >> I'll keep it running for a few hours but I have little faith if it only
> >> takes 1 minute on your machines...
> >
> > Please try the flat command.
> 
> Not sure if you've had any time to look at this.

Apologies for delay in response - I've ran the existing tests a few 1000's of times during the week, with one reproduction. That's not enough for confidence in debug/fix for me.


> I think there's a change we can make, but not sure about how it fits in
> the overall service lcore design.

This suggestion is only changing the test code correct?


> The proposal is to use a pthread_cond variable which blocks the thread
> requesting the service function to run.  The service function merely
> sets the condition.  The requesting thread does a timed wait (up to 5s?)
> and if the timeout is exceeded can throw an error.  Otherwise, it will
> unblock and can assume that the test passes.  WDYT?  I think it works
> better than the racy code in the test case for now.

The idea/concept is right above, but I think that's what the test is
approximating anyway? The main thread does an "mp_wait_lcore()" until
the service core has returned, essentially a blocking call.

The test fails if the flag is not == 1 (as that indidcates failure in launching
an application function on a previously-use-as-service-core lthread).

I think your RMB suggestion is likely to be the correct, but I'd like to dig into it a bit more.

Thanks for the ping on this thread.

  reply	other threads:[~2019-10-14 16:49 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-03 14:45 [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing Aaron Conole
2019-09-04  9:41 ` Van Haaren, Harry
2019-09-04 10:04   ` David Marchand
2019-09-04 10:38     ` David Marchand
2019-09-04 19:56       ` Aaron Conole
2019-10-07  9:50         ` Van Haaren, Harry
2019-10-07 12:38           ` Aaron Conole
2019-10-14 14:53             ` Aaron Conole
2019-10-14 16:48               ` Van Haaren, Harry [this message]
2019-10-15 16:42                 ` Van Haaren, Harry
2019-09-04  9:55 ` David Marchand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E923DB57A917B54B9182A2E928D00FA67591A3F3@IRSMSX102.ger.corp.intel.com \
    --to=harry.van.haaren@intel.com \
    --cc=aconole@redhat.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.