From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: [RFC PATCH] dpci: Put the dpci back on the list if
 running on another CPU.
Date: Mon, 2 Feb 2015 12:44:46 -0500
Message-ID: <20150202174446.GA8348@l.oracle.com>
References: <1421081130-22959-1-git-send-email-konrad.wilk@oracle.com>
	<54B4FF600200007800054376@mail.emea.novell.com>
	<20150202142920.GA661@l.oracle.com>
	<54CFA395020000780005C072@mail.emea.novell.com>
	<20150202153159.GA3163@l.oracle.com>
	<54CFAA53020000780005C103@mail.emea.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta5.messagelabs.com ([195.245.231.135])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <konrad.wilk@oracle.com>) id 1YIL3Q-0000Jn-8C
	for xen-devel@lists.xenproject.org; Mon, 02 Feb 2015 17:44:56 +0000
Content-Disposition: inline
In-Reply-To: <54CFAA53020000780005C103@mail.emea.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: andrew.cooper3@citrix.com, malcolm.crossley@citrix.com, xen-devel@lists.xenproject.org, linux@eikelenboom.it
List-Id: xen-devel@lists.xenproject.org

On Mon, Feb 02, 2015 at 03:48:19PM +0000, Jan Beulich wrote:
> >>> On 02.02.15 at 16:31, <konrad.wilk@oracle.com> wrote:
> > On Mon, Feb 02, 2015 at 03:19:33PM +0000, Jan Beulich wrote:
> >> >>> On 02.02.15 at 15:29, <konrad.wilk@oracle.com> wrote:
> >> > --- a/xen/drivers/passthrough/io.c
> >> > +++ b/xen/drivers/passthrough/io.c
> >> > @@ -63,10 +63,37 @@ enum {
> >> >  static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
> >> >  {
> >> >      unsigned long flags;
> >> > +    unsigned long old, new, val = pirq_dpci->state;
> >> >  
> >> > -    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> >> > -        return;
> >> > +    /*
> >> > +     * This cmpxch is a more clear version of:
> >> > +     * if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> >> > +     *         return;
> >> > +     * since it also checks for STATE_RUN conditions.
> >> > +     */
> >> > +    for ( ;; )
> >> > +    {
> >> > +        new = 1 << STATE_SCHED;
> >> > +        if ( val )
> >> > +            new |= val;
> >> 
> >> Why the if()?
> > 
> > To 'or' the variable new with '1 << STATE_RUN' in case 'val' changed from
> > the first read ('val = pirq_dpci->state') to the moment when
> > we do the cmpxchg.
> 
> But if "val" is zero, the | simply will do nothing.

Correct. Keep in mind that 'new' is set to '1 << STATE_SCHED' at every
loop iteration - so it ends up old = cmpxchg(.., 0, 1 << STATE_SCHED)
(and old == 0, val == 0, so we end up breaking out of the loop).


> 
> >> > +        old = cmpxchg(&pirq_dpci->state, val, new);
> >> > +        switch ( old )
> >> > +        {
> >> > +        case (1 << STATE_SCHED):
> >> > +        case (1 << STATE_RUN) | (1 << STATE_SCHED):
> >> > +            /*
> >> > +             * Whenever STATE_SCHED is set we MUST not schedule it.
> >> > +             */
> >> > +            return;
> >> > +        }
> >> > +        /*
> >> > +         * If the 'state' is 0 or (1 << STATE_RUN) we can schedule it.
> >> > +         */
> >> 
> >> Really? Wasn't the intention to _not_ schedule when STATE_RUN is
> >> set? (Also the above is a only line comment, i.e. wants different style.)
> > 
> > I must confess I must have misread your last review. You said:
> > 
> > 	> Here is a patch that does this. I don't yet have an setup to test
> > 	> the failing scenario (working on that). I removed the part in
> > 	> the softirq because with this patch I cannot see a way it would
> > 	> ever get there (in the softirq hitting the BUG).
> > 
> > 	Hmm, but how do you now schedule the second instance that needed ...
> > 
> > 	> +    case (1 << STATE_RUN):
> > 
> > 	... in this case?
> > 
> > which to me implied you still want to schedule an 'dpci' when STATE_RUN is 
> > set?
> > 
> > My thinking is that we should still schedule an 'dpci' when STATE_RUN is set
> > as that is inline with what the old tasklet code did. It would
> > schedule the tasklet the moment said tasklet was off the global list (but
> > it would not add it in the global list - that would be the job of the
> > tasklet function wrapper to detect and insert said tasklet back on
> > the global list).
> 
> Didn't the original discussion (and issue) revolve around scheduling
> while STATE_RUN was set? Hence the intention to wait for the flag

Yes.
> to clear - but preferably in an explicit rather than implicit (as your
> current and previous patch do) manner.

If we do explicitly we run risk of dead-lock. See below of an draft
(not even compiled tested) of what I think you mean.

The issue is that 'raise_softirq_for' ends up being called from do_IRQ.
And we might have an IRQ coming in just as we are in the dpci_softirq
having set the 1 << STATE_SCHED. Our spin-and-wait in raise_softirq_for
(in this code below) will spin forever.

One way to not get in that quagmire is to well, do what the previous
patches did.

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..706a636 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -63,10 +63,35 @@ enum {
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
+    unsigned long old, new, val = pirq_dpci->state;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
-        return;
-
+    for ( ;; )
+    {
+        old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
+        switch ( old )
+        {
+        case (1 << STATE_SCHED):
+            /*
+             * Whenever STATE_SCHED is set we MUST not schedule it.
+             */
+            return;
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        case (1 << STATE_RUN):
+            /* Getting close to finish. Spin. */
+            continue;
+        }
+        /*
+         * If the 'state' is 0 we can schedule it.
+         */
+        if ( old == 0 )
+            break;
+    }
     get_knownalive_domain(pirq_dpci->dom);
 
     local_irq_save(flags);
> 
> Jan
>