All of lore.kernel.org
 help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: Halil Pasic <pasic@linux.ibm.com>, Cornelia Huck <cohuck@redhat.com>
Cc: Eric Farman <farman@linux.ibm.com>,
	Matthew Rosato <mjrosato@linux.ibm.com>,
	Jared Rossi <jrossi@linux.ibm.com>,
	linux-s390@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [RFC PATCH v4 0/4] vfio-ccw: Fix interrupt handling for HALT/CLEAR
Date: Fri, 23 Apr 2021 15:28:21 +0200	[thread overview]
Message-ID: <6b671a9cc1e4b0206e9048f96b87182db188beec.camel@linux.ibm.com> (raw)
In-Reply-To: <20210423152311.20570945.pasic@linux.ibm.com>

On Fri, 2021-04-23 at 15:23 +0200, Halil Pasic wrote:
> On Fri, 23 Apr 2021 13:06:16 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Thu, 22 Apr 2021 16:49:21 -0400
> > Eric Farman <farman@linux.ibm.com> wrote:
> > 
> > > On Thu, 2021-04-22 at 02:52 +0200, Halil Pasic wrote:  
> > > > On Tue, 13 Apr 2021 20:24:06 +0200
> > > > Eric Farman <farman@linux.ibm.com> wrote:
> > > >     
> > > > > Hi Conny, Halil,
> > > > > 
> > > > > Let's restart our discussion about the collision between interrupts
> > > > > for
> > > > > START SUBCHANNEL and HALT/CLEAR SUBCHANNEL. It's been a quarter
> > > > > million
> > > > > minutes (give or take), so here is the problematic scenario again:
> > > > > 
> > > > > 	CPU 1			CPU 2
> > > > >  1	CLEAR SUBCHANNEL
> > > > >  2	fsm_irq()
> > > > >  3				START SUBCHANNEL
> > > > >  4	vfio_ccw_sch_io_todo()
> > > > >  5				fsm_irq()
> > > > >  6				vfio_ccw_sch_io_todo()
> > > > > 
> > > > > From the channel subsystem's point of view the CLEAR SUBCHANNEL
> > > > > (step 1)
> > > > > is complete once step 2 is called, as the Interrupt Response Block
> > > > > (IRB)
> > > > > has been presented and the TEST SUBCHANNEL was driven by the cio
> > > > > layer.
> > > > > Thus, the START SUBCHANNEL (step 3) is submitted [1] and gets a
> > > > > cc=0 to
> > > > > indicate the I/O was accepted. However, step 2 stacks the bulk of
> > > > > the
> > > > > actual work onto a workqueue for when the subchannel lock is NOT
> > > > > held,
> > > > > and is unqueued at step 4. That code misidentifies the data in the
> > > > > IRB
> > > > > as being associated with the newly active I/O, and may release
> > > > > memory
> > > > > that is actively in use by the channel subsystem and/or device.
> > > > > Eww.
> > > > > 
> > > > > In this version...
> > > > > 
> > > > > Patch 1 and 2 are defensive checks. Patch 2 was part of v3 [2], but
> > > > > I
> > > > > would love a better option here to guard between steps 2 and 4.
> > > > > 
> > > > > Patch 3 is a subset of the removal of the CP_PENDING FSM state in
> > > > > v3.
> > > > > I've obviously gone away from this idea, but I thought this piece
> > > > > is
> > > > > still valuable.
> > > > > 
> > > > > Patch 4 collapses the code on the interrupt path so that changes to
> > > > > the FSM state and the channel_program struct are handled at the
> > > > > same
> > > > > point, rather than separated by a mutex boundary. Because of the
> > > > > possibility of a START and HALT/CLEAR running concurrently, it does
> > > > > not make sense to split them here.
> > > > > 
> > > > > With the above patches, maybe it then makes sense to hold the
> > > > > io_mutex
> > > > > across the entirety of vfio_ccw_sch_io_todo(). But I'm not
> > > > > completely
> > > > > sure that would be acceptable.
> > > > > 
> > > > > So... Thoughts?    
> > > > 
> > > > I believe we should address    
> > > 
> > > Who is the "we" here?
> > >   
> > > >  the concurrency, encapsulation and layering
> > > > issues in the subchannel/ccw pass-through code (vfio-ccw) by taking a
> > > > holistic approach as soon as possible.  
> > 
> > Let me also ask: what is "holistic"? If that's a complete rewrite, I
> > definitely don't have the capacity for that; if others want to take
> > over the code, feel free.
> > 
> 
> In general: https://en.wikipedia.org/wiki/Holism
> 
> In this context I mean:
> * Fix all data races in in the vfio-ccw module instead of making the
> "race window" smaller. Reasoning about the behavior of racy programs
> is very difficult.
> * The passed-through subchannel of the VM, as seen by the guest OS is an
> overlay of the host subchannel (which we have to assume is within the
> specification), the vfio-ccw kernel module, and an userspace emulator.
> The interface between the kernel module and the userspace emulator is
> something the authors of the vfio-ccw kernel module design, and while
> doing so we have to think about the interface the whole solution needs
> to implemnet. For example we should ask ourselves: what is the right
> response in kernel when we encounter the situation described by the
> steps 1-3 of Eric's scenario. Our VMs subchannel needs to reward the
> SSC with a cc=2 if the subchannel has the clear FC bit set. If we detect
> the described condition, does it mean that the userspace emulator is
> broken? Or is the userspace emulator allowed to rely on the vfio-ccw
> kernel module to detect this condition and return an -EBUSY (which
> corresponds to cc=2 because that is apart of the definition of the
> interface between the kernel and the userspace)? When is the FC bit
> of our VMs subchannel cleared? I read patch 2 like it is trying to catch
> the condition and return an -EBUSY, but I don't see it catching all
> the possible cases. I.e. what if another CPU is executing the first
> instruction of vfio_ccw_sch_io_todo() when we check 
> work_pending(&private->io_work) in fsm_io_helper()?
> 
> [..]
> 
> > >   
> > > > Moreover patch 4 seems to rely on
> > > > private->state which, AFAIR is still used in a racy fashion.
> > > > 
> > > > But if strong empirical evidence shows that it performs better (stops
> > > > the bleeding), I think we can go ahead with it.    
> > > 
> > > Again with the bleeding. Is there a Doctor in the house? :)  
> > 
> > No idea, seen any blue boxes around? :)
> > 
> 
> Let me also ask what: blue boxes do you mean? If you mean
> https://en.wikipedia.org/wiki/Blue_box
> then, I'm not sure I can follow your association. Are you looking for
> phone to call a doctor?

I think he's talking about this phone box: 
https://en.wikipedia.org/wiki/TARDIS

> 
> Regards,
> Halil


  reply	other threads:[~2021-04-23 13:29 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13 18:24 [RFC PATCH v4 0/4] vfio-ccw: Fix interrupt handling for HALT/CLEAR Eric Farman
2021-04-13 18:24 ` [RFC PATCH v4 1/4] vfio-ccw: Check initialized flag in cp_init() Eric Farman
2021-04-14 16:30   ` Cornelia Huck
2021-04-13 18:24 ` [RFC PATCH v4 2/4] vfio-ccw: Check workqueue before doing START Eric Farman
2021-04-15 10:51   ` Cornelia Huck
2021-04-15 13:48     ` Eric Farman
2021-04-15 16:19       ` Cornelia Huck
2021-04-15 18:42         ` Eric Farman
2021-04-16 14:41           ` Cornelia Huck
2021-04-13 18:24 ` [RFC PATCH v4 3/4] vfio-ccw: Reset FSM state to IDLE inside FSM Eric Farman
2021-04-15 10:54   ` Cornelia Huck
2021-04-13 18:24 ` [RFC PATCH v4 4/4] vfio-ccw: Reset FSM state to IDLE before io_mutex Eric Farman
2021-04-21 10:25   ` Cornelia Huck
2021-04-21 12:58     ` Eric Farman
2021-04-22 16:16       ` Eric Farman
2021-04-22  0:52 ` [RFC PATCH v4 0/4] vfio-ccw: Fix interrupt handling for HALT/CLEAR Halil Pasic
2021-04-22 20:49   ` Eric Farman
2021-04-23 11:06     ` Cornelia Huck
2021-04-23 13:23       ` Halil Pasic
2021-04-23 13:28         ` Niklas Schnelle [this message]
2021-04-23 15:53         ` Eric Farman
2021-04-23 11:50     ` Halil Pasic
2021-04-23 15:53       ` Eric Farman
2021-04-23 17:08         ` Halil Pasic
2021-04-23 19:07           ` Eric Farman
2021-04-24  0:18             ` Halil Pasic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6b671a9cc1e4b0206e9048f96b87182db188beec.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=cohuck@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=jrossi@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=pasic@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.