kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Cornelia Huck <cohuck@redhat.com>
To: Eric Farman <farman@linux.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>,
	"Jason J . Herne" <jjherne@linux.ibm.com>,
	Jared Rossi <jrossi@linux.ibm.com>,
	linux-s390@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH v1 1/1] vfio-ccw: Don't free channel programs for unrelated interrupts
Date: Wed, 29 Jan 2020 13:00:48 +0100	[thread overview]
Message-ID: <20200129130048.39e1b898.cohuck@redhat.com> (raw)
In-Reply-To: <9635c45f-4652-c837-d256-46f426737a5e@linux.ibm.com>

On Tue, 28 Jan 2020 23:13:30 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 1/28/20 9:42 AM, Eric Farman wrote:
> > 
> > 
> > On 1/28/20 4:58 AM, Cornelia Huck wrote:  
> >> On Mon, 27 Jan 2020 16:28:18 -0500  
> 
> ...snip...
> 
> >>
> >> cp_init checking cp->initialized would probably be good to catch
> >> errors, in any case. (Maybe put a trace there, just to see if it fires?)  
> > 
> > I did this last night, and got frustrated.  The unfortunate thing was
> > that once it fires, we end up flooding our trace buffers with errors as
> > the guest continually retries.  So I need to either make a smarter trace
> > that is rate limited or just crash my host once this condition occurs.
> > Will try to do that between meetings today.
> >   
> 
> I reverted the subject patch, and simply triggered
> BUG_ON(cp->initialized) in cp_init().  It sprung VERY quickly (all
> traces are for the same device):
> 
> 366.399682 03 ...sch_io_todo state=4 o.cpa=03017810
>                              i.w0=00c04007 i.cpa=03017818 i.w2=0c000000
> 366.399832 03 ...sch_io_todo state=3 o.cpa=7f53dd30 UNSOLICITED
>                              i.w0=00c00011 i.cpa=03017818 i.w2=85000000
> 366.400086 03 ...sch_io_todo state=2 o.cpa=03017930
>                              i.w0=00c04007 i.cpa=03017938 i.w2=0c000000
> 366.400313 03 ...sch_io_todo state=3 o.cpa=03017930
>                              i.w0=00001001 i.cpa=03017938 i.w2=00000000
> 
> Ah, of course...  Unsolicited interrupts DO reset private->state back to
> idle, but leave cp->initialized and any channel_program struct remains
> allocated.  So there's one problem (a memory leak), and an easy one to
> rectify.

For a moment, I suspected a deferred condition code here, but it seems
to be a pure unsolicited interrupt.

But that got me thinking: If we get an unsolicited interrupt while
building the cp, it means that the guest is currently executing ssch.
We need to get the unsolicited interrupt to the guest, while not
executing the ssch. So maybe we need to do the following:

- deliver the unsolicited interrupt to the guest
- make sure we don't execute the ssch, but relay a cc 1 for it back to
  the guest
- clean up the cp

Maybe not avoiding issuing the ssch is what gets us in that pickle? We
either leak memory or free too much, it seems.

> 
> After more than a few silly rabbit holes, I had this trace:
> 
> 429.928480 07 ...sch_io_todo state=4 init=1 o.cpa=7fed8e10
>                              i.w0=00001001 i.cpa=7fed8e18 i.w2=00000000
> 429.929132 07 ...sch_io_todo state=4 init=1 o.cpa=0305aed0
>                              i.w0=00c04007 i.cpa=0305aed8 i.w2=0c000000
> 429.929538 07 ...sch_io_todo state=4 init=1 o.cpa=0305af30
>                              i.w0=00c04007 i.cpa=0305af38 i.w2=0c000000
> 467.339389 07   ...chp_event mask=0x80 event=1
> 467.339865 03 ...sch_io_todo state=3 init=0 o.cpa=01814548
>                              i.w0=00c02001 i.cpa=0305af38 i.w2=00000000
> 
> So my trace is at the beginning of vfio_ccw_sch_io_todo(), but the
> BUG_ON() is at the end of that function where private->state is
> (possibly) updated.  Looking at the contents of the vfio_ccw_private
> struct in the dump, the failing device is currently state=4 init=1
> instead of 3/0 as in the above trace.  So an I/O was being built in
> parallel here, and there's no serializing action within the stacked
> vfio_ccw_sch_io_todo() call to ensure they don't stomp on one another.
> The io_mutex handles the region changes, and the subchannel lock handles
> the start/halt/clear subchannel instructions, but nothing on the
> interrupt side, nor contention between them.  Sigh.

I feel we've been here a few times already, and never seem to come up
with a complete solution :(

There had been some changes by Pierre regarding locking the fsm; maybe
that's what's needed here?

> 
> My brain hurts.  I re-applied this patch (with some validation that the
> cpa is valid) to my current franken-code, and will let it run overnight.
>  I think it's going to be racing other CPUs and I'll find a dead system
> by morning, but who knows.  Maybe not.  :)
> 

I can relate to the brain hurting part :)


  reply	other threads:[~2020-01-29 12:01 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-24 14:54 [PATCH v1 0/1] vfio-ccw: Fix interrupt handling for HALT/CLEAR Eric Farman
2020-01-24 14:54 ` [PATCH v1 1/1] vfio-ccw: Don't free channel programs for unrelated interrupts Eric Farman
2020-01-24 15:33   ` Cornelia Huck
2020-01-24 16:08     ` Eric Farman
2020-01-27 12:52       ` Cornelia Huck
2020-01-27 21:28         ` Eric Farman
2020-01-28  9:58           ` Cornelia Huck
2020-01-28 14:42             ` Eric Farman
2020-01-29  4:13               ` Eric Farman
2020-01-29 12:00                 ` Cornelia Huck [this message]
2020-01-29 16:48                   ` Eric Farman
2020-01-24 15:28 ` [PATCH v1 0/1] vfio-ccw: Fix interrupt handling for HALT/CLEAR Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200129130048.39e1b898.cohuck@redhat.com \
    --to=cohuck@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=jjherne@linux.ibm.com \
    --cc=jrossi@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=pasic@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).