linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Paul Mackerras <paulus@ozlabs.org>
To: "Cédric Le Goater" <clg@kaod.org>
Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
Date: Wed, 30 Jan 2019 17:20:41 +1100	[thread overview]
Message-ID: <20190130062041.GD27109@blackberry> (raw)
In-Reply-To: <f92f71ee-5c8e-f6b1-dbbf-436bbb0624cd@kaod.org>

On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> >>> Would we end up with too many VMAs if we just used mmap() to
> >>> change the mappings from the software-generated pages to the
> >>> hardware-generated interrupt pages?  
> >> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> >> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> > 
> > Confused.  You say the number space has 32768 entries but then imply
> > there are only 8K entries.  Do you mean that the API allows for 15-bit
> > IRQ numbers but we are only making using of 8192 of them?
> 
> Ouch. My bad. Let's do it again. 
> 
> The sPAPR IRQ number space is 0x2000 wide :
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> 
> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> devices (which can be extended if needed).
> 
> So that's 8192 x 2 ESB pages.
> 
> >> extend the last range if needed as these are for MSIs. Dynamic 
> >> extensions under KVM should work also.
> >>
> >> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> >> lot of mmap(), too much. Also the KVM model needs to be compatible
> > 
> > I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> > space would be covered by a single mmap, overlaid by subsequent mmaps
> > where we need to map real device interrupts.
> 
> ok. The same fault handler could be used to populate the VMA with the 
> ESB pages. 
> 
> But it would mean extra work on the QEMU side, which is not needed 
> with this patch. 

Maybe, but just storing a single vma pointer in our private data is
not a feasible approach.  First, you have no control on the lifetime
of the vma and thus this is a use-after-free waiting to happen, and
secondly, there could be multiple vmas that you need to worry about.
Userspace could do multiple mmaps, or could do mprotect or similar on
part of an existing mmap, which would split the vma for the mmap into
multiple vmas.  You don't get notified about munmap either as far as I
can tell, so the vma is liable to go away.  And it doesn't matter if
QEMU would never do such things; if userspace can provoke a
use-after-free in the kernel using KVM then someone will write a
specially crafted user program to do that.

So we either solve it in userspace, or we have to write and maintain
complex kernel code with deep links into the MM subsystem.  I'd much
rather solve it in userspace.

> >> with the QEMU emulated one and it was simpler to have one overall
> >> memory region for the IPI ESBs, one for the END ESBs (if we support
> >> that one day) and one for the TIMA.
> >>
> >>> Are the necessary pages for a PCI
> >>> passthrough device contiguous in both host real space 
> >>
> >> They should as they are the PHB4 ESBs.
> >>
> >>> and guest real space ? 
> >>
> >> also. That's how we organized the mapping. 
> > 
> > "How we organized the mapping" is a significant design decision that I
> > haven't seen documented anywhere, and is really needed for
> > understanding what's going on.
> 
> OK. I will add comments on that. See below for some description.
> 
> There is nothing fancy, it's simply indexed with the interrupt number,
> like for HW, or for the QEMU XIVE emulated model.
> 
> >>> If so we'd only need one mmap() for all the device's interrupt
> >>> pages.
> >>
> >> Ah. So we would want to make a special case for the passthrough 
> >> device and have a mmap() and a memory region for its ESBs. Hmm.
> >>
> >> Wouldn't that require to hot plug a memory region under the guest ? 
> > 
> > No; the way that a memory region works is that userspace can do
> > whatever disparate mappings it likes within the region on the user
> > process side, and the corresponding region of guest real address space
> > follows that automatically.
> 
> yes. I suppose this should work also for 'ram device' memory mappings.
> 
> So when the passthrough device is added to the guest, we would add a 
> new 'ram device' memory region for the device interrupt ESB pages 
> that would overlap with the initial guest ESB pages.  

Not knowing the QEMU internals all that well, I don't at all
understand why a new ram device is necesssary.  I would see it as a
single virtual area mapping the ESB pages of guest hwirqs that are in
use, and we manage those mappings with mmap and munmap.

> This is really a different approach.
> 
> >> which means that we need to provision an address space/container 
> >> region for theses regions. What are the benefits ? 
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Explicitly unmapping parts of the VMA seems like the wrong way to do
> > it.  If you have a device mmap where the device wants to change the
> > physical page underlying parts of the mapping, there should be a way
> > for it to do that explicitly (but I don't know off the top of my head
> > what the interface to do that is).
> > 
> > However, I still haven't seen a clear and concise explanation of what
> > is being changed, when, and why we need to do that.
> 
> Yes. I agree on that. The problem is not very different from what we 
> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
> 
> 
> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
> implement an IRQ space for the guest using the generic IPI interrupts 
> of the XIVE IC controller. These interrupts are allocated at the OPAL
> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
> Interrupt management is performed in the XIVE way: using loads and 
> stores on the addresses of the XIVE IPI interrupt ESB pages.
> 
> Both KVM devices share the same internal structure caching information 
> on the interrupts, among which the xive_irq_data struct containing the 
> addresses of the IPI ESB pages and an extra one in case of passthrough. 
> The later contains the addresses of the ESB pages of the underlying HW 
> controller interrupts, PHB4 in all cases for now.    
> 
> A guest when running in the XICS legacy interrupt mode lets the KVM 
> XICS-over-XIVE device "handle" interrupt management, that is to perform  
> the loads and stores on the addresses of the ESB pages of the guest 
> interrupts. 
> 
> However, when running in XIVE native exploitation mode, the KVM XIVE 
> native device exposes the interrupt ESB pages to the guest and lets 
> the guest perform directly the loads and stores. 
> 
> The VMA exposing the ESB pages make use of a custom VM fault handler
> which role is to populate the VMA with appropriate pages. When a fault
> occurs, the guest IRQ number is deduced from the offset, and the ESB 
> pages of associated XIVE IPI interrupt are inserted in the VMA (using
> the internal structure caching information on the interrupts).
> 
> Supporting device passthrough in the guest running in XIVE native 
> exploitation mode adds some extra refinements because the ESB pages 
> of a different HW controller (PHB4) need to be exposed to the guest 
> along with the initial IPI ESB pages of the XIVE IC controller. But
> the overall mechanic is the same. 
> 
> When the device HW irqs are mapped into or unmapped from the guest
> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
> and kvmppc_xive_clr_mapped(), are called to record or clear the 
> passthrough interrupt information and to perform the switch.
> 
> The approach taken by this patch is to clear the ESB pages of the 
> guest IRQ number being mapped and let the VM fault handler repopulate. 
> The handler will insert the ESB page corresponding to the HW interrupt 
> of the device being passed-through or the initial IPI ESB page if the
> device is being removed.   

That's a much better write-up.  Thanks.

Paul.

  reply	other threads:[~2019-01-30  6:24 UTC|newest]

Thread overview: 135+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
2019-01-09  3:33   ` David Gibson
2019-01-09 13:08   ` Michael Ellerman
2019-01-09 13:38     ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support Cédric Le Goater
2019-01-09  4:26   ` David Gibson
2019-01-07 18:43 ` [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type Cédric Le Goater
2019-01-09  4:27   ` David Gibson
2019-01-22  4:56   ` Paul Mackerras
2019-01-23 16:24     ` Cédric Le Goater
2019-02-04  0:50       ` David Gibson
2019-02-04 10:16         ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device Cédric Le Goater
2019-01-11  4:09   ` David Gibson
2019-01-07 18:43 ` [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
2019-01-22  5:05   ` Paul Mackerras
2019-01-23 16:28     ` Cédric Le Goater
2019-01-28 17:35     ` Cédric Le Goater
2019-01-30  4:29       ` Paul Mackerras
2019-01-30  7:01         ` Cédric Le Goater
2019-01-31  3:01           ` Paul Mackerras
2019-02-01 17:03             ` Cédric Le Goater
2019-02-04  4:25   ` David Gibson
2019-02-04 11:19     ` Cédric Le Goater
2019-02-05  5:26       ` David Gibson
2019-01-07 18:43 ` [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device Cédric Le Goater
2019-01-22  5:09   ` Paul Mackerras
2019-01-23 16:48     ` Cédric Le Goater
2019-02-04  4:45   ` David Gibson
2019-02-04 11:30     ` Cédric Le Goater
2019-02-05  5:28       ` David Gibson
2019-02-05 12:55         ` Cédric Le Goater
2019-02-06  1:23           ` David Gibson
2019-02-06  7:21             ` Cédric Le Goater
2019-02-07  2:49               ` David Gibson
2019-02-07  9:03                 ` Cédric Le Goater
2019-02-08  5:15                   ` David Gibson
2019-02-08  7:58                     ` Cédric Le Goater
2019-02-08 21:53                       ` Paul Mackerras
2019-02-09  9:41                         ` Cédric Le Goater
2019-02-11  2:38                           ` David Gibson
2019-02-11  6:42                             ` Benjamin Herrenschmidt
2019-02-12 22:07                               ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to " Cédric Le Goater
2019-01-07 18:43 ` [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the " Cédric Le Goater
2019-01-22  5:14   ` Paul Mackerras
2019-01-23 16:56     ` Cédric Le Goater
2019-02-04  4:49       ` David Gibson
2019-02-04 15:36         ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE " Cédric Le Goater
2019-02-04  4:57   ` David Gibson
2019-02-04 19:07     ` Cédric Le Goater
2019-02-05  5:35       ` David Gibson
2019-02-05 13:39         ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state Cédric Le Goater
2019-01-07 18:43 ` [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls Cédric Le Goater
2019-01-22  5:23   ` Paul Mackerras
2019-01-23  6:44     ` Benjamin Herrenschmidt
2019-01-23  8:48       ` Cédric Le Goater
2019-01-23 10:26         ` Paul Mackerras
2019-01-23 10:48           ` Cédric Le Goater
2019-01-23 21:23           ` Benjamin Herrenschmidt
2019-01-07 18:43 ` [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address Cédric Le Goater
2019-02-04  5:15   ` David Gibson
2019-02-04 15:37     ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration Cédric Le Goater
2019-02-04  5:17   ` David Gibson
2019-02-04 15:39     ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty Cédric Le Goater
2019-02-04  5:18   ` David Gibson
2019-02-04 15:46     ` Cédric Le Goater
2019-02-05  5:30       ` David Gibson
2019-01-07 18:43 ` [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration Cédric Le Goater
2019-02-04  5:21   ` David Gibson
2019-02-04 16:07     ` Cédric Le Goater
2019-02-05  5:32       ` David Gibson
2019-02-05 13:03         ` Cédric Le Goater
2019-02-06  1:23           ` David Gibson
2019-02-06  1:24             ` David Gibson
2019-02-06  7:07               ` Cédric Le Goater
2019-02-07  2:48                 ` David Gibson
2019-02-07  9:13                   ` Cédric Le Goater
2019-02-08  5:15                     ` David Gibson
2019-02-14 16:50                       ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration Cédric Le Goater
2019-02-04  5:24   ` David Gibson
2019-02-05 17:45     ` Cédric Le Goater
2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
2019-01-07 19:10   ` [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Cédric Le Goater
2019-01-22  5:26     ` Paul Mackerras
2019-01-23  6:45       ` Benjamin Herrenschmidt
2019-01-23 10:30         ` Paul Mackerras
2019-01-23 11:07           ` Cédric Le Goater
2019-01-28  6:13             ` Paul Mackerras
2019-01-28 18:26               ` Cédric Le Goater
2019-01-29  2:45                 ` Paul Mackerras
2019-01-29 13:47                   ` Cédric Le Goater
2019-01-30  6:20                     ` Paul Mackerras [this message]
2019-01-30 15:54                       ` Cédric Le Goater
2019-01-31  2:48                         ` Paul Mackerras
2019-01-29  4:12                 ` Paul Mackerras
2019-01-29 17:44                   ` Cédric Le Goater
2019-01-30  5:55                     ` Paul Mackerras
2019-01-30  7:06                       ` Cédric Le Goater
2019-01-23 21:25           ` Benjamin Herrenschmidt
2019-01-24  8:41             ` Cédric Le Goater
2019-01-28  4:43             ` Paul Mackerras
2019-01-29 13:46               ` Cédric Le Goater
2019-01-07 19:10   ` [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl Cédric Le Goater
2019-01-22  5:42     ` Paul Mackerras
2019-01-23 18:39       ` Cédric Le Goater
2019-01-23 21:32         ` Benjamin Herrenschmidt
2019-02-04  5:26   ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state David Gibson
2019-02-04 18:57     ` Cédric Le Goater
2019-02-05  5:33       ` David Gibson
2019-02-05 11:58         ` Cédric Le Goater
2019-02-06  1:19           ` David Gibson
2019-01-22  4:46 ` [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Paul Mackerras
2019-01-23 19:07   ` Cédric Le Goater
2019-01-23 21:35     ` Benjamin Herrenschmidt
2019-01-26  8:25       ` Cédric Le Goater
2019-02-04  5:36         ` David Gibson
2019-02-05 11:31           ` Cédric Le Goater
2019-02-05 22:13             ` Paul Mackerras
2019-02-06  1:18               ` David Gibson
2019-02-06  7:35                 ` Cédric Le Goater
2019-02-07  2:51                   ` David Gibson
2019-02-07  8:31                     ` Cédric Le Goater
2019-02-08  5:07                       ` David Gibson
2019-02-08  7:38                         ` Cédric Le Goater
2019-01-28  5:51     ` Paul Mackerras
2019-01-29 13:51       ` Cédric Le Goater
2019-01-30  5:40         ` Paul Mackerras
2019-01-30 15:36           ` Cédric Le Goater

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190130062041.GD27109@blackberry \
    --to=paulus@ozlabs.org \
    --cc=clg@kaod.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).