From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= Subject: Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Date: Wed, 30 Jan 2019 16:54:23 +0100 Message-ID: References: <20190107191006.10648-1-clg@kaod.org> <20190107191006.10648-2-clg@kaod.org> <20190122052657.GG15124@blackberry> <20190123103009.GB29826@blackberry> <75762dbe-0f08-5b06-e376-744ff87ff4cb@kaod.org> <20190128061353.GD3237@blackberry> <38b0244d-beb6-3aee-c638-279ca570633c@kaod.org> <20190129024503.GA11368@blackberry> <20190130062041.GD27109@blackberry> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, David Gibson To: Paul Mackerras Return-path: In-Reply-To: <20190130062041.GD27109@blackberry> Content-Language: en-US List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org Sender: "Linuxppc-dev" List-Id: kvm.vger.kernel.org On 1/30/19 7:20 AM, Paul Mackerras wrote: > On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote: >> On 1/29/19 3:45 AM, Paul Mackerras wrote: >>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote: >>>> On 1/28/19 7:13 AM, Paul Mackerras wrote: >>>>> Would we end up with too many VMAs if we just used mmap() to >>>>> change the mappings from the software-generated pages to the >>>>> hardware-generated interrupt pages? >>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are >>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can >>> >>> Confused. You say the number space has 32768 entries but then imply >>> there are only 8K entries. Do you mean that the API allows for 15-bit >>> IRQ numbers but we are only making using of 8192 of them? >> >> Ouch. My bad. Let's do it again. >> >> The sPAPR IRQ number space is 0x2000 wide : >> >> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396 >> >> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for >> devices (which can be extended if needed). >> >> So that's 8192 x 2 ESB pages. >> >>>> extend the last range if needed as these are for MSIs. Dynamic >>>> extensions under KVM should work also. >>>> >>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a >>>> lot of mmap(), too much. Also the KVM model needs to be compatible >>> >>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the >>> space would be covered by a single mmap, overlaid by subsequent mmaps >>> where we need to map real device interrupts. >> >> ok. The same fault handler could be used to populate the VMA with the >> ESB pages. >> >> But it would mean extra work on the QEMU side, which is not needed >> with this patch. > > Maybe, but just storing a single vma pointer in our private data is > not a feasible approach. First, you have no control on the lifetime > of the vma and thus this is a use-after-free waiting to happen, and > secondly, there could be multiple vmas that you need to worry about. I fully agree. That's why I was uncomfortable with the solution. There are a few other drivers (GPUs if I recall) doing that but it feels wrong. > Userspace could do multiple mmaps, or could do mprotect or similar on > part of an existing mmap, which would split the vma for the mmap into > multiple vmas. You don't get notified about munmap either as far as I > can tell, so the vma is liable to go away. yes ... > And it doesn't matter if > QEMU would never do such things; if userspace can provoke a > use-after-free in the kernel using KVM then someone will write a > specially crafted user program to do that. > > So we either solve it in userspace, or we have to write and maintain > complex kernel code with deep links into the MM subsystem. > > I'd much rather solve it in userspace. OK, then. I have been reluctant doing so but it seems there are no other in-kernel solution. >>>> with the QEMU emulated one and it was simpler to have one overall >>>> memory region for the IPI ESBs, one for the END ESBs (if we support >>>> that one day) and one for the TIMA. >>>> >>>>> Are the necessary pages for a PCI >>>>> passthrough device contiguous in both host real space >>>> >>>> They should as they are the PHB4 ESBs. >>>> >>>>> and guest real space ? >>>> >>>> also. That's how we organized the mapping. >>> >>> "How we organized the mapping" is a significant design decision that I >>> haven't seen documented anywhere, and is really needed for >>> understanding what's going on. >> >> OK. I will add comments on that. See below for some description. >> >> There is nothing fancy, it's simply indexed with the interrupt number, >> like for HW, or for the QEMU XIVE emulated model. >> >>>>> If so we'd only need one mmap() for all the device's interrupt >>>>> pages. >>>> >>>> Ah. So we would want to make a special case for the passthrough >>>> device and have a mmap() and a memory region for its ESBs. Hmm. >>>> >>>> Wouldn't that require to hot plug a memory region under the guest ? >>> >>> No; the way that a memory region works is that userspace can do >>> whatever disparate mappings it likes within the region on the user >>> process side, and the corresponding region of guest real address space >>> follows that automatically. >> >> yes. I suppose this should work also for 'ram device' memory mappings. >> >> So when the passthrough device is added to the guest, we would add a >> new 'ram device' memory region for the device interrupt ESB pages >> that would overlap with the initial guest ESB pages. > > Not knowing the QEMU internals all that well, I don't at all > understand why a new ram device is necesssary. 'ram device' memory regions are of a special type which is used to directly map into the guest the result of a mmap() in QEMU. This is how we propagate the XIVE ESB pages from HW (and the TIMA) to the guest and the Linux kernel. It has other use with VFIO. > I would see it as a > single virtual area mapping the ESB pages of guest hwirqs that are in > use, and we manage those mappings with mmap and munmap. Yes I think I understand the idea. I will give a try. I need to find the right place to do so in QEMU. See other email thread. >> This is really a different approach. >> >>>> which means that we need to provision an address space/container >>>> region for theses regions. What are the benefits ? >>>> >>>> Is clearing the PTEs and repopulating the VMA unsafe ? >>> >>> Explicitly unmapping parts of the VMA seems like the wrong way to do >>> it. If you have a device mmap where the device wants to change the >>> physical page underlying parts of the mapping, there should be a way >>> for it to do that explicitly (but I don't know off the top of my head >>> what the interface to do that is). >>> >>> However, I still haven't seen a clear and concise explanation of what >>> is being changed, when, and why we need to do that. >> >> Yes. I agree on that. The problem is not very different from what we >> have today with the XICS-over-XIVE glue in KVM. Let me try to explain. >> >> >> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device >> implement an IRQ space for the guest using the generic IPI interrupts >> of the XIVE IC controller. These interrupts are allocated at the OPAL >> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF. >> Interrupt management is performed in the XIVE way: using loads and >> stores on the addresses of the XIVE IPI interrupt ESB pages. >> >> Both KVM devices share the same internal structure caching information >> on the interrupts, among which the xive_irq_data struct containing the >> addresses of the IPI ESB pages and an extra one in case of passthrough. >> The later contains the addresses of the ESB pages of the underlying HW >> controller interrupts, PHB4 in all cases for now. >> >> A guest when running in the XICS legacy interrupt mode lets the KVM >> XICS-over-XIVE device "handle" interrupt management, that is to perform >> the loads and stores on the addresses of the ESB pages of the guest >> interrupts. >> >> However, when running in XIVE native exploitation mode, the KVM XIVE >> native device exposes the interrupt ESB pages to the guest and lets >> the guest perform directly the loads and stores. >> >> The VMA exposing the ESB pages make use of a custom VM fault handler >> which role is to populate the VMA with appropriate pages. When a fault >> occurs, the guest IRQ number is deduced from the offset, and the ESB >> pages of associated XIVE IPI interrupt are inserted in the VMA (using >> the internal structure caching information on the interrupts). >> >> Supporting device passthrough in the guest running in XIVE native >> exploitation mode adds some extra refinements because the ESB pages >> of a different HW controller (PHB4) need to be exposed to the guest >> along with the initial IPI ESB pages of the XIVE IC controller. But >> the overall mechanic is the same. >> >> When the device HW irqs are mapped into or unmapped from the guest >> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped() >> and kvmppc_xive_clr_mapped(), are called to record or clear the >> passthrough interrupt information and to perform the switch. >> >> The approach taken by this patch is to clear the ESB pages of the >> guest IRQ number being mapped and let the VM fault handler repopulate. >> The handler will insert the ESB page corresponding to the HW interrupt >> of the device being passed-through or the initial IPI ESB page if the >> device is being removed. > > That's a much better write-up. Thanks. OK. I will reuse it when time comes. Thanks, C. From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= Date: Wed, 30 Jan 2019 15:54:23 +0000 Subject: Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Message-Id: List-Id: References: <20190107191006.10648-1-clg@kaod.org> <20190107191006.10648-2-clg@kaod.org> <20190122052657.GG15124@blackberry> <20190123103009.GB29826@blackberry> <75762dbe-0f08-5b06-e376-744ff87ff4cb@kaod.org> <20190128061353.GD3237@blackberry> <38b0244d-beb6-3aee-c638-279ca570633c@kaod.org> <20190129024503.GA11368@blackberry> <20190130062041.GD27109@blackberry> In-Reply-To: <20190130062041.GD27109@blackberry> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: Paul Mackerras Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, David Gibson On 1/30/19 7:20 AM, Paul Mackerras wrote: > On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote: >> On 1/29/19 3:45 AM, Paul Mackerras wrote: >>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote: >>>> On 1/28/19 7:13 AM, Paul Mackerras wrote: >>>>> Would we end up with too many VMAs if we just used mmap() to >>>>> change the mappings from the software-generated pages to the >>>>> hardware-generated interrupt pages? >>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are >>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can >>> >>> Confused. You say the number space has 32768 entries but then imply >>> there are only 8K entries. Do you mean that the API allows for 15-bit >>> IRQ numbers but we are only making using of 8192 of them? >> >> Ouch. My bad. Let's do it again. >> >> The sPAPR IRQ number space is 0x2000 wide : >> >> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;ha7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396 >> >> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for >> devices (which can be extended if needed). >> >> So that's 8192 x 2 ESB pages. >> >>>> extend the last range if needed as these are for MSIs. Dynamic >>>> extensions under KVM should work also. >>>> >>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a >>>> lot of mmap(), too much. Also the KVM model needs to be compatible >>> >>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the >>> space would be covered by a single mmap, overlaid by subsequent mmaps >>> where we need to map real device interrupts. >> >> ok. The same fault handler could be used to populate the VMA with the >> ESB pages. >> >> But it would mean extra work on the QEMU side, which is not needed >> with this patch. > > Maybe, but just storing a single vma pointer in our private data is > not a feasible approach. First, you have no control on the lifetime > of the vma and thus this is a use-after-free waiting to happen, and > secondly, there could be multiple vmas that you need to worry about. I fully agree. That's why I was uncomfortable with the solution. There are a few other drivers (GPUs if I recall) doing that but it feels wrong. > Userspace could do multiple mmaps, or could do mprotect or similar on > part of an existing mmap, which would split the vma for the mmap into > multiple vmas. You don't get notified about munmap either as far as I > can tell, so the vma is liable to go away. yes ... > And it doesn't matter if > QEMU would never do such things; if userspace can provoke a > use-after-free in the kernel using KVM then someone will write a > specially crafted user program to do that. > > So we either solve it in userspace, or we have to write and maintain > complex kernel code with deep links into the MM subsystem. > > I'd much rather solve it in userspace. OK, then. I have been reluctant doing so but it seems there are no other in-kernel solution. >>>> with the QEMU emulated one and it was simpler to have one overall >>>> memory region for the IPI ESBs, one for the END ESBs (if we support >>>> that one day) and one for the TIMA. >>>> >>>>> Are the necessary pages for a PCI >>>>> passthrough device contiguous in both host real space >>>> >>>> They should as they are the PHB4 ESBs. >>>> >>>>> and guest real space ? >>>> >>>> also. That's how we organized the mapping. >>> >>> "How we organized the mapping" is a significant design decision that I >>> haven't seen documented anywhere, and is really needed for >>> understanding what's going on. >> >> OK. I will add comments on that. See below for some description. >> >> There is nothing fancy, it's simply indexed with the interrupt number, >> like for HW, or for the QEMU XIVE emulated model. >> >>>>> If so we'd only need one mmap() for all the device's interrupt >>>>> pages. >>>> >>>> Ah. So we would want to make a special case for the passthrough >>>> device and have a mmap() and a memory region for its ESBs. Hmm. >>>> >>>> Wouldn't that require to hot plug a memory region under the guest ? >>> >>> No; the way that a memory region works is that userspace can do >>> whatever disparate mappings it likes within the region on the user >>> process side, and the corresponding region of guest real address space >>> follows that automatically. >> >> yes. I suppose this should work also for 'ram device' memory mappings. >> >> So when the passthrough device is added to the guest, we would add a >> new 'ram device' memory region for the device interrupt ESB pages >> that would overlap with the initial guest ESB pages. > > Not knowing the QEMU internals all that well, I don't at all > understand why a new ram device is necesssary. 'ram device' memory regions are of a special type which is used to directly map into the guest the result of a mmap() in QEMU. This is how we propagate the XIVE ESB pages from HW (and the TIMA) to the guest and the Linux kernel. It has other use with VFIO. > I would see it as a > single virtual area mapping the ESB pages of guest hwirqs that are in > use, and we manage those mappings with mmap and munmap. Yes I think I understand the idea. I will give a try. I need to find the right place to do so in QEMU. See other email thread. >> This is really a different approach. >> >>>> which means that we need to provision an address space/container >>>> region for theses regions. What are the benefits ? >>>> >>>> Is clearing the PTEs and repopulating the VMA unsafe ? >>> >>> Explicitly unmapping parts of the VMA seems like the wrong way to do >>> it. If you have a device mmap where the device wants to change the >>> physical page underlying parts of the mapping, there should be a way >>> for it to do that explicitly (but I don't know off the top of my head >>> what the interface to do that is). >>> >>> However, I still haven't seen a clear and concise explanation of what >>> is being changed, when, and why we need to do that. >> >> Yes. I agree on that. The problem is not very different from what we >> have today with the XICS-over-XIVE glue in KVM. Let me try to explain. >> >> >> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device >> implement an IRQ space for the guest using the generic IPI interrupts >> of the XIVE IC controller. These interrupts are allocated at the OPAL >> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF. >> Interrupt management is performed in the XIVE way: using loads and >> stores on the addresses of the XIVE IPI interrupt ESB pages. >> >> Both KVM devices share the same internal structure caching information >> on the interrupts, among which the xive_irq_data struct containing the >> addresses of the IPI ESB pages and an extra one in case of passthrough. >> The later contains the addresses of the ESB pages of the underlying HW >> controller interrupts, PHB4 in all cases for now. >> >> A guest when running in the XICS legacy interrupt mode lets the KVM >> XICS-over-XIVE device "handle" interrupt management, that is to perform >> the loads and stores on the addresses of the ESB pages of the guest >> interrupts. >> >> However, when running in XIVE native exploitation mode, the KVM XIVE >> native device exposes the interrupt ESB pages to the guest and lets >> the guest perform directly the loads and stores. >> >> The VMA exposing the ESB pages make use of a custom VM fault handler >> which role is to populate the VMA with appropriate pages. When a fault >> occurs, the guest IRQ number is deduced from the offset, and the ESB >> pages of associated XIVE IPI interrupt are inserted in the VMA (using >> the internal structure caching information on the interrupts). >> >> Supporting device passthrough in the guest running in XIVE native >> exploitation mode adds some extra refinements because the ESB pages >> of a different HW controller (PHB4) need to be exposed to the guest >> along with the initial IPI ESB pages of the XIVE IC controller. But >> the overall mechanic is the same. >> >> When the device HW irqs are mapped into or unmapped from the guest >> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped() >> and kvmppc_xive_clr_mapped(), are called to record or clear the >> passthrough interrupt information and to perform the switch. >> >> The approach taken by this patch is to clear the ESB pages of the >> guest IRQ number being mapped and let the VM fault handler repopulate. >> The handler will insert the ESB page corresponding to the HW interrupt >> of the device being passed-through or the initial IPI ESB page if the >> device is being removed. > > That's a much better write-up. Thanks. OK. I will reuse it when time comes. Thanks, C.