From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= <clg@kaod.org>
Subject: Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
Date: Wed, 30 Jan 2019 16:54:23 +0100
Message-ID: <f941f095-995f-0252-afd9-2a6f6f4ce7db@kaod.org>
References: <20190107191006.10648-1-clg@kaod.org>
 <20190107191006.10648-2-clg@kaod.org> <20190122052657.GG15124@blackberry>
 <b3788899d86b1d3bb8d141f1126592e3593ac8ef.camel@kernel.crashing.org>
 <20190123103009.GB29826@blackberry>
 <75762dbe-0f08-5b06-e376-744ff87ff4cb@kaod.org>
 <20190128061353.GD3237@blackberry>
 <38b0244d-beb6-3aee-c638-279ca570633c@kaod.org>
 <20190129024503.GA11368@blackberry>
 <f92f71ee-5c8e-f6b1-dbbf-436bbb0624cd@kaod.org>
 <20190130062041.GD27109@blackberry>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
 David Gibson <david@gibson.dropbear.id.au>
To: Paul Mackerras <paulus@ozlabs.org>
Return-path: <linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org>
In-Reply-To: <20190130062041.GD27109@blackberry>
Content-Language: en-US
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>
Errors-To: linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
 <linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org>
List-Id: kvm.vger.kernel.org

On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>>>> Would we end up with too many VMAs if we just used mmap() to
>>>>> change the mappings from the software-generated pages to the
>>>>> hardware-generated interrupt pages?  
>>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
>>>
>>> Confused.  You say the number space has 32768 entries but then imply
>>> there are only 8K entries.  Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again. 
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
>>>> extend the last range if needed as these are for MSIs. Dynamic 
>>>> extensions under KVM should work also.
>>>>
>>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>>>> lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the 
>> ESB pages. 
>>
>> But it would mean extra work on the QEMU side, which is not needed 
>> with this patch. 
> 
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach.  First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.

I fully agree. That's why I was uncomfortable with the solution. There 
are a few other drivers (GPUs if I recall) doing that but it feels wrong.

> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas.  You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.  

yes ...

> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
> 
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.  
>
> I'd much rather solve it in userspace.

OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution. 

>>>> with the QEMU emulated one and it was simpler to have one overall
>>>> memory region for the IPI ESBs, one for the END ESBs (if we support
>>>> that one day) and one for the TIMA.
>>>>
>>>>> Are the necessary pages for a PCI
>>>>> passthrough device contiguous in both host real space 
>>>>
>>>> They should as they are the PHB4 ESBs.
>>>>
>>>>> and guest real space ? 
>>>>
>>>> also. That's how we organized the mapping. 
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
>>>>> If so we'd only need one mmap() for all the device's interrupt
>>>>> pages.
>>>>
>>>> Ah. So we would want to make a special case for the passthrough 
>>>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>>>
>>>> Wouldn't that require to hot plug a memory region under the guest ? 
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a 
>> new 'ram device' memory region for the device interrupt ESB pages 
>> that would overlap with the initial guest ESB pages.  
> 
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary. 

'ram device' memory regions are of a special type which is used to 
directly map into the guest the result of a mmap() in QEMU. 

This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
to the guest and the Linux kernel. It has other use with VFIO.   

> I would see it as a
> single virtual area mapping the ESB pages of guest hwirqs that are in
> use, and we manage those mappings with mmap and munmap.

Yes I think I understand the idea. I will give a try. I need to find the 
right place to do so in QEMU. See other email thread.

>> This is really a different approach.
>>
>>>> which means that we need to provision an address space/container 
>>>> region for theses regions. What are the benefits ? 
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Explicitly unmapping parts of the VMA seems like the wrong way to do
>>> it.  If you have a device mmap where the device wants to change the
>>> physical page underlying parts of the mapping, there should be a way
>>> for it to do that explicitly (but I don't know off the top of my head
>>> what the interface to do that is).
>>>
>>> However, I still haven't seen a clear and concise explanation of what
>>> is being changed, when, and why we need to do that.
>>
>> Yes. I agree on that. The problem is not very different from what we 
>> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
>>
>>
>> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
>> implement an IRQ space for the guest using the generic IPI interrupts 
>> of the XIVE IC controller. These interrupts are allocated at the OPAL
>> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
>> Interrupt management is performed in the XIVE way: using loads and 
>> stores on the addresses of the XIVE IPI interrupt ESB pages.
>>
>> Both KVM devices share the same internal structure caching information 
>> on the interrupts, among which the xive_irq_data struct containing the 
>> addresses of the IPI ESB pages and an extra one in case of passthrough. 
>> The later contains the addresses of the ESB pages of the underlying HW 
>> controller interrupts, PHB4 in all cases for now.    
>>
>> A guest when running in the XICS legacy interrupt mode lets the KVM 
>> XICS-over-XIVE device "handle" interrupt management, that is to perform  
>> the loads and stores on the addresses of the ESB pages of the guest 
>> interrupts. 
>>
>> However, when running in XIVE native exploitation mode, the KVM XIVE 
>> native device exposes the interrupt ESB pages to the guest and lets 
>> the guest perform directly the loads and stores. 
>>
>> The VMA exposing the ESB pages make use of a custom VM fault handler
>> which role is to populate the VMA with appropriate pages. When a fault
>> occurs, the guest IRQ number is deduced from the offset, and the ESB 
>> pages of associated XIVE IPI interrupt are inserted in the VMA (using
>> the internal structure caching information on the interrupts).
>>
>> Supporting device passthrough in the guest running in XIVE native 
>> exploitation mode adds some extra refinements because the ESB pages 
>> of a different HW controller (PHB4) need to be exposed to the guest 
>> along with the initial IPI ESB pages of the XIVE IC controller. But
>> the overall mechanic is the same. 
>>
>> When the device HW irqs are mapped into or unmapped from the guest
>> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
>> and kvmppc_xive_clr_mapped(), are called to record or clear the 
>> passthrough interrupt information and to perform the switch.
>>
>> The approach taken by this patch is to clear the ESB pages of the 
>> guest IRQ number being mapped and let the VM fault handler repopulate. 
>> The handler will insert the ESB page corresponding to the HW interrupt 
>> of the device being passed-through or the initial IPI ESB page if the
>> device is being removed.   
> 
> That's a much better write-up.  Thanks.

OK. I will reuse it when time comes.

Thanks,

C. 

From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= <clg@kaod.org>
Date: Wed, 30 Jan 2019 15:54:23 +0000
Subject: Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
Message-Id: <f941f095-995f-0252-afd9-2a6f6f4ce7db@kaod.org>
List-Id: <kvm-ppc.vger.kernel.org>
References: <20190107191006.10648-1-clg@kaod.org>
 <20190107191006.10648-2-clg@kaod.org> <20190122052657.GG15124@blackberry>
 <b3788899d86b1d3bb8d141f1126592e3593ac8ef.camel@kernel.crashing.org>
 <20190123103009.GB29826@blackberry>
 <75762dbe-0f08-5b06-e376-744ff87ff4cb@kaod.org>
 <20190128061353.GD3237@blackberry>
 <38b0244d-beb6-3aee-c638-279ca570633c@kaod.org>
 <20190129024503.GA11368@blackberry>
 <f92f71ee-5c8e-f6b1-dbbf-436bbb0624cd@kaod.org>
 <20190130062041.GD27109@blackberry>
In-Reply-To: <20190130062041.GD27109@blackberry>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
To: Paul Mackerras <paulus@ozlabs.org>
Cc: kvm@vger.kernel.org, kvm-ppc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, David Gibson <david@gibson.dropbear.id.au>

On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>>>> Would we end up with too many VMAs if we just used mmap() to
>>>>> change the mappings from the software-generated pages to the
>>>>> hardware-generated interrupt pages?  
>>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
>>>
>>> Confused.  You say the number space has 32768 entries but then imply
>>> there are only 8K entries.  Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again. 
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;ha7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
>>>> extend the last range if needed as these are for MSIs. Dynamic 
>>>> extensions under KVM should work also.
>>>>
>>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>>>> lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the 
>> ESB pages. 
>>
>> But it would mean extra work on the QEMU side, which is not needed 
>> with this patch. 
> 
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach.  First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.

I fully agree. That's why I was uncomfortable with the solution. There 
are a few other drivers (GPUs if I recall) doing that but it feels wrong.

> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas.  You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.  

yes ...

> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
> 
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.  
>
> I'd much rather solve it in userspace.

OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution. 

>>>> with the QEMU emulated one and it was simpler to have one overall
>>>> memory region for the IPI ESBs, one for the END ESBs (if we support
>>>> that one day) and one for the TIMA.
>>>>
>>>>> Are the necessary pages for a PCI
>>>>> passthrough device contiguous in both host real space 
>>>>
>>>> They should as they are the PHB4 ESBs.
>>>>
>>>>> and guest real space ? 
>>>>
>>>> also. That's how we organized the mapping. 
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
>>>>> If so we'd only need one mmap() for all the device's interrupt
>>>>> pages.
>>>>
>>>> Ah. So we would want to make a special case for the passthrough 
>>>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>>>
>>>> Wouldn't that require to hot plug a memory region under the guest ? 
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a 
>> new 'ram device' memory region for the device interrupt ESB pages 
>> that would overlap with the initial guest ESB pages.  
> 
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary. 

'ram device' memory regions are of a special type which is used to 
directly map into the guest the result of a mmap() in QEMU. 

This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
to the guest and the Linux kernel. It has other use with VFIO.   

> I would see it as a
> single virtual area mapping the ESB pages of guest hwirqs that are in
> use, and we manage those mappings with mmap and munmap.

Yes I think I understand the idea. I will give a try. I need to find the 
right place to do so in QEMU. See other email thread.

>> This is really a different approach.
>>
>>>> which means that we need to provision an address space/container 
>>>> region for theses regions. What are the benefits ? 
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Explicitly unmapping parts of the VMA seems like the wrong way to do
>>> it.  If you have a device mmap where the device wants to change the
>>> physical page underlying parts of the mapping, there should be a way
>>> for it to do that explicitly (but I don't know off the top of my head
>>> what the interface to do that is).
>>>
>>> However, I still haven't seen a clear and concise explanation of what
>>> is being changed, when, and why we need to do that.
>>
>> Yes. I agree on that. The problem is not very different from what we 
>> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
>>
>>
>> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
>> implement an IRQ space for the guest using the generic IPI interrupts 
>> of the XIVE IC controller. These interrupts are allocated at the OPAL
>> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
>> Interrupt management is performed in the XIVE way: using loads and 
>> stores on the addresses of the XIVE IPI interrupt ESB pages.
>>
>> Both KVM devices share the same internal structure caching information 
>> on the interrupts, among which the xive_irq_data struct containing the 
>> addresses of the IPI ESB pages and an extra one in case of passthrough. 
>> The later contains the addresses of the ESB pages of the underlying HW 
>> controller interrupts, PHB4 in all cases for now.    
>>
>> A guest when running in the XICS legacy interrupt mode lets the KVM 
>> XICS-over-XIVE device "handle" interrupt management, that is to perform  
>> the loads and stores on the addresses of the ESB pages of the guest 
>> interrupts. 
>>
>> However, when running in XIVE native exploitation mode, the KVM XIVE 
>> native device exposes the interrupt ESB pages to the guest and lets 
>> the guest perform directly the loads and stores. 
>>
>> The VMA exposing the ESB pages make use of a custom VM fault handler
>> which role is to populate the VMA with appropriate pages. When a fault
>> occurs, the guest IRQ number is deduced from the offset, and the ESB 
>> pages of associated XIVE IPI interrupt are inserted in the VMA (using
>> the internal structure caching information on the interrupts).
>>
>> Supporting device passthrough in the guest running in XIVE native 
>> exploitation mode adds some extra refinements because the ESB pages 
>> of a different HW controller (PHB4) need to be exposed to the guest 
>> along with the initial IPI ESB pages of the XIVE IC controller. But
>> the overall mechanic is the same. 
>>
>> When the device HW irqs are mapped into or unmapped from the guest
>> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
>> and kvmppc_xive_clr_mapped(), are called to record or clear the 
>> passthrough interrupt information and to perform the switch.
>>
>> The approach taken by this patch is to clear the ESB pages of the 
>> guest IRQ number being mapped and let the VM fault handler repopulate. 
>> The handler will insert the ESB page corresponding to the HW interrupt 
>> of the device being passed-through or the initial IPI ESB page if the
>> device is being removed.   
> 
> That's a much better write-up.  Thanks.

OK. I will reuse it when time comes.

Thanks,

C.