[RFC,2/3] vfio/hisilicon: register the driver to vfio
diff mbox series

Message ID 1618284983-55581-3-git-send-email-liulongfang@huawei.com
State New
Headers show
Series
  • vfio/hisilicon: add acc live migration driver
Related show

Commit Message

liulongfang April 13, 2021, 3:36 a.m. UTC
Register the live migration driver of the accelerator module to vfio

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
 drivers/vfio/pci/vfio_pci_private.h |  9 +++++++++
 2 files changed, 20 insertions(+)

Comments

Jason Gunthorpe April 15, 2021, 10:01 p.m. UTC | #1
On Tue, Apr 13, 2021 at 11:36:22AM +0800, Longfang Liu wrote:
> Register the live migration driver of the accelerator module to vfio
> 
> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
>  drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |  9 +++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b..e1b0e37 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -407,6 +407,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  		}
>  	}
>  
> +	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
> +	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
> +		ret = vfio_pci_hisilicon_acc_init(vdev);

This is exactly what we want to avoid with the work we are doing on
the vfio-pci modularity.

It is a complete mess to just keep adding more stuff here, and as
we've discussed to death really ugly to have such a ridiculously wide
ID match.

You should be working with us on that project and base your work on
top of Max's series.. Alex given the interest here I think we should
find some way forward while we work on completed version of the mlx5
pci vfio driver..

I'm also confused how this works securely at all, as a general rule a
VFIO PCI driver cannot access the MMIO memory of the function it is
planning to assign to the guest. There is a lot of danger that the
guest could access that MMIO space one way or another.

Here I see the driver obtaining a devm_ioremap() of the same pdev it
is going to assign (and I really wonder why pci_release_mem_regions()
exists at all..)

This is why the mlx5 RFC was showing how to juggle two PCI devices via
the aux device connector.

Jason
liulongfang April 19, 2021, 12:24 p.m. UTC | #2
On 2021/4/16 6:01, Jason Gunthorpe wrote:
> On Tue, Apr 13, 2021 at 11:36:22AM +0800, Longfang Liu wrote:
>> Register the live migration driver of the accelerator module to vfio
>>
>> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
>>  drivers/vfio/pci/vfio_pci.c         | 11 +++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |  9 +++++++++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index 65e7e6b..e1b0e37 100644
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -407,6 +407,17 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>  		}
>>  	}
>>  
>> +	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
>> +		ret = vfio_pci_hisilicon_acc_init(vdev);
> 
> This is exactly what we want to avoid with the work we are doing on
> the vfio-pci modularity.
> 
> It is a complete mess to just keep adding more stuff here, and as
> we've discussed to death really ugly to have such a ridiculously wide
> ID match.
> 
> You should be working with us on that project and base your work on
> top of Max's series.. Alex given the interest here I think we should
> find some way forward while we work on completed version of the mlx5
> pci vfio driver..
> 
> I'm also confused how this works securely at all, as a general rule a
> VFIO PCI driver cannot access the MMIO memory of the function it is
> planning to assign to the guest. There is a lot of danger that the
> guest could access that MMIO space one way or another.
> 
VF's MMIO memory is divided into two parts, one is the guest part,
and the other is the live migration part. They do not affect each other,
so there is no security problem.

> Here I see the driver obtaining a devm_ioremap() of the same pdev it
> is going to assign (and I really wonder why pci_release_mem_regions()
> exists at all..)
> 
The driver here obtains the VF's MMIO memory address through devm_ioremap(),
and adding pci_release_mem_regions() is to get the MMIO memory address
in the guest after the driver here obtains the MMIO memory address.

If pci_release_mem_regions() is not added here,
The guests using pci_request_mem_regions() will return an error.
Then, the guest will not be able to obtain the MMIO address of the VF.

> This is why the mlx5 RFC was showing how to juggle two PCI devices via
> the aux device connector.
> 
I also noticed the mlx5 RFC, but it will take some time to analyze it. If the analysis
process goes well. I will import the mlx5 RFC into our driver and use this solution
to test and verify our live migration function.

If we use the mlx5 RFC, what should we pay attention to?

> Jason
> .
> 
Thanks.
Longfang.
Jason Gunthorpe April 19, 2021, 12:33 p.m. UTC | #3
On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:

> > I'm also confused how this works securely at all, as a general rule a
> > VFIO PCI driver cannot access the MMIO memory of the function it is
> > planning to assign to the guest. There is a lot of danger that the
> > guest could access that MMIO space one way or another.
> 
> VF's MMIO memory is divided into two parts, one is the guest part,
> and the other is the live migration part. They do not affect each other,
> so there is no security problem.

AFAIK there are several scenarios where a guest can access this MMIO
memory using DMA even if it is not mapped into the guest for CPU
access.

> If pci_release_mem_regions() is not added here,
> The guests using pci_request_mem_regions() will return an error.
> Then, the guest will not be able to obtain the MMIO address of the VF.

Which is why VFIO has this protection to prevent sharing MMIO regions
on the VF assigned to the guest

Jason
liulongfang April 20, 2021, 12:50 p.m. UTC | #4
On 2021/4/19 20:33, Jason Gunthorpe wrote:
> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> 
>>> I'm also confused how this works securely at all, as a general rule a
>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>> planning to assign to the guest. There is a lot of danger that the
>>> guest could access that MMIO space one way or another.
>>
>> VF's MMIO memory is divided into two parts, one is the guest part,
>> and the other is the live migration part. They do not affect each other,
>> so there is no security problem.
> 
> AFAIK there are several scenarios where a guest can access this MMIO
> memory using DMA even if it is not mapped into the guest for CPU
> access.
> 
The hardware divides VF's MMIO memory into two parts. The live migration
driver in the host uses the live migration part, and the device driver in
the guest uses the guest part. They obtain the address of VF's MMIO memory
in their respective drivers, although these two parts The memory is
continuous on the hardware device, but due to the needs of the drive function,
they will not perform operations on another part of the memory, and the
device hardware also independently responds to the operation commands of
the two parts.
So, I still don't understand what the security risk you are talking about is,
and what do you think the security design should look like?
Can you elaborate on it?

>> If pci_release_mem_regions() is not added here,
>> The guests using pci_request_mem_regions() will return an error.
>> Then, the guest will not be able to obtain the MMIO address of the VF.
> 
> Which is why VFIO has this protection to prevent sharing MMIO regions
> on the VF assigned to the guest
> 
> Jason
> .
> 

Thanks.
Longfang.
Jason Gunthorpe April 20, 2021, 12:59 p.m. UTC | #5
On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
> On 2021/4/19 20:33, Jason Gunthorpe wrote:
> > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> > 
> >>> I'm also confused how this works securely at all, as a general rule a
> >>> VFIO PCI driver cannot access the MMIO memory of the function it is
> >>> planning to assign to the guest. There is a lot of danger that the
> >>> guest could access that MMIO space one way or another.
> >>
> >> VF's MMIO memory is divided into two parts, one is the guest part,
> >> and the other is the live migration part. They do not affect each other,
> >> so there is no security problem.
> > 
> > AFAIK there are several scenarios where a guest can access this MMIO
> > memory using DMA even if it is not mapped into the guest for CPU
> > access.
> > 
> The hardware divides VF's MMIO memory into two parts. The live migration
> driver in the host uses the live migration part, and the device driver in
> the guest uses the guest part. They obtain the address of VF's MMIO memory
> in their respective drivers, although these two parts The memory is
> continuous on the hardware device, but due to the needs of the drive function,
> they will not perform operations on another part of the memory, and the
> device hardware also independently responds to the operation commands of
> the two parts.

It doesn't matter, the memory is still under the same PCI BDF and VFIO
supports scenarios where devices in the same IOMMU group are not
isolated from each other.

This is why the granual of isolation is a PCI BDF - VFIO directly
blocks kernel drivers from attaching to PCI BDFs that are not
completely isolated from VFIO BDF.

Bypassing this prevention and attaching a kernel driver directly to
the same BDF being exposed to the guest breaks that isolation model.

> So, I still don't understand what the security risk you are talking about is,
> and what do you think the security design should look like?
> Can you elaborate on it?

Each security domain must have its own PCI BDF.

The migration control registers must be on a different VF from the VF
being plugged into a guest and the two VFs have to be in different
IOMMU groups to ensure they are isolated from each other.

Jason
liulongfang April 20, 2021, 1:28 p.m. UTC | #6
On 2021/4/20 20:59, Jason Gunthorpe wrote:
> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
>> On 2021/4/19 20:33, Jason Gunthorpe wrote:
>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>
>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>> guest could access that MMIO space one way or another.
>>>>
>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>> and the other is the live migration part. They do not affect each other,
>>>> so there is no security problem.
>>>
>>> AFAIK there are several scenarios where a guest can access this MMIO
>>> memory using DMA even if it is not mapped into the guest for CPU
>>> access.
>>>
>> The hardware divides VF's MMIO memory into two parts. The live migration
>> driver in the host uses the live migration part, and the device driver in
>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>> in their respective drivers, although these two parts The memory is
>> continuous on the hardware device, but due to the needs of the drive function,
>> they will not perform operations on another part of the memory, and the
>> device hardware also independently responds to the operation commands of
>> the two parts.
> 
> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> supports scenarios where devices in the same IOMMU group are not
> isolated from each other.
> 
> This is why the granual of isolation is a PCI BDF - VFIO directly
> blocks kernel drivers from attaching to PCI BDFs that are not
> completely isolated from VFIO BDF.
> 
> Bypassing this prevention and attaching a kernel driver directly to
> the same BDF being exposed to the guest breaks that isolation model.
> 
>> So, I still don't understand what the security risk you are talking about is,
>> and what do you think the security design should look like?
>> Can you elaborate on it?
> 
> Each security domain must have its own PCI BDF.
> 
The basic unit to perform the live migration function is the VF, and the basic
function of the VF is the business function of the device. If the live migration
function and the business function are completely separated, and they are bound
to their respective VFs, it will result in the ability to execute the business
in the guest A functional VF cannot perform live migration, and a VF with a live
migration function cannot perform business functions in the guest.

In fact, the scenario that requires live migration is to migrate its business
functions to the VFs of other hosts when the VF's business functions are heavily
loaded.
This usage scenario itself requires that the VF needs to have these two functions
at the same time.

> The migration control registers must be on a different VF from the VF
> being plugged into a guest and the two VFs have to be in different
> IOMMU groups to ensure they are isolated from each other.
> 
> Jason
> .
> 

Thanks.
Longfang.
Jason Gunthorpe April 20, 2021, 2:55 p.m. UTC | #7
On Tue, Apr 20, 2021 at 09:28:46PM +0800, liulongfang wrote:
> >> So, I still don't understand what the security risk you are talking about is,
> >> and what do you think the security design should look like?
> >> Can you elaborate on it?
> > 
> > Each security domain must have its own PCI BDF.
> > 
> The basic unit to perform the live migration function is the VF, and the basic
> function of the VF is the business function of the device. If the live migration
> function and the business function are completely separated, and they are bound
> to their respective VFs, it will result in the ability to execute the business
> in the guest A functional VF cannot perform live migration, and a VF with a live
> migration function cannot perform business functions in the guest.
> 
> In fact, the scenario that requires live migration is to migrate its business
> functions to the VFs of other hosts when the VF's business functions are heavily
> loaded.
> This usage scenario itself requires that the VF needs to have these two functions
> at the same time.

Other vendors are managing, it is not an unsolvable problem.

I think these patches can't advance in any form without somehow
addressing the isolation issue.

Jason
Alex Williamson April 20, 2021, 10:04 p.m. UTC | #8
On Tue, 20 Apr 2021 09:59:57 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
> > On 2021/4/19 20:33, Jason Gunthorpe wrote:  
> > > On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> > >   
> > >>> I'm also confused how this works securely at all, as a general rule a
> > >>> VFIO PCI driver cannot access the MMIO memory of the function it is
> > >>> planning to assign to the guest. There is a lot of danger that the
> > >>> guest could access that MMIO space one way or another.  
> > >>
> > >> VF's MMIO memory is divided into two parts, one is the guest part,
> > >> and the other is the live migration part. They do not affect each other,
> > >> so there is no security problem.  
> > > 
> > > AFAIK there are several scenarios where a guest can access this MMIO
> > > memory using DMA even if it is not mapped into the guest for CPU
> > > access.
> > >   
> > The hardware divides VF's MMIO memory into two parts. The live migration
> > driver in the host uses the live migration part, and the device driver in
> > the guest uses the guest part. They obtain the address of VF's MMIO memory
> > in their respective drivers, although these two parts The memory is
> > continuous on the hardware device, but due to the needs of the drive function,
> > they will not perform operations on another part of the memory, and the
> > device hardware also independently responds to the operation commands of
> > the two parts.  
> 
> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> supports scenarios where devices in the same IOMMU group are not
> isolated from each other.
> 
> This is why the granual of isolation is a PCI BDF - VFIO directly
> blocks kernel drivers from attaching to PCI BDFs that are not
> completely isolated from VFIO BDF.
> 
> Bypassing this prevention and attaching a kernel driver directly to
> the same BDF being exposed to the guest breaks that isolation model.
> 
> > So, I still don't understand what the security risk you are talking about is,
> > and what do you think the security design should look like?
> > Can you elaborate on it?  
> 
> Each security domain must have its own PCI BDF.
> 
> The migration control registers must be on a different VF from the VF
> being plugged into a guest and the two VFs have to be in different
> IOMMU groups to ensure they are isolated from each other.

I think that's a solution, I don't know if it's the only solution.
AIUI, the issue here is that we have a device specific kernel driver
extending vfio-pci with migration support for this device by using an
MMIO region of the same device.  This is susceptible to DMA
manipulation by the user device.   Whether that's a security issue or
not depends on how the user can break the device.  If the scope is
limited to breaking their own device, they can do that any number of
ways and it's not very interesting.  If the user can manipulate device
state in order to trigger an exploit of the host-side kernel driver,
that's obviously more of a problem.

The other side of this is that if migration support can be implemented
entirely within the VF using this portion of the device MMIO space, why
do we need the host kernel to support this rather than implementing it
in userspace?  For example, QEMU could know about this device,
manipulate the BAR size to expose only the operational portion of MMIO
to the VM and use the remainder to support migration itself.  I'm
afraid that just like mdev, the vfio migration uAPI is going to be used
as an excuse to create kernel drivers simply to be able to make use of
that uAPI.  I haven't looked at this driver to know if it has some
other reason to exist beyond what could be done through vfio-pci and
userspace migration support.  Thanks,

Alex
Jason Gunthorpe April 20, 2021, 11:18 p.m. UTC | #9
On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:

> > The migration control registers must be on a different VF from the VF
> > being plugged into a guest and the two VFs have to be in different
> > IOMMU groups to ensure they are isolated from each other.
> 
> I think that's a solution, I don't know if it's the only solution.

Maybe, but that approach does offer DMA access for the migration. For
instance to implement something that needs a lot of data like
migrating a complicated device state, or dirty page tracking or
whatver.

This driver seems very simple - it has only 17 state elements - and
doesn't use DMA.

I can't quite tell, but does this pass the hypervisor BAR into the
guest anyhow? That would certainly be an adquate statement that it is
safe, assuming someone did a good security analysis.

> ways and it's not very interesting.  If the user can manipulate device
> state in order to trigger an exploit of the host-side kernel driver,
> that's obviously more of a problem.

Well, for instance, we have an implementation of
(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the
guest CPUs are still running and a hostile guest can be manipulating
the device.

But this driver is running code, like vf_qm_state_pre_save() in this
state. Looks very suspicious.

One quick attack I can imagine is to use the guest CPU to DOS the
migration and permanently block it, eg by causing qm_mb() or other
looping functions to fail.

There may be worse things possible, it is a bit hard to tell just from
the code.

.. also drivers should not be open coding ARM assembly as in
qm_mb_write()

.. and also, code can not randomly call pci_get_drvdata() on a struct
device it isn't attached to haven't verified the right driver is
bound, or locked correctly.

> manipulate the BAR size to expose only the operational portion of MMIO
> to the VM and use the remainder to support migration itself.  I'm
> afraid that just like mdev, the vfio migration uAPI is going to be used
> as an excuse to create kernel drivers simply to be able to make use of
> that uAPI.

I thought that is the general direction people had agreed on during
the IDXD mdev discussion?

People want the IOCTLs from VFIO to be the single API to program all
the VMMs to and to not implement user space drivers..

This actually seems like a great candidate for a userspace driver.

I would like to know we are still settled on this direction as the
mlx5 drivers we are working on also have some complicated option to be
user space only.

Jason
liulongfang April 21, 2021, 9:59 a.m. UTC | #10
On 2021/4/21 6:04, Alex Williamson wrote:
> On Tue, 20 Apr 2021 09:59:57 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
>>> On 2021/4/19 20:33, Jason Gunthorpe wrote:  
>>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>>   
>>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>>> guest could access that MMIO space one way or another.  
>>>>>
>>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>>> and the other is the live migration part. They do not affect each other,
>>>>> so there is no security problem.  
>>>>
>>>> AFAIK there are several scenarios where a guest can access this MMIO
>>>> memory using DMA even if it is not mapped into the guest for CPU
>>>> access.
>>>>   
>>> The hardware divides VF's MMIO memory into two parts. The live migration
>>> driver in the host uses the live migration part, and the device driver in
>>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>>> in their respective drivers, although these two parts The memory is
>>> continuous on the hardware device, but due to the needs of the drive function,
>>> they will not perform operations on another part of the memory, and the
>>> device hardware also independently responds to the operation commands of
>>> the two parts.  
>>
>> It doesn't matter, the memory is still under the same PCI BDF and VFIO
>> supports scenarios where devices in the same IOMMU group are not
>> isolated from each other.
>>
>> This is why the granual of isolation is a PCI BDF - VFIO directly
>> blocks kernel drivers from attaching to PCI BDFs that are not
>> completely isolated from VFIO BDF.
>>
>> Bypassing this prevention and attaching a kernel driver directly to
>> the same BDF being exposed to the guest breaks that isolation model.
>>
>>> So, I still don't understand what the security risk you are talking about is,
>>> and what do you think the security design should look like?
>>> Can you elaborate on it?  
>>
>> Each security domain must have its own PCI BDF.
>>
>> The migration control registers must be on a different VF from the VF
>> being plugged into a guest and the two VFs have to be in different
>> IOMMU groups to ensure they are isolated from each other.
> 
> I think that's a solution, I don't know if it's the only solution.
> AIUI, the issue here is that we have a device specific kernel driver
> extending vfio-pci with migration support for this device by using an

If the two parts of the MMIO region are split into different BAR spaces on
the device, the MMIO region of ​​the business function is still placed in BAR2,
and the MMIO region of ​​the live migration function is moved to BAR4.
Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
This can solve this security issue.

> MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
> limited to breaking their own device, they can do that any number of
> ways and it's not very interesting.  If the user can manipulate device
> state in order to trigger an exploit of the host-side kernel driver,
> that's obviously more of a problem.
> 
> The other side of this is that if migration support can be implemented
> entirely within the VF using this portion of the device MMIO space, why
> do we need the host kernel to support this rather than implementing it
> in userspace?  For example, QEMU could know about this device,
> manipulate the BAR size to expose only the operational portion of MMIO
> to the VM and use the remainder to support migration itself.  I'm
> afraid that just like mdev, the vfio migration uAPI is going to be used
> as an excuse to create kernel drivers simply to be able to make use of
> that uAPI.  I haven't looked at this driver to know if it has some

When the accelerator device is designed to support the live migration
function, it is based on the uAPI of the migration region to realize the
live migration function, so the live migration function requires a driver
that connects to this uAPI.
Is this set of interfaces not open to us now?

> other reason to exist beyond what could be done through vfio-pci and
> userspace migration support.  Thanks,
> 
> Alex
> 
> .
> 
Thanks,
Longfang.
liulongfang April 21, 2021, 9:59 a.m. UTC | #11
On 2021/4/21 7:18, Jason Gunthorpe wrote:
> On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:
> 
>>> The migration control registers must be on a different VF from the VF
>>> being plugged into a guest and the two VFs have to be in different
>>> IOMMU groups to ensure they are isolated from each other.
>>
>> I think that's a solution, I don't know if it's the only solution.
> 
> Maybe, but that approach does offer DMA access for the migration. For
> instance to implement something that needs a lot of data like
> migrating a complicated device state, or dirty page tracking or
> whatver.
> 
> This driver seems very simple - it has only 17 state elements - and
> doesn't use DMA.
> 
Yes,the operating address of this driver is the MMIO address,
not the DMA address, but the internal hardware DMA address is used
as data for migration.

> I can't quite tell, but does this pass the hypervisor BAR into the
> guest anyhow? That would certainly be an adquate statement that it is
> safe, assuming someone did a good security analysis.
> 
>> ways and it's not very interesting.  If the user can manipulate device
>> state in order to trigger an exploit of the host-side kernel driver,
>> that's obviously more of a problem.
> 
> Well, for instance, we have an implementation of
> (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the
> guest CPUs are still running and a hostile guest can be manipulating
> the device.
> 
> But this driver is running code, like vf_qm_state_pre_save() in this
> state. Looks very suspicious.
> 
> One quick attack I can imagine is to use the guest CPU to DOS the
> migration and permanently block it, eg by causing qm_mb() or other
> looping functions to fail.
> 
> There may be worse things possible, it is a bit hard to tell just from
> the code.
> 
> .. also drivers should not be open coding ARM assembly as in
> qm_mb_write()
> 
OK, these codes need to be encapsulated and should not be presented in
this driver.

> .. and also, code can not randomly call pci_get_drvdata() on a struct
> device it isn't attached to haven't verified the right driver is
> bound, or locked correctly.
> 
Yes, This call needs to be placed in an encapsulation interface,
and access protection needs to be added.

>> manipulate the BAR size to expose only the operational portion of MMIO
>> to the VM and use the remainder to support migration itself.  I'm
>> afraid that just like mdev, the vfio migration uAPI is going to be used
>> as an excuse to create kernel drivers simply to be able to make use of
>> that uAPI.
> 
> I thought that is the general direction people had agreed on during
> the IDXD mdev discussion?
> 
> People want the IOCTLs from VFIO to be the single API to program all
> the VMMs to and to not implement user space drivers..
> 
> This actually seems like a great candidate for a userspace driver.
> 
> I would like to know we are still settled on this direction as the
> mlx5 drivers we are working on also have some complicated option to be
> user space only.
> 
> Jason
> .
>
Alex Williamson April 21, 2021, 6:12 p.m. UTC | #12
On Wed, 21 Apr 2021 17:59:02 +0800
liulongfang <liulongfang@huawei.com> wrote:

> On 2021/4/21 6:04, Alex Williamson wrote:
> > On Tue, 20 Apr 2021 09:59:57 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:  
> >>> On 2021/4/19 20:33, Jason Gunthorpe wrote:    
> >>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
> >>>>     
> >>>>>> I'm also confused how this works securely at all, as a general rule a
> >>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
> >>>>>> planning to assign to the guest. There is a lot of danger that the
> >>>>>> guest could access that MMIO space one way or another.    
> >>>>>
> >>>>> VF's MMIO memory is divided into two parts, one is the guest part,
> >>>>> and the other is the live migration part. They do not affect each other,
> >>>>> so there is no security problem.    
> >>>>
> >>>> AFAIK there are several scenarios where a guest can access this MMIO
> >>>> memory using DMA even if it is not mapped into the guest for CPU
> >>>> access.
> >>>>     
> >>> The hardware divides VF's MMIO memory into two parts. The live migration
> >>> driver in the host uses the live migration part, and the device driver in
> >>> the guest uses the guest part. They obtain the address of VF's MMIO memory
> >>> in their respective drivers, although these two parts The memory is
> >>> continuous on the hardware device, but due to the needs of the drive function,
> >>> they will not perform operations on another part of the memory, and the
> >>> device hardware also independently responds to the operation commands of
> >>> the two parts.    
> >>
> >> It doesn't matter, the memory is still under the same PCI BDF and VFIO
> >> supports scenarios where devices in the same IOMMU group are not
> >> isolated from each other.
> >>
> >> This is why the granual of isolation is a PCI BDF - VFIO directly
> >> blocks kernel drivers from attaching to PCI BDFs that are not
> >> completely isolated from VFIO BDF.
> >>
> >> Bypassing this prevention and attaching a kernel driver directly to
> >> the same BDF being exposed to the guest breaks that isolation model.
> >>  
> >>> So, I still don't understand what the security risk you are talking about is,
> >>> and what do you think the security design should look like?
> >>> Can you elaborate on it?    
> >>
> >> Each security domain must have its own PCI BDF.
> >>
> >> The migration control registers must be on a different VF from the VF
> >> being plugged into a guest and the two VFs have to be in different
> >> IOMMU groups to ensure they are isolated from each other.  
> > 
> > I think that's a solution, I don't know if it's the only solution.
> > AIUI, the issue here is that we have a device specific kernel driver
> > extending vfio-pci with migration support for this device by using an  
> 
> If the two parts of the MMIO region are split into different BAR spaces on
> the device, the MMIO region of ​​the business function is still placed in BAR2,
> and the MMIO region of ​​the live migration function is moved to BAR4.
> Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
> This can solve this security issue.

The concern is really the "on the device" part rather than whether the
resources are within the same BAR or not.  We need to assume that a
user driver can generate a DMA targeting any address in the system,
including in this case the user driver could generate a DMA targeting
this migration BAR.  Ideally this would be routed upstream to the IOMMU
where it would be blocked for lack of a translation entry.  However,
because this range resides on the same PCIe requester ID, it's
logically more similar to a two-function device where the functions are
considered non-isolated and are therefore exposed within the same IOMMU
group.  We would not allow a kernel driver for one of those functions
and a userspace driver for the other.  In this case those drivers are
strongly related, but we still need to consider to what extent a
malicious user driver can interfere with or exploit the kernel side
driver.


> > MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
> > limited to breaking their own device, they can do that any number of
> > ways and it's not very interesting.  If the user can manipulate device
> > state in order to trigger an exploit of the host-side kernel driver,
> > that's obviously more of a problem.
> > 
> > The other side of this is that if migration support can be implemented
> > entirely within the VF using this portion of the device MMIO space, why
> > do we need the host kernel to support this rather than implementing it
> > in userspace?  For example, QEMU could know about this device,
> > manipulate the BAR size to expose only the operational portion of MMIO
> > to the VM and use the remainder to support migration itself.  I'm
> > afraid that just like mdev, the vfio migration uAPI is going to be used
> > as an excuse to create kernel drivers simply to be able to make use of
> > that uAPI.  I haven't looked at this driver to know if it has some  
> 
> When the accelerator device is designed to support the live migration
> function, it is based on the uAPI of the migration region to realize the
> live migration function, so the live migration function requires a driver
> that connects to this uAPI.
> Is this set of interfaces not open to us now?

In your model, if both BARs are exposed to userspace and a device
specific extension in QEMU claims the migration BAR rather than
exposing it to the VM, could that driver mimic the migration region
uAPI from userspace?  For example, you don't need page pinning to
interact with the IOMMU, you don't need resources beyond the scope
of the endpoint device itself, and the migration info BAR is safe for
userspace to manage?  If so, then a kernel-based driver to expose a
migration uAPI seems like it's only a risk for the kernel, ie. moving
what could be a userspace driver into the kernel for the convenience of
re-using a kernel uAPI.  Thanks,

Alex
liulongfang April 26, 2021, 11:49 a.m. UTC | #13
On 2021/4/22 2:12, Alex Williamson wrote:
> On Wed, 21 Apr 2021 17:59:02 +0800
> liulongfang <liulongfang@huawei.com> wrote:
> 
>> On 2021/4/21 6:04, Alex Williamson wrote:
>>> On Tue, 20 Apr 2021 09:59:57 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>   
>>>> On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:  
>>>>> On 2021/4/19 20:33, Jason Gunthorpe wrote:    
>>>>>> On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
>>>>>>     
>>>>>>>> I'm also confused how this works securely at all, as a general rule a
>>>>>>>> VFIO PCI driver cannot access the MMIO memory of the function it is
>>>>>>>> planning to assign to the guest. There is a lot of danger that the
>>>>>>>> guest could access that MMIO space one way or another.    
>>>>>>>
>>>>>>> VF's MMIO memory is divided into two parts, one is the guest part,
>>>>>>> and the other is the live migration part. They do not affect each other,
>>>>>>> so there is no security problem.    
>>>>>>
>>>>>> AFAIK there are several scenarios where a guest can access this MMIO
>>>>>> memory using DMA even if it is not mapped into the guest for CPU
>>>>>> access.
>>>>>>     
>>>>> The hardware divides VF's MMIO memory into two parts. The live migration
>>>>> driver in the host uses the live migration part, and the device driver in
>>>>> the guest uses the guest part. They obtain the address of VF's MMIO memory
>>>>> in their respective drivers, although these two parts The memory is
>>>>> continuous on the hardware device, but due to the needs of the drive function,
>>>>> they will not perform operations on another part of the memory, and the
>>>>> device hardware also independently responds to the operation commands of
>>>>> the two parts.    
>>>>
>>>> It doesn't matter, the memory is still under the same PCI BDF and VFIO
>>>> supports scenarios where devices in the same IOMMU group are not
>>>> isolated from each other.
>>>>
>>>> This is why the granual of isolation is a PCI BDF - VFIO directly
>>>> blocks kernel drivers from attaching to PCI BDFs that are not
>>>> completely isolated from VFIO BDF.
>>>>
>>>> Bypassing this prevention and attaching a kernel driver directly to
>>>> the same BDF being exposed to the guest breaks that isolation model.
>>>>  
>>>>> So, I still don't understand what the security risk you are talking about is,
>>>>> and what do you think the security design should look like?
>>>>> Can you elaborate on it?    
>>>>
>>>> Each security domain must have its own PCI BDF.
>>>>
>>>> The migration control registers must be on a different VF from the VF
>>>> being plugged into a guest and the two VFs have to be in different
>>>> IOMMU groups to ensure they are isolated from each other.  
>>>
>>> I think that's a solution, I don't know if it's the only solution.
>>> AIUI, the issue here is that we have a device specific kernel driver
>>> extending vfio-pci with migration support for this device by using an  
>>
>> If the two parts of the MMIO region are split into different BAR spaces on
>> the device, the MMIO region of ​​the business function is still placed in BAR2,
>> and the MMIO region of ​​the live migration function is moved to BAR4.
>> Only BAR2 is mapped in the guest. only BAR4 is mapped in the host.
>> This can solve this security issue.
> 
> The concern is really the "on the device" part rather than whether the
> resources are within the same BAR or not.  We need to assume that a
> user driver can generate a DMA targeting any address in the system,
> including in this case the user driver could generate a DMA targeting
> this migration BAR.  Ideally this would be routed upstream to the IOMMU
> where it would be blocked for lack of a translation entry.  However,
> because this range resides on the same PCIe requester ID, it's
> logically more similar to a two-function device where the functions are
> considered non-isolated and are therefore exposed within the same IOMMU
> group.  We would not allow a kernel driver for one of those functions
> and a userspace driver for the other.  In this case those drivers are
> strongly related, but we still need to consider to what extent a
> malicious user driver can interfere with or exploit the kernel side
> driver.
> 
> 
>>> MMIO region of the same device.  This is susceptible to DMA> manipulation by the user device.   Whether that's a security issue or> not depends on how the user can break the device.  If the scope is
>>> limited to breaking their own device, they can do that any number of
>>> ways and it's not very interesting.  If the user can manipulate device
>>> state in order to trigger an exploit of the host-side kernel driver,
>>> that's obviously more of a problem.
>>>
>>> The other side of this is that if migration support can be implemented
>>> entirely within the VF using this portion of the device MMIO space, why
>>> do we need the host kernel to support this rather than implementing it
>>> in userspace?  For example, QEMU could know about this device,
>>> manipulate the BAR size to expose only the operational portion of MMIO
>>> to the VM and use the remainder to support migration itself.  I'm
>>> afraid that just like mdev, the vfio migration uAPI is going to be used
>>> as an excuse to create kernel drivers simply to be able to make use of
>>> that uAPI.  I haven't looked at this driver to know if it has some  
>>
>> When the accelerator device is designed to support the live migration
>> function, it is based on the uAPI of the migration region to realize the
>> live migration function, so the live migration function requires a driver
>> that connects to this uAPI.
>> Is this set of interfaces not open to us now?
> 
> In your model, if both BARs are exposed to userspace and a device
> specific extension in QEMU claims the migration BAR rather than
> exposing it to the VM, could that driver mimic the migration region
> uAPI from userspace?  For example, you don't need page pinning to
> interact with the IOMMU, you don't need resources beyond the scope
> of the endpoint device itself, and the migration info BAR is safe for
> userspace to manage?  If so, then a kernel-based driver to expose a
> migration uAPI seems like it's only a risk for the kernel, ie. moving
> what could be a userspace driver into the kernel for the convenience of
> re-using a kernel uAPI.  Thanks,
> 
> Alex
> 
> .
> 
Thanks for your review and suggestions.

We discussed the security issues you mentioned and put forward some new solutions.
These new solutions are now being implemented. After they are tested successfully,
they will be issued in the second version of the RFC.

Thanks,
Longfang.

Patch
diff mbox series

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 65e7e6b..e1b0e37 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -407,6 +407,17 @@  static int vfio_pci_enable(struct vfio_pci_device *vdev)
 		}
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_HUAWEI &&
+	    IS_ENABLED(CONFIG_VFIO_PCI_HISI_MIGRATION)) {
+		ret = vfio_pci_hisilicon_acc_init(vdev);
+		if (ret && ret != -ENODEV) {
+			dev_warn(&vdev->pdev->dev,
+				 "Failed to setup Hisilicon ACC region\n");
+			vfio_pci_disable(vdev);
+			return ret;
+		}
+	}
+
 	vfio_pci_probe_mmaps(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 9cd1882..83c51be 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -214,6 +214,15 @@  static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
 }
 #endif
 
+#ifdef CONFIG_VFIO_PCI_HISI_MIGRATION
+extern int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev);
+#else
+static inline int vfio_pci_hisilicon_acc_init(struct vfio_pci_device *vdev)
+{
+	return -ENODEV;
+}
+#endif
+
 #ifdef CONFIG_S390
 extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev,
 				       struct vfio_info_cap *caps);