All of lore.kernel.org
 help / color / mirror / Atom feed
* Live Migration of Virtio Virtual Function
@ 2021-08-12 12:08 Max Gurtovoy
  2021-08-17  8:51 ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-12 12:08 UTC (permalink / raw)
  To: virtio-comment, Jason Wang, Michael S. Tsirkin, cohuck
  Cc: Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

Hi all,
Live migration is one of the most important features of virtualization and virtio devices are oftenly found in virtual environments.

The migration process is managed by a migration SW that is running on the hypervisor and the VM is not aware of the process at all.

Unlike the vDPA case, a real pci Virtual Function state resides in the HW.

In our vision, in order to fulfil the Live migration requirements for virtual functions, each physical function device must implement migration operations. Using these operations, it will be able to master the migration process for the virtual function devices. Each capable physical function device has a supervisor permissions to change the virtual function operational states, save/restore its internal state and start/stop dirty pages tracking.

An example of this approach can be seen in the way NVIDIA performs live migration of a ConnectX NIC function:
https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci

NVIDIAs SNAP technology enables hardware-accelerated software defined PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage and networking solutions. The host OS/hypervisor uses its standard drivers that are implemented according to a well-known VIRTIO specifications.

In order to implement Live Migration for these virtual function devices, that use a standard drivers as mentioned, the specification should define how HW vendor should build their devices and for SW developers to adjust the drivers.
This will enable specification compliant vendor agnostic solution.

This is exactly how we built the migration driver for ConnectX (internal HW design doc) and I guess that this is the way other vendors work.

For that, I would like to know if the approach of "PF that controls the VF live migration process" is acceptable by the VIRTIO technical group ?

Cheers,
-Max.

[-- Attachment #2: Type: text/html, Size: 4357 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-12 12:08 Live Migration of Virtio Virtual Function Max Gurtovoy
@ 2021-08-17  8:51 ` Jason Wang
  2021-08-17  9:11   ` Max Gurtovoy
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-17  8:51 UTC (permalink / raw)
  To: Max Gurtovoy, virtio-comment, Michael S. Tsirkin, cohuck
  Cc: Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>
> Hi all,
>
> Live migration is one of the most important features of virtualization 
> and virtio devices are oftenly found in virtual environments.
>
> The migration process is managed by a migration SW that is running on 
> the hypervisor and the VM is not aware of the process at all.
>
> Unlike the vDPA case, a real pci Virtual Function state resides in the HW.
>

vDPA doesn't prevent you from having HW states. Actually from the view 
of the VMM(Qemu), it doesn't care whether or not a state is stored in 
the software or hardware. A well designed VMM should be able to hide the 
virtio device implementation from the migration layer, that is how Qemu 
is wrote who doesn't care about whether or not it's a software 
virtio/vDPA device or not.


> In our vision, in order to fulfil the Live migration requirements for 
> virtual functions, each physical function device must implement 
> migration operations. Using these operations, it will be able to 
> master the migration process for the virtual function devices. Each 
> capable physical function device has a supervisor permissions to 
> change the virtual function operational states, save/restore its 
> internal state and start/stop dirty pages tracking.
>

For "supervisor permissions", is this from the software point of view? 
Maybe it's better to give an example for this.


> An example of this approach can be seen in the way NVIDIA performs 
> live migration of a ConnectX NIC function:
>
> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci 
> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>
> NVIDIAs SNAP technology enables hardware-accelerated software defined 
> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage 
> and networking solutions. The host OS/hypervisor uses its standard 
> drivers that are implemented according to a well-known VIRTIO 
> specifications.
>
> In order to implement Live Migration for these virtual function 
> devices, that use a standard drivers as mentioned, the specification 
> should define how HW vendor should build their devices and for SW 
> developers to adjust the drivers.
>
> This will enable specification compliant vendor agnostic solution.
>
> This is exactly how we built the migration driver for ConnectX 
> (internal HW design doc) and I guess that this is the way other 
> vendors work.
>
> For that, I would like to know if the approach of “PF that controls 
> the VF live migration process” is acceptable by the VIRTIO technical 
> group ?
>

I'm not sure but I think it's better to start from the general facility 
for all transports, then develop features for a specific transport.

Thanks


> Cheers,
>
> -Max.
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-17  8:51 ` [virtio-comment] " Jason Wang
@ 2021-08-17  9:11   ` Max Gurtovoy
  2021-08-17  9:44     ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-17  9:11 UTC (permalink / raw)
  To: Jason Wang, virtio-comment, Michael S. Tsirkin, cohuck
  Cc: Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


On 8/17/2021 11:51 AM, Jason Wang wrote:
>
> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>
>> Hi all,
>>
>> Live migration is one of the most important features of 
>> virtualization and virtio devices are oftenly found in virtual 
>> environments.
>>
>> The migration process is managed by a migration SW that is running on 
>> the hypervisor and the VM is not aware of the process at all.
>>
>> Unlike the vDPA case, a real pci Virtual Function state resides in 
>> the HW.
>>
>
> vDPA doesn't prevent you from having HW states. Actually from the view 
> of the VMM(Qemu), it doesn't care whether or not a state is stored in 
> the software or hardware. A well designed VMM should be able to hide 
> the virtio device implementation from the migration layer, that is how 
> Qemu is wrote who doesn't care about whether or not it's a software 
> virtio/vDPA device or not.
>
>
>> In our vision, in order to fulfil the Live migration requirements for 
>> virtual functions, each physical function device must implement 
>> migration operations. Using these operations, it will be able to 
>> master the migration process for the virtual function devices. Each 
>> capable physical function device has a supervisor permissions to 
>> change the virtual function operational states, save/restore its 
>> internal state and start/stop dirty pages tracking.
>>
>
> For "supervisor permissions", is this from the software point of view? 
> Maybe it's better to give an example for this.

A permission to a PF device for quiesce and freeze a VF device for example.

>
>
>> An example of this approach can be seen in the way NVIDIA performs 
>> live migration of a ConnectX NIC function:
>>
>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci 
>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>
>> NVIDIAs SNAP technology enables hardware-accelerated software defined 
>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage 
>> and networking solutions. The host OS/hypervisor uses its standard 
>> drivers that are implemented according to a well-known VIRTIO 
>> specifications.
>>
>> In order to implement Live Migration for these virtual function 
>> devices, that use a standard drivers as mentioned, the specification 
>> should define how HW vendor should build their devices and for SW 
>> developers to adjust the drivers.
>>
>> This will enable specification compliant vendor agnostic solution.
>>
>> This is exactly how we built the migration driver for ConnectX 
>> (internal HW design doc) and I guess that this is the way other 
>> vendors work.
>>
>> For that, I would like to know if the approach of “PF that controls 
>> the VF live migration process” is acceptable by the VIRTIO technical 
>> group ?
>>
>
> I'm not sure but I think it's better to start from the general 
> facility for all transports, then develop features for a specific 
> transport.

a general facility for all transports can be a generic admin queue ?


>
> Thanks
>
>
>> Cheers,
>>
>> -Max.
>>
>

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-17  9:11   ` Max Gurtovoy
@ 2021-08-17  9:44     ` Jason Wang
  2021-08-18  9:15       ` Max Gurtovoy
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-17  9:44 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>
> On 8/17/2021 11:51 AM, Jason Wang wrote:
> >
> > 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> >>
> >> Hi all,
> >>
> >> Live migration is one of the most important features of
> >> virtualization and virtio devices are oftenly found in virtual
> >> environments.
> >>
> >> The migration process is managed by a migration SW that is running on
> >> the hypervisor and the VM is not aware of the process at all.
> >>
> >> Unlike the vDPA case, a real pci Virtual Function state resides in
> >> the HW.
> >>
> >
> > vDPA doesn't prevent you from having HW states. Actually from the view
> > of the VMM(Qemu), it doesn't care whether or not a state is stored in
> > the software or hardware. A well designed VMM should be able to hide
> > the virtio device implementation from the migration layer, that is how
> > Qemu is wrote who doesn't care about whether or not it's a software
> > virtio/vDPA device or not.
> >
> >
> >> In our vision, in order to fulfil the Live migration requirements for
> >> virtual functions, each physical function device must implement
> >> migration operations. Using these operations, it will be able to
> >> master the migration process for the virtual function devices. Each
> >> capable physical function device has a supervisor permissions to
> >> change the virtual function operational states, save/restore its
> >> internal state and start/stop dirty pages tracking.
> >>
> >
> > For "supervisor permissions", is this from the software point of view?
> > Maybe it's better to give an example for this.
>
> A permission to a PF device for quiesce and freeze a VF device for example.

Note that for safety, VMM (e.g Qemu) is usually running without any privileges.

>
> >
> >
> >> An example of this approach can be seen in the way NVIDIA performs
> >> live migration of a ConnectX NIC function:
> >>
> >> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> >> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> >>
> >> NVIDIAs SNAP technology enables hardware-accelerated software defined
> >> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
> >> and networking solutions. The host OS/hypervisor uses its standard
> >> drivers that are implemented according to a well-known VIRTIO
> >> specifications.
> >>
> >> In order to implement Live Migration for these virtual function
> >> devices, that use a standard drivers as mentioned, the specification
> >> should define how HW vendor should build their devices and for SW
> >> developers to adjust the drivers.
> >>
> >> This will enable specification compliant vendor agnostic solution.
> >>
> >> This is exactly how we built the migration driver for ConnectX
> >> (internal HW design doc) and I guess that this is the way other
> >> vendors work.
> >>
> >> For that, I would like to know if the approach of “PF that controls
> >> the VF live migration process” is acceptable by the VIRTIO technical
> >> group ?
> >>
> >
> > I'm not sure but I think it's better to start from the general
> > facility for all transports, then develop features for a specific
> > transport.
>
> a general facility for all transports can be a generic admin queue ?

It could be a virtqueue or a transport specific method (pcie capability).

E.g we can define what needs to be migrated for the virtio-blk first
(the device state). Then we can define the interface to get and set
those states via admin virtqueue. Such decoupling may ease the future
development of the transport specific migration interface.

Thanks

>
>
> >
> > Thanks
> >
> >
> >> Cheers,
> >>
> >> -Max.
> >>
> >
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-17  9:44     ` Jason Wang
@ 2021-08-18  9:15       ` Max Gurtovoy
  2021-08-18 10:46         ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-18  9:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


On 8/17/2021 12:44 PM, Jason Wang wrote:
> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>
>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>> Hi all,
>>>>
>>>> Live migration is one of the most important features of
>>>> virtualization and virtio devices are oftenly found in virtual
>>>> environments.
>>>>
>>>> The migration process is managed by a migration SW that is running on
>>>> the hypervisor and the VM is not aware of the process at all.
>>>>
>>>> Unlike the vDPA case, a real pci Virtual Function state resides in
>>>> the HW.
>>>>
>>> vDPA doesn't prevent you from having HW states. Actually from the view
>>> of the VMM(Qemu), it doesn't care whether or not a state is stored in
>>> the software or hardware. A well designed VMM should be able to hide
>>> the virtio device implementation from the migration layer, that is how
>>> Qemu is wrote who doesn't care about whether or not it's a software
>>> virtio/vDPA device or not.
>>>
>>>
>>>> In our vision, in order to fulfil the Live migration requirements for
>>>> virtual functions, each physical function device must implement
>>>> migration operations. Using these operations, it will be able to
>>>> master the migration process for the virtual function devices. Each
>>>> capable physical function device has a supervisor permissions to
>>>> change the virtual function operational states, save/restore its
>>>> internal state and start/stop dirty pages tracking.
>>>>
>>> For "supervisor permissions", is this from the software point of view?
>>> Maybe it's better to give an example for this.
>> A permission to a PF device for quiesce and freeze a VF device for example.
> Note that for safety, VMM (e.g Qemu) is usually running without any privileges.

You're mixing layers here.

QEMU is not involved here. It's only sending IOCTLs to migration driver. 
The migration driver will control the migration process of the VF using 
the PF communication channel.


>
>>>
>>>> An example of this approach can be seen in the way NVIDIA performs
>>>> live migration of a ConnectX NIC function:
>>>>
>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>
>>>> NVIDIAs SNAP technology enables hardware-accelerated software defined
>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
>>>> and networking solutions. The host OS/hypervisor uses its standard
>>>> drivers that are implemented according to a well-known VIRTIO
>>>> specifications.
>>>>
>>>> In order to implement Live Migration for these virtual function
>>>> devices, that use a standard drivers as mentioned, the specification
>>>> should define how HW vendor should build their devices and for SW
>>>> developers to adjust the drivers.
>>>>
>>>> This will enable specification compliant vendor agnostic solution.
>>>>
>>>> This is exactly how we built the migration driver for ConnectX
>>>> (internal HW design doc) and I guess that this is the way other
>>>> vendors work.
>>>>
>>>> For that, I would like to know if the approach of “PF that controls
>>>> the VF live migration process” is acceptable by the VIRTIO technical
>>>> group ?
>>>>
>>> I'm not sure but I think it's better to start from the general
>>> facility for all transports, then develop features for a specific
>>> transport.
>> a general facility for all transports can be a generic admin queue ?
> It could be a virtqueue or a transport specific method (pcie capability).

No. You said a general facility for all transports.

Transport specific is not general.

>
> E.g we can define what needs to be migrated for the virtio-blk first
> (the device state). Then we can define the interface to get and set
> those states via admin virtqueue. Such decoupling may ease the future
> development of the transport specific migration interface.

I asked a simple question here.

Lets stick to this.

I'm not referring to internal state definitions.

Can you please not change the subject of my initial intent in the email ?

Thanks.


>
> Thanks
>
>>
>>> Thanks
>>>
>>>
>>>> Cheers,
>>>>
>>>> -Max.
>>>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZvEbfxoQhPnFyy1m%2BM5%2FyD0UW%2FJ7YZnOliFepRsoU30%3D&amp;reserved=0
>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Xsf04HBGeQb4qOwgAfUQucHsizBHV5Vj2WOJVc0AYTs%3D&amp;reserved=0
>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GUVBMKmEmZxbhacMofMW9OitOgMscoIOOvS%2BkyPTg1Y%3D&amp;reserved=0
>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OTdKqw11N6FxaSNhxBlmLSheYYqnHGVGyfgyuoAzxOU%3D&amp;reserved=0
>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=e80lScamJ3Sn5NqsiDBZ3tpc8oHl%2F0dVOq39pGHH4v8%3D&amp;reserved=0
>>

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-18  9:15       ` Max Gurtovoy
@ 2021-08-18 10:46         ` Jason Wang
  2021-08-18 11:45           ` Max Gurtovoy
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-18 10:46 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>
> On 8/17/2021 12:44 PM, Jason Wang wrote:
> > On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>
> >> On 8/17/2021 11:51 AM, Jason Wang wrote:
> >>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> >>>> Hi all,
> >>>>
> >>>> Live migration is one of the most important features of
> >>>> virtualization and virtio devices are oftenly found in virtual
> >>>> environments.
> >>>>
> >>>> The migration process is managed by a migration SW that is running on
> >>>> the hypervisor and the VM is not aware of the process at all.
> >>>>
> >>>> Unlike the vDPA case, a real pci Virtual Function state resides in
> >>>> the HW.
> >>>>
> >>> vDPA doesn't prevent you from having HW states. Actually from the view
> >>> of the VMM(Qemu), it doesn't care whether or not a state is stored in
> >>> the software or hardware. A well designed VMM should be able to hide
> >>> the virtio device implementation from the migration layer, that is how
> >>> Qemu is wrote who doesn't care about whether or not it's a software
> >>> virtio/vDPA device or not.
> >>>
> >>>
> >>>> In our vision, in order to fulfil the Live migration requirements for
> >>>> virtual functions, each physical function device must implement
> >>>> migration operations. Using these operations, it will be able to
> >>>> master the migration process for the virtual function devices. Each
> >>>> capable physical function device has a supervisor permissions to
> >>>> change the virtual function operational states, save/restore its
> >>>> internal state and start/stop dirty pages tracking.
> >>>>
> >>> For "supervisor permissions", is this from the software point of view?
> >>> Maybe it's better to give an example for this.
> >> A permission to a PF device for quiesce and freeze a VF device for example.
> > Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
>
> You're mixing layers here.
>
> QEMU is not involved here. It's only sending IOCTLs to migration driver.
> The migration driver will control the migration process of the VF using
> the PF communication channel.

So who will be granted the "permission" you mentioned here?

>
>
> >
> >>>
> >>>> An example of this approach can be seen in the way NVIDIA performs
> >>>> live migration of a ConnectX NIC function:
> >>>>
> >>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> >>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> >>>>
> >>>> NVIDIAs SNAP technology enables hardware-accelerated software defined
> >>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
> >>>> and networking solutions. The host OS/hypervisor uses its standard
> >>>> drivers that are implemented according to a well-known VIRTIO
> >>>> specifications.
> >>>>
> >>>> In order to implement Live Migration for these virtual function
> >>>> devices, that use a standard drivers as mentioned, the specification
> >>>> should define how HW vendor should build their devices and for SW
> >>>> developers to adjust the drivers.
> >>>>
> >>>> This will enable specification compliant vendor agnostic solution.
> >>>>
> >>>> This is exactly how we built the migration driver for ConnectX
> >>>> (internal HW design doc) and I guess that this is the way other
> >>>> vendors work.
> >>>>
> >>>> For that, I would like to know if the approach of “PF that controls
> >>>> the VF live migration process” is acceptable by the VIRTIO technical
> >>>> group ?
> >>>>
> >>> I'm not sure but I think it's better to start from the general
> >>> facility for all transports, then develop features for a specific
> >>> transport.
> >> a general facility for all transports can be a generic admin queue ?
> > It could be a virtqueue or a transport specific method (pcie capability).
>
> No. You said a general facility for all transports.

For general facility, I mean the chapter 2 of the spec which is general

"
2 Basic Facilities of a Virtio Device
"


>
> Transport specific is not general.

The transport is in charge of implementing the interface for those facilities.

>
> >
> > E.g we can define what needs to be migrated for the virtio-blk first
> > (the device state). Then we can define the interface to get and set
> > those states via admin virtqueue. Such decoupling may ease the future
> > development of the transport specific migration interface.
>
> I asked a simple question here.
>
> Lets stick to this.

I answered this question.  The virtqueue could be one of the
approaches. And it's your responsibility to convince the community
about that approach. Having an example may help people to understand
your proposal.

>
> I'm not referring to internal state definitions.

Without an example, how do we know if it can work well?

>
> Can you please not change the subject of my initial intent in the email ?

Did I? Basically, I'm asking how a virtio-blk can be migrated with
your proposal.

Thanks

>
> Thanks.
>
>
> >
> > Thanks
> >
> >>
> >>> Thanks
> >>>
> >>>
> >>>> Cheers,
> >>>>
> >>>> -Max.
> >>>>
> >> This publicly archived list offers a means to provide input to the
> >> OASIS Virtual I/O Device (VIRTIO) TC.
> >>
> >> In order to verify user consent to the Feedback License terms and
> >> to minimize spam in the list archive, subscription is required
> >> before posting.
> >>
> >> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >> List help: virtio-comment-help@lists.oasis-open.org
> >> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZvEbfxoQhPnFyy1m%2BM5%2FyD0UW%2FJ7YZnOliFepRsoU30%3D&amp;reserved=0
> >> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Xsf04HBGeQb4qOwgAfUQucHsizBHV5Vj2WOJVc0AYTs%3D&amp;reserved=0
> >> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GUVBMKmEmZxbhacMofMW9OitOgMscoIOOvS%2BkyPTg1Y%3D&amp;reserved=0
> >> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=OTdKqw11N6FxaSNhxBlmLSheYYqnHGVGyfgyuoAzxOU%3D&amp;reserved=0
> >> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0c1898923999420201b408d961639dee%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637647902764977086%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=e80lScamJ3Sn5NqsiDBZ3tpc8oHl%2F0dVOq39pGHH4v8%3D&amp;reserved=0
> >>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-18 10:46         ` Jason Wang
@ 2021-08-18 11:45           ` Max Gurtovoy
  2021-08-19  2:44             ` Jason Wang
  2021-08-19 11:12             ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-18 11:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


On 8/18/2021 1:46 PM, Jason Wang wrote:
> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>
>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>> Hi all,
>>>>>>
>>>>>> Live migration is one of the most important features of
>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>> environments.
>>>>>>
>>>>>> The migration process is managed by a migration SW that is running on
>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>
>>>>>> Unlike the vDPA case, a real pci Virtual Function state resides in
>>>>>> the HW.
>>>>>>
>>>>> vDPA doesn't prevent you from having HW states. Actually from the view
>>>>> of the VMM(Qemu), it doesn't care whether or not a state is stored in
>>>>> the software or hardware. A well designed VMM should be able to hide
>>>>> the virtio device implementation from the migration layer, that is how
>>>>> Qemu is wrote who doesn't care about whether or not it's a software
>>>>> virtio/vDPA device or not.
>>>>>
>>>>>
>>>>>> In our vision, in order to fulfil the Live migration requirements for
>>>>>> virtual functions, each physical function device must implement
>>>>>> migration operations. Using these operations, it will be able to
>>>>>> master the migration process for the virtual function devices. Each
>>>>>> capable physical function device has a supervisor permissions to
>>>>>> change the virtual function operational states, save/restore its
>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>
>>>>> For "supervisor permissions", is this from the software point of view?
>>>>> Maybe it's better to give an example for this.
>>>> A permission to a PF device for quiesce and freeze a VF device for example.
>>> Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
>> You're mixing layers here.
>>
>> QEMU is not involved here. It's only sending IOCTLs to migration driver.
>> The migration driver will control the migration process of the VF using
>> the PF communication channel.
> So who will be granted the "permission" you mentioned here?

This is just an expression.

What is not clear ?

The PF device will have an option to quiesce/freeze the VF device.

This is simple. Why are you looking for some sophisticated problems ?

>>
>>>>>> An example of this approach can be seen in the way NVIDIA performs
>>>>>> live migration of a ConnectX NIC function:
>>>>>>
>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>
>>>>>> NVIDIAs SNAP technology enables hardware-accelerated software defined
>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
>>>>>> and networking solutions. The host OS/hypervisor uses its standard
>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>> specifications.
>>>>>>
>>>>>> In order to implement Live Migration for these virtual function
>>>>>> devices, that use a standard drivers as mentioned, the specification
>>>>>> should define how HW vendor should build their devices and for SW
>>>>>> developers to adjust the drivers.
>>>>>>
>>>>>> This will enable specification compliant vendor agnostic solution.
>>>>>>
>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>> (internal HW design doc) and I guess that this is the way other
>>>>>> vendors work.
>>>>>>
>>>>>> For that, I would like to know if the approach of “PF that controls
>>>>>> the VF live migration process” is acceptable by the VIRTIO technical
>>>>>> group ?
>>>>>>
>>>>> I'm not sure but I think it's better to start from the general
>>>>> facility for all transports, then develop features for a specific
>>>>> transport.
>>>> a general facility for all transports can be a generic admin queue ?
>>> It could be a virtqueue or a transport specific method (pcie capability).
>> No. You said a general facility for all transports.
> For general facility, I mean the chapter 2 of the spec which is general
>
> "
> 2 Basic Facilities of a Virtio Device
> "
>
It will be in chapter 2. Right after "2.11 Exporting Object" I can add 
"2.12 Admin Virtqueues" and this is what I did in the RFC.

>> Transport specific is not general.
> The transport is in charge of implementing the interface for those facilities.

Transport specific is not general.


>
>>> E.g we can define what needs to be migrated for the virtio-blk first
>>> (the device state). Then we can define the interface to get and set
>>> those states via admin virtqueue. Such decoupling may ease the future
>>> development of the transport specific migration interface.
>> I asked a simple question here.
>>
>> Lets stick to this.
> I answered this question.

No you didn't answer.

I asked  if the approach of “PF that controls the VF live migration 
process” is acceptable by the VIRTIO technical group ?

And you take the discussion to your direction instead of answering a 
Yes/No question.

>    The virtqueue could be one of the
> approaches. And it's your responsibility to convince the community
> about that approach. Having an example may help people to understand
> your proposal.
>
>> I'm not referring to internal state definitions.
> Without an example, how do we know if it can work well?
>
>> Can you please not change the subject of my initial intent in the email ?
> Did I? Basically, I'm asking how a virtio-blk can be migrated with
> your proposal.

The virtio-blk PF admin queue will be used to manage the virtio-blk VF 
migration.

This is the whole discussion. I don't want to get into resolution.

Since you already know the answer as I published 4 RFCs already with all 
the flow.

Lets stick to my question.

> Thanks
>
>> Thanks.
>>
>>
>>> Thanks
>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> -Max.
>>>>>>
>>>> This publicly archived list offers a means to provide input to the
>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>
>>>> In order to verify user consent to the Feedback License terms and
>>>> to minimize spam in the list archive, subscription is required
>>>> before posting.
>>>>
>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586291850%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=bsdgv6XEcsFCSLo0G00WxKUaSQzj0xh4TLlOR2v4c8Y%3D&amp;reserved=0
>>>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P0NLoCAtirxtRJT6%2FhLir%2BHAPgZkFOIaKKLf3wgzRpE%3D&amp;reserved=0
>>>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gOcr8NTEiA0142OTIayO5C%2FnKfaROqSYtCpBYEfyrds%3D&amp;reserved=0
>>>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EbQ3NmU7YDLLetvoS41PxtADJx1TmWK90INGjZozrkk%3D&amp;reserved=0
>>>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xWF9jQStdg9SBspPSs8w5KYcZS08G72tfEKpd9bir2g%3D&amp;reserved=0
>>>>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-18 11:45           ` Max Gurtovoy
@ 2021-08-19  2:44             ` Jason Wang
  2021-08-19 14:58               ` Michael S. Tsirkin
  2021-08-19 11:12             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-19  2:44 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


在 2021/8/18 下午7:45, Max Gurtovoy 写道:
>
> On 8/18/2021 1:46 PM, Jason Wang wrote:
>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> 
>> wrote:
>>>
>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> 
>>>> wrote:
>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Live migration is one of the most important features of
>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>> environments.
>>>>>>>
>>>>>>> The migration process is managed by a migration SW that is 
>>>>>>> running on
>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>
>>>>>>> Unlike the vDPA case, a real pci Virtual Function state resides in
>>>>>>> the HW.
>>>>>>>
>>>>>> vDPA doesn't prevent you from having HW states. Actually from the 
>>>>>> view
>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is 
>>>>>> stored in
>>>>>> the software or hardware. A well designed VMM should be able to hide
>>>>>> the virtio device implementation from the migration layer, that 
>>>>>> is how
>>>>>> Qemu is wrote who doesn't care about whether or not it's a software
>>>>>> virtio/vDPA device or not.
>>>>>>
>>>>>>
>>>>>>> In our vision, in order to fulfil the Live migration 
>>>>>>> requirements for
>>>>>>> virtual functions, each physical function device must implement
>>>>>>> migration operations. Using these operations, it will be able to
>>>>>>> master the migration process for the virtual function devices. Each
>>>>>>> capable physical function device has a supervisor permissions to
>>>>>>> change the virtual function operational states, save/restore its
>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>
>>>>>> For "supervisor permissions", is this from the software point of 
>>>>>> view?
>>>>>> Maybe it's better to give an example for this.
>>>>> A permission to a PF device for quiesce and freeze a VF device for 
>>>>> example.
>>>> Note that for safety, VMM (e.g Qemu) is usually running without any 
>>>> privileges.
>>> You're mixing layers here.
>>>
>>> QEMU is not involved here. It's only sending IOCTLs to migration 
>>> driver.
>>> The migration driver will control the migration process of the VF using
>>> the PF communication channel.
>> So who will be granted the "permission" you mentioned here?
>
> This is just an expression.
>
> What is not clear ?


Well, the "supervisor permission" usually means it must be done that way 
otherwise it may have security implication.

But your answer sounds nothing related to that which is confusing.


>
> The PF device will have an option to quiesce/freeze the VF device.


Is such design a must? If no, why not simply introduce those functions 
in the VF? If yes, what's the reason for making virtio different (e.g 
VCPU live migration is not designed like that)?


>
> This is simple. Why are you looking for some sophisticated problems ?


It's pretty natural that people may review the patch or proposal from 
different angles. But it looks to me it's not something you want to see? 
If you mandate people to think the same as you, that's not how the 
community work. And it makes the conversation very hard. Before we 
moving forward, I think we should agree on some basic code-of-conduct as 
what Linux had: 
https://www.kernel.org/doc/html/latest/process/code-of-conduct.html. 
Especially the second standard: "Being respectful of differing 
viewpoints and experiences".

In the mean time, it's your duty to explain the motivation in a clear 
way or explain it to the reviewers. I suggest you to re-visit how to 
submit patches: 
https://www.kernel.org/doc/html/latest/process/submitting-patches.html


>
>>>
>>>>>>> An example of this approach can be seen in the way NVIDIA performs
>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>
>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>
>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated software 
>>>>>>> defined
>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
>>>>>>> and networking solutions. The host OS/hypervisor uses its standard
>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>> specifications.
>>>>>>>
>>>>>>> In order to implement Live Migration for these virtual function
>>>>>>> devices, that use a standard drivers as mentioned, the 
>>>>>>> specification
>>>>>>> should define how HW vendor should build their devices and for SW
>>>>>>> developers to adjust the drivers.
>>>>>>>
>>>>>>> This will enable specification compliant vendor agnostic solution.
>>>>>>>
>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>> (internal HW design doc) and I guess that this is the way other
>>>>>>> vendors work.
>>>>>>>
>>>>>>> For that, I would like to know if the approach of “PF that controls
>>>>>>> the VF live migration process” is acceptable by the VIRTIO 
>>>>>>> technical
>>>>>>> group ?
>>>>>>>
>>>>>> I'm not sure but I think it's better to start from the general
>>>>>> facility for all transports, then develop features for a specific
>>>>>> transport.
>>>>> a general facility for all transports can be a generic admin queue ?
>>>> It could be a virtqueue or a transport specific method (pcie 
>>>> capability).
>>> No. You said a general facility for all transports.
>> For general facility, I mean the chapter 2 of the spec which is general
>>
>> "
>> 2 Basic Facilities of a Virtio Device
>> "
>>
> It will be in chapter 2. Right after "2.11 Exporting Object" I can add 
> "2.12 Admin Virtqueues" and this is what I did in the RFC.


The point is, migration should be an independent facility and it's 
possible to be done in transport specific way other than the admin 
virtqueue.


>
>>> Transport specific is not general.
>> The transport is in charge of implementing the interface for those 
>> facilities.
>
> Transport specific is not general.
>
>
>>
>>>> E.g we can define what needs to be migrated for the virtio-blk first
>>>> (the device state). Then we can define the interface to get and set
>>>> those states via admin virtqueue. Such decoupling may ease the future
>>>> development of the transport specific migration interface.
>>> I asked a simple question here.
>>>
>>> Lets stick to this.
>> I answered this question.
>
> No you didn't answer.


I answered "I'm not sure". Or are you expecting the answer like yes or 
no? Of course I can't answer like that since it depends on whether your 
proposal is agreed by the vast majority of the members and the other 
procedure e.g voting before it is merged.

You may refer this doc to see about the procedure:

https://github.com/oasis-tcs/virtio-admin/blob/master/README.md


>
> I asked  if the approach of “PF that controls the VF live migration 
> process” is acceptable by the VIRTIO technical group ?
>
> And you take the discussion to your direction instead of answering a 
> Yes/No question.


I don't get the point of this question. If the reviewer think a 
direction may help, the review has the right to do that.

And what I want to say is:

1) I'm not sure it can be acceptable (I can't speak for the whole TC)
2) but I have idea to help people to understand the proposal (start form 
an example)


>
>>    The virtqueue could be one of the
>> approaches. And it's your responsibility to convince the community
>> about that approach. Having an example may help people to understand
>> your proposal.
>>
>>> I'm not referring to internal state definitions.
>> Without an example, how do we know if it can work well?
>>
>>> Can you please not change the subject of my initial intent in the 
>>> email ?
>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>> your proposal.
>
> The virtio-blk PF admin queue will be used to manage the virtio-blk VF 
> migration.
>
> This is the whole discussion. I don't want to get into resolution.
>
> Since you already know the answer as I published 4 RFCs already with 
> all the flow.


No I don't, especially the part of device states that need to be 
migrated. Even if I knew the answer, it doesn't mean other people can 
easily understand that. You only add a github link for to your mlx5e 
development tree, it's really hard to see the connections. And you don't 
even mention the 4 RFCS you've posted (and a lot of comments were not 
addressed there).


>
> Lets stick to my question.


I don't think you expectation can be met through "Hey, I have an idea, 
and you know how it work, does it make sense?". Especially consider it's 
a complicated issue.

Thanks


>
>> Thanks
>>
>>> Thanks.
>>>
>>>
>>>> Thanks
>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> -Max.
>>>>>>>
>>>>> This publicly archived list offers a means to provide input to the
>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>
>>>>> In order to verify user consent to the Feedback License terms and
>>>>> to minimize spam in the list archive, subscription is required
>>>>> before posting.
>>>>>
>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>> List archive: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586291850%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=bsdgv6XEcsFCSLo0G00WxKUaSQzj0xh4TLlOR2v4c8Y%3D&amp;reserved=0
>>>>> Feedback License: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P0NLoCAtirxtRJT6%2FhLir%2BHAPgZkFOIaKKLf3wgzRpE%3D&amp;reserved=0
>>>>> List Guidelines: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gOcr8NTEiA0142OTIayO5C%2FnKfaROqSYtCpBYEfyrds%3D&amp;reserved=0
>>>>> Committee: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EbQ3NmU7YDLLetvoS41PxtADJx1TmWK90INGjZozrkk%3D&amp;reserved=0
>>>>> Join OASIS: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xWF9jQStdg9SBspPSs8w5KYcZS08G72tfEKpd9bir2g%3D&amp;reserved=0
>>>>>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-18 11:45           ` Max Gurtovoy
  2021-08-19  2:44             ` Jason Wang
@ 2021-08-19 11:12             ` Dr. David Alan Gilbert
  2021-08-19 14:16               ` Max Gurtovoy
  1 sibling, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-19 11:12 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Wang, virtio-comment, Michael S. Tsirkin, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

* Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> 
> On 8/18/2021 1:46 PM, Jason Wang wrote:
> > On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > > 
> > > On 8/17/2021 12:44 PM, Jason Wang wrote:
> > > > On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > > > > On 8/17/2021 11:51 AM, Jason Wang wrote:
> > > > > > 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > Live migration is one of the most important features of
> > > > > > > virtualization and virtio devices are oftenly found in virtual
> > > > > > > environments.
> > > > > > > 
> > > > > > > The migration process is managed by a migration SW that is running on
> > > > > > > the hypervisor and the VM is not aware of the process at all.
> > > > > > > 
> > > > > > > Unlike the vDPA case, a real pci Virtual Function state resides in
> > > > > > > the HW.
> > > > > > > 
> > > > > > vDPA doesn't prevent you from having HW states. Actually from the view
> > > > > > of the VMM(Qemu), it doesn't care whether or not a state is stored in
> > > > > > the software or hardware. A well designed VMM should be able to hide
> > > > > > the virtio device implementation from the migration layer, that is how
> > > > > > Qemu is wrote who doesn't care about whether or not it's a software
> > > > > > virtio/vDPA device or not.
> > > > > > 
> > > > > > 
> > > > > > > In our vision, in order to fulfil the Live migration requirements for
> > > > > > > virtual functions, each physical function device must implement
> > > > > > > migration operations. Using these operations, it will be able to
> > > > > > > master the migration process for the virtual function devices. Each
> > > > > > > capable physical function device has a supervisor permissions to
> > > > > > > change the virtual function operational states, save/restore its
> > > > > > > internal state and start/stop dirty pages tracking.
> > > > > > > 
> > > > > > For "supervisor permissions", is this from the software point of view?
> > > > > > Maybe it's better to give an example for this.
> > > > > A permission to a PF device for quiesce and freeze a VF device for example.
> > > > Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
> > > You're mixing layers here.
> > > 
> > > QEMU is not involved here. It's only sending IOCTLs to migration driver.
> > > The migration driver will control the migration process of the VF using
> > > the PF communication channel.
> > So who will be granted the "permission" you mentioned here?
> 
> This is just an expression.
> 
> What is not clear ?
> 
> The PF device will have an option to quiesce/freeze the VF device.
> 
> This is simple. Why are you looking for some sophisticated problems ?

I'm trying to follow along here and have not completely; but I think the issue is a
security separation one.
The VMM (e.g. qemu) that has been given access to one of the VF's is
isolated and shouldn't be able to go poking at other devices; so it
can't go poking at the PF (it probably doesn't even have the PF device
node accessible) - so then the question is who has access to the
migration driver and how do you make sure it can only deal with VF's
that it's supposed to be able to migrate.

Dave

> > > 
> > > > > > > An example of this approach can be seen in the way NVIDIA performs
> > > > > > > live migration of a ConnectX NIC function:
> > > > > > > 
> > > > > > > https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> > > > > > > <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> > > > > > > 
> > > > > > > NVIDIAs SNAP technology enables hardware-accelerated software defined
> > > > > > > PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
> > > > > > > and networking solutions. The host OS/hypervisor uses its standard
> > > > > > > drivers that are implemented according to a well-known VIRTIO
> > > > > > > specifications.
> > > > > > > 
> > > > > > > In order to implement Live Migration for these virtual function
> > > > > > > devices, that use a standard drivers as mentioned, the specification
> > > > > > > should define how HW vendor should build their devices and for SW
> > > > > > > developers to adjust the drivers.
> > > > > > > 
> > > > > > > This will enable specification compliant vendor agnostic solution.
> > > > > > > 
> > > > > > > This is exactly how we built the migration driver for ConnectX
> > > > > > > (internal HW design doc) and I guess that this is the way other
> > > > > > > vendors work.
> > > > > > > 
> > > > > > > For that, I would like to know if the approach of “PF that controls
> > > > > > > the VF live migration process” is acceptable by the VIRTIO technical
> > > > > > > group ?
> > > > > > > 
> > > > > > I'm not sure but I think it's better to start from the general
> > > > > > facility for all transports, then develop features for a specific
> > > > > > transport.
> > > > > a general facility for all transports can be a generic admin queue ?
> > > > It could be a virtqueue or a transport specific method (pcie capability).
> > > No. You said a general facility for all transports.
> > For general facility, I mean the chapter 2 of the spec which is general
> > 
> > "
> > 2 Basic Facilities of a Virtio Device
> > "
> > 
> It will be in chapter 2. Right after "2.11 Exporting Object" I can add "2.12
> Admin Virtqueues" and this is what I did in the RFC.
> 
> > > Transport specific is not general.
> > The transport is in charge of implementing the interface for those facilities.
> 
> Transport specific is not general.
> 
> 
> > 
> > > > E.g we can define what needs to be migrated for the virtio-blk first
> > > > (the device state). Then we can define the interface to get and set
> > > > those states via admin virtqueue. Such decoupling may ease the future
> > > > development of the transport specific migration interface.
> > > I asked a simple question here.
> > > 
> > > Lets stick to this.
> > I answered this question.
> 
> No you didn't answer.
> 
> I asked  if the approach of “PF that controls the VF live migration process”
> is acceptable by the VIRTIO technical group ?
> 
> And you take the discussion to your direction instead of answering a Yes/No
> question.
> 
> >    The virtqueue could be one of the
> > approaches. And it's your responsibility to convince the community
> > about that approach. Having an example may help people to understand
> > your proposal.
> > 
> > > I'm not referring to internal state definitions.
> > Without an example, how do we know if it can work well?
> > 
> > > Can you please not change the subject of my initial intent in the email ?
> > Did I? Basically, I'm asking how a virtio-blk can be migrated with
> > your proposal.
> 
> The virtio-blk PF admin queue will be used to manage the virtio-blk VF
> migration.
> 
> This is the whole discussion. I don't want to get into resolution.
> 
> Since you already know the answer as I published 4 RFCs already with all the
> flow.
> 
> Lets stick to my question.
> 
> > Thanks
> > 
> > > Thanks.
> > > 
> > > 
> > > > Thanks
> > > > 
> > > > > > Thanks
> > > > > > 
> > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > -Max.
> > > > > > > 
> > > > > This publicly archived list offers a means to provide input to the
> > > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > > > 
> > > > > In order to verify user consent to the Feedback License terms and
> > > > > to minimize spam in the list archive, subscription is required
> > > > > before posting.
> > > > > 
> > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > > List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586291850%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=bsdgv6XEcsFCSLo0G00WxKUaSQzj0xh4TLlOR2v4c8Y%3D&amp;reserved=0
> > > > > Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P0NLoCAtirxtRJT6%2FhLir%2BHAPgZkFOIaKKLf3wgzRpE%3D&amp;reserved=0
> > > > > List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gOcr8NTEiA0142OTIayO5C%2FnKfaROqSYtCpBYEfyrds%3D&amp;reserved=0
> > > > > Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EbQ3NmU7YDLLetvoS41PxtADJx1TmWK90INGjZozrkk%3D&amp;reserved=0
> > > > > Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cb8808c97b68a409d091a08d962356c9c%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637648804586301810%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xWF9jQStdg9SBspPSs8w5KYcZS08G72tfEKpd9bir2g%3D&amp;reserved=0
> > > > > 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19 11:12             ` Dr. David Alan Gilbert
@ 2021-08-19 14:16               ` Max Gurtovoy
  2021-08-19 14:24                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-19 14:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Wang, virtio-comment, Michael S. Tsirkin, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>> On 8/18/2021 1:46 PM, Jason Wang wrote:
>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Live migration is one of the most important features of
>>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>>> environments.
>>>>>>>>
>>>>>>>> The migration process is managed by a migration SW that is running on
>>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>>
>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state resides in
>>>>>>>> the HW.
>>>>>>>>
>>>>>>> vDPA doesn't prevent you from having HW states. Actually from the view
>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is stored in
>>>>>>> the software or hardware. A well designed VMM should be able to hide
>>>>>>> the virtio device implementation from the migration layer, that is how
>>>>>>> Qemu is wrote who doesn't care about whether or not it's a software
>>>>>>> virtio/vDPA device or not.
>>>>>>>
>>>>>>>
>>>>>>>> In our vision, in order to fulfil the Live migration requirements for
>>>>>>>> virtual functions, each physical function device must implement
>>>>>>>> migration operations. Using these operations, it will be able to
>>>>>>>> master the migration process for the virtual function devices. Each
>>>>>>>> capable physical function device has a supervisor permissions to
>>>>>>>> change the virtual function operational states, save/restore its
>>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>>
>>>>>>> For "supervisor permissions", is this from the software point of view?
>>>>>>> Maybe it's better to give an example for this.
>>>>>> A permission to a PF device for quiesce and freeze a VF device for example.
>>>>> Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
>>>> You're mixing layers here.
>>>>
>>>> QEMU is not involved here. It's only sending IOCTLs to migration driver.
>>>> The migration driver will control the migration process of the VF using
>>>> the PF communication channel.
>>> So who will be granted the "permission" you mentioned here?
>> This is just an expression.
>>
>> What is not clear ?
>>
>> The PF device will have an option to quiesce/freeze the VF device.
>>
>> This is simple. Why are you looking for some sophisticated problems ?
> I'm trying to follow along here and have not completely; but I think the issue is a
> security separation one.
> The VMM (e.g. qemu) that has been given access to one of the VF's is
> isolated and shouldn't be able to go poking at other devices; so it
> can't go poking at the PF (it probably doesn't even have the PF device
> node accessible) - so then the question is who has access to the
> migration driver and how do you make sure it can only deal with VF's
> that it's supposed to be able to migrate.

The QEMU/userspace doesn't know or care about the PF connection and 
internal virtio_vfio_pci driver implementation.

You shouldn't change 1 line of code in the VM driver nor in QEMU.

QEMU does not have access to the PF. Only the kernel driver that has 
access to the VF will have access to the PF communication channel.  
There is no permission problem here.

The kernel driver of the VF will do this internally, and make sure that 
the commands it build will only impact the VF originating them.

We already do this in mlx5 NIC migration. The kernel is secured and QEMU 
interface is the VF.

> Dave
>
>>>>>>>> An example of this approach can be seen in the way NVIDIA performs
>>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>>
>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>>
>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated software defined
>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
>>>>>>>> and networking solutions. The host OS/hypervisor uses its standard
>>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>>> specifications.
>>>>>>>>
>>>>>>>> In order to implement Live Migration for these virtual function
>>>>>>>> devices, that use a standard drivers as mentioned, the specification
>>>>>>>> should define how HW vendor should build their devices and for SW
>>>>>>>> developers to adjust the drivers.
>>>>>>>>
>>>>>>>> This will enable specification compliant vendor agnostic solution.
>>>>>>>>
>>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>>> (internal HW design doc) and I guess that this is the way other
>>>>>>>> vendors work.
>>>>>>>>
>>>>>>>> For that, I would like to know if the approach of “PF that controls
>>>>>>>> the VF live migration process” is acceptable by the VIRTIO technical
>>>>>>>> group ?
>>>>>>>>
>>>>>>> I'm not sure but I think it's better to start from the general
>>>>>>> facility for all transports, then develop features for a specific
>>>>>>> transport.
>>>>>> a general facility for all transports can be a generic admin queue ?
>>>>> It could be a virtqueue or a transport specific method (pcie capability).
>>>> No. You said a general facility for all transports.
>>> For general facility, I mean the chapter 2 of the spec which is general
>>>
>>> "
>>> 2 Basic Facilities of a Virtio Device
>>> "
>>>
>> It will be in chapter 2. Right after "2.11 Exporting Object" I can add "2.12
>> Admin Virtqueues" and this is what I did in the RFC.
>>
>>>> Transport specific is not general.
>>> The transport is in charge of implementing the interface for those facilities.
>> Transport specific is not general.
>>
>>
>>>>> E.g we can define what needs to be migrated for the virtio-blk first
>>>>> (the device state). Then we can define the interface to get and set
>>>>> those states via admin virtqueue. Such decoupling may ease the future
>>>>> development of the transport specific migration interface.
>>>> I asked a simple question here.
>>>>
>>>> Lets stick to this.
>>> I answered this question.
>> No you didn't answer.
>>
>> I asked  if the approach of “PF that controls the VF live migration process”
>> is acceptable by the VIRTIO technical group ?
>>
>> And you take the discussion to your direction instead of answering a Yes/No
>> question.
>>
>>>     The virtqueue could be one of the
>>> approaches. And it's your responsibility to convince the community
>>> about that approach. Having an example may help people to understand
>>> your proposal.
>>>
>>>> I'm not referring to internal state definitions.
>>> Without an example, how do we know if it can work well?
>>>
>>>> Can you please not change the subject of my initial intent in the email ?
>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>>> your proposal.
>> The virtio-blk PF admin queue will be used to manage the virtio-blk VF
>> migration.
>>
>> This is the whole discussion. I don't want to get into resolution.
>>
>> Since you already know the answer as I published 4 RFCs already with all the
>> flow.
>>
>> Lets stick to my question.
>>
>>> Thanks
>>>
>>>> Thanks.
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> -Max.
>>>>>>>>
>>>>>> This publicly archived list offers a means to provide input to the
>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>
>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>> to minimize spam in the list archive, subscription is required
>>>>>> before posting.
>>>>>>
>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113731308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l%2FVw755FQgN6%2BunOXoNRummuOWky%2Feh0xcITkNemXvE%3D&amp;reserved=0
>>>>>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113731308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tlaAvXlxAXCdN3E%2F%2BH1%2Bb3t2C3jgQUofXTXGmrq5ug8%3D&amp;reserved=0
>>>>>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113741261%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gWzhW9NEeeYiE3oJl5ZTzBui1tIqEzYw%2FuV59qiBqzw%3D&amp;reserved=0
>>>>>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ccZibMmYiCnFVUQ4QeCexz8OQUsvdeC9dlr1aaRnXhg%3D&amp;reserved=0
>>>>>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O7T8D2Y6xzCLp%2FceWqiRoriTh0M6IBeMPLpfb7d3zvA%3D&amp;reserved=0
>>>>>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=apuVb9a9YN2itMje3TXj1ZtwFWm9%2FPQ%2BVfrL%2BoNHzOc%3D&amp;reserved=0
>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Pl4yjzImSQV3YdXl8o7LUOI70I6%2BS78kORP1%2BbOZaU%3D&amp;reserved=0
>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IcvHERzgz0h2oUleoQXpPk%2FPpf0CGVpNA74zc%2FpSDRQ%3D&amp;reserved=0
>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ccZibMmYiCnFVUQ4QeCexz8OQUsvdeC9dlr1aaRnXhg%3D&amp;reserved=0
>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O7T8D2Y6xzCLp%2FceWqiRoriTh0M6IBeMPLpfb7d3zvA%3D&amp;reserved=0
>>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19 14:16               ` Max Gurtovoy
@ 2021-08-19 14:24                 ` Dr. David Alan Gilbert
  2021-08-19 15:20                   ` Max Gurtovoy
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-19 14:24 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Wang, virtio-comment, Michael S. Tsirkin, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

* Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> 
> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
> > * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> > > On 8/18/2021 1:46 PM, Jason Wang wrote:
> > > > On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > > > > On 8/17/2021 12:44 PM, Jason Wang wrote:
> > > > > > On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > > > > > > On 8/17/2021 11:51 AM, Jason Wang wrote:
> > > > > > > > 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> > > > > > > > > Hi all,
> > > > > > > > > 
> > > > > > > > > Live migration is one of the most important features of
> > > > > > > > > virtualization and virtio devices are oftenly found in virtual
> > > > > > > > > environments.
> > > > > > > > > 
> > > > > > > > > The migration process is managed by a migration SW that is running on
> > > > > > > > > the hypervisor and the VM is not aware of the process at all.
> > > > > > > > > 
> > > > > > > > > Unlike the vDPA case, a real pci Virtual Function state resides in
> > > > > > > > > the HW.
> > > > > > > > > 
> > > > > > > > vDPA doesn't prevent you from having HW states. Actually from the view
> > > > > > > > of the VMM(Qemu), it doesn't care whether or not a state is stored in
> > > > > > > > the software or hardware. A well designed VMM should be able to hide
> > > > > > > > the virtio device implementation from the migration layer, that is how
> > > > > > > > Qemu is wrote who doesn't care about whether or not it's a software
> > > > > > > > virtio/vDPA device or not.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > In our vision, in order to fulfil the Live migration requirements for
> > > > > > > > > virtual functions, each physical function device must implement
> > > > > > > > > migration operations. Using these operations, it will be able to
> > > > > > > > > master the migration process for the virtual function devices. Each
> > > > > > > > > capable physical function device has a supervisor permissions to
> > > > > > > > > change the virtual function operational states, save/restore its
> > > > > > > > > internal state and start/stop dirty pages tracking.
> > > > > > > > > 
> > > > > > > > For "supervisor permissions", is this from the software point of view?
> > > > > > > > Maybe it's better to give an example for this.
> > > > > > > A permission to a PF device for quiesce and freeze a VF device for example.
> > > > > > Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
> > > > > You're mixing layers here.
> > > > > 
> > > > > QEMU is not involved here. It's only sending IOCTLs to migration driver.
> > > > > The migration driver will control the migration process of the VF using
> > > > > the PF communication channel.
> > > > So who will be granted the "permission" you mentioned here?
> > > This is just an expression.
> > > 
> > > What is not clear ?
> > > 
> > > The PF device will have an option to quiesce/freeze the VF device.
> > > 
> > > This is simple. Why are you looking for some sophisticated problems ?
> > I'm trying to follow along here and have not completely; but I think the issue is a
> > security separation one.
> > The VMM (e.g. qemu) that has been given access to one of the VF's is
> > isolated and shouldn't be able to go poking at other devices; so it
> > can't go poking at the PF (it probably doesn't even have the PF device
> > node accessible) - so then the question is who has access to the
> > migration driver and how do you make sure it can only deal with VF's
> > that it's supposed to be able to migrate.
> 
> The QEMU/userspace doesn't know or care about the PF connection and internal
> virtio_vfio_pci driver implementation.

OK

> You shouldn't change 1 line of code in the VM driver nor in QEMU.

Hmm OK.

> QEMU does not have access to the PF. Only the kernel driver that has access
> to the VF will have access to the PF communication channel.  There is no
> permission problem here.
>
> The kernel driver of the VF will do this internally, and make sure that the
> commands it build will only impact the VF originating them.
> 

Now that confuses me; isn't the kernel driver that has access to the VF
running inside the guest?  If it's inside the guest we can't trust it to
do anything about stopping impact to other devices.

Dave


> We already do this in mlx5 NIC migration. The kernel is secured and QEMU
> interface is the VF.
> 
> > Dave
> > 
> > > > > > > > > An example of this approach can be seen in the way NVIDIA performs
> > > > > > > > > live migration of a ConnectX NIC function:
> > > > > > > > > 
> > > > > > > > > https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> > > > > > > > > <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> > > > > > > > > 
> > > > > > > > > NVIDIAs SNAP technology enables hardware-accelerated software defined
> > > > > > > > > PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
> > > > > > > > > and networking solutions. The host OS/hypervisor uses its standard
> > > > > > > > > drivers that are implemented according to a well-known VIRTIO
> > > > > > > > > specifications.
> > > > > > > > > 
> > > > > > > > > In order to implement Live Migration for these virtual function
> > > > > > > > > devices, that use a standard drivers as mentioned, the specification
> > > > > > > > > should define how HW vendor should build their devices and for SW
> > > > > > > > > developers to adjust the drivers.
> > > > > > > > > 
> > > > > > > > > This will enable specification compliant vendor agnostic solution.
> > > > > > > > > 
> > > > > > > > > This is exactly how we built the migration driver for ConnectX
> > > > > > > > > (internal HW design doc) and I guess that this is the way other
> > > > > > > > > vendors work.
> > > > > > > > > 
> > > > > > > > > For that, I would like to know if the approach of “PF that controls
> > > > > > > > > the VF live migration process” is acceptable by the VIRTIO technical
> > > > > > > > > group ?
> > > > > > > > > 
> > > > > > > > I'm not sure but I think it's better to start from the general
> > > > > > > > facility for all transports, then develop features for a specific
> > > > > > > > transport.
> > > > > > > a general facility for all transports can be a generic admin queue ?
> > > > > > It could be a virtqueue or a transport specific method (pcie capability).
> > > > > No. You said a general facility for all transports.
> > > > For general facility, I mean the chapter 2 of the spec which is general
> > > > 
> > > > "
> > > > 2 Basic Facilities of a Virtio Device
> > > > "
> > > > 
> > > It will be in chapter 2. Right after "2.11 Exporting Object" I can add "2.12
> > > Admin Virtqueues" and this is what I did in the RFC.
> > > 
> > > > > Transport specific is not general.
> > > > The transport is in charge of implementing the interface for those facilities.
> > > Transport specific is not general.
> > > 
> > > 
> > > > > > E.g we can define what needs to be migrated for the virtio-blk first
> > > > > > (the device state). Then we can define the interface to get and set
> > > > > > those states via admin virtqueue. Such decoupling may ease the future
> > > > > > development of the transport specific migration interface.
> > > > > I asked a simple question here.
> > > > > 
> > > > > Lets stick to this.
> > > > I answered this question.
> > > No you didn't answer.
> > > 
> > > I asked  if the approach of “PF that controls the VF live migration process”
> > > is acceptable by the VIRTIO technical group ?
> > > 
> > > And you take the discussion to your direction instead of answering a Yes/No
> > > question.
> > > 
> > > >     The virtqueue could be one of the
> > > > approaches. And it's your responsibility to convince the community
> > > > about that approach. Having an example may help people to understand
> > > > your proposal.
> > > > 
> > > > > I'm not referring to internal state definitions.
> > > > Without an example, how do we know if it can work well?
> > > > 
> > > > > Can you please not change the subject of my initial intent in the email ?
> > > > Did I? Basically, I'm asking how a virtio-blk can be migrated with
> > > > your proposal.
> > > The virtio-blk PF admin queue will be used to manage the virtio-blk VF
> > > migration.
> > > 
> > > This is the whole discussion. I don't want to get into resolution.
> > > 
> > > Since you already know the answer as I published 4 RFCs already with all the
> > > flow.
> > > 
> > > Lets stick to my question.
> > > 
> > > > Thanks
> > > > 
> > > > > Thanks.
> > > > > 
> > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > > > > Thanks
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Cheers,
> > > > > > > > > 
> > > > > > > > > -Max.
> > > > > > > > > 
> > > > > > > This publicly archived list offers a means to provide input to the
> > > > > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > > > > > 
> > > > > > > In order to verify user consent to the Feedback License terms and
> > > > > > > to minimize spam in the list archive, subscription is required
> > > > > > > before posting.
> > > > > > > 
> > > > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > > > > List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113731308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l%2FVw755FQgN6%2BunOXoNRummuOWky%2Feh0xcITkNemXvE%3D&amp;reserved=0
> > > > > > > Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113731308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tlaAvXlxAXCdN3E%2F%2BH1%2Bb3t2C3jgQUofXTXGmrq5ug8%3D&amp;reserved=0
> > > > > > > List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113741261%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gWzhW9NEeeYiE3oJl5ZTzBui1tIqEzYw%2FuV59qiBqzw%3D&amp;reserved=0
> > > > > > > Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ccZibMmYiCnFVUQ4QeCexz8OQUsvdeC9dlr1aaRnXhg%3D&amp;reserved=0
> > > > > > > Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O7T8D2Y6xzCLp%2FceWqiRoriTh0M6IBeMPLpfb7d3zvA%3D&amp;reserved=0
> > > > > > > 
> > > This publicly archived list offers a means to provide input to the
> > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > 
> > > In order to verify user consent to the Feedback License terms and
> > > to minimize spam in the list archive, subscription is required
> > > before posting.
> > > 
> > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > List help: virtio-comment-help@lists.oasis-open.org
> > > List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=apuVb9a9YN2itMje3TXj1ZtwFWm9%2FPQ%2BVfrL%2BoNHzOc%3D&amp;reserved=0
> > > Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Pl4yjzImSQV3YdXl8o7LUOI70I6%2BS78kORP1%2BbOZaU%3D&amp;reserved=0
> > > List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IcvHERzgz0h2oUleoQXpPk%2FPpf0CGVpNA74zc%2FpSDRQ%3D&amp;reserved=0
> > > Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ccZibMmYiCnFVUQ4QeCexz8OQUsvdeC9dlr1aaRnXhg%3D&amp;reserved=0
> > > Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C012299d734484885c90308d96302337e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649684113930443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O7T8D2Y6xzCLp%2FceWqiRoriTh0M6IBeMPLpfb7d3zvA%3D&amp;reserved=0
> > > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19  2:44             ` Jason Wang
@ 2021-08-19 14:58               ` Michael S. Tsirkin
  2021-08-20  2:17                 ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2021-08-19 14:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, virtio-comment, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > 
> > The PF device will have an option to quiesce/freeze the VF device.
> 
> 
> Is such design a must? If no, why not simply introduce those functions in
> the VF?

Many IOMMUs only support protections at the function level.
Thus we need ability to have one device (e.g. a PF)
to control migration of another (e.g. a VF).
This is because allowing VF to access hypervisor memory used for
migration is not a good idea. 
For IOMMUs that support subfunctions, these "devices" could be
subfunctions.

The only alternative is to keep things in device memory which
does not need an IOMMU.
I guess we'd end up with something like a VQ in device memory which might
be tricky from multiple points of view, but yes, this could be
useful and people did ask for such a capability in the past.

> If yes, what's the reason for making virtio different (e.g VCPU live
> migration is not designed like that)?

I think the main difference is we need PF's help for memory
tracking for pre-copy migration anyway. Might as well integrate
the rest of state in the same channel.

Another answer is that CPUs trivially switch between
functions by switching the active page tables. For PCI DMA
it is all much trickier sine the page tables can be separate
from the device, and assumed to be mostly static.
So if you want to create something like the VMCS then
again you either need some help from another device or
put it in device memory.


-- 
MST


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19 14:24                 ` Dr. David Alan Gilbert
@ 2021-08-19 15:20                   ` Max Gurtovoy
  2021-08-20  2:24                     ` Jason Wang
  2021-08-23 12:18                     ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-19 15:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Wang, virtio-comment, Michael S. Tsirkin, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Live migration is one of the most important features of
>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>>>>> environments.
>>>>>>>>>>
>>>>>>>>>> The migration process is managed by a migration SW that is running on
>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>>>>
>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state resides in
>>>>>>>>>> the HW.
>>>>>>>>>>
>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually from the view
>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is stored in
>>>>>>>>> the software or hardware. A well designed VMM should be able to hide
>>>>>>>>> the virtio device implementation from the migration layer, that is how
>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a software
>>>>>>>>> virtio/vDPA device or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> In our vision, in order to fulfil the Live migration requirements for
>>>>>>>>>> virtual functions, each physical function device must implement
>>>>>>>>>> migration operations. Using these operations, it will be able to
>>>>>>>>>> master the migration process for the virtual function devices. Each
>>>>>>>>>> capable physical function device has a supervisor permissions to
>>>>>>>>>> change the virtual function operational states, save/restore its
>>>>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>>>>
>>>>>>>>> For "supervisor permissions", is this from the software point of view?
>>>>>>>>> Maybe it's better to give an example for this.
>>>>>>>> A permission to a PF device for quiesce and freeze a VF device for example.
>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
>>>>>> You're mixing layers here.
>>>>>>
>>>>>> QEMU is not involved here. It's only sending IOCTLs to migration driver.
>>>>>> The migration driver will control the migration process of the VF using
>>>>>> the PF communication channel.
>>>>> So who will be granted the "permission" you mentioned here?
>>>> This is just an expression.
>>>>
>>>> What is not clear ?
>>>>
>>>> The PF device will have an option to quiesce/freeze the VF device.
>>>>
>>>> This is simple. Why are you looking for some sophisticated problems ?
>>> I'm trying to follow along here and have not completely; but I think the issue is a
>>> security separation one.
>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
>>> isolated and shouldn't be able to go poking at other devices; so it
>>> can't go poking at the PF (it probably doesn't even have the PF device
>>> node accessible) - so then the question is who has access to the
>>> migration driver and how do you make sure it can only deal with VF's
>>> that it's supposed to be able to migrate.
>> The QEMU/userspace doesn't know or care about the PF connection and internal
>> virtio_vfio_pci driver implementation.
> OK
>
>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
> Hmm OK.
>
>> QEMU does not have access to the PF. Only the kernel driver that has access
>> to the VF will have access to the PF communication channel.  There is no
>> permission problem here.
>>
>> The kernel driver of the VF will do this internally, and make sure that the
>> commands it build will only impact the VF originating them.
>>
> Now that confuses me; isn't the kernel driver that has access to the VF
> running inside the guest?  If it's inside the guest we can't trust it to
> do anything about stopping impact to other devices.

No. The driver is in the hypervisor (virtio_vfio_pci). This is the 
migration driver, right ?

The guest is running as usual. It doesn't aware on the migration at all.

This is the point I try to make here. I don't (and I can't) change even 
1 line of code in the guest.

e.g:

QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor 
(bounded to VF5) --> send admin command on PF adminq to start tracking 
dirty pages for VF5 --> PF device will do it

QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor 
(bounded to VF5) --> send admin command on PF adminq to quiesce VF5 --> 
PF device will do it

You can take a look how we implement mlx5_vfio_pci in the link I provided.

>
> Dave
>
>
>> We already do this in mlx5 NIC migration. The kernel is secured and QEMU
>> interface is the VF.
>>
>>> Dave
>>>
>>>>>>>>>> An example of this approach can be seen in the way NVIDIA performs
>>>>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>>>>
>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>>>>
>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated software defined
>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its standard
>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>>>>> specifications.
>>>>>>>>>>
>>>>>>>>>> In order to implement Live Migration for these virtual function
>>>>>>>>>> devices, that use a standard drivers as mentioned, the specification
>>>>>>>>>> should define how HW vendor should build their devices and for SW
>>>>>>>>>> developers to adjust the drivers.
>>>>>>>>>>
>>>>>>>>>> This will enable specification compliant vendor agnostic solution.
>>>>>>>>>>
>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>>>>> (internal HW design doc) and I guess that this is the way other
>>>>>>>>>> vendors work.
>>>>>>>>>>
>>>>>>>>>> For that, I would like to know if the approach of “PF that controls
>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO technical
>>>>>>>>>> group ?
>>>>>>>>>>
>>>>>>>>> I'm not sure but I think it's better to start from the general
>>>>>>>>> facility for all transports, then develop features for a specific
>>>>>>>>> transport.
>>>>>>>> a general facility for all transports can be a generic admin queue ?
>>>>>>> It could be a virtqueue or a transport specific method (pcie capability).
>>>>>> No. You said a general facility for all transports.
>>>>> For general facility, I mean the chapter 2 of the spec which is general
>>>>>
>>>>> "
>>>>> 2 Basic Facilities of a Virtio Device
>>>>> "
>>>>>
>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I can add "2.12
>>>> Admin Virtqueues" and this is what I did in the RFC.
>>>>
>>>>>> Transport specific is not general.
>>>>> The transport is in charge of implementing the interface for those facilities.
>>>> Transport specific is not general.
>>>>
>>>>
>>>>>>> E.g we can define what needs to be migrated for the virtio-blk first
>>>>>>> (the device state). Then we can define the interface to get and set
>>>>>>> those states via admin virtqueue. Such decoupling may ease the future
>>>>>>> development of the transport specific migration interface.
>>>>>> I asked a simple question here.
>>>>>>
>>>>>> Lets stick to this.
>>>>> I answered this question.
>>>> No you didn't answer.
>>>>
>>>> I asked  if the approach of “PF that controls the VF live migration process”
>>>> is acceptable by the VIRTIO technical group ?
>>>>
>>>> And you take the discussion to your direction instead of answering a Yes/No
>>>> question.
>>>>
>>>>>      The virtqueue could be one of the
>>>>> approaches. And it's your responsibility to convince the community
>>>>> about that approach. Having an example may help people to understand
>>>>> your proposal.
>>>>>
>>>>>> I'm not referring to internal state definitions.
>>>>> Without an example, how do we know if it can work well?
>>>>>
>>>>>> Can you please not change the subject of my initial intent in the email ?
>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>>>>> your proposal.
>>>> The virtio-blk PF admin queue will be used to manage the virtio-blk VF
>>>> migration.
>>>>
>>>> This is the whole discussion. I don't want to get into resolution.
>>>>
>>>> Since you already know the answer as I published 4 RFCs already with all the
>>>> flow.
>>>>
>>>> Lets stick to my question.
>>>>
>>>>> Thanks
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> -Max.
>>>>>>>>>>
>>>>>>>> This publicly archived list offers a means to provide input to the
>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>
>>>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>> before posting.
>>>>>>>>
>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gVamhvYG3lbMVVyMz%2F%2Fq3VBMZKY47pqxvRi94Mp%2B%2B7I%3D&amp;reserved=0
>>>>>>>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2FxmgTCEaUTSgG%2BLohnAAuNXLncUOKU8yBxhkEMpmQk%3D&amp;reserved=0
>>>>>>>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nXzbdkD4B4TzAFXbD%2B4Jap8rWmzX2CZ8fVnEE2f4Tdc%3D&amp;reserved=0
>>>>>>>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Issa0O7V4p6MnuuJOcLDN4MAG77cSMSJ7MSZqvXol4%3D&amp;reserved=0
>>>>>>>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPqcruawglqgMjakkslrpSVZzaOu%2FCvfkTSuUfMiEh0%3D&amp;reserved=0
>>>>>>>>
>>>> This publicly archived list offers a means to provide input to the
>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>
>>>> In order to verify user consent to the Feedback License terms and
>>>> to minimize spam in the list archive, subscription is required
>>>> before posting.
>>>>
>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gVamhvYG3lbMVVyMz%2F%2Fq3VBMZKY47pqxvRi94Mp%2B%2B7I%3D&amp;reserved=0
>>>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2FxmgTCEaUTSgG%2BLohnAAuNXLncUOKU8yBxhkEMpmQk%3D&amp;reserved=0
>>>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nXzbdkD4B4TzAFXbD%2B4Jap8rWmzX2CZ8fVnEE2f4Tdc%3D&amp;reserved=0
>>>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Issa0O7V4p6MnuuJOcLDN4MAG77cSMSJ7MSZqvXol4%3D&amp;reserved=0
>>>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPqcruawglqgMjakkslrpSVZzaOu%2FCvfkTSuUfMiEh0%3D&amp;reserved=0
>>>>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19 14:58               ` Michael S. Tsirkin
@ 2021-08-20  2:17                 ` Jason Wang
  2021-08-20  7:03                   ` Michael S. Tsirkin
  2021-08-23 12:08                   ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 33+ messages in thread
From: Jason Wang @ 2021-08-20  2:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Max Gurtovoy, virtio-comment, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
>>> The PF device will have an option to quiesce/freeze the VF device.
>>
>> Is such design a must? If no, why not simply introduce those functions in
>> the VF?
> Many IOMMUs only support protections at the function level.
> Thus we need ability to have one device (e.g. a PF)
> to control migration of another (e.g. a VF).


So as discussed previously, the only possible "advantage" is that the 
DMA is isolated.


> This is because allowing VF to access hypervisor memory used for
> migration is not a good idea.
> For IOMMUs that support subfunctions, these "devices" could be
> subfunctions.
>
> The only alternative is to keep things in device memory which
> does not need an IOMMU.
> I guess we'd end up with something like a VQ in device memory which might
> be tricky from multiple points of view, but yes, this could be
> useful and people did ask for such a capability in the past.


I assume the spec already support this. We probably need some 
clarification at the transport layer. But it's as simple as setting MMIO 
are as virtqueue address?

Except for the dirty bit tracking, we don't have bulk data that needs to 
be transferred during migration. So a virtqueue is not must even in this 
case.


>
>> If yes, what's the reason for making virtio different (e.g VCPU live
>> migration is not designed like that)?
> I think the main difference is we need PF's help for memory
> tracking for pre-copy migration anyway.


Such kind of memory tracking is not a must. KVM uses software assisted 
technologies (write protection) and it works very well. For virtio, 
technology like shadow virtqueue has been used by DPDK and prototyped by 
Eugenio.

Even if we want to go with hardware technology, we have many 
alternatives (as we've discussed in the past):

1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external 
device write)
2) Write protection via IOMMU or device MMU
3) Address space ID for isolating DMAs

Using physical function is sub-optimal that all of the above since:

1) limited to a specific transport or implementation and it doesn't work 
for device or transport without PF
2) the virtio level function is not self contained, this makes any 
feature that ties to PF impossible to be used in the nested layer
3) more complicated than leveraging the existing facilities provided by 
the platform or transport

Consider (P)ASID will be ready very soon, workaround the platform 
limitation via PF is not a good idea for me. Especially consider it's 
not a must and we had already prototype the software assisted technology.


>   Might as well integrate
> the rest of state in the same channel.


That's another question. I think for the function that is a must for 
doing live migration, introducing them in the function itself is the 
most natural way since we did all the other facilities there. This ease 
the function that can be used in the nested layer.

And using the channel in the PF is not coming for free. It requires 
synchronization in the software or even QOS.

Or we can just separate the dirty page tracking into PF (but need to 
define them as basic facility for future extension).


>
> Another answer is that CPUs trivially switch between
> functions by switching the active page tables. For PCI DMA
> it is all much trickier sine the page tables can be separate
> from the device, and assumed to be mostly static.


I don't see much different, the page table is also separated from the 
CPU. If the device supports state save and restore we can scheduling the 
multiple VMs/VCPUs on the same device.


> So if you want to create something like the VMCS then
> again you either need some help from another device or
> put it in device memory.


For CPU virtualization, the states could be saved and restored via MSRs. 
For virtio, accessing them via registers is also possible and much more 
simple.

Thanks


>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19 15:20                   ` Max Gurtovoy
@ 2021-08-20  2:24                     ` Jason Wang
  2021-08-20 10:26                       ` Max Gurtovoy
  2021-08-23 12:18                     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-20  2:24 UTC (permalink / raw)
  To: Max Gurtovoy, Dr. David Alan Gilbert
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


在 2021/8/19 下午11:20, Max Gurtovoy 写道:
>
> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy 
>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy 
>>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Live migration is one of the most important features of
>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>>>>>> environments.
>>>>>>>>>>>
>>>>>>>>>>> The migration process is managed by a migration SW that is 
>>>>>>>>>>> running on
>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>>>>>
>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state 
>>>>>>>>>>> resides in
>>>>>>>>>>> the HW.
>>>>>>>>>>>
>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually from 
>>>>>>>>>> the view
>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is 
>>>>>>>>>> stored in
>>>>>>>>>> the software or hardware. A well designed VMM should be able 
>>>>>>>>>> to hide
>>>>>>>>>> the virtio device implementation from the migration layer, 
>>>>>>>>>> that is how
>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a 
>>>>>>>>>> software
>>>>>>>>>> virtio/vDPA device or not.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> In our vision, in order to fulfil the Live migration 
>>>>>>>>>>> requirements for
>>>>>>>>>>> virtual functions, each physical function device must implement
>>>>>>>>>>> migration operations. Using these operations, it will be 
>>>>>>>>>>> able to
>>>>>>>>>>> master the migration process for the virtual function 
>>>>>>>>>>> devices. Each
>>>>>>>>>>> capable physical function device has a supervisor 
>>>>>>>>>>> permissions to
>>>>>>>>>>> change the virtual function operational states, save/restore 
>>>>>>>>>>> its
>>>>>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>>>>>
>>>>>>>>>> For "supervisor permissions", is this from the software point 
>>>>>>>>>> of view?
>>>>>>>>>> Maybe it's better to give an example for this.
>>>>>>>>> A permission to a PF device for quiesce and freeze a VF device 
>>>>>>>>> for example.
>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running without 
>>>>>>>> any privileges.
>>>>>>> You're mixing layers here.
>>>>>>>
>>>>>>> QEMU is not involved here. It's only sending IOCTLs to migration 
>>>>>>> driver.
>>>>>>> The migration driver will control the migration process of the 
>>>>>>> VF using
>>>>>>> the PF communication channel.
>>>>>> So who will be granted the "permission" you mentioned here?
>>>>> This is just an expression.
>>>>>
>>>>> What is not clear ?
>>>>>
>>>>> The PF device will have an option to quiesce/freeze the VF device.
>>>>>
>>>>> This is simple. Why are you looking for some sophisticated problems ?
>>>> I'm trying to follow along here and have not completely; but I 
>>>> think the issue is a
>>>> security separation one.
>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
>>>> isolated and shouldn't be able to go poking at other devices; so it
>>>> can't go poking at the PF (it probably doesn't even have the PF device
>>>> node accessible) - so then the question is who has access to the
>>>> migration driver and how do you make sure it can only deal with VF's
>>>> that it's supposed to be able to migrate.
>>> The QEMU/userspace doesn't know or care about the PF connection and 
>>> internal
>>> virtio_vfio_pci driver implementation.
>> OK
>>
>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
>> Hmm OK.
>>
>>> QEMU does not have access to the PF. Only the kernel driver that has 
>>> access
>>> to the VF will have access to the PF communication channel. There is no
>>> permission problem here.
>>>
>>> The kernel driver of the VF will do this internally, and make sure 
>>> that the
>>> commands it build will only impact the VF originating them.
>>>
>> Now that confuses me; isn't the kernel driver that has access to the VF
>> running inside the guest?  If it's inside the guest we can't trust it to
>> do anything about stopping impact to other devices.
>
> No. The driver is in the hypervisor (virtio_vfio_pci). This is the 
> migration driver, right ?


Well, talking things like virtio_vfio_pci that is not mentioned before 
and not justified on the list may easily confuse people. As pointed out 
in another thread, it has too many disadvantages over the existing 
virtio-pci vdpa driver. And it just duplicates a partial function of 
what virtio-pci vdpa driver can do. I don't think we will go that way.

Thanks


>
> The guest is running as usual. It doesn't aware on the migration at all.
>
> This is the point I try to make here. I don't (and I can't) change 
> even 1 line of code in the guest.
>
> e.g:
>
> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor 
> (bounded to VF5) --> send admin command on PF adminq to start tracking 
> dirty pages for VF5 --> PF device will do it
>
> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor 
> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5 
> --> PF device will do it
>
> You can take a look how we implement mlx5_vfio_pci in the link I 
> provided.
>
>>
>> Dave
>>
>>
>>> We already do this in mlx5 NIC migration. The kernel is secured and 
>>> QEMU
>>> interface is the VF.
>>>
>>>> Dave
>>>>
>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA 
>>>>>>>>>>> performs
>>>>>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>>>>>
>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated 
>>>>>>>>>>> software defined
>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for 
>>>>>>>>>>> storage
>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its 
>>>>>>>>>>> standard
>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>>>>>> specifications.
>>>>>>>>>>>
>>>>>>>>>>> In order to implement Live Migration for these virtual function
>>>>>>>>>>> devices, that use a standard drivers as mentioned, the 
>>>>>>>>>>> specification
>>>>>>>>>>> should define how HW vendor should build their devices and 
>>>>>>>>>>> for SW
>>>>>>>>>>> developers to adjust the drivers.
>>>>>>>>>>>
>>>>>>>>>>> This will enable specification compliant vendor agnostic 
>>>>>>>>>>> solution.
>>>>>>>>>>>
>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>>>>>> (internal HW design doc) and I guess that this is the way other
>>>>>>>>>>> vendors work.
>>>>>>>>>>>
>>>>>>>>>>> For that, I would like to know if the approach of “PF that 
>>>>>>>>>>> controls
>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO 
>>>>>>>>>>> technical
>>>>>>>>>>> group ?
>>>>>>>>>>>
>>>>>>>>>> I'm not sure but I think it's better to start from the general
>>>>>>>>>> facility for all transports, then develop features for a 
>>>>>>>>>> specific
>>>>>>>>>> transport.
>>>>>>>>> a general facility for all transports can be a generic admin 
>>>>>>>>> queue ?
>>>>>>>> It could be a virtqueue or a transport specific method (pcie 
>>>>>>>> capability).
>>>>>>> No. You said a general facility for all transports.
>>>>>> For general facility, I mean the chapter 2 of the spec which is 
>>>>>> general
>>>>>>
>>>>>> "
>>>>>> 2 Basic Facilities of a Virtio Device
>>>>>> "
>>>>>>
>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I can 
>>>>> add "2.12
>>>>> Admin Virtqueues" and this is what I did in the RFC.
>>>>>
>>>>>>> Transport specific is not general.
>>>>>> The transport is in charge of implementing the interface for 
>>>>>> those facilities.
>>>>> Transport specific is not general.
>>>>>
>>>>>
>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk 
>>>>>>>> first
>>>>>>>> (the device state). Then we can define the interface to get and 
>>>>>>>> set
>>>>>>>> those states via admin virtqueue. Such decoupling may ease the 
>>>>>>>> future
>>>>>>>> development of the transport specific migration interface.
>>>>>>> I asked a simple question here.
>>>>>>>
>>>>>>> Lets stick to this.
>>>>>> I answered this question.
>>>>> No you didn't answer.
>>>>>
>>>>> I asked  if the approach of “PF that controls the VF live 
>>>>> migration process”
>>>>> is acceptable by the VIRTIO technical group ?
>>>>>
>>>>> And you take the discussion to your direction instead of answering 
>>>>> a Yes/No
>>>>> question.
>>>>>
>>>>>>      The virtqueue could be one of the
>>>>>> approaches. And it's your responsibility to convince the community
>>>>>> about that approach. Having an example may help people to understand
>>>>>> your proposal.
>>>>>>
>>>>>>> I'm not referring to internal state definitions.
>>>>>> Without an example, how do we know if it can work well?
>>>>>>
>>>>>>> Can you please not change the subject of my initial intent in 
>>>>>>> the email ?
>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>>>>>> your proposal.
>>>>> The virtio-blk PF admin queue will be used to manage the 
>>>>> virtio-blk VF
>>>>> migration.
>>>>>
>>>>> This is the whole discussion. I don't want to get into resolution.
>>>>>
>>>>> Since you already know the answer as I published 4 RFCs already 
>>>>> with all the
>>>>> flow.
>>>>>
>>>>> Lets stick to my question.
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> -Max.
>>>>>>>>>>>
>>>>>>>>> This publicly archived list offers a means to provide input to 
>>>>>>>>> the
>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>>
>>>>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>>> before posting.
>>>>>>>>>
>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>>> List archive: 
>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gVamhvYG3lbMVVyMz%2F%2Fq3VBMZKY47pqxvRi94Mp%2B%2B7I%3D&amp;reserved=0
>>>>>>>>> Feedback License: 
>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2FxmgTCEaUTSgG%2BLohnAAuNXLncUOKU8yBxhkEMpmQk%3D&amp;reserved=0
>>>>>>>>> List Guidelines: 
>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nXzbdkD4B4TzAFXbD%2B4Jap8rWmzX2CZ8fVnEE2f4Tdc%3D&amp;reserved=0
>>>>>>>>> Committee: 
>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Issa0O7V4p6MnuuJOcLDN4MAG77cSMSJ7MSZqvXol4%3D&amp;reserved=0
>>>>>>>>> Join OASIS: 
>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPqcruawglqgMjakkslrpSVZzaOu%2FCvfkTSuUfMiEh0%3D&amp;reserved=0
>>>>>>>>>
>>>>> This publicly archived list offers a means to provide input to the
>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>
>>>>> In order to verify user consent to the Feedback License terms and
>>>>> to minimize spam in the list archive, subscription is required
>>>>> before posting.
>>>>>
>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>> List archive: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gVamhvYG3lbMVVyMz%2F%2Fq3VBMZKY47pqxvRi94Mp%2B%2B7I%3D&amp;reserved=0
>>>>> Feedback License: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2FxmgTCEaUTSgG%2BLohnAAuNXLncUOKU8yBxhkEMpmQk%3D&amp;reserved=0
>>>>> List Guidelines: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nXzbdkD4B4TzAFXbD%2B4Jap8rWmzX2CZ8fVnEE2f4Tdc%3D&amp;reserved=0
>>>>> Committee: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Issa0O7V4p6MnuuJOcLDN4MAG77cSMSJ7MSZqvXol4%3D&amp;reserved=0
>>>>> Join OASIS: 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPqcruawglqgMjakkslrpSVZzaOu%2FCvfkTSuUfMiEh0%3D&amp;reserved=0
>>>>>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: 
> https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20  2:17                 ` Jason Wang
@ 2021-08-20  7:03                   ` Michael S. Tsirkin
  2021-08-20  7:49                     ` Jason Wang
  2021-08-23 12:08                   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2021-08-20  7:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, virtio-comment, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Fri, Aug 20, 2021 at 10:17:05AM +0800, Jason Wang wrote:
> 
> 在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > The PF device will have an option to quiesce/freeze the VF device.
> > > 
> > > Is such design a must? If no, why not simply introduce those functions in
> > > the VF?
> > Many IOMMUs only support protections at the function level.
> > Thus we need ability to have one device (e.g. a PF)
> > to control migration of another (e.g. a VF).
> 
> 
> So as discussed previously, the only possible "advantage" is that the DMA is
> isolated.
> 
> 
> > This is because allowing VF to access hypervisor memory used for
> > migration is not a good idea.
> > For IOMMUs that support subfunctions, these "devices" could be
> > subfunctions.
> > 
> > The only alternative is to keep things in device memory which
> > does not need an IOMMU.
> > I guess we'd end up with something like a VQ in device memory which might
> > be tricky from multiple points of view, but yes, this could be
> > useful and people did ask for such a capability in the past.
> 
> 
> I assume the spec already support this. We probably need some clarification
> at the transport layer. But it's as simple as setting MMIO are as virtqueue
> address?

Several issues
- we do not support changing VQ address. Devices do need to support
  changing memory addresses.
- Ordering becomes tricky especially .
  E.g. when device reads descriptor in VQ
  memory it suddenly does not flush out writes into buffer
  that is potentially in RAM. We might also need even stronger
  barriers on the driver side. We used dma_wmb but now it's
  probably need to be wmb.
  Reading multibyte structures from device memory is slow.
  To get reasonable performance we might need to mark this device memory
  WB or WC. That generally makes things even trickier.


> Except for the dirty bit tracking, we don't have bulk data that needs to be
> transferred during migration. So a virtqueue is not must even in this case.

Main traffic is write tracking.


> 
> > 
> > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > migration is not designed like that)?
> > I think the main difference is we need PF's help for memory
> > tracking for pre-copy migration anyway.
> 
> 
> Such kind of memory tracking is not a must. KVM uses software assisted
> technologies (write protection) and it works very well.

So page-fault support is absolutely a viable option IMHO.
To work well we need VIRTIO_F_PARTIAL_ORDER - there was not
a lot of excitement but sure I will finalize and repost it.


However we need support for reporting and handling faults.
Again this is data path stuff and needs to be under
hypervisor control so I guess we get right back
to having this in the PF?





> For virtio,
> technology like shadow virtqueue has been used by DPDK and prototyped by
> Eugenio.

That's ok but I think since it affects performance at 100% of the
time when active we can not rely on this as the only solution.


> Even if we want to go with hardware technology, we have many alternatives
> (as we've discussed in the past):
> 
> 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> write)
> 2) Write protection via IOMMU or device MMU
> 3) Address space ID for isolating DMAs

Not all systems support any of the above unfortunately.

Also some systems might have a limited # of PASIDs.
So burning up a extra PASID per VF halving their
number might not be great as the only option.


> 
> Using physical function is sub-optimal that all of the above since:
> 
> 1) limited to a specific transport or implementation and it doesn't work for
> device or transport without PF
> 2) the virtio level function is not self contained, this makes any feature
> that ties to PF impossible to be used in the nested layer
> 3) more complicated than leveraging the existing facilities provided by the
> platform or transport

I think I disagree with 2 and 3 above simply because controlling VFs through
a PF is how all other devices did this. About 1 - well this is
just about us being smart and writing this in a way that is
generic enough, right? E.g. include options for PASIDs too.

Note that support for cross-device addressing is useful
even outside of migration.  We also have things like
priority where it is useful to adjust properties of
a VF on the fly while it is active. Again the normal way
all devices do this is through a PF. Yes a bunch of tricks
in QEMU is possible but having a driver in host kernel
and just handle it in a contained way is way cleaner.


> Consider (P)ASID will be ready very soon, workaround the platform limitation
> via PF is not a good idea for me. Especially consider it's not a must and we
> had already prototype the software assisted technology.

Well PASID is just one technology.


> 
> >   Might as well integrate
> > the rest of state in the same channel.
> 
> 
> That's another question. I think for the function that is a must for doing
> live migration, introducing them in the function itself is the most natural
> way since we did all the other facilities there. This ease the function that
> can be used in the nested layer.
> 
> And using the channel in the PF is not coming for free. It requires
> synchronization in the software or even QOS.
> 
> Or we can just separate the dirty page tracking into PF (but need to define
> them as basic facility for future extension).

Well maybe just start focusing on write tracking, sure.
Once there's a proposal for this we can see whether
adding other state there is easier or harder.


> 
> > 
> > Another answer is that CPUs trivially switch between
> > functions by switching the active page tables. For PCI DMA
> > it is all much trickier sine the page tables can be separate
> > from the device, and assumed to be mostly static.
> 
> 
> I don't see much different, the page table is also separated from the CPU.
> If the device supports state save and restore we can scheduling the multiple
> VMs/VCPUs on the same device.

It's just that performance is terrible. If you keep losing packets
migration might as well not be live.

> 
> > So if you want to create something like the VMCS then
> > again you either need some help from another device or
> > put it in device memory.
> 
> 
> For CPU virtualization, the states could be saved and restored via MSRs. For
> virtio, accessing them via registers is also possible and much more simple.
> 
> Thanks

IMy guess is performance is going to be bad. MSRs are part of the
same CPU that is executing the accesses....

> 
> > 
> > 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20  7:03                   ` Michael S. Tsirkin
@ 2021-08-20  7:49                     ` Jason Wang
  2021-08-20 11:06                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-20  7:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Max Gurtovoy, virtio-comment, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Fri, Aug 20, 2021 at 3:04 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Aug 20, 2021 at 10:17:05AM +0800, Jason Wang wrote:
> >
> > 在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> > > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > > The PF device will have an option to quiesce/freeze the VF device.
> > > >
> > > > Is such design a must? If no, why not simply introduce those functions in
> > > > the VF?
> > > Many IOMMUs only support protections at the function level.
> > > Thus we need ability to have one device (e.g. a PF)
> > > to control migration of another (e.g. a VF).
> >
> >
> > So as discussed previously, the only possible "advantage" is that the DMA is
> > isolated.
> >
> >
> > > This is because allowing VF to access hypervisor memory used for
> > > migration is not a good idea.
> > > For IOMMUs that support subfunctions, these "devices" could be
> > > subfunctions.
> > >
> > > The only alternative is to keep things in device memory which
> > > does not need an IOMMU.
> > > I guess we'd end up with something like a VQ in device memory which might
> > > be tricky from multiple points of view, but yes, this could be
> > > useful and people did ask for such a capability in the past.
> >
> >
> > I assume the spec already support this. We probably need some clarification
> > at the transport layer. But it's as simple as setting MMIO are as virtqueue
> > address?
>
> Several issues
> - we do not support changing VQ address. Devices do need to support
>   changing memory addresses.

So it looks like a transport specific requirement (PCI-E) instead of a
general issue.

> - Ordering becomes tricky especially .
>   E.g. when device reads descriptor in VQ
>   memory it suddenly does not flush out writes into buffer
>   that is potentially in RAM. We might also need even stronger
>   barriers on the driver side. We used dma_wmb but now it's
>   probably need to be wmb.
>   Reading multibyte structures from device memory is slow.
>   To get reasonable performance we might need to mark this device memory
>   WB or WC. That generally makes things even trickier.

I agree, but still they are all transport specific requirements. If we
do that in a PCI-E BAR, the driver must obey the ordering rule for PCI
to make it work.

>
>
> > Except for the dirty bit tracking, we don't have bulk data that needs to be
> > transferred during migration. So a virtqueue is not must even in this case.
>
> Main traffic is write tracking.

Right.

>
>
> >
> > >
> > > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > > migration is not designed like that)?
> > > I think the main difference is we need PF's help for memory
> > > tracking for pre-copy migration anyway.
> >
> >
> > Such kind of memory tracking is not a must. KVM uses software assisted
> > technologies (write protection) and it works very well.
>
> So page-fault support is absolutely a viable option IMHO.
> To work well we need VIRTIO_F_PARTIAL_ORDER - there was not
> a lot of excitement but sure I will finalize and repost it.

As discussed before, it looks like a performance optimization but not a must?

I guess we don't do that for KVM and it works well.

>
>
> However we need support for reporting and handling faults.
> Again this is data path stuff and needs to be under
> hypervisor control so I guess we get right back
> to having this in the PF?

So it depends on whether it requires a DMA. If it's just something
like a CR2 register, we don't need PF.

>
>
>
>
>
> > For virtio,
> > technology like shadow virtqueue has been used by DPDK and prototyped by
> > Eugenio.
>
> That's ok but I think since it affects performance at 100% of the
> time when active we can not rely on this as the only solution.

This part I don't understand:

- KVM writes protect the pages, so it loses performance as well.
- If we are using virtqueue for reporting dirty bitmap, it can easily
run out of space and we will lose the performance as well
- If we are using bitmap/bytemap, we may also losing the performance
(e.g the huge footprint) or at PCI level

So I'm not against the idea, what I think makes more sense is not
limit the facilities like device states, dirty page tracking to the
PF.

>
>
> > Even if we want to go with hardware technology, we have many alternatives
> > (as we've discussed in the past):
> >
> > 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> > write)
> > 2) Write protection via IOMMU or device MMU
> > 3) Address space ID for isolating DMAs
>
> Not all systems support any of the above unfortunately.
>

Yes. But we know the platform (AMD/Intel/ARM) will be ready soon for
them in the near future.

> Also some systems might have a limited # of PASIDs.
> So burning up a extra PASID per VF halving their
> number might not be great as the only option.

Yes, so I think we agree that we should not limit the spec to work on
a specific configuration (e.g the device with PF).

>
>
> >
> > Using physical function is sub-optimal that all of the above since:
> >
> > 1) limited to a specific transport or implementation and it doesn't work for
> > device or transport without PF
> > 2) the virtio level function is not self contained, this makes any feature
> > that ties to PF impossible to be used in the nested layer
> > 3) more complicated than leveraging the existing facilities provided by the
> > platform or transport
>
> I think I disagree with 2 and 3 above simply because controlling VFs through
> a PF is how all other devices did this.

For management and provision yes. For other features, the answer is
not. This is simply because most hardware vendors don't consider
whether or not a feature could be virtualized. That's fine for them
but not us. E.g if we limit the feature A to PF. It means feature A
can't be used by guests. My understanding is that we'd better not
introduce a feature that is hard to be virtualized.

> About 1 - well this is
> just about us being smart and writing this in a way that is
> generic enough, right?

That's exactly my question and my point, I know it can be done in the
PF. What I'm asking is "why it must be in the PF".

And I'm trying to convince Max to introduce those features as "basic
device facilities" instead of doing that in the "admin virtqueue" or
other stuff that belongs to PF.

> E.g. include options for PASIDs too.
>
> Note that support for cross-device addressing is useful
> even outside of migration.  We also have things like
> priority where it is useful to adjust properties of
> a VF on the fly while it is active. Again the normal way
> all devices do this is through a PF. Yes a bunch of tricks
> in QEMU is possible but having a driver in host kernel
> and just handle it in a contained way is way cleaner.
>
>
> > Consider (P)ASID will be ready very soon, workaround the platform limitation
> > via PF is not a good idea for me. Especially consider it's not a must and we
> > had already prototype the software assisted technology.
>
> Well PASID is just one technology.

Yes, devices are allowed to have their own function to isolate DMA. I
mentioned PASID just because it is the most popular technology.

>
>
> >
> > >   Might as well integrate
> > > the rest of state in the same channel.
> >
> >
> > That's another question. I think for the function that is a must for doing
> > live migration, introducing them in the function itself is the most natural
> > way since we did all the other facilities there. This ease the function that
> > can be used in the nested layer.
> >
> > And using the channel in the PF is not coming for free. It requires
> > synchronization in the software or even QOS.
> >
> > Or we can just separate the dirty page tracking into PF (but need to define
> > them as basic facility for future extension).
>
> Well maybe just start focusing on write tracking, sure.
> Once there's a proposal for this we can see whether
> adding other state there is easier or harder.

Fine with me.

>
>
> >
> > >
> > > Another answer is that CPUs trivially switch between
> > > functions by switching the active page tables. For PCI DMA
> > > it is all much trickier sine the page tables can be separate
> > > from the device, and assumed to be mostly static.
> >
> >
> > I don't see much different, the page table is also separated from the CPU.
> > If the device supports state save and restore we can scheduling the multiple
> > VMs/VCPUs on the same device.
>
> It's just that performance is terrible. If you keep losing packets
> migration might as well not be live.

I don't measure the performance. But I believe the shadow virtqueue
should perform better than kernel vhost-net backends.

If it's not, we can switch to vhost-net if necessary and we know it
works well for the live migration.

>
> >
> > > So if you want to create something like the VMCS then
> > > again you either need some help from another device or
> > > put it in device memory.
> >
> >
> > For CPU virtualization, the states could be saved and restored via MSRs. For
> > virtio, accessing them via registers is also possible and much more simple.
> >
> > Thanks
>
> IMy guess is performance is going to be bad. MSRs are part of the
> same CPU that is executing the accesses....

I'm not sure but it's how current VMX or SVM did.

Thanks

>
> >
> > >
> > >
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20  2:24                     ` Jason Wang
@ 2021-08-20 10:26                       ` Max Gurtovoy
  2021-08-20 11:16                         ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-20 10:26 UTC (permalink / raw)
  To: Jason Wang, Dr. David Alan Gilbert
  Cc: virtio-comment, Michael S. Tsirkin, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer


On 8/20/2021 5:24 AM, Jason Wang wrote:
>
> 在 2021/8/19 下午11:20, Max Gurtovoy 写道:
>>
>> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
>>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy 
>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy 
>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> Live migration is one of the most important features of
>>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>>>>>>> environments.
>>>>>>>>>>>>
>>>>>>>>>>>> The migration process is managed by a migration SW that is 
>>>>>>>>>>>> running on
>>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>>>>>>
>>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state 
>>>>>>>>>>>> resides in
>>>>>>>>>>>> the HW.
>>>>>>>>>>>>
>>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually 
>>>>>>>>>>> from the view
>>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is 
>>>>>>>>>>> stored in
>>>>>>>>>>> the software or hardware. A well designed VMM should be able 
>>>>>>>>>>> to hide
>>>>>>>>>>> the virtio device implementation from the migration layer, 
>>>>>>>>>>> that is how
>>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a 
>>>>>>>>>>> software
>>>>>>>>>>> virtio/vDPA device or not.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> In our vision, in order to fulfil the Live migration 
>>>>>>>>>>>> requirements for
>>>>>>>>>>>> virtual functions, each physical function device must 
>>>>>>>>>>>> implement
>>>>>>>>>>>> migration operations. Using these operations, it will be 
>>>>>>>>>>>> able to
>>>>>>>>>>>> master the migration process for the virtual function 
>>>>>>>>>>>> devices. Each
>>>>>>>>>>>> capable physical function device has a supervisor 
>>>>>>>>>>>> permissions to
>>>>>>>>>>>> change the virtual function operational states, 
>>>>>>>>>>>> save/restore its
>>>>>>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>>>>>>
>>>>>>>>>>> For "supervisor permissions", is this from the software 
>>>>>>>>>>> point of view?
>>>>>>>>>>> Maybe it's better to give an example for this.
>>>>>>>>>> A permission to a PF device for quiesce and freeze a VF 
>>>>>>>>>> device for example.
>>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running 
>>>>>>>>> without any privileges.
>>>>>>>> You're mixing layers here.
>>>>>>>>
>>>>>>>> QEMU is not involved here. It's only sending IOCTLs to 
>>>>>>>> migration driver.
>>>>>>>> The migration driver will control the migration process of the 
>>>>>>>> VF using
>>>>>>>> the PF communication channel.
>>>>>>> So who will be granted the "permission" you mentioned here?
>>>>>> This is just an expression.
>>>>>>
>>>>>> What is not clear ?
>>>>>>
>>>>>> The PF device will have an option to quiesce/freeze the VF device.
>>>>>>
>>>>>> This is simple. Why are you looking for some sophisticated 
>>>>>> problems ?
>>>>> I'm trying to follow along here and have not completely; but I 
>>>>> think the issue is a
>>>>> security separation one.
>>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
>>>>> isolated and shouldn't be able to go poking at other devices; so it
>>>>> can't go poking at the PF (it probably doesn't even have the PF 
>>>>> device
>>>>> node accessible) - so then the question is who has access to the
>>>>> migration driver and how do you make sure it can only deal with VF's
>>>>> that it's supposed to be able to migrate.
>>>> The QEMU/userspace doesn't know or care about the PF connection and 
>>>> internal
>>>> virtio_vfio_pci driver implementation.
>>> OK
>>>
>>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
>>> Hmm OK.
>>>
>>>> QEMU does not have access to the PF. Only the kernel driver that 
>>>> has access
>>>> to the VF will have access to the PF communication channel. There 
>>>> is no
>>>> permission problem here.
>>>>
>>>> The kernel driver of the VF will do this internally, and make sure 
>>>> that the
>>>> commands it build will only impact the VF originating them.
>>>>
>>> Now that confuses me; isn't the kernel driver that has access to the VF
>>> running inside the guest?  If it's inside the guest we can't trust 
>>> it to
>>> do anything about stopping impact to other devices.
>>
>> No. The driver is in the hypervisor (virtio_vfio_pci). This is the 
>> migration driver, right ?
>
>
> Well, talking things like virtio_vfio_pci that is not mentioned before 
> and not justified on the list may easily confuse people. As pointed 
> out in another thread, it has too many disadvantages over the existing 
> virtio-pci vdpa driver. And it just duplicates a partial function of 
> what virtio-pci vdpa driver can do. I don't think we will go that way.

This was just an example for David to help with understanding the 
solution since he thought that the guest drivers somehow should be changed.

David I'm sorry if I confused you.

Again Jason, you try to propose your vDPA solution that is not what 
we're trying to achieve in this work. Think of a world without vDPA. 
Also I don't understand how vDPA is related to virtio specification 
decisions ? make vDPA into virtio and then we can open a discussion.

I'm interesting in virtio migration of HW devices.

The proposal in this thread is actually get support from Michal AFAIU 
and also others were happy with. All beside of you.

We do it in mlx5 and we didn't see any issues with that design.

I don't think you can say that we "go that way".

You're trying to build a complementary solution for creating scalable 
functions and for some reason trying to sabotage NVIDIA efforts to add 
new important functionality to virtio.

This also sabotage the evolvment of virtio as a standard.

You're trying to enforce some un-finished idea that should work on some 
future specific HW platform instead of helping defining a good spec for 
virtio.

And all is for having users to choose vDPA framework instead of using 
plain virtio.

We believe in our solution and we have a working prototype. We'll 
continue with our discussion to convince the community with it.

Thanks.

>
> Thanks
>
>
>>
>> The guest is running as usual. It doesn't aware on the migration at all.
>>
>> This is the point I try to make here. I don't (and I can't) change 
>> even 1 line of code in the guest.
>>
>> e.g:
>>
>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor 
>> (bounded to VF5) --> send admin command on PF adminq to start 
>> tracking dirty pages for VF5 --> PF device will do it
>>
>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor 
>> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5 
>> --> PF device will do it
>>
>> You can take a look how we implement mlx5_vfio_pci in the link I 
>> provided.
>>
>>>
>>> Dave
>>>
>>>
>>>> We already do this in mlx5 NIC migration. The kernel is secured and 
>>>> QEMU
>>>> interface is the VF.
>>>>
>>>>> Dave
>>>>>
>>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA 
>>>>>>>>>>>> performs
>>>>>>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>>>>>>
>>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated 
>>>>>>>>>>>> software defined
>>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for 
>>>>>>>>>>>> storage
>>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its 
>>>>>>>>>>>> standard
>>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>>>>>>> specifications.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to implement Live Migration for these virtual 
>>>>>>>>>>>> function
>>>>>>>>>>>> devices, that use a standard drivers as mentioned, the 
>>>>>>>>>>>> specification
>>>>>>>>>>>> should define how HW vendor should build their devices and 
>>>>>>>>>>>> for SW
>>>>>>>>>>>> developers to adjust the drivers.
>>>>>>>>>>>>
>>>>>>>>>>>> This will enable specification compliant vendor agnostic 
>>>>>>>>>>>> solution.
>>>>>>>>>>>>
>>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>>>>>>> (internal HW design doc) and I guess that this is the way 
>>>>>>>>>>>> other
>>>>>>>>>>>> vendors work.
>>>>>>>>>>>>
>>>>>>>>>>>> For that, I would like to know if the approach of “PF that 
>>>>>>>>>>>> controls
>>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO 
>>>>>>>>>>>> technical
>>>>>>>>>>>> group ?
>>>>>>>>>>>>
>>>>>>>>>>> I'm not sure but I think it's better to start from the general
>>>>>>>>>>> facility for all transports, then develop features for a 
>>>>>>>>>>> specific
>>>>>>>>>>> transport.
>>>>>>>>>> a general facility for all transports can be a generic admin 
>>>>>>>>>> queue ?
>>>>>>>>> It could be a virtqueue or a transport specific method (pcie 
>>>>>>>>> capability).
>>>>>>>> No. You said a general facility for all transports.
>>>>>>> For general facility, I mean the chapter 2 of the spec which is 
>>>>>>> general
>>>>>>>
>>>>>>> "
>>>>>>> 2 Basic Facilities of a Virtio Device
>>>>>>> "
>>>>>>>
>>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I 
>>>>>> can add "2.12
>>>>>> Admin Virtqueues" and this is what I did in the RFC.
>>>>>>
>>>>>>>> Transport specific is not general.
>>>>>>> The transport is in charge of implementing the interface for 
>>>>>>> those facilities.
>>>>>> Transport specific is not general.
>>>>>>
>>>>>>
>>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk 
>>>>>>>>> first
>>>>>>>>> (the device state). Then we can define the interface to get 
>>>>>>>>> and set
>>>>>>>>> those states via admin virtqueue. Such decoupling may ease the 
>>>>>>>>> future
>>>>>>>>> development of the transport specific migration interface.
>>>>>>>> I asked a simple question here.
>>>>>>>>
>>>>>>>> Lets stick to this.
>>>>>>> I answered this question.
>>>>>> No you didn't answer.
>>>>>>
>>>>>> I asked  if the approach of “PF that controls the VF live 
>>>>>> migration process”
>>>>>> is acceptable by the VIRTIO technical group ?
>>>>>>
>>>>>> And you take the discussion to your direction instead of 
>>>>>> answering a Yes/No
>>>>>> question.
>>>>>>
>>>>>>>      The virtqueue could be one of the
>>>>>>> approaches. And it's your responsibility to convince the community
>>>>>>> about that approach. Having an example may help people to 
>>>>>>> understand
>>>>>>> your proposal.
>>>>>>>
>>>>>>>> I'm not referring to internal state definitions.
>>>>>>> Without an example, how do we know if it can work well?
>>>>>>>
>>>>>>>> Can you please not change the subject of my initial intent in 
>>>>>>>> the email ?
>>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>>>>>>> your proposal.
>>>>>> The virtio-blk PF admin queue will be used to manage the 
>>>>>> virtio-blk VF
>>>>>> migration.
>>>>>>
>>>>>> This is the whole discussion. I don't want to get into resolution.
>>>>>>
>>>>>> Since you already know the answer as I published 4 RFCs already 
>>>>>> with all the
>>>>>> flow.
>>>>>>
>>>>>> Lets stick to my question.
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> -Max.
>>>>>>>>>>>>
>>>>>>>>>> This publicly archived list offers a means to provide input 
>>>>>>>>>> to the
>>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>>>
>>>>>>>>>> In order to verify user consent to the Feedback License terms 
>>>>>>>>>> and
>>>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>>>> before posting.
>>>>>>>>>>
>>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>>>> List archive: 
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tqLsKdPEjF4uMUeqrwpeBD7J4q%2BrWKH1xRUXozJ4KEI%3D&amp;reserved=0
>>>>>>>>>> Feedback License: 
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2tg8X5uT17Vtj1hWZ7kEskf%2BeG9jV5HhJA3yepZksrg%3D&amp;reserved=0
>>>>>>>>>> List Guidelines: 
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ynuvd9xCIEDhTjjxuAgMFVCek%2B6wQGhGjnUyrsKCamM%3D&amp;reserved=0
>>>>>>>>>> Committee: 
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QyJw7q7fwlq8oFjiaI%2BZICxARQEvXhrhRbiKkIyFmik%3D&amp;reserved=0
>>>>>>>>>> Join OASIS: 
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jebnTI98aw8LE0zC0BF2hiP6Rg3YsoKKunSVZz5GAlU%3D&amp;reserved=0
>>>>>>>>>>
>>>>>> This publicly archived list offers a means to provide input to the
>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>
>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>> to minimize spam in the list archive, subscription is required
>>>>>> before posting.
>>>>>>
>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>> List archive: 
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=T5gTfTBJ9blLPnGVlidRZFIeqrgi%2BcHyOwnVP5xtBPY%3D&amp;reserved=0
>>>>>> Feedback License: 
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SmRyku3IZ8O5vITCXasVxF%2BEjcG7zb%2BDi8sZinSuvp4%3D&amp;reserved=0
>>>>>> List Guidelines: 
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IV4C1Y%2BE%2Bq9OPzjyYYvoVrPp6975qyNZ%2Fn1%2B0eD58%2FA%3D&amp;reserved=0
>>>>>> Committee: 
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Bh0ufU4zKnzwspmmF0vN3tW3Ppi4RYfcT8pFzo%2FTT8%3D&amp;reserved=0
>>>>>> Join OASIS: 
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jebnTI98aw8LE0zC0BF2hiP6Rg3YsoKKunSVZz5GAlU%3D&amp;reserved=0
>>>>>>
>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=T5gTfTBJ9blLPnGVlidRZFIeqrgi%2BcHyOwnVP5xtBPY%3D&amp;reserved=0
>> Feedback License: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SmRyku3IZ8O5vITCXasVxF%2BEjcG7zb%2BDi8sZinSuvp4%3D&amp;reserved=0
>> List Guidelines: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IV4C1Y%2BE%2Bq9OPzjyYYvoVrPp6975qyNZ%2Fn1%2B0eD58%2FA%3D&amp;reserved=0
>> Committee: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Bh0ufU4zKnzwspmmF0vN3tW3Ppi4RYfcT8pFzo%2FTT8%3D&amp;reserved=0
>> Join OASIS: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jebnTI98aw8LE0zC0BF2hiP6Rg3YsoKKunSVZz5GAlU%3D&amp;reserved=0
>>
>

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20  7:49                     ` Jason Wang
@ 2021-08-20 11:06                       ` Michael S. Tsirkin
  2021-08-23  3:20                         ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2021-08-20 11:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, virtio-comment, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Fri, Aug 20, 2021 at 03:49:55PM +0800, Jason Wang wrote:
> On Fri, Aug 20, 2021 at 3:04 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Aug 20, 2021 at 10:17:05AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> > > > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > > > The PF device will have an option to quiesce/freeze the VF device.
> > > > >
> > > > > Is such design a must? If no, why not simply introduce those functions in
> > > > > the VF?
> > > > Many IOMMUs only support protections at the function level.
> > > > Thus we need ability to have one device (e.g. a PF)
> > > > to control migration of another (e.g. a VF).
> > >
> > >
> > > So as discussed previously, the only possible "advantage" is that the DMA is
> > > isolated.
> > >
> > >
> > > > This is because allowing VF to access hypervisor memory used for
> > > > migration is not a good idea.
> > > > For IOMMUs that support subfunctions, these "devices" could be
> > > > subfunctions.
> > > >
> > > > The only alternative is to keep things in device memory which
> > > > does not need an IOMMU.
> > > > I guess we'd end up with something like a VQ in device memory which might
> > > > be tricky from multiple points of view, but yes, this could be
> > > > useful and people did ask for such a capability in the past.
> > >
> > >
> > > I assume the spec already support this. We probably need some clarification
> > > at the transport layer. But it's as simple as setting MMIO are as virtqueue
> > > address?
> >
> > Several issues
> > - we do not support changing VQ address. Devices do need to support
> >   changing memory addresses.
> 
> So it looks like a transport specific requirement (PCI-E) instead of a
> general issue.
> 
> > - Ordering becomes tricky especially .
> >   E.g. when device reads descriptor in VQ
> >   memory it suddenly does not flush out writes into buffer
> >   that is potentially in RAM. We might also need even stronger
> >   barriers on the driver side. We used dma_wmb but now it's
> >   probably need to be wmb.
> >   Reading multibyte structures from device memory is slow.
> >   To get reasonable performance we might need to mark this device memory
> >   WB or WC. That generally makes things even trickier.
> 
> I agree, but still they are all transport specific requirements. If we
> do that in a PCI-E BAR, the driver must obey the ordering rule for PCI
> to make it work.
> >
> >
> > > Except for the dirty bit tracking, we don't have bulk data that needs to be
> > > transferred during migration. So a virtqueue is not must even in this case.
> >
> > Main traffic is write tracking.
> 
> Right.
> 
> >
> >
> > >
> > > >
> > > > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > > > migration is not designed like that)?
> > > > I think the main difference is we need PF's help for memory
> > > > tracking for pre-copy migration anyway.
> > >
> > >
> > > Such kind of memory tracking is not a must. KVM uses software assisted
> > > technologies (write protection) and it works very well.
> >
> > So page-fault support is absolutely a viable option IMHO.
> > To work well we need VIRTIO_F_PARTIAL_ORDER - there was not
> > a lot of excitement but sure I will finalize and repost it.
> 
> As discussed before, it looks like a performance optimization but not a must?
> 
> I guess we don't do that for KVM and it works well.

Depends on type of device. For networking it's a problem because it is
driven by outside events so it keeps going leading to packet drops which
is a quality of implementation issue, not an optimization.
Same thing with e.g. audio I suspect. Maybe graphics.  For KVM and
e.g. storage it's more of a performance issue.


> >
> >
> > However we need support for reporting and handling faults.
> > Again this is data path stuff and needs to be under
> > hypervisor control so I guess we get right back
> > to having this in the PF?
> 
> So it depends on whether it requires a DMA. If it's just something
> like a CR2 register, we don't need PF.

We won't strictly need it but it is a well understood model,
working well with e.g. vfio. It makes sense to support it.

> >
> >
> >
> >
> >
> > > For virtio,
> > > technology like shadow virtqueue has been used by DPDK and prototyped by
> > > Eugenio.
> >
> > That's ok but I think since it affects performance at 100% of the
> > time when active we can not rely on this as the only solution.
> 
> This part I don't understand:
> 
> - KVM writes protect the pages, so it loses performance as well.
> - If we are using virtqueue for reporting dirty bitmap, it can easily
> run out of space and we will lose the performance as well
> - If we are using bitmap/bytemap, we may also losing the performance
> (e.g the huge footprint) or at PCI level
> 
> So I'm not against the idea, what I think makes more sense is not
> limit the facilities like device states, dirty page tracking to the
> PF.

It could be a cross-device facility that can support PF but
also other forms of communication, yes.


> >
> >
> > > Even if we want to go with hardware technology, we have many alternatives
> > > (as we've discussed in the past):
> > >
> > > 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> > > write)
> > > 2) Write protection via IOMMU or device MMU
> > > 3) Address space ID for isolating DMAs
> >
> > Not all systems support any of the above unfortunately.
> >
> 
> Yes. But we know the platform (AMD/Intel/ARM) will be ready soon for
> them in the near future.

know and future in the same sentence make an oxymoron ;)

> > Also some systems might have a limited # of PASIDs.
> > So burning up a extra PASID per VF halving their
> > number might not be great as the only option.
> 
> Yes, so I think we agree that we should not limit the spec to work on
> a specific configuration (e.g the device with PF).

That makes sense to me.

> >
> >
> > >
> > > Using physical function is sub-optimal that all of the above since:
> > >
> > > 1) limited to a specific transport or implementation and it doesn't work for
> > > device or transport without PF
> > > 2) the virtio level function is not self contained, this makes any feature
> > > that ties to PF impossible to be used in the nested layer
> > > 3) more complicated than leveraging the existing facilities provided by the
> > > platform or transport
> >
> > I think I disagree with 2 and 3 above simply because controlling VFs through
> > a PF is how all other devices did this.
> 
> For management and provision yes. For other features, the answer is
> not. This is simply because most hardware vendors don't consider
> whether or not a feature could be virtualized. That's fine for them
> but not us. E.g if we limit the feature A to PF. It means feature A
> can't be used by guests. My understanding is that we'd better not
> introduce a feature that is hard to be virtualized.

I'm not sure what do you mean when you say management but I guess
at least stuff that ip link does normally:


               [ vf NUM [ mac LLADDR ]
                        [ VFVLAN-LIST ]
                        [ rate TXRATE ]
                        [ max_tx_rate TXRATE ]
                        [ min_tx_rate TXRATE ]
                        [ spoofchk { on | off } ]
                        [ query_rss { on | off } ]
                        [ state { auto | enable | disable } ]
                        [ trust { on | off } ]
                        [ node_guid eui64 ]
                        [ port_guid eui64 ] ]


is fair game ...

> > About 1 - well this is
> > just about us being smart and writing this in a way that is
> > generic enough, right?
> 
> That's exactly my question and my point, I know it can be done in the
> PF. What I'm asking is "why it must be in the PF".
> 
> And I'm trying to convince Max to introduce those features as "basic
> device facilities" instead of doing that in the "admin virtqueue" or
> other stuff that belongs to PF.

Let's say it's not in a PF, I think it needs some way to be separate so
we don't need lots of logic in the hypervisor to handle that.
So from that POV admin queue is ok. In fact
from my POV admin queue is suffering in that it does not focus on cross
device communication enough, not that it's doing that too much.

> > E.g. include options for PASIDs too.
> >
> > Note that support for cross-device addressing is useful
> > even outside of migration.  We also have things like
> > priority where it is useful to adjust properties of
> > a VF on the fly while it is active. Again the normal way
> > all devices do this is through a PF. Yes a bunch of tricks
> > in QEMU is possible but having a driver in host kernel
> > and just handle it in a contained way is way cleaner.
> >
> >
> > > Consider (P)ASID will be ready very soon, workaround the platform limitation
> > > via PF is not a good idea for me. Especially consider it's not a must and we
> > > had already prototype the software assisted technology.
> >
> > Well PASID is just one technology.
> 
> Yes, devices are allowed to have their own function to isolate DMA. I
> mentioned PASID just because it is the most popular technology.
> 
> >
> >
> > >
> > > >   Might as well integrate
> > > > the rest of state in the same channel.
> > >
> > >
> > > That's another question. I think for the function that is a must for doing
> > > live migration, introducing them in the function itself is the most natural
> > > way since we did all the other facilities there. This ease the function that
> > > can be used in the nested layer.
> > >
> > > And using the channel in the PF is not coming for free. It requires
> > > synchronization in the software or even QOS.
> > >
> > > Or we can just separate the dirty page tracking into PF (but need to define
> > > them as basic facility for future extension).
> >
> > Well maybe just start focusing on write tracking, sure.
> > Once there's a proposal for this we can see whether
> > adding other state there is easier or harder.
> 
> Fine with me.
> 
> >
> >
> > >
> > > >
> > > > Another answer is that CPUs trivially switch between
> > > > functions by switching the active page tables. For PCI DMA
> > > > it is all much trickier sine the page tables can be separate
> > > > from the device, and assumed to be mostly static.
> > >
> > >
> > > I don't see much different, the page table is also separated from the CPU.
> > > If the device supports state save and restore we can scheduling the multiple
> > > VMs/VCPUs on the same device.
> >
> > It's just that performance is terrible. If you keep losing packets
> > migration might as well not be live.
> 
> I don't measure the performance. But I believe the shadow virtqueue
> should perform better than kernel vhost-net backends.
> 
> If it's not, we can switch to vhost-net if necessary and we know it
> works well for the live migration.

Well but not as fast as hardware offloads with faults would be,
which can potentially go full speed as long as you are lucky
and do not hit too many faults.

> >
> > >
> > > > So if you want to create something like the VMCS then
> > > > again you either need some help from another device or
> > > > put it in device memory.
> > >
> > >
> > > For CPU virtualization, the states could be saved and restored via MSRs. For
> > > virtio, accessing them via registers is also possible and much more simple.
> > >
> > > Thanks
> >
> > IMy guess is performance is going to be bad. MSRs are part of the
> > same CPU that is executing the accesses....
> 
> I'm not sure but it's how current VMX or SVM did.
> 
> Thanks

Yes but again. moving state of the CPU around is faster than
pulling it across the PCI-E bus.

> >
> > >
> > > >
> > > >
> >


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20 10:26                       ` Max Gurtovoy
@ 2021-08-20 11:16                         ` Jason Wang
  2021-08-22 10:05                           ` Max Gurtovoy
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-20 11:16 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Dr. David Alan Gilbert, virtio-comment, Michael S. Tsirkin,
	cohuck, Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan,
	Bodong Wang, Jason Gunthorpe, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Fri, Aug 20, 2021 at 6:26 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>
> On 8/20/2021 5:24 AM, Jason Wang wrote:
> >
> > 在 2021/8/19 下午11:20, Max Gurtovoy 写道:
> >>
> >> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
> >>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> >>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
> >>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> >>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
> >>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy
> >>>>>>> <mgurtovoy@nvidia.com> wrote:
> >>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
> >>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy
> >>>>>>>>> <mgurtovoy@nvidia.com> wrote:
> >>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
> >>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Live migration is one of the most important features of
> >>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
> >>>>>>>>>>>> environments.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The migration process is managed by a migration SW that is
> >>>>>>>>>>>> running on
> >>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state
> >>>>>>>>>>>> resides in
> >>>>>>>>>>>> the HW.
> >>>>>>>>>>>>
> >>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually
> >>>>>>>>>>> from the view
> >>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is
> >>>>>>>>>>> stored in
> >>>>>>>>>>> the software or hardware. A well designed VMM should be able
> >>>>>>>>>>> to hide
> >>>>>>>>>>> the virtio device implementation from the migration layer,
> >>>>>>>>>>> that is how
> >>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a
> >>>>>>>>>>> software
> >>>>>>>>>>> virtio/vDPA device or not.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> In our vision, in order to fulfil the Live migration
> >>>>>>>>>>>> requirements for
> >>>>>>>>>>>> virtual functions, each physical function device must
> >>>>>>>>>>>> implement
> >>>>>>>>>>>> migration operations. Using these operations, it will be
> >>>>>>>>>>>> able to
> >>>>>>>>>>>> master the migration process for the virtual function
> >>>>>>>>>>>> devices. Each
> >>>>>>>>>>>> capable physical function device has a supervisor
> >>>>>>>>>>>> permissions to
> >>>>>>>>>>>> change the virtual function operational states,
> >>>>>>>>>>>> save/restore its
> >>>>>>>>>>>> internal state and start/stop dirty pages tracking.
> >>>>>>>>>>>>
> >>>>>>>>>>> For "supervisor permissions", is this from the software
> >>>>>>>>>>> point of view?
> >>>>>>>>>>> Maybe it's better to give an example for this.
> >>>>>>>>>> A permission to a PF device for quiesce and freeze a VF
> >>>>>>>>>> device for example.
> >>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running
> >>>>>>>>> without any privileges.
> >>>>>>>> You're mixing layers here.
> >>>>>>>>
> >>>>>>>> QEMU is not involved here. It's only sending IOCTLs to
> >>>>>>>> migration driver.
> >>>>>>>> The migration driver will control the migration process of the
> >>>>>>>> VF using
> >>>>>>>> the PF communication channel.
> >>>>>>> So who will be granted the "permission" you mentioned here?
> >>>>>> This is just an expression.
> >>>>>>
> >>>>>> What is not clear ?
> >>>>>>
> >>>>>> The PF device will have an option to quiesce/freeze the VF device.
> >>>>>>
> >>>>>> This is simple. Why are you looking for some sophisticated
> >>>>>> problems ?
> >>>>> I'm trying to follow along here and have not completely; but I
> >>>>> think the issue is a
> >>>>> security separation one.
> >>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
> >>>>> isolated and shouldn't be able to go poking at other devices; so it
> >>>>> can't go poking at the PF (it probably doesn't even have the PF
> >>>>> device
> >>>>> node accessible) - so then the question is who has access to the
> >>>>> migration driver and how do you make sure it can only deal with VF's
> >>>>> that it's supposed to be able to migrate.
> >>>> The QEMU/userspace doesn't know or care about the PF connection and
> >>>> internal
> >>>> virtio_vfio_pci driver implementation.
> >>> OK
> >>>
> >>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
> >>> Hmm OK.
> >>>
> >>>> QEMU does not have access to the PF. Only the kernel driver that
> >>>> has access
> >>>> to the VF will have access to the PF communication channel. There
> >>>> is no
> >>>> permission problem here.
> >>>>
> >>>> The kernel driver of the VF will do this internally, and make sure
> >>>> that the
> >>>> commands it build will only impact the VF originating them.
> >>>>
> >>> Now that confuses me; isn't the kernel driver that has access to the VF
> >>> running inside the guest?  If it's inside the guest we can't trust
> >>> it to
> >>> do anything about stopping impact to other devices.
> >>
> >> No. The driver is in the hypervisor (virtio_vfio_pci). This is the
> >> migration driver, right ?
> >
> >
> > Well, talking things like virtio_vfio_pci that is not mentioned before
> > and not justified on the list may easily confuse people. As pointed
> > out in another thread, it has too many disadvantages over the existing
> > virtio-pci vdpa driver. And it just duplicates a partial function of
> > what virtio-pci vdpa driver can do. I don't think we will go that way.
>
> This was just an example for David to help with understanding the
> solution since he thought that the guest drivers somehow should be changed.
>
> David I'm sorry if I confused you.
>
> Again Jason, you try to propose your vDPA solution that is not what
> we're trying to achieve in this work. Think of a world without vDPA.

Well, I'd say, let's think vDPA a superset of virtio, not just the
acceleration technologies.

> Also I don't understand how vDPA is related to virtio specification
> decisions ?

So how is VFIO related to virtio specific decisions? That's why I
think we should avoid talking about software architecture here. It's
the wrong community.

>  make vDPA into virtio and then we can open a discussion.
>
> I'm interesting in virtio migration of HW devices.
>
> The proposal in this thread is actually get support from Michal AFAIU
> and also others were happy with. All beside of you.

So I think I've clairfied my several times :(

- I'm fairly ok with the proposal
- but we decouple the basic facility out of the admin virtqueue and
this seems agreed by Michael:

Let's take the dirty page tracking as an example:

1) let's first define that as one of the basic facility
2) then we can introduce admin virtqueue or other stuffs as an
interface for that facility

Does this work for you?

>
> We do it in mlx5 and we didn't see any issues with that design.
>

If we seperate things as I suggested, I'm totally fine.

> I don't think you can say that we "go that way".

For "go that way" I meant the method of using vfio_virtio_pci, it has
nothing related to the discussion of "using PF to control VF" on the
spec.

>
> You're trying to build a complementary solution for creating scalable
> functions and for some reason trying to sabotage NVIDIA efforts to add
> new important functionality to virtio.

Well, it's a completely different topic. And it doesn't conflict with
anything that is proposed here by you. I think I've stated this
several times.  I don't think we block each other, it's just some
unification work if one of the proposals is merged first. I sent them
recently because it will be used as a material for my talk on the KVM
Forum which is really near.

>
> This also sabotage the evolvment of virtio as a standard.
>
> You're trying to enforce some un-finished idea that should work on some
> future specific HW platform instead of helping defining a good spec for
> virtio.

Let's open another thread for this if you wish, it has nothing related
to the spec but how it is implemented in Linux. If you search the
archive, something similar to "vfio_virtio_pci" has been proposed
several years before by Intel. The idea has been rejected, and we have
leveraged Linux vDPA bus for virtio-pci devices.

>
> And all is for having users to choose vDPA framework instead of using
> plain virtio.
>
> We believe in our solution and we have a working prototype. We'll
> continue with our discussion to convince the community with it.

Again, it looks like there's a lot of misunderstanding. Let's open a
thread on the suitable list instead of talking about any specific
software solution or architecture here. This will speed up things.

Thanks

>
> Thanks.
>
> >
> > Thanks
> >
> >
> >>
> >> The guest is running as usual. It doesn't aware on the migration at all.
> >>
> >> This is the point I try to make here. I don't (and I can't) change
> >> even 1 line of code in the guest.
> >>
> >> e.g:
> >>
> >> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
> >> (bounded to VF5) --> send admin command on PF adminq to start
> >> tracking dirty pages for VF5 --> PF device will do it
> >>
> >> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
> >> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5
> >> --> PF device will do it
> >>
> >> You can take a look how we implement mlx5_vfio_pci in the link I
> >> provided.
> >>
> >>>
> >>> Dave
> >>>
> >>>
> >>>> We already do this in mlx5 NIC migration. The kernel is secured and
> >>>> QEMU
> >>>> interface is the VF.
> >>>>
> >>>>> Dave
> >>>>>
> >>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA
> >>>>>>>>>>>> performs
> >>>>>>>>>>>> live migration of a ConnectX NIC function:
> >>>>>>>>>>>>
> >>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> >>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> >>>>>>>>>>>>
> >>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated
> >>>>>>>>>>>> software defined
> >>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for
> >>>>>>>>>>>> storage
> >>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its
> >>>>>>>>>>>> standard
> >>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
> >>>>>>>>>>>> specifications.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In order to implement Live Migration for these virtual
> >>>>>>>>>>>> function
> >>>>>>>>>>>> devices, that use a standard drivers as mentioned, the
> >>>>>>>>>>>> specification
> >>>>>>>>>>>> should define how HW vendor should build their devices and
> >>>>>>>>>>>> for SW
> >>>>>>>>>>>> developers to adjust the drivers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This will enable specification compliant vendor agnostic
> >>>>>>>>>>>> solution.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
> >>>>>>>>>>>> (internal HW design doc) and I guess that this is the way
> >>>>>>>>>>>> other
> >>>>>>>>>>>> vendors work.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For that, I would like to know if the approach of “PF that
> >>>>>>>>>>>> controls
> >>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO
> >>>>>>>>>>>> technical
> >>>>>>>>>>>> group ?
> >>>>>>>>>>>>
> >>>>>>>>>>> I'm not sure but I think it's better to start from the general
> >>>>>>>>>>> facility for all transports, then develop features for a
> >>>>>>>>>>> specific
> >>>>>>>>>>> transport.
> >>>>>>>>>> a general facility for all transports can be a generic admin
> >>>>>>>>>> queue ?
> >>>>>>>>> It could be a virtqueue or a transport specific method (pcie
> >>>>>>>>> capability).
> >>>>>>>> No. You said a general facility for all transports.
> >>>>>>> For general facility, I mean the chapter 2 of the spec which is
> >>>>>>> general
> >>>>>>>
> >>>>>>> "
> >>>>>>> 2 Basic Facilities of a Virtio Device
> >>>>>>> "
> >>>>>>>
> >>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I
> >>>>>> can add "2.12
> >>>>>> Admin Virtqueues" and this is what I did in the RFC.
> >>>>>>
> >>>>>>>> Transport specific is not general.
> >>>>>>> The transport is in charge of implementing the interface for
> >>>>>>> those facilities.
> >>>>>> Transport specific is not general.
> >>>>>>
> >>>>>>
> >>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk
> >>>>>>>>> first
> >>>>>>>>> (the device state). Then we can define the interface to get
> >>>>>>>>> and set
> >>>>>>>>> those states via admin virtqueue. Such decoupling may ease the
> >>>>>>>>> future
> >>>>>>>>> development of the transport specific migration interface.
> >>>>>>>> I asked a simple question here.
> >>>>>>>>
> >>>>>>>> Lets stick to this.
> >>>>>>> I answered this question.
> >>>>>> No you didn't answer.
> >>>>>>
> >>>>>> I asked  if the approach of “PF that controls the VF live
> >>>>>> migration process”
> >>>>>> is acceptable by the VIRTIO technical group ?
> >>>>>>
> >>>>>> And you take the discussion to your direction instead of
> >>>>>> answering a Yes/No
> >>>>>> question.
> >>>>>>
> >>>>>>>      The virtqueue could be one of the
> >>>>>>> approaches. And it's your responsibility to convince the community
> >>>>>>> about that approach. Having an example may help people to
> >>>>>>> understand
> >>>>>>> your proposal.
> >>>>>>>
> >>>>>>>> I'm not referring to internal state definitions.
> >>>>>>> Without an example, how do we know if it can work well?
> >>>>>>>
> >>>>>>>> Can you please not change the subject of my initial intent in
> >>>>>>>> the email ?
> >>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
> >>>>>>> your proposal.
> >>>>>> The virtio-blk PF admin queue will be used to manage the
> >>>>>> virtio-blk VF
> >>>>>> migration.
> >>>>>>
> >>>>>> This is the whole discussion. I don't want to get into resolution.
> >>>>>>
> >>>>>> Since you already know the answer as I published 4 RFCs already
> >>>>>> with all the
> >>>>>> flow.
> >>>>>>
> >>>>>> Lets stick to my question.
> >>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Max.
> >>>>>>>>>>>>
> >>>>>>>>>> This publicly archived list offers a means to provide input
> >>>>>>>>>> to the
> >>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>>>>>
> >>>>>>>>>> In order to verify user consent to the Feedback License terms
> >>>>>>>>>> and
> >>>>>>>>>> to minimize spam in the list archive, subscription is required
> >>>>>>>>>> before posting.
> >>>>>>>>>>
> >>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>>>>>> List archive:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tqLsKdPEjF4uMUeqrwpeBD7J4q%2BrWKH1xRUXozJ4KEI%3D&amp;reserved=0
> >>>>>>>>>> Feedback License:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2tg8X5uT17Vtj1hWZ7kEskf%2BeG9jV5HhJA3yepZksrg%3D&amp;reserved=0
> >>>>>>>>>> List Guidelines:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ynuvd9xCIEDhTjjxuAgMFVCek%2B6wQGhGjnUyrsKCamM%3D&amp;reserved=0
> >>>>>>>>>> Committee:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017968126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QyJw7q7fwlq8oFjiaI%2BZICxARQEvXhrhRbiKkIyFmik%3D&amp;reserved=0
> >>>>>>>>>> Join OASIS:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jebnTI98aw8LE0zC0BF2hiP6Rg3YsoKKunSVZz5GAlU%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>> This publicly archived list offers a means to provide input to the
> >>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>
> >>>>>> In order to verify user consent to the Feedback License terms and
> >>>>>> to minimize spam in the list archive, subscription is required
> >>>>>> before posting.
> >>>>>>
> >>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>> List archive:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=T5gTfTBJ9blLPnGVlidRZFIeqrgi%2BcHyOwnVP5xtBPY%3D&amp;reserved=0
> >>>>>> Feedback License:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SmRyku3IZ8O5vITCXasVxF%2BEjcG7zb%2BDi8sZinSuvp4%3D&amp;reserved=0
> >>>>>> List Guidelines:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IV4C1Y%2BE%2Bq9OPzjyYYvoVrPp6975qyNZ%2Fn1%2B0eD58%2FA%3D&amp;reserved=0
> >>>>>> Committee:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Bh0ufU4zKnzwspmmF0vN3tW3Ppi4RYfcT8pFzo%2FTT8%3D&amp;reserved=0
> >>>>>> Join OASIS:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jebnTI98aw8LE0zC0BF2hiP6Rg3YsoKKunSVZz5GAlU%3D&amp;reserved=0
> >>>>>>
> >>
> >> This publicly archived list offers a means to provide input to the
> >> OASIS Virtual I/O Device (VIRTIO) TC.
> >>
> >> In order to verify user consent to the Feedback License terms and
> >> to minimize spam in the list archive, subscription is required
> >> before posting.
> >>
> >> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >> List help: virtio-comment-help@lists.oasis-open.org
> >> List archive:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=T5gTfTBJ9blLPnGVlidRZFIeqrgi%2BcHyOwnVP5xtBPY%3D&amp;reserved=0
> >> Feedback License:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SmRyku3IZ8O5vITCXasVxF%2BEjcG7zb%2BDi8sZinSuvp4%3D&amp;reserved=0
> >> List Guidelines:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=IV4C1Y%2BE%2Bq9OPzjyYYvoVrPp6975qyNZ%2Fn1%2B0eD58%2FA%3D&amp;reserved=0
> >> Committee:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Bh0ufU4zKnzwspmmF0vN3tW3Ppi4RYfcT8pFzo%2FTT8%3D&amp;reserved=0
> >> Join OASIS:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C12187bc5e0d94abdd7bb08d96381b438%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650231017978092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jebnTI98aw8LE0zC0BF2hiP6Rg3YsoKKunSVZz5GAlU%3D&amp;reserved=0
> >>
> >
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20 11:16                         ` Jason Wang
@ 2021-08-22 10:05                           ` Max Gurtovoy
  2021-08-23  3:10                             ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-22 10:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Dr. David Alan Gilbert, virtio-comment, Michael S. Tsirkin,
	cohuck, Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan,
	Bodong Wang, Jason Gunthorpe, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer


On 8/20/2021 2:16 PM, Jason Wang wrote:
> On Fri, Aug 20, 2021 at 6:26 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>
>> On 8/20/2021 5:24 AM, Jason Wang wrote:
>>> 在 2021/8/19 下午11:20, Max Gurtovoy 写道:
>>>> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
>>>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
>>>>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy
>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy
>>>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Live migration is one of the most important features of
>>>>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>>>>>>>>> environments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The migration process is managed by a migration SW that is
>>>>>>>>>>>>>> running on
>>>>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state
>>>>>>>>>>>>>> resides in
>>>>>>>>>>>>>> the HW.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually
>>>>>>>>>>>>> from the view
>>>>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is
>>>>>>>>>>>>> stored in
>>>>>>>>>>>>> the software or hardware. A well designed VMM should be able
>>>>>>>>>>>>> to hide
>>>>>>>>>>>>> the virtio device implementation from the migration layer,
>>>>>>>>>>>>> that is how
>>>>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a
>>>>>>>>>>>>> software
>>>>>>>>>>>>> virtio/vDPA device or not.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In our vision, in order to fulfil the Live migration
>>>>>>>>>>>>>> requirements for
>>>>>>>>>>>>>> virtual functions, each physical function device must
>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>> migration operations. Using these operations, it will be
>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>> master the migration process for the virtual function
>>>>>>>>>>>>>> devices. Each
>>>>>>>>>>>>>> capable physical function device has a supervisor
>>>>>>>>>>>>>> permissions to
>>>>>>>>>>>>>> change the virtual function operational states,
>>>>>>>>>>>>>> save/restore its
>>>>>>>>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> For "supervisor permissions", is this from the software
>>>>>>>>>>>>> point of view?
>>>>>>>>>>>>> Maybe it's better to give an example for this.
>>>>>>>>>>>> A permission to a PF device for quiesce and freeze a VF
>>>>>>>>>>>> device for example.
>>>>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running
>>>>>>>>>>> without any privileges.
>>>>>>>>>> You're mixing layers here.
>>>>>>>>>>
>>>>>>>>>> QEMU is not involved here. It's only sending IOCTLs to
>>>>>>>>>> migration driver.
>>>>>>>>>> The migration driver will control the migration process of the
>>>>>>>>>> VF using
>>>>>>>>>> the PF communication channel.
>>>>>>>>> So who will be granted the "permission" you mentioned here?
>>>>>>>> This is just an expression.
>>>>>>>>
>>>>>>>> What is not clear ?
>>>>>>>>
>>>>>>>> The PF device will have an option to quiesce/freeze the VF device.
>>>>>>>>
>>>>>>>> This is simple. Why are you looking for some sophisticated
>>>>>>>> problems ?
>>>>>>> I'm trying to follow along here and have not completely; but I
>>>>>>> think the issue is a
>>>>>>> security separation one.
>>>>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
>>>>>>> isolated and shouldn't be able to go poking at other devices; so it
>>>>>>> can't go poking at the PF (it probably doesn't even have the PF
>>>>>>> device
>>>>>>> node accessible) - so then the question is who has access to the
>>>>>>> migration driver and how do you make sure it can only deal with VF's
>>>>>>> that it's supposed to be able to migrate.
>>>>>> The QEMU/userspace doesn't know or care about the PF connection and
>>>>>> internal
>>>>>> virtio_vfio_pci driver implementation.
>>>>> OK
>>>>>
>>>>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
>>>>> Hmm OK.
>>>>>
>>>>>> QEMU does not have access to the PF. Only the kernel driver that
>>>>>> has access
>>>>>> to the VF will have access to the PF communication channel. There
>>>>>> is no
>>>>>> permission problem here.
>>>>>>
>>>>>> The kernel driver of the VF will do this internally, and make sure
>>>>>> that the
>>>>>> commands it build will only impact the VF originating them.
>>>>>>
>>>>> Now that confuses me; isn't the kernel driver that has access to the VF
>>>>> running inside the guest?  If it's inside the guest we can't trust
>>>>> it to
>>>>> do anything about stopping impact to other devices.
>>>> No. The driver is in the hypervisor (virtio_vfio_pci). This is the
>>>> migration driver, right ?
>>>
>>> Well, talking things like virtio_vfio_pci that is not mentioned before
>>> and not justified on the list may easily confuse people. As pointed
>>> out in another thread, it has too many disadvantages over the existing
>>> virtio-pci vdpa driver. And it just duplicates a partial function of
>>> what virtio-pci vdpa driver can do. I don't think we will go that way.
>> This was just an example for David to help with understanding the
>> solution since he thought that the guest drivers somehow should be changed.
>>
>> David I'm sorry if I confused you.
>>
>> Again Jason, you try to propose your vDPA solution that is not what
>> we're trying to achieve in this work. Think of a world without vDPA.
> Well, I'd say, let's think vDPA a superset of virtio, not just the
> acceleration technologies.

I'm sorry but vDPA is not relevant to this discussion.

Anyhow, I don't see any problem for vDPA driver to work on top of the 
design proposed here.

>> Also I don't understand how vDPA is related to virtio specification
>> decisions ?
> So how is VFIO related to virtio specific decisions? That's why I
> think we should avoid talking about software architecture here. It's
> the wrong community.

VFIO is not related to virtio spec.

It was an example for David. What is the problem with giving examples to 
ease on people to understand the solution ?

Where did you see that the design is referring to VFIO ?

>
>>   make vDPA into virtio and then we can open a discussion.
>>
>> I'm interesting in virtio migration of HW devices.
>>
>> The proposal in this thread is actually get support from Michal AFAIU
>> and also others were happy with. All beside of you.
> So I think I've clairfied my several times :(
>
> - I'm fairly ok with the proposal

It doesn't seems like that.

> - but we decouple the basic facility out of the admin virtqueue and
> this seems agreed by Michael:
>
> Let's take the dirty page tracking as an example:
>
> 1) let's first define that as one of the basic facility
> 2) then we can introduce admin virtqueue or other stuffs as an
> interface for that facility
>
> Does this work for you?

What I really want is to agree that the right way to manage migration 
process of a virtio VF. My proposal is doing so by creating a 
communication channel in its parent PF.

I think I got a confirmation here.

This communication channel is not introduced in this thread, but 
obviously it should be an adminq.

For your future scalable functions, the Parent Device (lets call it PD) 
will manage the creation/migration/destruction process for its Virtual 
Devices (lets call them VDs) using the PD adminq.

Agreed ?

Please don't answer that this is not a "must". This is my proposal. If 
you have another proposal, please propose.

>
>> We do it in mlx5 and we didn't see any issues with that design.
>>
> If we seperate things as I suggested, I'm totally fine.

separate what ?

Why should I create different interfaces for different management tasks.

I have a virtual/scalable device that I want to  refer to from the 
physical/parent device using some interface.

This interface is adminq. This interface will be used for dirty_page 
tracking and operational state changing and get/set internal state as 
well. And more (create/destroy SF for example).

You can think of this in some other way, i'm fine with it. As long as 
the final conclusion is the same.

>
>> I don't think you can say that we "go that way".
> For "go that way" I meant the method of using vfio_virtio_pci, it has
> nothing related to the discussion of "using PF to control VF" on the
> spec.

This was an example. Please leave it as an example for David.


>> You're trying to build a complementary solution for creating scalable
>> functions and for some reason trying to sabotage NVIDIA efforts to add
>> new important functionality to virtio.
> Well, it's a completely different topic. And it doesn't conflict with
> anything that is proposed here by you. I think I've stated this
> several times.  I don't think we block each other, it's just some
> unification work if one of the proposals is merged first. I sent them
> recently because it will be used as a material for my talk on the KVM
> Forum which is really near.

In theory you're right. We shouldn't block each other, and I don't block 
you. But for some reason I see that you do try to block my proposal and 
I don't understand why.

I feel like I wasted 2 months on a discussion instead of progressing.

But now I do see a progress. A PF to manage VF migration is the way to 
go forward.

And the following RFC will take this into consideration.

>
>> This also sabotage the evolvment of virtio as a standard.
>>
>> You're trying to enforce some un-finished idea that should work on some
>> future specific HW platform instead of helping defining a good spec for
>> virtio.
> Let's open another thread for this if you wish, it has nothing related
> to the spec but how it is implemented in Linux. If you search the
> archive, something similar to "vfio_virtio_pci" has been proposed
> several years before by Intel. The idea has been rejected, and we have
> leveraged Linux vDPA bus for virtio-pci devices.

I don't know this history. And I will happy to hear about it one day.

But for our discussion in Linux, virtio_vfio_pci will happen. And it 
will implement the migration logic of a virtio device with PCI transport 
for VFs using the PF admin queue.

We at NVIDIA, currently upstreaming (alongside with AlexW and Cornelia) 
a vfio-pci separation that will enable an easy creation of vfio-pci 
vendor/protocol drivers to do some specific tasks.

New drivers such as mlx5_vfio_pci, hns_vfio_pci, virtio_vfio_pci and 
nvme_vfio_pci should be implemented in the near future in Linux to 
enable migration of these devices.

This is just an example. And it's not related to the spec nor the 
proposal at all.

>
>> And all is for having users to choose vDPA framework instead of using
>> plain virtio.
>>
>> We believe in our solution and we have a working prototype. We'll
>> continue with our discussion to convince the community with it.
> Again, it looks like there's a lot of misunderstanding. Let's open a
> thread on the suitable list instead of talking about any specific
> software solution or architecture here. This will speed up things.

I prefer to finish the specification first. SW arch is clear for us in 
Linux. We did it already for mlx5 devices and it will be the same for 
virtio if the spec changes will be accepted.

Thanks.


>
> Thanks
>
>> Thanks.
>>
>>> Thanks
>>>
>>>
>>>> The guest is running as usual. It doesn't aware on the migration at all.
>>>>
>>>> This is the point I try to make here. I don't (and I can't) change
>>>> even 1 line of code in the guest.
>>>>
>>>> e.g:
>>>>
>>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
>>>> (bounded to VF5) --> send admin command on PF adminq to start
>>>> tracking dirty pages for VF5 --> PF device will do it
>>>>
>>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
>>>> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5
>>>> --> PF device will do it
>>>>
>>>> You can take a look how we implement mlx5_vfio_pci in the link I
>>>> provided.
>>>>
>>>>> Dave
>>>>>
>>>>>
>>>>>> We already do this in mlx5 NIC migration. The kernel is secured and
>>>>>> QEMU
>>>>>> interface is the VF.
>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA
>>>>>>>>>>>>>> performs
>>>>>>>>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated
>>>>>>>>>>>>>> software defined
>>>>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for
>>>>>>>>>>>>>> storage
>>>>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its
>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>>>>>>>>> specifications.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to implement Live Migration for these virtual
>>>>>>>>>>>>>> function
>>>>>>>>>>>>>> devices, that use a standard drivers as mentioned, the
>>>>>>>>>>>>>> specification
>>>>>>>>>>>>>> should define how HW vendor should build their devices and
>>>>>>>>>>>>>> for SW
>>>>>>>>>>>>>> developers to adjust the drivers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This will enable specification compliant vendor agnostic
>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>>>>>>>>> (internal HW design doc) and I guess that this is the way
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> vendors work.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For that, I would like to know if the approach of “PF that
>>>>>>>>>>>>>> controls
>>>>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO
>>>>>>>>>>>>>> technical
>>>>>>>>>>>>>> group ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure but I think it's better to start from the general
>>>>>>>>>>>>> facility for all transports, then develop features for a
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> transport.
>>>>>>>>>>>> a general facility for all transports can be a generic admin
>>>>>>>>>>>> queue ?
>>>>>>>>>>> It could be a virtqueue or a transport specific method (pcie
>>>>>>>>>>> capability).
>>>>>>>>>> No. You said a general facility for all transports.
>>>>>>>>> For general facility, I mean the chapter 2 of the spec which is
>>>>>>>>> general
>>>>>>>>>
>>>>>>>>> "
>>>>>>>>> 2 Basic Facilities of a Virtio Device
>>>>>>>>> "
>>>>>>>>>
>>>>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I
>>>>>>>> can add "2.12
>>>>>>>> Admin Virtqueues" and this is what I did in the RFC.
>>>>>>>>
>>>>>>>>>> Transport specific is not general.
>>>>>>>>> The transport is in charge of implementing the interface for
>>>>>>>>> those facilities.
>>>>>>>> Transport specific is not general.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk
>>>>>>>>>>> first
>>>>>>>>>>> (the device state). Then we can define the interface to get
>>>>>>>>>>> and set
>>>>>>>>>>> those states via admin virtqueue. Such decoupling may ease the
>>>>>>>>>>> future
>>>>>>>>>>> development of the transport specific migration interface.
>>>>>>>>>> I asked a simple question here.
>>>>>>>>>>
>>>>>>>>>> Lets stick to this.
>>>>>>>>> I answered this question.
>>>>>>>> No you didn't answer.
>>>>>>>>
>>>>>>>> I asked  if the approach of “PF that controls the VF live
>>>>>>>> migration process”
>>>>>>>> is acceptable by the VIRTIO technical group ?
>>>>>>>>
>>>>>>>> And you take the discussion to your direction instead of
>>>>>>>> answering a Yes/No
>>>>>>>> question.
>>>>>>>>
>>>>>>>>>       The virtqueue could be one of the
>>>>>>>>> approaches. And it's your responsibility to convince the community
>>>>>>>>> about that approach. Having an example may help people to
>>>>>>>>> understand
>>>>>>>>> your proposal.
>>>>>>>>>
>>>>>>>>>> I'm not referring to internal state definitions.
>>>>>>>>> Without an example, how do we know if it can work well?
>>>>>>>>>
>>>>>>>>>> Can you please not change the subject of my initial intent in
>>>>>>>>>> the email ?
>>>>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>>>>>>>>> your proposal.
>>>>>>>> The virtio-blk PF admin queue will be used to manage the
>>>>>>>> virtio-blk VF
>>>>>>>> migration.
>>>>>>>>
>>>>>>>> This is the whole discussion. I don't want to get into resolution.
>>>>>>>>
>>>>>>>> Since you already know the answer as I published 4 RFCs already
>>>>>>>> with all the
>>>>>>>> flow.
>>>>>>>>
>>>>>>>> Lets stick to my question.
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Max.
>>>>>>>>>>>>>>
>>>>>>>>>>>> This publicly archived list offers a means to provide input
>>>>>>>>>>>> to the
>>>>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to verify user consent to the Feedback License terms
>>>>>>>>>>>> and
>>>>>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>>>>>> before posting.
>>>>>>>>>>>>
>>>>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>>>>>> List archive:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZOtYROgA9trASrl6vZpAgWKqd69wSO%2FIIehl%2F%2FV9Ji0%3D&amp;reserved=0
>>>>>>>>>>>> Feedback License:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RdR%2Bs7zilTU%2BwJii1XwwpObuMlBr0ThWy4ur0TacYYo%3D&amp;reserved=0
>>>>>>>>>>>> List Guidelines:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z2bXjUR8N0unB5xCki960drEzcAOM0oM6ULLh58B%2Bw0%3D&amp;reserved=0
>>>>>>>>>>>> Committee:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BMvgIFetSa3nZiMQ9NQNwQ23nf61tfoHF%2F7ZK3%2BjtVA%3D&amp;reserved=0
>>>>>>>>>>>> Join OASIS:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
>>>>>>>>>>>>
>>>>>>>> This publicly archived list offers a means to provide input to the
>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>
>>>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>> before posting.
>>>>>>>>
>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>> List archive:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LMnn0oZKj%2Bf2kE9pVeC2uSfROFFSawi6MJYUHmclJb0%3D&amp;reserved=0
>>>>>>>> Feedback License:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Z7M42HYu%2FZ%2B8JdnMo3mp%2FV%2Bwz1VHLnjQJZqum4fwY0M%3D&amp;reserved=0
>>>>>>>> List Guidelines:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ev2Z09T2ADqE9oxTw%2Bj5rlhrc939Xp4vd7D5j3Sa19M%3D&amp;reserved=0
>>>>>>>> Committee:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2FY1e4h4QFkGQg8PQRbZKIC4FhjSYE9%2FU9l4mX7E%2Fq4%3D&amp;reserved=0
>>>>>>>> Join OASIS:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
>>>>>>>>
>>>> This publicly archived list offers a means to provide input to the
>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>
>>>> In order to verify user consent to the Feedback License terms and
>>>> to minimize spam in the list archive, subscription is required
>>>> before posting.
>>>>
>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>> List archive:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LMnn0oZKj%2Bf2kE9pVeC2uSfROFFSawi6MJYUHmclJb0%3D&amp;reserved=0
>>>> Feedback License:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Z7M42HYu%2FZ%2B8JdnMo3mp%2FV%2Bwz1VHLnjQJZqum4fwY0M%3D&amp;reserved=0
>>>> List Guidelines:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ev2Z09T2ADqE9oxTw%2Bj5rlhrc939Xp4vd7D5j3Sa19M%3D&amp;reserved=0
>>>> Committee:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2FY1e4h4QFkGQg8PQRbZKIC4FhjSYE9%2FU9l4mX7E%2Fq4%3D&amp;reserved=0
>>>> Join OASIS:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
>>>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LMnn0oZKj%2Bf2kE9pVeC2uSfROFFSawi6MJYUHmclJb0%3D&amp;reserved=0
>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Z7M42HYu%2FZ%2B8JdnMo3mp%2FV%2Bwz1VHLnjQJZqum4fwY0M%3D&amp;reserved=0
>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ev2Z09T2ADqE9oxTw%2Bj5rlhrc939Xp4vd7D5j3Sa19M%3D&amp;reserved=0
>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2FY1e4h4QFkGQg8PQRbZKIC4FhjSYE9%2FU9l4mX7E%2Fq4%3D&amp;reserved=0
>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
>>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-22 10:05                           ` Max Gurtovoy
@ 2021-08-23  3:10                             ` Jason Wang
  2021-08-23  8:55                               ` Max Gurtovoy
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-23  3:10 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Dr. David Alan Gilbert, virtio-comment, Michael S. Tsirkin,
	cohuck, Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan,
	Bodong Wang, Jason Gunthorpe, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Sun, Aug 22, 2021 at 6:05 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>
> On 8/20/2021 2:16 PM, Jason Wang wrote:
> > On Fri, Aug 20, 2021 at 6:26 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>
> >> On 8/20/2021 5:24 AM, Jason Wang wrote:
> >>> 在 2021/8/19 下午11:20, Max Gurtovoy 写道:
> >>>> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
> >>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> >>>>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
> >>>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> >>>>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
> >>>>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy
> >>>>>>>>> <mgurtovoy@nvidia.com> wrote:
> >>>>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
> >>>>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy
> >>>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
> >>>>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
> >>>>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Live migration is one of the most important features of
> >>>>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
> >>>>>>>>>>>>>> environments.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The migration process is managed by a migration SW that is
> >>>>>>>>>>>>>> running on
> >>>>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state
> >>>>>>>>>>>>>> resides in
> >>>>>>>>>>>>>> the HW.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually
> >>>>>>>>>>>>> from the view
> >>>>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is
> >>>>>>>>>>>>> stored in
> >>>>>>>>>>>>> the software or hardware. A well designed VMM should be able
> >>>>>>>>>>>>> to hide
> >>>>>>>>>>>>> the virtio device implementation from the migration layer,
> >>>>>>>>>>>>> that is how
> >>>>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a
> >>>>>>>>>>>>> software
> >>>>>>>>>>>>> virtio/vDPA device or not.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> In our vision, in order to fulfil the Live migration
> >>>>>>>>>>>>>> requirements for
> >>>>>>>>>>>>>> virtual functions, each physical function device must
> >>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>> migration operations. Using these operations, it will be
> >>>>>>>>>>>>>> able to
> >>>>>>>>>>>>>> master the migration process for the virtual function
> >>>>>>>>>>>>>> devices. Each
> >>>>>>>>>>>>>> capable physical function device has a supervisor
> >>>>>>>>>>>>>> permissions to
> >>>>>>>>>>>>>> change the virtual function operational states,
> >>>>>>>>>>>>>> save/restore its
> >>>>>>>>>>>>>> internal state and start/stop dirty pages tracking.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> For "supervisor permissions", is this from the software
> >>>>>>>>>>>>> point of view?
> >>>>>>>>>>>>> Maybe it's better to give an example for this.
> >>>>>>>>>>>> A permission to a PF device for quiesce and freeze a VF
> >>>>>>>>>>>> device for example.
> >>>>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running
> >>>>>>>>>>> without any privileges.
> >>>>>>>>>> You're mixing layers here.
> >>>>>>>>>>
> >>>>>>>>>> QEMU is not involved here. It's only sending IOCTLs to
> >>>>>>>>>> migration driver.
> >>>>>>>>>> The migration driver will control the migration process of the
> >>>>>>>>>> VF using
> >>>>>>>>>> the PF communication channel.
> >>>>>>>>> So who will be granted the "permission" you mentioned here?
> >>>>>>>> This is just an expression.
> >>>>>>>>
> >>>>>>>> What is not clear ?
> >>>>>>>>
> >>>>>>>> The PF device will have an option to quiesce/freeze the VF device.
> >>>>>>>>
> >>>>>>>> This is simple. Why are you looking for some sophisticated
> >>>>>>>> problems ?
> >>>>>>> I'm trying to follow along here and have not completely; but I
> >>>>>>> think the issue is a
> >>>>>>> security separation one.
> >>>>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
> >>>>>>> isolated and shouldn't be able to go poking at other devices; so it
> >>>>>>> can't go poking at the PF (it probably doesn't even have the PF
> >>>>>>> device
> >>>>>>> node accessible) - so then the question is who has access to the
> >>>>>>> migration driver and how do you make sure it can only deal with VF's
> >>>>>>> that it's supposed to be able to migrate.
> >>>>>> The QEMU/userspace doesn't know or care about the PF connection and
> >>>>>> internal
> >>>>>> virtio_vfio_pci driver implementation.
> >>>>> OK
> >>>>>
> >>>>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
> >>>>> Hmm OK.
> >>>>>
> >>>>>> QEMU does not have access to the PF. Only the kernel driver that
> >>>>>> has access
> >>>>>> to the VF will have access to the PF communication channel. There
> >>>>>> is no
> >>>>>> permission problem here.
> >>>>>>
> >>>>>> The kernel driver of the VF will do this internally, and make sure
> >>>>>> that the
> >>>>>> commands it build will only impact the VF originating them.
> >>>>>>
> >>>>> Now that confuses me; isn't the kernel driver that has access to the VF
> >>>>> running inside the guest?  If it's inside the guest we can't trust
> >>>>> it to
> >>>>> do anything about stopping impact to other devices.
> >>>> No. The driver is in the hypervisor (virtio_vfio_pci). This is the
> >>>> migration driver, right ?
> >>>
> >>> Well, talking things like virtio_vfio_pci that is not mentioned before
> >>> and not justified on the list may easily confuse people. As pointed
> >>> out in another thread, it has too many disadvantages over the existing
> >>> virtio-pci vdpa driver. And it just duplicates a partial function of
> >>> what virtio-pci vdpa driver can do. I don't think we will go that way.
> >> This was just an example for David to help with understanding the
> >> solution since he thought that the guest drivers somehow should be changed.
> >>
> >> David I'm sorry if I confused you.
> >>
> >> Again Jason, you try to propose your vDPA solution that is not what
> >> we're trying to achieve in this work. Think of a world without vDPA.
> > Well, I'd say, let's think vDPA a superset of virtio, not just the
> > acceleration technologies.
>
> I'm sorry but vDPA is not relevant to this discussion.

Well, it's you that mention the software things like VFIO first.

>
> Anyhow, I don't see any problem for vDPA driver to work on top of the
> design proposed here.
>
> >> Also I don't understand how vDPA is related to virtio specification
> >> decisions ?
> > So how is VFIO related to virtio specific decisions? That's why I
> > think we should avoid talking about software architecture here. It's
> > the wrong community.
>
> VFIO is not related to virtio spec.

Of course.

>
> It was an example for David. What is the problem with giving examples to
> ease on people to understand the solution ?

I don't think your example ease the understanding.

>
> Where did you see that the design is referring to VFIO ?
>
> >
> >>   make vDPA into virtio and then we can open a discussion.
> >>
> >> I'm interesting in virtio migration of HW devices.
> >>
> >> The proposal in this thread is actually get support from Michal AFAIU
> >> and also others were happy with. All beside of you.
> > So I think I've clairfied my several times :(
> >
> > - I'm fairly ok with the proposal
>
> It doesn't seems like that.
>
> > - but we decouple the basic facility out of the admin virtqueue and
> > this seems agreed by Michael:
> >
> > Let's take the dirty page tracking as an example:
> >
> > 1) let's first define that as one of the basic facility
> > 2) then we can introduce admin virtqueue or other stuffs as an
> > interface for that facility
> >
> > Does this work for you?
>
> What I really want is to agree that the right way to manage migration
> process of a virtio VF. My proposal is doing so by creating a
> communication channel in its parent PF.

It looks to me you never answer the question "why it must be done by PF".

All the functions provided by PF so far for software is not expected
to be used by VMM like Qemu. Those functions usually requires
capability or privileges for the management software to use. You
mentioned things like "supervisor" and "permission", but it looks to
me you are still unaware how it connect to the security stuffs.

>
> I think I got a confirmation here.
>
> This communication channel is not introduced in this thread, but
> obviously it should be an adminq.

Let me clarify. What I want to say is admin should be one of the
possible channels.

>
> For your future scalable functions, the Parent Device (lets call it PD)
> will manage the creation/migration/destruction process for its Virtual
> Devices (lets call them VDs) using the PD adminq.
>
> Agreed ?

They are two different set of functions:

- provisioning/creation/destruction: requires privilege and we don't
have any plan to expose them to the guest. It should be done via PF or
PD for security as you mentioned above.
- migration: doesn't require privilege, and it can be expose to the
guest, if can be done in either PF or VF. To me using VF is much more
natural,  but using PF is also fine.

An exception for the migration is the dirty page tracking, without DMA
isolation, we may end up with security issue if we do that in the VF.

>
> Please don't answer that this is not a "must". This is my proposal. If
> you have another proposal, please propose.

Well, you are asking for the comments instead of enforcing things right?

And it's as simple as:

1) introduce admin virtqueue, and bind migration features to admin virtqueue

or

2) introduce migration features and admin virtqueue independently

What's the problem of do trivial modifications like 2)? Is that
conflict with your proposal?

>
> >
> >> We do it in mlx5 and we didn't see any issues with that design.
> >>
> > If we seperate things as I suggested, I'm totally fine.
>
> separate what ?
>
> Why should I create different interfaces for different management tasks.

I don't say you need to create different interfaces. It's for future extensions:

1) When VIRTIO_F_ADMIN_VQ is negotiated, the interface is admin virtqueue
2) When other features is negotiated, the interface is other.

In order to make 2) work, we need introduce migration and admin
virtqueue separately.

Migration is not management task which doesn't require any privilege.

>
> I have a virtual/scalable device that I want to  refer to from the
> physical/parent device using some interface.
>
> This interface is adminq. This interface will be used for dirty_page
> tracking and operational state changing and get/set internal state as
> well. And more (create/destroy SF for example).
>
> You can think of this in some other way, i'm fine with it. As long as
> the final conclusion is the same.
>
> >
> >> I don't think you can say that we "go that way".
> > For "go that way" I meant the method of using vfio_virtio_pci, it has
> > nothing related to the discussion of "using PF to control VF" on the
> > spec.
>
> This was an example. Please leave it as an example for David.
>
>
> >> You're trying to build a complementary solution for creating scalable
> >> functions and for some reason trying to sabotage NVIDIA efforts to add
> >> new important functionality to virtio.
> > Well, it's a completely different topic. And it doesn't conflict with
> > anything that is proposed here by you. I think I've stated this
> > several times.  I don't think we block each other, it's just some
> > unification work if one of the proposals is merged first. I sent them
> > recently because it will be used as a material for my talk on the KVM
> > Forum which is really near.
>
> In theory you're right. We shouldn't block each other, and I don't block
> you. But for some reason I see that you do try to block my proposal and
> I don't understand why.

I don't want to block your proposal, let's decouple the migration
feature out of admin virtqueue. Then it's fine.

The problem I see is that, you tend to refuse such a trivial but
beneficial change. That's what I don't understand.

>
> I feel like I wasted 2 months on a discussion instead of progressing.

Well, I'm not sure 2 months is short, but it's usually take more than
a year for huge project in Linux.

Patience may help us to understand the points of each other better.

>
> But now I do see a progress. A PF to manage VF migration is the way to
> go forward.
>
> And the following RFC will take this into consideration.
>
> >
> >> This also sabotage the evolvment of virtio as a standard.
> >>
> >> You're trying to enforce some un-finished idea that should work on some
> >> future specific HW platform instead of helping defining a good spec for
> >> virtio.
> > Let's open another thread for this if you wish, it has nothing related
> > to the spec but how it is implemented in Linux. If you search the
> > archive, something similar to "vfio_virtio_pci" has been proposed
> > several years before by Intel. The idea has been rejected, and we have
> > leveraged Linux vDPA bus for virtio-pci devices.
>
> I don't know this history. And I will happy to hear about it one day.
>
> But for our discussion in Linux, virtio_vfio_pci will happen. And it
> will implement the migration logic of a virtio device with PCI transport
> for VFs using the PF admin queue.
>
> We at NVIDIA, currently upstreaming (alongside with AlexW and Cornelia)
> a vfio-pci separation that will enable an easy creation of vfio-pci
> vendor/protocol drivers to do some specific tasks.
>
> New drivers such as mlx5_vfio_pci, hns_vfio_pci, virtio_vfio_pci and
> nvme_vfio_pci should be implemented in the near future in Linux to
> enable migration of these devices.
>
> This is just an example. And it's not related to the spec nor the
> proposal at all.

Let's move those discussions to the right list. I'm pretty sure there
will a long debate there. Please prepare for that.

>
> >
> >> And all is for having users to choose vDPA framework instead of using
> >> plain virtio.
> >>
> >> We believe in our solution and we have a working prototype. We'll
> >> continue with our discussion to convince the community with it.
> > Again, it looks like there's a lot of misunderstanding. Let's open a
> > thread on the suitable list instead of talking about any specific
> > software solution or architecture here. This will speed up things.
>
> I prefer to finish the specification first. SW arch is clear for us in
> Linux. We did it already for mlx5 devices and it will be the same for
> virtio if the spec changes will be accepted.

I disagree, but let's separate software discussion out of the spec
discussion here.

Thanks

>
> Thanks.
>
>
> >
> > Thanks
> >
> >> Thanks.
> >>
> >>> Thanks
> >>>
> >>>
> >>>> The guest is running as usual. It doesn't aware on the migration at all.
> >>>>
> >>>> This is the point I try to make here. I don't (and I can't) change
> >>>> even 1 line of code in the guest.
> >>>>
> >>>> e.g:
> >>>>
> >>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
> >>>> (bounded to VF5) --> send admin command on PF adminq to start
> >>>> tracking dirty pages for VF5 --> PF device will do it
> >>>>
> >>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
> >>>> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5
> >>>> --> PF device will do it
> >>>>
> >>>> You can take a look how we implement mlx5_vfio_pci in the link I
> >>>> provided.
> >>>>
> >>>>> Dave
> >>>>>
> >>>>>
> >>>>>> We already do this in mlx5 NIC migration. The kernel is secured and
> >>>>>> QEMU
> >>>>>> interface is the VF.
> >>>>>>
> >>>>>>> Dave
> >>>>>>>
> >>>>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA
> >>>>>>>>>>>>>> performs
> >>>>>>>>>>>>>> live migration of a ConnectX NIC function:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> >>>>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated
> >>>>>>>>>>>>>> software defined
> >>>>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for
> >>>>>>>>>>>>>> storage
> >>>>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its
> >>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
> >>>>>>>>>>>>>> specifications.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In order to implement Live Migration for these virtual
> >>>>>>>>>>>>>> function
> >>>>>>>>>>>>>> devices, that use a standard drivers as mentioned, the
> >>>>>>>>>>>>>> specification
> >>>>>>>>>>>>>> should define how HW vendor should build their devices and
> >>>>>>>>>>>>>> for SW
> >>>>>>>>>>>>>> developers to adjust the drivers.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This will enable specification compliant vendor agnostic
> >>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
> >>>>>>>>>>>>>> (internal HW design doc) and I guess that this is the way
> >>>>>>>>>>>>>> other
> >>>>>>>>>>>>>> vendors work.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For that, I would like to know if the approach of “PF that
> >>>>>>>>>>>>>> controls
> >>>>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO
> >>>>>>>>>>>>>> technical
> >>>>>>>>>>>>>> group ?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm not sure but I think it's better to start from the general
> >>>>>>>>>>>>> facility for all transports, then develop features for a
> >>>>>>>>>>>>> specific
> >>>>>>>>>>>>> transport.
> >>>>>>>>>>>> a general facility for all transports can be a generic admin
> >>>>>>>>>>>> queue ?
> >>>>>>>>>>> It could be a virtqueue or a transport specific method (pcie
> >>>>>>>>>>> capability).
> >>>>>>>>>> No. You said a general facility for all transports.
> >>>>>>>>> For general facility, I mean the chapter 2 of the spec which is
> >>>>>>>>> general
> >>>>>>>>>
> >>>>>>>>> "
> >>>>>>>>> 2 Basic Facilities of a Virtio Device
> >>>>>>>>> "
> >>>>>>>>>
> >>>>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I
> >>>>>>>> can add "2.12
> >>>>>>>> Admin Virtqueues" and this is what I did in the RFC.
> >>>>>>>>
> >>>>>>>>>> Transport specific is not general.
> >>>>>>>>> The transport is in charge of implementing the interface for
> >>>>>>>>> those facilities.
> >>>>>>>> Transport specific is not general.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk
> >>>>>>>>>>> first
> >>>>>>>>>>> (the device state). Then we can define the interface to get
> >>>>>>>>>>> and set
> >>>>>>>>>>> those states via admin virtqueue. Such decoupling may ease the
> >>>>>>>>>>> future
> >>>>>>>>>>> development of the transport specific migration interface.
> >>>>>>>>>> I asked a simple question here.
> >>>>>>>>>>
> >>>>>>>>>> Lets stick to this.
> >>>>>>>>> I answered this question.
> >>>>>>>> No you didn't answer.
> >>>>>>>>
> >>>>>>>> I asked  if the approach of “PF that controls the VF live
> >>>>>>>> migration process”
> >>>>>>>> is acceptable by the VIRTIO technical group ?
> >>>>>>>>
> >>>>>>>> And you take the discussion to your direction instead of
> >>>>>>>> answering a Yes/No
> >>>>>>>> question.
> >>>>>>>>
> >>>>>>>>>       The virtqueue could be one of the
> >>>>>>>>> approaches. And it's your responsibility to convince the community
> >>>>>>>>> about that approach. Having an example may help people to
> >>>>>>>>> understand
> >>>>>>>>> your proposal.
> >>>>>>>>>
> >>>>>>>>>> I'm not referring to internal state definitions.
> >>>>>>>>> Without an example, how do we know if it can work well?
> >>>>>>>>>
> >>>>>>>>>> Can you please not change the subject of my initial intent in
> >>>>>>>>>> the email ?
> >>>>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
> >>>>>>>>> your proposal.
> >>>>>>>> The virtio-blk PF admin queue will be used to manage the
> >>>>>>>> virtio-blk VF
> >>>>>>>> migration.
> >>>>>>>>
> >>>>>>>> This is the whole discussion. I don't want to get into resolution.
> >>>>>>>>
> >>>>>>>> Since you already know the answer as I published 4 RFCs already
> >>>>>>>> with all the
> >>>>>>>> flow.
> >>>>>>>>
> >>>>>>>> Lets stick to my question.
> >>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Max.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>> This publicly archived list offers a means to provide input
> >>>>>>>>>>>> to the
> >>>>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In order to verify user consent to the Feedback License terms
> >>>>>>>>>>>> and
> >>>>>>>>>>>> to minimize spam in the list archive, subscription is required
> >>>>>>>>>>>> before posting.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>>>>>>>> List archive:
> >>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZOtYROgA9trASrl6vZpAgWKqd69wSO%2FIIehl%2F%2FV9Ji0%3D&amp;reserved=0
> >>>>>>>>>>>> Feedback License:
> >>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RdR%2Bs7zilTU%2BwJii1XwwpObuMlBr0ThWy4ur0TacYYo%3D&amp;reserved=0
> >>>>>>>>>>>> List Guidelines:
> >>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z2bXjUR8N0unB5xCki960drEzcAOM0oM6ULLh58B%2Bw0%3D&amp;reserved=0
> >>>>>>>>>>>> Committee:
> >>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011068512%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BMvgIFetSa3nZiMQ9NQNwQ23nf61tfoHF%2F7ZK3%2BjtVA%3D&amp;reserved=0
> >>>>>>>>>>>> Join OASIS:
> >>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
> >>>>>>>>>>>>
> >>>>>>>> This publicly archived list offers a means to provide input to the
> >>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>>>
> >>>>>>>> In order to verify user consent to the Feedback License terms and
> >>>>>>>> to minimize spam in the list archive, subscription is required
> >>>>>>>> before posting.
> >>>>>>>>
> >>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>>>> List archive:
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LMnn0oZKj%2Bf2kE9pVeC2uSfROFFSawi6MJYUHmclJb0%3D&amp;reserved=0
> >>>>>>>> Feedback License:
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Z7M42HYu%2FZ%2B8JdnMo3mp%2FV%2Bwz1VHLnjQJZqum4fwY0M%3D&amp;reserved=0
> >>>>>>>> List Guidelines:
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ev2Z09T2ADqE9oxTw%2Bj5rlhrc939Xp4vd7D5j3Sa19M%3D&amp;reserved=0
> >>>>>>>> Committee:
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2FY1e4h4QFkGQg8PQRbZKIC4FhjSYE9%2FU9l4mX7E%2Fq4%3D&amp;reserved=0
> >>>>>>>> Join OASIS:
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
> >>>>>>>>
> >>>> This publicly archived list offers a means to provide input to the
> >>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>
> >>>> In order to verify user consent to the Feedback License terms and
> >>>> to minimize spam in the list archive, subscription is required
> >>>> before posting.
> >>>>
> >>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>> List archive:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LMnn0oZKj%2Bf2kE9pVeC2uSfROFFSawi6MJYUHmclJb0%3D&amp;reserved=0
> >>>> Feedback License:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Z7M42HYu%2FZ%2B8JdnMo3mp%2FV%2Bwz1VHLnjQJZqum4fwY0M%3D&amp;reserved=0
> >>>> List Guidelines:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ev2Z09T2ADqE9oxTw%2Bj5rlhrc939Xp4vd7D5j3Sa19M%3D&amp;reserved=0
> >>>> Committee:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2FY1e4h4QFkGQg8PQRbZKIC4FhjSYE9%2FU9l4mX7E%2Fq4%3D&amp;reserved=0
> >>>> Join OASIS:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
> >>>>
> >> This publicly archived list offers a means to provide input to the
> >> OASIS Virtual I/O Device (VIRTIO) TC.
> >>
> >> In order to verify user consent to the Feedback License terms and
> >> to minimize spam in the list archive, subscription is required
> >> before posting.
> >>
> >> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >> List help: virtio-comment-help@lists.oasis-open.org
> >> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LMnn0oZKj%2Bf2kE9pVeC2uSfROFFSawi6MJYUHmclJb0%3D&amp;reserved=0
> >> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Z7M42HYu%2FZ%2B8JdnMo3mp%2FV%2Bwz1VHLnjQJZqum4fwY0M%3D&amp;reserved=0
> >> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ev2Z09T2ADqE9oxTw%2Bj5rlhrc939Xp4vd7D5j3Sa19M%3D&amp;reserved=0
> >> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2FY1e4h4QFkGQg8PQRbZKIC4FhjSYE9%2FU9l4mX7E%2Fq4%3D&amp;reserved=0
> >> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7Cf31455a4a77448afbf4208d963cbfaf7%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637650550011078463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dWHo4Wz8oQ9o7VZDui%2Fsdg9iQbrFi1syTJuBDULZ%2BWs%3D&amp;reserved=0
> >>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20 11:06                       ` Michael S. Tsirkin
@ 2021-08-23  3:20                         ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2021-08-23  3:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Max Gurtovoy, virtio-comment, cohuck, Parav Pandit,
	Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Fri, Aug 20, 2021 at 7:06 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Aug 20, 2021 at 03:49:55PM +0800, Jason Wang wrote:
> > On Fri, Aug 20, 2021 at 3:04 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Aug 20, 2021 at 10:17:05AM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> > > > > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > > > > The PF device will have an option to quiesce/freeze the VF device.
> > > > > >
> > > > > > Is such design a must? If no, why not simply introduce those functions in
> > > > > > the VF?
> > > > > Many IOMMUs only support protections at the function level.
> > > > > Thus we need ability to have one device (e.g. a PF)
> > > > > to control migration of another (e.g. a VF).
> > > >
> > > >
> > > > So as discussed previously, the only possible "advantage" is that the DMA is
> > > > isolated.
> > > >
> > > >
> > > > > This is because allowing VF to access hypervisor memory used for
> > > > > migration is not a good idea.
> > > > > For IOMMUs that support subfunctions, these "devices" could be
> > > > > subfunctions.
> > > > >
> > > > > The only alternative is to keep things in device memory which
> > > > > does not need an IOMMU.
> > > > > I guess we'd end up with something like a VQ in device memory which might
> > > > > be tricky from multiple points of view, but yes, this could be
> > > > > useful and people did ask for such a capability in the past.
> > > >
> > > >
> > > > I assume the spec already support this. We probably need some clarification
> > > > at the transport layer. But it's as simple as setting MMIO are as virtqueue
> > > > address?
> > >
> > > Several issues
> > > - we do not support changing VQ address. Devices do need to support
> > >   changing memory addresses.
> >
> > So it looks like a transport specific requirement (PCI-E) instead of a
> > general issue.
> >
> > > - Ordering becomes tricky especially .
> > >   E.g. when device reads descriptor in VQ
> > >   memory it suddenly does not flush out writes into buffer
> > >   that is potentially in RAM. We might also need even stronger
> > >   barriers on the driver side. We used dma_wmb but now it's
> > >   probably need to be wmb.
> > >   Reading multibyte structures from device memory is slow.
> > >   To get reasonable performance we might need to mark this device memory
> > >   WB or WC. That generally makes things even trickier.
> >
> > I agree, but still they are all transport specific requirements. If we
> > do that in a PCI-E BAR, the driver must obey the ordering rule for PCI
> > to make it work.
> > >
> > >
> > > > Except for the dirty bit tracking, we don't have bulk data that needs to be
> > > > transferred during migration. So a virtqueue is not must even in this case.
> > >
> > > Main traffic is write tracking.
> >
> > Right.
> >
> > >
> > >
> > > >
> > > > >
> > > > > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > > > > migration is not designed like that)?
> > > > > I think the main difference is we need PF's help for memory
> > > > > tracking for pre-copy migration anyway.
> > > >
> > > >
> > > > Such kind of memory tracking is not a must. KVM uses software assisted
> > > > technologies (write protection) and it works very well.
> > >
> > > So page-fault support is absolutely a viable option IMHO.
> > > To work well we need VIRTIO_F_PARTIAL_ORDER - there was not
> > > a lot of excitement but sure I will finalize and repost it.
> >
> > As discussed before, it looks like a performance optimization but not a must?
> >
> > I guess we don't do that for KVM and it works well.
>
> Depends on type of device. For networking it's a problem because it is
> driven by outside events so it keeps going leading to packet drops which
> is a quality of implementation issue, not an optimization.

So it looks to me it's a factor of how device page faults perform. E.g
we may suffer from packet drops during live migration when KVM is
logger dirty pages as well.

> Same thing with e.g. audio I suspect. Maybe graphics.

I wonder even with this, it may not work for those real time tasks.

> For KVM and
> e.g. storage it's more of a performance issue.
>
>
> > >
> > >
> > > However we need support for reporting and handling faults.
> > > Again this is data path stuff and needs to be under
> > > hypervisor control so I guess we get right back
> > > to having this in the PF?
> >
> > So it depends on whether it requires a DMA. If it's just something
> > like a CR2 register, we don't need PF.
>
> We won't strictly need it but it is a well understood model,
> working well with e.g. vfio. It makes sense to support it.
>
> > >
> > >
> > >
> > >
> > >
> > > > For virtio,
> > > > technology like shadow virtqueue has been used by DPDK and prototyped by
> > > > Eugenio.
> > >
> > > That's ok but I think since it affects performance at 100% of the
> > > time when active we can not rely on this as the only solution.
> >
> > This part I don't understand:
> >
> > - KVM writes protect the pages, so it loses performance as well.
> > - If we are using virtqueue for reporting dirty bitmap, it can easily
> > run out of space and we will lose the performance as well
> > - If we are using bitmap/bytemap, we may also losing the performance
> > (e.g the huge footprint) or at PCI level
> >
> > So I'm not against the idea, what I think makes more sense is not
> > limit the facilities like device states, dirty page tracking to the
> > PF.
>
> It could be a cross-device facility that can support PF but
> also other forms of communication, yes.

That's my understanding as well.

>
>
> > >
> > >
> > > > Even if we want to go with hardware technology, we have many alternatives
> > > > (as we've discussed in the past):
> > > >
> > > > 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> > > > write)
> > > > 2) Write protection via IOMMU or device MMU
> > > > 3) Address space ID for isolating DMAs
> > >
> > > Not all systems support any of the above unfortunately.
> > >
> >
> > Yes. But we know the platform (AMD/Intel/ARM) will be ready soon for
> > them in the near future.
>
> know and future in the same sentence make an oxymoron ;)
>
> > > Also some systems might have a limited # of PASIDs.
> > > So burning up a extra PASID per VF halving their
> > > number might not be great as the only option.
> >
> > Yes, so I think we agree that we should not limit the spec to work on
> > a specific configuration (e.g the device with PF).
>
> That makes sense to me.
>
> > >
> > >
> > > >
> > > > Using physical function is sub-optimal that all of the above since:
> > > >
> > > > 1) limited to a specific transport or implementation and it doesn't work for
> > > > device or transport without PF
> > > > 2) the virtio level function is not self contained, this makes any feature
> > > > that ties to PF impossible to be used in the nested layer
> > > > 3) more complicated than leveraging the existing facilities provided by the
> > > > platform or transport
> > >
> > > I think I disagree with 2 and 3 above simply because controlling VFs through
> > > a PF is how all other devices did this.
> >
> > For management and provision yes. For other features, the answer is
> > not. This is simply because most hardware vendors don't consider
> > whether or not a feature could be virtualized. That's fine for them
> > but not us. E.g if we limit the feature A to PF. It means feature A
> > can't be used by guests. My understanding is that we'd better not
> > introduce a feature that is hard to be virtualized.
>
> I'm not sure what do you mean when you say management but I guess
> at least stuff that ip link does normally:
>
>
>                [ vf NUM [ mac LLADDR ]
>                         [ VFVLAN-LIST ]
>                         [ rate TXRATE ]
>                         [ max_tx_rate TXRATE ]
>                         [ min_tx_rate TXRATE ]
>                         [ spoofchk { on | off } ]
>                         [ query_rss { on | off } ]
>                         [ state { auto | enable | disable } ]
>                         [ trust { on | off } ]
>                         [ node_guid eui64 ]
>                         [ port_guid eui64 ] ]
>
>
> is fair game ...

That's the example of the management tasks:

1) is not expected or exposed for the guest
2) requires capabilities (CAP_NET_ADMIN) for security
3) won't be used by Qemu

But live migration seems different

1) it can be exposed for the guest for nested live migration
2) doesn't require capabilities, no security concern
3) will be used by Qemu


>
> > > About 1 - well this is
> > > just about us being smart and writing this in a way that is
> > > generic enough, right?
> >
> > That's exactly my question and my point, I know it can be done in the
> > PF. What I'm asking is "why it must be in the PF".
> >
> > And I'm trying to convince Max to introduce those features as "basic
> > device facilities" instead of doing that in the "admin virtqueue" or
> > other stuff that belongs to PF.
>
> Let's say it's not in a PF, I think it needs some way to be separate so
> we don't need lots of logic in the hypervisor to handle that.

We don't need a lot I think:

1) stop/freeze the device
2) device state set and get

> So from that POV admin queue is ok. In fact
> from my POV admin queue is suffering in that it does not focus on cross
> device communication enough, not that it's doing that too much.

Ok.

>
> > > E.g. include options for PASIDs too.
> > >
> > > Note that support for cross-device addressing is useful
> > > even outside of migration.  We also have things like
> > > priority where it is useful to adjust properties of
> > > a VF on the fly while it is active. Again the normal way
> > > all devices do this is through a PF. Yes a bunch of tricks
> > > in QEMU is possible but having a driver in host kernel
> > > and just handle it in a contained way is way cleaner.
> > >
> > >
> > > > Consider (P)ASID will be ready very soon, workaround the platform limitation
> > > > via PF is not a good idea for me. Especially consider it's not a must and we
> > > > had already prototype the software assisted technology.
> > >
> > > Well PASID is just one technology.
> >
> > Yes, devices are allowed to have their own function to isolate DMA. I
> > mentioned PASID just because it is the most popular technology.
> >
> > >
> > >
> > > >
> > > > >   Might as well integrate
> > > > > the rest of state in the same channel.
> > > >
> > > >
> > > > That's another question. I think for the function that is a must for doing
> > > > live migration, introducing them in the function itself is the most natural
> > > > way since we did all the other facilities there. This ease the function that
> > > > can be used in the nested layer.
> > > >
> > > > And using the channel in the PF is not coming for free. It requires
> > > > synchronization in the software or even QOS.
> > > >
> > > > Or we can just separate the dirty page tracking into PF (but need to define
> > > > them as basic facility for future extension).
> > >
> > > Well maybe just start focusing on write tracking, sure.
> > > Once there's a proposal for this we can see whether
> > > adding other state there is easier or harder.
> >
> > Fine with me.
> >
> > >
> > >
> > > >
> > > > >
> > > > > Another answer is that CPUs trivially switch between
> > > > > functions by switching the active page tables. For PCI DMA
> > > > > it is all much trickier sine the page tables can be separate
> > > > > from the device, and assumed to be mostly static.
> > > >
> > > >
> > > > I don't see much different, the page table is also separated from the CPU.
> > > > If the device supports state save and restore we can scheduling the multiple
> > > > VMs/VCPUs on the same device.
> > >
> > > It's just that performance is terrible. If you keep losing packets
> > > migration might as well not be live.
> >
> > I don't measure the performance. But I believe the shadow virtqueue
> > should perform better than kernel vhost-net backends.
> >
> > If it's not, we can switch to vhost-net if necessary and we know it
> > works well for the live migration.
>
> Well but not as fast as hardware offloads with faults would be,
> which can potentially go full speed as long as you are lucky
> and do not hit too many faults.

Yes, but for live migration, I agree that we need better performance,
but if we go full speed, that may break the convergence.

Anyhow, we can see how well shadow virtqueue performs.

>
> > >
> > > >
> > > > > So if you want to create something like the VMCS then
> > > > > again you either need some help from another device or
> > > > > put it in device memory.
> > > >
> > > >
> > > > For CPU virtualization, the states could be saved and restored via MSRs. For
> > > > virtio, accessing them via registers is also possible and much more simple.
> > > >
> > > > Thanks
> > >
> > > IMy guess is performance is going to be bad. MSRs are part of the
> > > same CPU that is executing the accesses....
> >
> > I'm not sure but it's how current VMX or SVM did.
> >
> > Thanks
>
> Yes but again. moving state of the CPU around is faster than
> pulling it across the PCI-E bus.

Right.

Thanks

>
> > >
> > > >
> > > > >
> > > > >
> > >
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-23  3:10                             ` Jason Wang
@ 2021-08-23  8:55                               ` Max Gurtovoy
  2021-08-24  2:41                                 ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Max Gurtovoy @ 2021-08-23  8:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Dr. David Alan Gilbert, virtio-comment, Michael S. Tsirkin,
	cohuck, Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan,
	Bodong Wang, Jason Gunthorpe, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer


On 8/23/2021 6:10 AM, Jason Wang wrote:
> On Sun, Aug 22, 2021 at 6:05 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>
>> On 8/20/2021 2:16 PM, Jason Wang wrote:
>>> On Fri, Aug 20, 2021 at 6:26 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>> On 8/20/2021 5:24 AM, Jason Wang wrote:
>>>>> 在 2021/8/19 下午11:20, Max Gurtovoy 写道:
>>>>>> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
>>>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>>>>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
>>>>>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
>>>>>>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
>>>>>>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy
>>>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
>>>>>>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy
>>>>>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
>>>>>>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Live migration is one of the most important features of
>>>>>>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
>>>>>>>>>>>>>>>> environments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The migration process is managed by a migration SW that is
>>>>>>>>>>>>>>>> running on
>>>>>>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state
>>>>>>>>>>>>>>>> resides in
>>>>>>>>>>>>>>>> the HW.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually
>>>>>>>>>>>>>>> from the view
>>>>>>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is
>>>>>>>>>>>>>>> stored in
>>>>>>>>>>>>>>> the software or hardware. A well designed VMM should be able
>>>>>>>>>>>>>>> to hide
>>>>>>>>>>>>>>> the virtio device implementation from the migration layer,
>>>>>>>>>>>>>>> that is how
>>>>>>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a
>>>>>>>>>>>>>>> software
>>>>>>>>>>>>>>> virtio/vDPA device or not.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In our vision, in order to fulfil the Live migration
>>>>>>>>>>>>>>>> requirements for
>>>>>>>>>>>>>>>> virtual functions, each physical function device must
>>>>>>>>>>>>>>>> implement
>>>>>>>>>>>>>>>> migration operations. Using these operations, it will be
>>>>>>>>>>>>>>>> able to
>>>>>>>>>>>>>>>> master the migration process for the virtual function
>>>>>>>>>>>>>>>> devices. Each
>>>>>>>>>>>>>>>> capable physical function device has a supervisor
>>>>>>>>>>>>>>>> permissions to
>>>>>>>>>>>>>>>> change the virtual function operational states,
>>>>>>>>>>>>>>>> save/restore its
>>>>>>>>>>>>>>>> internal state and start/stop dirty pages tracking.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For "supervisor permissions", is this from the software
>>>>>>>>>>>>>>> point of view?
>>>>>>>>>>>>>>> Maybe it's better to give an example for this.
>>>>>>>>>>>>>> A permission to a PF device for quiesce and freeze a VF
>>>>>>>>>>>>>> device for example.
>>>>>>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running
>>>>>>>>>>>>> without any privileges.
>>>>>>>>>>>> You're mixing layers here.
>>>>>>>>>>>>
>>>>>>>>>>>> QEMU is not involved here. It's only sending IOCTLs to
>>>>>>>>>>>> migration driver.
>>>>>>>>>>>> The migration driver will control the migration process of the
>>>>>>>>>>>> VF using
>>>>>>>>>>>> the PF communication channel.
>>>>>>>>>>> So who will be granted the "permission" you mentioned here?
>>>>>>>>>> This is just an expression.
>>>>>>>>>>
>>>>>>>>>> What is not clear ?
>>>>>>>>>>
>>>>>>>>>> The PF device will have an option to quiesce/freeze the VF device.
>>>>>>>>>>
>>>>>>>>>> This is simple. Why are you looking for some sophisticated
>>>>>>>>>> problems ?
>>>>>>>>> I'm trying to follow along here and have not completely; but I
>>>>>>>>> think the issue is a
>>>>>>>>> security separation one.
>>>>>>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
>>>>>>>>> isolated and shouldn't be able to go poking at other devices; so it
>>>>>>>>> can't go poking at the PF (it probably doesn't even have the PF
>>>>>>>>> device
>>>>>>>>> node accessible) - so then the question is who has access to the
>>>>>>>>> migration driver and how do you make sure it can only deal with VF's
>>>>>>>>> that it's supposed to be able to migrate.
>>>>>>>> The QEMU/userspace doesn't know or care about the PF connection and
>>>>>>>> internal
>>>>>>>> virtio_vfio_pci driver implementation.
>>>>>>> OK
>>>>>>>
>>>>>>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
>>>>>>> Hmm OK.
>>>>>>>
>>>>>>>> QEMU does not have access to the PF. Only the kernel driver that
>>>>>>>> has access
>>>>>>>> to the VF will have access to the PF communication channel. There
>>>>>>>> is no
>>>>>>>> permission problem here.
>>>>>>>>
>>>>>>>> The kernel driver of the VF will do this internally, and make sure
>>>>>>>> that the
>>>>>>>> commands it build will only impact the VF originating them.
>>>>>>>>
>>>>>>> Now that confuses me; isn't the kernel driver that has access to the VF
>>>>>>> running inside the guest?  If it's inside the guest we can't trust
>>>>>>> it to
>>>>>>> do anything about stopping impact to other devices.
>>>>>> No. The driver is in the hypervisor (virtio_vfio_pci). This is the
>>>>>> migration driver, right ?
>>>>> Well, talking things like virtio_vfio_pci that is not mentioned before
>>>>> and not justified on the list may easily confuse people. As pointed
>>>>> out in another thread, it has too many disadvantages over the existing
>>>>> virtio-pci vdpa driver. And it just duplicates a partial function of
>>>>> what virtio-pci vdpa driver can do. I don't think we will go that way.
>>>> This was just an example for David to help with understanding the
>>>> solution since he thought that the guest drivers somehow should be changed.
>>>>
>>>> David I'm sorry if I confused you.
>>>>
>>>> Again Jason, you try to propose your vDPA solution that is not what
>>>> we're trying to achieve in this work. Think of a world without vDPA.
>>> Well, I'd say, let's think vDPA a superset of virtio, not just the
>>> acceleration technologies.
>> I'm sorry but vDPA is not relevant to this discussion.
> Well, it's you that mention the software things like VFIO first.
>
>> Anyhow, I don't see any problem for vDPA driver to work on top of the
>> design proposed here.
>>
>>>> Also I don't understand how vDPA is related to virtio specification
>>>> decisions ?
>>> So how is VFIO related to virtio specific decisions? That's why I
>>> think we should avoid talking about software architecture here. It's
>>> the wrong community.
>> VFIO is not related to virtio spec.
> Of course.
>
>> It was an example for David. What is the problem with giving examples to
>> ease on people to understand the solution ?
> I don't think your example ease the understanding.
>
>> Where did you see that the design is referring to VFIO ?
>>
>>>>    make vDPA into virtio and then we can open a discussion.
>>>>
>>>> I'm interesting in virtio migration of HW devices.
>>>>
>>>> The proposal in this thread is actually get support from Michal AFAIU
>>>> and also others were happy with. All beside of you.
>>> So I think I've clairfied my several times :(
>>>
>>> - I'm fairly ok with the proposal
>> It doesn't seems like that.
>>
>>> - but we decouple the basic facility out of the admin virtqueue and
>>> this seems agreed by Michael:
>>>
>>> Let's take the dirty page tracking as an example:
>>>
>>> 1) let's first define that as one of the basic facility
>>> 2) then we can introduce admin virtqueue or other stuffs as an
>>> interface for that facility
>>>
>>> Does this work for you?
>> What I really want is to agree that the right way to manage migration
>> process of a virtio VF. My proposal is doing so by creating a
>> communication channel in its parent PF.
> It looks to me you never answer the question "why it must be done by PF".

This is not relevant question. In our profession you can solve a problem 
with more than 1 way.

We need to find the robust one.

>
> All the functions provided by PF so far for software is not expected
> to be used by VMM like Qemu. Those functions usually requires
> capability or privileges for the management software to use. You
> mentioned things like "supervisor" and "permission", but it looks to
> me you are still unaware how it connect to the security stuffs.

I now see that you don't understand at all what I'm proposing here.

Maybe you can go back to the questions David asked and read my answers 
to get better understanding of the solution.

>
>> I think I got a confirmation here.
>>
>> This communication channel is not introduced in this thread, but
>> obviously it should be an adminq.
> Let me clarify. What I want to say is admin should be one of the
> possible channels.

if you want to fork and create more than 1 way to do things we can check 
other options.

BTW, In the 2019 conference I saw that MST talked about adding LM to the 
spec and hint that the PF should manage the VF.

Adding some non-ready HW platforms consideration, future technologies 
and hypervisor hacks in the design of virtio LM sounds weird to me.

I still don't understand why you can't do all the things you wish doing 
with simple commands sent via the admin-q and insist of splitting 
devices and splitting config spaces and bunch of other hacks.

Don't you prefer a robust solution to work with any existing platform 
today ? or do you aim for future solution ?

>> For your future scalable functions, the Parent Device (lets call it PD)
>> will manage the creation/migration/destruction process for its Virtual
>> Devices (lets call them VDs) using the PD adminq.
>>
>> Agreed ?
> They are two different set of functions:
>
> - provisioning/creation/destruction: requires privilege and we don't
> have any plan to expose them to the guest. It should be done via PF or
> PD for security as you mentioned above.
> - migration: doesn't require privilege, and it can be expose to the
> guest, if can be done in either PF or VF. To me using VF is much more
> natural,  but using PF is also fine.

migration exposed to the guest ? No.

This is a basic assumption, really.

I think this is the problem in the whole discussion.

I think all the community agree that guest shouldn't be aware of 
migration. You must understand this.

Once you do, all this process will be easier and we'll progress instead 
of running in circles.

>
> An exception for the migration is the dirty page tracking, without DMA
> isolation, we may end up with security issue if we do that in the VF.

Lets start with basic migration first.

In my model the Hypervisor kernel control this. No security issue since 
the kernel is a secured entity.

This is what we do already in our solution for NIC devices.

I don't want virtio to be behind.

>
>> Please don't answer that this is not a "must". This is my proposal. If
>> you have another proposal, please propose.
> Well, you are asking for the comments instead of enforcing things right?
>
> And it's as simple as:
>
> 1) introduce admin virtqueue, and bind migration features to admin virtqueue
>
> or
>
> 2) introduce migration features and admin virtqueue independently
>
> What's the problem of do trivial modifications like 2)? Is that
> conflict with your proposal?

I did #2 already and then you asked me to do #1.

If I do #1 you'll ask #2.

I'm progressing towards final solution. I got the feedback I need.

>
>>>> We do it in mlx5 and we didn't see any issues with that design.
>>>>
>>> If we seperate things as I suggested, I'm totally fine.
>> separate what ?
>>
>> Why should I create different interfaces for different management tasks.
> I don't say you need to create different interfaces. It's for future extensions:
>
> 1) When VIRTIO_F_ADMIN_VQ is negotiated, the interface is admin virtqueue
> 2) When other features is negotiated, the interface is other.
>
> In order to make 2) work, we need introduce migration and admin
> virtqueue separately.
>
> Migration is not management task which doesn't require any privilege.

You need to control the operational state of a device, track its dirty 
pages, save/restore internal HW state.

If you think that anyone can do it to a virtio device so lets see this 
magic works (I believe that only the parent/management device can do it 
on behalf of the migration software).

>
>> I have a virtual/scalable device that I want to  refer to from the
>> physical/parent device using some interface.
>>
>> This interface is adminq. This interface will be used for dirty_page
>> tracking and operational state changing and get/set internal state as
>> well. And more (create/destroy SF for example).
>>
>> You can think of this in some other way, i'm fine with it. As long as
>> the final conclusion is the same.
>>
>>>> I don't think you can say that we "go that way".
>>> For "go that way" I meant the method of using vfio_virtio_pci, it has
>>> nothing related to the discussion of "using PF to control VF" on the
>>> spec.
>> This was an example. Please leave it as an example for David.
>>
>>
>>>> You're trying to build a complementary solution for creating scalable
>>>> functions and for some reason trying to sabotage NVIDIA efforts to add
>>>> new important functionality to virtio.
>>> Well, it's a completely different topic. And it doesn't conflict with
>>> anything that is proposed here by you. I think I've stated this
>>> several times.  I don't think we block each other, it's just some
>>> unification work if one of the proposals is merged first. I sent them
>>> recently because it will be used as a material for my talk on the KVM
>>> Forum which is really near.
>> In theory you're right. We shouldn't block each other, and I don't block
>> you. But for some reason I see that you do try to block my proposal and
>> I don't understand why.
> I don't want to block your proposal, let's decouple the migration
> feature out of admin virtqueue. Then it's fine.
>
> The problem I see is that, you tend to refuse such a trivial but
> beneficial change. That's what I don't understand.

I thought I explained it. Nothing keeps you happy. If we A, you ask for 
B. if we do B you as for A.

I continue with the feedback I get from MST.

>
>> I feel like I wasted 2 months on a discussion instead of progressing.
> Well, I'm not sure 2 months is short, but it's usually take more than
> a year for huge project in Linux.

But if you go in circles it will never end, right ?

>
> Patience may help us to understand the points of each other better.

first I want to agree on the above migration concepts I wrote.

If we don't agree on that, the discussion is useless.

>
>> But now I do see a progress. A PF to manage VF migration is the way to
>> go forward.
>>
>> And the following RFC will take this into consideration.
>>
>>>> This also sabotage the evolvment of virtio as a standard.
>>>>
>>>> You're trying to enforce some un-finished idea that should work on some
>>>> future specific HW platform instead of helping defining a good spec for
>>>> virtio.
>>> Let's open another thread for this if you wish, it has nothing related
>>> to the spec but how it is implemented in Linux. If you search the
>>> archive, something similar to "vfio_virtio_pci" has been proposed
>>> several years before by Intel. The idea has been rejected, and we have
>>> leveraged Linux vDPA bus for virtio-pci devices.
>> I don't know this history. And I will happy to hear about it one day.
>>
>> But for our discussion in Linux, virtio_vfio_pci will happen. And it
>> will implement the migration logic of a virtio device with PCI transport
>> for VFs using the PF admin queue.
>>
>> We at NVIDIA, currently upstreaming (alongside with AlexW and Cornelia)
>> a vfio-pci separation that will enable an easy creation of vfio-pci
>> vendor/protocol drivers to do some specific tasks.
>>
>> New drivers such as mlx5_vfio_pci, hns_vfio_pci, virtio_vfio_pci and
>> nvme_vfio_pci should be implemented in the near future in Linux to
>> enable migration of these devices.
>>
>> This is just an example. And it's not related to the spec nor the
>> proposal at all.
> Let's move those discussions to the right list. I'm pretty sure there
> will a long debate there. Please prepare for that.

We already discussed this with AlexW, Cornelia, JasonG, ChristophH and 
others.

And before we have a virtio spec for LM we can't discuss about it in the 
Linux mailing list.

It will waste everyone's time.

>
>>>> And all is for having users to choose vDPA framework instead of using
>>>> plain virtio.
>>>>
>>>> We believe in our solution and we have a working prototype. We'll
>>>> continue with our discussion to convince the community with it.
>>> Again, it looks like there's a lot of misunderstanding. Let's open a
>>> thread on the suitable list instead of talking about any specific
>>> software solution or architecture here. This will speed up things.
>> I prefer to finish the specification first. SW arch is clear for us in
>> Linux. We did it already for mlx5 devices and it will be the same for
>> virtio if the spec changes will be accepted.
> I disagree, but let's separate software discussion out of the spec
> discussion here.
>
> Thanks
>
>> Thanks.
>>
>>
>>> Thanks
>>>
>>>> Thanks.
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> The guest is running as usual. It doesn't aware on the migration at all.
>>>>>>
>>>>>> This is the point I try to make here. I don't (and I can't) change
>>>>>> even 1 line of code in the guest.
>>>>>>
>>>>>> e.g:
>>>>>>
>>>>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
>>>>>> (bounded to VF5) --> send admin command on PF adminq to start
>>>>>> tracking dirty pages for VF5 --> PF device will do it
>>>>>>
>>>>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
>>>>>> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5
>>>>>> --> PF device will do it
>>>>>>
>>>>>> You can take a look how we implement mlx5_vfio_pci in the link I
>>>>>> provided.
>>>>>>
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>>> We already do this in mlx5 NIC migration. The kernel is secured and
>>>>>>>> QEMU
>>>>>>>> interface is the VF.
>>>>>>>>
>>>>>>>>> Dave
>>>>>>>>>
>>>>>>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA
>>>>>>>>>>>>>>>> performs
>>>>>>>>>>>>>>>> live migration of a ConnectX NIC function:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>>>>>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated
>>>>>>>>>>>>>>>> software defined
>>>>>>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for
>>>>>>>>>>>>>>>> storage
>>>>>>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its
>>>>>>>>>>>>>>>> standard
>>>>>>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
>>>>>>>>>>>>>>>> specifications.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In order to implement Live Migration for these virtual
>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>> devices, that use a standard drivers as mentioned, the
>>>>>>>>>>>>>>>> specification
>>>>>>>>>>>>>>>> should define how HW vendor should build their devices and
>>>>>>>>>>>>>>>> for SW
>>>>>>>>>>>>>>>> developers to adjust the drivers.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This will enable specification compliant vendor agnostic
>>>>>>>>>>>>>>>> solution.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
>>>>>>>>>>>>>>>> (internal HW design doc) and I guess that this is the way
>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> vendors work.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For that, I would like to know if the approach of “PF that
>>>>>>>>>>>>>>>> controls
>>>>>>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO
>>>>>>>>>>>>>>>> technical
>>>>>>>>>>>>>>>> group ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure but I think it's better to start from the general
>>>>>>>>>>>>>>> facility for all transports, then develop features for a
>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>> transport.
>>>>>>>>>>>>>> a general facility for all transports can be a generic admin
>>>>>>>>>>>>>> queue ?
>>>>>>>>>>>>> It could be a virtqueue or a transport specific method (pcie
>>>>>>>>>>>>> capability).
>>>>>>>>>>>> No. You said a general facility for all transports.
>>>>>>>>>>> For general facility, I mean the chapter 2 of the spec which is
>>>>>>>>>>> general
>>>>>>>>>>>
>>>>>>>>>>> "
>>>>>>>>>>> 2 Basic Facilities of a Virtio Device
>>>>>>>>>>> "
>>>>>>>>>>>
>>>>>>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I
>>>>>>>>>> can add "2.12
>>>>>>>>>> Admin Virtqueues" and this is what I did in the RFC.
>>>>>>>>>>
>>>>>>>>>>>> Transport specific is not general.
>>>>>>>>>>> The transport is in charge of implementing the interface for
>>>>>>>>>>> those facilities.
>>>>>>>>>> Transport specific is not general.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk
>>>>>>>>>>>>> first
>>>>>>>>>>>>> (the device state). Then we can define the interface to get
>>>>>>>>>>>>> and set
>>>>>>>>>>>>> those states via admin virtqueue. Such decoupling may ease the
>>>>>>>>>>>>> future
>>>>>>>>>>>>> development of the transport specific migration interface.
>>>>>>>>>>>> I asked a simple question here.
>>>>>>>>>>>>
>>>>>>>>>>>> Lets stick to this.
>>>>>>>>>>> I answered this question.
>>>>>>>>>> No you didn't answer.
>>>>>>>>>>
>>>>>>>>>> I asked  if the approach of “PF that controls the VF live
>>>>>>>>>> migration process”
>>>>>>>>>> is acceptable by the VIRTIO technical group ?
>>>>>>>>>>
>>>>>>>>>> And you take the discussion to your direction instead of
>>>>>>>>>> answering a Yes/No
>>>>>>>>>> question.
>>>>>>>>>>
>>>>>>>>>>>        The virtqueue could be one of the
>>>>>>>>>>> approaches. And it's your responsibility to convince the community
>>>>>>>>>>> about that approach. Having an example may help people to
>>>>>>>>>>> understand
>>>>>>>>>>> your proposal.
>>>>>>>>>>>
>>>>>>>>>>>> I'm not referring to internal state definitions.
>>>>>>>>>>> Without an example, how do we know if it can work well?
>>>>>>>>>>>
>>>>>>>>>>>> Can you please not change the subject of my initial intent in
>>>>>>>>>>>> the email ?
>>>>>>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
>>>>>>>>>>> your proposal.
>>>>>>>>>> The virtio-blk PF admin queue will be used to manage the
>>>>>>>>>> virtio-blk VF
>>>>>>>>>> migration.
>>>>>>>>>>
>>>>>>>>>> This is the whole discussion. I don't want to get into resolution.
>>>>>>>>>>
>>>>>>>>>> Since you already know the answer as I published 4 RFCs already
>>>>>>>>>> with all the
>>>>>>>>>> flow.
>>>>>>>>>>
>>>>>>>>>> Lets stick to my question.
>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Max.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This publicly archived list offers a means to provide input
>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to verify user consent to the Feedback License terms
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>>>>>>>> before posting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>>>>>>>> List archive:
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1%2FZMUIr0f%2FfEgDjB8MQ9a2lHiXr4SkLqSG44r6kgJeQ%3D&amp;reserved=0
>>>>>>>>>>>>>> Feedback License:
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m%2BnMdp9ZA%2BqBU8PzC8HWJB1ouzyUx35VQApAFV8HeSg%3D&amp;reserved=0
>>>>>>>>>>>>>> List Guidelines:
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=57RGeurJjgZWxGajJPJh6BeR1vQ2OYLYdjTNba2HsPM%3D&amp;reserved=0
>>>>>>>>>>>>>> Committee:
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Wlkao282wqvghnPNiM2ER6I6GKO%2Fhe2LbDCFH%2FOnkko%3D&amp;reserved=0
>>>>>>>>>>>>>> Join OASIS:
>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=0q2g8y0CtJh5dqRKNE%2FzDC3wOC5kqn%2FDVjnNhj3FFGo%3D&amp;reserved=0
>>>>>>>>>>>>>>
>>>>>>>>>> This publicly archived list offers a means to provide input to the
>>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>>>>>
>>>>>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>>>>>> to minimize spam in the list archive, subscription is required
>>>>>>>>>> before posting.
>>>>>>>>>>
>>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>>>>>> List archive:
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1%2FZMUIr0f%2FfEgDjB8MQ9a2lHiXr4SkLqSG44r6kgJeQ%3D&amp;reserved=0
>>>>>>>>>> Feedback License:
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m%2BnMdp9ZA%2BqBU8PzC8HWJB1ouzyUx35VQApAFV8HeSg%3D&amp;reserved=0
>>>>>>>>>> List Guidelines:
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=57RGeurJjgZWxGajJPJh6BeR1vQ2OYLYdjTNba2HsPM%3D&amp;reserved=0
>>>>>>>>>> Committee:
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Wlkao282wqvghnPNiM2ER6I6GKO%2Fhe2LbDCFH%2FOnkko%3D&amp;reserved=0
>>>>>>>>>> Join OASIS:
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=0q2g8y0CtJh5dqRKNE%2FzDC3wOC5kqn%2FDVjnNhj3FFGo%3D&amp;reserved=0
>>>>>>>>>>
>>>>>> This publicly archived list offers a means to provide input to the
>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>>>
>>>>>> In order to verify user consent to the Feedback License terms and
>>>>>> to minimize spam in the list archive, subscription is required
>>>>>> before posting.
>>>>>>
>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>>>> List archive:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1%2FZMUIr0f%2FfEgDjB8MQ9a2lHiXr4SkLqSG44r6kgJeQ%3D&amp;reserved=0
>>>>>> Feedback License:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m%2BnMdp9ZA%2BqBU8PzC8HWJB1ouzyUx35VQApAFV8HeSg%3D&amp;reserved=0
>>>>>> List Guidelines:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=57RGeurJjgZWxGajJPJh6BeR1vQ2OYLYdjTNba2HsPM%3D&amp;reserved=0
>>>>>> Committee:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Wlkao282wqvghnPNiM2ER6I6GKO%2Fhe2LbDCFH%2FOnkko%3D&amp;reserved=0
>>>>>> Join OASIS:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QKSFAtueKKrXjhe2pIE1yVJ3pjNC0F%2FGvcXotSbnlCw%3D&amp;reserved=0
>>>>>>
>>>> This publicly archived list offers a means to provide input to the
>>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>>
>>>> In order to verify user consent to the Feedback License terms and
>>>> to minimize spam in the list archive, subscription is required
>>>> before posting.
>>>>
>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>>> List help: virtio-comment-help@lists.oasis-open.org
>>>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2FvradAyVbbFzSdoy7vFrIo61VQNV%2Fgn9swdf5kTaiQU%3D&amp;reserved=0
>>>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1inerAxzEQMA7QcJL5SmBE88VcW98PyxM0qJ5k%2B2B1c%3D&amp;reserved=0
>>>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uLlc0YNz7wXRXwO99ieHX25nBwKCyTBqVatNoc6BbSg%3D&amp;reserved=0
>>>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rGmEPe%2FZR%2FaxPUMfZdnLuQdszA4l39gccvEQkNUl9ds%3D&amp;reserved=0
>>>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QKSFAtueKKrXjhe2pIE1yVJ3pjNC0F%2FGvcXotSbnlCw%3D&amp;reserved=0
>>>>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-20  2:17                 ` Jason Wang
  2021-08-20  7:03                   ` Michael S. Tsirkin
@ 2021-08-23 12:08                   ` Dr. David Alan Gilbert
  2021-08-24  3:00                     ` Jason Wang
  1 sibling, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-23 12:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Max Gurtovoy, virtio-comment, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

* Jason Wang (jasowang@redhat.com) wrote:
> 
> 在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > The PF device will have an option to quiesce/freeze the VF device.
> > > 
> > > Is such design a must? If no, why not simply introduce those functions in
> > > the VF?
> > Many IOMMUs only support protections at the function level.
> > Thus we need ability to have one device (e.g. a PF)
> > to control migration of another (e.g. a VF).
> 
> 
> So as discussed previously, the only possible "advantage" is that the DMA is
> isolated.
> 
> 
> > This is because allowing VF to access hypervisor memory used for
> > migration is not a good idea.
> > For IOMMUs that support subfunctions, these "devices" could be
> > subfunctions.
> > 
> > The only alternative is to keep things in device memory which
> > does not need an IOMMU.
> > I guess we'd end up with something like a VQ in device memory which might
> > be tricky from multiple points of view, but yes, this could be
> > useful and people did ask for such a capability in the past.
> 
> 
> I assume the spec already support this. We probably need some clarification
> at the transport layer. But it's as simple as setting MMIO are as virtqueue
> address?
> 
> Except for the dirty bit tracking, we don't have bulk data that needs to be
> transferred during migration. So a virtqueue is not must even in this case.
> 
> 
> > 
> > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > migration is not designed like that)?
> > I think the main difference is we need PF's help for memory
> > tracking for pre-copy migration anyway.
> 
> 
> Such kind of memory tracking is not a must. KVM uses software assisted
> technologies (write protection) and it works very well. For virtio,
> technology like shadow virtqueue has been used by DPDK and prototyped by
> Eugenio.
> 
> Even if we want to go with hardware technology, we have many alternatives
> (as we've discussed in the past):
> 
> 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> write)
> 2) Write protection via IOMMU or device MMU
> 3) Address space ID for isolating DMAs

What's the state of those - last time I chatted to anyone about IOMMUs
doing protection, things were at the 'in the future' stage.

Dave

> Using physical function is sub-optimal that all of the above since:
> 
> 1) limited to a specific transport or implementation and it doesn't work for
> device or transport without PF
> 2) the virtio level function is not self contained, this makes any feature
> that ties to PF impossible to be used in the nested layer
> 3) more complicated than leveraging the existing facilities provided by the
> platform or transport
> 
> Consider (P)ASID will be ready very soon, workaround the platform limitation
> via PF is not a good idea for me. Especially consider it's not a must and we
> had already prototype the software assisted technology.
> 
> 
> >   Might as well integrate
> > the rest of state in the same channel.
> 
> 
> That's another question. I think for the function that is a must for doing
> live migration, introducing them in the function itself is the most natural
> way since we did all the other facilities there. This ease the function that
> can be used in the nested layer.
> 
> And using the channel in the PF is not coming for free. It requires
> synchronization in the software or even QOS.
> 
> Or we can just separate the dirty page tracking into PF (but need to define
> them as basic facility for future extension).
> 
> 
> > 
> > Another answer is that CPUs trivially switch between
> > functions by switching the active page tables. For PCI DMA
> > it is all much trickier sine the page tables can be separate
> > from the device, and assumed to be mostly static.
> 
> 
> I don't see much different, the page table is also separated from the CPU.
> If the device supports state save and restore we can scheduling the multiple
> VMs/VCPUs on the same device.
> 
> 
> > So if you want to create something like the VMCS then
> > again you either need some help from another device or
> > put it in device memory.
> 
> 
> For CPU virtualization, the states could be saved and restored via MSRs. For
> virtio, accessing them via registers is also possible and much more simple.
> 
> Thanks
> 
> 
> > 
> > 
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-19 15:20                   ` Max Gurtovoy
  2021-08-20  2:24                     ` Jason Wang
@ 2021-08-23 12:18                     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-23 12:18 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Wang, virtio-comment, Michael S. Tsirkin, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

* Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> 
> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
> > * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> > > On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
> > > > * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> > > > > On 8/18/2021 1:46 PM, Jason Wang wrote:
> > > > > > On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > > > > > > On 8/17/2021 12:44 PM, Jason Wang wrote:
> > > > > > > > On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > > > > > > > > On 8/17/2021 11:51 AM, Jason Wang wrote:
> > > > > > > > > > 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> > > > > > > > > > > Hi all,
> > > > > > > > > > > 
> > > > > > > > > > > Live migration is one of the most important features of
> > > > > > > > > > > virtualization and virtio devices are oftenly found in virtual
> > > > > > > > > > > environments.
> > > > > > > > > > > 
> > > > > > > > > > > The migration process is managed by a migration SW that is running on
> > > > > > > > > > > the hypervisor and the VM is not aware of the process at all.
> > > > > > > > > > > 
> > > > > > > > > > > Unlike the vDPA case, a real pci Virtual Function state resides in
> > > > > > > > > > > the HW.
> > > > > > > > > > > 
> > > > > > > > > > vDPA doesn't prevent you from having HW states. Actually from the view
> > > > > > > > > > of the VMM(Qemu), it doesn't care whether or not a state is stored in
> > > > > > > > > > the software or hardware. A well designed VMM should be able to hide
> > > > > > > > > > the virtio device implementation from the migration layer, that is how
> > > > > > > > > > Qemu is wrote who doesn't care about whether or not it's a software
> > > > > > > > > > virtio/vDPA device or not.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > In our vision, in order to fulfil the Live migration requirements for
> > > > > > > > > > > virtual functions, each physical function device must implement
> > > > > > > > > > > migration operations. Using these operations, it will be able to
> > > > > > > > > > > master the migration process for the virtual function devices. Each
> > > > > > > > > > > capable physical function device has a supervisor permissions to
> > > > > > > > > > > change the virtual function operational states, save/restore its
> > > > > > > > > > > internal state and start/stop dirty pages tracking.
> > > > > > > > > > > 
> > > > > > > > > > For "supervisor permissions", is this from the software point of view?
> > > > > > > > > > Maybe it's better to give an example for this.
> > > > > > > > > A permission to a PF device for quiesce and freeze a VF device for example.
> > > > > > > > Note that for safety, VMM (e.g Qemu) is usually running without any privileges.
> > > > > > > You're mixing layers here.
> > > > > > > 
> > > > > > > QEMU is not involved here. It's only sending IOCTLs to migration driver.
> > > > > > > The migration driver will control the migration process of the VF using
> > > > > > > the PF communication channel.
> > > > > > So who will be granted the "permission" you mentioned here?
> > > > > This is just an expression.
> > > > > 
> > > > > What is not clear ?
> > > > > 
> > > > > The PF device will have an option to quiesce/freeze the VF device.
> > > > > 
> > > > > This is simple. Why are you looking for some sophisticated problems ?
> > > > I'm trying to follow along here and have not completely; but I think the issue is a
> > > > security separation one.
> > > > The VMM (e.g. qemu) that has been given access to one of the VF's is
> > > > isolated and shouldn't be able to go poking at other devices; so it
> > > > can't go poking at the PF (it probably doesn't even have the PF device
> > > > node accessible) - so then the question is who has access to the
> > > > migration driver and how do you make sure it can only deal with VF's
> > > > that it's supposed to be able to migrate.
> > > The QEMU/userspace doesn't know or care about the PF connection and internal
> > > virtio_vfio_pci driver implementation.
> > OK
> > 
> > > You shouldn't change 1 line of code in the VM driver nor in QEMU.
> > Hmm OK.
> > 
> > > QEMU does not have access to the PF. Only the kernel driver that has access
> > > to the VF will have access to the PF communication channel.  There is no
> > > permission problem here.
> > > 
> > > The kernel driver of the VF will do this internally, and make sure that the
> > > commands it build will only impact the VF originating them.
> > > 
> > Now that confuses me; isn't the kernel driver that has access to the VF
> > running inside the guest?  If it's inside the guest we can't trust it to
> > do anything about stopping impact to other devices.
> 
> No. The driver is in the hypervisor (virtio_vfio_pci). This is the migration
> driver, right ?

Ah OK, the '*host* kernel driver of the VF' - that makes more sense to
me, especially with that just being VFIO.

> The guest is running as usual. It doesn't aware on the migration at all.
> 
> This is the point I try to make here. I don't (and I can't) change even 1
> line of code in the guest.
> 
> e.g:
> 
> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor (bounded
> to VF5) --> send admin command on PF adminq to start tracking dirty pages
> for VF5 --> PF device will do it
> 
> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor (bounded
> to VF5) --> send admin command on PF adminq to quiesce VF5 --> PF device
> will do it

Yeh that makes more sense.

Dave

> You can take a look how we implement mlx5_vfio_pci in the link I provided.
> 
> > 
> > Dave
> > 
> > 
> > > We already do this in mlx5 NIC migration. The kernel is secured and QEMU
> > > interface is the VF.
> > > 
> > > > Dave
> > > > 
> > > > > > > > > > > An example of this approach can be seen in the way NVIDIA performs
> > > > > > > > > > > live migration of a ConnectX NIC function:
> > > > > > > > > > > 
> > > > > > > > > > > https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> > > > > > > > > > > <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> > > > > > > > > > > 
> > > > > > > > > > > NVIDIAs SNAP technology enables hardware-accelerated software defined
> > > > > > > > > > > PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for storage
> > > > > > > > > > > and networking solutions. The host OS/hypervisor uses its standard
> > > > > > > > > > > drivers that are implemented according to a well-known VIRTIO
> > > > > > > > > > > specifications.
> > > > > > > > > > > 
> > > > > > > > > > > In order to implement Live Migration for these virtual function
> > > > > > > > > > > devices, that use a standard drivers as mentioned, the specification
> > > > > > > > > > > should define how HW vendor should build their devices and for SW
> > > > > > > > > > > developers to adjust the drivers.
> > > > > > > > > > > 
> > > > > > > > > > > This will enable specification compliant vendor agnostic solution.
> > > > > > > > > > > 
> > > > > > > > > > > This is exactly how we built the migration driver for ConnectX
> > > > > > > > > > > (internal HW design doc) and I guess that this is the way other
> > > > > > > > > > > vendors work.
> > > > > > > > > > > 
> > > > > > > > > > > For that, I would like to know if the approach of “PF that controls
> > > > > > > > > > > the VF live migration process” is acceptable by the VIRTIO technical
> > > > > > > > > > > group ?
> > > > > > > > > > > 
> > > > > > > > > > I'm not sure but I think it's better to start from the general
> > > > > > > > > > facility for all transports, then develop features for a specific
> > > > > > > > > > transport.
> > > > > > > > > a general facility for all transports can be a generic admin queue ?
> > > > > > > > It could be a virtqueue or a transport specific method (pcie capability).
> > > > > > > No. You said a general facility for all transports.
> > > > > > For general facility, I mean the chapter 2 of the spec which is general
> > > > > > 
> > > > > > "
> > > > > > 2 Basic Facilities of a Virtio Device
> > > > > > "
> > > > > > 
> > > > > It will be in chapter 2. Right after "2.11 Exporting Object" I can add "2.12
> > > > > Admin Virtqueues" and this is what I did in the RFC.
> > > > > 
> > > > > > > Transport specific is not general.
> > > > > > The transport is in charge of implementing the interface for those facilities.
> > > > > Transport specific is not general.
> > > > > 
> > > > > 
> > > > > > > > E.g we can define what needs to be migrated for the virtio-blk first
> > > > > > > > (the device state). Then we can define the interface to get and set
> > > > > > > > those states via admin virtqueue. Such decoupling may ease the future
> > > > > > > > development of the transport specific migration interface.
> > > > > > > I asked a simple question here.
> > > > > > > 
> > > > > > > Lets stick to this.
> > > > > > I answered this question.
> > > > > No you didn't answer.
> > > > > 
> > > > > I asked  if the approach of “PF that controls the VF live migration process”
> > > > > is acceptable by the VIRTIO technical group ?
> > > > > 
> > > > > And you take the discussion to your direction instead of answering a Yes/No
> > > > > question.
> > > > > 
> > > > > >      The virtqueue could be one of the
> > > > > > approaches. And it's your responsibility to convince the community
> > > > > > about that approach. Having an example may help people to understand
> > > > > > your proposal.
> > > > > > 
> > > > > > > I'm not referring to internal state definitions.
> > > > > > Without an example, how do we know if it can work well?
> > > > > > 
> > > > > > > Can you please not change the subject of my initial intent in the email ?
> > > > > > Did I? Basically, I'm asking how a virtio-blk can be migrated with
> > > > > > your proposal.
> > > > > The virtio-blk PF admin queue will be used to manage the virtio-blk VF
> > > > > migration.
> > > > > 
> > > > > This is the whole discussion. I don't want to get into resolution.
> > > > > 
> > > > > Since you already know the answer as I published 4 RFCs already with all the
> > > > > flow.
> > > > > 
> > > > > Lets stick to my question.
> > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > > > Thanks.
> > > > > > > 
> > > > > > > 
> > > > > > > > Thanks
> > > > > > > > 
> > > > > > > > > > Thanks
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > Cheers,
> > > > > > > > > > > 
> > > > > > > > > > > -Max.
> > > > > > > > > > > 
> > > > > > > > > This publicly archived list offers a means to provide input to the
> > > > > > > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > > > > > > > 
> > > > > > > > > In order to verify user consent to the Feedback License terms and
> > > > > > > > > to minimize spam in the list archive, subscription is required
> > > > > > > > > before posting.
> > > > > > > > > 
> > > > > > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > > > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > > > > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > > > > > > List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gVamhvYG3lbMVVyMz%2F%2Fq3VBMZKY47pqxvRi94Mp%2B%2B7I%3D&amp;reserved=0
> > > > > > > > > Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2FxmgTCEaUTSgG%2BLohnAAuNXLncUOKU8yBxhkEMpmQk%3D&amp;reserved=0
> > > > > > > > > List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nXzbdkD4B4TzAFXbD%2B4Jap8rWmzX2CZ8fVnEE2f4Tdc%3D&amp;reserved=0
> > > > > > > > > Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Issa0O7V4p6MnuuJOcLDN4MAG77cSMSJ7MSZqvXol4%3D&amp;reserved=0
> > > > > > > > > Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPqcruawglqgMjakkslrpSVZzaOu%2FCvfkTSuUfMiEh0%3D&amp;reserved=0
> > > > > > > > > 
> > > > > This publicly archived list offers a means to provide input to the
> > > > > OASIS Virtual I/O Device (VIRTIO) TC.
> > > > > 
> > > > > In order to verify user consent to the Feedback License terms and
> > > > > to minimize spam in the list archive, subscription is required
> > > > > before posting.
> > > > > 
> > > > > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > > > > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > > > > List help: virtio-comment-help@lists.oasis-open.org
> > > > > List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gVamhvYG3lbMVVyMz%2F%2Fq3VBMZKY47pqxvRi94Mp%2B%2B7I%3D&amp;reserved=0
> > > > > Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=U%2FxmgTCEaUTSgG%2BLohnAAuNXLncUOKU8yBxhkEMpmQk%3D&amp;reserved=0
> > > > > List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nXzbdkD4B4TzAFXbD%2B4Jap8rWmzX2CZ8fVnEE2f4Tdc%3D&amp;reserved=0
> > > > > Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8Issa0O7V4p6MnuuJOcLDN4MAG77cSMSJ7MSZqvXol4%3D&amp;reserved=0
> > > > > Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C9d6634c268e84039d18308d9631d0220%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637649798517296420%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPqcruawglqgMjakkslrpSVZzaOu%2FCvfkTSuUfMiEh0%3D&amp;reserved=0
> > > > > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-23  8:55                               ` Max Gurtovoy
@ 2021-08-24  2:41                                 ` Jason Wang
  2021-08-24 13:10                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-24  2:41 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Dr. David Alan Gilbert, virtio-comment, Michael S. Tsirkin,
	cohuck, Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan,
	Bodong Wang, Jason Gunthorpe, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Mon, Aug 23, 2021 at 4:55 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>
>
> On 8/23/2021 6:10 AM, Jason Wang wrote:
> > On Sun, Aug 22, 2021 at 6:05 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>
> >> On 8/20/2021 2:16 PM, Jason Wang wrote:
> >>> On Fri, Aug 20, 2021 at 6:26 PM Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>>> On 8/20/2021 5:24 AM, Jason Wang wrote:
> >>>>> 在 2021/8/19 下午11:20, Max Gurtovoy 写道:
> >>>>>> On 8/19/2021 5:24 PM, Dr. David Alan Gilbert wrote:
> >>>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> >>>>>>>> On 8/19/2021 2:12 PM, Dr. David Alan Gilbert wrote:
> >>>>>>>>> * Max Gurtovoy (mgurtovoy@nvidia.com) wrote:
> >>>>>>>>>> On 8/18/2021 1:46 PM, Jason Wang wrote:
> >>>>>>>>>>> On Wed, Aug 18, 2021 at 5:16 PM Max Gurtovoy
> >>>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
> >>>>>>>>>>>> On 8/17/2021 12:44 PM, Jason Wang wrote:
> >>>>>>>>>>>>> On Tue, Aug 17, 2021 at 5:11 PM Max Gurtovoy
> >>>>>>>>>>>>> <mgurtovoy@nvidia.com> wrote:
> >>>>>>>>>>>>>> On 8/17/2021 11:51 AM, Jason Wang wrote:
> >>>>>>>>>>>>>>> 在 2021/8/12 下午8:08, Max Gurtovoy 写道:
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Live migration is one of the most important features of
> >>>>>>>>>>>>>>>> virtualization and virtio devices are oftenly found in virtual
> >>>>>>>>>>>>>>>> environments.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The migration process is managed by a migration SW that is
> >>>>>>>>>>>>>>>> running on
> >>>>>>>>>>>>>>>> the hypervisor and the VM is not aware of the process at all.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Unlike the vDPA case, a real pci Virtual Function state
> >>>>>>>>>>>>>>>> resides in
> >>>>>>>>>>>>>>>> the HW.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> vDPA doesn't prevent you from having HW states. Actually
> >>>>>>>>>>>>>>> from the view
> >>>>>>>>>>>>>>> of the VMM(Qemu), it doesn't care whether or not a state is
> >>>>>>>>>>>>>>> stored in
> >>>>>>>>>>>>>>> the software or hardware. A well designed VMM should be able
> >>>>>>>>>>>>>>> to hide
> >>>>>>>>>>>>>>> the virtio device implementation from the migration layer,
> >>>>>>>>>>>>>>> that is how
> >>>>>>>>>>>>>>> Qemu is wrote who doesn't care about whether or not it's a
> >>>>>>>>>>>>>>> software
> >>>>>>>>>>>>>>> virtio/vDPA device or not.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In our vision, in order to fulfil the Live migration
> >>>>>>>>>>>>>>>> requirements for
> >>>>>>>>>>>>>>>> virtual functions, each physical function device must
> >>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>> migration operations. Using these operations, it will be
> >>>>>>>>>>>>>>>> able to
> >>>>>>>>>>>>>>>> master the migration process for the virtual function
> >>>>>>>>>>>>>>>> devices. Each
> >>>>>>>>>>>>>>>> capable physical function device has a supervisor
> >>>>>>>>>>>>>>>> permissions to
> >>>>>>>>>>>>>>>> change the virtual function operational states,
> >>>>>>>>>>>>>>>> save/restore its
> >>>>>>>>>>>>>>>> internal state and start/stop dirty pages tracking.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For "supervisor permissions", is this from the software
> >>>>>>>>>>>>>>> point of view?
> >>>>>>>>>>>>>>> Maybe it's better to give an example for this.
> >>>>>>>>>>>>>> A permission to a PF device for quiesce and freeze a VF
> >>>>>>>>>>>>>> device for example.
> >>>>>>>>>>>>> Note that for safety, VMM (e.g Qemu) is usually running
> >>>>>>>>>>>>> without any privileges.
> >>>>>>>>>>>> You're mixing layers here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> QEMU is not involved here. It's only sending IOCTLs to
> >>>>>>>>>>>> migration driver.
> >>>>>>>>>>>> The migration driver will control the migration process of the
> >>>>>>>>>>>> VF using
> >>>>>>>>>>>> the PF communication channel.
> >>>>>>>>>>> So who will be granted the "permission" you mentioned here?
> >>>>>>>>>> This is just an expression.
> >>>>>>>>>>
> >>>>>>>>>> What is not clear ?
> >>>>>>>>>>
> >>>>>>>>>> The PF device will have an option to quiesce/freeze the VF device.
> >>>>>>>>>>
> >>>>>>>>>> This is simple. Why are you looking for some sophisticated
> >>>>>>>>>> problems ?
> >>>>>>>>> I'm trying to follow along here and have not completely; but I
> >>>>>>>>> think the issue is a
> >>>>>>>>> security separation one.
> >>>>>>>>> The VMM (e.g. qemu) that has been given access to one of the VF's is
> >>>>>>>>> isolated and shouldn't be able to go poking at other devices; so it
> >>>>>>>>> can't go poking at the PF (it probably doesn't even have the PF
> >>>>>>>>> device
> >>>>>>>>> node accessible) - so then the question is who has access to the
> >>>>>>>>> migration driver and how do you make sure it can only deal with VF's
> >>>>>>>>> that it's supposed to be able to migrate.
> >>>>>>>> The QEMU/userspace doesn't know or care about the PF connection and
> >>>>>>>> internal
> >>>>>>>> virtio_vfio_pci driver implementation.
> >>>>>>> OK
> >>>>>>>
> >>>>>>>> You shouldn't change 1 line of code in the VM driver nor in QEMU.
> >>>>>>> Hmm OK.
> >>>>>>>
> >>>>>>>> QEMU does not have access to the PF. Only the kernel driver that
> >>>>>>>> has access
> >>>>>>>> to the VF will have access to the PF communication channel. There
> >>>>>>>> is no
> >>>>>>>> permission problem here.
> >>>>>>>>
> >>>>>>>> The kernel driver of the VF will do this internally, and make sure
> >>>>>>>> that the
> >>>>>>>> commands it build will only impact the VF originating them.
> >>>>>>>>
> >>>>>>> Now that confuses me; isn't the kernel driver that has access to the VF
> >>>>>>> running inside the guest?  If it's inside the guest we can't trust
> >>>>>>> it to
> >>>>>>> do anything about stopping impact to other devices.
> >>>>>> No. The driver is in the hypervisor (virtio_vfio_pci). This is the
> >>>>>> migration driver, right ?
> >>>>> Well, talking things like virtio_vfio_pci that is not mentioned before
> >>>>> and not justified on the list may easily confuse people. As pointed
> >>>>> out in another thread, it has too many disadvantages over the existing
> >>>>> virtio-pci vdpa driver. And it just duplicates a partial function of
> >>>>> what virtio-pci vdpa driver can do. I don't think we will go that way.
> >>>> This was just an example for David to help with understanding the
> >>>> solution since he thought that the guest drivers somehow should be changed.
> >>>>
> >>>> David I'm sorry if I confused you.
> >>>>
> >>>> Again Jason, you try to propose your vDPA solution that is not what
> >>>> we're trying to achieve in this work. Think of a world without vDPA.
> >>> Well, I'd say, let's think vDPA a superset of virtio, not just the
> >>> acceleration technologies.
> >> I'm sorry but vDPA is not relevant to this discussion.
> > Well, it's you that mention the software things like VFIO first.
> >
> >> Anyhow, I don't see any problem for vDPA driver to work on top of the
> >> design proposed here.
> >>
> >>>> Also I don't understand how vDPA is related to virtio specification
> >>>> decisions ?
> >>> So how is VFIO related to virtio specific decisions? That's why I
> >>> think we should avoid talking about software architecture here. It's
> >>> the wrong community.
> >> VFIO is not related to virtio spec.
> > Of course.
> >
> >> It was an example for David. What is the problem with giving examples to
> >> ease on people to understand the solution ?
> > I don't think your example ease the understanding.
> >
> >> Where did you see that the design is referring to VFIO ?
> >>
> >>>>    make vDPA into virtio and then we can open a discussion.
> >>>>
> >>>> I'm interesting in virtio migration of HW devices.
> >>>>
> >>>> The proposal in this thread is actually get support from Michal AFAIU
> >>>> and also others were happy with. All beside of you.
> >>> So I think I've clairfied my several times :(
> >>>
> >>> - I'm fairly ok with the proposal
> >> It doesn't seems like that.
> >>
> >>> - but we decouple the basic facility out of the admin virtqueue and
> >>> this seems agreed by Michael:
> >>>
> >>> Let's take the dirty page tracking as an example:
> >>>
> >>> 1) let's first define that as one of the basic facility
> >>> 2) then we can introduce admin virtqueue or other stuffs as an
> >>> interface for that facility
> >>>
> >>> Does this work for you?
> >> What I really want is to agree that the right way to manage migration
> >> process of a virtio VF. My proposal is doing so by creating a
> >> communication channel in its parent PF.
> > It looks to me you never answer the question "why it must be done by PF".
>
> This is not relevant question. In our profession you can solve a problem
> with more than 1 way.
>
> We need to find the robust one.

Then you need to prove how robust your proposal is.

>
> >
> > All the functions provided by PF so far for software is not expected
> > to be used by VMM like Qemu. Those functions usually requires
> > capability or privileges for the management software to use. You
> > mentioned things like "supervisor" and "permission", but it looks to
> > me you are still unaware how it connect to the security stuffs.
>
> I now see that you don't understand at all what I'm proposing here.
>
> Maybe you can go back to the questions David asked and read my answers
> to get better understanding of the solution.
>
> >
> >> I think I got a confirmation here.
> >>
> >> This communication channel is not introduced in this thread, but
> >> obviously it should be an adminq.
> > Let me clarify. What I want to say is admin should be one of the
> > possible channels.
>
> if you want to fork and create more than 1 way to do things we can check
> other options.
>
> BTW, In the 2019 conference I saw that MST talked about adding LM to the
> spec and hint that the PF should manage the VF.
>
> Adding some non-ready HW platforms consideration, future technologies
> and hypervisor hacks in the design of virtio LM sounds weird to me.

What do you mean by "non-ready"?

I don't think I suggest you add anything, it's just about a
restructure of your current proposal.

>
> I still don't understand why you can't do all the things you wish doing
> with simple commands sent via the admin-q and insist of splitting
> devices and splitting config spaces and bunch of other hacks.
>
> Don't you prefer a robust solution to work with any existing platform
> today ? or do you aim for future solution ?

You never explain why it is robust. That's why I ask why it must be
done in that way.

>
> >> For your future scalable functions, the Parent Device (lets call it PD)
> >> will manage the creation/migration/destruction process for its Virtual
> >> Devices (lets call them VDs) using the PD adminq.
> >>
> >> Agreed ?
> > They are two different set of functions:
> >
> > - provisioning/creation/destruction: requires privilege and we don't
> > have any plan to expose them to the guest. It should be done via PF or
> > PD for security as you mentioned above.
> > - migration: doesn't require privilege, and it can be expose to the
> > guest, if can be done in either PF or VF. To me using VF is much more
> > natural,  but using PF is also fine.
>
> migration exposed to the guest ? No.

Can you explain why?

>
> This is a basic assumption, really.

That's just your assumption. Nested virt has been supported by some
cloud vendors.

>
> I think this is the problem in the whole discussion.

No, if you tie any feature to the admin virtqueue, it can't be used by
the guest. Migration is just an example.

>
> I think all the community agree that guest shouldn't be aware of
> migration. You must understand this.

I just make a minimal effort so we can enable this capability in the
future, why not?

>
> Once you do, all this process will be easier and we'll progress instead
> of running in circles.

I gave you a simple suggestion to make nested migration to work.

>
> >
> > An exception for the migration is the dirty page tracking, without DMA
> > isolation, we may end up with security issue if we do that in the VF.
>
> Lets start with basic migration first.
>
> In my model the Hypervisor kernel control this. No security issue since
> the kernel is a secured entity.

Well, if you do things in VF, is it unsafe?

>
> This is what we do already in our solution for NIC devices.
>
> I don't want virtio to be behind.
>
> >
> >> Please don't answer that this is not a "must". This is my proposal. If
> >> you have another proposal, please propose.
> > Well, you are asking for the comments instead of enforcing things right?
> >
> > And it's as simple as:
> >
> > 1) introduce admin virtqueue, and bind migration features to admin virtqueue
> >
> > or
> >
> > 2) introduce migration features and admin virtqueue independently
> >
> > What's the problem of do trivial modifications like 2)? Is that
> > conflict with your proposal?
>
> I did #2 already and then you asked me to do #1.

Where? I don't think you decouple migration out of the admin virtqueue
in any of the previous versions. If you think I do that, I would like
to clarify once again then.

>
> If I do #1 you'll ask #2.

How do you know that?

>
> I'm progressing towards final solution. I got the feedback I need.
>
> >
> >>>> We do it in mlx5 and we didn't see any issues with that design.
> >>>>
> >>> If we seperate things as I suggested, I'm totally fine.
> >> separate what ?
> >>
> >> Why should I create different interfaces for different management tasks.
> > I don't say you need to create different interfaces. It's for future extensions:
> >
> > 1) When VIRTIO_F_ADMIN_VQ is negotiated, the interface is admin virtqueue
> > 2) When other features is negotiated, the interface is other.
> >
> > In order to make 2) work, we need introduce migration and admin
> > virtqueue separately.
> >
> > Migration is not management task which doesn't require any privilege.
>
> You need to control the operational state of a device, track its dirty
> pages, save/restore internal HW state.
>
> If you think that anyone can do it to a virtio device so lets see this
> magic works (I believe that only the parent/management device can do it
> on behalf of the migration software).

Well, I think both of us want to make progress, let's do:

1) decouple the migration features out of admin virtqueue, this has
been agreed by Michael
2) introduce admin virtqueue as the interface for this

Then that's all fine.

>
> >
> >> I have a virtual/scalable device that I want to  refer to from the
> >> physical/parent device using some interface.
> >>
> >> This interface is adminq. This interface will be used for dirty_page
> >> tracking and operational state changing and get/set internal state as
> >> well. And more (create/destroy SF for example).
> >>
> >> You can think of this in some other way, i'm fine with it. As long as
> >> the final conclusion is the same.
> >>
> >>>> I don't think you can say that we "go that way".
> >>> For "go that way" I meant the method of using vfio_virtio_pci, it has
> >>> nothing related to the discussion of "using PF to control VF" on the
> >>> spec.
> >> This was an example. Please leave it as an example for David.
> >>
> >>
> >>>> You're trying to build a complementary solution for creating scalable
> >>>> functions and for some reason trying to sabotage NVIDIA efforts to add
> >>>> new important functionality to virtio.
> >>> Well, it's a completely different topic. And it doesn't conflict with
> >>> anything that is proposed here by you. I think I've stated this
> >>> several times.  I don't think we block each other, it's just some
> >>> unification work if one of the proposals is merged first. I sent them
> >>> recently because it will be used as a material for my talk on the KVM
> >>> Forum which is really near.
> >> In theory you're right. We shouldn't block each other, and I don't block
> >> you. But for some reason I see that you do try to block my proposal and
> >> I don't understand why.
> > I don't want to block your proposal, let's decouple the migration
> > feature out of admin virtqueue. Then it's fine.
> >
> > The problem I see is that, you tend to refuse such a trivial but
> > beneficial change. That's what I don't understand.
>
> I thought I explained it. Nothing keeps you happy. If we A, you ask for
> B. if we do B you as for A.

Firstly, I never do that, as mentioned I can clarify things if you
give a pointer to the previous discussion that can prove this.

Secondly, for technical discussion, it's not rare:

1) we start from A, and get the comments to see if we can go B
2) when we propose B, people think it's too complicated and ask us to
go back to A
3) a new version goes back to A

That's pretty natural, and it's not an endless circular.

>
> I continue with the feedback I get from MST.

Michael agreed to decouple the basic function out of admin virtqueue.

>
> >
> >> I feel like I wasted 2 months on a discussion instead of progressing.
> > Well, I'm not sure 2 months is short, but it's usually take more than
> > a year for huge project in Linux.
>
> But if you go in circles it will never end, right ?

See above.


>
> >
> > Patience may help us to understand the points of each other better.
>
> first I want to agree on the above migration concepts I wrote.
>
> If we don't agree on that, the discussion is useless.
>
> >
> >> But now I do see a progress. A PF to manage VF migration is the way to
> >> go forward.
> >>
> >> And the following RFC will take this into consideration.
> >>
> >>>> This also sabotage the evolvment of virtio as a standard.
> >>>>
> >>>> You're trying to enforce some un-finished idea that should work on some
> >>>> future specific HW platform instead of helping defining a good spec for
> >>>> virtio.
> >>> Let's open another thread for this if you wish, it has nothing related
> >>> to the spec but how it is implemented in Linux. If you search the
> >>> archive, something similar to "vfio_virtio_pci" has been proposed
> >>> several years before by Intel. The idea has been rejected, and we have
> >>> leveraged Linux vDPA bus for virtio-pci devices.
> >> I don't know this history. And I will happy to hear about it one day.
> >>
> >> But for our discussion in Linux, virtio_vfio_pci will happen. And it
> >> will implement the migration logic of a virtio device with PCI transport
> >> for VFs using the PF admin queue.
> >>
> >> We at NVIDIA, currently upstreaming (alongside with AlexW and Cornelia)
> >> a vfio-pci separation that will enable an easy creation of vfio-pci
> >> vendor/protocol drivers to do some specific tasks.
> >>
> >> New drivers such as mlx5_vfio_pci, hns_vfio_pci, virtio_vfio_pci and
> >> nvme_vfio_pci should be implemented in the near future in Linux to
> >> enable migration of these devices.
> >>
> >> This is just an example. And it's not related to the spec nor the
> >> proposal at all.
> > Let's move those discussions to the right list. I'm pretty sure there
> > will a long debate there. Please prepare for that.
>
> We already discussed this with AlexW, Cornelia, JasonG, ChristophH and
> others.

Vendor specific drivers are not interested here. And google
"nvme_vfio_pci" gives nothing to me. Where is the discussion?

For virtio, I need to make sure the design is generic with sufficient
ability to be extended in the future instead of a feature that can
only work for some specific vendor or platform.

Your proposal works only for PCI with SR-IOV. And I want to leverage
it to be useful for other platforms or transport. That's all my
motivation.

Thanks

>
> And before we have a virtio spec for LM we can't discuss about it in the
> Linux mailing list.
>
> It will waste everyone's time.
>
> >
> >>>> And all is for having users to choose vDPA framework instead of using
> >>>> plain virtio.
> >>>>
> >>>> We believe in our solution and we have a working prototype. We'll
> >>>> continue with our discussion to convince the community with it.
> >>> Again, it looks like there's a lot of misunderstanding. Let's open a
> >>> thread on the suitable list instead of talking about any specific
> >>> software solution or architecture here. This will speed up things.
> >> I prefer to finish the specification first. SW arch is clear for us in
> >> Linux. We did it already for mlx5 devices and it will be the same for
> >> virtio if the spec changes will be accepted.
> > I disagree, but let's separate software discussion out of the spec
> > discussion here.
> >
> > Thanks
> >
> >> Thanks.
> >>
> >>
> >>> Thanks
> >>>
> >>>> Thanks.
> >>>>
> >>>>> Thanks
> >>>>>
> >>>>>
> >>>>>> The guest is running as usual. It doesn't aware on the migration at all.
> >>>>>>
> >>>>>> This is the point I try to make here. I don't (and I can't) change
> >>>>>> even 1 line of code in the guest.
> >>>>>>
> >>>>>> e.g:
> >>>>>>
> >>>>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
> >>>>>> (bounded to VF5) --> send admin command on PF adminq to start
> >>>>>> tracking dirty pages for VF5 --> PF device will do it
> >>>>>>
> >>>>>> QEMU ioctl --> vfio (hypervisor) --> virtio_vfio_pci on hypervisor
> >>>>>> (bounded to VF5) --> send admin command on PF adminq to quiesce VF5
> >>>>>> --> PF device will do it
> >>>>>>
> >>>>>> You can take a look how we implement mlx5_vfio_pci in the link I
> >>>>>> provided.
> >>>>>>
> >>>>>>> Dave
> >>>>>>>
> >>>>>>>
> >>>>>>>> We already do this in mlx5 NIC migration. The kernel is secured and
> >>>>>>>> QEMU
> >>>>>>>> interface is the VF.
> >>>>>>>>
> >>>>>>>>> Dave
> >>>>>>>>>
> >>>>>>>>>>>>>>>> An example of this approach can be seen in the way NVIDIA
> >>>>>>>>>>>>>>>> performs
> >>>>>>>>>>>>>>>> live migration of a ConnectX NIC function:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> >>>>>>>>>>>>>>>> <https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> NVIDIAs SNAP technology enables hardware-accelerated
> >>>>>>>>>>>>>>>> software defined
> >>>>>>>>>>>>>>>> PCIe devices. virtio-blk/virtio-net/virtio-fs SNAP used for
> >>>>>>>>>>>>>>>> storage
> >>>>>>>>>>>>>>>> and networking solutions. The host OS/hypervisor uses its
> >>>>>>>>>>>>>>>> standard
> >>>>>>>>>>>>>>>> drivers that are implemented according to a well-known VIRTIO
> >>>>>>>>>>>>>>>> specifications.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In order to implement Live Migration for these virtual
> >>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>> devices, that use a standard drivers as mentioned, the
> >>>>>>>>>>>>>>>> specification
> >>>>>>>>>>>>>>>> should define how HW vendor should build their devices and
> >>>>>>>>>>>>>>>> for SW
> >>>>>>>>>>>>>>>> developers to adjust the drivers.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This will enable specification compliant vendor agnostic
> >>>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is exactly how we built the migration driver for ConnectX
> >>>>>>>>>>>>>>>> (internal HW design doc) and I guess that this is the way
> >>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>> vendors work.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For that, I would like to know if the approach of “PF that
> >>>>>>>>>>>>>>>> controls
> >>>>>>>>>>>>>>>> the VF live migration process” is acceptable by the VIRTIO
> >>>>>>>>>>>>>>>> technical
> >>>>>>>>>>>>>>>> group ?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm not sure but I think it's better to start from the general
> >>>>>>>>>>>>>>> facility for all transports, then develop features for a
> >>>>>>>>>>>>>>> specific
> >>>>>>>>>>>>>>> transport.
> >>>>>>>>>>>>>> a general facility for all transports can be a generic admin
> >>>>>>>>>>>>>> queue ?
> >>>>>>>>>>>>> It could be a virtqueue or a transport specific method (pcie
> >>>>>>>>>>>>> capability).
> >>>>>>>>>>>> No. You said a general facility for all transports.
> >>>>>>>>>>> For general facility, I mean the chapter 2 of the spec which is
> >>>>>>>>>>> general
> >>>>>>>>>>>
> >>>>>>>>>>> "
> >>>>>>>>>>> 2 Basic Facilities of a Virtio Device
> >>>>>>>>>>> "
> >>>>>>>>>>>
> >>>>>>>>>> It will be in chapter 2. Right after "2.11 Exporting Object" I
> >>>>>>>>>> can add "2.12
> >>>>>>>>>> Admin Virtqueues" and this is what I did in the RFC.
> >>>>>>>>>>
> >>>>>>>>>>>> Transport specific is not general.
> >>>>>>>>>>> The transport is in charge of implementing the interface for
> >>>>>>>>>>> those facilities.
> >>>>>>>>>> Transport specific is not general.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>> E.g we can define what needs to be migrated for the virtio-blk
> >>>>>>>>>>>>> first
> >>>>>>>>>>>>> (the device state). Then we can define the interface to get
> >>>>>>>>>>>>> and set
> >>>>>>>>>>>>> those states via admin virtqueue. Such decoupling may ease the
> >>>>>>>>>>>>> future
> >>>>>>>>>>>>> development of the transport specific migration interface.
> >>>>>>>>>>>> I asked a simple question here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Lets stick to this.
> >>>>>>>>>>> I answered this question.
> >>>>>>>>>> No you didn't answer.
> >>>>>>>>>>
> >>>>>>>>>> I asked  if the approach of “PF that controls the VF live
> >>>>>>>>>> migration process”
> >>>>>>>>>> is acceptable by the VIRTIO technical group ?
> >>>>>>>>>>
> >>>>>>>>>> And you take the discussion to your direction instead of
> >>>>>>>>>> answering a Yes/No
> >>>>>>>>>> question.
> >>>>>>>>>>
> >>>>>>>>>>>        The virtqueue could be one of the
> >>>>>>>>>>> approaches. And it's your responsibility to convince the community
> >>>>>>>>>>> about that approach. Having an example may help people to
> >>>>>>>>>>> understand
> >>>>>>>>>>> your proposal.
> >>>>>>>>>>>
> >>>>>>>>>>>> I'm not referring to internal state definitions.
> >>>>>>>>>>> Without an example, how do we know if it can work well?
> >>>>>>>>>>>
> >>>>>>>>>>>> Can you please not change the subject of my initial intent in
> >>>>>>>>>>>> the email ?
> >>>>>>>>>>> Did I? Basically, I'm asking how a virtio-blk can be migrated with
> >>>>>>>>>>> your proposal.
> >>>>>>>>>> The virtio-blk PF admin queue will be used to manage the
> >>>>>>>>>> virtio-blk VF
> >>>>>>>>>> migration.
> >>>>>>>>>>
> >>>>>>>>>> This is the whole discussion. I don't want to get into resolution.
> >>>>>>>>>>
> >>>>>>>>>> Since you already know the answer as I published 4 RFCs already
> >>>>>>>>>> with all the
> >>>>>>>>>> flow.
> >>>>>>>>>>
> >>>>>>>>>> Lets stick to my question.
> >>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -Max.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This publicly archived list offers a means to provide input
> >>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In order to verify user consent to the Feedback License terms
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> to minimize spam in the list archive, subscription is required
> >>>>>>>>>>>>>> before posting.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>>>>>>>>>> List archive:
> >>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1%2FZMUIr0f%2FfEgDjB8MQ9a2lHiXr4SkLqSG44r6kgJeQ%3D&amp;reserved=0
> >>>>>>>>>>>>>> Feedback License:
> >>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m%2BnMdp9ZA%2BqBU8PzC8HWJB1ouzyUx35VQApAFV8HeSg%3D&amp;reserved=0
> >>>>>>>>>>>>>> List Guidelines:
> >>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=57RGeurJjgZWxGajJPJh6BeR1vQ2OYLYdjTNba2HsPM%3D&amp;reserved=0
> >>>>>>>>>>>>>> Committee:
> >>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Wlkao282wqvghnPNiM2ER6I6GKO%2Fhe2LbDCFH%2FOnkko%3D&amp;reserved=0
> >>>>>>>>>>>>>> Join OASIS:
> >>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=0q2g8y0CtJh5dqRKNE%2FzDC3wOC5kqn%2FDVjnNhj3FFGo%3D&amp;reserved=0
> >>>>>>>>>>>>>>
> >>>>>>>>>> This publicly archived list offers a means to provide input to the
> >>>>>>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>>>>>
> >>>>>>>>>> In order to verify user consent to the Feedback License terms and
> >>>>>>>>>> to minimize spam in the list archive, subscription is required
> >>>>>>>>>> before posting.
> >>>>>>>>>>
> >>>>>>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>>>>>> List archive:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1%2FZMUIr0f%2FfEgDjB8MQ9a2lHiXr4SkLqSG44r6kgJeQ%3D&amp;reserved=0
> >>>>>>>>>> Feedback License:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m%2BnMdp9ZA%2BqBU8PzC8HWJB1ouzyUx35VQApAFV8HeSg%3D&amp;reserved=0
> >>>>>>>>>> List Guidelines:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=57RGeurJjgZWxGajJPJh6BeR1vQ2OYLYdjTNba2HsPM%3D&amp;reserved=0
> >>>>>>>>>> Committee:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Wlkao282wqvghnPNiM2ER6I6GKO%2Fhe2LbDCFH%2FOnkko%3D&amp;reserved=0
> >>>>>>>>>> Join OASIS:
> >>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=0q2g8y0CtJh5dqRKNE%2FzDC3wOC5kqn%2FDVjnNhj3FFGo%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>> This publicly archived list offers a means to provide input to the
> >>>>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>>>
> >>>>>> In order to verify user consent to the Feedback License terms and
> >>>>>> to minimize spam in the list archive, subscription is required
> >>>>>> before posting.
> >>>>>>
> >>>>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>>>> List archive:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1%2FZMUIr0f%2FfEgDjB8MQ9a2lHiXr4SkLqSG44r6kgJeQ%3D&amp;reserved=0
> >>>>>> Feedback License:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m%2BnMdp9ZA%2BqBU8PzC8HWJB1ouzyUx35VQApAFV8HeSg%3D&amp;reserved=0
> >>>>>> List Guidelines:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=57RGeurJjgZWxGajJPJh6BeR1vQ2OYLYdjTNba2HsPM%3D&amp;reserved=0
> >>>>>> Committee:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190365210%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Wlkao282wqvghnPNiM2ER6I6GKO%2Fhe2LbDCFH%2FOnkko%3D&amp;reserved=0
> >>>>>> Join OASIS:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QKSFAtueKKrXjhe2pIE1yVJ3pjNC0F%2FGvcXotSbnlCw%3D&amp;reserved=0
> >>>>>>
> >>>> This publicly archived list offers a means to provide input to the
> >>>> OASIS Virtual I/O Device (VIRTIO) TC.
> >>>>
> >>>> In order to verify user consent to the Feedback License terms and
> >>>> to minimize spam in the list archive, subscription is required
> >>>> before posting.
> >>>>
> >>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> >>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> >>>> List help: virtio-comment-help@lists.oasis-open.org
> >>>> List archive: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.oasis-open.org%2Farchives%2Fvirtio-comment%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2FvradAyVbbFzSdoy7vFrIo61VQNV%2Fgn9swdf5kTaiQU%3D&amp;reserved=0
> >>>> Feedback License: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fwho%2Fipr%2Ffeedback_license.pdf&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1inerAxzEQMA7QcJL5SmBE88VcW98PyxM0qJ5k%2B2B1c%3D&amp;reserved=0
> >>>> List Guidelines: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fpolicies-guidelines%2Fmailing-lists&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uLlc0YNz7wXRXwO99ieHX25nBwKCyTBqVatNoc6BbSg%3D&amp;reserved=0
> >>>> Committee: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fcommittees%2Fvirtio%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rGmEPe%2FZR%2FaxPUMfZdnLuQdszA4l39gccvEQkNUl9ds%3D&amp;reserved=0
> >>>> Join OASIS: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.oasis-open.org%2Fjoin%2F&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C0110223ec8a341665c2c08d965e38ef4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637652851190375162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QKSFAtueKKrXjhe2pIE1yVJ3pjNC0F%2FGvcXotSbnlCw%3D&amp;reserved=0
> >>>>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-23 12:08                   ` Dr. David Alan Gilbert
@ 2021-08-24  3:00                     ` Jason Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Wang @ 2021-08-24  3:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael S. Tsirkin, Max Gurtovoy, virtio-comment, cohuck,
	Parav Pandit, Shahaf Shuler, Ariel Adam, Amnon Ilan, Bodong Wang,
	Jason Gunthorpe, Stefan Hajnoczi, Eugenio Perez Martin,
	Liran Liss, Oren Duer

On Mon, Aug 23, 2021 at 8:08 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Jason Wang (jasowang@redhat.com) wrote:
> >
> > 在 2021/8/19 下午10:58, Michael S. Tsirkin 写道:
> > > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > > The PF device will have an option to quiesce/freeze the VF device.
> > > >
> > > > Is such design a must? If no, why not simply introduce those functions in
> > > > the VF?
> > > Many IOMMUs only support protections at the function level.
> > > Thus we need ability to have one device (e.g. a PF)
> > > to control migration of another (e.g. a VF).
> >
> >
> > So as discussed previously, the only possible "advantage" is that the DMA is
> > isolated.
> >
> >
> > > This is because allowing VF to access hypervisor memory used for
> > > migration is not a good idea.
> > > For IOMMUs that support subfunctions, these "devices" could be
> > > subfunctions.
> > >
> > > The only alternative is to keep things in device memory which
> > > does not need an IOMMU.
> > > I guess we'd end up with something like a VQ in device memory which might
> > > be tricky from multiple points of view, but yes, this could be
> > > useful and people did ask for such a capability in the past.
> >
> >
> > I assume the spec already support this. We probably need some clarification
> > at the transport layer. But it's as simple as setting MMIO are as virtqueue
> > address?
> >
> > Except for the dirty bit tracking, we don't have bulk data that needs to be
> > transferred during migration. So a virtqueue is not must even in this case.
> >
> >
> > >
> > > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > > migration is not designed like that)?
> > > I think the main difference is we need PF's help for memory
> > > tracking for pre-copy migration anyway.
> >
> >
> > Such kind of memory tracking is not a must. KVM uses software assisted
> > technologies (write protection) and it works very well. For virtio,
> > technology like shadow virtqueue has been used by DPDK and prototyped by
> > Eugenio.
> >
> > Even if we want to go with hardware technology, we have many alternatives
> > (as we've discussed in the past):
> >
> > 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> > write)
> > 2) Write protection via IOMMU or device MMU
> > 3) Address space ID for isolating DMAs
>
> What's the state of those - last time I chatted to anyone about IOMMUs
> doing protection, things were at the 'in the future' stage.

For the IOMMU dirty bit, I don't check the hardware but it claims to
be supported by the vtd spec several years before.

For write protection via IOMMU, I think PRI or ATS has been supported
by some devices, especially the PRI allows a vendor specific way for
reporting page faults.

For device MMU, it has been supported by some vendors.

For ASID, PASID requires cpu and platform vendor support, but AFAIK,
it should be ready very soon, it could be the end of this year but I'm
not sure.

Thanks

>
> Dave
>
> > Using physical function is sub-optimal that all of the above since:
> >
> > 1) limited to a specific transport or implementation and it doesn't work for
> > device or transport without PF
> > 2) the virtio level function is not self contained, this makes any feature
> > that ties to PF impossible to be used in the nested layer
> > 3) more complicated than leveraging the existing facilities provided by the
> > platform or transport
> >
> > Consider (P)ASID will be ready very soon, workaround the platform limitation
> > via PF is not a good idea for me. Especially consider it's not a must and we
> > had already prototype the software assisted technology.
> >
> >
> > >   Might as well integrate
> > > the rest of state in the same channel.
> >
> >
> > That's another question. I think for the function that is a must for doing
> > live migration, introducing them in the function itself is the most natural
> > way since we did all the other facilities there. This ease the function that
> > can be used in the nested layer.
> >
> > And using the channel in the PF is not coming for free. It requires
> > synchronization in the software or even QOS.
> >
> > Or we can just separate the dirty page tracking into PF (but need to define
> > them as basic facility for future extension).
> >
> >
> > >
> > > Another answer is that CPUs trivially switch between
> > > functions by switching the active page tables. For PCI DMA
> > > it is all much trickier sine the page tables can be separate
> > > from the device, and assumed to be mostly static.
> >
> >
> > I don't see much different, the page table is also separated from the CPU.
> > If the device supports state save and restore we can scheduling the multiple
> > VMs/VCPUs on the same device.
> >
> >
> > > So if you want to create something like the VMCS then
> > > again you either need some help from another device or
> > > put it in device memory.
> >
> >
> > For CPU virtualization, the states could be saved and restored via MSRs. For
> > virtio, accessing them via registers is also possible and much more simple.
> >
> > Thanks
> >
> >
> > >
> > >
> >
> >
> > This publicly archived list offers a means to provide input to the
> > OASIS Virtual I/O Device (VIRTIO) TC.
> >
> > In order to verify user consent to the Feedback License terms and
> > to minimize spam in the list archive, subscription is required
> > before posting.
> >
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > List help: virtio-comment-help@lists.oasis-open.org
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > Committee: https://www.oasis-open.org/committees/virtio/
> > Join OASIS: https://www.oasis-open.org/join/
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-24  2:41                                 ` Jason Wang
@ 2021-08-24 13:10                                   ` Jason Gunthorpe
  2021-08-25  4:58                                     ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2021-08-24 13:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, Dr. David Alan Gilbert, virtio-comment,
	Michael S. Tsirkin, cohuck, Parav Pandit, Shahaf Shuler,
	Ariel Adam, Amnon Ilan, Bodong Wang, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Tue, Aug 24, 2021 at 10:41:54AM +0800, Jason Wang wrote:

> > migration exposed to the guest ? No.
> 
> Can you explain why?

For the SRIOV case migration is a privileged operation of the
hypervisor. The guest must not be allowed to interact with it in any
way otherwise the hypervisor migration could be attacked from the
guest and this has definite security implications.

In practice this means that nothing related to migration can be
located on the MMIO pages/queues/etc of the VF. The reasons for this
are a bit complicated and has to do with the limitations of IO
isolation with VFIO - eg you can't reliably split a single PCI BDF
into hypervisor/guest security domains without PASID.

We recently revisited this concept again with a HNS vfio driver. IIRC
Intel messed it up in their mdev driver too.

> > >>> Let's open another thread for this if you wish, it has nothing related
> > >>> to the spec but how it is implemented in Linux. If you search the
> > >>> archive, something similar to "vfio_virtio_pci" has been proposed
> > >>> several years before by Intel. The idea has been rejected, and we have
> > >>> leveraged Linux vDPA bus for virtio-pci devices.

That was largely because Intel was proposing to use mdevs to create an
entire VDPA subsystem hidden inside VFIO.

We've invested in a pure VFIO solution which should be merged soon:

https://lore.kernel.org/kvm/20210819161914.7ad2e80e.alex.williamson@redhat.com/

It does not rely on mdevs. It is not trying to recreate VDPA. Instead
the HW provides a fully functional virto VF and the solution uses
normal SRIOV approaches.

You can contrast this with the two virtio-net solutions mlx5 will
support:

- One is the existing hypervisor assisted VDPA solution where the mlx5
  driver does HW accelerated queue processing.

- The other one is a full PCI VF that provides a virtio-net function
  without any hypervisor assistance. In this case we will have a VFIO
  migration driver as above that to provide SRIOV VF live migration.

I see in this thread that these two things are becoming quite
confused. They are very different, have different security postures
and use different parts of the hypervisor stack, and intended for
quite different use cases.

> Your proposal works only for PCI with SR-IOV. And I want to leverage
> it to be useful for other platforms or transport. That's all my
> motivation.

I've read most of the emails here I still don't see what the use case
is for this beyond PCI SRIOV.

In a general sense it requires virtio to specify how PASID works. No
matter what we must create a split secure/guest world where DMAs from
each world are uniquely tagged. In the pure PCI world this means
either using PF/VF or VF/PASID.

In general PASID still has a long road to go before it is working in
Linux:

https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/

So, IMHO, it make sense to focus on the PF/VF definition for spec
purposes.

I agree it would be good spec design to have a general concept of a
secure and guest world and specific sections that defines how it works
for different scenarios, but that seems like a language remark and not
one about the design. For instance the admin queue Max is adding is
clearly part of the secure world and putting it on the PF is the only
option for the SRIOV mode.

Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-24 13:10                                   ` Jason Gunthorpe
@ 2021-08-25  4:58                                     ` Jason Wang
  2021-08-25 18:13                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-25  4:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, Dr. David Alan Gilbert, virtio-comment,
	Michael S. Tsirkin, cohuck, Parav Pandit, Shahaf Shuler,
	Ariel Adam, Amnon Ilan, Bodong Wang, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Tue, Aug 24, 2021 at 9:10 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 24, 2021 at 10:41:54AM +0800, Jason Wang wrote:
>
> > > migration exposed to the guest ? No.
> >
> > Can you explain why?
>
> For the SRIOV case migration is a privileged operation of the
> hypervisor. The guest must not be allowed to interact with it in any
> way otherwise the hypervisor migration could be attacked from the
> guest and this has definite security implications.
>
> In practice this means that nothing related to migration can be
> located on the MMIO pages/queues/etc of the VF. The reasons for this
> are a bit complicated and has to do with the limitations of IO
> isolation with VFIO - eg you can't reliably split a single PCI BDF
> into hypervisor/guest security domains without PASID.

So exposing the migration function can be done indirectly:

In L0, the hardware implements the function via PF, Qemu will present
an emulated PCI device then Qemu can expose those functions via a
capability for L1 guests. When L1 driver tries to use those functions,
it goes:

L1 virtio-net driver -(emulated PCI-E BAR)-> Qemu -(ioctl)-> L0 kernel
VF driver -> L0 kernel PF driver -(virtio interface)-> virtio PF

In this approach, there's no way for the L1 driver to control the or
see what is implemented in the hardware (PF). The details were hidden
by Qemu. This works even if DMA is required for the L0 kernel PF
driver to talk with the hardware since for L1 we didn't present a DMA
interface. With the future PASID support, we can even present a DMA
interface to L1.

>
> We recently revisited this concept again with a HNS vfio driver. IIRC
> Intel messed it up in their mdev driver too.
>
> > > >>> Let's open another thread for this if you wish, it has nothing related
> > > >>> to the spec but how it is implemented in Linux. If you search the
> > > >>> archive, something similar to "vfio_virtio_pci" has been proposed
> > > >>> several years before by Intel. The idea has been rejected, and we have
> > > >>> leveraged Linux vDPA bus for virtio-pci devices.
>
> That was largely because Intel was proposing to use mdevs to create an
> entire VDPA subsystem hidden inside VFIO.
>
> We've invested in a pure VFIO solution which should be merged soon:
>
> https://lore.kernel.org/kvm/20210819161914.7ad2e80e.alex.williamson@redhat.com/
>
> It does not rely on mdevs. It is not trying to recreate VDPA. Instead
> the HW provides a fully functional virto VF and the solution uses
> normal SRIOV approaches.
>
> You can contrast this with the two virtio-net solutions mlx5 will
> support:
>
> - One is the existing hypervisor assisted VDPA solution where the mlx5
>   driver does HW accelerated queue processing.
>
> - The other one is a full PCI VF that provides a virtio-net function
>   without any hypervisor assistance. In this case we will have a VFIO
>   migration driver as above that to provide SRIOV VF live migration.

This part I understand.

>
> I see in this thread that these two things are becoming quite
> confused. They are very different, have different security postures
> and use different parts of the hypervisor stack, and intended for
> quite different use cases.

It looks like the full PCI VF could go via the virtio-pci vDPA driver
as well (drivers/vdpa/virtio-pci). So what's the advantages of
exposing the migration of virtio via vfio instead of vhost-vDPA? With
the vhost, we can have a lot of benefits:

1) migration compatibility with the existing software virtio and
vhost/vDPA implementations
2) presenting a virtio device instead of a virtio-pci device, this
makes it possible to be used by the case that guest doesn't need a PCI
at all  (firecracker or micro vm)
3) management infrastructure is almost ready (what Parav did)

>
> > Your proposal works only for PCI with SR-IOV. And I want to leverage
> > it to be useful for other platforms or transport. That's all my
> > motivation.
>
> I've read most of the emails here I still don't see what the use case
> is for this beyond PCI SRIOV.

So we have transports other than PCI. The basic functions for
migration is common:

- device freeze/stop
- device states
- dirty page tracking (not a must)

>
> In a general sense it requires virtio to specify how PASID works. No
> matter what we must create a split secure/guest world where DMAs from
> each world are uniquely tagged. In the pure PCI world this means
> either using PF/VF or VF/PASID.
>
> In general PASID still has a long road to go before it is working in
> Linux:
>
> https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/
>

Yes, I think we have agreed that it is something we want and vDPA will
support that for sure.

> So, IMHO, it make sense to focus on the PF/VF definition for spec
> purposes.

That's fine.

>
> I agree it would be good spec design to have a general concept of a
> secure and guest world and specific sections that defines how it works
> for different scenarios, but that seems like a language remark and not
> one about the design. For instance the admin queue Max is adding is
> clearly part of the secure world and putting it on the PF is the only
> option for the SRIOV mode.

Yes, but let's move common functionality that is required for all
transports to the chapter of "basic device facility". We don't need to
define how it works in other different scenarios now.

Thanks

>
> Jason
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-25  4:58                                     ` Jason Wang
@ 2021-08-25 18:13                                       ` Jason Gunthorpe
  2021-08-26  3:15                                         ` Jason Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Gunthorpe @ 2021-08-25 18:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, Dr. David Alan Gilbert, virtio-comment,
	Michael S. Tsirkin, cohuck, Parav Pandit, Shahaf Shuler,
	Ariel Adam, Amnon Ilan, Bodong Wang, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Wed, Aug 25, 2021 at 12:58:01PM +0800, Jason Wang wrote:
> On Tue, Aug 24, 2021 at 9:10 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Aug 24, 2021 at 10:41:54AM +0800, Jason Wang wrote:
> >
> > > > migration exposed to the guest ? No.
> > >
> > > Can you explain why?
> >
> > For the SRIOV case migration is a privileged operation of the
> > hypervisor. The guest must not be allowed to interact with it in any
> > way otherwise the hypervisor migration could be attacked from the
> > guest and this has definite security implications.
> >
> > In practice this means that nothing related to migration can be
> > located on the MMIO pages/queues/etc of the VF. The reasons for this
> > are a bit complicated and has to do with the limitations of IO
> > isolation with VFIO - eg you can't reliably split a single PCI BDF
> > into hypervisor/guest security domains without PASID.
> 
> So exposing the migration function can be done indirectly:
> 
> In L0, the hardware implements the function via PF, Qemu will present
> an emulated PCI device then Qemu can expose those functions via a
> capability for L1 guests. When L1 driver tries to use those functions,
> it goes:
> 
> L1 virtio-net driver -(emulated PCI-E BAR)-> Qemu -(ioctl)-> L0 kernel
> VF driver -> L0 kernel PF driver -(virtio interface)-> virtio PF
> 
> In this approach, there's no way for the L1 driver to control the or
> see what is implemented in the hardware (PF). The details were hidden
> by Qemu. This works even if DMA is required for the L0 kernel PF
> driver to talk with the hardware since for L1 we didn't present a DMA
> interface. With the future PASID support, we can even present a DMA
> interface to L1.

Sure, you can do this, but that isn't what is being talked about here,
and honestly seems like a highly contrived use case.

Further, in this mode I'd expect the hypervisor kernel driver to
provide the migration support without requiring any special HW
function.

> > I see in this thread that these two things are becoming quite
> > confused. They are very different, have different security postures
> > and use different parts of the hypervisor stack, and intended for
> > quite different use cases.
> 
> It looks like the full PCI VF could go via the virtio-pci vDPA driver
> as well (drivers/vdpa/virtio-pci). So what's the advantages of
> exposing the migration of virtio via vfio instead of vhost-vDPA? 

Can't say, both are possibly valid approaches with different trade
offs.

Off hand I think it is just unneeded complexity to use VDPA if the
device is already exposing a fully functional virtio-pci interface. I
see VDPA as being useful to create HW accelerated virtio interface
from HW that does not natively speak full virtio.

> 1) migration compatibility with the existing software virtio and
> vhost/vDPA implementations

IMHO the the virtio spec should define the format of the migration
state and I'd expect interworking between all the different
implementations.

> > I agree it would be good spec design to have a general concept of a
> > secure and guest world and specific sections that defines how it works
> > for different scenarios, but that seems like a language remark and not
> > one about the design. For instance the admin queue Max is adding is
> > clearly part of the secure world and putting it on the PF is the only
> > option for the SRIOV mode.
> 
> Yes, but let's move common functionality that is required for all
> transports to the chapter of "basic device facility". We don't need to
> define how it works in other different scenarios now.

It seems like a reasonable way to write the spec. I'd define a secure
admin queue and define how the ops on that queue work

Then seperately define how to instantiate the secure admin queue in
all the relevant scenarios.

Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-25 18:13                                       ` Jason Gunthorpe
@ 2021-08-26  3:15                                         ` Jason Wang
  2021-08-26 12:27                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 33+ messages in thread
From: Jason Wang @ 2021-08-26  3:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Max Gurtovoy, Dr. David Alan Gilbert, virtio-comment,
	Michael S. Tsirkin, cohuck, Parav Pandit, Shahaf Shuler,
	Ariel Adam, Amnon Ilan, Bodong Wang, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Thu, Aug 26, 2021 at 2:13 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Aug 25, 2021 at 12:58:01PM +0800, Jason Wang wrote:
> > On Tue, Aug 24, 2021 at 9:10 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Aug 24, 2021 at 10:41:54AM +0800, Jason Wang wrote:
> > >
> > > > > migration exposed to the guest ? No.
> > > >
> > > > Can you explain why?
> > >
> > > For the SRIOV case migration is a privileged operation of the
> > > hypervisor. The guest must not be allowed to interact with it in any
> > > way otherwise the hypervisor migration could be attacked from the
> > > guest and this has definite security implications.
> > >
> > > In practice this means that nothing related to migration can be
> > > located on the MMIO pages/queues/etc of the VF. The reasons for this
> > > are a bit complicated and has to do with the limitations of IO
> > > isolation with VFIO - eg you can't reliably split a single PCI BDF
> > > into hypervisor/guest security domains without PASID.
> >
> > So exposing the migration function can be done indirectly:
> >
> > In L0, the hardware implements the function via PF, Qemu will present
> > an emulated PCI device then Qemu can expose those functions via a
> > capability for L1 guests. When L1 driver tries to use those functions,
> > it goes:
> >
> > L1 virtio-net driver -(emulated PCI-E BAR)-> Qemu -(ioctl)-> L0 kernel
> > VF driver -> L0 kernel PF driver -(virtio interface)-> virtio PF
> >
> > In this approach, there's no way for the L1 driver to control the or
> > see what is implemented in the hardware (PF). The details were hidden
> > by Qemu. This works even if DMA is required for the L0 kernel PF
> > driver to talk with the hardware since for L1 we didn't present a DMA
> > interface. With the future PASID support, we can even present a DMA
> > interface to L1.
>
> Sure, you can do this, but that isn't what is being talked about here,
> and honestly seems like a highly contrived use case.

It's basically how virtio-net / vhost is implemented so far in Qemu.
And if we want to do this sometime in the future, we need another
interface (e.g BAR or capability) in the spec for the emulated device
to allow the L1 to access those functions. That's another reason I
think we need to describe the migration in the chapter "basic device
facility". It eases the future extension of the spec.

>
> Further, in this mode I'd expect the hypervisor kernel driver to
> provide the migration support without requiring any special HW
> function.

For 'special HW function' do you mean PASID? If yes, I agree. But I
think we know that the PASID will be ready in the near future.

>
> > > I see in this thread that these two things are becoming quite
> > > confused. They are very different, have different security postures
> > > and use different parts of the hypervisor stack, and intended for
> > > quite different use cases.
> >
> > It looks like the full PCI VF could go via the virtio-pci vDPA driver
> > as well (drivers/vdpa/virtio-pci). So what's the advantages of
> > exposing the migration of virtio via vfio instead of vhost-vDPA?
>
> Can't say, both are possibly valid approaches with different trade
> offs.
>
> Off hand I think it is just unneeded complexity to use VDPA if the
> device is already exposing a fully functional virtio-pci interface. I
> see VDPA as being useful to create HW accelerated virtio interface
> from HW that does not natively speak full virtio.

I think it depends on how we view vDPA. If we treat vDPA as a vendor
specific control path and think the virtio spec is a "vendor" then
virtio can go within vDPA. For the complexity, it's true that we need
to build everything from scratch. But the virtio/vhost model has been
implemented in Qemu for more than 10 years, and the kernel has already
supported vhost-vDPA. So it's not a lot of engineering effort . Hiding
the hardware details via vhost may have broader use cases.
>
> > 1) migration compatibility with the existing software virtio and
> > vhost/vDPA implementations
>
> IMHO the the virtio spec should define the format of the migration
> state and I'd expect interworking between all the different
> implementations.

Yes, so assuming spec has defined the device state, the hypervisor
still can choose to convert them into another byte stream. Qemu has
already defined the migration stream format for the virtio-pci device,
and it works seamlessly with vhost(vDPA). For the vfio way, this means
it requires extra work in the Qemu (a dedicated migration module or
other) to make the migration work to convert the state to the existing
virtio-pci, and it needs to care about the migration compatibility
among different Qemu machine types and versions. And it needs to teach
the management to know that a migration between "-device vfio-pci" and
"-device virtio-net-pci" can work which is not easy.

>
> > > I agree it would be good spec design to have a general concept of a
> > > secure and guest world and specific sections that defines how it works
> > > for different scenarios, but that seems like a language remark and not
> > > one about the design. For instance the admin queue Max is adding is
> > > clearly part of the secure world and putting it on the PF is the only
> > > option for the SRIOV mode.
> >
> > Yes, but let's move common functionality that is required for all
> > transports to the chapter of "basic device facility". We don't need to
> > define how it works in other different scenarios now.
>
> It seems like a reasonable way to write the spec. I'd define a secure
> admin queue and define how the ops on that queue work
>

Yes.

> Then seperately define how to instantiate the secure admin queue in
> all the relevant scenarios.

I don't object to this. So just to clarify, what I meant is:

1) having one subsection in the "basic device facility" to describe
migration related functions: the dirty page tracking, device states.
2) having another subsection in the "basic device facility" to
describe the admin virtqueue and the ops for the migration functions
mentioned above

I think it doesn't conflict with what Max and you propose here. And it
eases the future extensions and makes sure the core migration facility
is stable.

Thanks

>
> Jason
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [virtio-comment] Live Migration of Virtio Virtual Function
  2021-08-26  3:15                                         ` Jason Wang
@ 2021-08-26 12:27                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 33+ messages in thread
From: Jason Gunthorpe @ 2021-08-26 12:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, Dr. David Alan Gilbert, virtio-comment,
	Michael S. Tsirkin, cohuck, Parav Pandit, Shahaf Shuler,
	Ariel Adam, Amnon Ilan, Bodong Wang, Stefan Hajnoczi,
	Eugenio Perez Martin, Liran Liss, Oren Duer

On Thu, Aug 26, 2021 at 11:15:25AM +0800, Jason Wang wrote:
> On Thu, Aug 26, 2021 at 2:13 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 12:58:01PM +0800, Jason Wang wrote:
> > > On Tue, Aug 24, 2021 at 9:10 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Tue, Aug 24, 2021 at 10:41:54AM +0800, Jason Wang wrote:
> > > >
> > > > > > migration exposed to the guest ? No.
> > > > >
> > > > > Can you explain why?
> > > >
> > > > For the SRIOV case migration is a privileged operation of the
> > > > hypervisor. The guest must not be allowed to interact with it in any
> > > > way otherwise the hypervisor migration could be attacked from the
> > > > guest and this has definite security implications.
> > > >
> > > > In practice this means that nothing related to migration can be
> > > > located on the MMIO pages/queues/etc of the VF. The reasons for this
> > > > are a bit complicated and has to do with the limitations of IO
> > > > isolation with VFIO - eg you can't reliably split a single PCI BDF
> > > > into hypervisor/guest security domains without PASID.
> > >
> > > So exposing the migration function can be done indirectly:
> > >
> > > In L0, the hardware implements the function via PF, Qemu will present
> > > an emulated PCI device then Qemu can expose those functions via a
> > > capability for L1 guests. When L1 driver tries to use those functions,
> > > it goes:
> > >
> > > L1 virtio-net driver -(emulated PCI-E BAR)-> Qemu -(ioctl)-> L0 kernel
> > > VF driver -> L0 kernel PF driver -(virtio interface)-> virtio PF
> > >
> > > In this approach, there's no way for the L1 driver to control the or
> > > see what is implemented in the hardware (PF). The details were hidden
> > > by Qemu. This works even if DMA is required for the L0 kernel PF
> > > driver to talk with the hardware since for L1 we didn't present a DMA
> > > interface. With the future PASID support, we can even present a DMA
> > > interface to L1.
> >
> > Sure, you can do this, but that isn't what is being talked about here,
> > and honestly seems like a highly contrived use case.
> 
> It's basically how virtio-net / vhost is implemented so far in Qemu.

Well, a "L1 no DMA interface" is completely not interesting for this
work. People that want a "no DMA" workflow can use the existing netdev
mechanisms and don't need HW assisted migration.

> And if we want to do this sometime in the future, we need another
> interface (e.g BAR or capability) in the spec for the emulated device
> to allow the L1 to access those functions. That's another reason I
> think we need to describe the migration in the chapter "basic device
> facility". It eases the future extension of the spec.

The L1 has the same issue as the bare metal, the migration function is
secure and how the two security domains are exposed and interact with
the vIOMMU must be defined.

The L0/L1 scenario above doesn't change anything, you still cannot
expose the migration function in the bar or capability block of the
virtio function because it becomes bundled with the security domain of
the function and rendered useless for its purpose.

> > Further, in this mode I'd expect the hypervisor kernel driver to
> > provide the migration support without requiring any special HW
> > function.
> 
> For 'special HW function' do you mean PASID? If yes, I agree. But I
> think we know that the PASID will be ready in the near future.

I mean the HW support to execute virtio suspend/resume/dirty page
tracking. If you have no DMA and a SW layer in the middle the
hypervisor driver can just do this directly in SW.

> I think it depends on how we view vDPA. If we treat vDPA as a vendor
> specific control path and think the virtio spec is a "vendor" then
> virtio can go within vDPA.

It can, but why? The whole point of vDPA is to create a virtio
interface, if I already have a perfectly functional virtio interface
why would I want to wrapper more software around it just to get back
to where I started?

This can only create problems in the long run.

Jason


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-08-26 12:27 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-12 12:08 Live Migration of Virtio Virtual Function Max Gurtovoy
2021-08-17  8:51 ` [virtio-comment] " Jason Wang
2021-08-17  9:11   ` Max Gurtovoy
2021-08-17  9:44     ` Jason Wang
2021-08-18  9:15       ` Max Gurtovoy
2021-08-18 10:46         ` Jason Wang
2021-08-18 11:45           ` Max Gurtovoy
2021-08-19  2:44             ` Jason Wang
2021-08-19 14:58               ` Michael S. Tsirkin
2021-08-20  2:17                 ` Jason Wang
2021-08-20  7:03                   ` Michael S. Tsirkin
2021-08-20  7:49                     ` Jason Wang
2021-08-20 11:06                       ` Michael S. Tsirkin
2021-08-23  3:20                         ` Jason Wang
2021-08-23 12:08                   ` Dr. David Alan Gilbert
2021-08-24  3:00                     ` Jason Wang
2021-08-19 11:12             ` Dr. David Alan Gilbert
2021-08-19 14:16               ` Max Gurtovoy
2021-08-19 14:24                 ` Dr. David Alan Gilbert
2021-08-19 15:20                   ` Max Gurtovoy
2021-08-20  2:24                     ` Jason Wang
2021-08-20 10:26                       ` Max Gurtovoy
2021-08-20 11:16                         ` Jason Wang
2021-08-22 10:05                           ` Max Gurtovoy
2021-08-23  3:10                             ` Jason Wang
2021-08-23  8:55                               ` Max Gurtovoy
2021-08-24  2:41                                 ` Jason Wang
2021-08-24 13:10                                   ` Jason Gunthorpe
2021-08-25  4:58                                     ` Jason Wang
2021-08-25 18:13                                       ` Jason Gunthorpe
2021-08-26  3:15                                         ` Jason Wang
2021-08-26 12:27                                           ` Jason Gunthorpe
2021-08-23 12:18                     ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.