qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Zhengxiao.zx@Alibaba-inc.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, cjia@nvidia.com, eskultet@redhat.com,
	ziye.yang@intel.com, qemu-devel@nongnu.org, cohuck@redhat.com,
	shuangtai.tst@alibaba-inc.com, dgilbert@redhat.com,
	zhi.a.wang@intel.com, mlevitsk@redhat.com, pasic@linux.ibm.com,
	aik@ozlabs.ru, eauger@redhat.com, felipe@nutanix.com,
	jonathan.davies@nutanix.com, yan.y.zhao@intel.com,
	changpeng.liu@intel.com, Ken.Xue@amd.com
Subject: Re: [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface
Date: Tue, 25 Jun 2019 00:22:16 +0530	[thread overview]
Message-ID: <c26273f7-0f05-860c-fb4d-5e8f641140a4@nvidia.com> (raw)
In-Reply-To: <20190624092535.4b66bac5@x1.home>



On 6/24/2019 8:55 PM, Alex Williamson wrote:
> On Mon, 24 Jun 2019 20:30:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/22/2019 3:31 AM, Alex Williamson wrote:
>>> On Sat, 22 Jun 2019 02:00:08 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:  
>>>> On 6/22/2019 1:30 AM, Alex Williamson wrote:  
>>>>> On Sat, 22 Jun 2019 01:05:48 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 6/21/2019 8:33 PM, Alex Williamson wrote:    
>>>>>>> On Fri, 21 Jun 2019 11:22:15 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On 6/20/2019 10:48 PM, Alex Williamson wrote:      
>>>>>>>>> On Thu, 20 Jun 2019 20:07:29 +0530
>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>         
>>>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>>>     Bit 0 => _RUNNING
>>>>>>>>>>     Bit 1 => _SAVING
>>>>>>>>>>     Bit 2 => _RESUMING
>>>>>>>>>>     Combination of these bits defines VFIO device's state during migration
>>>>>>>>>>     _STOPPED => All bits 0 indicates VFIO device stopped.
>>>>>>>>>>     _RUNNING => Normal VFIO device running state.
>>>>>>>>>>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>>>>>>>>>>                           saving state of device i.e. pre-copy state
>>>>>>>>>>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>>>>>>>>>>                           save device state,i.e. stop-n-copy state
>>>>>>>>>>     _RESUMING => VFIO device resuming state.
>>>>>>>>>>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>>>>>>>>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>>>>>>>>   offset of migration region to get/set VFIO device related information.
>>>>>>>>>>   Defined members of structure and usage on read/write access:
>>>>>>>>>>     * device_state: (read/write)
>>>>>>>>>>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>>>>>>>>>>         as of now.
>>>>>>>>>>     * pending bytes: (read only)
>>>>>>>>>>         To get pending bytes yet to be migrated for VFIO device.
>>>>>>>>>>     * data_offset: (read only)
>>>>>>>>>>         To get data offset in migration from where data exist during _SAVING
>>>>>>>>>>         and from where data should be written by user space application during
>>>>>>>>>>          _RESUMING state
>>>>>>>>>>     * data_size: (read/write)
>>>>>>>>>>         To get and set size of data copied in migration region during _SAVING
>>>>>>>>>>         and _RESUMING state.
>>>>>>>>>>     * start_pfn, page_size, total_pfns: (write only)
>>>>>>>>>>         To get bitmap of dirty pages from vendor driver from given
>>>>>>>>>>         start address for total_pfns.
>>>>>>>>>>     * copied_pfns: (read only)
>>>>>>>>>>         To get number of pfns bitmap copied in migration region.
>>>>>>>>>>         Vendor driver should copy the bitmap with bits set only for
>>>>>>>>>>         pages to be marked dirty in migration region. Vendor driver
>>>>>>>>>>         should return 0 if there are 0 pages dirty in requested
>>>>>>>>>>         range. Vendor driver should return -1 to mark all pages in the section
>>>>>>>>>>         as dirty
>>>>>>>>>>
>>>>>>>>>> Migration region looks like:
>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>> |vfio_device_migration_info|    data section                      |
>>>>>>>>>> |                          |     ///////////////////////////////  |
>>>>>>>>>>  ------------------------------------------------------------------
>>>>>>>>>>  ^                              ^                              ^
>>>>>>>>>>  offset 0-trapped part        data_offset                 data_size
>>>>>>>>>>
>>>>>>>>>> Data section is always followed by vfio_device_migration_info
>>>>>>>>>> structure in the region, so data_offset will always be none-0.
>>>>>>>>>> Offset from where data is copied is decided by kernel driver, data
>>>>>>>>>> section can be trapped or mapped depending on how kernel driver
>>>>>>>>>> defines data section. If mmapped, then data_offset should be page
>>>>>>>>>> aligned, where as initial section which contain
>>>>>>>>>> vfio_device_migration_info structure might not end at offset which
>>>>>>>>>> is page aligned.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>>>> ---
>>>>>>>>>>  linux-headers/linux/vfio.h | 71 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>  1 file changed, 71 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>>>>> index 24f505199f83..274ec477eb82 100644
>>>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>>>> @@ -372,6 +372,77 @@ struct vfio_region_gfx_edid {
>>>>>>>>>>   */
>>>>>>>>>>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>>>>>>>>>>  
>>>>>>>>>> +/* Migration region type and sub-type */
>>>>>>>>>> +#define VFIO_REGION_TYPE_MIGRATION	        (2)
>>>>>>>>>> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>>>>>>>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>>>>>>>>> + * information. Field accesses from this structure are only supported at their
>>>>>>>>>> + * native width and alignment, otherwise should return error.
>>>>>>>>>> + *
>>>>>>>>>> + * device_state: (read/write)
>>>>>>>>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>>>>>>>>> + *      to. If device state transition fails, write to this field return error.
>>>>>>>>>> + *      It consists of 3 bits:
>>>>>>>>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>>>>>>>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>>>>>>>>> + *        device before write returns.
>>>>>>>>>> + *      - If bit 1 set, indicates _SAVING state.
>>>>>>>>>> + *      - If bit 2 set, indicates _RESUMING state.
>>>>>>>>>> + *
>>>>>>>>>> + * pending bytes: (read only)
>>>>>>>>>> + *      Read pending bytes yet to be migrated from vendor driver
>>>>>>>>>> + *
>>>>>>>>>> + * data_offset: (read only)
>>>>>>>>>> + *      User application should read data_offset in migration region from where
>>>>>>>>>> + *      user application should read data during _SAVING state or write data
>>>>>>>>>> + *      during _RESUMING state.
>>>>>>>>>> + *
>>>>>>>>>> + * data_size: (read/write)
>>>>>>>>>> + *      User application should read data_size to know data copied in migration
>>>>>>>>>> + *      region during _SAVING state and write size of data copied in migration
>>>>>>>>>> + *      region during _RESUMING state.
>>>>>>>>>> + *
>>>>>>>>>> + * start_pfn: (write only)
>>>>>>>>>> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
>>>>>>>>>> + *      _SAVING state.
>>>>>>>>>> + *
>>>>>>>>>> + * page_size: (write only)
>>>>>>>>>> + *      User application should write the page_size of pfn.
>>>>>>>>>> + *
>>>>>>>>>> + * total_pfns: (write only)
>>>>>>>>>> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
>>>>>>>>>> + *
>>>>>>>>>> + * copied_pfns: (read only)
>>>>>>>>>> + *      pfn count for which dirty bitmap is copied to migration region.
>>>>>>>>>> + *      Vendor driver should copy the bitmap with bits set only for pages to be
>>>>>>>>>> + *      marked dirty in migration region.
>>>>>>>>>> + *      Vendor driver should return 0 if there are 0 pages dirty in requested
>>>>>>>>>> + *      range.
>>>>>>>>>> + *      Vendor driver should return -1 to mark all pages in the section as
>>>>>>>>>> + *      dirty.        
>>>>>>>>>
>>>>>>>>> Is the protocol that the user writes start_pfn/page_size/total_pfns in
>>>>>>>>> any order and then the read of copied_pfns is what triggers the
>>>>>>>>> snapshot?        
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>      
>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>>>> re-reading copied_pfns?        
>>>>>>>>
>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap size
>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>> is, there are no pages dirty in rest of the range)      
>>>>>>>
>>>>>>> So reading copied_pfns triggers the data range to be updated, but the
>>>>>>> caller cannot assume it to be synchronous and uses total_pfns to poll
>>>>>>> that the update is complete?  How does the vendor driver differentiate
>>>>>>> the user polling for the previous update to finish versus requesting a
>>>>>>> new update?
>>>>>>>       
>>>>>>
>>>>>> Write on start_pfn/page_size/total_pfns, then read on copied_pfns
>>>>>> indicates new update, where as sequential read on copied_pfns indicates
>>>>>> polling for previous update.    
>>>>>
>>>>> Hmm, this seems to contradict the answer to my question above where I
>>>>> ask if the write fields are sticky so a user can trigger a refresh via
>>>>> copied_pfns.    
>>>>
>>>> Sorry, how its contradict? pasting it again below:  
>>>>>>>>>  Are start_pfn/page_size/total_pfns sticky such that a user
>>>>>>>>> can write them once and get repeated refreshes of the dirty bitmap by
>>>>>>>>> re-reading copied_pfns?    
>>>>>>>>
>>>>>>>> Yes and that bitmap should be for given range (from start_pfn till
>>>>>>>> start_pfn + tolal_pfns).
>>>>>>>> Re-reading of copied_pfns is to handle the case where it might be
>>>>>>>> possible that vendor driver reserved area for bitmap < total bitmap    
>>>> size  
>>>>>>>> for range (start_pfn to start_pfn + tolal_pfns), then user will have to
>>>>>>>> iterate till copied_pfns == total_pfns or till copied_pfns == 0 (that
>>>>>>>> is, there are no pages dirty in rest of the range)    
>>>
>>> Sorry, I guess I misinterpreted again.  So the vendor driver can return
>>> copied_pfns < total_pfns if it has a buffer limitation, not as an
>>> indication of its background progress in writing out the bitmap.  Just
>>> as a proof of concept, let's say the vendor driver has a 1 bit buffer
>>> and I write 0 to start_pfn and 3 to total_pfns.  I read copied_pfns,
>>> which returns 1, so I read data_offset to find where this 1 bit is
>>> located and then read my bit from that location.  This is the dirty
>>> state of the first pfn.  I read copied_pfns again and it reports 2,  
>>
>> It should report 1 to indicate its data for one pfn.
>>
>>> I again read data_offset to find where the data is located, and it's my
>>> job to remember that I've already read 1 bit, so 2 means there's only 1
>>> bit available and it's the second pfn.  
>>
>> No.
>> Here 'I' means User application, right?
> 
> Yes
> 
>> User application knows for how many pfns bitmap he had already received,
>> i.e. see 'count' in function vfio_get_dirty_page_list().
>>
>> Here copied_pfns is the number of pfns for which bitmap is available in
>> buffer. Start address for that bitmap is then calculated by user
>> application as :
>> ((start_pfn + count) * page_size)
>>
>> Then QEMU calls:
>>
>> cpu_physical_memory_set_dirty_lebitmap((unsigned long *)buf,
>>                                        (start_pfn + count) * page_size,
>>                                         copied_pfns);
>>
>>>  I read the bit.  I again read
>>> copied_pfns, which now reports 3, I read data_offset to find the
>>> location of the data, I remember that I've already read 2 bits, so I
>>> read my bit into the 3rd pfn.  This seems rather clumsy.
>>>  
>>
>> Hope above explanation helps.
> 
> Still seems rather clumsy, the knowledge of which bit(s) are available
> in the buffer can only be known by foreknowledge of which bits have
> already been read.  That seems error prone for both the user and the
> vendor driver to stay in sync.
> 
>>> Now that copied_pfns == total_pfns, what happens if I read copied_pfns
>>> again?  This is actually what I thought I was asking previously.
>>>   
>>
>> It should return 0.
> 
> Are we assuming no new pages have been dirtied?  What if pages have
> been dirtied?
> 
>>> Should we expose the pfn buffer size and fault on writes of larger than that
>>> size, requiring the user to iterate start_pfn themselves?  
>>
>> Who should fault, vendor driver or user application?
>>
>> Here Vendor driver is writing data to data section.
>> In the steps in this patch-set, user application is incrementing
>> start_pfn by adding copied_pfn count.
> 
> The user app is writing total_pfns to get a range, correct?  The vendor
> driver could return errno on that write if total_pfns exceeds the
> available buffer size.
> 

ok. If vendor driver returns error, then will user application retry
with smaller size?

Thanks,
Kirti


  reply	other threads:[~2019-06-24 18:53 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-20 14:37 [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 01/13] vfio: KABI for migration interface Kirti Wankhede
2019-06-20 17:18   ` Alex Williamson
2019-06-21  5:52     ` Kirti Wankhede
2019-06-21 15:03       ` Alex Williamson
2019-06-21 19:35         ` Kirti Wankhede
2019-06-21 20:00           ` Alex Williamson
2019-06-21 20:30             ` Kirti Wankhede
2019-06-21 22:01               ` Alex Williamson
2019-06-24 15:00                 ` Kirti Wankhede
2019-06-24 15:25                   ` Alex Williamson
2019-06-24 18:52                     ` Kirti Wankhede [this message]
2019-06-24 19:01                       ` Alex Williamson
2019-06-25 15:20                         ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 02/13] vfio: Add function to unmap VFIO region Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 03/13] vfio: Add save and load functions for VFIO PCI devices Kirti Wankhede
2019-06-21  0:12   ` Yan Zhao
2019-06-21  6:44     ` Kirti Wankhede
2019-06-21  7:50       ` Yan Zhao
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 04/13] vfio: Add migration region initialization and finalize function Kirti Wankhede
2019-06-24 14:00   ` Cornelia Huck
2019-06-27 14:56     ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 05/13] vfio: Add VM state change handler to know state of VM Kirti Wankhede
2019-06-25 10:29   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 06/13] vfio: Add migration state change notifier Kirti Wankhede
2019-06-27 10:33   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 07/13] vfio: Register SaveVMHandlers for VFIO device Kirti Wankhede
2019-06-27 10:01   ` Dr. David Alan Gilbert
2019-06-27 14:31     ` Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 08/13] vfio: Add save state functions to SaveVMHandlers Kirti Wankhede
2019-06-20 19:25   ` Alex Williamson
2019-06-21  6:38     ` Kirti Wankhede
2019-06-21 15:16       ` Alex Williamson
2019-06-21 19:38         ` Kirti Wankhede
2019-06-21 20:02           ` Alex Williamson
2019-06-21 20:07             ` Kirti Wankhede
2019-06-21 20:32               ` Alex Williamson
2019-06-21 21:05                 ` Kirti Wankhede
2019-06-21 22:13                   ` Alex Williamson
2019-06-24 14:31                     ` Kirti Wankhede
2019-06-21  0:31   ` Yan Zhao
2019-06-25  3:30     ` Yan Zhao
2019-06-28  8:50       ` Dr. David Alan Gilbert
2019-06-28 21:16         ` Yan Zhao
2019-06-28  9:09   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 09/13] vfio: Add load " Kirti Wankhede
2019-06-28  9:18   ` Dr. David Alan Gilbert
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 10/13] vfio: Add function to get dirty page list Kirti Wankhede
2019-06-26  0:40   ` Yan Zhao
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 11/13] vfio: Add vfio_listerner_log_sync to mark dirty pages Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 12/13] vfio: Make vfio-pci device migration capable Kirti Wankhede
2019-06-20 14:37 ` [Qemu-devel] [PATCH v4 13/13] vfio: Add trace events in migration code path Kirti Wankhede
2019-06-20 18:50   ` Dr. David Alan Gilbert
2019-06-21  5:54     ` Kirti Wankhede
2019-06-21  0:25 ` [Qemu-devel] [PATCH v4 00/13] Add migration support for VFIO device Yan Zhao
2019-06-21  1:24   ` Yan Zhao
2019-06-21  8:02     ` Kirti Wankhede
2019-06-21  8:46       ` Yan Zhao
2019-06-21  9:22         ` Kirti Wankhede
2019-06-21 10:45           ` Yan Zhao
2019-06-24 19:00           ` Dr. David Alan Gilbert
2019-06-26  0:43             ` Yan Zhao
2019-06-28  9:44               ` Dr. David Alan Gilbert
2019-06-28 21:28                 ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c26273f7-0f05-860c-fb4d-5e8f641140a4@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).