From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BCE4C43603 for ; Mon, 16 Dec 2019 22:44:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C11CC2465E for ; Mon, 16 Dec 2019 22:44:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="d/orIeb/" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726794AbfLPWoR (ORCPT ); Mon, 16 Dec 2019 17:44:17 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:39093 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726448AbfLPWoR (ORCPT ); Mon, 16 Dec 2019 17:44:17 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1576536255; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=j28Lf3V1YDisZV22otE5P/GjWqvTpihlGzpsTc7ig3A=; b=d/orIeb/dI2jFhyi6SzK4dOL6xs5M3FpXKg5U7RDQGDkjEaKHYHBUHDAohBAKzJYw6r8tj jaln1XgSkwlNDHgtp7r9qEGav3TI/hh0ihUhpa/afQ9KxWhjjD5Up5FMaCSzZgI/AT+VEi N+BTzpQIC6f91pxWDXtwxoLpFTXHz7g= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-220-MboKfcO-NdWDIcW5uT3BiA-1; Mon, 16 Dec 2019 17:44:11 -0500 X-MC-Unique: MboKfcO-NdWDIcW5uT3BiA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 33126DB20; Mon, 16 Dec 2019 22:44:09 +0000 (UTC) Received: from x1.home (ovpn-116-53.phx2.redhat.com [10.3.116.53]) by smtp.corp.redhat.com (Postfix) with ESMTP id C7F5760BE0; Mon, 16 Dec 2019 22:44:06 +0000 (UTC) Date: Mon, 16 Dec 2019 15:44:06 -0700 From: Alex Williamson To: Kirti Wankhede Cc: , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state Message-ID: <20191216154406.023f912b@x1.home> In-Reply-To: <1576527700-21805-2-git-send-email-kwankhede@nvidia.com> References: <1576527700-21805-1-git-send-email-kwankhede@nvidia.com> <1576527700-21805-2-git-send-email-kwankhede@nvidia.com> Organization: Red Hat MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Tue, 17 Dec 2019 01:51:36 +0530 Kirti Wankhede wrote: > - Defined MIGRATION region type and sub-type. > > - Defined vfio_device_migration_info structure which will be placed at 0th > offset of migration region to get/set VFIO device related information. > Defined members of structure and usage on read/write access. > > - Defined device states and added state transition details in the comment. > > - Added sequence to be followed while saving and resuming VFIO device state > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > include/uapi/linux/vfio.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 180 insertions(+) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 9e843a147ead..a0817ba267c1 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff) > #define VFIO_REGION_TYPE_GFX (1) > #define VFIO_REGION_TYPE_CCW (2) > +#define VFIO_REGION_TYPE_MIGRATION (3) > > /* sub-types for VFIO_REGION_TYPE_PCI_* */ > > @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { > /* sub-types for VFIO_REGION_TYPE_CCW */ > #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1) > > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ > +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > + > +/* > + * Structure vfio_device_migration_info is placed at 0th offset of > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration > + * information. Field accesses from this structure are only supported at their > + * native width and alignment, otherwise the result is undefined and vendor > + * drivers should return an error. > + * > + * device_state: (read/write) > + * To indicate vendor driver the state VFIO device should be transitioned > + * to. If device state transition fails, write on this field return error. > + * It consists of 3 bits: > + * - If bit 0 set, indicates _RUNNING state. When its clear, that indicates s/its/it's/ > + * _STOP state. When device is changed to _STOP, driver should stop > + * device before write() returns. > + * - If bit 1 set, indicates _SAVING state. When set, that indicates driver > + * should start gathering device state information which will be provided > + * to VFIO user space application to save device's state. > + * - If bit 2 set, indicates _RESUMING state. When set, that indicates > + * prepare to resume device, data provided through migration region > + * should be used to resume device. > + * Bits 3 - 31 are reserved for future use. User should perform > + * read-modify-write operation on this field. > + * > + * +------- _RESUMING > + * |+------ _SAVING > + * ||+----- _RUNNING > + * ||| > + * 000b => Device Stopped, not saving or resuming > + * 001b => Device running state, default state > + * 010b => Stop Device & save device state, stop-and-copy state > + * 011b => Device running and save device state, pre-copy state > + * 100b => Device stopped and device state is resuming > + * 101b => Invalid state Eventually this would be intended for post-copy, if supported by the device, right? > + * 110b => Invalid state > + * 111b => Invalid state > + * > + * State transitions: > + * > + * _RESUMING _RUNNING Pre-copy Stop-and-copy _STOP > + * (100b) (001b) (011b) (010b) (000b) > + * 0. Running or Default state > + * | > + * > + * 1. Normal Shutdown Optional, userspace is under no obligation. > + * |------------------------------------->| > + * > + * 2. Save state or Suspend > + * |------------------------->|---------->| > + * > + * 3. Save state during live migration > + * |----------->|------------>|---------->| > + * > + * 4. Resuming > + * |<---------| > + * > + * 5. Resumed > + * |--------->| > + * > + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts. > + * 1. During normal VFIO application shutdown, vfio device state changes > + * from _RUNNING to _STOP. We cannot impose this requirement on existing userspace. Userspace may perform this action, but they are not required to and the vendor driver must not require it. > + * 2. When VFIO application save state or suspend application, VFIO device > + * state transition is from _RUNNING to stop-and-copy state and then to > + * _STOP. > + * On state transition from _RUNNING to stop-and-copy, driver must > + * stop device, save device state and send it to application through > + * migration region. > + * On _RUNNING to stop-and-copy state transition failure, application should > + * set VFIO device state to _RUNNING. A state transition failure means that the user's write to device_state failed, so is it the user's responsibility to set the next state? Why is it necessarily _RUNNING vs _STOP? > + * 3. In VFIO application live migration, state transition is from _RUNNING > + * to pre-copy to stop-and-copy to _STOP. > + * On state transition from _RUNNING to pre-copy, driver should start > + * gathering device state while application is still running and send device > + * state data to application through migration region. > + * On state transition from pre-copy to stop-and-copy, driver must stop > + * device, save device state and send it to application through migration > + * region. > + * On any failure during any of these state transition, VFIO device state > + * should be set to _RUNNING. Same comment as above regarding next state on failure. Also, it seems like it's the vendor driver's discretion to actually provide data during the pre-copy phase. As we've defined it, the vendor driver needs to participate in the migration region regardless, they might just always report no pending_bytes until we enter stop-and-copy. > + * 4. To start resuming phase, VFIO device state should be transitioned from > + * _RUNNING to _RESUMING state. > + * In _RESUMING state, driver should use received device state data through > + * migration region to resume device. > + * On failure during this state transition, application should set _RUNNING > + * state. Same comment regarding setting next state after failure. > + * 5. On providing saved device data to driver, appliation should change state > + * from _RESUMING to _RUNNING. > + * On failure to transition to _RUNNING state, VFIO application should reset > + * the device and set _RUNNING state so that device doesn't remain in unknown > + * or bad state. On reset, driver must reset device and device should be > + * available in default usable state. Didn't we discuss that the reset ioctl should return the device to the initial state, including the transition to _RUNNING? Also, as above, it's the user write that triggers the failure, this register is listed as read-write, so what value does the vendor driver report for the state when read after a transition failure? Is it reported as _RESUMING as it was prior to the attempted transition, or may the invalid states be used by the vendor driver to indicate the device is broken? > + * > + * pending bytes: (read only) > + * Number of pending bytes yet to be migrated from vendor driver > + * > + * data_offset: (read only) > + * User application should read data_offset in migration region from where > + * user application should read device data during _SAVING state or write > + * device data during _RESUMING state. See below for detail of sequence to > + * be followed. > + * > + * data_size: (read/write) > + * User application should read data_size to get size of data copied in > + * bytes in migration region during _SAVING state and write size of data > + * copied in bytes in migration region during _RESUMING state. > + * > + * Migration region looks like: > + * ------------------------------------------------------------------ > + * |vfio_device_migration_info| data section | > + * | | /////////////////////////////// | > + * ------------------------------------------------------------------ > + * ^ ^ > + * offset 0-trapped part data_offset > + * > + * Structure vfio_device_migration_info is always followed by data section in > + * the region, so data_offset will always be non-0. Offset from where data is > + * copied is decided by kernel driver, data section can be trapped or mapped > + * or partitioned, depending on how kernel driver defines data section. > + * Data section partition can be defined as mapped by sparse mmap capability. > + * If mmapped, then data_offset should be page aligned, where as initial section > + * which contain vfio_device_migration_info structure might not end at offset > + * which is page aligned. The user is not required to access via mmap regardless > + * of the region mmap capabilities. > + * Vendor driver should decide whether to partition data section and how to > + * partition the data section. Vendor driver should return data_offset > + * accordingly. > + * > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase > + * and for _SAVING device state or stop-and-copy phase: > + * a. read pending_bytes, indicates start of new iteration to get device data. > + * If there was previous iteration, then this read operation indicates > + * previous iteration is done. If pending_bytes > 0, go through below steps. > + * b. read data_offset, indicates kernel driver to make data available through > + * data section. Kernel driver should return this read operation only after > + * data is available from (region + data_offset) to (region + data_offset + > + * data_size). > + * c. read data_size, amount of data in bytes available through migration > + * region. > + * d. read data of data_size bytes from (region + data_offset) from migration > + * region. > + * e. process data. > + * f. Loop through a to e. It seems we always need to end an iteration by reading pending_bytes to signal to the vendor driver to release resources, so should the end of the loop be: e. Read pending_bytes f. Goto b. or optionally restart next iteration at a. I think this is defined such that reading data_offset commits resources and reading pending_bytes frees them, allowing userspace to restart at reading pending_bytes with no side-effects. Therefore reading pending_bytes repeatedly is supported. Is the same true for data_offset and data_size? It seems reasonable that the vendor driver can simply return offset and size for the current buffer if the user reads these more than once. How is a protocol or device error signaled? For example, we can have a user error where they read data_size before data_offset. Should the vendor driver generate a fault reading data_size in this case. We can also have internal errors in the vendor driver, should the vendor driver use a special errno or update device_state autonomously to indicate such an error? I believe it's also part of the intended protocol that the user can transition from _SAVING|_RUNNING to _SAVING at any point, regardless of pending_bytes. This should be noted. > + * > + * Sequence to be followed while _RESUMING device state: > + * While data for this device is available, repeat below steps: > + * a. read data_offset from where user application should write data. > + * b. write data of data_size to migration region from data_offset. Whose's data_size, the _SAVING end or the _RESUMING end? I think this is intended to be the transaction size from the _SAVING source, but it could easily be misinterpreted as reading data_size on the _RESUMING end. > + * c. write data_size which indicates vendor driver that data is written in > + * staging buffer. Vendor driver should read this data from migration > + * region and resume device's state. I think we also need to define the error protocol. The user could mis-order transactions or there could be an internal error in the vendor driver or device. Are all read(2)/write(2) operations susceptible to defined errnos to signal this? Is it reflected in device_state? What's the recovery protocol? > + * > + * For user application, data is opaque. User should write data in the same > + * order as received. Order and transaction size, ie. each data_size chunk is indivisible by the user. > + */ > + > +struct vfio_device_migration_info { > + __u32 device_state; /* VFIO device state */ > +#define VFIO_DEVICE_STATE_STOP (1 << 0) > +#define VFIO_DEVICE_STATE_RUNNING (1 << 0) Huh? We should probably just refer to it consistently, ie. _RUNNING and !_RUNNING, otherwise we have the incongruity that setting the _STOP value is actually the opposite of the necessary logic value (_STOP = 1 is _RUNNING, _STOP = 0 is !_RUNNING). > +#define VFIO_DEVICE_STATE_SAVING (1 << 1) > +#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ > + VFIO_DEVICE_STATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING) > + > +#define VFIO_DEVICE_STATE_INVALID_CASE1 (VFIO_DEVICE_STATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING) > + > +#define VFIO_DEVICE_STATE_INVALID_CASE2 (VFIO_DEVICE_STATE_RUNNING | \ > + VFIO_DEVICE_STATE_RESUMING) Gack, we fixed these in the last iteration! > + __u32 reserved; > + __u64 pending_bytes; > + __u64 data_offset; > + __u64 data_size; > +} __attribute__((packed)); > + > /* > * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped > * which allows direct access to non-MSIX registers which happened to be within Thanks, Alex