From mboxrd@z Thu Jan 1 00:00:00 1970 From: Pankaj Gupta Subject: [PATCH 0/3] kvm "fake DAX" device Date: Fri, 31 Aug 2018 19:00:15 +0530 Message-ID: <20180831133019.27579-1-pagupta@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" To: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org, linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, jack-AlSwsSmVLrQ@public.gmane.org, xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, riel-ebMLmSuQjDVBDgjK7y7TUQ@public.gmane.org, niteshnarayanlal-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org, david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, ross.zwisler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org, lcapitulino-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, stefanha-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, imammedo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, nilal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, eblake-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org List-Id: linux-nvdimm@lists.01.org This patch series has implementation for "fake DAX". "fake DAX" is fake persistent memory(nvdimm) in guest which allows to bypass the guest page cache. This also implements a VIRTIO based asynchronous flush mechanism. Sharing guest driver and qemu device changes in separate patch sets for easy review and it has been tested together. Details of project idea for 'fake DAX' flushing interface is shared [2] & [3]. Implementation is divided into two parts: New virtio pmem guest driver and qemu code changes for new virtio pmem paravirtualized device. 1. Guest virtio-pmem kernel driver --------------------------------- - Reads persistent memory range from paravirt device and registers with 'nvdimm_bus'. - 'nvdimm/pmem' driver uses this information to allocate persistent memory region and setup filesystem operations to the allocated memory. - virtio pmem driver implements asynchronous flushing interface to flush from guest to host. 2. Qemu virtio-pmem device --------------------------------- - Creates virtio pmem device and exposes a memory range to KVM guest. - At host side this is file backed memory which acts as persistent memory. - Qemu side flush uses aio thread pool API's and virtio for asynchronous guest multi request handling. David Hildenbrand CCed also posted a modified version[4] of qemu virtio-pmem code based on updated Qemu memory device API. Virtio-pmem errors handling: ---------------------------------------- Checked behaviour of virtio-pmem for below types of errors Need suggestions on expected behaviour for handling these errors? - Hardware Errors: Uncorrectable recoverable Errors: a] virtio-pmem: - As per current logic if error page belongs to Qemu process, host MCE handler isolates(hwpoison) that page and send SIGBUS. Qemu SIGBUS handler injects exception to KVM guest. - KVM guest then isolates the page and send SIGBUS to guest userspace process which has mapped the page. b] Existing implementation for ACPI pmem driver: - Handles such errors with MCE notifier and creates a list of bad blocks. Read/direct access DAX operation return EIO if accessed memory page fall in bad block list. - It also starts backgound scrubbing. - Similar functionality can be reused in virtio-pmem with MCE notifier but without scrubbing(no ACPI/ARS)? Need inputs to confirm if this behaviour is ok or needs any change? Changes from RFC v3: [1] - Rebase to latest upstream - Luiz - Call ndregion->flush in place of nvdimm_flush- Luiz - kmalloc return check - Luiz - virtqueue full handling - Stefan - Don't map entire virtio_pmem_req to device - Stefan - request leak,correct sizeof req- Stefan - Move declaration to virtio_pmem.c Changes from RFC v2: - Add flush function in the nd_region in place of switching on a flag - Dan & Stefan - Add flush completion function with proper locking and wait for host side flush completion - Stefan & Dan - Keep userspace API in uapi header file - Stefan, MST - Use LE fields & New device id - MST - Indentation & spacing suggestions - MST & Eric - Remove extra header files & add licensing - Stefan Changes from RFC v1: - Reuse existing 'pmem' code for registering persistent memory and other operations instead of creating an entirely new block driver. - Use VIRTIO driver to register memory information with nvdimm_bus and create region_type accordingly. - Call VIRTIO flush from existing pmem driver. Pankaj Gupta (3): nd: move nd_region to common header libnvdimm: nd_region flush callback support virtio-pmem: Add virtio-pmem guest driver [1] https://lkml.org/lkml/2018/7/13/102 [2] https://www.spinics.net/lists/kvm/msg149761.html [3] https://www.spinics.net/lists/kvm/msg153095.html [4] https://marc.info/?l=qemu-devel&m=153555721901824&w=2 drivers/acpi/nfit/core.c | 7 - drivers/nvdimm/claim.c | 3 drivers/nvdimm/nd.h | 39 ----- drivers/nvdimm/pmem.c | 12 + drivers/nvdimm/region_devs.c | 12 + drivers/virtio/Kconfig | 9 + drivers/virtio/Makefile | 1 drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++ include/linux/libnvdimm.h | 4 include/linux/nd.h | 40 ++++++ include/uapi/linux/virtio_ids.h | 1 include/uapi/linux/virtio_pmem.h | 40 ++++++ 12 files changed, 374 insertions(+), 49 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D461C433F4 for ; Fri, 31 Aug 2018 13:30:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 997A820652 for ; Fri, 31 Aug 2018 13:30:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 997A820652 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728107AbeHaRiJ (ORCPT ); Fri, 31 Aug 2018 13:38:09 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:60312 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727294AbeHaRiJ (ORCPT ); Fri, 31 Aug 2018 13:38:09 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 124C68190286; Fri, 31 Aug 2018 13:30:38 +0000 (UTC) Received: from dhcp201-121.englab.pnq.redhat.com (dhcp193-198.pnq.redhat.com [10.65.193.198]) by smtp.corp.redhat.com (Postfix) with ESMTP id D102539DDC; Fri, 31 Aug 2018 13:30:22 +0000 (UTC) From: Pankaj Gupta To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, qemu-devel@nongnu.org, linux-nvdimm@ml01.01.org Cc: jack@suse.cz, stefanha@redhat.com, dan.j.williams@intel.com, riel@surriel.com, nilal@redhat.com, kwolf@redhat.com, pbonzini@redhat.com, ross.zwisler@intel.com, david@redhat.com, xiaoguangrong.eric@gmail.com, hch@infradead.org, mst@redhat.com, niteshnarayanlal@hotmail.com, lcapitulino@redhat.com, imammedo@redhat.com, eblake@redhat.com, pagupta@redhat.com Subject: [PATCH 0/3] kvm "fake DAX" device Date: Fri, 31 Aug 2018 19:00:15 +0530 Message-Id: <20180831133019.27579-1-pagupta@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Fri, 31 Aug 2018 13:30:38 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Fri, 31 Aug 2018 13:30:38 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'pagupta@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch series has implementation for "fake DAX". "fake DAX" is fake persistent memory(nvdimm) in guest which allows to bypass the guest page cache. This also implements a VIRTIO based asynchronous flush mechanism. Sharing guest driver and qemu device changes in separate patch sets for easy review and it has been tested together. Details of project idea for 'fake DAX' flushing interface is shared [2] & [3]. Implementation is divided into two parts: New virtio pmem guest driver and qemu code changes for new virtio pmem paravirtualized device. 1. Guest virtio-pmem kernel driver --------------------------------- - Reads persistent memory range from paravirt device and registers with 'nvdimm_bus'. - 'nvdimm/pmem' driver uses this information to allocate persistent memory region and setup filesystem operations to the allocated memory. - virtio pmem driver implements asynchronous flushing interface to flush from guest to host. 2. Qemu virtio-pmem device --------------------------------- - Creates virtio pmem device and exposes a memory range to KVM guest. - At host side this is file backed memory which acts as persistent memory. - Qemu side flush uses aio thread pool API's and virtio for asynchronous guest multi request handling. David Hildenbrand CCed also posted a modified version[4] of qemu virtio-pmem code based on updated Qemu memory device API. Virtio-pmem errors handling: ---------------------------------------- Checked behaviour of virtio-pmem for below types of errors Need suggestions on expected behaviour for handling these errors? - Hardware Errors: Uncorrectable recoverable Errors: a] virtio-pmem: - As per current logic if error page belongs to Qemu process, host MCE handler isolates(hwpoison) that page and send SIGBUS. Qemu SIGBUS handler injects exception to KVM guest. - KVM guest then isolates the page and send SIGBUS to guest userspace process which has mapped the page. b] Existing implementation for ACPI pmem driver: - Handles such errors with MCE notifier and creates a list of bad blocks. Read/direct access DAX operation return EIO if accessed memory page fall in bad block list. - It also starts backgound scrubbing. - Similar functionality can be reused in virtio-pmem with MCE notifier but without scrubbing(no ACPI/ARS)? Need inputs to confirm if this behaviour is ok or needs any change? Changes from RFC v3: [1] - Rebase to latest upstream - Luiz - Call ndregion->flush in place of nvdimm_flush- Luiz - kmalloc return check - Luiz - virtqueue full handling - Stefan - Don't map entire virtio_pmem_req to device - Stefan - request leak,correct sizeof req- Stefan - Move declaration to virtio_pmem.c Changes from RFC v2: - Add flush function in the nd_region in place of switching on a flag - Dan & Stefan - Add flush completion function with proper locking and wait for host side flush completion - Stefan & Dan - Keep userspace API in uapi header file - Stefan, MST - Use LE fields & New device id - MST - Indentation & spacing suggestions - MST & Eric - Remove extra header files & add licensing - Stefan Changes from RFC v1: - Reuse existing 'pmem' code for registering persistent memory and other operations instead of creating an entirely new block driver. - Use VIRTIO driver to register memory information with nvdimm_bus and create region_type accordingly. - Call VIRTIO flush from existing pmem driver. Pankaj Gupta (3): nd: move nd_region to common header libnvdimm: nd_region flush callback support virtio-pmem: Add virtio-pmem guest driver [1] https://lkml.org/lkml/2018/7/13/102 [2] https://www.spinics.net/lists/kvm/msg149761.html [3] https://www.spinics.net/lists/kvm/msg153095.html [4] https://marc.info/?l=qemu-devel&m=153555721901824&w=2 drivers/acpi/nfit/core.c | 7 - drivers/nvdimm/claim.c | 3 drivers/nvdimm/nd.h | 39 ----- drivers/nvdimm/pmem.c | 12 + drivers/nvdimm/region_devs.c | 12 + drivers/virtio/Kconfig | 9 + drivers/virtio/Makefile | 1 drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++ include/linux/libnvdimm.h | 4 include/linux/nd.h | 40 ++++++ include/uapi/linux/virtio_ids.h | 1 include/uapi/linux/virtio_pmem.h | 40 ++++++ 12 files changed, 374 insertions(+), 49 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56935) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fvjae-0005g5-1q for qemu-devel@nongnu.org; Fri, 31 Aug 2018 09:35:57 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fvjVY-0005td-U9 for qemu-devel@nongnu.org; Fri, 31 Aug 2018 09:30:45 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:48454 helo=mx1.redhat.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fvjVX-0005sg-L8 for qemu-devel@nongnu.org; Fri, 31 Aug 2018 09:30:40 -0400 From: Pankaj Gupta Date: Fri, 31 Aug 2018 19:00:15 +0530 Message-Id: <20180831133019.27579-1-pagupta@redhat.com> Subject: [Qemu-devel] [PATCH 0/3] kvm "fake DAX" device List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, qemu-devel@nongnu.org, linux-nvdimm@ml01.01.org Cc: jack@suse.cz, stefanha@redhat.com, dan.j.williams@intel.com, riel@surriel.com, nilal@redhat.com, kwolf@redhat.com, pbonzini@redhat.com, ross.zwisler@intel.com, david@redhat.com, xiaoguangrong.eric@gmail.com, hch@infradead.org, mst@redhat.com, niteshnarayanlal@hotmail.com, lcapitulino@redhat.com, imammedo@redhat.com, eblake@redhat.com, pagupta@redhat.com This patch series has implementation for "fake DAX". "fake DAX" is fake persistent memory(nvdimm) in guest which allows to bypass the guest page cache. This also implements a VIRTIO based asynchronous flush mechanism. Sharing guest driver and qemu device changes in separate patch sets for easy review and it has been tested together. Details of project idea for 'fake DAX' flushing interface is shared [2] & [3]. Implementation is divided into two parts: New virtio pmem guest driver and qemu code changes for new virtio pmem paravirtualized device. 1. Guest virtio-pmem kernel driver --------------------------------- - Reads persistent memory range from paravirt device and registers with 'nvdimm_bus'. - 'nvdimm/pmem' driver uses this information to allocate persistent memory region and setup filesystem operations to the allocated memory. - virtio pmem driver implements asynchronous flushing interface to flush from guest to host. 2. Qemu virtio-pmem device --------------------------------- - Creates virtio pmem device and exposes a memory range to KVM guest. - At host side this is file backed memory which acts as persistent memory. - Qemu side flush uses aio thread pool API's and virtio for asynchronous guest multi request handling. David Hildenbrand CCed also posted a modified version[4] of qemu virtio-pmem code based on updated Qemu memory device API. Virtio-pmem errors handling: ---------------------------------------- Checked behaviour of virtio-pmem for below types of errors Need suggestions on expected behaviour for handling these errors? - Hardware Errors: Uncorrectable recoverable Errors: a] virtio-pmem: - As per current logic if error page belongs to Qemu process, host MCE handler isolates(hwpoison) that page and send SIGBUS. Qemu SIGBUS handler injects exception to KVM guest. - KVM guest then isolates the page and send SIGBUS to guest userspace process which has mapped the page. b] Existing implementation for ACPI pmem driver: - Handles such errors with MCE notifier and creates a list of bad blocks. Read/direct access DAX operation return EIO if accessed memory page fall in bad block list. - It also starts backgound scrubbing. - Similar functionality can be reused in virtio-pmem with MCE notifier but without scrubbing(no ACPI/ARS)? Need inputs to confirm if this behaviour is ok or needs any change? Changes from RFC v3: [1] - Rebase to latest upstream - Luiz - Call ndregion->flush in place of nvdimm_flush- Luiz - kmalloc return check - Luiz - virtqueue full handling - Stefan - Don't map entire virtio_pmem_req to device - Stefan - request leak,correct sizeof req- Stefan - Move declaration to virtio_pmem.c Changes from RFC v2: - Add flush function in the nd_region in place of switching on a flag - Dan & Stefan - Add flush completion function with proper locking and wait for host side flush completion - Stefan & Dan - Keep userspace API in uapi header file - Stefan, MST - Use LE fields & New device id - MST - Indentation & spacing suggestions - MST & Eric - Remove extra header files & add licensing - Stefan Changes from RFC v1: - Reuse existing 'pmem' code for registering persistent memory and other operations instead of creating an entirely new block driver. - Use VIRTIO driver to register memory information with nvdimm_bus and create region_type accordingly. - Call VIRTIO flush from existing pmem driver. Pankaj Gupta (3): nd: move nd_region to common header libnvdimm: nd_region flush callback support virtio-pmem: Add virtio-pmem guest driver [1] https://lkml.org/lkml/2018/7/13/102 [2] https://www.spinics.net/lists/kvm/msg149761.html [3] https://www.spinics.net/lists/kvm/msg153095.html [4] https://marc.info/?l=qemu-devel&m=153555721901824&w=2 drivers/acpi/nfit/core.c | 7 - drivers/nvdimm/claim.c | 3 drivers/nvdimm/nd.h | 39 ----- drivers/nvdimm/pmem.c | 12 + drivers/nvdimm/region_devs.c | 12 + drivers/virtio/Kconfig | 9 + drivers/virtio/Makefile | 1 drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++ include/linux/libnvdimm.h | 4 include/linux/nd.h | 40 ++++++ include/uapi/linux/virtio_ids.h | 1 include/uapi/linux/virtio_pmem.h | 40 ++++++ 12 files changed, 374 insertions(+), 49 deletions(-)