All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-04  8:15 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

If the device backend is not persistent memory for the nvdimm, there
is need for explicit IO flushes to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem.
So, the approach here is to convey when the hcall flush is required
in a device tree property. The guest once it knows the device needs
explicit flushes, makes the hcall as and when required.

It was suggested to create a new device type to address the
explicit flush for such backends on PPC instead of extending the
generic nvdimm device with new property. So, the patch introduces
the spapr-nvdimm device. The new device inherits the nvdimm device
with the new bahviour such that if the backend has pmem=no, the
device tree property is set by default.

The below demonstration shows the map_sync behavior for non-pmem
backends.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
from spapr-nvdimm with pmem=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
Failed to mmap  with Operation not supported

First patch adds the realize/unrealize call backs to the generic device
for the new device's vmstate registration. The second patch implements
the hcall, adds the necessary vmstate properties to spapr machine structure
for carrying the hcall status during save-restore. The nature of the hcall
being asynchronus, the patch uses aio utilities to offload the flush. The
third patch introduces the spapr-nvdimm device, adds the device tree
property for the guest when spapr-nvdimm is used with pmem=no on the
backend. Also adds new property pmem-override(?, suggest if you have better
name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
on pmem backed devices.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
Changes from v6:
      - Addressed commen from Daniel.
        Fixed a typo
        Fetch the memory backend FD in the flush_worker_cb(), updated hcall
        return values in the comments description)
      - Updated the signatures.

v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
Changes from v5:
      - Taken care of all comments from David
      - Moved the flush lists from spapr machine into the spapr-nvdimm device
        state structures. So, all corresponding data structures adjusted
	accordingly as required.
      - New property pmem-overrride is added to the spapr-nvdimm device. The
        hcall flushes are allowed when pmem-override is set for the device.
      - The flush for pmem backend devices are made to use pmem_persist().
      - The vmstate structures are also made part of device state instead of
        global spapr.
      - Passing the flush token to destination during migration, I think its
        better than finding, deriving it from the outstanding ones.

v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
Changes from v4:
      - Introduce spapr-nvdimm device with nvdimm device as the parent.
      - The new spapr-nvdimm has no new properties. As this is a new
        device and there is no migration related dependencies to be
        taken care of, the device behavior is made to set the device tree
        property and enable hcall when the device type spapr-nvdimm is
        used with pmem=off
      - Fixed commit messages
      - Added checks to ensure the backend is actualy file and not memory
      - Addressed things pointed out by Eric

v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
      - Fixed the forward declaration coding guideline violations in 1st patch.
      - Removed the code waiting for the flushes to complete during migration,
        instead restart the flush worker on destination qemu in post load.
      - Got rid of the randomization of the flush tokens, using simple
        counter.
      - Got rid of the redundant flush state lock, relying on the BQL now.
      - Handling the memory-backend-ram usage
      - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
	Added prevention code using 'writeback' on arm and x86_64.
      - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
      spapr: nvdimm: Implement H_SCM_FLUSH hcall
      spapr: nvdimm: Introduce spapr-nvdimm device


 hw/mem/nvdimm.c               |  16 ++
 hw/mem/pc-dimm.c              |   5 +
 hw/ppc/spapr.c                |   2 +
 hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   2 +
 include/hw/mem/pc-dimm.h      |   1 +
 include/hw/ppc/spapr.h        |   4 +-
 include/hw/ppc/spapr_nvdimm.h |   1 +
 8 files changed, 424 insertions(+), 1 deletion(-)

--
Signature


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-04  8:15 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm

If the device backend is not persistent memory for the nvdimm, there
is need for explicit IO flushes to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem.
So, the approach here is to convey when the hcall flush is required
in a device tree property. The guest once it knows the device needs
explicit flushes, makes the hcall as and when required.

It was suggested to create a new device type to address the
explicit flush for such backends on PPC instead of extending the
generic nvdimm device with new property. So, the patch introduces
the spapr-nvdimm device. The new device inherits the nvdimm device
with the new bahviour such that if the backend has pmem=no, the
device tree property is set by default.

The below demonstration shows the map_sync behavior for non-pmem
backends.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
from spapr-nvdimm with pmem=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
Failed to mmap  with Operation not supported

First patch adds the realize/unrealize call backs to the generic device
for the new device's vmstate registration. The second patch implements
the hcall, adds the necessary vmstate properties to spapr machine structure
for carrying the hcall status during save-restore. The nature of the hcall
being asynchronus, the patch uses aio utilities to offload the flush. The
third patch introduces the spapr-nvdimm device, adds the device tree
property for the guest when spapr-nvdimm is used with pmem=no on the
backend. Also adds new property pmem-override(?, suggest if you have better
name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
on pmem backed devices.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
Changes from v6:
      - Addressed commen from Daniel.
        Fixed a typo
        Fetch the memory backend FD in the flush_worker_cb(), updated hcall
        return values in the comments description)
      - Updated the signatures.

v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
Changes from v5:
      - Taken care of all comments from David
      - Moved the flush lists from spapr machine into the spapr-nvdimm device
        state structures. So, all corresponding data structures adjusted
	accordingly as required.
      - New property pmem-overrride is added to the spapr-nvdimm device. The
        hcall flushes are allowed when pmem-override is set for the device.
      - The flush for pmem backend devices are made to use pmem_persist().
      - The vmstate structures are also made part of device state instead of
        global spapr.
      - Passing the flush token to destination during migration, I think its
        better than finding, deriving it from the outstanding ones.

v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
Changes from v4:
      - Introduce spapr-nvdimm device with nvdimm device as the parent.
      - The new spapr-nvdimm has no new properties. As this is a new
        device and there is no migration related dependencies to be
        taken care of, the device behavior is made to set the device tree
        property and enable hcall when the device type spapr-nvdimm is
        used with pmem=off
      - Fixed commit messages
      - Added checks to ensure the backend is actualy file and not memory
      - Addressed things pointed out by Eric

v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
      - Fixed the forward declaration coding guideline violations in 1st patch.
      - Removed the code waiting for the flushes to complete during migration,
        instead restart the flush worker on destination qemu in post load.
      - Got rid of the randomization of the flush tokens, using simple
        counter.
      - Got rid of the redundant flush state lock, relying on the BQL now.
      - Handling the memory-backend-ram usage
      - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
	Added prevention code using 'writeback' on arm and x86_64.
      - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
      spapr: nvdimm: Implement H_SCM_FLUSH hcall
      spapr: nvdimm: Introduce spapr-nvdimm device


 hw/mem/nvdimm.c               |  16 ++
 hw/mem/pc-dimm.c              |   5 +
 hw/ppc/spapr.c                |   2 +
 hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   2 +
 include/hw/mem/pc-dimm.h      |   1 +
 include/hw/ppc/spapr.h        |   4 +-
 include/hw/ppc/spapr_nvdimm.h |   1 +
 8 files changed, 424 insertions(+), 1 deletion(-)

--
Signature



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-04  8:15 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

If the device backend is not persistent memory for the nvdimm, there
is need for explicit IO flushes to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem.
So, the approach here is to convey when the hcall flush is required
in a device tree property. The guest once it knows the device needs
explicit flushes, makes the hcall as and when required.

It was suggested to create a new device type to address the
explicit flush for such backends on PPC instead of extending the
generic nvdimm device with new property. So, the patch introduces
the spapr-nvdimm device. The new device inherits the nvdimm device
with the new bahviour such that if the backend has pmem=no, the
device tree property is set by default.

The below demonstration shows the map_sync behavior for non-pmem
backends.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
from spapr-nvdimm with pmem=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
[root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
Failed to mmap  with Operation not supported

First patch adds the realize/unrealize call backs to the generic device
for the new device's vmstate registration. The second patch implements
the hcall, adds the necessary vmstate properties to spapr machine structure
for carrying the hcall status during save-restore. The nature of the hcall
being asynchronus, the patch uses aio utilities to offload the flush. The
third patch introduces the spapr-nvdimm device, adds the device tree
property for the guest when spapr-nvdimm is used with pmem=no on the
backend. Also adds new property pmem-override(?, suggest if you have better
name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
on pmem backed devices.

The kernel changes to exploit this hcall is at
https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch

---
v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
Changes from v6:
      - Addressed commen from Daniel.
        Fixed a typo
        Fetch the memory backend FD in the flush_worker_cb(), updated hcall
        return values in the comments description)
      - Updated the signatures.

v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
Changes from v5:
      - Taken care of all comments from David
      - Moved the flush lists from spapr machine into the spapr-nvdimm device
        state structures. So, all corresponding data structures adjusted
	accordingly as required.
      - New property pmem-overrride is added to the spapr-nvdimm device. The
        hcall flushes are allowed when pmem-override is set for the device.
      - The flush for pmem backend devices are made to use pmem_persist().
      - The vmstate structures are also made part of device state instead of
        global spapr.
      - Passing the flush token to destination during migration, I think its
        better than finding, deriving it from the outstanding ones.

v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
Changes from v4:
      - Introduce spapr-nvdimm device with nvdimm device as the parent.
      - The new spapr-nvdimm has no new properties. As this is a new
        device and there is no migration related dependencies to be
        taken care of, the device behavior is made to set the device tree
        property and enable hcall when the device type spapr-nvdimm is
        used with pmem=off
      - Fixed commit messages
      - Added checks to ensure the backend is actualy file and not memory
      - Addressed things pointed out by Eric

v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
Changes from v3:
      - Fixed the forward declaration coding guideline violations in 1st patch.
      - Removed the code waiting for the flushes to complete during migration,
        instead restart the flush worker on destination qemu in post load.
      - Got rid of the randomization of the flush tokens, using simple
        counter.
      - Got rid of the redundant flush state lock, relying on the BQL now.
      - Handling the memory-backend-ram usage
      - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
	Added prevention code using 'writeback' on arm and x86_64.
      - Fixed all the miscellaneous comments.

v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
      spapr: nvdimm: Implement H_SCM_FLUSH hcall
      spapr: nvdimm: Introduce spapr-nvdimm device


 hw/mem/nvdimm.c               |  16 ++
 hw/mem/pc-dimm.c              |   5 +
 hw/ppc/spapr.c                |   2 +
 hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   2 +
 include/hw/mem/pc-dimm.h      |   1 +
 include/hw/ppc/spapr.h        |   4 +-
 include/hw/ppc/spapr_nvdimm.h |   1 +
 8 files changed, 424 insertions(+), 1 deletion(-)

--
Signature

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 1/3] nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
  2022-02-04  8:15 ` Shivaprasad G Bhat
@ 2022-02-04  8:15   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

A new subclass inheriting NVDIMMDevice is going to be introduced in
subsequent patches. The new subclass uses the realize and unrealize
callbacks. Add them on NVDIMMClass to appropriately call them as part
of plug-unplug.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Acked-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/mem/nvdimm.c          |   16 ++++++++++++++++
 hw/mem/pc-dimm.c         |    5 +++++
 include/hw/mem/nvdimm.h  |    2 ++
 include/hw/mem/pc-dimm.h |    1 +
 4 files changed, 24 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..59959d5563 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -181,10 +181,25 @@ static MemoryRegion *nvdimm_md_get_memory_region(MemoryDeviceState *md,
 static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
 {
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
+    NVDIMMClass *ndc = NVDIMM_GET_CLASS(nvdimm);
 
     if (!nvdimm->nvdimm_mr) {
         nvdimm_prepare_memory_region(nvdimm, errp);
     }
+
+    if (ndc->realize) {
+        ndc->realize(nvdimm, errp);
+    }
+}
+
+static void nvdimm_unrealize(PCDIMMDevice *dimm)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(dimm);
+    NVDIMMClass *ndc = NVDIMM_GET_CLASS(nvdimm);
+
+    if (ndc->unrealize) {
+        ndc->unrealize(nvdimm);
+    }
 }
 
 /*
@@ -240,6 +255,7 @@ static void nvdimm_class_init(ObjectClass *oc, void *data)
     DeviceClass *dc = DEVICE_CLASS(oc);
 
     ddc->realize = nvdimm_realize;
+    ddc->unrealize = nvdimm_unrealize;
     mdc->get_memory_region = nvdimm_md_get_memory_region;
     device_class_set_props(dc, nvdimm_properties);
 
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 48b913aba6..03bd0dd60e 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -216,6 +216,11 @@ static void pc_dimm_realize(DeviceState *dev, Error **errp)
 static void pc_dimm_unrealize(DeviceState *dev)
 {
     PCDIMMDevice *dimm = PC_DIMM(dev);
+    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
+
+    if (ddc->unrealize) {
+        ddc->unrealize(dimm);
+    }
 
     host_memory_backend_set_mapped(dimm->hostmem, false);
 }
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..cf8f59be44 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -103,6 +103,8 @@ struct NVDIMMClass {
     /* write @size bytes from @buf to NVDIMM label data at @offset. */
     void (*write_label_data)(NVDIMMDevice *nvdimm, const void *buf,
                              uint64_t size, uint64_t offset);
+    void (*realize)(NVDIMMDevice *nvdimm, Error **errp);
+    void (*unrealize)(NVDIMMDevice *nvdimm);
 };
 
 #define NVDIMM_DSM_MEM_FILE     "etc/acpi/nvdimm-mem"
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 1473e6db62..322bebe555 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -63,6 +63,7 @@ struct PCDIMMDeviceClass {
 
     /* public */
     void (*realize)(PCDIMMDevice *dimm, Error **errp);
+    void (*unrealize)(PCDIMMDevice *dimm);
 };
 
 void pc_dimm_pre_plug(PCDIMMDevice *dimm, MachineState *machine,



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 1/3] nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
@ 2022-02-04  8:15   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm

A new subclass inheriting NVDIMMDevice is going to be introduced in
subsequent patches. The new subclass uses the realize and unrealize
callbacks. Add them on NVDIMMClass to appropriately call them as part
of plug-unplug.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Acked-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/mem/nvdimm.c          |   16 ++++++++++++++++
 hw/mem/pc-dimm.c         |    5 +++++
 include/hw/mem/nvdimm.h  |    2 ++
 include/hw/mem/pc-dimm.h |    1 +
 4 files changed, 24 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..59959d5563 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -181,10 +181,25 @@ static MemoryRegion *nvdimm_md_get_memory_region(MemoryDeviceState *md,
 static void nvdimm_realize(PCDIMMDevice *dimm, Error **errp)
 {
     NVDIMMDevice *nvdimm = NVDIMM(dimm);
+    NVDIMMClass *ndc = NVDIMM_GET_CLASS(nvdimm);
 
     if (!nvdimm->nvdimm_mr) {
         nvdimm_prepare_memory_region(nvdimm, errp);
     }
+
+    if (ndc->realize) {
+        ndc->realize(nvdimm, errp);
+    }
+}
+
+static void nvdimm_unrealize(PCDIMMDevice *dimm)
+{
+    NVDIMMDevice *nvdimm = NVDIMM(dimm);
+    NVDIMMClass *ndc = NVDIMM_GET_CLASS(nvdimm);
+
+    if (ndc->unrealize) {
+        ndc->unrealize(nvdimm);
+    }
 }
 
 /*
@@ -240,6 +255,7 @@ static void nvdimm_class_init(ObjectClass *oc, void *data)
     DeviceClass *dc = DEVICE_CLASS(oc);
 
     ddc->realize = nvdimm_realize;
+    ddc->unrealize = nvdimm_unrealize;
     mdc->get_memory_region = nvdimm_md_get_memory_region;
     device_class_set_props(dc, nvdimm_properties);
 
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 48b913aba6..03bd0dd60e 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -216,6 +216,11 @@ static void pc_dimm_realize(DeviceState *dev, Error **errp)
 static void pc_dimm_unrealize(DeviceState *dev)
 {
     PCDIMMDevice *dimm = PC_DIMM(dev);
+    PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
+
+    if (ddc->unrealize) {
+        ddc->unrealize(dimm);
+    }
 
     host_memory_backend_set_mapped(dimm->hostmem, false);
 }
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..cf8f59be44 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -103,6 +103,8 @@ struct NVDIMMClass {
     /* write @size bytes from @buf to NVDIMM label data at @offset. */
     void (*write_label_data)(NVDIMMDevice *nvdimm, const void *buf,
                              uint64_t size, uint64_t offset);
+    void (*realize)(NVDIMMDevice *nvdimm, Error **errp);
+    void (*unrealize)(NVDIMMDevice *nvdimm);
 };
 
 #define NVDIMM_DSM_MEM_FILE     "etc/acpi/nvdimm-mem"
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 1473e6db62..322bebe555 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -63,6 +63,7 @@ struct PCDIMMDeviceClass {
 
     /* public */
     void (*realize)(PCDIMMDevice *dimm, Error **errp);
+    void (*unrealize)(PCDIMMDevice *dimm);
 };
 
 void pc_dimm_pre_plug(PCDIMMDevice *dimm, MachineState *machine,




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2022-02-04  8:15 ` Shivaprasad G Bhat
  (?)
@ 2022-02-04  8:15   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch. The
hcall is applicable only for new SPAPR specific device class which is
also introduced in this patch.

The hcall expects the semantics such that the flush to return with
H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
time along with a continue_token. The hcall to be called again by providing
the continue_token to get the status. So, all fresh requests are put into
a 'pending' list and flush worker is submitted to the thread pool. The
thread pool completion callbacks move the requests to 'completed' list,
which are cleaned up after collecting the return status for the guest
in subsequent hcall from the guest.

The semantics makes it necessary to preserve the continue_tokens and
their return status across migrations. So, the completed flush states
are forwarded to the destination and the pending ones are restarted
at the destination in post_load. The necessary nvdimm flush specific
vmstate structures are also introduced in this patch which are to be
saved in the new SPAPR specific nvdimm device to be introduced in the
following patch.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    2 
 hw/ppc/spapr_nvdimm.c         |  260 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |    4 -
 include/hw/ppc/spapr_nvdimm.h |    1 
 4 files changed, 266 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3d6ec309dd..9263985663 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1634,6 +1634,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes();
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 91de1052f2..ac44e00153 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
@@ -30,6 +31,9 @@
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
+#include "migration/vmstate.h"
+#include "qemu/pmem.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -47,6 +51,14 @@
 /* Have an explicit check for alignment */
 QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
 
+#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
+OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
+
+struct SPAPRNVDIMMClass {
+    /* private */
+    NVDIMMClass parent_class;
+};
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
@@ -375,6 +387,253 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    uint32_t drcidx;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
+struct SpaprNVDIMMDevice {
+    NVDIMMDevice parent_obj;
+
+    uint64_t nvdimm_flush_token;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
+};
+
+static int flush_worker_cb(void *opaque)
+{
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
+    PCDIMMDevice *dimm = PC_DIMM(drc->dev);
+    HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
+    int backend_fd = memory_region_get_fd(&backend->mr);
+
+    if (object_property_get_bool(OBJECT(backend), "pmem", NULL)) {
+        MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
+        void *ptr = memory_region_get_ram_ptr(mr);
+        size_t size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
+                                               NULL);
+
+        /* flush pmem backend */
+        pmem_persist(ptr, size);
+    } else {
+        /* flush raw backing image */
+        if (qemu_fdatasync(backend_fd) < 0) {
+            error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                         strerror(errno));
+            return H_HARDWARE;
+        }
+    }
+
+    return H_SUCCESS;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(drc->dev);
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&s_nvdimm->completed_nvdimm_flush_states, state, node);
+}
+
+static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
+{
+    SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
+     .name = "spapr_nvdimm_flush_state",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+const VMStateDescription vmstate_spapr_nvdimm_states = {
+    .name = "spapr_nvdimm_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .post_load = spapr_nvdimm_flush_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
+        VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_QLIST_V(pending_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Assign a token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
+                                                SpaprNVDIMMDevice *spapr_nvdimm)
+{
+    SpaprNVDIMMDeviceFlushState *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    spapr_nvdimm->nvdimm_flush_token++;
+    /* Token zero is presumed as no job pending. Assert on overflow to zero */
+    g_assert(spapr_nvdimm->nvdimm_flush_token != 0);
+
+    state->continue_token = spapr_nvdimm->nvdimm_flush_token;
+
+    QLIST_INSERT_HEAD(&spapr_nvdimm->pending_nvdimm_flush_states, state, node);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    GSList *list, *nvdimms;
+
+    /*
+     * Called on reset path, the main loop thread which calls
+     * the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+    nvdimms = nvdimm_get_device_list();
+    for (list = nvdimms; list; list = list->next) {
+        NVDIMMDevice *nvdimm = list->data;
+        if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+            SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(nvdimm);
+            while (!QLIST_EMPTY(&s_nvdimm->pending_nvdimm_flush_states)) {
+                aio_poll(qemu_get_aio_context(), true);
+            }
+
+            QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
+                               node, next) {
+                QLIST_REMOVE(state, node);
+                g_free(state);
+            }
+        }
+    }
+    g_slist_free(nvdimms);
+}
+
+/*
+ * spapr_nvdimm_get_flush_status
+ *      Fetches the status of the hcall worker and returns
+ *      H_LONG_BUSY_ORDER_10_MSEC if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(SpaprNVDIMMDevice *s_nvdimm,
+                                         uint64_t token)
+{
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+        if (state->continue_token == token) {
+            return H_LONG_BUSY_ORDER_10_MSEC;
+        }
+    }
+
+    QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
+                       node, node) {
+        if (state->continue_token == token) {
+            int ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            return ret;
+        }
+    }
+
+    /* If not found in complete list too, invalid token */
+    return H_P2;
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY_ORDER_10_MSEC,
+ *               H_UNSUPPORTED
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device. The hcall returns
+ * H_LONG_BUSY_ORDER_10_MSEC when the flush takes longer time and the hcall
+ * needs to be issued multiple times in order to be completely serviced. The
+ * continue-token from the output to be passed in the argument list of
+ * subsequent hcalls until the hcall is completely serviced at which point
+ * H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    int fd;
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    if (continue_token == 0) {
+        backend = MEMORY_BACKEND(dimm->hostmem);
+        fd = memory_region_get_fd(&backend->mr);
+
+        if (fd < 0) {
+            return H_UNSUPPORTED;
+        }
+
+        state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
+        if (!state) {
+            return H_HARDWARE;
+        }
+
+        state->drcidx = drc_index;
+
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+
+        continue_token = state->continue_token;
+    }
+
+    ret = spapr_nvdimm_get_flush_status(SPAPR_NVDIMM(dimm), continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -523,6 +782,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
     spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ee7504b976..727b2a0e7f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -341,6 +341,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005
@@ -559,8 +560,9 @@ struct SpaprMachineState {
 #define H_SCM_UNBIND_ALL        0x3FC
 #define H_SCM_HEALTH            0x400
 #define H_RPT_INVALIDATE        0x448
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_RPT_INVALIDATE
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 764f999f54..e9436cb6ef 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -21,5 +21,6 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(void);
 
 #endif



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2022-02-04  8:15   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch. The
hcall is applicable only for new SPAPR specific device class which is
also introduced in this patch.

The hcall expects the semantics such that the flush to return with
H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
time along with a continue_token. The hcall to be called again by providing
the continue_token to get the status. So, all fresh requests are put into
a 'pending' list and flush worker is submitted to the thread pool. The
thread pool completion callbacks move the requests to 'completed' list,
which are cleaned up after collecting the return status for the guest
in subsequent hcall from the guest.

The semantics makes it necessary to preserve the continue_tokens and
their return status across migrations. So, the completed flush states
are forwarded to the destination and the pending ones are restarted
at the destination in post_load. The necessary nvdimm flush specific
vmstate structures are also introduced in this patch which are to be
saved in the new SPAPR specific nvdimm device to be introduced in the
following patch.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    2 
 hw/ppc/spapr_nvdimm.c         |  260 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |    4 -
 include/hw/ppc/spapr_nvdimm.h |    1 
 4 files changed, 266 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3d6ec309dd..9263985663 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1634,6 +1634,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes();
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 91de1052f2..ac44e00153 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
@@ -30,6 +31,9 @@
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
+#include "migration/vmstate.h"
+#include "qemu/pmem.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -47,6 +51,14 @@
 /* Have an explicit check for alignment */
 QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
 
+#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
+OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
+
+struct SPAPRNVDIMMClass {
+    /* private */
+    NVDIMMClass parent_class;
+};
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
@@ -375,6 +387,253 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    uint32_t drcidx;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
+struct SpaprNVDIMMDevice {
+    NVDIMMDevice parent_obj;
+
+    uint64_t nvdimm_flush_token;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
+};
+
+static int flush_worker_cb(void *opaque)
+{
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
+    PCDIMMDevice *dimm = PC_DIMM(drc->dev);
+    HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
+    int backend_fd = memory_region_get_fd(&backend->mr);
+
+    if (object_property_get_bool(OBJECT(backend), "pmem", NULL)) {
+        MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
+        void *ptr = memory_region_get_ram_ptr(mr);
+        size_t size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
+                                               NULL);
+
+        /* flush pmem backend */
+        pmem_persist(ptr, size);
+    } else {
+        /* flush raw backing image */
+        if (qemu_fdatasync(backend_fd) < 0) {
+            error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                         strerror(errno));
+            return H_HARDWARE;
+        }
+    }
+
+    return H_SUCCESS;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(drc->dev);
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&s_nvdimm->completed_nvdimm_flush_states, state, node);
+}
+
+static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
+{
+    SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
+     .name = "spapr_nvdimm_flush_state",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+const VMStateDescription vmstate_spapr_nvdimm_states = {
+    .name = "spapr_nvdimm_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .post_load = spapr_nvdimm_flush_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
+        VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_QLIST_V(pending_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Assign a token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
+                                                SpaprNVDIMMDevice *spapr_nvdimm)
+{
+    SpaprNVDIMMDeviceFlushState *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    spapr_nvdimm->nvdimm_flush_token++;
+    /* Token zero is presumed as no job pending. Assert on overflow to zero */
+    g_assert(spapr_nvdimm->nvdimm_flush_token != 0);
+
+    state->continue_token = spapr_nvdimm->nvdimm_flush_token;
+
+    QLIST_INSERT_HEAD(&spapr_nvdimm->pending_nvdimm_flush_states, state, node);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    GSList *list, *nvdimms;
+
+    /*
+     * Called on reset path, the main loop thread which calls
+     * the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+    nvdimms = nvdimm_get_device_list();
+    for (list = nvdimms; list; list = list->next) {
+        NVDIMMDevice *nvdimm = list->data;
+        if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+            SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(nvdimm);
+            while (!QLIST_EMPTY(&s_nvdimm->pending_nvdimm_flush_states)) {
+                aio_poll(qemu_get_aio_context(), true);
+            }
+
+            QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
+                               node, next) {
+                QLIST_REMOVE(state, node);
+                g_free(state);
+            }
+        }
+    }
+    g_slist_free(nvdimms);
+}
+
+/*
+ * spapr_nvdimm_get_flush_status
+ *      Fetches the status of the hcall worker and returns
+ *      H_LONG_BUSY_ORDER_10_MSEC if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(SpaprNVDIMMDevice *s_nvdimm,
+                                         uint64_t token)
+{
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+        if (state->continue_token == token) {
+            return H_LONG_BUSY_ORDER_10_MSEC;
+        }
+    }
+
+    QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
+                       node, node) {
+        if (state->continue_token == token) {
+            int ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            return ret;
+        }
+    }
+
+    /* If not found in complete list too, invalid token */
+    return H_P2;
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY_ORDER_10_MSEC,
+ *               H_UNSUPPORTED
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device. The hcall returns
+ * H_LONG_BUSY_ORDER_10_MSEC when the flush takes longer time and the hcall
+ * needs to be issued multiple times in order to be completely serviced. The
+ * continue-token from the output to be passed in the argument list of
+ * subsequent hcalls until the hcall is completely serviced at which point
+ * H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    int fd;
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    if (continue_token == 0) {
+        backend = MEMORY_BACKEND(dimm->hostmem);
+        fd = memory_region_get_fd(&backend->mr);
+
+        if (fd < 0) {
+            return H_UNSUPPORTED;
+        }
+
+        state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
+        if (!state) {
+            return H_HARDWARE;
+        }
+
+        state->drcidx = drc_index;
+
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+
+        continue_token = state->continue_token;
+    }
+
+    ret = spapr_nvdimm_get_flush_status(SPAPR_NVDIMM(dimm), continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -523,6 +782,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
     spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ee7504b976..727b2a0e7f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -341,6 +341,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005
@@ -559,8 +560,9 @@ struct SpaprMachineState {
 #define H_SCM_UNBIND_ALL        0x3FC
 #define H_SCM_HEALTH            0x400
 #define H_RPT_INVALIDATE        0x448
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_RPT_INVALIDATE
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 764f999f54..e9436cb6ef 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -21,5 +21,6 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(void);
 
 #endif




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2022-02-04  8:15   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:15 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch. The
hcall is applicable only for new SPAPR specific device class which is
also introduced in this patch.

The hcall expects the semantics such that the flush to return with
H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
time along with a continue_token. The hcall to be called again by providing
the continue_token to get the status. So, all fresh requests are put into
a 'pending' list and flush worker is submitted to the thread pool. The
thread pool completion callbacks move the requests to 'completed' list,
which are cleaned up after collecting the return status for the guest
in subsequent hcall from the guest.

The semantics makes it necessary to preserve the continue_tokens and
their return status across migrations. So, the completed flush states
are forwarded to the destination and the pending ones are restarted
at the destination in post_load. The necessary nvdimm flush specific
vmstate structures are also introduced in this patch which are to be
saved in the new SPAPR specific nvdimm device to be introduced in the
following patch.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    2 
 hw/ppc/spapr_nvdimm.c         |  260 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |    4 -
 include/hw/ppc/spapr_nvdimm.h |    1 
 4 files changed, 266 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3d6ec309dd..9263985663 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1634,6 +1634,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes();
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 91de1052f2..ac44e00153 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
@@ -30,6 +31,9 @@
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
+#include "migration/vmstate.h"
+#include "qemu/pmem.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -47,6 +51,14 @@
 /* Have an explicit check for alignment */
 QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
 
+#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
+OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
+
+struct SPAPRNVDIMMClass {
+    /* private */
+    NVDIMMClass parent_class;
+};
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
@@ -375,6 +387,253 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    uint32_t drcidx;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
+struct SpaprNVDIMMDevice {
+    NVDIMMDevice parent_obj;
+
+    uint64_t nvdimm_flush_token;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
+};
+
+static int flush_worker_cb(void *opaque)
+{
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
+    PCDIMMDevice *dimm = PC_DIMM(drc->dev);
+    HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
+    int backend_fd = memory_region_get_fd(&backend->mr);
+
+    if (object_property_get_bool(OBJECT(backend), "pmem", NULL)) {
+        MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
+        void *ptr = memory_region_get_ram_ptr(mr);
+        size_t size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
+                                               NULL);
+
+        /* flush pmem backend */
+        pmem_persist(ptr, size);
+    } else {
+        /* flush raw backing image */
+        if (qemu_fdatasync(backend_fd) < 0) {
+            error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                         strerror(errno));
+            return H_HARDWARE;
+        }
+    }
+
+    return H_SUCCESS;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(drc->dev);
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&s_nvdimm->completed_nvdimm_flush_states, state, node);
+}
+
+static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
+{
+    SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
+     .name = "spapr_nvdimm_flush_state",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+const VMStateDescription vmstate_spapr_nvdimm_states = {
+    .name = "spapr_nvdimm_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .post_load = spapr_nvdimm_flush_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
+        VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_QLIST_V(pending_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
+                        vmstate_spapr_nvdimm_flush_state,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Assign a token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
+                                                SpaprNVDIMMDevice *spapr_nvdimm)
+{
+    SpaprNVDIMMDeviceFlushState *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    spapr_nvdimm->nvdimm_flush_token++;
+    /* Token zero is presumed as no job pending. Assert on overflow to zero */
+    g_assert(spapr_nvdimm->nvdimm_flush_token != 0);
+
+    state->continue_token = spapr_nvdimm->nvdimm_flush_token;
+
+    QLIST_INSERT_HEAD(&spapr_nvdimm->pending_nvdimm_flush_states, state, node);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    GSList *list, *nvdimms;
+
+    /*
+     * Called on reset path, the main loop thread which calls
+     * the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+    nvdimms = nvdimm_get_device_list();
+    for (list = nvdimms; list; list = list->next) {
+        NVDIMMDevice *nvdimm = list->data;
+        if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+            SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(nvdimm);
+            while (!QLIST_EMPTY(&s_nvdimm->pending_nvdimm_flush_states)) {
+                aio_poll(qemu_get_aio_context(), true);
+            }
+
+            QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
+                               node, next) {
+                QLIST_REMOVE(state, node);
+                g_free(state);
+            }
+        }
+    }
+    g_slist_free(nvdimms);
+}
+
+/*
+ * spapr_nvdimm_get_flush_status
+ *      Fetches the status of the hcall worker and returns
+ *      H_LONG_BUSY_ORDER_10_MSEC if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(SpaprNVDIMMDevice *s_nvdimm,
+                                         uint64_t token)
+{
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
+        if (state->continue_token = token) {
+            return H_LONG_BUSY_ORDER_10_MSEC;
+        }
+    }
+
+    QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
+                       node, node) {
+        if (state->continue_token = token) {
+            int ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            return ret;
+        }
+    }
+
+    /* If not found in complete list too, invalid token */
+    return H_P2;
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY_ORDER_10_MSEC,
+ *               H_UNSUPPORTED
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device. The hcall returns
+ * H_LONG_BUSY_ORDER_10_MSEC when the flush takes longer time and the hcall
+ * needs to be issued multiple times in order to be completely serviced. The
+ * continue-token from the output to be passed in the argument list of
+ * subsequent hcalls until the hcall is completely serviced at which point
+ * H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    int fd;
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    if (continue_token = 0) {
+        backend = MEMORY_BACKEND(dimm->hostmem);
+        fd = memory_region_get_fd(&backend->mr);
+
+        if (fd < 0) {
+            return H_UNSUPPORTED;
+        }
+
+        state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
+        if (!state) {
+            return H_HARDWARE;
+        }
+
+        state->drcidx = drc_index;
+
+        thread_pool_submit_aio(pool, flush_worker_cb, state,
+                               spapr_nvdimm_flush_completion_cb, state);
+
+        continue_token = state->continue_token;
+    }
+
+    ret = spapr_nvdimm_get_flush_status(SPAPR_NVDIMM(dimm), continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -523,6 +782,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
     spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ee7504b976..727b2a0e7f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -341,6 +341,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005
@@ -559,8 +560,9 @@ struct SpaprMachineState {
 #define H_SCM_UNBIND_ALL        0x3FC
 #define H_SCM_HEALTH            0x400
 #define H_RPT_INVALIDATE        0x448
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_RPT_INVALIDATE
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 764f999f54..e9436cb6ef 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -21,5 +21,6 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(void);
 
 #endif


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 3/3] spapr: nvdimm: Introduce spapr-nvdimm device
  2022-02-04  8:15 ` Shivaprasad G Bhat
  (?)
@ 2022-02-04  8:16   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:16 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

If the device backend is not persistent memory for the nvdimm, there is
need for explicit IO flushes on the backend to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem. So, the
approach here is to convey when the hcall flush is required in a device
tree property. The guest once it knows the device backend is not pmem,
makes the hcall whenever flush is required.

To set the device tree property, a new PAPR specific device type inheriting
the nvdimm device is implemented. When the backend doesn't have pmem=on
the device tree property "ibm,hcall-flush-required" is set, and the guest
makes hcall H_SCM_FLUSH requesting for an explicit flush. The new device
has boolean property pmem-override which when "on" advertises the device
tree property even when pmem=on for the backend. The flush function
invokes the fdatasync or pmem_persist() based on the type of backend.

The vmstate structures are made part of the spapr-nvdimm device object.
The patch attempts to keep the migration compatibility between source and
destination while rejecting the incompatibles ones with failures.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_nvdimm.c |  132 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index ac44e00153..c4c97da5de 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -34,6 +34,7 @@
 #include "block/thread-pool.h"
 #include "migration/vmstate.h"
 #include "qemu/pmem.h"
+#include "hw/qdev-properties.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -57,6 +58,10 @@ OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
 struct SPAPRNVDIMMClass {
     /* private */
     NVDIMMClass parent_class;
+
+    /* public */
+    void (*realize)(NVDIMMDevice *dimm, Error **errp);
+    void (*unrealize)(NVDIMMDevice *dimm, Error **errp);
 };
 
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
@@ -64,6 +69,8 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
 {
     const MachineClass *mc = MACHINE_GET_CLASS(hotplug_dev);
     const MachineState *ms = MACHINE(hotplug_dev);
+    PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+    MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
     g_autofree char *uuidstr = NULL;
     QemuUUID uuid;
     int ret;
@@ -101,6 +108,14 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
         return false;
     }
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM) &&
+        (memory_region_get_fd(mr) < 0)) {
+        error_setg(errp, "spapr-nvdimm device requires the "
+                   "memdev %s to be of memory-backend-file type",
+                   object_get_canonical_path_component(OBJECT(dimm->hostmem)));
+        return false;
+    }
+
     return true;
 }
 
@@ -172,6 +187,20 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+        bool is_pmem = false, pmem_override = false;
+        PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+        HostMemoryBackend *hostmem = dimm->hostmem;
+
+        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem", NULL);
+        pmem_override = object_property_get_bool(OBJECT(nvdimm),
+                                                 "pmem-override", NULL);
+        if (!is_pmem || pmem_override) {
+            _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                             NULL, 0));
+        }
+    }
+
     return child_offset;
 }
 
@@ -397,11 +426,21 @@ typedef struct SpaprNVDIMMDeviceFlushState {
 
 typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
 struct SpaprNVDIMMDevice {
+    /* private */
     NVDIMMDevice parent_obj;
 
+    bool hcall_flush_required;
     uint64_t nvdimm_flush_token;
     QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
     QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
+
+    /* public */
+
+    /*
+     * The 'on' value for this property forced the qemu to enable the hcall
+     * flush for the nvdimm device even if the backend is a pmem
+     */
+    bool pmem_override;
 };
 
 static int flush_worker_cb(void *opaque)
@@ -448,6 +487,24 @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
     SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
+    bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+    bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
+                                                  "pmem-override", NULL);
+    bool dest_hcall_flush_required = pmem_override || !is_pmem;
+
+    if (!s_nvdimm->hcall_flush_required && dest_hcall_flush_required) {
+        error_report("The file backend for the spapr-nvdimm device %s at "
+                     "source is a pmem, use pmem=on and pmem-override=off to "
+                     "continue.", DEVICE(s_nvdimm)->id);
+        return -EINVAL;
+    }
+    if (s_nvdimm->hcall_flush_required && !dest_hcall_flush_required) {
+        error_report("The guest expects hcall-flush support for the "
+                     "spapr-nvdimm device %s, use pmem_override=on to "
+                     "continue.", DEVICE(s_nvdimm)->id);
+        return -EINVAL;
+    }
 
     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
         thread_pool_submit_aio(pool, flush_worker_cb, state,
@@ -475,6 +532,7 @@ const VMStateDescription vmstate_spapr_nvdimm_states = {
     .minimum_version_id = 1,
     .post_load = spapr_nvdimm_flush_post_load,
     .fields = (VMStateField[]) {
+        VMSTATE_BOOL(hcall_flush_required, SpaprNVDIMMDevice),
         VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
         VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
                         vmstate_spapr_nvdimm_flush_state,
@@ -605,7 +663,11 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     }
 
     dimm = PC_DIMM(drc->dev);
+    if (!object_dynamic_cast(OBJECT(dimm), TYPE_SPAPR_NVDIMM)) {
+        return H_PARAMETER;
+    }
     if (continue_token == 0) {
+        bool is_pmem = false, pmem_override = false;
         backend = MEMORY_BACKEND(dimm->hostmem);
         fd = memory_region_get_fd(&backend->mr);
 
@@ -613,6 +675,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
             return H_UNSUPPORTED;
         }
 
+        is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+        pmem_override = object_property_get_bool(OBJECT(dimm),
+                                                "pmem-override", NULL);
+        if (is_pmem && !pmem_override) {
+            return H_UNSUPPORTED;
+        }
+
         state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
         if (!state) {
             return H_HARDWARE;
@@ -786,3 +855,66 @@ static void spapr_scm_register_types(void)
 }
 
 type_init(spapr_scm_register_types)
+
+static void spapr_nvdimm_realize(NVDIMMDevice *dimm, Error **errp)
+{
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(dimm);
+    HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(dimm)->hostmem);
+    bool is_pmem = object_property_get_bool(OBJECT(backend),  "pmem", NULL);
+    bool pmem_override = object_property_get_bool(OBJECT(dimm), "pmem-override",
+                                             NULL);
+    if (!is_pmem || pmem_override) {
+        s_nvdimm->hcall_flush_required = true;
+    }
+
+    vmstate_register(NULL, VMSTATE_INSTANCE_ID_ANY,
+                     &vmstate_spapr_nvdimm_states, dimm);
+}
+
+static void spapr_nvdimm_unrealize(NVDIMMDevice *dimm)
+{
+    vmstate_unregister(NULL, &vmstate_spapr_nvdimm_states, dimm);
+}
+
+static Property spapr_nvdimm_properties[] = {
+#ifdef CONFIG_LIBPMEM
+    DEFINE_PROP_BOOL("pmem-override", SpaprNVDIMMDevice, pmem_override, false),
+#endif
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_nvdimm_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+    NVDIMMClass *nvc = NVDIMM_CLASS(oc);
+
+    nvc->realize = spapr_nvdimm_realize;
+    nvc->unrealize = spapr_nvdimm_unrealize;
+
+    device_class_set_props(dc, spapr_nvdimm_properties);
+}
+
+static void spapr_nvdimm_init(Object *obj)
+{
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(obj);
+
+    s_nvdimm->hcall_flush_required = false;
+    QLIST_INIT(&s_nvdimm->pending_nvdimm_flush_states);
+    QLIST_INIT(&s_nvdimm->completed_nvdimm_flush_states);
+}
+
+static TypeInfo spapr_nvdimm_info = {
+    .name          = TYPE_SPAPR_NVDIMM,
+    .parent        = TYPE_NVDIMM,
+    .class_init    = spapr_nvdimm_class_init,
+    .class_size    = sizeof(SPAPRNVDIMMClass),
+    .instance_size = sizeof(SpaprNVDIMMDevice),
+    .instance_init = spapr_nvdimm_init,
+};
+
+static void spapr_nvdimm_register_types(void)
+{
+    type_register_static(&spapr_nvdimm_info);
+}
+
+type_init(spapr_nvdimm_register_types)



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 3/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-04  8:16   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:16 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm

If the device backend is not persistent memory for the nvdimm, there is
need for explicit IO flushes on the backend to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem. So, the
approach here is to convey when the hcall flush is required in a device
tree property. The guest once it knows the device backend is not pmem,
makes the hcall whenever flush is required.

To set the device tree property, a new PAPR specific device type inheriting
the nvdimm device is implemented. When the backend doesn't have pmem=on
the device tree property "ibm,hcall-flush-required" is set, and the guest
makes hcall H_SCM_FLUSH requesting for an explicit flush. The new device
has boolean property pmem-override which when "on" advertises the device
tree property even when pmem=on for the backend. The flush function
invokes the fdatasync or pmem_persist() based on the type of backend.

The vmstate structures are made part of the spapr-nvdimm device object.
The patch attempts to keep the migration compatibility between source and
destination while rejecting the incompatibles ones with failures.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_nvdimm.c |  132 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index ac44e00153..c4c97da5de 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -34,6 +34,7 @@
 #include "block/thread-pool.h"
 #include "migration/vmstate.h"
 #include "qemu/pmem.h"
+#include "hw/qdev-properties.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -57,6 +58,10 @@ OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
 struct SPAPRNVDIMMClass {
     /* private */
     NVDIMMClass parent_class;
+
+    /* public */
+    void (*realize)(NVDIMMDevice *dimm, Error **errp);
+    void (*unrealize)(NVDIMMDevice *dimm, Error **errp);
 };
 
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
@@ -64,6 +69,8 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
 {
     const MachineClass *mc = MACHINE_GET_CLASS(hotplug_dev);
     const MachineState *ms = MACHINE(hotplug_dev);
+    PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+    MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
     g_autofree char *uuidstr = NULL;
     QemuUUID uuid;
     int ret;
@@ -101,6 +108,14 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
         return false;
     }
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM) &&
+        (memory_region_get_fd(mr) < 0)) {
+        error_setg(errp, "spapr-nvdimm device requires the "
+                   "memdev %s to be of memory-backend-file type",
+                   object_get_canonical_path_component(OBJECT(dimm->hostmem)));
+        return false;
+    }
+
     return true;
 }
 
@@ -172,6 +187,20 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+        bool is_pmem = false, pmem_override = false;
+        PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+        HostMemoryBackend *hostmem = dimm->hostmem;
+
+        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem", NULL);
+        pmem_override = object_property_get_bool(OBJECT(nvdimm),
+                                                 "pmem-override", NULL);
+        if (!is_pmem || pmem_override) {
+            _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                             NULL, 0));
+        }
+    }
+
     return child_offset;
 }
 
@@ -397,11 +426,21 @@ typedef struct SpaprNVDIMMDeviceFlushState {
 
 typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
 struct SpaprNVDIMMDevice {
+    /* private */
     NVDIMMDevice parent_obj;
 
+    bool hcall_flush_required;
     uint64_t nvdimm_flush_token;
     QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
     QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
+
+    /* public */
+
+    /*
+     * The 'on' value for this property forced the qemu to enable the hcall
+     * flush for the nvdimm device even if the backend is a pmem
+     */
+    bool pmem_override;
 };
 
 static int flush_worker_cb(void *opaque)
@@ -448,6 +487,24 @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
     SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
+    bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+    bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
+                                                  "pmem-override", NULL);
+    bool dest_hcall_flush_required = pmem_override || !is_pmem;
+
+    if (!s_nvdimm->hcall_flush_required && dest_hcall_flush_required) {
+        error_report("The file backend for the spapr-nvdimm device %s at "
+                     "source is a pmem, use pmem=on and pmem-override=off to "
+                     "continue.", DEVICE(s_nvdimm)->id);
+        return -EINVAL;
+    }
+    if (s_nvdimm->hcall_flush_required && !dest_hcall_flush_required) {
+        error_report("The guest expects hcall-flush support for the "
+                     "spapr-nvdimm device %s, use pmem_override=on to "
+                     "continue.", DEVICE(s_nvdimm)->id);
+        return -EINVAL;
+    }
 
     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
         thread_pool_submit_aio(pool, flush_worker_cb, state,
@@ -475,6 +532,7 @@ const VMStateDescription vmstate_spapr_nvdimm_states = {
     .minimum_version_id = 1,
     .post_load = spapr_nvdimm_flush_post_load,
     .fields = (VMStateField[]) {
+        VMSTATE_BOOL(hcall_flush_required, SpaprNVDIMMDevice),
         VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
         VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
                         vmstate_spapr_nvdimm_flush_state,
@@ -605,7 +663,11 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     }
 
     dimm = PC_DIMM(drc->dev);
+    if (!object_dynamic_cast(OBJECT(dimm), TYPE_SPAPR_NVDIMM)) {
+        return H_PARAMETER;
+    }
     if (continue_token == 0) {
+        bool is_pmem = false, pmem_override = false;
         backend = MEMORY_BACKEND(dimm->hostmem);
         fd = memory_region_get_fd(&backend->mr);
 
@@ -613,6 +675,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
             return H_UNSUPPORTED;
         }
 
+        is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+        pmem_override = object_property_get_bool(OBJECT(dimm),
+                                                "pmem-override", NULL);
+        if (is_pmem && !pmem_override) {
+            return H_UNSUPPORTED;
+        }
+
         state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
         if (!state) {
             return H_HARDWARE;
@@ -786,3 +855,66 @@ static void spapr_scm_register_types(void)
 }
 
 type_init(spapr_scm_register_types)
+
+static void spapr_nvdimm_realize(NVDIMMDevice *dimm, Error **errp)
+{
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(dimm);
+    HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(dimm)->hostmem);
+    bool is_pmem = object_property_get_bool(OBJECT(backend),  "pmem", NULL);
+    bool pmem_override = object_property_get_bool(OBJECT(dimm), "pmem-override",
+                                             NULL);
+    if (!is_pmem || pmem_override) {
+        s_nvdimm->hcall_flush_required = true;
+    }
+
+    vmstate_register(NULL, VMSTATE_INSTANCE_ID_ANY,
+                     &vmstate_spapr_nvdimm_states, dimm);
+}
+
+static void spapr_nvdimm_unrealize(NVDIMMDevice *dimm)
+{
+    vmstate_unregister(NULL, &vmstate_spapr_nvdimm_states, dimm);
+}
+
+static Property spapr_nvdimm_properties[] = {
+#ifdef CONFIG_LIBPMEM
+    DEFINE_PROP_BOOL("pmem-override", SpaprNVDIMMDevice, pmem_override, false),
+#endif
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_nvdimm_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+    NVDIMMClass *nvc = NVDIMM_CLASS(oc);
+
+    nvc->realize = spapr_nvdimm_realize;
+    nvc->unrealize = spapr_nvdimm_unrealize;
+
+    device_class_set_props(dc, spapr_nvdimm_properties);
+}
+
+static void spapr_nvdimm_init(Object *obj)
+{
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(obj);
+
+    s_nvdimm->hcall_flush_required = false;
+    QLIST_INIT(&s_nvdimm->pending_nvdimm_flush_states);
+    QLIST_INIT(&s_nvdimm->completed_nvdimm_flush_states);
+}
+
+static TypeInfo spapr_nvdimm_info = {
+    .name          = TYPE_SPAPR_NVDIMM,
+    .parent        = TYPE_NVDIMM,
+    .class_init    = spapr_nvdimm_class_init,
+    .class_size    = sizeof(SPAPRNVDIMMClass),
+    .instance_size = sizeof(SpaprNVDIMMDevice),
+    .instance_init = spapr_nvdimm_init,
+};
+
+static void spapr_nvdimm_register_types(void)
+{
+    type_register_static(&spapr_nvdimm_info);
+}
+
+type_init(spapr_nvdimm_register_types)




^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 3/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-04  8:16   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 20+ messages in thread
From: Shivaprasad G Bhat @ 2022-02-04  8:16 UTC (permalink / raw)
  To: clg, mst, ani, danielhb413, david, groug, imammedo,
	xiaoguangrong.eric, david, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

If the device backend is not persistent memory for the nvdimm, there is
need for explicit IO flushes on the backend to ensure persistence.

On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem. So, the
approach here is to convey when the hcall flush is required in a device
tree property. The guest once it knows the device backend is not pmem,
makes the hcall whenever flush is required.

To set the device tree property, a new PAPR specific device type inheriting
the nvdimm device is implemented. When the backend doesn't have pmem=on
the device tree property "ibm,hcall-flush-required" is set, and the guest
makes hcall H_SCM_FLUSH requesting for an explicit flush. The new device
has boolean property pmem-override which when "on" advertises the device
tree property even when pmem=on for the backend. The flush function
invokes the fdatasync or pmem_persist() based on the type of backend.

The vmstate structures are made part of the spapr-nvdimm device object.
The patch attempts to keep the migration compatibility between source and
destination while rejecting the incompatibles ones with failures.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_nvdimm.c |  132 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index ac44e00153..c4c97da5de 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -34,6 +34,7 @@
 #include "block/thread-pool.h"
 #include "migration/vmstate.h"
 #include "qemu/pmem.h"
+#include "hw/qdev-properties.h"
 
 /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
 /* SCM device is unable to persist memory contents */
@@ -57,6 +58,10 @@ OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
 struct SPAPRNVDIMMClass {
     /* private */
     NVDIMMClass parent_class;
+
+    /* public */
+    void (*realize)(NVDIMMDevice *dimm, Error **errp);
+    void (*unrealize)(NVDIMMDevice *dimm, Error **errp);
 };
 
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
@@ -64,6 +69,8 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
 {
     const MachineClass *mc = MACHINE_GET_CLASS(hotplug_dev);
     const MachineState *ms = MACHINE(hotplug_dev);
+    PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+    MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
     g_autofree char *uuidstr = NULL;
     QemuUUID uuid;
     int ret;
@@ -101,6 +108,14 @@ bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
         return false;
     }
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM) &&
+        (memory_region_get_fd(mr) < 0)) {
+        error_setg(errp, "spapr-nvdimm device requires the "
+                   "memdev %s to be of memory-backend-file type",
+                   object_get_canonical_path_component(OBJECT(dimm->hostmem)));
+        return false;
+    }
+
     return true;
 }
 
@@ -172,6 +187,20 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
+        bool is_pmem = false, pmem_override = false;
+        PCDIMMDevice *dimm = PC_DIMM(nvdimm);
+        HostMemoryBackend *hostmem = dimm->hostmem;
+
+        is_pmem = object_property_get_bool(OBJECT(hostmem), "pmem", NULL);
+        pmem_override = object_property_get_bool(OBJECT(nvdimm),
+                                                 "pmem-override", NULL);
+        if (!is_pmem || pmem_override) {
+            _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                             NULL, 0));
+        }
+    }
+
     return child_offset;
 }
 
@@ -397,11 +426,21 @@ typedef struct SpaprNVDIMMDeviceFlushState {
 
 typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
 struct SpaprNVDIMMDevice {
+    /* private */
     NVDIMMDevice parent_obj;
 
+    bool hcall_flush_required;
     uint64_t nvdimm_flush_token;
     QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
     QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
+
+    /* public */
+
+    /*
+     * The 'on' value for this property forced the qemu to enable the hcall
+     * flush for the nvdimm device even if the backend is a pmem
+     */
+    bool pmem_override;
 };
 
 static int flush_worker_cb(void *opaque)
@@ -448,6 +487,24 @@ static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
     SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+    HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(s_nvdimm)->hostmem);
+    bool is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+    bool pmem_override = object_property_get_bool(OBJECT(s_nvdimm),
+                                                  "pmem-override", NULL);
+    bool dest_hcall_flush_required = pmem_override || !is_pmem;
+
+    if (!s_nvdimm->hcall_flush_required && dest_hcall_flush_required) {
+        error_report("The file backend for the spapr-nvdimm device %s at "
+                     "source is a pmem, use pmem=on and pmem-override=off to "
+                     "continue.", DEVICE(s_nvdimm)->id);
+        return -EINVAL;
+    }
+    if (s_nvdimm->hcall_flush_required && !dest_hcall_flush_required) {
+        error_report("The guest expects hcall-flush support for the "
+                     "spapr-nvdimm device %s, use pmem_override=on to "
+                     "continue.", DEVICE(s_nvdimm)->id);
+        return -EINVAL;
+    }
 
     QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
         thread_pool_submit_aio(pool, flush_worker_cb, state,
@@ -475,6 +532,7 @@ const VMStateDescription vmstate_spapr_nvdimm_states = {
     .minimum_version_id = 1,
     .post_load = spapr_nvdimm_flush_post_load,
     .fields = (VMStateField[]) {
+        VMSTATE_BOOL(hcall_flush_required, SpaprNVDIMMDevice),
         VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
         VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
                         vmstate_spapr_nvdimm_flush_state,
@@ -605,7 +663,11 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
     }
 
     dimm = PC_DIMM(drc->dev);
+    if (!object_dynamic_cast(OBJECT(dimm), TYPE_SPAPR_NVDIMM)) {
+        return H_PARAMETER;
+    }
     if (continue_token = 0) {
+        bool is_pmem = false, pmem_override = false;
         backend = MEMORY_BACKEND(dimm->hostmem);
         fd = memory_region_get_fd(&backend->mr);
 
@@ -613,6 +675,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
             return H_UNSUPPORTED;
         }
 
+        is_pmem = object_property_get_bool(OBJECT(backend), "pmem", NULL);
+        pmem_override = object_property_get_bool(OBJECT(dimm),
+                                                "pmem-override", NULL);
+        if (is_pmem && !pmem_override) {
+            return H_UNSUPPORTED;
+        }
+
         state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
         if (!state) {
             return H_HARDWARE;
@@ -786,3 +855,66 @@ static void spapr_scm_register_types(void)
 }
 
 type_init(spapr_scm_register_types)
+
+static void spapr_nvdimm_realize(NVDIMMDevice *dimm, Error **errp)
+{
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(dimm);
+    HostMemoryBackend *backend = MEMORY_BACKEND(PC_DIMM(dimm)->hostmem);
+    bool is_pmem = object_property_get_bool(OBJECT(backend),  "pmem", NULL);
+    bool pmem_override = object_property_get_bool(OBJECT(dimm), "pmem-override",
+                                             NULL);
+    if (!is_pmem || pmem_override) {
+        s_nvdimm->hcall_flush_required = true;
+    }
+
+    vmstate_register(NULL, VMSTATE_INSTANCE_ID_ANY,
+                     &vmstate_spapr_nvdimm_states, dimm);
+}
+
+static void spapr_nvdimm_unrealize(NVDIMMDevice *dimm)
+{
+    vmstate_unregister(NULL, &vmstate_spapr_nvdimm_states, dimm);
+}
+
+static Property spapr_nvdimm_properties[] = {
+#ifdef CONFIG_LIBPMEM
+    DEFINE_PROP_BOOL("pmem-override", SpaprNVDIMMDevice, pmem_override, false),
+#endif
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_nvdimm_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+    NVDIMMClass *nvc = NVDIMM_CLASS(oc);
+
+    nvc->realize = spapr_nvdimm_realize;
+    nvc->unrealize = spapr_nvdimm_unrealize;
+
+    device_class_set_props(dc, spapr_nvdimm_properties);
+}
+
+static void spapr_nvdimm_init(Object *obj)
+{
+    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(obj);
+
+    s_nvdimm->hcall_flush_required = false;
+    QLIST_INIT(&s_nvdimm->pending_nvdimm_flush_states);
+    QLIST_INIT(&s_nvdimm->completed_nvdimm_flush_states);
+}
+
+static TypeInfo spapr_nvdimm_info = {
+    .name          = TYPE_SPAPR_NVDIMM,
+    .parent        = TYPE_NVDIMM,
+    .class_init    = spapr_nvdimm_class_init,
+    .class_size    = sizeof(SPAPRNVDIMMClass),
+    .instance_size = sizeof(SpaprNVDIMMDevice),
+    .instance_init = spapr_nvdimm_init,
+};
+
+static void spapr_nvdimm_register_types(void)
+{
+    type_register_static(&spapr_nvdimm_info);
+}
+
+type_init(spapr_nvdimm_register_types)


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2022-02-04  8:15   ` Shivaprasad G Bhat
  (?)
@ 2022-02-07 11:48     ` Daniel Henrique Barboza
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2022-02-07 11:48 UTC (permalink / raw)
  To: Shivaprasad G Bhat, clg, mst, ani, david, groug, imammedo,
	xiaoguangrong.eric, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc



On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch. The
> hcall is applicable only for new SPAPR specific device class which is
> also introduced in this patch.
> 
> The hcall expects the semantics such that the flush to return with
> H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
> time along with a continue_token. The hcall to be called again by providing
> the continue_token to get the status. So, all fresh requests are put into
> a 'pending' list and flush worker is submitted to the thread pool. The
> thread pool completion callbacks move the requests to 'completed' list,
> which are cleaned up after collecting the return status for the guest
> in subsequent hcall from the guest.
> 
> The semantics makes it necessary to preserve the continue_tokens and
> their return status across migrations. So, the completed flush states
> are forwarded to the destination and the pending ones are restarted
> at the destination in post_load. The necessary nvdimm flush specific
> vmstate structures are also introduced in this patch which are to be
> saved in the new SPAPR specific nvdimm device to be introduced in the
> following patch.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

>   hw/ppc/spapr.c                |    2
>   hw/ppc/spapr_nvdimm.c         |  260 +++++++++++++++++++++++++++++++++++++++++
>   include/hw/ppc/spapr.h        |    4 -
>   include/hw/ppc/spapr_nvdimm.h |    1
>   4 files changed, 266 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 3d6ec309dd..9263985663 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1634,6 +1634,8 @@ static void spapr_machine_reset(MachineState *machine)
>           spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>       }
>   
> +    spapr_nvdimm_finish_flushes();
> +
>       /* DRC reset may cause a device to be unplugged. This will cause troubles
>        * if this device is used by another device (eg, a running vhost backend
>        * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 91de1052f2..ac44e00153 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
>    * THE SOFTWARE.
>    */
>   #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>   #include "qapi/error.h"
>   #include "hw/ppc/spapr_drc.h"
>   #include "hw/ppc/spapr_nvdimm.h"
> @@ -30,6 +31,9 @@
>   #include "hw/ppc/fdt.h"
>   #include "qemu/range.h"
>   #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
> +#include "migration/vmstate.h"
> +#include "qemu/pmem.h"
>   
>   /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
>   /* SCM device is unable to persist memory contents */
> @@ -47,6 +51,14 @@
>   /* Have an explicit check for alignment */
>   QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
>   
> +#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
> +OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
> +
> +struct SPAPRNVDIMMClass {
> +    /* private */
> +    NVDIMMClass parent_class;
> +};
> +
>   bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                              uint64_t size, Error **errp)
>   {
> @@ -375,6 +387,253 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>       return H_SUCCESS;
>   }
>   
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    uint32_t drcidx;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
> +struct SpaprNVDIMMDevice {
> +    NVDIMMDevice parent_obj;
> +
> +    uint64_t nvdimm_flush_token;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
> +};
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
> +    PCDIMMDevice *dimm = PC_DIMM(drc->dev);
> +    HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
> +    int backend_fd = memory_region_get_fd(&backend->mr);
> +
> +    if (object_property_get_bool(OBJECT(backend), "pmem", NULL)) {
> +        MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
> +        void *ptr = memory_region_get_ram_ptr(mr);
> +        size_t size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
> +                                               NULL);
> +
> +        /* flush pmem backend */
> +        pmem_persist(ptr, size);
> +    } else {
> +        /* flush raw backing image */
> +        if (qemu_fdatasync(backend_fd) < 0) {
> +            error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                         strerror(errno));
> +            return H_HARDWARE;
> +        }
> +    }
> +
> +    return H_SUCCESS;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
> +    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(drc->dev);
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&s_nvdimm->completed_nvdimm_flush_states, state, node);
> +}
> +
> +static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
> +{
> +    SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> +    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
> +     .name = "spapr_nvdimm_flush_state",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +const VMStateDescription vmstate_spapr_nvdimm_states = {
> +    .name = "spapr_nvdimm_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .post_load = spapr_nvdimm_flush_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
> +        VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_QLIST_V(pending_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Assign a token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
> +                                                SpaprNVDIMMDevice *spapr_nvdimm)
> +{
> +    SpaprNVDIMMDeviceFlushState *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    spapr_nvdimm->nvdimm_flush_token++;
> +    /* Token zero is presumed as no job pending. Assert on overflow to zero */
> +    g_assert(spapr_nvdimm->nvdimm_flush_token != 0);
> +
> +    state->continue_token = spapr_nvdimm->nvdimm_flush_token;
> +
> +    QLIST_INSERT_HEAD(&spapr_nvdimm->pending_nvdimm_flush_states, state, node);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(void)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    GSList *list, *nvdimms;
> +
> +    /*
> +     * Called on reset path, the main loop thread which calls
> +     * the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +    nvdimms = nvdimm_get_device_list();
> +    for (list = nvdimms; list; list = list->next) {
> +        NVDIMMDevice *nvdimm = list->data;
> +        if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
> +            SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(nvdimm);
> +            while (!QLIST_EMPTY(&s_nvdimm->pending_nvdimm_flush_states)) {
> +                aio_poll(qemu_get_aio_context(), true);
> +            }
> +
> +            QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
> +                               node, next) {
> +                QLIST_REMOVE(state, node);
> +                g_free(state);
> +            }
> +        }
> +    }
> +    g_slist_free(nvdimms);
> +}
> +
> +/*
> + * spapr_nvdimm_get_flush_status
> + *      Fetches the status of the hcall worker and returns
> + *      H_LONG_BUSY_ORDER_10_MSEC if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(SpaprNVDIMMDevice *s_nvdimm,
> +                                         uint64_t token)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
> +        if (state->continue_token == token) {
> +            return H_LONG_BUSY_ORDER_10_MSEC;
> +        }
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
> +                       node, node) {
> +        if (state->continue_token == token) {
> +            int ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            return ret;
> +        }
> +    }
> +
> +    /* If not found in complete list too, invalid token */
> +    return H_P2;
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY_ORDER_10_MSEC,
> + *               H_UNSUPPORTED
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device. The hcall returns
> + * H_LONG_BUSY_ORDER_10_MSEC when the flush takes longer time and the hcall
> + * needs to be issued multiple times in order to be completely serviced. The
> + * continue-token from the output to be passed in the argument list of
> + * subsequent hcalls until the hcall is completely serviced at which point
> + * H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    int fd;
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    if (continue_token == 0) {
> +        backend = MEMORY_BACKEND(dimm->hostmem);
> +        fd = memory_region_get_fd(&backend->mr);
> +
> +        if (fd < 0) {
> +            return H_UNSUPPORTED;
> +        }
> +
> +        state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
> +        if (!state) {
> +            return H_HARDWARE;
> +        }
> +
> +        state->drcidx = drc_index;
> +
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +
> +        continue_token = state->continue_token;
> +    }
> +
> +    ret = spapr_nvdimm_get_flush_status(SPAPR_NVDIMM(dimm), continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = continue_token;
> +    }
> +
> +    return ret;
> +}
> +
>   static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                        target_ulong opcode, target_ulong *args)
>   {
> @@ -523,6 +782,7 @@ static void spapr_scm_register_types(void)
>       spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>       spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
>       spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>   }
>   
>   type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index ee7504b976..727b2a0e7f 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -341,6 +341,7 @@ struct SpaprMachineState {
>   #define H_P7              -60
>   #define H_P8              -61
>   #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>   #define H_OVERLAP         -68
>   #define H_UNSUPPORTED_FLAG -256
>   #define H_MULTI_THREADS_ACTIVE -9005
> @@ -559,8 +560,9 @@ struct SpaprMachineState {
>   #define H_SCM_UNBIND_ALL        0x3FC
>   #define H_SCM_HEALTH            0x400
>   #define H_RPT_INVALIDATE        0x448
> +#define H_SCM_FLUSH             0x44C
>   
> -#define MAX_HCALL_OPCODE        H_RPT_INVALIDATE
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>   
>   /* The hcalls above are standardized in PAPR and implemented by pHyp
>    * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 764f999f54..e9436cb6ef 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -21,5 +21,6 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
>   bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                              uint64_t size, Error **errp);
>   void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(void);
>   
>   #endif
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2022-02-07 11:48     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2022-02-07 11:48 UTC (permalink / raw)
  To: Shivaprasad G Bhat, clg, mst, ani, david, groug, imammedo,
	xiaoguangrong.eric, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm



On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch. The
> hcall is applicable only for new SPAPR specific device class which is
> also introduced in this patch.
> 
> The hcall expects the semantics such that the flush to return with
> H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
> time along with a continue_token. The hcall to be called again by providing
> the continue_token to get the status. So, all fresh requests are put into
> a 'pending' list and flush worker is submitted to the thread pool. The
> thread pool completion callbacks move the requests to 'completed' list,
> which are cleaned up after collecting the return status for the guest
> in subsequent hcall from the guest.
> 
> The semantics makes it necessary to preserve the continue_tokens and
> their return status across migrations. So, the completed flush states
> are forwarded to the destination and the pending ones are restarted
> at the destination in post_load. The necessary nvdimm flush specific
> vmstate structures are also introduced in this patch which are to be
> saved in the new SPAPR specific nvdimm device to be introduced in the
> following patch.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

>   hw/ppc/spapr.c                |    2
>   hw/ppc/spapr_nvdimm.c         |  260 +++++++++++++++++++++++++++++++++++++++++
>   include/hw/ppc/spapr.h        |    4 -
>   include/hw/ppc/spapr_nvdimm.h |    1
>   4 files changed, 266 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 3d6ec309dd..9263985663 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1634,6 +1634,8 @@ static void spapr_machine_reset(MachineState *machine)
>           spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>       }
>   
> +    spapr_nvdimm_finish_flushes();
> +
>       /* DRC reset may cause a device to be unplugged. This will cause troubles
>        * if this device is used by another device (eg, a running vhost backend
>        * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 91de1052f2..ac44e00153 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
>    * THE SOFTWARE.
>    */
>   #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>   #include "qapi/error.h"
>   #include "hw/ppc/spapr_drc.h"
>   #include "hw/ppc/spapr_nvdimm.h"
> @@ -30,6 +31,9 @@
>   #include "hw/ppc/fdt.h"
>   #include "qemu/range.h"
>   #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
> +#include "migration/vmstate.h"
> +#include "qemu/pmem.h"
>   
>   /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
>   /* SCM device is unable to persist memory contents */
> @@ -47,6 +51,14 @@
>   /* Have an explicit check for alignment */
>   QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
>   
> +#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
> +OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
> +
> +struct SPAPRNVDIMMClass {
> +    /* private */
> +    NVDIMMClass parent_class;
> +};
> +
>   bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                              uint64_t size, Error **errp)
>   {
> @@ -375,6 +387,253 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>       return H_SUCCESS;
>   }
>   
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    uint32_t drcidx;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
> +struct SpaprNVDIMMDevice {
> +    NVDIMMDevice parent_obj;
> +
> +    uint64_t nvdimm_flush_token;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
> +};
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
> +    PCDIMMDevice *dimm = PC_DIMM(drc->dev);
> +    HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
> +    int backend_fd = memory_region_get_fd(&backend->mr);
> +
> +    if (object_property_get_bool(OBJECT(backend), "pmem", NULL)) {
> +        MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
> +        void *ptr = memory_region_get_ram_ptr(mr);
> +        size_t size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
> +                                               NULL);
> +
> +        /* flush pmem backend */
> +        pmem_persist(ptr, size);
> +    } else {
> +        /* flush raw backing image */
> +        if (qemu_fdatasync(backend_fd) < 0) {
> +            error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                         strerror(errno));
> +            return H_HARDWARE;
> +        }
> +    }
> +
> +    return H_SUCCESS;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
> +    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(drc->dev);
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&s_nvdimm->completed_nvdimm_flush_states, state, node);
> +}
> +
> +static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
> +{
> +    SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> +    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
> +     .name = "spapr_nvdimm_flush_state",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +const VMStateDescription vmstate_spapr_nvdimm_states = {
> +    .name = "spapr_nvdimm_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .post_load = spapr_nvdimm_flush_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
> +        VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_QLIST_V(pending_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Assign a token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
> +                                                SpaprNVDIMMDevice *spapr_nvdimm)
> +{
> +    SpaprNVDIMMDeviceFlushState *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    spapr_nvdimm->nvdimm_flush_token++;
> +    /* Token zero is presumed as no job pending. Assert on overflow to zero */
> +    g_assert(spapr_nvdimm->nvdimm_flush_token != 0);
> +
> +    state->continue_token = spapr_nvdimm->nvdimm_flush_token;
> +
> +    QLIST_INSERT_HEAD(&spapr_nvdimm->pending_nvdimm_flush_states, state, node);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(void)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    GSList *list, *nvdimms;
> +
> +    /*
> +     * Called on reset path, the main loop thread which calls
> +     * the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +    nvdimms = nvdimm_get_device_list();
> +    for (list = nvdimms; list; list = list->next) {
> +        NVDIMMDevice *nvdimm = list->data;
> +        if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
> +            SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(nvdimm);
> +            while (!QLIST_EMPTY(&s_nvdimm->pending_nvdimm_flush_states)) {
> +                aio_poll(qemu_get_aio_context(), true);
> +            }
> +
> +            QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
> +                               node, next) {
> +                QLIST_REMOVE(state, node);
> +                g_free(state);
> +            }
> +        }
> +    }
> +    g_slist_free(nvdimms);
> +}
> +
> +/*
> + * spapr_nvdimm_get_flush_status
> + *      Fetches the status of the hcall worker and returns
> + *      H_LONG_BUSY_ORDER_10_MSEC if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(SpaprNVDIMMDevice *s_nvdimm,
> +                                         uint64_t token)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
> +        if (state->continue_token == token) {
> +            return H_LONG_BUSY_ORDER_10_MSEC;
> +        }
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
> +                       node, node) {
> +        if (state->continue_token == token) {
> +            int ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            return ret;
> +        }
> +    }
> +
> +    /* If not found in complete list too, invalid token */
> +    return H_P2;
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY_ORDER_10_MSEC,
> + *               H_UNSUPPORTED
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device. The hcall returns
> + * H_LONG_BUSY_ORDER_10_MSEC when the flush takes longer time and the hcall
> + * needs to be issued multiple times in order to be completely serviced. The
> + * continue-token from the output to be passed in the argument list of
> + * subsequent hcalls until the hcall is completely serviced at which point
> + * H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    int fd;
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    if (continue_token == 0) {
> +        backend = MEMORY_BACKEND(dimm->hostmem);
> +        fd = memory_region_get_fd(&backend->mr);
> +
> +        if (fd < 0) {
> +            return H_UNSUPPORTED;
> +        }
> +
> +        state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
> +        if (!state) {
> +            return H_HARDWARE;
> +        }
> +
> +        state->drcidx = drc_index;
> +
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +
> +        continue_token = state->continue_token;
> +    }
> +
> +    ret = spapr_nvdimm_get_flush_status(SPAPR_NVDIMM(dimm), continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = continue_token;
> +    }
> +
> +    return ret;
> +}
> +
>   static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                        target_ulong opcode, target_ulong *args)
>   {
> @@ -523,6 +782,7 @@ static void spapr_scm_register_types(void)
>       spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>       spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
>       spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>   }
>   
>   type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index ee7504b976..727b2a0e7f 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -341,6 +341,7 @@ struct SpaprMachineState {
>   #define H_P7              -60
>   #define H_P8              -61
>   #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>   #define H_OVERLAP         -68
>   #define H_UNSUPPORTED_FLAG -256
>   #define H_MULTI_THREADS_ACTIVE -9005
> @@ -559,8 +560,9 @@ struct SpaprMachineState {
>   #define H_SCM_UNBIND_ALL        0x3FC
>   #define H_SCM_HEALTH            0x400
>   #define H_RPT_INVALIDATE        0x448
> +#define H_SCM_FLUSH             0x44C
>   
> -#define MAX_HCALL_OPCODE        H_RPT_INVALIDATE
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>   
>   /* The hcalls above are standardized in PAPR and implemented by pHyp
>    * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 764f999f54..e9436cb6ef 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -21,5 +21,6 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
>   bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                              uint64_t size, Error **errp);
>   void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(void);
>   
>   #endif
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2022-02-07 11:48     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2022-02-07 11:48 UTC (permalink / raw)
  To: Shivaprasad G Bhat, clg, mst, ani, david, groug, imammedo,
	xiaoguangrong.eric, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc



On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch. The
> hcall is applicable only for new SPAPR specific device class which is
> also introduced in this patch.
> 
> The hcall expects the semantics such that the flush to return with
> H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
> time along with a continue_token. The hcall to be called again by providing
> the continue_token to get the status. So, all fresh requests are put into
> a 'pending' list and flush worker is submitted to the thread pool. The
> thread pool completion callbacks move the requests to 'completed' list,
> which are cleaned up after collecting the return status for the guest
> in subsequent hcall from the guest.
> 
> The semantics makes it necessary to preserve the continue_tokens and
> their return status across migrations. So, the completed flush states
> are forwarded to the destination and the pending ones are restarted
> at the destination in post_load. The necessary nvdimm flush specific
> vmstate structures are also introduced in this patch which are to be
> saved in the new SPAPR specific nvdimm device to be introduced in the
> following patch.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

>   hw/ppc/spapr.c                |    2
>   hw/ppc/spapr_nvdimm.c         |  260 +++++++++++++++++++++++++++++++++++++++++
>   include/hw/ppc/spapr.h        |    4 -
>   include/hw/ppc/spapr_nvdimm.h |    1
>   4 files changed, 266 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 3d6ec309dd..9263985663 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1634,6 +1634,8 @@ static void spapr_machine_reset(MachineState *machine)
>           spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>       }
>   
> +    spapr_nvdimm_finish_flushes();
> +
>       /* DRC reset may cause a device to be unplugged. This will cause troubles
>        * if this device is used by another device (eg, a running vhost backend
>        * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 91de1052f2..ac44e00153 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,6 +22,7 @@
>    * THE SOFTWARE.
>    */
>   #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>   #include "qapi/error.h"
>   #include "hw/ppc/spapr_drc.h"
>   #include "hw/ppc/spapr_nvdimm.h"
> @@ -30,6 +31,9 @@
>   #include "hw/ppc/fdt.h"
>   #include "qemu/range.h"
>   #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
> +#include "migration/vmstate.h"
> +#include "qemu/pmem.h"
>   
>   /* DIMM health bitmap bitmap indicators. Taken from kernel's papr_scm.c */
>   /* SCM device is unable to persist memory contents */
> @@ -47,6 +51,14 @@
>   /* Have an explicit check for alignment */
>   QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
>   
> +#define TYPE_SPAPR_NVDIMM "spapr-nvdimm"
> +OBJECT_DECLARE_TYPE(SpaprNVDIMMDevice, SPAPRNVDIMMClass, SPAPR_NVDIMM)
> +
> +struct SPAPRNVDIMMClass {
> +    /* private */
> +    NVDIMMClass parent_class;
> +};
> +
>   bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                              uint64_t size, Error **errp)
>   {
> @@ -375,6 +387,253 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>       return H_SUCCESS;
>   }
>   
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    uint32_t drcidx;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +typedef struct SpaprNVDIMMDevice SpaprNVDIMMDevice;
> +struct SpaprNVDIMMDevice {
> +    NVDIMMDevice parent_obj;
> +
> +    uint64_t nvdimm_flush_token;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_nvdimm_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_nvdimm_flush_states;
> +};
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
> +    PCDIMMDevice *dimm = PC_DIMM(drc->dev);
> +    HostMemoryBackend *backend = MEMORY_BACKEND(dimm->hostmem);
> +    int backend_fd = memory_region_get_fd(&backend->mr);
> +
> +    if (object_property_get_bool(OBJECT(backend), "pmem", NULL)) {
> +        MemoryRegion *mr = host_memory_backend_get_memory(dimm->hostmem);
> +        void *ptr = memory_region_get_ram_ptr(mr);
> +        size_t size = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
> +                                               NULL);
> +
> +        /* flush pmem backend */
> +        pmem_persist(ptr, size);
> +    } else {
> +        /* flush raw backing image */
> +        if (qemu_fdatasync(backend_fd) < 0) {
> +            error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                         strerror(errno));
> +            return H_HARDWARE;
> +        }
> +    }
> +
> +    return H_SUCCESS;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +    SpaprDrc *drc = spapr_drc_by_index(state->drcidx);
> +    SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(drc->dev);
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&s_nvdimm->completed_nvdimm_flush_states, state, node);
> +}
> +
> +static int spapr_nvdimm_flush_post_load(void *opaque, int version_id)
> +{
> +    SpaprNVDIMMDevice *s_nvdimm = (SpaprNVDIMMDevice *)opaque;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> +    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_spapr_nvdimm_flush_state = {
> +     .name = "spapr_nvdimm_flush_state",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_UINT32(drcidx, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +const VMStateDescription vmstate_spapr_nvdimm_states = {
> +    .name = "spapr_nvdimm_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .post_load = spapr_nvdimm_flush_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(nvdimm_flush_token, SpaprNVDIMMDevice),
> +        VMSTATE_QLIST_V(completed_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_QLIST_V(pending_nvdimm_flush_states, SpaprNVDIMMDevice, 1,
> +                        vmstate_spapr_nvdimm_flush_state,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Assign a token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(
> +                                                SpaprNVDIMMDevice *spapr_nvdimm)
> +{
> +    SpaprNVDIMMDeviceFlushState *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    spapr_nvdimm->nvdimm_flush_token++;
> +    /* Token zero is presumed as no job pending. Assert on overflow to zero */
> +    g_assert(spapr_nvdimm->nvdimm_flush_token != 0);
> +
> +    state->continue_token = spapr_nvdimm->nvdimm_flush_token;
> +
> +    QLIST_INSERT_HEAD(&spapr_nvdimm->pending_nvdimm_flush_states, state, node);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(void)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    GSList *list, *nvdimms;
> +
> +    /*
> +     * Called on reset path, the main loop thread which calls
> +     * the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +    nvdimms = nvdimm_get_device_list();
> +    for (list = nvdimms; list; list = list->next) {
> +        NVDIMMDevice *nvdimm = list->data;
> +        if (object_dynamic_cast(OBJECT(nvdimm), TYPE_SPAPR_NVDIMM)) {
> +            SpaprNVDIMMDevice *s_nvdimm = SPAPR_NVDIMM(nvdimm);
> +            while (!QLIST_EMPTY(&s_nvdimm->pending_nvdimm_flush_states)) {
> +                aio_poll(qemu_get_aio_context(), true);
> +            }
> +
> +            QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
> +                               node, next) {
> +                QLIST_REMOVE(state, node);
> +                g_free(state);
> +            }
> +        }
> +    }
> +    g_slist_free(nvdimms);
> +}
> +
> +/*
> + * spapr_nvdimm_get_flush_status
> + *      Fetches the status of the hcall worker and returns
> + *      H_LONG_BUSY_ORDER_10_MSEC if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(SpaprNVDIMMDevice *s_nvdimm,
> +                                         uint64_t token)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    QLIST_FOREACH(state, &s_nvdimm->pending_nvdimm_flush_states, node) {
> +        if (state->continue_token = token) {
> +            return H_LONG_BUSY_ORDER_10_MSEC;
> +        }
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &s_nvdimm->completed_nvdimm_flush_states,
> +                       node, node) {
> +        if (state->continue_token = token) {
> +            int ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            return ret;
> +        }
> +    }
> +
> +    /* If not found in complete list too, invalid token */
> +    return H_P2;
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_LONG_BUSY_ORDER_10_MSEC,
> + *               H_UNSUPPORTED
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device. The hcall returns
> + * H_LONG_BUSY_ORDER_10_MSEC when the flush takes longer time and the hcall
> + * needs to be issued multiple times in order to be completely serviced. The
> + * continue-token from the output to be passed in the argument list of
> + * subsequent hcalls until the hcall is completely serviced at which point
> + * H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +    int fd;
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    if (continue_token = 0) {
> +        backend = MEMORY_BACKEND(dimm->hostmem);
> +        fd = memory_region_get_fd(&backend->mr);
> +
> +        if (fd < 0) {
> +            return H_UNSUPPORTED;
> +        }
> +
> +        state = spapr_nvdimm_init_new_flush_state(SPAPR_NVDIMM(dimm));
> +        if (!state) {
> +            return H_HARDWARE;
> +        }
> +
> +        state->drcidx = drc_index;
> +
> +        thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                               spapr_nvdimm_flush_completion_cb, state);
> +
> +        continue_token = state->continue_token;
> +    }
> +
> +    ret = spapr_nvdimm_get_flush_status(SPAPR_NVDIMM(dimm), continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = continue_token;
> +    }
> +
> +    return ret;
> +}
> +
>   static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                        target_ulong opcode, target_ulong *args)
>   {
> @@ -523,6 +782,7 @@ static void spapr_scm_register_types(void)
>       spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>       spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
>       spapr_register_hypercall(H_SCM_HEALTH, h_scm_health);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>   }
>   
>   type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index ee7504b976..727b2a0e7f 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -341,6 +341,7 @@ struct SpaprMachineState {
>   #define H_P7              -60
>   #define H_P8              -61
>   #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>   #define H_OVERLAP         -68
>   #define H_UNSUPPORTED_FLAG -256
>   #define H_MULTI_THREADS_ACTIVE -9005
> @@ -559,8 +560,9 @@ struct SpaprMachineState {
>   #define H_SCM_UNBIND_ALL        0x3FC
>   #define H_SCM_HEALTH            0x400
>   #define H_RPT_INVALIDATE        0x448
> +#define H_SCM_FLUSH             0x44C
>   
> -#define MAX_HCALL_OPCODE        H_RPT_INVALIDATE
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>   
>   /* The hcalls above are standardized in PAPR and implemented by pHyp
>    * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 764f999f54..e9436cb6ef 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -21,5 +21,6 @@ void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
>   bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                              uint64_t size, Error **errp);
>   void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(void);
>   
>   #endif
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
  2022-02-04  8:15 ` Shivaprasad G Bhat
  (?)
@ 2022-02-07 11:53   ` Daniel Henrique Barboza
  -1 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2022-02-07 11:53 UTC (permalink / raw)
  To: Shivaprasad G Bhat, clg, mst, ani, david, groug, imammedo,
	xiaoguangrong.eric, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc



On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
> 
> ---

I noted that we have only two nvdimm tests in QEMU, both in tests/qtest/bios-tables-test.c.
It would be a good future improvement to add some spapr-nvdimm tests there as well.


Thanks,


Daniel


> v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
> Changes from v6:
>        - Addressed commen from Daniel.
>          Fixed a typo
>          Fetch the memory backend FD in the flush_worker_cb(), updated hcall
>          return values in the comments description)
>        - Updated the signatures.
> 
> v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
> Changes from v5:
>        - Taken care of all comments from David
>        - Moved the flush lists from spapr machine into the spapr-nvdimm device
>          state structures. So, all corresponding data structures adjusted
> 	accordingly as required.
>        - New property pmem-overrride is added to the spapr-nvdimm device. The
>          hcall flushes are allowed when pmem-override is set for the device.
>        - The flush for pmem backend devices are made to use pmem_persist().
>        - The vmstate structures are also made part of device state instead of
>          global spapr.
>        - Passing the flush token to destination during migration, I think its
>          better than finding, deriving it from the outstanding ones.
> 
> v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
> Changes from v4:
>        - Introduce spapr-nvdimm device with nvdimm device as the parent.
>        - The new spapr-nvdimm has no new properties. As this is a new
>          device and there is no migration related dependencies to be
>          taken care of, the device behavior is made to set the device tree
>          property and enable hcall when the device type spapr-nvdimm is
>          used with pmem=off
>        - Fixed commit messages
>        - Added checks to ensure the backend is actualy file and not memory
>        - Addressed things pointed out by Eric
> 
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>        - Fixed the forward declaration coding guideline violations in 1st patch.
>        - Removed the code waiting for the flushes to complete during migration,
>          instead restart the flush worker on destination qemu in post load.
>        - Got rid of the randomization of the flush tokens, using simple
>          counter.
>        - Got rid of the redundant flush state lock, relying on the BQL now.
>        - Handling the memory-backend-ram usage
>        - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
> 	Added prevention code using 'writeback' on arm and x86_64.
>        - Fixed all the miscellaneous comments.
> 
> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
> Changes from v2:
>        - Using the thread pool based approach as suggested
>        - Moved the async hcall handling code to spapr_nvdimm.c along
>          with some simplifications
>        - Added vmstate to preserve the hcall status during save-restore
>          along with pre_save handler code to complete all ongoning flushes.
>        - Added hw_compat magic for sync-dax 'on' on previous machines.
>        - Miscellanious minor fixes.
> 
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>        - Fixed a missed-out unlock
>        - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
> 
> Shivaprasad G Bhat (3):
>        nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
>        spapr: nvdimm: Implement H_SCM_FLUSH hcall
>        spapr: nvdimm: Introduce spapr-nvdimm device
> 
> 
>   hw/mem/nvdimm.c               |  16 ++
>   hw/mem/pc-dimm.c              |   5 +
>   hw/ppc/spapr.c                |   2 +
>   hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
>   include/hw/mem/nvdimm.h       |   2 +
>   include/hw/mem/pc-dimm.h      |   1 +
>   include/hw/ppc/spapr.h        |   4 +-
>   include/hw/ppc/spapr_nvdimm.h |   1 +
>   8 files changed, 424 insertions(+), 1 deletion(-)
> 
> --
> Signature
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-07 11:53   ` Daniel Henrique Barboza
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2022-02-07 11:53 UTC (permalink / raw)
  To: Shivaprasad G Bhat, clg, mst, ani, david, groug, imammedo,
	xiaoguangrong.eric, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm



On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
> 
> ---

I noted that we have only two nvdimm tests in QEMU, both in tests/qtest/bios-tables-test.c.
It would be a good future improvement to add some spapr-nvdimm tests there as well.


Thanks,


Daniel


> v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
> Changes from v6:
>        - Addressed commen from Daniel.
>          Fixed a typo
>          Fetch the memory backend FD in the flush_worker_cb(), updated hcall
>          return values in the comments description)
>        - Updated the signatures.
> 
> v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
> Changes from v5:
>        - Taken care of all comments from David
>        - Moved the flush lists from spapr machine into the spapr-nvdimm device
>          state structures. So, all corresponding data structures adjusted
> 	accordingly as required.
>        - New property pmem-overrride is added to the spapr-nvdimm device. The
>          hcall flushes are allowed when pmem-override is set for the device.
>        - The flush for pmem backend devices are made to use pmem_persist().
>        - The vmstate structures are also made part of device state instead of
>          global spapr.
>        - Passing the flush token to destination during migration, I think its
>          better than finding, deriving it from the outstanding ones.
> 
> v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
> Changes from v4:
>        - Introduce spapr-nvdimm device with nvdimm device as the parent.
>        - The new spapr-nvdimm has no new properties. As this is a new
>          device and there is no migration related dependencies to be
>          taken care of, the device behavior is made to set the device tree
>          property and enable hcall when the device type spapr-nvdimm is
>          used with pmem=off
>        - Fixed commit messages
>        - Added checks to ensure the backend is actualy file and not memory
>        - Addressed things pointed out by Eric
> 
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>        - Fixed the forward declaration coding guideline violations in 1st patch.
>        - Removed the code waiting for the flushes to complete during migration,
>          instead restart the flush worker on destination qemu in post load.
>        - Got rid of the randomization of the flush tokens, using simple
>          counter.
>        - Got rid of the redundant flush state lock, relying on the BQL now.
>        - Handling the memory-backend-ram usage
>        - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
> 	Added prevention code using 'writeback' on arm and x86_64.
>        - Fixed all the miscellaneous comments.
> 
> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
> Changes from v2:
>        - Using the thread pool based approach as suggested
>        - Moved the async hcall handling code to spapr_nvdimm.c along
>          with some simplifications
>        - Added vmstate to preserve the hcall status during save-restore
>          along with pre_save handler code to complete all ongoning flushes.
>        - Added hw_compat magic for sync-dax 'on' on previous machines.
>        - Miscellanious minor fixes.
> 
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>        - Fixed a missed-out unlock
>        - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
> 
> Shivaprasad G Bhat (3):
>        nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
>        spapr: nvdimm: Implement H_SCM_FLUSH hcall
>        spapr: nvdimm: Introduce spapr-nvdimm device
> 
> 
>   hw/mem/nvdimm.c               |  16 ++
>   hw/mem/pc-dimm.c              |   5 +
>   hw/ppc/spapr.c                |   2 +
>   hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
>   include/hw/mem/nvdimm.h       |   2 +
>   include/hw/mem/pc-dimm.h      |   1 +
>   include/hw/ppc/spapr.h        |   4 +-
>   include/hw/ppc/spapr_nvdimm.h |   1 +
>   8 files changed, 424 insertions(+), 1 deletion(-)
> 
> --
> Signature
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-07 11:53   ` Daniel Henrique Barboza
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Henrique Barboza @ 2022-02-07 11:53 UTC (permalink / raw)
  To: Shivaprasad G Bhat, clg, mst, ani, david, groug, imammedo,
	xiaoguangrong.eric, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc



On 2/4/22 05:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch
> 
> ---

I noted that we have only two nvdimm tests in QEMU, both in tests/qtest/bios-tables-test.c.
It would be a good future improvement to add some spapr-nvdimm tests there as well.


Thanks,


Daniel


> v6 - https://lists.gnu.org/archive/html/qemu-devel/2022-02/msg00322.html
> Changes from v6:
>        - Addressed commen from Daniel.
>          Fixed a typo
>          Fetch the memory backend FD in the flush_worker_cb(), updated hcall
>          return values in the comments description)
>        - Updated the signatures.
> 
> v5 - https://lists.gnu.org/archive/html/qemu-devel/2021-07/msg01741.html
> Changes from v5:
>        - Taken care of all comments from David
>        - Moved the flush lists from spapr machine into the spapr-nvdimm device
>          state structures. So, all corresponding data structures adjusted
> 	accordingly as required.
>        - New property pmem-overrride is added to the spapr-nvdimm device. The
>          hcall flushes are allowed when pmem-override is set for the device.
>        - The flush for pmem backend devices are made to use pmem_persist().
>        - The vmstate structures are also made part of device state instead of
>          global spapr.
>        - Passing the flush token to destination during migration, I think its
>          better than finding, deriving it from the outstanding ones.
> 
> v4 - https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg05982.html
> Changes from v4:
>        - Introduce spapr-nvdimm device with nvdimm device as the parent.
>        - The new spapr-nvdimm has no new properties. As this is a new
>          device and there is no migration related dependencies to be
>          taken care of, the device behavior is made to set the device tree
>          property and enable hcall when the device type spapr-nvdimm is
>          used with pmem=off
>        - Fixed commit messages
>        - Added checks to ensure the backend is actualy file and not memory
>        - Addressed things pointed out by Eric
> 
> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html
> Changes from v3:
>        - Fixed the forward declaration coding guideline violations in 1st patch.
>        - Removed the code waiting for the flushes to complete during migration,
>          instead restart the flush worker on destination qemu in post load.
>        - Got rid of the randomization of the flush tokens, using simple
>          counter.
>        - Got rid of the redundant flush state lock, relying on the BQL now.
>        - Handling the memory-backend-ram usage
>        - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'.
> 	Added prevention code using 'writeback' on arm and x86_64.
>        - Fixed all the miscellaneous comments.
> 
> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
> Changes from v2:
>        - Using the thread pool based approach as suggested
>        - Moved the async hcall handling code to spapr_nvdimm.c along
>          with some simplifications
>        - Added vmstate to preserve the hcall status during save-restore
>          along with pre_save handler code to complete all ongoning flushes.
>        - Added hw_compat magic for sync-dax 'on' on previous machines.
>        - Miscellanious minor fixes.
> 
> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
> Changes from v1
>        - Fixed a missed-out unlock
>        - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token
> 
> Shivaprasad G Bhat (3):
>        nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class
>        spapr: nvdimm: Implement H_SCM_FLUSH hcall
>        spapr: nvdimm: Introduce spapr-nvdimm device
> 
> 
>   hw/mem/nvdimm.c               |  16 ++
>   hw/mem/pc-dimm.c              |   5 +
>   hw/ppc/spapr.c                |   2 +
>   hw/ppc/spapr_nvdimm.c         | 394 ++++++++++++++++++++++++++++++++++
>   include/hw/mem/nvdimm.h       |   2 +
>   include/hw/mem/pc-dimm.h      |   1 +
>   include/hw/ppc/spapr.h        |   4 +-
>   include/hw/ppc/spapr_nvdimm.h |   1 +
>   8 files changed, 424 insertions(+), 1 deletion(-)
> 
> --
> Signature
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
  2022-02-04  8:15 ` Shivaprasad G Bhat
  (?)
@ 2022-02-18  7:44   ` Cédric Le Goater
  -1 siblings, 0 replies; 20+ messages in thread
From: Cédric Le Goater @ 2022-02-18  7:44 UTC (permalink / raw)
  To: Shivaprasad G Bhat, mst, ani, danielhb413, david, groug,
	imammedo, xiaoguangrong.eric, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm

On 2/4/22 09:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch



Applied for ppc-7.0

Thanks,

C.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-18  7:44   ` Cédric Le Goater
  0 siblings, 0 replies; 20+ messages in thread
From: Cédric Le Goater @ 2022-02-18  7:44 UTC (permalink / raw)
  To: Shivaprasad G Bhat, mst, ani, danielhb413, david, groug,
	imammedo, xiaoguangrong.eric, qemu-ppc
  Cc: qemu-devel, aneesh.kumar, nvdimm, kvm-ppc

On 2/4/22 09:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch



Applied for ppc-7.0

Thanks,

C.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device
@ 2022-02-18  7:44   ` Cédric Le Goater
  0 siblings, 0 replies; 20+ messages in thread
From: Cédric Le Goater @ 2022-02-18  7:44 UTC (permalink / raw)
  To: Shivaprasad G Bhat, mst, ani, danielhb413, david, groug,
	imammedo, xiaoguangrong.eric, qemu-ppc
  Cc: aneesh.kumar, qemu-devel, kvm-ppc, nvdimm

On 2/4/22 09:15, Shivaprasad G Bhat wrote:
> If the device backend is not persistent memory for the nvdimm, there
> is need for explicit IO flushes to ensure persistence.
> 
> On SPAPR, the issue is addressed by adding a new hcall to request for
> an explicit flush from the guest when the backend is not pmem.
> So, the approach here is to convey when the hcall flush is required
> in a device tree property. The guest once it knows the device needs
> explicit flushes, makes the hcall as and when required.
> 
> It was suggested to create a new device type to address the
> explicit flush for such backends on PPC instead of extending the
> generic nvdimm device with new property. So, the patch introduces
> the spapr-nvdimm device. The new device inherits the nvdimm device
> with the new bahviour such that if the backend has pmem=no, the
> device tree property is set by default.
> 
> The below demonstration shows the map_sync behavior for non-pmem
> backends.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
> 
> The pmem0 is from spapr-nvdimm with with backend pmem=on, and pmem1 is
> from spapr-nvdimm with pmem=off, mounted as
> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)
> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)
> 
> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When pmem=on
> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when pmem=off
> Failed to mmap  with Operation not supported
> 
> First patch adds the realize/unrealize call backs to the generic device
> for the new device's vmstate registration. The second patch implements
> the hcall, adds the necessary vmstate properties to spapr machine structure
> for carrying the hcall status during save-restore. The nature of the hcall
> being asynchronus, the patch uses aio utilities to offload the flush. The
> third patch introduces the spapr-nvdimm device, adds the device tree
> property for the guest when spapr-nvdimm is used with pmem=no on the
> backend. Also adds new property pmem-override(?, suggest if you have better
> name) to the spapr-nvdimm which hints at forcing the hcall based flushes even
> on pmem backed devices.
> 
> The kernel changes to exploit this hcall is at
> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch



Applied for ppc-7.0

Thanks,

C.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-02-18 12:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04  8:15 [PATCH v7 0/3] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
2022-02-04  8:15 ` Shivaprasad G Bhat
2022-02-04  8:15 ` Shivaprasad G Bhat
2022-02-04  8:15 ` [PATCH v7 1/3] nvdimm: Add realize, unrealize callbacks to NVDIMMDevice class Shivaprasad G Bhat
2022-02-04  8:15   ` Shivaprasad G Bhat
2022-02-04  8:15 ` [PATCH v7 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
2022-02-04  8:15   ` Shivaprasad G Bhat
2022-02-04  8:15   ` Shivaprasad G Bhat
2022-02-07 11:48   ` Daniel Henrique Barboza
2022-02-07 11:48     ` Daniel Henrique Barboza
2022-02-07 11:48     ` Daniel Henrique Barboza
2022-02-04  8:16 ` [PATCH v7 3/3] spapr: nvdimm: Introduce spapr-nvdimm device Shivaprasad G Bhat
2022-02-04  8:16   ` Shivaprasad G Bhat
2022-02-04  8:16   ` Shivaprasad G Bhat
2022-02-07 11:53 ` [PATCH v7 0/3] " Daniel Henrique Barboza
2022-02-07 11:53   ` Daniel Henrique Barboza
2022-02-07 11:53   ` Daniel Henrique Barboza
2022-02-18  7:44 ` Cédric Le Goater
2022-02-18  7:44   ` Cédric Le Goater
2022-02-18  7:44   ` Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.