All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] spapr: nvdimm: Enable sync-dax property for nvdimm
@ 2021-03-23 13:47 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The nvdimm devices are expected to ensure write persistence during power
failure kind of scenarios.

The libpmem has architecture specific instructions like dcbf on power
to flush the cache data to backend nvdimm device during normal writes.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
doesn't traslate to actual flush to the backend file on the host in case
of file backed v-nvdimms. This is addressed by virtio-pmem in case of x86_64
by making explicit flushes translating to fdatasync at qemu.

On PAPR, the issue is addressed by adding a new hcall to
request for an explicit flush from the guest ndctl driver when the backend
nvdimm cannot ensure write persistence with dcbf alone. So, the approach
here is to convey when the hcall flush is required in a device tree
property. The guest makes the hcall when the property is found, instead
of relying on dcbf.

The first patch adds the necessary asynchronous hcall support infrastructure
code at the DRC level. Second patch implements the hcall using the
infrastructure.

Hcall number and semantics finalized, so dropping the RFC prefix.

A new device property sync-dax is added to the nvdimm device. When the 
sync-dax is off(default), device property "hcall-flush-required" is set,
and the guest makes hcall H_SCM_FLUSH requesting for an explicit flush. 

By default, sync-dax is "off" on all new pseries machines and prior to
5.2 its "on",

The below demonstration shows the map_sync behavior with sync-dax on & off.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with syn-dax=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile    ----> When sync-dax=off
[root@atest-guest ~]# ./mapsync /mnt2/newfile    ----> when sync-dax=on
Failed to mmap  with Operation not supported

The first patch does the header file cleanup necessary for the
subsequent ones. Second patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The third patch adds
the 'sync-dax' device property and enables the device tree property
for the guest to utilise the hcall.

---
v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested by Greg
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      spapr: nvdimm: Forward declare and move the definitions
      spapr: nvdimm: Impletment scm flush hcall
      spapr: nvdimm: Enable sync-dax device property for nvdimm


 hw/core/machine.c             |    1 
 hw/mem/nvdimm.c               |    1 
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  269 +++++++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   10 ++
 include/hw/ppc/spapr.h        |   12 ++
 include/hw/ppc/spapr_nvdimm.h |   34 +++--
 7 files changed, 317 insertions(+), 16 deletions(-)

--
Signature


_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 0/3] spapr: nvdimm: Enable sync-dax property for nvdimm
@ 2021-03-23 13:47 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The nvdimm devices are expected to ensure write persistence during power
failure kind of scenarios.

The libpmem has architecture specific instructions like dcbf on power
to flush the cache data to backend nvdimm device during normal writes.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
doesn't traslate to actual flush to the backend file on the host in case
of file backed v-nvdimms. This is addressed by virtio-pmem in case of x86_64
by making explicit flushes translating to fdatasync at qemu.

On PAPR, the issue is addressed by adding a new hcall to
request for an explicit flush from the guest ndctl driver when the backend
nvdimm cannot ensure write persistence with dcbf alone. So, the approach
here is to convey when the hcall flush is required in a device tree
property. The guest makes the hcall when the property is found, instead
of relying on dcbf.

The first patch adds the necessary asynchronous hcall support infrastructure
code at the DRC level. Second patch implements the hcall using the
infrastructure.

Hcall number and semantics finalized, so dropping the RFC prefix.

A new device property sync-dax is added to the nvdimm device. When the 
sync-dax is off(default), device property "hcall-flush-required" is set,
and the guest makes hcall H_SCM_FLUSH requesting for an explicit flush. 

By default, sync-dax is "off" on all new pseries machines and prior to
5.2 its "on",

The below demonstration shows the map_sync behavior with sync-dax on & off.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with syn-dax=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile    ----> When sync-dax=off
[root@atest-guest ~]# ./mapsync /mnt2/newfile    ----> when sync-dax=on
Failed to mmap  with Operation not supported

The first patch does the header file cleanup necessary for the
subsequent ones. Second patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The third patch adds
the 'sync-dax' device property and enables the device tree property
for the guest to utilise the hcall.

---
v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested by Greg
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      spapr: nvdimm: Forward declare and move the definitions
      spapr: nvdimm: Impletment scm flush hcall
      spapr: nvdimm: Enable sync-dax device property for nvdimm


 hw/core/machine.c             |    1 
 hw/mem/nvdimm.c               |    1 
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  269 +++++++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   10 ++
 include/hw/ppc/spapr.h        |   12 ++
 include/hw/ppc/spapr_nvdimm.h |   34 +++--
 7 files changed, 317 insertions(+), 16 deletions(-)

--
Signature





^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 0/3] spapr: nvdimm: Enable sync-dax property for nvdimm
@ 2021-03-23 13:47 ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The nvdimm devices are expected to ensure write persistence during power
failure kind of scenarios.

The libpmem has architecture specific instructions like dcbf on power
to flush the cache data to backend nvdimm device during normal writes.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
doesn't traslate to actual flush to the backend file on the host in case
of file backed v-nvdimms. This is addressed by virtio-pmem in case of x86_64
by making explicit flushes translating to fdatasync at qemu.

On PAPR, the issue is addressed by adding a new hcall to
request for an explicit flush from the guest ndctl driver when the backend
nvdimm cannot ensure write persistence with dcbf alone. So, the approach
here is to convey when the hcall flush is required in a device tree
property. The guest makes the hcall when the property is found, instead
of relying on dcbf.

The first patch adds the necessary asynchronous hcall support infrastructure
code at the DRC level. Second patch implements the hcall using the
infrastructure.

Hcall number and semantics finalized, so dropping the RFC prefix.

A new device property sync-dax is added to the nvdimm device. When the 
sync-dax is off(default), device property "hcall-flush-required" is set,
and the guest makes hcall H_SCM_FLUSH requesting for an explicit flush. 

By default, sync-dax is "off" on all new pseries machines and prior to
5.2 its "on",

The below demonstration shows the map_sync behavior with sync-dax on & off.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with syn-dax=off, mounted as
/dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)
/dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize2k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile    ----> When sync-dax=off
[root@atest-guest ~]# ./mapsync /mnt2/newfile    ----> when sync-dax=on
Failed to mmap  with Operation not supported

The first patch does the header file cleanup necessary for the
subsequent ones. Second patch implements the hcall, adds the necessary
vmstate properties to spapr machine structure for carrying the hcall
status during save-restore. The nature of the hcall being asynchronus,
the patch uses aio utilities to offload the flush. The third patch adds
the 'sync-dax' device property and enables the device tree property
for the guest to utilise the hcall.

---
v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html
Changes from v2:
      - Using the thread pool based approach as suggested by Greg
      - Moved the async hcall handling code to spapr_nvdimm.c along
        with some simplifications
      - Added vmstate to preserve the hcall status during save-restore
        along with pre_save handler code to complete all ongoning flushes.
      - Added hw_compat magic for sync-dax 'on' on previous machines.
      - Miscellanious minor fixes.

v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
      - Fixed a missed-out unlock
      - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (3):
      spapr: nvdimm: Forward declare and move the definitions
      spapr: nvdimm: Impletment scm flush hcall
      spapr: nvdimm: Enable sync-dax device property for nvdimm


 hw/core/machine.c             |    1 
 hw/mem/nvdimm.c               |    1 
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  269 +++++++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h       |   10 ++
 include/hw/ppc/spapr.h        |   12 ++
 include/hw/ppc/spapr_nvdimm.h |   34 +++--
 7 files changed, 317 insertions(+), 16 deletions(-)

--
Signature



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions
  2021-03-23 13:47 ` Shivaprasad G Bhat
  (?)
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The subsequent patches add definitions which tend to
get the compilation to cyclic dependency. So, prepare
with forward declarations, move the defitions and clean up.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
 include/hw/ppc/spapr_nvdimm.h |   21 ++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index b46c36917c..8cf3fb2ffb 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -31,6 +31,18 @@
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
 
+/*
+ * The nvdimm size should be aligned to SCM block size.
+ * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
+ * inorder to have SCM regions not to overlap with dimm memory regions.
+ * The SCM devices can have variable block sizes. For now, fixing the
+ * block size to the minimum value.
+ */
+#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
+
+/* Have an explicit check for alignment */
+QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 73be250e2a..abcacda5d7 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,23 +11,14 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
-#include "hw/ppc/spapr.h"
 
-/*
- * The nvdimm size should be aligned to SCM block size.
- * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
- * inorder to have SCM regions not to overlap with dimm memory regions.
- * The SCM devices can have variable block sizes. For now, fixing the
- * block size to the minimum value.
- */
-#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
-
-/* Have an explicit check for alignment */
-QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+struct SpaprDrc;
+struct SpaprMachineState;
 
-int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
-                           void *fdt, int *fdt_start_offset, Error **errp);
-void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
+int spapr_pmem_dt_populate(struct SpaprDrc *drc,
+                           struct SpaprMachineState *spapr, void *fdt,
+                           int *fdt_start_offset, Error **errp);
+void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The subsequent patches add definitions which tend to
get the compilation to cyclic dependency. So, prepare
with forward declarations, move the defitions and clean up.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
 include/hw/ppc/spapr_nvdimm.h |   21 ++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index b46c36917c..8cf3fb2ffb 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -31,6 +31,18 @@
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
 
+/*
+ * The nvdimm size should be aligned to SCM block size.
+ * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
+ * inorder to have SCM regions not to overlap with dimm memory regions.
+ * The SCM devices can have variable block sizes. For now, fixing the
+ * block size to the minimum value.
+ */
+#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
+
+/* Have an explicit check for alignment */
+QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 73be250e2a..abcacda5d7 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,23 +11,14 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
-#include "hw/ppc/spapr.h"
 
-/*
- * The nvdimm size should be aligned to SCM block size.
- * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
- * inorder to have SCM regions not to overlap with dimm memory regions.
- * The SCM devices can have variable block sizes. For now, fixing the
- * block size to the minimum value.
- */
-#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
-
-/* Have an explicit check for alignment */
-QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+struct SpaprDrc;
+struct SpaprMachineState;
 
-int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
-                           void *fdt, int *fdt_start_offset, Error **errp);
-void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
+int spapr_pmem_dt_populate(struct SpaprDrc *drc,
+                           struct SpaprMachineState *spapr, void *fdt,
+                           int *fdt_start_offset, Error **errp);
+void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The subsequent patches add definitions which tend to
get the compilation to cyclic dependency. So, prepare
with forward declarations, move the defitions and clean up.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
 include/hw/ppc/spapr_nvdimm.h |   21 ++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index b46c36917c..8cf3fb2ffb 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -31,6 +31,18 @@
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
 
+/*
+ * The nvdimm size should be aligned to SCM block size.
+ * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
+ * inorder to have SCM regions not to overlap with dimm memory regions.
+ * The SCM devices can have variable block sizes. For now, fixing the
+ * block size to the minimum value.
+ */
+#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
+
+/* Have an explicit check for alignment */
+QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp)
 {
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index 73be250e2a..abcacda5d7 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,23 +11,14 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
-#include "hw/ppc/spapr.h"
 
-/*
- * The nvdimm size should be aligned to SCM block size.
- * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
- * inorder to have SCM regions not to overlap with dimm memory regions.
- * The SCM devices can have variable block sizes. For now, fixing the
- * block size to the minimum value.
- */
-#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
-
-/* Have an explicit check for alignment */
-QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
+struct SpaprDrc;
+struct SpaprMachineState;
 
-int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
-                           void *fdt, int *fdt_start_offset, Error **errp);
-void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
+int spapr_pmem_dt_populate(struct SpaprDrc *drc,
+                           struct SpaprMachineState *spapr, void *fdt,
+                           int *fdt_start_offset, Error **errp);
+void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-23 13:47 ` Shivaprasad G Bhat
  (?)
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with H_BUSY when the operation is expected to take longer time along
with a continue_token. The hcall to be called again providing the
continue_token to get the status. So, all fresh requsts are put into
a 'pending' list and flush worker is submitted to the thread pool.
The thread pool completion callbacks move the requests to 'completed'
list, which are cleaned up after reporting to guest in subsequent
hcalls to get the status.

The semantics makes it necessary to preserve the continue_tokens
and their return status even across migrations. So, the pre_save
handler for the device waits for the flush worker to complete and
collects all the hcall states from 'completed' list. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |   11 ++
 include/hw/ppc/spapr_nvdimm.h |   12 ++
 4 files changed, 268 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d56418ca29..fdb0c73a2c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes();
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
@@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_cap_ccf_assist,
         &vmstate_spapr_cap_fwnmi,
         &vmstate_spapr_fwnmi,
+        &vmstate_spapr_nvdimm_flush_states,
         NULL
     }
 };
@@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
     }
 
     qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
+    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
+    QLIST_INIT(&spapr->pending_flush_states);
+    QLIST_INIT(&spapr->completed_flush_states);
 }
 
 #define DEFAULT_KVM_TYPE "auto"
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 8cf3fb2ffb..883317c1ed 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,14 +22,17 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/guest-random.h"
 #include "qemu/nvdimm-utils.h"
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
 
 /*
  * The nvdimm size should be aligned to SCM block size.
@@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static const VMStateDescription vmstate_spapr_nvdimm_entry = {
+     .name = "spapr_nvdimm_states",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+static bool spapr_nvdimm_states_needed(void *opaque)
+{
+     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
+             !QLIST_EMPTY(&spapr->completed_flush_states));
+}
+
+static int spapr_nvdimm_pre_save(void *opaque)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    return 0;
+}
+
+const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
+    .name = "spapr_nvdimm_hcall_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_nvdimm_states_needed,
+    .pre_save = spapr_nvdimm_pre_save,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_entry,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Acquire a unique token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
+{
+    Error *err = NULL;
+    uint64_t token;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+retry:
+    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
+        error_report_err(err);
+        g_free(state);
+        qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+        return NULL;
+    }
+
+    if (!token) /* Token should be non-zero */
+        goto retry;
+
+    /* If the token already in use, get a new one */
+    QLIST_FOREACH_SAFE(tmp, &(spapr->pending_flush_states), node, next) {
+        if (tmp->continue_token == token) {
+            goto retry;
+        }
+    }
+    QLIST_FOREACH_SAFE(tmp, &(spapr->completed_flush_states), node, next) {
+        if (tmp->continue_token == token) {
+            goto retry;
+        }
+    }
+
+    state->continue_token = token;
+    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
+
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    /*
+     * No contention here when called on reset path, the main loop thread
+     * which calls the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        QLIST_REMOVE(state, node);
+        g_free(state);
+    }
+}
+
+/*
+ * spapr_nvdimm_get_hcall_status
+ *      Fetches the status of the hcall worker and returns H_BUSY
+ *      if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(uint64_t token)
+{
+    int ret = H_LONG_BUSY_ORDER_10_MSEC;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
+        if (state->continue_token == token) {
+            goto exit;
+        }
+    }
+    ret = H_P2; /* If not found in complete list too, invalid token */
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
+        if (state->continue_token == token) {
+            ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            break;
+        }
+    }
+exit:
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    return ret;
+}
+
+static int flush_worker_cb(void *opaque)
+{
+    int ret = H_SUCCESS;
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    /* flush raw backing image */
+    if (qemu_fdatasync(state->backend_fd) < 0) {
+        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                     strerror(errno));
+        ret = H_HARDWARE;
+    }
+
+    return ret;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
+
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device.
+ * The hcall returns H_BUSY when the flush takes longer time and the hcall
+ * needs to be issued multiple times in order to be completely serviced.
+ * The continue-token from the output to be passed in the argument list of
+ * subsequent hcalls until the hcall is completely serviced at which
+ * point H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                      target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (continue_token != 0) {
+        ret = spapr_nvdimm_get_flush_status(continue_token);
+        if (H_IS_LONG_BUSY(ret)) {
+            args[0] = continue_token;
+        }
+
+        return ret;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    backend = MEMORY_BACKEND(dimm->hostmem);
+
+    state = spapr_nvdimm_init_new_flush_state();
+    if (!state) {
+        return H_P2;
+    }
+
+    state->backend_fd = memory_region_get_fd(&backend->mr);
+
+    thread_pool_submit_aio(pool, flush_worker_cb, state,
+                           spapr_nvdimm_flush_completion_cb, state);
+
+    ret = spapr_nvdimm_get_flush_status(state->continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = state->continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -487,6 +726,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 47cebaf3ac..7c27fb3e2d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -12,10 +12,12 @@
 #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
 #include "hw/ppc/xics.h"        /* For ICSState */
 #include "hw/ppc/spapr_tpm_proxy.h"
+#include "hw/ppc/spapr_nvdimm.h"
 
 struct SpaprVioBus;
 struct SpaprPhbState;
 struct SpaprNvram;
+struct SpaprNVDIMMDeviceFlushState;
 
 typedef struct SpaprEventLogEntry SpaprEventLogEntry;
 typedef struct SpaprEventSource SpaprEventSource;
@@ -245,6 +247,12 @@ struct SpaprMachineState {
     uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
 
     Error *fwnmi_migration_blocker;
+
+    /* nvdimm flush states */
+    QemuMutex spapr_nvdimm_flush_states_lock;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
+
 };
 
 #define H_SUCCESS         0
@@ -538,8 +546,9 @@ struct SpaprMachineState {
 #define H_SCM_BIND_MEM          0x3EC
 #define H_SCM_UNBIND_MEM        0x3F0
 #define H_SCM_UNBIND_ALL        0x3FC
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index abcacda5d7..c88df2c590 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,6 +11,7 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
+#include "migration/vmstate.h"
 
 struct SpaprDrc;
 struct SpaprMachineState;
@@ -22,5 +23,16 @@ void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(void);
+
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    int backend_fd;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+extern const VMStateDescription vmstate_spapr_nvdimm_flush_states;
 
 #endif

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with H_BUSY when the operation is expected to take longer time along
with a continue_token. The hcall to be called again providing the
continue_token to get the status. So, all fresh requsts are put into
a 'pending' list and flush worker is submitted to the thread pool.
The thread pool completion callbacks move the requests to 'completed'
list, which are cleaned up after reporting to guest in subsequent
hcalls to get the status.

The semantics makes it necessary to preserve the continue_tokens
and their return status even across migrations. So, the pre_save
handler for the device waits for the flush worker to complete and
collects all the hcall states from 'completed' list. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |   11 ++
 include/hw/ppc/spapr_nvdimm.h |   12 ++
 4 files changed, 268 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d56418ca29..fdb0c73a2c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes();
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
@@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_cap_ccf_assist,
         &vmstate_spapr_cap_fwnmi,
         &vmstate_spapr_fwnmi,
+        &vmstate_spapr_nvdimm_flush_states,
         NULL
     }
 };
@@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
     }
 
     qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
+    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
+    QLIST_INIT(&spapr->pending_flush_states);
+    QLIST_INIT(&spapr->completed_flush_states);
 }
 
 #define DEFAULT_KVM_TYPE "auto"
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 8cf3fb2ffb..883317c1ed 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,14 +22,17 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/guest-random.h"
 #include "qemu/nvdimm-utils.h"
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
 
 /*
  * The nvdimm size should be aligned to SCM block size.
@@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static const VMStateDescription vmstate_spapr_nvdimm_entry = {
+     .name = "spapr_nvdimm_states",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+static bool spapr_nvdimm_states_needed(void *opaque)
+{
+     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
+             !QLIST_EMPTY(&spapr->completed_flush_states));
+}
+
+static int spapr_nvdimm_pre_save(void *opaque)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    return 0;
+}
+
+const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
+    .name = "spapr_nvdimm_hcall_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_nvdimm_states_needed,
+    .pre_save = spapr_nvdimm_pre_save,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_entry,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Acquire a unique token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
+{
+    Error *err = NULL;
+    uint64_t token;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+retry:
+    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
+        error_report_err(err);
+        g_free(state);
+        qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+        return NULL;
+    }
+
+    if (!token) /* Token should be non-zero */
+        goto retry;
+
+    /* If the token already in use, get a new one */
+    QLIST_FOREACH_SAFE(tmp, &(spapr->pending_flush_states), node, next) {
+        if (tmp->continue_token == token) {
+            goto retry;
+        }
+    }
+    QLIST_FOREACH_SAFE(tmp, &(spapr->completed_flush_states), node, next) {
+        if (tmp->continue_token == token) {
+            goto retry;
+        }
+    }
+
+    state->continue_token = token;
+    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
+
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    /*
+     * No contention here when called on reset path, the main loop thread
+     * which calls the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        QLIST_REMOVE(state, node);
+        g_free(state);
+    }
+}
+
+/*
+ * spapr_nvdimm_get_hcall_status
+ *      Fetches the status of the hcall worker and returns H_BUSY
+ *      if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(uint64_t token)
+{
+    int ret = H_LONG_BUSY_ORDER_10_MSEC;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
+        if (state->continue_token == token) {
+            goto exit;
+        }
+    }
+    ret = H_P2; /* If not found in complete list too, invalid token */
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
+        if (state->continue_token == token) {
+            ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            break;
+        }
+    }
+exit:
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    return ret;
+}
+
+static int flush_worker_cb(void *opaque)
+{
+    int ret = H_SUCCESS;
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    /* flush raw backing image */
+    if (qemu_fdatasync(state->backend_fd) < 0) {
+        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                     strerror(errno));
+        ret = H_HARDWARE;
+    }
+
+    return ret;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
+
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device.
+ * The hcall returns H_BUSY when the flush takes longer time and the hcall
+ * needs to be issued multiple times in order to be completely serviced.
+ * The continue-token from the output to be passed in the argument list of
+ * subsequent hcalls until the hcall is completely serviced at which
+ * point H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                      target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (continue_token != 0) {
+        ret = spapr_nvdimm_get_flush_status(continue_token);
+        if (H_IS_LONG_BUSY(ret)) {
+            args[0] = continue_token;
+        }
+
+        return ret;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    backend = MEMORY_BACKEND(dimm->hostmem);
+
+    state = spapr_nvdimm_init_new_flush_state();
+    if (!state) {
+        return H_P2;
+    }
+
+    state->backend_fd = memory_region_get_fd(&backend->mr);
+
+    thread_pool_submit_aio(pool, flush_worker_cb, state,
+                           spapr_nvdimm_flush_completion_cb, state);
+
+    ret = spapr_nvdimm_get_flush_status(state->continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = state->continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -487,6 +726,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 47cebaf3ac..7c27fb3e2d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -12,10 +12,12 @@
 #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
 #include "hw/ppc/xics.h"        /* For ICSState */
 #include "hw/ppc/spapr_tpm_proxy.h"
+#include "hw/ppc/spapr_nvdimm.h"
 
 struct SpaprVioBus;
 struct SpaprPhbState;
 struct SpaprNvram;
+struct SpaprNVDIMMDeviceFlushState;
 
 typedef struct SpaprEventLogEntry SpaprEventLogEntry;
 typedef struct SpaprEventSource SpaprEventSource;
@@ -245,6 +247,12 @@ struct SpaprMachineState {
     uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
 
     Error *fwnmi_migration_blocker;
+
+    /* nvdimm flush states */
+    QemuMutex spapr_nvdimm_flush_states_lock;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
+
 };
 
 #define H_SUCCESS         0
@@ -538,8 +546,9 @@ struct SpaprMachineState {
 #define H_SCM_BIND_MEM          0x3EC
 #define H_SCM_UNBIND_MEM        0x3F0
 #define H_SCM_UNBIND_ALL        0x3FC
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index abcacda5d7..c88df2c590 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,6 +11,7 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
+#include "migration/vmstate.h"
 
 struct SpaprDrc;
 struct SpaprMachineState;
@@ -22,5 +23,16 @@ void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(void);
+
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    int backend_fd;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+extern const VMStateDescription vmstate_spapr_nvdimm_flush_states;
 
 #endif




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch.

The hcall expects the semantics such that the flush to return
with H_BUSY when the operation is expected to take longer time along
with a continue_token. The hcall to be called again providing the
continue_token to get the status. So, all fresh requsts are put into
a 'pending' list and flush worker is submitted to the thread pool.
The thread pool completion callbacks move the requests to 'completed'
list, which are cleaned up after reporting to guest in subsequent
hcalls to get the status.

The semantics makes it necessary to preserve the continue_tokens
and their return status even across migrations. So, the pre_save
handler for the device waits for the flush worker to complete and
collects all the hcall states from 'completed' list. The necessary
nvdimm flush specific vmstate structures are added to the spapr
machine vmstate.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/ppc/spapr.c                |    6 +
 hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h        |   11 ++
 include/hw/ppc/spapr_nvdimm.h |   12 ++
 4 files changed, 268 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d56418ca29..fdb0c73a2c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
     }
 
+    spapr_nvdimm_finish_flushes();
+
     /* DRC reset may cause a device to be unplugged. This will cause troubles
      * if this device is used by another device (eg, a running vhost backend
      * will crash QEMU if the DIMM holding the vring goes away). To avoid such
@@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_cap_ccf_assist,
         &vmstate_spapr_cap_fwnmi,
         &vmstate_spapr_fwnmi,
+        &vmstate_spapr_nvdimm_flush_states,
         NULL
     }
 };
@@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
     }
 
     qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
+    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
+    QLIST_INIT(&spapr->pending_flush_states);
+    QLIST_INIT(&spapr->completed_flush_states);
 }
 
 #define DEFAULT_KVM_TYPE "auto"
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 8cf3fb2ffb..883317c1ed 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,14 +22,17 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
 #include "hw/mem/nvdimm.h"
+#include "qemu/guest-random.h"
 #include "qemu/nvdimm-utils.h"
 #include "hw/ppc/fdt.h"
 #include "qemu/range.h"
 #include "hw/ppc/spapr_numa.h"
+#include "block/thread-pool.h"
 
 /*
  * The nvdimm size should be aligned to SCM block size.
@@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static const VMStateDescription vmstate_spapr_nvdimm_entry = {
+     .name = "spapr_nvdimm_states",
+     .version_id = 1,
+     .minimum_version_id = 1,
+     .fields = (VMStateField[]) {
+         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
+         VMSTATE_END_OF_LIST()
+     },
+};
+
+static bool spapr_nvdimm_states_needed(void *opaque)
+{
+     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
+             !QLIST_EMPTY(&spapr->completed_flush_states));
+}
+
+static int spapr_nvdimm_pre_save(void *opaque)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    return 0;
+}
+
+const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
+    .name = "spapr_nvdimm_hcall_states",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_nvdimm_states_needed,
+    .pre_save = spapr_nvdimm_pre_save,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
+                        vmstate_spapr_nvdimm_entry,
+                        SpaprNVDIMMDeviceFlushState, node),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * Acquire a unique token and reserve it for the new flush state.
+ */
+static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
+{
+    Error *err = NULL;
+    uint64_t token;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
+
+    state = g_malloc0(sizeof(*state));
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+retry:
+    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
+        error_report_err(err);
+        g_free(state);
+        qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+        return NULL;
+    }
+
+    if (!token) /* Token should be non-zero */
+        goto retry;
+
+    /* If the token already in use, get a new one */
+    QLIST_FOREACH_SAFE(tmp, &(spapr->pending_flush_states), node, next) {
+        if (tmp->continue_token = token) {
+            goto retry;
+        }
+    }
+    QLIST_FOREACH_SAFE(tmp, &(spapr->completed_flush_states), node, next) {
+        if (tmp->continue_token = token) {
+            goto retry;
+        }
+    }
+
+    state->continue_token = token;
+    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
+
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    return state;
+}
+
+/*
+ * spapr_nvdimm_finish_flushes
+ *      Waits for all pending flush requests to complete
+ *      their execution and free the states
+ */
+void spapr_nvdimm_finish_flushes(void)
+{
+    SpaprNVDIMMDeviceFlushState *state, *next;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    /*
+     * No contention here when called on reset path, the main loop thread
+     * which calls the pending BHs has gotten out running in the reset path,
+     * finally reaching here. Other code path being guest
+     * h_client_architecture_support, thats early boot up.
+     */
+
+    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
+        aio_poll(qemu_get_aio_context(), true);
+    }
+
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
+        QLIST_REMOVE(state, node);
+        g_free(state);
+    }
+}
+
+/*
+ * spapr_nvdimm_get_hcall_status
+ *      Fetches the status of the hcall worker and returns H_BUSY
+ *      if the worker is still running.
+ */
+static int spapr_nvdimm_get_flush_status(uint64_t token)
+{
+    int ret = H_LONG_BUSY_ORDER_10_MSEC;
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state, *node;
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
+        if (state->continue_token = token) {
+            goto exit;
+        }
+    }
+    ret = H_P2; /* If not found in complete list too, invalid token */
+    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
+        if (state->continue_token = token) {
+            ret = state->hcall_ret;
+            QLIST_REMOVE(state, node);
+            g_free(state);
+            break;
+        }
+    }
+exit:
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    return ret;
+}
+
+static int flush_worker_cb(void *opaque)
+{
+    int ret = H_SUCCESS;
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    /* flush raw backing image */
+    if (qemu_fdatasync(state->backend_fd) < 0) {
+        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+                     strerror(errno));
+        ret = H_HARDWARE;
+    }
+
+    return ret;
+}
+
+static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    SpaprNVDIMMDeviceFlushState *state = opaque;
+
+    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
+
+    state->hcall_ret = hcall_ret;
+    QLIST_REMOVE(state, node);
+    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
+
+    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
+}
+
+/*
+ * H_SCM_FLUSH
+ * Input: drc_index, continue-token
+ * Out: continue-token
+ * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
+ *
+ * Given a DRC Index Flush the data to backend NVDIMM device.
+ * The hcall returns H_BUSY when the flush takes longer time and the hcall
+ * needs to be issued multiple times in order to be completely serviced.
+ * The continue-token from the output to be passed in the argument list of
+ * subsequent hcalls until the hcall is completely serviced at which
+ * point H_SUCCESS or other error is returned.
+ */
+static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                                      target_ulong opcode, target_ulong *args)
+{
+    int ret;
+    uint32_t drc_index = args[0];
+    uint64_t continue_token = args[1];
+    SpaprDrc *drc = spapr_drc_by_index(drc_index);
+    PCDIMMDevice *dimm;
+    HostMemoryBackend *backend = NULL;
+    SpaprNVDIMMDeviceFlushState *state;
+    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+    if (!drc || !drc->dev ||
+        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+        return H_PARAMETER;
+    }
+
+    if (continue_token != 0) {
+        ret = spapr_nvdimm_get_flush_status(continue_token);
+        if (H_IS_LONG_BUSY(ret)) {
+            args[0] = continue_token;
+        }
+
+        return ret;
+    }
+
+    dimm = PC_DIMM(drc->dev);
+    backend = MEMORY_BACKEND(dimm->hostmem);
+
+    state = spapr_nvdimm_init_new_flush_state();
+    if (!state) {
+        return H_P2;
+    }
+
+    state->backend_fd = memory_region_get_fd(&backend->mr);
+
+    thread_pool_submit_aio(pool, flush_worker_cb, state,
+                           spapr_nvdimm_flush_completion_cb, state);
+
+    ret = spapr_nvdimm_get_flush_status(state->continue_token);
+    if (H_IS_LONG_BUSY(ret)) {
+        args[0] = state->continue_token;
+    }
+
+    return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                      target_ulong opcode, target_ulong *args)
 {
@@ -487,6 +726,7 @@ static void spapr_scm_register_types(void)
     spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
     spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
+    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 47cebaf3ac..7c27fb3e2d 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -12,10 +12,12 @@
 #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
 #include "hw/ppc/xics.h"        /* For ICSState */
 #include "hw/ppc/spapr_tpm_proxy.h"
+#include "hw/ppc/spapr_nvdimm.h"
 
 struct SpaprVioBus;
 struct SpaprPhbState;
 struct SpaprNvram;
+struct SpaprNVDIMMDeviceFlushState;
 
 typedef struct SpaprEventLogEntry SpaprEventLogEntry;
 typedef struct SpaprEventSource SpaprEventSource;
@@ -245,6 +247,12 @@ struct SpaprMachineState {
     uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
 
     Error *fwnmi_migration_blocker;
+
+    /* nvdimm flush states */
+    QemuMutex spapr_nvdimm_flush_states_lock;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
+    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
+
 };
 
 #define H_SUCCESS         0
@@ -538,8 +546,9 @@ struct SpaprMachineState {
 #define H_SCM_BIND_MEM          0x3EC
 #define H_SCM_UNBIND_MEM        0x3F0
 #define H_SCM_UNBIND_ALL        0x3FC
+#define H_SCM_FLUSH             0x44C
 
-#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
+#define MAX_HCALL_OPCODE        H_SCM_FLUSH
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
index abcacda5d7..c88df2c590 100644
--- a/include/hw/ppc/spapr_nvdimm.h
+++ b/include/hw/ppc/spapr_nvdimm.h
@@ -11,6 +11,7 @@
 #define HW_SPAPR_NVDIMM_H
 
 #include "hw/mem/nvdimm.h"
+#include "migration/vmstate.h"
 
 struct SpaprDrc;
 struct SpaprMachineState;
@@ -22,5 +23,16 @@ void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
 bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
                            uint64_t size, Error **errp);
 void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
+void spapr_nvdimm_finish_flushes(void);
+
+typedef struct SpaprNVDIMMDeviceFlushState {
+    uint64_t continue_token;
+    int64_t hcall_ret;
+    int backend_fd;
+
+    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
+} SpaprNVDIMMDeviceFlushState;
+
+extern const VMStateDescription vmstate_spapr_nvdimm_flush_states;
 
 #endif


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
  2021-03-23 13:47 ` Shivaprasad G Bhat
  (?)
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds the 'sync-dax' property to the nvdimm device.

When the sync-dax is 'off', the device tree property
"hcall-flush-required" is added to the nvdimm node which makes the
guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
This would be the default behaviour without sync-dax property set
for the nvdimm device.

The sync-dax="on" would mean the guest need not make flush requests
to the qemu. On previous machine versions the sync-dax is set to be
"on" by default using the hw_compat magic.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/core/machine.c       |    1 +
 hw/mem/nvdimm.c         |    1 +
 hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
 include/hw/mem/nvdimm.h |   10 ++++++++++
 include/hw/ppc/spapr.h  |    1 +
 5 files changed, 30 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 257a664ea2..f843643574 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
     { "PIIX4_PM", "smm-compat", "on"},
     { "virtio-blk-device", "report-discard-granularity", "off" },
     { "virtio-net-pci", "vectors", "3"},
+    { "nvdimm", "sync-dax", "on" },
 };
 const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
 
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..8f0e29b191 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
 
 static Property nvdimm_properties[] = {
     DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
+    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 883317c1ed..dd1c90251b 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
     uint64_t lsize = nvdimm->label_size;
     uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
                                             NULL);
+    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
+                                             NVDIMM_SYNC_DAX_PROP,
+                                             &error_abort);
 
     drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
     g_assert(drc);
@@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (!sync_dax) {
+        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                         NULL, 0));
+    }
+
     return child_offset;
 }
 
@@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                       target_ulong opcode, target_ulong *args)
 {
     int ret;
+    bool sync_dax;
     uint32_t drc_index = args[0];
     uint64_t continue_token = args[1];
     SpaprDrc *drc = spapr_drc_by_index(drc_index);
     PCDIMMDevice *dimm;
+    NVDIMMDevice *nvdimm;
     HostMemoryBackend *backend = NULL;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
@@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
         return H_PARAMETER;
     }
 
+    nvdimm = NVDIMM(drc->dev);
+    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                        &error_abort);
+    if (sync_dax) {
+        return H_UNSUPPORTED;
+    }
+
     if (continue_token != 0) {
         ret = spapr_nvdimm_get_flush_status(continue_token);
         if (H_IS_LONG_BUSY(ret)) {
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..f82979cf2f 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
 #define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
+#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
 
 struct NVDIMMDevice {
     /* private */
@@ -85,6 +86,15 @@ struct NVDIMMDevice {
      */
     bool unarmed;
 
+    /*
+     * On PPC64,
+     * The 'off' value results in the hcall-flush-required property set
+     * in the device tree for pseries machines. When 'off', the guest
+     * initiates explicit flush requests to the backend device ensuring
+     * write persistence.
+     */
+    bool sync_dax;
+
     /*
      * The PPC64 - spapr requires each nvdimm device have a uuid.
      */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 7c27fb3e2d..51c35488a4 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -333,6 +333,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: linux-nvdimm, aneesh.kumar, qemu-devel, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds the 'sync-dax' property to the nvdimm device.

When the sync-dax is 'off', the device tree property
"hcall-flush-required" is added to the nvdimm node which makes the
guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
This would be the default behaviour without sync-dax property set
for the nvdimm device.

The sync-dax="on" would mean the guest need not make flush requests
to the qemu. On previous machine versions the sync-dax is set to be
"on" by default using the hw_compat magic.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/core/machine.c       |    1 +
 hw/mem/nvdimm.c         |    1 +
 hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
 include/hw/mem/nvdimm.h |   10 ++++++++++
 include/hw/ppc/spapr.h  |    1 +
 5 files changed, 30 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 257a664ea2..f843643574 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
     { "PIIX4_PM", "smm-compat", "on"},
     { "virtio-blk-device", "report-discard-granularity", "off" },
     { "virtio-net-pci", "vectors", "3"},
+    { "nvdimm", "sync-dax", "on" },
 };
 const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
 
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..8f0e29b191 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
 
 static Property nvdimm_properties[] = {
     DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
+    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 883317c1ed..dd1c90251b 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
     uint64_t lsize = nvdimm->label_size;
     uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
                                             NULL);
+    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
+                                             NVDIMM_SYNC_DAX_PROP,
+                                             &error_abort);
 
     drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
     g_assert(drc);
@@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (!sync_dax) {
+        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                         NULL, 0));
+    }
+
     return child_offset;
 }
 
@@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                       target_ulong opcode, target_ulong *args)
 {
     int ret;
+    bool sync_dax;
     uint32_t drc_index = args[0];
     uint64_t continue_token = args[1];
     SpaprDrc *drc = spapr_drc_by_index(drc_index);
     PCDIMMDevice *dimm;
+    NVDIMMDevice *nvdimm;
     HostMemoryBackend *backend = NULL;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
@@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
         return H_PARAMETER;
     }
 
+    nvdimm = NVDIMM(drc->dev);
+    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                        &error_abort);
+    if (sync_dax) {
+        return H_UNSUPPORTED;
+    }
+
     if (continue_token != 0) {
         ret = spapr_nvdimm_get_flush_status(continue_token);
         if (H_IS_LONG_BUSY(ret)) {
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..f82979cf2f 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
 #define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
+#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
 
 struct NVDIMMDevice {
     /* private */
@@ -85,6 +86,15 @@ struct NVDIMMDevice {
      */
     bool unarmed;
 
+    /*
+     * On PPC64,
+     * The 'off' value results in the hcall-flush-required property set
+     * in the device tree for pseries machines. When 'off', the guest
+     * initiates explicit flush requests to the backend device ensuring
+     * write persistence.
+     */
+    bool sync_dax;
+
     /*
      * The PPC64 - spapr requires each nvdimm device have a uuid.
      */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 7c27fb3e2d..51c35488a4 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -333,6 +333,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
@ 2021-03-23 13:47   ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-23 13:47 UTC (permalink / raw)
  To: sbhat, david, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric
  Cc: qemu-devel, aneesh.kumar, linux-nvdimm, kvm-ppc, shivaprasadbhat,
	bharata

The patch adds the 'sync-dax' property to the nvdimm device.

When the sync-dax is 'off', the device tree property
"hcall-flush-required" is added to the nvdimm node which makes the
guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
This would be the default behaviour without sync-dax property set
for the nvdimm device.

The sync-dax="on" would mean the guest need not make flush requests
to the qemu. On previous machine versions the sync-dax is set to be
"on" by default using the hw_compat magic.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
---
 hw/core/machine.c       |    1 +
 hw/mem/nvdimm.c         |    1 +
 hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
 include/hw/mem/nvdimm.h |   10 ++++++++++
 include/hw/ppc/spapr.h  |    1 +
 5 files changed, 30 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 257a664ea2..f843643574 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
     { "PIIX4_PM", "smm-compat", "on"},
     { "virtio-blk-device", "report-discard-granularity", "off" },
     { "virtio-net-pci", "vectors", "3"},
+    { "nvdimm", "sync-dax", "on" },
 };
 const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
 
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 7397b67156..8f0e29b191 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
 
 static Property nvdimm_properties[] = {
     DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
+    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index 883317c1ed..dd1c90251b 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
     uint64_t lsize = nvdimm->label_size;
     uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
                                             NULL);
+    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
+                                             NVDIMM_SYNC_DAX_PROP,
+                                             &error_abort);
 
     drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
     g_assert(drc);
@@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
                              "operating-system")));
     _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+    if (!sync_dax) {
+        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
+                         NULL, 0));
+    }
+
     return child_offset;
 }
 
@@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
                                       target_ulong opcode, target_ulong *args)
 {
     int ret;
+    bool sync_dax;
     uint32_t drc_index = args[0];
     uint64_t continue_token = args[1];
     SpaprDrc *drc = spapr_drc_by_index(drc_index);
     PCDIMMDevice *dimm;
+    NVDIMMDevice *nvdimm;
     HostMemoryBackend *backend = NULL;
     SpaprNVDIMMDeviceFlushState *state;
     ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
@@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
         return H_PARAMETER;
     }
 
+    nvdimm = NVDIMM(drc->dev);
+    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
+                                        &error_abort);
+    if (sync_dax) {
+        return H_UNSUPPORTED;
+    }
+
     if (continue_token != 0) {
         ret = spapr_nvdimm_get_flush_status(continue_token);
         if (H_IS_LONG_BUSY(ret)) {
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index bcf62f825c..f82979cf2f 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
 #define NVDIMM_UUID_PROP       "uuid"
 #define NVDIMM_UNARMED_PROP    "unarmed"
+#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
 
 struct NVDIMMDevice {
     /* private */
@@ -85,6 +86,15 @@ struct NVDIMMDevice {
      */
     bool unarmed;
 
+    /*
+     * On PPC64,
+     * The 'off' value results in the hcall-flush-required property set
+     * in the device tree for pseries machines. When 'off', the guest
+     * initiates explicit flush requests to the backend device ensuring
+     * write persistence.
+     */
+    bool sync_dax;
+
     /*
      * The PPC64 - spapr requires each nvdimm device have a uuid.
      */
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 7c27fb3e2d..51c35488a4 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -333,6 +333,7 @@ struct SpaprMachineState {
 #define H_P7              -60
 #define H_P8              -61
 #define H_P9              -62
+#define H_UNSUPPORTED     -67
 #define H_OVERLAP         -68
 #define H_UNSUPPORTED_FLAG -256
 #define H_MULTI_THREADS_ACTIVE -9005


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions
  2021-03-23 13:47   ` Shivaprasad G Bhat
  (?)
@ 2021-03-24  2:30     ` David Gibson
  -1 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  2:30 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata


[-- Attachment #1.1: Type: text/plain, Size: 3442 bytes --]

On Tue, Mar 23, 2021 at 09:47:23AM -0400, Shivaprasad G Bhat wrote:
> The subsequent patches add definitions which tend to
> get the compilation to cyclic dependency. So, prepare
> with forward declarations, move the defitions and clean up.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
>  include/hw/ppc/spapr_nvdimm.h |   21 ++++++---------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index b46c36917c..8cf3fb2ffb 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -31,6 +31,18 @@
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
>  
> +/*
> + * The nvdimm size should be aligned to SCM block size.
> + * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> + * inorder to have SCM regions not to overlap with dimm memory regions.
> + * The SCM devices can have variable block sizes. For now, fixing the
> + * block size to the minimum value.
> + */
> +#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> +
> +/* Have an explicit check for alignment */
> +QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
> +
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp)
>  {
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 73be250e2a..abcacda5d7 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,23 +11,14 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> -#include "hw/ppc/spapr.h"
>  
> -/*
> - * The nvdimm size should be aligned to SCM block size.
> - * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> - * inorder to have SCM regions not to overlap with dimm memory regions.
> - * The SCM devices can have variable block sizes. For now, fixing the
> - * block size to the minimum value.
> - */
> -#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> -
> -/* Have an explicit check for alignment */
> -QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
> +struct SpaprDrc;
> +struct SpaprMachineState;
>  
> -int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> -                           void *fdt, int *fdt_start_offset, Error **errp);
> -void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
> +int spapr_pmem_dt_populate(struct SpaprDrc *drc,

Using explicit struct tags is against qemu coding style.   You should
put a typedef on the forward decl so you don't need to do it here (see
examples in spapr_pci.c amongst other places).

> +                           struct SpaprMachineState *spapr, void *fdt,
> +                           int *fdt_start_offset, Error **errp);
> +void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 167 bytes --]

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions
@ 2021-03-24  2:30     ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  2:30 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: ehabkost, mst, aneesh.kumar, bharata, linux-nvdimm, groug,
	kvm-ppc, qemu-devel, shivaprasadbhat, qemu-ppc, imammedo, sbhat,
	xiaoguangrong.eric

[-- Attachment #1: Type: text/plain, Size: 3442 bytes --]

On Tue, Mar 23, 2021 at 09:47:23AM -0400, Shivaprasad G Bhat wrote:
> The subsequent patches add definitions which tend to
> get the compilation to cyclic dependency. So, prepare
> with forward declarations, move the defitions and clean up.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
>  include/hw/ppc/spapr_nvdimm.h |   21 ++++++---------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index b46c36917c..8cf3fb2ffb 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -31,6 +31,18 @@
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
>  
> +/*
> + * The nvdimm size should be aligned to SCM block size.
> + * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> + * inorder to have SCM regions not to overlap with dimm memory regions.
> + * The SCM devices can have variable block sizes. For now, fixing the
> + * block size to the minimum value.
> + */
> +#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> +
> +/* Have an explicit check for alignment */
> +QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
> +
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp)
>  {
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 73be250e2a..abcacda5d7 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,23 +11,14 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> -#include "hw/ppc/spapr.h"
>  
> -/*
> - * The nvdimm size should be aligned to SCM block size.
> - * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> - * inorder to have SCM regions not to overlap with dimm memory regions.
> - * The SCM devices can have variable block sizes. For now, fixing the
> - * block size to the minimum value.
> - */
> -#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> -
> -/* Have an explicit check for alignment */
> -QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
> +struct SpaprDrc;
> +struct SpaprMachineState;
>  
> -int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> -                           void *fdt, int *fdt_start_offset, Error **errp);
> -void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
> +int spapr_pmem_dt_populate(struct SpaprDrc *drc,

Using explicit struct tags is against qemu coding style.   You should
put a typedef on the forward decl so you don't need to do it here (see
examples in spapr_pci.c amongst other places).

> +                           struct SpaprMachineState *spapr, void *fdt,
> +                           int *fdt_start_offset, Error **errp);
> +void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions
@ 2021-03-24  2:30     ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  2:30 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata

[-- Attachment #1: Type: text/plain, Size: 3442 bytes --]

On Tue, Mar 23, 2021 at 09:47:23AM -0400, Shivaprasad G Bhat wrote:
> The subsequent patches add definitions which tend to
> get the compilation to cyclic dependency. So, prepare
> with forward declarations, move the defitions and clean up.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/ppc/spapr_nvdimm.c         |   12 ++++++++++++
>  include/hw/ppc/spapr_nvdimm.h |   21 ++++++---------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index b46c36917c..8cf3fb2ffb 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -31,6 +31,18 @@
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
>  
> +/*
> + * The nvdimm size should be aligned to SCM block size.
> + * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> + * inorder to have SCM regions not to overlap with dimm memory regions.
> + * The SCM devices can have variable block sizes. For now, fixing the
> + * block size to the minimum value.
> + */
> +#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> +
> +/* Have an explicit check for alignment */
> +QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
> +
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp)
>  {
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index 73be250e2a..abcacda5d7 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,23 +11,14 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> -#include "hw/ppc/spapr.h"
>  
> -/*
> - * The nvdimm size should be aligned to SCM block size.
> - * The SCM block size should be aligned to SPAPR_MEMORY_BLOCK_SIZE
> - * inorder to have SCM regions not to overlap with dimm memory regions.
> - * The SCM devices can have variable block sizes. For now, fixing the
> - * block size to the minimum value.
> - */
> -#define SPAPR_MINIMUM_SCM_BLOCK_SIZE SPAPR_MEMORY_BLOCK_SIZE
> -
> -/* Have an explicit check for alignment */
> -QEMU_BUILD_BUG_ON(SPAPR_MINIMUM_SCM_BLOCK_SIZE % SPAPR_MEMORY_BLOCK_SIZE);
> +struct SpaprDrc;
> +struct SpaprMachineState;
>  
> -int spapr_pmem_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> -                           void *fdt, int *fdt_start_offset, Error **errp);
> -void spapr_dt_persistent_memory(SpaprMachineState *spapr, void *fdt);
> +int spapr_pmem_dt_populate(struct SpaprDrc *drc,

Using explicit struct tags is against qemu coding style.   You should
put a typedef on the forward decl so you don't need to do it here (see
examples in spapr_pci.c amongst other places).

> +                           struct SpaprMachineState *spapr, void *fdt,
> +                           int *fdt_start_offset, Error **errp);
> +void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-23 13:47   ` Shivaprasad G Bhat
  (?)
@ 2021-03-24  3:07     ` David Gibson
  -1 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  3:07 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata


[-- Attachment #1.1: Type: text/plain, Size: 15855 bytes --]

On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch.
> 
> The hcall expects the semantics such that the flush to return
> with H_BUSY when the operation is expected to take longer time along
> with a continue_token. The hcall to be called again providing the
> continue_token to get the status. So, all fresh requsts are put into
> a 'pending' list and flush worker is submitted to the thread pool.
> The thread pool completion callbacks move the requests to 'completed'
> list, which are cleaned up after reporting to guest in subsequent
> hcalls to get the status.
> 
> The semantics makes it necessary to preserve the continue_tokens
> and their return status even across migrations. So, the pre_save
> handler for the device waits for the flush worker to complete and
> collects all the hcall states from 'completed' list. The necessary
> nvdimm flush specific vmstate structures are added to the spapr
> machine vmstate.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>

An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?

> ---
>  hw/ppc/spapr.c                |    6 +
>  hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h        |   11 ++
>  include/hw/ppc/spapr_nvdimm.h |   12 ++
>  4 files changed, 268 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d56418ca29..fdb0c73a2c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>      }
>  
> +    spapr_nvdimm_finish_flushes();
> +
>      /* DRC reset may cause a device to be unplugged. This will cause troubles
>       * if this device is used by another device (eg, a running vhost backend
>       * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> @@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_cap_ccf_assist,
>          &vmstate_spapr_cap_fwnmi,
>          &vmstate_spapr_fwnmi,
> +        &vmstate_spapr_nvdimm_flush_states,
>          NULL
>      }
>  };
> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>      }
>  
>      qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);

Do you actually need an extra mutex, or can you rely on the BQL?

> +    QLIST_INIT(&spapr->pending_flush_states);
> +    QLIST_INIT(&spapr->completed_flush_states);
>  }
>  
>  #define DEFAULT_KVM_TYPE "auto"
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 8cf3fb2ffb..883317c1ed 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,14 +22,17 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/guest-random.h"
>  #include "qemu/nvdimm-utils.h"
>  #include "hw/ppc/fdt.h"
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
>  
>  /*
>   * The nvdimm size should be aligned to SCM block size.
> @@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static const VMStateDescription vmstate_spapr_nvdimm_entry = {
> +     .name = "spapr_nvdimm_states",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +static bool spapr_nvdimm_states_needed(void *opaque)
> +{
> +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> +             !QLIST_EMPTY(&spapr->completed_flush_states));
> +}
> +
> +static int spapr_nvdimm_pre_save(void *opaque)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);

Hmm... how long could waiting for all the pending flushes to complete
take?  This could add substanially to the guest's migration downtime,
couldn't it?

> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
> +    .name = "spapr_nvdimm_hcall_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_nvdimm_states_needed,
> +    .pre_save = spapr_nvdimm_pre_save,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_entry,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Acquire a unique token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
> +{
> +    Error *err = NULL;
> +    uint64_t token;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +retry:
> +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {

Using getrandom seems like overkill, why not just use a counter?

> +        error_report_err(err);
> +        g_free(state);
> +        qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +        return NULL;
> +    }
> +
> +    if (!token) /* Token should be non-zero */
> +        goto retry;
> +
> +    /* If the token already in use, get a new one */
> +    QLIST_FOREACH_SAFE(tmp, &(spapr->pending_flush_states), node, next) {
> +        if (tmp->continue_token == token) {
> +            goto retry;
> +        }
> +    }
> +    QLIST_FOREACH_SAFE(tmp, &(spapr->completed_flush_states), node, next) {
> +        if (tmp->continue_token == token) {
> +            goto retry;
> +        }
> +    }
> +
> +    state->continue_token = token;
> +    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
> +
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(void)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The caller has natural access to the machine, so pass it in rather
than using the global.

> +
> +    /*
> +     * No contention here when called on reset path, the main loop thread
> +     * which calls the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> +        QLIST_REMOVE(state, node);
> +        g_free(state);
> +    }
> +}
> +
> +/*
> + * spapr_nvdimm_get_hcall_status
> + *      Fetches the status of the hcall worker and returns H_BUSY
> + *      if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(uint64_t token)
> +{
> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The callers have natural access to spapr, so pass it in rather than
using the global.

> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            goto exit;
> +        }
> +    }
> +    ret = H_P2; /* If not found in complete list too, invalid token */
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            break;
> +        }
> +    }
> +exit:
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    return ret;
> +}
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    int ret = H_SUCCESS;
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    /* flush raw backing image */
> +    if (qemu_fdatasync(state->backend_fd) < 0) {
> +        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                     strerror(errno));
> +        ret = H_HARDWARE;
> +    }
> +
> +    return ret;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
> +
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device.
> + * The hcall returns H_BUSY when the flush takes longer time and the hcall

It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
it?


> + * needs to be issued multiple times in order to be completely serviced.
> + * The continue-token from the output to be passed in the argument list of
> + * subsequent hcalls until the hcall is completely serviced at which
> + * point H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                      target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (continue_token != 0) {
> +        ret = spapr_nvdimm_get_flush_status(continue_token);
> +        if (H_IS_LONG_BUSY(ret)) {
> +            args[0] = continue_token;
> +        }
> +
> +        return ret;
> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    backend = MEMORY_BACKEND(dimm->hostmem);
> +
> +    state = spapr_nvdimm_init_new_flush_state();
> +    if (!state) {
> +        return H_P2;

AFAICT the only way init_new_flush_state() fails is a failure in the
RNG, which definitely isn't a parameter problem.

> +    }
> +
> +    state->backend_fd = memory_region_get_fd(&backend->mr);

Is this guaranteed to return a usable fd in all configurations?

> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                           spapr_nvdimm_flush_completion_cb, state);
> +
> +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = state->continue_token;
> +    }
> +
> +    return ret;

I believe you can rearrange this so the get_flush_status / check /
return is shared between the args[0] == 0 and args[0] == token paths.

> +}
> +
>  static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                       target_ulong opcode, target_ulong *args)
>  {
> @@ -487,6 +726,7 @@ static void spapr_scm_register_types(void)
>      spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>  }
>  
>  type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 47cebaf3ac..7c27fb3e2d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -12,10 +12,12 @@
>  #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
>  #include "hw/ppc/xics.h"        /* For ICSState */
>  #include "hw/ppc/spapr_tpm_proxy.h"
> +#include "hw/ppc/spapr_nvdimm.h"
>  
>  struct SpaprVioBus;
>  struct SpaprPhbState;
>  struct SpaprNvram;
> +struct SpaprNVDIMMDeviceFlushState;
>  
>  typedef struct SpaprEventLogEntry SpaprEventLogEntry;
>  typedef struct SpaprEventSource SpaprEventSource;
> @@ -245,6 +247,12 @@ struct SpaprMachineState {
>      uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>  
>      Error *fwnmi_migration_blocker;
> +
> +    /* nvdimm flush states */
> +    QemuMutex spapr_nvdimm_flush_states_lock;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
> +
>  };
>  
>  #define H_SUCCESS         0
> @@ -538,8 +546,9 @@ struct SpaprMachineState {
>  #define H_SCM_BIND_MEM          0x3EC
>  #define H_SCM_UNBIND_MEM        0x3F0
>  #define H_SCM_UNBIND_ALL        0x3FC
> +#define H_SCM_FLUSH             0x44C
>  
> -#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index abcacda5d7..c88df2c590 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,6 +11,7 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> +#include "migration/vmstate.h"
>  
>  struct SpaprDrc;
>  struct SpaprMachineState;
> @@ -22,5 +23,16 @@ void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(void);
> +
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    int backend_fd;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +extern const VMStateDescription vmstate_spapr_nvdimm_flush_states;
>  
>  #endif
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 167 bytes --]

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-24  3:07     ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  3:07 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: ehabkost, mst, aneesh.kumar, bharata, linux-nvdimm, groug,
	kvm-ppc, qemu-devel, shivaprasadbhat, qemu-ppc, imammedo, sbhat,
	xiaoguangrong.eric

[-- Attachment #1: Type: text/plain, Size: 15855 bytes --]

On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch.
> 
> The hcall expects the semantics such that the flush to return
> with H_BUSY when the operation is expected to take longer time along
> with a continue_token. The hcall to be called again providing the
> continue_token to get the status. So, all fresh requsts are put into
> a 'pending' list and flush worker is submitted to the thread pool.
> The thread pool completion callbacks move the requests to 'completed'
> list, which are cleaned up after reporting to guest in subsequent
> hcalls to get the status.
> 
> The semantics makes it necessary to preserve the continue_tokens
> and their return status even across migrations. So, the pre_save
> handler for the device waits for the flush worker to complete and
> collects all the hcall states from 'completed' list. The necessary
> nvdimm flush specific vmstate structures are added to the spapr
> machine vmstate.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>

An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?

> ---
>  hw/ppc/spapr.c                |    6 +
>  hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h        |   11 ++
>  include/hw/ppc/spapr_nvdimm.h |   12 ++
>  4 files changed, 268 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d56418ca29..fdb0c73a2c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>      }
>  
> +    spapr_nvdimm_finish_flushes();
> +
>      /* DRC reset may cause a device to be unplugged. This will cause troubles
>       * if this device is used by another device (eg, a running vhost backend
>       * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> @@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_cap_ccf_assist,
>          &vmstate_spapr_cap_fwnmi,
>          &vmstate_spapr_fwnmi,
> +        &vmstate_spapr_nvdimm_flush_states,
>          NULL
>      }
>  };
> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>      }
>  
>      qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);

Do you actually need an extra mutex, or can you rely on the BQL?

> +    QLIST_INIT(&spapr->pending_flush_states);
> +    QLIST_INIT(&spapr->completed_flush_states);
>  }
>  
>  #define DEFAULT_KVM_TYPE "auto"
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 8cf3fb2ffb..883317c1ed 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,14 +22,17 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/guest-random.h"
>  #include "qemu/nvdimm-utils.h"
>  #include "hw/ppc/fdt.h"
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
>  
>  /*
>   * The nvdimm size should be aligned to SCM block size.
> @@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static const VMStateDescription vmstate_spapr_nvdimm_entry = {
> +     .name = "spapr_nvdimm_states",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +static bool spapr_nvdimm_states_needed(void *opaque)
> +{
> +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> +             !QLIST_EMPTY(&spapr->completed_flush_states));
> +}
> +
> +static int spapr_nvdimm_pre_save(void *opaque)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);

Hmm... how long could waiting for all the pending flushes to complete
take?  This could add substanially to the guest's migration downtime,
couldn't it?

> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
> +    .name = "spapr_nvdimm_hcall_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_nvdimm_states_needed,
> +    .pre_save = spapr_nvdimm_pre_save,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_entry,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Acquire a unique token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
> +{
> +    Error *err = NULL;
> +    uint64_t token;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +retry:
> +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {

Using getrandom seems like overkill, why not just use a counter?

> +        error_report_err(err);
> +        g_free(state);
> +        qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +        return NULL;
> +    }
> +
> +    if (!token) /* Token should be non-zero */
> +        goto retry;
> +
> +    /* If the token already in use, get a new one */
> +    QLIST_FOREACH_SAFE(tmp, &(spapr->pending_flush_states), node, next) {
> +        if (tmp->continue_token == token) {
> +            goto retry;
> +        }
> +    }
> +    QLIST_FOREACH_SAFE(tmp, &(spapr->completed_flush_states), node, next) {
> +        if (tmp->continue_token == token) {
> +            goto retry;
> +        }
> +    }
> +
> +    state->continue_token = token;
> +    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
> +
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(void)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The caller has natural access to the machine, so pass it in rather
than using the global.

> +
> +    /*
> +     * No contention here when called on reset path, the main loop thread
> +     * which calls the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> +        QLIST_REMOVE(state, node);
> +        g_free(state);
> +    }
> +}
> +
> +/*
> + * spapr_nvdimm_get_hcall_status
> + *      Fetches the status of the hcall worker and returns H_BUSY
> + *      if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(uint64_t token)
> +{
> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The callers have natural access to spapr, so pass it in rather than
using the global.

> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            goto exit;
> +        }
> +    }
> +    ret = H_P2; /* If not found in complete list too, invalid token */
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            break;
> +        }
> +    }
> +exit:
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    return ret;
> +}
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    int ret = H_SUCCESS;
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    /* flush raw backing image */
> +    if (qemu_fdatasync(state->backend_fd) < 0) {
> +        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                     strerror(errno));
> +        ret = H_HARDWARE;
> +    }
> +
> +    return ret;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
> +
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device.
> + * The hcall returns H_BUSY when the flush takes longer time and the hcall

It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
it?


> + * needs to be issued multiple times in order to be completely serviced.
> + * The continue-token from the output to be passed in the argument list of
> + * subsequent hcalls until the hcall is completely serviced at which
> + * point H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                      target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (continue_token != 0) {
> +        ret = spapr_nvdimm_get_flush_status(continue_token);
> +        if (H_IS_LONG_BUSY(ret)) {
> +            args[0] = continue_token;
> +        }
> +
> +        return ret;
> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    backend = MEMORY_BACKEND(dimm->hostmem);
> +
> +    state = spapr_nvdimm_init_new_flush_state();
> +    if (!state) {
> +        return H_P2;

AFAICT the only way init_new_flush_state() fails is a failure in the
RNG, which definitely isn't a parameter problem.

> +    }
> +
> +    state->backend_fd = memory_region_get_fd(&backend->mr);

Is this guaranteed to return a usable fd in all configurations?

> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                           spapr_nvdimm_flush_completion_cb, state);
> +
> +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = state->continue_token;
> +    }
> +
> +    return ret;

I believe you can rearrange this so the get_flush_status / check /
return is shared between the args[0] == 0 and args[0] == token paths.

> +}
> +
>  static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                       target_ulong opcode, target_ulong *args)
>  {
> @@ -487,6 +726,7 @@ static void spapr_scm_register_types(void)
>      spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>  }
>  
>  type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 47cebaf3ac..7c27fb3e2d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -12,10 +12,12 @@
>  #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
>  #include "hw/ppc/xics.h"        /* For ICSState */
>  #include "hw/ppc/spapr_tpm_proxy.h"
> +#include "hw/ppc/spapr_nvdimm.h"
>  
>  struct SpaprVioBus;
>  struct SpaprPhbState;
>  struct SpaprNvram;
> +struct SpaprNVDIMMDeviceFlushState;
>  
>  typedef struct SpaprEventLogEntry SpaprEventLogEntry;
>  typedef struct SpaprEventSource SpaprEventSource;
> @@ -245,6 +247,12 @@ struct SpaprMachineState {
>      uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>  
>      Error *fwnmi_migration_blocker;
> +
> +    /* nvdimm flush states */
> +    QemuMutex spapr_nvdimm_flush_states_lock;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
> +
>  };
>  
>  #define H_SUCCESS         0
> @@ -538,8 +546,9 @@ struct SpaprMachineState {
>  #define H_SCM_BIND_MEM          0x3EC
>  #define H_SCM_UNBIND_MEM        0x3F0
>  #define H_SCM_UNBIND_ALL        0x3FC
> +#define H_SCM_FLUSH             0x44C
>  
> -#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index abcacda5d7..c88df2c590 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,6 +11,7 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> +#include "migration/vmstate.h"
>  
>  struct SpaprDrc;
>  struct SpaprMachineState;
> @@ -22,5 +23,16 @@ void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(void);
> +
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    int backend_fd;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +extern const VMStateDescription vmstate_spapr_nvdimm_flush_states;
>  
>  #endif
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-24  3:07     ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  3:07 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata

[-- Attachment #1: Type: text/plain, Size: 15855 bytes --]

On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> The patch adds support for the SCM flush hcall for the nvdimm devices.
> To be available for exploitation by guest through the next patch.
> 
> The hcall expects the semantics such that the flush to return
> with H_BUSY when the operation is expected to take longer time along
> with a continue_token. The hcall to be called again providing the
> continue_token to get the status. So, all fresh requsts are put into
> a 'pending' list and flush worker is submitted to the thread pool.
> The thread pool completion callbacks move the requests to 'completed'
> list, which are cleaned up after reporting to guest in subsequent
> hcalls to get the status.
> 
> The semantics makes it necessary to preserve the continue_tokens
> and their return status even across migrations. So, the pre_save
> handler for the device waits for the flush worker to complete and
> collects all the hcall states from 'completed' list. The necessary
> nvdimm flush specific vmstate structures are added to the spapr
> machine vmstate.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>

An overal question: surely the same issue must arise on x86 with
file-backed NVDIMMs.  How do they handle this case?

> ---
>  hw/ppc/spapr.c                |    6 +
>  hw/ppc/spapr_nvdimm.c         |  240 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h        |   11 ++
>  include/hw/ppc/spapr_nvdimm.h |   12 ++
>  4 files changed, 268 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d56418ca29..fdb0c73a2c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1607,6 +1607,8 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr->ov5_cas = spapr_ovec_clone(spapr->ov5);
>      }
>  
> +    spapr_nvdimm_finish_flushes();
> +
>      /* DRC reset may cause a device to be unplugged. This will cause troubles
>       * if this device is used by another device (eg, a running vhost backend
>       * will crash QEMU if the DIMM holding the vring goes away). To avoid such
> @@ -2003,6 +2005,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_cap_ccf_assist,
>          &vmstate_spapr_cap_fwnmi,
>          &vmstate_spapr_fwnmi,
> +        &vmstate_spapr_nvdimm_flush_states,
>          NULL
>      }
>  };
> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>      }
>  
>      qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);

Do you actually need an extra mutex, or can you rely on the BQL?

> +    QLIST_INIT(&spapr->pending_flush_states);
> +    QLIST_INIT(&spapr->completed_flush_states);
>  }
>  
>  #define DEFAULT_KVM_TYPE "auto"
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 8cf3fb2ffb..883317c1ed 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -22,14 +22,17 @@
>   * THE SOFTWARE.
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/cutils.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr_drc.h"
>  #include "hw/ppc/spapr_nvdimm.h"
>  #include "hw/mem/nvdimm.h"
> +#include "qemu/guest-random.h"
>  #include "qemu/nvdimm-utils.h"
>  #include "hw/ppc/fdt.h"
>  #include "qemu/range.h"
>  #include "hw/ppc/spapr_numa.h"
> +#include "block/thread-pool.h"
>  
>  /*
>   * The nvdimm size should be aligned to SCM block size.
> @@ -371,6 +374,242 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static const VMStateDescription vmstate_spapr_nvdimm_entry = {
> +     .name = "spapr_nvdimm_states",
> +     .version_id = 1,
> +     .minimum_version_id = 1,
> +     .fields = (VMStateField[]) {
> +         VMSTATE_UINT64(continue_token, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_INT64(hcall_ret, SpaprNVDIMMDeviceFlushState),
> +         VMSTATE_END_OF_LIST()
> +     },
> +};
> +
> +static bool spapr_nvdimm_states_needed(void *opaque)
> +{
> +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> +             !QLIST_EMPTY(&spapr->completed_flush_states));
> +}
> +
> +static int spapr_nvdimm_pre_save(void *opaque)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);

Hmm... how long could waiting for all the pending flushes to complete
take?  This could add substanially to the guest's migration downtime,
couldn't it?

> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vmstate_spapr_nvdimm_flush_states = {
> +    .name = "spapr_nvdimm_hcall_states",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_nvdimm_states_needed,
> +    .pre_save = spapr_nvdimm_pre_save,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_QLIST_V(completed_flush_states, SpaprMachineState, 1,
> +                        vmstate_spapr_nvdimm_entry,
> +                        SpaprNVDIMMDeviceFlushState, node),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * Acquire a unique token and reserve it for the new flush state.
> + */
> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
> +{
> +    Error *err = NULL;
> +    uint64_t token;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> +
> +    state = g_malloc0(sizeof(*state));
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +retry:
> +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {

Using getrandom seems like overkill, why not just use a counter?

> +        error_report_err(err);
> +        g_free(state);
> +        qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +        return NULL;
> +    }
> +
> +    if (!token) /* Token should be non-zero */
> +        goto retry;
> +
> +    /* If the token already in use, get a new one */
> +    QLIST_FOREACH_SAFE(tmp, &(spapr->pending_flush_states), node, next) {
> +        if (tmp->continue_token == token) {
> +            goto retry;
> +        }
> +    }
> +    QLIST_FOREACH_SAFE(tmp, &(spapr->completed_flush_states), node, next) {
> +        if (tmp->continue_token == token) {
> +            goto retry;
> +        }
> +    }
> +
> +    state->continue_token = token;
> +    QLIST_INSERT_HEAD(&spapr->pending_flush_states, state, node);
> +
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    return state;
> +}
> +
> +/*
> + * spapr_nvdimm_finish_flushes
> + *      Waits for all pending flush requests to complete
> + *      their execution and free the states
> + */
> +void spapr_nvdimm_finish_flushes(void)
> +{
> +    SpaprNVDIMMDeviceFlushState *state, *next;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The caller has natural access to the machine, so pass it in rather
than using the global.

> +
> +    /*
> +     * No contention here when called on reset path, the main loop thread
> +     * which calls the pending BHs has gotten out running in the reset path,
> +     * finally reaching here. Other code path being guest
> +     * h_client_architecture_support, thats early boot up.
> +     */
> +
> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> +        aio_poll(qemu_get_aio_context(), true);
> +    }
> +
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, next) {
> +        QLIST_REMOVE(state, node);
> +        g_free(state);
> +    }
> +}
> +
> +/*
> + * spapr_nvdimm_get_hcall_status
> + *      Fetches the status of the hcall worker and returns H_BUSY
> + *      if the worker is still running.
> + */
> +static int spapr_nvdimm_get_flush_status(uint64_t token)
> +{
> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());

The callers have natural access to spapr, so pass it in rather than
using the global.

> +    SpaprNVDIMMDeviceFlushState *state, *node;
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +    QLIST_FOREACH_SAFE(state, &spapr->pending_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            goto exit;
> +        }
> +    }
> +    ret = H_P2; /* If not found in complete list too, invalid token */
> +    QLIST_FOREACH_SAFE(state, &spapr->completed_flush_states, node, node) {
> +        if (state->continue_token == token) {
> +            ret = state->hcall_ret;
> +            QLIST_REMOVE(state, node);
> +            g_free(state);
> +            break;
> +        }
> +    }
> +exit:
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    return ret;
> +}
> +
> +static int flush_worker_cb(void *opaque)
> +{
> +    int ret = H_SUCCESS;
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    /* flush raw backing image */
> +    if (qemu_fdatasync(state->backend_fd) < 0) {
> +        error_report("papr_scm: Could not sync nvdimm to backend file: %s",
> +                     strerror(errno));
> +        ret = H_HARDWARE;
> +    }
> +
> +    return ret;
> +}
> +
> +static void spapr_nvdimm_flush_completion_cb(void *opaque, int hcall_ret)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> +    SpaprNVDIMMDeviceFlushState *state = opaque;
> +
> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> +
> +    state->hcall_ret = hcall_ret;
> +    QLIST_REMOVE(state, node);
> +    QLIST_INSERT_HEAD(&spapr->completed_flush_states, state, node);
> +
> +    qemu_mutex_unlock(&spapr->spapr_nvdimm_flush_states_lock);
> +}
> +
> +/*
> + * H_SCM_FLUSH
> + * Input: drc_index, continue-token
> + * Out: continue-token
> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
> + *
> + * Given a DRC Index Flush the data to backend NVDIMM device.
> + * The hcall returns H_BUSY when the flush takes longer time and the hcall

It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
it?


> + * needs to be issued multiple times in order to be completely serviced.
> + * The continue-token from the output to be passed in the argument list of
> + * subsequent hcalls until the hcall is completely serviced at which
> + * point H_SUCCESS or other error is returned.
> + */
> +static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                      target_ulong opcode, target_ulong *args)
> +{
> +    int ret;
> +    uint32_t drc_index = args[0];
> +    uint64_t continue_token = args[1];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    PCDIMMDevice *dimm;
> +    HostMemoryBackend *backend = NULL;
> +    SpaprNVDIMMDeviceFlushState *state;
> +    ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> +    if (!drc || !drc->dev ||
> +        spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (continue_token != 0) {
> +        ret = spapr_nvdimm_get_flush_status(continue_token);
> +        if (H_IS_LONG_BUSY(ret)) {
> +            args[0] = continue_token;
> +        }
> +
> +        return ret;
> +    }
> +
> +    dimm = PC_DIMM(drc->dev);
> +    backend = MEMORY_BACKEND(dimm->hostmem);
> +
> +    state = spapr_nvdimm_init_new_flush_state();
> +    if (!state) {
> +        return H_P2;

AFAICT the only way init_new_flush_state() fails is a failure in the
RNG, which definitely isn't a parameter problem.

> +    }
> +
> +    state->backend_fd = memory_region_get_fd(&backend->mr);

Is this guaranteed to return a usable fd in all configurations?

> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> +                           spapr_nvdimm_flush_completion_cb, state);
> +
> +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
> +    if (H_IS_LONG_BUSY(ret)) {
> +        args[0] = state->continue_token;
> +    }
> +
> +    return ret;

I believe you can rearrange this so the get_flush_status / check /
return is shared between the args[0] == 0 and args[0] == token paths.

> +}
> +
>  static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                       target_ulong opcode, target_ulong *args)
>  {
> @@ -487,6 +726,7 @@ static void spapr_scm_register_types(void)
>      spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
>      spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
> +    spapr_register_hypercall(H_SCM_FLUSH, h_scm_flush);
>  }
>  
>  type_init(spapr_scm_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 47cebaf3ac..7c27fb3e2d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -12,10 +12,12 @@
>  #include "hw/ppc/spapr_xive.h"  /* For SpaprXive */
>  #include "hw/ppc/xics.h"        /* For ICSState */
>  #include "hw/ppc/spapr_tpm_proxy.h"
> +#include "hw/ppc/spapr_nvdimm.h"
>  
>  struct SpaprVioBus;
>  struct SpaprPhbState;
>  struct SpaprNvram;
> +struct SpaprNVDIMMDeviceFlushState;
>  
>  typedef struct SpaprEventLogEntry SpaprEventLogEntry;
>  typedef struct SpaprEventSource SpaprEventSource;
> @@ -245,6 +247,12 @@ struct SpaprMachineState {
>      uint32_t numa_assoc_array[MAX_NODES + NVGPU_MAX_NUM][NUMA_ASSOC_SIZE];
>  
>      Error *fwnmi_migration_blocker;
> +
> +    /* nvdimm flush states */
> +    QemuMutex spapr_nvdimm_flush_states_lock;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) pending_flush_states;
> +    QLIST_HEAD(, SpaprNVDIMMDeviceFlushState) completed_flush_states;
> +
>  };
>  
>  #define H_SUCCESS         0
> @@ -538,8 +546,9 @@ struct SpaprMachineState {
>  #define H_SCM_BIND_MEM          0x3EC
>  #define H_SCM_UNBIND_MEM        0x3F0
>  #define H_SCM_UNBIND_ALL        0x3FC
> +#define H_SCM_FLUSH             0x44C
>  
> -#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
> +#define MAX_HCALL_OPCODE        H_SCM_FLUSH
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> diff --git a/include/hw/ppc/spapr_nvdimm.h b/include/hw/ppc/spapr_nvdimm.h
> index abcacda5d7..c88df2c590 100644
> --- a/include/hw/ppc/spapr_nvdimm.h
> +++ b/include/hw/ppc/spapr_nvdimm.h
> @@ -11,6 +11,7 @@
>  #define HW_SPAPR_NVDIMM_H
>  
>  #include "hw/mem/nvdimm.h"
> +#include "migration/vmstate.h"
>  
>  struct SpaprDrc;
>  struct SpaprMachineState;
> @@ -22,5 +23,16 @@ void spapr_dt_persistent_memory(struct SpaprMachineState *spapr, void *fdt);
>  bool spapr_nvdimm_validate(HotplugHandler *hotplug_dev, NVDIMMDevice *nvdimm,
>                             uint64_t size, Error **errp);
>  void spapr_add_nvdimm(DeviceState *dev, uint64_t slot);
> +void spapr_nvdimm_finish_flushes(void);
> +
> +typedef struct SpaprNVDIMMDeviceFlushState {
> +    uint64_t continue_token;
> +    int64_t hcall_ret;
> +    int backend_fd;
> +
> +    QLIST_ENTRY(SpaprNVDIMMDeviceFlushState) node;
> +} SpaprNVDIMMDeviceFlushState;
> +
> +extern const VMStateDescription vmstate_spapr_nvdimm_flush_states;
>  
>  #endif
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
  2021-03-23 13:47   ` Shivaprasad G Bhat
  (?)
@ 2021-03-24  3:09     ` David Gibson
  -1 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  3:09 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata


[-- Attachment #1.1: Type: text/plain, Size: 6078 bytes --]

On Tue, Mar 23, 2021 at 09:47:55AM -0400, Shivaprasad G Bhat wrote:
> The patch adds the 'sync-dax' property to the nvdimm device.
> 
> When the sync-dax is 'off', the device tree property
> "hcall-flush-required" is added to the nvdimm node which makes the
> guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
> This would be the default behaviour without sync-dax property set
> for the nvdimm device.
> 
> The sync-dax="on" would mean the guest need not make flush requests
> to the qemu. On previous machine versions the sync-dax is set to be
> "on" by default using the hw_compat magic.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/core/machine.c       |    1 +
>  hw/mem/nvdimm.c         |    1 +
>  hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
>  include/hw/mem/nvdimm.h |   10 ++++++++++
>  include/hw/ppc/spapr.h  |    1 +
>  5 files changed, 30 insertions(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 257a664ea2..f843643574 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
>      { "PIIX4_PM", "smm-compat", "on"},
>      { "virtio-blk-device", "report-discard-granularity", "off" },
>      { "virtio-net-pci", "vectors", "3"},
> +    { "nvdimm", "sync-dax", "on" },
>  };
>  const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>  
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 7397b67156..8f0e29b191 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>  
>  static Property nvdimm_properties[] = {
>      DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
> +    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),

I'm a bit uncomfortable adding this base NVDIMM property without at
least some logic about how it's handled on non-PAPR platforms.

>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 883317c1ed..dd1c90251b 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>      uint64_t lsize = nvdimm->label_size;
>      uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>                                              NULL);
> +    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
> +                                             NVDIMM_SYNC_DAX_PROP,
> +                                             &error_abort);
>  
>      drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>      g_assert(drc);
> @@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>                               "operating-system")));
>      _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>  
> +    if (!sync_dax) {
> +        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
> +                         NULL, 0));
> +    }
> +
>      return child_offset;
>  }
>  
> @@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                        target_ulong opcode, target_ulong *args)
>  {
>      int ret;
> +    bool sync_dax;
>      uint32_t drc_index = args[0];
>      uint64_t continue_token = args[1];
>      SpaprDrc *drc = spapr_drc_by_index(drc_index);
>      PCDIMMDevice *dimm;
> +    NVDIMMDevice *nvdimm;
>      HostMemoryBackend *backend = NULL;
>      SpaprNVDIMMDeviceFlushState *state;
>      ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> @@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>          return H_PARAMETER;
>      }
>  
> +    nvdimm = NVDIMM(drc->dev);
> +    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
> +                                        &error_abort);
> +    if (sync_dax) {
> +        return H_UNSUPPORTED;

Do you want to return UNSUPPORTED here, or just H_SUCCESS, since the
flush should be a no-op in this case.

> +    }
> +
>      if (continue_token != 0) {
>          ret = spapr_nvdimm_get_flush_status(continue_token);
>          if (H_IS_LONG_BUSY(ret)) {
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index bcf62f825c..f82979cf2f 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>  #define NVDIMM_LABEL_SIZE_PROP "label-size"
>  #define NVDIMM_UUID_PROP       "uuid"
>  #define NVDIMM_UNARMED_PROP    "unarmed"
> +#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
>  
>  struct NVDIMMDevice {
>      /* private */
> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
>       */
>      bool unarmed;
>  
> +    /*
> +     * On PPC64,
> +     * The 'off' value results in the hcall-flush-required property set
> +     * in the device tree for pseries machines. When 'off', the guest
> +     * initiates explicit flush requests to the backend device ensuring
> +     * write persistence.
> +     */
> +    bool sync_dax;
> +
>      /*
>       * The PPC64 - spapr requires each nvdimm device have a uuid.
>       */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 7c27fb3e2d..51c35488a4 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -333,6 +333,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>  #define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 167 bytes --]

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
@ 2021-03-24  3:09     ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  3:09 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: ehabkost, mst, aneesh.kumar, bharata, linux-nvdimm, groug,
	kvm-ppc, qemu-devel, shivaprasadbhat, qemu-ppc, imammedo, sbhat,
	xiaoguangrong.eric

[-- Attachment #1: Type: text/plain, Size: 6078 bytes --]

On Tue, Mar 23, 2021 at 09:47:55AM -0400, Shivaprasad G Bhat wrote:
> The patch adds the 'sync-dax' property to the nvdimm device.
> 
> When the sync-dax is 'off', the device tree property
> "hcall-flush-required" is added to the nvdimm node which makes the
> guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
> This would be the default behaviour without sync-dax property set
> for the nvdimm device.
> 
> The sync-dax="on" would mean the guest need not make flush requests
> to the qemu. On previous machine versions the sync-dax is set to be
> "on" by default using the hw_compat magic.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/core/machine.c       |    1 +
>  hw/mem/nvdimm.c         |    1 +
>  hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
>  include/hw/mem/nvdimm.h |   10 ++++++++++
>  include/hw/ppc/spapr.h  |    1 +
>  5 files changed, 30 insertions(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 257a664ea2..f843643574 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
>      { "PIIX4_PM", "smm-compat", "on"},
>      { "virtio-blk-device", "report-discard-granularity", "off" },
>      { "virtio-net-pci", "vectors", "3"},
> +    { "nvdimm", "sync-dax", "on" },
>  };
>  const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>  
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 7397b67156..8f0e29b191 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>  
>  static Property nvdimm_properties[] = {
>      DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
> +    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),

I'm a bit uncomfortable adding this base NVDIMM property without at
least some logic about how it's handled on non-PAPR platforms.

>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 883317c1ed..dd1c90251b 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>      uint64_t lsize = nvdimm->label_size;
>      uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>                                              NULL);
> +    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
> +                                             NVDIMM_SYNC_DAX_PROP,
> +                                             &error_abort);
>  
>      drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>      g_assert(drc);
> @@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>                               "operating-system")));
>      _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>  
> +    if (!sync_dax) {
> +        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
> +                         NULL, 0));
> +    }
> +
>      return child_offset;
>  }
>  
> @@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                        target_ulong opcode, target_ulong *args)
>  {
>      int ret;
> +    bool sync_dax;
>      uint32_t drc_index = args[0];
>      uint64_t continue_token = args[1];
>      SpaprDrc *drc = spapr_drc_by_index(drc_index);
>      PCDIMMDevice *dimm;
> +    NVDIMMDevice *nvdimm;
>      HostMemoryBackend *backend = NULL;
>      SpaprNVDIMMDeviceFlushState *state;
>      ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> @@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>          return H_PARAMETER;
>      }
>  
> +    nvdimm = NVDIMM(drc->dev);
> +    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
> +                                        &error_abort);
> +    if (sync_dax) {
> +        return H_UNSUPPORTED;

Do you want to return UNSUPPORTED here, or just H_SUCCESS, since the
flush should be a no-op in this case.

> +    }
> +
>      if (continue_token != 0) {
>          ret = spapr_nvdimm_get_flush_status(continue_token);
>          if (H_IS_LONG_BUSY(ret)) {
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index bcf62f825c..f82979cf2f 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>  #define NVDIMM_LABEL_SIZE_PROP "label-size"
>  #define NVDIMM_UUID_PROP       "uuid"
>  #define NVDIMM_UNARMED_PROP    "unarmed"
> +#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
>  
>  struct NVDIMMDevice {
>      /* private */
> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
>       */
>      bool unarmed;
>  
> +    /*
> +     * On PPC64,
> +     * The 'off' value results in the hcall-flush-required property set
> +     * in the device tree for pseries machines. When 'off', the guest
> +     * initiates explicit flush requests to the backend device ensuring
> +     * write persistence.
> +     */
> +    bool sync_dax;
> +
>      /*
>       * The PPC64 - spapr requires each nvdimm device have a uuid.
>       */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 7c27fb3e2d..51c35488a4 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -333,6 +333,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>  #define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
@ 2021-03-24  3:09     ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-24  3:09 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata

[-- Attachment #1: Type: text/plain, Size: 6078 bytes --]

On Tue, Mar 23, 2021 at 09:47:55AM -0400, Shivaprasad G Bhat wrote:
> The patch adds the 'sync-dax' property to the nvdimm device.
> 
> When the sync-dax is 'off', the device tree property
> "hcall-flush-required" is added to the nvdimm node which makes the
> guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
> This would be the default behaviour without sync-dax property set
> for the nvdimm device.
> 
> The sync-dax="on" would mean the guest need not make flush requests
> to the qemu. On previous machine versions the sync-dax is set to be
> "on" by default using the hw_compat magic.
> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> ---
>  hw/core/machine.c       |    1 +
>  hw/mem/nvdimm.c         |    1 +
>  hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
>  include/hw/mem/nvdimm.h |   10 ++++++++++
>  include/hw/ppc/spapr.h  |    1 +
>  5 files changed, 30 insertions(+)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 257a664ea2..f843643574 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
>      { "PIIX4_PM", "smm-compat", "on"},
>      { "virtio-blk-device", "report-discard-granularity", "off" },
>      { "virtio-net-pci", "vectors", "3"},
> +    { "nvdimm", "sync-dax", "on" },
>  };
>  const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>  
> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> index 7397b67156..8f0e29b191 100644
> --- a/hw/mem/nvdimm.c
> +++ b/hw/mem/nvdimm.c
> @@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>  
>  static Property nvdimm_properties[] = {
>      DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
> +    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),

I'm a bit uncomfortable adding this base NVDIMM property without at
least some logic about how it's handled on non-PAPR platforms.

>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
> index 883317c1ed..dd1c90251b 100644
> --- a/hw/ppc/spapr_nvdimm.c
> +++ b/hw/ppc/spapr_nvdimm.c
> @@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>      uint64_t lsize = nvdimm->label_size;
>      uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>                                              NULL);
> +    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
> +                                             NVDIMM_SYNC_DAX_PROP,
> +                                             &error_abort);
>  
>      drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>      g_assert(drc);
> @@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>                               "operating-system")));
>      _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>  
> +    if (!sync_dax) {
> +        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
> +                         NULL, 0));
> +    }
> +
>      return child_offset;
>  }
>  
> @@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>                                        target_ulong opcode, target_ulong *args)
>  {
>      int ret;
> +    bool sync_dax;
>      uint32_t drc_index = args[0];
>      uint64_t continue_token = args[1];
>      SpaprDrc *drc = spapr_drc_by_index(drc_index);
>      PCDIMMDevice *dimm;
> +    NVDIMMDevice *nvdimm;
>      HostMemoryBackend *backend = NULL;
>      SpaprNVDIMMDeviceFlushState *state;
>      ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> @@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>          return H_PARAMETER;
>      }
>  
> +    nvdimm = NVDIMM(drc->dev);
> +    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
> +                                        &error_abort);
> +    if (sync_dax) {
> +        return H_UNSUPPORTED;

Do you want to return UNSUPPORTED here, or just H_SUCCESS, since the
flush should be a no-op in this case.

> +    }
> +
>      if (continue_token != 0) {
>          ret = spapr_nvdimm_get_flush_status(continue_token);
>          if (H_IS_LONG_BUSY(ret)) {
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index bcf62f825c..f82979cf2f 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>  #define NVDIMM_LABEL_SIZE_PROP "label-size"
>  #define NVDIMM_UUID_PROP       "uuid"
>  #define NVDIMM_UNARMED_PROP    "unarmed"
> +#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
>  
>  struct NVDIMMDevice {
>      /* private */
> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
>       */
>      bool unarmed;
>  
> +    /*
> +     * On PPC64,
> +     * The 'off' value results in the hcall-flush-required property set
> +     * in the device tree for pseries machines. When 'off', the guest
> +     * initiates explicit flush requests to the backend device ensuring
> +     * write persistence.
> +     */
> +    bool sync_dax;
> +
>      /*
>       * The PPC64 - spapr requires each nvdimm device have a uuid.
>       */
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 7c27fb3e2d..51c35488a4 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -333,6 +333,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_UNSUPPORTED     -67
>  #define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-24  3:07     ` David Gibson
  (?)
@ 2021-03-24  4:04       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-03-24  4:04 UTC (permalink / raw)
  To: David Gibson, Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, linux-nvdimm, kvm-ppc,
	shivaprasadbhat, bharata

On 3/24/21 8:37 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>> The patch adds support for the SCM flush hcall for the nvdimm devices.
>> To be available for exploitation by guest through the next patch.
>>
>> The hcall expects the semantics such that the flush to return
>> with H_BUSY when the operation is expected to take longer time along
>> with a continue_token. The hcall to be called again providing the
>> continue_token to get the status. So, all fresh requsts are put into
>> a 'pending' list and flush worker is submitted to the thread pool.
>> The thread pool completion callbacks move the requests to 'completed'
>> list, which are cleaned up after reporting to guest in subsequent
>> hcalls to get the status.
>>
>> The semantics makes it necessary to preserve the continue_tokens
>> and their return status even across migrations. So, the pre_save
>> handler for the device waits for the flush worker to complete and
>> collects all the hcall states from 'completed' list. The necessary
>> nvdimm flush specific vmstate structures are added to the spapr
>> machine vmstate.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> 
> An overal question: surely the same issue must arise on x86 with
> file-backed NVDIMMs.  How do they handle this case?

On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 
map and virtio_pmem. Among these virio_pmem always operated with 
synchronous dax disabled and both ACPI and e820 doesn't have the ability 
to differentiate support for synchronous dax.

With that I would expect users to use virtio_pmem when using using file 
backed NVDIMMS

-aneesh
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-24  4:04       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-03-24  4:04 UTC (permalink / raw)
  To: David Gibson, Shivaprasad G Bhat
  Cc: ehabkost, mst, bharata, linux-nvdimm, groug, kvm-ppc, qemu-devel,
	shivaprasadbhat, qemu-ppc, imammedo, sbhat, xiaoguangrong.eric

On 3/24/21 8:37 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>> The patch adds support for the SCM flush hcall for the nvdimm devices.
>> To be available for exploitation by guest through the next patch.
>>
>> The hcall expects the semantics such that the flush to return
>> with H_BUSY when the operation is expected to take longer time along
>> with a continue_token. The hcall to be called again providing the
>> continue_token to get the status. So, all fresh requsts are put into
>> a 'pending' list and flush worker is submitted to the thread pool.
>> The thread pool completion callbacks move the requests to 'completed'
>> list, which are cleaned up after reporting to guest in subsequent
>> hcalls to get the status.
>>
>> The semantics makes it necessary to preserve the continue_tokens
>> and their return status even across migrations. So, the pre_save
>> handler for the device waits for the flush worker to complete and
>> collects all the hcall states from 'completed' list. The necessary
>> nvdimm flush specific vmstate structures are added to the spapr
>> machine vmstate.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> 
> An overal question: surely the same issue must arise on x86 with
> file-backed NVDIMMs.  How do they handle this case?

On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 
map and virtio_pmem. Among these virio_pmem always operated with 
synchronous dax disabled and both ACPI and e820 doesn't have the ability 
to differentiate support for synchronous dax.

With that I would expect users to use virtio_pmem when using using file 
backed NVDIMMS

-aneesh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
  2021-03-24  3:09     ` David Gibson
  (?)
@ 2021-03-24  4:09       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-03-24  4:09 UTC (permalink / raw)
  To: David Gibson, Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, linux-nvdimm, kvm-ppc,
	shivaprasadbhat, bharata

On 3/24/21 8:39 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:55AM -0400, Shivaprasad G Bhat wrote:
>> The patch adds the 'sync-dax' property to the nvdimm device.
>>
>> When the sync-dax is 'off', the device tree property
>> "hcall-flush-required" is added to the nvdimm node which makes the
>> guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
>> This would be the default behaviour without sync-dax property set
>> for the nvdimm device.
>>
>> The sync-dax="on" would mean the guest need not make flush requests
>> to the qemu. On previous machine versions the sync-dax is set to be
>> "on" by default using the hw_compat magic.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>> ---
>>   hw/core/machine.c       |    1 +
>>   hw/mem/nvdimm.c         |    1 +
>>   hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
>>   include/hw/mem/nvdimm.h |   10 ++++++++++
>>   include/hw/ppc/spapr.h  |    1 +
>>   5 files changed, 30 insertions(+)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 257a664ea2..f843643574 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
>>       { "PIIX4_PM", "smm-compat", "on"},
>>       { "virtio-blk-device", "report-discard-granularity", "off" },
>>       { "virtio-net-pci", "vectors", "3"},
>> +    { "nvdimm", "sync-dax", "on" },
>>   };
>>   const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>>   
>> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
>> index 7397b67156..8f0e29b191 100644
>> --- a/hw/mem/nvdimm.c
>> +++ b/hw/mem/nvdimm.c
>> @@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>>   
>>   static Property nvdimm_properties[] = {
>>       DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
>> +    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
> 
> I'm a bit uncomfortable adding this base NVDIMM property without at
> least some logic about how it's handled on non-PAPR platforms.

yes these should be specific to PAPR. These are there to handle 
migration. with older guest. We can use the backing file to determine 
synchronous dax support. if it is a file backed nvdimm on a fsdax mount 
point, we can do synchronous dax. If it is one on a non dax file system 
synchronous dax can be disabled.

> 
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
>> index 883317c1ed..dd1c90251b 100644
>> --- a/hw/ppc/spapr_nvdimm.c
>> +++ b/hw/ppc/spapr_nvdimm.c
>> @@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>       uint64_t lsize = nvdimm->label_size;
>>       uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>>                                               NULL);
>> +    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
>> +                                             NVDIMM_SYNC_DAX_PROP,
>> +                                             &error_abort);
>>   
>>       drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>>       g_assert(drc);
>> @@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>                                "operating-system")));
>>       _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>>   
>> +    if (!sync_dax) {
>> +        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
>> +                         NULL, 0));
>> +    }
>> +
>>       return child_offset;
>>   }
>>   
>> @@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>                                         target_ulong opcode, target_ulong *args)
>>   {
>>       int ret;
>> +    bool sync_dax;
>>       uint32_t drc_index = args[0];
>>       uint64_t continue_token = args[1];
>>       SpaprDrc *drc = spapr_drc_by_index(drc_index);
>>       PCDIMMDevice *dimm;
>> +    NVDIMMDevice *nvdimm;
>>       HostMemoryBackend *backend = NULL;
>>       SpaprNVDIMMDeviceFlushState *state;
>>       ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
>> @@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>           return H_PARAMETER;
>>       }
>>   
>> +    nvdimm = NVDIMM(drc->dev);
>> +    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
>> +                                        &error_abort);
>> +    if (sync_dax) {
>> +        return H_UNSUPPORTED;
> 
> Do you want to return UNSUPPORTED here, or just H_SUCCESS, since the
> flush should be a no-op in this case.


The reason to handle this as error is to indicate the OS that it is 
using a wrong mechanism to flush.

> 
>> +    }
>> +
>>       if (continue_token != 0) {
>>           ret = spapr_nvdimm_get_flush_status(continue_token);
>>           if (H_IS_LONG_BUSY(ret)) {
>> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
>> index bcf62f825c..f82979cf2f 100644
>> --- a/include/hw/mem/nvdimm.h
>> +++ b/include/hw/mem/nvdimm.h
>> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>>   #define NVDIMM_LABEL_SIZE_PROP "label-size"
>>   #define NVDIMM_UUID_PROP       "uuid"
>>   #define NVDIMM_UNARMED_PROP    "unarmed"
>> +#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
>>   
>>   struct NVDIMMDevice {
>>       /* private */
>> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
>>        */
>>       bool unarmed;
>>   
>> +    /*
>> +     * On PPC64,
>> +     * The 'off' value results in the hcall-flush-required property set
>> +     * in the device tree for pseries machines. When 'off', the guest
>> +     * initiates explicit flush requests to the backend device ensuring
>> +     * write persistence.
>> +     */
>> +    bool sync_dax;
>> +
>>       /*
>>        * The PPC64 - spapr requires each nvdimm device have a uuid.
>>        */
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 7c27fb3e2d..51c35488a4 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -333,6 +333,7 @@ struct SpaprMachineState {
>>   #define H_P7              -60
>>   #define H_P8              -61
>>   #define H_P9              -62
>> +#define H_UNSUPPORTED     -67
>>   #define H_OVERLAP         -68
>>   #define H_UNSUPPORTED_FLAG -256
>>   #define H_MULTI_THREADS_ACTIVE -9005
>>
>>
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
@ 2021-03-24  4:09       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-03-24  4:09 UTC (permalink / raw)
  To: David Gibson, Shivaprasad G Bhat
  Cc: ehabkost, mst, bharata, linux-nvdimm, groug, kvm-ppc, qemu-devel,
	shivaprasadbhat, qemu-ppc, imammedo, sbhat, xiaoguangrong.eric

On 3/24/21 8:39 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:55AM -0400, Shivaprasad G Bhat wrote:
>> The patch adds the 'sync-dax' property to the nvdimm device.
>>
>> When the sync-dax is 'off', the device tree property
>> "hcall-flush-required" is added to the nvdimm node which makes the
>> guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
>> This would be the default behaviour without sync-dax property set
>> for the nvdimm device.
>>
>> The sync-dax="on" would mean the guest need not make flush requests
>> to the qemu. On previous machine versions the sync-dax is set to be
>> "on" by default using the hw_compat magic.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>> ---
>>   hw/core/machine.c       |    1 +
>>   hw/mem/nvdimm.c         |    1 +
>>   hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
>>   include/hw/mem/nvdimm.h |   10 ++++++++++
>>   include/hw/ppc/spapr.h  |    1 +
>>   5 files changed, 30 insertions(+)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 257a664ea2..f843643574 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
>>       { "PIIX4_PM", "smm-compat", "on"},
>>       { "virtio-blk-device", "report-discard-granularity", "off" },
>>       { "virtio-net-pci", "vectors", "3"},
>> +    { "nvdimm", "sync-dax", "on" },
>>   };
>>   const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>>   
>> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
>> index 7397b67156..8f0e29b191 100644
>> --- a/hw/mem/nvdimm.c
>> +++ b/hw/mem/nvdimm.c
>> @@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>>   
>>   static Property nvdimm_properties[] = {
>>       DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
>> +    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
> 
> I'm a bit uncomfortable adding this base NVDIMM property without at
> least some logic about how it's handled on non-PAPR platforms.

yes these should be specific to PAPR. These are there to handle 
migration. with older guest. We can use the backing file to determine 
synchronous dax support. if it is a file backed nvdimm on a fsdax mount 
point, we can do synchronous dax. If it is one on a non dax file system 
synchronous dax can be disabled.

> 
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
>> index 883317c1ed..dd1c90251b 100644
>> --- a/hw/ppc/spapr_nvdimm.c
>> +++ b/hw/ppc/spapr_nvdimm.c
>> @@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>       uint64_t lsize = nvdimm->label_size;
>>       uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>>                                               NULL);
>> +    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
>> +                                             NVDIMM_SYNC_DAX_PROP,
>> +                                             &error_abort);
>>   
>>       drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>>       g_assert(drc);
>> @@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>                                "operating-system")));
>>       _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>>   
>> +    if (!sync_dax) {
>> +        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
>> +                         NULL, 0));
>> +    }
>> +
>>       return child_offset;
>>   }
>>   
>> @@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>                                         target_ulong opcode, target_ulong *args)
>>   {
>>       int ret;
>> +    bool sync_dax;
>>       uint32_t drc_index = args[0];
>>       uint64_t continue_token = args[1];
>>       SpaprDrc *drc = spapr_drc_by_index(drc_index);
>>       PCDIMMDevice *dimm;
>> +    NVDIMMDevice *nvdimm;
>>       HostMemoryBackend *backend = NULL;
>>       SpaprNVDIMMDeviceFlushState *state;
>>       ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
>> @@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>           return H_PARAMETER;
>>       }
>>   
>> +    nvdimm = NVDIMM(drc->dev);
>> +    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
>> +                                        &error_abort);
>> +    if (sync_dax) {
>> +        return H_UNSUPPORTED;
> 
> Do you want to return UNSUPPORTED here, or just H_SUCCESS, since the
> flush should be a no-op in this case.


The reason to handle this as error is to indicate the OS that it is 
using a wrong mechanism to flush.

> 
>> +    }
>> +
>>       if (continue_token != 0) {
>>           ret = spapr_nvdimm_get_flush_status(continue_token);
>>           if (H_IS_LONG_BUSY(ret)) {
>> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
>> index bcf62f825c..f82979cf2f 100644
>> --- a/include/hw/mem/nvdimm.h
>> +++ b/include/hw/mem/nvdimm.h
>> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>>   #define NVDIMM_LABEL_SIZE_PROP "label-size"
>>   #define NVDIMM_UUID_PROP       "uuid"
>>   #define NVDIMM_UNARMED_PROP    "unarmed"
>> +#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
>>   
>>   struct NVDIMMDevice {
>>       /* private */
>> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
>>        */
>>       bool unarmed;
>>   
>> +    /*
>> +     * On PPC64,
>> +     * The 'off' value results in the hcall-flush-required property set
>> +     * in the device tree for pseries machines. When 'off', the guest
>> +     * initiates explicit flush requests to the backend device ensuring
>> +     * write persistence.
>> +     */
>> +    bool sync_dax;
>> +
>>       /*
>>        * The PPC64 - spapr requires each nvdimm device have a uuid.
>>        */
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 7c27fb3e2d..51c35488a4 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -333,6 +333,7 @@ struct SpaprMachineState {
>>   #define H_P7              -60
>>   #define H_P8              -61
>>   #define H_P9              -62
>> +#define H_UNSUPPORTED     -67
>>   #define H_OVERLAP         -68
>>   #define H_UNSUPPORTED_FLAG -256
>>   #define H_MULTI_THREADS_ACTIVE -9005
>>
>>
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-24  4:04       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-03-24  4:16 UTC (permalink / raw)
  To: David Gibson, Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, linux-nvdimm, kvm-ppc,
	shivaprasadbhat, bharata

On 3/24/21 8:37 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>> The patch adds support for the SCM flush hcall for the nvdimm devices.
>> To be available for exploitation by guest through the next patch.
>>
>> The hcall expects the semantics such that the flush to return
>> with H_BUSY when the operation is expected to take longer time along
>> with a continue_token. The hcall to be called again providing the
>> continue_token to get the status. So, all fresh requsts are put into
>> a 'pending' list and flush worker is submitted to the thread pool.
>> The thread pool completion callbacks move the requests to 'completed'
>> list, which are cleaned up after reporting to guest in subsequent
>> hcalls to get the status.
>>
>> The semantics makes it necessary to preserve the continue_tokens
>> and their return status even across migrations. So, the pre_save
>> handler for the device waits for the flush worker to complete and
>> collects all the hcall states from 'completed' list. The necessary
>> nvdimm flush specific vmstate structures are added to the spapr
>> machine vmstate.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> 
> An overal question: surely the same issue must arise on x86 with
> file-backed NVDIMMs.  How do they handle this case?

On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 
map and virtio_pmem. Among these virio_pmem always operated with 
synchronous dax disabled and both ACPI and e820 doesn't have the ability 
to differentiate support for synchronous dax.

With that I would expect users to use virtio_pmem when using using file 
backed NVDIMMS

-aneesh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm
@ 2021-03-24  4:09       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-03-24  4:21 UTC (permalink / raw)
  To: David Gibson, Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, linux-nvdimm, kvm-ppc,
	shivaprasadbhat, bharata

On 3/24/21 8:39 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:55AM -0400, Shivaprasad G Bhat wrote:
>> The patch adds the 'sync-dax' property to the nvdimm device.
>>
>> When the sync-dax is 'off', the device tree property
>> "hcall-flush-required" is added to the nvdimm node which makes the
>> guest to issue H_SCM_FLUSH hcalls to request for flushes explicitly.
>> This would be the default behaviour without sync-dax property set
>> for the nvdimm device.
>>
>> The sync-dax="on" would mean the guest need not make flush requests
>> to the qemu. On previous machine versions the sync-dax is set to be
>> "on" by default using the hw_compat magic.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>> ---
>>   hw/core/machine.c       |    1 +
>>   hw/mem/nvdimm.c         |    1 +
>>   hw/ppc/spapr_nvdimm.c   |   17 +++++++++++++++++
>>   include/hw/mem/nvdimm.h |   10 ++++++++++
>>   include/hw/ppc/spapr.h  |    1 +
>>   5 files changed, 30 insertions(+)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 257a664ea2..f843643574 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -41,6 +41,7 @@ GlobalProperty hw_compat_5_2[] = {
>>       { "PIIX4_PM", "smm-compat", "on"},
>>       { "virtio-blk-device", "report-discard-granularity", "off" },
>>       { "virtio-net-pci", "vectors", "3"},
>> +    { "nvdimm", "sync-dax", "on" },
>>   };
>>   const size_t hw_compat_5_2_len = G_N_ELEMENTS(hw_compat_5_2);
>>   
>> diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
>> index 7397b67156..8f0e29b191 100644
>> --- a/hw/mem/nvdimm.c
>> +++ b/hw/mem/nvdimm.c
>> @@ -229,6 +229,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, const void *buf,
>>   
>>   static Property nvdimm_properties[] = {
>>       DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
>> +    DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
> 
> I'm a bit uncomfortable adding this base NVDIMM property without at
> least some logic about how it's handled on non-PAPR platforms.

yes these should be specific to PAPR. These are there to handle 
migration. with older guest. We can use the backing file to determine 
synchronous dax support. if it is a file backed nvdimm on a fsdax mount 
point, we can do synchronous dax. If it is one on a non dax file system 
synchronous dax can be disabled.

> 
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
>> index 883317c1ed..dd1c90251b 100644
>> --- a/hw/ppc/spapr_nvdimm.c
>> +++ b/hw/ppc/spapr_nvdimm.c
>> @@ -125,6 +125,9 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>       uint64_t lsize = nvdimm->label_size;
>>       uint64_t size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
>>                                               NULL);
>> +    bool sync_dax = object_property_get_bool(OBJECT(nvdimm),
>> +                                             NVDIMM_SYNC_DAX_PROP,
>> +                                             &error_abort);
>>   
>>       drc = spapr_drc_by_id(TYPE_SPAPR_DRC_PMEM, slot);
>>       g_assert(drc);
>> @@ -159,6 +162,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void *fdt,
>>                                "operating-system")));
>>       _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
>>   
>> +    if (!sync_dax) {
>> +        _FDT(fdt_setprop(fdt, child_offset, "ibm,hcall-flush-required",
>> +                         NULL, 0));
>> +    }
>> +
>>       return child_offset;
>>   }
>>   
>> @@ -567,10 +575,12 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>                                         target_ulong opcode, target_ulong *args)
>>   {
>>       int ret;
>> +    bool sync_dax;
>>       uint32_t drc_index = args[0];
>>       uint64_t continue_token = args[1];
>>       SpaprDrc *drc = spapr_drc_by_index(drc_index);
>>       PCDIMMDevice *dimm;
>> +    NVDIMMDevice *nvdimm;
>>       HostMemoryBackend *backend = NULL;
>>       SpaprNVDIMMDeviceFlushState *state;
>>       ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
>> @@ -580,6 +590,13 @@ static target_ulong h_scm_flush(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>           return H_PARAMETER;
>>       }
>>   
>> +    nvdimm = NVDIMM(drc->dev);
>> +    sync_dax = object_property_get_bool(OBJECT(nvdimm), NVDIMM_SYNC_DAX_PROP,
>> +                                        &error_abort);
>> +    if (sync_dax) {
>> +        return H_UNSUPPORTED;
> 
> Do you want to return UNSUPPORTED here, or just H_SUCCESS, since the
> flush should be a no-op in this case.


The reason to handle this as error is to indicate the OS that it is 
using a wrong mechanism to flush.

> 
>> +    }
>> +
>>       if (continue_token != 0) {
>>           ret = spapr_nvdimm_get_flush_status(continue_token);
>>           if (H_IS_LONG_BUSY(ret)) {
>> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
>> index bcf62f825c..f82979cf2f 100644
>> --- a/include/hw/mem/nvdimm.h
>> +++ b/include/hw/mem/nvdimm.h
>> @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
>>   #define NVDIMM_LABEL_SIZE_PROP "label-size"
>>   #define NVDIMM_UUID_PROP       "uuid"
>>   #define NVDIMM_UNARMED_PROP    "unarmed"
>> +#define NVDIMM_SYNC_DAX_PROP   "sync-dax"
>>   
>>   struct NVDIMMDevice {
>>       /* private */
>> @@ -85,6 +86,15 @@ struct NVDIMMDevice {
>>        */
>>       bool unarmed;
>>   
>> +    /*
>> +     * On PPC64,
>> +     * The 'off' value results in the hcall-flush-required property set
>> +     * in the device tree for pseries machines. When 'off', the guest
>> +     * initiates explicit flush requests to the backend device ensuring
>> +     * write persistence.
>> +     */
>> +    bool sync_dax;
>> +
>>       /*
>>        * The PPC64 - spapr requires each nvdimm device have a uuid.
>>        */
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 7c27fb3e2d..51c35488a4 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -333,6 +333,7 @@ struct SpaprMachineState {
>>   #define H_P7              -60
>>   #define H_P8              -61
>>   #define H_P9              -62
>> +#define H_UNSUPPORTED     -67
>>   #define H_OVERLAP         -68
>>   #define H_UNSUPPORTED_FLAG -256
>>   #define H_MULTI_THREADS_ACTIVE -9005
>>
>>
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-24  4:04       ` Aneesh Kumar K.V
  (?)
@ 2021-03-25  1:51         ` David Gibson
  -1 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-25  1:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Shivaprasad G Bhat, sbhat, groug, qemu-ppc, ehabkost,
	marcel.apfelbaum, mst, imammedo, xiaoguangrong.eric, qemu-devel,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata


[-- Attachment #1.1: Type: text/plain, Size: 2249 bytes --]

On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > The patch adds support for the SCM flush hcall for the nvdimm devices.
> > > To be available for exploitation by guest through the next patch.
> > > 
> > > The hcall expects the semantics such that the flush to return
> > > with H_BUSY when the operation is expected to take longer time along
> > > with a continue_token. The hcall to be called again providing the
> > > continue_token to get the status. So, all fresh requsts are put into
> > > a 'pending' list and flush worker is submitted to the thread pool.
> > > The thread pool completion callbacks move the requests to 'completed'
> > > list, which are cleaned up after reporting to guest in subsequent
> > > hcalls to get the status.
> > > 
> > > The semantics makes it necessary to preserve the continue_tokens
> > > and their return status even across migrations. So, the pre_save
> > > handler for the device waits for the flush worker to complete and
> > > collects all the hcall states from 'completed' list. The necessary
> > > nvdimm flush specific vmstate structures are added to the spapr
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> > 
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
> disabled and both ACPI and e820 doesn't have the ability to differentiate
> support for synchronous dax.

Ok.  And for the virtio-pmem case, how are the extra flushes actually
done on x86?

> With that I would expect users to use virtio_pmem when using using file
> backed NVDIMMS

So... should we prevent advertising an NVDIMM through ACPI or e820 if
it doesn't have sync-dax enabled?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 167 bytes --]

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-25  1:51         ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-25  1:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: ehabkost, Shivaprasad G Bhat, mst, bharata, linux-nvdimm, groug,
	kvm-ppc, qemu-devel, shivaprasadbhat, qemu-ppc, imammedo, sbhat,
	xiaoguangrong.eric

[-- Attachment #1: Type: text/plain, Size: 2249 bytes --]

On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > The patch adds support for the SCM flush hcall for the nvdimm devices.
> > > To be available for exploitation by guest through the next patch.
> > > 
> > > The hcall expects the semantics such that the flush to return
> > > with H_BUSY when the operation is expected to take longer time along
> > > with a continue_token. The hcall to be called again providing the
> > > continue_token to get the status. So, all fresh requsts are put into
> > > a 'pending' list and flush worker is submitted to the thread pool.
> > > The thread pool completion callbacks move the requests to 'completed'
> > > list, which are cleaned up after reporting to guest in subsequent
> > > hcalls to get the status.
> > > 
> > > The semantics makes it necessary to preserve the continue_tokens
> > > and their return status even across migrations. So, the pre_save
> > > handler for the device waits for the flush worker to complete and
> > > collects all the hcall states from 'completed' list. The necessary
> > > nvdimm flush specific vmstate structures are added to the spapr
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> > 
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
> disabled and both ACPI and e820 doesn't have the ability to differentiate
> support for synchronous dax.

Ok.  And for the virtio-pmem case, how are the extra flushes actually
done on x86?

> With that I would expect users to use virtio_pmem when using using file
> backed NVDIMMS

So... should we prevent advertising an NVDIMM through ACPI or e820 if
it doesn't have sync-dax enabled?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-25  1:51         ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-25  1:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Shivaprasad G Bhat, sbhat, groug, qemu-ppc, ehabkost,
	marcel.apfelbaum, mst, imammedo, xiaoguangrong.eric, qemu-devel,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata

[-- Attachment #1: Type: text/plain, Size: 2249 bytes --]

On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > The patch adds support for the SCM flush hcall for the nvdimm devices.
> > > To be available for exploitation by guest through the next patch.
> > > 
> > > The hcall expects the semantics such that the flush to return
> > > with H_BUSY when the operation is expected to take longer time along
> > > with a continue_token. The hcall to be called again providing the
> > > continue_token to get the status. So, all fresh requsts are put into
> > > a 'pending' list and flush worker is submitted to the thread pool.
> > > The thread pool completion callbacks move the requests to 'completed'
> > > list, which are cleaned up after reporting to guest in subsequent
> > > hcalls to get the status.
> > > 
> > > The semantics makes it necessary to preserve the continue_tokens
> > > and their return status even across migrations. So, the pre_save
> > > handler for the device waits for the flush worker to complete and
> > > collects all the hcall states from 'completed' list. The necessary
> > > nvdimm flush specific vmstate structures are added to the spapr
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> > 
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
> disabled and both ACPI and e820 doesn't have the ability to differentiate
> support for synchronous dax.

Ok.  And for the virtio-pmem case, how are the extra flushes actually
done on x86?

> With that I would expect users to use virtio_pmem when using using file
> backed NVDIMMS

So... should we prevent advertising an NVDIMM through ACPI or e820 if
it doesn't have sync-dax enabled?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-25  1:51         ` David Gibson
  (?)
@ 2021-03-26 13:45           ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-26 13:45 UTC (permalink / raw)
  To: David Gibson, Aneesh Kumar K.V
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, linux-nvdimm, kvm-ppc,
	shivaprasadbhat, bharata

On 3/25/21 7:21 AM, David Gibson wrote:
> On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
>> On 3/24/21 8:37 AM, David Gibson wrote:
>>> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>>>> The patch adds support for the SCM flush hcall for the nvdimm devices.
...
>>>> collects all the hcall states from 'completed' list. The necessary
>>>> nvdimm flush specific vmstate structures are added to the spapr
>>>> machine vmstate.
>>>>
>>>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>>> An overal question: surely the same issue must arise on x86 with
>>> file-backed NVDIMMs.  How do they handle this case?
>> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
>> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
>> disabled and both ACPI and e820 doesn't have the ability to differentiate
>> support for synchronous dax.
> Ok.  And for the virtio-pmem case, how are the extra flushes actually
> done on x86?


virtio-pmem device has virtqueue with virtio_pmem_flush() as the handler

which gets called for all flush requests from guest. virtio_pmem_flush() is

offloading the flush to thread pool with a worker doing fsync() and the

completion callback notifying the guest with response.


>> With that I would expect users to use virtio_pmem when using using file
>> backed NVDIMMS
> So... should we prevent advertising an NVDIMM through ACPI or e820 if
> it doesn't have sync-dax enabled?


Is it possible to have different defaults for sync-dax based on 
architecture ?

The behaviour on x86 is sync-dax=on for nvdimms. So, it would be correct to

have the default as "on" for x86. For pseries -  "off" for new machines.

Looking at code, I didnt find much ways to achieve this. Can you suggest

what can be done ?
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-26 13:45           ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-26 13:45 UTC (permalink / raw)
  To: David Gibson, Aneesh Kumar K.V
  Cc: ehabkost, mst, bharata, linux-nvdimm, groug, kvm-ppc, qemu-devel,
	shivaprasadbhat, qemu-ppc, imammedo, sbhat, xiaoguangrong.eric

On 3/25/21 7:21 AM, David Gibson wrote:
> On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
>> On 3/24/21 8:37 AM, David Gibson wrote:
>>> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>>>> The patch adds support for the SCM flush hcall for the nvdimm devices.
...
>>>> collects all the hcall states from 'completed' list. The necessary
>>>> nvdimm flush specific vmstate structures are added to the spapr
>>>> machine vmstate.
>>>>
>>>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>>> An overal question: surely the same issue must arise on x86 with
>>> file-backed NVDIMMs.  How do they handle this case?
>> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
>> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
>> disabled and both ACPI and e820 doesn't have the ability to differentiate
>> support for synchronous dax.
> Ok.  And for the virtio-pmem case, how are the extra flushes actually
> done on x86?


virtio-pmem device has virtqueue with virtio_pmem_flush() as the handler

which gets called for all flush requests from guest. virtio_pmem_flush() is

offloading the flush to thread pool with a worker doing fsync() and the

completion callback notifying the guest with response.


>> With that I would expect users to use virtio_pmem when using using file
>> backed NVDIMMS
> So... should we prevent advertising an NVDIMM through ACPI or e820 if
> it doesn't have sync-dax enabled?


Is it possible to have different defaults for sync-dax based on 
architecture ?

The behaviour on x86 is sync-dax=on for nvdimms. So, it would be correct to

have the default as "on" for x86. For pseries -  "off" for new machines.

Looking at code, I didnt find much ways to achieve this. Can you suggest

what can be done ?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-26 13:45           ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-26 13:57 UTC (permalink / raw)
  To: David Gibson, Aneesh Kumar K.V
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, linux-nvdimm, kvm-ppc,
	shivaprasadbhat, bharata

On 3/25/21 7:21 AM, David Gibson wrote:
> On Wed, Mar 24, 2021 at 09:34:06AM +0530, Aneesh Kumar K.V wrote:
>> On 3/24/21 8:37 AM, David Gibson wrote:
>>> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>>>> The patch adds support for the SCM flush hcall for the nvdimm devices.
...
>>>> collects all the hcall states from 'completed' list. The necessary
>>>> nvdimm flush specific vmstate structures are added to the spapr
>>>> machine vmstate.
>>>>
>>>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>>> An overal question: surely the same issue must arise on x86 with
>>> file-backed NVDIMMs.  How do they handle this case?
>> On x86 we have different ways nvdimm can be discovered. ACPI NFIT, e820 map
>> and virtio_pmem. Among these virio_pmem always operated with synchronous dax
>> disabled and both ACPI and e820 doesn't have the ability to differentiate
>> support for synchronous dax.
> Ok.  And for the virtio-pmem case, how are the extra flushes actually
> done on x86?


virtio-pmem device has virtqueue with virtio_pmem_flush() as the handler

which gets called for all flush requests from guest. virtio_pmem_flush() is

offloading the flush to thread pool with a worker doing fsync() and the

completion callback notifying the guest with response.


>> With that I would expect users to use virtio_pmem when using using file
>> backed NVDIMMS
> So... should we prevent advertising an NVDIMM through ACPI or e820 if
> it doesn't have sync-dax enabled?


Is it possible to have different defaults for sync-dax based on 
architecture ?

The behaviour on x86 is sync-dax=on for nvdimms. So, it would be correct to

have the default as "on" for x86. For pseries -  "off" for new machines.

Looking at code, I didnt find much ways to achieve this. Can you suggest

what can be done ?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-24  3:07     ` David Gibson
@ 2021-03-29  9:23       ` Shivaprasad G Bhat
  -1 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-29  9:23 UTC (permalink / raw)
  To: David Gibson
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata


On 3/24/21 8:37 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>> machine vmstate.
>>
>> Signed-off-by: Shivaprasad G Bhat<sbhat@linux.ibm.com>
> An overal question: surely the same issue must arise on x86 with
> file-backed NVDIMMs.  How do they handle this case?

Discussed in other threads..

....

>>   };
>> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>>       }
>>   
>>       qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
>> +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
> Do you actually need an extra mutex, or can you rely on the BQL?

I verified BQL is held at all places where it matters in the context of 
this patch.

Safe to get rid of this extra mutex.

...

>
>> +{
>> +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +
>> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
>> +             !QLIST_EMPTY(&spapr->completed_flush_states));
>> +}
>> +
>> +static int spapr_nvdimm_pre_save(void *opaque)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +
>> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
>> +        aio_poll(qemu_get_aio_context(), true);
> Hmm... how long could waiting for all the pending flushes to complete
> take?  This could add substanially to the guest's migration downtime,
> couldn't it?


The time taken depends on the number of dirtied pages and the disk io 
write speed. The number of dirty pages on host is configureable with 
tunables vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 
20.04), vm.dirty_ratio(20%) of host memory and|or 
vm.dirty_expire_centisecs(30 seconds). So, the host itself would be 
flushing the mmaped file on its own from time to time. For guests using 
the nvdimms with filesystem, the flushes would have come frequently and 
the number of dirty pages might be less. The pmem applications can use 
the nvdimms without a filesystem. And for such guests, the chances that 
a flush request can come from pmem applications at the time of migration 
is less or is random. But, the host would have flushed the pagecache on 
its own when vm.dirty_background_ratio is crossed or 
vm.dirty_expire_centisecs expired. So, the worst case would stands at 
disk io latency for writing the dirtied pages in the last 
vm.dirty_expire_centisecs on host OR latency for writing maximum 
vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate 
any particular size, scenario and get the numbers please let me know.

...
>> +
>> +/*
>> + * Acquire a unique token and reserve it for the new flush state.
>> + */
>> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
>> +{
>> +    Error *err = NULL;
>> +    uint64_t token;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
>> +
>> +    state = g_malloc0(sizeof(*state));
>> +
>> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
>> +retry:
>> +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
> Using getrandom seems like overkill, why not just use a counter?

I didnt want a spurious guest to abuse by consuming the return value 
providing

a valid "guess-able" counter and the real driver failing subsequently. 
Also, across

guest migrations carrying the global counter to destination is another 
thing to ponder.


Let me know if you want me to reconsider using counter.

...

>> mm_flush_states_lock);
>> +
>> +    return state;
>> +}
>> +
>> +/*
>> + * spapr_nvdimm_finish_flushes
>> + *      Waits for all pending flush requests to complete
>> + *      their execution and free the states
>> + */
>> +void spapr_nvdimm_finish_flushes(void)
>> +{
>> +    SpaprNVDIMMDeviceFlushState *state, *next;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> The caller has natural access to the machine, so pass it in rather
> than using the global.

okay

...

>> +
>> +/*
>> + * spapr_nvdimm_get_hcall_status
>> + *      Fetches the status of the hcall worker and returns H_BUSY
>> + *      if the worker is still running.
>> + */
>> +static int spapr_nvdimm_get_flush_status(uint64_t token)
>> +{
>> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> The callers have natural access to spapr, so pass it in rather than
> using the global.

Okay

...

>> +
>> +/*
>> + * H_SCM_FLUSH
>> + * Input: drc_index, continue-token
>> + * Out: continue-token
>> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
>> + *
>> + * Given a DRC Index Flush the data to backend NVDIMM device.
>> + * The hcall returns H_BUSY when the flush takes longer time and the hcall
> It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
> it?

Yes. I thought its okay to call it just H_BUSY in a generic way. Will 
fix it.


>> + * needs to be issued multiple times in order to be completely serviced.
>> +        }
>> +
>> +        return ret;
>> +    }
>> +
>> +    dimm = PC_DIMM(drc->dev);
>> +    backend = MEMORY_BACKEND(dimm->hostmem);
>> +
>> +    state = spapr_nvdimm_init_new_flush_state();
>> +    if (!state) {
>> +        return H_P2;
> AFAICT the only way init_new_flush_state() fails is a failure in the
> RNG, which definitely isn't a parameter problem.

Will change it to H_HARDWARE.


>> +    }
>> +
>> +    state->backend_fd = memory_region_get_fd(&backend->mr);
> Is this guaranteed to return a usable fd in all configurations?

Right, for memory-backend-ram this wont work. I think we should

not set the hcall-flush-required too for memory-backend-ram. Will fix this.

>> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
>> +                           spapr_nvdimm_flush_completion_cb, state);
>> +
>> +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
>> +    if (H_IS_LONG_BUSY(ret)) {
>> +        args[0] = state->continue_token;
>> +    }
>> +
>> +    return ret;
> I believe you can rearrange this so the get_flush_status / check /
> return is shared between the args[0] == 0 and args[0] == token paths.
okay.

Thanks,

Shiva

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-29  9:23       ` Shivaprasad G Bhat
  0 siblings, 0 replies; 38+ messages in thread
From: Shivaprasad G Bhat @ 2021-03-29  9:23 UTC (permalink / raw)
  To: David Gibson
  Cc: ehabkost, mst, aneesh.kumar, bharata, linux-nvdimm, groug,
	kvm-ppc, qemu-devel, shivaprasadbhat, qemu-ppc, imammedo, sbhat,
	xiaoguangrong.eric

[-- Attachment #1: Type: text/plain, Size: 6193 bytes --]


On 3/24/21 8:37 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>> machine vmstate.
>>
>> Signed-off-by: Shivaprasad G Bhat<sbhat@linux.ibm.com>
> An overal question: surely the same issue must arise on x86 with
> file-backed NVDIMMs.  How do they handle this case?

Discussed in other threads..

....

>>   };
>> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>>       }
>>   
>>       qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
>> +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
> Do you actually need an extra mutex, or can you rely on the BQL?

I verified BQL is held at all places where it matters in the context of 
this patch.

Safe to get rid of this extra mutex.

...

>
>> +{
>> +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +
>> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
>> +             !QLIST_EMPTY(&spapr->completed_flush_states));
>> +}
>> +
>> +static int spapr_nvdimm_pre_save(void *opaque)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +
>> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
>> +        aio_poll(qemu_get_aio_context(), true);
> Hmm... how long could waiting for all the pending flushes to complete
> take?  This could add substanially to the guest's migration downtime,
> couldn't it?


The time taken depends on the number of dirtied pages and the disk io 
write speed. The number of dirty pages on host is configureable with 
tunables vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 
20.04), vm.dirty_ratio(20%) of host memory and|or 
vm.dirty_expire_centisecs(30 seconds). So, the host itself would be 
flushing the mmaped file on its own from time to time. For guests using 
the nvdimms with filesystem, the flushes would have come frequently and 
the number of dirty pages might be less. The pmem applications can use 
the nvdimms without a filesystem. And for such guests, the chances that 
a flush request can come from pmem applications at the time of migration 
is less or is random. But, the host would have flushed the pagecache on 
its own when vm.dirty_background_ratio is crossed or 
vm.dirty_expire_centisecs expired. So, the worst case would stands at 
disk io latency for writing the dirtied pages in the last 
vm.dirty_expire_centisecs on host OR latency for writing maximum 
vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate 
any particular size, scenario and get the numbers please let me know.

...
>> +
>> +/*
>> + * Acquire a unique token and reserve it for the new flush state.
>> + */
>> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
>> +{
>> +    Error *err = NULL;
>> +    uint64_t token;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
>> +
>> +    state = g_malloc0(sizeof(*state));
>> +
>> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
>> +retry:
>> +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
> Using getrandom seems like overkill, why not just use a counter?

I didnt want a spurious guest to abuse by consuming the return value 
providing

a valid "guess-able" counter and the real driver failing subsequently. 
Also, across

guest migrations carrying the global counter to destination is another 
thing to ponder.


Let me know if you want me to reconsider using counter.

...

>> mm_flush_states_lock);
>> +
>> +    return state;
>> +}
>> +
>> +/*
>> + * spapr_nvdimm_finish_flushes
>> + *      Waits for all pending flush requests to complete
>> + *      their execution and free the states
>> + */
>> +void spapr_nvdimm_finish_flushes(void)
>> +{
>> +    SpaprNVDIMMDeviceFlushState *state, *next;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> The caller has natural access to the machine, so pass it in rather
> than using the global.

okay

...

>> +
>> +/*
>> + * spapr_nvdimm_get_hcall_status
>> + *      Fetches the status of the hcall worker and returns H_BUSY
>> + *      if the worker is still running.
>> + */
>> +static int spapr_nvdimm_get_flush_status(uint64_t token)
>> +{
>> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> The callers have natural access to spapr, so pass it in rather than
> using the global.

Okay

...

>> +
>> +/*
>> + * H_SCM_FLUSH
>> + * Input: drc_index, continue-token
>> + * Out: continue-token
>> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
>> + *
>> + * Given a DRC Index Flush the data to backend NVDIMM device.
>> + * The hcall returns H_BUSY when the flush takes longer time and the hcall
> It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
> it?

Yes. I thought its okay to call it just H_BUSY in a generic way. Will 
fix it.


>> + * needs to be issued multiple times in order to be completely serviced.
>> +        }
>> +
>> +        return ret;
>> +    }
>> +
>> +    dimm = PC_DIMM(drc->dev);
>> +    backend = MEMORY_BACKEND(dimm->hostmem);
>> +
>> +    state = spapr_nvdimm_init_new_flush_state();
>> +    if (!state) {
>> +        return H_P2;
> AFAICT the only way init_new_flush_state() fails is a failure in the
> RNG, which definitely isn't a parameter problem.

Will change it to H_HARDWARE.


>> +    }
>> +
>> +    state->backend_fd = memory_region_get_fd(&backend->mr);
> Is this guaranteed to return a usable fd in all configurations?

Right, for memory-backend-ram this wont work. I think we should

not set the hcall-flush-required too for memory-backend-ram. Will fix this.

>> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
>> +                           spapr_nvdimm_flush_completion_cb, state);
>> +
>> +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
>> +    if (H_IS_LONG_BUSY(ret)) {
>> +        args[0] = state->continue_token;
>> +    }
>> +
>> +    return ret;
> I believe you can rearrange this so the get_flush_status / check /
> return is shared between the args[0] == 0 and args[0] == token paths.
okay.

Thanks,

Shiva


[-- Attachment #2: Type: text/html, Size: 14194 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
  2021-03-29  9:23       ` Shivaprasad G Bhat
  (?)
@ 2021-03-30 23:57         ` David Gibson
  -1 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-30 23:57 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata


[-- Attachment #1.1: Type: text/plain, Size: 8123 bytes --]

On Mon, Mar 29, 2021 at 02:53:47PM +0530, Shivaprasad G Bhat wrote:
> 
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat<sbhat@linux.ibm.com>
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> Discussed in other threads..
> 
> ....
> 
> > >   };
> > > @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
> > >       }
> > >       qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> > > +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
> > Do you actually need an extra mutex, or can you rely on the BQL?
> 
> I verified BQL is held at all places where it matters in the context of this
> patch.
> 
> Safe to get rid of this extra mutex.
> 
> ...
> 
> > 
> > > +{
> > > +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> > > +             !QLIST_EMPTY(&spapr->completed_flush_states));
> > > +}
> > > +
> > > +static int spapr_nvdimm_pre_save(void *opaque)
> > > +{
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> > > +        aio_poll(qemu_get_aio_context(), true);
> > Hmm... how long could waiting for all the pending flushes to complete
> > take?  This could add substanially to the guest's migration downtime,
> > couldn't it?
> 
> 
> The time taken depends on the number of dirtied pages and the disk io write
> speed. The number of dirty pages on host is configureable with tunables

Well, sure, I'm just trying to get an order of magnitude here.

> vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 20.04),
> vm.dirty_ratio(20%) of host memory and|or vm.dirty_expire_centisecs(30
> seconds). So, the host itself would be flushing the mmaped file on its own
> from time to time. For guests using the nvdimms with filesystem, the flushes
> would have come frequently and the number of dirty pages might be
> less. The

I'm not sure that necessarily follows.

> pmem applications can use the nvdimms without a filesystem. And for such
> guests, the chances that a flush request can come from pmem applications at
> the time of migration is less or is random. But, the host would have flushed
> the pagecache on its own when vm.dirty_background_ratio is crossed or
> vm.dirty_expire_centisecs expired. So, the worst case would stands at disk
> io latency for writing the dirtied pages in the last
> vm.dirty_expire_centisecs on host OR latency for writing maximum
> vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate any
> particular size, scenario and get the numbers please let me know.

Hmm.  I feel like this could still easily be 10s, maybe 100s of
milliseconds, yes?  Given that typical migration downtime caps are
also in the 100s of milliseconds, this still seems troublesome.  Since
this is part of the device migration itself, this flushing will all
happen during the downtime, but won't be factored into the downtime
estimations.

> ...
> > > +
> > > +/*
> > > + * Acquire a unique token and reserve it for the new flush state.
> > > + */
> > > +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
> > > +{
> > > +    Error *err = NULL;
> > > +    uint64_t token;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> > > +
> > > +    state = g_malloc0(sizeof(*state));
> > > +
> > > +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> > > +retry:
> > > +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
> > Using getrandom seems like overkill, why not just use a counter?
> 
> I didnt want a spurious guest to abuse by consuming the return value
> providing
> 
> a valid "guess-able" counter and the real driver failing
> subsequently.

Why would a guessable value be bad?  The counter would be per-guest,
so AFAICT all a malicious guest could do is mess itself up.

> Also,
> across
> 
> guest migrations carrying the global counter to destination is another thing
> to ponder.

I don't think you need to: if there are pending flushes on migration
you can set the dest counter to the max id of those, if not you can
reset it to 1 without harm.

Actually.. come to think of it, can't you just use the current max id
of pending flushes + 1 as a new id.


> Let me know if you want me to reconsider using counter.
> 
> ...
> 
> > > mm_flush_states_lock);
> > > +
> > > +    return state;
> > > +}
> > > +
> > > +/*
> > > + * spapr_nvdimm_finish_flushes
> > > + *      Waits for all pending flush requests to complete
> > > + *      their execution and free the states
> > > + */
> > > +void spapr_nvdimm_finish_flushes(void)
> > > +{
> > > +    SpaprNVDIMMDeviceFlushState *state, *next;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > The caller has natural access to the machine, so pass it in rather
> > than using the global.
> 
> okay
> 
> ...
> 
> > > +
> > > +/*
> > > + * spapr_nvdimm_get_hcall_status
> > > + *      Fetches the status of the hcall worker and returns H_BUSY
> > > + *      if the worker is still running.
> > > + */
> > > +static int spapr_nvdimm_get_flush_status(uint64_t token)
> > > +{
> > > +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > The callers have natural access to spapr, so pass it in rather than
> > using the global.
> 
> Okay
> 
> ...
> 
> > > +
> > > +/*
> > > + * H_SCM_FLUSH
> > > + * Input: drc_index, continue-token
> > > + * Out: continue-token
> > > + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
> > > + *
> > > + * Given a DRC Index Flush the data to backend NVDIMM device.
> > > + * The hcall returns H_BUSY when the flush takes longer time and the hcall
> > It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
> > it?
> 
> Yes. I thought its okay to call it just H_BUSY in a generic way. Will fix
> it.
> 
> 
> > > + * needs to be issued multiple times in order to be completely serviced.
> > > +        }
> > > +
> > > +        return ret;
> > > +    }
> > > +
> > > +    dimm = PC_DIMM(drc->dev);
> > > +    backend = MEMORY_BACKEND(dimm->hostmem);
> > > +
> > > +    state = spapr_nvdimm_init_new_flush_state();
> > > +    if (!state) {
> > > +        return H_P2;
> > AFAICT the only way init_new_flush_state() fails is a failure in the
> > RNG, which definitely isn't a parameter problem.
> 
> Will change it to H_HARDWARE.
> 
> 
> > > +    }
> > > +
> > > +    state->backend_fd = memory_region_get_fd(&backend->mr);
> > Is this guaranteed to return a usable fd in all configurations?
> 
> Right, for memory-backend-ram this wont work. I think we should
> 
> not set the hcall-flush-required too for memory-backend-ram. Will fix this.

Right, but it's good to be defensive here.  I think a bad guest could
initiate this path even if it's not supposed to yes?

> > > +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> > > +                           spapr_nvdimm_flush_completion_cb, state);
> > > +
> > > +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
> > > +    if (H_IS_LONG_BUSY(ret)) {
> > > +        args[0] = state->continue_token;
> > > +    }
> > > +
> > > +    return ret;
> > I believe you can rearrange this so the get_flush_status / check /
> > return is shared between the args[0] == 0 and args[0] == token paths.
> okay.
> 
> Thanks,
> 
> Shiva
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 167 bytes --]

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-30 23:57         ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-30 23:57 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: ehabkost, mst, aneesh.kumar, bharata, linux-nvdimm, groug,
	kvm-ppc, qemu-devel, shivaprasadbhat, qemu-ppc, imammedo, sbhat,
	xiaoguangrong.eric

[-- Attachment #1: Type: text/plain, Size: 8123 bytes --]

On Mon, Mar 29, 2021 at 02:53:47PM +0530, Shivaprasad G Bhat wrote:
> 
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat<sbhat@linux.ibm.com>
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> Discussed in other threads..
> 
> ....
> 
> > >   };
> > > @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
> > >       }
> > >       qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> > > +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
> > Do you actually need an extra mutex, or can you rely on the BQL?
> 
> I verified BQL is held at all places where it matters in the context of this
> patch.
> 
> Safe to get rid of this extra mutex.
> 
> ...
> 
> > 
> > > +{
> > > +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> > > +             !QLIST_EMPTY(&spapr->completed_flush_states));
> > > +}
> > > +
> > > +static int spapr_nvdimm_pre_save(void *opaque)
> > > +{
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> > > +        aio_poll(qemu_get_aio_context(), true);
> > Hmm... how long could waiting for all the pending flushes to complete
> > take?  This could add substanially to the guest's migration downtime,
> > couldn't it?
> 
> 
> The time taken depends on the number of dirtied pages and the disk io write
> speed. The number of dirty pages on host is configureable with tunables

Well, sure, I'm just trying to get an order of magnitude here.

> vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 20.04),
> vm.dirty_ratio(20%) of host memory and|or vm.dirty_expire_centisecs(30
> seconds). So, the host itself would be flushing the mmaped file on its own
> from time to time. For guests using the nvdimms with filesystem, the flushes
> would have come frequently and the number of dirty pages might be
> less. The

I'm not sure that necessarily follows.

> pmem applications can use the nvdimms without a filesystem. And for such
> guests, the chances that a flush request can come from pmem applications at
> the time of migration is less or is random. But, the host would have flushed
> the pagecache on its own when vm.dirty_background_ratio is crossed or
> vm.dirty_expire_centisecs expired. So, the worst case would stands at disk
> io latency for writing the dirtied pages in the last
> vm.dirty_expire_centisecs on host OR latency for writing maximum
> vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate any
> particular size, scenario and get the numbers please let me know.

Hmm.  I feel like this could still easily be 10s, maybe 100s of
milliseconds, yes?  Given that typical migration downtime caps are
also in the 100s of milliseconds, this still seems troublesome.  Since
this is part of the device migration itself, this flushing will all
happen during the downtime, but won't be factored into the downtime
estimations.

> ...
> > > +
> > > +/*
> > > + * Acquire a unique token and reserve it for the new flush state.
> > > + */
> > > +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
> > > +{
> > > +    Error *err = NULL;
> > > +    uint64_t token;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> > > +
> > > +    state = g_malloc0(sizeof(*state));
> > > +
> > > +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> > > +retry:
> > > +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
> > Using getrandom seems like overkill, why not just use a counter?
> 
> I didnt want a spurious guest to abuse by consuming the return value
> providing
> 
> a valid "guess-able" counter and the real driver failing
> subsequently.

Why would a guessable value be bad?  The counter would be per-guest,
so AFAICT all a malicious guest could do is mess itself up.

> Also,
> across
> 
> guest migrations carrying the global counter to destination is another thing
> to ponder.

I don't think you need to: if there are pending flushes on migration
you can set the dest counter to the max id of those, if not you can
reset it to 1 without harm.

Actually.. come to think of it, can't you just use the current max id
of pending flushes + 1 as a new id.


> Let me know if you want me to reconsider using counter.
> 
> ...
> 
> > > mm_flush_states_lock);
> > > +
> > > +    return state;
> > > +}
> > > +
> > > +/*
> > > + * spapr_nvdimm_finish_flushes
> > > + *      Waits for all pending flush requests to complete
> > > + *      their execution and free the states
> > > + */
> > > +void spapr_nvdimm_finish_flushes(void)
> > > +{
> > > +    SpaprNVDIMMDeviceFlushState *state, *next;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > The caller has natural access to the machine, so pass it in rather
> > than using the global.
> 
> okay
> 
> ...
> 
> > > +
> > > +/*
> > > + * spapr_nvdimm_get_hcall_status
> > > + *      Fetches the status of the hcall worker and returns H_BUSY
> > > + *      if the worker is still running.
> > > + */
> > > +static int spapr_nvdimm_get_flush_status(uint64_t token)
> > > +{
> > > +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > The callers have natural access to spapr, so pass it in rather than
> > using the global.
> 
> Okay
> 
> ...
> 
> > > +
> > > +/*
> > > + * H_SCM_FLUSH
> > > + * Input: drc_index, continue-token
> > > + * Out: continue-token
> > > + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
> > > + *
> > > + * Given a DRC Index Flush the data to backend NVDIMM device.
> > > + * The hcall returns H_BUSY when the flush takes longer time and the hcall
> > It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
> > it?
> 
> Yes. I thought its okay to call it just H_BUSY in a generic way. Will fix
> it.
> 
> 
> > > + * needs to be issued multiple times in order to be completely serviced.
> > > +        }
> > > +
> > > +        return ret;
> > > +    }
> > > +
> > > +    dimm = PC_DIMM(drc->dev);
> > > +    backend = MEMORY_BACKEND(dimm->hostmem);
> > > +
> > > +    state = spapr_nvdimm_init_new_flush_state();
> > > +    if (!state) {
> > > +        return H_P2;
> > AFAICT the only way init_new_flush_state() fails is a failure in the
> > RNG, which definitely isn't a parameter problem.
> 
> Will change it to H_HARDWARE.
> 
> 
> > > +    }
> > > +
> > > +    state->backend_fd = memory_region_get_fd(&backend->mr);
> > Is this guaranteed to return a usable fd in all configurations?
> 
> Right, for memory-backend-ram this wont work. I think we should
> 
> not set the hcall-flush-required too for memory-backend-ram. Will fix this.

Right, but it's good to be defensive here.  I think a bad guest could
initiate this path even if it's not supposed to yes?

> > > +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> > > +                           spapr_nvdimm_flush_completion_cb, state);
> > > +
> > > +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
> > > +    if (H_IS_LONG_BUSY(ret)) {
> > > +        args[0] = state->continue_token;
> > > +    }
> > > +
> > > +    return ret;
> > I believe you can rearrange this so the get_flush_status / check /
> > return is shared between the args[0] == 0 and args[0] == token paths.
> okay.
> 
> Thanks,
> 
> Shiva
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
@ 2021-03-30 23:57         ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-30 23:57 UTC (permalink / raw)
  To: Shivaprasad G Bhat
  Cc: sbhat, groug, qemu-ppc, ehabkost, marcel.apfelbaum, mst,
	imammedo, xiaoguangrong.eric, qemu-devel, aneesh.kumar,
	linux-nvdimm, kvm-ppc, shivaprasadbhat, bharata

[-- Attachment #1: Type: text/plain, Size: 8123 bytes --]

On Mon, Mar 29, 2021 at 02:53:47PM +0530, Shivaprasad G Bhat wrote:
> 
> On 3/24/21 8:37 AM, David Gibson wrote:
> > On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
> > > machine vmstate.
> > > 
> > > Signed-off-by: Shivaprasad G Bhat<sbhat@linux.ibm.com>
> > An overal question: surely the same issue must arise on x86 with
> > file-backed NVDIMMs.  How do they handle this case?
> 
> Discussed in other threads..
> 
> ....
> 
> > >   };
> > > @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
> > >       }
> > >       qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
> > > +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
> > Do you actually need an extra mutex, or can you rely on the BQL?
> 
> I verified BQL is held at all places where it matters in the context of this
> patch.
> 
> Safe to get rid of this extra mutex.
> 
> ...
> 
> > 
> > > +{
> > > +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
> > > +             !QLIST_EMPTY(&spapr->completed_flush_states));
> > > +}
> > > +
> > > +static int spapr_nvdimm_pre_save(void *opaque)
> > > +{
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +
> > > +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
> > > +        aio_poll(qemu_get_aio_context(), true);
> > Hmm... how long could waiting for all the pending flushes to complete
> > take?  This could add substanially to the guest's migration downtime,
> > couldn't it?
> 
> 
> The time taken depends on the number of dirtied pages and the disk io write
> speed. The number of dirty pages on host is configureable with tunables

Well, sure, I'm just trying to get an order of magnitude here.

> vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 20.04),
> vm.dirty_ratio(20%) of host memory and|or vm.dirty_expire_centisecs(30
> seconds). So, the host itself would be flushing the mmaped file on its own
> from time to time. For guests using the nvdimms with filesystem, the flushes
> would have come frequently and the number of dirty pages might be
> less. The

I'm not sure that necessarily follows.

> pmem applications can use the nvdimms without a filesystem. And for such
> guests, the chances that a flush request can come from pmem applications at
> the time of migration is less or is random. But, the host would have flushed
> the pagecache on its own when vm.dirty_background_ratio is crossed or
> vm.dirty_expire_centisecs expired. So, the worst case would stands at disk
> io latency for writing the dirtied pages in the last
> vm.dirty_expire_centisecs on host OR latency for writing maximum
> vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate any
> particular size, scenario and get the numbers please let me know.

Hmm.  I feel like this could still easily be 10s, maybe 100s of
milliseconds, yes?  Given that typical migration downtime caps are
also in the 100s of milliseconds, this still seems troublesome.  Since
this is part of the device migration itself, this flushing will all
happen during the downtime, but won't be factored into the downtime
estimations.

> ...
> > > +
> > > +/*
> > > + * Acquire a unique token and reserve it for the new flush state.
> > > + */
> > > +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
> > > +{
> > > +    Error *err = NULL;
> > > +    uint64_t token;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > > +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
> > > +
> > > +    state = g_malloc0(sizeof(*state));
> > > +
> > > +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
> > > +retry:
> > > +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
> > Using getrandom seems like overkill, why not just use a counter?
> 
> I didnt want a spurious guest to abuse by consuming the return value
> providing
> 
> a valid "guess-able" counter and the real driver failing
> subsequently.

Why would a guessable value be bad?  The counter would be per-guest,
so AFAICT all a malicious guest could do is mess itself up.

> Also,
> across
> 
> guest migrations carrying the global counter to destination is another thing
> to ponder.

I don't think you need to: if there are pending flushes on migration
you can set the dest counter to the max id of those, if not you can
reset it to 1 without harm.

Actually.. come to think of it, can't you just use the current max id
of pending flushes + 1 as a new id.


> Let me know if you want me to reconsider using counter.
> 
> ...
> 
> > > mm_flush_states_lock);
> > > +
> > > +    return state;
> > > +}
> > > +
> > > +/*
> > > + * spapr_nvdimm_finish_flushes
> > > + *      Waits for all pending flush requests to complete
> > > + *      their execution and free the states
> > > + */
> > > +void spapr_nvdimm_finish_flushes(void)
> > > +{
> > > +    SpaprNVDIMMDeviceFlushState *state, *next;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > The caller has natural access to the machine, so pass it in rather
> > than using the global.
> 
> okay
> 
> ...
> 
> > > +
> > > +/*
> > > + * spapr_nvdimm_get_hcall_status
> > > + *      Fetches the status of the hcall worker and returns H_BUSY
> > > + *      if the worker is still running.
> > > + */
> > > +static int spapr_nvdimm_get_flush_status(uint64_t token)
> > > +{
> > > +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
> > > +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> > The callers have natural access to spapr, so pass it in rather than
> > using the global.
> 
> Okay
> 
> ...
> 
> > > +
> > > +/*
> > > + * H_SCM_FLUSH
> > > + * Input: drc_index, continue-token
> > > + * Out: continue-token
> > > + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
> > > + *
> > > + * Given a DRC Index Flush the data to backend NVDIMM device.
> > > + * The hcall returns H_BUSY when the flush takes longer time and the hcall
> > It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
> > it?
> 
> Yes. I thought its okay to call it just H_BUSY in a generic way. Will fix
> it.
> 
> 
> > > + * needs to be issued multiple times in order to be completely serviced.
> > > +        }
> > > +
> > > +        return ret;
> > > +    }
> > > +
> > > +    dimm = PC_DIMM(drc->dev);
> > > +    backend = MEMORY_BACKEND(dimm->hostmem);
> > > +
> > > +    state = spapr_nvdimm_init_new_flush_state();
> > > +    if (!state) {
> > > +        return H_P2;
> > AFAICT the only way init_new_flush_state() fails is a failure in the
> > RNG, which definitely isn't a parameter problem.
> 
> Will change it to H_HARDWARE.
> 
> 
> > > +    }
> > > +
> > > +    state->backend_fd = memory_region_get_fd(&backend->mr);
> > Is this guaranteed to return a usable fd in all configurations?
> 
> Right, for memory-backend-ram this wont work. I think we should
> 
> not set the hcall-flush-required too for memory-backend-ram. Will fix this.

Right, but it's good to be defensive here.  I think a bad guest could
initiate this path even if it's not supposed to yes?

> > > +    thread_pool_submit_aio(pool, flush_worker_cb, state,
> > > +                           spapr_nvdimm_flush_completion_cb, state);
> > > +
> > > +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
> > > +    if (H_IS_LONG_BUSY(ret)) {
> > > +        args[0] = state->continue_token;
> > > +    }
> > > +
> > > +    return ret;
> > I believe you can rearrange this so the get_flush_status / check /
> > return is shared between the args[0] == 0 and args[0] == token paths.
> okay.
> 
> Thanks,
> 
> Shiva
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-03-31  0:16 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-23 13:47 [PATCH v3 0/3] spapr: nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
2021-03-23 13:47 ` Shivaprasad G Bhat
2021-03-23 13:47 ` Shivaprasad G Bhat
2021-03-23 13:47 ` [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
2021-03-23 13:47   ` Shivaprasad G Bhat
2021-03-23 13:47   ` Shivaprasad G Bhat
2021-03-24  2:30   ` David Gibson
2021-03-24  2:30     ` David Gibson
2021-03-24  2:30     ` David Gibson
2021-03-23 13:47 ` [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
2021-03-23 13:47   ` Shivaprasad G Bhat
2021-03-23 13:47   ` Shivaprasad G Bhat
2021-03-24  3:07   ` David Gibson
2021-03-24  3:07     ` David Gibson
2021-03-24  3:07     ` David Gibson
2021-03-24  4:04     ` Aneesh Kumar K.V
2021-03-24  4:16       ` Aneesh Kumar K.V
2021-03-24  4:04       ` Aneesh Kumar K.V
2021-03-25  1:51       ` David Gibson
2021-03-25  1:51         ` David Gibson
2021-03-25  1:51         ` David Gibson
2021-03-26 13:45         ` Shivaprasad G Bhat
2021-03-26 13:57           ` Shivaprasad G Bhat
2021-03-26 13:45           ` Shivaprasad G Bhat
2021-03-29  9:23     ` Shivaprasad G Bhat
2021-03-29  9:23       ` Shivaprasad G Bhat
2021-03-30 23:57       ` David Gibson
2021-03-30 23:57         ` David Gibson
2021-03-30 23:57         ` David Gibson
2021-03-23 13:47 ` [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm Shivaprasad G Bhat
2021-03-23 13:47   ` Shivaprasad G Bhat
2021-03-23 13:47   ` Shivaprasad G Bhat
2021-03-24  3:09   ` David Gibson
2021-03-24  3:09     ` David Gibson
2021-03-24  3:09     ` David Gibson
2021-03-24  4:09     ` Aneesh Kumar K.V
2021-03-24  4:21       ` Aneesh Kumar K.V
2021-03-24  4:09       ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.