All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/5] implement vNVDIMM
@ 2015-12-02  7:20 ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel, Xiao Guangrong

This patchset can be found at:
      https://github.com/xiaogr/qemu.git nvdimm-v9

It is based on pci branch on Michael's tree and the top commit is:
commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36).

Changelog in v9:
- the changes address Michael's comments:
  1) move the control parameter to -machine and it is off on default, then
     it can be enabled by, for example, -machine pc,nvdimm
  2) introduce a macro to define "NCAL"
  3) abstract the function, nvdimm_build_device_dsm(), to clean up the
     code
  4) adjust the code style of dsm method
  5) add spec reference in the code comment

other:
  pick up Stefan's Reviewed-by
  
Changelog in v8:
We split the long patch series into the small parts, as you see now, this
is the first part which enables NVDIMM without label data support.

The command line has been changed because some patches simplifying the
things have not been included into this series, you should specify the
file size exactly using the parameters as follows:
   memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
   -device nvdimm,memdev=mem1,id=nv1

Changelog in v7:
- changes from Vladimir Sementsov-Ogievskiy's comments:
  1) let gethugepagesize() realize if fstat is failed instead of get
     normal page size
  2) rename  open_file_path to open_ram_file_path
  3) better log the error message by using error_setg_errno
  4) update commit in the commit log to explain hugepage detection on
     Windows

- changes from Eduardo Habkost's comments:
  1) use 'Error**' to collect error message for qemu_file_get_page_size()
  2) move gethugepagesize() replacement to the same patch to make it
     better for review
  3) introduce qemu_get_file_size to unity the code with raw_getlength()

- changes from Stefan's comments:
  1) check the memory region is large enough to contain DSM output
     buffer

- changes from Eric Blake's comments:
  1) update the shell command in the commit log to generate the patch
     which drops 'pc-dimm' prefix
  
- others:
  pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
  Eric Blake.

Changelog in v6:
- changes from Stefan's comments:
  1) fix code style of struct naming by CamelCase way
  2) fix offset + length overflow when read/write label data
  3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
     be used to replace getpagesize()

Changelog in v5:
- changes from Michael's comments:
  1) prefix nvdimm_ to everything in NVDIMM source files
  2) make parsing _DSM Arg3 more clear
  3) comment style fix
  5) drop single used definition
  6) fix dirty dsm buffer lost due to memory write happened on host
  7) check dsm buffer if it is big enough to contain input data
  8) use build_append_int_noprefix to store single value to GArray

- changes from Michael's and Igor's comments:
  1) introduce 'nvdimm-support' parameter to control nvdimm
     enablement and it is disabled for 2.4 and its earlier versions
     to make live migration compatible
  2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
     virtualization

- changes from Stefan's comments:
  1) do endian adjustment for the buffer length

- changes from Bharata B Rao's comments:
  1) fix compile on ppc

- others:
  1) the buffer length is directly got from IO read rather than got
     from dsm memory
  2) fix dirty label data lost due to memory write happened on host

Changelog in v4:
- changes from Michael's comments:
  1) show the message, "Memory is not allocated from HugeTlbfs", if file
     based memory is not allocated from hugetlbfs.
  2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
     from Machine.
  3) statically define UUID and make its operation more clear
  4) use GArray to build device structures to avoid potential buffer
     overflow
  4) improve comments in the code
  5) improve code style

- changes from Igor's comments:
  1) add NVDIMM ACPI spec document
  2) use serialized method to avoid Mutex
  3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
  4) introduce a common ASL method used by _DSM for all devices to reduce
     ACPI size
  5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
     it's better to upgrade QEMU to support Rev2 in the future

- changes from Stefan's comments:
  1) copy input data from DSM memory to local buffer to avoid potential
     issues as DSM memory is visible to guest. Output data is handled
     in a similar way

- changes from Dan's comments:
  1) drop static namespace as Linux has already supported label-less
     nvdimm devices

- changes from Vladimir's comments:
  1) print better message, "failed to get file size for %s, can't create
     backend on it", if any file operation filed to obtain file size

- others:
  create a git repo on github.com for better review/test

Also, thanks for Eric Blake's review on QAPI's side.

Thank all of you to review this patchset.

Changelog in v3:
There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
Michael for their valuable comments, the patchset finally gets better shape.
- changes from Igor's comments:
  1) abstract dimm device type from pc-dimm and create nvdimm device based on
     dimm, then it uses memory backend device as nvdimm's memory and NUMA has
     easily been implemented.
  2) let file-backend device support any kind of filesystem not only for
     hugetlbfs and let it work on file not only for directory which is
     achieved by extending 'mem-path' - if it's a directory then it works as
     current behavior, otherwise if it's file then directly allocates memory
     from it.
  3) we figure out a unused memory hole below 4G that is 0xFF00000 ~ 
     0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
     ACPI SSDT/DSDT table will break windows XP.
     BTW, only make SSDT.rev = 2 can not work since the width is only depended
     on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
     in ACPI spec:
| Note: For compatibility with ACPI versions before ACPI 2.0, the bit 
| width of Integer objects is dependent on the ComplianceRevision of the DSDT.
| If the ComplianceRevision is less than 2, all integers are restricted to 32 
| bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets 
| the global integer width for all integers, including integers in SSDTs.
  4) use the lowest ACPI spec version to document AML terms.
  5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"

- changes from Stefan's comments:
  1) do not do endian adjustment in-place since _DSM memory is visible to guest
  2) use target platform's target page size instead of fixed PAGE_SIZE
     definition
  3) lots of code style improvement and typo fixes.
  4) live migration fix
- changes from Paolo's comments:
  1) improve the name of memory region
  
- other changes:
  1) return exact buffer size for _DSM method instead of the page size.
  2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
     devices.
  3) NUMA support
  4) implement _FIT method
  5) rename "configdata" to "reserve-label-data"
  6) simplify _DSM arg3 determination
  7) main changelog update to let it reflect v3.

Changlog in v2:
- Use litten endian for DSM method, thanks for Stefan's suggestion

- introduce a new parameter, @configdata, if it's false, Qemu will
  build a static and readonly namespace in memory and use it serveing
  for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
  reserved region is needed at the end of the @file, it is good for
  the user who want to pass whole nvdimm device and make its data
  completely be visible to guest

- divide the source code into separated files and add maintain info

BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
be posted on next week

====== Background ======
NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
on Intel's platform. They are discovered via ACPI and configured by _DSM
method of NVDIMM device in ACPI. There has some supporting documents which
can be found at:
ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
this patchset tries to enable it in virtualization field

====== Design ======
NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
address space then CPU can directly access it as normal memory, another is
BLK which is used as block device to reduce the occupying of CPU address
space

BLK mode accesses NVDIMM via Command Register window and Data Register window.
BLK virtualization has high workload since each sector access will cause at
least two VM-EXIT. So we currently only imperilment vPMEM in this patchset

--- vPMEM design ---
We introduce a new device named "nvdimm", it uses memory backend device as
NVDIMM memory. The file in file-backend device can be a regular file and block 
device. We can use any file when we do test or emulation, however,
in the real word, the files passed to guest are:
- the regular file in the filesystem with DAX enabled created on NVDIMM device
  on host
- the raw PMEM device on host, e,g /dev/pmem0
Memory access on the address created by mmap on these kinds of files can
directly reach NVDIMM device on host.

--- vConfigure data area design ---
Each NVDIMM device has a configure data area which is used to store label
namespace data. In order to emulating this area, we divide the file into two
parts:
- first parts is (0, size - 128K], which is used as PMEM
- 128K at the end of the file, which is used as Label Data Area
So that the label namespace data can be persistent during power lose or system
failure.

We also support passing the whole file to guest without reserve any region for
label data area which is achieved by "reserve-label-data" parameter - if it's
false then QEMU will build static and readonly namespace in memory and that
namespace contains the whole file size. The parameter is false on default.

--- _DSM method design ---
_DSM in ACPI is used to configure NVDIMM, currently we only allow access of
label namespace data, i.e, Get Namespace Label Size (Function Index 4),
Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
(Function Index 6)

_DSM uses two pages to transfer data between ACPI and Qemu, the first page
is RAM-based used to save the input info of _DSM method and Qemu reuse it
store output info and another page is MMIO-based, ACPI write data to this
page to transfer the control to Qemu

====== Test ======
In host
1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
2) append "-object memory-backend-file,share,id=mem1,
   mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
   id=nv1" in QEMU command line

In guest, download the latest upsteam kernel (4.2 merge window) and enable
ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
1) insmod drivers/nvdimm/libnvdimm.ko
2) insmod drivers/acpi/nfit.ko
3) insmod drivers/nvdimm/nd_btt.ko
4) insmod drivers/nvdimm/nd_pmem.ko
You can see the whole nvdimm device used as a single namespace and /dev/pmem0
appears. You can do whatever on /dev/pmem0 including DAX access.

Currently Linux NVDIMM driver does not support namespace operation on this
kind of PMEM, apply below changes to support dynamical namespace:

@@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
                        continue;
                }
 
-               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+               if (nfit_mem->memdev_pmem)
                        flags |= NDD_ALIASING;

You can append another NVDIMM device in guest and do:                       
# cd /sys/bus/nd/devices/
# cd namespace1.0/
# echo `uuidgen` > uuid
# echo `expr 1024 \* 1024 \* 128` > size
then reload nd.pmem.ko

You can see /dev/pmem1 appears

Xiao Guangrong (5):
  nvdimm: implement NVDIMM device abstract
  acpi: support specified oem table id for build_header
  nvdimm acpi: build ACPI NFIT table
  nvdimm acpi: build ACPI nvdimm devices
  nvdimm: add maintain info

 MAINTAINERS                        |   7 +
 default-configs/i386-softmmu.mak   |   2 +
 default-configs/x86_64-softmmu.mak |   2 +
 hw/acpi/Makefile.objs              |   1 +
 hw/acpi/aml-build.c                |  15 +-
 hw/acpi/memory_hotplug.c           |   5 +
 hw/acpi/nvdimm.c                   | 488 +++++++++++++++++++++++++++++++++++++
 hw/arm/virt-acpi-build.c           |  13 +-
 hw/i386/acpi-build.c               |  32 ++-
 hw/i386/pc.c                       |  19 ++
 hw/mem/Makefile.objs               |   1 +
 hw/mem/nvdimm.c                    |  46 ++++
 include/hw/acpi/aml-build.h        |   3 +-
 include/hw/i386/pc.h               |   2 +
 include/hw/mem/nvdimm.h            |  32 +++
 qemu-options.hx                    |   5 +-
 16 files changed, 651 insertions(+), 22 deletions(-)
 create mode 100644 hw/acpi/nvdimm.c
 create mode 100644 hw/mem/nvdimm.c
 create mode 100644 include/hw/mem/nvdimm.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM
@ 2015-12-02  7:20 ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: Xiao Guangrong, ehabkost, kvm, mst, gleb, mtosatti, qemu-devel,
	stefanha, dan.j.williams, rth

This patchset can be found at:
      https://github.com/xiaogr/qemu.git nvdimm-v9

It is based on pci branch on Michael's tree and the top commit is:
commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36).

Changelog in v9:
- the changes address Michael's comments:
  1) move the control parameter to -machine and it is off on default, then
     it can be enabled by, for example, -machine pc,nvdimm
  2) introduce a macro to define "NCAL"
  3) abstract the function, nvdimm_build_device_dsm(), to clean up the
     code
  4) adjust the code style of dsm method
  5) add spec reference in the code comment

other:
  pick up Stefan's Reviewed-by
  
Changelog in v8:
We split the long patch series into the small parts, as you see now, this
is the first part which enables NVDIMM without label data support.

The command line has been changed because some patches simplifying the
things have not been included into this series, you should specify the
file size exactly using the parameters as follows:
   memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
   -device nvdimm,memdev=mem1,id=nv1

Changelog in v7:
- changes from Vladimir Sementsov-Ogievskiy's comments:
  1) let gethugepagesize() realize if fstat is failed instead of get
     normal page size
  2) rename  open_file_path to open_ram_file_path
  3) better log the error message by using error_setg_errno
  4) update commit in the commit log to explain hugepage detection on
     Windows

- changes from Eduardo Habkost's comments:
  1) use 'Error**' to collect error message for qemu_file_get_page_size()
  2) move gethugepagesize() replacement to the same patch to make it
     better for review
  3) introduce qemu_get_file_size to unity the code with raw_getlength()

- changes from Stefan's comments:
  1) check the memory region is large enough to contain DSM output
     buffer

- changes from Eric Blake's comments:
  1) update the shell command in the commit log to generate the patch
     which drops 'pc-dimm' prefix
  
- others:
  pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
  Eric Blake.

Changelog in v6:
- changes from Stefan's comments:
  1) fix code style of struct naming by CamelCase way
  2) fix offset + length overflow when read/write label data
  3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
     be used to replace getpagesize()

Changelog in v5:
- changes from Michael's comments:
  1) prefix nvdimm_ to everything in NVDIMM source files
  2) make parsing _DSM Arg3 more clear
  3) comment style fix
  5) drop single used definition
  6) fix dirty dsm buffer lost due to memory write happened on host
  7) check dsm buffer if it is big enough to contain input data
  8) use build_append_int_noprefix to store single value to GArray

- changes from Michael's and Igor's comments:
  1) introduce 'nvdimm-support' parameter to control nvdimm
     enablement and it is disabled for 2.4 and its earlier versions
     to make live migration compatible
  2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
     virtualization

- changes from Stefan's comments:
  1) do endian adjustment for the buffer length

- changes from Bharata B Rao's comments:
  1) fix compile on ppc

- others:
  1) the buffer length is directly got from IO read rather than got
     from dsm memory
  2) fix dirty label data lost due to memory write happened on host

Changelog in v4:
- changes from Michael's comments:
  1) show the message, "Memory is not allocated from HugeTlbfs", if file
     based memory is not allocated from hugetlbfs.
  2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
     from Machine.
  3) statically define UUID and make its operation more clear
  4) use GArray to build device structures to avoid potential buffer
     overflow
  4) improve comments in the code
  5) improve code style

- changes from Igor's comments:
  1) add NVDIMM ACPI spec document
  2) use serialized method to avoid Mutex
  3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
  4) introduce a common ASL method used by _DSM for all devices to reduce
     ACPI size
  5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
     it's better to upgrade QEMU to support Rev2 in the future

- changes from Stefan's comments:
  1) copy input data from DSM memory to local buffer to avoid potential
     issues as DSM memory is visible to guest. Output data is handled
     in a similar way

- changes from Dan's comments:
  1) drop static namespace as Linux has already supported label-less
     nvdimm devices

- changes from Vladimir's comments:
  1) print better message, "failed to get file size for %s, can't create
     backend on it", if any file operation filed to obtain file size

- others:
  create a git repo on github.com for better review/test

Also, thanks for Eric Blake's review on QAPI's side.

Thank all of you to review this patchset.

Changelog in v3:
There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
Michael for their valuable comments, the patchset finally gets better shape.
- changes from Igor's comments:
  1) abstract dimm device type from pc-dimm and create nvdimm device based on
     dimm, then it uses memory backend device as nvdimm's memory and NUMA has
     easily been implemented.
  2) let file-backend device support any kind of filesystem not only for
     hugetlbfs and let it work on file not only for directory which is
     achieved by extending 'mem-path' - if it's a directory then it works as
     current behavior, otherwise if it's file then directly allocates memory
     from it.
  3) we figure out a unused memory hole below 4G that is 0xFF00000 ~ 
     0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
     ACPI SSDT/DSDT table will break windows XP.
     BTW, only make SSDT.rev = 2 can not work since the width is only depended
     on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
     in ACPI spec:
| Note: For compatibility with ACPI versions before ACPI 2.0, the bit 
| width of Integer objects is dependent on the ComplianceRevision of the DSDT.
| If the ComplianceRevision is less than 2, all integers are restricted to 32 
| bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets 
| the global integer width for all integers, including integers in SSDTs.
  4) use the lowest ACPI spec version to document AML terms.
  5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"

- changes from Stefan's comments:
  1) do not do endian adjustment in-place since _DSM memory is visible to guest
  2) use target platform's target page size instead of fixed PAGE_SIZE
     definition
  3) lots of code style improvement and typo fixes.
  4) live migration fix
- changes from Paolo's comments:
  1) improve the name of memory region
  
- other changes:
  1) return exact buffer size for _DSM method instead of the page size.
  2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
     devices.
  3) NUMA support
  4) implement _FIT method
  5) rename "configdata" to "reserve-label-data"
  6) simplify _DSM arg3 determination
  7) main changelog update to let it reflect v3.

Changlog in v2:
- Use litten endian for DSM method, thanks for Stefan's suggestion

- introduce a new parameter, @configdata, if it's false, Qemu will
  build a static and readonly namespace in memory and use it serveing
  for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
  reserved region is needed at the end of the @file, it is good for
  the user who want to pass whole nvdimm device and make its data
  completely be visible to guest

- divide the source code into separated files and add maintain info

BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
be posted on next week

====== Background ======
NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
on Intel's platform. They are discovered via ACPI and configured by _DSM
method of NVDIMM device in ACPI. There has some supporting documents which
can be found at:
ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
this patchset tries to enable it in virtualization field

====== Design ======
NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
address space then CPU can directly access it as normal memory, another is
BLK which is used as block device to reduce the occupying of CPU address
space

BLK mode accesses NVDIMM via Command Register window and Data Register window.
BLK virtualization has high workload since each sector access will cause at
least two VM-EXIT. So we currently only imperilment vPMEM in this patchset

--- vPMEM design ---
We introduce a new device named "nvdimm", it uses memory backend device as
NVDIMM memory. The file in file-backend device can be a regular file and block 
device. We can use any file when we do test or emulation, however,
in the real word, the files passed to guest are:
- the regular file in the filesystem with DAX enabled created on NVDIMM device
  on host
- the raw PMEM device on host, e,g /dev/pmem0
Memory access on the address created by mmap on these kinds of files can
directly reach NVDIMM device on host.

--- vConfigure data area design ---
Each NVDIMM device has a configure data area which is used to store label
namespace data. In order to emulating this area, we divide the file into two
parts:
- first parts is (0, size - 128K], which is used as PMEM
- 128K at the end of the file, which is used as Label Data Area
So that the label namespace data can be persistent during power lose or system
failure.

We also support passing the whole file to guest without reserve any region for
label data area which is achieved by "reserve-label-data" parameter - if it's
false then QEMU will build static and readonly namespace in memory and that
namespace contains the whole file size. The parameter is false on default.

--- _DSM method design ---
_DSM in ACPI is used to configure NVDIMM, currently we only allow access of
label namespace data, i.e, Get Namespace Label Size (Function Index 4),
Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
(Function Index 6)

_DSM uses two pages to transfer data between ACPI and Qemu, the first page
is RAM-based used to save the input info of _DSM method and Qemu reuse it
store output info and another page is MMIO-based, ACPI write data to this
page to transfer the control to Qemu

====== Test ======
In host
1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
2) append "-object memory-backend-file,share,id=mem1,
   mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
   id=nv1" in QEMU command line

In guest, download the latest upsteam kernel (4.2 merge window) and enable
ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
1) insmod drivers/nvdimm/libnvdimm.ko
2) insmod drivers/acpi/nfit.ko
3) insmod drivers/nvdimm/nd_btt.ko
4) insmod drivers/nvdimm/nd_pmem.ko
You can see the whole nvdimm device used as a single namespace and /dev/pmem0
appears. You can do whatever on /dev/pmem0 including DAX access.

Currently Linux NVDIMM driver does not support namespace operation on this
kind of PMEM, apply below changes to support dynamical namespace:

@@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
                        continue;
                }
 
-               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+               if (nfit_mem->memdev_pmem)
                        flags |= NDD_ALIASING;

You can append another NVDIMM device in guest and do:                       
# cd /sys/bus/nd/devices/
# cd namespace1.0/
# echo `uuidgen` > uuid
# echo `expr 1024 \* 1024 \* 128` > size
then reload nd.pmem.ko

You can see /dev/pmem1 appears

Xiao Guangrong (5):
  nvdimm: implement NVDIMM device abstract
  acpi: support specified oem table id for build_header
  nvdimm acpi: build ACPI NFIT table
  nvdimm acpi: build ACPI nvdimm devices
  nvdimm: add maintain info

 MAINTAINERS                        |   7 +
 default-configs/i386-softmmu.mak   |   2 +
 default-configs/x86_64-softmmu.mak |   2 +
 hw/acpi/Makefile.objs              |   1 +
 hw/acpi/aml-build.c                |  15 +-
 hw/acpi/memory_hotplug.c           |   5 +
 hw/acpi/nvdimm.c                   | 488 +++++++++++++++++++++++++++++++++++++
 hw/arm/virt-acpi-build.c           |  13 +-
 hw/i386/acpi-build.c               |  32 ++-
 hw/i386/pc.c                       |  19 ++
 hw/mem/Makefile.objs               |   1 +
 hw/mem/nvdimm.c                    |  46 ++++
 include/hw/acpi/aml-build.h        |   3 +-
 include/hw/i386/pc.h               |   2 +
 include/hw/mem/nvdimm.h            |  32 +++
 qemu-options.hx                    |   5 +-
 16 files changed, 651 insertions(+), 22 deletions(-)
 create mode 100644 hw/acpi/nvdimm.c
 create mode 100644 hw/mem/nvdimm.c
 create mode 100644 include/hw/mem/nvdimm.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v9 1/5] nvdimm: implement NVDIMM device abstract
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-02  7:20   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel, Xiao Guangrong

Introduce "nvdimm" device which is based on pc-dimm device type

Currently, nothing is specific for nvdimm but hotplug is disabled

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 default-configs/i386-softmmu.mak   |  1 +
 default-configs/x86_64-softmmu.mak |  1 +
 hw/acpi/memory_hotplug.c           |  5 +++++
 hw/mem/Makefile.objs               |  1 +
 hw/mem/nvdimm.c                    | 46 ++++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h            | 29 ++++++++++++++++++++++++
 6 files changed, 83 insertions(+)
 create mode 100644 hw/mem/nvdimm.c
 create mode 100644 include/hw/mem/nvdimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 43c96d1..4c79d3b 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -46,6 +46,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/x86_64-softmmu.mak b/default-configs/x86_64-softmmu.mak
index dfb8095..e42d2fc 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -46,6 +46,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index e4b9a01..298e868 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -231,6 +231,11 @@ void acpi_memory_plug_cb(ACPIREGS *ar, qemu_irq irq, MemHotplugState *mem_st,
                          DeviceState *dev, Error **errp)
 {
     MemStatus *mdev;
+    DeviceClass *dc = DEVICE_GET_CLASS(dev);
+
+    if (!dc->hotpluggable) {
+        return;
+    }
 
     mdev = acpi_memory_slot_status(mem_st, dev, errp);
     if (!mdev) {
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index b000fb4..f12f8b9 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1,2 @@
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm.o
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
new file mode 100644
index 0000000..4fd397f
--- /dev/null
+++ b/hw/mem/nvdimm.c
@@ -0,0 +1,46 @@
+/*
+ * Non-Volatile Dual In-line Memory Module Virtualization Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "hw/mem/nvdimm.h"
+
+static void nvdimm_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+
+    /* nvdimm hotplug has not been supported yet. */
+    dc->hotpluggable = false;
+}
+
+static TypeInfo nvdimm_info = {
+    .name          = TYPE_NVDIMM,
+    .parent        = TYPE_PC_DIMM,
+    .class_init    = nvdimm_class_init,
+};
+
+static void nvdimm_register_types(void)
+{
+    type_register_static(&nvdimm_info);
+}
+
+type_init(nvdimm_register_types)
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
new file mode 100644
index 0000000..dbfa8d6
--- /dev/null
+++ b/include/hw/mem/nvdimm.h
@@ -0,0 +1,29 @@
+/*
+ * Non-Volatile Dual In-line Memory Module Virtualization Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *
+ * NVDIMM specifications and some documents can be found at:
+ * NVDIMM ACPI device and NFIT are introduced in ACPI 6:
+ *      http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+ * NVDIMM Namespace specification:
+ *      http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+ * DSM Interface Example:
+ *      http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+ * Driver Writer's Guide:
+ *      http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_NVDIMM_H
+#define QEMU_NVDIMM_H
+
+#include "hw/mem/pc-dimm.h"
+
+#define TYPE_NVDIMM      "nvdimm"
+#endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH v9 1/5] nvdimm: implement NVDIMM device abstract
@ 2015-12-02  7:20   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: Xiao Guangrong, ehabkost, kvm, mst, gleb, mtosatti, qemu-devel,
	stefanha, dan.j.williams, rth

Introduce "nvdimm" device which is based on pc-dimm device type

Currently, nothing is specific for nvdimm but hotplug is disabled

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 default-configs/i386-softmmu.mak   |  1 +
 default-configs/x86_64-softmmu.mak |  1 +
 hw/acpi/memory_hotplug.c           |  5 +++++
 hw/mem/Makefile.objs               |  1 +
 hw/mem/nvdimm.c                    | 46 ++++++++++++++++++++++++++++++++++++++
 include/hw/mem/nvdimm.h            | 29 ++++++++++++++++++++++++
 6 files changed, 83 insertions(+)
 create mode 100644 hw/mem/nvdimm.c
 create mode 100644 include/hw/mem/nvdimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 43c96d1..4c79d3b 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -46,6 +46,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/x86_64-softmmu.mak b/default-configs/x86_64-softmmu.mak
index dfb8095..e42d2fc 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -46,6 +46,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index e4b9a01..298e868 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -231,6 +231,11 @@ void acpi_memory_plug_cb(ACPIREGS *ar, qemu_irq irq, MemHotplugState *mem_st,
                          DeviceState *dev, Error **errp)
 {
     MemStatus *mdev;
+    DeviceClass *dc = DEVICE_GET_CLASS(dev);
+
+    if (!dc->hotpluggable) {
+        return;
+    }
 
     mdev = acpi_memory_slot_status(mem_st, dev, errp);
     if (!mdev) {
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index b000fb4..f12f8b9 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1,2 @@
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm.o
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
new file mode 100644
index 0000000..4fd397f
--- /dev/null
+++ b/hw/mem/nvdimm.c
@@ -0,0 +1,46 @@
+/*
+ * Non-Volatile Dual In-line Memory Module Virtualization Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "hw/mem/nvdimm.h"
+
+static void nvdimm_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+
+    /* nvdimm hotplug has not been supported yet. */
+    dc->hotpluggable = false;
+}
+
+static TypeInfo nvdimm_info = {
+    .name          = TYPE_NVDIMM,
+    .parent        = TYPE_PC_DIMM,
+    .class_init    = nvdimm_class_init,
+};
+
+static void nvdimm_register_types(void)
+{
+    type_register_static(&nvdimm_info);
+}
+
+type_init(nvdimm_register_types)
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
new file mode 100644
index 0000000..dbfa8d6
--- /dev/null
+++ b/include/hw/mem/nvdimm.h
@@ -0,0 +1,29 @@
+/*
+ * Non-Volatile Dual In-line Memory Module Virtualization Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *
+ * NVDIMM specifications and some documents can be found at:
+ * NVDIMM ACPI device and NFIT are introduced in ACPI 6:
+ *      http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+ * NVDIMM Namespace specification:
+ *      http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+ * DSM Interface Example:
+ *      http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+ * Driver Writer's Guide:
+ *      http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_NVDIMM_H
+#define QEMU_NVDIMM_H
+
+#include "hw/mem/pc-dimm.h"
+
+#define TYPE_NVDIMM      "nvdimm"
+#endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v9 2/5] acpi: support specified oem table id for build_header
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-02  7:20   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel, Xiao Guangrong

Let build_header() support specified OEM table id so that we can build
multiple SSDT later

If the oem table id is not specified (aka, NULL), we use the default id
instead as the previous behavior

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 hw/acpi/aml-build.c         | 15 +++++++++++----
 hw/arm/virt-acpi-build.c    | 13 +++++++------
 hw/i386/acpi-build.c        | 20 ++++++++++----------
 include/hw/acpi/aml-build.h |  3 ++-
 4 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index a00a0ab..92873bb 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1137,14 +1137,21 @@ Aml *aml_unicode(const char *str)
 
 void
 build_header(GArray *linker, GArray *table_data,
-             AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
+             AcpiTableHeader *h, const char *sig, int len, uint8_t rev,
+             const char *oem_table_id)
 {
     memcpy(&h->signature, sig, 4);
     h->length = cpu_to_le32(len);
     h->revision = rev;
     memcpy(h->oem_id, ACPI_BUILD_APPNAME6, 6);
-    memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
-    memcpy(h->oem_table_id + 4, sig, 4);
+
+    if (oem_table_id) {
+        strncpy((char *)h->oem_table_id, oem_table_id, sizeof(h->oem_table_id));
+    } else {
+        memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
+        memcpy(h->oem_table_id + 4, sig, 4);
+    }
+
     h->oem_revision = cpu_to_le32(1);
     memcpy(h->asl_compiler_id, ACPI_BUILD_APPNAME4, 4);
     h->asl_compiler_revision = cpu_to_le32(1);
@@ -1211,5 +1218,5 @@ build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets)
                                        sizeof(uint32_t));
     }
     build_header(linker, table_data,
-                 (void *)rsdt, "RSDT", rsdt_len, 1);
+                 (void *)rsdt, "RSDT", rsdt_len, 1, NULL);
 }
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 3c2c5d6..da17779 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -381,7 +381,8 @@ build_spcr(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info)
     spcr->pci_device_id = 0xffff;  /* PCI Device ID: not a PCI device */
     spcr->pci_vendor_id = 0xffff;  /* PCI Vendor ID: not a PCI device */
 
-    build_header(linker, table_data, (void *)spcr, "SPCR", sizeof(*spcr), 2);
+    build_header(linker, table_data, (void *)spcr, "SPCR", sizeof(*spcr), 2,
+                 NULL);
 }
 
 static void
@@ -400,7 +401,7 @@ build_mcfg(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info)
     mcfg->allocation[0].end_bus_number = (memmap[VIRT_PCIE_ECAM].size
                                           / PCIE_MMCFG_SIZE_MIN) - 1;
 
-    build_header(linker, table_data, (void *)mcfg, "MCFG", len, 1);
+    build_header(linker, table_data, (void *)mcfg, "MCFG", len, 1, NULL);
 }
 
 /* GTDT */
@@ -426,7 +427,7 @@ build_gtdt(GArray *table_data, GArray *linker)
 
     build_header(linker, table_data,
                  (void *)(table_data->data + gtdt_start), "GTDT",
-                 table_data->len - gtdt_start, 2);
+                 table_data->len - gtdt_start, 2, NULL);
 }
 
 /* MADT */
@@ -488,7 +489,7 @@ build_madt(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info,
 
     build_header(linker, table_data,
                  (void *)(table_data->data + madt_start), "APIC",
-                 table_data->len - madt_start, 3);
+                 table_data->len - madt_start, 3, NULL);
 }
 
 /* FADT */
@@ -513,7 +514,7 @@ build_fadt(GArray *table_data, GArray *linker, unsigned dsdt)
                                    sizeof fadt->dsdt);
 
     build_header(linker, table_data,
-                 (void *)fadt, "FACP", sizeof(*fadt), 5);
+                 (void *)fadt, "FACP", sizeof(*fadt), 5, NULL);
 }
 
 /* DSDT */
@@ -546,7 +547,7 @@ build_dsdt(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info)
     g_array_append_vals(table_data, dsdt->buf->data, dsdt->buf->len);
     build_header(linker, table_data,
         (void *)(table_data->data + table_data->len - dsdt->buf->len),
-        "DSDT", dsdt->buf->len, 2);
+        "DSDT", dsdt->buf->len, 2, NULL);
     free_aml_allocator();
 }
 
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 95e0c65..215b58c 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -361,7 +361,7 @@ build_fadt(GArray *table_data, GArray *linker, AcpiPmInfo *pm,
     fadt_setup(fadt, pm);
 
     build_header(linker, table_data,
-                 (void *)fadt, "FACP", sizeof(*fadt), 1);
+                 (void *)fadt, "FACP", sizeof(*fadt), 1, NULL);
 }
 
 static void
@@ -431,7 +431,7 @@ build_madt(GArray *table_data, GArray *linker, AcpiCpuInfo *cpu,
 
     build_header(linker, table_data,
                  (void *)(table_data->data + madt_start), "APIC",
-                 table_data->len - madt_start, 1);
+                 table_data->len - madt_start, 1, NULL);
 }
 
 /* Assign BSEL property to all buses.  In the future, this can be changed
@@ -1349,7 +1349,7 @@ build_ssdt(GArray *table_data, GArray *linker,
     g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
     build_header(linker, table_data,
         (void *)(table_data->data + table_data->len - ssdt->buf->len),
-        "SSDT", ssdt->buf->len, 1);
+        "SSDT", ssdt->buf->len, 1, NULL);
     free_aml_allocator();
 }
 
@@ -1365,7 +1365,7 @@ build_hpet(GArray *table_data, GArray *linker)
     hpet->timer_block_id = cpu_to_le32(0x8086a201);
     hpet->addr.address = cpu_to_le64(HPET_BASE);
     build_header(linker, table_data,
-                 (void *)hpet, "HPET", sizeof(*hpet), 1);
+                 (void *)hpet, "HPET", sizeof(*hpet), 1, NULL);
 }
 
 static void
@@ -1388,7 +1388,7 @@ build_tpm_tcpa(GArray *table_data, GArray *linker, GArray *tcpalog)
                                    sizeof(tcpa->log_area_start_address));
 
     build_header(linker, table_data,
-                 (void *)tcpa, "TCPA", sizeof(*tcpa), 2);
+                 (void *)tcpa, "TCPA", sizeof(*tcpa), 2, NULL);
 
     acpi_data_push(tcpalog, TPM_LOG_AREA_MINIMUM_SIZE);
 }
@@ -1405,7 +1405,7 @@ build_tpm2(GArray *table_data, GArray *linker)
     tpm2_ptr->start_method = cpu_to_le32(TPM2_START_METHOD_MMIO);
 
     build_header(linker, table_data,
-                 (void *)tpm2_ptr, "TPM2", sizeof(*tpm2_ptr), 4);
+                 (void *)tpm2_ptr, "TPM2", sizeof(*tpm2_ptr), 4, NULL);
 }
 
 typedef enum {
@@ -1519,7 +1519,7 @@ build_srat(GArray *table_data, GArray *linker, PcGuestInfo *guest_info)
     build_header(linker, table_data,
                  (void *)(table_data->data + srat_start),
                  "SRAT",
-                 table_data->len - srat_start, 1);
+                 table_data->len - srat_start, 1, NULL);
 }
 
 static void
@@ -1548,7 +1548,7 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
     } else {
         sig = "MCFG";
     }
-    build_header(linker, table_data, (void *)mcfg, sig, len, 1);
+    build_header(linker, table_data, (void *)mcfg, sig, len, 1, NULL);
 }
 
 static void
@@ -1572,7 +1572,7 @@ build_dmar_q35(GArray *table_data, GArray *linker)
     drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
 
     build_header(linker, table_data, (void *)(table_data->data + dmar_start),
-                 "DMAR", table_data->len - dmar_start, 1);
+                 "DMAR", table_data->len - dmar_start, 1, NULL);
 }
 
 static void
@@ -1587,7 +1587,7 @@ build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
 
     memset(dsdt, 0, sizeof *dsdt);
     build_header(linker, table_data, dsdt, "DSDT",
-                 misc->dsdt_size, 1);
+                 misc->dsdt_size, 1, NULL);
 }
 
 static GArray *
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..e587b26 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -277,7 +277,8 @@ Aml *aml_unicode(const char *str);
 
 void
 build_header(GArray *linker, GArray *table_data,
-             AcpiTableHeader *h, const char *sig, int len, uint8_t rev);
+             AcpiTableHeader *h, const char *sig, int len, uint8_t rev,
+             const char *oem_table_id);
 void *acpi_data_push(GArray *table_data, unsigned size);
 unsigned acpi_data_len(GArray *table);
 void acpi_add_table(GArray *table_offsets, GArray *table_data);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH v9 2/5] acpi: support specified oem table id for build_header
@ 2015-12-02  7:20   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: Xiao Guangrong, ehabkost, kvm, mst, gleb, mtosatti, qemu-devel,
	stefanha, dan.j.williams, rth

Let build_header() support specified OEM table id so that we can build
multiple SSDT later

If the oem table id is not specified (aka, NULL), we use the default id
instead as the previous behavior

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 hw/acpi/aml-build.c         | 15 +++++++++++----
 hw/arm/virt-acpi-build.c    | 13 +++++++------
 hw/i386/acpi-build.c        | 20 ++++++++++----------
 include/hw/acpi/aml-build.h |  3 ++-
 4 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index a00a0ab..92873bb 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1137,14 +1137,21 @@ Aml *aml_unicode(const char *str)
 
 void
 build_header(GArray *linker, GArray *table_data,
-             AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
+             AcpiTableHeader *h, const char *sig, int len, uint8_t rev,
+             const char *oem_table_id)
 {
     memcpy(&h->signature, sig, 4);
     h->length = cpu_to_le32(len);
     h->revision = rev;
     memcpy(h->oem_id, ACPI_BUILD_APPNAME6, 6);
-    memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
-    memcpy(h->oem_table_id + 4, sig, 4);
+
+    if (oem_table_id) {
+        strncpy((char *)h->oem_table_id, oem_table_id, sizeof(h->oem_table_id));
+    } else {
+        memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
+        memcpy(h->oem_table_id + 4, sig, 4);
+    }
+
     h->oem_revision = cpu_to_le32(1);
     memcpy(h->asl_compiler_id, ACPI_BUILD_APPNAME4, 4);
     h->asl_compiler_revision = cpu_to_le32(1);
@@ -1211,5 +1218,5 @@ build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets)
                                        sizeof(uint32_t));
     }
     build_header(linker, table_data,
-                 (void *)rsdt, "RSDT", rsdt_len, 1);
+                 (void *)rsdt, "RSDT", rsdt_len, 1, NULL);
 }
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 3c2c5d6..da17779 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -381,7 +381,8 @@ build_spcr(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info)
     spcr->pci_device_id = 0xffff;  /* PCI Device ID: not a PCI device */
     spcr->pci_vendor_id = 0xffff;  /* PCI Vendor ID: not a PCI device */
 
-    build_header(linker, table_data, (void *)spcr, "SPCR", sizeof(*spcr), 2);
+    build_header(linker, table_data, (void *)spcr, "SPCR", sizeof(*spcr), 2,
+                 NULL);
 }
 
 static void
@@ -400,7 +401,7 @@ build_mcfg(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info)
     mcfg->allocation[0].end_bus_number = (memmap[VIRT_PCIE_ECAM].size
                                           / PCIE_MMCFG_SIZE_MIN) - 1;
 
-    build_header(linker, table_data, (void *)mcfg, "MCFG", len, 1);
+    build_header(linker, table_data, (void *)mcfg, "MCFG", len, 1, NULL);
 }
 
 /* GTDT */
@@ -426,7 +427,7 @@ build_gtdt(GArray *table_data, GArray *linker)
 
     build_header(linker, table_data,
                  (void *)(table_data->data + gtdt_start), "GTDT",
-                 table_data->len - gtdt_start, 2);
+                 table_data->len - gtdt_start, 2, NULL);
 }
 
 /* MADT */
@@ -488,7 +489,7 @@ build_madt(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info,
 
     build_header(linker, table_data,
                  (void *)(table_data->data + madt_start), "APIC",
-                 table_data->len - madt_start, 3);
+                 table_data->len - madt_start, 3, NULL);
 }
 
 /* FADT */
@@ -513,7 +514,7 @@ build_fadt(GArray *table_data, GArray *linker, unsigned dsdt)
                                    sizeof fadt->dsdt);
 
     build_header(linker, table_data,
-                 (void *)fadt, "FACP", sizeof(*fadt), 5);
+                 (void *)fadt, "FACP", sizeof(*fadt), 5, NULL);
 }
 
 /* DSDT */
@@ -546,7 +547,7 @@ build_dsdt(GArray *table_data, GArray *linker, VirtGuestInfo *guest_info)
     g_array_append_vals(table_data, dsdt->buf->data, dsdt->buf->len);
     build_header(linker, table_data,
         (void *)(table_data->data + table_data->len - dsdt->buf->len),
-        "DSDT", dsdt->buf->len, 2);
+        "DSDT", dsdt->buf->len, 2, NULL);
     free_aml_allocator();
 }
 
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 95e0c65..215b58c 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -361,7 +361,7 @@ build_fadt(GArray *table_data, GArray *linker, AcpiPmInfo *pm,
     fadt_setup(fadt, pm);
 
     build_header(linker, table_data,
-                 (void *)fadt, "FACP", sizeof(*fadt), 1);
+                 (void *)fadt, "FACP", sizeof(*fadt), 1, NULL);
 }
 
 static void
@@ -431,7 +431,7 @@ build_madt(GArray *table_data, GArray *linker, AcpiCpuInfo *cpu,
 
     build_header(linker, table_data,
                  (void *)(table_data->data + madt_start), "APIC",
-                 table_data->len - madt_start, 1);
+                 table_data->len - madt_start, 1, NULL);
 }
 
 /* Assign BSEL property to all buses.  In the future, this can be changed
@@ -1349,7 +1349,7 @@ build_ssdt(GArray *table_data, GArray *linker,
     g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
     build_header(linker, table_data,
         (void *)(table_data->data + table_data->len - ssdt->buf->len),
-        "SSDT", ssdt->buf->len, 1);
+        "SSDT", ssdt->buf->len, 1, NULL);
     free_aml_allocator();
 }
 
@@ -1365,7 +1365,7 @@ build_hpet(GArray *table_data, GArray *linker)
     hpet->timer_block_id = cpu_to_le32(0x8086a201);
     hpet->addr.address = cpu_to_le64(HPET_BASE);
     build_header(linker, table_data,
-                 (void *)hpet, "HPET", sizeof(*hpet), 1);
+                 (void *)hpet, "HPET", sizeof(*hpet), 1, NULL);
 }
 
 static void
@@ -1388,7 +1388,7 @@ build_tpm_tcpa(GArray *table_data, GArray *linker, GArray *tcpalog)
                                    sizeof(tcpa->log_area_start_address));
 
     build_header(linker, table_data,
-                 (void *)tcpa, "TCPA", sizeof(*tcpa), 2);
+                 (void *)tcpa, "TCPA", sizeof(*tcpa), 2, NULL);
 
     acpi_data_push(tcpalog, TPM_LOG_AREA_MINIMUM_SIZE);
 }
@@ -1405,7 +1405,7 @@ build_tpm2(GArray *table_data, GArray *linker)
     tpm2_ptr->start_method = cpu_to_le32(TPM2_START_METHOD_MMIO);
 
     build_header(linker, table_data,
-                 (void *)tpm2_ptr, "TPM2", sizeof(*tpm2_ptr), 4);
+                 (void *)tpm2_ptr, "TPM2", sizeof(*tpm2_ptr), 4, NULL);
 }
 
 typedef enum {
@@ -1519,7 +1519,7 @@ build_srat(GArray *table_data, GArray *linker, PcGuestInfo *guest_info)
     build_header(linker, table_data,
                  (void *)(table_data->data + srat_start),
                  "SRAT",
-                 table_data->len - srat_start, 1);
+                 table_data->len - srat_start, 1, NULL);
 }
 
 static void
@@ -1548,7 +1548,7 @@ build_mcfg_q35(GArray *table_data, GArray *linker, AcpiMcfgInfo *info)
     } else {
         sig = "MCFG";
     }
-    build_header(linker, table_data, (void *)mcfg, sig, len, 1);
+    build_header(linker, table_data, (void *)mcfg, sig, len, 1, NULL);
 }
 
 static void
@@ -1572,7 +1572,7 @@ build_dmar_q35(GArray *table_data, GArray *linker)
     drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
 
     build_header(linker, table_data, (void *)(table_data->data + dmar_start),
-                 "DMAR", table_data->len - dmar_start, 1);
+                 "DMAR", table_data->len - dmar_start, 1, NULL);
 }
 
 static void
@@ -1587,7 +1587,7 @@ build_dsdt(GArray *table_data, GArray *linker, AcpiMiscInfo *misc)
 
     memset(dsdt, 0, sizeof *dsdt);
     build_header(linker, table_data, dsdt, "DSDT",
-                 misc->dsdt_size, 1);
+                 misc->dsdt_size, 1, NULL);
 }
 
 static GArray *
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..e587b26 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -277,7 +277,8 @@ Aml *aml_unicode(const char *str);
 
 void
 build_header(GArray *linker, GArray *table_data,
-             AcpiTableHeader *h, const char *sig, int len, uint8_t rev);
+             AcpiTableHeader *h, const char *sig, int len, uint8_t rev,
+             const char *oem_table_id);
 void *acpi_data_push(GArray *table_data, unsigned size);
 unsigned acpi_data_len(GArray *table);
 void acpi_add_table(GArray *table_offsets, GArray *table_data);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v9 3/5] nvdimm acpi: build ACPI NFIT table
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-02  7:20   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel, Xiao Guangrong

NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)

Currently, we only support PMEM mode. Each device has 3 structures:
- SPA structure, defines the PMEM region info

- MEM DEV structure, it has the @handle which is used to associate specified
  ACPI NVDIMM  device we will introduce in later patch.
  Also we can happily ignored the memory device's interleave, the real
  nvdimm hardware access is hidden behind host

- DCR structure, it defines vendor ID used to associate specified vendor
  nvdimm driver. Since we only implement PMEM mode this time, Command
  window and Data window are not needed

The NVDIMM functionality is controlled by the parameter, 'nvdimm', which
is introduced for the machine, there is a example to enable it:
-machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G -device \
nvdimm,memdev=mem1,id=nv1

It is disabled on default

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 default-configs/i386-softmmu.mak   |   1 +
 default-configs/x86_64-softmmu.mak |   1 +
 hw/acpi/Makefile.objs              |   1 +
 hw/acpi/nvdimm.c                   | 382 +++++++++++++++++++++++++++++++++++++
 hw/i386/acpi-build.c               |  12 ++
 hw/i386/pc.c                       |  19 ++
 include/hw/i386/pc.h               |   2 +
 include/hw/mem/nvdimm.h            |   3 +
 qemu-options.hx                    |   5 +-
 9 files changed, 425 insertions(+), 1 deletion(-)
 create mode 100644 hw/acpi/nvdimm.c

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 4c79d3b..53fb517 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/x86_64-softmmu.mak b/default-configs/x86_64-softmmu.mak
index e42d2fc..766c27c 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 7d3230c..095597f 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
 common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
 common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
+common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI) += acpi_interface.o
 common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
 common-obj-$(CONFIG_ACPI) += aml-build.o
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
new file mode 100644
index 0000000..98c004d
--- /dev/null
+++ b/hw/acpi/nvdimm.c
@@ -0,0 +1,382 @@
+/*
+ * NVDIMM ACPI Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *
+ * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ * and the DSM specification can be found at:
+ *       http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/mem/nvdimm.h"
+
+static int nvdimm_plugged_device_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
+        DeviceState *dev = DEVICE(obj);
+
+        if (dev->realized) { /* only realized NVDIMMs matter */
+            *list = g_slist_append(*list, DEVICE(obj));
+        }
+    }
+
+    object_child_foreach(obj, nvdimm_plugged_device_list, opaque);
+    return 0;
+}
+
+/*
+ * inquire plugged NVDIMM devices and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *nvdimm_get_plugged_device_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(qdev_get_machine(), nvdimm_plugged_device_list,
+                         &list);
+    return list;
+}
+
+#define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)             \
+   { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
+     (b) & 0xff, ((b) >> 8) & 0xff, (c) & 0xff, ((c) >> 8) & 0xff,          \
+     (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }
+
+/*
+ * define Byte Addressable Persistent Memory (PM) Region according to
+ * ACPI 6.0: 5.2.25.1 System Physical Address Range Structure.
+ */
+static const uint8_t nvdimm_nfit_spa_uuid[] =
+      NVDIMM_UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d, 0x33,
+                     0x18, 0xb7, 0x8c, 0xdb);
+
+/*
+ * NVDIMM Firmware Interface Table
+ * @signature: "NFIT"
+ *
+ * It provides information that allows OSPM to enumerate NVDIMM present in
+ * the platform and associate system physical address ranges created by the
+ * NVDIMMs.
+ *
+ * It is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ */
+struct NvdimmNfitHeader {
+    ACPI_TABLE_HEADER_DEF
+    uint32_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitHeader NvdimmNfitHeader;
+
+/*
+ * define NFIT structures according to ACPI 6.0: 5.2.25 NVDIMM Firmware
+ * Interface Table (NFIT).
+ */
+
+/*
+ * System Physical Address Range Structure
+ *
+ * It describes the system physical address ranges occupied by NVDIMMs and
+ * the types of the regions.
+ */
+struct NvdimmNfitSpa {
+    uint16_t type;
+    uint16_t length;
+    uint16_t spa_index;
+    uint16_t flags;
+    uint32_t reserved;
+    uint32_t proximity_domain;
+    uint8_t type_guid[16];
+    uint64_t spa_base;
+    uint64_t spa_length;
+    uint64_t mem_attr;
+} QEMU_PACKED;
+typedef struct NvdimmNfitSpa NvdimmNfitSpa;
+
+/*
+ * Memory Device to System Physical Address Range Mapping Structure
+ *
+ * It enables identifying each NVDIMM region and the corresponding SPA
+ * describing the memory interleave
+ */
+struct NvdimmNfitMemDev {
+    uint16_t type;
+    uint16_t length;
+    uint32_t nfit_handle;
+    uint16_t phys_id;
+    uint16_t region_id;
+    uint16_t spa_index;
+    uint16_t dcr_index;
+    uint64_t region_len;
+    uint64_t region_offset;
+    uint64_t region_dpa;
+    uint16_t interleave_index;
+    uint16_t interleave_ways;
+    uint16_t flags;
+    uint16_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
+
+/*
+ * NVDIMM Control Region Structure
+ *
+ * It describes the NVDIMM and if applicable, Block Control Window.
+ */
+struct NvdimmNfitControlRegion {
+    uint16_t type;
+    uint16_t length;
+    uint16_t dcr_index;
+    uint16_t vendor_id;
+    uint16_t device_id;
+    uint16_t revision_id;
+    uint16_t sub_vendor_id;
+    uint16_t sub_device_id;
+    uint16_t sub_revision_id;
+    uint8_t reserved[6];
+    uint32_t serial_number;
+    uint16_t fic;
+    uint16_t num_bcw;
+    uint64_t bcw_size;
+    uint64_t cmd_offset;
+    uint64_t cmd_size;
+    uint64_t status_offset;
+    uint64_t status_size;
+    uint16_t flags;
+    uint8_t reserved2[6];
+} QEMU_PACKED;
+typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
+
+/*
+ * Module serial number is a unique number for each device. We use the
+ * slot id of NVDIMM device to generate this number so that each device
+ * associates with a different number.
+ *
+ * 0x123456 is a magic number we arbitrarily chose.
+ */
+static uint32_t nvdimm_slot_to_sn(int slot)
+{
+    return 0x123456 + slot;
+}
+
+/*
+ * handle is used to uniquely associate nfit_memdev structure with NVDIMM
+ * ACPI device - nfit_memdev.nfit_handle matches with the value returned
+ * by ACPI device _ADR method.
+ *
+ * We generate the handle with the slot id of NVDIMM device and reserve
+ * 0 for NVDIMM root device.
+ */
+static uint32_t nvdimm_slot_to_handle(int slot)
+{
+    return slot + 1;
+}
+
+/*
+ * index uniquely identifies the structure, 0 is reserved which indicates
+ * that the structure is not valid or the associated structure is not
+ * present.
+ *
+ * Each NVDIMM device needs two indexes, one for nfit_spa and another for
+ * nfit_dc which are generated by the slot id of NVDIMM device.
+ */
+static uint16_t nvdimm_slot_to_spa_index(int slot)
+{
+    return (slot + 1) << 1;
+}
+
+/* See the comments of nvdimm_slot_to_spa_index(). */
+static uint32_t nvdimm_slot_to_dcr_index(int slot)
+{
+    return nvdimm_slot_to_spa_index(slot) + 1;
+}
+
+/* ACPI 6.0: 5.2.25.1 System Physical Address Range Structure */
+static void
+nvdimm_build_structure_spa(GArray *structures, DeviceState *dev)
+{
+    NvdimmNfitSpa *nfit_spa;
+    uint64_t addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP,
+                                            NULL);
+    uint64_t size = object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP,
+                                            NULL);
+    uint32_t node = object_property_get_int(OBJECT(dev), PC_DIMM_NODE_PROP,
+                                            NULL);
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                            NULL);
+
+    nfit_spa = acpi_data_push(structures, sizeof(*nfit_spa));
+
+    nfit_spa->type = cpu_to_le16(0 /* System Physical Address Range
+                                      Structure */);
+    nfit_spa->length = cpu_to_le16(sizeof(*nfit_spa));
+    nfit_spa->spa_index = cpu_to_le16(nvdimm_slot_to_spa_index(slot));
+
+    /*
+     * Control region is strict as all the device info, such as SN, index,
+     * is associated with slot id.
+     */
+    nfit_spa->flags = cpu_to_le16(1 /* Control region is strictly for
+                                       management during hot add/online
+                                       operation */ |
+                                  2 /* Data in Proximity Domain field is
+                                       valid*/);
+
+    /* NUMA node. */
+    nfit_spa->proximity_domain = cpu_to_le32(node);
+    /* the region reported as PMEM. */
+    memcpy(nfit_spa->type_guid, nvdimm_nfit_spa_uuid,
+           sizeof(nvdimm_nfit_spa_uuid));
+
+    nfit_spa->spa_base = cpu_to_le64(addr);
+    nfit_spa->spa_length = cpu_to_le64(size);
+
+    /* It is the PMEM and can be cached as writeback. */
+    nfit_spa->mem_attr = cpu_to_le64(0x8ULL /* EFI_MEMORY_WB */ |
+                                     0x8000ULL /* EFI_MEMORY_NV */);
+}
+
+/*
+ * ACPI 6.0: 5.2.25.2 Memory Device to System Physical Address Range Mapping
+ * Structure
+ */
+static void
+nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
+{
+    NvdimmNfitMemDev *nfit_memdev;
+    uint64_t addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP,
+                                            NULL);
+    uint64_t size = object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP,
+                                            NULL);
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                            NULL);
+    uint32_t handle = nvdimm_slot_to_handle(slot);
+
+    nfit_memdev = acpi_data_push(structures, sizeof(*nfit_memdev));
+
+    nfit_memdev->type = cpu_to_le16(1 /* Memory Device to System Address
+                                         Range Map Structure*/);
+    nfit_memdev->length = cpu_to_le16(sizeof(*nfit_memdev));
+    nfit_memdev->nfit_handle = cpu_to_le32(handle);
+
+    /*
+     * associate memory device with System Physical Address Range
+     * Structure.
+     */
+    nfit_memdev->spa_index = cpu_to_le16(nvdimm_slot_to_spa_index(slot));
+    /* associate memory device with Control Region Structure. */
+    nfit_memdev->dcr_index = cpu_to_le16(nvdimm_slot_to_dcr_index(slot));
+
+    /* The memory region on the device. */
+    nfit_memdev->region_len = cpu_to_le64(size);
+    nfit_memdev->region_dpa = cpu_to_le64(addr);
+
+    /* Only one interleave for PMEM. */
+    nfit_memdev->interleave_ways = cpu_to_le16(1);
+}
+
+/*
+ * ACPI 6.0: 5.2.25.5 NVDIMM Control Region Structure.
+ */
+static void nvdimm_build_structure_dcr(GArray *structures, DeviceState *dev)
+{
+    NvdimmNfitControlRegion *nfit_dcr;
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                       NULL);
+    uint32_t sn = nvdimm_slot_to_sn(slot);
+
+    nfit_dcr = acpi_data_push(structures, sizeof(*nfit_dcr));
+
+    nfit_dcr->type = cpu_to_le16(4 /* NVDIMM Control Region Structure */);
+    nfit_dcr->length = cpu_to_le16(sizeof(*nfit_dcr));
+    nfit_dcr->dcr_index = cpu_to_le16(nvdimm_slot_to_dcr_index(slot));
+
+    /* vendor: Intel. */
+    nfit_dcr->vendor_id = cpu_to_le16(0x8086);
+    nfit_dcr->device_id = cpu_to_le16(1);
+
+    /* The _DSM method is following Intel's DSM specification. */
+    nfit_dcr->revision_id = cpu_to_le16(1 /* Current Revision supported
+                                             in ACPI 6.0 is 1. */);
+    nfit_dcr->serial_number = cpu_to_le32(sn);
+    nfit_dcr->fic = cpu_to_le16(0x201 /* Format Interface Code. See Chapter
+                                         2: NVDIMM Device Specific Method
+                                         (DSM) in DSM Spec Rev1.*/);
+}
+
+static GArray *nvdimm_build_device_structure(GSList *device_list)
+{
+    GArray *structures = g_array_new(false, true /* clear */, 1);
+
+    for (; device_list; device_list = device_list->next) {
+        DeviceState *dev = device_list->data;
+
+        /* build System Physical Address Range Structure. */
+        nvdimm_build_structure_spa(structures, dev);
+
+        /*
+         * build Memory Device to System Physical Address Range Mapping
+         * Structure.
+         */
+        nvdimm_build_structure_memdev(structures, dev);
+
+        /* build NVDIMM Control Region Structure. */
+        nvdimm_build_structure_dcr(structures, dev);
+    }
+
+    return structures;
+}
+
+static void nvdimm_build_nfit(GSList *device_list, GArray *table_offsets,
+                              GArray *table_data, GArray *linker)
+{
+    GArray *structures = nvdimm_build_device_structure(device_list);
+    void *header;
+
+    acpi_add_table(table_offsets, table_data);
+
+    /* NFIT header. */
+    header = acpi_data_push(table_data, sizeof(NvdimmNfitHeader));
+    /* NVDIMM device structures. */
+    g_array_append_vals(table_data, structures->data, structures->len);
+
+    build_header(linker, table_data, header, "NFIT",
+                 sizeof(NvdimmNfitHeader) + structures->len, 1, NULL);
+    g_array_free(structures, true);
+}
+
+void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
+                       GArray *linker)
+{
+    GSList *device_list;
+
+    /* no NVDIMM device is plugged. */
+    device_list = nvdimm_get_plugged_device_list();
+    if (!device_list) {
+        return;
+    }
+    nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
+    g_slist_free(device_list);
+}
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 215b58c..b55659d 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -39,6 +39,7 @@
 #include "hw/loader.h"
 #include "hw/isa/isa.h"
 #include "hw/acpi/memory_hotplug.h"
+#include "hw/mem/nvdimm.h"
 #include "sysemu/tpm.h"
 #include "hw/acpi/tpm.h"
 #include "sysemu/tpm_backend.h"
@@ -1658,6 +1659,13 @@ static bool acpi_has_iommu(void)
     return intel_iommu && !ambiguous;
 }
 
+static bool acpi_has_nvdimm(void)
+{
+    PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
+
+    return pcms->nvdimm;
+}
+
 static
 void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
 {
@@ -1742,6 +1750,10 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
         build_dmar_q35(tables_blob, tables->linker);
     }
 
+    if (acpi_has_nvdimm()) {
+        nvdimm_build_acpi(table_offsets, tables_blob, tables->linker);
+    }
+
     /* Add tables supplied by user (if any) */
     for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
         unsigned len = acpi_table_len(u);
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 5e20e07..7a9ea0a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1877,6 +1877,20 @@ static bool pc_machine_get_aligned_dimm(Object *obj, Error **errp)
     return pcms->enforce_aligned_dimm;
 }
 
+static bool pc_machine_get_nvdimm(Object *obj, Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+
+    return pcms->nvdimm;
+}
+
+static void pc_machine_set_nvdimm(Object *obj, bool value, Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+
+    pcms->nvdimm = value;
+}
+
 static void pc_machine_initfn(Object *obj)
 {
     PCMachineState *pcms = PC_MACHINE(obj);
@@ -1916,6 +1930,11 @@ static void pc_machine_initfn(Object *obj)
     object_property_add_bool(obj, PC_MACHINE_ENFORCE_ALIGNED_DIMM,
                              pc_machine_get_aligned_dimm,
                              NULL, &error_abort);
+
+    /* nvdimm is disabled on default. */
+    pcms->nvdimm = false;
+    object_property_add_bool(obj, PC_MACHINE_NVDIMM, pc_machine_get_nvdimm,
+                             pc_machine_set_nvdimm, &error_abort);
 }
 
 static void pc_machine_reset(void)
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 854c330..1b8d52b 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -40,6 +40,7 @@ struct PCMachineState {
     OnOffAuto vmport;
     OnOffAuto smm;
     bool enforce_aligned_dimm;
+    bool nvdimm;
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 };
 
@@ -49,6 +50,7 @@ struct PCMachineState {
 #define PC_MACHINE_VMPORT           "vmport"
 #define PC_MACHINE_SMM              "smm"
 #define PC_MACHINE_ENFORCE_ALIGNED_DIMM "enforce-aligned-dimm"
+#define PC_MACHINE_NVDIMM           "nvdimm"
 
 /**
  * PCMachineClass:
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index dbfa8d6..49183c1 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -26,4 +26,7 @@
 #include "hw/mem/pc-dimm.h"
 
 #define TYPE_NVDIMM      "nvdimm"
+
+void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
+                       GArray *linker);
 #endif
diff --git a/qemu-options.hx b/qemu-options.hx
index 0eea4ee..a6c92c7 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -41,7 +41,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                igd-passthru=on|off controls IGD GFX passthrough support (default=off)\n"
     "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
     "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
-    "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n",
+    "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
+    "                nvdimm=on|off controls NVDIMM support (default=off)\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
@@ -80,6 +81,8 @@ execution of AES cryptographic functions.  The default is on.
 Enables or disables DEA key wrapping support on s390-ccw hosts. This feature
 controls whether DEA wrapping keys will be created to allow
 execution of DEA cryptographic functions.  The default is on.
+@item nvdimm=on|off
+Enables or disables NVDIMM support. The default is off.
 @end table
 ETEXI
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH v9 3/5] nvdimm acpi: build ACPI NFIT table
@ 2015-12-02  7:20   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: Xiao Guangrong, ehabkost, kvm, mst, gleb, mtosatti, qemu-devel,
	stefanha, dan.j.williams, rth

NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)

Currently, we only support PMEM mode. Each device has 3 structures:
- SPA structure, defines the PMEM region info

- MEM DEV structure, it has the @handle which is used to associate specified
  ACPI NVDIMM  device we will introduce in later patch.
  Also we can happily ignored the memory device's interleave, the real
  nvdimm hardware access is hidden behind host

- DCR structure, it defines vendor ID used to associate specified vendor
  nvdimm driver. Since we only implement PMEM mode this time, Command
  window and Data window are not needed

The NVDIMM functionality is controlled by the parameter, 'nvdimm', which
is introduced for the machine, there is a example to enable it:
-machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G -device \
nvdimm,memdev=mem1,id=nv1

It is disabled on default

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 default-configs/i386-softmmu.mak   |   1 +
 default-configs/x86_64-softmmu.mak |   1 +
 hw/acpi/Makefile.objs              |   1 +
 hw/acpi/nvdimm.c                   | 382 +++++++++++++++++++++++++++++++++++++
 hw/i386/acpi-build.c               |  12 ++
 hw/i386/pc.c                       |  19 ++
 include/hw/i386/pc.h               |   2 +
 include/hw/mem/nvdimm.h            |   3 +
 qemu-options.hx                    |   5 +-
 9 files changed, 425 insertions(+), 1 deletion(-)
 create mode 100644 hw/acpi/nvdimm.c

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 4c79d3b..53fb517 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/x86_64-softmmu.mak b/default-configs/x86_64-softmmu.mak
index e42d2fc..766c27c 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 7d3230c..095597f 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
 common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
 common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
+common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI) += acpi_interface.o
 common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
 common-obj-$(CONFIG_ACPI) += aml-build.o
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
new file mode 100644
index 0000000..98c004d
--- /dev/null
+++ b/hw/acpi/nvdimm.c
@@ -0,0 +1,382 @@
+/*
+ * NVDIMM ACPI Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong <guangrong.xiao@linux.intel.com>
+ *
+ * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ * and the DSM specification can be found at:
+ *       http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/mem/nvdimm.h"
+
+static int nvdimm_plugged_device_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
+        DeviceState *dev = DEVICE(obj);
+
+        if (dev->realized) { /* only realized NVDIMMs matter */
+            *list = g_slist_append(*list, DEVICE(obj));
+        }
+    }
+
+    object_child_foreach(obj, nvdimm_plugged_device_list, opaque);
+    return 0;
+}
+
+/*
+ * inquire plugged NVDIMM devices and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *nvdimm_get_plugged_device_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(qdev_get_machine(), nvdimm_plugged_device_list,
+                         &list);
+    return list;
+}
+
+#define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)             \
+   { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
+     (b) & 0xff, ((b) >> 8) & 0xff, (c) & 0xff, ((c) >> 8) & 0xff,          \
+     (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }
+
+/*
+ * define Byte Addressable Persistent Memory (PM) Region according to
+ * ACPI 6.0: 5.2.25.1 System Physical Address Range Structure.
+ */
+static const uint8_t nvdimm_nfit_spa_uuid[] =
+      NVDIMM_UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d, 0x33,
+                     0x18, 0xb7, 0x8c, 0xdb);
+
+/*
+ * NVDIMM Firmware Interface Table
+ * @signature: "NFIT"
+ *
+ * It provides information that allows OSPM to enumerate NVDIMM present in
+ * the platform and associate system physical address ranges created by the
+ * NVDIMMs.
+ *
+ * It is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ */
+struct NvdimmNfitHeader {
+    ACPI_TABLE_HEADER_DEF
+    uint32_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitHeader NvdimmNfitHeader;
+
+/*
+ * define NFIT structures according to ACPI 6.0: 5.2.25 NVDIMM Firmware
+ * Interface Table (NFIT).
+ */
+
+/*
+ * System Physical Address Range Structure
+ *
+ * It describes the system physical address ranges occupied by NVDIMMs and
+ * the types of the regions.
+ */
+struct NvdimmNfitSpa {
+    uint16_t type;
+    uint16_t length;
+    uint16_t spa_index;
+    uint16_t flags;
+    uint32_t reserved;
+    uint32_t proximity_domain;
+    uint8_t type_guid[16];
+    uint64_t spa_base;
+    uint64_t spa_length;
+    uint64_t mem_attr;
+} QEMU_PACKED;
+typedef struct NvdimmNfitSpa NvdimmNfitSpa;
+
+/*
+ * Memory Device to System Physical Address Range Mapping Structure
+ *
+ * It enables identifying each NVDIMM region and the corresponding SPA
+ * describing the memory interleave
+ */
+struct NvdimmNfitMemDev {
+    uint16_t type;
+    uint16_t length;
+    uint32_t nfit_handle;
+    uint16_t phys_id;
+    uint16_t region_id;
+    uint16_t spa_index;
+    uint16_t dcr_index;
+    uint64_t region_len;
+    uint64_t region_offset;
+    uint64_t region_dpa;
+    uint16_t interleave_index;
+    uint16_t interleave_ways;
+    uint16_t flags;
+    uint16_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
+
+/*
+ * NVDIMM Control Region Structure
+ *
+ * It describes the NVDIMM and if applicable, Block Control Window.
+ */
+struct NvdimmNfitControlRegion {
+    uint16_t type;
+    uint16_t length;
+    uint16_t dcr_index;
+    uint16_t vendor_id;
+    uint16_t device_id;
+    uint16_t revision_id;
+    uint16_t sub_vendor_id;
+    uint16_t sub_device_id;
+    uint16_t sub_revision_id;
+    uint8_t reserved[6];
+    uint32_t serial_number;
+    uint16_t fic;
+    uint16_t num_bcw;
+    uint64_t bcw_size;
+    uint64_t cmd_offset;
+    uint64_t cmd_size;
+    uint64_t status_offset;
+    uint64_t status_size;
+    uint16_t flags;
+    uint8_t reserved2[6];
+} QEMU_PACKED;
+typedef struct NvdimmNfitControlRegion NvdimmNfitControlRegion;
+
+/*
+ * Module serial number is a unique number for each device. We use the
+ * slot id of NVDIMM device to generate this number so that each device
+ * associates with a different number.
+ *
+ * 0x123456 is a magic number we arbitrarily chose.
+ */
+static uint32_t nvdimm_slot_to_sn(int slot)
+{
+    return 0x123456 + slot;
+}
+
+/*
+ * handle is used to uniquely associate nfit_memdev structure with NVDIMM
+ * ACPI device - nfit_memdev.nfit_handle matches with the value returned
+ * by ACPI device _ADR method.
+ *
+ * We generate the handle with the slot id of NVDIMM device and reserve
+ * 0 for NVDIMM root device.
+ */
+static uint32_t nvdimm_slot_to_handle(int slot)
+{
+    return slot + 1;
+}
+
+/*
+ * index uniquely identifies the structure, 0 is reserved which indicates
+ * that the structure is not valid or the associated structure is not
+ * present.
+ *
+ * Each NVDIMM device needs two indexes, one for nfit_spa and another for
+ * nfit_dc which are generated by the slot id of NVDIMM device.
+ */
+static uint16_t nvdimm_slot_to_spa_index(int slot)
+{
+    return (slot + 1) << 1;
+}
+
+/* See the comments of nvdimm_slot_to_spa_index(). */
+static uint32_t nvdimm_slot_to_dcr_index(int slot)
+{
+    return nvdimm_slot_to_spa_index(slot) + 1;
+}
+
+/* ACPI 6.0: 5.2.25.1 System Physical Address Range Structure */
+static void
+nvdimm_build_structure_spa(GArray *structures, DeviceState *dev)
+{
+    NvdimmNfitSpa *nfit_spa;
+    uint64_t addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP,
+                                            NULL);
+    uint64_t size = object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP,
+                                            NULL);
+    uint32_t node = object_property_get_int(OBJECT(dev), PC_DIMM_NODE_PROP,
+                                            NULL);
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                            NULL);
+
+    nfit_spa = acpi_data_push(structures, sizeof(*nfit_spa));
+
+    nfit_spa->type = cpu_to_le16(0 /* System Physical Address Range
+                                      Structure */);
+    nfit_spa->length = cpu_to_le16(sizeof(*nfit_spa));
+    nfit_spa->spa_index = cpu_to_le16(nvdimm_slot_to_spa_index(slot));
+
+    /*
+     * Control region is strict as all the device info, such as SN, index,
+     * is associated with slot id.
+     */
+    nfit_spa->flags = cpu_to_le16(1 /* Control region is strictly for
+                                       management during hot add/online
+                                       operation */ |
+                                  2 /* Data in Proximity Domain field is
+                                       valid*/);
+
+    /* NUMA node. */
+    nfit_spa->proximity_domain = cpu_to_le32(node);
+    /* the region reported as PMEM. */
+    memcpy(nfit_spa->type_guid, nvdimm_nfit_spa_uuid,
+           sizeof(nvdimm_nfit_spa_uuid));
+
+    nfit_spa->spa_base = cpu_to_le64(addr);
+    nfit_spa->spa_length = cpu_to_le64(size);
+
+    /* It is the PMEM and can be cached as writeback. */
+    nfit_spa->mem_attr = cpu_to_le64(0x8ULL /* EFI_MEMORY_WB */ |
+                                     0x8000ULL /* EFI_MEMORY_NV */);
+}
+
+/*
+ * ACPI 6.0: 5.2.25.2 Memory Device to System Physical Address Range Mapping
+ * Structure
+ */
+static void
+nvdimm_build_structure_memdev(GArray *structures, DeviceState *dev)
+{
+    NvdimmNfitMemDev *nfit_memdev;
+    uint64_t addr = object_property_get_int(OBJECT(dev), PC_DIMM_ADDR_PROP,
+                                            NULL);
+    uint64_t size = object_property_get_int(OBJECT(dev), PC_DIMM_SIZE_PROP,
+                                            NULL);
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                            NULL);
+    uint32_t handle = nvdimm_slot_to_handle(slot);
+
+    nfit_memdev = acpi_data_push(structures, sizeof(*nfit_memdev));
+
+    nfit_memdev->type = cpu_to_le16(1 /* Memory Device to System Address
+                                         Range Map Structure*/);
+    nfit_memdev->length = cpu_to_le16(sizeof(*nfit_memdev));
+    nfit_memdev->nfit_handle = cpu_to_le32(handle);
+
+    /*
+     * associate memory device with System Physical Address Range
+     * Structure.
+     */
+    nfit_memdev->spa_index = cpu_to_le16(nvdimm_slot_to_spa_index(slot));
+    /* associate memory device with Control Region Structure. */
+    nfit_memdev->dcr_index = cpu_to_le16(nvdimm_slot_to_dcr_index(slot));
+
+    /* The memory region on the device. */
+    nfit_memdev->region_len = cpu_to_le64(size);
+    nfit_memdev->region_dpa = cpu_to_le64(addr);
+
+    /* Only one interleave for PMEM. */
+    nfit_memdev->interleave_ways = cpu_to_le16(1);
+}
+
+/*
+ * ACPI 6.0: 5.2.25.5 NVDIMM Control Region Structure.
+ */
+static void nvdimm_build_structure_dcr(GArray *structures, DeviceState *dev)
+{
+    NvdimmNfitControlRegion *nfit_dcr;
+    int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                       NULL);
+    uint32_t sn = nvdimm_slot_to_sn(slot);
+
+    nfit_dcr = acpi_data_push(structures, sizeof(*nfit_dcr));
+
+    nfit_dcr->type = cpu_to_le16(4 /* NVDIMM Control Region Structure */);
+    nfit_dcr->length = cpu_to_le16(sizeof(*nfit_dcr));
+    nfit_dcr->dcr_index = cpu_to_le16(nvdimm_slot_to_dcr_index(slot));
+
+    /* vendor: Intel. */
+    nfit_dcr->vendor_id = cpu_to_le16(0x8086);
+    nfit_dcr->device_id = cpu_to_le16(1);
+
+    /* The _DSM method is following Intel's DSM specification. */
+    nfit_dcr->revision_id = cpu_to_le16(1 /* Current Revision supported
+                                             in ACPI 6.0 is 1. */);
+    nfit_dcr->serial_number = cpu_to_le32(sn);
+    nfit_dcr->fic = cpu_to_le16(0x201 /* Format Interface Code. See Chapter
+                                         2: NVDIMM Device Specific Method
+                                         (DSM) in DSM Spec Rev1.*/);
+}
+
+static GArray *nvdimm_build_device_structure(GSList *device_list)
+{
+    GArray *structures = g_array_new(false, true /* clear */, 1);
+
+    for (; device_list; device_list = device_list->next) {
+        DeviceState *dev = device_list->data;
+
+        /* build System Physical Address Range Structure. */
+        nvdimm_build_structure_spa(structures, dev);
+
+        /*
+         * build Memory Device to System Physical Address Range Mapping
+         * Structure.
+         */
+        nvdimm_build_structure_memdev(structures, dev);
+
+        /* build NVDIMM Control Region Structure. */
+        nvdimm_build_structure_dcr(structures, dev);
+    }
+
+    return structures;
+}
+
+static void nvdimm_build_nfit(GSList *device_list, GArray *table_offsets,
+                              GArray *table_data, GArray *linker)
+{
+    GArray *structures = nvdimm_build_device_structure(device_list);
+    void *header;
+
+    acpi_add_table(table_offsets, table_data);
+
+    /* NFIT header. */
+    header = acpi_data_push(table_data, sizeof(NvdimmNfitHeader));
+    /* NVDIMM device structures. */
+    g_array_append_vals(table_data, structures->data, structures->len);
+
+    build_header(linker, table_data, header, "NFIT",
+                 sizeof(NvdimmNfitHeader) + structures->len, 1, NULL);
+    g_array_free(structures, true);
+}
+
+void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
+                       GArray *linker)
+{
+    GSList *device_list;
+
+    /* no NVDIMM device is plugged. */
+    device_list = nvdimm_get_plugged_device_list();
+    if (!device_list) {
+        return;
+    }
+    nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
+    g_slist_free(device_list);
+}
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 215b58c..b55659d 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -39,6 +39,7 @@
 #include "hw/loader.h"
 #include "hw/isa/isa.h"
 #include "hw/acpi/memory_hotplug.h"
+#include "hw/mem/nvdimm.h"
 #include "sysemu/tpm.h"
 #include "hw/acpi/tpm.h"
 #include "sysemu/tpm_backend.h"
@@ -1658,6 +1659,13 @@ static bool acpi_has_iommu(void)
     return intel_iommu && !ambiguous;
 }
 
+static bool acpi_has_nvdimm(void)
+{
+    PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
+
+    return pcms->nvdimm;
+}
+
 static
 void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
 {
@@ -1742,6 +1750,10 @@ void acpi_build(PcGuestInfo *guest_info, AcpiBuildTables *tables)
         build_dmar_q35(tables_blob, tables->linker);
     }
 
+    if (acpi_has_nvdimm()) {
+        nvdimm_build_acpi(table_offsets, tables_blob, tables->linker);
+    }
+
     /* Add tables supplied by user (if any) */
     for (u = acpi_table_first(); u; u = acpi_table_next(u)) {
         unsigned len = acpi_table_len(u);
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 5e20e07..7a9ea0a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1877,6 +1877,20 @@ static bool pc_machine_get_aligned_dimm(Object *obj, Error **errp)
     return pcms->enforce_aligned_dimm;
 }
 
+static bool pc_machine_get_nvdimm(Object *obj, Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+
+    return pcms->nvdimm;
+}
+
+static void pc_machine_set_nvdimm(Object *obj, bool value, Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+
+    pcms->nvdimm = value;
+}
+
 static void pc_machine_initfn(Object *obj)
 {
     PCMachineState *pcms = PC_MACHINE(obj);
@@ -1916,6 +1930,11 @@ static void pc_machine_initfn(Object *obj)
     object_property_add_bool(obj, PC_MACHINE_ENFORCE_ALIGNED_DIMM,
                              pc_machine_get_aligned_dimm,
                              NULL, &error_abort);
+
+    /* nvdimm is disabled on default. */
+    pcms->nvdimm = false;
+    object_property_add_bool(obj, PC_MACHINE_NVDIMM, pc_machine_get_nvdimm,
+                             pc_machine_set_nvdimm, &error_abort);
 }
 
 static void pc_machine_reset(void)
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 854c330..1b8d52b 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -40,6 +40,7 @@ struct PCMachineState {
     OnOffAuto vmport;
     OnOffAuto smm;
     bool enforce_aligned_dimm;
+    bool nvdimm;
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 };
 
@@ -49,6 +50,7 @@ struct PCMachineState {
 #define PC_MACHINE_VMPORT           "vmport"
 #define PC_MACHINE_SMM              "smm"
 #define PC_MACHINE_ENFORCE_ALIGNED_DIMM "enforce-aligned-dimm"
+#define PC_MACHINE_NVDIMM           "nvdimm"
 
 /**
  * PCMachineClass:
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index dbfa8d6..49183c1 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -26,4 +26,7 @@
 #include "hw/mem/pc-dimm.h"
 
 #define TYPE_NVDIMM      "nvdimm"
+
+void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
+                       GArray *linker);
 #endif
diff --git a/qemu-options.hx b/qemu-options.hx
index 0eea4ee..a6c92c7 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -41,7 +41,8 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                igd-passthru=on|off controls IGD GFX passthrough support (default=off)\n"
     "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
     "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
-    "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n",
+    "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
+    "                nvdimm=on|off controls NVDIMM support (default=off)\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
@@ -80,6 +81,8 @@ execution of AES cryptographic functions.  The default is on.
 Enables or disables DEA key wrapping support on s390-ccw hosts. This feature
 controls whether DEA wrapping keys will be created to allow
 execution of DEA cryptographic functions.  The default is on.
+@item nvdimm=on|off
+Enables or disables NVDIMM support. The default is off.
 @end table
 ETEXI
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v9 4/5] nvdimm acpi: build ACPI nvdimm devices
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-02  7:20   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel, Xiao Guangrong

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

Currently, we do not support any function on _DSM, that means, NVDIMM
label data has not been supported yet

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 hw/acpi/nvdimm.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 98c004d..d2fad01 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -367,6 +367,111 @@ static void nvdimm_build_nfit(GSList *device_list, GArray *table_offsets,
     g_array_free(structures, true);
 }
 
+#define NVDIMM_COMMON_DSM      "NCAL"
+
+static void nvdimm_build_common_dsm(Aml *dev)
+{
+    Aml *method, *ifctx, *function;
+    uint8_t byte_list[1];
+
+    method = aml_method(NVDIMM_COMMON_DSM, 4);
+    function = aml_arg(2);
+
+    /*
+     * function 0 is called to inquire what functions are supported by
+     * OSPM
+     */
+    ifctx = aml_if(aml_equal(function, aml_int(0)));
+    byte_list[0] = 0 /* No function Supported */;
+    aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
+    aml_append(method, ifctx);
+
+    /* No function is supported yet. */
+    byte_list[0] = 1 /* Not Supported */;
+    aml_append(method, aml_return(aml_buffer(1, byte_list)));
+
+    aml_append(dev, method);
+}
+
+static void nvdimm_build_device_dsm(Aml *dev)
+{
+    Aml *method;
+
+    method = aml_method("_DSM", 4);
+    aml_append(method, aml_return(aml_call4(NVDIMM_COMMON_DSM, aml_arg(0),
+                                  aml_arg(1), aml_arg(2), aml_arg(3))));
+    aml_append(dev, method);
+}
+
+static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
+{
+    for (; device_list; device_list = device_list->next) {
+        DeviceState *dev = device_list->data;
+        int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                           NULL);
+        uint32_t handle = nvdimm_slot_to_handle(slot);
+        Aml *nvdimm_dev;
+
+        nvdimm_dev = aml_device("NV%02X", slot);
+
+        /*
+         * ACPI 6.0: 9.20 NVDIMM Devices:
+         *
+         * _ADR object that is used to supply OSPM with unique address
+         * of the NVDIMM device. This is done by returning the NFIT Device
+         * handle that is used to identify the associated entries in ACPI
+         * table NFIT or _FIT.
+         */
+        aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
+
+        nvdimm_build_device_dsm(nvdimm_dev);
+        aml_append(root_dev, nvdimm_dev);
+    }
+}
+
+static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
+                              GArray *table_data, GArray *linker)
+{
+    Aml *ssdt, *sb_scope, *dev;
+
+    acpi_add_table(table_offsets, table_data);
+
+    ssdt = init_aml_allocator();
+    acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
+
+    sb_scope = aml_scope("\\_SB");
+
+    dev = aml_device("NVDR");
+
+    /*
+     * ACPI 6.0: 9.20 NVDIMM Devices:
+     *
+     * The ACPI Name Space device uses _HID of ACPI0012 to identify the root
+     * NVDIMM interface device. Platform firmware is required to contain one
+     * such device in _SB scope if NVDIMMs support is exposed by platform to
+     * OSPM.
+     * For each NVDIMM present or intended to be supported by platform,
+     * platform firmware also exposes an ACPI Namespace Device under the
+     * root device.
+     */
+    aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
+
+    nvdimm_build_common_dsm(dev);
+    nvdimm_build_device_dsm(dev);
+
+    nvdimm_build_nvdimm_devices(device_list, dev);
+
+    aml_append(sb_scope, dev);
+
+    aml_append(ssdt, sb_scope);
+    /* copy AML table into ACPI tables blob and patch header there */
+    g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
+    build_header(linker, table_data,
+        (void *)(table_data->data + table_data->len - ssdt->buf->len),
+        "SSDT", ssdt->buf->len, 1, "NVDIMM");
+    free_aml_allocator();
+}
+
 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        GArray *linker)
 {
@@ -378,5 +483,6 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
         return;
     }
     nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
+    nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
     g_slist_free(device_list);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH v9 4/5] nvdimm acpi: build ACPI nvdimm devices
@ 2015-12-02  7:20   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:20 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: Xiao Guangrong, ehabkost, kvm, mst, gleb, mtosatti, qemu-devel,
	stefanha, dan.j.williams, rth

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

Currently, we do not support any function on _DSM, that means, NVDIMM
label data has not been supported yet

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 hw/acpi/nvdimm.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 98c004d..d2fad01 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -367,6 +367,111 @@ static void nvdimm_build_nfit(GSList *device_list, GArray *table_offsets,
     g_array_free(structures, true);
 }
 
+#define NVDIMM_COMMON_DSM      "NCAL"
+
+static void nvdimm_build_common_dsm(Aml *dev)
+{
+    Aml *method, *ifctx, *function;
+    uint8_t byte_list[1];
+
+    method = aml_method(NVDIMM_COMMON_DSM, 4);
+    function = aml_arg(2);
+
+    /*
+     * function 0 is called to inquire what functions are supported by
+     * OSPM
+     */
+    ifctx = aml_if(aml_equal(function, aml_int(0)));
+    byte_list[0] = 0 /* No function Supported */;
+    aml_append(ifctx, aml_return(aml_buffer(1, byte_list)));
+    aml_append(method, ifctx);
+
+    /* No function is supported yet. */
+    byte_list[0] = 1 /* Not Supported */;
+    aml_append(method, aml_return(aml_buffer(1, byte_list)));
+
+    aml_append(dev, method);
+}
+
+static void nvdimm_build_device_dsm(Aml *dev)
+{
+    Aml *method;
+
+    method = aml_method("_DSM", 4);
+    aml_append(method, aml_return(aml_call4(NVDIMM_COMMON_DSM, aml_arg(0),
+                                  aml_arg(1), aml_arg(2), aml_arg(3))));
+    aml_append(dev, method);
+}
+
+static void nvdimm_build_nvdimm_devices(GSList *device_list, Aml *root_dev)
+{
+    for (; device_list; device_list = device_list->next) {
+        DeviceState *dev = device_list->data;
+        int slot = object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP,
+                                           NULL);
+        uint32_t handle = nvdimm_slot_to_handle(slot);
+        Aml *nvdimm_dev;
+
+        nvdimm_dev = aml_device("NV%02X", slot);
+
+        /*
+         * ACPI 6.0: 9.20 NVDIMM Devices:
+         *
+         * _ADR object that is used to supply OSPM with unique address
+         * of the NVDIMM device. This is done by returning the NFIT Device
+         * handle that is used to identify the associated entries in ACPI
+         * table NFIT or _FIT.
+         */
+        aml_append(nvdimm_dev, aml_name_decl("_ADR", aml_int(handle)));
+
+        nvdimm_build_device_dsm(nvdimm_dev);
+        aml_append(root_dev, nvdimm_dev);
+    }
+}
+
+static void nvdimm_build_ssdt(GSList *device_list, GArray *table_offsets,
+                              GArray *table_data, GArray *linker)
+{
+    Aml *ssdt, *sb_scope, *dev;
+
+    acpi_add_table(table_offsets, table_data);
+
+    ssdt = init_aml_allocator();
+    acpi_data_push(ssdt->buf, sizeof(AcpiTableHeader));
+
+    sb_scope = aml_scope("\\_SB");
+
+    dev = aml_device("NVDR");
+
+    /*
+     * ACPI 6.0: 9.20 NVDIMM Devices:
+     *
+     * The ACPI Name Space device uses _HID of ACPI0012 to identify the root
+     * NVDIMM interface device. Platform firmware is required to contain one
+     * such device in _SB scope if NVDIMMs support is exposed by platform to
+     * OSPM.
+     * For each NVDIMM present or intended to be supported by platform,
+     * platform firmware also exposes an ACPI Namespace Device under the
+     * root device.
+     */
+    aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
+
+    nvdimm_build_common_dsm(dev);
+    nvdimm_build_device_dsm(dev);
+
+    nvdimm_build_nvdimm_devices(device_list, dev);
+
+    aml_append(sb_scope, dev);
+
+    aml_append(ssdt, sb_scope);
+    /* copy AML table into ACPI tables blob and patch header there */
+    g_array_append_vals(table_data, ssdt->buf->data, ssdt->buf->len);
+    build_header(linker, table_data,
+        (void *)(table_data->data + table_data->len - ssdt->buf->len),
+        "SSDT", ssdt->buf->len, 1, "NVDIMM");
+    free_aml_allocator();
+}
+
 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        GArray *linker)
 {
@@ -378,5 +483,6 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
         return;
     }
     nvdimm_build_nfit(device_list, table_offsets, table_data, linker);
+    nvdimm_build_ssdt(device_list, table_offsets, table_data, linker);
     g_slist_free(device_list);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v9 5/5] nvdimm: add maintain info
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-02  7:21   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:21 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel, Xiao Guangrong

Add NVDIMM maintainer

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index bb1f3e4..7e82340 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -940,6 +940,13 @@ M: Jiri Pirko <jiri@resnulli.us>
 S: Maintained
 F: hw/net/rocker/
 
+NVDIMM
+M: Xiao Guangrong <guangrong.xiao@linux.intel.com>
+S: Maintained
+F: hw/acpi/nvdimm.c
+F: hw/mem/nvdimm.c
+F: include/hw/mem/nvdimm.h
+
 Subsystems
 ----------
 Audio
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [Qemu-devel] [PATCH v9 5/5] nvdimm: add maintain info
@ 2015-12-02  7:21   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-02  7:21 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: Xiao Guangrong, ehabkost, kvm, mst, gleb, mtosatti, qemu-devel,
	stefanha, dan.j.williams, rth

Add NVDIMM maintainer

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index bb1f3e4..7e82340 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -940,6 +940,13 @@ M: Jiri Pirko <jiri@resnulli.us>
 S: Maintained
 F: hw/net/rocker/
 
+NVDIMM
+M: Xiao Guangrong <guangrong.xiao@linux.intel.com>
+S: Maintained
+F: hw/acpi/nvdimm.c
+F: hw/mem/nvdimm.c
+F: include/hw/mem/nvdimm.h
+
 Subsystems
 ----------
 Audio
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v9 0/5] implement vNVDIMM
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-10  3:11   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-10  3:11 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel


New version, new week, and unfortunate new ping... :(


On 12/02/2015 03:20 PM, Xiao Guangrong wrote:
> This patchset can be found at:
>        https://github.com/xiaogr/qemu.git nvdimm-v9
>
> It is based on pci branch on Michael's tree and the top commit is:
> commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36).
>
> Changelog in v9:
> - the changes address Michael's comments:
>    1) move the control parameter to -machine and it is off on default, then
>       it can be enabled by, for example, -machine pc,nvdimm
>    2) introduce a macro to define "NCAL"
>    3) abstract the function, nvdimm_build_device_dsm(), to clean up the
>       code
>    4) adjust the code style of dsm method
>    5) add spec reference in the code comment
>
> other:
>    pick up Stefan's Reviewed-by
>
> Changelog in v8:
> We split the long patch series into the small parts, as you see now, this
> is the first part which enables NVDIMM without label data support.
>
> The command line has been changed because some patches simplifying the
> things have not been included into this series, you should specify the
> file size exactly using the parameters as follows:
>     memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
>     -device nvdimm,memdev=mem1,id=nv1
>
> Changelog in v7:
> - changes from Vladimir Sementsov-Ogievskiy's comments:
>    1) let gethugepagesize() realize if fstat is failed instead of get
>       normal page size
>    2) rename  open_file_path to open_ram_file_path
>    3) better log the error message by using error_setg_errno
>    4) update commit in the commit log to explain hugepage detection on
>       Windows
>
> - changes from Eduardo Habkost's comments:
>    1) use 'Error**' to collect error message for qemu_file_get_page_size()
>    2) move gethugepagesize() replacement to the same patch to make it
>       better for review
>    3) introduce qemu_get_file_size to unity the code with raw_getlength()
>
> - changes from Stefan's comments:
>    1) check the memory region is large enough to contain DSM output
>       buffer
>
> - changes from Eric Blake's comments:
>    1) update the shell command in the commit log to generate the patch
>       which drops 'pc-dimm' prefix
>
> - others:
>    pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
>    Eric Blake.
>
> Changelog in v6:
> - changes from Stefan's comments:
>    1) fix code style of struct naming by CamelCase way
>    2) fix offset + length overflow when read/write label data
>    3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>       be used to replace getpagesize()
>
> Changelog in v5:
> - changes from Michael's comments:
>    1) prefix nvdimm_ to everything in NVDIMM source files
>    2) make parsing _DSM Arg3 more clear
>    3) comment style fix
>    5) drop single used definition
>    6) fix dirty dsm buffer lost due to memory write happened on host
>    7) check dsm buffer if it is big enough to contain input data
>    8) use build_append_int_noprefix to store single value to GArray
>
> - changes from Michael's and Igor's comments:
>    1) introduce 'nvdimm-support' parameter to control nvdimm
>       enablement and it is disabled for 2.4 and its earlier versions
>       to make live migration compatible
>    2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>       virtualization
>
> - changes from Stefan's comments:
>    1) do endian adjustment for the buffer length
>
> - changes from Bharata B Rao's comments:
>    1) fix compile on ppc
>
> - others:
>    1) the buffer length is directly got from IO read rather than got
>       from dsm memory
>    2) fix dirty label data lost due to memory write happened on host
>
> Changelog in v4:
> - changes from Michael's comments:
>    1) show the message, "Memory is not allocated from HugeTlbfs", if file
>       based memory is not allocated from hugetlbfs.
>    2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>       from Machine.
>    3) statically define UUID and make its operation more clear
>    4) use GArray to build device structures to avoid potential buffer
>       overflow
>    4) improve comments in the code
>    5) improve code style
>
> - changes from Igor's comments:
>    1) add NVDIMM ACPI spec document
>    2) use serialized method to avoid Mutex
>    3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>    4) introduce a common ASL method used by _DSM for all devices to reduce
>       ACPI size
>    5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>       it's better to upgrade QEMU to support Rev2 in the future
>
> - changes from Stefan's comments:
>    1) copy input data from DSM memory to local buffer to avoid potential
>       issues as DSM memory is visible to guest. Output data is handled
>       in a similar way
>
> - changes from Dan's comments:
>    1) drop static namespace as Linux has already supported label-less
>       nvdimm devices
>
> - changes from Vladimir's comments:
>    1) print better message, "failed to get file size for %s, can't create
>       backend on it", if any file operation filed to obtain file size
>
> - others:
>    create a git repo on github.com for better review/test
>
> Also, thanks for Eric Blake's review on QAPI's side.
>
> Thank all of you to review this patchset.
>
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.
> - changes from Igor's comments:
>    1) abstract dimm device type from pc-dimm and create nvdimm device based on
>       dimm, then it uses memory backend device as nvdimm's memory and NUMA has
>       easily been implemented.
>    2) let file-backend device support any kind of filesystem not only for
>       hugetlbfs and let it work on file not only for directory which is
>       achieved by extending 'mem-path' - if it's a directory then it works as
>       current behavior, otherwise if it's file then directly allocates memory
>       from it.
>    3) we figure out a unused memory hole below 4G that is 0xFF00000 ~
>       0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
>       ACPI SSDT/DSDT table will break windows XP.
>       BTW, only make SSDT.rev = 2 can not work since the width is only depended
>       on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>       in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets
> | the global integer width for all integers, including integers in SSDTs.
>    4) use the lowest ACPI spec version to document AML terms.
>    5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
>
> - changes from Stefan's comments:
>    1) do not do endian adjustment in-place since _DSM memory is visible to guest
>    2) use target platform's target page size instead of fixed PAGE_SIZE
>       definition
>    3) lots of code style improvement and typo fixes.
>    4) live migration fix
> - changes from Paolo's comments:
>    1) improve the name of memory region
>
> - other changes:
>    1) return exact buffer size for _DSM method instead of the page size.
>    2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
>       devices.
>    3) NUMA support
>    4) implement _FIT method
>    5) rename "configdata" to "reserve-label-data"
>    6) simplify _DSM arg3 determination
>    7) main changelog update to let it reflect v3.
>
> Changlog in v2:
> - Use litten endian for DSM method, thanks for Stefan's suggestion
>
> - introduce a new parameter, @configdata, if it's false, Qemu will
>    build a static and readonly namespace in memory and use it serveing
>    for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
>    reserved region is needed at the end of the @file, it is good for
>    the user who want to pass whole nvdimm device and make its data
>    completely be visible to guest
>
> - divide the source code into separated files and add maintain info
>
> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> be posted on next week
>
> ====== Background ======
> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> on Intel's platform. They are discovered via ACPI and configured by _DSM
> method of NVDIMM device in ACPI. There has some supporting documents which
> can be found at:
> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> this patchset tries to enable it in virtualization field
>
> ====== Design ======
> NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
> address space then CPU can directly access it as normal memory, another is
> BLK which is used as block device to reduce the occupying of CPU address
> space
>
> BLK mode accesses NVDIMM via Command Register window and Data Register window.
> BLK virtualization has high workload since each sector access will cause at
> least two VM-EXIT. So we currently only imperilment vPMEM in this patchset
>
> --- vPMEM design ---
> We introduce a new device named "nvdimm", it uses memory backend device as
> NVDIMM memory. The file in file-backend device can be a regular file and block
> device. We can use any file when we do test or emulation, however,
> in the real word, the files passed to guest are:
> - the regular file in the filesystem with DAX enabled created on NVDIMM device
>    on host
> - the raw PMEM device on host, e,g /dev/pmem0
> Memory access on the address created by mmap on these kinds of files can
> directly reach NVDIMM device on host.
>
> --- vConfigure data area design ---
> Each NVDIMM device has a configure data area which is used to store label
> namespace data. In order to emulating this area, we divide the file into two
> parts:
> - first parts is (0, size - 128K], which is used as PMEM
> - 128K at the end of the file, which is used as Label Data Area
> So that the label namespace data can be persistent during power lose or system
> failure.
>
> We also support passing the whole file to guest without reserve any region for
> label data area which is achieved by "reserve-label-data" parameter - if it's
> false then QEMU will build static and readonly namespace in memory and that
> namespace contains the whole file size. The parameter is false on default.
>
> --- _DSM method design ---
> _DSM in ACPI is used to configure NVDIMM, currently we only allow access of
> label namespace data, i.e, Get Namespace Label Size (Function Index 4),
> Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
> (Function Index 6)
>
> _DSM uses two pages to transfer data between ACPI and Qemu, the first page
> is RAM-based used to save the input info of _DSM method and Qemu reuse it
> store output info and another page is MMIO-based, ACPI write data to this
> page to transfer the control to Qemu
>
> ====== Test ======
> In host
> 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
> 2) append "-object memory-backend-file,share,id=mem1,
>     mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
>     id=nv1" in QEMU command line
>
> In guest, download the latest upsteam kernel (4.2 merge window) and enable
> ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
> 1) insmod drivers/nvdimm/libnvdimm.ko
> 2) insmod drivers/acpi/nfit.ko
> 3) insmod drivers/nvdimm/nd_btt.ko
> 4) insmod drivers/nvdimm/nd_pmem.ko
> You can see the whole nvdimm device used as a single namespace and /dev/pmem0
> appears. You can do whatever on /dev/pmem0 including DAX access.
>
> Currently Linux NVDIMM driver does not support namespace operation on this
> kind of PMEM, apply below changes to support dynamical namespace:
>
> @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
>                          continue;
>                  }
>
> -               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               if (nfit_mem->memdev_pmem)
>                          flags |= NDD_ALIASING;
>
> You can append another NVDIMM device in guest and do:
> # cd /sys/bus/nd/devices/
> # cd namespace1.0/
> # echo `uuidgen` > uuid
> # echo `expr 1024 \* 1024 \* 128` > size
> then reload nd.pmem.ko
>
> You can see /dev/pmem1 appears
>
> Xiao Guangrong (5):
>    nvdimm: implement NVDIMM device abstract
>    acpi: support specified oem table id for build_header
>    nvdimm acpi: build ACPI NFIT table
>    nvdimm acpi: build ACPI nvdimm devices
>    nvdimm: add maintain info
>
>   MAINTAINERS                        |   7 +
>   default-configs/i386-softmmu.mak   |   2 +
>   default-configs/x86_64-softmmu.mak |   2 +
>   hw/acpi/Makefile.objs              |   1 +
>   hw/acpi/aml-build.c                |  15 +-
>   hw/acpi/memory_hotplug.c           |   5 +
>   hw/acpi/nvdimm.c                   | 488 +++++++++++++++++++++++++++++++++++++
>   hw/arm/virt-acpi-build.c           |  13 +-
>   hw/i386/acpi-build.c               |  32 ++-
>   hw/i386/pc.c                       |  19 ++
>   hw/mem/Makefile.objs               |   1 +
>   hw/mem/nvdimm.c                    |  46 ++++
>   include/hw/acpi/aml-build.h        |   3 +-
>   include/hw/i386/pc.h               |   2 +
>   include/hw/mem/nvdimm.h            |  32 +++
>   qemu-options.hx                    |   5 +-
>   16 files changed, 651 insertions(+), 22 deletions(-)
>   create mode 100644 hw/acpi/nvdimm.c
>   create mode 100644 hw/mem/nvdimm.c
>   create mode 100644 include/hw/mem/nvdimm.h
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM
@ 2015-12-10  3:11   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-10  3:11 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: ehabkost, kvm, mst, gleb, mtosatti, qemu-devel, stefanha,
	dan.j.williams, rth


New version, new week, and unfortunate new ping... :(


On 12/02/2015 03:20 PM, Xiao Guangrong wrote:
> This patchset can be found at:
>        https://github.com/xiaogr/qemu.git nvdimm-v9
>
> It is based on pci branch on Michael's tree and the top commit is:
> commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36).
>
> Changelog in v9:
> - the changes address Michael's comments:
>    1) move the control parameter to -machine and it is off on default, then
>       it can be enabled by, for example, -machine pc,nvdimm
>    2) introduce a macro to define "NCAL"
>    3) abstract the function, nvdimm_build_device_dsm(), to clean up the
>       code
>    4) adjust the code style of dsm method
>    5) add spec reference in the code comment
>
> other:
>    pick up Stefan's Reviewed-by
>
> Changelog in v8:
> We split the long patch series into the small parts, as you see now, this
> is the first part which enables NVDIMM without label data support.
>
> The command line has been changed because some patches simplifying the
> things have not been included into this series, you should specify the
> file size exactly using the parameters as follows:
>     memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
>     -device nvdimm,memdev=mem1,id=nv1
>
> Changelog in v7:
> - changes from Vladimir Sementsov-Ogievskiy's comments:
>    1) let gethugepagesize() realize if fstat is failed instead of get
>       normal page size
>    2) rename  open_file_path to open_ram_file_path
>    3) better log the error message by using error_setg_errno
>    4) update commit in the commit log to explain hugepage detection on
>       Windows
>
> - changes from Eduardo Habkost's comments:
>    1) use 'Error**' to collect error message for qemu_file_get_page_size()
>    2) move gethugepagesize() replacement to the same patch to make it
>       better for review
>    3) introduce qemu_get_file_size to unity the code with raw_getlength()
>
> - changes from Stefan's comments:
>    1) check the memory region is large enough to contain DSM output
>       buffer
>
> - changes from Eric Blake's comments:
>    1) update the shell command in the commit log to generate the patch
>       which drops 'pc-dimm' prefix
>
> - others:
>    pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
>    Eric Blake.
>
> Changelog in v6:
> - changes from Stefan's comments:
>    1) fix code style of struct naming by CamelCase way
>    2) fix offset + length overflow when read/write label data
>    3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>       be used to replace getpagesize()
>
> Changelog in v5:
> - changes from Michael's comments:
>    1) prefix nvdimm_ to everything in NVDIMM source files
>    2) make parsing _DSM Arg3 more clear
>    3) comment style fix
>    5) drop single used definition
>    6) fix dirty dsm buffer lost due to memory write happened on host
>    7) check dsm buffer if it is big enough to contain input data
>    8) use build_append_int_noprefix to store single value to GArray
>
> - changes from Michael's and Igor's comments:
>    1) introduce 'nvdimm-support' parameter to control nvdimm
>       enablement and it is disabled for 2.4 and its earlier versions
>       to make live migration compatible
>    2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>       virtualization
>
> - changes from Stefan's comments:
>    1) do endian adjustment for the buffer length
>
> - changes from Bharata B Rao's comments:
>    1) fix compile on ppc
>
> - others:
>    1) the buffer length is directly got from IO read rather than got
>       from dsm memory
>    2) fix dirty label data lost due to memory write happened on host
>
> Changelog in v4:
> - changes from Michael's comments:
>    1) show the message, "Memory is not allocated from HugeTlbfs", if file
>       based memory is not allocated from hugetlbfs.
>    2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>       from Machine.
>    3) statically define UUID and make its operation more clear
>    4) use GArray to build device structures to avoid potential buffer
>       overflow
>    4) improve comments in the code
>    5) improve code style
>
> - changes from Igor's comments:
>    1) add NVDIMM ACPI spec document
>    2) use serialized method to avoid Mutex
>    3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>    4) introduce a common ASL method used by _DSM for all devices to reduce
>       ACPI size
>    5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>       it's better to upgrade QEMU to support Rev2 in the future
>
> - changes from Stefan's comments:
>    1) copy input data from DSM memory to local buffer to avoid potential
>       issues as DSM memory is visible to guest. Output data is handled
>       in a similar way
>
> - changes from Dan's comments:
>    1) drop static namespace as Linux has already supported label-less
>       nvdimm devices
>
> - changes from Vladimir's comments:
>    1) print better message, "failed to get file size for %s, can't create
>       backend on it", if any file operation filed to obtain file size
>
> - others:
>    create a git repo on github.com for better review/test
>
> Also, thanks for Eric Blake's review on QAPI's side.
>
> Thank all of you to review this patchset.
>
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.
> - changes from Igor's comments:
>    1) abstract dimm device type from pc-dimm and create nvdimm device based on
>       dimm, then it uses memory backend device as nvdimm's memory and NUMA has
>       easily been implemented.
>    2) let file-backend device support any kind of filesystem not only for
>       hugetlbfs and let it work on file not only for directory which is
>       achieved by extending 'mem-path' - if it's a directory then it works as
>       current behavior, otherwise if it's file then directly allocates memory
>       from it.
>    3) we figure out a unused memory hole below 4G that is 0xFF00000 ~
>       0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
>       ACPI SSDT/DSDT table will break windows XP.
>       BTW, only make SSDT.rev = 2 can not work since the width is only depended
>       on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>       in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets
> | the global integer width for all integers, including integers in SSDTs.
>    4) use the lowest ACPI spec version to document AML terms.
>    5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
>
> - changes from Stefan's comments:
>    1) do not do endian adjustment in-place since _DSM memory is visible to guest
>    2) use target platform's target page size instead of fixed PAGE_SIZE
>       definition
>    3) lots of code style improvement and typo fixes.
>    4) live migration fix
> - changes from Paolo's comments:
>    1) improve the name of memory region
>
> - other changes:
>    1) return exact buffer size for _DSM method instead of the page size.
>    2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
>       devices.
>    3) NUMA support
>    4) implement _FIT method
>    5) rename "configdata" to "reserve-label-data"
>    6) simplify _DSM arg3 determination
>    7) main changelog update to let it reflect v3.
>
> Changlog in v2:
> - Use litten endian for DSM method, thanks for Stefan's suggestion
>
> - introduce a new parameter, @configdata, if it's false, Qemu will
>    build a static and readonly namespace in memory and use it serveing
>    for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
>    reserved region is needed at the end of the @file, it is good for
>    the user who want to pass whole nvdimm device and make its data
>    completely be visible to guest
>
> - divide the source code into separated files and add maintain info
>
> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> be posted on next week
>
> ====== Background ======
> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> on Intel's platform. They are discovered via ACPI and configured by _DSM
> method of NVDIMM device in ACPI. There has some supporting documents which
> can be found at:
> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> this patchset tries to enable it in virtualization field
>
> ====== Design ======
> NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
> address space then CPU can directly access it as normal memory, another is
> BLK which is used as block device to reduce the occupying of CPU address
> space
>
> BLK mode accesses NVDIMM via Command Register window and Data Register window.
> BLK virtualization has high workload since each sector access will cause at
> least two VM-EXIT. So we currently only imperilment vPMEM in this patchset
>
> --- vPMEM design ---
> We introduce a new device named "nvdimm", it uses memory backend device as
> NVDIMM memory. The file in file-backend device can be a regular file and block
> device. We can use any file when we do test or emulation, however,
> in the real word, the files passed to guest are:
> - the regular file in the filesystem with DAX enabled created on NVDIMM device
>    on host
> - the raw PMEM device on host, e,g /dev/pmem0
> Memory access on the address created by mmap on these kinds of files can
> directly reach NVDIMM device on host.
>
> --- vConfigure data area design ---
> Each NVDIMM device has a configure data area which is used to store label
> namespace data. In order to emulating this area, we divide the file into two
> parts:
> - first parts is (0, size - 128K], which is used as PMEM
> - 128K at the end of the file, which is used as Label Data Area
> So that the label namespace data can be persistent during power lose or system
> failure.
>
> We also support passing the whole file to guest without reserve any region for
> label data area which is achieved by "reserve-label-data" parameter - if it's
> false then QEMU will build static and readonly namespace in memory and that
> namespace contains the whole file size. The parameter is false on default.
>
> --- _DSM method design ---
> _DSM in ACPI is used to configure NVDIMM, currently we only allow access of
> label namespace data, i.e, Get Namespace Label Size (Function Index 4),
> Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
> (Function Index 6)
>
> _DSM uses two pages to transfer data between ACPI and Qemu, the first page
> is RAM-based used to save the input info of _DSM method and Qemu reuse it
> store output info and another page is MMIO-based, ACPI write data to this
> page to transfer the control to Qemu
>
> ====== Test ======
> In host
> 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
> 2) append "-object memory-backend-file,share,id=mem1,
>     mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
>     id=nv1" in QEMU command line
>
> In guest, download the latest upsteam kernel (4.2 merge window) and enable
> ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
> 1) insmod drivers/nvdimm/libnvdimm.ko
> 2) insmod drivers/acpi/nfit.ko
> 3) insmod drivers/nvdimm/nd_btt.ko
> 4) insmod drivers/nvdimm/nd_pmem.ko
> You can see the whole nvdimm device used as a single namespace and /dev/pmem0
> appears. You can do whatever on /dev/pmem0 including DAX access.
>
> Currently Linux NVDIMM driver does not support namespace operation on this
> kind of PMEM, apply below changes to support dynamical namespace:
>
> @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
>                          continue;
>                  }
>
> -               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               if (nfit_mem->memdev_pmem)
>                          flags |= NDD_ALIASING;
>
> You can append another NVDIMM device in guest and do:
> # cd /sys/bus/nd/devices/
> # cd namespace1.0/
> # echo `uuidgen` > uuid
> # echo `expr 1024 \* 1024 \* 128` > size
> then reload nd.pmem.ko
>
> You can see /dev/pmem1 appears
>
> Xiao Guangrong (5):
>    nvdimm: implement NVDIMM device abstract
>    acpi: support specified oem table id for build_header
>    nvdimm acpi: build ACPI NFIT table
>    nvdimm acpi: build ACPI nvdimm devices
>    nvdimm: add maintain info
>
>   MAINTAINERS                        |   7 +
>   default-configs/i386-softmmu.mak   |   2 +
>   default-configs/x86_64-softmmu.mak |   2 +
>   hw/acpi/Makefile.objs              |   1 +
>   hw/acpi/aml-build.c                |  15 +-
>   hw/acpi/memory_hotplug.c           |   5 +
>   hw/acpi/nvdimm.c                   | 488 +++++++++++++++++++++++++++++++++++++
>   hw/arm/virt-acpi-build.c           |  13 +-
>   hw/i386/acpi-build.c               |  32 ++-
>   hw/i386/pc.c                       |  19 ++
>   hw/mem/Makefile.objs               |   1 +
>   hw/mem/nvdimm.c                    |  46 ++++
>   include/hw/acpi/aml-build.h        |   3 +-
>   include/hw/i386/pc.h               |   2 +
>   include/hw/mem/nvdimm.h            |  32 +++
>   qemu-options.hx                    |   5 +-
>   16 files changed, 651 insertions(+), 22 deletions(-)
>   create mode 100644 hw/acpi/nvdimm.c
>   create mode 100644 hw/mem/nvdimm.c
>   create mode 100644 include/hw/mem/nvdimm.h
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v9 0/5] implement vNVDIMM
  2015-12-10  3:11   ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-21 14:13     ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-21 14:13 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: gleb, mtosatti, stefanha, mst, rth, ehabkost, dan.j.williams,
	kvm, qemu-devel



On 12/10/2015 11:11 AM, Xiao Guangrong wrote:
>
> New version, new week, and unfortunate new ping... :(

Ping again to see what happened...



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM
@ 2015-12-21 14:13     ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-21 14:13 UTC (permalink / raw)
  To: pbonzini, imammedo
  Cc: ehabkost, kvm, mst, gleb, mtosatti, qemu-devel, stefanha,
	dan.j.williams, rth



On 12/10/2015 11:11 AM, Xiao Guangrong wrote:
>
> New version, new week, and unfortunate new ping... :(

Ping again to see what happened...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* How to reserve guest physical region for ACPI
  2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-28  2:39   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-28  2:39 UTC (permalink / raw)
  To: pbonzini, imammedo, mst
  Cc: gleb, mtosatti, stefanha, rth, ehabkost, dan.j.williams, kvm, qemu-devel


Hi Michael, Paolo,

Now it is the time to return to the challenge that how to reserve guest
physical region internally used by ACPI.

Igor suggested that:
| An alternative place to allocate reserve from could be high memory.
| For pc we have "reserved-memory-end" which currently makes sure
| that hotpluggable memory range isn't used by firmware
(https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
| when writing ASL one shall make sure that only XP supported
| features are in global scope, which is evaluated when tables
| are loaded and features of rev2 and higher are inside methods.
| That way XP doesn't crash as far as it doesn't evaluate unsupported
| opcode and one can guard those opcodes checking _REV object if neccesary.
(https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)

Michael, Paolo, what do you think about these ideas?

Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [Qemu-devel] How to reserve guest physical region for ACPI
@ 2015-12-28  2:39   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2015-12-28  2:39 UTC (permalink / raw)
  To: pbonzini, imammedo, mst
  Cc: ehabkost, kvm, gleb, mtosatti, qemu-devel, stefanha, dan.j.williams, rth


Hi Michael, Paolo,

Now it is the time to return to the challenge that how to reserve guest
physical region internally used by ACPI.

Igor suggested that:
| An alternative place to allocate reserve from could be high memory.
| For pc we have "reserved-memory-end" which currently makes sure
| that hotpluggable memory range isn't used by firmware
(https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
| when writing ASL one shall make sure that only XP supported
| features are in global scope, which is evaluated when tables
| are loaded and features of rev2 and higher are inside methods.
| That way XP doesn't crash as far as it doesn't evaluate unsupported
| opcode and one can guard those opcodes checking _REV object if neccesary.
(https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)

Michael, Paolo, what do you think about these ideas?

Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2015-12-28  2:39   ` [Qemu-devel] " Xiao Guangrong
@ 2015-12-28 12:50     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28 12:50 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: ehabkost, kvm, gleb, mtosatti, qemu-devel, stefanha, imammedo,
	pbonzini, dan.j.williams, Laszlo Ersek, rth

On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> 
> Hi Michael, Paolo,
> 
> Now it is the time to return to the challenge that how to reserve guest
> physical region internally used by ACPI.
> 
> Igor suggested that:
> | An alternative place to allocate reserve from could be high memory.
> | For pc we have "reserved-memory-end" which currently makes sure
> | that hotpluggable memory range isn't used by firmware
> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

I don't want to tie things to reserved-memory-end because this
does not scale: next time we need to reserve memory,
we'll need to find yet another way to figure out what is where.

I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
support 64 bit RAM instead (and maybe a way to allocate and
zero-initialize buffer without loading it through fwcfg), this way bios
does the allocation, and addresses can be patched into acpi.

See patch at the bottom that might be handy.

> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> | when writing ASL one shall make sure that only XP supported
> | features are in global scope, which is evaluated when tables
> | are loaded and features of rev2 and higher are inside methods.
> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> | opcode and one can guard those opcodes checking _REV object if neccesary.
> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)

Yes, this technique works.

An alternative is to add an XSDT, XP ignores that.
XSDT at the moment breaks OVMF (because it loads both
the RSDT and the XSDT, which is wrong), but I think
Laszlo was working on a fix for that.

> Michael, Paolo, what do you think about these ideas?
> 
> Thanks!



So using a patch below, we can add Name(PQRS, 0x0) at the top of the
SSDT (or bottom, or add a separate SSDT just for that).  It returns the
current offset so we can add that to the linker.

Won't work if you append the Name to the Aml structure (these can be
nested to arbitrary depth using aml_append), so using plain GArray for
this API makes sense to me.

--->

acpi: add build_append_named_dword, returning an offset in buffer

This is a very limited form of support for runtime patching -
similar in functionality to what we can do with ACPI_EXTRACT
macros in python, but implemented in C.

This is to allow ACPI code direct access to data tables -
which is exactly what DataTableRegion is there for, except
no known windows release so far implements DataTableRegion.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..f8998ea 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
 void
 build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
 
+int
+build_append_named_dword(GArray *array, const char *name_format, ...);
+
 #endif
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 0d4b324..7f9fa65 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
     }
 }
 
+/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
+ * and return the offset to 0x0 for runtime patching.
+ *
+ * Warning: runtime patching is best avoided. Only use this as
+ * a replacement for DataTableRegion (for guests that don't
+ * support it).
+ */
+int
+build_append_named_qword(GArray *array, const char *name_format, ...)
+{
+    int offset;
+    va_list ap;
+
+    va_start(ap, name_format);
+    build_append_namestringv(array, name_format, ap);
+    va_end(ap);
+
+    build_append_byte(array, 0x0E); /* QWordPrefix */
+
+    offset = array->len;
+    build_append_int_noprefix(array, 0x0, 8);
+    assert(array->len == offset + 8);
+
+    return offset;
+}
+
 static GPtrArray *alloc_list;
 
 static Aml *aml_alloc(void)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2015-12-28 12:50     ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2015-12-28 12:50 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: ehabkost, kvm, gleb, mtosatti, qemu-devel, stefanha, imammedo,
	pbonzini, dan.j.williams, Laszlo Ersek, rth

On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> 
> Hi Michael, Paolo,
> 
> Now it is the time to return to the challenge that how to reserve guest
> physical region internally used by ACPI.
> 
> Igor suggested that:
> | An alternative place to allocate reserve from could be high memory.
> | For pc we have "reserved-memory-end" which currently makes sure
> | that hotpluggable memory range isn't used by firmware
> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

I don't want to tie things to reserved-memory-end because this
does not scale: next time we need to reserve memory,
we'll need to find yet another way to figure out what is where.

I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
support 64 bit RAM instead (and maybe a way to allocate and
zero-initialize buffer without loading it through fwcfg), this way bios
does the allocation, and addresses can be patched into acpi.

See patch at the bottom that might be handy.

> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> | when writing ASL one shall make sure that only XP supported
> | features are in global scope, which is evaluated when tables
> | are loaded and features of rev2 and higher are inside methods.
> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> | opcode and one can guard those opcodes checking _REV object if neccesary.
> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)

Yes, this technique works.

An alternative is to add an XSDT, XP ignores that.
XSDT at the moment breaks OVMF (because it loads both
the RSDT and the XSDT, which is wrong), but I think
Laszlo was working on a fix for that.

> Michael, Paolo, what do you think about these ideas?
> 
> Thanks!



So using a patch below, we can add Name(PQRS, 0x0) at the top of the
SSDT (or bottom, or add a separate SSDT just for that).  It returns the
current offset so we can add that to the linker.

Won't work if you append the Name to the Aml structure (these can be
nested to arbitrary depth using aml_append), so using plain GArray for
this API makes sense to me.

--->

acpi: add build_append_named_dword, returning an offset in buffer

This is a very limited form of support for runtime patching -
similar in functionality to what we can do with ACPI_EXTRACT
macros in python, but implemented in C.

This is to allow ACPI code direct access to data tables -
which is exactly what DataTableRegion is there for, except
no known windows release so far implements DataTableRegion.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..f8998ea 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
 void
 build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
 
+int
+build_append_named_dword(GArray *array, const char *name_format, ...);
+
 #endif
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 0d4b324..7f9fa65 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
     }
 }
 
+/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
+ * and return the offset to 0x0 for runtime patching.
+ *
+ * Warning: runtime patching is best avoided. Only use this as
+ * a replacement for DataTableRegion (for guests that don't
+ * support it).
+ */
+int
+build_append_named_qword(GArray *array, const char *name_format, ...)
+{
+    int offset;
+    va_list ap;
+
+    va_start(ap, name_format);
+    build_append_namestringv(array, name_format, ap);
+    va_end(ap);
+
+    build_append_byte(array, 0x0E); /* QWordPrefix */
+
+    offset = array->len;
+    build_append_int_noprefix(array, 0x0, 8);
+    assert(array->len == offset + 8);
+
+    return offset;
+}
+
 static GPtrArray *alloc_list;
 
 static Aml *aml_alloc(void)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2015-12-28 12:50     ` [Qemu-devel] " Michael S. Tsirkin
@ 2015-12-30 15:55       ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2015-12-30 15:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xiao Guangrong, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek

On Mon, 28 Dec 2015 14:50:15 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > 
> > Hi Michael, Paolo,
> > 
> > Now it is the time to return to the challenge that how to reserve guest
> > physical region internally used by ACPI.
> > 
> > Igor suggested that:
> > | An alternative place to allocate reserve from could be high memory.
> > | For pc we have "reserved-memory-end" which currently makes sure
> > | that hotpluggable memory range isn't used by firmware
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> 
> I don't want to tie things to reserved-memory-end because this
> does not scale: next time we need to reserve memory,
> we'll need to find yet another way to figure out what is where.
Could you elaborate a bit more on a problem you're seeing?

To me it looks like it scales rather well.
For example lets imagine that we adding a device
that has some on device memory that should be mapped into GPA
code to do so would look like:

  pc_machine_device_plug_cb(dev)
  {
   ...
   if (dev == OUR_NEW_DEVICE_TYPE) {
       memory_region_add_subregion(as, current_reserved_end, &dev->mr);
       set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
   }
  }

we can practically add any number of new devices that way.

 
> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> support 64 bit RAM instead (and maybe a way to allocate and
> zero-initialize buffer without loading it through fwcfg), this way bios
> does the allocation, and addresses can be patched into acpi.
and then guest side needs to parse/execute some AML that would
initialize QEMU side so it would know where to write data.

bios-linker-loader is a great interface for initializing some
guest owned data and linking it together but I think it adds
unnecessary complexity and is misused if it's used to handle
device owned data/on device memory in this and VMGID cases.

There was RFC on list to make BIOS boot from NVDIMM already
doing some ACPI table lookup/parsing. Now if they were forced
to also parse and execute AML to initialize QEMU with guest
allocated address that would complicate them quite a bit.
While with NVDIMM control memory region mapped directly by QEMU,
respective patches don't need in any way to initialize QEMU,
all they would need just read necessary data from control region.

Also using bios-linker-loader takes away some usable RAM
from guest and in the end that doesn't scale,
the more devices I add the less usable RAM is left for guest OS
while all the device needs is a piece of GPA address space
that would belong to it.

> 
> See patch at the bottom that might be handy.
> 
> > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > | when writing ASL one shall make sure that only XP supported
> > | features are in global scope, which is evaluated when tables
> > | are loaded and features of rev2 and higher are inside methods.
> > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> 
> Yes, this technique works.
> 
> An alternative is to add an XSDT, XP ignores that.
> XSDT at the moment breaks OVMF (because it loads both
> the RSDT and the XSDT, which is wrong), but I think
> Laszlo was working on a fix for that.
Using XSDT would increase ACPI tables occupied RAM
as it would duplicate DSDT + non XP supported AML
at global namespace.

So far we've managed keep DSDT compatible with XP while
introducing features from v2 and higher ACPI revisions as
AML that is only evaluated on demand.
We can continue doing so unless we have to unconditionally
add incompatible AML at global scope.


> 
> > Michael, Paolo, what do you think about these ideas?
> > 
> > Thanks!
> 
> 
> 
> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> current offset so we can add that to the linker.
> 
> Won't work if you append the Name to the Aml structure (these can be
> nested to arbitrary depth using aml_append), so using plain GArray for
> this API makes sense to me.
> 
> --->
> 
> acpi: add build_append_named_dword, returning an offset in buffer
> 
> This is a very limited form of support for runtime patching -
> similar in functionality to what we can do with ACPI_EXTRACT
> macros in python, but implemented in C.
> 
> This is to allow ACPI code direct access to data tables -
> which is exactly what DataTableRegion is there for, except
> no known windows release so far implements DataTableRegion.
unsupported means Windows will BSOD, so it's practically
unusable unless MS will patch currently existing Windows
versions.

Another thing about DataTableRegion is that ACPI tables are
supposed to have static content which matches checksum in
table the header while you are trying to use it for dynamic
data. It would be cleaner/more compatible to teach
bios-linker-loader to just allocate memory and patch AML
with the allocated address.

Also if OperationRegion() is used, then one has to patch
DefOpRegion directly as RegionOffset must be Integer,
using variable names is not permitted there.

> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 1b632dc..f8998ea 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>  void
>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>  
> +int
> +build_append_named_dword(GArray *array, const char *name_format, ...);
> +
>  #endif
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 0d4b324..7f9fa65 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>      }
>  }
>  
> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> + * and return the offset to 0x0 for runtime patching.
> + *
> + * Warning: runtime patching is best avoided. Only use this as
> + * a replacement for DataTableRegion (for guests that don't
> + * support it).
> + */
> +int
> +build_append_named_qword(GArray *array, const char *name_format, ...)
> +{
> +    int offset;
> +    va_list ap;
> +
> +    va_start(ap, name_format);
> +    build_append_namestringv(array, name_format, ap);
> +    va_end(ap);
> +
> +    build_append_byte(array, 0x0E); /* QWordPrefix */
> +
> +    offset = array->len;
> +    build_append_int_noprefix(array, 0x0, 8);
> +    assert(array->len == offset + 8);
> +
> +    return offset;
> +}
> +
>  static GPtrArray *alloc_list;
>  
>  static Aml *aml_alloc(void)
> 
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2015-12-30 15:55       ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2015-12-30 15:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Mon, 28 Dec 2015 14:50:15 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > 
> > Hi Michael, Paolo,
> > 
> > Now it is the time to return to the challenge that how to reserve guest
> > physical region internally used by ACPI.
> > 
> > Igor suggested that:
> > | An alternative place to allocate reserve from could be high memory.
> > | For pc we have "reserved-memory-end" which currently makes sure
> > | that hotpluggable memory range isn't used by firmware
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> 
> I don't want to tie things to reserved-memory-end because this
> does not scale: next time we need to reserve memory,
> we'll need to find yet another way to figure out what is where.
Could you elaborate a bit more on a problem you're seeing?

To me it looks like it scales rather well.
For example lets imagine that we adding a device
that has some on device memory that should be mapped into GPA
code to do so would look like:

  pc_machine_device_plug_cb(dev)
  {
   ...
   if (dev == OUR_NEW_DEVICE_TYPE) {
       memory_region_add_subregion(as, current_reserved_end, &dev->mr);
       set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
   }
  }

we can practically add any number of new devices that way.

 
> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> support 64 bit RAM instead (and maybe a way to allocate and
> zero-initialize buffer without loading it through fwcfg), this way bios
> does the allocation, and addresses can be patched into acpi.
and then guest side needs to parse/execute some AML that would
initialize QEMU side so it would know where to write data.

bios-linker-loader is a great interface for initializing some
guest owned data and linking it together but I think it adds
unnecessary complexity and is misused if it's used to handle
device owned data/on device memory in this and VMGID cases.

There was RFC on list to make BIOS boot from NVDIMM already
doing some ACPI table lookup/parsing. Now if they were forced
to also parse and execute AML to initialize QEMU with guest
allocated address that would complicate them quite a bit.
While with NVDIMM control memory region mapped directly by QEMU,
respective patches don't need in any way to initialize QEMU,
all they would need just read necessary data from control region.

Also using bios-linker-loader takes away some usable RAM
from guest and in the end that doesn't scale,
the more devices I add the less usable RAM is left for guest OS
while all the device needs is a piece of GPA address space
that would belong to it.

> 
> See patch at the bottom that might be handy.
> 
> > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > | when writing ASL one shall make sure that only XP supported
> > | features are in global scope, which is evaluated when tables
> > | are loaded and features of rev2 and higher are inside methods.
> > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> 
> Yes, this technique works.
> 
> An alternative is to add an XSDT, XP ignores that.
> XSDT at the moment breaks OVMF (because it loads both
> the RSDT and the XSDT, which is wrong), but I think
> Laszlo was working on a fix for that.
Using XSDT would increase ACPI tables occupied RAM
as it would duplicate DSDT + non XP supported AML
at global namespace.

So far we've managed keep DSDT compatible with XP while
introducing features from v2 and higher ACPI revisions as
AML that is only evaluated on demand.
We can continue doing so unless we have to unconditionally
add incompatible AML at global scope.


> 
> > Michael, Paolo, what do you think about these ideas?
> > 
> > Thanks!
> 
> 
> 
> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> current offset so we can add that to the linker.
> 
> Won't work if you append the Name to the Aml structure (these can be
> nested to arbitrary depth using aml_append), so using plain GArray for
> this API makes sense to me.
> 
> --->
> 
> acpi: add build_append_named_dword, returning an offset in buffer
> 
> This is a very limited form of support for runtime patching -
> similar in functionality to what we can do with ACPI_EXTRACT
> macros in python, but implemented in C.
> 
> This is to allow ACPI code direct access to data tables -
> which is exactly what DataTableRegion is there for, except
> no known windows release so far implements DataTableRegion.
unsupported means Windows will BSOD, so it's practically
unusable unless MS will patch currently existing Windows
versions.

Another thing about DataTableRegion is that ACPI tables are
supposed to have static content which matches checksum in
table the header while you are trying to use it for dynamic
data. It would be cleaner/more compatible to teach
bios-linker-loader to just allocate memory and patch AML
with the allocated address.

Also if OperationRegion() is used, then one has to patch
DefOpRegion directly as RegionOffset must be Integer,
using variable names is not permitted there.

> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 1b632dc..f8998ea 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>  void
>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>  
> +int
> +build_append_named_dword(GArray *array, const char *name_format, ...);
> +
>  #endif
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 0d4b324..7f9fa65 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>      }
>  }
>  
> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> + * and return the offset to 0x0 for runtime patching.
> + *
> + * Warning: runtime patching is best avoided. Only use this as
> + * a replacement for DataTableRegion (for guests that don't
> + * support it).
> + */
> +int
> +build_append_named_qword(GArray *array, const char *name_format, ...)
> +{
> +    int offset;
> +    va_list ap;
> +
> +    va_start(ap, name_format);
> +    build_append_namestringv(array, name_format, ap);
> +    va_end(ap);
> +
> +    build_append_byte(array, 0x0E); /* QWordPrefix */
> +
> +    offset = array->len;
> +    build_append_int_noprefix(array, 0x0, 8);
> +    assert(array->len == offset + 8);
> +
> +    return offset;
> +}
> +
>  static GPtrArray *alloc_list;
>  
>  static Aml *aml_alloc(void)
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2015-12-30 15:55       ` [Qemu-devel] " Igor Mammedov
@ 2015-12-30 19:52         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2015-12-30 19:52 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek

On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> On Mon, 28 Dec 2015 14:50:15 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > > 
> > > Hi Michael, Paolo,
> > > 
> > > Now it is the time to return to the challenge that how to reserve guest
> > > physical region internally used by ACPI.
> > > 
> > > Igor suggested that:
> > > | An alternative place to allocate reserve from could be high memory.
> > > | For pc we have "reserved-memory-end" which currently makes sure
> > > | that hotpluggable memory range isn't used by firmware
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> > 
> > I don't want to tie things to reserved-memory-end because this
> > does not scale: next time we need to reserve memory,
> > we'll need to find yet another way to figure out what is where.
> Could you elaborate a bit more on a problem you're seeing?
> 
> To me it looks like it scales rather well.
> For example lets imagine that we adding a device
> that has some on device memory that should be mapped into GPA
> code to do so would look like:
> 
>   pc_machine_device_plug_cb(dev)
>   {
>    ...
>    if (dev == OUR_NEW_DEVICE_TYPE) {
>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>    }
>   }
> 
> we can practically add any number of new devices that way.

Yes but we'll have to build a host side allocator for these, and that's
nasty. We'll also have to maintain these addresses indefinitely (at
least per machine version) as they are guest visible.
Not only that, there's no way for guest to know if we move things
around, so basically we'll never be able to change addresses.


>  
> > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > support 64 bit RAM instead (and maybe a way to allocate and
> > zero-initialize buffer without loading it through fwcfg), this way bios
> > does the allocation, and addresses can be patched into acpi.
> and then guest side needs to parse/execute some AML that would
> initialize QEMU side so it would know where to write data.

Well not really - we can put it in a data table, by itself
so it's easy to find.

AML is only needed if access from ACPI is desired.


> bios-linker-loader is a great interface for initializing some
> guest owned data and linking it together but I think it adds
> unnecessary complexity and is misused if it's used to handle
> device owned data/on device memory in this and VMGID cases.

I want a generic interface for guest to enumerate these things.  linker
seems quite reasonable but if you see a reason why it won't do, or want
to propose a better interface, fine.

PCI would do, too - though windows guys had concerns about
returning PCI BARs from ACPI.


> There was RFC on list to make BIOS boot from NVDIMM already
> doing some ACPI table lookup/parsing. Now if they were forced
> to also parse and execute AML to initialize QEMU with guest
> allocated address that would complicate them quite a bit.

If they just need to find a table by name, it won't be
too bad, would it?

> While with NVDIMM control memory region mapped directly by QEMU,
> respective patches don't need in any way to initialize QEMU,
> all they would need just read necessary data from control region.
> 
> Also using bios-linker-loader takes away some usable RAM
> from guest and in the end that doesn't scale,
> the more devices I add the less usable RAM is left for guest OS
> while all the device needs is a piece of GPA address space
> that would belong to it.

I don't get this comment. I don't think it's MMIO that is wanted.
If it's backed by qemu virtual memory then it's RAM.

> > 
> > See patch at the bottom that might be handy.
> > 
> > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > | when writing ASL one shall make sure that only XP supported
> > > | features are in global scope, which is evaluated when tables
> > > | are loaded and features of rev2 and higher are inside methods.
> > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> > 
> > Yes, this technique works.
> > 
> > An alternative is to add an XSDT, XP ignores that.
> > XSDT at the moment breaks OVMF (because it loads both
> > the RSDT and the XSDT, which is wrong), but I think
> > Laszlo was working on a fix for that.
> Using XSDT would increase ACPI tables occupied RAM
> as it would duplicate DSDT + non XP supported AML
> at global namespace.

Not at all - I posted patches linking to same
tables from both RSDT and XSDT at some point.
Only the list of pointers would be different.

> So far we've managed keep DSDT compatible with XP while
> introducing features from v2 and higher ACPI revisions as
> AML that is only evaluated on demand.
> We can continue doing so unless we have to unconditionally
> add incompatible AML at global scope.
> 

Yes.

> > 
> > > Michael, Paolo, what do you think about these ideas?
> > > 
> > > Thanks!
> > 
> > 
> > 
> > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > current offset so we can add that to the linker.
> > 
> > Won't work if you append the Name to the Aml structure (these can be
> > nested to arbitrary depth using aml_append), so using plain GArray for
> > this API makes sense to me.
> > 
> > --->
> > 
> > acpi: add build_append_named_dword, returning an offset in buffer
> > 
> > This is a very limited form of support for runtime patching -
> > similar in functionality to what we can do with ACPI_EXTRACT
> > macros in python, but implemented in C.
> > 
> > This is to allow ACPI code direct access to data tables -
> > which is exactly what DataTableRegion is there for, except
> > no known windows release so far implements DataTableRegion.
> unsupported means Windows will BSOD, so it's practically
> unusable unless MS will patch currently existing Windows
> versions.

Yes. That's why my patch allows patching SSDT without using
DataTableRegion.

> Another thing about DataTableRegion is that ACPI tables are
> supposed to have static content which matches checksum in
> table the header while you are trying to use it for dynamic
> data. It would be cleaner/more compatible to teach
> bios-linker-loader to just allocate memory and patch AML
> with the allocated address.

Yes - if address is static, you need to put it outside
the table. Can come right before or right after this.

> Also if OperationRegion() is used, then one has to patch
> DefOpRegion directly as RegionOffset must be Integer,
> using variable names is not permitted there.

I am not sure the comment was understood correctly.
The comment says really "we can't use DataTableRegion
so here is an alternative".

> 
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > index 1b632dc..f8998ea 100644
> > --- a/include/hw/acpi/aml-build.h
> > +++ b/include/hw/acpi/aml-build.h
> > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> >  void
> >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> >  
> > +int
> > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > +
> >  #endif
> > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > index 0d4b324..7f9fa65 100644
> > --- a/hw/acpi/aml-build.c
> > +++ b/hw/acpi/aml-build.c
> > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> >      }
> >  }
> >  
> > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > + * and return the offset to 0x0 for runtime patching.
> > + *
> > + * Warning: runtime patching is best avoided. Only use this as
> > + * a replacement for DataTableRegion (for guests that don't
> > + * support it).
> > + */
> > +int
> > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > +{
> > +    int offset;
> > +    va_list ap;
> > +
> > +    va_start(ap, name_format);
> > +    build_append_namestringv(array, name_format, ap);
> > +    va_end(ap);
> > +
> > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > +
> > +    offset = array->len;
> > +    build_append_int_noprefix(array, 0x0, 8);
> > +    assert(array->len == offset + 8);
> > +
> > +    return offset;
> > +}
> > +
> >  static GPtrArray *alloc_list;
> >  
> >  static Aml *aml_alloc(void)
> > 
> > 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2015-12-30 19:52         ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2015-12-30 19:52 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> On Mon, 28 Dec 2015 14:50:15 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
> > > 
> > > Hi Michael, Paolo,
> > > 
> > > Now it is the time to return to the challenge that how to reserve guest
> > > physical region internally used by ACPI.
> > > 
> > > Igor suggested that:
> > > | An alternative place to allocate reserve from could be high memory.
> > > | For pc we have "reserved-memory-end" which currently makes sure
> > > | that hotpluggable memory range isn't used by firmware
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)
> > 
> > I don't want to tie things to reserved-memory-end because this
> > does not scale: next time we need to reserve memory,
> > we'll need to find yet another way to figure out what is where.
> Could you elaborate a bit more on a problem you're seeing?
> 
> To me it looks like it scales rather well.
> For example lets imagine that we adding a device
> that has some on device memory that should be mapped into GPA
> code to do so would look like:
> 
>   pc_machine_device_plug_cb(dev)
>   {
>    ...
>    if (dev == OUR_NEW_DEVICE_TYPE) {
>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>    }
>   }
> 
> we can practically add any number of new devices that way.

Yes but we'll have to build a host side allocator for these, and that's
nasty. We'll also have to maintain these addresses indefinitely (at
least per machine version) as they are guest visible.
Not only that, there's no way for guest to know if we move things
around, so basically we'll never be able to change addresses.


>  
> > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > support 64 bit RAM instead (and maybe a way to allocate and
> > zero-initialize buffer without loading it through fwcfg), this way bios
> > does the allocation, and addresses can be patched into acpi.
> and then guest side needs to parse/execute some AML that would
> initialize QEMU side so it would know where to write data.

Well not really - we can put it in a data table, by itself
so it's easy to find.

AML is only needed if access from ACPI is desired.


> bios-linker-loader is a great interface for initializing some
> guest owned data and linking it together but I think it adds
> unnecessary complexity and is misused if it's used to handle
> device owned data/on device memory in this and VMGID cases.

I want a generic interface for guest to enumerate these things.  linker
seems quite reasonable but if you see a reason why it won't do, or want
to propose a better interface, fine.

PCI would do, too - though windows guys had concerns about
returning PCI BARs from ACPI.


> There was RFC on list to make BIOS boot from NVDIMM already
> doing some ACPI table lookup/parsing. Now if they were forced
> to also parse and execute AML to initialize QEMU with guest
> allocated address that would complicate them quite a bit.

If they just need to find a table by name, it won't be
too bad, would it?

> While with NVDIMM control memory region mapped directly by QEMU,
> respective patches don't need in any way to initialize QEMU,
> all they would need just read necessary data from control region.
> 
> Also using bios-linker-loader takes away some usable RAM
> from guest and in the end that doesn't scale,
> the more devices I add the less usable RAM is left for guest OS
> while all the device needs is a piece of GPA address space
> that would belong to it.

I don't get this comment. I don't think it's MMIO that is wanted.
If it's backed by qemu virtual memory then it's RAM.

> > 
> > See patch at the bottom that might be handy.
> > 
> > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > | when writing ASL one shall make sure that only XP supported
> > > | features are in global scope, which is evaluated when tables
> > > | are loaded and features of rev2 and higher are inside methods.
> > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
> > 
> > Yes, this technique works.
> > 
> > An alternative is to add an XSDT, XP ignores that.
> > XSDT at the moment breaks OVMF (because it loads both
> > the RSDT and the XSDT, which is wrong), but I think
> > Laszlo was working on a fix for that.
> Using XSDT would increase ACPI tables occupied RAM
> as it would duplicate DSDT + non XP supported AML
> at global namespace.

Not at all - I posted patches linking to same
tables from both RSDT and XSDT at some point.
Only the list of pointers would be different.

> So far we've managed keep DSDT compatible with XP while
> introducing features from v2 and higher ACPI revisions as
> AML that is only evaluated on demand.
> We can continue doing so unless we have to unconditionally
> add incompatible AML at global scope.
> 

Yes.

> > 
> > > Michael, Paolo, what do you think about these ideas?
> > > 
> > > Thanks!
> > 
> > 
> > 
> > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > current offset so we can add that to the linker.
> > 
> > Won't work if you append the Name to the Aml structure (these can be
> > nested to arbitrary depth using aml_append), so using plain GArray for
> > this API makes sense to me.
> > 
> > --->
> > 
> > acpi: add build_append_named_dword, returning an offset in buffer
> > 
> > This is a very limited form of support for runtime patching -
> > similar in functionality to what we can do with ACPI_EXTRACT
> > macros in python, but implemented in C.
> > 
> > This is to allow ACPI code direct access to data tables -
> > which is exactly what DataTableRegion is there for, except
> > no known windows release so far implements DataTableRegion.
> unsupported means Windows will BSOD, so it's practically
> unusable unless MS will patch currently existing Windows
> versions.

Yes. That's why my patch allows patching SSDT without using
DataTableRegion.

> Another thing about DataTableRegion is that ACPI tables are
> supposed to have static content which matches checksum in
> table the header while you are trying to use it for dynamic
> data. It would be cleaner/more compatible to teach
> bios-linker-loader to just allocate memory and patch AML
> with the allocated address.

Yes - if address is static, you need to put it outside
the table. Can come right before or right after this.

> Also if OperationRegion() is used, then one has to patch
> DefOpRegion directly as RegionOffset must be Integer,
> using variable names is not permitted there.

I am not sure the comment was understood correctly.
The comment says really "we can't use DataTableRegion
so here is an alternative".

> 
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > index 1b632dc..f8998ea 100644
> > --- a/include/hw/acpi/aml-build.h
> > +++ b/include/hw/acpi/aml-build.h
> > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> >  void
> >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> >  
> > +int
> > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > +
> >  #endif
> > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > index 0d4b324..7f9fa65 100644
> > --- a/hw/acpi/aml-build.c
> > +++ b/hw/acpi/aml-build.c
> > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> >      }
> >  }
> >  
> > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > + * and return the offset to 0x0 for runtime patching.
> > + *
> > + * Warning: runtime patching is best avoided. Only use this as
> > + * a replacement for DataTableRegion (for guests that don't
> > + * support it).
> > + */
> > +int
> > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > +{
> > +    int offset;
> > +    va_list ap;
> > +
> > +    va_start(ap, name_format);
> > +    build_append_namestringv(array, name_format, ap);
> > +    va_end(ap);
> > +
> > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > +
> > +    offset = array->len;
> > +    build_append_int_noprefix(array, 0x0, 8);
> > +    assert(array->len == offset + 8);
> > +
> > +    return offset;
> > +}
> > +
> >  static GPtrArray *alloc_list;
> >  
> >  static Aml *aml_alloc(void)
> > 
> > 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2015-12-30 19:52         ` [Qemu-devel] " Michael S. Tsirkin
@ 2016-01-04 20:17           ` Laszlo Ersek
  -1 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-04 20:17 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong
  Cc: pbonzini, gleb, mtosatti, stefanha, rth, ehabkost,
	dan.j.williams, kvm, qemu-devel

Michael CC'd me on the grandparent of the email below. I'll try to add
my thoughts in a single go, with regard to OVMF.

On 12/30/15 20:52, Michael S. Tsirkin wrote:
> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
>> On Mon, 28 Dec 2015 14:50:15 +0200
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
>>>>
>>>> Hi Michael, Paolo,
>>>>
>>>> Now it is the time to return to the challenge that how to reserve guest
>>>> physical region internally used by ACPI.
>>>>
>>>> Igor suggested that:
>>>> | An alternative place to allocate reserve from could be high memory.
>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>> | that hotpluggable memory range isn't used by firmware
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

OVMF has no support for the "reserved-memory-end" fw_cfg file. The
reason is that nobody wrote that patch, nor asked for the patch to be
written. (Not implying that just requesting the patch would be
sufficient for the patch to be written.)

>>> I don't want to tie things to reserved-memory-end because this
>>> does not scale: next time we need to reserve memory,
>>> we'll need to find yet another way to figure out what is where.
>> Could you elaborate a bit more on a problem you're seeing?
>>
>> To me it looks like it scales rather well.
>> For example lets imagine that we adding a device
>> that has some on device memory that should be mapped into GPA
>> code to do so would look like:
>>
>>   pc_machine_device_plug_cb(dev)
>>   {
>>    ...
>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>    }
>>   }
>>
>> we can practically add any number of new devices that way.
> 
> Yes but we'll have to build a host side allocator for these, and that's
> nasty. We'll also have to maintain these addresses indefinitely (at
> least per machine version) as they are guest visible.
> Not only that, there's no way for guest to know if we move things
> around, so basically we'll never be able to change addresses.
> 
> 
>>  
>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>> support 64 bit RAM instead

This looks quite doable in OVMF, as long as the blob to allocate from
high memory contains *zero* ACPI tables.

(
Namely, each ACPI table is installed from the containing fw_cfg blob
with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
own allocation policy for the *copies* of ACPI tables it installs.

This allocation policy is left unspecified in the section of the UEFI
spec that governs EFI_ACPI_TABLE_PROTOCOL.

The current policy in edk2 (= the reference implementation) seems to be
"allocate from under 4GB". It is currently being changed to "try to
allocate from under 4GB, and if that fails, retry from high memory". (It
is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
)

>>> (and maybe a way to allocate and
>>> zero-initialize buffer without loading it through fwcfg),

Sounds reasonable.

>>> this way bios
>>> does the allocation, and addresses can be patched into acpi.
>> and then guest side needs to parse/execute some AML that would
>> initialize QEMU side so it would know where to write data.
> 
> Well not really - we can put it in a data table, by itself
> so it's easy to find.

Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
acpi_get_table_with_size()?

> 
> AML is only needed if access from ACPI is desired.
> 
> 
>> bios-linker-loader is a great interface for initializing some
>> guest owned data and linking it together but I think it adds
>> unnecessary complexity and is misused if it's used to handle
>> device owned data/on device memory in this and VMGID cases.
> 
> I want a generic interface for guest to enumerate these things.  linker
> seems quite reasonable but if you see a reason why it won't do, or want
> to propose a better interface, fine.

* The guest could do the following:
- while processing the ALLOCATE commands, it would make a note where in
GPA space each fw_cfg blob gets allocated
- at the end the guest would prepare a temporary array with a predefined
record format, that associates each fw_cfg blob's name with the concrete
allocation address
- it would create an FWCfgDmaAccess stucture pointing at this array,
with a new "control" bit set (or something similar)
- the guest could write the address of the FWCfgDmaAccess struct to the
appropriate register, as always.

* Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
specifying:
- the fw_cfg blob's name, for which to retrieve the guest-allocated
  address (this command could only follow the matching ALLOCATE
  command, never precede it)
- a flag whether the address should be written to IO or MMIO space
  (would be likely IO on x86, MMIO on ARM)
- a unique uint64_t key (could be the 16-bit fw_cfg selector value that
  identifies the blob, actually!)
- a uint64_t (IO or MMIO) address to write the unique key and then the
  allocation address to.

Either way, QEMU could learn about all the relevant guest-side
allocation addresses in a low number of traps. In addition, AML code
wouldn't have to reflect any allocation addresses to QEMU, ever.

> 
> PCI would do, too - though windows guys had concerns about
> returning PCI BARs from ACPI.
> 
> 
>> There was RFC on list to make BIOS boot from NVDIMM already
>> doing some ACPI table lookup/parsing. Now if they were forced
>> to also parse and execute AML to initialize QEMU with guest
>> allocated address that would complicate them quite a bit.
> 
> If they just need to find a table by name, it won't be
> too bad, would it?
> 
>> While with NVDIMM control memory region mapped directly by QEMU,
>> respective patches don't need in any way to initialize QEMU,
>> all they would need just read necessary data from control region.
>>
>> Also using bios-linker-loader takes away some usable RAM
>> from guest and in the end that doesn't scale,
>> the more devices I add the less usable RAM is left for guest OS
>> while all the device needs is a piece of GPA address space
>> that would belong to it.
> 
> I don't get this comment. I don't think it's MMIO that is wanted.
> If it's backed by qemu virtual memory then it's RAM.
> 
>>>
>>> See patch at the bottom that might be handy.

I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)

>From last year I have a WIP version of "docs/vmgenid.txt" that is based
on Michael's build_append_named_dword() function. If
GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
stuff in that text file (and hopefully post it soon after for comments?)

>>>
>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>> | when writing ASL one shall make sure that only XP supported
>>>> | features are in global scope, which is evaluated when tables
>>>> | are loaded and features of rev2 and higher are inside methods.
>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
>>>
>>> Yes, this technique works.

Agreed.

>>>
>>> An alternative is to add an XSDT, XP ignores that.
>>> XSDT at the moment breaks OVMF (because it loads both
>>> the RSDT and the XSDT, which is wrong), but I think
>>> Laszlo was working on a fix for that.

We have to distinguish two use cases here.

* The first is the case when QEMU prepares both an XSDT and an RSDT, and
links at least one common ACPI table from both. This would cause OVMF to
pass the same source (= to-be-copied) table to
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
following outcomes:

- there would be two instances of the same table (think e.g. SSDT)
- the second attempt would be rejected (e.g. FADT) and that error would
  terminate the linker-loader procedure.

This issue would not be too hard to overcome, with a simple "memoization
technique". After the initial loading & linking of the tables, OVMF
could remember the addresses of the "source" ACPI tables, and could
avoid passing already installed source tables to
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.

* The second use case is when an ACPI table is linked *only* from QEMU's
XSDT. This is much harder to fix, because
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
passed-in table into *both* RSDT and XSDT, automatically. And, again,
the UEFI spec doesn't provide a way to control this from the caller
(i.e. from within OVMF).

I have tried earlier to effect a change in the specification of
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
lists. (At that time I was trying to expose UEFI memory *type* to the
caller, from which the copy of the ACPI table being installed should be
allocated from.) Alas, I received no answers at all.

All in all I strongly recommend the "place rev2+ objects in method
scope" trick, over the "link it from the XSDT only" trick.

>> Using XSDT would increase ACPI tables occupied RAM
>> as it would duplicate DSDT + non XP supported AML
>> at global namespace.
> 
> Not at all - I posted patches linking to same
> tables from both RSDT and XSDT at some point.

Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
could be made work in OVMF with the above mentioned memoization stuff.

> Only the list of pointers would be different.

I don't recommend that, see the second case above.

Thanks
Laszlo

>> So far we've managed keep DSDT compatible with XP while
>> introducing features from v2 and higher ACPI revisions as
>> AML that is only evaluated on demand.
>> We can continue doing so unless we have to unconditionally
>> add incompatible AML at global scope.
>>
> 
> Yes.
> 
>>>
>>>> Michael, Paolo, what do you think about these ideas?
>>>>
>>>> Thanks!
>>>
>>>
>>>
>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>> current offset so we can add that to the linker.
>>>
>>> Won't work if you append the Name to the Aml structure (these can be
>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>> this API makes sense to me.
>>>
>>> --->
>>>
>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>
>>> This is a very limited form of support for runtime patching -
>>> similar in functionality to what we can do with ACPI_EXTRACT
>>> macros in python, but implemented in C.
>>>
>>> This is to allow ACPI code direct access to data tables -
>>> which is exactly what DataTableRegion is there for, except
>>> no known windows release so far implements DataTableRegion.
>> unsupported means Windows will BSOD, so it's practically
>> unusable unless MS will patch currently existing Windows
>> versions.
> 
> Yes. That's why my patch allows patching SSDT without using
> DataTableRegion.
> 
>> Another thing about DataTableRegion is that ACPI tables are
>> supposed to have static content which matches checksum in
>> table the header while you are trying to use it for dynamic
>> data. It would be cleaner/more compatible to teach
>> bios-linker-loader to just allocate memory and patch AML
>> with the allocated address.
> 
> Yes - if address is static, you need to put it outside
> the table. Can come right before or right after this.
> 
>> Also if OperationRegion() is used, then one has to patch
>> DefOpRegion directly as RegionOffset must be Integer,
>> using variable names is not permitted there.
> 
> I am not sure the comment was understood correctly.
> The comment says really "we can't use DataTableRegion
> so here is an alternative".
> 
>>
>>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>
>>> ---
>>>
>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>> index 1b632dc..f8998ea 100644
>>> --- a/include/hw/acpi/aml-build.h
>>> +++ b/include/hw/acpi/aml-build.h
>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>  void
>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>  
>>> +int
>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>> +
>>>  #endif
>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>> index 0d4b324..7f9fa65 100644
>>> --- a/hw/acpi/aml-build.c
>>> +++ b/hw/acpi/aml-build.c
>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>      }
>>>  }
>>>  
>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>> + * and return the offset to 0x0 for runtime patching.
>>> + *
>>> + * Warning: runtime patching is best avoided. Only use this as
>>> + * a replacement for DataTableRegion (for guests that don't
>>> + * support it).
>>> + */
>>> +int
>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>> +{
>>> +    int offset;
>>> +    va_list ap;
>>> +
>>> +    va_start(ap, name_format);
>>> +    build_append_namestringv(array, name_format, ap);
>>> +    va_end(ap);
>>> +
>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>> +
>>> +    offset = array->len;
>>> +    build_append_int_noprefix(array, 0x0, 8);
>>> +    assert(array->len == offset + 8);
>>> +
>>> +    return offset;
>>> +}
>>> +
>>>  static GPtrArray *alloc_list;
>>>  
>>>  static Aml *aml_alloc(void)
>>>
>>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-04 20:17           ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-04 20:17 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov, Xiao Guangrong
  Cc: ehabkost, kvm, gleb, mtosatti, qemu-devel, stefanha, pbonzini,
	dan.j.williams, rth

Michael CC'd me on the grandparent of the email below. I'll try to add
my thoughts in a single go, with regard to OVMF.

On 12/30/15 20:52, Michael S. Tsirkin wrote:
> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
>> On Mon, 28 Dec 2015 14:50:15 +0200
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:
>>>>
>>>> Hi Michael, Paolo,
>>>>
>>>> Now it is the time to return to the challenge that how to reserve guest
>>>> physical region internally used by ACPI.
>>>>
>>>> Igor suggested that:
>>>> | An alternative place to allocate reserve from could be high memory.
>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>> | that hotpluggable memory range isn't used by firmware
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)

OVMF has no support for the "reserved-memory-end" fw_cfg file. The
reason is that nobody wrote that patch, nor asked for the patch to be
written. (Not implying that just requesting the patch would be
sufficient for the patch to be written.)

>>> I don't want to tie things to reserved-memory-end because this
>>> does not scale: next time we need to reserve memory,
>>> we'll need to find yet another way to figure out what is where.
>> Could you elaborate a bit more on a problem you're seeing?
>>
>> To me it looks like it scales rather well.
>> For example lets imagine that we adding a device
>> that has some on device memory that should be mapped into GPA
>> code to do so would look like:
>>
>>   pc_machine_device_plug_cb(dev)
>>   {
>>    ...
>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>    }
>>   }
>>
>> we can practically add any number of new devices that way.
> 
> Yes but we'll have to build a host side allocator for these, and that's
> nasty. We'll also have to maintain these addresses indefinitely (at
> least per machine version) as they are guest visible.
> Not only that, there's no way for guest to know if we move things
> around, so basically we'll never be able to change addresses.
> 
> 
>>  
>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>> support 64 bit RAM instead

This looks quite doable in OVMF, as long as the blob to allocate from
high memory contains *zero* ACPI tables.

(
Namely, each ACPI table is installed from the containing fw_cfg blob
with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
own allocation policy for the *copies* of ACPI tables it installs.

This allocation policy is left unspecified in the section of the UEFI
spec that governs EFI_ACPI_TABLE_PROTOCOL.

The current policy in edk2 (= the reference implementation) seems to be
"allocate from under 4GB". It is currently being changed to "try to
allocate from under 4GB, and if that fails, retry from high memory". (It
is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
)

>>> (and maybe a way to allocate and
>>> zero-initialize buffer without loading it through fwcfg),

Sounds reasonable.

>>> this way bios
>>> does the allocation, and addresses can be patched into acpi.
>> and then guest side needs to parse/execute some AML that would
>> initialize QEMU side so it would know where to write data.
> 
> Well not really - we can put it in a data table, by itself
> so it's easy to find.

Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
acpi_get_table_with_size()?

> 
> AML is only needed if access from ACPI is desired.
> 
> 
>> bios-linker-loader is a great interface for initializing some
>> guest owned data and linking it together but I think it adds
>> unnecessary complexity and is misused if it's used to handle
>> device owned data/on device memory in this and VMGID cases.
> 
> I want a generic interface for guest to enumerate these things.  linker
> seems quite reasonable but if you see a reason why it won't do, or want
> to propose a better interface, fine.

* The guest could do the following:
- while processing the ALLOCATE commands, it would make a note where in
GPA space each fw_cfg blob gets allocated
- at the end the guest would prepare a temporary array with a predefined
record format, that associates each fw_cfg blob's name with the concrete
allocation address
- it would create an FWCfgDmaAccess stucture pointing at this array,
with a new "control" bit set (or something similar)
- the guest could write the address of the FWCfgDmaAccess struct to the
appropriate register, as always.

* Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
specifying:
- the fw_cfg blob's name, for which to retrieve the guest-allocated
  address (this command could only follow the matching ALLOCATE
  command, never precede it)
- a flag whether the address should be written to IO or MMIO space
  (would be likely IO on x86, MMIO on ARM)
- a unique uint64_t key (could be the 16-bit fw_cfg selector value that
  identifies the blob, actually!)
- a uint64_t (IO or MMIO) address to write the unique key and then the
  allocation address to.

Either way, QEMU could learn about all the relevant guest-side
allocation addresses in a low number of traps. In addition, AML code
wouldn't have to reflect any allocation addresses to QEMU, ever.

> 
> PCI would do, too - though windows guys had concerns about
> returning PCI BARs from ACPI.
> 
> 
>> There was RFC on list to make BIOS boot from NVDIMM already
>> doing some ACPI table lookup/parsing. Now if they were forced
>> to also parse and execute AML to initialize QEMU with guest
>> allocated address that would complicate them quite a bit.
> 
> If they just need to find a table by name, it won't be
> too bad, would it?
> 
>> While with NVDIMM control memory region mapped directly by QEMU,
>> respective patches don't need in any way to initialize QEMU,
>> all they would need just read necessary data from control region.
>>
>> Also using bios-linker-loader takes away some usable RAM
>> from guest and in the end that doesn't scale,
>> the more devices I add the less usable RAM is left for guest OS
>> while all the device needs is a piece of GPA address space
>> that would belong to it.
> 
> I don't get this comment. I don't think it's MMIO that is wanted.
> If it's backed by qemu virtual memory then it's RAM.
> 
>>>
>>> See patch at the bottom that might be handy.

I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)

>From last year I have a WIP version of "docs/vmgenid.txt" that is based
on Michael's build_append_named_dword() function. If
GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
stuff in that text file (and hopefully post it soon after for comments?)

>>>
>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>> | when writing ASL one shall make sure that only XP supported
>>>> | features are in global scope, which is evaluated when tables
>>>> | are loaded and features of rev2 and higher are inside methods.
>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)
>>>
>>> Yes, this technique works.

Agreed.

>>>
>>> An alternative is to add an XSDT, XP ignores that.
>>> XSDT at the moment breaks OVMF (because it loads both
>>> the RSDT and the XSDT, which is wrong), but I think
>>> Laszlo was working on a fix for that.

We have to distinguish two use cases here.

* The first is the case when QEMU prepares both an XSDT and an RSDT, and
links at least one common ACPI table from both. This would cause OVMF to
pass the same source (= to-be-copied) table to
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
following outcomes:

- there would be two instances of the same table (think e.g. SSDT)
- the second attempt would be rejected (e.g. FADT) and that error would
  terminate the linker-loader procedure.

This issue would not be too hard to overcome, with a simple "memoization
technique". After the initial loading & linking of the tables, OVMF
could remember the addresses of the "source" ACPI tables, and could
avoid passing already installed source tables to
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.

* The second use case is when an ACPI table is linked *only* from QEMU's
XSDT. This is much harder to fix, because
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
passed-in table into *both* RSDT and XSDT, automatically. And, again,
the UEFI spec doesn't provide a way to control this from the caller
(i.e. from within OVMF).

I have tried earlier to effect a change in the specification of
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
lists. (At that time I was trying to expose UEFI memory *type* to the
caller, from which the copy of the ACPI table being installed should be
allocated from.) Alas, I received no answers at all.

All in all I strongly recommend the "place rev2+ objects in method
scope" trick, over the "link it from the XSDT only" trick.

>> Using XSDT would increase ACPI tables occupied RAM
>> as it would duplicate DSDT + non XP supported AML
>> at global namespace.
> 
> Not at all - I posted patches linking to same
> tables from both RSDT and XSDT at some point.

Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
could be made work in OVMF with the above mentioned memoization stuff.

> Only the list of pointers would be different.

I don't recommend that, see the second case above.

Thanks
Laszlo

>> So far we've managed keep DSDT compatible with XP while
>> introducing features from v2 and higher ACPI revisions as
>> AML that is only evaluated on demand.
>> We can continue doing so unless we have to unconditionally
>> add incompatible AML at global scope.
>>
> 
> Yes.
> 
>>>
>>>> Michael, Paolo, what do you think about these ideas?
>>>>
>>>> Thanks!
>>>
>>>
>>>
>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>> current offset so we can add that to the linker.
>>>
>>> Won't work if you append the Name to the Aml structure (these can be
>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>> this API makes sense to me.
>>>
>>> --->
>>>
>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>
>>> This is a very limited form of support for runtime patching -
>>> similar in functionality to what we can do with ACPI_EXTRACT
>>> macros in python, but implemented in C.
>>>
>>> This is to allow ACPI code direct access to data tables -
>>> which is exactly what DataTableRegion is there for, except
>>> no known windows release so far implements DataTableRegion.
>> unsupported means Windows will BSOD, so it's practically
>> unusable unless MS will patch currently existing Windows
>> versions.
> 
> Yes. That's why my patch allows patching SSDT without using
> DataTableRegion.
> 
>> Another thing about DataTableRegion is that ACPI tables are
>> supposed to have static content which matches checksum in
>> table the header while you are trying to use it for dynamic
>> data. It would be cleaner/more compatible to teach
>> bios-linker-loader to just allocate memory and patch AML
>> with the allocated address.
> 
> Yes - if address is static, you need to put it outside
> the table. Can come right before or right after this.
> 
>> Also if OperationRegion() is used, then one has to patch
>> DefOpRegion directly as RegionOffset must be Integer,
>> using variable names is not permitted there.
> 
> I am not sure the comment was understood correctly.
> The comment says really "we can't use DataTableRegion
> so here is an alternative".
> 
>>
>>>
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>
>>> ---
>>>
>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>> index 1b632dc..f8998ea 100644
>>> --- a/include/hw/acpi/aml-build.h
>>> +++ b/include/hw/acpi/aml-build.h
>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>  void
>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>  
>>> +int
>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>> +
>>>  #endif
>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>> index 0d4b324..7f9fa65 100644
>>> --- a/hw/acpi/aml-build.c
>>> +++ b/hw/acpi/aml-build.c
>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>      }
>>>  }
>>>  
>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>> + * and return the offset to 0x0 for runtime patching.
>>> + *
>>> + * Warning: runtime patching is best avoided. Only use this as
>>> + * a replacement for DataTableRegion (for guests that don't
>>> + * support it).
>>> + */
>>> +int
>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>> +{
>>> +    int offset;
>>> +    va_list ap;
>>> +
>>> +    va_start(ap, name_format);
>>> +    build_append_namestringv(array, name_format, ap);
>>> +    va_end(ap);
>>> +
>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>> +
>>> +    offset = array->len;
>>> +    build_append_int_noprefix(array, 0x0, 8);
>>> +    assert(array->len == offset + 8);
>>> +
>>> +    return offset;
>>> +}
>>> +
>>>  static GPtrArray *alloc_list;
>>>  
>>>  static Aml *aml_alloc(void)
>>>
>>>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2015-12-30 19:52         ` [Qemu-devel] " Michael S. Tsirkin
@ 2016-01-05 16:30           ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-05 16:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xiao Guangrong, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek

On Wed, 30 Dec 2015 21:52:32 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> > On Mon, 28 Dec 2015 14:50:15 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> > > > 
> > > > Hi Michael, Paolo,
> > > > 
> > > > Now it is the time to return to the challenge that how to reserve guest
> > > > physical region internally used by ACPI.
> > > > 
> > > > Igor suggested that:
> > > > | An alternative place to allocate reserve from could be high memory.
> > > > | For pc we have "reserved-memory-end" which currently makes sure
> > > > | that hotpluggable memory range isn't used by firmware
> > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
> > > 
> > > I don't want to tie things to reserved-memory-end because this
> > > does not scale: next time we need to reserve memory,
> > > we'll need to find yet another way to figure out what is where.  
> > Could you elaborate a bit more on a problem you're seeing?
> > 
> > To me it looks like it scales rather well.
> > For example lets imagine that we adding a device
> > that has some on device memory that should be mapped into GPA
> > code to do so would look like:
> > 
> >   pc_machine_device_plug_cb(dev)
> >   {
> >    ...
> >    if (dev == OUR_NEW_DEVICE_TYPE) {
> >        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
> >        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
> >    }
> >   }
> > 
> > we can practically add any number of new devices that way.  
> 
> Yes but we'll have to build a host side allocator for these, and that's
> nasty. We'll also have to maintain these addresses indefinitely (at
> least per machine version) as they are guest visible.
> Not only that, there's no way for guest to know if we move things
> around, so basically we'll never be able to change addresses.
simplistic GPA allocator in snippet above does the job,

if one unconditionally adds a device in new version then yes
code has to have compat code based on machine version.
But that applies to any device that gas a state to migrate
or to any address space layout change.

However device that directly maps addresses doesn't have to
have fixed address though, it could behave the same way as
PCI device with BARs, with only difference that its
MemoryRegions are mapped before guest is running vs
BARs mapped by BIOS.
It could be worth to create a generic base device class
that would do above. Then it could be inherited from and
extended by concrete device implementations.

> >    
> > > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > > support 64 bit RAM instead (and maybe a way to allocate and
> > > zero-initialize buffer without loading it through fwcfg), this way bios
> > > does the allocation, and addresses can be patched into acpi.  
> > and then guest side needs to parse/execute some AML that would
> > initialize QEMU side so it would know where to write data.  
> 
> Well not really - we can put it in a data table, by itself
> so it's easy to find.
> 
> AML is only needed if access from ACPI is desired.
in both cases (VMGEN, NVDIMM) access from ACPI is required
as minimum to write address back to QEMU and for NVDIM
to pass _DSM method data between guest and QEMU.

> 
> 
> > bios-linker-loader is a great interface for initializing some
> > guest owned data and linking it together but I think it adds
> > unnecessary complexity and is misused if it's used to handle
> > device owned data/on device memory in this and VMGID cases.  
> 
> I want a generic interface for guest to enumerate these things.  linker
> seems quite reasonable but if you see a reason why it won't do, or want
> to propose a better interface, fine.
> 
> PCI would do, too - though windows guys had concerns about
> returning PCI BARs from ACPI.
There were potential issues with pSeries bootloader that treated
PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
Could you point out to discussion about windows issues?

What VMGEN patches that used PCI for mapping purposes were
stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
class id but we couldn't agree on it.

VMGEN v13 with full discussion is here
https://patchwork.ozlabs.org/patch/443554/
So to continue with this route we would need to pick some other
driver less class id so windows won't prompt for driver or
maybe supply our own driver stub to guarantee that no one
would touch it. Any suggestions?

> 
> 
> > There was RFC on list to make BIOS boot from NVDIMM already
> > doing some ACPI table lookup/parsing. Now if they were forced
> > to also parse and execute AML to initialize QEMU with guest
> > allocated address that would complicate them quite a bit.  
> 
> If they just need to find a table by name, it won't be
> too bad, would it?
that's what they were doing scanning memory for static NVDIMM table.
However if it were DataTable, BIOS side would have to execute
AML so that the table address could be told to QEMU.

In case of direct mapping or PCI BAR there is no need to initialize
QEMU side from AML.
That also saves us IO port where this address should be written
if bios-linker-loader approach is used.

> 
> > While with NVDIMM control memory region mapped directly by QEMU,
> > respective patches don't need in any way to initialize QEMU,
> > all they would need just read necessary data from control region.
> > 
> > Also using bios-linker-loader takes away some usable RAM
> > from guest and in the end that doesn't scale,
> > the more devices I add the less usable RAM is left for guest OS
> > while all the device needs is a piece of GPA address space
> > that would belong to it.  
> 
> I don't get this comment. I don't think it's MMIO that is wanted.
> If it's backed by qemu virtual memory then it's RAM.
Then why don't allocate video card VRAM the same way and try to explain
user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
only has 64Mb of available RAM because of we think that on device VRAM
is also RAM.

Maybe I've used MMIO term wrongly here but it roughly reflects the idea
that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
area) is not allocated from guest's usable RAM (as described in E820)
but rather directly mapped in guest's GPA and doesn't consume available
RAM as guest sees it. That's also the way it's done on real hardware.

What we need in case of VMGEN ID and NVDIMM is on device memory
that could be directly accessed by guest.
Both direct mapping or PCI BAR do that job and we could use simple
static AML without any patching.

> > > 
> > > See patch at the bottom that might be handy.
> > >   
> > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > | when writing ASL one shall make sure that only XP supported
> > > > | features are in global scope, which is evaluated when tables
> > > > | are loaded and features of rev2 and higher are inside methods.
> > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
> > > 
> > > Yes, this technique works.
> > > 
> > > An alternative is to add an XSDT, XP ignores that.
> > > XSDT at the moment breaks OVMF (because it loads both
> > > the RSDT and the XSDT, which is wrong), but I think
> > > Laszlo was working on a fix for that.  
> > Using XSDT would increase ACPI tables occupied RAM
> > as it would duplicate DSDT + non XP supported AML
> > at global namespace.  
> 
> Not at all - I posted patches linking to same
> tables from both RSDT and XSDT at some point.
> Only the list of pointers would be different.
if you put XP incompatible AML in separate SSDT and link it
only from XSDT than that would work but if incompatibility
is in DSDT, one would have to provide compat DSDT for RSDT
an incompat DSDT for XSDT.

So far policy was don't try to run guest OS on QEMU
configuration that isn't supported by it.
For example we use VAR_PACKAGE when running with more
than 255 VCPUs (commit b4f4d5481) which BSODs XP.

So we can continue with that policy with out resorting to
using both RSDT and XSDT,
It would be even easier as all AML would be dynamically
generated and DSDT would only contain AML elements for
a concrete QEMU configuration.

> > So far we've managed keep DSDT compatible with XP while
> > introducing features from v2 and higher ACPI revisions as
> > AML that is only evaluated on demand.
> > We can continue doing so unless we have to unconditionally
> > add incompatible AML at global scope.
> >   
> 
> Yes.
> 
> > >   
> > > > Michael, Paolo, what do you think about these ideas?
> > > > 
> > > > Thanks!  
> > > 
> > > 
> > > 
> > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > current offset so we can add that to the linker.
> > > 
> > > Won't work if you append the Name to the Aml structure (these can be
> > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > this API makes sense to me.
> > >   
> > > --->  
> > > 
> > > acpi: add build_append_named_dword, returning an offset in buffer
> > > 
> > > This is a very limited form of support for runtime patching -
> > > similar in functionality to what we can do with ACPI_EXTRACT
> > > macros in python, but implemented in C.
> > > 
> > > This is to allow ACPI code direct access to data tables -
> > > which is exactly what DataTableRegion is there for, except
> > > no known windows release so far implements DataTableRegion.  
> > unsupported means Windows will BSOD, so it's practically
> > unusable unless MS will patch currently existing Windows
> > versions.  
> 
> Yes. That's why my patch allows patching SSDT without using
> DataTableRegion.
> 
> > Another thing about DataTableRegion is that ACPI tables are
> > supposed to have static content which matches checksum in
> > table the header while you are trying to use it for dynamic
> > data. It would be cleaner/more compatible to teach
> > bios-linker-loader to just allocate memory and patch AML
> > with the allocated address.  
> 
> Yes - if address is static, you need to put it outside
> the table. Can come right before or right after this.
> 
> > Also if OperationRegion() is used, then one has to patch
> > DefOpRegion directly as RegionOffset must be Integer,
> > using variable names is not permitted there.  
> 
> I am not sure the comment was understood correctly.
> The comment says really "we can't use DataTableRegion
> so here is an alternative".
so how are you going to access data at which patched
NameString point to?
for that you'd need a normal patched OperationRegion
as well since DataTableRegion isn't usable.

> 
> >   
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > 
> > > ---
> > > 
> > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > index 1b632dc..f8998ea 100644
> > > --- a/include/hw/acpi/aml-build.h
> > > +++ b/include/hw/acpi/aml-build.h
> > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > >  void
> > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > >  
> > > +int
> > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > +
> > >  #endif
> > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > index 0d4b324..7f9fa65 100644
> > > --- a/hw/acpi/aml-build.c
> > > +++ b/hw/acpi/aml-build.c
> > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > >      }
> > >  }
> > >  
> > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > + * and return the offset to 0x0 for runtime patching.
> > > + *
> > > + * Warning: runtime patching is best avoided. Only use this as
> > > + * a replacement for DataTableRegion (for guests that don't
> > > + * support it).
> > > + */
> > > +int
> > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > +{
> > > +    int offset;
> > > +    va_list ap;
> > > +
> > > +    va_start(ap, name_format);
> > > +    build_append_namestringv(array, name_format, ap);
> > > +    va_end(ap);
> > > +
> > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > +
> > > +    offset = array->len;
> > > +    build_append_int_noprefix(array, 0x0, 8);
> > > +    assert(array->len == offset + 8);
> > > +
> > > +    return offset;
> > > +}
> > > +
> > >  static GPtrArray *alloc_list;
> > >  
> > >  static Aml *aml_alloc(void)
> > > 
> > >   
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-05 16:30           ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-05 16:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Wed, 30 Dec 2015 21:52:32 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:
> > On Mon, 28 Dec 2015 14:50:15 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> > > > 
> > > > Hi Michael, Paolo,
> > > > 
> > > > Now it is the time to return to the challenge that how to reserve guest
> > > > physical region internally used by ACPI.
> > > > 
> > > > Igor suggested that:
> > > > | An alternative place to allocate reserve from could be high memory.
> > > > | For pc we have "reserved-memory-end" which currently makes sure
> > > > | that hotpluggable memory range isn't used by firmware
> > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
> > > 
> > > I don't want to tie things to reserved-memory-end because this
> > > does not scale: next time we need to reserve memory,
> > > we'll need to find yet another way to figure out what is where.  
> > Could you elaborate a bit more on a problem you're seeing?
> > 
> > To me it looks like it scales rather well.
> > For example lets imagine that we adding a device
> > that has some on device memory that should be mapped into GPA
> > code to do so would look like:
> > 
> >   pc_machine_device_plug_cb(dev)
> >   {
> >    ...
> >    if (dev == OUR_NEW_DEVICE_TYPE) {
> >        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
> >        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
> >    }
> >   }
> > 
> > we can practically add any number of new devices that way.  
> 
> Yes but we'll have to build a host side allocator for these, and that's
> nasty. We'll also have to maintain these addresses indefinitely (at
> least per machine version) as they are guest visible.
> Not only that, there's no way for guest to know if we move things
> around, so basically we'll never be able to change addresses.
simplistic GPA allocator in snippet above does the job,

if one unconditionally adds a device in new version then yes
code has to have compat code based on machine version.
But that applies to any device that gas a state to migrate
or to any address space layout change.

However device that directly maps addresses doesn't have to
have fixed address though, it could behave the same way as
PCI device with BARs, with only difference that its
MemoryRegions are mapped before guest is running vs
BARs mapped by BIOS.
It could be worth to create a generic base device class
that would do above. Then it could be inherited from and
extended by concrete device implementations.

> >    
> > > I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> > > support 64 bit RAM instead (and maybe a way to allocate and
> > > zero-initialize buffer without loading it through fwcfg), this way bios
> > > does the allocation, and addresses can be patched into acpi.  
> > and then guest side needs to parse/execute some AML that would
> > initialize QEMU side so it would know where to write data.  
> 
> Well not really - we can put it in a data table, by itself
> so it's easy to find.
> 
> AML is only needed if access from ACPI is desired.
in both cases (VMGEN, NVDIMM) access from ACPI is required
as minimum to write address back to QEMU and for NVDIM
to pass _DSM method data between guest and QEMU.

> 
> 
> > bios-linker-loader is a great interface for initializing some
> > guest owned data and linking it together but I think it adds
> > unnecessary complexity and is misused if it's used to handle
> > device owned data/on device memory in this and VMGID cases.  
> 
> I want a generic interface for guest to enumerate these things.  linker
> seems quite reasonable but if you see a reason why it won't do, or want
> to propose a better interface, fine.
> 
> PCI would do, too - though windows guys had concerns about
> returning PCI BARs from ACPI.
There were potential issues with pSeries bootloader that treated
PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
Could you point out to discussion about windows issues?

What VMGEN patches that used PCI for mapping purposes were
stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
class id but we couldn't agree on it.

VMGEN v13 with full discussion is here
https://patchwork.ozlabs.org/patch/443554/
So to continue with this route we would need to pick some other
driver less class id so windows won't prompt for driver or
maybe supply our own driver stub to guarantee that no one
would touch it. Any suggestions?

> 
> 
> > There was RFC on list to make BIOS boot from NVDIMM already
> > doing some ACPI table lookup/parsing. Now if they were forced
> > to also parse and execute AML to initialize QEMU with guest
> > allocated address that would complicate them quite a bit.  
> 
> If they just need to find a table by name, it won't be
> too bad, would it?
that's what they were doing scanning memory for static NVDIMM table.
However if it were DataTable, BIOS side would have to execute
AML so that the table address could be told to QEMU.

In case of direct mapping or PCI BAR there is no need to initialize
QEMU side from AML.
That also saves us IO port where this address should be written
if bios-linker-loader approach is used.

> 
> > While with NVDIMM control memory region mapped directly by QEMU,
> > respective patches don't need in any way to initialize QEMU,
> > all they would need just read necessary data from control region.
> > 
> > Also using bios-linker-loader takes away some usable RAM
> > from guest and in the end that doesn't scale,
> > the more devices I add the less usable RAM is left for guest OS
> > while all the device needs is a piece of GPA address space
> > that would belong to it.  
> 
> I don't get this comment. I don't think it's MMIO that is wanted.
> If it's backed by qemu virtual memory then it's RAM.
Then why don't allocate video card VRAM the same way and try to explain
user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
only has 64Mb of available RAM because of we think that on device VRAM
is also RAM.

Maybe I've used MMIO term wrongly here but it roughly reflects the idea
that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
area) is not allocated from guest's usable RAM (as described in E820)
but rather directly mapped in guest's GPA and doesn't consume available
RAM as guest sees it. That's also the way it's done on real hardware.

What we need in case of VMGEN ID and NVDIMM is on device memory
that could be directly accessed by guest.
Both direct mapping or PCI BAR do that job and we could use simple
static AML without any patching.

> > > 
> > > See patch at the bottom that might be handy.
> > >   
> > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > | when writing ASL one shall make sure that only XP supported
> > > > | features are in global scope, which is evaluated when tables
> > > > | are loaded and features of rev2 and higher are inside methods.
> > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
> > > 
> > > Yes, this technique works.
> > > 
> > > An alternative is to add an XSDT, XP ignores that.
> > > XSDT at the moment breaks OVMF (because it loads both
> > > the RSDT and the XSDT, which is wrong), but I think
> > > Laszlo was working on a fix for that.  
> > Using XSDT would increase ACPI tables occupied RAM
> > as it would duplicate DSDT + non XP supported AML
> > at global namespace.  
> 
> Not at all - I posted patches linking to same
> tables from both RSDT and XSDT at some point.
> Only the list of pointers would be different.
if you put XP incompatible AML in separate SSDT and link it
only from XSDT than that would work but if incompatibility
is in DSDT, one would have to provide compat DSDT for RSDT
an incompat DSDT for XSDT.

So far policy was don't try to run guest OS on QEMU
configuration that isn't supported by it.
For example we use VAR_PACKAGE when running with more
than 255 VCPUs (commit b4f4d5481) which BSODs XP.

So we can continue with that policy with out resorting to
using both RSDT and XSDT,
It would be even easier as all AML would be dynamically
generated and DSDT would only contain AML elements for
a concrete QEMU configuration.

> > So far we've managed keep DSDT compatible with XP while
> > introducing features from v2 and higher ACPI revisions as
> > AML that is only evaluated on demand.
> > We can continue doing so unless we have to unconditionally
> > add incompatible AML at global scope.
> >   
> 
> Yes.
> 
> > >   
> > > > Michael, Paolo, what do you think about these ideas?
> > > > 
> > > > Thanks!  
> > > 
> > > 
> > > 
> > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > current offset so we can add that to the linker.
> > > 
> > > Won't work if you append the Name to the Aml structure (these can be
> > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > this API makes sense to me.
> > >   
> > > --->  
> > > 
> > > acpi: add build_append_named_dword, returning an offset in buffer
> > > 
> > > This is a very limited form of support for runtime patching -
> > > similar in functionality to what we can do with ACPI_EXTRACT
> > > macros in python, but implemented in C.
> > > 
> > > This is to allow ACPI code direct access to data tables -
> > > which is exactly what DataTableRegion is there for, except
> > > no known windows release so far implements DataTableRegion.  
> > unsupported means Windows will BSOD, so it's practically
> > unusable unless MS will patch currently existing Windows
> > versions.  
> 
> Yes. That's why my patch allows patching SSDT without using
> DataTableRegion.
> 
> > Another thing about DataTableRegion is that ACPI tables are
> > supposed to have static content which matches checksum in
> > table the header while you are trying to use it for dynamic
> > data. It would be cleaner/more compatible to teach
> > bios-linker-loader to just allocate memory and patch AML
> > with the allocated address.  
> 
> Yes - if address is static, you need to put it outside
> the table. Can come right before or right after this.
> 
> > Also if OperationRegion() is used, then one has to patch
> > DefOpRegion directly as RegionOffset must be Integer,
> > using variable names is not permitted there.  
> 
> I am not sure the comment was understood correctly.
> The comment says really "we can't use DataTableRegion
> so here is an alternative".
so how are you going to access data at which patched
NameString point to?
for that you'd need a normal patched OperationRegion
as well since DataTableRegion isn't usable.

> 
> >   
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > 
> > > ---
> > > 
> > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > index 1b632dc..f8998ea 100644
> > > --- a/include/hw/acpi/aml-build.h
> > > +++ b/include/hw/acpi/aml-build.h
> > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > >  void
> > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > >  
> > > +int
> > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > +
> > >  #endif
> > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > index 0d4b324..7f9fa65 100644
> > > --- a/hw/acpi/aml-build.c
> > > +++ b/hw/acpi/aml-build.c
> > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > >      }
> > >  }
> > >  
> > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > + * and return the offset to 0x0 for runtime patching.
> > > + *
> > > + * Warning: runtime patching is best avoided. Only use this as
> > > + * a replacement for DataTableRegion (for guests that don't
> > > + * support it).
> > > + */
> > > +int
> > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > +{
> > > +    int offset;
> > > +    va_list ap;
> > > +
> > > +    va_start(ap, name_format);
> > > +    build_append_namestringv(array, name_format, ap);
> > > +    va_end(ap);
> > > +
> > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > +
> > > +    offset = array->len;
> > > +    build_append_int_noprefix(array, 0x0, 8);
> > > +    assert(array->len == offset + 8);
> > > +
> > > +    return offset;
> > > +}
> > > +
> > >  static GPtrArray *alloc_list;
> > >  
> > >  static Aml *aml_alloc(void)
> > > 
> > >   
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-05 16:30           ` [Qemu-devel] " Igor Mammedov
@ 2016-01-05 16:43             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 16:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek

On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
> > > bios-linker-loader is a great interface for initializing some
> > > guest owned data and linking it together but I think it adds
> > > unnecessary complexity and is misused if it's used to handle
> > > device owned data/on device memory in this and VMGID cases.  
> > 
> > I want a generic interface for guest to enumerate these things.  linker
> > seems quite reasonable but if you see a reason why it won't do, or want
> > to propose a better interface, fine.
> > 
> > PCI would do, too - though windows guys had concerns about
> > returning PCI BARs from ACPI.
> There were potential issues with pSeries bootloader that treated
> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> Could you point out to discussion about windows issues?
> 
> What VMGEN patches that used PCI for mapping purposes were
> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> class id but we couldn't agree on it.
> 
> VMGEN v13 with full discussion is here
> https://patchwork.ozlabs.org/patch/443554/
> So to continue with this route we would need to pick some other
> driver less class id so windows won't prompt for driver or
> maybe supply our own driver stub to guarantee that no one
> would touch it. Any suggestions?

Pick any device/vendor id pair for which windows specifies no driver.
There's a small risk that this will conflict with some
guest but I think it's minimal.


> > 
> > 
> > > There was RFC on list to make BIOS boot from NVDIMM already
> > > doing some ACPI table lookup/parsing. Now if they were forced
> > > to also parse and execute AML to initialize QEMU with guest
> > > allocated address that would complicate them quite a bit.  
> > 
> > If they just need to find a table by name, it won't be
> > too bad, would it?
> that's what they were doing scanning memory for static NVDIMM table.
> However if it were DataTable, BIOS side would have to execute
> AML so that the table address could be told to QEMU.

Not at all. You can find any table by its signature without
parsing AML.


> In case of direct mapping or PCI BAR there is no need to initialize
> QEMU side from AML.
> That also saves us IO port where this address should be written
> if bios-linker-loader approach is used.
> 
> > 
> > > While with NVDIMM control memory region mapped directly by QEMU,
> > > respective patches don't need in any way to initialize QEMU,
> > > all they would need just read necessary data from control region.
> > > 
> > > Also using bios-linker-loader takes away some usable RAM
> > > from guest and in the end that doesn't scale,
> > > the more devices I add the less usable RAM is left for guest OS
> > > while all the device needs is a piece of GPA address space
> > > that would belong to it.  
> > 
> > I don't get this comment. I don't think it's MMIO that is wanted.
> > If it's backed by qemu virtual memory then it's RAM.
> Then why don't allocate video card VRAM the same way and try to explain
> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
> only has 64Mb of available RAM because of we think that on device VRAM
> is also RAM.
> 
> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
> area) is not allocated from guest's usable RAM (as described in E820)
> but rather directly mapped in guest's GPA and doesn't consume available
> RAM as guest sees it. That's also the way it's done on real hardware.
> 
> What we need in case of VMGEN ID and NVDIMM is on device memory
> that could be directly accessed by guest.
> Both direct mapping or PCI BAR do that job and we could use simple
> static AML without any patching.

At least with VMGEN the issue is that there's an AML method
that returns the physical address.
Then if guest OS moves the BAR (which is legal), it will break
since caller has no way to know it's related to the BAR.


> > > > 
> > > > See patch at the bottom that might be handy.
> > > >   
> > > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > > | when writing ASL one shall make sure that only XP supported
> > > > > | features are in global scope, which is evaluated when tables
> > > > > | are loaded and features of rev2 and higher are inside methods.
> > > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
> > > > 
> > > > Yes, this technique works.
> > > > 
> > > > An alternative is to add an XSDT, XP ignores that.
> > > > XSDT at the moment breaks OVMF (because it loads both
> > > > the RSDT and the XSDT, which is wrong), but I think
> > > > Laszlo was working on a fix for that.  
> > > Using XSDT would increase ACPI tables occupied RAM
> > > as it would duplicate DSDT + non XP supported AML
> > > at global namespace.  
> > 
> > Not at all - I posted patches linking to same
> > tables from both RSDT and XSDT at some point.
> > Only the list of pointers would be different.
> if you put XP incompatible AML in separate SSDT and link it
> only from XSDT than that would work but if incompatibility
> is in DSDT, one would have to provide compat DSDT for RSDT
> an incompat DSDT for XSDT.

So don't do this.

> So far policy was don't try to run guest OS on QEMU
> configuration that isn't supported by it.

It's better if guests don't see some features but
don't crash. It's not always possible of course but
we should try to avoid this.

> For example we use VAR_PACKAGE when running with more
> than 255 VCPUs (commit b4f4d5481) which BSODs XP.

Yes. And it's because we violate the spec, DSDT
should not have this stuff.

> So we can continue with that policy with out resorting to
> using both RSDT and XSDT,
> It would be even easier as all AML would be dynamically
> generated and DSDT would only contain AML elements for
> a concrete QEMU configuration.

I'd prefer XSDT but I won't nack it if you do it in DSDT.
I think it's not spec compliant but guests do not
seem to care.

> > > So far we've managed keep DSDT compatible with XP while
> > > introducing features from v2 and higher ACPI revisions as
> > > AML that is only evaluated on demand.
> > > We can continue doing so unless we have to unconditionally
> > > add incompatible AML at global scope.
> > >   
> > 
> > Yes.
> > 
> > > >   
> > > > > Michael, Paolo, what do you think about these ideas?
> > > > > 
> > > > > Thanks!  
> > > > 
> > > > 
> > > > 
> > > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > > current offset so we can add that to the linker.
> > > > 
> > > > Won't work if you append the Name to the Aml structure (these can be
> > > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > > this API makes sense to me.
> > > >   
> > > > --->  
> > > > 
> > > > acpi: add build_append_named_dword, returning an offset in buffer
> > > > 
> > > > This is a very limited form of support for runtime patching -
> > > > similar in functionality to what we can do with ACPI_EXTRACT
> > > > macros in python, but implemented in C.
> > > > 
> > > > This is to allow ACPI code direct access to data tables -
> > > > which is exactly what DataTableRegion is there for, except
> > > > no known windows release so far implements DataTableRegion.  
> > > unsupported means Windows will BSOD, so it's practically
> > > unusable unless MS will patch currently existing Windows
> > > versions.  
> > 
> > Yes. That's why my patch allows patching SSDT without using
> > DataTableRegion.
> > 
> > > Another thing about DataTableRegion is that ACPI tables are
> > > supposed to have static content which matches checksum in
> > > table the header while you are trying to use it for dynamic
> > > data. It would be cleaner/more compatible to teach
> > > bios-linker-loader to just allocate memory and patch AML
> > > with the allocated address.  
> > 
> > Yes - if address is static, you need to put it outside
> > the table. Can come right before or right after this.
> > 
> > > Also if OperationRegion() is used, then one has to patch
> > > DefOpRegion directly as RegionOffset must be Integer,
> > > using variable names is not permitted there.  
> > 
> > I am not sure the comment was understood correctly.
> > The comment says really "we can't use DataTableRegion
> > so here is an alternative".
> so how are you going to access data at which patched
> NameString point to?
> for that you'd need a normal patched OperationRegion
> as well since DataTableRegion isn't usable.

For VMGENID you would patch the method that
returns the address - you do not need an op region
as you never access it.

I don't know about NVDIMM. Maybe OperationRegion can
use the patched NameString? Will need some thought.

> > 
> > >   
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > 
> > > > ---
> > > > 
> > > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > > index 1b632dc..f8998ea 100644
> > > > --- a/include/hw/acpi/aml-build.h
> > > > +++ b/include/hw/acpi/aml-build.h
> > > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > > >  void
> > > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > > >  
> > > > +int
> > > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > > +
> > > >  #endif
> > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > > index 0d4b324..7f9fa65 100644
> > > > --- a/hw/acpi/aml-build.c
> > > > +++ b/hw/acpi/aml-build.c
> > > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > > >      }
> > > >  }
> > > >  
> > > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > > + * and return the offset to 0x0 for runtime patching.
> > > > + *
> > > > + * Warning: runtime patching is best avoided. Only use this as
> > > > + * a replacement for DataTableRegion (for guests that don't
> > > > + * support it).
> > > > + */
> > > > +int
> > > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > > +{
> > > > +    int offset;
> > > > +    va_list ap;
> > > > +
> > > > +    va_start(ap, name_format);
> > > > +    build_append_namestringv(array, name_format, ap);
> > > > +    va_end(ap);
> > > > +
> > > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > > +
> > > > +    offset = array->len;
> > > > +    build_append_int_noprefix(array, 0x0, 8);
> > > > +    assert(array->len == offset + 8);
> > > > +
> > > > +    return offset;
> > > > +}
> > > > +
> > > >  static GPtrArray *alloc_list;
> > > >  
> > > >  static Aml *aml_alloc(void)
> > > > 
> > > >   
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-05 16:43             ` Michael S. Tsirkin
  0 siblings, 0 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2016-01-05 16:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
> > > bios-linker-loader is a great interface for initializing some
> > > guest owned data and linking it together but I think it adds
> > > unnecessary complexity and is misused if it's used to handle
> > > device owned data/on device memory in this and VMGID cases.  
> > 
> > I want a generic interface for guest to enumerate these things.  linker
> > seems quite reasonable but if you see a reason why it won't do, or want
> > to propose a better interface, fine.
> > 
> > PCI would do, too - though windows guys had concerns about
> > returning PCI BARs from ACPI.
> There were potential issues with pSeries bootloader that treated
> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> Could you point out to discussion about windows issues?
> 
> What VMGEN patches that used PCI for mapping purposes were
> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> class id but we couldn't agree on it.
> 
> VMGEN v13 with full discussion is here
> https://patchwork.ozlabs.org/patch/443554/
> So to continue with this route we would need to pick some other
> driver less class id so windows won't prompt for driver or
> maybe supply our own driver stub to guarantee that no one
> would touch it. Any suggestions?

Pick any device/vendor id pair for which windows specifies no driver.
There's a small risk that this will conflict with some
guest but I think it's minimal.


> > 
> > 
> > > There was RFC on list to make BIOS boot from NVDIMM already
> > > doing some ACPI table lookup/parsing. Now if they were forced
> > > to also parse and execute AML to initialize QEMU with guest
> > > allocated address that would complicate them quite a bit.  
> > 
> > If they just need to find a table by name, it won't be
> > too bad, would it?
> that's what they were doing scanning memory for static NVDIMM table.
> However if it were DataTable, BIOS side would have to execute
> AML so that the table address could be told to QEMU.

Not at all. You can find any table by its signature without
parsing AML.


> In case of direct mapping or PCI BAR there is no need to initialize
> QEMU side from AML.
> That also saves us IO port where this address should be written
> if bios-linker-loader approach is used.
> 
> > 
> > > While with NVDIMM control memory region mapped directly by QEMU,
> > > respective patches don't need in any way to initialize QEMU,
> > > all they would need just read necessary data from control region.
> > > 
> > > Also using bios-linker-loader takes away some usable RAM
> > > from guest and in the end that doesn't scale,
> > > the more devices I add the less usable RAM is left for guest OS
> > > while all the device needs is a piece of GPA address space
> > > that would belong to it.  
> > 
> > I don't get this comment. I don't think it's MMIO that is wanted.
> > If it's backed by qemu virtual memory then it's RAM.
> Then why don't allocate video card VRAM the same way and try to explain
> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
> only has 64Mb of available RAM because of we think that on device VRAM
> is also RAM.
> 
> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
> area) is not allocated from guest's usable RAM (as described in E820)
> but rather directly mapped in guest's GPA and doesn't consume available
> RAM as guest sees it. That's also the way it's done on real hardware.
> 
> What we need in case of VMGEN ID and NVDIMM is on device memory
> that could be directly accessed by guest.
> Both direct mapping or PCI BAR do that job and we could use simple
> static AML without any patching.

At least with VMGEN the issue is that there's an AML method
that returns the physical address.
Then if guest OS moves the BAR (which is legal), it will break
since caller has no way to know it's related to the BAR.


> > > > 
> > > > See patch at the bottom that might be handy.
> > > >   
> > > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > > | when writing ASL one shall make sure that only XP supported
> > > > > | features are in global scope, which is evaluated when tables
> > > > > | are loaded and features of rev2 and higher are inside methods.
> > > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
> > > > 
> > > > Yes, this technique works.
> > > > 
> > > > An alternative is to add an XSDT, XP ignores that.
> > > > XSDT at the moment breaks OVMF (because it loads both
> > > > the RSDT and the XSDT, which is wrong), but I think
> > > > Laszlo was working on a fix for that.  
> > > Using XSDT would increase ACPI tables occupied RAM
> > > as it would duplicate DSDT + non XP supported AML
> > > at global namespace.  
> > 
> > Not at all - I posted patches linking to same
> > tables from both RSDT and XSDT at some point.
> > Only the list of pointers would be different.
> if you put XP incompatible AML in separate SSDT and link it
> only from XSDT than that would work but if incompatibility
> is in DSDT, one would have to provide compat DSDT for RSDT
> an incompat DSDT for XSDT.

So don't do this.

> So far policy was don't try to run guest OS on QEMU
> configuration that isn't supported by it.

It's better if guests don't see some features but
don't crash. It's not always possible of course but
we should try to avoid this.

> For example we use VAR_PACKAGE when running with more
> than 255 VCPUs (commit b4f4d5481) which BSODs XP.

Yes. And it's because we violate the spec, DSDT
should not have this stuff.

> So we can continue with that policy with out resorting to
> using both RSDT and XSDT,
> It would be even easier as all AML would be dynamically
> generated and DSDT would only contain AML elements for
> a concrete QEMU configuration.

I'd prefer XSDT but I won't nack it if you do it in DSDT.
I think it's not spec compliant but guests do not
seem to care.

> > > So far we've managed keep DSDT compatible with XP while
> > > introducing features from v2 and higher ACPI revisions as
> > > AML that is only evaluated on demand.
> > > We can continue doing so unless we have to unconditionally
> > > add incompatible AML at global scope.
> > >   
> > 
> > Yes.
> > 
> > > >   
> > > > > Michael, Paolo, what do you think about these ideas?
> > > > > 
> > > > > Thanks!  
> > > > 
> > > > 
> > > > 
> > > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > > current offset so we can add that to the linker.
> > > > 
> > > > Won't work if you append the Name to the Aml structure (these can be
> > > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > > this API makes sense to me.
> > > >   
> > > > --->  
> > > > 
> > > > acpi: add build_append_named_dword, returning an offset in buffer
> > > > 
> > > > This is a very limited form of support for runtime patching -
> > > > similar in functionality to what we can do with ACPI_EXTRACT
> > > > macros in python, but implemented in C.
> > > > 
> > > > This is to allow ACPI code direct access to data tables -
> > > > which is exactly what DataTableRegion is there for, except
> > > > no known windows release so far implements DataTableRegion.  
> > > unsupported means Windows will BSOD, so it's practically
> > > unusable unless MS will patch currently existing Windows
> > > versions.  
> > 
> > Yes. That's why my patch allows patching SSDT without using
> > DataTableRegion.
> > 
> > > Another thing about DataTableRegion is that ACPI tables are
> > > supposed to have static content which matches checksum in
> > > table the header while you are trying to use it for dynamic
> > > data. It would be cleaner/more compatible to teach
> > > bios-linker-loader to just allocate memory and patch AML
> > > with the allocated address.  
> > 
> > Yes - if address is static, you need to put it outside
> > the table. Can come right before or right after this.
> > 
> > > Also if OperationRegion() is used, then one has to patch
> > > DefOpRegion directly as RegionOffset must be Integer,
> > > using variable names is not permitted there.  
> > 
> > I am not sure the comment was understood correctly.
> > The comment says really "we can't use DataTableRegion
> > so here is an alternative".
> so how are you going to access data at which patched
> NameString point to?
> for that you'd need a normal patched OperationRegion
> as well since DataTableRegion isn't usable.

For VMGENID you would patch the method that
returns the address - you do not need an op region
as you never access it.

I don't know about NVDIMM. Maybe OperationRegion can
use the patched NameString? Will need some thought.

> > 
> > >   
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > 
> > > > ---
> > > > 
> > > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > > index 1b632dc..f8998ea 100644
> > > > --- a/include/hw/acpi/aml-build.h
> > > > +++ b/include/hw/acpi/aml-build.h
> > > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > > >  void
> > > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > > >  
> > > > +int
> > > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > > +
> > > >  #endif
> > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > > index 0d4b324..7f9fa65 100644
> > > > --- a/hw/acpi/aml-build.c
> > > > +++ b/hw/acpi/aml-build.c
> > > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > > >      }
> > > >  }
> > > >  
> > > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > > + * and return the offset to 0x0 for runtime patching.
> > > > + *
> > > > + * Warning: runtime patching is best avoided. Only use this as
> > > > + * a replacement for DataTableRegion (for guests that don't
> > > > + * support it).
> > > > + */
> > > > +int
> > > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > > +{
> > > > +    int offset;
> > > > +    va_list ap;
> > > > +
> > > > +    va_start(ap, name_format);
> > > > +    build_append_namestringv(array, name_format, ap);
> > > > +    va_end(ap);
> > > > +
> > > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > > +
> > > > +    offset = array->len;
> > > > +    build_append_int_noprefix(array, 0x0, 8);
> > > > +    assert(array->len == offset + 8);
> > > > +
> > > > +    return offset;
> > > > +}
> > > > +
> > > >  static GPtrArray *alloc_list;
> > > >  
> > > >  static Aml *aml_alloc(void)
> > > > 
> > > >   
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-05 16:43             ` [Qemu-devel] " Michael S. Tsirkin
@ 2016-01-05 17:07               ` Laszlo Ersek
  -1 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-05 17:07 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov
  Cc: Xiao Guangrong, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel

On 01/05/16 17:43, Michael S. Tsirkin wrote:
> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
>>>> bios-linker-loader is a great interface for initializing some
>>>> guest owned data and linking it together but I think it adds
>>>> unnecessary complexity and is misused if it's used to handle
>>>> device owned data/on device memory in this and VMGID cases.  
>>>
>>> I want a generic interface for guest to enumerate these things.  linker
>>> seems quite reasonable but if you see a reason why it won't do, or want
>>> to propose a better interface, fine.
>>>
>>> PCI would do, too - though windows guys had concerns about
>>> returning PCI BARs from ACPI.
>> There were potential issues with pSeries bootloader that treated
>> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
>> Could you point out to discussion about windows issues?
>>
>> What VMGEN patches that used PCI for mapping purposes were
>> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
>> class id but we couldn't agree on it.
>>
>> VMGEN v13 with full discussion is here
>> https://patchwork.ozlabs.org/patch/443554/
>> So to continue with this route we would need to pick some other
>> driver less class id so windows won't prompt for driver or
>> maybe supply our own driver stub to guarantee that no one
>> would touch it. Any suggestions?
> 
> Pick any device/vendor id pair for which windows specifies no driver.
> There's a small risk that this will conflict with some
> guest but I think it's minimal.
> 
> 
>>>
>>>
>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>> to also parse and execute AML to initialize QEMU with guest
>>>> allocated address that would complicate them quite a bit.  
>>>
>>> If they just need to find a table by name, it won't be
>>> too bad, would it?
>> that's what they were doing scanning memory for static NVDIMM table.
>> However if it were DataTable, BIOS side would have to execute
>> AML so that the table address could be told to QEMU.
> 
> Not at all. You can find any table by its signature without
> parsing AML.
> 
> 
>> In case of direct mapping or PCI BAR there is no need to initialize
>> QEMU side from AML.
>> That also saves us IO port where this address should be written
>> if bios-linker-loader approach is used.
>>
>>>
>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>> respective patches don't need in any way to initialize QEMU,
>>>> all they would need just read necessary data from control region.
>>>>
>>>> Also using bios-linker-loader takes away some usable RAM
>>>> from guest and in the end that doesn't scale,
>>>> the more devices I add the less usable RAM is left for guest OS
>>>> while all the device needs is a piece of GPA address space
>>>> that would belong to it.  
>>>
>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>> If it's backed by qemu virtual memory then it's RAM.
>> Then why don't allocate video card VRAM the same way and try to explain
>> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
>> only has 64Mb of available RAM because of we think that on device VRAM
>> is also RAM.
>>
>> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
>> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
>> area) is not allocated from guest's usable RAM (as described in E820)
>> but rather directly mapped in guest's GPA and doesn't consume available
>> RAM as guest sees it. That's also the way it's done on real hardware.
>>
>> What we need in case of VMGEN ID and NVDIMM is on device memory
>> that could be directly accessed by guest.
>> Both direct mapping or PCI BAR do that job and we could use simple
>> static AML without any patching.
> 
> At least with VMGEN the issue is that there's an AML method
> that returns the physical address.
> Then if guest OS moves the BAR (which is legal), it will break
> since caller has no way to know it's related to the BAR.
> 
> 
>>>>>
>>>>> See patch at the bottom that might be handy.
>>>>>   
>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>> | features are in global scope, which is evaluated when tables
>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
>>>>>
>>>>> Yes, this technique works.
>>>>>
>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>> Laszlo was working on a fix for that.  
>>>> Using XSDT would increase ACPI tables occupied RAM
>>>> as it would duplicate DSDT + non XP supported AML
>>>> at global namespace.  
>>>
>>> Not at all - I posted patches linking to same
>>> tables from both RSDT and XSDT at some point.
>>> Only the list of pointers would be different.
>> if you put XP incompatible AML in separate SSDT and link it
>> only from XSDT than that would work but if incompatibility
>> is in DSDT, one would have to provide compat DSDT for RSDT
>> an incompat DSDT for XSDT.
> 
> So don't do this.
> 
>> So far policy was don't try to run guest OS on QEMU
>> configuration that isn't supported by it.
> 
> It's better if guests don't see some features but
> don't crash. It's not always possible of course but
> we should try to avoid this.
> 
>> For example we use VAR_PACKAGE when running with more
>> than 255 VCPUs (commit b4f4d5481) which BSODs XP.
> 
> Yes. And it's because we violate the spec, DSDT
> should not have this stuff.
> 
>> So we can continue with that policy with out resorting to
>> using both RSDT and XSDT,
>> It would be even easier as all AML would be dynamically
>> generated and DSDT would only contain AML elements for
>> a concrete QEMU configuration.
> 
> I'd prefer XSDT but I won't nack it if you do it in DSDT.
> I think it's not spec compliant but guests do not
> seem to care.
> 
>>>> So far we've managed keep DSDT compatible with XP while
>>>> introducing features from v2 and higher ACPI revisions as
>>>> AML that is only evaluated on demand.
>>>> We can continue doing so unless we have to unconditionally
>>>> add incompatible AML at global scope.
>>>>   
>>>
>>> Yes.
>>>
>>>>>   
>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>
>>>>>> Thanks!  
>>>>>
>>>>>
>>>>>
>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>> current offset so we can add that to the linker.
>>>>>
>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>> this API makes sense to me.
>>>>>   
>>>>> --->  
>>>>>
>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>
>>>>> This is a very limited form of support for runtime patching -
>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>> macros in python, but implemented in C.
>>>>>
>>>>> This is to allow ACPI code direct access to data tables -
>>>>> which is exactly what DataTableRegion is there for, except
>>>>> no known windows release so far implements DataTableRegion.  
>>>> unsupported means Windows will BSOD, so it's practically
>>>> unusable unless MS will patch currently existing Windows
>>>> versions.  
>>>
>>> Yes. That's why my patch allows patching SSDT without using
>>> DataTableRegion.
>>>
>>>> Another thing about DataTableRegion is that ACPI tables are
>>>> supposed to have static content which matches checksum in
>>>> table the header while you are trying to use it for dynamic
>>>> data. It would be cleaner/more compatible to teach
>>>> bios-linker-loader to just allocate memory and patch AML
>>>> with the allocated address.  
>>>
>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.  
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>> so how are you going to access data at which patched
>> NameString point to?
>> for that you'd need a normal patched OperationRegion
>> as well since DataTableRegion isn't usable.
> 
> For VMGENID you would patch the method that
> returns the address - you do not need an op region
> as you never access it.
> 
> I don't know about NVDIMM. Maybe OperationRegion can
> use the patched NameString? Will need some thought.

Xiao Guangrong has patches on the list that already solve this.

  [Qemu-devel] [PATCH 0/6] NVDIMM ACPI: introduce the framework of QEMU
                           emulated DSM

  http://thread.gmane.org/gmane.comp.emulators.kvm.devel/145138

I very briefly skimmed that series.

(Side note: I sort of dislike that with the approach seen in that
series, nvdimm and vmgenid would *both* have to have their own ioports
for telling QEMU about the guest-allocated address. See the rough
GET_ALLOCATION_ADDRESS idea in my earlier post in this thread for one
way to generalize this.)

In any case, in order to stay on topic, AFAICS in patch 3/6, Xiao
Guangrong creates a method called "MEMA". That method consists of a
single return statement that returns a 64-bit integer constant. This
returned constant is patched by the linker/loader.

Then in patch 5/6, there seems to be another method (named "NCAL"?) that
calls MEMA, then uses MEMA's return value to dynamically create the NRAM
operation region, apparently scoped to the NCAL method.

This is possible because the <RegionOffset> symbol (from the expansion
of <DefOpRegion>) is "TermArg => Integer". Patch 4/6 modifies
aml_operation_region() so that it exposes this capability.

... Again, this is just from a superficial skimming; it would have
helped quite a bit if Xiao Guangrong had appended a decompiled ACPI dump
to the 0/6 blurb (or even a documentation file).

Thanks
Laszlo

> 
>>>
>>>>   
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>> index 1b632dc..f8998ea 100644
>>>>> --- a/include/hw/acpi/aml-build.h
>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>  void
>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>  
>>>>> +int
>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>> +
>>>>>  #endif
>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>> index 0d4b324..7f9fa65 100644
>>>>> --- a/hw/acpi/aml-build.c
>>>>> +++ b/hw/acpi/aml-build.c
>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>> + *
>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>> + * support it).
>>>>> + */
>>>>> +int
>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>> +{
>>>>> +    int offset;
>>>>> +    va_list ap;
>>>>> +
>>>>> +    va_start(ap, name_format);
>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>> +    va_end(ap);
>>>>> +
>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>> +
>>>>> +    offset = array->len;
>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>> +    assert(array->len == offset + 8);
>>>>> +
>>>>> +    return offset;
>>>>> +}
>>>>> +
>>>>>  static GPtrArray *alloc_list;
>>>>>  
>>>>>  static Aml *aml_alloc(void)
>>>>>
>>>>>   
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-05 17:07               ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-05 17:07 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, rth

On 01/05/16 17:43, Michael S. Tsirkin wrote:
> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
>>>> bios-linker-loader is a great interface for initializing some
>>>> guest owned data and linking it together but I think it adds
>>>> unnecessary complexity and is misused if it's used to handle
>>>> device owned data/on device memory in this and VMGID cases.  
>>>
>>> I want a generic interface for guest to enumerate these things.  linker
>>> seems quite reasonable but if you see a reason why it won't do, or want
>>> to propose a better interface, fine.
>>>
>>> PCI would do, too - though windows guys had concerns about
>>> returning PCI BARs from ACPI.
>> There were potential issues with pSeries bootloader that treated
>> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
>> Could you point out to discussion about windows issues?
>>
>> What VMGEN patches that used PCI for mapping purposes were
>> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
>> class id but we couldn't agree on it.
>>
>> VMGEN v13 with full discussion is here
>> https://patchwork.ozlabs.org/patch/443554/
>> So to continue with this route we would need to pick some other
>> driver less class id so windows won't prompt for driver or
>> maybe supply our own driver stub to guarantee that no one
>> would touch it. Any suggestions?
> 
> Pick any device/vendor id pair for which windows specifies no driver.
> There's a small risk that this will conflict with some
> guest but I think it's minimal.
> 
> 
>>>
>>>
>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>> to also parse and execute AML to initialize QEMU with guest
>>>> allocated address that would complicate them quite a bit.  
>>>
>>> If they just need to find a table by name, it won't be
>>> too bad, would it?
>> that's what they were doing scanning memory for static NVDIMM table.
>> However if it were DataTable, BIOS side would have to execute
>> AML so that the table address could be told to QEMU.
> 
> Not at all. You can find any table by its signature without
> parsing AML.
> 
> 
>> In case of direct mapping or PCI BAR there is no need to initialize
>> QEMU side from AML.
>> That also saves us IO port where this address should be written
>> if bios-linker-loader approach is used.
>>
>>>
>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>> respective patches don't need in any way to initialize QEMU,
>>>> all they would need just read necessary data from control region.
>>>>
>>>> Also using bios-linker-loader takes away some usable RAM
>>>> from guest and in the end that doesn't scale,
>>>> the more devices I add the less usable RAM is left for guest OS
>>>> while all the device needs is a piece of GPA address space
>>>> that would belong to it.  
>>>
>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>> If it's backed by qemu virtual memory then it's RAM.
>> Then why don't allocate video card VRAM the same way and try to explain
>> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
>> only has 64Mb of available RAM because of we think that on device VRAM
>> is also RAM.
>>
>> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
>> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
>> area) is not allocated from guest's usable RAM (as described in E820)
>> but rather directly mapped in guest's GPA and doesn't consume available
>> RAM as guest sees it. That's also the way it's done on real hardware.
>>
>> What we need in case of VMGEN ID and NVDIMM is on device memory
>> that could be directly accessed by guest.
>> Both direct mapping or PCI BAR do that job and we could use simple
>> static AML without any patching.
> 
> At least with VMGEN the issue is that there's an AML method
> that returns the physical address.
> Then if guest OS moves the BAR (which is legal), it will break
> since caller has no way to know it's related to the BAR.
> 
> 
>>>>>
>>>>> See patch at the bottom that might be handy.
>>>>>   
>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>> | features are in global scope, which is evaluated when tables
>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
>>>>>
>>>>> Yes, this technique works.
>>>>>
>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>> Laszlo was working on a fix for that.  
>>>> Using XSDT would increase ACPI tables occupied RAM
>>>> as it would duplicate DSDT + non XP supported AML
>>>> at global namespace.  
>>>
>>> Not at all - I posted patches linking to same
>>> tables from both RSDT and XSDT at some point.
>>> Only the list of pointers would be different.
>> if you put XP incompatible AML in separate SSDT and link it
>> only from XSDT than that would work but if incompatibility
>> is in DSDT, one would have to provide compat DSDT for RSDT
>> an incompat DSDT for XSDT.
> 
> So don't do this.
> 
>> So far policy was don't try to run guest OS on QEMU
>> configuration that isn't supported by it.
> 
> It's better if guests don't see some features but
> don't crash. It's not always possible of course but
> we should try to avoid this.
> 
>> For example we use VAR_PACKAGE when running with more
>> than 255 VCPUs (commit b4f4d5481) which BSODs XP.
> 
> Yes. And it's because we violate the spec, DSDT
> should not have this stuff.
> 
>> So we can continue with that policy with out resorting to
>> using both RSDT and XSDT,
>> It would be even easier as all AML would be dynamically
>> generated and DSDT would only contain AML elements for
>> a concrete QEMU configuration.
> 
> I'd prefer XSDT but I won't nack it if you do it in DSDT.
> I think it's not spec compliant but guests do not
> seem to care.
> 
>>>> So far we've managed keep DSDT compatible with XP while
>>>> introducing features from v2 and higher ACPI revisions as
>>>> AML that is only evaluated on demand.
>>>> We can continue doing so unless we have to unconditionally
>>>> add incompatible AML at global scope.
>>>>   
>>>
>>> Yes.
>>>
>>>>>   
>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>
>>>>>> Thanks!  
>>>>>
>>>>>
>>>>>
>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>> current offset so we can add that to the linker.
>>>>>
>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>> this API makes sense to me.
>>>>>   
>>>>> --->  
>>>>>
>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>
>>>>> This is a very limited form of support for runtime patching -
>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>> macros in python, but implemented in C.
>>>>>
>>>>> This is to allow ACPI code direct access to data tables -
>>>>> which is exactly what DataTableRegion is there for, except
>>>>> no known windows release so far implements DataTableRegion.  
>>>> unsupported means Windows will BSOD, so it's practically
>>>> unusable unless MS will patch currently existing Windows
>>>> versions.  
>>>
>>> Yes. That's why my patch allows patching SSDT without using
>>> DataTableRegion.
>>>
>>>> Another thing about DataTableRegion is that ACPI tables are
>>>> supposed to have static content which matches checksum in
>>>> table the header while you are trying to use it for dynamic
>>>> data. It would be cleaner/more compatible to teach
>>>> bios-linker-loader to just allocate memory and patch AML
>>>> with the allocated address.  
>>>
>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.  
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>> so how are you going to access data at which patched
>> NameString point to?
>> for that you'd need a normal patched OperationRegion
>> as well since DataTableRegion isn't usable.
> 
> For VMGENID you would patch the method that
> returns the address - you do not need an op region
> as you never access it.
> 
> I don't know about NVDIMM. Maybe OperationRegion can
> use the patched NameString? Will need some thought.

Xiao Guangrong has patches on the list that already solve this.

  [Qemu-devel] [PATCH 0/6] NVDIMM ACPI: introduce the framework of QEMU
                           emulated DSM

  http://thread.gmane.org/gmane.comp.emulators.kvm.devel/145138

I very briefly skimmed that series.

(Side note: I sort of dislike that with the approach seen in that
series, nvdimm and vmgenid would *both* have to have their own ioports
for telling QEMU about the guest-allocated address. See the rough
GET_ALLOCATION_ADDRESS idea in my earlier post in this thread for one
way to generalize this.)

In any case, in order to stay on topic, AFAICS in patch 3/6, Xiao
Guangrong creates a method called "MEMA". That method consists of a
single return statement that returns a 64-bit integer constant. This
returned constant is patched by the linker/loader.

Then in patch 5/6, there seems to be another method (named "NCAL"?) that
calls MEMA, then uses MEMA's return value to dynamically create the NRAM
operation region, apparently scoped to the NCAL method.

This is possible because the <RegionOffset> symbol (from the expansion
of <DefOpRegion>) is "TermArg => Integer". Patch 4/6 modifies
aml_operation_region() so that it exposes this capability.

... Again, this is just from a superficial skimming; it would have
helped quite a bit if Xiao Guangrong had appended a decompiled ACPI dump
to the 0/6 blurb (or even a documentation file).

Thanks
Laszlo

> 
>>>
>>>>   
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>> index 1b632dc..f8998ea 100644
>>>>> --- a/include/hw/acpi/aml-build.h
>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>  void
>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>  
>>>>> +int
>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>> +
>>>>>  #endif
>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>> index 0d4b324..7f9fa65 100644
>>>>> --- a/hw/acpi/aml-build.c
>>>>> +++ b/hw/acpi/aml-build.c
>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>> + *
>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>> + * support it).
>>>>> + */
>>>>> +int
>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>> +{
>>>>> +    int offset;
>>>>> +    va_list ap;
>>>>> +
>>>>> +    va_start(ap, name_format);
>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>> +    va_end(ap);
>>>>> +
>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>> +
>>>>> +    offset = array->len;
>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>> +    assert(array->len == offset + 8);
>>>>> +
>>>>> +    return offset;
>>>>> +}
>>>>> +
>>>>>  static GPtrArray *alloc_list;
>>>>>  
>>>>>  static Aml *aml_alloc(void)
>>>>>
>>>>>   
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-05 16:43             ` [Qemu-devel] " Michael S. Tsirkin
@ 2016-01-05 17:07               ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2016-01-05 17:07 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov
  Cc: pbonzini, gleb, mtosatti, stefanha, rth, ehabkost,
	dan.j.williams, kvm, qemu-devel, Laszlo Ersek



On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:

>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>> so how are you going to access data at which patched
>> NameString point to?
>> for that you'd need a normal patched OperationRegion
>> as well since DataTableRegion isn't usable.
>
> For VMGENID you would patch the method that
> returns the address - you do not need an op region
> as you never access it.
>
> I don't know about NVDIMM. Maybe OperationRegion can
> use the patched NameString? Will need some thought.

The ACPI spec says that the offsetTerm in OperationRegion
is evaluated as Int, so the named object is allowed to be
used in OperationRegion, that is exact what my patchset
is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):

+    dsm_mem = aml_arg(3);
+    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));

+    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
+                                            dsm_mem, TARGET_PAGE_SIZE));

We hide the int64 object which is patched by BIOS in the method,
NVDIMM_GET_DSM_MEM, to make windows XP happy.

However, the disadvantages i see are:
a) as Igor pointed out, we need a way to tell QEMU what is the patched
    address, in NVDIMM ACPI, we used a 64 bit IO ports to pass the address
    to QEMU.

b) BIOS allocated memory is RAM based so it stops us to use MMIO in ACPI,
    MMIO is the more scalable resource than IO port as it has larger region
    and supports 64 bits operation.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-05 17:07               ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2016-01-05 17:07 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov
  Cc: ehabkost, kvm, gleb, mtosatti, qemu-devel, stefanha, pbonzini,
	dan.j.williams, Laszlo Ersek, rth



On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:

>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>> so how are you going to access data at which patched
>> NameString point to?
>> for that you'd need a normal patched OperationRegion
>> as well since DataTableRegion isn't usable.
>
> For VMGENID you would patch the method that
> returns the address - you do not need an op region
> as you never access it.
>
> I don't know about NVDIMM. Maybe OperationRegion can
> use the patched NameString? Will need some thought.

The ACPI spec says that the offsetTerm in OperationRegion
is evaluated as Int, so the named object is allowed to be
used in OperationRegion, that is exact what my patchset
is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):

+    dsm_mem = aml_arg(3);
+    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));

+    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
+                                            dsm_mem, TARGET_PAGE_SIZE));

We hide the int64 object which is patched by BIOS in the method,
NVDIMM_GET_DSM_MEM, to make windows XP happy.

However, the disadvantages i see are:
a) as Igor pointed out, we need a way to tell QEMU what is the patched
    address, in NVDIMM ACPI, we used a 64 bit IO ports to pass the address
    to QEMU.

b) BIOS allocated memory is RAM based so it stops us to use MMIO in ACPI,
    MMIO is the more scalable resource than IO port as it has larger region
    and supports 64 bits operation.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-04 20:17           ` [Qemu-devel] " Laszlo Ersek
@ 2016-01-05 17:08             ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-05 17:08 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael S. Tsirkin, Xiao Guangrong, pbonzini, gleb, mtosatti,
	stefanha, rth, ehabkost, dan.j.williams, kvm, qemu-devel

On Mon, 4 Jan 2016 21:17:31 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> Michael CC'd me on the grandparent of the email below. I'll try to add
> my thoughts in a single go, with regard to OVMF.
> 
> On 12/30/15 20:52, Michael S. Tsirkin wrote:
> > On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
> >> On Mon, 28 Dec 2015 14:50:15 +0200
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>  
> >>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> >>>>
> >>>> Hi Michael, Paolo,
> >>>>
> >>>> Now it is the time to return to the challenge that how to reserve guest
> >>>> physical region internally used by ACPI.
> >>>>
> >>>> Igor suggested that:
> >>>> | An alternative place to allocate reserve from could be high memory.
> >>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>> | that hotpluggable memory range isn't used by firmware
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
> 
> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> reason is that nobody wrote that patch, nor asked for the patch to be
> written. (Not implying that just requesting the patch would be
> sufficient for the patch to be written.)
> 
> >>> I don't want to tie things to reserved-memory-end because this
> >>> does not scale: next time we need to reserve memory,
> >>> we'll need to find yet another way to figure out what is where.  
> >> Could you elaborate a bit more on a problem you're seeing?
> >>
> >> To me it looks like it scales rather well.
> >> For example lets imagine that we adding a device
> >> that has some on device memory that should be mapped into GPA
> >> code to do so would look like:
> >>
> >>   pc_machine_device_plug_cb(dev)
> >>   {
> >>    ...
> >>    if (dev == OUR_NEW_DEVICE_TYPE) {
> >>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
> >>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
> >>    }
> >>   }
> >>
> >> we can practically add any number of new devices that way.  
> > 
> > Yes but we'll have to build a host side allocator for these, and that's
> > nasty. We'll also have to maintain these addresses indefinitely (at
> > least per machine version) as they are guest visible.
> > Not only that, there's no way for guest to know if we move things
> > around, so basically we'll never be able to change addresses.
> > 
> >   
> >>    
> >>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> >>> support 64 bit RAM instead  
> 
> This looks quite doable in OVMF, as long as the blob to allocate from
> high memory contains *zero* ACPI tables.
> 
> (
> Namely, each ACPI table is installed from the containing fw_cfg blob
> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
> own allocation policy for the *copies* of ACPI tables it installs.
> 
> This allocation policy is left unspecified in the section of the UEFI
> spec that governs EFI_ACPI_TABLE_PROTOCOL.
> 
> The current policy in edk2 (= the reference implementation) seems to be
> "allocate from under 4GB". It is currently being changed to "try to
> allocate from under 4GB, and if that fails, retry from high memory". (It
> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
> )
> 
> >>> (and maybe a way to allocate and
> >>> zero-initialize buffer without loading it through fwcfg),  
> 
> Sounds reasonable.
> 
> >>> this way bios
> >>> does the allocation, and addresses can be patched into acpi.  
> >> and then guest side needs to parse/execute some AML that would
> >> initialize QEMU side so it would know where to write data.  
> > 
> > Well not really - we can put it in a data table, by itself
> > so it's easy to find.  
> 
> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
> acpi_get_table_with_size()?
> 
> > 
> > AML is only needed if access from ACPI is desired.
> > 
> >   
> >> bios-linker-loader is a great interface for initializing some
> >> guest owned data and linking it together but I think it adds
> >> unnecessary complexity and is misused if it's used to handle
> >> device owned data/on device memory in this and VMGID cases.  
> > 
> > I want a generic interface for guest to enumerate these things.  linker
> > seems quite reasonable but if you see a reason why it won't do, or want
> > to propose a better interface, fine.  
> 
> * The guest could do the following:
> - while processing the ALLOCATE commands, it would make a note where in
> GPA space each fw_cfg blob gets allocated
> - at the end the guest would prepare a temporary array with a predefined
> record format, that associates each fw_cfg blob's name with the concrete
> allocation address
> - it would create an FWCfgDmaAccess stucture pointing at this array,
> with a new "control" bit set (or something similar)
> - the guest could write the address of the FWCfgDmaAccess struct to the
> appropriate register, as always.
> 
> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
> specifying:
> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>   address (this command could only follow the matching ALLOCATE
>   command, never precede it)
> - a flag whether the address should be written to IO or MMIO space
>   (would be likely IO on x86, MMIO on ARM)
> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>   identifies the blob, actually!)
> - a uint64_t (IO or MMIO) address to write the unique key and then the
>   allocation address to.
> 
> Either way, QEMU could learn about all the relevant guest-side
> allocation addresses in a low number of traps. In addition, AML code
> wouldn't have to reflect any allocation addresses to QEMU, ever.
That would be nice trick. I see 2 issues here:
 1. ACPI tables blob is build atomically when one guest tries to read it
    from fw_cfg so patched addresses have to be communicated
    to QEMU before that.
 2. Mo important I think that we are miss-using linker-loader
    interface here, trying to from allocate buffer in guest RAM
    an so consuming it while all we need a window into device
    memory mapped somewhere outside of RAM occupied  address space.

> 
> > 
> > PCI would do, too - though windows guys had concerns about
> > returning PCI BARs from ACPI.
> > 
> >   
> >> There was RFC on list to make BIOS boot from NVDIMM already
> >> doing some ACPI table lookup/parsing. Now if they were forced
> >> to also parse and execute AML to initialize QEMU with guest
> >> allocated address that would complicate them quite a bit.  
> > 
> > If they just need to find a table by name, it won't be
> > too bad, would it?
> >   
> >> While with NVDIMM control memory region mapped directly by QEMU,
> >> respective patches don't need in any way to initialize QEMU,
> >> all they would need just read necessary data from control region.
> >>
> >> Also using bios-linker-loader takes away some usable RAM
> >> from guest and in the end that doesn't scale,
> >> the more devices I add the less usable RAM is left for guest OS
> >> while all the device needs is a piece of GPA address space
> >> that would belong to it.  
> > 
> > I don't get this comment. I don't think it's MMIO that is wanted.
> > If it's backed by qemu virtual memory then it's RAM.
> >   
> >>>
> >>> See patch at the bottom that might be handy.  
> 
> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
> 
> From last year I have a WIP version of "docs/vmgenid.txt" that is based
> on Michael's build_append_named_dword() function. If
> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
> stuff in that text file (and hopefully post it soon after for comments?)
> 
> >>>  
> >>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> >>>> | when writing ASL one shall make sure that only XP supported
> >>>> | features are in global scope, which is evaluated when tables
> >>>> | are loaded and features of rev2 and higher are inside methods.
> >>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> >>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
> >>>
> >>> Yes, this technique works.  
> 
> Agreed.
> 
> >>>
> >>> An alternative is to add an XSDT, XP ignores that.
> >>> XSDT at the moment breaks OVMF (because it loads both
> >>> the RSDT and the XSDT, which is wrong), but I think
> >>> Laszlo was working on a fix for that.  
> 
> We have to distinguish two use cases here.
> 
> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
> links at least one common ACPI table from both. This would cause OVMF to
> pass the same source (= to-be-copied) table to
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
> following outcomes:
> 
> - there would be two instances of the same table (think e.g. SSDT)
> - the second attempt would be rejected (e.g. FADT) and that error would
>   terminate the linker-loader procedure.
> 
> This issue would not be too hard to overcome, with a simple "memoization
> technique". After the initial loading & linking of the tables, OVMF
> could remember the addresses of the "source" ACPI tables, and could
> avoid passing already installed source tables to
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
> 
> * The second use case is when an ACPI table is linked *only* from QEMU's
> XSDT. This is much harder to fix, because
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
> passed-in table into *both* RSDT and XSDT, automatically. And, again,
> the UEFI spec doesn't provide a way to control this from the caller
> (i.e. from within OVMF).
> 
> I have tried earlier to effect a change in the specification of
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
> lists. (At that time I was trying to expose UEFI memory *type* to the
> caller, from which the copy of the ACPI table being installed should be
> allocated from.) Alas, I received no answers at all.
> 
> All in all I strongly recommend the "place rev2+ objects in method
> scope" trick, over the "link it from the XSDT only" trick.
> 
> >> Using XSDT would increase ACPI tables occupied RAM
> >> as it would duplicate DSDT + non XP supported AML
> >> at global namespace.  
> > 
> > Not at all - I posted patches linking to same
> > tables from both RSDT and XSDT at some point.  
> 
> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
> could be made work in OVMF with the above mentioned memoization stuff.
> 
> > Only the list of pointers would be different.  
> 
> I don't recommend that, see the second case above.
> 
> Thanks
> Laszlo
> 
> >> So far we've managed keep DSDT compatible with XP while
> >> introducing features from v2 and higher ACPI revisions as
> >> AML that is only evaluated on demand.
> >> We can continue doing so unless we have to unconditionally
> >> add incompatible AML at global scope.
> >>  
> > 
> > Yes.
> >   
> >>>  
> >>>> Michael, Paolo, what do you think about these ideas?
> >>>>
> >>>> Thanks!  
> >>>
> >>>
> >>>
> >>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> >>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> >>> current offset so we can add that to the linker.
> >>>
> >>> Won't work if you append the Name to the Aml structure (these can be
> >>> nested to arbitrary depth using aml_append), so using plain GArray for
> >>> this API makes sense to me.
> >>>  
> >>> --->  
> >>>
> >>> acpi: add build_append_named_dword, returning an offset in buffer
> >>>
> >>> This is a very limited form of support for runtime patching -
> >>> similar in functionality to what we can do with ACPI_EXTRACT
> >>> macros in python, but implemented in C.
> >>>
> >>> This is to allow ACPI code direct access to data tables -
> >>> which is exactly what DataTableRegion is there for, except
> >>> no known windows release so far implements DataTableRegion.  
> >> unsupported means Windows will BSOD, so it's practically
> >> unusable unless MS will patch currently existing Windows
> >> versions.  
> > 
> > Yes. That's why my patch allows patching SSDT without using
> > DataTableRegion.
> >   
> >> Another thing about DataTableRegion is that ACPI tables are
> >> supposed to have static content which matches checksum in
> >> table the header while you are trying to use it for dynamic
> >> data. It would be cleaner/more compatible to teach
> >> bios-linker-loader to just allocate memory and patch AML
> >> with the allocated address.  
> > 
> > Yes - if address is static, you need to put it outside
> > the table. Can come right before or right after this.
> >   
> >> Also if OperationRegion() is used, then one has to patch
> >> DefOpRegion directly as RegionOffset must be Integer,
> >> using variable names is not permitted there.  
> > 
> > I am not sure the comment was understood correctly.
> > The comment says really "we can't use DataTableRegion
> > so here is an alternative".
> >   
> >>  
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>
> >>> ---
> >>>
> >>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> >>> index 1b632dc..f8998ea 100644
> >>> --- a/include/hw/acpi/aml-build.h
> >>> +++ b/include/hw/acpi/aml-build.h
> >>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> >>>  void
> >>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> >>>  
> >>> +int
> >>> +build_append_named_dword(GArray *array, const char *name_format, ...);
> >>> +
> >>>  #endif
> >>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> >>> index 0d4b324..7f9fa65 100644
> >>> --- a/hw/acpi/aml-build.c
> >>> +++ b/hw/acpi/aml-build.c
> >>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> >>>      }
> >>>  }
> >>>  
> >>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> >>> + * and return the offset to 0x0 for runtime patching.
> >>> + *
> >>> + * Warning: runtime patching is best avoided. Only use this as
> >>> + * a replacement for DataTableRegion (for guests that don't
> >>> + * support it).
> >>> + */
> >>> +int
> >>> +build_append_named_qword(GArray *array, const char *name_format, ...)
> >>> +{
> >>> +    int offset;
> >>> +    va_list ap;
> >>> +
> >>> +    va_start(ap, name_format);
> >>> +    build_append_namestringv(array, name_format, ap);
> >>> +    va_end(ap);
> >>> +
> >>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
> >>> +
> >>> +    offset = array->len;
> >>> +    build_append_int_noprefix(array, 0x0, 8);
> >>> +    assert(array->len == offset + 8);
> >>> +
> >>> +    return offset;
> >>> +}
> >>> +
> >>>  static GPtrArray *alloc_list;
> >>>  
> >>>  static Aml *aml_alloc(void)
> >>>
> >>>  
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-05 17:08             ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-05 17:08 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Xiao Guangrong, ehabkost, kvm, Michael S. Tsirkin, gleb,
	mtosatti, qemu-devel, stefanha, pbonzini, dan.j.williams, rth

On Mon, 4 Jan 2016 21:17:31 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> Michael CC'd me on the grandparent of the email below. I'll try to add
> my thoughts in a single go, with regard to OVMF.
> 
> On 12/30/15 20:52, Michael S. Tsirkin wrote:
> > On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
> >> On Mon, 28 Dec 2015 14:50:15 +0200
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>  
> >>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> >>>>
> >>>> Hi Michael, Paolo,
> >>>>
> >>>> Now it is the time to return to the challenge that how to reserve guest
> >>>> physical region internally used by ACPI.
> >>>>
> >>>> Igor suggested that:
> >>>> | An alternative place to allocate reserve from could be high memory.
> >>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>> | that hotpluggable memory range isn't used by firmware
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
> 
> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> reason is that nobody wrote that patch, nor asked for the patch to be
> written. (Not implying that just requesting the patch would be
> sufficient for the patch to be written.)
> 
> >>> I don't want to tie things to reserved-memory-end because this
> >>> does not scale: next time we need to reserve memory,
> >>> we'll need to find yet another way to figure out what is where.  
> >> Could you elaborate a bit more on a problem you're seeing?
> >>
> >> To me it looks like it scales rather well.
> >> For example lets imagine that we adding a device
> >> that has some on device memory that should be mapped into GPA
> >> code to do so would look like:
> >>
> >>   pc_machine_device_plug_cb(dev)
> >>   {
> >>    ...
> >>    if (dev == OUR_NEW_DEVICE_TYPE) {
> >>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
> >>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
> >>    }
> >>   }
> >>
> >> we can practically add any number of new devices that way.  
> > 
> > Yes but we'll have to build a host side allocator for these, and that's
> > nasty. We'll also have to maintain these addresses indefinitely (at
> > least per machine version) as they are guest visible.
> > Not only that, there's no way for guest to know if we move things
> > around, so basically we'll never be able to change addresses.
> > 
> >   
> >>    
> >>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> >>> support 64 bit RAM instead  
> 
> This looks quite doable in OVMF, as long as the blob to allocate from
> high memory contains *zero* ACPI tables.
> 
> (
> Namely, each ACPI table is installed from the containing fw_cfg blob
> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
> own allocation policy for the *copies* of ACPI tables it installs.
> 
> This allocation policy is left unspecified in the section of the UEFI
> spec that governs EFI_ACPI_TABLE_PROTOCOL.
> 
> The current policy in edk2 (= the reference implementation) seems to be
> "allocate from under 4GB". It is currently being changed to "try to
> allocate from under 4GB, and if that fails, retry from high memory". (It
> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
> )
> 
> >>> (and maybe a way to allocate and
> >>> zero-initialize buffer without loading it through fwcfg),  
> 
> Sounds reasonable.
> 
> >>> this way bios
> >>> does the allocation, and addresses can be patched into acpi.  
> >> and then guest side needs to parse/execute some AML that would
> >> initialize QEMU side so it would know where to write data.  
> > 
> > Well not really - we can put it in a data table, by itself
> > so it's easy to find.  
> 
> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
> acpi_get_table_with_size()?
> 
> > 
> > AML is only needed if access from ACPI is desired.
> > 
> >   
> >> bios-linker-loader is a great interface for initializing some
> >> guest owned data and linking it together but I think it adds
> >> unnecessary complexity and is misused if it's used to handle
> >> device owned data/on device memory in this and VMGID cases.  
> > 
> > I want a generic interface for guest to enumerate these things.  linker
> > seems quite reasonable but if you see a reason why it won't do, or want
> > to propose a better interface, fine.  
> 
> * The guest could do the following:
> - while processing the ALLOCATE commands, it would make a note where in
> GPA space each fw_cfg blob gets allocated
> - at the end the guest would prepare a temporary array with a predefined
> record format, that associates each fw_cfg blob's name with the concrete
> allocation address
> - it would create an FWCfgDmaAccess stucture pointing at this array,
> with a new "control" bit set (or something similar)
> - the guest could write the address of the FWCfgDmaAccess struct to the
> appropriate register, as always.
> 
> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
> specifying:
> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>   address (this command could only follow the matching ALLOCATE
>   command, never precede it)
> - a flag whether the address should be written to IO or MMIO space
>   (would be likely IO on x86, MMIO on ARM)
> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>   identifies the blob, actually!)
> - a uint64_t (IO or MMIO) address to write the unique key and then the
>   allocation address to.
> 
> Either way, QEMU could learn about all the relevant guest-side
> allocation addresses in a low number of traps. In addition, AML code
> wouldn't have to reflect any allocation addresses to QEMU, ever.
That would be nice trick. I see 2 issues here:
 1. ACPI tables blob is build atomically when one guest tries to read it
    from fw_cfg so patched addresses have to be communicated
    to QEMU before that.
 2. Mo important I think that we are miss-using linker-loader
    interface here, trying to from allocate buffer in guest RAM
    an so consuming it while all we need a window into device
    memory mapped somewhere outside of RAM occupied  address space.

> 
> > 
> > PCI would do, too - though windows guys had concerns about
> > returning PCI BARs from ACPI.
> > 
> >   
> >> There was RFC on list to make BIOS boot from NVDIMM already
> >> doing some ACPI table lookup/parsing. Now if they were forced
> >> to also parse and execute AML to initialize QEMU with guest
> >> allocated address that would complicate them quite a bit.  
> > 
> > If they just need to find a table by name, it won't be
> > too bad, would it?
> >   
> >> While with NVDIMM control memory region mapped directly by QEMU,
> >> respective patches don't need in any way to initialize QEMU,
> >> all they would need just read necessary data from control region.
> >>
> >> Also using bios-linker-loader takes away some usable RAM
> >> from guest and in the end that doesn't scale,
> >> the more devices I add the less usable RAM is left for guest OS
> >> while all the device needs is a piece of GPA address space
> >> that would belong to it.  
> > 
> > I don't get this comment. I don't think it's MMIO that is wanted.
> > If it's backed by qemu virtual memory then it's RAM.
> >   
> >>>
> >>> See patch at the bottom that might be handy.  
> 
> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
> 
> From last year I have a WIP version of "docs/vmgenid.txt" that is based
> on Michael's build_append_named_dword() function. If
> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
> stuff in that text file (and hopefully post it soon after for comments?)
> 
> >>>  
> >>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> >>>> | when writing ASL one shall make sure that only XP supported
> >>>> | features are in global scope, which is evaluated when tables
> >>>> | are loaded and features of rev2 and higher are inside methods.
> >>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> >>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
> >>>
> >>> Yes, this technique works.  
> 
> Agreed.
> 
> >>>
> >>> An alternative is to add an XSDT, XP ignores that.
> >>> XSDT at the moment breaks OVMF (because it loads both
> >>> the RSDT and the XSDT, which is wrong), but I think
> >>> Laszlo was working on a fix for that.  
> 
> We have to distinguish two use cases here.
> 
> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
> links at least one common ACPI table from both. This would cause OVMF to
> pass the same source (= to-be-copied) table to
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
> following outcomes:
> 
> - there would be two instances of the same table (think e.g. SSDT)
> - the second attempt would be rejected (e.g. FADT) and that error would
>   terminate the linker-loader procedure.
> 
> This issue would not be too hard to overcome, with a simple "memoization
> technique". After the initial loading & linking of the tables, OVMF
> could remember the addresses of the "source" ACPI tables, and could
> avoid passing already installed source tables to
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
> 
> * The second use case is when an ACPI table is linked *only* from QEMU's
> XSDT. This is much harder to fix, because
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
> passed-in table into *both* RSDT and XSDT, automatically. And, again,
> the UEFI spec doesn't provide a way to control this from the caller
> (i.e. from within OVMF).
> 
> I have tried earlier to effect a change in the specification of
> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
> lists. (At that time I was trying to expose UEFI memory *type* to the
> caller, from which the copy of the ACPI table being installed should be
> allocated from.) Alas, I received no answers at all.
> 
> All in all I strongly recommend the "place rev2+ objects in method
> scope" trick, over the "link it from the XSDT only" trick.
> 
> >> Using XSDT would increase ACPI tables occupied RAM
> >> as it would duplicate DSDT + non XP supported AML
> >> at global namespace.  
> > 
> > Not at all - I posted patches linking to same
> > tables from both RSDT and XSDT at some point.  
> 
> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
> could be made work in OVMF with the above mentioned memoization stuff.
> 
> > Only the list of pointers would be different.  
> 
> I don't recommend that, see the second case above.
> 
> Thanks
> Laszlo
> 
> >> So far we've managed keep DSDT compatible with XP while
> >> introducing features from v2 and higher ACPI revisions as
> >> AML that is only evaluated on demand.
> >> We can continue doing so unless we have to unconditionally
> >> add incompatible AML at global scope.
> >>  
> > 
> > Yes.
> >   
> >>>  
> >>>> Michael, Paolo, what do you think about these ideas?
> >>>>
> >>>> Thanks!  
> >>>
> >>>
> >>>
> >>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> >>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> >>> current offset so we can add that to the linker.
> >>>
> >>> Won't work if you append the Name to the Aml structure (these can be
> >>> nested to arbitrary depth using aml_append), so using plain GArray for
> >>> this API makes sense to me.
> >>>  
> >>> --->  
> >>>
> >>> acpi: add build_append_named_dword, returning an offset in buffer
> >>>
> >>> This is a very limited form of support for runtime patching -
> >>> similar in functionality to what we can do with ACPI_EXTRACT
> >>> macros in python, but implemented in C.
> >>>
> >>> This is to allow ACPI code direct access to data tables -
> >>> which is exactly what DataTableRegion is there for, except
> >>> no known windows release so far implements DataTableRegion.  
> >> unsupported means Windows will BSOD, so it's practically
> >> unusable unless MS will patch currently existing Windows
> >> versions.  
> > 
> > Yes. That's why my patch allows patching SSDT without using
> > DataTableRegion.
> >   
> >> Another thing about DataTableRegion is that ACPI tables are
> >> supposed to have static content which matches checksum in
> >> table the header while you are trying to use it for dynamic
> >> data. It would be cleaner/more compatible to teach
> >> bios-linker-loader to just allocate memory and patch AML
> >> with the allocated address.  
> > 
> > Yes - if address is static, you need to put it outside
> > the table. Can come right before or right after this.
> >   
> >> Also if OperationRegion() is used, then one has to patch
> >> DefOpRegion directly as RegionOffset must be Integer,
> >> using variable names is not permitted there.  
> > 
> > I am not sure the comment was understood correctly.
> > The comment says really "we can't use DataTableRegion
> > so here is an alternative".
> >   
> >>  
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>
> >>> ---
> >>>
> >>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> >>> index 1b632dc..f8998ea 100644
> >>> --- a/include/hw/acpi/aml-build.h
> >>> +++ b/include/hw/acpi/aml-build.h
> >>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> >>>  void
> >>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> >>>  
> >>> +int
> >>> +build_append_named_dword(GArray *array, const char *name_format, ...);
> >>> +
> >>>  #endif
> >>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> >>> index 0d4b324..7f9fa65 100644
> >>> --- a/hw/acpi/aml-build.c
> >>> +++ b/hw/acpi/aml-build.c
> >>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> >>>      }
> >>>  }
> >>>  
> >>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> >>> + * and return the offset to 0x0 for runtime patching.
> >>> + *
> >>> + * Warning: runtime patching is best avoided. Only use this as
> >>> + * a replacement for DataTableRegion (for guests that don't
> >>> + * support it).
> >>> + */
> >>> +int
> >>> +build_append_named_qword(GArray *array, const char *name_format, ...)
> >>> +{
> >>> +    int offset;
> >>> +    va_list ap;
> >>> +
> >>> +    va_start(ap, name_format);
> >>> +    build_append_namestringv(array, name_format, ap);
> >>> +    va_end(ap);
> >>> +
> >>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
> >>> +
> >>> +    offset = array->len;
> >>> +    build_append_int_noprefix(array, 0x0, 8);
> >>> +    assert(array->len == offset + 8);
> >>> +
> >>> +    return offset;
> >>> +}
> >>> +
> >>>  static GPtrArray *alloc_list;
> >>>  
> >>>  static Aml *aml_alloc(void)
> >>>
> >>>  
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-05 17:08             ` [Qemu-devel] " Igor Mammedov
@ 2016-01-05 17:22               ` Laszlo Ersek
  -1 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-05 17:22 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael S. Tsirkin, Xiao Guangrong, pbonzini, gleb, mtosatti,
	stefanha, rth, ehabkost, dan.j.williams, kvm, qemu-devel

On 01/05/16 18:08, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
>>
>>>>> I don't want to tie things to reserved-memory-end because this
>>>>> does not scale: next time we need to reserve memory,
>>>>> we'll need to find yet another way to figure out what is where.  
>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>
>>>> To me it looks like it scales rather well.
>>>> For example lets imagine that we adding a device
>>>> that has some on device memory that should be mapped into GPA
>>>> code to do so would look like:
>>>>
>>>>   pc_machine_device_plug_cb(dev)
>>>>   {
>>>>    ...
>>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>>>    }
>>>>   }
>>>>
>>>> we can practically add any number of new devices that way.  
>>>
>>> Yes but we'll have to build a host side allocator for these, and that's
>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>> least per machine version) as they are guest visible.
>>> Not only that, there's no way for guest to know if we move things
>>> around, so basically we'll never be able to change addresses.
>>>
>>>   
>>>>    
>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>> support 64 bit RAM instead  
>>
>> This looks quite doable in OVMF, as long as the blob to allocate from
>> high memory contains *zero* ACPI tables.
>>
>> (
>> Namely, each ACPI table is installed from the containing fw_cfg blob
>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>> own allocation policy for the *copies* of ACPI tables it installs.
>>
>> This allocation policy is left unspecified in the section of the UEFI
>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>
>> The current policy in edk2 (= the reference implementation) seems to be
>> "allocate from under 4GB". It is currently being changed to "try to
>> allocate from under 4GB, and if that fails, retry from high memory". (It
>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>> )
>>
>>>>> (and maybe a way to allocate and
>>>>> zero-initialize buffer without loading it through fwcfg),  
>>
>> Sounds reasonable.
>>
>>>>> this way bios
>>>>> does the allocation, and addresses can be patched into acpi.  
>>>> and then guest side needs to parse/execute some AML that would
>>>> initialize QEMU side so it would know where to write data.  
>>>
>>> Well not really - we can put it in a data table, by itself
>>> so it's easy to find.  
>>
>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>> acpi_get_table_with_size()?
>>
>>>
>>> AML is only needed if access from ACPI is desired.
>>>
>>>   
>>>> bios-linker-loader is a great interface for initializing some
>>>> guest owned data and linking it together but I think it adds
>>>> unnecessary complexity and is misused if it's used to handle
>>>> device owned data/on device memory in this and VMGID cases.  
>>>
>>> I want a generic interface for guest to enumerate these things.  linker
>>> seems quite reasonable but if you see a reason why it won't do, or want
>>> to propose a better interface, fine.  
>>
>> * The guest could do the following:
>> - while processing the ALLOCATE commands, it would make a note where in
>> GPA space each fw_cfg blob gets allocated
>> - at the end the guest would prepare a temporary array with a predefined
>> record format, that associates each fw_cfg blob's name with the concrete
>> allocation address
>> - it would create an FWCfgDmaAccess stucture pointing at this array,
>> with a new "control" bit set (or something similar)
>> - the guest could write the address of the FWCfgDmaAccess struct to the
>> appropriate register, as always.
>>
>> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
>> specifying:
>> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>>   address (this command could only follow the matching ALLOCATE
>>   command, never precede it)
>> - a flag whether the address should be written to IO or MMIO space
>>   (would be likely IO on x86, MMIO on ARM)
>> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>>   identifies the blob, actually!)
>> - a uint64_t (IO or MMIO) address to write the unique key and then the
>>   allocation address to.
>>
>> Either way, QEMU could learn about all the relevant guest-side
>> allocation addresses in a low number of traps. In addition, AML code
>> wouldn't have to reflect any allocation addresses to QEMU, ever.

> That would be nice trick. I see 2 issues here:
>  1. ACPI tables blob is build atomically when one guest tries to read it
>     from fw_cfg so patched addresses have to be communicated
>     to QEMU before that.

I don't understand issue #1. I think it is okay if the allocation
happens strictly after QEMU refreshes / regenerates the ACPI payload.
Namely, the guest-allocated addresses have two uses:
- references from within the ACPI payload
- references from the QEMU side, for device operation.

The first purpose is covered by the linker/loader itself (that is,
GET_ALLOCATION_ADDRESS would be used *in addition* to ADD_POINTER). The
second purpose would be covered by GET_ALLOCATION_ADDRESS.

>  2. Mo important I think that we are miss-using linker-loader
>     interface here, trying to from allocate buffer in guest RAM
>     an so consuming it while all we need a window into device
>     memory mapped somewhere outside of RAM occupied  address space.

But, more importantly, I definitely see your point with issue #2. I'm
neutral on the question whether this should be solved with the ACPI
linker/loader or with something else. I'm perfectly fine with "something
else", as long as it is generic enough. The above GET_ALLOCATION_ADDRESS
idea is relevant *only if* the ACPI linker/loader is deemed the best
solution here.

(Heck, if the linker/loader avenue is rejected here, that's the least
work for me! :))

Thanks
Laszlo

> 
>>
>>>
>>> PCI would do, too - though windows guys had concerns about
>>> returning PCI BARs from ACPI.
>>>
>>>   
>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>> to also parse and execute AML to initialize QEMU with guest
>>>> allocated address that would complicate them quite a bit.  
>>>
>>> If they just need to find a table by name, it won't be
>>> too bad, would it?
>>>   
>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>> respective patches don't need in any way to initialize QEMU,
>>>> all they would need just read necessary data from control region.
>>>>
>>>> Also using bios-linker-loader takes away some usable RAM
>>>> from guest and in the end that doesn't scale,
>>>> the more devices I add the less usable RAM is left for guest OS
>>>> while all the device needs is a piece of GPA address space
>>>> that would belong to it.  
>>>
>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>> If it's backed by qemu virtual memory then it's RAM.
>>>   
>>>>>
>>>>> See patch at the bottom that might be handy.  
>>
>> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
>>
>> From last year I have a WIP version of "docs/vmgenid.txt" that is based
>> on Michael's build_append_named_dword() function. If
>> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
>> stuff in that text file (and hopefully post it soon after for comments?)
>>
>>>>>  
>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>> | features are in global scope, which is evaluated when tables
>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
>>>>>
>>>>> Yes, this technique works.  
>>
>> Agreed.
>>
>>>>>
>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>> Laszlo was working on a fix for that.  
>>
>> We have to distinguish two use cases here.
>>
>> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
>> links at least one common ACPI table from both. This would cause OVMF to
>> pass the same source (= to-be-copied) table to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
>> following outcomes:
>>
>> - there would be two instances of the same table (think e.g. SSDT)
>> - the second attempt would be rejected (e.g. FADT) and that error would
>>   terminate the linker-loader procedure.
>>
>> This issue would not be too hard to overcome, with a simple "memoization
>> technique". After the initial loading & linking of the tables, OVMF
>> could remember the addresses of the "source" ACPI tables, and could
>> avoid passing already installed source tables to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
>>
>> * The second use case is when an ACPI table is linked *only* from QEMU's
>> XSDT. This is much harder to fix, because
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
>> passed-in table into *both* RSDT and XSDT, automatically. And, again,
>> the UEFI spec doesn't provide a way to control this from the caller
>> (i.e. from within OVMF).
>>
>> I have tried earlier to effect a change in the specification of
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
>> lists. (At that time I was trying to expose UEFI memory *type* to the
>> caller, from which the copy of the ACPI table being installed should be
>> allocated from.) Alas, I received no answers at all.
>>
>> All in all I strongly recommend the "place rev2+ objects in method
>> scope" trick, over the "link it from the XSDT only" trick.
>>
>>>> Using XSDT would increase ACPI tables occupied RAM
>>>> as it would duplicate DSDT + non XP supported AML
>>>> at global namespace.  
>>>
>>> Not at all - I posted patches linking to same
>>> tables from both RSDT and XSDT at some point.  
>>
>> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
>> could be made work in OVMF with the above mentioned memoization stuff.
>>
>>> Only the list of pointers would be different.  
>>
>> I don't recommend that, see the second case above.
>>
>> Thanks
>> Laszlo
>>
>>>> So far we've managed keep DSDT compatible with XP while
>>>> introducing features from v2 and higher ACPI revisions as
>>>> AML that is only evaluated on demand.
>>>> We can continue doing so unless we have to unconditionally
>>>> add incompatible AML at global scope.
>>>>  
>>>
>>> Yes.
>>>   
>>>>>  
>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>
>>>>>> Thanks!  
>>>>>
>>>>>
>>>>>
>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>> current offset so we can add that to the linker.
>>>>>
>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>> this API makes sense to me.
>>>>>  
>>>>> --->  
>>>>>
>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>
>>>>> This is a very limited form of support for runtime patching -
>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>> macros in python, but implemented in C.
>>>>>
>>>>> This is to allow ACPI code direct access to data tables -
>>>>> which is exactly what DataTableRegion is there for, except
>>>>> no known windows release so far implements DataTableRegion.  
>>>> unsupported means Windows will BSOD, so it's practically
>>>> unusable unless MS will patch currently existing Windows
>>>> versions.  
>>>
>>> Yes. That's why my patch allows patching SSDT without using
>>> DataTableRegion.
>>>   
>>>> Another thing about DataTableRegion is that ACPI tables are
>>>> supposed to have static content which matches checksum in
>>>> table the header while you are trying to use it for dynamic
>>>> data. It would be cleaner/more compatible to teach
>>>> bios-linker-loader to just allocate memory and patch AML
>>>> with the allocated address.  
>>>
>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>   
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.  
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>>>   
>>>>  
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>> index 1b632dc..f8998ea 100644
>>>>> --- a/include/hw/acpi/aml-build.h
>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>  void
>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>  
>>>>> +int
>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>> +
>>>>>  #endif
>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>> index 0d4b324..7f9fa65 100644
>>>>> --- a/hw/acpi/aml-build.c
>>>>> +++ b/hw/acpi/aml-build.c
>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>> + *
>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>> + * support it).
>>>>> + */
>>>>> +int
>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>> +{
>>>>> +    int offset;
>>>>> +    va_list ap;
>>>>> +
>>>>> +    va_start(ap, name_format);
>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>> +    va_end(ap);
>>>>> +
>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>> +
>>>>> +    offset = array->len;
>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>> +    assert(array->len == offset + 8);
>>>>> +
>>>>> +    return offset;
>>>>> +}
>>>>> +
>>>>>  static GPtrArray *alloc_list;
>>>>>  
>>>>>  static Aml *aml_alloc(void)
>>>>>
>>>>>  
>>
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-05 17:22               ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-05 17:22 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, Michael S. Tsirkin, gleb,
	mtosatti, qemu-devel, stefanha, pbonzini, dan.j.williams, rth

On 01/05/16 18:08, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
>>
>>>>> I don't want to tie things to reserved-memory-end because this
>>>>> does not scale: next time we need to reserve memory,
>>>>> we'll need to find yet another way to figure out what is where.  
>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>
>>>> To me it looks like it scales rather well.
>>>> For example lets imagine that we adding a device
>>>> that has some on device memory that should be mapped into GPA
>>>> code to do so would look like:
>>>>
>>>>   pc_machine_device_plug_cb(dev)
>>>>   {
>>>>    ...
>>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>>>    }
>>>>   }
>>>>
>>>> we can practically add any number of new devices that way.  
>>>
>>> Yes but we'll have to build a host side allocator for these, and that's
>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>> least per machine version) as they are guest visible.
>>> Not only that, there's no way for guest to know if we move things
>>> around, so basically we'll never be able to change addresses.
>>>
>>>   
>>>>    
>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>> support 64 bit RAM instead  
>>
>> This looks quite doable in OVMF, as long as the blob to allocate from
>> high memory contains *zero* ACPI tables.
>>
>> (
>> Namely, each ACPI table is installed from the containing fw_cfg blob
>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>> own allocation policy for the *copies* of ACPI tables it installs.
>>
>> This allocation policy is left unspecified in the section of the UEFI
>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>
>> The current policy in edk2 (= the reference implementation) seems to be
>> "allocate from under 4GB". It is currently being changed to "try to
>> allocate from under 4GB, and if that fails, retry from high memory". (It
>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>> )
>>
>>>>> (and maybe a way to allocate and
>>>>> zero-initialize buffer without loading it through fwcfg),  
>>
>> Sounds reasonable.
>>
>>>>> this way bios
>>>>> does the allocation, and addresses can be patched into acpi.  
>>>> and then guest side needs to parse/execute some AML that would
>>>> initialize QEMU side so it would know where to write data.  
>>>
>>> Well not really - we can put it in a data table, by itself
>>> so it's easy to find.  
>>
>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>> acpi_get_table_with_size()?
>>
>>>
>>> AML is only needed if access from ACPI is desired.
>>>
>>>   
>>>> bios-linker-loader is a great interface for initializing some
>>>> guest owned data and linking it together but I think it adds
>>>> unnecessary complexity and is misused if it's used to handle
>>>> device owned data/on device memory in this and VMGID cases.  
>>>
>>> I want a generic interface for guest to enumerate these things.  linker
>>> seems quite reasonable but if you see a reason why it won't do, or want
>>> to propose a better interface, fine.  
>>
>> * The guest could do the following:
>> - while processing the ALLOCATE commands, it would make a note where in
>> GPA space each fw_cfg blob gets allocated
>> - at the end the guest would prepare a temporary array with a predefined
>> record format, that associates each fw_cfg blob's name with the concrete
>> allocation address
>> - it would create an FWCfgDmaAccess stucture pointing at this array,
>> with a new "control" bit set (or something similar)
>> - the guest could write the address of the FWCfgDmaAccess struct to the
>> appropriate register, as always.
>>
>> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
>> specifying:
>> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>>   address (this command could only follow the matching ALLOCATE
>>   command, never precede it)
>> - a flag whether the address should be written to IO or MMIO space
>>   (would be likely IO on x86, MMIO on ARM)
>> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>>   identifies the blob, actually!)
>> - a uint64_t (IO or MMIO) address to write the unique key and then the
>>   allocation address to.
>>
>> Either way, QEMU could learn about all the relevant guest-side
>> allocation addresses in a low number of traps. In addition, AML code
>> wouldn't have to reflect any allocation addresses to QEMU, ever.

> That would be nice trick. I see 2 issues here:
>  1. ACPI tables blob is build atomically when one guest tries to read it
>     from fw_cfg so patched addresses have to be communicated
>     to QEMU before that.

I don't understand issue #1. I think it is okay if the allocation
happens strictly after QEMU refreshes / regenerates the ACPI payload.
Namely, the guest-allocated addresses have two uses:
- references from within the ACPI payload
- references from the QEMU side, for device operation.

The first purpose is covered by the linker/loader itself (that is,
GET_ALLOCATION_ADDRESS would be used *in addition* to ADD_POINTER). The
second purpose would be covered by GET_ALLOCATION_ADDRESS.

>  2. Mo important I think that we are miss-using linker-loader
>     interface here, trying to from allocate buffer in guest RAM
>     an so consuming it while all we need a window into device
>     memory mapped somewhere outside of RAM occupied  address space.

But, more importantly, I definitely see your point with issue #2. I'm
neutral on the question whether this should be solved with the ACPI
linker/loader or with something else. I'm perfectly fine with "something
else", as long as it is generic enough. The above GET_ALLOCATION_ADDRESS
idea is relevant *only if* the ACPI linker/loader is deemed the best
solution here.

(Heck, if the linker/loader avenue is rejected here, that's the least
work for me! :))

Thanks
Laszlo

> 
>>
>>>
>>> PCI would do, too - though windows guys had concerns about
>>> returning PCI BARs from ACPI.
>>>
>>>   
>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>> to also parse and execute AML to initialize QEMU with guest
>>>> allocated address that would complicate them quite a bit.  
>>>
>>> If they just need to find a table by name, it won't be
>>> too bad, would it?
>>>   
>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>> respective patches don't need in any way to initialize QEMU,
>>>> all they would need just read necessary data from control region.
>>>>
>>>> Also using bios-linker-loader takes away some usable RAM
>>>> from guest and in the end that doesn't scale,
>>>> the more devices I add the less usable RAM is left for guest OS
>>>> while all the device needs is a piece of GPA address space
>>>> that would belong to it.  
>>>
>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>> If it's backed by qemu virtual memory then it's RAM.
>>>   
>>>>>
>>>>> See patch at the bottom that might be handy.  
>>
>> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
>>
>> From last year I have a WIP version of "docs/vmgenid.txt" that is based
>> on Michael's build_append_named_dword() function. If
>> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
>> stuff in that text file (and hopefully post it soon after for comments?)
>>
>>>>>  
>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>> | features are in global scope, which is evaluated when tables
>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
>>>>>
>>>>> Yes, this technique works.  
>>
>> Agreed.
>>
>>>>>
>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>> Laszlo was working on a fix for that.  
>>
>> We have to distinguish two use cases here.
>>
>> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
>> links at least one common ACPI table from both. This would cause OVMF to
>> pass the same source (= to-be-copied) table to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
>> following outcomes:
>>
>> - there would be two instances of the same table (think e.g. SSDT)
>> - the second attempt would be rejected (e.g. FADT) and that error would
>>   terminate the linker-loader procedure.
>>
>> This issue would not be too hard to overcome, with a simple "memoization
>> technique". After the initial loading & linking of the tables, OVMF
>> could remember the addresses of the "source" ACPI tables, and could
>> avoid passing already installed source tables to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
>>
>> * The second use case is when an ACPI table is linked *only* from QEMU's
>> XSDT. This is much harder to fix, because
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
>> passed-in table into *both* RSDT and XSDT, automatically. And, again,
>> the UEFI spec doesn't provide a way to control this from the caller
>> (i.e. from within OVMF).
>>
>> I have tried earlier to effect a change in the specification of
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
>> lists. (At that time I was trying to expose UEFI memory *type* to the
>> caller, from which the copy of the ACPI table being installed should be
>> allocated from.) Alas, I received no answers at all.
>>
>> All in all I strongly recommend the "place rev2+ objects in method
>> scope" trick, over the "link it from the XSDT only" trick.
>>
>>>> Using XSDT would increase ACPI tables occupied RAM
>>>> as it would duplicate DSDT + non XP supported AML
>>>> at global namespace.  
>>>
>>> Not at all - I posted patches linking to same
>>> tables from both RSDT and XSDT at some point.  
>>
>> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
>> could be made work in OVMF with the above mentioned memoization stuff.
>>
>>> Only the list of pointers would be different.  
>>
>> I don't recommend that, see the second case above.
>>
>> Thanks
>> Laszlo
>>
>>>> So far we've managed keep DSDT compatible with XP while
>>>> introducing features from v2 and higher ACPI revisions as
>>>> AML that is only evaluated on demand.
>>>> We can continue doing so unless we have to unconditionally
>>>> add incompatible AML at global scope.
>>>>  
>>>
>>> Yes.
>>>   
>>>>>  
>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>
>>>>>> Thanks!  
>>>>>
>>>>>
>>>>>
>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>> current offset so we can add that to the linker.
>>>>>
>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>> this API makes sense to me.
>>>>>  
>>>>> --->  
>>>>>
>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>
>>>>> This is a very limited form of support for runtime patching -
>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>> macros in python, but implemented in C.
>>>>>
>>>>> This is to allow ACPI code direct access to data tables -
>>>>> which is exactly what DataTableRegion is there for, except
>>>>> no known windows release so far implements DataTableRegion.  
>>>> unsupported means Windows will BSOD, so it's practically
>>>> unusable unless MS will patch currently existing Windows
>>>> versions.  
>>>
>>> Yes. That's why my patch allows patching SSDT without using
>>> DataTableRegion.
>>>   
>>>> Another thing about DataTableRegion is that ACPI tables are
>>>> supposed to have static content which matches checksum in
>>>> table the header while you are trying to use it for dynamic
>>>> data. It would be cleaner/more compatible to teach
>>>> bios-linker-loader to just allocate memory and patch AML
>>>> with the allocated address.  
>>>
>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>   
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.  
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>>>   
>>>>  
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>> index 1b632dc..f8998ea 100644
>>>>> --- a/include/hw/acpi/aml-build.h
>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>  void
>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>  
>>>>> +int
>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>> +
>>>>>  #endif
>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>> index 0d4b324..7f9fa65 100644
>>>>> --- a/hw/acpi/aml-build.c
>>>>> +++ b/hw/acpi/aml-build.c
>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>> + *
>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>> + * support it).
>>>>> + */
>>>>> +int
>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>> +{
>>>>> +    int offset;
>>>>> +    va_list ap;
>>>>> +
>>>>> +    va_start(ap, name_format);
>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>> +    va_end(ap);
>>>>> +
>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>> +
>>>>> +    offset = array->len;
>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>> +    assert(array->len == offset + 8);
>>>>> +
>>>>> +    return offset;
>>>>> +}
>>>>> +
>>>>>  static GPtrArray *alloc_list;
>>>>>  
>>>>>  static Aml *aml_alloc(void)
>>>>>
>>>>>  
>>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-05 17:22               ` [Qemu-devel] " Laszlo Ersek
@ 2016-01-06 13:39                 ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-06 13:39 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael S. Tsirkin, Xiao Guangrong, pbonzini, gleb, mtosatti,
	stefanha, rth, ehabkost, dan.j.williams, kvm, qemu-devel

On Tue, 5 Jan 2016 18:22:33 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/05/16 18:08, Igor Mammedov wrote:
> > On Mon, 4 Jan 2016 21:17:31 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> Michael CC'd me on the grandparent of the email below. I'll try to add
> >> my thoughts in a single go, with regard to OVMF.
> >>
> >> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
> >>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:    
> >>>> On Mon, 28 Dec 2015 14:50:15 +0200
> >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>    
> >>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:    
> >>>>>>
> >>>>>> Hi Michael, Paolo,
> >>>>>>
> >>>>>> Now it is the time to return to the challenge that how to reserve guest
> >>>>>> physical region internally used by ACPI.
> >>>>>>
> >>>>>> Igor suggested that:
> >>>>>> | An alternative place to allocate reserve from could be high memory.
> >>>>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>>>> | that hotpluggable memory range isn't used by firmware
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)    
> >>
> >> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> >> reason is that nobody wrote that patch, nor asked for the patch to be
> >> written. (Not implying that just requesting the patch would be
> >> sufficient for the patch to be written.)
> >>  
> >>>>> I don't want to tie things to reserved-memory-end because this
> >>>>> does not scale: next time we need to reserve memory,
> >>>>> we'll need to find yet another way to figure out what is where.    
> >>>> Could you elaborate a bit more on a problem you're seeing?
> >>>>
> >>>> To me it looks like it scales rather well.
> >>>> For example lets imagine that we adding a device
> >>>> that has some on device memory that should be mapped into GPA
> >>>> code to do so would look like:
> >>>>
> >>>>   pc_machine_device_plug_cb(dev)
> >>>>   {
> >>>>    ...
> >>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
> >>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
> >>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
> >>>>    }
> >>>>   }
> >>>>
> >>>> we can practically add any number of new devices that way.    
> >>>
> >>> Yes but we'll have to build a host side allocator for these, and that's
> >>> nasty. We'll also have to maintain these addresses indefinitely (at
> >>> least per machine version) as they are guest visible.
> >>> Not only that, there's no way for guest to know if we move things
> >>> around, so basically we'll never be able to change addresses.
> >>>
> >>>     
> >>>>      
> >>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> >>>>> support 64 bit RAM instead    
> >>
> >> This looks quite doable in OVMF, as long as the blob to allocate from
> >> high memory contains *zero* ACPI tables.
> >>
> >> (
> >> Namely, each ACPI table is installed from the containing fw_cfg blob
> >> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
> >> own allocation policy for the *copies* of ACPI tables it installs.
> >>
> >> This allocation policy is left unspecified in the section of the UEFI
> >> spec that governs EFI_ACPI_TABLE_PROTOCOL.
> >>
> >> The current policy in edk2 (= the reference implementation) seems to be
> >> "allocate from under 4GB". It is currently being changed to "try to
> >> allocate from under 4GB, and if that fails, retry from high memory". (It
> >> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
> >> )
> >>  
> >>>>> (and maybe a way to allocate and
> >>>>> zero-initialize buffer without loading it through fwcfg),    
> >>
> >> Sounds reasonable.
> >>  
> >>>>> this way bios
> >>>>> does the allocation, and addresses can be patched into acpi.    
> >>>> and then guest side needs to parse/execute some AML that would
> >>>> initialize QEMU side so it would know where to write data.    
> >>>
> >>> Well not really - we can put it in a data table, by itself
> >>> so it's easy to find.    
> >>
> >> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
> >> acpi_get_table_with_size()?
> >>  
> >>>
> >>> AML is only needed if access from ACPI is desired.
> >>>
> >>>     
> >>>> bios-linker-loader is a great interface for initializing some
> >>>> guest owned data and linking it together but I think it adds
> >>>> unnecessary complexity and is misused if it's used to handle
> >>>> device owned data/on device memory in this and VMGID cases.    
> >>>
> >>> I want a generic interface for guest to enumerate these things.  linker
> >>> seems quite reasonable but if you see a reason why it won't do, or want
> >>> to propose a better interface, fine.    
> >>
> >> * The guest could do the following:
> >> - while processing the ALLOCATE commands, it would make a note where in
> >> GPA space each fw_cfg blob gets allocated
> >> - at the end the guest would prepare a temporary array with a predefined
> >> record format, that associates each fw_cfg blob's name with the concrete
> >> allocation address
> >> - it would create an FWCfgDmaAccess stucture pointing at this array,
> >> with a new "control" bit set (or something similar)
> >> - the guest could write the address of the FWCfgDmaAccess struct to the
> >> appropriate register, as always.
> >>
> >> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
> >> specifying:
> >> - the fw_cfg blob's name, for which to retrieve the guest-allocated
> >>   address (this command could only follow the matching ALLOCATE
> >>   command, never precede it)
> >> - a flag whether the address should be written to IO or MMIO space
> >>   (would be likely IO on x86, MMIO on ARM)
> >> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
> >>   identifies the blob, actually!)
> >> - a uint64_t (IO or MMIO) address to write the unique key and then the
> >>   allocation address to.
> >>
> >> Either way, QEMU could learn about all the relevant guest-side
> >> allocation addresses in a low number of traps. In addition, AML code
> >> wouldn't have to reflect any allocation addresses to QEMU, ever.  
> 
> > That would be nice trick. I see 2 issues here:
> >  1. ACPI tables blob is build atomically when one guest tries to read it
> >     from fw_cfg so patched addresses have to be communicated
> >     to QEMU before that.  
> 
> I don't understand issue #1. I think it is okay if the allocation
> happens strictly after QEMU refreshes / regenerates the ACPI payload.
> Namely, the guest-allocated addresses have two uses:
> - references from within the ACPI payload
If references are from AML, then AML should be patched by linker,
which is tricky and forces us to invent duplicate AML API that
would be able to tell linker where AML object should be patched
(Michael's patch in this thread as example)

It would be better if linker would communicate addresses to QEMU
before AML is built, so that AML would use already present
in QEMU addresses and doesn't have to be patched at all.

> - references from the QEMU side, for device operation.
> 
> The first purpose is covered by the linker/loader itself (that is,
> GET_ALLOCATION_ADDRESS would be used *in addition* to ADD_POINTER). The
> second purpose would be covered by GET_ALLOCATION_ADDRESS.
> 
> >  2. Mo important I think that we are miss-using linker-loader
> >     interface here, trying to from allocate buffer in guest RAM
> >     an so consuming it while all we need a window into device
> >     memory mapped somewhere outside of RAM occupied  address space.  
> 
> But, more importantly, I definitely see your point with issue #2. I'm
> neutral on the question whether this should be solved with the ACPI
> linker/loader or with something else. I'm perfectly fine with "something
> else", as long as it is generic enough. The above GET_ALLOCATION_ADDRESS
> idea is relevant *only if* the ACPI linker/loader is deemed the best
> solution here.
> 
> (Heck, if the linker/loader avenue is rejected here, that's the least
> work for me! :))
> 
> Thanks
> Laszlo
> 
> >   
> >>  
> >>>
> >>> PCI would do, too - though windows guys had concerns about
> >>> returning PCI BARs from ACPI.
> >>>
> >>>     
> >>>> There was RFC on list to make BIOS boot from NVDIMM already
> >>>> doing some ACPI table lookup/parsing. Now if they were forced
> >>>> to also parse and execute AML to initialize QEMU with guest
> >>>> allocated address that would complicate them quite a bit.    
> >>>
> >>> If they just need to find a table by name, it won't be
> >>> too bad, would it?
> >>>     
> >>>> While with NVDIMM control memory region mapped directly by QEMU,
> >>>> respective patches don't need in any way to initialize QEMU,
> >>>> all they would need just read necessary data from control region.
> >>>>
> >>>> Also using bios-linker-loader takes away some usable RAM
> >>>> from guest and in the end that doesn't scale,
> >>>> the more devices I add the less usable RAM is left for guest OS
> >>>> while all the device needs is a piece of GPA address space
> >>>> that would belong to it.    
> >>>
> >>> I don't get this comment. I don't think it's MMIO that is wanted.
> >>> If it's backed by qemu virtual memory then it's RAM.
> >>>     
> >>>>>
> >>>>> See patch at the bottom that might be handy.    
> >>
> >> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
> >>
> >> From last year I have a WIP version of "docs/vmgenid.txt" that is based
> >> on Michael's build_append_named_dword() function. If
> >> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
> >> stuff in that text file (and hopefully post it soon after for comments?)
> >>  
> >>>>>    
> >>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> >>>>>> | when writing ASL one shall make sure that only XP supported
> >>>>>> | features are in global scope, which is evaluated when tables
> >>>>>> | are loaded and features of rev2 and higher are inside methods.
> >>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> >>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)    
> >>>>>
> >>>>> Yes, this technique works.    
> >>
> >> Agreed.
> >>  
> >>>>>
> >>>>> An alternative is to add an XSDT, XP ignores that.
> >>>>> XSDT at the moment breaks OVMF (because it loads both
> >>>>> the RSDT and the XSDT, which is wrong), but I think
> >>>>> Laszlo was working on a fix for that.    
> >>
> >> We have to distinguish two use cases here.
> >>
> >> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
> >> links at least one common ACPI table from both. This would cause OVMF to
> >> pass the same source (= to-be-copied) table to
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
> >> following outcomes:
> >>
> >> - there would be two instances of the same table (think e.g. SSDT)
> >> - the second attempt would be rejected (e.g. FADT) and that error would
> >>   terminate the linker-loader procedure.
> >>
> >> This issue would not be too hard to overcome, with a simple "memoization
> >> technique". After the initial loading & linking of the tables, OVMF
> >> could remember the addresses of the "source" ACPI tables, and could
> >> avoid passing already installed source tables to
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
> >>
> >> * The second use case is when an ACPI table is linked *only* from QEMU's
> >> XSDT. This is much harder to fix, because
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
> >> passed-in table into *both* RSDT and XSDT, automatically. And, again,
> >> the UEFI spec doesn't provide a way to control this from the caller
> >> (i.e. from within OVMF).
> >>
> >> I have tried earlier to effect a change in the specification of
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
> >> lists. (At that time I was trying to expose UEFI memory *type* to the
> >> caller, from which the copy of the ACPI table being installed should be
> >> allocated from.) Alas, I received no answers at all.
> >>
> >> All in all I strongly recommend the "place rev2+ objects in method
> >> scope" trick, over the "link it from the XSDT only" trick.
> >>  
> >>>> Using XSDT would increase ACPI tables occupied RAM
> >>>> as it would duplicate DSDT + non XP supported AML
> >>>> at global namespace.    
> >>>
> >>> Not at all - I posted patches linking to same
> >>> tables from both RSDT and XSDT at some point.    
> >>
> >> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
> >> could be made work in OVMF with the above mentioned memoization stuff.
> >>  
> >>> Only the list of pointers would be different.    
> >>
> >> I don't recommend that, see the second case above.
> >>
> >> Thanks
> >> Laszlo
> >>  
> >>>> So far we've managed keep DSDT compatible with XP while
> >>>> introducing features from v2 and higher ACPI revisions as
> >>>> AML that is only evaluated on demand.
> >>>> We can continue doing so unless we have to unconditionally
> >>>> add incompatible AML at global scope.
> >>>>    
> >>>
> >>> Yes.
> >>>     
> >>>>>    
> >>>>>> Michael, Paolo, what do you think about these ideas?
> >>>>>>
> >>>>>> Thanks!    
> >>>>>
> >>>>>
> >>>>>
> >>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> >>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> >>>>> current offset so we can add that to the linker.
> >>>>>
> >>>>> Won't work if you append the Name to the Aml structure (these can be
> >>>>> nested to arbitrary depth using aml_append), so using plain GArray for
> >>>>> this API makes sense to me.
> >>>>>    
> >>>>> --->    
> >>>>>
> >>>>> acpi: add build_append_named_dword, returning an offset in buffer
> >>>>>
> >>>>> This is a very limited form of support for runtime patching -
> >>>>> similar in functionality to what we can do with ACPI_EXTRACT
> >>>>> macros in python, but implemented in C.
> >>>>>
> >>>>> This is to allow ACPI code direct access to data tables -
> >>>>> which is exactly what DataTableRegion is there for, except
> >>>>> no known windows release so far implements DataTableRegion.    
> >>>> unsupported means Windows will BSOD, so it's practically
> >>>> unusable unless MS will patch currently existing Windows
> >>>> versions.    
> >>>
> >>> Yes. That's why my patch allows patching SSDT without using
> >>> DataTableRegion.
> >>>     
> >>>> Another thing about DataTableRegion is that ACPI tables are
> >>>> supposed to have static content which matches checksum in
> >>>> table the header while you are trying to use it for dynamic
> >>>> data. It would be cleaner/more compatible to teach
> >>>> bios-linker-loader to just allocate memory and patch AML
> >>>> with the allocated address.    
> >>>
> >>> Yes - if address is static, you need to put it outside
> >>> the table. Can come right before or right after this.
> >>>     
> >>>> Also if OperationRegion() is used, then one has to patch
> >>>> DefOpRegion directly as RegionOffset must be Integer,
> >>>> using variable names is not permitted there.    
> >>>
> >>> I am not sure the comment was understood correctly.
> >>> The comment says really "we can't use DataTableRegion
> >>> so here is an alternative".
> >>>     
> >>>>    
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>
> >>>>> ---
> >>>>>
> >>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> >>>>> index 1b632dc..f8998ea 100644
> >>>>> --- a/include/hw/acpi/aml-build.h
> >>>>> +++ b/include/hw/acpi/aml-build.h
> >>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> >>>>>  void
> >>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> >>>>>  
> >>>>> +int
> >>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
> >>>>> +
> >>>>>  #endif
> >>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> >>>>> index 0d4b324..7f9fa65 100644
> >>>>> --- a/hw/acpi/aml-build.c
> >>>>> +++ b/hw/acpi/aml-build.c
> >>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> >>>>>      }
> >>>>>  }
> >>>>>  
> >>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> >>>>> + * and return the offset to 0x0 for runtime patching.
> >>>>> + *
> >>>>> + * Warning: runtime patching is best avoided. Only use this as
> >>>>> + * a replacement for DataTableRegion (for guests that don't
> >>>>> + * support it).
> >>>>> + */
> >>>>> +int
> >>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
> >>>>> +{
> >>>>> +    int offset;
> >>>>> +    va_list ap;
> >>>>> +
> >>>>> +    va_start(ap, name_format);
> >>>>> +    build_append_namestringv(array, name_format, ap);
> >>>>> +    va_end(ap);
> >>>>> +
> >>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
> >>>>> +
> >>>>> +    offset = array->len;
> >>>>> +    build_append_int_noprefix(array, 0x0, 8);
> >>>>> +    assert(array->len == offset + 8);
> >>>>> +
> >>>>> +    return offset;
> >>>>> +}
> >>>>> +
> >>>>>  static GPtrArray *alloc_list;
> >>>>>  
> >>>>>  static Aml *aml_alloc(void)
> >>>>>
> >>>>>    
> >>  
> >   
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-06 13:39                 ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-06 13:39 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Xiao Guangrong, ehabkost, kvm, Michael S. Tsirkin, gleb,
	mtosatti, qemu-devel, stefanha, pbonzini, dan.j.williams, rth

On Tue, 5 Jan 2016 18:22:33 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> On 01/05/16 18:08, Igor Mammedov wrote:
> > On Mon, 4 Jan 2016 21:17:31 +0100
> > Laszlo Ersek <lersek@redhat.com> wrote:
> >   
> >> Michael CC'd me on the grandparent of the email below. I'll try to add
> >> my thoughts in a single go, with regard to OVMF.
> >>
> >> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
> >>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:    
> >>>> On Mon, 28 Dec 2015 14:50:15 +0200
> >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>    
> >>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:    
> >>>>>>
> >>>>>> Hi Michael, Paolo,
> >>>>>>
> >>>>>> Now it is the time to return to the challenge that how to reserve guest
> >>>>>> physical region internally used by ACPI.
> >>>>>>
> >>>>>> Igor suggested that:
> >>>>>> | An alternative place to allocate reserve from could be high memory.
> >>>>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>>>> | that hotpluggable memory range isn't used by firmware
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)    
> >>
> >> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> >> reason is that nobody wrote that patch, nor asked for the patch to be
> >> written. (Not implying that just requesting the patch would be
> >> sufficient for the patch to be written.)
> >>  
> >>>>> I don't want to tie things to reserved-memory-end because this
> >>>>> does not scale: next time we need to reserve memory,
> >>>>> we'll need to find yet another way to figure out what is where.    
> >>>> Could you elaborate a bit more on a problem you're seeing?
> >>>>
> >>>> To me it looks like it scales rather well.
> >>>> For example lets imagine that we adding a device
> >>>> that has some on device memory that should be mapped into GPA
> >>>> code to do so would look like:
> >>>>
> >>>>   pc_machine_device_plug_cb(dev)
> >>>>   {
> >>>>    ...
> >>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
> >>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
> >>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
> >>>>    }
> >>>>   }
> >>>>
> >>>> we can practically add any number of new devices that way.    
> >>>
> >>> Yes but we'll have to build a host side allocator for these, and that's
> >>> nasty. We'll also have to maintain these addresses indefinitely (at
> >>> least per machine version) as they are guest visible.
> >>> Not only that, there's no way for guest to know if we move things
> >>> around, so basically we'll never be able to change addresses.
> >>>
> >>>     
> >>>>      
> >>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
> >>>>> support 64 bit RAM instead    
> >>
> >> This looks quite doable in OVMF, as long as the blob to allocate from
> >> high memory contains *zero* ACPI tables.
> >>
> >> (
> >> Namely, each ACPI table is installed from the containing fw_cfg blob
> >> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
> >> own allocation policy for the *copies* of ACPI tables it installs.
> >>
> >> This allocation policy is left unspecified in the section of the UEFI
> >> spec that governs EFI_ACPI_TABLE_PROTOCOL.
> >>
> >> The current policy in edk2 (= the reference implementation) seems to be
> >> "allocate from under 4GB". It is currently being changed to "try to
> >> allocate from under 4GB, and if that fails, retry from high memory". (It
> >> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
> >> )
> >>  
> >>>>> (and maybe a way to allocate and
> >>>>> zero-initialize buffer without loading it through fwcfg),    
> >>
> >> Sounds reasonable.
> >>  
> >>>>> this way bios
> >>>>> does the allocation, and addresses can be patched into acpi.    
> >>>> and then guest side needs to parse/execute some AML that would
> >>>> initialize QEMU side so it would know where to write data.    
> >>>
> >>> Well not really - we can put it in a data table, by itself
> >>> so it's easy to find.    
> >>
> >> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
> >> acpi_get_table_with_size()?
> >>  
> >>>
> >>> AML is only needed if access from ACPI is desired.
> >>>
> >>>     
> >>>> bios-linker-loader is a great interface for initializing some
> >>>> guest owned data and linking it together but I think it adds
> >>>> unnecessary complexity and is misused if it's used to handle
> >>>> device owned data/on device memory in this and VMGID cases.    
> >>>
> >>> I want a generic interface for guest to enumerate these things.  linker
> >>> seems quite reasonable but if you see a reason why it won't do, or want
> >>> to propose a better interface, fine.    
> >>
> >> * The guest could do the following:
> >> - while processing the ALLOCATE commands, it would make a note where in
> >> GPA space each fw_cfg blob gets allocated
> >> - at the end the guest would prepare a temporary array with a predefined
> >> record format, that associates each fw_cfg blob's name with the concrete
> >> allocation address
> >> - it would create an FWCfgDmaAccess stucture pointing at this array,
> >> with a new "control" bit set (or something similar)
> >> - the guest could write the address of the FWCfgDmaAccess struct to the
> >> appropriate register, as always.
> >>
> >> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
> >> specifying:
> >> - the fw_cfg blob's name, for which to retrieve the guest-allocated
> >>   address (this command could only follow the matching ALLOCATE
> >>   command, never precede it)
> >> - a flag whether the address should be written to IO or MMIO space
> >>   (would be likely IO on x86, MMIO on ARM)
> >> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
> >>   identifies the blob, actually!)
> >> - a uint64_t (IO or MMIO) address to write the unique key and then the
> >>   allocation address to.
> >>
> >> Either way, QEMU could learn about all the relevant guest-side
> >> allocation addresses in a low number of traps. In addition, AML code
> >> wouldn't have to reflect any allocation addresses to QEMU, ever.  
> 
> > That would be nice trick. I see 2 issues here:
> >  1. ACPI tables blob is build atomically when one guest tries to read it
> >     from fw_cfg so patched addresses have to be communicated
> >     to QEMU before that.  
> 
> I don't understand issue #1. I think it is okay if the allocation
> happens strictly after QEMU refreshes / regenerates the ACPI payload.
> Namely, the guest-allocated addresses have two uses:
> - references from within the ACPI payload
If references are from AML, then AML should be patched by linker,
which is tricky and forces us to invent duplicate AML API that
would be able to tell linker where AML object should be patched
(Michael's patch in this thread as example)

It would be better if linker would communicate addresses to QEMU
before AML is built, so that AML would use already present
in QEMU addresses and doesn't have to be patched at all.

> - references from the QEMU side, for device operation.
> 
> The first purpose is covered by the linker/loader itself (that is,
> GET_ALLOCATION_ADDRESS would be used *in addition* to ADD_POINTER). The
> second purpose would be covered by GET_ALLOCATION_ADDRESS.
> 
> >  2. Mo important I think that we are miss-using linker-loader
> >     interface here, trying to from allocate buffer in guest RAM
> >     an so consuming it while all we need a window into device
> >     memory mapped somewhere outside of RAM occupied  address space.  
> 
> But, more importantly, I definitely see your point with issue #2. I'm
> neutral on the question whether this should be solved with the ACPI
> linker/loader or with something else. I'm perfectly fine with "something
> else", as long as it is generic enough. The above GET_ALLOCATION_ADDRESS
> idea is relevant *only if* the ACPI linker/loader is deemed the best
> solution here.
> 
> (Heck, if the linker/loader avenue is rejected here, that's the least
> work for me! :))
> 
> Thanks
> Laszlo
> 
> >   
> >>  
> >>>
> >>> PCI would do, too - though windows guys had concerns about
> >>> returning PCI BARs from ACPI.
> >>>
> >>>     
> >>>> There was RFC on list to make BIOS boot from NVDIMM already
> >>>> doing some ACPI table lookup/parsing. Now if they were forced
> >>>> to also parse and execute AML to initialize QEMU with guest
> >>>> allocated address that would complicate them quite a bit.    
> >>>
> >>> If they just need to find a table by name, it won't be
> >>> too bad, would it?
> >>>     
> >>>> While with NVDIMM control memory region mapped directly by QEMU,
> >>>> respective patches don't need in any way to initialize QEMU,
> >>>> all they would need just read necessary data from control region.
> >>>>
> >>>> Also using bios-linker-loader takes away some usable RAM
> >>>> from guest and in the end that doesn't scale,
> >>>> the more devices I add the less usable RAM is left for guest OS
> >>>> while all the device needs is a piece of GPA address space
> >>>> that would belong to it.    
> >>>
> >>> I don't get this comment. I don't think it's MMIO that is wanted.
> >>> If it's backed by qemu virtual memory then it's RAM.
> >>>     
> >>>>>
> >>>>> See patch at the bottom that might be handy.    
> >>
> >> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
> >>
> >> From last year I have a WIP version of "docs/vmgenid.txt" that is based
> >> on Michael's build_append_named_dword() function. If
> >> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
> >> stuff in that text file (and hopefully post it soon after for comments?)
> >>  
> >>>>>    
> >>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> >>>>>> | when writing ASL one shall make sure that only XP supported
> >>>>>> | features are in global scope, which is evaluated when tables
> >>>>>> | are loaded and features of rev2 and higher are inside methods.
> >>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
> >>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)    
> >>>>>
> >>>>> Yes, this technique works.    
> >>
> >> Agreed.
> >>  
> >>>>>
> >>>>> An alternative is to add an XSDT, XP ignores that.
> >>>>> XSDT at the moment breaks OVMF (because it loads both
> >>>>> the RSDT and the XSDT, which is wrong), but I think
> >>>>> Laszlo was working on a fix for that.    
> >>
> >> We have to distinguish two use cases here.
> >>
> >> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
> >> links at least one common ACPI table from both. This would cause OVMF to
> >> pass the same source (= to-be-copied) table to
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
> >> following outcomes:
> >>
> >> - there would be two instances of the same table (think e.g. SSDT)
> >> - the second attempt would be rejected (e.g. FADT) and that error would
> >>   terminate the linker-loader procedure.
> >>
> >> This issue would not be too hard to overcome, with a simple "memoization
> >> technique". After the initial loading & linking of the tables, OVMF
> >> could remember the addresses of the "source" ACPI tables, and could
> >> avoid passing already installed source tables to
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
> >>
> >> * The second use case is when an ACPI table is linked *only* from QEMU's
> >> XSDT. This is much harder to fix, because
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
> >> passed-in table into *both* RSDT and XSDT, automatically. And, again,
> >> the UEFI spec doesn't provide a way to control this from the caller
> >> (i.e. from within OVMF).
> >>
> >> I have tried earlier to effect a change in the specification of
> >> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
> >> lists. (At that time I was trying to expose UEFI memory *type* to the
> >> caller, from which the copy of the ACPI table being installed should be
> >> allocated from.) Alas, I received no answers at all.
> >>
> >> All in all I strongly recommend the "place rev2+ objects in method
> >> scope" trick, over the "link it from the XSDT only" trick.
> >>  
> >>>> Using XSDT would increase ACPI tables occupied RAM
> >>>> as it would duplicate DSDT + non XP supported AML
> >>>> at global namespace.    
> >>>
> >>> Not at all - I posted patches linking to same
> >>> tables from both RSDT and XSDT at some point.    
> >>
> >> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
> >> could be made work in OVMF with the above mentioned memoization stuff.
> >>  
> >>> Only the list of pointers would be different.    
> >>
> >> I don't recommend that, see the second case above.
> >>
> >> Thanks
> >> Laszlo
> >>  
> >>>> So far we've managed keep DSDT compatible with XP while
> >>>> introducing features from v2 and higher ACPI revisions as
> >>>> AML that is only evaluated on demand.
> >>>> We can continue doing so unless we have to unconditionally
> >>>> add incompatible AML at global scope.
> >>>>    
> >>>
> >>> Yes.
> >>>     
> >>>>>    
> >>>>>> Michael, Paolo, what do you think about these ideas?
> >>>>>>
> >>>>>> Thanks!    
> >>>>>
> >>>>>
> >>>>>
> >>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> >>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> >>>>> current offset so we can add that to the linker.
> >>>>>
> >>>>> Won't work if you append the Name to the Aml structure (these can be
> >>>>> nested to arbitrary depth using aml_append), so using plain GArray for
> >>>>> this API makes sense to me.
> >>>>>    
> >>>>> --->    
> >>>>>
> >>>>> acpi: add build_append_named_dword, returning an offset in buffer
> >>>>>
> >>>>> This is a very limited form of support for runtime patching -
> >>>>> similar in functionality to what we can do with ACPI_EXTRACT
> >>>>> macros in python, but implemented in C.
> >>>>>
> >>>>> This is to allow ACPI code direct access to data tables -
> >>>>> which is exactly what DataTableRegion is there for, except
> >>>>> no known windows release so far implements DataTableRegion.    
> >>>> unsupported means Windows will BSOD, so it's practically
> >>>> unusable unless MS will patch currently existing Windows
> >>>> versions.    
> >>>
> >>> Yes. That's why my patch allows patching SSDT without using
> >>> DataTableRegion.
> >>>     
> >>>> Another thing about DataTableRegion is that ACPI tables are
> >>>> supposed to have static content which matches checksum in
> >>>> table the header while you are trying to use it for dynamic
> >>>> data. It would be cleaner/more compatible to teach
> >>>> bios-linker-loader to just allocate memory and patch AML
> >>>> with the allocated address.    
> >>>
> >>> Yes - if address is static, you need to put it outside
> >>> the table. Can come right before or right after this.
> >>>     
> >>>> Also if OperationRegion() is used, then one has to patch
> >>>> DefOpRegion directly as RegionOffset must be Integer,
> >>>> using variable names is not permitted there.    
> >>>
> >>> I am not sure the comment was understood correctly.
> >>> The comment says really "we can't use DataTableRegion
> >>> so here is an alternative".
> >>>     
> >>>>    
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>
> >>>>> ---
> >>>>>
> >>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> >>>>> index 1b632dc..f8998ea 100644
> >>>>> --- a/include/hw/acpi/aml-build.h
> >>>>> +++ b/include/hw/acpi/aml-build.h
> >>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> >>>>>  void
> >>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> >>>>>  
> >>>>> +int
> >>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
> >>>>> +
> >>>>>  #endif
> >>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> >>>>> index 0d4b324..7f9fa65 100644
> >>>>> --- a/hw/acpi/aml-build.c
> >>>>> +++ b/hw/acpi/aml-build.c
> >>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> >>>>>      }
> >>>>>  }
> >>>>>  
> >>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> >>>>> + * and return the offset to 0x0 for runtime patching.
> >>>>> + *
> >>>>> + * Warning: runtime patching is best avoided. Only use this as
> >>>>> + * a replacement for DataTableRegion (for guests that don't
> >>>>> + * support it).
> >>>>> + */
> >>>>> +int
> >>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
> >>>>> +{
> >>>>> +    int offset;
> >>>>> +    va_list ap;
> >>>>> +
> >>>>> +    va_start(ap, name_format);
> >>>>> +    build_append_namestringv(array, name_format, ap);
> >>>>> +    va_end(ap);
> >>>>> +
> >>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
> >>>>> +
> >>>>> +    offset = array->len;
> >>>>> +    build_append_int_noprefix(array, 0x0, 8);
> >>>>> +    assert(array->len == offset + 8);
> >>>>> +
> >>>>> +    return offset;
> >>>>> +}
> >>>>> +
> >>>>>  static GPtrArray *alloc_list;
> >>>>>  
> >>>>>  static Aml *aml_alloc(void)
> >>>>>
> >>>>>    
> >>  
> >   
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-06 13:39                 ` [Qemu-devel] " Igor Mammedov
@ 2016-01-06 14:43                   ` Laszlo Ersek
  -1 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-06 14:43 UTC (permalink / raw)
  To: Igor Mammedov, Michael S. Tsirkin, Xiao Guangrong
  Cc: pbonzini, gleb, mtosatti, stefanha, rth, ehabkost,
	dan.j.williams, kvm, qemu-devel

On 01/06/16 14:39, Igor Mammedov wrote:
> On Tue, 5 Jan 2016 18:22:33 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/05/16 18:08, Igor Mammedov wrote:
>>> On Mon, 4 Jan 2016 21:17:31 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> Michael CC'd me on the grandparent of the email below. I'll try to add
>>>> my thoughts in a single go, with regard to OVMF.
>>>>
>>>> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
>>>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:    
>>>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>    
>>>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:    
>>>>>>>>
>>>>>>>> Hi Michael, Paolo,
>>>>>>>>
>>>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>>>> physical region internally used by ACPI.
>>>>>>>>
>>>>>>>> Igor suggested that:
>>>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)    
>>>>
>>>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>>>> reason is that nobody wrote that patch, nor asked for the patch to be
>>>> written. (Not implying that just requesting the patch would be
>>>> sufficient for the patch to be written.)
>>>>  
>>>>>>> I don't want to tie things to reserved-memory-end because this
>>>>>>> does not scale: next time we need to reserve memory,
>>>>>>> we'll need to find yet another way to figure out what is where.    
>>>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>>>
>>>>>> To me it looks like it scales rather well.
>>>>>> For example lets imagine that we adding a device
>>>>>> that has some on device memory that should be mapped into GPA
>>>>>> code to do so would look like:
>>>>>>
>>>>>>   pc_machine_device_plug_cb(dev)
>>>>>>   {
>>>>>>    ...
>>>>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>>>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>>>>>    }
>>>>>>   }
>>>>>>
>>>>>> we can practically add any number of new devices that way.    
>>>>>
>>>>> Yes but we'll have to build a host side allocator for these, and that's
>>>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>>>> least per machine version) as they are guest visible.
>>>>> Not only that, there's no way for guest to know if we move things
>>>>> around, so basically we'll never be able to change addresses.
>>>>>
>>>>>     
>>>>>>      
>>>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>>>> support 64 bit RAM instead    
>>>>
>>>> This looks quite doable in OVMF, as long as the blob to allocate from
>>>> high memory contains *zero* ACPI tables.
>>>>
>>>> (
>>>> Namely, each ACPI table is installed from the containing fw_cfg blob
>>>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>>>> own allocation policy for the *copies* of ACPI tables it installs.
>>>>
>>>> This allocation policy is left unspecified in the section of the UEFI
>>>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>>>
>>>> The current policy in edk2 (= the reference implementation) seems to be
>>>> "allocate from under 4GB". It is currently being changed to "try to
>>>> allocate from under 4GB, and if that fails, retry from high memory". (It
>>>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>>>> )
>>>>  
>>>>>>> (and maybe a way to allocate and
>>>>>>> zero-initialize buffer without loading it through fwcfg),    
>>>>
>>>> Sounds reasonable.
>>>>  
>>>>>>> this way bios
>>>>>>> does the allocation, and addresses can be patched into acpi.    
>>>>>> and then guest side needs to parse/execute some AML that would
>>>>>> initialize QEMU side so it would know where to write data.    
>>>>>
>>>>> Well not really - we can put it in a data table, by itself
>>>>> so it's easy to find.    
>>>>
>>>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>>>> acpi_get_table_with_size()?
>>>>  
>>>>>
>>>>> AML is only needed if access from ACPI is desired.
>>>>>
>>>>>     
>>>>>> bios-linker-loader is a great interface for initializing some
>>>>>> guest owned data and linking it together but I think it adds
>>>>>> unnecessary complexity and is misused if it's used to handle
>>>>>> device owned data/on device memory in this and VMGID cases.    
>>>>>
>>>>> I want a generic interface for guest to enumerate these things.  linker
>>>>> seems quite reasonable but if you see a reason why it won't do, or want
>>>>> to propose a better interface, fine.    
>>>>
>>>> * The guest could do the following:
>>>> - while processing the ALLOCATE commands, it would make a note where in
>>>> GPA space each fw_cfg blob gets allocated
>>>> - at the end the guest would prepare a temporary array with a predefined
>>>> record format, that associates each fw_cfg blob's name with the concrete
>>>> allocation address
>>>> - it would create an FWCfgDmaAccess stucture pointing at this array,
>>>> with a new "control" bit set (or something similar)
>>>> - the guest could write the address of the FWCfgDmaAccess struct to the
>>>> appropriate register, as always.
>>>>
>>>> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
>>>> specifying:
>>>> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>>>>   address (this command could only follow the matching ALLOCATE
>>>>   command, never precede it)
>>>> - a flag whether the address should be written to IO or MMIO space
>>>>   (would be likely IO on x86, MMIO on ARM)
>>>> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>>>>   identifies the blob, actually!)
>>>> - a uint64_t (IO or MMIO) address to write the unique key and then the
>>>>   allocation address to.
>>>>
>>>> Either way, QEMU could learn about all the relevant guest-side
>>>> allocation addresses in a low number of traps. In addition, AML code
>>>> wouldn't have to reflect any allocation addresses to QEMU, ever.  
>>
>>> That would be nice trick. I see 2 issues here:
>>>  1. ACPI tables blob is build atomically when one guest tries to read it
>>>     from fw_cfg so patched addresses have to be communicated
>>>     to QEMU before that.  
>>
>> I don't understand issue #1. I think it is okay if the allocation
>> happens strictly after QEMU refreshes / regenerates the ACPI payload.
>> Namely, the guest-allocated addresses have two uses:
>> - references from within the ACPI payload
> If references are from AML, then AML should be patched by linker,
> which is tricky and forces us to invent duplicate AML API that
> would be able to tell linker where AML object should be patched
> (Michael's patch in this thread as example)

Yes, such minimal AML patching is necessary.

> It would be better if linker would communicate addresses to QEMU
> before AML is built, so that AML would use already present
> in QEMU addresses and doesn't have to be patched at all.

I dislike this.

First, this would duplicate part of the linker's functionality in the host.

Second, it would lead to an ugly ping-pong between host and guest. First
QEMU has to create the full ACPI payload, with placeholder constants in
the AML. Then the guest could retrieve the *size* of the ACPI payload
(the fw_cf gblobs), and perform the allocations. Then QEMU would fix up
the AML. Then the guest would download the fw_cfg blobs. Then the guest
linker would fix up the data tables. Ugly ugly ugly.

I think Michael's and Xiao Guangrong's solutions to the minimal AML
patching (= patch named dword / qword object, or the constant return
value in a minimal method) is quite feasible.

How about this:

+------------------+            +-----------------------+
|Single DWORD      |            | 4KB system memory     |
|object or         |            | operation region      | ---------+
|DWORD-returning   |            | hosting a single      |          |
|method in AML,    | ---------> | "QEMU parameter"      | -----+   |
|to be patched with|            | structure, with       |      |   |
|Michael's or      |            | pointers, small       |      |   |
|Xiao Guangrong's  |            | scalars, and padding. |      |   |
|trick             |            | Call this QPRM ("QEMU |      |   |
+------------------+            | parameters").         |      |   |
                                +-----------------------+      |   |
                                                               |   |
                                +-----------------------+ <----+   |
                                | "NRAM" operation      |          |
                                | region for NVDIMM     |          |
                                +-----------------------+          |
                                                                   |
                                +--------------------------+       |
                                | Another operation region | <-----+
                                | for another device       |
                                +--------------------------+

                                ...


Here's the idea formulated in a sequence of steps:

(1) In QEMU, create a single DWORD object, or DWORD-returning simple
method, that we *do* patch, with Michael's or Xiao Guangrong's trick,
using the ACPI linker. This would be the *only* such trick.

(2) This object or method would provide the GPA of a 4KB fw_cfg blob.
This fw_cfg blob would start with 36 zero bytes (for reasons specific to
OVMF; let me skip those for now). The rest of the blob would carry a
structure that we would actually define in the QEMU source code, as a type.

Fields of this structure would be:
- pointers (4-byte or 8-byte)
- small scalars (like a 128-bit GUID)
- padding

This structure would be named QPRM ("QEMU parameters").

(3) We add an *unconditional*, do-nothing device to the DSDT whose
initialization function evaluates the DWORD object (or DWORD-returning
method), and writes the result (= the guest-allocated address of QPRM)
to a hard-coded IO (or MMIO) port range.

(4) This port range would be backed by a *single* MemoryRegion in QEMU,
and the address written by the guest would be stored in a global
variable (or a singleton object anyway).

(5) In the guest AML, an unconditional QPRM operation region would
overlay the blob, with fields matching the QEMU structure type.

(6) Whenever a new device is introduced in QEMU that needs a dedicated
system memory operation region in the guest (nvdimm, vmgenid), we add a
new field to QPRM.

If the required region is very small (just a few scalars, like with
vmgenid), then the field is placed directly in QPRM (with the necessary
padding).

Otherwise (see the NRAM operation region for nvdimm) we add a new fw_cfg
blob, and an ADD_POINTER command for relocating the referencing field in
QPRM.

(7) The device models in QEMU can follow the pointers in guest memory,
from the initially stashed address of QPRM, through the necessary
pointer fields in QPRM, to the final operation regions.

(8) The device model-specific AML in the guest can do the same
traversal. It can fetch the right pointer field from QPRM, and define a
new operation region (like NRAM) based on that value.


All in all this is just another layer of indirection, same as the
DataTableRegion idea, except that the parameter table would be located
by a central patched DWORD object or method, not by ACPI SDT signature /
OEM ID / OEM table ID.

If we can agree on this, I could work on the device model-independent
steps (1-5), and perhaps do (6) and (8) for vmgenid on top.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-06 14:43                   ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-06 14:43 UTC (permalink / raw)
  To: Igor Mammedov, Michael S. Tsirkin, Xiao Guangrong
  Cc: ehabkost, kvm, gleb, mtosatti, qemu-devel, stefanha, pbonzini,
	dan.j.williams, rth

On 01/06/16 14:39, Igor Mammedov wrote:
> On Tue, 5 Jan 2016 18:22:33 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> On 01/05/16 18:08, Igor Mammedov wrote:
>>> On Mon, 4 Jan 2016 21:17:31 +0100
>>> Laszlo Ersek <lersek@redhat.com> wrote:
>>>   
>>>> Michael CC'd me on the grandparent of the email below. I'll try to add
>>>> my thoughts in a single go, with regard to OVMF.
>>>>
>>>> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
>>>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:    
>>>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>    
>>>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:    
>>>>>>>>
>>>>>>>> Hi Michael, Paolo,
>>>>>>>>
>>>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>>>> physical region internally used by ACPI.
>>>>>>>>
>>>>>>>> Igor suggested that:
>>>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)    
>>>>
>>>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>>>> reason is that nobody wrote that patch, nor asked for the patch to be
>>>> written. (Not implying that just requesting the patch would be
>>>> sufficient for the patch to be written.)
>>>>  
>>>>>>> I don't want to tie things to reserved-memory-end because this
>>>>>>> does not scale: next time we need to reserve memory,
>>>>>>> we'll need to find yet another way to figure out what is where.    
>>>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>>>
>>>>>> To me it looks like it scales rather well.
>>>>>> For example lets imagine that we adding a device
>>>>>> that has some on device memory that should be mapped into GPA
>>>>>> code to do so would look like:
>>>>>>
>>>>>>   pc_machine_device_plug_cb(dev)
>>>>>>   {
>>>>>>    ...
>>>>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>>>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>>>>>    }
>>>>>>   }
>>>>>>
>>>>>> we can practically add any number of new devices that way.    
>>>>>
>>>>> Yes but we'll have to build a host side allocator for these, and that's
>>>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>>>> least per machine version) as they are guest visible.
>>>>> Not only that, there's no way for guest to know if we move things
>>>>> around, so basically we'll never be able to change addresses.
>>>>>
>>>>>     
>>>>>>      
>>>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>>>> support 64 bit RAM instead    
>>>>
>>>> This looks quite doable in OVMF, as long as the blob to allocate from
>>>> high memory contains *zero* ACPI tables.
>>>>
>>>> (
>>>> Namely, each ACPI table is installed from the containing fw_cfg blob
>>>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>>>> own allocation policy for the *copies* of ACPI tables it installs.
>>>>
>>>> This allocation policy is left unspecified in the section of the UEFI
>>>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>>>
>>>> The current policy in edk2 (= the reference implementation) seems to be
>>>> "allocate from under 4GB". It is currently being changed to "try to
>>>> allocate from under 4GB, and if that fails, retry from high memory". (It
>>>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>>>> )
>>>>  
>>>>>>> (and maybe a way to allocate and
>>>>>>> zero-initialize buffer without loading it through fwcfg),    
>>>>
>>>> Sounds reasonable.
>>>>  
>>>>>>> this way bios
>>>>>>> does the allocation, and addresses can be patched into acpi.    
>>>>>> and then guest side needs to parse/execute some AML that would
>>>>>> initialize QEMU side so it would know where to write data.    
>>>>>
>>>>> Well not really - we can put it in a data table, by itself
>>>>> so it's easy to find.    
>>>>
>>>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>>>> acpi_get_table_with_size()?
>>>>  
>>>>>
>>>>> AML is only needed if access from ACPI is desired.
>>>>>
>>>>>     
>>>>>> bios-linker-loader is a great interface for initializing some
>>>>>> guest owned data and linking it together but I think it adds
>>>>>> unnecessary complexity and is misused if it's used to handle
>>>>>> device owned data/on device memory in this and VMGID cases.    
>>>>>
>>>>> I want a generic interface for guest to enumerate these things.  linker
>>>>> seems quite reasonable but if you see a reason why it won't do, or want
>>>>> to propose a better interface, fine.    
>>>>
>>>> * The guest could do the following:
>>>> - while processing the ALLOCATE commands, it would make a note where in
>>>> GPA space each fw_cfg blob gets allocated
>>>> - at the end the guest would prepare a temporary array with a predefined
>>>> record format, that associates each fw_cfg blob's name with the concrete
>>>> allocation address
>>>> - it would create an FWCfgDmaAccess stucture pointing at this array,
>>>> with a new "control" bit set (or something similar)
>>>> - the guest could write the address of the FWCfgDmaAccess struct to the
>>>> appropriate register, as always.
>>>>
>>>> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
>>>> specifying:
>>>> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>>>>   address (this command could only follow the matching ALLOCATE
>>>>   command, never precede it)
>>>> - a flag whether the address should be written to IO or MMIO space
>>>>   (would be likely IO on x86, MMIO on ARM)
>>>> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>>>>   identifies the blob, actually!)
>>>> - a uint64_t (IO or MMIO) address to write the unique key and then the
>>>>   allocation address to.
>>>>
>>>> Either way, QEMU could learn about all the relevant guest-side
>>>> allocation addresses in a low number of traps. In addition, AML code
>>>> wouldn't have to reflect any allocation addresses to QEMU, ever.  
>>
>>> That would be nice trick. I see 2 issues here:
>>>  1. ACPI tables blob is build atomically when one guest tries to read it
>>>     from fw_cfg so patched addresses have to be communicated
>>>     to QEMU before that.  
>>
>> I don't understand issue #1. I think it is okay if the allocation
>> happens strictly after QEMU refreshes / regenerates the ACPI payload.
>> Namely, the guest-allocated addresses have two uses:
>> - references from within the ACPI payload
> If references are from AML, then AML should be patched by linker,
> which is tricky and forces us to invent duplicate AML API that
> would be able to tell linker where AML object should be patched
> (Michael's patch in this thread as example)

Yes, such minimal AML patching is necessary.

> It would be better if linker would communicate addresses to QEMU
> before AML is built, so that AML would use already present
> in QEMU addresses and doesn't have to be patched at all.

I dislike this.

First, this would duplicate part of the linker's functionality in the host.

Second, it would lead to an ugly ping-pong between host and guest. First
QEMU has to create the full ACPI payload, with placeholder constants in
the AML. Then the guest could retrieve the *size* of the ACPI payload
(the fw_cf gblobs), and perform the allocations. Then QEMU would fix up
the AML. Then the guest would download the fw_cfg blobs. Then the guest
linker would fix up the data tables. Ugly ugly ugly.

I think Michael's and Xiao Guangrong's solutions to the minimal AML
patching (= patch named dword / qword object, or the constant return
value in a minimal method) is quite feasible.

How about this:

+------------------+            +-----------------------+
|Single DWORD      |            | 4KB system memory     |
|object or         |            | operation region      | ---------+
|DWORD-returning   |            | hosting a single      |          |
|method in AML,    | ---------> | "QEMU parameter"      | -----+   |
|to be patched with|            | structure, with       |      |   |
|Michael's or      |            | pointers, small       |      |   |
|Xiao Guangrong's  |            | scalars, and padding. |      |   |
|trick             |            | Call this QPRM ("QEMU |      |   |
+------------------+            | parameters").         |      |   |
                                +-----------------------+      |   |
                                                               |   |
                                +-----------------------+ <----+   |
                                | "NRAM" operation      |          |
                                | region for NVDIMM     |          |
                                +-----------------------+          |
                                                                   |
                                +--------------------------+       |
                                | Another operation region | <-----+
                                | for another device       |
                                +--------------------------+

                                ...


Here's the idea formulated in a sequence of steps:

(1) In QEMU, create a single DWORD object, or DWORD-returning simple
method, that we *do* patch, with Michael's or Xiao Guangrong's trick,
using the ACPI linker. This would be the *only* such trick.

(2) This object or method would provide the GPA of a 4KB fw_cfg blob.
This fw_cfg blob would start with 36 zero bytes (for reasons specific to
OVMF; let me skip those for now). The rest of the blob would carry a
structure that we would actually define in the QEMU source code, as a type.

Fields of this structure would be:
- pointers (4-byte or 8-byte)
- small scalars (like a 128-bit GUID)
- padding

This structure would be named QPRM ("QEMU parameters").

(3) We add an *unconditional*, do-nothing device to the DSDT whose
initialization function evaluates the DWORD object (or DWORD-returning
method), and writes the result (= the guest-allocated address of QPRM)
to a hard-coded IO (or MMIO) port range.

(4) This port range would be backed by a *single* MemoryRegion in QEMU,
and the address written by the guest would be stored in a global
variable (or a singleton object anyway).

(5) In the guest AML, an unconditional QPRM operation region would
overlay the blob, with fields matching the QEMU structure type.

(6) Whenever a new device is introduced in QEMU that needs a dedicated
system memory operation region in the guest (nvdimm, vmgenid), we add a
new field to QPRM.

If the required region is very small (just a few scalars, like with
vmgenid), then the field is placed directly in QPRM (with the necessary
padding).

Otherwise (see the NRAM operation region for nvdimm) we add a new fw_cfg
blob, and an ADD_POINTER command for relocating the referencing field in
QPRM.

(7) The device models in QEMU can follow the pointers in guest memory,
from the initially stashed address of QPRM, through the necessary
pointer fields in QPRM, to the final operation regions.

(8) The device model-specific AML in the guest can do the same
traversal. It can fetch the right pointer field from QPRM, and define a
new operation region (like NRAM) based on that value.


All in all this is just another layer of indirection, same as the
DataTableRegion idea, except that the parameter table would be located
by a central patched DWORD object or method, not by ACPI SDT signature /
OEM ID / OEM table ID.

If we can agree on this, I could work on the device model-independent
steps (1-5), and perhaps do (6) and (8) for vmgenid on top.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-05 17:07               ` [Qemu-devel] " Xiao Guangrong
@ 2016-01-07  9:21                 ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-07  9:21 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Michael S. Tsirkin, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek

On Wed, 6 Jan 2016 01:07:45 +0800
Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:

> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
> 
> >>> Yes - if address is static, you need to put it outside
> >>> the table. Can come right before or right after this.
> >>>  
> >>>> Also if OperationRegion() is used, then one has to patch
> >>>> DefOpRegion directly as RegionOffset must be Integer,
> >>>> using variable names is not permitted there.  
> >>>
> >>> I am not sure the comment was understood correctly.
> >>> The comment says really "we can't use DataTableRegion
> >>> so here is an alternative".  
> >> so how are you going to access data at which patched
> >> NameString point to?
> >> for that you'd need a normal patched OperationRegion
> >> as well since DataTableRegion isn't usable.  
> >
> > For VMGENID you would patch the method that
> > returns the address - you do not need an op region
> > as you never access it.
> >
> > I don't know about NVDIMM. Maybe OperationRegion can
> > use the patched NameString? Will need some thought.  
> 
> The ACPI spec says that the offsetTerm in OperationRegion
> is evaluated as Int, so the named object is allowed to be
> used in OperationRegion, that is exact what my patchset
> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):
that's not my reading of spec:
"
DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
RegionOffset := TermArg => Integer
TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
"

Named object is not allowed per spec, but you've used ArgObj which is
allowed, even Windows ok with such dynamic OperationRegion.

> 
> +    dsm_mem = aml_arg(3);
> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
> 
> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
> +                                            dsm_mem, TARGET_PAGE_SIZE));
> 
> We hide the int64 object which is patched by BIOS in the method,
> NVDIMM_GET_DSM_MEM, to make windows XP happy.
considering that NRAM is allocated in low mem it's even fine to move
OperationRegion into object scope to get rid of IASL warnings
about declariong Named object inside method, but the you'd need to
patch it directly as the only choice for RegionOffset would be DataObject

> 
> However, the disadvantages i see are:
> a) as Igor pointed out, we need a way to tell QEMU what is the patched
>     address, in NVDIMM ACPI, we used a 64 bit IO ports to pass the address
>     to QEMU.
> 
> b) BIOS allocated memory is RAM based so it stops us to use MMIO in ACPI,
>     MMIO is the more scalable resource than IO port as it has larger region
>     and supports 64 bits operation.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-07  9:21                 ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-07  9:21 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: ehabkost, kvm, Michael S. Tsirkin, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Wed, 6 Jan 2016 01:07:45 +0800
Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:

> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
> 
> >>> Yes - if address is static, you need to put it outside
> >>> the table. Can come right before or right after this.
> >>>  
> >>>> Also if OperationRegion() is used, then one has to patch
> >>>> DefOpRegion directly as RegionOffset must be Integer,
> >>>> using variable names is not permitted there.  
> >>>
> >>> I am not sure the comment was understood correctly.
> >>> The comment says really "we can't use DataTableRegion
> >>> so here is an alternative".  
> >> so how are you going to access data at which patched
> >> NameString point to?
> >> for that you'd need a normal patched OperationRegion
> >> as well since DataTableRegion isn't usable.  
> >
> > For VMGENID you would patch the method that
> > returns the address - you do not need an op region
> > as you never access it.
> >
> > I don't know about NVDIMM. Maybe OperationRegion can
> > use the patched NameString? Will need some thought.  
> 
> The ACPI spec says that the offsetTerm in OperationRegion
> is evaluated as Int, so the named object is allowed to be
> used in OperationRegion, that is exact what my patchset
> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):
that's not my reading of spec:
"
DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
RegionOffset := TermArg => Integer
TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
"

Named object is not allowed per spec, but you've used ArgObj which is
allowed, even Windows ok with such dynamic OperationRegion.

> 
> +    dsm_mem = aml_arg(3);
> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
> 
> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
> +                                            dsm_mem, TARGET_PAGE_SIZE));
> 
> We hide the int64 object which is patched by BIOS in the method,
> NVDIMM_GET_DSM_MEM, to make windows XP happy.
considering that NRAM is allocated in low mem it's even fine to move
OperationRegion into object scope to get rid of IASL warnings
about declariong Named object inside method, but the you'd need to
patch it directly as the only choice for RegionOffset would be DataObject

> 
> However, the disadvantages i see are:
> a) as Igor pointed out, we need a way to tell QEMU what is the patched
>     address, in NVDIMM ACPI, we used a 64 bit IO ports to pass the address
>     to QEMU.
> 
> b) BIOS allocated memory is RAM based so it stops us to use MMIO in ACPI,
>     MMIO is the more scalable resource than IO port as it has larger region
>     and supports 64 bits operation.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
  2016-01-05 16:43             ` [Qemu-devel] " Michael S. Tsirkin
                               ` (2 preceding siblings ...)
  (?)
@ 2016-01-07 10:30             ` Igor Mammedov
  2016-01-07 10:54               ` Michael S. Tsirkin
  -1 siblings, 1 reply; 59+ messages in thread
From: Igor Mammedov @ 2016-01-07 10:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Tue, 5 Jan 2016 18:43:02 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
> > > > bios-linker-loader is a great interface for initializing some
> > > > guest owned data and linking it together but I think it adds
> > > > unnecessary complexity and is misused if it's used to handle
> > > > device owned data/on device memory in this and VMGID cases.    
> > > 
> > > I want a generic interface for guest to enumerate these things.  linker
> > > seems quite reasonable but if you see a reason why it won't do, or want
> > > to propose a better interface, fine.
> > > 
> > > PCI would do, too - though windows guys had concerns about
> > > returning PCI BARs from ACPI.  
> > There were potential issues with pSeries bootloader that treated
> > PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> > Could you point out to discussion about windows issues?
> > 
> > What VMGEN patches that used PCI for mapping purposes were
> > stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> > class id but we couldn't agree on it.
> > 
> > VMGEN v13 with full discussion is here
> > https://patchwork.ozlabs.org/patch/443554/
> > So to continue with this route we would need to pick some other
> > driver less class id so windows won't prompt for driver or
> > maybe supply our own driver stub to guarantee that no one
> > would touch it. Any suggestions?  
> 
> Pick any device/vendor id pair for which windows specifies no driver.
> There's a small risk that this will conflict with some
> guest but I think it's minimal.
device/vendor id pair was QEMU specific so doesn't conflicts with anything
issue we were trying to solve was to prevent windows asking for driver
even though it does so only once if told not to ask again.

That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
device descriptor in INF file which matches as the last resort if
there isn't any other diver that's matched device by device/vendor id pair.

> 
> 
> > > 
> > >   
> > > > There was RFC on list to make BIOS boot from NVDIMM already
> > > > doing some ACPI table lookup/parsing. Now if they were forced
> > > > to also parse and execute AML to initialize QEMU with guest
> > > > allocated address that would complicate them quite a bit.    
> > > 
> > > If they just need to find a table by name, it won't be
> > > too bad, would it?  
> > that's what they were doing scanning memory for static NVDIMM table.
> > However if it were DataTable, BIOS side would have to execute
> > AML so that the table address could be told to QEMU.  
> 
> Not at all. You can find any table by its signature without
> parsing AML.
yep, and then BIOS would need to tell its address to QEMU
writing to IO port which is allocated statically in QEMU
for this purpose and is described in AML only on guest side.

> 
> 
> > In case of direct mapping or PCI BAR there is no need to initialize
> > QEMU side from AML.
> > That also saves us IO port where this address should be written
> > if bios-linker-loader approach is used.
> >   
> > >   
> > > > While with NVDIMM control memory region mapped directly by QEMU,
> > > > respective patches don't need in any way to initialize QEMU,
> > > > all they would need just read necessary data from control region.
> > > > 
> > > > Also using bios-linker-loader takes away some usable RAM
> > > > from guest and in the end that doesn't scale,
> > > > the more devices I add the less usable RAM is left for guest OS
> > > > while all the device needs is a piece of GPA address space
> > > > that would belong to it.    
> > > 
> > > I don't get this comment. I don't think it's MMIO that is wanted.
> > > If it's backed by qemu virtual memory then it's RAM.  
> > Then why don't allocate video card VRAM the same way and try to explain
> > user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
> > only has 64Mb of available RAM because of we think that on device VRAM
> > is also RAM.
> > 
> > Maybe I've used MMIO term wrongly here but it roughly reflects the idea
> > that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
> > area) is not allocated from guest's usable RAM (as described in E820)
> > but rather directly mapped in guest's GPA and doesn't consume available
> > RAM as guest sees it. That's also the way it's done on real hardware.
> > 
> > What we need in case of VMGEN ID and NVDIMM is on device memory
> > that could be directly accessed by guest.
> > Both direct mapping or PCI BAR do that job and we could use simple
> > static AML without any patching.  
> 
> At least with VMGEN the issue is that there's an AML method
> that returns the physical address.
> Then if guest OS moves the BAR (which is legal), it will break
> since caller has no way to know it's related to the BAR.
I've found a following MS doc "Firmware Allocation of PCI Device Resources in Windows". It looks like when MS implemented resource rebalancing in
Vista they pushed a compat change to PCI specs.
That ECN is called "Ignore PCI Boot Configuration_DSM Function"
and can be found here:
https://pcisig.com/sites/default/files/specification_documents/ECR-Ignorebootconfig-final.pdf

It looks like it's possible to forbid rebalancing per
device/bridge if it has _DMS method that returns "do not
ignore the boot configuration of PCI resources".

 
> > > > > 
> > > > > See patch at the bottom that might be handy.
> > > > >     
> > > > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > > > | when writing ASL one shall make sure that only XP supported
> > > > > > | features are in global scope, which is evaluated when tables
> > > > > > | are loaded and features of rev2 and higher are inside methods.
> > > > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)    
> > > > > 
> > > > > Yes, this technique works.
> > > > > 
> > > > > An alternative is to add an XSDT, XP ignores that.
> > > > > XSDT at the moment breaks OVMF (because it loads both
> > > > > the RSDT and the XSDT, which is wrong), but I think
> > > > > Laszlo was working on a fix for that.    
> > > > Using XSDT would increase ACPI tables occupied RAM
> > > > as it would duplicate DSDT + non XP supported AML
> > > > at global namespace.    
> > > 
> > > Not at all - I posted patches linking to same
> > > tables from both RSDT and XSDT at some point.
> > > Only the list of pointers would be different.  
> > if you put XP incompatible AML in separate SSDT and link it
> > only from XSDT than that would work but if incompatibility
> > is in DSDT, one would have to provide compat DSDT for RSDT
> > an incompat DSDT for XSDT.  
> 
> So don't do this.
well spec says "An ACPI-compatible OS must use the XSDT if present",
which I read as tables pointed by RSDT MUST be pointed by XSDT
as well and RSDT MUST NOT not be used.

so if we put incompatible changes in a separate SSDT and put
it only in XSDT that might work. Showstopper here is OVMF which
has issues with it as Laszlo pointed out.

Also since Windows implements only subset of spec XSDT trick
would cover only XP based versions while the rest will see and
use XSDT pointed tables which still could have incompatible
AML with some of the later windows versions.


> 
> > So far policy was don't try to run guest OS on QEMU
> > configuration that isn't supported by it.  
> 
> It's better if guests don't see some features but
> don't crash. It's not always possible of course but
> we should try to avoid this.
> 
> > For example we use VAR_PACKAGE when running with more
> > than 255 VCPUs (commit b4f4d5481) which BSODs XP.  
> 
> Yes. And it's because we violate the spec, DSDT
> should not have this stuff.
> 
> > So we can continue with that policy with out resorting to
> > using both RSDT and XSDT,
> > It would be even easier as all AML would be dynamically
> > generated and DSDT would only contain AML elements for
> > a concrete QEMU configuration.  
> 
> I'd prefer XSDT but I won't nack it if you do it in DSDT.
> I think it's not spec compliant but guests do not
> seem to care.
> 
> > > > So far we've managed keep DSDT compatible with XP while
> > > > introducing features from v2 and higher ACPI revisions as
> > > > AML that is only evaluated on demand.
> > > > We can continue doing so unless we have to unconditionally
> > > > add incompatible AML at global scope.
> > > >     
> > > 
> > > Yes.
> > >   
> > > > >     
> > > > > > Michael, Paolo, what do you think about these ideas?
> > > > > > 
> > > > > > Thanks!    
> > > > > 
> > > > > 
> > > > > 
> > > > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > > > current offset so we can add that to the linker.
> > > > > 
> > > > > Won't work if you append the Name to the Aml structure (these can be
> > > > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > > > this API makes sense to me.
> > > > >     
> > > > > --->    
> > > > > 
> > > > > acpi: add build_append_named_dword, returning an offset in buffer
> > > > > 
> > > > > This is a very limited form of support for runtime patching -
> > > > > similar in functionality to what we can do with ACPI_EXTRACT
> > > > > macros in python, but implemented in C.
> > > > > 
> > > > > This is to allow ACPI code direct access to data tables -
> > > > > which is exactly what DataTableRegion is there for, except
> > > > > no known windows release so far implements DataTableRegion.    
> > > > unsupported means Windows will BSOD, so it's practically
> > > > unusable unless MS will patch currently existing Windows
> > > > versions.    
> > > 
> > > Yes. That's why my patch allows patching SSDT without using
> > > DataTableRegion.
> > >   
> > > > Another thing about DataTableRegion is that ACPI tables are
> > > > supposed to have static content which matches checksum in
> > > > table the header while you are trying to use it for dynamic
> > > > data. It would be cleaner/more compatible to teach
> > > > bios-linker-loader to just allocate memory and patch AML
> > > > with the allocated address.    
> > > 
> > > Yes - if address is static, you need to put it outside
> > > the table. Can come right before or right after this.
> > >   
> > > > Also if OperationRegion() is used, then one has to patch
> > > > DefOpRegion directly as RegionOffset must be Integer,
> > > > using variable names is not permitted there.    
> > > 
> > > I am not sure the comment was understood correctly.
> > > The comment says really "we can't use DataTableRegion
> > > so here is an alternative".  
> > so how are you going to access data at which patched
> > NameString point to?
> > for that you'd need a normal patched OperationRegion
> > as well since DataTableRegion isn't usable.  
> 
> For VMGENID you would patch the method that
> returns the address - you do not need an op region
> as you never access it.
> 
> I don't know about NVDIMM. Maybe OperationRegion can
> use the patched NameString? Will need some thought.
> 
> > >   
> > > >     
> > > > > 
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > 
> > > > > ---
> > > > > 
> > > > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > > > index 1b632dc..f8998ea 100644
> > > > > --- a/include/hw/acpi/aml-build.h
> > > > > +++ b/include/hw/acpi/aml-build.h
> > > > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > > > >  void
> > > > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > > > >  
> > > > > +int
> > > > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > > > +
> > > > >  #endif
> > > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > > > index 0d4b324..7f9fa65 100644
> > > > > --- a/hw/acpi/aml-build.c
> > > > > +++ b/hw/acpi/aml-build.c
> > > > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > > > >      }
> > > > >  }
> > > > >  
> > > > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > > > + * and return the offset to 0x0 for runtime patching.
> > > > > + *
> > > > > + * Warning: runtime patching is best avoided. Only use this as
> > > > > + * a replacement for DataTableRegion (for guests that don't
> > > > > + * support it).
> > > > > + */
> > > > > +int
> > > > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > > > +{
> > > > > +    int offset;
> > > > > +    va_list ap;
> > > > > +
> > > > > +    va_start(ap, name_format);
> > > > > +    build_append_namestringv(array, name_format, ap);
> > > > > +    va_end(ap);
> > > > > +
> > > > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > > > +
> > > > > +    offset = array->len;
> > > > > +    build_append_int_noprefix(array, 0x0, 8);
> > > > > +    assert(array->len == offset + 8);
> > > > > +
> > > > > +    return offset;
> > > > > +}
> > > > > +
> > > > >  static GPtrArray *alloc_list;
> > > > >  
> > > > >  static Aml *aml_alloc(void)
> > > > > 
> > > > >     
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
  2016-01-07 10:30             ` Igor Mammedov
@ 2016-01-07 10:54               ` Michael S. Tsirkin
  2016-01-07 13:42                 ` Igor Mammedov
  2016-01-07 17:08                 ` Laszlo Ersek
  0 siblings, 2 replies; 59+ messages in thread
From: Michael S. Tsirkin @ 2016-01-07 10:54 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Thu, Jan 07, 2016 at 11:30:25AM +0100, Igor Mammedov wrote:
> On Tue, 5 Jan 2016 18:43:02 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:
> > > > > bios-linker-loader is a great interface for initializing some
> > > > > guest owned data and linking it together but I think it adds
> > > > > unnecessary complexity and is misused if it's used to handle
> > > > > device owned data/on device memory in this and VMGID cases.    
> > > > 
> > > > I want a generic interface for guest to enumerate these things.  linker
> > > > seems quite reasonable but if you see a reason why it won't do, or want
> > > > to propose a better interface, fine.
> > > > 
> > > > PCI would do, too - though windows guys had concerns about
> > > > returning PCI BARs from ACPI.  
> > > There were potential issues with pSeries bootloader that treated
> > > PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> > > Could you point out to discussion about windows issues?
> > > 
> > > What VMGEN patches that used PCI for mapping purposes were
> > > stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> > > class id but we couldn't agree on it.
> > > 
> > > VMGEN v13 with full discussion is here
> > > https://patchwork.ozlabs.org/patch/443554/
> > > So to continue with this route we would need to pick some other
> > > driver less class id so windows won't prompt for driver or
> > > maybe supply our own driver stub to guarantee that no one
> > > would touch it. Any suggestions?  
> > 
> > Pick any device/vendor id pair for which windows specifies no driver.
> > There's a small risk that this will conflict with some
> > guest but I think it's minimal.
> device/vendor id pair was QEMU specific so doesn't conflicts with anything
> issue we were trying to solve was to prevent windows asking for driver
> even though it does so only once if told not to ask again.
> 
> That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
> device descriptor in INF file which matches as the last resort if
> there isn't any other diver that's matched device by device/vendor id pair.

I think this is the only class in this inf.
If you can't use it, you must use an existing device/vendor id pair,
there's some risk involved but probably not much.

> > 
> > 
> > > > 
> > > >   
> > > > > There was RFC on list to make BIOS boot from NVDIMM already
> > > > > doing some ACPI table lookup/parsing. Now if they were forced
> > > > > to also parse and execute AML to initialize QEMU with guest
> > > > > allocated address that would complicate them quite a bit.    
> > > > 
> > > > If they just need to find a table by name, it won't be
> > > > too bad, would it?  
> > > that's what they were doing scanning memory for static NVDIMM table.
> > > However if it were DataTable, BIOS side would have to execute
> > > AML so that the table address could be told to QEMU.  
> > 
> > Not at all. You can find any table by its signature without
> > parsing AML.
> yep, and then BIOS would need to tell its address to QEMU
> writing to IO port which is allocated statically in QEMU
> for this purpose and is described in AML only on guest side.

io ports are an ABI too but they are way easier to
maintain.

> > 
> > 
> > > In case of direct mapping or PCI BAR there is no need to initialize
> > > QEMU side from AML.
> > > That also saves us IO port where this address should be written
> > > if bios-linker-loader approach is used.
> > >   
> > > >   
> > > > > While with NVDIMM control memory region mapped directly by QEMU,
> > > > > respective patches don't need in any way to initialize QEMU,
> > > > > all they would need just read necessary data from control region.
> > > > > 
> > > > > Also using bios-linker-loader takes away some usable RAM
> > > > > from guest and in the end that doesn't scale,
> > > > > the more devices I add the less usable RAM is left for guest OS
> > > > > while all the device needs is a piece of GPA address space
> > > > > that would belong to it.    
> > > > 
> > > > I don't get this comment. I don't think it's MMIO that is wanted.
> > > > If it's backed by qemu virtual memory then it's RAM.  
> > > Then why don't allocate video card VRAM the same way and try to explain
> > > user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
> > > only has 64Mb of available RAM because of we think that on device VRAM
> > > is also RAM.
> > > 
> > > Maybe I've used MMIO term wrongly here but it roughly reflects the idea
> > > that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
> > > area) is not allocated from guest's usable RAM (as described in E820)
> > > but rather directly mapped in guest's GPA and doesn't consume available
> > > RAM as guest sees it. That's also the way it's done on real hardware.
> > > 
> > > What we need in case of VMGEN ID and NVDIMM is on device memory
> > > that could be directly accessed by guest.
> > > Both direct mapping or PCI BAR do that job and we could use simple
> > > static AML without any patching.  
> > 
> > At least with VMGEN the issue is that there's an AML method
> > that returns the physical address.
> > Then if guest OS moves the BAR (which is legal), it will break
> > since caller has no way to know it's related to the BAR.
> I've found a following MS doc "Firmware Allocation of PCI Device Resources in Windows". It looks like when MS implemented resource rebalancing in
> Vista they pushed a compat change to PCI specs.
> That ECN is called "Ignore PCI Boot Configuration_DSM Function"
> and can be found here:
> https://pcisig.com/sites/default/files/specification_documents/ECR-Ignorebootconfig-final.pdf
> 
> It looks like it's possible to forbid rebalancing per
> device/bridge if it has _DMS method that returns "do not
> ignore the boot configuration of PCI resources".

I'll have to study this but we don't want that
globally, do we?
This restricts hotplug functionality significantly.

>  
> > > > > > 
> > > > > > See patch at the bottom that might be handy.
> > > > > >     
> > > > > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > > > > | when writing ASL one shall make sure that only XP supported
> > > > > > > | features are in global scope, which is evaluated when tables
> > > > > > > | are loaded and features of rev2 and higher are inside methods.
> > > > > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)    
> > > > > > 
> > > > > > Yes, this technique works.
> > > > > > 
> > > > > > An alternative is to add an XSDT, XP ignores that.
> > > > > > XSDT at the moment breaks OVMF (because it loads both
> > > > > > the RSDT and the XSDT, which is wrong), but I think
> > > > > > Laszlo was working on a fix for that.    
> > > > > Using XSDT would increase ACPI tables occupied RAM
> > > > > as it would duplicate DSDT + non XP supported AML
> > > > > at global namespace.    
> > > > 
> > > > Not at all - I posted patches linking to same
> > > > tables from both RSDT and XSDT at some point.
> > > > Only the list of pointers would be different.  
> > > if you put XP incompatible AML in separate SSDT and link it
> > > only from XSDT than that would work but if incompatibility
> > > is in DSDT, one would have to provide compat DSDT for RSDT
> > > an incompat DSDT for XSDT.  
> > 
> > So don't do this.
> well spec says "An ACPI-compatible OS must use the XSDT if present",
> which I read as tables pointed by RSDT MUST be pointed by XSDT
> as well and RSDT MUST NOT not be used.
>
> so if we put incompatible changes in a separate SSDT and put
> it only in XSDT that might work. Showstopper here is OVMF which
> has issues with it as Laszlo pointed out.

But that's just a bug.

> Also since Windows implements only subset of spec XSDT trick
> would cover only XP based versions while the rest will see and
> use XSDT pointed tables which still could have incompatible
> AML with some of the later windows versions.

We'll have to see what these are exactly.
If it's methods in SSDT we can check the version supported
by the ASPM.

> 
> > 
> > > So far policy was don't try to run guest OS on QEMU
> > > configuration that isn't supported by it.  
> > 
> > It's better if guests don't see some features but
> > don't crash. It's not always possible of course but
> > we should try to avoid this.
> > 
> > > For example we use VAR_PACKAGE when running with more
> > > than 255 VCPUs (commit b4f4d5481) which BSODs XP.  
> > 
> > Yes. And it's because we violate the spec, DSDT
> > should not have this stuff.
> > 
> > > So we can continue with that policy with out resorting to
> > > using both RSDT and XSDT,
> > > It would be even easier as all AML would be dynamically
> > > generated and DSDT would only contain AML elements for
> > > a concrete QEMU configuration.  
> > 
> > I'd prefer XSDT but I won't nack it if you do it in DSDT.
> > I think it's not spec compliant but guests do not
> > seem to care.
> > 
> > > > > So far we've managed keep DSDT compatible with XP while
> > > > > introducing features from v2 and higher ACPI revisions as
> > > > > AML that is only evaluated on demand.
> > > > > We can continue doing so unless we have to unconditionally
> > > > > add incompatible AML at global scope.
> > > > >     
> > > > 
> > > > Yes.
> > > >   
> > > > > >     
> > > > > > > Michael, Paolo, what do you think about these ideas?
> > > > > > > 
> > > > > > > Thanks!    
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > > > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > > > > current offset so we can add that to the linker.
> > > > > > 
> > > > > > Won't work if you append the Name to the Aml structure (these can be
> > > > > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > > > > this API makes sense to me.
> > > > > >     
> > > > > > --->    
> > > > > > 
> > > > > > acpi: add build_append_named_dword, returning an offset in buffer
> > > > > > 
> > > > > > This is a very limited form of support for runtime patching -
> > > > > > similar in functionality to what we can do with ACPI_EXTRACT
> > > > > > macros in python, but implemented in C.
> > > > > > 
> > > > > > This is to allow ACPI code direct access to data tables -
> > > > > > which is exactly what DataTableRegion is there for, except
> > > > > > no known windows release so far implements DataTableRegion.    
> > > > > unsupported means Windows will BSOD, so it's practically
> > > > > unusable unless MS will patch currently existing Windows
> > > > > versions.    
> > > > 
> > > > Yes. That's why my patch allows patching SSDT without using
> > > > DataTableRegion.
> > > >   
> > > > > Another thing about DataTableRegion is that ACPI tables are
> > > > > supposed to have static content which matches checksum in
> > > > > table the header while you are trying to use it for dynamic
> > > > > data. It would be cleaner/more compatible to teach
> > > > > bios-linker-loader to just allocate memory and patch AML
> > > > > with the allocated address.    
> > > > 
> > > > Yes - if address is static, you need to put it outside
> > > > the table. Can come right before or right after this.
> > > >   
> > > > > Also if OperationRegion() is used, then one has to patch
> > > > > DefOpRegion directly as RegionOffset must be Integer,
> > > > > using variable names is not permitted there.    
> > > > 
> > > > I am not sure the comment was understood correctly.
> > > > The comment says really "we can't use DataTableRegion
> > > > so here is an alternative".  
> > > so how are you going to access data at which patched
> > > NameString point to?
> > > for that you'd need a normal patched OperationRegion
> > > as well since DataTableRegion isn't usable.  
> > 
> > For VMGENID you would patch the method that
> > returns the address - you do not need an op region
> > as you never access it.
> > 
> > I don't know about NVDIMM. Maybe OperationRegion can
> > use the patched NameString? Will need some thought.
> > 
> > > >   
> > > > >     
> > > > > > 
> > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > 
> > > > > > ---
> > > > > > 
> > > > > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > > > > index 1b632dc..f8998ea 100644
> > > > > > --- a/include/hw/acpi/aml-build.h
> > > > > > +++ b/include/hw/acpi/aml-build.h
> > > > > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > > > > >  void
> > > > > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > > > > >  
> > > > > > +int
> > > > > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > > > > +
> > > > > >  #endif
> > > > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > > > > index 0d4b324..7f9fa65 100644
> > > > > > --- a/hw/acpi/aml-build.c
> > > > > > +++ b/hw/acpi/aml-build.c
> > > > > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > > > > >      }
> > > > > >  }
> > > > > >  
> > > > > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > > > > + * and return the offset to 0x0 for runtime patching.
> > > > > > + *
> > > > > > + * Warning: runtime patching is best avoided. Only use this as
> > > > > > + * a replacement for DataTableRegion (for guests that don't
> > > > > > + * support it).
> > > > > > + */
> > > > > > +int
> > > > > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > > > > +{
> > > > > > +    int offset;
> > > > > > +    va_list ap;
> > > > > > +
> > > > > > +    va_start(ap, name_format);
> > > > > > +    build_append_namestringv(array, name_format, ap);
> > > > > > +    va_end(ap);
> > > > > > +
> > > > > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > > > > +
> > > > > > +    offset = array->len;
> > > > > > +    build_append_int_noprefix(array, 0x0, 8);
> > > > > > +    assert(array->len == offset + 8);
> > > > > > +
> > > > > > +    return offset;
> > > > > > +}
> > > > > > +
> > > > > >  static GPtrArray *alloc_list;
> > > > > >  
> > > > > >  static Aml *aml_alloc(void)
> > > > > > 
> > > > > >     
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> > 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
  2016-01-07 10:54               ` Michael S. Tsirkin
@ 2016-01-07 13:42                 ` Igor Mammedov
  2016-01-07 17:11                   ` Laszlo Ersek
  2016-01-07 17:08                 ` Laszlo Ersek
  1 sibling, 1 reply; 59+ messages in thread
From: Igor Mammedov @ 2016-01-07 13:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Thu, 7 Jan 2016 12:54:30 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jan 07, 2016 at 11:30:25AM +0100, Igor Mammedov wrote:
> > On Tue, 5 Jan 2016 18:43:02 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:  
> > > > > > bios-linker-loader is a great interface for initializing some
> > > > > > guest owned data and linking it together but I think it adds
> > > > > > unnecessary complexity and is misused if it's used to handle
> > > > > > device owned data/on device memory in this and VMGID cases.      
> > > > > 
> > > > > I want a generic interface for guest to enumerate these things.  linker
> > > > > seems quite reasonable but if you see a reason why it won't do, or want
> > > > > to propose a better interface, fine.
> > > > > 
> > > > > PCI would do, too - though windows guys had concerns about
> > > > > returning PCI BARs from ACPI.    
> > > > There were potential issues with pSeries bootloader that treated
> > > > PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
> > > > Could you point out to discussion about windows issues?
> > > > 
> > > > What VMGEN patches that used PCI for mapping purposes were
> > > > stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
> > > > class id but we couldn't agree on it.
> > > > 
> > > > VMGEN v13 with full discussion is here
> > > > https://patchwork.ozlabs.org/patch/443554/
> > > > So to continue with this route we would need to pick some other
> > > > driver less class id so windows won't prompt for driver or
> > > > maybe supply our own driver stub to guarantee that no one
> > > > would touch it. Any suggestions?    
> > > 
> > > Pick any device/vendor id pair for which windows specifies no driver.
> > > There's a small risk that this will conflict with some
> > > guest but I think it's minimal.  
> > device/vendor id pair was QEMU specific so doesn't conflicts with anything
> > issue we were trying to solve was to prevent windows asking for driver
> > even though it does so only once if told not to ask again.
> > 
> > That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
> > device descriptor in INF file which matches as the last resort if
> > there isn't any other diver that's matched device by device/vendor id pair.  
> 
> I think this is the only class in this inf.
> If you can't use it, you must use an existing device/vendor id pair,
> there's some risk involved but probably not much.
I can't wrap my head around this answer, could you rephrase it?

As far as I see we can use PCI_CLASS_MEMORY_RAM with qemu's device/vendor ids.
In that case Windows associates it with dummy "Generic RAM controller".

The same happens with some NVIDIA cards if NVIDIA drivers are not installed,
if we install drivers then Windows binds NVIDIA's PCI_CLASS_MEMORY_RAM with
concrete driver that manages VRAM the way NVIDIA wants it.

So I think we can use it with low risk.

If we use existing device/vendor id pair with some driver then driver
will fail to initialize and as minimum we would get device marked as
not working in Device-Manager. Any way if you have in mind a concrete
existing device/vendor id pair feel free to suggest it.

> 
> > > 
> > >   
> > > > > 
> > > > >     
> > > > > > There was RFC on list to make BIOS boot from NVDIMM already
> > > > > > doing some ACPI table lookup/parsing. Now if they were forced
> > > > > > to also parse and execute AML to initialize QEMU with guest
> > > > > > allocated address that would complicate them quite a bit.      
> > > > > 
> > > > > If they just need to find a table by name, it won't be
> > > > > too bad, would it?    
> > > > that's what they were doing scanning memory for static NVDIMM table.
> > > > However if it were DataTable, BIOS side would have to execute
> > > > AML so that the table address could be told to QEMU.    
> > > 
> > > Not at all. You can find any table by its signature without
> > > parsing AML.  
> > yep, and then BIOS would need to tell its address to QEMU
> > writing to IO port which is allocated statically in QEMU
> > for this purpose and is described in AML only on guest side.  
> 
> io ports are an ABI too but they are way easier to
> maintain.
It's pretty much the same as GPA addresses only it's much more limited resource.
Otherwise one has to do the same tricks to maintain ABI.

> 
> > > 
> > >   
> > > > In case of direct mapping or PCI BAR there is no need to initialize
> > > > QEMU side from AML.
> > > > That also saves us IO port where this address should be written
> > > > if bios-linker-loader approach is used.
> > > >     
> > > > >     
> > > > > > While with NVDIMM control memory region mapped directly by QEMU,
> > > > > > respective patches don't need in any way to initialize QEMU,
> > > > > > all they would need just read necessary data from control region.
> > > > > > 
> > > > > > Also using bios-linker-loader takes away some usable RAM
> > > > > > from guest and in the end that doesn't scale,
> > > > > > the more devices I add the less usable RAM is left for guest OS
> > > > > > while all the device needs is a piece of GPA address space
> > > > > > that would belong to it.      
> > > > > 
> > > > > I don't get this comment. I don't think it's MMIO that is wanted.
> > > > > If it's backed by qemu virtual memory then it's RAM.    
> > > > Then why don't allocate video card VRAM the same way and try to explain
> > > > user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
> > > > only has 64Mb of available RAM because of we think that on device VRAM
> > > > is also RAM.
> > > > 
> > > > Maybe I've used MMIO term wrongly here but it roughly reflects the idea
> > > > that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
> > > > area) is not allocated from guest's usable RAM (as described in E820)
> > > > but rather directly mapped in guest's GPA and doesn't consume available
> > > > RAM as guest sees it. That's also the way it's done on real hardware.
> > > > 
> > > > What we need in case of VMGEN ID and NVDIMM is on device memory
> > > > that could be directly accessed by guest.
> > > > Both direct mapping or PCI BAR do that job and we could use simple
> > > > static AML without any patching.    
> > > 
> > > At least with VMGEN the issue is that there's an AML method
> > > that returns the physical address.
> > > Then if guest OS moves the BAR (which is legal), it will break
> > > since caller has no way to know it's related to the BAR.  
> > I've found a following MS doc "Firmware Allocation of PCI Device Resources in Windows". It looks like when MS implemented resource rebalancing in
> > Vista they pushed a compat change to PCI specs.
> > That ECN is called "Ignore PCI Boot Configuration_DSM Function"
> > and can be found here:
> > https://pcisig.com/sites/default/files/specification_documents/ECR-Ignorebootconfig-final.pdf
> > 
> > It looks like it's possible to forbid rebalancing per
> > device/bridge if it has _DMS method that returns "do not
> > ignore the boot configuration of PCI resources".  
> 
> I'll have to study this but we don't want that
> globally, do we?
no need to do it globally, adding _DSM to a device, we don't wish
to be rebalanced, should be sufficient to lock down specific resources.

actually existence of spec implies that if there is a boot configured
device with resources described in ACPI table and there isn't _DSM
method enabling rebalancing for it, then rebalancing is not permitted.
It should be easy to make an experiment to verify what Windows would do.

So if this approach would work and we agree on going with it, I could work
on redoing VMGENv13 series using _DSM as described.
That would simplify implementing this kind of devices vs bios-linker approach i.e.:
 - free RAM occupied by linker blob
 - free IO port
 - avoid 2 or 3 layers of indirection - which makes understanding of code much easier
 - avoid runtime AML patching and simplify AML and its composing parts
 - there won't be need for BIOS to get IO port from fw_cfg and write
   there GPA as well no need for table lookup.
 - much easier to write unit tests, i.e. use the same qtest device testing
   technique without necessity of running actual guest code.
   i.e. no binary code blobs like we have for running bios-tables test in TCG mode.


> This restricts hotplug functionality significantly.
> 
> >    
> > > > > > > 
> > > > > > > See patch at the bottom that might be handy.
> > > > > > >       
> > > > > > > > he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
> > > > > > > > | when writing ASL one shall make sure that only XP supported
> > > > > > > > | features are in global scope, which is evaluated when tables
> > > > > > > > | are loaded and features of rev2 and higher are inside methods.
> > > > > > > > | That way XP doesn't crash as far as it doesn't evaluate unsupported
> > > > > > > > | opcode and one can guard those opcodes checking _REV object if neccesary.
> > > > > > > > (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)      
> > > > > > > 
> > > > > > > Yes, this technique works.
> > > > > > > 
> > > > > > > An alternative is to add an XSDT, XP ignores that.
> > > > > > > XSDT at the moment breaks OVMF (because it loads both
> > > > > > > the RSDT and the XSDT, which is wrong), but I think
> > > > > > > Laszlo was working on a fix for that.      
> > > > > > Using XSDT would increase ACPI tables occupied RAM
> > > > > > as it would duplicate DSDT + non XP supported AML
> > > > > > at global namespace.      
> > > > > 
> > > > > Not at all - I posted patches linking to same
> > > > > tables from both RSDT and XSDT at some point.
> > > > > Only the list of pointers would be different.    
> > > > if you put XP incompatible AML in separate SSDT and link it
> > > > only from XSDT than that would work but if incompatibility
> > > > is in DSDT, one would have to provide compat DSDT for RSDT
> > > > an incompat DSDT for XSDT.    
> > > 
> > > So don't do this.  
> > well spec says "An ACPI-compatible OS must use the XSDT if present",
> > which I read as tables pointed by RSDT MUST be pointed by XSDT
> > as well and RSDT MUST NOT not be used.
> >
> > so if we put incompatible changes in a separate SSDT and put
> > it only in XSDT that might work. Showstopper here is OVMF which
> > has issues with it as Laszlo pointed out.  
> 
> But that's just a bug.
> 
> > Also since Windows implements only subset of spec XSDT trick
> > would cover only XP based versions while the rest will see and
> > use XSDT pointed tables which still could have incompatible
> > AML with some of the later windows versions.  
> 
> We'll have to see what these are exactly.
> If it's methods in SSDT we can check the version supported
> by the ASPM.
I see only VAR_PACKAGE as such object so far.

64-bit PCI0._CRS probably won't crash 32-bit Vista and later as
it should be able to parse 64-bit Integers as defined by ACPI 2.0.

> 
> >   
> > >   
> > > > So far policy was don't try to run guest OS on QEMU
> > > > configuration that isn't supported by it.    
> > > 
> > > It's better if guests don't see some features but
> > > don't crash. It's not always possible of course but
> > > we should try to avoid this.
> > >   
> > > > For example we use VAR_PACKAGE when running with more
> > > > than 255 VCPUs (commit b4f4d5481) which BSODs XP.    
> > > 
> > > Yes. And it's because we violate the spec, DSDT
> > > should not have this stuff.
> > >   
> > > > So we can continue with that policy with out resorting to
> > > > using both RSDT and XSDT,
> > > > It would be even easier as all AML would be dynamically
> > > > generated and DSDT would only contain AML elements for
> > > > a concrete QEMU configuration.    
> > > 
> > > I'd prefer XSDT but I won't nack it if you do it in DSDT.
> > > I think it's not spec compliant but guests do not
> > > seem to care.
> > >   
> > > > > > So far we've managed keep DSDT compatible with XP while
> > > > > > introducing features from v2 and higher ACPI revisions as
> > > > > > AML that is only evaluated on demand.
> > > > > > We can continue doing so unless we have to unconditionally
> > > > > > add incompatible AML at global scope.
> > > > > >       
> > > > > 
> > > > > Yes.
> > > > >     
> > > > > > >       
> > > > > > > > Michael, Paolo, what do you think about these ideas?
> > > > > > > > 
> > > > > > > > Thanks!      
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > So using a patch below, we can add Name(PQRS, 0x0) at the top of the
> > > > > > > SSDT (or bottom, or add a separate SSDT just for that).  It returns the
> > > > > > > current offset so we can add that to the linker.
> > > > > > > 
> > > > > > > Won't work if you append the Name to the Aml structure (these can be
> > > > > > > nested to arbitrary depth using aml_append), so using plain GArray for
> > > > > > > this API makes sense to me.
> > > > > > >       
> > > > > > > --->      
> > > > > > > 
> > > > > > > acpi: add build_append_named_dword, returning an offset in buffer
> > > > > > > 
> > > > > > > This is a very limited form of support for runtime patching -
> > > > > > > similar in functionality to what we can do with ACPI_EXTRACT
> > > > > > > macros in python, but implemented in C.
> > > > > > > 
> > > > > > > This is to allow ACPI code direct access to data tables -
> > > > > > > which is exactly what DataTableRegion is there for, except
> > > > > > > no known windows release so far implements DataTableRegion.      
> > > > > > unsupported means Windows will BSOD, so it's practically
> > > > > > unusable unless MS will patch currently existing Windows
> > > > > > versions.      
> > > > > 
> > > > > Yes. That's why my patch allows patching SSDT without using
> > > > > DataTableRegion.
> > > > >     
> > > > > > Another thing about DataTableRegion is that ACPI tables are
> > > > > > supposed to have static content which matches checksum in
> > > > > > table the header while you are trying to use it for dynamic
> > > > > > data. It would be cleaner/more compatible to teach
> > > > > > bios-linker-loader to just allocate memory and patch AML
> > > > > > with the allocated address.      
> > > > > 
> > > > > Yes - if address is static, you need to put it outside
> > > > > the table. Can come right before or right after this.
> > > > >     
> > > > > > Also if OperationRegion() is used, then one has to patch
> > > > > > DefOpRegion directly as RegionOffset must be Integer,
> > > > > > using variable names is not permitted there.      
> > > > > 
> > > > > I am not sure the comment was understood correctly.
> > > > > The comment says really "we can't use DataTableRegion
> > > > > so here is an alternative".    
> > > > so how are you going to access data at which patched
> > > > NameString point to?
> > > > for that you'd need a normal patched OperationRegion
> > > > as well since DataTableRegion isn't usable.    
> > > 
> > > For VMGENID you would patch the method that
> > > returns the address - you do not need an op region
> > > as you never access it.
> > > 
> > > I don't know about NVDIMM. Maybe OperationRegion can
> > > use the patched NameString? Will need some thought.
> > >   
> > > > >     
> > > > > >       
> > > > > > > 
> > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > 
> > > > > > > ---
> > > > > > > 
> > > > > > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > > > > > index 1b632dc..f8998ea 100644
> > > > > > > --- a/include/hw/acpi/aml-build.h
> > > > > > > +++ b/include/hw/acpi/aml-build.h
> > > > > > > @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
> > > > > > >  void
> > > > > > >  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
> > > > > > >  
> > > > > > > +int
> > > > > > > +build_append_named_dword(GArray *array, const char *name_format, ...);
> > > > > > > +
> > > > > > >  #endif
> > > > > > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > > > > > index 0d4b324..7f9fa65 100644
> > > > > > > --- a/hw/acpi/aml-build.c
> > > > > > > +++ b/hw/acpi/aml-build.c
> > > > > > > @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
> > > > > > >      }
> > > > > > >  }
> > > > > > >  
> > > > > > > +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
> > > > > > > + * and return the offset to 0x0 for runtime patching.
> > > > > > > + *
> > > > > > > + * Warning: runtime patching is best avoided. Only use this as
> > > > > > > + * a replacement for DataTableRegion (for guests that don't
> > > > > > > + * support it).
> > > > > > > + */
> > > > > > > +int
> > > > > > > +build_append_named_qword(GArray *array, const char *name_format, ...)
> > > > > > > +{
> > > > > > > +    int offset;
> > > > > > > +    va_list ap;
> > > > > > > +
> > > > > > > +    va_start(ap, name_format);
> > > > > > > +    build_append_namestringv(array, name_format, ap);
> > > > > > > +    va_end(ap);
> > > > > > > +
> > > > > > > +    build_append_byte(array, 0x0E); /* QWordPrefix */
> > > > > > > +
> > > > > > > +    offset = array->len;
> > > > > > > +    build_append_int_noprefix(array, 0x0, 8);
> > > > > > > +    assert(array->len == offset + 8);
> > > > > > > +
> > > > > > > +    return offset;
> > > > > > > +}
> > > > > > > +
> > > > > > >  static GPtrArray *alloc_list;
> > > > > > >  
> > > > > > >  static Aml *aml_alloc(void)
> > > > > > > 
> > > > > > >       
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html    
> > >   


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-04 20:17           ` [Qemu-devel] " Laszlo Ersek
@ 2016-01-07 13:51             ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-07 13:51 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Michael S. Tsirkin, Xiao Guangrong, pbonzini, gleb, mtosatti,
	stefanha, rth, ehabkost, dan.j.williams, kvm, qemu-devel

On Mon, 4 Jan 2016 21:17:31 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> Michael CC'd me on the grandparent of the email below. I'll try to add
> my thoughts in a single go, with regard to OVMF.
> 
> On 12/30/15 20:52, Michael S. Tsirkin wrote:
> > On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
> >> On Mon, 28 Dec 2015 14:50:15 +0200
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>  
> >>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> >>>>
> >>>> Hi Michael, Paolo,
> >>>>
> >>>> Now it is the time to return to the challenge that how to reserve guest
> >>>> physical region internally used by ACPI.
> >>>>
> >>>> Igor suggested that:
> >>>> | An alternative place to allocate reserve from could be high memory.
> >>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>> | that hotpluggable memory range isn't used by firmware
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
> 
> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> reason is that nobody wrote that patch, nor asked for the patch to be
> written. (Not implying that just requesting the patch would be
> sufficient for the patch to be written.)
Hijacking this part of thread to check if OVMF would work with memory-hotplug
and if it needs "reserved-memory-end" support at all.

How OVMF determines which GPA ranges to use for initializing PCI BARs
at boot time, more specifically 64-bit BARs.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-07 13:51             ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-07 13:51 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Xiao Guangrong, ehabkost, kvm, Michael S. Tsirkin, gleb,
	mtosatti, qemu-devel, stefanha, pbonzini, dan.j.williams, rth

On Mon, 4 Jan 2016 21:17:31 +0100
Laszlo Ersek <lersek@redhat.com> wrote:

> Michael CC'd me on the grandparent of the email below. I'll try to add
> my thoughts in a single go, with regard to OVMF.
> 
> On 12/30/15 20:52, Michael S. Tsirkin wrote:
> > On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
> >> On Mon, 28 Dec 2015 14:50:15 +0200
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>  
> >>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
> >>>>
> >>>> Hi Michael, Paolo,
> >>>>
> >>>> Now it is the time to return to the challenge that how to reserve guest
> >>>> physical region internally used by ACPI.
> >>>>
> >>>> Igor suggested that:
> >>>> | An alternative place to allocate reserve from could be high memory.
> >>>> | For pc we have "reserved-memory-end" which currently makes sure
> >>>> | that hotpluggable memory range isn't used by firmware
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
> 
> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
> reason is that nobody wrote that patch, nor asked for the patch to be
> written. (Not implying that just requesting the patch would be
> sufficient for the patch to be written.)
Hijacking this part of thread to check if OVMF would work with memory-hotplug
and if it needs "reserved-memory-end" support at all.

How OVMF determines which GPA ranges to use for initializing PCI BARs
at boot time, more specifically 64-bit BARs.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
  2016-01-07 10:54               ` Michael S. Tsirkin
  2016-01-07 13:42                 ` Igor Mammedov
@ 2016-01-07 17:08                 ` Laszlo Ersek
  1 sibling, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-07 17:08 UTC (permalink / raw)
  To: Michael S. Tsirkin, Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, rth

On 01/07/16 11:54, Michael S. Tsirkin wrote:
> On Thu, Jan 07, 2016 at 11:30:25AM +0100, Igor Mammedov wrote:
>> On Tue, 5 Jan 2016 18:43:02 +0200
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:

...

>>>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>>>> Laszlo was working on a fix for that.    
>>>>>> Using XSDT would increase ACPI tables occupied RAM
>>>>>> as it would duplicate DSDT + non XP supported AML
>>>>>> at global namespace.    
>>>>>
>>>>> Not at all - I posted patches linking to same
>>>>> tables from both RSDT and XSDT at some point.
>>>>> Only the list of pointers would be different.  
>>>> if you put XP incompatible AML in separate SSDT and link it
>>>> only from XSDT than that would work but if incompatibility
>>>> is in DSDT, one would have to provide compat DSDT for RSDT
>>>> an incompat DSDT for XSDT.  
>>>
>>> So don't do this.
>> well spec says "An ACPI-compatible OS must use the XSDT if present",
>> which I read as tables pointed by RSDT MUST be pointed by XSDT
>> as well and RSDT MUST NOT not be used.
>>
>> so if we put incompatible changes in a separate SSDT and put
>> it only in XSDT that might work. Showstopper here is OVMF which
>> has issues with it as Laszlo pointed out.
> 
> But that's just a bug.

Yes, but the bug (actually: lack of feature) is in the UEFI
specification. The current EFI_ACPI_TABLE_PROTOCOL implementation in
edk2 conforms to the specification. In order to expose the functionality
that the above trick needs, the UEFI spec has to be changed. In my
(limited, admittedly) experience, that's an uphill battle.

[...]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
  2016-01-07 13:42                 ` Igor Mammedov
@ 2016-01-07 17:11                   ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-07 17:11 UTC (permalink / raw)
  To: Igor Mammedov, Michael S. Tsirkin
  Cc: Xiao Guangrong, ehabkost, kvm, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, rth

On 01/07/16 14:42, Igor Mammedov wrote:
> On Thu, 7 Jan 2016 12:54:30 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> On Thu, Jan 07, 2016 at 11:30:25AM +0100, Igor Mammedov wrote:
>>> On Tue, 5 Jan 2016 18:43:02 +0200
>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>   
>>>> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:  
>>>>>>> bios-linker-loader is a great interface for initializing some
>>>>>>> guest owned data and linking it together but I think it adds
>>>>>>> unnecessary complexity and is misused if it's used to handle
>>>>>>> device owned data/on device memory in this and VMGID cases.      
>>>>>>
>>>>>> I want a generic interface for guest to enumerate these things.  linker
>>>>>> seems quite reasonable but if you see a reason why it won't do, or want
>>>>>> to propose a better interface, fine.
>>>>>>
>>>>>> PCI would do, too - though windows guys had concerns about
>>>>>> returning PCI BARs from ACPI.    
>>>>> There were potential issues with pSeries bootloader that treated
>>>>> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
>>>>> Could you point out to discussion about windows issues?
>>>>>
>>>>> What VMGEN patches that used PCI for mapping purposes were
>>>>> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
>>>>> class id but we couldn't agree on it.
>>>>>
>>>>> VMGEN v13 with full discussion is here
>>>>> https://patchwork.ozlabs.org/patch/443554/
>>>>> So to continue with this route we would need to pick some other
>>>>> driver less class id so windows won't prompt for driver or
>>>>> maybe supply our own driver stub to guarantee that no one
>>>>> would touch it. Any suggestions?    
>>>>
>>>> Pick any device/vendor id pair for which windows specifies no driver.
>>>> There's a small risk that this will conflict with some
>>>> guest but I think it's minimal.  
>>> device/vendor id pair was QEMU specific so doesn't conflicts with anything
>>> issue we were trying to solve was to prevent windows asking for driver
>>> even though it does so only once if told not to ask again.
>>>
>>> That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
>>> device descriptor in INF file which matches as the last resort if
>>> there isn't any other diver that's matched device by device/vendor id pair.  
>>
>> I think this is the only class in this inf.
>> If you can't use it, you must use an existing device/vendor id pair,
>> there's some risk involved but probably not much.
> I can't wrap my head around this answer, could you rephrase it?
> 
> As far as I see we can use PCI_CLASS_MEMORY_RAM with qemu's device/vendor ids.
> In that case Windows associates it with dummy "Generic RAM controller".
> 
> The same happens with some NVIDIA cards if NVIDIA drivers are not installed,
> if we install drivers then Windows binds NVIDIA's PCI_CLASS_MEMORY_RAM with
> concrete driver that manages VRAM the way NVIDIA wants it.
> 
> So I think we can use it with low risk.
> 
> If we use existing device/vendor id pair with some driver then driver
> will fail to initialize and as minimum we would get device marked as
> not working in Device-Manager. Any way if you have in mind a concrete
> existing device/vendor id pair feel free to suggest it.
> 
>>
>>>>
>>>>   
>>>>>>
>>>>>>     
>>>>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>>>>> to also parse and execute AML to initialize QEMU with guest
>>>>>>> allocated address that would complicate them quite a bit.      
>>>>>>
>>>>>> If they just need to find a table by name, it won't be
>>>>>> too bad, would it?    
>>>>> that's what they were doing scanning memory for static NVDIMM table.
>>>>> However if it were DataTable, BIOS side would have to execute
>>>>> AML so that the table address could be told to QEMU.    
>>>>
>>>> Not at all. You can find any table by its signature without
>>>> parsing AML.  
>>> yep, and then BIOS would need to tell its address to QEMU
>>> writing to IO port which is allocated statically in QEMU
>>> for this purpose and is described in AML only on guest side.  
>>
>> io ports are an ABI too but they are way easier to
>> maintain.
> It's pretty much the same as GPA addresses only it's much more limited resource.
> Otherwise one has to do the same tricks to maintain ABI.
> 
>>
>>>>
>>>>   
>>>>> In case of direct mapping or PCI BAR there is no need to initialize
>>>>> QEMU side from AML.
>>>>> That also saves us IO port where this address should be written
>>>>> if bios-linker-loader approach is used.
>>>>>     
>>>>>>     
>>>>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>>>>> respective patches don't need in any way to initialize QEMU,
>>>>>>> all they would need just read necessary data from control region.
>>>>>>>
>>>>>>> Also using bios-linker-loader takes away some usable RAM
>>>>>>> from guest and in the end that doesn't scale,
>>>>>>> the more devices I add the less usable RAM is left for guest OS
>>>>>>> while all the device needs is a piece of GPA address space
>>>>>>> that would belong to it.      
>>>>>>
>>>>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>>>>> If it's backed by qemu virtual memory then it's RAM.    
>>>>> Then why don't allocate video card VRAM the same way and try to explain
>>>>> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
>>>>> only has 64Mb of available RAM because of we think that on device VRAM
>>>>> is also RAM.
>>>>>
>>>>> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
>>>>> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
>>>>> area) is not allocated from guest's usable RAM (as described in E820)
>>>>> but rather directly mapped in guest's GPA and doesn't consume available
>>>>> RAM as guest sees it. That's also the way it's done on real hardware.
>>>>>
>>>>> What we need in case of VMGEN ID and NVDIMM is on device memory
>>>>> that could be directly accessed by guest.
>>>>> Both direct mapping or PCI BAR do that job and we could use simple
>>>>> static AML without any patching.    
>>>>
>>>> At least with VMGEN the issue is that there's an AML method
>>>> that returns the physical address.
>>>> Then if guest OS moves the BAR (which is legal), it will break
>>>> since caller has no way to know it's related to the BAR.  
>>> I've found a following MS doc "Firmware Allocation of PCI Device Resources in Windows". It looks like when MS implemented resource rebalancing in
>>> Vista they pushed a compat change to PCI specs.
>>> That ECN is called "Ignore PCI Boot Configuration_DSM Function"
>>> and can be found here:
>>> https://pcisig.com/sites/default/files/specification_documents/ECR-Ignorebootconfig-final.pdf
>>>
>>> It looks like it's possible to forbid rebalancing per
>>> device/bridge if it has _DMS method that returns "do not
>>> ignore the boot configuration of PCI resources".  
>>
>> I'll have to study this but we don't want that
>> globally, do we?
> no need to do it globally, adding _DSM to a device, we don't wish
> to be rebalanced, should be sufficient to lock down specific resources.
> 
> actually existence of spec implies that if there is a boot configured
> device with resources described in ACPI table and there isn't _DSM
> method enabling rebalancing for it, then rebalancing is not permitted.
> It should be easy to make an experiment to verify what Windows would do.
> 
> So if this approach would work and we agree on going with it, I could work
> on redoing VMGENv13 series using _DSM as described.
> That would simplify implementing this kind of devices vs bios-linker approach i.e.:
>  - free RAM occupied by linker blob
>  - free IO port
>  - avoid 2 or 3 layers of indirection - which makes understanding of code much easier
>  - avoid runtime AML patching and simplify AML and its composing parts
>  - there won't be need for BIOS to get IO port from fw_cfg and write
>    there GPA as well no need for table lookup.
>  - much easier to write unit tests, i.e. use the same qtest device testing
>    technique without necessity of running actual guest code.
>    i.e. no binary code blobs like we have for running bios-tables test in TCG mode.

No objections on my part. If it works, it works for me!

Laszlo


> 
> 
>> This restricts hotplug functionality significantly.
>>
>>>    
>>>>>>>>
>>>>>>>> See patch at the bottom that might be handy.
>>>>>>>>       
>>>>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>>>>> | features are in global scope, which is evaluated when tables
>>>>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)      
>>>>>>>>
>>>>>>>> Yes, this technique works.
>>>>>>>>
>>>>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>>>>> Laszlo was working on a fix for that.      
>>>>>>> Using XSDT would increase ACPI tables occupied RAM
>>>>>>> as it would duplicate DSDT + non XP supported AML
>>>>>>> at global namespace.      
>>>>>>
>>>>>> Not at all - I posted patches linking to same
>>>>>> tables from both RSDT and XSDT at some point.
>>>>>> Only the list of pointers would be different.    
>>>>> if you put XP incompatible AML in separate SSDT and link it
>>>>> only from XSDT than that would work but if incompatibility
>>>>> is in DSDT, one would have to provide compat DSDT for RSDT
>>>>> an incompat DSDT for XSDT.    
>>>>
>>>> So don't do this.  
>>> well spec says "An ACPI-compatible OS must use the XSDT if present",
>>> which I read as tables pointed by RSDT MUST be pointed by XSDT
>>> as well and RSDT MUST NOT not be used.
>>>
>>> so if we put incompatible changes in a separate SSDT and put
>>> it only in XSDT that might work. Showstopper here is OVMF which
>>> has issues with it as Laszlo pointed out.  
>>
>> But that's just a bug.
>>
>>> Also since Windows implements only subset of spec XSDT trick
>>> would cover only XP based versions while the rest will see and
>>> use XSDT pointed tables which still could have incompatible
>>> AML with some of the later windows versions.  
>>
>> We'll have to see what these are exactly.
>> If it's methods in SSDT we can check the version supported
>> by the ASPM.
> I see only VAR_PACKAGE as such object so far.
> 
> 64-bit PCI0._CRS probably won't crash 32-bit Vista and later as
> it should be able to parse 64-bit Integers as defined by ACPI 2.0.
> 
>>
>>>   
>>>>   
>>>>> So far policy was don't try to run guest OS on QEMU
>>>>> configuration that isn't supported by it.    
>>>>
>>>> It's better if guests don't see some features but
>>>> don't crash. It's not always possible of course but
>>>> we should try to avoid this.
>>>>   
>>>>> For example we use VAR_PACKAGE when running with more
>>>>> than 255 VCPUs (commit b4f4d5481) which BSODs XP.    
>>>>
>>>> Yes. And it's because we violate the spec, DSDT
>>>> should not have this stuff.
>>>>   
>>>>> So we can continue with that policy with out resorting to
>>>>> using both RSDT and XSDT,
>>>>> It would be even easier as all AML would be dynamically
>>>>> generated and DSDT would only contain AML elements for
>>>>> a concrete QEMU configuration.    
>>>>
>>>> I'd prefer XSDT but I won't nack it if you do it in DSDT.
>>>> I think it's not spec compliant but guests do not
>>>> seem to care.
>>>>   
>>>>>>> So far we've managed keep DSDT compatible with XP while
>>>>>>> introducing features from v2 and higher ACPI revisions as
>>>>>>> AML that is only evaluated on demand.
>>>>>>> We can continue doing so unless we have to unconditionally
>>>>>>> add incompatible AML at global scope.
>>>>>>>       
>>>>>>
>>>>>> Yes.
>>>>>>     
>>>>>>>>       
>>>>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>>>>
>>>>>>>>> Thanks!      
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>>>>> current offset so we can add that to the linker.
>>>>>>>>
>>>>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>>>>> this API makes sense to me.
>>>>>>>>       
>>>>>>>> --->      
>>>>>>>>
>>>>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>>>>
>>>>>>>> This is a very limited form of support for runtime patching -
>>>>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>>>>> macros in python, but implemented in C.
>>>>>>>>
>>>>>>>> This is to allow ACPI code direct access to data tables -
>>>>>>>> which is exactly what DataTableRegion is there for, except
>>>>>>>> no known windows release so far implements DataTableRegion.      
>>>>>>> unsupported means Windows will BSOD, so it's practically
>>>>>>> unusable unless MS will patch currently existing Windows
>>>>>>> versions.      
>>>>>>
>>>>>> Yes. That's why my patch allows patching SSDT without using
>>>>>> DataTableRegion.
>>>>>>     
>>>>>>> Another thing about DataTableRegion is that ACPI tables are
>>>>>>> supposed to have static content which matches checksum in
>>>>>>> table the header while you are trying to use it for dynamic
>>>>>>> data. It would be cleaner/more compatible to teach
>>>>>>> bios-linker-loader to just allocate memory and patch AML
>>>>>>> with the allocated address.      
>>>>>>
>>>>>> Yes - if address is static, you need to put it outside
>>>>>> the table. Can come right before or right after this.
>>>>>>     
>>>>>>> Also if OperationRegion() is used, then one has to patch
>>>>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>>>>> using variable names is not permitted there.      
>>>>>>
>>>>>> I am not sure the comment was understood correctly.
>>>>>> The comment says really "we can't use DataTableRegion
>>>>>> so here is an alternative".    
>>>>> so how are you going to access data at which patched
>>>>> NameString point to?
>>>>> for that you'd need a normal patched OperationRegion
>>>>> as well since DataTableRegion isn't usable.    
>>>>
>>>> For VMGENID you would patch the method that
>>>> returns the address - you do not need an op region
>>>> as you never access it.
>>>>
>>>> I don't know about NVDIMM. Maybe OperationRegion can
>>>> use the patched NameString? Will need some thought.
>>>>   
>>>>>>     
>>>>>>>       
>>>>>>>>
>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>>>>> index 1b632dc..f8998ea 100644
>>>>>>>> --- a/include/hw/acpi/aml-build.h
>>>>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>>>>  void
>>>>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>>>>  
>>>>>>>> +int
>>>>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>>>>> +
>>>>>>>>  #endif
>>>>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>>>>> index 0d4b324..7f9fa65 100644
>>>>>>>> --- a/hw/acpi/aml-build.c
>>>>>>>> +++ b/hw/acpi/aml-build.c
>>>>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>>>>      }
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>>>>> + *
>>>>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>>>>> + * support it).
>>>>>>>> + */
>>>>>>>> +int
>>>>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>>>>> +{
>>>>>>>> +    int offset;
>>>>>>>> +    va_list ap;
>>>>>>>> +
>>>>>>>> +    va_start(ap, name_format);
>>>>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>>>>> +    va_end(ap);
>>>>>>>> +
>>>>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>>>>> +
>>>>>>>> +    offset = array->len;
>>>>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>>>>> +    assert(array->len == offset + 8);
>>>>>>>> +
>>>>>>>> +    return offset;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static GPtrArray *alloc_list;
>>>>>>>>  
>>>>>>>>  static Aml *aml_alloc(void)
>>>>>>>>
>>>>>>>>       
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html    
>>>>   
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-07 13:51             ` [Qemu-devel] " Igor Mammedov
@ 2016-01-07 17:33               ` Laszlo Ersek
  -1 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-07 17:33 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael S. Tsirkin, Xiao Guangrong, pbonzini, gleb, mtosatti,
	stefanha, rth, ehabkost, dan.j.williams, kvm, qemu-devel,
	Marcel Apfelbaum

On 01/07/16 14:51, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
> Hijacking this part of thread to check if OVMF would work with memory-hotplug
> and if it needs "reserved-memory-end" support at all.
> 
> How OVMF determines which GPA ranges to use for initializing PCI BARs
> at boot time,

I'm glad you asked this question. This is an utterly sordid area that goes back quite a bit. We've discussed it several times in the past; for example: if you recall the "etc/pci-info" discussion...

Fact is, OVMF has no way to dynamically determine the PCI MMIO aperture to allocate BARs from. (Obviously parsing AML is out of question, especially at the early stage of the firmware where this information would be necessary. Plus that would be a chicken-egg problem anyway: QEMU composes the CRS in the AML *based on* the enumeration that was completed by the guest.)

Search "OvmfPkg/PlatformPei/Platform.c" for the string "PciBase"; it all originates there. I can also quote it:

    UINT32  TopOfLowRam;
    UINT32  PciBase;

    TopOfLowRam = GetSystemMemorySizeBelow4gb ();
    if (mHostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID) {
      //
      // A 3GB base will always fall into Q35's 32-bit PCI host aperture,
      // regardless of the Q35 MMCONFIG BAR. Correspondingly, QEMU never lets
      // the RAM below 4 GB exceed it.
      //
      PciBase = BASE_2GB + BASE_1GB;
      ASSERT (TopOfLowRam <= PciBase);
    } else {
      PciBase = (TopOfLowRam < BASE_2GB) ? BASE_2GB : TopOfLowRam;
    }

    ...

    AddIoMemoryRangeHob (PciBase, 0xFC000000);

That's it.

In the past, it has repeatedly occurred that OVMF's calculation wouldn't match QEMU's calculation. Then PCI MMIO BARs were allocated outside of QEMU's actual MMIO aperture. This caused two things:
- video display not working (due to framebuffer being accessed in bogus place),
- Windows and Linux guests noticing that the BARs were outside of the range exposed in the _CRS, and disabling devices etc.

We kept duct-taping this, with patches in both OVMF and QEMU (see e.g. Gerd's QEMU commit ddaaefb4dd42).

It has been working fine for quite a long time now, but it is still not dynamic -- the calculations are duplicated between QEMU and OVMF.

To this day, I maintain that the "etc/pci-info" fw_cfg file would have been ideal for OVMF's purposes; and I still don't understand why it was ultimately removed.

> more specifically 64-bit BARs.

Ha. Haha. Hahahaha.

OVMF doesn't support 64-bit BARs *at all*. In order to implement that, I would have to (1) understand PCI about ten billion percent better than I do now, (2) extend the mostly *impenetrable* PCI host bridge / root bridge driver in "OvmfPkg/PciHostBridgeDxe" to support this functionality.

Unfortunately, the parts of the UEFI & Platform Init specs that seem to talk about this functionality are super complex and obscure.

We have plans with Marcel and others to understand this better and perhaps do something about it.

Anyway, the basic premise bears repeating: even for the 32-bit case, OVMF has no way to dynamically retrieve the PCI hole's boundaries from QEMU.

Honestly, I'm confused. If "reserved-memory-end" is exposed over fw_cfg, and it -- apparently! -- partakes in communicating the 64-bit PCI hole to the guest, then why again was "etc/pci-info" removed in the first place?

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-07 17:33               ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-07 17:33 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Xiao Guangrong, ehabkost, kvm, Michael S. Tsirkin, gleb,
	mtosatti, qemu-devel, stefanha, Marcel Apfelbaum, pbonzini,
	dan.j.williams, rth

On 01/07/16 14:51, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <lersek@redhat.com> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
> Hijacking this part of thread to check if OVMF would work with memory-hotplug
> and if it needs "reserved-memory-end" support at all.
> 
> How OVMF determines which GPA ranges to use for initializing PCI BARs
> at boot time,

I'm glad you asked this question. This is an utterly sordid area that goes back quite a bit. We've discussed it several times in the past; for example: if you recall the "etc/pci-info" discussion...

Fact is, OVMF has no way to dynamically determine the PCI MMIO aperture to allocate BARs from. (Obviously parsing AML is out of question, especially at the early stage of the firmware where this information would be necessary. Plus that would be a chicken-egg problem anyway: QEMU composes the CRS in the AML *based on* the enumeration that was completed by the guest.)

Search "OvmfPkg/PlatformPei/Platform.c" for the string "PciBase"; it all originates there. I can also quote it:

    UINT32  TopOfLowRam;
    UINT32  PciBase;

    TopOfLowRam = GetSystemMemorySizeBelow4gb ();
    if (mHostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID) {
      //
      // A 3GB base will always fall into Q35's 32-bit PCI host aperture,
      // regardless of the Q35 MMCONFIG BAR. Correspondingly, QEMU never lets
      // the RAM below 4 GB exceed it.
      //
      PciBase = BASE_2GB + BASE_1GB;
      ASSERT (TopOfLowRam <= PciBase);
    } else {
      PciBase = (TopOfLowRam < BASE_2GB) ? BASE_2GB : TopOfLowRam;
    }

    ...

    AddIoMemoryRangeHob (PciBase, 0xFC000000);

That's it.

In the past, it has repeatedly occurred that OVMF's calculation wouldn't match QEMU's calculation. Then PCI MMIO BARs were allocated outside of QEMU's actual MMIO aperture. This caused two things:
- video display not working (due to framebuffer being accessed in bogus place),
- Windows and Linux guests noticing that the BARs were outside of the range exposed in the _CRS, and disabling devices etc.

We kept duct-taping this, with patches in both OVMF and QEMU (see e.g. Gerd's QEMU commit ddaaefb4dd42).

It has been working fine for quite a long time now, but it is still not dynamic -- the calculations are duplicated between QEMU and OVMF.

To this day, I maintain that the "etc/pci-info" fw_cfg file would have been ideal for OVMF's purposes; and I still don't understand why it was ultimately removed.

> more specifically 64-bit BARs.

Ha. Haha. Hahahaha.

OVMF doesn't support 64-bit BARs *at all*. In order to implement that, I would have to (1) understand PCI about ten billion percent better than I do now, (2) extend the mostly *impenetrable* PCI host bridge / root bridge driver in "OvmfPkg/PciHostBridgeDxe" to support this functionality.

Unfortunately, the parts of the UEFI & Platform Init specs that seem to talk about this functionality are super complex and obscure.

We have plans with Marcel and others to understand this better and perhaps do something about it.

Anyway, the basic premise bears repeating: even for the 32-bit case, OVMF has no way to dynamically retrieve the PCI hole's boundaries from QEMU.

Honestly, I'm confused. If "reserved-memory-end" is exposed over fw_cfg, and it -- apparently! -- partakes in communicating the 64-bit PCI hole to the guest, then why again was "etc/pci-info" removed in the first place?

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-07  9:21                 ` [Qemu-devel] " Igor Mammedov
@ 2016-01-08  4:21                   ` Xiao Guangrong
  -1 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2016-01-08  4:21 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Michael S. Tsirkin, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek



On 01/07/2016 05:21 PM, Igor Mammedov wrote:
> On Wed, 6 Jan 2016 01:07:45 +0800
> Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:
>
>> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
>>
>>>>> Yes - if address is static, you need to put it outside
>>>>> the table. Can come right before or right after this.
>>>>>
>>>>>> Also if OperationRegion() is used, then one has to patch
>>>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>>>> using variable names is not permitted there.
>>>>>
>>>>> I am not sure the comment was understood correctly.
>>>>> The comment says really "we can't use DataTableRegion
>>>>> so here is an alternative".
>>>> so how are you going to access data at which patched
>>>> NameString point to?
>>>> for that you'd need a normal patched OperationRegion
>>>> as well since DataTableRegion isn't usable.
>>>
>>> For VMGENID you would patch the method that
>>> returns the address - you do not need an op region
>>> as you never access it.
>>>
>>> I don't know about NVDIMM. Maybe OperationRegion can
>>> use the patched NameString? Will need some thought.
>>
>> The ACPI spec says that the offsetTerm in OperationRegion
>> is evaluated as Int, so the named object is allowed to be
>> used in OperationRegion, that is exact what my patchset
>> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):
> that's not my reading of spec:
> "
> DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
> RegionOffset := TermArg => Integer
> TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
> "
>
> Named object is not allowed per spec, but you've used ArgObj which is
> allowed, even Windows ok with such dynamic OperationRegion.

Sorry, Named object i was talking about is something like this:
Name("SOTH", int(0x10000))

I am checking acpi spec, and this is a formal NamedObj definition in
that spec, my fault.

>
>>
>> +    dsm_mem = aml_arg(3);
>> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
>>
>> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
>> +                                            dsm_mem, TARGET_PAGE_SIZE));
>>
>> We hide the int64 object which is patched by BIOS in the method,
>> NVDIMM_GET_DSM_MEM, to make windows XP happy.
> considering that NRAM is allocated in low mem it's even fine to move
> OperationRegion into object scope to get rid of IASL warnings
> about declariong Named object inside method, but the you'd need to
> patch it directly as the only choice for RegionOffset would be DataObject
>

Yes, it is. So it is depends on the question in my reply of another thread:
http://marc.info/?l=kvm&m=145222487605390&w=2
Can we assume that BIOS allocated address is always 32 bits?

If yes, we also need not make ssdt as v2.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-08  4:21                   ` Xiao Guangrong
  0 siblings, 0 replies; 59+ messages in thread
From: Xiao Guangrong @ 2016-01-08  4:21 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: ehabkost, kvm, Michael S. Tsirkin, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth



On 01/07/2016 05:21 PM, Igor Mammedov wrote:
> On Wed, 6 Jan 2016 01:07:45 +0800
> Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:
>
>> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
>>
>>>>> Yes - if address is static, you need to put it outside
>>>>> the table. Can come right before or right after this.
>>>>>
>>>>>> Also if OperationRegion() is used, then one has to patch
>>>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>>>> using variable names is not permitted there.
>>>>>
>>>>> I am not sure the comment was understood correctly.
>>>>> The comment says really "we can't use DataTableRegion
>>>>> so here is an alternative".
>>>> so how are you going to access data at which patched
>>>> NameString point to?
>>>> for that you'd need a normal patched OperationRegion
>>>> as well since DataTableRegion isn't usable.
>>>
>>> For VMGENID you would patch the method that
>>> returns the address - you do not need an op region
>>> as you never access it.
>>>
>>> I don't know about NVDIMM. Maybe OperationRegion can
>>> use the patched NameString? Will need some thought.
>>
>> The ACPI spec says that the offsetTerm in OperationRegion
>> is evaluated as Int, so the named object is allowed to be
>> used in OperationRegion, that is exact what my patchset
>> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):
> that's not my reading of spec:
> "
> DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
> RegionOffset := TermArg => Integer
> TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
> "
>
> Named object is not allowed per spec, but you've used ArgObj which is
> allowed, even Windows ok with such dynamic OperationRegion.

Sorry, Named object i was talking about is something like this:
Name("SOTH", int(0x10000))

I am checking acpi spec, and this is a formal NamedObj definition in
that spec, my fault.

>
>>
>> +    dsm_mem = aml_arg(3);
>> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
>>
>> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
>> +                                            dsm_mem, TARGET_PAGE_SIZE));
>>
>> We hide the int64 object which is patched by BIOS in the method,
>> NVDIMM_GET_DSM_MEM, to make windows XP happy.
> considering that NRAM is allocated in low mem it's even fine to move
> OperationRegion into object scope to get rid of IASL warnings
> about declariong Named object inside method, but the you'd need to
> patch it directly as the only choice for RegionOffset would be DataObject
>

Yes, it is. So it is depends on the question in my reply of another thread:
http://marc.info/?l=kvm&m=145222487605390&w=2
Can we assume that BIOS allocated address is always 32 bits?

If yes, we also need not make ssdt as v2.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-08  4:21                   ` [Qemu-devel] " Xiao Guangrong
@ 2016-01-08  9:42                     ` Laszlo Ersek
  -1 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-08  9:42 UTC (permalink / raw)
  To: Xiao Guangrong, Igor Mammedov
  Cc: Michael S. Tsirkin, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel

On 01/08/16 05:21, Xiao Guangrong wrote:
> 
> 
> On 01/07/2016 05:21 PM, Igor Mammedov wrote:
>> On Wed, 6 Jan 2016 01:07:45 +0800
>> Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:
>>
>>> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
>>>
>>>>>> Yes - if address is static, you need to put it outside
>>>>>> the table. Can come right before or right after this.
>>>>>>
>>>>>>> Also if OperationRegion() is used, then one has to patch
>>>>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>>>>> using variable names is not permitted there.
>>>>>>
>>>>>> I am not sure the comment was understood correctly.
>>>>>> The comment says really "we can't use DataTableRegion
>>>>>> so here is an alternative".
>>>>> so how are you going to access data at which patched
>>>>> NameString point to?
>>>>> for that you'd need a normal patched OperationRegion
>>>>> as well since DataTableRegion isn't usable.
>>>>
>>>> For VMGENID you would patch the method that
>>>> returns the address - you do not need an op region
>>>> as you never access it.
>>>>
>>>> I don't know about NVDIMM. Maybe OperationRegion can
>>>> use the patched NameString? Will need some thought.
>>>
>>> The ACPI spec says that the offsetTerm in OperationRegion
>>> is evaluated as Int, so the named object is allowed to be
>>> used in OperationRegion, that is exact what my patchset
>>> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):
>> that's not my reading of spec:
>> "
>> DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
>> RegionOffset := TermArg => Integer
>> TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
>> "
>>
>> Named object is not allowed per spec, but you've used ArgObj which is
>> allowed, even Windows ok with such dynamic OperationRegion.
> 
> Sorry, Named object i was talking about is something like this:
> Name("SOTH", int(0x10000))
> 
> I am checking acpi spec, and this is a formal NamedObj definition in
> that spec, my fault.
> 
>>
>>>
>>> +    dsm_mem = aml_arg(3);
>>> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM),
>>> dsm_mem));
>>>
>>> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
>>> +                                            dsm_mem,
>>> TARGET_PAGE_SIZE));
>>>
>>> We hide the int64 object which is patched by BIOS in the method,
>>> NVDIMM_GET_DSM_MEM, to make windows XP happy.
>> considering that NRAM is allocated in low mem it's even fine to move
>> OperationRegion into object scope to get rid of IASL warnings
>> about declariong Named object inside method, but the you'd need to
>> patch it directly as the only choice for RegionOffset would be DataObject
>>
> 
> Yes, it is. So it is depends on the question in my reply of another thread:
> http://marc.info/?l=kvm&m=145222487605390&w=2
> Can we assume that BIOS allocated address is always 32 bits?

As far as OVMF is concerned: you can assume this at the moment, yes.

Thanks
Laszlo

> If yes, we also need not make ssdt as v2.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-08  9:42                     ` Laszlo Ersek
  0 siblings, 0 replies; 59+ messages in thread
From: Laszlo Ersek @ 2016-01-08  9:42 UTC (permalink / raw)
  To: Xiao Guangrong, Igor Mammedov
  Cc: ehabkost, kvm, Michael S. Tsirkin, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, rth

On 01/08/16 05:21, Xiao Guangrong wrote:
> 
> 
> On 01/07/2016 05:21 PM, Igor Mammedov wrote:
>> On Wed, 6 Jan 2016 01:07:45 +0800
>> Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:
>>
>>> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
>>>
>>>>>> Yes - if address is static, you need to put it outside
>>>>>> the table. Can come right before or right after this.
>>>>>>
>>>>>>> Also if OperationRegion() is used, then one has to patch
>>>>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>>>>> using variable names is not permitted there.
>>>>>>
>>>>>> I am not sure the comment was understood correctly.
>>>>>> The comment says really "we can't use DataTableRegion
>>>>>> so here is an alternative".
>>>>> so how are you going to access data at which patched
>>>>> NameString point to?
>>>>> for that you'd need a normal patched OperationRegion
>>>>> as well since DataTableRegion isn't usable.
>>>>
>>>> For VMGENID you would patch the method that
>>>> returns the address - you do not need an op region
>>>> as you never access it.
>>>>
>>>> I don't know about NVDIMM. Maybe OperationRegion can
>>>> use the patched NameString? Will need some thought.
>>>
>>> The ACPI spec says that the offsetTerm in OperationRegion
>>> is evaluated as Int, so the named object is allowed to be
>>> used in OperationRegion, that is exact what my patchset
>>> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):
>> that's not my reading of spec:
>> "
>> DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
>> RegionOffset := TermArg => Integer
>> TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
>> "
>>
>> Named object is not allowed per spec, but you've used ArgObj which is
>> allowed, even Windows ok with such dynamic OperationRegion.
> 
> Sorry, Named object i was talking about is something like this:
> Name("SOTH", int(0x10000))
> 
> I am checking acpi spec, and this is a formal NamedObj definition in
> that spec, my fault.
> 
>>
>>>
>>> +    dsm_mem = aml_arg(3);
>>> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM),
>>> dsm_mem));
>>>
>>> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
>>> +                                            dsm_mem,
>>> TARGET_PAGE_SIZE));
>>>
>>> We hide the int64 object which is patched by BIOS in the method,
>>> NVDIMM_GET_DSM_MEM, to make windows XP happy.
>> considering that NRAM is allocated in low mem it's even fine to move
>> OperationRegion into object scope to get rid of IASL warnings
>> about declariong Named object inside method, but the you'd need to
>> patch it directly as the only choice for RegionOffset would be DataObject
>>
> 
> Yes, it is. So it is depends on the question in my reply of another thread:
> http://marc.info/?l=kvm&m=145222487605390&w=2
> Can we assume that BIOS allocated address is always 32 bits?

As far as OVMF is concerned: you can assume this at the moment, yes.

Thanks
Laszlo

> If yes, we also need not make ssdt as v2.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: How to reserve guest physical region for ACPI
  2016-01-08  4:21                   ` [Qemu-devel] " Xiao Guangrong
@ 2016-01-08 15:59                     ` Igor Mammedov
  -1 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-08 15:59 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Michael S. Tsirkin, pbonzini, gleb, mtosatti, stefanha, rth,
	ehabkost, dan.j.williams, kvm, qemu-devel, Laszlo Ersek

On Fri, 8 Jan 2016 12:21:09 +0800
Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:

> On 01/07/2016 05:21 PM, Igor Mammedov wrote:
> > On Wed, 6 Jan 2016 01:07:45 +0800
> > Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:
> >  
> >> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
> >>  
> >>>>> Yes - if address is static, you need to put it outside
> >>>>> the table. Can come right before or right after this.
> >>>>>  
> >>>>>> Also if OperationRegion() is used, then one has to patch
> >>>>>> DefOpRegion directly as RegionOffset must be Integer,
> >>>>>> using variable names is not permitted there.  
> >>>>>
> >>>>> I am not sure the comment was understood correctly.
> >>>>> The comment says really "we can't use DataTableRegion
> >>>>> so here is an alternative".  
> >>>> so how are you going to access data at which patched
> >>>> NameString point to?
> >>>> for that you'd need a normal patched OperationRegion
> >>>> as well since DataTableRegion isn't usable.  
> >>>
> >>> For VMGENID you would patch the method that
> >>> returns the address - you do not need an op region
> >>> as you never access it.
> >>>
> >>> I don't know about NVDIMM. Maybe OperationRegion can
> >>> use the patched NameString? Will need some thought.  
> >>
> >> The ACPI spec says that the offsetTerm in OperationRegion
> >> is evaluated as Int, so the named object is allowed to be
> >> used in OperationRegion, that is exact what my patchset
> >> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):  
> > that's not my reading of spec:
> > "
> > DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
> > RegionOffset := TermArg => Integer
> > TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
> > "
> >
> > Named object is not allowed per spec, but you've used ArgObj which is
> > allowed, even Windows ok with such dynamic OperationRegion.  
> 
> Sorry, Named object i was talking about is something like this:
> Name("SOTH", int(0x10000))
> 
> I am checking acpi spec, and this is a formal NamedObj definition in
> that spec, my fault.
> 
> >  
> >>
> >> +    dsm_mem = aml_arg(3);
> >> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
> >>
> >> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
> >> +                                            dsm_mem, TARGET_PAGE_SIZE));
> >>
> >> We hide the int64 object which is patched by BIOS in the method,
> >> NVDIMM_GET_DSM_MEM, to make windows XP happy.  
> > considering that NRAM is allocated in low mem it's even fine to move
> > OperationRegion into object scope to get rid of IASL warnings
> > about declariong Named object inside method, but the you'd need to
> > patch it directly as the only choice for RegionOffset would be DataObject
> >  
> 
> Yes, it is. So it is depends on the question in my reply of another thread:
> http://marc.info/?l=kvm&m=145222487605390&w=2
> Can we assume that BIOS allocated address is always 32 bits?
> 
> If yes, we also need not make ssdt as v2.
For SeaBIOS it's so for now.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [Qemu-devel] How to reserve guest physical region for ACPI
@ 2016-01-08 15:59                     ` Igor Mammedov
  0 siblings, 0 replies; 59+ messages in thread
From: Igor Mammedov @ 2016-01-08 15:59 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: ehabkost, kvm, Michael S. Tsirkin, gleb, mtosatti, qemu-devel,
	stefanha, pbonzini, dan.j.williams, Laszlo Ersek, rth

On Fri, 8 Jan 2016 12:21:09 +0800
Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:

> On 01/07/2016 05:21 PM, Igor Mammedov wrote:
> > On Wed, 6 Jan 2016 01:07:45 +0800
> > Xiao Guangrong <guangrong.xiao@linux.intel.com> wrote:
> >  
> >> On 01/06/2016 12:43 AM, Michael S. Tsirkin wrote:
> >>  
> >>>>> Yes - if address is static, you need to put it outside
> >>>>> the table. Can come right before or right after this.
> >>>>>  
> >>>>>> Also if OperationRegion() is used, then one has to patch
> >>>>>> DefOpRegion directly as RegionOffset must be Integer,
> >>>>>> using variable names is not permitted there.  
> >>>>>
> >>>>> I am not sure the comment was understood correctly.
> >>>>> The comment says really "we can't use DataTableRegion
> >>>>> so here is an alternative".  
> >>>> so how are you going to access data at which patched
> >>>> NameString point to?
> >>>> for that you'd need a normal patched OperationRegion
> >>>> as well since DataTableRegion isn't usable.  
> >>>
> >>> For VMGENID you would patch the method that
> >>> returns the address - you do not need an op region
> >>> as you never access it.
> >>>
> >>> I don't know about NVDIMM. Maybe OperationRegion can
> >>> use the patched NameString? Will need some thought.  
> >>
> >> The ACPI spec says that the offsetTerm in OperationRegion
> >> is evaluated as Int, so the named object is allowed to be
> >> used in OperationRegion, that is exact what my patchset
> >> is doing (http://marc.info/?l=kvm&m=145193395624537&w=2):  
> > that's not my reading of spec:
> > "
> > DefOpRegion := OpRegionOp NameString RegionSpace RegionOffset RegionLen
> > RegionOffset := TermArg => Integer
> > TermArg := Type2Opcode | DataObject | ArgObj | LocalObj
> > "
> >
> > Named object is not allowed per spec, but you've used ArgObj which is
> > allowed, even Windows ok with such dynamic OperationRegion.  
> 
> Sorry, Named object i was talking about is something like this:
> Name("SOTH", int(0x10000))
> 
> I am checking acpi spec, and this is a formal NamedObj definition in
> that spec, my fault.
> 
> >  
> >>
> >> +    dsm_mem = aml_arg(3);
> >> +    aml_append(method, aml_store(aml_call0(NVDIMM_GET_DSM_MEM), dsm_mem));
> >>
> >> +    aml_append(method, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
> >> +                                            dsm_mem, TARGET_PAGE_SIZE));
> >>
> >> We hide the int64 object which is patched by BIOS in the method,
> >> NVDIMM_GET_DSM_MEM, to make windows XP happy.  
> > considering that NRAM is allocated in low mem it's even fine to move
> > OperationRegion into object scope to get rid of IASL warnings
> > about declariong Named object inside method, but the you'd need to
> > patch it directly as the only choice for RegionOffset would be DataObject
> >  
> 
> Yes, it is. So it is depends on the question in my reply of another thread:
> http://marc.info/?l=kvm&m=145222487605390&w=2
> Can we assume that BIOS allocated address is always 32 bits?
> 
> If yes, we also need not make ssdt as v2.
For SeaBIOS it's so for now.

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-01-08 15:59 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-02  7:20 [PATCH v9 0/5] implement vNVDIMM Xiao Guangrong
2015-12-02  7:20 ` [Qemu-devel] " Xiao Guangrong
2015-12-02  7:20 ` [PATCH v9 1/5] nvdimm: implement NVDIMM device abstract Xiao Guangrong
2015-12-02  7:20   ` [Qemu-devel] " Xiao Guangrong
2015-12-02  7:20 ` [PATCH v9 2/5] acpi: support specified oem table id for build_header Xiao Guangrong
2015-12-02  7:20   ` [Qemu-devel] " Xiao Guangrong
2015-12-02  7:20 ` [PATCH v9 3/5] nvdimm acpi: build ACPI NFIT table Xiao Guangrong
2015-12-02  7:20   ` [Qemu-devel] " Xiao Guangrong
2015-12-02  7:20 ` [PATCH v9 4/5] nvdimm acpi: build ACPI nvdimm devices Xiao Guangrong
2015-12-02  7:20   ` [Qemu-devel] " Xiao Guangrong
2015-12-02  7:21 ` [PATCH v9 5/5] nvdimm: add maintain info Xiao Guangrong
2015-12-02  7:21   ` [Qemu-devel] " Xiao Guangrong
2015-12-10  3:11 ` [PATCH v9 0/5] implement vNVDIMM Xiao Guangrong
2015-12-10  3:11   ` [Qemu-devel] " Xiao Guangrong
2015-12-21 14:13   ` Xiao Guangrong
2015-12-21 14:13     ` [Qemu-devel] " Xiao Guangrong
2015-12-28  2:39 ` How to reserve guest physical region for ACPI Xiao Guangrong
2015-12-28  2:39   ` [Qemu-devel] " Xiao Guangrong
2015-12-28 12:50   ` Michael S. Tsirkin
2015-12-28 12:50     ` [Qemu-devel] " Michael S. Tsirkin
2015-12-30 15:55     ` Igor Mammedov
2015-12-30 15:55       ` [Qemu-devel] " Igor Mammedov
2015-12-30 19:52       ` Michael S. Tsirkin
2015-12-30 19:52         ` [Qemu-devel] " Michael S. Tsirkin
2016-01-04 20:17         ` Laszlo Ersek
2016-01-04 20:17           ` [Qemu-devel] " Laszlo Ersek
2016-01-05 17:08           ` Igor Mammedov
2016-01-05 17:08             ` [Qemu-devel] " Igor Mammedov
2016-01-05 17:22             ` Laszlo Ersek
2016-01-05 17:22               ` [Qemu-devel] " Laszlo Ersek
2016-01-06 13:39               ` Igor Mammedov
2016-01-06 13:39                 ` [Qemu-devel] " Igor Mammedov
2016-01-06 14:43                 ` Laszlo Ersek
2016-01-06 14:43                   ` [Qemu-devel] " Laszlo Ersek
2016-01-07 13:51           ` Igor Mammedov
2016-01-07 13:51             ` [Qemu-devel] " Igor Mammedov
2016-01-07 17:33             ` Laszlo Ersek
2016-01-07 17:33               ` [Qemu-devel] " Laszlo Ersek
2016-01-05 16:30         ` Igor Mammedov
2016-01-05 16:30           ` [Qemu-devel] " Igor Mammedov
2016-01-05 16:43           ` Michael S. Tsirkin
2016-01-05 16:43             ` [Qemu-devel] " Michael S. Tsirkin
2016-01-05 17:07             ` Laszlo Ersek
2016-01-05 17:07               ` [Qemu-devel] " Laszlo Ersek
2016-01-05 17:07             ` Xiao Guangrong
2016-01-05 17:07               ` [Qemu-devel] " Xiao Guangrong
2016-01-07  9:21               ` Igor Mammedov
2016-01-07  9:21                 ` [Qemu-devel] " Igor Mammedov
2016-01-08  4:21                 ` Xiao Guangrong
2016-01-08  4:21                   ` [Qemu-devel] " Xiao Guangrong
2016-01-08  9:42                   ` Laszlo Ersek
2016-01-08  9:42                     ` [Qemu-devel] " Laszlo Ersek
2016-01-08 15:59                   ` Igor Mammedov
2016-01-08 15:59                     ` [Qemu-devel] " Igor Mammedov
2016-01-07 10:30             ` Igor Mammedov
2016-01-07 10:54               ` Michael S. Tsirkin
2016-01-07 13:42                 ` Igor Mammedov
2016-01-07 17:11                   ` Laszlo Ersek
2016-01-07 17:08                 ` Laszlo Ersek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.