linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] virtio-mem: paravirtualized memory
@ 2017-06-16 14:20 David Hildenbrand
  2017-06-16 15:04 ` Michael S. Tsirkin
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: David Hildenbrand @ 2017-06-16 14:20 UTC (permalink / raw)
  To: KVM, virtualization, qemu-devel, linux-mm
  Cc: Michael S. Tsirkin, Andrea Arcangeli

Hi,

this is an idea that is based on Andrea Arcangeli's original idea to
host enforce guest access to memory given up using virtio-balloon using
userfaultfd in the hypervisor. While looking into the details, I
realized that host-enforcing virtio-balloon would result in way too many
problems (mainly backwards compatibility) and would also have some
conceptual restrictions that I want to avoid. So I developed the idea of
virtio-mem - "paravirtualized memory".

The basic idea is to add memory to the guest via a paravirtualized
mechanism (so the guest can hotplug it) and remove memory via a
mechanism similar to a balloon. This avoids having to online memory as
"online-movable" in the guest and allows more fain grained memory
hot(un)plug. In addition, migrating QEMU guests after adding/removing
memory gets a lot easier.

Actually, this has a lot in common with the XEN balloon or the Hyper-V
balloon (namely: paravirtualized hotplug and ballooning), but is very
different when going into the details.

Getting this all implemented properly will take quite some effort,
that's why I want to get some early feedback regarding the general
concept. If you have some alternative ideas, or ideas how to modify this
concept, I'll be happy to discuss. Just please make sure to have a look
at the requirements first.

-----------------------------------------------------------------------
0. Outline:
-----------------------------------------------------------------------
- I.    General concept
- II.   Use cases
- III.  Identified requirements
- IV.   Possible modifications
- V.    Prototype
- VI.   Problems to solve / things to sort out / missing in prototype
- VII.  Questions
- VIII. Q&A

------------------------------------------------------------------------
I. General concept
------------------------------------------------------------------------

We expose memory regions to the guest via a paravirtualize interface. So
instead of e.g. a DIMM on x86, such memory is not anounced via ACPI.
Unmodified guests (without a virtio-mem driver) won't be able to see/use
this memory. The virtio-mem guest driver is needed to detect and manage
these memory areas. What makes this memory special is that it can grow
while the guest is running ("plug memory") and might shrink on a reboot
(to compensate "unplugged" memory - see next paragraph). Each virtio-mem
device manages exactly one such memory area. By having multiple ones
assigned to different NUMA nodes, we can modify memory on a NUMA basis.

Of course, we cannot shrink these memory areas while the guest is
running. To be able to unplug memory, we do something like a balloon
does, however limited to this very memory area that belongs to the
virtio-mem device. The guest will hand back small chunks of memory. If
we want to add memory to the guest, we first "replug" memory that has
previously been given up by the guest, before we grow our memory area.

On a reboot, we want to avoid any memory holes in our memory, therefore
we resize our memory area (shrink it) to compensate memory that has been
unplugged. This highly simplifies hotplugging memory in the guest (
hotplugging memory with random memory holes is basically impossible).

We have to make sure that all memory chunks the guest hands back on
unplug requests will not consume memory in the host. We do this by
write-protecting that memory chunk in the host and then dropping the
backing pages. The guest can read this memory (reading from the ZERO
page) but no longer write to it. For now, this will only work on
anonymous memory. We will use userfaultfd WP (write-protect mode) to
avoid creating too many VMAs. Huge pages will require more effort (no
explicit ZERO page).

As we unplug memory on a fine grained basis (and e.g. not on
a complete DIMM basis), there is no need to online virtio-mem memory
as online-movable. Also, memory unplug support for Windows might be
supported that way. You can find more details in the Q/A section below.


The important points here are:
- After a reboot, every memory the guest sees can be accessed and used.
  (in contrast to e.g. the XEN balloon, see Q/A fore more details)
- Rebooting into an unmodified guest will not result into random
  crashed. The guest will simply not be able to use all memory without a
  virtio-mem driver.
- Adding/Removing memory will not require modifying the QEMU command
  line on the migration target. Migration simply works (re-sizing memory
  areas is already part of the migration protocol!). Essentially, this
  makes adding/removing memory to/from a guest way simpler and
  independent of the underlying architecture. If the guest OS can online
  new memory, we can add more memory this way.
- Unplugged memory can be read. This allows e.g. kexec() without nasty
  modifications. Especially relevant for Windows' kexec() variant.
- It will play nicely with other things mapped into the address space,
  e.g. also other DIMMs or NVDIMM. virtio-mem will only work on its own
  memory region (in contrast e.g. to virtio-balloon). Especially it will
  not give up ("allocate") memory on other DIMMs, hindering them to get
  unplugged the ACPI way.
- We can add/remove memory without running into KVM memory slot or other
  (e.g. ACPI slot) restrictions. The granularity in which we can add
  memory is only limited by the granularity the guest can add memory
  (e.g. Windows 2MB, Linux on x86 128MB for now).
- By not having to online memory as online-movable we don't run into any
  memory restrictions in the guest. E.g. page tables can only be created
  on !movable memory. So while there might be plenty of online-movable
  memory left, allocation of page tables might fail. See Q/A for more
  details.
- The admin will not have to set memory offline in the guest first in
  order to unplug it. virtio-mem will handle this internally and not
  require interaction with an admin or a guest-agent.

Important restrictions of this concept:
- Guests without a virtio-mem guest driver can't see that memory.
- We will always require some boot memory that cannot get unplugged.
  Also, virtio-mem memory (as all other hotplugged memory) cannot become
  DMA memory under Linux. So the boot memory also defines the amount of
  DMA memory.
- Hibernation/Sleep+Restore while virtio-mem is active is not supported.
  On a reboot/fresh start, the size of the virtio-mem memory area might
  change and a running/loaded guest can't deal with that.
- Unplug support for hugetlbfs/shmem will take quite some time to
  support. The larger the used page size, the harder for the guest to
  give up memory. We can still use DIMM based hotplug for that.
- Huge huge pages are problematic, as the guest would have to give up
  e.g. 1GB chunks. This is not expected to be supported. We can still
  use DIMM based hotplug for setups that require that.
- For any memory we unplug using this mechanism, for now we will still
  have struct pages allocated in the guest. This means, that roughly
  1.6% of unplugged memory will still be allocated in the guest, being
  unusable.


------------------------------------------------------------------------
II. Use cases
------------------------------------------------------------------------

Of course, we want to deny any access to unplugged memory. In contrast
to virtio-balloon or other similar ideas (free page hinting), this is
not about cooperative memory management, but about guarantees. The idea
is, that both concepts can coexist.

So one use case is of course cloud providers. Customers can add
or remove memory to/from a VM without having to care about how to
online memory or in which amount to add memory in the first place in
order to remove it again. In cloud environments, we care about
guarantees. E.g. for virtio-balloon a malicious guest can simply reuse
any deflated memory, and the hypervisor can't even tell if the guest is
malicious (e.g. a harmless guest reboot might look like a malicious
guest). For virtio-mem, we guarantee that the guest can't reuse any
memory that it previously gave up.

But also for ordinary VMs (!cloud), this avoids having to online memory
in the guest as online-movable and therefore not running into allocation
problems if there are e.g. many processes needing many page tables on
!movable memory. Also here, we don't have to know how much memory we
want to remove some-when in the future before we add memory. (e.g. if we
add a 128GB DIMM, we can only remove that 128GB DIMM - if we are lucky).

We might be able to support memory unplug for Windows (as for now,
ACPI unplug is not supported), more details have to be clarified.

As we can grow these memory areas quite easily, another use case might
be guests that tell us they need more memory. Thinking about VMs to
protect containers, there seems to be the general problem that we don't
know how much memory the container will actually need. We could
implement a mechanism (in virtio-mem or guest driver), by which the
guest can request more memory. If the hypervisor agrees, it can simply
give the guest more memory. As this is all handled within QEMU,
migration is not a problem. Adding more memory will not result in new
DIMM devices.


------------------------------------------------------------------------
III. Identified requirements
------------------------------------------------------------------------

I considered the following requirements.

NUMA aware:
  We want to be able to add/remove memory to/from NUMA nodes.
Different page-size support:
  We want to be able to support different page sizes, e.g. because of
  huge pages in the hypervisor or because host and guest have different
  page sizes (powerpc 64k vs 4k).
Guarantees:
  There has to be no way the guest can reuse unplugged memory without
  host consent. Still, we could implement a mechanism for the guest to
  request more memory. The hypervisor then has to decide how it wants to
  handle that request.
Architecture independence:
  We want this to work independently of other technologies bound to
  specific architectures, like ACPI.
Avoid online-movable:
  We don't want to have to online memory in the guest as online-movable
  just to be able to unplug (at least parts of) it again.
Migration support:
  Be able to migrate without too much hassle. Especially, to handle it
  completely within QEMU (not having to add new devices to the target
  command line).
Windows support:
  We definitely want to support Windows guests in the long run.
Coexistence with other hotplug mechanisms:
  Allow to hotplug DIMMs / NVDIMMs, therefore to share the "hotplug"
  address space part with other devices.
Backwards compatibility:
  Don't break if rebooting into an unmodified guest after having
  unplugged some memory. All memory a freshly booted guest sees must not
  contain memory holes that will crash it if it tries to access it.


------------------------------------------------------------------------
IV. Possible modifications
------------------------------------------------------------------------

Adding a guest->host request mechanism would make sense to e.g. be able
to request further memory from the hypervisor directly from the guest.

Adding memory will be much easier than removing memory. We can split
this up and first introduce "adding memory" and later add "removing
memory". Removing memory will require userfaultfd WP in the hypervisor
and a special fancy allocator in the guest. So this will take some time.

Adding a mechanism to trade in memory blocks might make sense to allow
some sort of memory compaction. However I expect this to be highly
complicated and basically not feasible.

Being able to unplug memory "any" memory instead of only memory
belonging to the virtio-mem device sounds tempting (and simplifies
certain parts), however it has a couple of side effects I want to avoid.
You can read more about that in the Q/A below.


------------------------------------------------------------------------
V. Prototype
------------------------------------------------------------------------

To identify potential problems I developed a very basic prototype. It
is incomplete, full of hacks and most probably broken in various ways.
I used it only in the given setup, only on x86 and only with an initrd.

It uses a fixed page size of 256k for now, has a very ugly allocator
hack in the guest, the virtio protocol really needs some tuning and
an async job interface towards the user is missing. Instead of using
userfaultfd WP, I am using simply mprotect() in this prototype. Basic
migration works (not involving userfaultfd).

Please, don't even try to review it (that's why I will also not attach
any patches to this mail :) ), just use this as an inspiration what this
could look like. You can find the latest hack at:

QEMU: https://github.com/davidhildenbrand/qemu/tree/virtio-mem

Kernel: https://github.com/davidhildenbrand/linux/tree/virtio-mem

Use the kernel in the guest and make sure to compile the virtio-mem
driver into the kernel (CONFIG_VIRTIO_MEM=y). A host kernel patch is
contained to allow atomic resize of KVM memory regions, however it is
pretty much untested.


1. Starting a guest with virtio-mem memory:
   We will create a guest with 2 NUMA nodes and 4GB of "boot + DMA"
   memory. This memory is visible also to guests without virtio-mem.
   Also, we will add 4GB to NUMA node 0 and 3GB to NUMA node 1 using
   virtio-mem. We allow both virtio-mem devices to grow up to 8GB. The
   last 4 lines are the important part.

--> qemu/x86_64-softmmu/qemu-system-x86_64 \
	--enable-kvm
	-m 4G,maxmem=20G \
	-smp sockets=2,cores=2 \
	-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
	-machine pc \
	-kernel linux/arch/x86_64/boot/bzImage \
	-nodefaults \
	-chardev stdio,id=serial \
	-device isa-serial,chardev=serial \
	-append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0" \
	-initrd /boot/initramfs-4.10.8-200.fc25.x86_64.img \
	-chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
	-mon chardev=monitor,mode=readline \
	-object memory-backend-ram,id=mem0,size=4G,max-size=8G \
	-device virtio-mem-pci,id=reg0,memdev=mem0,node=0 \
	-object memory-backend-ram,id=mem1,size=3G,max-size=8G \
	-device virtio-mem-pci,id=reg1,memdev=mem1,node=1

2. Listing current memory assignment:

--> (qemu) info memory-devices
	Memory device [virtio-mem]: "reg0"
	  addr: 0x140000000
	  node: 0
	  size: 4294967296
	  max-size: 8589934592
	  memdev: /objects/mem0
	Memory device [virtio-mem]: "reg1"
	  addr: 0x340000000
	  node: 1
	  size: 3221225472
	  max-size: 8589934592
	  memdev: /objects/mem1
--> (qemu) info numa
	2 nodes
	node 0 cpus: 0 1
	node 0 size: 6144 MB
	node 1 cpus: 2 3
	node 1 size: 5120 MB

3. Resize a virtio-mem device: Unplugging memory.
   Setting reg0 to 2G (remove 2G from NUMA node 0)

--> (qemu) virtio-mem reg0 2048
	virtio-mem reg0 2048
--> (qemu) info numa
	info numa
	2 nodes
	node 0 cpus: 0 1
	node 0 size: 4096 MB
	node 1 cpus: 2 3
	node 1 size: 5120 MB

4. Resize a virtio-mem device: Plugging memory
   Setting reg0 to 8G (adding 6G to NUMA node 0) will replug 2G and plug
   4G, automatically re-sizing the memory area. You might experience
   random crashes at this point if the host kernel missed a KVM patch
   (as the memory slot is not re-sized in an atomic fashion).

--> (qemu) virtio-mem reg0 8192
	virtio-mem reg0 8192
--> (qemu) info numa
	info numa
	2 nodes
	node 0 cpus: 0 1
	node 0 size: 10240 MB
	node 1 cpus: 2 3
	node 1 size: 5120 MB

5. Resize a virtio-mem device: Try to unplug all memory.
   Setting reg0 to 0G (removing 8G from NUMA node 0) will not work. The
   guest will not be able to unplug all memory. In my example, 164M
   cannot be unplugged (out of memory).

--> (qemu) virtio-mem reg0 0
	virtio-mem reg0 0
--> (qemu) info numa
	info numa
	2 nodes
	node 0 cpus: 0 1
	node 0 size: 2212 MB
	node 1 cpus: 2 3
	node 1 size: 5120 MB
--> (qemu) info virtio-mem reg0
	info virtio-mem reg0
	Status: ready
	Request status: vm-oom
	Page size: 2097152 bytes
--> (qemu) info memory-devices
	Memory device [virtio-mem]: "reg0"
	  addr: 0x140000000
	  node: 0
	  size: 171966464
	  max-size: 8589934592
	  memdev: /objects/mem0
	Memory device [virtio-mem]: "reg1"
	  addr: 0x340000000
	  node: 1
	  size: 3221225472
	  max-size: 8589934592
	  memdev: /objects/mem1

At any point, we can migrate our guest without having to care about
modifying the QEMU command line on the target side. Simply start the
target e.g. with an additional '-incoming "exec: cat IMAGE"' and you're
done.

------------------------------------------------------------------------
VI. Problems to solve / things to sort out / missing in prototype
------------------------------------------------------------------------

General:
- We need an async job API to send the unplug/replug/plug requests to
  the guest and query the state. [medium/hard]
- Handle various alignment problems. [medium]
- We need a virtio spec

Relevant for plug:
- Resize QEMU memory regions while the guest is running (esp. grow).
  While I implemented a demo solution for KVM memory slots, something
  similar would be needed for vhost. Re-sizing of memory slots has to be
  an atomic operation. [medium]
- NUMA: Most probably the NUMA node should not be part of the virtio-mem
  device, this should rather be indicated via e.g. ACPI. [medium]
- x86: Add the complete possible memory to the a820 map as reserved.
  [medium]
- x86/powerpc/...: Indicate to which NUMA node the memory belongs using
  ACPI. [medium]
- x86/powerpc/...: Share address space with ordinary DIMMS/NVDIMMs, for
  now this is blocked for simplicity. [medium/hard]
- If the bitmaps become too big, migrate them like memory. [medium]

Relevant for unplug:
- Allocate memory in Linux from a specific memory range. Windows has a
  nice interface for that (at least it looks nice when reading the API).
  This could be done using fake NUMA nodes or a new ZONE. My prototype
  just uses a very ugly hack. [very hard]
- Use userfaultfd WP (write-protect) insted of mprotect. Especially,
  have multiple userfaultfd user in QEMU at a time (postcopy).
  [medium/hard]

Stuff for the future:
- Huge pages are problematic (no ZERO page support). This might not be
  trivial to support. [hard/very hard]
- Try to free struct pages, to avoid the 1.6% overhead [very very hard]


------------------------------------------------------------------------
VII. Questions
------------------------------------------------------------------------

To get unplug working properly, it will require quite some effort,
that's why I want to get some basic feedback before continuing working
on a RFC implementation + RFC virtio spec.

a) Did I miss anything important? Are there any ultimate blockers that I
   ignored? Any concepts that are broken?

b) Are there any alternatives? Any modifications that could make life
   easier while still taking care of the requirements?

c) Are there other use cases we should care about and focus on?

d) Am I missing any requirements? What else could be important for
   !cloud and cloud?

e) Are there any possible solutions to the allocator problem (allocating
   memory from a specific memory area)? Please speak up!

f) Anything unclear?

e) Any feelings about this? Yay or nay?


As you reached this point: Thanks for having a look!!! Highly appreciated!


------------------------------------------------------------------------
VIII. Q&A
------------------------------------------------------------------------

---
Q: What's the problem with ordinary memory hot(un)plug?

A: 1. We can only unplug in the granularity we plugged. So we have to
      know in advance, how much memory we want to remove later on. If we
      plug a 2G dimm, we can only unplug a 2G dimm.
   2. We might run out of memory slots. Although very unlikely, this
      would strike if we try to always plug small modules in order to be
      able to unplug again (e.g. loads of 128MB modules).
   3. Any locked page in the guest can hinder us from unplugging a dimm.
      Even if memory was onlined as online_movable, a single locked page
      can hinder us from unplugging that memory dimm.
   4. Memory has to be onlined as online_movable. If we don't put that
      memory into the movable zone, any non-movable kernel allocation
      could end up on it, turning the complete dimm unpluggable. As
      certain allocations cannot go into the movable zone (e.g. page
      tables), the ratio between online_movable/online memory depends on
      the workload in the guest. Ratios of 50% -70% are usually fine.
      But it could happen, that there is plenty of memory available,
      but kernel allocations fail. (source: Andrea Arcangeli)
   5. Unplugging might require several attempts. It takes some time to
      migrate all memory from the dimm. At that point, it is then not
      really obvious why it failed, and whether it could ever succeed.
   6. Windows does support memory hotplug but not memory hotunplug. So
      this could be a way to support it also for Windows.
---
Q: Will this work with Windows?

A: Most probably not in the current form. Memory has to be at least
   added to the a820 map and ACPI (NUMA). Hyper-V ballon is also able to
   hotadd memory using a paravirtualized interface, so there are very
   good chances that this will work. But we won't know for sure until we
   also start prototyping.
---
Q: How does this compare to virtio-balloo?

A: In contrast to virtio-balloon, virtio-mem
   1. Supports multiple page sizes, even different ones for different
      virtio-mem devices in a guest.
   2. Is NUMA aware.
   3. Is able to add more memory.
   4. Doesn't work on all memory, but only on the managed one.
   5. Has guarantees. There is now way for the guest to reclaim memory.
---
Q: How does this compare to XEN balloon?

A: XEN balloon also has a way to hotplug new memory. However, on a
   reboot, the guest will "see" more memory than it actually has.
   Compared to XEN balloon, virtio-mem:
   1. Supports multiple page sizes.
   2. Is NUMA aware.
   3. The guest can survive a reboot into a system without the guest
      driver. If the XEN guest driver doesn't come up, the guest will
      get killed once it touches too much memory.
   4. Reboots don't require any hacks.
   5. The guest knows which memory is special. And it remains special
      during a reboot. Hotplugged memory not suddenly becomes base
      memory. The balloon mechanism will only work on a specific memory
      area.
---
Q: How does this compare to Hyper-V balloon?

A: Based on the code from the Linux Hyper-V balloon driver, I can say
   that Hyper-V also has a way to hotplug new memory. However, memory
   will remain plugged on a reboot. Therefore, the guest will see more
   memory than the hypervisor actually wants to assign to it.
   Virtio-mem in contrast:
   1. Supports multiple page sizes.
   2. Is NUMA aware.
   3. I have no idea what happens under Hyper-v when
      a) rebooting into a guest without a fitting guest driver
      b) kexec() touches all memory
      c) the guest misbehaves
   4. The guest knows which memory is special. And it remains special
      during a reboot. Hotpplugged memory not suddenly becomes base
      memory. The balloon mechanism will only work on a specific memory
      area.
   In general, it looks like the hypervisor has to deal with malicious
   guests trying to access more memory than desired by providing enough
   swap space.
---
Q: How is virtio-mem NUMA aware?

A: Each virtio-mem device belongs exactly to one NUMA node (if NUMA is
   enabled). As we can resize these regions separately, we can control
   from/to which node to remove/add memory.
---
Q: Why do we need support for multiple page sizes?

A: If huge pages are used in the host, we can only guarantee that they
   are not accessible by the guest anymore, if the guest gives up memory
   in this granularity. We prepare for that. Also, powerpc can have 64k
   pages in the host but 4k pages in the guest. So the guest must only
   give up 64k chunks. In addition, unplugging 4k pages might be bad
   when it comes to fragmentation. My prototype currently uses 256k. We
   can make this configurable - and it can vary for each virtio-mem
   device.
---
Q: What are the limitations with paravirtualized memory hotplug?

A: The same as for DIMM based hotplug, but we don't run out of any
   memory/ACPI slots. E.g. on x86 Linux, only 128MB chunks can be
   hotplugged, on x86 Windows it's 2MB. In addition, of course we
   have to take care of maximum address limits in the guest. The idea
   is to communicate these limits to the hypervisor via virtio-mem,
   to give hints when trying to add/remove memory.
---
Q: Why not simply unplug *any* memory like virtio-balloon does?

A: This could be done and a previous prototype did it like that.
   However, there are some points to consider here.
   1. If we combine this with ordinary memory hotplug (DIMM), we most
      likely won't be able to unplug DIMMs anymore as virtio-mem memory
      gets "allocated" on these.
   2. All guests using virtio-mem cannot use huge pages as backing
      storage at all (as virtio-mem only supports anonymous pages).
   3. We need to track unplugged memory for the complete address space,
      so we need a global state in QEMU. Bitmaps get bigger. We will not
      be abe to dynamically grow the bitmaps for a virtio-mem device.
   4. Resolving/checking memory to be unplugged gets significantly
      harder. How should the guest know which memory it can unplug for a
      specific virtio-mem device? E.g. if NUMA is active, only that NUMA
      node to which a virtio-mem device belongs can be used.
   5. We will need userfaultfd handler for the complete address space,
      not just for the virtio-mem managed memory.
      Especially, if somebody hotplugs a DIMM, we dynamically will have
      to enable the userfaultfd handler.
   6. What shall we do if somebody hotplugs a DIMM with huge pages? How
      should we tell the guest, that this memory cannot be used for
      unplugging?
   In summary: This concept is way cleaner, but also harder to
   implement.
---
Q: Why not reuse virtio-balloon?

A: virtio-balloon is for cooperative memory management. It has a fixed
   page size and will deflate in certain situations. Any change we
   introduce will break backwards compatibility. virtio-balloon was not
   designed to give guarantees. Nobody can hinder the guest from
   deflating/reusing inflated memory. In addition, it might make perfect
   sense to have both, virtio-balloon and virtio-mem at the same time,
   especially looking at the DEFLATE_ON_OOM or STATS features of
   virtio-balloon. While virtio-mem is all about guarantees, virtio-
   balloon is about cooperation.
---
Q: Why not reuse acpi hotplug?

A: We can easily run out of slots, migration in QEMU will just be
   horrible and we don't want to bind virtio* to architecture specific
   technologies.
   E.g. thinking about s390x - no ACPI. Also, mixing an ACPI driver with
   a virtio-driver sounds very weird. If the virtio-driver performs the
   hotplug itself, we might later perform some extra tricks: e.g.
   actually unplug certain regions to give up some struct pages.

   We want to manage the way memory is added/removed completely in QEMU.
   We cannot simply add new device from within QEMU and expect that
   migration in QEMU will work.
---
Q: Why do we need resizable memory regions?

A: Migration in QEMU is special. Any device we have on our source VM has
   to already be around on our target VM. So simply creating random
   devides internally in QEMU is not going to work. The concept of
   resizable memory regions in QEMU already exists and is part of the
   migration protocol. Before memory is migrated, the memory is resized.
   So in essence, this makes migration support _a lot_ easier.

   In addition, we won't run in any slot number restriction when
   automatically managing how to add memory in QEMU.
---
Q: Why do we have to resize memory regions on a reboot?

A: We have to compensate all memory that has been unplugged for that
   area by shrinking it, so that a fresh guest can use all memory when
   initializing the virtio-mem device.
---
Q: Why do we need userfaultfd?

A: mprotect() will create a lot of VMAs in the kernel. This will degrade
   performance and might even fail at one point. userfaultfd avoids this
   by not creating a new VMA for every protected range. userfaultfd WP
   is currently still under development and suffers from false positives
   that make it currently impossible to properly integrate this into the
   prototype.
---
Q: Why do we have to allow reading unplugged memory?

A: E.g. if the guest crashes and want's to write a memory dump, it will
   blindly access all memory. While we could find ways to fixup kexec,
   Windows dumps might be more problematic. Allowing the guest to read
   all memory (resulting in reading all 0's) safes us from a lot of
   trouble.

   The downside is, that page tables full of zero pages might be
   created. (we might be able to find ways to optimize this)
---
Q: Will this work with postcopy live-migration?

A: Not in the current form. And it doesn't really make sense to spend
   time on it as long as we don't use userfaultfd. Combining both
   handlers will be interesting. It can be done with some effort on the
   QEMU side.
---
Q: What's the problem with shmem/hugetlbfs?

A: We currently rely on the ZERO page to be mapped when the guest tries
   to read unplugged memory. For shmem/hugetlbfs, there is no ZERO page,
   so read access would result in memory getting populated. We could
   either introduce an explicit ZERO page, or manage it using one dummy
   ZERO page (using regular usefaultfd, allow only one such page to be
   mapped at a time). For now, only anonymous memory.
---
Q: Ripping out random page ranges, won't this fragment our guest memory?

A: Yes, but depending on the virtio-mem page size, this might be more or
   less problematic. The smaller the virtio-mem page size, the more we
   fragment and make small allocations fail. The bigger the virtio-mem
   page size, the higher the chance that we can't unplug any more
   memory.
---
Q: Why can't we use memory compaction like virtio-balloon?

A: If the virtio-mem page size > PAGE_SIZE, we can't do ordinary
   page migration, migration would have to be done in blocks. We could
   later add an guest->host virtqueue, via which the guest can
   "exchange" memory ranges. However, also mm has to support this kind
   of migration. So it is not completely out of scope, but will require
   quite some work.
---
Q: Do we really need yet another paravirtualized interface for this?

A: You tell me :)
---

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 14:20 [RFC] virtio-mem: paravirtualized memory David Hildenbrand
@ 2017-06-16 15:04 ` Michael S. Tsirkin
  2017-06-16 15:59   ` David Hildenbrand
  2017-06-19 10:08 ` Stefan Hajnoczi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16 15:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli

On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
> Hi,
> 
> this is an idea that is based on Andrea Arcangeli's original idea to
> host enforce guest access to memory given up using virtio-balloon using
> userfaultfd in the hypervisor. While looking into the details, I
> realized that host-enforcing virtio-balloon would result in way too many
> problems (mainly backwards compatibility) and would also have some
> conceptual restrictions that I want to avoid. So I developed the idea of
> virtio-mem - "paravirtualized memory".

Thanks! I went over this quickly, will read some more in the
coming days. I would like to ask for some clarifications
on one part meanwhile:

> Q: Why not reuse virtio-balloon?
> 
> A: virtio-balloon is for cooperative memory management. It has a fixed
>    page size

We are fixing that with VIRTIO_BALLOON_F_PAGE_CHUNKS btw.
I would appreciate you looking into that patchset.

> and will deflate in certain situations.

What does this refer to?

> Any change we
>    introduce will break backwards compatibility.

Why does this have to be the case?

> virtio-balloon was not
>    designed to give guarantees. Nobody can hinder the guest from
>    deflating/reusing inflated memory.

Reusing without deflate is forbidden with TELL_HOST, right?

>    In addition, it might make perfect
>    sense to have both, virtio-balloon and virtio-mem at the same time,
>    especially looking at the DEFLATE_ON_OOM or STATS features of
>    virtio-balloon. While virtio-mem is all about guarantees, virtio-
>    balloon is about cooperation.

Thanks, and I intend to look more into this next week.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 15:04 ` Michael S. Tsirkin
@ 2017-06-16 15:59   ` David Hildenbrand
  2017-06-16 20:19     ` Michael S. Tsirkin
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2017-06-16 15:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli

On 16.06.2017 17:04, Michael S. Tsirkin wrote:
> On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
>> Hi,
>>
>> this is an idea that is based on Andrea Arcangeli's original idea to
>> host enforce guest access to memory given up using virtio-balloon using
>> userfaultfd in the hypervisor. While looking into the details, I
>> realized that host-enforcing virtio-balloon would result in way too many
>> problems (mainly backwards compatibility) and would also have some
>> conceptual restrictions that I want to avoid. So I developed the idea of
>> virtio-mem - "paravirtualized memory".
> 
> Thanks! I went over this quickly, will read some more in the
> coming days. I would like to ask for some clarifications
> on one part meanwhile:

Thanks for looking into it that fast! :)

In general, what this section is all about: Why to not simply host
enforce virtio-balloon.

> 
>> Q: Why not reuse virtio-balloon?
>>
>> A: virtio-balloon is for cooperative memory management. It has a fixed
>>    page size
> 
> We are fixing that with VIRTIO_BALLOON_F_PAGE_CHUNKS btw.
> I would appreciate you looking into that patchset.

Will do, thanks. Problem is that there is no "enforcement" on the page
size. VIRTIO_BALLOON_F_PAGE_CHUNKS simply allows to send bigger chunks.
Nobody hinders the guest (especially legacy virtio-balloon drivers) from
sending 4k pages.

So this doesn't really fix the issue (we have here), it just allows to
speed up transfer. Which is a good thing, but does not help for
enforcement at all. So, yes support for page sizes > 4k, but no way to
enforce it.

> 
>> and will deflate in certain situations.
> 
> What does this refer to?

A Linux guest will deflate the balloon (all or some pages) in the
following scenarios:
a) page migration
b) unload virtio-balloon kernel module
c) hibernate/suspension
d) (DEFLATE_ON_OOM)

A Linux guest will touch memory without deflating:
a) During a kexec() dump
d) On reboots (regular, after kexec(), system_reset)

> 
>> Any change we
>>    introduce will break backwards compatibility.
> 
> Why does this have to be the case
If we suddenly enforce the existing virtio-balloon, we will break legacy
guests.

Simple example:
Guest with inflated virtio-balloon reboots. Touches inflated memory.
Gets killed at some random point.

Of course, another discussion would be "can't we move virtio-mem
functionality into virtio-balloon instead of changing virtio-balloon".
With the current concept this is also not possible (one region per
device vs. one virtio-balloon device). And I think while similar, these
are two different concepts.

> 
>> virtio-balloon was not
>>    designed to give guarantees. Nobody can hinder the guest from
>>    deflating/reusing inflated memory.
> 
> Reusing without deflate is forbidden with TELL_HOST, right?

TELL_HOST just means "please inform me". There is no way to NACK a
request. It is not a permission to do so, just a "friendly
notification". And this is exactly not what we want when host enforcing
memory access.


> 
>>    In addition, it might make perfect
>>    sense to have both, virtio-balloon and virtio-mem at the same time,
>>    especially looking at the DEFLATE_ON_OOM or STATS features of
>>    virtio-balloon. While virtio-mem is all about guarantees, virtio-
>>    balloon is about cooperation.
> 
> Thanks, and I intend to look more into this next week.
> 

I know that it is tempting to force this concept into virtio-balloon. I
spent quite some time thinking about this (and possible other techniques
like implicit memory deflation on reboots) and decided not to do it. We
just end up trying to hack around all possible things that could go
wrong, while still not being able to handle all requirements properly.

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 15:59   ` David Hildenbrand
@ 2017-06-16 20:19     ` Michael S. Tsirkin
  2017-06-18 10:17       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16 20:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli

On Fri, Jun 16, 2017 at 05:59:07PM +0200, David Hildenbrand wrote:
> On 16.06.2017 17:04, Michael S. Tsirkin wrote:
> > On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
> >> Hi,
> >>
> >> this is an idea that is based on Andrea Arcangeli's original idea to
> >> host enforce guest access to memory given up using virtio-balloon using
> >> userfaultfd in the hypervisor. While looking into the details, I
> >> realized that host-enforcing virtio-balloon would result in way too many
> >> problems (mainly backwards compatibility) and would also have some
> >> conceptual restrictions that I want to avoid. So I developed the idea of
> >> virtio-mem - "paravirtualized memory".
> > 
> > Thanks! I went over this quickly, will read some more in the
> > coming days. I would like to ask for some clarifications
> > on one part meanwhile:
> 
> Thanks for looking into it that fast! :)
> 
> In general, what this section is all about: Why to not simply host
> enforce virtio-balloon.
> > 
> >> Q: Why not reuse virtio-balloon?
> >>
> >> A: virtio-balloon is for cooperative memory management. It has a fixed
> >>    page size
> > 
> > We are fixing that with VIRTIO_BALLOON_F_PAGE_CHUNKS btw.
> > I would appreciate you looking into that patchset.
> 
> Will do, thanks. Problem is that there is no "enforcement" on the page
> size. VIRTIO_BALLOON_F_PAGE_CHUNKS simply allows to send bigger chunks.
> Nobody hinders the guest (especially legacy virtio-balloon drivers) from
> sending 4k pages.
> 
> So this doesn't really fix the issue (we have here), it just allows to
> speed up transfer. Which is a good thing, but does not help for
> enforcement at all. So, yes support for page sizes > 4k, but no way to
> enforce it.
> 
> > 
> >> and will deflate in certain situations.
> > 
> > What does this refer to?
> 
> A Linux guest will deflate the balloon (all or some pages) in the
> following scenarios:
> a) page migration

It inflates it first, doesn't it?

> b) unload virtio-balloon kernel module
> c) hibernate/suspension
> d) (DEFLATE_ON_OOM)

You need to set a flag in the balloon to allow this, right?

> A Linux guest will touch memory without deflating:
> a) During a kexec() dump
> d) On reboots (regular, after kexec(), system_reset)
> > 
> >> Any change we
> >>    introduce will break backwards compatibility.
> > 
> > Why does this have to be the case
> If we suddenly enforce the existing virtio-balloon, we will break legacy
> guests.

Can't we do it with a feature flag?

> Simple example:
> Guest with inflated virtio-balloon reboots. Touches inflated memory.
> Gets killed at some random point.
> 
> Of course, another discussion would be "can't we move virtio-mem
> functionality into virtio-balloon instead of changing virtio-balloon".
> With the current concept this is also not possible (one region per
> device vs. one virtio-balloon device). And I think while similar, these
> are two different concepts.
> 
> > 
> >> virtio-balloon was not
> >>    designed to give guarantees. Nobody can hinder the guest from
> >>    deflating/reusing inflated memory.
> > 
> > Reusing without deflate is forbidden with TELL_HOST, right?
> 
> TELL_HOST just means "please inform me". There is no way to NACK a
> request. It is not a permission to do so, just a "friendly
> notification". And this is exactly not what we want when host enforcing
> memory access.
> 
> 
> > 
> >>    In addition, it might make perfect
> >>    sense to have both, virtio-balloon and virtio-mem at the same time,
> >>    especially looking at the DEFLATE_ON_OOM or STATS features of
> >>    virtio-balloon. While virtio-mem is all about guarantees, virtio-
> >>    balloon is about cooperation.
> > 
> > Thanks, and I intend to look more into this next week.
> > 
> 
> I know that it is tempting to force this concept into virtio-balloon. I
> spent quite some time thinking about this (and possible other techniques
> like implicit memory deflation on reboots) and decided not to do it. We
> just end up trying to hack around all possible things that could go
> wrong, while still not being able to handle all requirements properly.

I agree there's a large # of requirements here not addressed by the balloon.

One other thing that would be helpful here is pointing out the
similarities between virtio-mem and the balloon. I'll ponder it
over the weekend.

The biggest worry for me is inability to support DMA into this memory.
Is this hard to fix?


Thanks!



> -- 
> 
> Thanks,
> 
> David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 20:19     ` Michael S. Tsirkin
@ 2017-06-18 10:17       ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2017-06-18 10:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli

>> A Linux guest will deflate the balloon (all or some pages) in the
>> following scenarios:
>> a) page migration
> 
> It inflates it first, doesn't it?

Yes, that that is true. I was just listing all scenarios.

> 
>> b) unload virtio-balloon kernel module
>> c) hibernate/suspension
>> d) (DEFLATE_ON_OOM)
> 
> You need to set a flag in the balloon to allow this, right?

Yes, has to be enabled in QEMU and will propagate to the guest. It is
used in various setups and you could either go for DEFLATE_ON_OOM
(cooperative memory manangement) or memory unplug, not both.

> 
>> A Linux guest will touch memory without deflating:
>> a) During a kexec() dump
>> d) On reboots (regular, after kexec(), system_reset)
>>>
>>>> Any change we
>>>>    introduce will break backwards compatibility.
>>>
>>> Why does this have to be the case
>> If we suddenly enforce the existing virtio-balloon, we will break legacy
>> guests.
> 
> Can't we do it with a feature flag?

I haven't found an easy way to do that, without turning all existing
virtio-balloon implementations useless. But honestly, whatever you do,
you will be confronted with the very basic problems of this approach:

Random memory holes on a reboot and the chance that the guest that comes up

a) contains a legacy virtio-balloon
b) contains no virtio-balloon at all
c) starts up virtio-balloon too late to fill the holes

Now, there are various possible approaches that require their own hacks
and only solve a subset of these problems. Just a very short version of
it all:

1) very early virtio-balloon that queries a bitmap of inflated memory
via some interface. This is just a giant hack (e.g. what about Windows?)
and even the bios might already touch inflated memory. Still breaks at
least b) and c). No good.

2) Do "implicit" balloon inflation on a reboot. Any page the guest
touches is marked as inflated. This requires a lot of quirks in the host
and still breaks at least b) and c). Basically no good for us.

Yo can read more about the involved problems at
https://blog.xenproject.org/2014/02/14/ballooning-rebooting-and-the-feature-youve-never-heard-of/

3) Try to mark inflated pages as reserved in the a820 bitmap and make
the balloon hotplug these. Well, this is x86 special and has some other
problems (e.g. what to do with ACPI hotplugged memory?). Also, how to
handle this on windows? Exploding size of the a820 map. No good.

4) Try to resize the guest main memory, to compensate unplugged memory.
While this sounds promising, there are elementary problems to solve: How
to deal with ACPI hotplugged memory? What to resize? And there has to be
ACPI hotplug, otherwise you cannot add more memory to a guest. While we
could solve some x86 specific problems here, migration on the QEMU side
will also be "fun". virtio-mem heavily simplifies that all by only
working on its own memory.

But again, these are all hacks, and at least I don't want to create a
giant hack and call it virtio-*, that is restricted to some very
specific use cases and/or architectures. Let's just do it in a clean way
if possible.

[...]

> I agree there's a large # of requirements here not addressed by the
> balloon.

Exactly, and it tries to solve the basic problem of rebooting into a
guest that does not contain a fitting guest driver.

>
> One other thing that would be helpful here is pointing out the
> similarities between virtio-mem and the balloon. I'll ponder it
> over the weekend.

There is much more difference here than similarity. The only thing they
share is allocating/freeing memory and tell the host about it. But
already how/from where memory is allocated is different. I think even
the general use case is different. Again, I think both concepts make
sense to coexist.

> 
> The biggest worry for me is inability to support DMA into this memory.
> Is this hard to fix?

As a short term solution: Always give your (x86) guest at least 3.x G of
base memory. And I mean that is the exact same thing you have with
ordinary ACPI based memory hotplug right now. That will also never
become DMA memory. So it is not worse compared to what we have right now.

Long term solution: I think this was never a use case. Usually, all
memory you "add", you theoretically want to be able to "remove" again.
So from that point, it does not make sense to mark it as DMA and feed it
to some driver that will not let go of it. I haven't had a deep look at
it, but I at least think it could be done with some effort. Not sure
about Windows.

Thanks!

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 14:20 [RFC] virtio-mem: paravirtualized memory David Hildenbrand
  2017-06-16 15:04 ` Michael S. Tsirkin
@ 2017-06-19 10:08 ` Stefan Hajnoczi
  2017-06-19 10:26   ` David Hildenbrand
  2017-07-25  8:21 ` David Hildenbrand
  2017-07-28 11:09 ` David Hildenbrand
  3 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2017-06-19 10:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
> Important restrictions of this concept:
> - Guests without a virtio-mem guest driver can't see that memory.
> - We will always require some boot memory that cannot get unplugged.
>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
>   DMA memory under Linux. So the boot memory also defines the amount of
>   DMA memory.

I didn't know that hotplug memory cannot become DMA memory.

Ouch.  Zero-copy disk I/O with O_DIRECT and network I/O with virtio-net
won't be possible.

When running an application that uses O_DIRECT file I/O this probably
means we now have 2 copies of pages in memory: 1. in the application and
2. in the kernel page cache.

So this increases pressure on the page cache and reduces performance :(.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-19 10:08 ` Stefan Hajnoczi
@ 2017-06-19 10:26   ` David Hildenbrand
  2017-06-21 11:08     ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2017-06-19 10:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli,
	Michael S. Tsirkin

On 19.06.2017 12:08, Stefan Hajnoczi wrote:
> On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
>> Important restrictions of this concept:
>> - Guests without a virtio-mem guest driver can't see that memory.
>> - We will always require some boot memory that cannot get unplugged.
>>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
>>   DMA memory under Linux. So the boot memory also defines the amount of
>>   DMA memory.
> 
> I didn't know that hotplug memory cannot become DMA memory.
> 
> Ouch.  Zero-copy disk I/O with O_DIRECT and network I/O with virtio-net
> won't be possible.
> 
> When running an application that uses O_DIRECT file I/O this probably
> means we now have 2 copies of pages in memory: 1. in the application and
> 2. in the kernel page cache.
> 
> So this increases pressure on the page cache and reduces performance :(.
> 
> Stefan
> 

arch/x86/mm/init_64.c:

/*
 * Memory is added always to NORMAL zone. This means you will never get
 * additional DMA/DMA32 memory.
 */
int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{

The is for sure something to work on in the future. Until then, base
memory of 3.X GB should be sufficient, right?

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-19 10:26   ` David Hildenbrand
@ 2017-06-21 11:08     ` Stefan Hajnoczi
  2017-06-21 12:32       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Hajnoczi @ 2017-06-21 11:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

On Mon, Jun 19, 2017 at 12:26:52PM +0200, David Hildenbrand wrote:
> On 19.06.2017 12:08, Stefan Hajnoczi wrote:
> > On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
> >> Important restrictions of this concept:
> >> - Guests without a virtio-mem guest driver can't see that memory.
> >> - We will always require some boot memory that cannot get unplugged.
> >>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
> >>   DMA memory under Linux. So the boot memory also defines the amount of
> >>   DMA memory.
> > 
> > I didn't know that hotplug memory cannot become DMA memory.
> > 
> > Ouch.  Zero-copy disk I/O with O_DIRECT and network I/O with virtio-net
> > won't be possible.
> > 
> > When running an application that uses O_DIRECT file I/O this probably
> > means we now have 2 copies of pages in memory: 1. in the application and
> > 2. in the kernel page cache.
> > 
> > So this increases pressure on the page cache and reduces performance :(.
> > 
> > Stefan
> > 
> 
> arch/x86/mm/init_64.c:
> 
> /*
>  * Memory is added always to NORMAL zone. This means you will never get
>  * additional DMA/DMA32 memory.
>  */
> int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> {
> 
> The is for sure something to work on in the future. Until then, base
> memory of 3.X GB should be sufficient, right?

I'm not sure that helps because applications typically don't control
where their buffers are located?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-21 11:08     ` Stefan Hajnoczi
@ 2017-06-21 12:32       ` David Hildenbrand
  2017-06-23 12:45         ` Stefan Hajnoczi
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2017-06-21 12:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli,
	Michael S. Tsirkin

On 21.06.2017 13:08, Stefan Hajnoczi wrote:
> On Mon, Jun 19, 2017 at 12:26:52PM +0200, David Hildenbrand wrote:
>> On 19.06.2017 12:08, Stefan Hajnoczi wrote:
>>> On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
>>>> Important restrictions of this concept:
>>>> - Guests without a virtio-mem guest driver can't see that memory.
>>>> - We will always require some boot memory that cannot get unplugged.
>>>>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
>>>>   DMA memory under Linux. So the boot memory also defines the amount of
>>>>   DMA memory.
>>>
>>> I didn't know that hotplug memory cannot become DMA memory.
>>>
>>> Ouch.  Zero-copy disk I/O with O_DIRECT and network I/O with virtio-net
>>> won't be possible.
>>>
>>> When running an application that uses O_DIRECT file I/O this probably
>>> means we now have 2 copies of pages in memory: 1. in the application and
>>> 2. in the kernel page cache.
>>>
>>> So this increases pressure on the page cache and reduces performance :(.
>>>
>>> Stefan
>>>
>>
>> arch/x86/mm/init_64.c:
>>
>> /*
>>  * Memory is added always to NORMAL zone. This means you will never get
>>  * additional DMA/DMA32 memory.
>>  */
>> int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>> {
>>
>> The is for sure something to work on in the future. Until then, base
>> memory of 3.X GB should be sufficient, right?
> 
> I'm not sure that helps because applications typically don't control
> where their buffers are located?

Okay, let me try to explain what is going on here (no expert, please
someone correct me if I am wrong).

There is a difference between DMA and DMA memory in Linux. DMA memory is
simply memory with special addresses. DMA is the general technique of a
device directly copying data to ram, bypassing the CPU.

ZONE_DMA contains all* memory < 16MB
ZONE_DMA32 contains all* memory < 4G
* meaning available on boot via a820 map, not hotplugged.

So memory from these zones can be used by devices that can only deal
with 24bit/32bit addresses.

Hotplugged memory is never added to the ZONE_DMA/DMA32, but to
ZONE_NORMAL. That means, kmalloc(.., GFP_DMA will) not be able to use
hotplugged memory. Say you have 1GB of main storage and hotplug 1G (on
address 1G). This memory will not be available in the ZONE_DMA, although
below 4g.

Memory in ZONE_NORMAL is used for ordinary kmalloc(), so all these
memory can be used to do DMA, but you are not guaranteed to get 32bit
capable addresses. I pretty much assume that virtio-net can deal with
64bit addresses.


My understanding of O_DIRECT:

The user space buffers (O_DIRECT) is directly used to do DMA. This will
work just fine as long as the device can deal with 64bit addresses. I
guess this is the case for virtio-net, otherwise there would be the
exact same problem already without virtio-mem.

Summary:

virtio-mem memory can be used for DMA, it will simply not be added to
ZONE_DMA/DMA32 and therefore won't be available for kmalloc(...,
GFP_DMA). This should work just fine with O_DIRECT as before.

If necessary, we could try to add memory to the ZONE_DMA later on,
however for now I would rate this a minor problem. By simply using 3.X
GB of base memory, basically all memory that could go to ZONE_DMA/DMA32
already is in these zones without virtio-mem.

Thanks!

> 
> Stefan
> 


-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-21 12:32       ` David Hildenbrand
@ 2017-06-23 12:45         ` Stefan Hajnoczi
  0 siblings, 0 replies; 16+ messages in thread
From: Stefan Hajnoczi @ 2017-06-23 12:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: KVM, virtualization, qemu-devel, linux-mm, Andrea Arcangeli,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3688 bytes --]

On Wed, Jun 21, 2017 at 02:32:48PM +0200, David Hildenbrand wrote:
> On 21.06.2017 13:08, Stefan Hajnoczi wrote:
> > On Mon, Jun 19, 2017 at 12:26:52PM +0200, David Hildenbrand wrote:
> >> On 19.06.2017 12:08, Stefan Hajnoczi wrote:
> >>> On Fri, Jun 16, 2017 at 04:20:02PM +0200, David Hildenbrand wrote:
> >>>> Important restrictions of this concept:
> >>>> - Guests without a virtio-mem guest driver can't see that memory.
> >>>> - We will always require some boot memory that cannot get unplugged.
> >>>>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
> >>>>   DMA memory under Linux. So the boot memory also defines the amount of
> >>>>   DMA memory.
> >>>
> >>> I didn't know that hotplug memory cannot become DMA memory.
> >>>
> >>> Ouch.  Zero-copy disk I/O with O_DIRECT and network I/O with virtio-net
> >>> won't be possible.
> >>>
> >>> When running an application that uses O_DIRECT file I/O this probably
> >>> means we now have 2 copies of pages in memory: 1. in the application and
> >>> 2. in the kernel page cache.
> >>>
> >>> So this increases pressure on the page cache and reduces performance :(.
> >>>
> >>> Stefan
> >>>
> >>
> >> arch/x86/mm/init_64.c:
> >>
> >> /*
> >>  * Memory is added always to NORMAL zone. This means you will never get
> >>  * additional DMA/DMA32 memory.
> >>  */
> >> int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> >> {
> >>
> >> The is for sure something to work on in the future. Until then, base
> >> memory of 3.X GB should be sufficient, right?
> > 
> > I'm not sure that helps because applications typically don't control
> > where their buffers are located?
> 
> Okay, let me try to explain what is going on here (no expert, please
> someone correct me if I am wrong).
> 
> There is a difference between DMA and DMA memory in Linux. DMA memory is
> simply memory with special addresses. DMA is the general technique of a
> device directly copying data to ram, bypassing the CPU.
> 
> ZONE_DMA contains all* memory < 16MB
> ZONE_DMA32 contains all* memory < 4G
> * meaning available on boot via a820 map, not hotplugged.
> 
> So memory from these zones can be used by devices that can only deal
> with 24bit/32bit addresses.
> 
> Hotplugged memory is never added to the ZONE_DMA/DMA32, but to
> ZONE_NORMAL. That means, kmalloc(.., GFP_DMA will) not be able to use
> hotplugged memory. Say you have 1GB of main storage and hotplug 1G (on
> address 1G). This memory will not be available in the ZONE_DMA, although
> below 4g.
> 
> Memory in ZONE_NORMAL is used for ordinary kmalloc(), so all these
> memory can be used to do DMA, but you are not guaranteed to get 32bit
> capable addresses. I pretty much assume that virtio-net can deal with
> 64bit addresses.
> 
> 
> My understanding of O_DIRECT:
> 
> The user space buffers (O_DIRECT) is directly used to do DMA. This will
> work just fine as long as the device can deal with 64bit addresses. I
> guess this is the case for virtio-net, otherwise there would be the
> exact same problem already without virtio-mem.
> 
> Summary:
> 
> virtio-mem memory can be used for DMA, it will simply not be added to
> ZONE_DMA/DMA32 and therefore won't be available for kmalloc(...,
> GFP_DMA). This should work just fine with O_DIRECT as before.
> 
> If necessary, we could try to add memory to the ZONE_DMA later on,
> however for now I would rate this a minor problem. By simply using 3.X
> GB of base memory, basically all memory that could go to ZONE_DMA/DMA32
> already is in these zones without virtio-mem.

Nice, thanks for clearing this up!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 14:20 [RFC] virtio-mem: paravirtualized memory David Hildenbrand
  2017-06-16 15:04 ` Michael S. Tsirkin
  2017-06-19 10:08 ` Stefan Hajnoczi
@ 2017-07-25  8:21 ` David Hildenbrand
  2017-07-28 11:09 ` David Hildenbrand
  3 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2017-07-25  8:21 UTC (permalink / raw)
  To: KVM, virtualization, qemu-devel, linux-mm
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Paolo Bonzini, Radim Krcmar

(ping)

Hi,

this has been on these lists for quite some time now. I want to start
preparing a virtio spec for virtio-mem soon.

So if you have any more comments/ideas/objections/questions, now is the
right time to post them :)

Thanks!


On 16.06.2017 16:20, David Hildenbrand wrote:
> Hi,
> 
> this is an idea that is based on Andrea Arcangeli's original idea to
> host enforce guest access to memory given up using virtio-balloon using
> userfaultfd in the hypervisor. While looking into the details, I
> realized that host-enforcing virtio-balloon would result in way too many
> problems (mainly backwards compatibility) and would also have some
> conceptual restrictions that I want to avoid. So I developed the idea of
> virtio-mem - "paravirtualized memory".
> 
> The basic idea is to add memory to the guest via a paravirtualized
> mechanism (so the guest can hotplug it) and remove memory via a
> mechanism similar to a balloon. This avoids having to online memory as
> "online-movable" in the guest and allows more fain grained memory
> hot(un)plug. In addition, migrating QEMU guests after adding/removing
> memory gets a lot easier.
> 
> Actually, this has a lot in common with the XEN balloon or the Hyper-V
> balloon (namely: paravirtualized hotplug and ballooning), but is very
> different when going into the details.
> 
> Getting this all implemented properly will take quite some effort,
> that's why I want to get some early feedback regarding the general
> concept. If you have some alternative ideas, or ideas how to modify this
> concept, I'll be happy to discuss. Just please make sure to have a look
> at the requirements first.
> 
> -----------------------------------------------------------------------
> 0. Outline:
> -----------------------------------------------------------------------
> - I.    General concept
> - II.   Use cases
> - III.  Identified requirements
> - IV.   Possible modifications
> - V.    Prototype
> - VI.   Problems to solve / things to sort out / missing in prototype
> - VII.  Questions
> - VIII. Q&A
> 
> ------------------------------------------------------------------------
> I. General concept
> ------------------------------------------------------------------------
> 
> We expose memory regions to the guest via a paravirtualize interface. So
> instead of e.g. a DIMM on x86, such memory is not anounced via ACPI.
> Unmodified guests (without a virtio-mem driver) won't be able to see/use
> this memory. The virtio-mem guest driver is needed to detect and manage
> these memory areas. What makes this memory special is that it can grow
> while the guest is running ("plug memory") and might shrink on a reboot
> (to compensate "unplugged" memory - see next paragraph). Each virtio-mem
> device manages exactly one such memory area. By having multiple ones
> assigned to different NUMA nodes, we can modify memory on a NUMA basis.
> 
> Of course, we cannot shrink these memory areas while the guest is
> running. To be able to unplug memory, we do something like a balloon
> does, however limited to this very memory area that belongs to the
> virtio-mem device. The guest will hand back small chunks of memory. If
> we want to add memory to the guest, we first "replug" memory that has
> previously been given up by the guest, before we grow our memory area.
> 
> On a reboot, we want to avoid any memory holes in our memory, therefore
> we resize our memory area (shrink it) to compensate memory that has been
> unplugged. This highly simplifies hotplugging memory in the guest (
> hotplugging memory with random memory holes is basically impossible).
> 
> We have to make sure that all memory chunks the guest hands back on
> unplug requests will not consume memory in the host. We do this by
> write-protecting that memory chunk in the host and then dropping the
> backing pages. The guest can read this memory (reading from the ZERO
> page) but no longer write to it. For now, this will only work on
> anonymous memory. We will use userfaultfd WP (write-protect mode) to
> avoid creating too many VMAs. Huge pages will require more effort (no
> explicit ZERO page).
> 
> As we unplug memory on a fine grained basis (and e.g. not on
> a complete DIMM basis), there is no need to online virtio-mem memory
> as online-movable. Also, memory unplug support for Windows might be
> supported that way. You can find more details in the Q/A section below.
> 
> 
> The important points here are:
> - After a reboot, every memory the guest sees can be accessed and used.
>   (in contrast to e.g. the XEN balloon, see Q/A fore more details)
> - Rebooting into an unmodified guest will not result into random
>   crashed. The guest will simply not be able to use all memory without a
>   virtio-mem driver.
> - Adding/Removing memory will not require modifying the QEMU command
>   line on the migration target. Migration simply works (re-sizing memory
>   areas is already part of the migration protocol!). Essentially, this
>   makes adding/removing memory to/from a guest way simpler and
>   independent of the underlying architecture. If the guest OS can online
>   new memory, we can add more memory this way.
> - Unplugged memory can be read. This allows e.g. kexec() without nasty
>   modifications. Especially relevant for Windows' kexec() variant.
> - It will play nicely with other things mapped into the address space,
>   e.g. also other DIMMs or NVDIMM. virtio-mem will only work on its own
>   memory region (in contrast e.g. to virtio-balloon). Especially it will
>   not give up ("allocate") memory on other DIMMs, hindering them to get
>   unplugged the ACPI way.
> - We can add/remove memory without running into KVM memory slot or other
>   (e.g. ACPI slot) restrictions. The granularity in which we can add
>   memory is only limited by the granularity the guest can add memory
>   (e.g. Windows 2MB, Linux on x86 128MB for now).
> - By not having to online memory as online-movable we don't run into any
>   memory restrictions in the guest. E.g. page tables can only be created
>   on !movable memory. So while there might be plenty of online-movable
>   memory left, allocation of page tables might fail. See Q/A for more
>   details.
> - The admin will not have to set memory offline in the guest first in
>   order to unplug it. virtio-mem will handle this internally and not
>   require interaction with an admin or a guest-agent.
> 
> Important restrictions of this concept:
> - Guests without a virtio-mem guest driver can't see that memory.
> - We will always require some boot memory that cannot get unplugged.
>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
>   DMA memory under Linux. So the boot memory also defines the amount of
>   DMA memory.
> - Hibernation/Sleep+Restore while virtio-mem is active is not supported.
>   On a reboot/fresh start, the size of the virtio-mem memory area might
>   change and a running/loaded guest can't deal with that.
> - Unplug support for hugetlbfs/shmem will take quite some time to
>   support. The larger the used page size, the harder for the guest to
>   give up memory. We can still use DIMM based hotplug for that.
> - Huge huge pages are problematic, as the guest would have to give up
>   e.g. 1GB chunks. This is not expected to be supported. We can still
>   use DIMM based hotplug for setups that require that.
> - For any memory we unplug using this mechanism, for now we will still
>   have struct pages allocated in the guest. This means, that roughly
>   1.6% of unplugged memory will still be allocated in the guest, being
>   unusable.
> 
> 
> ------------------------------------------------------------------------
> II. Use cases
> ------------------------------------------------------------------------
> 
> Of course, we want to deny any access to unplugged memory. In contrast
> to virtio-balloon or other similar ideas (free page hinting), this is
> not about cooperative memory management, but about guarantees. The idea
> is, that both concepts can coexist.
> 
> So one use case is of course cloud providers. Customers can add
> or remove memory to/from a VM without having to care about how to
> online memory or in which amount to add memory in the first place in
> order to remove it again. In cloud environments, we care about
> guarantees. E.g. for virtio-balloon a malicious guest can simply reuse
> any deflated memory, and the hypervisor can't even tell if the guest is
> malicious (e.g. a harmless guest reboot might look like a malicious
> guest). For virtio-mem, we guarantee that the guest can't reuse any
> memory that it previously gave up.
> 
> But also for ordinary VMs (!cloud), this avoids having to online memory
> in the guest as online-movable and therefore not running into allocation
> problems if there are e.g. many processes needing many page tables on
> !movable memory. Also here, we don't have to know how much memory we
> want to remove some-when in the future before we add memory. (e.g. if we
> add a 128GB DIMM, we can only remove that 128GB DIMM - if we are lucky).
> 
> We might be able to support memory unplug for Windows (as for now,
> ACPI unplug is not supported), more details have to be clarified.
> 
> As we can grow these memory areas quite easily, another use case might
> be guests that tell us they need more memory. Thinking about VMs to
> protect containers, there seems to be the general problem that we don't
> know how much memory the container will actually need. We could
> implement a mechanism (in virtio-mem or guest driver), by which the
> guest can request more memory. If the hypervisor agrees, it can simply
> give the guest more memory. As this is all handled within QEMU,
> migration is not a problem. Adding more memory will not result in new
> DIMM devices.
> 
> 
> ------------------------------------------------------------------------
> III. Identified requirements
> ------------------------------------------------------------------------
> 
> I considered the following requirements.
> 
> NUMA aware:
>   We want to be able to add/remove memory to/from NUMA nodes.
> Different page-size support:
>   We want to be able to support different page sizes, e.g. because of
>   huge pages in the hypervisor or because host and guest have different
>   page sizes (powerpc 64k vs 4k).
> Guarantees:
>   There has to be no way the guest can reuse unplugged memory without
>   host consent. Still, we could implement a mechanism for the guest to
>   request more memory. The hypervisor then has to decide how it wants to
>   handle that request.
> Architecture independence:
>   We want this to work independently of other technologies bound to
>   specific architectures, like ACPI.
> Avoid online-movable:
>   We don't want to have to online memory in the guest as online-movable
>   just to be able to unplug (at least parts of) it again.
> Migration support:
>   Be able to migrate without too much hassle. Especially, to handle it
>   completely within QEMU (not having to add new devices to the target
>   command line).
> Windows support:
>   We definitely want to support Windows guests in the long run.
> Coexistence with other hotplug mechanisms:
>   Allow to hotplug DIMMs / NVDIMMs, therefore to share the "hotplug"
>   address space part with other devices.
> Backwards compatibility:
>   Don't break if rebooting into an unmodified guest after having
>   unplugged some memory. All memory a freshly booted guest sees must not
>   contain memory holes that will crash it if it tries to access it.
> 
> 
> ------------------------------------------------------------------------
> IV. Possible modifications
> ------------------------------------------------------------------------
> 
> Adding a guest->host request mechanism would make sense to e.g. be able
> to request further memory from the hypervisor directly from the guest.
> 
> Adding memory will be much easier than removing memory. We can split
> this up and first introduce "adding memory" and later add "removing
> memory". Removing memory will require userfaultfd WP in the hypervisor
> and a special fancy allocator in the guest. So this will take some time.
> 
> Adding a mechanism to trade in memory blocks might make sense to allow
> some sort of memory compaction. However I expect this to be highly
> complicated and basically not feasible.
> 
> Being able to unplug memory "any" memory instead of only memory
> belonging to the virtio-mem device sounds tempting (and simplifies
> certain parts), however it has a couple of side effects I want to avoid.
> You can read more about that in the Q/A below.
> 
> 
> ------------------------------------------------------------------------
> V. Prototype
> ------------------------------------------------------------------------
> 
> To identify potential problems I developed a very basic prototype. It
> is incomplete, full of hacks and most probably broken in various ways.
> I used it only in the given setup, only on x86 and only with an initrd.
> 
> It uses a fixed page size of 256k for now, has a very ugly allocator
> hack in the guest, the virtio protocol really needs some tuning and
> an async job interface towards the user is missing. Instead of using
> userfaultfd WP, I am using simply mprotect() in this prototype. Basic
> migration works (not involving userfaultfd).
> 
> Please, don't even try to review it (that's why I will also not attach
> any patches to this mail :) ), just use this as an inspiration what this
> could look like. You can find the latest hack at:
> 
> QEMU: https://github.com/davidhildenbrand/qemu/tree/virtio-mem
> 
> Kernel: https://github.com/davidhildenbrand/linux/tree/virtio-mem
> 
> Use the kernel in the guest and make sure to compile the virtio-mem
> driver into the kernel (CONFIG_VIRTIO_MEM=y). A host kernel patch is
> contained to allow atomic resize of KVM memory regions, however it is
> pretty much untested.
> 
> 
> 1. Starting a guest with virtio-mem memory:
>    We will create a guest with 2 NUMA nodes and 4GB of "boot + DMA"
>    memory. This memory is visible also to guests without virtio-mem.
>    Also, we will add 4GB to NUMA node 0 and 3GB to NUMA node 1 using
>    virtio-mem. We allow both virtio-mem devices to grow up to 8GB. The
>    last 4 lines are the important part.
> 
> --> qemu/x86_64-softmmu/qemu-system-x86_64 \
> 	--enable-kvm
> 	-m 4G,maxmem=20G \
> 	-smp sockets=2,cores=2 \
> 	-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
> 	-machine pc \
> 	-kernel linux/arch/x86_64/boot/bzImage \
> 	-nodefaults \
> 	-chardev stdio,id=serial \
> 	-device isa-serial,chardev=serial \
> 	-append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0" \
> 	-initrd /boot/initramfs-4.10.8-200.fc25.x86_64.img \
> 	-chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
> 	-mon chardev=monitor,mode=readline \
> 	-object memory-backend-ram,id=mem0,size=4G,max-size=8G \
> 	-device virtio-mem-pci,id=reg0,memdev=mem0,node=0 \
> 	-object memory-backend-ram,id=mem1,size=3G,max-size=8G \
> 	-device virtio-mem-pci,id=reg1,memdev=mem1,node=1
> 
> 2. Listing current memory assignment:
> 
> --> (qemu) info memory-devices
> 	Memory device [virtio-mem]: "reg0"
> 	  addr: 0x140000000
> 	  node: 0
> 	  size: 4294967296
> 	  max-size: 8589934592
> 	  memdev: /objects/mem0
> 	Memory device [virtio-mem]: "reg1"
> 	  addr: 0x340000000
> 	  node: 1
> 	  size: 3221225472
> 	  max-size: 8589934592
> 	  memdev: /objects/mem1
> --> (qemu) info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 6144 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> 
> 3. Resize a virtio-mem device: Unplugging memory.
>    Setting reg0 to 2G (remove 2G from NUMA node 0)
> 
> --> (qemu) virtio-mem reg0 2048
> 	virtio-mem reg0 2048
> --> (qemu) info numa
> 	info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 4096 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> 
> 4. Resize a virtio-mem device: Plugging memory
>    Setting reg0 to 8G (adding 6G to NUMA node 0) will replug 2G and plug
>    4G, automatically re-sizing the memory area. You might experience
>    random crashes at this point if the host kernel missed a KVM patch
>    (as the memory slot is not re-sized in an atomic fashion).
> 
> --> (qemu) virtio-mem reg0 8192
> 	virtio-mem reg0 8192
> --> (qemu) info numa
> 	info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 10240 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> 
> 5. Resize a virtio-mem device: Try to unplug all memory.
>    Setting reg0 to 0G (removing 8G from NUMA node 0) will not work. The
>    guest will not be able to unplug all memory. In my example, 164M
>    cannot be unplugged (out of memory).
> 
> --> (qemu) virtio-mem reg0 0
> 	virtio-mem reg0 0
> --> (qemu) info numa
> 	info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 2212 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> --> (qemu) info virtio-mem reg0
> 	info virtio-mem reg0
> 	Status: ready
> 	Request status: vm-oom
> 	Page size: 2097152 bytes
> --> (qemu) info memory-devices
> 	Memory device [virtio-mem]: "reg0"
> 	  addr: 0x140000000
> 	  node: 0
> 	  size: 171966464
> 	  max-size: 8589934592
> 	  memdev: /objects/mem0
> 	Memory device [virtio-mem]: "reg1"
> 	  addr: 0x340000000
> 	  node: 1
> 	  size: 3221225472
> 	  max-size: 8589934592
> 	  memdev: /objects/mem1
> 
> At any point, we can migrate our guest without having to care about
> modifying the QEMU command line on the target side. Simply start the
> target e.g. with an additional '-incoming "exec: cat IMAGE"' and you're
> done.
> 
> ------------------------------------------------------------------------
> VI. Problems to solve / things to sort out / missing in prototype
> ------------------------------------------------------------------------
> 
> General:
> - We need an async job API to send the unplug/replug/plug requests to
>   the guest and query the state. [medium/hard]
> - Handle various alignment problems. [medium]
> - We need a virtio spec
> 
> Relevant for plug:
> - Resize QEMU memory regions while the guest is running (esp. grow).
>   While I implemented a demo solution for KVM memory slots, something
>   similar would be needed for vhost. Re-sizing of memory slots has to be
>   an atomic operation. [medium]
> - NUMA: Most probably the NUMA node should not be part of the virtio-mem
>   device, this should rather be indicated via e.g. ACPI. [medium]
> - x86: Add the complete possible memory to the a820 map as reserved.
>   [medium]
> - x86/powerpc/...: Indicate to which NUMA node the memory belongs using
>   ACPI. [medium]
> - x86/powerpc/...: Share address space with ordinary DIMMS/NVDIMMs, for
>   now this is blocked for simplicity. [medium/hard]
> - If the bitmaps become too big, migrate them like memory. [medium]
> 
> Relevant for unplug:
> - Allocate memory in Linux from a specific memory range. Windows has a
>   nice interface for that (at least it looks nice when reading the API).
>   This could be done using fake NUMA nodes or a new ZONE. My prototype
>   just uses a very ugly hack. [very hard]
> - Use userfaultfd WP (write-protect) insted of mprotect. Especially,
>   have multiple userfaultfd user in QEMU at a time (postcopy).
>   [medium/hard]
> 
> Stuff for the future:
> - Huge pages are problematic (no ZERO page support). This might not be
>   trivial to support. [hard/very hard]
> - Try to free struct pages, to avoid the 1.6% overhead [very very hard]
> 
> 
> ------------------------------------------------------------------------
> VII. Questions
> ------------------------------------------------------------------------
> 
> To get unplug working properly, it will require quite some effort,
> that's why I want to get some basic feedback before continuing working
> on a RFC implementation + RFC virtio spec.
> 
> a) Did I miss anything important? Are there any ultimate blockers that I
>    ignored? Any concepts that are broken?
> 
> b) Are there any alternatives? Any modifications that could make life
>    easier while still taking care of the requirements?
> 
> c) Are there other use cases we should care about and focus on?
> 
> d) Am I missing any requirements? What else could be important for
>    !cloud and cloud?
> 
> e) Are there any possible solutions to the allocator problem (allocating
>    memory from a specific memory area)? Please speak up!
> 
> f) Anything unclear?
> 
> e) Any feelings about this? Yay or nay?
> 
> 
> As you reached this point: Thanks for having a look!!! Highly appreciated!
> 
> 
> ------------------------------------------------------------------------
> VIII. Q&A
> ------------------------------------------------------------------------
> 
> ---
> Q: What's the problem with ordinary memory hot(un)plug?
> 
> A: 1. We can only unplug in the granularity we plugged. So we have to
>       know in advance, how much memory we want to remove later on. If we
>       plug a 2G dimm, we can only unplug a 2G dimm.
>    2. We might run out of memory slots. Although very unlikely, this
>       would strike if we try to always plug small modules in order to be
>       able to unplug again (e.g. loads of 128MB modules).
>    3. Any locked page in the guest can hinder us from unplugging a dimm.
>       Even if memory was onlined as online_movable, a single locked page
>       can hinder us from unplugging that memory dimm.
>    4. Memory has to be onlined as online_movable. If we don't put that
>       memory into the movable zone, any non-movable kernel allocation
>       could end up on it, turning the complete dimm unpluggable. As
>       certain allocations cannot go into the movable zone (e.g. page
>       tables), the ratio between online_movable/online memory depends on
>       the workload in the guest. Ratios of 50% -70% are usually fine.
>       But it could happen, that there is plenty of memory available,
>       but kernel allocations fail. (source: Andrea Arcangeli)
>    5. Unplugging might require several attempts. It takes some time to
>       migrate all memory from the dimm. At that point, it is then not
>       really obvious why it failed, and whether it could ever succeed.
>    6. Windows does support memory hotplug but not memory hotunplug. So
>       this could be a way to support it also for Windows.
> ---
> Q: Will this work with Windows?
> 
> A: Most probably not in the current form. Memory has to be at least
>    added to the a820 map and ACPI (NUMA). Hyper-V ballon is also able to
>    hotadd memory using a paravirtualized interface, so there are very
>    good chances that this will work. But we won't know for sure until we
>    also start prototyping.
> ---
> Q: How does this compare to virtio-balloo?
> 
> A: In contrast to virtio-balloon, virtio-mem
>    1. Supports multiple page sizes, even different ones for different
>       virtio-mem devices in a guest.
>    2. Is NUMA aware.
>    3. Is able to add more memory.
>    4. Doesn't work on all memory, but only on the managed one.
>    5. Has guarantees. There is now way for the guest to reclaim memory.
> ---
> Q: How does this compare to XEN balloon?
> 
> A: XEN balloon also has a way to hotplug new memory. However, on a
>    reboot, the guest will "see" more memory than it actually has.
>    Compared to XEN balloon, virtio-mem:
>    1. Supports multiple page sizes.
>    2. Is NUMA aware.
>    3. The guest can survive a reboot into a system without the guest
>       driver. If the XEN guest driver doesn't come up, the guest will
>       get killed once it touches too much memory.
>    4. Reboots don't require any hacks.
>    5. The guest knows which memory is special. And it remains special
>       during a reboot. Hotplugged memory not suddenly becomes base
>       memory. The balloon mechanism will only work on a specific memory
>       area.
> ---
> Q: How does this compare to Hyper-V balloon?
> 
> A: Based on the code from the Linux Hyper-V balloon driver, I can say
>    that Hyper-V also has a way to hotplug new memory. However, memory
>    will remain plugged on a reboot. Therefore, the guest will see more
>    memory than the hypervisor actually wants to assign to it.
>    Virtio-mem in contrast:
>    1. Supports multiple page sizes.
>    2. Is NUMA aware.
>    3. I have no idea what happens under Hyper-v when
>       a) rebooting into a guest without a fitting guest driver
>       b) kexec() touches all memory
>       c) the guest misbehaves
>    4. The guest knows which memory is special. And it remains special
>       during a reboot. Hotpplugged memory not suddenly becomes base
>       memory. The balloon mechanism will only work on a specific memory
>       area.
>    In general, it looks like the hypervisor has to deal with malicious
>    guests trying to access more memory than desired by providing enough
>    swap space.
> ---
> Q: How is virtio-mem NUMA aware?
> 
> A: Each virtio-mem device belongs exactly to one NUMA node (if NUMA is
>    enabled). As we can resize these regions separately, we can control
>    from/to which node to remove/add memory.
> ---
> Q: Why do we need support for multiple page sizes?
> 
> A: If huge pages are used in the host, we can only guarantee that they
>    are not accessible by the guest anymore, if the guest gives up memory
>    in this granularity. We prepare for that. Also, powerpc can have 64k
>    pages in the host but 4k pages in the guest. So the guest must only
>    give up 64k chunks. In addition, unplugging 4k pages might be bad
>    when it comes to fragmentation. My prototype currently uses 256k. We
>    can make this configurable - and it can vary for each virtio-mem
>    device.
> ---
> Q: What are the limitations with paravirtualized memory hotplug?
> 
> A: The same as for DIMM based hotplug, but we don't run out of any
>    memory/ACPI slots. E.g. on x86 Linux, only 128MB chunks can be
>    hotplugged, on x86 Windows it's 2MB. In addition, of course we
>    have to take care of maximum address limits in the guest. The idea
>    is to communicate these limits to the hypervisor via virtio-mem,
>    to give hints when trying to add/remove memory.
> ---
> Q: Why not simply unplug *any* memory like virtio-balloon does?
> 
> A: This could be done and a previous prototype did it like that.
>    However, there are some points to consider here.
>    1. If we combine this with ordinary memory hotplug (DIMM), we most
>       likely won't be able to unplug DIMMs anymore as virtio-mem memory
>       gets "allocated" on these.
>    2. All guests using virtio-mem cannot use huge pages as backing
>       storage at all (as virtio-mem only supports anonymous pages).
>    3. We need to track unplugged memory for the complete address space,
>       so we need a global state in QEMU. Bitmaps get bigger. We will not
>       be abe to dynamically grow the bitmaps for a virtio-mem device.
>    4. Resolving/checking memory to be unplugged gets significantly
>       harder. How should the guest know which memory it can unplug for a
>       specific virtio-mem device? E.g. if NUMA is active, only that NUMA
>       node to which a virtio-mem device belongs can be used.
>    5. We will need userfaultfd handler for the complete address space,
>       not just for the virtio-mem managed memory.
>       Especially, if somebody hotplugs a DIMM, we dynamically will have
>       to enable the userfaultfd handler.
>    6. What shall we do if somebody hotplugs a DIMM with huge pages? How
>       should we tell the guest, that this memory cannot be used for
>       unplugging?
>    In summary: This concept is way cleaner, but also harder to
>    implement.
> ---
> Q: Why not reuse virtio-balloon?
> 
> A: virtio-balloon is for cooperative memory management. It has a fixed
>    page size and will deflate in certain situations. Any change we
>    introduce will break backwards compatibility. virtio-balloon was not
>    designed to give guarantees. Nobody can hinder the guest from
>    deflating/reusing inflated memory. In addition, it might make perfect
>    sense to have both, virtio-balloon and virtio-mem at the same time,
>    especially looking at the DEFLATE_ON_OOM or STATS features of
>    virtio-balloon. While virtio-mem is all about guarantees, virtio-
>    balloon is about cooperation.
> ---
> Q: Why not reuse acpi hotplug?
> 
> A: We can easily run out of slots, migration in QEMU will just be
>    horrible and we don't want to bind virtio* to architecture specific
>    technologies.
>    E.g. thinking about s390x - no ACPI. Also, mixing an ACPI driver with
>    a virtio-driver sounds very weird. If the virtio-driver performs the
>    hotplug itself, we might later perform some extra tricks: e.g.
>    actually unplug certain regions to give up some struct pages.
> 
>    We want to manage the way memory is added/removed completely in QEMU.
>    We cannot simply add new device from within QEMU and expect that
>    migration in QEMU will work.
> ---
> Q: Why do we need resizable memory regions?
> 
> A: Migration in QEMU is special. Any device we have on our source VM has
>    to already be around on our target VM. So simply creating random
>    devides internally in QEMU is not going to work. The concept of
>    resizable memory regions in QEMU already exists and is part of the
>    migration protocol. Before memory is migrated, the memory is resized.
>    So in essence, this makes migration support _a lot_ easier.
> 
>    In addition, we won't run in any slot number restriction when
>    automatically managing how to add memory in QEMU.
> ---
> Q: Why do we have to resize memory regions on a reboot?
> 
> A: We have to compensate all memory that has been unplugged for that
>    area by shrinking it, so that a fresh guest can use all memory when
>    initializing the virtio-mem device.
> ---
> Q: Why do we need userfaultfd?
> 
> A: mprotect() will create a lot of VMAs in the kernel. This will degrade
>    performance and might even fail at one point. userfaultfd avoids this
>    by not creating a new VMA for every protected range. userfaultfd WP
>    is currently still under development and suffers from false positives
>    that make it currently impossible to properly integrate this into the
>    prototype.
> ---
> Q: Why do we have to allow reading unplugged memory?
> 
> A: E.g. if the guest crashes and want's to write a memory dump, it will
>    blindly access all memory. While we could find ways to fixup kexec,
>    Windows dumps might be more problematic. Allowing the guest to read
>    all memory (resulting in reading all 0's) safes us from a lot of
>    trouble.
> 
>    The downside is, that page tables full of zero pages might be
>    created. (we might be able to find ways to optimize this)
> ---
> Q: Will this work with postcopy live-migration?
> 
> A: Not in the current form. And it doesn't really make sense to spend
>    time on it as long as we don't use userfaultfd. Combining both
>    handlers will be interesting. It can be done with some effort on the
>    QEMU side.
> ---
> Q: What's the problem with shmem/hugetlbfs?
> 
> A: We currently rely on the ZERO page to be mapped when the guest tries
>    to read unplugged memory. For shmem/hugetlbfs, there is no ZERO page,
>    so read access would result in memory getting populated. We could
>    either introduce an explicit ZERO page, or manage it using one dummy
>    ZERO page (using regular usefaultfd, allow only one such page to be
>    mapped at a time). For now, only anonymous memory.
> ---
> Q: Ripping out random page ranges, won't this fragment our guest memory?
> 
> A: Yes, but depending on the virtio-mem page size, this might be more or
>    less problematic. The smaller the virtio-mem page size, the more we
>    fragment and make small allocations fail. The bigger the virtio-mem
>    page size, the higher the chance that we can't unplug any more
>    memory.
> ---
> Q: Why can't we use memory compaction like virtio-balloon?
> 
> A: If the virtio-mem page size > PAGE_SIZE, we can't do ordinary
>    page migration, migration would have to be done in blocks. We could
>    later add an guest->host virtqueue, via which the guest can
>    "exchange" memory ranges. However, also mm has to support this kind
>    of migration. So it is not completely out of scope, but will require
>    quite some work.
> ---
> Q: Do we really need yet another paravirtualized interface for this?
> 
> A: You tell me :)
> ---
> 
> Thanks,
> 
> David
> 


-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-06-16 14:20 [RFC] virtio-mem: paravirtualized memory David Hildenbrand
                   ` (2 preceding siblings ...)
  2017-07-25  8:21 ` David Hildenbrand
@ 2017-07-28 11:09 ` David Hildenbrand
  2017-07-28 15:16   ` Dan Williams
  3 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2017-07-28 11:09 UTC (permalink / raw)
  To: KVM, virtualization, qemu-devel, linux-mm
  Cc: Michael S. Tsirkin, Andrea Arcangeli, Pankaj Gupta, Rik van Riel,
	Dan Williams

Btw, I am thinking about the following addition to the concept:

1. Add a type to each virtio-mem device.

This describes the type of the memory region we expose to the guest.
Initially, we could have RAM and RAM_HUGE. The latter one would be
interesting, because the guest would know that this memory is based on
huge pages in case we would ever want to expose different RAM types to a
guest (the guest could conclude that this memory might be faster and
would also best be used with huge pages in the guest). But we could also
simply start only with RAM.


2. Adding also a guest -> host command queue.

That can be used to request/notify the host about something. As written
in the original proposal, for ordinary RAM this could be used to request
more/less memory out of the guest.


This might come in handy for other memory regions we just want to expose
to the guest via a paravirtualized interface. The resize features
(adding/removing memory) might not apply to these, but we can simply
restrict that to certain types.

E.g. if we want to expose PMEM memory region to a guest using a
paravirtualized interface (or anything else that can be mapped into
guest memory in the form of memory regions), we could use this. The
guest->host control queue can be used for tasks that typically cannot be
done if moddeling something like this using ordinary ACPI DIMMs
(flushing etc).

CCing a couple of people that just thought about something like this in
the concept of fake DAX.


On 16.06.2017 16:20, David Hildenbrand wrote:
> Hi,
> 
> this is an idea that is based on Andrea Arcangeli's original idea to
> host enforce guest access to memory given up using virtio-balloon using
> userfaultfd in the hypervisor. While looking into the details, I
> realized that host-enforcing virtio-balloon would result in way too many
> problems (mainly backwards compatibility) and would also have some
> conceptual restrictions that I want to avoid. So I developed the idea of
> virtio-mem - "paravirtualized memory".
> 
> The basic idea is to add memory to the guest via a paravirtualized
> mechanism (so the guest can hotplug it) and remove memory via a
> mechanism similar to a balloon. This avoids having to online memory as
> "online-movable" in the guest and allows more fain grained memory
> hot(un)plug. In addition, migrating QEMU guests after adding/removing
> memory gets a lot easier.
> 
> Actually, this has a lot in common with the XEN balloon or the Hyper-V
> balloon (namely: paravirtualized hotplug and ballooning), but is very
> different when going into the details.
> 
> Getting this all implemented properly will take quite some effort,
> that's why I want to get some early feedback regarding the general
> concept. If you have some alternative ideas, or ideas how to modify this
> concept, I'll be happy to discuss. Just please make sure to have a look
> at the requirements first.
> 
> -----------------------------------------------------------------------
> 0. Outline:
> -----------------------------------------------------------------------
> - I.    General concept
> - II.   Use cases
> - III.  Identified requirements
> - IV.   Possible modifications
> - V.    Prototype
> - VI.   Problems to solve / things to sort out / missing in prototype
> - VII.  Questions
> - VIII. Q&A
> 
> ------------------------------------------------------------------------
> I. General concept
> ------------------------------------------------------------------------
> 
> We expose memory regions to the guest via a paravirtualize interface. So
> instead of e.g. a DIMM on x86, such memory is not anounced via ACPI.
> Unmodified guests (without a virtio-mem driver) won't be able to see/use
> this memory. The virtio-mem guest driver is needed to detect and manage
> these memory areas. What makes this memory special is that it can grow
> while the guest is running ("plug memory") and might shrink on a reboot
> (to compensate "unplugged" memory - see next paragraph). Each virtio-mem
> device manages exactly one such memory area. By having multiple ones
> assigned to different NUMA nodes, we can modify memory on a NUMA basis.
> 
> Of course, we cannot shrink these memory areas while the guest is
> running. To be able to unplug memory, we do something like a balloon
> does, however limited to this very memory area that belongs to the
> virtio-mem device. The guest will hand back small chunks of memory. If
> we want to add memory to the guest, we first "replug" memory that has
> previously been given up by the guest, before we grow our memory area.
> 
> On a reboot, we want to avoid any memory holes in our memory, therefore
> we resize our memory area (shrink it) to compensate memory that has been
> unplugged. This highly simplifies hotplugging memory in the guest (
> hotplugging memory with random memory holes is basically impossible).
> 
> We have to make sure that all memory chunks the guest hands back on
> unplug requests will not consume memory in the host. We do this by
> write-protecting that memory chunk in the host and then dropping the
> backing pages. The guest can read this memory (reading from the ZERO
> page) but no longer write to it. For now, this will only work on
> anonymous memory. We will use userfaultfd WP (write-protect mode) to
> avoid creating too many VMAs. Huge pages will require more effort (no
> explicit ZERO page).
> 
> As we unplug memory on a fine grained basis (and e.g. not on
> a complete DIMM basis), there is no need to online virtio-mem memory
> as online-movable. Also, memory unplug support for Windows might be
> supported that way. You can find more details in the Q/A section below.
> 
> 
> The important points here are:
> - After a reboot, every memory the guest sees can be accessed and used.
>   (in contrast to e.g. the XEN balloon, see Q/A fore more details)
> - Rebooting into an unmodified guest will not result into random
>   crashed. The guest will simply not be able to use all memory without a
>   virtio-mem driver.
> - Adding/Removing memory will not require modifying the QEMU command
>   line on the migration target. Migration simply works (re-sizing memory
>   areas is already part of the migration protocol!). Essentially, this
>   makes adding/removing memory to/from a guest way simpler and
>   independent of the underlying architecture. If the guest OS can online
>   new memory, we can add more memory this way.
> - Unplugged memory can be read. This allows e.g. kexec() without nasty
>   modifications. Especially relevant for Windows' kexec() variant.
> - It will play nicely with other things mapped into the address space,
>   e.g. also other DIMMs or NVDIMM. virtio-mem will only work on its own
>   memory region (in contrast e.g. to virtio-balloon). Especially it will
>   not give up ("allocate") memory on other DIMMs, hindering them to get
>   unplugged the ACPI way.
> - We can add/remove memory without running into KVM memory slot or other
>   (e.g. ACPI slot) restrictions. The granularity in which we can add
>   memory is only limited by the granularity the guest can add memory
>   (e.g. Windows 2MB, Linux on x86 128MB for now).
> - By not having to online memory as online-movable we don't run into any
>   memory restrictions in the guest. E.g. page tables can only be created
>   on !movable memory. So while there might be plenty of online-movable
>   memory left, allocation of page tables might fail. See Q/A for more
>   details.
> - The admin will not have to set memory offline in the guest first in
>   order to unplug it. virtio-mem will handle this internally and not
>   require interaction with an admin or a guest-agent.
> 
> Important restrictions of this concept:
> - Guests without a virtio-mem guest driver can't see that memory.
> - We will always require some boot memory that cannot get unplugged.
>   Also, virtio-mem memory (as all other hotplugged memory) cannot become
>   DMA memory under Linux. So the boot memory also defines the amount of
>   DMA memory.
> - Hibernation/Sleep+Restore while virtio-mem is active is not supported.
>   On a reboot/fresh start, the size of the virtio-mem memory area might
>   change and a running/loaded guest can't deal with that.
> - Unplug support for hugetlbfs/shmem will take quite some time to
>   support. The larger the used page size, the harder for the guest to
>   give up memory. We can still use DIMM based hotplug for that.
> - Huge huge pages are problematic, as the guest would have to give up
>   e.g. 1GB chunks. This is not expected to be supported. We can still
>   use DIMM based hotplug for setups that require that.
> - For any memory we unplug using this mechanism, for now we will still
>   have struct pages allocated in the guest. This means, that roughly
>   1.6% of unplugged memory will still be allocated in the guest, being
>   unusable.
> 
> 
> ------------------------------------------------------------------------
> II. Use cases
> ------------------------------------------------------------------------
> 
> Of course, we want to deny any access to unplugged memory. In contrast
> to virtio-balloon or other similar ideas (free page hinting), this is
> not about cooperative memory management, but about guarantees. The idea
> is, that both concepts can coexist.
> 
> So one use case is of course cloud providers. Customers can add
> or remove memory to/from a VM without having to care about how to
> online memory or in which amount to add memory in the first place in
> order to remove it again. In cloud environments, we care about
> guarantees. E.g. for virtio-balloon a malicious guest can simply reuse
> any deflated memory, and the hypervisor can't even tell if the guest is
> malicious (e.g. a harmless guest reboot might look like a malicious
> guest). For virtio-mem, we guarantee that the guest can't reuse any
> memory that it previously gave up.
> 
> But also for ordinary VMs (!cloud), this avoids having to online memory
> in the guest as online-movable and therefore not running into allocation
> problems if there are e.g. many processes needing many page tables on
> !movable memory. Also here, we don't have to know how much memory we
> want to remove some-when in the future before we add memory. (e.g. if we
> add a 128GB DIMM, we can only remove that 128GB DIMM - if we are lucky).
> 
> We might be able to support memory unplug for Windows (as for now,
> ACPI unplug is not supported), more details have to be clarified.
> 
> As we can grow these memory areas quite easily, another use case might
> be guests that tell us they need more memory. Thinking about VMs to
> protect containers, there seems to be the general problem that we don't
> know how much memory the container will actually need. We could
> implement a mechanism (in virtio-mem or guest driver), by which the
> guest can request more memory. If the hypervisor agrees, it can simply
> give the guest more memory. As this is all handled within QEMU,
> migration is not a problem. Adding more memory will not result in new
> DIMM devices.
> 
> 
> ------------------------------------------------------------------------
> III. Identified requirements
> ------------------------------------------------------------------------
> 
> I considered the following requirements.
> 
> NUMA aware:
>   We want to be able to add/remove memory to/from NUMA nodes.
> Different page-size support:
>   We want to be able to support different page sizes, e.g. because of
>   huge pages in the hypervisor or because host and guest have different
>   page sizes (powerpc 64k vs 4k).
> Guarantees:
>   There has to be no way the guest can reuse unplugged memory without
>   host consent. Still, we could implement a mechanism for the guest to
>   request more memory. The hypervisor then has to decide how it wants to
>   handle that request.
> Architecture independence:
>   We want this to work independently of other technologies bound to
>   specific architectures, like ACPI.
> Avoid online-movable:
>   We don't want to have to online memory in the guest as online-movable
>   just to be able to unplug (at least parts of) it again.
> Migration support:
>   Be able to migrate without too much hassle. Especially, to handle it
>   completely within QEMU (not having to add new devices to the target
>   command line).
> Windows support:
>   We definitely want to support Windows guests in the long run.
> Coexistence with other hotplug mechanisms:
>   Allow to hotplug DIMMs / NVDIMMs, therefore to share the "hotplug"
>   address space part with other devices.
> Backwards compatibility:
>   Don't break if rebooting into an unmodified guest after having
>   unplugged some memory. All memory a freshly booted guest sees must not
>   contain memory holes that will crash it if it tries to access it.
> 
> 
> ------------------------------------------------------------------------
> IV. Possible modifications
> ------------------------------------------------------------------------
> 
> Adding a guest->host request mechanism would make sense to e.g. be able
> to request further memory from the hypervisor directly from the guest.
> 
> Adding memory will be much easier than removing memory. We can split
> this up and first introduce "adding memory" and later add "removing
> memory". Removing memory will require userfaultfd WP in the hypervisor
> and a special fancy allocator in the guest. So this will take some time.
> 
> Adding a mechanism to trade in memory blocks might make sense to allow
> some sort of memory compaction. However I expect this to be highly
> complicated and basically not feasible.
> 
> Being able to unplug memory "any" memory instead of only memory
> belonging to the virtio-mem device sounds tempting (and simplifies
> certain parts), however it has a couple of side effects I want to avoid.
> You can read more about that in the Q/A below.
> 
> 
> ------------------------------------------------------------------------
> V. Prototype
> ------------------------------------------------------------------------
> 
> To identify potential problems I developed a very basic prototype. It
> is incomplete, full of hacks and most probably broken in various ways.
> I used it only in the given setup, only on x86 and only with an initrd.
> 
> It uses a fixed page size of 256k for now, has a very ugly allocator
> hack in the guest, the virtio protocol really needs some tuning and
> an async job interface towards the user is missing. Instead of using
> userfaultfd WP, I am using simply mprotect() in this prototype. Basic
> migration works (not involving userfaultfd).
> 
> Please, don't even try to review it (that's why I will also not attach
> any patches to this mail :) ), just use this as an inspiration what this
> could look like. You can find the latest hack at:
> 
> QEMU: https://github.com/davidhildenbrand/qemu/tree/virtio-mem
> 
> Kernel: https://github.com/davidhildenbrand/linux/tree/virtio-mem
> 
> Use the kernel in the guest and make sure to compile the virtio-mem
> driver into the kernel (CONFIG_VIRTIO_MEM=y). A host kernel patch is
> contained to allow atomic resize of KVM memory regions, however it is
> pretty much untested.
> 
> 
> 1. Starting a guest with virtio-mem memory:
>    We will create a guest with 2 NUMA nodes and 4GB of "boot + DMA"
>    memory. This memory is visible also to guests without virtio-mem.
>    Also, we will add 4GB to NUMA node 0 and 3GB to NUMA node 1 using
>    virtio-mem. We allow both virtio-mem devices to grow up to 8GB. The
>    last 4 lines are the important part.
> 
> --> qemu/x86_64-softmmu/qemu-system-x86_64 \
> 	--enable-kvm
> 	-m 4G,maxmem=20G \
> 	-smp sockets=2,cores=2 \
> 	-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
> 	-machine pc \
> 	-kernel linux/arch/x86_64/boot/bzImage \
> 	-nodefaults \
> 	-chardev stdio,id=serial \
> 	-device isa-serial,chardev=serial \
> 	-append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0" \
> 	-initrd /boot/initramfs-4.10.8-200.fc25.x86_64.img \
> 	-chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
> 	-mon chardev=monitor,mode=readline \
> 	-object memory-backend-ram,id=mem0,size=4G,max-size=8G \
> 	-device virtio-mem-pci,id=reg0,memdev=mem0,node=0 \
> 	-object memory-backend-ram,id=mem1,size=3G,max-size=8G \
> 	-device virtio-mem-pci,id=reg1,memdev=mem1,node=1
> 
> 2. Listing current memory assignment:
> 
> --> (qemu) info memory-devices
> 	Memory device [virtio-mem]: "reg0"
> 	  addr: 0x140000000
> 	  node: 0
> 	  size: 4294967296
> 	  max-size: 8589934592
> 	  memdev: /objects/mem0
> 	Memory device [virtio-mem]: "reg1"
> 	  addr: 0x340000000
> 	  node: 1
> 	  size: 3221225472
> 	  max-size: 8589934592
> 	  memdev: /objects/mem1
> --> (qemu) info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 6144 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> 
> 3. Resize a virtio-mem device: Unplugging memory.
>    Setting reg0 to 2G (remove 2G from NUMA node 0)
> 
> --> (qemu) virtio-mem reg0 2048
> 	virtio-mem reg0 2048
> --> (qemu) info numa
> 	info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 4096 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> 
> 4. Resize a virtio-mem device: Plugging memory
>    Setting reg0 to 8G (adding 6G to NUMA node 0) will replug 2G and plug
>    4G, automatically re-sizing the memory area. You might experience
>    random crashes at this point if the host kernel missed a KVM patch
>    (as the memory slot is not re-sized in an atomic fashion).
> 
> --> (qemu) virtio-mem reg0 8192
> 	virtio-mem reg0 8192
> --> (qemu) info numa
> 	info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 10240 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> 
> 5. Resize a virtio-mem device: Try to unplug all memory.
>    Setting reg0 to 0G (removing 8G from NUMA node 0) will not work. The
>    guest will not be able to unplug all memory. In my example, 164M
>    cannot be unplugged (out of memory).
> 
> --> (qemu) virtio-mem reg0 0
> 	virtio-mem reg0 0
> --> (qemu) info numa
> 	info numa
> 	2 nodes
> 	node 0 cpus: 0 1
> 	node 0 size: 2212 MB
> 	node 1 cpus: 2 3
> 	node 1 size: 5120 MB
> --> (qemu) info virtio-mem reg0
> 	info virtio-mem reg0
> 	Status: ready
> 	Request status: vm-oom
> 	Page size: 2097152 bytes
> --> (qemu) info memory-devices
> 	Memory device [virtio-mem]: "reg0"
> 	  addr: 0x140000000
> 	  node: 0
> 	  size: 171966464
> 	  max-size: 8589934592
> 	  memdev: /objects/mem0
> 	Memory device [virtio-mem]: "reg1"
> 	  addr: 0x340000000
> 	  node: 1
> 	  size: 3221225472
> 	  max-size: 8589934592
> 	  memdev: /objects/mem1
> 
> At any point, we can migrate our guest without having to care about
> modifying the QEMU command line on the target side. Simply start the
> target e.g. with an additional '-incoming "exec: cat IMAGE"' and you're
> done.
> 
> ------------------------------------------------------------------------
> VI. Problems to solve / things to sort out / missing in prototype
> ------------------------------------------------------------------------
> 
> General:
> - We need an async job API to send the unplug/replug/plug requests to
>   the guest and query the state. [medium/hard]
> - Handle various alignment problems. [medium]
> - We need a virtio spec
> 
> Relevant for plug:
> - Resize QEMU memory regions while the guest is running (esp. grow).
>   While I implemented a demo solution for KVM memory slots, something
>   similar would be needed for vhost. Re-sizing of memory slots has to be
>   an atomic operation. [medium]
> - NUMA: Most probably the NUMA node should not be part of the virtio-mem
>   device, this should rather be indicated via e.g. ACPI. [medium]
> - x86: Add the complete possible memory to the a820 map as reserved.
>   [medium]
> - x86/powerpc/...: Indicate to which NUMA node the memory belongs using
>   ACPI. [medium]
> - x86/powerpc/...: Share address space with ordinary DIMMS/NVDIMMs, for
>   now this is blocked for simplicity. [medium/hard]
> - If the bitmaps become too big, migrate them like memory. [medium]
> 
> Relevant for unplug:
> - Allocate memory in Linux from a specific memory range. Windows has a
>   nice interface for that (at least it looks nice when reading the API).
>   This could be done using fake NUMA nodes or a new ZONE. My prototype
>   just uses a very ugly hack. [very hard]
> - Use userfaultfd WP (write-protect) insted of mprotect. Especially,
>   have multiple userfaultfd user in QEMU at a time (postcopy).
>   [medium/hard]
> 
> Stuff for the future:
> - Huge pages are problematic (no ZERO page support). This might not be
>   trivial to support. [hard/very hard]
> - Try to free struct pages, to avoid the 1.6% overhead [very very hard]
> 
> 
> ------------------------------------------------------------------------
> VII. Questions
> ------------------------------------------------------------------------
> 
> To get unplug working properly, it will require quite some effort,
> that's why I want to get some basic feedback before continuing working
> on a RFC implementation + RFC virtio spec.
> 
> a) Did I miss anything important? Are there any ultimate blockers that I
>    ignored? Any concepts that are broken?
> 
> b) Are there any alternatives? Any modifications that could make life
>    easier while still taking care of the requirements?
> 
> c) Are there other use cases we should care about and focus on?
> 
> d) Am I missing any requirements? What else could be important for
>    !cloud and cloud?
> 
> e) Are there any possible solutions to the allocator problem (allocating
>    memory from a specific memory area)? Please speak up!
> 
> f) Anything unclear?
> 
> e) Any feelings about this? Yay or nay?
> 
> 
> As you reached this point: Thanks for having a look!!! Highly appreciated!
> 
> 
> ------------------------------------------------------------------------
> VIII. Q&A
> ------------------------------------------------------------------------
> 
> ---
> Q: What's the problem with ordinary memory hot(un)plug?
> 
> A: 1. We can only unplug in the granularity we plugged. So we have to
>       know in advance, how much memory we want to remove later on. If we
>       plug a 2G dimm, we can only unplug a 2G dimm.
>    2. We might run out of memory slots. Although very unlikely, this
>       would strike if we try to always plug small modules in order to be
>       able to unplug again (e.g. loads of 128MB modules).
>    3. Any locked page in the guest can hinder us from unplugging a dimm.
>       Even if memory was onlined as online_movable, a single locked page
>       can hinder us from unplugging that memory dimm.
>    4. Memory has to be onlined as online_movable. If we don't put that
>       memory into the movable zone, any non-movable kernel allocation
>       could end up on it, turning the complete dimm unpluggable. As
>       certain allocations cannot go into the movable zone (e.g. page
>       tables), the ratio between online_movable/online memory depends on
>       the workload in the guest. Ratios of 50% -70% are usually fine.
>       But it could happen, that there is plenty of memory available,
>       but kernel allocations fail. (source: Andrea Arcangeli)
>    5. Unplugging might require several attempts. It takes some time to
>       migrate all memory from the dimm. At that point, it is then not
>       really obvious why it failed, and whether it could ever succeed.
>    6. Windows does support memory hotplug but not memory hotunplug. So
>       this could be a way to support it also for Windows.
> ---
> Q: Will this work with Windows?
> 
> A: Most probably not in the current form. Memory has to be at least
>    added to the a820 map and ACPI (NUMA). Hyper-V ballon is also able to
>    hotadd memory using a paravirtualized interface, so there are very
>    good chances that this will work. But we won't know for sure until we
>    also start prototyping.
> ---
> Q: How does this compare to virtio-balloo?
> 
> A: In contrast to virtio-balloon, virtio-mem
>    1. Supports multiple page sizes, even different ones for different
>       virtio-mem devices in a guest.
>    2. Is NUMA aware.
>    3. Is able to add more memory.
>    4. Doesn't work on all memory, but only on the managed one.
>    5. Has guarantees. There is now way for the guest to reclaim memory.
> ---
> Q: How does this compare to XEN balloon?
> 
> A: XEN balloon also has a way to hotplug new memory. However, on a
>    reboot, the guest will "see" more memory than it actually has.
>    Compared to XEN balloon, virtio-mem:
>    1. Supports multiple page sizes.
>    2. Is NUMA aware.
>    3. The guest can survive a reboot into a system without the guest
>       driver. If the XEN guest driver doesn't come up, the guest will
>       get killed once it touches too much memory.
>    4. Reboots don't require any hacks.
>    5. The guest knows which memory is special. And it remains special
>       during a reboot. Hotplugged memory not suddenly becomes base
>       memory. The balloon mechanism will only work on a specific memory
>       area.
> ---
> Q: How does this compare to Hyper-V balloon?
> 
> A: Based on the code from the Linux Hyper-V balloon driver, I can say
>    that Hyper-V also has a way to hotplug new memory. However, memory
>    will remain plugged on a reboot. Therefore, the guest will see more
>    memory than the hypervisor actually wants to assign to it.
>    Virtio-mem in contrast:
>    1. Supports multiple page sizes.
>    2. Is NUMA aware.
>    3. I have no idea what happens under Hyper-v when
>       a) rebooting into a guest without a fitting guest driver
>       b) kexec() touches all memory
>       c) the guest misbehaves
>    4. The guest knows which memory is special. And it remains special
>       during a reboot. Hotpplugged memory not suddenly becomes base
>       memory. The balloon mechanism will only work on a specific memory
>       area.
>    In general, it looks like the hypervisor has to deal with malicious
>    guests trying to access more memory than desired by providing enough
>    swap space.
> ---
> Q: How is virtio-mem NUMA aware?
> 
> A: Each virtio-mem device belongs exactly to one NUMA node (if NUMA is
>    enabled). As we can resize these regions separately, we can control
>    from/to which node to remove/add memory.
> ---
> Q: Why do we need support for multiple page sizes?
> 
> A: If huge pages are used in the host, we can only guarantee that they
>    are not accessible by the guest anymore, if the guest gives up memory
>    in this granularity. We prepare for that. Also, powerpc can have 64k
>    pages in the host but 4k pages in the guest. So the guest must only
>    give up 64k chunks. In addition, unplugging 4k pages might be bad
>    when it comes to fragmentation. My prototype currently uses 256k. We
>    can make this configurable - and it can vary for each virtio-mem
>    device.
> ---
> Q: What are the limitations with paravirtualized memory hotplug?
> 
> A: The same as for DIMM based hotplug, but we don't run out of any
>    memory/ACPI slots. E.g. on x86 Linux, only 128MB chunks can be
>    hotplugged, on x86 Windows it's 2MB. In addition, of course we
>    have to take care of maximum address limits in the guest. The idea
>    is to communicate these limits to the hypervisor via virtio-mem,
>    to give hints when trying to add/remove memory.
> ---
> Q: Why not simply unplug *any* memory like virtio-balloon does?
> 
> A: This could be done and a previous prototype did it like that.
>    However, there are some points to consider here.
>    1. If we combine this with ordinary memory hotplug (DIMM), we most
>       likely won't be able to unplug DIMMs anymore as virtio-mem memory
>       gets "allocated" on these.
>    2. All guests using virtio-mem cannot use huge pages as backing
>       storage at all (as virtio-mem only supports anonymous pages).
>    3. We need to track unplugged memory for the complete address space,
>       so we need a global state in QEMU. Bitmaps get bigger. We will not
>       be abe to dynamically grow the bitmaps for a virtio-mem device.
>    4. Resolving/checking memory to be unplugged gets significantly
>       harder. How should the guest know which memory it can unplug for a
>       specific virtio-mem device? E.g. if NUMA is active, only that NUMA
>       node to which a virtio-mem device belongs can be used.
>    5. We will need userfaultfd handler for the complete address space,
>       not just for the virtio-mem managed memory.
>       Especially, if somebody hotplugs a DIMM, we dynamically will have
>       to enable the userfaultfd handler.
>    6. What shall we do if somebody hotplugs a DIMM with huge pages? How
>       should we tell the guest, that this memory cannot be used for
>       unplugging?
>    In summary: This concept is way cleaner, but also harder to
>    implement.
> ---
> Q: Why not reuse virtio-balloon?
> 
> A: virtio-balloon is for cooperative memory management. It has a fixed
>    page size and will deflate in certain situations. Any change we
>    introduce will break backwards compatibility. virtio-balloon was not
>    designed to give guarantees. Nobody can hinder the guest from
>    deflating/reusing inflated memory. In addition, it might make perfect
>    sense to have both, virtio-balloon and virtio-mem at the same time,
>    especially looking at the DEFLATE_ON_OOM or STATS features of
>    virtio-balloon. While virtio-mem is all about guarantees, virtio-
>    balloon is about cooperation.
> ---
> Q: Why not reuse acpi hotplug?
> 
> A: We can easily run out of slots, migration in QEMU will just be
>    horrible and we don't want to bind virtio* to architecture specific
>    technologies.
>    E.g. thinking about s390x - no ACPI. Also, mixing an ACPI driver with
>    a virtio-driver sounds very weird. If the virtio-driver performs the
>    hotplug itself, we might later perform some extra tricks: e.g.
>    actually unplug certain regions to give up some struct pages.
> 
>    We want to manage the way memory is added/removed completely in QEMU.
>    We cannot simply add new device from within QEMU and expect that
>    migration in QEMU will work.
> ---
> Q: Why do we need resizable memory regions?
> 
> A: Migration in QEMU is special. Any device we have on our source VM has
>    to already be around on our target VM. So simply creating random
>    devides internally in QEMU is not going to work. The concept of
>    resizable memory regions in QEMU already exists and is part of the
>    migration protocol. Before memory is migrated, the memory is resized.
>    So in essence, this makes migration support _a lot_ easier.
> 
>    In addition, we won't run in any slot number restriction when
>    automatically managing how to add memory in QEMU.
> ---
> Q: Why do we have to resize memory regions on a reboot?
> 
> A: We have to compensate all memory that has been unplugged for that
>    area by shrinking it, so that a fresh guest can use all memory when
>    initializing the virtio-mem device.
> ---
> Q: Why do we need userfaultfd?
> 
> A: mprotect() will create a lot of VMAs in the kernel. This will degrade
>    performance and might even fail at one point. userfaultfd avoids this
>    by not creating a new VMA for every protected range. userfaultfd WP
>    is currently still under development and suffers from false positives
>    that make it currently impossible to properly integrate this into the
>    prototype.
> ---
> Q: Why do we have to allow reading unplugged memory?
> 
> A: E.g. if the guest crashes and want's to write a memory dump, it will
>    blindly access all memory. While we could find ways to fixup kexec,
>    Windows dumps might be more problematic. Allowing the guest to read
>    all memory (resulting in reading all 0's) safes us from a lot of
>    trouble.
> 
>    The downside is, that page tables full of zero pages might be
>    created. (we might be able to find ways to optimize this)
> ---
> Q: Will this work with postcopy live-migration?
> 
> A: Not in the current form. And it doesn't really make sense to spend
>    time on it as long as we don't use userfaultfd. Combining both
>    handlers will be interesting. It can be done with some effort on the
>    QEMU side.
> ---
> Q: What's the problem with shmem/hugetlbfs?
> 
> A: We currently rely on the ZERO page to be mapped when the guest tries
>    to read unplugged memory. For shmem/hugetlbfs, there is no ZERO page,
>    so read access would result in memory getting populated. We could
>    either introduce an explicit ZERO page, or manage it using one dummy
>    ZERO page (using regular usefaultfd, allow only one such page to be
>    mapped at a time). For now, only anonymous memory.
> ---
> Q: Ripping out random page ranges, won't this fragment our guest memory?
> 
> A: Yes, but depending on the virtio-mem page size, this might be more or
>    less problematic. The smaller the virtio-mem page size, the more we
>    fragment and make small allocations fail. The bigger the virtio-mem
>    page size, the higher the chance that we can't unplug any more
>    memory.
> ---
> Q: Why can't we use memory compaction like virtio-balloon?
> 
> A: If the virtio-mem page size > PAGE_SIZE, we can't do ordinary
>    page migration, migration would have to be done in blocks. We could
>    later add an guest->host virtqueue, via which the guest can
>    "exchange" memory ranges. However, also mm has to support this kind
>    of migration. So it is not completely out of scope, but will require
>    quite some work.
> ---
> Q: Do we really need yet another paravirtualized interface for this?
> 
> A: You tell me :)
> ---
> 
> Thanks,
> 
> David
> 


-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-07-28 11:09 ` David Hildenbrand
@ 2017-07-28 15:16   ` Dan Williams
  2017-07-28 15:48     ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2017-07-28 15:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: KVM, virtualization, qemu-devel, linux-mm, Michael S. Tsirkin,
	Andrea Arcangeli, Pankaj Gupta, Rik van Riel

On Fri, Jul 28, 2017 at 4:09 AM, David Hildenbrand <david@redhat.com> wrote:
> Btw, I am thinking about the following addition to the concept:
>
> 1. Add a type to each virtio-mem device.
>
> This describes the type of the memory region we expose to the guest.
> Initially, we could have RAM and RAM_HUGE. The latter one would be
> interesting, because the guest would know that this memory is based on
> huge pages in case we would ever want to expose different RAM types to a
> guest (the guest could conclude that this memory might be faster and
> would also best be used with huge pages in the guest). But we could also
> simply start only with RAM.

I think it's up to the hypervisor to manage whether the guest is
getting huge pages or not and the guest need not know. As for
communicating differentiated memory media performance we have the ACPI
HMAT (Heterogeneous Memory Attributes Table) for that.

> 2. Adding also a guest -> host command queue.
>
> That can be used to request/notify the host about something. As written
> in the original proposal, for ordinary RAM this could be used to request
> more/less memory out of the guest.

I would hope that where possible we minimize paravirtualized
interfaces and just use standardized interfaces. In the case of memory
hotplug, ACPI already defines that interface.

> This might come in handy for other memory regions we just want to expose
> to the guest via a paravirtualized interface. The resize features
> (adding/removing memory) might not apply to these, but we can simply
> restrict that to certain types.
>
> E.g. if we want to expose PMEM memory region to a guest using a
> paravirtualized interface (or anything else that can be mapped into
> guest memory in the form of memory regions), we could use this. The
> guest->host control queue can be used for tasks that typically cannot be
> done if moddeling something like this using ordinary ACPI DIMMs
> (flushing etc).
>
> CCing a couple of people that just thought about something like this in
> the concept of fake DAX.

I'm not convinced that there is a use case for paravirtualized PMEM
commands beyond this "fake-DAX" use case. Everything would seem to
have a path through the standard ACPI platform communication
mechanisms.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-07-28 15:16   ` Dan Williams
@ 2017-07-28 15:48     ` David Hildenbrand
  2017-07-31 14:12       ` Michael S. Tsirkin
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2017-07-28 15:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: KVM, virtualization, qemu-devel, linux-mm, Michael S. Tsirkin,
	Andrea Arcangeli, Pankaj Gupta, Rik van Riel

On 28.07.2017 17:16, Dan Williams wrote:
> On Fri, Jul 28, 2017 at 4:09 AM, David Hildenbrand <david@redhat.com> wrote:
>> Btw, I am thinking about the following addition to the concept:
>>
>> 1. Add a type to each virtio-mem device.
>>
>> This describes the type of the memory region we expose to the guest.
>> Initially, we could have RAM and RAM_HUGE. The latter one would be
>> interesting, because the guest would know that this memory is based on
>> huge pages in case we would ever want to expose different RAM types to a
>> guest (the guest could conclude that this memory might be faster and
>> would also best be used with huge pages in the guest). But we could also
>> simply start only with RAM.
> 
> I think it's up to the hypervisor to manage whether the guest is
> getting huge pages or not and the guest need not know. As for
> communicating differentiated memory media performance we have the ACPI
> HMAT (Heterogeneous Memory Attributes Table) for that.

Yes, in the world of ACPI I agree.

> 
>> 2. Adding also a guest -> host command queue.
>>
>> That can be used to request/notify the host about something. As written
>> in the original proposal, for ordinary RAM this could be used to request
>> more/less memory out of the guest.
> 
> I would hope that where possible we minimize paravirtualized
> interfaces and just use standardized interfaces. In the case of memory
> hotplug, ACPI already defines that interface.

I partly agree in the world of ACPI. If you just want to add/remove
memory in the form of DIMMs, yes. This already works just fine. For
other approaches in the context of virtualization (e.g. ballooners that
XEN or Hyper-V use, or also what virtio-mem tries to achieve), this is
not enough. They need a different way of memory hotplug (as e.g. XEN and
Hyper-V already have).

Especially when trying to standardize stuff in form of virtio - binding
it to a technology specific to a handful of architectures is not
desired. Until now (as far as I remember), all but 2 virtio types
(virtio-balloon and virtio-iommu) operate on their own system resources
only, not on some resources exposed/detected via different interfaces.

> 
>> This might come in handy for other memory regions we just want to expose
>> to the guest via a paravirtualized interface. The resize features
>> (adding/removing memory) might not apply to these, but we can simply
>> restrict that to certain types.
>>
>> E.g. if we want to expose PMEM memory region to a guest using a
>> paravirtualized interface (or anything else that can be mapped into
>> guest memory in the form of memory regions), we could use this. The
>> guest->host control queue can be used for tasks that typically cannot be
>> done if moddeling something like this using ordinary ACPI DIMMs
>> (flushing etc).
>>
>> CCing a couple of people that just thought about something like this in
>> the concept of fake DAX.
> 
> I'm not convinced that there is a use case for paravirtualized PMEM
> commands beyond this "fake-DAX" use case. Everything would seem to
> have a path through the standard ACPI platform communication
> mechanisms.

I don't know about further commands, most likely not really many more in
this scenario. I just pinged you guys to have a look when I heard the
term virtio-pmem.

In general, a paravirtualized interface (for detection of PMEM regions)
might have one big advantage: not limited to certain architectures.

With a paravirtualized interface, we can even support* fake DAX on
architectures that don't provide a "real HW" interface for it. I think
this sounds interesting.

*quite some effort will most likely be necessary for other architectures.

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-07-28 15:48     ` David Hildenbrand
@ 2017-07-31 14:12       ` Michael S. Tsirkin
  2017-07-31 15:04         ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Michael S. Tsirkin @ 2017-07-31 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, KVM, virtualization, qemu-devel, linux-mm,
	Andrea Arcangeli, Pankaj Gupta, Rik van Riel

On Fri, Jul 28, 2017 at 05:48:07PM +0200, David Hildenbrand wrote:
> In general, a paravirtualized interface (for detection of PMEM regions)
> might have one big advantage: not limited to certain architectures.

What follows is a generic rant, and slightly offtopic -sorry about that.
I thought it's worth replying to above since people sometimes propose
random PV devices and portability is often the argument. I'd claim if
its the only argument - its not a very good one.

One of the points of KVM is to try and reuse the infrastructure in Linux
that runs containers/bare metal anyway.  The more paravirtualized
interfaces you build, the more effort you get to spend to maintain
various aspects of the system. As optimizations force more and more
paravirtualization into the picture, our solution has been to try to
localize their effect, so you can mix and match paravirtualization and
emulation, as well as enable a subset of PV things that makes sense. For
example, virtio devices look more or less like PCI devices on systems
that have PCI.

It's not clear it applies here - memory overcommit on bare metal is
kind of different.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] virtio-mem: paravirtualized memory
  2017-07-31 14:12       ` Michael S. Tsirkin
@ 2017-07-31 15:04         ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2017-07-31 15:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dan Williams, KVM, virtualization, qemu-devel, linux-mm,
	Andrea Arcangeli, Pankaj Gupta, Rik van Riel,
	Christian Borntraeger

On 31.07.2017 16:12, Michael S. Tsirkin wrote:
> On Fri, Jul 28, 2017 at 05:48:07PM +0200, David Hildenbrand wrote:
>> In general, a paravirtualized interface (for detection of PMEM regions)
>> might have one big advantage: not limited to certain architectures.
> 
> What follows is a generic rant, and slightly offtopic -sorry about that.
> I thought it's worth replying to above since people sometimes propose
> random PV devices and portability is often the argument. I'd claim if
> its the only argument - its not a very good one.

Very good point. Thanks for that comment. I totally agree, that we have
to decide for which parts we really need a paravirtualized interface. We
already paravirtualized quite a lot (starting with clocks and mmios,
ending with network devices).

Related to fake DAX, think about this example (cc'ing Christian, so I
don't talk too much nonsense):

s390x hardware cannot map anything into the address space. Not even PCI
devices on s390x work that way. So the whole architecture (to this point
I am aware of) is built on this fact. We can indicate "valid memory
regions" to the guest via a standardized interface, but the guest will
simply treat it like RAM.

With virtualization, this is different. We can map whatever we want into
the guest address space, but we have to tell the guest that this area is
special. There is no ACPI on s390x to do that.

This implies, that for s390x, we could not support fake DAX, just
because we don't have ACPI. No fake DAX, no avoiding of page caches in
the guest. Which is something we _could_ avoid quite easily.

> 
> One of the points of KVM is to try and reuse the infrastructure in Linux
> that runs containers/bare metal anyway.  The more paravirtualized
> interfaces you build, the more effort you get to spend to maintain
> various aspects of the system. As optimizations force more and more
> paravirtualization into the picture, our solution has been to try to
> localize their effect, so you can mix and match paravirtualization and
> emulation, as well as enable a subset of PV things that makes sense. For
> example, virtio devices look more or less like PCI devices on systems
> that have PCI.
We make paravirtualized devices look like them, but what we (in most
cases) don't do is the following: Detect and use devices via a
!paravirtualized way and later on decide "oh, this device is special"
and treat it suddenly like a paravirtualized device*.

E.g. virtio-scsi, an unmodified guest will not simply detect and use,
say, a virtio-scsi attached disk. (unless I am very wrong on that point ;) )

*This might work for devices, where paravirtualization is e.g. just a
way to speedup things. But if paravirtualization is part of the concept
(fake DAX - we need that flush), this will not work.

In the words of s390x: indicate fake DAX as ordinary ram. The guest will
hotplug it and use it like ordinary ram. At that point, it is too late
to convert it logically into a disk.


What I think is the following: We need a way to advertise devices that
are mapped into the address space via a paravirtualized way. This could
e.g. be fake DAX devices or what virtio-mem's memory hotplug approach
tries to solve.

Basically what virtio-mem proposed: indicating devices in the form of
memory regions that are special to the guest via a paravirtualized
interface and providing paravirtualized features, that can (and even
have to) be used along with these special devices.

> 
> It's not clear it applies here - memory overcommit on bare metal is
> kind of different.

Yes, there is no such thing as fine grained memory hot(un)plug on real
hardware.

-- 

Thanks,

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-07-31 15:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-16 14:20 [RFC] virtio-mem: paravirtualized memory David Hildenbrand
2017-06-16 15:04 ` Michael S. Tsirkin
2017-06-16 15:59   ` David Hildenbrand
2017-06-16 20:19     ` Michael S. Tsirkin
2017-06-18 10:17       ` David Hildenbrand
2017-06-19 10:08 ` Stefan Hajnoczi
2017-06-19 10:26   ` David Hildenbrand
2017-06-21 11:08     ` Stefan Hajnoczi
2017-06-21 12:32       ` David Hildenbrand
2017-06-23 12:45         ` Stefan Hajnoczi
2017-07-25  8:21 ` David Hildenbrand
2017-07-28 11:09 ` David Hildenbrand
2017-07-28 15:16   ` Dan Williams
2017-07-28 15:48     ` David Hildenbrand
2017-07-31 14:12       ` Michael S. Tsirkin
2017-07-31 15:04         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).