All of lore.kernel.org
 help / color / mirror / Atom feed
* arm64: dropping prevent_bootmem_remove_notifier
@ 2020-10-16 23:11 ` Sudarshan Rajagopalan
  0 siblings, 0 replies; 10+ messages in thread
From: Sudarshan Rajagopalan @ 2020-10-16 23:11 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel
  Cc: Suren Baghdasaryan, pratikp, Gavin Shan, Mark Rutland,
	Logan Gunthorpe, David Hildenbrand, Andrew Morton, Steven Price


Hello Anshuman,

In the patch that enables memory hot-remove (commit bbd6ec605c0f 
("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
put in place that prevents boot memory from being offlined and removed. 
Also commit text mentions that boot memory on arm64 cannot be removed. 
We wanted to understand more about the reasoning for this. X86 and other 
archs doesn’t seem to do this prevention. There’s also comment in the 
code that this notifier could be dropped in future if and when boot 
memory can be removed.

The current logic is that only “new” memory blocks which are hot-added 
can later be offlined and removed. The memory that system booted up with 
cannot be offlined and removed. But there could be many usercases such 
as inter-VM memory sharing where a primary VM could offline and 
hot-remove a block/section of memory and lend it to secondary VM where 
it could hot-add it. And after usecase is done, the reverse happens 
where secondary VM hot-removes and gives it back to primary which can 
hot-add it back. In such cases, the present logic for arm64 doesn’t 
allow this hot-remove in primary to happen.

Also, on systems with movable zone that sort of guarantees pages to be 
migrated and isolated so that blocks can be offlined, this logic also 
defeats the purpose of having a movable zone which system can rely on 
memory hot-plugging, which say virt-io mem also relies on for fully 
plugged memory blocks.

I understand that some region of boot RAM shouldn’t be allowed to be 
removed, but such regions won’t be allowed to be offlined in first place 
since pages cannot be migrated and isolated, example reserved pages.

So we’re trying to understand the reasoning for such a prevention put in 
place for arm64 arch alone.

One possible way to solve this is by marking the required sections as 
“non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
This puts these sections in the context of “memory hotpluggable” which 
can be offlined-removed and added-onlined which are part of boot RAM 
itself and doesn’t need any extra blocks to be hot added. This way of 
marking certain sections as “non-early” could be exported so that module 
drivers can set the required number of sections as “memory 
hotpluggable”. This could have certain checks put in place to see which 
sections are allowed, example only movable zone sections can be marked 
as “non-early”.

Your thoughts on this? We are also looking for different ways to solve 
the problem without having to completely dropping this notifier, but 
just putting out the concern here about the notifier logic that is 
breaking our usecase which is a generic memory sharing usecase using 
memory hotplug feature.


Sudarshan

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 10+ messages in thread

* arm64: dropping prevent_bootmem_remove_notifier
@ 2020-10-16 23:11 ` Sudarshan Rajagopalan
  0 siblings, 0 replies; 10+ messages in thread
From: Sudarshan Rajagopalan @ 2020-10-16 23:11 UTC (permalink / raw)
  To: Anshuman Khandual, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel
  Cc: Mark Rutland, Gavin Shan, David Hildenbrand, Logan Gunthorpe,
	Steven Price, Suren Baghdasaryan, Andrew Morton, pratikp


Hello Anshuman,

In the patch that enables memory hot-remove (commit bbd6ec605c0f 
("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
put in place that prevents boot memory from being offlined and removed. 
Also commit text mentions that boot memory on arm64 cannot be removed. 
We wanted to understand more about the reasoning for this. X86 and other 
archs doesn’t seem to do this prevention. There’s also comment in the 
code that this notifier could be dropped in future if and when boot 
memory can be removed.

The current logic is that only “new” memory blocks which are hot-added 
can later be offlined and removed. The memory that system booted up with 
cannot be offlined and removed. But there could be many usercases such 
as inter-VM memory sharing where a primary VM could offline and 
hot-remove a block/section of memory and lend it to secondary VM where 
it could hot-add it. And after usecase is done, the reverse happens 
where secondary VM hot-removes and gives it back to primary which can 
hot-add it back. In such cases, the present logic for arm64 doesn’t 
allow this hot-remove in primary to happen.

Also, on systems with movable zone that sort of guarantees pages to be 
migrated and isolated so that blocks can be offlined, this logic also 
defeats the purpose of having a movable zone which system can rely on 
memory hot-plugging, which say virt-io mem also relies on for fully 
plugged memory blocks.

I understand that some region of boot RAM shouldn’t be allowed to be 
removed, but such regions won’t be allowed to be offlined in first place 
since pages cannot be migrated and isolated, example reserved pages.

So we’re trying to understand the reasoning for such a prevention put in 
place for arm64 arch alone.

One possible way to solve this is by marking the required sections as 
“non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
This puts these sections in the context of “memory hotpluggable” which 
can be offlined-removed and added-onlined which are part of boot RAM 
itself and doesn’t need any extra blocks to be hot added. This way of 
marking certain sections as “non-early” could be exported so that module 
drivers can set the required number of sections as “memory 
hotpluggable”. This could have certain checks put in place to see which 
sections are allowed, example only movable zone sections can be marked 
as “non-early”.

Your thoughts on this? We are also looking for different ways to solve 
the problem without having to completely dropping this notifier, but 
just putting out the concern here about the notifier logic that is 
breaking our usecase which is a generic memory sharing usecase using 
memory hotplug feature.


Sudarshan

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
  2020-10-16 23:11 ` Sudarshan Rajagopalan
@ 2020-10-17  9:35   ` David Hildenbrand
  -1 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2020-10-17  9:35 UTC (permalink / raw)
  To: Sudarshan Rajagopalan, Anshuman Khandual, Catalin Marinas,
	Will Deacon, linux-arm-kernel, linux-kernel
  Cc: Suren Baghdasaryan, pratikp, Gavin Shan, Mark Rutland,
	Logan Gunthorpe, Andrew Morton, Steven Price, Muchun Song

On 17.10.20 01:11, Sudarshan Rajagopalan wrote:
> 
> Hello Anshuman,
> 
David here,

in general, if your driver offlines+removes random memory, it is doing
something *very* wrong and dangerous. You shouldn't ever be
offlining+removing memory unless
a) you own that boot memory after boot. E.g., the ACPI driver owns DIMMs
after a reboot.
b) you added that memory via add_memory() and friends.

Even trusting that offline memory can be used by your driver is wrong.

Just imagine you racing with actual memory hot(un)plug, you'll be in
*big* trouble. For example,

1. You offlined memory and assume you can use it. A DIMM can simply get
unplugged. you're doomed.
2. You offlined+removed memory and assume you can use it. A DIMM can
simply get unplugged and the whole machine would crash.

Or imagine your driver running on a system that has virtio-mem, which
will try to remove/offline+remove memory that was added by virtio-mem/
is under its control.

Long story short: don't do it.

There is *one* instance in Linux where we currently allow it for legacy
reasons. It is powernv/memtrace code that offlines+removes boot memory.
But here we are guaranteed to run in an environment (HW) without any
actual memory hot(un)plug.

I guess you're going to say "but in our environment we don't have ..." -
this is not a valid argument to change such generic things upstream /
introducing such hacks.

> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
> put in place that prevents boot memory from being offlined and removed. 
> Also commit text mentions that boot memory on arm64 cannot be removed. 
> We wanted to understand more about the reasoning for this. X86 and other 
> archs doesn’t seem to do this prevention. There’s also comment in the 
> code that this notifier could be dropped in future if and when boot 
> memory can be removed.

The issue is that with *actual* memory hotunplug (for what the whole
machinery should be used for), that memory/DIMM will be gone. And as you
cannot fixup the initial memmap, if you were to reboot that machine, you
would simply crash immediately.

On x86, you can have that easily: hotplug DIMMs on bare metal and
reboot. The DIMMs will be exposed via e820 during boot, so they are
"early", although if done right (movable_node, movable_core and
similar), they can get hotunplugged later. Important in environments
where you want to hotunplug whole nodes. But has HW on x86 will properly
adjust the initial memmap / e820, there is no such issue as on arm64.

> 
> The current logic is that only “new” memory blocks which are hot-added 
> can later be offlined and removed. The memory that system booted up with 
> cannot be offlined and removed. But there could be many usercases such 
> as inter-VM memory sharing where a primary VM could offline and 
> hot-remove a block/section of memory and lend it to secondary VM where 
> it could hot-add it. And after usecase is done, the reverse happens 

That use case is using the wrong mechanisms. It shouldn't be
offlining+removing memory. Read below.

> where secondary VM hot-removes and gives it back to primary which can 
> hot-add it back. In such cases, the present logic for arm64 doesn’t 
> allow this hot-remove in primary to happen.
> 
> Also, on systems with movable zone that sort of guarantees pages to be 
> migrated and isolated so that blocks can be offlined, this logic also 
> defeats the purpose of having a movable zone which system can rely on 
> memory hot-plugging, which say virt-io mem also relies on for fully 
> plugged memory blocks.

The MOVABLE_ZONE is *not* just for better guarantees when trying to
hotunplug memory. It also increases the number of THP/huge pages. And
that part works just fine.

> 
> So we’re trying to understand the reasoning for such a prevention put in 
> place for arm64 arch alone.
> 
> One possible way to solve this is by marking the required sections as 
> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
> This puts these sections in the context of “memory hotpluggable” which 
> can be offlined-removed and added-onlined which are part of boot RAM 
> itself and doesn’t need any extra blocks to be hot added. This way of 
> marking certain sections as “non-early” could be exported so that module 
> drivers can set the required number of sections as “memory 

Oh please no. No driver should be doing that. That's just hacking around
the root issue: you're not supposed to do that.

> hotpluggable”. This could have certain checks put in place to see which 
> sections are allowed, example only movable zone sections can be marked 
> as “non-early”.
> 

I assume what your use case wants to achieve is, starting VMs with
large, contiguous memory backings, not wasting memory for the memmap in
the hypervisor.

The "traditional" way of doing that is using the "mem=" boot parameter,
and starting VMs with memory within the "never exposed to Linux" part.
While that in general works, I consider it an ugly hack. And it doesn't
really allow the hypervisor the reuse unexposed memory.

The obvious way for a driver to *allocate* memory (because that's what
you want to do!) is using alloc_contig_range(). I know, that there are
no guarantees. So you could be using CMA, ... but then, you still have
the memmap consuming memory in your hypervisor.

What you could try instead is

1. Using hugetlbfs with huge (2MB) / gigantic (1GB) (on x86) pages for
backing your guest.
2. To free up the memmap, you could then go into the direction proposed
by Muchun Song [1].

That's then a clean way for a driver to allocate/use memory without
abusing memory hot(un)plug infrastructure, minimizing the memmap
consumption.


[1]
https://lkml.kernel.org/r/20200915125947.26204-1-songmuchun@bytedance.com

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
@ 2020-10-17  9:35   ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2020-10-17  9:35 UTC (permalink / raw)
  To: Sudarshan Rajagopalan, Anshuman Khandual, Catalin Marinas,
	Will Deacon, linux-arm-kernel, linux-kernel
  Cc: Mark Rutland, Gavin Shan, Logan Gunthorpe, Steven Price,
	Suren Baghdasaryan, Muchun Song, Andrew Morton, pratikp

On 17.10.20 01:11, Sudarshan Rajagopalan wrote:
> 
> Hello Anshuman,
> 
David here,

in general, if your driver offlines+removes random memory, it is doing
something *very* wrong and dangerous. You shouldn't ever be
offlining+removing memory unless
a) you own that boot memory after boot. E.g., the ACPI driver owns DIMMs
after a reboot.
b) you added that memory via add_memory() and friends.

Even trusting that offline memory can be used by your driver is wrong.

Just imagine you racing with actual memory hot(un)plug, you'll be in
*big* trouble. For example,

1. You offlined memory and assume you can use it. A DIMM can simply get
unplugged. you're doomed.
2. You offlined+removed memory and assume you can use it. A DIMM can
simply get unplugged and the whole machine would crash.

Or imagine your driver running on a system that has virtio-mem, which
will try to remove/offline+remove memory that was added by virtio-mem/
is under its control.

Long story short: don't do it.

There is *one* instance in Linux where we currently allow it for legacy
reasons. It is powernv/memtrace code that offlines+removes boot memory.
But here we are guaranteed to run in an environment (HW) without any
actual memory hot(un)plug.

I guess you're going to say "but in our environment we don't have ..." -
this is not a valid argument to change such generic things upstream /
introducing such hacks.

> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
> put in place that prevents boot memory from being offlined and removed. 
> Also commit text mentions that boot memory on arm64 cannot be removed. 
> We wanted to understand more about the reasoning for this. X86 and other 
> archs doesn’t seem to do this prevention. There’s also comment in the 
> code that this notifier could be dropped in future if and when boot 
> memory can be removed.

The issue is that with *actual* memory hotunplug (for what the whole
machinery should be used for), that memory/DIMM will be gone. And as you
cannot fixup the initial memmap, if you were to reboot that machine, you
would simply crash immediately.

On x86, you can have that easily: hotplug DIMMs on bare metal and
reboot. The DIMMs will be exposed via e820 during boot, so they are
"early", although if done right (movable_node, movable_core and
similar), they can get hotunplugged later. Important in environments
where you want to hotunplug whole nodes. But has HW on x86 will properly
adjust the initial memmap / e820, there is no such issue as on arm64.

> 
> The current logic is that only “new” memory blocks which are hot-added 
> can later be offlined and removed. The memory that system booted up with 
> cannot be offlined and removed. But there could be many usercases such 
> as inter-VM memory sharing where a primary VM could offline and 
> hot-remove a block/section of memory and lend it to secondary VM where 
> it could hot-add it. And after usecase is done, the reverse happens 

That use case is using the wrong mechanisms. It shouldn't be
offlining+removing memory. Read below.

> where secondary VM hot-removes and gives it back to primary which can 
> hot-add it back. In such cases, the present logic for arm64 doesn’t 
> allow this hot-remove in primary to happen.
> 
> Also, on systems with movable zone that sort of guarantees pages to be 
> migrated and isolated so that blocks can be offlined, this logic also 
> defeats the purpose of having a movable zone which system can rely on 
> memory hot-plugging, which say virt-io mem also relies on for fully 
> plugged memory blocks.

The MOVABLE_ZONE is *not* just for better guarantees when trying to
hotunplug memory. It also increases the number of THP/huge pages. And
that part works just fine.

> 
> So we’re trying to understand the reasoning for such a prevention put in 
> place for arm64 arch alone.
> 
> One possible way to solve this is by marking the required sections as 
> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
> This puts these sections in the context of “memory hotpluggable” which 
> can be offlined-removed and added-onlined which are part of boot RAM 
> itself and doesn’t need any extra blocks to be hot added. This way of 
> marking certain sections as “non-early” could be exported so that module 
> drivers can set the required number of sections as “memory 

Oh please no. No driver should be doing that. That's just hacking around
the root issue: you're not supposed to do that.

> hotpluggable”. This could have certain checks put in place to see which 
> sections are allowed, example only movable zone sections can be marked 
> as “non-early”.
> 

I assume what your use case wants to achieve is, starting VMs with
large, contiguous memory backings, not wasting memory for the memmap in
the hypervisor.

The "traditional" way of doing that is using the "mem=" boot parameter,
and starting VMs with memory within the "never exposed to Linux" part.
While that in general works, I consider it an ugly hack. And it doesn't
really allow the hypervisor the reuse unexposed memory.

The obvious way for a driver to *allocate* memory (because that's what
you want to do!) is using alloc_contig_range(). I know, that there are
no guarantees. So you could be using CMA, ... but then, you still have
the memmap consuming memory in your hypervisor.

What you could try instead is

1. Using hugetlbfs with huge (2MB) / gigantic (1GB) (on x86) pages for
backing your guest.
2. To free up the memmap, you could then go into the direction proposed
by Muchun Song [1].

That's then a clean way for a driver to allocate/use memory without
abusing memory hot(un)plug infrastructure, minimizing the memmap
consumption.


[1]
https://lkml.kernel.org/r/20200915125947.26204-1-songmuchun@bytedance.com

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
  2020-10-16 23:11 ` Sudarshan Rajagopalan
@ 2020-10-19  5:37   ` Anshuman Khandual
  -1 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-10-19  5:37 UTC (permalink / raw)
  To: Sudarshan Rajagopalan, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel
  Cc: Suren Baghdasaryan, pratikp, Gavin Shan, Mark Rutland,
	Logan Gunthorpe, David Hildenbrand, Andrew Morton, Steven Price

Hello Sudarshan,

On 10/17/2020 04:41 AM, Sudarshan Rajagopalan wrote:
> 
> Hello Anshuman,
> 
> In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier put in place that prevents boot memory from being offlined and removed. Also commit text mentions that boot memory on arm64 cannot be removed. We wanted to understand more about the reasoning for this. X86 and other archs doesn’t seem to do this prevention. There’s also comment in the code that this notifier could be dropped in future if and when boot memory can be removed.

Right and till then the notifier cannot be dropped. There was a lot of discussions
around this topic during multiple iterations of memory hot remove series. Hence, I
would just request you to please go through them first. This list here is from one
such series (https://lwn.net/Articles/809179/) but might not be exhaustive.

-----------------
On arm64 platform, it is essential to ensure that the boot time discovered
memory couldn't be hot-removed so that,

1. FW data structures used across kexec are idempotent
   e.g. the EFI memory map.

2. linear map or vmemmap would not have to be dynamically split, and can
   map boot memory at a large granularity

3. Avoid penalizing paths that have to walk page tables, where we can be
   certain that the memory is not hot-removable
-----------------

The primary reason being kexec which would need substantial rework otherwise.

> 
> The current logic is that only “new” memory blocks which are hot-added can later be offlined and removed. The memory that system booted up with cannot be offlined and removed. But there could be many usercases such as inter-VM memory sharing where a primary VM could offline and hot-remove a block/section of memory and lend it to secondary VM where it could hot-add it. And after usecase is done, the reverse happens where secondary VM hot-removes and gives it back to primary which can hot-add it back. In such cases, the present logic for arm64 doesn’t allow this hot-remove in primary to happen.

That is not true. Each VM could just boot with a minimum boot memory which can
not be offlined or removed but then a possible larger portion of memory can be
hot added during the boot process itself, making them available for any future
inter VM sharing purpose. Hence this problem could easily be solved in the user
space itself.

> 
> Also, on systems with movable zone that sort of guarantees pages to be migrated and isolated so that blocks can be offlined, this logic also defeats the purpose of having a movable zone which system can rely on memory hot-plugging, which say virt-io mem also relies on for fully plugged memory blocks.
ZONE_MOVABLE does not really guarantee migration, isolation and removal. There
are reasons an offline request might just fail. I agree that those reasons are
normally not platform related but core memory gives platform an opportunity to
decline an offlining request via a notifier. Hence ZONE_MOVABLE offline can be
denied. Semantics wise we are still okay.

This might look bit inconsistent that movablecore/kernelcore/movable_node with
firmware sending in 'hot pluggable' memory (IIRC arm64 does not really support
this yet), the system might end up with ZONE_MOVABLE marked boot memory which
cannot be offlined or removed. But an offline notifier action is orthogonal.
Hence did not block those kernel command line paths that creates ZONE_MOVABLE
during boot to preserve existing behavior.

> 
> I understand that some region of boot RAM shouldn’t be allowed to be removed, but such regions won’t be allowed to be offlined in first place since pages cannot be migrated and isolated, example reserved pages.
> 
> So we’re trying to understand the reasoning for such a prevention put in place for arm64 arch alone.

Primary reason being kexec. During kexec on arm64, next kernel's memory map is
derived from firmware and not from current running kernel. So the next kernel
will crash if it would access memory that might have been removed in running
kernel. Until kexec on arm64 changes substantially and takes into account the
real available memory on the current kernel, boot memory cannot be removed.

> 
> One possible way to solve this is by marking the required sections as “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map.

That is too intrusive from core memory perspective.

 This puts these sections in the context of “memory hotpluggable” which can be offlined-removed and added-onlined which are part of boot RAM itself and doesn’t need any extra blocks to be hot added. This way of marking certain sections as “non-early” could be exported so that module drivers can set the required number of sections as “memory hotpluggable”. This could have certain checks put in place to see which sections are allowed, example only movable zone sections can be marked as “non-early”.

Giving modules the right to mark memory hotpluggable ? That is too intrusive
and would still not solve the problem with kexec.

> 
> Your thoughts on this? We are also looking for different ways to solve the problem without having to completely dropping this notifier, but just putting out the concern here about the notifier logic that is breaking our usecase which is a generic memory sharing usecase using memory hotplug feature.

Completely preventing boot memory offline and removal is essential for kexec
to work as expected afterwards. As suggested previously, splitting the VM
memory into boot and non boot chunks during init can help work around this
restriction effectively in userspace itself and would not require any kernel
changes.

- Anshuman

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
@ 2020-10-19  5:37   ` Anshuman Khandual
  0 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-10-19  5:37 UTC (permalink / raw)
  To: Sudarshan Rajagopalan, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel
  Cc: Mark Rutland, Gavin Shan, David Hildenbrand, Logan Gunthorpe,
	Steven Price, Suren Baghdasaryan, Andrew Morton, pratikp

Hello Sudarshan,

On 10/17/2020 04:41 AM, Sudarshan Rajagopalan wrote:
> 
> Hello Anshuman,
> 
> In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier put in place that prevents boot memory from being offlined and removed. Also commit text mentions that boot memory on arm64 cannot be removed. We wanted to understand more about the reasoning for this. X86 and other archs doesn’t seem to do this prevention. There’s also comment in the code that this notifier could be dropped in future if and when boot memory can be removed.

Right and till then the notifier cannot be dropped. There was a lot of discussions
around this topic during multiple iterations of memory hot remove series. Hence, I
would just request you to please go through them first. This list here is from one
such series (https://lwn.net/Articles/809179/) but might not be exhaustive.

-----------------
On arm64 platform, it is essential to ensure that the boot time discovered
memory couldn't be hot-removed so that,

1. FW data structures used across kexec are idempotent
   e.g. the EFI memory map.

2. linear map or vmemmap would not have to be dynamically split, and can
   map boot memory at a large granularity

3. Avoid penalizing paths that have to walk page tables, where we can be
   certain that the memory is not hot-removable
-----------------

The primary reason being kexec which would need substantial rework otherwise.

> 
> The current logic is that only “new” memory blocks which are hot-added can later be offlined and removed. The memory that system booted up with cannot be offlined and removed. But there could be many usercases such as inter-VM memory sharing where a primary VM could offline and hot-remove a block/section of memory and lend it to secondary VM where it could hot-add it. And after usecase is done, the reverse happens where secondary VM hot-removes and gives it back to primary which can hot-add it back. In such cases, the present logic for arm64 doesn’t allow this hot-remove in primary to happen.

That is not true. Each VM could just boot with a minimum boot memory which can
not be offlined or removed but then a possible larger portion of memory can be
hot added during the boot process itself, making them available for any future
inter VM sharing purpose. Hence this problem could easily be solved in the user
space itself.

> 
> Also, on systems with movable zone that sort of guarantees pages to be migrated and isolated so that blocks can be offlined, this logic also defeats the purpose of having a movable zone which system can rely on memory hot-plugging, which say virt-io mem also relies on for fully plugged memory blocks.
ZONE_MOVABLE does not really guarantee migration, isolation and removal. There
are reasons an offline request might just fail. I agree that those reasons are
normally not platform related but core memory gives platform an opportunity to
decline an offlining request via a notifier. Hence ZONE_MOVABLE offline can be
denied. Semantics wise we are still okay.

This might look bit inconsistent that movablecore/kernelcore/movable_node with
firmware sending in 'hot pluggable' memory (IIRC arm64 does not really support
this yet), the system might end up with ZONE_MOVABLE marked boot memory which
cannot be offlined or removed. But an offline notifier action is orthogonal.
Hence did not block those kernel command line paths that creates ZONE_MOVABLE
during boot to preserve existing behavior.

> 
> I understand that some region of boot RAM shouldn’t be allowed to be removed, but such regions won’t be allowed to be offlined in first place since pages cannot be migrated and isolated, example reserved pages.
> 
> So we’re trying to understand the reasoning for such a prevention put in place for arm64 arch alone.

Primary reason being kexec. During kexec on arm64, next kernel's memory map is
derived from firmware and not from current running kernel. So the next kernel
will crash if it would access memory that might have been removed in running
kernel. Until kexec on arm64 changes substantially and takes into account the
real available memory on the current kernel, boot memory cannot be removed.

> 
> One possible way to solve this is by marking the required sections as “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map.

That is too intrusive from core memory perspective.

 This puts these sections in the context of “memory hotpluggable” which can be offlined-removed and added-onlined which are part of boot RAM itself and doesn’t need any extra blocks to be hot added. This way of marking certain sections as “non-early” could be exported so that module drivers can set the required number of sections as “memory hotpluggable”. This could have certain checks put in place to see which sections are allowed, example only movable zone sections can be marked as “non-early”.

Giving modules the right to mark memory hotpluggable ? That is too intrusive
and would still not solve the problem with kexec.

> 
> Your thoughts on this? We are also looking for different ways to solve the problem without having to completely dropping this notifier, but just putting out the concern here about the notifier logic that is breaking our usecase which is a generic memory sharing usecase using memory hotplug feature.

Completely preventing boot memory offline and removal is essential for kexec
to work as expected afterwards. As suggested previously, splitting the VM
memory into boot and non boot chunks during init can help work around this
restriction effectively in userspace itself and would not require any kernel
changes.

- Anshuman

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
  2020-10-17  9:35   ` David Hildenbrand
@ 2020-10-19  6:30     ` Anshuman Khandual
  -1 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-10-19  6:30 UTC (permalink / raw)
  To: David Hildenbrand, Sudarshan Rajagopalan, Catalin Marinas,
	Will Deacon, linux-arm-kernel, linux-kernel
  Cc: Suren Baghdasaryan, pratikp, Gavin Shan, Mark Rutland,
	Logan Gunthorpe, Andrew Morton, Steven Price, Muchun Song



On 10/17/2020 03:05 PM, David Hildenbrand wrote:
> On 17.10.20 01:11, Sudarshan Rajagopalan wrote:
>>
>> Hello Anshuman,
>>
> David here,
> 
> in general, if your driver offlines+removes random memory, it is doing
> something *very* wrong and dangerous. You shouldn't ever be
> offlining+removing memory unless
> a) you own that boot memory after boot. E.g., the ACPI driver owns DIMMs
> after a reboot.
> b) you added that memory via add_memory() and friends.

Right.

> 
> Even trusting that offline memory can be used by your driver is wrong.

Right.

> 
> Just imagine you racing with actual memory hot(un)plug, you'll be in
> *big* trouble. For example,
> 
> 1. You offlined memory and assume you can use it. A DIMM can simply get
> unplugged. you're doomed.
> 2. You offlined+removed memory and assume you can use it. A DIMM can
> simply get unplugged and the whole machine would crash.
> 
> Or imagine your driver running on a system that has virtio-mem, which
> will try to remove/offline+remove memory that was added by virtio-mem/
> is under its control.
> 
> Long story short: don't do it.
> 
> There is *one* instance in Linux where we currently allow it for legacy
> reasons. It is powernv/memtrace code that offlines+removes boot memory.
> But here we are guaranteed to run in an environment (HW) without any
> actual memory hot(un)plug.
> 
> I guess you're going to say "but in our environment we don't have ..." -
> this is not a valid argument to change such generic things upstream /
> introducing such hacks.

Agreed.

> 
>> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
>> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
>> put in place that prevents boot memory from being offlined and removed. 
>> Also commit text mentions that boot memory on arm64 cannot be removed. 
>> We wanted to understand more about the reasoning for this. X86 and other 
>> archs doesn’t seem to do this prevention. There’s also comment in the 
>> code that this notifier could be dropped in future if and when boot 
>> memory can be removed.
> 
> The issue is that with *actual* memory hotunplug (for what the whole
> machinery should be used for), that memory/DIMM will be gone. And as you
> cannot fixup the initial memmap, if you were to reboot that machine, you
> would simply crash immediately.

Right.

> 
> On x86, you can have that easily: hotplug DIMMs on bare metal and
> reboot. The DIMMs will be exposed via e820 during boot, so they are
> "early", although if done right (movable_node, movable_core and
> similar), they can get hotunplugged later. Important in environments
> where you want to hotunplug whole nodes. But has HW on x86 will properly
> adjust the initial memmap / e820, there is no such issue as on arm64.

That is the primary problem.

> 
>>
>> The current logic is that only “new” memory blocks which are hot-added 
>> can later be offlined and removed. The memory that system booted up with 
>> cannot be offlined and removed. But there could be many usercases such 
>> as inter-VM memory sharing where a primary VM could offline and 
>> hot-remove a block/section of memory and lend it to secondary VM where 
>> it could hot-add it. And after usecase is done, the reverse happens 
> 
> That use case is using the wrong mechanisms. It shouldn't be
> offlining+removing memory. Read below.
> 
>> where secondary VM hot-removes and gives it back to primary which can 
>> hot-add it back. In such cases, the present logic for arm64 doesn’t 
>> allow this hot-remove in primary to happen.
>>
>> Also, on systems with movable zone that sort of guarantees pages to be 
>> migrated and isolated so that blocks can be offlined, this logic also 
>> defeats the purpose of having a movable zone which system can rely on 
>> memory hot-plugging, which say virt-io mem also relies on for fully 
>> plugged memory blocks.
> 
> The MOVABLE_ZONE is *not* just for better guarantees when trying to
> hotunplug memory. It also increases the number of THP/huge pages. And
> that part works just fine.

Right.

> 
>>
>> So we’re trying to understand the reasoning for such a prevention put in 
>> place for arm64 arch alone.
>>
>> One possible way to solve this is by marking the required sections as 
>> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
>> This puts these sections in the context of “memory hotpluggable” which 
>> can be offlined-removed and added-onlined which are part of boot RAM 
>> itself and doesn’t need any extra blocks to be hot added. This way of 
>> marking certain sections as “non-early” could be exported so that module 
>> drivers can set the required number of sections as “memory 
> 
> Oh please no. No driver should be doing that. That's just hacking around
> the root issue: you're not supposed to do that.
> 
>> hotpluggable”. This could have certain checks put in place to see which 
>> sections are allowed, example only movable zone sections can be marked 
>> as “non-early”.
>>
> 
> I assume what your use case wants to achieve is, starting VMs with
> large, contiguous memory backings, not wasting memory for the memmap in
> the hypervisor.
> 
> The "traditional" way of doing that is using the "mem=" boot parameter,
> and starting VMs with memory within the "never exposed to Linux" part.

I suggested something similar. Memory not exposed to Linux via this method
can be hot added or removed later on.

> While that in general works, I consider it an ugly hack. And it doesn't
> really allow the hypervisor the reuse unexposed memory.
> 
> The obvious way for a driver to *allocate* memory (because that's what
> you want to do!) is using alloc_contig_range(). I know, that there are
> no guarantees. So you could be using CMA, ... but then, you still have
> the memmap consuming memory in your hypervisor.
> 
> What you could try instead is
> 
> 1. Using hugetlbfs with huge (2MB) / gigantic (1GB) (on x86) pages for
> backing your guest.
> 2. To free up the memmap, you could then go into the direction proposed
> by Muchun Song [1].
> 
> That's then a clean way for a driver to allocate/use memory without
> abusing memory hot(un)plug infrastructure, minimizing the memmap
> consumption.
> 
> 
> [1]
> https://lkml.kernel.org/r/20200915125947.26204-1-songmuchun@bytedance.com
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
@ 2020-10-19  6:30     ` Anshuman Khandual
  0 siblings, 0 replies; 10+ messages in thread
From: Anshuman Khandual @ 2020-10-19  6:30 UTC (permalink / raw)
  To: David Hildenbrand, Sudarshan Rajagopalan, Catalin Marinas,
	Will Deacon, linux-arm-kernel, linux-kernel
  Cc: Mark Rutland, Gavin Shan, Logan Gunthorpe, Steven Price,
	Suren Baghdasaryan, Muchun Song, Andrew Morton, pratikp



On 10/17/2020 03:05 PM, David Hildenbrand wrote:
> On 17.10.20 01:11, Sudarshan Rajagopalan wrote:
>>
>> Hello Anshuman,
>>
> David here,
> 
> in general, if your driver offlines+removes random memory, it is doing
> something *very* wrong and dangerous. You shouldn't ever be
> offlining+removing memory unless
> a) you own that boot memory after boot. E.g., the ACPI driver owns DIMMs
> after a reboot.
> b) you added that memory via add_memory() and friends.

Right.

> 
> Even trusting that offline memory can be used by your driver is wrong.

Right.

> 
> Just imagine you racing with actual memory hot(un)plug, you'll be in
> *big* trouble. For example,
> 
> 1. You offlined memory and assume you can use it. A DIMM can simply get
> unplugged. you're doomed.
> 2. You offlined+removed memory and assume you can use it. A DIMM can
> simply get unplugged and the whole machine would crash.
> 
> Or imagine your driver running on a system that has virtio-mem, which
> will try to remove/offline+remove memory that was added by virtio-mem/
> is under its control.
> 
> Long story short: don't do it.
> 
> There is *one* instance in Linux where we currently allow it for legacy
> reasons. It is powernv/memtrace code that offlines+removes boot memory.
> But here we are guaranteed to run in an environment (HW) without any
> actual memory hot(un)plug.
> 
> I guess you're going to say "but in our environment we don't have ..." -
> this is not a valid argument to change such generic things upstream /
> introducing such hacks.

Agreed.

> 
>> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
>> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
>> put in place that prevents boot memory from being offlined and removed. 
>> Also commit text mentions that boot memory on arm64 cannot be removed. 
>> We wanted to understand more about the reasoning for this. X86 and other 
>> archs doesn’t seem to do this prevention. There’s also comment in the 
>> code that this notifier could be dropped in future if and when boot 
>> memory can be removed.
> 
> The issue is that with *actual* memory hotunplug (for what the whole
> machinery should be used for), that memory/DIMM will be gone. And as you
> cannot fixup the initial memmap, if you were to reboot that machine, you
> would simply crash immediately.

Right.

> 
> On x86, you can have that easily: hotplug DIMMs on bare metal and
> reboot. The DIMMs will be exposed via e820 during boot, so they are
> "early", although if done right (movable_node, movable_core and
> similar), they can get hotunplugged later. Important in environments
> where you want to hotunplug whole nodes. But has HW on x86 will properly
> adjust the initial memmap / e820, there is no such issue as on arm64.

That is the primary problem.

> 
>>
>> The current logic is that only “new” memory blocks which are hot-added 
>> can later be offlined and removed. The memory that system booted up with 
>> cannot be offlined and removed. But there could be many usercases such 
>> as inter-VM memory sharing where a primary VM could offline and 
>> hot-remove a block/section of memory and lend it to secondary VM where 
>> it could hot-add it. And after usecase is done, the reverse happens 
> 
> That use case is using the wrong mechanisms. It shouldn't be
> offlining+removing memory. Read below.
> 
>> where secondary VM hot-removes and gives it back to primary which can 
>> hot-add it back. In such cases, the present logic for arm64 doesn’t 
>> allow this hot-remove in primary to happen.
>>
>> Also, on systems with movable zone that sort of guarantees pages to be 
>> migrated and isolated so that blocks can be offlined, this logic also 
>> defeats the purpose of having a movable zone which system can rely on 
>> memory hot-plugging, which say virt-io mem also relies on for fully 
>> plugged memory blocks.
> 
> The MOVABLE_ZONE is *not* just for better guarantees when trying to
> hotunplug memory. It also increases the number of THP/huge pages. And
> that part works just fine.

Right.

> 
>>
>> So we’re trying to understand the reasoning for such a prevention put in 
>> place for arm64 arch alone.
>>
>> One possible way to solve this is by marking the required sections as 
>> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
>> This puts these sections in the context of “memory hotpluggable” which 
>> can be offlined-removed and added-onlined which are part of boot RAM 
>> itself and doesn’t need any extra blocks to be hot added. This way of 
>> marking certain sections as “non-early” could be exported so that module 
>> drivers can set the required number of sections as “memory 
> 
> Oh please no. No driver should be doing that. That's just hacking around
> the root issue: you're not supposed to do that.
> 
>> hotpluggable”. This could have certain checks put in place to see which 
>> sections are allowed, example only movable zone sections can be marked 
>> as “non-early”.
>>
> 
> I assume what your use case wants to achieve is, starting VMs with
> large, contiguous memory backings, not wasting memory for the memmap in
> the hypervisor.
> 
> The "traditional" way of doing that is using the "mem=" boot parameter,
> and starting VMs with memory within the "never exposed to Linux" part.

I suggested something similar. Memory not exposed to Linux via this method
can be hot added or removed later on.

> While that in general works, I consider it an ugly hack. And it doesn't
> really allow the hypervisor the reuse unexposed memory.
> 
> The obvious way for a driver to *allocate* memory (because that's what
> you want to do!) is using alloc_contig_range(). I know, that there are
> no guarantees. So you could be using CMA, ... but then, you still have
> the memmap consuming memory in your hypervisor.
> 
> What you could try instead is
> 
> 1. Using hugetlbfs with huge (2MB) / gigantic (1GB) (on x86) pages for
> backing your guest.
> 2. To free up the memmap, you could then go into the direction proposed
> by Muchun Song [1].
> 
> That's then a clean way for a driver to allocate/use memory without
> abusing memory hot(un)plug infrastructure, minimizing the memmap
> consumption.
> 
> 
> [1]
> https://lkml.kernel.org/r/20200915125947.26204-1-songmuchun@bytedance.com
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
  2020-10-19  5:37   ` Anshuman Khandual
@ 2020-10-29 21:02     ` Sudarshan Rajagopalan
  -1 siblings, 0 replies; 10+ messages in thread
From: Sudarshan Rajagopalan @ 2020-10-29 21:02 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	Suren Baghdasaryan, pratikp, Gavin Shan, Mark Rutland,
	Logan Gunthorpe, David Hildenbrand, Andrew Morton, Steven Price



Hi Anshuman, David,

Thanks for all the detailed explanations for the reasoning to have 
bootmem protected from being removed. Also, I do agree drivers being 
able to mark memory sections isn't the right thing to do.

We went ahead with the approach of using "mem=" as you suggested to 
limit the bootmem and add remaining blocks using 
add_memory_driver_managed() so that driver has ownership of these 
blocks.

We do have some follow-up questions regarding this - will initiate a 
discussion soon.


On 2020-10-18 22:37, Anshuman Khandual wrote:
> Hello Sudarshan,
> 
> On 10/17/2020 04:41 AM, Sudarshan Rajagopalan wrote:
>> 
>> Hello Anshuman,
>> 
>> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
>> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
>> put in place that prevents boot memory from being offlined and 
>> removed. Also commit text mentions that boot memory on arm64 cannot be 
>> removed. We wanted to understand more about the reasoning for this. 
>> X86 and other archs doesn’t seem to do this prevention. There’s also 
>> comment in the code that this notifier could be dropped in future if 
>> and when boot memory can be removed.
> 
> Right and till then the notifier cannot be dropped. There was a lot of
> discussions
> around this topic during multiple iterations of memory hot remove
> series. Hence, I
> would just request you to please go through them first. This list here
> is from one
> such series (https://lwn.net/Articles/809179/) but might not be 
> exhaustive.
> 
> -----------------
> On arm64 platform, it is essential to ensure that the boot time 
> discovered
> memory couldn't be hot-removed so that,
> 
> 1. FW data structures used across kexec are idempotent
>    e.g. the EFI memory map.
> 
> 2. linear map or vmemmap would not have to be dynamically split, and 
> can
>    map boot memory at a large granularity
> 
> 3. Avoid penalizing paths that have to walk page tables, where we can 
> be
>    certain that the memory is not hot-removable
> -----------------
> 
> The primary reason being kexec which would need substantial rework 
> otherwise.
> 
>> 
>> The current logic is that only “new” memory blocks which are hot-added 
>> can later be offlined and removed. The memory that system booted up 
>> with cannot be offlined and removed. But there could be many usercases 
>> such as inter-VM memory sharing where a primary VM could offline and 
>> hot-remove a block/section of memory and lend it to secondary VM where 
>> it could hot-add it. And after usecase is done, the reverse happens 
>> where secondary VM hot-removes and gives it back to primary which can 
>> hot-add it back. In such cases, the present logic for arm64 doesn’t 
>> allow this hot-remove in primary to happen.
> 
> That is not true. Each VM could just boot with a minimum boot memory 
> which can
> not be offlined or removed but then a possible larger portion of memory 
> can be
> hot added during the boot process itself, making them available for any 
> future
> inter VM sharing purpose. Hence this problem could easily be solved in 
> the user
> space itself.
> 
>> 
>> Also, on systems with movable zone that sort of guarantees pages to be 
>> migrated and isolated so that blocks can be offlined, this logic also 
>> defeats the purpose of having a movable zone which system can rely on 
>> memory hot-plugging, which say virt-io mem also relies on for fully 
>> plugged memory blocks.
> ZONE_MOVABLE does not really guarantee migration, isolation and 
> removal. There
> are reasons an offline request might just fail. I agree that those 
> reasons are
> normally not platform related but core memory gives platform an 
> opportunity to
> decline an offlining request via a notifier. Hence ZONE_MOVABLE offline 
> can be
> denied. Semantics wise we are still okay.
> 
> This might look bit inconsistent that 
> movablecore/kernelcore/movable_node with
> firmware sending in 'hot pluggable' memory (IIRC arm64 does not really 
> support
> this yet), the system might end up with ZONE_MOVABLE marked boot memory 
> which
> cannot be offlined or removed. But an offline notifier action is 
> orthogonal.
> Hence did not block those kernel command line paths that creates 
> ZONE_MOVABLE
> during boot to preserve existing behavior.
> 
>> 
>> I understand that some region of boot RAM shouldn’t be allowed to be 
>> removed, but such regions won’t be allowed to be offlined in first 
>> place since pages cannot be migrated and isolated, example reserved 
>> pages.
>> 
>> So we’re trying to understand the reasoning for such a prevention put 
>> in place for arm64 arch alone.
> 
> Primary reason being kexec. During kexec on arm64, next kernel's memory 
> map is
> derived from firmware and not from current running kernel. So the next 
> kernel
> will crash if it would access memory that might have been removed in 
> running
> kernel. Until kexec on arm64 changes substantially and takes into 
> account the
> real available memory on the current kernel, boot memory cannot be 
> removed.
> 
>> 
>> One possible way to solve this is by marking the required sections as 
>> “non-early” by removing the SECTION_IS_EARLY bit in its 
>> section_mem_map.
> 
> That is too intrusive from core memory perspective.
> 
>  This puts these sections in the context of “memory hotpluggable”
> which can be offlined-removed and added-onlined which are part of boot
> RAM itself and doesn’t need any extra blocks to be hot added. This way
> of marking certain sections as “non-early” could be exported so that
> module drivers can set the required number of sections as “memory
> hotpluggable”. This could have certain checks put in place to see
> which sections are allowed, example only movable zone sections can be
> marked as “non-early”.
> 
> Giving modules the right to mark memory hotpluggable ? That is too 
> intrusive
> and would still not solve the problem with kexec.
> 
>> 
>> Your thoughts on this? We are also looking for different ways to solve 
>> the problem without having to completely dropping this notifier, but 
>> just putting out the concern here about the notifier logic that is 
>> breaking our usecase which is a generic memory sharing usecase using 
>> memory hotplug feature.
> 
> Completely preventing boot memory offline and removal is essential for 
> kexec
> to work as expected afterwards. As suggested previously, splitting the 
> VM
> memory into boot and non boot chunks during init can help work around 
> this
> restriction effectively in userspace itself and would not require any 
> kernel
> changes.
> 
> - Anshuman


Sudarshan

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: arm64: dropping prevent_bootmem_remove_notifier
@ 2020-10-29 21:02     ` Sudarshan Rajagopalan
  0 siblings, 0 replies; 10+ messages in thread
From: Sudarshan Rajagopalan @ 2020-10-29 21:02 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mark Rutland, Gavin Shan, Will Deacon, David Hildenbrand,
	Catalin Marinas, linux-kernel, Steven Price, Logan Gunthorpe,
	Andrew Morton, Suren Baghdasaryan, linux-arm-kernel, pratikp



Hi Anshuman, David,

Thanks for all the detailed explanations for the reasoning to have 
bootmem protected from being removed. Also, I do agree drivers being 
able to mark memory sections isn't the right thing to do.

We went ahead with the approach of using "mem=" as you suggested to 
limit the bootmem and add remaining blocks using 
add_memory_driver_managed() so that driver has ownership of these 
blocks.

We do have some follow-up questions regarding this - will initiate a 
discussion soon.


On 2020-10-18 22:37, Anshuman Khandual wrote:
> Hello Sudarshan,
> 
> On 10/17/2020 04:41 AM, Sudarshan Rajagopalan wrote:
>> 
>> Hello Anshuman,
>> 
>> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
>> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
>> put in place that prevents boot memory from being offlined and 
>> removed. Also commit text mentions that boot memory on arm64 cannot be 
>> removed. We wanted to understand more about the reasoning for this. 
>> X86 and other archs doesn’t seem to do this prevention. There’s also 
>> comment in the code that this notifier could be dropped in future if 
>> and when boot memory can be removed.
> 
> Right and till then the notifier cannot be dropped. There was a lot of
> discussions
> around this topic during multiple iterations of memory hot remove
> series. Hence, I
> would just request you to please go through them first. This list here
> is from one
> such series (https://lwn.net/Articles/809179/) but might not be 
> exhaustive.
> 
> -----------------
> On arm64 platform, it is essential to ensure that the boot time 
> discovered
> memory couldn't be hot-removed so that,
> 
> 1. FW data structures used across kexec are idempotent
>    e.g. the EFI memory map.
> 
> 2. linear map or vmemmap would not have to be dynamically split, and 
> can
>    map boot memory at a large granularity
> 
> 3. Avoid penalizing paths that have to walk page tables, where we can 
> be
>    certain that the memory is not hot-removable
> -----------------
> 
> The primary reason being kexec which would need substantial rework 
> otherwise.
> 
>> 
>> The current logic is that only “new” memory blocks which are hot-added 
>> can later be offlined and removed. The memory that system booted up 
>> with cannot be offlined and removed. But there could be many usercases 
>> such as inter-VM memory sharing where a primary VM could offline and 
>> hot-remove a block/section of memory and lend it to secondary VM where 
>> it could hot-add it. And after usecase is done, the reverse happens 
>> where secondary VM hot-removes and gives it back to primary which can 
>> hot-add it back. In such cases, the present logic for arm64 doesn’t 
>> allow this hot-remove in primary to happen.
> 
> That is not true. Each VM could just boot with a minimum boot memory 
> which can
> not be offlined or removed but then a possible larger portion of memory 
> can be
> hot added during the boot process itself, making them available for any 
> future
> inter VM sharing purpose. Hence this problem could easily be solved in 
> the user
> space itself.
> 
>> 
>> Also, on systems with movable zone that sort of guarantees pages to be 
>> migrated and isolated so that blocks can be offlined, this logic also 
>> defeats the purpose of having a movable zone which system can rely on 
>> memory hot-plugging, which say virt-io mem also relies on for fully 
>> plugged memory blocks.
> ZONE_MOVABLE does not really guarantee migration, isolation and 
> removal. There
> are reasons an offline request might just fail. I agree that those 
> reasons are
> normally not platform related but core memory gives platform an 
> opportunity to
> decline an offlining request via a notifier. Hence ZONE_MOVABLE offline 
> can be
> denied. Semantics wise we are still okay.
> 
> This might look bit inconsistent that 
> movablecore/kernelcore/movable_node with
> firmware sending in 'hot pluggable' memory (IIRC arm64 does not really 
> support
> this yet), the system might end up with ZONE_MOVABLE marked boot memory 
> which
> cannot be offlined or removed. But an offline notifier action is 
> orthogonal.
> Hence did not block those kernel command line paths that creates 
> ZONE_MOVABLE
> during boot to preserve existing behavior.
> 
>> 
>> I understand that some region of boot RAM shouldn’t be allowed to be 
>> removed, but such regions won’t be allowed to be offlined in first 
>> place since pages cannot be migrated and isolated, example reserved 
>> pages.
>> 
>> So we’re trying to understand the reasoning for such a prevention put 
>> in place for arm64 arch alone.
> 
> Primary reason being kexec. During kexec on arm64, next kernel's memory 
> map is
> derived from firmware and not from current running kernel. So the next 
> kernel
> will crash if it would access memory that might have been removed in 
> running
> kernel. Until kexec on arm64 changes substantially and takes into 
> account the
> real available memory on the current kernel, boot memory cannot be 
> removed.
> 
>> 
>> One possible way to solve this is by marking the required sections as 
>> “non-early” by removing the SECTION_IS_EARLY bit in its 
>> section_mem_map.
> 
> That is too intrusive from core memory perspective.
> 
>  This puts these sections in the context of “memory hotpluggable”
> which can be offlined-removed and added-onlined which are part of boot
> RAM itself and doesn’t need any extra blocks to be hot added. This way
> of marking certain sections as “non-early” could be exported so that
> module drivers can set the required number of sections as “memory
> hotpluggable”. This could have certain checks put in place to see
> which sections are allowed, example only movable zone sections can be
> marked as “non-early”.
> 
> Giving modules the right to mark memory hotpluggable ? That is too 
> intrusive
> and would still not solve the problem with kexec.
> 
>> 
>> Your thoughts on this? We are also looking for different ways to solve 
>> the problem without having to completely dropping this notifier, but 
>> just putting out the concern here about the notifier logic that is 
>> breaking our usecase which is a generic memory sharing usecase using 
>> memory hotplug feature.
> 
> Completely preventing boot memory offline and removal is essential for 
> kexec
> to work as expected afterwards. As suggested previously, splitting the 
> VM
> memory into boot and non boot chunks during init can help work around 
> this
> restriction effectively in userspace itself and would not require any 
> kernel
> changes.
> 
> - Anshuman


Sudarshan

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-10-29 21:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-16 23:11 arm64: dropping prevent_bootmem_remove_notifier Sudarshan Rajagopalan
2020-10-16 23:11 ` Sudarshan Rajagopalan
2020-10-17  9:35 ` David Hildenbrand
2020-10-17  9:35   ` David Hildenbrand
2020-10-19  6:30   ` Anshuman Khandual
2020-10-19  6:30     ` Anshuman Khandual
2020-10-19  5:37 ` Anshuman Khandual
2020-10-19  5:37   ` Anshuman Khandual
2020-10-29 21:02   ` Sudarshan Rajagopalan
2020-10-29 21:02     ` Sudarshan Rajagopalan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.