All of lore.kernel.org
 help / color / mirror / Atom feed
* Question: Using online_pages/offline_pages() with granularity < mem section size
@ 2018-03-02 15:23 David Hildenbrand
  2018-03-03 17:53 ` Dan Williams
  0 siblings, 1 reply; 3+ messages in thread
From: David Hildenbrand @ 2018-03-02 15:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vlastimil Babka, Dan Williams, Reza Arbab, linux-mm

Hi,

in the context of virtualization, I am experimenting right now with an
approach to plug/unplug memory using a paravirtualized interface(not
ACPI). And I stumbled over certain things, looking at the memory hot/un
plug code.

The big picture:

A paravirtualized device provides a physical memory region to the guest.
We could have multiple such devices. Each device is assigned to a NUMA
node. We want to control how much memory in such a region the guest is
allowed to use. We can dynamically add/remove memory to NUMA nodes this
way and make sure a guest cannot make use of more memory than requested.

Especially: We decide in the kernel which memory block to online/offline.


The basic mechanism:

The hypervisor provides a physical memory region to the guest. This
memory region can be used by the guest to plug/unplug memory. The
hypervisor asks for a certain amount of used memory and the guest should
try to reach that goal, by plugging/unplugging memory. Whenever the
guest wants to plug/unplug a block, it has to communicate that to the
hypervisor.

The hypervisor can grant/deny requests to plug/unplug a block of memory.
Especially, the guest must not take more memory than requested. Trying
to read unplugged memory succeeds (e.g. for kdump), writing to that
memory is prohibited.

Memory blocks can be of any granularity, but 1-4MB looks like a sane
amount to not fragment memory too much. If the guest can't find free
memory blocks, no unplug is possible.


In the guest, I add_memory() new memory blocks to the NORMAL zone. The
NORMAL zone makes it harder to remove memory but we don't run into any
problems (e.g. too little NORMAL memory e.g. for page tables). Now,
these chunks are fairly big (>= 128MB) and there seems to be no way to
plug/unplug smaller chunks to Linux using official interfaces ("memory
segments"). Trying to remove >=128MB of NORMAL memory will usually not
succeed. So I thought about manually removing parts of a memory section.

Yes, this sounds similar to a balloon, but it is different: I have to
offline memory in a certain memory range, not just any memory in the
system. So I cannot simply use kmalloc() - there is no allocator that
guarantees that.

So instead I want ahead and thought about simply manually
offlining/onlining parts of a memory segment - especially "page blocks".
I do my own bookkeeping about which parts of a memory segment are
online/offline and use that information for finding blocks to
plug/unplug. The offline_pages() interface made me assume that this
should work with blocks in the size of pageblock_nr_pages.


I stumbled over the following two problems:

1. __offline_isolated_pages() doesn't care about page blocks, it simply
calls offline_mem_sections(), which marks the whole section as offline,
although it has to remain online until all pages in that section were
offlined. Now this can be handled by moving the offline_mem_sections()
logic further outside to the caller of offline_pages().

2. While offlining 2MB blocks (page block size), I discovered that more
memory was marked as reserved. Especially, a page block contains pages
with an order 10 (4MB), which implies that two page blocks are "bound
together". This is also done in __offline_isolated_pages(). Offlining
2MB will result in 4MB being marked as reserved.

Now, when I switch to 4MB, my manual online_pages/offline_pages seems so
far to work fine.

So my questions are:

Can I assume that online_pages/offline_pages() works with "MAX_ORDER -
1" sizes reliably? Should the checks in these functions be updated? page
blocks does not seem to be the real deal.

Any better approach to allocate memory in a specific memory range
(without fake numa nodes)? So I could avoid using
online_pages/offline_pages and instead do it similar to a balloon
driver? (mark the page as reserved myself)


Thanks a lot!

-- 

Thanks,

David / dhildenb

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question: Using online_pages/offline_pages() with granularity < mem section size
  2018-03-02 15:23 Question: Using online_pages/offline_pages() with granularity < mem section size David Hildenbrand
@ 2018-03-03 17:53 ` Dan Williams
  2018-04-04  9:08   ` David Hildenbrand
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Williams @ 2018-03-03 17:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Reza Arbab, linux-mm

On Fri, Mar 2, 2018 at 7:23 AM, David Hildenbrand <david@redhat.com> wrote:
> Hi,
>
> in the context of virtualization, I am experimenting right now with an
> approach to plug/unplug memory using a paravirtualized interface(not
> ACPI). And I stumbled over certain things, looking at the memory hot/un
> plug code.
>
> The big picture:
>
> A paravirtualized device provides a physical memory region to the guest.
> We could have multiple such devices. Each device is assigned to a NUMA
> node. We want to control how much memory in such a region the guest is
> allowed to use. We can dynamically add/remove memory to NUMA nodes this
> way and make sure a guest cannot make use of more memory than requested.
>
> Especially: We decide in the kernel which memory block to online/offline.
>
>
> The basic mechanism:
>
> The hypervisor provides a physical memory region to the guest. This
> memory region can be used by the guest to plug/unplug memory. The
> hypervisor asks for a certain amount of used memory and the guest should
> try to reach that goal, by plugging/unplugging memory. Whenever the
> guest wants to plug/unplug a block, it has to communicate that to the
> hypervisor.
>
> The hypervisor can grant/deny requests to plug/unplug a block of memory.
> Especially, the guest must not take more memory than requested. Trying
> to read unplugged memory succeeds (e.g. for kdump), writing to that
> memory is prohibited.
>
> Memory blocks can be of any granularity, but 1-4MB looks like a sane
> amount to not fragment memory too much. If the guest can't find free
> memory blocks, no unplug is possible.
>
>
> In the guest, I add_memory() new memory blocks to the NORMAL zone. The
> NORMAL zone makes it harder to remove memory but we don't run into any
> problems (e.g. too little NORMAL memory e.g. for page tables). Now,
> these chunks are fairly big (>= 128MB) and there seems to be no way to
> plug/unplug smaller chunks to Linux using official interfaces ("memory
> segments"). Trying to remove >=128MB of NORMAL memory will usually not
> succeed. So I thought about manually removing parts of a memory section.
>
> Yes, this sounds similar to a balloon, but it is different: I have to
> offline memory in a certain memory range, not just any memory in the
> system. So I cannot simply use kmalloc() - there is no allocator that
> guarantees that.
>
> So instead I want ahead and thought about simply manually
> offlining/onlining parts of a memory segment - especially "page blocks".
> I do my own bookkeeping about which parts of a memory segment are
> online/offline and use that information for finding blocks to
> plug/unplug. The offline_pages() interface made me assume that this
> should work with blocks in the size of pageblock_nr_pages.
>
>
> I stumbled over the following two problems:
>
> 1. __offline_isolated_pages() doesn't care about page blocks, it simply
> calls offline_mem_sections(), which marks the whole section as offline,
> although it has to remain online until all pages in that section were
> offlined. Now this can be handled by moving the offline_mem_sections()
> logic further outside to the caller of offline_pages().
>
> 2. While offlining 2MB blocks (page block size), I discovered that more
> memory was marked as reserved. Especially, a page block contains pages
> with an order 10 (4MB), which implies that two page blocks are "bound
> together". This is also done in __offline_isolated_pages(). Offlining
> 2MB will result in 4MB being marked as reserved.
>
> Now, when I switch to 4MB, my manual online_pages/offline_pages seems so
> far to work fine.
>
> So my questions are:
>
> Can I assume that online_pages/offline_pages() works with "MAX_ORDER -
> 1" sizes reliably? Should the checks in these functions be updated? page
> blocks does not seem to be the real deal.
>
> Any better approach to allocate memory in a specific memory range
> (without fake numa nodes)? So I could avoid using
> online_pages/offline_pages and instead do it similar to a balloon
> driver? (mark the page as reserved myself)

Not sure this answers your questions, but I did play with sub-section
memory hotplug last year in this patch set, but it fell to the bottom
of my queue. At least at the time it seemed possible to remove the
section alignment constraints of memory hotplug.

https://lists.01.org/pipermail/linux-nvdimm/2017-March/009167.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question: Using online_pages/offline_pages() with granularity < mem section size
  2018-03-03 17:53 ` Dan Williams
@ 2018-04-04  9:08   ` David Hildenbrand
  0 siblings, 0 replies; 3+ messages in thread
From: David Hildenbrand @ 2018-04-04  9:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Reza Arbab, linux-mm

On 03.03.2018 18:53, Dan Williams wrote:
> On Fri, Mar 2, 2018 at 7:23 AM, David Hildenbrand <david@redhat.com> wrote:
>> Hi,
>>
>> in the context of virtualization, I am experimenting right now with an
>> approach to plug/unplug memory using a paravirtualized interface(not
>> ACPI). And I stumbled over certain things, looking at the memory hot/un
>> plug code.
>>
>> The big picture:
>>
>> A paravirtualized device provides a physical memory region to the guest.
>> We could have multiple such devices. Each device is assigned to a NUMA
>> node. We want to control how much memory in such a region the guest is
>> allowed to use. We can dynamically add/remove memory to NUMA nodes this
>> way and make sure a guest cannot make use of more memory than requested.
>>
>> Especially: We decide in the kernel which memory block to online/offline.
>>
>>
>> The basic mechanism:
>>
>> The hypervisor provides a physical memory region to the guest. This
>> memory region can be used by the guest to plug/unplug memory. The
>> hypervisor asks for a certain amount of used memory and the guest should
>> try to reach that goal, by plugging/unplugging memory. Whenever the
>> guest wants to plug/unplug a block, it has to communicate that to the
>> hypervisor.
>>
>> The hypervisor can grant/deny requests to plug/unplug a block of memory.
>> Especially, the guest must not take more memory than requested. Trying
>> to read unplugged memory succeeds (e.g. for kdump), writing to that
>> memory is prohibited.
>>
>> Memory blocks can be of any granularity, but 1-4MB looks like a sane
>> amount to not fragment memory too much. If the guest can't find free
>> memory blocks, no unplug is possible.
>>
>>
>> In the guest, I add_memory() new memory blocks to the NORMAL zone. The
>> NORMAL zone makes it harder to remove memory but we don't run into any
>> problems (e.g. too little NORMAL memory e.g. for page tables). Now,
>> these chunks are fairly big (>= 128MB) and there seems to be no way to
>> plug/unplug smaller chunks to Linux using official interfaces ("memory
>> segments"). Trying to remove >=128MB of NORMAL memory will usually not
>> succeed. So I thought about manually removing parts of a memory section.
>>
>> Yes, this sounds similar to a balloon, but it is different: I have to
>> offline memory in a certain memory range, not just any memory in the
>> system. So I cannot simply use kmalloc() - there is no allocator that
>> guarantees that.
>>
>> So instead I want ahead and thought about simply manually
>> offlining/onlining parts of a memory segment - especially "page blocks".
>> I do my own bookkeeping about which parts of a memory segment are
>> online/offline and use that information for finding blocks to
>> plug/unplug. The offline_pages() interface made me assume that this
>> should work with blocks in the size of pageblock_nr_pages.
>>
>>
>> I stumbled over the following two problems:
>>
>> 1. __offline_isolated_pages() doesn't care about page blocks, it simply
>> calls offline_mem_sections(), which marks the whole section as offline,
>> although it has to remain online until all pages in that section were
>> offlined. Now this can be handled by moving the offline_mem_sections()
>> logic further outside to the caller of offline_pages().
>>
>> 2. While offlining 2MB blocks (page block size), I discovered that more
>> memory was marked as reserved. Especially, a page block contains pages
>> with an order 10 (4MB), which implies that two page blocks are "bound
>> together". This is also done in __offline_isolated_pages(). Offlining
>> 2MB will result in 4MB being marked as reserved.
>>
>> Now, when I switch to 4MB, my manual online_pages/offline_pages seems so
>> far to work fine.
>>
>> So my questions are:
>>
>> Can I assume that online_pages/offline_pages() works with "MAX_ORDER -
>> 1" sizes reliably? Should the checks in these functions be updated? page
>> blocks does not seem to be the real deal.
>>
>> Any better approach to allocate memory in a specific memory range
>> (without fake numa nodes)? So I could avoid using
>> online_pages/offline_pages and instead do it similar to a balloon
>> driver? (mark the page as reserved myself)
> 
> Not sure this answers your questions, but I did play with sub-section
> memory hotplug last year in this patch set, but it fell to the bottom
> of my queue. At least at the time it seemed possible to remove the
> section alignment constraints of memory hotplug.
> 
> https://lists.01.org/pipermail/linux-nvdimm/2017-March/009167.html
> 

Thanks, goes into a similar direction but seems to be more about "being
able to add a persistent memory device with bad alignment". The
!persistent memory part seems to be more complicated (e.g. struct pages
are allocated per segment).

In the meantime, I managed to make online_pages()/offline_pages() work
reliably with 4MB chunks. So I can e.g. add_memory() 128MB but only
online/offline 4MB chunks of that, which is sufficient for what I need
right now.

Will send some patches soon. Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-04-04  9:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-02 15:23 Question: Using online_pages/offline_pages() with granularity < mem section size David Hildenbrand
2018-03-03 17:53 ` Dan Williams
2018-04-04  9:08   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.