Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages
       [not found] <20211118091130.3817665-1-stevensd@google.com>
@ 2021-11-18 11:17 ` David Hildenbrand
       [not found]   ` <CAD=HUj7i7foyPE8a6dhj+=UR2jn5_vaQx-3jjKtjYrY8iSJWzw@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2021-11-18 11:17 UTC (permalink / raw)
  To: David Stevens, virtualization; +Cc: Michael S. Tsirkin

On 18.11.21 10:11, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>

Hi David,

> 
> Add a module parameters to virtio_balloon to allow specifying whether or
> not the driver should call adjust_managed_page_count. If the parameter
> is set, it overrides the default behavior inferred from the deflate on
> OOM flag. This allows the balloon to operate without changing the amount
> of memory visible to userspace via /proc/meminfo or sysinfo, even on a
> system that cannot set the default on OOM flag.
> 
> The motivation for this patch is to allow userspace to more accurately
> take advantage of virtio_balloon's cooperative memory control on a
> system without the ability to use deflate on OOM. As it stands,
> userspace has no way to know how much memory may be available on such a
> system, which makes tasks such as sizing caches impossible.

But that user space also has no idea "when" that memory will become
available, it could be never. This problem is similar to memory hotplug,
where we don't know "when" more memory might get hotplugged.

With deflate-on-OOM this behavior makes sense, because the guest can use
that memory whenever it wants -- it's actually available as soon as we
need it.

> 
> When deflate on OOM is not enabled, the current behavior of the
> virtio_balloon more or less resembles hotplugging individual pages, at
> least from an accounting perspective. This is basically hardcoding the
> requirement that totalram_pages must be available to the guest
> immediately, regardless of what the host does. While that is a valid
> policy, on Linux (which supports memory overcommit) with virtio_balloon
> (which is designed to facilitate overcommit in the host), it is not the
> only possible policy.
> 
> The param added by this patch allows the guest to operate under the
> assumption that pages in the virtio_balloon will generally be made
> available when needed. This assumption may not always hold, but when it
> is violated, the guest will just fall back to the normal mechanisms for
> dealing with overcommitted memory.
> 
> Independent of what policy the guest wants, the virtio_balloon device
> does not consider pages in the balloon as contributing to the guest's
> total amount of memory if deflate on OOM is not enabled. Ensure that the
> reported stats are consistent with this by adjusting totalram if a
> guest without deflate on OOM is skipping the calls to
> adjust_managed_page_count.

What about simply exposing the number of inflated balloon pages
("logically offline pages") e.g., via /proc/meminfo to user space? It's
then up to smart user space trying to be smart about memory that's not
available right now and might never become available eventually in the
future -- but still that user space wants to optimize for some eventuality.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages
       [not found]   ` <CAD=HUj7i7foyPE8a6dhj+=UR2jn5_vaQx-3jjKtjYrY8iSJWzw@mail.gmail.com>
@ 2021-11-19 13:36     ` David Hildenbrand
  2021-11-19 13:53       ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2021-11-19 13:36 UTC (permalink / raw)
  To: David Stevens; +Cc: Michael S. Tsirkin, virtualization

On 19.11.21 08:22, David Stevens wrote:
> On Thu, Nov 18, 2021 at 8:17 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.11.21 10:11, David Stevens wrote:
>>> From: David Stevens <stevensd@chromium.org>
>>
>> Hi David,
>>
>>>
>>> Add a module parameters to virtio_balloon to allow specifying whether or
>>> not the driver should call adjust_managed_page_count. If the parameter
>>> is set, it overrides the default behavior inferred from the deflate on
>>> OOM flag. This allows the balloon to operate without changing the amount
>>> of memory visible to userspace via /proc/meminfo or sysinfo, even on a
>>> system that cannot set the default on OOM flag.
>>>
>>> The motivation for this patch is to allow userspace to more accurately
>>> take advantage of virtio_balloon's cooperative memory control on a
>>> system without the ability to use deflate on OOM. As it stands,
>>> userspace has no way to know how much memory may be available on such a
>>> system, which makes tasks such as sizing caches impossible.
>>
>> But that user space also has no idea "when" that memory will become
>> available, it could be never.
> 
> Isn't this statement always true with respect to
> MemTotal/sysinfo.totalram? The kernel allocates and reserves memory,
> so there will always be some amount of memory that will never be
> available to userspace.

Please note that early allocation, most importantly the memmap,
are not accounted to MemTotal. This memory is similarly not managed
by the buddy. (thus the name adjust_managed_page_count())

But yes, there will always be some memory that will be accounted to
MemTotal that can never get freed. With memory ballooning it can easily
be in the range of gigabytes.

> And if you look at things from the context of
> a specific userspace process, there will be other processes running
> and using memory. So while that statement is true with respect to this
> change, that is also true without this change. The specific details
> might be changed by the proposed parameter, but it wouldn't be
> introducing any fundamentally new behavior to Linux.
> 

Please note that the hyper-v balloon just recently switched to using
adjust_managed_page_count() proper accounting reasons:

commit d1df458cbfdb0c3384c03c7fbcb1689bc02a746c
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date:   Wed Dec 2 17:12:45 2020 +0100

    hv_balloon: do adjust_managed_page_count() when ballooning/un-ballooning
    
    Unlike virtio_balloon/virtio_mem/xen balloon drivers, Hyper-V balloon driver
    does not adjust managed pages count when ballooning/un-ballooning and this leads
    to incorrect stats being reported, e.g. unexpected 'free' output.
    
    Note, the calculation in post_status() seems to remain correct: ballooned out
    pages are never 'available' and we manually add dm->num_pages_ballooned to
    'commited'.


>> This problem is similar to memory hotplug,
>> where we don't know "when" more memory might get hotplugged.
> 
> There are some meaningful differences with respect to hotplug. First,
> with the balloon, the kernel knows exactly what the maximum amount of
> memory the guest could have is.

Not quite. There are hypervisors that add more memory using
DIMMs (and as far as I heard virtio-mem) when required but
always fake-unplug memory using virtio-balloon.

> Since the VM was created with that
> specific amount of memory, assuming that the host will be able to
> provide that amount of memory to the guest if the guest needs it is a
> relatively safe assumption (and if the system administrator doesn't
> want to make that assumption, they can skip using this parameter). On
> the other hand, with hotplug, the only maximum value the kernel has is
> the theoretical limit allowed by the hardware. That is clearly a
> different sort of limit, and of markedly less value.

So what you actually want to expose to user space is the initial VM size.
We cannot reliably say what the maximum in the future will be
(memory hotplug), it can even be lower than the initial VM size. So it's
a pure heuristic after all.

> 
> Second, there's also the difference as to the nature of the event. If
> someone went through the trouble to configure a VM to use the
> virtio_balloon, then I think it's fair to say that inflating/deflating
> the balloon is a 'normal' event that can happen with some regularity.

I disagree. Libvirt adds virtio-balloon as default to each and every VM.
Some people use virtiio-balloon to logically unplug memory, to eventually
never hotplug it again.

> On the other hand, I don't think anyone would say that hotplug of
> physical memory is a normal event.

Well, I would say that :) The hyper-v balloon is a good example, and the
xen balloon also hotplugs actual memory on demand. And it's
becoming more popular in the virtio world with virtio-mem as well
(which some cloud providers already support).

> 
>> With deflate-on-OOM this behavior makes sense, because the guest can use
>> that memory whenever it wants -- it's actually available as soon as we
>> need it.
> 
> Well, by some definition of 'need'. It's available as soon as the
> kernel is about to OOM some process. By that point, we've probably
> already evicted a lot of caches, and the end user is going to be
> having a bad time. If we get to this point, given how poorly Linux
> usually behaves under low memory conditions, I think it is not an
> unreasonable viewpoint to prefer to OOM kill something, rather than to
> pull a paltry 1MB out of the balloon, at least on systems that are
> relatively resilient to OOMs.

Don't get me wrong, I am absolutely not a fan of deflate-on-oom. I think
it was a mistake, but apparently, there are some actual users.

> 
>>>
>>> When deflate on OOM is not enabled, the current behavior of the
>>> virtio_balloon more or less resembles hotplugging individual pages, at
>>> least from an accounting perspective. This is basically hardcoding the
>>> requirement that totalram_pages must be available to the guest
>>> immediately, regardless of what the host does. While that is a valid
>>> policy, on Linux (which supports memory overcommit) with virtio_balloon
>>> (which is designed to facilitate overcommit in the host), it is not the
>>> only possible policy.
>>>
>>> The param added by this patch allows the guest to operate under the
>>> assumption that pages in the virtio_balloon will generally be made
>>> available when needed. This assumption may not always hold, but when it
>>> is violated, the guest will just fall back to the normal mechanisms for
>>> dealing with overcommitted memory.
>>>
>>> Independent of what policy the guest wants, the virtio_balloon device
>>> does not consider pages in the balloon as contributing to the guest's
>>> total amount of memory if deflate on OOM is not enabled. Ensure that the
>>> reported stats are consistent with this by adjusting totalram if a
>>> guest without deflate on OOM is skipping the calls to
>>> adjust_managed_page_count.
>>
>> What about simply exposing the number of inflated balloon pages
>> ("logically offline pages") e.g., via /proc/meminfo to user space? It's
>> then up to smart user space trying to be smart about memory that's not
>> available right now and might never become available eventually in the
>> future -- but still that user space wants to optimize for some eventuality.
> 
> That approach would require a lot of changes to userspace - probably
> nearly everywhere that uses _SC_PHYS_PAGES or get_phys_pages, or
> anywhere that parses /proc/meminfo. Actually properly using "logically
> offline pages" would require an additional API for monitoring changes
> to the value, and updating to that sort of listener API would not be a
> localized change, especially since most programs do not account for
> memory hotplug and just use the amount of physical memory during
> initialization. Realistically, nearly all of the callers would simply
> add together "logically offline pages" and MemTotal.

I'd appreciate a more generic approach for user space to figure out the
"initial memory size" in a virtualized environment than adding
some module parameter to virtio-balloon -- if that makes sense.

MemTotal as is expresses how much memory the buddy currently manages,
for example, excluding early allocations during boot, excluding actually
unplugged memory and excluding logically unplugged memory. Adjusting that
value makes perfect sense for virtio-balloon without deflate-on-oom.

Instead of changing MemTotal semantics, I'd say we introduce some other
mechanism to figure out the initial VM size -- "logically offline" memory
is just an example. "MemMax" might be misleading and easily wrong.
"MemInitial" might be an feasible option.

Yes, some special user space applications would have to be adjusted.

> 
> It's also not clear to me what utility the extra information would
> provide to userspace. If userspace wants to know how much memory is
> available, they should use MemAvailable. If userspace wants to have a
> rough estimate for the maximum amount of memory in the system, they
> would add together MemTotal and "logically offline pages". The value
> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a
> vague meaning that lies somewhere between the "maximum amount of
> memory" and the "current amount of memory". I don't really see any
> situations where it should clearly be used over one of MemAvailable or
> MemTotal + "logically offline pages".

The issue is that any application that relies on MemTotal in a virtualized
environment is most probably already suboptimal in some cases. You can
rely on it and actually later someone will unplug (inflate balloon)
memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal
because what about two applications that rely on that information at
the same time?

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages
  2021-11-19 13:36     ` David Hildenbrand
@ 2021-11-19 13:53       ` David Hildenbrand
       [not found]         ` <CAD=HUj5wPYLKJxsjgcnMu_NYQ6eMwmd-VDU0gbWbqgzOPkV6fg@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2021-11-19 13:53 UTC (permalink / raw)
  To: David Stevens; +Cc: Michael S. Tsirkin, virtualization

On 19.11.21 14:36, David Hildenbrand wrote:
> On 19.11.21 08:22, David Stevens wrote:
>> On Thu, Nov 18, 2021 at 8:17 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 18.11.21 10:11, David Stevens wrote:
>>>> From: David Stevens <stevensd@chromium.org>
>>>
>>> Hi David,
>>>
>>>>
>>>> Add a module parameters to virtio_balloon to allow specifying whether or
>>>> not the driver should call adjust_managed_page_count. If the parameter
>>>> is set, it overrides the default behavior inferred from the deflate on
>>>> OOM flag. This allows the balloon to operate without changing the amount
>>>> of memory visible to userspace via /proc/meminfo or sysinfo, even on a
>>>> system that cannot set the default on OOM flag.
>>>>
>>>> The motivation for this patch is to allow userspace to more accurately
>>>> take advantage of virtio_balloon's cooperative memory control on a
>>>> system without the ability to use deflate on OOM. As it stands,
>>>> userspace has no way to know how much memory may be available on such a
>>>> system, which makes tasks such as sizing caches impossible.
>>>
>>> But that user space also has no idea "when" that memory will become
>>> available, it could be never.
>>
>> Isn't this statement always true with respect to
>> MemTotal/sysinfo.totalram? The kernel allocates and reserves memory,
>> so there will always be some amount of memory that will never be
>> available to userspace.
> 
> Please note that early allocation, most importantly the memmap,
> are not accounted to MemTotal. This memory is similarly not managed
> by the buddy. (thus the name adjust_managed_page_count())
> 
> But yes, there will always be some memory that will be accounted to
> MemTotal that can never get freed. With memory ballooning it can easily
> be in the range of gigabytes.
> 
>> And if you look at things from the context of
>> a specific userspace process, there will be other processes running
>> and using memory. So while that statement is true with respect to this
>> change, that is also true without this change. The specific details
>> might be changed by the proposed parameter, but it wouldn't be
>> introducing any fundamentally new behavior to Linux.
>>
> 
> Please note that the hyper-v balloon just recently switched to using
> adjust_managed_page_count() proper accounting reasons:
> 
> commit d1df458cbfdb0c3384c03c7fbcb1689bc02a746c
> Author: Vitaly Kuznetsov <vkuznets@redhat.com>
> Date:   Wed Dec 2 17:12:45 2020 +0100
> 
>     hv_balloon: do adjust_managed_page_count() when ballooning/un-ballooning
>     
>     Unlike virtio_balloon/virtio_mem/xen balloon drivers, Hyper-V balloon driver
>     does not adjust managed pages count when ballooning/un-ballooning and this leads
>     to incorrect stats being reported, e.g. unexpected 'free' output.
>     
>     Note, the calculation in post_status() seems to remain correct: ballooned out
>     pages are never 'available' and we manually add dm->num_pages_ballooned to
>     'commited'.
> 
> 
>>> This problem is similar to memory hotplug,
>>> where we don't know "when" more memory might get hotplugged.
>>
>> There are some meaningful differences with respect to hotplug. First,
>> with the balloon, the kernel knows exactly what the maximum amount of
>> memory the guest could have is.
> 
> Not quite. There are hypervisors that add more memory using
> DIMMs (and as far as I heard virtio-mem) when required but
> always fake-unplug memory using virtio-balloon.
> 
>> Since the VM was created with that
>> specific amount of memory, assuming that the host will be able to
>> provide that amount of memory to the guest if the guest needs it is a
>> relatively safe assumption (and if the system administrator doesn't
>> want to make that assumption, they can skip using this parameter). On
>> the other hand, with hotplug, the only maximum value the kernel has is
>> the theoretical limit allowed by the hardware. That is clearly a
>> different sort of limit, and of markedly less value.
> 
> So what you actually want to expose to user space is the initial VM size.
> We cannot reliably say what the maximum in the future will be
> (memory hotplug), it can even be lower than the initial VM size. So it's
> a pure heuristic after all.
> 
>>
>> Second, there's also the difference as to the nature of the event. If
>> someone went through the trouble to configure a VM to use the
>> virtio_balloon, then I think it's fair to say that inflating/deflating
>> the balloon is a 'normal' event that can happen with some regularity.
> 
> I disagree. Libvirt adds virtio-balloon as default to each and every VM.
> Some people use virtiio-balloon to logically unplug memory, to eventually
> never hotplug it again.
> 
>> On the other hand, I don't think anyone would say that hotplug of
>> physical memory is a normal event.
> 
> Well, I would say that :) The hyper-v balloon is a good example, and the
> xen balloon also hotplugs actual memory on demand. And it's
> becoming more popular in the virtio world with virtio-mem as well
> (which some cloud providers already support).
> 
>>
>>> With deflate-on-OOM this behavior makes sense, because the guest can use
>>> that memory whenever it wants -- it's actually available as soon as we
>>> need it.
>>
>> Well, by some definition of 'need'. It's available as soon as the
>> kernel is about to OOM some process. By that point, we've probably
>> already evicted a lot of caches, and the end user is going to be
>> having a bad time. If we get to this point, given how poorly Linux
>> usually behaves under low memory conditions, I think it is not an
>> unreasonable viewpoint to prefer to OOM kill something, rather than to
>> pull a paltry 1MB out of the balloon, at least on systems that are
>> relatively resilient to OOMs.
> 
> Don't get me wrong, I am absolutely not a fan of deflate-on-oom. I think
> it was a mistake, but apparently, there are some actual users.
> 
>>
>>>>
>>>> When deflate on OOM is not enabled, the current behavior of the
>>>> virtio_balloon more or less resembles hotplugging individual pages, at
>>>> least from an accounting perspective. This is basically hardcoding the
>>>> requirement that totalram_pages must be available to the guest
>>>> immediately, regardless of what the host does. While that is a valid
>>>> policy, on Linux (which supports memory overcommit) with virtio_balloon
>>>> (which is designed to facilitate overcommit in the host), it is not the
>>>> only possible policy.
>>>>
>>>> The param added by this patch allows the guest to operate under the
>>>> assumption that pages in the virtio_balloon will generally be made
>>>> available when needed. This assumption may not always hold, but when it
>>>> is violated, the guest will just fall back to the normal mechanisms for
>>>> dealing with overcommitted memory.
>>>>
>>>> Independent of what policy the guest wants, the virtio_balloon device
>>>> does not consider pages in the balloon as contributing to the guest's
>>>> total amount of memory if deflate on OOM is not enabled. Ensure that the
>>>> reported stats are consistent with this by adjusting totalram if a
>>>> guest without deflate on OOM is skipping the calls to
>>>> adjust_managed_page_count.
>>>
>>> What about simply exposing the number of inflated balloon pages
>>> ("logically offline pages") e.g., via /proc/meminfo to user space? It's
>>> then up to smart user space trying to be smart about memory that's not
>>> available right now and might never become available eventually in the
>>> future -- but still that user space wants to optimize for some eventuality.
>>
>> That approach would require a lot of changes to userspace - probably
>> nearly everywhere that uses _SC_PHYS_PAGES or get_phys_pages, or
>> anywhere that parses /proc/meminfo. Actually properly using "logically
>> offline pages" would require an additional API for monitoring changes
>> to the value, and updating to that sort of listener API would not be a
>> localized change, especially since most programs do not account for
>> memory hotplug and just use the amount of physical memory during
>> initialization. Realistically, nearly all of the callers would simply
>> add together "logically offline pages" and MemTotal.
> 
> I'd appreciate a more generic approach for user space to figure out the
> "initial memory size" in a virtualized environment than adding
> some module parameter to virtio-balloon -- if that makes sense.
> 
> MemTotal as is expresses how much memory the buddy currently manages,
> for example, excluding early allocations during boot, excluding actually
> unplugged memory and excluding logically unplugged memory. Adjusting that
> value makes perfect sense for virtio-balloon without deflate-on-oom.
> 
> Instead of changing MemTotal semantics, I'd say we introduce some other
> mechanism to figure out the initial VM size -- "logically offline" memory
> is just an example. "MemMax" might be misleading and easily wrong.
> "MemInitial" might be an feasible option.
> 
> Yes, some special user space applications would have to be adjusted.
> 
>>
>> It's also not clear to me what utility the extra information would
>> provide to userspace. If userspace wants to know how much memory is
>> available, they should use MemAvailable. If userspace wants to have a
>> rough estimate for the maximum amount of memory in the system, they
>> would add together MemTotal and "logically offline pages". The value
>> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a
>> vague meaning that lies somewhere between the "maximum amount of
>> memory" and the "current amount of memory". I don't really see any
>> situations where it should clearly be used over one of MemAvailable or
>> MemTotal + "logically offline pages".
> 
> The issue is that any application that relies on MemTotal in a virtualized
> environment is most probably already suboptimal in some cases. You can
> rely on it and actually later someone will unplug (inflate balloon)
> memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal
> because what about two applications that rely on that information at
> the same time?
> 

BTW, the general issue here is that "we don't know what the hypervisor
will do".

Maybe "MemMax" actually could make sense, where we expose the maximum
"MemTotal" we had so far since we were up an running. So the semantics
wouldn't be "maximum possible", because we don't know that, but instead
"maximum we had".

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages
       [not found]         ` <CAD=HUj5wPYLKJxsjgcnMu_NYQ6eMwmd-VDU0gbWbqgzOPkV6fg@mail.gmail.com>
@ 2021-11-24  8:37           ` Michael S. Tsirkin
  2021-11-24 10:00             ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Michael S. Tsirkin @ 2021-11-24  8:37 UTC (permalink / raw)
  To: David Stevens; +Cc: virtualization

On Wed, Nov 24, 2021 at 01:55:16PM +0900, David Stevens wrote:
> > >> And if you look at things from the context of
> > >> a specific userspace process, there will be other processes running
> > >> and using memory. So while that statement is true with respect to this
> > >> change, that is also true without this change. The specific details
> > >> might be changed by the proposed parameter, but it wouldn't be
> > >> introducing any fundamentally new behavior to Linux.
> > >>
> > >
> > > Please note that the hyper-v balloon just recently switched to using
> > > adjust_managed_page_count() proper accounting reasons:
> > >
> > > commit d1df458cbfdb0c3384c03c7fbcb1689bc02a746c
> > > Author: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Date:   Wed Dec 2 17:12:45 2020 +0100
> > >
> > >     hv_balloon: do adjust_managed_page_count() when ballooning/un-ballooning
> > >
> > >     Unlike virtio_balloon/virtio_mem/xen balloon drivers, Hyper-V balloon driver
> > >     does not adjust managed pages count when ballooning/un-ballooning and this leads
> > >     to incorrect stats being reported, e.g. unexpected 'free' output.
> > >
> > >     Note, the calculation in post_status() seems to remain correct: ballooned out
> > >     pages are never 'available' and we manually add dm->num_pages_ballooned to
> > >     'commited'.
> > >
> 
> I saw this commit, but it wasn't entirely clear to me the problem it
> was addressing was. Is it the issue Michael pointed out on v1 of my
> patch set, where memory in the balloon shouldn't be included in the
> free stat reported to the device? This version of my patch should
> address that specific issue. Managed page count is linux kernel
> specific metadata, so there's no fundamental reason that it needs to
> line up exactly with anything reported via the virtio-balloon API.
> 
> > >> That approach would require a lot of changes to userspace - probably
> > >> nearly everywhere that uses _SC_PHYS_PAGES or get_phys_pages, or
> > >> anywhere that parses /proc/meminfo. Actually properly using "logically
> > >> offline pages" would require an additional API for monitoring changes
> > >> to the value, and updating to that sort of listener API would not be a
> > >> localized change, especially since most programs do not account for
> > >> memory hotplug and just use the amount of physical memory during
> > >> initialization. Realistically, nearly all of the callers would simply
> > >> add together "logically offline pages" and MemTotal.
> > >
> > > I'd appreciate a more generic approach for user space to figure out the
> > > "initial memory size" in a virtualized environment than adding
> > > some module parameter to virtio-balloon -- if that makes sense.
> > >
> > > MemTotal as is expresses how much memory the buddy currently manages,
> > > for example, excluding early allocations during boot, excluding actually
> > > unplugged memory and excluding logically unplugged memory. Adjusting that
> > > value makes perfect sense for virtio-balloon without deflate-on-oom.
> > >
> 
> That's a definition of how MemTotal is implemented, but it's not
> really a specification of the MemTotal API. The closest thing to a
> real specification I can find is "Total usable RAM (i.e., physical RAM
> minus a few reserved bits and the kernel binary code)", from the proc
> man pages. I think there is quite a bit of leeway in changing how
> exactly MemTotal is implemented without violating the (quite vague)
> specification or changing any observable semantics of the API. In
> particular, leaving the pages in the balloon as part of MemTotal is
> essentially indistinguishable from simply having a non-OOM killable
> process locking an equivalent amount of memory. So this proposal isn't
> really introducing any fundamentally new behavior to the Linux kernel.
> 
> > >> It's also not clear to me what utility the extra information would
> > >> provide to userspace. If userspace wants to know how much memory is
> > >> available, they should use MemAvailable. If userspace wants to have a
> > >> rough estimate for the maximum amount of memory in the system, they
> > >> would add together MemTotal and "logically offline pages". The value
> > >> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a
> > >> vague meaning that lies somewhere between the "maximum amount of
> > >> memory" and the "current amount of memory". I don't really see any
> > >> situations where it should clearly be used over one of MemAvailable or
> > >> MemTotal + "logically offline pages".
> > >
> > > The issue is that any application that relies on MemTotal in a virtualized
> > > environment is most probably already suboptimal in some cases. You can
> > > rely on it and actually later someone will unplug (inflate balloon)
> > > memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal
> > > because what about two applications that rely on that information at
> > > the same time?
> > >
> >
> > BTW, the general issue here is that "we don't know what the hypervisor
> > will do".
> 
> I do agree that this is a significant problem. I would expand on it a
> bit more, to be "since we don't know what the hypervisor will do, we
> don't know how to treat memory in the balloon". The proposed module
> parameter is more or less a mechanism to allow the system
> administrator to tell the virtio_balloon driver how the hypervisor
> behaves.


Now that you put it that way, it looks more like this should
be a feature bit not a module parameter.



> And if the hypervisor will give memory back to the guest when
> the guest needs it, then I don't think it's not necessary to logically
> unplug the memory.

Ideally we would also pair this with sending a signal to device
that memory is needed.

> It might be a bit cleaner to explicitly address this in the
> virtio_balloon protocol. We could add a min_num_pages field to the
> balloon config, with semantics along the lines of "The host will
> respond to memory pressure in the guest by deflating the balloon down
> to min_num_pages, unless it would cause system instability in the
> host". Given that feature, I think it would be reasonable to only
> consider min_num_pages as logically unplugged.

Okay. I think I would do it a bit differently though, make num_pages be
the min_num_pages, and add an extra_num_pages field for memory that is
nice to have but ok to drop.


As long as we are here, can we add a page_shift field please
so more than 2^44 bytes can be requested?


> > Maybe "MemMax" actually could make sense, where we expose the maximum
> > "MemTotal" we had so far since we were up an running. So the semantics
> > wouldn't be "maximum possible", because we don't know that, but instead
> > "maximum we had".
> 
> Rather than add a new API, I think it is much better to make existing
> APIs behave closer to how they behave in a non-virtualized
> environment. It is true that we could go through and fix a limited
> number of special user space applications, but sysconf(_SC_PHYS_PAGES)
> and /proc/meminfo are not special APIs. Fixing every application that
> uses them is not feasible, especially when taking into account systems
> with closed-source applications (e.g. Android). Also, while MemMax is
> well defined, it has the same issues you brought up earlier -
> specifically, applications don't know whether the hypervisor will
> actually ever provide MemMax again, and they don't know whether MemMax
> is actually the realy maximum amount of memory that could be available
> in the future. It's not clear to me that it's significantly better or
> more useful to userspace than simply changing how MemTotal is
> implemented.
> 
> -David

Agree on trying to avoid changing applications, limiting changes
to device and guest kernel, this has a lot of value.

-- 
MST

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] virtio_balloon: add param to skip adjusting pages
  2021-11-24  8:37           ` Michael S. Tsirkin
@ 2021-11-24 10:00             ` David Hildenbrand
  0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2021-11-24 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Stevens; +Cc: virtualization

>>>> I'd appreciate a more generic approach for user space to figure out the
>>>> "initial memory size" in a virtualized environment than adding
>>>> some module parameter to virtio-balloon -- if that makes sense.
>>>>
>>>> MemTotal as is expresses how much memory the buddy currently manages,
>>>> for example, excluding early allocations during boot, excluding actually
>>>> unplugged memory and excluding logically unplugged memory. Adjusting that
>>>> value makes perfect sense for virtio-balloon without deflate-on-oom.
>>>>
>>
>> That's a definition of how MemTotal is implemented, but it's not
>> really a specification of the MemTotal API. The closest thing to a
>> real specification I can find is "Total usable RAM (i.e., physical RAM
>> minus a few reserved bits and the kernel binary code)", from the proc
>> man pages. I think there is quite a bit of leeway in changing how
>> exactly MemTotal is implemented without violating the (quite vague)
>> specification or changing any observable semantics of the API. In
>> particular, leaving the pages in the balloon as part of MemTotal is
>> essentially indistinguishable from simply having a non-OOM killable
>> process locking an equivalent amount of memory. So this proposal isn't
>> really introducing any fundamentally new behavior to the Linux kernel.

How to indicate MemTotal completely depends on the intended semantics.
Using balloon inflation to logically unplug memory vs. some kind of
co-operational memory management with the hypervisor.

For co-operational management I would strongly advise using free page
reporting instead if possible. It can't drain the pagecache so far, but
there are approaches being discussed on how to make that happen (e.g.,
using DAMON, or avoiding the guest page cache using virtio-pmem).

>>
>>>>> It's also not clear to me what utility the extra information would
>>>>> provide to userspace. If userspace wants to know how much memory is
>>>>> available, they should use MemAvailable. If userspace wants to have a
>>>>> rough estimate for the maximum amount of memory in the system, they
>>>>> would add together MemTotal and "logically offline pages". The value
>>>>> of MemTotal with a no-deflate-on-oom virtio-balloon is a value with a
>>>>> vague meaning that lies somewhere between the "maximum amount of
>>>>> memory" and the "current amount of memory". I don't really see any
>>>>> situations where it should clearly be used over one of MemAvailable or
>>>>> MemTotal + "logically offline pages".
>>>>
>>>> The issue is that any application that relies on MemTotal in a virtualized
>>>> environment is most probably already suboptimal in some cases. You can
>>>> rely on it and actually later someone will unplug (inflate balloon)
>>>> memory or plug (deflate balloon) memory. Even MemAvailable is suboptimal
>>>> because what about two applications that rely on that information at
>>>> the same time?
>>>>
>>>
>>> BTW, the general issue here is that "we don't know what the hypervisor
>>> will do".
>>
>> I do agree that this is a significant problem. I would expand on it a
>> bit more, to be "since we don't know what the hypervisor will do, we
>> don't know how to treat memory in the balloon". The proposed module
>> parameter is more or less a mechanism to allow the system
>> administrator to tell the virtio_balloon driver how the hypervisor
>> behaves.
> 
> 
> Now that you put it that way, it looks more like this should
> be a feature bit not a module parameter.

It will be slightly better. At least the hypervisor can indicate the
what it's intending on doing.

>> And if the hypervisor will give memory back to the guest when
>> the guest needs it, then I don't think it's not necessary to logically
>> unplug the memory.
> 
> Ideally we would also pair this with sending a signal to device
> that memory is needed.

Such approaches are in general problematic because once the guest is
already OOM, the hypervisor will most likely not react in time and it's
essentially too late.

So you need some policy somewhere that monitors memory consumption and
makes smart decisions. Usually this is implemented in the hypervisor by
monitoring VM stats.

IMHO the device is the wrong place. I recently discussed something
similar offline with potential virtio-mem users.

> 
>> It might be a bit cleaner to explicitly address this in the
>> virtio_balloon protocol. We could add a min_num_pages field to the
>> balloon config, with semantics along the lines of "The host will
>> respond to memory pressure in the guest by deflating the balloon down
>> to min_num_pages, unless it would cause system instability in the
>> host". Given that feature, I think it would be reasonable to only
>> consider min_num_pages as logically unplugged.
> 
> Okay. I think I would do it a bit differently though, make num_pages be
> the min_num_pages, and add an extra_num_pages field for memory that is
> nice to have but ok to drop.
> 
> 
> As long as we are here, can we add a page_shift field please
> so more than 2^44 bytes can be requested?
> 
> 
>>> Maybe "MemMax" actually could make sense, where we expose the maximum
>>> "MemTotal" we had so far since we were up an running. So the semantics
>>> wouldn't be "maximum possible", because we don't know that, but instead
>>> "maximum we had".
>>
>> Rather than add a new API, I think it is much better to make existing
>> APIs behave closer to how they behave in a non-virtualized
>> environment. It is true that we could go through and fix a limited
>> number of special user space applications, but sysconf(_SC_PHYS_PAGES)
>> and /proc/meminfo are not special APIs. Fixing every application that
>> uses them is not feasible, especially when taking into account systems
>> with closed-source applications (e.g. Android). Also, while MemMax is
>> well defined, it has the same issues you brought up earlier -
>> specifically, applications don't know whether the hypervisor will
>> actually ever provide MemMax again, and they don't know whether MemMax
>> is actually the realy maximum amount of memory that could be available
>> in the future. It's not clear to me that it's significantly better or
>> more useful to userspace than simply changing how MemTotal is
>> implemented.
>>
>> -David
> 
> Agree on trying to avoid changing applications, limiting changes
> to device and guest kernel, this has a lot of value.

With free page reporting in place I barely see a future for such
features, but these are just my 2 cents.

Gluing it to a feature bit like "I, the device, will monitor your memory
consumption and adjust if you're in need of more memory" could be done.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-24 10:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20211118091130.3817665-1-stevensd@google.com>
2021-11-18 11:17 ` [PATCH v2] virtio_balloon: add param to skip adjusting pages David Hildenbrand
     [not found]   ` <CAD=HUj7i7foyPE8a6dhj+=UR2jn5_vaQx-3jjKtjYrY8iSJWzw@mail.gmail.com>
2021-11-19 13:36     ` David Hildenbrand
2021-11-19 13:53       ` David Hildenbrand
     [not found]         ` <CAD=HUj5wPYLKJxsjgcnMu_NYQ6eMwmd-VDU0gbWbqgzOPkV6fg@mail.gmail.com>
2021-11-24  8:37           ` Michael S. Tsirkin
2021-11-24 10:00             ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.