All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] When it's okay to treat OOM as fatal?
@ 2018-10-16 13:01 Markus Armbruster
  2018-10-16 13:20 ` Daniel P. Berrangé
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Markus Armbruster @ 2018-10-16 13:01 UTC (permalink / raw)
  To: qemu-devel

We sometimes use g_new() & friends, which abort() on OOM, and sometimes
g_try_new() & friends, which can fail, and therefore require error
handling.

HACKING points out the difference, but is mum on when to use what:

    3. Low level memory management

    Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
    APIs is not allowed in the QEMU codebase. Instead of these routines,
    use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
    g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
    APIs.

    Please note that g_malloc will exit on allocation failure, so there
    is no need to test for failure (as you would have to with malloc).
    Calling g_malloc with a zero size is valid and will return NULL.

    Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
    reasons:

      a. It catches multiplication overflowing size_t;
      b. It returns T * instead of void *, letting compiler catch more type
         errors.

    Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.

    Memory allocated by qemu_memalign or qemu_blockalign must be freed with
    qemu_vfree, since breaking this will cause problems on Win32.

Now, in my personal opinion, handling OOM gracefully is worth the
(commonly considerable) trouble when you're coding for an Apple II or
similar.  Anything that pages commonly becomes unusable long before
allocations fail.  Anything that overcommits will send you a (commonly
lethal) signal instead.  Anything that tries handling OOM gracefully,
and manages to dodge both these bullets somehow, will commonly get it
wrong and crash.

But others are entitled to their opinions as much as I am.  I just want
to know what our rules are, preferably in the form of a patch to
HACKING.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-16 13:01 [Qemu-devel] When it's okay to treat OOM as fatal? Markus Armbruster
@ 2018-10-16 13:20 ` Daniel P. Berrangé
  2018-10-18 13:06   ` Markus Armbruster
  2018-10-16 13:33 ` Dr. David Alan Gilbert
  2018-10-17 10:05 ` Stefan Hajnoczi
  2 siblings, 1 reply; 13+ messages in thread
From: Daniel P. Berrangé @ 2018-10-16 13:20 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

On Tue, Oct 16, 2018 at 03:01:29PM +0200, Markus Armbruster wrote:
> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
> g_try_new() & friends, which can fail, and therefore require error
> handling.
> 
> HACKING points out the difference, but is mum on when to use what:
> 
>     3. Low level memory management
> 
>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
>     APIs is not allowed in the QEMU codebase. Instead of these routines,
>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
>     APIs.
> 
>     Please note that g_malloc will exit on allocation failure, so there
>     is no need to test for failure (as you would have to with malloc).
>     Calling g_malloc with a zero size is valid and will return NULL.
> 
>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
>     reasons:
> 
>       a. It catches multiplication overflowing size_t;
>       b. It returns T * instead of void *, letting compiler catch more type
>          errors.
> 
>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
> 
>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
>     qemu_vfree, since breaking this will cause problems on Win32.
> 
> Now, in my personal opinion, handling OOM gracefully is worth the
> (commonly considerable) trouble when you're coding for an Apple II or
> similar.  Anything that pages commonly becomes unusable long before
> allocations fail.  Anything that overcommits will send you a (commonly
> lethal) signal instead.  Anything that tries handling OOM gracefully,
> and manages to dodge both these bullets somehow, will commonly get it
> wrong and crash.

FWIW, with the cgroups memory controller (with or without containers)
you can be in an environment where there's a memory cap. This can
conceivably cause QEMU to see ENOMEM, while the host OS in general
is operating normally with no swap usage / paging.

That said, no one has ever been able to come up with an algorithm that
reliably predicts the "normal" QEMU peak memory usage. So any time the
cgroups memory cap has been used, it has typically resulted in QEMU
unreasonably aborting in normal operation. This makes it impractical
to try to confine QEMU's memory usage with cgroups IMHO.

> But others are entitled to their opinions as much as I am.  I just want
> to know what our rules are, preferably in the form of a patch to
> HACKING.

I vaguely recall it being said that we should use g_try_new in code
paths that can be triggered from monitor commands that would cause
allocation of "significant" amounts of RAM, for some arbitrary
defintiion of what "significant" means.

eg hotplug a QXL PCI video card with 256 MB of video RAM, you might
use g_try_new() for allocating this 256 MB chunk and return gracefully
on failure, rather than the hotplug op causing QEMU to abort.

The problem with OOM handling is proving that the cleanup paths you
take actually do something sensible / correct, rather than result
in cascading failures due to further OOMs. You're going to need test
cases that exercise the relevant codepaths, and a way to inject OOM
at each individual malloc, or across a sequence of mallocs. This is
extraordinarily expensive to test as it becomes a combinatorial
problem.

We've done such exhaustive malloc failure testing in libvirt before
but it takes such a long time and it is hard to characterize "correct"
output of the test suite. This meant we caught obvious mistakes that
lead to SEGVs for the test, but needed hand inspection to identify
cases where we incorrectly carried on executing with critical data
missing due to the OOM.  It has been a while since I last tried todo
OOM testing of libvirt, so I don't have high confidence in us doing
something sensible. The only thing in our favour is that we've designed
our malloc API replacements so that the pointer to allocated memory is
returned to the caller separately from the success/failure status.
Combined with attribute((return_check)) this let us get compile time
validation that we are actually checking for malloc failures. GLibs
g_try_new API don't allow such compile time checking as they still
overload the pointer with the success/failure status.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-16 13:01 [Qemu-devel] When it's okay to treat OOM as fatal? Markus Armbruster
  2018-10-16 13:20 ` Daniel P. Berrangé
@ 2018-10-16 13:33 ` Dr. David Alan Gilbert
  2018-10-18 14:46   ` Markus Armbruster
  2018-10-17 10:05 ` Stefan Hajnoczi
  2 siblings, 1 reply; 13+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-16 13:33 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

* Markus Armbruster (armbru@redhat.com) wrote:
> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
> g_try_new() & friends, which can fail, and therefore require error
> handling.
> 
> HACKING points out the difference, but is mum on when to use what:
> 
>     3. Low level memory management
> 
>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
>     APIs is not allowed in the QEMU codebase. Instead of these routines,
>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
>     APIs.
> 
>     Please note that g_malloc will exit on allocation failure, so there
>     is no need to test for failure (as you would have to with malloc).
>     Calling g_malloc with a zero size is valid and will return NULL.
> 
>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
>     reasons:
> 
>       a. It catches multiplication overflowing size_t;
>       b. It returns T * instead of void *, letting compiler catch more type
>          errors.
> 
>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
> 
>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
>     qemu_vfree, since breaking this will cause problems on Win32.
> 
> Now, in my personal opinion, handling OOM gracefully is worth the
> (commonly considerable) trouble when you're coding for an Apple II or
> similar.  Anything that pages commonly becomes unusable long before
> allocations fail.

That's not always my experience; I've seen cases where you suddenly
allocate a load more memory and hit OOM fairly quickly on that hot
process.  Most of the time on the desktop you're right.

> Anything that overcommits will send you a (commonly
> lethal) signal instead.  Anything that tries handling OOM gracefully,
> and manages to dodge both these bullets somehow, will commonly get it
> wrong and crash.

If your qemu has maped it's main memory from hugetlbfs or similar pools
then we're looking at the other memory allocations; and that's a bit of
an interesting difference where those other allocations should be a lot
smaller.

> But others are entitled to their opinions as much as I am.  I just want
> to know what our rules are, preferably in the form of a patch to
> HACKING.

My rule is to try not to break a happily running VM by some new
activity; I don't worry about it during startup.

So for example, I don't like it when starting a migration, allocates
some more memory and kills the VM - the user had a happy stable VM
upto that point.  Migration gets the blame at this point.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-16 13:01 [Qemu-devel] When it's okay to treat OOM as fatal? Markus Armbruster
  2018-10-16 13:20 ` Daniel P. Berrangé
  2018-10-16 13:33 ` Dr. David Alan Gilbert
@ 2018-10-17 10:05 ` Stefan Hajnoczi
  2 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2018-10-17 10:05 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]

On Tue, Oct 16, 2018 at 03:01:29PM +0200, Markus Armbruster wrote:
> Anything that pages commonly becomes unusable long before
> allocations fail.  Anything that overcommits will send you a (commonly
> lethal) signal instead.  Anything that tries handling OOM gracefully,
> and manages to dodge both these bullets somehow, will commonly get it
> wrong and crash.

In the block layer blk_try_blockalign() (previously
qemu_try_blockalign()) is used because significant amounts of memory can
be allocated by the untrusted guest or untrusted disk image files.  I
think the error handling is reasonable in those cases:
1. QEMU startup or disk hotplug fail with a nice error message
OR
2. An I/O request is failed (ultimately just EIO error reporting but
   it's better than killing the QEMU process!)

I'm pretty sure ENOMEM errors are possible even when memory overcommit
is enabled.

My thinking has been to use g_new() for small QEMU-internal structures
and g_try_new() for large amounts of memory allocated in response to
untrusted inputs.  (Untrusted inputs must never be used for unbounded
allocation sizes but those bounded sizes can still be large.)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-16 13:20 ` Daniel P. Berrangé
@ 2018-10-18 13:06   ` Markus Armbruster
  2018-10-18 14:28     ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Markus Armbruster @ 2018-10-18 13:06 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: qemu-devel

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Oct 16, 2018 at 03:01:29PM +0200, Markus Armbruster wrote:
>> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
>> g_try_new() & friends, which can fail, and therefore require error
>> handling.
>> 
>> HACKING points out the difference, but is mum on when to use what:
>> 
>>     3. Low level memory management
>> 
>>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
>>     APIs is not allowed in the QEMU codebase. Instead of these routines,
>>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
>>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
>>     APIs.
>> 
>>     Please note that g_malloc will exit on allocation failure, so there
>>     is no need to test for failure (as you would have to with malloc).
>>     Calling g_malloc with a zero size is valid and will return NULL.
>> 
>>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
>>     reasons:
>> 
>>       a. It catches multiplication overflowing size_t;
>>       b. It returns T * instead of void *, letting compiler catch more type
>>          errors.
>> 
>>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
>> 
>>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
>>     qemu_vfree, since breaking this will cause problems on Win32.
>> 
>> Now, in my personal opinion, handling OOM gracefully is worth the
>> (commonly considerable) trouble when you're coding for an Apple II or
>> similar.  Anything that pages commonly becomes unusable long before
>> allocations fail.  Anything that overcommits will send you a (commonly
>> lethal) signal instead.  Anything that tries handling OOM gracefully,
>> and manages to dodge both these bullets somehow, will commonly get it
>> wrong and crash.
>
> FWIW, with the cgroups memory controller (with or without containers)
> you can be in an environment where there's a memory cap. This can
> conceivably cause QEMU to see ENOMEM, while the host OS in general
> is operating normally with no swap usage / paging.
>
> That said, no one has ever been able to come up with an algorithm that
> reliably predicts the "normal" QEMU peak memory usage. So any time the
> cgroups memory cap has been used, it has typically resulted in QEMU
> unreasonably aborting in normal operation. This makes it impractical
> to try to confine QEMU's memory usage with cgroups IMHO.
>
>> But others are entitled to their opinions as much as I am.  I just want
>> to know what our rules are, preferably in the form of a patch to
>> HACKING.
>
> I vaguely recall it being said that we should use g_try_new in code
> paths that can be triggered from monitor commands that would cause
> allocation of "significant" amounts of RAM, for some arbitrary
> defintiion of what "significant" means.
>
> eg hotplug a QXL PCI video card with 256 MB of video RAM, you might
> use g_try_new() for allocating this 256 MB chunk and return gracefully
> on failure, rather than the hotplug op causing QEMU to abort.

Funny you picked this example.  It happens to be one of the devices that
made me ask.

Device "qxl" creates a memory region "qxl.vgavram" with a size taken
from uint32_t property "ram_size", silently rounded up to the next power
of two.  It uses &error_fatal for error handling.

Let's play with it.

    $ upstream-qemu -monitor stdio -display none -device qxl,ram_size=2147483648
    QEMU 3.0.50 monitor - type 'help' for more information
    (qemu) info qtree
    bus: main-system-bus
      [...]
      dev: i440FX-pcihost, id ""
        pci-hole64-size = 2147483648 (2 GiB)
        short_root_bus = 0 (0x0)
        x-pci-hole64-fix = true
        bus: pci.0
          type PCI
          dev: qxl, id ""
--->        ram_size = 2147483648 (0x80000000)
            vram_size = 67108864 (0x4000000)
            [...]

Happily allocates 2GiB of RAM.  I could do this with a monitor command
(qxl is hot-pluggable), but I'm too lazy for that.

Adding another 26 of them for a total of 54 GiB also succeeds.  That's
more than this box has RAM and swap space combined.

Fun: scratch -display none, and Gtk starts spitting messages at seven
qxl devices, and SEGVs at eight.

Cherry on top:

    $ upstream-qemu -device qxl,ram_size=2147483649
    upstream-qemu: /home/armbru/work/qemu/exec.c:1891: find_ram_offset: Assertion `size != 0' failed.
    Aborted (core dumped)

My points are:

1. Even if we 'should use g_try_new in code paths that can be triggered
   from monitor commands that would cause allocation of "significant"
   amounts of RAM', we actually don't, at least not anywhere near
   consistently.

2. And even when we don't, that's not the actual problem, simply because
   allocation stubbornly refuses to fail.  Instead we die of other
   causes.

> The problem with OOM handling is proving that the cleanup paths you
> take actually do something sensible / correct, rather than result
> in cascading failures due to further OOMs. You're going to need test
> cases that exercise the relevant codepaths, and a way to inject OOM
> at each individual malloc, or across a sequence of mallocs. This is
> extraordinarily expensive to test as it becomes a combinatorial
> problem.

Exactly.

> We've done such exhaustive malloc failure testing in libvirt before
> but it takes such a long time and it is hard to characterize "correct"
> output of the test suite. This meant we caught obvious mistakes that
> lead to SEGVs for the test, but needed hand inspection to identify
> cases where we incorrectly carried on executing with critical data
> missing due to the OOM.  It has been a while since I last tried todo
> OOM testing of libvirt, so I don't have high confidence in us doing
> something sensible.

If "extraordinary expensive" work results in low confidence, decaying
quickly to even lower confidence unless you expensively maintain it,
then it's a bad investment.

>                     The only thing in our favour is that we've designed
> our malloc API replacements so that the pointer to allocated memory is
> returned to the caller separately from the success/failure status.
> Combined with attribute((return_check)) this let us get compile time
> validation that we are actually checking for malloc failures. GLibs
> g_try_new API don't allow such compile time checking as they still
> overload the pointer with the success/failure status.

Forcing error handling into existence is the easy part.  Making sure it
actually works is much, much harder.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-18 13:06   ` Markus Armbruster
@ 2018-10-18 14:28     ` Paolo Bonzini
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2018-10-18 14:28 UTC (permalink / raw)
  To: Markus Armbruster, Daniel P. Berrangé; +Cc: qemu-devel

On 18/10/2018 15:06, Markus Armbruster wrote:
> Device "qxl" creates a memory region "qxl.vgavram" with a size taken
> from uint32_t property "ram_size", silently rounded up to the next power
> of two.  It uses &error_fatal for error handling.

That's good to some extent---it means that the core code _is_ ready for
handling ENOMEM in this part of QEMU, it's just the device that doesn't
use it.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-16 13:33 ` Dr. David Alan Gilbert
@ 2018-10-18 14:46   ` Markus Armbruster
  2018-10-18 14:54     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 13+ messages in thread
From: Markus Armbruster @ 2018-10-18 14:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
>> g_try_new() & friends, which can fail, and therefore require error
>> handling.
>> 
>> HACKING points out the difference, but is mum on when to use what:
>> 
>>     3. Low level memory management
>> 
>>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
>>     APIs is not allowed in the QEMU codebase. Instead of these routines,
>>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
>>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
>>     APIs.
>> 
>>     Please note that g_malloc will exit on allocation failure, so there
>>     is no need to test for failure (as you would have to with malloc).
>>     Calling g_malloc with a zero size is valid and will return NULL.
>> 
>>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
>>     reasons:
>> 
>>       a. It catches multiplication overflowing size_t;
>>       b. It returns T * instead of void *, letting compiler catch more type
>>          errors.
>> 
>>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
>> 
>>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
>>     qemu_vfree, since breaking this will cause problems on Win32.
>> 
>> Now, in my personal opinion, handling OOM gracefully is worth the
>> (commonly considerable) trouble when you're coding for an Apple II or
>> similar.  Anything that pages commonly becomes unusable long before
>> allocations fail.
>
> That's not always my experience; I've seen cases where you suddenly
> allocate a load more memory and hit OOM fairly quickly on that hot
> process.  Most of the time on the desktop you're right.
>
>> Anything that overcommits will send you a (commonly
>> lethal) signal instead.  Anything that tries handling OOM gracefully,
>> and manages to dodge both these bullets somehow, will commonly get it
>> wrong and crash.
>
> If your qemu has maped it's main memory from hugetlbfs or similar pools
> then we're looking at the other memory allocations; and that's a bit of
> an interesting difference where those other allocations should be a lot
> smaller.
>
>> But others are entitled to their opinions as much as I am.  I just want
>> to know what our rules are, preferably in the form of a patch to
>> HACKING.
>
> My rule is to try not to break a happily running VM by some new
> activity; I don't worry about it during startup.
>
> So for example, I don't like it when starting a migration, allocates
> some more memory and kills the VM - the user had a happy stable VM
> upto that point.  Migration gets the blame at this point.

I don't doubt reliable OOM handling would be nice.  I do doubt it's
practical for an application like QEMU.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-18 14:46   ` Markus Armbruster
@ 2018-10-18 14:54     ` Dr. David Alan Gilbert
  2018-10-18 17:26       ` Markus Armbruster
  0 siblings, 1 reply; 13+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-18 14:54 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
> >> g_try_new() & friends, which can fail, and therefore require error
> >> handling.
> >> 
> >> HACKING points out the difference, but is mum on when to use what:
> >> 
> >>     3. Low level memory management
> >> 
> >>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
> >>     APIs is not allowed in the QEMU codebase. Instead of these routines,
> >>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
> >>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
> >>     APIs.
> >> 
> >>     Please note that g_malloc will exit on allocation failure, so there
> >>     is no need to test for failure (as you would have to with malloc).
> >>     Calling g_malloc with a zero size is valid and will return NULL.
> >> 
> >>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
> >>     reasons:
> >> 
> >>       a. It catches multiplication overflowing size_t;
> >>       b. It returns T * instead of void *, letting compiler catch more type
> >>          errors.
> >> 
> >>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
> >> 
> >>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
> >>     qemu_vfree, since breaking this will cause problems on Win32.
> >> 
> >> Now, in my personal opinion, handling OOM gracefully is worth the
> >> (commonly considerable) trouble when you're coding for an Apple II or
> >> similar.  Anything that pages commonly becomes unusable long before
> >> allocations fail.
> >
> > That's not always my experience; I've seen cases where you suddenly
> > allocate a load more memory and hit OOM fairly quickly on that hot
> > process.  Most of the time on the desktop you're right.
> >
> >> Anything that overcommits will send you a (commonly
> >> lethal) signal instead.  Anything that tries handling OOM gracefully,
> >> and manages to dodge both these bullets somehow, will commonly get it
> >> wrong and crash.
> >
> > If your qemu has maped it's main memory from hugetlbfs or similar pools
> > then we're looking at the other memory allocations; and that's a bit of
> > an interesting difference where those other allocations should be a lot
> > smaller.
> >
> >> But others are entitled to their opinions as much as I am.  I just want
> >> to know what our rules are, preferably in the form of a patch to
> >> HACKING.
> >
> > My rule is to try not to break a happily running VM by some new
> > activity; I don't worry about it during startup.
> >
> > So for example, I don't like it when starting a migration, allocates
> > some more memory and kills the VM - the user had a happy stable VM
> > upto that point.  Migration gets the blame at this point.
> 
> I don't doubt reliable OOM handling would be nice.  I do doubt it's
> practical for an application like QEMU.

Well, our use of glib certainly makes it much much harder.
I just try and make sure anywhere that I'm allocating a non-trivial
amount of memory (especially anything guest or user controlled) uses
the _try_ variants.  That should keep a lot of the larger allocations.
However, it scares me that we've got things that can return big chunks
of JSON for example, and I don't think they're being careful about it.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-18 14:54     ` Dr. David Alan Gilbert
@ 2018-10-18 17:26       ` Markus Armbruster
  2018-10-18 18:01         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 13+ messages in thread
From: Markus Armbruster @ 2018-10-18 17:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> 
>> > * Markus Armbruster (armbru@redhat.com) wrote:
>> >> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
>> >> g_try_new() & friends, which can fail, and therefore require error
>> >> handling.
>> >> 
>> >> HACKING points out the difference, but is mum on when to use what:
>> >> 
>> >>     3. Low level memory management
>> >> 
>> >>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
>> >>     APIs is not allowed in the QEMU codebase. Instead of these routines,
>> >>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
>> >>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
>> >>     APIs.
>> >> 
>> >>     Please note that g_malloc will exit on allocation failure, so there
>> >>     is no need to test for failure (as you would have to with malloc).
>> >>     Calling g_malloc with a zero size is valid and will return NULL.
>> >> 
>> >>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
>> >>     reasons:
>> >> 
>> >>       a. It catches multiplication overflowing size_t;
>> >>       b. It returns T * instead of void *, letting compiler catch more type
>> >>          errors.
>> >> 
>> >>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
>> >> 
>> >>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
>> >>     qemu_vfree, since breaking this will cause problems on Win32.
>> >> 
>> >> Now, in my personal opinion, handling OOM gracefully is worth the
>> >> (commonly considerable) trouble when you're coding for an Apple II or
>> >> similar.  Anything that pages commonly becomes unusable long before
>> >> allocations fail.
>> >
>> > That's not always my experience; I've seen cases where you suddenly
>> > allocate a load more memory and hit OOM fairly quickly on that hot
>> > process.  Most of the time on the desktop you're right.
>> >
>> >> Anything that overcommits will send you a (commonly
>> >> lethal) signal instead.  Anything that tries handling OOM gracefully,
>> >> and manages to dodge both these bullets somehow, will commonly get it
>> >> wrong and crash.
>> >
>> > If your qemu has maped it's main memory from hugetlbfs or similar pools
>> > then we're looking at the other memory allocations; and that's a bit of
>> > an interesting difference where those other allocations should be a lot
>> > smaller.
>> >
>> >> But others are entitled to their opinions as much as I am.  I just want
>> >> to know what our rules are, preferably in the form of a patch to
>> >> HACKING.
>> >
>> > My rule is to try not to break a happily running VM by some new
>> > activity; I don't worry about it during startup.
>> >
>> > So for example, I don't like it when starting a migration, allocates
>> > some more memory and kills the VM - the user had a happy stable VM
>> > upto that point.  Migration gets the blame at this point.
>> 
>> I don't doubt reliable OOM handling would be nice.  I do doubt it's
>> practical for an application like QEMU.
>
> Well, our use of glib certainly makes it much much harder.
> I just try and make sure anywhere that I'm allocating a non-trivial
> amount of memory (especially anything guest or user controlled) uses
> the _try_ variants.  That should keep a lot of the larger allocations.

Matters only when your g_try_new()s actually fail (which they won't, at
least not reliably), and your error paths actually work (which they
won't unless you test them, no offense).

> However, it scares me that we've got things that can return big chunks
> of JSON for example, and I don't think they're being careful about it.

We got countless allocations small and large (large as in Gigabytes)
that kill QEMU on OOM.  Some of the small allocations add up to
Megabytes (QObjects for JSON work, for example).

Yet the *practical* problem isn't lack of graceful handling when these
allocations fail.  Because they pretty much don't.

The practical problem I see is general confusion on what to do about
OOM.  There's no written guidance.  Vague rules of thumb on when to
handle OOM are floating around.  Code gets copied.  Unsurprisingly, OOM
handling is a haphazard affair.

In this state, whatever OOM handling we have is too unreliable to be
worth much, since it can only help when (1) allocations actually fail
(they generally don't), and (2) the allocation that fails is actually
handled (they generally aren't), and (3) the handling actually works (we
don't test OOM, so it generally doesn't).

For the sake of the argument, let's assume there's a practical way to
run QEMU so that memory allocations actually fail.  We then still need
to find a way to increase the probability for failed allocations to be
actually handled, and the probability for the error handling to actually
work, both to a useful level.  This will require rules on OOM handling,
a strategy to make them stick, a strategy to test OOM, and resources to
implement all that.

Will the benefits be worth the effort?  Arguing about that in the
near-total vacuum we have now is unlikely to be productive.  To ground
the debate at least somewhat, I'd like those of us in favour of OOM
handling to propose a first draft of OOM handling rules.

If we can't do even that, I'll be tempted to shoot down OOM handling in
patches to code I maintain.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-18 17:26       ` Markus Armbruster
@ 2018-10-18 18:01         ` Dr. David Alan Gilbert
  2018-10-19  5:43           ` Markus Armbruster
  0 siblings, 1 reply; 13+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-18 18:01 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> 
> >> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> >> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
> >> >> g_try_new() & friends, which can fail, and therefore require error
> >> >> handling.
> >> >> 
> >> >> HACKING points out the difference, but is mum on when to use what:
> >> >> 
> >> >>     3. Low level memory management
> >> >> 
> >> >>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
> >> >>     APIs is not allowed in the QEMU codebase. Instead of these routines,
> >> >>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
> >> >>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
> >> >>     APIs.
> >> >> 
> >> >>     Please note that g_malloc will exit on allocation failure, so there
> >> >>     is no need to test for failure (as you would have to with malloc).
> >> >>     Calling g_malloc with a zero size is valid and will return NULL.
> >> >> 
> >> >>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
> >> >>     reasons:
> >> >> 
> >> >>       a. It catches multiplication overflowing size_t;
> >> >>       b. It returns T * instead of void *, letting compiler catch more type
> >> >>          errors.
> >> >> 
> >> >>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
> >> >> 
> >> >>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
> >> >>     qemu_vfree, since breaking this will cause problems on Win32.
> >> >> 
> >> >> Now, in my personal opinion, handling OOM gracefully is worth the
> >> >> (commonly considerable) trouble when you're coding for an Apple II or
> >> >> similar.  Anything that pages commonly becomes unusable long before
> >> >> allocations fail.
> >> >
> >> > That's not always my experience; I've seen cases where you suddenly
> >> > allocate a load more memory and hit OOM fairly quickly on that hot
> >> > process.  Most of the time on the desktop you're right.
> >> >
> >> >> Anything that overcommits will send you a (commonly
> >> >> lethal) signal instead.  Anything that tries handling OOM gracefully,
> >> >> and manages to dodge both these bullets somehow, will commonly get it
> >> >> wrong and crash.
> >> >
> >> > If your qemu has maped it's main memory from hugetlbfs or similar pools
> >> > then we're looking at the other memory allocations; and that's a bit of
> >> > an interesting difference where those other allocations should be a lot
> >> > smaller.
> >> >
> >> >> But others are entitled to their opinions as much as I am.  I just want
> >> >> to know what our rules are, preferably in the form of a patch to
> >> >> HACKING.
> >> >
> >> > My rule is to try not to break a happily running VM by some new
> >> > activity; I don't worry about it during startup.
> >> >
> >> > So for example, I don't like it when starting a migration, allocates
> >> > some more memory and kills the VM - the user had a happy stable VM
> >> > upto that point.  Migration gets the blame at this point.
> >> 
> >> I don't doubt reliable OOM handling would be nice.  I do doubt it's
> >> practical for an application like QEMU.
> >
> > Well, our use of glib certainly makes it much much harder.
> > I just try and make sure anywhere that I'm allocating a non-trivial
> > amount of memory (especially anything guest or user controlled) uses
> > the _try_ variants.  That should keep a lot of the larger allocations.
> 
> Matters only when your g_try_new()s actually fail (which they won't, at
> least not reliably), and your error paths actually work (which they
> won't unless you test them, no offense).
> 
> > However, it scares me that we've got things that can return big chunks
> > of JSON for example, and I don't think they're being careful about it.
> 
> We got countless allocations small and large (large as in Gigabytes)
> that kill QEMU on OOM.  Some of the small allocations add up to
> Megabytes (QObjects for JSON work, for example).
> 
> Yet the *practical* problem isn't lack of graceful handling when these
> allocations fail.  Because they pretty much don't.
> 
> The practical problem I see is general confusion on what to do about
> OOM.  There's no written guidance.  Vague rules of thumb on when to
> handle OOM are floating around.  Code gets copied.  Unsurprisingly, OOM
> handling is a haphazard affair.

> In this state, whatever OOM handling we have is too unreliable to be
> worth much, since it can only help when (1) allocations actually fail
> (they generally don't), and (2) the allocation that fails is actually
> handled (they generally aren't), and (3) the handling actually works (we
> don't test OOM, so it generally doesn't).
> 
> For the sake of the argument, let's assume there's a practical way to
> run QEMU so that memory allocations actually fail.  We then still need
> to find a way to increase the probability for failed allocations to be
> actually handled, and the probability for the error handling to actually
> work, both to a useful level.  This will require rules on OOM handling,
> a strategy to make them stick, a strategy to test OOM, and resources to
> implement all that.

There's probably no way to guarantee we've got all paths, however we
can test in restricted memory environments.
For example we could set up a test environment that runs a series of
hotplug or migration tests (say avocado or something) in cgroups
or nested VMs with random reduced amounts of RAM.  These will blow up
spectacularly and we can slowly attack some of the more common paths.

If we can find common cases then perhaps we can identify things to use
static checkers for.

We can also try setting up tests in environments closer to the way
OpenStack and oVirt configure they're hosts;  they seem to jump through
hoops to get a feeling of how much spare memory to allocate, but of
course since we don't define how much we use they can't really do that.

Using  mlock would probably make the allocations more likely to
fail rather than fault later?

> Will the benefits be worth the effort?  Arguing about that in the
> near-total vacuum we have now is unlikely to be productive.  To ground
> the debate at least somewhat, I'd like those of us in favour of OOM
> handling to propose a first draft of OOM handling rules.

Well, I'm up to give it a go; but before I do, can you define a bit more
what you want. Firstly what do you define as 'OOM handling' and secondly
what type of level of rules do you want.

> If we can't do even that, I'll be tempted to shoot down OOM handling in
> patches to code I maintain.

Please please don't do that;  getting it right in the monitor path and
QMP is import for those cases where we generate big chunks of JSON
(it would be better if we didn't generate big chunks of JSON, but that's
a partially separate problem).

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-18 18:01         ` Dr. David Alan Gilbert
@ 2018-10-19  5:43           ` Markus Armbruster
  2018-10-19 10:07             ` Dr. David Alan Gilbert
  2018-10-22 13:40             ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 13+ messages in thread
From: Markus Armbruster @ 2018-10-19  5:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Markus Armbruster, qemu-devel

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> 
>> > * Markus Armbruster (armbru@redhat.com) wrote:
>> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> >> 
>> >> > * Markus Armbruster (armbru@redhat.com) wrote:
>> >> >> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
>> >> >> g_try_new() & friends, which can fail, and therefore require error
>> >> >> handling.
>> >> >> 
>> >> >> HACKING points out the difference, but is mum on when to use what:
>> >> >> 
>> >> >>     3. Low level memory management
>> >> >> 
>> >> >>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
>> >> >>     APIs is not allowed in the QEMU codebase. Instead of these routines,
>> >> >>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
>> >> >>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
>> >> >>     APIs.
>> >> >> 
>> >> >>     Please note that g_malloc will exit on allocation failure, so there
>> >> >>     is no need to test for failure (as you would have to with malloc).
>> >> >>     Calling g_malloc with a zero size is valid and will return NULL.
>> >> >> 
>> >> >>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
>> >> >>     reasons:
>> >> >> 
>> >> >>       a. It catches multiplication overflowing size_t;
>> >> >>       b. It returns T * instead of void *, letting compiler catch more type
>> >> >>          errors.
>> >> >> 
>> >> >>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
>> >> >> 
>> >> >>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
>> >> >>     qemu_vfree, since breaking this will cause problems on Win32.
>> >> >> 
>> >> >> Now, in my personal opinion, handling OOM gracefully is worth the
>> >> >> (commonly considerable) trouble when you're coding for an Apple II or
>> >> >> similar.  Anything that pages commonly becomes unusable long before
>> >> >> allocations fail.
>> >> >
>> >> > That's not always my experience; I've seen cases where you suddenly
>> >> > allocate a load more memory and hit OOM fairly quickly on that hot
>> >> > process.  Most of the time on the desktop you're right.
>> >> >
>> >> >> Anything that overcommits will send you a (commonly
>> >> >> lethal) signal instead.  Anything that tries handling OOM gracefully,
>> >> >> and manages to dodge both these bullets somehow, will commonly get it
>> >> >> wrong and crash.
>> >> >
>> >> > If your qemu has maped it's main memory from hugetlbfs or similar pools
>> >> > then we're looking at the other memory allocations; and that's a bit of
>> >> > an interesting difference where those other allocations should be a lot
>> >> > smaller.
>> >> >
>> >> >> But others are entitled to their opinions as much as I am.  I just want
>> >> >> to know what our rules are, preferably in the form of a patch to
>> >> >> HACKING.
>> >> >
>> >> > My rule is to try not to break a happily running VM by some new
>> >> > activity; I don't worry about it during startup.
>> >> >
>> >> > So for example, I don't like it when starting a migration, allocates
>> >> > some more memory and kills the VM - the user had a happy stable VM
>> >> > upto that point.  Migration gets the blame at this point.
>> >> 
>> >> I don't doubt reliable OOM handling would be nice.  I do doubt it's
>> >> practical for an application like QEMU.
>> >
>> > Well, our use of glib certainly makes it much much harder.
>> > I just try and make sure anywhere that I'm allocating a non-trivial
>> > amount of memory (especially anything guest or user controlled) uses
>> > the _try_ variants.  That should keep a lot of the larger allocations.
>> 
>> Matters only when your g_try_new()s actually fail (which they won't, at
>> least not reliably), and your error paths actually work (which they
>> won't unless you test them, no offense).
>> 
>> > However, it scares me that we've got things that can return big chunks
>> > of JSON for example, and I don't think they're being careful about it.
>> 
>> We got countless allocations small and large (large as in Gigabytes)
>> that kill QEMU on OOM.  Some of the small allocations add up to
>> Megabytes (QObjects for JSON work, for example).
>> 
>> Yet the *practical* problem isn't lack of graceful handling when these
>> allocations fail.  Because they pretty much don't.
>> 
>> The practical problem I see is general confusion on what to do about
>> OOM.  There's no written guidance.  Vague rules of thumb on when to
>> handle OOM are floating around.  Code gets copied.  Unsurprisingly, OOM
>> handling is a haphazard affair.
>
>> In this state, whatever OOM handling we have is too unreliable to be
>> worth much, since it can only help when (1) allocations actually fail
>> (they generally don't), and (2) the allocation that fails is actually
>> handled (they generally aren't), and (3) the handling actually works (we
>> don't test OOM, so it generally doesn't).
>> 
>> For the sake of the argument, let's assume there's a practical way to
>> run QEMU so that memory allocations actually fail.  We then still need
>> to find a way to increase the probability for failed allocations to be
>> actually handled, and the probability for the error handling to actually
>> work, both to a useful level.  This will require rules on OOM handling,
>> a strategy to make them stick, a strategy to test OOM, and resources to
>> implement all that.
>
> There's probably no way to guarantee we've got all paths, however we
> can test in restricted memory environments.
> For example we could set up a test environment that runs a series of
> hotplug or migration tests (say avocado or something) in cgroups
> or nested VMs with random reduced amounts of RAM.  These will blow up
> spectacularly and we can slowly attack some of the more common paths.

There's also fault injection.  It's more targeted.  Bonus: it lets you
make only the allocations fail you deem likely to fail, i.e. keep the
unchecked ones working ;-P

> If we can find common cases then perhaps we can identify things to use
> static checkers for.
>
> We can also try setting up tests in environments closer to the way
> OpenStack and oVirt configure they're hosts;  they seem to jump through
> hoops to get a feeling of how much spare memory to allocate, but of
> course since we don't define how much we use they can't really do that.
>
> Using  mlock would probably make the allocations more likely to
> fail rather than fault later?

"More likely" as in "at all likely".  Without a solution here, all the
other work is on unreachable code.  My box lets me allocate Gigabytes of
memory I don't have.  In case you find my example involving -device qxl
is too opaque, I append a test program.  It successfully allocates one
Terabyte in 1024 Gigabyte chunks for me.  It behaves exactly the same
with g_malloc() instead of malloc().

The "normal" way to disable memory overcommit is
/proc/sys/vm/overcommit_memory, but it's system-wide, and requires root.
That's a big hammer.  A more precise tool could be more useful.  To
actually matter, we need a tool that libvirt can apply to production
VMs.

>> Will the benefits be worth the effort?  Arguing about that in the
>> near-total vacuum we have now is unlikely to be productive.  To ground
>> the debate at least somewhat, I'd like those of us in favour of OOM
>> handling to propose a first draft of OOM handling rules.
>
> Well, I'm up to give it a go; but before I do, can you define a bit more
> what you want. Firstly what do you define as 'OOM handling' and secondly
> what type of level of rules do you want.

0. OOM is failure to allocate a chunk of memory.

1. When is it okay to terminate the process on OOM?

   Write down rules that let people decide whether a given allocation
   needs to be handled gracefully.

   If your rules involve small vs. large allocations, then make sure to
   define "small".

   If your rules involve "in response to untrusted input", spell that
   out.

   If your rules involve "in response to trusted input (think QMP)",
   spell that out.

   Also: exit() or abort()?  If I remember correctly, GLib aborts.

2. How to handle OOM gracefully [skip for first draft]

   The usual: revert the functions side effects, return failure to
   caller, repeat for caller until reaching the caller that consumes the
   error.

3. Coding conventions [definitely skip for first draft]

   This is part of the "strategy to make the rules stick".

>> If we can't do even that, I'll be tempted to shoot down OOM handling in
>> patches to code I maintain.
>
> Please please don't do that;  getting it right in the monitor path and
> QMP is import for those cases where we generate big chunks of JSON
> (it would be better if we didn't generate big chunks of JSON, but that's
> a partially separate problem).

As long as allocations don't fail, this is all mental masturbation
(pardon my french).



#include <stdlib.h>
#include <stdio.h>

int
main(void)
{
    size_t GiB = 1024 * 1024 * 1024;
    int i;
    void *p;

    for (i = 0; i < 1024; i++) {
	printf("%d\n", i);
	p = malloc(GiB);
	if (!p) {
	    printf("OOM\n");
	    break;
	}
    }
    return 0;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-19  5:43           ` Markus Armbruster
@ 2018-10-19 10:07             ` Dr. David Alan Gilbert
  2018-10-22 13:40             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 13+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-19 10:07 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> 
> >> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> >> 
> >> >> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> >> >> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
> >> >> >> g_try_new() & friends, which can fail, and therefore require error
> >> >> >> handling.
> >> >> >> 
> >> >> >> HACKING points out the difference, but is mum on when to use what:
> >> >> >> 
> >> >> >>     3. Low level memory management
> >> >> >> 
> >> >> >>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
> >> >> >>     APIs is not allowed in the QEMU codebase. Instead of these routines,
> >> >> >>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
> >> >> >>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
> >> >> >>     APIs.
> >> >> >> 
> >> >> >>     Please note that g_malloc will exit on allocation failure, so there
> >> >> >>     is no need to test for failure (as you would have to with malloc).
> >> >> >>     Calling g_malloc with a zero size is valid and will return NULL.
> >> >> >> 
> >> >> >>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
> >> >> >>     reasons:
> >> >> >> 
> >> >> >>       a. It catches multiplication overflowing size_t;
> >> >> >>       b. It returns T * instead of void *, letting compiler catch more type
> >> >> >>          errors.
> >> >> >> 
> >> >> >>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
> >> >> >> 
> >> >> >>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
> >> >> >>     qemu_vfree, since breaking this will cause problems on Win32.
> >> >> >> 
> >> >> >> Now, in my personal opinion, handling OOM gracefully is worth the
> >> >> >> (commonly considerable) trouble when you're coding for an Apple II or
> >> >> >> similar.  Anything that pages commonly becomes unusable long before
> >> >> >> allocations fail.
> >> >> >
> >> >> > That's not always my experience; I've seen cases where you suddenly
> >> >> > allocate a load more memory and hit OOM fairly quickly on that hot
> >> >> > process.  Most of the time on the desktop you're right.
> >> >> >
> >> >> >> Anything that overcommits will send you a (commonly
> >> >> >> lethal) signal instead.  Anything that tries handling OOM gracefully,
> >> >> >> and manages to dodge both these bullets somehow, will commonly get it
> >> >> >> wrong and crash.
> >> >> >
> >> >> > If your qemu has maped it's main memory from hugetlbfs or similar pools
> >> >> > then we're looking at the other memory allocations; and that's a bit of
> >> >> > an interesting difference where those other allocations should be a lot
> >> >> > smaller.
> >> >> >
> >> >> >> But others are entitled to their opinions as much as I am.  I just want
> >> >> >> to know what our rules are, preferably in the form of a patch to
> >> >> >> HACKING.
> >> >> >
> >> >> > My rule is to try not to break a happily running VM by some new
> >> >> > activity; I don't worry about it during startup.
> >> >> >
> >> >> > So for example, I don't like it when starting a migration, allocates
> >> >> > some more memory and kills the VM - the user had a happy stable VM
> >> >> > upto that point.  Migration gets the blame at this point.
> >> >> 
> >> >> I don't doubt reliable OOM handling would be nice.  I do doubt it's
> >> >> practical for an application like QEMU.
> >> >
> >> > Well, our use of glib certainly makes it much much harder.
> >> > I just try and make sure anywhere that I'm allocating a non-trivial
> >> > amount of memory (especially anything guest or user controlled) uses
> >> > the _try_ variants.  That should keep a lot of the larger allocations.
> >> 
> >> Matters only when your g_try_new()s actually fail (which they won't, at
> >> least not reliably), and your error paths actually work (which they
> >> won't unless you test them, no offense).
> >> 
> >> > However, it scares me that we've got things that can return big chunks
> >> > of JSON for example, and I don't think they're being careful about it.
> >> 
> >> We got countless allocations small and large (large as in Gigabytes)
> >> that kill QEMU on OOM.  Some of the small allocations add up to
> >> Megabytes (QObjects for JSON work, for example).
> >> 
> >> Yet the *practical* problem isn't lack of graceful handling when these
> >> allocations fail.  Because they pretty much don't.
> >> 
> >> The practical problem I see is general confusion on what to do about
> >> OOM.  There's no written guidance.  Vague rules of thumb on when to
> >> handle OOM are floating around.  Code gets copied.  Unsurprisingly, OOM
> >> handling is a haphazard affair.
> >
> >> In this state, whatever OOM handling we have is too unreliable to be
> >> worth much, since it can only help when (1) allocations actually fail
> >> (they generally don't), and (2) the allocation that fails is actually
> >> handled (they generally aren't), and (3) the handling actually works (we
> >> don't test OOM, so it generally doesn't).
> >> 
> >> For the sake of the argument, let's assume there's a practical way to
> >> run QEMU so that memory allocations actually fail.  We then still need
> >> to find a way to increase the probability for failed allocations to be
> >> actually handled, and the probability for the error handling to actually
> >> work, both to a useful level.  This will require rules on OOM handling,
> >> a strategy to make them stick, a strategy to test OOM, and resources to
> >> implement all that.
> >
> > There's probably no way to guarantee we've got all paths, however we
> > can test in restricted memory environments.
> > For example we could set up a test environment that runs a series of
> > hotplug or migration tests (say avocado or something) in cgroups
> > or nested VMs with random reduced amounts of RAM.  These will blow up
> > spectacularly and we can slowly attack some of the more common paths.
> 
> There's also fault injection.  It's more targeted.  Bonus: it lets you
> make only the allocations fail you deem likely to fail, i.e. keep the
> unchecked ones working ;-P

Yes, although I'm worrying more about the paths we forgot about, so I
like the idea of random testing to find those.

> > If we can find common cases then perhaps we can identify things to use
> > static checkers for.
> >
> > We can also try setting up tests in environments closer to the way
> > OpenStack and oVirt configure they're hosts;  they seem to jump through
> > hoops to get a feeling of how much spare memory to allocate, but of
> > course since we don't define how much we use they can't really do that.
> >
> > Using  mlock would probably make the allocations more likely to
> > fail rather than fault later?
> 
> "More likely" as in "at all likely".  Without a solution here, all the
> other work is on unreachable code.  My box lets me allocate Gigabytes of
> memory I don't have.  In case you find my example involving -device qxl
> is too opaque, I append a test program.  It successfully allocates one
> Terabyte in 1024 Gigabyte chunks for me.  It behaves exactly the same
> with g_malloc() instead of malloc().

Yes, it's tricky.
The only way that I got to work was  ulimit -v 4000000   and then your
test prints OOM and the qxl test prints:
-device qxl,ram_size=2147483648: cannot set up guest memory 'qxl.vgavram': Cannot allocate memory

> The "normal" way to disable memory overcommit is
> /proc/sys/vm/overcommit_memory, but it's system-wide, and requires root.
> That's a big hammer.  A more precise tool could be more useful.  To
> actually matter, we need a tool that libvirt can apply to production
> VMs.

I had a play with some other things that I thought should work but
couldn't persuade them to, and I worry why.
I thought ulimit -l together with qemu with -realtime mlock=on  would
work, but it didn't seem to - that one really worries me.

I thought cgroup limits would work, but again they didn't seem to.
libvirt can set up both ulimit mlock and some cgroup limits.

While overcommit_memory is a big hammer, it's not necessarily a problem
having a root-only hammer, because that still solves the problem for
dedicated hypervisor machines that are common in both OpenStack and
oVirt.

I guess we should also take a step back and worry if this is a
linux-ism.

> >> Will the benefits be worth the effort?  Arguing about that in the
> >> near-total vacuum we have now is unlikely to be productive.  To ground
> >> the debate at least somewhat, I'd like those of us in favour of OOM
> >> handling to propose a first draft of OOM handling rules.
> >
> > Well, I'm up to give it a go; but before I do, can you define a bit more
> > what you want. Firstly what do you define as 'OOM handling' and secondly
> > what type of level of rules do you want.
> 
> 0. OOM is failure to allocate a chunk of memory.
> 
> 1. When is it okay to terminate the process on OOM?
> 
>    Write down rules that let people decide whether a given allocation
>    needs to be handled gracefully.
> 
>    If your rules involve small vs. large allocations, then make sure to
>    define "small".
> 
>    If your rules involve "in response to untrusted input", spell that
>    out.
> 
>    If your rules involve "in response to trusted input (think QMP)",
>    spell that out.
> 
>    Also: exit() or abort()?  If I remember correctly, GLib aborts.
> 
> 2. How to handle OOM gracefully [skip for first draft]
> 
>    The usual: revert the functions side effects, return failure to
>    caller, repeat for caller until reaching the caller that consumes the
>    error.
> 
> 3. Coding conventions [definitely skip for first draft]
> 
>    This is part of the "strategy to make the rules stick".

OK, that's reasonable (although I might try and avoid the use of OOM as
a name because people think o the kernel OOM killer, and that confuses
the thing we'd try and avoid).

> >> If we can't do even that, I'll be tempted to shoot down OOM handling in
> >> patches to code I maintain.
> >
> > Please please don't do that;  getting it right in the monitor path and
> > QMP is import for those cases where we generate big chunks of JSON
> > (it would be better if we didn't generate big chunks of JSON, but that's
> > a partially separate problem).
> 
> As long as allocations don't fail, this is all mental masturbation
> (pardon my french).

No need to blame the French.

Dave
> 
> 
> #include <stdlib.h>
> #include <stdio.h>
> 
> int
> main(void)
> {
>     size_t GiB = 1024 * 1024 * 1024;
>     int i;
>     void *p;
> 
>     for (i = 0; i < 1024; i++) {
> 	printf("%d\n", i);
> 	p = malloc(GiB);
> 	if (!p) {
> 	    printf("OOM\n");
> 	    break;
> 	}
>     }
>     return 0;
> }
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] When it's okay to treat OOM as fatal?
  2018-10-19  5:43           ` Markus Armbruster
  2018-10-19 10:07             ` Dr. David Alan Gilbert
@ 2018-10-22 13:40             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 13+ messages in thread
From: Dr. David Alan Gilbert @ 2018-10-22 13:40 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> 
> >> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> >> 
> >> >> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> >> >> We sometimes use g_new() & friends, which abort() on OOM, and sometimes
> >> >> >> g_try_new() & friends, which can fail, and therefore require error
> >> >> >> handling.
> >> >> >> 
> >> >> >> HACKING points out the difference, but is mum on when to use what:
> >> >> >> 
> >> >> >>     3. Low level memory management
> >> >> >> 
> >> >> >>     Use of the malloc/free/realloc/calloc/valloc/memalign/posix_memalign
> >> >> >>     APIs is not allowed in the QEMU codebase. Instead of these routines,
> >> >> >>     use the GLib memory allocation routines g_malloc/g_malloc0/g_new/
> >> >> >>     g_new0/g_realloc/g_free or QEMU's qemu_memalign/qemu_blockalign/qemu_vfree
> >> >> >>     APIs.
> >> >> >> 
> >> >> >>     Please note that g_malloc will exit on allocation failure, so there
> >> >> >>     is no need to test for failure (as you would have to with malloc).
> >> >> >>     Calling g_malloc with a zero size is valid and will return NULL.
> >> >> >> 
> >> >> >>     Prefer g_new(T, n) instead of g_malloc(sizeof(T) * n) for the following
> >> >> >>     reasons:
> >> >> >> 
> >> >> >>       a. It catches multiplication overflowing size_t;
> >> >> >>       b. It returns T * instead of void *, letting compiler catch more type
> >> >> >>          errors.
> >> >> >> 
> >> >> >>     Declarations like T *v = g_malloc(sizeof(*v)) are acceptable, though.
> >> >> >> 
> >> >> >>     Memory allocated by qemu_memalign or qemu_blockalign must be freed with
> >> >> >>     qemu_vfree, since breaking this will cause problems on Win32.
> >> >> >> 
> >> >> >> Now, in my personal opinion, handling OOM gracefully is worth the
> >> >> >> (commonly considerable) trouble when you're coding for an Apple II or
> >> >> >> similar.  Anything that pages commonly becomes unusable long before
> >> >> >> allocations fail.
> >> >> >
> >> >> > That's not always my experience; I've seen cases where you suddenly
> >> >> > allocate a load more memory and hit OOM fairly quickly on that hot
> >> >> > process.  Most of the time on the desktop you're right.
> >> >> >
> >> >> >> Anything that overcommits will send you a (commonly
> >> >> >> lethal) signal instead.  Anything that tries handling OOM gracefully,
> >> >> >> and manages to dodge both these bullets somehow, will commonly get it
> >> >> >> wrong and crash.
> >> >> >
> >> >> > If your qemu has maped it's main memory from hugetlbfs or similar pools
> >> >> > then we're looking at the other memory allocations; and that's a bit of
> >> >> > an interesting difference where those other allocations should be a lot
> >> >> > smaller.
> >> >> >
> >> >> >> But others are entitled to their opinions as much as I am.  I just want
> >> >> >> to know what our rules are, preferably in the form of a patch to
> >> >> >> HACKING.
> >> >> >
> >> >> > My rule is to try not to break a happily running VM by some new
> >> >> > activity; I don't worry about it during startup.
> >> >> >
> >> >> > So for example, I don't like it when starting a migration, allocates
> >> >> > some more memory and kills the VM - the user had a happy stable VM
> >> >> > upto that point.  Migration gets the blame at this point.
> >> >> 
> >> >> I don't doubt reliable OOM handling would be nice.  I do doubt it's
> >> >> practical for an application like QEMU.
> >> >
> >> > Well, our use of glib certainly makes it much much harder.
> >> > I just try and make sure anywhere that I'm allocating a non-trivial
> >> > amount of memory (especially anything guest or user controlled) uses
> >> > the _try_ variants.  That should keep a lot of the larger allocations.
> >> 
> >> Matters only when your g_try_new()s actually fail (which they won't, at
> >> least not reliably), and your error paths actually work (which they
> >> won't unless you test them, no offense).
> >> 
> >> > However, it scares me that we've got things that can return big chunks
> >> > of JSON for example, and I don't think they're being careful about it.
> >> 
> >> We got countless allocations small and large (large as in Gigabytes)
> >> that kill QEMU on OOM.  Some of the small allocations add up to
> >> Megabytes (QObjects for JSON work, for example).
> >> 
> >> Yet the *practical* problem isn't lack of graceful handling when these
> >> allocations fail.  Because they pretty much don't.
> >> 
> >> The practical problem I see is general confusion on what to do about
> >> OOM.  There's no written guidance.  Vague rules of thumb on when to
> >> handle OOM are floating around.  Code gets copied.  Unsurprisingly, OOM
> >> handling is a haphazard affair.
> >
> >> In this state, whatever OOM handling we have is too unreliable to be
> >> worth much, since it can only help when (1) allocations actually fail
> >> (they generally don't), and (2) the allocation that fails is actually
> >> handled (they generally aren't), and (3) the handling actually works (we
> >> don't test OOM, so it generally doesn't).
> >> 
> >> For the sake of the argument, let's assume there's a practical way to
> >> run QEMU so that memory allocations actually fail.  We then still need
> >> to find a way to increase the probability for failed allocations to be
> >> actually handled, and the probability for the error handling to actually
> >> work, both to a useful level.  This will require rules on OOM handling,
> >> a strategy to make them stick, a strategy to test OOM, and resources to
> >> implement all that.
> >
> > There's probably no way to guarantee we've got all paths, however we
> > can test in restricted memory environments.
> > For example we could set up a test environment that runs a series of
> > hotplug or migration tests (say avocado or something) in cgroups
> > or nested VMs with random reduced amounts of RAM.  These will blow up
> > spectacularly and we can slowly attack some of the more common paths.
> 
> There's also fault injection.  It's more targeted.  Bonus: it lets you
> make only the allocations fail you deem likely to fail, i.e. keep the
> unchecked ones working ;-P
> 
> > If we can find common cases then perhaps we can identify things to use
> > static checkers for.
> >
> > We can also try setting up tests in environments closer to the way
> > OpenStack and oVirt configure they're hosts;  they seem to jump through
> > hoops to get a feeling of how much spare memory to allocate, but of
> > course since we don't define how much we use they can't really do that.
> >
> > Using  mlock would probably make the allocations more likely to
> > fail rather than fault later?
> 
> "More likely" as in "at all likely".  Without a solution here, all the
> other work is on unreachable code.  My box lets me allocate Gigabytes of
> memory I don't have.  In case you find my example involving -device qxl
> is too opaque, I append a test program.  It successfully allocates one
> Terabyte in 1024 Gigabyte chunks for me.  It behaves exactly the same
> with g_malloc() instead of malloc().
> 
> The "normal" way to disable memory overcommit is
> /proc/sys/vm/overcommit_memory, but it's system-wide, and requires root.
> That's a big hammer.  A more precise tool could be more useful.  To
> actually matter, we need a tool that libvirt can apply to production
> VMs.
> 
> >> Will the benefits be worth the effort?  Arguing about that in the
> >> near-total vacuum we have now is unlikely to be productive.  To ground
> >> the debate at least somewhat, I'd like those of us in favour of OOM
> >> handling to propose a first draft of OOM handling rules.
> >
> > Well, I'm up to give it a go; but before I do, can you define a bit more
> > what you want. Firstly what do you define as 'OOM handling' and secondly
> > what type of level of rules do you want.
> 
> 0. OOM is failure to allocate a chunk of memory.
> 
> 1. When is it okay to terminate the process on OOM?
> 
>    Write down rules that let people decide whether a given allocation
>    needs to be handled gracefully.
> 
>    If your rules involve small vs. large allocations, then make sure to
>    define "small".
> 
>    If your rules involve "in response to untrusted input", spell that
>    out.
> 
>    If your rules involve "in response to trusted input (think QMP)",
>    spell that out.
> 
>    Also: exit() or abort()?  If I remember correctly, GLib aborts.
> 
> 2. How to handle OOM gracefully [skip for first draft]
> 
>    The usual: revert the functions side effects, return failure to
>    caller, repeat for caller until reaching the caller that consumes the
>    error.
> 
> 3. Coding conventions [definitely skip for first draft]
> 
>    This is part of the "strategy to make the rules stick".

OK, how about this as a strawman:

https://wiki.qemu.org/Features/AllocationFailures

Dave

> >> If we can't do even that, I'll be tempted to shoot down OOM handling in
> >> patches to code I maintain.
> >
> > Please please don't do that;  getting it right in the monitor path and
> > QMP is import for those cases where we generate big chunks of JSON
> > (it would be better if we didn't generate big chunks of JSON, but that's
> > a partially separate problem).
> 
> As long as allocations don't fail, this is all mental masturbation
> (pardon my french).
> 
> 
> 
> #include <stdlib.h>
> #include <stdio.h>
> 
> int
> main(void)
> {
>     size_t GiB = 1024 * 1024 * 1024;
>     int i;
>     void *p;
> 
>     for (i = 0; i < 1024; i++) {
> 	printf("%d\n", i);
> 	p = malloc(GiB);
> 	if (!p) {
> 	    printf("OOM\n");
> 	    break;
> 	}
>     }
>     return 0;
> }
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-10-22 13:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-16 13:01 [Qemu-devel] When it's okay to treat OOM as fatal? Markus Armbruster
2018-10-16 13:20 ` Daniel P. Berrangé
2018-10-18 13:06   ` Markus Armbruster
2018-10-18 14:28     ` Paolo Bonzini
2018-10-16 13:33 ` Dr. David Alan Gilbert
2018-10-18 14:46   ` Markus Armbruster
2018-10-18 14:54     ` Dr. David Alan Gilbert
2018-10-18 17:26       ` Markus Armbruster
2018-10-18 18:01         ` Dr. David Alan Gilbert
2018-10-19  5:43           ` Markus Armbruster
2018-10-19 10:07             ` Dr. David Alan Gilbert
2018-10-22 13:40             ` Dr. David Alan Gilbert
2018-10-17 10:05 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.