qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Logging, abnormal cases, ...
@ 2020-12-17 18:21 Michael Tokarev
  2020-12-17 20:34 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Tokarev @ 2020-12-17 18:21 UTC (permalink / raw)
  To: qemu-devel

Hi!

I want to bring a topic which is mostly neglected in qemu but it is
one of very important, especially for serious usage of qemu.

This is about logging of various events or even some unexpected events
coming from guest.

Let's see for example a very easy to understand situation: a virtual
disk image of a VM runs out of space _on the host_, so that qemu does
not have space to write newly allocated blocks to.  What does it do?
By default it pauses the VM and just sits there waiting. But does it
lets someone to know that it run onto something which it can't handle?
Does it write at least some information about it somewhere?

I know this very situation should not happen in any serious environment
to start with, but everything happens, and the should ring a very loud
bell somewhere so that people can react sooner than later after numerous
attempt to _understand_ what's going on.

What I'm talking about is that we should have some standard way of
logging such events, at least, or a way of notifying something about these.
Including for example start/stop, suspend/resume, migration out/migration
in etc - while these might look like a job of some management layer on
top of qemu (like libvirt), it's still better do that in qemu so that
all possible tools using it will benefit.

But it is not restricted to such global events like outlined already,
there are much more interesting cases there. For another example,
there's a lot of places where qemu checks validity of various requests
coming to virtual hardware it emulates, - so that DMA is done to a
valid address, register holds valid index, etc etc. A majority of
these checks results in a simple error return or even abort().
But it'd be very interesting to be able to log these events, maybe
with some context, and be able to turn that logging on/off at runtime.

The thing is - there ara at least 2 types of guests out there, one
is possible "hostile" - this is what all the recent CVEs are about,
it is possible to abuse qemu in one way or another by doing some
strange things within a guest. And another is entirely different -
a "friendly" guest which is here just for easy management, so that
various migrations, expansions and reconfigurations are possible.
And there, it should not do evil things, and if it does, we'd better
know about that, as it is a possible bug which might turn into a
smoking gun one day.

The other day we had a quite bad system crash in a VMWare environment
which resulted in a massive data loss, interruption of service and
lots of really hard work for everyone.  A crash which actually started
after vmware side discovered that it does not have any place where to
put a new write for a (production, "friendly") guest. After that issue
were fixed, there were several more conditions _like_ this happened, -
not with lack of free space, but with the same freezing, vmware stopped
the guest for a few seconds several times, and I've seen brief change
in the guest's icon color (turning to red) in the monitor at least twice,
during these freezes. But there's nothing in the logs - *why* it stopped
the guest, what happened?? - nothing's left.  And finally it crashed
hard after live-migrating the VM into another host with larger capacity
storage - at the moment the migration finished, the VM freeze for a few
seconds, there was tons of kernel messages about stuck processes and
kernel threads after that, and, among other things, there were a few
error messages about failed writes into the (virtual) storage. Which,
in turn, hit some bugs in oracle and finally the db become corrupt and
oracle instance refused to open it in any way, crashing with sigsegvs
at startup, so we had to resort to the harder way.

The thing was that there was absolutely nothing on the vmware side,
even the fact of migration weren't obvious in its logs. I bet there
were some inconsistences within data request handling from the guest
at least, or if system clocks were different this could have been
logged too, or anything else not-so-expected. But there's nothing,
so we don't sill know what happened. Sure it is a bug somewhere,
maybe several bugs at once (vmware tools are running in the vm and
the hypervisor surely can assume a safe state of the vm while doing
last steps of the migration, oracle should not sigsegv on bad blocks,
and so on). But if there were logs, we had at least some chance to
debug and maybe fix it, or to work around this if fixing isn't possible.
But there's nothing.

I think we can do much better in qemu.

Comments?

Thanks,

/mjt


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Logging, abnormal cases, ...
  2020-12-17 18:21 Logging, abnormal cases, Michael Tokarev
@ 2020-12-17 20:34 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 2+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-17 20:34 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: qemu-devel

* Michael Tokarev (mjt@tls.msk.ru) wrote:
> Hi!
> 
> I want to bring a topic which is mostly neglected in qemu but it is
> one of very important, especially for serious usage of qemu.
> 
> This is about logging of various events or even some unexpected events
> coming from guest.
> 
> Let's see for example a very easy to understand situation: a virtual
> disk image of a VM runs out of space _on the host_, so that qemu does
> not have space to write newly allocated blocks to.  What does it do?
> By default it pauses the VM and just sits there waiting. But does it
> lets someone to know that it run onto something which it can't handle?
> Does it write at least some information about it somewhere?
> 
> I know this very situation should not happen in any serious environment
> to start with, but everything happens, and the should ring a very loud
> bell somewhere so that people can react sooner than later after numerous
> attempt to _understand_ what's going on.
> 
> What I'm talking about is that we should have some standard way of
> logging such events, at least, or a way of notifying something about these.
> Including for example start/stop, suspend/resume, migration out/migration
> in etc - while these might look like a job of some management layer on
> top of qemu (like libvirt), it's still better do that in qemu so that
> all possible tools using it will benefit.
> 
> But it is not restricted to such global events like outlined already,
> there are much more interesting cases there. For another example,
> there's a lot of places where qemu checks validity of various requests
> coming to virtual hardware it emulates, - so that DMA is done to a
> valid address, register holds valid index, etc etc. A majority of
> these checks results in a simple error return or even abort().
> But it'd be very interesting to be able to log these events, maybe
> with some context, and be able to turn that logging on/off at runtime.
> 
> The thing is - there ara at least 2 types of guests out there, one
> is possible "hostile" - this is what all the recent CVEs are about,
> it is possible to abuse qemu in one way or another by doing some
> strange things within a guest. And another is entirely different -
> a "friendly" guest which is here just for easy management, so that
> various migrations, expansions and reconfigurations are possible.
> And there, it should not do evil things, and if it does, we'd better
> know about that, as it is a possible bug which might turn into a
> smoking gun one day.
> 
> The other day we had a quite bad system crash in a VMWare environment
> which resulted in a massive data loss, interruption of service and
> lots of really hard work for everyone.  A crash which actually started
> after vmware side discovered that it does not have any place where to
> put a new write for a (production, "friendly") guest. After that issue
> were fixed, there were several more conditions _like_ this happened, -
> not with lack of free space, but with the same freezing, vmware stopped
> the guest for a few seconds several times, and I've seen brief change
> in the guest's icon color (turning to red) in the monitor at least twice,
> during these freezes. But there's nothing in the logs - *why* it stopped
> the guest, what happened?? - nothing's left.  And finally it crashed
> hard after live-migrating the VM into another host with larger capacity
> storage - at the moment the migration finished, the VM freeze for a few
> seconds, there was tons of kernel messages about stuck processes and
> kernel threads after that, and, among other things, there were a few
> error messages about failed writes into the (virtual) storage. Which,
> in turn, hit some bugs in oracle and finally the db become corrupt and
> oracle instance refused to open it in any way, crashing with sigsegvs
> at startup, so we had to resort to the harder way.

I'm glad it's not just me who has hard migration problems!

> The thing was that there was absolutely nothing on the vmware side,
> even the fact of migration weren't obvious in its logs. I bet there
> were some inconsistences within data request handling from the guest
> at least, or if system clocks were different this could have been
> logged too, or anything else not-so-expected. But there's nothing,
> so we don't sill know what happened. Sure it is a bug somewhere,
> maybe several bugs at once (vmware tools are running in the vm and
> the hypervisor surely can assume a safe state of the vm while doing
> last steps of the migration, oracle should not sigsegv on bad blocks,
> and so on). But if there were logs, we had at least some chance to
> debug and maybe fix it, or to work around this if fixing isn't possible.
> But there's nothing.
> 
> I think we can do much better in qemu.
> 
> Comments?

We've got lots of things, the problem is deciding what to use when.
We can:
  a) Send QMP events - that's pretty good for a 'hey I've stopped, disk
full'
  b) We've got a logging mechanism, but I think it's rarely used.
  c) We've got trace events - that's good for fine debug but not
what you want all the time.

For migration, if something fails, I always try and print to stderr
so I get it in the /var/log/libvirt/qemu/... log;  the problem is
that only happens if you know you've screwed up - if you corrupt some
state during the migrate, then a hung guest is often the outcome.

Dave

> Thanks,
> 
> /mjt
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-12-17 20:38 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 18:21 Logging, abnormal cases, Michael Tokarev
2020-12-17 20:34 ` Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).