Logging, abnormal cases, ...

* Logging, abnormal cases, ...
@ 2020-12-17 18:21 Michael Tokarev
  2020-12-17 20:34 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Tokarev @ 2020-12-17 18:21 UTC (permalink / raw)
  To: qemu-devel

Hi!

I want to bring a topic which is mostly neglected in qemu but it is
one of very important, especially for serious usage of qemu.

This is about logging of various events or even some unexpected events
coming from guest.

Let's see for example a very easy to understand situation: a virtual
disk image of a VM runs out of space _on the host_, so that qemu does
not have space to write newly allocated blocks to.  What does it do?
By default it pauses the VM and just sits there waiting. But does it
lets someone to know that it run onto something which it can't handle?
Does it write at least some information about it somewhere?

I know this very situation should not happen in any serious environment
to start with, but everything happens, and the should ring a very loud
bell somewhere so that people can react sooner than later after numerous
attempt to _understand_ what's going on.

What I'm talking about is that we should have some standard way of
logging such events, at least, or a way of notifying something about these.
Including for example start/stop, suspend/resume, migration out/migration
in etc - while these might look like a job of some management layer on
top of qemu (like libvirt), it's still better do that in qemu so that
all possible tools using it will benefit.

But it is not restricted to such global events like outlined already,
there are much more interesting cases there. For another example,
there's a lot of places where qemu checks validity of various requests
coming to virtual hardware it emulates, - so that DMA is done to a
valid address, register holds valid index, etc etc. A majority of
these checks results in a simple error return or even abort().
But it'd be very interesting to be able to log these events, maybe
with some context, and be able to turn that logging on/off at runtime.

The thing is - there ara at least 2 types of guests out there, one
is possible "hostile" - this is what all the recent CVEs are about,
it is possible to abuse qemu in one way or another by doing some
strange things within a guest. And another is entirely different -
a "friendly" guest which is here just for easy management, so that
various migrations, expansions and reconfigurations are possible.
And there, it should not do evil things, and if it does, we'd better
know about that, as it is a possible bug which might turn into a
smoking gun one day.

The other day we had a quite bad system crash in a VMWare environment
which resulted in a massive data loss, interruption of service and
lots of really hard work for everyone.  A crash which actually started
after vmware side discovered that it does not have any place where to
put a new write for a (production, "friendly") guest. After that issue
were fixed, there were several more conditions _like_ this happened, -
not with lack of free space, but with the same freezing, vmware stopped
the guest for a few seconds several times, and I've seen brief change
in the guest's icon color (turning to red) in the monitor at least twice,
during these freezes. But there's nothing in the logs - *why* it stopped
the guest, what happened?? - nothing's left.  And finally it crashed
hard after live-migrating the VM into another host with larger capacity
storage - at the moment the migration finished, the VM freeze for a few
seconds, there was tons of kernel messages about stuck processes and
kernel threads after that, and, among other things, there were a few
error messages about failed writes into the (virtual) storage. Which,
in turn, hit some bugs in oracle and finally the db become corrupt and
oracle instance refused to open it in any way, crashing with sigsegvs
at startup, so we had to resort to the harder way.

The thing was that there was absolutely nothing on the vmware side,
even the fact of migration weren't obvious in its logs. I bet there
were some inconsistences within data request handling from the guest
at least, or if system clocks were different this could have been
logged too, or anything else not-so-expected. But there's nothing,
so we don't sill know what happened. Sure it is a bug somewhere,
maybe several bugs at once (vmware tools are running in the vm and
the hypervisor surely can assume a safe state of the vm while doing
last steps of the migration, oracle should not sigsegv on bad blocks,
and so on). But if there were logs, we had at least some chance to
debug and maybe fix it, or to work around this if fixing isn't possible.
But there's nothing.

I think we can do much better in qemu.

Comments?

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 2+ messages in thread