[Qemu-devel] [Bug 1810603] [NEW] QEMU QCow Images crow dramatically

* [Qemu-devel] [Bug 1810603] [NEW] QEMU QCow Images crow dramatically
@ 2019-01-05 15:32 Lenny Helpline
  2019-01-05 16:19 ` Eric Blake
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Lenny Helpline @ 2019-01-05 15:32 UTC (permalink / raw)
  To: qemu-devel

Public bug reported:

I've recently migrated our VM infrastructure (~200 guest on 15 hosts)
from vbox to Qemu (using KVM / libvirt). We have a master image (QEMU
QCow v3) from which we spawn multiple instances (linked clones). All
guests are being revert once per hour for security reasons.

About 2 weeks after we successfully migrated to Qemu, we noticed that
almost all disks went full across all 15 hosts. Our investigation showed
that the initial qcow disk images blow up from a few gigabytes to 100GB
and more. This should not happen, as we revert all VMs back to the
initial snapshot once per hour and hence all changes that have been made
to disks must be reverted too.

We did an addition test with 24 hour time frame with which we could
reproduce this bug as documented below.

Initial disk image size (created on Jan 04):
-rw-r--r-- 1 root root 7.1G Jan  4 15:59 W10-TS01-0.img
-rw-r--r-- 1 root root 7.3G Jan  4 15:59 W10-TS02-0.img
-rw-r--r-- 1 root root 7.4G Jan  4 15:59 W10-TS03-0.img
-rw-r--r-- 1 root root 8.3G Jan  4 16:02 W10-CLIENT01-0.img
-rw-r--r-- 1 root root 8.6G Jan  4 16:05 W10-CLIENT02-0.img
-rw-r--r-- 1 root root 8.0G Jan  4 16:05 W10-CLIENT03-0.img
-rw-r--r-- 1 root root 8.3G Jan  4 16:08 W10-CLIENT04-0.img
-rw-r--r-- 1 root root 8.1G Jan  4 16:12 W10-CLIENT05-0.img
-rw-r--r-- 1 root root 8.0G Jan  4 16:12 W10-CLIENT06-0.img
-rw-r--r-- 1 root root 8.1G Jan  4 16:16 W10-CLIENT07-0.img
-rw-r--r-- 1 root root 7.6G Jan  4 16:16 W10-CLIENT08-0.img
-rw-r--r-- 1 root root 7.6G Jan  4 16:19 W10-CLIENT09-0.img
-rw-r--r-- 1 root root 7.5G Jan  4 16:21 W10-ROUTER-0.img
-rw-r--r-- 1 root root  18G Jan  4 16:25 W10-MASTER-IMG.qcow2

Disk image size after 24 hours (printed on Jan 05):
-rw-r--r-- 1 root root  13G Jan  5 15:07 W10-TS01-0.img
-rw-r--r-- 1 root root 8.9G Jan  5 14:20 W10-TS02-0.img
-rw-r--r-- 1 root root 9.0G Jan  5 15:07 W10-TS03-0.img
-rw-r--r-- 1 root root  10G Jan  5 15:08 W10-CLIENT01-0.img
-rw-r--r-- 1 root root  11G Jan  5 15:08 W10-CLIENT02-0.img
-rw-r--r-- 1 root root  11G Jan  5 15:08 W10-CLIENT03-0.img
-rw-r--r-- 1 root root  11G Jan  5 15:08 W10-CLIENT04-0.img
-rw-r--r-- 1 root root  19G Jan  5 15:07 W10-CLIENT05-0.img
-rw-r--r-- 1 root root  14G Jan  5 15:08 W10-CLIENT06-0.img
-rw-r--r-- 1 root root 9.7G Jan  5 15:07 W10-CLIENT07-0.img
-rw-r--r-- 1 root root  35G Jan  5 15:08 W10-CLIENT08-0.img
-rw-r--r-- 1 root root 9.2G Jan  5 15:07 W10-CLIENT09-0.img
-rw-r--r-- 1 root root  41G Jan  5 15:08 W10-ROUTER-0.img
-rw-r--r-- 1 root root  18G Jan  4 16:25 W10-MASTER-IMG.qcow2

You can reproduce this bug as follow:
1) create an initial disk image
2) create a linked clone
3) create a snapshot of the linked clone
4) revert the snapshot every X minutes / hours

Due the described behavior / bug, our VM farm is completely down at the
moment (as we run out of disk space on all host systems). A quick fix
for this bug would be much appreciated.

Host OS: Ubuntu 18.04.01 LTS
Kernel: 4.15.0-43-generic
Qemu: 3.1.0
libvirt: 4.10.0
Guest OS: Windows 10 64bit

** Affects: qemu
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1810603

Title:
  QEMU QCow Images crow dramatically

Status in QEMU:
  New

Bug description:
  I've recently migrated our VM infrastructure (~200 guest on 15 hosts)
  from vbox to Qemu (using KVM / libvirt). We have a master image (QEMU
  QCow v3) from which we spawn multiple instances (linked clones). All
  guests are being revert once per hour for security reasons.

  About 2 weeks after we successfully migrated to Qemu, we noticed that
  almost all disks went full across all 15 hosts. Our investigation
  showed that the initial qcow disk images blow up from a few gigabytes
  to 100GB and more. This should not happen, as we revert all VMs back
  to the initial snapshot once per hour and hence all changes that have
  been made to disks must be reverted too.

  We did an addition test with 24 hour time frame with which we could
  reproduce this bug as documented below.

  Initial disk image size (created on Jan 04):
  -rw-r--r-- 1 root root 7.1G Jan  4 15:59 W10-TS01-0.img
  -rw-r--r-- 1 root root 7.3G Jan  4 15:59 W10-TS02-0.img
  -rw-r--r-- 1 root root 7.4G Jan  4 15:59 W10-TS03-0.img
  -rw-r--r-- 1 root root 8.3G Jan  4 16:02 W10-CLIENT01-0.img
  -rw-r--r-- 1 root root 8.6G Jan  4 16:05 W10-CLIENT02-0.img
  -rw-r--r-- 1 root root 8.0G Jan  4 16:05 W10-CLIENT03-0.img
  -rw-r--r-- 1 root root 8.3G Jan  4 16:08 W10-CLIENT04-0.img
  -rw-r--r-- 1 root root 8.1G Jan  4 16:12 W10-CLIENT05-0.img
  -rw-r--r-- 1 root root 8.0G Jan  4 16:12 W10-CLIENT06-0.img
  -rw-r--r-- 1 root root 8.1G Jan  4 16:16 W10-CLIENT07-0.img
  -rw-r--r-- 1 root root 7.6G Jan  4 16:16 W10-CLIENT08-0.img
  -rw-r--r-- 1 root root 7.6G Jan  4 16:19 W10-CLIENT09-0.img
  -rw-r--r-- 1 root root 7.5G Jan  4 16:21 W10-ROUTER-0.img
  -rw-r--r-- 1 root root  18G Jan  4 16:25 W10-MASTER-IMG.qcow2

  Disk image size after 24 hours (printed on Jan 05):
  -rw-r--r-- 1 root root  13G Jan  5 15:07 W10-TS01-0.img
  -rw-r--r-- 1 root root 8.9G Jan  5 14:20 W10-TS02-0.img
  -rw-r--r-- 1 root root 9.0G Jan  5 15:07 W10-TS03-0.img
  -rw-r--r-- 1 root root  10G Jan  5 15:08 W10-CLIENT01-0.img
  -rw-r--r-- 1 root root  11G Jan  5 15:08 W10-CLIENT02-0.img
  -rw-r--r-- 1 root root  11G Jan  5 15:08 W10-CLIENT03-0.img
  -rw-r--r-- 1 root root  11G Jan  5 15:08 W10-CLIENT04-0.img
  -rw-r--r-- 1 root root  19G Jan  5 15:07 W10-CLIENT05-0.img
  -rw-r--r-- 1 root root  14G Jan  5 15:08 W10-CLIENT06-0.img
  -rw-r--r-- 1 root root 9.7G Jan  5 15:07 W10-CLIENT07-0.img
  -rw-r--r-- 1 root root  35G Jan  5 15:08 W10-CLIENT08-0.img
  -rw-r--r-- 1 root root 9.2G Jan  5 15:07 W10-CLIENT09-0.img
  -rw-r--r-- 1 root root  41G Jan  5 15:08 W10-ROUTER-0.img
  -rw-r--r-- 1 root root  18G Jan  4 16:25 W10-MASTER-IMG.qcow2

  You can reproduce this bug as follow:
  1) create an initial disk image
  2) create a linked clone
  3) create a snapshot of the linked clone
  4) revert the snapshot every X minutes / hours

  Due the described behavior / bug, our VM farm is completely down at
  the moment (as we run out of disk space on all host systems). A quick
  fix for this bug would be much appreciated.

  Host OS: Ubuntu 18.04.01 LTS
  Kernel: 4.15.0-43-generic
  Qemu: 3.1.0
  libvirt: 4.10.0
  Guest OS: Windows 10 64bit

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1810603/+subscriptions

^ permalink raw reply	[flat|nested] 13+ messages in thread