What prevents discarding a cluster during rewrite? - Vladimir Sementsov-Ogievskiy

From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: qemu block <qemu-block@nongnu.org>
Cc: qemu-devel <qemu-devel@nongnu.org>, Max Reitz <mreitz@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>, "Denis V. Lunev" <den@openvz.org>,
	Eric Blake <eblake@redhat.com>
Subject: What prevents discarding a cluster during rewrite?
Date: Tue, 23 Feb 2021 00:30:53 +0300	[thread overview]
Message-ID: <198596cd-4867-3da5-cd8f-68c1c570a52b@virtuozzo.com> (raw)

Hi all!

Thinking of how to prevent dereferencing to zero (and discard) of host cluster during flush of compressed cache (which I'm working on now), I have a question.. What prevents it for normal writes?

A simple interactive qemu-io session on master branch:

./qemu-img create -f qcow2 x 1M

[root@kvm build]# ./qemu-io blkdebug::x

do initial write:

qemu-io> write -P 1 0 64K
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.12 sec (556.453 KiB/sec and 8.6946 ops/sec)

rewrite, and break before write (assume long write by fs or hardware for some reason)

qemu-io> break write_aio A
qemu-io> aio_write -P 2 0 64K
blkdebug: Suspended request 'A'

OK, we stopped before write. Everything is already allocated on initial write, mutex now resumed.. And suddenly we do discard:

qemu-io> discard 0 64K
discard 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (146.034 MiB/sec and 2336.5414 ops/sec)

Now, start another write, to another place.. But it will allocate same host cluster!!!

qemu-io> write -P 3 128K 64K
wrote 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.08 sec (787.122 KiB/sec and 12.2988 ops/sec)

Check it:

qemu-io> read -P 3 128K 64K
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (188.238 MiB/sec and 3011.8033 ops/sec)

resume our old write:

qemu-io> resume A
blkdebug: Resuming request 'A'
qemu-io> wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0:05:07.10 (213.400382 bytes/sec and 0.0033 ops/sec)

of course it doesn't influence first cluster, as it is discarded:

qemu-io> read -P 2 0 64K
Pattern verification failed at offset 0, 65536 bytes
read 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (726.246 MiB/sec and 11619.9352 ops/sec)
qemu-io> read -P 0 0 64K
read 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (632.348 MiB/sec and 10117.5661 ops/sec)

But in 3rd cluster data is corrupted now:

qemu-io> read -P 3 128K 64K
Pattern verification failed at offset 131072, 65536 bytes
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (163.922 MiB/sec and 2622.7444 ops/sec)
qemu-io> read -P 2 128K 64K
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (257.058 MiB/sec and 4112.9245 ops/sec

So, that's a classical use-after-free... For user it looks like racy write/discard to one cluster may corrupt another cluster... It may be even worse, if use-after-free corrupts metadata.

Note, that initial write is significant, as when we do allocate cluster we write L2 entry after data write (as I understand), so the race doesn't happen. But, if consider compressed writes, they allocate everything before write.. Let's check:

[root@kvm build]# ./qemu-img create -f qcow2 x 1M; ./qemu-io blkdebug::x
Formatting 'x', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1048576 lazy_refcounts=off refcount_bits=16
qemu-io> break write_compressed A
qemu-io> aio_write -c -P 1 0 64K
qemu-io> compressed: 327680 79
blkdebug: Suspended request 'A'

qemu-io> discard 0 64K
discarded: 327680
discard 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.01 sec (7.102 MiB/sec and 113.6297 ops/sec)
qemu-io> write -P 3 128K 64K
normal cluster alloc: 327680
wrote 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.06 sec (1.005 MiB/sec and 16.0774 ops/sec)
qemu-io> resume A
blkdebug: Resuming request 'A'
qemu-io> wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0:00:15.90 (4.026 KiB/sec and 0.0629 ops/sec)

qemu-io> read -P 3 128K 64K
Pattern verification failed at offset 131072, 65536 bytes
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (237.791 MiB/sec and 3804.6539 ops/sec)

(strange, but seems it didn't fail several times for me.. But now it fails several times... Anyway, it's all not good).

-- 
Best regards,
Vladimir