All of lore.kernel.org
 help / color / mirror / Atom feed
* What prevents discarding a cluster during rewrite?
@ 2021-02-22 21:30 Vladimir Sementsov-Ogievskiy
  2021-02-22 21:42 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 6+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-22 21:30 UTC (permalink / raw)
  To: qemu block; +Cc: qemu-devel, Max Reitz, Kevin Wolf, Denis V. Lunev, Eric Blake

Hi all!

Thinking of how to prevent dereferencing to zero (and discard) of host cluster during flush of compressed cache (which I'm working on now), I have a question.. What prevents it for normal writes?

A simple interactive qemu-io session on master branch:

./qemu-img create -f qcow2 x 1M

[root@kvm build]# ./qemu-io blkdebug::x


do initial write:

qemu-io> write -P 1 0 64K
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.12 sec (556.453 KiB/sec and 8.6946 ops/sec)


rewrite, and break before write (assume long write by fs or hardware for some reason)

qemu-io> break write_aio A
qemu-io> aio_write -P 2 0 64K
blkdebug: Suspended request 'A'


OK, we stopped before write. Everything is already allocated on initial write, mutex now resumed.. And suddenly we do discard:

qemu-io> discard 0 64K
discard 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (146.034 MiB/sec and 2336.5414 ops/sec)


Now, start another write, to another place.. But it will allocate same host cluster!!!

qemu-io> write -P 3 128K 64K
wrote 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.08 sec (787.122 KiB/sec and 12.2988 ops/sec)

Check it:

qemu-io> read -P 3 128K 64K
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (188.238 MiB/sec and 3011.8033 ops/sec)


resume our old write:

qemu-io> resume A
blkdebug: Resuming request 'A'
qemu-io> wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0:05:07.10 (213.400382 bytes/sec and 0.0033 ops/sec)


of course it doesn't influence first cluster, as it is discarded:

qemu-io> read -P 2 0 64K
Pattern verification failed at offset 0, 65536 bytes
read 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (726.246 MiB/sec and 11619.9352 ops/sec)
qemu-io> read -P 0 0 64K
read 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (632.348 MiB/sec and 10117.5661 ops/sec)


But in 3rd cluster data is corrupted now:

qemu-io> read -P 3 128K 64K
Pattern verification failed at offset 131072, 65536 bytes
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (163.922 MiB/sec and 2622.7444 ops/sec)
qemu-io> read -P 2 128K 64K
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (257.058 MiB/sec and 4112.9245 ops/sec


So, that's a classical use-after-free... For user it looks like racy write/discard to one cluster may corrupt another cluster... It may be even worse, if use-after-free corrupts metadata.

Note, that initial write is significant, as when we do allocate cluster we write L2 entry after data write (as I understand), so the race doesn't happen. But, if consider compressed writes, they allocate everything before write.. Let's check:

[root@kvm build]# ./qemu-img create -f qcow2 x 1M; ./qemu-io blkdebug::x
Formatting 'x', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1048576 lazy_refcounts=off refcount_bits=16
qemu-io> break write_compressed A
qemu-io> aio_write -c -P 1 0 64K
qemu-io> compressed: 327680 79
blkdebug: Suspended request 'A'

qemu-io> discard 0 64K
discarded: 327680
discard 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.01 sec (7.102 MiB/sec and 113.6297 ops/sec)
qemu-io> write -P 3 128K 64K
normal cluster alloc: 327680
wrote 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.06 sec (1.005 MiB/sec and 16.0774 ops/sec)
qemu-io> resume A
blkdebug: Resuming request 'A'
qemu-io> wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0:00:15.90 (4.026 KiB/sec and 0.0629 ops/sec)

qemu-io> read -P 3 128K 64K
Pattern verification failed at offset 131072, 65536 bytes
read 65536/65536 bytes at offset 131072
64 KiB, 1 ops; 00.00 sec (237.791 MiB/sec and 3804.6539 ops/sec)


(strange, but seems it didn't fail several times for me.. But now it fails several times... Anyway, it's all not good).


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What prevents discarding a cluster during rewrite?
  2021-02-22 21:30 What prevents discarding a cluster during rewrite? Vladimir Sementsov-Ogievskiy
@ 2021-02-22 21:42 ` Vladimir Sementsov-Ogievskiy
  2021-02-23 10:35   ` Kevin Wolf
  0 siblings, 1 reply; 6+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-22 21:42 UTC (permalink / raw)
  To: qemu block; +Cc: qemu-devel, Max Reitz, Kevin Wolf, Denis V. Lunev, Eric Blake

23.02.2021 00:30, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Thinking of how to prevent dereferencing to zero (and discard) of host cluster during flush of compressed cache (which I'm working on now), I have a question.. What prevents it for normal writes?
> 

I have no idea about why didn't it fail for years.. May be, I'm missing something?

I have idea of fixing: increase the refcount of host cluster before write to data_file (it's important to increase refacount in same s->lock critical section where we get host_offset) and dereference it after write.. It should help. Any thoughts?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What prevents discarding a cluster during rewrite?
  2021-02-22 21:42 ` Vladimir Sementsov-Ogievskiy
@ 2021-02-23 10:35   ` Kevin Wolf
  2021-02-23 15:23     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 6+ messages in thread
From: Kevin Wolf @ 2021-02-23 10:35 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Denis V. Lunev, qemu-devel, qemu block, Max Reitz

Am 22.02.2021 um 22:42 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 23.02.2021 00:30, Vladimir Sementsov-Ogievskiy wrote:
> > Thinking of how to prevent dereferencing to zero (and discard) of
> > host cluster during flush of compressed cache (which I'm working on
> > now), I have a question.. What prevents it for normal writes?
> 
> I have no idea about why didn't it fail for years.. May be, I'm
> missing something?

Ouch. I suppose the reason why we never ran into it is that afaik Linux
drains the queue before sending discard requests.

Of course, we still need to protect against this in QEMU. We can't just
unref a cluster that is still in use.

> I have idea of fixing: increase the refcount of host cluster before
> write to data_file (it's important to increase refacount in same
> s->lock critical section where we get host_offset) and dereference it
> after write.. It should help. Any thoughts?

It would cause metadata updates for rewrites. I don't know whether the
intermediate value would ever be written to disk, but at least we'd
rewrite the same refcounts as before. I don't think we want that.

Discard requests might be rare enough that not considering the cluster
at all could work. We could then take a reader CoRwlock during most
operations, and a writer lock for discard.

Actually, maybe s->lock should be this CoRwlock, and instead of dropping
it temporarily like we do now we would just upgrade and downgrade it as
needed. Maybe this would allow finer grained locking in other places,
too.

Kevin



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What prevents discarding a cluster during rewrite?
  2021-02-23 10:35   ` Kevin Wolf
@ 2021-02-23 15:23     ` Vladimir Sementsov-Ogievskiy
  2021-02-23 16:29       ` Kevin Wolf
  0 siblings, 1 reply; 6+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-23 15:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu block, qemu-devel, Max Reitz, Denis V. Lunev, Eric Blake



On 23.02.2021 13:35, Kevin Wolf wrote:
> Am 22.02.2021 um 22:42 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 23.02.2021 00:30, Vladimir Sementsov-Ogievskiy wrote:
>>> Thinking of how to prevent dereferencing to zero (and discard) of
>>> host cluster during flush of compressed cache (which I'm working on
>>> now), I have a question.. What prevents it for normal writes?
>>
>> I have no idea about why didn't it fail for years.. May be, I'm
>> missing something?
> 
> Ouch. I suppose the reason why we never ran into it is that afaik Linux
> drains the queue before sending discard requests.
> 
> Of course, we still need to protect against this in QEMU. We can't just
> unref a cluster that is still in use.
> 
>> I have idea of fixing: increase the refcount of host cluster before
>> write to data_file (it's important to increase refacount in same
>> s->lock critical section where we get host_offset) and dereference it
>> after write.. It should help. Any thoughts?
> 
> It would cause metadata updates for rewrites. I don't know whether the
> intermediate value would ever be written to disk, but at least we'd
> rewrite the same refcounts as before. I don't think we want that.
> 

Hmm, if that can provoke extra refcount cache flush that's bad..

May be we need something like of additional "dynamic" refcounts, so
that total_refcount = normal_refcount + dynamic_refcount.. And for
in-flight clusters dynamic_refcount is > 0. We can store dynamic
refcounts in GHashTable(cluster -> refcount).



> Discard requests might be rare enough that not considering the cluster
> at all could work. We could then take a reader CoRwlock during most
> operations, and a writer lock for discard.
> 
> Actually, maybe s->lock should be this CoRwlock, and instead of dropping
> it temporarily like we do now we would just upgrade and downgrade it as
> needed. Maybe this would allow finer grained locking in other places,
> too.

In this case many operations will be blocked during data writing, like 
allocation of another cluster.. That doesn't seem good. Separate 
CoRwLock looks more feasible.

--
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What prevents discarding a cluster during rewrite?
  2021-02-23 15:23     ` Vladimir Sementsov-Ogievskiy
@ 2021-02-23 16:29       ` Kevin Wolf
  2021-02-24  6:04         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 6+ messages in thread
From: Kevin Wolf @ 2021-02-23 16:29 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Denis V. Lunev, qemu-devel, qemu block, Max Reitz

Am 23.02.2021 um 16:23 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 
> 
> On 23.02.2021 13:35, Kevin Wolf wrote:
> > Am 22.02.2021 um 22:42 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 23.02.2021 00:30, Vladimir Sementsov-Ogievskiy wrote:
> > > > Thinking of how to prevent dereferencing to zero (and discard) of
> > > > host cluster during flush of compressed cache (which I'm working on
> > > > now), I have a question.. What prevents it for normal writes?
> > > 
> > > I have no idea about why didn't it fail for years.. May be, I'm
> > > missing something?
> > 
> > Ouch. I suppose the reason why we never ran into it is that afaik Linux
> > drains the queue before sending discard requests.
> > 
> > Of course, we still need to protect against this in QEMU. We can't just
> > unref a cluster that is still in use.
> > 
> > > I have idea of fixing: increase the refcount of host cluster before
> > > write to data_file (it's important to increase refacount in same
> > > s->lock critical section where we get host_offset) and dereference it
> > > after write.. It should help. Any thoughts?
> > 
> > It would cause metadata updates for rewrites. I don't know whether the
> > intermediate value would ever be written to disk, but at least we'd
> > rewrite the same refcounts as before. I don't think we want that.
> 
> Hmm, if that can provoke extra refcount cache flush that's bad..
> 
> May be we need something like of additional "dynamic" refcounts, so
> that total_refcount = normal_refcount + dynamic_refcount.. And for
> in-flight clusters dynamic_refcount is > 0. We can store dynamic
> refcounts in GHashTable(cluster -> refcount).

Do you think it's worth the complexity? The qcow2 driver is already
relatively complicated today.

> > Discard requests might be rare enough that not considering the cluster
> > at all could work. We could then take a reader CoRwlock during most
> > operations, and a writer lock for discard.
> > 
> > Actually, maybe s->lock should be this CoRwlock, and instead of dropping
> > it temporarily like we do now we would just upgrade and downgrade it as
> > needed. Maybe this would allow finer grained locking in other places,
> > too.
> 
> In this case many operations will be blocked during data writing, like
> allocation of another cluster.. That doesn't seem good.

You're right, that would be too restrictive.

> Separate CoRwLock looks more feasible.

Maybe that then.

Kevin



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What prevents discarding a cluster during rewrite?
  2021-02-23 16:29       ` Kevin Wolf
@ 2021-02-24  6:04         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 6+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-02-24  6:04 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu block, qemu-devel, Max Reitz, Denis V. Lunev, Eric Blake

23.02.2021 19:29, Kevin Wolf wrote:
> Am 23.02.2021 um 16:23 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>
>>
>> On 23.02.2021 13:35, Kevin Wolf wrote:
>>> Am 22.02.2021 um 22:42 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 23.02.2021 00:30, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Thinking of how to prevent dereferencing to zero (and discard) of
>>>>> host cluster during flush of compressed cache (which I'm working on
>>>>> now), I have a question.. What prevents it for normal writes?
>>>>
>>>> I have no idea about why didn't it fail for years.. May be, I'm
>>>> missing something?
>>>
>>> Ouch. I suppose the reason why we never ran into it is that afaik Linux
>>> drains the queue before sending discard requests.
>>>
>>> Of course, we still need to protect against this in QEMU. We can't just
>>> unref a cluster that is still in use.
>>>
>>>> I have idea of fixing: increase the refcount of host cluster before
>>>> write to data_file (it's important to increase refacount in same
>>>> s->lock critical section where we get host_offset) and dereference it
>>>> after write.. It should help. Any thoughts?
>>>
>>> It would cause metadata updates for rewrites. I don't know whether the
>>> intermediate value would ever be written to disk, but at least we'd
>>> rewrite the same refcounts as before. I don't think we want that.
>>
>> Hmm, if that can provoke extra refcount cache flush that's bad..
>>
>> May be we need something like of additional "dynamic" refcounts, so
>> that total_refcount = normal_refcount + dynamic_refcount.. And for
>> in-flight clusters dynamic_refcount is > 0. We can store dynamic
>> refcounts in GHashTable(cluster -> refcount).
> 
> Do you think it's worth the complexity? The qcow2 driver is already
> relatively complicated today.

I started to implement it and actually it doesn't seem much more complicated
than additional CoRwLock. So, I think, I will make two patches (refcounts vs
CoRwMutex) and we'll compare.

> 
>>> Discard requests might be rare enough that not considering the cluster
>>> at all could work. We could then take a reader CoRwlock during most
>>> operations, and a writer lock for discard.
>>>
>>> Actually, maybe s->lock should be this CoRwlock, and instead of dropping
>>> it temporarily like we do now we would just upgrade and downgrade it as
>>> needed. Maybe this would allow finer grained locking in other places,
>>> too.
>>
>> In this case many operations will be blocked during data writing, like
>> allocation of another cluster.. That doesn't seem good.
> 
> You're right, that would be too restrictive.
> 
>> Separate CoRwLock looks more feasible.
> 
> Maybe that then.
> 
> Kevin
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-02-24  6:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-22 21:30 What prevents discarding a cluster during rewrite? Vladimir Sementsov-Ogievskiy
2021-02-22 21:42 ` Vladimir Sementsov-Ogievskiy
2021-02-23 10:35   ` Kevin Wolf
2021-02-23 15:23     ` Vladimir Sementsov-Ogievskiy
2021-02-23 16:29       ` Kevin Wolf
2021-02-24  6:04         ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.