Re: Race condition in overlayed qcow2?

From: dovgaluk <dovgaluk@ispras.ru>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Cc: kwolf@redhat.com, qemu-devel@nongnu.org, mreitz@redhat.com
Subject: Re: Race condition in overlayed qcow2?
Date: Fri, 21 Feb 2020 12:49:59 +0300	[thread overview]
Message-ID: <b3405d429e42bdf03177db1b8f7531ee@ispras.ru> (raw)
In-Reply-To: <0cbd2c7a-44e1-272f-9995-1ff7e2fb9e36@virtuozzo.com>

Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
> 20.02.2020 12:05, Vladimir Sementsov-Ogievskiy wrote:
>> 20.02.2020 11:31, dovgaluk wrote:
>>> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
>>>> 19.02.2020 17:32, dovgaluk wrote:
>>>>> I encountered a problem with record/replay of QEMU execution and 
>>>>> figured out the following, when
>>>>> QEMU is started with one virtual disk connected to the qcow2 image 
>>>>> with applied 'snapshot' option.
>>>>> 
>>>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: 
>>>>> introduce parallel subrequest handling in read and write"
>>>>> introduces some kind of race condition, which causes difference in 
>>>>> the data read from the disk.
>>>>> 
>>>>> I detected this by adding the following code, which logs IO 
>>>>> operation checksum. And this checksum may be different in different 
>>>>> runs of the same recorded execution.
>>>>> 
>>>>> logging in blk_aio_complete function:
>>>>>          qemu_log("%"PRId64": blk_aio_complete\n", 
>>>>> replay_get_current_icount());
>>>>>          QEMUIOVector *qiov = acb->rwco.iobuf;
>>>>>          if (qiov && qiov->iov) {
>>>>>              size_t i, j;
>>>>>              uint64_t sum = 0;
>>>>>              int count = 0;
>>>>>              for (i = 0 ; i < qiov->niov ; ++i) {
>>>>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>>>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>>>>                      ++count;
>>>>>                  }
>>>>>              }
>>>>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: 
>>>>> %"PRIx64"\n", acb->rwco.offset, count, sum);
>>>>>          }
>>>>> 
>>>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>>>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, 
>>>>> cur_bytes, qiov, qiov_offset);
>>>>> 
>>>>> That change fixed a bug, but I have no idea what to debug next to 
>>>>> figure out the exact reason of the failure.
>>>>> 
>>>>> Do you have any ideas or hints?
>>>>> 
>>>> 
>>>> Hi!
>>>> 
>>>> Hmm, do mean that read from the disk may return wrong data? It would
>>>> be very bad of course :(
>>>> Could you provide a reproducer, so that I can look at it and debug?
>>> 
>>> It is just a winxp-32 image. I record the execution and replay it 
>>> with the following command lines:
>>> 
>>> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M 
>>> -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive 
>>> driver=blkreplay,if=none,image=device-34-file,id=device-34-driver 
>>> -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net 
>>> none
>>> 
>>> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M 
>>> -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive 
>>> driver=blkreplay,if=none,image=device-34-file,id=device-34-driver 
>>> -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net 
>>> none
>>> 
>>> Replay stalls at some moment due to the non-determinism of the 
>>> execution (probably caused by the wrong data read).
>> 
>> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for 
>> some time for a first command, than Ctrl+C it. After it replay.bin was 
>> 4M. Than started the second command. It works, not failing, not 
>> finishing. Is it bad? What is expected behavior and what is wrong?
>> 
>>> 
>>>> What is exactly the case? May be you have other parallel aio
>>>> operations to the same region?
>>> 
>>> As far as I understand, all aio operations, initiated by IDE 
>>> controller, are performed one-by-one.
>>> I don't see anything else in the logs.
>>> 
>>>> Ideas to experiment:
>>>> 
>>>> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
>>> 
>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>> 
>>>> 2. understand what is the case in code: is it read from one or 
>>>> several
>>>> clusters, is it aligned,
>>>> what is the type of clusters, is encryption in use, compression?
>>> 
>>> There is no encryption and I thinks compression is not enabled too.
>>> Clusters are read from the temporary overlay:
>>> 
>>> blk_aio_prwv
>>> blk_aio_read_entry
>>> bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 
>>> 1e00
>>> bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 
>>> 1c200
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 
>>> 1e000
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 
>>> 1e000
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 
>>> 1e000
>>> 
>>> 
>>>> 3. understand what kind of data corruption. What we read instead of
>>>> correct data? Just garbage, or may be zeroes, or what..
>>> 
>>> Most bytes are the same, but some are different:
>>> 
>>> < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
>>> < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
>>> < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
>>> < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
>>> < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>> ---
>>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
>>>> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
>>>> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
>>>> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
>>>> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>> 
>>> That is strange. I could think, that it was caused by the bugs in
>>> deterministic CPU execution, but the first difference in logs
>>> occur in READ operation (I dump read/write buffers in 
>>> blk_aio_complete).
>>> 
>> 
>> Aha, yes, looks strange.
>> 
>> Then next steps:
>> 
>> 1. Does problem hit into the same offset every time?
>> 2. Do we write to this region before this strange read?
>> 
>> 2.1. If yes, we need to check that we read what we write.. You say you 
>> dump buffers
>> in blk_aio_complete... I think it would be more reliable to dump at 
>> start of
>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify 
>> its buffers
>> during operation which would be strange but possible.
>> 
>> 2.2 If not, hmm...
>> 
>> 
> 
> Another idea to check: use blkverify

I added logging of file descriptor and discovered that different results 
are obtained
when reading from the backing file.
And even more - replay runs of the same recording produce different 
results.
Logs show that there is a preadv race, but I can't figure out the source 
of the failure.

Log1:
preadv c 30467e00
preadv c 30960000
--- sum = a2e1e
bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
--- sum = 10cdee
bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
ee00

Log2:
preadv c 30467e00
--- sum = a2e1e
bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
preadv c 30960000
--- sum = f094f
bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
ee00

Checksum calculation was added to preadv in file-posix.c

Pavel Dovgalyuk