All of lore.kernel.org
 help / color / mirror / Atom feed
* Race condition in overlayed qcow2?
@ 2020-02-19 14:32 dovgaluk
  2020-02-19 16:07 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 19+ messages in thread
From: dovgaluk @ 2020-02-19 14:32 UTC (permalink / raw)
  To: vsementsov, qemu-devel, mreitz, kwolf

Hi!

I encountered a problem with record/replay of QEMU execution and figured 
out the following, when
QEMU is started with one virtual disk connected to the qcow2 image with 
applied 'snapshot' option.

The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: 
introduce parallel subrequest handling in read and write"
introduces some kind of race condition, which causes difference in the 
data read from the disk.

I detected this by adding the following code, which logs IO operation 
checksum. And this checksum may be different in different runs of the 
same recorded execution.

logging in blk_aio_complete function:
         qemu_log("%"PRId64": blk_aio_complete\n", 
replay_get_current_icount());
         QEMUIOVector *qiov = acb->rwco.iobuf;
         if (qiov && qiov->iov) {
             size_t i, j;
             uint64_t sum = 0;
             int count = 0;
             for (i = 0 ; i < qiov->niov ; ++i) {
                 for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
                     sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
                     ++count;
                 }
             }
             qemu_log("--- iobuf offset %"PRIx64" len %x sum: 
%"PRIx64"\n", acb->rwco.offset, count, sum);
         }

I tried to get rid of aio task by patching qcow2_co_preadv_part:
ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, 
qiov, qiov_offset);

That change fixed a bug, but I have no idea what to debug next to figure 
out the exact reason of the failure.

Do you have any ideas or hints?

Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-19 14:32 Race condition in overlayed qcow2? dovgaluk
@ 2020-02-19 16:07 ` Vladimir Sementsov-Ogievskiy
  2020-02-20  8:31   ` dovgaluk
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-19 16:07 UTC (permalink / raw)
  To: dovgaluk, qemu-devel, mreitz, kwolf

19.02.2020 17:32, dovgaluk wrote:
> Hi!
> 
> I encountered a problem with record/replay of QEMU execution and figured out the following, when
> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot' option.
> 
> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel subrequest handling in read and write"
> introduces some kind of race condition, which causes difference in the data read from the disk.
> 
> I detected this by adding the following code, which logs IO operation checksum. And this checksum may be different in different runs of the same recorded execution.
> 
> logging in blk_aio_complete function:
>          qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
>          QEMUIOVector *qiov = acb->rwco.iobuf;
>          if (qiov && qiov->iov) {
>              size_t i, j;
>              uint64_t sum = 0;
>              int count = 0;
>              for (i = 0 ; i < qiov->niov ; ++i) {
>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>                      ++count;
>                  }
>              }
>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb->rwco.offset, count, sum);
>          }
> 
> I tried to get rid of aio task by patching qcow2_co_preadv_part:
> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
> 
> That change fixed a bug, but I have no idea what to debug next to figure out the exact reason of the failure.
> 
> Do you have any ideas or hints?
> 

Hi!

Hmm, do mean that read from the disk may return wrong data? It would be very bad of course :(
Could you provide a reproducer, so that I can look at it and debug?

What is exactly the case? May be you have other parallel aio operations to the same region?

Ideas to experiment:

1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
2. understand what is the case in code: is it read from one or several clusters, is it aligned,
what is the type of clusters, is encryption in use, compression?
3. understand what kind of data corruption. What we read instead of correct data? Just garbage, or may be zeroes, or what..

and of course best thing would be creating small reproducer, or test in tests/qemu-iotests


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-19 16:07 ` Vladimir Sementsov-Ogievskiy
@ 2020-02-20  8:31   ` dovgaluk
  2020-02-20  9:05     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 19+ messages in thread
From: dovgaluk @ 2020-02-20  8:31 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: kwolf, dovgaluk, qemu-devel, mreitz

Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
> 19.02.2020 17:32, dovgaluk wrote:
>> I encountered a problem with record/replay of QEMU execution and 
>> figured out the following, when
>> QEMU is started with one virtual disk connected to the qcow2 image 
>> with applied 'snapshot' option.
>> 
>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: 
>> introduce parallel subrequest handling in read and write"
>> introduces some kind of race condition, which causes difference in the 
>> data read from the disk.
>> 
>> I detected this by adding the following code, which logs IO operation 
>> checksum. And this checksum may be different in different runs of the 
>> same recorded execution.
>> 
>> logging in blk_aio_complete function:
>>          qemu_log("%"PRId64": blk_aio_complete\n", 
>> replay_get_current_icount());
>>          QEMUIOVector *qiov = acb->rwco.iobuf;
>>          if (qiov && qiov->iov) {
>>              size_t i, j;
>>              uint64_t sum = 0;
>>              int count = 0;
>>              for (i = 0 ; i < qiov->niov ; ++i) {
>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>                      ++count;
>>                  }
>>              }
>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: 
>> %"PRIx64"\n", acb->rwco.offset, count, sum);
>>          }
>> 
>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, 
>> qiov, qiov_offset);
>> 
>> That change fixed a bug, but I have no idea what to debug next to 
>> figure out the exact reason of the failure.
>> 
>> Do you have any ideas or hints?
>> 
> 
> Hi!
> 
> Hmm, do mean that read from the disk may return wrong data? It would
> be very bad of course :(
> Could you provide a reproducer, so that I can look at it and debug?

It is just a winxp-32 image. I record the execution and replay it with 
the following command lines:

qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M 
-drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive 
driver=blkreplay,if=none,image=device-34-file,id=device-34-driver 
-device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none

qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M 
-drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive 
driver=blkreplay,if=none,image=device-34-file,id=device-34-driver 
-device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none

Replay stalls at some moment due to the non-determinism of the execution 
(probably caused by the wrong data read).

> What is exactly the case? May be you have other parallel aio
> operations to the same region?

As far as I understand, all aio operations, initiated by IDE controller, 
are performed one-by-one.
I don't see anything else in the logs.

> Ideas to experiment:
> 
> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?

1 or 2 are ok, and 4 or 8 lead to the failures.

> 2. understand what is the case in code: is it read from one or several
> clusters, is it aligned,
> what is the type of clusters, is encryption in use, compression?

There is no encryption and I thinks compression is not enabled too.
Clusters are read from the temporary overlay:

blk_aio_prwv
blk_aio_read_entry
bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 
1e00
bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000


> 3. understand what kind of data corruption. What we read instead of
> correct data? Just garbage, or may be zeroes, or what..

Most bytes are the same, but some are different:

< 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
< 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
< 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
< 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
< 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
---
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00

That is strange. I could think, that it was caused by the bugs in
deterministic CPU execution, but the first difference in logs
occur in READ operation (I dump read/write buffers in blk_aio_complete).

Maybe I missed logging in one of the functions?

> and of course best thing would be creating small reproducer, or test
> in tests/qemu-iotests


Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-20  8:31   ` dovgaluk
@ 2020-02-20  9:05     ` Vladimir Sementsov-Ogievskiy
  2020-02-20  9:36       ` Vladimir Sementsov-Ogievskiy
  2020-02-20 10:00       ` Pavel Dovgalyuk
  0 siblings, 2 replies; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-20  9:05 UTC (permalink / raw)
  To: dovgaluk; +Cc: kwolf, qemu-devel, mreitz

20.02.2020 11:31, dovgaluk wrote:
> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
>> 19.02.2020 17:32, dovgaluk wrote:
>>> I encountered a problem with record/replay of QEMU execution and figured out the following, when
>>> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot' option.
>>>
>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel subrequest handling in read and write"
>>> introduces some kind of race condition, which causes difference in the data read from the disk.
>>>
>>> I detected this by adding the following code, which logs IO operation checksum. And this checksum may be different in different runs of the same recorded execution.
>>>
>>> logging in blk_aio_complete function:
>>>          qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
>>>          QEMUIOVector *qiov = acb->rwco.iobuf;
>>>          if (qiov && qiov->iov) {
>>>              size_t i, j;
>>>              uint64_t sum = 0;
>>>              int count = 0;
>>>              for (i = 0 ; i < qiov->niov ; ++i) {
>>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>>                      ++count;
>>>                  }
>>>              }
>>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb->rwco.offset, count, sum);
>>>          }
>>>
>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
>>>
>>> That change fixed a bug, but I have no idea what to debug next to figure out the exact reason of the failure.
>>>
>>> Do you have any ideas or hints?
>>>
>>
>> Hi!
>>
>> Hmm, do mean that read from the disk may return wrong data? It would
>> be very bad of course :(
>> Could you provide a reproducer, so that I can look at it and debug?
> 
> It is just a winxp-32 image. I record the execution and replay it with the following command lines:
> 
> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none
> 
> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none
> 
> Replay stalls at some moment due to the non-determinism of the execution (probably caused by the wrong data read).

Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?

> 
>> What is exactly the case? May be you have other parallel aio
>> operations to the same region?
> 
> As far as I understand, all aio operations, initiated by IDE controller, are performed one-by-one.
> I don't see anything else in the logs.
> 
>> Ideas to experiment:
>>
>> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
> 
> 1 or 2 are ok, and 4 or 8 lead to the failures.
> 
>> 2. understand what is the case in code: is it read from one or several
>> clusters, is it aligned,
>> what is the type of clusters, is encryption in use, compression?
> 
> There is no encryption and I thinks compression is not enabled too.
> Clusters are read from the temporary overlay:
> 
> blk_aio_prwv
> blk_aio_read_entry
> bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 1e00
> bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> 
> 
>> 3. understand what kind of data corruption. What we read instead of
>> correct data? Just garbage, or may be zeroes, or what..
> 
> Most bytes are the same, but some are different:
> 
> < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
> < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
> < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
> < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
> < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
> ---
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
>> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
>> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
>> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
>> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
> 
> That is strange. I could think, that it was caused by the bugs in
> deterministic CPU execution, but the first difference in logs
> occur in READ operation (I dump read/write buffers in blk_aio_complete).
> 

Aha, yes, looks strange.

Then next steps:

1. Does problem hit into the same offset every time?
2. Do we write to this region before this strange read?

2.1. If yes, we need to check that we read what we write.. You say you dump buffers
in blk_aio_complete... I think it would be more reliable to dump at start of
bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
during operation which would be strange but possible.

2.2 If not, hmm...


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-20  9:05     ` Vladimir Sementsov-Ogievskiy
@ 2020-02-20  9:36       ` Vladimir Sementsov-Ogievskiy
  2020-02-21  9:49         ` dovgaluk
  2020-02-20 10:00       ` Pavel Dovgalyuk
  1 sibling, 1 reply; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-20  9:36 UTC (permalink / raw)
  To: dovgaluk; +Cc: kwolf, qemu-devel, mreitz

20.02.2020 12:05, Vladimir Sementsov-Ogievskiy wrote:
> 20.02.2020 11:31, dovgaluk wrote:
>> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
>>> 19.02.2020 17:32, dovgaluk wrote:
>>>> I encountered a problem with record/replay of QEMU execution and figured out the following, when
>>>> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot' option.
>>>>
>>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel subrequest handling in read and write"
>>>> introduces some kind of race condition, which causes difference in the data read from the disk.
>>>>
>>>> I detected this by adding the following code, which logs IO operation checksum. And this checksum may be different in different runs of the same recorded execution.
>>>>
>>>> logging in blk_aio_complete function:
>>>>          qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
>>>>          QEMUIOVector *qiov = acb->rwco.iobuf;
>>>>          if (qiov && qiov->iov) {
>>>>              size_t i, j;
>>>>              uint64_t sum = 0;
>>>>              int count = 0;
>>>>              for (i = 0 ; i < qiov->niov ; ++i) {
>>>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>>>                      ++count;
>>>>                  }
>>>>              }
>>>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb->rwco.offset, count, sum);
>>>>          }
>>>>
>>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
>>>>
>>>> That change fixed a bug, but I have no idea what to debug next to figure out the exact reason of the failure.
>>>>
>>>> Do you have any ideas or hints?
>>>>
>>>
>>> Hi!
>>>
>>> Hmm, do mean that read from the disk may return wrong data? It would
>>> be very bad of course :(
>>> Could you provide a reproducer, so that I can look at it and debug?
>>
>> It is just a winxp-32 image. I record the execution and replay it with the following command lines:
>>
>> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none
>>
>> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none
>>
>> Replay stalls at some moment due to the non-determinism of the execution (probably caused by the wrong data read).
> 
> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?
> 
>>
>>> What is exactly the case? May be you have other parallel aio
>>> operations to the same region?
>>
>> As far as I understand, all aio operations, initiated by IDE controller, are performed one-by-one.
>> I don't see anything else in the logs.
>>
>>> Ideas to experiment:
>>>
>>> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
>>
>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>
>>> 2. understand what is the case in code: is it read from one or several
>>> clusters, is it aligned,
>>> what is the type of clusters, is encryption in use, compression?
>>
>> There is no encryption and I thinks compression is not enabled too.
>> Clusters are read from the temporary overlay:
>>
>> blk_aio_prwv
>> blk_aio_read_entry
>> bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 1e00
>> bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>
>>
>>> 3. understand what kind of data corruption. What we read instead of
>>> correct data? Just garbage, or may be zeroes, or what..
>>
>> Most bytes are the same, but some are different:
>>
>> < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
>> < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
>> < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
>> < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
>> < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>> ---
>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
>>> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
>>> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
>>> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
>>> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>
>> That is strange. I could think, that it was caused by the bugs in
>> deterministic CPU execution, but the first difference in logs
>> occur in READ operation (I dump read/write buffers in blk_aio_complete).
>>
> 
> Aha, yes, looks strange.
> 
> Then next steps:
> 
> 1. Does problem hit into the same offset every time?
> 2. Do we write to this region before this strange read?
> 
> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
> in blk_aio_complete... I think it would be more reliable to dump at start of
> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
> during operation which would be strange but possible.
> 
> 2.2 If not, hmm...
> 
> 

Another idea to check: use blkverify

blkverify is filter driver, which on write writes to two separate files and on read
check that it reads the same from these two files (and print error and exits if differs)

So, I imagine creating two overlays by hand, like
qemu-img create -f qcow2 -b /path/to/xp.qcow2 a.qcow2
qemu-img create -f qcow2 -b /path/to/xp.qcow2 b.qcow2

and then stating vm with something like this:

-drive driver=blkverify,raw.filename=/work/a.qcow2,test.file.filename=/work/b.qcow2



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Race condition in overlayed qcow2?
  2020-02-20  9:05     ` Vladimir Sementsov-Ogievskiy
  2020-02-20  9:36       ` Vladimir Sementsov-Ogievskiy
@ 2020-02-20 10:00       ` Pavel Dovgalyuk
  2020-02-20 11:26         ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-02-20 10:00 UTC (permalink / raw)
  To: 'Vladimir Sementsov-Ogievskiy'; +Cc: kwolf, qemu-devel, mreitz

> From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> 20.02.2020 11:31, dovgaluk wrote:
> > Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
> >> 19.02.2020 17:32, dovgaluk wrote:
> >>> I encountered a problem with record/replay of QEMU execution and figured out the
> following, when
> >>> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot'
> option.
> >>>
> >>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel
> subrequest handling in read and write"
> >>> introduces some kind of race condition, which causes difference in the data read from the
> disk.
> >>>
> >>> I detected this by adding the following code, which logs IO operation checksum. And this
> checksum may be different in different runs of the same recorded execution.
> >>>
> >>> logging in blk_aio_complete function:
> >>>          qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
> >>>          QEMUIOVector *qiov = acb->rwco.iobuf;
> >>>          if (qiov && qiov->iov) {
> >>>              size_t i, j;
> >>>              uint64_t sum = 0;
> >>>              int count = 0;
> >>>              for (i = 0 ; i < qiov->niov ; ++i) {
> >>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
> >>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
> >>>                      ++count;
> >>>                  }
> >>>              }
> >>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb-
> >rwco.offset, count, sum);
> >>>          }
> >>>
> >>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
> >>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
> >>>
> >>> That change fixed a bug, but I have no idea what to debug next to figure out the exact
> reason of the failure.
> >>>
> >>> Do you have any ideas or hints?
> >>>
> >>
> >> Hi!
> >>
> >> Hmm, do mean that read from the disk may return wrong data? It would
> >> be very bad of course :(
> >> Could you provide a reproducer, so that I can look at it and debug?
> >
> > It is just a winxp-32 image. I record the execution and replay it with the following command
> lines:
> >
> > qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive
> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-
> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net
> none
> >
> > qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive
> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-
> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net
> none
> >
> > Replay stalls at some moment due to the non-determinism of the execution (probably caused by
> the wrong data read).
> 
> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first
> command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It
> works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?

The second command should finish. There is no replay introspection yet (in master), but you can
stop qemu with gdb and inspect replay_state.current_icount field. It should increase with every
virtual CPU instruction execution. If that counter has stopped, it means that replay hangs.

> >> What is exactly the case? May be you have other parallel aio
> >> operations to the same region?
> >
> > As far as I understand, all aio operations, initiated by IDE controller, are performed one-
> by-one.
> > I don't see anything else in the logs.
> >
> >> Ideas to experiment:
> >>
> >> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
> >
> > 1 or 2 are ok, and 4 or 8 lead to the failures.
> >
> >> 2. understand what is the case in code: is it read from one or several
> >> clusters, is it aligned,
> >> what is the type of clusters, is encryption in use, compression?
> >
> > There is no encryption and I thinks compression is not enabled too.
> > Clusters are read from the temporary overlay:
> >
> > blk_aio_prwv
> > blk_aio_read_entry
> > bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 1e00
> > bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
> > bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> > bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> > bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> >
> >
> >> 3. understand what kind of data corruption. What we read instead of
> >> correct data? Just garbage, or may be zeroes, or what..
> >
> > Most bytes are the same, but some are different:
> >
> > < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
> > < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
> > < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
> > < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
> > < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
> > ---
> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
> >> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
> >> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
> >> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
> >> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
> >
> > That is strange. I could think, that it was caused by the bugs in
> > deterministic CPU execution, but the first difference in logs
> > occur in READ operation (I dump read/write buffers in blk_aio_complete).
> >
> 
> Aha, yes, looks strange.
> 
> Then next steps:
> 
> 1. Does problem hit into the same offset every time?

Yes, almost the same offset, almost the same phase of the execution.

> 2. Do we write to this region before this strange read?

No.

> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
> in blk_aio_complete... I think it would be more reliable to dump at start of
> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
> during operation which would be strange but possible.

I dumped every write in file-posix.c handle_aiocb_rw_linear and qemu_pwritev
and found no difference in executions.

> 2.2 If not, hmm...

Exactly.

Pavel Dovgalyuk



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-20 10:00       ` Pavel Dovgalyuk
@ 2020-02-20 11:26         ` Vladimir Sementsov-Ogievskiy
  2020-02-20 11:48           ` Pavel Dovgalyuk
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-20 11:26 UTC (permalink / raw)
  To: Pavel Dovgalyuk; +Cc: kwolf, qemu-devel, mreitz

20.02.2020 13:00, Pavel Dovgalyuk wrote:
>> From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
>> 20.02.2020 11:31, dovgaluk wrote:
>>> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
>>>> 19.02.2020 17:32, dovgaluk wrote:
>>>>> I encountered a problem with record/replay of QEMU execution and figured out the
>> following, when
>>>>> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot'
>> option.
>>>>>
>>>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel
>> subrequest handling in read and write"
>>>>> introduces some kind of race condition, which causes difference in the data read from the
>> disk.
>>>>>
>>>>> I detected this by adding the following code, which logs IO operation checksum. And this
>> checksum may be different in different runs of the same recorded execution.
>>>>>
>>>>> logging in blk_aio_complete function:
>>>>>           qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
>>>>>           QEMUIOVector *qiov = acb->rwco.iobuf;
>>>>>           if (qiov && qiov->iov) {
>>>>>               size_t i, j;
>>>>>               uint64_t sum = 0;
>>>>>               int count = 0;
>>>>>               for (i = 0 ; i < qiov->niov ; ++i) {
>>>>>                   for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>>>>                       sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>>>>                       ++count;
>>>>>                   }
>>>>>               }
>>>>>               qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb-
>>> rwco.offset, count, sum);
>>>>>           }
>>>>>
>>>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>>>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
>>>>>
>>>>> That change fixed a bug, but I have no idea what to debug next to figure out the exact
>> reason of the failure.
>>>>>
>>>>> Do you have any ideas or hints?
>>>>>
>>>>
>>>> Hi!
>>>>
>>>> Hmm, do mean that read from the disk may return wrong data? It would
>>>> be very bad of course :(
>>>> Could you provide a reproducer, so that I can look at it and debug?
>>>
>>> It is just a winxp-32 image. I record the execution and replay it with the following command
>> lines:
>>>
>>> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive
>> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-
>> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net
>> none
>>>
>>> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive
>> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-
>> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net
>> none
>>>
>>> Replay stalls at some moment due to the non-determinism of the execution (probably caused by
>> the wrong data read).
>>
>> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first
>> command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It
>> works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?
> 
> The second command should finish. There is no replay introspection yet (in master), but you can
> stop qemu with gdb and inspect replay_state.current_icount field. It should increase with every
> virtual CPU instruction execution. If that counter has stopped, it means that replay hangs.

It hangs for me even with QCOW2_MAX_WORKERS = 1..

> 
>>>> What is exactly the case? May be you have other parallel aio
>>>> operations to the same region?
>>>
>>> As far as I understand, all aio operations, initiated by IDE controller, are performed one-
>> by-one.
>>> I don't see anything else in the logs.
>>>
>>>> Ideas to experiment:
>>>>
>>>> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
>>>
>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>>
>>>> 2. understand what is the case in code: is it read from one or several
>>>> clusters, is it aligned,
>>>> what is the type of clusters, is encryption in use, compression?
>>>
>>> There is no encryption and I thinks compression is not enabled too.
>>> Clusters are read from the temporary overlay:
>>>
>>> blk_aio_prwv
>>> blk_aio_read_entry
>>> bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 1e00
>>> bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>>
>>>
>>>> 3. understand what kind of data corruption. What we read instead of
>>>> correct data? Just garbage, or may be zeroes, or what..
>>>
>>> Most bytes are the same, but some are different:
>>>
>>> < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
>>> < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
>>> < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
>>> < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
>>> < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>> ---
>>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
>>>> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
>>>> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
>>>> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
>>>> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>>
>>> That is strange. I could think, that it was caused by the bugs in
>>> deterministic CPU execution, but the first difference in logs
>>> occur in READ operation (I dump read/write buffers in blk_aio_complete).
>>>
>>
>> Aha, yes, looks strange.
>>
>> Then next steps:
>>
>> 1. Does problem hit into the same offset every time?
> 
> Yes, almost the same offset, almost the same phase of the execution.
> 
>> 2. Do we write to this region before this strange read?
> 
> No.
> 
>> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
>> in blk_aio_complete... I think it would be more reliable to dump at start of
>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
>> during operation which would be strange but possible.
> 
> I dumped every write in file-posix.c handle_aiocb_rw_linear and qemu_pwritev
> and found no difference in executions.
> 
>> 2.2 If not, hmm...
> 
> Exactly.
> 
> Pavel Dovgalyuk
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Race condition in overlayed qcow2?
  2020-02-20 11:26         ` Vladimir Sementsov-Ogievskiy
@ 2020-02-20 11:48           ` Pavel Dovgalyuk
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-02-20 11:48 UTC (permalink / raw)
  To: 'Vladimir Sementsov-Ogievskiy'; +Cc: kwolf, qemu-devel, mreitz

> From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> 20.02.2020 13:00, Pavel Dovgalyuk wrote:
> >> From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> >> 20.02.2020 11:31, dovgaluk wrote:
> >>> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
> >>>> 19.02.2020 17:32, dovgaluk wrote:
> >>>>> I encountered a problem with record/replay of QEMU execution and figured out the
> >> following, when
> >>>>> QEMU is started with one virtual disk connected to the qcow2 image with applied
> 'snapshot'
> >> option.
> >>>>>
> >>>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel
> >> subrequest handling in read and write"
> >>>>> introduces some kind of race condition, which causes difference in the data read from
> the
> >> disk.
> >>>>>
> >>>>> I detected this by adding the following code, which logs IO operation checksum. And this
> >> checksum may be different in different runs of the same recorded execution.
> >>>>>
> >>>>> logging in blk_aio_complete function:
> >>>>>           qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
> >>>>>           QEMUIOVector *qiov = acb->rwco.iobuf;
> >>>>>           if (qiov && qiov->iov) {
> >>>>>               size_t i, j;
> >>>>>               uint64_t sum = 0;
> >>>>>               int count = 0;
> >>>>>               for (i = 0 ; i < qiov->niov ; ++i) {
> >>>>>                   for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
> >>>>>                       sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
> >>>>>                       ++count;
> >>>>>                   }
> >>>>>               }
> >>>>>               qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb-
> >>> rwco.offset, count, sum);
> >>>>>           }
> >>>>>
> >>>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
> >>>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov,
> qiov_offset);
> >>>>>
> >>>>> That change fixed a bug, but I have no idea what to debug next to figure out the exact
> >> reason of the failure.
> >>>>>
> >>>>> Do you have any ideas or hints?
> >>>>>
> >>>>
> >>>> Hi!
> >>>>
> >>>> Hmm, do mean that read from the disk may return wrong data? It would
> >>>> be very bad of course :(
> >>>> Could you provide a reproducer, so that I can look at it and debug?
> >>>
> >>> It is just a winxp-32 image. I record the execution and replay it with the following
> command
> >> lines:
> >>>
> >>> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive
> >> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive
> driver=blkreplay,if=none,image=device-
> >> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -
> net
> >> none
> >>>
> >>> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive
> >> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive
> driver=blkreplay,if=none,image=device-
> >> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -
> net
> >> none
> >>>
> >>> Replay stalls at some moment due to the non-determinism of the execution (probably caused
> by
> >> the wrong data read).
> >>
> >> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first
> >> command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It
> >> works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?
> >
> > The second command should finish. There is no replay introspection yet (in master), but you
> can
> > stop qemu with gdb and inspect replay_state.current_icount field. It should increase with
> every
> > virtual CPU instruction execution. If that counter has stopped, it means that replay hangs.
> 
> It hangs for me even with QCOW2_MAX_WORKERS = 1..


There could be some other bugs in record/replay.
To be sure try winxp on i386.

Pavel Dovgalyuk



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-20  9:36       ` Vladimir Sementsov-Ogievskiy
@ 2020-02-21  9:49         ` dovgaluk
  2020-02-21 10:09           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 19+ messages in thread
From: dovgaluk @ 2020-02-21  9:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: kwolf, qemu-devel, mreitz

Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
> 20.02.2020 12:05, Vladimir Sementsov-Ogievskiy wrote:
>> 20.02.2020 11:31, dovgaluk wrote:
>>> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
>>>> 19.02.2020 17:32, dovgaluk wrote:
>>>>> I encountered a problem with record/replay of QEMU execution and 
>>>>> figured out the following, when
>>>>> QEMU is started with one virtual disk connected to the qcow2 image 
>>>>> with applied 'snapshot' option.
>>>>> 
>>>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: 
>>>>> introduce parallel subrequest handling in read and write"
>>>>> introduces some kind of race condition, which causes difference in 
>>>>> the data read from the disk.
>>>>> 
>>>>> I detected this by adding the following code, which logs IO 
>>>>> operation checksum. And this checksum may be different in different 
>>>>> runs of the same recorded execution.
>>>>> 
>>>>> logging in blk_aio_complete function:
>>>>>          qemu_log("%"PRId64": blk_aio_complete\n", 
>>>>> replay_get_current_icount());
>>>>>          QEMUIOVector *qiov = acb->rwco.iobuf;
>>>>>          if (qiov && qiov->iov) {
>>>>>              size_t i, j;
>>>>>              uint64_t sum = 0;
>>>>>              int count = 0;
>>>>>              for (i = 0 ; i < qiov->niov ; ++i) {
>>>>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>>>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>>>>                      ++count;
>>>>>                  }
>>>>>              }
>>>>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: 
>>>>> %"PRIx64"\n", acb->rwco.offset, count, sum);
>>>>>          }
>>>>> 
>>>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>>>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, 
>>>>> cur_bytes, qiov, qiov_offset);
>>>>> 
>>>>> That change fixed a bug, but I have no idea what to debug next to 
>>>>> figure out the exact reason of the failure.
>>>>> 
>>>>> Do you have any ideas or hints?
>>>>> 
>>>> 
>>>> Hi!
>>>> 
>>>> Hmm, do mean that read from the disk may return wrong data? It would
>>>> be very bad of course :(
>>>> Could you provide a reproducer, so that I can look at it and debug?
>>> 
>>> It is just a winxp-32 image. I record the execution and replay it 
>>> with the following command lines:
>>> 
>>> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M 
>>> -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive 
>>> driver=blkreplay,if=none,image=device-34-file,id=device-34-driver 
>>> -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net 
>>> none
>>> 
>>> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M 
>>> -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive 
>>> driver=blkreplay,if=none,image=device-34-file,id=device-34-driver 
>>> -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net 
>>> none
>>> 
>>> Replay stalls at some moment due to the non-determinism of the 
>>> execution (probably caused by the wrong data read).
>> 
>> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for 
>> some time for a first command, than Ctrl+C it. After it replay.bin was 
>> 4M. Than started the second command. It works, not failing, not 
>> finishing. Is it bad? What is expected behavior and what is wrong?
>> 
>>> 
>>>> What is exactly the case? May be you have other parallel aio
>>>> operations to the same region?
>>> 
>>> As far as I understand, all aio operations, initiated by IDE 
>>> controller, are performed one-by-one.
>>> I don't see anything else in the logs.
>>> 
>>>> Ideas to experiment:
>>>> 
>>>> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
>>> 
>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>> 
>>>> 2. understand what is the case in code: is it read from one or 
>>>> several
>>>> clusters, is it aligned,
>>>> what is the type of clusters, is encryption in use, compression?
>>> 
>>> There is no encryption and I thinks compression is not enabled too.
>>> Clusters are read from the temporary overlay:
>>> 
>>> blk_aio_prwv
>>> blk_aio_read_entry
>>> bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 
>>> 1e00
>>> bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 
>>> 1c200
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 
>>> 1e000
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 
>>> 1e000
>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 
>>> 1e000
>>> 
>>> 
>>>> 3. understand what kind of data corruption. What we read instead of
>>>> correct data? Just garbage, or may be zeroes, or what..
>>> 
>>> Most bytes are the same, but some are different:
>>> 
>>> < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
>>> < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
>>> < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
>>> < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
>>> < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>> ---
>>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
>>>> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
>>>> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
>>>> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
>>>> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>> 
>>> That is strange. I could think, that it was caused by the bugs in
>>> deterministic CPU execution, but the first difference in logs
>>> occur in READ operation (I dump read/write buffers in 
>>> blk_aio_complete).
>>> 
>> 
>> Aha, yes, looks strange.
>> 
>> Then next steps:
>> 
>> 1. Does problem hit into the same offset every time?
>> 2. Do we write to this region before this strange read?
>> 
>> 2.1. If yes, we need to check that we read what we write.. You say you 
>> dump buffers
>> in blk_aio_complete... I think it would be more reliable to dump at 
>> start of
>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify 
>> its buffers
>> during operation which would be strange but possible.
>> 
>> 2.2 If not, hmm...
>> 
>> 
> 
> Another idea to check: use blkverify

I added logging of file descriptor and discovered that different results 
are obtained
when reading from the backing file.
And even more - replay runs of the same recording produce different 
results.
Logs show that there is a preadv race, but I can't figure out the source 
of the failure.

Log1:
preadv c 30467e00
preadv c 30960000
--- sum = a2e1e
bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
--- sum = 10cdee
bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
ee00

Log2:
preadv c 30467e00
--- sum = a2e1e
bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
preadv c 30960000
--- sum = f094f
bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
ee00


Checksum calculation was added to preadv in file-posix.c

Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-21  9:49         ` dovgaluk
@ 2020-02-21 10:09           ` Vladimir Sementsov-Ogievskiy
  2020-02-21 12:35             ` dovgaluk
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-21 10:09 UTC (permalink / raw)
  To: dovgaluk; +Cc: kwolf, qemu-devel, mreitz

21.02.2020 12:49, dovgaluk wrote:
> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>> 20.02.2020 12:05, Vladimir Sementsov-Ogievskiy wrote:
>>> 20.02.2020 11:31, dovgaluk wrote:
>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
>>>>> 19.02.2020 17:32, dovgaluk wrote:
>>>>>> I encountered a problem with record/replay of QEMU execution and figured out the following, when
>>>>>> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot' option.
>>>>>>
>>>>>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel subrequest handling in read and write"
>>>>>> introduces some kind of race condition, which causes difference in the data read from the disk.
>>>>>>
>>>>>> I detected this by adding the following code, which logs IO operation checksum. And this checksum may be different in different runs of the same recorded execution.
>>>>>>
>>>>>> logging in blk_aio_complete function:
>>>>>>          qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
>>>>>>          QEMUIOVector *qiov = acb->rwco.iobuf;
>>>>>>          if (qiov && qiov->iov) {
>>>>>>              size_t i, j;
>>>>>>              uint64_t sum = 0;
>>>>>>              int count = 0;
>>>>>>              for (i = 0 ; i < qiov->niov ; ++i) {
>>>>>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
>>>>>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
>>>>>>                      ++count;
>>>>>>                  }
>>>>>>              }
>>>>>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb->rwco.offset, count, sum);
>>>>>>          }
>>>>>>
>>>>>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
>>>>>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
>>>>>>
>>>>>> That change fixed a bug, but I have no idea what to debug next to figure out the exact reason of the failure.
>>>>>>
>>>>>> Do you have any ideas or hints?
>>>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>> Hmm, do mean that read from the disk may return wrong data? It would
>>>>> be very bad of course :(
>>>>> Could you provide a reproducer, so that I can look at it and debug?
>>>>
>>>> It is just a winxp-32 image. I record the execution and replay it with the following command lines:
>>>>
>>>> qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none
>>>>
>>>> qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net none
>>>>
>>>> Replay stalls at some moment due to the non-determinism of the execution (probably caused by the wrong data read).
>>>
>>> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?
>>>
>>>>
>>>>> What is exactly the case? May be you have other parallel aio
>>>>> operations to the same region?
>>>>
>>>> As far as I understand, all aio operations, initiated by IDE controller, are performed one-by-one.
>>>> I don't see anything else in the logs.
>>>>
>>>>> Ideas to experiment:
>>>>>
>>>>> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
>>>>
>>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>>>
>>>>> 2. understand what is the case in code: is it read from one or several
>>>>> clusters, is it aligned,
>>>>> what is the type of clusters, is encryption in use, compression?
>>>>
>>>> There is no encryption and I thinks compression is not enabled too.
>>>> Clusters are read from the temporary overlay:
>>>>
>>>> blk_aio_prwv
>>>> blk_aio_read_entry
>>>> bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 1e00
>>>> bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
>>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>>> bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
>>>>
>>>>
>>>>> 3. understand what kind of data corruption. What we read instead of
>>>>> correct data? Just garbage, or may be zeroes, or what..
>>>>
>>>> Most bytes are the same, but some are different:
>>>>
>>>> < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
>>>> < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
>>>> < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
>>>> < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
>>>> < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>>> ---
>>>>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
>>>>> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
>>>>> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
>>>>> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
>>>>> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
>>>>
>>>> That is strange. I could think, that it was caused by the bugs in
>>>> deterministic CPU execution, but the first difference in logs
>>>> occur in READ operation (I dump read/write buffers in blk_aio_complete).
>>>>
>>>
>>> Aha, yes, looks strange.
>>>
>>> Then next steps:
>>>
>>> 1. Does problem hit into the same offset every time?
>>> 2. Do we write to this region before this strange read?
>>>
>>> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
>>> in blk_aio_complete... I think it would be more reliable to dump at start of
>>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
>>> during operation which would be strange but possible.
>>>
>>> 2.2 If not, hmm...
>>>
>>>
>>
>> Another idea to check: use blkverify
> 
> I added logging of file descriptor and discovered that different results are obtained
> when reading from the backing file.
> And even more - replay runs of the same recording produce different results.
> Logs show that there is a preadv race, but I can't figure out the source of the failure.
> 
> Log1:
> preadv c 30467e00
> preadv c 30960000
> --- sum = a2e1e
> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
> --- sum = 10cdee
> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: ee00
> 
> Log2:
> preadv c 30467e00
> --- sum = a2e1e
> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
> preadv c 30960000
> --- sum = f094f
> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: ee00
> 
> 
> Checksum calculation was added to preadv in file-posix.c
> 

So, preadv in file-posix.c returns different results for the same offset, for file which is always opened in RO mode? Sounds impossible :)


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-21 10:09           ` Vladimir Sementsov-Ogievskiy
@ 2020-02-21 12:35             ` dovgaluk
  2020-02-21 13:23               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 19+ messages in thread
From: dovgaluk @ 2020-02-21 12:35 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: kwolf, qemu-devel, mreitz

Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
> 21.02.2020 12:49, dovgaluk wrote:
>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>>>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>>>> 
>>>>> 
>>>>> That is strange. I could think, that it was caused by the bugs in
>>>>> deterministic CPU execution, but the first difference in logs
>>>>> occur in READ operation (I dump read/write buffers in 
>>>>> blk_aio_complete).
>>>>> 
>>>> 
>>>> Aha, yes, looks strange.
>>>> 
>>>> Then next steps:
>>>> 
>>>> 1. Does problem hit into the same offset every time?
>>>> 2. Do we write to this region before this strange read?
>>>> 
>>>> 2.1. If yes, we need to check that we read what we write.. You say 
>>>> you dump buffers
>>>> in blk_aio_complete... I think it would be more reliable to dump at 
>>>> start of
>>>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify 
>>>> its buffers
>>>> during operation which would be strange but possible.
>>>> 
>>>> 2.2 If not, hmm...
>>>> 
>>>> 
>>> 
>>> Another idea to check: use blkverify
>> 
>> I added logging of file descriptor and discovered that different 
>> results are obtained
>> when reading from the backing file.
>> And even more - replay runs of the same recording produce different 
>> results.
>> Logs show that there is a preadv race, but I can't figure out the 
>> source of the failure.
>> 
>> Log1:
>> preadv c 30467e00
>> preadv c 30960000
>> --- sum = a2e1e
>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
>> --- sum = 10cdee
>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
>> ee00
>> 
>> Log2:
>> preadv c 30467e00
>> --- sum = a2e1e
>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
>> preadv c 30960000
>> --- sum = f094f
>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
>> ee00
>> 
>> 
>> Checksum calculation was added to preadv in file-posix.c
>> 
> 
> So, preadv in file-posix.c returns different results for the same
> offset, for file which is always opened in RO mode? Sounds impossible
> :)

True.
Maybe my logging is wrong?

static ssize_t
qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
{
     ssize_t res = preadv(fd, iov, nr_iov, offset);
     qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
     int i;
     uint32_t sum = 0;
     int cnt = 0;
     for (i = 0 ; i < nr_iov ; ++i) {
         int j;
         for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
         {
             sum += ((uint8_t*)iov[i].iov_base)[j];
             ++cnt;
         }
     }
     qemu_log("size: %x sum: %x\n", cnt, sum);
     assert(cnt == res);
     return res;
}

This code prints preadv checksum.
But when I calculate the same with the standalone program, then it gives 
me another values of the checksums for the same offsets:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>

unsigned char buf[0x100000];

int main(int argc, char **argv)
{
   if (argc < 4) return 1;
   int f = open(argv[1], O_RDONLY);
   unsigned int cnt;
   unsigned int offs;
   sscanf(argv[2], "%x", &offs);
   sscanf(argv[3], "%x", &cnt);
   printf("file: %s offset: %x size: %x\n", argv[1], offs, cnt);
   struct iovec iov = {buf, (size_t)cnt};
   size_t sz = preadv(f, &iov, 1, offs);
   printf("read %x\n", (int)sz);
   int i;
   unsigned int sum = 0;
   for (i = 0 ; i < cnt ; ++i)
     sum += buf[i];
   printf("sum = %x\n", sum);
}



Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-21 12:35             ` dovgaluk
@ 2020-02-21 13:23               ` Vladimir Sementsov-Ogievskiy
  2020-02-25  5:58                 ` dovgaluk
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-21 13:23 UTC (permalink / raw)
  To: dovgaluk; +Cc: kwolf, qemu-devel, mreitz

21.02.2020 15:35, dovgaluk wrote:
> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
>> 21.02.2020 12:49, dovgaluk wrote:
>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>>>>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>>>>>
>>>>>>
>>>>>> That is strange. I could think, that it was caused by the bugs in
>>>>>> deterministic CPU execution, but the first difference in logs
>>>>>> occur in READ operation (I dump read/write buffers in blk_aio_complete).
>>>>>>
>>>>>
>>>>> Aha, yes, looks strange.
>>>>>
>>>>> Then next steps:
>>>>>
>>>>> 1. Does problem hit into the same offset every time?
>>>>> 2. Do we write to this region before this strange read?
>>>>>
>>>>> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
>>>>> in blk_aio_complete... I think it would be more reliable to dump at start of
>>>>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
>>>>> during operation which would be strange but possible.
>>>>>
>>>>> 2.2 If not, hmm...
>>>>>
>>>>>
>>>>
>>>> Another idea to check: use blkverify
>>>
>>> I added logging of file descriptor and discovered that different results are obtained
>>> when reading from the backing file.
>>> And even more - replay runs of the same recording produce different results.
>>> Logs show that there is a preadv race, but I can't figure out the source of the failure.
>>>
>>> Log1:
>>> preadv c 30467e00
>>> preadv c 30960000
>>> --- sum = a2e1e
>>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
>>> --- sum = 10cdee
>>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: ee00
>>>
>>> Log2:
>>> preadv c 30467e00
>>> --- sum = a2e1e
>>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
>>> preadv c 30960000
>>> --- sum = f094f
>>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: ee00
>>>
>>>
>>> Checksum calculation was added to preadv in file-posix.c
>>>
>>
>> So, preadv in file-posix.c returns different results for the same
>> offset, for file which is always opened in RO mode? Sounds impossible
>> :)
> 
> True.
> Maybe my logging is wrong?
> 
> static ssize_t
> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
> {
>      ssize_t res = preadv(fd, iov, nr_iov, offset);
>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
>      int i;
>      uint32_t sum = 0;
>      int cnt = 0;
>      for (i = 0 ; i < nr_iov ; ++i) {
>          int j;
>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
>          {
>              sum += ((uint8_t*)iov[i].iov_base)[j];
>              ++cnt;
>          }
>      }
>      qemu_log("size: %x sum: %x\n", cnt, sum);
>      assert(cnt == res);
>      return res;
> }
> 
> This code prints preadv checksum.
> But when I calculate the same with the standalone program, then it gives me another values of the checksums for the same offsets:
> 
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/uio.h>
> 
> unsigned char buf[0x100000];
> 
> int main(int argc, char **argv)
> {
>    if (argc < 4) return 1;
>    int f = open(argv[1], O_RDONLY);
>    unsigned int cnt;
>    unsigned int offs;
>    sscanf(argv[2], "%x", &offs);
>    sscanf(argv[3], "%x", &cnt);
>    printf("file: %s offset: %x size: %x\n", argv[1], offs, cnt);
>    struct iovec iov = {buf, (size_t)cnt};
>    size_t sz = preadv(f, &iov, 1, offs);
>    printf("read %x\n", (int)sz);
>    int i;
>    unsigned int sum = 0;
>    for (i = 0 ; i < cnt ; ++i)
>      sum += buf[i];
>    printf("sum = %x\n", sum);
> }
> 


Hmm, I don't see any issues here..

Are you absolutely sure, that all these reads are from backing file, which is read-only and never changed (may be by other processes)?
Hmm, may be it worth making backing file to be read-only on file-system level?
If yes, I can imagine only two things:
1. bug in file-system (as all that we are doing is preadv syscall from the same place of never-modified file)
2. guest modifies buffers during operation (you can catch it if allocate personal buffer for preadv, than calculate checksum, then memcpy to guest buffer)


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-21 13:23               ` Vladimir Sementsov-Ogievskiy
@ 2020-02-25  5:58                 ` dovgaluk
  2020-02-25  7:27                   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 19+ messages in thread
From: dovgaluk @ 2020-02-25  5:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: kwolf, qemu-devel, mreitz

Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
> 21.02.2020 15:35, dovgaluk wrote:
>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
>>> 21.02.2020 12:49, dovgaluk wrote:
>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>>>>>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>>>>>> 
>>>>>>> 
>>>>>>> That is strange. I could think, that it was caused by the bugs in
>>>>>>> deterministic CPU execution, but the first difference in logs
>>>>>>> occur in READ operation (I dump read/write buffers in 
>>>>>>> blk_aio_complete).
>>>>>>> 
>>>>>> 
>>>>>> Aha, yes, looks strange.
>>>>>> 
>>>>>> Then next steps:
>>>>>> 
>>>>>> 1. Does problem hit into the same offset every time?
>>>>>> 2. Do we write to this region before this strange read?
>>>>>> 
>>>>>> 2.1. If yes, we need to check that we read what we write.. You say 
>>>>>> you dump buffers
>>>>>> in blk_aio_complete... I think it would be more reliable to dump 
>>>>>> at start of
>>>>>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may 
>>>>>> modify its buffers
>>>>>> during operation which would be strange but possible.
>>>>>> 
>>>>>> 2.2 If not, hmm...
>>>>>> 
>>>>>> 
>>>>> 
>>>>> Another idea to check: use blkverify
>>>> 
>>>> I added logging of file descriptor and discovered that different 
>>>> results are obtained
>>>> when reading from the backing file.
>>>> And even more - replay runs of the same recording produce different 
>>>> results.
>>>> Logs show that there is a preadv race, but I can't figure out the 
>>>> source of the failure.
>>>> 
>>>> Log1:
>>>> preadv c 30467e00
>>>> preadv c 30960000
>>>> --- sum = a2e1e
>>>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 
>>>> 8200
>>>> --- sum = 10cdee
>>>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
>>>> ee00
>>>> 
>>>> Log2:
>>>> preadv c 30467e00
>>>> --- sum = a2e1e
>>>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 
>>>> 8200
>>>> preadv c 30960000
>>>> --- sum = f094f
>>>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: 
>>>> ee00
>>>> 
>>>> 
>>>> Checksum calculation was added to preadv in file-posix.c
>>>> 
>>> 
>>> So, preadv in file-posix.c returns different results for the same
>>> offset, for file which is always opened in RO mode? Sounds impossible
>>> :)
>> 
>> True.
>> Maybe my logging is wrong?
>> 
>> static ssize_t
>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
>> {
>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
>>      int i;
>>      uint32_t sum = 0;
>>      int cnt = 0;
>>      for (i = 0 ; i < nr_iov ; ++i) {
>>          int j;
>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
>>          {
>>              sum += ((uint8_t*)iov[i].iov_base)[j];
>>              ++cnt;
>>          }
>>      }
>>      qemu_log("size: %x sum: %x\n", cnt, sum);
>>      assert(cnt == res);
>>      return res;
>> }
>> 
> 
> Hmm, I don't see any issues here..
> 
> Are you absolutely sure, that all these reads are from backing file,
> which is read-only and never changed (may be by other processes)?

Yes, I made a copy and compared the files with binwalk.

> 2. guest modifies buffers during operation (you can catch it if
> allocate personal buffer for preadv, than calculate checksum, then
> memcpy to guest buffer)

I added the following to the qemu_preadv:

     // do it again
     unsigned char *buf = g_malloc(cnt);
     struct iovec v = {buf, cnt};
     res = preadv(fd, &v, 1, offset);
     assert(cnt == res);
     uint32_t sum2 = 0;
     for (i = 0 ; i < cnt ; ++i)
         sum2 += buf[i];
     g_free(buf);
     qemu_log("--- sum2 = %x\n", sum2);
     assert(sum2 == sum);

These two reads give different results.
But who can modify the buffer while qcow2 workers filling it with data 
from the disk?



Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-25  5:58                 ` dovgaluk
@ 2020-02-25  7:27                   ` Vladimir Sementsov-Ogievskiy
  2020-02-25  7:56                     ` dovgaluk
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-25  7:27 UTC (permalink / raw)
  To: dovgaluk; +Cc: kwolf, qemu-devel, mreitz

25.02.2020 8:58, dovgaluk wrote:
> Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
>> 21.02.2020 15:35, dovgaluk wrote:
>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
>>>> 21.02.2020 12:49, dovgaluk wrote:
>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>>>>>>>> 1 or 2 are ok, and 4 or 8 lead to the failures.
>>>>>>>>
>>>>>>>>
>>>>>>>> That is strange. I could think, that it was caused by the bugs in
>>>>>>>> deterministic CPU execution, but the first difference in logs
>>>>>>>> occur in READ operation (I dump read/write buffers in blk_aio_complete).
>>>>>>>>
>>>>>>>
>>>>>>> Aha, yes, looks strange.
>>>>>>>
>>>>>>> Then next steps:
>>>>>>>
>>>>>>> 1. Does problem hit into the same offset every time?
>>>>>>> 2. Do we write to this region before this strange read?
>>>>>>>
>>>>>>> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
>>>>>>> in blk_aio_complete... I think it would be more reliable to dump at start of
>>>>>>> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
>>>>>>> during operation which would be strange but possible.
>>>>>>>
>>>>>>> 2.2 If not, hmm...
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Another idea to check: use blkverify
>>>>>
>>>>> I added logging of file descriptor and discovered that different results are obtained
>>>>> when reading from the backing file.
>>>>> And even more - replay runs of the same recording produce different results.
>>>>> Logs show that there is a preadv race, but I can't figure out the source of the failure.
>>>>>
>>>>> Log1:
>>>>> preadv c 30467e00
>>>>> preadv c 30960000
>>>>> --- sum = a2e1e
>>>>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
>>>>> --- sum = 10cdee
>>>>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: ee00
>>>>>
>>>>> Log2:
>>>>> preadv c 30467e00
>>>>> --- sum = a2e1e
>>>>> bdrv_co_preadv_part complete offset: 30467e00 qiov_offset: 0 len: 8200
>>>>> preadv c 30960000
>>>>> --- sum = f094f
>>>>> bdrv_co_preadv_part complete offset: 30960000 qiov_offset: 8200 len: ee00
>>>>>
>>>>>
>>>>> Checksum calculation was added to preadv in file-posix.c
>>>>>
>>>>
>>>> So, preadv in file-posix.c returns different results for the same
>>>> offset, for file which is always opened in RO mode? Sounds impossible
>>>> :)
>>>
>>> True.
>>> Maybe my logging is wrong?
>>>
>>> static ssize_t
>>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
>>> {
>>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
>>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
>>>      int i;
>>>      uint32_t sum = 0;
>>>      int cnt = 0;
>>>      for (i = 0 ; i < nr_iov ; ++i) {
>>>          int j;
>>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
>>>          {
>>>              sum += ((uint8_t*)iov[i].iov_base)[j];
>>>              ++cnt;
>>>          }
>>>      }
>>>      qemu_log("size: %x sum: %x\n", cnt, sum);
>>>      assert(cnt == res);
>>>      return res;
>>> }
>>>
>>
>> Hmm, I don't see any issues here..
>>
>> Are you absolutely sure, that all these reads are from backing file,
>> which is read-only and never changed (may be by other processes)?
> 
> Yes, I made a copy and compared the files with binwalk.
> 
>> 2. guest modifies buffers during operation (you can catch it if
>> allocate personal buffer for preadv, than calculate checksum, then
>> memcpy to guest buffer)
> 
> I added the following to the qemu_preadv:
> 
>      // do it again
>      unsigned char *buf = g_malloc(cnt);
>      struct iovec v = {buf, cnt};
>      res = preadv(fd, &v, 1, offset);
>      assert(cnt == res);
>      uint32_t sum2 = 0;
>      for (i = 0 ; i < cnt ; ++i)
>          sum2 += buf[i];
>      g_free(buf);
>      qemu_log("--- sum2 = %x\n", sum2);
>      assert(sum2 == sum);
> 
> These two reads give different results.
> But who can modify the buffer while qcow2 workers filling it with data from the disk?
> 

As far as I know, it's guest's buffer, and guest may modify it during the operation. So, it may be winxp :)



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-25  7:27                   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-25  7:56                     ` dovgaluk
  2020-02-25  9:19                       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 19+ messages in thread
From: dovgaluk @ 2020-02-25  7:56 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy; +Cc: kwolf, qemu-devel, mreitz

Vladimir Sementsov-Ogievskiy писал 2020-02-25 10:27:
> 25.02.2020 8:58, dovgaluk wrote:
>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
>>> 21.02.2020 15:35, dovgaluk wrote:
>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
>>>>> 21.02.2020 12:49, dovgaluk wrote:
>>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>>>>> 
>>>>> So, preadv in file-posix.c returns different results for the same
>>>>> offset, for file which is always opened in RO mode? Sounds 
>>>>> impossible
>>>>> :)
>>>> 
>>>> True.
>>>> Maybe my logging is wrong?
>>>> 
>>>> static ssize_t
>>>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t 
>>>> offset)
>>>> {
>>>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
>>>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
>>>>      int i;
>>>>      uint32_t sum = 0;
>>>>      int cnt = 0;
>>>>      for (i = 0 ; i < nr_iov ; ++i) {
>>>>          int j;
>>>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
>>>>          {
>>>>              sum += ((uint8_t*)iov[i].iov_base)[j];
>>>>              ++cnt;
>>>>          }
>>>>      }
>>>>      qemu_log("size: %x sum: %x\n", cnt, sum);
>>>>      assert(cnt == res);
>>>>      return res;
>>>> }
>>>> 
>>> 
>>> Hmm, I don't see any issues here..
>>> 
>>> Are you absolutely sure, that all these reads are from backing file,
>>> which is read-only and never changed (may be by other processes)?
>> 
>> Yes, I made a copy and compared the files with binwalk.
>> 
>>> 2. guest modifies buffers during operation (you can catch it if
>>> allocate personal buffer for preadv, than calculate checksum, then
>>> memcpy to guest buffer)
>> 
>> I added the following to the qemu_preadv:
>> 
>>      // do it again
>>      unsigned char *buf = g_malloc(cnt);
>>      struct iovec v = {buf, cnt};
>>      res = preadv(fd, &v, 1, offset);
>>      assert(cnt == res);
>>      uint32_t sum2 = 0;
>>      for (i = 0 ; i < cnt ; ++i)
>>          sum2 += buf[i];
>>      g_free(buf);
>>      qemu_log("--- sum2 = %x\n", sum2);
>>      assert(sum2 == sum);
>> 
>> These two reads give different results.
>> But who can modify the buffer while qcow2 workers filling it with data 
>> from the disk?
>> 
> 
> As far as I know, it's guest's buffer, and guest may modify it during
> the operation. So, it may be winxp :)

True, but normally the guest won't do it.

But I noticed that DMA operation which causes the problems has the 
following set of the buffers:
dma read sg size 20000 offset: c000fe00
--- sg: base: 2eb1000 len: 1000
--- sg: base: 3000000 len: 1000
--- sg: base: 2eb2000 len: 3000
--- sg: base: 3000000 len: 1000
--- sg: base: 2eb5000 len: b000
--- sg: base: 3040000 len: 1000
--- sg: base: 2f41000 len: 3000
--- sg: base: 3000000 len: 1000
--- sg: base: 2f44000 len: 4000
--- sg: base: 3000000 len: 1000
--- sg: base: 2f48000 len: 2000
--- sg: base: 3000000 len: 1000
--- sg: base: 3000000 len: 1000
--- sg: base: 3000000 len: 1000


It means that one DMA transaction performs multiple reads into the same 
address.
And no races is possible, when there is only one qcow2 worker.
When there are many of them - they can fill this buffer simultaneously.

Pavel Dovgalyuk


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-25  7:56                     ` dovgaluk
@ 2020-02-25  9:19                       ` Vladimir Sementsov-Ogievskiy
  2020-02-25  9:26                         ` Pavel Dovgalyuk
  2020-02-25 10:07                         ` Pavel Dovgalyuk
  0 siblings, 2 replies; 19+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-25  9:19 UTC (permalink / raw)
  To: dovgaluk; +Cc: kwolf, qemu-devel, mreitz

25.02.2020 10:56, dovgaluk wrote:
> Vladimir Sementsov-Ogievskiy писал 2020-02-25 10:27:
>> 25.02.2020 8:58, dovgaluk wrote:
>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
>>>> 21.02.2020 15:35, dovgaluk wrote:
>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
>>>>>> 21.02.2020 12:49, dovgaluk wrote:
>>>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
>>>>>>
>>>>>> So, preadv in file-posix.c returns different results for the same
>>>>>> offset, for file which is always opened in RO mode? Sounds impossible
>>>>>> :)
>>>>>
>>>>> True.
>>>>> Maybe my logging is wrong?
>>>>>
>>>>> static ssize_t
>>>>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
>>>>> {
>>>>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
>>>>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
>>>>>      int i;
>>>>>      uint32_t sum = 0;
>>>>>      int cnt = 0;
>>>>>      for (i = 0 ; i < nr_iov ; ++i) {
>>>>>          int j;
>>>>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
>>>>>          {
>>>>>              sum += ((uint8_t*)iov[i].iov_base)[j];
>>>>>              ++cnt;
>>>>>          }
>>>>>      }
>>>>>      qemu_log("size: %x sum: %x\n", cnt, sum);
>>>>>      assert(cnt == res);
>>>>>      return res;
>>>>> }
>>>>>
>>>>
>>>> Hmm, I don't see any issues here..
>>>>
>>>> Are you absolutely sure, that all these reads are from backing file,
>>>> which is read-only and never changed (may be by other processes)?
>>>
>>> Yes, I made a copy and compared the files with binwalk.
>>>
>>>> 2. guest modifies buffers during operation (you can catch it if
>>>> allocate personal buffer for preadv, than calculate checksum, then
>>>> memcpy to guest buffer)
>>>
>>> I added the following to the qemu_preadv:
>>>
>>>      // do it again
>>>      unsigned char *buf = g_malloc(cnt);
>>>      struct iovec v = {buf, cnt};
>>>      res = preadv(fd, &v, 1, offset);
>>>      assert(cnt == res);
>>>      uint32_t sum2 = 0;
>>>      for (i = 0 ; i < cnt ; ++i)
>>>          sum2 += buf[i];
>>>      g_free(buf);
>>>      qemu_log("--- sum2 = %x\n", sum2);
>>>      assert(sum2 == sum);
>>>
>>> These two reads give different results.
>>> But who can modify the buffer while qcow2 workers filling it with data from the disk?
>>>
>>
>> As far as I know, it's guest's buffer, and guest may modify it during
>> the operation. So, it may be winxp :)
> 
> True, but normally the guest won't do it.
> 
> But I noticed that DMA operation which causes the problems has the following set of the buffers:
> dma read sg size 20000 offset: c000fe00
> --- sg: base: 2eb1000 len: 1000
> --- sg: base: 3000000 len: 1000
> --- sg: base: 2eb2000 len: 3000
> --- sg: base: 3000000 len: 1000
> --- sg: base: 2eb5000 len: b000
> --- sg: base: 3040000 len: 1000
> --- sg: base: 2f41000 len: 3000
> --- sg: base: 3000000 len: 1000
> --- sg: base: 2f44000 len: 4000
> --- sg: base: 3000000 len: 1000
> --- sg: base: 2f48000 len: 2000
> --- sg: base: 3000000 len: 1000
> --- sg: base: 3000000 len: 1000
> --- sg: base: 3000000 len: 1000
> 
> 
> It means that one DMA transaction performs multiple reads into the same address.
> And no races is possible, when there is only one qcow2 worker.
> When there are many of them - they can fill this buffer simultaneously.
> 

Hmm, actually if guest start parallel reads into same buffer from different offsets, races are possible anyway, as different requests run in parallel even with one worker, because MAX_WORKERS is per-request value, not total... But several workers may increase probability of races or introduce new ones.

So, actually, several workers of one request can write to the same buffer only if guest provides broken iovec, which references the same buffer several times (if it is possible at all).



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Race condition in overlayed qcow2?
  2020-02-25  9:19                       ` Vladimir Sementsov-Ogievskiy
@ 2020-02-25  9:26                         ` Pavel Dovgalyuk
  2020-02-25 10:07                         ` Pavel Dovgalyuk
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-02-25  9:26 UTC (permalink / raw)
  To: kwolf; +Cc: 'Vladimir Sementsov-Ogievskiy', qemu-devel, mreitz

Kevin, what do you think about it?

What guest is intended to receive, when it requests multiple reads to the same buffer in a single DMA transaction?

Should it be the first SG part? The last one?
Or just a random set of bytes? (Then why it is reading this data in that case?)

Pavel Dovgalyuk

> -----Original Message-----
> From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> Sent: Tuesday, February 25, 2020 12:19 PM
> To: dovgaluk
> Cc: qemu-devel@nongnu.org; mreitz@redhat.com; kwolf@redhat.com
> Subject: Re: Race condition in overlayed qcow2?
> 
> 25.02.2020 10:56, dovgaluk wrote:
> > Vladimir Sementsov-Ogievskiy писал 2020-02-25 10:27:
> >> 25.02.2020 8:58, dovgaluk wrote:
> >>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
> >>>> 21.02.2020 15:35, dovgaluk wrote:
> >>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
> >>>>>> 21.02.2020 12:49, dovgaluk wrote:
> >>>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
> >>>>>>
> >>>>>> So, preadv in file-posix.c returns different results for the same
> >>>>>> offset, for file which is always opened in RO mode? Sounds impossible
> >>>>>> :)
> >>>>>
> >>>>> True.
> >>>>> Maybe my logging is wrong?
> >>>>>
> >>>>> static ssize_t
> >>>>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
> >>>>> {
> >>>>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
> >>>>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
> >>>>>      int i;
> >>>>>      uint32_t sum = 0;
> >>>>>      int cnt = 0;
> >>>>>      for (i = 0 ; i < nr_iov ; ++i) {
> >>>>>          int j;
> >>>>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
> >>>>>          {
> >>>>>              sum += ((uint8_t*)iov[i].iov_base)[j];
> >>>>>              ++cnt;
> >>>>>          }
> >>>>>      }
> >>>>>      qemu_log("size: %x sum: %x\n", cnt, sum);
> >>>>>      assert(cnt == res);
> >>>>>      return res;
> >>>>> }
> >>>>>
> >>>>
> >>>> Hmm, I don't see any issues here..
> >>>>
> >>>> Are you absolutely sure, that all these reads are from backing file,
> >>>> which is read-only and never changed (may be by other processes)?
> >>>
> >>> Yes, I made a copy and compared the files with binwalk.
> >>>
> >>>> 2. guest modifies buffers during operation (you can catch it if
> >>>> allocate personal buffer for preadv, than calculate checksum, then
> >>>> memcpy to guest buffer)
> >>>
> >>> I added the following to the qemu_preadv:
> >>>
> >>>      // do it again
> >>>      unsigned char *buf = g_malloc(cnt);
> >>>      struct iovec v = {buf, cnt};
> >>>      res = preadv(fd, &v, 1, offset);
> >>>      assert(cnt == res);
> >>>      uint32_t sum2 = 0;
> >>>      for (i = 0 ; i < cnt ; ++i)
> >>>          sum2 += buf[i];
> >>>      g_free(buf);
> >>>      qemu_log("--- sum2 = %x\n", sum2);
> >>>      assert(sum2 == sum);
> >>>
> >>> These two reads give different results.
> >>> But who can modify the buffer while qcow2 workers filling it with data from the disk?
> >>>
> >>
> >> As far as I know, it's guest's buffer, and guest may modify it during
> >> the operation. So, it may be winxp :)
> >
> > True, but normally the guest won't do it.
> >
> > But I noticed that DMA operation which causes the problems has the following set of the
> buffers:
> > dma read sg size 20000 offset: c000fe00
> > --- sg: base: 2eb1000 len: 1000
> > --- sg: base: 3000000 len: 1000
> > --- sg: base: 2eb2000 len: 3000
> > --- sg: base: 3000000 len: 1000
> > --- sg: base: 2eb5000 len: b000
> > --- sg: base: 3040000 len: 1000
> > --- sg: base: 2f41000 len: 3000
> > --- sg: base: 3000000 len: 1000
> > --- sg: base: 2f44000 len: 4000
> > --- sg: base: 3000000 len: 1000
> > --- sg: base: 2f48000 len: 2000
> > --- sg: base: 3000000 len: 1000
> > --- sg: base: 3000000 len: 1000
> > --- sg: base: 3000000 len: 1000
> >
> >
> > It means that one DMA transaction performs multiple reads into the same address.
> > And no races is possible, when there is only one qcow2 worker.
> > When there are many of them - they can fill this buffer simultaneously.
> >
> 
> Hmm, actually if guest start parallel reads into same buffer from different offsets, races are
> possible anyway, as different requests run in parallel even with one worker, because
> MAX_WORKERS is per-request value, not total... But several workers may increase probability of
> races or introduce new ones.
> 
> So, actually, several workers of one request can write to the same buffer only if guest
> provides broken iovec, which references the same buffer several times (if it is possible at
> all).
> 
> 
> 
> --
> Best regards,
> Vladimir



^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Race condition in overlayed qcow2?
  2020-02-25  9:19                       ` Vladimir Sementsov-Ogievskiy
  2020-02-25  9:26                         ` Pavel Dovgalyuk
@ 2020-02-25 10:07                         ` Pavel Dovgalyuk
  2020-02-25 11:47                           ` Kevin Wolf
  1 sibling, 1 reply; 19+ messages in thread
From: Pavel Dovgalyuk @ 2020-02-25 10:07 UTC (permalink / raw)
  To: stefanha, kwolf
  Cc: 'Vladimir Sementsov-Ogievskiy', 'Pavel Dovgalyuk',
	qemu-devel, mreitz

CC'ing Stefan due to the same question back in 2010:

https://lists.gnu.org/archive/html/qemu-devel/2010-09/msg01996.html

I also encountered this with Windows guest.
E.g., there were the requests like:

Read 2000 bytes:
addr=A, size=1000
addr=A, size=1000

I.e. reading 1000 bytes in real, but the purpose of such request is unclear.

Pavel Dovgalyuk


> -----Original Message-----
> From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> Sent: Tuesday, February 25, 2020 12:27 PM
> To: 'kwolf@redhat.com'
> Cc: 'qemu-devel@nongnu.org'; 'mreitz@redhat.com'; 'Vladimir Sementsov-Ogievskiy'
> Subject: RE: Race condition in overlayed qcow2?
> 
> Kevin, what do you think about it?
> 
> What guest is intended to receive, when it requests multiple reads to the same buffer in a
> single DMA transaction?
> 
> Should it be the first SG part? The last one?
> Or just a random set of bytes? (Then why it is reading this data in that case?)
> 
> Pavel Dovgalyuk
> 
> > -----Original Message-----
> > From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> > Sent: Tuesday, February 25, 2020 12:19 PM
> > To: dovgaluk
> > Cc: qemu-devel@nongnu.org; mreitz@redhat.com; kwolf@redhat.com
> > Subject: Re: Race condition in overlayed qcow2?
> >
> > 25.02.2020 10:56, dovgaluk wrote:
> > > Vladimir Sementsov-Ogievskiy писал 2020-02-25 10:27:
> > >> 25.02.2020 8:58, dovgaluk wrote:
> > >>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
> > >>>> 21.02.2020 15:35, dovgaluk wrote:
> > >>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
> > >>>>>> 21.02.2020 12:49, dovgaluk wrote:
> > >>>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
> > >>>>>>
> > >>>>>> So, preadv in file-posix.c returns different results for the same
> > >>>>>> offset, for file which is always opened in RO mode? Sounds impossible
> > >>>>>> :)
> > >>>>>
> > >>>>> True.
> > >>>>> Maybe my logging is wrong?
> > >>>>>
> > >>>>> static ssize_t
> > >>>>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
> > >>>>> {
> > >>>>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
> > >>>>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
> > >>>>>      int i;
> > >>>>>      uint32_t sum = 0;
> > >>>>>      int cnt = 0;
> > >>>>>      for (i = 0 ; i < nr_iov ; ++i) {
> > >>>>>          int j;
> > >>>>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
> > >>>>>          {
> > >>>>>              sum += ((uint8_t*)iov[i].iov_base)[j];
> > >>>>>              ++cnt;
> > >>>>>          }
> > >>>>>      }
> > >>>>>      qemu_log("size: %x sum: %x\n", cnt, sum);
> > >>>>>      assert(cnt == res);
> > >>>>>      return res;
> > >>>>> }
> > >>>>>
> > >>>>
> > >>>> Hmm, I don't see any issues here..
> > >>>>
> > >>>> Are you absolutely sure, that all these reads are from backing file,
> > >>>> which is read-only and never changed (may be by other processes)?
> > >>>
> > >>> Yes, I made a copy and compared the files with binwalk.
> > >>>
> > >>>> 2. guest modifies buffers during operation (you can catch it if
> > >>>> allocate personal buffer for preadv, than calculate checksum, then
> > >>>> memcpy to guest buffer)
> > >>>
> > >>> I added the following to the qemu_preadv:
> > >>>
> > >>>      // do it again
> > >>>      unsigned char *buf = g_malloc(cnt);
> > >>>      struct iovec v = {buf, cnt};
> > >>>      res = preadv(fd, &v, 1, offset);
> > >>>      assert(cnt == res);
> > >>>      uint32_t sum2 = 0;
> > >>>      for (i = 0 ; i < cnt ; ++i)
> > >>>          sum2 += buf[i];
> > >>>      g_free(buf);
> > >>>      qemu_log("--- sum2 = %x\n", sum2);
> > >>>      assert(sum2 == sum);
> > >>>
> > >>> These two reads give different results.
> > >>> But who can modify the buffer while qcow2 workers filling it with data from the disk?
> > >>>
> > >>
> > >> As far as I know, it's guest's buffer, and guest may modify it during
> > >> the operation. So, it may be winxp :)
> > >
> > > True, but normally the guest won't do it.
> > >
> > > But I noticed that DMA operation which causes the problems has the following set of the
> > buffers:
> > > dma read sg size 20000 offset: c000fe00
> > > --- sg: base: 2eb1000 len: 1000
> > > --- sg: base: 3000000 len: 1000
> > > --- sg: base: 2eb2000 len: 3000
> > > --- sg: base: 3000000 len: 1000
> > > --- sg: base: 2eb5000 len: b000
> > > --- sg: base: 3040000 len: 1000
> > > --- sg: base: 2f41000 len: 3000
> > > --- sg: base: 3000000 len: 1000
> > > --- sg: base: 2f44000 len: 4000
> > > --- sg: base: 3000000 len: 1000
> > > --- sg: base: 2f48000 len: 2000
> > > --- sg: base: 3000000 len: 1000
> > > --- sg: base: 3000000 len: 1000
> > > --- sg: base: 3000000 len: 1000
> > >
> > >
> > > It means that one DMA transaction performs multiple reads into the same address.
> > > And no races is possible, when there is only one qcow2 worker.
> > > When there are many of them - they can fill this buffer simultaneously.
> > >
> >
> > Hmm, actually if guest start parallel reads into same buffer from different offsets, races
> are
> > possible anyway, as different requests run in parallel even with one worker, because
> > MAX_WORKERS is per-request value, not total... But several workers may increase probability
> of
> > races or introduce new ones.
> >
> > So, actually, several workers of one request can write to the same buffer only if guest
> > provides broken iovec, which references the same buffer several times (if it is possible at
> > all).
> >
> >
> >
> > --
> > Best regards,
> > Vladimir



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Race condition in overlayed qcow2?
  2020-02-25 10:07                         ` Pavel Dovgalyuk
@ 2020-02-25 11:47                           ` Kevin Wolf
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin Wolf @ 2020-02-25 11:47 UTC (permalink / raw)
  To: Pavel Dovgalyuk
  Cc: 'Vladimir Sementsov-Ogievskiy', qemu-devel, stefanha, mreitz

Am 25.02.2020 um 11:07 hat Pavel Dovgalyuk geschrieben:
> CC'ing Stefan due to the same question back in 2010:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2010-09/msg01996.html
> 
> I also encountered this with Windows guest.
> E.g., there were the requests like:
> 
> Read 2000 bytes:
> addr=A, size=1000
> addr=A, size=1000
> 
> I.e. reading 1000 bytes in real, but the purpose of such request is unclear.

I think the conclusion back then was that the result is undefined (i.e.
you can get any mix between both parts). I'm not sure if we ever found
out why Windows is even issuing such requests. Maybe Stefan knows.

Kevin

> > -----Original Message-----
> > From: Pavel Dovgalyuk [mailto:dovgaluk@ispras.ru]
> > Sent: Tuesday, February 25, 2020 12:27 PM
> > To: 'kwolf@redhat.com'
> > Cc: 'qemu-devel@nongnu.org'; 'mreitz@redhat.com'; 'Vladimir Sementsov-Ogievskiy'
> > Subject: RE: Race condition in overlayed qcow2?
> > 
> > Kevin, what do you think about it?
> > 
> > What guest is intended to receive, when it requests multiple reads to the same buffer in a
> > single DMA transaction?
> > 
> > Should it be the first SG part? The last one?
> > Or just a random set of bytes? (Then why it is reading this data in that case?)
> > 
> > Pavel Dovgalyuk
> > 
> > > -----Original Message-----
> > > From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> > > Sent: Tuesday, February 25, 2020 12:19 PM
> > > To: dovgaluk
> > > Cc: qemu-devel@nongnu.org; mreitz@redhat.com; kwolf@redhat.com
> > > Subject: Re: Race condition in overlayed qcow2?
> > >
> > > 25.02.2020 10:56, dovgaluk wrote:
> > > > Vladimir Sementsov-Ogievskiy писал 2020-02-25 10:27:
> > > >> 25.02.2020 8:58, dovgaluk wrote:
> > > >>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 16:23:
> > > >>>> 21.02.2020 15:35, dovgaluk wrote:
> > > >>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-21 13:09:
> > > >>>>>> 21.02.2020 12:49, dovgaluk wrote:
> > > >>>>>>> Vladimir Sementsov-Ogievskiy писал 2020-02-20 12:36:
> > > >>>>>>
> > > >>>>>> So, preadv in file-posix.c returns different results for the same
> > > >>>>>> offset, for file which is always opened in RO mode? Sounds impossible
> > > >>>>>> :)
> > > >>>>>
> > > >>>>> True.
> > > >>>>> Maybe my logging is wrong?
> > > >>>>>
> > > >>>>> static ssize_t
> > > >>>>> qemu_preadv(int fd, const struct iovec *iov, int nr_iov, off_t offset)
> > > >>>>> {
> > > >>>>>      ssize_t res = preadv(fd, iov, nr_iov, offset);
> > > >>>>>      qemu_log("preadv %x %"PRIx64"\n", fd, (uint64_t)offset);
> > > >>>>>      int i;
> > > >>>>>      uint32_t sum = 0;
> > > >>>>>      int cnt = 0;
> > > >>>>>      for (i = 0 ; i < nr_iov ; ++i) {
> > > >>>>>          int j;
> > > >>>>>          for (j = 0 ; j < (int)iov[i].iov_len ; ++j)
> > > >>>>>          {
> > > >>>>>              sum += ((uint8_t*)iov[i].iov_base)[j];
> > > >>>>>              ++cnt;
> > > >>>>>          }
> > > >>>>>      }
> > > >>>>>      qemu_log("size: %x sum: %x\n", cnt, sum);
> > > >>>>>      assert(cnt == res);
> > > >>>>>      return res;
> > > >>>>> }
> > > >>>>>
> > > >>>>
> > > >>>> Hmm, I don't see any issues here..
> > > >>>>
> > > >>>> Are you absolutely sure, that all these reads are from backing file,
> > > >>>> which is read-only and never changed (may be by other processes)?
> > > >>>
> > > >>> Yes, I made a copy and compared the files with binwalk.
> > > >>>
> > > >>>> 2. guest modifies buffers during operation (you can catch it if
> > > >>>> allocate personal buffer for preadv, than calculate checksum, then
> > > >>>> memcpy to guest buffer)
> > > >>>
> > > >>> I added the following to the qemu_preadv:
> > > >>>
> > > >>>      // do it again
> > > >>>      unsigned char *buf = g_malloc(cnt);
> > > >>>      struct iovec v = {buf, cnt};
> > > >>>      res = preadv(fd, &v, 1, offset);
> > > >>>      assert(cnt == res);
> > > >>>      uint32_t sum2 = 0;
> > > >>>      for (i = 0 ; i < cnt ; ++i)
> > > >>>          sum2 += buf[i];
> > > >>>      g_free(buf);
> > > >>>      qemu_log("--- sum2 = %x\n", sum2);
> > > >>>      assert(sum2 == sum);
> > > >>>
> > > >>> These two reads give different results.
> > > >>> But who can modify the buffer while qcow2 workers filling it with data from the disk?
> > > >>>
> > > >>
> > > >> As far as I know, it's guest's buffer, and guest may modify it during
> > > >> the operation. So, it may be winxp :)
> > > >
> > > > True, but normally the guest won't do it.
> > > >
> > > > But I noticed that DMA operation which causes the problems has the following set of the
> > > buffers:
> > > > dma read sg size 20000 offset: c000fe00
> > > > --- sg: base: 2eb1000 len: 1000
> > > > --- sg: base: 3000000 len: 1000
> > > > --- sg: base: 2eb2000 len: 3000
> > > > --- sg: base: 3000000 len: 1000
> > > > --- sg: base: 2eb5000 len: b000
> > > > --- sg: base: 3040000 len: 1000
> > > > --- sg: base: 2f41000 len: 3000
> > > > --- sg: base: 3000000 len: 1000
> > > > --- sg: base: 2f44000 len: 4000
> > > > --- sg: base: 3000000 len: 1000
> > > > --- sg: base: 2f48000 len: 2000
> > > > --- sg: base: 3000000 len: 1000
> > > > --- sg: base: 3000000 len: 1000
> > > > --- sg: base: 3000000 len: 1000
> > > >
> > > >
> > > > It means that one DMA transaction performs multiple reads into the same address.
> > > > And no races is possible, when there is only one qcow2 worker.
> > > > When there are many of them - they can fill this buffer simultaneously.
> > > >
> > >
> > > Hmm, actually if guest start parallel reads into same buffer from different offsets, races
> > are
> > > possible anyway, as different requests run in parallel even with one worker, because
> > > MAX_WORKERS is per-request value, not total... But several workers may increase probability
> > of
> > > races or introduce new ones.
> > >
> > > So, actually, several workers of one request can write to the same buffer only if guest
> > > provides broken iovec, which references the same buffer several times (if it is possible at
> > > all).
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Vladimir
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-02-25 11:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-19 14:32 Race condition in overlayed qcow2? dovgaluk
2020-02-19 16:07 ` Vladimir Sementsov-Ogievskiy
2020-02-20  8:31   ` dovgaluk
2020-02-20  9:05     ` Vladimir Sementsov-Ogievskiy
2020-02-20  9:36       ` Vladimir Sementsov-Ogievskiy
2020-02-21  9:49         ` dovgaluk
2020-02-21 10:09           ` Vladimir Sementsov-Ogievskiy
2020-02-21 12:35             ` dovgaluk
2020-02-21 13:23               ` Vladimir Sementsov-Ogievskiy
2020-02-25  5:58                 ` dovgaluk
2020-02-25  7:27                   ` Vladimir Sementsov-Ogievskiy
2020-02-25  7:56                     ` dovgaluk
2020-02-25  9:19                       ` Vladimir Sementsov-Ogievskiy
2020-02-25  9:26                         ` Pavel Dovgalyuk
2020-02-25 10:07                         ` Pavel Dovgalyuk
2020-02-25 11:47                           ` Kevin Wolf
2020-02-20 10:00       ` Pavel Dovgalyuk
2020-02-20 11:26         ` Vladimir Sementsov-Ogievskiy
2020-02-20 11:48           ` Pavel Dovgalyuk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.