All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
@ 2018-05-15 21:25 Daniel Henrique Barboza
  2018-05-16  7:47 ` Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Daniel Henrique Barboza @ 2018-05-15 21:25 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: Fam Zheng, Paolo Bonzini, Kevin Wolf

Hi,

I've been working in the last two months in a miscompare issue that 
happens when using a raid device and a SATA as scsi-hd (emulated SCSI) 
with cache=none and io=threads during a hardware stress test. I'll 
summarize it here as best as I can without creating a great wall of text 
- Red Hat folks can check [1] for all the details.

Using the following setup:

- Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le, 
qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)

- Guest is RHEL 7.5-alt using the same kernel as the host, using two 
storage disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:

     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source 
dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
       <target dev='sdc' bus='scsi'/>
       <alias name='scsi0-0-0-2'/>
       <address type='drive' controller='0' bus='0' target='0' unit='2'/>
     </disk>

Both block devices have WCE off in the host.

With this env, we found problems when running a stress test called HTX 
[2]. At a given time (usually after 24+ hours of test) HTX finds a data 
miscompare in one of the devices. This is an example:

-------

Device name: /dev/sdb
Total blocks: 0x74706daf, Block size: 0x200
Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
Number of Rulefile passes (cycle) completed: 0
Stanza running: rule_6, Thread no.: 8
Oper performed: wrc, Current seek type: SEQ
LBA no. where IO started: 0x94fa
Transfer size: 0x8400

Miscompare Summary:
===================
LBA no. where miscomapre started:     0x94fa
LBA no. where miscomapre ended:       0x94ff
Miscompare start offset (in bytes):   0x8
Miscomapre end offset (in bytes):     0xbff
Miscompare size (in bytes):           0xbf8

Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
Actual data (at miscomapre offset): 889aea5a736462000000000000007275

-----


This means that the test executed a write at  LBA 0x94fa and, after 
confirming that the write was completed, issue 2 reads in the same LBA 
to assert the written contents and found out a mismatch.


I've tested all sort of configurations between disk vs LUN, cache modes 
and AIO. My findings are:

- using device='lun' instead of device='disk', I can't reproduce the 
issue doesn't matter what other configurations are;
- using device='disk' but with cache='writethrough', issue doesn't 
happen (haven't checked other cache modes);
- using device='disk', cache='none' and io='native', issue doesn't happen.


The issue seems to be tied with the combination device=disk + cache=none 
+ io=threads. I've started digging into the SCSI layer all the way down 
to the block backend. With a shameful amount of logs I've discovered 
that, in the write that the test finds a miscompare, in block/file-posix.c:

- when doing the write, handle_aiocb_rw_vector() returns success, 
pwritev() reports that all bytes were written
- in both reads after the write, handle_aiocb_rw_vector returns success, 
all bytes read by preadv(). In both reads, the data read is different 
from the data written by  the pwritev() that happened before

In the discussions at [1], Fam Zheng suggested a test in which we would 
take down the number of threads created in the POSIX thread pool from 64 
to 1. The idea is to ensure that we're using the same thread to write 
and read. There was a suspicion that the kernel can't guarantee data 
coherency between different threads, even if using the same fd, when 
using pwritev() and preadv(). This would explain why the following reads 
in the same fd would fail to retrieve the same data that was written 
before. After doing this modification, the miscompare didn't reproduce.

After reverting the thread pool number change, I've made a couple of 
attempts trying to flush before read() and flushing after write(). Both 
attempts failed - the miscompare appears in both scenarios. This 
enforces the suspicion we have above - if data coherency can't be 
granted between different threads, flushing in different threads 
wouldn't make a difference too. I've also tested a suggestion from Fam 
where I started the disks with "cache.direct=on,cache.no-flush=off" - 
bug still reproduces.


This is the current status of this investigation. I decided to start a 
discussion here, see if someone can point me something that I overlooked 
or got it wrong, before I started changing the POSIX thread pool 
behavior to see if I can enforce one specific POSIX thread to do a 
read() if we had a write() done in the same fd. Any suggestions?



ps: it is worth mentioning that I was able to reproduce this same bug in 
a POWER8 system running Ubuntu 18.04. Given that the code we're dealing 
with doesn't have any arch-specific behavior I wouldn't be surprised if 
this bug is also reproducible in other archs like x86.


Thanks,

Daniel

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
[2] https://github.com/open-power/HTX

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
@ 2018-05-16  7:47 ` Paolo Bonzini
  2018-05-16 21:35   ` Daniel Henrique Barboza
  2018-05-16  9:47 ` Dr. David Alan Gilbert
  2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2018-05-16  7:47 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, qemu-block; +Cc: Fam Zheng, Kevin Wolf

On 15/05/2018 23:25, Daniel Henrique Barboza wrote:
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked
> or got it wrong, before I started changing the POSIX thread pool
> behavior to see if I can enforce one specific POSIX thread to do a
> read() if we had a write() done in the same fd. Any suggestions?

Copying from the bug:

> Unless we learn something new, my understanding is that we're dealing
> with a host side limitation/bug when calling pwritev() in a different
> thread than a following preadv(), using the same file descriptor
> opened with O_DIRECT and no WCE in the host side, the kernel can't
> grant data coherency, e.g:
> 
> - thread A executes a pwritev() writing dataA in the disk
> 
> - thread B executes a preadv() call to read the data, but this
> preadv() call isn't aware of the previous pwritev() call done in
> thread A, thus the guarantee of the preadv() call reading dataA isn't
> assured (as opposed to what is described in man 3 write)
> 
> - the physical disk, due to the heavy load of the stress test, didn't
> finish writing up dataA. Since the disk itself doesn't have any
> internal cache to rely on, the preadv() call goes in and read an old
> data that differs from dataA.

There is a problem in the reasoning of the third point: if the physical
disk hasn't yet finished writing up dataA, pwritev() shouldn't have
returned.  This could be a bug in the kernel, or even in the disk.  I
suspect the kernel because SCSI passthrough doesn't show the bug; SCSI
passthrough uses ioctl() which completes exactly when the disk tells
QEMU that the command is done---it cannot report completion too early.

(Another small problem in the third point is that the disk actually does
have a cache.  But the cache should be transparent, if it weren't the
bug would be in the disk firmware).

It has to be debugged and fixed in the kernel.  The thread pool is
just... a thread pool, and shouldn't be working around bugs, especially
as serious as these.

A more likely possibility: maybe the disk has 4K sectors and QEMU is
doing read-modify-write cycles to emulate 512 byte sectors?  In this
case, mismatches are not expected, since QEMU serializes RMW cycles, but
at least we would know that the bug would be in QEMU, and where.

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
  2018-05-16  7:47 ` Paolo Bonzini
@ 2018-05-16  9:47 ` Dr. David Alan Gilbert
  2018-05-16 21:40   ` Daniel Henrique Barboza
  2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2 siblings, 1 reply; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2018-05-16  9:47 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-block, Kevin Wolf, Paolo Bonzini, Fam Zheng

* Daniel Henrique Barboza (danielhb@linux.ibm.com) wrote:
> Hi,
> 
> I've been working in the last two months in a miscompare issue that happens
> when using a raid device and a SATA as scsi-hd (emulated SCSI) with
> cache=none and io=threads during a hardware stress test. I'll summarize it
> here as best as I can without creating a great wall of text - Red Hat folks
> can check [1] for all the details.
> 
> Using the following setup:
> 
> - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
> qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)
> 
> - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
> disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:
> 
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='none'/>
>       <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
>       <target dev='sdc' bus='scsi'/>
>       <alias name='scsi0-0-0-2'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='2'/>
>     </disk>
> 
> Both block devices have WCE off in the host.
> 
> With this env, we found problems when running a stress test called HTX [2].
> At a given time (usually after 24+ hours of test) HTX finds a data
> miscompare in one of the devices. This is an example:
> 
> -------
> 
> Device name: /dev/sdb
> Total blocks: 0x74706daf, Block size: 0x200
> Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
> Number of Rulefile passes (cycle) completed: 0
> Stanza running: rule_6, Thread no.: 8
> Oper performed: wrc, Current seek type: SEQ
> LBA no. where IO started: 0x94fa
> Transfer size: 0x8400
> 
> Miscompare Summary:
> ===================
> LBA no. where miscomapre started:     0x94fa
> LBA no. where miscomapre ended:       0x94ff
> Miscompare start offset (in bytes):   0x8
> Miscomapre end offset (in bytes):     0xbff
> Miscompare size (in bytes):           0xbf8
> 
> Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
> Actual data (at miscomapre offset): 889aea5a736462000000000000007275

Are all the miscompares single bit errors like that one?
Is the test doing single bit manipulation or is that coming out of the
blue?

Dave

> -----
> 
> 
> This means that the test executed a write at  LBA 0x94fa and, after
> confirming that the write was completed, issue 2 reads in the same LBA to
> assert the written contents and found out a mismatch.
> 
> 
> I've tested all sort of configurations between disk vs LUN, cache modes and
> AIO. My findings are:
> 
> - using device='lun' instead of device='disk', I can't reproduce the issue
> doesn't matter what other configurations are;
> - using device='disk' but with cache='writethrough', issue doesn't happen
> (haven't checked other cache modes);
> - using device='disk', cache='none' and io='native', issue doesn't happen.
> 
> 
> The issue seems to be tied with the combination device=disk + cache=none +
> io=threads. I've started digging into the SCSI layer all the way down to the
> block backend. With a shameful amount of logs I've discovered that, in the
> write that the test finds a miscompare, in block/file-posix.c:
> 
> - when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
> reports that all bytes were written
> - in both reads after the write, handle_aiocb_rw_vector returns success, all
> bytes read by preadv(). In both reads, the data read is different from the
> data written by  the pwritev() that happened before
> 
> In the discussions at [1], Fam Zheng suggested a test in which we would take
> down the number of threads created in the POSIX thread pool from 64 to 1.
> The idea is to ensure that we're using the same thread to write and read.
> There was a suspicion that the kernel can't guarantee data coherency between
> different threads, even if using the same fd, when using pwritev() and
> preadv(). This would explain why the following reads in the same fd would
> fail to retrieve the same data that was written before. After doing this
> modification, the miscompare didn't reproduce.
> 
> After reverting the thread pool number change, I've made a couple of
> attempts trying to flush before read() and flushing after write(). Both
> attempts failed - the miscompare appears in both scenarios. This enforces
> the suspicion we have above - if data coherency can't be granted between
> different threads, flushing in different threads wouldn't make a difference
> too. I've also tested a suggestion from Fam where I started the disks with
> "cache.direct=on,cache.no-flush=off" - bug still reproduces.
> 
> 
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked or
> got it wrong, before I started changing the POSIX thread pool behavior to
> see if I can enforce one specific POSIX thread to do a read() if we had a
> write() done in the same fd. Any suggestions?
> 
> 
> 
> ps: it is worth mentioning that I was able to reproduce this same bug in a
> POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
> doesn't have any arch-specific behavior I wouldn't be surprised if this bug
> is also reproducible in other archs like x86.
> 
> 
> Thanks,
> 
> Daniel
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
> [2] https://github.com/open-power/HTX
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-16  7:47 ` Paolo Bonzini
@ 2018-05-16 21:35   ` Daniel Henrique Barboza
  2018-05-16 22:12     ` Daniel Henrique Barboza
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Henrique Barboza @ 2018-05-16 21:35 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, qemu-block; +Cc: Kevin Wolf, Fam Zheng



On 05/16/2018 04:47 AM, Paolo Bonzini wrote:
> On 15/05/2018 23:25, Daniel Henrique Barboza wrote:
>> This is the current status of this investigation. I decided to start a
>> discussion here, see if someone can point me something that I overlooked
>> or got it wrong, before I started changing the POSIX thread pool
>> behavior to see if I can enforce one specific POSIX thread to do a
>> read() if we had a write() done in the same fd. Any suggestions?
> Copying from the bug:
>
>> Unless we learn something new, my understanding is that we're dealing
>> with a host side limitation/bug when calling pwritev() in a different
>> thread than a following preadv(), using the same file descriptor
>> opened with O_DIRECT and no WCE in the host side, the kernel can't
>> grant data coherency, e.g:
>>
>> - thread A executes a pwritev() writing dataA in the disk
>>
>> - thread B executes a preadv() call to read the data, but this
>> preadv() call isn't aware of the previous pwritev() call done in
>> thread A, thus the guarantee of the preadv() call reading dataA isn't
>> assured (as opposed to what is described in man 3 write)
>>
>> - the physical disk, due to the heavy load of the stress test, didn't
>> finish writing up dataA. Since the disk itself doesn't have any
>> internal cache to rely on, the preadv() call goes in and read an old
>> data that differs from dataA.
> There is a problem in the reasoning of the third point: if the physical
> disk hasn't yet finished writing up dataA, pwritev() shouldn't have
> returned.  This could be a bug in the kernel, or even in the disk.  I
> suspect the kernel because SCSI passthrough doesn't show the bug; SCSI
> passthrough uses ioctl() which completes exactly when the disk tells
> QEMU that the command is done---it cannot report completion too early.
>
> (Another small problem in the third point is that the disk actually does
> have a cache.  But the cache should be transparent, if it weren't the
> bug would be in the disk firmware).
>
> It has to be debugged and fixed in the kernel.  The thread pool is
> just... a thread pool, and shouldn't be working around bugs, especially
> as serious as these.

Fixing in the thread pool would only make sense if we were sure that
the kernel was working as intended. I think the next step would be to
look it in the kernel level and see what is not working there.

>
> A more likely possibility: maybe the disk has 4K sectors and QEMU is
> doing read-modify-write cycles to emulate 512 byte sectors?  In this
> case, mismatches are not expected, since QEMU serializes RMW cycles, but
> at least we would know that the bug would be in QEMU, and where.

Haven't considered this possibility. I'll look it up if the disk has 4k
sectors and whether QEMU is emulating 512 bytes sectors.


>
> Paolo
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-16  9:47 ` Dr. David Alan Gilbert
@ 2018-05-16 21:40   ` Daniel Henrique Barboza
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Henrique Barboza @ 2018-05-16 21:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Paolo Bonzini, Fam Zheng, qemu-devel, qemu-block



On 05/16/2018 06:47 AM, Dr. David Alan Gilbert wrote:
> * Daniel Henrique Barboza (danielhb@linux.ibm.com) wrote:
>> Hi,
>>
>> I've been working in the last two months in a miscompare issue that happens
>> when using a raid device and a SATA as scsi-hd (emulated SCSI) with
>> cache=none and io=threads during a hardware stress test. I'll summarize it
>> here as best as I can without creating a great wall of text - Red Hat folks
>> can check [1] for all the details.
>>
>> Using the following setup:
>>
>> - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
>> qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)
>>
>> - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
>> disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:
>>
>>      <disk type='block' device='disk'>
>>        <driver name='qemu' type='raw' cache='none'/>
>>        <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
>>        <target dev='sdc' bus='scsi'/>
>>        <alias name='scsi0-0-0-2'/>
>>        <address type='drive' controller='0' bus='0' target='0' unit='2'/>
>>      </disk>
>>
>> Both block devices have WCE off in the host.
>>
>> With this env, we found problems when running a stress test called HTX [2].
>> At a given time (usually after 24+ hours of test) HTX finds a data
>> miscompare in one of the devices. This is an example:
>>
>> -------
>>
>> Device name: /dev/sdb
>> Total blocks: 0x74706daf, Block size: 0x200
>> Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
>> Number of Rulefile passes (cycle) completed: 0
>> Stanza running: rule_6, Thread no.: 8
>> Oper performed: wrc, Current seek type: SEQ
>> LBA no. where IO started: 0x94fa
>> Transfer size: 0x8400
>>
>> Miscompare Summary:
>> ===================
>> LBA no. where miscomapre started:     0x94fa
>> LBA no. where miscomapre ended:       0x94ff
>> Miscompare start offset (in bytes):   0x8
>> Miscomapre end offset (in bytes):     0xbff
>> Miscompare size (in bytes):           0xbf8
>>
>> Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
>> Actual data (at miscomapre offset): 889aea5a736462000000000000007275
> Are all the miscompares single bit errors like that one?

The miscompares differs in size. What it is displayed here is the first 
snippet of
the miscompare data, but in this case the miscompare has 0xbf8 bytes of 
size.

I've seen cases where the miscompare has the same size of the data 
written - the
test initialize the disk with a known pattern (bbbbbbb for example), 
then a miscompare
happens and it founds out that the disk had the starting pattern.


> Is the test doing single bit manipulation or is that coming out of the
> blue?

As far as I've read in the test suite code, it is writing several 
sectors at once
then asserting that the contents were written.

>
> Dave
>
>> -----
>>
>>
>> This means that the test executed a write at  LBA 0x94fa and, after
>> confirming that the write was completed, issue 2 reads in the same LBA to
>> assert the written contents and found out a mismatch.
>>
>>
>> I've tested all sort of configurations between disk vs LUN, cache modes and
>> AIO. My findings are:
>>
>> - using device='lun' instead of device='disk', I can't reproduce the issue
>> doesn't matter what other configurations are;
>> - using device='disk' but with cache='writethrough', issue doesn't happen
>> (haven't checked other cache modes);
>> - using device='disk', cache='none' and io='native', issue doesn't happen.
>>
>>
>> The issue seems to be tied with the combination device=disk + cache=none +
>> io=threads. I've started digging into the SCSI layer all the way down to the
>> block backend. With a shameful amount of logs I've discovered that, in the
>> write that the test finds a miscompare, in block/file-posix.c:
>>
>> - when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
>> reports that all bytes were written
>> - in both reads after the write, handle_aiocb_rw_vector returns success, all
>> bytes read by preadv(). In both reads, the data read is different from the
>> data written by  the pwritev() that happened before
>>
>> In the discussions at [1], Fam Zheng suggested a test in which we would take
>> down the number of threads created in the POSIX thread pool from 64 to 1.
>> The idea is to ensure that we're using the same thread to write and read.
>> There was a suspicion that the kernel can't guarantee data coherency between
>> different threads, even if using the same fd, when using pwritev() and
>> preadv(). This would explain why the following reads in the same fd would
>> fail to retrieve the same data that was written before. After doing this
>> modification, the miscompare didn't reproduce.
>>
>> After reverting the thread pool number change, I've made a couple of
>> attempts trying to flush before read() and flushing after write(). Both
>> attempts failed - the miscompare appears in both scenarios. This enforces
>> the suspicion we have above - if data coherency can't be granted between
>> different threads, flushing in different threads wouldn't make a difference
>> too. I've also tested a suggestion from Fam where I started the disks with
>> "cache.direct=on,cache.no-flush=off" - bug still reproduces.
>>
>>
>> This is the current status of this investigation. I decided to start a
>> discussion here, see if someone can point me something that I overlooked or
>> got it wrong, before I started changing the POSIX thread pool behavior to
>> see if I can enforce one specific POSIX thread to do a read() if we had a
>> write() done in the same fd. Any suggestions?
>>
>>
>>
>> ps: it is worth mentioning that I was able to reproduce this same bug in a
>> POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
>> doesn't have any arch-specific behavior I wouldn't be surprised if this bug
>> is also reproducible in other archs like x86.
>>
>>
>> Thanks,
>>
>> Daniel
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
>> [2] https://github.com/open-power/HTX
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-16 21:35   ` Daniel Henrique Barboza
@ 2018-05-16 22:12     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Henrique Barboza @ 2018-05-16 22:12 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, qemu-block; +Cc: Kevin Wolf, Fam Zheng



On 05/16/2018 06:35 PM, Daniel Henrique Barboza wrote:
>
>
> On 05/16/2018 04:47 AM, Paolo Bonzini wrote:
>> On 15/05/2018 23:25, Daniel Henrique Barboza wrote:
>>> This is the current status of this investigation. I decided to start a
>>> discussion here, see if someone can point me something that I 
>>> overlooked
>>> or got it wrong, before I started changing the POSIX thread pool
>>> behavior to see if I can enforce one specific POSIX thread to do a
>>> read() if we had a write() done in the same fd. Any suggestions?
>> Copying from the bug:
>>
>>> Unless we learn something new, my understanding is that we're dealing
>>> with a host side limitation/bug when calling pwritev() in a different
>>> thread than a following preadv(), using the same file descriptor
>>> opened with O_DIRECT and no WCE in the host side, the kernel can't
>>> grant data coherency, e.g:
>>>
>>> - thread A executes a pwritev() writing dataA in the disk
>>>
>>> - thread B executes a preadv() call to read the data, but this
>>> preadv() call isn't aware of the previous pwritev() call done in
>>> thread A, thus the guarantee of the preadv() call reading dataA isn't
>>> assured (as opposed to what is described in man 3 write)
>>>
>>> - the physical disk, due to the heavy load of the stress test, didn't
>>> finish writing up dataA. Since the disk itself doesn't have any
>>> internal cache to rely on, the preadv() call goes in and read an old
>>> data that differs from dataA.
>> There is a problem in the reasoning of the third point: if the physical
>> disk hasn't yet finished writing up dataA, pwritev() shouldn't have
>> returned.  This could be a bug in the kernel, or even in the disk.  I
>> suspect the kernel because SCSI passthrough doesn't show the bug; SCSI
>> passthrough uses ioctl() which completes exactly when the disk tells
>> QEMU that the command is done---it cannot report completion too early.
>>
>> (Another small problem in the third point is that the disk actually does
>> have a cache.  But the cache should be transparent, if it weren't the
>> bug would be in the disk firmware).
>>
>> It has to be debugged and fixed in the kernel.  The thread pool is
>> just... a thread pool, and shouldn't be working around bugs, especially
>> as serious as these.
>
> Fixing in the thread pool would only make sense if we were sure that
> the kernel was working as intended. I think the next step would be to
> look it in the kernel level and see what is not working there.
>
>>
>> A more likely possibility: maybe the disk has 4K sectors and QEMU is
>> doing read-modify-write cycles to emulate 512 byte sectors?  In this
>> case, mismatches are not expected, since QEMU serializes RMW cycles, but
>> at least we would know that the bug would be in QEMU, and where.
>
> Haven't considered this possibility. I'll look it up if the disk has 4k
> sectors and whether QEMU is emulating 512 bytes sectors.

There are several differences between the guest and the host device 
regarding the
kernel parameters. This is how the guest configured the SATA disk:


# grep . /sys/block/sdb/queue/*
/sys/block/sdb/queue/add_random:1
/sys/block/sdb/queue/chunk_sectors:0
/sys/block/sdb/queue/dax:0
/sys/block/sdb/queue/discard_granularity:4096
/sys/block/sdb/queue/discard_max_bytes:1073741824
/sys/block/sdb/queue/discard_max_hw_bytes:1073741824
/sys/block/sdb/queue/discard_zeroes_data:0
/sys/block/sdb/queue/hw_sector_size:512
/sys/block/sdb/queue/io_poll:0
/sys/block/sdb/queue/io_poll_delay:0
grep: /sys/block/sdb/queue/iosched: Is a directory
/sys/block/sdb/queue/iostats:1
/sys/block/sdb/queue/logical_block_size:512
/sys/block/sdb/queue/max_discard_segments:1
/sys/block/sdb/queue/max_hw_sectors_kb:32767
/sys/block/sdb/queue/max_integrity_segments:0
/sys/block/sdb/queue/max_sectors_kb:256
/sys/block/sdb/queue/max_segments:126
/sys/block/sdb/queue/max_segment_size:65536
/sys/block/sdb/queue/minimum_io_size:262144
/sys/block/sdb/queue/nomerges:0
/sys/block/sdb/queue/nr_requests:128
/sys/block/sdb/queue/optimal_io_size:262144
/sys/block/sdb/queue/physical_block_size:512
/sys/block/sdb/queue/read_ahead_kb:4096
/sys/block/sdb/queue/rotational:1
/sys/block/sdb/queue/rq_affinity:1
/sys/block/sdb/queue/scheduler:noop [deadline] cfq
/sys/block/sdb/queue/unpriv_sgio:0
grep: /sys/block/sdb/queue/wbt_lat_usec: Invalid argument
/sys/block/sdb/queue/write_cache:write back
/sys/block/sdb/queue/write_same_max_bytes:262144
/sys/block/sdb/queue/write_zeroes_max_bytes:262144
/sys/block/sdb/queue/zoned:none


The same device in the host:

$ grep . /sys/block/sdc/queue/*
/sys/block/sdc/queue/add_random:1
/sys/block/sdc/queue/chunk_sectors:0
/sys/block/sdc/queue/dax:0
/sys/block/sdc/queue/discard_granularity:0
/sys/block/sdc/queue/discard_max_bytes:0
/sys/block/sdc/queue/discard_max_hw_bytes:0
/sys/block/sdc/queue/discard_zeroes_data:0
/sys/block/sdc/queue/hw_sector_size:512
/sys/block/sdc/queue/io_poll:0
/sys/block/sdc/queue/io_poll_delay:0
grep: /sys/block/sdc/queue/iosched: Is a directory
/sys/block/sdc/queue/iostats:1
/sys/block/sdc/queue/logical_block_size:512
/sys/block/sdc/queue/max_discard_segments:1
/sys/block/sdc/queue/max_hw_sectors_kb:256
/sys/block/sdc/queue/max_integrity_segments:0
/sys/block/sdc/queue/max_sectors_kb:256
/sys/block/sdc/queue/max_segments:64
/sys/block/sdc/queue/max_segment_size:65536
/sys/block/sdc/queue/minimum_io_size:512
/sys/block/sdc/queue/nomerges:0
/sys/block/sdc/queue/nr_requests:128
/sys/block/sdc/queue/optimal_io_size:0
/sys/block/sdc/queue/physical_block_size:512
/sys/block/sdc/queue/read_ahead_kb:4096
/sys/block/sdc/queue/rotational:1
/sys/block/sdc/queue/rq_affinity:1
/sys/block/sdc/queue/scheduler:noop [deadline] cfq
/sys/block/sdc/queue/unpriv_sgio:0
grep: /sys/block/sdc/queue/wbt_lat_usec: Invalid argument
/sys/block/sdc/queue/write_cache:write through
/sys/block/sdc/queue/write_same_max_bytes:0
/sys/block/sdc/queue/write_zeroes_max_bytes:0
/sys/block/sdc/queue/zoned:none



Physical block size is 512 in both guest and host but there are a lot of 
differences
in how the guest sees the device. Not sure if there is something suspicious
in these differences that can bring some light in the problem though.


Daniel


>
>
>>
>> Paolo
>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
  2018-05-16  7:47 ` Paolo Bonzini
  2018-05-16  9:47 ` Dr. David Alan Gilbert
@ 2018-05-24 14:04 ` Stefan Hajnoczi
  2018-05-24 21:30   ` Daniel Henrique Barboza
  2 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2018-05-24 14:04 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-block, Kevin Wolf, Paolo Bonzini, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 625 bytes --]

On Tue, May 15, 2018 at 06:25:46PM -0300, Daniel Henrique Barboza wrote:
> This means that the test executed a write at  LBA 0x94fa and, after
> confirming that the write was completed, issue 2 reads in the same LBA to
> assert the written contents and found out a mismatch.

Have you confirmed this pattern at various levels in the stack:
1. Application inside the guest (strace)
2. Guest kernel block layer (blktrace)
3. QEMU (strace)
4. Host kernel block layer (blktrace)

The key thing is that the write completes before the 2 reads are
submitted.

Have you tried running the test on bare metal?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2018-05-24 21:30   ` Daniel Henrique Barboza
  2018-06-01 11:49     ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Henrique Barboza @ 2018-05-24 21:30 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Kevin Wolf, Paolo Bonzini, Fam Zheng



On 05/24/2018 11:04 AM, Stefan Hajnoczi wrote:
> On Tue, May 15, 2018 at 06:25:46PM -0300, Daniel Henrique Barboza wrote:
>> This means that the test executed a write at  LBA 0x94fa and, after
>> confirming that the write was completed, issue 2 reads in the same LBA to
>> assert the written contents and found out a mismatch.
> Have you confirmed this pattern at various levels in the stack:
> 1. Application inside the guest (strace)
> 2. Guest kernel block layer (blktrace)
> 3. QEMU (strace)
> 4. Host kernel block layer (blktrace)

Tested (3). In this case the bug stop reproducing. Not sure if there is
anything related with strace adding a delay back and forth the
preadv/pwritev calls, but the act of attaching strace to the QEMU
process changed the behavior.

Haven't tried the other 3 scenarios. (2) and (4) are definitely worth 
trying it
out, specially (4).

> The key thing is that the write completes before the 2 reads are
> submitted.
>
> Have you tried running the test on bare metal?

Yes. The stress test does not reproduce the miscompare issue when
running on bare metal.


Thanks,

Daniel

>
> Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Problem with data miscompare using scsi-hd, cache=none and io=threads
  2018-05-24 21:30   ` Daniel Henrique Barboza
@ 2018-06-01 11:49     ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2018-06-01 11:49 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: qemu-devel, qemu-block, Kevin Wolf, Paolo Bonzini, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 1348 bytes --]

On Thu, May 24, 2018 at 06:30:10PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 05/24/2018 11:04 AM, Stefan Hajnoczi wrote:
> > On Tue, May 15, 2018 at 06:25:46PM -0300, Daniel Henrique Barboza wrote:
> > > This means that the test executed a write at  LBA 0x94fa and, after
> > > confirming that the write was completed, issue 2 reads in the same LBA to
> > > assert the written contents and found out a mismatch.
> > Have you confirmed this pattern at various levels in the stack:
> > 1. Application inside the guest (strace)
> > 2. Guest kernel block layer (blktrace)
> > 3. QEMU (strace)
> > 4. Host kernel block layer (blktrace)
> 
> Tested (3). In this case the bug stop reproducing. Not sure if there is
> anything related with strace adding a delay back and forth the
> preadv/pwritev calls, but the act of attaching strace to the QEMU
> process changed the behavior.
> 
> Haven't tried the other 3 scenarios. (2) and (4) are definitely worth trying
> it
> out, specially (4).

Yes, host blktrace is probably the most important piece of evidence.

> > The key thing is that the write completes before the 2 reads are
> > submitted.
> > 
> > Have you tried running the test on bare metal?
> 
> Yes. The stress test does not reproduce the miscompare issue when
> running on bare metal.

Great.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-06-01 11:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
2018-05-16  7:47 ` Paolo Bonzini
2018-05-16 21:35   ` Daniel Henrique Barboza
2018-05-16 22:12     ` Daniel Henrique Barboza
2018-05-16  9:47 ` Dr. David Alan Gilbert
2018-05-16 21:40   ` Daniel Henrique Barboza
2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2018-05-24 21:30   ` Daniel Henrique Barboza
2018-06-01 11:49     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.