All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
       [not found] <CAAxjCEzHQz4cG_8m7S6=CwCBoN5daQs+KVyuU5GL5Tq3Bky1NA@mail.gmail.com>
@ 2020-02-24 12:35 ` Stefan Ring
  2020-02-24 13:10   ` Stefan Ring
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Stefan Ring @ 2020-02-24 12:35 UTC (permalink / raw)
  To: qemu-block, qemu-devel, integration

On Thu, Feb 20, 2020 at 10:19 AM Stefan Ring <stefanrin@gmail.com> wrote:
>
> Hi,
>
> I have a very curious problem on an oVirt-like virtualization host
> whose storage lives on gluster (as qcow2).
>
> The problem is that of the writes done by ZFS, whose sizes according
> to blktrace are a mixture of 8, 16, 24, ... 256 (512 byte)
> blocks,sometimes the first 4KB or more, but at least the first 4KB,
> end up zeroed out when read back later from storage. To clarify: ZFS
> is only used in the guest. In my current test scenario, I write
> approx. 3GB to the guest machine, which takes roughly a minute.
> Actually it’s around 35 GB which gets compressed down to 3GB by lz4.
> Within that, I end up with close to 100 data errors when I read it
> back from storage afterwards (zpool scrub).
>
> There are quite a few machines running on this host, and we have not
> experienced other problems so far. So right now, only ZFS is able to
> trigger this for some reason. The guest has 8 virtual cores. I also
> tried writing directly to the affected device from user space in
> patterns mimicking what I see in blktrace, but so far have been unable
> to trigger the same issue that way. Of the many ZFS knobs, I know at
> least one that makes a huge difference: When I set
> zfs_vdev_async_write_max_active to 1 (as opposed to 2 or 10), the
> error count goes through the roof (11.000). Curiously, when I switch
> off ZFS compression, the data amount written increases almost
> 10-fold,while the absolute error amount drops to close to, but not
> entirely,zero. Which I guess supports my suspicion that this must be
> somehow related to timing.
>
> Switching the guest storage driver between scsi and virtio does not
> make a difference.
>
> Switching the storage backend to file on glusterfs-fuse does make a
> difference, i.e. the problem disappears.
>
> Any hints? I'm still trying to investigate a few things, but what bugs
> me most that only ZFS seems to trigger this behavior, although I am
> almost sure that ZFS is not really at fault here.
>
> Software versions used:
>
> Host
> kernel 3.10.0-957.12.1.el7.x86_64
> qemu-kvm-ev-2.12.0-18.el7_6.3.1.x86_64
> glusterfs-api-5.6-1.el7.x86_64
>
> Guest
> kernel 3.10.0-1062.12.1.el7.x86_64
> kmod-zfs-0.8.3-1.el7.x86_64 (from the official ZoL binaries)

I can actually reproduce this on my Fedora 31 home machine with 3 VMs.
All 3 running CentOS 7.7. Two for glusterd, one for ZFS. Briefly, I
also got rid of the 2 glusterd VMs altogether, i.e. running glusterd
(the Fedora version) directly on the host, and it would still occur.
So my impression is that the server side of GlusterFS does not matter
much – I’ve seen it happen on 4.x, 6.x, 7.2 and 7.3. Also, as it
happens in the same way on a Fedora 31 qemu as well as a CentOS 7 one,
the qemu version is equally irrelevant.

The main conclusion so far is that it has to do with growing the qcow2
image. With a fully pre-populated image, I cannot trigger it.

I poked around a little in the glfs api integration, but trying to
make sense of two unknown asynchronous io systems (QEMU's and
GlusterFS's) interacting with each other is demanding a bit much for a
single weekend ;). The one thing I did verify so far is that there is
only one thread ever calling qemu_gluster_co_rw. As already stated in
the original post, the problem only occurs with multiple parallel
write requests happening.

What I plan to do next is look at the block ranges being written in
the hope of finding overlaps there.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-24 12:35 ` Strange data corruption issue with gluster (libgfapi) and ZFS Stefan Ring
@ 2020-02-24 13:10   ` Stefan Ring
  2020-02-24 13:26   ` Kevin Wolf
  2020-02-25 14:12   ` Stefan Ring
  2 siblings, 0 replies; 9+ messages in thread
From: Stefan Ring @ 2020-02-24 13:10 UTC (permalink / raw)
  To: qemu-block, qemu-devel, integration

On Mon, Feb 24, 2020 at 1:35 PM Stefan Ring <stefanrin@gmail.com> wrote:
>
> [...]. As already stated in
> the original post, the problem only occurs with multiple parallel
> write requests happening.

Actually I did not state that. Anyway, the corruption does not happen
when I restrict the ZFS io scheduler to only 1 request at a time
(zfs_vdev_max_active), therefore I assume that this is somehow related
to the ordering of asynchronously scheduled writes.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-24 12:35 ` Strange data corruption issue with gluster (libgfapi) and ZFS Stefan Ring
  2020-02-24 13:10   ` Stefan Ring
@ 2020-02-24 13:26   ` Kevin Wolf
  2020-02-24 15:50     ` Stefan Ring
  2020-02-25 14:12   ` Stefan Ring
  2 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2020-02-24 13:26 UTC (permalink / raw)
  To: Stefan Ring; +Cc: integration, qemu-devel, qemu-block

Am 24.02.2020 um 13:35 hat Stefan Ring geschrieben:
> On Thu, Feb 20, 2020 at 10:19 AM Stefan Ring <stefanrin@gmail.com> wrote:
> >
> > Hi,
> >
> > I have a very curious problem on an oVirt-like virtualization host
> > whose storage lives on gluster (as qcow2).
> >
> > The problem is that of the writes done by ZFS, whose sizes according
> > to blktrace are a mixture of 8, 16, 24, ... 256 (512 byte)
> > blocks,sometimes the first 4KB or more, but at least the first 4KB,
> > end up zeroed out when read back later from storage. To clarify: ZFS
> > is only used in the guest. In my current test scenario, I write
> > approx. 3GB to the guest machine, which takes roughly a minute.
> > Actually it’s around 35 GB which gets compressed down to 3GB by lz4.
> > Within that, I end up with close to 100 data errors when I read it
> > back from storage afterwards (zpool scrub).
> >
> > There are quite a few machines running on this host, and we have not
> > experienced other problems so far. So right now, only ZFS is able to
> > trigger this for some reason. The guest has 8 virtual cores. I also
> > tried writing directly to the affected device from user space in
> > patterns mimicking what I see in blktrace, but so far have been unable
> > to trigger the same issue that way. Of the many ZFS knobs, I know at
> > least one that makes a huge difference: When I set
> > zfs_vdev_async_write_max_active to 1 (as opposed to 2 or 10), the
> > error count goes through the roof (11.000).

Wait, that does this setting actually do? Does it mean that QEMU should
never sees more than a single active write request at the same time?
So if this makes the error a lot more prominent, does this mean that
async I/O actually makes the problem _less_ likely to occur?

This sounds weird, so probably I'm misunderstanding the setting?

> > Curiously, when I switch
> > off ZFS compression, the data amount written increases almost
> > 10-fold,while the absolute error amount drops to close to, but not
> > entirely,zero. Which I guess supports my suspicion that this must be
> > somehow related to timing.
> >
> > Switching the guest storage driver between scsi and virtio does not
> > make a difference.
> >
> > Switching the storage backend to file on glusterfs-fuse does make a
> > difference, i.e. the problem disappears.
> >
> > Any hints? I'm still trying to investigate a few things, but what bugs
> > me most that only ZFS seems to trigger this behavior, although I am
> > almost sure that ZFS is not really at fault here.
> >
> > Software versions used:
> >
> > Host
> > kernel 3.10.0-957.12.1.el7.x86_64
> > qemu-kvm-ev-2.12.0-18.el7_6.3.1.x86_64
> > glusterfs-api-5.6-1.el7.x86_64
> >
> > Guest
> > kernel 3.10.0-1062.12.1.el7.x86_64
> > kmod-zfs-0.8.3-1.el7.x86_64 (from the official ZoL binaries)
> 
> I can actually reproduce this on my Fedora 31 home machine with 3 VMs.

This means QEMU 4.1.1, right?

> All 3 running CentOS 7.7. Two for glusterd, one for ZFS. Briefly, I
> also got rid of the 2 glusterd VMs altogether, i.e. running glusterd
> (the Fedora version) directly on the host, and it would still occur.
> So my impression is that the server side of GlusterFS does not matter
> much – I’ve seen it happen on 4.x, 6.x, 7.2 and 7.3. Also, as it
> happens in the same way on a Fedora 31 qemu as well as a CentOS 7 one,
> the qemu version is equally irrelevant.
> 
> The main conclusion so far is that it has to do with growing the qcow2
> image. With a fully pre-populated image, I cannot trigger it.

Ok, that's a good data point.

Is the corruption that you're seeing only in the guest data or is qcow2
metadata also affected (does 'qemu-img check' report errors)?

> I poked around a little in the glfs api integration, but trying to
> make sense of two unknown asynchronous io systems (QEMU's and
> GlusterFS's) interacting with each other is demanding a bit much for a
> single weekend ;). The one thing I did verify so far is that there is
> only one thread ever calling qemu_gluster_co_rw. As already stated in
> the original post, the problem only occurs with multiple parallel
> write requests happening.
> 
> What I plan to do next is look at the block ranges being written in
> the hope of finding overlaps there.

Either that, or other interesting patterns.

Did you try to remove the guest from the equation? If you say that the
problem is with multiple parallel requests, maybe 'qemu-img bench' can
cause the same kind of corruption? (Though if it's only corruption in
the content rather than qcow2 metadata, it may be hard to detect.
Giving qemu-io an explicit list of requests could still be an option
once we have a suspicion what pattern creates the problem.)

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-24 13:26   ` Kevin Wolf
@ 2020-02-24 15:50     ` Stefan Ring
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Ring @ 2020-02-24 15:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: integration, qemu-devel, qemu-block

On Mon, Feb 24, 2020 at 2:27 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > There are quite a few machines running on this host, and we have not
> > > experienced other problems so far. So right now, only ZFS is able to
> > > trigger this for some reason. The guest has 8 virtual cores. I also
> > > tried writing directly to the affected device from user space in
> > > patterns mimicking what I see in blktrace, but so far have been unable
> > > to trigger the same issue that way. Of the many ZFS knobs, I know at
> > > least one that makes a huge difference: When I set
> > > zfs_vdev_async_write_max_active to 1 (as opposed to 2 or 10), the
> > > error count goes through the roof (11.000).
>
> Wait, that does this setting actually do? Does it mean that QEMU should
> never sees more than a single active write request at the same time?
> So if this makes the error a lot more prominent, does this mean that
> async I/O actually makes the problem _less_ likely to occur?
>
> This sounds weird, so probably I'm misunderstanding the setting?

Yes, this is strange, and I will not follow this as I cannot reproduce
it on my home setup. Let’s just hope that it’s some kind of anomaly
that will go away once the real issue has been eliminated ;).

> > I can actually reproduce this on my Fedora 31 home machine with 3 VMs.
>
> This means QEMU 4.1.1, right?

Yes, qemu-system-x86-4.1.1-1.fc31.x86_64.

> > All 3 running CentOS 7.7. Two for glusterd, one for ZFS. Briefly, I
> > also got rid of the 2 glusterd VMs altogether, i.e. running glusterd
> > (the Fedora version) directly on the host, and it would still occur.
> > So my impression is that the server side of GlusterFS does not matter
> > much – I’ve seen it happen on 4.x, 6.x, 7.2 and 7.3. Also, as it
> > happens in the same way on a Fedora 31 qemu as well as a CentOS 7 one,
> > the qemu version is equally irrelevant.
> >
> > The main conclusion so far is that it has to do with growing the qcow2
> > image. With a fully pre-populated image, I cannot trigger it.
>
> Ok, that's a good data point.
>
> Is the corruption that you're seeing only in the guest data or is qcow2
> metadata also affected (does 'qemu-img check' report errors)?

"No errors were found on the image."

I don’t entirely rule out the possibility of qcow metadata corruption,
but at least it seems to be very unlikely compared to guest data
corruption.

> > What I plan to do next is look at the block ranges being written in
> > the hope of finding overlaps there.
>
> Either that, or other interesting patterns.
>
> Did you try to remove the guest from the equation? If you say that the
> problem is with multiple parallel requests, maybe 'qemu-img bench' can
> cause the same kind of corruption? (Though if it's only corruption in
> the content rather than qcow2 metadata, it may be hard to detect.
> Giving qemu-io an explicit list of requests could still be an option
> once we have a suspicion what pattern creates the problem.)

Did not know about qemu-img bench, but narrowing it down as much as
possible – and that entails removing the guest VM – is my number one
priority here.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-24 12:35 ` Strange data corruption issue with gluster (libgfapi) and ZFS Stefan Ring
  2020-02-24 13:10   ` Stefan Ring
  2020-02-24 13:26   ` Kevin Wolf
@ 2020-02-25 14:12   ` Stefan Ring
  2020-02-27 21:12     ` Stefan Ring
  2 siblings, 1 reply; 9+ messages in thread
From: Stefan Ring @ 2020-02-25 14:12 UTC (permalink / raw)
  To: qemu-block, qemu-devel, integration

On Mon, Feb 24, 2020 at 1:35 PM Stefan Ring <stefanrin@gmail.com> wrote:
>
> What I plan to do next is look at the block ranges being written in
> the hope of finding overlaps there.

Status update:

I still have not found out what is actually causing this. I have not
found concurrent writes to overlapping file areas. But what I can say
is that by switching qemu_gluster_co_rw to the synchronous glusterfs
api (glfs_pwritev), the problem goes away.

Unfortunately, I have not yet been able to find exactly how the qcow2
file is grown. It looks like this happens just as a consequence of
writing beyond the end. Because contrary to my expectations, neither
qemu_gluster_co_pwrite_zeroes nor qemu_gluster_do_truncate is ever
called. My current line of thinking is that there must be something
special about in-flight writes which grow the file.

I find many instances with the following pattern:

current file length (= max position + size written): p
write request n writes from (p + hole_size), thus leaving a hole
request n+1 writes exactly hole_size, starting from p, thus completely
filling the hole
The two requests' in-flight times overlap.
hole_size can be almost any value (7-127).

I see fewer data errors than instances of this pattern though.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-25 14:12   ` Stefan Ring
@ 2020-02-27 21:12     ` Stefan Ring
  2020-02-27 22:25       ` Stefan Ring
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Ring @ 2020-02-27 21:12 UTC (permalink / raw)
  To: qemu-block, qemu-devel, integration

On Tue, Feb 25, 2020 at 3:12 PM Stefan Ring <stefanrin@gmail.com> wrote:
>
> I find many instances with the following pattern:
>
> current file length (= max position + size written): p
> write request n writes from (p + hole_size), thus leaving a hole
> request n+1 writes exactly hole_size, starting from p, thus completely
> filling the hole
> The two requests' in-flight times overlap.
> hole_size can be almost any value (7-127).

Victory! I have a reproducer in the form of a plain C libgfapi client.

However, I have not been able to trigger corruption by just executing
the simple pattern in an artificial way. Currently, I need to feed my
reproducer 2 GB of data that I streamed out of the qemu block driver.
I get two possible end states out of my reproducer: The correct one or
a corrupted one, where 48 KB are zeroed out. It takes no more than 10
runs to get each of them at least once. The corrupted end state is
exactly the same that I got from the real qemu process from where I
obtained the streamed trace. This gives me a lot of confidence in the
soundness of my reproducer.

More details will follow.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-27 21:12     ` Stefan Ring
@ 2020-02-27 22:25       ` Stefan Ring
  2020-02-28 11:10         ` Kevin Wolf
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Ring @ 2020-02-27 22:25 UTC (permalink / raw)
  To: qemu-block, qemu-devel, integration

On Thu, Feb 27, 2020 at 10:12 PM Stefan Ring <stefanrin@gmail.com> wrote:
> Victory! I have a reproducer in the form of a plain C libgfapi client.
>
> However, I have not been able to trigger corruption by just executing
> the simple pattern in an artificial way. Currently, I need to feed my
> reproducer 2 GB of data that I streamed out of the qemu block driver.
> I get two possible end states out of my reproducer: The correct one or
> a corrupted one, where 48 KB are zeroed out. It takes no more than 10
> runs to get each of them at least once. The corrupted end state is
> exactly the same that I got from the real qemu process from where I
> obtained the streamed trace. This gives me a lot of confidence in the
> soundness of my reproducer.
>
> More details will follow.

Ok, so the exact sequence of activity around the corruption is this:

8700 and so on are the sequential request numbers. All of these
requests are writes. Blocks are 512 bytes.

8700
  grows the file to a certain size (2134144 blocks)

<8700 retires, nothing in flight>

8701
  writes 55 blocks inside currently allocated file range, close to the
end (7 blocks short)

8702
  writes 54 blocks from the end of 8701, growing the file by 47 blocks

<8702 retires, 8701 remains in flight>

8703
  writes from the end of 8702, growing the file by 81 blocks

<8703 retires, 8701 remains in flight>

8704
  writes 1623 blocks also from the end of 8702, growing the file by 1542 blocks

<8701 retires>
<8704 retires>

The exact range covered by 8703 ends up zeroed out.

If 8701 retires earlier (before 8702 is issued), everything is fine.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-27 22:25       ` Stefan Ring
@ 2020-02-28 11:10         ` Kevin Wolf
  2020-02-28 11:41           ` Stefan Ring
  0 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2020-02-28 11:10 UTC (permalink / raw)
  To: Stefan Ring; +Cc: integration, qemu-devel, qemu-block

Am 27.02.2020 um 23:25 hat Stefan Ring geschrieben:
> On Thu, Feb 27, 2020 at 10:12 PM Stefan Ring <stefanrin@gmail.com> wrote:
> > Victory! I have a reproducer in the form of a plain C libgfapi client.
> >
> > However, I have not been able to trigger corruption by just executing
> > the simple pattern in an artificial way. Currently, I need to feed my
> > reproducer 2 GB of data that I streamed out of the qemu block driver.
> > I get two possible end states out of my reproducer: The correct one or
> > a corrupted one, where 48 KB are zeroed out. It takes no more than 10
> > runs to get each of them at least once. The corrupted end state is
> > exactly the same that I got from the real qemu process from where I
> > obtained the streamed trace. This gives me a lot of confidence in the
> > soundness of my reproducer.
> >
> > More details will follow.
> 
> Ok, so the exact sequence of activity around the corruption is this:
> 
> 8700 and so on are the sequential request numbers. All of these
> requests are writes. Blocks are 512 bytes.
> 
> 8700
>   grows the file to a certain size (2134144 blocks)
> 
> <8700 retires, nothing in flight>
> 
> 8701
>   writes 55 blocks inside currently allocated file range, close to the
> end (7 blocks short)
> 
> 8702
>   writes 54 blocks from the end of 8701, growing the file by 47 blocks
> 
> <8702 retires, 8701 remains in flight>
> 
> 8703
>   writes from the end of 8702, growing the file by 81 blocks
> 
> <8703 retires, 8701 remains in flight>
> 
> 8704
>   writes 1623 blocks also from the end of 8702, growing the file by 1542 blocks
> 
> <8701 retires>
> <8704 retires>
> 
> The exact range covered by 8703 ends up zeroed out.
> 
> If 8701 retires earlier (before 8702 is issued), everything is fine.

This sounds almost like two other bugs we got fixed recently (in the
QEMU file-posix driver and in the XFS kernel driver) where two write
extending the file size were in flight in parallel, but if the shorter
one completed last, instead extending the file, it would end up
truncating it.

I'm not sure, though, why 8701 would try to change the file size because
it's entirely inside the already allocated file range. But maybe adding
the current file size at the start and completion of each request to
your debug output could give us more data points?

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange data corruption issue with gluster (libgfapi) and ZFS
  2020-02-28 11:10         ` Kevin Wolf
@ 2020-02-28 11:41           ` Stefan Ring
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Ring @ 2020-02-28 11:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: integration, qemu-devel, qemu-block

On Fri, Feb 28, 2020 at 12:10 PM Kevin Wolf <kwolf@redhat.com> wrote:
>
> This sounds almost like two other bugs we got fixed recently (in the
> QEMU file-posix driver and in the XFS kernel driver) where two write
> extending the file size were in flight in parallel, but if the shorter
> one completed last, instead extending the file, it would end up
> truncating it.

Hmm, XFS is used on the gluster server, with a CentOS 7 kernel. I can
switch to ext4 and check if it changes anything.

At this point I actually wanted to take this investigation away from
qemu, because it’s not involved anymore in my reproducer.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-02-28 11:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAxjCEzHQz4cG_8m7S6=CwCBoN5daQs+KVyuU5GL5Tq3Bky1NA@mail.gmail.com>
2020-02-24 12:35 ` Strange data corruption issue with gluster (libgfapi) and ZFS Stefan Ring
2020-02-24 13:10   ` Stefan Ring
2020-02-24 13:26   ` Kevin Wolf
2020-02-24 15:50     ` Stefan Ring
2020-02-25 14:12   ` Stefan Ring
2020-02-27 21:12     ` Stefan Ring
2020-02-27 22:25       ` Stefan Ring
2020-02-28 11:10         ` Kevin Wolf
2020-02-28 11:41           ` Stefan Ring

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.