All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible data corruption with dm-thin
@ 2016-06-21  7:56 Dennis Yang
  2016-06-21  8:59 ` Zdenek Kabelac
  2016-06-24 13:55 ` Edward Thornber
  0 siblings, 2 replies; 8+ messages in thread
From: Dennis Yang @ 2016-06-21  7:56 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 1674 bytes --]

Hi,

We have been dealing with a data corruption issue when we run out I/O
test suite made by ourselves with multiple thin devices built on top of a
thin-pool. In our test suites, we will create multiple thin devices and
continually write to them, check the file checksum, and delete all files
and issue DISCARD to reclaim space if no checksum error takes place.

We found that there is one data access pattern could corrupt the data.
Suppose that there are two thin devices A and B, and device A receives
a DISCARD bio to discard a physical(pool) block 100. Device A will quiesce
all previous I/O and held both virtual and physical data cell before it
actually remove the corresponding data mapping. After the data mapping
is removed, both data cell will be released and this DISCARD bio will
be passed down to underlying devices. If device B tries to allocate
a new block at the very same moment, it could reuse the block 100 which
was just been discarded by device A (suppose metadata commit had
been triggered, for a block cannot be reused in the same transaction).
In this case, we will have a race between the WRITE bio coming from
device B and the DISCARD bio coming from device A. Once the WRITE
bio completes before the DISCARD bio, there would be checksum error
for device B.

So my question is, does dm-thin have any mechanism to eliminate the race
when
discarded block is reused right away by another device?

Any help would be grateful.
Thanks,

Dennis


-- 
Dennis Yang
QNAP Systems, Inc.
Skype: qnap.dennis.yang
Email: dennisyang@qnap.com
Tel: (+886)-2-2393-5152 ext. 15018
Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei
City, Taiwan

[-- Attachment #1.2: Type: text/html, Size: 3147 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-06-21  7:56 Possible data corruption with dm-thin Dennis Yang
@ 2016-06-21  8:59 ` Zdenek Kabelac
  2016-06-21 10:40   ` Dennis Yang
  2016-06-24 13:55 ` Edward Thornber
  1 sibling, 1 reply; 8+ messages in thread
From: Zdenek Kabelac @ 2016-06-21  8:59 UTC (permalink / raw)
  To: dm-devel

Dne 21.6.2016 v 09:56 Dennis Yang napsal(a):
> Hi,
>
> We have been dealing with a data corruption issue when we run out I/O
> test suite made by ourselves with multiple thin devices built on top of a
> thin-pool. In our test suites, we will create multiple thin devices and
> continually write to them, check the file checksum, and delete all files
> and issue DISCARD to reclaim space if no checksum error takes place.
>
> We found that there is one data access pattern could corrupt the data.
> Suppose that there are two thin devices A and B, and device A receives
> a DISCARD bio to discard a physical(pool) block 100. Device A will quiesce
> all previous I/O and held both virtual and physical data cell before it
> actually remove the corresponding data mapping. After the data mapping
> is removed, both data cell will be released and this DISCARD bio will
> be passed down to underlying devices. If device B tries to allocate
> a new block at the very same moment, it could reuse the block 100 which
> was just been discarded by device A (suppose metadata commit had
> been triggered, for a block cannot be reused in the same transaction).
> In this case, we will have a race between the WRITE bio coming from
> device B and the DISCARD bio coming from device A. Once the WRITE
> bio completes before the DISCARD bio, there would be checksum error
> for device B.
>
> So my question is, does dm-thin have any mechanism to eliminate the race when
> discarded block is reused right away by another device?
>
> Any help would be grateful.
> Thanks,


Please provide version of kernel and surrounding tools (OS release version)?
also are you using  'lvm2'  or you use directly 'dmsetup/ioctl' ?
(in the later case we would need to see exact sequencing of operation).

Also please provide  reproducer script.


Regards

Zdenek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-06-21  8:59 ` Zdenek Kabelac
@ 2016-06-21 10:40   ` Dennis Yang
  2016-06-21 10:46     ` Zdenek Kabelac
  0 siblings, 1 reply; 8+ messages in thread
From: Dennis Yang @ 2016-06-21 10:40 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 4182 bytes --]

2016-06-21 16:59 GMT+08:00 Zdenek Kabelac <zkabelac@redhat.com>:

> Dne 21.6.2016 v 09:56 Dennis Yang napsal(a):
>
>> Hi,
>>
>> We have been dealing with a data corruption issue when we run out I/O
>> test suite made by ourselves with multiple thin devices built on top of a
>> thin-pool. In our test suites, we will create multiple thin devices and
>> continually write to them, check the file checksum, and delete all files
>> and issue DISCARD to reclaim space if no checksum error takes place.
>>
>> We found that there is one data access pattern could corrupt the data.
>> Suppose that there are two thin devices A and B, and device A receives
>> a DISCARD bio to discard a physical(pool) block 100. Device A will quiesce
>> all previous I/O and held both virtual and physical data cell before it
>> actually remove the corresponding data mapping. After the data mapping
>> is removed, both data cell will be released and this DISCARD bio will
>> be passed down to underlying devices. If device B tries to allocate
>> a new block at the very same moment, it could reuse the block 100 which
>> was just been discarded by device A (suppose metadata commit had
>> been triggered, for a block cannot be reused in the same transaction).
>> In this case, we will have a race between the WRITE bio coming from
>> device B and the DISCARD bio coming from device A. Once the WRITE
>> bio completes before the DISCARD bio, there would be checksum error
>> for device B.
>>
>> So my question is, does dm-thin have any mechanism to eliminate the race
>> when
>> discarded block is reused right away by another device?
>>
>> Any help would be grateful.
>> Thanks,
>>
>
>
> Please provide version of kernel and surrounding tools (OS release
> version)?
> also are you using  'lvm2'  or you use directly 'dmsetup/ioctl' ?
> (in the later case we would need to see exact sequencing of operation).
>
> Also please provide  reproducer script.
>
>
> Regards
>
> Zdenek
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>


Hi Zdenek,

We are using a customized dm-thin driver based on linux 3.19.8 running
on our QNAP NAS. Also, we create all our thin devices with "lvm2". I am
afraid that I cannot provide the reproducer script since we reproduce this
by
running the I/O stress test suite on Windows to all thin devices exported
to
them via samba and iSCSI.

The following is the trace of thin-pool we dumped via blktrace. The data
corruption takes place from sector address 310150144 to 310150144 + 832.

252,19   1   154916   184.875465510 29959  Q   W 310150144 + 1024
[kworker/u8:0]
252,19   0   205964   185.496309521     0      C   W 310150144 + 1024 [0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
At first, pool receives a 1024 sector WRITE bio which had allocated a pool
block.

252,19   3   353811   656.542481344 30280  Q   D 310150144 + 1024
[kworker/u8:8]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pool receives a 1024 sector (thin block size) DISCARD bio passed down by
one of the thin device.

252,19   1   495204   656.558652936 30280  Q   W 310150144 + 832
[kworker/u8:8]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Another thin device passed down a 832 sector WRITE bio to the exact same
place.

252,19   3   353820   656.564140283     0      C   W 310150144 + 832 [0]
252,19   0   697455   656.770883592     0      C   D 310150144 + 1024 [0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Although the DISCARD bio was queued before the WRITE bio, their completion
had
been reordered which could corrupt the data.

252,19   1   515212   684.425478220 20751  A   R 310150144 + 80 <- (252,22)
28932096
252,19   1   515213   684.425478325 20751  Q   R 310150144 + 80 [smbd]
252,19   0   725274   684.425741079 23937  C   R 310150144 + 80 [0]

Hope this helps.
Thanks,

Dennis

-- 
Dennis Yang
QNAP Systems, Inc.
Skype: qnap.dennis.yang
Email: dennisyang@qnap.com
Tel: (+886)-2-2393-5152 ext. 15018
Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei
City, Taiwan

[-- Attachment #1.2: Type: text/html, Size: 7299 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-06-21 10:40   ` Dennis Yang
@ 2016-06-21 10:46     ` Zdenek Kabelac
  0 siblings, 0 replies; 8+ messages in thread
From: Zdenek Kabelac @ 2016-06-21 10:46 UTC (permalink / raw)
  To: Dennis Yang, device-mapper development

Dne 21.6.2016 v 12:40 Dennis Yang napsal(a):
>
>
> 2016-06-21 16:59 GMT+08:00 Zdenek Kabelac <zkabelac@redhat.com
> <mailto:zkabelac@redhat.com>>:
>
>     Dne 21.6.2016 v 09:56 Dennis Yang napsal(a):
>
>         Hi,
>
>         We have been dealing with a data corruption issue when we run out I/O
>         test suite made by ourselves with multiple thin devices built on top of a
>         thin-pool. In our test suites, we will create multiple thin devices and
>         continually write to them, check the file checksum, and delete all files
>         and issue DISCARD to reclaim space if no checksum error takes place.
>
>         We found that there is one data access pattern could corrupt the data.
>         Suppose that there are two thin devices A and B, and device A receives
>         a DISCARD bio to discard a physical(pool) block 100. Device A will quiesce
>         all previous I/O and held both virtual and physical data cell before it
>         actually remove the corresponding data mapping. After the data mapping
>         is removed, both data cell will be released and this DISCARD bio will
>         be passed down to underlying devices. If device B tries to allocate
>         a new block at the very same moment, it could reuse the block 100 which
>         was just been discarded by device A (suppose metadata commit had
>         been triggered, for a block cannot be reused in the same transaction).
>         In this case, we will have a race between the WRITE bio coming from
>         device B and the DISCARD bio coming from device A. Once the WRITE
>         bio completes before the DISCARD bio, there would be checksum error
>         for device B.
>
>         So my question is, does dm-thin have any mechanism to eliminate the
>         race when
>         discarded block is reused right away by another device?
>
>         Any help would be grateful.
>         Thanks,
>
>
>
>     Please provide version of kernel and surrounding tools (OS release version)?
>     also are you using  'lvm2'  or you use directly 'dmsetup/ioctl' ?
>     (in the later case we would need to see exact sequencing of operation).
>
>     Also please provide  reproducer script.
>
>
>     Regards
>
>     Zdenek
>
>     --
>     dm-devel mailing list
>     dm-devel@redhat.com <mailto:dm-devel@redhat.com>
>     https://www.redhat.com/mailman/listinfo/dm-devel
>
>
>
> Hi Zdenek,
>
> We are using a customized dm-thin driver based on linux 3.19.8 running
> on our QNAP NAS. Also, we create all our thin devices with "lvm2". I am

Please try to reproduce with recent kernel 4.6.

Regards

Zdenek

> afraid that I cannot provide the reproducer script since we reproduce this by
> running the I/O stress test suite on Windows to all thin devices exported to
> them via samba and iSCSI.
>
> The following is the trace of thin-pool we dumped via blktrace. The data
> corruption takes place from sector address 310150144 to 310150144 + 832.
>
> 252,19   1   154916   184.875465510 29959  Q   W 310150144 + 1024 [kworker/u8:0]
> 252,19   0   205964   185.496309521     0      C   W 310150144 + 1024 [0]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> At first, pool receives a 1024 sector WRITE bio which had allocated a pool block.
>
> 252,19   3   353811   656.542481344 30280  Q   D 310150144 + 1024 [kworker/u8:8]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Pool receives a 1024 sector (thin block size) DISCARD bio passed down by one
> of the thin device.
>
> 252,19   1   495204   656.558652936 30280  Q   W 310150144 + 832 [kworker/u8:8]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Another thin device passed down a 832 sector WRITE bio to the exact same place.
>
> 252,19   3   353820   656.564140283     0      C   W 310150144 + 832 [0]
> 252,19   0   697455   656.770883592     0      C   D 310150144 + 1024 [0]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Although the DISCARD bio was queued before the WRITE bio, their completion had
> been reordered which could corrupt the data.
>
> 252,19   1   515212   684.425478220 20751  A   R 310150144 + 80 <- (252,22)
> 28932096
> 252,19   1   515213   684.425478325 20751  Q   R 310150144 + 80 [smbd]
> 252,19   0   725274   684.425741079 23937  C   R 310150144 + 80 [0]
>
> Hope this helps.
> Thanks,
>
> Dennis
>
> --
> Dennis Yang
> QNAP Systems, Inc.
> Skype: qnap.dennis.yang
> Email: dennisyang@qnap.com <mailto:dennisyang@qnap.com>
> Tel: (+886)-2-2393-5152 ext. 15018
> Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City,
> Taiwan
>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-06-21  7:56 Possible data corruption with dm-thin Dennis Yang
  2016-06-21  8:59 ` Zdenek Kabelac
@ 2016-06-24 13:55 ` Edward Thornber
  2016-06-27  9:32   ` Dennis Yang
  1 sibling, 1 reply; 8+ messages in thread
From: Edward Thornber @ 2016-06-24 13:55 UTC (permalink / raw)
  To: Dennis Yang; +Cc: device-mapper development

Hi Dennis,

On Tue, Jun 21, 2016 at 03:56:26PM +0800, Dennis Yang wrote:
> So my question is, does dm-thin have any mechanism to eliminate the race
> when
> discarded block is reused right away by another device?

I'll try and recreate your scenario.  The transaction-manager will not
reallocate a block that's been freed within a transaction so that we
can always rollback.  So as long as there hasn't been a commit,
reallocation shouldn't be possible.  This is what normally guards
allocation, but in your case I think you may be onto something and we
may have to hold onto the data cell until the passdown discards are
complete.

- Joe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-06-24 13:55 ` Edward Thornber
@ 2016-06-27  9:32   ` Dennis Yang
  2016-07-01 13:18     ` Edward Thornber
  0 siblings, 1 reply; 8+ messages in thread
From: Dennis Yang @ 2016-06-27  9:32 UTC (permalink / raw)
  To: Dennis Yang, device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 2341 bytes --]

Hi Joe,

2016-06-24 21:55 GMT+08:00 Edward Thornber <thornber@redhat.com>:

> Hi Dennis,
>
> On Tue, Jun 21, 2016 at 03:56:26PM +0800, Dennis Yang wrote:
> > So my question is, does dm-thin have any mechanism to eliminate the race
> > when
> > discarded block is reused right away by another device?
>
> I'll try and recreate your scenario.  The transaction-manager will not
> reallocate a block that's been freed within a transaction so that we
> can always rollback.  So as long as there hasn't been a commit,
> reallocation shouldn't be possible.  This is what normally guards
> allocation, but in your case I think you may be onto something and we
> may have to hold onto the data cell until the passdown discards are
> complete.
>
> - Joe
>

In my experience, this issue is pretty hard to reproduce solely by a
thin-pool
which is built on top of a regular hard disk or RAID, since I rarely
observe the
DISCARD and WRITE coming from different thin devices get reordered in the
lower level. In my system, the thin-pool is built on top of another
device-mapper
device providing data tiering functionality which could delay the DISCARD
request a little bit and lead to a request reorder situation.

Based on the log I had, I suspect that there was metadata commit before
another
thin device reused the discarded block while the DISCARD request was still
being
processed by the lower level stacks. In this case, holding the data cell
until the
passdown discards are complete seems to only protect the discarded block
from
being reallocated by the same thin device which allocates the block in the
first place,
because only those write I/Os that are going to the same device will be
prisoned.

In my opinion, I think maybe the to-be-discard block should only be freed
after the
passdown discards have been complete. I had only written a patch in this
way and
applied it on my site since last week, and the issue had not been seen on
my site
for 72 hours. I am not very confident about whether this patch will cause
any side
effect, I would be highly appreciated if you can share your concern with me.

Thanks for your reply,
Dennis




-- 
Dennis Yang
QNAP Systems, Inc.
Skype: qnap.dennis.yang
Email: dennisyang@qnap.com
Tel: (+886)-2-2393-5152 ext. 15018
Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei
City, Taiwan

[-- Attachment #1.2: Type: text/html, Size: 4553 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-06-27  9:32   ` Dennis Yang
@ 2016-07-01 13:18     ` Edward Thornber
  2016-09-06  4:51       ` Dennis Yang
  0 siblings, 1 reply; 8+ messages in thread
From: Edward Thornber @ 2016-07-01 13:18 UTC (permalink / raw)
  To: Dennis Yang; +Cc: device-mapper development

Hi Dennis,

Sorry to take so long about getting back to you.  There was a bug in
the btree library that I mistakenly thought was caused by my fix for
your problem.  Anyway, could you try and recreate your problem with
these two patches applied please (patches are against v4.7-rc5 + my
usual thin-dev extras):

    https://github.com/jthornber/linux-2.6/commit/ebdd99fc12acd0e9178831549896633171d5af1d
    https://github.com/jthornber/linux-2.6/commit/c05989b59d091270c2bf7dd05bdb7a3575223529

Thanks,

- Joe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption with dm-thin
  2016-07-01 13:18     ` Edward Thornber
@ 2016-09-06  4:51       ` Dennis Yang
  0 siblings, 0 replies; 8+ messages in thread
From: Dennis Yang @ 2016-09-06  4:51 UTC (permalink / raw)
  To: Dennis Yang, device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 1166 bytes --]

Hi Joe,

Sorry for the late reply. We have been fixing bugs and conducting many
experiments with heavy I/O to verify the correctness and stability of our
storage. This problem I mentioned in this thread cannot be reproduced
anymore with these two patches (I applied them back to 3.19.8 and 4.2.8 and
both work fine).

Thanks for your help.

Dennis

2016-07-01 21:18 GMT+08:00 Edward Thornber <thornber@redhat.com>:

> Hi Dennis,
>
> Sorry to take so long about getting back to you.  There was a bug in
> the btree library that I mistakenly thought was caused by my fix for
> your problem.  Anyway, could you try and recreate your problem with
> these two patches applied please (patches are against v4.7-rc5 + my
> usual thin-dev extras):
>
>     https://github.com/jthornber/linux-2.6/commit/
> ebdd99fc12acd0e9178831549896633171d5af1d
>     https://github.com/jthornber/linux-2.6/commit/
> c05989b59d091270c2bf7dd05bdb7a3575223529
>
> Thanks,
>
> - Joe
>



-- 
Dennis Yang
QNAP Systems, Inc.
Skype: qnap.dennis.yang
Email: dennisyang@qnap.com
Tel: (+886)-2-2393-5152 ext. 15018
Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei
City, Taiwan

[-- Attachment #1.2: Type: text/html, Size: 2910 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-06  4:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-21  7:56 Possible data corruption with dm-thin Dennis Yang
2016-06-21  8:59 ` Zdenek Kabelac
2016-06-21 10:40   ` Dennis Yang
2016-06-21 10:46     ` Zdenek Kabelac
2016-06-24 13:55 ` Edward Thornber
2016-06-27  9:32   ` Dennis Yang
2016-07-01 13:18     ` Edward Thornber
2016-09-06  4:51       ` Dennis Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.