From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dennis Yang Subject: Re: Possible data corruption with dm-thin Date: Tue, 21 Jun 2016 18:40:27 +0800 Message-ID: References: <54221955-a21e-5152-00e3-d6b78e0c78ef@redhat.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6545299448274249637==" Return-path: In-Reply-To: <54221955-a21e-5152-00e3-d6b78e0c78ef@redhat.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: device-mapper development List-Id: dm-devel.ids --===============6545299448274249637== Content-Type: multipart/alternative; boundary=047d7b5d357a2bcf810535c77402 --047d7b5d357a2bcf810535c77402 Content-Type: text/plain; charset=UTF-8 2016-06-21 16:59 GMT+08:00 Zdenek Kabelac : > Dne 21.6.2016 v 09:56 Dennis Yang napsal(a): > >> Hi, >> >> We have been dealing with a data corruption issue when we run out I/O >> test suite made by ourselves with multiple thin devices built on top of a >> thin-pool. In our test suites, we will create multiple thin devices and >> continually write to them, check the file checksum, and delete all files >> and issue DISCARD to reclaim space if no checksum error takes place. >> >> We found that there is one data access pattern could corrupt the data. >> Suppose that there are two thin devices A and B, and device A receives >> a DISCARD bio to discard a physical(pool) block 100. Device A will quiesce >> all previous I/O and held both virtual and physical data cell before it >> actually remove the corresponding data mapping. After the data mapping >> is removed, both data cell will be released and this DISCARD bio will >> be passed down to underlying devices. If device B tries to allocate >> a new block at the very same moment, it could reuse the block 100 which >> was just been discarded by device A (suppose metadata commit had >> been triggered, for a block cannot be reused in the same transaction). >> In this case, we will have a race between the WRITE bio coming from >> device B and the DISCARD bio coming from device A. Once the WRITE >> bio completes before the DISCARD bio, there would be checksum error >> for device B. >> >> So my question is, does dm-thin have any mechanism to eliminate the race >> when >> discarded block is reused right away by another device? >> >> Any help would be grateful. >> Thanks, >> > > > Please provide version of kernel and surrounding tools (OS release > version)? > also are you using 'lvm2' or you use directly 'dmsetup/ioctl' ? > (in the later case we would need to see exact sequencing of operation). > > Also please provide reproducer script. > > > Regards > > Zdenek > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > Hi Zdenek, We are using a customized dm-thin driver based on linux 3.19.8 running on our QNAP NAS. Also, we create all our thin devices with "lvm2". I am afraid that I cannot provide the reproducer script since we reproduce this by running the I/O stress test suite on Windows to all thin devices exported to them via samba and iSCSI. The following is the trace of thin-pool we dumped via blktrace. The data corruption takes place from sector address 310150144 to 310150144 + 832. 252,19 1 154916 184.875465510 29959 Q W 310150144 + 1024 [kworker/u8:0] 252,19 0 205964 185.496309521 0 C W 310150144 + 1024 [0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ At first, pool receives a 1024 sector WRITE bio which had allocated a pool block. 252,19 3 353811 656.542481344 30280 Q D 310150144 + 1024 [kworker/u8:8] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Pool receives a 1024 sector (thin block size) DISCARD bio passed down by one of the thin device. 252,19 1 495204 656.558652936 30280 Q W 310150144 + 832 [kworker/u8:8] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Another thin device passed down a 832 sector WRITE bio to the exact same place. 252,19 3 353820 656.564140283 0 C W 310150144 + 832 [0] 252,19 0 697455 656.770883592 0 C D 310150144 + 1024 [0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Although the DISCARD bio was queued before the WRITE bio, their completion had been reordered which could corrupt the data. 252,19 1 515212 684.425478220 20751 A R 310150144 + 80 <- (252,22) 28932096 252,19 1 515213 684.425478325 20751 Q R 310150144 + 80 [smbd] 252,19 0 725274 684.425741079 23937 C R 310150144 + 80 [0] Hope this helps. Thanks, Dennis -- Dennis Yang QNAP Systems, Inc. Skype: qnap.dennis.yang Email: dennisyang@qnap.com Tel: (+886)-2-2393-5152 ext. 15018 Address: 13F., No.56, Sec. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City, Taiwan --047d7b5d357a2bcf810535c77402 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


2016-06-21 16:59 GMT+08:00 Zdenek Kabelac <zkabelac@redhat.com&g= t;:
Dne 21.6.2016 v 09:56 Dennis Yang napsal(a):=
Hi,

We have been dealing with a data corruption issue when we run out I/O
test suite made by ourselves with multiple thin devices built on top of a thin-pool. In our test suites, we will create multiple thin devices and
continually write to them, check the file checksum, and delete all files and issue DISCARD to reclaim space if no checksum error takes place.

We found that there is one data access pattern could corrupt the data.
Suppose that there are two thin devices A and B, and device A receives
a DISCARD bio to discard a physical(pool) block 100. Device A will quiesce<= br> all previous I/O and held both virtual and physical data cell before it
actually remove the corresponding data mapping. After the data mapping
is removed, both data cell will be released and this DISCARD bio will
be passed down to underlying devices. If device B tries to allocate
a new block at the very same moment, it could reuse the block 100 which
was just been discarded by device A (suppose metadata commit had
been triggered, for a block cannot be reused in the same transaction).
In this case, we will have a race between the WRITE bio coming from
device B and the DISCARD bio coming from device A. Once the WRITE
bio completes before the DISCARD bio, there would be checksum error
for device B.

So my question is, does dm-thin have any mechanism to eliminate the race wh= en
discarded block is reused right away by another device?

Any help would be grateful.
Thanks,


Please provide version of kernel and surrounding tools (OS release version)= ?
also are you using=C2=A0 'lvm2'=C2=A0 or you use directly 'dmse= tup/ioctl' ?
(in the later case we would need to see exact sequencing of operation).

Also please provide=C2=A0 reproducer script.


Regards

Zdenek

--
dm-devel mailing list
dm-devel@redhat.co= m
https://www.redhat.com/mailman/listinfo/dm-devel=


Hi Zdenek,

<= /div>
We are using a customized dm-thin driver ba= sed on linux 3.19.8 running
on our QNAP NAS= . Also, we create all our thin devices with "lvm2". I am
afraid that I cannot provide the reproducer script = since we reproduce this by=C2=A0
running th= e I/O stress test suite on Windows to all thin devices exported to=C2=A0
them via samba and iSCSI.=C2=A0

The following is th= e trace of thin-pool we dumped via blktrace. The data=C2=A0
corruption takes place from sector address 310150144 to 31= 0150144 + 832.
=C2=A0
252,19 =C2=A0 1 =C2=A0 154916 =C2=A0 = 184.875465510 29959 =C2=A0Q =C2=A0 W 310150144 + 1024 [kworker/u8:0]
<= div class=3D"gmail_extra">252,19 =C2=A0 0 =C2=A0 205964 =C2=A0 185.49630952= 1 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0C =C2=A0 W 310150144 + 1024 [0]
=
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^= ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
At first, p= ool receives a 1024 sector WRITE bio which had allocated a pool block.

252,19 =C2= =A0 3 =C2=A0 353811 =C2=A0 656.542481344 30280 =C2=A0Q =C2=A0 D 310150144 += 1024 [kworker/u8:8]
^^^^^^^^^^^^^^^^^^^^^^= ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pool receives a 1024 sector (thin block size) DISCARD bio = passed down by one of the thin device.

=
252,19 =C2=A0 1 =C2=A0 495204 =C2=A0 656.5= 58652936 30280 =C2=A0Q =C2=A0 W 310150144 + 832 [kworker/u8:8]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^= ^^^^^^^^^^^^^^^^^^^^^^^^
Another thin devic= e passed down a 832 sector WRITE bio to the exact same place.
=C2=A0
252,19 =C2=A0 3 = =C2=A0 353820 =C2=A0 656.564140283 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0C = =C2=A0 W 310150144 + 832 [0]
252,19 =C2=A0 = 0 =C2=A0 697455 =C2=A0 656.770883592 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0C = =C2=A0 D 310150144 + 1024 [0]
^^^^^^^^^^^^^= ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Although the DISCARD bio was queued before the WRIT= E bio, their completion had=C2=A0
been reor= dered which could corrupt the data.

252,19 =C2=A0 1 =C2=A0 515212 =C2=A0 684.4254= 78220 20751 =C2=A0A =C2=A0 R 310150144 + 80 <- (252,22) 28932096
252,19 =C2=A0 1 =C2=A0 515213 =C2=A0 684.425478325= 20751 =C2=A0Q =C2=A0 R 310150144 + 80 [smbd]
252,19 =C2=A0 0 =C2=A0 725274 =C2=A0 684.425741079 23937 =C2=A0C =C2=A0 = R 310150144 + 80 [0]

Hope this helps.
Thank= s,

Den= nis

--
Dennis Ya= ng=C2=A0=C2=A0
QNAP Systems, Inc.
Skype: qnap.d= ennis.yang
Email:
=C2=A0dennisyang@qnap.com
Tel:=C2=A0(+886)-2-2393-5152= ext.=C2=A015018
Addres= s:=C2=A013F., No.56, Se= c. 1, Xinsheng S. Rd., Zhongzheng Dist., Taipei City, Taiwan
--047d7b5d357a2bcf810535c77402-- --===============6545299448274249637== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --===============6545299448274249637==--