Massive filesystem corruption after balance + fstrim on Linux 5.1.2

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Massive filesystem corruption after balance + fstrim on Linux 5.1.2
@ 2019-05-16 22:16 Michael Laß
  2019-05-16 23:41 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-16 22:16 UTC (permalink / raw)
  To: linux-btrfs

Hi.

Today I managed to destroy my btrfs root filesystem using the following 
sequence of commands:

sync
btrfs balance start -dusage 75 -musage 75 /
sync
fstrim -v /

Shortly after, the kernel spew out lots of messages like the following:

BTRFS warning (device dm-5): csum failed root 257 ino 16634085 off 
21504884736 csum 0xd47cc2a2 expected csum 0xcebd791b mirror 1

A btrfs scrub shows roughly 27000 unrecoverable csum errors and lots of 
data on that system is not accessible anymore.

I'm running Linux 5.1.2 on an Arch Linux. Their kernel pretty much 
matches upstream with only one non btrfs-related patch on top: 
https://git.archlinux.org/linux.git/log/?h=v5.1.2-arch1

The btrfs file system was mounted with compress=lzo. The underlying 
storage device is a LUKS volume, on top of an LVM logical volume and the 
underlying physical volume is a Samsung 830 SSD. The LUKS volume is 
opened with the option "discard" so that trim commands are passed to the 
device.

SMART shows no errors on the SSD itself. I never had issues with 
balancing or trimming the btrfs volume before, even the exact same 
sequence of commands as above never caused any issues. Until now.

Does anyone have an idea of what happened here? Could this be a bug in 
btrfs?

I have made a copy of that volume so I can get further information out 
of it if necessary. I already ran btrfs check on it (using the slightly 
outdated version 4.19.1) and it did not show any errors. So it seems 
like only data has been corrupted.

Please tell me if I can provide any more useful information on this.

Cheers,
Michael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-16 22:16 Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Michael Laß
@ 2019-05-16 23:41 ` Qu Wenruo
  2019-05-16 23:42 ` Chris Murphy
  2019-05-28 12:36 ` Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Christoph Anton Mitterer
  2 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2019-05-16 23:41 UTC (permalink / raw)
  To: Michael Laß, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2587 bytes --]



On 2019/5/17 上午6:16, Michael Laß wrote:
> Hi.
> 
> Today I managed to destroy my btrfs root filesystem using the following
> sequence of commands:

I don't have a root fs filled, but a btrfs with linux kernel with
compiled results filling 5G of a total 10G.

I'm using the that fs in my VM to try to reproduce.
> 
> sync
> btrfs balance start -dusage 75 -musage 75 /
> sync
> fstrim -v /

Tried the same, while I use --full-blanace for that balance to ensure
all chunks get relocated.

> 
> Shortly after, the kernel spew out lots of messages like the following:
> 
> BTRFS warning (device dm-5): csum failed root 257 ino 16634085 off
> 21504884736 csum 0xd47cc2a2 expected csum 0xcebd791b mirror 1
> 
> A btrfs scrub shows roughly 27000 unrecoverable csum errors and lots of
> data on that system is not accessible anymore.

After above operations, nothing wrong happened in scrub:

  $ sudo btrfs scrub start -B /mnt/btrfs/
  scrub done for 1dd1bcf6-4392-4be1-8c0e-0bfd16321ade
  	scrub started at Fri May 17 07:34:26 2019 and finished after 00:00:02
  	total bytes scrubbed: 4.19GiB with 0 errors
> 
> I'm running Linux 5.1.2 on an Arch Linux. Their kernel pretty much
> matches upstream with only one non btrfs-related patch on top:
> https://git.archlinux.org/linux.git/log/?h=v5.1.2-arch1
> 
> The btrfs file system was mounted with compress=lzo. The underlying
> storage device is a LUKS volume, on top of an LVM logical volume and the
> underlying physical volume is a Samsung 830 SSD. The LUKS volume is
> opened with the option "discard" so that trim commands are passed to the
> device.

I'm not sure if it's LUKS or btrfs to blame.
In my test environment, I'm using LVM but without LUKS.

My LVM setup has issue_discards = 1 set.

Would you please try to verify the behavior on a plain partition to rule
out possible interference?

Thanks,
Qu

> 
> SMART shows no errors on the SSD itself. I never had issues with
> balancing or trimming the btrfs volume before, even the exact same
> sequence of commands as above never caused any issues. Until now.
> 
> Does anyone have an idea of what happened here? Could this be a bug in
> btrfs?
> 
> I have made a copy of that volume so I can get further information out
> of it if necessary. I already ran btrfs check on it (using the slightly
> outdated version 4.19.1) and it did not show any errors. So it seems
> like only data has been corrupted.
> 
> Please tell me if I can provide any more useful information on this.
> 
> Cheers,
> Michael


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-16 22:16 Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Michael Laß
  2019-05-16 23:41 ` Qu Wenruo
@ 2019-05-16 23:42 ` Chris Murphy
  2019-05-17 17:37   ` Michael Laß
  2019-05-28 12:36 ` Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Christoph Anton Mitterer
  2 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2019-05-16 23:42 UTC (permalink / raw)
  To: Michael Laß; +Cc: Btrfs BTRFS

On Thu, May 16, 2019 at 4:26 PM Michael Laß <bevan@bi-co.net> wrote:
>
> Hi.
>
> Today I managed to destroy my btrfs root filesystem using the following
> sequence of commands:
>
> sync
> btrfs balance start -dusage 75 -musage 75 /
> sync
> fstrim -v /
>
> Shortly after, the kernel spew out lots of messages like the following:
>
> BTRFS warning (device dm-5): csum failed root 257 ino 16634085 off
> 21504884736 csum 0xd47cc2a2 expected csum 0xcebd791b mirror 1
>
> A btrfs scrub shows roughly 27000 unrecoverable csum errors and lots of
> data on that system is not accessible anymore.
>
> I'm running Linux 5.1.2 on an Arch Linux. Their kernel pretty much
> matches upstream with only one non btrfs-related patch on top:
> https://git.archlinux.org/linux.git/log/?h=v5.1.2-arch1
>
> The btrfs file system was mounted with compress=lzo. The underlying
> storage device is a LUKS volume, on top of an LVM logical volume and the
> underlying physical volume is a Samsung 830 SSD. The LUKS volume is
> opened with the option "discard" so that trim commands are passed to the
> device.
>
> SMART shows no errors on the SSD itself. I never had issues with
> balancing or trimming the btrfs volume before, even the exact same
> sequence of commands as above never caused any issues. Until now.
>
> Does anyone have an idea of what happened here? Could this be a bug in
> btrfs?

I suspect there's a regression somewhere, question is where. I've used
a Samsung 830 SSD extensively with Btrfs and fstrim in the past, but
without dm-crypt. I'm using Btrfs extensively with dm-crypt but on
hard drives. So I can't test this.

Btrfs balance is supposed to be COW. So a block group is not
dereferenced until it is copied successfully and metadata is updated.
So it sounds like the fstrim happened before the metadata was updated.
But I don't see how that's possible in normal operation even without a
sync, let alone with the sync.

The most reliable way to test it, ideally keep everything the same, do
a new mkfs.btrfs, and try to reproduce the problem. And then do a
bisect. That for sure will find it, whether it's btrfs or something
else that's changed in the kernel. But it's also a bit tedious.

I'm not sure how to test this with any other filesystem on top of your
existing storage stack instead of btrfs, to see if it's btrfs or
something else. And you'll still have to do a lot of iteration. So it
doesn't make things that much easier than doing a kernel bisect.
Neither ext4 nor XFS have block group move like Btrfs does. LVM does
however, with pvmove. But that makes the testing more complicated,
introduces more factors. So...I still vote for bisect.

But even if you can't bisect, if you can reproduce, that might help
someone else who can do the bisect.

Your stack looks like this?

Btrfs
LUKS/dmcrypt
LVM
Samsung SSD

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-16 23:42 ` Chris Murphy
@ 2019-05-17 17:37   ` Michael Laß
  2019-05-18  4:09     ` Chris Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Laß @ 2019-05-17 17:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS


> Am 17.05.2019 um 01:42 schrieb Chris Murphy <lists@colorremedies.com>:
> 
> Btrfs balance is supposed to be COW. So a block group is not
> dereferenced until it is copied successfully and metadata is updated.
> So it sounds like the fstrim happened before the metadata was updated.
> But I don't see how that's possible in normal operation even without a
> sync, let alone with the sync.

Balance is indeed not to blame here. See below.

> The most reliable way to test it, ideally keep everything the same, do
> a new mkfs.btrfs, and try to reproduce the problem. And then do a
> bisect. That for sure will find it, whether it's btrfs or something
> else that's changed in the kernel. But it's also a bit tedious.
> 
> I'm not sure how to test this with any other filesystem on top of your
> existing storage stack instead of btrfs, to see if it's btrfs or
> something else. And you'll still have to do a lot of iteration. So it
> doesn't make things that much easier than doing a kernel bisect.
> Neither ext4 nor XFS have block group move like Btrfs does. LVM does
> however, with pvmove. But that makes the testing more complicated,
> introduces more factors. So...I still vote for bisect.
> 
> But even if you can't bisect, if you can reproduce, that might help
> someone else who can do the bisect.

I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:

fstrim: /: FITRIM ioctl failed: Input/output error

Now it gets iteresting: After this, the btrfs file system was fine. However, two other LVM logical volumes that are partitioned with ext4 were destroyed. I cannot reproduce this issue with an older Linux 4.19 live CD. So I assume that it is not an issue with the SSD itself. I’ll start bisecting now. It could take a while since every “successful” (i.e., destructive) test requires me to recreate the system.

> Your stack looks like this?
> 
> Btrfs
> LUKS/dmcrypt
> LVM
> Samsung SSD

To be precise, there’s an MBR partition in the game as well:

Btrfs
LUKS/dmcrypt
LVM
MBR partition
Samsung SSD

Cheers,
Michael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-17 17:37   ` Michael Laß
@ 2019-05-18  4:09     ` Chris Murphy
  2019-05-18  9:18       ` Michael Laß
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2019-05-18  4:09 UTC (permalink / raw)
  To: Michael Laß; +Cc: Btrfs BTRFS

On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
>
>
> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>
> fstrim: /: FITRIM ioctl failed: Input/output error

Huh. Any kernel message at the same time? I would expect any fstrim
user space error message to also have a kernel message. Any i/o error
suggests some kind of storage stack failure - which could be hardware
or software, you can't know without seeing the kernel messages.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-18  4:09     ` Chris Murphy
@ 2019-05-18  9:18       ` Michael Laß
  2019-05-18  9:31         ` Roman Mamedov
  2019-05-18 10:26         ` Qu Wenruo
  0 siblings, 2 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-18  9:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS


> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
> 
> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
>> 
>> 
>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>> 
>> fstrim: /: FITRIM ioctl failed: Input/output error
> 
> Huh. Any kernel message at the same time? I would expect any fstrim
> user space error message to also have a kernel message. Any i/o error
> suggests some kind of storage stack failure - which could be hardware
> or software, you can't know without seeing the kernel messages.

I missed that. The kernel messages are:

attempt to access beyond end of device
sda1: rw=16387, want=252755893, limit=250067632
BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5

Here are some more information on the partitions and LVM physical segments:

fdisk -l /dev/sda:

Device     Boot Start       End   Sectors   Size Id Type
/dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM

pvdisplay -m:

  --- Physical volume ---
  PV Name               /dev/sda1
  VG Name               vg_system
  PV Size               119.24 GiB / not usable <22.34 MiB
  Allocatable           yes (but full)
  PE Size               32.00 MiB
  Total PE              3815
  Free PE               0
  Allocated PE          3815
  PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
   
  --- Physical Segments ---
  Physical extent 0 to 1248:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	2231 to 3479
  Physical extent 1249 to 1728:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	640 to 1119
  Physical extent 1729 to 1760:
    Logical volume	/dev/vg_system/grml-images
    Logical extents	0 to 31
  Physical extent 1761 to 2016:
    Logical volume	/dev/vg_system/swap
    Logical extents	0 to 255
  Physical extent 2017 to 2047:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	3480 to 3510
  Physical extent 2048 to 2687:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	0 to 639
  Physical extent 2688 to 3007:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	1911 to 2230
  Physical extent 3008 to 3320:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	1120 to 1432
  Physical extent 3321 to 3336:
    Logical volume	/dev/vg_system/boot
    Logical extents	0 to 15
  Physical extent 3337 to 3814:
    Logical volume	/dev/vg_system/btrfs
    Logical extents	1433 to 1910
   

Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue? Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?

Cheers,
Michael

PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-18  9:18       ` Michael Laß
@ 2019-05-18  9:31         ` Roman Mamedov
  2019-05-18 10:09           ` Michael Laß
  2019-05-18 10:26         ` Qu Wenruo
  1 sibling, 1 reply; 24+ messages in thread
From: Roman Mamedov @ 2019-05-18  9:31 UTC (permalink / raw)
  To: Michael Laß; +Cc: Chris Murphy, Btrfs BTRFS

On Sat, 18 May 2019 11:18:31 +0200
Michael Laß <bevan@bi-co.net> wrote:

> 
> > Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
> > 
> > On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
> >> 
> >> 
> >> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
> >> 
> >> fstrim: /: FITRIM ioctl failed: Input/output error
> > 
> > Huh. Any kernel message at the same time? I would expect any fstrim
> > user space error message to also have a kernel message. Any i/o error
> > suggests some kind of storage stack failure - which could be hardware
> > or software, you can't know without seeing the kernel messages.
> 
> I missed that. The kernel messages are:
> 
> attempt to access beyond end of device
> sda1: rw=16387, want=252755893, limit=250067632
> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
> 
> Here are some more information on the partitions and LVM physical segments:
> 
> fdisk -l /dev/sda:
> 
> Device     Boot Start       End   Sectors   Size Id Type
> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
> 
> pvdisplay -m:
> 
>   --- Physical volume ---
>   PV Name               /dev/sda1
>   VG Name               vg_system
>   PV Size               119.24 GiB / not usable <22.34 MiB
>   Allocatable           yes (but full)
>   PE Size               32.00 MiB
>   Total PE              3815
>   Free PE               0
>   Allocated PE          3815
>   PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP

Such peculiar physical layout suggests you resize your LVs up and down a lot,
is there any chance you could have recently shrinked the LV without first
resizing down all the layers above it (Btrfs and LUKS) in proper order?

>   --- Physical Segments ---
>   Physical extent 0 to 1248:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	2231 to 3479
>   Physical extent 1249 to 1728:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	640 to 1119
>   Physical extent 1729 to 1760:
>     Logical volume	/dev/vg_system/grml-images
>     Logical extents	0 to 31
>   Physical extent 1761 to 2016:
>     Logical volume	/dev/vg_system/swap
>     Logical extents	0 to 255
>   Physical extent 2017 to 2047:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	3480 to 3510
>   Physical extent 2048 to 2687:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	0 to 639
>   Physical extent 2688 to 3007:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	1911 to 2230
>   Physical extent 3008 to 3320:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	1120 to 1432
>   Physical extent 3321 to 3336:
>     Logical volume	/dev/vg_system/boot
>     Logical extents	0 to 15
>   Physical extent 3337 to 3814:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	1433 to 1910
>    
> 
> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue? Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?
> 
> Cheers,
> Michael
> 
> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.


-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-18  9:31         ` Roman Mamedov
@ 2019-05-18 10:09           ` Michael Laß
  0 siblings, 0 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-18 10:09 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Chris Murphy, Btrfs BTRFS


> Am 18.05.2019 um 11:31 schrieb Roman Mamedov <rm@romanrm.net>:
> 
> On Sat, 18 May 2019 11:18:31 +0200
> Michael Laß <bevan@bi-co.net> wrote:
>> 
>> pvdisplay -m:
>> 
>>  --- Physical volume ---
>>  PV Name               /dev/sda1
>>  VG Name               vg_system
>>  PV Size               119.24 GiB / not usable <22.34 MiB
>>  Allocatable           yes (but full)
>>  PE Size               32.00 MiB
>>  Total PE              3815
>>  Free PE               0
>>  Allocated PE          3815
>>  PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
> 
> Such peculiar physical layout suggests you resize your LVs up and down a lot,
> is there any chance you could have recently shrinked the LV without first
> resizing down all the layers above it (Btrfs and LUKS) in proper order?

This is mostly a result from my transition from several ext4 volumes to one btrfs volume, where I extended the new btrfs volume several times. I quickly checked my shell history and it was something like this:

cryptsetup luksFormat /dev/mapper/vg_system-btrfs
cryptsetup luksOpen --allow-discards /dev/mapper/vg_system-btrfs cryptsystem
mkfs.btrfs -L system /dev/mapper/cryptsystem
lvextend -l100%free /dev/vg_system/btrfs
cryptsetup resize cryptsystem
btrfs fi resize max /

The previous ext4 volumes had been resized a couple of times as well before. However, the last resize operation was in 2015 and never caused any issues since then.

The btrfs file system which I now use to reproduce the issue is freshly created. So if there is any fallout from these resize operations, it would have to be in dm-crypt or LVM. Just to double-check, I compared the output of “cryptsetup status” and “lvdisplay”:

lvdisplay shows me that vg_system/btrfs uses 3511 LE. Each of those is 32MiB which makes
3511 * 32 * 1024 * 1024 / 512 = 230096896 sectors

cryptsetup shows me that the volume has a size of 230092800 sectors and an offset of 4096 which makes
230092800 + 4096 = 230096896 sectors

So this seems to match perfectly.

>>  --- Physical Segments ---
>>  Physical extent 0 to 1248:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	2231 to 3479
>>  Physical extent 1249 to 1728:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	640 to 1119
>>  Physical extent 1729 to 1760:
>>    Logical volume	/dev/vg_system/grml-images
>>    Logical extents	0 to 31
>>  Physical extent 1761 to 2016:
>>    Logical volume	/dev/vg_system/swap
>>    Logical extents	0 to 255
>>  Physical extent 2017 to 2047:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	3480 to 3510
>>  Physical extent 2048 to 2687:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	0 to 639
>>  Physical extent 2688 to 3007:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	1911 to 2230
>>  Physical extent 3008 to 3320:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	1120 to 1432
>>  Physical extent 3321 to 3336:
>>    Logical volume	/dev/vg_system/boot
>>    Logical extents	0 to 15
>>  Physical extent 3337 to 3814:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	1433 to 1910



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-18  9:18       ` Michael Laß
  2019-05-18  9:31         ` Roman Mamedov
@ 2019-05-18 10:26         ` Qu Wenruo
  2019-05-19 19:55           ` fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss Michael Laß
  1 sibling, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2019-05-18 10:26 UTC (permalink / raw)
  To: Michael Laß, Chris Murphy; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 3948 bytes --]



On 2019/5/18 下午5:18, Michael Laß wrote:
> 
>> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
>>
>> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
>>>
>>>
>>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>>>
>>> fstrim: /: FITRIM ioctl failed: Input/output error
>>
>> Huh. Any kernel message at the same time? I would expect any fstrim
>> user space error message to also have a kernel message. Any i/o error
>> suggests some kind of storage stack failure - which could be hardware
>> or software, you can't know without seeing the kernel messages.
> 
> I missed that. The kernel messages are:
> 
> attempt to access beyond end of device
> sda1: rw=16387, want=252755893, limit=250067632
> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
> 
> Here are some more information on the partitions and LVM physical segments:
> 
> fdisk -l /dev/sda:
> 
> Device     Boot Start       End   Sectors   Size Id Type
> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
> 
> pvdisplay -m:
> 
>   --- Physical volume ---
>   PV Name               /dev/sda1
>   VG Name               vg_system
>   PV Size               119.24 GiB / not usable <22.34 MiB
>   Allocatable           yes (but full)
>   PE Size               32.00 MiB
>   Total PE              3815
>   Free PE               0
>   Allocated PE          3815
>   PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
>    
>   --- Physical Segments ---
>   Physical extent 0 to 1248:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	2231 to 3479
>   Physical extent 1249 to 1728:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	640 to 1119
>   Physical extent 1729 to 1760:
>     Logical volume	/dev/vg_system/grml-images
>     Logical extents	0 to 31
>   Physical extent 1761 to 2016:
>     Logical volume	/dev/vg_system/swap
>     Logical extents	0 to 255
>   Physical extent 2017 to 2047:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	3480 to 3510
>   Physical extent 2048 to 2687:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	0 to 639
>   Physical extent 2688 to 3007:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	1911 to 2230
>   Physical extent 3008 to 3320:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	1120 to 1432
>   Physical extent 3321 to 3336:
>     Logical volume	/dev/vg_system/boot
>     Logical extents	0 to 15
>   Physical extent 3337 to 3814:
>     Logical volume	/dev/vg_system/btrfs
>     Logical extents	1433 to 1910
>    
> 
> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue?

I can't speak sure, but (at least for latest kernel) btrfs has a lot of
extra mount time self check, including chunk stripe check against
underlying device, thus the possibility shouldn't be that high for btrfs.

> Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?

Sure, you could use dm-log-writes.
It will record all read/write (including trim) for later replay.

So in your case, you can build the storage stack like:

Btrfs
<dm-log-writes>
LUKS/dmcrypt
LVM
MBR partition
Samsung SSD

Then replay the log (using src/log-write/replay-log in fstests) with
verbose output, you can verify every trim operation against the dmcrypt
device size.

If all trim are fine, then move the dm-log-writes a layer lower, until
you find which layer is causing the problem.

Thanks,
Qu
> 
> Cheers,
> Michael
> 
> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-18 10:26         ` Qu Wenruo
@ 2019-05-19 19:55           ` Michael Laß
  2019-05-20 11:38             ` [dm-devel] " Michael Laß
       [not found]             ` <CAK-xaQYPs62v971zm1McXw_FGzDmh_vpz3KLEbxzkmrsSgTfXw@mail.gmail.com>
  0 siblings, 2 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-19 19:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Btrfs BTRFS, dm-devel

CC'ing dm-devel, as this seems to be a dm-related issue. Short summary for new readers:

On Linux 5.1 (tested up to 5.1.3), fstrim may discard too many blocks, leading to data loss. I have the following storage stack:

btrfs
dm-crypt (LUKS)
LVM logical volume
LVM single physical volume
MBR partition
Samsung 830 SSD

The mapping between logical volumes and physical segments is a bit mixed up. See below for the output for “pvdisplay -m”. When I issue fstrim on the mounted btrfs volume, I get the following kernel messages:

attempt to access beyond end of device
sda1: rw=16387, want=252755893, limit=250067632
BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5

At the same time, other logical volumes on the same physical volume are destroyed. Also the btrfs volume itself may be damaged (this seems to depend on the actual usage).

I can easily reproduce this issue locally and I’m currently bisecting. So far I could narrow down the range of commits to:
Good: 92fff53b7191cae566be9ca6752069426c7f8241
Bad: 225557446856448039a9e495da37b72c20071ef2

In this range of commits, there are only dm-related changes.

So far, I have not reproduced the issue with other file systems or a simplified stack. I first want to continue bisecting but this may take another day.


> Am 18.05.2019 um 12:26 schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> On 2019/5/18 下午5:18, Michael Laß wrote:
>> 
>>> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
>>> 
>>> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
>>>> 
>>>> 
>>>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>>>> 
>>>> fstrim: /: FITRIM ioctl failed: Input/output error
>>> 
>>> Huh. Any kernel message at the same time? I would expect any fstrim
>>> user space error message to also have a kernel message. Any i/o error
>>> suggests some kind of storage stack failure - which could be hardware
>>> or software, you can't know without seeing the kernel messages.
>> 
>> I missed that. The kernel messages are:
>> 
>> attempt to access beyond end of device
>> sda1: rw=16387, want=252755893, limit=250067632
>> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
>> 
>> Here are some more information on the partitions and LVM physical segments:
>> 
>> fdisk -l /dev/sda:
>> 
>> Device     Boot Start       End   Sectors   Size Id Type
>> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
>> 
>> pvdisplay -m:
>> 
>>  --- Physical volume ---
>>  PV Name               /dev/sda1
>>  VG Name               vg_system
>>  PV Size               119.24 GiB / not usable <22.34 MiB
>>  Allocatable           yes (but full)
>>  PE Size               32.00 MiB
>>  Total PE              3815
>>  Free PE               0
>>  Allocated PE          3815
>>  PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
>> 
>>  --- Physical Segments ---
>>  Physical extent 0 to 1248:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	2231 to 3479
>>  Physical extent 1249 to 1728:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	640 to 1119
>>  Physical extent 1729 to 1760:
>>    Logical volume	/dev/vg_system/grml-images
>>    Logical extents	0 to 31
>>  Physical extent 1761 to 2016:
>>    Logical volume	/dev/vg_system/swap
>>    Logical extents	0 to 255
>>  Physical extent 2017 to 2047:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	3480 to 3510
>>  Physical extent 2048 to 2687:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	0 to 639
>>  Physical extent 2688 to 3007:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	1911 to 2230
>>  Physical extent 3008 to 3320:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	1120 to 1432
>>  Physical extent 3321 to 3336:
>>    Logical volume	/dev/vg_system/boot
>>    Logical extents	0 to 15
>>  Physical extent 3337 to 3814:
>>    Logical volume	/dev/vg_system/btrfs
>>    Logical extents	1433 to 1910
>> 
>> 
>> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue?
> 
> I can't speak sure, but (at least for latest kernel) btrfs has a lot of
> extra mount time self check, including chunk stripe check against
> underlying device, thus the possibility shouldn't be that high for btrfs.

Indeed, bisecting the issue led me to a range of commits that only contains dm-related and no btrfs-related changes. So I assume this is a bug in dm.

>> Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?
> 
> Sure, you could use dm-log-writes.
> It will record all read/write (including trim) for later replay.
> 
> So in your case, you can build the storage stack like:
> 
> Btrfs
> <dm-log-writes>
> LUKS/dmcrypt
> LVM
> MBR partition
> Samsung SSD
> 
> Then replay the log (using src/log-write/replay-log in fstests) with
> verbose output, you can verify every trim operation against the dmcrypt
> device size.
> 
> If all trim are fine, then move the dm-log-writes a layer lower, until
> you find which layer is causing the problem.

That sounds like a plan! However, I first want to continue bisecting as I am afraid to lose my reproducer by changing parts of my storage stack.

Cheers,
Michael

> 
> Thanks,
> Qu
>> 
>> Cheers,
>> Michael
>> 
>> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dm-devel] fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-19 19:55           ` fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss Michael Laß
@ 2019-05-20 11:38             ` Michael Laß
  2019-05-21 16:46               ` Michael Laß
       [not found]             ` <CAK-xaQYPs62v971zm1McXw_FGzDmh_vpz3KLEbxzkmrsSgTfXw@mail.gmail.com>
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Laß @ 2019-05-20 11:38 UTC (permalink / raw)
  To: dm-devel; +Cc: Chris Murphy, Btrfs BTRFS, Qu Wenruo


> Am 19.05.2019 um 21:55 schrieb Michael Laß <bevan@bi-co.net>:
> 
> CC'ing dm-devel, as this seems to be a dm-related issue. Short summary for new readers:
> 
> On Linux 5.1 (tested up to 5.1.3), fstrim may discard too many blocks, leading to data loss. I have the following storage stack:
> 
> btrfs
> dm-crypt (LUKS)
> LVM logical volume
> LVM single physical volume
> MBR partition
> Samsung 830 SSD
> 
> The mapping between logical volumes and physical segments is a bit mixed up. See below for the output for “pvdisplay -m”. When I issue fstrim on the mounted btrfs volume, I get the following kernel messages:
> 
> attempt to access beyond end of device
> sda1: rw=16387, want=252755893, limit=250067632
> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
> 
> At the same time, other logical volumes on the same physical volume are destroyed. Also the btrfs volume itself may be damaged (this seems to depend on the actual usage).
> 
> I can easily reproduce this issue locally and I’m currently bisecting. So far I could narrow down the range of commits to:
> Good: 92fff53b7191cae566be9ca6752069426c7f8241
> Bad: 225557446856448039a9e495da37b72c20071ef2

I finished bisecting. Here’s the responsible commit:

commit 61697a6abd24acba941359c6268a94f4afe4a53d
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Fri Jan 18 14:19:26 2019 -0500

    dm: eliminate 'split_discard_bios' flag from DM target interface
    
    There is no need to have DM core split discards on behalf of a DM target
    now that blk_queue_split() handles splitting discards based on the
    queue_limits.  A DM target just needs to set max_discard_sectors,
    discard_granularity, etc, in queue_limits.
    
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Maybe the assumptions taken here ("A DM target just needs to set max_discard_sectors, discard_granularity, etc, in queue_limits.”) isn’t valid in my case? Does anyone have an idea?


> 
> In this range of commits, there are only dm-related changes.
> 
> So far, I have not reproduced the issue with other file systems or a simplified stack. I first want to continue bisecting but this may take another day.
> 
> 
>> Am 18.05.2019 um 12:26 schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> On 2019/5/18 下午5:18, Michael Laß wrote:
>>> 
>>>> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
>>>> 
>>>> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
>>>>> 
>>>>> 
>>>>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>>>>> 
>>>>> fstrim: /: FITRIM ioctl failed: Input/output error
>>>> 
>>>> Huh. Any kernel message at the same time? I would expect any fstrim
>>>> user space error message to also have a kernel message. Any i/o error
>>>> suggests some kind of storage stack failure - which could be hardware
>>>> or software, you can't know without seeing the kernel messages.
>>> 
>>> I missed that. The kernel messages are:
>>> 
>>> attempt to access beyond end of device
>>> sda1: rw=16387, want=252755893, limit=250067632
>>> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
>>> 
>>> Here are some more information on the partitions and LVM physical segments:
>>> 
>>> fdisk -l /dev/sda:
>>> 
>>> Device     Boot Start       End   Sectors   Size Id Type
>>> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
>>> 
>>> pvdisplay -m:
>>> 
>>> --- Physical volume ---
>>> PV Name               /dev/sda1
>>> VG Name               vg_system
>>> PV Size               119.24 GiB / not usable <22.34 MiB
>>> Allocatable           yes (but full)
>>> PE Size               32.00 MiB
>>> Total PE              3815
>>> Free PE               0
>>> Allocated PE          3815
>>> PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
>>> 
>>> --- Physical Segments ---
>>> Physical extent 0 to 1248:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	2231 to 3479
>>> Physical extent 1249 to 1728:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	640 to 1119
>>> Physical extent 1729 to 1760:
>>>   Logical volume	/dev/vg_system/grml-images
>>>   Logical extents	0 to 31
>>> Physical extent 1761 to 2016:
>>>   Logical volume	/dev/vg_system/swap
>>>   Logical extents	0 to 255
>>> Physical extent 2017 to 2047:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	3480 to 3510
>>> Physical extent 2048 to 2687:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	0 to 639
>>> Physical extent 2688 to 3007:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	1911 to 2230
>>> Physical extent 3008 to 3320:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	1120 to 1432
>>> Physical extent 3321 to 3336:
>>>   Logical volume	/dev/vg_system/boot
>>>   Logical extents	0 to 15
>>> Physical extent 3337 to 3814:
>>>   Logical volume	/dev/vg_system/btrfs
>>>   Logical extents	1433 to 1910
>>> 
>>> 
>>> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue?
>> 
>> I can't speak sure, but (at least for latest kernel) btrfs has a lot of
>> extra mount time self check, including chunk stripe check against
>> underlying device, thus the possibility shouldn't be that high for btrfs.
> 
> Indeed, bisecting the issue led me to a range of commits that only contains dm-related and no btrfs-related changes. So I assume this is a bug in dm.
> 
>>> Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?
>> 
>> Sure, you could use dm-log-writes.
>> It will record all read/write (including trim) for later replay.
>> 
>> So in your case, you can build the storage stack like:
>> 
>> Btrfs
>> <dm-log-writes>
>> LUKS/dmcrypt
>> LVM
>> MBR partition
>> Samsung SSD
>> 
>> Then replay the log (using src/log-write/replay-log in fstests) with
>> verbose output, you can verify every trim operation against the dmcrypt
>> device size.
>> 
>> If all trim are fine, then move the dm-log-writes a layer lower, until
>> you find which layer is causing the problem.
> 
> That sounds like a plan! However, I first want to continue bisecting as I am afraid to lose my reproducer by changing parts of my storage stack.
> 
> Cheers,
> Michael
> 
>> 
>> Thanks,
>> Qu
>>> 
>>> Cheers,
>>> Michael
>>> 
>>> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
       [not found]             ` <CAK-xaQYPs62v971zm1McXw_FGzDmh_vpz3KLEbxzkmrsSgTfXw@mail.gmail.com>
@ 2019-05-20 13:58               ` Michael Laß
  2019-05-20 14:53                 ` Andrea Gelmini
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Laß @ 2019-05-20 13:58 UTC (permalink / raw)
  To: Andrea Gelmini; +Cc: Qu Wenruo, Chris Murphy, Btrfs BTRFS, dm-devel


> Am 20.05.2019 um 15:53 schrieb Andrea Gelmini <andrea.gelmini@gmail.com>:
> 
> Had same issue on a similar (well,  quite exactly same setup), on a machine in production.
> But It Is more than 4 tera of data, so in the end I re-dd the image and restarted, sticking to 5.0.y branch never had problem.
> I was able to replicate it. SSD Samsung, more recent version.
> Not with btrfs but ext4, by the way.

Thanks for the info, that eliminates one variable. So you also used dm-crypt on top of LVM?

Cheers,
Michael

> I saw the discard of big initial part of lvm partition. I can't find superblocks Copy in the First half, but torwards the end of logical volume.
> 
> Sorry, i can't play with It again, but i have the whole (4 tera) dd image with the bug.
> 
> 
> Ciao,
> Gelma
> 
> Il lun 20 mag 2019, 02:38 Michael Laß <bevan@bi-co.net> ha scritto:
> CC'ing dm-devel, as this seems to be a dm-related issue. Short summary for new readers:
> 
> On Linux 5.1 (tested up to 5.1.3), fstrim may discard too many blocks, leading to data loss. I have the following storage stack:
> 
> btrfs
> dm-crypt (LUKS)
> LVM logical volume
> LVM single physical volume
> MBR partition
> Samsung 830 SSD
> 
> The mapping between logical volumes and physical segments is a bit mixed up. See below for the output for “pvdisplay -m”. When I issue fstrim on the mounted btrfs volume, I get the following kernel messages:
> 
> attempt to access beyond end of device
> sda1: rw=16387, want=252755893, limit=250067632
> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
> 
> At the same time, other logical volumes on the same physical volume are destroyed. Also the btrfs volume itself may be damaged (this seems to depend on the actual usage).
> 
> I can easily reproduce this issue locally and I’m currently bisecting. So far I could narrow down the range of commits to:
> Good: 92fff53b7191cae566be9ca6752069426c7f8241
> Bad: 225557446856448039a9e495da37b72c20071ef2
> 
> In this range of commits, there are only dm-related changes.
> 
> So far, I have not reproduced the issue with other file systems or a simplified stack. I first want to continue bisecting but this may take another day.
> 
> 
> > Am 18.05.2019 um 12:26 schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
> > On 2019/5/18 下午5:18, Michael Laß wrote:
> >> 
> >>> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
> >>> 
> >>> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
> >>>> 
> >>>> 
> >>>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
> >>>> 
> >>>> fstrim: /: FITRIM ioctl failed: Input/output error
> >>> 
> >>> Huh. Any kernel message at the same time? I would expect any fstrim
> >>> user space error message to also have a kernel message. Any i/o error
> >>> suggests some kind of storage stack failure - which could be hardware
> >>> or software, you can't know without seeing the kernel messages.
> >> 
> >> I missed that. The kernel messages are:
> >> 
> >> attempt to access beyond end of device
> >> sda1: rw=16387, want=252755893, limit=250067632
> >> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
> >> 
> >> Here are some more information on the partitions and LVM physical segments:
> >> 
> >> fdisk -l /dev/sda:
> >> 
> >> Device     Boot Start       End   Sectors   Size Id Type
> >> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
> >> 
> >> pvdisplay -m:
> >> 
> >>  --- Physical volume ---
> >>  PV Name               /dev/sda1
> >>  VG Name               vg_system
> >>  PV Size               119.24 GiB / not usable <22.34 MiB
> >>  Allocatable           yes (but full)
> >>  PE Size               32.00 MiB
> >>  Total PE              3815
> >>  Free PE               0
> >>  Allocated PE          3815
> >>  PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
> >> 
> >>  --- Physical Segments ---
> >>  Physical extent 0 to 1248:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   2231 to 3479
> >>  Physical extent 1249 to 1728:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   640 to 1119
> >>  Physical extent 1729 to 1760:
> >>    Logical volume    /dev/vg_system/grml-images
> >>    Logical extents   0 to 31
> >>  Physical extent 1761 to 2016:
> >>    Logical volume    /dev/vg_system/swap
> >>    Logical extents   0 to 255
> >>  Physical extent 2017 to 2047:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   3480 to 3510
> >>  Physical extent 2048 to 2687:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   0 to 639
> >>  Physical extent 2688 to 3007:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   1911 to 2230
> >>  Physical extent 3008 to 3320:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   1120 to 1432
> >>  Physical extent 3321 to 3336:
> >>    Logical volume    /dev/vg_system/boot
> >>    Logical extents   0 to 15
> >>  Physical extent 3337 to 3814:
> >>    Logical volume    /dev/vg_system/btrfs
> >>    Logical extents   1433 to 1910
> >> 
> >> 
> >> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue?
> > 
> > I can't speak sure, but (at least for latest kernel) btrfs has a lot of
> > extra mount time self check, including chunk stripe check against
> > underlying device, thus the possibility shouldn't be that high for btrfs.
> 
> Indeed, bisecting the issue led me to a range of commits that only contains dm-related and no btrfs-related changes. So I assume this is a bug in dm.
> 
> >> Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?
> > 
> > Sure, you could use dm-log-writes.
> > It will record all read/write (including trim) for later replay.
> > 
> > So in your case, you can build the storage stack like:
> > 
> > Btrfs
> > <dm-log-writes>
> > LUKS/dmcrypt
> > LVM
> > MBR partition
> > Samsung SSD
> > 
> > Then replay the log (using src/log-write/replay-log in fstests) with
> > verbose output, you can verify every trim operation against the dmcrypt
> > device size.
> > 
> > If all trim are fine, then move the dm-log-writes a layer lower, until
> > you find which layer is causing the problem.
> 
> That sounds like a plan! However, I first want to continue bisecting as I am afraid to lose my reproducer by changing parts of my storage stack.
> 
> Cheers,
> Michael
> 
> > 
> > Thanks,
> > Qu
> >> 
> >> Cheers,
> >> Michael
> >> 
> >> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-20 13:58               ` Michael Laß
@ 2019-05-20 14:53                 ` Andrea Gelmini
  2019-05-20 16:45                   ` Milan Broz
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Gelmini @ 2019-05-20 14:53 UTC (permalink / raw)
  To: Michael Laß; +Cc: Qu Wenruo, Chris Murphy, Btrfs BTRFS, dm-devel

Il giorno lun 20 mag 2019 alle ore 15:58 Michael Laß <bevan@bi-co.net>
ha scritto:
>
>
> > Am 20.05.2019 um 15:53 schrieb Andrea Gelmini <andrea.gelmini@gmail.com>:
> >
> > Had same issue on a similar (well,  quite exactly same setup), on a machine in production.
> > But It Is more than 4 tera of data, so in the end I re-dd the image and restarted, sticking to 5.0.y branch never had problem.
> > I was able to replicate it. SSD Samsung, more recent version.
> > Not with btrfs but ext4, by the way.
>
> Thanks for the info, that eliminates one variable. So you also used dm-crypt on top of LVM?

root@glet:~# lsblk |grep -v loop
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0   3,7T  0 disk
├─sda1           8:1    0   260M  0 part  /boot/efi
├─sda2           8:2    0    16M  0 part
├─sda3           8:3    0  67,6G  0 part
├─sda4           8:4    0   883M  0 part
├─sda5           8:5    0   1,9G  0 part  /boot
└─sda6           8:6    0   3,5T  0 part
 └─sda6_crypt 254:0    0   3,5T  0 crypt
   ├─cry-root 254:1    0    28G  0 lvm   /
   ├─cry-swap 254:2    0    70G  0 lvm   [SWAP]
   └─cry-home 254:3    0   2,7T  0 lvm   /home
nvme0n1        259:0    0 119,2G  0 disk
├─nvme0n1p1    259:1    0  97,8G  0 part  /mnt/nvme
└─nvme0n1p2    259:2    0  21,5G  0 part  [SWAP]
root@glet:~#

Booting with kernel > 5.0, it discard cry-home, for the first big part.

root@glet:~# lvdisplay  -vv
     devices/global_filter not found in config: defaulting to
global_filter = [ "a|.*/|" ]
     Setting global/locking_type to 1
     Setting global/use_lvmetad to 1
     global/lvmetad_update_wait_time not found in config: defaulting to 10
     Setting response to OK
     Setting protocol to lvmetad
     Setting version to 1
     Setting global/use_lvmpolld to 1
     Setting devices/sysfs_scan to 1
     Setting devices/multipath_component_detection to 1
     Setting devices/md_component_detection to 1
     Setting devices/fw_raid_component_detection to 0
     Setting devices/ignore_suspended_devices to 0
     Setting devices/ignore_lvm_mirrors to 1
     devices/filter not found in config: defaulting to filter = [ "a|.*/|" ]
     Setting devices/cache_dir to /run/lvm
     Setting devices/cache_file_prefix to
     devices/cache not found in config: defaulting to /run/lvm/.cache
     Setting devices/write_cache_state to 1
     Setting global/use_lvmetad to 1
     Setting activation/activation_mode to degraded
     metadata/record_lvs_history not found in config: defaulting to 0
     Setting activation/monitoring to 1
     Setting global/locking_type to 1
     Setting global/wait_for_locks to 1
     File-based locking selected.
     Setting global/prioritise_write_locks to 1
     Setting global/locking_dir to /run/lock/lvm
     Setting global/use_lvmlockd to 0
     Setting response to OK
     Setting token to filter:3239235440
     Setting daemon_pid to 650
     Setting response to OK
     Setting global_disable to 0
     report/output_format not found in config: defaulting to basic
     log/report_command_log not found in config: defaulting to 0
     Setting response to OK
     Setting response to OK
     Setting response to OK
     Setting name to cry
     Processing VG cry Orkwof-zq16-e1qM-rUMt-vKV1-Lc13-CgiKYp
     Locking /run/lock/lvm/V_cry RB
     Reading VG cry Orkwofzq16e1qMrUMtvKV1Lc13CgiKYp
     Setting response to OK
     Setting response to OK
     Setting response to OK
     Setting name to cry
     Setting metadata/format to lvm2
     Setting id to OtoEfX-bpWN-l9gd-kLJW-1xca-PaHR-ARrSKr
     Setting format to lvm2
     Setting device to 65024
     Setting dev_size to 7465840640
     Setting label_sector to 1
     Setting ext_flags to 1
     Setting ext_version to 2
     Setting size to 1044480
     Setting start to 4096
     Setting ignore to 0
     Setting response to OK
     Setting response to OK
     Setting response to OK
     /dev/mapper/sda6_crypt: size is 7465842688 sectors
     Adding cry/root to the list of LVs to be processed.
     Adding cry/swap to the list of LVs to be processed.
     Adding cry/home to the list of LVs to be processed.
     Processing LV root in VG cry.
 --- Logical volume ---
     global/lvdisplay_shows_full_device_path not found in config:
defaulting to 0
 LV Path                /dev/cry/root
 LV Name                root
 VG Name                cry
 LV UUID                J0vJ5D-Rzyt-9fOm-cJVU-bwjc-6pc1-jGqIhc
 LV Write Access        read/write
 LV Creation host, time glet, 2018-11-02 17:51:35 +0100
 LV Status              available
 # open                 1
 LV Size                <27,94 GiB
 Current LE             7152
 Segments               1
 Allocation             inherit
 Read ahead sectors     auto
 - currently set to     256
 Block device           254:1

     Processing LV swap in VG cry.
 --- Logical volume ---
     global/lvdisplay_shows_full_device_path not found in config:
defaulting to 0
 LV Path                /dev/cry/swap
 LV Name                swap
 VG Name                cry
 LV UUID                c4iLex-xxMu-Quyr-4qkt-hFk2-uOb5-BDF5ls
 LV Write Access        read/write
 LV Creation host, time glet, 2018-11-02 17:51:43 +0100
 LV Status              available
 # open                 2
 LV Size                70,00 GiB
 Current LE             17920
 Segments               2
 Allocation             inherit
 Read ahead sectors     auto
 - currently set to     256
 Block device           254:2

     Processing LV home in VG cry.
 --- Logical volume ---
     global/lvdisplay_shows_full_device_path not found in config:
defaulting to 0
 LV Path                /dev/cry/home
 LV Name                home
 VG Name                cry
 LV UUID                jycl7w-59lN-F3Ne-DBDa-G21g-CAmb-ROvIaX
 LV Write Access        read/write
 LV Creation host, time glet, 2018-11-02 17:51:50 +0100
 LV Status              available
 # open                 1
 LV Size                <2,71 TiB
 Current LE             709591
 Segments               2
 Allocation             inherit
 Read ahead sectors     auto
 - currently set to     256
 Block device           254:3

     Unlocking /run/lock/lvm/V_cry
     Setting global/notify_dbus to 1

Also, changing crypttab:
root@glet:~# cat /etc/crypttab
sda6_crypt UUID=fe03e2e6-b8b1-4672-8a3e-b536ac4e1539 none luks,discard

removing discard didn't solve the issue.

In my setup it was enough to boot the system, so having complain about
/home mounting
impossible (no filesystem found).

Well, keep in mind that at boot I have a few things, like:
root@glet:~# grep -i swap /etc/fstab
/dev/mapper/cry-swap none       swap    sw,discard=once,pri=0
    0       0
/dev/nvme0n1p2       none       swap    sw,discard=once,pri=1

And other stuff in cron and so on.

So I can trigger the problem at boot (ubuntu 19.04), by my changes.

Hope it helps.

Uhm, by the way, my SSD (latest firmware):

root@glet:~# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
       Model Number:       Samsung SSD 860 EVO 4TB
       Serial Number:      S3YPNWAK101163T
       Firmware Revision:  RVT02B6Q
       Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II
Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
       Used: unknown (minor revision code 0x005e)
       Supported: 11 8 7 6 5
       Likely used: 11
Configuration:
       Logical         max     current
       cylinders       16383   16383
       heads           16      16
       sectors/track   63      63
       --
       CHS current addressable sectors:    16514064
       LBA    user addressable sectors:   268435455
       LBA48  user addressable sectors:  7814037168
       Logical  Sector size:                   512 bytes
       Physical Sector size:                   512 bytes
       Logical Sector-0 offset:                  0 bytes
       device size with M = 1024*1024:     3815447 MBytes
       device size with M = 1000*1000:     4000787 MBytes (4000 GB)
       cache/buffer size  = unknown
       Form Factor: 2.5 inch
       Nominal Media Rotation Rate: Solid State Device
Capabilities:
       LBA, IORDY(can be disabled)
       Queue depth: 32
       Standby timer values: spec'd by Standard, no device specific minimum
       R/W multiple sector transfer: Max = 1   Current = 1
       DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
            Cycle time: min=120ns recommended=120ns
       PIO: pio0 pio1 pio2 pio3 pio4
            Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
       Enabled Supported:
          *    SMART feature set
               Security Mode feature set
          *    Power Management feature set
          *    Write cache
          *    Look-ahead
          *    Host Protected Area feature set
          *    WRITE_BUFFER command
          *    READ_BUFFER command
          *    NOP cmd
          *    DOWNLOAD_MICROCODE
               SET_MAX security extension
          *    48-bit Address feature set
          *    Device Configuration Overlay feature set
          *    Mandatory FLUSH_CACHE
          *    FLUSH_CACHE_EXT
          *    SMART error logging
          *    SMART self-test
          *    General Purpose Logging feature set
          *    WRITE_{DMA|MULTIPLE}_FUA_EXT
          *    64-bit World wide name
               Write-Read-Verify feature set
          *    WRITE_UNCORRECTABLE_EXT command
          *    {READ,WRITE}_DMA_EXT_GPL commands
          *    Segmented DOWNLOAD_MICROCODE
          *    Gen1 signaling speed (1.5Gb/s)
          *    Gen2 signaling speed (3.0Gb/s)
          *    Gen3 signaling speed (6.0Gb/s)
          *    Native Command Queueing (NCQ)
          *    Phy event counters
          *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
          *    DMA Setup Auto-Activate optimization
          *    Device-initiated interface power management
          *    Asynchronous notification (eg. media change)
          *    Software settings preservation
          *    Device Sleep (DEVSLP)
          *    SMART Command Transport (SCT) feature set
          *    SCT Write Same (AC2)
          *    SCT Error Recovery Control (AC3)
          *    SCT Features Control (AC4)
          *    SCT Data Tables (AC5)
          *    reserved 69[4]
          *    DOWNLOAD MICROCODE DMA command
          *    SET MAX SETPASSWORD/UNLOCK DMA commands
          *    WRITE BUFFER DMA command
          *    READ BUFFER DMA command
          *    Data Set Management TRIM supported (limit 8 blocks)
          *    Deterministic read ZEROs after TRIM
Security:
       Master password revision code = 65534
               supported
       not     enabled
       not     locked
       not     frozen
       not     expired: security count
               supported: enhanced erase
       4min for SECURITY ERASE UNIT. 8min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5002538e7001e8a7
       NAA             : 5
       IEEE OUI        : 002538
       Unique ID       : e7001e8a7
Device Sleep:
       DEVSLP Exit Timeout (DETO): 50 ms (drive)
       Minimum DEVSLP Assertion Time (MDAT): 30 ms (drive)
Checksum: correct

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-20 14:53                 ` Andrea Gelmini
@ 2019-05-20 16:45                   ` Milan Broz
  2019-05-20 19:58                     ` Michael Laß
  2019-05-21 18:54                     ` Andrea Gelmini
  0 siblings, 2 replies; 24+ messages in thread
From: Milan Broz @ 2019-05-20 16:45 UTC (permalink / raw)
  To: Andrea Gelmini, Michael Laß
  Cc: Qu Wenruo, Chris Murphy, Btrfs BTRFS, dm-devel

On 20/05/2019 16:53, Andrea Gelmini wrote:
...
> Also, changing crypttab:
> root@glet:~# cat /etc/crypttab
> sda6_crypt UUID=fe03e2e6-b8b1-4672-8a3e-b536ac4e1539 none luks,discard
> 
> removing discard didn't solve the issue.

This is very strange, disabling discard should reject every discard IO
on the dmcrypt layer. Are you sure it was really disabled?

Note, it is the root filesystem, so you have to regenerate initramfs
to update crypttab inside it.

Could you paste "dmsetup table" and "lsblk -D" to verify that discard flag
is not there?
(I mean dmsetup table with the zeroed key, as a default and safe output.)

Milan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-20 16:45                   ` Milan Broz
@ 2019-05-20 19:58                     ` Michael Laß
  2019-05-21 18:54                     ` Andrea Gelmini
  1 sibling, 0 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-20 19:58 UTC (permalink / raw)
  To: Milan Broz; +Cc: Andrea Gelmini, Qu Wenruo, Chris Murphy, Btrfs BTRFS, dm-devel


> Am 20.05.2019 um 18:45 schrieb Milan Broz <gmazyland@gmail.com>:
> 
> On 20/05/2019 16:53, Andrea Gelmini wrote:
> ...
>> Also, changing crypttab:
>> root@glet:~# cat /etc/crypttab
>> sda6_crypt UUID=fe03e2e6-b8b1-4672-8a3e-b536ac4e1539 none luks,discard
>> 
>> removing discard didn't solve the issue.
> 
> This is very strange, disabling discard should reject every discard IO
> on the dmcrypt layer. Are you sure it was really disabled?
> 
> Note, it is the root filesystem, so you have to regenerate initramfs
> to update crypttab inside it.

For me, I cannot reproduce the issue when I remove the discard option from the crypttab (and regenerate the initramfs). When trying fstrim I just get “the discard operation is not supported”, as I would expect. No damage is done to other logical volumes.

However, my stack differs from Andrea’s in that I have dm-crypt on an LVM logical volume and not dm-crypt as a physical volume for LVM. Not sure if that makes a difference here.

Cheers,
Michael

> Could you paste "dmsetup table" and "lsblk -D" to verify that discard flag
> is not there?
> (I mean dmsetup table with the zeroed key, as a default and safe output.)
> 
> Milan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dm-devel] fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-20 11:38             ` [dm-devel] " Michael Laß
@ 2019-05-21 16:46               ` Michael Laß
  2019-05-21 19:00                 ` Andrea Gelmini
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Laß @ 2019-05-21 16:46 UTC (permalink / raw)
  To: dm-devel; +Cc: Chris Murphy, Qu Wenruo, Btrfs BTRFS


> Am 20.05.2019 um 13:38 schrieb Michael Laß <bevan@bi-co.net>:
> 
>> 
>> Am 19.05.2019 um 21:55 schrieb Michael Laß <bevan@bi-co.net>:
>> 
>> CC'ing dm-devel, as this seems to be a dm-related issue. Short summary for new readers:
>> 
>> On Linux 5.1 (tested up to 5.1.3), fstrim may discard too many blocks, leading to data loss. I have the following storage stack:
>> 
>> btrfs
>> dm-crypt (LUKS)
>> LVM logical volume
>> LVM single physical volume
>> MBR partition
>> Samsung 830 SSD
>> 
>> The mapping between logical volumes and physical segments is a bit mixed up. See below for the output for “pvdisplay -m”. When I issue fstrim on the mounted btrfs volume, I get the following kernel messages:
>> 
>> attempt to access beyond end of device
>> sda1: rw=16387, want=252755893, limit=250067632
>> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
>> 
>> At the same time, other logical volumes on the same physical volume are destroyed. Also the btrfs volume itself may be damaged (this seems to depend on the actual usage).
>> 
>> I can easily reproduce this issue locally and I’m currently bisecting. So far I could narrow down the range of commits to:
>> Good: 92fff53b7191cae566be9ca6752069426c7f8241
>> Bad: 225557446856448039a9e495da37b72c20071ef2
> 
> I finished bisecting. Here’s the responsible commit:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer@redhat.com>
> Date:   Fri Jan 18 14:19:26 2019 -0500
> 
>    dm: eliminate 'split_discard_bios' flag from DM target interface
> 
>    There is no need to have DM core split discards on behalf of a DM target
>    now that blk_queue_split() handles splitting discards based on the
>    queue_limits.  A DM target just needs to set max_discard_sectors,
>    discard_granularity, etc, in queue_limits.
> 
>    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Reverting that commit solves the issue for me on Linux 5.1.3. Would that be an option until the root cause has been identified? I’d rather not let more people run into this issue.

Cheers,
Michael


> Maybe the assumptions taken here ("A DM target just needs to set max_discard_sectors, discard_granularity, etc, in queue_limits.”) isn’t valid in my case? Does anyone have an idea?
> 
> 
>> 
>> In this range of commits, there are only dm-related changes.
>> 
>> So far, I have not reproduced the issue with other file systems or a simplified stack. I first want to continue bisecting but this may take another day.
>> 
>> 
>>> Am 18.05.2019 um 12:26 schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>> On 2019/5/18 下午5:18, Michael Laß wrote:
>>>> 
>>>>> Am 18.05.2019 um 06:09 schrieb Chris Murphy <lists@colorremedies.com>:
>>>>> 
>>>>> On Fri, May 17, 2019 at 11:37 AM Michael Laß <bevan@bi-co.net> wrote:
>>>>>> 
>>>>>> 
>>>>>> I tried to reproduce this issue: I recreated the btrfs file system, set up a minimal system and issued fstrim again. It printed the following error message:
>>>>>> 
>>>>>> fstrim: /: FITRIM ioctl failed: Input/output error
>>>>> 
>>>>> Huh. Any kernel message at the same time? I would expect any fstrim
>>>>> user space error message to also have a kernel message. Any i/o error
>>>>> suggests some kind of storage stack failure - which could be hardware
>>>>> or software, you can't know without seeing the kernel messages.
>>>> 
>>>> I missed that. The kernel messages are:
>>>> 
>>>> attempt to access beyond end of device
>>>> sda1: rw=16387, want=252755893, limit=250067632
>>>> BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5
>>>> 
>>>> Here are some more information on the partitions and LVM physical segments:
>>>> 
>>>> fdisk -l /dev/sda:
>>>> 
>>>> Device     Boot Start       End   Sectors   Size Id Type
>>>> /dev/sda1  *     2048 250069679 250067632 119.2G 8e Linux LVM
>>>> 
>>>> pvdisplay -m:
>>>> 
>>>> --- Physical volume ---
>>>> PV Name               /dev/sda1
>>>> VG Name               vg_system
>>>> PV Size               119.24 GiB / not usable <22.34 MiB
>>>> Allocatable           yes (but full)
>>>> PE Size               32.00 MiB
>>>> Total PE              3815
>>>> Free PE               0
>>>> Allocated PE          3815
>>>> PV UUID               mqCLFy-iDnt-NfdC-lfSv-Maor-V1Ih-RlG8lP
>>>> 
>>>> --- Physical Segments ---
>>>> Physical extent 0 to 1248:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	2231 to 3479
>>>> Physical extent 1249 to 1728:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	640 to 1119
>>>> Physical extent 1729 to 1760:
>>>>  Logical volume	/dev/vg_system/grml-images
>>>>  Logical extents	0 to 31
>>>> Physical extent 1761 to 2016:
>>>>  Logical volume	/dev/vg_system/swap
>>>>  Logical extents	0 to 255
>>>> Physical extent 2017 to 2047:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	3480 to 3510
>>>> Physical extent 2048 to 2687:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	0 to 639
>>>> Physical extent 2688 to 3007:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	1911 to 2230
>>>> Physical extent 3008 to 3320:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	1120 to 1432
>>>> Physical extent 3321 to 3336:
>>>>  Logical volume	/dev/vg_system/boot
>>>>  Logical extents	0 to 15
>>>> Physical extent 3337 to 3814:
>>>>  Logical volume	/dev/vg_system/btrfs
>>>>  Logical extents	1433 to 1910
>>>> 
>>>> 
>>>> Would btrfs even be able to accidentally trim parts of other LVs or does this clearly hint towards a LVM/dm issue?
>>> 
>>> I can't speak sure, but (at least for latest kernel) btrfs has a lot of
>>> extra mount time self check, including chunk stripe check against
>>> underlying device, thus the possibility shouldn't be that high for btrfs.
>> 
>> Indeed, bisecting the issue led me to a range of commits that only contains dm-related and no btrfs-related changes. So I assume this is a bug in dm.
>> 
>>>> Is there an easy way to somehow trace the trim through the different layers so one can see where it goes wrong?
>>> 
>>> Sure, you could use dm-log-writes.
>>> It will record all read/write (including trim) for later replay.
>>> 
>>> So in your case, you can build the storage stack like:
>>> 
>>> Btrfs
>>> <dm-log-writes>
>>> LUKS/dmcrypt
>>> LVM
>>> MBR partition
>>> Samsung SSD
>>> 
>>> Then replay the log (using src/log-write/replay-log in fstests) with
>>> verbose output, you can verify every trim operation against the dmcrypt
>>> device size.
>>> 
>>> If all trim are fine, then move the dm-log-writes a layer lower, until
>>> you find which layer is causing the problem.
>> 
>> That sounds like a plan! However, I first want to continue bisecting as I am afraid to lose my reproducer by changing parts of my storage stack.
>> 
>> Cheers,
>> Michael
>> 
>>> 
>>> Thanks,
>>> Qu
>>>> 
>>>> Cheers,
>>>> Michael
>>>> 
>>>> PS: Current state of bisection: It looks like the error was introduced somewhere between b5dd0c658c31b469ccff1b637e5124851e7a4a1c and v5.1.
>> 
>> 
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-20 16:45                   ` Milan Broz
  2019-05-20 19:58                     ` Michael Laß
@ 2019-05-21 18:54                     ` Andrea Gelmini
  1 sibling, 0 replies; 24+ messages in thread
From: Andrea Gelmini @ 2019-05-21 18:54 UTC (permalink / raw)
  To: Milan Broz
  Cc: Michael Laß, Qu Wenruo, Chris Murphy, Btrfs BTRFS, dm-devel

Il giorno lun 20 mag 2019 alle ore 18:45 Milan Broz
<gmazyland@gmail.com> ha scritto:
> Note, it is the root filesystem, so you have to regenerate initramfs
> to update crypttab inside it.

Good catch. I didn't re-mkinitramfs.

> Could you paste "dmsetup table" and "lsblk -D" to verify that discard flag
> is not there?
> (I mean dmsetup table with the zeroed key, as a default and safe output.)

This weekend if I have time I'm going to re-test it. It takes a lot to
restore 4TB.

Thanks a lot,
Andrea

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dm-devel] fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-21 16:46               ` Michael Laß
@ 2019-05-21 19:00                 ` Andrea Gelmini
  2019-05-21 19:59                   ` Michael Laß
  2019-05-21 20:12                   ` Mike Snitzer
  0 siblings, 2 replies; 24+ messages in thread
From: Andrea Gelmini @ 2019-05-21 19:00 UTC (permalink / raw)
  To: Michael Laß
  Cc: dm-devel, Chris Murphy, Qu Wenruo, Btrfs BTRFS, Mike Snitzer

On Tue, May 21, 2019 at 06:46:20PM +0200, Michael Laß wrote:
> > I finished bisecting. Here’s the responsible commit:
> > 
> > commit 61697a6abd24acba941359c6268a94f4afe4a53d
> > Author: Mike Snitzer <snitzer@redhat.com>
> > Date:   Fri Jan 18 14:19:26 2019 -0500
> > 
> >    dm: eliminate 'split_discard_bios' flag from DM target interface
> > 
> >    There is no need to have DM core split discards on behalf of a DM target
> >    now that blk_queue_split() handles splitting discards based on the
> >    queue_limits.  A DM target just needs to set max_discard_sectors,
> >    discard_granularity, etc, in queue_limits.
> > 
> >    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> Reverting that commit solves the issue for me on Linux 5.1.3. Would that be an option until the root cause has been identified? I’d rather not let more people run into this issue.

Thanks a lot Michael, for your time/work.

This kind of bisecting are very boring and time consuming.

I CC: also the patch author.

Thanks again,
Andrea

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dm-devel] fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-21 19:00                 ` Andrea Gelmini
@ 2019-05-21 19:59                   ` Michael Laß
  2019-05-21 20:12                   ` Mike Snitzer
  1 sibling, 0 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-21 19:59 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: dm-devel, Chris Murphy, Qu Wenruo, Btrfs BTRFS, Mike Snitzer


> Am 21.05.2019 um 21:00 schrieb Andrea Gelmini <andrea.gelmini@linux.it>:
> 
> On Tue, May 21, 2019 at 06:46:20PM +0200, Michael Laß wrote:
>>> I finished bisecting. Here’s the responsible commit:
>>> 
>>> commit 61697a6abd24acba941359c6268a94f4afe4a53d
>>> Author: Mike Snitzer <snitzer@redhat.com>
>>> Date:   Fri Jan 18 14:19:26 2019 -0500
>>> 
>>>   dm: eliminate 'split_discard_bios' flag from DM target interface
>>> 
>>>   There is no need to have DM core split discards on behalf of a DM target
>>>   now that blk_queue_split() handles splitting discards based on the
>>>   queue_limits.  A DM target just needs to set max_discard_sectors,
>>>   discard_granularity, etc, in queue_limits.
>>> 
>>>   Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>> 
>> Reverting that commit solves the issue for me on Linux 5.1.3. Would that be an option until the root cause has been identified? I’d rather not let more people run into this issue.
> 
> Thanks a lot Michael, for your time/work.
> 
> This kind of bisecting are very boring and time consuming.

I just sent a patch to dm-devel which fixes the issue for me. Maybe you can test that in your environment?

Cheers,
Michael

PS: Sorry if the patch was sent multiple times. I had some issues with git send-email.

> I CC: also the patch author.
> 
> Thanks again,
> Andrea


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-21 19:00                 ` Andrea Gelmini
  2019-05-21 19:59                   ` Michael Laß
@ 2019-05-21 20:12                   ` Mike Snitzer
  2019-05-24 15:00                     ` Andrea Gelmini
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Snitzer @ 2019-05-21 20:12 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: Michael Laß, dm-devel, Chris Murphy, Qu Wenruo, Btrfs BTRFS

On Tue, May 21 2019 at  3:00pm -0400,
Andrea Gelmini <andrea.gelmini@linux.it> wrote:

> On Tue, May 21, 2019 at 06:46:20PM +0200, Michael Laß wrote:
> > > I finished bisecting. Here’s the responsible commit:
> > > 
> > > commit 61697a6abd24acba941359c6268a94f4afe4a53d
> > > Author: Mike Snitzer <snitzer@redhat.com>
> > > Date:   Fri Jan 18 14:19:26 2019 -0500
> > > 
> > >    dm: eliminate 'split_discard_bios' flag from DM target interface
> > > 
> > >    There is no need to have DM core split discards on behalf of a DM target
> > >    now that blk_queue_split() handles splitting discards based on the
> > >    queue_limits.  A DM target just needs to set max_discard_sectors,
> > >    discard_granularity, etc, in queue_limits.
> > > 
> > >    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > Reverting that commit solves the issue for me on Linux 5.1.3. Would
> that be an option until the root cause has been identified? I’d rather
> not let more people run into this issue.
> 
> Thanks a lot Michael, for your time/work.
> 
> This kind of bisecting are very boring and time consuming.
> 
> I CC: also the patch author.

Thanks for cc'ing me, this thread didn't catch my eye.

Sorry for your troubles.  Can you please try this patch?

Thanks,
Mike

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1fb1333fefec..997385c1ca54 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1469,7 +1469,7 @@ static unsigned get_num_write_zeroes_bios(struct dm_target *ti)
 static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
 				       unsigned num_bios)
 {
-	unsigned len = ci->sector_count;
+	unsigned len;
 
 	/*
 	 * Even though the device advertised support for this type of
@@ -1480,6 +1480,8 @@ static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *
 	if (!num_bios)
 		return -EOPNOTSUPP;
 
+	len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti));
+
 	__send_duplicate_bios(ci, ti, num_bios, &len);
 
 	ci->sector += len;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-21 20:12                   ` Mike Snitzer
@ 2019-05-24 15:00                     ` Andrea Gelmini
  2019-05-24 15:10                       ` Greg KH
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Gelmini @ 2019-05-24 15:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Michael Laß, dm-devel, Chris Murphy, Qu Wenruo, Btrfs BTRFS, gregkh

Hi Mike,
   I'm doing setup to replicate and test the condition. I see your
patch is already in the 5.2 dev kernel.
   I'm going to try with latest git, and see what happens. Anyway,
don't you this it would be good
   to have this patch ( 51b86f9a8d1c4bb4e3862ee4b4c5f46072f7520d )
anyway in the 5.1 stable branch?

Thanks a lot for your time,
Gelma

Il giorno mar 21 mag 2019 alle ore 22:12 Mike Snitzer
<snitzer@redhat.com> ha scritto:
>
> On Tue, May 21 2019 at  3:00pm -0400,
> Andrea Gelmini <andrea.gelmini@linux.it> wrote:
>
> > On Tue, May 21, 2019 at 06:46:20PM +0200, Michael Laß wrote:
> > > > I finished bisecting. Here’s the responsible commit:
> > > >
> > > > commit 61697a6abd24acba941359c6268a94f4afe4a53d
> > > > Author: Mike Snitzer <snitzer@redhat.com>
> > > > Date:   Fri Jan 18 14:19:26 2019 -0500
> > > >
> > > >    dm: eliminate 'split_discard_bios' flag from DM target interface
> > > >
> > > >    There is no need to have DM core split discards on behalf of a DM target
> > > >    now that blk_queue_split() handles splitting discards based on the
> > > >    queue_limits.  A DM target just needs to set max_discard_sectors,
> > > >    discard_granularity, etc, in queue_limits.
> > > >
> > > >    Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > >
> > > Reverting that commit solves the issue for me on Linux 5.1.3. Would
> > that be an option until the root cause has been identified? I’d rather
> > not let more people run into this issue.
> >
> > Thanks a lot Michael, for your time/work.
> >
> > This kind of bisecting are very boring and time consuming.
> >
> > I CC: also the patch author.
>
> Thanks for cc'ing me, this thread didn't catch my eye.
>
> Sorry for your troubles.  Can you please try this patch?
>
> Thanks,
> Mike
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 1fb1333fefec..997385c1ca54 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1469,7 +1469,7 @@ static unsigned get_num_write_zeroes_bios(struct dm_target *ti)
>  static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
>                                        unsigned num_bios)
>  {
> -       unsigned len = ci->sector_count;
> +       unsigned len;
>
>         /*
>          * Even though the device advertised support for this type of
> @@ -1480,6 +1480,8 @@ static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *
>         if (!num_bios)
>                 return -EOPNOTSUPP;
>
> +       len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti));
> +
>         __send_duplicate_bios(ci, ti, num_bios, &len);
>
>         ci->sector += len;
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss
  2019-05-24 15:00                     ` Andrea Gelmini
@ 2019-05-24 15:10                       ` Greg KH
  0 siblings, 0 replies; 24+ messages in thread
From: Greg KH @ 2019-05-24 15:10 UTC (permalink / raw)
  To: Andrea Gelmini
  Cc: Mike Snitzer, Michael Laß,
	dm-devel, Chris Murphy, Qu Wenruo, Btrfs BTRFS

On Fri, May 24, 2019 at 05:00:51PM +0200, Andrea Gelmini wrote:
> Hi Mike,
>    I'm doing setup to replicate and test the condition. I see your
> patch is already in the 5.2 dev kernel.
>    I'm going to try with latest git, and see what happens. Anyway,
> don't you this it would be good
>    to have this patch ( 51b86f9a8d1c4bb4e3862ee4b4c5f46072f7520d )
> anyway in the 5.1 stable branch?

It's already in the 5.1 stable queue and will be in the next 5.1 release
in a day or so.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-16 22:16 Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Michael Laß
  2019-05-16 23:41 ` Qu Wenruo
  2019-05-16 23:42 ` Chris Murphy
@ 2019-05-28 12:36 ` Christoph Anton Mitterer
  2019-05-28 12:43   ` Michael Laß
  2 siblings, 1 reply; 24+ messages in thread
From: Christoph Anton Mitterer @ 2019-05-28 12:36 UTC (permalink / raw)
  To: Michael Laß, linux-btrfs

Hey.

Just to be on the safe side...

AFAIU this issue only occured in 5.1.2 and later, right?

Starting with which 5.1.x and 5.2.x versions has the fix been merged?

Cheers,
Chris.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Massive filesystem corruption after balance + fstrim on Linux 5.1.2
  2019-05-28 12:36 ` Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Christoph Anton Mitterer
@ 2019-05-28 12:43   ` Michael Laß
  0 siblings, 0 replies; 24+ messages in thread
From: Michael Laß @ 2019-05-28 12:43 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs


> Am 28.05.2019 um 14:36 schrieb Christoph Anton Mitterer <calestyo@scientia.net>:
> 
> Hey.
> 
> Just to be on the safe side...
> 
> AFAIU this issue only occured in 5.1.2 and later, right?

No. The issue was already introduced in v5.1-rc1 (commit 61697a6abd24).

> Starting with which 5.1.x and 5.2.x versions has the fix been merged?

It's fixed in v5.2-rc2 (commit 51b86f9a8d1c) and v5.1.5 (commit 871e122d55e8).

Cheers,
Michael

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-05-28 12:43 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-16 22:16 Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Michael Laß
2019-05-16 23:41 ` Qu Wenruo
2019-05-16 23:42 ` Chris Murphy
2019-05-17 17:37   ` Michael Laß
2019-05-18  4:09     ` Chris Murphy
2019-05-18  9:18       ` Michael Laß
2019-05-18  9:31         ` Roman Mamedov
2019-05-18 10:09           ` Michael Laß
2019-05-18 10:26         ` Qu Wenruo
2019-05-19 19:55           ` fstrim discarding too many or wrong blocks on Linux 5.1, leading to data loss Michael Laß
2019-05-20 11:38             ` [dm-devel] " Michael Laß
2019-05-21 16:46               ` Michael Laß
2019-05-21 19:00                 ` Andrea Gelmini
2019-05-21 19:59                   ` Michael Laß
2019-05-21 20:12                   ` Mike Snitzer
2019-05-24 15:00                     ` Andrea Gelmini
2019-05-24 15:10                       ` Greg KH
     [not found]             ` <CAK-xaQYPs62v971zm1McXw_FGzDmh_vpz3KLEbxzkmrsSgTfXw@mail.gmail.com>
2019-05-20 13:58               ` Michael Laß
2019-05-20 14:53                 ` Andrea Gelmini
2019-05-20 16:45                   ` Milan Broz
2019-05-20 19:58                     ` Michael Laß
2019-05-21 18:54                     ` Andrea Gelmini
2019-05-28 12:36 ` Massive filesystem corruption after balance + fstrim on Linux 5.1.2 Christoph Anton Mitterer
2019-05-28 12:43   ` Michael Laß

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).