linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression: Disk corruption with dm-crypt and kernels >= 4.0
@ 2015-05-01  4:37 Abelardo Ricart III
  2015-05-01 21:17 ` Mike Snitzer
  2015-05-01 21:47 ` [dm-devel] " Alasdair G Kergon
  0 siblings, 2 replies; 11+ messages in thread
From: Abelardo Ricart III @ 2015-05-01  4:37 UTC (permalink / raw)
  To: dm-devel; +Cc: mpatocka, snitzer, linux-kernel

I made sure to run a completely vanilla kernel when testing why I was suddenly
seeing some nasty libata errors with all kernels >= v4.0. Here's a snippet:

-------------------->8--------------------
[  165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 action 0x6
frozen
[  165.592140] ata5.00: irq_stat 0x20000000, host bus error
[  165.592143] ata5: SError: { HostInt }
[  165.592145] ata5.00: failed command: READ FPDMA QUEUED
[  165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 ncq 4096
in
                        res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
(host bus error)
[  165.592151] ata5.00: status: { DRDY }
-------------------->8--------------------

After a few dozen of these errors, I'd suddenly find my system in read-only
mode with corrupted files throughout my encrypted filesystems (seemed like
either a read or a write would corrupt a file, though I could be mistaken). I
decided to do a git bisect with a random read-write-sync test to narrow down
the culprit, which turned out to be this commit (part of a series):

# first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: don't
allocate pages for a partial request

Just to be sure, I created a patch to revert the entire nine patch series that
commit belonged to... and the bad behavior disappeared. I've now been running
kernel 4.0 for a few days without issue, and went so far as to stress test my
poor SSD for a few hours to be 100% positive.

Here's some more info on my setup.

-------------------->8--------------------
$ lsblk -f
NAME         FSTYPE      LABEL MOUNTPOINT
sda                  
├─sda1       vfat              /boot/EFI
├─sda2       ext4              /boot
└─sda3       LVM2_member
  ├─SSD-root crypto_LUKS
  │ └─root   f2fs              /
  └─SSD-home crypto_LUKS
    └─home   f2fs              /home

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow-discards
root=/dev/mapper/root acpi_osi=Linux security=tomoyo
TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
modprobe.blacklist=nouveau rw quiet

$ cat /etc/lvm/lvm.conf | grep "issue_discards"
issue_discards = 1
-------------------->8--------------------

If there's anything else I can do to help diagnose the underlying problem, I'm
more than willing.

Thanks,

Abelardo Ricart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-01  4:37 Regression: Disk corruption with dm-crypt and kernels >= 4.0 Abelardo Ricart III
@ 2015-05-01 21:17 ` Mike Snitzer
  2015-05-01 22:24   ` Abelardo Ricart III
  2015-05-01 21:47 ` [dm-devel] " Alasdair G Kergon
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2015-05-01 21:17 UTC (permalink / raw)
  To: Abelardo Ricart III; +Cc: dm-devel, mpatocka, linux-kernel

On Fri, May 01 2015 at 12:37am -0400,
Abelardo Ricart III <aricart@memnix.com> wrote:

> I made sure to run a completely vanilla kernel when testing why I was suddenly
> seeing some nasty libata errors with all kernels >= v4.0. Here's a snippet:
> 
> -------------------->8--------------------
> [  165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 action 0x6
> frozen
> [  165.592140] ata5.00: irq_stat 0x20000000, host bus error
> [  165.592143] ata5: SError: { HostInt }
> [  165.592145] ata5.00: failed command: READ FPDMA QUEUED
> [  165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 ncq 4096
> in
>                         res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
> (host bus error)
> [  165.592151] ata5.00: status: { DRDY }
> -------------------->8--------------------
> 
> After a few dozen of these errors, I'd suddenly find my system in read-only
> mode with corrupted files throughout my encrypted filesystems (seemed like
> either a read or a write would corrupt a file, though I could be mistaken). I
> decided to do a git bisect with a random read-write-sync test to narrow down
> the culprit, which turned out to be this commit (part of a series):
> 
> # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: don't
> allocate pages for a partial request
> 
> Just to be sure, I created a patch to revert the entire nine patch series that
> commit belonged to... and the bad behavior disappeared. I've now been running
> kernel 4.0 for a few days without issue, and went so far as to stress test my
> poor SSD for a few hours to be 100% positive.
> 
> Here's some more info on my setup.
> 
> -------------------->8--------------------
> $ lsblk -f
> NAME         FSTYPE      LABEL MOUNTPOINT
> sda                  
> ├─sda1       vfat              /boot/EFI
> ├─sda2       ext4              /boot
> └─sda3       LVM2_member
>   ├─SSD-root crypto_LUKS
>   │ └─root   f2fs              /
>   └─SSD-home crypto_LUKS
>     └─home   f2fs              /home
> 
> $ cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow-discards
> root=/dev/mapper/root acpi_osi=Linux security=tomoyo
> TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
> modprobe.blacklist=nouveau rw quiet
> 
> $ cat /etc/lvm/lvm.conf | grep "issue_discards"
> issue_discards = 1
> -------------------->8--------------------
> 
> If there's anything else I can do to help diagnose the underlying problem, I'm
> more than willing.

The patchset in question was tested quite heavily so this is a
surprising report.  I'm noticing you are opting in to dm-crypt discard
support.  Have you tested without discards enabled?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dm-devel] Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-01  4:37 Regression: Disk corruption with dm-crypt and kernels >= 4.0 Abelardo Ricart III
  2015-05-01 21:17 ` Mike Snitzer
@ 2015-05-01 21:47 ` Alasdair G Kergon
  2015-05-02  0:19   ` Abelardo Ricart III
  1 sibling, 1 reply; 11+ messages in thread
From: Alasdair G Kergon @ 2015-05-01 21:47 UTC (permalink / raw)
  To: Abelardo Ricart III; +Cc: dm-devel, mpatocka, linux-kernel, snitzer

On Fri, May 01, 2015 at 12:37:07AM -0400, Abelardo Ricart III wrote:
> # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: don't
> allocate pages for a partial request
 
That's not a particularly good commit to identify.

If you didn't already, can you confirm whether or not the code works at the
patch immediately following?

  7145c241a1bf2841952c3e297c4080b357b3e52d

Alasdair


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-01 21:17 ` Mike Snitzer
@ 2015-05-01 22:24   ` Abelardo Ricart III
  2015-05-01 23:42     ` Abelardo Ricart III
  0 siblings, 1 reply; 11+ messages in thread
From: Abelardo Ricart III @ 2015-05-01 22:24 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, mpatocka, linux-kernel

On Fri, 2015-05-01 at 17:17 -0400, Mike Snitzer wrote:
> On Fri, May 01 2015 at 12:37am -0400,
> Abelardo Ricart III <aricart@memnix.com> wrote:
> 
> > I made sure to run a completely vanilla kernel when testing why I was 
> > suddenly
> > seeing some nasty libata errors with all kernels >= v4.0. Here's a snippet:
> > 
> > -------------------->8--------------------
> > [  165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 action 
> > 0x6
> > frozen
> > [  165.592140] ata5.00: irq_stat 0x20000000, host bus error
> > [  165.592143] ata5: SError: { HostInt }
> > [  165.592145] ata5.00: failed command: READ FPDMA QUEUED
> > [  165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 ncq 
> > 4096
> > in
> >                         res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
> > (host bus error)
> > [  165.592151] ata5.00: status: { DRDY }
> > -------------------->8--------------------
> > 
> > After a few dozen of these errors, I'd suddenly find my system in read-only
> > mode with corrupted files throughout my encrypted filesystems (seemed like
> > either a read or a write would corrupt a file, though I could be mistaken). 
> > I
> > decided to do a git bisect with a random read-write-sync test to narrow down
> > the culprit, which turned out to be this commit (part of a series):
> > 
> > # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: 
> > don't
> > allocate pages for a partial request
> > 
> > Just to be sure, I created a patch to revert the entire nine patch series 
> > that
> > commit belonged to... and the bad behavior disappeared. I've now been 
> > running
> > kernel 4.0 for a few days without issue, and went so far as to stress test 
> > my
> > poor SSD for a few hours to be 100% positive.
> > 
> > Here's some more info on my setup.
> > 
> > -------------------->8--------------------
> > $ lsblk -f
> > NAME         FSTYPE      LABEL MOUNTPOINT
> > sda                  
> > ├─sda1       vfat              /boot/EFI
> > ├─sda2       ext4              /boot
> > └─sda3       LVM2_member
> >   ├─SSD-root crypto_LUKS
> >   │ └─root   f2fs              /
> >   └─SSD-home crypto_LUKS
> >     └─home   f2fs              /home
> > 
> > $ cat /proc/cmdline
> > BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow
> > -discards
> > root=/dev/mapper/root acpi_osi=Linux security=tomoyo
> > TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
> > modprobe.blacklist=nouveau rw quiet
> > 
> > $ cat /etc/lvm/lvm.conf | grep "issue_discards"
> > issue_discards = 1
> > -------------------->8--------------------
> > 
> > If there's anything else I can do to help diagnose the underlying problem, 
> > I'm
> > more than willing.
> 
> The patchset in question was tested quite heavily so this is a
> surprising report.  I'm noticing you are opting in to dm-crypt discard
> support.  Have you tested without discards enabled?

I've disabled discards universally and rebuilt a vanilla kernel. After running
my heavy read-write-sync scripts, everything seems to be working fine now. I
suppose this could be something that used to fail silently before, but now
produces bad behavior? I seem to remember having something in my message log
about "discards not supported on this device" when running with it enabled
before.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-01 22:24   ` Abelardo Ricart III
@ 2015-05-01 23:42     ` Abelardo Ricart III
  2015-05-15 15:04       ` Brandon Smith
  0 siblings, 1 reply; 11+ messages in thread
From: Abelardo Ricart III @ 2015-05-01 23:42 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, mpatocka, linux-kernel

On Fri, 2015-05-01 at 18:24 -0400, Abelardo Ricart III wrote:
> On Fri, 2015-05-01 at 17:17 -0400, Mike Snitzer wrote:
> > On Fri, May 01 2015 at 12:37am -0400,
> > Abelardo Ricart III <aricart@memnix.com> wrote:
> > 
> > > I made sure to run a completely vanilla kernel when testing why I was 
> > > suddenly
> > > seeing some nasty libata errors with all kernels >= v4.0. Here's a 
> > > snippet:
> > > 
> > > -------------------->8--------------------
> > > [  165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 
> > > action 
> > > 0x6
> > > frozen
> > > [  165.592140] ata5.00: irq_stat 0x20000000, host bus error
> > > [  165.592143] ata5: SError: { HostInt }
> > > [  165.592145] ata5.00: failed command: READ FPDMA QUEUED
> > > [  165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 
> > > ncq 
> > > 4096
> > > in
> > >                         res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
> > > (host bus error)
> > > [  165.592151] ata5.00: status: { DRDY }
> > > -------------------->8--------------------
> > > 
> > > After a few dozen of these errors, I'd suddenly find my system in read
> > > -only
> > > mode with corrupted files throughout my encrypted filesystems (seemed like
> > > either a read or a write would corrupt a file, though I could be 
> > > mistaken). 
> > > I
> > > decided to do a git bisect with a random read-write-sync test to narrow 
> > > down
> > > the culprit, which turned out to be this commit (part of a series):
> > > 
> > > # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: 
> > > don't
> > > allocate pages for a partial request
> > > 
> > > Just to be sure, I created a patch to revert the entire nine patch series 
> > > that
> > > commit belonged to... and the bad behavior disappeared. I've now been 
> > > running
> > > kernel 4.0 for a few days without issue, and went so far as to stress 
> > > test 
> > > my
> > > poor SSD for a few hours to be 100% positive.
> > > 
> > > Here's some more info on my setup.
> > > 
> > > -------------------->8--------------------
> > > $ lsblk -f
> > > NAME         FSTYPE      LABEL MOUNTPOINT
> > > sda                  
> > > ├─sda1       vfat              /boot/EFI
> > > ├─sda2       ext4              /boot
> > > └─sda3       LVM2_member
> > >   ├─SSD-root crypto_LUKS
> > >   │ └─root   f2fs              /
> > >   └─SSD-home crypto_LUKS
> > >     └─home   f2fs              /home
> > > 
> > > $ cat /proc/cmdline
> > > BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow
> > > -discards
> > > root=/dev/mapper/root acpi_osi=Linux security=tomoyo
> > > TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
> > > modprobe.blacklist=nouveau rw quiet
> > > 
> > > $ cat /etc/lvm/lvm.conf | grep "issue_discards"
> > > issue_discards = 1
> > > -------------------->8--------------------
> > > 
> > > If there's anything else I can do to help diagnose the underlying 
> > > problem, 
> > > I'm
> > > more than willing.
> > 
> > The patchset in question was tested quite heavily so this is a
> > surprising report.  I'm noticing you are opting in to dm-crypt discard
> > support.  Have you tested without discards enabled?
> 
> I've disabled discards universally and rebuilt a vanilla kernel. After running
> my heavy read-write-sync scripts, everything seems to be working fine now. I
> suppose this could be something that used to fail silently before, but now
> produces bad behavior? I seem to remember having something in my message log
> about "discards not supported on this device" when running with it enabled
> before.

Forgive me, but I spoke too soon. The corruption and libata errors are still
there, as was evidenced when I went to reboot and got treated to an eye full of
"read-only filesystem" and ata errors.

So no, disabling discards unfortunately did nothing to help.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dm-devel] Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-01 21:47 ` [dm-devel] " Alasdair G Kergon
@ 2015-05-02  0:19   ` Abelardo Ricart III
  0 siblings, 0 replies; 11+ messages in thread
From: Abelardo Ricart III @ 2015-05-02  0:19 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: dm-devel, mpatocka, linux-kernel, snitzer

On Fri, 2015-05-01 at 22:47 +0100, Alasdair G Kergon wrote:
> On Fri, May 01, 2015 at 12:37:07AM -0400, Abelardo Ricart III wrote:
> > # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: 
> > don't
> > allocate pages for a partial request
>  
> That's not a particularly good commit to identify.
> 
> If you didn't already, can you confirm whether or not the code works at the
> patch immediately following?
> 
>   7145c241a1bf2841952c3e297c4080b357b3e52d
> 
> Alasdair
> 
Just built that revision and it failed almost immediately with more ata errors. It also corrupted my testing log.

As an aside, here's my fstab in case it's of any use

-------------------->8--------------------
/dev/mapper/root        /             f2fs              rw,relatime,flush_merge,background_gc=on,user_xattr,acl,active_logs=6   0 0

/dev/mapper/home        /home           f2fs            rw,relatime,flush_merge,background_gc=on,user_xattr,acl,active_logs=6   0 2

/dev/sda2               /boot           ext4            rw,relatime,data=ordered        0 2

/dev/sda1               /boot/EFI       vfat            rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro    0 2

tmpfs                   /scratch        tmpfs           nodev,nosuid,size=12G           0 0
-------------------->8--------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-01 23:42     ` Abelardo Ricart III
@ 2015-05-15 15:04       ` Brandon Smith
  2015-05-18 14:36         ` Abelardo Ricart III
  0 siblings, 1 reply; 11+ messages in thread
From: Brandon Smith @ 2015-05-15 15:04 UTC (permalink / raw)
  To: Abelardo Ricart III; +Cc: Mike Snitzer, dm-devel, mpatocka, linux-kernel

On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > The patchset in question was tested quite heavily so this is a
> > > surprising report.  I'm noticing you are opting in to dm-crypt discard
> > > support.  Have you tested without discards enabled?
> > 
> > I've disabled discards universally and rebuilt a vanilla kernel. After running
> > my heavy read-write-sync scripts, everything seems to be working fine now. I
> > suppose this could be something that used to fail silently before, but now
> > produces bad behavior? I seem to remember having something in my message log
> > about "discards not supported on this device" when running with it enabled
> > before.
> 
> Forgive me, but I spoke too soon. The corruption and libata errors are still
> there, as was evidenced when I went to reboot and got treated to an eye full of
> "read-only filesystem" and ata errors.
> 
> So no, disabling discards unfortunately did nothing to help.

I've been experiencing the same problem.  Vanilla 4.0 series kernels,
dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
LiteOn LGT-256M6G SSD.   

After some of googling around, I found some chatter relating to changes
in NCQ on SSDs in 4.0.   Been running w/o NCQ for a full kernel build so
far without issue.  Perhaps there's been some change in the interaction
between dm-crypt and NCQ?

Abelardo, can you try w/o NCQ and see if that helps your situation?

Best,

--Brandon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-15 15:04       ` Brandon Smith
@ 2015-05-18 14:36         ` Abelardo Ricart III
  2015-06-02 17:51           ` Mikulas Patocka
  0 siblings, 1 reply; 11+ messages in thread
From: Abelardo Ricart III @ 2015-05-18 14:36 UTC (permalink / raw)
  To: Brandon Smith; +Cc: Mike Snitzer, dm-devel, mpatocka, linux-kernel

On Fri, 2015-05-15 at 08:04 -0700, Brandon Smith wrote:
> On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > > The patchset in question was tested quite heavily so this is a
> > > > surprising report.  I'm noticing you are opting in to dm-crypt discard
> > > > support.  Have you tested without discards enabled?
> > > 
> > > I've disabled discards universally and rebuilt a vanilla kernel. After 
> > > running
> > > my heavy read-write-sync scripts, everything seems to be working fine now. 
> > > I
> > > suppose this could be something that used to fail silently before, but now
> > > produces bad behavior? I seem to remember having something in my message 
> > > log
> > > about "discards not supported on this device" when running with it enabled
> > > before.
> > 
> > Forgive me, but I spoke too soon. The corruption and libata errors are still
> > there, as was evidenced when I went to reboot and got treated to an eye full 
> > of
> > "read-only filesystem" and ata errors.
> > 
> > So no, disabling discards unfortunately did nothing to help.
> 
> I've been experiencing the same problem.  Vanilla 4.0 series kernels,
> dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
> LiteOn LGT-256M6G SSD.   
> 
> After some of googling around, I found some chatter relating to changes
> in NCQ on SSDs in 4.0.   Been running w/o NCQ for a full kernel build so
> far without issue.  Perhaps there's been some change in the interaction
> between dm-crypt and NCQ?
> 
> Abelardo, can you try w/o NCQ and see if that helps your situation?
> 
> Best,
> 
> --Brandon

I've been running with NCQ disabled and been stress testing for awhile and the
issue is indeed gone. Thanks for the workaround!

So it seems the issue is somehow related to the combination of NCQ, dm-crypt,
and possibly (some?) SSDs.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-05-18 14:36         ` Abelardo Ricart III
@ 2015-06-02 17:51           ` Mikulas Patocka
  2015-06-03  2:21             ` Abelardo Ricart III
  0 siblings, 1 reply; 11+ messages in thread
From: Mikulas Patocka @ 2015-06-02 17:51 UTC (permalink / raw)
  To: Abelardo Ricart III; +Cc: Brandon Smith, Mike Snitzer, dm-devel, linux-kernel



On Mon, 18 May 2015, Abelardo Ricart III wrote:

> On Fri, 2015-05-15 at 08:04 -0700, Brandon Smith wrote:
> > On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > > > The patchset in question was tested quite heavily so this is a
> > > > > surprising report.  I'm noticing you are opting in to dm-crypt discard
> > > > > support.  Have you tested without discards enabled?
> > > > 
> > > > I've disabled discards universally and rebuilt a vanilla kernel. After 
> > > > running
> > > > my heavy read-write-sync scripts, everything seems to be working fine now. 
> > > > I
> > > > suppose this could be something that used to fail silently before, but now
> > > > produces bad behavior? I seem to remember having something in my message 
> > > > log
> > > > about "discards not supported on this device" when running with it enabled
> > > > before.
> > > 
> > > Forgive me, but I spoke too soon. The corruption and libata errors are still
> > > there, as was evidenced when I went to reboot and got treated to an eye full 
> > > of
> > > "read-only filesystem" and ata errors.
> > > 
> > > So no, disabling discards unfortunately did nothing to help.
> > 
> > I've been experiencing the same problem.  Vanilla 4.0 series kernels,
> > dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
> > LiteOn LGT-256M6G SSD.   
> > 
> > After some of googling around, I found some chatter relating to changes
> > in NCQ on SSDs in 4.0.   Been running w/o NCQ for a full kernel build so
> > far without issue.  Perhaps there's been some change in the interaction
> > between dm-crypt and NCQ?
> > 
> > Abelardo, can you try w/o NCQ and see if that helps your situation?
> > 
> > Best,
> > 
> > --Brandon
> 
> I've been running with NCQ disabled and been stress testing for awhile and the
> issue is indeed gone. Thanks for the workaround!
> 
> So it seems the issue is somehow related to the combination of NCQ, dm-crypt,
> and possibly (some?) SSDs.

Hi

I suspect that this is a bug in kernel NCQ processing or in SSD firmware 
and recent dm-crypt changes made the bug show up.

I suggest this:

If you have some test that reliably reproduces the bug, please do this: 
take kernel 3.19 or 3.18 and apply dm-crypt parallelization patches 
(commits f3396c58fd8442850e759843457d78b6ec3a9589, 
cf2f1abfbd0dba701f7f16ef619e4d2485de3366, 
7145c241a1bf2841952c3e297c4080b357b3e52d, 
94f5e0243c48aa01441c987743dc468e2d6eaca2, 
dc2676210c425ee8e5cb1bec5bc84d004ddf4179, 
0f5d8e6ee758f7023e4353cca75d785b2d4f6abe, 
b3c5fd3052492f1b8d060799d4f18be5a5438add) on it. If the bug doesn't show 
up with the older kernel and dm-crypt parallelization patches, use git 
bisect to find out which patch broken NCQ. When you test a kernel with 
bisect, apply the above mentioned patches to it.

Mikulas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-06-02 17:51           ` Mikulas Patocka
@ 2015-06-03  2:21             ` Abelardo Ricart III
  2015-09-11 16:11               ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Abelardo Ricart III @ 2015-06-03  2:21 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Brandon Smith, Mike Snitzer, dm-devel, linux-kernel

On Tue, 2015-06-02 at 13:51 -0400, Mikulas Patocka wrote:
> 
> On Mon, 18 May 2015, Abelardo Ricart III wrote:
> 
> > On Fri, 2015-05-15 at 08:04 -0700, Brandon Smith wrote:
> > > On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > > > > The patchset in question was tested quite heavily so this is a
> > > > > > surprising report.  I'm noticing you are opting in to dm-crypt 
> discard
> > > > > > support.  Have you tested without discards enabled?
> > > > > 
> > > > > I've disabled discards universally and rebuilt a vanilla kernel. After 
> 
> > > > > running
> > > > > my heavy read-write-sync scripts, everything seems to be working fine 
> now. 
> > > > > I
> > > > > suppose this could be something that used to fail silently before, but 
> now
> > > > > produces bad behavior? I seem to remember having something in my 
> message 
> > > > > log
> > > > > about "discards not supported on this device" when running with it 
> enabled
> > > > > before.
> > > > 
> > > > Forgive me, but I spoke too soon. The corruption and libata errors are 
> still
> > > > there, as was evidenced when I went to reboot and got treated to an eye 
> full 
> > > > of
> > > > "read-only filesystem" and ata errors.
> > > > 
> > > > So no, disabling discards unfortunately did nothing to help.
> > > 
> > > I've been experiencing the same problem.  Vanilla 4.0 series kernels,
> > > dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
> > > LiteOn LGT-256M6G SSD.   
> > > 
> > > After some of googling around, I found some chatter relating to changes
> > > in NCQ on SSDs in 4.0.   Been running w/o NCQ for a full kernel build so
> > > far without issue.  Perhaps there's been some change in the interaction
> > > between dm-crypt and NCQ?
> > > 
> > > Abelardo, can you try w/o NCQ and see if that helps your situation?
> > > 
> > > Best,
> > > 
> > > --Brandon
> > 
> > I've been running with NCQ disabled and been stress testing for awhile and 
> the
> > issue is indeed gone. Thanks for the workaround!
> > 
> > So it seems the issue is somehow related to the combination of NCQ, dm
> -crypt,
> > and possibly (some?) SSDs.
> 
> Hi
> 
> I suspect that this is a bug in kernel NCQ processing or in SSD firmware 
> and recent dm-crypt changes made the bug show up.
> 
> I suggest this:
> 
> If you have some test that reliably reproduces the bug, please do this: 
> take kernel 3.19 or 3.18 and apply dm-crypt parallelization patches 
> (commits f3396c58fd8442850e759843457d78b6ec3a9589, 
> cf2f1abfbd0dba701f7f16ef619e4d2485de3366, 
> 7145c241a1bf2841952c3e297c4080b357b3e52d, 
> 94f5e0243c48aa01441c987743dc468e2d6eaca2, 
> dc2676210c425ee8e5cb1bec5bc84d004ddf4179, 
> 0f5d8e6ee758f7023e4353cca75d785b2d4f6abe, 
> b3c5fd3052492f1b8d060799d4f18be5a5438add) on it. If the bug doesn't show 
> up with the older kernel and dm-crypt parallelization patches, use git 
> bisect to find out which patch broken NCQ. When you test a kernel with 
> bisect, apply the above mentioned patches to it.
> 
> Mikulas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Alright, I'll try this next and report back soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0
  2015-06-03  2:21             ` Abelardo Ricart III
@ 2015-09-11 16:11               ` Mike Snitzer
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2015-09-11 16:11 UTC (permalink / raw)
  To: Abelardo Ricart III
  Cc: Mikulas Patocka, dm-devel, Brandon Smith, linux-kernel

Hi,

Could you please try the following patch (against any of the kernels you
saw the corruption with.  be it 4.0, 4.1, 4.2) to see if the regression
you reported goes away?  Thanks, Mike

From: Mike Snitzer <snitzer@redhat.com>
Date: Wed, 9 Sep 2015 21:34:51 -0400
Subject: [PATCH] dm crypt: constrain crypt device's max_segment_size to
 PAGE_SIZE

Unfortunate constraint that is required to avoid the potential for
exceeding underlying device's max_segments limits -- due to
crypt_alloc_buffer() possibly allocating pages for the encryption bio
that are not as physically contiguous as the original bio.

Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-crypt.c |   17 +++++++++++++++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 76f1d6e..f717762 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -973,7 +973,8 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);
 
 /*
  * Generate a new unfragmented bio with the given size
- * This should never violate the device limitations
+ * This should never violate the device limitations (but only because
+ * max_segment_size is being constrained to PAGE_SIZE).
  *
  * This function may be called concurrently. If we allocate from the mempool
  * concurrently, there is a possibility of deadlock. For example, if we have
@@ -2057,9 +2058,20 @@ static int crypt_iterate_devices(struct dm_target *ti,
 	return fn(ti, cc->dev, cc->start, ti->len, data);
 }
 
+static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	/*
+	 * Unfortunate constraint that is required to avoid the potential
+	 * for exceeding underlying device's max_segments limits -- due to
+	 * crypt_alloc_buffer() possibly allocating pages for the encryption
+	 * bio that are not as physically contiguous as the original bio.
+	 */
+	limits->max_segment_size = PAGE_SIZE;
+}
+
 static struct target_type crypt_target = {
 	.name   = "crypt",
-	.version = {1, 14, 0},
+	.version = {1, 14, 1},
 	.module = THIS_MODULE,
 	.ctr    = crypt_ctr,
 	.dtr    = crypt_dtr,
@@ -2071,6 +2083,7 @@ static struct target_type crypt_target = {
 	.message = crypt_message,
 	.merge  = crypt_merge,
 	.iterate_devices = crypt_iterate_devices,
+	.io_hints = crypt_io_hints,
 };
 
 static int __init dm_crypt_init(void)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-09-11 16:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-01  4:37 Regression: Disk corruption with dm-crypt and kernels >= 4.0 Abelardo Ricart III
2015-05-01 21:17 ` Mike Snitzer
2015-05-01 22:24   ` Abelardo Ricart III
2015-05-01 23:42     ` Abelardo Ricart III
2015-05-15 15:04       ` Brandon Smith
2015-05-18 14:36         ` Abelardo Ricart III
2015-06-02 17:51           ` Mikulas Patocka
2015-06-03  2:21             ` Abelardo Ricart III
2015-09-11 16:11               ` Mike Snitzer
2015-05-01 21:47 ` [dm-devel] " Alasdair G Kergon
2015-05-02  0:19   ` Abelardo Ricart III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).