Regression in 4.14: wrong data being read from bcache device

All of lore.kernel.org
 help / color / mirror / Atom feed

* Regression in 4.14: wrong data being read from bcache device
@ 2017-11-16  8:28 Pavel Goran
  2017-11-16 19:36 ` Campbell Steven
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Goran @ 2017-11-16  8:28 UTC (permalink / raw)
  To: linux-bcache

Hello list,

I encountered a severe problem when trying to switch to kernel version 4.14.
In short, reads from the bcache device produce different data in 4.14 and
4.13.

I'm running Gentoo Linux with manually-compiled kernels (normally it's
pf-kernel; after encountering this problem, I tried version 4.14 of the
regular Gentoo kernel, and the problem was still there). I'm using an embedded
initramfs (generated with drakut 2 years ago). bcache is compiled in kernel
(not a module). I was using bcache (and this device in particular) for more
than a year.

My setup is as follows:

* /dev/bcache0 uses /dev/sda9 as cache and /dev/sdb1 as backing device

* LVM Volume Group consists of /dev/bcache0

* /dev/dm-1 is an LVM Logical Volume containing a JFS filesystem

bcache is running in writearound mode. Data from priority_stats:

Unused:         64%
Clean:          33%
Dirty:          0%
Metadata:       0%
Average:        391
Sectors per Q:  412592
Quantiles:      [1 5 14 27 45 90 105 115 122 140 154 260 267 285 363 369 438 507 530 540 554 583 596 608 636 671 688 805 815 826 834]

Here is what

On first reboot in 4.14, /dev/dm-1 failed a fsck check. An error message was
something about secondary data structures not matching primary. After this,
the filesystem failed to mount, and boot stopped. Then I rebooted under 4.13
and fsck did not find any errors, even after forced check; now I'm
successfully running 4.13 and actively access the filesystem read-write
without any problems.

I booted under 4.13 and 4.14 in single-user mode and took dumps of the first
16M of /dev/sdb1, /dev/bcache0 and /dev/dm-1 (the last one was actually 1M).
When I compare the 4.13 and 4.14 dumps of each device, I see that dumps from
/dev/sdb1 are the same, but dumps from /dev/bcache0 and /dev/dm-1 are
considerably different: it's 1437954 (of 16M) different bytes for bcache0, and
16317 (of 1M) different bytes for dm-1.

The patterns of invalid (4.14) data vary: I observe (1) almost-normal data
(for example, the superblock of the JFS filesystem seems to be mostly OK, with
differences like the dirty bit set in 4.14 data and not set in 4.13 data, and
a different s_logserial value), (2) non-zero bytes in place of zeroes, (3)
zeroes or mostly zeroes in place of non-zero bytes, (4) FF bytes in place of
zeroes or mostly zeroes.

In both kernels, dmesg doesn't show any abnormal bcache-related records. With
both kernels, I get records like this:

[    3.999534] bcache: bch_journal_replay() journal replay done, 11 keys in 8 entries, seq 21829
[    3.999917] bcache: register_cache() registered cache device sda9
[    4.029076] bcache: register_bdev() registered backing device sdb1
[    4.031208] bcache: bch_cached_dev_attach() Caching sdb1 as bcache0 on set 42eb0424-c06b-4a0d-af25-a76fb0ff8289

And then:

[   12.423886] bcache: register_bcache() error opening /dev/sda9: device already registered
[   12.850808] bcache: register_bcache() error opening /dev/sdb1: device already registered

(Looks like udev trying to repeatedly register the device).

Overall, this looks like a very serious bug that can corrupt or completely
destroy the data on bcache devices.

Pavel Goran

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-16  8:28 Regression in 4.14: wrong data being read from bcache device Pavel Goran
@ 2017-11-16 19:36 ` Campbell Steven
  2017-11-16 19:48   ` Michael Lyle
  0 siblings, 1 reply; 8+ messages in thread
From: Campbell Steven @ 2017-11-16 19:36 UTC (permalink / raw)
  To: Pavel Goran; +Cc: linux-bcache

On 16 November 2017 at 21:28, Pavel Goran <via-bcache@pvgoran.name> wrote:
> Hello list,
>
> I encountered a severe problem when trying to switch to kernel version 4.14.
> In short, reads from the bcache device produce different data in 4.14 and
> 4.13.
>
> I'm running Gentoo Linux with manually-compiled kernels (normally it's
> pf-kernel; after encountering this problem, I tried version 4.14 of the
> regular Gentoo kernel, and the problem was still there). I'm using an embedded
> initramfs (generated with drakut 2 years ago). bcache is compiled in kernel
> (not a module). I was using bcache (and this device in particular) for more
> than a year.
>
> My setup is as follows:
>
> * /dev/bcache0 uses /dev/sda9 as cache and /dev/sdb1 as backing device
>
> * LVM Volume Group consists of /dev/bcache0
>
> * /dev/dm-1 is an LVM Logical Volume containing a JFS filesystem
>
> bcache is running in writearound mode. Data from priority_stats:
>
> Unused:         64%
> Clean:          33%
> Dirty:          0%
> Metadata:       0%
> Average:        391
> Sectors per Q:  412592
> Quantiles:      [1 5 14 27 45 90 105 115 122 140 154 260 267 285 363 369 438 507 530 540 554 583 596 608 636 671 688 805 815 826 834]
>
> Here is what
>
> On first reboot in 4.14, /dev/dm-1 failed a fsck check. An error message was
> something about secondary data structures not matching primary. After this,
> the filesystem failed to mount, and boot stopped. Then I rebooted under 4.13
> and fsck did not find any errors, even after forced check; now I'm
> successfully running 4.13 and actively access the filesystem read-write
> without any problems.
>
> I booted under 4.13 and 4.14 in single-user mode and took dumps of the first
> 16M of /dev/sdb1, /dev/bcache0 and /dev/dm-1 (the last one was actually 1M).
> When I compare the 4.13 and 4.14 dumps of each device, I see that dumps from
> /dev/sdb1 are the same, but dumps from /dev/bcache0 and /dev/dm-1 are
> considerably different: it's 1437954 (of 16M) different bytes for bcache0, and
> 16317 (of 1M) different bytes for dm-1.
>
> The patterns of invalid (4.14) data vary: I observe (1) almost-normal data
> (for example, the superblock of the JFS filesystem seems to be mostly OK, with
> differences like the dirty bit set in 4.14 data and not set in 4.13 data, and
> a different s_logserial value), (2) non-zero bytes in place of zeroes, (3)
> zeroes or mostly zeroes in place of non-zero bytes, (4) FF bytes in place of
> zeroes or mostly zeroes.
>
> In both kernels, dmesg doesn't show any abnormal bcache-related records. With
> both kernels, I get records like this:
>
> [    3.999534] bcache: bch_journal_replay() journal replay done, 11 keys in 8 entries, seq 21829
> [    3.999917] bcache: register_cache() registered cache device sda9
> [    4.029076] bcache: register_bdev() registered backing device sdb1
> [    4.031208] bcache: bch_cached_dev_attach() Caching sdb1 as bcache0 on set 42eb0424-c06b-4a0d-af25-a76fb0ff8289
>
> And then:
>
> [   12.423886] bcache: register_bcache() error opening /dev/sda9: device already registered
> [   12.850808] bcache: register_bcache() error opening /dev/sdb1: device already registered
>
> (Looks like udev trying to repeatedly register the device).
>
> Overall, this looks like a very serious bug that can corrupt or completely
> destroy the data on bcache devices.
>
> Pavel Goran
>

Hi Pavel,

I'm kind of glad this is what you are seeing as well, as I have been
facing exactly the same issue. For the past few days I've been trying
to build up a nice summary like you have to post, but aside from the
file system (we are using xfs) we are seeing the exactly the same
issue when we try and shift from a 4.7 kernel up to 4.14. I have also
tried the bcache patches out of linux-next (slated for 4.15) which
contain all the latest patches but I see the same result.

I have a test system here so am happy to do any further testing or
debug if developers are unable to repeat this one or want more
information.

These are the messages I get on boot into 4.14 from 4.7 (which had a
working bcache before this boot):

[    7.987709] bcache: register_bdev() registered backing device sda1
[    8.039975] bcache: bch_journal_replay() journal replay done, 9
keys in 9 entries, seq 2695
[    8.053470] bcache: bch_cached_dev_attach() Caching sda1 as bcache0
on set 7d0cfd11-6191-4b6d-a68d-76428e36948d
[    8.053483] bcache: register_cache() registered cache device sdd2
[    8.064088] bcache: register_bdev() registered backing device sdb1
[    8.069736] bcache: bch_cached_dev_attach() Caching sdb1 as bcache1
on set 7d0cfd11-6191-4b6d-a68d-76428e36948d
[    8.069893] bcache: register_bdev() registered backing device sdc1
[    8.076092] bcache: bch_cached_dev_attach() Caching sdc1 as bcache2
on set 7d0cfd11-6191-4b6d-a68d-76428e36948d
[   13.532137] systemd[1]: Set hostname to <bcache-test>.
[   16.051024] bcache: register_bcache() error opening /dev/sdd2:
device already registered
[   16.153337] bcache: register_bcache() error opening /dev/sdb1:
device already registered
[   16.381834] bcache: register_bcache() error opening /dev/sdc1:
device already registered
[   16.429065] bcache: register_bcache() error opening /dev/sda1:
device already registered
[   16.535736] XFS (bcache2): Mounting V5 Filesystem
[   16.537606] XFS (bcache1): Mounting V5 Filesystem
[   16.573616] XFS (bcache1): Ending clean mount
[   16.574389] XFS (bcache0): Mounting V5 Filesystem
[   16.893474] XFS (bcache2): Ending clean mount
[   16.912354] XFS (bcache2): Metadata corruption detected at
xfs_agi_read_verify+0x4f/0xe0 [xfs], xfs_agi block 0x746a32a
[   16.926493] XFS (bcache0): Ending clean mount
[   16.956459] XFS (bcache0): Metadata corruption detected at
xfs_agi_read_verify+0x4f/0xe0 [xfs], xfs_agi block 0x747145a
[   16.956460] XFS (bcache0): Unmount and run xfs_repair
[   16.956461] XFS (bcache0): First 64 bytes of corrupted metadata buffer:
[   16.956499] XFS (bcache0): metadata I/O error: block 0x747145a
("xfs_trans_read_buf_map") error 117 numblks 1
[   16.967095] XFS (bcache0): Metadata corruption detected at
xfs_agi_read_verify+0x4f/0xe0 [xfs], xfs_agi block 0xe8e28b2
[   16.967096] XFS (bcache0): Unmount and run xfs_repair
[   16.967097] XFS (bcache0): First 64 bytes of corrupted metadata buffer:
[   16.967108] XFS (bcache0): metadata I/O error: block 0xe8e28b2
("xfs_trans_read_buf_map") error 117 numblks 1
[   16.976962] XFS (bcache0): Metadata corruption detected at
xfs_agi_read_verify+0x4f/0xe0 [xfs], xfs_agi block 0x15d53d0a
[   16.976963] XFS (bcache0): Unmount and run xfs_repair
[   16.976964] XFS (bcache0): First 64 bytes of corrupted metadata buffer:
[   16.977000] XFS (bcache0): metadata I/O error: block 0x15d53d0a
("xfs_trans_read_buf_map") error 117 numblks 1
[   16.977004] XFS (bcache0): Error -117 reserving per-AG metadata reserve pool.
[   16.977005] XFS (bcache0): xfs_do_force_shutdown(0x8) called from
line 1017 of file fs/xfs/xfs_fsops.c.  Return address =
0xffffffffc08695c0
[   16.977011] XFS (bcache0): Corruption of in-memory data detected.
Shutting down filesystem
[   16.977011] XFS (bcache0): Please umount the filesystem and rectify
the problem(s)
[   18.602638] XFS (bcache2): Unmount and run xfs_repair
[   18.602638] XFS (bcache2): First 64 bytes of corrupted metadata buffer:
[   18.602675] XFS (bcache2): metadata I/O error: block 0x746a32a
("xfs_trans_read_buf_map") error 117 numblks 1
[   18.622489] XFS (bcache2): Metadata corruption detected at
xfs_agi_read_verify+0x4f/0xe0 [xfs], xfs_agi block 0xe8d4652
[   18.622490] XFS (bcache2): Unmount and run xfs_repair
[   18.622491] XFS (bcache2): First 64 bytes of corrupted metadata buffer:
[   18.622527] XFS (bcache2): metadata I/O error: block 0xe8d4652
("xfs_trans_read_buf_map") error 117 numblks 1
[   18.635789] XFS (bcache2): Metadata corruption detected at
xfs_agi_read_verify+0x4f/0xe0 [xfs], xfs_agi block 0x15d3e97a
[   18.635790] XFS (bcache2): Unmount and run xfs_repair
[   18.635791] XFS (bcache2): First 64 bytes of corrupted metadata buffer:
[   18.635827] XFS (bcache2): metadata I/O error: block 0x15d3e97a
("xfs_trans_read_buf_map") error 117 numblks 1
[   18.635830] XFS (bcache2): Error -117 reserving per-AG metadata reserve pool.
[   18.635832] XFS (bcache2): xfs_do_force_shutdown(0x8) called from
line 1017 of file fs/xfs/xfs_fsops.c.  Return address =
0xffffffffc08695c0
[   18.635837] XFS (bcache2): Corruption of in-memory data detected.
Shutting down filesystem
[   18.635838] XFS (bcache2): Please umount the filesystem and rectify
the problem(s)

Campbell

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-16 19:36 ` Campbell Steven
@ 2017-11-16 19:48   ` Michael Lyle
  2017-11-16 20:31     ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Lyle @ 2017-11-16 19:48 UTC (permalink / raw)
  To: Campbell Steven, Pavel Goran; +Cc: linux-bcache, linux-block, Kent Overstreet

Hi Campbell & Pavel---

On 11/16/2017 11:36 AM, Campbell Steven wrote:
> On 16 November 2017 at 21:28, Pavel Goran <via-bcache@pvgoran.name> wrote:
>> Hello list,
>>
>> I encountered a severe problem when trying to switch to kernel version 4.14.
>> In short, reads from the bcache device produce different data in 4.14 and
>> 4.13.

Thanks for the report.  I've heard about this once on the #bcache IRC
channel too, so it seems this is a real problem, though I've not
encountered it in my testing yet.

This is just a note to let you all know that I'm looking at this and
will be seeking a clean repro that's not too painful to bisect to
determine what's going on.

Most of the 4.14 work predates my involvement on bcache, so I'm coming
up to speed on it.  That said, it looks pretty boring/safe within
bcache--- and where there's functional change it's almost all in the
write path which wouldn't be relevant here.  So I'm somewhat fearful
that there is an interaction with something else within the block layer.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-16 19:48   ` Michael Lyle
@ 2017-11-16 20:31     ` Jens Axboe
  2017-11-16 23:02       ` Michael Lyle
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2017-11-16 20:31 UTC (permalink / raw)
  To: Michael Lyle, Campbell Steven, Pavel Goran
  Cc: linux-bcache, linux-block, Kent Overstreet

On 11/16/2017 12:48 PM, Michael Lyle wrote:
> Hi Campbell & Pavel---
> 
> On 11/16/2017 11:36 AM, Campbell Steven wrote:
>> On 16 November 2017 at 21:28, Pavel Goran <via-bcache@pvgoran.name> wrote:
>>> Hello list,
>>>
>>> I encountered a severe problem when trying to switch to kernel version 4.14.
>>> In short, reads from the bcache device produce different data in 4.14 and
>>> 4.13.
> 
> Thanks for the report.  I've heard about this once on the #bcache IRC
> channel too, so it seems this is a real problem, though I've not
> encountered it in my testing yet.
> 
> This is just a note to let you all know that I'm looking at this and
> will be seeking a clean repro that's not too painful to bisect to
> determine what's going on.
> 
> Most of the 4.14 work predates my involvement on bcache, so I'm coming
> up to speed on it.  That said, it looks pretty boring/safe within
> bcache--- and where there's functional change it's almost all in the
> write path which wouldn't be relevant here.  So I'm somewhat fearful
> that there is an interaction with something else within the block layer.

The 4.14 block layer changes were pretty boring and uneventful, so that'd
be surprising as well.

I'd suggest just doing a bisect between 4.13 and 4.14, sounds like the
issue is trivially reproducible.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-16 20:31     ` Jens Axboe
@ 2017-11-16 23:02       ` Michael Lyle
  2017-11-17  3:07         ` Michael Lyle
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Lyle @ 2017-11-16 23:02 UTC (permalink / raw)
  To: Jens Axboe, Campbell Steven, Pavel Goran
  Cc: linux-bcache, linux-block, Kent Overstreet

On 11/16/2017 12:31 PM, Jens Axboe wrote:> I'd suggest just doing a
bisect between 4.13 and 4.14, sounds like the> issue is trivially
reproducible.
Took me awhile to get a repro but I have one now.  Will start bisecting
tonight.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-16 23:02       ` Michael Lyle
@ 2017-11-17  3:07         ` Michael Lyle
  2017-11-17 15:08           ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Lyle @ 2017-11-17  3:07 UTC (permalink / raw)
  To: Jens Axboe, Campbell Steven, Pavel Goran
  Cc: linux-bcache, linux-block, Kent Overstreet, hch

Jens, Christoph & everyone--

Hi-- I believe I have a general idea what's going on.

On Thu, Nov 16, 2017 at 3:02 PM, Michael Lyle <mlyle@lyle.org> wrote:
> On 11/16/2017 12:31 PM, Jens Axboe wrote:
> > I'd suggest just doing a bisect between 4.13 and 4.14, sounds like the
> > issue is trivially reproducible.
> Took me awhile to get a repro but I have one now.  Will start bisecting
> tonight.

It looks like the regression came from commit is
74d46992e0d9dee7f1f376de0d56d31614c8a17a (3/3 repros with it, 0/3
repros without it and hacked to build).

I don't have time to analyze this tonight as I have a couple other
things to do, but will look at fully analyzing it tomorrow morning.

I think the probable cause is this: we construct bios based on
previously completed bios in a few places, but blk_partition_remap
sets the partno to 0 and adjusts the disk offset.  It seems like a lot
of potential uses of bio_copy_dev could be dangerous for this reason.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-17  3:07         ` Michael Lyle
@ 2017-11-17 15:08           ` Christoph Hellwig
  2017-11-17 17:04             ` Michael Lyle
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2017-11-17 15:08 UTC (permalink / raw)
  To: Michael Lyle
  Cc: Jens Axboe, Campbell Steven, Pavel Goran, linux-bcache,
	linux-block, Kent Overstreet, hch

On Thu, Nov 16, 2017 at 07:07:40PM -0800, Michael Lyle wrote:
> I think the probable cause is this: we construct bios based on
> previously completed bios in a few places,

That is an extremely bad idea in many ways, so I think we'll need
to fix this as the priority.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in 4.14: wrong data being read from bcache device
  2017-11-17 15:08           ` Christoph Hellwig
@ 2017-11-17 17:04             ` Michael Lyle
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Lyle @ 2017-11-17 17:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Campbell Steven, Pavel Goran, linux-bcache,
	linux-block, Kent Overstreet

On Fri, Nov 17, 2017 at 7:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Nov 16, 2017 at 07:07:40PM -0800, Michael Lyle wrote:
>> I think the probable cause is this: we construct bios based on
>> previously completed bios in a few places,
>
> That is an extremely bad idea in many ways, so I think we'll need
> to fix this as the priority.

Yah, sorry, this wasn't actually the case.  I was sleep deprived and
hadn't been able to follow it all yet.

Bart points out that there's many remaining cases of assignment of
disk where partno isn't assigned as well-- so I wonder if there are
other potential problems from this.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-11-17 17:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-16  8:28 Regression in 4.14: wrong data being read from bcache device Pavel Goran
2017-11-16 19:36 ` Campbell Steven
2017-11-16 19:48   ` Michael Lyle
2017-11-16 20:31     ` Jens Axboe
2017-11-16 23:02       ` Michael Lyle
2017-11-17  3:07         ` Michael Lyle
2017-11-17 15:08           ` Christoph Hellwig
2017-11-17 17:04             ` Michael Lyle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.