All of lore.kernel.org
 help / color / mirror / Atom feed
* csum failure messages
@ 2013-11-05  1:24 Russell Coker
  2013-11-05  6:15 ` Hans-Kristian Bakke
  0 siblings, 1 reply; 12+ messages in thread
From: Russell Coker @ 2013-11-05  1:24 UTC (permalink / raw)
  To: linux-btrfs

The below messages are from dmesg on a system where "btrfs balance" just 
aborted.  It's running kernel 3.11.6 (the latest Debian package).

This seems to be telling me that Inode 388 is involved, but there are over 300 
subvols on that system which could contain such an Inode.

I think that more information is needed for such log messages.  We need to at 
least be able to identify the subvol (is it possible to extract this from the 
numbers in the log messages?).  Ideally we would be able to identify the file 
name as well.


[10751.637517] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum 
2566472073 private 3193692311
[10751.646390] BTRFS info (device sda3): csum failed ino 388 off 24104960 csum 
5219137 private 2264608335
[10751.654472] BTRFS info (device sda3): csum failed ino 388 off 24154112 csum 
4084831521 private 1792217768
[10751.731830] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum 
2566472073 private 3193692311

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05  1:24 csum failure messages Russell Coker
@ 2013-11-05  6:15 ` Hans-Kristian Bakke
  2013-11-05 11:30   ` David Sterba
  2013-11-05 12:16   ` Russell Coker
  0 siblings, 2 replies; 12+ messages in thread
From: Hans-Kristian Bakke @ 2013-11-05  6:15 UTC (permalink / raw)
  To: linux-btrfs

As you were in the process of a rebalance these errors may actually be
caused by this serious bug "Btrfs: relocate csums properly with
prealloc extents".

I hit that myself with several preallocated files made by rtorrent
during a rebalance and I lost several huge files as a consequence. The
only way I could rebalance without large scale corruptions was to
manually patch the 3.11.6 kernel with the small patch that fixes the
issue.
For some reason this patch is not pushed upstream yet. I think that is
strange as it leads to corruption and actual data loss and it is 100%
reproducible with preallocated files. Only systemd logs is mentioned
in the bug reports, but in my case it was actually hitting several
terabytes of files created by rtorrent.

Mvh

Hans-Kristian Bakke


On 5 November 2013 02:24, Russell Coker <russell@coker.com.au> wrote:
> The below messages are from dmesg on a system where "btrfs balance" just
> aborted.  It's running kernel 3.11.6 (the latest Debian package).
>
> This seems to be telling me that Inode 388 is involved, but there are over 300
> subvols on that system which could contain such an Inode.
>
> I think that more information is needed for such log messages.  We need to at
> least be able to identify the subvol (is it possible to extract this from the
> numbers in the log messages?).  Ideally we would be able to identify the file
> name as well.
>
>
> [10751.637517] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum
> 2566472073 private 3193692311
> [10751.646390] BTRFS info (device sda3): csum failed ino 388 off 24104960 csum
> 5219137 private 2264608335
> [10751.654472] BTRFS info (device sda3): csum failed ino 388 off 24154112 csum
> 4084831521 private 1792217768
> [10751.731830] BTRFS info (device sda3): csum failed ino 388 off 23191552 csum
> 2566472073 private 3193692311
>
> --
> My Main Blog         http://etbe.coker.com.au/
> My Documents Blog    http://doc.coker.com.au/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05  6:15 ` Hans-Kristian Bakke
@ 2013-11-05 11:30   ` David Sterba
  2013-11-05 12:16   ` Russell Coker
  1 sibling, 0 replies; 12+ messages in thread
From: David Sterba @ 2013-11-05 11:30 UTC (permalink / raw)
  To: chris.mason; +Cc: Hans-Kristian Bakke, linux-btrfs

On Tue, Nov 05, 2013 at 07:15:57AM +0100, Hans-Kristian Bakke wrote:
> As you were in the process of a rebalance these errors may actually be
> caused by this serious bug "Btrfs: relocate csums properly with
> prealloc extents".
> 
> I hit that myself with several preallocated files made by rtorrent
> during a rebalance and I lost several huge files as a consequence. The
> only way I could rebalance without large scale corruptions was to
> manually patch the 3.11.6 kernel with the small patch that fixes the
> issue.
> For some reason this patch is not pushed upstream yet. I think that is
> strange as it leads to corruption and actual data loss and it is 100%
> reproducible with preallocated files. Only systemd logs is mentioned
> in the bug reports, but in my case it was actually hitting several
> terabytes of files created by rtorrent.

Thanks for the summary. There's no doubt that this is serious.

Chris, please can you somehow get the patch into stable sooner than it
gets to the 3.13 queue? The merge window will start in 1 week and based
on previous pull request schedule, the patch will be merged in ~2 weeks
from now. That's kind of long time for for a unfixed corruption bug that
reportedly affects common installations (with systemd) or usecases
(torrent) in combination with balance.

david

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05  6:15 ` Hans-Kristian Bakke
  2013-11-05 11:30   ` David Sterba
@ 2013-11-05 12:16   ` Russell Coker
  2013-11-05 12:37     ` Hans-Kristian Bakke
  2013-11-05 14:26     ` Chris Murphy
  1 sibling, 2 replies; 12+ messages in thread
From: Russell Coker @ 2013-11-05 12:16 UTC (permalink / raw)
  To: Hans-Kristian Bakke; +Cc: linux-btrfs

On Tue, 5 Nov 2013, "Hans-Kristian Bakke" <hkbakke@gmail.com> wrote:
> As you were in the process of a rebalance these errors may actually be
> caused by this serious bug "Btrfs: relocate csums properly with
> prealloc extents".
> 
> I hit that myself with several preallocated files made by rtorrent
> during a rebalance and I lost several huge files as a consequence. The
> only way I could rebalance without large scale corruptions was to
> manually patch the 3.11.6 kernel with the small patch that fixes the
> issue.
> For some reason this patch is not pushed upstream yet. I think that is
> strange as it leads to corruption and actual data loss and it is 100%
> reproducible with preallocated files. Only systemd logs is mentioned
> in the bug reports, but in my case it was actually hitting several
> terabytes of files created by rtorrent.

I run systemd to I guess it's the systemd logs.  That's fortunate as such logs 
aren't important to me.  Thanks for providing this information.

I've just run a scrub and I saw the following output.  There was nothing 
useful or apparently relevant in the kernel message log either.  So scrub is 
just telling me that there are 57 errors without giving me a clue as to which 
files might need to be restored from backup.

# btrfs scrub start -B /
scrub done for c55218a6-abb5-4e35-9a20-33fb1fa05879
        scrub started at Tue Nov  5 11:32:03 2013 and finished after 6762 
seconds
        total bytes scrubbed: 140.06GB with 57 errors
        error details: csum=57
        corrected errors: 0, uncorrectable errors: 57, unverified errors: 0

I can imagine a balance operation being unable to conveniently display all the 
data that one might desire.  But a scrub really should go through everything 
and should know where the inconsistencies are.  In this case the scrub gave me 
less information than the balance.

I presume that my filesystem is still corrupt.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05 12:16   ` Russell Coker
@ 2013-11-05 12:37     ` Hans-Kristian Bakke
  2013-11-05 14:26     ` Chris Murphy
  1 sibling, 0 replies; 12+ messages in thread
From: Hans-Kristian Bakke @ 2013-11-05 12:37 UTC (permalink / raw)
  To: linux-btrfs

I gave up on getting the filesystem to a concistent state, but my
corruption was much more severe than yours. Several 100 000's. As the
fs was still usable and mountable I just moved all the files to
another filesystem, patched the kernel recreated the original btrfs fs
and ran a rebalance. This time without issues because of the patch. As
the corrupt files were rtorrent files in my case I could just rehash
the torrents and make rtorrent redownload the corrupt blocks. Very
lucky indeed. The other files I could verify against backup.

Luckily the reason for the rebalance in the first place was to add
another 16TB of disk to the RAID10 array, so I just happened to have
enough temporary storage lying around. After patching the kernel and
rebalance I now have a 32TB btrfs RAID10 volume.
Mvh

Hans-Kristian Bakke


On 5 November 2013 13:16, Russell Coker <russell@coker.com.au> wrote:
> On Tue, 5 Nov 2013, "Hans-Kristian Bakke" <hkbakke@gmail.com> wrote:
>> As you were in the process of a rebalance these errors may actually be
>> caused by this serious bug "Btrfs: relocate csums properly with
>> prealloc extents".
>>
>> I hit that myself with several preallocated files made by rtorrent
>> during a rebalance and I lost several huge files as a consequence. The
>> only way I could rebalance without large scale corruptions was to
>> manually patch the 3.11.6 kernel with the small patch that fixes the
>> issue.
>> For some reason this patch is not pushed upstream yet. I think that is
>> strange as it leads to corruption and actual data loss and it is 100%
>> reproducible with preallocated files. Only systemd logs is mentioned
>> in the bug reports, but in my case it was actually hitting several
>> terabytes of files created by rtorrent.
>
> I run systemd to I guess it's the systemd logs.  That's fortunate as such logs
> aren't important to me.  Thanks for providing this information.
>
> I've just run a scrub and I saw the following output.  There was nothing
> useful or apparently relevant in the kernel message log either.  So scrub is
> just telling me that there are 57 errors without giving me a clue as to which
> files might need to be restored from backup.
>
> # btrfs scrub start -B /
> scrub done for c55218a6-abb5-4e35-9a20-33fb1fa05879
>         scrub started at Tue Nov  5 11:32:03 2013 and finished after 6762
> seconds
>         total bytes scrubbed: 140.06GB with 57 errors
>         error details: csum=57
>         corrected errors: 0, uncorrectable errors: 57, unverified errors: 0
>
> I can imagine a balance operation being unable to conveniently display all the
> data that one might desire.  But a scrub really should go through everything
> and should know where the inconsistencies are.  In this case the scrub gave me
> less information than the balance.
>
> I presume that my filesystem is still corrupt.
>
> --
> My Main Blog         http://etbe.coker.com.au/
> My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05 12:16   ` Russell Coker
  2013-11-05 12:37     ` Hans-Kristian Bakke
@ 2013-11-05 14:26     ` Chris Murphy
  2013-11-05 14:34       ` Hugo Mills
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2013-11-05 14:26 UTC (permalink / raw)
  To: russell; +Cc: David Sterba, Btrfs BTRFS


On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote:
> 
> I presume that my filesystem is still corrupt.

I'm the original reporter of the bug. The file system itself isn't corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean.

Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it's been missed.

Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05 14:26     ` Chris Murphy
@ 2013-11-05 14:34       ` Hugo Mills
  2013-11-06  0:20         ` John Williams
  2013-11-14 18:37         ` Chris Murphy
  0 siblings, 2 replies; 12+ messages in thread
From: Hugo Mills @ 2013-11-05 14:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: russell, David Sterba, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1172 bytes --]

On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
> 
> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote:
> > 
> > I presume that my filesystem is still corrupt.
> 
> I'm the original reporter of the bug. The file system itself isn't corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean.
> 
> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it's been missed.

   Someone else tripped over it on IRC last night, and I was surprised
to discover it hadn't made it upstream yet. :(

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Two things came out of Berkeley in the 1960s: LSD and Unix. ---   
                       This is not a coincidence.                        

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05 14:34       ` Hugo Mills
@ 2013-11-06  0:20         ` John Williams
  2013-11-06 12:19           ` Duncan
  2013-11-06 16:54           ` David Sterba
  2013-11-14 18:37         ` Chris Murphy
  1 sibling, 2 replies; 12+ messages in thread
From: John Williams @ 2013-11-06  0:20 UTC (permalink / raw)
  To: Hugo Mills, Btrfs BTRFS

On Tue, Nov 5, 2013 at 6:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
>>
>> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote:
>> >
>> > I presume that my filesystem is still corrupt.
>>
>> I'm the original reporter of the bug. The file system itself isn't corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean.
>>
>> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it's been missed.
>
>    Someone else tripped over it on IRC last night, and I was surprised
> to discover it hadn't made it upstream yet. :(

Is there now a verification test that could detect an issue like this?
It seems like the sort of thing that needs to be added to automated
testing.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-06  0:20         ` John Williams
@ 2013-11-06 12:19           ` Duncan
  2013-11-06 16:54           ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: Duncan @ 2013-11-06 12:19 UTC (permalink / raw)
  To: linux-btrfs

John Williams posted on Tue, 05 Nov 2013 16:20:58 -0800 as excerpted:

> Is there now a verification test that could detect an issue like this?
> It seems like the sort of thing that needs to be added to automated
> testing.

[Your question is general enough, not mentioning xfs-tests, simply asking 
about general automated testing, I'm assuming a general answer is 
appropriate.]

I haven't tracked this specific issue, but in general, the btrfs devs are 
pretty strict with adding an xfs-tests (NOT used for just xfs, at least 
btrfs and ext4 use it too) package test for any regressions they find, 
and people ARE regularly running those tests on new code, so past issues 
don't happen again.  

If you watch the list you'll see occasional patch rejections due to 
failed xfs-tests, as well as regular new xfs-tests patches adding new 
tests, as well as review discussion requesting an xfs-test be added as 
appropriate.

So I'd be /very/ surprised if this bugfix didn't already have a 
corresponding new xfs-tests test.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-06  0:20         ` John Williams
  2013-11-06 12:19           ` Duncan
@ 2013-11-06 16:54           ` David Sterba
  1 sibling, 0 replies; 12+ messages in thread
From: David Sterba @ 2013-11-06 16:54 UTC (permalink / raw)
  To: John Williams; +Cc: Hugo Mills, Btrfs BTRFS

On Tue, Nov 05, 2013 at 04:20:58PM -0800, John Williams wrote:
> Is there now a verification test that could detect an issue like this?
> It seems like the sort of thing that needs to be added to automated
> testing.

Yes there is:

xfstests/btrfs/013
https://bugzilla.kernel.org/show_bug.cgi?id=63411

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-05 14:34       ` Hugo Mills
  2013-11-06  0:20         ` John Williams
@ 2013-11-14 18:37         ` Chris Murphy
  2013-11-14 18:44           ` David Sterba
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2013-11-14 18:37 UTC (permalink / raw)
  To: Hugo Mills; +Cc: russell, David Sterba, Btrfs BTRFS


On Nov 5, 2013, at 7:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

> On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
>> 
>> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote:
>>> 
>>> I presume that my filesystem is still corrupt.
>> 
>> I'm the original reporter of the bug. The file system itself isn't corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean.
>> 
>> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it's been missed.
> 
>   Someone else tripped over it on IRC last night, and I was surprised
> to discover it hadn't made it upstream yet. :(

I just checked kernel.org changelogs and I'm not seeing this fixed in either 3.11.7 or 3.11.8. It is listed in today's git pull by Chris for 3.12 as

Btrfs: relocate csums properly with prealloc extents


Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: csum failure messages
  2013-11-14 18:37         ` Chris Murphy
@ 2013-11-14 18:44           ` David Sterba
  0 siblings, 0 replies; 12+ messages in thread
From: David Sterba @ 2013-11-14 18:44 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hugo Mills, russell, David Sterba, Btrfs BTRFS

On Thu, Nov 14, 2013 at 11:37:39AM -0700, Chris Murphy wrote:
> 
> On Nov 5, 2013, at 7:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> 
> > On Tue, Nov 05, 2013 at 07:26:54AM -0700, Chris Murphy wrote:
> >> 
> >> On Nov 5, 2013, at 5:16 AM, Russell Coker <russell@coker.com.au> wrote:
> >>> 
> >>> I presume that my filesystem is still corrupt.
> >> 
> >> I'm the original reporter of the bug. The file system itself isn't corrupt, but the affected files probably are. In my case, systemd journal files were reported as corrupt by systemd following a balance, as well as btrfs scrub. Upon scrub, dmesg contains a path for each affected file. Upon deleting those files, subsequent scrubs come up clean.
> >> 
> >> Fedora merged the fix for this bug with: 3.11.5-302.fc20, 3.11.6-200.fc19, 3.11.6-101.fc18. I thought for sure it was marked for stable to go into mainline 3.11.6, so somehow it's been missed.
> > 
> >   Someone else tripped over it on IRC last night, and I was surprised
> > to discover it hadn't made it upstream yet. :(
> 
> I just checked kernel.org changelogs and I'm not seeing this fixed in
> either 3.11.7 or 3.11.8. It is listed in today's git pull by Chris for
> 3.12 as
> 
> Btrfs: relocate csums properly with prealloc extents

The stable tree process does not normally accept patches that are not in
Linus' tree first. The patch has yet to be submitted to stable right
after today's pull is merged.

david

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-11-14 18:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-05  1:24 csum failure messages Russell Coker
2013-11-05  6:15 ` Hans-Kristian Bakke
2013-11-05 11:30   ` David Sterba
2013-11-05 12:16   ` Russell Coker
2013-11-05 12:37     ` Hans-Kristian Bakke
2013-11-05 14:26     ` Chris Murphy
2013-11-05 14:34       ` Hugo Mills
2013-11-06  0:20         ` John Williams
2013-11-06 12:19           ` Duncan
2013-11-06 16:54           ` David Sterba
2013-11-14 18:37         ` Chris Murphy
2013-11-14 18:44           ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.