All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS: Internal error XFS_WANT_CORRUPTED_RETURN
@ 2013-12-11 17:27 Dave Jones
  2013-12-11 18:52 ` Chris Murphy
  2013-12-11 23:01 ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Jones @ 2013-12-11 17:27 UTC (permalink / raw)
  To: xfs

Powered up my desktop this morning and noticed I couldn't cd into ~/Mail
dmesg didn't look good.  "XFS: Internal error XFS_WANT_CORRUPTED_RETURN"
http://codemonkey.org.uk/junk/xfs-1.txt

I rebooted into single user mode, and ran xfs_repair on /dev/sda3 (/home).
It fixed up a bunch of stuff, but ended up eating ~/.procmailrc entirely
(no sign of it in lost & found), and a bunch of filenames got garbled
'december' became 'decemcer' for eg.  Looks like a couple kernel trees ended
up in lost & found.

After rebooting back into multi-user mode, I looked in dmesg again to be sure
and this time sda2 was complaining..

http://codemonkey.org.uk/junk/xfs-2.txt

Same drill, reboot, xfs_repair. Looks like a bunch of man pages ended up in lost & found.

Thoughts ? Could sda be dying ? (It is a fairly old crappy ssd)

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-11 17:27 XFS: Internal error XFS_WANT_CORRUPTED_RETURN Dave Jones
@ 2013-12-11 18:52 ` Chris Murphy
  2013-12-11 18:57   ` Dave Jones
  2013-12-11 23:01 ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2013-12-11 18:52 UTC (permalink / raw)
  To: Dave Jones; +Cc: xfs


On Dec 11, 2013, at 10:27 AM, Dave Jones <davej@redhat.com> wrote:
> 
> Thoughts ? Could sda be dying ? (It is a fairly old crappy ssd)

It may reveal nothing useful, but please report the results from 'smartctl -x /dev/sda' and if not found install smartmontools package.


Chris Murphy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-11 18:52 ` Chris Murphy
@ 2013-12-11 18:57   ` Dave Jones
  2013-12-12  0:19     ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Jones @ 2013-12-11 18:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: xfs

On Wed, Dec 11, 2013 at 11:52:51AM -0700, Chris Murphy wrote:
 > 
 > On Dec 11, 2013, at 10:27 AM, Dave Jones <davej@redhat.com> wrote:
 > > 
 > > Thoughts ? Could sda be dying ? (It is a fairly old crappy ssd)
 > 
 > It may reveal nothing useful, but please report the results from 'smartctl -x /dev/sda' and if not found install smartmontools package.


I meant it when I said 'old' and 'crappy'.
It doesn't even support the interesting SMART commands.

	Dave


smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-300.fc20.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     G.SKILL 64GB SSD
Serial Number:    MK08085207D640017
Firmware Version: 02.10104
User Capacity:    64,105,742,336 bytes [64.1 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
Local Time is:    Wed Dec 11 13:56:50 2013 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Disabled
Write cache is:   Disabled
ATA Security is:  Unavailable
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x00) 	Offline data collection not supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x00)	Error logging NOT supported.
					No General Purpose Logging support.

SMART Attributes Data Structure revision number: 1280
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
 12 Power_Cycle_Count       -O--CK   100   100   000    -    5447
  9 Power_On_Hours          -O--CK   100   100   000    -    0
194 Temperature_Celsius     POS---   032   100   000    -    0
229 Unknown_Attribute       -O----   100   000   000    -    260003199309804
232 Available_Reservd_Space -O----   100   048   000    -    9028846498104
233 Media_Wearout_Indicator -O----   100   000   000    -    1122231520092
234 Unknown_Attribute       -O----   100   000   000    -    782120273690
235 Unknown_Attribute       -O----   100   000   000    -    1006826557
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Read SMART Log Directory failed: scsi error aborted command

ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: scsi error aborted command
Read GP Log Directory failed

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log (GP Log 0x07) not supported

SMART Self-test Log not supported

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP Log 0x04) not supported

ATA_READ_LOG_EXT (addr=0x11:0x00, page=0, n=1) failed: scsi error aborted command
Read SATA Phy Event Counters failed



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-11 17:27 XFS: Internal error XFS_WANT_CORRUPTED_RETURN Dave Jones
  2013-12-11 18:52 ` Chris Murphy
@ 2013-12-11 23:01 ` Dave Chinner
  2013-12-12 16:14   ` Eric Sandeen
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2013-12-11 23:01 UTC (permalink / raw)
  To: Dave Jones; +Cc: xfs

On Wed, Dec 11, 2013 at 12:27:25PM -0500, Dave Jones wrote:
> Powered up my desktop this morning and noticed I couldn't cd into ~/Mail
> dmesg didn't look good.  "XFS: Internal error XFS_WANT_CORRUPTED_RETURN"
> http://codemonkey.org.uk/junk/xfs-1.txt

They came from xfs_dir3_block_verify() on read IO completion, which
indicates that the corruption was on disk and in the directory
structure. Yeah, definitely a verifier error:

XFS (sda3): metadata I/O error: block 0x2e790 ("xfs_trans_read_buf_map") error 117 numblks 8

Are you running a CRC enabled filesystem? (i.e. mkfs.xfs -m crc=1)

Is there any evidence that this verifier has fired in the past on
write? If not, then it's a good chance that it's a media error
causing this, because the same verifier runs when the metadata is
written to ensure we are not writing bas stuff to disk.

> I rebooted into single user mode, and ran xfs_repair on /dev/sda3 (/home).
> It fixed up a bunch of stuff, but ended up eating ~/.procmailrc entirely
> (no sign of it in lost & found), and a bunch of filenames got garbled
> 'december' became 'decemcer' for eg.  Looks like a couple kernel trees ended
> up in lost & found.

Single bit errors in directory names? That really does point towards
media errors, not a filesystem error being the cause.

> After rebooting back into multi-user mode, I looked in dmesg again to be sure
> and this time sda2 was complaining..
> 
> http://codemonkey.org.uk/junk/xfs-2.txt

Exaclty the same - directory blocks failing read verification.

> Same drill, reboot, xfs_repair. Looks like a bunch of man pages ended up in lost & found.
> 
> Thoughts ? Could sda be dying ? (It is a fairly old crappy ssd)

I'd seriously be considering replacing the SSD as the first step.
If you then see failures on a known good drive, we'll need to dig
further.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-11 18:57   ` Dave Jones
@ 2013-12-12  0:19     ` Chris Murphy
  2013-12-13  9:46       ` Stan Hoeppner
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2013-12-12  0:19 UTC (permalink / raw)
  To: Dave Jones; +Cc: xfs


On Dec 11, 2013, at 11:57 AM, Dave Jones <davej@redhat.com> wrote:

> On Wed, Dec 11, 2013 at 11:52:51AM -0700, Chris Murphy wrote:
>> 
>> On Dec 11, 2013, at 10:27 AM, Dave Jones <davej@redhat.com> wrote:
>>> 
>>> Thoughts ? Could sda be dying ? (It is a fairly old crappy ssd)
>> 
>> It may reveal nothing useful, but please report the results from 'smartctl -x /dev/sda' and if not found install smartmontools package.
> 
> 
> I meant it when I said 'old' and 'crappy'.
> It doesn't even support the interesting SMART commands.


Oh well, was worth a shot. The Available_Reservd_Space and Media_Wearout_Indicator could be useful, but I don't know how trustworthy they are when both say they're at 100 which is normally where these values start. Yet they have high, and without reference, meaningless, raw values. The Available_Reservd_Space value is currently 100 but its worst value was 48 which is sorta interesting that it dipped down at one point. That seems to imply it gave up some reserved sectors. I'd expect that once replaced that'd be it, and this value should only go down.

I suspect we've only just begun to see the myriad ways in which SSDs could fail. I ran across this article earlier today:
http://techreport.com/review/25681/the-ssd-endurance-experiment-testing-data-retention-at-300tb

What I thought was eye opening was a hashed file failing multiple times in a row with *different* hash values, being allowed to rest unpowered for five days and then passing. Eeek. Talk about a great setup for a lot of weird transient problems with that kind of reversal. What I can't tell is if there were read errors report to the SATA driver, or if (different) bad data from a particular page was sent to the driver.

Chris Murphy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-11 23:01 ` Dave Chinner
@ 2013-12-12 16:14   ` Eric Sandeen
  2013-12-12 16:20     ` Dave Jones
  2013-12-12 21:27     ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Sandeen @ 2013-12-12 16:14 UTC (permalink / raw)
  To: Dave Chinner, Dave Jones; +Cc: xfs

On 12/11/13, 5:01 PM, Dave Chinner wrote:
> On Wed, Dec 11, 2013 at 12:27:25PM -0500, Dave Jones wrote:
>> Powered up my desktop this morning and noticed I couldn't cd into ~/Mail
>> dmesg didn't look good.  "XFS: Internal error XFS_WANT_CORRUPTED_RETURN"
>> http://codemonkey.org.uk/junk/xfs-1.txt
> 
> They came from xfs_dir3_block_verify() on read IO completion, which
> indicates that the corruption was on disk and in the directory
> structure. Yeah, definitely a verifier error:
> 
> XFS (sda3): metadata I/O error: block 0x2e790 ("xfs_trans_read_buf_map") error 117 numblks 8
> 
> Are you running a CRC enabled filesystem? (i.e. mkfs.xfs -m crc=1)
> 
> Is there any evidence that this verifier has fired in the past on
> write? If not, then it's a good chance that it's a media error
> causing this, because the same verifier runs when the metadata is
> written to ensure we are not writing bas stuff to disk.

Dave C, have you given any thought to how to make the verifier errors more
actionable?  If davej throws up his hands, the rest of the world is obviously
in trouble.  ;)

To the inexperienced this looks like a "crash" thanks to the backtrace.
I do understand that it's necessary for bug reports, but I wonder if we
could preface it with something informative or instructive.

We also don't get a block number or inode number, although you or I can
dig the inode number out of the hexdump, in this case.

We also don't get any details of what the values in the failed check were;
not from the check macro itself or from the hexdump, necessarily, since
it only prints the first handful of bytes.

Any ideas here?

-Eric

>> I rebooted into single user mode, and ran xfs_repair on /dev/sda3 (/home).
>> It fixed up a bunch of stuff, but ended up eating ~/.procmailrc entirely
>> (no sign of it in lost & found), and a bunch of filenames got garbled
>> 'december' became 'decemcer' for eg.  Looks like a couple kernel trees ended
>> up in lost & found.
> 
> Single bit errors in directory names? That really does point towards
> media errors, not a filesystem error being the cause.
> 
>> After rebooting back into multi-user mode, I looked in dmesg again to be sure
>> and this time sda2 was complaining..
>>
>> http://codemonkey.org.uk/junk/xfs-2.txt
> 
> Exaclty the same - directory blocks failing read verification.
> 
>> Same drill, reboot, xfs_repair. Looks like a bunch of man pages ended up in lost & found.
>>
>> Thoughts ? Could sda be dying ? (It is a fairly old crappy ssd)
> 
> I'd seriously be considering replacing the SSD as the first step.
> If you then see failures on a known good drive, we'll need to dig
> further.
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-12 16:14   ` Eric Sandeen
@ 2013-12-12 16:20     ` Dave Jones
  2013-12-12 21:27     ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Jones @ 2013-12-12 16:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Thu, Dec 12, 2013 at 10:14:39AM -0600, Eric Sandeen wrote:
 > On 12/11/13, 5:01 PM, Dave Chinner wrote:
 > > On Wed, Dec 11, 2013 at 12:27:25PM -0500, Dave Jones wrote:
 > >> Powered up my desktop this morning and noticed I couldn't cd into ~/Mail
 > >> dmesg didn't look good.  "XFS: Internal error XFS_WANT_CORRUPTED_RETURN"
 > >> http://codemonkey.org.uk/junk/xfs-1.txt
 > > 
 > > They came from xfs_dir3_block_verify() on read IO completion, which
 > > indicates that the corruption was on disk and in the directory
 > > structure. Yeah, definitely a verifier error:
 > > 
 > > XFS (sda3): metadata I/O error: block 0x2e790 ("xfs_trans_read_buf_map") error 117 numblks 8
 > > 
 > > Are you running a CRC enabled filesystem? (i.e. mkfs.xfs -m crc=1)
 > > 
 > > Is there any evidence that this verifier has fired in the past on
 > > write? If not, then it's a good chance that it's a media error
 > > causing this, because the same verifier runs when the metadata is
 > > written to ensure we are not writing bas stuff to disk.
 > 
 > Dave C, have you given any thought to how to make the verifier errors more
 > actionable?  If davej throws up his hands, the rest of the world is obviously
 > in trouble.  ;)
 > 
 > To the inexperienced this looks like a "crash" thanks to the backtrace.
 > I do understand that it's necessary for bug reports, but I wonder if we
 > could preface it with something informative or instructive.
 > 
 > We also don't get a block number or inode number, although you or I can
 > dig the inode number out of the hexdump, in this case.
 > 
 > We also don't get any details of what the values in the failed check were;
 > not from the check macro itself or from the hexdump, necessarily, since
 > it only prints the first handful of bytes.

This morning, same ssd spewed a bunch of other errors when find ran over a kernel tree..

http://paste.fedoraproject.org/61189/38686344

in that case I did get a block number (the irony of the failing block # is not lost on me)

As soon as the new one arrives, I'll try some destructive tests on the failing one.

I'm just happy it's stayed alive long enough for me to get the data off it.
When my Intel SSD failed earlier this year it was just a brick.

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-12 16:14   ` Eric Sandeen
  2013-12-12 16:20     ` Dave Jones
@ 2013-12-12 21:27     ` Dave Chinner
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2013-12-12 21:27 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Jones, xfs

On Thu, Dec 12, 2013 at 10:14:39AM -0600, Eric Sandeen wrote:
> On 12/11/13, 5:01 PM, Dave Chinner wrote:
> > On Wed, Dec 11, 2013 at 12:27:25PM -0500, Dave Jones wrote:
> >> Powered up my desktop this morning and noticed I couldn't cd into ~/Mail
> >> dmesg didn't look good.  "XFS: Internal error XFS_WANT_CORRUPTED_RETURN"
> >> http://codemonkey.org.uk/junk/xfs-1.txt
> > 
> > They came from xfs_dir3_block_verify() on read IO completion, which
> > indicates that the corruption was on disk and in the directory
> > structure. Yeah, definitely a verifier error:
> > 
> > XFS (sda3): metadata I/O error: block 0x2e790 ("xfs_trans_read_buf_map") error 117 numblks 8
> > 
> > Are you running a CRC enabled filesystem? (i.e. mkfs.xfs -m crc=1)
> > 
> > Is there any evidence that this verifier has fired in the past on
> > write? If not, then it's a good chance that it's a media error
> > causing this, because the same verifier runs when the metadata is
> > written to ensure we are not writing bas stuff to disk.
> 
> Dave C, have you given any thought to how to make the verifier errors more
> actionable?  If davej throws up his hands, the rest of the world is obviously
> in trouble.  ;)

The verifier behaviour is effectively boiler plate code.

> To the inexperienced this looks like a "crash" thanks to the backtrace.
> I do understand that it's necessary for bug reports, but I wonder if we
> could preface it with something informative or instructive.

Yup, It was done like that so it would scare people and they'd
report verifier failures so that we had good visibility of problems
they were detecting. So, from that perspective they are doing
exactly what they were intended to do.

In reality, the incidence of verifiers detecting corruption is no
different from the long term historical trends of corruptions being
reported. The only difference is that we are catching them
immediately as the come off disk, rather than later on in the code
when we can't tell if the problem is a code bug or an IO error.
So, again, the verifiers are doing exactly what they were intended
to do.

> We also don't get a block number or inode number, although you or I can
> dig the inode number out of the hexdump, in this case.

That comes from the higher layer error message. We don't get it from
the verifier simply because the boilerplate code doesn't report it.

> We also don't get any details of what the values in the failed check were;
> not from the check macro itself or from the hexdump, necessarily, since
> it only prints the first handful of bytes.

In most cases, the handful (64) of bytes is more than sufficient -
it is big enough to contain the entire self-describing header for
the object that failed, and that is enough to validate whether the
corruption is a bad metadata block or something internal to the
metadata structure itself. i.e. the hexdump has actually been
carefully sized to balance between scary noise and useful for
debugging.

That said, we need to do some work on the verifiers - they need to
be converted to use WANT_CORRUPTED_RETURN or a similar new
corruption report. That way we know exactly what verifier test
failed from the line of code it dumped from.  A couple of the
verifiers already do this (in the directory code), but the rest need
to be converted across, too.  We can easily add a more custom info
to the failure by doing this (e.g. block number, whether it is a
read or write verifier failure, etc); if we do this correctly then
the stack trace that is currently being dumped can go away.

We also need to distinguish between CRC validation errors and object
format validation errors. We need this in userspace for xfs_repair,
and it could replace the custom code in xfs_db that does this, so the
kernel code needs to have it put in place first.

IOWs, there's a bunch of verifier improvements that are in the
works that should help this situation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS: Internal error XFS_WANT_CORRUPTED_RETURN
  2013-12-12  0:19     ` Chris Murphy
@ 2013-12-13  9:46       ` Stan Hoeppner
  0 siblings, 0 replies; 14+ messages in thread
From: Stan Hoeppner @ 2013-12-13  9:46 UTC (permalink / raw)
  To: xfs

On 12/11/2013 6:19 PM, Chris Murphy wrote:
...
> I suspect we've only just begun to see the myriad ways in which SSDs
> could fail. I ran across this article earlier today: 
> http://techreport.com/review/25681/the-ssd-endurance-experiment-testing-data-retention-at-300tb
>
>  What I thought was eye opening was a hashed file failing multiple
> times in a row with *different* hash values, being allowed to rest
> unpowered for five days and then passing. Eeek. Talk about a great
> setup for a lot of weird transient problems with that kind of
> reversal. What I can't tell is if there were read errors report to
> the SATA driver, or if (different) bad data from a particular page
> was sent to the driver.

The drive that exhibited this problem, the Samsung 840, is (one of) the
first on the market to use triple level cell NAND.  The drive is
marketed at consumers only.  The anomaly occurred after 100 TB of
writes, well beyond what is expected for a consumer drive.  After the
anomaly occurred the drive ran flawlessly up to 300 TB.

The rest of the drives, including the Samsung 840 Pro, use two cell MLC
NAND, and none of them have shown problems in their testing.  They've
been flawless.  So I disagree with your statement "we've only just begun
to see the myriad ways in which SSDs could fail".

What we have here is what we've always had.  A manufacturer using a
bleeding edge technology didn't have all the bugs identified and fixed
with the first rev of the product.  This isn't a problem with SSDs in
general, but one manufacturer, one new drive model, using a brand new
NAND type.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS internal error XFS_WANT_CORRUPTED_RETURN
  2007-08-24  2:02 ` Timothy Shimmin
@ 2007-08-24 19:23   ` Markus Schoder
  0 siblings, 0 replies; 14+ messages in thread
From: Markus Schoder @ 2007-08-24 19:23 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: xfs

On Friday 24 August 2007, Timothy Shimmin wrote:
> It looks a lot like a reported bug:
>    Suse#198124 and sgi-pv#956334
>
> Where we had corruption in freespace btrees.
>
> Can you also run xfs_check before running xfs_repair.

I actually tried to run xfs_check before xfs_repair but it seg-faulted.

--
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS internal error XFS_WANT_CORRUPTED_RETURN
  2007-08-24  1:43 ` David Chinner
@ 2007-08-24 19:19   ` Markus Schoder
  0 siblings, 0 replies; 14+ messages in thread
From: Markus Schoder @ 2007-08-24 19:19 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs

On Friday 24 August 2007, David Chinner wrote:
> On Thu, Aug 23, 2007 at 09:09:50PM +0100, Markus Schoder wrote:
> > Got a bunch of the below errors in the log. It occured during a
> > Debian upgrade with aptitude. xfs_repair found a lost inode.
>
> Can you post the output of xfs_repair if you still have it? It
> would be handy to correlæte the shutdown to an actual error on
> disk....

Unfortunately I don't have the output anymore. From memory I recall 
there were two messsages indicating problems:

1. A message about rebuilding a directory at inode 128.
2. A lost inode that was moved to lost+found.

Other than that there were only the usual progress messages.

> > Kernel is stock 2.6.22.4 plus CFS patch. This is a 64bit kernel on
> > an amd64 processor.
>
> Hmmmm - so not exactly a stock kernel then. Which of the 19 versions
> the CFS patch was applied? Is it reproducable or only a one-off?
> if it's not a one-off problem, have you seen the problem without the
> CFS patch?

It is version 19.1 and the problem was a one-off never seen it before or 
since.

-- 
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS internal error XFS_WANT_CORRUPTED_RETURN
  2007-08-23 20:09 XFS internal " Markus Schoder
  2007-08-24  1:43 ` David Chinner
@ 2007-08-24  2:02 ` Timothy Shimmin
  2007-08-24 19:23   ` Markus Schoder
  1 sibling, 1 reply; 14+ messages in thread
From: Timothy Shimmin @ 2007-08-24  2:02 UTC (permalink / raw)
  To: Markus Schoder; +Cc: xfs

Markus Schoder wrote:
> Got a bunch of the below errors in the log. It occured during a Debian
> upgrade with aptitude. xfs_repair found a lost inode.
> 
> Kernel is stock 2.6.22.4 plus CFS patch. This is a 64bit kernel on an
> amd64 processor.
> 
> I am glad to provide more information if required.
> 
> Would be nice to keep me CC'ed if this is not against policy.
> 
> --
> Markus
> 
> Aug 23 00:15:09 gondolin kernel: [17152.830267] XFS internal error XFS_WANT_CORRUPTED_RETURN at line 281 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff802eb061
> Aug 23 00:15:09 gondolin kernel: [17152.830274]
> Aug 23 00:15:09 gondolin kernel: [17152.830275] Call Trace:
> Aug 23 00:15:09 gondolin kernel: [17152.830295]  [<ffffffff802e9bf7>] xfs_alloc_fixup_trees+0x307/0x3a0
> Aug 23 00:15:09 gondolin kernel: [17152.830301]  [<ffffffff80303ddd>] xfs_btree_setbuf+0x2d/0xb0
> Aug 23 00:15:09 gondolin kernel: [17152.830306]  [<ffffffff802eb061>] xfs_alloc_ag_vextent_near+0x581/0x9b0
> Aug 23 00:15:09 gondolin kernel: [17152.830313]  [<ffffffff802ebc95>] xfs_alloc_ag_vextent+0xd5/0x130
> Aug 23 00:15:09 gondolin kernel: [17152.830316]  [<ffffffff802ec4f5>] xfs_alloc_vextent+0x275/0x460
> Aug 23 00:15:09 gondolin kernel: [17152.830322]  [<ffffffff802f9630>] xfs_bmap_btalloc+0x410/0x7c0
> Aug 23 00:15:09 gondolin kernel: [17152.830327]  [<ffffffff8032d6fa>] xfs_mod_incore_sb_batch+0xea/0x130
> Aug 23 00:15:09 gondolin kernel: [17152.830335]  [<ffffffff802f6d1e>] xfs_bmap_isaeof+0x7e/0xd0
> Aug 23 00:15:09 gondolin kernel: [17152.830344]  [<ffffffff802fd8c9>] xfs_bmapi+0xc89/0x1340
> Aug 23 00:15:09 gondolin kernel: [17152.830359]  [<ffffffff80330298>] xfs_trans_reserve+0xa8/0x210
> Aug 23 00:15:09 gondolin kernel: [17152.830364]  [<ffffffff80320e58>] xfs_iomap_write_direct+0x2e8/0x500
> Aug 23 00:15:09 gondolin kernel: [17152.830374]  [<ffffffff8032028c>] xfs_iomap+0x31c/0x390
> Aug 23 00:15:09 gondolin kernel: [17152.830383]  [<ffffffff8033dafa>] xfs_map_blocks+0x3a/0x80
> Aug 23 00:15:09 gondolin kernel: [17152.830387]  [<ffffffff8033ec3e>] xfs_page_state_convert+0x2be/0x630
> Aug 23 00:15:09 gondolin kernel: [17152.830394]  [<ffffffff802a23cc>] alloc_buffer_head+0x4c/0x80
> Aug 23 00:15:09 gondolin kernel: [17152.830398]  [<ffffffff802a2be0>] alloc_page_buffers+0x60/0xe0
> Aug 23 00:15:09 gondolin kernel: [17152.830403]  [<ffffffff8033f10f>] xfs_vm_writepage+0x6f/0x120
> Aug 23 00:15:09 gondolin kernel: [17152.830408]  [<ffffffff8025dfba>] __writepage+0xa/0x30
> Aug 23 00:15:09 gondolin kernel: [17152.830411]  [<ffffffff8025e5fe>] write_cache_pages+0x23e/0x330
> Aug 23 00:15:09 gondolin kernel: [17152.830415]  [<ffffffff8025dfb0>] __writepage+0x0/0x30
> Aug 23 00:15:09 gondolin kernel: [17152.830424]  [<ffffffff8025e740>] do_writepages+0x20/0x40
> Aug 23 00:15:09 gondolin kernel: [17152.830427]  [<ffffffff80259715>] __filemap_fdatawrite_range+0x75/0xb0
> Aug 23 00:15:09 gondolin kernel: [17152.830434]  [<ffffffff802a1975>] do_fsync+0x45/0xe0
> Aug 23 00:15:09 gondolin kernel: [17152.830438]  [<ffffffff8026e2b6>] sys_msync+0x166/0x1d0
> Aug 23 00:15:09 gondolin kernel: [17152.830444]  [<ffffffff8021e302>] ia32_sysret+0x0/0xa
> 

It looks a lot like a reported bug:
   Suse#198124 and sgi-pv#956334

Where we had corruption in freespace btrees.

Can you also run xfs_check before running xfs_repair.

--Tim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS internal error XFS_WANT_CORRUPTED_RETURN
  2007-08-23 20:09 XFS internal " Markus Schoder
@ 2007-08-24  1:43 ` David Chinner
  2007-08-24 19:19   ` Markus Schoder
  2007-08-24  2:02 ` Timothy Shimmin
  1 sibling, 1 reply; 14+ messages in thread
From: David Chinner @ 2007-08-24  1:43 UTC (permalink / raw)
  To: Markus Schoder; +Cc: xfs

On Thu, Aug 23, 2007 at 09:09:50PM +0100, Markus Schoder wrote:
> Got a bunch of the below errors in the log. It occured during a Debian
> upgrade with aptitude. xfs_repair found a lost inode.

Can you post the output of xfs_repair if you still have it? It
would be handy to correlæte the shutdown to an actual error on
disk....

> Kernel is stock 2.6.22.4 plus CFS patch. This is a 64bit kernel on an
> amd64 processor.

Hmmmm - so not exactly a stock kernel then. Which of the 19 versions
the CFS patch was applied? Is it reproducable or only a one-off?
if it's not a one-off problem, have you seen the problem without the
CFS patch?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* XFS internal error XFS_WANT_CORRUPTED_RETURN
@ 2007-08-23 20:09 Markus Schoder
  2007-08-24  1:43 ` David Chinner
  2007-08-24  2:02 ` Timothy Shimmin
  0 siblings, 2 replies; 14+ messages in thread
From: Markus Schoder @ 2007-08-23 20:09 UTC (permalink / raw)
  To: xfs

Got a bunch of the below errors in the log. It occured during a Debian
upgrade with aptitude. xfs_repair found a lost inode.

Kernel is stock 2.6.22.4 plus CFS patch. This is a 64bit kernel on an
amd64 processor.

I am glad to provide more information if required.

Would be nice to keep me CC'ed if this is not against policy.

--
Markus

Aug 23 00:15:09 gondolin kernel: [17152.830267] XFS internal error XFS_WANT_CORRUPTED_RETURN at line 281 of file fs/xfs/xfs_alloc.c.  Caller 0xffffffff802eb061
Aug 23 00:15:09 gondolin kernel: [17152.830274]
Aug 23 00:15:09 gondolin kernel: [17152.830275] Call Trace:
Aug 23 00:15:09 gondolin kernel: [17152.830295]  [<ffffffff802e9bf7>] xfs_alloc_fixup_trees+0x307/0x3a0
Aug 23 00:15:09 gondolin kernel: [17152.830301]  [<ffffffff80303ddd>] xfs_btree_setbuf+0x2d/0xb0
Aug 23 00:15:09 gondolin kernel: [17152.830306]  [<ffffffff802eb061>] xfs_alloc_ag_vextent_near+0x581/0x9b0
Aug 23 00:15:09 gondolin kernel: [17152.830313]  [<ffffffff802ebc95>] xfs_alloc_ag_vextent+0xd5/0x130
Aug 23 00:15:09 gondolin kernel: [17152.830316]  [<ffffffff802ec4f5>] xfs_alloc_vextent+0x275/0x460
Aug 23 00:15:09 gondolin kernel: [17152.830322]  [<ffffffff802f9630>] xfs_bmap_btalloc+0x410/0x7c0
Aug 23 00:15:09 gondolin kernel: [17152.830327]  [<ffffffff8032d6fa>] xfs_mod_incore_sb_batch+0xea/0x130
Aug 23 00:15:09 gondolin kernel: [17152.830335]  [<ffffffff802f6d1e>] xfs_bmap_isaeof+0x7e/0xd0
Aug 23 00:15:09 gondolin kernel: [17152.830344]  [<ffffffff802fd8c9>] xfs_bmapi+0xc89/0x1340
Aug 23 00:15:09 gondolin kernel: [17152.830359]  [<ffffffff80330298>] xfs_trans_reserve+0xa8/0x210
Aug 23 00:15:09 gondolin kernel: [17152.830364]  [<ffffffff80320e58>] xfs_iomap_write_direct+0x2e8/0x500
Aug 23 00:15:09 gondolin kernel: [17152.830374]  [<ffffffff8032028c>] xfs_iomap+0x31c/0x390
Aug 23 00:15:09 gondolin kernel: [17152.830383]  [<ffffffff8033dafa>] xfs_map_blocks+0x3a/0x80
Aug 23 00:15:09 gondolin kernel: [17152.830387]  [<ffffffff8033ec3e>] xfs_page_state_convert+0x2be/0x630
Aug 23 00:15:09 gondolin kernel: [17152.830394]  [<ffffffff802a23cc>] alloc_buffer_head+0x4c/0x80
Aug 23 00:15:09 gondolin kernel: [17152.830398]  [<ffffffff802a2be0>] alloc_page_buffers+0x60/0xe0
Aug 23 00:15:09 gondolin kernel: [17152.830403]  [<ffffffff8033f10f>] xfs_vm_writepage+0x6f/0x120
Aug 23 00:15:09 gondolin kernel: [17152.830408]  [<ffffffff8025dfba>] __writepage+0xa/0x30
Aug 23 00:15:09 gondolin kernel: [17152.830411]  [<ffffffff8025e5fe>] write_cache_pages+0x23e/0x330
Aug 23 00:15:09 gondolin kernel: [17152.830415]  [<ffffffff8025dfb0>] __writepage+0x0/0x30
Aug 23 00:15:09 gondolin kernel: [17152.830424]  [<ffffffff8025e740>] do_writepages+0x20/0x40
Aug 23 00:15:09 gondolin kernel: [17152.830427]  [<ffffffff80259715>] __filemap_fdatawrite_range+0x75/0xb0
Aug 23 00:15:09 gondolin kernel: [17152.830434]  [<ffffffff802a1975>] do_fsync+0x45/0xe0
Aug 23 00:15:09 gondolin kernel: [17152.830438]  [<ffffffff8026e2b6>] sys_msync+0x166/0x1d0
Aug 23 00:15:09 gondolin kernel: [17152.830444]  [<ffffffff8021e302>] ia32_sysret+0x0/0xa

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-12-13  9:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-11 17:27 XFS: Internal error XFS_WANT_CORRUPTED_RETURN Dave Jones
2013-12-11 18:52 ` Chris Murphy
2013-12-11 18:57   ` Dave Jones
2013-12-12  0:19     ` Chris Murphy
2013-12-13  9:46       ` Stan Hoeppner
2013-12-11 23:01 ` Dave Chinner
2013-12-12 16:14   ` Eric Sandeen
2013-12-12 16:20     ` Dave Jones
2013-12-12 21:27     ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2007-08-23 20:09 XFS internal " Markus Schoder
2007-08-24  1:43 ` David Chinner
2007-08-24 19:19   ` Markus Schoder
2007-08-24  2:02 ` Timothy Shimmin
2007-08-24 19:23   ` Markus Schoder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.