All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS umount issue
@ 2011-05-23 21:39 Nuno Subtil
  2011-05-24  0:02 ` Dave Chinner
  2011-05-24 13:33 ` Paul Anderson
  0 siblings, 2 replies; 11+ messages in thread
From: Nuno Subtil @ 2011-05-23 21:39 UTC (permalink / raw)
  To: xfs-oss

I have an MD RAID-1 array with two SATA drives, formatted as XFS.
Occasionally, doing an umount followed by a mount causes the mount to
fail with errors that strongly suggest some sort of filesystem
corruption (usually 'bad clientid' with a seemingly arbitrary ID, but
occasionally invalid log errors as well).

The one thing in common among all these failures is that they require
xfs_repair -L to recover from. This has already caused a few
lost+found entries (and data loss on recently written files). I
originally noticed this bug because of mount failures at boot, but
I've managed to repro it reliably with this script:

while true; do
	mount /store
	(cd /store && tar xf test.tar)
	umount /store
	mount /store
	rm -rf /store/test-data
	umount /store
done

test.tar contains around 100 files with various sizes inside
test-data/, ranging from a few hundred KB to around 5-6MB. The failure
triggers within minutes of starting this loop.

I'm not entirely sure that this is XFS-specific, but the same script
does run successfully overnight on the same MD array with ext3 on it.
This is on an ARM system running kernel 2.6.39.

Has something like this been seen before?

Thanks,
Nuno

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-23 21:39 XFS umount issue Nuno Subtil
@ 2011-05-24  0:02 ` Dave Chinner
  2011-05-24  6:29   ` Nuno Subtil
  2011-05-24 13:33 ` Paul Anderson
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2011-05-24  0:02 UTC (permalink / raw)
  To: Nuno Subtil; +Cc: xfs-oss

On Mon, May 23, 2011 at 02:39:39PM -0700, Nuno Subtil wrote:
> I have an MD RAID-1 array with two SATA drives, formatted as XFS.

Hi Nuno. it is probably best to say this at the start, too:

> This is on an ARM system running kernel 2.6.39.

So we know what platform this is occurring on.

> Occasionally, doing an umount followed by a mount causes the mount to
> fail with errors that strongly suggest some sort of filesystem
> corruption (usually 'bad clientid' with a seemingly arbitrary ID, but
> occasionally invalid log errors as well).

So reading back the journal is getting bad data?
> 
> The one thing in common among all these failures is that they require
> xfs_repair -L to recover from. This has already caused a few
> lost+found entries (and data loss on recently written files). I
> originally noticed this bug because of mount failures at boot, but
> I've managed to repro it reliably with this script:

Yup, that's normal with recovery errors.

> while true; do
> 	mount /store
> 	(cd /store && tar xf test.tar)
> 	umount /store
> 	mount /store
> 	rm -rf /store/test-data
> 	umount /store
> done

Ok, so there's nothing here that actually says it's an unmount
error. More likely it is a vmap problem in log recovery resulting in
aliasing or some other stale data appearing in the buffer pages.

Can you add a 'xfs_logprint -t <device>' after the umount? You
should always see something like this telling you the log is clean:

$ xfs_logprint -t /dev/vdb
xfs_logprint:
    data device: 0xfd10
    log device: 0xfd10 daddr: 11534368 length: 20480

    log tail: 51 head: 51 state: <CLEAN>

If the log is not clean on an unmount, then you may have an unmount
problem. If it is clean when the recovery error occurs, then it's
almost certainly a problem with you platform not implementing vmap
cache flushing correctly, not an XFS problem.

> I'm not entirely sure that this is XFS-specific, but the same script
> does run successfully overnight on the same MD array with ext3 on it.

ext3 doesn't use vmapped buffers at all, so won't show such a
proble,.

> Has something like this been seen before?

Every so often on ARM, MIPS, etc platforms that have virtually
indexed caches.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24  0:02 ` Dave Chinner
@ 2011-05-24  6:29   ` Nuno Subtil
  2011-05-24  7:54     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Nuno Subtil @ 2011-05-24  6:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-oss

Thanks for chiming in. Replies inline below:

On Mon, May 23, 2011 at 17:02, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, May 23, 2011 at 02:39:39PM -0700, Nuno Subtil wrote:
>> I have an MD RAID-1 array with two SATA drives, formatted as XFS.
>
> Hi Nuno. it is probably best to say this at the start, too:
>
>> This is on an ARM system running kernel 2.6.39.
>
> So we know what platform this is occurring on.

Will keep that in mind. Thanks.

>
>> Occasionally, doing an umount followed by a mount causes the mount to
>> fail with errors that strongly suggest some sort of filesystem
>> corruption (usually 'bad clientid' with a seemingly arbitrary ID, but
>> occasionally invalid log errors as well).
>
> So reading back the journal is getting bad data?

I'm not sure. XFS claims it found a bad clientid. I'm not too versed
in filesystems to be able to tell for myself :)

>>
>> The one thing in common among all these failures is that they require
>> xfs_repair -L to recover from. This has already caused a few
>> lost+found entries (and data loss on recently written files). I
>> originally noticed this bug because of mount failures at boot, but
>> I've managed to repro it reliably with this script:
>
> Yup, that's normal with recovery errors.
>
>> while true; do
>>       mount /store
>>       (cd /store && tar xf test.tar)
>>       umount /store
>>       mount /store
>>       rm -rf /store/test-data
>>       umount /store
>> done
>
> Ok, so there's nothing here that actually says it's an unmount
> error. More likely it is a vmap problem in log recovery resulting in
> aliasing or some other stale data appearing in the buffer pages.
>
> Can you add a 'xfs_logprint -t <device>' after the umount? You
> should always see something like this telling you the log is clean:

Well, I just ran into this again even without using the script:

root@howl:/# umount /dev/md5
root@howl:/# xfs_logprint -t /dev/md5
xfs_logprint:
    data device: 0x905
    log device: 0x905 daddr: 488382880 length: 476936

    log tail: 731 head: 859 state: <DIRTY>


LOG REC AT LSN cycle 1 block 731 (0x1, 0x2db)

LOG REC AT LSN cycle 1 block 795 (0x1, 0x31b)

I see nothing in dmesg at umount time. Attempting to mount the device
at this point, I got:

[  764.516319] XFS (md5): Mounting Filesystem
[  764.601082] XFS (md5): Starting recovery (logdev: internal)
[  764.626294] XFS (md5): xlog_recover_process_data: bad clientid 0x0
[  764.632559] XFS (md5): log mount/recovery failed: error 5
[  764.638151] XFS (md5): log mount failed

Based on your description, this would be an unmount problem rather
than a vmap problem?

I've tried adding a sync before each umount, as well as testing on a
plain old disk partition (i.e., without going through MD), but the
problem persists either way.

Thanks,
Nuno

>
> $ xfs_logprint -t /dev/vdb
> xfs_logprint:
>    data device: 0xfd10
>    log device: 0xfd10 daddr: 11534368 length: 20480
>
>    log tail: 51 head: 51 state: <CLEAN>
>
> If the log is not clean on an unmount, then you may have an unmount
> problem. If it is clean when the recovery error occurs, then it's
> almost certainly a problem with you platform not implementing vmap
> cache flushing correctly, not an XFS problem.
>
>> I'm not entirely sure that this is XFS-specific, but the same script
>> does run successfully overnight on the same MD array with ext3 on it.
>
> ext3 doesn't use vmapped buffers at all, so won't show such a
> proble,.
>
>> Has something like this been seen before?
>
> Every so often on ARM, MIPS, etc platforms that have virtually
> indexed caches.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24  6:29   ` Nuno Subtil
@ 2011-05-24  7:54     ` Dave Chinner
  2011-05-24 10:18       ` Nuno Subtil
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2011-05-24  7:54 UTC (permalink / raw)
  To: Nuno Subtil; +Cc: xfs-oss

On Mon, May 23, 2011 at 11:29:19PM -0700, Nuno Subtil wrote:
> Thanks for chiming in. Replies inline below:
> 
> On Mon, May 23, 2011 at 17:02, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, May 23, 2011 at 02:39:39PM -0700, Nuno Subtil wrote:
> >> I have an MD RAID-1 array with two SATA drives, formatted as XFS.
> >
> > Hi Nuno. it is probably best to say this at the start, too:
> >
> >> This is on an ARM system running kernel 2.6.39.
> >
> > So we know what platform this is occurring on.
> 
> Will keep that in mind. Thanks.
> 
> >
> >> Occasionally, doing an umount followed by a mount causes the mount to
> >> fail with errors that strongly suggest some sort of filesystem
> >> corruption (usually 'bad clientid' with a seemingly arbitrary ID, but
> >> occasionally invalid log errors as well).
> >
> > So reading back the journal is getting bad data?
> 
> I'm not sure. XFS claims it found a bad clientid. I'm not too versed
> in filesystems to be able to tell for myself :)
> 
> >>
> >> The one thing in common among all these failures is that they require
> >> xfs_repair -L to recover from. This has already caused a few
> >> lost+found entries (and data loss on recently written files). I
> >> originally noticed this bug because of mount failures at boot, but
> >> I've managed to repro it reliably with this script:
> >
> > Yup, that's normal with recovery errors.
> >
> >> while true; do
> >>       mount /store
> >>       (cd /store && tar xf test.tar)
> >>       umount /store
> >>       mount /store
> >>       rm -rf /store/test-data
> >>       umount /store
> >> done
> >
> > Ok, so there's nothing here that actually says it's an unmount
> > error. More likely it is a vmap problem in log recovery resulting in
> > aliasing or some other stale data appearing in the buffer pages.
> >
> > Can you add a 'xfs_logprint -t <device>' after the umount? You
> > should always see something like this telling you the log is clean:
> 
> Well, I just ran into this again even without using the script:
> 
> root@howl:/# umount /dev/md5
> root@howl:/# xfs_logprint -t /dev/md5
> xfs_logprint:
>     data device: 0x905
>     log device: 0x905 daddr: 488382880 length: 476936
> 
>     log tail: 731 head: 859 state: <DIRTY>
> 
> 
> LOG REC AT LSN cycle 1 block 731 (0x1, 0x2db)
> 
> LOG REC AT LSN cycle 1 block 795 (0x1, 0x31b)

Was there any other output? If there were valid transactions between
the head and tail of the log xfs_logprint should have decoded them.

> I see nothing in dmesg at umount time. Attempting to mount the device
> at this point, I got:
> 
> [  764.516319] XFS (md5): Mounting Filesystem
> [  764.601082] XFS (md5): Starting recovery (logdev: internal)
> [  764.626294] XFS (md5): xlog_recover_process_data: bad clientid 0x0

Yup, that's got bad information in a transaction header.

> [  764.632559] XFS (md5): log mount/recovery failed: error 5
> [  764.638151] XFS (md5): log mount failed
> 
> Based on your description, this would be an unmount problem rather
> than a vmap problem?

Not clear yet. I forgot to mention that you need to do

# echo 3 > /proc/sys/vm/drop_caches

before you run xfs_logprint, otherwise it will see stale cached
pages and give erroneous results..

You might want to find out if your platform needs to (and does)
implement these functions:

flush_kernel_dcache_page()
flush_kernel_vmap_range()
void invalidate_kernel_vmap_range()

as these are what XFS relies on platforms to implement correctly to
avoid cache aliasing issues on CPUs with virtually indexed caches.

> I've tried adding a sync before each umount, as well as testing on a
> plain old disk partition (i.e., without going through MD), but the
> problem persists either way.

The use of sync before unmount implies it is not an unmount problem,
and ruling out MD is also a good thing to know.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24  7:54     ` Dave Chinner
@ 2011-05-24 10:18       ` Nuno Subtil
  2011-05-24 23:39         ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Nuno Subtil @ 2011-05-24 10:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-oss

On Tue, May 24, 2011 at 00:54, Dave Chinner <david@fromorbit.com> wrote:

...

>> > Ok, so there's nothing here that actually says it's an unmount
>> > error. More likely it is a vmap problem in log recovery resulting in
>> > aliasing or some other stale data appearing in the buffer pages.
>> >
>> > Can you add a 'xfs_logprint -t <device>' after the umount? You
>> > should always see something like this telling you the log is clean:
>>
>> Well, I just ran into this again even without using the script:
>>
>> root@howl:/# umount /dev/md5
>> root@howl:/# xfs_logprint -t /dev/md5
>> xfs_logprint:
>>     data device: 0x905
>>     log device: 0x905 daddr: 488382880 length: 476936
>>
>>     log tail: 731 head: 859 state: <DIRTY>
>>
>>
>> LOG REC AT LSN cycle 1 block 731 (0x1, 0x2db)
>>
>> LOG REC AT LSN cycle 1 block 795 (0x1, 0x31b)
>
> Was there any other output? If there were valid transactions between
> the head and tail of the log xfs_logprint should have decoded them.

There was no more output here.

>
>> I see nothing in dmesg at umount time. Attempting to mount the device
>> at this point, I got:
>>
>> [  764.516319] XFS (md5): Mounting Filesystem
>> [  764.601082] XFS (md5): Starting recovery (logdev: internal)
>> [  764.626294] XFS (md5): xlog_recover_process_data: bad clientid 0x0
>
> Yup, that's got bad information in a transaction header.
>
>> [  764.632559] XFS (md5): log mount/recovery failed: error 5
>> [  764.638151] XFS (md5): log mount failed
>>
>> Based on your description, this would be an unmount problem rather
>> than a vmap problem?
>
> Not clear yet. I forgot to mention that you need to do
>
> # echo 3 > /proc/sys/vm/drop_caches
>
> before you run xfs_logprint, otherwise it will see stale cached
> pages and give erroneous results..

I added that before each xfs_logprint and ran the script again. Still
the same results:

...
+ mount /store
+ cd /store
+ tar xf test.tar
+ sync
+ umount /store
+ echo 3
+ xfs_logprint -t /dev/sda1
xfs_logprint:
    data device: 0x801
    log device: 0x801 daddr: 488384032 length: 476936

    log tail: 2048 head: 2176 state: <DIRTY>


LOG REC AT LSN cycle 1 block 2048 (0x1, 0x800)

LOG REC AT LSN cycle 1 block 2112 (0x1, 0x840)
+ mount /store
mount: /dev/sda1: can't read superblock

Same messages in dmesg at this point.

> You might want to find out if your platform needs to (and does)
> implement these functions:
>
> flush_kernel_dcache_page()
> flush_kernel_vmap_range()
> void invalidate_kernel_vmap_range()
>
> as these are what XFS relies on platforms to implement correctly to
> avoid cache aliasing issues on CPUs with virtually indexed caches.

Is this what /proc/sys/vm/drop_caches relies on as well?

flush_kernel_dcache_page is empty, the others are not but are
conditionalized on the type of cache that is present. I wonder if that
is somehow not being detected properly. Wouldn't that cause other
areas of the system to misbehave as well?

Nuno

>
>> I've tried adding a sync before each umount, as well as testing on a
>> plain old disk partition (i.e., without going through MD), but the
>> problem persists either way.
>
> The use of sync before unmount implies it is not an unmount problem,
> and ruling out MD is also a good thing to know.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-23 21:39 XFS umount issue Nuno Subtil
  2011-05-24  0:02 ` Dave Chinner
@ 2011-05-24 13:33 ` Paul Anderson
  2011-05-24 19:10   ` Nuno Subtil
  1 sibling, 1 reply; 11+ messages in thread
From: Paul Anderson @ 2011-05-24 13:33 UTC (permalink / raw)
  To: Nuno Subtil; +Cc: xfs-oss

Hi Nuno - can you elaborate on the ARM hardware?  I noticed that my
XFS on ARM was mildly unstable, but felt it wasn't XFS code, but
rather the ARM port of Linux.  My test case is a Seagate Dockstar
hacked to run Linux.

I'll see if I can update to the latest kernel and test this use case
as well - it would be interesting to see how well it works (I'd like
to run my Dockstar as a mythtv server - stability was good enough for
proof of concept, but not longer term use).

Thanks,

Paul

On Mon, May 23, 2011 at 5:39 PM, Nuno Subtil <subtil@gmail.com> wrote:
> I have an MD RAID-1 array with two SATA drives, formatted as XFS.
> Occasionally, doing an umount followed by a mount causes the mount to
> fail with errors that strongly suggest some sort of filesystem
> corruption (usually 'bad clientid' with a seemingly arbitrary ID, but
> occasionally invalid log errors as well).
>
> The one thing in common among all these failures is that they require
> xfs_repair -L to recover from. This has already caused a few
> lost+found entries (and data loss on recently written files). I
> originally noticed this bug because of mount failures at boot, but
> I've managed to repro it reliably with this script:
>
> while true; do
>        mount /store
>        (cd /store && tar xf test.tar)
>        umount /store
>        mount /store
>        rm -rf /store/test-data
>        umount /store
> done
>
> test.tar contains around 100 files with various sizes inside
> test-data/, ranging from a few hundred KB to around 5-6MB. The failure
> triggers within minutes of starting this loop.
>
> I'm not entirely sure that this is XFS-specific, but the same script
> does run successfully overnight on the same MD array with ext3 on it.
> This is on an ARM system running kernel 2.6.39.
>
> Has something like this been seen before?
>
> Thanks,
> Nuno
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24 13:33 ` Paul Anderson
@ 2011-05-24 19:10   ` Nuno Subtil
  2011-05-25  0:29     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Nuno Subtil @ 2011-05-24 19:10 UTC (permalink / raw)
  To: Paul Anderson; +Cc: xfs-oss

On Tue, May 24, 2011 at 06:33, Paul Anderson <pha@umich.edu> wrote:
> Hi Nuno - can you elaborate on the ARM hardware?  I noticed that my
> XFS on ARM was mildly unstable, but felt it wasn't XFS code, but
> rather the ARM port of Linux.  My test case is a Seagate Dockstar
> hacked to run Linux.

Mine is a Netgear Stora. The interesting bit is that the stock
firmware runs kernel 2.6.22.18 and uses XFS as well, but I don't know
how stable it was to begin with.

>
> I'll see if I can update to the latest kernel and test this use case
> as well - it would be interesting to see how well it works (I'd like
> to run my Dockstar as a mythtv server - stability was good enough for
> proof of concept, but not longer term use).
>
> Thanks,
>
> Paul
>
> On Mon, May 23, 2011 at 5:39 PM, Nuno Subtil <subtil@gmail.com> wrote:
>> I have an MD RAID-1 array with two SATA drives, formatted as XFS.
>> Occasionally, doing an umount followed by a mount causes the mount to
>> fail with errors that strongly suggest some sort of filesystem
>> corruption (usually 'bad clientid' with a seemingly arbitrary ID, but
>> occasionally invalid log errors as well).
>>
>> The one thing in common among all these failures is that they require
>> xfs_repair -L to recover from. This has already caused a few
>> lost+found entries (and data loss on recently written files). I
>> originally noticed this bug because of mount failures at boot, but
>> I've managed to repro it reliably with this script:
>>
>> while true; do
>>        mount /store
>>        (cd /store && tar xf test.tar)
>>        umount /store
>>        mount /store
>>        rm -rf /store/test-data
>>        umount /store
>> done
>>
>> test.tar contains around 100 files with various sizes inside
>> test-data/, ranging from a few hundred KB to around 5-6MB. The failure
>> triggers within minutes of starting this loop.
>>
>> I'm not entirely sure that this is XFS-specific, but the same script
>> does run successfully overnight on the same MD array with ext3 on it.
>> This is on an ARM system running kernel 2.6.39.
>>
>> Has something like this been seen before?
>>
>> Thanks,
>> Nuno
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24 10:18       ` Nuno Subtil
@ 2011-05-24 23:39         ` Dave Chinner
  2011-05-25  8:14           ` Nuno Subtil
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2011-05-24 23:39 UTC (permalink / raw)
  To: Nuno Subtil; +Cc: xfs-oss

On Tue, May 24, 2011 at 03:18:11AM -0700, Nuno Subtil wrote:
> On Tue, May 24, 2011 at 00:54, Dave Chinner <david@fromorbit.com> wrote:
> 
> ...
> 
> >> > Ok, so there's nothing here that actually says it's an unmount
> >> > error. More likely it is a vmap problem in log recovery resulting in
> >> > aliasing or some other stale data appearing in the buffer pages.
> >> >
> >> > Can you add a 'xfs_logprint -t <device>' after the umount? You
> >> > should always see something like this telling you the log is clean:
> >>
> >> Well, I just ran into this again even without using the script:
> >>
> >> root@howl:/# umount /dev/md5
> >> root@howl:/# xfs_logprint -t /dev/md5
> >> xfs_logprint:
> >>     data device: 0x905
> >>     log device: 0x905 daddr: 488382880 length: 476936
> >>
> >>     log tail: 731 head: 859 state: <DIRTY>
> >>
> >>
> >> LOG REC AT LSN cycle 1 block 731 (0x1, 0x2db)
> >>
> >> LOG REC AT LSN cycle 1 block 795 (0x1, 0x31b)
> >
> > Was there any other output? If there were valid transactions between
> > the head and tail of the log xfs_logprint should have decoded them.
> 
> There was no more output here.

That doesn't seem quite right. Does it always look like sthis, even
if you do a sync before unmount?

> >> I see nothing in dmesg at umount time. Attempting to mount the device
> >> at this point, I got:
> >>
> >> [  764.516319] XFS (md5): Mounting Filesystem
> >> [  764.601082] XFS (md5): Starting recovery (logdev: internal)
> >> [  764.626294] XFS (md5): xlog_recover_process_data: bad clientid 0x0
> >
> > Yup, that's got bad information in a transaction header.
> >
> >> [  764.632559] XFS (md5): log mount/recovery failed: error 5
> >> [  764.638151] XFS (md5): log mount failed
> >>
> >> Based on your description, this would be an unmount problem rather
> >> than a vmap problem?
> >
> > Not clear yet. I forgot to mention that you need to do
> >
> > # echo 3 > /proc/sys/vm/drop_caches
> >
> > before you run xfs_logprint, otherwise it will see stale cached
> > pages and give erroneous results..
> 
> I added that before each xfs_logprint and ran the script again. Still
> the same results:
> 
> ...
> + mount /store
> + cd /store
> + tar xf test.tar
> + sync
> + umount /store
> + echo 3
> + xfs_logprint -t /dev/sda1
> xfs_logprint:
>     data device: 0x801
>     log device: 0x801 daddr: 488384032 length: 476936
> 
>     log tail: 2048 head: 2176 state: <DIRTY>
> 
> 
> LOG REC AT LSN cycle 1 block 2048 (0x1, 0x800)
> 
> LOG REC AT LSN cycle 1 block 2112 (0x1, 0x840)
> + mount /store
> mount: /dev/sda1: can't read superblock
> 
> Same messages in dmesg at this point.
> 
> > You might want to find out if your platform needs to (and does)
> > implement these functions:
> >
> > flush_kernel_dcache_page()
> > flush_kernel_vmap_range()
> > void invalidate_kernel_vmap_range()
> >
> > as these are what XFS relies on platforms to implement correctly to
> > avoid cache aliasing issues on CPUs with virtually indexed caches.
> 
> Is this what /proc/sys/vm/drop_caches relies on as well?

No, drop_caches frees the page cache and slab caches so future reads
need to be looked up from disk.

> flush_kernel_dcache_page is empty, the others are not but are
> conditionalized on the type of cache that is present. I wonder if that
> is somehow not being detected properly. Wouldn't that cause other
> areas of the system to misbehave as well?

vmap is not widely used throughout the kernel, and as a result
people porting linux to a new arch/CPU type often don't realise
there's anything to implement there because their system seems to be
working. That is, of course, until someone tries to use XFS.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24 19:10   ` Nuno Subtil
@ 2011-05-25  0:29     ` Dave Chinner
  2011-05-25  8:08       ` Nuno Subtil
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2011-05-25  0:29 UTC (permalink / raw)
  To: Nuno Subtil; +Cc: Paul Anderson, xfs-oss

On Tue, May 24, 2011 at 12:10:44PM -0700, Nuno Subtil wrote:
> On Tue, May 24, 2011 at 06:33, Paul Anderson <pha@umich.edu> wrote:
> > Hi Nuno - can you elaborate on the ARM hardware?  I noticed that my
> > XFS on ARM was mildly unstable, but felt it wasn't XFS code, but
> > rather the ARM port of Linux.  My test case is a Seagate Dockstar
> > hacked to run Linux.
> 
> Mine is a Netgear Stora. The interesting bit is that the stock
> firmware runs kernel 2.6.22.18 and uses XFS as well, but I don't know
> how stable it was to begin with.

Have you checked to see whether there are extra patches added to
that kernel by Netgear? It's not uncommon for these embedded systems
to run a kernel that has been patched to fix problems that you are
seeing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-25  0:29     ` Dave Chinner
@ 2011-05-25  8:08       ` Nuno Subtil
  0 siblings, 0 replies; 11+ messages in thread
From: Nuno Subtil @ 2011-05-25  8:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Paul Anderson, xfs-oss

On Tue, May 24, 2011 at 17:29, Dave Chinner <david@fromorbit.com> wrote:
...
>> Mine is a Netgear Stora. The interesting bit is that the stock
>> firmware runs kernel 2.6.22.18 and uses XFS as well, but I don't know
>> how stable it was to begin with.
>
> Have you checked to see whether there are extra patches added to
> that kernel by Netgear? It's not uncommon for these embedded systems
> to run a kernel that has been patched to fix problems that you are
> seeing.

I looked through it and didn't see anything that stood out, although I
could have easily missed it (the diff is quite noisy, plus it sounds
like the vmap cache invalidation functions have changed names in the
meantime?).

The one interesting bit that I found was that the Netgear kernel
comments out the test that disables write barriers at mount time,
which I found quite odd, but it sounds unrelated to this issue.

Nuno

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: XFS umount issue
  2011-05-24 23:39         ` Dave Chinner
@ 2011-05-25  8:14           ` Nuno Subtil
  0 siblings, 0 replies; 11+ messages in thread
From: Nuno Subtil @ 2011-05-25  8:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs-oss

On Tue, May 24, 2011 at 16:39, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, May 24, 2011 at 03:18:11AM -0700, Nuno Subtil wrote:
>> On Tue, May 24, 2011 at 00:54, Dave Chinner <david@fromorbit.com> wrote:
>>
>> ...
>>
>> >> > Ok, so there's nothing here that actually says it's an unmount
>> >> > error. More likely it is a vmap problem in log recovery resulting in
>> >> > aliasing or some other stale data appearing in the buffer pages.
>> >> >
>> >> > Can you add a 'xfs_logprint -t <device>' after the umount? You
>> >> > should always see something like this telling you the log is clean:
>> >>
>> >> Well, I just ran into this again even without using the script:
>> >>
>> >> root@howl:/# umount /dev/md5
>> >> root@howl:/# xfs_logprint -t /dev/md5
>> >> xfs_logprint:
>> >>     data device: 0x905
>> >>     log device: 0x905 daddr: 488382880 length: 476936
>> >>
>> >>     log tail: 731 head: 859 state: <DIRTY>
>> >>
>> >>
>> >> LOG REC AT LSN cycle 1 block 731 (0x1, 0x2db)
>> >>
>> >> LOG REC AT LSN cycle 1 block 795 (0x1, 0x31b)
>> >
>> > Was there any other output? If there were valid transactions between
>> > the head and tail of the log xfs_logprint should have decoded them.
>>
>> There was no more output here.
>
> That doesn't seem quite right. Does it always look like sthis, even
> if you do a sync before unmount?

Not always, but almost. Sometimes there's a number of transactions in
the log as well, but this is by far the most common output I got. I'll
try to capture the output for that case as well.

>> >> I see nothing in dmesg at umount time. Attempting to mount the device
>> >> at this point, I got:
>> >>
>> >> [  764.516319] XFS (md5): Mounting Filesystem
>> >> [  764.601082] XFS (md5): Starting recovery (logdev: internal)
>> >> [  764.626294] XFS (md5): xlog_recover_process_data: bad clientid 0x0
>> >
>> > Yup, that's got bad information in a transaction header.
>> >
>> >> [  764.632559] XFS (md5): log mount/recovery failed: error 5
>> >> [  764.638151] XFS (md5): log mount failed
>> >>
>> >> Based on your description, this would be an unmount problem rather
>> >> than a vmap problem?
>> >
>> > Not clear yet. I forgot to mention that you need to do
>> >
>> > # echo 3 > /proc/sys/vm/drop_caches
>> >
>> > before you run xfs_logprint, otherwise it will see stale cached
>> > pages and give erroneous results..
>>
>> I added that before each xfs_logprint and ran the script again. Still
>> the same results:
>>
>> ...
>> + mount /store
>> + cd /store
>> + tar xf test.tar
>> + sync
>> + umount /store
>> + echo 3
>> + xfs_logprint -t /dev/sda1
>> xfs_logprint:
>>     data device: 0x801
>>     log device: 0x801 daddr: 488384032 length: 476936
>>
>>     log tail: 2048 head: 2176 state: <DIRTY>
>>
>>
>> LOG REC AT LSN cycle 1 block 2048 (0x1, 0x800)
>>
>> LOG REC AT LSN cycle 1 block 2112 (0x1, 0x840)
>> + mount /store
>> mount: /dev/sda1: can't read superblock
>>
>> Same messages in dmesg at this point.
>>
>> > You might want to find out if your platform needs to (and does)
>> > implement these functions:
>> >
>> > flush_kernel_dcache_page()
>> > flush_kernel_vmap_range()
>> > void invalidate_kernel_vmap_range()
>> >
>> > as these are what XFS relies on platforms to implement correctly to
>> > avoid cache aliasing issues on CPUs with virtually indexed caches.
>>
>> Is this what /proc/sys/vm/drop_caches relies on as well?
>
> No, drop_caches frees the page cache and slab caches so future reads
> need to be looked up from disk.
>
>> flush_kernel_dcache_page is empty, the others are not but are
>> conditionalized on the type of cache that is present. I wonder if that
>> is somehow not being detected properly. Wouldn't that cause other
>> areas of the system to misbehave as well?
>
> vmap is not widely used throughout the kernel, and as a result
> people porting linux to a new arch/CPU type often don't realise
> there's anything to implement there because their system seems to be
> working. That is, of course, until someone tries to use XFS.....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-05-25  8:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-23 21:39 XFS umount issue Nuno Subtil
2011-05-24  0:02 ` Dave Chinner
2011-05-24  6:29   ` Nuno Subtil
2011-05-24  7:54     ` Dave Chinner
2011-05-24 10:18       ` Nuno Subtil
2011-05-24 23:39         ` Dave Chinner
2011-05-25  8:14           ` Nuno Subtil
2011-05-24 13:33 ` Paul Anderson
2011-05-24 19:10   ` Nuno Subtil
2011-05-25  0:29     ` Dave Chinner
2011-05-25  8:08       ` Nuno Subtil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.