All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS regression?
@ 2007-10-10 14:27 Andrew Clayton
  2007-10-11  1:01 ` David Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Clayton @ 2007-10-10 14:27 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

(Seeing as I haven't been able to subscribe or post to the XFS mailing
list, I'll try here)

I'll try not to flood with information on the first post.

In trying to track down this issue here:
http://www.spinics.net/lists/raid/msg17195.html

I think I'm seeing a regression in XFS


If I run the following program (This was on an Athlon XP, Seagate IDE,
with a 2.6.23-rc9-git2)

/* fslattest.c */

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>


int main(int argc, char *argv[])
{
        char file[255];

        if (argc < 2) {
                printf("Usage: fslattest file\n");
                exit(1);
        }

        strncpy(file, argv[1], 254);
        printf("Opening %s\n", file);

        while (1) {
                int testfd = open(file, 
		O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 
															0600); 
		close(testfd);
		unlink(file); 
		sleep(1);
	}
	
	exit(0);
}


e.g $ strace -T -e open fslattest test

And then after a few seconds run

$ dd if=/dev/zero of=bigfile bs=1M count=500

I see the following

Before dd kicks in

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<0.005043> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <0.000212> open("test",
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 <0.016844>

while dd is running

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<2.000348> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <1.594441> open("test",
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 <2.224636>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<1.074615>

dd stopped

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<0.013224> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <0.007109> open("test",
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 <0.007108>


Doing the same thing with ext3 shows no such stalls. e.g before, during
and after the above dd

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<0.015423> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <0.000092> open("test",
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 <0.000093>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<0.000088> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <0.000103> open("test",
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 <0.000096>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<0.000094> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <0.000114> open("test",
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 <0.000091>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
<0.000274> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE,
0600) = 3 <0.000107>


I have found a machine in the office to do more testing on and that can
run older kernels.

It seems this problem was introduced between 2.6.18 and 2.6.19. 

The other thing I've found is that if I do the dd to an ext3 fs (on
the same disk at least) while running the test in the XFS fs then I
also see the latencies.

I can replicate this behaviour on at least 4 different machines so far.

Just ask it you'd like more details.

Cheers,

Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-10 14:27 XFS regression? Andrew Clayton
@ 2007-10-11  1:01 ` David Chinner
  2007-10-11  9:05   ` Andrew Clayton
  2007-10-11 14:15   ` Andrew Clayton
  0 siblings, 2 replies; 13+ messages in thread
From: David Chinner @ 2007-10-11  1:01 UTC (permalink / raw)
  To: Andrew Clayton; +Cc: linux-fsdevel

On Wed, Oct 10, 2007 at 03:27:42PM +0100, Andrew Clayton wrote:
> Hi,
> 
> (Seeing as I haven't been able to subscribe or post to the XFS mailing
> list, I'll try here)
> 
> I'll try not to flood with information on the first post.
> 
> In trying to track down this issue here:
> http://www.spinics.net/lists/raid/msg17195.html
> 
> I think I'm seeing a regression in XFS

perhaps not XFS. 2.6.23-rc4, cfq, ia64, 10k rpm scsi
disk, WCE, ctq enabled, nobarrier mount option.

lattest output at defaults:

# cat /sys/block/sdb/queue/nr_requests
128

idle state:

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000160>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000159>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000158>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000159>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000155>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000159>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000155>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000159>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000156>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000158>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000158>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000153>

run dd:

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000456>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000372>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000412>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000365>

(writeback starts)

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.401471>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.544249>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.727238>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.569971>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.715226>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.561839>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.714825>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000389>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000480>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.441398>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.488799>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.701921>

So we've got latencies of 500-700ms there while writeback is occuring. i.e. the
disk is busy.

# echo 4 > /sys/block/sdb/queue/nr_requests
# cat /sys/block/sdb/queue/nr_requests
4

idle:

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000223>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000219>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000219>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000269>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000249>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000220>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000215>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000209>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000214>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000220>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000213>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000239>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000281>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000209>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000326>

run dd:

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.001026>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.001077>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000882>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000197>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000862>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000198>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000199>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000218>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000216>

(writeback starts):

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.080791>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.077202>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.003443>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.083383>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.071555>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.068379>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.077401>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.087573>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.080739>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.069364>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.071922>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.000216>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.076608>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 4 <0.069775>

Latencies are an order of magnitude lower at 60-70ms because the disks
have less deep queues. This is expected - deep queues and multiple outstanding
I/Os are the enemy of single I/O latency....

If I remount with barriers enabled, the latency at nr_requests=128 goes
up to a consistent 2.2s. Not surprising, we're flushing the drive
cache very regularly now and it points to the create or truncate
transaction having to pushing log buffers to disk. The latency remains
at 70-80ms at nr_requests=4.

> It seems this problem was introduced between 2.6.18 and 2.6.19. 

When the new SATA driver infrastructure was introduced. Do you have NCQ
enabled on more recent kernels and not on 2.6.18? If so, try disabling it
and see if the problem goes away....

> The other thing I've found is that if I do the dd to an ext3 fs (on
> the same disk at least) while running the test in the XFS fs then I
> also see the latencies.

So it's almost certainly pointing at an elevator or driver change, not
an XFS change.

Cheers,

dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-11  1:01 ` David Chinner
@ 2007-10-11  9:05   ` Andrew Clayton
  2007-10-11 14:15   ` Andrew Clayton
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Clayton @ 2007-10-11  9:05 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel

On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:

 
> Latencies are an order of magnitude lower at 60-70ms because the disks
> have less deep queues. This is expected - deep queues and multiple
> outstanding I/Os are the enemy of single I/O latency....
> 
> If I remount with barriers enabled, the latency at nr_requests=128
> goes up to a consistent 2.2s. Not surprising, we're flushing the drive
> cache very regularly now and it points to the create or truncate
> transaction having to pushing log buffers to disk. The latency remains
> at 70-80ms at nr_requests=4.

Thanks for the info. I did try fiddling nr_requests but I made it
bigger. I'll try with it lower.
 
> > It seems this problem was introduced between 2.6.18 and 2.6.19. 
> 
> When the new SATA driver infrastructure was introduced. Do you have
> NCQ enabled on more recent kernels and not on 2.6.18? If so, try
> disabling it and see if the problem goes away....

Unfortunately the drives in the file server don't support NCQ. Not sure
if it's supported in the machine I was testing on (it's certainly a few
years old).
 
> > The other thing I've found is that if I do the dd to an ext3 fs (on
> > the same disk at least) while running the test in the XFS fs then I
> > also see the latencies.
> 
> So it's almost certainly pointing at an elevator or driver change, not
> an XFS change.

OK, though it doesn't seem to effect ext3. I'm going to run a git
bisect to see what it comes up with.

> Cheers,
> 
> dave.

Cheers,

Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-11  1:01 ` David Chinner
  2007-10-11  9:05   ` Andrew Clayton
@ 2007-10-11 14:15   ` Andrew Clayton
  2007-10-11 21:53     ` David Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Clayton @ 2007-10-11 14:15 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel

On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:

> So it's almost certainly pointing at an elevator or driver change, not
> an XFS change.

heh, git bisect begs to differ :)

4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit
commit 4c60658e0f4e253cf275f12b7c76bf128515a774
Author: David Chinner <dgc@sgi.com>
Date:   Sat Nov 11 18:05:00 2006 +1100

    [XFS] Prevent a deadlock when xfslogd unpins inodes.
    
    The previous fixes for the use after free in xfs_iunpin left a nasty log
    deadlock when xfslogd unpinned the inode and dropped the last reference to
    the inode. the ->clear_inode() method can issue transactions, and if the
    log was full, the transaction could push on the log and get stuck trying
    to push the inode it was currently unpinning.
    
    To fix this, we provide xfs_iunpin a guarantee that it will always have a
    valid xfs_inode <-> linux inode link or a particular flag will be set on
    the inode. We then use log forces during lookup to ensure transactions are
    completed before we recycle the inode. This ensures that xfs_iunpin will
    never use the linux inode after it is being freed, and any lookup on an
    inode on the reclaim list will wait until it is safe to attach a new linux
    inode to the xfs inode.
    
    SGI-PV: 956832
    SGI-Modid: xfs-linux-melb:xfs-kern:27359a
    
    Signed-off-by: David Chinner <dgc@sgi.com>
    Signed-off-by: Shailendra Tripathi <stripathi@agami.com>
    Signed-off-by: Takenori Nagano <t-nagano@ah.jp.nec.com>
    Signed-off-by: Tim Shimmin <tes@sgi.com>

:040000 040000 1c47ba44bc404456c87c5a493d543a8d30696b88 92319b34585d03c64e53890a80e550b579a0363d M      fs


If you'd like more info, don't hesitate to ask.

> Cheers,
> 
> dave.

Cheers,

Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-11 14:15   ` Andrew Clayton
@ 2007-10-11 21:53     ` David Chinner
  2007-10-12  0:26       ` David Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: David Chinner @ 2007-10-11 21:53 UTC (permalink / raw)
  To: Andrew Clayton; +Cc: David Chinner, linux-fsdevel, xfs

On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
> On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
> 
> > So it's almost certainly pointing at an elevator or driver change, not an
> > XFS change.
> 
> heh, git bisect begs to differ :)
> 
> 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
> 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <dgc@sgi.com>
> Date:   Sat Nov 11 18:05:00 2006 +1100
> 
>     [XFS] Prevent a deadlock when xfslogd unpins inodes.

Oh, of course - I failed to notice the significance of
this loop in your test:

	while [foo]; do
		touch fred
		rm fred
	done

The inode allocator keeps reusing the same inode.  If the
transaction that did the unlink has not hit the disk before we
allocate the inode again, we have to force the log to get the unlink
transaction to disk to get the xfs inode unpinned (i.e. able to be
modified in memory again).

It's the log force I/O that's introducing the latency.

If we don't force the log, then we have a possible use-after free
of the linux inode because of a fundamental mismatch between
the XFS inode life cycle and the linux inode life cycle. The
use-after free only occurs on large machines under heavy, heavy
metadata load to many disks and filesystems (requires enough
traffic to overload an xfslogd) and is very difficult to
reproduce (large machine, lots of disks and 20-30 hours MTTF).

I'll have a look at other ways to solve this problem, but it
took 6 months to find a solution to the race in the first place
so don't hold your breath.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-11 21:53     ` David Chinner
@ 2007-10-12  0:26       ` David Chinner
  2007-10-12 11:36         ` Andrew Clayton
  0 siblings, 1 reply; 13+ messages in thread
From: David Chinner @ 2007-10-12  0:26 UTC (permalink / raw)
  To: David Chinner; +Cc: Andrew Clayton, linux-fsdevel, xfs

On Fri, Oct 12, 2007 at 07:53:53AM +1000, David Chinner wrote:
> On Thu, Oct 11, 2007 at 03:15:12PM +0100, Andrew Clayton wrote:
> > On Thu, 11 Oct 2007 11:01:39 +1000, David Chinner wrote:
> > 
> > > So it's almost certainly pointing at an elevator or driver change, not an
> > > XFS change.
> > 
> > heh, git bisect begs to differ :)
> > 
> > 4c60658e0f4e253cf275f12b7c76bf128515a774 is first bad commit commit
> > 4c60658e0f4e253cf275f12b7c76bf128515a774 Author: David Chinner <dgc@sgi.com>
> > Date:   Sat Nov 11 18:05:00 2006 +1100
> > 
> >     [XFS] Prevent a deadlock when xfslogd unpins inodes.
> 
> Oh, of course - I failed to notice the significance of
> this loop in your test:
> 
> 	while [foo]; do
> 		touch fred
> 		rm fred
> 	done
> 
> The inode allocator keeps reusing the same inode.  If the
> transaction that did the unlink has not hit the disk before we
> allocate the inode again, we have to force the log to get the unlink
> transaction to disk to get the xfs inode unpinned (i.e. able to be
> modified in memory again).
> 
> It's the log force I/O that's introducing the latency.
> 
> If we don't force the log, then we have a possible use-after free
> of the linux inode because of a fundamental mismatch between
> the XFS inode life cycle and the linux inode life cycle. The
> use-after free only occurs on large machines under heavy, heavy
> metadata load to many disks and filesystems (requires enough
> traffic to overload an xfslogd) and is very difficult to
> reproduce (large machine, lots of disks and 20-30 hours MTTF).
> 
> I'll have a look at other ways to solve this problem, but it
> took 6 months to find a solution to the race in the first place
> so don't hold your breath.

You can breath again. Here's a test patch (warning - may harm
kittens - not fully tested or verified) that solves both
the use-after-free issue (by avoiding it altogether) as well the
unlink/create latency because the log force is no longer there.

(yay! xfsqa test 016 passes again ;)

It does have other possible side effects triggering extra
log forces elsewhere on inode writeback and affects sync behaviour
so it's only a proof of concept at this point.

The latency output looks like:

<idle>

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000281>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000184>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000188>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000182>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000225>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000182>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000185>

<start dd>

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000405>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000224>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000199>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000163>

<writeback starts>

open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000145>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.817704>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000148>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000143>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000154>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000147>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000158>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000144>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000379>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000151>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000190>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000191>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000150>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.797099>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000139>
open("/mnt/scratch/testfile", O_WRONLY|O_CREAT|O_TRUNC|O_EXCL, 0600) = 3 <0.000163>

<writeback ends>

So we still see some operations show high latency - that will most likely
be due to the allocation transaction filling a log buffer and pushing it
to disk, then having to wait because I/O is in progress on all other log
buffers.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


---
 fs/xfs/linux-2.6/xfs_iops.c |   16 ++++++++++++++++
 fs/xfs/xfs_iget.c           |   18 ------------------
 fs/xfs/xfs_inode.c          |   34 +---------------------------------
 fs/xfs/xfs_inode.h          |    1 +
 fs/xfs/xfs_inode_item.c     |    5 +++++
 5 files changed, 23 insertions(+), 51 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c	2007-10-02 16:01:47.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c	2007-10-12 09:54:12.466194814 +1000
@@ -71,6 +71,22 @@ xfs_synchronize_atime(
 }
 
 /*
+ * If the linux inode exists, mark it dirty.
+ * Used when commiting a dirty inode into a transaction so that
+ * the inode will get written back by the linux code
+ */
+void
+xfs_mark_inode_dirty_sync(
+	xfs_inode_t	*ip)
+{
+	bhv_vnode_t	*vp;
+
+	vp = XFS_ITOV_NULL(ip);
+	if (vp)
+		mark_inode_dirty_sync(vn_to_inode(vp));
+}
+
+/*
  * Change the requested timestamp in the given inode.
  * We don't lock across timestamp updates, and we don't log them but
  * we do record the fact that there is dirty information in core.
Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c	2007-10-02 16:01:48.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c	2007-10-12 09:58:38.263910771 +1000
@@ -140,27 +140,9 @@ again:
 				return ENOENT;
 			}
 
-			/*
-			 * There may be transactions sitting in the
-			 * incore log buffers or being flushed to disk
-			 * at this time.  We can't clear the
-			 * XFS_IRECLAIMABLE flag until these
-			 * transactions have hit the disk, otherwise we
-			 * will void the guarantee the flag provides
-			 * xfs_iunpin()
-			 */
-			if (xfs_ipincount(ip)) {
-				read_unlock(&pag->pag_ici_lock);
-				xfs_log_force(mp, 0,
-					XFS_LOG_FORCE|XFS_LOG_SYNC);
-				XFS_STATS_INC(xs_ig_frecycle);
-				goto again;
-			}
-
 			xfs_itrace_exit_tag(ip, "xfs_iget.alloc");
 
 			XFS_STATS_INC(xs_ig_found);
-
 			xfs_iflags_clear(ip, XFS_IRECLAIMABLE);
 			read_unlock(&pag->pag_ici_lock);
 
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2007-10-11 09:53:53.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2007-10-12 09:56:49.721912637 +1000
@@ -2807,40 +2807,8 @@ xfs_iunpin(
 {
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 
-	if (atomic_dec_and_lock(&ip->i_pincount, &ip->i_flags_lock)) {
-
-		/*
-		 * If the inode is currently being reclaimed, the link between
-		 * the bhv_vnode and the xfs_inode will be broken after the
-		 * XFS_IRECLAIM* flag is set. Hence, if these flags are not
-		 * set, then we can move forward and mark the linux inode dirty
-		 * knowing that it is still valid as it won't freed until after
-		 * the bhv_vnode<->xfs_inode link is broken in xfs_reclaim. The
-		 * i_flags_lock is used to synchronise the setting of the
-		 * XFS_IRECLAIM* flags and the breaking of the link, and so we
-		 * can execute atomically w.r.t to reclaim by holding this lock
-		 * here.
-		 *
-		 * However, we still need to issue the unpin wakeup call as the
-		 * inode reclaim may be blocked waiting for the inode to become
-		 * unpinned.
-		 */
-
-		if (!__xfs_iflags_test(ip, XFS_IRECLAIM|XFS_IRECLAIMABLE)) {
-			bhv_vnode_t	*vp = XFS_ITOV_NULL(ip);
-			struct inode *inode = NULL;
-
-			BUG_ON(vp == NULL);
-			inode = vn_to_inode(vp);
-			BUG_ON(inode->i_state & I_CLEAR);
-
-			/* make sync come back and flush this inode */
-			if (!(inode->i_state & (I_NEW|I_FREEING)))
-				mark_inode_dirty_sync(inode);
-		}
-		spin_unlock(&ip->i_flags_lock);
+	if (atomic_dec_and_test(&ip->i_pincount))
 		wake_up(&ip->i_ipin_wait);
-	}
 }
 
 /*
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.h	2007-10-02 16:01:48.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.h	2007-10-12 10:00:10.979948827 +1000
@@ -532,6 +532,7 @@ xfs_fsize_t	xfs_file_last_byte(xfs_inode
 void		xfs_lock_inodes(xfs_inode_t **, int, int, uint);
 
 void		xfs_synchronize_atime(xfs_inode_t *);
+void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 xfs_bmbt_rec_host_t *xfs_iext_get_ext(xfs_ifork_t *, xfs_extnum_t);
 void		xfs_iext_insert(xfs_ifork_t *, xfs_extnum_t, xfs_extnum_t,
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode_item.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode_item.c	2007-10-02 16:01:48.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode_item.c	2007-10-12 09:54:18.053474268 +1000
@@ -274,6 +274,11 @@ xfs_inode_item_format(
 	 */
 	xfs_synchronize_atime(ip);
 
+	/*
+	 * make sure the linux inode is dirty
+	 */
+	xfs_mark_inode_dirty_sync(ip);
+
 	vecp->i_addr = (xfs_caddr_t)&ip->i_d;
 	vecp->i_len  = sizeof(xfs_dinode_core_t);
 	XLOG_VEC_SET_TYPE(vecp, XLOG_REG_TYPE_ICORE);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-12  0:26       ` David Chinner
@ 2007-10-12 11:36         ` Andrew Clayton
  2007-10-12 13:28           ` Andrew Clayton
                             ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrew Clayton @ 2007-10-12 11:36 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, xfs

On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote:
 
> You can breath again. Here's a test patch (warning - may harm

heh

> kittens - not fully tested or verified) that solves both
> the use-after-free issue (by avoiding it altogether) as well the
> unlink/create latency because the log force is no longer there.
> 
> (yay! xfsqa test 016 passes again ;)
> 
> It does have other possible side effects triggering extra
> log forces elsewhere on inode writeback and affects sync behaviour
> so it's only a proof of concept at this point.

What kernel is that against?. I got rejects with 2.6.23 

However I tried a 2.6.18 on the file server and ran my test, it didn't
show the problem. I then made a 2.6.23 but with the patch from my git
bisect reverted.

Doing the test with that kernel, while writing a 1GB file I saw only
one > 1 second latency (1.2) and only a few ~ 0.5 second latencies.

However over the longer term I'm still seeing latencies > 1 second.
Just leaving my strace test running (no dd) on the raid filesystem I see
the
latencies come when the raid5 stripe cache fills up. So I think I'm
perhaps seeing another problem here.

Running the strace (again no dd) on the system disk (not raided) I'm not
seeing any latencies.

In fact the latencies on the raid array seem to be generally greater
than the system disk (all identical disks, all XFS).

raid array

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.122943>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.021620>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.014963>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.023264>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.011368>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.002561>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.012623>

system disk

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000190>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000039>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000191>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000268>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000188>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000233>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000279>


Maybe that's to be expected?

> Cheers,
> 
> Dave.

Thanks for looking at this.

Andrew


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-12 11:36         ` Andrew Clayton
@ 2007-10-12 13:28           ` Andrew Clayton
  2007-10-13 13:35           ` Bhagi rathi
  2007-10-14 23:19           ` David Chinner
  2 siblings, 0 replies; 13+ messages in thread
From: Andrew Clayton @ 2007-10-12 13:28 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, xfs

On Fri, 12 Oct 2007 12:36:01 +0100, Andrew Clayton wrote:


Ignore the numbers below. under normal conditions the raid array and
single drive show basically the same numbers. The raid array numbers
were done over NFS.

> raid array
> 
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.122943>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.021620>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.014963>
> 
> system disk
> 
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000190>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000039>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000191>

While we're talking number's...

raid array, strace over NFS

open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.018330>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.026398>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <8.927449>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <1.284661>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.030078>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.021407>
open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.018660>

raid array, strace locally

open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000069>
open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000044>
open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <8.995286>
open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <1.258810>
open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <1.225763>
open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000056>
open("test2", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000092>


Those where running concurrently. I think that rules out networking.
I'm getting more convinced it's a raid5 problem...


Andrew

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-12 11:36         ` Andrew Clayton
  2007-10-12 13:28           ` Andrew Clayton
@ 2007-10-13 13:35           ` Bhagi rathi
  2007-10-14 23:09             ` David Chinner
  2007-10-14 23:19           ` David Chinner
  2 siblings, 1 reply; 13+ messages in thread
From: Bhagi rathi @ 2007-10-13 13:35 UTC (permalink / raw)
  To: Andrew Clayton; +Cc: David Chinner, linux-fsdevel, xfs

David, Can you let me know the use after free problem? I want to understand
how the life cycle of linux inode
and xfs inode are related to log flush. Any pointer is also of great help.

 Thanks,
- Bhagi

On 10/12/07, Andrew Clayton <andrew@digital-domain.net> wrote:
>
> On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote:
>
> > You can breath again. Here's a test patch (warning - may harm
>
> heh
>
> > kittens - not fully tested or verified) that solves both
> > the use-after-free issue (by avoiding it altogether) as well the
> > unlink/create latency because the log force is no longer there.
> >
> > (yay! xfsqa test 016 passes again ;)
> >
> > It does have other possible side effects triggering extra
> > log forces elsewhere on inode writeback and affects sync behaviour
> > so it's only a proof of concept at this point.
>
> What kernel is that against?. I got rejects with 2.6.23
>
> However I tried a 2.6.18 on the file server and ran my test, it didn't
> show the problem. I then made a 2.6.23 but with the patch from my git
> bisect reverted.
>
> Doing the test with that kernel, while writing a 1GB file I saw only
> one > 1 second latency (1.2) and only a few ~ 0.5 second latencies.
>
> However over the longer term I'm still seeing latencies > 1 second.
> Just leaving my strace test running (no dd) on the raid filesystem I see
> the
> latencies come when the raid5 stripe cache fills up. So I think I'm
> perhaps seeing another problem here.
>
> Running the strace (again no dd) on the system disk (not raided) I'm not
> seeing any latencies.
>
> In fact the latencies on the raid array seem to be generally greater
> than the system disk (all identical disks, all XFS).
>
> raid array
>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.122943>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.021620>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.014963>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.023264>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.011368>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.002561>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.012623>
>
> system disk
>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000190>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000039>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000191>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000268>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000188>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000233>
> open("test", O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 <0.000279>
>
>
> Maybe that's to be expected?
>
> > Cheers,
> >
> > Dave.
>
> Thanks for looking at this.
>
> Andrew
>
>
>


[[HTML alternate version deleted]]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-13 13:35           ` Bhagi rathi
@ 2007-10-14 23:09             ` David Chinner
  2007-10-15  9:58               ` Bhagi rathi
  0 siblings, 1 reply; 13+ messages in thread
From: David Chinner @ 2007-10-14 23:09 UTC (permalink / raw)
  To: Bhagi rathi; +Cc: Andrew Clayton, David Chinner, linux-fsdevel, xfs

On Sat, Oct 13, 2007 at 07:05:17PM +0530, Bhagi rathi wrote:
> David, Can you let me know the use after free problem? I want to understand
> how the life cycle of linux inode
> and xfs inode are related to log flush.

Log I/O completion:

  -> xfs_trans_commited
       -> xfs_iunpin(xfs inode)
         get linux inode from xfs inode
          -> mark_inode_dirty_sync(linux inode)

Freeing the linux inode:

clear_inode(linux_inode)
 -> xfs_inactive()
    -> xfs_trans_commit() (e.g. freeing data associated with unlinked inode)
       -> xfs_ipin()
    (link between xfs and linux inode broken)
  linux inode freed 
 
So, in log I/O completion, we can be completing a previous
transaction at the same time clear_inode() is running, and
hence in xfs_iunpin() we can race with the freeing of the
linux inode as xfs_iunpin does not hold any locks.

> Any pointer is also of great help.

/me points at the code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-12 11:36         ` Andrew Clayton
  2007-10-12 13:28           ` Andrew Clayton
  2007-10-13 13:35           ` Bhagi rathi
@ 2007-10-14 23:19           ` David Chinner
  2 siblings, 0 replies; 13+ messages in thread
From: David Chinner @ 2007-10-14 23:19 UTC (permalink / raw)
  To: Andrew Clayton; +Cc: David Chinner, linux-fsdevel, xfs

On Fri, Oct 12, 2007 at 12:36:01PM +0100, Andrew Clayton wrote:
> On Fri, 12 Oct 2007 10:26:13 +1000, David Chinner wrote:
>  
> > You can breath again. Here's a test patch (warning - may harm
> 
> heh
> 
> > kittens - not fully tested or verified) that solves both
> > the use-after-free issue (by avoiding it altogether) as well the
> > unlink/create latency because the log force is no longer there.
> > 
> > (yay! xfsqa test 016 passes again ;)
> > 
> > It does have other possible side effects triggering extra
> > log forces elsewhere on inode writeback and affects sync behaviour
> > so it's only a proof of concept at this point.
> 
> What kernel is that against?. I got rejects with 2.6.23 

The xfs-dev tree - i.e. the XFS that will be in 2.6.25 ;)

> However I tried a 2.6.18 on the file server and ran my test, it didn't
> show the problem. I then made a 2.6.23 but with the patch from my git
> bisect reverted.
> 
> Doing the test with that kernel, while writing a 1GB file I saw only
> one > 1 second latency (1.2) and only a few ~ 0.5 second latencies.
> 
> However over the longer term I'm still seeing latencies > 1 second.

Sure - you've got a busy disk. If the truncate has to flush the log
and wait for space, then it's going to take some time for I/Os
to complete. Full queue + busy disk == unpredictable latency for all
operations.

> Just leaving my strace test running (no dd) on the raid filesystem I see
> the
> latencies come when the raid5 stripe cache fills up. So I think I'm
> perhaps seeing another problem here.

Software raid isn't good for latency, either ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-14 23:09             ` David Chinner
@ 2007-10-15  9:58               ` Bhagi rathi
  2007-10-15 11:57                 ` David Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Bhagi rathi @ 2007-10-15  9:58 UTC (permalink / raw)
  To: David Chinner; +Cc: Andrew Clayton, linux-fsdevel, xfs

Thanks Dave for the response.   Thinking futher, why is that xfs_iunpin has
to mark the inode dirty?
All transactions generally modify one time or other, xfs_ichgtime takes care
of marking inode as
dirty. I am thinking on why we need to mark the inode dirty at all, either
in the context of  unpin or
in the context for formatting the inode.

-Bhagi.

On 10/15/07, David Chinner <dgc@sgi.com> wrote:
>
> On Sat, Oct 13, 2007 at 07:05:17PM +0530, Bhagi rathi wrote:
> > David, Can you let me know the use after free problem? I want to
> understand
> > how the life cycle of linux inode
> > and xfs inode are related to log flush.
>
> Log I/O completion:
>
>   -> xfs_trans_commited
>        -> xfs_iunpin(xfs inode)
>          get linux inode from xfs inode
>           -> mark_inode_dirty_sync(linux inode)
>
> Freeing the linux inode:
>
> clear_inode(linux_inode)
> -> xfs_inactive()
>     -> xfs_trans_commit() (e.g. freeing data associated with unlinked
> inode)
>        -> xfs_ipin()
>     (link between xfs and linux inode broken)
>   linux inode freed
>
> So, in log I/O completion, we can be completing a previous
> transaction at the same time clear_inode() is running, and
> hence in xfs_iunpin() we can race with the freeing of the
> linux inode as xfs_iunpin does not hold any locks.
>
> > Any pointer is also of great help.
>
> /me points at the code.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
>


[[HTML alternate version deleted]]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: XFS regression?
  2007-10-15  9:58               ` Bhagi rathi
@ 2007-10-15 11:57                 ` David Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: David Chinner @ 2007-10-15 11:57 UTC (permalink / raw)
  To: Bhagi rathi; +Cc: David Chinner, Andrew Clayton, linux-fsdevel, xfs

On Mon, Oct 15, 2007 at 03:28:34PM +0530, Bhagi rathi wrote:
> Thanks Dave for the response.   Thinking futher, why is that xfs_iunpin has
> to mark the inode dirty?

Because the inode has been modified, and instead of sprinkling
mark_inode_dirty_sync() all over the code, we can do it in a single
spot that catches all inode modifications. We don't have to think about
it by doing this - inodes in transactions get marked dirty for free....

> All transactions generally modify one time or other, xfs_ichgtime takes care
> of marking inode as
> dirty.

Sure, but there's plenty of other transactions that don't have such
a convenient hook.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-10-15 11:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-10 14:27 XFS regression? Andrew Clayton
2007-10-11  1:01 ` David Chinner
2007-10-11  9:05   ` Andrew Clayton
2007-10-11 14:15   ` Andrew Clayton
2007-10-11 21:53     ` David Chinner
2007-10-12  0:26       ` David Chinner
2007-10-12 11:36         ` Andrew Clayton
2007-10-12 13:28           ` Andrew Clayton
2007-10-13 13:35           ` Bhagi rathi
2007-10-14 23:09             ` David Chinner
2007-10-15  9:58               ` Bhagi rathi
2007-10-15 11:57                 ` David Chinner
2007-10-14 23:19           ` David Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.