All of lore.kernel.org
 help / color / mirror / Atom feed
* deleting 2TB lots of files with delaylog: sync helps?
@ 2010-08-31 23:30 Michael Monnerie
  2010-09-01  0:06 ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Monnerie @ 2010-08-31 23:30 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2202 bytes --]

I'm just trying the delaylog mount option on a filesystem (LVM over 
2x 2TB 4K sector drives), and I see this while running 8 processes 
of "rm -r * & 2>/dev/null":

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               2,80    33,40  125,00   64,60   720,00   939,30    17,50     0,55    2,91   1,71  32,40
sdd               0,00    25,60  122,80   63,40   662,40   874,40    16,51     0,52    2,77   1,96  36,54
dm-0              0,00     0,00  250,60  123,00  1382,40  1941,70    17,79     1,64    4,39   1,74  65,08

Then I issue "sync", and utilisation increases:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0,00     0,20   15,80  175,40    84,00  2093,30    22,78     0,62    3,26   2,93  55,94
sdd               0,00     1,00   13,40  177,60    79,20  2114,10    22,97     0,69    3,63   3,34  63,80
dm-0              0,00     0,00   29,20  101,20   163,20  4207,40    67,03     1,11    8,51   7,56  98,60

This is reproducible. Now it can be that the sync just causes more writes and stalls reads
so overall it's slower, but I'm wondering why none of the devices says "100% util", which
should be the case on deletes? Or is this again the "mistake" of the utilization calculation
that writes do not really show up there?

I know I should have benchmarked and tested, I just wanted to raise eyes on this as it 
could be possible there's something to optimize.

Another strange thing: After the 8 "rm -r" finished, there were some subdirs left over 
that hadn't been deleted - running one "rm -r" cleaned them out then. Could that be
a problem with "delaylog"? Or can that happen when several "rm" compete in the same 
dirs?

This is kernel 2.6.35.4

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Aktuelles Radiointerview! ******
http://www.it-podcast.at/aktuelle-sendung.html

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-08-31 23:30 deleting 2TB lots of files with delaylog: sync helps? Michael Monnerie
@ 2010-09-01  0:06 ` Dave Chinner
  2010-09-01  0:22   ` Michael Monnerie
  2010-09-01  3:01   ` Stan Hoeppner
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2010-09-01  0:06 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

On Wed, Sep 01, 2010 at 01:30:41AM +0200, Michael Monnerie wrote:
> I'm just trying the delaylog mount option on a filesystem (LVM over 
> 2x 2TB 4K sector drives), and I see this while running 8 processes 
> of "rm -r * & 2>/dev/null":
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> sdc               2,80    33,40  125,00   64,60   720,00   939,30    17,50     0,55    2,91   1,71  32,40
> sdd               0,00    25,60  122,80   63,40   662,40   874,40    16,51     0,52    2,77   1,96  36,54
> dm-0              0,00     0,00  250,60  123,00  1382,40  1941,70    17,79     1,64    4,39   1,74  65,08
> 
> Then I issue "sync", and utilisation increases:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> sdc               0,00     0,20   15,80  175,40    84,00  2093,30    22,78     0,62    3,26   2,93  55,94
> sdd               0,00     1,00   13,40  177,60    79,20  2114,10    22,97     0,69    3,63   3,34  63,80
> dm-0              0,00     0,00   29,20  101,20   163,20  4207,40    67,03     1,11    8,51   7,56  98,60
> 
> This is reproducible.

You're probably getting RMW cycles on inode writeback. I've been
noticing this lately with my benchmarking - the VM is being _very
aggressive_ reclaiming page cache pages vs inode caches and as a
result the inode buffers used for IO are being reclaimed between the
time it takes to create the inodes and when they are written back.
Hence you get lots of reads occurring during inode writeback.

By issuing a sync, you clear out all the inode writeback and all the
RMW cycles go away. As a result, there is more disk throughput
availble for the unlink processes.  There is a good chance this is
the case as the number of reads after the sync drop by an order of
magnitude...

> Now it can be that the sync just causes more writes and stalls reads
> so overall it's slower, but I'm wondering why none of the devices says "100% util", which
> should be the case on deletes? Or is this again the "mistake" of the utilization calculation
> that writes do not really show up there?

You're probably CPU bound, not IO bound.

> I know I should have benchmarked and tested, I just wanted to raise eyes on this as it 
> could be possible there's something to optimize.
> 
> Another strange thing: After the 8 "rm -r" finished, there were some subdirs left over 
> that hadn't been deleted - running one "rm -r" cleaned them out then. Could that be
> a problem with "delaylog"?

Unlikely - files not being deleted is not a function of the way
transactions are written to disk. It's a function of whether the
operation was performed or not.

> Or can that happen when several "rm" compete in the same dirs?

Most likely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  0:06 ` Dave Chinner
@ 2010-09-01  0:22   ` Michael Monnerie
  2010-09-01  3:19     ` Dave Chinner
  2010-09-01  3:01   ` Stan Hoeppner
  1 sibling, 1 reply; 17+ messages in thread
From: Michael Monnerie @ 2010-09-01  0:22 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: Text/Plain, Size: 1762 bytes --]

On Mittwoch, 1. September 2010 Dave Chinner wrote:
> You're probably getting RMW cycles on inode writeback. I've been
> noticing this lately with my benchmarking - the VM is being _very
> aggressive_ reclaiming page cache pages vs inode caches and as a
> result the inode buffers used for IO are being reclaimed between the
> time it takes to create the inodes and when they are written back.
> Hence you get lots of reads occurring during inode writeback.
> 
> By issuing a sync, you clear out all the inode writeback and all the
> RMW cycles go away. As a result, there is more disk throughput
> availble for the unlink processes.  There is a good chance this is
> the case as the number of reads after the sync drop by an order of
> magnitude...

Nice explanation.
 
> > Now it can be that the sync just causes more writes and stalls
> > reads so overall it's slower, but I'm wondering why none of the
> > devices says "100% util", which should be the case on deletes? Or
> > is this again the "mistake" of the utilization calculation that
> > writes do not really show up there?
> 
> You're probably CPU bound, not IO bound.

This is a hexa-core AMD Phenom(tm) II X6 1090T Processor with up to 
3.2GHz per core, so that shouldn't be - or is there only one core used? 
I think I read somewhere that each AG should get a core or so...
 
Thanks for your explanation.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Aktuelles Radiointerview! ******
http://www.it-podcast.at/aktuelle-sendung.html

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  0:06 ` Dave Chinner
  2010-09-01  0:22   ` Michael Monnerie
@ 2010-09-01  3:01   ` Stan Hoeppner
  2010-09-01  3:41     ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-01  3:01 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 8/31/2010 7:06 PM:

> You're probably CPU bound, not IO bound.

7200 rpm is the highest spindle speed for 2TB drives--5400 is most
common.  None of them are going to do much over 200 random seeks/second,
if that.  That's 400 tops for two drives.

Using any modern Intel/AMD ~2 GHz CPU, you think he's CPU bound?
Apparently this "rm -rf" type operation is much more complex than I
previously believed.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  0:22   ` Michael Monnerie
@ 2010-09-01  3:19     ` Dave Chinner
  2010-09-01  4:42       ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2010-09-01  3:19 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

On Wed, Sep 01, 2010 at 02:22:31AM +0200, Michael Monnerie wrote:
> On Mittwoch, 1. September 2010 Dave Chinner wrote:
> > You're probably getting RMW cycles on inode writeback. I've been
> > noticing this lately with my benchmarking - the VM is being _very
> > aggressive_ reclaiming page cache pages vs inode caches and as a
> > result the inode buffers used for IO are being reclaimed between the
> > time it takes to create the inodes and when they are written back.
> > Hence you get lots of reads occurring during inode writeback.
> > 
> > By issuing a sync, you clear out all the inode writeback and all the
> > RMW cycles go away. As a result, there is more disk throughput
> > availble for the unlink processes.  There is a good chance this is
> > the case as the number of reads after the sync drop by an order of
> > magnitude...
> 
> Nice explanation.
>  
> > > Now it can be that the sync just causes more writes and stalls
> > > reads so overall it's slower, but I'm wondering why none of the
> > > devices says "100% util", which should be the case on deletes? Or
> > > is this again the "mistake" of the utilization calculation that
> > > writes do not really show up there?
> > 
> > You're probably CPU bound, not IO bound.
> 
> This is a hexa-core AMD Phenom(tm) II X6 1090T Processor with up to 
> 3.2GHz per core, so that shouldn't be

I'm getting a 8core/16thread server being CPU bound with multithreaded
unlink workloads using delaylog, so it's entirely possible that all
CPU cores are fully utilised on your machine.

> - or is there only one core used? 
> I think I read somewhere that each AG should get a core or so...

If all the files are in one AG, then it will serialise on the AGI
header and won't use much more than one CPU.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  3:01   ` Stan Hoeppner
@ 2010-09-01  3:41     ` Dave Chinner
  2010-09-01  7:45       ` Michael Monnerie
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2010-09-01  3:41 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Tue, Aug 31, 2010 at 10:01:47PM -0500, Stan Hoeppner wrote:
> Dave Chinner put forth on 8/31/2010 7:06 PM:
> 
> > You're probably CPU bound, not IO bound.
> 
> 7200 rpm is the highest spindle speed for 2TB drives--5400 is most
> common.  None of them are going to do much over 200 random seeks/second,
> if that.  That's 400 tops for two drives.
> 
> Using any modern Intel/AMD ~2 GHz CPU, you think he's CPU bound?

Absolutely.

> Apparently this "rm -rf" type operation is much more complex than I
> previously believed.

Nothing in XFS is simple. ;)

Unlinks that free the inode clusters results in no inode writeback
load, so the majority of the IO is log traffic. Hence they are
either log IO bound or read latency bound.  A pair of 2TB SATA
drives will be good for at least 150MB/s of log throughput, but
the numbers are nowhere near that.

Without delayed logging, 150MB/s is enough for a single threaded
unlink to consume an entire CPU core on any modern CPU, and there
maybe enough bandwidth for two threads to max out 2 CPUs. With
delaylog, log throughput is reduced by an order of magnitude, so
should be good for at least 10x that number of CPU cores running
flat out unless they are latency bound reading the directories
and inodes into memory.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  3:19     ` Dave Chinner
@ 2010-09-01  4:42       ` Stan Hoeppner
  2010-09-01  6:44         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-01  4:42 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 8/31/2010 10:19 PM:
> On Wed, Sep 01, 2010 at 02:22:31AM +0200, Michael Monnerie wrote:
>>
>> This is a hexa-core AMD Phenom(tm) II X6 1090T Processor with up to 
>> 3.2GHz per core, so that shouldn't be
> 
> I'm getting a 8core/16thread server being CPU bound with multithreaded
> unlink workloads using delaylog, so it's entirely possible that all
> CPU cores are fully utilised on your machine.

What's your disk configuration on this 8 core machine?

Are you implying/stating that the performance of the disk subsystem is
irrelevant WRT multithreaded unlink workloads with delaylog enabled?

If so, this CPU hit you describe is specific to this workload scenario
only, not necessarily all your XFS test workloads, correct?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  4:42       ` Stan Hoeppner
@ 2010-09-01  6:44         ` Dave Chinner
  2010-09-02  5:37           ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2010-09-01  6:44 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Tue, Aug 31, 2010 at 11:42:07PM -0500, Stan Hoeppner wrote:
> Dave Chinner put forth on 8/31/2010 10:19 PM:
> > On Wed, Sep 01, 2010 at 02:22:31AM +0200, Michael Monnerie wrote:
> >>
> >> This is a hexa-core AMD Phenom(tm) II X6 1090T Processor with up to 
> >> 3.2GHz per core, so that shouldn't be
> > 
> > I'm getting a 8core/16thread server being CPU bound with multithreaded
> > unlink workloads using delaylog, so it's entirely possible that all
> > CPU cores are fully utilised on your machine.
> 
> What's your disk configuration on this 8 core machine?

Depends on where I place the disk image for the VM's I run on it ;)

For example, running fs_mark with 4 threads to create then delete
200k files in a directory per thread in a 4p VM w/ 2GB RAM with the
disk image on a hw-RAID1 device make up of 2x500Gb SATA drives (create
and remove 800k files):

$ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb
meta-data=/dev/vdb               isize=256    agcount=16, agsize=163840 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2621440, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ ./fs_mark -S0 -k -n 200000 -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2

#  ./fs_mark  -S0  -k  -n  200000  -s  0  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/3  -d  /mnt/scratch/2
#       Version 3.3, 4 thread(s) starting at Wed Sep  1 16:08:20 2010
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  no subdirectories used
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     2       800000            0      54517.1          6465501
$

The same test run on a 8p VM w/ 16Gb RAM, with the disk image hosted
on a 12x2TB SAS dm RAID-0 array:

FSUse%        Count         Size    Files/sec     App Overhead
     2       800000            0      51409.5          6186336

It was a bit slower despite having a disk subsystem with 10x the
bandwidth and 20-30x the iops capability...

> Are you implying/stating that the performance of the disk subsystem is
> irrelevant WRT multithreaded unlink workloads with delaylog enabled?

Not entirely irrelevant, just mostly. ;) For workloads that have all
the data cached in memory, anyway (i.e. not read latency bound).

> If so, this CPU hit you describe is specific to this workload scenario
> only, not necessarily all your XFS test workloads, correct?

It's not a CPU hit - the CPU is gainfully employed doing more work.
e.g. The same test as above without delayed logging on the 4p VM:

FSUse%        Count         Size    Files/sec     App Overhead
     2       800000            0      15118.3          7524424

delayed logging is 3.6x faster on the same filesystem. It went from
15k files/s at ~120% CPU utilisation, to 54k files/s at 400% CPU
utilisation. IOWs, it is _clearly_ CPU bound with delayed logging as
there is no idle CPU left in the VM at all.

When trying to improve filesystem performance, there are two goals
we are trying to acheive depending on the limiting factor:

	1. If the workload is IO bound, we want to improve the IO
	patterns enough that performance becomes CPU bound.

	2. If the workload is CPU bound, we want to reduce the
	per-operation CPU overhead to the point where the workload
	becomes IO bound.

Delayed logging has acheived #1 for metadata operations. To get
further improvements, we now need to start optimising based on
#2....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  3:41     ` Dave Chinner
@ 2010-09-01  7:45       ` Michael Monnerie
  2010-09-02  1:17         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Monnerie @ 2010-09-01  7:45 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: Text/Plain, Size: 977 bytes --]

On Mittwoch, 1. September 2010 Dave Chinner wrote:
> Without delayed logging, 150MB/s is enough for a single threaded
> unlink to consume an entire CPU core on any modern CPU
 
Just as Stan I'm puzzled by this. Why is it such a hard work for the 
CPU, what does it do? Is it really about calculating something, or has 
it to do with lock contention, cold caches, cache line bouncing and 
other "horrible" things so the CPU can't get it's maximum power? I'm 
really curious to understand that.

Maybe there should be an extra SSE4 assembler instruction "rm on XFS" so 
we can delete files faster? ;-)

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Aktuelles Radiointerview! ******
http://www.it-podcast.at/aktuelle-sendung.html

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  7:45       ` Michael Monnerie
@ 2010-09-02  1:17         ` Dave Chinner
  2010-09-02  2:15           ` Michael Monnerie
  2010-09-02  7:51           ` Stan Hoeppner
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2010-09-02  1:17 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: xfs

On Wed, Sep 01, 2010 at 09:45:58AM +0200, Michael Monnerie wrote:
> On Mittwoch, 1. September 2010 Dave Chinner wrote:
> > Without delayed logging, 150MB/s is enough for a single threaded
> > unlink to consume an entire CPU core on any modern CPU
>  
> Just as Stan I'm puzzled by this. Why is it such a hard work for the 
> CPU, what does it do? Is it really about calculating something, or has 
> it to do with lock contention, cold caches, cache line bouncing and 
> other "horrible" things so the CPU can't get it's maximum power? I'm 
> really curious to understand that.

Ok, it seems that people don't have any real idea of th complexity
of directory operations in XFS, so I'll give you a quick overview.

The XFS directory structure is excedingly complex and the algorithm
is designed to trade off using more CPU time to issue fewer disk IOs
during operations and so provide deterministic, predictable
scalability. The result is that it consumes more CPU per operation
than ext3/4, but the algorithms scale far better than ext3/4.

Here's what an unlink must do on XFS:

	-> determine the directory format:
		-> short form (a handful of entries, not interesting)
		-> leaf form (up to a few thousand entries)
		-> node/leaf form (up to a few tens of thousand entries)
		-> btree form
	-> hash the name
	-> walk the directory hash btree index to find the
	   leaf the entry exists in
		-> the btree code has lots of interesting readahead
		   heuristics to minimise the impact of seek latency on
		   tree walks and modifications
	-> all blocks are read on demand from disk/cache
	-> remove entry from the leaf
	-> update the freespace entry in the leaf
	-> update the dirctory hash index
	-> update the directory freespace index
	-> update the by-offset directory index
	-> track all modified regions of all blocks
	   in transaction structure

	If any of these result in an index block merge or
	a leaf being freed,then for every block being freed:

	-> punch hole in directory extent map
	-> free block
		-> add free extent back to AG freespace trees
		  (both the by-size indexed tree, and the by-offset
		  indexed tree). For each tree:
			-> adjust freelist to have enough free
			   blocks for any allocation required
			-> lookup tree to position cursor for insert
			-> determine if record merge is required
			-> insert/modify record
				-> may rquire split (allocation)
			-> update index
				-> may require multiple splits as we
				   walk back up the tree
			-> track all modified regions of all blocks
		-> mark block busy

	-> commit transaction.

At < 100k entries in a directory, XFS consumes roughly 2-3x more CPU
per operation than ext3/4. However, at somewhere between 100k-200k
entries, ext4 directory hashing results in directory fragmentation
bad enough that the IO patterns become completely random and
performance becomes seek bound. I can get ext4 performance to
continue to scale to 1m entries because my disk backend can do
10kiops!

In contrast, XFS CPU consumption increases per-operation in a
predictable fashion - O(log n) where n is the number of directory
entries. e.g for 4k directory blocks it increases by ~2x from 100k
entries to 1m entries, and another 2x from 1M entries to 10M entries
and so on, but the result is that the IO patterns are rarely enough
to cause operations to become seek bound.

> Maybe there should be an extra SSE4 assembler instruction "rm on XFS" so 
> we can delete files faster? ;-)

You'd need an entire ASIC, not an instruction ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-02  1:17         ` Dave Chinner
@ 2010-09-02  2:15           ` Michael Monnerie
  2010-09-02  7:51           ` Stan Hoeppner
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Monnerie @ 2010-09-02  2:15 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: Text/Plain, Size: 1451 bytes --]

On Donnerstag, 2. September 2010 Dave Chinner wrote:
> 	-> free block

Is the SSD-needed "trim" belonging into here?

> In contrast, XFS CPU consumption increases per-operation in a
> predictable fashion - O(log n) where n is the number of directory
> entries. e.g for 4k directory blocks it increases by ~2x from 100k
> entries to 1m entries, and another 2x from 1M entries to 10M entries
> and so on, but the result is that the IO patterns are rarely enough
> to cause operations to become seek bound.

Now I understand, thanks again for that great explanation.

> > Maybe there should be an extra SSE4 assembler instruction "rm on
> > XFS" so we can delete files faster? ;-)
> 
> You'd need an entire ASIC, not an instruction ;)

Time to invent the "XFS rm co-processor". Should be multi-core so it 
scales better. Maybe someone writes a graphics cards plugin for XFS? 
Then we'd see an increase of servers with fast GFX cards "because we 
need to delete files quickly". And at times no users are deleting files, 
the admins can play Doom on the servers. ;-)

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Aktuelles Radiointerview! ******
http://www.it-podcast.at/aktuelle-sendung.html

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-01  6:44         ` Dave Chinner
@ 2010-09-02  5:37           ` Stan Hoeppner
  2010-09-02  7:01             ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-02  5:37 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 9/1/2010 1:44 AM:

> 4p VM w/ 2GB RAM with the
> disk image on a hw-RAID1 device make up of 2x500Gb SATA drives (create
> and remove 800k files):

> FSUse%        Count         Size    Files/sec     App Overhead
>      2       800000            0      54517.1          6465501
> $
> 
> The same test run on a 8p VM w/ 16Gb RAM, with the disk image hosted
> on a 12x2TB SAS dm RAID-0 array:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      2       800000            0      51409.5          6186336

Is this a single socket quad core Intel machine with hyperthreading
enabled?  That would fully explain the results above.  Looks like you
ran out of memory bandwidth in the 4 "processor" case.  Adding phantom
CPUs merely made them churn without additional results.

> It was a bit slower despite having a disk subsystem with 10x the
> bandwidth and 20-30x the iops capability...
> 
>> Are you implying/stating that the performance of the disk subsystem is
>> irrelevant WRT multithreaded unlink workloads with delaylog enabled?
> 
> Not entirely irrelevant, just mostly. ;) For workloads that have all
> the data cached in memory, anyway (i.e. not read latency bound).
> 
>> If so, this CPU hit you describe is specific to this workload scenario
>> only, not necessarily all your XFS test workloads, correct?
> 
> It's not a CPU hit - the CPU is gainfully employed doing more work.
> e.g. The same test as above without delayed logging on the 4p VM:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      2       800000            0      15118.3          7524424
> 
> delayed logging is 3.6x faster on the same filesystem. It went from
> 15k files/s at ~120% CPU utilisation, to 54k files/s at 400% CPU
> utilisation. IOWs, it is _clearly_ CPU bound with delayed logging as
> there is no idle CPU left in the VM at all.

Without seeing all of what you have available, going on strictly the
data above, I disagree.  I'd say your bottleneck is your memory/IPC
bandwidth.

> When trying to improve filesystem performance, there are two goals
> we are trying to acheive depending on the limiting factor:
> 
> 	1. If the workload is IO bound, we want to improve the IO
> 	patterns enough that performance becomes CPU bound.
> 
> 	2. If the workload is CPU bound, we want to reduce the
> 	per-operation CPU overhead to the point where the workload
> 	becomes IO bound.
> 
> Delayed logging has acheived #1 for metadata operations. To get
> further improvements, we now need to start optimising based on
> #2....

If my guess about your platform is correct, try testing on a dual socket
quad core Opteron with quad memory channels.  Test with 2, 4, 6, and 8
fs_mark threads.  I'm guessing at some point between 4 and 8 threads
you'll run out of memory bandwidth, and from then on you won't see the
additional CPU burn that you are with Intel hyperthreading.

Also, I've not looked at the code, but is there possibly a delayed
logging global data structure stored in a shared memory location that
each thread accesses frequently?  If so, that might appear as memory B/W
starvation, and make each processor appear busy because they're all
waiting on access to that shared object.  Just a guess from non-dev end
user with a lot of hardware knowledge and not enough coding skillz. ;)

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-02  5:37           ` Stan Hoeppner
@ 2010-09-02  7:01             ` Dave Chinner
  2010-09-02  8:41               ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2010-09-02  7:01 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Thu, Sep 02, 2010 at 12:37:39AM -0500, Stan Hoeppner wrote:
> Dave Chinner put forth on 9/1/2010 1:44 AM:
> 
> > 4p VM w/ 2GB RAM with the
> > disk image on a hw-RAID1 device make up of 2x500Gb SATA drives (create
> > and remove 800k files):
> 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      2       800000            0      54517.1          6465501
> > $
> > 
> > The same test run on a 8p VM w/ 16Gb RAM, with the disk image hosted
> > on a 12x2TB SAS dm RAID-0 array:
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      2       800000            0      51409.5          6186336
> 
> Is this a single socket quad core Intel machine with hyperthreading
> enabled? 

No, It's a dual socket (8c/16t) server.

> That would fully explain the results above.  Looks like you
> ran out of memory bandwidth in the 4 "processor" case.  Adding phantom
> CPUs merely made them churn without additional results.

No, that's definitely not the case. A different kernel in the 
same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144"

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            0      39554.2          7590355

4 threads with mount options "logbsize=262144,delaylog"

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            0      67269.7          5697246

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.36-rc3-4-thread-delaylog-comparison.png

Top chart is CPu usage, 2nd chart is disk iops (purple is write),
thrid chart is disk bandwidth (purple is write), and the bottom
chart is create rate (yellow) and unlink rate (green).

From left to write, the first IO peak (~1000 iops, 250MB/s) is the
mkfs‥xfs. the next sustained load is the first fs_mark workload
without delayed logging - 2500 iops and 500MB/s, and the second is
the same workload again with delayed logging enabled (zero IO,
roughly 400% CPU utilisation and significantly higher create/unlink
rates).

I'll let you decide which of thw two IO patterns is sustainable on a
single sata disk yourself. ;)

> > FSUse%        Count         Size    Files/sec     App Overhead
> >      2       800000            0      15118.3          7524424
> > 
> > delayed logging is 3.6x faster on the same filesystem. It went from
> > 15k files/s at ~120% CPU utilisation, to 54k files/s at 400% CPU
> > utilisation. IOWs, it is _clearly_ CPU bound with delayed logging as
> > there is no idle CPU left in the VM at all.
> 
> Without seeing all of what you have available, going on strictly the
> data above, I disagree. I'd say your bottleneck is your memory/IPC
> bandwidth.

You are free to choose to believe I don't know I'm doing - if you
can get XFS to perform better, then I'll happily take the patches ;)

> If my guess about your platform is correct, try testing on a dual socket
> quad core Opteron with quad memory channels.  Test with 2, 4, 6, and 8
> fs_mark threads.

Did that a long time ago - it's in the archives a few months back.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-02  1:17         ` Dave Chinner
  2010-09-02  2:15           ` Michael Monnerie
@ 2010-09-02  7:51           ` Stan Hoeppner
  1 sibling, 0 replies; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-02  7:51 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 9/1/2010 8:17 PM:

> You'd need an entire ASIC, not an instruction ;)

More like an FPGA.  As we see on list, daily, the XFS code changes far
too rapidly for implementation in an ASIC.  ;)

Hay, there's a sales opportunity for SGI:  XFS on Virtex 7 FPGA on a
PCIe accelerator board. :)  Oh, wait, what about RASC?  Just put in on
there. :P

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-02  7:01             ` Dave Chinner
@ 2010-09-02  8:41               ` Stan Hoeppner
  2010-09-02 11:29                 ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-02  8:41 UTC (permalink / raw)
  To: xfs

Dave Chinner put forth on 9/2/2010 2:01 AM:

> No, that's definitely not the case. A different kernel in the 
> same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144"
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       800000            0      39554.2          7590355
> 
> 4 threads with mount options "logbsize=262144,delaylog"
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       800000            0      67269.7          5697246

What happens when you bump each of these to 8 threads, 1 per core?  If
the test consumes all cpus/cores, what instrumentation are you viewing
that tells you the cpu utilization _isn't_ due to memory b/w starvation?

A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core.  An
8 core machine would therefore have an instruction issue rate of 8 * 16
GB/s = 128 GB/s.  A modern dual socket system is going to top out at
24-48 GB/s, well short of the instruction issue rate.  Now, this doesn't
even take the b/w of data load/store operations into account, but I'm
guessing the data size per directory operation is smaller than the total
instruction sequence, which operates on the same variable(s).

So, if the CPUs are pegging, and we're not running out of memory b/w,
then this would lead me to believe that the hot kernel code, core
fs_mark code and the filesystem data are fully, or near fully, contained
in level 2 and 3 CPU caches.  Is this correct, more or less?

> You are free to choose to believe I don't know I'm doing - if you
> can get XFS to perform better, then I'll happily take the patches ;)

Not at all.  I have near total faith in you Dave.  I just like to play
Monday morning quarterback now and then.  It allows me to show my
knuckles drag the ground, and you an opportunity to educate me, and
others, so we can one day walk upright when discussing XFS. ;)

> Did that a long time ago - it's in the archives a few months back.

I'll have to dig around.  I've never even looked for the archives for
this list.  It's hopefully mirrored in the usual places.

Out of curiosity, have you ever run into memory b/w starvation before
peaking all CPUs while running this test?  I could see that maybe
occurring with dual 1GHz+ P3 class systems with their smallish caches
and lowly single channel PC100, back before the switch to DDR memory,
but those machines were probably gone before XFS was open sourced, IIRC,
so you may not have had the pleasure (if you could call it that).

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-02  8:41               ` Stan Hoeppner
@ 2010-09-02 11:29                 ` Dave Chinner
  2010-09-02 14:57                   ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2010-09-02 11:29 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote:
> Dave Chinner put forth on 9/2/2010 2:01 AM:
> 
> > No, that's definitely not the case. A different kernel in the 
> > same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144"
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0       800000            0      39554.2          7590355
> > 
> > 4 threads with mount options "logbsize=262144,delaylog"
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0       800000            0      67269.7          5697246
> 
> What happens when you bump each of these to 8 threads, 1 per core?  If

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     127979.3         13156823

So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets
67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU
bound at each load point ;)

> the test consumes all cpus/cores, what instrumentation are you viewing
> that tells you the cpu utilization _isn't_ due to memory b/w starvation?

1) profiling like 'perf top' or oprofile, using hardware counters to
profile on cpu cycles, l1/l2 cache misses, etc

2) the delayed logging code uses significantly more memory bandwidth
than the original code because it copies changed information twice
(instead of once) before it is written to disk. Given that single
threaded performance of delayed logging is identical to the original
code and scalability from 1 to 8 cores is almost linear, it cannot
be memory bandwidth bound....

The code might be memory _latency_ bound (i.e on cache misses), but
it is certainly not stressing pure memory bandwidth.

> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core.  An
> 8 core machine would therefore have an instruction issue rate of 8 * 16
> GB/s = 128 GB/s.  A modern dual socket system is going to top out at
> 24-48 GB/s, well short of the instruction issue rate.  Now, this doesn't
> even take the b/w of data load/store operations into account, but I'm
> guessing the data size per directory operation is smaller than the total
> instruction sequence, which operates on the same variable(s).
> 
> So, if the CPUs are pegging, and we're not running out of memory b/w,
> then this would lead me to believe that the hot kernel code, core
> fs_mark code and the filesystem data are fully, or near fully, contained
> in level 2 and 3 CPU caches.  Is this correct, more or less?

Probably.

However (and it is a big however!), I generally don't care to
analyse performance at this level because it's getting into
micro-optimisation territory. Sure, it will get you a few percent
here and there, but then you lose focus on improving the algorithms.
An algorithmic change can provide an order of magnitude improvement,
not a few percent. The delayed logging code is a clear example of
that.

Another example - perf top shows this on the above 8p load on
a plain 2.6.36-rc3 kernel (and it gets about 40k files/s):

           426043.00 27.4% _xfs_buf_find
            87491.00  5.6% __ticket_spin_lock
            67204.00  4.3% xfs_dir2_node_addname
            60434.00  3.9% dso__find_symbol
            48407.00  3.1% kmem_cache_alloc
            37006.00  2.4% __d_lookup
            31625.00  2.0% xfs_trans_buf_item_match
            20036.00  1.3% xfs_log_commit_cil
            18728.00  1.2% _raw_spin_unlock_irqrestore
            18428.00  1.2% __memset
            18001.00  1.2% __memcpy
            17781.00  1.1% xfs_da_do_buf
            17732.00  1.1% xfs_iflush_cluster
            16831.00  1.1% kmem_cache_free
            14836.00  1.0% __kmalloc

It is clear that buffer lookup is consuming the most CPU of any
operation. Why? Because the buffer hash table is too small. I've
already posted patches for a short term solution (increase the size
of the hash table) and the above 127k files/s result is using that
patch. hence it is clear that the micro-optimisation works, but at
the cost of 16x increase in memory usage for the hash table. And
that still isn't really large enough, because now the load is
already pushing the limits of the enlarged hash table.

As Christoph has already suggested, the correct way to fix the
problem is to change the caching algorithm to something that is
self-scaling (e.g. a rb-tree or a btree). That will keep memory
usage low on small filesystems, yet scale efficiently to large
numbers of buffers, something a hash cannot easily do.

IOWs, an algorithmic change will solve the problem far better for
more situations than the micro-optimisation of tweaking the hash
sizes. Reduced to the simplest argument, scalability is all
about choosing the right algorithm so you don't have to care about
minute details to obtain the performance you require.

> I'll have to dig around.  I've never even looked for the archives for
> this list.  It's hopefully mirrored in the usual places.
> 
> Out of curiosity, have you ever run into memory b/w starvation before
> peaking all CPUs while running this test? 

No. Last time I ran out of bandwidth doing an IO workloads was doing
6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk
subsystem could handle 11GB/s, we got 10GB/s with direct IO, but
buffered IO was limited by the cross-sectional memory bandwidth of
the machine (25GB/s) because of the extra copy buffered IO
requires....

> I could see that maybe
> occurring with dual 1GHz+ P3 class systems with their smallish caches
> and lowly single channel PC100, back before the switch to DDR memory,
> but those machines were probably gone before XFS was open sourced, IIRC,
> so you may not have had the pleasure (if you could call it that).

The ratio between CPU cycles and memory bandwidth really hasn't
changed much since then. The CPUs weren't powerful enough then,
either, to run enough metadata ops to get near memory bandwidth
limits...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: deleting 2TB lots of files with delaylog: sync helps?
  2010-09-02 11:29                 ` Dave Chinner
@ 2010-09-02 14:57                   ` Stan Hoeppner
  0 siblings, 0 replies; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-02 14:57 UTC (permalink / raw)
  To: xfs

Thanks Dave.

I don't normally top post, but I just wanted to quickly say I _really_
enjoyed reading your reply below.  It was seriously educational.  I
really enjoyed your note about the 24p Altix system.  I've been a fan of
SGI NUMA machines since the Origin 2k, due to the uniqueness of this
scalable interconnect, though I've never been an SGI user. :(

Keep up the great work, and keep us educated, as you've done so very
well here. :)

-- 
Stan


Dave Chinner put forth on 9/2/2010 6:29 AM:
> On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote:
>> Dave Chinner put forth on 9/2/2010 2:01 AM:
>>
>>> No, that's definitely not the case. A different kernel in the 
>>> same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options "logbsize=262144"
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>      0       800000            0      39554.2          7590355
>>>
>>> 4 threads with mount options "logbsize=262144,delaylog"
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>      0       800000            0      67269.7          5697246
>>
>> What happens when you bump each of these to 8 threads, 1 per core?  If
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0      1600000            0     127979.3         13156823
> 
> So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets
> 67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU
> bound at each load point ;)
> 
>> the test consumes all cpus/cores, what instrumentation are you viewing
>> that tells you the cpu utilization _isn't_ due to memory b/w starvation?
> 
> 1) profiling like 'perf top' or oprofile, using hardware counters to
> profile on cpu cycles, l1/l2 cache misses, etc
> 
> 2) the delayed logging code uses significantly more memory bandwidth
> than the original code because it copies changed information twice
> (instead of once) before it is written to disk. Given that single
> threaded performance of delayed logging is identical to the original
> code and scalability from 1 to 8 cores is almost linear, it cannot
> be memory bandwidth bound....
> 
> The code might be memory _latency_ bound (i.e on cache misses), but
> it is certainly not stressing pure memory bandwidth.
> 
>> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
>> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core.  An
>> 8 core machine would therefore have an instruction issue rate of 8 * 16
>> GB/s = 128 GB/s.  A modern dual socket system is going to top out at
>> 24-48 GB/s, well short of the instruction issue rate.  Now, this doesn't
>> even take the b/w of data load/store operations into account, but I'm
>> guessing the data size per directory operation is smaller than the total
>> instruction sequence, which operates on the same variable(s).
>>
>> So, if the CPUs are pegging, and we're not running out of memory b/w,
>> then this would lead me to believe that the hot kernel code, core
>> fs_mark code and the filesystem data are fully, or near fully, contained
>> in level 2 and 3 CPU caches.  Is this correct, more or less?
> 
> Probably.
> 
> However (and it is a big however!), I generally don't care to
> analyse performance at this level because it's getting into
> micro-optimisation territory. Sure, it will get you a few percent
> here and there, but then you lose focus on improving the algorithms.
> An algorithmic change can provide an order of magnitude improvement,
> not a few percent. The delayed logging code is a clear example of
> that.
> 
> Another example - perf top shows this on the above 8p load on
> a plain 2.6.36-rc3 kernel (and it gets about 40k files/s):
> 
>            426043.00 27.4% _xfs_buf_find
>             87491.00  5.6% __ticket_spin_lock
>             67204.00  4.3% xfs_dir2_node_addname
>             60434.00  3.9% dso__find_symbol
>             48407.00  3.1% kmem_cache_alloc
>             37006.00  2.4% __d_lookup
>             31625.00  2.0% xfs_trans_buf_item_match
>             20036.00  1.3% xfs_log_commit_cil
>             18728.00  1.2% _raw_spin_unlock_irqrestore
>             18428.00  1.2% __memset
>             18001.00  1.2% __memcpy
>             17781.00  1.1% xfs_da_do_buf
>             17732.00  1.1% xfs_iflush_cluster
>             16831.00  1.1% kmem_cache_free
>             14836.00  1.0% __kmalloc
> 
> It is clear that buffer lookup is consuming the most CPU of any
> operation. Why? Because the buffer hash table is too small. I've
> already posted patches for a short term solution (increase the size
> of the hash table) and the above 127k files/s result is using that
> patch. hence it is clear that the micro-optimisation works, but at
> the cost of 16x increase in memory usage for the hash table. And
> that still isn't really large enough, because now the load is
> already pushing the limits of the enlarged hash table.
> 
> As Christoph has already suggested, the correct way to fix the
> problem is to change the caching algorithm to something that is
> self-scaling (e.g. a rb-tree or a btree). That will keep memory
> usage low on small filesystems, yet scale efficiently to large
> numbers of buffers, something a hash cannot easily do.
> 
> IOWs, an algorithmic change will solve the problem far better for
> more situations than the micro-optimisation of tweaking the hash
> sizes. Reduced to the simplest argument, scalability is all
> about choosing the right algorithm so you don't have to care about
> minute details to obtain the performance you require.
> 
>> I'll have to dig around.  I've never even looked for the archives for
>> this list.  It's hopefully mirrored in the usual places.
>>
>> Out of curiosity, have you ever run into memory b/w starvation before
>> peaking all CPUs while running this test? 
> 
> No. Last time I ran out of bandwidth doing an IO workloads was doing
> 6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk
> subsystem could handle 11GB/s, we got 10GB/s with direct IO, but
> buffered IO was limited by the cross-sectional memory bandwidth of
> the machine (25GB/s) because of the extra copy buffered IO
> requires....
> 
>> I could see that maybe
>> occurring with dual 1GHz+ P3 class systems with their smallish caches
>> and lowly single channel PC100, back before the switch to DDR memory,
>> but those machines were probably gone before XFS was open sourced, IIRC,
>> so you may not have had the pleasure (if you could call it that).
> 
> The ratio between CPU cycles and memory bandwidth really hasn't
> changed much since then. The CPUs weren't powerful enough then,
> either, to run enough metadata ops to get near memory bandwidth
> limits...
> 
> Cheers,
> 
> Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-09-02 14:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-31 23:30 deleting 2TB lots of files with delaylog: sync helps? Michael Monnerie
2010-09-01  0:06 ` Dave Chinner
2010-09-01  0:22   ` Michael Monnerie
2010-09-01  3:19     ` Dave Chinner
2010-09-01  4:42       ` Stan Hoeppner
2010-09-01  6:44         ` Dave Chinner
2010-09-02  5:37           ` Stan Hoeppner
2010-09-02  7:01             ` Dave Chinner
2010-09-02  8:41               ` Stan Hoeppner
2010-09-02 11:29                 ` Dave Chinner
2010-09-02 14:57                   ` Stan Hoeppner
2010-09-01  3:01   ` Stan Hoeppner
2010-09-01  3:41     ` Dave Chinner
2010-09-01  7:45       ` Michael Monnerie
2010-09-02  1:17         ` Dave Chinner
2010-09-02  2:15           ` Michael Monnerie
2010-09-02  7:51           ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.