All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance problem - reads slower than writes
@ 2012-01-30 22:00 Brian Candler
  2012-01-31  2:05 ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-01-30 22:00 UTC (permalink / raw)
  To: xfs

I am doing some performance testing of XFS. I am using Ubuntu 11.10 amd64
(server), on an i3-2130 (3.4GHz) with 8GB RAM.

This will eventually run with a bunch of Hitachi 3TB Deskstar drives, but
the performance issue can be shown with just one.

Writing and reading large files using dd is fine. Performance is close to
what I get if I dd to the drive itself (which is 125MB/sec near the start of
the disk, down to 60MB/sec near the end of the disk, both reading and
writing).

However I'm getting something strange when I try using bonnie++ to write and
read a bunch of individual files - in this case 100,000 files with sizes
between 500k and 800k, spread over 1000 directories.

# time bonnie++ -d /data/sdb -s 16384k -n 98:800k:500k:1000 -u root
...
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
storage1        16G  1900  93 97299   3 49909   4  4899  96 139565   5 270.7   4
Latency              5251us     222ms     394ms   10705us   94111us     347ms
Version  1.96       ------Sequential Create------ --------Random Create--------
storage1            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
98:819200:512000/1000   112   3    37   2 12659  32   106   3    39   2  8148  31
Latency              6791ms     134ms   56383us    5161ms     459ms    3664ms
1.96,1.96,storage1,1,1327926367,16G,,1900,93,97299,3,49909,4,4899,96,139565,5,270.7,4,98,819200,512000,,1000,112,3,37,2,12659,32,106,3,39,2,8148,31,5251us,222ms,394ms,10705us,94111us,347ms,6791ms,134ms,56383us,5161ms,459ms,3664ms

real	129m3.450s
user	0m6.684s
sys	3m22.421s

Writing is fine: it writes about 110 files per second, and iostat shows
about 75MB/sec of write data throughput during that phase.

However when bonnie++ gets to the reading stage it reads only ~38 files per
second, and iostat shows only about 22MB/sec of data being read from the
disk.  There are about 270 disk operations per second seen at the time, so
the drive is clearly saturated with seeks.  It seems to be doing about 7
seeks for each stat+read.

root@storage1:~# iostat 5 | grep sdb
...
sdb             270.40     22119.20         0.00     110596          0
sdb             269.60     21948.80         0.00     109744          0
sdb             257.80     20969.60         0.00     104848          0

Now of course this might be symptomatic of something that bonnie++ is doing,
but strace'ing the program just shows it doing stat, open, repeated reads,
and close:

# strace -p 14318 2>&1 | grep -v '= 8192' | head -20
Process 14318 attached - interrupt to quit
read(3, "00999\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0"..., 2214) = 2214
close(3)                                = 0
stat("00248/Jm00000061ec", {st_mode=S_IFREG|0600, st_size=706711, ...}) = 0
open("00248/Jm00000061ec", O_RDONLY)    = 3
read(3, "00999\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0"..., 2199) = 2199
close(3)                                = 0
stat("00566/000000df91", {st_mode=S_IFREG|0600, st_size=637764, ...}) = 0
open("00566/000000df91", O_RDONLY)      = 3
read(3, "00999\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0"..., 6980) = 6980
close(3)                                = 0
stat("00868/7ndVASazO4I00000156bb", {st_mode=S_IFREG|0600, st_size=813560, ...}) = 0
open("00868/7ndVASazO4I00000156bb", O_RDONLY) = 3
read(3, "00999\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0"..., 2552) = 2552
close(3)                                = 0
stat("00759/vGTNHcCtfQ0000012bb6", {st_mode=S_IFREG|0600, st_size=786576, ...}) = 0
open("00759/vGTNHcCtfQ0000012bb6", O_RDONLY) = 3
read(3, "00999\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0 \266Z\1\0\0\0\0"..., 144) = 144
close(3)                                = 0
stat("00736/0RYuGyy00000122a4", {st_mode=S_IFREG|0600, st_size=758003, ...}) = 0

I can't see anything unusual there, not even O_DIRECT on the open().

The filesystem was created like this:

# mkfs.xfs -i attr=2,maxpct=1 /dev/sdb

Does anyone have any suggestions as to why there is so much seek activity
going on, or anything I can do to trace this further?

Thanks,

Brian Candler.

P.S. When dd'ing large files ontp XFS I found that bs=8k gave a lower
performance than bs=16k or larger.  So I wanted to rerun bonnie++ with
larger chunk sizes.  Unfortunately that causes it to crash (and fairly
consistently) - see below.

Is the 8k block size likely to be the performance culprit here?

# time bonnie++ -d /data/sdb -s 16384k:32k -n 98:800k:500k:1000:32k -u root
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...
done
start 'em...done...done...done...done...done...
*** glibc detected *** bonnie++: double free or corruption (out): 0x00000000024430a0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x78a96)[0x7f42a0317a96]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f42a031bd7c]
bonnie++[0x404dd7]
bonnie++[0x402e90]
bonnie++[0x403bb6]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f42a02c030d]
bonnie++[0x402219]
======= Memory map: ========
00400000-0040c000 r-xp 00000000 08:01 3683058                            /usr/sbin/bonnie++
0060b000-0060c000 r--p 0000b000 08:01 3683058                            /usr/sbin/bonnie++
0060c000-0060d000 rw-p 0000c000 08:01 3683058                            /usr/sbin/bonnie++
02438000-02484000 rw-p 00000000 00:00 0                                  [heap]
7f4298000000-7f4298021000 rw-p 00000000 00:00 0 
7f4298021000-7f429c000000 ---p 00000000 00:00 0 
7f429d25e000-7f429d25f000 ---p 00000000 00:00 0 
7f429d25f000-7f429da5f000 rw-p 00000000 00:00 0 
7f429da5f000-7f429da60000 ---p 00000000 00:00 0 
7f429da60000-7f429e260000 rw-p 00000000 00:00 0 
7f429e260000-7f429e261000 ---p 00000000 00:00 0 
7f429e261000-7f429ea61000 rw-p 00000000 00:00 0 
7f429ea61000-7f429ea62000 ---p 00000000 00:00 0 
7f429ea62000-7f429f262000 rw-p 00000000 00:00 0 
7f429f262000-7f429f263000 ---p 00000000 00:00 0 
7f429f263000-7f429fa63000 rw-p 00000000 00:00 0 
7f429fa63000-7f429fa6f000 r-xp 00000000 08:01 1179679                    /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f429fa6f000-7f429fc6e000 ---p 0000c000 08:01 1179679                    /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f429fc6e000-7f429fc6f000 r--p 0000b000 08:01 1179679                    /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f429fc6f000-7f429fc70000 rw-p 0000c000 08:01 1179679                    /lib/x86_64-linux-gnu/libnss_files-2.13.so
7f429fc70000-7f429fc7a000 r-xp 00000000 08:01 1179685                    /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f429fc7a000-7f429fe7a000 ---p 0000a000 08:01 1179685                    /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f429fe7a000-7f429fe7b000 r--p 0000a000 08:01 1179685                    /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f429fe7b000-7f429fe7c000 rw-p 0000b000 08:01 1179685                    /lib/x86_64-linux-gnu/libnss_nis-2.13.so
7f429fe7c000-7f429fe93000 r-xp 00000000 08:01 1179674                    /lib/x86_64-linux-gnu/libnsl-2.13.so
7f429fe93000-7f42a0092000 ---p 00017000 08:01 1179674                    /lib/x86_64-linux-gnu/libnsl-2.13.so
7f42a0092000-7f42a0093000 r--p 00016000 08:01 1179674                    /lib/x86_64-linux-gnu/libnsl-2.13.so
7f42a0093000-7f42a0094000 rw-p 00017000 08:01 1179674                    /lib/x86_64-linux-gnu/libnsl-2.13.so
7f42a0094000-7f42a0096000 rw-p 00000000 00:00 0 
7f42a0096000-7f42a009e000 r-xp 00000000 08:01 1179667                    /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f42a009e000-7f42a029d000 ---p 00008000 08:01 1179667                    /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f42a029d000-7f42a029e000 r--p 00007000 08:01 1179667                    /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f42a029e000-7f42a029f000 rw-p 00008000 08:01 1179667                    /lib/x86_64-linux-gnu/libnss_compat-2.13.so
7f42a029f000-7f42a0434000 r-xp 00000000 08:01 1179676                    /lib/x86_64-linux-gnu/libc-2.13.so
7f42a0434000-7f42a0633000 ---p 00195000 08:01 1179676                    /lib/x86_64-linux-gnu/libc-2.13.so
7f42a0633000-7f42a0637000 r--p 00194000 08:01 1179676                    /lib/x86_64-linux-gnu/libc-2.13.so
7f42a0637000-7f42a0638000 rw-p 00198000 08:01 1179676                    /lib/x86_64-linux-gnu/libc-2.13.so
7f42a0638000-7f42a063e000 rw-p 00000000 00:00 0 
7f42a063e000-7f42a0653000 r-xp 00000000 08:01 1179692                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f42a0653000-7f42a0852000 ---p 00015000 08:01 1179692                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f42a0852000-7f42a0853000 r--p 00014000 08:01 1179692                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f42a0853000-7f42a0854000 rw-p 00015000 08:01 1179692                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7f42a0854000-7f42a08d7000 r-xp 00000000 08:01 1179686                    /lib/x86_64-linux-gnu/libm-2.13.so
7f42a08d7000-7f42a0ad6000 ---p 00083000 08:01 1179686                    /lib/x86_64-linux-gnu/libm-2.13.so
7f42a0ad6000-7f42a0ad7000 r--p 00082000 08:01 1179686                    /lib/x86_64-linux-gnu/libm-2.13.so
7f42a0ad7000-7f42a0ad8000 rw-p 00083000 08:01 1179686                    /lib/x86_64-linux-gnu/libm-2.13.so
7f42a0ad8000-7f42a0bc0000 r-xp 00000000 08:01 3674840                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7f42a0bc0000-7f42a0dc0000 ---p 000e8000 08:01 3674840                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7f42a0dc0000-7f42a0dc8000 r--p 000e8000 08:01 3674840                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7f42a0dc8000-7f42a0dca000 rw-p 000f0000 08:01 3674840                    /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16
7f42a0dca000-7f42a0ddf000 rw-p 00000000 00:00 0 
7f42a0ddf000-7f42a0df7000 r-xp 00000000 08:01 1179684                    /lib/x86_64-linux-gnu/libpthread-2.13.so
7f42a0df7000-7f42a0ff6000 ---p 00018000 08:01 1179684                    /lib/x86_64-linux-gnu/libpthread-2.13.so
7f42a0ff6000-7f42a0ff7000 r--p 00017000 08:01 1179684                    /lib/x86_64-linux-gnu/libpthread-2.13.so
7f42a0ff7000-7f42a0ff8000 rw-p 00018000 08:01 1179684                    /lib/x86_64-linux-gnu/libpthread-2.13.so
7f42a0ff8000-7f42a0ffc000 rw-p 00000000 00:00 0 
7f42a0ffc000-7f42a101d000 r-xp 00000000 08:01 1179683                    /lib/x86_64-linux-gnu/ld-2.13.so
7f42a1210000-7f42a1215000 rw-p 00000000 00:00 0 
7f42a121a000-7f42a121c000 rw-p 00000000 00:00 0 
7f42a121c000-7f42a121d000 r--p 00020000 08:01 1179683                    /lib/x86_64-linux-gnu/ld-2.13.so
7f42a121d000-7f42a121f000 rw-p 00021000 08:01 1179683                    /lib/x86_64-linux-gnu/ld-2.13.so
7ffff2c84000-7ffff2ca5000 rw-p 00000000 00:00 0                          [stack]
7ffff2cc1000-7ffff2cc2000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted

real    14m38.760s
user    0m0.832s
sys     0m32.670s

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-30 22:00 Performance problem - reads slower than writes Brian Candler
@ 2012-01-31  2:05 ` Dave Chinner
  2012-01-31 10:31   ` Brian Candler
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2012-01-31  2:05 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs

On Mon, Jan 30, 2012 at 10:00:19PM +0000, Brian Candler wrote:
> I am doing some performance testing of XFS. I am using Ubuntu 11.10 amd64
> (server), on an i3-2130 (3.4GHz) with 8GB RAM.
> 
> This will eventually run with a bunch of Hitachi 3TB Deskstar drives, but
> the performance issue can be shown with just one.
> 
> Writing and reading large files using dd is fine. Performance is close to
> what I get if I dd to the drive itself (which is 125MB/sec near the start of
> the disk, down to 60MB/sec near the end of the disk, both reading and
> writing).
> 
> However I'm getting something strange when I try using bonnie++ to write and
> read a bunch of individual files - in this case 100,000 files with sizes
> between 500k and 800k, spread over 1000 directories.

Write order is different to read order, and read performance is
sensitive to cache hit rates and IO latency. When you working set is
larger than memory (which is definitely true here), read performance
will almost always be determined by read IO latency.

> # time bonnie++ -d /data/sdb -s 16384k -n 98:800k:500k:1000 -u root
> ...
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> storage1        16G  1900  93 97299   3 49909   4  4899  96 139565   5 270.7   4
> Latency              5251us     222ms     394ms   10705us   94111us     347ms
> Version  1.96       ------Sequential Create------ --------Random Create--------
> storage1            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> 98:819200:512000/1000   112   3    37   2 12659  32   106   3    39   2  8148  31
> Latency              6791ms     134ms   56383us    5161ms     459ms    3664ms
> 1.96,1.96,storage1,1,1327926367,16G,,1900,93,97299,3,49909,4,4899,96,139565,5,270.7,4,98,819200,512000,,1000,112,3,37,2,12659,32,106,3,39,2,8148,31,5251us,222ms,394ms,10705us,94111us,347ms,6791ms,134ms,56383us,5161ms,459ms,3664ms
> 
> real	129m3.450s
> user	0m6.684s
> sys	3m22.421s
> 
> Writing is fine: it writes about 110 files per second, and iostat shows
> about 75MB/sec of write data throughput during that phase.
>
> However when bonnie++ gets to the reading stage it reads only ~38 files per
> second, and iostat shows only about 22MB/sec of data being read from the
> disk.  There are about 270 disk operations per second seen at the time, so
> the drive is clearly saturated with seeks.  It seems to be doing about 7
> seeks for each stat+read. 

It's actually reading bits of the files, too, as your strace shows,
which is where most of the IO comes from.

So my desktop which is similar to yours except for the storage. It
has a pair of $150 SSDs in RAID-0.

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
disappointment  16G  1904  98 511102  26 182616  21  4623  99 348043  22  8367 201
Latency             10601us     283ms     250ms    5491us     156ms    8502us
Version  1.96       ------Sequential Create------ --------Random Create--------
disappointment      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
98:819200:512000/1000   698  29   378  21 17257  89   699  29   370  21 13532  86
Latency              1421ms     278ms     145ms    1520ms     242ms     159ms
1.96,1.96,disappointment,1,1327963543,16G,,1904,98,511102,26,182616,21,4623,99,348043,22,8367,201,98,819200,512000,,1000,698,29,378,21,17257,89,699,29,370,21,13532,86,10601us,283ms,250ms,5491us,156ms,8502us,1421ms,278ms,145ms,1520ms,242ms,159ms

real    16m55.664s
user    0m6.708s
sys     4m8.468s

So, sequential write is 500MB/s read is 350MB/s, single threaded
seeks are ~10k/s and cpu bound. Creates are 700/s, read is 380/s and
deletes are 17,000/s. Random create/read/delete is roughly the same.

So as you can see, the read performance of your storage makes a big,
big difference to the results. The write performance is 5x faster
than your SATA drive, the read is only 3x faster, but the seeks are
40x faster. The result is that the seek intensive workload runs 10x
faster, and the overall benchmark run completes in only 10% of your
current runtime.

The big question is whether this bonnie++ workload reflects your
real workload? If not, then find a benchmark that is more closely
related to your application. If so, and the read performance is what
you really need maximised then you need to optimise your storage
architecture for minimising read latency, not write speed. That
means either lots of spindles, or high RPM drives or SSDs or some
combination of all three. There's nothing the filesystem can really
do to make it any faster than it already is...

> The filesystem was created like this:
> 
> # mkfs.xfs -i attr=2,maxpct=1 /dev/sdb

attr=2 is the default, and maxpct is a soft limit so the only reason
you would have to change it is if you need more indoes in teh
filesystem than it can support by default. Indeed, that's somewhere
around 200 million inodes per TB of disk space...

> P.S. When dd'ing large files ontp XFS I found that bs=8k gave a lower
> performance than bs=16k or larger.  So I wanted to rerun bonnie++ with
> larger chunk sizes.  Unfortunately that causes it to crash (and fairly
> consistently) - see below.

No surprise - twice as many syscalls, twice the overhead.

> Is the 8k block size likely to be the performance culprit here?
> 
> # time bonnie++ -d /data/sdb -s 16384k:32k -n 98:800k:500k:1000:32k -u root
> Using uid:0, gid:0.
> Writing a byte at a time...done
> Writing intelligently...done
> Rewriting...done
> Reading a byte at a time...done
> Reading intelligently...
> done
> start 'em...done...done...done...done...done...
> *** glibc detected *** bonnie++: double free or corruption (out): 0x00000000024430a0 ***
> ======= Backtrace: =========
> /lib/x86_64-linux-gnu/libc.so.6(+0x78a96)[0x7f42a0317a96]
> /lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f42a031bd7c]

That's a bug in bonnie. I'd take that up with the benchmark maintainer.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31  2:05 ` Dave Chinner
@ 2012-01-31 10:31   ` Brian Candler
  2012-01-31 14:16     ` Brian Candler
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Brian Candler @ 2012-01-31 10:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Jan 31, 2012 at 01:05:08PM +1100, Dave Chinner wrote:
> When you working set is
> larger than memory (which is definitely true here), read performance
> will almost always be determined by read IO latency.

Absolutely.

> > There are about 270 disk operations per second seen at the time, so
> > the drive is clearly saturated with seeks.  It seems to be doing about 7
> > seeks for each stat+read. 
> 
> It's actually reading bits of the files, too, as your strace shows,
> which is where most of the IO comes from.

It's reading the entire files - I had grepped out the read(...) = 8192
lines so that the stat/open/read/close pattern could be seen.

> The big question is whether this bonnie++ workload reflects your
> real workload?

Yes it does. The particular application I'm tuning for includes a library of
some 20M files in the 500-800K size range.  The library is semi-static, i.e. 
occasionally appended to.  Some clients will be reading individual files at
random, but from time to time we will need to scan across the whole library
and process all the files or a large subset of it.

> you need to optimise your storage
> architecture for minimising read latency, not write speed. That
> means either lots of spindles, or high RPM drives or SSDs or some
> combination of all three. There's nothing the filesystem can really
> do to make it any faster than it already is...

I will end up distributing the library across multiple spindles using
something like Gluster, but first I want to tune the performance on a single
filesystem.

It seems to me that reading a file should consist roughly of:

- seek to inode (if the inode block isn't already in cache)
- seek to extents table (if all extents don't fit in the inode)
- seek(s) to the file contents, depending on how they're fragmented.

I am currently seeing somewhere between 7 and 8 seeks per file read, and
this just doesn't seem right to me.

One thing I can test directly is whether the files are fragmented, using
xfs_bmap, and this shows they clearly are not:

root@storage1:~# xfs_bmap /data/sdc/Bonnie.16388/00449/*
/data/sdc/Bonnie.16388/00449/000000b125QaaLg:
        0: [0..1167]: 2952872392..2952873559
/data/sdc/Bonnie.16388/00449/000000b126:
        0: [0..1087]: 4415131112..4415132199
/data/sdc/Bonnie.16388/00449/000000b1272Mfp:
        0: [0..1255]: 1484828464..1484829719
/data/sdc/Bonnie.16388/00449/000000b128sEYN5:
        0: [0..1319]: 2952873560..2952874879
/data/sdc/Bonnie.16388/00449/000000b129Zs:
        0: [0..1591]: 4415132200..4415133791
/data/sdc/Bonnie.16388/00449/000000b12aIaa3UV:
        0: [0..1527]: 1484829720..1484831247
/data/sdc/Bonnie.16388/00449/000000b12b:
        0: [0..1287]: 2952874880..2952876167
/data/sdc/Bonnie.16388/00449/000000b12c3ze1zN5FfX1i:
        0: [0..1463]: 4415133792..4415135255
... snip rest

So the next thing I'd have to do is to try to get a trace of the I/O
operations being performed, and I don't know how to do that.

> > The filesystem was created like this:
> > 
> > # mkfs.xfs -i attr=2,maxpct=1 /dev/sdb
> 
> attr=2 is the default, and maxpct is a soft limit so the only reason
> you would have to change it is if you need more indoes in teh
> filesystem than it can support by default. Indeed, that's somewhere
> around 200 million inodes per TB of disk space...

OK. I saw "df -i" reporting a stupid number of available inodes, over 500
million, so I decided to reduce it to 100 million.  But df -k didn't show
any corresponding increase in disk space, so I'm guessing in xfs these are
allocated on-demand, and the inode limit doesn't really matter?

> > P.S. When dd'ing large files ontp XFS I found that bs=8k gave a lower
> > performance than bs=16k or larger.  So I wanted to rerun bonnie++ with
> > larger chunk sizes.  Unfortunately that causes it to crash (and fairly
> > consistently) - see below.
> 
> No surprise - twice as many syscalls, twice the overhead.

I'm not sure that simple explanation works here. I see almost exactly the
same performance with bs=512m down bs=32k, slightly worse at bs=16k, and a
sudden degradation at bs=8k.  However the CPU is still massively
underutilised at that point.

root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=1024k count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 7.91832 s, 136 MB/s

real	0m7.950s
user	0m0.000s
sys	0m0.100s

root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=32k count=32768
32768+0 records in
32768+0 records out
1073741824 bytes (1.1 GB) copied, 7.92206 s, 136 MB/s

real	0m7.963s
user	0m0.004s
sys	0m0.420s

root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=16k count=65536
65536+0 records in
65536+0 records out
1073741824 bytes (1.1 GB) copied, 8.48255 s, 127 MB/s

real	0m8.496s
user	0m0.096s
sys	0m0.644s

root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=8k count=131072
131072+0 records in
131072+0 records out
1073741824 bytes (1.1 GB) copied, 13.8283 s, 77.6 MB/s

real	0m13.829s
user	0m0.084s
sys	0m1.328s

Also: I can run the same dd on twelve separate drives concurrently, and get
the same results. This is a two-core (+hyperthreading) processor, but if
syscall overhead really were the limiting factor I would expect doing it
twelve times in parallel would amplify the effect.

My suspicion is that some other factor is coming into play - read-ahead on
the drives perhaps - but I haven't nailed it down yet.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 10:31   ` Brian Candler
@ 2012-01-31 14:16     ` Brian Candler
  2012-01-31 20:25       ` Dave Chinner
  2012-01-31 14:52     ` Christoph Hellwig
  2012-01-31 20:06     ` Dave Chinner
  2 siblings, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-01-31 14:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Updates:

(1) The bug in bonnie++ is to do with memory allocation, and you can work
around it by putting '-n' before '-s' on the command line and using the same
custom chunk size before both (or by using '-n' with '-s 0')

# time bonnie++ -d /data/sdc -n 98:800k:500k:1000:32k -s 16384k:32k -u root

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
storage1    16G:32k  2061  91 101801   3 49405   4  5054  97 126748   6 130.9   3
Latency             15446us     222ms     412ms   23149us   83913us     452ms
Version  1.96       ------Sequential Create------ --------Random Create--------
storage1            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
98:819200:512000/1000   128   3    37   1 10550  25   108   3    38   1  8290  33
Latency              6874ms   99117us   45394us    4462ms   12582ms    4027ms
1.96,1.96,storage1,1,1328002525,16G,32k,2061,91,101801,3,49405,4,5054,97,126748,6,130.9,3,98,819200,512000,,1000,128,3,37,1,10550,25,108,3,38,1,8290,33,15446us,222ms,412ms,23149us,83913us,452ms,6874ms,99117us,45394us,4462ms,12582ms,4027ms

This shows that using 32k transfers instead of 8k doesn't really help; I'm
still only seeing 37-38 reads per second, either sequential or random.


(2) In case extents aren't being kept in the inode, I decided to build a
filesystem with '-i size=1024'

# time bonnie++ -d /data/sdb -n 98:800k:500k:1000:32k -s0 -u root

Version  1.96       ------Sequential Create------ --------Random Create--------
storage1            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
98:819200:512000/1000   110   3   131   5  3410  10   110   3    33   1   387   1
Latency              6038ms   92092us   87730us    5202ms     117ms    7653ms
1.96,1.96,storage1,1,1328003901,,,,,,,,,,,,,,98,819200,512000,,1000,110,3,131,5,3410,10,110,3,33,1,387,1,,,,,,,6038ms,92092us,87730us,5202ms,117ms,7653ms

Wow! The sequential read just blows away the previous results. What's even
more amazing is the number of transactions per second reported by iostat
while bonnie++ was sequentially stat()ing and read()ing the files:

# iostat 5
...
sdb             820.80     86558.40         0.00     432792          0
                  !!

820 tps on a bog-standard hard-drive is unbelievable, although the total
throughput of 86MB/sec is.  It could be that either NCQ or drive read-ahead
is scoring big-time here.

However during random stat()+read() the performance drops:

# iostat 5
...
sdb             225.40     21632.00         0.00     108160          0

Here we appear to be limited by real seeks. 225 seeks/sec is still very good
for a hard drive, but it means the filesystem is generating about 7 seeks
for every file (stat+open+read+close).  Indeed the random read performance
appears to be a bit worse than the default (-i size=256) filesystem, where
I was getting 25MB/sec on iostat, and 38 files per second instead of 33.

There are only 1000 directories in this test, and I would expect those to
become cached quickly.

According to Wikipedia, XFS has variable-length extents. I think that as
long as the file data is contiguous then each file should only be taking a
single extent, and this is what xfs_bmap seems to be telling me:

# xfs_bmap -n1 -l -v /data/sdc/Bonnie.25448/00449/* | head
/data/sdc/Bonnie.25448/00449/000000b125mpBap4gg7U:
 EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL
   0: [0..1559]:       4446598752..4446600311  3 (51198864..51200423)  1560
/data/sdc/Bonnie.25448/00449/000000b1262hBudG6gV:
 EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL
   0: [0..1551]:       1484870256..1484871807  1 (19736960..19738511)  1552
/data/sdc/Bonnie.25448/00449/000000b127fM:
 EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL
   0: [0..1111]:       2954889944..2954891055  2 (24623352..24624463)  1112
/data/sdc/Bonnie.25448/00449/000000b128:

It looks like I need to get familiar with xfs_db and
http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf
to find out what's going on.

(These filesystems are mounted with noatime,nodiratime incidentally)

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 10:31   ` Brian Candler
  2012-01-31 14:16     ` Brian Candler
@ 2012-01-31 14:52     ` Christoph Hellwig
  2012-01-31 21:52       ` Brian Candler
  2012-02-03 11:54       ` Brian Candler
  2012-01-31 20:06     ` Dave Chinner
  2 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2012-01-31 14:52 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs

On Tue, Jan 31, 2012 at 10:31:26AM +0000, Brian Candler wrote:
> - seek to inode (if the inode block isn't already in cache)
> - seek to extents table (if all extents don't fit in the inode)
> - seek(s) to the file contents, depending on how they're fragmented.
> 
> I am currently seeing somewhere between 7 and 8 seeks per file read, and
> this just doesn't seem right to me.

You don't just read a single file at a time but multiple ones, don't
you?

Try playing with the following tweaks to get larger I/O to the disk:

 a) make sure you use the noop or deadline elevators
 b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
 c) dramatically increase /sys/devices/virtual/bdi/<major>:<minor>/read_ahead_kb

> OK. I saw "df -i" reporting a stupid number of available inodes, over 500
> million, so I decided to reduce it to 100 million.  But df -k didn't show
> any corresponding increase in disk space, so I'm guessing in xfs these are
> allocated on-demand, and the inode limit doesn't really matter?

Exactly, the number displayed is the upper bound.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 10:31   ` Brian Candler
  2012-01-31 14:16     ` Brian Candler
  2012-01-31 14:52     ` Christoph Hellwig
@ 2012-01-31 20:06     ` Dave Chinner
  2012-01-31 21:35       ` Brian Candler
  2 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2012-01-31 20:06 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs

On Tue, Jan 31, 2012 at 10:31:26AM +0000, Brian Candler wrote:
> On Tue, Jan 31, 2012 at 01:05:08PM +1100, Dave Chinner wrote:
> > When you working set is
> > larger than memory (which is definitely true here), read performance
> > will almost always be determined by read IO latency.
> 
> Absolutely.
> 
> > > There are about 270 disk operations per second seen at the time, so
> > > the drive is clearly saturated with seeks.  It seems to be doing about 7
> > > seeks for each stat+read. 
> > 
> > It's actually reading bits of the files, too, as your strace shows,
> > which is where most of the IO comes from.
> 
> It's reading the entire files - I had grepped out the read(...) = 8192
> lines so that the stat/open/read/close pattern could be seen.
> 
> > The big question is whether this bonnie++ workload reflects your
> > real workload?
> 
> Yes it does. The particular application I'm tuning for includes a library of
> some 20M files in the 500-800K size range.  The library is semi-static, i.e. 
> occasionally appended to.  Some clients will be reading individual files at
> random, but from time to time we will need to scan across the whole library
> and process all the files or a large subset of it.
> 
> > you need to optimise your storage
> > architecture for minimising read latency, not write speed. That
> > means either lots of spindles, or high RPM drives or SSDs or some
> > combination of all three. There's nothing the filesystem can really
> > do to make it any faster than it already is...
> 
> I will end up distributing the library across multiple spindles using
> something like Gluster, but first I want to tune the performance on a single
> filesystem.
> 
> It seems to me that reading a file should consist roughly of:
> 
> - seek to inode (if the inode block isn't already in cache)
> - seek to extents table (if all extents don't fit in the inode)
> - seek(s) to the file contents, depending on how they're fragmented.

You forgot the directory IO. If you've got enough entries in the
directory to push it out to leaf/node format, then it could
certainly take 3-4 IOs just to find the directory entry you are
looking for.

> I am currently seeing somewhere between 7 and 8 seeks per file read, and
> this just doesn't seem right to me.

The number of IOs does not equal the number of seeks. Two adjacent,
sequential IOs issued serially will show up as two IOs, even though
there was no seek in between. Especially if the files are large
enough that readahead tops out (500-800k is large enough for this as
readahead maximum is 128k by default).  So it might be taking 3-4
IOs just to read the file data.

> So the next thing I'd have to do is to try to get a trace of the I/O
> operations being performed, and I don't know how to do that.

blktrace/blkparse or seekwatcher.

> > > The filesystem was created like this:
> > > 
> > > # mkfs.xfs -i attr=2,maxpct=1 /dev/sdb
> > 
> > attr=2 is the default, and maxpct is a soft limit so the only reason
> > you would have to change it is if you need more indoes in teh
> > filesystem than it can support by default. Indeed, that's somewhere
> > around 200 million inodes per TB of disk space...
> 
> OK. I saw "df -i" reporting a stupid number of available inodes, over 500
> million, so I decided to reduce it to 100 million.  But df -k didn't show
> any corresponding increase in disk space, so I'm guessing in xfs these are
> allocated on-demand, and the inode limit doesn't really matter?

Right. The "available inodes" number is calculated based on the
current amount of free space, IIRC. It's dynamic, and mostly
meaningless.

> > > P.S. When dd'ing large files ontp XFS I found that bs=8k gave a lower
> > > performance than bs=16k or larger.  So I wanted to rerun bonnie++ with
> > > larger chunk sizes.  Unfortunately that causes it to crash (and fairly
> > > consistently) - see below.
> > 
> > No surprise - twice as many syscalls, twice the overhead.
> 
> I'm not sure that simple explanation works here. I see almost exactly the
> same performance with bs=512m down bs=32k, slightly worse at bs=16k, and a
> sudden degradation at bs=8k.  However the CPU is still massively
> underutilised at that point.
>
> root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=1024k count=1024
                           ^^^^^^^^^^^^

Direct IO is different to buffered IO, which is what bonnie++ does.
For direct IO, the IO size that hits th disk is exactly the bs
value, and you con only have one IO per thread outstanding. All you
are showing is that your disk cache readahead is not magic.


Indeed, look at the system time:

> sys	0m0.100s
> 
> root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=32k count=32768
.....
> sys	0m0.420s
> 
> root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=16k count=65536
.....
> sys	0m0.644s
> 
> root@storage1:~# time dd iflag=direct if=/dev/sdg of=/dev/null bs=8k count=131072
....
> sys	0m1.328s

It scales roughly linearly with the number of IOs that are done.
This means there is more CPU time spent to retreive a given amount
of data, and that time is not being spent doing IO. Put simply, this
is slower:

    Fixed CPU time to issue 8K IO
    IO time
    Fixed CPU time to issue 8K IO
    IO time
    Fixed CPU time to issue 8K IO
    IO time
    Fixed CPU time to issue 8K IO
    IO time

than:

    Fixed CPU time to issue 32K IO
    IO time

because of the CPU time spent between IOs, and the difference in IO
time between an 8k read and a 32k read is only about 5%.

> Also: I can run the same dd on twelve separate drives concurrently, and get
> the same results. This is a two-core (+hyperthreading) processor, but if
> syscall overhead really were the limiting factor I would expect doing it
> twelve times in parallel would amplify the effect.

It's single thread latency that is your limiting factor. All you've
done is demonstrate that threads don't interfere with each other.

> My suspicion is that some other factor is coming into play - read-ahead on
> the drives perhaps - but I haven't nailed it down yet.

It's simply that the amount of CPU spent in syscalls doing IO is
the performance limiting factor for a single thread.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 14:16     ` Brian Candler
@ 2012-01-31 20:25       ` Dave Chinner
  2012-02-01  7:29         ` Stan Hoeppner
  2012-02-03 18:47         ` Brian Candler
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2012-01-31 20:25 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs

On Tue, Jan 31, 2012 at 02:16:04PM +0000, Brian Candler wrote:
> Updates:
> 
> (1) The bug in bonnie++ is to do with memory allocation, and you can work
> around it by putting '-n' before '-s' on the command line and using the same
> custom chunk size before both (or by using '-n' with '-s 0')
> 
> # time bonnie++ -d /data/sdc -n 98:800k:500k:1000:32k -s 16384k:32k -u root
> 
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> storage1    16G:32k  2061  91 101801   3 49405   4  5054  97 126748   6 130.9   3
> Latency             15446us     222ms     412ms   23149us   83913us     452ms
> Version  1.96       ------Sequential Create------ --------Random Create--------
> storage1            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> 98:819200:512000/1000   128   3    37   1 10550  25   108   3    38   1  8290  33
> Latency              6874ms   99117us   45394us    4462ms   12582ms    4027ms
> 1.96,1.96,storage1,1,1328002525,16G,32k,2061,91,101801,3,49405,4,5054,97,126748,6,130.9,3,98,819200,512000,,1000,128,3,37,1,10550,25,108,3,38,1,8290,33,15446us,222ms,412ms,23149us,83913us,452ms,6874ms,99117us,45394us,4462ms,12582ms,4027ms
> 
> This shows that using 32k transfers instead of 8k doesn't really help; I'm
> still only seeing 37-38 reads per second, either sequential or random.

Right, because it is doing buffered IO and reading and writing into
the page cache for small Io sizes is much faster than waiting for
physical IO. Hence there is much less of a penalty for small
buffered IOs compared.

> (2) In case extents aren't being kept in the inode, I decided to build a
> filesystem with '-i size=1024'
> 
> # time bonnie++ -d /data/sdb -n 98:800k:500k:1000:32k -s0 -u root
> 
> Version  1.96       ------Sequential Create------ --------Random Create--------
> storage1            -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> 98:819200:512000/1000   110   3   131   5  3410  10   110   3    33   1   387   1
> Latency              6038ms   92092us   87730us    5202ms     117ms    7653ms
> 1.96,1.96,storage1,1,1328003901,,,,,,,,,,,,,,98,819200,512000,,1000,110,3,131,5,3410,10,110,3,33,1,387,1,,,,,,,6038ms,92092us,87730us,5202ms,117ms,7653ms
> 
> Wow! The sequential read just blows away the previous results. What's even
> more amazing is the number of transactions per second reported by iostat
> while bonnie++ was sequentially stat()ing and read()ing the files:

The only thing changing the inode size will have affected is the
directory structure - maybe your directories are small enough to fit
in line, or the inode is large enough to keep it in extent format
rather than a full btree. In either case, though, the directory
lookup will require less IO.

> 
> # iostat 5
> ...
> sdb             820.80     86558.40         0.00     432792          0
>                   !!
> 
> 820 tps on a bog-standard hard-drive is unbelievable, although the total
> throughput of 86MB/sec is.  It could be that either NCQ or drive read-ahead
> is scoring big-time here.

See my previous explanation of adjacent IOs not needing seeks. All
you've done is increase the amount of IO needed to read and write
inodes because the inode cluster size is a fixed 8k. That means you
now need to do 8 adjacent IOs to read a 64 inode chunk instead of 2
adjecent IOs when you have 256 byte inodes. And because they are
adjacent IOs, they will hit the drive cache and so not require
physical IO to be done. Hence you can get much "higher" Io
throughput without actually doing any more physical IO....

> However during random stat()+read() the performance drops:
> 
> # iostat 5
> ...
> sdb             225.40     21632.00         0.00     108160          0

Because it is now reading random inodes so not reading adjacent
8k inode clusters all the time.

> 
> Here we appear to be limited by real seeks. 225 seeks/sec is still very good

That number indicates 225 IOs/s, not 225 seeks/s.

> for a hard drive, but it means the filesystem is generating about 7 seeks
> for every file (stat+open+read+close).  Indeed the random read performance

7 IOs for every file.

> appears to be a bit worse than the default (-i size=256) filesystem, where
> I was getting 25MB/sec on iostat, and 38 files per second instead of 33.

Right, because it is taking more seeks to read the inodes because they
are physically further apart.

> There are only 1000 directories in this test, and I would expect those to
> become cached quickly.

Doubtful. There's plenty of page cache pressure (500-800k) per inode
read (maybe 16k of cached metadata all up) so there's enough memory
pressure to prevent the directory structure from staying memory
resident.

> It looks like I need to get familiar with xfs_db and
> http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf
> to find out what's going on.

It's pretty obvious to me what is happening. :/ I think that you first
need to understand exactly what the tools you are already using are
actually telling you, then go from there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 20:06     ` Dave Chinner
@ 2012-01-31 21:35       ` Brian Candler
  0 siblings, 0 replies; 30+ messages in thread
From: Brian Candler @ 2012-01-31 21:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 01, 2012 at 07:06:35AM +1100, Dave Chinner wrote:
> The number of IOs does not equal the number of seeks. Two adjacent,
> sequential IOs issued serially will show up as two IOs, even though
> there was no seek in between. Especially if the files are large
> enough that readahead tops out (500-800k is large enough for this as
> readahead maximum is 128k by default).  So it might be taking 3-4
> IOs just to read the file data.

Ah. And if the IOs are not stacked up, then the platter has to rotate nearly
a whole turn to perform the next one.

> > So the next thing I'd have to do is to try to get a trace of the I/O
> > operations being performed, and I don't know how to do that.
> 
> blktrace/blkparse or seekwatcher.

Excellent, just what I wanted. I've made a start with this and will report
back.

Many thanks for the help and pointers you have provided.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 14:52     ` Christoph Hellwig
@ 2012-01-31 21:52       ` Brian Candler
  2012-02-01  0:50         ` Raghavendra D Prabhu
  2012-02-01  3:59         ` Dave Chinner
  2012-02-03 11:54       ` Brian Candler
  1 sibling, 2 replies; 30+ messages in thread
From: Brian Candler @ 2012-01-31 21:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote:
> You don't just read a single file at a time but multiple ones, don't
> you?

It's sequential at the moment, although I'll do further tests with the -c
(concurrency) option to bonnie++

> Try playing with the following tweaks to get larger I/O to the disk:
> 
>  a) make sure you use the noop or deadline elevators
>  b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
>  c) dramatically increase /sys/devices/virtual/bdi/<major>:<minor>/read_ahead_kb

Thank you very much: I will do further tests with these.

Is the read_ahead_kb knob aware of file boundaries? That is, is there any
risk that if I set it too large it would read useless blocks past the end of
the file?

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 21:52       ` Brian Candler
@ 2012-02-01  0:50         ` Raghavendra D Prabhu
  2012-02-01  3:59         ` Dave Chinner
  1 sibling, 0 replies; 30+ messages in thread
From: Raghavendra D Prabhu @ 2012-02-01  0:50 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs


[-- Attachment #1.1: Type: text/plain, Size: 1728 bytes --]

Hi,


* On Tue, Jan 31, 2012 at 09:52:10PM +0000, Brian Candler <brian@soundmouse.com> wrote:
>On Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote:
>> You don't just read a single file at a time but multiple ones, don't
>> you?
>
>It's sequential at the moment, although I'll do further tests with the -c
>(concurrency) option to bonnie++
>
>> Try playing with the following tweaks to get larger I/O to the disk:
>>
>>  a) make sure you use the noop or deadline elevators
>>  b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
>>  c) dramatically increase /sys/devices/virtual/bdi/<major>:<minor>/read_ahead_kb
>
>Thank you very much: I will do further tests with these.
>
>Is the read_ahead_kb knob aware of file boundaries? That is, is there any
>risk that if I set it too large it would read useless blocks past the end of
>the file?

     The read_ahead_kb knob is used the by memory subsystem 
     readahead code to set the initial readahead to scale from (it 
     uses a dynamic scaling window). It is set by default based on 
     device readahead value (probably obtained in a way similar to 
     hdparm -I). 
     
     Setting it higher will be beneficial for sequential workloads 
     and the risk you mentioned is not there since it file 
     boundary aware -- check 
     http://lxr.linux.no/linux+*/mm/readahead.c#L151 for more 
     details.
>
>Regards,
>
>Brian.
>
>_______________________________________________
>xfs mailing list
>xfs@oss.sgi.com
>http://oss.sgi.com/mailman/listinfo/xfs


Regards,
-- 
Raghavendra Prabhu
GPG Id : 0xD72BE977
Fingerprint: B93F EBCB 8E05 7039 CD3C A4B8 A616 DCA1 D72B E977
www: wnohang.net

[-- Attachment #1.2: Type: application/pgp-signature, Size: 490 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 21:52       ` Brian Candler
  2012-02-01  0:50         ` Raghavendra D Prabhu
@ 2012-02-01  3:59         ` Dave Chinner
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2012-02-01  3:59 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On Tue, Jan 31, 2012 at 09:52:10PM +0000, Brian Candler wrote:
> On Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote:
> > You don't just read a single file at a time but multiple ones, don't
> > you?
> 
> It's sequential at the moment, although I'll do further tests with the -c
> (concurrency) option to bonnie++
> 
> > Try playing with the following tweaks to get larger I/O to the disk:
> > 
> >  a) make sure you use the noop or deadline elevators
> >  b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
> >  c) dramatically increase /sys/devices/virtual/bdi/<major>:<minor>/read_ahead_kb
> 
> Thank you very much: I will do further tests with these.
> 
> Is the read_ahead_kb knob aware of file boundaries? That is, is there any
> risk that if I set it too large it would read useless blocks past the end of
> the file?

Yes, readahead only occurs within the file, and won't readahead past
EOF.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 20:25       ` Dave Chinner
@ 2012-02-01  7:29         ` Stan Hoeppner
  2012-02-03 18:47         ` Brian Candler
  1 sibling, 0 replies; 30+ messages in thread
From: Stan Hoeppner @ 2012-02-01  7:29 UTC (permalink / raw)
  To: xfs

On 1/31/2012 2:25 PM, Dave Chinner wrote:
> On Tue, Jan 31, 2012 at 02:16:04PM +0000, Brian Candler wrote:

>> Here we appear to be limited by real seeks. 225 seeks/sec is still very good
> 
> That number indicates 225 IOs/s, not 225 seeks/s.

Yeah, the voice coil actuator and spindle rotation limits the peak
random seek rate of good 7.2k drive/controller combos to about 150/s.
15k drives do about 250-300 seeks/s max.  Simple tool to test max random
seeks/sec for a device:

32bit binary:  http://www.hardwarefreak.com/seekerb
source:        http://www.hardwarefreak.com/seeker_baryluk.c

I'm not the author.  The original seeker program is single threaded.
Baryluk did the thread hacking.  Background info:
http://www.linuxinsight.com/how_fast_is_your_disk.html

Usage:   ./seekerb device [threads]

Results for a single WD 7.2K drive, no NCQ, deadline elevator:

  1 threads Results: 64 seeks/second, 15.416 ms random access time
 16 threads Results: 97 seeks/second, 10.285 ms random access time
128 threads Results: 121 seeks/second, 8.208 ms random access time

Actual output:
$ seekerb /dev/sda 128
Seeker v3.0, 2009-06-17,
http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [976773168 blocks, 500107862016 bytes, 465 GB,
476940 MB, 500 GiB, 500107 MiB]
[512 logical sector size, 512 physical sector size]
[128 threads]
Wait 30 seconds.............................
Results: 121 seeks/second, 8.208 ms random access time (52614775 <
offsets < 499769984475)

Targeting array devices (mdraid or hardware, or FC SAN LUN) with lots of
spindles, and/or SSDs should yield some interesting results.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 14:52     ` Christoph Hellwig
  2012-01-31 21:52       ` Brian Candler
@ 2012-02-03 11:54       ` Brian Candler
  2012-02-03 19:42         ` Stan Hoeppner
  1 sibling, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-02-03 11:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

uOn Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote:
> Try playing with the following tweaks to get larger I/O to the disk:
> 
>  a) make sure you use the noop or deadline elevators
>  b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
>  c) dramatically increase /sys/devices/virtual/bdi/<major>:<minor>/read_ahead_kb

The default settings on this system are:

# cat /sys/block/sdb/queue/scheduler 
noop [deadline] cfq 

(I think the one in brackets is the active one)

# cat /sys/block/sdb/queue/max_sectors_kb 
512
# cat /sys/devices/virtual/bdi/8:16/read_ahead_kb
128

I did a series of tests where I increased either or both of those to 1024,
but it didn't make any difference to read throughput, which sat stubbornly
at 25MB/sec.  However I could see the difference in btrace and in tps
figures, showing a smaller number of larger transfers taking place.  It was
clearly doing the right thing: seek and read a large block of data, seek and
read the next large block of data, and so on.

Writing the files should (to my mind) require the same amount of seeking and
disk passing under the head, but it runs at 75MB/sec.

So I realised: when you are writing lots of files with write-behind caching,
the drive has a lot of opportunity for reordering those writes to minimise
the seek and rotational latency.  But when reading in a single thread, you
are doing sequential seek one - read one - seek two - read two - ...

It turns out the -c (concurrency) option to bonnie++ is ignored for the file
creation and reading test. So my next steps were:

* run 4 instances of bonnie++ concurrently on the same filesystem.

This did make some improvement. For the sequential file reading part, it got
me to 30MB/sec with bonnie++ reading 32k chunks, and 40MB/sec with it
reading 1024k chunks.  It fell back to 33MB/sec in the random file reading
part.

* write a script to do just a random read test, with varying levels of
concurrency.  The script is given below: it forks a varying number of
processes, each of which runs "dd" sequentially on a subset of the files.

First running with default params (max_sectors_kb=512, read_ahead_kb=128)

 #p  files/sec  dd_args
  1      39.87  bs=1024k iflag=direct		=> 25.9 MB/sec
  1      42.51  bs=1024k
  2      43.88  bs=1024k iflag=direct
  2      29.53  bs=1024k
  5      57.40  bs=1024k iflag=direct
  5      43.48  bs=1024k
 10      68.68  bs=1024k iflag=direct
 10      48.02  bs=1024k
 20      75.51  bs=1024k iflag=direct
 20      53.08  bs=1024k
 50      79.37  bs=1024k iflag=direct		=> 51.6 MB/sec
 50      51.30  bs=1024k

The files have an average size of 0.65MB, so I've converted some files/sec
into MB/sec.  What I found surprising was that the performance is lower with
iflag=direct for a single reader, but much higher with iflag=direct for
concurrent readers.

So I tried again with max_sectors_kb=1024, read_ahead_kb=1024

 #p  files/sec  dd_args
  1      39.95  bs=1024k iflag=direct
  1      42.21  bs=1024k
  2      43.14  bs=1024k iflag=direct
  2      47.93  bs=1024k
  5      56.68  bs=1024k iflag=direct
  5      61.95  bs=1024k
 10      68.35  bs=1024k iflag=direct
 10      75.50  bs=1024k                        => 49.1 MB/sec
 20      75.74  bs=1024k iflag=direct
 20      83.36  bs=1024k			=> 54.2 MB/sec
 50      79.45  bs=1024k iflag=direct
 50      86.58  bs=1024k			=> 56.3 MB/sec

Now it works better without iflag=direct. With 20+ readers the throughput is
approaching decent, albeit still a way from the 75MB/sec I achieve when
writing.

Next with max_sectors_kb=128, read_ahead_kb=1024 (just to see if smaller SATA
transfers work better than large ones)

 #p  files/sec  dd_args
  1      39.74  bs=1024k iflag=direct
  1      42.49  bs=1024k
  2      43.92  bs=1024k iflag=direct
  2      48.22  bs=1024k
  5      56.39  bs=1024k iflag=direct
  5      62.61  bs=1024k
 10      61.50  bs=1024k iflag=direct
 10      68.67  bs=1024k
 20      68.21  bs=1024k iflag=direct
 20      75.28  bs=1024k			=> 48.9 MB/s
 50      68.36  bs=1024k iflag=direct
 50      75.32  bs=1024k			=> 49.0 MB/s

Maybe tiny improvement at low concurrency, but worse at high concurrency
(presumably a larger number of queued I/Os is hitting a queue depth limit)

Finally with max_sectors_kb=1024, read_ahead_kb=1024, and the noop
scheduler:

root@storage1:~# echo noop >/sys/block/sdc/queue/scheduler
root@storage1:~# cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq 

 #p  files/sec  dd_args
  1      40.19  bs=1024k iflag=direct
  1      41.98  bs=1024k
  2      43.63  bs=1024k iflag=direct
  2      48.24  bs=1024k
  5      56.97  bs=1024k iflag=direct
  5      62.86  bs=1024k
 10      68.68  bs=1024k iflag=direct
 10      76.81  bs=1024k
 20      76.03  bs=1024k iflag=direct
 20      85.17  bs=1024k			=> 55.4 MB/s
 50      76.58  bs=1024k iflag=direct
 50      83.66  bs=1024k

This may be slightly better than the deadline scheduler until we get to 50
concurrent readers.

FWIW, this is the controller:

[   12.855639] mpt2sas1: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (8183856 kB)
[   12.855801] mpt2sas1: PCI-MSI-X enabled: IRQ 68
[   12.855804] mpt2sas1: iomem(0x00000000fb8c0000), mapped(0xffffc90011788000), size(16384)
[   12.855806] mpt2sas1: ioport(0x0000000000008000), size(256)
[   13.142189] mpt2sas1: sending message unit reset !!
[   13.150164] mpt2sas1: message unit reset: SUCCESS
[   13.323195] mpt2sas1: Allocated physical memory: size(16611 kB)
[   13.323200] mpt2sas1: Current Controller Queue Depth(7386), Max Controller Queue Depth(7647)
[   13.323203] mpt2sas1: Scatter Gather Elements per IO(128)
[   13.553727] mpt2sas1: LSISAS2116: FWVersion(05.00.13.00), ChipRevision(0x02), BiosVersion(07.11.00.00)
[   13.553737] mpt2sas1: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[   13.553814] mpt2sas1: sending port enable !!
[   13.555001] mpt2sas1: port enable: SUCCESS
[   13.559519] mpt2sas1: host_add: handle(0x0001), sas_addr(0x500062b2000b4c00), phys(16)

and the drives are Hitachi Deskstar 5K3000 HDS5C3030ALA630:

[   13.567932] scsi 5:0:1:0: Direct-Access     ATA      Hitachi HDS5C303 A580 PQ: 0 ANSI: 5
[   13.567946] scsi 5:0:1:0: SATA: handle(0x0012), sas_addr(0x4433221101000000), phy(1), device_name(0xcca2500032cd28c0)
[   13.567953] scsi 5:0:1:0: SATA: enclosure_logical_id(0x500062b2000b4c00), slot(1)
[   13.568041] scsi 5:0:1:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[   13.568049] scsi 5:0:1:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(6), cmd_que(1)
[   13.568185] sd 5:0:1:0: Attached scsi generic sg2 type 0
[   13.569753] sd 5:0:1:0: [sdc] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
...
[   13.797275] sd 5:0:1:0: [sdc] Write Protect is off
[   13.797284] sd 5:0:1:0: [sdc] Mode Sense: 73 00 00 08
[   13.800400] sd 5:0:1:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

They are "CoolSpin" drives of unspecified RPM, i.e. definitely not high
performance in absolute terms.  But if I can get the filesystem read
throughput to approach the write throughput, I'll be extremely happy.

Regards,

Brian.

------ 8< ---------------------------------------------------------------
#!/usr/bin/ruby -w

CORPUS = "/data/sdc/Bonnie.26384/*/*"

module Perftest
  def self.run(n_files, n_procs=1, dd_args="")
    files = Dir[CORPUS].sort_by { rand }[0, n_files]
    chunks = files.each_slice(n_files/n_procs).to_a[0, n_procs]
    n_files = chunks.map { |chunk| chunk.size }.inject(:+)
    t1 = Time.now
    @pids = chunks.map { |chunk| fork { run_single(chunk, dd_args); exit! } }
    @pids.delete_if { |pid| Process.waitpid(pid) }
    t2 = Time.now
    printf "%3d %10.2f  %s\n", n_procs, n_files/(t2-t1), dd_args
  end

  def self.run_single(files, dd_args)
    files.each do |f|
      system("dd if='#{f}' of=/dev/null #{dd_args} 2>/dev/null")
    end
  end

  def self.kill_all(sig="TERM")
    @pids.each { |pid| Process.kill(sig, pid) rescue nil }
  end
end

at_exit { Perftest.kill_all }

puts " #p  files/sec  dd_args"
[1,2,5,10,20,50].each do |nprocs|
  Perftest.run(4000, nprocs, "bs=1024k iflag=direct")
  Perftest.run(4000, nprocs, "bs=1024k")
end

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-01-31 20:25       ` Dave Chinner
  2012-02-01  7:29         ` Stan Hoeppner
@ 2012-02-03 18:47         ` Brian Candler
  2012-02-03 19:03           ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-02-03 18:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 01, 2012 at 07:25:26AM +1100, Dave Chinner wrote:
> The only thing changing the inode size will have affected is the
> directory structure - maybe your directories are small enough to fit
> in line, or the inode is large enough to keep it in extent format
> rather than a full btree. In either case, though, the directory
> lookup will require less IO.

I've done a whole bunch of testing, which I won't describe in detail unless
you're interested, but I've finally found out what's causing the sudden
change in performance.

With defaults, the files in one directory are spread all over the
filesystem.  But with -i size=1024, the files in a directory are stored
adjacent to each other. Hence reading all the files in one directory
requires far less seeking across the disk, and runs about 3 times faster.

Here is the filesystem on a disk formatted with defaults:

root@storage1:~# find /data/sdc | head -20 | xargs xfs_bmap 
/data/sdc: no extents
/data/sdc/Bonnie.26384:
	0: [0..31]: 567088..567119
/data/sdc/Bonnie.26384/00000:
	0: [0..7]: 567120..567127
/data/sdc/Bonnie.26384/00000/0icoeTRPHKX0000000000:
	0: [0..1015]: 4411196808..4411197823
/data/sdc/Bonnie.26384/00000/Q0000000001:
	0: [0..1543]: 1466262056..1466263599
/data/sdc/Bonnie.26384/00000/JFXQyeq6diG0000000002:
	0: [0..1295]: 2936342144..2936343439
/data/sdc/Bonnie.26384/00000/TK7ciXkkj0000000003:
	0: [0..1519]: 4411197824..4411199343
/data/sdc/Bonnie.26384/00000/0000000004:
	0: [0..1207]: 1466263600..1466264807
/data/sdc/Bonnie.26384/00000/acJKZWAwEnu0000000005:
	0: [0..1223]: 2936343440..2936344663
/data/sdc/Bonnie.26384/00000/9wIgxPKeI4B0000000006:
	0: [0..1319]: 4411199344..4411200663
/data/sdc/Bonnie.26384/00000/C6QLFdND0000000007:
	0: [0..1111]: 1466264808..1466265919
/data/sdc/Bonnie.26384/00000/6xc1Wydh0000000008:
	0: [0..1223]: 2936344664..2936345887
/data/sdc/Bonnie.26384/00000/0000000009:
	0: [0..1167]: 4411200664..4411201831
/data/sdc/Bonnie.26384/00000/HdlN0000000000a:
	0: [0..1535]: 1466265920..1466267455
/data/sdc/Bonnie.26384/00000/52IabyC5pvis000000000b:
	0: [0..1287]: 2936345888..2936347175
/data/sdc/Bonnie.26384/00000/LvDhxcdLf000000000c:
	0: [0..1583]: 4411201832..4411203415
/data/sdc/Bonnie.26384/00000/08P3JAR000000000d:
	0: [0..1255]: 1466267456..1466268711
/data/sdc/Bonnie.26384/00000/000000000e:
	0: [0..1095]: 2936347176..2936348271
/data/sdc/Bonnie.26384/00000/s0gtPGPecXu000000000f:
	0: [0..1319]: 4411203416..4411204735
/data/sdc/Bonnie.26384/00000/HFLOcN0000000010:
	0: [0..1503]: 1466268712..1466270215

And here is the filesystem created with -i size=1024:

root@storage1:~# find /data/sdb | head -20 | xargs xfs_bmap 
/data/sdb: no extents
/data/sdb/Bonnie.26384:
	0: [0..7]: 243752..243759
	1: [8..15]: 5526920..5526927
	2: [16..23]: 7053272..7053279
	3: [24..31]: 24223832..24223839
/data/sdb/Bonnie.26384/00000:
	0: [0..7]: 1465133488..1465133495
/data/sdb/Bonnie.26384/00000/0icoeTRPHKX0000000000:
	0: [0..1015]: 1465134032..1465135047
/data/sdb/Bonnie.26384/00000/Q0000000001:
	0: [0..1543]: 1465135048..1465136591
/data/sdb/Bonnie.26384/00000/JFXQyeq6diG0000000002:
	0: [0..1295]: 1465136592..1465137887
/data/sdb/Bonnie.26384/00000/TK7ciXkkj0000000003:
	0: [0..1519]: 1465137888..1465139407
/data/sdb/Bonnie.26384/00000/0000000004:
	0: [0..1207]: 1465139408..1465140615
/data/sdb/Bonnie.26384/00000/acJKZWAwEnu0000000005:
	0: [0..1223]: 1465140616..1465141839
/data/sdb/Bonnie.26384/00000/9wIgxPKeI4B0000000006:
	0: [0..1319]: 1465141840..1465143159
/data/sdb/Bonnie.26384/00000/C6QLFdND0000000007:
	0: [0..1111]: 1465143160..1465144271
/data/sdb/Bonnie.26384/00000/6xc1Wydh0000000008:
	0: [0..1223]: 1465144272..1465145495
/data/sdb/Bonnie.26384/00000/0000000009:
	0: [0..1167]: 1465145496..1465146663
/data/sdb/Bonnie.26384/00000/HdlN0000000000a:
	0: [0..1535]: 1465146664..1465148199
/data/sdb/Bonnie.26384/00000/52IabyC5pvis000000000b:
	0: [0..1287]: 1465148200..1465149487
/data/sdb/Bonnie.26384/00000/LvDhxcdLf000000000c:
	0: [0..1583]: 1465149488..1465151071
/data/sdb/Bonnie.26384/00000/08P3JAR000000000d:
	0: [0..1255]: 1465151072..1465152327
/data/sdb/Bonnie.26384/00000/000000000e:
	0: [0..1095]: 1465152464..1465153559
/data/sdb/Bonnie.26384/00000/s0gtPGPecXu000000000f:
	0: [0..1319]: 1465153560..1465154879
/data/sdb/Bonnie.26384/00000/HFLOcN0000000010:
	0: [0..1503]: 1465154880..1465156383

All the files in one directory are close to that directory; when you get to
another directory the block offset jumps.

This is a highly desirable property when you want to copy all the files: for
example, using this filesystem I can tar it up and untar it onto another
filesystem at 73MB/s, as compared to about 25MB/sec on a default filesystem.

So now my questions now are:

(1) Is this a fluke? What is it about -i size=1024 which causes this to
happen?

(2) What is the intended behaviour for XFS: that files should be close to
their parent directory or spread across allocation groups?

I did some additional tests:

* -i size=512
Files spread around

* -n size=16384
Files spread around

* -i size=1024 -n size=16384
Files local to directory

* -i size=2048
Files local to directory

Any clues gratefully received. This usage pattern (dumping in a large
library of files, and then processing all those files sequentially) is an
important one for the system I'm working on.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 18:47         ` Brian Candler
@ 2012-02-03 19:03           ` Christoph Hellwig
  2012-02-03 21:01             ` Brian Candler
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2012-02-03 19:03 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs

On Fri, Feb 03, 2012 at 06:47:23PM +0000, Brian Candler wrote:
> On Wed, Feb 01, 2012 at 07:25:26AM +1100, Dave Chinner wrote:
> > The only thing changing the inode size will have affected is the
> > directory structure - maybe your directories are small enough to fit
> > in line, or the inode is large enough to keep it in extent format
> > rather than a full btree. In either case, though, the directory
> > lookup will require less IO.
> 
> I've done a whole bunch of testing, which I won't describe in detail unless
> you're interested, but I've finally found out what's causing the sudden
> change in performance.
> 
> With defaults, the files in one directory are spread all over the
> filesystem.  But with -i size=1024, the files in a directory are stored
> adjacent to each other. Hence reading all the files in one directory
> requires far less seeking across the disk, and runs about 3 times faster.

Not sure if you mentioned it somewhere before, but:

 a) how large is the filesystem?
 b) do use the inode64 mount option
 c) can you see the same good behaviour when using inode64 and small
    inodes (not that inode64 can NOT be set using remount)

> 
> Here is the filesystem on a disk formatted with defaults:
> 
> root@storage1:~# find /data/sdc | head -20 | xargs xfs_bmap 
> /data/sdc: no extents
> /data/sdc/Bonnie.26384:
> 	0: [0..31]: 567088..567119
> /data/sdc/Bonnie.26384/00000:
> 	0: [0..7]: 567120..567127
> /data/sdc/Bonnie.26384/00000/0icoeTRPHKX0000000000:
> 	0: [0..1015]: 4411196808..4411197823
> /data/sdc/Bonnie.26384/00000/Q0000000001:
> 	0: [0..1543]: 1466262056..1466263599
> /data/sdc/Bonnie.26384/00000/JFXQyeq6diG0000000002:
> 	0: [0..1295]: 2936342144..2936343439
> /data/sdc/Bonnie.26384/00000/TK7ciXkkj0000000003:
> 	0: [0..1519]: 4411197824..4411199343
> /data/sdc/Bonnie.26384/00000/0000000004:
> 	0: [0..1207]: 1466263600..1466264807
> /data/sdc/Bonnie.26384/00000/acJKZWAwEnu0000000005:
> 	0: [0..1223]: 2936343440..2936344663
> /data/sdc/Bonnie.26384/00000/9wIgxPKeI4B0000000006:
> 	0: [0..1319]: 4411199344..4411200663
> /data/sdc/Bonnie.26384/00000/C6QLFdND0000000007:
> 	0: [0..1111]: 1466264808..1466265919
> /data/sdc/Bonnie.26384/00000/6xc1Wydh0000000008:
> 	0: [0..1223]: 2936344664..2936345887
> /data/sdc/Bonnie.26384/00000/0000000009:
> 	0: [0..1167]: 4411200664..4411201831
> /data/sdc/Bonnie.26384/00000/HdlN0000000000a:
> 	0: [0..1535]: 1466265920..1466267455
> /data/sdc/Bonnie.26384/00000/52IabyC5pvis000000000b:
> 	0: [0..1287]: 2936345888..2936347175
> /data/sdc/Bonnie.26384/00000/LvDhxcdLf000000000c:
> 	0: [0..1583]: 4411201832..4411203415
> /data/sdc/Bonnie.26384/00000/08P3JAR000000000d:
> 	0: [0..1255]: 1466267456..1466268711
> /data/sdc/Bonnie.26384/00000/000000000e:
> 	0: [0..1095]: 2936347176..2936348271
> /data/sdc/Bonnie.26384/00000/s0gtPGPecXu000000000f:
> 	0: [0..1319]: 4411203416..4411204735
> /data/sdc/Bonnie.26384/00000/HFLOcN0000000010:
> 	0: [0..1503]: 1466268712..1466270215
> 
> And here is the filesystem created with -i size=1024:
> 
> root@storage1:~# find /data/sdb | head -20 | xargs xfs_bmap 
> /data/sdb: no extents
> /data/sdb/Bonnie.26384:
> 	0: [0..7]: 243752..243759
> 	1: [8..15]: 5526920..5526927
> 	2: [16..23]: 7053272..7053279
> 	3: [24..31]: 24223832..24223839
> /data/sdb/Bonnie.26384/00000:
> 	0: [0..7]: 1465133488..1465133495
> /data/sdb/Bonnie.26384/00000/0icoeTRPHKX0000000000:
> 	0: [0..1015]: 1465134032..1465135047
> /data/sdb/Bonnie.26384/00000/Q0000000001:
> 	0: [0..1543]: 1465135048..1465136591
> /data/sdb/Bonnie.26384/00000/JFXQyeq6diG0000000002:
> 	0: [0..1295]: 1465136592..1465137887
> /data/sdb/Bonnie.26384/00000/TK7ciXkkj0000000003:
> 	0: [0..1519]: 1465137888..1465139407
> /data/sdb/Bonnie.26384/00000/0000000004:
> 	0: [0..1207]: 1465139408..1465140615
> /data/sdb/Bonnie.26384/00000/acJKZWAwEnu0000000005:
> 	0: [0..1223]: 1465140616..1465141839
> /data/sdb/Bonnie.26384/00000/9wIgxPKeI4B0000000006:
> 	0: [0..1319]: 1465141840..1465143159
> /data/sdb/Bonnie.26384/00000/C6QLFdND0000000007:
> 	0: [0..1111]: 1465143160..1465144271
> /data/sdb/Bonnie.26384/00000/6xc1Wydh0000000008:
> 	0: [0..1223]: 1465144272..1465145495
> /data/sdb/Bonnie.26384/00000/0000000009:
> 	0: [0..1167]: 1465145496..1465146663
> /data/sdb/Bonnie.26384/00000/HdlN0000000000a:
> 	0: [0..1535]: 1465146664..1465148199
> /data/sdb/Bonnie.26384/00000/52IabyC5pvis000000000b:
> 	0: [0..1287]: 1465148200..1465149487
> /data/sdb/Bonnie.26384/00000/LvDhxcdLf000000000c:
> 	0: [0..1583]: 1465149488..1465151071
> /data/sdb/Bonnie.26384/00000/08P3JAR000000000d:
> 	0: [0..1255]: 1465151072..1465152327
> /data/sdb/Bonnie.26384/00000/000000000e:
> 	0: [0..1095]: 1465152464..1465153559
> /data/sdb/Bonnie.26384/00000/s0gtPGPecXu000000000f:
> 	0: [0..1319]: 1465153560..1465154879
> /data/sdb/Bonnie.26384/00000/HFLOcN0000000010:
> 	0: [0..1503]: 1465154880..1465156383
> 
> All the files in one directory are close to that directory; when you get to
> another directory the block offset jumps.
> 
> This is a highly desirable property when you want to copy all the files: for
> example, using this filesystem I can tar it up and untar it onto another
> filesystem at 73MB/s, as compared to about 25MB/sec on a default filesystem.
> 
> So now my questions now are:
> 
> (1) Is this a fluke? What is it about -i size=1024 which causes this to
> happen?
> 
> (2) What is the intended behaviour for XFS: that files should be close to
> their parent directory or spread across allocation groups?
> 
> I did some additional tests:
> 
> * -i size=512
> Files spread around
> 
> * -n size=16384
> Files spread around
> 
> * -i size=1024 -n size=16384
> Files local to directory
> 
> * -i size=2048
> Files local to directory
> 
> Any clues gratefully received. This usage pattern (dumping in a large
> library of files, and then processing all those files sequentially) is an
> important one for the system I'm working on.
> 
> Regards,
> 
> Brian.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
---end quoted text---

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 11:54       ` Brian Candler
@ 2012-02-03 19:42         ` Stan Hoeppner
  2012-02-03 22:10           ` Brian Candler
  0 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2012-02-03 19:42 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On 2/3/2012 5:54 AM, Brian Candler wrote:

> and the drives are Hitachi Deskstar 5K3000 HDS5C3030ALA630:

3TB, 32MB cache, 5940 RPM

> The files have an average size of 0.65MB

You stated you're writing 100,000 of these files across 1,000
directories, with Bonnie, then reading them back with dd in your custom
script.  You state this is similar to your production workload.

You've hit the peak read rate of these Hitachi drives.  As others
pointed out, if you need more read performance than the dozen of these
you plan to RAID stripe, then you'll need to swap them for units with a
faster spindle:

7.2k 	 1.21x
 10k	 1.68x
 15k	 2.53x

or with SSDs, which will yield an order of magnitude increase.  Your
stated need is 20M 500-800KB files, or 20GB if my math is correct.  Four
of these enterprise Intel SLC SSDs in a layered mdRAID0 over mdRAID1
will give you ~375 file reads/sec at 800KB per file, again if my math is
correct:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820167062

for $480 USD total investment.  You might get by with a mirrored pair
depending on your actual space needs, but performance would be half.

You're probably wondering why I didn't recommend an mdRAID10 instead, or
a 3 SSD RAID5.  All of the mdRAID striped RAID codes serialize on a
single master thread, except for RAID0 and the linear concatenation
(--linear).  With storage devices capable of 35K IOPS each, that single
thread, even running on a 3+GHz core, has trouble keeping up.

The LSI 9201-16 you have is based on the SAS2116 chip, which isn't fast
enough in RAID10 mode to keep up with the SSDs.  In straight HBA mode it
is.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 19:03           ` Christoph Hellwig
@ 2012-02-03 21:01             ` Brian Candler
  2012-02-03 21:17               ` Brian Candler
  2012-02-05 22:43               ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Brian Candler @ 2012-02-03 21:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Feb 03, 2012 at 02:03:04PM -0500, Christoph Hellwig wrote:
> > With defaults, the files in one directory are spread all over the
> > filesystem.  But with -i size=1024, the files in a directory are stored
> > adjacent to each other. Hence reading all the files in one directory
> > requires far less seeking across the disk, and runs about 3 times faster.
> 
> Not sure if you mentioned it somewhere before, but:
> 
>  a) how large is the filesystem?

3TB.

>  b) do use the inode64 mount option

No: the only mount options I've given are noatime,nodiratime.

>  c) can you see the same good behaviour when using inode64 and small
>     inodes (not that inode64 can NOT be set using remount)

I created a fresh filesystem (/dev/sdh), default parameters, but mounted it
with inode64.  Then I tar'd across my corpus of 100K files.  Result: files
are located close to the directories they belong to, and read performance
zooms.

So I conclude that XFS *does* try to keep file extents close to the
enclosing directory, but was being thwarted by the limitations of 32-bit
inodes.

There is a comment "performance sucks" at:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F

However, there it talks about files [extents?] being located close to their
inodes, rather than file extents being located close to their parent
directory.

Regards,

Brian.


root@storage1:~# find /data/sdh | head -50 | xargs xfs_bmap 
/data/sdh: no extents
/data/sdh/Bonnie.26384: no extents
/data/sdh/Bonnie.26384/00000:
	0: [0..7]: 1465133488..1465133495
/data/sdh/Bonnie.26384/00000/0icoeTRPHKX0000000000:
	0: [0..1015]: 1465134032..1465135047
/data/sdh/Bonnie.26384/00000/Q0000000001:
	0: [0..1543]: 1465135048..1465136591
/data/sdh/Bonnie.26384/00000/JFXQyeq6diG0000000002:
	0: [0..1295]: 1465136592..1465137887
/data/sdh/Bonnie.26384/00000/TK7ciXkkj0000000003:
	0: [0..1519]: 1465137888..1465139407
/data/sdh/Bonnie.26384/00000/0000000004:
	0: [0..1207]: 1465139408..1465140615
/data/sdh/Bonnie.26384/00000/acJKZWAwEnu0000000005:
	0: [0..1223]: 1465140616..1465141839
/data/sdh/Bonnie.26384/00000/9wIgxPKeI4B0000000006:
	0: [0..1319]: 1465141840..1465143159
/data/sdh/Bonnie.26384/00000/C6QLFdND0000000007:
	0: [0..1111]: 1465143160..1465144271
/data/sdh/Bonnie.26384/00000/6xc1Wydh0000000008:
	0: [0..1223]: 1465144272..1465145495
/data/sdh/Bonnie.26384/00000/0000000009:
	0: [0..1167]: 1465145496..1465146663
/data/sdh/Bonnie.26384/00000/HdlN0000000000a:
	0: [0..1535]: 1465146664..1465148199
/data/sdh/Bonnie.26384/00000/52IabyC5pvis000000000b:
	0: [0..1287]: 1465148200..1465149487
/data/sdh/Bonnie.26384/00000/LvDhxcdLf000000000c:
	0: [0..1583]: 1465149488..1465151071
/data/sdh/Bonnie.26384/00000/08P3JAR000000000d:
	0: [0..1255]: 1465151072..1465152327
/data/sdh/Bonnie.26384/00000/000000000e:
	0: [0..1095]: 1465152328..1465153423
/data/sdh/Bonnie.26384/00000/s0gtPGPecXu000000000f:
	0: [0..1319]: 1465153424..1465154743
/data/sdh/Bonnie.26384/00000/HFLOcN0000000010:
	0: [0..1503]: 1465154744..1465156247
/data/sdh/Bonnie.26384/00000/LQZly0000000011:
	0: [0..1591]: 1465156248..1465157839
/data/sdh/Bonnie.26384/00000/Cgx2O3Km9db0000000012:
	0: [0..1463]: 1465157840..1465159303
/data/sdh/Bonnie.26384/00000/QdqMvy30000000013:
	0: [0..1063]: 1465159304..1465160367
/data/sdh/Bonnie.26384/00000/kraVgKMdTiS60000000014:
	0: [0..1263]: 1465160368..1465161631
/data/sdh/Bonnie.26384/00000/qYaHGnrJm30000000015:
	0: [0..1575]: 1465161760..1465163335
/data/sdh/Bonnie.26384/00000/oJu9fLAncA0000000016:
	0: [0..1023]: 1465163336..1465164359
/data/sdh/Bonnie.26384/00000/gsTjmbcIoq0000000017:
	0: [0..1535]: 1465164360..1465165895
/data/sdh/Bonnie.26384/00000/0000000018:
	0: [0..1271]: 1465165896..1465167167
/data/sdh/Bonnie.26384/00000/Xu0000000019:
	0: [0..1199]: 1465167168..1465168367
/data/sdh/Bonnie.26384/00000/mbAF9Ow000000001a:
	0: [0..1479]: 1465168368..1465169847
/data/sdh/Bonnie.26384/00000/x2CVDC4MIM000000001b:
	0: [0..1319]: 1465169848..1465171167
/data/sdh/Bonnie.26384/00000/SYFSGTgs000000001c:
	0: [0..1239]: 1465171168..1465172407
/data/sdh/Bonnie.26384/00000/dA3oCdRjRmbm000000001d:
	0: [0..1551]: 1465172408..1465173959
/data/sdh/Bonnie.26384/00000/B000000001e:
	0: [0..1319]: 1465173960..1465175279
/data/sdh/Bonnie.26384/00000/p000000001f:
	0: [0..1559]: 1465175280..1465176839
/data/sdh/Bonnie.26384/00000/CaUyF0000000020:
	0: [0..1199]: 1465176840..1465178039
/data/sdh/Bonnie.26384/00000/xsCb0000000021:
	0: [0..1319]: 1465178040..1465179359
/data/sdh/Bonnie.26384/00000/IupKUGW4JNE80000000022:
	0: [0..1471]: 1465179360..1465180831
/data/sdh/Bonnie.26384/00000/DKBmSRy2Rt0000000023:
	0: [0..1399]: 1465180832..1465182231
/data/sdh/Bonnie.26384/00000/4dmLGnWw50000000024:
	0: [0..1247]: 1465182232..1465183479
/data/sdh/Bonnie.26384/00000/0000000025:
	0: [0..1495]: 1465183480..1465184975
/data/sdh/Bonnie.26384/00000/yPcS6O0000000026:
	0: [0..1223]: 1465184976..1465186199
/data/sdh/Bonnie.26384/00000/eNhPxu0000000027:
	0: [0..1471]: 1465186200..1465187671
/data/sdh/Bonnie.26384/00000/oGidZ0000000028:
	0: [0..1063]: 1465187672..1465188735
/data/sdh/Bonnie.26384/00000/5blq0000000029:
	0: [0..1151]: 1465188736..1465189887
/data/sdh/Bonnie.26384/00000/wlbSsioikgEY000000002a:
	0: [0..1159]: 1465189888..1465191047
/data/sdh/Bonnie.26384/00000/HKG6hYj000000002b:
	0: [0..1039]: 1465191048..1465192087
/data/sdh/Bonnie.26384/00000/FruCoPDzes000000002c:
	0: [0..1407]: 1465192088..1465193495
/data/sdh/Bonnie.26384/00000/puA70OD8U000000002d:
	0: [0..1247]: 1465193496..1465194743
/data/sdh/Bonnie.26384/00000/53Vpi1ueADH000000002e:
	0: [0..1063]: 1465194744..1465195807

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 21:01             ` Brian Candler
@ 2012-02-03 21:17               ` Brian Candler
  2012-02-05 22:50                 ` Dave Chinner
  2012-02-05 22:43               ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-02-03 21:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Feb 03, 2012 at 09:01:14PM +0000, Brian Candler wrote:
> I created a fresh filesystem (/dev/sdh), default parameters, but mounted it
> with inode64.  Then I tar'd across my corpus of 100K files.  Result: files
> are located close to the directories they belong to, and read performance
> zooms.

Although perversely, keeping all the inodes at one end of the disk does
increase throughput with random reads, and also under high concurrency loads
(for this corpus of ~65GB anyway, maybe not true for a full disk)

-- original results: defaults without inode64 --

 #p  files/sec  dd_args
  1      43.57  bs=1024k
  1      43.29  bs=1024k [random]
  2      51.27  bs=1024k 
  2      48.17  bs=1024k [random]
  5      69.06  bs=1024k 
  5      63.41  bs=1024k [random]
 10      83.77  bs=1024k 
 10      77.28  bs=1024k [random]

-- defaults with inode64 --

 #p  files/sec  dd_args
  1     138.20  bs=1024k 
  1      30.32  bs=1024k [random]
  2      70.48  bs=1024k 
  2      27.25  bs=1024k [random]
  5      61.21  bs=1024k 
  5      35.42  bs=1024k [random]
 10      80.39  bs=1024k 
 10      45.17  bs=1024k [random]

Additionally, I see a noticeable boost in random read performance when using
-i size=1024 in conjunction with inode64, which I'd also like to understand:

-- inode64 *and* -i size=1024 --

 #p  files/sec  dd_args
  1     141.52  bs=1024k 
  1      38.95  bs=1024k [random]
  2      67.28  bs=1024k 
  2      42.15  bs=1024k [random]
  5      79.83  bs=1024k 
  5      57.76  bs=1024k [random]
 10      86.85  bs=1024k
 10      72.45  bs=1024k [random]

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 19:42         ` Stan Hoeppner
@ 2012-02-03 22:10           ` Brian Candler
  2012-02-04  9:59             ` Stan Hoeppner
  0 siblings, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-02-03 22:10 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, xfs

On Fri, Feb 03, 2012 at 01:42:54PM -0600, Stan Hoeppner wrote:
> You've hit the peak read rate of these Hitachi drives.  As others
> pointed out, if you need more read performance than the dozen of these
> you plan to RAID stripe, then you'll need to swap them for units with a
> faster spindle:
> 
> 7.2k 	 1.21x
>  10k	 1.68x
>  15k	 2.53x
> 
> or with SSDs, which will yield an order of magnitude increase.  Your
> stated need is 20M 500-800KB files, or 20GB if my math is correct.

Thanks for your suggestion, but unfortunately your maths isn't correct: 20M
x 0.65MB = 13TB.  And that's just one of many possible datasets like this.

I'm aware that I'm working with low-performance drives. This is intentional:
we need low power consumption so we can get lots in a rack, and large
capacity at low cost.

Fortunately our workload will also parallelise easily, and throwing it
across 24 spindles will be fine.  But obviously I want to squeeze the most
performance out of each spindle we have first.  I'm very happy to have found
the bottleneck that was troubling me :-)

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 22:10           ` Brian Candler
@ 2012-02-04  9:59             ` Stan Hoeppner
  2012-02-04 11:24               ` Brian Candler
  0 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2012-02-04  9:59 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On 2/3/2012 4:10 PM, Brian Candler wrote:
> On Fri, Feb 03, 2012 at 01:42:54PM -0600, Stan Hoeppner wrote:
>> You've hit the peak read rate of these Hitachi drives.  As others
>> pointed out, if you need more read performance than the dozen of these
>> you plan to RAID stripe, then you'll need to swap them for units with a
>> faster spindle:
>>
>> 7.2k 	 1.21x
>>  10k	 1.68x
>>  15k	 2.53x
>>
>> or with SSDs, which will yield an order of magnitude increase.  Your
>> stated need is 20M 500-800KB files, or 20GB if my math is correct.
> 
> Thanks for your suggestion, but unfortunately your maths isn't correct: 20M
> x 0.65MB = 13TB.  And that's just one of many possible datasets like this.

Wow, you're right.  How did I miss so many zeros?  Got in hurry I guess.

> I'm aware that I'm working with low-performance drives. This is intentional:
> we need low power consumption so we can get lots in a rack, and large
> capacity at low cost.

SSDs would fulfill criteria 1/2 but obviously not 3/4.

> Fortunately our workload will also parallelise easily, and throwing it
> across 24 spindles will be fine.  But obviously I want to squeeze the most
> performance out of each spindle we have first.  I'm very happy to have found
> the bottleneck that was troubling me :-)

Will you be using mdraid or hardware RAID across those 24 spindles?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04  9:59             ` Stan Hoeppner
@ 2012-02-04 11:24               ` Brian Candler
  2012-02-04 12:49                 ` Stan Hoeppner
  0 siblings, 1 reply; 30+ messages in thread
From: Brian Candler @ 2012-02-04 11:24 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, xfs

On Sat, Feb 04, 2012 at 03:59:08AM -0600, Stan Hoeppner wrote:
> Will you be using mdraid or hardware RAID across those 24 spindles?

Gluster is the front-runner at the moment. Each file sits on a single
spindle, and there is a separate filesystem per spindle, so I think the
parallel processing will work much better this way. This does mean double
the disks to get data replication though.

I did some testing of RAID6 mdraid (12 disks with with 1MB stripe size) and
it sucked.  However I need to re-test it now that I know about inode64.
We do have a requirement for archival storage and that might use RAID6.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04 11:24               ` Brian Candler
@ 2012-02-04 12:49                 ` Stan Hoeppner
  2012-02-04 20:04                   ` Brian Candler
  0 siblings, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2012-02-04 12:49 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On 2/4/2012 5:24 AM, Brian Candler wrote:
> On Sat, Feb 04, 2012 at 03:59:08AM -0600, Stan Hoeppner wrote:
>> Will you be using mdraid or hardware RAID across those 24 spindles?
> 
> Gluster is the front-runner at the moment. Each file sits on a single
> spindle, and there is a separate filesystem per spindle, so I think the
> parallel processing will work much better this way. This does mean double
> the disks to get data replication though.

Apparently you've read of a different GlusterFS.  The one I know of is
for aggregating multiple storage hosts into a cloud storage resource.
It is not designed to replace striping or concatenation of disks within
a single host.

Even if what you describe can be done with Gluster, the performance will
likely be significantly less than a properly setup mdraid or hardware
raid.  Again, if it can be done, I'd test it head-to-head against RAID.

> I did some testing of RAID6 mdraid (12 disks with with 1MB stripe size) and
> it sucked.  However I need to re-test it now that I know about inode64.
> We do have a requirement for archival storage and that might use RAID6.

I've never been a fan of parity RAID, let alone double parity RAID.
SATA drives are so cheap (or were until the flooding in Thailand) that
it's really hard to justify RAID6 over RAID10 or a layered stripe over
mirror, given the many advantages of RAID10 and negligible
disadvantages.  The RAID6 dead drive rebuild time, and performance
degradation during the rebuild, on a production system with real users,
is enough justification to go RAID10, where that drive rebuild will take
many many hours less, if not days less, and degrade performance only mildly.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04 12:49                 ` Stan Hoeppner
@ 2012-02-04 20:04                   ` Brian Candler
  2012-02-04 20:44                     ` Joe Landman
  2012-02-05  5:16                     ` Stan Hoeppner
  0 siblings, 2 replies; 30+ messages in thread
From: Brian Candler @ 2012-02-04 20:04 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, xfs

On Sat, Feb 04, 2012 at 06:49:23AM -0600, Stan Hoeppner wrote:
> Apparently you've read of a different GlusterFS.  The one I know of is
> for aggregating multiple storage hosts into a cloud storage resource.
> It is not designed to replace striping or concatenation of disks within
> a single host.

Sure it can. A gluster volume consists of "bricks". Each brick is served by
a glusterd process listening on a different TCP port. Those bricks can be on
the same server or on different servers.

> Even if what you describe can be done with Gluster, the performance will
> likely be significantly less than a properly setup mdraid or hardware
> raid.  Again, if it can be done, I'd test it head-to-head against RAID.

I'd expect similar throughput but higher latency. Given that I'm using low
RPM drives which already have high latency, I'm hoping the additional
latency will be insignificant.  Anyway, I'll know more once I've done the
measurements.

> I've never been a fan of parity RAID, let alone double parity RAID.

I'm with you on that one.

The attractions of gluster are:
- being able to scale a volume across many nodes, transparently to
  the clients
- being able to take a whole node out of service, while clients
  automatically flip over to the other

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04 20:04                   ` Brian Candler
@ 2012-02-04 20:44                     ` Joe Landman
  2012-02-06 10:40                       ` Brian Candler
  2012-02-07 17:30                       ` Brian Candler
  2012-02-05  5:16                     ` Stan Hoeppner
  1 sibling, 2 replies; 30+ messages in thread
From: Joe Landman @ 2012-02-04 20:44 UTC (permalink / raw)
  To: xfs

On 02/04/2012 03:04 PM, Brian Candler wrote:
> On Sat, Feb 04, 2012 at 06:49:23AM -0600, Stan Hoeppner wrote:

[...]

> Sure it can. A gluster volume consists of "bricks". Each brick is served by
> a glusterd process listening on a different TCP port. Those bricks can be on
> the same server or on different servers.

I seem to remember that the Gluster folks abandoned this model (using 
their code versus MD raid) on single servers due to performance issues. 
  We did play with this a few times, and the performance wasn't that 
good.  Basically limited by single disk seek/write speed.

>
>> Even if what you describe can be done with Gluster, the performance will
>> likely be significantly less than a properly setup mdraid or hardware
>> raid.  Again, if it can be done, I'd test it head-to-head against RAID.
>
> I'd expect similar throughput but higher latency. Given that I'm using low

My recollection is that this wasn't the case.  Performance was 
suboptimal in all cases we tried.

> RPM drives which already have high latency, I'm hoping the additional
> latency will be insignificant.  Anyway, I'll know more once I've done the
> measurements.

I did this with the 3.0.x and the 2.x series of Gluster.  Usually atop 
xfs of some flavor.

>
>> I've never been a fan of parity RAID, let alone double parity RAID.
>
> I'm with you on that one.

RAID's entire purpose in life is to give an administrator time to run in 
and change a disk.  RAID isn't a backup, or even a guarantee of data 
retention.  Many do treat it this way though, to their (and their 
data's) peril.

> The attractions of gluster are:
> - being able to scale a volume across many nodes, transparently to
>    the clients

This does work, though rebalance is as much a function of the seek and 
bandwidth of the slowest link as other things.  So if you have 20 
drives, and you do a rebalance to add 5 more, its gonna be slow for a 
while.

> - being able to take a whole node out of service, while clients
>    automatically flip over to the other
>

I hate to put it like this, but this is true for various definitions of 
the word "automatically".  You need to make sure that your definitions 
line up with the reality of "automatic".

If a brick goes away, and you have a file on this brick you want to 
overwrite, it doesn't (unless you have a mirror) flip over to another 
unit "automatically" or otherwise.

RAID in this case can protect you from some of these issues (single disk 
failure issues, being replaced by RAID issues), but unless you are 
building mirror pairs of bricks on separate units, this magical 
"automatic" isn't quite so.

Moreover, last I checked, Gluster made no guarantees as to the ordering 
of the layout for mirrors.  So if you have more than one brick per node, 
and build mirror pairs with the "replicate" option, you have to check 
the actual hashing to make sure it did what you expect.  Or build up the 
mirror pairs more carefully.

At this point, it sounds like there is a gluster side of this discussion 
that I'd recommend you take to the gluster list.  There is an xfs 
portion as well which is fine here.

Disclosure:  we build/sell/support gluster (and other) based systems 
atop xfs based RAID units (both hardware and software RAID; 
1,10,6,60,...) so we have inherent biases.  Your mileage may vary.  See 
your doctor if your re-balance exceeds 4 hours.

> Regards,
>
> Brian.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04 20:04                   ` Brian Candler
  2012-02-04 20:44                     ` Joe Landman
@ 2012-02-05  5:16                     ` Stan Hoeppner
  2012-02-05  9:05                       ` Brian Candler
  1 sibling, 1 reply; 30+ messages in thread
From: Stan Hoeppner @ 2012-02-05  5:16 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On 2/4/2012 2:04 PM, Brian Candler wrote:
> On Sat, Feb 04, 2012 at 06:49:23AM -0600, Stan Hoeppner wrote:
>> Apparently you've read of a different GlusterFS.  The one I know of is
>> for aggregating multiple storage hosts into a cloud storage resource.
>> It is not designed to replace striping or concatenation of disks within
>> a single host.
> 
> Sure it can. A gluster volume consists of "bricks". Each brick is served by
> a glusterd process listening on a different TCP port. Those bricks can be on
> the same server or on different servers.

That's some interesting flexibility.  I'd never heard of the "intranode"
Gluster setup.  All the example ocnfigs I'd seen showed md/hardware RAID
with EXT4 atop, then EXT4 exported through Gluster.

>> Even if what you describe can be done with Gluster, the performance will
>> likely be significantly less than a properly setup mdraid or hardware
>> raid.  Again, if it can be done, I'd test it head-to-head against RAID.
> 
> I'd expect similar throughput but higher latency. Given that I'm using low
> RPM drives which already have high latency, I'm hoping the additional
> latency will be insignificant.  Anyway, I'll know more once I've done the
> measurements.

As they say, there's more than one way to skin a cat.

>> I've never been a fan of parity RAID, let alone double parity RAID.
> 
> I'm with you on that one.

When you lose a disk in this setup, how do you rebuild the replacement
drive?  Do you simply format it and then move 3TB of data across GbE
from other Gluster nodes?  Even if the disk is only 1/3rd full, such a
restore seems like an expensive and time consuming operation.  I'm
thinking RAID has a significant advantage here.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-05  5:16                     ` Stan Hoeppner
@ 2012-02-05  9:05                       ` Brian Candler
  0 siblings, 0 replies; 30+ messages in thread
From: Brian Candler @ 2012-02-05  9:05 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, xfs

On Sat, Feb 04, 2012 at 11:16:49PM -0600, Stan Hoeppner wrote:
> When you lose a disk in this setup, how do you rebuild the replacement
> drive?  Do you simply format it and then move 3TB of data across GbE
> from other Gluster nodes?

Basically, yes. When you read a file, it causes the mirror to synchronise
that particular file. To force the whole brick to come back into sync you
run find+stat across the whole filesystem.
http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/AG/html/sect-Administration_Guide-Managing_Volumes-Self_heal.html

> Even if the disk is only 1/3rd full, such a
> restore seems like an expensive and time consuming operation.  I'm
> thinking RAID has a significant advantage here.

Well, if you lose a 3TB disk in a RAID-1 type setup, then the whole disk has
to be copied block by block (whether it contains data or not). So the
consideration here is network bandwidth.

I am building with 10GE, but even 1G would be just about sufficient to carry
the peak bandwidth of a single one of these disks.  (dd on the raw disk
gives 120MB/s at the start and 60MB/s at the end)

The whole manageability aspect certainly needs to be considered very
seriously though.  With RAID1 or RAID10, dealing with a failed disk is
pretty much pull and plug; with Gluster we'd be looking at having to mkfs
the new filesystem, mount it at the right place, and then run the self-heal. 
This will have to be weighed against the availability advantages of being
able to take an entire storage node out of service.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 21:01             ` Brian Candler
  2012-02-03 21:17               ` Brian Candler
@ 2012-02-05 22:43               ` Dave Chinner
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2012-02-05 22:43 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On Fri, Feb 03, 2012 at 09:01:14PM +0000, Brian Candler wrote:
> On Fri, Feb 03, 2012 at 02:03:04PM -0500, Christoph Hellwig wrote:
> > > With defaults, the files in one directory are spread all over the
> > > filesystem.  But with -i size=1024, the files in a directory are stored
> > > adjacent to each other. Hence reading all the files in one directory
> > > requires far less seeking across the disk, and runs about 3 times faster.
> > 
> > Not sure if you mentioned it somewhere before, but:
> > 
> >  a) how large is the filesystem?
> 
> 3TB.
> 
> >  b) do use the inode64 mount option
> 
> No: the only mount options I've given are noatime,nodiratime.
> 
> >  c) can you see the same good behaviour when using inode64 and small
> >     inodes (not that inode64 can NOT be set using remount)
> 
> I created a fresh filesystem (/dev/sdh), default parameters, but mounted it
> with inode64.  Then I tar'd across my corpus of 100K files.  Result: files
> are located close to the directories they belong to, and read performance
> zooms.
> 
> So I conclude that XFS *does* try to keep file extents close to the
> enclosing directory, but was being thwarted by the limitations of 32-bit
> inodes.
> 
> There is a comment "performance sucks" at:
> http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_inode64_mount_option_for.3F
> 
> However, there it talks about files [extents?] being located close to their
> inodes, rather than file extents being located close to their parent
> directory.

With inode64, inodes are located close to their parent directories'
inode, and file extent allocation is close to the owner's inode.
Hence file extent allocation is close to the parent directory inode,
too.

Directory inodes are where the locality changes - each new subdir is
placed in a different AG, with the above behaviour you get
per directory locality with inode64.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-03 21:17               ` Brian Candler
@ 2012-02-05 22:50                 ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2012-02-05 22:50 UTC (permalink / raw)
  To: Brian Candler; +Cc: Christoph Hellwig, xfs

On Fri, Feb 03, 2012 at 09:17:41PM +0000, Brian Candler wrote:
> On Fri, Feb 03, 2012 at 09:01:14PM +0000, Brian Candler wrote:
> > I created a fresh filesystem (/dev/sdh), default parameters, but mounted it
> > with inode64.  Then I tar'd across my corpus of 100K files.  Result: files
> > are located close to the directories they belong to, and read performance
> > zooms.
> 
> Although perversely, keeping all the inodes at one end of the disk does
> increase throughput with random reads, and also under high concurrency loads
> (for this corpus of ~65GB anyway, maybe not true for a full disk)
> 
> -- original results: defaults without inode64 --
> 
>  #p  files/sec  dd_args
>   1      43.57  bs=1024k
>   1      43.29  bs=1024k [random]
>   2      51.27  bs=1024k 
>   2      48.17  bs=1024k [random]
>   5      69.06  bs=1024k 
>   5      63.41  bs=1024k [random]
>  10      83.77  bs=1024k 
>  10      77.28  bs=1024k [random]
> 
> -- defaults with inode64 --
> 
>  #p  files/sec  dd_args
>   1     138.20  bs=1024k 
>   1      30.32  bs=1024k [random]
>   2      70.48  bs=1024k 
>   2      27.25  bs=1024k [random]
>   5      61.21  bs=1024k 
>   5      35.42  bs=1024k [random]
>  10      80.39  bs=1024k 
>  10      45.17  bs=1024k [random]
> 
> Additionally, I see a noticeable boost in random read performance when using
> -i size=1024 in conjunction with inode64, which I'd also like to understand:
> 
> -- inode64 *and* -i size=1024 --
> 
>  #p  files/sec  dd_args
>   1     141.52  bs=1024k 
>   1      38.95  bs=1024k [random]
>   2      67.28  bs=1024k 
>   2      42.15  bs=1024k [random]
>   5      79.83  bs=1024k 
>   5      57.76  bs=1024k [random]
>  10      86.85  bs=1024k
>  10      72.45  bs=1024k [random]

Directories probably take less IO to read because they remain in
short/extent form rather than moving to leaf/node (btree)
format because you can fit more extent records in line in the inode.
That means probably 1 IO less per random read. However, it has other
downsides, like requiring 4x as much IO to read and write the same
number of inodes when under memory pressure (e.g. when you app is
using 98% of RAM).

Basically, you are discovering how to tune your system for optimal
performance with a given set of bonnie++ parameters. Keep in mind
that's exactly what we suggest you -don't- do when tuning a
filesystem:

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04 20:44                     ` Joe Landman
@ 2012-02-06 10:40                       ` Brian Candler
  2012-02-07 17:30                       ` Brian Candler
  1 sibling, 0 replies; 30+ messages in thread
From: Brian Candler @ 2012-02-06 10:40 UTC (permalink / raw)
  To: Joe Landman; +Cc: xfs

On Sat, Feb 04, 2012 at 03:44:25PM -0500, Joe Landman wrote:
> >Sure it can. A gluster volume consists of "bricks". Each brick is served by
> >a glusterd process listening on a different TCP port. Those bricks can be on
> >the same server or on different servers.
> 
> I seem to remember that the Gluster folks abandoned this model
> (using their code versus MD raid) on single servers due to
> performance issues.  We did play with this a few times, and the
> performance wasn't that good.  Basically limited by single disk
> seek/write speed.

I did raise the same question on the gluster-users list recently and there
seemed to be no clear-cut answer; some people were using Gluster to
aggregate RAID nodes, and some were using it to mirror individual disks
between nodes.

I do like the idea of having individual filesystems per disk, making data
recovery much more straightforward and allowing for efficient
parallelisation.

However I also like the idea of low-level RAID which lets you pop out and
replace a disk invisibly to the higher levels, and is perhaps better
battle-tested than gluster file-level replication.

> RAID in this case can protect you from some of these issues (single
> disk failure issues, being replaced by RAID issues), but unless you
> are building mirror pairs of bricks on separate units, this magical
> "automatic" isn't quite so.

That was the idea: having mirror bricks on different nodes.

server1:/brick1 <-> server2:/brick1
server2:/brick2 <-> server2:/brick2 etc

> Moreover, last I checked, Gluster made no guarantees as to the
> ordering of the layout for mirrors.  So if you have more than one
> brick per node, and build mirror pairs with the "replicate" option,
> you have to check the actual hashing to make sure it did what you
> expect.  Or build up the mirror pairs more carefully.

AFAICS it does guarantee the ordering:
http://download.gluster.com/pub/gluster/glusterfs/3.2/Documentation/AG/html/sect-Administration_Guide--Setting_Volumes-Distributed_Replicated.html

"Note: The number of bricks should be a multiple of the replica count for a
distributed replicated volume. Also, the order in which bricks are specified
has a great effect on data protection. Each replica_count consecutive bricks
in the list you give will form a replica set, with all replica sets combined
into a volume-wide distribute set. To make sure that replica-set members are
not placed on the same node, list the first brick on every server, then the
second brick on every server in the same order, and so on."

> At this point, it sounds like there is a gluster side of this
> discussion that I'd recommend you take to the gluster list.  There
> is an xfs portion as well which is fine here.

Understood. Whatever the final solution looks like, I'm totally sold on XFS.

> Disclosure:  we build/sell/support gluster (and other) based systems
> atop xfs based RAID units (both hardware and software RAID;
> 1,10,6,60,...) so we have inherent biases.

You have also inherent experience, and that is extremely valuable as I try
to pick the best storage model which will work for us going forward.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Performance problem - reads slower than writes
  2012-02-04 20:44                     ` Joe Landman
  2012-02-06 10:40                       ` Brian Candler
@ 2012-02-07 17:30                       ` Brian Candler
  1 sibling, 0 replies; 30+ messages in thread
From: Brian Candler @ 2012-02-07 17:30 UTC (permalink / raw)
  To: Joe Landman; +Cc: xfs

On Sat, Feb 04, 2012 at 03:44:25PM -0500, Joe Landman wrote:
> >Sure it can. A gluster volume consists of "bricks". Each brick is served by
> >a glusterd process listening on a different TCP port. Those bricks can be on
> >the same server or on different servers.
> 
> I seem to remember that the Gluster folks abandoned this model
> (using their code versus MD raid) on single servers due to
> performance issues.  We did play with this a few times, and the
> performance wasn't that good.  Basically limited by single disk
> seek/write speed.

It does appear to scale up, although not as linearly as I'd like.

Here are some performance stats [1][2].
#p = number of concurrent client processes; files read first in sequence
and then randomly.

With a 12-brick distributed replicated volume (6 bricks each on 2 servers),
the servers connected by 10GE and the gluster volume mounted locally on one
of the servers:

 #p  files/sec  dd_args
  1      95.77  bs=1024k 
  1      24.42  bs=1024k [random]
  2     126.03  bs=1024k 
  2      43.53  bs=1024k [random]
  5     284.35  bs=1024k 
  5      82.23  bs=1024k [random]
 10     280.75  bs=1024k 
 10     146.47  bs=1024k [random]
 20     316.31  bs=1024k 
 20     209.67  bs=1024k [random]
 30     381.11  bs=1024k 
 30     241.55  bs=1024k [random]

With a 12-drive md raid10 "far" array, exported as a single brick and
accessed using glusterfs over 10GE:

 #p  files/sec  dd_args
  1     114.60  bs=1024k 
  1      38.58  bs=1024k [random]
  2     169.88  bs=1024k 
  2      70.68  bs=1024k [random]
  5     181.94  bs=1024k 
  5     141.74  bs=1024k [random]
 10     250.96  bs=1024k 
 10     209.76  bs=1024k [random]
 20     315.51  bs=1024k 
 20     277.99  bs=1024k [random]
 30     343.84  bs=1024k 
 30     316.24  bs=1024k [random]

This is a rather unfair comparison because the RAID10 "far" configuration
allows it to find all data on the first half of each drive, reducing the
seek times and giving faster read throughput.  Unsurprisingly, it wins on
all the random reads.

For sequential reads with 5+ concurrent clients, the gluster distribution
wins (because of the locality of files to their directory)

In the limiting case, because the filesystems are independent you can read
off them separately and concurrently:

# for i in /brick{1..6}; do find $i | time cpio -o >/dev/null & done

This completed in 127 seconds for the entire corpus of 100,352 files (65GB
of data), i.e.  790 files/sec or 513MB/sec.  If your main use case was to be
able to copy or process all the files at once, this would win hands-down.

In fact, since the data is duplicated, we can read half the directories
from each disk in the pair.

root@storage1:~# for i in /brick{1..6}; do find $i | egrep '/[0-9]{4}[02468]/' | time cpio -o >/dev/null & done
root@storage2:~# for i in /brick{1..6}; do find $i | egrep '/[0-9]{4}[13579]/' | time cpio -o >/dev/null & done

This read the whole corpus in 69 seconds: i.e. 1454 files/sec or 945MB/s. 
Clearly you have to jump through some hoops to get this, but actually
reading through all the files (in any order) is an important use case for
us.

Maybe the RAID10 array could score better if I used a really big stripe size
- I'm using 1MB at the moment.

Regards,

Brian.

[1] Test script shown at
http://gluster.org/pipermail/gluster-users/2012-February/009585.html

[2] Tuned by:
gluster volume set <volname> performance.io-thread-count 32
and with the patch at
http://gluster.org/pipermail/gluster-users/2012-February/009590.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-02-07 17:30 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-30 22:00 Performance problem - reads slower than writes Brian Candler
2012-01-31  2:05 ` Dave Chinner
2012-01-31 10:31   ` Brian Candler
2012-01-31 14:16     ` Brian Candler
2012-01-31 20:25       ` Dave Chinner
2012-02-01  7:29         ` Stan Hoeppner
2012-02-03 18:47         ` Brian Candler
2012-02-03 19:03           ` Christoph Hellwig
2012-02-03 21:01             ` Brian Candler
2012-02-03 21:17               ` Brian Candler
2012-02-05 22:50                 ` Dave Chinner
2012-02-05 22:43               ` Dave Chinner
2012-01-31 14:52     ` Christoph Hellwig
2012-01-31 21:52       ` Brian Candler
2012-02-01  0:50         ` Raghavendra D Prabhu
2012-02-01  3:59         ` Dave Chinner
2012-02-03 11:54       ` Brian Candler
2012-02-03 19:42         ` Stan Hoeppner
2012-02-03 22:10           ` Brian Candler
2012-02-04  9:59             ` Stan Hoeppner
2012-02-04 11:24               ` Brian Candler
2012-02-04 12:49                 ` Stan Hoeppner
2012-02-04 20:04                   ` Brian Candler
2012-02-04 20:44                     ` Joe Landman
2012-02-06 10:40                       ` Brian Candler
2012-02-07 17:30                       ` Brian Candler
2012-02-05  5:16                     ` Stan Hoeppner
2012-02-05  9:05                       ` Brian Candler
2012-01-31 20:06     ` Dave Chinner
2012-01-31 21:35       ` Brian Candler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.