* Where is the performance bottleneck?
@ 2005-08-29 18:20 Holger Kiehl
2005-08-29 19:54 ` Mark Hahn
` (3 more replies)
0 siblings, 4 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-29 18:20 UTC (permalink / raw)
To: linux-raid; +Cc: linux-kernel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 8112 bytes --]
Hello
I have a system with the following setup:
Board is Tyan S4882 with AMD 8131 Chipset
4 Opterons 848 (2.2GHz)
8 GB DDR400 Ram (2GB for each CPU)
1 onboard Symbios Logic 53c1030 dual channel U320 controller
2 SATA disks put together as a SW Raid1 for system, swap and spares
8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
are on one channel and the other four (sdg, sdh, sdi, sdj) on
the other channel.
The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
device on that bus. Unfortunatly I was unable to determine at what speed
it is running, here the output from lspci -vv:
02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
Interrupt: pin A routed to IRQ 217
Region 0: I/O ports at 3000 [size=256]
Region 1: Memory at fe010000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at fe000000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
Address: 0000000000000000 Data: 0000
Capabilities: [68] PCI-X non-bridge device.
Command: DPERE- ERO- RBC=2 OST=0
Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,
02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
Interrupt: pin B routed to IRQ 225
Region 0: I/O ports at 3400 [size=256]
Region 1: Memory at fe030000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at fe020000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
Address: 0000000000000000 Data: 0000
Capabilities: [68] PCI-X non-bridge device.
Command: DPERE- ERO- RBC=2 OST=0
Status: Bus=2 Dev=4 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple,
How does one determine the PCI-X bus speed?
Anyway, I thought with this system I would get theoretically 640 MB/s using
both channels. I tested several software raid setups to get the best possible
write speeds for this system. But testing shows that the absolute maximum I
can reach with software raid is only approx. 270 MB/s for writting. Which is
very disappointing.
The tests where done with 2.6.12.5 kernel from kernel.org, scheduler is the
deadline and distribution is fedora core 4 x86_64 with all updates. Chunksize
is always the default from mdadm (64k). Filesystem was always created with the
command mke2fs -j -b4096 -O dir_index /dev/mdx.
I also have tried with 2.6.13-rc7, but here the speed was much lower, the
maximum there was approx. 140 MB/s for writting.
Here some tests I did and the results with bonnie++:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
Raid0 (8 disk)15744M 54406 96 247419 90 100752 25 60266 98 226651 29 830.2 1
Raid0s(4 disk)15744M 54915 97 253642 89 73976 18 59445 97 198372 24 659.8 1
Raid0s(4 disk)15744M 54866 97 268361 95 72852 17 59165 97 187183 22 666.3 1
Raid0p(4 disk)15744M 54017 96 149897 57 60202 15 59048 96 156887 20 381.8 1
Raid0p(4 disk)15744M 54771 98 156129 59 54130 14 58941 97 157543 20 520.3 1
Raid1+0 15744M 52496 94 202497 77 55928 14 60150 98 270509 34 930.2 1
Raid0+1 15744M 53927 95 194492 66 53430 15 49590 83 174313 30 884.7 1
Raid5 (8 disk)15744M 55881 98 153735 51 61680 24 56229 95 207348 44 741.2 1
Raid5s(4 disk)15744M 55238 98 81023 28 36859 14 56358 95 193030 38 605.7 1
Raid5s(4 disk)15744M 54920 97 83680 29 36551 14 56917 95 185345 35 599.8 1
Raid5p(4 disk)15744M 53681 95 54517 20 44932 17 54808 93 172216 33 371.1 1
Raid5p(4 disk)15744M 53856 96 55901 21 34737 13 55810 94 181825 36 607.7 1
/dev/sdc 15744M 53861 95 102270 35 25718 6 37273 60 76275 8 377.0 0
/dev/sdd 15744M 53575 95 96846 36 26209 6 37248 60 76197 9 378.4 0
/dev/sde 15744M 54398 94 87937 28 25540 6 36476 59 76520 8 380.4 0
/dev/sdf 15744M 53982 95 109192 38 26136 6 38516 63 76277 9 383.0 0
/dev/sdg 15744M 53880 95 102625 36 26458 6 37926 61 76538 9 399.1 0
/dev/sdh 15744M 53326 95 106447 39 26570 6 38129 62 76427 9 384.3 0
/dev/sdi 15744M 53103 94 96976 33 25632 6 36748 59 76658 8 386.4 0
/dev/sdj 15744M 53840 95 105521 39 26251 6 37146 60 76097 9 384.8 0
Raid1+0 - Four raid1's where each disk of one raid1 hangs on one
channel. The setup was done as follows:
Raid1 /dev/md3 (sdc + sdg)
Raid1 /dev/md4 (sdd + sdh)
Raid1 /dev/md5 (sde + sdi)
Raid1 /dev/md6 (sdf + sdj)
Raid0 /dev/md7 (md3 + md4 + md5 + md6)
Raid0+1 - Raid1 over two raid0 each having four disks:
Raid0 /dev/md3 (sdc + sdd + sde + sdf)
Raid0 /dev/md4 (sdg + sdh + sdi + sdj)
Raid1 /dev/md5 (md3 + md4)
Raid0s(4 disk) - Consists of Raid0 /dev/md3 sdc + sdd + sde + sdf or
Raid0 /dev/md4 sdg + sdh + sdi + sdj and the test where
done separate once for md3 and then for md4.
Raid0p(4 disk) - Same as Raid0s(4 disk) only the test for md3 and md4 where
done at the same time (parallel).
Raid5s(4 disk) - Same as Raid0s(4 disk) only with Raid5.
Raid5p(4 disk) - Same as Raid0p(4 disk) only with Raid5.
Additional tests where done with a little C program (attached to this mail)
that I wrote a long time ago. It measures the time it takes to write a file
of the given size, first result is without fsync() and second result with
fsync(). It is called with two parameters, the first is the file size in
Kilobytes and the second is the blocksize in bytes. The program was always
started as follows:
fw 16121856 4096
I choose 4096 as blocksize since this is value that is suggested by stat()
st_blksize. With larger values the transfer rate increases.
Here the results in MB/s:
Raid0 (8 disk) 203.017 191.649
Raid0s(4 disk) 200.331 166.129
Raid0s(4 disk) 198.013 165.465
Raid0p(4 disk) 143.781 118.832
Raid0p(4 disk) 146.592 117.703
Raid0+1 206.046 118.670
Raid5 (8 disk) 181.382 115.037
/dev/sdc 94.439 56.928
/dev/sdd 89.838 55.711
/dev/sde 84.391 51.545
/dev/sdf 87.549 57.368
/dev/sdg 92.847 57.799
/dev/sdh 94.615 58.678
/dev/sdi 89.030 54.945
/dev/sdj 91.344 56.899
Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
three times those numbers if you take the numbers from the individual disks.
What limit am I hitting here?
Thanks,
Holger
--
[-- Attachment #2: Type: TEXT/PLAIN, Size: 3993 bytes --]
/*****************************************************************************/
/* File Write Performance */
/* ====================== */
/*****************************************************************************/
#include <stdio.h> /* printf() */
#include <string.h> /* strcmp() */
#include <stdlib.h> /* exit(), atoi(), calloc(), free() */
#include <unistd.h> /* write(), sysconf(), close(), fsync() */
#include <sys/times.h> /* times(), struct tms */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdarg.h>
#define MAXLINE 4096
#define BUFSIZE 512
#define DEFAULT_FILE_SIZE 31457280
#define TEST_FILE "test.file"
#define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
static void err_doit(int, char *, va_list),
err_quit(char *, ...),
err_sys(char *, ...);
/*############################### main() ####################################*/
int
main(int argc, char *argv[])
{
register n,
loops,
rest;
int fd,
oflag,
blocksize = BUFSIZE;
off_t filesize = DEFAULT_FILE_SIZE;
clock_t start,
end,
syncend;
long clktck;
char *buf;
struct tms tmsdummy;
if ((argc > 1) && (argc < 5))
{
filesize = (off_t)atoi(argv[1]) * 1024;
if (argc == 3)
blocksize = atoi(argv[2]);
else if (argc == 4)
err_quit("Usage: %s [filesize] [blocksize]");
}
else if (argc != 1)
err_quit("Usage: %s [filesize] [blocksize]", argv[0]);
if ((clktck = sysconf(_SC_CLK_TCK)) < 0)
err_sys("sysconf error");
/* If clktck=0 it dosn't make sence to run the test */
if (clktck == 0)
{
(void)printf("0\n");
exit(0);
}
if ((buf = calloc(blocksize, sizeof(char))) == NULL)
err_sys("calloc error");
for (n = 0; n < blocksize; n++)
buf[n] = 'T';
loops = filesize / blocksize;
rest = filesize % blocksize;
oflag = O_WRONLY | O_CREAT;
if ((fd = open(TEST_FILE, oflag, FILE_MODE)) < 0)
err_quit("Could not open %s", TEST_FILE);
if ((start = times(&tmsdummy)) == -1)
err_sys("Could not get start time");
for (n = 0; n < loops; n++)
if (write(fd, buf, blocksize) != blocksize)
err_sys("write error");
if (rest > 0)
if (write(fd, buf, rest) != rest)
err_sys("write error");
if ((end = times(&tmsdummy)) == -1)
err_sys("Could not get end time");
(void)fsync(fd);
if ((syncend = times(&tmsdummy)) == -1)
err_sys("Could not get end time");
(void)close(fd);
free(buf);
(void)printf("%f %f\n", (double)filesize / ((double)(end - start) / (double)clktck),
(double)filesize / ((double)(syncend - start) / (double)clktck));
exit(0);
}
static void
err_sys(char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
err_doit(1, fmt, ap);
va_end(ap);
exit(1);
}
static void
err_quit(char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
err_doit(0, fmt, ap);
va_end(ap);
exit(1);
}
static void
err_doit(int errnoflag, char *fmt, va_list ap)
{
int errno_save;
char buf[MAXLINE];
errno_save = errno;
(void)vsprintf(buf, fmt, ap);
if (errnoflag)
(void)sprintf(buf+strlen(buf), ": %s", strerror(errno_save));
(void)strcat(buf, "\n");
fflush(stdout);
(void)fputs(buf, stderr);
fflush(NULL); /* Flushes all stdio output streams */
return;
}
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
@ 2005-08-29 19:54 ` Mark Hahn
2005-08-30 19:08 ` Holger Kiehl
2005-08-29 20:10 ` Al Boldi
` (2 subsequent siblings)
3 siblings, 1 reply; 42+ messages in thread
From: Mark Hahn @ 2005-08-29 19:54 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid, linux-kernel
> 8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
figure each is worth, say, 60 MB/s, so you'll peak (theoretically) at
240 MB/s per channel.
> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
> device on that bus. Unfortunatly I was unable to determine at what speed
> it is running, here the output from lspci -vv:
...
> Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,
the "133MHz+" is a good sign. OTOH the latency (72) seems rather low - my
understanding is that that would noticably limit the size of burst transfers.
> Anyway, I thought with this system I would get theoretically 640 MB/s using
> both channels.
"theoretically" in the same sense as "according to quantum theory,
Bush and BinLadin may swap bodies tomorrow morning at 4:59."
> write speeds for this system. But testing shows that the absolute maximum I
> can reach with software raid is only approx. 270 MB/s for writting. Which is
> very disappointing.
it's a bit low, but "very" is unrealistic...
> deadline and distribution is fedora core 4 x86_64 with all updates. Chunksize
> is always the default from mdadm (64k). Filesystem was always created with the
> command mke2fs -j -b4096 -O dir_index /dev/mdx.
bear in mind that a 64k chunksize means that an 8 disk raid5 will really
only work well for writes that are multiples of of 7*64=448K...
> I also have tried with 2.6.13-rc7, but here the speed was much lower, the
> maximum there was approx. 140 MB/s for writting.
hmm, there should not have been any such dramatic slowdown.
> Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
> Raid0 (8 disk)15744M 54406 96 247419 90 100752 25 60266 98 226651 29 830.2 1
> Raid0s(4 disk)15744M 54915 97 253642 89 73976 18 59445 97 198372 24 659.8 1
> Raid0s(4 disk)15744M 54866 97 268361 95 72852 17 59165 97 187183 22 666.3 1
you're obviously saturating something already with 2 disks. did you play
with "blockdev --setra" setings?
> Raid5 (8 disk)15744M 55881 98 153735 51 61680 24 56229 95 207348 44 741.2 1
> Raid5s(4 disk)15744M 55238 98 81023 28 36859 14 56358 95 193030 38 605.7 1
> Raid5s(4 disk)15744M 54920 97 83680 29 36551 14 56917 95 185345 35 599.8 1
the block-read shows that even with 3 disks, you're hitting ~190 MB/s,
which is pretty close to your actual disk speed. the low value for block-out
is probably just due to non-stripe writes needing R/M/W cycles.
> /dev/sdc 15744M 53861 95 102270 35 25718 6 37273 60 76275 8 377.0 0
the block-out is clearly distorted by buffer-cache (too high), but the
input rate is good and consistent. obvoiusly, it'll fall off somewhat
towards inner tracks, but will probably still be above 50.
> Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
> bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
> three times those numbers if you take the numbers from the individual disks.
expecting 3x is unreasonable; 2x (480 or so) would be good.
I suspect that some (sw kernel) components are badly tuned for fast IO.
obviously, most machines are in the 50-100 MB/s range, so this is not
surprising. readahead is certainly one, but there are also magic numbers
in MD as well, not to mention PCI latency, scsi driver tuning, probably
even /proc/sys/vm settings.
I've got some 4x2.6G opteron servers (same board, 32G PC3200), but alas,
end-users have found out about them. not to mention that they only have
3x160G SATA disks...
regards, mark hahn.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
2005-08-29 19:54 ` Mark Hahn
@ 2005-08-29 20:10 ` Al Boldi
2005-08-30 19:18 ` Holger Kiehl
2005-08-29 20:25 ` Vojtech Pavlik
2005-08-29 23:09 ` Peter Chubb
3 siblings, 1 reply; 42+ messages in thread
From: Al Boldi @ 2005-08-29 20:10 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-kernel, linux-raid
Holger Kiehl wrote:
> Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
> bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
> three times those numbers if you take the numbers from the individual
> disks.
>
> What limit am I hitting here?
You may be hitting a 2.6 kernel bug, which has something to do with
readahead, ask Jens Axboe about it! (see "[git patches] IDE update" thread)
Sadly, 2.6.13 did not fix it either.
Did you try 2.4.31?
--
Al
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
2005-08-29 19:54 ` Mark Hahn
2005-08-29 20:10 ` Al Boldi
@ 2005-08-29 20:25 ` Vojtech Pavlik
2005-08-30 20:06 ` Holger Kiehl
2005-08-29 23:09 ` Peter Chubb
3 siblings, 1 reply; 42+ messages in thread
From: Vojtech Pavlik @ 2005-08-29 20:25 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-kernel
On Mon, Aug 29, 2005 at 06:20:56PM +0000, Holger Kiehl wrote:
> Hello
>
> I have a system with the following setup:
>
> Board is Tyan S4882 with AMD 8131 Chipset
> 4 Opterons 848 (2.2GHz)
> 8 GB DDR400 Ram (2GB for each CPU)
> 1 onboard Symbios Logic 53c1030 dual channel U320 controller
> 2 SATA disks put together as a SW Raid1 for system, swap and spares
> 8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
> are on one channel and the other four (sdg, sdh, sdi, sdj) on
> the other channel.
>
> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is
> no other device on that bus. Unfortunatly I was unable to determine at
> what speed it is running, here the output from lspci -vv:
> How does one determine the PCI-X bus speed?
Usually only the card (in your case the Symbios SCSI controller) can
tell. If it does, it'll be most likely in 'dmesg'.
> Anyway, I thought with this system I would get theoretically 640 MB/s using
> both channels.
You can never use the full theoretical bandwidth of the channel for
data. A lot of overhead remains for other signalling. Similarly for PCI.
> I tested several software raid setups to get the best possible write
> speeds for this system. But testing shows that the absolute maximum I
> can reach with software raid is only approx. 270 MB/s for writting.
> Which is very disappointing.
I'd expect somewhat better (in the 300-400 MB/s range), but this is not
too bad.
To find where the bottleneck is, I'd suggest trying without the
filesystem at all, and just filling a large part of the block device
using the 'dd' command.
Also, trying without the RAID, and just running 4 (and 8) concurrent
dd's to the separate drives could show whether it's the RAID that's
slowing things down.
> The tests where done with 2.6.12.5 kernel from kernel.org, scheduler
> is the deadline and distribution is fedora core 4 x86_64 with all
> updates. Chunksize is always the default from mdadm (64k). Filesystem
> was always created with the command mke2fs -j -b4096 -O dir_index
> /dev/mdx.
>
> I also have tried with 2.6.13-rc7, but here the speed was much lower,
> the maximum there was approx. 140 MB/s for writting.
Now that's very low.
--
Vojtech Pavlik
SuSE Labs, SuSE CR
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
` (2 preceding siblings ...)
2005-08-29 20:25 ` Vojtech Pavlik
@ 2005-08-29 23:09 ` Peter Chubb
3 siblings, 0 replies; 42+ messages in thread
From: Peter Chubb @ 2005-08-29 23:09 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid, linux-kernel
>>>>> "Holger" == Holger Kiehl <Holger.Kiehl@dwd.de> writes:
Holger> Hello I have a system with the following setup:
(4-way CPUs, 8 spindles on two controllers)
Try using XFS.
See http://scalability.gelato.org/DiskScalability_2fResults --- ext3
is single threaded and tends not to get the full benefit of either the
multiple spindles nor the multiple processors.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 19:54 ` Mark Hahn
@ 2005-08-30 19:08 ` Holger Kiehl
2005-08-30 23:05 ` Guy
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-30 19:08 UTC (permalink / raw)
To: Mark Hahn; +Cc: linux-raid, linux-kernel
On Mon, 29 Aug 2005, Mark Hahn wrote:
>> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
>> device on that bus. Unfortunatly I was unable to determine at what speed
>> it is running, here the output from lspci -vv:
> ...
>> Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,
>
> the "133MHz+" is a good sign. OTOH the latency (72) seems rather low - my
> understanding is that that would noticably limit the size of burst transfers.
>
I have tried with 128 and 144, but the transfer rate is only a little
bit higher barely measurable. Or what values should I try?
>
>> Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
>> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
>> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
>> Raid0 (8 disk)15744M 54406 96 247419 90 100752 25 60266 98 226651 29 830.2 1
>> Raid0s(4 disk)15744M 54915 97 253642 89 73976 18 59445 97 198372 24 659.8 1
>> Raid0s(4 disk)15744M 54866 97 268361 95 72852 17 59165 97 187183 22 666.3 1
>
> you're obviously saturating something already with 2 disks. did you play
> with "blockdev --setra" setings?
>
Yes, I did play a little bit with it but this only changed read performance,
it made no measurable difference when writting.
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 20:10 ` Al Boldi
@ 2005-08-30 19:18 ` Holger Kiehl
2005-08-31 10:30 ` Al Boldi
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-30 19:18 UTC (permalink / raw)
To: Al Boldi; +Cc: linux-kernel, linux-raid
On Mon, 29 Aug 2005, Al Boldi wrote:
> Holger Kiehl wrote:
>> Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
>> bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
>> three times those numbers if you take the numbers from the individual
>> disks.
>>
>> What limit am I hitting here?
>
> You may be hitting a 2.6 kernel bug, which has something to do with
> readahead, ask Jens Axboe about it! (see "[git patches] IDE update" thread)
> Sadly, 2.6.13 did not fix it either.
>
I did read that threat, but due to my limited understanding about kernel
code, don't see the relation to my problem.
But I am willing to try any patches to solve the problem.
> Did you try 2.4.31?
>
No. Will give this a try if the problem is not found.
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-29 20:25 ` Vojtech Pavlik
@ 2005-08-30 20:06 ` Holger Kiehl
2005-08-31 7:11 ` Vojtech Pavlik
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-30 20:06 UTC (permalink / raw)
To: Vojtech Pavlik; +Cc: linux-raid, linux-kernel
On Mon, 29 Aug 2005, Vojtech Pavlik wrote:
> On Mon, Aug 29, 2005 at 06:20:56PM +0000, Holger Kiehl wrote:
>> Hello
>>
>> I have a system with the following setup:
>>
>> Board is Tyan S4882 with AMD 8131 Chipset
>> 4 Opterons 848 (2.2GHz)
>> 8 GB DDR400 Ram (2GB for each CPU)
>> 1 onboard Symbios Logic 53c1030 dual channel U320 controller
>> 2 SATA disks put together as a SW Raid1 for system, swap and spares
>> 8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
>> are on one channel and the other four (sdg, sdh, sdi, sdj) on
>> the other channel.
>>
>> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is
>> no other device on that bus. Unfortunatly I was unable to determine at
>> what speed it is running, here the output from lspci -vv:
>
>> How does one determine the PCI-X bus speed?
>
> Usually only the card (in your case the Symbios SCSI controller) can
> tell. If it does, it'll be most likely in 'dmesg'.
>
There is nothing in dmesg:
Fusion MPT base driver 3.01.20
Copyright (c) 1999-2004 LSI Logic Corporation
ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
mptbase: Initiating ioc0 bringup
ioc0: 53C1030: Capabilities={Initiator,Target}
ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
mptbase: Initiating ioc1 bringup
ioc1: 53C1030: Capabilities={Initiator,Target}
Fusion MPT SCSI Host driver 3.01.20
>> Anyway, I thought with this system I would get theoretically 640 MB/s using
>> both channels.
>
> You can never use the full theoretical bandwidth of the channel for
> data. A lot of overhead remains for other signalling. Similarly for PCI.
>
>> I tested several software raid setups to get the best possible write
>> speeds for this system. But testing shows that the absolute maximum I
>> can reach with software raid is only approx. 270 MB/s for writting.
>> Which is very disappointing.
>
> I'd expect somewhat better (in the 300-400 MB/s range), but this is not
> too bad.
>
> To find where the bottleneck is, I'd suggest trying without the
> filesystem at all, and just filling a large part of the block device
> using the 'dd' command.
>
> Also, trying without the RAID, and just running 4 (and 8) concurrent
> dd's to the separate drives could show whether it's the RAID that's
> slowing things down.
>
Ok, I did run the following dd command in different combinations:
dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
Here the results:
Each disk alone
/dev/sdc1 59.094636 MB/s
/dev/sdd1 58.686592 MB/s
/dev/sde1 55.282807 MB/s
/dev/sdf1 62.271240 MB/s
/dev/sdg1 60.872891 MB/s
/dev/sdh1 62.252781 MB/s
/dev/sdi1 59.145637 MB/s
/dev/sdj1 60.921119 MB/s
sdc + sdd in parallel (2 disks on same channel)
/dev/sdc1 42.512287 MB/s
/dev/sdd1 43.118483 MB/s
sdc + sdg in parallel (2 disks on different channels)
/dev/sdc1 42.938186 MB/s
/dev/sdg1 43.934779 MB/s
sdc + sdd + sde in parallel (3 disks on same channel)
/dev/sdc1 35.043501 MB/s
/dev/sdd1 35.686878 MB/s
/dev/sde1 34.580457 MB/s
Similar results for three disks (sdg + sdh + sdi) on the other channel
/dev/sdg1 36.381137 MB/s
/dev/sdh1 37.541758 MB/s
/dev/sdi1 35.834920 MB/s
sdc + sdd + sde + sdf in parallel (4 disks on same channel)
/dev/sdc1 31.432914 MB/s
/dev/sdd1 32.058752 MB/s
/dev/sde1 31.393455 MB/s
/dev/sdf1 33.208165 MB/s
And here for the four disks on the other channel
/dev/sdg1 31.873028 MB/s
/dev/sdh1 33.277193 MB/s
/dev/sdi1 31.910000 MB/s
/dev/sdj1 32.626744 MB/s
All 8 disks in parallel
/dev/sdc1 24.120545 MB/s
/dev/sdd1 24.419801 MB/s
/dev/sde1 24.296588 MB/s
/dev/sdf1 25.609548 MB/s
/dev/sdg1 24.572617 MB/s
/dev/sdh1 25.552590 MB/s
/dev/sdi1 24.575616 MB/s
/dev/sdj1 25.124165 MB/s
So from these results, I may assume that md is not the cause of the problem.
What comes as a big surprise is that I loose 25% performance with only
two disks and each hanging on its own channel!
Is this normal? I wonder if other people have the same problem with
other controllers or the same.
What can I do next to find out if this is a kernel, driver or hardware
problem?
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: Where is the performance bottleneck?
2005-08-30 19:08 ` Holger Kiehl
@ 2005-08-30 23:05 ` Guy
2005-09-28 20:04 ` Bill Davidsen
0 siblings, 1 reply; 42+ messages in thread
From: Guy @ 2005-08-30 23:05 UTC (permalink / raw)
To: 'Holger Kiehl', 'Mark Hahn'
Cc: 'linux-raid', 'linux-kernel'
In most of your results, your CPU usage is very high. Once you get to about
90% usage, you really can't do much else, unless you can improve the CPU
usage.
Guy
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Holger Kiehl
> Sent: Tuesday, August 30, 2005 3:09 PM
> To: Mark Hahn
> Cc: linux-raid; linux-kernel
> Subject: Re: Where is the performance bottleneck?
>
> On Mon, 29 Aug 2005, Mark Hahn wrote:
>
> >> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no
> other
> >> device on that bus. Unfortunatly I was unable to determine at what
> speed
> >> it is running, here the output from lspci -vv:
> > ...
> >> Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-,
> DC=simple,
> >
> > the "133MHz+" is a good sign. OTOH the latency (72) seems rather low -
> my
> > understanding is that that would noticably limit the size of burst
> transfers.
> >
> I have tried with 128 and 144, but the transfer rate is only a little
> bit higher barely measurable. Or what values should I try?
>
> >
> >> Version 1.03 ------Sequential Output------ --Sequential Input-
> --Random-
> >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> >> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> >> Raid0 (8 disk)15744M 54406 96 247419 90 100752 25 60266 98 226651 29
> 830.2 1
> >> Raid0s(4 disk)15744M 54915 97 253642 89 73976 18 59445 97 198372 24
> 659.8 1
> >> Raid0s(4 disk)15744M 54866 97 268361 95 72852 17 59165 97 187183 22
> 666.3 1
> >
> > you're obviously saturating something already with 2 disks. did you
> play
> > with "blockdev --setra" setings?
> >
> Yes, I did play a little bit with it but this only changed read
> performance,
> it made no measurable difference when writting.
>
> Thanks,
> Holger
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-30 20:06 ` Holger Kiehl
@ 2005-08-31 7:11 ` Vojtech Pavlik
2005-08-31 7:26 ` Jens Axboe
2005-08-31 13:38 ` Holger Kiehl
0 siblings, 2 replies; 42+ messages in thread
From: Vojtech Pavlik @ 2005-08-31 7:11 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid, linux-kernel
On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
> >>How does one determine the PCI-X bus speed?
> >
> >Usually only the card (in your case the Symbios SCSI controller) can
> >tell. If it does, it'll be most likely in 'dmesg'.
> >
> There is nothing in dmesg:
>
> Fusion MPT base driver 3.01.20
> Copyright (c) 1999-2004 LSI Logic Corporation
> ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
> mptbase: Initiating ioc0 bringup
> ioc0: 53C1030: Capabilities={Initiator,Target}
> ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
> mptbase: Initiating ioc1 bringup
> ioc1: 53C1030: Capabilities={Initiator,Target}
> Fusion MPT SCSI Host driver 3.01.20
>
> >To find where the bottleneck is, I'd suggest trying without the
> >filesystem at all, and just filling a large part of the block device
> >using the 'dd' command.
> >
> >Also, trying without the RAID, and just running 4 (and 8) concurrent
> >dd's to the separate drives could show whether it's the RAID that's
> >slowing things down.
> >
> Ok, I did run the following dd command in different combinations:
>
> dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
I think a bs of 4k is way too small and will cause huge CPU overhead.
Can you try with something like 4M? Also, you can use /dev/full to avoid
the pre-zeroing.
> Here the results:
>
> Each disk alone
> /dev/sdc1 59.094636 MB/s
> /dev/sdd1 58.686592 MB/s
> /dev/sde1 55.282807 MB/s
> /dev/sdf1 62.271240 MB/s
> /dev/sdg1 60.872891 MB/s
> /dev/sdh1 62.252781 MB/s
> /dev/sdi1 59.145637 MB/s
> /dev/sdj1 60.921119 MB/s
> All 8 disks in parallel
> /dev/sdc1 24.120545 MB/s
> /dev/sdd1 24.419801 MB/s
> /dev/sde1 24.296588 MB/s
> /dev/sdf1 25.609548 MB/s
> /dev/sdg1 24.572617 MB/s
> /dev/sdh1 25.552590 MB/s
> /dev/sdi1 24.575616 MB/s
> /dev/sdj1 25.124165 MB/s
You're saturating some bus. It almost looks like it's the PCI-X,
although that should be able to deliver up (if running at the full speed
of AMD8132) up to 1GB/sec, so it SHOULD not be an issue.
> So from these results, I may assume that md is not the cause of the problem.
>
> What comes as a big surprise is that I loose 25% performance with only
> two disks and each hanging on its own channel!
>
> Is this normal? I wonder if other people have the same problem with
> other controllers or the same.
No, I don't think this is OK.
> What can I do next to find out if this is a kernel, driver or hardware
> problem?
You need to find where the bottleneck is, by removing one possible
bottleneck at a time in your test.
--
Vojtech Pavlik
SuSE Labs, SuSE CR
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 7:11 ` Vojtech Pavlik
@ 2005-08-31 7:26 ` Jens Axboe
2005-08-31 11:54 ` Holger Kiehl
2005-08-31 13:38 ` Holger Kiehl
1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 7:26 UTC (permalink / raw)
To: Vojtech Pavlik; +Cc: Holger Kiehl, linux-raid, linux-kernel
On Wed, Aug 31 2005, Vojtech Pavlik wrote:
> On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
> > >>How does one determine the PCI-X bus speed?
> > >
> > >Usually only the card (in your case the Symbios SCSI controller) can
> > >tell. If it does, it'll be most likely in 'dmesg'.
> > >
> > There is nothing in dmesg:
> >
> > Fusion MPT base driver 3.01.20
> > Copyright (c) 1999-2004 LSI Logic Corporation
> > ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
> > mptbase: Initiating ioc0 bringup
> > ioc0: 53C1030: Capabilities={Initiator,Target}
> > ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
> > mptbase: Initiating ioc1 bringup
> > ioc1: 53C1030: Capabilities={Initiator,Target}
> > Fusion MPT SCSI Host driver 3.01.20
> >
> > >To find where the bottleneck is, I'd suggest trying without the
> > >filesystem at all, and just filling a large part of the block device
> > >using the 'dd' command.
> > >
> > >Also, trying without the RAID, and just running 4 (and 8) concurrent
> > >dd's to the separate drives could show whether it's the RAID that's
> > >slowing things down.
> > >
> > Ok, I did run the following dd command in different combinations:
> >
> > dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
>
> I think a bs of 4k is way too small and will cause huge CPU overhead.
> Can you try with something like 4M? Also, you can use /dev/full to avoid
> the pre-zeroing.
That was my initial thought as well, but since he's writing the io side
should look correct. I doubt 8 dd's writing 4k chunks will gobble that
much CPU as to make this much difference.
Holger, we need vmstat 1 info while the dd's are running. A simple
profile would be nice as well, boot with profile=2 and do a readprofile
-r; run tests; readprofile > foo and send the first 50 lines of foo to
this list.
--
Jens Axboe
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-30 19:18 ` Holger Kiehl
@ 2005-08-31 10:30 ` Al Boldi
0 siblings, 0 replies; 42+ messages in thread
From: Al Boldi @ 2005-08-31 10:30 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-kernel, linux-raid
Holger Kiehl wrote:
> On Mon, 29 Aug 2005, Al Boldi wrote:
> > You may be hitting a 2.6 kernel bug, which has something to do with
> > readahead, ask Jens Axboe about it! (see "[git patches] IDE update"
> > thread) Sadly, 2.6.13 did not fix it either.
>
> I did read that threat, but due to my limited understanding about kernel
> code, don't see the relation to my problem.
Basically the kernel is loosing CPU cycles while accessing bockdevices.
The problem shows most when the CPU/DISK ratio is low.
Throwing more CPU cycles at the problem may seemingly remove this bottleneck.
> But I am willing to try any patches to solve the problem.
No patches yet.
> > Did you try 2.4.31?
>
> No. Will give this a try if the problem is not found.
Keep us posted!
--
Al
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 7:26 ` Jens Axboe
@ 2005-08-31 11:54 ` Holger Kiehl
2005-08-31 12:07 ` Jens Axboe
2005-08-31 12:24 ` Nick Piggin
0 siblings, 2 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 11:54 UTC (permalink / raw)
To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
On Wed, 31 Aug 2005, Jens Axboe wrote:
> On Wed, Aug 31 2005, Vojtech Pavlik wrote:
>> On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
>>>>> How does one determine the PCI-X bus speed?
>>>>
>>>> Usually only the card (in your case the Symbios SCSI controller) can
>>>> tell. If it does, it'll be most likely in 'dmesg'.
>>>>
>>> There is nothing in dmesg:
>>>
>>> Fusion MPT base driver 3.01.20
>>> Copyright (c) 1999-2004 LSI Logic Corporation
>>> ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
>>> mptbase: Initiating ioc0 bringup
>>> ioc0: 53C1030: Capabilities={Initiator,Target}
>>> ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
>>> mptbase: Initiating ioc1 bringup
>>> ioc1: 53C1030: Capabilities={Initiator,Target}
>>> Fusion MPT SCSI Host driver 3.01.20
>>>
>>>> To find where the bottleneck is, I'd suggest trying without the
>>>> filesystem at all, and just filling a large part of the block device
>>>> using the 'dd' command.
>>>>
>>>> Also, trying without the RAID, and just running 4 (and 8) concurrent
>>>> dd's to the separate drives could show whether it's the RAID that's
>>>> slowing things down.
>>>>
>>> Ok, I did run the following dd command in different combinations:
>>>
>>> dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
>>
>> I think a bs of 4k is way too small and will cause huge CPU overhead.
>> Can you try with something like 4M? Also, you can use /dev/full to avoid
>> the pre-zeroing.
>
> That was my initial thought as well, but since he's writing the io side
> should look correct. I doubt 8 dd's writing 4k chunks will gobble that
> much CPU as to make this much difference.
>
> Holger, we need vmstat 1 info while the dd's are running. A simple
> profile would be nice as well, boot with profile=2 and do a readprofile
> -r; run tests; readprofile > foo and send the first 50 lines of foo to
> this list.
>
Here vmstat for 8 dd's still with 4k blocksize:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
9 2 5244 38272 7738248 10400 0 0 3 11444 390 24 0 5 75 20
5 10 5244 30824 7747680 8684 0 0 0 265672 2582 1917 1 95 0 4
2 12 5244 30948 7747248 8708 0 0 0 222620 2858 292 0 33 0 67
4 10 5244 31072 7747516 8644 0 0 0 236400 3132 326 0 43 0 57
2 12 5244 31320 7747792 8512 0 0 0 250204 3225 285 0 37 0 63
1 13 5244 30948 7747412 8552 0 0 24 227600 3261 312 0 41 0 59
2 12 5244 32684 7746124 8616 0 0 0 235392 3219 274 0 32 0 68
1 13 5244 30948 7747940 8568 0 0 0 228020 3394 296 0 37 0 63
0 14 5244 31196 7747680 8624 0 0 0 232932 3389 300 0 32 0 68
3 12 5244 31072 7747904 8536 0 0 0 233096 3545 312 0 33 0 67
1 13 5244 31072 7747852 8520 0 0 0 226992 3381 290 0 31 0 69
1 13 5244 31196 7747704 8396 0 0 0 230112 3372 265 0 28 0 72
0 14 5244 31072 7747928 8512 0 0 0 240652 3491 295 0 33 0 67
3 13 5244 31072 7748104 8608 0 0 0 222944 3433 269 0 27 0 73
1 13 5244 31072 7748000 8508 0 0 0 207944 3470 294 0 28 0 72
0 14 5244 31072 7747980 8528 0 0 0 234608 3496 272 0 31 0 69
2 12 5244 31196 7748148 8496 0 0 0 228760 3480 280 0 28 0 72
0 14 5244 30948 7748568 8620 0 0 0 214372 3551 302 0 29 0 71
1 13 5244 31072 7748392 8524 0 0 0 226732 3494 284 0 29 0 71
0 14 5244 31072 7748004 8640 0 0 0 229628 3604 273 0 26 0 74
1 13 5244 30948 7748392 8660 0 0 0 212868 3563 266 0 28 0 72
1 13 5244 30948 7748600 8520 0 0 0 228244 3568 294 0 30 0 70
1 13 5244 31196 7748228 8416 0 0 0 221692 3543 258 0 27 0 73
1 13 5244 31072 7748192 8520 0 0 0 241040 3983 330 0 25 0 74
1 13 5244 31196 7748288 8560 0 0 0 217108 3676 276 0 28 0 72
.
.
.
This goses on up to the end.
.
.
.
0 3 5244 825096 6949252 8596 0 0 0 241244 2683 223 0 7 71 22
0 2 5244 825108 6949252 8596 0 0 0 229764 2683 214 0 7 73 20
0 3 5244 826348 6949252 8596 0 0 0 116840 2046 450 0 4 71 26
0 3 5244 826976 6949252 8596 0 0 0 141992 1887 97 0 4 73 23
0 3 5244 827100 6949252 8596 0 0 0 137716 1871 93 0 4 70 26
0 3 5244 827100 6949252 8596 0 0 0 137032 1894 96 0 4 75 21
0 3 5244 827224 6949252 8596 0 0 0 131332 1860 288 0 4 73 23
0 1 5244 1943732 5833756 8620 0 0 0 72404 1560 481 0 24 61 16
0 2 5244 1943732 5833756 8620 0 0 0 71680 1450 60 0 2 61 38
0 2 5244 1943736 5833756 8620 0 0 0 71680 1464 70 0 2 52 46
0 2 5244 1943736 5833756 8620 0 0 0 66560 1436 66 0 2 50 48
0 2 5244 1943984 5833756 8620 0 0 0 71680 1454 72 0 2 50 48
0 2 5244 1943984 5833756 8620 0 0 0 71680 1450 70 0 2 50 48
1 0 5244 2906484 4872176 8612 0 0 0 12760 1240 321 0 13 68 19
0 0 5244 3306732 4472300 8580 0 0 0 0 1109 31 0 9 91 0
0 0 5244 3306732 4472300 8580 0 0 0 0 1008 22 0 0 100 0
And here the profile output (I assume you meant sorted):
3236497 total 1.4547
2507913 default_idle 52248.1875
158752 shrink_zone 43.3275
121584 copy_user_generic_c 3199.5789
34271 __wake_up_bit 713.9792
31131 __make_request 23.1629
22096 scsi_request_fn 18.4133
21915 rotate_reclaimable_page 80.5699
20641 end_buffer_async_write 86.0042
18701 __clear_user 292.2031
13562 __block_write_full_page 18.4266
12981 test_set_page_writeback 47.7243
10772 kmem_cache_free 96.1786
10216 unlock_page 159.6250
9492 free_hot_cold_page 32.9583
9478 add_to_page_cache 45.5673
9117 page_waitqueue 81.4018
8671 drop_buffers 38.7098
8584 __set_page_dirty_nobuffers 31.5588
8444 release_pages 23.9886
8204 scsi_dispatch_cmd 14.2431
8191 buffered_rmqueue 11.6349
7966 page_referenced 22.6307
7093 generic_file_buffered_write 4.1431
6953 __pagevec_lru_add 28.9708
6740 __alloc_pages 5.6926
6369 __end_that_request_first 11.7077
5940 dnotify_parent 30.9375
5880 kmem_cache_alloc 91.8750
5797 submit_bh 19.0691
4720 find_lock_page 21.0714
4612 __generic_file_aio_write_nolock 4.8042
4559 __do_softirq 20.3527
4337 end_page_writeback 54.2125
4090 create_empty_buffers 25.5625
3985 bio_alloc_bioset 9.2245
3787 mempool_alloc 12.4572
3708 set_page_refs 231.7500
3545 __block_commit_write 17.0433
3037 system_call 23.1832
2968 zone_watermark_ok 15.4583
2966 cond_resched 26.4821
2828 generic_make_request 4.7770
2766 __mod_page_state 86.4375
2759 fget_light 15.6761
2692 test_clear_page_dirty 11.2167
2523 vfs_write 8.2993
2406 generic_file_aio_write_nolock 15.0375
2335 bio_put 36.4844
2287 bad_range 23.8229
Under ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/ I put the full vmstat
and profile output (also with -v). There is also dmesg and my kernel.config
from this system.
I will also do some test with 4M instead of 4k and as Al Boldi hinted
do a test together with some CPU load.
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 11:54 ` Holger Kiehl
@ 2005-08-31 12:07 ` Jens Axboe
2005-08-31 13:55 ` Holger Kiehl
2005-08-31 12:24 ` Nick Piggin
1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 12:07 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
On Wed, Aug 31 2005, Holger Kiehl wrote:
> >>>Ok, I did run the following dd command in different combinations:
> >>>
> >>> dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
> >>
> >>I think a bs of 4k is way too small and will cause huge CPU overhead.
> >>Can you try with something like 4M? Also, you can use /dev/full to avoid
> >>the pre-zeroing.
> >
> >That was my initial thought as well, but since he's writing the io side
> >should look correct. I doubt 8 dd's writing 4k chunks will gobble that
> >much CPU as to make this much difference.
> >
> >Holger, we need vmstat 1 info while the dd's are running. A simple
> >profile would be nice as well, boot with profile=2 and do a readprofile
> >-r; run tests; readprofile > foo and send the first 50 lines of foo to
> >this list.
> >
> Here vmstat for 8 dd's still with 4k blocksize:
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id
> wa
> 9 2 5244 38272 7738248 10400 0 0 3 11444 390 24 0 5
> 75 20
> 5 10 5244 30824 7747680 8684 0 0 0 265672 2582 1917 1 95
> 0 4
> 2 12 5244 30948 7747248 8708 0 0 0 222620 2858 292 0 33
> 0 67
> 4 10 5244 31072 7747516 8644 0 0 0 236400 3132 326 0 43
> 0 57
> 2 12 5244 31320 7747792 8512 0 0 0 250204 3225 285 0 37
> 0 63
> 1 13 5244 30948 7747412 8552 0 0 24 227600 3261 312 0 41
> 0 59
> 2 12 5244 32684 7746124 8616 0 0 0 235392 3219 274 0 32
> 0 68
[snip]
Looks as expected, nothing too excessive showing up. About 30-40% sys
time, but it should not bog the machine down that much.
> And here the profile output (I assume you meant sorted):
I did, thanks :)
> 3236497 total 1.4547
> 2507913 default_idle 52248.1875
> 158752 shrink_zone 43.3275
> 121584 copy_user_generic_c 3199.5789
> 34271 __wake_up_bit 713.9792
> 31131 __make_request 23.1629
> 22096 scsi_request_fn 18.4133
> 21915 rotate_reclaimable_page 80.5699
> 20641 end_buffer_async_write 86.0042
> 18701 __clear_user 292.2031
Nothing sticks out here either. There's plenty of idle time. It smells
like a driver issue. Can you try the same dd test, but read from the
drives instead? Use a bigger blocksize here, 128 or 256k.
You might want to try the same with direct io, just to eliminate the
costly user copy. I don't expect it to make much of a difference though,
feels like the problem is elsewhere (driver, most likely).
If we still can't get closer to this, it would be interesting to try my
block tracing stuff so we can see what is going on at the queue level.
But lets gather some more info first, since it requires testing -mm.
--
Jens Axboe
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 11:54 ` Holger Kiehl
2005-08-31 12:07 ` Jens Axboe
@ 2005-08-31 12:24 ` Nick Piggin
2005-08-31 16:25 ` Holger Kiehl
1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-08-31 12:24 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
Holger Kiehl wrote:
> 3236497 total 1.4547
> 2507913 default_idle 52248.1875
> 158752 shrink_zone 43.3275
> 121584 copy_user_generic_c 3199.5789
> 34271 __wake_up_bit 713.9792
> 31131 __make_request 23.1629
> 22096 scsi_request_fn 18.4133
> 21915 rotate_reclaimable_page 80.5699
^^^^^^^^^
I don't think this function should be here. This indicates that
lots of writeout is happening due to pages falling off the end
of the LRU.
There was a bug recently causing memory estimates to be wrong
on Opterons that could cause this I think.
Can you send in 2 dumps of /proc/vmstat taken 10 seconds apart
while you're writing at full speed (with 2.6.13 or the latest
-git tree).
A dump of /proc/zoneinfo and /proc/meminfo while the write is
going on would be helpful too.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 7:11 ` Vojtech Pavlik
2005-08-31 7:26 ` Jens Axboe
@ 2005-08-31 13:38 ` Holger Kiehl
1 sibling, 0 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 13:38 UTC (permalink / raw)
To: Vojtech Pavlik; +Cc: linux-raid, linux-kernel
On Wed, 31 Aug 2005, Vojtech Pavlik wrote:
> On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
>>>> How does one determine the PCI-X bus speed?
>>>
>>> Usually only the card (in your case the Symbios SCSI controller) can
>>> tell. If it does, it'll be most likely in 'dmesg'.
>>>
>> There is nothing in dmesg:
>>
>> Fusion MPT base driver 3.01.20
>> Copyright (c) 1999-2004 LSI Logic Corporation
>> ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
>> mptbase: Initiating ioc0 bringup
>> ioc0: 53C1030: Capabilities={Initiator,Target}
>> ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
>> mptbase: Initiating ioc1 bringup
>> ioc1: 53C1030: Capabilities={Initiator,Target}
>> Fusion MPT SCSI Host driver 3.01.20
>>
>>> To find where the bottleneck is, I'd suggest trying without the
>>> filesystem at all, and just filling a large part of the block device
>>> using the 'dd' command.
>>>
>>> Also, trying without the RAID, and just running 4 (and 8) concurrent
>>> dd's to the separate drives could show whether it's the RAID that's
>>> slowing things down.
>>>
>> Ok, I did run the following dd command in different combinations:
>>
>> dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
>
> I think a bs of 4k is way too small and will cause huge CPU overhead.
> Can you try with something like 4M? Also, you can use /dev/full to avoid
> the pre-zeroing.
>
Ok, I now use the following command:
dd if=/dev/full of=/dev/sd?1 bs=4M count=4883
Here the results for all 8 disks in parallel:
/dev/sdc1 24.957257 MB/s
/dev/sdd1 25.290177 MB/s
/dev/sde1 25.046711 MB/s
/dev/sdf1 26.369777 MB/s
/dev/sdg1 24.080695 MB/s
/dev/sdh1 25.008803 MB/s
/dev/sdi1 24.202202 MB/s
/dev/sdj1 24.712840 MB/s
A little bit faster but not much.
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 12:07 ` Jens Axboe
@ 2005-08-31 13:55 ` Holger Kiehl
2005-08-31 14:24 ` Dr. David Alan Gilbert
2005-08-31 16:20 ` Jens Axboe
0 siblings, 2 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 13:55 UTC (permalink / raw)
To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
On Wed, 31 Aug 2005, Jens Axboe wrote:
> Nothing sticks out here either. There's plenty of idle time. It smells
> like a driver issue. Can you try the same dd test, but read from the
> drives instead? Use a bigger blocksize here, 128 or 256k.
>
I used the following command reading from all 8 disks in parallel:
dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
Here vmstat output (I just cut something out in the middle):
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----^M
r b swpd free buff cache si so bi bo in cs us sy id wa^M
3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987 0 22 0 78
1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987 0 23 4 74
0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955 0 22 12 66
1 7 4348 38912 7803700 9636 0 0 322432 0 3526 5078 0 23 7 70
2 6 4348 37552 7805120 9644 0 0 322432 0 3527 4908 0 23 12 64
0 8 4348 41152 7801552 9608 0 0 322176 0 3524 5018 0 24 6 70
1 7 4348 41644 7801044 9572 0 0 322560 0 3530 5175 0 23 0 76
1 7 4348 37184 7805396 9640 0 0 322176 0 3525 4914 0 24 18 59
3 7 4348 41704 7800376 9832 0 0 322176 20 3531 5080 0 23 4 73
1 7 4348 40652 7801700 9732 0 0 323072 0 3533 5115 0 24 13 64
1 7 4348 40284 7802224 9616 0 0 322560 0 3527 4967 0 23 1 76
0 8 4348 40156 7802356 9688 0 0 322560 0 3528 5080 0 23 2 75
6 8 4348 41896 7799984 9816 0 0 322176 0 3530 4945 0 24 20 57
0 8 4348 39540 7803124 9600 0 0 322560 0 3529 4811 0 24 21 55
1 7 4348 41520 7801084 9600 0 0 322560 0 3532 4843 0 23 22 55
0 8 4348 40408 7802116 9588 0 0 322560 0 3527 5010 0 23 4 72
0 8 4348 38172 7804300 9580 0 0 322176 0 3526 4992 0 24 7 69
4 7 4348 42264 7799784 9812 0 0 322688 0 3529 5003 0 24 8 68
1 7 4348 39908 7802520 9660 0 0 322700 0 3529 4963 0 24 14 62
0 8 4348 37428 7805076 9620 0 0 322420 0 3528 4967 0 23 15 62
0 8 4348 37056 7805348 9688 0 0 322048 0 3525 4982 0 24 26 50
1 7 4348 37804 7804456 9696 0 0 322560 0 3528 5072 0 24 16 60
0 8 4348 38416 7804084 9660 0 0 323200 0 3533 5081 0 24 23 53
0 8 4348 40160 7802300 9676 0 0 323200 28 3543 5095 0 24 17 59
1 7 4348 37928 7804612 9608 0 0 323072 0 3532 5175 0 24 7 68
2 6 4348 38680 7803724 9612 0 0 322944 0 3531 4906 0 25 24 51
1 7 4348 40408 7802192 9648 0 0 322048 0 3524 4947 0 24 19 57
Full vmstat session can be found under:
ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/vmstat-256k-read
And here the profile data:
2106577 total 0.9469
1638177 default_idle 34128.6875
179615 copy_user_generic_c 4726.7105
27670 end_buffer_async_read 108.0859
26055 shrink_zone 7.1111
23199 __make_request 17.2612
17221 kmem_cache_free 153.7589
11796 drop_buffers 52.6607
11016 add_to_page_cache 52.9615
9470 __wake_up_bit 197.2917
8760 buffered_rmqueue 12.4432
8646 find_get_page 90.0625
8319 __do_page_cache_readahead 11.0625
7976 kmem_cache_alloc 124.6250
7463 scsi_request_fn 6.2192
7208 try_to_free_buffers 40.9545
6716 create_empty_buffers 41.9750
6432 __end_that_request_first 11.8235
6044 test_clear_page_dirty 25.1833
5643 scsi_dispatch_cmd 9.7969
5588 free_hot_cold_page 19.4028
5479 submit_bh 18.0230
3903 __alloc_pages 3.2965
3671 file_read_actor 9.9755
3425 thread_return 14.2708
3333 generic_make_request 5.6301
3294 bio_alloc_bioset 7.6250
2868 bio_put 44.8125
2851 mpt_interrupt 2.8284
2697 mempool_alloc 8.8717
2642 block_read_full_page 3.9315
2512 do_generic_mapping_read 2.1216
2394 set_page_refs 149.6250
2235 alloc_page_buffers 9.9777
1992 __pagevec_lru_add 8.3000
1859 __memset 9.6823
1791 page_waitqueue 15.9911
1783 scsi_end_request 6.9648
1348 dma_unmap_sg 6.4808
1324 bio_endio 11.8214
1306 unlock_page 20.4062
1211 mptscsih_freeChainBuffers 7.5687
1141 alloc_pages_current 7.9236
1136 __mod_page_state 35.5000
1116 radix_tree_preload 8.7188
1061 __pagevec_release_nonlru 6.6312
1043 set_bh_page 9.3125
1024 release_pages 2.9091
1023 mempool_free 6.3937
832 alloc_buffer_head 13.0000
Full profile data can be found under:
ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/dd-256k-8disk-read.profile
> You might want to try the same with direct io, just to eliminate the
> costly user copy. I don't expect it to make much of a difference though,
> feels like the problem is elsewhere (driver, most likely).
>
Sorry, I don't know how to do this. Do you mean using a C program
that sets some flag to do direct io, or how can I do that?
> If we still can't get closer to this, it would be interesting to try my
> block tracing stuff so we can see what is going on at the queue level.
> But lets gather some more info first, since it requires testing -mm.
>
Ok, please then just tell me what I must do.
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 13:55 ` Holger Kiehl
@ 2005-08-31 14:24 ` Dr. David Alan Gilbert
2005-08-31 20:56 ` Holger Kiehl
2005-08-31 16:20 ` Jens Axboe
1 sibling, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2005-08-31 14:24 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid, linux-kernel
* Holger Kiehl (Holger.Kiehl@dwd.de) wrote:
> On Wed, 31 Aug 2005, Jens Axboe wrote:
>
> Full vmstat session can be found under:
Have you got iostat? iostat -x 10 might be interesting to see
for a period while it is going.
Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 16:20 ` Jens Axboe
@ 2005-08-31 15:16 ` jmerkey
2005-08-31 16:58 ` Tom Callahan
2005-08-31 17:11 ` Jens Axboe
2005-08-31 16:51 ` Holger Kiehl
1 sibling, 2 replies; 42+ messages in thread
From: jmerkey @ 2005-08-31 15:16 UTC (permalink / raw)
To: Jens Axboe; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel
I have seen an 80GB/sec limitation in the kernel unless this value is
changed in the SCSI I/O layer
for 3Ware and other controllers during testing of 2.6.X series kernels.
Change these values in include/linux/blkdev.h and performance goes from
80MB/S to over 670MB/S on the 3Ware controller.
//#define BLKDEV_MIN_RQ 4
//#define BLKDEV_MAX_RQ 128 /* Default maximum */
#define BLKDEV_MIN_RQ 4096
#define BLKDEV_MAX_RQ 8192 /* Default maximum */
Jeff
Jens Axboe wrote:
>On Wed, Aug 31 2005, Holger Kiehl wrote:
>
>
>>On Wed, 31 Aug 2005, Jens Axboe wrote:
>>
>>
>>
>>>Nothing sticks out here either. There's plenty of idle time. It smells
>>>like a driver issue. Can you try the same dd test, but read from the
>>>drives instead? Use a bigger blocksize here, 128 or 256k.
>>>
>>>
>>>
>>I used the following command reading from all 8 disks in parallel:
>>
>> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>
>>Here vmstat output (I just cut something out in the middle):
>>
>>procs -----------memory---------- ---swap-- -----io---- --system--
>>----cpu----^M
>> r b swpd free buff cache si so bi bo in cs us sy id
>> wa^M
>> 3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987 0 22
>> 0 78
>> 1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987 0 23
>> 4 74
>> 0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955 0 22
>> 12 66
>> 1 7 4348 38912 7803700 9636 0 0 322432 0 3526 5078 0 23
>>
>>
>
>Ok, so that's somewhat better than the writes but still off from what
>the individual drives can do in total.
>
>
>
>>>You might want to try the same with direct io, just to eliminate the
>>>costly user copy. I don't expect it to make much of a difference though,
>>>feels like the problem is elsewhere (driver, most likely).
>>>
>>>
>>>
>>Sorry, I don't know how to do this. Do you mean using a C program
>>that sets some flag to do direct io, or how can I do that?
>>
>>
>
>I've attached a little sample for you, just run ala
>
># ./oread /dev/sdX
>
>and it will read 128k chunks direct from that device. Run on the same
>drives as above, reply with the vmstat info again.
>
>
>
>------------------------------------------------------------------------
>
>#include <stdio.h>
>#include <stdlib.h>
>#define __USE_GNU
>#include <fcntl.h>
>#include <stdlib.h>
>#include <unistd.h>
>
>#define BS (131072)
>#define ALIGN(buf) (char *) (((unsigned long) (buf) + 4095) & ~(4095))
>#define BLOCKS (8192)
>
>int main(int argc, char *argv[])
>{
> char *p;
> int fd, i;
>
> if (argc < 2) {
> printf("%s: <dev>\n", argv[0]);
> return 1;
> }
>
> fd = open(argv[1], O_RDONLY | O_DIRECT);
> if (fd == -1) {
> perror("open");
> return 1;
> }
>
> p = ALIGN(malloc(BS + 4095));
> for (i = 0; i < BLOCKS; i++) {
> int r = read(fd, p, BS);
>
> if (r == BS)
> continue;
> else {
> if (r == -1)
> perror("read");
>
> break;
> }
> }
>
> return 0;
>}
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 16:58 ` Tom Callahan
@ 2005-08-31 15:47 ` jmerkey
0 siblings, 0 replies; 42+ messages in thread
From: jmerkey @ 2005-08-31 15:47 UTC (permalink / raw)
To: Tom Callahan
Cc: Jens Axboe, Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel
I'll try this approach as well. On 2.4.X kernels, I had to change
nr_requests to achieve performance, but
I noticed it didn't seem to work as well on 2.6.X. I'll retry the
change with nr_requests on 2.6.X.
Thanks
Jeff
Tom Callahan wrote:
>>From linux-kernel mailing list.....
>
>Don't do this. BLKDEV_MIN_RQ sets the size of the mempool reserved
>requests and will only get slightly used in low memory conditions, so
>most memory will probably be wasted.....
>
>Change /sys/block/xxx/queue/nr_requests
>
>Tom Callahan
>TESSCO Technologies
>(443)-506-6216
>callahant@tessco.com
>
>
>
>jmerkey wrote:
>
>
>
>>I have seen an 80GB/sec limitation in the kernel unless this value is
>>changed in the SCSI I/O layer
>>for 3Ware and other controllers during testing of 2.6.X series kernels.
>>
>>Change these values in include/linux/blkdev.h and performance goes from
>>80MB/S to over 670MB/S on the 3Ware controller.
>>
>>
>>//#define BLKDEV_MIN_RQ 4
>>//#define BLKDEV_MAX_RQ 128 /* Default maximum */
>>#define BLKDEV_MIN_RQ 4096
>>#define BLKDEV_MAX_RQ 8192 /* Default maximum */
>>
>>Jeff
>>
>>
>>
>>Jens Axboe wrote:
>>
>>
>>
>>
>>
>>>On Wed, Aug 31 2005, Holger Kiehl wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>>On Wed, 31 Aug 2005, Jens Axboe wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Nothing sticks out here either. There's plenty of idle time. It
>>>>>
>>>>>
>>>>>
>>>>>
>>smells
>>
>>
>>
>>
>>>>>like a driver issue. Can you try the same dd test, but read from the
>>>>>drives instead? Use a bigger blocksize here, 128 or 256k.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>I used the following command reading from all 8 disks in parallel:
>>>>
>>>> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>>>
>>>>Here vmstat output (I just cut something out in the middle):
>>>>
>>>>procs -----------memory---------- ---swap-- -----io---- --system--
>>>>----cpu----^M
>>>>r b swpd free buff cache si so bi bo in cs us
>>>>
>>>>
>>>>
>>>>
>>sy id
>>
>>
>>
>>
>>>>wa^M
>>>>3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987
>>>>
>>>>
>>>>
>>>>
>>0 22
>>
>>
>>
>>
>>>>0 78
>>>>1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987
>>>>
>>>>
>>>>
>>>>
>>0 23
>>
>>
>>
>>
>>>>4 74
>>>>0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955
>>>>
>>>>
>>>>
>>>>
>>0 22
>>
>>
>>
>>
>>>>12 66
>>>>1 7 4348 38912 7803700 9636 0 0 322432 0 3526 5078
>>>>
>>>>
>>>>
>>>>
>>0 23
>>
>>
>>
>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>Ok, so that's somewhat better than the writes but still off from what
>>>the individual drives can do in total.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>>You might want to try the same with direct io, just to eliminate the
>>>>>costly user copy. I don't expect it to make much of a difference
>>>>>
>>>>>
>>>>>
>>>>>
>>though,
>>
>>
>>
>>
>>>>>feels like the problem is elsewhere (driver, most likely).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Sorry, I don't know how to do this. Do you mean using a C program
>>>>that sets some flag to do direct io, or how can I do that?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>I've attached a little sample for you, just run ala
>>>
>>># ./oread /dev/sdX
>>>
>>>and it will read 128k chunks direct from that device. Run on the same
>>>drives as above, reply with the vmstat info again.
>>>
>>>
>>>
>>>-----------------------------------------------------------------------
>>>
>>>
>>>
>>>
>>-
>>
>>
>>
>>
>>>#include <stdio.h>
>>>#include <stdlib.h>
>>>#define __USE_GNU
>>>#include <fcntl.h>
>>>#include <stdlib.h>
>>>#include <unistd.h>
>>>
>>>#define BS (131072)
>>>#define ALIGN(buf) (char *) (((unsigned long) (buf) + 4095) &
>>>
>>>
>>>
>>>
>>~(4095))
>>
>>
>>
>>
>>>#define BLOCKS (8192)
>>>
>>>int main(int argc, char *argv[])
>>>{
>>> char *p;
>>> int fd, i;
>>>
>>> if (argc < 2) {
>>> printf("%s: <dev>\n", argv[0]);
>>> return 1;
>>> }
>>>
>>> fd = open(argv[1], O_RDONLY | O_DIRECT);
>>> if (fd == -1) {
>>> perror("open");
>>> return 1;
>>> }
>>>
>>> p = ALIGN(malloc(BS + 4095));
>>> for (i = 0; i < BLOCKS; i++) {
>>> int r = read(fd, p, BS);
>>>
>>> if (r == BS)
>>> continue;
>>> else {
>>> if (r == -1)
>>> perror("read");
>>>
>>> break;
>>> }
>>> }
>>>
>>> return 0;
>>>}
>>>
>>>
>>>
>>>
>>>
>>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 17:11 ` Jens Axboe
@ 2005-08-31 15:59 ` jmerkey
2005-08-31 17:32 ` Jens Axboe
0 siblings, 1 reply; 42+ messages in thread
From: jmerkey @ 2005-08-31 15:59 UTC (permalink / raw)
To: Jens Axboe; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel
512 is not enough. It has to be larger. I just tried 512 and it still
limits the data rates.
Jeff
Jens Axboe wrote:
>On Wed, Aug 31 2005, jmerkey wrote:
>
>
>>I have seen an 80GB/sec limitation in the kernel unless this value is
>>changed in the SCSI I/O layer
>>for 3Ware and other controllers during testing of 2.6.X series kernels.
>>
>>Change these values in include/linux/blkdev.h and performance goes from
>>80MB/S to over 670MB/S on the 3Ware controller.
>>
>>
>>//#define BLKDEV_MIN_RQ 4
>>//#define BLKDEV_MAX_RQ 128 /* Default maximum */
>>#define BLKDEV_MIN_RQ 4096
>>#define BLKDEV_MAX_RQ 8192 /* Default maximum */
>>
>>
>
>That's insane, you just wasted 1MiB of preallocated requests on each
>queue in the system!
>
>Please just do
>
># echo 512 > /sys/block/dev/queue/nr_requests
>
>after boot for each device you want to increase the queue size too. 512
>should be enough with the 3ware.
>
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 13:55 ` Holger Kiehl
2005-08-31 14:24 ` Dr. David Alan Gilbert
@ 2005-08-31 16:20 ` Jens Axboe
2005-08-31 15:16 ` jmerkey
2005-08-31 16:51 ` Holger Kiehl
1 sibling, 2 replies; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 16:20 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1670 bytes --]
On Wed, Aug 31 2005, Holger Kiehl wrote:
> On Wed, 31 Aug 2005, Jens Axboe wrote:
>
> >Nothing sticks out here either. There's plenty of idle time. It smells
> >like a driver issue. Can you try the same dd test, but read from the
> >drives instead? Use a bigger blocksize here, 128 or 256k.
> >
> I used the following command reading from all 8 disks in parallel:
>
> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>
> Here vmstat output (I just cut something out in the middle):
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----^M
> r b swpd free buff cache si so bi bo in cs us sy id
> wa^M
> 3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987 0 22
> 0 78
> 1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987 0 23
> 4 74
> 0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955 0 22
> 12 66
> 1 7 4348 38912 7803700 9636 0 0 322432 0 3526 5078 0 23
Ok, so that's somewhat better than the writes but still off from what
the individual drives can do in total.
> >You might want to try the same with direct io, just to eliminate the
> >costly user copy. I don't expect it to make much of a difference though,
> >feels like the problem is elsewhere (driver, most likely).
> >
> Sorry, I don't know how to do this. Do you mean using a C program
> that sets some flag to do direct io, or how can I do that?
I've attached a little sample for you, just run ala
# ./oread /dev/sdX
and it will read 128k chunks direct from that device. Run on the same
drives as above, reply with the vmstat info again.
--
Jens Axboe
[-- Attachment #2: oread.c --]
[-- Type: text/plain, Size: 647 bytes --]
#include <stdio.h>
#include <stdlib.h>
#define __USE_GNU
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#define BS (131072)
#define ALIGN(buf) (char *) (((unsigned long) (buf) + 4095) & ~(4095))
#define BLOCKS (8192)
int main(int argc, char *argv[])
{
char *p;
int fd, i;
if (argc < 2) {
printf("%s: <dev>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY | O_DIRECT);
if (fd == -1) {
perror("open");
return 1;
}
p = ALIGN(malloc(BS + 4095));
for (i = 0; i < BLOCKS; i++) {
int r = read(fd, p, BS);
if (r == BS)
continue;
else {
if (r == -1)
perror("read");
break;
}
}
return 0;
}
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 12:24 ` Nick Piggin
@ 2005-08-31 16:25 ` Holger Kiehl
2005-08-31 17:25 ` Nick Piggin
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 16:25 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
On Wed, 31 Aug 2005, Nick Piggin wrote:
> Holger Kiehl wrote:
>
>> 3236497 total 1.4547
>> 2507913 default_idle 52248.1875
>> 158752 shrink_zone 43.3275
>> 121584 copy_user_generic_c 3199.5789
>> 34271 __wake_up_bit 713.9792
>> 31131 __make_request 23.1629
>> 22096 scsi_request_fn 18.4133
>> 21915 rotate_reclaimable_page 80.5699
> ^^^^^^^^^
>
> I don't think this function should be here. This indicates that
> lots of writeout is happening due to pages falling off the end
> of the LRU.
>
> There was a bug recently causing memory estimates to be wrong
> on Opterons that could cause this I think.
>
> Can you send in 2 dumps of /proc/vmstat taken 10 seconds apart
> while you're writing at full speed (with 2.6.13 or the latest
> -git tree).
>
I took 2.6.13, there where no git snapshots at www.kernel.org when
I looked. With 2.6.13 I must load the Fusion MPT driver as module.
Compiling it in it does not detect the drive correctly, as module
there is no problem.
Here is what I did:
#!/bin/bash
time dd if=/dev/full of=/dev/sdc1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sdd1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sde1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sdf1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sdg1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sdh1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sdi1 bs=4M count=4883 &
time dd if=/dev/full of=/dev/sdj1 bs=4M count=4883 &
sleep 20
cat /proc/vmstat > /root/vmstat-1.dump
sleep 10
cat /proc/vmstat > /root/vmstat-2.dump
cat /proc/zoneinfo > /root/zoneinfo.dump
cat /proc/meminfo > /root/meminfo.dump
exit 0
vmstat-1.dump:
nr_dirty 787282
nr_writeback 44317
nr_unstable 0
nr_page_table_pages 633
nr_mapped 6373
nr_slab 53030
pgpgin 263362
pgpgout 5260352
pswpin 0
pswpout 0
pgalloc_high 0
pgalloc_normal 2448628
pgalloc_dma 1041
pgfree 2457343
pgactivate 5775
pgdeactivate 2113
pgfault 465679
pgmajfault 321
pgrefill_high 0
pgrefill_normal 5940
pgrefill_dma 33
pgsteal_high 0
pgsteal_normal 148759
pgsteal_dma 0
pgscan_kswapd_high 0
pgscan_kswapd_normal 153813
pgscan_kswapd_dma 1089
pgscan_direct_high 0
pgscan_direct_normal 0
pgscan_direct_dma 0
pginodesteal 0
slabs_scanned 0
kswapd_steal 148759
kswapd_inodesteal 0
pageoutrun 5304
allocstall 0
pgrotated 0
nr_bounce 0
vmstat-2.dump:
nr_dirty 786397
nr_writeback 44233
nr_unstable 0
nr_page_table_pages 640
nr_mapped 6406
nr_slab 53027
pgpgin 263382
pgpgout 7835732
pswpin 0
pswpout 0
pgalloc_high 0
pgalloc_normal 3091687
pgalloc_dma 2420
pgfree 3101327
pgactivate 5817
pgdeactivate 2918
pgfault 466269
pgmajfault 322
pgrefill_high 0
pgrefill_normal 28265
pgrefill_dma 150
pgsteal_high 0
pgsteal_normal 789909
pgsteal_dma 1388
pgscan_kswapd_high 0
pgscan_kswapd_normal 904101
pgscan_kswapd_dma 4950
pgscan_direct_high 0
pgscan_direct_normal 0
pgscan_direct_dma 0
pginodesteal 0
slabs_scanned 1152
kswapd_steal 791297
kswapd_inodesteal 0
pageoutrun 28299
allocstall 0
pgrotated 562
nr_bounce 0
zoneinfo.dump:
Node 3, zone Normal
pages free 899
min 726
low 907
high 1089
active 3996
inactive 490989
scanned 0 (a: 16 i: 0)
spanned 524287
present 524287
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 2
low: 62
high: 186
batch: 31
cpu: 0 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 10186
numa_miss: 3313
numa_foreign: 0
interleave_hit: 10136
local_node: 0
other_node: 13499
cpu: 1 pcp: 0
count: 13
low: 62
high: 186
batch: 31
cpu: 1 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 6559
numa_miss: 1668
numa_foreign: 0
interleave_hit: 6559
local_node: 0
other_node: 8227
cpu: 2 pcp: 0
count: 84
low: 62
high: 186
batch: 31
cpu: 2 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 5579
numa_miss: 12806
numa_foreign: 0
interleave_hit: 5579
local_node: 0
other_node: 18385
cpu: 3 pcp: 0
count: 93
low: 62
high: 186
batch: 31
cpu: 3 pcp: 1
count: 55
low: 0
high: 62
batch: 31
numa_hit: 834769
numa_miss: 1
numa_foreign: 940192
interleave_hit: 5563
local_node: 834770
other_node: 0
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 1572864
Node 2, zone Normal
pages free 1036
min 726
low 907
high 1089
active 360
inactive 501700
scanned 0 (a: 26 i: 0)
spanned 524287
present 524287
protection: (0, 0, 0)
pagesets
cpu: 1 pcp: 0
count: 91
low: 62
high: 186
batch: 31
cpu: 1 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 6002
numa_miss: 15490
numa_foreign: 0
interleave_hit: 6002
local_node: 0
other_node: 21492
cpu: 2 pcp: 0
count: 75
low: 62
high: 186
batch: 31
cpu: 2 pcp: 1
count: 56
low: 0
high: 62
batch: 31
numa_hit: 410692
numa_miss: 0
numa_foreign: 76064
interleave_hit: 5223
local_node: 410692
other_node: 0
cpu: 3 pcp: 0
count: 73
low: 62
high: 186
batch: 31
cpu: 3 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 5163
numa_miss: 288909
numa_foreign: 1
interleave_hit: 5152
local_node: 0
other_node: 294072
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 1048576
Node 1, zone Normal
pages free 859
min 703
low 878
high 1054
active 1224
inactive 485043
scanned 0 (a: 14 i: 0)
spanned 507903
present 507760
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 1
low: 62
high: 186
batch: 31
cpu: 0 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 9443
numa_miss: 15475
numa_foreign: 18446604437880297808
interleave_hit: 18446604437880307200
local_node: 1
other_node: 24917
cpu: 1 pcp: 0
count: 181
low: 62
high: 186
batch: 31
cpu: 1 pcp: 1
count: 38
low: 0
high: 62
batch: 31
numa_hit: 368191
numa_miss: 0
numa_foreign: 39046
interleave_hit: 5967
local_node: 368191
other_node: 0
cpu: 2 pcp: 0
count: 85
low: 62
high: 186
batch: 31
cpu: 2 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 5139
numa_miss: 18963
numa_foreign: 0
interleave_hit: 5139
local_node: 0
other_node: 24102
cpu: 3 pcp: 0
count: 92
low: 62
high: 186
batch: 31
cpu: 3 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 5124
numa_miss: 363472
numa_foreign: 0
interleave_hit: 5115
local_node: 0
other_node: 368596
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 524288
Node 0, zone DMA
pages free 2045
min 5
low 6
high 7
active 0
inactive 992
scanned 0 (a: 2 i: 2)
spanned 4096
present 3994
protection: (0, 2031, 2031)
pagesets
cpu: 0 pcp: 0
count: 1
low: 2
high: 6
batch: 1
cpu: 0 pcp: 1
count: 1
low: 0
high: 2
batch: 1
numa_hit: 18446604437880298786
numa_miss: 18446604442220017848
numa_foreign: 0
interleave_hit: 0
local_node: 7567460
other_node: 0
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 1052
min 721
low 901
high 1081
active 845
inactive 480162
scanned 0 (a: 2 i: 0)
spanned 520191
present 520191
protection: (0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 96
low: 62
high: 186
batch: 31
cpu: 0 pcp: 1
count: 50
low: 0
high: 62
batch: 31
numa_hit: 18446604437880708763
numa_miss: 18446604439958819000
numa_foreign: 29590
interleave_hit: 9679
local_node: 7977309
other_node: 0
cpu: 1 pcp: 0
count: 88
low: 62
high: 186
batch: 31
cpu: 1 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 6206
numa_miss: 21831
numa_foreign: 0
interleave_hit: 6206
local_node: 0
other_node: 28037
cpu: 2 pcp: 0
count: 65
low: 62
high: 186
batch: 31
cpu: 2 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 5367
numa_miss: 44135
numa_foreign: 0
interleave_hit: 5365
local_node: 0
other_node: 49502
cpu: 3 pcp: 0
count: 92
low: 62
high: 186
batch: 31
cpu: 3 pcp: 1
count: 0
low: 0
high: 62
batch: 31
numa_hit: 5544
numa_miss: 286378
numa_foreign: 0
interleave_hit: 5507
local_node: 0
other_node: 291922
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 4096
meminfo.dump:
MemTotal: 8124172 kB
MemFree: 23564 kB
Buffers: 7825944 kB
Cached: 19216 kB
SwapCached: 0 kB
Active: 25708 kB
Inactive: 7835548 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 8124172 kB
LowFree: 23564 kB
SwapTotal: 15631160 kB
SwapFree: 15631160 kB
Dirty: 3145604 kB
Writeback: 176452 kB
Mapped: 25624 kB
Slab: 212116 kB
CommitLimit: 19693244 kB
Committed_AS: 85112 kB
PageTables: 2560 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 16288 kB
VmallocChunk: 34359721635 kB
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 16:20 ` Jens Axboe
2005-08-31 15:16 ` jmerkey
@ 2005-08-31 16:51 ` Holger Kiehl
2005-08-31 17:35 ` Jens Axboe
2005-08-31 18:06 ` Michael Tokarev
1 sibling, 2 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 16:51 UTC (permalink / raw)
To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
On Wed, 31 Aug 2005, Jens Axboe wrote:
> On Wed, Aug 31 2005, Holger Kiehl wrote:
>> On Wed, 31 Aug 2005, Jens Axboe wrote:
>>
>>> Nothing sticks out here either. There's plenty of idle time. It smells
>>> like a driver issue. Can you try the same dd test, but read from the
>>> drives instead? Use a bigger blocksize here, 128 or 256k.
>>>
>> I used the following command reading from all 8 disks in parallel:
>>
>> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>
>> Here vmstat output (I just cut something out in the middle):
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> ----cpu----^M
>> r b swpd free buff cache si so bi bo in cs us sy id
>> wa^M
>> 3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987 0 22
>> 0 78
>> 1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987 0 23
>> 4 74
>> 0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955 0 22
>> 12 66
>> 1 7 4348 38912 7803700 9636 0 0 322432 0 3526 5078 0 23
>
> Ok, so that's somewhat better than the writes but still off from what
> the individual drives can do in total.
>
>>> You might want to try the same with direct io, just to eliminate the
>>> costly user copy. I don't expect it to make much of a difference though,
>>> feels like the problem is elsewhere (driver, most likely).
>>>
>> Sorry, I don't know how to do this. Do you mean using a C program
>> that sets some flag to do direct io, or how can I do that?
>
> I've attached a little sample for you, just run ala
>
> # ./oread /dev/sdX
>
> and it will read 128k chunks direct from that device. Run on the same
> drives as above, reply with the vmstat info again.
>
Using kernel 2.6.12.5 again, here the results:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 8009648 4764 40592 0 0 0 0 1011 32 0 0 100 0
0 0 0 8009648 4764 40592 0 0 0 0 1011 34 0 0 100 0
0 0 0 8009648 4764 40592 0 0 0 0 1008 61 0 0 100 0
0 0 0 8009648 4764 40592 0 0 0 0 1006 26 0 0 100 0
0 8 0 8006372 4764 40592 0 0 120192 0 1944 1929 0 1 89 10
2 8 0 8006372 4764 40592 0 0 319488 0 3502 4999 0 2 75 24
0 8 0 8006372 4764 40592 0 0 319488 0 3506 4995 0 2 75 24
0 8 0 8006372 4764 40592 0 0 319744 0 3504 4999 0 1 75 24
0 8 0 8006372 4764 40592 0 0 319488 0 3507 5009 0 2 75 23
0 8 0 8006372 4764 40592 0 0 319616 0 3506 5011 0 2 75 24
0 8 0 8005124 4800 41100 0 0 319976 0 3536 4995 0 2 73 25
0 8 0 8005124 4800 41100 0 0 323584 0 3534 5000 0 2 75 23
0 8 0 8005124 4800 41100 0 0 323968 0 3540 5035 0 1 75 24
0 8 0 8005124 4800 41100 0 0 319232 0 3506 4811 0 1 75 24
0 8 0 8005504 4800 41100 0 0 317952 0 3498 4747 0 1 75 24
0 8 0 8005504 4800 41100 0 0 318720 0 3495 4672 0 2 75 23
1 8 0 8005504 4800 41100 0 0 318720 0 3509 4707 0 1 75 24
0 8 0 8005504 4800 41100 0 0 318720 0 3499 4667 0 2 75 23
0 8 0 8005504 4808 41092 0 0 318848 40 3509 4674 0 1 75 24
0 8 0 8005380 4808 41092 0 0 318848 0 3497 4693 0 2 72 26
0 8 0 8005380 4808 41092 0 0 318592 0 3500 4646 0 2 75 23
0 8 0 8005380 4808 41092 0 0 318592 0 3495 4828 0 2 61 37
0 8 0 8005380 4808 41092 0 0 318848 0 3499 4827 0 1 62 37
1 8 0 8005380 4808 41092 0 0 318464 0 3495 4642 0 2 75 23
0 8 0 8005380 4816 41084 0 0 318848 32 3511 4672 0 1 75 24
0 8 0 8005380 4816 41084 0 0 320640 0 3512 4877 0 2 75 23
0 8 0 8005380 4816 41084 0 0 322944 0 3533 5047 0 2 75 24
0 8 0 8005380 4816 41084 0 0 322816 0 3531 5053 0 1 75 24
0 8 0 8005380 4816 41084 0 0 322944 0 3531 5048 0 2 75 23
0 8 0 8005380 4816 41084 0 0 322944 0 3529 5043 0 1 75 24
0 0 0 8008360 4816 41084 0 0 266880 0 3112 4224 0 2 78 20
0 0 0 8008360 4816 41084 0 0 0 0 1012 28 0 0 100 0
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 15:16 ` jmerkey
@ 2005-08-31 16:58 ` Tom Callahan
2005-08-31 15:47 ` jmerkey
2005-08-31 17:11 ` Jens Axboe
1 sibling, 1 reply; 42+ messages in thread
From: Tom Callahan @ 2005-08-31 16:58 UTC (permalink / raw)
To: jmerkey
Cc: Jens Axboe, Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel
>From linux-kernel mailing list.....
Don't do this. BLKDEV_MIN_RQ sets the size of the mempool reserved
requests and will only get slightly used in low memory conditions, so
most memory will probably be wasted.....
Change /sys/block/xxx/queue/nr_requests
Tom Callahan
TESSCO Technologies
(443)-506-6216
callahant@tessco.com
jmerkey wrote:
>I have seen an 80GB/sec limitation in the kernel unless this value is
>changed in the SCSI I/O layer
>for 3Ware and other controllers during testing of 2.6.X series kernels.
>
>Change these values in include/linux/blkdev.h and performance goes from
>80MB/S to over 670MB/S on the 3Ware controller.
>
>
>//#define BLKDEV_MIN_RQ 4
>//#define BLKDEV_MAX_RQ 128 /* Default maximum */
>#define BLKDEV_MIN_RQ 4096
>#define BLKDEV_MAX_RQ 8192 /* Default maximum */
>
>Jeff
>
>
>
>Jens Axboe wrote:
>
>
>
>>On Wed, Aug 31 2005, Holger Kiehl wrote:
>>
>>
>>
>>
>>>On Wed, 31 Aug 2005, Jens Axboe wrote:
>>>
>>>
>>>
>>>
>>>
>>>>Nothing sticks out here either. There's plenty of idle time. It
>>>>
>>>>
>smells
>
>
>>>>like a driver issue. Can you try the same dd test, but read from the
>>>>drives instead? Use a bigger blocksize here, 128 or 256k.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>I used the following command reading from all 8 disks in parallel:
>>>
>>> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>>
>>>Here vmstat output (I just cut something out in the middle):
>>>
>>>procs -----------memory---------- ---swap-- -----io---- --system--
>>>----cpu----^M
>>>r b swpd free buff cache si so bi bo in cs us
>>>
>>>
>sy id
>
>
>>>wa^M
>>>3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987
>>>
>>>
>0 22
>
>
>>>0 78
>>>1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987
>>>
>>>
>0 23
>
>
>>>4 74
>>>0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955
>>>
>>>
>0 22
>
>
>>>12 66
>>>1 7 4348 38912 7803700 9636 0 0 322432 0 3526 5078
>>>
>>>
>0 23
>
>
>>>
>>>
>>>
>>>
>>Ok, so that's somewhat better than the writes but still off from what
>>the individual drives can do in total.
>>
>>
>>
>>
>>
>>>>You might want to try the same with direct io, just to eliminate the
>>>>costly user copy. I don't expect it to make much of a difference
>>>>
>>>>
>though,
>
>
>>>>feels like the problem is elsewhere (driver, most likely).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>Sorry, I don't know how to do this. Do you mean using a C program
>>>that sets some flag to do direct io, or how can I do that?
>>>
>>>
>>>
>>>
>>I've attached a little sample for you, just run ala
>>
>># ./oread /dev/sdX
>>
>>and it will read 128k chunks direct from that device. Run on the same
>>drives as above, reply with the vmstat info again.
>>
>>
>>
>>-----------------------------------------------------------------------
>>
>>
>-
>
>
>>#include <stdio.h>
>>#include <stdlib.h>
>>#define __USE_GNU
>>#include <fcntl.h>
>>#include <stdlib.h>
>>#include <unistd.h>
>>
>>#define BS (131072)
>>#define ALIGN(buf) (char *) (((unsigned long) (buf) + 4095) &
>>
>>
>~(4095))
>
>
>>#define BLOCKS (8192)
>>
>>int main(int argc, char *argv[])
>>{
>> char *p;
>> int fd, i;
>>
>> if (argc < 2) {
>> printf("%s: <dev>\n", argv[0]);
>> return 1;
>> }
>>
>> fd = open(argv[1], O_RDONLY | O_DIRECT);
>> if (fd == -1) {
>> perror("open");
>> return 1;
>> }
>>
>> p = ALIGN(malloc(BS + 4095));
>> for (i = 0; i < BLOCKS; i++) {
>> int r = read(fd, p, BS);
>>
>> if (r == BS)
>> continue;
>> else {
>> if (r == -1)
>> perror("read");
>>
>> break;
>> }
>> }
>>
>> return 0;
>>}
>>
>>
>>
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 15:16 ` jmerkey
2005-08-31 16:58 ` Tom Callahan
@ 2005-08-31 17:11 ` Jens Axboe
2005-08-31 15:59 ` jmerkey
1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 17:11 UTC (permalink / raw)
To: jmerkey; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel
On Wed, Aug 31 2005, jmerkey wrote:
>
>
> I have seen an 80GB/sec limitation in the kernel unless this value is
> changed in the SCSI I/O layer
> for 3Ware and other controllers during testing of 2.6.X series kernels.
>
> Change these values in include/linux/blkdev.h and performance goes from
> 80MB/S to over 670MB/S on the 3Ware controller.
>
>
> //#define BLKDEV_MIN_RQ 4
> //#define BLKDEV_MAX_RQ 128 /* Default maximum */
> #define BLKDEV_MIN_RQ 4096
> #define BLKDEV_MAX_RQ 8192 /* Default maximum */
That's insane, you just wasted 1MiB of preallocated requests on each
queue in the system!
Please just do
# echo 512 > /sys/block/dev/queue/nr_requests
after boot for each device you want to increase the queue size too. 512
should be enough with the 3ware.
--
Jens Axboe
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 16:25 ` Holger Kiehl
@ 2005-08-31 17:25 ` Nick Piggin
2005-08-31 21:57 ` Holger Kiehl
0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-08-31 17:25 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
Holger Kiehl wrote:
> meminfo.dump:
>
> MemTotal: 8124172 kB
> MemFree: 23564 kB
> Buffers: 7825944 kB
> Cached: 19216 kB
> SwapCached: 0 kB
> Active: 25708 kB
> Inactive: 7835548 kB
> HighTotal: 0 kB
> HighFree: 0 kB
> LowTotal: 8124172 kB
> LowFree: 23564 kB
> SwapTotal: 15631160 kB
> SwapFree: 15631160 kB
> Dirty: 3145604 kB
Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
so maybe I've just led you on a goose chase.
You could
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
To further reduce dirty memory in the system, however this is
a long shot, so please continue your interaction with the
other people in the thread first.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 15:59 ` jmerkey
@ 2005-08-31 17:32 ` Jens Axboe
0 siblings, 0 replies; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 17:32 UTC (permalink / raw)
To: jmerkey; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel
On Wed, Aug 31 2005, jmerkey wrote:
>
> 512 is not enough. It has to be larger. I just tried 512 and it still
> limits the data rates.
Please don't top post.
512 wasn't the point, setting it properly is the point. If you need more
than 512, go ahead. This isn't Holger's problem, though, the reading
would be much faster if it was. If the fusion is using a large queue
depth, increasing nr_requests would likely help the writes (but not to
the extent of where it would suddenly be as fast as it should).
--
Jens Axboe
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 16:51 ` Holger Kiehl
@ 2005-08-31 17:35 ` Jens Axboe
2005-08-31 19:00 ` Holger Kiehl
2005-08-31 18:06 ` Michael Tokarev
1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 17:35 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
On Wed, Aug 31 2005, Holger Kiehl wrote:
> ># ./oread /dev/sdX
> >
> >and it will read 128k chunks direct from that device. Run on the same
> >drives as above, reply with the vmstat info again.
> >
> Using kernel 2.6.12.5 again, here the results:
[snip]
Ok, reads as expected, like the buffered io but using less system time.
And you are still 1/3 off the target data rate, hmmm...
With the reads, how does the aggregate bandwidth look when you add
'clients'? Same as with writes, gradually decreasing per-device
throughput?
--
Jens Axboe
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 16:51 ` Holger Kiehl
2005-08-31 17:35 ` Jens Axboe
@ 2005-08-31 18:06 ` Michael Tokarev
2005-08-31 18:52 ` Ming Zhang
1 sibling, 1 reply; 42+ messages in thread
From: Michael Tokarev @ 2005-08-31 18:06 UTC (permalink / raw)
To: Holger Kiehl; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
Holger Kiehl wrote:
> On Wed, 31 Aug 2005, Jens Axboe wrote:
>
>> On Wed, Aug 31 2005, Holger Kiehl wrote:
>>
[]
>>> I used the following command reading from all 8 disks in parallel:
>>>
>>> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>>
>>> Here vmstat output (I just cut something out in the middle):
>>>
>>> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>>> r b swpd free buff cache si so bi bo in cs us sy id wa
>>> 3 7 4348 42640 7799984 9612 0 0 322816 0 3532 4987 0 22 0 78
>>> 1 7 4348 42136 7800624 9584 0 0 322176 0 3526 4987 0 23 4 74
>>> 0 8 4348 39912 7802648 9668 0 0 322176 0 3525 4955 0 22 12 66
>>
>> Ok, so that's somewhat better than the writes but still off from what
>> the individual drives can do in total.
>>
>>>> You might want to try the same with direct io, just to eliminate the
>>>> costly user copy. I don't expect it to make much of a difference though,
>>>> feels like the problem is elsewhere (driver, most likely).
>>>>
>>> Sorry, I don't know how to do this. Do you mean using a C program
>>> that sets some flag to do direct io, or how can I do that?
>>
>> I've attached a little sample for you, just run ala
>>
>> # ./oread /dev/sdX
>>
>> and it will read 128k chunks direct from that device. Run on the same
>> drives as above, reply with the vmstat info again.
>>
> Using kernel 2.6.12.5 again, here the results:
>
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 8 0 8005380 4816 41084 0 0 318848 32 3511 4672 0 1 75 24
> 0 8 0 8005380 4816 41084 0 0 320640 0 3512 4877 0 2 75 23
> 0 8 0 8005380 4816 41084 0 0 322944 0 3533 5047 0 2 75 24
> 0 8 0 8005380 4816 41084 0 0 322816 0 3531 5053 0 1 75 24
> 0 8 0 8005380 4816 41084 0 0 322944 0 3531 5048 0 2 75 23
> 0 8 0 8005380 4816 41084 0 0 322944 0 3529 5043 0 1 75 24
> 0 0 0 8008360 4816 41084 0 0 266880 0 3112 4224 0 2 78 20
I went on and did similar tests on our box, which is:
dual Xeon 2.44GHz with HT (so it's like 4 logical CPUs)
dual-channel AIC-7902 U320 controller
8 SEAGATE ST336607LW drives attached to the 2 channels of the
controller, sd[abcd] to channel 0 and sd[efgh] to channel 1
Each drive is capable to get about 60 megabytes/sec.
The kernel is 2.6.13 from kernel.org.
With direct-reading:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 8 12 87 471 1839 0 0 455296 84 1936 3739 0 3 47 50
1 7 12 87 471 1839 0 0 456704 80 1941 3744 0 4 48 48
1 7 12 87 471 1839 0 0 446464 82 1914 3648 0 2 48 50
0 8 12 87 471 1839 0 0 454016 94 1944 3765 0 2 47 50
0 8 12 87 471 1839 0 0 458752 60 1944 3746 0 2 48 50
Without direct:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
8 6 12 80 470 1839 0 0 359966 124 1726 2270 1 89 0 10
2 7 12 80 470 1839 0 0 352813 113 1741 2124 1 88 1 10
8 4 12 80 471 1839 0 0 358990 34 1669 1934 1 94 0 5
7 5 12 79 471 1839 0 0 354065 157 1761 2128 1 90 1 8
6 5 12 80 471 1839 0 0 358062 44 1686 1911 1 93 0 6
So the difference direct vs "indirect" is quite.. significant. And with
direct-reading, all 8 drives are up to their real speed. Note the CPU usage
in case of "indirect" reading too - it's about 90%...
And here's an idle system as well:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 12 89 471 1839 0 0 0 58 151 358 0 0 100 0
0 0 12 89 471 1839 0 0 0 66 157 167 0 0 99 0
Too bad I can't perform write tests on this system...
/mjt
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 18:06 ` Michael Tokarev
@ 2005-08-31 18:52 ` Ming Zhang
2005-08-31 18:57 ` Ming Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Ming Zhang @ 2005-08-31 18:52 UTC (permalink / raw)
To: Michael Tokarev
Cc: Holger Kiehl, Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2478 bytes --]
join the party. ;)
8 400GB SATA disk on same Marvel 8 port PCIX-133 card. P4 CPU.
Supermicro SCT board.
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6]
[raid10] [faulty]
md0 : active raid0 sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda
[0]
3125690368 blocks 64k chunks
8 DISK RAID0 from same slot and card. Stripe size is 512KB.
run oread
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----
cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
1 1 0 533216 330424 11004 0 0 7128 1610 1069 77 0 2
95 3
1 0 0 298464 560828 11004 0 0 230404 0 2595 1389 1
23 0 76
0 1 0 64736 792248 11004 0 0 231420 0 2648 1342 0
26 0 74
1 0 0 8948 848416 9696 0 0 229376 0 2638 1337 0
29 0 71
0 0 0 868896 768 9696 0 0 29696 48 1224 162 0 19
73 8
# time ./oread /dev/md0
real 0m6.595s
user 0m0.004s
sys 0m0.151s
run dd
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----
cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
2 2 0 854008 2932 17108 0 0 7355 1606 1071 80 0 2
95 3
0 2 0 848888 3112 21388 0 0 164332 0 2985 3564 2
7 0 91
0 2 0 844024 3260 25664 0 0 164040 0 2990 3665 1
7 0 92
0 2 0 840328 3380 28920 0 0 164272 0 2932 3791 1
9 0 90
0 2 0 836360 3500 32232 0 0 163688 100 3001 5045 2
7 0 91
0 2 0 831432 3644 36612 0 0 164120 568 2977 3843 0
9 0 91
0 1 0 826056 3752 41688 0 0 7872 0 1267 1474 1 3
0 96
# time dd if=/dev/md0 of=/dev/null bs=131072 count=8192
8192+0 records in
8192+0 records out
real 0m4.771s
user 0m0.005s
sys 0m0.973s
so the reasonable thing here is because of O_DIRECT, the sys time
reduced a lot.
but the time is longer! the reason i found is...
i attached a new oread.c which allow to set block size of each read and
total read count. so i read full strip once a time,
# time ./oread /dev/md0 524288 2048
real 0m4.950s
user 0m0.000s
sys 0m0.131s
compared to
# time ./oread /dev/md0 131072 8192
real 0m6.633s
user 0m0.002s
sys 0m0.191s
but still, I can get linear speed at 4 DISKS, then no speed gain when
adding more disk into the RAID.
Ming
[-- Attachment #2: oread.c --]
[-- Type: text/x-csrc, Size: 673 bytes --]
#include <stdio.h>
#include <stdlib.h>
#define __USE_GNU
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#define ALIGN(buf) (char *) (((unsigned long) (buf) + 4095) & ~(4095))
int main(int argc, char *argv[])
{
char *p;
int fd, i;
int BS, BLOCKS;
if (argc < 4) {
printf("%s: <dev> bs cnt\n", argv[0]);
return 1;
}
BS = atoi(argv[2]);
BLOCKS = atoi(argv[3]);
fd = open(argv[1], O_RDONLY | O_DIRECT);
if (fd == -1) {
perror("open");
return 1;
}
p = ALIGN(malloc(BS + 4095));
for (i = 0; i < BLOCKS; i++) {
int r = read(fd, p, BS);
if (r == BS)
continue;
else {
if (r == -1)
perror("read");
break;
}
}
return 0;
}
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 18:52 ` Ming Zhang
@ 2005-08-31 18:57 ` Ming Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Ming Zhang @ 2005-08-31 18:57 UTC (permalink / raw)
To: Michael Tokarev
Cc: Holger Kiehl, Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
forgot to attach lspci output.
it is a 133MB PCI-X card but only run at 66MHZ.
quick question, where I can check if it is running at 64bit?
66MHZ * 32Bit /8 * 80% bus utilization ~= 211MB/s then match the upper
speed I meet now...
Ming
02:01.0 SCSI storage controller: Marvell MV88SX5081 8-port SATA I PCI-X
Controller (rev 03)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 128, Cache Line Size 08
Interrupt: pin A routed to IRQ 24
Region 0: Memory at fa000000 (64-bit, non-prefetchable)
Capabilities: [40] Power Management version 2
Flags: PMEClk+ DSI- D1- D2- AuxCurrent=0mA PME
(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [60] PCI-X non-bridge device.
Command: DPERE- ERO- RBC=0 OST=3
Status: Bus=2 Dev=1 Func=0 64bit+ 133MHz+ SCD- USC-,
DC=simple, DMMRBC=0, DMOST=3, DMCRS=0, RSCEM-
On Wed, 2005-08-31 at 14:52 -0400, Ming Zhang wrote:
> join the party. ;)
>
> 8 400GB SATA disk on same Marvel 8 port PCIX-133 card. P4 CPU.
> Supermicro SCT board.
>
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6]
> [raid10] [faulty]
> md0 : active raid0 sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda
> [0]
> 3125690368 blocks 64k chunks
>
> 8 DISK RAID0 from same slot and card. Stripe size is 512KB.
>
> run oread
>
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- --system-- ----
> cpu----
> r b swpd free buff cache si so bi bo in cs us sy
> id wa
> 1 1 0 533216 330424 11004 0 0 7128 1610 1069 77 0 2
> 95 3
> 1 0 0 298464 560828 11004 0 0 230404 0 2595 1389 1
> 23 0 76
> 0 1 0 64736 792248 11004 0 0 231420 0 2648 1342 0
> 26 0 74
> 1 0 0 8948 848416 9696 0 0 229376 0 2638 1337 0
> 29 0 71
> 0 0 0 868896 768 9696 0 0 29696 48 1224 162 0 19
> 73 8
>
> # time ./oread /dev/md0
>
> real 0m6.595s
> user 0m0.004s
> sys 0m0.151s
>
> run dd
>
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- --system-- ----
> cpu----
> r b swpd free buff cache si so bi bo in cs us sy
> id wa
> 2 2 0 854008 2932 17108 0 0 7355 1606 1071 80 0 2
> 95 3
> 0 2 0 848888 3112 21388 0 0 164332 0 2985 3564 2
> 7 0 91
> 0 2 0 844024 3260 25664 0 0 164040 0 2990 3665 1
> 7 0 92
> 0 2 0 840328 3380 28920 0 0 164272 0 2932 3791 1
> 9 0 90
> 0 2 0 836360 3500 32232 0 0 163688 100 3001 5045 2
> 7 0 91
> 0 2 0 831432 3644 36612 0 0 164120 568 2977 3843 0
> 9 0 91
> 0 1 0 826056 3752 41688 0 0 7872 0 1267 1474 1 3
> 0 96
>
> # time dd if=/dev/md0 of=/dev/null bs=131072 count=8192
> 8192+0 records in
> 8192+0 records out
>
> real 0m4.771s
> user 0m0.005s
> sys 0m0.973s
>
> so the reasonable thing here is because of O_DIRECT, the sys time
> reduced a lot.
>
> but the time is longer! the reason i found is...
>
> i attached a new oread.c which allow to set block size of each read and
> total read count. so i read full strip once a time,
>
> # time ./oread /dev/md0 524288 2048
>
> real 0m4.950s
> user 0m0.000s
> sys 0m0.131s
>
> compared to
>
> # time ./oread /dev/md0 131072 8192
>
> real 0m6.633s
> user 0m0.002s
> sys 0m0.191s
>
>
> but still, I can get linear speed at 4 DISKS, then no speed gain when
> adding more disk into the RAID.
>
> Ming
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 17:35 ` Jens Axboe
@ 2005-08-31 19:00 ` Holger Kiehl
0 siblings, 0 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 19:00 UTC (permalink / raw)
To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel
On Wed, 31 Aug 2005, Jens Axboe wrote:
> On Wed, Aug 31 2005, Holger Kiehl wrote:
>>> # ./oread /dev/sdX
>>>
>>> and it will read 128k chunks direct from that device. Run on the same
>>> drives as above, reply with the vmstat info again.
>>>
>> Using kernel 2.6.12.5 again, here the results:
>
> [snip]
>
> Ok, reads as expected, like the buffered io but using less system time.
> And you are still 1/3 off the target data rate, hmmm...
>
> With the reads, how does the aggregate bandwidth look when you add
> 'clients'? Same as with writes, gradually decreasing per-device
> throughput?
>
I performed the following tests with this command:
dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
Single disk tests:
/dev/sdc1 74.954715 MB/s
/dev/sdg1 74.973417 MB/s
Following disks in parallel:
2 disks on same channel
/dev/sdc1 75.034191 MB/s
/dev/sdd1 74.984643 MB/s
3 disks on same channel
/dev/sdc1 75.027850 MB/s
/dev/sdd1 74.976583 MB/s
/dev/sde1 75.278276 MB/s
4 disks on same channel
/dev/sdc1 58.343166 MB/s
/dev/sdd1 62.993059 MB/s
/dev/sde1 66.940569 MB/s
/dev/sdd1 70.986072 MB/s
2 disks on different channels
/dev/sdc1 74.954715 MB/s
/dev/sdg1 74.973417 MB/s
4 disks on different channels
/dev/sdc1 74.959030 MB/s
/dev/sdd1 74.877703 MB/s
/dev/sdg1 75.009697 MB/s
/dev/sdh1 75.028138 MB/s
6 disks on different channels
/dev/sdc1 49.640743 MB/s
/dev/sdd1 55.935419 MB/s
/dev/sde1 58.795241 MB/s
/dev/sdg1 50.280864 MB/s
/dev/sdh1 54.210705 MB/s
/dev/sdi1 59.413176 MB/s
So this looks different from writting, only as of four disks does the
performance begin to drop.
I just noticed, did you want me to do these test with the oread program?
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 14:24 ` Dr. David Alan Gilbert
@ 2005-08-31 20:56 ` Holger Kiehl
2005-08-31 21:16 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 20:56 UTC (permalink / raw)
To: Dr. David Alan Gilbert; +Cc: linux-raid, linux-kernel
On Wed, 31 Aug 2005, Dr. David Alan Gilbert wrote:
> * Holger Kiehl (Holger.Kiehl@dwd.de) wrote:
>> On Wed, 31 Aug 2005, Jens Axboe wrote:
>>
>> Full vmstat session can be found under:
>
> Have you got iostat? iostat -x 10 might be interesting to see
> for a period while it is going.
>
The following is the result from all 8 disks at the same time with the command
dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
There is however one difference, here I had set
/sys/block/sd?/queue/nr_requests to 4096.
avg-cpu: %user %nice %sys %iowait %idle
0.10 0.00 21.85 58.55 19.50
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.30 0.00 2.40 0.00 1.20 8.00 0.00 1.00 1.00 0.03
sdb 0.70 0.00 0.10 0.30 6.40 2.40 3.20 1.20 22.00 0.00 4.25 4.25 0.17
sdc 8276.90 0.00 267.10 0.00 68352.00 0.00 34176.00 0.00 255.90 1.95 7.29 3.74 100.02
sdd 9098.50 0.00 293.50 0.00 75136.00 0.00 37568.00 0.00 256.00 1.93 6.59 3.41 100.03
sde 10428.40 0.00 336.40 0.00 86118.40 0.00 43059.20 0.00 256.00 1.92 5.71 2.97 100.02
sdf 11314.90 0.00 365.10 0.00 93440.00 0.00 46720.00 0.00 255.93 1.92 5.26 2.74 99.98
sdg 7973.20 0.00 257.20 0.00 65843.20 0.00 32921.60 0.00 256.00 1.94 7.53 3.89 100.01
sdh 9436.30 0.00 304.70 0.00 77928.00 0.00 38964.00 0.00 255.75 1.93 6.35 3.28 100.01
sdi 10604.80 0.00 342.40 0.00 87577.60 0.00 43788.80 0.00 255.78 1.92 5.62 2.92 100.02
sdj 10914.30 0.00 352.20 0.00 90132.80 0.00 45066.40 0.00 255.91 1.91 5.43 2.84 100.00
md0 0.00 0.00 0.00 0.10 0.00 0.80 0.00 0.40 8.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.80 0.00 6.40 0.00 3.20 0.00 8.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %sys %iowait %idle
0.07 0.00 24.49 66.81 8.62
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.40 0.00 1.00 0.00 11.20 0.00 5.60 11.20 0.00 1.30 0.50 0.05
sdb 0.00 0.40 0.00 1.00 0.00 11.20 0.00 5.60 11.20 0.00 1.50 0.70 0.07
sdc 8161.90 0.00 263.70 0.00 67404.80 0.00 33702.40 0.00 255.61 1.95 7.38 3.79 100.02
sdd 9157.30 0.00 295.50 0.00 75622.40 0.00 37811.20 0.00 255.91 1.93 6.53 3.38 100.00
sde 10505.60 0.00 339.20 0.00 86758.40 0.00 43379.20 0.00 255.77 1.93 5.68 2.95 99.99
sdf 11212.50 0.00 361.90 0.00 92595.20 0.00 46297.60 0.00 255.86 1.91 5.28 2.76 100.00
sdg 7988.40 0.00 258.00 0.00 65971.20 0.00 32985.60 0.00 255.70 1.93 7.49 3.88 99.98
sdh 9436.20 0.00 304.40 0.00 77924.80 0.00 38962.40 0.00 255.99 1.92 6.32 3.28 99.99
sdi 10406.10 0.00 336.30 0.00 85939.20 0.00 42969.60 0.00 255.54 1.92 5.70 2.97 100.00
sdj 11027.00 0.00 356.00 0.00 91064.00 0.00 45532.00 0.00 255.80 1.92 5.40 2.81 99.96
md0 0.00 0.00 0.00 1.00 0.00 8.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %sys %iowait %idle
0.08 0.00 22.23 60.44 17.25
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.30 0.00 2.40 0.00 1.20 8.00 0.00 1.00 1.00 0.03
sdb 0.00 0.00 0.00 0.30 0.00 2.40 0.00 1.20 8.00 0.00 0.67 0.67 0.02
sdc 8204.50 0.00 264.76 0.00 67754.15 0.00 33877.08 0.00 255.90 1.95 7.38 3.78 100.12
sdd 9166.47 0.00 295.90 0.00 75698.10 0.00 37849.05 0.00 255.83 1.94 6.55 3.38 100.12
sde 10534.93 0.00 339.94 0.00 86999.00 0.00 43499.50 0.00 255.92 1.93 5.67 2.95 100.12
sdf 11282.68 0.00 364.16 0.00 93174.77 0.00 46587.39 0.00 255.86 1.92 5.28 2.75 100.10
sdg 8114.61 0.00 261.76 0.00 67011.01 0.00 33505.51 0.00 256.00 1.95 7.44 3.82 100.11
sdh 9380.68 0.00 302.60 0.00 77466.27 0.00 38733.13 0.00 256.00 1.93 6.38 3.31 100.10
sdi 10507.01 0.00 339.04 0.00 86768.37 0.00 43384.18 0.00 255.92 1.93 5.69 2.95 100.12
sdj 10969.27 0.00 354.15 0.00 90586.59 0.00 45293.29 0.00 255.78 1.92 5.42 2.83 100.11
md0 0.00 0.00 0.00 0.10 0.00 0.80 0.00 0.40 8.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The full output can be found at:
ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/iostat-read-256k
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 20:56 ` Holger Kiehl
@ 2005-08-31 21:16 ` Dr. David Alan Gilbert
0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2005-08-31 21:16 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid, linux-kernel
* Holger Kiehl (Holger.Kiehl@dwd.de) wrote:
> There is however one difference, here I had set
> /sys/block/sd?/queue/nr_requests to 4096.
Well from that it looks like none of the queues get about 255
(hmm that's a round number....)
> avg-cpu: %user %nice %sys %iowait %idle
> 0.10 0.00 21.85 58.55 19.50
Fair amount of system time.
> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
> avgrq-sz avgqu-sz await svctm %util
> sdf 11314.90 0.00 365.10 0.00 93440.00 0.00 46720.00 0.00
> 255.93 1.92 5.26 2.74 99.98
> sdg 7973.20 0.00 257.20 0.00 65843.20 0.00 32921.60 0.00
> 256.00 1.94 7.53 3.89 100.01
There seems to be quite a spread of read performance accross the drives
(pretty consistent accross the run); what makes sdg so much slower than
sdf (which seems to be the slowest and fastest drives respectively).
I guess if everyone was running at sdf's speed you would be pretty happy.
If you physically swap f and g does the performance follow the drive
or the letter?
Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 17:25 ` Nick Piggin
@ 2005-08-31 21:57 ` Holger Kiehl
2005-09-01 9:12 ` Holger Kiehl
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 21:57 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
On Thu, 1 Sep 2005, Nick Piggin wrote:
> Holger Kiehl wrote:
>
>> meminfo.dump:
>>
>> MemTotal: 8124172 kB
>> MemFree: 23564 kB
>> Buffers: 7825944 kB
>> Cached: 19216 kB
>> SwapCached: 0 kB
>> Active: 25708 kB
>> Inactive: 7835548 kB
>> HighTotal: 0 kB
>> HighFree: 0 kB
>> LowTotal: 8124172 kB
>> LowFree: 23564 kB
>> SwapTotal: 15631160 kB
>> SwapFree: 15631160 kB
>> Dirty: 3145604 kB
>
> Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
> so maybe I've just led you on a goose chase.
>
> You could
> echo 5 > /proc/sys/vm/dirty_background_ratio
> echo 10 > /proc/sys/vm/dirty_ratio
>
> To further reduce dirty memory in the system, however this is
> a long shot, so please continue your interaction with the
> other people in the thread first.
>
Yes, this does make a difference, here the results of running
dd if=/dev/full of=/dev/sd?1 bs=4M count=4883
on 8 disks at the same time:
34.273340
33.938829
33.598469
32.970575
32.841351
32.723988
31.559880
29.778112
That's 32.710568 MB/s on average per disk with your change and without
it it was 24.958557 MB/s on average per disk.
I will do more tests tomorrow.
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-31 21:57 ` Holger Kiehl
@ 2005-09-01 9:12 ` Holger Kiehl
2005-09-02 14:28 ` Al Boldi
0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-09-01 9:12 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel
On Wed, 31 Aug 2005, Holger Kiehl wrote:
> On Thu, 1 Sep 2005, Nick Piggin wrote:
>
>> Holger Kiehl wrote:
>>
>>> meminfo.dump:
>>>
>>> MemTotal: 8124172 kB
>>> MemFree: 23564 kB
>>> Buffers: 7825944 kB
>>> Cached: 19216 kB
>>> SwapCached: 0 kB
>>> Active: 25708 kB
>>> Inactive: 7835548 kB
>>> HighTotal: 0 kB
>>> HighFree: 0 kB
>>> LowTotal: 8124172 kB
>>> LowFree: 23564 kB
>>> SwapTotal: 15631160 kB
>>> SwapFree: 15631160 kB
>>> Dirty: 3145604 kB
>>
>> Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
>> so maybe I've just led you on a goose chase.
>>
>> You could
>> echo 5 > /proc/sys/vm/dirty_background_ratio
>> echo 10 > /proc/sys/vm/dirty_ratio
>>
>> To further reduce dirty memory in the system, however this is
>> a long shot, so please continue your interaction with the
>> other people in the thread first.
>>
> Yes, this does make a difference, here the results of running
>
> dd if=/dev/full of=/dev/sd?1 bs=4M count=4883
>
> on 8 disks at the same time:
>
> 34.273340
> 33.938829
> 33.598469
> 32.970575
> 32.841351
> 32.723988
> 31.559880
> 29.778112
>
> That's 32.710568 MB/s on average per disk with your change and without
> it it was 24.958557 MB/s on average per disk.
>
> I will do more tests tomorrow.
>
Just rechecked those numbers. Did a fresh boot and run the test several
times. With defaults (dirty_background_ratio=10, dirty_ratio=40) I get
for the dd write tests an average of 24.559491 MB/s (8 disks in parallel)
per disk. With the suggested values (dirty_background_ratio=5, dirty_ratio=10)
32.390659 MB/s per disk.
I then did a SW raid0 over all disks with the following command:
mdadm -C /dev/md3 -l0 -n8 /dev/sd[cdefghij]1
(dirty_background_ratio=10, dirty_ratio=40) 223.955995 MB/s
(dirty_background_ratio=5, dirty_ratio=10) 234.318936 MB/s
So the differnece is not so big anymore.
Something else I notice while doing the dd over 8 disks is the following
(top just before they are finished):
top - 08:39:11 up 2:03, 2 users, load average: 23.01, 21.48, 15.64
Tasks: 102 total, 2 running, 100 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% us, 17.7% sy, 0.0% ni, 0.0% id, 78.9% wa, 0.2% hi, 3.1% si
Mem: 8124184k total, 8093068k used, 31116k free, 7831348k buffers
Swap: 15631160k total, 13352k used, 15617808k free, 5524k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3423 root 18 0 55204 460 392 R 12.0 0.0 1:15.55 dd
3421 root 18 0 55204 464 392 D 11.3 0.0 1:17.36 dd
3418 root 18 0 55204 464 392 D 10.3 0.0 1:10.92 dd
3416 root 18 0 55200 464 392 D 10.0 0.0 1:09.20 dd
3420 root 18 0 55204 464 392 D 10.0 0.0 1:10.49 dd
3422 root 18 0 55200 460 392 D 9.3 0.0 1:13.58 dd
3417 root 18 0 55204 460 392 D 7.6 0.0 1:13.11 dd
158 root 15 0 0 0 0 D 1.3 0.0 1:12.61 kswapd3
159 root 15 0 0 0 0 D 1.3 0.0 1:08.75 kswapd2
160 root 15 0 0 0 0 D 1.0 0.0 1:07.11 kswapd1
3419 root 18 0 51096 552 476 D 1.0 0.0 1:17.15 dd
161 root 15 0 0 0 0 D 0.7 0.0 0:54.46 kswapd0
1 root 16 0 4876 372 332 S 0.0 0.0 0:01.15 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
5 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
7 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
9 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3
A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd working
so hard? Is that correct.
Please just tell me if there is anything else I can test or dumps that
could be useful.
Thanks,
Holger
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-09-01 9:12 ` Holger Kiehl
@ 2005-09-02 14:28 ` Al Boldi
0 siblings, 0 replies; 42+ messages in thread
From: Al Boldi @ 2005-09-02 14:28 UTC (permalink / raw)
To: Holger Kiehl
Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel, Nick Piggin
Holger Kiehl wrote:
> top - 08:39:11 up 2:03, 2 users, load average: 23.01, 21.48, 15.64
> Tasks: 102 total, 2 running, 100 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.0% us, 17.7% sy, 0.0% ni, 0.0% id, 78.9% wa, 0.2% hi, 3.1%
> si Mem: 8124184k total, 8093068k used, 31116k free, 7831348k
> buffers Swap: 15631160k total, 13352k used, 15617808k free, 5524k
> cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 3423 root 18 0 55204 460 392 R 12.0 0.0 1:15.55 dd
> 3421 root 18 0 55204 464 392 D 11.3 0.0 1:17.36 dd
> 3418 root 18 0 55204 464 392 D 10.3 0.0 1:10.92 dd
> 3416 root 18 0 55200 464 392 D 10.0 0.0 1:09.20 dd
> 3420 root 18 0 55204 464 392 D 10.0 0.0 1:10.49 dd
> 3422 root 18 0 55200 460 392 D 9.3 0.0 1:13.58 dd
> 3417 root 18 0 55204 460 392 D 7.6 0.0 1:13.11 dd
> 158 root 15 0 0 0 0 D 1.3 0.0 1:12.61 kswapd3
> 159 root 15 0 0 0 0 D 1.3 0.0 1:08.75 kswapd2
> 160 root 15 0 0 0 0 D 1.0 0.0 1:07.11 kswapd1
> 3419 root 18 0 51096 552 476 D 1.0 0.0 1:17.15 dd
> 161 root 15 0 0 0 0 D 0.7 0.0 0:54.46 kswapd0
>
> A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd
> working so hard? Is that correct.
Actually, kswapd is another problem. (see "Kswapd Flaw" thread)
Which has little impact on your problem but basically kswapd tries very hard
maybe even to hard to fullfil a request for memory, so when the buffer/cache
pages are full kswapd tries to find some more unused memory. When it finds
none it starts recycling the buffer/cache pages. Which is OK, but it only
does this after searching for swappable memory which wastes CPU cycles.
This can be tuned a little but not much by adjusting /sys(proc)/.../vm/...
Or renicing kswapd to the lowest priority, which may cause other problems.
Things get really bad when procs start asking for more memory than is
available, causing kswapd to take the liberty of paging out running procs in
the hope that these procs won't come back later. So when they do come back
something like a wild goose chase begins. This is also known as OverCommit.
This is closely related to the dreaded OOM-killer, which occurs when the
system cannot satisfy a memory request for a returning proc, causing the VM
to start killing in an unpredictable manner.
Turning OverCommit off should solve this problem but it doesn't.
This is why it is recommended to run the system always with swap enabled even
if you have tons of memory, which really only pushes the problem out of the
way until you hit the dead end and the wild goose chase begins again.
Sadly 2.6.13 did not fix this either.
Although this description only vaguely defines the problem from an end-user
pov, the actual semantics may be quite different.
--
Al
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-08-30 23:05 ` Guy
@ 2005-09-28 20:04 ` Bill Davidsen
2005-09-30 4:52 ` Guy
0 siblings, 1 reply; 42+ messages in thread
From: Bill Davidsen @ 2005-09-28 20:04 UTC (permalink / raw)
To: Guy
Cc: 'Holger Kiehl', 'Mark Hahn', 'linux-raid',
'linux-kernel'
Guy wrote:
>In most of your results, your CPU usage is very high. Once you get to about
>90% usage, you really can't do much else, unless you can improve the CPU
>usage.
>
That seems one of the problems with software RAID, the calculations are
done in the CPU and not dedicated hardware. As you move to the top end
drive hardware the CPU gets to be a limit. I don't remember off the top
of my head how threaded this code is, and if more CPUs will help.
I see you are using RAID-1 for your system stuff, did one of the tests
use RAID-0 over all the drives? Mirroring or XOR redundancy help
stability but hurt performance. Was the 270MB/s with RAID-0 or ???
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: Where is the performance bottleneck?
2005-09-28 20:04 ` Bill Davidsen
@ 2005-09-30 4:52 ` Guy
2005-09-30 5:19 ` dean gaudet
2005-10-06 21:15 ` Bill Davidsen
0 siblings, 2 replies; 42+ messages in thread
From: Guy @ 2005-09-30 4:52 UTC (permalink / raw)
To: 'Bill Davidsen'
Cc: 'Holger Kiehl', 'Mark Hahn', 'linux-raid',
'linux-kernel'
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Bill Davidsen
> Sent: Wednesday, September 28, 2005 4:05 PM
> To: Guy
> Cc: 'Holger Kiehl'; 'Mark Hahn'; 'linux-raid'; 'linux-kernel'
> Subject: Re: Where is the performance bottleneck?
>
> Guy wrote:
>
> >In most of your results, your CPU usage is very high. Once you get to
> about
> >90% usage, you really can't do much else, unless you can improve the CPU
> >usage.
> >
> That seems one of the problems with software RAID, the calculations are
> done in the CPU and not dedicated hardware. As you move to the top end
> drive hardware the CPU gets to be a limit. I don't remember off the top
> of my head how threaded this code is, and if more CPUs will help.
My old 500MHz P3 can xor at 1GB/sec. I don't think the RAID5 logic is the
issue! Also, I have not seen hardware that fast! Or even half as fast.
But I must admit, I have not seen a hardware RAID5 in a few years. :(
8regs : 918.000 MB/sec
32regs : 469.600 MB/sec
pIII_sse : 994.800 MB/sec
pII_mmx : 1102.400 MB/sec
p5_mmx : 1152.800 MB/sec
raid5: using function: pIII_sse (994.800 MB/sec)
Humm.. It did not select the fastest?
Guy
>
> I see you are using RAID-1 for your system stuff, did one of the tests
> use RAID-0 over all the drives? Mirroring or XOR redundancy help
> stability but hurt performance. Was the 270MB/s with RAID-0 or ???
>
> --
> bill davidsen <davidsen@tmr.com>
> CTO TMR Associates, Inc
> Doing interesting things with small computers since 1979
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: Where is the performance bottleneck?
2005-09-30 4:52 ` Guy
@ 2005-09-30 5:19 ` dean gaudet
2005-10-06 21:15 ` Bill Davidsen
1 sibling, 0 replies; 42+ messages in thread
From: dean gaudet @ 2005-09-30 5:19 UTC (permalink / raw)
To: Guy
Cc: 'Bill Davidsen', 'Holger Kiehl',
'Mark Hahn', 'linux-raid', 'linux-kernel'
On Fri, 30 Sep 2005, Guy wrote:
> My old 500MHz P3 can xor at 1GB/sec. I don't think the RAID5 logic is the
> issue! Also, I have not seen hardware that fast! Or even half as fast.
> But I must admit, I have not seen a hardware RAID5 in a few years. :(
>
> 8regs : 918.000 MB/sec
> 32regs : 469.600 MB/sec
> pIII_sse : 994.800 MB/sec
> pII_mmx : 1102.400 MB/sec
> p5_mmx : 1152.800 MB/sec
> raid5: using function: pIII_sse (994.800 MB/sec)
those are cache based timings... an old 500mhz p3 probably has pc100
memory and main memory can't even go that fast. in fact i've got one of
those here and it's lucky to get 600MB/s out of memory.
in fact, to compare sw raid to a hw raid you should count every byte of
i/o somewhere between 2 and 3 times. this is because every line you read
into cache might knock out a dirty line, but it's definitely going to
replace something which would still be there on a hw raid. (i.e. it
decreases the cache effectiveness and you end up paying later after the sw
raid xor to read data back in which wouldn't leave the cache on a hw
raid.)
then add in the read/write traffic required on the parity block (which as
a fraction of i/o is worse with fewer drives) ... and it's pretty crazy to
believe that sw raid is "free" just because the kernel prints those
fantastic numbers at boot :)
> Humm.. It did not select the fastest?
this is related to what i'm describing -- iirc the pIII_sse code uses a
non-temporal store and/or prefetchnta to reduce memory traffic.
-dean
p.s. i use sw raid regardless, i just don't like seeing these misleading
discussions pointing at the kernel raid timings and saying "hw offload is
pointless!"
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Where is the performance bottleneck?
2005-09-30 4:52 ` Guy
2005-09-30 5:19 ` dean gaudet
@ 2005-10-06 21:15 ` Bill Davidsen
1 sibling, 0 replies; 42+ messages in thread
From: Bill Davidsen @ 2005-10-06 21:15 UTC (permalink / raw)
To: Guy
Cc: 'Holger Kiehl', 'Mark Hahn', 'linux-raid',
'linux-kernel'
Guy wrote:
>
>
>>-----Original Message-----
>>From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>>owner@vger.kernel.org] On Behalf Of Bill Davidsen
>>Sent: Wednesday, September 28, 2005 4:05 PM
>>To: Guy
>>Cc: 'Holger Kiehl'; 'Mark Hahn'; 'linux-raid'; 'linux-kernel'
>>Subject: Re: Where is the performance bottleneck?
>>
>>Guy wrote:
>>
>>
>>
>>>In most of your results, your CPU usage is very high. Once you get to
>>>
>>>
>>about
>>
>>
>>>90% usage, you really can't do much else, unless you can improve the CPU
>>>usage.
>>>
>>>
>>>
>>That seems one of the problems with software RAID, the calculations are
>>done in the CPU and not dedicated hardware. As you move to the top end
>>drive hardware the CPU gets to be a limit. I don't remember off the top
>>of my head how threaded this code is, and if more CPUs will help.
>>
>>
>
>My old 500MHz P3 can xor at 1GB/sec. I don't think the RAID5 logic is the
>issue! Also, I have not seen hardware that fast! Or even half as fast.
>But I must admit, I have not seen a hardware RAID5 in a few years. :(
>
> 8regs : 918.000 MB/sec
> 32regs : 469.600 MB/sec
> pIII_sse : 994.800 MB/sec
> pII_mmx : 1102.400 MB/sec
> p5_mmx : 1152.800 MB/sec
>raid5: using function: pIII_sse (994.800 MB/sec)
>
>Humm.. It did not select the fastest?
>
Maybe. There was discussion on this previously, but the decision was
made to us sse when available because it is nicer to cache, or uses
fewer registers, or similar. In any case fewer undesirable side effects.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2005-10-06 21:12 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
2005-08-29 19:54 ` Mark Hahn
2005-08-30 19:08 ` Holger Kiehl
2005-08-30 23:05 ` Guy
2005-09-28 20:04 ` Bill Davidsen
2005-09-30 4:52 ` Guy
2005-09-30 5:19 ` dean gaudet
2005-10-06 21:15 ` Bill Davidsen
2005-08-29 20:10 ` Al Boldi
2005-08-30 19:18 ` Holger Kiehl
2005-08-31 10:30 ` Al Boldi
2005-08-29 20:25 ` Vojtech Pavlik
2005-08-30 20:06 ` Holger Kiehl
2005-08-31 7:11 ` Vojtech Pavlik
2005-08-31 7:26 ` Jens Axboe
2005-08-31 11:54 ` Holger Kiehl
2005-08-31 12:07 ` Jens Axboe
2005-08-31 13:55 ` Holger Kiehl
2005-08-31 14:24 ` Dr. David Alan Gilbert
2005-08-31 20:56 ` Holger Kiehl
2005-08-31 21:16 ` Dr. David Alan Gilbert
2005-08-31 16:20 ` Jens Axboe
2005-08-31 15:16 ` jmerkey
2005-08-31 16:58 ` Tom Callahan
2005-08-31 15:47 ` jmerkey
2005-08-31 17:11 ` Jens Axboe
2005-08-31 15:59 ` jmerkey
2005-08-31 17:32 ` Jens Axboe
2005-08-31 16:51 ` Holger Kiehl
2005-08-31 17:35 ` Jens Axboe
2005-08-31 19:00 ` Holger Kiehl
2005-08-31 18:06 ` Michael Tokarev
2005-08-31 18:52 ` Ming Zhang
2005-08-31 18:57 ` Ming Zhang
2005-08-31 12:24 ` Nick Piggin
2005-08-31 16:25 ` Holger Kiehl
2005-08-31 17:25 ` Nick Piggin
2005-08-31 21:57 ` Holger Kiehl
2005-09-01 9:12 ` Holger Kiehl
2005-09-02 14:28 ` Al Boldi
2005-08-31 13:38 ` Holger Kiehl
2005-08-29 23:09 ` Peter Chubb
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).