linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Where is the performance bottleneck?
@ 2005-08-29 18:20 Holger Kiehl
  2005-08-29 19:54 ` Mark Hahn
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-29 18:20 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8112 bytes --]

Hello

I have a system with the following setup:

     Board is Tyan S4882 with AMD 8131 Chipset
     4 Opterons 848 (2.2GHz)
     8 GB DDR400 Ram (2GB for each CPU)
     1 onboard Symbios Logic 53c1030 dual channel U320 controller
     2 SATA disks put together as a SW Raid1 for system, swap and spares
     8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
       are on one channel and the other four (sdg, sdh, sdi, sdj) on
       the other channel.

The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
device on that bus. Unfortunatly I was unable to determine at what speed
it is running, here the output from lspci -vv:

02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
         Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
         Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
         Interrupt: pin A routed to IRQ 217
         Region 0: I/O ports at 3000 [size=256]
         Region 1: Memory at fe010000 (64-bit, non-prefetchable) [size=64K]
         Region 3: Memory at fe000000 (64-bit, non-prefetchable) [size=64K]
         Capabilities: [50] Power Management version 2
                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
                 Address: 0000000000000000  Data: 0000
         Capabilities: [68] PCI-X non-bridge device.
                 Command: DPERE- ERO- RBC=2 OST=0
                 Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,

02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
         Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
         Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
         Interrupt: pin B routed to IRQ 225
         Region 0: I/O ports at 3400 [size=256]
         Region 1: Memory at fe030000 (64-bit, non-prefetchable) [size=64K]
         Region 3: Memory at fe020000 (64-bit, non-prefetchable) [size=64K]
         Capabilities: [50] Power Management version 2
                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
                 Address: 0000000000000000  Data: 0000
         Capabilities: [68] PCI-X non-bridge device.
                 Command: DPERE- ERO- RBC=2 OST=0
                 Status: Bus=2 Dev=4 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple,

How does one determine the PCI-X bus speed?

Anyway, I thought with this system I would get theoretically 640 MB/s using
both channels. I tested several software raid setups to get the best possible
write speeds for this system. But testing shows that the absolute maximum I
can reach with software raid is only approx. 270 MB/s for writting. Which is
very disappointing.

The tests where done with 2.6.12.5 kernel from kernel.org, scheduler is the
deadline and distribution is fedora core 4 x86_64 with all updates. Chunksize
is always the default from mdadm (64k). Filesystem was always created with the
command mke2fs -j -b4096 -O dir_index /dev/mdx.

I also have tried with 2.6.13-rc7, but here the speed was much lower, the
maximum there was approx. 140 MB/s for writting.

Here some tests I did and the results with bonnie++:

Version  1.03        ------Sequential Output------ --Sequential Input- --Random-
                      -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine         Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Raid0 (8 disk)15744M 54406  96 247419 90 100752 25 60266  98 226651 29 830.2   1
Raid0s(4 disk)15744M 54915  97 253642 89 73976  18 59445  97 198372 24 659.8   1
Raid0s(4 disk)15744M 54866  97 268361 95 72852  17 59165  97 187183 22 666.3   1
Raid0p(4 disk)15744M 54017  96 149897 57 60202  15 59048  96 156887 20 381.8   1
Raid0p(4 disk)15744M 54771  98 156129 59 54130  14 58941  97 157543 20 520.3   1
Raid1+0       15744M 52496  94 202497 77 55928  14 60150  98 270509 34 930.2   1
Raid0+1       15744M 53927  95 194492 66 53430  15 49590  83 174313 30 884.7   1
Raid5 (8 disk)15744M 55881  98 153735 51 61680  24 56229  95 207348 44 741.2   1
Raid5s(4 disk)15744M 55238  98 81023  28 36859  14 56358  95 193030 38 605.7   1
Raid5s(4 disk)15744M 54920  97 83680  29 36551  14 56917  95 185345 35 599.8   1
Raid5p(4 disk)15744M 53681  95 54517  20 44932  17 54808  93 172216 33 371.1   1
Raid5p(4 disk)15744M 53856  96 55901  21 34737  13 55810  94 181825 36 607.7   1
/dev/sdc      15744M 53861  95 102270 35 25718   6 37273  60 76275   8 377.0   0
/dev/sdd      15744M 53575  95 96846  36 26209   6 37248  60 76197   9 378.4   0
/dev/sde      15744M 54398  94 87937  28 25540   6 36476  59 76520   8 380.4   0
/dev/sdf      15744M 53982  95 109192 38 26136   6 38516  63 76277   9 383.0   0
/dev/sdg      15744M 53880  95 102625 36 26458   6 37926  61 76538   9 399.1   0
/dev/sdh      15744M 53326  95 106447 39 26570   6 38129  62 76427   9 384.3   0
/dev/sdi      15744M 53103  94 96976  33 25632   6 36748  59 76658   8 386.4   0
/dev/sdj      15744M 53840  95 105521 39 26251   6 37146  60 76097   9 384.8   0

Raid1+0        - Four raid1's where each disk of one raid1 hangs on one
                  channel. The setup was done as follows:
                              Raid1 /dev/md3 (sdc + sdg)
                              Raid1 /dev/md4 (sdd + sdh)
                              Raid1 /dev/md5 (sde + sdi)
                              Raid1 /dev/md6 (sdf + sdj)
                              Raid0 /dev/md7 (md3 + md4 + md5 + md6)
Raid0+1        - Raid1 over two raid0 each having four disks:
                              Raid0 /dev/md3 (sdc + sdd + sde + sdf)
                              Raid0 /dev/md4 (sdg + sdh + sdi + sdj)
                              Raid1 /dev/md5 (md3 + md4)
Raid0s(4 disk) - Consists of Raid0 /dev/md3 sdc + sdd + sde + sdf or
                  Raid0 /dev/md4 sdg + sdh + sdi + sdj and the test where
                  done separate once for md3 and then for md4.
Raid0p(4 disk) - Same as Raid0s(4 disk) only the test for md3 and md4 where
                  done at the same time (parallel).
Raid5s(4 disk) - Same as Raid0s(4 disk) only with Raid5.
Raid5p(4 disk) - Same as Raid0p(4 disk) only with Raid5.

Additional tests where done with a little C program (attached to this mail)
that I wrote a long time ago. It measures the time it takes to write a file
of the given size, first result is without fsync() and second result with
fsync(). It is called with two parameters, the first is the file size in
Kilobytes and the second is the blocksize in bytes. The program was always
started as follows:

          fw 16121856 4096

I choose 4096 as blocksize since this is value that is suggested by stat()
st_blksize. With larger values the transfer rate increases.

Here the results in MB/s:
Raid0 (8 disk) 203.017 191.649
Raid0s(4 disk) 200.331 166.129
Raid0s(4 disk) 198.013 165.465
Raid0p(4 disk) 143.781 118.832
Raid0p(4 disk) 146.592 117.703
Raid0+1        206.046 118.670
Raid5 (8 disk) 181.382 115.037
/dev/sdc        94.439  56.928
/dev/sdd        89.838  55.711
/dev/sde        84.391  51.545
/dev/sdf        87.549  57.368
/dev/sdg        92.847  57.799
/dev/sdh        94.615  58.678
/dev/sdi        89.030  54.945
/dev/sdj        91.344  56.899

Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
three times those numbers if you take the numbers from the individual disks.

What limit am I hitting here?

Thanks,
Holger
-- 

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3993 bytes --]

/*****************************************************************************/
/*                            File Write Performance                         */
/*                            ======================                         */
/*****************************************************************************/

#include <stdio.h>      /* printf()                                          */
#include <string.h>     /* strcmp()                                          */
#include <stdlib.h>     /* exit(), atoi(), calloc(), free()                  */
#include <unistd.h>     /* write(), sysconf(), close(), fsync()              */
#include <sys/times.h>  /* times(), struct tms                               */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdarg.h>

#define MAXLINE             4096
#define BUFSIZE             512
#define DEFAULT_FILE_SIZE   31457280
#define TEST_FILE           "test.file"
#define FILE_MODE           (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)


static void err_doit(int, char *, va_list),
            err_quit(char *, ...),
            err_sys(char *, ...);


/*############################### main() ####################################*/
int
main(int argc, char *argv[])
{
   register    n,
               loops,
               rest;
   int         fd,
               oflag,
               blocksize = BUFSIZE;
   off_t       filesize = DEFAULT_FILE_SIZE;
   clock_t     start,
               end,
               syncend;
   long        clktck;
   char        *buf;
   struct tms  tmsdummy;

   if ((argc > 1) && (argc < 5))
   {
      filesize = (off_t)atoi(argv[1]) * 1024;
      if (argc == 3)
         blocksize = atoi(argv[2]);
      else  if (argc == 4)
               err_quit("Usage: %s [filesize] [blocksize]");
   }
   else  if (argc != 1)
            err_quit("Usage: %s [filesize] [blocksize]", argv[0]);

   if ((clktck = sysconf(_SC_CLK_TCK)) < 0)
      err_sys("sysconf error");

   /* If clktck=0 it dosn't make sence to run the test */
   if (clktck == 0)
   {
      (void)printf("0\n");
      exit(0);
   }

   if ((buf = calloc(blocksize, sizeof(char))) == NULL)
      err_sys("calloc error");

   for (n = 0; n < blocksize; n++)
      buf[n] = 'T';

   loops = filesize / blocksize;
   rest = filesize % blocksize;

   oflag = O_WRONLY | O_CREAT;

   if ((fd = open(TEST_FILE, oflag, FILE_MODE)) < 0)
      err_quit("Could not open %s", TEST_FILE);

   if ((start = times(&tmsdummy)) == -1)
      err_sys("Could not get start time");

   for (n = 0; n < loops; n++)
      if (write(fd, buf, blocksize) != blocksize)
            err_sys("write error");
   if (rest > 0)
      if (write(fd, buf, rest) != rest)
            err_sys("write error");

   if ((end = times(&tmsdummy)) == -1)
      err_sys("Could not get end time");

   (void)fsync(fd);

   if ((syncend = times(&tmsdummy)) == -1)
      err_sys("Could not get end time");

   (void)close(fd);
   free(buf);

   (void)printf("%f %f\n", (double)filesize / ((double)(end - start) / (double)clktck),
                           (double)filesize / ((double)(syncend - start) / (double)clktck));

   exit(0);
}


static void
err_sys(char *fmt, ...)
{
   va_list  ap;

   va_start(ap, fmt);
   err_doit(1, fmt, ap);
   va_end(ap);
   exit(1);
}


static void
err_quit(char *fmt, ...)
{
   va_list  ap;

   va_start(ap, fmt);
   err_doit(0, fmt, ap);
   va_end(ap);
   exit(1);
}


static void
err_doit(int errnoflag, char *fmt, va_list ap)
{
   int   errno_save;
   char  buf[MAXLINE];

   errno_save = errno;
   (void)vsprintf(buf, fmt, ap);
   if (errnoflag)
      (void)sprintf(buf+strlen(buf), ": %s", strerror(errno_save));
   (void)strcat(buf, "\n");
   fflush(stdout);
   (void)fputs(buf, stderr);
   fflush(NULL);     /* Flushes all stdio output streams */
   return;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
@ 2005-08-29 19:54 ` Mark Hahn
  2005-08-30 19:08   ` Holger Kiehl
  2005-08-29 20:10 ` Al Boldi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Mark Hahn @ 2005-08-29 19:54 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid, linux-kernel

>      8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)

figure each is worth, say, 60 MB/s, so you'll peak (theoretically) at 
240 MB/s per channel.

> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
> device on that bus. Unfortunatly I was unable to determine at what speed
> it is running, here the output from lspci -vv:
...
>                  Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,

the "133MHz+" is a good sign.  OTOH the latency (72) seems rather low - my
understanding is that that would noticably limit the size of burst transfers.

> Anyway, I thought with this system I would get theoretically 640 MB/s using
> both channels.

"theoretically" in the same sense as "according to quantum theory,
Bush and BinLadin may swap bodies tomorrow morning at 4:59."

> write speeds for this system. But testing shows that the absolute maximum I
> can reach with software raid is only approx. 270 MB/s for writting. Which is
> very disappointing.

it's a bit low, but "very" is unrealistic...

> deadline and distribution is fedora core 4 x86_64 with all updates. Chunksize
> is always the default from mdadm (64k). Filesystem was always created with the
> command mke2fs -j -b4096 -O dir_index /dev/mdx.

bear in mind that a 64k chunksize means that an 8 disk raid5 will really
only work well for writes that are multiples of of 7*64=448K...

> I also have tried with 2.6.13-rc7, but here the speed was much lower, the
> maximum there was approx. 140 MB/s for writting.

hmm, there should not have been any such dramatic slowdown.

> Version  1.03        ------Sequential Output------ --Sequential Input- --Random-
>                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine         Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> Raid0 (8 disk)15744M 54406  96 247419 90 100752 25 60266  98 226651 29 830.2   1
> Raid0s(4 disk)15744M 54915  97 253642 89 73976  18 59445  97 198372 24 659.8   1
> Raid0s(4 disk)15744M 54866  97 268361 95 72852  17 59165  97 187183 22 666.3   1

you're obviously saturating something already with 2 disks.  did you play
with "blockdev --setra" setings?

> Raid5 (8 disk)15744M 55881  98 153735 51 61680  24 56229  95 207348 44 741.2   1
> Raid5s(4 disk)15744M 55238  98 81023  28 36859  14 56358  95 193030 38 605.7   1
> Raid5s(4 disk)15744M 54920  97 83680  29 36551  14 56917  95 185345 35 599.8   1

the block-read shows that even with 3 disks, you're hitting ~190 MB/s,
which is pretty close to your actual disk speed.  the low value for block-out
is probably just due to non-stripe writes needing R/M/W cycles.

> /dev/sdc      15744M 53861  95 102270 35 25718   6 37273  60 76275   8 377.0   0

the block-out is clearly distorted by buffer-cache (too high), but the 
input rate is good and consistent.  obvoiusly, it'll fall off somewhat 
towards inner tracks, but will probably still be above 50.

> Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
> bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
> three times those numbers if you take the numbers from the individual disks.

expecting 3x is unreasonable; 2x (480 or so) would be good.

I suspect that some (sw kernel) components are badly tuned for fast IO.
obviously, most machines are in the 50-100 MB/s range, so this is not
surprising.  readahead is certainly one, but there are also magic numbers
in MD as well, not to mention PCI latency, scsi driver tuning, probably
even /proc/sys/vm settings.

I've got some 4x2.6G opteron servers (same board, 32G PC3200), but alas,
end-users have found out about them.  not to mention that they only have 
3x160G SATA disks...

regards, mark hahn.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
  2005-08-29 19:54 ` Mark Hahn
@ 2005-08-29 20:10 ` Al Boldi
  2005-08-30 19:18   ` Holger Kiehl
  2005-08-29 20:25 ` Vojtech Pavlik
  2005-08-29 23:09 ` Peter Chubb
  3 siblings, 1 reply; 42+ messages in thread
From: Al Boldi @ 2005-08-29 20:10 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-kernel, linux-raid

Holger Kiehl wrote:
> Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
> bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
> three times those numbers if you take the numbers from the individual
> disks.
>
> What limit am I hitting here?

You may be hitting a 2.6 kernel bug, which has something to do with 
readahead, ask Jens Axboe about it! (see "[git patches] IDE update" thread)
Sadly, 2.6.13 did not fix it either.

Did you try 2.4.31?

--
Al

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
  2005-08-29 19:54 ` Mark Hahn
  2005-08-29 20:10 ` Al Boldi
@ 2005-08-29 20:25 ` Vojtech Pavlik
  2005-08-30 20:06   ` Holger Kiehl
  2005-08-29 23:09 ` Peter Chubb
  3 siblings, 1 reply; 42+ messages in thread
From: Vojtech Pavlik @ 2005-08-29 20:25 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-kernel

On Mon, Aug 29, 2005 at 06:20:56PM +0000, Holger Kiehl wrote:
> Hello
> 
> I have a system with the following setup:
> 
>     Board is Tyan S4882 with AMD 8131 Chipset
>     4 Opterons 848 (2.2GHz)
>     8 GB DDR400 Ram (2GB for each CPU)
>     1 onboard Symbios Logic 53c1030 dual channel U320 controller
>     2 SATA disks put together as a SW Raid1 for system, swap and spares
>     8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
>       are on one channel and the other four (sdg, sdh, sdi, sdj) on
>       the other channel.
> 
> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is
> no other device on that bus. Unfortunatly I was unable to determine at
> what speed it is running, here the output from lspci -vv:

> How does one determine the PCI-X bus speed?

Usually only the card (in your case the Symbios SCSI controller) can
tell. If it does, it'll be most likely in 'dmesg'.

> Anyway, I thought with this system I would get theoretically 640 MB/s using
> both channels.

You can never use the full theoretical bandwidth of the channel for
data. A lot of overhead remains for other signalling. Similarly for PCI.

> I tested several software raid setups to get the best possible write
> speeds for this system. But testing shows that the absolute maximum I
> can reach with software raid is only approx. 270 MB/s for writting.
> Which is very disappointing.

I'd expect somewhat better (in the 300-400 MB/s range), but this is not
too bad.

To find where the bottleneck is, I'd suggest trying without the
filesystem at all, and just filling a large part of the block device
using the 'dd' command.

Also, trying without the RAID, and just running 4 (and 8) concurrent
dd's to the separate drives could show whether it's the RAID that's
slowing things down. 

> The tests where done with 2.6.12.5 kernel from kernel.org, scheduler
> is the deadline and distribution is fedora core 4 x86_64 with all
> updates.  Chunksize is always the default from mdadm (64k). Filesystem
> was always created with the command mke2fs -j -b4096 -O dir_index
> /dev/mdx.
> 
> I also have tried with 2.6.13-rc7, but here the speed was much lower,
> the maximum there was approx. 140 MB/s for writting.

Now that's very low.

-- 
Vojtech Pavlik
SuSE Labs, SuSE CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
                   ` (2 preceding siblings ...)
  2005-08-29 20:25 ` Vojtech Pavlik
@ 2005-08-29 23:09 ` Peter Chubb
  3 siblings, 0 replies; 42+ messages in thread
From: Peter Chubb @ 2005-08-29 23:09 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid, linux-kernel

>>>>> "Holger" == Holger Kiehl <Holger.Kiehl@dwd.de> writes:

Holger> Hello I have a system with the following setup:

	(4-way CPUs, 8 spindles on two controllers)

Try using XFS.

See http://scalability.gelato.org/DiskScalability_2fResults --- ext3
is single threaded and tends not to get the full benefit of either the
multiple spindles nor the multiple processors.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 19:54 ` Mark Hahn
@ 2005-08-30 19:08   ` Holger Kiehl
  2005-08-30 23:05     ` Guy
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-30 19:08 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid, linux-kernel

On Mon, 29 Aug 2005, Mark Hahn wrote:

>> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
>> device on that bus. Unfortunatly I was unable to determine at what speed
>> it is running, here the output from lspci -vv:
> ...
>>                  Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,
>
> the "133MHz+" is a good sign.  OTOH the latency (72) seems rather low - my
> understanding is that that would noticably limit the size of burst transfers.
>
I have tried with 128 and 144, but the transfer rate is only a little
bit higher barely measurable. Or what values should I try?

>
>> Version  1.03        ------Sequential Output------ --Sequential Input- --Random-
>>                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
>> Machine         Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
>> Raid0 (8 disk)15744M 54406  96 247419 90 100752 25 60266  98 226651 29 830.2   1
>> Raid0s(4 disk)15744M 54915  97 253642 89 73976  18 59445  97 198372 24 659.8   1
>> Raid0s(4 disk)15744M 54866  97 268361 95 72852  17 59165  97 187183 22 666.3   1
>
> you're obviously saturating something already with 2 disks.  did you play
> with "blockdev --setra" setings?
>
Yes, I did play a little bit with it but this only changed read performance,
it made no measurable difference when writting.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 20:10 ` Al Boldi
@ 2005-08-30 19:18   ` Holger Kiehl
  2005-08-31 10:30     ` Al Boldi
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-30 19:18 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel, linux-raid

On Mon, 29 Aug 2005, Al Boldi wrote:

> Holger Kiehl wrote:
>> Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
>> bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
>> three times those numbers if you take the numbers from the individual
>> disks.
>>
>> What limit am I hitting here?
>
> You may be hitting a 2.6 kernel bug, which has something to do with
> readahead, ask Jens Axboe about it! (see "[git patches] IDE update" thread)
> Sadly, 2.6.13 did not fix it either.
>
I did read that threat, but due to my limited understanding about kernel
code, don't see the relation to my problem.

But I am willing to try any patches to solve the problem.

> Did you try 2.4.31?
>
No. Will give this a try if the problem is not found.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-29 20:25 ` Vojtech Pavlik
@ 2005-08-30 20:06   ` Holger Kiehl
  2005-08-31  7:11     ` Vojtech Pavlik
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-30 20:06 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: linux-raid, linux-kernel

On Mon, 29 Aug 2005, Vojtech Pavlik wrote:

> On Mon, Aug 29, 2005 at 06:20:56PM +0000, Holger Kiehl wrote:
>> Hello
>>
>> I have a system with the following setup:
>>
>>     Board is Tyan S4882 with AMD 8131 Chipset
>>     4 Opterons 848 (2.2GHz)
>>     8 GB DDR400 Ram (2GB for each CPU)
>>     1 onboard Symbios Logic 53c1030 dual channel U320 controller
>>     2 SATA disks put together as a SW Raid1 for system, swap and spares
>>     8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
>>       are on one channel and the other four (sdg, sdh, sdi, sdj) on
>>       the other channel.
>>
>> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is
>> no other device on that bus. Unfortunatly I was unable to determine at
>> what speed it is running, here the output from lspci -vv:
>
>> How does one determine the PCI-X bus speed?
>
> Usually only the card (in your case the Symbios SCSI controller) can
> tell. If it does, it'll be most likely in 'dmesg'.
>
There is nothing in dmesg:

    Fusion MPT base driver 3.01.20
    Copyright (c) 1999-2004 LSI Logic Corporation
    ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
    mptbase: Initiating ioc0 bringup
    ioc0: 53C1030: Capabilities={Initiator,Target}
    ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
    mptbase: Initiating ioc1 bringup
    ioc1: 53C1030: Capabilities={Initiator,Target}
    Fusion MPT SCSI Host driver 3.01.20

>> Anyway, I thought with this system I would get theoretically 640 MB/s using
>> both channels.
>
> You can never use the full theoretical bandwidth of the channel for
> data. A lot of overhead remains for other signalling. Similarly for PCI.
>
>> I tested several software raid setups to get the best possible write
>> speeds for this system. But testing shows that the absolute maximum I
>> can reach with software raid is only approx. 270 MB/s for writting.
>> Which is very disappointing.
>
> I'd expect somewhat better (in the 300-400 MB/s range), but this is not
> too bad.
>
> To find where the bottleneck is, I'd suggest trying without the
> filesystem at all, and just filling a large part of the block device
> using the 'dd' command.
>
> Also, trying without the RAID, and just running 4 (and 8) concurrent
> dd's to the separate drives could show whether it's the RAID that's
> slowing things down.
>
Ok, I did run the following dd command in different combinations:

    dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000

Here the results:

    Each disk alone
    /dev/sdc1 59.094636 MB/s
    /dev/sdd1 58.686592 MB/s
    /dev/sde1 55.282807 MB/s
    /dev/sdf1 62.271240 MB/s
    /dev/sdg1 60.872891 MB/s
    /dev/sdh1 62.252781 MB/s
    /dev/sdi1 59.145637 MB/s
    /dev/sdj1 60.921119 MB/s

    sdc + sdd in parallel (2 disks on same channel)
    /dev/sdc1 42.512287 MB/s
    /dev/sdd1 43.118483 MB/s

    sdc + sdg in parallel (2 disks on different channels)
    /dev/sdc1 42.938186 MB/s
    /dev/sdg1 43.934779 MB/s

    sdc + sdd + sde in parallel (3 disks on same channel)
    /dev/sdc1 35.043501 MB/s
    /dev/sdd1 35.686878 MB/s
    /dev/sde1 34.580457 MB/s

    Similar results for three disks (sdg + sdh + sdi) on the other channel
    /dev/sdg1 36.381137 MB/s
    /dev/sdh1 37.541758 MB/s
    /dev/sdi1 35.834920 MB/s

    sdc + sdd + sde + sdf in parallel (4 disks on same channel)
    /dev/sdc1 31.432914 MB/s
    /dev/sdd1 32.058752 MB/s
    /dev/sde1 31.393455 MB/s
    /dev/sdf1 33.208165 MB/s

    And here for the four disks on the other channel
    /dev/sdg1 31.873028 MB/s
    /dev/sdh1 33.277193 MB/s
    /dev/sdi1 31.910000 MB/s
    /dev/sdj1 32.626744 MB/s

    All 8 disks in parallel
    /dev/sdc1 24.120545 MB/s
    /dev/sdd1 24.419801 MB/s
    /dev/sde1 24.296588 MB/s
    /dev/sdf1 25.609548 MB/s
    /dev/sdg1 24.572617 MB/s
    /dev/sdh1 25.552590 MB/s
    /dev/sdi1 24.575616 MB/s
    /dev/sdj1 25.124165 MB/s

So from these results, I may assume that md is not the cause of the problem.

What comes as a big surprise is that I loose 25% performance with only
two disks and each hanging on its own channel!

Is this normal? I wonder if other people have the same problem with
other controllers or the same.

What can I do next to find out if this is a kernel, driver or hardware
problem?

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: Where is the performance bottleneck?
  2005-08-30 19:08   ` Holger Kiehl
@ 2005-08-30 23:05     ` Guy
  2005-09-28 20:04       ` Bill Davidsen
  0 siblings, 1 reply; 42+ messages in thread
From: Guy @ 2005-08-30 23:05 UTC (permalink / raw)
  To: 'Holger Kiehl', 'Mark Hahn'
  Cc: 'linux-raid', 'linux-kernel'

In most of your results, your CPU usage is very high.  Once you get to about
90% usage, you really can't do much else, unless you can improve the CPU
usage.

Guy

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Holger Kiehl
> Sent: Tuesday, August 30, 2005 3:09 PM
> To: Mark Hahn
> Cc: linux-raid; linux-kernel
> Subject: Re: Where is the performance bottleneck?
> 
> On Mon, 29 Aug 2005, Mark Hahn wrote:
> 
> >> The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no
> other
> >> device on that bus. Unfortunatly I was unable to determine at what
> speed
> >> it is running, here the output from lspci -vv:
> > ...
> >>                  Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-,
> DC=simple,
> >
> > the "133MHz+" is a good sign.  OTOH the latency (72) seems rather low -
> my
> > understanding is that that would noticably limit the size of burst
> transfers.
> >
> I have tried with 128 and 144, but the transfer rate is only a little
> bit higher barely measurable. Or what values should I try?
> 
> >
> >> Version  1.03        ------Sequential Output------ --Sequential Input-
> --Random-
> >>                       -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> >> Machine         Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> >> Raid0 (8 disk)15744M 54406  96 247419 90 100752 25 60266  98 226651 29
> 830.2   1
> >> Raid0s(4 disk)15744M 54915  97 253642 89 73976  18 59445  97 198372 24
> 659.8   1
> >> Raid0s(4 disk)15744M 54866  97 268361 95 72852  17 59165  97 187183 22
> 666.3   1
> >
> > you're obviously saturating something already with 2 disks.  did you
> play
> > with "blockdev --setra" setings?
> >
> Yes, I did play a little bit with it but this only changed read
> performance,
> it made no measurable difference when writting.
> 
> Thanks,
> Holger
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-30 20:06   ` Holger Kiehl
@ 2005-08-31  7:11     ` Vojtech Pavlik
  2005-08-31  7:26       ` Jens Axboe
  2005-08-31 13:38       ` Holger Kiehl
  0 siblings, 2 replies; 42+ messages in thread
From: Vojtech Pavlik @ 2005-08-31  7:11 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid, linux-kernel

On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
> >>How does one determine the PCI-X bus speed?
> >
> >Usually only the card (in your case the Symbios SCSI controller) can
> >tell. If it does, it'll be most likely in 'dmesg'.
> >
> There is nothing in dmesg:
> 
>    Fusion MPT base driver 3.01.20
>    Copyright (c) 1999-2004 LSI Logic Corporation
>    ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
>    mptbase: Initiating ioc0 bringup
>    ioc0: 53C1030: Capabilities={Initiator,Target}
>    ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
>    mptbase: Initiating ioc1 bringup
>    ioc1: 53C1030: Capabilities={Initiator,Target}
>    Fusion MPT SCSI Host driver 3.01.20
> 
> >To find where the bottleneck is, I'd suggest trying without the
> >filesystem at all, and just filling a large part of the block device
> >using the 'dd' command.
> >
> >Also, trying without the RAID, and just running 4 (and 8) concurrent
> >dd's to the separate drives could show whether it's the RAID that's
> >slowing things down.
> >
> Ok, I did run the following dd command in different combinations:
> 
>    dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000

I think a bs of 4k is way too small and will cause huge CPU overhead.
Can you try with something like 4M? Also, you can use /dev/full to avoid
the pre-zeroing.

> Here the results:
> 
>    Each disk alone
>    /dev/sdc1 59.094636 MB/s
>    /dev/sdd1 58.686592 MB/s
>    /dev/sde1 55.282807 MB/s
>    /dev/sdf1 62.271240 MB/s
>    /dev/sdg1 60.872891 MB/s
>    /dev/sdh1 62.252781 MB/s
>    /dev/sdi1 59.145637 MB/s
>    /dev/sdj1 60.921119 MB/s

>    All 8 disks in parallel
>    /dev/sdc1 24.120545 MB/s
>    /dev/sdd1 24.419801 MB/s
>    /dev/sde1 24.296588 MB/s
>    /dev/sdf1 25.609548 MB/s
>    /dev/sdg1 24.572617 MB/s
>    /dev/sdh1 25.552590 MB/s
>    /dev/sdi1 24.575616 MB/s
>    /dev/sdj1 25.124165 MB/s

You're saturating some bus. It almost looks like it's the PCI-X,
although that should be able to deliver up (if running at the full speed
of AMD8132) up to 1GB/sec, so it SHOULD not be an issue.

> So from these results, I may assume that md is not the cause of the problem.
> 
> What comes as a big surprise is that I loose 25% performance with only
> two disks and each hanging on its own channel!
> 
> Is this normal? I wonder if other people have the same problem with
> other controllers or the same.

No, I don't think this is OK.

> What can I do next to find out if this is a kernel, driver or hardware
> problem?
 
You need to find where the bottleneck is, by removing one possible
bottleneck at a time in your test.

-- 
Vojtech Pavlik
SuSE Labs, SuSE CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31  7:11     ` Vojtech Pavlik
@ 2005-08-31  7:26       ` Jens Axboe
  2005-08-31 11:54         ` Holger Kiehl
  2005-08-31 13:38       ` Holger Kiehl
  1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31  7:26 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: Holger Kiehl, linux-raid, linux-kernel

On Wed, Aug 31 2005, Vojtech Pavlik wrote:
> On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
> > >>How does one determine the PCI-X bus speed?
> > >
> > >Usually only the card (in your case the Symbios SCSI controller) can
> > >tell. If it does, it'll be most likely in 'dmesg'.
> > >
> > There is nothing in dmesg:
> > 
> >    Fusion MPT base driver 3.01.20
> >    Copyright (c) 1999-2004 LSI Logic Corporation
> >    ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
> >    mptbase: Initiating ioc0 bringup
> >    ioc0: 53C1030: Capabilities={Initiator,Target}
> >    ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
> >    mptbase: Initiating ioc1 bringup
> >    ioc1: 53C1030: Capabilities={Initiator,Target}
> >    Fusion MPT SCSI Host driver 3.01.20
> > 
> > >To find where the bottleneck is, I'd suggest trying without the
> > >filesystem at all, and just filling a large part of the block device
> > >using the 'dd' command.
> > >
> > >Also, trying without the RAID, and just running 4 (and 8) concurrent
> > >dd's to the separate drives could show whether it's the RAID that's
> > >slowing things down.
> > >
> > Ok, I did run the following dd command in different combinations:
> > 
> >    dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
> 
> I think a bs of 4k is way too small and will cause huge CPU overhead.
> Can you try with something like 4M? Also, you can use /dev/full to avoid
> the pre-zeroing.

That was my initial thought as well, but since he's writing the io side
should look correct. I doubt 8 dd's writing 4k chunks will gobble that
much CPU as to make this much difference.

Holger, we need vmstat 1 info while the dd's are running. A simple
profile would be nice as well, boot with profile=2 and do a readprofile
-r; run tests; readprofile > foo and send the first 50 lines of foo to
this list.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-30 19:18   ` Holger Kiehl
@ 2005-08-31 10:30     ` Al Boldi
  0 siblings, 0 replies; 42+ messages in thread
From: Al Boldi @ 2005-08-31 10:30 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-kernel, linux-raid

Holger Kiehl wrote:
> On Mon, 29 Aug 2005, Al Boldi wrote:
> > You may be hitting a 2.6 kernel bug, which has something to do with
> > readahead, ask Jens Axboe about it! (see "[git patches] IDE update"
> > thread) Sadly, 2.6.13 did not fix it either.
>
> I did read that threat, but due to my limited understanding about kernel
> code, don't see the relation to my problem.

Basically the kernel is loosing CPU cycles while accessing bockdevices.
The problem shows most when the CPU/DISK ratio is low.
Throwing more CPU cycles at the problem may seemingly remove this bottleneck.

> But I am willing to try any patches to solve the problem.

No patches yet.

> > Did you try 2.4.31?
>
> No. Will give this a try if the problem is not found.

Keep us posted!

--
Al

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31  7:26       ` Jens Axboe
@ 2005-08-31 11:54         ` Holger Kiehl
  2005-08-31 12:07           ` Jens Axboe
  2005-08-31 12:24           ` Nick Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 11:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

On Wed, 31 Aug 2005, Jens Axboe wrote:

> On Wed, Aug 31 2005, Vojtech Pavlik wrote:
>> On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
>>>>> How does one determine the PCI-X bus speed?
>>>>
>>>> Usually only the card (in your case the Symbios SCSI controller) can
>>>> tell. If it does, it'll be most likely in 'dmesg'.
>>>>
>>> There is nothing in dmesg:
>>>
>>>    Fusion MPT base driver 3.01.20
>>>    Copyright (c) 1999-2004 LSI Logic Corporation
>>>    ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
>>>    mptbase: Initiating ioc0 bringup
>>>    ioc0: 53C1030: Capabilities={Initiator,Target}
>>>    ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
>>>    mptbase: Initiating ioc1 bringup
>>>    ioc1: 53C1030: Capabilities={Initiator,Target}
>>>    Fusion MPT SCSI Host driver 3.01.20
>>>
>>>> To find where the bottleneck is, I'd suggest trying without the
>>>> filesystem at all, and just filling a large part of the block device
>>>> using the 'dd' command.
>>>>
>>>> Also, trying without the RAID, and just running 4 (and 8) concurrent
>>>> dd's to the separate drives could show whether it's the RAID that's
>>>> slowing things down.
>>>>
>>> Ok, I did run the following dd command in different combinations:
>>>
>>>    dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
>>
>> I think a bs of 4k is way too small and will cause huge CPU overhead.
>> Can you try with something like 4M? Also, you can use /dev/full to avoid
>> the pre-zeroing.
>
> That was my initial thought as well, but since he's writing the io side
> should look correct. I doubt 8 dd's writing 4k chunks will gobble that
> much CPU as to make this much difference.
>
> Holger, we need vmstat 1 info while the dd's are running. A simple
> profile would be nice as well, boot with profile=2 and do a readprofile
> -r; run tests; readprofile > foo and send the first 50 lines of foo to
> this list.
>
Here vmstat for 8 dd's still with 4k blocksize:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  9  2   5244  38272 7738248  10400    0    0     3 11444  390    24  0  5 75 20
  5 10   5244  30824 7747680   8684    0    0     0 265672 2582  1917  1 95  0  4
  2 12   5244  30948 7747248   8708    0    0     0 222620 2858   292  0 33  0 67
  4 10   5244  31072 7747516   8644    0    0     0 236400 3132   326  0 43  0 57
  2 12   5244  31320 7747792   8512    0    0     0 250204 3225   285  0 37  0 63
  1 13   5244  30948 7747412   8552    0    0    24 227600 3261   312  0 41  0 59
  2 12   5244  32684 7746124   8616    0    0     0 235392 3219   274  0 32  0 68
  1 13   5244  30948 7747940   8568    0    0     0 228020 3394   296  0 37  0 63
  0 14   5244  31196 7747680   8624    0    0     0 232932 3389   300  0 32  0 68
  3 12   5244  31072 7747904   8536    0    0     0 233096 3545   312  0 33  0 67
  1 13   5244  31072 7747852   8520    0    0     0 226992 3381   290  0 31  0 69
  1 13   5244  31196 7747704   8396    0    0     0 230112 3372   265  0 28  0 72
  0 14   5244  31072 7747928   8512    0    0     0 240652 3491   295  0 33  0 67
  3 13   5244  31072 7748104   8608    0    0     0 222944 3433   269  0 27  0 73
  1 13   5244  31072 7748000   8508    0    0     0 207944 3470   294  0 28  0 72
  0 14   5244  31072 7747980   8528    0    0     0 234608 3496   272  0 31  0 69
  2 12   5244  31196 7748148   8496    0    0     0 228760 3480   280  0 28  0 72
  0 14   5244  30948 7748568   8620    0    0     0 214372 3551   302  0 29  0 71
  1 13   5244  31072 7748392   8524    0    0     0 226732 3494   284  0 29  0 71
  0 14   5244  31072 7748004   8640    0    0     0 229628 3604   273  0 26  0 74
  1 13   5244  30948 7748392   8660    0    0     0 212868 3563   266  0 28  0 72
  1 13   5244  30948 7748600   8520    0    0     0 228244 3568   294  0 30  0 70
  1 13   5244  31196 7748228   8416    0    0     0 221692 3543   258  0 27  0 73
  1 13   5244  31072 7748192   8520    0    0     0 241040 3983   330  0 25  0 74
  1 13   5244  31196 7748288   8560    0    0     0 217108 3676   276  0 28  0 72
                              .
                              .
                              .
                This goses on up to the end.
                              .
                              .
                              .
  0  3   5244 825096 6949252   8596    0    0     0 241244 2683   223  0  7 71 22
  0  2   5244 825108 6949252   8596    0    0     0 229764 2683   214  0  7 73 20
  0  3   5244 826348 6949252   8596    0    0     0 116840 2046   450  0  4 71 26
  0  3   5244 826976 6949252   8596    0    0     0 141992 1887    97  0  4 73 23
  0  3   5244 827100 6949252   8596    0    0     0 137716 1871    93  0  4 70 26
  0  3   5244 827100 6949252   8596    0    0     0 137032 1894    96  0  4 75 21
  0  3   5244 827224 6949252   8596    0    0     0 131332 1860   288  0  4 73 23
  0  1   5244 1943732 5833756   8620    0    0     0 72404 1560   481  0 24 61 16
  0  2   5244 1943732 5833756   8620    0    0     0 71680 1450    60  0  2 61 38
  0  2   5244 1943736 5833756   8620    0    0     0 71680 1464    70  0  2 52 46
  0  2   5244 1943736 5833756   8620    0    0     0 66560 1436    66  0  2 50 48
  0  2   5244 1943984 5833756   8620    0    0     0 71680 1454    72  0  2 50 48
  0  2   5244 1943984 5833756   8620    0    0     0 71680 1450    70  0  2 50 48
  1  0   5244 2906484 4872176   8612    0    0     0 12760 1240   321  0 13 68 19
  0  0   5244 3306732 4472300   8580    0    0     0     0 1109    31  0  9 91  0
  0  0   5244 3306732 4472300   8580    0    0     0     0 1008    22  0  0 100  0

And here the profile output (I assume you meant sorted):

3236497 total                                      1.4547
2507913 default_idle                             52248.1875
158752 shrink_zone                               43.3275
121584 copy_user_generic_c                      3199.5789
  34271 __wake_up_bit                            713.9792
  31131 __make_request                            23.1629
  22096 scsi_request_fn                           18.4133
  21915 rotate_reclaimable_page                   80.5699
  20641 end_buffer_async_write                    86.0042
  18701 __clear_user                             292.2031
  13562 __block_write_full_page                   18.4266
  12981 test_set_page_writeback                   47.7243
  10772 kmem_cache_free                           96.1786
  10216 unlock_page                              159.6250
   9492 free_hot_cold_page                        32.9583
   9478 add_to_page_cache                         45.5673
   9117 page_waitqueue                            81.4018
   8671 drop_buffers                              38.7098
   8584 __set_page_dirty_nobuffers                31.5588
   8444 release_pages                             23.9886
   8204 scsi_dispatch_cmd                         14.2431
   8191 buffered_rmqueue                          11.6349
   7966 page_referenced                           22.6307
   7093 generic_file_buffered_write                4.1431
   6953 __pagevec_lru_add                         28.9708
   6740 __alloc_pages                              5.6926
   6369 __end_that_request_first                  11.7077
   5940 dnotify_parent                            30.9375
   5880 kmem_cache_alloc                          91.8750
   5797 submit_bh                                 19.0691
   4720 find_lock_page                            21.0714
   4612 __generic_file_aio_write_nolock            4.8042
   4559 __do_softirq                              20.3527
   4337 end_page_writeback                        54.2125
   4090 create_empty_buffers                      25.5625
   3985 bio_alloc_bioset                           9.2245
   3787 mempool_alloc                             12.4572
   3708 set_page_refs                            231.7500
   3545 __block_commit_write                      17.0433
   3037 system_call                               23.1832
   2968 zone_watermark_ok                         15.4583
   2966 cond_resched                              26.4821
   2828 generic_make_request                       4.7770
   2766 __mod_page_state                          86.4375
   2759 fget_light                                15.6761
   2692 test_clear_page_dirty                     11.2167
   2523 vfs_write                                  8.2993
   2406 generic_file_aio_write_nolock             15.0375
   2335 bio_put                                   36.4844
   2287 bad_range                                 23.8229

Under ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/ I put the full vmstat
and profile output (also with -v). There is also dmesg and my kernel.config
from this system.

I will also do some test with 4M instead of 4k and as Al Boldi hinted
do a test together with some CPU load.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 11:54         ` Holger Kiehl
@ 2005-08-31 12:07           ` Jens Axboe
  2005-08-31 13:55             ` Holger Kiehl
  2005-08-31 12:24           ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 12:07 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

On Wed, Aug 31 2005, Holger Kiehl wrote:
> >>>Ok, I did run the following dd command in different combinations:
> >>>
> >>>   dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
> >>
> >>I think a bs of 4k is way too small and will cause huge CPU overhead.
> >>Can you try with something like 4M? Also, you can use /dev/full to avoid
> >>the pre-zeroing.
> >
> >That was my initial thought as well, but since he's writing the io side
> >should look correct. I doubt 8 dd's writing 4k chunks will gobble that
> >much CPU as to make this much difference.
> >
> >Holger, we need vmstat 1 info while the dd's are running. A simple
> >profile would be nice as well, boot with profile=2 and do a readprofile
> >-r; run tests; readprofile > foo and send the first 50 lines of foo to
> >this list.
> >
> Here vmstat for 8 dd's still with 4k blocksize:
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- 
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id 
>  wa
>  9  2   5244  38272 7738248  10400    0    0     3 11444  390    24  0  5 
>  75 20
>  5 10   5244  30824 7747680   8684    0    0     0 265672 2582  1917  1 95  
>  0  4
>  2 12   5244  30948 7747248   8708    0    0     0 222620 2858   292  0 33  
>  0 67
>  4 10   5244  31072 7747516   8644    0    0     0 236400 3132   326  0 43  
>  0 57
>  2 12   5244  31320 7747792   8512    0    0     0 250204 3225   285  0 37  
>  0 63
>  1 13   5244  30948 7747412   8552    0    0    24 227600 3261   312  0 41  
>  0 59
>  2 12   5244  32684 7746124   8616    0    0     0 235392 3219   274  0 32  
>  0 68

[snip]

Looks as expected, nothing too excessive showing up. About 30-40% sys
time, but it should not bog the machine down that much.

> And here the profile output (I assume you meant sorted):

I did, thanks :)

> 3236497 total                                      1.4547
> 2507913 default_idle                             52248.1875
> 158752 shrink_zone                               43.3275
> 121584 copy_user_generic_c                      3199.5789
>  34271 __wake_up_bit                            713.9792
>  31131 __make_request                            23.1629
>  22096 scsi_request_fn                           18.4133
>  21915 rotate_reclaimable_page                   80.5699
>  20641 end_buffer_async_write                    86.0042
>  18701 __clear_user                             292.2031

Nothing sticks out here either. There's plenty of idle time. It smells
like a driver issue. Can you try the same dd test, but read from the
drives instead? Use a bigger blocksize here, 128 or 256k.

You might want to try the same with direct io, just to eliminate the
costly user copy. I don't expect it to make much of a difference though,
feels like the problem is elsewhere (driver, most likely).

If we still can't get closer to this, it would be interesting to try my
block tracing stuff so we can see what is going on at the queue level.
But lets gather some more info first, since it requires testing -mm.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 11:54         ` Holger Kiehl
  2005-08-31 12:07           ` Jens Axboe
@ 2005-08-31 12:24           ` Nick Piggin
  2005-08-31 16:25             ` Holger Kiehl
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-08-31 12:24 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

Holger Kiehl wrote:

> 3236497 total                                      1.4547
> 2507913 default_idle                             52248.1875
> 158752 shrink_zone                               43.3275
> 121584 copy_user_generic_c                      3199.5789
>  34271 __wake_up_bit                            713.9792
>  31131 __make_request                            23.1629
>  22096 scsi_request_fn                           18.4133
>  21915 rotate_reclaimable_page                   80.5699
            ^^^^^^^^^

I don't think this function should be here. This indicates that
lots of writeout is happening due to pages falling off the end
of the LRU.

There was a bug recently causing memory estimates to be wrong
on Opterons that could cause this I think.

Can you send in 2 dumps of /proc/vmstat taken 10 seconds apart
while you're writing at full speed (with 2.6.13 or the latest
-git tree).

A dump of /proc/zoneinfo and /proc/meminfo while the write is
going on would be helpful too.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31  7:11     ` Vojtech Pavlik
  2005-08-31  7:26       ` Jens Axboe
@ 2005-08-31 13:38       ` Holger Kiehl
  1 sibling, 0 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 13:38 UTC (permalink / raw)
  To: Vojtech Pavlik; +Cc: linux-raid, linux-kernel

On Wed, 31 Aug 2005, Vojtech Pavlik wrote:

> On Tue, Aug 30, 2005 at 08:06:21PM +0000, Holger Kiehl wrote:
>>>> How does one determine the PCI-X bus speed?
>>>
>>> Usually only the card (in your case the Symbios SCSI controller) can
>>> tell. If it does, it'll be most likely in 'dmesg'.
>>>
>> There is nothing in dmesg:
>>
>>    Fusion MPT base driver 3.01.20
>>    Copyright (c) 1999-2004 LSI Logic Corporation
>>    ACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 24 (level, low) -> IRQ 217
>>    mptbase: Initiating ioc0 bringup
>>    ioc0: 53C1030: Capabilities={Initiator,Target}
>>    ACPI: PCI Interrupt 0000:02:04.1[B] -> GSI 25 (level, low) -> IRQ 225
>>    mptbase: Initiating ioc1 bringup
>>    ioc1: 53C1030: Capabilities={Initiator,Target}
>>    Fusion MPT SCSI Host driver 3.01.20
>>
>>> To find where the bottleneck is, I'd suggest trying without the
>>> filesystem at all, and just filling a large part of the block device
>>> using the 'dd' command.
>>>
>>> Also, trying without the RAID, and just running 4 (and 8) concurrent
>>> dd's to the separate drives could show whether it's the RAID that's
>>> slowing things down.
>>>
>> Ok, I did run the following dd command in different combinations:
>>
>>    dd if=/dev/zero of=/dev/sd?1 bs=4k count=5000000
>
> I think a bs of 4k is way too small and will cause huge CPU overhead.
> Can you try with something like 4M? Also, you can use /dev/full to avoid
> the pre-zeroing.
>
Ok, I now use the following command:

       dd if=/dev/full of=/dev/sd?1 bs=4M count=4883

Here the results for all 8 disks in parallel:

       /dev/sdc1 24.957257 MB/s
       /dev/sdd1 25.290177 MB/s
       /dev/sde1 25.046711 MB/s
       /dev/sdf1 26.369777 MB/s
       /dev/sdg1 24.080695 MB/s
       /dev/sdh1 25.008803 MB/s
       /dev/sdi1 24.202202 MB/s
       /dev/sdj1 24.712840 MB/s

A little bit faster but not much.

Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 12:07           ` Jens Axboe
@ 2005-08-31 13:55             ` Holger Kiehl
  2005-08-31 14:24               ` Dr. David Alan Gilbert
  2005-08-31 16:20               ` Jens Axboe
  0 siblings, 2 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 13:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

On Wed, 31 Aug 2005, Jens Axboe wrote:

> Nothing sticks out here either. There's plenty of idle time. It smells
> like a driver issue. Can you try the same dd test, but read from the
> drives instead? Use a bigger blocksize here, 128 or 256k.
>
I used the following command reading from all 8 disks in parallel:

    dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

Here vmstat output (I just cut something out in the middle):

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----^M
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa^M
  3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987  0 22  0 78
  1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987  0 23  4 74
  0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955  0 22 12 66
  1  7   4348  38912 7803700   9636    0    0 322432     0 3526  5078  0 23  7 70
  2  6   4348  37552 7805120   9644    0    0 322432     0 3527  4908  0 23 12 64
  0  8   4348  41152 7801552   9608    0    0 322176     0 3524  5018  0 24  6 70
  1  7   4348  41644 7801044   9572    0    0 322560     0 3530  5175  0 23  0 76
  1  7   4348  37184 7805396   9640    0    0 322176     0 3525  4914  0 24 18 59
  3  7   4348  41704 7800376   9832    0    0 322176    20 3531  5080  0 23  4 73
  1  7   4348  40652 7801700   9732    0    0 323072     0 3533  5115  0 24 13 64
  1  7   4348  40284 7802224   9616    0    0 322560     0 3527  4967  0 23  1 76
  0  8   4348  40156 7802356   9688    0    0 322560     0 3528  5080  0 23  2 75
  6  8   4348  41896 7799984   9816    0    0 322176     0 3530  4945  0 24 20 57
  0  8   4348  39540 7803124   9600    0    0 322560     0 3529  4811  0 24 21 55
  1  7   4348  41520 7801084   9600    0    0 322560     0 3532  4843  0 23 22 55
  0  8   4348  40408 7802116   9588    0    0 322560     0 3527  5010  0 23  4 72
  0  8   4348  38172 7804300   9580    0    0 322176     0 3526  4992  0 24  7 69
  4  7   4348  42264 7799784   9812    0    0 322688     0 3529  5003  0 24  8 68
  1  7   4348  39908 7802520   9660    0    0 322700     0 3529  4963  0 24 14 62
  0  8   4348  37428 7805076   9620    0    0 322420     0 3528  4967  0 23 15 62
  0  8   4348  37056 7805348   9688    0    0 322048     0 3525  4982  0 24 26 50
  1  7   4348  37804 7804456   9696    0    0 322560     0 3528  5072  0 24 16 60
  0  8   4348  38416 7804084   9660    0    0 323200     0 3533  5081  0 24 23 53
  0  8   4348  40160 7802300   9676    0    0 323200    28 3543  5095  0 24 17 59
  1  7   4348  37928 7804612   9608    0    0 323072     0 3532  5175  0 24  7 68
  2  6   4348  38680 7803724   9612    0    0 322944     0 3531  4906  0 25 24 51
  1  7   4348  40408 7802192   9648    0    0 322048     0 3524  4947  0 24 19 57

Full vmstat session can be found under:

   ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/vmstat-256k-read

And here the profile data:

2106577 total                                      0.9469
1638177 default_idle                             34128.6875
179615 copy_user_generic_c                      4726.7105
  27670 end_buffer_async_read                    108.0859
  26055 shrink_zone                                7.1111
  23199 __make_request                            17.2612
  17221 kmem_cache_free                          153.7589
  11796 drop_buffers                              52.6607
  11016 add_to_page_cache                         52.9615
   9470 __wake_up_bit                            197.2917
   8760 buffered_rmqueue                          12.4432
   8646 find_get_page                             90.0625
   8319 __do_page_cache_readahead                 11.0625
   7976 kmem_cache_alloc                         124.6250
   7463 scsi_request_fn                            6.2192
   7208 try_to_free_buffers                       40.9545
   6716 create_empty_buffers                      41.9750
   6432 __end_that_request_first                  11.8235
   6044 test_clear_page_dirty                     25.1833
   5643 scsi_dispatch_cmd                          9.7969
   5588 free_hot_cold_page                        19.4028
   5479 submit_bh                                 18.0230
   3903 __alloc_pages                              3.2965
   3671 file_read_actor                            9.9755
   3425 thread_return                             14.2708
   3333 generic_make_request                       5.6301
   3294 bio_alloc_bioset                           7.6250
   2868 bio_put                                   44.8125
   2851 mpt_interrupt                              2.8284
   2697 mempool_alloc                              8.8717
   2642 block_read_full_page                       3.9315
   2512 do_generic_mapping_read                    2.1216
   2394 set_page_refs                            149.6250
   2235 alloc_page_buffers                         9.9777
   1992 __pagevec_lru_add                          8.3000
   1859 __memset                                   9.6823
   1791 page_waitqueue                            15.9911
   1783 scsi_end_request                           6.9648
   1348 dma_unmap_sg                               6.4808
   1324 bio_endio                                 11.8214
   1306 unlock_page                               20.4062
   1211 mptscsih_freeChainBuffers                  7.5687
   1141 alloc_pages_current                        7.9236
   1136 __mod_page_state                          35.5000
   1116 radix_tree_preload                         8.7188
   1061 __pagevec_release_nonlru                   6.6312
   1043 set_bh_page                                9.3125
   1024 release_pages                              2.9091
   1023 mempool_free                               6.3937
    832 alloc_buffer_head                         13.0000

Full profile data can be found under:

    ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/dd-256k-8disk-read.profile

> You might want to try the same with direct io, just to eliminate the
> costly user copy. I don't expect it to make much of a difference though,
> feels like the problem is elsewhere (driver, most likely).
>
Sorry, I don't know how to do this. Do you mean using a C program
that sets some flag to do direct io, or how can I do that?

> If we still can't get closer to this, it would be interesting to try my
> block tracing stuff so we can see what is going on at the queue level.
> But lets gather some more info first, since it requires testing -mm.
>
Ok, please then just tell me what I must do.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 13:55             ` Holger Kiehl
@ 2005-08-31 14:24               ` Dr. David Alan Gilbert
  2005-08-31 20:56                 ` Holger Kiehl
  2005-08-31 16:20               ` Jens Axboe
  1 sibling, 1 reply; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2005-08-31 14:24 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid, linux-kernel

* Holger Kiehl (Holger.Kiehl@dwd.de) wrote:
> On Wed, 31 Aug 2005, Jens Axboe wrote:
> 
> Full vmstat session can be found under:

Have you got iostat?  iostat -x 10 might be interesting to see
for a period while it is going.

Dave
--
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 16:20               ` Jens Axboe
@ 2005-08-31 15:16                 ` jmerkey
  2005-08-31 16:58                   ` Tom Callahan
  2005-08-31 17:11                   ` Jens Axboe
  2005-08-31 16:51                 ` Holger Kiehl
  1 sibling, 2 replies; 42+ messages in thread
From: jmerkey @ 2005-08-31 15:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel



I have seen an 80GB/sec limitation in the kernel unless this value is 
changed in the SCSI I/O layer
for 3Ware and other controllers during testing of 2.6.X series kernels.

Change these values in include/linux/blkdev.h and performance goes from 
80MB/S to over 670MB/S on the 3Ware controller.


//#define BLKDEV_MIN_RQ    4
//#define BLKDEV_MAX_RQ    128    /* Default maximum */
#define BLKDEV_MIN_RQ    4096
#define BLKDEV_MAX_RQ    8192    /* Default maximum */

Jeff



Jens Axboe wrote:

>On Wed, Aug 31 2005, Holger Kiehl wrote:
>  
>
>>On Wed, 31 Aug 2005, Jens Axboe wrote:
>>
>>    
>>
>>>Nothing sticks out here either. There's plenty of idle time. It smells
>>>like a driver issue. Can you try the same dd test, but read from the
>>>drives instead? Use a bigger blocksize here, 128 or 256k.
>>>
>>>      
>>>
>>I used the following command reading from all 8 disks in parallel:
>>
>>   dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>
>>Here vmstat output (I just cut something out in the middle):
>>
>>procs -----------memory---------- ---swap-- -----io---- --system-- 
>>----cpu----^M
>> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id 
>> wa^M
>> 3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987  0 22  
>> 0 78
>> 1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987  0 23  
>> 4 74
>> 0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955  0 22 
>> 12 66
>> 1  7   4348  38912 7803700   9636    0    0 322432     0 3526  5078  0 23  
>>    
>>
>
>Ok, so that's somewhat better than the writes but still off from what
>the individual drives can do in total.
>
>  
>
>>>You might want to try the same with direct io, just to eliminate the
>>>costly user copy. I don't expect it to make much of a difference though,
>>>feels like the problem is elsewhere (driver, most likely).
>>>
>>>      
>>>
>>Sorry, I don't know how to do this. Do you mean using a C program
>>that sets some flag to do direct io, or how can I do that?
>>    
>>
>
>I've attached a little sample for you, just run ala
>
># ./oread /dev/sdX
>
>and it will read 128k chunks direct from that device. Run on the same
>drives as above, reply with the vmstat info again.
>
>  
>
>------------------------------------------------------------------------
>
>#include <stdio.h>
>#include <stdlib.h>
>#define __USE_GNU
>#include <fcntl.h>
>#include <stdlib.h>
>#include <unistd.h>
>
>#define BS		(131072)
>#define ALIGN(buf)	(char *) (((unsigned long) (buf) + 4095) & ~(4095))
>#define BLOCKS		(8192)
>
>int main(int argc, char *argv[])
>{
>	char *p;
>	int fd, i;
>
>	if (argc < 2) {
>		printf("%s: <dev>\n", argv[0]);
>		return 1;
>	}
>
>	fd = open(argv[1], O_RDONLY | O_DIRECT);
>	if (fd == -1) {
>		perror("open");
>		return 1;
>	}
>
>	p = ALIGN(malloc(BS + 4095));
>	for (i = 0; i < BLOCKS; i++) {
>		int r = read(fd, p, BS);
>
>		if (r == BS)
>			continue;
>		else {
>			if (r == -1)
>				perror("read");
>
>			break;
>		}
>	}
>
>	return 0;
>}
>  
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 16:58                   ` Tom Callahan
@ 2005-08-31 15:47                     ` jmerkey
  0 siblings, 0 replies; 42+ messages in thread
From: jmerkey @ 2005-08-31 15:47 UTC (permalink / raw)
  To: Tom Callahan
  Cc: Jens Axboe, Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel


I'll try this approach as well.  On 2.4.X kernels, I had to change 
nr_requests to achieve performance, but
I noticed it didn't seem to work as well on 2.6.X.  I'll retry the 
change with nr_requests on 2.6.X.

Thanks

Jeff

Tom Callahan wrote:

>>From linux-kernel mailing list.....
>
>Don't do this. BLKDEV_MIN_RQ sets the size of the mempool reserved
>requests and will only get slightly used in low memory conditions, so
>most memory will probably be wasted.....
>
>Change /sys/block/xxx/queue/nr_requests
>
>Tom Callahan
>TESSCO Technologies
>(443)-506-6216
>callahant@tessco.com
>
>
>
>jmerkey wrote:
>
>  
>
>>I have seen an 80GB/sec limitation in the kernel unless this value is 
>>changed in the SCSI I/O layer
>>for 3Ware and other controllers during testing of 2.6.X series kernels.
>>
>>Change these values in include/linux/blkdev.h and performance goes from 
>>80MB/S to over 670MB/S on the 3Ware controller.
>>
>>
>>//#define BLKDEV_MIN_RQ    4
>>//#define BLKDEV_MAX_RQ    128    /* Default maximum */
>>#define BLKDEV_MIN_RQ    4096
>>#define BLKDEV_MAX_RQ    8192    /* Default maximum */
>>
>>Jeff
>>
>>
>>
>>Jens Axboe wrote:
>>
>> 
>>
>>    
>>
>>>On Wed, Aug 31 2005, Holger Kiehl wrote:
>>>
>>>
>>>   
>>>
>>>      
>>>
>>>>On Wed, 31 Aug 2005, Jens Axboe wrote:
>>>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>>>Nothing sticks out here either. There's plenty of idle time. It
>>>>>       
>>>>>
>>>>>          
>>>>>
>>smells
>> 
>>
>>    
>>
>>>>>like a driver issue. Can you try the same dd test, but read from the
>>>>>drives instead? Use a bigger blocksize here, 128 or 256k.
>>>>>
>>>>>    
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>I used the following command reading from all 8 disks in parallel:
>>>>
>>>> dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>>>
>>>>Here vmstat output (I just cut something out in the middle):
>>>>
>>>>procs -----------memory---------- ---swap-- -----io---- --system-- 
>>>>----cpu----^M
>>>>r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
>>>>     
>>>>
>>>>        
>>>>
>>sy id 
>> 
>>
>>    
>>
>>>>wa^M
>>>>3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987
>>>>     
>>>>
>>>>        
>>>>
>>0 22  
>> 
>>
>>    
>>
>>>>0 78
>>>>1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987
>>>>     
>>>>
>>>>        
>>>>
>>0 23  
>> 
>>
>>    
>>
>>>>4 74
>>>>0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955
>>>>     
>>>>
>>>>        
>>>>
>>0 22 
>> 
>>
>>    
>>
>>>>12 66
>>>>1  7   4348  38912 7803700   9636    0    0 322432     0 3526  5078
>>>>     
>>>>
>>>>        
>>>>
>>0 23  
>> 
>>
>>    
>>
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>Ok, so that's somewhat better than the writes but still off from what
>>>the individual drives can do in total.
>>>
>>>
>>>
>>>   
>>>
>>>      
>>>
>>>>>You might want to try the same with direct io, just to eliminate the
>>>>>costly user copy. I don't expect it to make much of a difference
>>>>>       
>>>>>
>>>>>          
>>>>>
>>though,
>> 
>>
>>    
>>
>>>>>feels like the problem is elsewhere (driver, most likely).
>>>>>
>>>>>    
>>>>>
>>>>>       
>>>>>
>>>>>          
>>>>>
>>>>Sorry, I don't know how to do this. Do you mean using a C program
>>>>that sets some flag to do direct io, or how can I do that?
>>>>  
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>I've attached a little sample for you, just run ala
>>>
>>># ./oread /dev/sdX
>>>
>>>and it will read 128k chunks direct from that device. Run on the same
>>>drives as above, reply with the vmstat info again.
>>>
>>>
>>>
>>>-----------------------------------------------------------------------
>>>   
>>>
>>>      
>>>
>>-
>> 
>>
>>    
>>
>>>#include <stdio.h>
>>>#include <stdlib.h>
>>>#define __USE_GNU
>>>#include <fcntl.h>
>>>#include <stdlib.h>
>>>#include <unistd.h>
>>>
>>>#define BS		(131072)
>>>#define ALIGN(buf)	(char *) (((unsigned long) (buf) + 4095) &
>>>   
>>>
>>>      
>>>
>>~(4095))
>> 
>>
>>    
>>
>>>#define BLOCKS		(8192)
>>>
>>>int main(int argc, char *argv[])
>>>{
>>>	char *p;
>>>	int fd, i;
>>>
>>>	if (argc < 2) {
>>>		printf("%s: <dev>\n", argv[0]);
>>>		return 1;
>>>	}
>>>
>>>	fd = open(argv[1], O_RDONLY | O_DIRECT);
>>>	if (fd == -1) {
>>>		perror("open");
>>>		return 1;
>>>	}
>>>
>>>	p = ALIGN(malloc(BS + 4095));
>>>	for (i = 0; i < BLOCKS; i++) {
>>>		int r = read(fd, p, BS);
>>>
>>>		if (r == BS)
>>>			continue;
>>>		else {
>>>			if (r == -1)
>>>				perror("read");
>>>
>>>			break;
>>>		}
>>>	}
>>>
>>>	return 0;
>>>}
>>>
>>>
>>>   
>>>
>>>      
>>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> 
>>
>>    
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 17:11                   ` Jens Axboe
@ 2005-08-31 15:59                     ` jmerkey
  2005-08-31 17:32                       ` Jens Axboe
  0 siblings, 1 reply; 42+ messages in thread
From: jmerkey @ 2005-08-31 15:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel


512 is not enough. It has to be larger. I just tried 512 and it still 
limits the data rates.

Jeff


Jens Axboe wrote:

>On Wed, Aug 31 2005, jmerkey wrote:
>  
>
>>I have seen an 80GB/sec limitation in the kernel unless this value is 
>>changed in the SCSI I/O layer
>>for 3Ware and other controllers during testing of 2.6.X series kernels.
>>
>>Change these values in include/linux/blkdev.h and performance goes from 
>>80MB/S to over 670MB/S on the 3Ware controller.
>>
>>
>>//#define BLKDEV_MIN_RQ    4
>>//#define BLKDEV_MAX_RQ    128    /* Default maximum */
>>#define BLKDEV_MIN_RQ    4096
>>#define BLKDEV_MAX_RQ    8192    /* Default maximum */
>>    
>>
>
>That's insane, you just wasted 1MiB of preallocated requests on each
>queue in the system!
>
>Please just do
>
># echo 512 > /sys/block/dev/queue/nr_requests
>
>after boot for each device you want to increase the queue size too. 512
>should be enough with the 3ware.
>
>  
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 13:55             ` Holger Kiehl
  2005-08-31 14:24               ` Dr. David Alan Gilbert
@ 2005-08-31 16:20               ` Jens Axboe
  2005-08-31 15:16                 ` jmerkey
  2005-08-31 16:51                 ` Holger Kiehl
  1 sibling, 2 replies; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 16:20 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1670 bytes --]

On Wed, Aug 31 2005, Holger Kiehl wrote:
> On Wed, 31 Aug 2005, Jens Axboe wrote:
> 
> >Nothing sticks out here either. There's plenty of idle time. It smells
> >like a driver issue. Can you try the same dd test, but read from the
> >drives instead? Use a bigger blocksize here, 128 or 256k.
> >
> I used the following command reading from all 8 disks in parallel:
> 
>    dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
> 
> Here vmstat output (I just cut something out in the middle):
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- 
> ----cpu----^M
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id 
>  wa^M
>  3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987  0 22  
>  0 78
>  1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987  0 23  
>  4 74
>  0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955  0 22 
>  12 66
>  1  7   4348  38912 7803700   9636    0    0 322432     0 3526  5078  0 23  

Ok, so that's somewhat better than the writes but still off from what
the individual drives can do in total.

> >You might want to try the same with direct io, just to eliminate the
> >costly user copy. I don't expect it to make much of a difference though,
> >feels like the problem is elsewhere (driver, most likely).
> >
> Sorry, I don't know how to do this. Do you mean using a C program
> that sets some flag to do direct io, or how can I do that?

I've attached a little sample for you, just run ala

# ./oread /dev/sdX

and it will read 128k chunks direct from that device. Run on the same
drives as above, reply with the vmstat info again.

-- 
Jens Axboe


[-- Attachment #2: oread.c --]
[-- Type: text/plain, Size: 647 bytes --]

#include <stdio.h>
#include <stdlib.h>
#define __USE_GNU
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

#define BS		(131072)
#define ALIGN(buf)	(char *) (((unsigned long) (buf) + 4095) & ~(4095))
#define BLOCKS		(8192)

int main(int argc, char *argv[])
{
	char *p;
	int fd, i;

	if (argc < 2) {
		printf("%s: <dev>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY | O_DIRECT);
	if (fd == -1) {
		perror("open");
		return 1;
	}

	p = ALIGN(malloc(BS + 4095));
	for (i = 0; i < BLOCKS; i++) {
		int r = read(fd, p, BS);

		if (r == BS)
			continue;
		else {
			if (r == -1)
				perror("read");

			break;
		}
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 12:24           ` Nick Piggin
@ 2005-08-31 16:25             ` Holger Kiehl
  2005-08-31 17:25               ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 16:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

On Wed, 31 Aug 2005, Nick Piggin wrote:

> Holger Kiehl wrote:
>
>> 3236497 total                                      1.4547
>> 2507913 default_idle                             52248.1875
>> 158752 shrink_zone                               43.3275
>> 121584 copy_user_generic_c                      3199.5789
>>  34271 __wake_up_bit                            713.9792
>>  31131 __make_request                            23.1629
>>  22096 scsi_request_fn                           18.4133
>>  21915 rotate_reclaimable_page                   80.5699
>           ^^^^^^^^^
>
> I don't think this function should be here. This indicates that
> lots of writeout is happening due to pages falling off the end
> of the LRU.
>
> There was a bug recently causing memory estimates to be wrong
> on Opterons that could cause this I think.
>
> Can you send in 2 dumps of /proc/vmstat taken 10 seconds apart
> while you're writing at full speed (with 2.6.13 or the latest
> -git tree).
>
I took 2.6.13, there where no git snapshots at www.kernel.org when
I looked. With 2.6.13 I must load the Fusion MPT driver as module.
Compiling it in it does not detect the drive correctly, as module
there is no problem.

Here is what I did:

    #!/bin/bash

    time dd if=/dev/full of=/dev/sdc1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sdd1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sde1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sdf1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sdg1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sdh1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sdi1 bs=4M count=4883 &
    time dd if=/dev/full of=/dev/sdj1 bs=4M count=4883 &

    sleep 20

    cat /proc/vmstat > /root/vmstat-1.dump

    sleep 10

    cat /proc/vmstat > /root/vmstat-2.dump
    cat /proc/zoneinfo > /root/zoneinfo.dump
    cat /proc/meminfo > /root/meminfo.dump

    exit 0

vmstat-1.dump:

    nr_dirty 787282
    nr_writeback 44317
    nr_unstable 0
    nr_page_table_pages 633
    nr_mapped 6373
    nr_slab 53030
    pgpgin 263362
    pgpgout 5260352
    pswpin 0
    pswpout 0
    pgalloc_high 0
    pgalloc_normal 2448628
    pgalloc_dma 1041
    pgfree 2457343
    pgactivate 5775
    pgdeactivate 2113
    pgfault 465679
    pgmajfault 321
    pgrefill_high 0
    pgrefill_normal 5940
    pgrefill_dma 33
    pgsteal_high 0
    pgsteal_normal 148759
    pgsteal_dma 0
    pgscan_kswapd_high 0
    pgscan_kswapd_normal 153813
    pgscan_kswapd_dma 1089
    pgscan_direct_high 0
    pgscan_direct_normal 0
    pgscan_direct_dma 0
    pginodesteal 0
    slabs_scanned 0
    kswapd_steal 148759
    kswapd_inodesteal 0
    pageoutrun 5304
    allocstall 0
    pgrotated 0
    nr_bounce 0

vmstat-2.dump:

    nr_dirty 786397
    nr_writeback 44233
    nr_unstable 0
    nr_page_table_pages 640
    nr_mapped 6406
    nr_slab 53027
    pgpgin 263382
    pgpgout 7835732
    pswpin 0
    pswpout 0
    pgalloc_high 0
    pgalloc_normal 3091687
    pgalloc_dma 2420
    pgfree 3101327
    pgactivate 5817
    pgdeactivate 2918
    pgfault 466269
    pgmajfault 322
    pgrefill_high 0
    pgrefill_normal 28265
    pgrefill_dma 150
    pgsteal_high 0
    pgsteal_normal 789909
    pgsteal_dma 1388
    pgscan_kswapd_high 0
    pgscan_kswapd_normal 904101
    pgscan_kswapd_dma 4950
    pgscan_direct_high 0
    pgscan_direct_normal 0
    pgscan_direct_dma 0
    pginodesteal 0
    slabs_scanned 1152
    kswapd_steal 791297
    kswapd_inodesteal 0
    pageoutrun 28299
    allocstall 0
    pgrotated 562
    nr_bounce 0

zoneinfo.dump:

    Node 3, zone   Normal
      pages free     899
            min      726
            low      907
            high     1089
            active   3996
            inactive 490989
            scanned  0 (a: 16 i: 0)
            spanned  524287
            present  524287
            protection: (0, 0, 0)
      pagesets
        cpu: 0 pcp: 0
                  count: 2
                  low:   62
                  high:  186
                  batch: 31
        cpu: 0 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       10186
                numa_miss:      3313
                numa_foreign:   0
                interleave_hit: 10136
                local_node:     0
                other_node:     13499
        cpu: 1 pcp: 0
                  count: 13
                  low:   62
                  high:  186
                  batch: 31
        cpu: 1 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       6559
                numa_miss:      1668
                numa_foreign:   0
                interleave_hit: 6559
                local_node:     0
                other_node:     8227
        cpu: 2 pcp: 0
                  count: 84
                  low:   62
                  high:  186
                  batch: 31
        cpu: 2 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       5579
                numa_miss:      12806
                numa_foreign:   0
                interleave_hit: 5579
                local_node:     0
                other_node:     18385
        cpu: 3 pcp: 0
                  count: 93
                  low:   62
                  high:  186
                  batch: 31
        cpu: 3 pcp: 1
                  count: 55
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       834769
                numa_miss:      1
                numa_foreign:   940192
                interleave_hit: 5563
                local_node:     834770
                other_node:     0
      all_unreclaimable: 0
      prev_priority:     12
      temp_priority:     12
      start_pfn:         1572864
    Node 2, zone   Normal
      pages free     1036
            min      726
            low      907
            high     1089
            active   360
            inactive 501700
            scanned  0 (a: 26 i: 0)
            spanned  524287
            present  524287
            protection: (0, 0, 0)
      pagesets
        cpu: 1 pcp: 0
                  count: 91
                  low:   62
                  high:  186
                  batch: 31
        cpu: 1 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       6002
                numa_miss:      15490
                numa_foreign:   0
                interleave_hit: 6002
                local_node:     0
                other_node:     21492
        cpu: 2 pcp: 0
                  count: 75
                  low:   62
                  high:  186
                  batch: 31
        cpu: 2 pcp: 1
                  count: 56
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       410692
                numa_miss:      0
                numa_foreign:   76064
                interleave_hit: 5223
                local_node:     410692
                other_node:     0
        cpu: 3 pcp: 0
                  count: 73
                  low:   62
                  high:  186
                  batch: 31
        cpu: 3 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       5163
                numa_miss:      288909
                numa_foreign:   1
                interleave_hit: 5152
                local_node:     0
                other_node:     294072
      all_unreclaimable: 0
      prev_priority:     12
      temp_priority:     12
      start_pfn:         1048576
    Node 1, zone   Normal
      pages free     859
            min      703
            low      878
            high     1054
            active   1224
            inactive 485043
            scanned  0 (a: 14 i: 0)
            spanned  507903
            present  507760
            protection: (0, 0, 0)
      pagesets
        cpu: 0 pcp: 0
                  count: 1
                  low:   62
                  high:  186
                  batch: 31
        cpu: 0 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       9443
                numa_miss:      15475
                numa_foreign:   18446604437880297808
                interleave_hit: 18446604437880307200
                local_node:     1
                other_node:     24917
        cpu: 1 pcp: 0
                  count: 181
                  low:   62
                  high:  186
                  batch: 31
        cpu: 1 pcp: 1
                  count: 38
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       368191
                numa_miss:      0
                numa_foreign:   39046
                interleave_hit: 5967
                local_node:     368191
                other_node:     0
        cpu: 2 pcp: 0
                  count: 85
                  low:   62
                  high:  186
                  batch: 31
        cpu: 2 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       5139
                numa_miss:      18963
                numa_foreign:   0
                interleave_hit: 5139
                local_node:     0
                other_node:     24102
        cpu: 3 pcp: 0
                  count: 92
                  low:   62
                  high:  186
                  batch: 31
        cpu: 3 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       5124
                numa_miss:      363472
                numa_foreign:   0
                interleave_hit: 5115
                local_node:     0
                other_node:     368596
      all_unreclaimable: 0
      prev_priority:     12
      temp_priority:     12
      start_pfn:         524288
    Node 0, zone      DMA
      pages free     2045
            min      5
            low      6
            high     7
            active   0
            inactive 992
            scanned  0 (a: 2 i: 2)
            spanned  4096
            present  3994
            protection: (0, 2031, 2031)
      pagesets
        cpu: 0 pcp: 0
                  count: 1
                  low:   2
                  high:  6
                  batch: 1
        cpu: 0 pcp: 1
                  count: 1
                  low:   0
                  high:  2
                  batch: 1
                numa_hit:       18446604437880298786
                numa_miss:      18446604442220017848
                numa_foreign:   0
                interleave_hit: 0
                local_node:     7567460
                other_node:     0
      all_unreclaimable: 0
      prev_priority:     12
      temp_priority:     12
      start_pfn:         0
    Node 0, zone   Normal
      pages free     1052
            min      721
            low      901
            high     1081
            active   845
            inactive 480162
            scanned  0 (a: 2 i: 0)
            spanned  520191
            present  520191
            protection: (0, 0, 0)
      pagesets
        cpu: 0 pcp: 0
                  count: 96
                  low:   62
                  high:  186
                  batch: 31
        cpu: 0 pcp: 1
                  count: 50
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       18446604437880708763
                numa_miss:      18446604439958819000
                numa_foreign:   29590
                interleave_hit: 9679
                local_node:     7977309
                other_node:     0
        cpu: 1 pcp: 0
                  count: 88
                  low:   62
                  high:  186
                  batch: 31
        cpu: 1 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       6206
                numa_miss:      21831
                numa_foreign:   0
                interleave_hit: 6206
                local_node:     0
                other_node:     28037
        cpu: 2 pcp: 0
                  count: 65
                  low:   62
                  high:  186
                  batch: 31
        cpu: 2 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       5367
                numa_miss:      44135
                numa_foreign:   0
                interleave_hit: 5365
                local_node:     0
                other_node:     49502
        cpu: 3 pcp: 0
                  count: 92
                  low:   62
                  high:  186
                  batch: 31
        cpu: 3 pcp: 1
                  count: 0
                  low:   0
                  high:  62
                  batch: 31
                numa_hit:       5544
                numa_miss:      286378
                numa_foreign:   0
                interleave_hit: 5507
                local_node:     0
                other_node:     291922
      all_unreclaimable: 0
      prev_priority:     12
      temp_priority:     12
      start_pfn:         4096

meminfo.dump:

    MemTotal:      8124172 kB
    MemFree:         23564 kB
    Buffers:       7825944 kB
    Cached:          19216 kB
    SwapCached:          0 kB
    Active:          25708 kB
    Inactive:      7835548 kB
    HighTotal:           0 kB
    HighFree:            0 kB
    LowTotal:      8124172 kB
    LowFree:         23564 kB
    SwapTotal:    15631160 kB
    SwapFree:     15631160 kB
    Dirty:         3145604 kB
    Writeback:      176452 kB
    Mapped:          25624 kB
    Slab:           212116 kB
    CommitLimit:  19693244 kB
    Committed_AS:    85112 kB
    PageTables:       2560 kB
    VmallocTotal: 34359738367 kB
    VmallocUsed:     16288 kB
    VmallocChunk: 34359721635 kB

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 16:20               ` Jens Axboe
  2005-08-31 15:16                 ` jmerkey
@ 2005-08-31 16:51                 ` Holger Kiehl
  2005-08-31 17:35                   ` Jens Axboe
  2005-08-31 18:06                   ` Michael Tokarev
  1 sibling, 2 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 16:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

On Wed, 31 Aug 2005, Jens Axboe wrote:

> On Wed, Aug 31 2005, Holger Kiehl wrote:
>> On Wed, 31 Aug 2005, Jens Axboe wrote:
>>
>>> Nothing sticks out here either. There's plenty of idle time. It smells
>>> like a driver issue. Can you try the same dd test, but read from the
>>> drives instead? Use a bigger blocksize here, 128 or 256k.
>>>
>> I used the following command reading from all 8 disks in parallel:
>>
>>    dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>
>> Here vmstat output (I just cut something out in the middle):
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> ----cpu----^M
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
>>  wa^M
>>  3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987  0 22
>>  0 78
>>  1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987  0 23
>>  4 74
>>  0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955  0 22
>>  12 66
>>  1  7   4348  38912 7803700   9636    0    0 322432     0 3526  5078  0 23
>
> Ok, so that's somewhat better than the writes but still off from what
> the individual drives can do in total.
>
>>> You might want to try the same with direct io, just to eliminate the
>>> costly user copy. I don't expect it to make much of a difference though,
>>> feels like the problem is elsewhere (driver, most likely).
>>>
>> Sorry, I don't know how to do this. Do you mean using a C program
>> that sets some flag to do direct io, or how can I do that?
>
> I've attached a little sample for you, just run ala
>
> # ./oread /dev/sdX
>
> and it will read 128k chunks direct from that device. Run on the same
> drives as above, reply with the vmstat info again.
>
Using kernel 2.6.12.5 again, here the results:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  0  0      0 8009648   4764  40592    0    0     0     0 1011    32  0  0 100  0
  0  0      0 8009648   4764  40592    0    0     0     0 1011    34  0  0 100  0
  0  0      0 8009648   4764  40592    0    0     0     0 1008    61  0  0 100  0
  0  0      0 8009648   4764  40592    0    0     0     0 1006    26  0  0 100  0
  0  8      0 8006372   4764  40592    0    0 120192     0 1944  1929  0  1 89 10
  2  8      0 8006372   4764  40592    0    0 319488     0 3502  4999  0  2 75 24
  0  8      0 8006372   4764  40592    0    0 319488     0 3506  4995  0  2 75 24
  0  8      0 8006372   4764  40592    0    0 319744     0 3504  4999  0  1 75 24
  0  8      0 8006372   4764  40592    0    0 319488     0 3507  5009  0  2 75 23
  0  8      0 8006372   4764  40592    0    0 319616     0 3506  5011  0  2 75 24
  0  8      0 8005124   4800  41100    0    0 319976     0 3536  4995  0  2 73 25
  0  8      0 8005124   4800  41100    0    0 323584     0 3534  5000  0  2 75 23
  0  8      0 8005124   4800  41100    0    0 323968     0 3540  5035  0  1 75 24
  0  8      0 8005124   4800  41100    0    0 319232     0 3506  4811  0  1 75 24
  0  8      0 8005504   4800  41100    0    0 317952     0 3498  4747  0  1 75 24
  0  8      0 8005504   4800  41100    0    0 318720     0 3495  4672  0  2 75 23
  1  8      0 8005504   4800  41100    0    0 318720     0 3509  4707  0  1 75 24
  0  8      0 8005504   4800  41100    0    0 318720     0 3499  4667  0  2 75 23
  0  8      0 8005504   4808  41092    0    0 318848    40 3509  4674  0  1 75 24
  0  8      0 8005380   4808  41092    0    0 318848     0 3497  4693  0  2 72 26
  0  8      0 8005380   4808  41092    0    0 318592     0 3500  4646  0  2 75 23
  0  8      0 8005380   4808  41092    0    0 318592     0 3495  4828  0  2 61 37
  0  8      0 8005380   4808  41092    0    0 318848     0 3499  4827  0  1 62 37
  1  8      0 8005380   4808  41092    0    0 318464     0 3495  4642  0  2 75 23
  0  8      0 8005380   4816  41084    0    0 318848    32 3511  4672  0  1 75 24
  0  8      0 8005380   4816  41084    0    0 320640     0 3512  4877  0  2 75 23
  0  8      0 8005380   4816  41084    0    0 322944     0 3533  5047  0  2 75 24
  0  8      0 8005380   4816  41084    0    0 322816     0 3531  5053  0  1 75 24
  0  8      0 8005380   4816  41084    0    0 322944     0 3531  5048  0  2 75 23
  0  8      0 8005380   4816  41084    0    0 322944     0 3529  5043  0  1 75 24
  0  0      0 8008360   4816  41084    0    0 266880     0 3112  4224  0  2 78 20
  0  0      0 8008360   4816  41084    0    0     0     0 1012    28  0  0 100  0

Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 15:16                 ` jmerkey
@ 2005-08-31 16:58                   ` Tom Callahan
  2005-08-31 15:47                     ` jmerkey
  2005-08-31 17:11                   ` Jens Axboe
  1 sibling, 1 reply; 42+ messages in thread
From: Tom Callahan @ 2005-08-31 16:58 UTC (permalink / raw)
  To: jmerkey
  Cc: Jens Axboe, Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel

>From linux-kernel mailing list.....

Don't do this. BLKDEV_MIN_RQ sets the size of the mempool reserved
requests and will only get slightly used in low memory conditions, so
most memory will probably be wasted.....

Change /sys/block/xxx/queue/nr_requests

Tom Callahan
TESSCO Technologies
(443)-506-6216
callahant@tessco.com



jmerkey wrote:

>I have seen an 80GB/sec limitation in the kernel unless this value is 
>changed in the SCSI I/O layer
>for 3Ware and other controllers during testing of 2.6.X series kernels.
>
>Change these values in include/linux/blkdev.h and performance goes from 
>80MB/S to over 670MB/S on the 3Ware controller.
>
>
>//#define BLKDEV_MIN_RQ    4
>//#define BLKDEV_MAX_RQ    128    /* Default maximum */
>#define BLKDEV_MIN_RQ    4096
>#define BLKDEV_MAX_RQ    8192    /* Default maximum */
>
>Jeff
>
>
>
>Jens Axboe wrote:
>
>  
>
>>On Wed, Aug 31 2005, Holger Kiehl wrote:
>> 
>>
>>    
>>
>>>On Wed, 31 Aug 2005, Jens Axboe wrote:
>>>
>>>   
>>>
>>>      
>>>
>>>>Nothing sticks out here either. There's plenty of idle time. It
>>>>        
>>>>
>smells
>  
>
>>>>like a driver issue. Can you try the same dd test, but read from the
>>>>drives instead? Use a bigger blocksize here, 128 or 256k.
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>I used the following command reading from all 8 disks in parallel:
>>>
>>>  dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>>
>>>Here vmstat output (I just cut something out in the middle):
>>>
>>>procs -----------memory---------- ---swap-- -----io---- --system-- 
>>>----cpu----^M
>>>r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
>>>      
>>>
>sy id 
>  
>
>>>wa^M
>>>3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987
>>>      
>>>
>0 22  
>  
>
>>>0 78
>>>1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987
>>>      
>>>
>0 23  
>  
>
>>>4 74
>>>0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955
>>>      
>>>
>0 22 
>  
>
>>>12 66
>>>1  7   4348  38912 7803700   9636    0    0 322432     0 3526  5078
>>>      
>>>
>0 23  
>  
>
>>>   
>>>
>>>      
>>>
>>Ok, so that's somewhat better than the writes but still off from what
>>the individual drives can do in total.
>>
>> 
>>
>>    
>>
>>>>You might want to try the same with direct io, just to eliminate the
>>>>costly user copy. I don't expect it to make much of a difference
>>>>        
>>>>
>though,
>  
>
>>>>feels like the problem is elsewhere (driver, most likely).
>>>>
>>>>     
>>>>
>>>>        
>>>>
>>>Sorry, I don't know how to do this. Do you mean using a C program
>>>that sets some flag to do direct io, or how can I do that?
>>>   
>>>
>>>      
>>>
>>I've attached a little sample for you, just run ala
>>
>># ./oread /dev/sdX
>>
>>and it will read 128k chunks direct from that device. Run on the same
>>drives as above, reply with the vmstat info again.
>>
>> 
>>
>>-----------------------------------------------------------------------
>>    
>>
>-
>  
>
>>#include <stdio.h>
>>#include <stdlib.h>
>>#define __USE_GNU
>>#include <fcntl.h>
>>#include <stdlib.h>
>>#include <unistd.h>
>>
>>#define BS		(131072)
>>#define ALIGN(buf)	(char *) (((unsigned long) (buf) + 4095) &
>>    
>>
>~(4095))
>  
>
>>#define BLOCKS		(8192)
>>
>>int main(int argc, char *argv[])
>>{
>>	char *p;
>>	int fd, i;
>>
>>	if (argc < 2) {
>>		printf("%s: <dev>\n", argv[0]);
>>		return 1;
>>	}
>>
>>	fd = open(argv[1], O_RDONLY | O_DIRECT);
>>	if (fd == -1) {
>>		perror("open");
>>		return 1;
>>	}
>>
>>	p = ALIGN(malloc(BS + 4095));
>>	for (i = 0; i < BLOCKS; i++) {
>>		int r = read(fd, p, BS);
>>
>>		if (r == BS)
>>			continue;
>>		else {
>>			if (r == -1)
>>				perror("read");
>>
>>			break;
>>		}
>>	}
>>
>>	return 0;
>>}
>> 
>>
>>    
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 15:16                 ` jmerkey
  2005-08-31 16:58                   ` Tom Callahan
@ 2005-08-31 17:11                   ` Jens Axboe
  2005-08-31 15:59                     ` jmerkey
  1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 17:11 UTC (permalink / raw)
  To: jmerkey; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel

On Wed, Aug 31 2005, jmerkey wrote:
> 
> 
> I have seen an 80GB/sec limitation in the kernel unless this value is 
> changed in the SCSI I/O layer
> for 3Ware and other controllers during testing of 2.6.X series kernels.
> 
> Change these values in include/linux/blkdev.h and performance goes from 
> 80MB/S to over 670MB/S on the 3Ware controller.
> 
> 
> //#define BLKDEV_MIN_RQ    4
> //#define BLKDEV_MAX_RQ    128    /* Default maximum */
> #define BLKDEV_MIN_RQ    4096
> #define BLKDEV_MAX_RQ    8192    /* Default maximum */

That's insane, you just wasted 1MiB of preallocated requests on each
queue in the system!

Please just do

# echo 512 > /sys/block/dev/queue/nr_requests

after boot for each device you want to increase the queue size too. 512
should be enough with the 3ware.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 16:25             ` Holger Kiehl
@ 2005-08-31 17:25               ` Nick Piggin
  2005-08-31 21:57                 ` Holger Kiehl
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2005-08-31 17:25 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

Holger Kiehl wrote:

> meminfo.dump:
> 
>    MemTotal:      8124172 kB
>    MemFree:         23564 kB
>    Buffers:       7825944 kB
>    Cached:          19216 kB
>    SwapCached:          0 kB
>    Active:          25708 kB
>    Inactive:      7835548 kB
>    HighTotal:           0 kB
>    HighFree:            0 kB
>    LowTotal:      8124172 kB
>    LowFree:         23564 kB
>    SwapTotal:    15631160 kB
>    SwapFree:     15631160 kB
>    Dirty:         3145604 kB

Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
so maybe I've just led you on a goose chase.

You could
     echo 5 > /proc/sys/vm/dirty_background_ratio
     echo 10 > /proc/sys/vm/dirty_ratio

To further reduce dirty memory in the system, however this is
a long shot, so please continue your interaction with the
other people in the thread first.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 15:59                     ` jmerkey
@ 2005-08-31 17:32                       ` Jens Axboe
  0 siblings, 0 replies; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 17:32 UTC (permalink / raw)
  To: jmerkey; +Cc: Holger Kiehl, Vojtech Pavlik, linux-raid, linux-kernel

On Wed, Aug 31 2005, jmerkey wrote:
> 
> 512 is not enough. It has to be larger. I just tried 512 and it still 
> limits the data rates.

Please don't top post.

512 wasn't the point, setting it properly is the point. If you need more
than 512, go ahead. This isn't Holger's problem, though, the reading
would be much faster if it was. If the fusion is using a large queue
depth, increasing nr_requests would likely help the writes (but not to
the extent of where it would suddenly be as fast as it should).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 16:51                 ` Holger Kiehl
@ 2005-08-31 17:35                   ` Jens Axboe
  2005-08-31 19:00                     ` Holger Kiehl
  2005-08-31 18:06                   ` Michael Tokarev
  1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2005-08-31 17:35 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

On Wed, Aug 31 2005, Holger Kiehl wrote:
> ># ./oread /dev/sdX
> >
> >and it will read 128k chunks direct from that device. Run on the same
> >drives as above, reply with the vmstat info again.
> >
> Using kernel 2.6.12.5 again, here the results:

[snip]

Ok, reads as expected, like the buffered io but using less system time.
And you are still 1/3 off the target data rate, hmmm...

With the reads, how does the aggregate bandwidth look when you add
'clients'? Same as with writes, gradually decreasing per-device
throughput?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 16:51                 ` Holger Kiehl
  2005-08-31 17:35                   ` Jens Axboe
@ 2005-08-31 18:06                   ` Michael Tokarev
  2005-08-31 18:52                     ` Ming Zhang
  1 sibling, 1 reply; 42+ messages in thread
From: Michael Tokarev @ 2005-08-31 18:06 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

Holger Kiehl wrote:
> On Wed, 31 Aug 2005, Jens Axboe wrote:
> 
>> On Wed, Aug 31 2005, Holger Kiehl wrote:
>>
[]
>>> I used the following command reading from all 8 disks in parallel:
>>>
>>>    dd if=/dev/sd?1 of=/dev/null bs=256k count=78125
>>>
>>> Here vmstat output (I just cut something out in the middle):
>>>
>>> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>>>  3  7   4348  42640 7799984   9612    0    0 322816     0 3532  4987 0 22 0 78
>>>  1  7   4348  42136 7800624   9584    0    0 322176     0 3526  4987 0 23 4 74
>>>  0  8   4348  39912 7802648   9668    0    0 322176     0 3525  4955 0 22 12 66
>>
>> Ok, so that's somewhat better than the writes but still off from what
>> the individual drives can do in total.
>>
>>>> You might want to try the same with direct io, just to eliminate the
>>>> costly user copy. I don't expect it to make much of a difference though,
>>>> feels like the problem is elsewhere (driver, most likely).
>>>>
>>> Sorry, I don't know how to do this. Do you mean using a C program
>>> that sets some flag to do direct io, or how can I do that?
>>
>> I've attached a little sample for you, just run ala
>>
>> # ./oread /dev/sdX
>>
>> and it will read 128k chunks direct from that device. Run on the same
>> drives as above, reply with the vmstat info again.
>>
> Using kernel 2.6.12.5 again, here the results:
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  0  8      0 8005380   4816  41084    0    0 318848    32 3511  4672  0 1 75 24
>  0  8      0 8005380   4816  41084    0    0 320640     0 3512  4877  0 2 75 23
>  0  8      0 8005380   4816  41084    0    0 322944     0 3533  5047  0 2 75 24
>  0  8      0 8005380   4816  41084    0    0 322816     0 3531  5053  0 1 75 24
>  0  8      0 8005380   4816  41084    0    0 322944     0 3531  5048  0 2 75 23
>  0  8      0 8005380   4816  41084    0    0 322944     0 3529  5043  0 1 75 24
>  0  0      0 8008360   4816  41084    0    0 266880     0 3112  4224  0 2 78 20

I went on and did similar tests on our box, which is:

 dual Xeon 2.44GHz with HT (so it's like 4 logical CPUs)
 dual-channel AIC-7902 U320 controller
 8 SEAGATE ST336607LW drives attached to the 2 channels of the
  controller, sd[abcd] to channel 0 and sd[efgh] to channel 1

Each drive is capable to get about 60 megabytes/sec.
The kernel is 2.6.13 from kernel.org.

With direct-reading:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  8     12     87    471   1839    0    0 455296    84 1936  3739  0  3 47 50
 1  7     12     87    471   1839    0    0 456704    80 1941  3744  0  4 48 48
 1  7     12     87    471   1839    0    0 446464    82 1914  3648  0  2 48 50
 0  8     12     87    471   1839    0    0 454016    94 1944  3765  0  2 47 50
 0  8     12     87    471   1839    0    0 458752    60 1944  3746  0  2 48 50

Without direct:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 8  6     12     80    470   1839    0    0 359966   124 1726  2270  1 89  0 10
 2  7     12     80    470   1839    0    0 352813   113 1741  2124  1 88  1 10
 8  4     12     80    471   1839    0    0 358990    34 1669  1934  1 94  0  5
 7  5     12     79    471   1839    0    0 354065   157 1761  2128  1 90  1  8
 6  5     12     80    471   1839    0    0 358062    44 1686  1911  1 93  0  6

So the difference direct vs "indirect" is quite.. significant.  And with
direct-reading, all 8 drives are up to their real speed.  Note the CPU usage
in case of "indirect" reading too - it's about 90%...

And here's an idle system as well:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0     12     89    471   1839    0    0     0    58  151   358  0  0 100  0
 0  0     12     89    471   1839    0    0     0    66  157   167  0  0 99  0

Too bad I can't perform write tests on this system...

/mjt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 18:06                   ` Michael Tokarev
@ 2005-08-31 18:52                     ` Ming Zhang
  2005-08-31 18:57                       ` Ming Zhang
  0 siblings, 1 reply; 42+ messages in thread
From: Ming Zhang @ 2005-08-31 18:52 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Holger Kiehl, Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2478 bytes --]

join the party. ;)

8 400GB SATA disk on same Marvel 8 port PCIX-133 card. P4 CPU.
Supermicro SCT board.

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6]
[raid10] [faulty]
md0 : active raid0 sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda
[0]
      3125690368 blocks 64k chunks

8 DISK RAID0 from same slot and card. Stripe size is 512KB.

run oread

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----
cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
id wa
 1  1      0 533216 330424  11004    0    0  7128  1610 1069    77  0  2
95  3
 1  0      0 298464 560828  11004    0    0 230404     0 2595  1389  1
23  0 76
 0  1      0  64736 792248  11004    0    0 231420     0 2648  1342  0
26  0 74
 1  0      0   8948 848416   9696    0    0 229376     0 2638  1337  0
29  0 71
 0  0      0 868896    768   9696    0    0 29696    48 1224   162  0 19
73  8

# time ./oread /dev/md0

real    0m6.595s
user    0m0.004s
sys     0m0.151s

run dd

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----
cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
id wa
 2  2      0 854008   2932  17108    0    0  7355  1606 1071    80  0  2
95  3
 0  2      0 848888   3112  21388    0    0 164332     0 2985  3564  2
7  0 91
 0  2      0 844024   3260  25664    0    0 164040     0 2990  3665  1
7  0 92
 0  2      0 840328   3380  28920    0    0 164272     0 2932  3791  1
9  0 90
 0  2      0 836360   3500  32232    0    0 163688   100 3001  5045  2
7  0 91
 0  2      0 831432   3644  36612    0    0 164120   568 2977  3843  0
9  0 91
 0  1      0 826056   3752  41688    0    0  7872     0 1267  1474  1  3
0 96

# time dd if=/dev/md0 of=/dev/null bs=131072 count=8192
8192+0 records in
8192+0 records out

real    0m4.771s
user    0m0.005s
sys     0m0.973s

so the reasonable thing here is because of O_DIRECT, the sys time
reduced a lot.

but the time is longer! the reason i found is...

i attached a new oread.c which allow to set block size of each read and
total read count. so i read full strip once a time,

# time ./oread /dev/md0 524288 2048

real    0m4.950s
user    0m0.000s
sys     0m0.131s

compared to 

# time ./oread /dev/md0 131072 8192

real    0m6.633s
user    0m0.002s
sys     0m0.191s


but still, I can get linear speed at 4 DISKS, then no speed gain when
adding more disk into the RAID.

Ming


[-- Attachment #2: oread.c --]
[-- Type: text/x-csrc, Size: 673 bytes --]

#include <stdio.h>
#include <stdlib.h>
#define __USE_GNU
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>

#define ALIGN(buf)	(char *) (((unsigned long) (buf) + 4095) & ~(4095))

int main(int argc, char *argv[])
{
	char *p;
	int fd, i;
	int BS, BLOCKS;

	if (argc < 4) {
		printf("%s: <dev> bs cnt\n", argv[0]);
		return 1;
	}

	BS = atoi(argv[2]);
	BLOCKS = atoi(argv[3]);
	fd = open(argv[1], O_RDONLY | O_DIRECT);
	if (fd == -1) {
		perror("open");
		return 1;
	}

	p = ALIGN(malloc(BS + 4095));
	for (i = 0; i < BLOCKS; i++) {
		int r = read(fd, p, BS);

		if (r == BS)
			continue;
		else {
			if (r == -1)
				perror("read");

			break;
		}
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 18:52                     ` Ming Zhang
@ 2005-08-31 18:57                       ` Ming Zhang
  0 siblings, 0 replies; 42+ messages in thread
From: Ming Zhang @ 2005-08-31 18:57 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Holger Kiehl, Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

forgot to attach lspci output.

it is a 133MB PCI-X card but only run at 66MHZ.

quick question, where I can check if it is running at 64bit?

66MHZ * 32Bit /8 * 80% bus utilization ~= 211MB/s then match the upper
speed I meet now...

Ming


02:01.0 SCSI storage controller: Marvell MV88SX5081 8-port SATA I PCI-X
Controller (rev 03)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 128, Cache Line Size 08
        Interrupt: pin A routed to IRQ 24
        Region 0: Memory at fa000000 (64-bit, non-prefetchable)
        Capabilities: [40] Power Management version 2
                Flags: PMEClk+ DSI- D1- D2- AuxCurrent=0mA PME
(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [60] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=0 OST=3
                Status: Bus=2 Dev=1 Func=0 64bit+ 133MHz+ SCD- USC-,
DC=simple, DMMRBC=0, DMOST=3, DMCRS=0, RSCEM-


On Wed, 2005-08-31 at 14:52 -0400, Ming Zhang wrote:
> join the party. ;)
> 
> 8 400GB SATA disk on same Marvel 8 port PCIX-133 card. P4 CPU.
> Supermicro SCT board.
> 
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid5] [multipath] [raid6]
> [raid10] [faulty]
> md0 : active raid0 sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda
> [0]
>       3125690368 blocks 64k chunks
> 
> 8 DISK RAID0 from same slot and card. Stripe size is 512KB.
> 
> run oread
> 
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- --system-- ----
> cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
> id wa
>  1  1      0 533216 330424  11004    0    0  7128  1610 1069    77  0  2
> 95  3
>  1  0      0 298464 560828  11004    0    0 230404     0 2595  1389  1
> 23  0 76
>  0  1      0  64736 792248  11004    0    0 231420     0 2648  1342  0
> 26  0 74
>  1  0      0   8948 848416   9696    0    0 229376     0 2638  1337  0
> 29  0 71
>  0  0      0 868896    768   9696    0    0 29696    48 1224   162  0 19
> 73  8
> 
> # time ./oread /dev/md0
> 
> real    0m6.595s
> user    0m0.004s
> sys     0m0.151s
> 
> run dd
> 
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- --system-- ----
> cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
> id wa
>  2  2      0 854008   2932  17108    0    0  7355  1606 1071    80  0  2
> 95  3
>  0  2      0 848888   3112  21388    0    0 164332     0 2985  3564  2
> 7  0 91
>  0  2      0 844024   3260  25664    0    0 164040     0 2990  3665  1
> 7  0 92
>  0  2      0 840328   3380  28920    0    0 164272     0 2932  3791  1
> 9  0 90
>  0  2      0 836360   3500  32232    0    0 163688   100 3001  5045  2
> 7  0 91
>  0  2      0 831432   3644  36612    0    0 164120   568 2977  3843  0
> 9  0 91
>  0  1      0 826056   3752  41688    0    0  7872     0 1267  1474  1  3
> 0 96
> 
> # time dd if=/dev/md0 of=/dev/null bs=131072 count=8192
> 8192+0 records in
> 8192+0 records out
> 
> real    0m4.771s
> user    0m0.005s
> sys     0m0.973s
> 
> so the reasonable thing here is because of O_DIRECT, the sys time
> reduced a lot.
> 
> but the time is longer! the reason i found is...
> 
> i attached a new oread.c which allow to set block size of each read and
> total read count. so i read full strip once a time,
> 
> # time ./oread /dev/md0 524288 2048
> 
> real    0m4.950s
> user    0m0.000s
> sys     0m0.131s
> 
> compared to 
> 
> # time ./oread /dev/md0 131072 8192
> 
> real    0m6.633s
> user    0m0.002s
> sys     0m0.191s
> 
> 
> but still, I can get linear speed at 4 DISKS, then no speed gain when
> adding more disk into the RAID.
> 
> Ming
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 17:35                   ` Jens Axboe
@ 2005-08-31 19:00                     ` Holger Kiehl
  0 siblings, 0 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 19:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vojtech Pavlik, linux-raid, linux-kernel

On Wed, 31 Aug 2005, Jens Axboe wrote:

> On Wed, Aug 31 2005, Holger Kiehl wrote:
>>> # ./oread /dev/sdX
>>>
>>> and it will read 128k chunks direct from that device. Run on the same
>>> drives as above, reply with the vmstat info again.
>>>
>> Using kernel 2.6.12.5 again, here the results:
>
> [snip]
>
> Ok, reads as expected, like the buffered io but using less system time.
> And you are still 1/3 off the target data rate, hmmm...
>
> With the reads, how does the aggregate bandwidth look when you add
> 'clients'? Same as with writes, gradually decreasing per-device
> throughput?
>
I performed the following tests with this command:

    dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

Single disk tests:

    /dev/sdc1 74.954715 MB/s
    /dev/sdg1 74.973417 MB/s

Following disks in parallel:

    2 disks on same channel
    /dev/sdc1 75.034191 MB/s
    /dev/sdd1 74.984643 MB/s

    3 disks on same channel
    /dev/sdc1 75.027850 MB/s
    /dev/sdd1 74.976583 MB/s
    /dev/sde1 75.278276 MB/s

    4 disks on same channel
    /dev/sdc1 58.343166 MB/s
    /dev/sdd1 62.993059 MB/s
    /dev/sde1 66.940569 MB/s
    /dev/sdd1 70.986072 MB/s

    2 disks on different channels
    /dev/sdc1 74.954715 MB/s
    /dev/sdg1 74.973417 MB/s

    4 disks on different channels
    /dev/sdc1 74.959030 MB/s
    /dev/sdd1 74.877703 MB/s
    /dev/sdg1 75.009697 MB/s
    /dev/sdh1 75.028138 MB/s

    6 disks on different channels
    /dev/sdc1 49.640743 MB/s
    /dev/sdd1 55.935419 MB/s
    /dev/sde1 58.795241 MB/s
    /dev/sdg1 50.280864 MB/s
    /dev/sdh1 54.210705 MB/s
    /dev/sdi1 59.413176 MB/s

So this looks different from writting, only as of four disks does the
performance begin to drop.

I just noticed, did you want me to do these test with the oread program?

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 14:24               ` Dr. David Alan Gilbert
@ 2005-08-31 20:56                 ` Holger Kiehl
  2005-08-31 21:16                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 20:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: linux-raid, linux-kernel

On Wed, 31 Aug 2005, Dr. David Alan Gilbert wrote:

> * Holger Kiehl (Holger.Kiehl@dwd.de) wrote:
>> On Wed, 31 Aug 2005, Jens Axboe wrote:
>>
>> Full vmstat session can be found under:
>
> Have you got iostat?  iostat -x 10 might be interesting to see
> for a period while it is going.
>
The following is the result from all 8 disks at the same time with the command
dd if=/dev/sd?1 of=/dev/null bs=256k count=78125

There is however one difference, here I had set
/sys/block/sd?/queue/nr_requests to 4096.

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.10    0.00   21.85   58.55   19.50

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda          0.00   0.00  0.00  0.30    0.00    2.40     0.00     1.20     8.00     0.00    1.00   1.00   0.03
sdb          0.70   0.00  0.10  0.30    6.40    2.40     3.20     1.20    22.00     0.00    4.25   4.25   0.17
sdc        8276.90   0.00 267.10  0.00 68352.00    0.00 34176.00     0.00   255.90     1.95    7.29   3.74 100.02
sdd        9098.50   0.00 293.50  0.00 75136.00    0.00 37568.00     0.00   256.00     1.93    6.59   3.41 100.03
sde        10428.40   0.00 336.40  0.00 86118.40    0.00 43059.20     0.00   256.00     1.92    5.71   2.97 100.02
sdf        11314.90   0.00 365.10  0.00 93440.00    0.00 46720.00     0.00   255.93     1.92    5.26   2.74  99.98
sdg        7973.20   0.00 257.20  0.00 65843.20    0.00 32921.60     0.00   256.00     1.94    7.53   3.89 100.01
sdh        9436.30   0.00 304.70  0.00 77928.00    0.00 38964.00     0.00   255.75     1.93    6.35   3.28 100.01
sdi        10604.80   0.00 342.40  0.00 87577.60    0.00 43788.80     0.00   255.78     1.92    5.62   2.92 100.02
sdj        10914.30   0.00 352.20  0.00 90132.80    0.00 45066.40     0.00   255.91     1.91    5.43   2.84 100.00
md0          0.00   0.00  0.00  0.10    0.00    0.80     0.00     0.40     8.00     0.00    0.00   0.00   0.00
md2          0.00   0.00  0.80  0.00    6.40    0.00     3.20     0.00     8.00     0.00    0.00   0.00   0.00
md1          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.07    0.00   24.49   66.81    8.62

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda          0.00   0.40  0.00  1.00    0.00   11.20     0.00     5.60    11.20     0.00    1.30   0.50   0.05
sdb          0.00   0.40  0.00  1.00    0.00   11.20     0.00     5.60    11.20     0.00    1.50   0.70   0.07
sdc        8161.90   0.00 263.70  0.00 67404.80    0.00 33702.40     0.00   255.61     1.95    7.38   3.79 100.02
sdd        9157.30   0.00 295.50  0.00 75622.40    0.00 37811.20     0.00   255.91     1.93    6.53   3.38 100.00
sde        10505.60   0.00 339.20  0.00 86758.40    0.00 43379.20     0.00   255.77     1.93    5.68   2.95  99.99
sdf        11212.50   0.00 361.90  0.00 92595.20    0.00 46297.60     0.00   255.86     1.91    5.28   2.76 100.00
sdg        7988.40   0.00 258.00  0.00 65971.20    0.00 32985.60     0.00   255.70     1.93    7.49   3.88  99.98
sdh        9436.20   0.00 304.40  0.00 77924.80    0.00 38962.40     0.00   255.99     1.92    6.32   3.28  99.99
sdi        10406.10   0.00 336.30  0.00 85939.20    0.00 42969.60     0.00   255.54     1.92    5.70   2.97 100.00
sdj        11027.00   0.00 356.00  0.00 91064.00    0.00 45532.00     0.00   255.80     1.92    5.40   2.81  99.96
md0          0.00   0.00  0.00  1.00    0.00    8.00     0.00     4.00     8.00     0.00    0.00   0.00   0.00
md2          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice    %sys %iowait   %idle
            0.08    0.00   22.23   60.44   17.25

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda          0.00   0.00  0.00  0.30    0.00    2.40     0.00     1.20     8.00     0.00    1.00   1.00   0.03
sdb          0.00   0.00  0.00  0.30    0.00    2.40     0.00     1.20     8.00     0.00    0.67   0.67   0.02
sdc        8204.50   0.00 264.76  0.00 67754.15    0.00 33877.08     0.00   255.90     1.95    7.38   3.78 100.12
sdd        9166.47   0.00 295.90  0.00 75698.10    0.00 37849.05     0.00   255.83     1.94    6.55   3.38 100.12
sde        10534.93   0.00 339.94  0.00 86999.00    0.00 43499.50     0.00   255.92     1.93    5.67   2.95 100.12
sdf        11282.68   0.00 364.16  0.00 93174.77    0.00 46587.39     0.00   255.86     1.92    5.28   2.75 100.10
sdg        8114.61   0.00 261.76  0.00 67011.01    0.00 33505.51     0.00   256.00     1.95    7.44   3.82 100.11
sdh        9380.68   0.00 302.60  0.00 77466.27    0.00 38733.13     0.00   256.00     1.93    6.38   3.31 100.10
sdi        10507.01   0.00 339.04  0.00 86768.37    0.00 43384.18     0.00   255.92     1.93    5.69   2.95 100.12
sdj        10969.27   0.00 354.15  0.00 90586.59    0.00 45293.29     0.00   255.78     1.92    5.42   2.83 100.11
md0          0.00   0.00  0.00  0.10    0.00    0.80     0.00     0.40     8.00     0.00    0.00   0.00   0.00
md2          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

The full output can be found at:

    ftp://ftp.dwd.de/pub/afd/linux_kernel_debug/iostat-read-256k

Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 20:56                 ` Holger Kiehl
@ 2005-08-31 21:16                   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2005-08-31 21:16 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid, linux-kernel

* Holger Kiehl (Holger.Kiehl@dwd.de) wrote:

> There is however one difference, here I had set
> /sys/block/sd?/queue/nr_requests to 4096.

Well from that it looks like none of the queues get about 255
(hmm that's a round number....)

> avg-cpu:  %user   %nice    %sys %iowait   %idle
>            0.10    0.00   21.85   58.55   19.50

Fair amount of system time.

> Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
> avgrq-sz avgqu-sz   await  svctm  %util

> sdf        11314.90   0.00 365.10  0.00 93440.00    0.00 46720.00     0.00  
> 255.93     1.92    5.26   2.74  99.98
> sdg        7973.20   0.00 257.20  0.00 65843.20    0.00 32921.60     0.00   
> 256.00     1.94    7.53   3.89 100.01

There seems to be quite a spread of read performance accross the drives
(pretty consistent accross the run); what makes sdg so much slower than
sdf (which seems to be the slowest and fastest drives respectively).
I guess if everyone was running at sdf's speed you would be pretty happy.

If you physically swap f and g does the performance follow the drive
or the letter?

Dave
--
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    | Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 17:25               ` Nick Piggin
@ 2005-08-31 21:57                 ` Holger Kiehl
  2005-09-01  9:12                   ` Holger Kiehl
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-08-31 21:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

On Thu, 1 Sep 2005, Nick Piggin wrote:

> Holger Kiehl wrote:
>
>> meminfo.dump:
>> 
>>    MemTotal:      8124172 kB
>>    MemFree:         23564 kB
>>    Buffers:       7825944 kB
>>    Cached:          19216 kB
>>    SwapCached:          0 kB
>>    Active:          25708 kB
>>    Inactive:      7835548 kB
>>    HighTotal:           0 kB
>>    HighFree:            0 kB
>>    LowTotal:      8124172 kB
>>    LowFree:         23564 kB
>>    SwapTotal:    15631160 kB
>>    SwapFree:     15631160 kB
>>    Dirty:         3145604 kB
>
> Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
> so maybe I've just led you on a goose chase.
>
> You could
>    echo 5 > /proc/sys/vm/dirty_background_ratio
>    echo 10 > /proc/sys/vm/dirty_ratio
>
> To further reduce dirty memory in the system, however this is
> a long shot, so please continue your interaction with the
> other people in the thread first.
>
Yes, this does make a difference, here the results of running

   dd if=/dev/full of=/dev/sd?1 bs=4M count=4883

on 8 disks at the same time:

   34.273340
   33.938829
   33.598469
   32.970575
   32.841351
   32.723988
   31.559880
   29.778112

That's 32.710568 MB/s on average per disk with your change and without
it it was 24.958557 MB/s on average per disk.

I will do more tests tomorrow.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-31 21:57                 ` Holger Kiehl
@ 2005-09-01  9:12                   ` Holger Kiehl
  2005-09-02 14:28                     ` Al Boldi
  0 siblings, 1 reply; 42+ messages in thread
From: Holger Kiehl @ 2005-09-01  9:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel

On Wed, 31 Aug 2005, Holger Kiehl wrote:

> On Thu, 1 Sep 2005, Nick Piggin wrote:
>
>> Holger Kiehl wrote:
>> 
>>> meminfo.dump:
>>> 
>>>    MemTotal:      8124172 kB
>>>    MemFree:         23564 kB
>>>    Buffers:       7825944 kB
>>>    Cached:          19216 kB
>>>    SwapCached:          0 kB
>>>    Active:          25708 kB
>>>    Inactive:      7835548 kB
>>>    HighTotal:           0 kB
>>>    HighFree:            0 kB
>>>    LowTotal:      8124172 kB
>>>    LowFree:         23564 kB
>>>    SwapTotal:    15631160 kB
>>>    SwapFree:     15631160 kB
>>>    Dirty:         3145604 kB
>> 
>> Hmm OK, dirty memory is pinned pretty much exactly on dirty_ratio
>> so maybe I've just led you on a goose chase.
>> 
>> You could
>>    echo 5 > /proc/sys/vm/dirty_background_ratio
>>    echo 10 > /proc/sys/vm/dirty_ratio
>> 
>> To further reduce dirty memory in the system, however this is
>> a long shot, so please continue your interaction with the
>> other people in the thread first.
>> 
> Yes, this does make a difference, here the results of running
>
>  dd if=/dev/full of=/dev/sd?1 bs=4M count=4883
>
> on 8 disks at the same time:
>
>  34.273340
>  33.938829
>  33.598469
>  32.970575
>  32.841351
>  32.723988
>  31.559880
>  29.778112
>
> That's 32.710568 MB/s on average per disk with your change and without
> it it was 24.958557 MB/s on average per disk.
>
> I will do more tests tomorrow.
>
Just rechecked those numbers. Did a fresh boot and run the test several
times. With defaults (dirty_background_ratio=10, dirty_ratio=40) I get
for the dd write tests an average of 24.559491 MB/s (8 disks in parallel)
per disk. With the suggested values (dirty_background_ratio=5, dirty_ratio=10)
32.390659 MB/s per disk.

I then did a SW raid0 over all disks with the following command:

   mdadm -C /dev/md3 -l0 -n8 /dev/sd[cdefghij]1

   (dirty_background_ratio=10, dirty_ratio=40) 223.955995 MB/s
   (dirty_background_ratio=5, dirty_ratio=10)  234.318936 MB/s

So the differnece is not so big anymore.

Something else I notice while doing the dd over 8 disks is the following
(top just before they are finished):

top - 08:39:11 up  2:03,  2 users,  load average: 23.01, 21.48, 15.64
Tasks: 102 total,   2 running, 100 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us, 17.7% sy,  0.0% ni,  0.0% id, 78.9% wa,  0.2% hi,  3.1% si
Mem:   8124184k total,  8093068k used,    31116k free,  7831348k buffers
Swap: 15631160k total,    13352k used, 15617808k free,     5524k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  3423 root      18   0 55204  460  392 R 12.0  0.0   1:15.55 dd
  3421 root      18   0 55204  464  392 D 11.3  0.0   1:17.36 dd
  3418 root      18   0 55204  464  392 D 10.3  0.0   1:10.92 dd
  3416 root      18   0 55200  464  392 D 10.0  0.0   1:09.20 dd
  3420 root      18   0 55204  464  392 D 10.0  0.0   1:10.49 dd
  3422 root      18   0 55200  460  392 D  9.3  0.0   1:13.58 dd
  3417 root      18   0 55204  460  392 D  7.6  0.0   1:13.11 dd
   158 root      15   0     0    0    0 D  1.3  0.0   1:12.61 kswapd3
   159 root      15   0     0    0    0 D  1.3  0.0   1:08.75 kswapd2
   160 root      15   0     0    0    0 D  1.0  0.0   1:07.11 kswapd1
  3419 root      18   0 51096  552  476 D  1.0  0.0   1:17.15 dd
   161 root      15   0     0    0    0 D  0.7  0.0   0:54.46 kswapd0
     1 root      16   0  4876  372  332 S  0.0  0.0   0:01.15 init
     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
     3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
     5 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1
     6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
     7 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2
     8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
     9 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3

A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd working
so hard? Is that correct.

Please just tell me if there is anything else I can test or dumps that
could be useful.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-09-01  9:12                   ` Holger Kiehl
@ 2005-09-02 14:28                     ` Al Boldi
  0 siblings, 0 replies; 42+ messages in thread
From: Al Boldi @ 2005-09-02 14:28 UTC (permalink / raw)
  To: Holger Kiehl
  Cc: Jens Axboe, Vojtech Pavlik, linux-raid, linux-kernel, Nick Piggin

Holger Kiehl wrote:
> top - 08:39:11 up  2:03,  2 users,  load average: 23.01, 21.48, 15.64
> Tasks: 102 total,   2 running, 100 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0% us, 17.7% sy,  0.0% ni,  0.0% id, 78.9% wa,  0.2% hi,  3.1%
> si Mem:   8124184k total,  8093068k used,    31116k free,  7831348k
> buffers Swap: 15631160k total,    13352k used, 15617808k free,     5524k
> cached
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   3423 root      18   0 55204  460  392 R 12.0  0.0   1:15.55 dd
>   3421 root      18   0 55204  464  392 D 11.3  0.0   1:17.36 dd
>   3418 root      18   0 55204  464  392 D 10.3  0.0   1:10.92 dd
>   3416 root      18   0 55200  464  392 D 10.0  0.0   1:09.20 dd
>   3420 root      18   0 55204  464  392 D 10.0  0.0   1:10.49 dd
>   3422 root      18   0 55200  460  392 D  9.3  0.0   1:13.58 dd
>   3417 root      18   0 55204  460  392 D  7.6  0.0   1:13.11 dd
>    158 root      15   0     0    0    0 D  1.3  0.0   1:12.61 kswapd3
>    159 root      15   0     0    0    0 D  1.3  0.0   1:08.75 kswapd2
>    160 root      15   0     0    0    0 D  1.0  0.0   1:07.11 kswapd1
>   3419 root      18   0 51096  552  476 D  1.0  0.0   1:17.15 dd
>    161 root      15   0     0    0    0 D  0.7  0.0   0:54.46 kswapd0
>
> A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd
> working so hard? Is that correct.

Actually, kswapd is another problem. (see "Kswapd Flaw"  thread)
Which has little impact on your problem but basically kswapd tries very hard 
maybe even to hard to fullfil a request for memory, so when the buffer/cache 
pages are full kswapd tries to find some more unused memory. When it finds 
none it starts recycling the buffer/cache pages.  Which is OK, but it only 
does this after searching for swappable memory which wastes CPU cycles.

This can be tuned a little but not much by adjusting /sys(proc)/.../vm/...
Or renicing kswapd to the lowest priority, which may cause other problems.

Things get really bad when procs start asking for more memory than is 
available, causing kswapd to take the liberty of paging out running procs in 
the hope that these procs won't come back later.  So when they do come back 
something like a wild goose chase begins.  This is also known as OverCommit. 

This is closely related to the dreaded OOM-killer, which occurs when the 
system cannot satisfy a memory request for a returning proc, causing the VM 
to start killing in an unpredictable manner.

Turning OverCommit off should solve this problem but it doesn't.

This is why it is recommended to run the system always with swap enabled even 
if you have tons of memory, which really only pushes the problem out of the 
way until you hit the dead end and the wild goose chase begins again.

Sadly 2.6.13 did not fix this either.

Although this description only vaguely defines the problem from an end-user 
pov, the actual semantics may be quite different.

--
Al


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-08-30 23:05     ` Guy
@ 2005-09-28 20:04       ` Bill Davidsen
  2005-09-30  4:52         ` Guy
  0 siblings, 1 reply; 42+ messages in thread
From: Bill Davidsen @ 2005-09-28 20:04 UTC (permalink / raw)
  To: Guy
  Cc: 'Holger Kiehl', 'Mark Hahn', 'linux-raid',
	'linux-kernel'

Guy wrote:

>In most of your results, your CPU usage is very high.  Once you get to about
>90% usage, you really can't do much else, unless you can improve the CPU
>usage.
>
That seems one of the problems with software RAID, the calculations are 
done in the CPU and not dedicated hardware. As you move to the top end 
drive hardware the CPU gets to be a limit. I don't remember off the top 
of my head how threaded this code is, and if more CPUs will help.

I see you are using RAID-1 for your system stuff, did one of the tests 
use RAID-0 over all the drives? Mirroring or XOR redundancy help 
stability but hurt performance. Was the 270MB/s with RAID-0 or ???

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: Where is the performance bottleneck?
  2005-09-28 20:04       ` Bill Davidsen
@ 2005-09-30  4:52         ` Guy
  2005-09-30  5:19           ` dean gaudet
  2005-10-06 21:15           ` Bill Davidsen
  0 siblings, 2 replies; 42+ messages in thread
From: Guy @ 2005-09-30  4:52 UTC (permalink / raw)
  To: 'Bill Davidsen'
  Cc: 'Holger Kiehl', 'Mark Hahn', 'linux-raid',
	'linux-kernel'



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Bill Davidsen
> Sent: Wednesday, September 28, 2005 4:05 PM
> To: Guy
> Cc: 'Holger Kiehl'; 'Mark Hahn'; 'linux-raid'; 'linux-kernel'
> Subject: Re: Where is the performance bottleneck?
> 
> Guy wrote:
> 
> >In most of your results, your CPU usage is very high.  Once you get to
> about
> >90% usage, you really can't do much else, unless you can improve the CPU
> >usage.
> >
> That seems one of the problems with software RAID, the calculations are
> done in the CPU and not dedicated hardware. As you move to the top end
> drive hardware the CPU gets to be a limit. I don't remember off the top
> of my head how threaded this code is, and if more CPUs will help.

My old 500MHz P3 can xor at 1GB/sec.  I don't think the RAID5 logic is the
issue!  Also, I have not seen hardware that fast!  Or even half as fast.
But I must admit, I have not seen a hardware RAID5 in a few years.  :(

   8regs     :   918.000 MB/sec
   32regs    :   469.600 MB/sec
   pIII_sse  :   994.800 MB/sec
   pII_mmx   :  1102.400 MB/sec
   p5_mmx    :  1152.800 MB/sec
raid5: using function: pIII_sse (994.800 MB/sec)

Humm..  It did not select the fastest?

Guy
> 
> I see you are using RAID-1 for your system stuff, did one of the tests
> use RAID-0 over all the drives? Mirroring or XOR redundancy help
> stability but hurt performance. Was the 270MB/s with RAID-0 or ???
> 
> --
> bill davidsen <davidsen@tmr.com>
>   CTO TMR Associates, Inc
>   Doing interesting things with small computers since 1979
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: Where is the performance bottleneck?
  2005-09-30  4:52         ` Guy
@ 2005-09-30  5:19           ` dean gaudet
  2005-10-06 21:15           ` Bill Davidsen
  1 sibling, 0 replies; 42+ messages in thread
From: dean gaudet @ 2005-09-30  5:19 UTC (permalink / raw)
  To: Guy
  Cc: 'Bill Davidsen', 'Holger Kiehl',
	'Mark Hahn', 'linux-raid', 'linux-kernel'

On Fri, 30 Sep 2005, Guy wrote:

> My old 500MHz P3 can xor at 1GB/sec.  I don't think the RAID5 logic is the
> issue!  Also, I have not seen hardware that fast!  Or even half as fast.
> But I must admit, I have not seen a hardware RAID5 in a few years.  :(
> 
>    8regs     :   918.000 MB/sec
>    32regs    :   469.600 MB/sec
>    pIII_sse  :   994.800 MB/sec
>    pII_mmx   :  1102.400 MB/sec
>    p5_mmx    :  1152.800 MB/sec
> raid5: using function: pIII_sse (994.800 MB/sec)

those are cache based timings... an old 500mhz p3 probably has pc100 
memory and main memory can't even go that fast.  in fact i've got one of 
those here and it's lucky to get 600MB/s out of memory.

in fact, to compare sw raid to a hw raid you should count every byte of 
i/o somewhere between 2 and 3 times.  this is because every line you read 
into cache might knock out a dirty line, but it's definitely going to 
replace something which would still be there on a hw raid.  (i.e. it 
decreases the cache effectiveness and you end up paying later after the sw 
raid xor to read data back in which wouldn't leave the cache on a hw 
raid.)

then add in the read/write traffic required on the parity block (which as 
a fraction of i/o is worse with fewer drives) ... and it's pretty crazy to 
believe that sw raid is "free" just because the kernel prints those 
fantastic numbers at boot :)


> Humm..  It did not select the fastest?

this is related to what i'm describing -- iirc the pIII_sse code uses a 
non-temporal store and/or prefetchnta to reduce memory traffic.

-dean

p.s. i use sw raid regardless, i just don't like seeing these misleading 
discussions pointing at the kernel raid timings and saying "hw offload is 
pointless!"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Where is the performance bottleneck?
  2005-09-30  4:52         ` Guy
  2005-09-30  5:19           ` dean gaudet
@ 2005-10-06 21:15           ` Bill Davidsen
  1 sibling, 0 replies; 42+ messages in thread
From: Bill Davidsen @ 2005-10-06 21:15 UTC (permalink / raw)
  To: Guy
  Cc: 'Holger Kiehl', 'Mark Hahn', 'linux-raid',
	'linux-kernel'

Guy wrote:

>  
>
>>-----Original Message-----
>>From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>>owner@vger.kernel.org] On Behalf Of Bill Davidsen
>>Sent: Wednesday, September 28, 2005 4:05 PM
>>To: Guy
>>Cc: 'Holger Kiehl'; 'Mark Hahn'; 'linux-raid'; 'linux-kernel'
>>Subject: Re: Where is the performance bottleneck?
>>
>>Guy wrote:
>>
>>    
>>
>>>In most of your results, your CPU usage is very high.  Once you get to
>>>      
>>>
>>about
>>    
>>
>>>90% usage, you really can't do much else, unless you can improve the CPU
>>>usage.
>>>
>>>      
>>>
>>That seems one of the problems with software RAID, the calculations are
>>done in the CPU and not dedicated hardware. As you move to the top end
>>drive hardware the CPU gets to be a limit. I don't remember off the top
>>of my head how threaded this code is, and if more CPUs will help.
>>    
>>
>
>My old 500MHz P3 can xor at 1GB/sec.  I don't think the RAID5 logic is the
>issue!  Also, I have not seen hardware that fast!  Or even half as fast.
>But I must admit, I have not seen a hardware RAID5 in a few years.  :(
>
>   8regs     :   918.000 MB/sec
>   32regs    :   469.600 MB/sec
>   pIII_sse  :   994.800 MB/sec
>   pII_mmx   :  1102.400 MB/sec
>   p5_mmx    :  1152.800 MB/sec
>raid5: using function: pIII_sse (994.800 MB/sec)
>
>Humm..  It did not select the fastest?
>
Maybe. There was discussion on this previously, but the decision was 
made to us sse when available because it is nicer to cache, or uses 
fewer registers, or similar. In any case fewer undesirable side effects.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2005-10-06 21:12 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-29 18:20 Where is the performance bottleneck? Holger Kiehl
2005-08-29 19:54 ` Mark Hahn
2005-08-30 19:08   ` Holger Kiehl
2005-08-30 23:05     ` Guy
2005-09-28 20:04       ` Bill Davidsen
2005-09-30  4:52         ` Guy
2005-09-30  5:19           ` dean gaudet
2005-10-06 21:15           ` Bill Davidsen
2005-08-29 20:10 ` Al Boldi
2005-08-30 19:18   ` Holger Kiehl
2005-08-31 10:30     ` Al Boldi
2005-08-29 20:25 ` Vojtech Pavlik
2005-08-30 20:06   ` Holger Kiehl
2005-08-31  7:11     ` Vojtech Pavlik
2005-08-31  7:26       ` Jens Axboe
2005-08-31 11:54         ` Holger Kiehl
2005-08-31 12:07           ` Jens Axboe
2005-08-31 13:55             ` Holger Kiehl
2005-08-31 14:24               ` Dr. David Alan Gilbert
2005-08-31 20:56                 ` Holger Kiehl
2005-08-31 21:16                   ` Dr. David Alan Gilbert
2005-08-31 16:20               ` Jens Axboe
2005-08-31 15:16                 ` jmerkey
2005-08-31 16:58                   ` Tom Callahan
2005-08-31 15:47                     ` jmerkey
2005-08-31 17:11                   ` Jens Axboe
2005-08-31 15:59                     ` jmerkey
2005-08-31 17:32                       ` Jens Axboe
2005-08-31 16:51                 ` Holger Kiehl
2005-08-31 17:35                   ` Jens Axboe
2005-08-31 19:00                     ` Holger Kiehl
2005-08-31 18:06                   ` Michael Tokarev
2005-08-31 18:52                     ` Ming Zhang
2005-08-31 18:57                       ` Ming Zhang
2005-08-31 12:24           ` Nick Piggin
2005-08-31 16:25             ` Holger Kiehl
2005-08-31 17:25               ` Nick Piggin
2005-08-31 21:57                 ` Holger Kiehl
2005-09-01  9:12                   ` Holger Kiehl
2005-09-02 14:28                     ` Al Boldi
2005-08-31 13:38       ` Holger Kiehl
2005-08-29 23:09 ` Peter Chubb

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).