Where is the performance bottleneck?

* Where is the performance bottleneck?
@ 2005-08-29 18:20 Holger Kiehl
  2005-08-29 19:54 ` Mark Hahn
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Holger Kiehl @ 2005-08-29 18:20 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8112 bytes --]

Hello

I have a system with the following setup:

     Board is Tyan S4882 with AMD 8131 Chipset
     4 Opterons 848 (2.2GHz)
     8 GB DDR400 Ram (2GB for each CPU)
     1 onboard Symbios Logic 53c1030 dual channel U320 controller
     2 SATA disks put together as a SW Raid1 for system, swap and spares
     8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
       are on one channel and the other four (sdg, sdh, sdi, sdj) on
       the other channel.

The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
device on that bus. Unfortunatly I was unable to determine at what speed
it is running, here the output from lspci -vv:

02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
         Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
         Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
         Interrupt: pin A routed to IRQ 217
         Region 0: I/O ports at 3000 [size=256]
         Region 1: Memory at fe010000 (64-bit, non-prefetchable) [size=64K]
         Region 3: Memory at fe000000 (64-bit, non-prefetchable) [size=64K]
         Capabilities: [50] Power Management version 2
                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
                 Address: 0000000000000000  Data: 0000
         Capabilities: [68] PCI-X non-bridge device.
                 Command: DPERE- ERO- RBC=2 OST=0
                 Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,

02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
         Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
         Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
         Interrupt: pin B routed to IRQ 225
         Region 0: I/O ports at 3400 [size=256]
         Region 1: Memory at fe030000 (64-bit, non-prefetchable) [size=64K]
         Region 3: Memory at fe020000 (64-bit, non-prefetchable) [size=64K]
         Capabilities: [50] Power Management version 2
                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
                 Address: 0000000000000000  Data: 0000
         Capabilities: [68] PCI-X non-bridge device.
                 Command: DPERE- ERO- RBC=2 OST=0
                 Status: Bus=2 Dev=4 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple,

How does one determine the PCI-X bus speed?

Anyway, I thought with this system I would get theoretically 640 MB/s using
both channels. I tested several software raid setups to get the best possible
write speeds for this system. But testing shows that the absolute maximum I
can reach with software raid is only approx. 270 MB/s for writting. Which is
very disappointing.

The tests where done with 2.6.12.5 kernel from kernel.org, scheduler is the
deadline and distribution is fedora core 4 x86_64 with all updates. Chunksize
is always the default from mdadm (64k). Filesystem was always created with the
command mke2fs -j -b4096 -O dir_index /dev/mdx.

I also have tried with 2.6.13-rc7, but here the speed was much lower, the
maximum there was approx. 140 MB/s for writting.

Here some tests I did and the results with bonnie++:

Version  1.03        ------Sequential Output------ --Sequential Input- --Random-
                      -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine         Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Raid0 (8 disk)15744M 54406  96 247419 90 100752 25 60266  98 226651 29 830.2   1
Raid0s(4 disk)15744M 54915  97 253642 89 73976  18 59445  97 198372 24 659.8   1
Raid0s(4 disk)15744M 54866  97 268361 95 72852  17 59165  97 187183 22 666.3   1
Raid0p(4 disk)15744M 54017  96 149897 57 60202  15 59048  96 156887 20 381.8   1
Raid0p(4 disk)15744M 54771  98 156129 59 54130  14 58941  97 157543 20 520.3   1
Raid1+0       15744M 52496  94 202497 77 55928  14 60150  98 270509 34 930.2   1
Raid0+1       15744M 53927  95 194492 66 53430  15 49590  83 174313 30 884.7   1
Raid5 (8 disk)15744M 55881  98 153735 51 61680  24 56229  95 207348 44 741.2   1
Raid5s(4 disk)15744M 55238  98 81023  28 36859  14 56358  95 193030 38 605.7   1
Raid5s(4 disk)15744M 54920  97 83680  29 36551  14 56917  95 185345 35 599.8   1
Raid5p(4 disk)15744M 53681  95 54517  20 44932  17 54808  93 172216 33 371.1   1
Raid5p(4 disk)15744M 53856  96 55901  21 34737  13 55810  94 181825 36 607.7   1
/dev/sdc      15744M 53861  95 102270 35 25718   6 37273  60 76275   8 377.0   0
/dev/sdd      15744M 53575  95 96846  36 26209   6 37248  60 76197   9 378.4   0
/dev/sde      15744M 54398  94 87937  28 25540   6 36476  59 76520   8 380.4   0
/dev/sdf      15744M 53982  95 109192 38 26136   6 38516  63 76277   9 383.0   0
/dev/sdg      15744M 53880  95 102625 36 26458   6 37926  61 76538   9 399.1   0
/dev/sdh      15744M 53326  95 106447 39 26570   6 38129  62 76427   9 384.3   0
/dev/sdi      15744M 53103  94 96976  33 25632   6 36748  59 76658   8 386.4   0
/dev/sdj      15744M 53840  95 105521 39 26251   6 37146  60 76097   9 384.8   0

Raid1+0        - Four raid1's where each disk of one raid1 hangs on one
                  channel. The setup was done as follows:
                              Raid1 /dev/md3 (sdc + sdg)
                              Raid1 /dev/md4 (sdd + sdh)
                              Raid1 /dev/md5 (sde + sdi)
                              Raid1 /dev/md6 (sdf + sdj)
                              Raid0 /dev/md7 (md3 + md4 + md5 + md6)
Raid0+1        - Raid1 over two raid0 each having four disks:
                              Raid0 /dev/md3 (sdc + sdd + sde + sdf)
                              Raid0 /dev/md4 (sdg + sdh + sdi + sdj)
                              Raid1 /dev/md5 (md3 + md4)
Raid0s(4 disk) - Consists of Raid0 /dev/md3 sdc + sdd + sde + sdf or
                  Raid0 /dev/md4 sdg + sdh + sdi + sdj and the test where
                  done separate once for md3 and then for md4.
Raid0p(4 disk) - Same as Raid0s(4 disk) only the test for md3 and md4 where
                  done at the same time (parallel).
Raid5s(4 disk) - Same as Raid0s(4 disk) only with Raid5.
Raid5p(4 disk) - Same as Raid0p(4 disk) only with Raid5.

Additional tests where done with a little C program (attached to this mail)
that I wrote a long time ago. It measures the time it takes to write a file
of the given size, first result is without fsync() and second result with
fsync(). It is called with two parameters, the first is the file size in
Kilobytes and the second is the blocksize in bytes. The program was always
started as follows:

          fw 16121856 4096

I choose 4096 as blocksize since this is value that is suggested by stat()
st_blksize. With larger values the transfer rate increases.

Here the results in MB/s:
Raid0 (8 disk) 203.017 191.649
Raid0s(4 disk) 200.331 166.129
Raid0s(4 disk) 198.013 165.465
Raid0p(4 disk) 143.781 118.832
Raid0p(4 disk) 146.592 117.703
Raid0+1        206.046 118.670
Raid5 (8 disk) 181.382 115.037
/dev/sdc        94.439  56.928
/dev/sdd        89.838  55.711
/dev/sde        84.391  51.545
/dev/sdf        87.549  57.368
/dev/sdg        92.847  57.799
/dev/sdh        94.615  58.678
/dev/sdi        89.030  54.945
/dev/sdj        91.344  56.899

Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
three times those numbers if you take the numbers from the individual disks.

What limit am I hitting here?

Thanks,
Holger
-- 

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3993 bytes --]

/*****************************************************************************/
/*                            File Write Performance                         */
/*                            ======================                         */
/*****************************************************************************/

#include <stdio.h>      /* printf()                                          */
#include <string.h>     /* strcmp()                                          */
#include <stdlib.h>     /* exit(), atoi(), calloc(), free()                  */
#include <unistd.h>     /* write(), sysconf(), close(), fsync()              */
#include <sys/times.h>  /* times(), struct tms                               */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdarg.h>

#define MAXLINE             4096
#define BUFSIZE             512
#define DEFAULT_FILE_SIZE   31457280
#define TEST_FILE           "test.file"
#define FILE_MODE           (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)

static void err_doit(int, char *, va_list),
            err_quit(char *, ...),
            err_sys(char *, ...);

/*############################### main() ####################################*/
int
main(int argc, char *argv[])
{
   register    n,
               loops,
               rest;
   int         fd,
               oflag,
               blocksize = BUFSIZE;
   off_t       filesize = DEFAULT_FILE_SIZE;
   clock_t     start,
               end,
               syncend;
   long        clktck;
   char        *buf;
   struct tms  tmsdummy;

   if ((argc > 1) && (argc < 5))
   {
      filesize = (off_t)atoi(argv[1]) * 1024;
      if (argc == 3)
         blocksize = atoi(argv[2]);
      else  if (argc == 4)
               err_quit("Usage: %s [filesize] [blocksize]");
   }
   else  if (argc != 1)
            err_quit("Usage: %s [filesize] [blocksize]", argv[0]);

   if ((clktck = sysconf(_SC_CLK_TCK)) < 0)
      err_sys("sysconf error");

   /* If clktck=0 it dosn't make sence to run the test */
   if (clktck == 0)
   {
      (void)printf("0\n");
      exit(0);
   }

   if ((buf = calloc(blocksize, sizeof(char))) == NULL)
      err_sys("calloc error");

   for (n = 0; n < blocksize; n++)
      buf[n] = 'T';

   loops = filesize / blocksize;
   rest = filesize % blocksize;

   oflag = O_WRONLY | O_CREAT;

   if ((fd = open(TEST_FILE, oflag, FILE_MODE)) < 0)
      err_quit("Could not open %s", TEST_FILE);

   if ((start = times(&tmsdummy)) == -1)
      err_sys("Could not get start time");

   for (n = 0; n < loops; n++)
      if (write(fd, buf, blocksize) != blocksize)
            err_sys("write error");
   if (rest > 0)
      if (write(fd, buf, rest) != rest)
            err_sys("write error");

   if ((end = times(&tmsdummy)) == -1)
      err_sys("Could not get end time");

   (void)fsync(fd);

   if ((syncend = times(&tmsdummy)) == -1)
      err_sys("Could not get end time");

   (void)close(fd);
   free(buf);

   (void)printf("%f %f\n", (double)filesize / ((double)(end - start) / (double)clktck),
                           (double)filesize / ((double)(syncend - start) / (double)clktck));

   exit(0);
}

static void
err_sys(char *fmt, ...)
{
   va_list  ap;

   va_start(ap, fmt);
   err_doit(1, fmt, ap);
   va_end(ap);
   exit(1);
}

static void
err_quit(char *fmt, ...)
{
   va_list  ap;

   va_start(ap, fmt);
   err_doit(0, fmt, ap);
   va_end(ap);
   exit(1);
}

static void
err_doit(int errnoflag, char *fmt, va_list ap)
{
   int   errno_save;
   char  buf[MAXLINE];

   errno_save = errno;
   (void)vsprintf(buf, fmt, ap);
   if (errnoflag)
      (void)sprintf(buf+strlen(buf), ": %s", strerror(errno_save));
   (void)strcat(buf, "\n");
   fflush(stdout);
   (void)fputs(buf, stderr);
   fflush(NULL);     /* Flushes all stdio output streams */
   return;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread