linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Buffered write slowness
@ 2004-10-26  1:14 Jesse Barnes
  2004-10-29 17:46 ` Buffered I/O slowness Jesse Barnes
  0 siblings, 1 reply; 7+ messages in thread
From: Jesse Barnes @ 2004-10-26  1:14 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2140 bytes --]

I've been doing some simple disk I/O benchmarking with an eye towards 
improving large, striped volume bandwidth.  I ran some tests on individual 
disks and filesystems to establish a baseline and found that things generally 
scale quite well:

o one thread/disk using O_DIRECT on the block device
  read avg: 2784.81 MB/s
  write avg: 2585.60 MB/s

o one thread/disk using O_DIRECT + filesystem
  read avg: 2635.98 MB/s
  write avg: 2573.39 MB/s

o one thread/disk using buffered I/O + filesystem
  read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
  read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
  write avg: 1394.99 MB/s

Configuration:
  o 8p sn2 ia64 box
  o 8GB memory
  o 58 disks across 16 controllers
    (4 disks for 10 of them and 3 for the other 6)
  o aggregate I/O bw available is about 2.8GB/s

Test:
  o one I/O thread per disk, round robined across the 8 CPUs
  o each thread did ~450MB of I/O depending on the test (ran for 10s)
    Note: the total was > 8GB so in the buffered read case not everything
    could be cached

As you can see, for a test that does one thread/disk things look really good 
(very close to the available bandwidth in the system) with the exception of 
buffered writes.  I've attached the vmstat and profile from that run in case 
anyone's interested.  It seems that there was some spinlock contention in 
that run that wasn't present in other runs.

Preliminary runs on a large volume showed that a single thread reading from a 
striped volume w/O_DIRECT performed poorly, while a single thread writing to 
a volume the same way was able to get slightly over 1GB/s.  Using multiple 
read threads against the volume increased the bandwidth to near 1GB/s, but 
multiple threads writing slightly slowed performance.  My tests and the 
system configuration have changed slightly though, so don't put much stock in 
these numbers until I rerun them (and collect profiles and such).

Thanks,
Jesse

P.S. The 'dev-fs' in the filenames doesn't mean I was using devfs (I wasn't, 
not that it should matter), just that I was running per-dev tests with a 
filesystem. :)

[-- Attachment #2: profile-buffered-write-dev-fs.txt --]
[-- Type: text/plain, Size: 1711 bytes --]

598157 total                                      0.1052
132002 _spin_unlock_irq                         2062.5312
 87219 ia64_pal_call_static                     454.2656
 72515 default_idle                             161.8638
 60019 __copy_user                               25.3459
 37898 ia64_spinlock_contention                 394.7708
 32167 _spin_unlock_irqrestore                  335.0729
 15351 kmem_cache_free                           39.9766
 11585 smp_call_function                         10.3438
 10047 kmem_cache_alloc                          39.2461
 10007 ia64_save_scratch_fpregs                 156.3594
  9994 ia64_load_scratch_fpregs                 156.1562
  7125 bio_put                                   24.7396
  6540 __end_that_request_first                   5.5236
  6064 shrink_list                                1.3345
  5829 buffered_rmqueue                           3.1406
  5137 mempool_alloc                              4.8646
  4986 set_bh_page                               17.3125
  4261 bio_alloc                                  3.0967
  4069 end_bio_bh_io_sync                         9.0826
  3906 submit_bh                                  4.3594
  3653 wake_up_page                              28.5391
  3607 drop_buffers                               6.6305
  3381 __might_sleep                              5.5609
  3335 free_hot_cold_page                         3.2568
  3157 writeback_inodes                           3.5234
  3105 __alloc_pages                              1.2937
  2533 submit_bio                                 3.0445
  2486 __block_prepare_write                      0.9033
  2335 mark_buffer_async_write                   18.2422

[-- Attachment #3: vmstat-buffered-write-dev-fs.txt --]
[-- Type: text/plain, Size: 6879 bytes --]

[root@junkbond ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0   2544 230240   1040 6981472    0    0  4342 10272    7    77  0  4 93  3
 0  0   2544 231776   1040 6981472    0    0    24 16515 8529   143  0  0 100  0
 0  0   2544 230880   1040 6981472    0    0    64 276273 10068    96  0  4 96  0
63  0   2544  14176   1040 7196128    0    0    24 303435 13281   216  0 12 88  0
62  1   2544   8736   1104 7187808    0    0    72 755830 51021  5716  0 100  0  0
63  2   2544   5728    656 7182064    0    0     4 1330749 79954 10109  0 100  0  0
61 17   2544   9056    672 7107744    0    0   104 2905708 68129  8461  0 100  0  0
58 23   2544   6368    672 7052016    0    0    64 2432964 66482  6400  0 100  0  0
60 25   2544  14816    848 7025008    0    0   428 1755180 68765  8282  0 100  0  0
57 20   2544   5856    448 7021280    0    0    72 1216548 63333  7905  0 100  0  0
58 17   2544  10272    448 7006832    0    0    32 1097956 61771  8779  0 100  0  0
57 10   2544   8224    624 7010784    0    0   464 1083460 60803  6123  0 100  0  0
62 14   2544   5792    464 7015072    0    0   100 1005260 63960  5773  0 100  0  0
 5  1   2544  14912    752 7000336    0    0   620 1060772 60624  5679  0 98  1  0
 0  1   2544  14784    976 6998048    0    0   856 63972 16941  3954  1 16 70 13
 0  0   2544  14720    976 6998048    0    0    16  4920 14865  1285  0  9 82  9
 0  1   2544  14784    976 6998048    0    0     8 23728 13974    40  0  7 93  0
 0  1   2544  15040    976 6998048    0    0     0 18432 14097   847  0  8 81 11
63  1   2544   8640   1040 7008304    0    0    64 140656 39666  2799  0 40 52  8
59  1   2544  11328   1008 7004208    0    0    80 915952 64727  8821  0 100  0  0
62  7   2544   9952    448 7015088    0    0    64 2327688 76873  9117  0 100  0  0
60 12   2544   6688    448 7013024    0    0    32 2457992 66550  8099  0 100  0  0
61 15   2544   8032    448 7010960    0    0    80 1851480 66455  7185  0 100  0  0
60 18   2544   6144    448 6994448    0    0    80 1488552 70833  8758  0 100  0  0
60 13   2544   7456    448 6984128    0    0    32 1207904 65113  7116  0 100  0  0
58 10   2544   5664    448 6982064    0    0    32 985696 64674  7964  0 100  0  0
60 21   2544   7840    448 6984128    0    0    64 976540 61481  7015  0 100  0  0
59 15   2544   8736    448 6982064    0    0    88 869564 60383  6613  0 100  0  0
61 11   2544   7616    736 6981776    0    0   592 1183640 63801  7145  0 100  0  0
 4  1   2544  85376    896 6971296    0    0   924 193617 22851  4451  1 27 62 11
 0  1   2544  11904    896 6971296    0    0    12     0 14137  1325  0 10 79 11
 0  1   2544  11840    896 6971296    0    0     0     6 13729   446  0  7 82 11
 0  1   2544  11968    896 6971296    0    0     0     2 13719   952  0  7 82 11
 0  1   2544  12160    960 6971232    0    0     8    69 12409   858  0  6 82 12
57  1   2544  13312    960 6971232    0    0    96 686610 58996  5417  0 97  3  0
58  1   2544  11008    448 6980000    0    0   104 1973020 81572 12784  0 100  0  0
60 19   2544   9632    448 6977936    0    0    88 2302016 74034  9208  0 100  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
62 20   2544  11872    448 6984128    0    0    48 1914352 71746  8483  0 100  0  0
56 22   2544   5472    448 6984128    0    0    56 1413780 68497  7051  0 100  0  0
60 12   2544  10912    448 6980000    0    0   112 1076532 67341  7422  0 100  0  0
59 19   2544   5984    448 6984128    0    0    56 1129136 69899  7353  0 100  0  0
55 19   2544   9696    448 6980000    0    0    48 1113548 72296  6531  0 100  0  0
65 15   2544   7584    448 6984128    0    0    32 1028156 66183  8963  0 100  0  0
62 26   2544   9728    784 6977600    0    0   604 997204 69320 10154  0 100  0  0
 0  4   2544  29376    864 6963072    0    0   836 49804 17486  5250  1 28 31 40
 0  1   2544  29376    864 6963072    0    0     0 24856 15374   952  0 10 76 14
 0  1   2544  29376    864 6963072    0    0     0     0 14715  1532  0  9 80 11
 0  1   2544  29312    864 6963072    0    0     8 18856 13580   735  0  8 85  7
64  2   2544   8768    928 6981584    0    0   104 422060 54309  2831  0 68 28  4
56  1   2544   6592    576 6986064    0    0    48 1364520 75387  8135  0 100  0  0
63 24   2544   7776    448 6990320    0    0    56 2284556 66785  9020  0 100  0  0
57 33   2544   5536    448 6988256    0    0    80 1774088 62152  9677  0 100  0  0
58 19   2544   8992    448 6988256    0    0    40 1671936 70438  7714  0 100  0  0
58 15   2544   7520    448 6984128    0    0    48 1333652 69927  6837  0 100  0  0
56 10   2544   9824    448 6988256    0    0    16 1095912 73715  7797  0 100  0  0
58  7   2544   7008    448 6982064    0    0    56 1167724 67655  6401  0 100  0  0
57 10   2544   8480    448 6986192    0    0     8 915948 69804  6384  0 100  0  0
56 11   2544   7392    448 6986192    0    0    32 998204 69488  6417  0 100  0  0
59 13   2544   7296    528 6988176    0    0   196 1001173 68362  6354  0 100  0  0
 6  2   2544  14528    784 6977600    0    0   440 277287 25760  5077  0 35 44 21
 1  1   2544  13184    944 6965056    0    0   844  6335 15959  3621  1 16 70 13
 0  1   2544  13120    944 6965056    0    0     0     2 13599   833  0  7 81 11
 0  1   2544  13120    944 6965056    0    0     0    11 13051  1347  0  7 82 11
 0  1   2544  13056   1008 6964992    0    0     8    80 11969   858  0  5 83 12
59  1   2544   9984    992 6971200    0    0   128 874215 69268  7155  0 83 14  2
61  2   2544   7712    480 6975840    0    0    40 1739593 73008 12486  0 100  0  0
61 18   2544   5536    448 6977936    0    0   128 2176972 67952  9666  0 100  0  0
59 19   2544   5792    448 6973808    0    0    64 2008752 68552  7674  0 100  0  0
57 24   2544   8288    448 6973808    0    0    80 1561368 71155  8361  0 100  0  0
63 14   2544  11744    448 6971744    0    0    96 1349620 71476  7671  0 100  0  0
60 17   2544   5856    448 6967616    0    0    64 958588 66934  5707  0 100  0  0
56 15   2544   6560    448 6967616    0    0   112 1061172 71689  7726  0 100  0  0
57  8   2544   6176    448 6969680    0    0    48 948800 70414  7758  0 100  0  0
62 10   2544   8704    704 6969424    0    0   596 959484 66082  8013  0 100  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  3   2544  23872    656 6957088    0    0   144 613012 47183  8732  0 72 17 11
 0  1   2544 112000    896 6956848    0    0  1220 48532 14407  3578  2 14 58 26
 0  1   2544 112000    896 6956848    0    0     0     0 13582   995  0  8 81 11

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Buffered I/O slowness
  2004-10-26  1:14 Buffered write slowness Jesse Barnes
@ 2004-10-29 17:46 ` Jesse Barnes
  2004-10-29 23:08   ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Jesse Barnes @ 2004-10-29 17:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

[-- Attachment #1: Type: text/plain, Size: 3172 bytes --]

On Monday, October 25, 2004 6:14 pm, Jesse Barnes wrote:
> I've been doing some simple disk I/O benchmarking with an eye towards
> improving large, striped volume bandwidth.  I ran some tests on individual
> disks and filesystems to establish a baseline and found that things
> generally scale quite well:
>
> o one thread/disk using O_DIRECT on the block device
>   read avg: 2784.81 MB/s
>   write avg: 2585.60 MB/s
>
> o one thread/disk using O_DIRECT + filesystem
>   read avg: 2635.98 MB/s
>   write avg: 2573.39 MB/s
>
> o one thread/disk using buffered I/O + filesystem
>   read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
>   read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
>   write avg: 1394.99 MB/s
>
> Configuration:
>   o 8p sn2 ia64 box
>   o 8GB memory
>   o 58 disks across 16 controllers
>     (4 disks for 10 of them and 3 for the other 6)
>   o aggregate I/O bw available is about 2.8GB/s
>
> Test:
>   o one I/O thread per disk, round robined across the 8 CPUs
>   o each thread did ~450MB of I/O depending on the test (ran for 10s)
>     Note: the total was > 8GB so in the buffered read case not everything
>     could be cached

More results here.  I've run some tests on a large dm striped volume formatted 
with XFS.  It had 64 disks with a 64k stripe unit (XFS was made aware of this 
at format time), and I explicitly set the readahead using blockdev to 524288 
blocks.  The results aren't as bad as my previous runs, but are still much 
slower than they ought to be I think given the direct I/O results above.  
This is after a fresh mount, so the pagecache was empty when I started the 
tests.

o one thread on one large volume using buffered I/O + filesystem
  read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
  write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s

I'm intentionally issuing very large reads and writes here to take advantage 
of the striping, but it looks like both the readahead and regular buffered 
I/O code will split the I/O into page sized chunks?  The call chain is pretty 
long, but it looks to me like do_generic_mapping_read() will split the reads 
up by page and issue them independently to the lower levels.  In the direct 
I/O case, up to 64 pages are issued at a time, which seems like it would help 
throughput quite a bit.  The profile seems to confirm this.  Unfortunately I 
didn't save the vmstat output for this run (and now the fc switch is 
misbehaving so I have to fix that before I run again), but iirc the system 
time was pretty high given that only one thread was issuing I/O.

So maybe a few things need to be done:
  o set readahead to larger values by default for dm volumes at setup time
    (the default was very small)
  o maybe bypass readahead for very large requests?
    if the process is doing a huge request, chances are that readahead won't
    benefit it as much as a process doing small requests
  o not sure about writes yet, I haven't looked at that call chain much yet

Does any of this sound reasonable at all?  What else could be done to make the 
buffered I/O layer friendlier to large requests?

Thanks,
Jesse

[-- Attachment #2: vol-buffered-read-profile.txt --]
[-- Type: text/plain, Size: 1710 bytes --]

115383 total                                      0.0203
 49642 ia64_pal_call_static                     258.5521
 42065 default_idle                              93.8951
  7348 __copy_user                                3.1030
  5865 ia64_save_scratch_fpregs                  91.6406
  5766 ia64_load_scratch_fpregs                  90.0938
  1944 _spin_unlock_irq                          30.3750
   352 _spin_unlock_irqrestore                    3.6667
   231 buffered_rmqueue                           0.1245
   225 kmem_cache_free                            0.5859
   151 mpage_end_io_read                          0.2776
   147 __end_that_request_first                   0.1242
   133 bio_alloc                                  0.0967
   122 smp_call_function                          0.1089
   102 shrink_list                                0.0224
    99 unlock_page                                0.4420
    86 free_hot_cold_page                         0.0840
    82 kmem_cache_alloc                           0.3203
    65 __alloc_pages                              0.0271
    53 do_mpage_readpage                          0.0224
    53 bio_clone                                  0.1380
    49 __might_sleep                              0.0806
    44 mpage_readpages                            0.0598
    43 generic_make_request                       0.0345
    42 sn_pci_unmap_sg                            0.1010
    42 sn_dma_flush                               0.0597
    41 clear_page                                 0.2562
    40 file_read_actor                            0.0431
    34 mark_page_accessed                         0.0966
    32 __bio_add_page                             0.0278

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Buffered I/O slowness
  2004-10-29 17:46 ` Buffered I/O slowness Jesse Barnes
@ 2004-10-29 23:08   ` Andrew Morton
  2004-10-30  0:16     ` Jesse Barnes
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-10-29 23:08 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel

Jesse Barnes <jbarnes@engr.sgi.com> wrote:
>
> ...
> o one thread on one large volume using buffered I/O + filesystem
>   read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
>   write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s
> 
> I'm intentionally issuing very large reads and writes here to take advantage 
> of the striping, but it looks like both the readahead and regular buffered 
> I/O code will split the I/O into page sized chunks?

No, the readahead code will assemble single BIOs up to the size of the
readahead window.  So the single-page-reads in do_generic_mapping_read()
should never happen, because the pages are in cache from the readahead.

>  The call chain is pretty 
> long, but it looks to me like do_generic_mapping_read() will split the reads 
> up by page and issue them independently to the lower levels.  In the direct 
> I/O case, up to 64 pages are issued at a time, which seems like it would help 
> throughput quite a bit.  The profile seems to confirm this.  Unfortunately I 
> didn't save the vmstat output for this run (and now the fc switch is 
> misbehaving so I have to fix that before I run again), but iirc the system 
> time was pretty high given that only one thread was issuing I/O.
> 
> So maybe a few things need to be done:
>   o set readahead to larger values by default for dm volumes at setup time
>     (the default was very small)

Well possibly.  dm has control of queue->backing_dev_info and is free to
tune the queue's default readahead.

>   o maybe bypass readahead for very large requests?
>     if the process is doing a huge request, chances are that readahead won't
>     benefit it as much as a process doing small requests

Maybe - but bear in mind that this is all pinned memory when the I/O is in
flight, so some upper bound has to remain.

>   o not sure about writes yet, I haven't looked at that call chain much yet
> 
> Does any of this sound reasonable at all?  What else could be done to make the 
> buffered I/O layer friendlier to large requests?

I'm not sure that we know what's going on yet.  I certainly don't.  The
above numbers look good, so what's the problem???

Suggest you get geared up to monitor the BIOs going into submit_bio(). 
Look at their bi_sector and bi_size.  Make sure that buffered I/O is doing
the right thing.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Buffered I/O slowness
  2004-10-29 23:08   ` Andrew Morton
@ 2004-10-30  0:16     ` Jesse Barnes
  2004-10-30  0:30       ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Jesse Barnes @ 2004-10-30  0:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jeremy

On Friday, October 29, 2004 4:08 pm, Andrew Morton wrote:
> > I'm intentionally issuing very large reads and writes here to take
> > advantage of the striping, but it looks like both the readahead and
> > regular buffered I/O code will split the I/O into page sized chunks?
>
> No, the readahead code will assemble single BIOs up to the size of the
> readahead window.  So the single-page-reads in do_generic_mapping_read()
> should never happen, because the pages are in cache from the readahead.

Yeah, I realized that after I sent the message.  The readahead looks like it 
might be ok.

> > So maybe a few things need to be done:
> >   o set readahead to larger values by default for dm volumes at setup
> > time (the default was very small)
>
> Well possibly.  dm has control of queue->backing_dev_info and is free to
> tune the queue's default readahead.

Yep, I'll give that a try and see if I can come up with a reasonable default 
(something more the stripe unit seems like a start).

> >   o maybe bypass readahead for very large requests?
> >     if the process is doing a huge request, chances are that readahead
> > won't benefit it as much as a process doing small requests
>
> Maybe - but bear in mind that this is all pinned memory when the I/O is in
> flight, so some upper bound has to remain.

Right, for the direct I/O case, it looks like things are limited to 64 pages 
at a time.

>
> >   o not sure about writes yet, I haven't looked at that call chain much
> > yet
> >
> > Does any of this sound reasonable at all?  What else could be done to
> > make the buffered I/O layer friendlier to large requests?
>
> I'm not sure that we know what's going on yet.  I certainly don't.  The
> above numbers look good, so what's the problem???

The numbers are ~1/3 of what the machine is capable of with direct I/O.  That 
seems like it's much lower than it should be to me.  Cache cold reads into 
the page cache seem like they should be nearly as fast as direct reads (at 
least on a CPU where the extra data copying overhead isn't getting in the 
way).

> Suggest you get geared up to monitor the BIOs going into submit_bio().
> Look at their bi_sector and bi_size.  Make sure that buffered I/O is doing
> the right thing.

Ok, I'll give that a try.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Buffered I/O slowness
  2004-10-30  0:16     ` Jesse Barnes
@ 2004-10-30  0:30       ` Andrew Morton
  2004-11-01 18:26         ` Jesse Barnes
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-10-30  0:30 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-kernel, jeremy

Jesse Barnes <jbarnes@engr.sgi.com> wrote:
>
> > I'm not sure that we know what's going on yet.  I certainly don't.  The
> > above numbers look good, so what's the problem???
> 
> The numbers are ~1/3 of what the machine is capable of with direct I/O.

Are there CPU cycles to spare?  If you have just one CPU copying 1GB/sec
out of pagecache, maybe it is pegged?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Buffered I/O slowness
  2004-10-30  0:30       ` Andrew Morton
@ 2004-11-01 18:26         ` Jesse Barnes
  2004-11-01 18:34           ` Jesse Barnes
  0 siblings, 1 reply; 7+ messages in thread
From: Jesse Barnes @ 2004-11-01 18:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jeremy

On Friday, October 29, 2004 5:30 pm, Andrew Morton wrote:
> Jesse Barnes <jbarnes@engr.sgi.com> wrote:
> > > I'm not sure that we know what's going on yet.  I certainly don't.  The
> > > above numbers look good, so what's the problem???
> >
> > The numbers are ~1/3 of what the machine is capable of with direct I/O.
>
> Are there CPU cycles to spare?  If you have just one CPU copying 1GB/sec
> out of pagecache, maybe it is pegged?

Hm, I thought I had more CPU to spare, but when I set the readahead to a large 
value, I'm taking ~100% of the CPU time on the CPU doing the read.  ~98% of 
that is system time.  When I run 8 copies (this is an 8 CPU system), I get 
~4GB/s and all the CPUs are near fully busy.  I guess things aren't as bad as 
I initially thought.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Buffered I/O slowness
  2004-11-01 18:26         ` Jesse Barnes
@ 2004-11-01 18:34           ` Jesse Barnes
  0 siblings, 0 replies; 7+ messages in thread
From: Jesse Barnes @ 2004-11-01 18:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jeremy

On Monday, November 1, 2004 10:26 am, Jesse Barnes wrote:
> On Friday, October 29, 2004 5:30 pm, Andrew Morton wrote:
> > Jesse Barnes <jbarnes@engr.sgi.com> wrote:
> > > > I'm not sure that we know what's going on yet.  I certainly don't. 
> > > > The above numbers look good, so what's the problem???
> > >
> > > The numbers are ~1/3 of what the machine is capable of with direct I/O.
> >
> > Are there CPU cycles to spare?  If you have just one CPU copying 1GB/sec
> > out of pagecache, maybe it is pegged?
>
> Hm, I thought I had more CPU to spare, but when I set the readahead to a
> large value, I'm taking ~100% of the CPU time on the CPU doing the read. 
> ~98% of that is system time.  When I run 8 copies (this is an 8 CPU
> system), I get ~4GB/s and all the CPUs are near fully busy.  I guess things
> aren't as bad as I initially thought.

OTOH, if I run 8 copies against 8 separate files (the test above was 8 I/O 
threads on the same file), I'm seeing ~16% CPU for each CPU in the machine 
and only about 700 MB/s of I/O throughput, so this case *does* look like a 
problem.  Here's the profile (this is 2.6.10-rc1-mm2).

Jesse

mgr Aggregate throughput: 6241.204239 MB in 10.183594s; 612.868541 MB/s
116885 total                                      0.0162
 50577 ia64_pal_call_static                     263.4219
 42784 default_idle                              95.5000
  6148 ia64_save_scratch_fpregs                  96.0625
  5908 ia64_load_scratch_fpregs                  92.3125
  4738 __copy_user                                2.0008
  2079 _spin_unlock_irq                          12.9938
   926 _spin_unlock_irqrestore                    4.8229
   374 sn_dma_flush                               0.2997
   192 generic_make_request                       0.1250
   177 clone_endio                                0.2634
   149 _read_unlock_irq                           0.9313
   135 dm_table_unplug_all                        0.4688
   128 buffered_rmqueue                           0.0597
   122 mptscsih_io_done                           0.0428
   117 clear_page                                 0.7312
    96 __end_that_request_first                   0.0811
    94 _spin_lock_irqsave                         0.2670
    92 mempool_alloc                              0.0927
    88 handle_IRQ_event                           0.3056
    80 _write_unlock_irq                          0.3571
    80 mpage_end_io_read                          0.1471
    61 kmem_cache_alloc                           0.2383
    59 xfs_iomap                                  0.0181
    59 xfs_bmapi                                  0.0038
    59 do_mpage_readpage                          0.0249
    55 dm_table_any_congested                     0.1719
    53 pcibr_dma_unmap                            0.3312
    51 scsi_io_completion                         0.0228
    47 kmem_cache_free                            0.1224

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-11-01 19:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-26  1:14 Buffered write slowness Jesse Barnes
2004-10-29 17:46 ` Buffered I/O slowness Jesse Barnes
2004-10-29 23:08   ` Andrew Morton
2004-10-30  0:16     ` Jesse Barnes
2004-10-30  0:30       ` Andrew Morton
2004-11-01 18:26         ` Jesse Barnes
2004-11-01 18:34           ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).