linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* xmm2 - monitor Linux MM active/inactive lists graphically
@ 2001-10-24 10:42 Zlatko Calusic
  2001-10-24 14:26 ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-24 10:42 UTC (permalink / raw)
  To: linux-mm, linux-kernel

New version is out and can be found at the same URL:

<URL:http://linux.inet.hr/>

As Linus' MM lost inactive dirty/clean lists in favour of just one
inactive list, the application needed to be modified to support that.

You can still continue to use the older one for kernels <= 2.4.9
and/or Alan's (-ac) kernels, which continued to use older Rik's VM
system.

Enjoy and, as usual, all comments welcome!
-- 
Zlatko

P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
     andrea@suse.de is redirected to /dev/null. <g>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-24 10:42 xmm2 - monitor Linux MM active/inactive lists graphically Zlatko Calusic
@ 2001-10-24 14:26 ` Marcelo Tosatti
  2001-10-25  0:25   ` Zlatko Calusic
  0 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2001-10-24 14:26 UTC (permalink / raw)
  To: Zlatko Calusic, Linus Torvalds; +Cc: linux-mm, lkml



On 24 Oct 2001, Zlatko Calusic wrote:

> P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
>      andrea@suse.de is redirected to /dev/null. <g>

Zlatko,

Could you please show us your case of bad writeout performance ? 

Thanks


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-24 14:26 ` Marcelo Tosatti
@ 2001-10-25  0:25   ` Zlatko Calusic
  2001-10-25  1:50     ` Simon Kirby
  2001-10-25  4:19     ` Linus Torvalds
  0 siblings, 2 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-25  0:25 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, linux-mm, lkml

Marcelo Tosatti <marcelo@conectiva.com.br> writes:

> On 24 Oct 2001, Zlatko Calusic wrote:
> 
> > P.S. BTW, 2.4.13 still has very unoptimal writeout performance and
> >      andrea@suse.de is redirected to /dev/null. <g>
> 
> Zlatko,
> 
> Could you please show us your case of bad writeout performance ? 
> 
> Thanks
> 

Sure. Output of 'vmstat 1' follows:


 1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
 0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
 1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
 0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
 0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97
 0  2  0      0  99552   5268 334548   0   0     0  7880  174    67 0   3  97
 0  2  0      0  89448   5280 344392   0   0     0  9804  175    76 0   4  96
 0  1  0      0  79352   5288 354236   0   0     0  9852  176    87 0   5  95
 0  1  0      0  71220   5300 362156   0   0     4  7884  170   120 0   4  96
 0  1  0      0  63088   5308 370084   0   0     0  7936  174    76 0   3  97
 0  2  0      0  52988   5320 379924   0   0     0  9920  175    77 0   4  96
 0  2  0      0  43148   5328 389516   0   0     0  9548  174    97 0   4  95
 0  2  0      0  35144   5336 397316   0   0     0  7820  176    73 0   3  97
 0  2  0      0  25172   5344 407036   0   0     0  9724  188   183 0   4  96
 0  2  1      0  17300   5352 414708   0   0     0  7744  174    78 0   4  96
 0  1  0      0   7068   5360 424684   0   0     0  9920  175    93 0   3  97
 0  1  0      0   3128   4132 430132   0   0     0  9920  174    81 0   4  96

Notice how there's planty of RAM. I'm writing sequentially to a file
on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  0:25   ` Zlatko Calusic
@ 2001-10-25  1:50     ` Simon Kirby
  2001-10-25  4:19     ` Linus Torvalds
  1 sibling, 0 replies; 42+ messages in thread
From: Simon Kirby @ 2001-10-25  1:50 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, Linus Torvalds, linux-mm, lkml

On Thu, Oct 25, 2001 at 02:25:45AM +0200, Zlatko Calusic wrote:

> Sure. Output of 'vmstat 1' follows:
>...
>  0  2  0      0  43148   5328 389516   0   0     0  9548  174    97 0   4  95
>  0  2  0      0  35144   5336 397316   0   0     0  7820  176    73 0   3  97
>  0  2  0      0  25172   5344 407036   0   0     0  9724  188   183 0   4  96
>  0  2  1      0  17300   5352 414708   0   0     0  7744  174    78 0   4  96
>...
> Notice how there's planty of RAM. I'm writing sequentially to a file
> on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!

Same here.  But hey, at least it doesn't swap now! :)

Also, dd if=/dev/zero of=blah bs=1024k seems to totally kill everything
else on my box until I ^C it.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  0:25   ` Zlatko Calusic
  2001-10-25  1:50     ` Simon Kirby
@ 2001-10-25  4:19     ` Linus Torvalds
  2001-10-25  4:57       ` Linus Torvalds
  2001-10-25  9:07       ` Zlatko Calusic
  1 sibling, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2001-10-25  4:19 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, linux-mm, lkml


On 25 Oct 2001, Zlatko Calusic wrote:
>
> Sure. Output of 'vmstat 1' follows:
>
>  1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
>  0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
>  1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
>  0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
>  0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97

This does not look like a VM issue at all - at this point you're already
getting only 10MB/s, yet the VM isn't even involved (there's definitely no
VM pressure here).

> Notice how there's planty of RAM. I'm writing sequentially to a file
> on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!

Are you sure you haven't lost some DMA setting or something?

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  4:19     ` Linus Torvalds
@ 2001-10-25  4:57       ` Linus Torvalds
  2001-10-25 12:48         ` Zlatko Calusic
  2001-10-25  9:07       ` Zlatko Calusic
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2001-10-25  4:57 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, linux-mm, lkml


On Wed, 24 Oct 2001, Linus Torvalds wrote:
>
> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Sure. Output of 'vmstat 1' follows:
> >
> >  1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
> >  0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
> >  1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
> >  0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
> >  0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97
>
> This does not look like a VM issue at all - at this point you're already
> getting only 10MB/s, yet the VM isn't even involved (there's definitely no
> VM pressure here).

I wonder if you're getting screwed by bdflush().. You do have a lot of
context switching going on, and you do have a clear pattern: once the
write-out gets going, you're filling new cached pages at about the same
pace that you're writing them out, which definitely means that the dirty
buffer balancing is nice and active.

So the problem is that you're obviously not actually getting the
throughput you should - it's not the VM, as the page cache grows nicely at
the same rate you're writing.

Try something for me: in fs/buffer.c make "balance_dirty_state()" never
return > 0, ie make the "return 1" be a "return 0" instead.

That will cause us to not wake up bdflush at all, and if you're just on
the "border" of 40% dirty buffer usage you'll have bdflush work in
lock-step with you, alternately writing out buffers and waiting for them.

Quite frankly, just the act of doing the "write_some_buffers()" in
balance_dirty() should cause us to block much better than the synchronous
waiting anyway, because then we will block when the request queue fills
up, not at random points.

Even so, considering that you have such a steady 9-10MB/s, please double-
check that it's not something even simpler and embarrassing, like just
having forgotten to enable auto-DMA in the kernel config ;)

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  4:19     ` Linus Torvalds
  2001-10-25  4:57       ` Linus Torvalds
@ 2001-10-25  9:07       ` Zlatko Calusic
  1 sibling, 0 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-25  9:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Sure. Output of 'vmstat 1' follows:
> >
> >  1  0  0      0 254552   5120 183476   0   0    12    24  178   438 2  37  60
> >  0  1  0      0 137296   5232 297760   0   0     4  5284  195   440 3  43  54
> >  1  0  0      0 126520   5244 308260   0   0     0 10588  215   230 0   3  96
> >  0  2  0      0 117488   5252 317064   0   0     0  8796  176   139 1   3  96
> >  0  2  0      0 107556   5264 326744   0   0     0  9704  174    78 0   3  97
> 
> This does not look like a VM issue at all - at this point you're already
> getting only 10MB/s, yet the VM isn't even involved (there's definitely no
> VM pressure here).

That's true, I'll admit. Anyway, -ac kernels don't have the problem,
and I was misleaded by the fact that only VM implementation differs in
those two branches (at least I think so).

> 
> > Notice how there's planty of RAM. I'm writing sequentially to a file
> > on the ext2 filesystem. The disk I'm writing on is a 7200rpm IDE,
> > capable of ~ 22 MB/s and I'm still getting only ~ 9 MB/s. Weird!
> 
> Are you sure you haven't lost some DMA setting or something?
> 

No. Setup is fine. I wouldn't make such a mistake. :)
If the disk were in some PIO mode, CPU usage would be much higher, but
it isn't.

This all definitely looks like a problem either in the bdflush daemon,
or request queue/elevator, but unfortunately I don't have enough
knowledge of that areas to pinpoint it more precisely.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25  4:57       ` Linus Torvalds
@ 2001-10-25 12:48         ` Zlatko Calusic
  2001-10-25 16:31           ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-25 12:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> I wonder if you're getting screwed by bdflush().. You do have a lot of
> context switching going on, and you do have a clear pattern: once the
> write-out gets going, you're filling new cached pages at about the same
> pace that you're writing them out, which definitely means that the dirty
> buffer balancing is nice and active.
>

Yes, but things are similar when I finally allocate whole memory, and
kswapd kicks in. Everything is behaving in the same way, so it is
definitely not the VM, as you pointed out.

> So the problem is that you're obviously not actually getting the
> throughput you should - it's not the VM, as the page cache grows nicely at
> the same rate you're writing.
>

Yes.

> Try something for me: in fs/buffer.c make "balance_dirty_state()" never
> return > 0, ie make the "return 1" be a "return 0" instead.
>

Sure. I recompiled fresh 2.4.13 at the work an rerun tests. This time
on different setup, so numbers are even smaller (tests were performed
at the last partition of the disk, where disk is capable of ~ 13MB/s)


   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0      0   6308    600 441592   0   0     0  7788  159   132   0   7  93
 0  1  0      0   3692    580 444272   0   0     0  5748  169   197   1   4  95
 0  1  0      0   3180    556 444804   0   0     0  5632  228   408   1   5  94
 0  1  0      0   3720    556 444284   0   0     0  7672  226   418   3   4  93
 0  1  0      0   3836    556 444148   0   0     0  5928  249   509   0   8  92
 0  1  0      0   3204    388 444952   0   0     0  7828  156   139   0   6  94
 1  1  0      0   3456    392 444692   0   0     0  5952  157   139   0   5  95
 0  1  0      0   3728    400 444428   0   0     0  7840  312   750   0   7  93
 0  1  0      0   3968    404 444168   0   0     0  5952  216   364   0   5  95


> That will cause us to not wake up bdflush at all, and if you're just on
> the "border" of 40% dirty buffer usage you'll have bdflush work in
> lock-step with you, alternately writing out buffers and waiting for them.
> 
> Quite frankly, just the act of doing the "write_some_buffers()" in
> balance_dirty() should cause us to block much better than the synchronous
> waiting anyway, because then we will block when the request queue fills
> up, not at random points.
> 
> Even so, considering that you have such a steady 9-10MB/s, please double-
> check that it's not something even simpler and embarrassing, like just
> having forgotten to enable auto-DMA in the kernel config ;)
> 

Yes, I definitely have DMA turned ON. All parameters are OK. :)

# hdparm /dev/hda

/dev/hda:
 multcount    = 16 (on)
 I/O support  =  0 (default 16-bit)
 unmaskirq    =  0 (off)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 1650/255/63, sectors = 26520480, start = 0

-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 12:48         ` Zlatko Calusic
@ 2001-10-25 16:31           ` Linus Torvalds
  2001-10-25 17:33             ` Jens Axboe
                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Linus Torvalds @ 2001-10-25 16:31 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Marcelo Tosatti, linux-mm, lkml



On 25 Oct 2001, Zlatko Calusic wrote:
>
> Yes, I definitely have DMA turned ON. All parameters are OK. :)

I suspect it may just be that "queue_nr_requests"/"batch_count" is
different in -ac: what happens if you tweak them to the same values?

(See drivers/block/ll_rw_block.c)

I think -ac made the queues a bit deeper the regular kernel does 128
requests and a batch-count of 16, I _think_ -ac does something like "2
requests per megabyte" and batch_count=32, so if you have 512MB you should
try with

	queue_nr_requests = 1024
	batch_count = 32

Does that help?

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 16:31           ` Linus Torvalds
@ 2001-10-25 17:33             ` Jens Axboe
  2001-10-26  9:45             ` Zlatko Calusic
  2001-10-26 10:08             ` Zlatko Calusic
  2 siblings, 0 replies; 42+ messages in thread
From: Jens Axboe @ 2001-10-25 17:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Zlatko Calusic, Marcelo Tosatti, linux-mm, lkml

On Thu, Oct 25 2001, Linus Torvalds wrote:
> 
> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> 
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
> 
> (See drivers/block/ll_rw_block.c)
> 
> I think -ac made the queues a bit deeper the regular kernel does 128
> requests and a batch-count of 16, I _think_ -ac does something like "2
> requests per megabyte" and batch_count=32, so if you have 512MB you should
> try with
> 
> 	queue_nr_requests = 1024
> 	batch_count = 32

Right, -ac keeps the elevator flow control and proper queue sizes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 16:31           ` Linus Torvalds
  2001-10-25 17:33             ` Jens Axboe
@ 2001-10-26  9:45             ` Zlatko Calusic
  2001-10-26 10:08             ` Zlatko Calusic
  2 siblings, 0 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-26  9:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> 
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
> 
> (See drivers/block/ll_rw_block.c)
> 
> I think -ac made the queues a bit deeper the regular kernel does 128
> requests and a batch-count of 16, I _think_ -ac does something like "2
> requests per megabyte" and batch_count=32, so if you have 512MB you should
> try with
> 
> 	queue_nr_requests = 1024
> 	batch_count = 32
> 
> Does that help?
> 

Unfortunately not. It makes a machine quite unresponsive while it's
writing to disk, and vmstat 1 discovers strange "spiky"
behaviour. Average throughput is ~ 8MB/s (disk is capable of ~ 13MB/s)

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  0  0      0   3840    528 441900   0   0     0 34816  188   594   2  34  64
 0  1  0      0   3332    536 442384   0   0     4 10624  187   519   2   8  90
 0  1  0      0   3324    536 442384   0   0     0     0  182   499   0   0 100
 2  1  0      0   3300    536 442384   0   0     0     0  198   486   0   1  99
 1  1  0      0   3304    536 442384   0   0     0     0  186   513   0   0 100
 0  1  1      0   3304    536 442384   0   0     0     0  193   473   0   1  99
 0  1  1      0   3304    536 442384   0   0     0     0  191   508   1   1  98
 0  1  0      0   3884    536 441840   0   0     4 44672  189   590   4  40  56
 0  1  0      0   3860    536 441840   0   0     0     0  186   526   0   1  99
 0  1  0      0   3852    536 441840   0   0     0     0  191   500   0   0 100
 0  1  0      0   3844    536 441840   0   0     0     0  193   482   1   0  99
 0  1  0      0   3844    536 441840   0   0     0     0  187   511   0   1  99
 0  2  1      0   3832    540 441844   0   0     4     0  305  1004   3   2  95
 0  3  1      0   3824    544 441844   0   0     4     0  410  1340   2   2  96
 0  3  0      0   3764    552 441916   0   0    12 47360  346   915   6  41  53
 0  3  0      0   3764    552 441916   0   0     0     0  373   887   0   0 100
 0  3  0      0   3764    552 441916   0   0     0     0  278   692   1   2  97
 1  3  0      0   3764    552 441916   0   0     0     0  221   579   0   3  97
 0  3  0      0   3764    552 441916   0   0     0     0  286   704   0   2  98

I'll now test "batch_count = queue_nr_requests / 3", which I found in
2.4.14-pre2, but with queue_nr_request still left at 1024. And report
results after that.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-25 16:31           ` Linus Torvalds
  2001-10-25 17:33             ` Jens Axboe
  2001-10-26  9:45             ` Zlatko Calusic
@ 2001-10-26 10:08             ` Zlatko Calusic
  2001-10-26 14:39               ` Jens Axboe
  2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
  2 siblings, 2 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-26 10:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On 25 Oct 2001, Zlatko Calusic wrote:
> >
> > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> 
> I suspect it may just be that "queue_nr_requests"/"batch_count" is
> different in -ac: what happens if you tweak them to the same values?
> 

Next test:

block: 1024 slots per queue, batch=341

Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)

Still very spiky, and during the write disk is uncapable of doing any
reads. IOW, no serious application can be started before writing has
finished. Shouldn't we favour reads over writes? Or is it just that
the elevator is not doing its job right, so reads suffer?


   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  1      0   3600    424 453416   0   0     0     0  190   510   2   1  97
 0  1  1      0   3596    424 453416   0   0     0 40468  189   508   2   2  96
 0  1  1      0   3592    424 453416   0   0     0     0  189   541   1   0  99
 0  1  1      0   3592    424 453416   0   0     0     0  190   513   1   0  99
 1  1  1      0   3592    424 453416   0   0     0     0  192   511   0   1  99
 0  1  1      0   3596    424 453416   0   0     0     0  188   528   0   0 100
 0  1  1      0   3592    424 453416   0   0     0     0  188   510   1   0  99
 0  1  1      0   3592    424 453416   0   0     0 41444  195   507   0   2  98
 0  1  1      0   3592    424 453416   0   0     0     0  190   514   1   1  98
 1  1  1      0   3588    424 453416   0   0     0     0  192   554   0   2  98
 0  1  1      0   3584    424 453416   0   0     0     0  191   506   0   1  99
 0  1  1      0   3584    424 453416   0   0     0     0  186   514   0   0 100
 0  1  1      0   3584    424 453416   0   0     0     0  186   515   0   0 100
 1  1  1      0   3576    424 453416   0   0     0     0  434  1493   3   2  95
 1  1  1      0   3564    424 453416   0   0     0 40560  301   936   3   1  96
 0  1  1      0   3564    424 453416   0   0     0     0  338  1050   1   2  97
 0  1  1      0   3560    424 453416   0   0     0     0  286   893   1   2  97

-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 10:08             ` Zlatko Calusic
@ 2001-10-26 14:39               ` Jens Axboe
  2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
  1 sibling, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2001-10-26 14:39 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

On Fri, Oct 26 2001, Zlatko Calusic wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > On 25 Oct 2001, Zlatko Calusic wrote:
> > >
> > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > 
> > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > different in -ac: what happens if you tweak them to the same values?
> > 
> 
> Next test:
> 
> block: 1024 slots per queue, batch=341

That's way too much, batch should just stay around 32, that is fine.

> Still very spiky, and during the write disk is uncapable of doing any
> reads. IOW, no serious application can be started before writing has
> finished. Shouldn't we favour reads over writes? Or is it just that
> the elevator is not doing its job right, so reads suffer?

You are probably just seeing starvation due to the very long queues.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:39               ` Jens Axboe
@ 2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-26 15:01                   ` Jens Axboe
                                     ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-26 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

Jens Axboe <axboe@suse.de> writes:

> On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > Linus Torvalds <torvalds@transmeta.com> writes:
> > 
> > > On 25 Oct 2001, Zlatko Calusic wrote:
> > > >
> > > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > > 
> > > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > > different in -ac: what happens if you tweak them to the same values?
> > > 
> > 
> > Next test:
> > 
> > block: 1024 slots per queue, batch=341
> 
> That's way too much, batch should just stay around 32, that is fine.

OK. Anyway, neither configuration works well, so the problem might be
somewhere else.

While at it, could you give short explanation of those two parameters?

> 
> > Still very spiky, and during the write disk is uncapable of doing any
> > reads. IOW, no serious application can be started before writing has
> > finished. Shouldn't we favour reads over writes? Or is it just that
> > the elevator is not doing its job right, so reads suffer?
> 
> You are probably just seeing starvation due to the very long queues.
> 

Is there anything we could do about that? I remember Linux once had a
favoured reads, but I'm not sure if we do that likewise these days.

When I find some time, I'll dig around that code. It is very
interesting part of the kernel, I'm sure, I just didn't have enough
time so far, to spend hacking on that part.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:57                 ` Zlatko Calusic
@ 2001-10-26 15:01                   ` Jens Axboe
  2001-10-26 16:04                   ` Linus Torvalds
  2001-10-26 16:57                   ` Linus Torvalds
  2 siblings, 0 replies; 42+ messages in thread
From: Jens Axboe @ 2001-10-26 15:01 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > On Fri, Oct 26 2001, Zlatko Calusic wrote:
> > > Linus Torvalds <torvalds@transmeta.com> writes:
> > > 
> > > > On 25 Oct 2001, Zlatko Calusic wrote:
> > > > >
> > > > > Yes, I definitely have DMA turned ON. All parameters are OK. :)
> > > > 
> > > > I suspect it may just be that "queue_nr_requests"/"batch_count" is
> > > > different in -ac: what happens if you tweak them to the same values?
> > > > 
> > > 
> > > Next test:
> > > 
> > > block: 1024 slots per queue, batch=341
> > 
> > That's way too much, batch should just stay around 32, that is fine.
> 
> OK. Anyway, neither configuration works well, so the problem might be
> somewhere else.

Most likely, yes.

> While at it, could you give short explanation of those two parameters?

Sure. queue_nr_requests is the total number of free request slots per
queue. There are queue_nr_requests / 2 free slots for READ and WRITE.
Each request can be anywhere from fs block size and up to 127kB of data
per default. batch only matters once the request free list has been
depleted. In order to give the elevator some input to work with, we free
request slots in batches of 'batch' to get decent merging etc. That's
why numbers bigger than ~ 32 would not be such a good idea and only add
to bad latency.

> > > Still very spiky, and during the write disk is uncapable of doing any
> > > reads. IOW, no serious application can be started before writing has
> > > finished. Shouldn't we favour reads over writes? Or is it just that
> > > the elevator is not doing its job right, so reads suffer?
> > 
> > You are probably just seeing starvation due to the very long queues.
> > 
> 
> Is there anything we could do about that? I remember Linux once had a
> favoured reads, but I'm not sure if we do that likewise these days.

It still favors reads, take a look at the initial sequence numbers given
to reads and writes. We use to favor reads in the request slots too --
you could try and change the blk_init_freelist split so that you get a
1/3 - 2/3 ratio between WRITE's and READ's and see if that makes the
system more smooth.

> When I find some time, I'll dig around that code. It is very
> interesting part of the kernel, I'm sure, I just didn't have enough
> time so far, to spend hacking on that part.

Indeed it is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-26 15:01                   ` Jens Axboe
@ 2001-10-26 16:04                   ` Linus Torvalds
  2001-10-26 16:57                   ` Linus Torvalds
  2 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2001-10-26 16:04 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml


On 26 Oct 2001, Zlatko Calusic wrote:
>
> OK. Anyway, neither configuration works well, so the problem might be
> somewhere else.
>
> While at it, could you give short explanation of those two parameters?

Did you try the ones 2.4.14-2 does?

Basically, the "queue_nr_requests" means how many requests there can be
for this queue. Half of them are allocated to reads, half of them are
allocated to writes.

The "batch_requests" thing is something that kicks in when the queue has
emptied - we don't want to "trickle" requests to users, because if we do
that means that a new large write will not be able to merge its new
requests sanely because it basically has to do them one at a time. So when
we run out of requests (ie "queue_nr_requests" isn't enough), we start
putting the freed-up requests on a "pending" list, and we release them
only when the pending list is bigger than "batch_requests".

Now, one thing to remember is that "queue_nr_requests" is for the whole
queue (half of them for reads, half for writes), and "batch_requests" is a
per-type thing (ie we batch reads and writes separately). So
"batch_requests" must be less than half of "queue_nr_requests", or we will
never release anything at all.

Now, in Alan's tree, there is a separate tuning thing, which is the "max
nr of _sectors_ in flight", which in my opinion is pretty bogus. It's
really a memory-management thing, but it also does something else: it has
low-and-high water-marks, and those might well be a good idea. It is
possible that we should just ditch the "batch_requests" thing, and use the
watermarks instead.

Side note: all of this is relevant really only for writes - reads pretty
much only care about the maximum queue-size, and it's very hard to get a
_huge_ queue-size with reads unless you do tons of read-ahead.

Now, the "batching" is technically equivalent with water-marking if there
is _one_ writer. But if there are multiple writers, water-marking may
actually has some advantages: it might allow the other writer to make some
progress when the first one has stopped, while the batching will stop
everybody until the batch is released. Who knows.

Anyway, the reason I think Alan's "max nr of sectors" is bogus is because:

 - it's a global count, and if you have 10 controllers and want to write
   to all 10, you _should_ be able to - you can write 10 times as many
   requests in the same latency, so there is nothing "global" with it.

   (It turns out that one advantage of the globalism is that it ends up
   limiting MM write-outs, but I personally think that is a _MM_ thing, ie
   we might want to have a "we have half of all our pages in flight, we
   have to throttle now" thing in "writepage()", not in the queue)

 - "nr of sectors" has very little to do with request latency on most
   hardware. You can do 255 sectors (ie one request) almost as fast as you
   can do just one, if you do them in one request. While just _two_
   sectors might be much slower than the 255, if they are in separate
   requests and cause seeking.

   So from a latency standpoint, the "request" is a much better number.

So Alan almost never throttles on requests (on big machines, the -ac tree
allows thousands of requests in flight per queue), while he _does_ have
this water-marking for sectors.

So I have two suspicions:

 - 128 requests (ie 64 for writes) like the default kernel should be
   _plenty_ enough to keep the disks busy, especially for streaming
   writes. It's small enough that you don't get the absolutely _huge_
   spikes you get with thousands of requests, while being large enough for
   fast writers that even if they _do_ block for 32 of the 64 requests,
   they'll have time to refill the next 32 long before the 32 pending one
   have finished.

   Also: limiting the write queue to 128 requests means that you can
   pretty much guarantee that you can get at least a few read requests
   per second, even if the write queue is constantly full, and even if
   your reader is serialized.

BUT:

 - the hard "batch" count is too harsh. It works as a watermark in the
   degenerate case, but doesn't allow a second writer to use up _some_ of
   the requests while the first writer is blocked due to watermarking.

   So with batching, when the queue is full and another process wants
   memory, that _OTHER_ process will also always block untilt he queue has
   emptied.

   With watermarks, when the writer has filled up the queue and starts
   waiting, other processes can still do some writing as long as they
   don't fill up the queue again. So if you have MM pressure but the
   writer is blocked (and some requests _have_ completed, but the writer
   waits for the low-water-mark), you can still push out requests.

   That's also likely to be a lot more fair - batching tends to give the
   whole batch to the big writer, while watermarking automatically allows
   others to get a look at the queue.

I'll whip up a patch for testing (2.4.14-2 made the batching slightly
saner, but the same "hard" behaviour is pretty much unavoidable with
batching)

			Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 14:57                 ` Zlatko Calusic
  2001-10-26 15:01                   ` Jens Axboe
  2001-10-26 16:04                   ` Linus Torvalds
@ 2001-10-26 16:57                   ` Linus Torvalds
  2001-10-26 17:19                     ` Linus Torvalds
  2 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2001-10-26 16:57 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

[-- Attachment #1: Type: TEXT/PLAIN, Size: 666 bytes --]


On 26 Oct 2001, Zlatko Calusic wrote:
>
> When I find some time, I'll dig around that code. It is very
> interesting part of the kernel, I'm sure, I just didn't have enough
> time so far, to spend hacking on that part.

Attached is a very untested patch (but hey, it compiles, so it must work,
right?) against 2.4.14-pre2, that makes the batching be a high/low
watermark thing instead. It actually simplified the code, but that is, of
course, assuming that it works at all ;)

(If I got the comparisons wrong, of if I update the counts wrong, your IO
queue will probably stop cold. So be careful. The code is obvious
enough, but typos and thinkos happen).

		Linus

[-- Attachment #2: Type: TEXT/PLAIN, Size: 5499 bytes --]

diff -u --recursive pre2/linux/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- pre2/linux/drivers/block/ll_rw_blk.c	Fri Oct 26 09:48:25 2001
+++ linux/drivers/block/ll_rw_blk.c	Fri Oct 26 09:53:54 2001
@@ -140,21 +140,23 @@
 		return &blk_dev[MAJOR(dev)].request_queue;
 }
 
-static int __blk_cleanup_queue(struct list_head *head)
+static int __blk_cleanup_queue(struct request_list *list)
 {
+	struct list_head *head = &list->free;
 	struct request *rq;
 	int i = 0;
 
-	if (list_empty(head))
-		return 0;
-
-	do {
+	while (!list_empty(head)) {
 		rq = list_entry(head->next, struct request, queue);
 		list_del(&rq->queue);
 		kmem_cache_free(request_cachep, rq);
 		i++;
-	} while (!list_empty(head));
+	};
 
+	if (i != list->count)
+		printk("request list leak!\n");
+
+	list->count = 0;
 	return i;
 }
 
@@ -176,10 +178,8 @@
 {
 	int count = queue_nr_requests;
 
-	count -= __blk_cleanup_queue(&q->request_freelist[READ]);
-	count -= __blk_cleanup_queue(&q->request_freelist[WRITE]);
-	count -= __blk_cleanup_queue(&q->pending_freelist[READ]);
-	count -= __blk_cleanup_queue(&q->pending_freelist[WRITE]);
+	count -= __blk_cleanup_queue(&q->rq[READ]);
+	count -= __blk_cleanup_queue(&q->rq[WRITE]);
 
 	if (count)
 		printk("blk_cleanup_queue: leaked requests (%d)\n", count);
@@ -331,11 +331,10 @@
 	struct request *rq;
 	int i;
 
-	INIT_LIST_HEAD(&q->request_freelist[READ]);
-	INIT_LIST_HEAD(&q->request_freelist[WRITE]);
-	INIT_LIST_HEAD(&q->pending_freelist[READ]);
-	INIT_LIST_HEAD(&q->pending_freelist[WRITE]);
-	q->pending_free[READ] = q->pending_free[WRITE] = 0;
+	INIT_LIST_HEAD(&q->rq[READ].free);
+	INIT_LIST_HEAD(&q->rq[WRITE].free);
+	q->rq[READ].count = 0;
+	q->rq[WRITE].count = 0;
 
 	/*
 	 * Divide requests in half between read and write
@@ -349,7 +348,8 @@
 		}
 		memset(rq, 0, sizeof(struct request));
 		rq->rq_status = RQ_INACTIVE;
-		list_add(&rq->queue, &q->request_freelist[i & 1]);
+		list_add(&rq->queue, &q->rq[i&1].free);
+		q->rq[i&1].count++;
 	}
 
 	init_waitqueue_head(&q->wait_for_request);
@@ -423,10 +423,12 @@
 static inline struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
+	struct request_list *rl = q->rq + rw;
 
-	if (!list_empty(&q->request_freelist[rw])) {
-		rq = blkdev_free_rq(&q->request_freelist[rw]);
+	if (!list_empty(&rl->free)) {
+		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
+		rl->count--;
 		rq->rq_status = RQ_ACTIVE;
 		rq->special = NULL;
 		rq->q = q;
@@ -443,17 +445,13 @@
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
+	generic_unplug_device(q);
 	add_wait_queue_exclusive(&q->wait_for_request, &wait);
-	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_lock_irq(&io_request_lock);
-		rq = get_request(q, rw);
-		spin_unlock_irq(&io_request_lock);
-		if (rq)
-			break;
-		generic_unplug_device(q);
-		schedule();
-	}
+	do {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (q->rq[rw].count < batch_requests)
+			schedule();
+	} while ((rq = get_request(q,rw)) == NULL);
 	remove_wait_queue(&q->wait_for_request, &wait);
 	current->state = TASK_RUNNING;
 	return rq;
@@ -542,15 +540,6 @@
 	list_add(&req->queue, insert_here);
 }
 
-inline void blk_refill_freelist(request_queue_t *q, int rw)
-{
-	if (q->pending_free[rw]) {
-		list_splice(&q->pending_freelist[rw], &q->request_freelist[rw]);
-		INIT_LIST_HEAD(&q->pending_freelist[rw]);
-		q->pending_free[rw] = 0;
-	}
-}
-
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
@@ -564,28 +553,12 @@
 
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
-	 * asumme it has free buffers and check waiters
+	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
-		/*
-		 * If nobody is waiting for requests, don't bother
-		 * batching up.
-		 */
-		if (!list_empty(&q->request_freelist[rw])) {
-			list_add(&req->queue, &q->request_freelist[rw]);
-			return;
-		}
-
-		/*
-		 * Add to pending free list and batch wakeups
-		 */
-		list_add(&req->queue, &q->pending_freelist[rw]);
-
-		if (++q->pending_free[rw] >= batch_requests) {
-			int wake_up = q->pending_free[rw];
-			blk_refill_freelist(q, rw);
-			wake_up_nr(&q->wait_for_request, wake_up);
-		}
+		list_add(&req->queue, &q->rq[rw].free);
+		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
+			wake_up(&q->wait_for_request);
 	}
 }
 
@@ -1144,7 +1117,7 @@
 	/*
 	 * Batch frees according to queue length
 	 */
-	batch_requests = queue_nr_requests/3;
+	batch_requests = queue_nr_requests/4;
 	printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
 
 #ifdef CONFIG_AMIGA_Z2RAM
diff -u --recursive pre2/linux/include/linux/blkdev.h linux/include/linux/blkdev.h
--- pre2/linux/include/linux/blkdev.h	Tue Oct 23 22:01:01 2001
+++ linux/include/linux/blkdev.h	Fri Oct 26 09:36:41 2001
@@ -66,14 +66,17 @@
  */
 #define QUEUE_NR_REQUESTS	8192
 
+struct request_list {
+	unsigned int count;
+	struct list_head free;
+};
+
 struct request_queue
 {
 	/*
 	 * the queue request freelist, one for reads and one for writes
 	 */
-	struct list_head	request_freelist[2];
-	struct list_head	pending_freelist[2];
-	int			pending_free[2];
+	struct request_list	rq[2];
 
 	/*
 	 * Together with queue_head for cacheline sharing

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 16:57                   ` Linus Torvalds
@ 2001-10-26 17:19                     ` Linus Torvalds
  2001-10-28 17:30                       ` Zlatko Calusic
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2001-10-26 17:19 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml


On Fri, 26 Oct 2001, Linus Torvalds wrote:
>
> Attached is a very untested patch (but hey, it compiles, so it must work,
> right?)

And it actually does seem to.

Zlatko, does this make a difference for your disk?

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 10:08             ` Zlatko Calusic
  2001-10-26 14:39               ` Jens Axboe
@ 2001-10-27 13:14               ` Giuliano Pochini
  2001-10-28  5:05                 ` Mike Fedyk
  1 sibling, 1 reply; 42+ messages in thread
From: Giuliano Pochini @ 2001-10-27 13:14 UTC (permalink / raw)
  To: zlatko.calusic; +Cc: Linus Torvalds, Marcelo Tosatti, linux-mm, lkml


> block: 1024 slots per queue, batch=341
> 
> Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)
> 
> Still very spiky, and during the write disk is uncapable of doing any
> reads. IOW, no serious application can be started before writing has
> finished. Shouldn't we favour reads over writes? Or is it just that
> the elevator is not doing its job right, so reads suffer?
>
>    procs                      memory    swap          io     system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
>  0  1  1      0   3596    424 453416   0   0     0 40468  189   508   2   2  96

341*127K = ~40M.

Batch is too high. It doesn't explain why reads get delayed so much, anyway.

Bye.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
@ 2001-10-28  5:05                 ` Mike Fedyk
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Fedyk @ 2001-10-28  5:05 UTC (permalink / raw)
  To: Giuliano Pochini
  Cc: zlatko.calusic, Linus Torvalds, Marcelo Tosatti, linux-mm, lkml

On Sat, Oct 27, 2001 at 03:14:44PM +0200, Giuliano Pochini wrote:
> 
> > block: 1024 slots per queue, batch=341
> > 
> > Wrote 600.00 MB in 71 seconds -> 8.39 MB/s (7.5 %CPU)
> > 
> > Still very spiky, and during the write disk is uncapable of doing any
> > reads. IOW, no serious application can be started before writing has
> > finished. Shouldn't we favour reads over writes? Or is it just that
> > the elevator is not doing its job right, so reads suffer?
> >
> >    procs                      memory    swap          io     system         cpu
> >  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
> >  0  1  1      0   3596    424 453416   0   0     0 40468  189   508   2   2  96
> 
> 341*127K = ~40M.
> 
> Batch is too high. It doesn't explain why reads get delayed so much, anyway.
> 

Try modifying the elivator queue length with elvtune.

BTW, 2.2.19 has the queue lengths in the hundreds, and 2.4.xx has it in the
thousands.  I've set 2.4 kernels back to the 2.2 defaults, and interactive
performance has gone up considerably.  These are subjective tests though.

Mike

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-26 17:19                     ` Linus Torvalds
@ 2001-10-28 17:30                       ` Zlatko Calusic
  2001-10-28 17:34                         ` Linus Torvalds
                                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-28 17:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Linus Torvalds <torvalds@transmeta.com> writes:

> On Fri, 26 Oct 2001, Linus Torvalds wrote:
> >
> > Attached is a very untested patch (but hey, it compiles, so it must work,
> > right?)
> 
> And it actually does seem to.
> 
> Zlatko, does this make a difference for your disk?
> 

First, sorry for such a delay in answering, I was busy.

I compiled 2.4.14-pre3 as it seems to be identical to your p2p3 patch,
with regard to queue processing.

Unfortunately, things didn't change on my first disk (IBM 7200rpm
@home). I'm still getting low numbers, check the vmstat output at the
end of the email.

But, now I found something interesting, other two disk which are on
the standard IDE controller work correctly (writing is at 17-22
MB/sec). The disk which doesn't work well is on the HPT366 interface,
so that may be our culprit. Now I got the idea to check patches
retrogradely to see where it started behaving poorely.

Also, one more thing, I'm pretty sure that under strange circumstances
(specific alignment of stars) it behaves well (with appropriate
writing speed). I just haven't yet pinpointed what needs to be done to
get to that point.

I know I haven't supplied you with a lot of information, but I'll keep
investigating until I have some more solid data on the problem.

BTW, thank you and Jens for nice explanation of the numbers, very good
reading.

 0  2  0  13208   2924    516 450716   0   0     0 11808  179   113   0   6  93
 0  1  0  13208   2656    524 450964   0   0     0  8432  174    86   1   6  93
 0  1  0  13208   3676    532 449924   0   0     0  8432  174    91   1   4  95
 0  1  0  13208   3400    540 450172   0   0     0  8432  231   343   1   4  94
 0  2  0  13208   3520    548 450036   0   0     0  8440  180   179   2   5  93
 0  1  0  20216   3544    728 456976  32   0    32  8432  175    94   0   4  95
 0  2  0  20212   3280    728 457232   0   0     0  8440  174    88   0   5  95
 0  2  0  20208   3032    728 457480   0   0     0  8364  174    84   1   4  95
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  2  0  20208   3412    732 457092   0   0     0  6964  175   111   0   4  96
 0  2  0  20208   3272    728 457224   0   0     0  1216  207    89   0   1  99
 0  2  0  20208   3164    728 457352   0   0     0  1300  256    77   1   2  97
 0  2  1  20208   2928    732 457604   0   0     0  1444  283    77   1   0  99
 0  2  1  20208   2764    732 457732   0   0     0  1316  278    73   1   1  98
 0  2  1  20208   3420    728 457096   0   0     0  1652  273   117   0   1  99
 0  2  1  20208   3180    732 457348   0   0     0  1404  240    90   0   0  99
 0  2  1  20208   3696    728 456840   0   0     0  1784  247    80   0   1  98
 0  2  1  20204   3432    728 457096   0   0     0  1404  237    77   1   0  99
 0  2  1  20204   2896    732 457604   0   0     0  1672  255    77   1   1  98
 0  1  0  20204   3284    728 457224   0   0     0  1976  257   112   0   2  98
 0  1  0  20204   2772    728 457736   0   0     0  7628  260   100   0   4  96
 0  1  0  20204   3540    728 456968   0   0     0  8492  178    83   1   4  95
 0  2  0  20204   3584    736 456916   0   0     4  4848  175    88   0   2  97

Regards,
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:30                       ` Zlatko Calusic
@ 2001-10-28 17:34                         ` Linus Torvalds
  2001-10-28 17:48                           ` Alan Cox
  2001-10-28 19:13                         ` Barry K. Nathan
  2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
  2 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2001-10-28 17:34 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: Jens Axboe, Marcelo Tosatti, linux-mm, lkml


On 28 Oct 2001, Zlatko Calusic wrote:
>
> But, now I found something interesting, other two disk which are on
> the standard IDE controller work correctly (writing is at 17-22
> MB/sec). The disk which doesn't work well is on the HPT366 interface,
> so that may be our culprit. Now I got the idea to check patches
> retrogradely to see where it started behaving poorely.

Ok. That _is_ indeed a big clue.

Does the -ac patches have any hpt366-specific stuff? Although I suspect
you're right, and that it's just the driver (or controller itself) being
very very sensitive to some random alignment of stars, rather than any
real code itself.

>  0  2  0  13208   2924    516 450716   0   0     0 11808  179   113   0   6  93
>  0  1  0  13208   2656    524 450964   0   0     0  8432  174    86   1   6  93
>  0  1  0  13208   3676    532 449924   0   0     0  8432  174    91   1   4  95
>  0  1  0  13208   3400    540 450172   0   0     0  8432  231   343   1   4  94
>  0  2  0  13208   3520    548 450036   0   0     0  8440  180   179   2   5  93
>  0  1  0  20216   3544    728 456976  32   0    32  8432  175    94   0   4  95
>  0  2  0  20212   3280    728 457232   0   0     0  8440  174    88   0   5  95
>  0  2  0  20208   3032    728 457480   0   0     0  8364  174    84   1   4  95
>    procs                      memory    swap          io     system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
>  0  2  0  20208   3412    732 457092   0   0     0  6964  175   111   0   4  96
>  0  2  0  20208   3272    728 457224   0   0     0  1216  207    89   0   1  99
>  0  2  0  20208   3164    728 457352   0   0     0  1300  256    77   1   2  97
>  0  2  1  20208   2928    732 457604   0   0     0  1444  283    77   1   0  99
>  0  2  1  20208   2764    732 457732   0   0     0  1316  278    73   1   1  98

So it actually slows down to just 1.5MB/s at times? That's just
disgusting. I wonder what the driver is doing..

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:34                         ` Linus Torvalds
@ 2001-10-28 17:48                           ` Alan Cox
  2001-10-28 17:59                             ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Cox @ 2001-10-28 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> Does the -ac patches have any hpt366-specific stuff? Although I suspect
> you're right, and that it's just the driver (or controller itself) being

The IDE code matches between the two. It isnt a driver change


Alan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:48                           ` Alan Cox
@ 2001-10-28 17:59                             ` Linus Torvalds
  2001-10-28 18:22                               ` Alan Cox
                                                 ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Linus Torvalds @ 2001-10-28 17:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml


On Sun, 28 Oct 2001, Alan Cox wrote:
>
> > Does the -ac patches have any hpt366-specific stuff? Although I suspect
> > you're right, and that it's just the driver (or controller itself) being
>
> The IDE code matches between the two. It isnt a driver change

It might, of course, just be timing, but that sounds like a bit _too_ easy
an explanation. Even if it could easily be true.

The fact that -ac gets higher speeds, and -ac has a very different
request watermark strategy makes me suspect that that might be the cause.

In particular, the standard kernel _requires_ that in order to get good
performance you can merge many bh's onto one request. That's a very
reasonable assumption: it basically says that any high-performance driver
has to accept merging, because that in turn is required for the elevator
overhead to not grow without bounds. And if the driver doesn't accept big
requests, that driver cannot perform well because it won't have many
requests pending.

In contrast, the -ac logic says roughly "Who the hell cares if the driver
can merge requests or not, we can just give it thousands of small requests
instead, and cap the total number of _sectors_ instead of capping the
total number of requests earlier".

In my opinion, the -ac logic is really bad, but one thing it does allow is
for stupid drivers that look like high-performance drivers. Which may be
why it got implemented.

And it may be that the hpt366 IDE driver has always had this braindamage,
which the -ac code hides. Or something like this.

Does anybody know the hpt driver? Does it, for example, limit the maximum
number of sectors per merge somehow for some reason?

Jens?

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:59                             ` Linus Torvalds
@ 2001-10-28 18:22                               ` Alan Cox
  2001-10-28 18:46                                 ` Linus Torvalds
  2001-10-28 18:56                               ` Andrew Morton
  2001-10-30  8:56                               ` Jens Axboe
  2 siblings, 1 reply; 42+ messages in thread
From: Alan Cox @ 2001-10-28 18:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> In contrast, the -ac logic says roughly "Who the hell cares if the driver
> can merge requests or not, we can just give it thousands of small requests
> instead, and cap the total number of _sectors_ instead of capping the
> total number of requests earlier"

If you think about it the major resource constraint is sectors - or another
way to think of it "number of pinned pages the VM cannot rescue until the
I/O is done". We also have many devices where the latency is horribly
important - IDE is one because it lacks sensible overlapping I/O. I'm less
sure what the latency trade offs are. Less commands means less turnarounds
so there is counterbalance.

In the case of IDE the -ac tree will do basically the same merging - the
limitations on IDE DMA are pretty reasonable. DMA IDE has scatter gather
tables and is actually smarter than many older scsi controllers. The IDE
layer supports up to 128 chunks of up to just under 64Kb (should be 64K
but some chipsets get 64K = 0 wrong and its not pretty)

> In my opinion, the -ac logic is really bad, but one thing it does allow is
> for stupid drivers that look like high-performance drivers. Which may be
> why it got implemented.

Well I'm all for making dumb hardware go as fast as smart stuff but that
wasn't the original goal - the original goal was to fix the bad behaviour
with the base kernel and large I/O queues to slow devices like M/O disks.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 18:22                               ` Alan Cox
@ 2001-10-28 18:46                                 ` Linus Torvalds
  2001-10-28 19:29                                   ` Alan Cox
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2001-10-28 18:46 UTC (permalink / raw)
  To: Alan Cox; +Cc: Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml


On Sun, 28 Oct 2001, Alan Cox wrote:
>
> > In contrast, the -ac logic says roughly "Who the hell cares if the driver
> > can merge requests or not, we can just give it thousands of small requests
> > instead, and cap the total number of _sectors_ instead of capping the
> > total number of requests earlier"
>
> If you think about it the major resource constraint is sectors - or another
> way to think of it "number of pinned pages the VM cannot rescue until the
> I/O is done".

Yes. But that's a VM decision, and that's a decision the VM _can_ and does
make. At least in newer VM's.

So counting sectors is only hiding problems at a higher level, and it's
hiding problems that the higher level can know about.

In contrast, one thing that the higher level _cannot_ know about is the
latency of the request queue, because that latency depends on the layout
of the requests. Contiguous requests are fast, seeks are slow. So the
number of requests (as long as they aren't infinitely sized) fairly well
approximates the latency.

Note that you are certainly right that the Linux VM system did not use to
be very good at throttling, and you could make it try to write out all of
memory on small machines. But that's really a VM issue.

(And have we have VM's that tried to push all of memory onto the disk, and
then returned Out-of-Memory when all pages were locked? Sure we have. But
I know mine doesn't, don't know about yours).

>		 We also have many devices where the latency is horribly
> important - IDE is one because it lacks sensible overlapping I/O. I'm less
> sure what the latency trade offs are. Less commands means less turnarounds
> so there is counterbalance.

Note that from a latency standpoint, you only need to have enough requests
to fill the queue - and right now we have a total of 128 requests, of
which half a for reads, and half are for the watermarking, so you end up
having 32 requests "in flight" while you refill the queue.

Which is _plenty_. Because each request can be 255 sectors (or 128,
depending on where the limit is today ;), which means that if you actually
have something throughput-limited, you can certainly keep the disk busy.

(And if the requests aren't localized enough to coalesce well, you cannot
keep the disk at platter-speed _anyway_, plus the requests will take
longer to process, so you'll have even more time to fill the queue).

The important part for real throughput is not to have thousands of
requests in flight, but to have _big_enough_ requests in flight. You can
keep even a fast disk busy with just a few requests, if you just keep
refilling them quickly enough and if they are _big_ enough.

> In the case of IDE the -ac tree will do basically the same merging - the
> limitations on IDE DMA are pretty reasonable. DMA IDE has scatter gather
> tables and is actually smarter than many older scsi controllers. The IDE
> layer supports up to 128 chunks of up to just under 64Kb (should be 64K
> but some chipsets get 64K = 0 wrong and its not pretty)

Yes. My question is more: does the dpt366 thing limit the queueing some
way?

> Well I'm all for making dumb hardware go as fast as smart stuff but that
> wasn't the original goal - the original goal was to fix the bad behaviour
> with the base kernel and large I/O queues to slow devices like M/O disks.

Now, that's a _latency_ issue, and should be fixed by having the max
number of requests (and the max _size_ of a request too) be a per-queue
thing.

But notice how that actually doesn't have anything to do with memory size,
and makes your "scale by max memory" thing illogical.

		Linus


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:59                             ` Linus Torvalds
  2001-10-28 18:22                               ` Alan Cox
@ 2001-10-28 18:56                               ` Andrew Morton
  2001-10-30  8:56                               ` Jens Axboe
  2 siblings, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2001-10-28 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Linus Torvalds wrote:
> 
> And it may be that the hpt366 IDE driver has always had this braindamage,
> which the -ac code hides. Or something like this.
> 

My hpt366, running stock 2.4.14-pre3 performs OK.
	time ( dd if=/dev/zero of=foo bs=10240k count=100 ; sync )
takes 35 seconds (30 megs/sec).  The same on current -ac kernels.

Maybe Zlatko's drive stopped doing DMA?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:30                       ` Zlatko Calusic
  2001-10-28 17:34                         ` Linus Torvalds
@ 2001-10-28 19:13                         ` Barry K. Nathan
  2001-10-28 21:42                           ` Jonathan Morton
  2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
  2 siblings, 1 reply; 42+ messages in thread
From: Barry K. Nathan @ 2001-10-28 19:13 UTC (permalink / raw)
  To: zlatko.calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> Unfortunately, things didn't change on my first disk (IBM 7200rpm
> @home). I'm still getting low numbers, check the vmstat output at the
> end of the email.
> 
> But, now I found something interesting, other two disk which are on
> the standard IDE controller work correctly (writing is at 17-22
> MB/sec). The disk which doesn't work well is on the HPT366 interface,
> so that may be our culprit. Now I got the idea to check patches
> retrogradely to see where it started behaving poorely.
> 
> Also, one more thing, I'm pretty sure that under strange circumstances
> (specific alignment of stars) it behaves well (with appropriate
> writing speed). I just haven't yet pinpointed what needs to be done to
> get to that point.

I didn't read the entire thread, so this is a bit of a stab in the dark,
but:

This really reminds me of a problem I once had with a hard drive of
mine. It would usually go at 15-20MB/sec, but sometimes (under both
Linux and Windows) would slow down to maybe 350KB/sec. The slowdown, or
lack thereof, did seem to depend on the alignment of the stars. I lived
with it for a number of months, then started getting intermittent I/O
errors as well, as if the drive had bad sectors on disk.

The problem turned out to be insufficient ventilation for the controller
board on the bottom of the drive -- it was in the lowest 3.5" drive bay
in my case, so the bottom of the drive was snuggled next to a piece of
metal with ventilation holes. The holes were rather large (maybe 0.5"
diameter) -- and so were the areas without holes. Guess where one of the
drive's controller chips happened to be positioned, relative to the
holes? :( Moving the drive up a bit in the case, so as to allow 0.5"-1"
of space for air beneath the drive, fixed the problem (both the slowdown
and the I/O errors).

I don't know if this is your problem, but I'm mentioning it just in
case it is...

-Barry K. Nathan <barryn@pobox.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 18:46                                 ` Linus Torvalds
@ 2001-10-28 19:29                                   ` Alan Cox
  2001-10-28 22:04                                     ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Cox @ 2001-10-28 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Zlatko Calusic, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

> Yes. My question is more: does the dpt366 thing limit the queueing some
> way?

Nope. The HPT366 is a bog standard DMA IDE controller. At least unless Andre
can point out something I've forgotten any behaviour seen on it should be
the same as seen on any other IDE controller with DMA support.

In practical terms that should mean you can obsere the same HPT366 problem
he does on whatever random IDE controller is on your desktop box

> But notice how that actually doesn't have anything to do with memory size,
> and makes your "scale by max memory" thing illogical.

When you are dealing with the VM limit which the limiter was originally
added for then it makes a lot of sense. When you want to use it solely for
other purposes then it doesnt.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 19:13                         ` Barry K. Nathan
@ 2001-10-28 21:42                           ` Jonathan Morton
  0 siblings, 0 replies; 42+ messages in thread
From: Jonathan Morton @ 2001-10-28 21:42 UTC (permalink / raw)
  To: barryn, zlatko.calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

>  > Unfortunately, things didn't change on my first disk (IBM 7200rpm
>>  @home). I'm still getting low numbers, check the vmstat output at the
>>  end of the email.
>>
>>  But, now I found something interesting, other two disk which are on
>>  the standard IDE controller work correctly (writing is at 17-22
>>  MB/sec). The disk which doesn't work well is on the HPT366 interface,
>>  so that may be our culprit. Now I got the idea to check patches
>  > retrogradely to see where it started behaving poorely.

>This really reminds me of a problem I once had with a hard drive of
>mine. It would usually go at 15-20MB/sec, but sometimes (under both
>Linux and Windows) would slow down to maybe 350KB/sec. The slowdown, or
>lack thereof, did seem to depend on the alignment of the stars. I lived
>with it for a number of months, then started getting intermittent I/O
>errors as well, as if the drive had bad sectors on disk.
>
>The problem turned out to be insufficient ventilation for the controller
>board on the bottom of the drive

As an extra datapoint, my IBM Deskstar 60GXP's (40Gb version) runs 
slightly slower with writing than with reading.  This is on a VIA 
686a controller, UDMA/66 active.  The drive also has plenty of air 
around it, being in a 5.25" bracket with fans in front.

Writing 1GB from /dev/zero takes 34.27s = 29.88MB/sec, 19% CPU
Reading 1GB from test file takes 29.64s = 34.58MB/sec, 18% CPU

Hmm, that's almost as fast as the 10000rpm Ultrastar sited just above 
it, but with higher CPU usage.  Ultrastar gets 36MB/sec on reading 
with hdparm, haven't tested write performance due to probable 
fragmentation.

Both tests conducted using 'dd bs=1k' on my 1GHz Athlon with 256Mb 
RAM.  Test file is on a freshly-created ext2 filesystem starting at 
10Gb into the 40Gb drive (knowing IBM's recent trend, this'll still 
be fairly close to the outer rim).  Write test includes a sync at the 
end.  Kernel is Linus 2.4.9, no relevant patches.

-- 
--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
website:  http://www.chromatix.uklinux.net/vnc/
geekcode: GCS$/E dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$
           V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
tagline:  The key to knowledge is not to rely on people to teach you it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 19:29                                   ` Alan Cox
@ 2001-10-28 22:04                                     ` Linus Torvalds
  0 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2001-10-28 22:04 UTC (permalink / raw)
  To: linux-kernel

In article <E15xvcd-0000FM-00@the-village.bc.nu>,
Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
>> Yes. My question is more: does the dpt366 thing limit the queueing some
>> way?
>
>Nope. The HPT366 is a bog standard DMA IDE controller. At least unless Andre
>can point out something I've forgotten any behaviour seen on it should be
>the same as seen on any other IDE controller with DMA support.
>
>In practical terms that should mean you can obsere the same HPT366 problem
>he does on whatever random IDE controller is on your desktop box

Well, the thing is, I obviously _don't_ observe that problem. Neither
does anybody else I have heard about. I get a nice 20MB/s on my IDE
disks both at home and at work, whether reading or writing. 

Which was why I was suspecting the hpt366 code. But considering that
others report good performance with the same controller, it might be
something even more localized, either in just Zlatko's setup (ie disk or
controller breakage), or some subtle timing issue that is general but
you have to have just the right timing to hit it.

		Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-28 17:59                             ` Linus Torvalds
  2001-10-28 18:22                               ` Alan Cox
  2001-10-28 18:56                               ` Andrew Morton
@ 2001-10-30  8:56                               ` Jens Axboe
  2001-10-30  9:26                                 ` Zlatko Calusic
  2 siblings, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2001-10-30  8:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan Cox, Zlatko Calusic, Marcelo Tosatti, linux-mm, lkml

On Sun, Oct 28 2001, Linus Torvalds wrote:
> 
> On Sun, 28 Oct 2001, Alan Cox wrote:
> >
> > > Does the -ac patches have any hpt366-specific stuff? Although I suspect
> > > you're right, and that it's just the driver (or controller itself) being
> >
> > The IDE code matches between the two. It isnt a driver change
> 
> It might, of course, just be timing, but that sounds like a bit _too_ easy
> an explanation. Even if it could easily be true.
> 
> The fact that -ac gets higher speeds, and -ac has a very different
> request watermark strategy makes me suspect that that might be the cause.
> 
> In particular, the standard kernel _requires_ that in order to get good
> performance you can merge many bh's onto one request. That's a very
> reasonable assumption: it basically says that any high-performance driver
> has to accept merging, because that in turn is required for the elevator
> overhead to not grow without bounds. And if the driver doesn't accept big
> requests, that driver cannot perform well because it won't have many
> requests pending.

Nod

> In contrast, the -ac logic says roughly "Who the hell cares if the driver
> can merge requests or not, we can just give it thousands of small requests
> instead, and cap the total number of _sectors_ instead of capping the
> total number of requests earlier".

Not true, that was not the intended goal. We always want the driver to
get merged requests, even if we can have ridicilously large queue
lengths. The large queues were a benchmark win (blush), since it allowed
the elevator to reorder seeks across a big bench run effieciently. I've
later done more real life testing and I don't think it matters too much
here, in fact it only seems to incur greater latency and starvation.

> In my opinion, the -ac logic is really bad, but one thing it does allow is
> for stupid drivers that look like high-performance drivers. Which may be
> why it got implemented.

Don't mix up the larger queues with lack of will to merge, that is not
the case.

> And it may be that the hpt366 IDE driver has always had this braindamage,
> which the -ac code hides. Or something like this.
> 
> Does anybody know the hpt driver? Does it, for example, limit the maximum
> number of sectors per merge somehow for some reason?

hpt366 has no special work arounds or stuff it disables, it can't be
anything like that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-30  8:56                               ` Jens Axboe
@ 2001-10-30  9:26                                 ` Zlatko Calusic
  2001-10-30 19:07                                   ` Josh McKinney
  0 siblings, 1 reply; 42+ messages in thread
From: Zlatko Calusic @ 2001-10-30  9:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, Alan Cox, Marcelo Tosatti, linux-mm, lkml

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

Jens Axboe <axboe@suse.de> writes:

> hpt366 has no special work arounds or stuff it disables, it can't be
> anything like that.
> 

Followup on the problem. Yesterday I was upgrading my Debian Linux. To
do that I have to remount /usr read-write. After the update finished,
I tested once again disk writing speed. And there it was, full
22MB/sec (on the same partition). And once I get to that point, disk
will remain performant. Then I thought (poor man's logic) that poor
performance might have something to do with my /usr mounted read-only
(BTW, it's on the same disk I'm having problems with).

Quick test: reboot (/usr is ro), check speed -> only 8MB/sec, remount
/usr rw, but unfortunately didn't help, writing speed remains low.

So it was just an idea. I still don't know what can be done to return
speed to normal. I don't know if I have mentioned, but reading from
the same disk is always going at the full speed.

Also, I'm pretty sure that I have the same problem on the completely
another machine (at the work) which doesn't use HPT366, but standard
controller (BX chipset).

So, something might be wrong with my setup, but I'm still unable to
find what.

I'm compiling with 2.95.4 20011006 (Debian prerelease) from the Debian
unstable distribution. Kernel is completely monolithic (no modules).

Attached is the _relevant_ part of IDE configuration.


[-- Attachment #2: A --]
[-- Type: text/plain, Size: 2385 bytes --]

#
# ATA/IDE/MFM/RLL support
#
CONFIG_IDE=y

#
# IDE, ATA and ATAPI Block devices
#
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_HD_IDE is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
# CONFIG_BLK_DEV_IDEDISK_VENDOR is not set
# CONFIG_BLK_DEV_IDEDISK_FUJITSU is not set
# CONFIG_BLK_DEV_IDEDISK_IBM is not set
# CONFIG_BLK_DEV_IDEDISK_MAXTOR is not set
# CONFIG_BLK_DEV_IDEDISK_QUANTUM is not set
# CONFIG_BLK_DEV_IDEDISK_SEAGATE is not set
# CONFIG_BLK_DEV_IDEDISK_WD is not set
# CONFIG_BLK_DEV_COMMERIAL is not set
# CONFIG_BLK_DEV_TIVO is not set
# CONFIG_BLK_DEV_IDECS is not set
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
# CONFIG_BLK_DEV_IDESCSI is not set

#
# IDE chipset support/bugfixes
#
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_CMD640_ENHANCED is not set
# CONFIG_BLK_DEV_ISAPNP is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_BLK_DEV_ADMA=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_IDEDMA_PCI_AUTO=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_IDEDMA_PCI_WIP=y
# CONFIG_IDEDMA_NEW_DRIVE_LISTINGS is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_AEC62XX_TUNING is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_WDC_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_AMD74XX_OVERRIDE is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_HPT34X_AUTODMA is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_PIIX is not set
# CONFIG_PIIX_TUNING is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_PDC202XX is not set
# CONFIG_PDC202XX_BURST is not set
# CONFIG_PDC202XX_FORCE is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_CHIPSETS is not set
CONFIG_IDEDMA_AUTO=y
# CONFIG_IDEDMA_IVB is not set
# CONFIG_DMA_NONPCI is not set
# CONFIG_BLK_DEV_IDE_MODES is not set
# CONFIG_BLK_DEV_ATARAID is not set
# CONFIG_BLK_DEV_ATARAID_PDC is not set
# CONFIG_BLK_DEV_ATARAID_HPT is not set

[-- Attachment #3: Type: text/plain, Size: 12 bytes --]


-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: xmm2 - monitor Linux MM active/inactive lists graphically
  2001-10-30  9:26                                 ` Zlatko Calusic
@ 2001-10-30 19:07                                   ` Josh McKinney
  0 siblings, 0 replies; 42+ messages in thread
From: Josh McKinney @ 2001-10-30 19:07 UTC (permalink / raw)
  To: lkml

On approximately Tue, Oct 30, 2001 at 10:26:32AM +0100, Zlatko Calusic wrote:
> 
> Followup on the problem. Yesterday I was upgrading my Debian Linux. To
> do that I have to remount /usr read-write. After the update finished,
> I tested once again disk writing speed. And there it was, full
> 22MB/sec (on the same partition). And once I get to that point, disk
> will remain performant. Then I thought (poor man's logic) that poor
> performance might have something to do with my /usr mounted read-only
> (BTW, it's on the same disk I'm having problems with).
> 
> Quick test: reboot (/usr is ro), check speed -> only 8MB/sec, remount
> /usr rw, but unfortunately didn't help, writing speed remains low.
> 
> So it was just an idea. I still don't know what can be done to return
> speed to normal. I don't know if I have mentioned, but reading from
> the same disk is always going at the full speed.
> 
> So, something might be wrong with my setup, but I'm still unable to
> find what.
> 
> I'm compiling with 2.95.4 20011006 (Debian prerelease) from the Debian
> unstable distribution. Kernel is completely monolithic (no modules).
>

I am also seeing some not_so_great performance from my ide drive.  It
is a IBM 30GB 7200rpm drive on a promise ata/100 controller.  I am
also using Debian unstable, but I hope that isn't really the problem.
The vmstat output seems very erratic.  It has large bursts then really
slow spots.

I am running 2.4.13-ac4, with rik's swapoff patch.

This is the command I used.

time nice -n -20 dd if=/dev/zero of=/mp3/foobara bs=1024k count=1024
1024+0 records in
1024+0 records out
0.02s user 23.95s system 20% cpu 1:57.24 total

And here is the vmstat output...

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  0      0 112636  18728  78352   0   0    58  2139  145    90   5  10  85
 0  0  3      0 112636  18728  78352   0   0     0    48  115    73   0   5  95
 1  0  0      0  91052  18728  99856   0   0     0    65  135    81   0  40  60
 1  0  1      0  41900  18728 149008   0   0     0 31474  198    32   1  99   0
 0  1  1      0   6104  18756 184748   0   0     0 10608  193    35   1  78  21
 0  1  1      0   3064  18768 187756   0   0     0 10296  189    52   0  35  65
 0  1  5      0   3060  18796 187704   0   0     0 10808  205    62   2  21  77
 0  1  2      0   3064  18948 187412   0   0     0 130132 2654   812   0  23  76
 0  1  2      0   3060  18956 187404   0   0     0 10692  188    54   0  26  74
 1  0  3      0   3060  18964 187392   0   0     0  9818  199    53   1  26  73
 0  1  3      0   3060  18976 187368   0   0     0  7308  186    52   1  23  76
 0  3  2      0   3068  19016 187296   0   0     0 73788 1459   417   0  23  77
 0  3  3      0   3076  19444 186436   0   0     0  6392  188    31   1   9  90
 0  3  3      0   3076  19444 186436   0   0     0 10832  188    15   0   2  98
 0  3  3      0   3076  19444 186436   0   0     0 10536  188    17   0   4  96
 0  3  2      0   3076  19444 186436   0   0     0  6556  191    31   0   1  99
 2  0  0      0   3064  19444 186448   0   0     0 17424  724   120   0   9  91
 1  0  1      0   3064  19444 186896   0   0     0 22724  201    33   2  98   0
 0  1  1      0   3064  19444 186540   0   0     0  9872  187    47   1  46  53
 0  1  1      0   3060  19444 186464   0   0     0  2840  193    57   0  15  85
 0  1  1      0   3060  19444 186464   0   0     0 11088  201    54   0  19  81
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  2      0   3060  19500 186360   0   0     0 118288 2232   604   0  25  75
 0  3  2      0   3060  19500 186360   0   0     0  7748  185    57   1   1  98
 0  3  2      0   3060  19500 186360   0   0     0 11776  184    54   0   2  98
 0  3  2      0   3060  19500 186360   0   0     0  7924  181    52   0   1  99
 0  3  2      0   3060  19500 186360   0   0     0  4776  182    64   0   1  99
 0  3  1      0   3060  19500 186360   0   0     0   952  183    50   0   2  98
 0  3  1      0   3060  19500 186360   0   0     0     0  180    48   0   1  99
 1  0  2      0   3064  19500 186356   0   0     0   296  242    80   0  12  88
 0  1  0      0   3060  19500 186360   0   0     0 19668  730   282   1   7  93
 0  1  0      0   3060  19500 186360   0   0     0  6764  200   133   1   2  97
 1  0  0      0   3064  19500 186356   0   0     0 16680  171   123   1  64  35
 1  0  1      0   3060  19500 186360   0   0     0 26903  193    33   1  99   0
 0  1  1      0   3064  19500 186356   0   0     0  7756  184    40   1  54  45
 0  1  5      0   3060  19500 186360   0   0     0 10915  182    66   0  23  77
 0  3  3      0   3060  19668 186024   0   0     0 105102 2059   629   0  18  82
 0  3  3      0   3060  19668 186024   0   0     0  7772  181    59   0   1  99
 0  4  1      0   3060  19668 185948   0   0     1  7440  187    68   0   2  98
 0  4  1      0   3060  19668 185948   0   0     0   260  187    57   0   2  98
 0  4  1      0   3064  19668 185948   0   0     0     0  181    69   0   1  99
 0  5  2      0   3276  19668 185944   0   0     5   522  180    69   3   5  92
 1  1  2      0   3064  19668 185740   0   0    25 26593  190    96   0  78  22
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  4  4      0   3064  19788 184768   0   0    13 99062 2190   647   0  20  79
 0  4  1      0   3064  19788 184768   0   0     0  2440  218    57   0   3  97
 0  4  1      0   3064  19788 184768   0   0     1     0  217    51   0   0 100
 1  1  5      0   3060  19788 185148   0   0    24 14692  214   108   2  46  52
 0  4  2      0   3064  19824 185068   0   0     8 19965  940   240   0  24  75
 0  4  2      0   3064  19824 185068   0   0     0  3224  293    64   0   0 100
 0  4  2      0   3064  19824 185068   0   0     0  4624  310    64   0   3  97
 0  4  2      0   3140  19824 184864   0   0     4  3316  245    86   1   4  95
 0  4  2      0   3140  19824 184864   0   0     0  2608  252    58   0   1  99
 0  4  2      0   3140  19824 184864   0   0     0  6208  262    58   0   0 100
 1  3  2      0   3144  19824 184864   0   0     0 11864  228    58   1   1  98
 0  4  1      0   3140  19824 184864   0   0     4  1200  241    86   1   5  94
 0  4  3      0   3140  19824 184864   0   0     0    27  182    51   0   1  99
 1  1  3      0   3064  19824 184936   0   0    16 24423  183   104   1  61  38
 0  3  6      0   3064  19940 183904   0   0    69 44339  995   314   1  25  74
 0  3  5      0   3064  19940 183904   0   0     0   100  180    51   0   0 100
 0  3  5      0   3064  19940 183904   0   0     0     0  187    50   0   1  99



-- 
Linux, the choice                | "Shelter," what a nice name for for a place
of a GNU generation       -o)    | where you polish your cat. 
Kernel 2.4.13-ac4          /\    | 
on a i586                 _\_v   | 
                                 | 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Zlatko's I/O slowdown status
  2001-10-28 17:30                       ` Zlatko Calusic
  2001-10-28 17:34                         ` Linus Torvalds
  2001-10-28 19:13                         ` Barry K. Nathan
@ 2001-11-02  5:52                         ` Andrea Arcangeli
  2001-11-02 20:14                           ` Zlatko Calusic
  2 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2001-11-02  5:52 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Hello Zlatko,

I'm not sure how the email thread ended but I noticed different
unplugging of the I/O queues in mainline (mainline was a little more
overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
bdflush to avoid blocking if the write flood could be sustained by the
bandwith of the HD was missing for example).

So you may want to give a spin to pre6aa1 and see if it makes any
difference, if it makes any difference I'll know what your problem is
(see the buffer.c part of the vm-10 patch in pre6aa1 for more details).

thanks,

Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
@ 2001-11-02 20:14                           ` Zlatko Calusic
  2001-11-02 20:16                             ` Jeffrey W. Baker
                                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-11-02 20:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

Andrea Arcangeli <andrea@suse.de> writes:

> Hello Zlatko,
> 
> I'm not sure how the email thread ended but I noticed different
> unplugging of the I/O queues in mainline (mainline was a little more
> overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> bdflush to avoid blocking if the write flood could be sustained by the
> bandwith of the HD was missing for example).

Thank God, today it is finally solved. Just two days ago, I was pretty
sure that disk had started dying on me, and i didn't know of any
solution for that. Today, while I was about to try your patch, I got
another idea and finally pinpointed the problem.

It was write caching. Somehow disk was running with write cache turned
off and I was getting abysmal write performance. Then I found hdparm
-W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
but I don't understand how it survived through reboots and restarts!
And why only two of four disks, which I'm dealing with, got confused
with the command. And finally I don't understand how I could still got
full speed occassionaly. Weird!

I would advise users of Debian unstable to comment that part, I'm sure
it's useless on most if not all setups. You might be pleasantly
surprised with performance gains (write speed doubles).

> 
> So you may want to give a spin to pre6aa1 and see if it makes any
> difference, if it makes any difference I'll know what your problem is
> (see the buffer.c part of the vm-10 patch in pre6aa1 for more details).
> 

Thanks for your concern. Eventually I compiled aa1 and it is running
correctly (whole day at work, and last hour at home - SMP), although I
now don't see any performance improvements.

I would like to thank all the others that spent time helping me,
especially Linus, Jens and Marcelo, sorry guys for taking your time.
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:14                           ` Zlatko Calusic
@ 2001-11-02 20:16                             ` Jeffrey W. Baker
  2001-11-02 20:36                               ` John Alvord
  2001-11-02 21:16                               ` Zlatko Calusic
  2001-11-02 20:57                             ` Andrea Arcangeli
  2001-11-02 23:23                             ` Simon Kirby
  2 siblings, 2 replies; 42+ messages in thread
From: Jeffrey W. Baker @ 2001-11-02 20:16 UTC (permalink / raw)
  To: Zlatko Calusic; +Cc: lkml



On 2 Nov 2001, Zlatko Calusic wrote:

> Andrea Arcangeli <andrea@suse.de> writes:
>
> > Hello Zlatko,
> >
> > I'm not sure how the email thread ended but I noticed different
> > unplugging of the I/O queues in mainline (mainline was a little more
> > overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> > bdflush to avoid blocking if the write flood could be sustained by the
> > bandwith of the HD was missing for example).
>
> Thank God, today it is finally solved. Just two days ago, I was pretty
> sure that disk had started dying on me, and i didn't know of any
> solution for that. Today, while I was about to try your patch, I got
> another idea and finally pinpointed the problem.
>
> It was write caching. Somehow disk was running with write cache turned
> off and I was getting abysmal write performance. Then I found hdparm
> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> but I don't understand how it survived through reboots and restarts!
> And why only two of four disks, which I'm dealing with, got confused
> with the command. And finally I don't understand how I could still got
> full speed occassionaly. Weird!
>
> I would advise users of Debian unstable to comment that part, I'm sure
> it's useless on most if not all setups. You might be pleasantly
> surprised with performance gains (write speed doubles).

That's great if you don't mind losing all of your data in a power outage!
What do you think happens if the software thinks data is committed to
permanent storage when in fact it in only in DRAM on the drive?

-jwb


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:16                             ` Jeffrey W. Baker
@ 2001-11-02 20:36                               ` John Alvord
  2001-11-02 21:16                               ` Zlatko Calusic
  1 sibling, 0 replies; 42+ messages in thread
From: John Alvord @ 2001-11-02 20:36 UTC (permalink / raw)
  To: Jeffrey W. Baker; +Cc: Zlatko Calusic, lkml

On Fri, 2 Nov 2001 12:16:40 -0800 (PST), "Jeffrey W. Baker"
<jwbaker@acm.org> wrote:

>
>
>On 2 Nov 2001, Zlatko Calusic wrote:
>
>> Andrea Arcangeli <andrea@suse.de> writes:
>>
>> > Hello Zlatko,
>> >
>> > I'm not sure how the email thread ended but I noticed different
>> > unplugging of the I/O queues in mainline (mainline was a little more
>> > overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
>> > bdflush to avoid blocking if the write flood could be sustained by the
>> > bandwith of the HD was missing for example).
>>
>> Thank God, today it is finally solved. Just two days ago, I was pretty
>> sure that disk had started dying on me, and i didn't know of any
>> solution for that. Today, while I was about to try your patch, I got
>> another idea and finally pinpointed the problem.
>>
>> It was write caching. Somehow disk was running with write cache turned
>> off and I was getting abysmal write performance. Then I found hdparm
>> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
>> but I don't understand how it survived through reboots and restarts!
>> And why only two of four disks, which I'm dealing with, got confused
>> with the command. And finally I don't understand how I could still got
>> full speed occassionaly. Weird!
>>
>> I would advise users of Debian unstable to comment that part, I'm sure
>> it's useless on most if not all setups. You might be pleasantly
>> surprised with performance gains (write speed doubles).
>
>That's great if you don't mind losing all of your data in a power outage!
>What do you think happens if the software thinks data is committed to
>permanent storage when in fact it in only in DRAM on the drive?

Sounds like switching write-caching off at shutdown is a valid way to
get the data out of cache. But shouldn't it be switched back on again
later?

john

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:14                           ` Zlatko Calusic
  2001-11-02 20:16                             ` Jeffrey W. Baker
@ 2001-11-02 20:57                             ` Andrea Arcangeli
  2001-11-02 23:23                             ` Simon Kirby
  2 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2001-11-02 20:57 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Linus Torvalds, Jens Axboe, Marcelo Tosatti, linux-mm, lkml

On Fri, Nov 02, 2001 at 09:14:14PM +0100, Zlatko Calusic wrote:
> It was write caching. Somehow disk was running with write cache turned

Ah, I was going to ask you to try with:

	/sbin/hdparm -d1 -u1 -W1 -c1 /dev/hda

(my settings, of course not safe for journaling fs, safe to use it only
with ext2 and I -W0 back during /etc/init.d/halt) but I assumed you were
using the same hdparm settings in -ac and mainline. Never mind, good
that it's solved now :).

Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:16                             ` Jeffrey W. Baker
  2001-11-02 20:36                               ` John Alvord
@ 2001-11-02 21:16                               ` Zlatko Calusic
  1 sibling, 0 replies; 42+ messages in thread
From: Zlatko Calusic @ 2001-11-02 21:16 UTC (permalink / raw)
  To: Jeffrey W. Baker; +Cc: lkml

"Jeffrey W. Baker" <jwbaker@acm.org> writes:

> On 2 Nov 2001, Zlatko Calusic wrote:
> 
> > Andrea Arcangeli <andrea@suse.de> writes:
> >
> > > Hello Zlatko,
> > >
> > > I'm not sure how the email thread ended but I noticed different
> > > unplugging of the I/O queues in mainline (mainline was a little more
> > > overkill than -ac) and also wrong bdflush histeresis (pre-wakekup of
> > > bdflush to avoid blocking if the write flood could be sustained by the
> > > bandwith of the HD was missing for example).
> >
> > Thank God, today it is finally solved. Just two days ago, I was pretty
> > sure that disk had started dying on me, and i didn't know of any
> > solution for that. Today, while I was about to try your patch, I got
> > another idea and finally pinpointed the problem.
> >
> > It was write caching. Somehow disk was running with write cache turned
> > off and I was getting abysmal write performance. Then I found hdparm
> > -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> > but I don't understand how it survived through reboots and restarts!
> > And why only two of four disks, which I'm dealing with, got confused
> > with the command. And finally I don't understand how I could still got
> > full speed occassionaly. Weird!
> >
> > I would advise users of Debian unstable to comment that part, I'm sure
> > it's useless on most if not all setups. You might be pleasantly
> > surprised with performance gains (write speed doubles).
> 
> That's great if you don't mind losing all of your data in a power outage!

That has nothing to do with power outage, it is only run during
halt/poweroff. 

> What do you think happens if the software thinks data is committed to
> permanent storage when in fact it in only in DRAM on the drive?
> 

Bad things of course. But -W0 won't save you from file corruption when
you have megabytes of data in page cache, still not synced on disk,
and suddenly you lost power.

Of course, journalling filesystems will change things a bit...
-- 
Zlatko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 20:14                           ` Zlatko Calusic
  2001-11-02 20:16                             ` Jeffrey W. Baker
  2001-11-02 20:57                             ` Andrea Arcangeli
@ 2001-11-02 23:23                             ` Simon Kirby
  2001-11-02 23:37                               ` Miquel van Smoorenburg
  2 siblings, 1 reply; 42+ messages in thread
From: Simon Kirby @ 2001-11-02 23:23 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrea Arcangeli, Linus Torvalds, Jens Axboe, Marcelo Tosatti,
	linux-mm, lkml

On Fri, Nov 02, 2001 at 09:14:14PM +0100, Zlatko Calusic wrote:

> Thank God, today it is finally solved. Just two days ago, I was pretty
> sure that disk had started dying on me, and i didn't know of any
> solution for that. Today, while I was about to try your patch, I got
> another idea and finally pinpointed the problem.
> 
> It was write caching. Somehow disk was running with write cache turned
> off and I was getting abysmal write performance. Then I found hdparm
> -W0 /proc/ide/hd* in /etc/init.d/umountfs which is ran during shutdown
> but I don't understand how it survived through reboots and restarts!
> And why only two of four disks, which I'm dealing with, got confused
> with the command. And finally I don't understand how I could still got
> full speed occassionaly. Weird!
> 
> I would advise users of Debian unstable to comment that part, I'm sure
> it's useless on most if not all setups. You might be pleasantly
> surprised with performance gains (write speed doubles).

Aha!  That would explain why I was seeing it as well... and why I was
seeing errors from hdparm for /dev/hdc and /dev/hdd, which are CDROMs.

Argh. :)

If they have hdparm -W 0 at shutdown, there should be a -W 1 during
startup.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Zlatko's I/O slowdown status
  2001-11-02 23:23                             ` Simon Kirby
@ 2001-11-02 23:37                               ` Miquel van Smoorenburg
  0 siblings, 0 replies; 42+ messages in thread
From: Miquel van Smoorenburg @ 2001-11-02 23:37 UTC (permalink / raw)
  To: linux-kernel

In article <20011102152349.B17362@netnation.com>,
Simon Kirby  <sim@netnation.com> wrote:
>If they have hdparm -W 0 at shutdown, there should be a -W 1 during
>startup.

Well no. It should be set back to 'power on default' on startup.
But there is no way to do that.

Mike.
-- 
"Only two things are infinite, the universe and human stupidity,
 and I'm not sure about the former" -- Albert Einstein.


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2001-11-02 23:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-24 10:42 xmm2 - monitor Linux MM active/inactive lists graphically Zlatko Calusic
2001-10-24 14:26 ` Marcelo Tosatti
2001-10-25  0:25   ` Zlatko Calusic
2001-10-25  1:50     ` Simon Kirby
2001-10-25  4:19     ` Linus Torvalds
2001-10-25  4:57       ` Linus Torvalds
2001-10-25 12:48         ` Zlatko Calusic
2001-10-25 16:31           ` Linus Torvalds
2001-10-25 17:33             ` Jens Axboe
2001-10-26  9:45             ` Zlatko Calusic
2001-10-26 10:08             ` Zlatko Calusic
2001-10-26 14:39               ` Jens Axboe
2001-10-26 14:57                 ` Zlatko Calusic
2001-10-26 15:01                   ` Jens Axboe
2001-10-26 16:04                   ` Linus Torvalds
2001-10-26 16:57                   ` Linus Torvalds
2001-10-26 17:19                     ` Linus Torvalds
2001-10-28 17:30                       ` Zlatko Calusic
2001-10-28 17:34                         ` Linus Torvalds
2001-10-28 17:48                           ` Alan Cox
2001-10-28 17:59                             ` Linus Torvalds
2001-10-28 18:22                               ` Alan Cox
2001-10-28 18:46                                 ` Linus Torvalds
2001-10-28 19:29                                   ` Alan Cox
2001-10-28 22:04                                     ` Linus Torvalds
2001-10-28 18:56                               ` Andrew Morton
2001-10-30  8:56                               ` Jens Axboe
2001-10-30  9:26                                 ` Zlatko Calusic
2001-10-30 19:07                                   ` Josh McKinney
2001-10-28 19:13                         ` Barry K. Nathan
2001-10-28 21:42                           ` Jonathan Morton
2001-11-02  5:52                         ` Zlatko's I/O slowdown status Andrea Arcangeli
2001-11-02 20:14                           ` Zlatko Calusic
2001-11-02 20:16                             ` Jeffrey W. Baker
2001-11-02 20:36                               ` John Alvord
2001-11-02 21:16                               ` Zlatko Calusic
2001-11-02 20:57                             ` Andrea Arcangeli
2001-11-02 23:23                             ` Simon Kirby
2001-11-02 23:37                               ` Miquel van Smoorenburg
2001-10-27 13:14               ` xmm2 - monitor Linux MM active/inactive lists graphically Giuliano Pochini
2001-10-28  5:05                 ` Mike Fedyk
2001-10-25  9:07       ` Zlatko Calusic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).