linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][DATA] re "ongoing vm suckage"
@ 2001-08-03 23:44 Ben LaHaise
  2001-08-04  1:29 ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Ben LaHaise @ 2001-08-03 23:44 UTC (permalink / raw)
  To: torvalds, linux-kernel, linux-mm

Hey folks,

I've been doing a bunch of analysis this week as to what exactly is going
wrong with the "vm system" that causes interactive performance to suffer,
especially on larger systems.  IMO, far too much tinkering of code is
going on currently without hard data (other than "it looks good"), and
this is exacerbating the problems.  Towards this end, here is some hard
data, some analysis, a small patch and a BUG report (I'm about to leave
the office today, so I wanted to send out a status-thus-far).

The data below [see bottom] comes from some Ugly Hacks (tm) to measure how
long the kernel is sleeping in a schedule on a particular event.  All
times are in jiffies following the HERE symbol which comes from inserting
the line into System.map and doing a grep -1 HERE.

The workload consists of using dd if=/dev/zero of=/tmp/foo bs=1M and then
attempting to perform some interactive work on the machine while dd is
running.  I'm working on a 2.4.7 kernel, but this should apply equally
well to other kernels nearby.  The machine being used is a 4way x86 with
4GB of memory.


On to some analysis:  the first thing that should stick out below is that
an awefully large amount of time is spent waiting in bread and
read_cache_page.  The time looks to roughly correspond to that spent
waiting for the various interactive tasks to repsond to input during a
stall.  Most of the interactive tasks consisted of poking around the
filesystem and loading programs like grep and datafiles that weren't in
the cache since the system was freshly booted.

(Side note here: vmstat/sshd didn't get kicked out of memory until a
minute or two *after* all io throughput broke down chaotically.  Sorry, I
don't have the vmstat output, but run it for yourself: it's reproducible.)

In any case, the fact that the io system is saturated with write requests
means that reads are going to take longer.  The current code in
ll_rw_blk.c sets the maximum amount of memory in the io queue to 2/3 of
ram, which on a 4GB box is more io than the machine can do in 30s.  With
that line of reasoning, I applied the following patch:

--- v2.4.7/drivers/block/ll_rw_blk.c	Sun Jul 22 19:17:15 2001
+++ vm-2.4.7/drivers/block/ll_rw_blk.c	Fri Aug  3 17:52:39 2001
@@ -1176,9 +1176,11 @@
 	 * use half of RAM
 	 */
 	high_queued_sectors = (total_ram * 2) / 3;
+	if (high_queued_sectors > MB(4))
+		high_queued_sectors = MB(4);
 	low_queued_sectors = high_queued_sectors / 3;
-	if (high_queued_sectors - low_queued_sectors > MB(128))
-		low_queued_sectors = high_queued_sectors - MB(128);
+	if (high_queued_sectors - low_queued_sectors > MB(1))
+		low_queued_sectors = high_queued_sectors - MB(1);


 	/*


Seems like it would help, eh?  Indeed, it did help a little bit.  That is,
until dd ground to a complete halt.  That said, the rest of the system did
not.  In fact, running programs that triggered disk io resulted in writes
being flushed to disk.  [at this point I'm refering to the second data set
below]  Interesting: the bulk of the waits are once more are in bread,
with block_prepare_write, do_generic_file_read and ll_rw_block in at
1/3-1/5 the time spent waiting.  Hypothesis: one of the block writeout
paths is missing a run_task_queue(&tq_disk);.  [off to read code and try a
patch]  bread spends most of its time in wait_on_buffer, which runs
tq_disk.  Uninteresting.  __block_prepare_write calls wait_on_buffer.
Uninteresting.  do_generic_file_read calls lock_page and wait_on_page.
They both play by the rules and run tq_disk.  Let's look at ll_rw_block():
ah, we have a wait_event call...

Okay, let's try the following patch:

--- vm-2.4.7/drivers/block/ll_rw_blk.c.2	Fri Aug  3 19:06:46 2001
+++ vm-2.4.7/drivers/block/ll_rw_blk.c	Fri Aug  3 19:32:46 2001
@@ -1037,9 +1037,16 @@
 		 * water mark. instead start I/O on the queued stuff.
 		 */
 		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
-			run_task_queue(&tq_disk);
-			wait_event(blk_buffers_wait,
-			 atomic_read(&queued_sectors) < low_queued_sectors);
+			DECLARE_WAITQUEUE(wait, current);
+
+			add_wait_queue(&blk_buffers_wait, &wait);
+			do {
+				run_task_queue(&tq_disk);
+				set_current_state(TASK_UNINTERRUPTIBLE);
+				if (atomic_read(&queued_sectors) >= low_queued_sectors)
+					schedule();
+			} while (atomic_read(&queued_sectors) >= low_queued_sectors);
+			remove_wait_queue(&blk_buffers_wait, &wait);
 		}

 		/* Only one thread can actually submit the I/O. */


Compiles fine, let's see what effect it has on io...

bah.  Doesn't fix it.  Still waiting indefinately in ll_rw_blk().  At
least this is a bit of a narrower scope for the problem.  More later,
unless someone else wants to debug it before tomorrow ;-)

		-ben



==================first data set==================
c01a8590 t wait_til_done
c01a8612 HERE			2
c01a86d0 t generic_done

c0122820 t context_thread
c0122927 HERE			80213
c0122a20 T flush_scheduled_tasks

c01384d0 T bread
c0138515 HERE			143815
c0138540 t get_unused_buffer_head

c0229190 t write_disk_sb
c0229343 HERE			2
c0229370 t set_this_disk

c0129a50 T read_cache_page
c0129b2b HERE			44123
c0129c10 T grab_cache_page

c0128510 T filemap_nopage
c012887b HERE			3157
c0128a90 T filemap_sync

c0190030 T tty_wait_until_sent
c0190092 HERE			0
c0190100 t unset_locked_termios

c01273e0 T ___wait_on_page
c0127418 HERE			2075
c0127490 t __lock_page

c01194e0 T sys_wait4
c011954d HERE			1394
c01198c0 T sys_waitpid

c0140c90 t pipe_poll
c0140cba HERE			81513
c0140d00 t pipe_release

c01406c0 T pipe_wait
c0140720 HERE			156
c0140770 t pipe_read

c013d830 T block_read
c013dd36 HERE			232
c013ddf0 t block_llseek

c0105b30 T __down
c0105b6b HERE			0
c0105c00 T __down_interruptible

c0265d90 t unix_poll
c0265db6 HERE			73989
c0265e30 t unix_read_proc

c0136ee0 t write_locked_buffers
c0136f27 HERE			1056
c0136f40 t write_unlocked_buffers

c0137090 t wait_for_locked_buffers
c0137119 HERE			5
c0137140 t sync_buffers

c0127a20 T do_generic_file_read
c0127c98 HERE			802
c0127f20 T file_read_actor

c0138ba0 t __block_prepare_write
c0138df7 HERE			132392
c0138e20 t __block_commit_write

c025f2e0 t inet_wait_for_connect
c025f33e HERE			1
c025f4a0 T inet_stream_connect

c0233720 T datagram_poll
c0233747 HERE			154105
c0233810 t scm_fp_copy

c0122a20 T flush_scheduled_tasks
c0122a7e HERE			0
c0122ab0 T start_context_thread

c0243a30 T tcp_poll
c0243a5e HERE			76841
c0243b80 T tcp_write_space

c026d4e0 t rpciod
c026d66c HERE			77064
c026d760 t rpciod_killall

c026c720 t __rpc_execute
c026c90a HERE			1
c026ca50 T rpc_execute

c018fce0 t write_chan
c018fda2 HERE			2857
c018fef0 t normal_poll

c0246800 t tcp_data_wait
c024685b HERE			60957
c0246990 t tcp_prequeue_process

c015ad30 t ext2_update_inode
c015b0f4 HERE			1
c015b150 T ext2_write_inode

c013a210 t sync_page_buffers
c013a251 HERE			6372
c013a280 T try_to_free_buffers


========================second data set==========================
c01a8590 t wait_til_done
c01a8612 HERE			2
c01a86d0 t generic_done

c0122820 t context_thread
c0122927 HERE			3439
c0122a20 T flush_scheduled_tasks

c01384d0 T bread
c0138515 HERE			1401380
c0138540 t get_unused_buffer_head

c0229190 t write_disk_sb
c0229343 HERE			1
c0229370 t set_this_disk

c0129a50 T read_cache_page
c0129b2b HERE			327
c0129c10 T grab_cache_page

c0128510 T filemap_nopage
c012887b HERE			23650
c0128a90 T filemap_sync

c0190030 T tty_wait_until_sent
c0190092 HERE			0
c0190100 t unset_locked_termios

c01273e0 T ___wait_on_page
c0127418 HERE			291
c0127490 t __lock_page

c01194e0 T sys_wait4
c011954d HERE			1335
c01198c0 T sys_waitpid

c0140c90 t pipe_poll
c0140cba HERE			244933
c0140d00 t pipe_release

c01406c0 T pipe_wait
c0140720 HERE			166
c0140770 t pipe_read

c013d830 T block_read
c013dd36 HERE			236
c013ddf0 t block_llseek

c0105b30 T __down
c0105b6b HERE			0
c0105c00 T __down_interruptible

c0265d90 t unix_poll
c0265db6 HERE			244143
c0265e30 t unix_read_proc

c0136ee0 t write_locked_buffers
c0136f27 HERE			38
c0136f40 t write_unlocked_buffers

c0137090 t wait_for_locked_buffers
c0137119 HERE			5
c0137140 t sync_buffers

c0127a20 T do_generic_file_read
c0127c98 HERE			244722
c0127f20 T file_read_actor

c0138ba0 t __block_prepare_write
c0138df7 HERE			485918
c0138e20 t __block_commit_write

c025f2e0 t inet_wait_for_connect
c025f33e HERE			1
c025f4a0 T inet_stream_connect

c0233720 T datagram_poll
c0233747 HERE			276902
c0233810 t scm_fp_copy

c0122a20 T flush_scheduled_tasks
c0122a7e HERE			0
c0122ab0 T start_context_thread

c0243a30 T tcp_poll
c0243a5e HERE			3735
c0243b80 T tcp_write_space

c0246800 t tcp_data_wait
c024685b HERE			240523
c0246990 t tcp_prequeue_process

c026d4e0 t rpciod
c026d66c HERE			2143
c026d760 t rpciod_killall

c026c720 t __rpc_execute
c026c90a HERE			19
c026ca50 T rpc_execute

c018fce0 t write_chan
c018fda2 HERE			580
c018fef0 t normal_poll

c01a47f0 T ll_rw_block
c01a492a HERE			233343
c01a4a40 T end_that_request_first



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-03 23:44 [RFC][DATA] re "ongoing vm suckage" Ben LaHaise
@ 2001-08-04  1:29 ` Rik van Riel
  2001-08-04  3:06   ` Daniel Phillips
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2001-08-04  1:29 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: torvalds, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Ben LaHaise wrote:

> --- vm-2.4.7/drivers/block/ll_rw_blk.c.2	Fri Aug  3 19:06:46 2001
> +++ vm-2.4.7/drivers/block/ll_rw_blk.c	Fri Aug  3 19:32:46 2001
> @@ -1037,9 +1037,16 @@
>  		 * water mark. instead start I/O on the queued stuff.
>  		 */
>  		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
> -			run_task_queue(&tq_disk);
> -			wait_event(blk_buffers_wait,
> -			 atomic_read(&queued_sectors) < low_queued_sectors);

... OUCH ...

> bah.  Doesn't fix it.  Still waiting indefinately in ll_rw_blk().

And it's obvious why.

The code above, as well as your replacement, are have a
VERY serious "fairness issue".

	task 1			task 2

 queued_sectors > high
   ==> waits for
   queued_sectors < low

                             write stuff, submits IO
                             queued_sectors < high  (but > low)
                             ....
                             queued sectors still < high, > low
                             happily submits more IO
                             ...
                             etc..

It is quite obvious that the second task can easily starve
the first task as long as it keeps submitting IO at a rate
where queued_sectors will stay above low_queued_sectors,
but under high_queued sectors.

There are two possible solutions to the starvation scenario:

1) have one threshold
2) if one task is sleeping, let ALL tasks sleep
   until we reach the lower threshold

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  1:29 ` Rik van Riel
@ 2001-08-04  3:06   ` Daniel Phillips
  2001-08-04  3:13     ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Phillips @ 2001-08-04  3:06 UTC (permalink / raw)
  To: Rik van Riel, Ben LaHaise; +Cc: torvalds, linux-kernel, linux-mm

On Saturday 04 August 2001 03:29, Rik van Riel wrote:
> On Fri, 3 Aug 2001, Ben LaHaise wrote:
> > --- vm-2.4.7/drivers/block/ll_rw_blk.c.2	Fri Aug  3 19:06:46 2001
> > +++ vm-2.4.7/drivers/block/ll_rw_blk.c	Fri Aug  3 19:32:46 2001
> > @@ -1037,9 +1037,16 @@
> >  		 * water mark. instead start I/O on the queued stuff.
> >  		 */
> >  		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
> > -			run_task_queue(&tq_disk);
> > -			wait_event(blk_buffers_wait,
> > -			 atomic_read(&queued_sectors) < low_queued_sectors);
>
> ... OUCH ...
>
> > bah.  Doesn't fix it.  Still waiting indefinately in ll_rw_blk().
>
> And it's obvious why.
>
> The code above, as well as your replacement, are have a
> VERY serious "fairness issue".
>
> 	task 1			task 2
>
>  queued_sectors > high
>    ==> waits for
>    queued_sectors < low
>
>                              write stuff, submits IO
>                              queued_sectors < high  (but > low)
>                              ....
>                              queued sectors still < high, > low
>                              happily submits more IO
>                              ...
>                              etc..
>
> It is quite obvious that the second task can easily starve
> the first task as long as it keeps submitting IO at a rate
> where queued_sectors will stay above low_queued_sectors,
> but under high_queued sectors.

Nice shooting, this could explain the effect I noticed where
writing a linker file takes 8 times longer when competing with
a simultaneous grep.

> There are two possible solutions to the starvation scenario:
>
> 1) have one threshold
> 2) if one task is sleeping, let ALL tasks sleep
>    until we reach the lower threshold

Umm.... Hmm, there are lots more solutions than that, but those two
are nice and simple.  A quick test for (1) I hope Ben will try is
just to set high_queued_sectors = low_queued_sectors.

Currently, IO scheduling relies on the "random" algorithm for fairness
where the randomness is supplied by the processes.  This breaks down
sometimes, spectacularly, for some distinctly non-random access
patterns as you demonstrated.

Algorithm (2) above would have some potentially strange interactions
with the scheduler, it looks scary.  (E.g., change the scheduler, IO
on some people's machines suddenly goes to hell.)

Come to think of it (1) will also suffer in some cases from nonrandom
scheduling.

Now let me see, why do we even have the high+low thresholds?  I
suppose it is to avoid taking two context switches on every submitted
block, so it seems like a good idea.

For IO fairness I think we need something a little more deterministic.
I'm thinking about an IO quantum right now - when a task has used up
its quantum it yields to the next task, if any, waiting on the IO
queue.  How to preserve the effect of the high+low thresholds... it
needs more thinking, though I've already thought of several ways of
doing it badly :-)

--
Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:06   ` Daniel Phillips
@ 2001-08-04  3:13     ` Linus Torvalds
  2001-08-04  3:23       ` Rik van Riel
  2001-08-04  3:26       ` Ben LaHaise
  0 siblings, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  3:13 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Ben LaHaise, linux-kernel, linux-mm


On Sat, 4 Aug 2001, Daniel Phillips wrote:
>
> Nice shooting, this could explain the effect I noticed where
> writing a linker file takes 8 times longer when competing with
> a simultaneous grep.

Just remove that whole logic - it's silly and broken. That's _not_ where
the logic should be anyway.

The whole "we don't want to have too many queued requests" logic in that
place is just stupid. Let's go through this:

 - we have read requests, and we have write requests.
 - we _NEVER_ want to have a read request trigger this logic. When we
   start a read, we'll eventually wait on it, so readers will always
   throttle themselves. If readers do huge amounts of read-ahead, that's
   still ok. We're much better off just blocking in the request allocation
   layer.
 - writers are different. Writers write in big chunks, and they should
   wait for themselves, not on others. See write_locked_buffers() in
   recent kernels: that makes "sync()" a very nice player. It just waits
   every NRSYNC blocks (for "sync", NRSYNC is a low 32 buffers, which is
   just 128kB at a time. That's fine, because "sync" is not performance
   critical. Other writeouts might want to have slightly bigger blocking
   factors).

Agreed? Let's just remove the broken code in ll_rw_block() - it's not as
if most people even _use_ ll_rw_block() for writing at all any more.

(Yeah, fsync_inode_buffers() does, and would probably speed up by using
the same approach "sync" does - it not only gives nicer behaviour under
load, it also reduces spinlock contention and CPU usage by a LOT).

Oh, and "flush_dirty_buffers()" is _really_ broken. I wanted to clean that
up use the sync code too, but I was too lazy.

> Umm.... Hmm, there are lots more solutions than that, but those two
> are nice and simple.  A quick test for (1) I hope Ben will try is
> just to set high_queued_sectors = low_queued_sectors.

Please just remove the code instead. I don't think it buys you anything.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:13     ` Linus Torvalds
@ 2001-08-04  3:23       ` Rik van Riel
  2001-08-04  3:35         ` Linus Torvalds
  2001-08-04  3:26       ` Ben LaHaise
  1 sibling, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2001-08-04  3:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Ben LaHaise, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Linus Torvalds wrote:

> Please just remove the code instead. I don't think it buys you anything.

IIRC you applied the patch introducing that logic because it
gave a 25% performance increase under some write intensive
loads (or something like that).

Or are you telling us now that there wasn't a reason at all
you applied that code to your tree in the first place? ;))

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:13     ` Linus Torvalds
  2001-08-04  3:23       ` Rik van Riel
@ 2001-08-04  3:26       ` Ben LaHaise
  2001-08-04  3:34         ` Rik van Riel
                           ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Ben LaHaise @ 2001-08-04  3:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Linus Torvalds wrote:

> Please just remove the code instead. I don't think it buys you anything.

No.  Here's the bug in the block layer that was causing the throttling not
to work.  Leave the logic in, it has good reason -- think of batching of
io, where you don't want to add just one page at a time.  Bah.  Get some
diagnostics to backup the assertions first, that's the whole point I'm
arguing for.

See, after applying this patch, it no longer deadlocks on io.  The jerky
interactive performance still exists, but it's now sync_page_buffers
that's waiting too long.  That can be fixed by waiting for writes to
complete, which blk_buffers_wait is quite useful for.

		-ben

diff -ur v2.4.7/drivers/block/ll_rw_blk.c vm-2.4.7/drivers/block/ll_rw_blk.c
--- v2.4.7/drivers/block/ll_rw_blk.c	Sun Jul 22 19:17:15 2001
+++ vm-2.4.7/drivers/block/ll_rw_blk.c	Fri Aug  3 20:03:39 2001
@@ -122,14 +122,14 @@
  * queued sectors for all devices, used to make sure we don't fill all
  * of memory with locked buffers
  */
+DECLARE_WAIT_QUEUE_HEAD(blk_buffers_wait);
 atomic_t queued_sectors;

 /*
  * high and low watermark for above
  */
-static int high_queued_sectors, low_queued_sectors;
+int high_queued_sectors, low_queued_sectors;
 static int batch_requests, queue_nr_requests;
-static DECLARE_WAIT_QUEUE_HEAD(blk_buffers_wait);

 static inline int get_max_sectors(kdev_t dev)
 {
diff -ur v2.4.7/include/linux/blkdev.h vm-2.4.7/include/linux/blkdev.h
--- v2.4.7/include/linux/blkdev.h	Fri Aug  3 16:07:23 2001
+++ vm-2.4.7/include/linux/blkdev.h	Fri Aug  3 20:04:07 2001
@@ -176,7 +176,9 @@

 extern int * max_segments[MAX_BLKDEV];

+extern wait_queue_head_t blk_buffers_wait;
 extern atomic_t queued_sectors;
+extern int low_queued_sectors;

 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
@@ -205,12 +207,15 @@
 		return 512;
 }

-#define blk_finished_io(nsects)				\
+#define blk_finished_io(nsects) do {			\
 	atomic_sub(nsects, &queued_sectors);		\
 	if (atomic_read(&queued_sectors) < 0) {		\
 		printk("block: queued_sectors < 0\n");	\
 		atomic_set(&queued_sectors, 0);		\
-	}
+	}						\
+	if (atomic_read(&queued_sectors) < low_queued_sectors) \
+		wake_up(&blk_buffers_wait);		\
+} while (0)

 #define blk_started_io(nsects)				\
 	atomic_add(nsects, &queued_sectors);


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:26       ` Ben LaHaise
@ 2001-08-04  3:34         ` Rik van Riel
  2001-08-04  3:38         ` Linus Torvalds
  2001-08-04  3:48         ` Linus Torvalds
  2 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2001-08-04  3:34 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Linus Torvalds, Daniel Phillips, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Ben LaHaise wrote:

> See, after applying this patch, it no longer deadlocks on io.  The
> jerky interactive performance still exists,

Would something like this help ?

(yes, there's a small SMP race, but since the system survives
the starvation bug today that isn't critical)


--- ./ll_rw_blk.c.batch	Sat Aug  4 00:30:55 2001
+++ ./ll_rw_blk.c	Sat Aug  4 00:33:48 2001
@@ -1031,15 +1031,19 @@

 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
+		static int queued_sector_waiters;

 		/*
 		 * don't lock any more buffers if we are above the high
 		 * water mark. instead start I/O on the queued stuff.
 		 */
-		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
+		if (atomic_read(&queued_sectors) >= high_queued_sectors
+				|| queued_sector_waiters) {
 			run_task_queue(&tq_disk);
+			queued_sector_waiters = 1;
 			wait_event(blk_buffers_wait,
 			 atomic_read(&queued_sectors) < low_queued_sectors);
+			queued_sector_waiters = 0;
 		}

 		/* Only one thread can actually submit the I/O. */


Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:23       ` Rik van Riel
@ 2001-08-04  3:35         ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  3:35 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Ben LaHaise, linux-kernel, linux-mm


On Sat, 4 Aug 2001, Rik van Riel wrote:
> On Fri, 3 Aug 2001, Linus Torvalds wrote:
>
> > Please just remove the code instead. I don't think it buys you anything.
>
> IIRC you applied the patch introducing that logic because it
> gave a 25% performance increase under some write intensive
> loads (or something like that).

That's the batching code, which is somewhat intertwined with the same
code.

The batching code is a separate issue: when we free the requests, we don't
actually make them available as they get free'd (because then the waiters
will trickle out new requests one at a time and cannot do any merging
etc).

Also, the throttling code probably _did_ make behaviour nicer back when
"sync()" used to use ll_rw_block().  Of course, now most of the IO layer
actually uses "submit_bh()" and bypasses this code completely, so only the
ones that still use it get hit by the unfairness. What a double whammy ;)

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:26       ` Ben LaHaise
  2001-08-04  3:34         ` Rik van Riel
@ 2001-08-04  3:38         ` Linus Torvalds
  2001-08-04  3:48         ` Linus Torvalds
  2 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  3:38 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Fri, 3 Aug 2001, Ben LaHaise wrote:
>
> No.  Here's the bug in the block layer that was causing the throttling not
> to work.  Leave the logic in, it has good reason -- think of batching of
> io, where you don't want to add just one page at a time.

I absolutely agree on the batching, but this has nothing to do with
batching. The batching code uses "batch_requests", and the fact that we
free the finished requests to another area.

The ll_rw_block() code really _is_ broken. As proven by the fact that it
doesn't even get invoced most of the time.. And the times it _does_ get
invoced is exactly when it shouldn't (guess what the biggest user of
"ll_rw_block()" tends to be? "bread()")

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:26       ` Ben LaHaise
  2001-08-04  3:34         ` Rik van Riel
  2001-08-04  3:38         ` Linus Torvalds
@ 2001-08-04  3:48         ` Linus Torvalds
  2001-08-04  4:14           ` Ben LaHaise
  2 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  3:48 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 411 bytes --]


You might try this approach instead, which just removes the thing that
might deadlock and always is unfair..

(Ugh, I hate attachements, but the system I'm sending this from has this
broken version of 'pine' that will mess up white-space).

For nicer interactive behaviour while flushing things out, the
inode_fsync() thing should really use "write_locked_buffers()". That's a
separate patch, though.

		Linus

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3916 bytes --]

diff -u --recursive --new-file pre3/linux/drivers/block/ll_rw_blk.c linux/drivers/block/ll_rw_blk.c
--- pre3/linux/drivers/block/ll_rw_blk.c	Thu Jul 19 20:51:23 2001
+++ linux/drivers/block/ll_rw_blk.c	Fri Aug  3 20:28:43 2001
@@ -119,17 +119,10 @@
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * queued sectors for all devices, used to make sure we don't fill all
- * of memory with locked buffers
+ * How many reqeusts do we allocate per queue,
+ * and how many do we "batch" on freeing them?
  */
-atomic_t queued_sectors;
-
-/*
- * high and low watermark for above
- */
-static int high_queued_sectors, low_queued_sectors;
-static int batch_requests, queue_nr_requests;
-static DECLARE_WAIT_QUEUE_HEAD(blk_buffers_wait);
+static int queue_nr_requests, batch_requests;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -592,13 +585,6 @@
 	 */
 	if (q) {
 		/*
-		 * we've released enough buffers to start I/O again
-		 */
-		if (waitqueue_active(&blk_buffers_wait)
-		    && atomic_read(&queued_sectors) < low_queued_sectors)
-			wake_up(&blk_buffers_wait);
-
-		/*
 		 * Add to pending free list and batch wakeups
 		 */
 		list_add(&req->table, &q->pending_freelist[rw]);
@@ -1032,16 +1018,6 @@
 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
 
-		/*
-		 * don't lock any more buffers if we are above the high
-		 * water mark. instead start I/O on the queued stuff.
-		 */
-		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
-			run_task_queue(&tq_disk);
-			wait_event(blk_buffers_wait,
-			 atomic_read(&queued_sectors) < low_queued_sectors);
-		}
-
 		/* Only one thread can actually submit the I/O. */
 		if (test_and_set_bit(BH_Lock, &bh->b_state))
 			continue;
@@ -1168,26 +1144,9 @@
 	memset(max_readahead, 0, sizeof(max_readahead));
 	memset(max_sectors, 0, sizeof(max_sectors));
 
-	atomic_set(&queued_sectors, 0);
 	total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
 
 	/*
-	 * Try to keep 128MB max hysteris. If not possible,
-	 * use half of RAM
-	 */
-	high_queued_sectors = (total_ram * 2) / 3;
-	low_queued_sectors = high_queued_sectors / 3;
-	if (high_queued_sectors - low_queued_sectors > MB(128))
-		low_queued_sectors = high_queued_sectors - MB(128);
-
-
-	/*
-	 * make it sectors (512b)
-	 */
-	high_queued_sectors <<= 1;
-	low_queued_sectors <<= 1;
-
-	/*
 	 * Scale free request slots per queue too
 	 */
 	total_ram = (total_ram + MB(32) - 1) & ~(MB(32) - 1);
@@ -1200,10 +1159,7 @@
 	if ((batch_requests = queue_nr_requests >> 3) > 32)
 		batch_requests = 32;
 
-	printk("block: queued sectors max/low %dkB/%dkB, %d slots per queue\n",
-						high_queued_sectors / 2,
-						low_queued_sectors / 2,
-						queue_nr_requests);
+	printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
 
 #ifdef CONFIG_AMIGA_Z2RAM
 	z2_init();
@@ -1324,4 +1280,3 @@
 EXPORT_SYMBOL(generic_make_request);
 EXPORT_SYMBOL(blkdev_release_request);
 EXPORT_SYMBOL(generic_unplug_device);
-EXPORT_SYMBOL(queued_sectors);
diff -u --recursive --new-file pre3/linux/include/linux/blkdev.h linux/include/linux/blkdev.h
--- pre3/linux/include/linux/blkdev.h	Mon Jul 30 10:45:59 2001
+++ linux/include/linux/blkdev.h	Fri Aug  3 20:30:01 2001
@@ -174,8 +174,6 @@
 
 extern int * max_segments[MAX_BLKDEV];
 
-extern atomic_t queued_sectors;
-
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
 
@@ -203,14 +201,7 @@
 		return 512;
 }
 
-#define blk_finished_io(nsects)				\
-	atomic_sub(nsects, &queued_sectors);		\
-	if (atomic_read(&queued_sectors) < 0) {		\
-		printk("block: queued_sectors < 0\n");	\
-		atomic_set(&queued_sectors, 0);		\
-	}
-
-#define blk_started_io(nsects)				\
-	atomic_add(nsects, &queued_sectors);
+#define blk_finished_io(nsects)	do { } while (0)
+#define blk_started_io(nsects)	do { } while (0)
 
 #endif

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  3:48         ` Linus Torvalds
@ 2001-08-04  4:14           ` Ben LaHaise
  2001-08-04  4:20             ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: Ben LaHaise @ 2001-08-04  4:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Linus Torvalds wrote:

> For nicer interactive behaviour while flushing things out, the
> inode_fsync() thing should really use "write_locked_buffers()". That's a
> separate patch, though.

Mildly better interactive performance, but absolutely horrid io
throughput.  The system degrades to the point where blocks are getting
flushed to disk at ~2MB/s vs the 80MB/s its capable of.  Not instrumented
since I'm trying to actually relax.

		-ben


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  4:14           ` Ben LaHaise
@ 2001-08-04  4:20             ` Linus Torvalds
  2001-08-04  4:39               ` Ben LaHaise
  0 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  4:20 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Sat, 4 Aug 2001, Ben LaHaise wrote:
> On Fri, 3 Aug 2001, Linus Torvalds wrote:
>
> > For nicer interactive behaviour while flushing things out, the
> > inode_fsync() thing should really use "write_locked_buffers()". That's a
> > separate patch, though.
>
> Mildly better interactive performance, but absolutely horrid io
> throughput.  The system degrades to the point where blocks are getting
> flushed to disk at ~2MB/s vs the 80MB/s its capable of.  Not instrumented
> since I'm trying to actually relax.

That implies that we are trying to flush way too few buffers at a time and
do not get any overlapping IO.

Btw, how did you test this? Do you have a patch for inode_fsync() already?

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  4:20             ` Linus Torvalds
@ 2001-08-04  4:39               ` Ben LaHaise
  2001-08-04  4:47                 ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: Ben LaHaise @ 2001-08-04  4:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Linus Torvalds wrote:

> That implies that we are trying to flush way too few buffers at a time and
> do not get any overlapping IO.

It started out at full throughput, but eventually became jerky and
chaotic.  After I sent the message I noticed it was at 0% idle with
kswapd, kreclaimd, kflushd and kupdated all eating cycles in addition to
dd.

> Btw, how did you test this? Do you have a patch for inode_fsync() already?

Nah, didn't do that patch yet.  The test was the same old dd </dev/zero
of=/tmp/foo bs=1024k with vmstat running and executing commands on another
shell (gotta automate that).  Note that the other approach of leaving the
throttling in and limited to 2MB queues resulted in fairly consistent
60MB/s throughput with no chaotic breakdown.

Using the number of queued sectors in the io queues is, imo, the right way
to throttle io.  The high/low water marks give us decent batching as well
as the delays that we need for throttling writers.  If we remove that,
we'll need another way to wait for io to complete.  Waiting on pages
simply does not work as the page chosen may not be in the "right place" in
the queue.  So, what's the plan?

		-ben


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  4:39               ` Ben LaHaise
@ 2001-08-04  4:47                 ` Linus Torvalds
  2001-08-04  5:13                   ` Ben LaHaise
  0 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  4:47 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Sat, 4 Aug 2001, Ben LaHaise wrote:
>
> Using the number of queued sectors in the io queues is, imo, the right way
> to throttle io.  The high/low water marks give us decent batching as well
> as the delays that we need for throttling writers.  If we remove that,
> we'll need another way to wait for io to complete.

Well, we actually _do_ have that other way already - that should be, after
all, the whole point in the request allocation.

It's when we allocate the request that we know whether we already have too
many requests pending.. And we have the batching there too. Maybe the
current maximum number of requests is just way too big?

[ Quick grep later ]

On my 1GB machine, we apparently allocate 1792 requests for _each_ queue.
Considering that a single request can have hundreds of buffers allocated
to it, that is just _ridiculous_.

How about capping the number of requests to something sane, like 128? Then
the natural request allocation (together with the batching that we already
have) should work just dandy.

Ben, willing to do some quick benchmarks?

			Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  4:47                 ` Linus Torvalds
@ 2001-08-04  5:13                   ` Ben LaHaise
  2001-08-04  5:28                     ` Linus Torvalds
  2001-08-04  6:37                     ` Linus Torvalds
  0 siblings, 2 replies; 40+ messages in thread
From: Ben LaHaise @ 2001-08-04  5:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Linus Torvalds wrote:

> [ Quick grep later ]
>
> On my 1GB machine, we apparently allocate 1792 requests for _each_ queue.
> Considering that a single request can have hundreds of buffers allocated
> to it, that is just _ridiculous_.

> How about capping the number of requests to something sane, like 128? Then
> the natural request allocation (together with the batching that we already
> have) should work just dandy.

This has other drawbacks that are quite serious: namely, the order in
which io is submitted to the block layer is not anywhere close to optimal
for getting useful amounts of work done.  This situation only gets worse
as more and more tasks find that they need to clean buffers in order to
allocate memory, and start throwing more and more buffers from different
tasks into the io queue (think what happens when two tasks are walking
the dirty buffer lists locking buffers and then attempting to allocate a
request which then delays one of the tasks).

> Ben, willing to do some quick benchmarks?

Within reason.  I'm actually heading to bed now, so it'll have to wait
until tomorrow, but it is fairly trivial to reproduce by dd'ing to an 8GB
non-sparse file.  Also, duplicating a huge file will show similar
breakdown under load.  Cheers,

		-ben


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  5:13                   ` Ben LaHaise
@ 2001-08-04  5:28                     ` Linus Torvalds
  2001-08-04  6:37                     ` Linus Torvalds
  1 sibling, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  5:28 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Sat, 4 Aug 2001, Ben LaHaise wrote:
>
> > How about capping the number of requests to something sane, like 128? Then
> > the natural request allocation (together with the batching that we already
> > have) should work just dandy.
>
> This has other drawbacks that are quite serious: namely, the order in
> which io is submitted to the block layer is not anywhere close to optimal
> for getting useful amounts of work done.

Now this is true _whatever_ we do.

We all agree that we have to cap the thing somewhere, no?

Which means that we may be cutting off at a point where if we didn't cut
off, we could have merged better etc. So that problem we have regardless
of whether we could bhäs submitted to ll_rw_block() or we count requests
submitted to the actual IO layer.

The advantage off cutting off on a per-request basis is:

 - doing contiguous IO is "almost free" on most hardware today. So it's ok
   to allow a lot more IO if it's contiguous - because the cost of doing
   one request (even if large) is usually much lower than the cost of
   doing two (smaller) requests.

 - What we really want to do is to have a sliding window of active
   requests - enough to get reasonable elevator behaviour, and small
   enough to get reasonable latency. Again, for both of these, the
   "request" is the right entity - latency comes mostly from seeks (ie
   between request boundaries), and similarly the elevator obviously works
   on request boundaries too, not on "bh" boundaries.

Also, I doubt it makes all that much sense to change the number of queue
entries based on memory size. It probably makes more sense to scale the
number of requests by disk speed, for example.

[ Although there's almost certainly some amount of correlation - if you
  have 2GB of RAM, you probably have fast disks too. But not the linear
  function that we currently have. ]

>			  This situation only gets worse
> as more and more tasks find that they need to clean buffers in order to
> allocate memory, and start throwing more and more buffers from different
> tasks into the io queue (think what happens when two tasks are walking
> the dirty buffer lists locking buffers and then attempting to allocate a
> request which then delays one of the tasks).

Note that this really is a sitation we've had forever.

There are good reasons to believe that we should do a better job of
sorting the IO requests at a higher level in _addition_ to the low-level
elevator. Filesystems should strive to allocate blocks contiguously etc,
and we should strive to keep (and write out) the dirty lists etc in a
somewhat cronological order to take advantage of usually contiguous writes
(and maybe actively sort the dirty queue on writes that are _not_ going to
have good locality, like swapping).

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  6:37                     ` Linus Torvalds
@ 2001-08-04  5:38                       ` Marcelo Tosatti
  2001-08-04  7:13                         ` Rik van Riel
  2001-08-04 14:22                       ` Mike Black
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Marcelo Tosatti @ 2001-08-04  5:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Daniel Phillips, Rik van Riel, linux-kernel, linux-mm



On Fri, 3 Aug 2001, Linus Torvalds wrote:

> 
> On Sat, 4 Aug 2001, Ben LaHaise wrote:
> >
> > Within reason.  I'm actually heading to bed now, so it'll have to wait
> > until tomorrow, but it is fairly trivial to reproduce by dd'ing to an 8GB
> > non-sparse file.  Also, duplicating a huge file will show similar
> > breakdown under load.
> 
> Well, I've made a 2.4.8-pre4.
> 
> This one has marcelo's zone fixes, and my request suggestions. I'm writing
> email right now with the 8GB write in the background, and unpacked and
> patched a kernel. It's certainly not _fast_, but it's not too painful to
> use either.  The 8GB file took 7:25 to write (including the sync), which
> averages out to 18+MB/s. Which is, as far as I can tell, about the best I
> can get on this 5400RPM 80GB drive with the current IDE driver (the
> experimental IDE driver is supposed to do better, but that's not for
> 2.4.x)
> 
> An added advantage of doing the waiting in the request handling was that
> this way it automatically balances reads against writes - writes cannot
> cause reads to fail because they have separate request queue allocations.
> 
> Does it work reasonably under your loads?

Well, the freepages_high change needs more work.

Normal allocations are not going to easily "fall down" to lower zones
because the high zones will be kept at freepages.high most of the time.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  5:13                   ` Ben LaHaise
  2001-08-04  5:28                     ` Linus Torvalds
@ 2001-08-04  6:37                     ` Linus Torvalds
  2001-08-04  5:38                       ` Marcelo Tosatti
                                         ` (3 more replies)
  1 sibling, 4 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04  6:37 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Sat, 4 Aug 2001, Ben LaHaise wrote:
>
> Within reason.  I'm actually heading to bed now, so it'll have to wait
> until tomorrow, but it is fairly trivial to reproduce by dd'ing to an 8GB
> non-sparse file.  Also, duplicating a huge file will show similar
> breakdown under load.

Well, I've made a 2.4.8-pre4.

This one has marcelo's zone fixes, and my request suggestions. I'm writing
email right now with the 8GB write in the background, and unpacked and
patched a kernel. It's certainly not _fast_, but it's not too painful to
use either.  The 8GB file took 7:25 to write (including the sync), which
averages out to 18+MB/s. Which is, as far as I can tell, about the best I
can get on this 5400RPM 80GB drive with the current IDE driver (the
experimental IDE driver is supposed to do better, but that's not for
2.4.x)

An added advantage of doing the waiting in the request handling was that
this way it automatically balances reads against writes - writes cannot
cause reads to fail because they have separate request queue allocations.

Does it work reasonably under your loads?

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  5:38                       ` Marcelo Tosatti
@ 2001-08-04  7:13                         ` Rik van Riel
  0 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2001-08-04  7:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Linus Torvalds, Ben LaHaise, Daniel Phillips, linux-kernel, linux-mm

On Sat, 4 Aug 2001, Marcelo Tosatti wrote:

> Well, the freepages_high change needs more work.
>
> Normal allocations are not going to easily "fall down" to lower zones
> because the high zones will be kept at freepages.high most of the time.

Actually, the first allocation loop in __alloc_pages()
is testing against zone->pages_high and allocating only
from zones which have MORE than this.

So I guess this should only result in a somewhat slower
and/or softer fallback and definately worth a try.

Oh, and we definately need to un-lazy the queue movement
from the inactive_clean list. Having all of the pages you
counted on as being reclaimable referenced is a very bad
surprise ...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  6:37                     ` Linus Torvalds
  2001-08-04  5:38                       ` Marcelo Tosatti
@ 2001-08-04 14:22                       ` Mike Black
  2001-08-04 17:08                         ` Linus Torvalds
  2001-08-04 16:21                       ` Mark Hemment
  2001-08-07 15:45                       ` Ben LaHaise
  3 siblings, 1 reply; 40+ messages in thread
From: Mike Black @ 2001-08-04 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Ben LaHaise
  Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm, Andrew Morton

I'm testing 2.4.8-pre4 -- MUCH better interactivity behavior now.
I've been testing ext3/raid5 for several weeks now and this is usable now.
My system is Dual 1Ghz/2GRam/4GSwap fibrechannel.
But...the single thread i/o performance is down.
Previously I was getting about 60MB/sec on one thread for Seq Read -- now
it's 40MB/sec.
Here's the run:
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  40.62 73.8% 0.675 1.38% 27.08 43.6% 1.207 2.00%
   .     4000   4096    2  17.02 30.2% 0.761 1.63% 16.84 29.9% 1.270 1.78%
   .     4000   4096    4  14.96 26.8% 0.885 2.13% 13.75 31.2% 1.278 1.69%
   .     4000   4096    8  13.39 21.5% 0.952 2.48% 12.46 33.2% 1.188 1.48%

During the 4-thread run there was one long pause (instead of being totally
unusable before with even 2 threads).
Didn't notice any pauses during 8 threads.

I"m seeing a lot more CPU Usage for the 1st thread than previous tests --
perhaps we've shortened the queue too much and it's throttling the read?
Why would CPU usage go up and I/O go down?
Here's a previous test (only 1 thread as 2 threads became unusable).
         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  66.69 53.6% 0.829 1.43% 27.64 41.6% 1.287 0.74%

----- Original Message -----
From: "Linus Torvalds" <torvalds@transmeta.com>
To: "Ben LaHaise" <bcrl@redhat.com>
Cc: "Daniel Phillips" <phillips@bonn-fries.net>; "Rik van Riel"
<riel@conectiva.com.br>; <linux-kernel@vger.kernel.org>;
<linux-mm@kvack.org>
Sent: Saturday, August 04, 2001 2:37 AM
Subject: Re: [RFC][DATA] re "ongoing vm suckage"


>
> Well, I've made a 2.4.8-pre4.
>
> This one has marcelo's zone fixes, and my request suggestions. I'm writing
> email right now with the 8GB write in the background, and unpacked and
> patched a kernel. It's certainly not _fast_, but it's not too painful to
> use either.  The 8GB file took 7:25 to write (including the sync), which
> averages out to 18+MB/s. Which is, as far as I can tell, about the best I
> can get on this 5400RPM 80GB drive with the current IDE driver (the
> experimental IDE driver is supposed to do better, but that's not for
> 2.4.x)
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  6:37                     ` Linus Torvalds
  2001-08-04  5:38                       ` Marcelo Tosatti
  2001-08-04 14:22                       ` Mike Black
@ 2001-08-04 16:21                       ` Mark Hemment
  2001-08-07 15:45                       ` Ben LaHaise
  3 siblings, 0 replies; 40+ messages in thread
From: Mark Hemment @ 2001-08-04 16:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-mm, Hugh Dickins

On Fri, 3 Aug 2001, Linus Torvalds wrote:
> Well, I've made a 2.4.8-pre4.

  A colleague has reminded me that we this small patch against
flush_dirty_buffers() - kick the disk queues before sleeping.

Mark


--- linux-2.4.8-pre4/fs/buffer.c	Sat Aug  4 11:49:52 2001
+++ linux/fs/buffer.c	Sat Aug  4 11:56:25 2001
@@ -2568,8 +2568,11 @@
 		ll_rw_block(WRITE, 1, &bh);
 		put_bh(bh);

-		if (current->need_resched)
+		if (current->need_resched) {
+			/* kick what we've already pushed down */
+			run_task_queue(&tq_disk);
 			schedule();
+		}
 		goto restart;
 	}
  out_unlock:


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04 14:22                       ` Mike Black
@ 2001-08-04 17:08                         ` Linus Torvalds
  2001-08-05  4:19                           ` Michael Rothwell
  2001-08-05 15:24                           ` Mike Black
  0 siblings, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-04 17:08 UTC (permalink / raw)
  To: Mike Black
  Cc: Ben LaHaise, Daniel Phillips, Rik van Riel, linux-kernel,
	linux-mm, Andrew Morton


On Sat, 4 Aug 2001, Mike Black wrote:
>
> I'm testing 2.4.8-pre4 -- MUCH better interactivity behavior now.

Good.. However..

> I've been testing ext3/raid5 for several weeks now and this is usable now.
> My system is Dual 1Ghz/2GRam/4GSwap fibrechannel.
> But...the single thread i/o performance is down.

Bad. And before we get too happy about the interactive thing, let's
remember that sometimes interactivity comes at the expense of throughput,
and maybe if we fix the throughput we'll be back where we started.

Now, you basically have a rather fast disk subsystem, and it's entirely
possible that with that kind of oomph you really want a longer queue. So
in blk_dev_init() in drivers/block/ll_rw_blk.c, try changing

	/*
         * Free request slots per queue.
         * (Half for reads, half for writes)
         */
        queue_nr_requests = 64;
        if (total_ram > MB(32))
                queue_nr_requests = 128;

to something more like

	/*
         * Free request slots per queue.
         * (Half for reads, half for writes)
         */
        queue_nr_requests = 64;
        if (total_ram > MB(32)) {
                queue_nr_requests = 128;
		if (total_ram > MB(128))
			queue_nr_requests = 256;
	}

and tell me if interactivity is still fine, and whether performance goes
up?

And please feel free to play with different values - but remember that
big values do tend to mean bad latency.

Rule of thumb: even on fast disks, the average seek time (and between
requests you almost always have to seek) is on the order of a few
milliseconds. With a large write-queue (256 total requests means 128 write
requests) you can basically get single-request latencies of up to a
second. Which is really bad.

One partial solution may be the just make the read queue deeper than the
write queue. That's a bit more complicated than just changing a single
value, though - you'd need to make the batching threshold be dependent on
read-write too etc. But it would probably not be a bad idea to change the
"split requests evenly" to do even "split requests 2:1 to read:write".

All the logic is in drivers/block/ll_rw_block.c, and it's fairly easy to
just search for queue_nr_requests/batch_requests to see what it's doing.

> I"m seeing a lot more CPU Usage for the 1st thread than previous tests --
> perhaps we've shortened the queue too much and it's throttling the read?
> Why would CPU usage go up and I/O go down?

I'd guess it's calling the scheduler more. With fast disks and a queue
that runs out, you'd probably go into a series of extremely short
stop-start behaviour. Or something similar.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04 17:08                         ` Linus Torvalds
@ 2001-08-05  4:19                           ` Michael Rothwell
  2001-08-05 18:40                             ` Marcelo Tosatti
  2001-08-05 20:20                             ` Linus Torvalds
  2001-08-05 15:24                           ` Mike Black
  1 sibling, 2 replies; 40+ messages in thread
From: Michael Rothwell @ 2001-08-05  4:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Black, Ben LaHaise, Daniel Phillips, Rik van Riel,
	linux-kernel, linux-mm, Andrew Morton

On 04 Aug 2001 10:08:56 -0700, Linus Torvalds wrote:
> 
> On Sat, 4 Aug 2001, Mike Black wrote:
> >
> > I'm testing 2.4.8-pre4 -- MUCH better interactivity behavior now.
> 
> Good.. However.. [...]  before we get too happy about the interactive thing, let's
> remember that sometimes interactivity comes at the expense of throughput,
> and maybe if we fix the throughput we'll be back where we started.

Could there be both interactive and throughput optimizations, and a way
to choose one or the other at run-time? Or even just at compile time? 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04 17:08                         ` Linus Torvalds
  2001-08-05  4:19                           ` Michael Rothwell
@ 2001-08-05 15:24                           ` Mike Black
  2001-08-05 20:04                             ` Linus Torvalds
  1 sibling, 1 reply; 40+ messages in thread
From: Mike Black @ 2001-08-05 15:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ben LaHaise, Daniel Phillips, Rik van Riel, linux-kernel,
	linux-mm, Andrew Morton

I bumped up the queue_nr_requests to 512, then 1024 -- 1024 finally made a
performance difference for me and the machine was still usable.
As an ext2 mount there were no interactive delays at all (this is as it
always has been prior to any of these patches):
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  50.52 59.9% 0.869 2.28% 29.28 25.0% 1.329 0.85%
   .     4000   4096    2  45.52 57.9% 1.104 2.58% 26.66 25.8% 1.346 1.11%
   .     4000   4096    4  33.69 44.2% 1.316 3.08% 17.02 17.2% 1.342 1.26%
   .     4000   4096    8  29.74 39.5% 1.500 3.43% 14.45 15.4% 1.342 1.26%

As an ext3 mount (here's where I've been seeing BIG delays before) there
were:
1 thread - no delays
2 threads - 2 delays for 2 seconds each  << previously even 2 threads caused
minute+ delays.
4 threads - 5 delays - 1 for 3 seconds, 4 for 2 seconds
8 threads - 21 delays  - 9 for 2 sec, 4 for 3 sec,  4 for 4 sec, 2 for 5
sec, 1 for 6 sec, and 1 for 10 sec
NOTE: all these delays were during the write tests -- none during read.
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     4000   4096    1  46.32 66.3% 0.859 1.98% 26.09 31.3% 1.280 0.73%
   .     4000   4096    2  18.65 29.8% 0.997 2.10% 16.04 28.2% 1.300 1.12%
   .     4000   4096    4  15.90 26.7% 1.154 2.48% 14.68 31.0% 1.263 1.15%
   .     4000   4096    8  14.93 24.1% 1.307 2.82% 11.68 41.5% 1.251 1.18%

To compute the delays I'm just using a simple little program:
#include <stdio.h>
main()
{
        int last,last2;
        last=time(NULL);
        while(1) {
                sleep(1);
                last2=time(NULL);
                if(last2-last > 2) {
                        printf("Delay %d\n",last2-last-1);
                }
                last = last2;
        }
}



----- Original Message -----
From: "Linus Torvalds" <torvalds@transmeta.com>
To: "Mike Black" <mblack@csihq.com>
Cc: "Ben LaHaise" <bcrl@redhat.com>; "Daniel Phillips"
<phillips@bonn-fries.net>; "Rik van Riel" <riel@conectiva.com.br>;
<linux-kernel@vger.kernel.org>; <linux-mm@kvack.org>; "Andrew Morton"
<andrewm@uow.edu.au>
Sent: Saturday, August 04, 2001 1:08 PM
Subject: Re: [RFC][DATA] re "ongoing vm suckage"


>
> On Sat, 4 Aug 2001, Mike Black wrote:
> >
> > I'm testing 2.4.8-pre4 -- MUCH better interactivity behavior now.
>
> Good.. However..
>
> > I've been testing ext3/raid5 for several weeks now and this is usable
now.
> > My system is Dual 1Ghz/2GRam/4GSwap fibrechannel.
> > But...the single thread i/o performance is down.
>
> Bad. And before we get too happy about the interactive thing, let's
> remember that sometimes interactivity comes at the expense of throughput,
> and maybe if we fix the throughput we'll be back where we started.
>
> Now, you basically have a rather fast disk subsystem, and it's entirely
> possible that with that kind of oomph you really want a longer queue. So
> in blk_dev_init() in drivers/block/ll_rw_blk.c, try changing
>
> /*
>          * Free request slots per queue.
>          * (Half for reads, half for writes)
>          */
>         queue_nr_requests = 64;
>         if (total_ram > MB(32))
>                 queue_nr_requests = 128;
>
> to something more like
>
> /*
>          * Free request slots per queue.
>          * (Half for reads, half for writes)
>          */
>         queue_nr_requests = 64;
>         if (total_ram > MB(32)) {
>                 queue_nr_requests = 128;
> if (total_ram > MB(128))
> queue_nr_requests = 256;
> }
>
> and tell me if interactivity is still fine, and whether performance goes
> up?
>
> And please feel free to play with different values - but remember that
> big values do tend to mean bad latency.
>
> Rule of thumb: even on fast disks, the average seek time (and between
> requests you almost always have to seek) is on the order of a few
> milliseconds. With a large write-queue (256 total requests means 128 write
> requests) you can basically get single-request latencies of up to a
> second. Which is really bad.
>
> One partial solution may be the just make the read queue deeper than the
> write queue. That's a bit more complicated than just changing a single
> value, though - you'd need to make the batching threshold be dependent on
> read-write too etc. But it would probably not be a bad idea to change the
> "split requests evenly" to do even "split requests 2:1 to read:write".
>
> All the logic is in drivers/block/ll_rw_block.c, and it's fairly easy to
> just search for queue_nr_requests/batch_requests to see what it's doing.
>
> > I"m seeing a lot more CPU Usage for the 1st thread than previous
tests --
> > perhaps we've shortened the queue too much and it's throttling the read?
> > Why would CPU usage go up and I/O go down?
>
> I'd guess it's calling the scheduler more. With fast disks and a queue
> that runs out, you'd probably go into a series of extremely short
> stop-start behaviour. Or something similar.
>
> Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05  4:19                           ` Michael Rothwell
@ 2001-08-05 18:40                             ` Marcelo Tosatti
  2001-08-05 20:20                             ` Linus Torvalds
  1 sibling, 0 replies; 40+ messages in thread
From: Marcelo Tosatti @ 2001-08-05 18:40 UTC (permalink / raw)
  To: Michael Rothwell
  Cc: Linus Torvalds, Mike Black, Ben LaHaise, Daniel Phillips,
	Rik van Riel, linux-kernel, linux-mm, Andrew Morton



On 5 Aug 2001, Michael Rothwell wrote:

> On 04 Aug 2001 10:08:56 -0700, Linus Torvalds wrote:
> > 
> > On Sat, 4 Aug 2001, Mike Black wrote:
> > >
> > > I'm testing 2.4.8-pre4 -- MUCH better interactivity behavior now.
> > 
> > Good.. However.. [...]  before we get too happy about the interactive thing, let's
> > remember that sometimes interactivity comes at the expense of throughput,
> > and maybe if we fix the throughput we'll be back where we started.
> 
> Could there be both interactive and throughput optimizations, and a way
> to choose one or the other at run-time? Or even just at compile time? 

You can increase the queue size (somewhere in drivers/block/ll_rw_block.c)
to get higher throughtput.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05 15:24                           ` Mike Black
@ 2001-08-05 20:04                             ` Linus Torvalds
  2001-08-05 20:23                               ` Alan Cox
  0 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2001-08-05 20:04 UTC (permalink / raw)
  To: Mike Black
  Cc: Ben LaHaise, Daniel Phillips, Rik van Riel, linux-kernel,
	linux-mm, Andrew Morton


On Sun, 5 Aug 2001, Mike Black wrote:
>
> I bumped up the queue_nr_requests to 512, then 1024 -- 1024 finally made a
> performance difference for me and the machine was still usable.

This is truly strange.

The reason it is _so_ strange is that with a single thread doing read IO,
the IO should actually be limited by the read-ahead size, which is _much_
smaller than 1024 entries. Right now the max read-ahead for a regular
filesystem is 127 pages - which, even assuming the absolute worst case
(1kB filesystem, totally non-contiguous etc) is no more than 500
reqesusts.

And quite frankly, if your disk can push 50MB/s through a 1kB
non-contiguous filesystem, then my name is Bugs Bunny.

You're more likely to have a nice contiguous file, probably on a 4kB
filesystem, and it should be able to do read-ahead of 127 pages in just a
few requests.

The fact that it makes a difference for you at the 1024 mark (which means
512 entries for the read queue) is rather strange.

> As an ext3 mount (here's where I've been seeing BIG delays before) there
> were:
> 1 thread - no delays
> 2 threads - 2 delays for 2 seconds each  << previously even 2 threads caused
> minute+ delays.
> 4 threads - 5 delays - 1 for 3 seconds, 4 for 2 seconds
> 8 threads - 21 delays  - 9 for 2 sec, 4 for 3 sec,  4 for 4 sec, 2 for 5
> sec, 1 for 6 sec, and 1 for 10 sec

Now, this is the good news. It tells me that the old ll_rw_blk() code
really was totally buggered, and that getting rid of it was 100% the right
thing to do.

But I'd _really_ like to understand why you see differences in read
performance, though. That really makes no sense, considering that you
should never even get close to the request limits anyway.

What driver do you use (maybe it has merging problems - some drivers want
to merge only blocks that are also physically adjacent in memory), and can
you please verify that you have a 4kB filesystem, not a old 1kB
filesystem..

Oh, and can you play around with the "MAX_READAHEAD" define, too? It is,
in fact, not unlikely that performance will _improve_ by making the
read-ahead lower, if the read-ahead is so large as to cause request
stalling.

(It would be good to have a nice interface for true "if you don't have
enough requests, stop read-ahead" interface, I'll take a look at
possibly reviving the READA code).

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05  4:19                           ` Michael Rothwell
  2001-08-05 18:40                             ` Marcelo Tosatti
@ 2001-08-05 20:20                             ` Linus Torvalds
  2001-08-05 20:45                               ` arjan
  2001-08-06 20:32                               ` Rob Landley
  1 sibling, 2 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-05 20:20 UTC (permalink / raw)
  To: Michael Rothwell
  Cc: Mike Black, Ben LaHaise, Daniel Phillips, Rik van Riel,
	linux-kernel, linux-mm, Andrew Morton


On 5 Aug 2001, Michael Rothwell wrote:
>
> Could there be both interactive and throughput optimizations, and a
> way to choose one or the other at run-time? Or even just at compile
> time?

Quite frankly, that's in my opinion the absolute worst approach.

Yes, it's an approach many systems take - put the tuning load on the user,
and blame the user if something doesn't work well. That way you don't have
to bother with trying to get the code right, or make it make sense.

In general, I think we can get latency to acceptable values, and latency
is the _hard_ thing. We seem to have become a lot better already, by just
removing the artificial ll_rw_blk code.

Getting throughput up to where it should be should "just" be a matter of
making sure we get nicely overlapping IO going. We probably just have some
silly bug tht makes us hickup every once in a while and not keep the
queues full enough. My current suspect is the read-ahead code itself being
a bit too inflexible, but..

			Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05 20:04                             ` Linus Torvalds
@ 2001-08-05 20:23                               ` Alan Cox
  2001-08-05 20:33                                 ` Linus Torvalds
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Cox @ 2001-08-05 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Black, Ben LaHaise, Daniel Phillips, Rik van Riel,
	linux-kernel, linux-mm, Andrew Morton

> On Sun, 5 Aug 2001, Mike Black wrote:
> And quite frankly, if your disk can push 50MB/s through a 1kB
> non-contiguous filesystem, then my name is Bugs Bunny.

Hi Bugs 8), previously Frodo Rabbit, .. I think you watch too much kids tv
8)

[To be fair I can do this through a raid controller with write back caches
and the like ..]

> You're more likely to have a nice contiguous file, probably on a 4kB
> filesystem, and it should be able to do read-ahead of 127 pages in just a
> few requests.

One problem I saw with scsi was that non power of two readaheads were
causing lots of small I/O requests to actual hit the disk controller (which
hurt big time on hardware raid as it meant reading/rewriting chunks). I
ended up seeing 128/127/1 128/127/1 128/127/1 with a 255 block queue.

It might be worth logging the number of blocks in each request that hits
the disk layer and dumping them out in /proc. I'll see if I still have the
hack for that around.

Alan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05 20:23                               ` Alan Cox
@ 2001-08-05 20:33                                 ` Linus Torvalds
  0 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-05 20:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Mike Black, Ben LaHaise, Daniel Phillips, Rik van Riel,
	linux-kernel, linux-mm, Andrew Morton


On Sun, 5 Aug 2001, Alan Cox wrote:

> > On Sun, 5 Aug 2001, Mike Black wrote:
> > And quite frankly, if your disk can push 50MB/s through a 1kB
> > non-contiguous filesystem, then my name is Bugs Bunny.
>
> Hi Bugs 8), previously Frodo Rabbit, .. I think you watch too much kids tv
> 8)

Three kids will do that to you. Some day, you too will be there.

> [To be fair I can do this through a raid controller with write back caches
> and the like ..]

Note that this was _read_ performance.

I agree that writing is easier, and contiguous buffers do not mean much if
you have a big write cache.

> One problem I saw with scsi was that non power of two readaheads were
> causing lots of small I/O requests to actual hit the disk controller (which
> hurt big time on hardware raid as it meant reading/rewriting chunks). I
> ended up seeing 128/127/1 128/127/1 128/127/1 with a 255 block queue.

Uhhuh. I think the read-ahead is actually a power-of-two, because it ends
up being "127 pages plus the current one", but hey, I could easily be
off-by-one.

I would actually love to see the read-ahead code just pass down the
knowledge that it is a read-ahead to the IO layer, and let the IO layer do
whatever it wants. In the case of block devices, for example, the READA
code is still there and looks very simple and functional - so it should be
trivial for generic_block_read_page() to pass down the READA information
and just return "no point in doing read-ahead, the queue is full"..

That way we'd never have the situation that we end up breaking up large
reads into badly sized entities.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05 20:20                             ` Linus Torvalds
@ 2001-08-05 20:45                               ` arjan
  2001-08-06 20:32                               ` Rob Landley
  1 sibling, 0 replies; 40+ messages in thread
From: arjan @ 2001-08-05 20:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

In article <Pine.LNX.4.33.0108051315540.7988-100000@penguin.transmeta.com> you wrote:

> In general, I think we can get latency to acceptable values, and latency
> is the _hard_ thing. We seem to have become a lot better already, by just
> removing the artificial ll_rw_blk code.

Ok how about a scheme (in 2.5) where every request has a "priority" assigned
to it. The way I see this is:

* priority is a signed value
* negative priority means "no need to do IO yet" to allow for gathering
  and grouping more requests in the request queues. It would be possible
  to get most of the inactive-dirty list in this state, eg io scheduled but
  not yet running
* on merging requests, the highest priority obviously becomes the overall 
  priority of the request
* "interactive" requests get a higher priority; this can be helped by adding
  a ll_rw_block_sync function, as 99% if the ll_rw_block users ends up
  waiting for io anyway
* priority needs to be "aged up" in time to take care of latency and such

If a device is truely idle (eg no io for X jiffies), it could steal negative
requests from the queue to do preemtive writes in order to prevent the
current situation of 5 seconds of no IO, and then suddenly a problem and
long-latency IO.

Also, intelligent devices such as aacraid, where the hardware controller has
the notion of priority, can be used more effectively this way. Such hardware
raid controllers also like to have deep IO queues for non-priority requests
to keep all disks in the raid array busy...

Comments ?

Greetings,
   Arjan van de Ven

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-05 20:20                             ` Linus Torvalds
  2001-08-05 20:45                               ` arjan
@ 2001-08-06 20:32                               ` Rob Landley
  1 sibling, 0 replies; 40+ messages in thread
From: Rob Landley @ 2001-08-06 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Black, Ben LaHaise, Daniel Phillips, Rik van Riel,
	linux-kernel, linux-mm, Andrew Morton

On Sunday 05 August 2001 16:20, Linus Torvalds wrote:
> On 5 Aug 2001, Michael Rothwell wrote:
> > Could there be both interactive and throughput optimizations, and a
> > way to choose one or the other at run-time? Or even just at compile
> > time?
>
> Quite frankly, that's in my opinion the absolute worst approach.
>
> Yes, it's an approach many systems take - put the tuning load on the user,
> and blame the user if something doesn't work well. That way you don't have
> to bother with trying to get the code right, or make it make sense.

Good defaults make sense, of course, but a /proc entry for this might not be 
a bad idea either.  Specifically, I'm thinking write-intensive systems.

Some loads are nonstandard.  I worked on a system once trying to capture a 
raw (uncompressed) HTDV signal to a 20 disk software raid hanging off of two 
qlogic fibre channel scsi cards.  (Recorder function for a commercial video 
capture/editing system.)  Throughput was all we cared about, and it had to be 
within about 10% of the hardware's theoretical maximum to avoid dropping 
frames.

Penalizing the write queue at the expense of the read queue wouldn't have 
done us any good there.  If anything, we'd have wanted to go the other way.  
(Yeah, we were nonstandard.  Yeah, we patched our kernel.  Yeah, we were 
apparently the first people on the planet to stick 2 qlogic cards in the same 
system and try to use them both, and run into the hardwired scsi request 
queue length limit that managed to panic the kernel by spewing printks and 
making something timeout.  But darn it, somebody had to. :)

It's easier to read proc.txt than to search the kernel archive for 
discussions that might possibly relate to the problem you're seeing...

> In general, I think we can get latency to acceptable values, and latency
> is the _hard_ thing. We seem to have become a lot better already, by just
> removing the artificial ll_rw_blk code.

I'm trying to think of an optimization that DOESN'T boil down to balancing 
latency vs throughput once you've got the easy part done.  Nothing comes to 
mind, probably a lack of caffiene on my part...

> Getting throughput up to where it should be should "just" be a matter of
> making sure we get nicely overlapping IO going. We probably just have some
> silly bug tht makes us hickup every once in a while and not keep the
> queues full enough. My current suspect is the read-ahead code itself being
> a bit too inflexible, but..

Are we going to remember to update these queue sizes when Moore's Law gives 
us drives four times as fast?  Or do you think this won't be a problem?

>
> 			Linus
>

Rob

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-04  6:37                     ` Linus Torvalds
                                         ` (2 preceding siblings ...)
  2001-08-04 16:21                       ` Mark Hemment
@ 2001-08-07 15:45                       ` Ben LaHaise
  2001-08-07 16:22                         ` Linus Torvalds
  3 siblings, 1 reply; 40+ messages in thread
From: Ben LaHaise @ 2001-08-07 15:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Fri, 3 Aug 2001, Linus Torvalds wrote:

> Does it work reasonably under your loads?

I didn't try pre4.  pre5 is absolutely horrible. vmstat stops responding
for long periods of time and it looks like the vm isn't being throttled at
all, so the system goes nuts trying to swap data out while it should just
wait a bit for io to complete.

		-ben

 1  2  1      0 993520   4376 2981648   0   0     4 56700  616    22   0  42  58
 1  1  1      0 863768   4492 3102032   0   0     0 69376  627    29   0  45  55
 1  0  1      0 739320   4620 3230408   0   0     0 56184  617    20   0  41  59
 1  0  1      0 611832   4740 3354952   0   0     0 63080  622    19   0  45  55
 1  0  1      0 481664   4860 3477320   0   0     0 67812  614    17   0  44  55
 1  0  1      0 354024   4984 3604156   0   0     0 60868  622    26   0  43  57
 0  2  1      0 266856   5072 3692556   0   0     4 59636  617    23   0  37  63
 0  2  1      0 195384   5136 3759408   0   0     0 66288  624    26   0  35  64
 1  1  1      0 147016   5192 3818968   0   0     4 51008  637    74   0  29  71
 0  3  1      0  95912   5248 3875684   0   0     4 56256  625    38   0  30  70
 1  1  1      0  31256   5312 3939796   0   0     8 62420  627    32   0  33  66
 1  1  2      0  14860   5352 3968976   0   0     0 49664  624  2600   0  56  44
 1  0  2      0   5476   5380 3983880   0   0    76 29988  434  2156   0  81  19
 1  0  2      0   3608   5396 3984212   0   0     0  1516  104   458   0  79  21
 1  0  2      0   3608   5396 3984212   0   0     0     0  103    92   0  75  25
 1  0  2      0   6908   5432 3987980   0   0     8 44280 1315  1394   0  95   5
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  2      0   3692   5444 3988452   0   0     0  2736  104  2363   0  79  21
 1  0  2      0   3692   5444 3988452   0   0     0     0  107    17   0  78  22
 5  1  3      0   6524   5444 3988516   0   0     8  7628  580   326   0  97   3
 2  1  3      0   5268   5472 3995208   0   0     8 11376  742   622   0 100   0
 1  0  2      0   5272   5472 3995208   0   0     0     0  105    13   0  75  25
 1  0  3      0   4476   5484 3995324   0   0     0 11780  757  6174   0  91   9
 2  0  3      0   4620   6268 3988588   0   0    76 803788 38730 117827   0  99   1
 3  1  4      0   5212   6268 3988056   0   0     4  3036  714   415   0 100   0
 1  1  3      0   4424   6268 3988036   0   0    40 33508  544  1362   0 100   0
 3  0  2      0   5948   6268 3988016   0   0     0 11532  299   792   0  99   1
 1  1  2      0   6356   6268 3987860   0   0     0  8568  360   258   0  94   6
 1  1  3      0   4452   6268 3987528   0   0     4 63604 1008 16081   0  91   9
 0  1  3      0   3652   6268 3987396   0   0     0 44796  624  3539   0  98   2



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 15:45                       ` Ben LaHaise
@ 2001-08-07 16:22                         ` Linus Torvalds
  2001-08-07 16:51                           ` Ben LaHaise
  0 siblings, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2001-08-07 16:22 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm



On Tue, 7 Aug 2001, Ben LaHaise wrote:
>
> I didn't try pre4.  pre5 is absolutely horrible.

Sorry, I should have warned people: pre5 is a test-release that was
intended solely for Leonard Zubkoff who has been helping with trying to
debug a FS livelock condition.

Try pre4.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 16:22                         ` Linus Torvalds
@ 2001-08-07 16:51                           ` Ben LaHaise
  2001-08-07 17:08                             ` Linus Torvalds
                                               ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Ben LaHaise @ 2001-08-07 16:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Tue, 7 Aug 2001, Linus Torvalds wrote:

> Try pre4.

It's similarly awful (what did you expect -- there are no meaningful
changes between the two!).  io throughput to a 12 disk array is humming
along at a whopping 40MB/s (can do 80) that's very spotty and jerky,
mostly being driven by syncs.  vmscan gets delayed occasionally, and small
interactive program loading varies from not to long (3s) to way too long
(> 30s).

		-ben


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 16:51                           ` Ben LaHaise
@ 2001-08-07 17:08                             ` Linus Torvalds
  2001-08-07 18:17                             ` Andrew Morton
  2001-08-07 21:33                             ` Linus Torvalds
  2 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-07 17:08 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Tue, 7 Aug 2001, Ben LaHaise wrote:
>
> On Tue, 7 Aug 2001, Linus Torvalds wrote:
>
> > Try pre4.
>
> It's similarly awful (what did you expect -- there are no meaningful
> changes between the two!).

The buffer.c changes could easily cause pre5 to be more aggressive in
pushing larger dirty blocks out..

Some people report _much_ better interactive behaviour with pre4.

So it obviously depends on load.

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 16:51                           ` Ben LaHaise
  2001-08-07 17:08                             ` Linus Torvalds
@ 2001-08-07 18:17                             ` Andrew Morton
  2001-08-07 18:40                               ` Ben LaHaise
  2001-08-07 21:33                             ` Linus Torvalds
  2 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2001-08-07 18:17 UTC (permalink / raw)
  To: Ben LaHaise
  Cc: Linus Torvalds, Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

Ben LaHaise wrote:
> 
> On Tue, 7 Aug 2001, Linus Torvalds wrote:
> 
> > Try pre4.
> 
> It's similarly awful (what did you expect -- there are no meaningful
> changes between the two!).  io throughput to a 12 disk array is humming
> along at a whopping 40MB/s (can do 80) that's very spotty and jerky,
> mostly being driven by syncs.  vmscan gets delayed occasionally, and small
> interactive program loading varies from not to long (3s) to way too long
> (> 30s).

Ben, are you using software RAID?

The throughput problems which Mike Black has been seeing with
ext3 seem to be specific to an interaction with software RAID5
and possibly highmem.  I've never been able to reproduce them.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 18:17                             ` Andrew Morton
@ 2001-08-07 18:40                               ` Ben LaHaise
  2001-08-07 21:33                                 ` Daniel Phillips
  2001-08-07 22:03                                 ` Linus Torvalds
  0 siblings, 2 replies; 40+ messages in thread
From: Ben LaHaise @ 2001-08-07 18:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Daniel Phillips, Rik van Riel, linux-kernel, linux-mm

On Tue, 7 Aug 2001, Andrew Morton wrote:

> Ben, are you using software RAID?
>
> The throughput problems which Mike Black has been seeing with
> ext3 seem to be specific to an interaction with software RAID5
> and possibly highmem.  I've never been able to reproduce them.

Yes, but I'm using raid 0.  The ratio of highmem to normal memory is
~3.25:1, and it would seem that this is breaking write throttling somehow.
The interaction between vm and io throttling is not at all predictable.
Certainly, pulling highmem out of the equation results in writes
proceeding at the speed of the disk, which makes me wonder if the bounce
buffer allocation is triggering the vm code to attempt to free more
memory.... Ah, and that would explain why shorter io queues makes things
smoother: less memory pressure is occuring on the normal memory zone from
bounce buffers.  The original state of things was allowing several hundred
MB of ram to be allocated for bounce buffers, which lead to a continuous
shortage, causing kswapd et al to spin in a loop making no progress.

Hmmm, how to make kswapd/bdflush/kreclaimd all back off until progress is
made in cleaning the io queue?

		-ben



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 18:40                               ` Ben LaHaise
@ 2001-08-07 21:33                                 ` Daniel Phillips
  2001-08-07 22:03                                 ` Linus Torvalds
  1 sibling, 0 replies; 40+ messages in thread
From: Daniel Phillips @ 2001-08-07 21:33 UTC (permalink / raw)
  To: Ben LaHaise, Andrew Morton
  Cc: Linus Torvalds <torvalds@transmeta.com> Rik van Riel,
	linux-kernel, linux-mm

On Tuesday 07 August 2001 20:40, Ben LaHaise wrote:
> On Tue, 7 Aug 2001, Andrew Morton wrote:
> > Ben, are you using software RAID?
> >
> > The throughput problems which Mike Black has been seeing with
> > ext3 seem to be specific to an interaction with software RAID5
> > and possibly highmem.  I've never been able to reproduce them.
>
> Yes, but I'm using raid 0.  The ratio of highmem to normal memory is
> ~3.25:1, and it would seem that this is breaking write throttling
> somehow. The interaction between vm and io throttling is not at all
> predictable. Certainly, pulling highmem out of the equation results in
> writes proceeding at the speed of the disk, which makes me wonder if
> the bounce buffer allocation is triggering the vm code to attempt to
> free more memory.... Ah, and that would explain why shorter io queues
> makes things smoother: less memory pressure is occuring on the normal
> memory zone from bounce buffers.  The original state of things was
> allowing several hundred MB of ram to be allocated for bounce buffers,
> which lead to a continuous shortage, causing kswapd et al to spin in a
> loop making no progress.

I thought Marcelo and Linus fixed that in pre1.

But even with the inactive_plent/skip_page strategy there's a problem.  
Suppose 2 gig of memory is active, that's 2 million pages.  Suppose you 
touch it all once, now everything is active, age=2.  It takes 4 million 
scan steps to age that down to zero so we can start deactivating the 
kind of pages we want.  In the meantime the page cache user is stalled, 
the pages just aren't there.  After two times around the active list, 
inactive pages come flooding out, some time later they make it through 
the inactive queue, the user snaps them up and create another flood of 
activations.

Please, tell me this scenario can't happen.

> Hmmm, how to make kswapd/bdflush/kreclaimd all back off until progress
> is made in cleaning the io queue?

I'd suggest putting memory users on a wait queue instead of letting them 
transform themselves into vm scanners...

--
Daniel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 16:51                           ` Ben LaHaise
  2001-08-07 17:08                             ` Linus Torvalds
  2001-08-07 18:17                             ` Andrew Morton
@ 2001-08-07 21:33                             ` Linus Torvalds
  2 siblings, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-07 21:33 UTC (permalink / raw)
  To: Ben LaHaise; +Cc: Daniel Phillips, Rik van Riel, linux-kernel, linux-mm


On Tue, 7 Aug 2001, Ben LaHaise wrote:
> > Try pre4.
>
> It's similarly awful (what did you expect -- there are no meaningful
> changes between the two!).  io throughput to a 12 disk array is humming
> along at a whopping 40MB/s (can do 80) that's very spotty and jerky,
> mostly being driven by syncs.

How about some sane approach to "balace_dirty()", like in -pre6.

The sane approach to balance_dirty() is to
 - when we're over the threshold of dirty, but not over the hard limit, we
   start IO. We don't wait for it (except in the sense that if we overflow
   the request queue we will _always_ wait for it, of course. No way to
   avoid that).
 - if we're over the hard limit, we wait for the oldest buffer on the
   locked list.

The only question is "when should we wake up bdflush?" I currently wake it
up any time we're over the soft limit, but I have this feeling that we
really should wait until we're over the hard limit - oherwise we might end
up dribbling again. I haven't tried it, but I will. Others please do too -
its trivially moving the wakeup around in fs/buffer.c: balance_dirty().

At least here it gives quite good results, and was rather usable even
under X when writing a 8GB file. I haven't seen this disk push 20MB/s
sustained before, and it did now (except when I was doing other things at
the time).

Will it keep the IO queues full as hell and make interactive programs
suffer? Yes, of course it will. No way to avoid the fact that reads are
going to be slower if there's a lot of writes going on. But I didn't see
vmstat hickups or anything like that.

Of course, this will depend on machine and on disk controller etc. Which
is why it would be good to test..

		Linus


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][DATA] re "ongoing vm suckage"
  2001-08-07 18:40                               ` Ben LaHaise
  2001-08-07 21:33                                 ` Daniel Phillips
@ 2001-08-07 22:03                                 ` Linus Torvalds
  1 sibling, 0 replies; 40+ messages in thread
From: Linus Torvalds @ 2001-08-07 22:03 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.33.0108071426380.30280-100000@touchme.toronto.redhat.com>,
Ben LaHaise  <bcrl@redhat.com> wrote:
>
>Yes, but I'm using raid 0.  The ratio of highmem to normal memory is
>~3.25:1, and it would seem that this is breaking write throttling somehow.

Ahh - I see it. 

Check "nr_free_buffer_pages()" - and notice how the function is meant to
return the number of pages that can be used for buffers.

But the function doesn't understand about the limitations of buffer
allocations inherent in GFP_NOFS, namely that it won't ever allocate a
high-mem buffer. So it just stupidly adds up the number of free pages,
coming to the conclusion that we have a _lot_ of memory that buffers
could use..

This obviously makes the whole balance_dirty() algorithm not work at
all.

This should be fairly easy to do. Instead of counting all zones,
nr_free_buffer_pages() should count only the zones that are listed in
the GFP_NOFS zonelist. So instead of using

	unsigned int sum;

	sum = nr_free_pages();
	sum += nr_inactive_clean_pages();
	sum += nr_inactive_dirty_pages;

it should do something like this instead (but please hide the "zonelist"
lookup behind some nice macro, I almost lost my lunch when I wrote that
;)

	unsigned int sum = 0;
	zonelist_t *zonelist = contig_page_data.node_zonelists+(gfp_mask & GFP_ZONEMASK);
	zone_t **zonep = zonelist->zones, *zone;

	for (;;) {
		zone_t *zone = *zonep;
		if (!zone)
			return sum;
		sum += zone->free_pages + zone->inactive_clean_pages + zone->inactive_dirty_pages;
	}

which is more accurate, and actually faster to boot (look at what
"nr_free_pages()" and friends do - they already walk all the zones)

I can't easily test this - mind giving it a whirl?

		Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2001-08-07 22:05 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-03 23:44 [RFC][DATA] re "ongoing vm suckage" Ben LaHaise
2001-08-04  1:29 ` Rik van Riel
2001-08-04  3:06   ` Daniel Phillips
2001-08-04  3:13     ` Linus Torvalds
2001-08-04  3:23       ` Rik van Riel
2001-08-04  3:35         ` Linus Torvalds
2001-08-04  3:26       ` Ben LaHaise
2001-08-04  3:34         ` Rik van Riel
2001-08-04  3:38         ` Linus Torvalds
2001-08-04  3:48         ` Linus Torvalds
2001-08-04  4:14           ` Ben LaHaise
2001-08-04  4:20             ` Linus Torvalds
2001-08-04  4:39               ` Ben LaHaise
2001-08-04  4:47                 ` Linus Torvalds
2001-08-04  5:13                   ` Ben LaHaise
2001-08-04  5:28                     ` Linus Torvalds
2001-08-04  6:37                     ` Linus Torvalds
2001-08-04  5:38                       ` Marcelo Tosatti
2001-08-04  7:13                         ` Rik van Riel
2001-08-04 14:22                       ` Mike Black
2001-08-04 17:08                         ` Linus Torvalds
2001-08-05  4:19                           ` Michael Rothwell
2001-08-05 18:40                             ` Marcelo Tosatti
2001-08-05 20:20                             ` Linus Torvalds
2001-08-05 20:45                               ` arjan
2001-08-06 20:32                               ` Rob Landley
2001-08-05 15:24                           ` Mike Black
2001-08-05 20:04                             ` Linus Torvalds
2001-08-05 20:23                               ` Alan Cox
2001-08-05 20:33                                 ` Linus Torvalds
2001-08-04 16:21                       ` Mark Hemment
2001-08-07 15:45                       ` Ben LaHaise
2001-08-07 16:22                         ` Linus Torvalds
2001-08-07 16:51                           ` Ben LaHaise
2001-08-07 17:08                             ` Linus Torvalds
2001-08-07 18:17                             ` Andrew Morton
2001-08-07 18:40                               ` Ben LaHaise
2001-08-07 21:33                                 ` Daniel Phillips
2001-08-07 22:03                                 ` Linus Torvalds
2001-08-07 21:33                             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).