linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Scheduling problem with 2.4?
@ 2003-05-17 11:22 David Kastrup
  2003-05-17 17:41 ` Andrea Arcangeli
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2003-05-17 11:22 UTC (permalink / raw)
  To: linux-kernel


I have a problem with Emacs which crawls when used as a shell.  If I
call M-x shell RET and then do something like
hexdump -v /dev/null|dd count=100k bs=1
then characters appear rather slowly at first, most of them in single
character packets, with an occasional 1024 packet in between.  More
of them at the end.  The rate of the packets does not exceed 100/sec.

Other operating systems do not exhibit this extreme case of
throttling.  Emacs is using select calls AFAICS for dealing with
packets arriving, and it would appear that the delivering process
causes a context switch immediately when writing its single byte to
the pipe or is taken from the run queue while the select call is
pending.  Whatever.  In the ensuing time slice, Emacs processes the
single arriving byte and then enters the select call again.

Now that the pipe is not allowed to fill up in most cases (most: in
some, increasingly more at the later time, the pipe gets filled
alternately complete and with single bytes) before a context switch
occurs is already unfortunate with regard to the efficiency of
processing.  What is quite more of a nuisance is the apparent
throttling to a single byte per clock tick: the CPU is idling most of
the time.  It would be unfortunate enough if it were going full blast
at processing 1-byte packets, but most of the time it does nothing
but wait for the next tick.

Now I would want to

a) get sufficient information about what might go wrong inside of
Linux as compared to other operating systems.

b) find a good workaround for not triggering this kind of behavior.

I don't have sufficient kernel experience to debug this one, and it
definitely impacts the speed of operations severely, and in close
cooperation with the operating system.  So any pointers would be
welcome.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 11:22 Scheduling problem with 2.4? David Kastrup
@ 2003-05-17 17:41 ` Andrea Arcangeli
  2003-05-17 19:36   ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-17 17:41 UTC (permalink / raw)
  To: dak; +Cc: linux-kernel

On Sat, May 17, 2003 at 01:22:23PM +0200, David Kastrup wrote:
> 
> I have a problem with Emacs which crawls when used as a shell.  If I
> call M-x shell RET and then do something like
> hexdump -v /dev/null|dd count=100k bs=1

this certainly can't be it

andrea@dualathlon:~> hexdump -v /dev/null|dd count=100k bs=1
0+0 records in
0+0 records out
andrea@dualathlon:~> 

how can that run slow, there's nothing to hexdump in /dev/null and
nothing to read for dd?

One related hint, since you're playing with >4k buffers with dd, make
sure to run a -aa kernel with the lowlatency fixes (latest is
2.4.21rc2aa1), or those large writes can loop in the kernel forbid
preemption (in this case maybe not if they block in a socket/pipe before
100k is all written)

overall often people notices improvement with preempt becuse they
compare it with buggy kernels (I know this has nothing to do with your
report, but it's related to the above)

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 17:41 ` Andrea Arcangeli
@ 2003-05-17 19:36   ` David Kastrup
  2003-05-17 20:30     ` Andrea Arcangeli
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2003-05-17 19:36 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 825 bytes --]

Andrea Arcangeli <andrea@suse.de> writes:

> On Sat, May 17, 2003 at 01:22:23PM +0200, David Kastrup wrote:
> > 
> > I have a problem with Emacs which crawls when used as a shell.  If I
> > call M-x shell RET and then do something like
> > hexdump -v /dev/null|dd count=100k bs=1
> 
> this certainly can't be it
> 
> andrea@dualathlon:~> hexdump -v /dev/null|dd count=100k bs=1
> 0+0 records in
> 0+0 records out

Argl.  Substitute /dev/zero in the obvious place.  My stupidity.

Here is a test file for generating further stats from within Emacs.
You can call it interactively from within Emacs with
M-x load-library RET testio.el RET
M-x make-test RET

and you can use it from the command line (if you really must) with
emacs -batch -q -no-site-file -l testio.el -eval "(progn(make-test t
  'kill-emacs)(sit-for 100000))"


[-- Attachment #2: testio.el --]
[-- Type: application/emacs-lisp, Size: 1557 bytes --]

[-- Attachment #3: Type: text/plain, Size: 901 bytes --]



It will then output the read stuff from the process followed by
statistics of block sizes over each second it ran.

The interactive session tends to be somewhat worse, but the command
line version is bad enough for a demonstration.

It is obvious that during most of the time, the pipe only receives
single characters.

Again, I am pretty sure that Emacs is at fault too, but I don't
understand the implications of what scheduling behavior causes the
pipes to remain mostly empty, with an occasional full filling.  It
would be better if Emacs would not be context-switched into the
moment something appears in the pipe, but if the writer to the pipe
would keep the opportunity to fill'er'up until it is either preempted
or needs to wait.  If there was some possibility to force this
behavior from within Emacs, this would be good to know.

Thanks,

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 19:36   ` David Kastrup
@ 2003-05-17 20:30     ` Andrea Arcangeli
  2003-05-17 20:44       ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-17 20:30 UTC (permalink / raw)
  To: dak; +Cc: linux-kernel

On Sat, May 17, 2003 at 09:36:29PM +0200, David Kastrup wrote:
> Andrea Arcangeli <andrea@suse.de> writes:
> 
> > On Sat, May 17, 2003 at 01:22:23PM +0200, David Kastrup wrote:
> > > 
> > > I have a problem with Emacs which crawls when used as a shell.  If I
> > > call M-x shell RET and then do something like
> > > hexdump -v /dev/null|dd count=100k bs=1
> > 
> > this certainly can't be it
> > 
> > andrea@dualathlon:~> hexdump -v /dev/null|dd count=100k bs=1
> > 0+0 records in
> > 0+0 records out
> 
> Argl.  Substitute /dev/zero in the obvious place.  My stupidity.
> 
> Here is a test file for generating further stats from within Emacs.
> You can call it interactively from within Emacs with
> M-x load-library RET testio.el RET
> M-x make-test RET
> 
> and you can use it from the command line (if you really must) with
> emacs -batch -q -no-site-file -l testio.el -eval "(progn(make-test t
>   'kill-emacs)(sit-for 100000))"
> 


> 
> 
> It will then output the read stuff from the process followed by
> statistics of block sizes over each second it ran.
> 
> The interactive session tends to be somewhat worse, but the command
> line version is bad enough for a demonstration.

this is within emacs:

   1 blocks with size   95
   1 blocks with size  146
   1 blocks with size  175
   1 blocks with size  189
   1 blocks with size  199
   2 blocks with size  231
   2 blocks with size  232
   1 blocks with size  233
   1 blocks with size  238
   1 blocks with size  241
   1 blocks with size  242
   1 blocks with size  243
   1 blocks with size  244
   1 blocks with size  245
   1 blocks with size  247
   1 blocks with size  252
   2 blocks with size  257
   1 blocks with size  260
   2 blocks with size  261
   2 blocks with size  262
   1 blocks with size  263
   1 blocks with size  266
   1 blocks with size  271
   1 blocks with size  276
   1 blocks with size  282
   1 blocks with size  283
   1 blocks with size  289
   1 blocks with size  296
   1 blocks with size  297
   1 blocks with size  299
   1 blocks with size  300
   2 blocks with size  302
   1 blocks with size  304
   1 blocks with size  306
   2 blocks with size  307
   1 blocks with size  308
   1 blocks with size  310
   1 blocks with size  311
   3 blocks with size  312
   1 blocks with size  313
   2 blocks with size  315
   1 blocks with size  316
   1 blocks with size  317
   1 blocks with size  322
   1 blocks with size  324
   1 blocks with size  330
   1 blocks with size  345
   1 blocks with size  347
   2 blocks with size  355
   2 blocks with size  359
   2 blocks with size  363
   1 blocks with size  368
   1 blocks with size  369
   2 blocks with size  373
   1 blocks with size  374
   2 blocks with size  375
   1 blocks with size  383
   1 blocks with size  384
   2 blocks with size  385
   1 blocks with size  386
   2 blocks with size  387
   1 blocks with size  389
   1 blocks with size  391
   1 blocks with size  392
   2 blocks with size  393
   3 blocks with size  394
   1 blocks with size  397
   2 blocks with size  398
   4 blocks with size  400
   1 blocks with size  402
   1 blocks with size  405
   1 blocks with size  409
   1 blocks with size  414
   1 blocks with size  418
   1 blocks with size  420
   4 blocks with size  423
   1 blocks with size  424
   1 blocks with size  428
   1 blocks with size  429
   1 blocks with size  430
   1 blocks with size  432
   1 blocks with size  433
   1 blocks with size  436
   1 blocks with size  437
   1 blocks with size  443
   2 blocks with size  445
   1 blocks with size  448
   2 blocks with size  451
   1 blocks with size  453
   1 blocks with size  455
   1 blocks with size  472
   2 blocks with size  485
   1 blocks with size  490
   1 blocks with size  498
   1 blocks with size  504
   1 blocks with size  505
   1 blocks with size  510
   1 blocks with size  512
   2 blocks with size  533
   2 blocks with size  537
   1 blocks with size  540
   1 blocks with size  545
   1 blocks with size  551
   1 blocks with size  557
   2 blocks with size  562
   1 blocks with size  571
   1 blocks with size  572
   1 blocks with size  573
   1 blocks with size  578
   1 blocks with size  579
   1 blocks with size  582
   1 blocks with size  588
   2 blocks with size  596
   1 blocks with size  598
   1 blocks with size  599
   1 blocks with size  600
   1 blocks with size  602
   2 blocks with size  610
   1 blocks with size  616
   1 blocks with size  620
   2 blocks with size  623
   1 blocks with size  628
   1 blocks with size  629
   1 blocks with size  630
   2 blocks with size  638
   1 blocks with size  639
   2 blocks with size  640
   1 blocks with size  641
   1 blocks with size  642
   1 blocks with size  644
   1 blocks with size  648
   1 blocks with size  650
   1 blocks with size  651
   1 blocks with size  652
   1 blocks with size  654
   1 blocks with size  656
   1 blocks with size  657
   1 blocks with size  659
   1 blocks with size  662
   1 blocks with size  665
   1 blocks with size  667
   2 blocks with size  670
   1 blocks with size  672
   1 blocks with size  679
   1 blocks with size  687
   1 blocks with size  720
   1 blocks with size  735
   1 blocks with size  746
   1 blocks with size  754
   1 blocks with size  775
   1 blocks with size  834
   1 blocks with size  844
   1 blocks with size  894
   1 blocks with size  896
   1 blocks with size  906
   1 blocks with size  928
   1 blocks with size  980
   2 blocks with size  987
   1 blocks with size  991
   1 blocks with size  998
   1 blocks with size 1023
   8 blocks with size 1024

from the shell prompt:

   6 blocks with size   26
   5 blocks with size   27
  14 blocks with size   28
  10 blocks with size   29
   4 blocks with size   30
   2 blocks with size   31
   2 blocks with size   32
   2 blocks with size   33
   4 blocks with size   34
   1 blocks with size   35
   1 blocks with size   36
   2 blocks with size   38
   2 blocks with size   39
   1 blocks with size   41
   1 blocks with size   45
   1 blocks with size   46
   3 blocks with size   48
   1 blocks with size   49
   1 blocks with size   52
   3 blocks with size   55
   1 blocks with size   57
   1 blocks with size   58
   2 blocks with size   60
   1 blocks with size   61
   1 blocks with size   62
   1 blocks with size   63
   1 blocks with size   65
   1 blocks with size   71
   2 blocks with size   75
   1 blocks with size   79
   1 blocks with size   86
   1 blocks with size  120
   1 blocks with size  128
   2 blocks with size  147
   1 blocks with size  190
   1 blocks with size  233
   1 blocks with size  241
   1 blocks with size  250
   1 blocks with size  269
   1 blocks with size  273
   1 blocks with size  407
   1 blocks with size  564
   1 blocks with size  614
   2 blocks with size  626
   1 blocks with size  667
   1 blocks with size  680
   1 blocks with size  709
   1 blocks with size  711
   1 blocks with size  712
   1 blocks with size  719
   1 blocks with size  722
   1 blocks with size  729
   1 blocks with size  732
   2 blocks with size  737
   1 blocks with size  759
   1 blocks with size  772
   1 blocks with size 1014
   5 blocks with size 1023
  31 blocks with size 1024


my emacs is:

GNU Emacs 21.2.1
Copyright (C) 2001 Free Software Foundation, Inc.
GNU Emacs comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of Emacs
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

> 
> It is obvious that during most of the time, the pipe only receives
> single characters.
> 
> Again, I am pretty sure that Emacs is at fault too, but I don't
> understand the implications of what scheduling behavior causes the
> pipes to remain mostly empty, with an occasional full filling.  It
> would be better if Emacs would not be context-switched into the
> moment something appears in the pipe, but if the writer to the pipe
> would keep the opportunity to fill'er'up until it is either preempted
> or needs to wait.  If there was some possibility to force this
> behavior from within Emacs, this would be good to know.

I don't see any slowdown here.

As said the 100k bs will lead to just 1 syscall for lots of I/O, this
made me think the lowlatency fixes can be related. So I would suggest
you to try to reproduce with my 2.4 tree or with 2.5 + -preempt enabled.

Just for the record, this is the patch I'm talking about (you can apply
it easily to other 2.4, this should really be merged into mainline too):

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/kernels/v2.4/2.4.21rc2aa1/9998_lowlatency-fixes-12

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 20:30     ` Andrea Arcangeli
@ 2003-05-17 20:44       ` David Kastrup
  2003-05-17 21:53         ` Andrea Arcangeli
  2003-05-17 21:54         ` Barry K. Nathan
  0 siblings, 2 replies; 20+ messages in thread
From: David Kastrup @ 2003-05-17 20:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> > The interactive session tends to be somewhat worse, but the command
> > line version is bad enough for a demonstration.
> 
> this is within emacs:
> 
>    1 blocks with size   95
>    1 blocks with size  146
>    1 blocks with size  175

[...]

>    1 blocks with size  980
>    2 blocks with size  987
>    1 blocks with size  991
>    1 blocks with size  998
>    1 blocks with size 1023
>    8 blocks with size 1024
> 
> from the shell prompt:
> 

[...]

Wel, those are excellent results and as it should be.

> my emacs is:
> 
> GNU Emacs 21.2.1

Same here.

I am talking about the kernel coming with RedHat 9, uname -a gives
Linux lola.goethe.zz 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux

Unfortunately, this kernel is here to stay for quite a while, and I
would want to find a way to let Emacs cooperate with it better without
telling people to recompile the kernel or wait a year.

> > Again, I am pretty sure that Emacs is at fault too, but I don't
> > understand the implications of what scheduling behavior causes the
> > pipes to remain mostly empty, with an occasional full filling.  It
> > would be better if Emacs would not be context-switched into the
> > moment something appears in the pipe, but if the writer to the
> > pipe would keep the opportunity to fill'er'up until it is either
> > preempted or needs to wait.  If there was some possibility to
> > force this behavior from within Emacs, this would be good to know.
> 
> I don't see any slowdown here.

Which kernel?  Oh, and of course: on a SMP machine the problem would
not be visible since then the context switch to Emacs does not stop
the other process from stuffing into the pipe.

> 	http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/kernels/v2.4/2.4.21rc2aa1/9998_lowlatency-fixes-12

Taking notice, but will need some time to apply.  And I really would
want to find a way to avoid triggering this problem.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 20:44       ` David Kastrup
@ 2003-05-17 21:53         ` Andrea Arcangeli
  2003-05-17 22:37           ` David Kastrup
  2003-05-17 21:54         ` Barry K. Nathan
  1 sibling, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-17 21:53 UTC (permalink / raw)
  To: dak; +Cc: linux-kernel

On Sat, May 17, 2003 at 10:44:16PM +0200, David Kastrup wrote:
> Which kernel?  Oh, and of course: on a SMP machine the problem would

I run it on smp. but I understood what's going on.

--- testio.el.orig	2003-05-17 22:22:24.000000000 +0200
+++ testio.el	2003-05-17 23:39:49.000000000 +0200
@@ -19,7 +19,7 @@
   (setq test-pattern nil test-start (float-time))
   (let ((process (start-process
 		  "test" (and (bufferp printer) printer) "sh"
-		  "-c" "od -v /dev/zero|dd bs=1 count=100k")))
+		  "-c" "od -v /dev/zero| dd bs=100k count=1")))
     (set-process-filter process `(lambda (process string)
 				   (test-filter process string
 						',printer)))


this is what I thought you were really doing with this script, not the
other way around ;).

The reason you get the 1 char reads in your version is because dd will
only write 1 char at time to emacs. This has nothing to do with emacs,
this is about dd writing 1 char at time. If you don't write 1 char at
time, emacs won't read 1 char at time.

The only thing you could do to decrease the cxt switch flood is to put a
buffer with timeout in between, but by default having a buffer with
timeout in the kernel would be an overkill overhead and depending on the
timeout it could be visible for normal shells. So the kernel behaviour
is right and it isn't going to change.

If one important app of yours is dumb and writes flood of data 1 char at
time then just write a buffer in userspace yourself, and in 99% of cases
a dd bs=4k will just work perfectly (you'll need a timeout-stdout-flush
version one only if the pipe keeps staying open i.e. if the app doesn't
quit).

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 20:44       ` David Kastrup
  2003-05-17 21:53         ` Andrea Arcangeli
@ 2003-05-17 21:54         ` Barry K. Nathan
  1 sibling, 0 replies; 20+ messages in thread
From: Barry K. Nathan @ 2003-05-17 21:54 UTC (permalink / raw)
  To: dak; +Cc: Andrea Arcangeli, linux-kernel

On Sat, May 17, 2003 at 10:44:16PM +0200, David Kastrup wrote:
> I am talking about the kernel coming with RedHat 9, uname -a gives
> Linux lola.goethe.zz 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux
> 
> Unfortunately, this kernel is here to stay for quite a while, and I
> would want to find a way to let Emacs cooperate with it better without
> telling people to recompile the kernel or wait a year.

They should be upgrading to the kernel here, at least, for security
reasons:
https://rhn.redhat.com/errata/RHSA-2003-172.html

I have no idea what effect, if any, it would have on your performance
problems though.

-Barry K. Nathan <barryn@pobox.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 21:53         ` Andrea Arcangeli
@ 2003-05-17 22:37           ` David Kastrup
  2003-05-17 23:50             ` Andrea Arcangeli
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2003-05-17 22:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Sat, May 17, 2003 at 10:44:16PM +0200, David Kastrup wrote:
> > Which kernel?  Oh, and of course: on a SMP machine the problem would
> 
> I run it on smp.

Then the results are worthless.  The problem is that the write to the
pipe causes an immediate context switch, keeping the pipe from
gathering useful amounts of information.

> but I understood what's going on.
> 
> --- testio.el.orig	2003-05-17 22:22:24.000000000 +0200
> +++ testio.el	2003-05-17 23:39:49.000000000 +0200
> @@ -19,7 +19,7 @@
>    (setq test-pattern nil test-start (float-time))
>    (let ((process (start-process
>  		  "test" (and (bufferp printer) printer) "sh"
> -		  "-c" "od -v /dev/zero|dd bs=1 count=100k")))
> +		  "-c" "od -v /dev/zero| dd bs=100k count=1")))
>      (set-process-filter process `(lambda (process string)
>  				   (test-filter process string
>  						',printer)))

That utterly and entirely defeats the purpose of my demonstration.
Emacs crawls as a terminal/shell/whatever exactly because programs
writing to a pty do not write in chunks of 100k.

> The reason you get the 1 char reads in your version is because dd will
> only write 1 char at time to emacs.

Of course it does.  I told it to do so.  But there is no necessity to
do an immediate context switch: it would be completely sufficient if
Emacs (which is waiting on select) were put in the run queue and
scheduled when the time slice of dd was up.  Performance gets better
at some time only because Emacs spends some time with other business
(like garbage collection) and gets preempted outside of the select
system call, and only _then_ can dd fill the pipe in a whiff.

> This has nothing to do with emacs, this is about dd writing 1 char
> at time. If you don't write 1 char at time, emacs won't read 1 char
> at time.

But if I am doing process communication with other processes, the I/O
_will_ arrive in small portions, and when the generating processes are
running on the same CPU instead of being I/O bound, I don't stand a
chance of working efficiently, namely utilizing the pipes, if I do a
quasi-synchronous context switch by force even if the time slice has
been hardly touched.

Pipes that never fill up cause overhead.  I think it amusing that a
pipeline on a single processor system will yield _vastly_ increased
performance the moment the receiving end works as _slowly_ as to get
preempted.  Only then will the sender be able to fill the pipe,
causing a large performance gain.

> The only thing you could do to decrease the cxt switch flood is to put a
> buffer with timeout in between, but by default having a buffer with
> timeout in the kernel would be an overkill overhead and depending on the
> timeout it could be visible for normal shells. So the kernel behaviour
> is right and it isn't going to change.

Visible for normal shells?  Hardly.  We are talking about letting the
output generating process live out its time slice (10ms) or being
scheduled because of waiting for something else.  Do you think you
will negatively notice that the first letter of the prompt appears
10ms later together with the complete rest of the prompt, instead of
the letters appearing one by one, but the first immediately?

> If one important app of yours is dumb and writes flood of data 1
> char at time

All interactive apps, which includes almost anything run in a pty,
write data in small chunks.  If you context switch them away from the
pipe the moment they have placed a small chunk in, performance will
suffer.

Wasn't the idea behind the anticipative I/O scheduler something like
that?  I remember vaguely.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 22:37           ` David Kastrup
@ 2003-05-17 23:50             ` Andrea Arcangeli
  2003-05-18  0:16               ` David Schwartz
  2003-05-18  0:24               ` David Kastrup
  0 siblings, 2 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-17 23:50 UTC (permalink / raw)
  To: dak; +Cc: linux-kernel

On Sun, May 18, 2003 at 12:37:01AM +0200, David Kastrup wrote:
> Of course it does.  I told it to do so.  But there is no necessity to
> do an immediate context switch: it would be completely sufficient if
> Emacs (which is waiting on select) were put in the run queue and
> scheduled when the time slice of dd was up.  Performance gets better

the switch happens because emacs has higher dynamic priority, as it was
sleeping for the longest time. without these special cases for the
interactive tasks we couldn't use these long timeslices without making
the system not responsive.

> (like garbage collection) and gets preempted outside of the select
> system call, and only _then_ can dd fill the pipe in a whiff.
> 
> > This has nothing to do with emacs, this is about dd writing 1 char
> > at time. If you don't write 1 char at time, emacs won't read 1 char
> > at time.
> 
> But if I am doing process communication with other processes, the I/O
> _will_ arrive in small portions, and when the generating processes are
> running on the same CPU instead of being I/O bound, I don't stand a
> chance of working efficiently, namely utilizing the pipes, if I do a

writing 1 byte per syscall isn't very efficient in the first place (no
matter if the cxt switch happens or not).

I see what you mean, but I still don't think it is a problem. If
bandwidth matters you will have to use large writes and reads anyways,
if bandwidth doesn't matter the number of ctx switches doesn't matter
either and latency usually is way more important with small messages.

you're applying small messages to a "throughput" test, this is why you
have a problem. If you really did interprocess communication (and not a
throughput benchmark) you would probably want the smallest delay in the
delivery of the signal/message.

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Scheduling problem with 2.4?
  2003-05-17 23:50             ` Andrea Arcangeli
@ 2003-05-18  0:16               ` David Schwartz
  2003-05-18  1:06                 ` Andrea Arcangeli
  2003-05-18  8:55                 ` Mike Galbraith
  2003-05-18  0:24               ` David Kastrup
  1 sibling, 2 replies; 20+ messages in thread
From: David Schwartz @ 2003-05-18  0:16 UTC (permalink / raw)
  To: Andrea Arcangeli, dak; +Cc: linux-kernel


> I see what you mean, but I still don't think it is a problem. If
> bandwidth matters you will have to use large writes and reads anyways,
> if bandwidth doesn't matter the number of ctx switches doesn't matter
> either and latency usually is way more important with small messages.

> Andrea

	This is the danger of pre-emption based upon dynamic priorities. You can
get cases where two processes each are permitted to make a very small amount
of progress in alternation. This can happen just as well with large writes
as small ones, the amount of data is irrelevent, it's the amount of CPU time
that's important, or to put it another way, it's how far a process can get
without suffering a context switch.

	I suggest that a process be permitted to use up at least some portion of
its timeslice exempt from any pre-emption based solely on dynamic
priorities.

	DS



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-17 23:50             ` Andrea Arcangeli
  2003-05-18  0:16               ` David Schwartz
@ 2003-05-18  0:24               ` David Kastrup
  1 sibling, 0 replies; 20+ messages in thread
From: David Kastrup @ 2003-05-18  0:24 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Sun, May 18, 2003 at 12:37:01AM +0200, David Kastrup wrote:
> > Of course it does.  I told it to do so.  But there is no necessity to
> > do an immediate context switch: it would be completely sufficient if
> > Emacs (which is waiting on select) were put in the run queue and
> > scheduled when the time slice of dd was up.  Performance gets better
> 
> the switch happens because emacs has higher dynamic priority, as it
> was sleeping for the longest time. without these special cases for
> the interactive tasks we couldn't use these long timeslices without
> making the system not responsive.

Then we will need a smaller timeslice for making this decision.  If I
have a sending process able to write 1000000 characters per second,
it is wasteful if such a process is scheduled away after writing a
single line (most processes running inside of a shell will work
line-buffered, won't they?).

> > But if I am doing process communication with other processes, the I/O
> > _will_ arrive in small portions, and when the generating processes are
> > running on the same CPU instead of being I/O bound, I don't stand a
> > chance of working efficiently, namely utilizing the pipes, if I do a
> 
> writing 1 byte per syscall isn't very efficient in the first place
> (no matter if the cxt switch happens or not).

Sure, but we are more typically talking about writing a line at a
time, which is easily by a factor 50 smaller than the pipe capacity
for typical output.

> I see what you mean, but I still don't think it is a problem. If
> bandwidth matters you will have to use large writes and reads
> anyways, if bandwidth doesn't matter the number of ctx switches
> doesn't matter either and latency usually is way more important with
> small messages.
> 
> you're applying small messages to a "throughput" test, this is why
> you have a problem. If you really did interprocess communication
> (and not a throughput benchmark) you would probably want the
> smallest delay in the delivery of the signal/message.

Not when I could have my pipe filled within a fraction of a
millisecond _without_ idling (starting up the reading process the
moment that the writing process has taken more than its due of time
or is in itself waiting is, of course, perfectly sensible).

Xterm:
time dd if=/dev/zero bs=16k count=16|od -v
real    0m8.656s
user    0m0.240s
sys     0m0.090s

time dd if=/dev/zero bs=16k count=16|od -v|dd obs=16k
real    0m3.794s
user    0m0.240s
sys     0m0.060s

Do you really think that I would have appreciated the "smaller
latency" of few milliseconds at best in the first case?  Note that
this is exclusively a uni-processor problem: the fast context switch
will starve the writing process from CPU time and make sure that even
if it could crank out hundreds of writes in millisecond, it will only
get a single one of them placed into the pipe, ever, unless the
receiving end gets preempted before reaching select, at which time the
writer can finally stuff the pipe completely.

This all-or-nothing utilization of a pipe or pty is not a sane
tradeoff.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-18  0:16               ` David Schwartz
@ 2003-05-18  1:06                 ` Andrea Arcangeli
  2003-05-18  9:41                   ` David Kastrup
  2003-05-18  8:55                 ` Mike Galbraith
  1 sibling, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-18  1:06 UTC (permalink / raw)
  To: David Schwartz; +Cc: dak, linux-kernel

On Sat, May 17, 2003 at 05:16:33PM -0700, David Schwartz wrote:
> 
> > I see what you mean, but I still don't think it is a problem. If
> > bandwidth matters you will have to use large writes and reads anyways,
> > if bandwidth doesn't matter the number of ctx switches doesn't matter
> > either and latency usually is way more important with small messages.
> 
> > Andrea
> 
> 	This is the danger of pre-emption based upon dynamic priorities. You can
> get cases where two processes each are permitted to make a very small amount
> of progress in alternation. This can happen just as well with large writes
> as small ones, the amount of data is irrelevent, it's the amount of CPU time
> that's important, or to put it another way, it's how far a process can get
> without suffering a context switch.
> 
> 	I suggest that a process be permitted to use up at least some portion of
> its timeslice exempt from any pre-emption based solely on dynamic
> priorities.

that's the issue yes. but then when a multithreaded app sends a signal
to another process  it can take up to this "x" timeslice portion before
the signal will run (I mean in UP). Same goes for mouse clicks etc..
1msec for mouse clicks should not be noticeable though. And over all I
don't see a real big issue in the current code.

To try it probably the simpler way to add a need_resched_timeout
along to need_resched, and to honour the need_resched only when the
timeout triggers, immediate need_resched will set the timeout = jiffies
so it'll trigger immediatly, the timer interrupt will check it. The
reschedule_idle on a non idle cpu will be the only one to set the
timeout. Effectively I'm curious to see what will happen. Not all archs
would need to check against it (the entry.S code is the main reader of
need_resched), it'll be an hint only and idle() for sure must not check
it at all. this would guarantee minimal timeslice reserved up to 1/HZ by
setting the timeout to jiffies + 2 (jiffies + 1 would return a mean of
1/HZ/2 but the worst case would be ~0, in practice even + 1 would be
enough) Does anybody think this could have a value? If yes I can make a
quick hack to see what happens.

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Scheduling problem with 2.4?
  2003-05-18  0:16               ` David Schwartz
  2003-05-18  1:06                 ` Andrea Arcangeli
@ 2003-05-18  8:55                 ` Mike Galbraith
  2003-05-18 17:46                   ` David Schwartz
                                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Mike Galbraith @ 2003-05-18  8:55 UTC (permalink / raw)
  To: David Schwartz; +Cc: Andrea Arcangeli, dak, linux-kernel

At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:

> > I see what you mean, but I still don't think it is a problem. If
> > bandwidth matters you will have to use large writes and reads anyways,
> > if bandwidth doesn't matter the number of ctx switches doesn't matter
> > either and latency usually is way more important with small messages.
>
> > Andrea
>
>         This is the danger of pre-emption based upon dynamic priorities. 
> You can
>get cases where two processes each are permitted to make a very small amount
>of progress in alternation. This can happen just as well with large writes
>as small ones, the amount of data is irrelevent, it's the amount of CPU time
>that's important, or to put it another way, it's how far a process can get
>without suffering a context switch.
>
>         I suggest that a process be permitted to use up at least some 
> portion of
>its timeslice exempt from any pre-emption based solely on dynamic
>priorities.

Cool.

Thank you for the spiffy idea.  I implemented this in my (hack/chop) mm5 
tree in about 30 seconds, and it works just fine.  Very simple 
time_after(jiffies, p->last_run + MIN_TIMESLICE) checks in wake_up_cpu() 
plus the obvious dinky change to schedule(), and it worked 
instantly.  Hmm.. I wonder if this might help some of the other scheduler 
(seemingly) bad behavior as well.

Is there any down-side to not preempting quite as often?  It seems like 
there should be a bandwidth gain.

         -Mike 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-18  1:06                 ` Andrea Arcangeli
@ 2003-05-18  9:41                   ` David Kastrup
  0 siblings, 0 replies; 20+ messages in thread
From: David Kastrup @ 2003-05-18  9:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Schwartz, linux-kernel, monnier+gnu/emacs

Andrea Arcangeli <andrea@suse.de> writes:

> On Sat, May 17, 2003 at 05:16:33PM -0700, David Schwartz wrote:
> > 
> > 	This is the danger of pre-emption based upon dynamic
> > priorities. You can get cases where two processes each are
> > permitted to make a very small amount of progress in
> > alternation. This can happen just as well with large writes as
> > small ones, the amount of data is irrelevent, it's the amount of
> > CPU time that's important, or to put it another way, it's how far
> > a process can get without suffering a context switch.
> > 
> > 	I suggest that a process be permitted to use up at least some
> > portion of its timeslice exempt from any pre-emption based solely
> > on dynamic priorities.
> 
> that's the issue yes. but then when a multithreaded app sends a
> signal to another process it can take up to this "x" timeslice
> portion before the signal will run (I mean in UP). Same goes for
> mouse clicks etc..  1msec for mouse clicks should not be noticeable
> though. And over all I don't see a real big issue in the current
> code.

Well, the problem that we have is excessive synchronous context
switching, so the solution might be to simply throttle on that.  It's
no problem the first few times round, but after some time the
scheduler should recognize the pattern and clamp down.  So I'd propose
that we give a process that yields synchronously while it could be
ready to run an accumulating priority boost that gets wasted pretty
fast when the process does a full wait (aka the pipe is full) or is
preempted.

That is, have the scheduler penalize/throttle application-visible
context switches between runnable processes.  On particular if the
processes are dependent on one another, like in the case of the pipe.
We can proud ourselves to optimize the kernel to make a context switch
in 1us, but if each context switch occurs 2ms of avoidable overhead in
our application, the bottom line to the user will remain ugly.

> To try it probably the simpler way to add a need_resched_timeout
> along to need_resched, and to honour the need_resched only when the
> timeout triggers, immediate need_resched will set the timeout = jiffies
> so it'll trigger immediatly, the timer interrupt will check it. The
> reschedule_idle on a non idle cpu will be the only one to set the
> timeout. Effectively I'm curious to see what will happen. Not all archs
> would need to check against it (the entry.S code is the main reader of
> need_resched), it'll be an hint only and idle() for sure must not check
> it at all. this would guarantee minimal timeslice reserved up to 1/HZ by
> setting the timeout to jiffies + 2 (jiffies + 1 would return a mean of
> 1/HZ/2 but the worst case would be ~0, in practice even + 1 would be
> enough) Does anybody think this could have a value? If yes I can make a
> quick hack to see what happens.

We probably need not immediate action.  A good test case is an xterm,
I guess, and I noticed that od has an option for limiting its length.

So I propose that comparing the output of

time od -vN 100k /dev/zero

to that of

time od -vN 100k /dev/zero|dd obs=16k

in an xterm should provide some idea about _typical_ overhead occured
in an _application_ due to excessive context switching.

If you want to get really nasty, you can compare to
time od -vN 100k /dev/zero|dd obs=1

Note that in all of these cases, it is by far the xterm that is
wasting the lion's share of the processing time (and so you need to
take a look at the _real_ time expended, since xterm will not be
measured in the time command), the kernel people and the generating
processes will wash their hands off any guilt: "_we_ do our work
efficiently and give xterm all the time of the world to be able to
empty the pipe, and let it take all the time it needs to without
pushing it".  Only that xterm could work much more efficiently if one
pushed it by piling things into the pipe.

Another silly exercise is the following:

time od -v /dev/zero|dd obs=1

in an xterm.  Keep it running.  Now in a different term, enter

while true;do :;done

The above xterm will freeze while the interactive shell still has its
bonus, and then will start working again, at a much higher speed than
when it has an idle CPU at its disposal.

Kill off the busy loop, and the xterm gets slow again.

Come to think of it, I have witnessed the phenomenon of "slow start
xterms" that get ragingly more efficient after several screenfulls
when they _can't_ keep up on a busy system for years on a large
variety of Unices with single CPU.  It is some of those quirks one
does not think about too much.

So I repeat: I should think a fast accumulating scheduling malus for a
synchronous context switch to a process woken by an event from a still
runnable process should be appropriate.  If anybody can think of a
scheme that magically converges to a good tradeoff between average
fill level of a pipe and estimated processing overhead by the
application at the receiving end, of course this would be appreciated.

The aim would be that running an additional

  while true;do :;done

should not usually be able to speed up the _real_ time performance of
an unrelated task.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Scheduling problem with 2.4?
  2003-05-18  8:55                 ` Mike Galbraith
@ 2003-05-18 17:46                   ` David Schwartz
  2003-05-18 23:18                     ` Andrea Arcangeli
  2003-05-18 23:11                   ` Andrea Arcangeli
  2003-05-19  4:02                   ` Mike Galbraith
  2 siblings, 1 reply; 20+ messages in thread
From: David Schwartz @ 2003-05-18 17:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andrea Arcangeli, linux-kernel


> Is there any down-side to not preempting quite as often?  It seems like
> there should be a bandwidth gain.
>
>          -Mike

	The theoretical down-side is that interactivity might suffer a bit because
a process isn't scheduled quite as quickly. Yes, the less-often you preempt
a process, the faster the system will go in the sense of work done per unit
time. But those pesky users want their characters to echo quickly and the
mouse pointer to track their physical motions.

	Obviously, we must preempt when a process with a higher static priority
becomes ready to run. However, preempting based on dynamic priorities has
permitted time slices to be even longer, permitting a reduction in context
switches without sacrificing interactivity.

	I still believe, however, that a process should be 'guaranteed' some slice
of time every time it's scheduled unless circumstances make it impossible to
allow the process to continue running. IMO, the pendulum has swung too far
in favor of interactivity. Obviously, if the process faults, blocks, or a
process with higher static priority becomes ready to run, then we must
terminate the process' time slice early.

	DS



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-18  8:55                 ` Mike Galbraith
  2003-05-18 17:46                   ` David Schwartz
@ 2003-05-18 23:11                   ` Andrea Arcangeli
  2003-05-19  7:16                     ` Mike Galbraith
  2003-05-19  4:02                   ` Mike Galbraith
  2 siblings, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-18 23:11 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: David Schwartz, dak, linux-kernel

On Sun, May 18, 2003 at 10:55:17AM +0200, Mike Galbraith wrote:
> At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:
> 
> >> I see what you mean, but I still don't think it is a problem. If
> >> bandwidth matters you will have to use large writes and reads anyways,
> >> if bandwidth doesn't matter the number of ctx switches doesn't matter
> >> either and latency usually is way more important with small messages.
> >
> >> Andrea
> >
> >        This is the danger of pre-emption based upon dynamic priorities. 
> >You can
> >get cases where two processes each are permitted to make a very small 
> >amount
> >of progress in alternation. This can happen just as well with large writes
> >as small ones, the amount of data is irrelevent, it's the amount of CPU 
> >time
> >that's important, or to put it another way, it's how far a process can get
> >without suffering a context switch.
> >
> >        I suggest that a process be permitted to use up at least some 
> >portion of
> >its timeslice exempt from any pre-emption based solely on dynamic
> >priorities.
> 
> Cool.
> 
> Thank you for the spiffy idea.  I implemented this in my (hack/chop) mm5 
> tree in about 30 seconds, and it works just fine.  Very simple 
> time_after(jiffies, p->last_run + MIN_TIMESLICE) checks in wake_up_cpu() 

If I understand well, what you did is different (in functionalty) from
what I described (and what I described in the last email certainly takes
more than 30 seconds no matter how smart you implement it ;). I mean,
you lose the whole "wakeup" information, yeah that will fix
it too like deleting the need_resched =1 after the check on the
curr->prio enterely, but while it's so simple you you don't only
guarantee the miniumum timeslice, but you let the current task running
even after it expired the minimum timeslice.  That will most certainly
hurt interactivity way too much, I don't think it's an option, unless
you want to trim significantly the timeslice length too. The only reason
we can take these long timeslices are the interactivity hints, we always
had those in linux, all versions. If you start to ignore it, things
should not do too well, even throughput can decrease in a multithread
environment due the slower delivery of events.

Delaying a interprocess message for 1msec (or even 10msec) [i.e. 1/HZ]
may not be noticeable, but delaying it for a whole timeslice will
certainly be noticeable, that's an order of magnitude bigger delay.

Actually besides the way I described yesterday (that would require arch
asm changes) to experiment the "miniumum timeslice guarantee", it could
also be implemented by moving the timeslice-min_timeslice in a
rest_timeslice field if jiffies - last_run  < MIN_TIMESLICE and if
rest_timeslice is zero, and trimming the timeslice down to
min_timeslice. Then the next schedule would put rest_timeslice back in
timeslice. This is on the lines of what you implemented but it also
guaranteees that the higher dyn-prio task will be scheduled after this
min_timeslice (to still provide the ~same interactive behaviour, which
is a must IMHO ;) This will be a bit more difficult of the need_resched
secondary field, but it's probably conceptually cleaner, since it
restricts the algorithm in the scheduler keeping the asm fast paths
small and simple.

> plus the obvious dinky change to schedule(), and it worked 
> instantly.  Hmm.. I wonder if this might help some of the other scheduler 
> (seemingly) bad behavior as well.
> 
> Is there any down-side to not preempting quite as often?  It seems like 
> there should be a bandwidth gain.

it's certainly on the right trick for this issue, in the sense there
will be the wanted bandwidth gain in the 1 byte write case to a pty ;)
(like dropping the need_resched = 1, that even goes down to 5 sec and a
1 liner patch), however I don't think the 30 sec modification alone is
just a generic solution. delaying the reschedule to min_timeslice may be
accpetable instead.

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-18 17:46                   ` David Schwartz
@ 2003-05-18 23:18                     ` Andrea Arcangeli
  2003-05-19  2:59                       ` David Schwartz
  0 siblings, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2003-05-18 23:18 UTC (permalink / raw)
  To: David Schwartz; +Cc: Mike Galbraith, linux-kernel

On Sun, May 18, 2003 at 10:46:24AM -0700, David Schwartz wrote:
> 
> > Is there any down-side to not preempting quite as often?  It seems like
> > there should be a bandwidth gain.
> >
> >          -Mike
> 
> 	The theoretical down-side is that interactivity might suffer a bit because
> a process isn't scheduled quite as quickly. Yes, the less-often you preempt
> a process, the faster the system will go in the sense of work done per unit
> time. But those pesky users want their characters to echo quickly and the
> mouse pointer to track their physical motions.
> 
> 	Obviously, we must preempt when a process with a higher static priority

the static priority is the same for all tasks in the system unless you
use nice. I believe linux should do well without forcing the user to set
the GUI at high prio etc..

> becomes ready to run. However, preempting based on dynamic priorities
> has
> permitted time slices to be even longer, permitting a reduction in context

the dyn prio doesn't change the timeslice size, it only chooses if to
wakeup a task immediatly or not, timeslices are calculated in function
of the static priority, not the dyn prio. Those interactive tasks uses
nearly zero of their timeslice anyways, the size of the timeslice has
little meaning for interactive tasks, what matters is "when" they run
and that's controlled by the dyn prio.

> switches without sacrificing interactivity.
> 
> 	I still believe, however, that a process should be 'guaranteed' some slice
> of time every time it's scheduled unless circumstances make it impossible to
> allow the process to continue running. IMO, the pendulum has swung too far
> in favor of interactivity. Obviously, if the process faults, blocks, or a
> process with higher static priority becomes ready to run, then we must

if the static priority is much higher the immediate switch already
happens. But if we choose to guarantee a min_timeslice to tasks to avoid
the ctx switch flood for your testcase, then also the higher static prio
tasks will have to wait for this min_timeslice. The issue is no
different with higher static prio, or you will complain next time that
you get a ctx switch flood from a dd bs=1 writing into a pty connected
to an xterm running at -20 prio.

> terminate the process' time slice early.
> 
> 	DS
> 
> 


Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Scheduling problem with 2.4?
  2003-05-18 23:18                     ` Andrea Arcangeli
@ 2003-05-19  2:59                       ` David Schwartz
  0 siblings, 0 replies; 20+ messages in thread
From: David Schwartz @ 2003-05-19  2:59 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Mike Galbraith, linux-kernel


> > 	Obviously, we must preempt when a process with a higher
> > static priority

> the static priority is the same for all tasks in the system unless you
> use nice. I believe linux should do well without forcing the user to set
> the GUI at high prio etc..

	Exactly. Pre-emption based on dynamic priorities is an "if it helps"
feature and not one required by the semantics. On the other hand,
pre-emption based on static priorities is pretty much required and not doing
that would certainly violate the principle of least astonishment.

> > becomes ready to run. However, preempting based on dynamic priorities
> > has
> > permitted time slices to be even longer, permitting a reduction
> > in context

> the dyn prio doesn't change the timeslice size, it only chooses if to
> wakeup a task immediatly or not, timeslices are calculated in function
> of the static priority, not the dyn prio. Those interactive tasks uses
> nearly zero of their timeslice anyways, the size of the timeslice has
> little meaning for interactive tasks, what matters is "when" they run
> and that's controlled by the dyn prio.

	You say "the dyn prio doesn't change the timeslice size", but that's not
true. If a process with low dynamic priority writes to a FIFO, its timeslice
will be cut short, whereas one with a higher dynamic priorit would enjoy a
longer timeslice. Yes, priority doesn't affect the *intended* timeslice, but
it can affect the effective timeslice.

> > 	I still believe, however, that a process should be
> > 'guaranteed' some slice
> > of time every time it's scheduled unless circumstances make it
> > impossible to
> > allow the process to continue running. IMO, the pendulum has
> > swung too far
> > in favor of interactivity. Obviously, if the process faults,
> > blocks, or a
> > process with higher static priority becomes ready to run, then we must

> if the static priority is much higher the immediate switch already
> happens.

	As it certainly should. The maximum amount of time we could justify
delaying a task with a higher static priority in the name of better 'bulk
data' performance is so small that it's probably not feasible to do it at
all.

> But if we choose to guarantee a min_timeslice to tasks to avoid
> the ctx switch flood for your testcase, then also the higher static prio
> tasks will have to wait for this min_timeslice.

	No. I don't think that's appropriate. Higher statis priority is an explicit
decision made by users for the specific reason of getting the faster
response. Under no circumstances should an NTP process that received a UDP
packet be prevented from getting that packet as quickly as possible.

> The issue is no
> different with higher static prio, or you will complain next time that
> you get a ctx switch flood from a dd bs=1 writing into a pty connected
> to an xterm running at -20 prio.

	With static priorities, you can always tell people "Don't do that then, you
got what you asked for. The system doesn't know what you 'really want'."
With dynamic priorities, however, this is not the case.

	I'm not sure what the best solution is. But I think it's along the lines of
deferring a deschedule based purely on dynamic priorities if the task has
used only a small fraction of its timeslice. Perhaps, for example, a normal
timeslice should be four jiffies. If a reschule would normally occur during
the first jiffy, and it's based purely on dynamic priorities, we set a flag
which we check when the first jiffy ends.

	Now maybe in some cases where the static priorities only differ slightly,
we should defer even based on static priorities. Perhaps immediate
pre-emption during the beginning of a timeslice should only occur when the
winning process has an elevated priority (and not just because the losing
process has a reduced one).

	I don't really know. But I think what we're doing now is fundamentally
wrong because processes should be allowed to use up their timeslices when
possible.

	DS



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Scheduling problem with 2.4?
  2003-05-18  8:55                 ` Mike Galbraith
  2003-05-18 17:46                   ` David Schwartz
  2003-05-18 23:11                   ` Andrea Arcangeli
@ 2003-05-19  4:02                   ` Mike Galbraith
  2 siblings, 0 replies; 20+ messages in thread
From: Mike Galbraith @ 2003-05-19  4:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Schwartz, Andrea Arcangeli, dak

At 10:55 AM 5/18/2003 +0200, Mike Galbraith wrote:
>At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:
>
>>         I suggest that a process be permitted to use up at least some 
>> portion of
>>its timeslice exempt from any pre-emption based solely on dynamic
>>priorities.
>
>
>Is there any down-side to not preempting quite as often?  It seems like 
>there should be a bandwidth gain.

(my reply was supposed to be off-list, but..)

The answer appears to be yes, there are down-sides.  Instead of the 
expected throughput gain, I got a loss :-/  which makes sense from the 
cache side I suppose.  While it was instant gratification for the pipe 
testcase, as a general solution it leaves something to be desired... at 
least when implemented in it's simplest form  Oh well.

(hmm.. maybe I should go stare at the cache decay stuff some more.  later)

         -Mike 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Scheduling problem with 2.4?
  2003-05-18 23:11                   ` Andrea Arcangeli
@ 2003-05-19  7:16                     ` Mike Galbraith
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Galbraith @ 2003-05-19  7:16 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Schwartz, dak, linux-kernel

At 01:11 AM 5/19/2003 +0200, Andrea Arcangeli wrote:
>On Sun, May 18, 2003 at 10:55:17AM +0200, Mike Galbraith wrote:
> > At 05:16 PM 5/17/2003 -0700, David Schwartz wrote:
> >
> > >> I see what you mean, but I still don't think it is a problem. If
> > >> bandwidth matters you will have to use large writes and reads anyways,
> > >> if bandwidth doesn't matter the number of ctx switches doesn't matter
> > >> either and latency usually is way more important with small messages.
> > >
> > >> Andrea
> > >
> > >        This is the danger of pre-emption based upon dynamic priorities.
> > >You can
> > >get cases where two processes each are permitted to make a very small
> > >amount
> > >of progress in alternation. This can happen just as well with large writes
> > >as small ones, the amount of data is irrelevent, it's the amount of CPU
> > >time
> > >that's important, or to put it another way, it's how far a process can get
> > >without suffering a context switch.
> > >
> > >        I suggest that a process be permitted to use up at least some
> > >portion of
> > >its timeslice exempt from any pre-emption based solely on dynamic
> > >priorities.
> >
> > Cool.
> >
> > Thank you for the spiffy idea.  I implemented this in my (hack/chop) mm5
> > tree in about 30 seconds, and it works just fine.  Very simple
> > time_after(jiffies, p->last_run + MIN_TIMESLICE) checks in wake_up_cpu()
>
>If I understand well, what you did is different (in functionalty) from
>what I described (and what I described in the last email certainly takes
>more than 30 seconds no matter how smart you implement it ;). I mean,

The p->last_run is a typo, it's effectively current->last_run, with 
last_run also being updated upon task switch so you can see how long he's 
had the cpu.  That can be done much quicker than 30 seconds by someone who 
can type with fingers instead of thumbs ;-)

>you lose the whole "wakeup" information, yeah that will fix
>it too like deleting the need_resched =1 after the check on the
>curr->prio enterely, but while it's so simple you you don't only
>guarantee the miniumum timeslice, but you let the current task running
>even after it expired the minimum timeslice.  That will most certainly
>hurt interactivity way too much, I don't think it's an option, unless
>you want to trim significantly the timeslice length too. The only reason
>we can take these long timeslices are the interactivity hints, we always
>had those in linux, all versions. If you start to ignore it, things
>should not do too well, even throughput can decrease in a multithread
>environment due the slower delivery of events.

Throughput did decrease.  I was thinking that there was likely to be 
another event before my ignoring the preempt could matter much.  As 
mentioned privately, I guaranteed that no task could hold the cpu for more 
than 50ms max, so on the next schedule he'd get the cpu (if prio high 
enough) but the test results weren't encouraging.

>Delaying a interprocess message for 1msec (or even 10msec) [i.e. 1/HZ]
>may not be noticeable, but delaying it for a whole timeslice will
>certainly be noticeable, that's an order of magnitude bigger delay.
>
>Actually besides the way I described yesterday (that would require arch
>asm changes) to experiment the "miniumum timeslice guarantee", it could
>also be implemented by moving the timeslice-min_timeslice in a
>rest_timeslice field if jiffies - last_run  < MIN_TIMESLICE and if
>rest_timeslice is zero, and trimming the timeslice down to
>min_timeslice. Then the next schedule would put rest_timeslice back in
>timeslice. This is on the lines of what you implemented but it also
>guaranteees that the higher dyn-prio task will be scheduled after this
>min_timeslice (to still provide the ~same interactive behaviour, which
>is a must IMHO ;) This will be a bit more difficult of the need_resched
>secondary field, but it's probably conceptually cleaner, since it
>restricts the algorithm in the scheduler keeping the asm fast paths
>small and simple.

I'd really like to try a deferred preempt.  I don't know enough (yet 
anyway;) to not create spectacular explosions with that idea though.  I'll 
go ponder your idea instead.

         -Mike 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2003-05-19  6:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-05-17 11:22 Scheduling problem with 2.4? David Kastrup
2003-05-17 17:41 ` Andrea Arcangeli
2003-05-17 19:36   ` David Kastrup
2003-05-17 20:30     ` Andrea Arcangeli
2003-05-17 20:44       ` David Kastrup
2003-05-17 21:53         ` Andrea Arcangeli
2003-05-17 22:37           ` David Kastrup
2003-05-17 23:50             ` Andrea Arcangeli
2003-05-18  0:16               ` David Schwartz
2003-05-18  1:06                 ` Andrea Arcangeli
2003-05-18  9:41                   ` David Kastrup
2003-05-18  8:55                 ` Mike Galbraith
2003-05-18 17:46                   ` David Schwartz
2003-05-18 23:18                     ` Andrea Arcangeli
2003-05-19  2:59                       ` David Schwartz
2003-05-18 23:11                   ` Andrea Arcangeli
2003-05-19  7:16                     ` Mike Galbraith
2003-05-19  4:02                   ` Mike Galbraith
2003-05-18  0:24               ` David Kastrup
2003-05-17 21:54         ` Barry K. Nathan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).