* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 2:44 ` Andrea Arcangeli
@ 2002-11-10 3:56 ` Matt Reppert
2002-11-10 9:58 ` Con Kolivas
2002-11-10 19:32 ` Rik van Riel
2 siblings, 0 replies; 47+ messages in thread
From: Matt Reppert @ 2002-11-10 3:56 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: conman, linux-kernel, marcelo
Purely for information's sake ...
On Sun, 10 Nov 2002 03:44:51 +0100
Andrea Arcangeli <andrea@suse.de> wrote:
> On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
> > xtar_load:
> > Kernel [runs] Time CPU% Loads LCPU% Ratio
> > 2.4.18 [3] 150.8 49 2 8 2.11
> > 2.4.19 [1] 132.4 55 2 9 1.85
> > 2.4.19-ck9 [2] 138.6 58 2 11 1.94
> > 2.4.20-rc1 [3] 180.7 40 3 8 2.53
> > 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.33
>
> these numbers doesn't make sense. Can you describe what xtar_load is
> doing?
Repeatedly extracting tars while compiling kernels.
Andrea, I think you mixed up what the descriptions go to. They come *under*
the numbers, not above, commenting on only the test directly above them.
(eg, "First noticeable difference" is about "xtar_load")
Yes, these are kind of meaningless without descriptions. You can find those
at the webpage, http://contest.kolivas.net/ ... This will make more sense with
that, of course how meaningful it is is always up to debate :)
All of these benchmark the kernel compile while doing something else in the
background.
> In short if somebody runs fast in something like this:
>
> cp /dev/zero . & time cp bigfile /dev/null
>
> he will win your whole contest too.
That's practically one of the loads, actually.
"IO Load - copies /dev/zero continually to a file the size of
the physical memory."
Which dds blocks the size of MemTotal in /proc/meminfo to a file
in /tmp in a shell script as long as the kernel compile is running.
> please show the difff between
> 2.4.19-ck9/drivers/block/{ll_rw_blk,elevator}.c and
> 2.4.19/drivers/block/...
elevator.c is untouched, ll_rw_blk.c follows. The full patch is here:
http://members.optusnet.com.au/con.man/ck9_2.4.19.patch.bz2
diff -bBdaurN linux-2.4.19/drivers/block/ll_rw_blk.c linux-2.4.19-ck9/drivers/bl
ock/ll_rw_blk.c
--- linux-2.4.19/drivers/block/ll_rw_blk.c 2002-08-03 13:14:45.000000000 +1
000
+++ linux-2.4.19-ck9/drivers/block/ll_rw_blk.c 2002-10-14 17:21:18.000000000 +1
000
@@ -1112,6 +1112,9 @@
if (!test_bit(BH_Lock, &bh->b_state))
BUG();
+ if (buffer_delay(bh) || !buffer_mapped(bh))
+ BUG();
+
set_bit(BH_Req, &bh->b_state);
set_bit(BH_Launder, &bh->b_state);
@@ -1132,6 +1135,7 @@
kstat.pgpgin += count;
break;
}
+ conditional_schedule();
}
/**
@@ -1270,7 +1274,8 @@
req->errors = 0;
if (!uptodate)
- printk("end_request: I/O error, dev %s (%s), sector %lu\n",
+ printk(KERN_INFO "end_request: I/O error, dev %s (%s),"
+ " sector %lu\n",
kdevname(req->rq_dev), name, req->sector);
if ((bh = req->bh) != NULL) {
.
> Either that or change the name of your project,
It's called "contest" because it's a reasonably arbitrary test of what
the kernel does under some circumstances that's put out by Con Kolivas.
Con's test. Contest. It's not supposed to actually mean anything.
Matt
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 2:44 ` Andrea Arcangeli
2002-11-10 3:56 ` Matt Reppert
@ 2002-11-10 9:58 ` Con Kolivas
2002-11-10 10:06 ` Jens Axboe
2002-11-10 16:20 ` Andrea Arcangeli
2002-11-10 19:32 ` Rik van Riel
2 siblings, 2 replies; 47+ messages in thread
From: Con Kolivas @ 2002-11-10 9:58 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: linux kernel mailing list, marcelo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
First some explanation.
Contest (http://contest.kolivas.net) is obviously not a throughput style
benchmark. The benchmark simply uses userland loads known to slow down the
machine (like writing large files) and sees how much longer kernel
compilation takes (make -j4 bzImage on uniprocessor). Thus it never claims to
be any sort of comprehensive system benchmark; it only serves to give an idea
of the systems ability to respond in the presence of different loads, in
terms end users can understand.
>On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
>> xtar_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 150.8 49 2 8 2.11
>> 2.4.19 [1] 132.4 55 2 9 1.85
>> 2.4.19-ck9 [2] 138.6 58 2 11 1.94
>> 2.4.20-rc1 [3] 180.7 40 3 8 2.53
>> 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.33
>
>these numbers doesn't make sense. Can you describe what xtar_load is
>doing?
Ok xtar_load starts extracting a large tar (a kernel tree) in the background
then tries to compile a kernel. The time is how long kernel compilation takes
and cpu% is how much cpu% make -j4 bzImage uses. Loads is how many times it
successfully extracts the tar and LCPU% is the cpu% returned by the "tar x
linux.tar" command. Ratio is the ratio of this kernel compilation time to the
reference (2.4.18 with no load).
>> First noticeable difference. With repeated extracting of tars while
>> compiling kernels 2.4.20-rc1 seems to be slower and aa1 curbs it just a
>> little.
This explanation said simply that kernel compilation with the same tar
extracting load takes longer in 2.4.20-rc1 compared with 2.4.19, but that the
aa addons sped it up a bit.
>> io_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 474.1 15 36 10 6.64
>> 2.4.19 [3] 492.6 14 38 10 6.90
>> 2.4.19-ck9 [2] 140.6 49 5 5 1.97
>> 2.4.20-rc1 [2] 1142.2 6 90 10 16.00
>> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.86
>
>What are you benchmarking, tar or the kernel compile? I think the
>latter. That's the elevator and the size of the I/O queue here. Nothing
>else. hacks like read-latency aren't very nice in particular with
>async-io aware apps. If this improvement in ck9 was achieved decreasing
>the queue size it'll be interesting to see how much the sequential I/O
>is slowed down, it's very possible we've too big queues for some device.
>
>> Well this is interesting. 2.4.20-rc1 seems to have improved it's ability
>> to do IO work. Unfortunately it is now busy starving the scheduler in the
>> mean time, much like the 2.5 kernels did before the deadline scheduler was
>> put in.
>>
>> read_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 102.3 70 6 3 1.43
>> 2.4.19 [2] 134.1 54 14 5 1.88
>> 2.4.19-ck9 [2] 77.4 85 11 9 1.08
>> 2.4.20-rc1 [3] 173.2 43 20 5 2.43
>> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.11
>
>What is busy starving the scheduler? This sounds like it's again just an
>evelator benchmark. I don't buy your scheduler claims, give more
>explanations or it'll take it as vapourware wording, I very much doubt
>you can find any single problem in the scheduler rc1aa2 or that the
>scheduler in rc1aa1 has a chance to run slower than the one of 2.4.19 in
>a I/O benchmark, ok it still misses the numa algorithm, but that's not a
>bug, just a missing feature and it'll soon be fixed too and it doesn't
>matter for normal smp non-numa machines out there.
Ok I fully retract the statement. I should not pass judgement on what part of
the kernel has changed the benchmark results, I'll just describe what the
results say. Note however this comment was centred on the results of io_load
above. Put simply : if I am writing a large file and then try to compile the
kernel (make -j4 bzImage) it is 16 times slower.
>> mem_load:
>> Kernel [runs] Time CPU% Loads LCPU% Ratio
>> 2.4.18 [3] 103.3 70 32 3 1.45
>> 2.4.19 [3] 100.0 72 33 3 1.40
>> 2.4.19-ck9 [2] 78.3 88 31 8 1.10
>> 2.4.20-rc1 [3] 105.9 69 32 2 1.48
>> 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49
>
>again ck9 is faster because of elevator hacks ala read-latency.
>
>in short your whole benchmark seems all about interacitivy of reads
>during write flood. That's the read-latency thing or whatever else you
>could do to ll_rw_block.c.
>
>In short if somebody runs fast in something like this:
>
> cp /dev/zero . & time cp bigfile /dev/null
>
>he will win your whole contest too.
>
>please show the difff between
>2.4.19-ck9/drivers/block/{ll_rw_blk,elevator}.c and
>2.4.19/drivers/block/...
I think Matt addressed this issue.
>All the difference is there and it will hurt you badly if you do
>async-io benchmarks, and possibly dbench too. So you should always
>accompain your benchmark with async-io simultanous read/write bandwitdth
>and dbench, or I could always win your contest by shipping a very bad
>kernel. Either that or change the name of your project, if somebody wins
>this context that's probably a bad I/O scheduler in many other aspects,
>some of the reason I didn't merge read-latency from Andrew.
The name is meaningless and based on my name. Had my name been John it would
be johntest.
I regret ever including the -ck (http://kernel.kolivas.net) results. The
purpose of publishing these results was to compare 2.4.20-rc1/aa1 with
previous kernels. As some people are interested in the results of the ck
patchset I threw them in as well. -ck is a patchset with desktop users in
mind and is simply a merged patch of O(1),preempt,low latency and compressed
caching. If it sacrifices throughput in certain areas to maintain system
responsiveness then so be it. I'll look into adding other loads to contest as
you suggested, but I'm not going to add basic throughput benchmarks. There
are plenty of tools for this already.
I've done some ordinary dbench-quick benchmarks of ck9 and 2.4.20-rc1aa1 at
the OSDL http://www.osdl.org/stp
ck10_cc is the sum of patches that make up ck9 so is the same thing.
ck10_cc: http://khack.osdl.org/stp/7005/
2.4.20-rc1-aa1: http://khack.osdl.org/stp/7006/
Summary:
2420rc1aa1:
1 117.5
4 114.002
7 114.643
10 114.818
13 109.478
16 109.817
19 103.692
22 103.678
25 105.478
28 93.1296
31 87.0544
34 84.2668
37 81.0731
40 75.4605
43 77.2198
46 69.0448
49 66.7997
52 61.5987
55 60.2009
58 60.1531
61 58.3121
64 55.7127
67 56.2714
70 53.6214
73 52.2704
76 52.3631
79 49.7146
82 48.2406
85 48.1078
88 42.8405
91 42.4929
94 42.3958
97 43.5729
100 45.8318
ck10_cc:
1 116.239
4 115.075
7 114.414
10 114.166
13 109.129
16 109.403
19 106.601
22 97.7714
25 93.7279
28 95.0076
31 92.5594
34 88.5938
37 89.7026
40 86.9904
43 85.1783
46 82.7975
49 79.7348
52 80.2497
55 79.2346
58 76.6632
61 75.9002
64 75.8677
67 75.7318
70 73.2223
73 73.7652
76 72.9277
79 72.5244
82 71.6753
85 71.3161
88 70.9735
91 69.5539
94 69.602
97 67.2016
100 67.158
Regards,
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
iD8DBQE9zi3UF6dfvkL3i1gRAkWmAJ4zX7gyUjzKH7eCNneyNRWLPGtCeACff9A7
Bn8LHqZw46CrGauuWTldDnQ=
=0WMB
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 9:58 ` Con Kolivas
@ 2002-11-10 10:06 ` Jens Axboe
2002-11-10 16:21 ` Andrea Arcangeli
2002-11-10 16:20 ` Andrea Arcangeli
1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2002-11-10 10:06 UTC (permalink / raw)
To: Con Kolivas; +Cc: Andrea Arcangeli, linux kernel mailing list, marcelo
On Sun, Nov 10 2002, Con Kolivas wrote:
> >> Well this is interesting. 2.4.20-rc1 seems to have improved it's ability
> >> to do IO work. Unfortunately it is now busy starving the scheduler in the
> >> mean time, much like the 2.5 kernels did before the deadline scheduler was
> >> put in.
> >>
> >> read_load:
> >> Kernel [runs] Time CPU% Loads LCPU% Ratio
> >> 2.4.18 [3] 102.3 70 6 3 1.43
> >> 2.4.19 [2] 134.1 54 14 5 1.88
> >> 2.4.19-ck9 [2] 77.4 85 11 9 1.08
> >> 2.4.20-rc1 [3] 173.2 43 20 5 2.43
> >> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.11
> >
> >What is busy starving the scheduler? This sounds like it's again just an
> >evelator benchmark. I don't buy your scheduler claims, give more
> >explanations or it'll take it as vapourware wording, I very much doubt
> >you can find any single problem in the scheduler rc1aa2 or that the
> >scheduler in rc1aa1 has a chance to run slower than the one of 2.4.19 in
> >a I/O benchmark, ok it still misses the numa algorithm, but that's not a
> >bug, just a missing feature and it'll soon be fixed too and it doesn't
> >matter for normal smp non-numa machines out there.
>
> Ok I fully retract the statement. I should not pass judgement on what part of
> the kernel has changed the benchmark results, I'll just describe what the
> results say. Note however this comment was centred on the results of io_load
> above. Put simply : if I am writing a large file and then try to compile the
> kernel (make -j4 bzImage) it is 16 times slower.
In Con's defence, I think he meant io scheduler starvation and not
process scheduler starvation. Otherwise the following wouldn't make a
lot of sense:
"Unfortunately it is now busy starving the scheduler in the mean time,
much like the 2.5 kernels did before the deadline scheduler was put in."
In indeed, 2.5 kernels had the exact same io scheduler algorithm in 2.5
as 2.4.20-rc has, so this makes perfect sense from the io scheduler
starvation POV.
There are inherent problems in the 2.4 io scheduler for these types of
work loads, the ugly and nausea-inducing read-latency hack that akpm did
attempts to work-around that.
Andrea is obviously talking about process scheduler, note the numa
reference among other things.
--
Jens Axboe
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 10:06 ` Jens Axboe
@ 2002-11-10 16:21 ` Andrea Arcangeli
0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-10 16:21 UTC (permalink / raw)
To: Jens Axboe; +Cc: Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 11:06:56AM +0100, Jens Axboe wrote:
> Andrea is obviously talking about process scheduler, note the numa
exactly, sorry.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 9:58 ` Con Kolivas
2002-11-10 10:06 ` Jens Axboe
@ 2002-11-10 16:20 ` Andrea Arcangeli
1 sibling, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-10 16:20 UTC (permalink / raw)
To: Con Kolivas; +Cc: linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 08:58:43PM +1100, Con Kolivas wrote:
> >> to do IO work. Unfortunately it is now busy starving the scheduler in the
> >> mean time, much like the 2.5 kernels did before the deadline scheduler was
> >> put in.
> Ok I fully retract the statement. I should not pass judgement on what part of
> the kernel has changed the benchmark results, I'll just describe what the
actually Wil pointed out to me privately you meant I/O scheduler, you
just never mentioned the name "I/O" so I mistaken if for the process
scheduler, sorry (I should have understood from the deadline adjective).
It makes sense what you said once parsed as I/O scheduler of course.
Next week I will check the changes in your tree and I'll try to
reproduce the dbench numbers on my 4-way with very high I/O and disk
bandwith and I'll let you know the numbers I get here. It maybe simply
the different elevator default values and fixes in 2.4.20rc, but I
recall that you still win compared to -r0 somewhere (according to your
numbers). It's pointless from my part to discuss this further now until
I've the whole picture of the changes you did, the whole picture on the
contest source code, and after I can reproduce every single result you
posted here. Hope to be able to comment further ASAP.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 2:44 ` Andrea Arcangeli
2002-11-10 3:56 ` Matt Reppert
2002-11-10 9:58 ` Con Kolivas
@ 2002-11-10 19:32 ` Rik van Riel
2002-11-10 20:10 ` Andrea Arcangeli
2 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2002-11-10 19:32 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Con Kolivas, linux kernel mailing list, marcelo
On Sun, 10 Nov 2002, Andrea Arcangeli wrote:
> On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
> > 2.4.19-ck9 [2] 78.3 88 31 8 1.10
> > 2.4.20-rc1 [3] 105.9 69 32 2 1.48
> > 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49
>
> again ck9 is faster because of elevator hacks ala read-latency.
>
> in short your whole benchmark seems all about interacitivy of reads
> during write flood.
Which is a very important thing. You have to keep in mind that
reads and writes are fundamentally different operations since
the majority of the writes happen asynchronously while the program
continues running, while the majority of reads are synchronous and
your program will block while the read is going on.
Because of this it is also much easier to do writes in large chunks
than it is to do reads in large chunks, because with writes you
know exactly what data you're going to write while you can't know
which data you'll need to read next.
> All the difference is there and it will hurt you badly if you do
> async-io benchmarks,
Why would read-latency hurt the async-io benchmark ?
Whether the IO is synchronous or asynchronous shouldn't matter much,
if you do a read you still need to wait for the data to be read in
before you can process it while the data you write is still in memory
and can be used over and over again.
What is the big difference with asynchronous IO that removes the big
asymetry between reads and writes ?
> kernel. Either that or change the name of your project, if somebody wins
> this context that's probably a bad I/O scheduler in many other aspects,
> some of the reason I didn't merge read-latency from Andrew.
Any reasons in particular or just a gut feeling ?
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"october@surriel.com">october@surriel.com</a>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 19:32 ` Rik van Riel
@ 2002-11-10 20:10 ` Andrea Arcangeli
2002-11-10 20:52 ` Andrew Morton
2002-11-10 20:56 ` Andrew Morton
0 siblings, 2 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-10 20:10 UTC (permalink / raw)
To: Rik van Riel; +Cc: Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 05:32:44PM -0200, Rik van Riel wrote:
> On Sun, 10 Nov 2002, Andrea Arcangeli wrote:
> > On Sat, Nov 09, 2002 at 01:00:19PM +1100, Con Kolivas wrote:
>
> > > 2.4.19-ck9 [2] 78.3 88 31 8 1.10
> > > 2.4.20-rc1 [3] 105.9 69 32 2 1.48
> > > 2.4.20-rc1aa1 [1] 106.3 69 33 3 1.49
> >
> > again ck9 is faster because of elevator hacks ala read-latency.
> >
> > in short your whole benchmark seems all about interacitivy of reads
> > during write flood.
>
> Which is a very important thing. You have to keep in mind that
sure, this is why I fixed the potential ~infinite starvation in the 2.3
elevator.
> reads and writes are fundamentally different operations since
> the majority of the writes happen asynchronously while the program
> continues running, while the majority of reads are synchronous and
> your program will block while the read is going on.
>
> Because of this it is also much easier to do writes in large chunks
> than it is to do reads in large chunks, because with writes you
> know exactly what data you're going to write while you can't know
> which data you'll need to read next.
>
> > All the difference is there and it will hurt you badly if you do
> > async-io benchmarks,
>
> Why would read-latency hurt the async-io benchmark ?
because only with async-io it is possible to keep the I/O pipeline
filled by reads. readahead only allows to do read-I/O in large chunks,
it has no way to fill the pipeline.
Infact the size of the request queue is the foundamental factor that
controls read latency during heavy writes without special heuristics ala
read-latency.
In short without async-io there is no way at all that a read application
can read at a decent speed during a write flood, unless you have special
hacks in the elevator ala read-latency that allows reads to enter in the
front of the queue, which reduces the chance to reorder reads and
potentially decreases performance on a async-io benchmark even in
presence of seeks.
> Whether the IO is synchronous or asynchronous shouldn't matter much,
the fact the I/O is sync or async makes the whole difference. with sync
reads the vmstat line in the read column will be always very small
compared to the write column under a write flood. This can be fixed either:
1) with hacks in the elevator ala read-latency that are not generic and
could decrease performance of other workloads
2) reducing the size of the I/O queue, that may decrease performance
also with seeks since it decreases the probaility of reordering in
the elevator
3) by having the app using async-io for reads allowing it to keep the
I/O pipeline full with reads
readahead, at least in its current form, only make sure that a 512k
command will be submitted instead of a 4k command, that's not remotely
comparable to writeback that floods the I/O queue constnatly with
several dozen or hundred mbytes of data. Increasing readhaead is also
risky, 512k is kind of obviously safe in all circumstances since it's a
single dma command anyways (and 128k for ide).
I'm starting benchmarking 2.4.20rc1aa against 2.4.19-ck9 under dbench
right now (then I'll run the contest), I can't imagine how can it be
that much faster under dbench, -aa is almost as fast as 2.5 in dbench
and much faster than 2.4 mainline, so if 19-ck9 is really that much
faster than -aa then it is likely much faster than 2.5 too. I definitely
need to examine in full detail what's going on with 2.4.19-ck9. Once I
will understand it I will let you know. For istance I know Randy's
numbers are fully reliable and I trust them:
http://home.earthlink.net/~rwhron/kernel/bigbox.html
I find Randy's number extremely useful. Of course it's great to see also
the responsiveness side of a kernel, but dbench isn't normally a
benchmark that needs responsiveness, quite the opposite, the most unfair
is the behaviour of vm and elevator, the faster usually dbench runs,
because with unfariness dbench tends to run kind of single threaded that
maximizes at most the writeback effect etc... So if 2.4.19-ck9 is so
much faster under dbench and so much more responsive with the contest
that seems to benchmark basically only the read latency under writeback
flushing flood, then it is definitely worthwhile to produce a patch
against mainline that generates this boost. If it has the preemption
patch that could hardly explain it too, the improvement from 45 MB/sec
to 65 MB/sec there's quite an huge difference and we have all the
schedule points in the submit_bh too, so it's quite unlikely that
preempt could explain that difference, it might against a mainline, but
not against my tree.
Anyways this is all guessing, once I'll check the code after I
reproduced the numbers things should be much more clear.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 20:10 ` Andrea Arcangeli
@ 2002-11-10 20:52 ` Andrew Morton
2002-11-10 21:05 ` Rik van Riel
2002-11-10 20:56 ` Andrew Morton
1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2002-11-10 20:52 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
Andrea Arcangeli wrote:
>
> > Whether the IO is synchronous or asynchronous shouldn't matter much,
>
> the fact the I/O is sync or async makes the whole difference. with sync
> reads the vmstat line in the read column will be always very small
> compared to the write column under a write flood. This can be fixed either:
>
> 1) with hacks in the elevator ala read-latency that are not generic and
> could decrease performance of other workloads
read-latency will only do the front-insertion if it was unable to find a
merge or insert on the tail-to-head search.
And the problem it desparately addresses is severe.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 20:52 ` Andrew Morton
@ 2002-11-10 21:05 ` Rik van Riel
2002-11-11 1:54 ` Andrea Arcangeli
0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2002-11-10 21:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Con Kolivas, linux kernel mailing list, marcelo
On Sun, 10 Nov 2002, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > > Whether the IO is synchronous or asynchronous shouldn't matter much,
> >
> > the fact the I/O is sync or async makes the whole difference. with sync
> > reads the vmstat line in the read column will be always very small
> > compared to the write column under a write flood. This can be fixed either:
> >
> > 1) with hacks in the elevator ala read-latency that are not generic and
> > could decrease performance of other workloads
It'd be nice if you specified which kind of workloads. Generic
handwaving is easy, but if you think about this problem a bit
more you'll see that most workloads which look like they might
suffer at first view should be just fine in reality...
> read-latency will only do the front-insertion if it was unable to find a
> merge or insert on the tail-to-head search.
>
> And the problem it desparately addresses is severe.
Note that async-IO shouldn't make a big difference here, except
maybe in synthetic benchmarks.
This is because the stream of data in a server will be approximately
the same regardless of whether the application is coded to use async
IO, threads or processes and because clients still need to wait for
the data on read while most writes are asynchronous.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"october@surriel.com">october@surriel.com</a>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 21:05 ` Rik van Riel
@ 2002-11-11 1:54 ` Andrea Arcangeli
2002-11-11 4:03 ` Andrew Morton
2002-11-11 13:45 ` Rik van Riel
0 siblings, 2 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 1:54 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 07:05:01PM -0200, Rik van Riel wrote:
> On Sun, 10 Nov 2002, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> > >
> > > > Whether the IO is synchronous or asynchronous shouldn't matter much,
> > >
> > > the fact the I/O is sync or async makes the whole difference. with sync
> > > reads the vmstat line in the read column will be always very small
> > > compared to the write column under a write flood. This can be fixed either:
> > >
> > > 1) with hacks in the elevator ala read-latency that are not generic and
> > > could decrease performance of other workloads
>
> It'd be nice if you specified which kind of workloads. Generic
the slowdown happens in this case:
queue 5 6 7 8 9
insert read 3
queue 3 5 6 7 8 9
request 3 is handled by the device
queue 5 6 7 8 9
insert read 1
queue 1 5 6 7 8 9
request 1 is handled by the device
queue 5 6 7 8 9
insert read 2
queue 2 5 6 7 8 9
request 2 is handled by the device
so what happened is:
read 3
read 1
read 2
while w/o read-latency what would happen most probably would been the
below, because the read 5 6 7 8 9 would give the other reads the time to
be inserted and reordered and in turn optimized:
read 1
read 2
read 3
let's ignore async-io to keep it simple, there definitely the
possibility of slowing down with lots of task reading at the same time
even with only sync reads (that could be during swapping lots of major
faults at the same time during some write load or whatever else that
generates lots of tasks reading at near time during some background
writing).
Anybody claiming there isn't the potential of a global I/O throughput
slowdown would be clueless.
I know in some case it the additional seeking may allow the cpu to do
more work and that may actually increase the throughput, but this isn't
always the case, it can definitely slowdown.
all you can argue is that the decrease of latency for lots of common
interactive workloads could worth the potential of a global throghput
slowdown. On that I may agree. I wasn't very excited in merging that
because I was scared of slowdowns of workloads with async-io and
lots of tasks reading at the same time small things during writes that
as I demonstrated above can definitely happen in practice and it's
realistic. I run myself a number of workloads like that. The current
algorithm is optimal for throughput.
However I think even read-latency is more a workarond to a problem in
the I/O queue dimensions. I think the I/O queue should be dunamically
limited to amount of data queued (in bytes not in number of requests).
We need plenty of requests only because all the requests may have 4k
only when no merging can happen, and in such case we definitely need the
elevator to do an huge work to be efficient, seeking heavily on 4k
requests (or smaller) hurts a lot, seeking on 512k requests is much less
severe.
But when each request is large 512k it is pointless to allow the same
number of requests that we allow when the requests are 4k. I think
starting with such simple fix would provide smimilar benefit of
read-latency and no corner case at all. So I would much prefer to start
with a fix like that to account for the available request size to
drivers in bytes of data in the queue, instead of in number of requests
in the queue. read-latency kind of workarounds the way too huge I/O
queue when each request is 512k in size. And it workaround it only for
reads, O_SYNC/-osync would get stuck big time against writeback load
from other files just like like reads now. The fix I propose is generic,
basically it has no downside, it is more dynamic and so I prefer it even
if may not be as direct and hard like read-latency, but that is infact what
makes it better and potentially faster in throughput than read-latency.
Going one step further we could limit the amount of bytes that each
single task can submit, so for example kupdate/bdflush couldn't fill the
queue completely anymore, and still the elevator could do an huge work
when thousand of different tasks are submitting at the same time, which
is the interesting case for the elevator, or the amount of data to
submit in the queue for each task could depend on the number of tasks
actively doing I/O in the last few seconds.
These are the fixes (I consider the limiting of bytes in the I/O queue a
fix) that I would prefer.
Infact I today think the max_bomb_segment I researched some year back
was so beneficial in terms of read-latency just because it effectively
had the effect of reducing the max amount of pending "writeback" bytes
in the queue, not really because it splitted the request in multiple dma
(in turn decreasing a lot performance because the dma chunks were way
too small to have an hope to reach the peak performance of the hardware,
and the fact performance was so hurted forced us to back it out
completely, rightly). So I'm optimistic that reducing the size of the
queue and making it tunable from elvtune would be the first thing to do
rather than playing with the read-latency hack that just workarounds the
way too huge queue size when the merging is at its maximum and that can
hurt performance in some case.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 1:54 ` Andrea Arcangeli
@ 2002-11-11 4:03 ` Andrew Morton
2002-11-11 4:06 ` Andrea Arcangeli
2002-11-11 13:45 ` Rik van Riel
1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2002-11-11 4:03 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
Andrea Arcangeli wrote:
>
> the slowdown happens in this case:
>
> queue 5 6 7 8 9
>
> insert read 3
>
> queue 3 5 6 7 8 9
read-latency will not do that.
> However I think even read-latency is more a workarond to a problem in
> the I/O queue dimensions.
The problem is the 2.4 algorithm. If a read is not mergeable or
insertable it is placed at the tail of the queue. Which is the
worst possible place it can be put because applications wait on
reads, not on writes.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 4:03 ` Andrew Morton
@ 2002-11-11 4:06 ` Andrea Arcangeli
2002-11-11 4:22 ` Andrew Morton
0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 4:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 08:03:01PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > the slowdown happens in this case:
> >
> > queue 5 6 7 8 9
> >
> > insert read 3
> >
> > queue 3 5 6 7 8 9
>
> read-latency will not do that.
So what will it do? Must do something very much like what I described or
it is a noop period. Please elaborate.
>
> > However I think even read-latency is more a workarond to a problem in
> > the I/O queue dimensions.
>
> The problem is the 2.4 algorithm. If a read is not mergeable or
> insertable it is placed at the tail of the queue. Which is the
> worst possible place it can be put because applications wait on
> reads, not on writes.
O_SYNC/-osync waits on writes too, so are you saying writes must go to
the head because of that? reads should be not too bad at the end too if
only the queue wasn't that oversized when the merging is at its maximum.
Fix the oversizing of the queue, then read-latency will matter much
less.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 4:06 ` Andrea Arcangeli
@ 2002-11-11 4:22 ` Andrew Morton
2002-11-11 4:39 ` Andrea Arcangeli
0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2002-11-11 4:22 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
Andrea Arcangeli wrote:
>
> On Sun, Nov 10, 2002 at 08:03:01PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> > >
> > > the slowdown happens in this case:
> > >
> > > queue 5 6 7 8 9
> > >
> > > insert read 3
> > >
> > > queue 3 5 6 7 8 9
> >
> > read-latency will not do that.
>
> So what will it do? Must do something very much like what I described or
> it is a noop period. Please elaborate.
If a read was not merged with another read on the tail->head walk
the read will be inserted near the head. The head->tail walk bypasses
all reads, six (default) writes and then inserts the new read.
It has the shortcoming that earlier reads may be walked past in the
tail->head phase. It's a three-liner to prevent that but I was never
able to demonstrate any difference.
> >
> > > However I think even read-latency is more a workarond to a problem in
> > > the I/O queue dimensions.
> >
> > The problem is the 2.4 algorithm. If a read is not mergeable or
> > insertable it is placed at the tail of the queue. Which is the
> > worst possible place it can be put because applications wait on
> > reads, not on writes.
>
> O_SYNC/-osync waits on writes too, so are you saying writes must go to
> the head because of that?
It has been discussed: boost a request to head-of-queue when a thread
starts to wait on a buffer/page which is inside that request.
But we don't care about synchronous writes. As long as we don't
starve them out completely, optimise the (vastly more) common case.
> reads should be not too bad at the end too if
> only the queue wasn't that oversized when the merging is at its maximum.
> Fix the oversizing of the queue, then read-latency will matter much
> less.
Think about two threads. One is generating a stream of writes and
the other is trying to read a file. The reader needs to read the
directory, the inode, the first data blocks, the first indirect and
then some more data blocks. That's at least three synchronous reads.
Even if those reads are placed just three requests from head-of-queue,
the reader will make one tenth of the progress of the writer.
And the current code places those reads 64 requests from head-of-queue.
When the various things which were congesting write queueing were fixed
in the 2.5 VM a streaming write was slowing such read operations down by
a factor of 4000.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 4:22 ` Andrew Morton
@ 2002-11-11 4:39 ` Andrea Arcangeli
2002-11-11 5:10 ` Andrew Morton
0 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 4:39 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 08:22:38PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > On Sun, Nov 10, 2002 at 08:03:01PM -0800, Andrew Morton wrote:
> > > Andrea Arcangeli wrote:
> > > >
> > > > the slowdown happens in this case:
> > > >
> > > > queue 5 6 7 8 9
> > > >
> > > > insert read 3
> > > >
> > > > queue 3 5 6 7 8 9
> > >
> > > read-latency will not do that.
> >
> > So what will it do? Must do something very much like what I described or
> > it is a noop period. Please elaborate.
>
> If a read was not merged with another read on the tail->head walk
> the read will be inserted near the head. The head->tail walk bypasses
> all reads, six (default) writes and then inserts the new read.
>
> It has the shortcoming that earlier reads may be walked past in the
> tail->head phase. It's a three-liner to prevent that but I was never
> able to demonstrate any difference.
from your description it seems what will happen is:
queue 3 5 6 7 8 9
I don't see why you say it won't do that. the whole point of the patch
to put reads at or near the head, and you say 3 won't be put at the
head if only 5 writes are pending. Or maybe your bypasses "6 writes"
means the other way around, that you put the read as the seventh entry
in the queue if there are 6 writes pending, is it the case?
> > > > However I think even read-latency is more a workarond to a
> > > > problem in
> > > > the I/O queue dimensions.
> > >
> > > The problem is the 2.4 algorithm. If a read is not mergeable or
> > > insertable it is placed at the tail of the queue. Which is the
> > > worst possible place it can be put because applications wait on
> > > reads, not on writes.
> >
> > O_SYNC/-osync waits on writes too, so are you saying writes must go to
> > the head because of that?
>
> It has been discussed: boost a request to head-of-queue when a thread
> starts to wait on a buffer/page which is inside that request.
>
> But we don't care about synchronous writes. As long as we don't
> starve them out completely, optimise the (vastly more) common case.
yes, it should be worthwhile to potentially decrease a little the global
throughput to increase significantly the read latency, I'm not against
that, but before I would care about that I prefer to get a limit on the
size of the queue in bytes, not in requests, that is a generic issue for
writes and read-async-io too, it's a task against task fairness/latency
matter, not specific to reads, but it should help read latency
visibly too. In any case the two things are orthogonal, if the queue is
smaller read-latency will do even better.
> > reads should be not too bad at the end too if
> > only the queue wasn't that oversized when the merging is at its maximum.
> > Fix the oversizing of the queue, then read-latency will matter much
> > less.
>
> Think about two threads. One is generating a stream of writes and
> the other is trying to read a file. The reader needs to read the
> directory, the inode, the first data blocks, the first indirect and
> then some more data blocks. That's at least three synchronous reads.
sure I know the problem with sync reads.
> Even if those reads are placed just three requests from head-of-queue,
> the reader will make one tenth of the progress of the writer.
actually it's probably much worse tha a 10 times ratio since the writer
is going to use big requests, while the reader is probably seeking with
<=4k requests.
> And the current code places those reads 64 requests from head-of-queue.
>
> When the various things which were congesting write queueing were fixed
> in the 2.5 VM a streaming write was slowing such read operations down by
> a factor of 4000.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 4:39 ` Andrea Arcangeli
@ 2002-11-11 5:10 ` Andrew Morton
2002-11-11 5:23 ` Andrea Arcangeli
` (2 more replies)
0 siblings, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2002-11-11 5:10 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
Andrea Arcangeli wrote:
>
> from your description it seems what will happen is:
>
> queue 3 5 6 7 8 9
>
> I don't see why you say it won't do that. the whole point of the patch
> to put reads at or near the head, and you say 3 won't be put at the
> head if only 5 writes are pending. Or maybe your bypasses "6 writes"
> means the other way around, that you put the read as the seventh entry
> in the queue if there are 6 writes pending, is it the case?
Actually I thought your "queue" was "head of queue" and that 5,6,7,8 and 9
were reads....
If the queue contains, say:
(head) R1 R2 R3 W1 W2 W3 W4 W5 W6 W7
Then a new R4 will be inserted between W6 and W7. So if R5 is mergeable
with R4 there is still plenty of time for that.
> > > > > However I think even read-latency is more a workarond to a
> > > > > problem in
> > > > > the I/O queue dimensions.
> > > >
> > > > The problem is the 2.4 algorithm. If a read is not mergeable or
> > > > insertable it is placed at the tail of the queue. Which is the
> > > > worst possible place it can be put because applications wait on
> > > > reads, not on writes.
> > >
> > > O_SYNC/-osync waits on writes too, so are you saying writes must go to
> > > the head because of that?
> >
> > It has been discussed: boost a request to head-of-queue when a thread
> > starts to wait on a buffer/page which is inside that request.
> >
> > But we don't care about synchronous writes. As long as we don't
> > starve them out completely, optimise the (vastly more) common case.
>
> yes, it should be worthwhile to potentially decrease a little the global
> throughput to increase significantly the read latency, I'm not against
> that, but before I would care about that I prefer to get a limit on the
> size of the queue in bytes, not in requests,
Really, it should be in terms of "time". If you assume 6 msec seek and
30 mbyte/sec bandwidth, the crossover is a 120 kbyte I/O. Not that I'm
sure this means anything interesting ;) But the lesson is that the
size of a request isn't very important.
> actually it's probably much worse tha a 10 times ratio since the writer
> is going to use big requests, while the reader is probably seeking with
> <=4k requests.
>
Yup. This is one case where improving latency improves throughput,
if there's computational work to be done.
2.5 (and read-latency) sort-of solve these problems by creating a
massive seekstorm when there are competing reads and writes. It's
a pretty sad solution really.
Better would be to perform those reads and writes in nice big batches.
That's easy for the writes, but for reads we need to wait for the
application to submit another one. That means actually deliberately
leaving the disk head idle for a few milliseconds in the anticipation
that the application will submit another nearby read. This is called
"anticipatory scheduling" and has been shown to provide 20%-70%
performance boost in web serving workloads. It just makes heaps of
sense to me and I'd love to see it in Linux...
See http://www.cs.ucsd.edu/sosp01/papers/iyer.pdf
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 5:10 ` Andrew Morton
@ 2002-11-11 5:23 ` Andrea Arcangeli
2002-11-11 7:58 ` William Lee Irwin III
2002-11-11 13:56 ` Rik van Riel
2 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 5:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 09:10:41PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > from your description it seems what will happen is:
> >
> > queue 3 5 6 7 8 9
> >
> > I don't see why you say it won't do that. the whole point of the patch
> > to put reads at or near the head, and you say 3 won't be put at the
> > head if only 5 writes are pending. Or maybe your bypasses "6 writes"
> > means the other way around, that you put the read as the seventh entry
> > in the queue if there are 6 writes pending, is it the case?
>
> Actually I thought your "queue" was "head of queue" and that 5,6,7,8 and 9
> were reads....
>
> If the queue contains, say:
>
> (head) R1 R2 R3 W1 W2 W3 W4 W5 W6 W7
>
> Then a new R4 will be inserted between W6 and W7. So if R5 is mergeable
> with R4 there is still plenty of time for that.
yes, the fact it's "near" and not exactly in the head as I originally
thought, makes it less likely that it slows things down, even if it
theoretically still could for some workload, overall it seems a
worthwhile heuristic.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 5:10 ` Andrew Morton
2002-11-11 5:23 ` Andrea Arcangeli
@ 2002-11-11 7:58 ` William Lee Irwin III
2002-11-11 13:56 ` Rik van Riel
2 siblings, 0 replies; 47+ messages in thread
From: William Lee Irwin III @ 2002-11-11 7:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Rik van Riel, Con Kolivas,
linux kernel mailing list, marcelo
Andrea Arcangeli wrote:
> 2.5 (and read-latency) sort-of solve these problems by creating a
> massive seekstorm when there are competing reads and writes. It's
> a pretty sad solution really.
On Sun, Nov 10, 2002 at 09:10:41PM -0800, Andrew Morton wrote:
> Better would be to perform those reads and writes in nice big batches.
> That's easy for the writes, but for reads we need to wait for the
> application to submit another one. That means actually deliberately
> leaving the disk head idle for a few milliseconds in the anticipation
> that the application will submit another nearby read. This is called
> "anticipatory scheduling" and has been shown to provide 20%-70%
> performance boost in web serving workloads. It just makes heaps of
> sense to me and I'd love to see it in Linux...
> See http://www.cs.ucsd.edu/sosp01/papers/iyer.pdf
This smacks of "deceptive idleness". OTOH I prefer to keep out of those
issues and focus on pure fault handling, TLB, and space consumption
issues. I/O scheduling is far afield for me, and I prefer to keep it so.
Bill
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 5:10 ` Andrew Morton
2002-11-11 5:23 ` Andrea Arcangeli
2002-11-11 7:58 ` William Lee Irwin III
@ 2002-11-11 13:56 ` Rik van Riel
2 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2002-11-11 13:56 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Con Kolivas, linux kernel mailing list, marcelo
On Sun, 10 Nov 2002, Andrew Morton wrote:
> Really, it should be in terms of "time". If you assume 6 msec seek and
> 30 mbyte/sec bandwidth, the crossover is a 120 kbyte I/O.
Now figure in the rotational latency and the crossover point has
moved to 200 kB. ;)
> Not that I'm sure this means anything interesting ;) But the lesson is
> that the size of a request isn't very important.
Besides, larger requests are much more efficient so penalising
those is the very last thing we want to do.
> Better would be to perform those reads and writes in nice big batches.
> That's easy for the writes, but for reads we need to wait for the
> application to submit another one. That means actually deliberately
> leaving the disk head idle for a few milliseconds in the anticipation
> that the application will submit another nearby read. This is called
> "anticipatory scheduling" and has been shown to provide 20%-70%
> performance boost in web serving workloads. It just makes heaps of
> sense to me and I'd love to see it in Linux...
It only makes sense under heavy multiprocessing workloads where
we have multiple processes submitting IO, but if it's just one
process all this deliberate delay will achieve is a slowdown of
the process.
> See http://www.cs.ucsd.edu/sosp01/papers/iyer.pdf
Looking at it now.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"october@surriel.com">october@surriel.com</a>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 1:54 ` Andrea Arcangeli
2002-11-11 4:03 ` Andrew Morton
@ 2002-11-11 13:45 ` Rik van Riel
2002-11-11 14:09 ` Jens Axboe
2002-11-11 15:43 ` Andrea Arcangeli
1 sibling, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2002-11-11 13:45 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Andrew Morton, Con Kolivas, linux kernel mailing list, marcelo
On Mon, 11 Nov 2002, Andrea Arcangeli wrote:
> [snip bad example by somebody who hasn't read Andrew's patch]
> Anybody claiming there isn't the potential of a global I/O throughput
> slowdown would be clueless.
IO throughput isn't the point. Due to the fundamental asymmetry
between reads and writes IO throughput does NOT correspond to
program throughput under many kinds of IO patterns.
Sure, the best IO throughput is good for writeout, but it'll slow
down any program doing reads, including async IO programs because
those too need to get their data before they can process it.
> all you can argue is that the decrease of latency for lots of common
> interactive workloads could worth the potential of a global throghput
> slowdown. On that I may agree.
On the contrary, the decrease of latency will probably bring a
global throughput increase. Just program throughput, not raw
IO throughput.
> However I think even read-latency is more a workarond to a problem in
> the I/O queue dimensions. I think the I/O queue should be dunamically
> limited to amount of data queued (in bytes not in number of requests).
The number of bytes makes surprisingly little sense when you keep
into account that one disk seek on a modern costs as much time as
it takes to read about half a megabyte worth of data.
> But when each request is large 512k it is pointless to allow the same
> number of requests that we allow when the requests are 4k.
A request of 512 kB will take about twice the time to service as a 4 kB
request would take, assuming the disk does around 50 MB/s throughput.
If you take one of those really modern disks Andre Hedrick has in his
lab the difference gets even smaller.
> Infact I today think the max_bomb_segment I researched some year back
> was so beneficial in terms of read-latency just because it effectively
That must be why it was backed out ;)
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"october@surriel.com">october@surriel.com</a>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 13:45 ` Rik van Riel
@ 2002-11-11 14:09 ` Jens Axboe
2002-11-11 15:48 ` Andrea Arcangeli
2002-11-11 15:43 ` Andrea Arcangeli
1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2002-11-11 14:09 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrea Arcangeli, Andrew Morton, Con Kolivas,
linux kernel mailing list, marcelo
On Mon, Nov 11 2002, Rik van Riel wrote:
> > Infact I today think the max_bomb_segment I researched some year back
> > was so beneficial in terms of read-latency just because it effectively
>
> That must be why it was backed out ;)
Warning, incredibly bad quote snip above.
Rik, you basically deleted the interesting part there. The
max_bomb_segment logic was pretty uninteresting if you looked at it
from the POV that says that we must limit the size of a request to
prevent starvation. This is what the name implies, and this is flawed.
However, Andrea goes on to say that it sort-of worked anyways, just
not for the reaon he originally thought it would. It worked because it
limited the total size of pending writes in the queue. And this is
indeed the key factor to read latency in the 2.4 elevator, because reads
tend to get pushed in the back all the time because the queue looks like
R1-W1-W2-W3-....W127
service R1, queue is now
W1-W2-W3....-W127
application got R1 serviced, issue a new read. Queue is now:
W1-W2-W3....-W127-R2
So even with 0 read passover value, an application typically has to wait
for the total sum of writes in the queue. And this is what causes the
starvation. max_bomb_segments wasn't too good anyways, because in order
to get good latency you have to limit the sum of W1-W127 way too much
and then it starts to hurt write throughput really badly.
This is why the 2.4 io scheduler is fundamentally flawed from the read
latency view point. This is also why the 2.5 deadline io scheduler is
far superior in this area.
>> But when each request is large 512k it is pointless to allow the same
>> number of requests that we allow when the requests are 4k.
> A request of 512 kB will take about twice the time to service as a 4 kB
> request would take, assuming the disk does around 50 MB/s throughput.
> If you take one of those really modern disks Andre Hedrick has in his
> lab the difference gets even smaller.
I'll mention that for 2.5 the number of bytes that equals a full seek in
service time if called a stream_unit and is tweakable. Typically you are
looking at plain 40MiB/s and 8ms seek, so ~256-300KiB is more in the
normal range that 512KiB.
--
Jens Axboe
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 14:09 ` Jens Axboe
@ 2002-11-11 15:48 ` Andrea Arcangeli
0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 15:48 UTC (permalink / raw)
To: Jens Axboe
Cc: Rik van Riel, Andrew Morton, Con Kolivas,
linux kernel mailing list, marcelo
On Mon, Nov 11, 2002 at 03:09:20PM +0100, Jens Axboe wrote:
> latency view point. This is also why the 2.5 deadline io scheduler is
> far superior in this area.
going in function of time is even better of course, but just assuming
bytes to be a linear function of time would be a good start, it depends
if you want to backport the deadline I/O scheduler to 2.4 or not. I
think going in terms of bytes would be simpler for 2.4. We're going to
use 2.4 for at least one more year in some production environment, so I
think it could make sense to address this, at least to be a function of
bytes if not of time.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-11 13:45 ` Rik van Riel
2002-11-11 14:09 ` Jens Axboe
@ 2002-11-11 15:43 ` Andrea Arcangeli
1 sibling, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 15:43 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrew Morton, Con Kolivas, linux kernel mailing list, marcelo
On Mon, Nov 11, 2002 at 11:45:06AM -0200, Rik van Riel wrote:
> On Mon, 11 Nov 2002, Andrea Arcangeli wrote:
>
> > [snip bad example by somebody who hasn't read Andrew's patch]
>
> > Anybody claiming there isn't the potential of a global I/O throughput
> > slowdown would be clueless.
>
> IO throughput isn't the point. Due to the fundamental asymmetry
IO throughput is the whole point of the elevator and if you change it
that way, you can decrease it, even if you put at the seventh request
instead of the first, you're making assumption that the reads cannot
keep the I/O pipeline full, this is a realistic assumption for some
workloads, but not all. My example still very much apply, just not at
the head but as the seventh request. I definitely known what is the
design idea behind read-latency unlike what you think, I just didn't
remeber the lowlevel implementation details which are not important in
terms of a pontential slowdown in math theorical terms.
> On the contrary, the decrease of latency will probably bring a
> global throughput increase. Just program throughput, not raw
I perfectly know this, but you're making assumptions about certain
workloads, I can agree they are realistic workloads on a desktop
machine though, but not all the workloads are like that.
> That must be why it was backed out ;)
it was backed out because the request size must be big and it couldn't
be big with such ""feature"" enabled, as I just said in my previous
email. I just given you the reason it was backed out, not sure what are
you wondering about.
The fact is that read-latency is an hack for getting a special case
faster and that definitely in theory can hurt some workload, there is a
reason read-latency isn't the default, read-latency definitely *can*
increase the seeks, not admitting this and claiming it can only improve
performance is clueless from your part. the implementation detail that
it is adding as the seventh request instead of as the first request
decreases the probability of a slowdown, but it still has the potential
of slowing down something, this is all about math local to the elevator.
And IMHO read-latency kinds of hide the real problem that is we should
limit the queue in bytes or we could delay after I/O completion as
mentioned by Andrew since certain workloads will be still very much
slower than writes even with read-latency. I'll fix soon the real
problem in my tree, I just need to make a number of benchmarks on SCSI
and IDE to kind of measure a good size in bytes for peak contigous I/O
performance before I can implement that.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 20:10 ` Andrea Arcangeli
2002-11-10 20:52 ` Andrew Morton
@ 2002-11-10 20:56 ` Andrew Morton
2002-11-11 1:08 ` Andrea Arcangeli
1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2002-11-10 20:56 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
Andrea Arcangeli wrote:
>
> So if 2.4.19-ck9 is so
> much faster under dbench and so much more responsive with the contest
> that seems to benchmark basically only the read latency under writeback
> flushing flood, then it is definitely worthwhile to produce a patch
> against mainline that generates this boost. If it has the preemption
> patch that could hardly explain it too, the improvement from 45 MB/sec
> to 65 MB/sec there's quite an huge difference and we have all the
> schedule points in the submit_bh too, so it's quite unlikely that
> preempt could explain that difference, it might against a mainline, but
> not against my tree.
>
> Anyways this is all guessing, once I'll check the code after I
> reproduced the numbers things should be much more clear.
Well if I understand it correctly, compressed caching, umm, compresses
the cache ;)
And dbench writes 01 01 01 01 01 everywhere. Enormously compressible.
So it's basically fitting vastly more pagecache into the machine.
That would be my guessing, anyway. Changing dbench to write random
stuff might change the picture.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [BENCHMARK] 2.4.{18,19{-ck9},20rc1{-aa1}} with contest
2002-11-10 20:56 ` Andrew Morton
@ 2002-11-11 1:08 ` Andrea Arcangeli
0 siblings, 0 replies; 47+ messages in thread
From: Andrea Arcangeli @ 2002-11-11 1:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Con Kolivas, linux kernel mailing list, marcelo
On Sun, Nov 10, 2002 at 12:56:33PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > So if 2.4.19-ck9 is so
> > much faster under dbench and so much more responsive with the contest
> > that seems to benchmark basically only the read latency under writeback
> > flushing flood, then it is definitely worthwhile to produce a patch
> > against mainline that generates this boost. If it has the preemption
> > patch that could hardly explain it too, the improvement from 45 MB/sec
> > to 65 MB/sec there's quite an huge difference and we have all the
> > schedule points in the submit_bh too, so it's quite unlikely that
> > preempt could explain that difference, it might against a mainline, but
> > not against my tree.
> >
> > Anyways this is all guessing, once I'll check the code after I
> > reproduced the numbers things should be much more clear.
>
> Well if I understand it correctly, compressed caching, umm, compresses
> the cache ;)
>
> And dbench writes 01 01 01 01 01 everywhere. Enormously compressible.
>
> So it's basically fitting vastly more pagecache into the machine.
>
> That would be my guessing, anyway. Changing dbench to write random
> stuff might change the picture.
yes, it may be the pagecache compression that makes the difference here.
My hardware has lots of disk and ram bandwidth so it should benefit less
from compression. the results on my tree are finished, I'm starting a
new run on ck10.
Andrea
^ permalink raw reply [flat|nested] 47+ messages in thread