All of lore.kernel.org
 help / color / mirror / Atom feed
* BFQ: simple elevator
@ 2013-03-20 15:54 Raymond Jennings
  2013-03-20 19:24 ` Mulyadi Santosa
  0 siblings, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-20 15:54 UTC (permalink / raw)
  To: kernelnewbies

I've been pondering making a very simple IO scheduler

one step above noop, just keeps everything in a big heap sorted by
position and a single cursor bouncing from head to tail shaving off
requests in a loop of ascending and descending sweeps.

Any gotchas I need to be aware of or can I simply fork off of deadline
and simplify it to omit batching?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 15:54 BFQ: simple elevator Raymond Jennings
@ 2013-03-20 19:24 ` Mulyadi Santosa
  2013-03-20 21:03   ` Valdis.Kletnieks at vt.edu
  0 siblings, 1 reply; 24+ messages in thread
From: Mulyadi Santosa @ 2013-03-20 19:24 UTC (permalink / raw)
  To: kernelnewbies

On 3/20/13, Raymond Jennings <shentino@gmail.com> wrote:
> I've been pondering making a very simple IO scheduler
>
> one step above noop, just keeps everything in a big heap sorted by
> position and a single cursor bouncing from head to tail shaving off
> requests in a loop of ascending and descending sweeps.

pardon me for any possible sillyness, but what happen if there are
incoming I/O operation at very nearby sectors (or perhaps at the same
sector?)? I suppose, the elevator will prioritize them first over the
rest? (i.e starving will happen...)

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 19:24 ` Mulyadi Santosa
@ 2013-03-20 21:03   ` Valdis.Kletnieks at vt.edu
  2013-03-20 21:41     ` Raymond Jennings
  2013-03-20 23:05     ` Linux elevators (Re: BFQ: simple elevator) Arlie Stephens
  0 siblings, 2 replies; 24+ messages in thread
From: Valdis.Kletnieks at vt.edu @ 2013-03-20 21:03 UTC (permalink / raw)
  To: kernelnewbies

On Thu, 21 Mar 2013 02:24:23 +0700, Mulyadi Santosa said:

> pardon me for any possible sillyness, but what happen if there are
> incoming I/O operation at very nearby sectors (or perhaps at the same
> sector?)? I suppose, the elevator will prioritize them first over the
> rest? (i.e starving will happen...)

And this, my friends, is why elevators aren't as easy to do as the average
undergrad might hope - it's a lot harder to balance fairness and throughput
across all the corner cases than you might think.  It gets really fun
when you have (for example) a 'find' command moving the heads all over
the disk while another process is trying to do large amounts of streaming
I/O.  And then you'll get some idiot process that insists on doing the
occasional fsync() or syncfs() call.  Yes, it's almost always *all*
corner cases, it's very rare (unless you're an embedded system like a Tivo)
that all your I/O is one flavor that is easily handled by a simple elevator.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 865 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20130320/5bc19fa6/attachment.bin 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 21:03   ` Valdis.Kletnieks at vt.edu
@ 2013-03-20 21:41     ` Raymond Jennings
  2013-03-20 23:10       ` Valdis.Kletnieks at vt.edu
  2013-03-23 14:38       ` Matthias Brugger
  2013-03-20 23:05     ` Linux elevators (Re: BFQ: simple elevator) Arlie Stephens
  1 sibling, 2 replies; 24+ messages in thread
From: Raymond Jennings @ 2013-03-20 21:41 UTC (permalink / raw)
  To: kernelnewbies

On Wed, Mar 20, 2013 at 2:03 PM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Thu, 21 Mar 2013 02:24:23 +0700, Mulyadi Santosa said:
>
>> pardon me for any possible sillyness, but what happen if there are
>> incoming I/O operation at very nearby sectors (or perhaps at the same
>> sector?)? I suppose, the elevator will prioritize them first over the
>> rest? (i.e starving will happen...)

This is actually why I proposed to enforce forward progress by only
looking for further requests in one direction at a time.

Suppose you have requests at sectors 1, 4, 5, and 6

You dispatch sectors 1, 4, and 5, leaving the head parked at 5 and the
direction as ascending.

But suddenly, just before you get a chance to dispatch for sector 6,
sector 4 gets busy again.

I'm not proposing going back to sector 4.  It's behind us and (as you
indicated) we could starve sector 6 indefinitely.

So instead, because sector 4 is on the wrong side of our present head
position, it is ignored and we keep marching forward, and then we hit
sector 6 and dispatch it.

Once we hit sector 6 and dispatch it, we do a u-turn and start
descending.  That's when we pick up sector 4 again.

When we're going up, we ignore what's below us, and when we're going
down we ignore what is above us.

We only switch directions when there's nothing in front of us the way
we were going.  In theory, given that disk capacity is itself finite,
so too is the amount of time one has to wait before getting reached by
the elevator.

Anyway, does this clarification answer your concerns about starvation?

> And this, my friends, is why elevators aren't as easy to do as the average
> undergrad might hope - it's a lot harder to balance fairness and throughput
> across all the corner cases than you might think.  It gets really fun
> when you have (for example) a 'find' command moving the heads all over
> the disk while another process is trying to do large amounts of streaming
> I/O.  And then you'll get some idiot process that insists on doing the
> occasional fsync() or syncfs() call.  Yes, it's almost always *all*
> corner cases, it's very rare (unless you're an embedded system like a Tivo)
> that all your I/O is one flavor that is easily handled by a simple elevator.

In my case I'm just concerned with raw total system throughput.

I called it "BFQ" for a reason.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Linux elevators (Re: BFQ: simple elevator)
  2013-03-20 21:03   ` Valdis.Kletnieks at vt.edu
  2013-03-20 21:41     ` Raymond Jennings
@ 2013-03-20 23:05     ` Arlie Stephens
  2013-03-20 23:16       ` Valdis.Kletnieks at vt.edu
  1 sibling, 1 reply; 24+ messages in thread
From: Arlie Stephens @ 2013-03-20 23:05 UTC (permalink / raw)
  To: kernelnewbies

The ongoing thread reminds me of a simple question I've had since I
first read about linux' mutiple I/O schedulers. Why is the choice of
I/O scheduler global to the whole kernel, rather than per-device or
similar? 

Consider a system with both traditional rotating disks and SSDs - not
at all far fetched. An appropriate I/O scheduling algorithm for
rotating disks is likely to do a lot of work that's useless for
SSDs. Why require that the same algorithms be used for both? 

--
Arlie

(Arlie Stephens					arlie at worldash.org)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 21:41     ` Raymond Jennings
@ 2013-03-20 23:10       ` Valdis.Kletnieks at vt.edu
  2013-03-20 23:37         ` Raymond Jennings
  2013-03-23 14:38       ` Matthias Brugger
  1 sibling, 1 reply; 24+ messages in thread
From: Valdis.Kletnieks at vt.edu @ 2013-03-20 23:10 UTC (permalink / raw)
  To: kernelnewbies

On Wed, 20 Mar 2013 14:41:31 -0700, Raymond Jennings said:

> Suppose you have requests at sectors 1, 4, 5, and 6
>
> You dispatch sectors 1, 4, and 5, leaving the head parked at 5 and the
> direction as ascending.
>
> But suddenly, just before you get a chance to dispatch for sector 6,
> sector 4 gets busy again.
>
> I'm not proposing going back to sector 4.  It's behind us and (as you
> indicated) we could starve sector 6 indefinitely.
>
> So instead, because sector 4 is on the wrong side of our present head
> position, it is ignored and we keep marching forward, and then we hit
> sector 6 and dispatch it.
>
> Once we hit sector 6 and dispatch it, we do a u-turn and start
> descending.  That's when we pick up sector 4 again.

The problem is that not all seeks are created equal.

Consider the requests are at 1, 4, 5, and 199343245.  If as we're servicing
5, another request for 4 comes in, we may well be *much* better off doing a
short seek to 4 and then one long seek to the boonies, rather than 2 long
seeks.

My laptop has a 160G Western Digital drive in it (WD1600BJKT).  The minimum
track-to-track seek time is 2ms, the average time is 12ms, and the maximum is
probably on the order of 36ms. So by replacing 2 max-length seeks with a
track-to-track seek and 1 max-length, you can almost half the delay waiting
for seeks (38ms versus 72ms). (And even better if the target block is logically before the current one, but
still on the same track, so you only take a rotational latency hit and no seek
hit.

(The maximum is not given in the spec sheets, but is almost always 3 times the
average - for a discussion of the math behind that, and a lot of other issues,
see:

http://pages.cs.wisc.edu/~remzi/OSFEP/file-disks.pdf

And of course, this interacts in very mysterious ways with the firmware
on the drive, which can do its own re-ordering of I/O requests and/or
manage the use of the disk's onboard read/write cache - this is why
command queueing is useful for throughput, because if the disk has the
option of re-ordering 32 requests, it can do more than if it only has 1 or
2 requests in the queue.  Of course, very deep command queues have their
own issues - most notably that at some point you need to use barriers or
something to ensure that the metadata writes aren't being re-ordered into
a pattern that could cause corruption if the disk lost its mind before
completing all the writes...

> In my case I'm just concerned with raw total system throughput.

See the above discussion.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 865 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20130320/079734db/attachment-0001.bin 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Linux elevators (Re: BFQ: simple elevator)
  2013-03-20 23:05     ` Linux elevators (Re: BFQ: simple elevator) Arlie Stephens
@ 2013-03-20 23:16       ` Valdis.Kletnieks at vt.edu
  2013-03-20 23:45         ` Arlie Stephens
  0 siblings, 1 reply; 24+ messages in thread
From: Valdis.Kletnieks at vt.edu @ 2013-03-20 23:16 UTC (permalink / raw)
  To: kernelnewbies

On Wed, 20 Mar 2013 16:05:09 -0700, Arlie Stephens said:
> The ongoing thread reminds me of a simple question I've had since I
> first read about linux' mutiple I/O schedulers. Why is the choice of
> I/O scheduler global to the whole kernel, rather than per-device or
> similar?

They aren't global to the kernel.

On my laptop:

# find /sys/devices/pci* -name 'scheduler' | xargs grep .
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop deadline [cfq]
/sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:noop deadline [cfq]
# echo noop >| /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/schedule
# find /sys/devices/pci* -name 'scheduler' | xargs grep .
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop deadline [cfq]
/sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:[noop] deadline cfq

I just changed the scheduler for the CD-ROM.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 865 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20130320/f316ead6/attachment.bin 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 23:10       ` Valdis.Kletnieks at vt.edu
@ 2013-03-20 23:37         ` Raymond Jennings
  2013-03-21  9:13           ` Valdis.Kletnieks at vt.edu
  0 siblings, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-20 23:37 UTC (permalink / raw)
  To: kernelnewbies

On Wed, Mar 20, 2013 at 4:10 PM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Wed, 20 Mar 2013 14:41:31 -0700, Raymond Jennings said:
>
>> Suppose you have requests at sectors 1, 4, 5, and 6
>>
>> You dispatch sectors 1, 4, and 5, leaving the head parked at 5 and the
>> direction as ascending.
>>
>> But suddenly, just before you get a chance to dispatch for sector 6,
>> sector 4 gets busy again.
>>
>> I'm not proposing going back to sector 4.  It's behind us and (as you
>> indicated) we could starve sector 6 indefinitely.
>>
>> So instead, because sector 4 is on the wrong side of our present head
>> position, it is ignored and we keep marching forward, and then we hit
>> sector 6 and dispatch it.
>>
>> Once we hit sector 6 and dispatch it, we do a u-turn and start
>> descending.  That's when we pick up sector 4 again.
>
> The problem is that not all seeks are created equal.
>
> Consider the requests are at 1, 4, 5, and 199343245.  If as we're servicing
> 5, another request for 4 comes in, we may well be *much* better off doing a
> short seek to 4 and then one long seek to the boonies, rather than 2 long
> seeks.
>
> My laptop has a 160G Western Digital drive in it (WD1600BJKT).  The minimum
> track-to-track seek time is 2ms, the average time is 12ms, and the maximum is
> probably on the order of 36ms. So by replacing 2 max-length seeks with a
> track-to-track seek and 1 max-length, you can almost half the delay waiting
> for seeks (38ms versus 72ms). (And even better if the target block is logically before the current one, but
> still on the same track, so you only take a rotational latency hit and no seek
> hit.
>
> (The maximum is not given in the spec sheets, but is almost always 3 times the
> average - for a discussion of the math behind that, and a lot of other issues,
> see:
>
> http://pages.cs.wisc.edu/~remzi/OSFEP/file-disks.pdf
>
> And of course, this interacts in very mysterious ways with the firmware
> on the drive, which can do its own re-ordering of I/O requests and/or
> manage the use of the disk's onboard read/write cache - this is why
> command queueing is useful for throughput, because if the disk has the
> option of re-ordering 32 requests, it can do more than if it only has 1 or
> 2 requests in the queue.  Of course, very deep command queues have their
> own issues - most notably that at some point you need to use barriers or
> something to ensure that the metadata writes aren't being re-ordered into
> a pattern that could cause corruption if the disk lost its mind before
> completing all the writes...
>
>> In my case I'm just concerned with raw total system throughput.
>
> See the above discussion.

Hmm...Maybe a hybrid approach that allows a finite number of reverse
seeks, or as I suspect deadline does a finite delay before abandoning
the close stuff to march to the boonies.

Btw, does deadline currently do ping-pong seeking or does it always go
for the nearest sector in either direction?

At any rate, I'll probably need performance tuning to get a feel for
what will work.

What I'd like to start with is correct usage of the api's in question
to actually do the processing.  Is request dispatch well documented?

My first hunch was to copy-paste from deadline and then tweak it.

Mostly I want to make something simple and learn how to write an I/O
scheduler in the process.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Linux elevators (Re: BFQ: simple elevator)
  2013-03-20 23:16       ` Valdis.Kletnieks at vt.edu
@ 2013-03-20 23:45         ` Arlie Stephens
  2013-03-21  1:00           ` Raymond Jennings
  0 siblings, 1 reply; 24+ messages in thread
From: Arlie Stephens @ 2013-03-20 23:45 UTC (permalink / raw)
  To: kernelnewbies

On Mar 20 2013, Valdis.Kletnieks at vt.edu wrote:
> On Wed, 20 Mar 2013 16:05:09 -0700, Arlie Stephens said:
> > The ongoing thread reminds me of a simple question I've had since I
> > first read about linux' mutiple I/O schedulers. Why is the choice of
> > I/O scheduler global to the whole kernel, rather than per-device or
> > similar?
> 
> They aren't global to the kernel.

Thanks for the correction. It appears I got wrong (outdated?)
information from some book on kernel development, or perhaps simply
misunderstood what I read. 

When I tried the example you gave, I saw the same thing, even on
the older kernels I'm working with (2.6.32 in particular). 


> 
> On my laptop:
> 
> # find /sys/devices/pci* -name 'scheduler' | xargs grep .
> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop deadline [cfq]
> /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:noop deadline [cfq]
> # echo noop >| /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/schedule
> # find /sys/devices/pci* -name 'scheduler' | xargs grep .
> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop deadline [cfq]
> /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:[noop] deadline cfq
> 
> I just changed the scheduler for the CD-ROM.

--
Arlie

(Arlie Stephens					 arlie at worldash.org)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Linux elevators (Re: BFQ: simple elevator)
  2013-03-20 23:45         ` Arlie Stephens
@ 2013-03-21  1:00           ` Raymond Jennings
  0 siblings, 0 replies; 24+ messages in thread
From: Raymond Jennings @ 2013-03-21  1:00 UTC (permalink / raw)
  To: kernelnewbies

On Wed, Mar 20, 2013 at 4:45 PM, Arlie Stephens <arlie@worldash.org> wrote:
> On Mar 20 2013, Valdis.Kletnieks at vt.edu wrote:
>> On Wed, 20 Mar 2013 16:05:09 -0700, Arlie Stephens said:
>> > The ongoing thread reminds me of a simple question I've had since I
>> > first read about linux' mutiple I/O schedulers. Why is the choice of
>> > I/O scheduler global to the whole kernel, rather than per-device or
>> > similar?
>>
>> They aren't global to the kernel.
>
> Thanks for the correction. It appears I got wrong (outdated?)
> information from some book on kernel development, or perhaps simply
> misunderstood what I read.

Yeah, the "global" thing is just the system default.

It's what newly created or discovered block devices will be assigned initially.

> When I tried the example you gave, I saw the same thing, even on
> the older kernels I'm working with (2.6.32 in particular).
>
>
>>
>> On my laptop:
>>
>> # find /sys/devices/pci* -name 'scheduler' | xargs grep .
>> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop deadline [cfq]
>> /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:noop deadline [cfq]
>> # echo noop >| /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/schedule
>> # find /sys/devices/pci* -name 'scheduler' | xargs grep .
>> /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler:noop deadline [cfq]
>> /sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/scheduler:[noop] deadline cfq
>>
>> I just changed the scheduler for the CD-ROM.
>
> --
> Arlie
>
> (Arlie Stephens                                  arlie at worldash.org)
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 23:37         ` Raymond Jennings
@ 2013-03-21  9:13           ` Valdis.Kletnieks at vt.edu
  2013-03-21  9:37             ` Raymond Jennings
  0 siblings, 1 reply; 24+ messages in thread
From: Valdis.Kletnieks at vt.edu @ 2013-03-21  9:13 UTC (permalink / raw)
  To: kernelnewbies

On Wed, 20 Mar 2013 16:37:41 -0700, Raymond Jennings said:

> Hmm...Maybe a hybrid approach that allows a finite number of reverse
> seeks, or as I suspect deadline does a finite delay before abandoning
> the close stuff to march to the boonies.

Maybe. Maybe not.  It's going to depend on the workload - look how many times
we've had to tweak something as obvious as cache writeback to get it to behave
for corner cases.  You'll think you got the algorithm right, and then the next
guy to test-drive it will do something only 5% different and ends up cratering
the disk. :)

Now of course, the flip side of "a disk's average seek time is between 5ms and
12ms depending how much you paid for it" is that there's no spinning disk on
the planet that can do much more than 200 seeks per second (oh, and before you
knee-jerk and say "SSD to the rescue", that's got its own issues). Right now,
you should be thinking "so *that* is why xfs and ext4 do extents - so we can
keep file I/O as sequential as possible with as few seeks as possible". Other
things you start doing if you want *real* throughput: you start looking at
striped and parallel filesystems, self-defragmenting filesystems,
multipath-capable disk controllers, and other stuff like that to spread the I/O
across lots of disks fronted by lots of servers. Lots as in hundreds.   As in
"imagine 2 racks, each with 10 4U shelves with 60 drives per shelf, with some
beefy DDN or NetApp E-series heads in front, talking to a dozen or so servers
in front of it with multiple 10GE and Infiniband links to client machines".

In other words, if you're *serious* about throughput, you're gonna need a
lot more than just a better elevator.

(For the record, a big chunk of my day job is maintaining several several
petabytes of storage for HPC users, where moving data at 3 gigabytes/second
is considered sluggish...)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 865 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20130321/93e2d971/attachment-0001.bin 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-21  9:13           ` Valdis.Kletnieks at vt.edu
@ 2013-03-21  9:37             ` Raymond Jennings
  2013-03-22  3:52               ` Mulyadi Santosa
  0 siblings, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-21  9:37 UTC (permalink / raw)
  To: kernelnewbies

On Thu, Mar 21, 2013 at 2:13 AM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Wed, 20 Mar 2013 16:37:41 -0700, Raymond Jennings said:
>
>> Hmm...Maybe a hybrid approach that allows a finite number of reverse
>> seeks, or as I suspect deadline does a finite delay before abandoning
>> the close stuff to march to the boonies.
>
> Maybe. Maybe not.  It's going to depend on the workload - look how many times
> we've had to tweak something as obvious as cache writeback to get it to behave
> for corner cases.  You'll think you got the algorithm right, and then the next
> guy to test-drive it will do something only 5% different and ends up cratering
> the disk. :)

Which is one reason I suspect this is why you have both competing
schedulers co-existing in the same kernel as well as tunables for
each.  Namely that there isn't really a one size fits all solution.

> Now of course, the flip side of "a disk's average seek time is between 5ms and
> 12ms depending how much you paid for it" is that there's no spinning disk on
> the planet that can do much more than 200 seeks per second (oh, and before you
> knee-jerk and say "SSD to the rescue", that's got its own issues). Right now,
> you should be thinking "so *that* is why xfs and ext4 do extents - so we can
> keep file I/O as sequential as possible with as few seeks as possible". Other
> things you start doing if you want *real* throughput: you start looking at
> striped and parallel filesystems, self-defragmenting filesystems,
> multipath-capable disk controllers, and other stuff like that to spread the I/O
> across lots of disks fronted by lots of servers. Lots as in hundreds.   As in
> "imagine 2 racks, each with 10 4U shelves with 60 drives per shelf, with some
> beefy DDN or NetApp E-series heads in front, talking to a dozen or so servers
> in front of it with multiple 10GE and Infiniband links to client machines".
>
> In other words, if you're *serious* about throughput, you're gonna need a
> lot more than just a better elevator.

I suspect that good old fashioned "learning from experience" will get
my feet wet.

> (For the record, a big chunk of my day job is maintaining several several
> petabytes of storage for HPC users, where moving data at 3 gigabytes/second
> is considered sluggish...)

My day job involves receiving disability from the feds for having
autism, so I have almost nothing but time on my hands.

At any rate I suppose the best way to get started on this is to get a
grip on the api's involved in receiving requests from above and
dispatching them below.  Is studying the deadline scheduler a good
start or would I be better off looking at documentation?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-21  9:37             ` Raymond Jennings
@ 2013-03-22  3:52               ` Mulyadi Santosa
  2013-03-22 20:50                 ` Raymond Jennings
  0 siblings, 1 reply; 24+ messages in thread
From: Mulyadi Santosa @ 2013-03-22  3:52 UTC (permalink / raw)
  To: kernelnewbies

Hi....

On Thu, Mar 21, 2013 at 4:37 PM, Raymond Jennings <shentino@gmail.com> wrote:
> At any rate I suppose the best way to get started on this is to get a
> grip on the api's involved in receiving requests from above and
> dispatching them below.  Is studying the deadline scheduler a good
> start or would I be better off looking at documentation?

Valdis already gave you great feedback, so I'll just add that what you
need is lots of experiments.

Reading your detailed proposal, sounds like what you're doing is much
similar to Con Kolivas old- Staircase scheduler (a process scheduler).
The difference is, you're implementing it on disk which has way bigger
latency to consider. But again, IMHO there is nothing to stop you from
experimenting.

So I suggest, maybe you need to add something that anticipatory does,
wait for few moments to anticipate I/O coming in nearby sectors. You
need to define the limit of "nearby". If nothing coming, just keep
moving. If there is one, service them while changing the "service
path".

Just my 2 cents idea......


-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-22  3:52               ` Mulyadi Santosa
@ 2013-03-22 20:50                 ` Raymond Jennings
  2013-03-22 20:53                   ` Raymond Jennings
  0 siblings, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-22 20:50 UTC (permalink / raw)
  To: kernelnewbies

On Thu, Mar 21, 2013 at 8:52 PM, Mulyadi Santosa
<mulyadi.santosa@gmail.com> wrote:
> Hi....
>
> On Thu, Mar 21, 2013 at 4:37 PM, Raymond Jennings <shentino@gmail.com> wrote:
>> At any rate I suppose the best way to get started on this is to get a
>> grip on the api's involved in receiving requests from above and
>> dispatching them below.  Is studying the deadline scheduler a good
>> start or would I be better off looking at documentation?
>
> Valdis already gave you great feedback, so I'll just add that what you
> need is lots of experiments.

Oh I already agree with Valdis's feedback, and especially the concept
of this not being as simple as one would hope.  Before I get close to
considering submitting my work upstream I'll be playing with it a lot.

Mostly I want to get a grip on the brass tacks of interfacing properly
with the code layers above and below my scheduler.  Hence my question
on the best way to learn the ropes of actually handling the requests.

My goal in writing this was more to get practice writing kernel code
and properly interfacing iwth other subsystems than to actually write
a decent scheduler ^^.

> Reading your detailed proposal, sounds like what you're doing is much
> similar to Con Kolivas old- Staircase scheduler (a process scheduler).
> The difference is, you're implementing it on disk which has way bigger
> latency to consider. But again, IMHO there is nothing to stop you from
> experimenting.
>
> So I suggest, maybe you need to add something that anticipatory does,
> wait for few moments to anticipate I/O coming in nearby sectors. You
> need to define the limit of "nearby". If nothing coming, just keep
> moving. If there is one, service them while changing the "service
> path".
>
> Just my 2 cents idea......

Quite, quite.

Wonderful feedback everyone.

At this point it's time to get my feet wet so all I'm after right now is

>
> --
> regards,
>
> Mulyadi Santosa
> Freelance Linux trainer and consultant
>
> blog: the-hydra.blogspot.com
> training: mulyaditraining.blogspot.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-22 20:50                 ` Raymond Jennings
@ 2013-03-22 20:53                   ` Raymond Jennings
  2013-03-22 21:20                     ` Valdis.Kletnieks at vt.edu
  0 siblings, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-22 20:53 UTC (permalink / raw)
  To: kernelnewbies

I think I might split incoming requests into two heaps.

The first heap would be synchronous requests such as reads and syncs
that someone in userspace is blocking on.

The second is background I/O like writeback and readahead.

The same distinction that CFQ completely makes.

Anyway, I plan to give the first heap strict priority over the second.

But anyway yeah, I'm into this more for learning how to interface with
the APIs in question and write kernel code in general.  Good
performance is a secondary objective.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-22 20:53                   ` Raymond Jennings
@ 2013-03-22 21:20                     ` Valdis.Kletnieks at vt.edu
  2013-03-23  0:05                       ` Raymond Jennings
  2013-03-23  1:42                       ` Raymond Jennings
  0 siblings, 2 replies; 24+ messages in thread
From: Valdis.Kletnieks at vt.edu @ 2013-03-22 21:20 UTC (permalink / raw)
  To: kernelnewbies

On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:

> The first heap would be synchronous requests such as reads and syncs
> that someone in userspace is blocking on.
>
> The second is background I/O like writeback and readahead.
>
> The same distinction that CFQ completely makes.

Again, this may or may not be a win, depending on the exact workload.

If you are about to block on a userspace read, it may make sense to go ahead
and tack a readahead on the request "for free" - at 100MB/sec transfer and 10ms
seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
seeks later, you're willing.  Of course, the *real* problem here is that how
much readahead to actually do needs help from the VFS and filesystem levels -
if there's only 600K more data before the end of the current file extent, doing
more than 600K of read-ahead is a loss.

Meanwhile, over on the write side of the fence, unless a program is
specifically using O_DIRECT, userspace writes will get dropped into the cache
and become writeback requests later on.  So the vast majority of writes will
usually be writebacks rather than syncronous writes.

So in many cases, it's unclear how much performance CFQ gets from making
the distinction (and I'm positive that given a sufficient supply of pizza and
caffeine, I could cook up a realistic scenario where CFQ's behavior makes
things worse)...

Did I mention this stuff is tricky? :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 865 bytes
Desc: not available
Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20130322/c7ad6a7b/attachment.bin 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-22 21:20                     ` Valdis.Kletnieks at vt.edu
@ 2013-03-23  0:05                       ` Raymond Jennings
  2013-03-23 16:42                         ` Matthias Brugger
  2013-03-23  1:42                       ` Raymond Jennings
  1 sibling, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-23  0:05 UTC (permalink / raw)
  To: kernelnewbies

On Fri, Mar 22, 2013 at 2:20 PM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:
>
>> The first heap would be synchronous requests such as reads and syncs
>> that someone in userspace is blocking on.
>>
>> The second is background I/O like writeback and readahead.
>>
>> The same distinction that CFQ completely makes.
>
> Again, this may or may not be a win, depending on the exact workload.
>
> If you are about to block on a userspace read, it may make sense to go ahead
> and tack a readahead on the request "for free" - at 100MB/sec transfer and 10ms
> seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
> seeks later, you're willing.  Of course, the *real* problem here is that how
> much readahead to actually do needs help from the VFS and filesystem levels -
> if there's only 600K more data before the end of the current file extent, doing
> more than 600K of read-ahead is a loss.
>
> Meanwhile, over on the write side of the fence, unless a program is
> specifically using O_DIRECT, userspace writes will get dropped into the cache
> and become writeback requests later on.  So the vast majority of writes will
> usually be writebacks rather than syncronous writes.
>
> So in many cases, it's unclear how much performance CFQ gets from making
> the distinction (and I'm positive that given a sufficient supply of pizza and
> caffeine, I could cook up a realistic scenario where CFQ's behavior makes
> things worse)...
>
> Did I mention this stuff is tricky? :)
>

Oh I'm well aware that it's tricky.  but as I said i'm more interested
in learning the api than tuning performance.

Having a super efficient toaster won't do much good if I can't plug
the darn thing in.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-22 21:20                     ` Valdis.Kletnieks at vt.edu
  2013-03-23  0:05                       ` Raymond Jennings
@ 2013-03-23  1:42                       ` Raymond Jennings
  1 sibling, 0 replies; 24+ messages in thread
From: Raymond Jennings @ 2013-03-23  1:42 UTC (permalink / raw)
  To: kernelnewbies

On Fri, Mar 22, 2013 at 2:20 PM,  <Valdis.Kletnieks@vt.edu> wrote:
> On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:
>
>> The first heap would be synchronous requests such as reads and syncs
>> that someone in userspace is blocking on.
>>
>> The second is background I/O like writeback and readahead.
>>
>> The same distinction that CFQ completely makes.
>
> Again, this may or may not be a win, depending on the exact workload.
>
> If you are about to block on a userspace read, it may make sense to go ahead
> and tack a readahead on the request "for free" - at 100MB/sec transfer and 10ms
> seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
> seeks later, you're willing.  Of course, the *real* problem here is that how
> much readahead to actually do needs help from the VFS and filesystem levels -
> if there's only 600K more data before the end of the current file extent, doing
> more than 600K of read-ahead is a loss.

That sounds more like an argument for pushing readahead logic up into
the fs and vfs levels.  If it were up to me, linear/physical readahead
would be disabled entirely.

Speaking of which, I hear that tux3 is in the works to take more
charge of bossing the block layer around.

> Meanwhile, over on the write side of the fence, unless a program is
> specifically using O_DIRECT, userspace writes will get dropped into the cache
> and become writeback requests later on.  So the vast majority of writes will
> usually be writebacks rather than syncronous writes.
>
> So in many cases, it's unclear how much performance CFQ gets from making
> the distinction (and I'm positive that given a sufficient supply of pizza and
> caffeine, I could cook up a realistic scenario where CFQ's behavior makes
> things worse)...
>
> Did I mention this stuff is tricky? :)
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-20 21:41     ` Raymond Jennings
  2013-03-20 23:10       ` Valdis.Kletnieks at vt.edu
@ 2013-03-23 14:38       ` Matthias Brugger
  2013-03-25 18:19         ` Raymond Jennings
  1 sibling, 1 reply; 24+ messages in thread
From: Matthias Brugger @ 2013-03-23 14:38 UTC (permalink / raw)
  To: kernelnewbies

On 03/20/2013 10:41 PM, Raymond Jennings wrote:
> On Wed, Mar 20, 2013 at 2:03 PM,  <Valdis.Kletnieks@vt.edu> wrote:
>> On Thu, 21 Mar 2013 02:24:23 +0700, Mulyadi Santosa said:
>>
>>> pardon me for any possible sillyness, but what happen if there are
>>> incoming I/O operation at very nearby sectors (or perhaps at the same
>>> sector?)? I suppose, the elevator will prioritize them first over the
>>> rest? (i.e starving will happen...)
> This is actually why I proposed to enforce forward progress by only
> looking for further requests in one direction at a time.
>
> Suppose you have requests at sectors 1, 4, 5, and 6
>
> You dispatch sectors 1, 4, and 5, leaving the head parked at 5 and the
> direction as ascending.
>
> But suddenly, just before you get a chance to dispatch for sector 6,
> sector 4 gets busy again.
>
> I'm not proposing going back to sector 4.  It's behind us and (as you
> indicated) we could starve sector 6 indefinitely.
>
> So instead, because sector 4 is on the wrong side of our present head
> position, it is ignored and we keep marching forward, and then we hit
> sector 6 and dispatch it.
>
> Once we hit sector 6 and dispatch it, we do a u-turn and start
> descending.  That's when we pick up sector 4 again.
>
> When we're going up, we ignore what's below us, and when we're going
> down we ignore what is above us.
>
> We only switch directions when there's nothing in front of us the way
> we were going.  In theory, given that disk capacity is itself finite,
> so too is the amount of time one has to wait before getting reached by
> the elevator.
>
> Anyway, does this clarification answer your concerns about starvation?
>
>> And this, my friends, is why elevators aren't as easy to do as the average
>> undergrad might hope - it's a lot harder to balance fairness and throughput
>> across all the corner cases than you might think.  It gets really fun
>> when you have (for example) a 'find' command moving the heads all over
>> the disk while another process is trying to do large amounts of streaming
>> I/O.  And then you'll get some idiot process that insists on doing the
>> occasional fsync() or syncfs() call.  Yes, it's almost always *all*
>> corner cases, it's very rare (unless you're an embedded system like a Tivo)
>> that all your I/O is one flavor that is easily handled by a simple elevator.
> In my case I'm just concerned with raw total system throughput.
>
> I called it "BFQ" for a reason.

It sounds to me like the LOOK disk scheduling algorithm from 1970? Or do 
I miss something?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-23  0:05                       ` Raymond Jennings
@ 2013-03-23 16:42                         ` Matthias Brugger
  2013-03-25 19:15                           ` Raymond Jennings
  0 siblings, 1 reply; 24+ messages in thread
From: Matthias Brugger @ 2013-03-23 16:42 UTC (permalink / raw)
  To: kernelnewbies

 On 03/23/2013 01:05 AM, Raymond Jennings wrote:

On Fri, Mar 22, 2013 at 2:20 PM,  <Valdis.Kletnieks@vt.edu>
<Valdis.Kletnieks@vt.edu> wrote:

 On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:


 The first heap would be synchronous requests such as reads and syncs
that someone in userspace is blocking on.

The second is background I/O like writeback and readahead.

The same distinction that CFQ completely makes.

 Again, this may or may not be a win, depending on the exact workload.

If you are about to block on a userspace read, it may make sense to go ahead
and tack a readahead on the request "for free" - at 100MB/sec transfer and 10ms
seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
seeks later, you're willing.  Of course, the *real* problem here is that how
much readahead to actually do needs help from the VFS and filesystem levels -
if there's only 600K more data before the end of the current file extent, doing
more than 600K of read-ahead is a loss.

Meanwhile, over on the write side of the fence, unless a program is
specifically using O_DIRECT, userspace writes will get dropped into the cache
and become writeback requests later on.  So the vast majority of writes will
usually be writebacks rather than syncronous writes.

So in many cases, it's unclear how much performance CFQ gets from making
the distinction (and I'm positive that given a sufficient supply of pizza and
caffeine, I could cook up a realistic scenario where CFQ's behavior makes
things worse)...

Did I mention this stuff is tricky? :)


 Oh I'm well aware that it's tricky.  but as I said i'm more interested
in learning the api than tuning performance.

Having a super efficient toaster won't do much good if I can't plug
the darn thing in.


If you want to understand the interface, I would recommend to start having
a look to the noop scheduler. It's by far the simplest implementation of a
scheduler.

For me a good starting point were this slides:
http://www.cs.ccu.edu.tw/~lhr89/linux-kernel/Linux%20IO%20Schedulers.pdf

Hope that helps you to bring the theory into practice :)

 _______________________________________________
Kernelnewbies mailing
listKernelnewbies at kernelnewbies.orghttp://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20130323/e5433c05/attachment.html 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-23 14:38       ` Matthias Brugger
@ 2013-03-25 18:19         ` Raymond Jennings
  0 siblings, 0 replies; 24+ messages in thread
From: Raymond Jennings @ 2013-03-25 18:19 UTC (permalink / raw)
  To: kernelnewbies

On Sat, Mar 23, 2013 at 7:38 AM, Matthias Brugger
<matthias.bgg@gmail.com> wrote:
> On 03/20/2013 10:41 PM, Raymond Jennings wrote:
>>
>> On Wed, Mar 20, 2013 at 2:03 PM,  <Valdis.Kletnieks@vt.edu> wrote:
>>>
>>> On Thu, 21 Mar 2013 02:24:23 +0700, Mulyadi Santosa said:
>>>
>>>> pardon me for any possible sillyness, but what happen if there are
>>>> incoming I/O operation at very nearby sectors (or perhaps at the same
>>>> sector?)? I suppose, the elevator will prioritize them first over the
>>>> rest? (i.e starving will happen...)
>>
>> This is actually why I proposed to enforce forward progress by only
>> looking for further requests in one direction at a time.
>>
>> Suppose you have requests at sectors 1, 4, 5, and 6
>>
>> You dispatch sectors 1, 4, and 5, leaving the head parked at 5 and the
>> direction as ascending.
>>
>> But suddenly, just before you get a chance to dispatch for sector 6,
>> sector 4 gets busy again.
>>
>> I'm not proposing going back to sector 4.  It's behind us and (as you
>> indicated) we could starve sector 6 indefinitely.
>>
>> So instead, because sector 4 is on the wrong side of our present head
>> position, it is ignored and we keep marching forward, and then we hit
>> sector 6 and dispatch it.
>>
>> Once we hit sector 6 and dispatch it, we do a u-turn and start
>> descending.  That's when we pick up sector 4 again.
>>
>> When we're going up, we ignore what's below us, and when we're going
>> down we ignore what is above us.
>>
>> We only switch directions when there's nothing in front of us the way
>> we were going.  In theory, given that disk capacity is itself finite,
>> so too is the amount of time one has to wait before getting reached by
>> the elevator.
>>
>> Anyway, does this clarification answer your concerns about starvation?
>>
>>> And this, my friends, is why elevators aren't as easy to do as the
>>> average
>>> undergrad might hope - it's a lot harder to balance fairness and
>>> throughput
>>> across all the corner cases than you might think.  It gets really fun
>>> when you have (for example) a 'find' command moving the heads all over
>>> the disk while another process is trying to do large amounts of streaming
>>> I/O.  And then you'll get some idiot process that insists on doing the
>>> occasional fsync() or syncfs() call.  Yes, it's almost always *all*
>>> corner cases, it's very rare (unless you're an embedded system like a
>>> Tivo)
>>> that all your I/O is one flavor that is easily handled by a simple
>>> elevator.
>>
>> In my case I'm just concerned with raw total system throughput.
>>
>> I called it "BFQ" for a reason.
>
>
> It sounds to me like the LOOK disk scheduling algorithm from 1970? Or do I
> miss something?

Your guess is as good as mine.  I've never even heard of it.

My description was very much made up on the fly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-23 16:42                         ` Matthias Brugger
@ 2013-03-25 19:15                           ` Raymond Jennings
  2013-03-25 22:27                             ` Matthias Brugger
  0 siblings, 1 reply; 24+ messages in thread
From: Raymond Jennings @ 2013-03-25 19:15 UTC (permalink / raw)
  To: kernelnewbies

On Sat, Mar 23, 2013 at 9:42 AM, Matthias Brugger
<matthias.bgg@gmail.com> wrote:
> On 03/23/2013 01:05 AM, Raymond Jennings wrote:
>
> On Fri, Mar 22, 2013 at 2:20 PM,  <Valdis.Kletnieks@vt.edu> wrote:
>
> On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:
>
> The first heap would be synchronous requests such as reads and syncs
> that someone in userspace is blocking on.
>
> The second is background I/O like writeback and readahead.
>
> The same distinction that CFQ completely makes.
>
> Again, this may or may not be a win, depending on the exact workload.
>
> If you are about to block on a userspace read, it may make sense to go ahead
> and tack a readahead on the request "for free" - at 100MB/sec transfer and
> 10ms
> seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
> seeks later, you're willing.  Of course, the *real* problem here is that how
> much readahead to actually do needs help from the VFS and filesystem levels
> -
> if there's only 600K more data before the end of the current file extent,
> doing
> more than 600K of read-ahead is a loss.
>
> Meanwhile, over on the write side of the fence, unless a program is
> specifically using O_DIRECT, userspace writes will get dropped into the
> cache
> and become writeback requests later on.  So the vast majority of writes will
> usually be writebacks rather than syncronous writes.
>
> So in many cases, it's unclear how much performance CFQ gets from making
> the distinction (and I'm positive that given a sufficient supply of pizza
> and
> caffeine, I could cook up a realistic scenario where CFQ's behavior makes
> things worse)...
>
> Did I mention this stuff is tricky? :)
>
> Oh I'm well aware that it's tricky.  but as I said i'm more interested
> in learning the api than tuning performance.
>
> Having a super efficient toaster won't do much good if I can't plug
> the darn thing in.
>
>
> If you want to understand the interface, I would recommend to start having a
> look to the noop scheduler. It's by far the simplest implementation of a
> scheduler.
>
> For me a good starting point were this slides:
> http://www.cs.ccu.edu.tw/~lhr89/linux-kernel/Linux%20IO%20Schedulers.pdf
>
> Hope that helps you to bring the theory into practice :)
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies at kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>
>

Just what I was looking for.

Now, how do I enable/disable my scheduler during kernel config?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-25 19:15                           ` Raymond Jennings
@ 2013-03-25 22:27                             ` Matthias Brugger
  2013-03-25 22:29                               ` Raymond Jennings
  0 siblings, 1 reply; 24+ messages in thread
From: Matthias Brugger @ 2013-03-25 22:27 UTC (permalink / raw)
  To: kernelnewbies

2013/3/25 Raymond Jennings <shentino@gmail.com>:
> On Sat, Mar 23, 2013 at 9:42 AM, Matthias Brugger
> <matthias.bgg@gmail.com> wrote:
>> On 03/23/2013 01:05 AM, Raymond Jennings wrote:
>>
>> On Fri, Mar 22, 2013 at 2:20 PM,  <Valdis.Kletnieks@vt.edu> wrote:
>>
>> On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:
>>
>> The first heap would be synchronous requests such as reads and syncs
>> that someone in userspace is blocking on.
>>
>> The second is background I/O like writeback and readahead.
>>
>> The same distinction that CFQ completely makes.
>>
>> Again, this may or may not be a win, depending on the exact workload.
>>
>> If you are about to block on a userspace read, it may make sense to go ahead
>> and tack a readahead on the request "for free" - at 100MB/sec transfer and
>> 10ms
>> seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
>> seeks later, you're willing.  Of course, the *real* problem here is that how
>> much readahead to actually do needs help from the VFS and filesystem levels
>> -
>> if there's only 600K more data before the end of the current file extent,
>> doing
>> more than 600K of read-ahead is a loss.
>>
>> Meanwhile, over on the write side of the fence, unless a program is
>> specifically using O_DIRECT, userspace writes will get dropped into the
>> cache
>> and become writeback requests later on.  So the vast majority of writes will
>> usually be writebacks rather than syncronous writes.
>>
>> So in many cases, it's unclear how much performance CFQ gets from making
>> the distinction (and I'm positive that given a sufficient supply of pizza
>> and
>> caffeine, I could cook up a realistic scenario where CFQ's behavior makes
>> things worse)...
>>
>> Did I mention this stuff is tricky? :)
>>
>> Oh I'm well aware that it's tricky.  but as I said i'm more interested
>> in learning the api than tuning performance.
>>
>> Having a super efficient toaster won't do much good if I can't plug
>> the darn thing in.
>>
>>
>> If you want to understand the interface, I would recommend to start having a
>> look to the noop scheduler. It's by far the simplest implementation of a
>> scheduler.
>>
>> For me a good starting point were this slides:
>> http://www.cs.ccu.edu.tw/~lhr89/linux-kernel/Linux%20IO%20Schedulers.pdf
>>
>> Hope that helps you to bring the theory into practice :)
>>
>> _______________________________________________
>> Kernelnewbies mailing list
>> Kernelnewbies at kernelnewbies.org
>> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>>
>>
>
> Just what I was looking for.
>
> Now, how do I enable/disable my scheduler during kernel config?

1. Add your disk scheduler to the kernel sources (Kconfig, Makefile
and in block/bfq-iosched.c)
2. Add the bfq scheduler in the kernel config (as a moudle might make sense)
3. Recompile and install your new kernel
4. You can load/unload the module dynamically. Via sysfs you can
associate the bfq scheduler with one disk.

Happy hacking :)

-- 
---
motzblog.wordpress.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* BFQ: simple elevator
  2013-03-25 22:27                             ` Matthias Brugger
@ 2013-03-25 22:29                               ` Raymond Jennings
  0 siblings, 0 replies; 24+ messages in thread
From: Raymond Jennings @ 2013-03-25 22:29 UTC (permalink / raw)
  To: kernelnewbies

it would appear indeed that my primitive "out of my hat" is identical to LOOK.

On Mon, Mar 25, 2013 at 3:27 PM, Matthias Brugger
<matthias.bgg@gmail.com> wrote:
> 2013/3/25 Raymond Jennings <shentino@gmail.com>:
>> On Sat, Mar 23, 2013 at 9:42 AM, Matthias Brugger
>> <matthias.bgg@gmail.com> wrote:
>>> On 03/23/2013 01:05 AM, Raymond Jennings wrote:
>>>
>>> On Fri, Mar 22, 2013 at 2:20 PM,  <Valdis.Kletnieks@vt.edu> wrote:
>>>
>>> On Fri, 22 Mar 2013 13:53:45 -0700, Raymond Jennings said:
>>>
>>> The first heap would be synchronous requests such as reads and syncs
>>> that someone in userspace is blocking on.
>>>
>>> The second is background I/O like writeback and readahead.
>>>
>>> The same distinction that CFQ completely makes.
>>>
>>> Again, this may or may not be a win, depending on the exact workload.
>>>
>>> If you are about to block on a userspace read, it may make sense to go ahead
>>> and tack a readahead on the request "for free" - at 100MB/sec transfer and
>>> 10ms
>>> seeks, reading 1M costs the same as a seek.  If you read 2M ahead and save 3
>>> seeks later, you're willing.  Of course, the *real* problem here is that how
>>> much readahead to actually do needs help from the VFS and filesystem levels
>>> -
>>> if there's only 600K more data before the end of the current file extent,
>>> doing
>>> more than 600K of read-ahead is a loss.
>>>
>>> Meanwhile, over on the write side of the fence, unless a program is
>>> specifically using O_DIRECT, userspace writes will get dropped into the
>>> cache
>>> and become writeback requests later on.  So the vast majority of writes will
>>> usually be writebacks rather than syncronous writes.
>>>
>>> So in many cases, it's unclear how much performance CFQ gets from making
>>> the distinction (and I'm positive that given a sufficient supply of pizza
>>> and
>>> caffeine, I could cook up a realistic scenario where CFQ's behavior makes
>>> things worse)...
>>>
>>> Did I mention this stuff is tricky? :)
>>>
>>> Oh I'm well aware that it's tricky.  but as I said i'm more interested
>>> in learning the api than tuning performance.
>>>
>>> Having a super efficient toaster won't do much good if I can't plug
>>> the darn thing in.
>>>
>>>
>>> If you want to understand the interface, I would recommend to start having a
>>> look to the noop scheduler. It's by far the simplest implementation of a
>>> scheduler.
>>>
>>> For me a good starting point were this slides:
>>> http://www.cs.ccu.edu.tw/~lhr89/linux-kernel/Linux%20IO%20Schedulers.pdf
>>>
>>> Hope that helps you to bring the theory into practice :)
>>>
>>> _______________________________________________
>>> Kernelnewbies mailing list
>>> Kernelnewbies at kernelnewbies.org
>>> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>>>
>>>
>>
>> Just what I was looking for.
>>
>> Now, how do I enable/disable my scheduler during kernel config?
>
> 1. Add your disk scheduler to the kernel sources (Kconfig, Makefile
> and in block/bfq-iosched.c)
> 2. Add the bfq scheduler in the kernel config (as a moudle might make sense)
> 3. Recompile and install your new kernel
> 4. You can load/unload the module dynamically. Via sysfs you can
> associate the bfq scheduler with one disk.
>
> Happy hacking :)
>
> --
> ---
> motzblog.wordpress.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-03-25 22:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-20 15:54 BFQ: simple elevator Raymond Jennings
2013-03-20 19:24 ` Mulyadi Santosa
2013-03-20 21:03   ` Valdis.Kletnieks at vt.edu
2013-03-20 21:41     ` Raymond Jennings
2013-03-20 23:10       ` Valdis.Kletnieks at vt.edu
2013-03-20 23:37         ` Raymond Jennings
2013-03-21  9:13           ` Valdis.Kletnieks at vt.edu
2013-03-21  9:37             ` Raymond Jennings
2013-03-22  3:52               ` Mulyadi Santosa
2013-03-22 20:50                 ` Raymond Jennings
2013-03-22 20:53                   ` Raymond Jennings
2013-03-22 21:20                     ` Valdis.Kletnieks at vt.edu
2013-03-23  0:05                       ` Raymond Jennings
2013-03-23 16:42                         ` Matthias Brugger
2013-03-25 19:15                           ` Raymond Jennings
2013-03-25 22:27                             ` Matthias Brugger
2013-03-25 22:29                               ` Raymond Jennings
2013-03-23  1:42                       ` Raymond Jennings
2013-03-23 14:38       ` Matthias Brugger
2013-03-25 18:19         ` Raymond Jennings
2013-03-20 23:05     ` Linux elevators (Re: BFQ: simple elevator) Arlie Stephens
2013-03-20 23:16       ` Valdis.Kletnieks at vt.edu
2013-03-20 23:45         ` Arlie Stephens
2013-03-21  1:00           ` Raymond Jennings

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.