Re: Linux 2.4.5-ac15

From: Daniel Phillips <phillips@bonn-fries.net>
To: Marcelo Tosatti <marcelo@conectiva.com.br>
Cc: Mike Galbraith <mikeg@wen-online.de>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: Linux 2.4.5-ac15
Date: Fri, 22 Jun 2001 02:32:00 +0200	[thread overview]
Message-ID: <01062202320001.00455@starship> (raw)
In-Reply-To: <Pine.LNX.4.21.0106211649260.788-100000@freak.distro.conectiva>
In-Reply-To: <Pine.LNX.4.21.0106211649260.788-100000@freak.distro.conectiva>

On Thursday 21 June 2001 21:50, Marcelo Tosatti wrote:
> On Thu, 21 Jun 2001, Daniel Phillips wrote:
> > On Thursday 21 June 2001 07:44, Marcelo Tosatti wrote:
> > > On Thu, 21 Jun 2001, Mike Galbraith wrote:
> > > > On Thu, 21 Jun 2001, Marcelo Tosatti wrote:
> > > > > Ok, I suspect that GFP_BUFFER allocations are fucking up here (they
> > > > > can't block on IO, so they loop insanely).
> > > >
> > > > Why doesn't the VM hang the syncing of queued IO on these guys via
> > > > wait_event or such instead of trying to just let the allocation fail?
> > >
> > > Actually the VM should limit the amount of data being queued for _all_
> > > kind of allocations.
> > >
> > > The problem is the lack of a mechanism which allows us to account the
> > > approximated amount of queued IO by the VM. (except for swap pages)
> >
> > Coincidence - that's what I started working on two days ago, and I'm
> > moving into the second generation design today.  Look at
> > 'queued_sectors'.  I found pretty quickly it's not enough, today I'm
> > adding 'submitted_sectors' to the soup.  This will allow me to
> > distinguish between traffic generated by my own thread and other traffic.
>
> Could you expand on this, please ?

OK, I am doing opportunistic flushing, so I want to know that nobody else is 
using the disk, and so long as that's true, I'll keep flushing out buffers.  
Conversely, if anybody else queues a request I'll bail out of the flush loop 
as soon as I've flushed the absolute minimum number of buffers, i.e., the 
ones that were dirtied more than bdflush_params->age_buffer ago.  But how do 
I know if somebody else is submitting requests?  The surest way to know is to 
have a sumitted_sectors counter that just counts every submission, and 
compare that to the number of sectors I know I've submitted.  (This counter 
wraps, so I actually track the difference from value on entering the flush 
loop).

The first thing I found (duh) is that nobody else ever submits anything while 
I'm in the flush loop because I'm on UP and I never (almost never) yield the 
CPU.  On SMP I will get other threads submitting, but only rarely will the 
submission happen while I'm in the flush loop.  No good, I'm not detecting 
the other disk activity reliably, back to the drawing board.

My original plan was to compute a running average of submission rates and use 
that to control my opportunistic flushing.  I departed from that because I 
seemed to get good results with a much simpler strategy, the patch I already 
posted.  It's fundamentally flawed though - it works fine for constant light 
load and constant full load, but not for sporadic loads.  What I need is 
something a lot smoother, more analog, so I'll return to my original plan.

What I want to notice is that the IO submission rate has fallen below a 
certain level then, when the IO backlog has also fallen below a few ms worth 
of transfers I can do the opportunistic flushing.  In the flush loop I want 
to submit enough buffers to make sure I'm using the full bandwidth, but not 
so many that I create a big backlog that gets in the way of a surge in demand 
from some other source.  I'm still working out the details of that, I will 
not post an updated patch today after all ;-)

By the way, there's a really important throughput benefit for doing this 
early flushing that I didn't put in the list when I first wrote about it.  
It's this: whenever we have a bunch of buffers dirtied, if the disk bandwidth 
is available we want to load up the disk right away, not 5 seconds from now.  
If we wait 5 seconds, we just wasted 5 seconds of disk bandwidth.  Again, 
duh.  So my goal in doing this was initially do have it cost as little in 
throughput as possible - I see now that it's actually a win for throughput.  
End of discussion about whether to put in the effort or not.

--
Daniel