linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Chinner <dgc@sgi.com>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: dgc@sgi.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [patch 3/8] per backing_dev dirty and writeback page accounting
Date: Tue, 13 Mar 2007 08:44:05 +1100	[thread overview]
Message-ID: <20070312214405.GQ6095633@melbourne.sgi.com> (raw)
In-Reply-To: <E1HQitT-0002wP-00@dorka.pomaz.szeredi.hu>

On Mon, Mar 12, 2007 at 12:40:47PM +0100, Miklos Szeredi wrote:
> > > I have no idea how serious the scalability problems with this are.  If
> > > they are serious, different solutions can probably be found for the
> > > above, but this is certainly the simplest.
> > 
> > Atomic operations to a single per-backing device from all CPUs at once?
> > That's a pretty serious scalability issue and it will cause a major
> > performance regression for XFS.
> 
> OK.  How about just accounting writeback pages?  That should be much
> less of a problem, since normally writeback is started from
> pdflush/kupdate in large batches without any concurrency.

Except when you are throttling you bounce the cacheline around
each cpu as it triggers foreground writeback.....

> Or is it possible to export the state of the device queue to mm?
> E.g. could balance_dirty_pages() query the backing dev if there are
> any outstanding write requests?

Not directly - writeback_in_progress(bdi) is a coarse measure
indicating pdflush is active on this bdi, which implies outstanding
write requests).

> > I'd call this a showstopper right now - maybe you need to look at
> > something like the ZVC code that Christoph Lameter wrote, perhaps?
> 
> That's rather a heavyweight approach for this I think.

But if you want to use per-page accounting, you are going to
need a per-cpu or per-zone set of counters on each bdi to do
this without introducing regressions.

> The only info balance_dirty_pages() really needs is whether there are
> any dirty+writeback bound for the backing dev or not.

writeback bound (i.e. writing as fast as we can) is probably
indicated fairly reliably by bdi_congested(bdi).

Now all you need is the number of dirty pages....

> It knows about the diry pages, since it calls writeback_inodes() which
> scans the dirty pages for this backing dev looking for ones to write
> out.

It scans the dirty inode list for dirty inodes which indirectly finds
the dirty pages. It does not know about the number of dirty pages
directly...

> If after returning from writeback_inodes() wbc->nr_to_write
> didn't decrease and wbc->pages_skipped is zero then we know that there
> are no more dirty pages for the device.  Or at least there are no
> dirty pages which aren't already under writeback.

Sure, you can tell if there are _no_ dirty pages on the bdi, but
if there are dirty pages, you can't tell how many there are. Your
followup patches need to know how many dirty+writeback pages there
are on the bdi, so I don't really see any way you can solve the
deadlock in this manner without scalable bdi->nr_dirty accounting.

----

IIUC, your problem is that there's another bdi that holds all the
dirty pages, and this throttle loop never flushes pages from that
other bdi and we sleep instead. It seems to me that the fundamental
problem is that to clean the pages we need to flush both bdi's, not
just the bdi we are directly dirtying.

How about a "dependent bdi" link? i.e. if you have a loopback
filesystem, it has a direct bdi (the loopback device) and a
dependent bdi - the bdi that belongs to the underlying filesystem.

When we enter the throttle loop we flush from the direct bdi
and if we fail to flush all the pages we require, we flush
the dependent bdi (maybe even just kick pdflush for that bdi)
before we call congestion_wait() and go to sleep. This way
we are always making progress cleaning pages on the machine,
not just transferring dirty pages form one bdi to another.

Wouldn't that solve the deadlock without needing painful
accounting?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

  reply	other threads:[~2007-03-12 21:44 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-06 18:04 [patch 0/8] VFS/VM patches Miklos Szeredi
2007-03-06 18:04 ` [patch 1/8] fix race in clear_page_dirty_for_io() Miklos Szeredi
2007-03-06 22:25   ` Andrew Morton
2007-03-06 18:04 ` [patch 2/8] update ctime and mtime for mmaped write Miklos Szeredi
2007-03-06 20:32   ` Peter Zijlstra
2007-03-06 21:24     ` Miklos Szeredi
2007-03-06 21:47       ` Peter Zijlstra
2007-03-06 22:00         ` Miklos Szeredi
2007-03-06 22:07         ` Peter Zijlstra
2007-03-06 22:18           ` Miklos Szeredi
2007-03-06 22:28             ` Peter Zijlstra
2007-03-06 22:36               ` Miklos Szeredi
2007-03-06 18:04 ` [patch 3/8] per backing_dev dirty and writeback page accounting Miklos Szeredi
2007-03-12  6:23   ` David Chinner
2007-03-12 11:40     ` Miklos Szeredi
2007-03-12 21:44       ` David Chinner [this message]
2007-03-12 22:36         ` Miklos Szeredi
2007-03-12 23:12           ` David Chinner
2007-03-13  8:21             ` Miklos Szeredi
2007-03-13 22:12               ` David Chinner
2007-03-14 22:09                 ` Miklos Szeredi
2007-03-06 18:04 ` [patch 4/8] fix deadlock in balance_dirty_pages Miklos Szeredi
2007-03-06 18:04 ` [patch 5/8] fix deadlock in throttle_vm_writeout Miklos Szeredi
2007-03-06 18:04 ` [patch 6/8] balance dirty pages from loop device Miklos Szeredi
2007-03-06 18:04 ` [patch 7/8] add filesystem subtype support Miklos Szeredi
2007-03-06 18:04 ` [patch 8/8] consolidate generic_writepages and mpage_writepages fix Miklos Szeredi
2007-03-07 20:46   ` Andrew Morton
2007-03-07 21:26     ` Miklos Szeredi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070312214405.GQ6095633@melbourne.sgi.com \
    --to=dgc@sgi.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).