All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Oeser <ioe-lkml@rameria.de>
To: Greg Stark <gsstark@mit.edu>
Cc: Greg Stark <gsstark@mit.edu>,
	Helge Hafting <helgehaf@aitel.hist.no>,
	Joel Becker <Joel.Becker@oracle.com>,
	Jamie Lokier <jamie@shareable.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Ulrich Drepper <drepper@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Date: Tue, 21 Oct 2003 13:47:07 +0200	[thread overview]
Message-ID: <200310211347.07143.ioe-lkml@rameria.de> (raw)
In-Reply-To: <87smlt9t70.fsf@stark.dyndns.tv>

Hi Greg,

On Thursday 16 October 2003 16:02, Greg Stark wrote:
> Ingo Oeser <ioe-lkml@rameria.de> writes:
> > Hi there,
> >
> > first: I think the problem is solvable with mixing blocking and
> > non-blocking IO or simply AIO, which will be supported nicely by 2.6.0,
> > is a POSIX standard and is meant for doing your own IO scheduling.
>
> I think aio could be very useful for databases, but not in this area. 
[AIO for write barriers]
> But I don't see how it's useful for the problem I'm describing.

It can, because this way, you generate sth. like a "user space request
queue" and can control it's activity and saturation as fine grained as
the syncing. You simply notice, if an event is in flight or not and can
estimate current bandwidth that way.

> Indeed we're discussing methods for doing that now. But this seems like a
> awkward way to accomplish what the kernel could do very precisely. I don't
> see why non-dedicated servers would be make priorities any less useful, in
> fact I think that's exactly where they would shine.

The kernel problem is, that an IO operation is not associated with any
process, just with a physical page and a backing store. This is esp.
true for reads. So userspace doesn't know in many cases, whether the
kernel needs to do an IO at all to satisfy this request. Direct-IO helps
this by having you to do the IO ALWAYS, but isn't that nice for the
kernel.

So if you say "This fd has an IO priority of 1 and that fd has one of 2"
for the same file, then what should the kernel do?

Or another secenario: You have chunk A and chunk B both of 128k. Now
vacuum wants to read chunk B as low priority and transaction wants
to read second page from chunk A and chunk B high priority (readv()).

Readahead of second page from chunk A brings in first page of chunk B
which vacuum has been waiting for and is woken and vacuums until chunk C
is needed, which causes IO again.

Now the transaction continues and can read immediately from page cache
the page vacuum left.

This will be even more fun, if vacuum is working so fast per timeslice,
that it will push the cached pages out of memory ;-)

See how controlling submission from vacuum might be better, then actions
done by the kernel?

If you just prioritize work, then the low priority work accumulates and
takes up kernel memory. So better stop submission.

> > > So if vacuum slept a bit, say every 64k of data vacuumed. It could end
> > > up sleeping when the disks are actually idle. Or it could be not
> > > sleeping enough and still be interfering with transactions.
> >
> > The vacuum io is submitted (via AIO or simulation of it) normally in a
> > unit U and waiting ALWAYS for U to complete, before submitting a new one.
> > Between submitting units, the vacuums checks for outstanding transactions
> > and stops, when we have one.
> >
> > Now a transaction is submitted and the submitting from vacuum is stopped
> > by it existing. The transaction waits for completion (e.g. 
> > aio_suspend()) and signals vacuum to continue.
>
> User-space has no idea if disk i/o is occurring. The data the transaction
> needs could be cached, or it could be on a different disk.

So how should it prioritze then, if it doesn't know which will preempt
which?

> Besides, I think this is far too coarse-grained than what's needed.
> Transactions sometimes run for seconds, minutes, or hours,, some of that
> time is spent doing disk i/o and some of it doing cpu calculations. It
> can't stop and signal another process every time it finishes reading a
> block and needs to do a bit of calculation. Then context switch again a
> millisecond later so it can read the next block...

I don't want it to signal vacuum, I just want vacuum to check for
existance of more important things to do. Like a "disk idle process".
This can be as simple as having vacuum at extremly low process priority
and reading some atomically set variable, whether it can submit more now
or not.

I think you need to do sth. like the kernel does for page writing
for your user space task. (stepping by watermarks from none, async to sync)

PS: Sorry for the late answer, but needed to rethink a bit more.

If you could point me to the source files actually triggering and doing
vacuum, I might get more enlightment ;-)

Regards

Ingo Oeser



  reply	other threads:[~2003-10-21 11:50 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-10-09 22:16 Trond Myklebust
2003-10-09 22:26 ` Linus Torvalds
2003-10-09 23:19   ` Ulrich Drepper
2003-10-10  0:22     ` viro
2003-10-10  4:49       ` Jamie Lokier
2003-10-10  5:26         ` Trond Myklebust
2003-10-10 12:37           ` Jamie Lokier
2003-10-10 13:46             ` Trond Myklebust
2003-10-10 14:35               ` Jamie Lokier
2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
2003-10-10 15:53                   ` Jamie Lokier
2003-10-10 16:07                     ` Trond Myklebust
2003-10-10 15:55                   ` Michael Shuey
2003-10-10 16:20                     ` Trond Myklebust
2003-10-10 16:45                     ` J. Bruce Fields
2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
2003-10-09 23:31   ` Trond Myklebust
2003-10-10 12:27   ` Joel Becker
2003-10-10 14:59     ` Linus Torvalds
2003-10-10 15:27       ` Joel Becker
2003-10-10 16:00         ` Linus Torvalds
2003-10-10 16:26           ` Joel Becker
2003-10-10 16:50             ` Linus Torvalds
2003-10-10 17:33               ` Joel Becker
2003-10-10 17:51                 ` Linus Torvalds
2003-10-10 18:13                   ` Joel Becker
2003-10-10 16:27           ` Valdis.Kletnieks
2003-10-10 16:33           ` Chris Friesen
2003-10-10 17:04             ` Linus Torvalds
2003-10-10 17:07               ` Linus Torvalds
2003-10-10 17:21                 ` Joel Becker
2003-10-10 16:01         ` Jamie Lokier
2003-10-10 16:33           ` Joel Becker
2003-10-10 16:58             ` Chris Friesen
2003-10-10 17:05               ` Trond Myklebust
2003-10-10 17:20               ` Joel Becker
2003-10-10 17:33                 ` Chris Friesen
2003-10-10 17:40                 ` Linus Torvalds
2003-10-10 17:54                   ` Trond Myklebust
2003-10-10 18:05                     ` Linus Torvalds
2003-10-10 20:40                       ` Trond Myklebust
2003-10-10 21:09                         ` Linus Torvalds
2003-10-10 22:17                           ` Trond Myklebust
2003-10-11  2:53                     ` Andrew Morton
2003-10-11  3:47                       ` Trond Myklebust
2003-10-10 18:05                   ` Joel Becker
2003-10-10 18:31                     ` Andrea Arcangeli
2003-10-10 20:33                     ` Helge Hafting
2003-10-10 20:07             ` Jamie Lokier
2003-10-12 15:31             ` Greg Stark
2003-10-12 16:13               ` Linus Torvalds
2003-10-12 22:09                 ` Greg Stark
2003-10-13  8:45                   ` Helge Hafting
2003-10-15 13:25                     ` Ingo Oeser
2003-10-15 15:03                       ` Greg Stark
2003-10-15 18:37                         ` Helge Hafting
2003-10-16 10:29                         ` Ingo Oeser
2003-10-16 14:02                           ` Greg Stark
2003-10-21 11:47                             ` Ingo Oeser [this message]
2003-10-10 18:20           ` Andrea Arcangeli
2003-10-10 18:36             ` Linus Torvalds
2003-10-10 19:03               ` Andrea Arcangeli
2003-10-09 23:16 ` Andreas Dilger
2003-10-09 23:24   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200310211347.07143.ioe-lkml@rameria.de \
    --to=ioe-lkml@rameria.de \
    --cc=Joel.Becker@oracle.com \
    --cc=drepper@redhat.com \
    --cc=gsstark@mit.edu \
    --cc=helgehaf@aitel.hist.no \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=trond.myklebust@fys.uio.no \
    --subject='Re: statfs() / statvfs() syscall ballsup...' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.