All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Stark <gsstark@mit.edu>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Greg Stark <gsstark@mit.edu>,
	Joel Becker <Joel.Becker@oracle.com>,
	Jamie Lokier <jamie@shareable.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Ulrich Drepper <drepper@redhat.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: statfs() / statvfs() syscall ballsup...
Date: 12 Oct 2003 18:09:16 -0400	[thread overview]
Message-ID: <878ynq3y7n.fsf@stark.dyndns.tv> (raw)
In-Reply-To: <Pine.LNX.4.44.0310120909050.12190-100000@home.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> > worse, there's no way to indicate that the i/o it's doing is lower priority,
> > so i/o bound servers get hit dramatically. 
> 
> IO priorities are pretty much worthless. It doesn't _matter_ if other 
> processes get preferred treatment - what is costly is the latency cost of 
> seeking. What you want is not priorities, but batching.

What you want depends very much on the circumstances. I'm sure in a lot of
cases batching helps, but in this case it's not the issue.

The vacuum job that runs periodically in fact is batched very well. In fact
that's the main reason it exists rather than having the cleanup handled in the
critical path in the transaction itself. 

I'm not aware of all the details but my understanding is that it reads every
block in the table sequentially, keeping note of all the records that are no
longer visible to any transaction. When it's finished reading it writes out a
"free space map" that subsequent transactions read and use to find available
space in the table.

The vacuum job is makes very efficient use of disk i/o. In fact too efficient.
Frequently people have their disks running at 50-90% capacity simply handling
the random seeks to read data. Those seeks are already batched to the OS's
best ability. 

But then vacuum comes along and tries to read the entire table sequentially.
In the best case the sequential read will take up a lot of the available disk
bandwidth and delay transactions. In the worst case the OS will actually
prefer the sequential read because the elevator algorithm always sees that it
can get more bandwidth by handling it ahead of the random access.

In reality there is no time pressure on the vacuum at all. As long as it
completes faster than dead records can pile up it's fast enough. The
transactions on the other hand must complete as fast as possible.

Certainly batching is useful and in many cases is more important than
prioritizing, but in this case it's not the whole answer.

I'll mention this thread on the postgresql-hackers list, perhaps some of the
more knowledgeable programmers there will have thought about these issues and
will be able to post their wishlist ideas for kernel APIs.

I can see why back in the day Oracle preferred to simply tell all the OS
vendors, "just give us direct control over disk accesses, we'll figure it out"
rather than have to really hash out all the details of their low level needs
with every OS vendor. But between being able to prioritize I/O resources and
cache resources, and being able to sync IDE disks properly and cleanly (that
other thread) Linux may be able drastically improve the kernel interface for
databases.

-- 
greg


  reply	other threads:[~2003-10-12 22:09 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-10-09 22:16 statfs() / statvfs() syscall ballsup Trond Myklebust
2003-10-09 22:26 ` Linus Torvalds
2003-10-09 23:19   ` Ulrich Drepper
2003-10-10  0:22     ` viro
2003-10-10  4:49       ` Jamie Lokier
2003-10-10  5:26         ` Trond Myklebust
2003-10-10 12:37           ` Jamie Lokier
2003-10-10 13:46             ` Trond Myklebust
2003-10-10 14:35               ` Jamie Lokier
2003-10-10 15:32                 ` Misc NFSv4 (was Re: statfs() / statvfs() syscall ballsup...) Trond Myklebust
2003-10-10 15:53                   ` Jamie Lokier
2003-10-10 16:07                     ` Trond Myklebust
2003-10-10 15:55                   ` Michael Shuey
2003-10-10 16:20                     ` Trond Myklebust
2003-10-10 16:45                     ` J. Bruce Fields
2003-10-10 14:39               ` statfs() / statvfs() syscall ballsup Jamie Lokier
2003-10-09 23:31   ` Trond Myklebust
2003-10-10 12:27   ` Joel Becker
2003-10-10 14:59     ` Linus Torvalds
2003-10-10 15:27       ` Joel Becker
2003-10-10 16:00         ` Linus Torvalds
2003-10-10 16:26           ` Joel Becker
2003-10-10 16:50             ` Linus Torvalds
2003-10-10 17:33               ` Joel Becker
2003-10-10 17:51                 ` Linus Torvalds
2003-10-10 18:13                   ` Joel Becker
2003-10-10 16:27           ` Valdis.Kletnieks
2003-10-10 16:33           ` Chris Friesen
2003-10-10 17:04             ` Linus Torvalds
2003-10-10 17:07               ` Linus Torvalds
2003-10-10 17:21                 ` Joel Becker
2003-10-10 16:01         ` Jamie Lokier
2003-10-10 16:33           ` Joel Becker
2003-10-10 16:58             ` Chris Friesen
2003-10-10 17:05               ` Trond Myklebust
2003-10-10 17:20               ` Joel Becker
2003-10-10 17:33                 ` Chris Friesen
2003-10-10 17:40                 ` Linus Torvalds
2003-10-10 17:54                   ` Trond Myklebust
2003-10-10 18:05                     ` Linus Torvalds
2003-10-10 20:40                       ` Trond Myklebust
2003-10-10 21:09                         ` Linus Torvalds
2003-10-10 22:17                           ` Trond Myklebust
2003-10-11  2:53                     ` Andrew Morton
2003-10-11  3:47                       ` Trond Myklebust
2003-10-10 18:05                   ` Joel Becker
2003-10-10 18:31                     ` Andrea Arcangeli
2003-10-10 20:33                     ` Helge Hafting
2003-10-10 20:07             ` Jamie Lokier
2003-10-12 15:31             ` Greg Stark
2003-10-12 16:13               ` Linus Torvalds
2003-10-12 22:09                 ` Greg Stark [this message]
2003-10-13  8:45                   ` Helge Hafting
2003-10-15 13:25                     ` Ingo Oeser
2003-10-15 15:03                       ` Greg Stark
2003-10-15 18:37                         ` Helge Hafting
2003-10-16 10:29                         ` Ingo Oeser
2003-10-16 14:02                           ` Greg Stark
2003-10-21 11:47                             ` Ingo Oeser
2003-10-10 18:20           ` Andrea Arcangeli
2003-10-10 18:36             ` Linus Torvalds
2003-10-10 19:03               ` Andrea Arcangeli
2003-10-09 23:16 ` Andreas Dilger
2003-10-09 23:24   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878ynq3y7n.fsf@stark.dyndns.tv \
    --to=gsstark@mit.edu \
    --cc=Joel.Becker@oracle.com \
    --cc=drepper@redhat.com \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.