From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262979AbTJPOCc (ORCPT ); Thu, 16 Oct 2003 10:02:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262987AbTJPOCc (ORCPT ); Thu, 16 Oct 2003 10:02:32 -0400 Received: from dci.doncaster.on.ca ([66.11.168.194]:62420 "EHLO smtp.istop.com") by vger.kernel.org with ESMTP id S262979AbTJPOC3 (ORCPT ); Thu, 16 Oct 2003 10:02:29 -0400 To: Ingo Oeser Cc: Greg Stark , Helge Hafting , Joel Becker , Jamie Lokier , Trond Myklebust , Ulrich Drepper , Linux Kernel Subject: Re: statfs() / statvfs() syscall ballsup... References: <200310151525.40470.ioe-lkml@rameria.de> <87llrmbl1g.fsf@stark.dyndns.tv> <200310161229.44861.ioe-lkml@rameria.de> In-Reply-To: <200310161229.44861.ioe-lkml@rameria.de> From: Greg Stark Organization: The Emacs Conspiracy; member since 1992 Date: 16 Oct 2003 10:02:27 -0400 Message-ID: <87smlt9t70.fsf@stark.dyndns.tv> User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Ingo Oeser writes: > Hi there, > > first: I think the problem is solvable with mixing blocking and > non-blocking IO or simply AIO, which will be supported nicely by 2.6.0, > is a POSIX standard and is meant for doing your own IO scheduling. I think aio could be very useful for databases, but not in this area. I think it's useful as a more fine-grained tool than sync/fsync. Currently the database has to fsync a file to commit a transaction, which means flushing _all_writes to the file even ones from other transactions. If aio inserted write barriers to the disk controller then it would provide a way to ensure the current transaction is synced without having to flush all other transactions writes at the same time. But I don't see how it's useful for the problem I'm describing. > On Wednesday 15 October 2003 17:03, Greg Stark wrote: > > Ingo Oeser writes: > > > On Monday 13 October 2003 10:45, Helge Hafting wrote: > > > > This is easier than trying to tell the kernel that the job is > > > > less important, that goes wrong wether the job runs too much > > > > or too little. Let that job sleep a little when its services > > > > aren't needed, or when you need the disk bandwith elsewhere. > > > > Actually I think that's exactly backwards. The problem is that if the > > user-space tries to throttle the process it doesn't know how much or when. > > The kernel knows exactly when there are other higher priority writes, it > > can schedule just enough writes from vacuum to not interfere. > > On dedicated servers this might be true. But on these you could also > solve it in user space by measuring disk bandwidth and issueing just > enough IO to keep up roughly with it. Indeed we're discussing methods for doing that now. But this seems like a awkward way to accomplish what the kernel could do very precisely. I don't see why non-dedicated servers would be make priorities any less useful, in fact I think that's exactly where they would shine. > > So if vacuum slept a bit, say every 64k of data vacuumed. It could end up > > sleeping when the disks are actually idle. Or it could be not sleeping > > enough and still be interfering with transactions. > > The vacuum io is submitted (via AIO or simulation of it) normally in a > unit U and waiting ALWAYS for U to complete, before submitting a new one. > Between submitting units, the vacuums checks for outstanding transactions > and stops, when we have one. > > Now a transaction is submitted and the submitting from vacuum is stopped > by it existing. The transaction waits for completion (e.g. aio_suspend()) > and signals vacuum to continue. User-space has no idea if disk i/o is occurring. The data the transaction needs could be cached, or it could be on a different disk. Besides, I think this is far too coarse-grained than what's needed. Transactions sometimes run for seconds, minutes, or hours,, some of that time is spent doing disk i/o and some of it doing cpu calculations. It can't stop and signal another process every time it finishes reading a block and needs to do a bit of calculation. Then context switch again a millisecond later so it can read the next block... And besides, this is would only useful on dedicated servers. -- greg