From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753743AbZCZE6k (ORCPT ); Thu, 26 Mar 2009 00:58:40 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751972AbZCZE6b (ORCPT ); Thu, 26 Mar 2009 00:58:31 -0400 Received: from yx-out-2324.google.com ([74.125.44.30]:30258 "EHLO yx-out-2324.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751275AbZCZE6b convert rfc822-to-8bit (ORCPT ); Thu, 26 Mar 2009 00:58:31 -0400 MIME-Version: 1.0 In-Reply-To: References: <20090324093245.GA22483@elte.hu> <20090325185824.GO32307@mit.edu> <20090325194851.GA1617@infradead.org> <20090325215016.GP32307@mit.edu> <20090326021034.GA26559@srcf.ucam.org> <49CAEDA7.1080902@garzik.org> Date: Thu, 26 Mar 2009 00:58:28 -0400 Message-ID: Subject: Re: Linux 2.6.29 From: Kyle Moffett To: Linus Torvalds Cc: Jeff Garzik , Matthew Garrett , Theodore Tso , Christoph Hellwig , Jan Kara , Andrew Morton , Ingo Molnar , Alan Cox , Arjan van de Ven , Peter Zijlstra , Nick Piggin , Jens Axboe , David Rees , Jesper Krogh , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 25, 2009 at 11:40 PM, Linus Torvalds wrote: > On Wed, 25 Mar 2009, Kyle Moffett wrote: >> To be honest I think we could provide much better data consistency >> guarantees and remove a lot of fsync() calls with just a basic >> per-filesystem barrier() call. > > The problem is not that we have a lot of fsync() calls. Quite the reverse. > fsync() is really really rare. So is being careful in general. The number > of applications that do even the _minimal_ safety-net of "create new file, > rename it atomically over an old one" is basically zero. Almost everybody > ends up rewriting files with something like > >        open(name, O_CREAT | O_TRUNC, 0666) >        write(); >        close(); > > where there isn't an fsync in sight, nor any "create temp file", nor > likely even any real error checking on the write(), much less the > close(). Really, I think virtually all of the database programs would be perfectly happy with an "fsbarrier(fd, flags)" syscall, where if "fd" points to a regular file or directory then it instructs the underlying filesystem to do whatever internal barrier it supports, and if not just fail with -ENOTSUPP (so you can fall back to fdatasync(), etc). Perhaps "flags" would allow a "data" or "metadata" barrier, but if not it's not a big issue. I've ended up having to write a fair amount of high-performance filesystem library code which almost never ends up using fsync() quite simply because the performance on it sucks so badly. This is one of the big reasons why so many critical database programs use O_DIRECT and reinvent the the wheel^H^H^H^H^H^H pagecache. The only way you can actually use it in high-bandwidth transaction applications is by doing your own IO-thread and buffering system. You have to have your own buffer ordering dependencies and call fdatasync() or fsync() from individual threads in-between specific ordered IOs. The threading helps you keep other IO in flight while waiting for the flush to finish. For big databases on spinning media (SSDs don't work precisely because they are small and your databases are big) the overhead of a full flush may still be too large. Even with SSDs, with multiple processes vying for IO bandwidth you still want some kind of application-level barrier to avoid introducing bubbles in your IO pipeline. It all comes down to a trivial calculation: if you can't get (bandwidth * latency-to-stable-storage) bytes of data queued *behind* a flush then your disk is going to sit idle waiting for more data after completing it. If a user-level tool needs to enforce ordering between IOs the only tool right now is is a full flush; when database-oriented tools can use a barrier()-ish call instead, they can issue the op and immediately resume keeping the IO queues full. Cheers, Kyle Moffett