From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Mon, 15 Jul 2002 20:59:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Mon, 15 Jul 2002 20:59:37 -0400 Received: from LIN2.andrew.cmu.edu ([128.2.6.35]:18050 "EHLO lin2.andrew.cmu.edu") by vger.kernel.org with ESMTP id ; Mon, 15 Jul 2002 20:59:35 -0400 Date: Mon, 15 Jul 2002 21:02:11 -0400 Message-Id: <200207160102.g6G12BiH022986@lin2.andrew.cmu.edu> From: Lawrence Greenfield X-Mailer: BatIMail version 3.3 To: "Patrick J. LoPresti" Cc: linux-kernel@vger.kernel.org In-reply-to: Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks References: <20020712162306$aa7d@traf.lcs.mit.edu> <20020715173337$acad@traf.lcs.mit.edu> <1026767676.4751.499.camel@tiny> User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.3 (=?ISO-8859-4?Q?Unebigory?= =?ISO-8859-4?Q?=F2mae?=) Emacs/21.2 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org From: "Patrick J. LoPresti" Date: 15 Jul 2002 17:31:07 -0400 [...] I really wish MTA authors would just support Linux's "fsync the directory" approach. It is simple, reliable, and fast. Yes, it does require Linux-specific support in the application, but that's what application authors should expect when there is a gap in the standards. Actually, it's not all that simple (you have to find the enclosing directories of any files you're modifying, which might require string manipulation) or necessarily all that fast (you're doubling the number of system calls and now the application is imposing an ordering on the filesystem that didn't exist before). It's only necessary for ext2. Modern Linux filesystems (such as ext3 or reiserfs) don't require it. Finally: ext2 isn't safe even if you do call fsync() on the directory! Let's consider: some filesystem operation modifies two different blocks. This operation is safe if block A is written before block B. . FFS guarantees this by performing the writes synchronously: block A is written when it is changed, followed by block B when it is changed. . Journalling filesystems (ext3, reiserfs) guarantee this by journalling the operation and forcing that journal entry to disk before either A or B can be modified. . What does ext2 do (in the default mode)? It modifies A, it modifies B, and then leaves it up to the buffer cache to write them back---and the buffer cache might decide to write B before A. We're finally getting to some decent shared semantics on filesystems. Reiserfs, ext3, FFS w/ softupdates, vxfs, etc., all work with just fsync()ing the file (though an fsync() is required after a link() or rename() operation). Let's encourage all filesystems to provide these semantics and make it slightly easier on us stupid application programmers. Larry