From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752631AbXBJBuX (ORCPT ); Fri, 9 Feb 2007 20:50:23 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752635AbXBJBuX (ORCPT ); Fri, 9 Feb 2007 20:50:23 -0500 Received: from smtp.osdl.org ([65.172.181.24]:59756 "EHLO smtp.osdl.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752623AbXBJBuW (ORCPT ); Fri, 9 Feb 2007 20:50:22 -0500 Date: Fri, 9 Feb 2007 17:50:17 -0800 From: Andrew Morton To: Nick Piggin Cc: Linux Filesystems , Linux Kernel Subject: Re: [patch 3/3] ext2: use perform_write aop Message-Id: <20070209175017.c8a8c500.akpm@linux-foundation.org> In-Reply-To: <20070210013407.GB14349@wotan.suse.de> References: <20070208105437.26443.35653.sendpatchset@linux.site> <20070208105508.26443.7806.sendpatchset@linux.site> <20070209111455.67a69783.akpm@linux-foundation.org> <20070209114539.8bd15ee0.akpm@linux-foundation.org> <20070210013407.GB14349@wotan.suse.de> X-Mailer: Sylpheed version 2.2.7 (GTK+ 2.8.6; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 10 Feb 2007 02:34:07 +0100 Nick Piggin wrote: > On Fri, Feb 09, 2007 at 11:45:39AM -0800, Andrew Morton wrote: > > On Fri, 9 Feb 2007 11:14:55 -0800 Andrew Morton wrote: > > > > > If so, that might be preventable by leaving the buffer nonuptodate. > > > > oh, OK, it was buffer_new(), so zeroes are the right thing for a reader to > > see. > > > > But if it wasn't buffer_new() then the appropriate thing for the reader to > > see is what's on the disk. But __block_prepare_write() won't read a buffer > > which is fully-inside the write area from disk. > > > > And that's seemingly OK, because if a reader gets in there after the short > > copy, that reader will see the non-uptodate buffer and will populate it > > from disk. > > > > But doing that will overwrite the data which the write() caller managed to > > copy into the page before it took a fault. And that's not OK because > > block_perform_write() does iovec_iterator_advance(i, copied) in this case > > and hence will not rerun the copy after acquiring the page lock? > > Hmm, yeah. This can be handled by not advancing partially into a !uptodate > buffer. Think so, yeah. Overall, the implementation you have there seems reasonable to me. Basically it's passing the responsibility for preventing the deadlock and the exposure-of-zeroes problem down into the filesystem itself, where we have visibility of the state of the various subsections of the page and can take appropriate actions in response to that. It's got conceptually harder to follow as a result, which is a shame. But still no magic bullet is on offer. I pity the poor schmuck who has to write ext3_journalled_perform_write(), ext3_ordered_perform_write(), ext3_writeback_perform_write(), ext3_writeback_nobh_perform_write() and all that other stuff. But I think we need to do that pretty soon to validate the whole approach. Also xfs and reiser3. NTFS will be interesting from the can-this-be-made-to-work POV. Is NFS vulnerable to the deadlock? It looks to be. Shudder. We'd need to find a way of communicating all this to the poor old fs maintainers.