From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754126AbYAWPsi (ORCPT ); Wed, 23 Jan 2008 10:48:38 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753601AbYAWPs0 (ORCPT ); Wed, 23 Jan 2008 10:48:26 -0500 Received: from styx.suse.cz ([82.119.242.94]:37939 "EHLO duck.suse.cz" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753436AbYAWPsU (ORCPT ); Wed, 23 Jan 2008 10:48:20 -0500 Date: Wed, 23 Jan 2008 16:48:18 +0100 From: Jan Kara To: Nick Piggin Cc: linux-kernel@vger.kernel.org, Andrew Morton Subject: Re: [PATCH RESEND] Minimal fix for private_list handling races Message-ID: <20080123154818.GD10144@duck.suse.cz> References: <20080122171025.GB4485@duck.suse.cz> <200801231200.03016.nickpiggin@yahoo.com.au> <20080123133045.GC10144@duck.suse.cz> <200801240205.17005.nickpiggin@yahoo.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200801240205.17005.nickpiggin@yahoo.com.au> User-Agent: Mutt/1.5.16 (2007-06-09) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 24-01-08 02:05:16, Nick Piggin wrote: > On Thursday 24 January 2008 00:30, Jan Kara wrote: > > On Wed 23-01-08 12:00:02, Nick Piggin wrote: > > > On Wednesday 23 January 2008 04:10, Jan Kara wrote: > > > > Hi, > > > > > > > > as I got no answer for a week, I'm resending this fix for races in > > > > private_list handling. Andrew, do you like them more than the previous > > > > version? > > > > > > FWIW, I reviewed this, and it looks OK although I think some comments > > > would be in order. > > > > Thanks. > > > > > What would be really nice is to avoid the use of b_assoc_buffers > > > completely in this function like I've attempted (untested). I don't > > > know if you'd actually call that an improvement...? > > > > I thought about this solution as well. But main issue I had with this > > solution is that currently, you nicely submit all the metadata buffers at > > once, so that block layer can sort them and write them in nice order. With > > the array you submit buffers by 16 (or any other fixed amount) and in > > mostly random order... So IMHO fsync would become measurably slower. > > Oh, I don't know the filesystems very well... which ones would > attach a large number of metadata buffers to the inode? This logic is actually used only by a few filesystems - ext2 and UDF are probably the most common ones. For example for ext2, the indirect blocks are on the list if the file is freshly written, so that is roughly around 1MB of metadata per 1GB of data (for 4KB blocks, with 1KB blocks it is 4MB per 1GB). Because seeks are expensive, you could really end up with the write being 16 times slower when you do it in 16 passes instead of one... > > > Couple of things I noticed while looking at this code. > > > > > > - What is osync_buffers_list supposed to do? I couldn't actually > > > work it out. Why do we care about waiting for these buffers on > > > here that were added while waiting for writeout of other buffers > > > to finish? Can we just remove it completely? I must be missing > > > something. > > > > The problem here is that mark_buffer_dirty_inode() can move the buffer > > from 'tmp' list back to private_list while we are waiting for another > > buffer... > > Hmm, no not while we're waiting for another buffer because b_assoc_buffers > will not be empty. However it is possible between removing from the inode > list and insertion onto the temp list I think, because > > if (list_empty(&bh->b_assoc_buffers)) { > > check in mark_buffer_dirty_inode is done without private_lock held. Nice. Oh yes, that was it. > With that in mind, doesn't your first patch suffer from a race due to > exactly this unlocked list_empty check when you are removing clean buffers > from the queue? > > if (!buffer_dirty(bh) && !buffer_locked(bh)) > mark_buffer_dirty() > if (list_empty(&bh->b_assoc_buffers)) > /* add */ > __remove_assoc_queue(bh); > > Which results in the buffer being dirty but not on the ->private_list, > doesn't it? Hmm, I'm not sure about which patch you speak. Logic with removing clean buffers has been in the first version (but there mark_buffer_dirty_inode() was written differently). In the current version, we readd buffer to private_list if it is found dirty in the second while loop of fsync_buffers() and that should be enough. > But let's see... there must be a memory ordering problem here in existing > code anyway, because I don't see any barriers. Between b_assoc_buffers and > b_state (via buffer_dirty); fsync_buffers_list vs mark_buffer_dirty_inode, > right? I'm not sure. What exactly to you mean? BTW: spin_lock is a memory barrier, isn't it? Honza -- Jan Kara SUSE Labs, CR