From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266536AbUBERyi (ORCPT ); Thu, 5 Feb 2004 12:54:38 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S266532AbUBERyi (ORCPT ); Thu, 5 Feb 2004 12:54:38 -0500 Received: from fw.osdl.org ([65.172.181.6]:57830 "EHLO mail.osdl.org") by vger.kernel.org with ESMTP id S266536AbUBERxn (ORCPT ); Thu, 5 Feb 2004 12:53:43 -0500 Subject: Re: [PATCH 2.6.2-rc3-mm1] DIO read race fix From: Daniel McNeil To: Andrew Morton Cc: janetmor@us.ibm.com, Badari Pulavarty , "linux-aio@kvack.org" , Linux Kernel Mailing List , Suparna Bhattacharya In-Reply-To: <20040204213336.354d8103.akpm@osdl.org> References: <3FCD4B66.8090905@us.ibm.com> <1070674185.1929.9.camel@ibm-c.pdx.osdl.net> <1070907814.707.2.camel@ibm-c.pdx.osdl.net> <1071190292.1937.13.camel@ibm-c.pdx.osdl.net> <20031230045334.GA3484@in.ibm.com> <1072830557.712.49.camel@ibm-c.pdx.osdl.net> <20031231060956.GB3285@in.ibm.com> <1073606144.1831.9.camel@ibm-c.pdx.osdl.net> <20040109035510.GA3279@in.ibm.com> <1075945198.7182.46.camel@ibm-c.pdx.osdl.net> <20040204213336.354d8103.akpm@osdl.org> Content-Type: text/plain Organization: Message-Id: <1076003576.7182.77.camel@ibm-c.pdx.osdl.net> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 05 Feb 2004 09:52:56 -0800 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Andrew, I am still thinking about your patch. I will run some tests today using 2.6.2-mm1 to see if the problem is fixed. My 8-proc machine ran overnight with 6 copies of the read_under running without problems with my original patch. Previously on the 8-proc machine, it would hit uninitialized data within an hour. The concern I have is that DIO needs filemap_write_and_wait() to make sure all previously dirty pages have been written back to disk before the DIO is issued. If __block_write_full_page() can possibly clear PageWriteback with buffer i/o still in flight (even for WB_SYNC_NONE) then a subsequent filemap_write_and_wait() will miss that page. For example, I previously tried: do { get_bh(bh); + if (wbc->sync_mode != WB_SYCN_NONE) + wait_on_buffer(bh); if (buffer_mapped(bh) && buffer_dirty(bh)) { if (wbc->sync_mode != WB_SYNC_NONE) { lock_buffer(bh); and this still saw uninitialized data. Also, if __block_write_full_page() can redirty a page wouldn't this allow filemap_write_and_wait() to return with page still dirty that DIO needs written back? I'll work on updating the other patches. Daniel On Wed, 2004-02-04 at 21:33, Andrew Morton wrote: > Daniel McNeil wrote: > > > > I have found (finally) the problem causing DIO reads racing with > > buffered writes to see uninitialized data on ext3 file systems > > (which is what I have been testing on). > > > > The problem is caused by the changes to __block_write_page_full() > > and a race with journaling: > > > > journal_commit_transaction() -> ll_rw_block() -> submit_bh() > > > > ll_rw_block() locks the buffer, clears buffer dirty and calls > > submit_bh() > > > > A racing __block_write_full_page() (from ext3_ordered_writepage()) > > > > would see that buffer_dirty() is not set because the i/o > > is still in flight, so it would not do a bh_submit() > > > > It would SetPageWriteback() and unlock_page() and then > > see that no i/o was submitted and call end_page_writeback() > > (with the i/o still in flight). > > > > This would allow the DIO code to issue the DIO read while buffer writes > > are still in flight. The i/o can be reordered by i/o scheduling and > > the DIO can complete BEFORE the writebacks complete. Thus the DIO > > sees the old uninitialized data. > > I suppose we should go for a general fix to the problem. I'm not 100% > happy with it. It's similar to yours, except we only wait if > wbc->sync_mode says it's a write-for-sync. Also we hold the buffer lock > across all the tests. > > > > > > > Fix a race which was identified by Daniel McNeil > > If a buffer_head is under I/O due to JBD's ordered data writeout (which uses > ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need > to wait on the buffer's existing I/O. > > Presently neither will do so, because __block_write_full_page() will not > actually submit any I/O and will hence not mark the page as being under > writeback. > > The best-performing fix would be to somehow mark the page as being under > writeback and defer waiting for the ll_rw_block-initiated I/O until > filemap_fdatawait()-time. But this is hard, because in > __block_write_full_page() we do not have control of the buffer_head's end_io > handler. Possibly we could make JBD call into end_buffer_async_write(), but > that gets nasty. > > This patch makes __block_write_full_page() wait for any buffer_head I/O to > complete before inspecting the buffer_head state. It only does this in the > case where __block_write_full_page() was called for a "data-integrity" write: > (wbc->sync_mode != WB_SYNC_NONE). > > Probably it doesn't matter, because kjournald is currently submitting (or has > already submitted) all dirty buffers anyway. > > > > --- > > fs/buffer.c | 29 +++++++++++++++-------------- > 1 files changed, 15 insertions(+), 14 deletions(-) > > diff -puN fs/buffer.c~O_DIRECT-ll_rw_block-vs-block_write_full_page-fix fs/buffer.c > --- 25/fs/buffer.c~O_DIRECT-ll_rw_block-vs-block_write_full_page-fix 2004-02-04 20:38:30.000000000 -0800 > +++ 25-akpm/fs/buffer.c 2004-02-04 20:40:19.000000000 -0800 > @@ -1810,23 +1810,24 @@ static int __block_write_full_page(struc > > do { > get_bh(bh); > - if (buffer_mapped(bh) && buffer_dirty(bh)) { > - if (wbc->sync_mode != WB_SYNC_NONE) { > - lock_buffer(bh); > - } else { > - if (test_set_buffer_locked(bh)) { > + if (!buffer_mapped(bh)) > + continue; > + if (wbc->sync_mode != WB_SYNC_NONE) { > + lock_buffer(bh); > + } else { > + if (test_set_buffer_locked(bh)) { > + if (buffer_dirty(bh)) > __set_page_dirty_nobuffers(page); > - continue; > - } > - } > - if (test_clear_buffer_dirty(bh)) { > - if (!buffer_uptodate(bh)) > - buffer_error(); > - mark_buffer_async_write(bh); > - } else { > - unlock_buffer(bh); > + continue; > } > } > + if (test_clear_buffer_dirty(bh)) { > + if (!buffer_uptodate(bh)) > + buffer_error(); > + mark_buffer_async_write(bh); > + } else { > + unlock_buffer(bh); > + } > } while ((bh = bh->b_this_page) != head); > > BUG_ON(PageWriteback(page)); > > _