From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S937398AbZDDAiz (ORCPT ); Fri, 3 Apr 2009 20:38:55 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S937263AbZDDAii (ORCPT ); Fri, 3 Apr 2009 20:38:38 -0400 Received: from wf-out-1314.google.com ([209.85.200.173]:27361 "EHLO wf-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762382AbZDDAig (ORCPT ); Fri, 3 Apr 2009 20:38:36 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=LOpPPuUWdmt+2LO3Xn5BhkAR0UHbjdpDd92alqG0HEhGRsxB4Qa1FOqfHAjznvjgz0 8l8zG3h4f0zX2Nl/JbPrqLrDGtvI68HdsYwfn8QEX74fYtcrUVJsvOpCMWHqBo3vWxEx wdBO+EqjEpBOCh8otBtV18yum0GsdzKGgpOnk= MIME-Version: 1.0 In-Reply-To: <1238715399-22172-1-git-send-email-jack@suse.cz> References: <1238715399-22172-1-git-send-email-jack@suse.cz> Date: Fri, 3 Apr 2009 17:38:33 -0700 Message-ID: Subject: Re: [PATCH] ext2: Fix data corruption for racing writes From: Ying Han To: Jan Kara Cc: linux-ext4@vger.kernel.org, Andrew Morton , LKML , Ying Han , Nick Piggin Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 2, 2009 at 4:36 PM, Jan Kara wrote: > > If two writers allocating blocks to file race with each other (e.g. because > writepages races with ordinary write or two writepages race with each other), > ext2_getblock() can be called on the same inode in parallel. Before we are > going to allocate new blocks, we have to recheck the block chain we have > obtained so far without holding truncate_mutex. Otherwise we could overwrite > the indirect block pointer set by the other writer leading to data loss. > > The below test program by Ying is able to reproduce the data loss with ext2 > on in BRD in a few minutes if the machine is under memory pressure: > > long kMemSize = 50 << 20; > int kPageSize = 4096; > > int main(int argc, char **argv) { > int status; > int count = 0; > int i; > char *fname = "/mnt/test.mmap"; > char *mem; > unlink(fname); > int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600); > status = ftruncate(fd, kMemSize); > mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > // Fill the memory with 1s. > memset(mem, 1, kMemSize); > sleep(2); > for (i = 0; i < kMemSize; i++) { > int byte_good = mem[i] != 0; > if (!byte_good && ((i % kPageSize) == 0)) { > //printf("%d ", i / kPageSize); > count++; > } > } > munmap(mem, kMemSize); > close(fd); > unlink(fname); > > if (count > 0) { > printf("Running %d bad page\n", count); > return 1; > } > return 0; > } > > CC: Ying Han > CC: Nick Piggin > Signed-off-by: Jan Kara > --- > fs/ext2/inode.c | 44 +++++++++++++++++++++++++++++++++----------- > 1 files changed, 33 insertions(+), 11 deletions(-) > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index b43b955..acf6788 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -590,9 +590,8 @@ static int ext2_get_blocks(struct inode *inode, > > if (depth == 0) > return (err); > -reread: > - partial = ext2_get_branch(inode, depth, offsets, chain, &err); > > + partial = ext2_get_branch(inode, depth, offsets, chain, &err); > /* Simplest case - block found, no allocation needed */ > if (!partial) { > first_block = le32_to_cpu(chain[depth - 1].key); > @@ -602,15 +601,16 @@ reread: > while (count < maxblocks && count <= blocks_to_boundary) { > ext2_fsblk_t blk; > > - if (!verify_chain(chain, partial)) { > + if (!verify_chain(chain, chain + depth - 1)) { > /* > * Indirect block might be removed by > * truncate while we were reading it. > * Handling of that case: forget what we've > * got now, go to reread. > */ > + err = -EAGAIN; > count = 0; > - goto changed; > + break; > } > blk = le32_to_cpu(*(chain[depth-1].p + count)); > if (blk == first_block + count) > @@ -618,7 +618,8 @@ reread: > else > break; > } > - goto got_it; > + if (err != -EAGAIN) > + goto got_it; > } > > /* Next simple case - plain lookup or failed read of indirect block */ > @@ -626,6 +627,33 @@ reread: > goto cleanup; > > mutex_lock(&ei->truncate_mutex); > + /* > + * If the indirect block is missing while we are reading > + * the chain(ext3_get_branch() returns -EAGAIN err), or > + * if the chain has been changed after we grab the semaphore, > + * (either because another process truncated this branch, or > + * another get_block allocated this branch) re-grab the chain to see if > + * the request block has been allocated or not. > + * > + * Since we already block the truncate/other get_block > + * at this point, we will have the current copy of the chain when we > + * splice the branch into the tree. > + */ > + if (err == -EAGAIN || !verify_chain(chain, partial)) { > + while (partial > chain) { > + brelse(partial->bh); > + partial--; > + } > + partial = ext2_get_branch(inode, depth, offsets, chain, &err); > + if (!partial) { > + count++; > + mutex_unlock(&ei->truncate_mutex); > + if (err) > + goto cleanup; > + clear_buffer_new(bh_result); > + goto got_it; > + } > + } > > /* > * Okay, we need to do block allocation. Lazily initialize the block > @@ -683,12 +711,6 @@ cleanup: > partial--; > } > return err; > -changed: > - while (partial > chain) { > - brelse(partial->bh); > - partial--; > - } > - goto reread; > } > > int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) I tried this patch and seems i got deadlock on the truncate_mutex. Here is the message after enabling lockdep. I pasted the same message on the origianal thread. kernel: 1 lock held by kswapd1/264: kernel: #0: (&ei->truncate_mutex){--..}, at: [] ext2_get_block+0x109/0x960 kernel: INFO: task ftruncate_mmap:2950 blocked for more than 120 seconds. kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: ftruncate_mma D ffff81047e733a80 0 2950 2858 kernel: ffff8101798516f8 0000000000000092 0000000000000000 0000000000000046 kernel: ffff81047e0a1260 ffff81047f070000 ffff81047e0a15c0 0000000100130c66 kernel: 00000000ffffffff ffffffff8025740d 0000000000000000 0000000000000000 kernel: Call Trace: kernel: [] mark_held_locks+0x3d/0x80 kernel: [] mutex_lock_nested+0x14d/0x280 kernel: [] mutex_lock_nested+0xe5/0x280 kernel: [] ext2_get_block+0x109/0x960 kernel: [] create_empty_buffers+0x43/0xb0 kernel: [] create_empty_buffers+0x43/0xb0 kernel: [] alloc_page_buffers+0x97/0x120 kernel: [] __block_write_full_page+0x206/0x320 kernel: [] __block_write_full_page+0xc0/0x320 kernel: [] ext2_get_block+0x0/0x960 kernel: [] shrink_page_list+0x4fe/0x650 kernel: [] __lock_acquire+0x3b8/0x1080 kernel: [] isolate_lru_pages+0x88/0x230 kernel: [] shrink_inactive_list+0x14a/0x3f0 kernel: [] shrink_zone+0xb3/0x130 kernel: [] autoremove_wake_function+0x0/0x30 kernel: [] try_to_free_pages+0x268/0x3e0 kernel: [] isolate_pages_global+0x0/0x40 kernel: [] __alloc_pages_internal+0x1d7/0x4a0 kernel: [] __do_page_cache_readahead+0x124/0x270 kernel: [] filemap_fault+0x18f/0x400 kernel: [] __do_fault+0x65/0x450 kernel: [] __lock_acquire+0x3b8/0x1080 kernel: [] __down_read_trylock+0x1d/0x60 kernel: [] handle_mm_fault+0x18a/0x7a0 kernel: [] do_page_fault+0x29c/0x930 kernel: [] trace_hardirqs_on_thunk+0x35/0x3a kernel: [] error_exit+0x0/0xa9 kernel: kernel: 2 locks held by ftruncate_mmap/2950: kernel: #0: (&mm->mmap_sem){----}, at: [] do_page_fault+0x22f/0x930 kernel: #1: (&ei->truncate_mutex){--..}, at: [] ext2_get_block+0x109/0x960 Besides than that, does this patch fix the problem while moving the mutex up to the beginning of ext_get_blocks() does too? Like the following one diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 384fc0d..bef3ef7 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -586,11 +586,13 @@ static int ext2_get_blocks(struct inode *inode, int count = 0; ext2_fsblk_t first_block = 0; + mutex_lock(&ei->truncate_mutex); depth = ext2_block_to_path(inode,iblock,offsets,&blocks_to_boundary); - if (depth == 0) + if (depth == 0) { + mutex_unlock(&ei->truncate_mutex); return (err); -reread: + } partial = ext2_get_branch(inode, depth, offsets, chain, &err); /* Simplest case - block found, no allocation needed */ @@ -602,16 +604,6 @@ reread: while (count < maxblocks && count <= blocks_to_boundary) { ext2_fsblk_t blk; - if (!verify_chain(chain, partial)) { - /* - * Indirect block might be removed by - * truncate while we were reading it. - * Handling of that case: forget what we've - * got now, go to reread. - */ - count = 0; - goto changed; - } blk = le32_to_cpu(*(chain[depth-1].p + count)); if (blk == first_block + count) count++; @@ -625,7 +617,6 @@ reread: if (!create || err == -EIO) goto cleanup; - mutex_lock(&ei->truncate_mutex); /* * Okay, we need to do block allocation. Lazily initialize the block @@ -651,7 +642,6 @@ reread: offsets + (partial - chain), partial); if (err) { - mutex_unlock(&ei->truncate_mutex); goto cleanup; } @@ -662,13 +652,11 @@ reread: err = ext2_clear_xip_target (inode, le32_to_cpu(chain[depth-1].key)); if (err) { - mutex_unlock(&ei->truncate_mutex); goto cleanup; } } ext2_splice_branch(inode, iblock, partial, indirect_blks, count); - mutex_unlock(&ei->truncate_mutex); set_buffer_new(bh_result); got_it: map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); @@ -678,17 +666,12 @@ got_it: /* Clean up and exit */ partial = chain + depth - 1; /* the whole chain */ cleanup: + mutex_unlock(&ei->truncate_mutex); while (partial > chain) { brelse(partial->bh); partial--; } return err; -changed: - while (partial > chain) { - brelse(partial->bh); - partial--; - } - goto reread; } int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) --Ying > > -- > 1.6.0.2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/