All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: ext4: Can we talk about bforget() and metadata blocks
       [not found]   ` <6601abe90909091707s1df9e71bvb4551772dc4917cb@mail.gmail.com>
@ 2009-09-10  1:35     ` Theodore Tso
  2009-09-10  6:54       ` Aneesh Kumar K.V
  2009-09-10 15:55       ` Curt Wohlgemuth
  0 siblings, 2 replies; 12+ messages in thread
From: Theodore Tso @ 2009-09-10  1:35 UTC (permalink / raw)
  To: Curt Wohlgemuth; +Cc: linux-ext4

On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
> 
> First, ext4_journal_forget() is called from ext4_forget() only when
> we're journalling; without a journal, ext4_journal_forget() is only
> called for various non-extent paths.  ext4_forget() could be changed,
> of course...

Ext4_forget() calls either ext4_journal_forget() or
ext4_journal_revoke().  So we need to fix up both functions.

			      	      	  - Ted

commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Wed Sep 9 21:32:41 2009 -0400

    ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
    
    When ext4 is using a journal, a metadata block which is deallocated
    must be passed into the journal layer so it can be dropped from the
    current transaction and/or revoked.  This is done by calling the
    functions ext4_journal_forget() and ext4_journal_revoke(), which call
    jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
    
    Since the jbd2_journal_forget() and jbd2_journal_revoke() call
    bforget(), if ext4 is not using a journal, ext4_journal_forget() and
    ext4_journal_revoke() must call bforget() to avoid a dirty metadata
    block overwriting a block after it has been reallocated and reused for
    another inode's data block.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index eb27fd0..ecb9ca4 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -44,7 +44,7 @@ int __ext4_journal_forget(const char *where, handle_t *handle,
 						  handle, err);
 	}
 	else
-		brelse(bh);
+		bforget(bh);
 	return err;
 }
 
@@ -60,7 +60,7 @@ int __ext4_journal_revoke(const char *where, handle_t *handle,
 						  handle, err);
 	}
 	else
-		brelse(bh);
+		bforget(bh);
 	return err;
 }
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10  1:35     ` ext4: Can we talk about bforget() and metadata blocks Theodore Tso
@ 2009-09-10  6:54       ` Aneesh Kumar K.V
  2009-09-10 15:46         ` Curt Wohlgemuth
  2009-09-10 15:55       ` Curt Wohlgemuth
  1 sibling, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2009-09-10  6:54 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Curt Wohlgemuth, linux-ext4

On Wed, Sep 09, 2009 at 09:35:40PM -0400, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
> > 
> > First, ext4_journal_forget() is called from ext4_forget() only when
> > we're journalling; without a journal, ext4_journal_forget() is only
> > called for various non-extent paths.  ext4_forget() could be changed,
> > of course...
> 
> Ext4_forget() calls either ext4_journal_forget() or
> ext4_journal_revoke().  So we need to fix up both functions.
> 
> 			      	      	  - Ted
> 
> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Wed Sep 9 21:32:41 2009 -0400
> 
>     ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
>     
>     When ext4 is using a journal, a metadata block which is deallocated
>     must be passed into the journal layer so it can be dropped from the
>     current transaction and/or revoked.  This is done by calling the
>     functions ext4_journal_forget() and ext4_journal_revoke(), which call
>     jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
>     
>     Since the jbd2_journal_forget() and jbd2_journal_revoke() call
>     bforget(), if ext4 is not using a journal, ext4_journal_forget() and
>     ext4_journal_revoke() must call bforget() to avoid a dirty metadata
>     block overwriting a block after it has been reallocated and reused for
>     another inode's data block.
>     

I am sure i am missing something. But where are we adding the buffer_head
to the mapping->private_list ?. For ext2 when we allocate meta data blocks
we do mark_buffer_dirty_inode which add the buffer_head to the inodes
private_list. Shouldn't we do something similar with Ext4 without journal ?

-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10  6:54       ` Aneesh Kumar K.V
@ 2009-09-10 15:46         ` Curt Wohlgemuth
  2009-09-10 16:24           ` Aneesh Kumar K.V
  0 siblings, 1 reply; 12+ messages in thread
From: Curt Wohlgemuth @ 2009-09-10 15:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Theodore Tso, linux-ext4

On Wed, Sep 9, 2009 at 11:54 PM, Aneesh Kumar
K.V<aneesh.kumar@linux.vnet.ibm.com> wrote:
> On Wed, Sep 09, 2009 at 09:35:40PM -0400, Theodore Tso wrote:
>> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
>> >
>> > First, ext4_journal_forget() is called from ext4_forget() only when
>> > we're journalling; without a journal, ext4_journal_forget() is only
>> > called for various non-extent paths.  ext4_forget() could be changed,
>> > of course...
>>
>> Ext4_forget() calls either ext4_journal_forget() or
>> ext4_journal_revoke().  So we need to fix up both functions.
>>
>>                                         - Ted
>>
>> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
>> Author: Theodore Ts'o <tytso@mit.edu>
>> Date:   Wed Sep 9 21:32:41 2009 -0400
>>
>>     ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
>>
>>     When ext4 is using a journal, a metadata block which is deallocated
>>     must be passed into the journal layer so it can be dropped from the
>>     current transaction and/or revoked.  This is done by calling the
>>     functions ext4_journal_forget() and ext4_journal_revoke(), which call
>>     jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
>>
>>     Since the jbd2_journal_forget() and jbd2_journal_revoke() call
>>     bforget(), if ext4 is not using a journal, ext4_journal_forget() and
>>     ext4_journal_revoke() must call bforget() to avoid a dirty metadata
>>     block overwriting a block after it has been reallocated and reused for
>>     another inode's data block.
>>
>
> I am sure i am missing something. But where are we adding the buffer_head
> to the mapping->private_list ?. For ext2 when we allocate meta data blocks
> we do mark_buffer_dirty_inode which add the buffer_head to the inodes
> private_list. Shouldn't we do something similar with Ext4 without journal ?

As Ted explained to me, all buffer heads pointing to metadata blocks
are attached to the block device inode.  So pdflush writes of these
pages go through the block device address space ops.  Explicit
sync_dirty_buffer() calls for the metadata buffer heads still work, of
course.

Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10  1:35     ` ext4: Can we talk about bforget() and metadata blocks Theodore Tso
  2009-09-10  6:54       ` Aneesh Kumar K.V
@ 2009-09-10 15:55       ` Curt Wohlgemuth
  1 sibling, 0 replies; 12+ messages in thread
From: Curt Wohlgemuth @ 2009-09-10 15:55 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Wed, Sep 9, 2009 at 6:35 PM, Theodore Tso<tytso@mit.edu> wrote:
> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
>>
>> First, ext4_journal_forget() is called from ext4_forget() only when
>> we're journalling; without a journal, ext4_journal_forget() is only
>> called for various non-extent paths.  ext4_forget() could be changed,
>> of course...
>
> Ext4_forget() calls either ext4_journal_forget() or
> ext4_journal_revoke().  So we need to fix up both functions.
>
>                                          - Ted
>
> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Wed Sep 9 21:32:41 2009 -0400
>
>    ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
>
>    When ext4 is using a journal, a metadata block which is deallocated
>    must be passed into the journal layer so it can be dropped from the
>    current transaction and/or revoked.  This is done by calling the
>    functions ext4_journal_forget() and ext4_journal_revoke(), which call
>    jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
>
>    Since the jbd2_journal_forget() and jbd2_journal_revoke() call
>    bforget(), if ext4 is not using a journal, ext4_journal_forget() and
>    ext4_journal_revoke() must call bforget() to avoid a dirty metadata
>    block overwriting a block after it has been reallocated and reused for
>    another inode's data block.
>
>    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index eb27fd0..ecb9ca4 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -44,7 +44,7 @@ int __ext4_journal_forget(const char *where, handle_t *handle,
>                                                  handle, err);
>        }
>        else
> -               brelse(bh);
> +               bforget(bh);
>        return err;
>  }
>
> @@ -60,7 +60,7 @@ int __ext4_journal_revoke(const char *where, handle_t *handle,
>                                                  handle, err);
>        }
>        else
> -               brelse(bh);
> +               bforget(bh);
>        return err;
>  }

This looks quite reasonable to me.  And of course, your patch fixes
the brelse() calls that I added back in July.

Of course, validating this patch on my end will be tough; I need at
least a week of no corruptions to convince myself that the problem is
fixed, to the same degree that my call to sync_dirty_buffer() works.

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10 15:46         ` Curt Wohlgemuth
@ 2009-09-10 16:24           ` Aneesh Kumar K.V
  2009-09-10 18:58             ` Theodore Tso
  0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2009-09-10 16:24 UTC (permalink / raw)
  To: Curt Wohlgemuth; +Cc: Theodore Tso, linux-ext4

On Thu, Sep 10, 2009 at 08:46:41AM -0700, Curt Wohlgemuth wrote:
> On Wed, Sep 9, 2009 at 11:54 PM, Aneesh Kumar
> K.V<aneesh.kumar@linux.vnet.ibm.com> wrote:
> > On Wed, Sep 09, 2009 at 09:35:40PM -0400, Theodore Tso wrote:
> >> On Wed, Sep 09, 2009 at 05:07:28PM -0700, Curt Wohlgemuth wrote:
> >> >
> >> > First, ext4_journal_forget() is called from ext4_forget() only when
> >> > we're journalling; without a journal, ext4_journal_forget() is only
> >> > called for various non-extent paths.  ext4_forget() could be changed,
> >> > of course...
> >>
> >> Ext4_forget() calls either ext4_journal_forget() or
> >> ext4_journal_revoke().  So we need to fix up both functions.
> >>
> >>                                         - Ted
> >>
> >> commit 4afdf0958f6f7b878e6d85cb4e0c0c12a0bd74e2
> >> Author: Theodore Ts'o <tytso@mit.edu>
> >> Date:   Wed Sep 9 21:32:41 2009 -0400
> >>
> >>     ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
> >>
> >>     When ext4 is using a journal, a metadata block which is deallocated
> >>     must be passed into the journal layer so it can be dropped from the
> >>     current transaction and/or revoked.  This is done by calling the
> >>     functions ext4_journal_forget() and ext4_journal_revoke(), which call
> >>     jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
> >>
> >>     Since the jbd2_journal_forget() and jbd2_journal_revoke() call
> >>     bforget(), if ext4 is not using a journal, ext4_journal_forget() and
> >>     ext4_journal_revoke() must call bforget() to avoid a dirty metadata
> >>     block overwriting a block after it has been reallocated and reused for
> >>     another inode's data block.
> >>
> >
> > I am sure i am missing something. But where are we adding the buffer_head
> > to the mapping->private_list ?. For ext2 when we allocate meta data blocks
> > we do mark_buffer_dirty_inode which add the buffer_head to the inodes
> > private_list. Shouldn't we do something similar with Ext4 without journal ?
> 
> As Ted explained to me, all buffer heads pointing to metadata blocks
> are attached to the block device inode.  So pdflush writes of these
> pages go through the block device address space ops.  Explicit
> sync_dirty_buffer() calls for the metadata buffer heads still work, of
> course.

But how would it work for fsync ? I mean 

I would expect for no journal mode ext4_sync_file  should be doing
simple_fsync(). That should be forcing the metadata buffer_heads
via sync_mapping_buffers. And if we reuse these meta buffers we
drop them the inode->mapping->private_list using bforget.

But I don't see any of the above in code

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10 16:24           ` Aneesh Kumar K.V
@ 2009-09-10 18:58             ` Theodore Tso
  2009-09-11 17:21               ` Aneesh Kumar K.V
  2009-09-12 15:00               ` Aneesh Kumar K.V
  0 siblings, 2 replies; 12+ messages in thread
From: Theodore Tso @ 2009-09-10 18:58 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Curt Wohlgemuth, linux-ext4

On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
> 
> But how would it work for fsync ? I mean 
> 
> I would expect for no journal mode ext4_sync_file  should be doing
> simple_fsync(). That should be forcing the metadata buffer_heads
> via sync_mapping_buffers. And if we reuse these meta buffers we
> drop them the inode->mapping->private_list using bforget.
> 
> But I don't see any of the above in code

Aneesh, you're addressing a different problem than the one that Curt
were trying to deal with this patch.  The problem we are worry about
is one where an inode's extent tree or indirect blocks are modified
right before the inode is deleted, and then one or more of those
metadata blocks get reallocated and written right away (most likely
this will happen via an O_DIRECT write), and then, because we didn't
use bforget(), the dirty metadata block in the buffer cache would get
written out, overwriting the O_DIRECT block.

What you're worrying about, is a different issue.  You're concerned
about the fact that since we are not associating an inode's extent
tree or indirect blocks with the inode, those blocks won't get forced
out to disk on an fsync() in ext4 no-journal mode.  This may not be a
big deal for applications which expect to recover from an unclean
using mke2fs (and thus probably don't use fsync in any case), but
here's a patch to deal with the problem you've raised.

       	       	       	    		       - Ted

commit 417cf58253fbf3e36df7b3aca11c120e8367f5e6
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Sep 10 14:58:02 2009 -0400

    ext4: Assure that metadata blocks are written during fsync in no journal mode
    
    When there is no journal present, we must attach buffer heads
    associated with extent tree and indirect blocks to the inode's
    mapping->private_list so that fsync() will write out the inode's
    metadata blocks.  This is done via mark_buffer_dirty_inode().
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index ecb9ca4..6a94099 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -89,7 +89,10 @@ int __ext4_handle_dirty_metadata(const char *where, handle_t *handle,
 			ext4_journal_abort_handle(where, __func__, bh,
 						  handle, err);
 	} else {
-		mark_buffer_dirty(bh);
+		if (inode && bh)
+			mark_buffer_dirty_inode(bh, inode);
+		else
+			mark_buffer_dirty(bh);
 		if (inode && inode_needs_sync(inode)) {
 			sync_dirty_buffer(bh);
 			if (buffer_req(bh) && !buffer_uptodate(bh)) {


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10 18:58             ` Theodore Tso
@ 2009-09-11 17:21               ` Aneesh Kumar K.V
  2009-09-11 17:36                 ` Curt Wohlgemuth
  2009-09-11 18:08                 ` Theodore Tso
  2009-09-12 15:00               ` Aneesh Kumar K.V
  1 sibling, 2 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2009-09-11 17:21 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Curt Wohlgemuth, linux-ext4

On Thu, Sep 10, 2009 at 02:58:26PM -0400, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
> > 
> > But how would it work for fsync ? I mean 
> > 
> > I would expect for no journal mode ext4_sync_file  should be doing
> > simple_fsync(). That should be forcing the metadata buffer_heads
> > via sync_mapping_buffers. And if we reuse these meta buffers we
> > drop them the inode->mapping->private_list using bforget.
> > 
> > But I don't see any of the above in code
> 
> Aneesh, you're addressing a different problem than the one that Curt
> were trying to deal with this patch.  The problem we are worry about
> is one where an inode's extent tree or indirect blocks are modified
> right before the inode is deleted, and then one or more of those
> metadata blocks get reallocated and written right away (most likely
> this will happen via an O_DIRECT write), and then, because we didn't
> use bforget(), the dirty metadata block in the buffer cache would get
> written out, overwriting the O_DIRECT block.
> 
> What you're worrying about, is a different issue.  You're concerned
> about the fact that since we are not associating an inode's extent
> tree or indirect blocks with the inode, those blocks won't get forced
> out to disk on an fsync() in ext4 no-journal mode.  This may not be a
> big deal for applications which expect to recover from an unclean
> using mke2fs (and thus probably don't use fsync in any case), but
> here's a patch to deal with the problem you've raised.
> 
>        	       	       	    		       - Ted

But the patch you posted is using bforget which is removing the
buffer_head from the inode->mapping->private_list. What i am
trying to figure out is where does the buffer_head getting added
to the private_list. ?

-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-11 17:21               ` Aneesh Kumar K.V
@ 2009-09-11 17:36                 ` Curt Wohlgemuth
  2009-09-11 18:08                 ` Theodore Tso
  1 sibling, 0 replies; 12+ messages in thread
From: Curt Wohlgemuth @ 2009-09-11 17:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Theodore Tso, linux-ext4

On Fri, Sep 11, 2009 at 10:21 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> On Thu, Sep 10, 2009 at 02:58:26PM -0400, Theodore Tso wrote:
>> On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
>> >
>> > But how would it work for fsync ? I mean
>> >
>> > I would expect for no journal mode ext4_sync_file  should be doing
>> > simple_fsync(). That should be forcing the metadata buffer_heads
>> > via sync_mapping_buffers. And if we reuse these meta buffers we
>> > drop them the inode->mapping->private_list using bforget.
>> >
>> > But I don't see any of the above in code
>>
>> Aneesh, you're addressing a different problem than the one that Curt
>> were trying to deal with this patch.  The problem we are worry about
>> is one where an inode's extent tree or indirect blocks are modified
>> right before the inode is deleted, and then one or more of those
>> metadata blocks get reallocated and written right away (most likely
>> this will happen via an O_DIRECT write), and then, because we didn't
>> use bforget(), the dirty metadata block in the buffer cache would get
>> written out, overwriting the O_DIRECT block.
>>
>> What you're worrying about, is a different issue.  You're concerned
>> about the fact that since we are not associating an inode's extent
>> tree or indirect blocks with the inode, those blocks won't get forced
>> out to disk on an fsync() in ext4 no-journal mode.  This may not be a
>> big deal for applications which expect to recover from an unclean
>> using mke2fs (and thus probably don't use fsync in any case), but
>> here's a patch to deal with the problem you've raised.
>>
>>                                                      - Ted
>
> But the patch you posted is using bforget which is removing the
> buffer_head from the inode->mapping->private_list. What i am
> trying to figure out is where does the buffer_head getting added
> to the private_list. ?

I don't think the buffer_head's b_assoc_map is set, because as you
say, mark_buffer_dirty_inode() isn't called from ext4.

All the bforget() call does in this case is clear the BH dirty bit,
which prevents it from being written out during writeback.

Unless I'm missing something too...

Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-11 17:21               ` Aneesh Kumar K.V
  2009-09-11 17:36                 ` Curt Wohlgemuth
@ 2009-09-11 18:08                 ` Theodore Tso
  2009-09-11 18:15                   ` Theodore Tso
  1 sibling, 1 reply; 12+ messages in thread
From: Theodore Tso @ 2009-09-11 18:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Curt Wohlgemuth, linux-ext4

On Fri, Sep 11, 2009 at 10:51:25PM +0530, Aneesh Kumar K.V wrote:
> 
> But the patch you posted is using bforget which is removing the
> buffer_head from the inode->mapping->private_list. What i am
> trying to figure out is where does the buffer_head getting added
> to the private_list. ?
> 

bforget() does three things; 1) it clears the buffer dirty flag, 2) it
drops the buffer from inode->mapping->private_list (if it is on such a
list), and 3) it drops the refcount (when it calls __brelse()).  (n.b.
brelse() doesn't actually "release" or "free" a buffer; the name is a
holdover from the days when buffer cache was a stand-alone entity,
instead of being implemented in the page cache.)

The bug that Curt and I was chasing was related to (1) not happening,
and the fix that I proposed to fix it is "ext4: Use bforget() in no
journal mode for ext4_journal_{forget,revoke}()".  This bug was caused
by the fact that brelse() never releases the bh which is part of the
problem.  Even if bh->b_count is zero, the page in the buffer cache is
still present, and the buffer head will still be marked dirty.  So if
you don't want a dirty buffer to be written out to disk (because its
inode has been deleted and the block might be reallocated for some
other purpose), it's not enough to brelse() it; you have to explicitly
call clear_buffer_dirty() or bforget().

Your question about where does the buffer_head is getting added to
private_list is reltaed to (2), and the answer (at least before my
proposed patch, "ext4: Assure that metadata blocks are written during
fsync in no journal mode" is 'nowhere'.  So that was an entirely
separate problem, albeit one that I'm not sure Curt cares about for
his workloads.  (Why take the performance hit of fsync() in your
applications if your recovery strategy in case of a crash is mke2fs
and copy from one of the other two redundant servers?)

Does that help clarify matters?  Basically, there are three separate
bugs related to no journal mode that are being addressed by patches in
the ext4 patch queue:

ext4: Make non-journal fsync work properly
      (Found and fixed by Frank; we need to explicitly write out the 
      inode structure to disk during an fsync since we can't depend 
      on the journal doing this for us in no-journal mode.  So this is 
      an issue of the inode itself not getting written out by
      ext4_write_inode, which is called by pdflush and fsync.  
      Since the inode table buffer is marked dirty, the inode will 
      *eventually* be written out, but on a much greater time scale.  
      This caused the increased fragility of ext4 in no-journal mode
      after a power failure.)

ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
      (Pointed out by Aneesh, fixed by Ted; this fix makes sure that
      fsync will write out an inode's extent tree and/or indirect blocks,
      which is kinda important.  :-)

ext4: Assure that metadata blocks are written during fsync in no journal mode
      (Found by Curt, with an initial fix that worked by forcing 
      the dirty buffer to be written to disk, and fixed in a better way 
      by Ted by using bforget.  The problem here relates to an inode that
      is being deleted, which is why there's no reason to write the 
      dirty block to disk; when we were about to deallocate the block ---
      the better fix is to drop the dirty bit by using bforget.)

Hopefully this quick Cliff Notes(tm) summary of the ext4 no-journal
patches in the ext4 patch queue is helpful.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-11 18:08                 ` Theodore Tso
@ 2009-09-11 18:15                   ` Theodore Tso
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Tso @ 2009-09-11 18:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Curt Wohlgemuth, linux-ext4

On Fri, Sep 11, 2009 at 02:08:27PM -0400, Theodore Tso wrote:
> Does that help clarify matters?  Basically, there are three separate
> bugs related to no journal mode that are being addressed by patches in
> the ext4 patch queue:

Whoops, I mixed up the patch names and description for two of them.
Let me try again:

ext4: Make non-journal fsync work properly
      (Found and fixed by Frank; we need to explicitly write out the 
      inode structure to disk during an fsync since we can't depend 
      on the journal doing this for us in no-journal mode.  So this is 
      an issue of the inode itself not getting written out by
      ext4_write_inode, which is called by pdflush and fsync.  
      Since the inode table buffer is marked dirty, the inode will 
      *eventually* be written out, but on a much greater time scale.  
      This caused the increased fragility of ext4 in no-journal mode
      after a power failure.)

ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
      (Found by Curt, with an initial fix that worked by forcing 
      the dirty buffer to be written to disk, and fixed in a better way 
      by Ted by using bforget.  The problem here relates to an inode that
      is being deleted, which is why there's no reason to write the 
      dirty block to disk; when we were about to deallocate the block ---
      the better fix is to drop the dirty bit by using bforget.)

ext4: Assure that metadata blocks are written during fsync in no journal mode
      (Pointed out by Aneesh, fixed by Ted; this fix makes sure that
      fsync will write out an inode's extent tree and/or indirect blocks,
      which is kinda important.  :-)

(I swapped the patch names/description for the last two.)

> Hopefully this quick Cliff Notes(tm) summary of the ext4 no-journal
> patches in the ext4 patch queue is helpful.

  	     	      	    	     - Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-10 18:58             ` Theodore Tso
  2009-09-11 17:21               ` Aneesh Kumar K.V
@ 2009-09-12 15:00               ` Aneesh Kumar K.V
  2009-09-12 17:59                 ` Theodore Tso
  1 sibling, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2009-09-12 15:00 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Curt Wohlgemuth, linux-ext4

On Thu, Sep 10, 2009 at 02:58:26PM -0400, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 09:54:35PM +0530, Aneesh Kumar K.V wrote:
> > 
> > But how would it work for fsync ? I mean 
> > 
> > I would expect for no journal mode ext4_sync_file  should be doing
> > simple_fsync(). That should be forcing the metadata buffer_heads
> > via sync_mapping_buffers. And if we reuse these meta buffers we
> > drop them the inode->mapping->private_list using bforget.
> > 
> > But I don't see any of the above in code
> 
> Aneesh, you're addressing a different problem than the one that Curt
> were trying to deal with this patch.  The problem we are worry about
> is one where an inode's extent tree or indirect blocks are modified
> right before the inode is deleted, and then one or more of those
> metadata blocks get reallocated and written right away (most likely
> this will happen via an O_DIRECT write), and then, because we didn't
> use bforget(), the dirty metadata block in the buffer cache would get
> written out, overwriting the O_DIRECT block.
> 
> What you're worrying about, is a different issue.  You're concerned
> about the fact that since we are not associating an inode's extent
> tree or indirect blocks with the inode, those blocks won't get forced
> out to disk on an fsync() in ext4 no-journal mode.  This may not be a
> big deal for applications which expect to recover from an unclean
> using mke2fs (and thus probably don't use fsync in any case), but
> here's a patch to deal with the problem you've raised.
> 
>        	       	       	    		       - Ted
> 
> commit 417cf58253fbf3e36df7b3aca11c120e8367f5e6
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Sep 10 14:58:02 2009 -0400
> 
>     ext4: Assure that metadata blocks are written during fsync in no journal mode
>     
>     When there is no journal present, we must attach buffer heads
>     associated with extent tree and indirect blocks to the inode's
>     mapping->private_list so that fsync() will write out the inode's
>     metadata blocks.  This is done via mark_buffer_dirty_inode().
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index ecb9ca4..6a94099 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -89,7 +89,10 @@ int __ext4_handle_dirty_metadata(const char *where, handle_t *handle,
>  			ext4_journal_abort_handle(where, __func__, bh,
>  						  handle, err);
>  	} else {
> -		mark_buffer_dirty(bh);
> +		if (inode && bh)
> +			mark_buffer_dirty_inode(bh, inode);
> +		else
> +			mark_buffer_dirty(bh);
>  		if (inode && inode_needs_sync(inode)) {
>  			sync_dirty_buffer(bh);
>  			if (buffer_req(bh) && !buffer_uptodate(bh)) {
> 


This does add the meta data buffer_head to the inode->mapping->private_list.
But ext4_sync_file is not writing them. I guess we need to call sync_mapping_buffers
for no-journal mode in ext4_sync_file

-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ext4: Can we talk about bforget() and metadata blocks
  2009-09-12 15:00               ` Aneesh Kumar K.V
@ 2009-09-12 17:59                 ` Theodore Tso
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Tso @ 2009-09-12 17:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Curt Wohlgemuth, linux-ext4

On Sat, Sep 12, 2009 at 08:30:36PM +0530, Aneesh Kumar K.V wrote:
> 
> This does add the meta data buffer_head to the
> inode->mapping->private_list.  But ext4_sync_file is not writing
> them. I guess we need to call sync_mapping_buffers for no-journal
> mode in ext4_sync_file.

Good point, thanks for catching this.  Here's a revised patch which
adds the call to ext4_sync_file().

						- Ted

commit 6c6e80dc88568f4e49004967f8fd56b69c86e715
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Sep 12 13:41:55 2009 -0400

    ext4: Assure that metadata blocks are written during fsync in no journal mode
    
    When there is no journal present, we must attach buffer heads
    associated with extent tree and indirect blocks to the inode's
    mapping->private_list via mark_buffer_dirty_inode() so that
    ext4_sync_file() --- which is called to service fsync() and
    fdatasync() system calls --- can write out the inode's metadata blocks
    by calling sync_mapping_buffers().
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index ecb9ca4..6a94099 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -89,7 +89,10 @@ int __ext4_handle_dirty_metadata(const char *where, handle_t *handle,
 			ext4_journal_abort_handle(where, __func__, bh,
 						  handle, err);
 	} else {
-		mark_buffer_dirty(bh);
+		if (inode && bh)
+			mark_buffer_dirty_inode(bh, inode);
+		else
+			mark_buffer_dirty(bh);
 		if (inode && inode_needs_sync(inode)) {
 			sync_dirty_buffer(bh);
 			if (buffer_req(bh) && !buffer_uptodate(bh)) {
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index ab418c0..0747574 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -50,7 +50,7 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 {
 	struct inode *inode = dentry->d_inode;
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
-	int ret = 0;
+	int err, ret = 0;
 
 	J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -79,6 +79,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 		goto out;
 	}
 
+	if (!journal)
+		ret = sync_mapping_buffers(inode->i_mapping);
+
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
 		goto out;
 
@@ -91,7 +94,9 @@ int ext4_sync_file(struct file *file, struct dentry *dentry, int datasync)
 			.sync_mode = WB_SYNC_ALL,
 			.nr_to_write = 0, /* sys_fsync did this */
 		};
-		ret = sync_inode(inode, &wbc);
+		err = sync_inode(inode, &wbc);
+		if (ret == 0)
+			ret = err;
 	}
 out:
 	if (journal && (journal->j_flags & JBD2_BARRIER))

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-09-12 17:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <6601abe90909091029s74465ebave932987e5fdf93ba@mail.gmail.com>
     [not found] ` <20090909225429.GB24951@mit.edu>
     [not found]   ` <6601abe90909091707s1df9e71bvb4551772dc4917cb@mail.gmail.com>
2009-09-10  1:35     ` ext4: Can we talk about bforget() and metadata blocks Theodore Tso
2009-09-10  6:54       ` Aneesh Kumar K.V
2009-09-10 15:46         ` Curt Wohlgemuth
2009-09-10 16:24           ` Aneesh Kumar K.V
2009-09-10 18:58             ` Theodore Tso
2009-09-11 17:21               ` Aneesh Kumar K.V
2009-09-11 17:36                 ` Curt Wohlgemuth
2009-09-11 18:08                 ` Theodore Tso
2009-09-11 18:15                   ` Theodore Tso
2009-09-12 15:00               ` Aneesh Kumar K.V
2009-09-12 17:59                 ` Theodore Tso
2009-09-10 15:55       ` Curt Wohlgemuth

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.