linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)
@ 2008-05-14  4:43 Hidehiro Kawai
  2008-05-14  4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-14  4:43 UTC (permalink / raw)
  To: Andrew Morton, sct, adilger
  Cc: linux-kernel, linux-ext4, Jan Kara, Josef Bacik, Mingming Cao,
	Hidehiro Kawai, Satoshi OSHIMA, sugita

Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)

This is the rebased version against 2.6.26-rc2, so there is no
essential difference from the previous post.
The previous post can be found at: http://lkml.org/lkml/2008/4/18/154
(The previous post may have been filtered out as SPAM mails
 due to a trouble in the mail submission.)


This patch set fixes several error handling problems.  As the
result, we can save the filesystem from file data and structural
corruption especially caused by temporal I/O errors.  Do temporal
I/O errors occur so often?  At least it will be not uncommon
for iSCSI storages.
This fixes have been done only for ext3/JBD parts.  The ext4/JBD2
version has not been prepared yet, but merging this patch set
will be worthwhile because it takes away possible filesystem
corruption.

[PATCH 1/4] jbd: strictly check for write errors on data buffers
  Without this patch, some file data in ordered mode aren't
  checked for errors.  This means user processes can continue to
  update the filesystem without noticing the write failure.
  Furthermore, the page cache which we failed to write becomes
  reclaimable.  So if the page cache is reclaimed then we
  succeed to read its data from the disk, the data corruption
  will occur because the data is old.
  Jan's ordered mode rewrite patch also fixes this problem, but
  this patch will be needed at least for the current kernel.

[PATCH 2/4] jbd: ordered data integrity fix
  This patch fixes the ordered mode violation problem caused
  by write error.  Jan's ordered mode rewrite patch will also
  fix this problem.

[PATCH 3/4] jbd: abort when failed to log metadata buffers
  Without this patch, the filesystem can corrupt along with
  the following scenario:

  1. fail to write a metadata buffer to block B in the journal
  2. succeed to write the commit record
  3. the system crashes, reboots and mount the filesystem
  4. in the recovery phase, succeed to read data from block B
  5. write back the read data to the filesystem, but it is
     a stale metadata
  6. lose some files and directories!

  This problem wouldn't happen if we have JBD2's journal
  checksumming feature and it's always turned on.

[PATCH 4/4] ext3/jbd: fix error handling for checkpoint io
  Without this patch, the filesystem can lose some metadata
  updates even though the transactions have been committed.

Regards,

-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased)
  2008-05-14  4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai
@ 2008-05-14  4:47 ` Hidehiro Kawai
  2008-05-14 12:56   ` Jan Kara
  2008-05-14  4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-14  4:47 UTC (permalink / raw)
  To: Andrew Morton, sct, adilger
  Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik,
	Mingming Cao, Satoshi OSHIMA, sugita

Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers

In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction.  But
there can be data buffers which are not checked for error:

  (a) the buffer which has already been written out by pdflush
  (b) the buffer which has been unlocked before scanned in the
      t_locked_list loop

This patch adds missing error checks and aborts journaling
appropriately.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com>
---
 fs/jbd/commit.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc2/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
 /*
  *  Submit all the data buffers to disk
  */
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
 				transaction_t *commit_transaction)
 {
 	struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
 	int locked;
 	int bufs = 0;
 	struct buffer_head **wbuf = journal->j_wbuf;
+	int err = 0;
 
 	/*
 	 * Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
 			put_bh(bh);
 		} else {
 			BUFFER_TRACE(bh, "writeout complete: unfile");
+			if (unlikely(!buffer_uptodate(bh)))
+				err = -EIO;
 			__journal_unfile_buffer(jh);
 			jbd_unlock_bh_state(bh);
 			if (locked)
@@ -271,6 +274,8 @@ write_out_data:
 	}
 	spin_unlock(&journal->j_list_lock);
 	journal_do_submit_data(wbuf, bufs);
+
+	return err;
 }
 
 /*
@@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-	err = 0;
-	journal_submit_data_buffers(journal, commit_transaction);
+	err = journal_submit_data_buffers(journal, commit_transaction);
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -426,10 +430,10 @@ void journal_commit_transaction(journal_
 		if (buffer_locked(bh)) {
 			spin_unlock(&journal->j_list_lock);
 			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
 			spin_lock(&journal->j_list_lock);
 		}
+		if (unlikely(!buffer_uptodate(bh)))
+			err = -EIO;
 		if (!inverted_lock(journal, bh)) {
 			put_bh(bh);
 			spin_lock(&journal->j_list_lock);



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/4] jbd: ordered data integrity fix (rebased)
  2008-05-14  4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai
  2008-05-14  4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai
@ 2008-05-14  4:48 ` Hidehiro Kawai
  2008-05-14 13:10   ` Jan Kara
  2008-05-14  4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai
  2008-05-14  4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai
  3 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-14  4:48 UTC (permalink / raw)
  To: Andrew Morton, sct, adilger
  Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik,
	Mingming Cao, Satoshi OSHIMA, sugita

Subject: [PATCH 2/4] jbd: ordered data integrity fix

In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
breaks the ordered mode rule.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/transaction.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/transaction.c
+++ linux-2.6.26-rc2/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
 	journal_t *journal = handle->h_transaction->t_journal;
 	int need_brelse = 0;
 	struct journal_head *jh;
+	int ret = 0;
 
 	if (is_handle_aborted(handle))
-		return 0;
+		return ret;
 
 	jh = journal_add_journal_head(bh);
 	JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
 				   time if it is redirtied */
 			}
 
-			/* journal_clean_data_list() may have got there first */
+			/*
+			 * We shouldn't remove the buffer from the committing
+			 * transaction if it has failed to be written.
+			 * Otherwise, it breaks the ordered mode rule.
+			 */
+			if (unlikely(!buffer_uptodate(bh))) {
+				ret = -EIO;
+				goto no_journal;
+			}
+
 			if (jh->b_transaction != NULL) {
 				JBUFFER_TRACE(jh, "unfile from commit");
 				__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
 	}
 	JBUFFER_TRACE(jh, "exit");
 	journal_put_journal_head(jh);
-	return 0;
+	return ret;
 }
 
 /**



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)
  2008-05-14  4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai
  2008-05-14  4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai
  2008-05-14  4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai
@ 2008-05-14  4:49 ` Hidehiro Kawai
  2008-05-14 13:15   ` Jan Kara
  2008-05-14  4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai
  3 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-14  4:49 UTC (permalink / raw)
  To: Andrew Morton, sct, adilger
  Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik,
	Mingming Cao, Satoshi OSHIMA, sugita

Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers

If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/commit.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.26-rc2/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc2/fs/jbd/commit.c
@@ -703,6 +703,9 @@ wait_for_iobuf:
 		__brelse(bh);
 	}
 
+	if (err)
+		journal_abort(journal, err);
+
 	J_ASSERT (commit_transaction->t_shadow_list == NULL);
 
 	jbd_debug(3, "JBD: commit phase 5\n");



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14  4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai
                   ` (2 preceding siblings ...)
  2008-05-14  4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai
@ 2008-05-14  4:50 ` Hidehiro Kawai
  2008-05-14 13:16   ` Josef Bacik
  2008-05-14 14:32   ` Jan Kara
  3 siblings, 2 replies; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-14  4:50 UTC (permalink / raw)
  To: Andrew Morton, sct, adilger
  Cc: Hidehiro Kawai, linux-kernel, linux-ext4, Jan Kara, Josef Bacik,
	Mingming Cao, Satoshi OSHIMA, sugita

Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling.  This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
   the case of __try_to_free_cp_buf() because it may be released or
   overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
   from the checkpoint list and abort journaling
3. when checkpointing fails, don't update the journal super block to
   prevent the journalled contents from being cleaned
4. when checkpointing fails, don't clear the needs_recovery flag,
   otherwise the journalled contents are ignored and cleaned in the
   recovery phase
5. if the recovery fails, keep the needs_recovery flag

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/ext3/ioctl.c     |   12 ++++----
 fs/ext3/super.c     |   13 +++++++--
 fs/jbd/checkpoint.c |   59 ++++++++++++++++++++++++++++++++++--------
 fs/jbd/journal.c    |   21 +++++++++-----
 fs/jbd/recovery.c   |    7 +++-
 include/linux/jbd.h |    7 ++++
 6 files changed, 91 insertions(+), 28 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
 	int ret = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
-	if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+	if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+	    !buffer_dirty(bh) && buffer_uptodate(bh)) {
 		JBUFFER_TRACE(jh, "remove from checkpoint list");
 		ret = __journal_remove_checkpoint(jh) + 1;
 		jbd_unlock_bh_state(bh);
@@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
 }
 
 /*
+ * We couldn't write back some metadata buffers to the filesystem.
+ * To avoid unwritten-back metadata buffers from losing, set
+ * JFS_CP_ABORT flag and make sure that the journal space isn't
+ * cleaned.  This function also aborts journaling.
+ */
+static void __journal_abort_checkpoint(journal_t *journal, int errno)
+{
+	if (is_checkpoint_aborted(journal))
+		return;
+
+	spin_lock(&journal->j_state_lock);
+	journal->j_flags |= JFS_CP_ABORT;
+	spin_unlock(&journal->j_state_lock);
+	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
+	       "are still old.\n");
+	journal_abort(journal, errno);
+}
+
+/*
  * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
  * The caller must restart a list walk.  Wait for someone else to run
  * jbd_unlock_bh_state().
@@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
  * buffers. Note that we take the buffers in the opposite ordering
  * from the one in which they were submitted for IO.
  *
+ * Return 0 on success, and return <0 if some buffers have failed
+ * to be written out.
+ *
  * Called with j_list_lock held.
  */
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
 {
 	struct journal_head *jh;
 	struct buffer_head *bh;
 	tid_t this_tid;
 	int released = 0;
+	int ret = 0;
 
 	this_tid = transaction->t_tid;
 restart:
 	/* Did somebody clean up the transaction in the meanwhile? */
 	if (journal->j_checkpoint_transactions != transaction ||
 			transaction->t_tid != this_tid)
-		return;
+		return ret;
 	while (!released && transaction->t_checkpoint_io_list) {
 		jh = transaction->t_checkpoint_io_list;
 		bh = jh2bh(jh);
@@ -194,6 +218,9 @@ restart:
 			spin_lock(&journal->j_list_lock);
 			goto restart;
 		}
+		if (unlikely(!buffer_uptodate(bh)))
+			ret = -EIO;
+
 		/*
 		 * Now in whatever state the buffer currently is, we know that
 		 * it has been written out and so we can drop it from the list
@@ -203,6 +230,8 @@ restart:
 		journal_remove_journal_head(bh);
 		__brelse(bh);
 	}
+
+	return ret;
 }
 
 #define NR_BATCH	64
@@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
  * Try to flush one buffer from the checkpoint list to disk.
  *
  * Return 1 if something happened which requires us to abort the current
- * scan of the checkpoint list.
+ * scan of the checkpoint list.  Return <0 if the buffer has failed to
+ * be written out.
  *
  * Called with j_list_lock held and drops it if 1 is returned
  * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
@@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
 		log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
+		ret = 1;
+		if (unlikely(!buffer_uptodate(bh)))
+			ret = -EIO;
 		J_ASSERT_JH(jh, !buffer_jbddirty(bh));
 		BUFFER_TRACE(bh, "remove from checkpoint");
 		__journal_remove_checkpoint(jh);
@@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
 		jbd_unlock_bh_state(bh);
 		journal_remove_journal_head(bh);
 		__brelse(bh);
-		ret = 1;
 	} else {
 		/*
 		 * Important: we are about to write the buffer, and
@@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
 	 * OK, we need to start writing disk blocks.  Take one transaction
 	 * and write it.
 	 */
+	result = 0;
 	spin_lock(&journal->j_list_lock);
 	if (!journal->j_checkpoint_transactions)
 		goto out;
@@ -334,7 +367,7 @@ restart:
 		int batch_count = 0;
 		struct buffer_head *bhs[NR_BATCH];
 		struct journal_head *jh;
-		int retry = 0;
+		int retry = 0, err;
 
 		while (!retry && transaction->t_checkpoint_list) {
 			struct buffer_head *bh;
@@ -347,6 +380,8 @@ restart:
 				break;
 			}
 			retry = __process_buffer(journal, jh, bhs,&batch_count);
+			if (retry < 0)
+				result = retry;
 			if (!retry && (need_resched() ||
 				spin_needbreak(&journal->j_list_lock))) {
 				spin_unlock(&journal->j_list_lock);
@@ -371,14 +406,18 @@ restart:
 		 * Now we have cleaned up the first transaction's checkpoint
 		 * list. Let's clean up the second one
 		 */
-		__wait_cp_io(journal, transaction);
+		err = __wait_cp_io(journal, transaction);
+		if (!result)
+			result = err;
 	}
 out:
 	spin_unlock(&journal->j_list_lock);
-	result = cleanup_journal_tail(journal);
 	if (result < 0)
-		return result;
-	return 0;
+		__journal_abort_checkpoint(journal, result);
+	else
+		result = cleanup_journal_tail(journal);
+
+	return (result < 0) ? result : 0;
 }
 
 /*
Index: linux-2.6.26-rc2/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc2.orig/include/linux/jbd.h
+++ linux-2.6.26-rc2/include/linux/jbd.h
@@ -816,6 +816,8 @@ struct journal_s
 #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
 #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
 #define JFS_BARRIER	0x020	/* Use IDE barriers */
+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
+				 * clean the journal space.  */
 
 /*
  * Function declarations for the journaling transaction and buffer
@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
 	return journal->j_flags & JFS_ABORT;
 }
 
+static inline int is_checkpoint_aborted(journal_t *journal)
+{
+	return journal->j_flags & JFS_CP_ABORT;
+}
+
 static inline int is_handle_aborted(handle_t *handle)
 {
 	if (handle->h_aborted)
Index: linux-2.6.26-rc2/fs/ext3/ioctl.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c
+++ linux-2.6.26-rc2/fs/ext3/ioctl.c
@@ -239,7 +239,7 @@ setrsvsz_out:
 	case EXT3_IOC_GROUP_EXTEND: {
 		ext3_fsblk_t n_blocks_count;
 		struct super_block *sb = inode->i_sb;
-		int err;
+		int err, err2;
 
 		if (!capable(CAP_SYS_RESOURCE))
 			return -EPERM;
@@ -254,16 +254,16 @@ setrsvsz_out:
 		}
 		err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
 		journal_lock_updates(EXT3_SB(sb)->s_journal);
-		journal_flush(EXT3_SB(sb)->s_journal);
+		err2 = journal_flush(EXT3_SB(sb)->s_journal);
 		journal_unlock_updates(EXT3_SB(sb)->s_journal);
 group_extend_out:
 		mnt_drop_write(filp->f_path.mnt);
-		return err;
+		return (err) ? err : err2;
 	}
 	case EXT3_IOC_GROUP_ADD: {
 		struct ext3_new_group_data input;
 		struct super_block *sb = inode->i_sb;
-		int err;
+		int err, err2;
 
 		if (!capable(CAP_SYS_RESOURCE))
 			return -EPERM;
@@ -280,11 +280,11 @@ group_extend_out:
 
 		err = ext3_group_add(sb, &input);
 		journal_lock_updates(EXT3_SB(sb)->s_journal);
-		journal_flush(EXT3_SB(sb)->s_journal);
+		err2 = journal_flush(EXT3_SB(sb)->s_journal);
 		journal_unlock_updates(EXT3_SB(sb)->s_journal);
 group_add_out:
 		mnt_drop_write(filp->f_path.mnt);
-		return err;
+		return (err) ? err : err2;
 	}
 
 
Index: linux-2.6.26-rc2/fs/ext3/super.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/ext3/super.c
+++ linux-2.6.26-rc2/fs/ext3/super.c
@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
 	ext3_xattr_put_super(sb);
 	journal_destroy(sbi->s_journal);
 	if (!(sb->s_flags & MS_RDONLY)) {
-		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
+		if (!is_checkpoint_aborted(sbi->s_journal)) {
+			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
+				EXT3_FEATURE_INCOMPAT_RECOVER);
+		}
 		es->s_state = cpu_to_le16(sbi->s_mount_state);
 		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
 		mark_buffer_dirty(sbi->s_sbh);
@@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru
 
 		/* Now we set up the journal barrier. */
 		journal_lock_updates(journal);
-		journal_flush(journal);
+
+		/*
+		 * We don't want to clear needs_recovery flag when we failed
+		 * to flush the journal.
+		 */
+		if (journal_flush(journal) < 0)
+			return;
 
 		/* Journal blocked and flushed, clear needs_recovery flag. */
 		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
Index: linux-2.6.26-rc2/fs/jbd/journal.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/journal.c
+++ linux-2.6.26-rc2/fs/jbd/journal.c
@@ -674,7 +674,7 @@ static journal_t * journal_init_common (
 	journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
 
 	/* The journal is marked for error until we succeed with recovery! */
-	journal->j_flags = JFS_ABORT;
+	journal->j_flags = JFS_ABORT | JFS_CP_ABORT;
 
 	/* Set up a default-sized revoke table for the new mount. */
 	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
@@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
 	if (journal_reset(journal))
 		goto recovery_error;
 
-	journal->j_flags &= ~JFS_ABORT;
+	journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
 	journal->j_flags |= JFS_LOADED;
 	return 0;
 
@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
 	journal->j_tail = 0;
 	journal->j_tail_sequence = ++journal->j_transaction_sequence;
 	if (journal->j_sb_buffer) {
-		journal_update_superblock(journal, 1);
+		if (!is_checkpoint_aborted(journal))
+			journal_update_superblock(journal, 1);
 		brelse(journal->j_sb_buffer);
 	}
 
@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
 
 int journal_flush(journal_t *journal)
 {
-	int err = 0;
 	transaction_t *transaction = NULL;
 	unsigned long old_tail;
 
@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
 		spin_unlock(&journal->j_state_lock);
 	}
 
-	/* ...and flush everything in the log out to disk. */
+	/* ...and flush everything in the log out to disk.
+	 * Even if an error occurs, we don't stop this loop.
+	 * We do checkpoint as much as possible.  */
 	spin_lock(&journal->j_list_lock);
-	while (!err && journal->j_checkpoint_transactions != NULL) {
+	while (journal->j_checkpoint_transactions != NULL) {
 		spin_unlock(&journal->j_list_lock);
-		err = log_do_checkpoint(journal);
+		log_do_checkpoint(journal);
 		spin_lock(&journal->j_list_lock);
 	}
 	spin_unlock(&journal->j_list_lock);
+	if (is_checkpoint_aborted(journal))
+		return -EIO;
+
 	cleanup_journal_tail(journal);
 
 	/* Finally, mark the journal as really needing no recovery.
@@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
 	J_ASSERT(journal->j_head == journal->j_tail);
 	J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
 	spin_unlock(&journal->j_state_lock);
-	return err;
+	return 0;
 }
 
 /**
Index: linux-2.6.26-rc2/fs/jbd/recovery.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/recovery.c
+++ linux-2.6.26-rc2/fs/jbd/recovery.c
@@ -223,7 +223,7 @@ do {									\
  */
 int journal_recover(journal_t *journal)
 {
-	int			err;
+	int			err, err2;
 	journal_superblock_t *	sb;
 
 	struct recovery_info	info;
@@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
 	journal->j_transaction_sequence = ++info.end_transaction;
 
 	journal_clear_revoke(journal);
-	sync_blockdev(journal->j_fs_dev);
+	err2 = sync_blockdev(journal->j_fs_dev);
+	if (!err)
+		err = err2;
+
 	return err;
 }
 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased)
  2008-05-14  4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai
@ 2008-05-14 12:56   ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2008-05-14 12:56 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

  Hi,

On Wed 14-05-08 13:47:31, Hidehiro Kawai wrote:
> Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers
> 
> In ordered mode, we should abort journaling when an I/O error has
> occurred on a file data buffer in the committing transaction.  But
> there can be data buffers which are not checked for error:
> 
>   (a) the buffer which has already been written out by pdflush
>   (b) the buffer which has been unlocked before scanned in the
>       t_locked_list loop
> 
> This patch adds missing error checks and aborts journaling
> appropriately.
> 
> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com>
  Acked-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/jbd/commit.c |   14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.26-rc2/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc2/fs/jbd/commit.c
> @@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
>  /*
>   *  Submit all the data buffers to disk
>   */
> -static void journal_submit_data_buffers(journal_t *journal,
> +static int journal_submit_data_buffers(journal_t *journal,
>  				transaction_t *commit_transaction)
>  {
>  	struct journal_head *jh;
> @@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
>  	int locked;
>  	int bufs = 0;
>  	struct buffer_head **wbuf = journal->j_wbuf;
> +	int err = 0;
>  
>  	/*
>  	 * Whenever we unlock the journal and sleep, things can get added
> @@ -253,6 +254,8 @@ write_out_data:
>  			put_bh(bh);
>  		} else {
>  			BUFFER_TRACE(bh, "writeout complete: unfile");
> +			if (unlikely(!buffer_uptodate(bh)))
> +				err = -EIO;
>  			__journal_unfile_buffer(jh);
>  			jbd_unlock_bh_state(bh);
>  			if (locked)
> @@ -271,6 +274,8 @@ write_out_data:
>  	}
>  	spin_unlock(&journal->j_list_lock);
>  	journal_do_submit_data(wbuf, bufs);
> +
> +	return err;
>  }
>  
>  /*
> @@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
>  	 * Now start flushing things to disk, in the order they appear
>  	 * on the transaction lists.  Data blocks go first.
>  	 */
> -	err = 0;
> -	journal_submit_data_buffers(journal, commit_transaction);
> +	err = journal_submit_data_buffers(journal, commit_transaction);
>  
>  	/*
>  	 * Wait for all previously submitted IO to complete.
> @@ -426,10 +430,10 @@ void journal_commit_transaction(journal_
>  		if (buffer_locked(bh)) {
>  			spin_unlock(&journal->j_list_lock);
>  			wait_on_buffer(bh);
> -			if (unlikely(!buffer_uptodate(bh)))
> -				err = -EIO;
>  			spin_lock(&journal->j_list_lock);
>  		}
> +		if (unlikely(!buffer_uptodate(bh)))
> +			err = -EIO;
>  		if (!inverted_lock(journal, bh)) {
>  			put_bh(bh);
>  			spin_lock(&journal->j_list_lock);
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased)
  2008-05-14  4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai
@ 2008-05-14 13:10   ` Jan Kara
  2008-05-16 10:25     ` Hidehiro Kawai
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-14 13:10 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote:
> Subject: [PATCH 2/4] jbd: ordered data integrity fix
> 
> In ordered mode, if a buffer being dirtied exists in the committing
> transaction, we write the buffer to the disk, move it from the
> committing transaction to the running transaction, then dirty it.
> But we don't have to remove the buffer from the committing
> transaction when the buffer couldn't be written out, otherwise it
> breaks the ordered mode rule.
  Hmm, could you elaborate a bit more what exactly is broken and how does
this help to fix it? Because even if we find EIO happened on data buffer,
we currently don't do anything else than just remove the buffer from the
transaction and abort the journal. And even if we later managed to write
the data buffer from other process before the journal is aborted, ordered
mode guarantees are satisfied - we only guarantee that too old data cannot
be seen, newer can be seen easily... Thanks.

								Honza

> 
> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> ---
>  fs/jbd/transaction.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6.26-rc2/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c
> +++ linux-2.6.26-rc2/fs/jbd/transaction.c
> @@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
>  	journal_t *journal = handle->h_transaction->t_journal;
>  	int need_brelse = 0;
>  	struct journal_head *jh;
> +	int ret = 0;
>  
>  	if (is_handle_aborted(handle))
> -		return 0;
> +		return ret;
>  
>  	jh = journal_add_journal_head(bh);
>  	JBUFFER_TRACE(jh, "entry");
> @@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
>  				   time if it is redirtied */
>  			}
>  
> -			/* journal_clean_data_list() may have got there first */
> +			/*
> +			 * We shouldn't remove the buffer from the committing
> +			 * transaction if it has failed to be written.
> +			 * Otherwise, it breaks the ordered mode rule.
> +			 */
> +			if (unlikely(!buffer_uptodate(bh))) {
> +				ret = -EIO;
> +				goto no_journal;
> +			}
> +
>  			if (jh->b_transaction != NULL) {
>  				JBUFFER_TRACE(jh, "unfile from commit");
>  				__journal_temp_unlink_buffer(jh);
> @@ -1108,7 +1118,7 @@ no_journal:
>  	}
>  	JBUFFER_TRACE(jh, "exit");
>  	journal_put_journal_head(jh);
> -	return 0;
> +	return ret;
>  }
>  
>  /**
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)
  2008-05-14  4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai
@ 2008-05-14 13:15   ` Jan Kara
  2008-05-16 10:26     ` Hidehiro Kawai
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-14 13:15 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
> Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
> 
> If we failed to write metadata buffers to the journal space and
> succeeded to write the commit record, stale data can be written
> back to the filesystem as metadata in the recovery phase.
> 
> To avoid this, when we failed to write out metadata buffers,
> abort the journal before writing the commit record.
> 
> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> ---
>  fs/jbd/commit.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.26-rc2/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc2/fs/jbd/commit.c
> @@ -703,6 +703,9 @@ wait_for_iobuf:
>  		__brelse(bh);
>  	}
>  
> +	if (err)
> +		journal_abort(journal, err);
> +
>  	J_ASSERT (commit_transaction->t_shadow_list == NULL);
  Shouldn't this rather be further just before
journal_write_commit_record()? We should abort also if writing revoke
records etc. failed, shouldn't we?

>  	jbd_debug(3, "JBD: commit phase 5\n");
> 

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14  4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai
@ 2008-05-14 13:16   ` Josef Bacik
  2008-05-14 14:44     ` Jan Kara
  2008-05-14 14:32   ` Jan Kara
  1 sibling, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2008-05-14 13:16 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

On Wed, May 14, 2008 at 01:50:43PM +0900, Hidehiro Kawai wrote:
> Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io
> 
> When a checkpointing IO fails, current JBD code doesn't check the
> error and continue journaling.  This means latest metadata can be
> lost from both the journal and filesystem.
> 
> This patch leaves the failed metadata blocks in the journal space
> and aborts journaling in the case of log_do_checkpoint().
> To achieve this, we need to do:
> 
> 1. don't remove the failed buffer from the checkpoint list where in
>    the case of __try_to_free_cp_buf() because it may be released or
>    overwritten by a later transaction
> 2. log_do_checkpoint() is the last chance, remove the failed buffer
>    from the checkpoint list and abort journaling
> 3. when checkpointing fails, don't update the journal super block to
>    prevent the journalled contents from being cleaned
> 4. when checkpointing fails, don't clear the needs_recovery flag,
>    otherwise the journalled contents are ignored and cleaned in the
>    recovery phase
> 5. if the recovery fails, keep the needs_recovery flag
> 
> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> ---
>  fs/ext3/ioctl.c     |   12 ++++----
>  fs/ext3/super.c     |   13 +++++++--
>  fs/jbd/checkpoint.c |   59 ++++++++++++++++++++++++++++++++++--------
>  fs/jbd/journal.c    |   21 +++++++++-----
>  fs/jbd/recovery.c   |    7 +++-
>  include/linux/jbd.h |    7 ++++
>  6 files changed, 91 insertions(+), 28 deletions(-)
> 
> Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
> +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
> @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
>  	int ret = 0;
>  	struct buffer_head *bh = jh2bh(jh);
>  
> -	if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
> +	if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
> +	    !buffer_dirty(bh) && buffer_uptodate(bh)) {
>  		JBUFFER_TRACE(jh, "remove from checkpoint list");
>  		ret = __journal_remove_checkpoint(jh) + 1;
>  		jbd_unlock_bh_state(bh);
> @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
>  }
>  
>  /*
> + * We couldn't write back some metadata buffers to the filesystem.
> + * To avoid unwritten-back metadata buffers from losing, set
> + * JFS_CP_ABORT flag and make sure that the journal space isn't
> + * cleaned.  This function also aborts journaling.
> + */
> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
> +{
> +	if (is_checkpoint_aborted(journal))
> +		return;
> +
> +	spin_lock(&journal->j_state_lock);
> +	journal->j_flags |= JFS_CP_ABORT;
> +	spin_unlock(&journal->j_state_lock);
> +	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
> +	       "are still old.\n");
> +	journal_abort(journal, errno);
> +}
> +
> +/*
>   * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
>   * The caller must restart a list walk.  Wait for someone else to run
>   * jbd_unlock_bh_state().
> @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
>   * buffers. Note that we take the buffers in the opposite ordering
>   * from the one in which they were submitted for IO.
>   *
> + * Return 0 on success, and return <0 if some buffers have failed
> + * to be written out.
> + *
>   * Called with j_list_lock held.
>   */
> -static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
> +static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
>  {
>  	struct journal_head *jh;
>  	struct buffer_head *bh;
>  	tid_t this_tid;
>  	int released = 0;
> +	int ret = 0;
>  
>  	this_tid = transaction->t_tid;
>  restart:
>  	/* Did somebody clean up the transaction in the meanwhile? */
>  	if (journal->j_checkpoint_transactions != transaction ||
>  			transaction->t_tid != this_tid)
> -		return;
> +		return ret;
>  	while (!released && transaction->t_checkpoint_io_list) {
>  		jh = transaction->t_checkpoint_io_list;
>  		bh = jh2bh(jh);
> @@ -194,6 +218,9 @@ restart:
>  			spin_lock(&journal->j_list_lock);
>  			goto restart;
>  		}
> +		if (unlikely(!buffer_uptodate(bh)))
> +			ret = -EIO;
> +
>  		/*
>  		 * Now in whatever state the buffer currently is, we know that
>  		 * it has been written out and so we can drop it from the list
> @@ -203,6 +230,8 @@ restart:
>  		journal_remove_journal_head(bh);
>  		__brelse(bh);
>  	}
> +
> +	return ret;
>  }
>  
>  #define NR_BATCH	64
> @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
>   * Try to flush one buffer from the checkpoint list to disk.
>   *
>   * Return 1 if something happened which requires us to abort the current
> - * scan of the checkpoint list.
> + * scan of the checkpoint list.  Return <0 if the buffer has failed to
> + * be written out.
>   *
>   * Called with j_list_lock held and drops it if 1 is returned
>   * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
> @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
>  		log_wait_commit(journal, tid);
>  		ret = 1;
>  	} else if (!buffer_dirty(bh)) {
> +		ret = 1;
> +		if (unlikely(!buffer_uptodate(bh)))
> +			ret = -EIO;
>  		J_ASSERT_JH(jh, !buffer_jbddirty(bh));
>  		BUFFER_TRACE(bh, "remove from checkpoint");
>  		__journal_remove_checkpoint(jh);
> @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
>  		jbd_unlock_bh_state(bh);
>  		journal_remove_journal_head(bh);
>  		__brelse(bh);
> -		ret = 1;
>  	} else {
>  		/*
>  		 * Important: we are about to write the buffer, and
> @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
>  	 * OK, we need to start writing disk blocks.  Take one transaction
>  	 * and write it.
>  	 */
> +	result = 0;
>  	spin_lock(&journal->j_list_lock);
>  	if (!journal->j_checkpoint_transactions)
>  		goto out;
> @@ -334,7 +367,7 @@ restart:
>  		int batch_count = 0;
>  		struct buffer_head *bhs[NR_BATCH];
>  		struct journal_head *jh;
> -		int retry = 0;
> +		int retry = 0, err;
>  
>  		while (!retry && transaction->t_checkpoint_list) {
>  			struct buffer_head *bh;
> @@ -347,6 +380,8 @@ restart:
>  				break;
>  			}
>  			retry = __process_buffer(journal, jh, bhs,&batch_count);
> +			if (retry < 0)
> +				result = retry;
>  			if (!retry && (need_resched() ||
>  				spin_needbreak(&journal->j_list_lock))) {
>  				spin_unlock(&journal->j_list_lock);
> @@ -371,14 +406,18 @@ restart:
>  		 * Now we have cleaned up the first transaction's checkpoint
>  		 * list. Let's clean up the second one
>  		 */
> -		__wait_cp_io(journal, transaction);
> +		err = __wait_cp_io(journal, transaction);
> +		if (!result)
> +			result = err;
>  	}
>  out:
>  	spin_unlock(&journal->j_list_lock);
> -	result = cleanup_journal_tail(journal);
>  	if (result < 0)
> -		return result;
> -	return 0;
> +		__journal_abort_checkpoint(journal, result);
> +	else
> +		result = cleanup_journal_tail(journal);
> +
> +	return (result < 0) ? result : 0;
>  }
>  
>  /*
> Index: linux-2.6.26-rc2/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.26-rc2.orig/include/linux/jbd.h
> +++ linux-2.6.26-rc2/include/linux/jbd.h
> @@ -816,6 +816,8 @@ struct journal_s
>  #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
>  #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
>  #define JFS_BARRIER	0x020	/* Use IDE barriers */
> +#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
> +				 * clean the journal space.  */
>  
>  /*
>   * Function declarations for the journaling transaction and buffer
> @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>  	return journal->j_flags & JFS_ABORT;
>  }
>  
> +static inline int is_checkpoint_aborted(journal_t *journal)
> +{
> +	return journal->j_flags & JFS_CP_ABORT;
> +}
> +
>  static inline int is_handle_aborted(handle_t *handle)
>  {
>  	if (handle->h_aborted)
> Index: linux-2.6.26-rc2/fs/ext3/ioctl.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c
> +++ linux-2.6.26-rc2/fs/ext3/ioctl.c
> @@ -239,7 +239,7 @@ setrsvsz_out:
>  	case EXT3_IOC_GROUP_EXTEND: {
>  		ext3_fsblk_t n_blocks_count;
>  		struct super_block *sb = inode->i_sb;
> -		int err;
> +		int err, err2;
>  
>  		if (!capable(CAP_SYS_RESOURCE))
>  			return -EPERM;
> @@ -254,16 +254,16 @@ setrsvsz_out:
>  		}
>  		err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
>  		journal_lock_updates(EXT3_SB(sb)->s_journal);
> -		journal_flush(EXT3_SB(sb)->s_journal);
> +		err2 = journal_flush(EXT3_SB(sb)->s_journal);
>  		journal_unlock_updates(EXT3_SB(sb)->s_journal);
>  group_extend_out:
>  		mnt_drop_write(filp->f_path.mnt);
> -		return err;
> +		return (err) ? err : err2;
>  	}
>  	case EXT3_IOC_GROUP_ADD: {
>  		struct ext3_new_group_data input;
>  		struct super_block *sb = inode->i_sb;
> -		int err;
> +		int err, err2;
>  
>  		if (!capable(CAP_SYS_RESOURCE))
>  			return -EPERM;
> @@ -280,11 +280,11 @@ group_extend_out:
>  
>  		err = ext3_group_add(sb, &input);
>  		journal_lock_updates(EXT3_SB(sb)->s_journal);
> -		journal_flush(EXT3_SB(sb)->s_journal);
> +		err2 = journal_flush(EXT3_SB(sb)->s_journal);
>  		journal_unlock_updates(EXT3_SB(sb)->s_journal);
>  group_add_out:
>  		mnt_drop_write(filp->f_path.mnt);
> -		return err;
> +		return (err) ? err : err2;
>  	}
>  
>  
> Index: linux-2.6.26-rc2/fs/ext3/super.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> +++ linux-2.6.26-rc2/fs/ext3/super.c
> @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
>  	ext3_xattr_put_super(sb);
>  	journal_destroy(sbi->s_journal);
>  	if (!(sb->s_flags & MS_RDONLY)) {
> -		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> +		if (!is_checkpoint_aborted(sbi->s_journal)) {
> +			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> +				EXT3_FEATURE_INCOMPAT_RECOVER);
> +		}
>  		es->s_state = cpu_to_le16(sbi->s_mount_state);
>  		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
>  		mark_buffer_dirty(sbi->s_sbh);

Is this bit here really needed?  If we abort the journal the fs will be mounted
read only and we should never get in here.  Is there a case where we could abort
the journal and not be flipped to being read-only?  If there is such a case I
would think that we should fix that by making the fs read-only, and not have
this check.

> @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru
>  
>  		/* Now we set up the journal barrier. */
>  		journal_lock_updates(journal);
> -		journal_flush(journal);
> +
> +		/*
> +		 * We don't want to clear needs_recovery flag when we failed
> +		 * to flush the journal.
> +		 */
> +		if (journal_flush(journal) < 0)
> +			return;
>  
>  		/* Journal blocked and flushed, clear needs_recovery flag. */
>  		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> Index: linux-2.6.26-rc2/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/journal.c
> +++ linux-2.6.26-rc2/fs/jbd/journal.c
> @@ -674,7 +674,7 @@ static journal_t * journal_init_common (
>  	journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
>  
>  	/* The journal is marked for error until we succeed with recovery! */
> -	journal->j_flags = JFS_ABORT;
> +	journal->j_flags = JFS_ABORT | JFS_CP_ABORT;
>  
>  	/* Set up a default-sized revoke table for the new mount. */
>  	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
>  	if (journal_reset(journal))
>  		goto recovery_error;
>  
> -	journal->j_flags &= ~JFS_ABORT;
> +	journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
>  	journal->j_flags |= JFS_LOADED;
>  	return 0;
>  
> @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
>  	journal->j_tail = 0;
>  	journal->j_tail_sequence = ++journal->j_transaction_sequence;
>  	if (journal->j_sb_buffer) {
> -		journal_update_superblock(journal, 1);
> +		if (!is_checkpoint_aborted(journal))
> +			journal_update_superblock(journal, 1);
>  		brelse(journal->j_sb_buffer);
>  	}
>  
> @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
>  
>  int journal_flush(journal_t *journal)
>  {
> -	int err = 0;
>  	transaction_t *transaction = NULL;
>  	unsigned long old_tail;
>  
> @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
>  		spin_unlock(&journal->j_state_lock);
>  	}
>  
> -	/* ...and flush everything in the log out to disk. */
> +	/* ...and flush everything in the log out to disk.
> +	 * Even if an error occurs, we don't stop this loop.
> +	 * We do checkpoint as much as possible.  */
>  	spin_lock(&journal->j_list_lock);
> -	while (!err && journal->j_checkpoint_transactions != NULL) {
> +	while (journal->j_checkpoint_transactions != NULL) {
>  		spin_unlock(&journal->j_list_lock);
> -		err = log_do_checkpoint(journal);
> +		log_do_checkpoint(journal);
>  		spin_lock(&journal->j_list_lock);
>  	}
>  	spin_unlock(&journal->j_list_lock);
> +	if (is_checkpoint_aborted(journal))
> +		return -EIO;
> +
>  	cleanup_journal_tail(journal);
>  
>  	/* Finally, mark the journal as really needing no recovery.
> @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
>  	J_ASSERT(journal->j_head == journal->j_tail);
>  	J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
>  	spin_unlock(&journal->j_state_lock);
> -	return err;
> +	return 0;
>  }
>  
>  /**
> Index: linux-2.6.26-rc2/fs/jbd/recovery.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c
> +++ linux-2.6.26-rc2/fs/jbd/recovery.c
> @@ -223,7 +223,7 @@ do {									\
>   */
>  int journal_recover(journal_t *journal)
>  {
> -	int			err;
> +	int			err, err2;
>  	journal_superblock_t *	sb;
>  
>  	struct recovery_info	info;
> @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
>  	journal->j_transaction_sequence = ++info.end_transaction;
>  
>  	journal_clear_revoke(journal);
> -	sync_blockdev(journal->j_fs_dev);
> +	err2 = sync_blockdev(journal->j_fs_dev);
> +	if (!err)
> +		err = err2;
> +
>  	return err;
>  }
>  
> 
>

Other than that one question it looks good to me.  Thanks much,

Josef 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14  4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai
  2008-05-14 13:16   ` Josef Bacik
@ 2008-05-14 14:32   ` Jan Kara
  2008-05-16 10:29     ` Hidehiro Kawai
  1 sibling, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-14 14:32 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4, Jan Kara,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

  Hello,

On Wed 14-05-08 13:50:43, Hidehiro Kawai wrote:
> Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io
> 
> When a checkpointing IO fails, current JBD code doesn't check the
> error and continue journaling.  This means latest metadata can be
> lost from both the journal and filesystem.
> 
> This patch leaves the failed metadata blocks in the journal space
> and aborts journaling in the case of log_do_checkpoint().
> To achieve this, we need to do:
> 
> 1. don't remove the failed buffer from the checkpoint list where in
>    the case of __try_to_free_cp_buf() because it may be released or
>    overwritten by a later transaction
> 2. log_do_checkpoint() is the last chance, remove the failed buffer
>    from the checkpoint list and abort journaling
> 3. when checkpointing fails, don't update the journal super block to
>    prevent the journalled contents from being cleaned
> 4. when checkpointing fails, don't clear the needs_recovery flag,
>    otherwise the journalled contents are ignored and cleaned in the
>    recovery phase
> 5. if the recovery fails, keep the needs_recovery flag
  I have a few comments below..

> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> ---
>  fs/ext3/ioctl.c     |   12 ++++----
>  fs/ext3/super.c     |   13 +++++++--
>  fs/jbd/checkpoint.c |   59 ++++++++++++++++++++++++++++++++++--------
>  fs/jbd/journal.c    |   21 +++++++++-----
>  fs/jbd/recovery.c   |    7 +++-
>  include/linux/jbd.h |    7 ++++
>  6 files changed, 91 insertions(+), 28 deletions(-)
> 
> Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
> +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
> @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
>  	int ret = 0;
>  	struct buffer_head *bh = jh2bh(jh);
>  
> -	if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
> +	if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
> +	    !buffer_dirty(bh) && buffer_uptodate(bh)) {
>  		JBUFFER_TRACE(jh, "remove from checkpoint list");
>  		ret = __journal_remove_checkpoint(jh) + 1;
>  		jbd_unlock_bh_state(bh);
> @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
>  }
>  
>  /*
> + * We couldn't write back some metadata buffers to the filesystem.
> + * To avoid unwritten-back metadata buffers from losing, set
> + * JFS_CP_ABORT flag and make sure that the journal space isn't
> + * cleaned.  This function also aborts journaling.
> + */
> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
> +{
> +	if (is_checkpoint_aborted(journal))
> +		return;
> +
> +	spin_lock(&journal->j_state_lock);
> +	journal->j_flags |= JFS_CP_ABORT;
> +	spin_unlock(&journal->j_state_lock);
> +	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
> +	       "are still old.\n");
> +	journal_abort(journal, errno);
> +}
> +
> +/*
>   * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
>   * The caller must restart a list walk.  Wait for someone else to run
>   * jbd_unlock_bh_state().
> @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
>   * buffers. Note that we take the buffers in the opposite ordering
>   * from the one in which they were submitted for IO.
>   *
> + * Return 0 on success, and return <0 if some buffers have failed
> + * to be written out.
> + *
>   * Called with j_list_lock held.
>   */
> -static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
> +static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
>  {
>  	struct journal_head *jh;
>  	struct buffer_head *bh;
>  	tid_t this_tid;
>  	int released = 0;
> +	int ret = 0;
>  
>  	this_tid = transaction->t_tid;
>  restart:
>  	/* Did somebody clean up the transaction in the meanwhile? */
>  	if (journal->j_checkpoint_transactions != transaction ||
>  			transaction->t_tid != this_tid)
> -		return;
> +		return ret;
>  	while (!released && transaction->t_checkpoint_io_list) {
>  		jh = transaction->t_checkpoint_io_list;
>  		bh = jh2bh(jh);
> @@ -194,6 +218,9 @@ restart:
>  			spin_lock(&journal->j_list_lock);
>  			goto restart;
>  		}
> +		if (unlikely(!buffer_uptodate(bh)))
> +			ret = -EIO;
> +
>  		/*
>  		 * Now in whatever state the buffer currently is, we know that
>  		 * it has been written out and so we can drop it from the list
> @@ -203,6 +230,8 @@ restart:
>  		journal_remove_journal_head(bh);
>  		__brelse(bh);
>  	}
> +
> +	return ret;
>  }
>  
>  #define NR_BATCH	64
> @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
>   * Try to flush one buffer from the checkpoint list to disk.
>   *
>   * Return 1 if something happened which requires us to abort the current
> - * scan of the checkpoint list.
> + * scan of the checkpoint list.  Return <0 if the buffer has failed to
> + * be written out.
>   *
>   * Called with j_list_lock held and drops it if 1 is returned
>   * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
> @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
>  		log_wait_commit(journal, tid);
>  		ret = 1;
>  	} else if (!buffer_dirty(bh)) {
> +		ret = 1;
> +		if (unlikely(!buffer_uptodate(bh)))
> +			ret = -EIO;
>  		J_ASSERT_JH(jh, !buffer_jbddirty(bh));
>  		BUFFER_TRACE(bh, "remove from checkpoint");
>  		__journal_remove_checkpoint(jh);
> @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
>  		jbd_unlock_bh_state(bh);
>  		journal_remove_journal_head(bh);
>  		__brelse(bh);
> -		ret = 1;
>  	} else {
>  		/*
>  		 * Important: we are about to write the buffer, and
> @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
>  	 * OK, we need to start writing disk blocks.  Take one transaction
>  	 * and write it.
>  	 */
> +	result = 0;
>  	spin_lock(&journal->j_list_lock);
>  	if (!journal->j_checkpoint_transactions)
>  		goto out;
> @@ -334,7 +367,7 @@ restart:
>  		int batch_count = 0;
>  		struct buffer_head *bhs[NR_BATCH];
>  		struct journal_head *jh;
> -		int retry = 0;
> +		int retry = 0, err;
>  
>  		while (!retry && transaction->t_checkpoint_list) {
>  			struct buffer_head *bh;
> @@ -347,6 +380,8 @@ restart:
>  				break;
>  			}
>  			retry = __process_buffer(journal, jh, bhs,&batch_count);
> +			if (retry < 0)
> +				result = retry;
>  			if (!retry && (need_resched() ||
>  				spin_needbreak(&journal->j_list_lock))) {
>  				spin_unlock(&journal->j_list_lock);
> @@ -371,14 +406,18 @@ restart:
>  		 * Now we have cleaned up the first transaction's checkpoint
>  		 * list. Let's clean up the second one
>  		 */
> -		__wait_cp_io(journal, transaction);
> +		err = __wait_cp_io(journal, transaction);
> +		if (!result)
> +			result = err;
>  	}
>  out:
>  	spin_unlock(&journal->j_list_lock);
> -	result = cleanup_journal_tail(journal);
>  	if (result < 0)
> -		return result;
> -	return 0;
> +		__journal_abort_checkpoint(journal, result);
> +	else
> +		result = cleanup_journal_tail(journal);
> +
> +	return (result < 0) ? result : 0;
>  }
>  
>  /*
> Index: linux-2.6.26-rc2/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.26-rc2.orig/include/linux/jbd.h
> +++ linux-2.6.26-rc2/include/linux/jbd.h
> @@ -816,6 +816,8 @@ struct journal_s
>  #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
>  #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
>  #define JFS_BARRIER	0x020	/* Use IDE barriers */
> +#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
> +				 * clean the journal space.  */
>  
>  /*
>   * Function declarations for the journaling transaction and buffer
> @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>  	return journal->j_flags & JFS_ABORT;
>  }
>  
> +static inline int is_checkpoint_aborted(journal_t *journal)
> +{
> +	return journal->j_flags & JFS_CP_ABORT;
> +}
> +
  Actually, I don't think this new flag is needed (not that it would really
harm anything). At least at the places where you check it you can as well
check whether the journal is aborted (maybe except for journal_destroy()
but see my comment there).

>  static inline int is_handle_aborted(handle_t *handle)
>  {
>  	if (handle->h_aborted)
> Index: linux-2.6.26-rc2/fs/ext3/ioctl.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c
> +++ linux-2.6.26-rc2/fs/ext3/ioctl.c
> @@ -239,7 +239,7 @@ setrsvsz_out:
>  	case EXT3_IOC_GROUP_EXTEND: {
>  		ext3_fsblk_t n_blocks_count;
>  		struct super_block *sb = inode->i_sb;
> -		int err;
> +		int err, err2;
>  
>  		if (!capable(CAP_SYS_RESOURCE))
>  			return -EPERM;
> @@ -254,16 +254,16 @@ setrsvsz_out:
>  		}
>  		err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
>  		journal_lock_updates(EXT3_SB(sb)->s_journal);
> -		journal_flush(EXT3_SB(sb)->s_journal);
> +		err2 = journal_flush(EXT3_SB(sb)->s_journal);
>  		journal_unlock_updates(EXT3_SB(sb)->s_journal);
>  group_extend_out:
>  		mnt_drop_write(filp->f_path.mnt);
> -		return err;
> +		return (err) ? err : err2;
>  	}
>  	case EXT3_IOC_GROUP_ADD: {
>  		struct ext3_new_group_data input;
>  		struct super_block *sb = inode->i_sb;
> -		int err;
> +		int err, err2;
>  
>  		if (!capable(CAP_SYS_RESOURCE))
>  			return -EPERM;
> @@ -280,11 +280,11 @@ group_extend_out:
>  
>  		err = ext3_group_add(sb, &input);
>  		journal_lock_updates(EXT3_SB(sb)->s_journal);
> -		journal_flush(EXT3_SB(sb)->s_journal);
> +		err2 = journal_flush(EXT3_SB(sb)->s_journal);
>  		journal_unlock_updates(EXT3_SB(sb)->s_journal);
>  group_add_out:
>  		mnt_drop_write(filp->f_path.mnt);
> -		return err;
> +		return (err) ? err : err2;
>  	}
  Please split ext3 changes into a separate patch. There's no reason
why they should be together with JBD fixes (i.e. JBD fixes don't depend on
ext3 being fixed). Thanks.

> Index: linux-2.6.26-rc2/fs/ext3/super.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> +++ linux-2.6.26-rc2/fs/ext3/super.c
> @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
>  	ext3_xattr_put_super(sb);
>  	journal_destroy(sbi->s_journal);
>  	if (!(sb->s_flags & MS_RDONLY)) {
> -		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> +		if (!is_checkpoint_aborted(sbi->s_journal)) {
> +			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> +				EXT3_FEATURE_INCOMPAT_RECOVER);
> +		}
  I think you should test here whether the journal is aborted (and
EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
completely avoid updating super block...

>  		es->s_state = cpu_to_le16(sbi->s_mount_state);
>  		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
>  		mark_buffer_dirty(sbi->s_sbh);
> @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru
>  
>  		/* Now we set up the journal barrier. */
>  		journal_lock_updates(journal);
> -		journal_flush(journal);
> +
> +		/*
> +		 * We don't want to clear needs_recovery flag when we failed
> +		 * to flush the journal.
> +		 */
> +		if (journal_flush(journal) < 0)
> +			return;
>  
>  		/* Journal blocked and flushed, clear needs_recovery flag. */
>  		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> Index: linux-2.6.26-rc2/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/journal.c
> +++ linux-2.6.26-rc2/fs/jbd/journal.c
> @@ -674,7 +674,7 @@ static journal_t * journal_init_common (
>  	journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
>  
>  	/* The journal is marked for error until we succeed with recovery! */
> -	journal->j_flags = JFS_ABORT;
> +	journal->j_flags = JFS_ABORT | JFS_CP_ABORT;
>  
>  	/* Set up a default-sized revoke table for the new mount. */
>  	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
>  	if (journal_reset(journal))
>  		goto recovery_error;
>  
> -	journal->j_flags &= ~JFS_ABORT;
> +	journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
>  	journal->j_flags |= JFS_LOADED;
>  	return 0;
>  
> @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
>  	journal->j_tail = 0;
>  	journal->j_tail_sequence = ++journal->j_transaction_sequence;
>  	if (journal->j_sb_buffer) {
> -		journal_update_superblock(journal, 1);
> +		if (!is_checkpoint_aborted(journal))
> +			journal_update_superblock(journal, 1);
>  		brelse(journal->j_sb_buffer);
>  	}
  I don't like this much. I'd rather completely avoid updating j_tail and
j_tail_sequence above in case the journal is aborted but I'd write the
journal superblock so that information about abortion gets written...

> @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
>  
>  int journal_flush(journal_t *journal)
>  {
> -	int err = 0;
>  	transaction_t *transaction = NULL;
>  	unsigned long old_tail;
>  
> @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
>  		spin_unlock(&journal->j_state_lock);
>  	}
>  
> -	/* ...and flush everything in the log out to disk. */
> +	/* ...and flush everything in the log out to disk.
> +	 * Even if an error occurs, we don't stop this loop.
> +	 * We do checkpoint as much as possible.  */
>  	spin_lock(&journal->j_list_lock);
> -	while (!err && journal->j_checkpoint_transactions != NULL) {
> +	while (journal->j_checkpoint_transactions != NULL) {
>  		spin_unlock(&journal->j_list_lock);
> -		err = log_do_checkpoint(journal);
> +		log_do_checkpoint(journal);
>  		spin_lock(&journal->j_list_lock);
>  	}
>  	spin_unlock(&journal->j_list_lock);
> +	if (is_checkpoint_aborted(journal))
> +		return -EIO;
> +
  Why do you change the loop? If we fail to checkpoint some buffer, we stop
journaling anyway and so journal will be replayed when fsck is run...

>  	cleanup_journal_tail(journal);
>  
>  	/* Finally, mark the journal as really needing no recovery.
> @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
>  	J_ASSERT(journal->j_head == journal->j_tail);
>  	J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
>  	spin_unlock(&journal->j_state_lock);
> -	return err;
> +	return 0;
>  }
>  
>  /**
> Index: linux-2.6.26-rc2/fs/jbd/recovery.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c
> +++ linux-2.6.26-rc2/fs/jbd/recovery.c
> @@ -223,7 +223,7 @@ do {									\
>   */
>  int journal_recover(journal_t *journal)
>  {
> -	int			err;
> +	int			err, err2;
>  	journal_superblock_t *	sb;
>  
>  	struct recovery_info	info;
> @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
>  	journal->j_transaction_sequence = ++info.end_transaction;
>  
>  	journal_clear_revoke(journal);
> -	sync_blockdev(journal->j_fs_dev);
> +	err2 = sync_blockdev(journal->j_fs_dev);
> +	if (!err)
> +		err = err2;
> +
>  	return err;
>  }
>  

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14 14:44     ` Jan Kara
@ 2008-05-14 14:37       ` Josef Bacik
  2008-05-16 10:28         ` Hidehiro Kawai
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Bacik @ 2008-05-14 14:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, Hidehiro Kawai, Andrew Morton, sct, adilger,
	linux-kernel, linux-ext4, Mingming Cao, Satoshi OSHIMA, sugita

On Wed, May 14, 2008 at 04:44:10PM +0200, Jan Kara wrote:
> > >  
> > > Index: linux-2.6.26-rc2/fs/ext3/super.c
> > > ===================================================================
> > > --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> > > +++ linux-2.6.26-rc2/fs/ext3/super.c
> > > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> > >  	ext3_xattr_put_super(sb);
> > >  	journal_destroy(sbi->s_journal);
> > >  	if (!(sb->s_flags & MS_RDONLY)) {
> > > -		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> > > +		if (!is_checkpoint_aborted(sbi->s_journal)) {
> > > +			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> > > +				EXT3_FEATURE_INCOMPAT_RECOVER);
> > > +		}
> > >  		es->s_state = cpu_to_le16(sbi->s_mount_state);
> > >  		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
> > >  		mark_buffer_dirty(sbi->s_sbh);
> > 
> > Is this bit here really needed?  If we abort the journal the fs will be mounted
> > read only and we should never get in here.  Is there a case where we could abort
> > the journal and not be flipped to being read-only?  If there is such a case I
> > would think that we should fix that by making the fs read-only, and not have
> > this check.
>   Actually, journal_abort() (which could be called from journal_destroy())
> does nothing to the filesystem as such. So at this moment, ext3 can still
> happily think everything is OK. We only detect aborted journal in
> ext3_journal_start_sb() and call ext3_abort() in that case, which does all
> that is needed...
>

Hmm you're right, I was thinking we did some other stuff before put_super which
would have caught a journal abort but it looks like thats not the case.  Still
shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy()
kfree's the journal.  Should probably move the is_journal_aborted() check above
that or something.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14 13:16   ` Josef Bacik
@ 2008-05-14 14:44     ` Jan Kara
  2008-05-14 14:37       ` Josef Bacik
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-14 14:44 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hidehiro Kawai, Andrew Morton, sct, adilger, linux-kernel,
	linux-ext4, Jan Kara, Mingming Cao, Satoshi OSHIMA, sugita

> >  
> > Index: linux-2.6.26-rc2/fs/ext3/super.c
> > ===================================================================
> > --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> > +++ linux-2.6.26-rc2/fs/ext3/super.c
> > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> >  	ext3_xattr_put_super(sb);
> >  	journal_destroy(sbi->s_journal);
> >  	if (!(sb->s_flags & MS_RDONLY)) {
> > -		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> > +		if (!is_checkpoint_aborted(sbi->s_journal)) {
> > +			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> > +				EXT3_FEATURE_INCOMPAT_RECOVER);
> > +		}
> >  		es->s_state = cpu_to_le16(sbi->s_mount_state);
> >  		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
> >  		mark_buffer_dirty(sbi->s_sbh);
> 
> Is this bit here really needed?  If we abort the journal the fs will be mounted
> read only and we should never get in here.  Is there a case where we could abort
> the journal and not be flipped to being read-only?  If there is such a case I
> would think that we should fix that by making the fs read-only, and not have
> this check.
  Actually, journal_abort() (which could be called from journal_destroy())
does nothing to the filesystem as such. So at this moment, ext3 can still
happily think everything is OK. We only detect aborted journal in
ext3_journal_start_sb() and call ext3_abort() in that case, which does all
that is needed...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased)
  2008-05-14 13:10   ` Jan Kara
@ 2008-05-16 10:25     ` Hidehiro Kawai
  2008-05-19  3:11       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-16 10:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

Hi,

Thanks for your comment.

Jan Kara wrote:

> On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote:
> 
>>Subject: [PATCH 2/4] jbd: ordered data integrity fix
>>
>>In ordered mode, if a buffer being dirtied exists in the committing
>>transaction, we write the buffer to the disk, move it from the
>>committing transaction to the running transaction, then dirty it.
>>But we don't have to remove the buffer from the committing
>>transaction when the buffer couldn't be written out, otherwise it
>>breaks the ordered mode rule.
> 
>   Hmm, could you elaborate a bit more what exactly is broken and how does
> this help to fix it? Because even if we find EIO happened on data buffer,
> we currently don't do anything else than just remove the buffer from the
> transaction and abort the journal. And even if we later managed to write
> the data buffer from other process before the journal is aborted, ordered
> mode guarantees are satisfied - we only guarantee that too old data cannot
> be seen, newer can be seen easily... Thanks.

In the case where I stated the above, error checking is postponed to
the next (currently running) transaction because the buffer is removed
from the committing transaction before checked for an error.  This can
happen repeatedly, then the error won't be detected "for a long time".
However, finally the error is detected by, for example,
journal_commit_transaction(), we can abort the journal.  So this
problem is not so serious than the other patches which I sent.

Thanks,

-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)
  2008-05-14 13:15   ` Jan Kara
@ 2008-05-16 10:26     ` Hidehiro Kawai
  2008-05-19  3:14       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-16 10:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

Hi,

Thank you for review.

Jan Kara wrote:

> On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
> 
>>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
>>
>>If we failed to write metadata buffers to the journal space and
>>succeeded to write the commit record, stale data can be written
>>back to the filesystem as metadata in the recovery phase.
>>
>>To avoid this, when we failed to write out metadata buffers,
>>abort the journal before writing the commit record.
>>
>>Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
>>---
>> fs/jbd/commit.c |    3 +++
>> 1 file changed, 3 insertions(+)
>>
>>Index: linux-2.6.26-rc2/fs/jbd/commit.c
>>===================================================================
>>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
>>+++ linux-2.6.26-rc2/fs/jbd/commit.c
>>@@ -703,6 +703,9 @@ wait_for_iobuf:
>> 		__brelse(bh);
>> 	}
>> 
>>+	if (err)
>>+		journal_abort(journal, err);
>>+
>> 	J_ASSERT (commit_transaction->t_shadow_list == NULL);
> 
>   Shouldn't this rather be further just before
> journal_write_commit_record()? We should abort also if writing revoke
> records etc. failed, shouldn't we?

Unlike metadata blocks, each revoke block has a descriptor with the
sequence number of the commiting transaction.  If we failed to write
a revoke block, there should be an old control block, metadata block,
or zero-filled block where we tried to write the revoke block.
In the recovery process, this old invalid block is detected by
checking its magic number and sequence number, then the transaction
is ignored even if we have succeeded to write the commit record.
So I think we don't need to check for errors just after writing
revoke records.

Thanks,
 
>> 	jbd_debug(3, "JBD: commit phase 5\n");
>>

-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14 14:37       ` Josef Bacik
@ 2008-05-16 10:28         ` Hidehiro Kawai
  0 siblings, 0 replies; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-16 10:28 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Mingming Cao, Satoshi OSHIMA, sugita

Hi,

Thank you for review.

Josef Bacik wrote:

> On Wed, May 14, 2008 at 04:44:10PM +0200, Jan Kara wrote:
> 
>>>> 
>>>>Index: linux-2.6.26-rc2/fs/ext3/super.c
>>>>===================================================================
>>>>--- linux-2.6.26-rc2.orig/fs/ext3/super.c
>>>>+++ linux-2.6.26-rc2/fs/ext3/super.c
>>>>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
>>>> 	ext3_xattr_put_super(sb);
>>>> 	journal_destroy(sbi->s_journal);
>>>> 	if (!(sb->s_flags & MS_RDONLY)) {
>>>>-		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
>>>>+		if (!is_checkpoint_aborted(sbi->s_journal)) {
>>>>+			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
>>>>+				EXT3_FEATURE_INCOMPAT_RECOVER);
>>>>+		}
>>>> 		es->s_state = cpu_to_le16(sbi->s_mount_state);
>>>> 		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
>>>> 		mark_buffer_dirty(sbi->s_sbh);
>>>
>>>Is this bit here really needed?  If we abort the journal the fs will be mounted
>>>read only and we should never get in here.  Is there a case where we could abort
>>>the journal and not be flipped to being read-only?  If there is such a case I
>>>would think that we should fix that by making the fs read-only, and not have
>>>this check.
>>
>>  Actually, journal_abort() (which could be called from journal_destroy())
>>does nothing to the filesystem as such. So at this moment, ext3 can still
>>happily think everything is OK. We only detect aborted journal in
>>ext3_journal_start_sb() and call ext3_abort() in that case, which does all
>>that is needed...

Yes, that is why I added this check.
 
 
> Hmm you're right, I was thinking we did some other stuff before put_super which
> would have caught a journal abort but it looks like thats not the case.  Still
> shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy()
> kfree's the journal.  Should probably move the is_journal_aborted() check above
> that or something.  Thanks,

Good catch, I will fix it.
Thanks!

-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-14 14:32   ` Jan Kara
@ 2008-05-16 10:29     ` Hidehiro Kawai
  2008-05-19  3:38       ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-16 10:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

Hi,
Thank you for review.

Jan Kara wrote:

>> /*
>> + * We couldn't write back some metadata buffers to the filesystem.
>> + * To avoid unwritten-back metadata buffers from losing, set
>> + * JFS_CP_ABORT flag and make sure that the journal space isn't
>> + * cleaned.  This function also aborts journaling.
>> + */
>> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
>> +{
>> +	if (is_checkpoint_aborted(journal))
>> +		return;
>> +
>> +	spin_lock(&journal->j_state_lock);
>> +	journal->j_flags |= JFS_CP_ABORT;
>> +	spin_unlock(&journal->j_state_lock);
>> +	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
>> +	       "are still old.\n");
>> +	journal_abort(journal, errno);
>> +}

[snip]

>>Index: linux-2.6.26-rc2/include/linux/jbd.h
>>===================================================================
>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
>>+++ linux-2.6.26-rc2/include/linux/jbd.h
>>@@ -816,6 +816,8 @@ struct journal_s
>> #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
>> #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
>> #define JFS_BARRIER	0x020	/* Use IDE barriers */
>>+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
>>+				 * clean the journal space.  */
>> 
>> /*
>>  * Function declarations for the journaling transaction and buffer
>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>> 	return journal->j_flags & JFS_ABORT;
>> }
>> 
>>+static inline int is_checkpoint_aborted(journal_t *journal)
>>+{
>>+	return journal->j_flags & JFS_CP_ABORT;
>>+}
>>+
> 
>   Actually, I don't think this new flag is needed (not that it would really
> harm anything). At least at the places where you check it you can as well
> check whether the journal is aborted (maybe except for journal_destroy()
> but see my comment there).

As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
but it is needed for __journal_abort_checkpoint() which report the
checkpoint related error by printk only once.  If we use JFS_ABORT
to check whether this call is second time, the error message may be
never printed out because the journal has been aborted by another
reason.  If we don't check for the second call, the error message
will be printed out several times.

Instead of using __journal_abort_checkpoint(), we'll be able to do
the similar thing by adding the printk() directly in journal_destroy(),
journal_flush(), and __log_wait_for_space() (but it can be more
than one time).

I agree that we should not add another flag as much as possible.
So I'll try to revise to remove the flag if you agree to add
3 printk()s.

>   Please split ext3 changes into a separate patch. There's no reason
> why they should be together with JBD fixes (i.e. JBD fixes don't depend on
> ext3 being fixed). Thanks.

OK, I'll do so.


>>Index: linux-2.6.26-rc2/fs/ext3/super.c
>>===================================================================
>>--- linux-2.6.26-rc2.orig/fs/ext3/super.c
>>+++ linux-2.6.26-rc2/fs/ext3/super.c
>>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
>> 	ext3_xattr_put_super(sb);
>> 	journal_destroy(sbi->s_journal);
>> 	if (!(sb->s_flags & MS_RDONLY)) {
>>-		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
>>+		if (!is_checkpoint_aborted(sbi->s_journal)) {
>>+			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
>>+				EXT3_FEATURE_INCOMPAT_RECOVER);
>>+		}
> 
>   I think you should test here whether the journal is aborted (and
> EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
> completely avoid updating super block...

Your idea looks good.  But as Josef pointed out, journal_destroy()
frees sbi->s_journal, so we need to make journal_destroy() return the
error status.


>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
>> 	journal->j_tail = 0;
>> 	journal->j_tail_sequence = ++journal->j_transaction_sequence;
>> 	if (journal->j_sb_buffer) {
>>-		journal_update_superblock(journal, 1);
>>+		if (!is_checkpoint_aborted(journal))
>>+			journal_update_superblock(journal, 1);
>> 		brelse(journal->j_sb_buffer);
>> 	}
> 
>   I don't like this much. I'd rather completely avoid updating j_tail and
> j_tail_sequence above in case the journal is aborted but I'd write the
> journal superblock so that information about abortion gets written...

log_do_checkpoint() calls cleanup_journal_tail(), which advances
j_tail and j_tail_sequence.  So if we adopt the policy that we don't
modify j_tail and j_tail_sequence when checkpointing failed, we should
also fix cleanup_journal_tail().

I adopted the policy that we don't update the journal super block
when checkpointing failed.  When checkpointing failed, journal_abort()
is called before cleanup_journal_tail().  So what we want to write
should be in the journal super block.


>>@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
>> 
>> int journal_flush(journal_t *journal)
>> {
>>-	int err = 0;
>> 	transaction_t *transaction = NULL;
>> 	unsigned long old_tail;
>> 
>>@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
>> 		spin_unlock(&journal->j_state_lock);
>> 	}
>> 
>>-	/* ...and flush everything in the log out to disk. */
>>+	/* ...and flush everything in the log out to disk.
>>+	 * Even if an error occurs, we don't stop this loop.
>>+	 * We do checkpoint as much as possible.  */
>> 	spin_lock(&journal->j_list_lock);
>>-	while (!err && journal->j_checkpoint_transactions != NULL) {
>>+	while (journal->j_checkpoint_transactions != NULL) {
>> 		spin_unlock(&journal->j_list_lock);
>>-		err = log_do_checkpoint(journal);
>>+		log_do_checkpoint(journal);
>> 		spin_lock(&journal->j_list_lock);
>> 	}
>> 	spin_unlock(&journal->j_list_lock);
>>+	if (is_checkpoint_aborted(journal))
>>+		return -EIO;
>>+
> 
>   Why do you change the loop? If we fail to checkpoint some buffer, we stop
> journaling anyway and so journal will be replayed when fsck is run...

Certainly, fsck or journal_recover() replay the unwritten-back metadata
block.  journal_destroy() also call log_do_checkpoint() against all
un-checkpointed transactions.  I just thought `flush' means writing out
all data which we should write to disk.  There is no problem if I
don't change the loop.

Thanks,

-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased)
  2008-05-16 10:25     ` Hidehiro Kawai
@ 2008-05-19  3:11       ` Jan Kara
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Kara @ 2008-05-19  3:11 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

  Hello,

On Fri 16-05-08 19:25:44, Hidehiro Kawai wrote:
> Jan Kara wrote:
> 
> > On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote:
> > 
> >>Subject: [PATCH 2/4] jbd: ordered data integrity fix
> >>
> >>In ordered mode, if a buffer being dirtied exists in the committing
> >>transaction, we write the buffer to the disk, move it from the
> >>committing transaction to the running transaction, then dirty it.
> >>But we don't have to remove the buffer from the committing
> >>transaction when the buffer couldn't be written out, otherwise it
> >>breaks the ordered mode rule.
> > 
> >   Hmm, could you elaborate a bit more what exactly is broken and how does
> > this help to fix it? Because even if we find EIO happened on data buffer,
> > we currently don't do anything else than just remove the buffer from the
> > transaction and abort the journal. And even if we later managed to write
> > the data buffer from other process before the journal is aborted, ordered
> > mode guarantees are satisfied - we only guarantee that too old data cannot
> > be seen, newer can be seen easily... Thanks.
> 
> In the case where I stated the above, error checking is postponed to
> the next (currently running) transaction because the buffer is removed
> from the committing transaction before checked for an error.  This can
> happen repeatedly, then the error won't be detected "for a long time".
> However, finally the error is detected by, for example,
> journal_commit_transaction(), we can abort the journal.  So this
> problem is not so serious than the other patches which I sent.
  OK, I see. So I agree with the change but please add this explanation
(like: cannot remove buffer with io error from the committing transaction
because otherwise it would miss the error and commit would not abort) to
the comment in journal_dirty_data(). Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)
  2008-05-16 10:26     ` Hidehiro Kawai
@ 2008-05-19  3:14       ` Jan Kara
  2008-05-21  1:33         ` Hidehiro Kawai
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-19  3:14 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

  Hello,

On Fri 16-05-08 19:26:57, Hidehiro Kawai wrote:
> Jan Kara wrote:
> 
> > On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
> > 
> >>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
> >>
> >>If we failed to write metadata buffers to the journal space and
> >>succeeded to write the commit record, stale data can be written
> >>back to the filesystem as metadata in the recovery phase.
> >>
> >>To avoid this, when we failed to write out metadata buffers,
> >>abort the journal before writing the commit record.
> >>
> >>Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
> >>---
> >> fs/jbd/commit.c |    3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >>Index: linux-2.6.26-rc2/fs/jbd/commit.c
> >>===================================================================
> >>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
> >>+++ linux-2.6.26-rc2/fs/jbd/commit.c
> >>@@ -703,6 +703,9 @@ wait_for_iobuf:
> >> 		__brelse(bh);
> >> 	}
> >> 
> >>+	if (err)
> >>+		journal_abort(journal, err);
> >>+
> >> 	J_ASSERT (commit_transaction->t_shadow_list == NULL);
> > 
> >   Shouldn't this rather be further just before
> > journal_write_commit_record()? We should abort also if writing revoke
> > records etc. failed, shouldn't we?
> 
> Unlike metadata blocks, each revoke block has a descriptor with the
> sequence number of the commiting transaction.  If we failed to write
> a revoke block, there should be an old control block, metadata block,
> or zero-filled block where we tried to write the revoke block.
> In the recovery process, this old invalid block is detected by
> checking its magic number and sequence number, then the transaction
> is ignored even if we have succeeded to write the commit record.
> So I think we don't need to check for errors just after writing
> revoke records.
  Yes, I agree that not doing such check will not cause data corruption but
still I think that in case we fail to properly commit a transaction, we
should detect the error and abort the journal...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-16 10:29     ` Hidehiro Kawai
@ 2008-05-19  3:38       ` Jan Kara
  2008-05-21  1:34         ` Hidehiro Kawai
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-19  3:38 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Jan Kara, Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

  Hello,

On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
> Jan Kara wrote:
> >> /*
> >> + * We couldn't write back some metadata buffers to the filesystem.
> >> + * To avoid unwritten-back metadata buffers from losing, set
> >> + * JFS_CP_ABORT flag and make sure that the journal space isn't
> >> + * cleaned.  This function also aborts journaling.
> >> + */
> >> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
> >> +{
> >> +	if (is_checkpoint_aborted(journal))
> >> +		return;
> >> +
> >> +	spin_lock(&journal->j_state_lock);
> >> +	journal->j_flags |= JFS_CP_ABORT;
> >> +	spin_unlock(&journal->j_state_lock);
> >> +	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
> >> +	       "are still old.\n");
> >> +	journal_abort(journal, errno);
> >> +}
> 
> [snip]
> 
> >>Index: linux-2.6.26-rc2/include/linux/jbd.h
> >>===================================================================
> >>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
> >>+++ linux-2.6.26-rc2/include/linux/jbd.h
> >>@@ -816,6 +816,8 @@ struct journal_s
> >> #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
> >> #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
> >> #define JFS_BARRIER	0x020	/* Use IDE barriers */
> >>+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
> >>+				 * clean the journal space.  */
> >> 
> >> /*
> >>  * Function declarations for the journaling transaction and buffer
> >>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
> >> 	return journal->j_flags & JFS_ABORT;
> >> }
> >> 
> >>+static inline int is_checkpoint_aborted(journal_t *journal)
> >>+{
> >>+	return journal->j_flags & JFS_CP_ABORT;
> >>+}
> >>+
> > 
> >   Actually, I don't think this new flag is needed (not that it would really
> > harm anything). At least at the places where you check it you can as well
> > check whether the journal is aborted (maybe except for journal_destroy()
> > but see my comment there).
> 
> As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
> but it is needed for __journal_abort_checkpoint() which report the
> checkpoint related error by printk only once.  If we use JFS_ABORT
> to check whether this call is second time, the error message may be
> never printed out because the journal has been aborted by another
> reason.  If we don't check for the second call, the error message
> will be printed out several times.
  Yes, I think that checking out for JFS_ABORT is the right thing to do.
Once the journal has aborted for some reason, it is enough that we print
some error message (and that is responsibility of the first caller of
journal_abort()). Printing that checkpointing has not succeeded as well is
IMO not needed.

> Instead of using __journal_abort_checkpoint(), we'll be able to do
> the similar thing by adding the printk() directly in journal_destroy(),
> journal_flush(), and __log_wait_for_space() (but it can be more
> than one time).
> 
> I agree that we should not add another flag as much as possible.
> So I'll try to revise to remove the flag if you agree to add
> 3 printk()s.
  Well, as I said, one flag in journal is not a big deal but I found it a
bit confusing why there's a special flag for checkpoint abort when standard
abort would do fine as well.

> >>Index: linux-2.6.26-rc2/fs/ext3/super.c
> >>===================================================================
> >>--- linux-2.6.26-rc2.orig/fs/ext3/super.c
> >>+++ linux-2.6.26-rc2/fs/ext3/super.c
> >>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> >> 	ext3_xattr_put_super(sb);
> >> 	journal_destroy(sbi->s_journal);
> >> 	if (!(sb->s_flags & MS_RDONLY)) {
> >>-		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> >>+		if (!is_checkpoint_aborted(sbi->s_journal)) {
> >>+			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> >>+				EXT3_FEATURE_INCOMPAT_RECOVER);
> >>+		}
> > 
> >   I think you should test here whether the journal is aborted (and
> > EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
> > completely avoid updating super block...
> 
> Your idea looks good.  But as Josef pointed out, journal_destroy()
> frees sbi->s_journal, so we need to make journal_destroy() return the
> error status.
  Yes.

> >>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
> >> 	journal->j_tail = 0;
> >> 	journal->j_tail_sequence = ++journal->j_transaction_sequence;
> >> 	if (journal->j_sb_buffer) {
> >>-		journal_update_superblock(journal, 1);
> >>+		if (!is_checkpoint_aborted(journal))
> >>+			journal_update_superblock(journal, 1);
> >> 		brelse(journal->j_sb_buffer);
> >> 	}
> > 
> >   I don't like this much. I'd rather completely avoid updating j_tail and
> > j_tail_sequence above in case the journal is aborted but I'd write the
> > journal superblock so that information about abortion gets written...
> 
> log_do_checkpoint() calls cleanup_journal_tail(), which advances
> j_tail and j_tail_sequence.  So if we adopt the policy that we don't
> modify j_tail and j_tail_sequence when checkpointing failed, we should
> also fix cleanup_journal_tail().
> 
> I adopted the policy that we don't update the journal super block
> when checkpointing failed.  When checkpointing failed, journal_abort()
> is called before cleanup_journal_tail().  So what we want to write
> should be in the journal super block.
  I see. The thing I'm afraid of with this policy is, that when sometime later
we add somewhere journal_update_superblock() and forget about checking
whether journal isn't aborted, we will magically get filesystem corruption
when IO error happens and that would be really hard to debug. So I'd
rather refrain from updating j_tail and j_tail_sequence.

> >>@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
> >> 
> >> int journal_flush(journal_t *journal)
> >> {
> >>-	int err = 0;
> >> 	transaction_t *transaction = NULL;
> >> 	unsigned long old_tail;
> >> 
> >>@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
> >> 		spin_unlock(&journal->j_state_lock);
> >> 	}
> >> 
> >>-	/* ...and flush everything in the log out to disk. */
> >>+	/* ...and flush everything in the log out to disk.
> >>+	 * Even if an error occurs, we don't stop this loop.
> >>+	 * We do checkpoint as much as possible.  */
> >> 	spin_lock(&journal->j_list_lock);
> >>-	while (!err && journal->j_checkpoint_transactions != NULL) {
> >>+	while (journal->j_checkpoint_transactions != NULL) {
> >> 		spin_unlock(&journal->j_list_lock);
> >>-		err = log_do_checkpoint(journal);
> >>+		log_do_checkpoint(journal);
> >> 		spin_lock(&journal->j_list_lock);
> >> 	}
> >> 	spin_unlock(&journal->j_list_lock);
> >>+	if (is_checkpoint_aborted(journal))
> >>+		return -EIO;
> >>+
> > 
> >   Why do you change the loop? If we fail to checkpoint some buffer, we stop
> > journaling anyway and so journal will be replayed when fsck is run...
> 
> Certainly, fsck or journal_recover() replay the unwritten-back metadata
> block.  journal_destroy() also call log_do_checkpoint() against all
> un-checkpointed transactions.  I just thought `flush' means writing out
> all data which we should write to disk.  There is no problem if I
> don't change the loop.
  "flush" means "make journal empty". If we are not able to do it all, it
does not really matter how much do we manage to write out. So I wouldn't
change the loop.
								Honza
 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)
  2008-05-19  3:14       ` Jan Kara
@ 2008-05-21  1:33         ` Hidehiro Kawai
  0 siblings, 0 replies; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-21  1:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

Hi,

Jan Kara wrote:
> 
> On Fri 16-05-08 19:26:57, Hidehiro Kawai wrote:
> 
>>Jan Kara wrote:
>>
>>
>>>On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
>>>
>>>
>>>>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
>>>>
>>>>If we failed to write metadata buffers to the journal space and
>>>>succeeded to write the commit record, stale data can be written
>>>>back to the filesystem as metadata in the recovery phase.
>>>>
>>>>To avoid this, when we failed to write out metadata buffers,
>>>>abort the journal before writing the commit record.
>>>>
>>>>Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
>>>>---
>>>>fs/jbd/commit.c |    3 +++
>>>>1 file changed, 3 insertions(+)
>>>>
>>>>Index: linux-2.6.26-rc2/fs/jbd/commit.c
>>>>===================================================================
>>>>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
>>>>+++ linux-2.6.26-rc2/fs/jbd/commit.c
>>>>@@ -703,6 +703,9 @@ wait_for_iobuf:
>>>>		__brelse(bh);
>>>>	}
>>>>
>>>>+	if (err)
>>>>+		journal_abort(journal, err);
>>>>+
>>>>	J_ASSERT (commit_transaction->t_shadow_list == NULL);
>>>
>>>  Shouldn't this rather be further just before
>>>journal_write_commit_record()? We should abort also if writing revoke
>>>records etc. failed, shouldn't we?
>>
>>Unlike metadata blocks, each revoke block has a descriptor with the
>>sequence number of the commiting transaction.  If we failed to write
>>a revoke block, there should be an old control block, metadata block,
>>or zero-filled block where we tried to write the revoke block.
>>In the recovery process, this old invalid block is detected by
>>checking its magic number and sequence number, then the transaction
>>is ignored even if we have succeeded to write the commit record.
>>So I think we don't need to check for errors just after writing
>>revoke records.
> 
>   Yes, I agree that not doing such check will not cause data corruption but
> still I think that in case we fail to properly commit a transaction, we
> should detect the error and abort the journal...

I see.  I'll move the aborting point to just before
journal_write_commit_record() in the next version.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-19  3:38       ` Jan Kara
@ 2008-05-21  1:34         ` Hidehiro Kawai
  2008-05-23 22:28           ` Jan Kara
  0 siblings, 1 reply; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-21  1:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

Hi,

Jan Kara wrote:
> 
> On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
> 
>>Jan Kara wrote:
>>
>>>>/*
>>>>+ * We couldn't write back some metadata buffers to the filesystem.
>>>>+ * To avoid unwritten-back metadata buffers from losing, set
>>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't
>>>>+ * cleaned.  This function also aborts journaling.
>>>>+ */
>>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno)
>>>>+{
>>>>+	if (is_checkpoint_aborted(journal))
>>>>+		return;
>>>>+
>>>>+	spin_lock(&journal->j_state_lock);
>>>>+	journal->j_flags |= JFS_CP_ABORT;
>>>>+	spin_unlock(&journal->j_state_lock);
>>>>+	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
>>>>+	       "are still old.\n");
>>>>+	journal_abort(journal, errno);
>>>>+}
>>
>>[snip]
>>
>>
>>>>Index: linux-2.6.26-rc2/include/linux/jbd.h
>>>>===================================================================
>>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
>>>>+++ linux-2.6.26-rc2/include/linux/jbd.h
>>>>@@ -816,6 +816,8 @@ struct journal_s
>>>>#define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
>>>>#define JFS_LOADED	0x010	/* The journal superblock has been loaded */
>>>>#define JFS_BARRIER	0x020	/* Use IDE barriers */
>>>>+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
>>>>+				 * clean the journal space.  */
>>>>
>>>>/*
>>>> * Function declarations for the journaling transaction and buffer
>>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>>>>	return journal->j_flags & JFS_ABORT;
>>>>}
>>>>
>>>>+static inline int is_checkpoint_aborted(journal_t *journal)
>>>>+{
>>>>+	return journal->j_flags & JFS_CP_ABORT;
>>>>+}
>>>>+
>>>
>>>  Actually, I don't think this new flag is needed (not that it would really
>>>harm anything). At least at the places where you check it you can as well
>>>check whether the journal is aborted (maybe except for journal_destroy()
>>>but see my comment there).
>>
>>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
>>but it is needed for __journal_abort_checkpoint() which report the
>>checkpoint related error by printk only once.  If we use JFS_ABORT
>>to check whether this call is second time, the error message may be
>>never printed out because the journal has been aborted by another
>>reason.  If we don't check for the second call, the error message
>>will be printed out several times.
> 
>   Yes, I think that checking out for JFS_ABORT is the right thing to do.
> Once the journal has aborted for some reason, it is enough that we print
> some error message (and that is responsibility of the first caller of
> journal_abort()). Printing that checkpointing has not succeeded as well is
> IMO not needed.

A checkpointing failure is a bit special.  If the journal is aborted
by journal_commit_transaction(), the integrity of the filesystem is
ensured although the latest changes will be lost.  However, if the
journal is aborted by log_do_checkpoint() and the replay also failed,
the filesystem may be corrupted because some of the metadata blocks
are in old states.  In this case, the user will have to recover the
filesystem manually and carefully.  So I think printing the special
message is needed to inform users about that.
 

>>>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
>>>>	journal->j_tail = 0;
>>>>	journal->j_tail_sequence = ++journal->j_transaction_sequence;
>>>>	if (journal->j_sb_buffer) {
>>>>-		journal_update_superblock(journal, 1);
>>>>+		if (!is_checkpoint_aborted(journal))
>>>>+			journal_update_superblock(journal, 1);
>>>>		brelse(journal->j_sb_buffer);
>>>>	}
>>>
>>>  I don't like this much. I'd rather completely avoid updating j_tail and
>>>j_tail_sequence above in case the journal is aborted but I'd write the
>>>journal superblock so that information about abortion gets written...
>>
>>log_do_checkpoint() calls cleanup_journal_tail(), which advances
>>j_tail and j_tail_sequence.  So if we adopt the policy that we don't
>>modify j_tail and j_tail_sequence when checkpointing failed, we should
>>also fix cleanup_journal_tail().
>>
>>I adopted the policy that we don't update the journal super block
>>when checkpointing failed.  When checkpointing failed, journal_abort()
>>is called before cleanup_journal_tail().  So what we want to write
>>should be in the journal super block.
> 
>   I see. The thing I'm afraid of with this policy is, that when sometime later
> we add somewhere journal_update_superblock() and forget about checking
> whether journal isn't aborted, we will magically get filesystem corruption
> when IO error happens and that would be really hard to debug. So I'd
> rather refrain from updating j_tail and j_tail_sequence.

I can understand your concern as the current JBD updates the journal
super block even if a checkpoint has failed.  I think it will be
improved by adding a comment to journal_update_superblock().
For example: don't invoke this function if checkpointing has
failed, otherwise unwritten-back journaled data can be lost.

Or we may stop both modifying j_tail and updating the super block
when the journal has aborted (especially for a reason of checkpoint
failure).  Of course, __journal_abort_soft() still updates the
super block once.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-21  1:34         ` Hidehiro Kawai
@ 2008-05-23 22:28           ` Jan Kara
  2008-05-26  4:57             ` Hidehiro Kawai
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Kara @ 2008-05-23 22:28 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

  Hello,

> Jan Kara wrote:
> > 
> > On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
> > 
> >>Jan Kara wrote:
> >>
> >>>>/*
> >>>>+ * We couldn't write back some metadata buffers to the filesystem.
> >>>>+ * To avoid unwritten-back metadata buffers from losing, set
> >>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't
> >>>>+ * cleaned.  This function also aborts journaling.
> >>>>+ */
> >>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno)
> >>>>+{
> >>>>+	if (is_checkpoint_aborted(journal))
> >>>>+		return;
> >>>>+
> >>>>+	spin_lock(&journal->j_state_lock);
> >>>>+	journal->j_flags |= JFS_CP_ABORT;
> >>>>+	spin_unlock(&journal->j_state_lock);
> >>>>+	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
> >>>>+	       "are still old.\n");
> >>>>+	journal_abort(journal, errno);
> >>>>+}
> >>
> >>[snip]
> >>
> >>
> >>>>Index: linux-2.6.26-rc2/include/linux/jbd.h
> >>>>===================================================================
> >>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
> >>>>+++ linux-2.6.26-rc2/include/linux/jbd.h
> >>>>@@ -816,6 +816,8 @@ struct journal_s
> >>>>#define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
> >>>>#define JFS_LOADED	0x010	/* The journal superblock has been loaded */
> >>>>#define JFS_BARRIER	0x020	/* Use IDE barriers */
> >>>>+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
> >>>>+				 * clean the journal space.  */
> >>>>
> >>>>/*
> >>>> * Function declarations for the journaling transaction and buffer
> >>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
> >>>>	return journal->j_flags & JFS_ABORT;
> >>>>}
> >>>>
> >>>>+static inline int is_checkpoint_aborted(journal_t *journal)
> >>>>+{
> >>>>+	return journal->j_flags & JFS_CP_ABORT;
> >>>>+}
> >>>>+
> >>>
> >>>  Actually, I don't think this new flag is needed (not that it would really
> >>>harm anything). At least at the places where you check it you can as well
> >>>check whether the journal is aborted (maybe except for journal_destroy()
> >>>but see my comment there).
> >>
> >>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
> >>but it is needed for __journal_abort_checkpoint() which report the
> >>checkpoint related error by printk only once.  If we use JFS_ABORT
> >>to check whether this call is second time, the error message may be
> >>never printed out because the journal has been aborted by another
> >>reason.  If we don't check for the second call, the error message
> >>will be printed out several times.
> > 
> >   Yes, I think that checking out for JFS_ABORT is the right thing to do.
> > Once the journal has aborted for some reason, it is enough that we print
> > some error message (and that is responsibility of the first caller of
> > journal_abort()). Printing that checkpointing has not succeeded as well is
> > IMO not needed.
> 
> A checkpointing failure is a bit special.  If the journal is aborted
> by journal_commit_transaction(), the integrity of the filesystem is
> ensured although the latest changes will be lost.  However, if the
> journal is aborted by log_do_checkpoint() and the replay also failed,
> the filesystem may be corrupted because some of the metadata blocks
> are in old states.  In this case, the user will have to recover the
> filesystem manually and carefully.  So I think printing the special
> message is needed to inform users about that.
  Is this really different? If we spot error during checkpointing, we
abort journal (so data still remain in the journal because checkpointing
has failed). On next mount, either replay succeeds and everything is fine,
or we spot error during replay and we should just refuse to mount the
filesystem. So we are correct again - user then has to run fsck to solve
the problem and that will try it's best to get most data readable. So I
don't see why checkpointing bug would need a special attention.

> >>>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
> >>>>	journal->j_tail = 0;
> >>>>	journal->j_tail_sequence = ++journal->j_transaction_sequence;
> >>>>	if (journal->j_sb_buffer) {
> >>>>-		journal_update_superblock(journal, 1);
> >>>>+		if (!is_checkpoint_aborted(journal))
> >>>>+			journal_update_superblock(journal, 1);
> >>>>		brelse(journal->j_sb_buffer);
> >>>>	}
> >>>
> >>>  I don't like this much. I'd rather completely avoid updating j_tail and
> >>>j_tail_sequence above in case the journal is aborted but I'd write the
> >>>journal superblock so that information about abortion gets written...
> >>
> >>log_do_checkpoint() calls cleanup_journal_tail(), which advances
> >>j_tail and j_tail_sequence.  So if we adopt the policy that we don't
> >>modify j_tail and j_tail_sequence when checkpointing failed, we should
> >>also fix cleanup_journal_tail().
> >>
> >>I adopted the policy that we don't update the journal super block
> >>when checkpointing failed.  When checkpointing failed, journal_abort()
> >>is called before cleanup_journal_tail().  So what we want to write
> >>should be in the journal super block.
> > 
> >   I see. The thing I'm afraid of with this policy is, that when sometime later
> > we add somewhere journal_update_superblock() and forget about checking
> > whether journal isn't aborted, we will magically get filesystem corruption
> > when IO error happens and that would be really hard to debug. So I'd
> > rather refrain from updating j_tail and j_tail_sequence.
> 
> I can understand your concern as the current JBD updates the journal
> super block even if a checkpoint has failed.  I think it will be
> improved by adding a comment to journal_update_superblock().
> For example: don't invoke this function if checkpointing has
> failed, otherwise unwritten-back journaled data can be lost.
> 
> Or we may stop both modifying j_tail and updating the super block
> when the journal has aborted (especially for a reason of checkpoint
> failure).  Of course, __journal_abort_soft() still updates the
> super block once.
  OK, the last option is also fine with me. Thanks for your work.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)
  2008-05-23 22:28           ` Jan Kara
@ 2008-05-26  4:57             ` Hidehiro Kawai
  0 siblings, 0 replies; 23+ messages in thread
From: Hidehiro Kawai @ 2008-05-26  4:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, sct, adilger, linux-kernel, linux-ext4,
	Josef Bacik, Mingming Cao, Satoshi OSHIMA, sugita

Hello,

>>Jan Kara wrote:
>>
>>>On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
>>>
>>>
>>>>Jan Kara wrote:
>>>>
>>>>
>>>>>>/*
>>>>>>+ * We couldn't write back some metadata buffers to the filesystem.
>>>>>>+ * To avoid unwritten-back metadata buffers from losing, set
>>>>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't
>>>>>>+ * cleaned.  This function also aborts journaling.
>>>>>>+ */
>>>>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno)
>>>>>>+{
>>>>>>+	if (is_checkpoint_aborted(journal))
>>>>>>+		return;
>>>>>>+
>>>>>>+	spin_lock(&journal->j_state_lock);
>>>>>>+	journal->j_flags |= JFS_CP_ABORT;
>>>>>>+	spin_unlock(&journal->j_state_lock);
>>>>>>+	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
>>>>>>+	       "are still old.\n");
>>>>>>+	journal_abort(journal, errno);
>>>>>>+}
>>>>
>>>>[snip]
>>>>
>>>>
>>>>
>>>>>>Index: linux-2.6.26-rc2/include/linux/jbd.h
>>>>>>===================================================================
>>>>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
>>>>>>+++ linux-2.6.26-rc2/include/linux/jbd.h
>>>>>>@@ -816,6 +816,8 @@ struct journal_s
>>>>>>#define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
>>>>>>#define JFS_LOADED	0x010	/* The journal superblock has been loaded */
>>>>>>#define JFS_BARRIER	0x020	/* Use IDE barriers */
>>>>>>+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
>>>>>>+				 * clean the journal space.  */
>>>>>>
>>>>>>/*
>>>>>>* Function declarations for the journaling transaction and buffer
>>>>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>>>>>>	return journal->j_flags & JFS_ABORT;
>>>>>>}
>>>>>>
>>>>>>+static inline int is_checkpoint_aborted(journal_t *journal)
>>>>>>+{
>>>>>>+	return journal->j_flags & JFS_CP_ABORT;
>>>>>>+}
>>>>>>+
>>>>>
>>>>> Actually, I don't think this new flag is needed (not that it would really
>>>>>harm anything). At least at the places where you check it you can as well
>>>>>check whether the journal is aborted (maybe except for journal_destroy()
>>>>>but see my comment there).
>>>>
>>>>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
>>>>but it is needed for __journal_abort_checkpoint() which report the
>>>>checkpoint related error by printk only once.  If we use JFS_ABORT
>>>>to check whether this call is second time, the error message may be
>>>>never printed out because the journal has been aborted by another
>>>>reason.  If we don't check for the second call, the error message
>>>>will be printed out several times.
>>>
>>>  Yes, I think that checking out for JFS_ABORT is the right thing to do.
>>>Once the journal has aborted for some reason, it is enough that we print
>>>some error message (and that is responsibility of the first caller of
>>>journal_abort()). Printing that checkpointing has not succeeded as well is
>>>IMO not needed.
>>
>>A checkpointing failure is a bit special.  If the journal is aborted
>>by journal_commit_transaction(), the integrity of the filesystem is
>>ensured although the latest changes will be lost.  However, if the
>>journal is aborted by log_do_checkpoint() and the replay also failed,
>>the filesystem may be corrupted because some of the metadata blocks
>>are in old states.  In this case, the user will have to recover the
>>filesystem manually and carefully.  So I think printing the special
>>message is needed to inform users about that.
> 
>   Is this really different? If we spot error during checkpointing, we
> abort journal (so data still remain in the journal because checkpointing
> has failed). On next mount, either replay succeeds and everything is fine,
> or we spot error during replay and we should just refuse to mount the
> filesystem. So we are correct again - user then has to run fsck to solve
> the problem and that will try it's best to get most data readable. So I
> don't see why checkpointing bug would need a special attention.

I understand.

I have thought the following scenario.  When we failed to checkpoint
and replay, copy all readable blocks from the bad disk to a good disk,
and we can replay all journaled data on the good disk.  But it will be
overkill.  Normally we run the fsck when we failed to mount, and it's
enough for us.

I'm going to just use JFS_ABORT instead of introducing JFS_CP_ABORT
in the next revision.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-05-26  4:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-14  4:43 [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased) Hidehiro Kawai
2008-05-14  4:47 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased) Hidehiro Kawai
2008-05-14 12:56   ` Jan Kara
2008-05-14  4:48 ` [PATCH 2/4] jbd: ordered data integrity fix (rebased) Hidehiro Kawai
2008-05-14 13:10   ` Jan Kara
2008-05-16 10:25     ` Hidehiro Kawai
2008-05-19  3:11       ` Jan Kara
2008-05-14  4:49 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased) Hidehiro Kawai
2008-05-14 13:15   ` Jan Kara
2008-05-16 10:26     ` Hidehiro Kawai
2008-05-19  3:14       ` Jan Kara
2008-05-21  1:33         ` Hidehiro Kawai
2008-05-14  4:50 ` [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased) Hidehiro Kawai
2008-05-14 13:16   ` Josef Bacik
2008-05-14 14:44     ` Jan Kara
2008-05-14 14:37       ` Josef Bacik
2008-05-16 10:28         ` Hidehiro Kawai
2008-05-14 14:32   ` Jan Kara
2008-05-16 10:29     ` Hidehiro Kawai
2008-05-19  3:38       ` Jan Kara
2008-05-21  1:34         ` Hidehiro Kawai
2008-05-23 22:28           ` Jan Kara
2008-05-26  4:57             ` Hidehiro Kawai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).