linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] jbd: possible filesystem corruption fixes
@ 2008-04-18 13:00 Hidehiro Kawai
  2008-04-18 13:36 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers Hidehiro Kawai
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-18 13:00 UTC (permalink / raw)
  To: akpm, sct, adilger; +Cc: linux-kernel, linux-ext4, jack, sugita, Satoshi OSHIMA

Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes

The current JBD is not sufficient for I/O error handling.  It can
cause filesystem corruption.   An example scenario:

1. fail to write a metadata buffer to block B in the journal
2. succeed to write the commit record
3. the system crashes, reboots and mount the filesystem
4. in the recovery phase, succeed to read data from block B
5. write back the read data to the filesystem, but it is a stale
   metadata
6. lose some files and directories!

This scenario is a rare case, but it (temporal I/O error)
can occur.  If we abort the journal between 1. and 2., this
tragedy can be avoided.

This patch set fixes several error handling problems to protect
from filesystem corruption caused by I/O errors.  It has been
done only for JBD and ext3 parts.

This patch is against 2.6.25

[PATCH 1/4] jbd: strictly check for write errors on data buffers
[PATCH 2/4] jbd: ordered data integrity fix
[PATCH 3/4] jbd: abort when failed to log metadata buffers
[PATCH 4/4] jbd: fix error handling for checkpoint io

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] jbd: strictly check for write errors on data buffers
  2008-04-18 13:00 [PATCH 0/4] jbd: possible filesystem corruption fixes Hidehiro Kawai
@ 2008-04-18 13:36 ` Hidehiro Kawai
  2008-04-18 13:37 ` [PATCH 2/4] jbd: ordered data integrity fix Hidehiro Kawai
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-18 13:36 UTC (permalink / raw)
  To: akpm, sct; +Cc: linux-kernel, linux-ext4, jack, sugita, Satoshi OSHIMA

Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers

In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction.  But
there can be data buffers which are not checked for error:

  (a) the buffer which has already been written out by pdflush
  (b) the buffer which has been unlocked before scanned in the
      t_locked_list loop

This patch adds missing error checks and aborts journaling
appropriately.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com>
---
 fs/jbd/commit.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6.25/fs/jbd/commit.c
===================================================================
--- linux-2.6.25.orig/fs/jbd/commit.c
+++ linux-2.6.25/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
 /*
  *  Submit all the data buffers to disk
  */
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
 				transaction_t *commit_transaction)
 {
 	struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
 	int locked;
 	int bufs = 0;
 	struct buffer_head **wbuf = journal->j_wbuf;
+	int err = 0;
 
 	/*
 	 * Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
 			put_bh(bh);
 		} else {
 			BUFFER_TRACE(bh, "writeout complete: unfile");
+			if (unlikely(!buffer_uptodate(bh)))
+				err = -EIO;
 			__journal_unfile_buffer(jh);
 			jbd_unlock_bh_state(bh);
 			if (locked)
@@ -271,6 +274,8 @@ write_out_data:
 	}
 	spin_unlock(&journal->j_list_lock);
 	journal_do_submit_data(wbuf, bufs);
+
+	return err;
 }
 
 /*
@@ -426,8 +431,7 @@ void journal_commit_transaction(journal_
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-	err = 0;
-	journal_submit_data_buffers(journal, commit_transaction);
+	err = journal_submit_data_buffers(journal, commit_transaction);
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -442,10 +446,10 @@ void journal_commit_transaction(journal_
 		if (buffer_locked(bh)) {
 			spin_unlock(&journal->j_list_lock);
 			wait_on_buffer(bh);
-			if (unlikely(!buffer_uptodate(bh)))
-				err = -EIO;
 			spin_lock(&journal->j_list_lock);
 		}
+		if (unlikely(!buffer_uptodate(bh)))
+			err = -EIO;
 		if (!inverted_lock(journal, bh)) {
 			put_bh(bh);
 			spin_lock(&journal->j_list_lock);



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/4] jbd: ordered data integrity fix
  2008-04-18 13:00 [PATCH 0/4] jbd: possible filesystem corruption fixes Hidehiro Kawai
  2008-04-18 13:36 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers Hidehiro Kawai
@ 2008-04-18 13:37 ` Hidehiro Kawai
  2008-04-18 13:38 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers Hidehiro Kawai
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-18 13:37 UTC (permalink / raw)
  To: akpm, sct; +Cc: linux-kernel, linux-ext4, jack, sugita, Satoshi OSHIMA

Subject: [PATCH 2/4] jbd: ordered data integrity fix

In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
breaks the ordered mode rule.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/transaction.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6.25/fs/jbd/transaction.c
===================================================================
--- linux-2.6.25.orig/fs/jbd/transaction.c
+++ linux-2.6.25/fs/jbd/transaction.c
@@ -941,9 +941,10 @@ int journal_dirty_data(handle_t *handle,
 	journal_t *journal = handle->h_transaction->t_journal;
 	int need_brelse = 0;
 	struct journal_head *jh;
+	int ret = 0;
 
 	if (is_handle_aborted(handle))
-		return 0;
+		return ret;
 
 	jh = journal_add_journal_head(bh);
 	JBUFFER_TRACE(jh, "entry");
@@ -1054,7 +1055,16 @@ int journal_dirty_data(handle_t *handle,
 				   time if it is redirtied */
 			}
 
-			/* journal_clean_data_list() may have got there first */
+			/*
+			 * We shouldn't remove the buffer from the committing
+			 * transaction if it has failed to be written.
+			 * Otherwise, it breaks the ordered mode rule.
+			 */
+			if (unlikely(!buffer_uptodate(bh))) {
+				ret = -EIO;
+				goto no_journal;
+			}
+
 			if (jh->b_transaction != NULL) {
 				JBUFFER_TRACE(jh, "unfile from commit");
 				__journal_temp_unlink_buffer(jh);
@@ -1095,7 +1105,7 @@ no_journal:
 	}
 	JBUFFER_TRACE(jh, "exit");
 	journal_put_journal_head(jh);
-	return 0;
+	return ret;
 }
 
 /**



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/4] jbd: abort when failed to log metadata buffers
  2008-04-18 13:00 [PATCH 0/4] jbd: possible filesystem corruption fixes Hidehiro Kawai
  2008-04-18 13:36 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers Hidehiro Kawai
  2008-04-18 13:37 ` [PATCH 2/4] jbd: ordered data integrity fix Hidehiro Kawai
@ 2008-04-18 13:38 ` Hidehiro Kawai
  2008-04-18 13:39 ` [PATCH 4/4] jbd/ext3: fix error handling for checkpoint io Hidehiro Kawai
  2008-04-18 14:09 ` [PATCH 0/4] jbd: possible filesystem corruption fixes Josef Bacik
  4 siblings, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-18 13:38 UTC (permalink / raw)
  To: akpm, sct; +Cc: linux-kernel, linux-ext4, jack, sugita, Satoshi OSHIMA

Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers

If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/commit.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.25/fs/jbd/commit.c
===================================================================
--- linux-2.6.25.orig/fs/jbd/commit.c
+++ linux-2.6.25/fs/jbd/commit.c
@@ -716,6 +716,9 @@ wait_for_iobuf:
 		__brelse(bh);
 	}
 
+	if (err)
+		journal_abort(journal, err);
+
 	J_ASSERT (commit_transaction->t_shadow_list == NULL);
 
 	jbd_debug(3, "JBD: commit phase 5\n");



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/4] jbd/ext3: fix error handling for checkpoint io
  2008-04-18 13:00 [PATCH 0/4] jbd: possible filesystem corruption fixes Hidehiro Kawai
                   ` (2 preceding siblings ...)
  2008-04-18 13:38 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers Hidehiro Kawai
@ 2008-04-18 13:39 ` Hidehiro Kawai
  2008-04-18 14:09 ` [PATCH 0/4] jbd: possible filesystem corruption fixes Josef Bacik
  4 siblings, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-18 13:39 UTC (permalink / raw)
  To: akpm, sct, adilger; +Cc: linux-kernel, linux-ext4, jack, sugita, Satoshi OSHIMA

Subject: [PATCH 4/4] jbd/ext3: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling.  This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
   the case of __try_to_free_cp_buf() because it may be released or
   overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
   from the checkpoint list and abort journaling
3. when checkpointing fails, don't update the journal super block to
   prevent the journalled contents from being cleaned
4. when checkpointing fails, don't clear the needs_recovery flag,
   otherwise the journalled contents are ignored and cleaned in the
   recovery phase
5. if the recovery fails, keep the needs_recovery flag

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/ext3/ioctl.c     |   12 ++++----
 fs/ext3/super.c     |   13 +++++++--
 fs/jbd/checkpoint.c |   59 ++++++++++++++++++++++++++++++++++--------
 fs/jbd/journal.c    |   21 +++++++++-----
 fs/jbd/recovery.c   |    7 +++-
 include/linux/jbd.h |    7 ++++
 6 files changed, 91 insertions(+), 28 deletions(-)

Index: linux-2.6.25/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.25.orig/fs/jbd/checkpoint.c
+++ linux-2.6.25/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
 	int ret = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
-	if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+	if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+	    !buffer_dirty(bh) && buffer_uptodate(bh)) {
 		JBUFFER_TRACE(jh, "remove from checkpoint list");
 		ret = __journal_remove_checkpoint(jh) + 1;
 		jbd_unlock_bh_state(bh);
@@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
 }
 
 /*
+ * We couldn't write back some metadata buffers to the filesystem.
+ * To avoid unwritten-back metadata buffers from losing, set
+ * JFS_CP_ABORT flag and make sure that the journal space isn't
+ * cleaned.  This function also aborts journaling.
+ */
+static void __journal_abort_checkpoint(journal_t *journal, int errno)
+{
+	if (is_checkpoint_aborted(journal))
+		return;
+
+	spin_lock(&journal->j_state_lock);
+	journal->j_flags |= JFS_CP_ABORT;
+	spin_unlock(&journal->j_state_lock);
+	printk(KERN_WARNING "JBD: Checkpointing failed.  Some metadata blocks "
+	       "are still old.\n");
+	journal_abort(journal, errno);
+}
+
+/*
  * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
  * The caller must restart a list walk.  Wait for someone else to run
  * jbd_unlock_bh_state().
@@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
  * buffers. Note that we take the buffers in the opposite ordering
  * from the one in which they were submitted for IO.
  *
+ * Return 0 on success, and return <0 if some buffers have failed
+ * to be written out.
+ *
  * Called with j_list_lock held.
  */
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
 {
 	struct journal_head *jh;
 	struct buffer_head *bh;
 	tid_t this_tid;
 	int released = 0;
+	int ret = 0;
 
 	this_tid = transaction->t_tid;
 restart:
 	/* Did somebody clean up the transaction in the meanwhile? */
 	if (journal->j_checkpoint_transactions != transaction ||
 			transaction->t_tid != this_tid)
-		return;
+		return ret;
 	while (!released && transaction->t_checkpoint_io_list) {
 		jh = transaction->t_checkpoint_io_list;
 		bh = jh2bh(jh);
@@ -194,6 +218,9 @@ restart:
 			spin_lock(&journal->j_list_lock);
 			goto restart;
 		}
+		if (unlikely(!buffer_uptodate(bh)))
+			ret = -EIO;
+
 		/*
 		 * Now in whatever state the buffer currently is, we know that
 		 * it has been written out and so we can drop it from the list
@@ -203,6 +230,8 @@ restart:
 		journal_remove_journal_head(bh);
 		__brelse(bh);
 	}
+
+	return ret;
 }
 
 #define NR_BATCH	64
@@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
  * Try to flush one buffer from the checkpoint list to disk.
  *
  * Return 1 if something happened which requires us to abort the current
- * scan of the checkpoint list.
+ * scan of the checkpoint list.  Return <0 if the buffer has failed to
+ * be written out.
  *
  * Called with j_list_lock held and drops it if 1 is returned
  * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
@@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
 		log_wait_commit(journal, tid);
 		ret = 1;
 	} else if (!buffer_dirty(bh)) {
+		ret = 1;
+		if (unlikely(!buffer_uptodate(bh)))
+			ret = -EIO;
 		J_ASSERT_JH(jh, !buffer_jbddirty(bh));
 		BUFFER_TRACE(bh, "remove from checkpoint");
 		__journal_remove_checkpoint(jh);
@@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
 		jbd_unlock_bh_state(bh);
 		journal_remove_journal_head(bh);
 		__brelse(bh);
-		ret = 1;
 	} else {
 		/*
 		 * Important: we are about to write the buffer, and
@@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
 	 * OK, we need to start writing disk blocks.  Take one transaction
 	 * and write it.
 	 */
+	result = 0;
 	spin_lock(&journal->j_list_lock);
 	if (!journal->j_checkpoint_transactions)
 		goto out;
@@ -334,7 +367,7 @@ restart:
 		int batch_count = 0;
 		struct buffer_head *bhs[NR_BATCH];
 		struct journal_head *jh;
-		int retry = 0;
+		int retry = 0, err;
 
 		while (!retry && transaction->t_checkpoint_list) {
 			struct buffer_head *bh;
@@ -347,6 +380,8 @@ restart:
 				break;
 			}
 			retry = __process_buffer(journal, jh, bhs,&batch_count);
+			if (retry < 0)
+				result = retry;
 			if (!retry && (need_resched() ||
 				spin_needbreak(&journal->j_list_lock))) {
 				spin_unlock(&journal->j_list_lock);
@@ -371,14 +406,18 @@ restart:
 		 * Now we have cleaned up the first transaction's checkpoint
 		 * list. Let's clean up the second one
 		 */
-		__wait_cp_io(journal, transaction);
+		err = __wait_cp_io(journal, transaction);
+		if (!result)
+			result = err;
 	}
 out:
 	spin_unlock(&journal->j_list_lock);
-	result = cleanup_journal_tail(journal);
 	if (result < 0)
-		return result;
-	return 0;
+		__journal_abort_checkpoint(journal, result);
+	else
+		result = cleanup_journal_tail(journal);
+
+	return (result < 0) ? result : 0;
 }
 
 /*
Index: linux-2.6.25/include/linux/jbd.h
===================================================================
--- linux-2.6.25.orig/include/linux/jbd.h
+++ linux-2.6.25/include/linux/jbd.h
@@ -818,6 +818,8 @@ struct journal_s
 #define JFS_FLUSHED	0x008	/* The journal superblock has been flushed */
 #define JFS_LOADED	0x010	/* The journal superblock has been loaded */
 #define JFS_BARRIER	0x020	/* Use IDE barriers */
+#define JFS_CP_ABORT	0x040	/* Checkpointing has failed.  We don't have to
+				 * clean the journal space.  */
 
 /*
  * Function declarations for the journaling transaction and buffer
@@ -1006,6 +1008,11 @@ static inline int is_journal_aborted(jou
 	return journal->j_flags & JFS_ABORT;
 }
 
+static inline int is_checkpoint_aborted(journal_t *journal)
+{
+	return journal->j_flags & JFS_CP_ABORT;
+}
+
 static inline int is_handle_aborted(handle_t *handle)
 {
 	if (handle->h_aborted)
Index: linux-2.6.25/fs/ext3/ioctl.c
===================================================================
--- linux-2.6.25.orig/fs/ext3/ioctl.c
+++ linux-2.6.25/fs/ext3/ioctl.c
@@ -213,7 +213,7 @@ flags_err:
 	case EXT3_IOC_GROUP_EXTEND: {
 		ext3_fsblk_t n_blocks_count;
 		struct super_block *sb = inode->i_sb;
-		int err;
+		int err, err2;
 
 		if (!capable(CAP_SYS_RESOURCE))
 			return -EPERM;
@@ -226,15 +226,15 @@ flags_err:
 
 		err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
 		journal_lock_updates(EXT3_SB(sb)->s_journal);
-		journal_flush(EXT3_SB(sb)->s_journal);
+		err2 = journal_flush(EXT3_SB(sb)->s_journal);
 		journal_unlock_updates(EXT3_SB(sb)->s_journal);
 
-		return err;
+		return (err) ? err : err2;
 	}
 	case EXT3_IOC_GROUP_ADD: {
 		struct ext3_new_group_data input;
 		struct super_block *sb = inode->i_sb;
-		int err;
+		int err, err2;
 
 		if (!capable(CAP_SYS_RESOURCE))
 			return -EPERM;
@@ -248,10 +248,10 @@ flags_err:
 
 		err = ext3_group_add(sb, &input);
 		journal_lock_updates(EXT3_SB(sb)->s_journal);
-		journal_flush(EXT3_SB(sb)->s_journal);
+		err2 = journal_flush(EXT3_SB(sb)->s_journal);
 		journal_unlock_updates(EXT3_SB(sb)->s_journal);
 
-		return err;
+		return (err) ? err : err2;
 	}
 
 
Index: linux-2.6.25/fs/ext3/super.c
===================================================================
--- linux-2.6.25.orig/fs/ext3/super.c
+++ linux-2.6.25/fs/ext3/super.c
@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
 	ext3_xattr_put_super(sb);
 	journal_destroy(sbi->s_journal);
 	if (!(sb->s_flags & MS_RDONLY)) {
-		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
+		if (!is_checkpoint_aborted(sbi->s_journal)) {
+			EXT3_CLEAR_INCOMPAT_FEATURE(sb,
+				EXT3_FEATURE_INCOMPAT_RECOVER);
+		}
 		es->s_state = cpu_to_le16(sbi->s_mount_state);
 		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
 		mark_buffer_dirty(sbi->s_sbh);
@@ -2370,7 +2373,13 @@ static void ext3_write_super_lockfs(stru
 
 		/* Now we set up the journal barrier. */
 		journal_lock_updates(journal);
-		journal_flush(journal);
+
+		/*
+		 * We don't want to clear needs_recovery flag when we failed
+		 * to flush the journal.
+		 */
+		if (journal_flush(journal) < 0)
+			return;
 
 		/* Journal blocked and flushed, clear needs_recovery flag. */
 		EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
Index: linux-2.6.25/fs/jbd/journal.c
===================================================================
--- linux-2.6.25.orig/fs/jbd/journal.c
+++ linux-2.6.25/fs/jbd/journal.c
@@ -674,7 +674,7 @@ static journal_t * journal_init_common (
 	journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
 
 	/* The journal is marked for error until we succeed with recovery! */
-	journal->j_flags = JFS_ABORT;
+	journal->j_flags = JFS_ABORT | JFS_CP_ABORT;
 
 	/* Set up a default-sized revoke table for the new mount. */
 	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
@@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
 	if (journal_reset(journal))
 		goto recovery_error;
 
-	journal->j_flags &= ~JFS_ABORT;
+	journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
 	journal->j_flags |= JFS_LOADED;
 	return 0;
 
@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
 	journal->j_tail = 0;
 	journal->j_tail_sequence = ++journal->j_transaction_sequence;
 	if (journal->j_sb_buffer) {
-		journal_update_superblock(journal, 1);
+		if (!is_checkpoint_aborted(journal))
+			journal_update_superblock(journal, 1);
 		brelse(journal->j_sb_buffer);
 	}
 
@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
 
 int journal_flush(journal_t *journal)
 {
-	int err = 0;
 	transaction_t *transaction = NULL;
 	unsigned long old_tail;
 
@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
 		spin_unlock(&journal->j_state_lock);
 	}
 
-	/* ...and flush everything in the log out to disk. */
+	/* ...and flush everything in the log out to disk.
+	 * Even if an error occurs, we don't stop this loop.
+	 * We do checkpoint as much as possible.  */
 	spin_lock(&journal->j_list_lock);
-	while (!err && journal->j_checkpoint_transactions != NULL) {
+	while (journal->j_checkpoint_transactions != NULL) {
 		spin_unlock(&journal->j_list_lock);
-		err = log_do_checkpoint(journal);
+		log_do_checkpoint(journal);
 		spin_lock(&journal->j_list_lock);
 	}
 	spin_unlock(&journal->j_list_lock);
+	if (is_checkpoint_aborted(journal))
+		return -EIO;
+
 	cleanup_journal_tail(journal);
 
 	/* Finally, mark the journal as really needing no recovery.
@@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
 	J_ASSERT(journal->j_head == journal->j_tail);
 	J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
 	spin_unlock(&journal->j_state_lock);
-	return err;
+	return 0;
 }
 
 /**
Index: linux-2.6.25/fs/jbd/recovery.c
===================================================================
--- linux-2.6.25.orig/fs/jbd/recovery.c
+++ linux-2.6.25/fs/jbd/recovery.c
@@ -223,7 +223,7 @@ do {									\
  */
 int journal_recover(journal_t *journal)
 {
-	int			err;
+	int			err, err2;
 	journal_superblock_t *	sb;
 
 	struct recovery_info	info;
@@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
 	journal->j_transaction_sequence = ++info.end_transaction;
 
 	journal_clear_revoke(journal);
-	sync_blockdev(journal->j_fs_dev);
+	err2 = sync_blockdev(journal->j_fs_dev);
+	if (!err)
+		err = err2;
+
 	return err;
 }
 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] jbd: possible filesystem corruption fixes
  2008-04-18 13:00 [PATCH 0/4] jbd: possible filesystem corruption fixes Hidehiro Kawai
                   ` (3 preceding siblings ...)
  2008-04-18 13:39 ` [PATCH 4/4] jbd/ext3: fix error handling for checkpoint io Hidehiro Kawai
@ 2008-04-18 14:09 ` Josef Bacik
  2008-04-18 19:26   ` Mingming Cao
  2008-04-23 10:59   ` Hidehiro Kawai
  4 siblings, 2 replies; 11+ messages in thread
From: Josef Bacik @ 2008-04-18 14:09 UTC (permalink / raw)
  To: Hidehiro Kawai
  Cc: akpm, sct, adilger, linux-kernel, linux-ext4, jack, sugita,
	Satoshi OSHIMA

On Fri, Apr 18, 2008 at 10:00:54PM +0900, Hidehiro Kawai wrote:
> Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes
> 
> The current JBD is not sufficient for I/O error handling.  It can
> cause filesystem corruption.   An example scenario:
> 
> 1. fail to write a metadata buffer to block B in the journal
> 2. succeed to write the commit record
> 3. the system crashes, reboots and mount the filesystem
> 4. in the recovery phase, succeed to read data from block B
> 5. write back the read data to the filesystem, but it is a stale
>    metadata
> 6. lose some files and directories!
> 
> This scenario is a rare case, but it (temporal I/O error)
> can occur.  If we abort the journal between 1. and 2., this
> tragedy can be avoided.
> 
> This patch set fixes several error handling problems to protect
> from filesystem corruption caused by I/O errors.  It has been
> done only for JBD and ext3 parts.
>

There doesn't seem like much point in taking these patches as Jan is rewriting
the ordered mode path and most of these functions will be going away soon.
Those patches seem like they will be coming soon and will obsolete these.

Josef 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] jbd: possible filesystem corruption fixes
  2008-04-18 14:09 ` [PATCH 0/4] jbd: possible filesystem corruption fixes Josef Bacik
@ 2008-04-18 19:26   ` Mingming Cao
  2008-04-21 21:08     ` Andreas Dilger
  2008-04-23 11:01     ` Hidehiro Kawai
  2008-04-23 10:59   ` Hidehiro Kawai
  1 sibling, 2 replies; 11+ messages in thread
From: Mingming Cao @ 2008-04-18 19:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Hidehiro Kawai, akpm, sct, adilger, linux-kernel, linux-ext4,
	jack, sugita, Satoshi OSHIMA

On Fri, 2008-04-18 at 10:09 -0400, Josef Bacik wrote:
> On Fri, Apr 18, 2008 at 10:00:54PM +0900, Hidehiro Kawai wrote:
> > Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes
> > 
> > The current JBD is not sufficient for I/O error handling.  It can
> > cause filesystem corruption.   An example scenario:
> > 
> > 1. fail to write a metadata buffer to block B in the journal
> > 2. succeed to write the commit record
> > 3. the system crashes, reboots and mount the filesystem
> > 4. in the recovery phase, succeed to read data from block B
> > 5. write back the read data to the filesystem, but it is a stale
> >    metadata
> > 6. lose some files and directories!
> > 
> > This scenario is a rare case, but it (temporal I/O error)
> > can occur.  If we abort the journal between 1. and 2., this
> > tragedy can be avoided.
> > 
> > This patch set fixes several error handling problems to protect
> > from filesystem corruption caused by I/O errors.  It has been
> > done only for JBD and ext3 parts.
> >
> 

Could you sent Ext4/JBD2 version patches? Thanks!

> There doesn't seem like much point in taking these patches as Jan is rewriting
> the ordered mode path and most of these functions will be going away soon.
> Those patches seem like they will be coming soon and will obsolete these.
> 

I hope we have a better ordered mode very soon too. Just thought it's
still valid to fix the current ordered mode for people who uses
linux-2.6.25 kernel today. 

Mingming
> Josef 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] jbd: possible filesystem corruption fixes
  2008-04-18 19:26   ` Mingming Cao
@ 2008-04-21 21:08     ` Andreas Dilger
  2008-04-23 12:45       ` Hidehiro Kawai
  2008-04-23 11:01     ` Hidehiro Kawai
  1 sibling, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2008-04-21 21:08 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Josef Bacik, Hidehiro Kawai, akpm, sct, adilger, linux-kernel,
	linux-ext4, jack, sugita, Satoshi OSHIMA

On Apr 18, 2008  12:26 -0700, Mingming Cao wrote:
> On Fri, 2008-04-18 at 10:09 -0400, Josef Bacik wrote:
> > On Fri, Apr 18, 2008 at 10:00:54PM +0900, Hidehiro Kawai wrote:
> > > Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes
> > > 
> > > The current JBD is not sufficient for I/O error handling.  It can
> > > cause filesystem corruption.   An example scenario:
> > > 
> > > 1. fail to write a metadata buffer to block B in the journal
> > > 2. succeed to write the commit record
> > > 3. the system crashes, reboots and mount the filesystem
> > > 4. in the recovery phase, succeed to read data from block B
> > > 5. write back the read data to the filesystem, but it is a stale
> > >    metadata
> > > 6. lose some files and directories!
> > > 
> > > This scenario is a rare case, but it (temporal I/O error)
> > > can occur.  If we abort the journal between 1. and 2., this
> > > tragedy can be avoided.
> > > 
> > > This patch set fixes several error handling problems to protect
> > > from filesystem corruption caused by I/O errors.  It has been
> > > done only for JBD and ext3 parts.
> 
> Could you sent Ext4/JBD2 version patches? Thanks!

Actually, the journal checksum in ext4/jbd2 detects this kind of error,
as well as errors that are NOT reported to the caller (e.g. media errors
not reported to the kernel).

One question is whether we want to _introduce_ a point of failure to the
filesystem that may never actually cause a problem for the system,
since the journal is only needed in the case of a crash.  By aborting
the journal at this point instead of letting the checkpoint write the
data to the filesystem then we are guaranteed a filesystem failure
instead of "likely no problem at all".

The journal checksum would detect the bad data in the transaction in the
cases where it is important, and during operation it makes more sense
to report the error via printk() so the administrator has some chance to
do something about it.  There is no reason why the jbd2 change couldn't be
merged back to jbd so ext3 could use the journal checksumming.  It is a
"COMPAT" journal feature.

> > There doesn't seem like much point in taking these patches as Jan is rewriting
> > the ordered mode path and most of these functions will be going away soon.
> > Those patches seem like they will be coming soon and will obsolete these.
> 
> I hope we have a better ordered mode very soon too. Just thought it's
> still valid to fix the current ordered mode for people who uses
> linux-2.6.25 kernel today. 

I agree that we should at least report the errors to the syslog (if this
isn't happening already) so the admin knows there is a problem, and I also
agree that waiting for some future patch isn't a good reason to stop making
fixes to the current code.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] jbd: possible filesystem corruption fixes
  2008-04-18 14:09 ` [PATCH 0/4] jbd: possible filesystem corruption fixes Josef Bacik
  2008-04-18 19:26   ` Mingming Cao
@ 2008-04-23 10:59   ` Hidehiro Kawai
  1 sibling, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-23 10:59 UTC (permalink / raw)
  To: Josef Bacik
  Cc: akpm, sct, adilger, linux-kernel, linux-ext4, jack, sugita,
	Satoshi OSHIMA

Josef Bacik wrote:

> On Fri, Apr 18, 2008 at 10:00:54PM +0900, Hidehiro Kawai wrote:
> 
>>Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes
>>
>>This patch set fixes several error handling problems to protect
>>from filesystem corruption caused by I/O errors.  It has been
>>done only for JBD and ext3 parts.
>>
> There doesn't seem like much point in taking these patches as Jan is rewriting
> the ordered mode path and most of these functions will be going away soon.
> Those patches seem like they will be coming soon and will obsolete these.

Yes, PATCH 1/4 and PATCH 2/4 are specific to the ordered mode, and Jan's
patches seem to fix the same problems.  But the remain patches target
generic journaling problems, so I think those patches are still needed.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] jbd: possible filesystem corruption fixes
  2008-04-18 19:26   ` Mingming Cao
  2008-04-21 21:08     ` Andreas Dilger
@ 2008-04-23 11:01     ` Hidehiro Kawai
  1 sibling, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-23 11:01 UTC (permalink / raw)
  To: cmm
  Cc: Josef Bacik, akpm, sct, adilger, linux-kernel, linux-ext4, jack,
	sugita, Satoshi OSHIMA

Mingming Cao wrote:

>>>This patch set fixes several error handling problems to protect
>>>from filesystem corruption caused by I/O errors.  It has been
>>>done only for JBD and ext3 parts.
> 
> Could you sent Ext4/JBD2 version patches? Thanks!

I will try it, but I don't know I can send the Ext4/JBD2 version
within a reasonable time because I haven't read Ext4/JBD2 codes
so much yet.
 
>>There doesn't seem like much point in taking these patches as Jan is rewriting
>>the ordered mode path and most of these functions will be going away soon.
>>Those patches seem like they will be coming soon and will obsolete these.
> 
> I hope we have a better ordered mode very soon too. Just thought it's
> still valid to fix the current ordered mode for people who uses
> linux-2.6.25 kernel today. 

And older kernel users will be happy if someone or I make a backport
of this patch set.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/4] jbd: possible filesystem corruption fixes
  2008-04-21 21:08     ` Andreas Dilger
@ 2008-04-23 12:45       ` Hidehiro Kawai
  0 siblings, 0 replies; 11+ messages in thread
From: Hidehiro Kawai @ 2008-04-23 12:45 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Mingming Cao, Josef Bacik, akpm, sct, adilger, linux-kernel,
	linux-ext4, jack, sugita, Satoshi OSHIMA

Andreas Dilger wrote:

> On Apr 18, 2008  12:26 -0700, Mingming Cao wrote:
> 
>>On Fri, 2008-04-18 at 10:09 -0400, Josef Bacik wrote:
>>
>>>On Fri, Apr 18, 2008 at 10:00:54PM +0900, Hidehiro Kawai wrote:
>>>
>>>>Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes
>>>>
>>>>The current JBD is not sufficient for I/O error handling.  It can
>>>>cause filesystem corruption.   An example scenario:
>>>>
>>>>1. fail to write a metadata buffer to block B in the journal
>>>>2. succeed to write the commit record
>>>>3. the system crashes, reboots and mount the filesystem
>>>>4. in the recovery phase, succeed to read data from block B
>>>>5. write back the read data to the filesystem, but it is a stale
>>>>   metadata
>>>>6. lose some files and directories!
>>>>
>>>>This scenario is a rare case, but it (temporal I/O error)
>>>>can occur.  If we abort the journal between 1. and 2., this
>>>>tragedy can be avoided.
>>>>
>>>>This patch set fixes several error handling problems to protect
>>>>from filesystem corruption caused by I/O errors.  It has been
>>>>done only for JBD and ext3 parts.
>>
>>Could you sent Ext4/JBD2 version patches? Thanks!
> 
> 
> Actually, the journal checksum in ext4/jbd2 detects this kind of error,
> as well as errors that are NOT reported to the caller (e.g. media errors
> not reported to the kernel).

It's interesting feature.  I read the journal checksum patch,
it seems to fix the problem addressed by PATCH 3/4.
However, journal checksum feature is optional, so PATCH 3/4
will be needed as long as checksuming feature isn't turned
on always.

> One question is whether we want to _introduce_ a point of failure to the
> filesystem that may never actually cause a problem for the system,
> since the journal is only needed in the case of a crash.  By aborting
> the journal at this point instead of letting the checkpoint write the
> data to the filesystem then we are guaranteed a filesystem failure
> instead of "likely no problem at all".

I think it depends on the system and administrator.
When we failed to write metadata to the journal, we...

  (a) abort journaling
      - the filesystem can keep a consistent state if the system
        crashed
      - the system will stop because the filesystem becomes read-only
        state (default)
  (b) only do printk()
      - the system can continue to work
      - bad journalled data may break the file system if the system
        crashed

A user who demands high data integrity will choose (a), and
a user who demands high availability will choose (b).
We might want to enable the user to specify the behavior
on error such as the "errors" mount option.

 
> The journal checksum would detect the bad data in the transaction in the
> cases where it is important, and during operation it makes more sense
> to report the error via printk() so the administrator has some chance to
> do something about it.  There is no reason why the jbd2 change couldn't be
> merged back to jbd so ext3 could use the journal checksumming.  It is a
> "COMPAT" journal feature.

It's interesting.  For example, when a fsync operation is issued,
commit the current transaction, then read the journalled data of
that transaction to check the checksum.  If the bad data is detected,
flush the whole journal.  Aborting the journal will also make sense
because the journal space is errorneous.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-04-23 12:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-18 13:00 [PATCH 0/4] jbd: possible filesystem corruption fixes Hidehiro Kawai
2008-04-18 13:36 ` [PATCH 1/4] jbd: strictly check for write errors on data buffers Hidehiro Kawai
2008-04-18 13:37 ` [PATCH 2/4] jbd: ordered data integrity fix Hidehiro Kawai
2008-04-18 13:38 ` [PATCH 3/4] jbd: abort when failed to log metadata buffers Hidehiro Kawai
2008-04-18 13:39 ` [PATCH 4/4] jbd/ext3: fix error handling for checkpoint io Hidehiro Kawai
2008-04-18 14:09 ` [PATCH 0/4] jbd: possible filesystem corruption fixes Josef Bacik
2008-04-18 19:26   ` Mingming Cao
2008-04-21 21:08     ` Andreas Dilger
2008-04-23 12:45       ` Hidehiro Kawai
2008-04-23 11:01     ` Hidehiro Kawai
2008-04-23 10:59   ` Hidehiro Kawai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).