linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] jbd2: recheck chechpointing non-dirty buffer
@ 2023-04-26 13:10 Zhang Yi
  2023-05-03 15:50 ` Jan Kara
  0 siblings, 1 reply; 3+ messages in thread
From: Zhang Yi @ 2023-04-26 13:10 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, yi.zhang, yukuai3, chengzhihao1

From: Zhang Yi <yi.zhang@huawei.com>

There is a long-standing metadata corruption issue that happens from
time to time, but it's very difficult to reproduce and analyse, benefit
from the JBD2_CYCLE_RECORD option, we found out that the problem is the
checkpointing process miss to write out some buffers which are raced by
another do_get_write_access(). Looks below for detail.

jbd2_log_do_checkpoint() //transaction X
 //buffer A is dirty and not belones to any transaction
 __buffer_relink_io() //move it to the IO list
 __flush_batch()
  write_dirty_buffer()
                             do_get_write_access()
                             clear_buffer_dirty
                             __jbd2_journal_file_buffer()
                             //add buffer A to a new transaction Y
   lock_buffer(bh)
   //doesn't write out
 __jbd2_journal_remove_checkpoint()
 //finish checkpoint except buffer A
 //filesystem corrupt if the new transaction Y isn't fully write out.

The fix is subtle because we can't trust the chechpointing buffers and
transactions once we release the j_list_lock, they could be written back
and checkpointed by some others, or they could have been added to a new
transaction. So we have to re-add them on the checkpoint list and
recheck their status if they are clean and don't need to write out.

Cc: stable@vger.kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
 fs/jbd2/checkpoint.c | 52 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 3 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 51bd38da21cd..1aca860eb0f6 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -77,8 +77,31 @@ static inline void __buffer_relink_io(struct journal_head *jh)
 		jh->b_cpnext->b_cpprev = jh;
 	}
 	transaction->t_checkpoint_io_list = jh;
+	transaction->t_chp_stats.cs_written++;
 }
 
+/*
+ * Move a buffer from the checkpoint io list back to the checkpoint list
+ *
+ * Called with j_list_lock held
+ */
+static inline void __buffer_relink_cp(struct journal_head *jh)
+{
+	transaction_t *transaction = jh->b_cp_transaction;
+
+	__buffer_unlink(jh);
+
+	if (!transaction->t_checkpoint_list) {
+		jh->b_cpnext = jh->b_cpprev = jh;
+	} else {
+		jh->b_cpnext = transaction->t_checkpoint_list;
+		jh->b_cpprev = transaction->t_checkpoint_list->b_cpprev;
+		jh->b_cpprev->b_cpnext = jh;
+		jh->b_cpnext->b_cpprev = jh;
+	}
+	transaction->t_checkpoint_list = jh;
+	transaction->t_chp_stats.cs_written--;
+}
 /*
  * Check a checkpoint buffer could be release or not.
  *
@@ -175,8 +198,31 @@ __flush_batch(journal_t *journal, int *batch_count)
 	struct blk_plug plug;
 
 	blk_start_plug(&plug);
-	for (i = 0; i < *batch_count; i++)
-		write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC);
+	for (i = 0; i < *batch_count; i++) {
+		struct buffer_head *bh = journal->j_chkpt_bhs[i];
+		struct journal_head *jh = bh2jh(bh);
+
+		lock_buffer(bh);
+		/*
+		 * This buffer isn't dirty, it could be getten write access
+		 * again by a new transaction, re-add it on the checkpoint
+		 * list if it still needs to be checkpointed, and wait
+		 * until that transaction finished to write out.
+		 */
+		if (!test_clear_buffer_dirty(bh)) {
+			unlock_buffer(bh);
+			spin_lock(&journal->j_list_lock);
+			if (jh->b_cp_transaction)
+				__buffer_relink_cp(jh);
+			spin_unlock(&journal->j_list_lock);
+			jbd2_journal_put_journal_head(jh);
+			continue;
+		}
+		jbd2_journal_put_journal_head(jh);
+		bh->b_end_io = end_buffer_write_sync;
+		get_bh(bh);
+		submit_bh(REQ_OP_WRITE | REQ_SYNC, bh);
+	}
 	blk_finish_plug(&plug);
 
 	for (i = 0; i < *batch_count; i++) {
@@ -303,9 +349,9 @@ int jbd2_log_do_checkpoint(journal_t *journal)
 		BUFFER_TRACE(bh, "queue");
 		get_bh(bh);
 		J_ASSERT_BH(bh, !buffer_jwrite(bh));
+		jbd2_journal_grab_journal_head(bh);
 		journal->j_chkpt_bhs[batch_count++] = bh;
 		__buffer_relink_io(jh);
-		transaction->t_chp_stats.cs_written++;
 		if ((batch_count == JBD2_NR_BATCH) ||
 		    need_resched() ||
 		    spin_needbreak(&journal->j_list_lock))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] jbd2: recheck chechpointing non-dirty buffer
  2023-04-26 13:10 [PATCH] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
@ 2023-05-03 15:50 ` Jan Kara
  2023-05-04 11:35   ` Zhang Yi
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2023-05-03 15:50 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, yukuai3, chengzhihao1

On Wed 26-04-23 21:10:41, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is a long-standing metadata corruption issue that happens from
> time to time, but it's very difficult to reproduce and analyse, benefit
> from the JBD2_CYCLE_RECORD option, we found out that the problem is the
> checkpointing process miss to write out some buffers which are raced by
> another do_get_write_access(). Looks below for detail.
> 
> jbd2_log_do_checkpoint() //transaction X
>  //buffer A is dirty and not belones to any transaction
>  __buffer_relink_io() //move it to the IO list
>  __flush_batch()
>   write_dirty_buffer()
>                              do_get_write_access()
>                              clear_buffer_dirty
>                              __jbd2_journal_file_buffer()
>                              //add buffer A to a new transaction Y
>    lock_buffer(bh)
>    //doesn't write out
>  __jbd2_journal_remove_checkpoint()
>  //finish checkpoint except buffer A
>  //filesystem corrupt if the new transaction Y isn't fully write out.
> 
> The fix is subtle because we can't trust the chechpointing buffers and
> transactions once we release the j_list_lock, they could be written back
> and checkpointed by some others, or they could have been added to a new
> transaction. So we have to re-add them on the checkpoint list and
> recheck their status if they are clean and don't need to write out.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com>

Thanks for the analysis. This indeed looks like a nasty issue to debug.  I
think we can actually solve the problem by simplifying the checkpointing
code in jbd2_log_do_checkpoint(), not by making it more complex. What I
think we can do is that we can completely remove the t_checkpoint_io_list
and only keep buffers on t_checkpoint_list. When processing
t_checkpoint_list in jbd2_log_do_checkpoint(), we just need to make sure to
move t_checkpoint_list pointer to the next buffer when adding buffer to
j_chkpt_bhs array. That way buffers to submit / already submitted buffers
will be accumulating at the tail of the list. The logic in the loop already
handles waiting for buffers under IO / removing cleaned buffers so this
makes sure the list will eventually get empty. Buffers cannot get redirtied
without being removed from the checkpoint list and moved to a newer
transaction's checkpoint list so forward progress is guaranteed. The only
other tweak we need to add is to check for the situation when all the
buffers are in the j_chkpt_bhs array. So the end of the loop should look
like:

		transaction->t_checkpoint_list = jh->j_cpnext;
		if (batch_count == JBD2_NR_BATCH || need_resched() ||
		    spin_needbreak(&journal->j_list_lock) ||
		    transaction->t_checkpoint_list == journal->j_chkpt_bhs[0])
			flush and restart

and that should be it. What do you think?

								Honza

> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 51bd38da21cd..1aca860eb0f6 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -77,8 +77,31 @@ static inline void __buffer_relink_io(struct journal_head *jh)
>  		jh->b_cpnext->b_cpprev = jh;
>  	}
>  	transaction->t_checkpoint_io_list = jh;
> +	transaction->t_chp_stats.cs_written++;
>  }
>  
> +/*
> + * Move a buffer from the checkpoint io list back to the checkpoint list
> + *
> + * Called with j_list_lock held
> + */
> +static inline void __buffer_relink_cp(struct journal_head *jh)
> +{
> +	transaction_t *transaction = jh->b_cp_transaction;
> +
> +	__buffer_unlink(jh);
> +
> +	if (!transaction->t_checkpoint_list) {
> +		jh->b_cpnext = jh->b_cpprev = jh;
> +	} else {
> +		jh->b_cpnext = transaction->t_checkpoint_list;
> +		jh->b_cpprev = transaction->t_checkpoint_list->b_cpprev;
> +		jh->b_cpprev->b_cpnext = jh;
> +		jh->b_cpnext->b_cpprev = jh;
> +	}
> +	transaction->t_checkpoint_list = jh;
> +	transaction->t_chp_stats.cs_written--;
> +}
>  /*
>   * Check a checkpoint buffer could be release or not.
>   *
> @@ -175,8 +198,31 @@ __flush_batch(journal_t *journal, int *batch_count)
>  	struct blk_plug plug;
>  
>  	blk_start_plug(&plug);
> -	for (i = 0; i < *batch_count; i++)
> -		write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC);
> +	for (i = 0; i < *batch_count; i++) {
> +		struct buffer_head *bh = journal->j_chkpt_bhs[i];
> +		struct journal_head *jh = bh2jh(bh);
> +
> +		lock_buffer(bh);
> +		/*
> +		 * This buffer isn't dirty, it could be getten write access
> +		 * again by a new transaction, re-add it on the checkpoint
> +		 * list if it still needs to be checkpointed, and wait
> +		 * until that transaction finished to write out.
> +		 */
> +		if (!test_clear_buffer_dirty(bh)) {
> +			unlock_buffer(bh);
> +			spin_lock(&journal->j_list_lock);
> +			if (jh->b_cp_transaction)
> +				__buffer_relink_cp(jh);
> +			spin_unlock(&journal->j_list_lock);
> +			jbd2_journal_put_journal_head(jh);
> +			continue;
> +		}
> +		jbd2_journal_put_journal_head(jh);
> +		bh->b_end_io = end_buffer_write_sync;
> +		get_bh(bh);
> +		submit_bh(REQ_OP_WRITE | REQ_SYNC, bh);
> +	}
>  	blk_finish_plug(&plug);
>  
>  	for (i = 0; i < *batch_count; i++) {
> @@ -303,9 +349,9 @@ int jbd2_log_do_checkpoint(journal_t *journal)
>  		BUFFER_TRACE(bh, "queue");
>  		get_bh(bh);
>  		J_ASSERT_BH(bh, !buffer_jwrite(bh));
> +		jbd2_journal_grab_journal_head(bh);
>  		journal->j_chkpt_bhs[batch_count++] = bh;
>  		__buffer_relink_io(jh);
> -		transaction->t_chp_stats.cs_written++;
>  		if ((batch_count == JBD2_NR_BATCH) ||
>  		    need_resched() ||
>  		    spin_needbreak(&journal->j_list_lock))
> -- 
> 2.31.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] jbd2: recheck chechpointing non-dirty buffer
  2023-05-03 15:50 ` Jan Kara
@ 2023-05-04 11:35   ` Zhang Yi
  0 siblings, 0 replies; 3+ messages in thread
From: Zhang Yi @ 2023-05-04 11:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, tytso, adilger.kernel, yi.zhang, yukuai3, chengzhihao1

On 2023/5/3 23:50, Jan Kara wrote:
> On Wed 26-04-23 21:10:41, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> There is a long-standing metadata corruption issue that happens from
>> time to time, but it's very difficult to reproduce and analyse, benefit
>> from the JBD2_CYCLE_RECORD option, we found out that the problem is the
>> checkpointing process miss to write out some buffers which are raced by
>> another do_get_write_access(). Looks below for detail.
>>
>> jbd2_log_do_checkpoint() //transaction X
>>  //buffer A is dirty and not belones to any transaction
>>  __buffer_relink_io() //move it to the IO list
>>  __flush_batch()
>>   write_dirty_buffer()
>>                              do_get_write_access()
>>                              clear_buffer_dirty
>>                              __jbd2_journal_file_buffer()
>>                              //add buffer A to a new transaction Y
>>    lock_buffer(bh)
>>    //doesn't write out
>>  __jbd2_journal_remove_checkpoint()
>>  //finish checkpoint except buffer A
>>  //filesystem corrupt if the new transaction Y isn't fully write out.
>>
>> The fix is subtle because we can't trust the chechpointing buffers and
>> transactions once we release the j_list_lock, they could be written back
>> and checkpointed by some others, or they could have been added to a new
>> transaction. So we have to re-add them on the checkpoint list and
>> recheck their status if they are clean and don't need to write out.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Tested-by: Zhihao Cheng <chengzhihao1@huawei.com>
> 
> Thanks for the analysis. This indeed looks like a nasty issue to debug.  I
> think we can actually solve the problem by simplifying the checkpointing
> code in jbd2_log_do_checkpoint(), not by making it more complex. What I
> think we can do is that we can completely remove the t_checkpoint_io_list
> and only keep buffers on t_checkpoint_list. When processing
> t_checkpoint_list in jbd2_log_do_checkpoint(), we just need to make sure to
> move t_checkpoint_list pointer to the next buffer when adding buffer to
> j_chkpt_bhs array. That way buffers to submit / already submitted buffers
> will be accumulating at the tail of the list. The logic in the loop already
> handles waiting for buffers under IO / removing cleaned buffers so this
> makes sure the list will eventually get empty. Buffers cannot get redirtied
> without being removed from the checkpoint list and moved to a newer
> transaction's checkpoint list so forward progress is guaranteed. The only
> other tweak we need to add is to check for the situation when all the
> buffers are in the j_chkpt_bhs array. So the end of the loop should look
> like:
> 
> 		transaction->t_checkpoint_list = jh->j_cpnext;
> 		if (batch_count == JBD2_NR_BATCH || need_resched() ||
> 		    spin_needbreak(&journal->j_list_lock) ||
> 		    transaction->t_checkpoint_list == journal->j_chkpt_bhs[0])
> 			flush and restart
> 
> and that should be it. What do you think?
> 

This solution sounds great, Let me do it.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-05-04 11:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-26 13:10 [PATCH] jbd2: recheck chechpointing non-dirty buffer Zhang Yi
2023-05-03 15:50 ` Jan Kara
2023-05-04 11:35   ` Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).