All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback
@ 2013-03-22 16:18 Dmitry Monakhov
  2013-03-22 16:18 ` [PATCH 2/3] ext4: fix journal callback list traversal Dmitry Monakhov
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Dmitry Monakhov @ 2013-03-22 16:18 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov

Following race is possible
[kjournald2]                              other_task
jbd2_journal_commit_transaction()
  j_state = T_FINISHED;
  spin_unlock(&journal->j_list_lock);
                                         ->jbd2_journal_remove_checkpoint()
					   ->jbd2_journal_free_transaction();
					     ->kmem_cache_free(transaction)
  ->j_commit_callback(journal, transaction);
    -> USE_AFTER_FREE

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

In order to demonstrace this issue one should mount ext4 with -odiscard option
on SSD disk. This makes callback longer and race window becomes wider.

In order to fix this we should mark transaction as finished only after
one of contidions are true:
1) Checkpoint lists are empty so we are shure what jbd2_journal_remove_checkpoint()
   can not be called under our feets.
2) All callbacks have completed

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/jbd2/commit.c     |   28 +++++++++++++++++++++++-----
 include/linux/jbd2.h |    1 +
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 750c701..31541f7 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -1134,7 +1134,7 @@ restart_loop:
 	journal->j_stats.run.rs_blocks_logged += stats.run.rs_blocks_logged;
 	spin_unlock(&journal->j_history_lock);
 
-	commit_transaction->t_state = T_FINISHED;
+	commit_transaction->t_state = T_COMMIT_CALLBACK;
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
@@ -1149,12 +1149,20 @@ restart_loop:
 				journal->j_average_commit_time*3) / 4;
 	else
 		journal->j_average_commit_time = commit_time;
-	write_unlock(&journal->j_state_lock);
 
+	/*
+	 * If checkpoint lists are empty it is safe to mark transaction as
+	 * finished because no one will destroy it in under our feets.
+	 */
 	if (commit_transaction->t_checkpoint_list == NULL &&
 	    commit_transaction->t_checkpoint_io_list == NULL) {
-		__jbd2_journal_drop_transaction(journal, commit_transaction);
 		to_free = 1;
+		commit_transaction->t_state = T_FINISHED;
+	}
+	write_unlock(&journal->j_state_lock);
+
+	if (to_free) {
+		__jbd2_journal_drop_transaction(journal, commit_transaction);
 	} else {
 		if (journal->j_checkpoint_transactions == NULL) {
 			journal->j_checkpoint_transactions = commit_transaction;
@@ -1179,8 +1187,18 @@ restart_loop:
 	trace_jbd2_end_commit(journal, commit_transaction);
 	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
-	if (to_free)
-		jbd2_journal_free_transaction(commit_transaction);
 
+	write_lock(&journal->j_state_lock);
+	spin_lock(&journal->j_list_lock);
+	commit_transaction->t_state = T_FINISHED;
+	/* Recheck checkpoint lists after j_list_lock was dropped */
+	if (commit_transaction->t_checkpoint_list == NULL &&
+	    commit_transaction->t_checkpoint_io_list == NULL) {
+		if (!to_free)
+			__jbd2_journal_drop_transaction(journal, commit_transaction);
+		jbd2_journal_free_transaction(commit_transaction);
+	}
+	spin_unlock(&journal->j_list_lock);
+	write_unlock(&journal->j_state_lock);
 	wake_up(&journal->j_wait_done_commit);
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 50e5a5e..608ea13 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -480,6 +480,7 @@ struct transaction_s
 		T_COMMIT,
 		T_COMMIT_DFLUSH,
 		T_COMMIT_JFLUSH,
+		T_COMMIT_CALLBACK,
 		T_FINISHED
 	}			t_state;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] ext4: fix journal callback list traversal
  2013-03-22 16:18 [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Dmitry Monakhov
@ 2013-03-22 16:18 ` Dmitry Monakhov
  2013-03-26 21:48   ` Jan Kara
  2013-03-22 16:18 ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2013-03-22 16:18 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov

It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This result in following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue like follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ext4_jbd2.h |    6 +++++-
 fs/ext4/mballoc.c   |    8 ++++----
 fs/ext4/super.c     |    4 +++-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 4c216b1..aeed0ba 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -194,16 +194,20 @@ static inline void ext4_journal_callback_add(handle_t *handle,
  * ext4_journal_callback_del: delete a registered callback
  * @handle: active journal transaction handle on which callback was registered
  * @jce: registered journal callback entry to unregister
+ * Return true if object was sucessfully removed
  */
-static inline void ext4_journal_callback_del(handle_t *handle,
+static inline bool ext4_journal_callback_try_del(handle_t *handle,
 					     struct ext4_journal_cb_entry *jce)
 {
+	bool deleted;
 	struct ext4_sb_info *sbi =
 			EXT4_SB(handle->h_transaction->t_journal->j_private);
 
 	spin_lock(&sbi->s_md_lock);
+	deleted = !list_empty(&jce->jce_list);
 	list_del_init(&jce->jce_list);
 	spin_unlock(&sbi->s_md_lock);
+	return deleted;
 }
 
 int
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ee6614b..2e5196a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4420,11 +4420,11 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	node = rb_prev(new_node);
 	if (node) {
 		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		if (can_merge(entry, new_entry)) {
+		if (can_merge(entry, new_entry) &&
+			ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
 			new_entry->efd_start_cluster = entry->efd_start_cluster;
 			new_entry->efd_count += entry->efd_count;
 			rb_erase(node, &(db->bb_free_root));
-			ext4_journal_callback_del(handle, &entry->efd_jce);
 			kmem_cache_free(ext4_free_data_cachep, entry);
 		}
 	}
@@ -4432,10 +4432,10 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	node = rb_next(new_node);
 	if (node) {
 		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		if (can_merge(new_entry, entry)) {
+		if (can_merge(new_entry, entry) &&
+			ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
 			new_entry->efd_count += entry->efd_count;
 			rb_erase(node, &(db->bb_free_root));
-			ext4_journal_callback_del(handle, &entry->efd_jce);
 			kmem_cache_free(ext4_free_data_cachep, entry);
 		}
 	}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d1ee6a8..c7e1509 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -352,7 +352,9 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	struct ext4_journal_cb_entry	*jce, *tmp;
 
 	spin_lock(&sbi->s_md_lock);
-	list_for_each_entry_safe(jce, tmp, &txn->t_private_list, jce_list) {
+	while (!list_empty(&txn->t_private_list)) {
+		jce = list_entry(txn->t_private_list.next,
+				 struct ext4_journal_cb_entry, jce_list);
 		list_del_init(&jce->jce_list);
 		spin_unlock(&sbi->s_md_lock);
 		jce->jce_func(sb, jce, error);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] ext4: undegister es_shrinker if mount failed
  2013-03-22 16:18 [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Dmitry Monakhov
  2013-03-22 16:18 ` [PATCH 2/3] ext4: fix journal callback list traversal Dmitry Monakhov
@ 2013-03-22 16:18 ` Dmitry Monakhov
  2013-03-23  2:31   ` Zheng Liu
  2013-03-26 21:34 ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Jan Kara
  2013-03-27  2:56 ` Theodore Ts'o
  3 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2013-03-22 16:18 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov

Otherwise destroyed ext_sb_info will be part of global shinker list and result in
following OOPS:

JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<ffffffff81249381>] mount_bdev+0x331/0x340
[<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<ffffffff81362035>] ext4_mount+0x15/0x20
[<ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<ffffffff8127aa56>] do_mount+0x376/0x380
[<ffffffff8127ab98>] sys_mount+0x138/0x150
[<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b


Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c7e1509..8196ca2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4008,6 +4008,7 @@ failed_mount_wq:
 		sbi->s_journal = NULL;
 	}
 failed_mount3:
+	ext4_es_unregister_shrinker(sb);
 	del_timer(&sbi->s_err_report);
 	if (sbi->s_flex_groups)
 		ext4_kvfree(sbi->s_flex_groups);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] ext4: undegister es_shrinker if mount failed
  2013-03-22 16:18 ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
@ 2013-03-23  2:31   ` Zheng Liu
  2013-04-01 14:49     ` Theodore Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Zheng Liu @ 2013-03-23  2:31 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz

On 03/23/2013 12:18 AM, Dmitry Monakhov wrote:
> Otherwise destroyed ext_sb_info will be part of global shinker list and result in
> following OOPS:
> 
> JBD2: corrupted journal superblock
> JBD2: recovery failed
> EXT4-fs (dm-2): error loading journal
> general protection fault: 0000 [#1] SMP
> Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
> mod
> CPU 1
> Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
> RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
> RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
> RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
> RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
> R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
> R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
> FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
> Stack:
> ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
> 00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
> ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
> Call Trace:
> [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
> [<ffffffff81249381>] mount_bdev+0x331/0x340
> [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
> [<ffffffff81362035>] ext4_mount+0x15/0x20
> [<ffffffff8124869a>] mount_fs+0x9a/0x2e0
> [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
> [<ffffffff81279c02>] do_new_mount+0x172/0x2e0
> [<ffffffff8127aa56>] do_mount+0x376/0x380
> [<ffffffff8127ab98>] sys_mount+0x138/0x150
> [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
> 
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/super.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c7e1509..8196ca2 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4008,6 +4008,7 @@ failed_mount_wq:
>  		sbi->s_journal = NULL;
>  	}
>  failed_mount3:
> +	ext4_es_unregister_shrinker(sb);
>  	del_timer(&sbi->s_err_report);
>  	if (sbi->s_flex_groups)
>  		ext4_kvfree(sbi->s_flex_groups);
> 

Thanks for fixing it.  One thing I concerned is that we could go to
failed_mount3 but ext4_es_register_shrinker() isn't called yet.  So I
think maybe we need to register es_shrinker before initializing some
per-cpu counters.

Thanks,
						- Zheng

>From 596eee1f5caf23084d9b43c47b327206bd8c7ee5 Mon Sep 17 00:00:00 2001
From: Zheng Liu <wenqing.lz@taobao.com>
Date: Sat, 23 Mar 2013 10:26:55 +0800
Subject: [PATCH] register es shrinker before initializing per-cpu counters

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/super.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d1ee6a8..80d96bf 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3694,6 +3694,9 @@ static int ext4_fill_super(struct super_block *sb,
void *data, int silent)
 	sbi->s_err_report.function = print_daily_error_info;
 	sbi->s_err_report.data = (unsigned long) sb;

+	/* Register extent status tree shrinker */
+	ext4_es_register_shrinker(sb);
+
 	err = percpu_counter_init(&sbi->s_freeclusters_counter,
 			ext4_count_free_clusters(sb));
 	if (!err) {
@@ -3719,9 +3722,6 @@ static int ext4_fill_super(struct super_block *sb,
void *data, int silent)
 	sbi->s_max_writeback_mb_bump = 128;
 	sbi->s_extent_max_zeroout_kb = 32;

-	/* Register extent status tree shrinker */
-	ext4_es_register_shrinker(sb);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback
  2013-03-22 16:18 [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Dmitry Monakhov
  2013-03-22 16:18 ` [PATCH 2/3] ext4: fix journal callback list traversal Dmitry Monakhov
  2013-03-22 16:18 ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
@ 2013-03-26 21:34 ` Jan Kara
  2013-03-27  2:56 ` Theodore Ts'o
  3 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2013-03-26 21:34 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz

On Fri 22-03-13 20:18:41, Dmitry Monakhov wrote:
> Following race is possible
> [kjournald2]                              other_task
> jbd2_journal_commit_transaction()
>   j_state = T_FINISHED;
>   spin_unlock(&journal->j_list_lock);
>                                          ->jbd2_journal_remove_checkpoint()
> 					   ->jbd2_journal_free_transaction();
> 					     ->kmem_cache_free(transaction)
>   ->j_commit_callback(journal, transaction);
>     -> USE_AFTER_FREE
> 
> WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
> Hardware name:
> list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
> Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
> Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
> Call Trace:
>  [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
>  [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
>  [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
>  [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
>  [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
>  [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
>  [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
>  [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
>  [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
>  [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
>  [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
>  [<ffffffff810ac6be>] kthread+0x10e/0x120
>  [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
>  [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
> 
> In order to demonstrace this issue one should mount ext4 with -odiscard option
> on SSD disk. This makes callback longer and race window becomes wider.
> 
> In order to fix this we should mark transaction as finished only after
> one of contidions are true:
> 1) Checkpoint lists are empty so we are shure what jbd2_journal_remove_checkpoint()
>    can not be called under our feets.
> 2) All callbacks have completed
  Nice catch! One minor comment below:

> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/jbd2/commit.c     |   28 +++++++++++++++++++++++-----
>  include/linux/jbd2.h |    1 +
>  2 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 750c701..31541f7 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -1134,7 +1134,7 @@ restart_loop:
>  	journal->j_stats.run.rs_blocks_logged += stats.run.rs_blocks_logged;
>  	spin_unlock(&journal->j_history_lock);
>  
> -	commit_transaction->t_state = T_FINISHED;
> +	commit_transaction->t_state = T_COMMIT_CALLBACK;
>  	J_ASSERT(commit_transaction == journal->j_committing_transaction);
>  	journal->j_commit_sequence = commit_transaction->t_tid;
>  	journal->j_committing_transaction = NULL;
> @@ -1149,12 +1149,20 @@ restart_loop:
>  				journal->j_average_commit_time*3) / 4;
>  	else
>  		journal->j_average_commit_time = commit_time;
> -	write_unlock(&journal->j_state_lock);
>  
> +	/*
> +	 * If checkpoint lists are empty it is safe to mark transaction as
> +	 * finished because no one will destroy it in under our feets.
> +	 */
>  	if (commit_transaction->t_checkpoint_list == NULL &&
>  	    commit_transaction->t_checkpoint_io_list == NULL) {
> -		__jbd2_journal_drop_transaction(journal, commit_transaction);
>  		to_free = 1;
> +		commit_transaction->t_state = T_FINISHED;
> +	}
> +	write_unlock(&journal->j_state_lock);
> +
> +	if (to_free) {
> +		__jbd2_journal_drop_transaction(journal, commit_transaction);
>  	} else {
>  		if (journal->j_checkpoint_transactions == NULL) {
>  			journal->j_checkpoint_transactions = commit_transaction;
> @@ -1179,8 +1187,18 @@ restart_loop:
>  	trace_jbd2_end_commit(journal, commit_transaction);
>  	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);
> -	if (to_free)
> -		jbd2_journal_free_transaction(commit_transaction);
>  
> +	write_lock(&journal->j_state_lock);
> +	spin_lock(&journal->j_list_lock);
> +	commit_transaction->t_state = T_FINISHED;
> +	/* Recheck checkpoint lists after j_list_lock was dropped */
> +	if (commit_transaction->t_checkpoint_list == NULL &&
> +	    commit_transaction->t_checkpoint_io_list == NULL) {
> +		if (!to_free)
> +			__jbd2_journal_drop_transaction(journal, commit_transaction);
> +		jbd2_journal_free_transaction(commit_transaction);
> +	}
> +	spin_unlock(&journal->j_list_lock);
> +	write_unlock(&journal->j_state_lock);
>  	wake_up(&journal->j_wait_done_commit);
>  }
  Hm, I'd just link the transaction into checkpoint lists when changing
state to T_COMMIT_CALLBACK and then handle all the dropping / freeing in
one place after setting state to T_FINISHED. It should be slightly easier
to read. Also please add a comment before the call of callback that we have
to call it there because once transaction is in T_FINISHED state,
checkpointing code can free it.

								Honza

> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 50e5a5e..608ea13 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -480,6 +480,7 @@ struct transaction_s
>  		T_COMMIT,
>  		T_COMMIT_DFLUSH,
>  		T_COMMIT_JFLUSH,
> +		T_COMMIT_CALLBACK,
>  		T_FINISHED
>  	}			t_state;
>  
> -- 
> 1.7.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] ext4: fix journal callback list traversal
  2013-03-22 16:18 ` [PATCH 2/3] ext4: fix journal callback list traversal Dmitry Monakhov
@ 2013-03-26 21:48   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2013-03-26 21:48 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz

On Fri 22-03-13 20:18:42, Dmitry Monakhov wrote:
> It is incorrect to use list_for_each_entry_safe() for journal callback
> traversial because ->next may be removed by other task:
> ->ext4_mb_free_metadata()
>   ->ext4_mb_free_metadata()
>     ->ext4_journal_callback_del()
> 
> This result in following issue:
> 
> WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
> Hardware name:
> list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
> Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
> Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
> Call Trace:
>  [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
>  [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
>  [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
>  [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
>  [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
>  [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
>  [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
>  [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
>  [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
>  [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
>  [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
>  [<ffffffff810ac6be>] kthread+0x10e/0x120
>  [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
>  [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
> 
> This patch fix the issue like follows:
> - ext4_journal_commit_callback() make list truly traversial safe
>   simply by always starting from list_head
> - fix race between two ext4_journal_callback_del() and
>   ext4_journal_callback_try_del()
  Nasty! The fix is correct just one style nit below. But feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>

> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/ext4_jbd2.h |    6 +++++-
>  fs/ext4/mballoc.c   |    8 ++++----
>  fs/ext4/super.c     |    4 +++-
>  3 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 4c216b1..aeed0ba 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -194,16 +194,20 @@ static inline void ext4_journal_callback_add(handle_t *handle,
>   * ext4_journal_callback_del: delete a registered callback
>   * @handle: active journal transaction handle on which callback was registered
>   * @jce: registered journal callback entry to unregister
> + * Return true if object was sucessfully removed
>   */
> -static inline void ext4_journal_callback_del(handle_t *handle,
> +static inline bool ext4_journal_callback_try_del(handle_t *handle,
>  					     struct ext4_journal_cb_entry *jce)
>  {
> +	bool deleted;
>  	struct ext4_sb_info *sbi =
>  			EXT4_SB(handle->h_transaction->t_journal->j_private);
>  
>  	spin_lock(&sbi->s_md_lock);
> +	deleted = !list_empty(&jce->jce_list);
>  	list_del_init(&jce->jce_list);
>  	spin_unlock(&sbi->s_md_lock);
> +	return deleted;
>  }
>  
>  int
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index ee6614b..2e5196a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4420,11 +4420,11 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
>  	node = rb_prev(new_node);
>  	if (node) {
>  		entry = rb_entry(node, struct ext4_free_data, efd_node);
> -		if (can_merge(entry, new_entry)) {
> +		if (can_merge(entry, new_entry) &&
> +			ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
>  			new_entry->efd_start_cluster = entry->efd_start_cluster;
>  			new_entry->efd_count += entry->efd_count;
>  			rb_erase(node, &(db->bb_free_root));
> -			ext4_journal_callback_del(handle, &entry->efd_jce);
>  			kmem_cache_free(ext4_free_data_cachep, entry);
  Ah, the indentation is evil here. It made me think you don't use the
return value of ext4_journal_callback_try_del() for a while. Please indent
it like:
		if (can_merge(entry, new_entry) &&
		    ext4_journal_callback_try_del(handle, > &entry->efd_jce)) {
			fooo

								Honza
>  		}
>  	}
> @@ -4432,10 +4432,10 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
>  	node = rb_next(new_node);
>  	if (node) {
>  		entry = rb_entry(node, struct ext4_free_data, efd_node);
> -		if (can_merge(new_entry, entry)) {
> +		if (can_merge(new_entry, entry) &&
> +			ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
>  			new_entry->efd_count += entry->efd_count;
>  			rb_erase(node, &(db->bb_free_root));
> -			ext4_journal_callback_del(handle, &entry->efd_jce);
>  			kmem_cache_free(ext4_free_data_cachep, entry);
>  		}
>  	}
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index d1ee6a8..c7e1509 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -352,7 +352,9 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
>  	struct ext4_journal_cb_entry	*jce, *tmp;
>  
>  	spin_lock(&sbi->s_md_lock);
> -	list_for_each_entry_safe(jce, tmp, &txn->t_private_list, jce_list) {
> +	while (!list_empty(&txn->t_private_list)) {
> +		jce = list_entry(txn->t_private_list.next,
> +				 struct ext4_journal_cb_entry, jce_list);
>  		list_del_init(&jce->jce_list);
>  		spin_unlock(&sbi->s_md_lock);
>  		jce->jce_func(sb, jce, error);
> -- 
> 1.7.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback
  2013-03-22 16:18 [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Dmitry Monakhov
                   ` (2 preceding siblings ...)
  2013-03-26 21:34 ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Jan Kara
@ 2013-03-27  2:56 ` Theodore Ts'o
  2013-03-27  9:22   ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Dmitry Monakhov
  3 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2013-03-27  2:56 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, jack, wenqing.lz

On Fri, Mar 22, 2013 at 08:18:41PM +0400, Dmitry Monakhov wrote:
> 1) Checkpoint lists are empty so we are shure what jbd2_journal_remove_checkpoint()
>    can not be called under our feets.

One spelling nit: s/shure/sure/

Otherwise, looks good!  I take it you'll be resending an updated set
of patch to deal with the other comments?

Thanks for finding this,

   	    	      	  	- Ted


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2
  2013-03-27  2:56 ` Theodore Ts'o
@ 2013-03-27  9:22   ` Dmitry Monakhov
  2013-03-27  9:22     ` [PATCH 2/3] ext4: fix journal callback list traversal V2 Dmitry Monakhov
                       ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Dmitry Monakhov @ 2013-03-27  9:22 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov

Following race is possible
[kjournald2]                              other_task
jbd2_journal_commit_transaction()
  j_state = T_FINISHED;
  spin_unlock(&journal->j_list_lock);
                                         ->jbd2_journal_remove_checkpoint()
					   ->jbd2_journal_free_transaction();
					     ->kmem_cache_free(transaction)
  ->j_commit_callback(journal, transaction);
    -> USE_AFTER_FREE

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

In order to demonstrace this issue one should mount ext4 with -odiscard option
on SSD disk. This makes callback longer and race window becomes wider.

In order to fix this we should mark transaction as finished only after
callbacks have completed

Changes since V1:
 - Simplify code-flow and add comments according to Jan's request

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/jbd2/commit.c     |   50 ++++++++++++++++++++++++++++----------------------
 include/linux/jbd2.h |    1 +
 2 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 750c701..1f1fd9c 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -382,7 +382,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int space_left = 0;
 	int first_tag = 0;
 	int tag_flag;
-	int i, to_free = 0;
+	int i;
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
@@ -1134,7 +1134,7 @@ restart_loop:
 	journal->j_stats.run.rs_blocks_logged += stats.run.rs_blocks_logged;
 	spin_unlock(&journal->j_history_lock);
 
-	commit_transaction->t_state = T_FINISHED;
+	commit_transaction->t_state = T_COMMIT_CALLBACK;
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
 	journal->j_committing_transaction = NULL;
@@ -1149,38 +1149,44 @@ restart_loop:
 				journal->j_average_commit_time*3) / 4;
 	else
 		journal->j_average_commit_time = commit_time;
+
 	write_unlock(&journal->j_state_lock);
 
-	if (commit_transaction->t_checkpoint_list == NULL &&
-	    commit_transaction->t_checkpoint_io_list == NULL) {
-		__jbd2_journal_drop_transaction(journal, commit_transaction);
-		to_free = 1;
+	if (journal->j_checkpoint_transactions == NULL) {
+		journal->j_checkpoint_transactions = commit_transaction;
+		commit_transaction->t_cpnext = commit_transaction;
+		commit_transaction->t_cpprev = commit_transaction;
 	} else {
-		if (journal->j_checkpoint_transactions == NULL) {
-			journal->j_checkpoint_transactions = commit_transaction;
-			commit_transaction->t_cpnext = commit_transaction;
-			commit_transaction->t_cpprev = commit_transaction;
-		} else {
-			commit_transaction->t_cpnext =
-				journal->j_checkpoint_transactions;
-			commit_transaction->t_cpprev =
-				commit_transaction->t_cpnext->t_cpprev;
-			commit_transaction->t_cpnext->t_cpprev =
-				commit_transaction;
-			commit_transaction->t_cpprev->t_cpnext =
+		commit_transaction->t_cpnext =
+			journal->j_checkpoint_transactions;
+		commit_transaction->t_cpprev =
+			commit_transaction->t_cpnext->t_cpprev;
+		commit_transaction->t_cpnext->t_cpprev =
+			commit_transaction;
+		commit_transaction->t_cpprev->t_cpnext =
 				commit_transaction;
-		}
 	}
 	spin_unlock(&journal->j_list_lock);
-
+	/* Drop all spin_locks because commit_callback may be block.
+	 * __journal_remove_checkpoint() can not destroy transaction
+	 * under us because it is marked as T_FINISHED yet */
 	if (journal->j_commit_callback)
 		journal->j_commit_callback(journal, commit_transaction);
 
 	trace_jbd2_end_commit(journal, commit_transaction);
 	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
-	if (to_free)
-		jbd2_journal_free_transaction(commit_transaction);
 
+	write_lock(&journal->j_state_lock);
+	spin_lock(&journal->j_list_lock);
+	commit_transaction->t_state = T_FINISHED;
+	/* Recheck checkpoint lists after j_list_lock was dropped */
+	if (commit_transaction->t_checkpoint_list == NULL &&
+	    commit_transaction->t_checkpoint_io_list == NULL) {
+		__jbd2_journal_drop_transaction(journal, commit_transaction);
+		jbd2_journal_free_transaction(commit_transaction);
+	}
+	spin_unlock(&journal->j_list_lock);
+	write_unlock(&journal->j_state_lock);
 	wake_up(&journal->j_wait_done_commit);
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 50e5a5e..608ea13 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -480,6 +480,7 @@ struct transaction_s
 		T_COMMIT,
 		T_COMMIT_DFLUSH,
 		T_COMMIT_JFLUSH,
+		T_COMMIT_CALLBACK,
 		T_FINISHED
 	}			t_state;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] ext4: fix journal callback list traversal V2
  2013-03-27  9:22   ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Dmitry Monakhov
@ 2013-03-27  9:22     ` Dmitry Monakhov
  2013-03-27 14:16       ` Theodore Ts'o
  2013-03-27  9:22     ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2013-03-27  9:22 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov

It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This result in following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue like follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Changes since V1:
 Indention fixes accordion to Jan's request

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ext4_jbd2.h |    6 +++++-
 fs/ext4/mballoc.c   |    8 ++++----
 fs/ext4/super.c     |    7 +++++--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 4c216b1..aeed0ba 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -194,16 +194,20 @@ static inline void ext4_journal_callback_add(handle_t *handle,
  * ext4_journal_callback_del: delete a registered callback
  * @handle: active journal transaction handle on which callback was registered
  * @jce: registered journal callback entry to unregister
+ * Return true if object was sucessfully removed
  */
-static inline void ext4_journal_callback_del(handle_t *handle,
+static inline bool ext4_journal_callback_try_del(handle_t *handle,
 					     struct ext4_journal_cb_entry *jce)
 {
+	bool deleted;
 	struct ext4_sb_info *sbi =
 			EXT4_SB(handle->h_transaction->t_journal->j_private);
 
 	spin_lock(&sbi->s_md_lock);
+	deleted = !list_empty(&jce->jce_list);
 	list_del_init(&jce->jce_list);
 	spin_unlock(&sbi->s_md_lock);
+	return deleted;
 }
 
 int
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ee6614b..cf3025c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4420,11 +4420,11 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	node = rb_prev(new_node);
 	if (node) {
 		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		if (can_merge(entry, new_entry)) {
+		if (can_merge(entry, new_entry) &&
+		    ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
 			new_entry->efd_start_cluster = entry->efd_start_cluster;
 			new_entry->efd_count += entry->efd_count;
 			rb_erase(node, &(db->bb_free_root));
-			ext4_journal_callback_del(handle, &entry->efd_jce);
 			kmem_cache_free(ext4_free_data_cachep, entry);
 		}
 	}
@@ -4432,10 +4432,10 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	node = rb_next(new_node);
 	if (node) {
 		entry = rb_entry(node, struct ext4_free_data, efd_node);
-		if (can_merge(new_entry, entry)) {
+		if (can_merge(new_entry, entry) &&
+		    ext4_journal_callback_try_del(handle, &entry->efd_jce)) {
 			new_entry->efd_count += entry->efd_count;
 			rb_erase(node, &(db->bb_free_root));
-			ext4_journal_callback_del(handle, &entry->efd_jce);
 			kmem_cache_free(ext4_free_data_cachep, entry);
 		}
 	}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d1ee6a8..305aa18 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -349,10 +349,13 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	struct super_block		*sb = journal->j_private;
 	struct ext4_sb_info		*sbi = EXT4_SB(sb);
 	int				error = is_journal_aborted(journal);
-	struct ext4_journal_cb_entry	*jce, *tmp;
+	struct ext4_journal_cb_entry	*jce;
 
+	BUG_ON(txn->t_state == T_FINISHED);
 	spin_lock(&sbi->s_md_lock);
-	list_for_each_entry_safe(jce, tmp, &txn->t_private_list, jce_list) {
+	while (!list_empty(&txn->t_private_list)) {
+		jce = list_entry(txn->t_private_list.next,
+				 struct ext4_journal_cb_entry, jce_list);
 		list_del_init(&jce->jce_list);
 		spin_unlock(&sbi->s_md_lock);
 		jce->jce_func(sb, jce, error);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] ext4: undegister es_shrinker if mount failed
  2013-03-27  9:22   ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Dmitry Monakhov
  2013-03-27  9:22     ` [PATCH 2/3] ext4: fix journal callback list traversal V2 Dmitry Monakhov
@ 2013-03-27  9:22     ` Dmitry Monakhov
  2013-03-27 14:17       ` Theodore Ts'o
  2013-03-27 14:16     ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Theodore Ts'o
  2013-03-27 14:32     ` Jan Kara
  3 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2013-03-27  9:22 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jack, wenqing.lz, Dmitry Monakhov

Otherwise destroyed ext_sb_info will be part of global shinker list and result in
following OOPS:

JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<ffffffff81249381>] mount_bdev+0x331/0x340
[<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<ffffffff81362035>] ext4_mount+0x15/0x20
[<ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<ffffffff8127aa56>] do_mount+0x376/0x380
[<ffffffff8127ab98>] sys_mount+0x138/0x150
[<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
8
RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP <ffff88011d5cbcd8>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/super.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 305aa18..b66afdc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4009,6 +4009,7 @@ failed_mount_wq:
 		sbi->s_journal = NULL;
 	}
 failed_mount3:
+	ext4_es_unregister_shrinker(sb);
 	del_timer(&sbi->s_err_report);
 	if (sbi->s_flex_groups)
 		ext4_kvfree(sbi->s_flex_groups);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2
  2013-03-27  9:22   ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Dmitry Monakhov
  2013-03-27  9:22     ` [PATCH 2/3] ext4: fix journal callback list traversal V2 Dmitry Monakhov
  2013-03-27  9:22     ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
@ 2013-03-27 14:16     ` Theodore Ts'o
  2013-03-27 14:32     ` Jan Kara
  3 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2013-03-27 14:16 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, jack, wenqing.lz

Thanks, applied on the dev branch for testing.

	       	      	  	 - Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] ext4: fix journal callback list traversal V2
  2013-03-27  9:22     ` [PATCH 2/3] ext4: fix journal callback list traversal V2 Dmitry Monakhov
@ 2013-03-27 14:16       ` Theodore Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2013-03-27 14:16 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, jack, wenqing.lz

Thanks, applied on the dev branch for testing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] ext4: undegister es_shrinker if mount failed
  2013-03-27  9:22     ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
@ 2013-03-27 14:17       ` Theodore Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2013-03-27 14:17 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, jack, wenqing.lz

Thanks, applied on the dev branch for testing.

		       	   	  - Ted


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2
  2013-03-27  9:22   ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Dmitry Monakhov
                       ` (2 preceding siblings ...)
  2013-03-27 14:16     ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Theodore Ts'o
@ 2013-03-27 14:32     ` Jan Kara
  2013-03-27 14:49       ` Theodore Ts'o
  3 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2013-03-27 14:32 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4, tytso, jack, wenqing.lz

On Wed 27-03-13 13:22:42, Dmitry Monakhov wrote:
> Following race is possible
> [kjournald2]                              other_task
> jbd2_journal_commit_transaction()
>   j_state = T_FINISHED;
>   spin_unlock(&journal->j_list_lock);
>                                          ->jbd2_journal_remove_checkpoint()
> 					   ->jbd2_journal_free_transaction();
> 					     ->kmem_cache_free(transaction)
>   ->j_commit_callback(journal, transaction);
>     -> USE_AFTER_FREE
> 
> WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
> Hardware name:
> list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
> Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
> Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
> Call Trace:
>  [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
>  [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
>  [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
>  [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
>  [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
>  [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
>  [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
>  [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
>  [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
>  [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
>  [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
>  [<ffffffff810ac6be>] kthread+0x10e/0x120
>  [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
>  [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
> 
> In order to demonstrace this issue one should mount ext4 with -odiscard option
> on SSD disk. This makes callback longer and race window becomes wider.
> 
> In order to fix this we should mark transaction as finished only after
> callbacks have completed
> 
> Changes since V1:
>  - Simplify code-flow and add comments according to Jan's request
  Looks good. Just one text correction below - Ted can you apply it please?
...
> -
> +	/* Drop all spin_locks because commit_callback may be block.
> +	 * __journal_remove_checkpoint() can not destroy transaction
> +	 * under us because it is marked as T_FINISHED yet */
                               ^^^ is *not*

>  	if (journal->j_commit_callback)
>  		journal->j_commit_callback(journal, commit_transaction);
>  
>  	trace_jbd2_end_commit(journal, commit_transaction);
>  	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2
  2013-03-27 14:32     ` Jan Kara
@ 2013-03-27 14:49       ` Theodore Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2013-03-27 14:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dmitry Monakhov, linux-ext4, wenqing.lz

On Wed, Mar 27, 2013 at 03:32:37PM +0100, Jan Kara wrote:
> > Changes since V1:
> >  - Simplify code-flow and add comments according to Jan's request
>   Looks good. Just one text correction below - Ted can you apply it please?
> ...
> > -
> > +	/* Drop all spin_locks because commit_callback may be block.
> > +	 * __journal_remove_checkpoint() can not destroy transaction
> > +	 * under us because it is marked as T_FINISHED yet */
>                                ^^^ is *not*

Nice catch.  Thanks, I've fixed up the comment.

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] ext4: undegister es_shrinker if mount failed
  2013-03-23  2:31   ` Zheng Liu
@ 2013-04-01 14:49     ` Theodore Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2013-04-01 14:49 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Dmitry Monakhov, linux-ext4, jack, wenqing.lz

> From 596eee1f5caf23084d9b43c47b327206bd8c7ee5 Mon Sep 17 00:00:00 2001
> From: Zheng Liu <wenqing.lz@taobao.com>
> Date: Sat, 23 Mar 2013 10:26:55 +0800
> Subject: [PATCH] register es shrinker before initializing per-cpu counters
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>

I've merged this change into the version of Dmitry's patch which is in
the ext4 tree.

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-04-01 14:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-22 16:18 [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Dmitry Monakhov
2013-03-22 16:18 ` [PATCH 2/3] ext4: fix journal callback list traversal Dmitry Monakhov
2013-03-26 21:48   ` Jan Kara
2013-03-22 16:18 ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
2013-03-23  2:31   ` Zheng Liu
2013-04-01 14:49     ` Theodore Ts'o
2013-03-26 21:34 ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback Jan Kara
2013-03-27  2:56 ` Theodore Ts'o
2013-03-27  9:22   ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Dmitry Monakhov
2013-03-27  9:22     ` [PATCH 2/3] ext4: fix journal callback list traversal V2 Dmitry Monakhov
2013-03-27 14:16       ` Theodore Ts'o
2013-03-27  9:22     ` [PATCH 3/3] ext4: undegister es_shrinker if mount failed Dmitry Monakhov
2013-03-27 14:17       ` Theodore Ts'o
2013-03-27 14:16     ` [PATCH 1/3] jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback V2 Theodore Ts'o
2013-03-27 14:32     ` Jan Kara
2013-03-27 14:49       ` Theodore Ts'o

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.