All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH 0/2] GFS2 counting fixes
@ 2019-01-31 10:55 Ross Lagerwall
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
  0 siblings, 2 replies; 21+ messages in thread
From: Ross Lagerwall @ 2019-01-31 10:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Here are a couple of fixes for GFS2 (ref-)counting going wrong.

Ross Lagerwall (2):
  gfs2: Fix occasional glock use-after-free
  gfs2: Fix lru_count going negative

 fs/gfs2/aops.c    |  3 +--
 fs/gfs2/glock.c   | 16 +++++++++-------
 fs/gfs2/lops.c    |  2 +-
 fs/gfs2/meta_io.c |  2 +-
 fs/gfs2/trans.c   |  9 ++++++++-
 fs/gfs2/trans.h   |  2 ++
 6 files changed, 22 insertions(+), 12 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-01-31 10:55 [Cluster-devel] [PATCH 0/2] GFS2 counting fixes Ross Lagerwall
@ 2019-01-31 10:55 ` Ross Lagerwall
  2019-01-31 11:23   ` Steven Whitehouse
                     ` (2 more replies)
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
  1 sibling, 3 replies; 21+ messages in thread
From: Ross Lagerwall @ 2019-01-31 10:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Each gfs2_bufdata stores a reference to a glock but the reference count
isn't incremented. This causes an occasional use-after-free of the
glock. Fix by taking a reference on the glock during allocation and
dropping it when freeing.

Found by KASAN:

BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371

CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
Workqueue: glock_workqueue glock_work_func [gfs2]
Call Trace:
 dump_stack+0x71/0xab
 print_address_description+0x6a/0x270
 kasan_report+0x258/0x380
 ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
 revoke_lo_after_commit+0x8e/0xe0 [gfs2]
 gfs2_log_flush+0x511/0xa70 [gfs2]
 ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
 ? __brelse+0x48/0x50
 ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
 ? gfs2_trans_end+0x18d/0x340 [gfs2]
 gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
 ? inode_go_dump+0xe0/0xe0 [gfs2]
 ? inode_go_sync+0xe4/0x220 [gfs2]
 inode_go_sync+0xe4/0x220 [gfs2]
 do_xmote+0x12b/0x290 [gfs2]
 glock_work_func+0x6f/0x160 [gfs2]
 process_one_work+0x461/0x790
 worker_thread+0x69/0x6b0
 ? process_one_work+0x790/0x790
 kthread+0x1ae/0x1d0
 ? kthread_create_worker_on_cpu+0xc0/0xc0
 ret_from_fork+0x22/0x40

Allocated by task 20805:
 kasan_kmalloc+0xa0/0xd0
 kmem_cache_alloc+0xb5/0x1b0
 gfs2_glock_get+0x14b/0x620 [gfs2]
 gfs2_inode_lookup+0x20c/0x640 [gfs2]
 gfs2_dir_search+0x150/0x180 [gfs2]
 gfs2_lookupi+0x272/0x360 [gfs2]
 __gfs2_lookup+0x8b/0x1d0 [gfs2]
 gfs2_atomic_open+0x77/0x100 [gfs2]
 path_openat+0x1454/0x1c10
 do_filp_open+0x124/0x1d0
 do_sys_open+0x213/0x2c0
 do_syscall_64+0x69/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 0:
 __kasan_slab_free+0x130/0x180
 kmem_cache_free+0x78/0x1e0
 rcu_process_callbacks+0x2ad/0x6c0
 __do_softirq+0x111/0x38c

The buggy address belongs to the object at ffff88801aff6040
 which belongs to the cache gfs2_glock(aspace) of size 560
The buggy address is located 244 bytes inside of
 560-byte region [ffff88801aff6040, ffff88801aff6270)
...

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 fs/gfs2/aops.c    | 3 +--
 fs/gfs2/lops.c    | 2 +-
 fs/gfs2/meta_io.c | 2 +-
 fs/gfs2/trans.c   | 9 ++++++++-
 fs/gfs2/trans.h   | 2 ++
 5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..8c2b572a7fb1 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -868,9 +868,8 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
 			gfs2_assert_warn(sdp, bd->bd_bh == bh);
 			if (!list_empty(&bd->bd_list))
 				list_del_init(&bd->bd_list);
-			bd->bd_bh = NULL;
 			bh->b_private = NULL;
-			kmem_cache_free(gfs2_bufdata_cachep, bd);
+			gfs2_free_bufdata(bd);
 		}
 
 		bh = bh->b_this_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 94dcab655bc0..f40be71677d1 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -847,7 +847,7 @@ static void revoke_lo_after_commit(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
 		gl = bd->bd_gl;
 		atomic_dec(&gl->gl_revokes);
 		clear_bit(GLF_LFLUSH, &gl->gl_flags);
-		kmem_cache_free(gfs2_bufdata_cachep, bd);
+		gfs2_free_bufdata(bd);
 	}
 }
 
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index be9c0bf697fe..868caa0eb104 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -355,7 +355,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int meta)
 			gfs2_trans_add_revoke(sdp, bd);
 		} else if (was_pinned) {
 			bh->b_private = NULL;
-			kmem_cache_free(gfs2_bufdata_cachep, bd);
+			gfs2_free_bufdata(bd);
 		}
 		spin_unlock(&sdp->sd_ail_lock);
 	}
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index cd9a94a6b5bb..423cbee8fa08 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -133,9 +133,16 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,
 	bd->bd_gl = gl;
 	INIT_LIST_HEAD(&bd->bd_list);
 	bh->b_private = bd;
+	gfs2_glock_hold(gl);
 	return bd;
 }
 
+void gfs2_free_bufdata(struct gfs2_bufdata *bd)
+{
+	gfs2_glock_put(bd->bd_gl);
+	kmem_cache_free(gfs2_bufdata_cachep, bd);
+}
+
 /**
  * gfs2_trans_add_data - Add a databuf to the transaction.
  * @gl: The inode glock associated with the buffer
@@ -265,7 +272,7 @@ void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len)
 			list_del_init(&bd->bd_list);
 			gfs2_assert_withdraw(sdp, sdp->sd_log_num_revoke);
 			sdp->sd_log_num_revoke--;
-			kmem_cache_free(gfs2_bufdata_cachep, bd);
+			gfs2_free_bufdata(bd);
 			tr->tr_num_revoke_rm++;
 			if (--n == 0)
 				break;
diff --git a/fs/gfs2/trans.h b/fs/gfs2/trans.h
index ad70087d0597..276ddca7bbe9 100644
--- a/fs/gfs2/trans.h
+++ b/fs/gfs2/trans.h
@@ -46,4 +46,6 @@ extern void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh);
 extern void gfs2_trans_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd);
 extern void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len);
 
+void gfs2_free_bufdata(struct gfs2_bufdata *bd);
+
 #endif /* __TRANS_DOT_H__ */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative
  2019-01-31 10:55 [Cluster-devel] [PATCH 0/2] GFS2 counting fixes Ross Lagerwall
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
@ 2019-01-31 10:55 ` Ross Lagerwall
  2019-01-31 11:21   ` Steven Whitehouse
                     ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Ross Lagerwall @ 2019-01-31 10:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Under certain conditions, lru_count may drop below zero resulting in
a large amount of log spam like this:

vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
    negative objects to delete nr=-1

This happens as follows:
1) A glock is moved from lru_list to the dispose list and lru_count is
   decremented.
2) The dispose function calls cond_resched() and drops the lru lock.
3) Another thread takes the lru lock and tries to add the same glock to
   lru_list, checking if the glock is on an lru list.
4) It is on a list (actually the dispose list) and so it avoids
   incrementing lru_count.
5) The glock is moved to lru_list.
5) The original thread doesn't dispose it because it has been re-added
   to the lru list but the lru_count has still decreased by one.

Fix by checking if the LRU flag is set on the glock rather than checking
if the glock is on some list and rearrange the code so that the LRU flag
is added/removed precisely when the glock is added/removed from lru_list.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 fs/gfs2/glock.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b92740edc416..53e6c7e0c1b3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
 {
 	spin_lock(&lru_lock);
 
-	if (!list_empty(&gl->gl_lru))
-		list_del_init(&gl->gl_lru);
-	else
+	list_del(&gl->gl_lru);
+	list_add_tail(&gl->gl_lru, &lru_list);
+
+	if (!test_bit(GLF_LRU, &gl->gl_flags)) {
+		set_bit(GLF_LRU, &gl->gl_flags);
 		atomic_inc(&lru_count);
+	}
 
-	list_add_tail(&gl->gl_lru, &lru_list);
-	set_bit(GLF_LRU, &gl->gl_flags);
 	spin_unlock(&lru_lock);
 }
 
@@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
 		return;
 
 	spin_lock(&lru_lock);
-	if (!list_empty(&gl->gl_lru)) {
+	if (test_bit(GLF_LRU, &gl->gl_flags)) {
 		list_del_init(&gl->gl_lru);
 		atomic_dec(&lru_count);
 		clear_bit(GLF_LRU, &gl->gl_flags);
@@ -1456,6 +1457,7 @@ __acquires(&lru_lock)
 		if (!spin_trylock(&gl->gl_lockref.lock)) {
 add_back_to_lru:
 			list_add(&gl->gl_lru, &lru_list);
+			set_bit(GLF_LRU, &gl->gl_flags);
 			atomic_inc(&lru_count);
 			continue;
 		}
@@ -1463,7 +1465,6 @@ __acquires(&lru_lock)
 			spin_unlock(&gl->gl_lockref.lock);
 			goto add_back_to_lru;
 		}
-		clear_bit(GLF_LRU, &gl->gl_flags);
 		gl->gl_lockref.count++;
 		if (demote_ok(gl))
 			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
@@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
 		if (!test_bit(GLF_LOCK, &gl->gl_flags)) {
 			list_move(&gl->gl_lru, &dispose);
 			atomic_dec(&lru_count);
+			clear_bit(GLF_LRU, &gl->gl_flags);
 			freed++;
 			continue;
 		}
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
@ 2019-01-31 11:21   ` Steven Whitehouse
  2019-01-31 14:36   ` Bob Peterson
  2019-01-31 18:32   ` Andreas Gruenbacher
  2 siblings, 0 replies; 21+ messages in thread
From: Steven Whitehouse @ 2019-01-31 11:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 31/01/2019 10:55, Ross Lagerwall wrote:
> Under certain conditions, lru_count may drop below zero resulting in
> a large amount of log spam like this:
>
> vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
>      negative objects to delete nr=-1
>
> This happens as follows:
> 1) A glock is moved from lru_list to the dispose list and lru_count is
>     decremented.
> 2) The dispose function calls cond_resched() and drops the lru lock.
> 3) Another thread takes the lru lock and tries to add the same glock to
>     lru_list, checking if the glock is on an lru list.
> 4) It is on a list (actually the dispose list) and so it avoids
>     incrementing lru_count.
> 5) The glock is moved to lru_list.
> 5) The original thread doesn't dispose it because it has been re-added
>     to the lru list but the lru_count has still decreased by one.
>
> Fix by checking if the LRU flag is set on the glock rather than checking
> if the glock is on some list and rearrange the code so that the LRU flag
> is added/removed precisely when the glock is added/removed from lru_list.
>
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>

I'm glad we've got to the bottom of that one. Excellent work debugging 
that! Many thanks for the fix,

Steve.


> ---
>   fs/gfs2/glock.c | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index b92740edc416..53e6c7e0c1b3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
>   {
>   	spin_lock(&lru_lock);
>   
> -	if (!list_empty(&gl->gl_lru))
> -		list_del_init(&gl->gl_lru);
> -	else
> +	list_del(&gl->gl_lru);
> +	list_add_tail(&gl->gl_lru, &lru_list);
> +
> +	if (!test_bit(GLF_LRU, &gl->gl_flags)) {
> +		set_bit(GLF_LRU, &gl->gl_flags);
>   		atomic_inc(&lru_count);
> +	}
>   
> -	list_add_tail(&gl->gl_lru, &lru_list);
> -	set_bit(GLF_LRU, &gl->gl_flags);
>   	spin_unlock(&lru_lock);
>   }
>   
> @@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
>   		return;
>   
>   	spin_lock(&lru_lock);
> -	if (!list_empty(&gl->gl_lru)) {
> +	if (test_bit(GLF_LRU, &gl->gl_flags)) {
>   		list_del_init(&gl->gl_lru);
>   		atomic_dec(&lru_count);
>   		clear_bit(GLF_LRU, &gl->gl_flags);
> @@ -1456,6 +1457,7 @@ __acquires(&lru_lock)
>   		if (!spin_trylock(&gl->gl_lockref.lock)) {
>   add_back_to_lru:
>   			list_add(&gl->gl_lru, &lru_list);
> +			set_bit(GLF_LRU, &gl->gl_flags);
>   			atomic_inc(&lru_count);
>   			continue;
>   		}
> @@ -1463,7 +1465,6 @@ __acquires(&lru_lock)
>   			spin_unlock(&gl->gl_lockref.lock);
>   			goto add_back_to_lru;
>   		}
> -		clear_bit(GLF_LRU, &gl->gl_flags);
>   		gl->gl_lockref.count++;
>   		if (demote_ok(gl))
>   			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> @@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
>   		if (!test_bit(GLF_LOCK, &gl->gl_flags)) {
>   			list_move(&gl->gl_lru, &dispose);
>   			atomic_dec(&lru_count);
> +			clear_bit(GLF_LRU, &gl->gl_flags);
>   			freed++;
>   			continue;
>   		}



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
@ 2019-01-31 11:23   ` Steven Whitehouse
  2019-01-31 14:40   ` Bob Peterson
  2019-01-31 17:18   ` Andreas Gruenbacher
  2 siblings, 0 replies; 21+ messages in thread
From: Steven Whitehouse @ 2019-01-31 11:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 31/01/2019 10:55, Ross Lagerwall wrote:
> Each gfs2_bufdata stores a reference to a glock but the reference count
> isn't incremented. This causes an occasional use-after-free of the
> glock. Fix by taking a reference on the glock during allocation and
> dropping it when freeing.

Another good bit of debugging. It would be nice if we can (longer term) 
avoid using the ref count though, since that will have some overhead, 
but for the time being, the correctness is the important thing,

Steve.

>
> Found by KASAN:
>
> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
> Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371
>
> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
> Workqueue: glock_workqueue glock_work_func [gfs2]
> Call Trace:
>   dump_stack+0x71/0xab
>   print_address_description+0x6a/0x270
>   kasan_report+0x258/0x380
>   ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>   revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>   gfs2_log_flush+0x511/0xa70 [gfs2]
>   ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>   ? __brelse+0x48/0x50
>   ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>   ? gfs2_trans_end+0x18d/0x340 [gfs2]
>   gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>   ? inode_go_dump+0xe0/0xe0 [gfs2]
>   ? inode_go_sync+0xe4/0x220 [gfs2]
>   inode_go_sync+0xe4/0x220 [gfs2]
>   do_xmote+0x12b/0x290 [gfs2]
>   glock_work_func+0x6f/0x160 [gfs2]
>   process_one_work+0x461/0x790
>   worker_thread+0x69/0x6b0
>   ? process_one_work+0x790/0x790
>   kthread+0x1ae/0x1d0
>   ? kthread_create_worker_on_cpu+0xc0/0xc0
>   ret_from_fork+0x22/0x40
>
> Allocated by task 20805:
>   kasan_kmalloc+0xa0/0xd0
>   kmem_cache_alloc+0xb5/0x1b0
>   gfs2_glock_get+0x14b/0x620 [gfs2]
>   gfs2_inode_lookup+0x20c/0x640 [gfs2]
>   gfs2_dir_search+0x150/0x180 [gfs2]
>   gfs2_lookupi+0x272/0x360 [gfs2]
>   __gfs2_lookup+0x8b/0x1d0 [gfs2]
>   gfs2_atomic_open+0x77/0x100 [gfs2]
>   path_openat+0x1454/0x1c10
>   do_filp_open+0x124/0x1d0
>   do_sys_open+0x213/0x2c0
>   do_syscall_64+0x69/0x160
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Freed by task 0:
>   __kasan_slab_free+0x130/0x180
>   kmem_cache_free+0x78/0x1e0
>   rcu_process_callbacks+0x2ad/0x6c0
>   __do_softirq+0x111/0x38c
>
> The buggy address belongs to the object at ffff88801aff6040
>   which belongs to the cache gfs2_glock(aspace) of size 560
> The buggy address is located 244 bytes inside of
>   560-byte region [ffff88801aff6040, ffff88801aff6270)
> ...
>
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>   fs/gfs2/aops.c    | 3 +--
>   fs/gfs2/lops.c    | 2 +-
>   fs/gfs2/meta_io.c | 2 +-
>   fs/gfs2/trans.c   | 9 ++++++++-
>   fs/gfs2/trans.h   | 2 ++
>   5 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 05dd78f4b2b3..8c2b572a7fb1 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -868,9 +868,8 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
>   			gfs2_assert_warn(sdp, bd->bd_bh == bh);
>   			if (!list_empty(&bd->bd_list))
>   				list_del_init(&bd->bd_list);
> -			bd->bd_bh = NULL;
>   			bh->b_private = NULL;
> -			kmem_cache_free(gfs2_bufdata_cachep, bd);
> +			gfs2_free_bufdata(bd);
>   		}
>   
>   		bh = bh->b_this_page;
> diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
> index 94dcab655bc0..f40be71677d1 100644
> --- a/fs/gfs2/lops.c
> +++ b/fs/gfs2/lops.c
> @@ -847,7 +847,7 @@ static void revoke_lo_after_commit(struct gfs2_sbd *sdp, struct gfs2_trans *tr)
>   		gl = bd->bd_gl;
>   		atomic_dec(&gl->gl_revokes);
>   		clear_bit(GLF_LFLUSH, &gl->gl_flags);
> -		kmem_cache_free(gfs2_bufdata_cachep, bd);
> +		gfs2_free_bufdata(bd);
>   	}
>   }
>   
> diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
> index be9c0bf697fe..868caa0eb104 100644
> --- a/fs/gfs2/meta_io.c
> +++ b/fs/gfs2/meta_io.c
> @@ -355,7 +355,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int meta)
>   			gfs2_trans_add_revoke(sdp, bd);
>   		} else if (was_pinned) {
>   			bh->b_private = NULL;
> -			kmem_cache_free(gfs2_bufdata_cachep, bd);
> +			gfs2_free_bufdata(bd);
>   		}
>   		spin_unlock(&sdp->sd_ail_lock);
>   	}
> diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
> index cd9a94a6b5bb..423cbee8fa08 100644
> --- a/fs/gfs2/trans.c
> +++ b/fs/gfs2/trans.c
> @@ -133,9 +133,16 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,
>   	bd->bd_gl = gl;
>   	INIT_LIST_HEAD(&bd->bd_list);
>   	bh->b_private = bd;
> +	gfs2_glock_hold(gl);
>   	return bd;
>   }
>   
> +void gfs2_free_bufdata(struct gfs2_bufdata *bd)
> +{
> +	gfs2_glock_put(bd->bd_gl);
> +	kmem_cache_free(gfs2_bufdata_cachep, bd);
> +}
> +
>   /**
>    * gfs2_trans_add_data - Add a databuf to the transaction.
>    * @gl: The inode glock associated with the buffer
> @@ -265,7 +272,7 @@ void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len)
>   			list_del_init(&bd->bd_list);
>   			gfs2_assert_withdraw(sdp, sdp->sd_log_num_revoke);
>   			sdp->sd_log_num_revoke--;
> -			kmem_cache_free(gfs2_bufdata_cachep, bd);
> +			gfs2_free_bufdata(bd);
>   			tr->tr_num_revoke_rm++;
>   			if (--n == 0)
>   				break;
> diff --git a/fs/gfs2/trans.h b/fs/gfs2/trans.h
> index ad70087d0597..276ddca7bbe9 100644
> --- a/fs/gfs2/trans.h
> +++ b/fs/gfs2/trans.h
> @@ -46,4 +46,6 @@ extern void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh);
>   extern void gfs2_trans_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd);
>   extern void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 blkno, unsigned int len);
>   
> +void gfs2_free_bufdata(struct gfs2_bufdata *bd);
> +
>   #endif /* __TRANS_DOT_H__ */



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
  2019-01-31 11:21   ` Steven Whitehouse
@ 2019-01-31 14:36   ` Bob Peterson
  2019-01-31 15:04     ` Bob Peterson
  2019-01-31 15:23     ` Ross Lagerwall
  2019-01-31 18:32   ` Andreas Gruenbacher
  2 siblings, 2 replies; 21+ messages in thread
From: Bob Peterson @ 2019-01-31 14:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

Comments below. Sorry if this is a bit incoherent; it's early and I'm
not properly caffeinated yet.

----- Original Message -----
> Under certain conditions, lru_count may drop below zero resulting in
> a large amount of log spam like this:
> 
> vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
>     negative objects to delete nr=-1
> 
> This happens as follows:
> 1) A glock is moved from lru_list to the dispose list and lru_count is
>    decremented.
> 2) The dispose function calls cond_resched() and drops the lru lock.
> 3) Another thread takes the lru lock and tries to add the same glock to
>    lru_list, checking if the glock is on an lru list.
> 4) It is on a list (actually the dispose list) and so it avoids
>    incrementing lru_count.
> 5) The glock is moved to lru_list.
> 5) The original thread doesn't dispose it because it has been re-added
>    to the lru list but the lru_count has still decreased by one.
> 
> Fix by checking if the LRU flag is set on the glock rather than checking
> if the glock is on some list and rearrange the code so that the LRU flag
> is added/removed precisely when the glock is added/removed from lru_list.
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  fs/gfs2/glock.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index b92740edc416..53e6c7e0c1b3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
>  {
>  	spin_lock(&lru_lock);
>  
> -	if (!list_empty(&gl->gl_lru))
> -		list_del_init(&gl->gl_lru);
> -	else
> +	list_del(&gl->gl_lru);
> +	list_add_tail(&gl->gl_lru, &lru_list);

This looks like a bug, and I like your idea of using the GLF_LRU bit
to determine whether or not to do the manipulation, but I have some
concerns. First, does it work with kernel list debugging turned on?

To me it looks like the list_del (as opposed to list_del_init) above
will set entry->next and prev to LIST_POISON values, then the
list_add_tail() calls __list_add() which checks:
	if (!__list_add_valid(new, prev, next))
		return;
Without list debugging, the value is always returned true, but with
list debugging it checks for circular values of list->prev and list->next
which, since they're LIST_POISON, ought to fail.
So it seems like the original list_del_init is correct.

The intent was: if the glock is already on the lru, take it off
before re-adding it, and the count ought to be okay, because if it's
on the LRU list, it's already been incremented. So taking it off and
adding it back on is a net 0 on the count. But that's only
true if the GLF_LRU bit is set. If it's on a different list (the
dispose list), as you noted, it still needs to be incremented.

If the glock is on the dispose_list, rather than the lru list, we
want to take it off the dispose list and move it to the lru_list,
but in that case, we need to increment the lru count, and not
poison the list_head.

So to me it seems like we should keep the list_del_init, and only
do it if the list isn't empty, but trigger off the GLF_LRU flag
for managing the count. The lru_lock ought to prevent races.

> +
> +	if (!test_bit(GLF_LRU, &gl->gl_flags)) {
> +		set_bit(GLF_LRU, &gl->gl_flags);
>  		atomic_inc(&lru_count);
> +	}

The above may be simplified to something like:
+	if (!test_and_set_bit(GLF_LRU, &gl->gl_flags))
 		atomic_inc(&lru_count);

>  
> -	list_add_tail(&gl->gl_lru, &lru_list);
> -	set_bit(GLF_LRU, &gl->gl_flags);
>  	spin_unlock(&lru_lock);
>  }
>  
> @@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock
> *gl)
>  		return;
>  
>  	spin_lock(&lru_lock);
> -	if (!list_empty(&gl->gl_lru)) {
> +	if (test_bit(GLF_LRU, &gl->gl_flags)) {
>  		list_del_init(&gl->gl_lru);
>  		atomic_dec(&lru_count);
>  		clear_bit(GLF_LRU, &gl->gl_flags);

Here again, we could simplify with test_and_clear_bit above.

> @@ -1456,6 +1457,7 @@ __acquires(&lru_lock)
>  		if (!spin_trylock(&gl->gl_lockref.lock)) {
>  add_back_to_lru:
>  			list_add(&gl->gl_lru, &lru_list);
> +			set_bit(GLF_LRU, &gl->gl_flags);
>  			atomic_inc(&lru_count);
>  			continue;
>  		}
> @@ -1463,7 +1465,6 @@ __acquires(&lru_lock)
>  			spin_unlock(&gl->gl_lockref.lock);
>  			goto add_back_to_lru;
>  		}
> -		clear_bit(GLF_LRU, &gl->gl_flags);
>  		gl->gl_lockref.count++;
>  		if (demote_ok(gl))
>  			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> @@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
>  		if (!test_bit(GLF_LOCK, &gl->gl_flags)) {
>  			list_move(&gl->gl_lru, &dispose);
>  			atomic_dec(&lru_count);
> +			clear_bit(GLF_LRU, &gl->gl_flags);
>  			freed++;
>  			continue;
>  		}
> --
> 2.17.2
> 
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
  2019-01-31 11:23   ` Steven Whitehouse
@ 2019-01-31 14:40   ` Bob Peterson
  2019-01-31 17:18   ` Andreas Gruenbacher
  2 siblings, 0 replies; 21+ messages in thread
From: Bob Peterson @ 2019-01-31 14:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Each gfs2_bufdata stores a reference to a glock but the reference count
> isn't incremented. This causes an occasional use-after-free of the
> glock. Fix by taking a reference on the glock during allocation and
> dropping it when freeing.
> 
> Found by KASAN:
> 
> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
> Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371
> 
> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
> Workqueue: glock_workqueue glock_work_func [gfs2]
> Call Trace:
>  dump_stack+0x71/0xab
>  print_address_description+0x6a/0x270
>  kasan_report+0x258/0x380
>  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  gfs2_log_flush+0x511/0xa70 [gfs2]
>  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>  ? __brelse+0x48/0x50
>  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>  ? gfs2_trans_end+0x18d/0x340 [gfs2]
>  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>  ? inode_go_dump+0xe0/0xe0 [gfs2]
>  ? inode_go_sync+0xe4/0x220 [gfs2]
>  inode_go_sync+0xe4/0x220 [gfs2]
>  do_xmote+0x12b/0x290 [gfs2]
>  glock_work_func+0x6f/0x160 [gfs2]
>  process_one_work+0x461/0x790
>  worker_thread+0x69/0x6b0
>  ? process_one_work+0x790/0x790
>  kthread+0x1ae/0x1d0
>  ? kthread_create_worker_on_cpu+0xc0/0xc0
>  ret_from_fork+0x22/0x40
> 
> Allocated by task 20805:
>  kasan_kmalloc+0xa0/0xd0
>  kmem_cache_alloc+0xb5/0x1b0
>  gfs2_glock_get+0x14b/0x620 [gfs2]
>  gfs2_inode_lookup+0x20c/0x640 [gfs2]
>  gfs2_dir_search+0x150/0x180 [gfs2]
>  gfs2_lookupi+0x272/0x360 [gfs2]
>  __gfs2_lookup+0x8b/0x1d0 [gfs2]
>  gfs2_atomic_open+0x77/0x100 [gfs2]
>  path_openat+0x1454/0x1c10
>  do_filp_open+0x124/0x1d0
>  do_sys_open+0x213/0x2c0
>  do_syscall_64+0x69/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 0:
>  __kasan_slab_free+0x130/0x180
>  kmem_cache_free+0x78/0x1e0
>  rcu_process_callbacks+0x2ad/0x6c0
>  __do_softirq+0x111/0x38c
> 
> The buggy address belongs to the object at ffff88801aff6040
>  which belongs to the cache gfs2_glock(aspace) of size 560
> The buggy address is located 244 bytes inside of
>  560-byte region [ffff88801aff6040, ffff88801aff6270)
> ...
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---

This makes sense to me.

Regards,

Bob Peterson



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative
  2019-01-31 14:36   ` Bob Peterson
@ 2019-01-31 15:04     ` Bob Peterson
  2019-01-31 15:23     ` Ross Lagerwall
  1 sibling, 0 replies; 21+ messages in thread
From: Bob Peterson @ 2019-01-31 15:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> > +
> > +	if (!test_bit(GLF_LRU, &gl->gl_flags)) {
> > +		set_bit(GLF_LRU, &gl->gl_flags);
> >  		atomic_inc(&lru_count);
> > +	}
> 
> The above may be simplified to something like:
> +	if (!test_and_set_bit(GLF_LRU, &gl->gl_flags))
>  		atomic_inc(&lru_count);

Scratch that. Andreas says test_and_set_bit() and similar are much more
expensive cpu-wise, and we're already protected from races by the
lru_lock, so I guess the original is better after all.

Bob Peterson



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative
  2019-01-31 14:36   ` Bob Peterson
  2019-01-31 15:04     ` Bob Peterson
@ 2019-01-31 15:23     ` Ross Lagerwall
  1 sibling, 0 replies; 21+ messages in thread
From: Ross Lagerwall @ 2019-01-31 15:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 1/31/19 2:36 PM, Bob Peterson wrote:
> Hi Ross,
> 
> Comments below. Sorry if this is a bit incoherent; it's early and I'm
> not properly caffeinated yet.
> 
> ----- Original Message -----
>> Under certain conditions, lru_count may drop below zero resulting in
>> a large amount of log spam like this:
>>
>> vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
>>      negative objects to delete nr=-1
>>
>> This happens as follows:
>> 1) A glock is moved from lru_list to the dispose list and lru_count is
>>     decremented.
>> 2) The dispose function calls cond_resched() and drops the lru lock.
>> 3) Another thread takes the lru lock and tries to add the same glock to
>>     lru_list, checking if the glock is on an lru list.
>> 4) It is on a list (actually the dispose list) and so it avoids
>>     incrementing lru_count.
>> 5) The glock is moved to lru_list.
>> 5) The original thread doesn't dispose it because it has been re-added
>>     to the lru list but the lru_count has still decreased by one.
>>
>> Fix by checking if the LRU flag is set on the glock rather than checking
>> if the glock is on some list and rearrange the code so that the LRU flag
>> is added/removed precisely when the glock is added/removed from lru_list.
>>
>> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
>> ---
>>   fs/gfs2/glock.c | 16 +++++++++-------
>>   1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
>> index b92740edc416..53e6c7e0c1b3 100644
>> --- a/fs/gfs2/glock.c
>> +++ b/fs/gfs2/glock.c
>> @@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
>>   {
>>   	spin_lock(&lru_lock);
>>   
>> -	if (!list_empty(&gl->gl_lru))
>> -		list_del_init(&gl->gl_lru);
>> -	else
>> +	list_del(&gl->gl_lru);
>> +	list_add_tail(&gl->gl_lru, &lru_list);
> 
> This looks like a bug, and I like your idea of using the GLF_LRU bit
> to determine whether or not to do the manipulation, but I have some
> concerns. First, does it work with kernel list debugging turned on?
> 
> To me it looks like the list_del (as opposed to list_del_init) above
> will set entry->next and prev to LIST_POISON values, then the
> list_add_tail() calls __list_add() which checks:
> 	if (!__list_add_valid(new, prev, next))
> 		return;
> Without list debugging, the value is always returned true, but with
> list debugging it checks for circular values of list->prev and list->next
> which, since they're LIST_POISON, ought to fail.
> So it seems like the original list_del_init is correct.

No, __list_add_valid() checks if prev & next correctly link to each 
other and checks that new is not the same as prev or next. It doesn't 
check anything to do with new's pointers. I've tested with DEBUG_LIST 
and it appears to work.

> 
> The intent was: if the glock is already on the lru, take it off
> before re-adding it, and the count ought to be okay, because if it's
> on the LRU list, it's already been incremented. So taking it off and
> adding it back on is a net 0 on the count. But that's only
> true if the GLF_LRU bit is set. If it's on a different list (the
> dispose list), as you noted, it still needs to be incremented.
> 
> If the glock is on the dispose_list, rather than the lru list, we
> want to take it off the dispose list and move it to the lru_list,
> but in that case, we need to increment the lru count, and not
> poison the list_head.
> 
> So to me it seems like we should keep the list_del_init, and only
> do it if the list isn't empty, but trigger off the GLF_LRU flag
> for managing the count. The lru_lock ought to prevent races.

I think I understand the intent, but IMO this patch makes the logic 
clearer than relying on both the LRU bit and the state of the list to 
determine what to do. Thoughts?

Thanks,
-- 
Ross Lagerwall



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
  2019-01-31 11:23   ` Steven Whitehouse
  2019-01-31 14:40   ` Bob Peterson
@ 2019-01-31 17:18   ` Andreas Gruenbacher
  2019-02-01  9:23     ` Ross Lagerwall
  2019-03-26 18:49     ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
  2 siblings, 2 replies; 21+ messages in thread
From: Andreas Gruenbacher @ 2019-01-31 17:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall <ross.lagerwall@citrix.com> wrote:
> Each gfs2_bufdata stores a reference to a glock but the reference count
> isn't incremented. This causes an occasional use-after-free of the
> glock. Fix by taking a reference on the glock during allocation and
> dropping it when freeing.
>
> Found by KASAN:
>
> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
> Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371
>
> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
> Workqueue: glock_workqueue glock_work_func [gfs2]
> Call Trace:
>  dump_stack+0x71/0xab
>  print_address_description+0x6a/0x270
>  kasan_report+0x258/0x380
>  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  gfs2_log_flush+0x511/0xa70 [gfs2]
>  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>  ? __brelse+0x48/0x50
>  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>  ? gfs2_trans_end+0x18d/0x340 [gfs2]
>  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>  ? inode_go_dump+0xe0/0xe0 [gfs2]
>  ? inode_go_sync+0xe4/0x220 [gfs2]
>  inode_go_sync+0xe4/0x220 [gfs2]
>  do_xmote+0x12b/0x290 [gfs2]
>  glock_work_func+0x6f/0x160 [gfs2]
>  process_one_work+0x461/0x790
>  worker_thread+0x69/0x6b0
>  ? process_one_work+0x790/0x790
>  kthread+0x1ae/0x1d0
>  ? kthread_create_worker_on_cpu+0xc0/0xc0
>  ret_from_fork+0x22/0x40

thanks for tracking this down, very interesting.

The consistency model here is that every buffer head that a struct
gfs2_bufdata object is attached to is protected by a glock. Before a
glock can be released, all the buffers under that glock have to be
flushed out and released; this is what allows another node to access
the same on-disk location without causing inconsistencies. When there
is a bufdata object that points to a glock that has already been
freed, this consistency model is broken. Taking an additional refcount
as this patch does may make the use-after-free go away, but it doesn't
fix the underlying problem. So I think we'll need a different fix
here.

Did you observe this problem in a real-world scenario, or with KASAN
only? It might be that we're looking at a small race that is unlikely
to trigger in the field. In any case, I think we need to understand
better what't actually going on.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative
  2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
  2019-01-31 11:21   ` Steven Whitehouse
  2019-01-31 14:36   ` Bob Peterson
@ 2019-01-31 18:32   ` Andreas Gruenbacher
  2 siblings, 0 replies; 21+ messages in thread
From: Andreas Gruenbacher @ 2019-01-31 18:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall <ross.lagerwall@citrix.com> wrote:
>
> Under certain conditions, lru_count may drop below zero resulting in
> a large amount of log spam like this:
>
> vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
>     negative objects to delete nr=-1
>
> This happens as follows:
> 1) A glock is moved from lru_list to the dispose list and lru_count is
>    decremented.
> 2) The dispose function calls cond_resched() and drops the lru lock.
> 3) Another thread takes the lru lock and tries to add the same glock to
>    lru_list, checking if the glock is on an lru list.
> 4) It is on a list (actually the dispose list) and so it avoids
>    incrementing lru_count.
> 5) The glock is moved to lru_list.
> 5) The original thread doesn't dispose it because it has been re-added
>    to the lru list but the lru_count has still decreased by one.
>
> Fix by checking if the LRU flag is set on the glock rather than checking
> if the glock is on some list and rearrange the code so that the LRU flag
> is added/removed precisely when the glock is added/removed from lru_list.

this analysis and the patch both seem to be spot on, thanks a lot.

I've got one minor nit, but that's not related to your changes (see below).

> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  fs/gfs2/glock.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index b92740edc416..53e6c7e0c1b3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
>  {

For symmetry and consistency, I think it would make sense to move the
(glops->go_flags & GLOF_LRU) check from gfs2_glock_dq here as well.
That function is also called for inode glocks from gfs2_evict_inode,
but those always have GLOF_LRU set.

>         spin_lock(&lru_lock);
>
> -       if (!list_empty(&gl->gl_lru))
> -               list_del_init(&gl->gl_lru);
> -       else
> +       list_del(&gl->gl_lru);
> +       list_add_tail(&gl->gl_lru, &lru_list);
> +
> +       if (!test_bit(GLF_LRU, &gl->gl_flags)) {
> +               set_bit(GLF_LRU, &gl->gl_flags);
>                 atomic_inc(&lru_count);
> +       }
>
> -       list_add_tail(&gl->gl_lru, &lru_list);
> -       set_bit(GLF_LRU, &gl->gl_flags);
>         spin_unlock(&lru_lock);
>  }
>
> @@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)
>                 return;
>
>         spin_lock(&lru_lock);
> -       if (!list_empty(&gl->gl_lru)) {
> +       if (test_bit(GLF_LRU, &gl->gl_flags)) {
>                 list_del_init(&gl->gl_lru);
>                 atomic_dec(&lru_count);
>                 clear_bit(GLF_LRU, &gl->gl_flags);
> @@ -1456,6 +1457,7 @@ __acquires(&lru_lock)
>                 if (!spin_trylock(&gl->gl_lockref.lock)) {
>  add_back_to_lru:
>                         list_add(&gl->gl_lru, &lru_list);
> +                       set_bit(GLF_LRU, &gl->gl_flags);
>                         atomic_inc(&lru_count);
>                         continue;
>                 }
> @@ -1463,7 +1465,6 @@ __acquires(&lru_lock)
>                         spin_unlock(&gl->gl_lockref.lock);
>                         goto add_back_to_lru;
>                 }
> -               clear_bit(GLF_LRU, &gl->gl_flags);
>                 gl->gl_lockref.count++;
>                 if (demote_ok(gl))
>                         handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> @@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
>                 if (!test_bit(GLF_LOCK, &gl->gl_flags)) {
>                         list_move(&gl->gl_lru, &dispose);
>                         atomic_dec(&lru_count);
> +                       clear_bit(GLF_LRU, &gl->gl_flags);
>                         freed++;
>                         continue;
>                 }
> --
> 2.17.2
>

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-01-31 17:18   ` Andreas Gruenbacher
@ 2019-02-01  9:23     ` Ross Lagerwall
  2019-02-01 14:34       ` Bob Peterson
                         ` (2 more replies)
  2019-03-26 18:49     ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
  1 sibling, 3 replies; 21+ messages in thread
From: Ross Lagerwall @ 2019-02-01  9:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 1/31/19 5:18 PM, Andreas Gruenbacher wrote:
> Hi Ross,
> 
> On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall <ross.lagerwall@citrix.com> wrote:
>> Each gfs2_bufdata stores a reference to a glock but the reference count
>> isn't incremented. This causes an occasional use-after-free of the
>> glock. Fix by taking a reference on the glock during allocation and
>> dropping it when freeing.
>>
>> Found by KASAN:
>>
>> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>> Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371
>>
>> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
>> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
>> Workqueue: glock_workqueue glock_work_func [gfs2]
>> Call Trace:
>>   dump_stack+0x71/0xab
>>   print_address_description+0x6a/0x270
>>   kasan_report+0x258/0x380
>>   ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>>   revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>>   gfs2_log_flush+0x511/0xa70 [gfs2]
>>   ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>>   ? __brelse+0x48/0x50
>>   ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>>   ? gfs2_trans_end+0x18d/0x340 [gfs2]
>>   gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>>   ? inode_go_dump+0xe0/0xe0 [gfs2]
>>   ? inode_go_sync+0xe4/0x220 [gfs2]
>>   inode_go_sync+0xe4/0x220 [gfs2]
>>   do_xmote+0x12b/0x290 [gfs2]
>>   glock_work_func+0x6f/0x160 [gfs2]
>>   process_one_work+0x461/0x790
>>   worker_thread+0x69/0x6b0
>>   ? process_one_work+0x790/0x790
>>   kthread+0x1ae/0x1d0
>>   ? kthread_create_worker_on_cpu+0xc0/0xc0
>>   ret_from_fork+0x22/0x40
> 
> thanks for tracking this down, very interesting.
> 
> The consistency model here is that every buffer head that a struct
> gfs2_bufdata object is attached to is protected by a glock. Before a
> glock can be released, all the buffers under that glock have to be
> flushed out and released; this is what allows another node to access
> the same on-disk location without causing inconsistencies. When there
> is a bufdata object that points to a glock that has already been
> freed, this consistency model is broken. Taking an additional refcount
> as this patch does may make the use-after-free go away, but it doesn't
> fix the underlying problem. So I think we'll need a different fix
> here.

Yes, I kind of suspected that this is papering over the problem rather 
than fixing the root cause.

> 
> Did you observe this problem in a real-world scenario, or with KASAN
> only? It might be that we're looking at a small race that is unlikely
> to trigger in the field. In any case, I think we need to understand
> better what't actually going on.

We haven't observed any problems that can be directly attributed to this 
without KASAN, although it is hard to tell what a stray write may do. We 
have hit sporadic asserts and filesystem corruption during testing.

When I added tracing, the time between freeing a glock and writing to it 
varied but could be up to hundreds of milliseconds so I would guess that 
this could easily happen without KASAN. It is relatively easy to 
reproduce in our test environment.

Do you have any suggestions for tracking down the root cause?

Thanks,
-- 
Ross Lagerwall



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-02-01  9:23     ` Ross Lagerwall
@ 2019-02-01 14:34       ` Bob Peterson
  2019-02-01 14:51       ` Bob Peterson
  2019-02-01 15:03       ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch) Bob Peterson
  2 siblings, 0 replies; 21+ messages in thread
From: Bob Peterson @ 2019-02-01 14:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

----- Original Message -----
(snip)
> We haven't observed any problems that can be directly attributed to this
> without KASAN, although it is hard to tell what a stray write may do. We
> have hit sporadic asserts and filesystem corruption during testing.
> 
> When I added tracing, the time between freeing a glock and writing to it
> varied but could be up to hundreds of milliseconds so I would guess that
> this could easily happen without KASAN. It is relatively easy to
> reproduce in our test environment.
> 
> Do you have any suggestions for tracking down the root cause?

In the past, I've debugged problems with glock reference counting by
using kernel tracing and instrumentation. Unfortunately, the "glock_put"
trace point only shows you when the glock ref count goes to 0, and
doesn't show when or how the glock is first created, which, of course,
doesn't show if it's created and destroyed multiple times, and often
that's important to figuring these out, otherwise it's just a lot of chaos.

In the past, I've added my own temporary kernel trace point for when new
glocks are created, and called it "glock_new." You probably also want to
modify the glock put functions, such as gfs2_glock_put and
gfs2_glock_queue_put, to call a trace point so you can tell that too, and
have it save off the gl_lockref reference count in the trace.

Then recreate the problem with the trace running. I attached a script I
often use for these purposes. The script contains several bogus trace
point references for various sets of temporary trace points I've added
and deleted over the years, like a generic "debug" trace point where I
can add generic messages of what's happening. So don't be surprised if
you get errors about trying to cat values into non-existent debugfs files.
Just ignore them. The script DOES contain a trigger for a "glock_new"
trace point for just this purpose. I can try to dig out whether I still
have that trace point (glock_new) and the generic debug trace point
lying around somewhere in my many git repositories, but it might take
longer than just writing them again from scratch. I know it pre-dates
the concept of a "queued_put" so things will need to be tweaked anyway.

The script had a bunch of declares at the top for which trace points to
monitor and collect. I modified it for glock_new and glock_put, but
you can play with it.

To run the script and collect the trace, just do this:
./gfs2trace.sh &
(recreate the problem)
rm /var/run/gfs2-tracepoints.pid

Removing that file triggers the trace script to stop tracing and save
the results to a file in /tmp/ named after the machine's name
(so we can keep them straight in clustered situations).
Then, of course, someone needs to analyze the resulting trace file and
figure out where the count is getting off. I hope this helps.

Regards,

Bob Peterson
Red Hat File Systems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs2trace.sh
Type: application/x-shellscript
Size: 9477 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190201/3c6a6bdf/attachment.bin>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-02-01  9:23     ` Ross Lagerwall
  2019-02-01 14:34       ` Bob Peterson
@ 2019-02-01 14:51       ` Bob Peterson
  2019-02-01 15:03       ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch) Bob Peterson
  2 siblings, 0 replies; 21+ messages in thread
From: Bob Peterson @ 2019-02-01 14:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

----- Original Message -----
> Do you have any suggestions for tracking down the root cause?

One time, when I had a similar problem in rhel7, and couldn't use
kernel tracing because there were millions of glocks involved.
The trace was too huge and quickly swamped the biggest possible
kernel trace buffer. So I ended up writing this ugly, hacky patch
that's attached. Perhaps you can use it as a starting point.

The idea is: every time there's a get or a put to a glock, it
saves off a 1-byte identifier of what function did the get/put.
It saved it in a new 64-byte field kept for each glock, which of
course meant the slab became much bigger, but it was never meant
to be shipped, right?

Then, when the problem occurred, it would dump out the problematic
glock, including the 64-byte get/put history value.
Then I would go through it and identify the history of what went
wrong.

Since this is a fairly old (2015) patch that targets an old rhel7,
it will obviously need a lot of updating to get it to work, but
it might work better than the kernel tracing, depending on how
many glocks are involved in your test.

Regards,

Bob Peterson
Red Hat File Systems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: get_put.patch
Type: text/x-patch
Size: 25569 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190201/4bb1e1c2/attachment.bin>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch)
  2019-02-01  9:23     ` Ross Lagerwall
  2019-02-01 14:34       ` Bob Peterson
  2019-02-01 14:51       ` Bob Peterson
@ 2019-02-01 15:03       ` Bob Peterson
  2 siblings, 0 replies; 21+ messages in thread
From: Bob Peterson @ 2019-02-01 15:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

----- Original Message -----
> Do you have any suggestions for tracking down the root cause?

Attached is the starting point for a generic "debug" kernel trace point,
complete with examples of how it's used. It's always associated with a
particular glock. You might find it helpful, but again, it's old, so
it'll probably need to be tweaked for an upstream kernel.

Regards,

Bob Peterson
Red Hat File Systems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug_trace.patch
Type: text/x-patch
Size: 3915 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190201/82e4a2a6/attachment.bin>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-01-31 17:18   ` Andreas Gruenbacher
  2019-02-01  9:23     ` Ross Lagerwall
@ 2019-03-26 18:49     ` Ross Lagerwall
  2019-03-26 19:14       ` Bob Peterson
  1 sibling, 1 reply; 21+ messages in thread
From: Ross Lagerwall @ 2019-03-26 18:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 1/31/19 5:18 PM, Andreas Gruenbacher wrote:
> Hi Ross,
> 
> On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall <ross.lagerwall@citrix.com> wrote:
>> Each gfs2_bufdata stores a reference to a glock but the reference count
>> isn't incremented. This causes an occasional use-after-free of the
>> glock. Fix by taking a reference on the glock during allocation and
>> dropping it when freeing.
>>
>> Found by KASAN:
>>
>> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>> Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371
>>
>> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
>> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
>> Workqueue: glock_workqueue glock_work_func [gfs2]
>> Call Trace:
>>   dump_stack+0x71/0xab
>>   print_address_description+0x6a/0x270
>>   kasan_report+0x258/0x380
>>   ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>>   revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>>   gfs2_log_flush+0x511/0xa70 [gfs2]
>>   ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>>   ? __brelse+0x48/0x50
>>   ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>>   ? gfs2_trans_end+0x18d/0x340 [gfs2]
>>   gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>>   ? inode_go_dump+0xe0/0xe0 [gfs2]
>>   ? inode_go_sync+0xe4/0x220 [gfs2]
>>   inode_go_sync+0xe4/0x220 [gfs2]
>>   do_xmote+0x12b/0x290 [gfs2]
>>   glock_work_func+0x6f/0x160 [gfs2]
>>   process_one_work+0x461/0x790
>>   worker_thread+0x69/0x6b0
>>   ? process_one_work+0x790/0x790
>>   kthread+0x1ae/0x1d0
>>   ? kthread_create_worker_on_cpu+0xc0/0xc0
>>   ret_from_fork+0x22/0x40
> 
> thanks for tracking this down, very interesting.
> 
> The consistency model here is that every buffer head that a struct
> gfs2_bufdata object is attached to is protected by a glock. Before a
> glock can be released, all the buffers under that glock have to be
> flushed out and released; this is what allows another node to access
> the same on-disk location without causing inconsistencies. When there
> is a bufdata object that points to a glock that has already been
> freed, this consistency model is broken. Taking an additional refcount
> as this patch does may make the use-after-free go away, but it doesn't
> fix the underlying problem. So I think we'll need a different fix
> here.
> 
> Did you observe this problem in a real-world scenario, or with KASAN
> only? It might be that we're looking at a small race that is unlikely
> to trigger in the field. In any case, I think we need to understand
> better what't actually going on.
> 

I finally got time to look into this further. The following is what 
happens as far as I can tell though my understanding is a bit limited at 
this point:

1. A file is opened, truncated and closed which results in a metadata 
write so a gfs2_bufdata object is created and associated with the inode 
glock.

2. The gfs2_bufdata is written out and placed on an AIL list.

3. The VFS layer calls gfs2_evict_inode() and the inode is evicted which 
drops a reference on the glock. It takes case 3 (i_nlink > 0) so no log 
flush or ail_flush happens.

4. The gfs2_bufdata object is however still on one of the AIL lists and 
the next gfs2_log_flush() begins to write a revoke entry.

5. At about the same time the glock is freed. At the point of freeing, 
gl_revokes is 1 (should probably have a GLOCK_BUG_ON for this).

6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and 
uses the freed glock (stack trace above).

Should evict_inode call gfs2_ail_flush() or something so that the revoke 
is written before it drops its reference to the glock?

Or is there something else that is meant to keep the glock around if the 
inode is evicted while there is a linked gfs2_bufdata sitting on some 
AIL list?

Let me know if any more specific information is needed since I now have 
a test setup that can usually reproduce this within 10 minutes.

Thanks,
-- 
Ross Lagerwall



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-03-26 18:49     ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
@ 2019-03-26 19:14       ` Bob Peterson
  2019-04-01 22:59         ` Andreas Gruenbacher
  0 siblings, 1 reply; 21+ messages in thread
From: Bob Peterson @ 2019-03-26 19:14 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

----- Original Message -----
> 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and
> uses the freed glock (stack trace above).
> 
> Should evict_inode call gfs2_ail_flush() or something so that the revoke
> is written before it drops its reference to the glock?
> 
> Or is there something else that is meant to keep the glock around if the
> inode is evicted while there is a linked gfs2_bufdata sitting on some
> AIL list?
> 
> Let me know if any more specific information is needed since I now have
> a test setup that can usually reproduce this within 10 minutes.

Very interesting.

It's not unusual for glocks to outlive their inodes, but I'm not sure
what the right answer is in this case. Either the revoke should
take an additional reference to the glock, and not let go until the
revoke is written by some log flush, or else the evict needs to do the
log flush to make sure the revoke is committed. But we've had issues with
evict in the past, so we need to be careful about how we fix it.
Andreas and I will look into the best way to fix it. Thanks again for your help.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-03-26 19:14       ` Bob Peterson
@ 2019-04-01 22:59         ` Andreas Gruenbacher
  2019-04-05 17:50           ` Andreas Gruenbacher
  0 siblings, 1 reply; 21+ messages in thread
From: Andreas Gruenbacher @ 2019-04-01 22:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

thanks for the great analysis.

On Tue, 26 Mar 2019 at 20:14, Bob Peterson <rpeterso@redhat.com> wrote:
> ----- Original Message -----
> > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and
> > uses the freed glock (stack trace above).
> >
> > Should evict_inode call gfs2_ail_flush() or something so that the revoke
> > is written before it drops its reference to the glock?
> >
> > Or is there something else that is meant to keep the glock around if the
> > inode is evicted while there is a linked gfs2_bufdata sitting on some
> > AIL list?
> >
> > Let me know if any more specific information is needed since I now have
> > a test setup that can usually reproduce this within 10 minutes.
>
> Very interesting.
>
> It's not unusual for glocks to outlive their inodes, but I'm not sure
> what the right answer is in this case. Either the revoke should
> take an additional reference to the glock, and not let go until the
> revoke is written by some log flush, or else the evict needs to do the
> log flush to make sure the revoke is committed. But we've had issues with
> evict in the past, so we need to be careful about how we fix it.
> Andreas and I will look into the best way to fix it. Thanks again for your help.

Usually, glocks are attached to inodes or other objects. When the
state of the glock changes, the go_lock, go_sync, and go_inval
operations determine what happens to the attached object. In
gfs2_evict_inode, we disassociate the inode from the glock, do the
necessary cleanup work locally, and put the glock asynchronously when
needed, though. We do that in PF_MEMALLOC context to avoid
deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put
glocks asynchronously"). Even if we didn't do that, there could still
be other glock holders like the glock state machine. There is no
guarantee that the glock will go away anytime soon, or actually at
all: it could get reused by another instance of the same inode. So
waiting for the glock to go away first in gfs2_evict_inode definitely
is not an option.

This basically answers your above question: gfs2_evict_inode should
definitely clean up before putting the glock. I'm just not sure how to
best achieve that, yet.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-04-01 22:59         ` Andreas Gruenbacher
@ 2019-04-05 17:50           ` Andreas Gruenbacher
  2019-04-09 15:36             ` Ross Lagerwall
  0 siblings, 1 reply; 21+ messages in thread
From: Andreas Gruenbacher @ 2019-04-05 17:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ross,

On Tue, 2 Apr 2019 at 00:59, Andreas Gruenbacher <agruenba@redhat.com> wrote:
> thanks for the great analysis.
>
> On Tue, 26 Mar 2019 at 20:14, Bob Peterson <rpeterso@redhat.com> wrote:
> > ----- Original Message -----
> > > 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and
> > > uses the freed glock (stack trace above).
> > >
> > > Should evict_inode call gfs2_ail_flush() or something so that the revoke
> > > is written before it drops its reference to the glock?
> > >
> > > Or is there something else that is meant to keep the glock around if the
> > > inode is evicted while there is a linked gfs2_bufdata sitting on some
> > > AIL list?
> > >
> > > Let me know if any more specific information is needed since I now have
> > > a test setup that can usually reproduce this within 10 minutes.
> >
> > Very interesting.
> >
> > It's not unusual for glocks to outlive their inodes, but I'm not sure
> > what the right answer is in this case. Either the revoke should
> > take an additional reference to the glock, and not let go until the
> > revoke is written by some log flush, or else the evict needs to do the
> > log flush to make sure the revoke is committed. But we've had issues with
> > evict in the past, so we need to be careful about how we fix it.
> > Andreas and I will look into the best way to fix it. Thanks again for your help.
>
> Usually, glocks are attached to inodes or other objects. When the
> state of the glock changes, the go_lock, go_sync, and go_inval
> operations determine what happens to the attached object. In
> gfs2_evict_inode, we disassociate the inode from the glock, do the
> necessary cleanup work locally, and put the glock asynchronously when
> needed, though. We do that in PF_MEMALLOC context to avoid
> deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put
> glocks asynchronously"). Even if we didn't do that, there could still
> be other glock holders like the glock state machine. There is no
> guarantee that the glock will go away anytime soon, or actually at
> all: it could get reused by another instance of the same inode. So
> waiting for the glock to go away first in gfs2_evict_inode definitely
> is not an option.
>
> This basically answers your above question: gfs2_evict_inode should
> definitely clean up before putting the glock. I'm just not sure how to
> best achieve that, yet.

after more discussions, Bob convinced me that it makes perfect sense
to not write out outstanding revokes in gfs2_evict_inode. We really
want to delay writing revokes as long as possible so that as many of
the revokes as possible will go away (either because the blocks are
re-added to the journal and the revokes are "unrevoked", or because
the journal wraps around). So the right fix here is indeed taking
additional glock references.

I've come up with a patch that takes one additional reference for each
glock instead of one reference for each bufdata object; that should
scale better. I'll post that and a follow-up patch separately.

Could you please verify?

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-04-05 17:50           ` Andreas Gruenbacher
@ 2019-04-09 15:36             ` Ross Lagerwall
  2019-04-09 15:41               ` Andreas Gruenbacher
  0 siblings, 1 reply; 21+ messages in thread
From: Ross Lagerwall @ 2019-04-09 15:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 4/5/19 6:50 PM, Andreas Gruenbacher wrote:
> Hi Ross,
> 
> On Tue, 2 Apr 2019 at 00:59, Andreas Gruenbacher <agruenba@redhat.com> wrote:
>> thanks for the great analysis.
>>
>> On Tue, 26 Mar 2019 at 20:14, Bob Peterson <rpeterso@redhat.com> wrote:
>>> ----- Original Message -----
>>>> 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and
>>>> uses the freed glock (stack trace above).
>>>>
>>>> Should evict_inode call gfs2_ail_flush() or something so that the revoke
>>>> is written before it drops its reference to the glock?
>>>>
>>>> Or is there something else that is meant to keep the glock around if the
>>>> inode is evicted while there is a linked gfs2_bufdata sitting on some
>>>> AIL list?
>>>>
>>>> Let me know if any more specific information is needed since I now have
>>>> a test setup that can usually reproduce this within 10 minutes.
>>>
>>> Very interesting.
>>>
>>> It's not unusual for glocks to outlive their inodes, but I'm not sure
>>> what the right answer is in this case. Either the revoke should
>>> take an additional reference to the glock, and not let go until the
>>> revoke is written by some log flush, or else the evict needs to do the
>>> log flush to make sure the revoke is committed. But we've had issues with
>>> evict in the past, so we need to be careful about how we fix it.
>>> Andreas and I will look into the best way to fix it. Thanks again for your help.
>>
>> Usually, glocks are attached to inodes or other objects. When the
>> state of the glock changes, the go_lock, go_sync, and go_inval
>> operations determine what happens to the attached object. In
>> gfs2_evict_inode, we disassociate the inode from the glock, do the
>> necessary cleanup work locally, and put the glock asynchronously when
>> needed, though. We do that in PF_MEMALLOC context to avoid
>> deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put
>> glocks asynchronously"). Even if we didn't do that, there could still
>> be other glock holders like the glock state machine. There is no
>> guarantee that the glock will go away anytime soon, or actually at
>> all: it could get reused by another instance of the same inode. So
>> waiting for the glock to go away first in gfs2_evict_inode definitely
>> is not an option.
>>
>> This basically answers your above question: gfs2_evict_inode should
>> definitely clean up before putting the glock. I'm just not sure how to
>> best achieve that, yet.
> 
> after more discussions, Bob convinced me that it makes perfect sense
> to not write out outstanding revokes in gfs2_evict_inode. We really
> want to delay writing revokes as long as possible so that as many of
> the revokes as possible will go away (either because the blocks are
> re-added to the journal and the revokes are "unrevoked", or because
> the journal wraps around). So the right fix here is indeed taking
> additional glock references.
> 
> I've come up with a patch that takes one additional reference for each
> glock instead of one reference for each bufdata object; that should
> scale better. I'll post that and a follow-up patch separately.
> 
> Could you please verify?
> 

I've tested the patches and no longer see the use-after-free.

Thanks for working on this!

-- 
Ross Lagerwall



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free
  2019-04-09 15:36             ` Ross Lagerwall
@ 2019-04-09 15:41               ` Andreas Gruenbacher
  0 siblings, 0 replies; 21+ messages in thread
From: Andreas Gruenbacher @ 2019-04-09 15:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, 9 Apr 2019 at 17:37, Ross Lagerwall <ross.lagerwall@citrix.com> wrote:
> On 4/5/19 6:50 PM, Andreas Gruenbacher wrote:
> > Hi Ross,
> >
> > On Tue, 2 Apr 2019 at 00:59, Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >> thanks for the great analysis.
> >>
> >> On Tue, 26 Mar 2019 at 20:14, Bob Peterson <rpeterso@redhat.com> wrote:
> >>> ----- Original Message -----
> >>>> 6. gfs2_log_flush() continues and calls revoke_lo_after_commit() and
> >>>> uses the freed glock (stack trace above).
> >>>>
> >>>> Should evict_inode call gfs2_ail_flush() or something so that the revoke
> >>>> is written before it drops its reference to the glock?
> >>>>
> >>>> Or is there something else that is meant to keep the glock around if the
> >>>> inode is evicted while there is a linked gfs2_bufdata sitting on some
> >>>> AIL list?
> >>>>
> >>>> Let me know if any more specific information is needed since I now have
> >>>> a test setup that can usually reproduce this within 10 minutes.
> >>>
> >>> Very interesting.
> >>>
> >>> It's not unusual for glocks to outlive their inodes, but I'm not sure
> >>> what the right answer is in this case. Either the revoke should
> >>> take an additional reference to the glock, and not let go until the
> >>> revoke is written by some log flush, or else the evict needs to do the
> >>> log flush to make sure the revoke is committed. But we've had issues with
> >>> evict in the past, so we need to be careful about how we fix it.
> >>> Andreas and I will look into the best way to fix it. Thanks again for your help.
> >>
> >> Usually, glocks are attached to inodes or other objects. When the
> >> state of the glock changes, the go_lock, go_sync, and go_inval
> >> operations determine what happens to the attached object. In
> >> gfs2_evict_inode, we disassociate the inode from the glock, do the
> >> necessary cleanup work locally, and put the glock asynchronously when
> >> needed, though. We do that in PF_MEMALLOC context to avoid
> >> deadlocking; see commit 71c1b2136835 ("gfs2: gfs2_evict_inode: Put
> >> glocks asynchronously"). Even if we didn't do that, there could still
> >> be other glock holders like the glock state machine. There is no
> >> guarantee that the glock will go away anytime soon, or actually at
> >> all: it could get reused by another instance of the same inode. So
> >> waiting for the glock to go away first in gfs2_evict_inode definitely
> >> is not an option.
> >>
> >> This basically answers your above question: gfs2_evict_inode should
> >> definitely clean up before putting the glock. I'm just not sure how to
> >> best achieve that, yet.
> >
> > after more discussions, Bob convinced me that it makes perfect sense
> > to not write out outstanding revokes in gfs2_evict_inode. We really
> > want to delay writing revokes as long as possible so that as many of
> > the revokes as possible will go away (either because the blocks are
> > re-added to the journal and the revokes are "unrevoked", or because
> > the journal wraps around). So the right fix here is indeed taking
> > additional glock references.
> >
> > I've come up with a patch that takes one additional reference for each
> > glock instead of one reference for each bufdata object; that should
> > scale better. I'll post that and a follow-up patch separately.
> >
> > Could you please verify?
> >
>
> I've tested the patches and no longer see the use-after-free.

Ok great, we'll line the fix up for the next merge window.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-04-09 15:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-31 10:55 [Cluster-devel] [PATCH 0/2] GFS2 counting fixes Ross Lagerwall
2019-01-31 10:55 ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
2019-01-31 11:23   ` Steven Whitehouse
2019-01-31 14:40   ` Bob Peterson
2019-01-31 17:18   ` Andreas Gruenbacher
2019-02-01  9:23     ` Ross Lagerwall
2019-02-01 14:34       ` Bob Peterson
2019-02-01 14:51       ` Bob Peterson
2019-02-01 15:03       ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free (Another debug patch) Bob Peterson
2019-03-26 18:49     ` [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free Ross Lagerwall
2019-03-26 19:14       ` Bob Peterson
2019-04-01 22:59         ` Andreas Gruenbacher
2019-04-05 17:50           ` Andreas Gruenbacher
2019-04-09 15:36             ` Ross Lagerwall
2019-04-09 15:41               ` Andreas Gruenbacher
2019-01-31 10:55 ` [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative Ross Lagerwall
2019-01-31 11:21   ` Steven Whitehouse
2019-01-31 14:36   ` Bob Peterson
2019-01-31 15:04     ` Bob Peterson
2019-01-31 15:23     ` Ross Lagerwall
2019-01-31 18:32   ` Andreas Gruenbacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.