All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Avoid inode shrinker-related deadlocks
@ 2015-11-19 18:42 Bob Peterson
  2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put Bob Peterson
  2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode Bob Peterson
  0 siblings, 2 replies; 18+ messages in thread
From: Bob Peterson @ 2015-11-19 18:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

This set of two patches is very simple, even though it's somewhat big.

The problem is that GFS2 can livelock when the inode shrinker runs.

The failing scenario goes like this:
The inode shrinker tells GFS2 to evict inodes, but in order to evict them,
the evict code needs to unlock their glocks, and calls DLM to do so.
In this particular scenario, the DLM can't unlock the glocks because
it's blocked waiting on a pending fence operation to complete. The fence
operation can't complete because it's blocked waiting to allocate memory.
The allocation of memory can't complete because it's waiting for the
shrinker, which of course, is waiting for GFS2 to evict inodes.

The problem is that it's unsafe for GFS2's evict code to make a call into
DLM. The solution is to queue the final glock put to the glock state
machine. Since the glocks outlive the inode anyway, it's safe for the
evict code to continue and return successfully back to the shrinker,
which can then continue running and satisfy the memory needed for the
fence operation, and so forth.

The second patch reverts commit 35e478f, which made the evict code wait
for outstanding work on the glock state machine. That's also unsafe for
the same reason: It can block on the state machine, which can block on
DLM. That should be safe now, for reasons stated in that patch.

Bob Peterson (2):
  GFS2: Make gfs2_clear_inode() queue the final put
  GFS2: Revert 35e478f Flush pending glock work when evicting an inode

 fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
 fs/gfs2/glock.h |  1 +
 fs/gfs2/super.c |  6 ++++--
 3 files changed, 22 insertions(+), 19 deletions(-)

-- 
2.5.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-11-19 18:42 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Avoid inode shrinker-related deadlocks Bob Peterson
@ 2015-11-19 18:42 ` Bob Peterson
  2015-11-20 13:33   ` Steven Whitehouse
  2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode Bob Peterson
  1 sibling, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2015-11-19 18:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put directly() most of the time, it queues
the glock to the delayed work queue. That avoids a possible
deadlock where it calls dlm during a fence operation:
dlm waits for a fence operation, the fence operation waits for
memory, the shrinker waits for gfs2 to free an inode from memory,
but gfs2 waits for dlm.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
 fs/gfs2/glock.h |  1 +
 fs/gfs2/super.c |  5 ++++-
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 68484ef..53fedbb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -62,7 +62,7 @@ typedef void (*glock_examiner) (struct gfs2_glock * gl);
 static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);
 
 static struct dentry *gfs2_root;
-static struct workqueue_struct *glock_workqueue;
+struct workqueue_struct *gfs2_glock_workqueue;
 struct workqueue_struct *gfs2_delete_workqueue;
 static LIST_HEAD(lru_list);
 static atomic_t lru_count = ATOMIC_INIT(0);
@@ -481,7 +481,7 @@ __acquires(&gl->gl_lockref.lock)
 		}
 	} else { /* lock_nolock */
 		finish_xmote(gl, target);
-		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
+		if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0)
 			gfs2_glock_put(gl);
 	}
 
@@ -554,7 +554,7 @@ out_sched:
 	clear_bit(GLF_LOCK, &gl->gl_flags);
 	smp_mb__after_atomic();
 	gl->gl_lockref.count++;
-	if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
+	if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0)
 		gl->gl_lockref.count--;
 	return;
 
@@ -618,7 +618,7 @@ static void glock_work_func(struct work_struct *work)
 	else {
 		if (gl->gl_name.ln_type != LM_TYPE_INODE)
 			delay = 0;
-		if (queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0)
+		if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, delay) == 0)
 			gfs2_glock_put(gl);
 	}
 	if (drop_ref)
@@ -973,7 +973,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 		     test_and_clear_bit(GLF_FROZEN, &gl->gl_flags))) {
 		set_bit(GLF_REPLY_PENDING, &gl->gl_flags);
 		gl->gl_lockref.count++;
-		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
+		if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0)
 			gl->gl_lockref.count--;
 	}
 	run_queue(gl, 1);
@@ -1042,7 +1042,7 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 	    !test_bit(GLF_DEMOTE, &gl->gl_flags) &&
 	    gl->gl_name.ln_type == LM_TYPE_INODE)
 		delay = gl->gl_hold_time;
-	if (queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0)
+	if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, delay) == 0)
 		gfs2_glock_put(gl);
 }
 
@@ -1220,7 +1220,7 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
 	spin_lock(&gl->gl_lockref.lock);
 	handle_callback(gl, state, delay, true);
 	spin_unlock(&gl->gl_lockref.lock);
-	if (queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0)
+	if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, delay) == 0)
 		gfs2_glock_put(gl);
 }
 
@@ -1282,7 +1282,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
 	set_bit(GLF_REPLY_PENDING, &gl->gl_flags);
 	spin_unlock(&gl->gl_lockref.lock);
 
-	if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
+	if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0)
 		gfs2_glock_put(gl);
 }
 
@@ -1341,7 +1341,7 @@ add_back_to_lru:
 		if (demote_ok(gl))
 			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
 		WARN_ON(!test_and_clear_bit(GLF_LOCK, &gl->gl_flags));
-		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
+		if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0)
 			gl->gl_lockref.count--;
 		spin_unlock(&gl->gl_lockref.lock);
 		cond_resched_lock(&lru_lock);
@@ -1445,7 +1445,7 @@ static void thaw_glock(struct gfs2_glock *gl)
 	if (!test_and_clear_bit(GLF_FROZEN, &gl->gl_flags))
 		goto out;
 	set_bit(GLF_REPLY_PENDING, &gl->gl_flags);
-	if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) {
+	if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) {
 out:
 		gfs2_glock_put(gl);
 	}
@@ -1465,7 +1465,7 @@ static void clear_glock(struct gfs2_glock *gl)
 	if (gl->gl_state != LM_ST_UNLOCKED)
 		handle_callback(gl, LM_ST_UNLOCKED, 0, false);
 	spin_unlock(&gl->gl_lockref.lock);
-	if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
+	if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0)
 		gfs2_glock_put(gl);
 }
 
@@ -1503,9 +1503,9 @@ static void dump_glock_func(struct gfs2_glock *gl)
 void gfs2_gl_hash_clear(struct gfs2_sbd *sdp)
 {
 	set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags);
-	flush_workqueue(glock_workqueue);
+	flush_workqueue(gfs2_glock_workqueue);
 	glock_hash_walk(clear_glock, sdp);
-	flush_workqueue(glock_workqueue);
+	flush_workqueue(gfs2_glock_workqueue);
 	wait_event(sdp->sd_glock_wait, atomic_read(&sdp->sd_glock_disposal) == 0);
 	glock_hash_walk(dump_glock_func, sdp);
 }
@@ -1756,9 +1756,9 @@ int __init gfs2_glock_init(void)
 	if (ret < 0)
 		return ret;
 
-	glock_workqueue = alloc_workqueue("glock_workqueue", WQ_MEM_RECLAIM |
+	gfs2_glock_workqueue = alloc_workqueue("glock_workqueue", WQ_MEM_RECLAIM |
 					  WQ_HIGHPRI | WQ_FREEZABLE, 0);
-	if (!glock_workqueue) {
+	if (!gfs2_glock_workqueue) {
 		rhashtable_destroy(&gl_hash_table);
 		return -ENOMEM;
 	}
@@ -1766,7 +1766,7 @@ int __init gfs2_glock_init(void)
 						WQ_MEM_RECLAIM | WQ_FREEZABLE,
 						0);
 	if (!gfs2_delete_workqueue) {
-		destroy_workqueue(glock_workqueue);
+		destroy_workqueue(gfs2_glock_workqueue);
 		rhashtable_destroy(&gl_hash_table);
 		return -ENOMEM;
 	}
@@ -1780,7 +1780,7 @@ void gfs2_glock_exit(void)
 {
 	unregister_shrinker(&glock_shrinker);
 	rhashtable_destroy(&gl_hash_table);
-	destroy_workqueue(glock_workqueue);
+	destroy_workqueue(gfs2_glock_workqueue);
 	destroy_workqueue(gfs2_delete_workqueue);
 }
 
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 46ab67f..e2f80ca 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -134,6 +134,7 @@ struct lm_lockops {
 	const match_table_t *lm_tokens;
 };
 
+extern struct workqueue_struct *gfs2_glock_workqueue;
 extern struct workqueue_struct *gfs2_delete_workqueue;
 static inline struct gfs2_holder *gfs2_glock_is_locked_by_me(struct gfs2_glock *gl)
 {
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 9d5c3f7..46e5004 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -24,6 +24,7 @@
 #include <linux/crc32.h>
 #include <linux/time.h>
 #include <linux/wait.h>
+#include <linux/workqueue.h>
 #include <linux/writeback.h>
 #include <linux/backing-dev.h>
 #include <linux/kernel.h>
@@ -1614,7 +1615,9 @@ out:
 	ip->i_gl->gl_object = NULL;
 	flush_delayed_work(&ip->i_gl->gl_work);
 	gfs2_glock_add_to_lru(ip->i_gl);
-	gfs2_glock_put(ip->i_gl);
+	if (queue_delayed_work(gfs2_glock_workqueue,
+			       &ip->i_gl->gl_work, 0) == 0)
+		gfs2_glock_put(ip->i_gl);
 	ip->i_gl = NULL;
 	if (ip->i_iopen_gh.gh_gl) {
 		ip->i_iopen_gh.gh_gl->gl_object = NULL;
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode
  2015-11-19 18:42 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Avoid inode shrinker-related deadlocks Bob Peterson
  2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put Bob Peterson
@ 2015-11-19 18:42 ` Bob Peterson
  2015-11-20 13:47   ` Steven Whitehouse
  1 sibling, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2015-11-19 18:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Commit 35e478f was added because there were cases where
the inode's glock still had pending delayed work.
The theory is that the delayed work referenced the inode after it
had been deleted from memory by gfs2_evict_inode during its call to
filemap_fdatawrite. I'm guessing that's wrong. I'm guessing that
the delayed work was referencing the glock after it had been
freed due to the call to gfs2_glock_put() in gfs2_evict_inode().
The previous patch made the last put be done from the delayed
work queue, so that should make 35e478f obsolete. The other thing
to consider is that vfs evict() calls invalidate_inode_buffers
before calling the file system-specific gfs2_evict_inode(), and
therefore we should have no dirty buffers for the inode itself.
(We might have dirty buffers for the glock address space, though).

The reason why the call to flush_delayed_work can't be done here
is that it can block, waiting like so:

kernel: Call Trace:
kernel: [<ffffffff815395f5>] schedule_timeout+0x215/0x2e0
kernel: [<ffffffff81141f1a>] ? shrink_inactive_list+0x53a/0x830
kernel: [<ffffffff81539273>] wait_for_common+0x123/0x180
kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
kernel: [<ffffffff8153938d>] wait_for_completion+0x1d/0x20
kernel: [<ffffffff8109b3e7>] flush_work+0x77/0xc0
kernel: [<ffffffff8109ac50>] ? wq_barrier_func+0x0/0x20
kernel: [<ffffffff8109b604>] flush_delayed_work+0x54/0x70
kernel: [<ffffffffa05e3552>] gfs2_clear_inode+0x62/0xf0 [gfs2]

The delayed work can depend on actions from DLM, and DLM can
hang indefinitely on a fencing action.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 46e5004..4a9b090 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1613,7 +1613,6 @@ out:
 	clear_inode(inode);
 	gfs2_dir_hash_inval(ip);
 	ip->i_gl->gl_object = NULL;
-	flush_delayed_work(&ip->i_gl->gl_work);
 	gfs2_glock_add_to_lru(ip->i_gl);
 	if (queue_delayed_work(gfs2_glock_workqueue,
 			       &ip->i_gl->gl_work, 0) == 0)
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put Bob Peterson
@ 2015-11-20 13:33   ` Steven Whitehouse
  2015-11-25 14:22     ` Bob Peterson
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2015-11-20 13:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 19/11/15 18:42, Bob Peterson wrote:
> This patch changes function gfs2_clear_inode() so that instead
> of calling gfs2_glock_put directly() most of the time, it queues
> the glock to the delayed work queue. That avoids a possible
> deadlock where it calls dlm during a fence operation:
> dlm waits for a fence operation, the fence operation waits for
> memory, the shrinker waits for gfs2 to free an inode from memory,
> but gfs2 waits for dlm.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
>   fs/gfs2/glock.h |  1 +
>   fs/gfs2/super.c |  5 ++++-
>   3 files changed, 22 insertions(+), 18 deletions(-)
[snip]
Most of the patch seems to just rename the workqueue which makes it 
tricky to spot the other changes. However, the below code seems to be 
the new bit..

> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 9d5c3f7..46e5004 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -24,6 +24,7 @@
>   #include <linux/crc32.h>
>   #include <linux/time.h>
>   #include <linux/wait.h>
> +#include <linux/workqueue.h>
>   #include <linux/writeback.h>
>   #include <linux/backing-dev.h>
>   #include <linux/kernel.h>
> @@ -1614,7 +1615,9 @@ out:
>   	ip->i_gl->gl_object = NULL;
>   	flush_delayed_work(&ip->i_gl->gl_work);
>   	gfs2_glock_add_to_lru(ip->i_gl);
> -	gfs2_glock_put(ip->i_gl);
> +	if (queue_delayed_work(gfs2_glock_workqueue,
> +			       &ip->i_gl->gl_work, 0) == 0)
> +		gfs2_glock_put(ip->i_gl);
>   	ip->i_gl = NULL;
>   	if (ip->i_iopen_gh.gh_gl) {
>   		ip->i_iopen_gh.gh_gl->gl_object = NULL;

which replaces a put with a queue & put if the queue fails (due to it 
being already on the queue) which doesn't look quite right to be since 
if calling gfs2_glock_put() was not safe before, then calling it 
conditionally like this is still no safer I think?

Steve.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode
  2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode Bob Peterson
@ 2015-11-20 13:47   ` Steven Whitehouse
  2015-11-25 14:36     ` Bob Peterson
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2015-11-20 13:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 19/11/15 18:42, Bob Peterson wrote:
> Commit 35e478f was added because there were cases where
> the inode's glock still had pending delayed work.
> The theory is that the delayed work referenced the inode after it
> had been deleted from memory by gfs2_evict_inode during its call to
> filemap_fdatawrite. I'm guessing that's wrong. I'm guessing that
> the delayed work was referencing the glock after it had been
> freed due to the call to gfs2_glock_put() in gfs2_evict_inode().
> The previous patch made the last put be done from the delayed
> work queue, so that should make 35e478f obsolete. The other thing
> to consider is that vfs evict() calls invalidate_inode_buffers
> before calling the file system-specific gfs2_evict_inode(), and
> therefore we should have no dirty buffers for the inode itself.
> (We might have dirty buffers for the glock address space, though).
Not only are there likely to be dirty buffers still in the meta data 
address space, that is a requirement for good performance when unlinking 
lots of inodes.

> The reason why the call to flush_delayed_work can't be done here
> is that it can block, waiting like so:
>
> kernel: Call Trace:
> kernel: [<ffffffff815395f5>] schedule_timeout+0x215/0x2e0
> kernel: [<ffffffff81141f1a>] ? shrink_inactive_list+0x53a/0x830
> kernel: [<ffffffff81539273>] wait_for_common+0x123/0x180
> kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
> kernel: [<ffffffff8153938d>] wait_for_completion+0x1d/0x20
> kernel: [<ffffffff8109b3e7>] flush_work+0x77/0xc0
> kernel: [<ffffffff8109ac50>] ? wq_barrier_func+0x0/0x20
> kernel: [<ffffffff8109b604>] flush_delayed_work+0x54/0x70
> kernel: [<ffffffffa05e3552>] gfs2_clear_inode+0x62/0xf0 [gfs2]
>
> The delayed work can depend on actions from DLM, and DLM can
> hang indefinitely on a fencing action.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/super.c | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 46e5004..4a9b090 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -1613,7 +1613,6 @@ out:
>   	clear_inode(inode);
>   	gfs2_dir_hash_inval(ip);
>   	ip->i_gl->gl_object = NULL;
> -	flush_delayed_work(&ip->i_gl->gl_work);
>   	gfs2_glock_add_to_lru(ip->i_gl);
>   	if (queue_delayed_work(gfs2_glock_workqueue,
>   			       &ip->i_gl->gl_work, 0) == 0)

I'm not convinced that it is safe to do this, without finding some other 
way to flush the workqueue. All glock work items on the queue hold a ref 
count to the glock, so it should not be possible to have a case where 
the glock is referenced by work queue activity after it has been freed  
- unless I've misunderstood the explanation above,

Steve.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-11-20 13:33   ` Steven Whitehouse
@ 2015-11-25 14:22     ` Bob Peterson
  2015-11-25 14:26       ` Steven Whitehouse
  0 siblings, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2015-11-25 14:22 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi,
> 
> On 19/11/15 18:42, Bob Peterson wrote:
> > This patch changes function gfs2_clear_inode() so that instead
> > of calling gfs2_glock_put directly() most of the time, it queues
> > the glock to the delayed work queue. That avoids a possible
> > deadlock where it calls dlm during a fence operation:
> > dlm waits for a fence operation, the fence operation waits for
> > memory, the shrinker waits for gfs2 to free an inode from memory,
> > but gfs2 waits for dlm.
> >
> > Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> > ---
> >   fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
> >   fs/gfs2/glock.h |  1 +
> >   fs/gfs2/super.c |  5 ++++-
> >   3 files changed, 22 insertions(+), 18 deletions(-)
> [snip]
> Most of the patch seems to just rename the workqueue which makes it
> tricky to spot the other changes. However, the below code seems to be
> the new bit..
> 
> > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> > index 9d5c3f7..46e5004 100644
> > --- a/fs/gfs2/super.c
> > +++ b/fs/gfs2/super.c
> > @@ -24,6 +24,7 @@
> >   #include <linux/crc32.h>
> >   #include <linux/time.h>
> >   #include <linux/wait.h>
> > +#include <linux/workqueue.h>
> >   #include <linux/writeback.h>
> >   #include <linux/backing-dev.h>
> >   #include <linux/kernel.h>
> > @@ -1614,7 +1615,9 @@ out:
> >   	ip->i_gl->gl_object = NULL;
> >   	flush_delayed_work(&ip->i_gl->gl_work);
> >   	gfs2_glock_add_to_lru(ip->i_gl);
> > -	gfs2_glock_put(ip->i_gl);
> > +	if (queue_delayed_work(gfs2_glock_workqueue,
> > +			       &ip->i_gl->gl_work, 0) == 0)
> > +		gfs2_glock_put(ip->i_gl);
> >   	ip->i_gl = NULL;
> >   	if (ip->i_iopen_gh.gh_gl) {
> >   		ip->i_iopen_gh.gh_gl->gl_object = NULL;
> 
> which replaces a put with a queue & put if the queue fails (due to it
> being already on the queue) which doesn't look quite right to be since
> if calling gfs2_glock_put() was not safe before, then calling it
> conditionally like this is still no safer I think?
> 
> Steve.

Hi,

The call to gfs2_glock_put() in this case should be safe.

If queuing the delayed work fails, it means the glock reference count is
greater than 1, to be decremented when the glock state machine runs.
Which means this can't be the final glock_put().
Which means we can't possibly call into DLM, which means we can't block.
Which means it's safe.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-11-25 14:22     ` Bob Peterson
@ 2015-11-25 14:26       ` Steven Whitehouse
  2015-12-01 15:42         ` Bob Peterson
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2015-11-25 14:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 25/11/15 14:22, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>> On 19/11/15 18:42, Bob Peterson wrote:
>>> This patch changes function gfs2_clear_inode() so that instead
>>> of calling gfs2_glock_put directly() most of the time, it queues
>>> the glock to the delayed work queue. That avoids a possible
>>> deadlock where it calls dlm during a fence operation:
>>> dlm waits for a fence operation, the fence operation waits for
>>> memory, the shrinker waits for gfs2 to free an inode from memory,
>>> but gfs2 waits for dlm.
>>>
>>> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
>>> ---
>>>    fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
>>>    fs/gfs2/glock.h |  1 +
>>>    fs/gfs2/super.c |  5 ++++-
>>>    3 files changed, 22 insertions(+), 18 deletions(-)
>> [snip]
>> Most of the patch seems to just rename the workqueue which makes it
>> tricky to spot the other changes. However, the below code seems to be
>> the new bit..
>>
>>> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
>>> index 9d5c3f7..46e5004 100644
>>> --- a/fs/gfs2/super.c
>>> +++ b/fs/gfs2/super.c
>>> @@ -24,6 +24,7 @@
>>>    #include <linux/crc32.h>
>>>    #include <linux/time.h>
>>>    #include <linux/wait.h>
>>> +#include <linux/workqueue.h>
>>>    #include <linux/writeback.h>
>>>    #include <linux/backing-dev.h>
>>>    #include <linux/kernel.h>
>>> @@ -1614,7 +1615,9 @@ out:
>>>    	ip->i_gl->gl_object = NULL;
>>>    	flush_delayed_work(&ip->i_gl->gl_work);
>>>    	gfs2_glock_add_to_lru(ip->i_gl);
>>> -	gfs2_glock_put(ip->i_gl);
>>> +	if (queue_delayed_work(gfs2_glock_workqueue,
>>> +			       &ip->i_gl->gl_work, 0) == 0)
>>> +		gfs2_glock_put(ip->i_gl);
>>>    	ip->i_gl = NULL;
>>>    	if (ip->i_iopen_gh.gh_gl) {
>>>    		ip->i_iopen_gh.gh_gl->gl_object = NULL;
>> which replaces a put with a queue & put if the queue fails (due to it
>> being already on the queue) which doesn't look quite right to be since
>> if calling gfs2_glock_put() was not safe before, then calling it
>> conditionally like this is still no safer I think?
>>
>> Steve.
> Hi,
>
> The call to gfs2_glock_put() in this case should be safe.
>
> If queuing the delayed work fails, it means the glock reference count is
> greater than 1, to be decremented when the glock state machine runs.
> Which means this can't be the final glock_put().
> Which means we can't possibly call into DLM, which means we can't block.
> Which means it's safe.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

There is no reason that this cannot be the final glock put, since there 
is no synchronization with the work that has been queued, so it might 
well have run and decremented the ref count before we return from the 
queuing function. It is unlikely that will be the case, but it is still 
possible,

Steve.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode
  2015-11-20 13:47   ` Steven Whitehouse
@ 2015-11-25 14:36     ` Bob Peterson
  0 siblings, 0 replies; 18+ messages in thread
From: Bob Peterson @ 2015-11-25 14:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

----- Original Message -----
> Hi,
> 
> On 19/11/15 18:42, Bob Peterson wrote:
> > Commit 35e478f was added because there were cases where
> > the inode's glock still had pending delayed work.
> > The theory is that the delayed work referenced the inode after it
> > had been deleted from memory by gfs2_evict_inode during its call to
> > filemap_fdatawrite. I'm guessing that's wrong. I'm guessing that
> > the delayed work was referencing the glock after it had been
> > freed due to the call to gfs2_glock_put() in gfs2_evict_inode().
> > The previous patch made the last put be done from the delayed
> > work queue, so that should make 35e478f obsolete. The other thing
> > to consider is that vfs evict() calls invalidate_inode_buffers
> > before calling the file system-specific gfs2_evict_inode(), and
> > therefore we should have no dirty buffers for the inode itself.
> > (We might have dirty buffers for the glock address space, though).
> Not only are there likely to be dirty buffers still in the meta data
> address space, that is a requirement for good performance when unlinking
> lots of inodes.
> 
> > The reason why the call to flush_delayed_work can't be done here
> > is that it can block, waiting like so:
> >
> > kernel: Call Trace:
> > kernel: [<ffffffff815395f5>] schedule_timeout+0x215/0x2e0
> > kernel: [<ffffffff81141f1a>] ? shrink_inactive_list+0x53a/0x830
> > kernel: [<ffffffff81539273>] wait_for_common+0x123/0x180
> > kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
> > kernel: [<ffffffff8153938d>] wait_for_completion+0x1d/0x20
> > kernel: [<ffffffff8109b3e7>] flush_work+0x77/0xc0
> > kernel: [<ffffffff8109ac50>] ? wq_barrier_func+0x0/0x20
> > kernel: [<ffffffff8109b604>] flush_delayed_work+0x54/0x70
> > kernel: [<ffffffffa05e3552>] gfs2_clear_inode+0x62/0xf0 [gfs2]
> >
> > The delayed work can depend on actions from DLM, and DLM can
> > hang indefinitely on a fencing action.
> >
> > Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> > ---
> >   fs/gfs2/super.c | 1 -
> >   1 file changed, 1 deletion(-)
> >
> > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> > index 46e5004..4a9b090 100644
> > --- a/fs/gfs2/super.c
> > +++ b/fs/gfs2/super.c
> > @@ -1613,7 +1613,6 @@ out:
> >   	clear_inode(inode);
> >   	gfs2_dir_hash_inval(ip);
> >   	ip->i_gl->gl_object = NULL;
> > -	flush_delayed_work(&ip->i_gl->gl_work);
> >   	gfs2_glock_add_to_lru(ip->i_gl);
> >   	if (queue_delayed_work(gfs2_glock_workqueue,
> >   			       &ip->i_gl->gl_work, 0) == 0)
> 
> I'm not convinced that it is safe to do this, without finding some other
> way to flush the workqueue. All glock work items on the queue hold a ref
> count to the glock, so it should not be possible to have a case where
> the glock is referenced by work queue activity after it has been freed
> - unless I've misunderstood the explanation above,
> 
> Steve.

I'm not convinced it's 100% safe either, however, two things to consider:

1. With the previous patch, we're queueing the final glock put, so when the
   final put is done, the glock delayed work should be done, right?

   By the time the final delayed work is run, gl_object is set to NULL
   (immediately above) so the state machine should not reference the inode
   being deleted.
2. I've asked Barry to re-test the failing scenario with specsfs 2008, and
   he's done so several times and not recreated the problem.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-11-25 14:26       ` Steven Whitehouse
@ 2015-12-01 15:42         ` Bob Peterson
  2015-12-02 10:23           ` Steven Whitehouse
  0 siblings, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2015-12-01 15:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi,
> 
> On 25/11/15 14:22, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >> On 19/11/15 18:42, Bob Peterson wrote:
> >>> This patch changes function gfs2_clear_inode() so that instead
> >>> of calling gfs2_glock_put directly() most of the time, it queues
> >>> the glock to the delayed work queue. That avoids a possible
> >>> deadlock where it calls dlm during a fence operation:
> >>> dlm waits for a fence operation, the fence operation waits for
> >>> memory, the shrinker waits for gfs2 to free an inode from memory,
> >>> but gfs2 waits for dlm.
> >>>
> >>> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> >>> ---
> >>>    fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
> >>>    fs/gfs2/glock.h |  1 +
> >>>    fs/gfs2/super.c |  5 ++++-
> >>>    3 files changed, 22 insertions(+), 18 deletions(-)
> >> [snip]
> >> Most of the patch seems to just rename the workqueue which makes it
> >> tricky to spot the other changes. However, the below code seems to be
> >> the new bit..
> >>
> >>> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> >>> index 9d5c3f7..46e5004 100644
> >>> --- a/fs/gfs2/super.c
> >>> +++ b/fs/gfs2/super.c
> >>> @@ -24,6 +24,7 @@
> >>>    #include <linux/crc32.h>
> >>>    #include <linux/time.h>
> >>>    #include <linux/wait.h>
> >>> +#include <linux/workqueue.h>
> >>>    #include <linux/writeback.h>
> >>>    #include <linux/backing-dev.h>
> >>>    #include <linux/kernel.h>
> >>> @@ -1614,7 +1615,9 @@ out:
> >>>    	ip->i_gl->gl_object = NULL;
> >>>    	flush_delayed_work(&ip->i_gl->gl_work);
> >>>    	gfs2_glock_add_to_lru(ip->i_gl);
> >>> -	gfs2_glock_put(ip->i_gl);
> >>> +	if (queue_delayed_work(gfs2_glock_workqueue,
> >>> +			       &ip->i_gl->gl_work, 0) == 0)
> >>> +		gfs2_glock_put(ip->i_gl);
> >>>    	ip->i_gl = NULL;
> >>>    	if (ip->i_iopen_gh.gh_gl) {
> >>>    		ip->i_iopen_gh.gh_gl->gl_object = NULL;
> >> which replaces a put with a queue & put if the queue fails (due to it
> >> being already on the queue) which doesn't look quite right to be since
> >> if calling gfs2_glock_put() was not safe before, then calling it
> >> conditionally like this is still no safer I think?
> >>
> >> Steve.
> > Hi,
> >
> > The call to gfs2_glock_put() in this case should be safe.
> >
> > If queuing the delayed work fails, it means the glock reference count is
> > greater than 1, to be decremented when the glock state machine runs.
> > Which means this can't be the final glock_put().
> > Which means we can't possibly call into DLM, which means we can't block.
> > Which means it's safe.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> There is no reason that this cannot be the final glock put, since there
> is no synchronization with the work that has been queued, so it might
> well have run and decremented the ref count before we return from the
> queuing function. It is unlikely that will be the case, but it is still
> possible,
> 
> Steve.
> 

Hi Steve,

It's kind of an ugly hack, but can we do something like the patch below instead?

Regards,

Bob Peterson
Red Hat File Systems
---
commit 1949050b4b13c1b32ea45987fbf2936ae779609e
Author: Bob Peterson <rpeterso@redhat.com>
Date:   Thu Nov 19 12:06:31 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an inode from memory, but gfs2 waits for
dlm. The new non-blocking glock_put does this:

1. It acquires the lockref to ensure no one else is messing with it.
2. If the lockref is put (not locked) it can safely return because
   it is not the last reference to the glock.
3. If this is the last reference, it tries to queue delayed work for
   the glock.
4. If it was able to queue the delayed work, it's safe to return
   because the glock_work_func will run in another process, so
   this one cannot block.
5. If it was unable to queue the delayed work, it needs to schedule
   and start the whole process again.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index a4ff7b5..22870c6 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -178,6 +178,27 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 }
 
 /**
+ * gfs2_glock_put_noblock() - Decrement reference count on glock
+ * @gl: The glock to put
+ *
+ * This is the same as gfs2_glock_put() but it's not allowed to block
+ */
+
+void gfs2_glock_put_noblock(struct gfs2_glock *gl)
+{
+	while (1) {
+		if (lockref_put_or_lock(&gl->gl_lockref))
+			break;
+
+		spin_unlock(&gl->gl_lockref.lock);
+		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) != 0)
+			break;
+
+		cond_resched();
+	}
+}
+
+/**
  * may_grant - check if its ok to grant a new lock
  * @gl: The glock
  * @gh: The lock request which we wish to grant
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 46ab67f..d786446 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -182,6 +182,7 @@ extern int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
 			  const struct gfs2_glock_operations *glops,
 			  int create, struct gfs2_glock **glp);
 extern void gfs2_glock_put(struct gfs2_glock *gl);
+extern void gfs2_glock_put_noblock(struct gfs2_glock *gl);
 extern void gfs2_holder_init(struct gfs2_glock *gl, unsigned int state,
 			     u16 flags, struct gfs2_holder *gh);
 extern void gfs2_holder_reinit(unsigned int state, u16 flags,
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 03fa155..188f2a5 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1613,7 +1613,7 @@ out:
 	ip->i_gl->gl_object = NULL;
 	flush_delayed_work(&ip->i_gl->gl_work);
 	gfs2_glock_add_to_lru(ip->i_gl);
-	gfs2_glock_put(ip->i_gl);
+	gfs2_glock_put_noblock(ip->i_gl);
 	ip->i_gl = NULL;
 	if (ip->i_iopen_gh.gh_gl) {
 		ip->i_iopen_gh.gh_gl->gl_object = NULL;



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-01 15:42         ` Bob Peterson
@ 2015-12-02 10:23           ` Steven Whitehouse
  2015-12-02 16:42             ` Bob Peterson
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2015-12-02 10:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

HI,

On 01/12/15 15:42, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>> On 25/11/15 14:22, Bob Peterson wrote:
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>> On 19/11/15 18:42, Bob Peterson wrote:
>>>>> This patch changes function gfs2_clear_inode() so that instead
>>>>> of calling gfs2_glock_put directly() most of the time, it queues
>>>>> the glock to the delayed work queue. That avoids a possible
>>>>> deadlock where it calls dlm during a fence operation:
>>>>> dlm waits for a fence operation, the fence operation waits for
>>>>> memory, the shrinker waits for gfs2 to free an inode from memory,
>>>>> but gfs2 waits for dlm.
>>>>>
>>>>> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
>>>>> ---
>>>>>     fs/gfs2/glock.c | 34 +++++++++++++++++-----------------
>>>>>     fs/gfs2/glock.h |  1 +
>>>>>     fs/gfs2/super.c |  5 ++++-
>>>>>     3 files changed, 22 insertions(+), 18 deletions(-)
>>>> [snip]
>>>> Most of the patch seems to just rename the workqueue which makes it
>>>> tricky to spot the other changes. However, the below code seems to be
>>>> the new bit..
>>>>
>>>>> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
>>>>> index 9d5c3f7..46e5004 100644
>>>>> --- a/fs/gfs2/super.c
>>>>> +++ b/fs/gfs2/super.c
>>>>> @@ -24,6 +24,7 @@
>>>>>     #include <linux/crc32.h>
>>>>>     #include <linux/time.h>
>>>>>     #include <linux/wait.h>
>>>>> +#include <linux/workqueue.h>
>>>>>     #include <linux/writeback.h>
>>>>>     #include <linux/backing-dev.h>
>>>>>     #include <linux/kernel.h>
>>>>> @@ -1614,7 +1615,9 @@ out:
>>>>>     	ip->i_gl->gl_object = NULL;
>>>>>     	flush_delayed_work(&ip->i_gl->gl_work);
>>>>>     	gfs2_glock_add_to_lru(ip->i_gl);
>>>>> -	gfs2_glock_put(ip->i_gl);
>>>>> +	if (queue_delayed_work(gfs2_glock_workqueue,
>>>>> +			       &ip->i_gl->gl_work, 0) == 0)
>>>>> +		gfs2_glock_put(ip->i_gl);
>>>>>     	ip->i_gl = NULL;
>>>>>     	if (ip->i_iopen_gh.gh_gl) {
>>>>>     		ip->i_iopen_gh.gh_gl->gl_object = NULL;
>>>> which replaces a put with a queue & put if the queue fails (due to it
>>>> being already on the queue) which doesn't look quite right to be since
>>>> if calling gfs2_glock_put() was not safe before, then calling it
>>>> conditionally like this is still no safer I think?
>>>>
>>>> Steve.
>>> Hi,
>>>
>>> The call to gfs2_glock_put() in this case should be safe.
>>>
>>> If queuing the delayed work fails, it means the glock reference count is
>>> greater than 1, to be decremented when the glock state machine runs.
>>> Which means this can't be the final glock_put().
>>> Which means we can't possibly call into DLM, which means we can't block.
>>> Which means it's safe.
>>>
>>> Regards,
>>>
>>> Bob Peterson
>>> Red Hat File Systems
>> There is no reason that this cannot be the final glock put, since there
>> is no synchronization with the work that has been queued, so it might
>> well have run and decremented the ref count before we return from the
>> queuing function. It is unlikely that will be the case, but it is still
>> possible,
>>
>> Steve.
>>
> Hi Steve,
>
> It's kind of an ugly hack, but can we do something like the patch below instead?
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
> ---
> commit 1949050b4b13c1b32ea45987fbf2936ae779609e
> Author: Bob Peterson <rpeterso@redhat.com>
> Date:   Thu Nov 19 12:06:31 2015 -0600
>
> GFS2: Make gfs2_clear_inode() not block on final glock put
>
> This patch changes function gfs2_clear_inode() so that instead
> of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
> function that avoids a possible deadlock that would occur should
> it call dlm during a fence operation: dlm waits for a fence
> operation, the fence operation waits for memory, the shrinker
> waits for gfs2 to free an inode from memory, but gfs2 waits for
> dlm. The new non-blocking glock_put does this:
>
> 1. It acquires the lockref to ensure no one else is messing with it.
> 2. If the lockref is put (not locked) it can safely return because
>     it is not the last reference to the glock.
> 3. If this is the last reference, it tries to queue delayed work for
>     the glock.
> 4. If it was able to queue the delayed work, it's safe to return
>     because the glock_work_func will run in another process, so
>     this one cannot block.
> 5. If it was unable to queue the delayed work, it needs to schedule
>     and start the whole process again.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index a4ff7b5..22870c6 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -178,6 +178,27 @@ void gfs2_glock_put(struct gfs2_glock *gl)
>   }
>   
>   /**
> + * gfs2_glock_put_noblock() - Decrement reference count on glock
> + * @gl: The glock to put
> + *
> + * This is the same as gfs2_glock_put() but it's not allowed to block
> + */
> +
> +void gfs2_glock_put_noblock(struct gfs2_glock *gl)
> +{
> +	while (1) {
> +		if (lockref_put_or_lock(&gl->gl_lockref))
> +			break;
> +
> +		spin_unlock(&gl->gl_lockref.lock);
That just drops the ref count without doing anything.

> +		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) != 0)
> +			break;
You can't call queue_delayed_work on a glock for which you don't have a 
ref count - it might not exist any more. Please take a look at this 
again and figure out what the problematic cycle of events is, and then 
work out how to avoid that happening in the first place. There is no 
point in replacing one problem with another one, particularly one which 
would likely be very tricky to debug,

Steve.

> +
> +		cond_resched();
> +	}
> +}
> +
> +/**
>    * may_grant - check if its ok to grant a new lock
>    * @gl: The glock
>    * @gh: The lock request which we wish to grant
> diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
> index 46ab67f..d786446 100644
> --- a/fs/gfs2/glock.h
> +++ b/fs/gfs2/glock.h
> @@ -182,6 +182,7 @@ extern int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
>   			  const struct gfs2_glock_operations *glops,
>   			  int create, struct gfs2_glock **glp);
>   extern void gfs2_glock_put(struct gfs2_glock *gl);
> +extern void gfs2_glock_put_noblock(struct gfs2_glock *gl);
>   extern void gfs2_holder_init(struct gfs2_glock *gl, unsigned int state,
>   			     u16 flags, struct gfs2_holder *gh);
>   extern void gfs2_holder_reinit(unsigned int state, u16 flags,
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index 03fa155..188f2a5 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -1613,7 +1613,7 @@ out:
>   	ip->i_gl->gl_object = NULL;
>   	flush_delayed_work(&ip->i_gl->gl_work);
>   	gfs2_glock_add_to_lru(ip->i_gl);
> -	gfs2_glock_put(ip->i_gl);
> +	gfs2_glock_put_noblock(ip->i_gl);
>   	ip->i_gl = NULL;
>   	if (ip->i_iopen_gh.gh_gl) {
>   		ip->i_iopen_gh.gh_gl->gl_object = NULL;
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-02 10:23           ` Steven Whitehouse
@ 2015-12-02 16:42             ` Bob Peterson
  2015-12-02 17:41               ` Bob Peterson
  2015-12-08  7:57               ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Bob Peterson @ 2015-12-02 16:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
(snip)
> Please take a look at this
> again and figure out what the problematic cycle of events is, and then
> work out how to avoid that happening in the first place. There is no
> point in replacing one problem with another one, particularly one which
> would likely be very tricky to debug,
> 
> Steve.

Rhe problematic cycle of events is well known:
gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
but if it's the very last put, it calls into dlm, which can block,
and that's where we get into trouble.

The livelock goes like this:

1. A fence operation needs memory, so it blocks on memory allocation.
2. Memory allocation blocks on slab shrinker.
3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's glock.
6. gfs2_glock_put calls dlm unlock to unlock the glock.
7. dlm blocks on a pending fence operation. Goto 1.

So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
needs to do the final put and tell dlm to unlock the inode's glock, which is why
I've been trying to queue it to the delayed work queue.

If I can't do that, we're left with few alternatives: Perhaps a new
function of the quota daemon: to run the lru list and call dlm to unlock
any that have a special bit set, but that just seems ugly.

I've thought of some other alternatives, but they seem a lot uglier and
harder to manage. I'll give it some more thought.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-02 16:42             ` Bob Peterson
@ 2015-12-02 17:41               ` Bob Peterson
  2015-12-03 11:18                 ` Steven Whitehouse
  2015-12-08  7:57               ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2015-12-02 17:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> ----- Original Message -----
> (snip)
> > Please take a look at this
> > again and figure out what the problematic cycle of events is, and then
> > work out how to avoid that happening in the first place. There is no
> > point in replacing one problem with another one, particularly one which
> > would likely be very tricky to debug,
> > 
> > Steve.
> 
> Rhe problematic cycle of events is well known:
> gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
> but if it's the very last put, it calls into dlm, which can block,
> and that's where we get into trouble.
> 
> The livelock goes like this:
> 
> 1. A fence operation needs memory, so it blocks on memory allocation.
> 2. Memory allocation blocks on slab shrinker.
> 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
> 4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
> 5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's
> glock.
> 6. gfs2_glock_put calls dlm unlock to unlock the glock.
> 7. dlm blocks on a pending fence operation. Goto 1.
> 
> So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
> needs to do the final put and tell dlm to unlock the inode's glock, which is
> why
> I've been trying to queue it to the delayed work queue.
> 
> If I can't do that, we're left with few alternatives: Perhaps a new
> function of the quota daemon: to run the lru list and call dlm to unlock
> any that have a special bit set, but that just seems ugly.
> 
> I've thought of some other alternatives, but they seem a lot uglier and
> harder to manage. I'll give it some more thought.
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems
> 

Hi Steve,

Another possibility is to add a new inode work_queue which does the work of
gfs2_clear_inode(). Its only function would be to clear the inode, and
therefore, there should not be any competing work, as there is in the case of
a glock. The gfs2_clear_inode() function would then queue the work and return
to its caller, and therefore it wouldn't block. The work that's been queued
might block on DLM, but it shouldn't matter, in theory, since the shrinker
will complete, which will free up the fence, which will free up dlm, which will
free up everything. What do you think?

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-02 17:41               ` Bob Peterson
@ 2015-12-03 11:18                 ` Steven Whitehouse
  2015-12-04 14:51                   ` Bob Peterson
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2015-12-03 11:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 02/12/15 17:41, Bob Peterson wrote:
> ----- Original Message -----
>> ----- Original Message -----
>> (snip)
>>> Please take a look at this
>>> again and figure out what the problematic cycle of events is, and then
>>> work out how to avoid that happening in the first place. There is no
>>> point in replacing one problem with another one, particularly one which
>>> would likely be very tricky to debug,
>>>
>>> Steve.
>> Rhe problematic cycle of events is well known:
>> gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
>> but if it's the very last put, it calls into dlm, which can block,
>> and that's where we get into trouble.
>>
>> The livelock goes like this:
>>
>> 1. A fence operation needs memory, so it blocks on memory allocation.
>> 2. Memory allocation blocks on slab shrinker.
>> 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
>> 4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
>> 5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's
>> glock.
>> 6. gfs2_glock_put calls dlm unlock to unlock the glock.
>> 7. dlm blocks on a pending fence operation. Goto 1.
>>
>> So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
>> needs to do the final put and tell dlm to unlock the inode's glock, which is
>> why
>> I've been trying to queue it to the delayed work queue.
>>
>> If I can't do that, we're left with few alternatives: Perhaps a new
>> function of the quota daemon: to run the lru list and call dlm to unlock
>> any that have a special bit set, but that just seems ugly.
>>
>> I've thought of some other alternatives, but they seem a lot uglier and
>> harder to manage. I'll give it some more thought.
>>
>> Regards,
>>
>> Bob Peterson
>> Red Hat File Systems
>>
> Hi Steve,
>
> Another possibility is to add a new inode work_queue which does the work of
> gfs2_clear_inode(). Its only function would be to clear the inode, and
> therefore, there should not be any competing work, as there is in the case of
> a glock. The gfs2_clear_inode() function would then queue the work and return
> to its caller, and therefore it wouldn't block. The work that's been queued
> might block on DLM, but it shouldn't matter, in theory, since the shrinker
> will complete, which will free up the fence, which will free up dlm, which will
> free up everything. What do you think?
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems

Well that is a possibility. There is another dependency cycle there 
however, which is the one relating to the inode lifetime, and we need to 
be careful that we don't break that in attempting to fix the memory 
allocation at fence time issue. I know it isn't easy, but we need to be 
very careful to avoid introducing any new races, as they are likely to 
be very tricky to debug in this area.

One thing we could do, is to be more careful about when we take this 
delayed path. We could, for example, add a check to see if we are 
running in the context of the shrinker, and only delay the eviction in 
that specific case. I was wondering whether we could move the decision 
further up the chain too, and avoid evicting inodes with zero link count 
under memory pressure in the first place. The issue of needing to 
allocate memory to evict inodes is likely to be something that might 
affect other filesystems too, so it might be useful as a generic feature.

I'd prefer to avoid adding a new inode workqueue, however there is no 
reason why we could not add a new feature to an existing workqueue by 
adding a new inode or glock flag as appropriate (which will not expand 
the size of the inode) to deal with this case,

Steve.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-03 11:18                 ` Steven Whitehouse
@ 2015-12-04 14:51                   ` Bob Peterson
  2015-12-04 15:51                     ` David Teigland
  0 siblings, 1 reply; 18+ messages in thread
From: Bob Peterson @ 2015-12-04 14:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

----- Original Message -----
> > Hi Steve,
> >
> > Another possibility is to add a new inode work_queue which does the work of
> > gfs2_clear_inode(). Its only function would be to clear the inode, and
> > therefore, there should not be any competing work, as there is in the case
> > of
> > a glock. The gfs2_clear_inode() function would then queue the work and
> > return
> > to its caller, and therefore it wouldn't block. The work that's been queued
> > might block on DLM, but it shouldn't matter, in theory, since the shrinker
> > will complete, which will free up the fence, which will free up dlm, which
> > will
> > free up everything. What do you think?
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> Well that is a possibility. There is another dependency cycle there
> however, which is the one relating to the inode lifetime, and we need to
> be careful that we don't break that in attempting to fix the memory
> allocation at fence time issue. I know it isn't easy, but we need to be
> very careful to avoid introducing any new races, as they are likely to
> be very tricky to debug in this area.
> 
> One thing we could do, is to be more careful about when we take this
> delayed path. We could, for example, add a check to see if we are
> running in the context of the shrinker, and only delay the eviction in
> that specific case. I was wondering whether we could move the decision
> further up the chain too, and avoid evicting inodes with zero link count
> under memory pressure in the first place. The issue of needing to
> allocate memory to evict inodes is likely to be something that might
> affect other filesystems too, so it might be useful as a generic feature.
> 
> I'd prefer to avoid adding a new inode workqueue, however there is no
> reason why we could not add a new feature to an existing workqueue by
> adding a new inode or glock flag as appropriate (which will not expand
> the size of the inode) to deal with this case,
> 
> Steve.
> 
> 

I tried adding a work queue for the inodes, but it was a disaster and led
to other hangs.

Then I decided it made more sense to just add a new delayed work function
for the final put of the inode glock. That also led to problems because
glocks may be reused for a different inode with the same block address,
due to the fact that the glocks often outlive the inodes. What happened
was that, given enough pressure, the delayed work became backlogged and
the second put couldn't be queued due to a previous queue (with delay==0).

In the end, I decided that you're right: The safest approach is to
isolate this to the specific problem I'm trying to solve: to do our
business the same way, but when it comes to the final close, check if
it's from the fenced process, and if so, queue the final put. That should
mitigate the problem.

Unfortunately, it's not good enough to only do these things when the link
count is zero: Closing any file at any time could recreate the problem,
provided there was a pending fence action.

I'd also prefer if there was a way we could tell whether we were called
from the slab shrinker, but unfortunately, there's no good way to know
without wasting a lot of time traversing the call stack and such.
I did a fair amount of investigating this: there are a couple of bits
in the superblock that I was hoping to leverage: PF_FSTRANS is used by
NFS to detect situations where it mustn't do memory allocations, but
I don't think we can use it.
There's also PF_MEMALLOC, but that's set for the global shrinker, not the
slab shrinker, so I don't think we can use that either.

I investigated how ocfs2 handles this situation: turns out, it doesn't,
so they probably have the same vulnerability.

In doing all this investigation, I learned of another similar possible
problem discovered in ocfs2: If the shrinker tried to ditch an inode from
memory, then the file system calls into DLM, but DLM needs to do some
COMM stuff, and the comm stuff needs to allocate memory for comm buffers.
In that case, the use of superblock flags (as above) was discouraged and
NACKed in favor of using a different set of memory flags, such as GFP_NOFS,
etc. I never did determine exactly how this ended up.

In the meantime, I prototyped the following patch, and it works, but
it's a hack, and I'm not too happy with it. So I'm going to see if I can
find a solution that would work for both GFS2 and ocfs2.

Regards,

Bob Peterson
Red Hat File Systems
---
commit 3142a311fa848d44b15e9f86fc3599b7d9b8b3f5
Author: Bob Peterson <rpeterso@redhat.com>
Date:   Mon Nov 30 12:04:09 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an inode from memory, but gfs2 waits for
dlm.

This solution is a bit of a hack: If clear_inode was called from
the fenced process, instead of calling gfs2_glock_put, we queue
the glock state machine which calls the gfs2_glock_put from a
different context.

I've tried many other solutions to no avail. Consider:

1. If we always queue the final glock_put, there's no guarantee
   that the queue_delayed_work is successful, because there could
   already be work queued for that glock.
2. If the queue_delayed_work fails, reacting with a call to
   gfs2_glock_put (however unlikely) might still be the final
   put, which means we could still hang.
3. If we loop around and try again, the queued delayed work that
   prevented us from queueing might run, causing the glock to
   be freed, which would result in a use-after-free problem.
4. If we just forget it, our glock reference could would get off
   and we wouldn't be able to free the glocks from slab when the
   module is unloaded.
5. We can't really use current->flags & PF_FSTRANS like NFS,
   as far as I can tell.
6. We can't really use current->flags & PF_MEMALLOC either,
   like the global shrinker because it's not set for the slab
   shrinker.
7. We can't easily tell if the inode shrinker is in our call
   stack, but it's easy to tell if it's called from fenced, which
   is the most likely to block dlm.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

rhbz#1255872

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b5e9d2e..36bea46 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -225,6 +225,29 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 }
 
 /**
+ * gfs2_glock_put_noblock() - Decrement reference count on glock
+ * @gl: The glock to put
+ *
+ * This is the same as gfs2_glock_put() but it's not allowed to block.
+ * We're simply trying to avoid a live-lock case in which the fenced process
+ * calls the inode shrinker, which calls the gfs2_clear_inode, which calls
+ * gfs2_glock_put, which calls dlm, which blocks on a fence operation.
+ */
+
+void gfs2_glock_put_noblock(struct gfs2_glock *gl)
+{
+	if (strcmp(current->comm, "fenced") == 0) {
+		if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) != 0)
+			return;
+
+		printk(KERN_WARNING "GFS2: Unable to queue final glock "
+		       "put for inode.\n");
+		__dump_glock(NULL, gl);
+	}
+	gfs2_glock_put(gl);
+}
+
+/**
  * search_bucket() - Find struct gfs2_glock by lock number
  * @bucket: the bucket to search
  * @name: The lock name
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 7e22f20..d9f8073 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -188,6 +188,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp,
 void gfs2_glock_hold(struct gfs2_glock *gl);
 void gfs2_glock_put_nolock(struct gfs2_glock *gl);
 void gfs2_glock_put(struct gfs2_glock *gl);
+extern void gfs2_glock_put_noblock(struct gfs2_glock *gl);
 void gfs2_holder_init(struct gfs2_glock *gl, unsigned int state, u16 flags,
 		      struct gfs2_holder *gh);
 void gfs2_holder_reinit(unsigned int state, u16 flags, struct gfs2_holder *gh);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 6c8637a..0d5e721 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1278,7 +1278,7 @@ static void gfs2_clear_inode(struct inode *inode)
 	ip->i_gl->gl_object = NULL;
 	flush_delayed_work(&ip->i_gl->gl_work);
 	gfs2_glock_add_to_lru(ip->i_gl);
-	gfs2_glock_put(ip->i_gl);
+	gfs2_glock_put_noblock(ip->i_gl);
 	ip->i_gl = NULL;
 	if (ip->i_iopen_gh.gh_gl) {
 		ip->i_iopen_gh.gh_gl->gl_object = NULL;



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-04 14:51                   ` Bob Peterson
@ 2015-12-04 15:51                     ` David Teigland
  2015-12-04 17:38                       ` Bob Peterson
  0 siblings, 1 reply; 18+ messages in thread
From: David Teigland @ 2015-12-04 15:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> it's from the fenced process, and if so, queue the final put. That should
> mitigate the problem.

Bob, I'm perplexed by the focus on fencing; this issue is broader than
fencing as I mentioned in bz 1255872.  Over the years that I've reported
these issues, rarely if ever have they involving fencing.  Any userland
process, not just the fencing process, can allocate memory, fall into the
general shrinking path, get into gfs2 and dlm, and end up blocked for some
undefined time.  That can cause problems in any number of ways.

The specific problem you're focused on may be one of the easier ways of
demonstrating the problem -- making the original userland process one of
the cluster-related processes that gfs2/dlm depend on, combined with
recovery when those processes do an especially large amount of work that
gfs2/dlm require.  But problems could occur if any process is forced to
unwittingly do this dlm work, not just a cluster-related process, and it
would not need to involve recovery (or fencing which is one small part of
it).

I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
for shrinking its cache and doing the dlm work, and would not do anything
from the generic shrinking paths because of this issue.  I don't think
it's reasonable to expect random, unsuspecting processes on the system to
perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
its own dedicated contexts, or from processes that are explicitly choosing
to use gfs2.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-04 15:51                     ` David Teigland
@ 2015-12-04 17:38                       ` Bob Peterson
  0 siblings, 0 replies; 18+ messages in thread
From: Bob Peterson @ 2015-12-04 17:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> > it's from the fenced process, and if so, queue the final put. That should
> > mitigate the problem.
> 
> Bob, I'm perplexed by the focus on fencing; this issue is broader than
> fencing as I mentioned in bz 1255872.  Over the years that I've reported
> these issues, rarely if ever have they involving fencing.  Any userland
> process, not just the fencing process, can allocate memory, fall into the
> general shrinking path, get into gfs2 and dlm, and end up blocked for some
> undefined time.  That can cause problems in any number of ways.
> 
> The specific problem you're focused on may be one of the easier ways of
> demonstrating the problem -- making the original userland process one of
> the cluster-related processes that gfs2/dlm depend on, combined with
> recovery when those processes do an especially large amount of work that
> gfs2/dlm require.  But problems could occur if any process is forced to
> unwittingly do this dlm work, not just a cluster-related process, and it
> would not need to involve recovery (or fencing which is one small part of
> it).
> 
> I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
> for shrinking its cache and doing the dlm work, and would not do anything
> from the generic shrinking paths because of this issue.  I don't think
> it's reasonable to expect random, unsuspecting processes on the system to
> perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
> unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
> its own dedicated contexts, or from processes that are explicitly choosing
> to use gfs2.
> 
> 
Hi Dave,

Thanks for your input.
You're right, of course. The problem can occur to other processes that cause
the shrinker to run, not just from fenced.

Yes, the original GFS2 did not have this problem. When gfs2_clear_inode
was called, it called gfs2_glock_schedule_for_reclaim, which queued the inode
glock to a special queue which was handled by the gfs2_glockd daemon. The
reference count was bumped to prevent the gfs2_clear_inode() from being the
last guy out.

The problem with that approach is that it's a centralized list, which means
there's no parallelism and there can get to be a backlog of glocks waiting
to be reclaimed. If that glock needs to be reacquired because the block was
reused for a different inode, we could end up reaping glocks that are still
being (re)used.

We could still do that, but either we reintroduce the gfs2_glockd daemon,
or give our existing daemon, gfs2_quotad, more responsibilities, which would
make it a bit uglier and more complex than it is today.

My previous attempts to solve this involved using a work queue and
queuing the final gfs2_glock_put() and that fixed the problem for all
cases, unless there was already work queued.

When you think about it, using delayed work accomplishes the same thing, but
with the parallelism. When it works. Perhaps I just need to focus on a way to
allow work to be queued multiple times (in an ideal case) or, alternatively,
an atomic counter that corresponds to the number of times the work should be
executed. Or something similar.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-02 16:42             ` Bob Peterson
  2015-12-02 17:41               ` Bob Peterson
@ 2015-12-08  7:57               ` Dave Chinner
  2015-12-08  9:03                 ` Steven Whitehouse
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2015-12-08  7:57 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Dec 02, 2015 at 11:42:13AM -0500, Bob Peterson wrote:
> ----- Original Message -----
> (snip)
> > Please take a look at this
> > again and figure out what the problematic cycle of events is, and then
> > work out how to avoid that happening in the first place. There is no
> > point in replacing one problem with another one, particularly one which
> > would likely be very tricky to debug,
> > 
> > Steve.
> 
> Rhe problematic cycle of events is well known:
> gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
> but if it's the very last put, it calls into dlm, which can block,
> and that's where we get into trouble.
> 
> The livelock goes like this:
> 
> 1. A fence operation needs memory, so it blocks on memory allocation.
> 2. Memory allocation blocks on slab shrinker.
> 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
....
> 7. dlm blocks on a pending fence operation. Goto 1.

Therefore, the fence operation should be doing GFP_NOFS allocations
to prevent re-entry into the DLM via the filesystem via the shrinker....

Cheers,

Dave.
-- 
Dave Chinner
dchinner at redhat.com



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
  2015-12-08  7:57               ` Dave Chinner
@ 2015-12-08  9:03                 ` Steven Whitehouse
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Whitehouse @ 2015-12-08  9:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 08/12/15 07:57, Dave Chinner wrote:
> On Wed, Dec 02, 2015 at 11:42:13AM -0500, Bob Peterson wrote:
>> ----- Original Message -----
>> (snip)
>>> Please take a look at this
>>> again and figure out what the problematic cycle of events is, and then
>>> work out how to avoid that happening in the first place. There is no
>>> point in replacing one problem with another one, particularly one which
>>> would likely be very tricky to debug,
>>>
>>> Steve.
>> Rhe problematic cycle of events is well known:
>> gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
>> but if it's the very last put, it calls into dlm, which can block,
>> and that's where we get into trouble.
>>
>> The livelock goes like this:
>>
>> 1. A fence operation needs memory, so it blocks on memory allocation.
>> 2. Memory allocation blocks on slab shrinker.
>> 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
> ....
>> 7. dlm blocks on a pending fence operation. Goto 1.
> Therefore, the fence operation should be doing GFP_NOFS allocations
> to prevent re-entry into the DLM via the filesystem via the shrinker....
>
> Cheers,
>
> Dave.

Which would be ideal, but how do you do that from user space?

Steve.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-08  9:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-19 18:42 [Cluster-devel] [GFS2 PATCH 0/2] GFS2: Avoid inode shrinker-related deadlocks Bob Peterson
2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put Bob Peterson
2015-11-20 13:33   ` Steven Whitehouse
2015-11-25 14:22     ` Bob Peterson
2015-11-25 14:26       ` Steven Whitehouse
2015-12-01 15:42         ` Bob Peterson
2015-12-02 10:23           ` Steven Whitehouse
2015-12-02 16:42             ` Bob Peterson
2015-12-02 17:41               ` Bob Peterson
2015-12-03 11:18                 ` Steven Whitehouse
2015-12-04 14:51                   ` Bob Peterson
2015-12-04 15:51                     ` David Teigland
2015-12-04 17:38                       ` Bob Peterson
2015-12-08  7:57               ` Dave Chinner
2015-12-08  9:03                 ` Steven Whitehouse
2015-11-19 18:42 ` [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Revert 35e478f Flush pending glock work when evicting an inode Bob Peterson
2015-11-20 13:47   ` Steven Whitehouse
2015-11-25 14:36     ` Bob Peterson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.