All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] [PATCH 0/4] patches related to file unlink->delete->new
@ 2015-12-18 19:02 Bob Peterson
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create Bob Peterson
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Bob Peterson @ 2015-12-18 19:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Back on October 22, I posted a set of fourteen patches related to the
transition of inodes from unlinked to free. Since then, I've found and
fixed several problems with the patches. I've also treated several of
those patches as separate problems. What remains is this set of four
patches.

1. The first patch prevents collisions between locks being used to delete
   inodes and the same glock being used to create a new inode. This
   prevents all kinds of bad problems, most of which have to do with
   glock reference count problems.
2. The second patch makes it so that we no longer filter out I_FREEING
   inodes anymore. As I stated in an email from 10 December, if we tell
   the delete code the inode does not exist, it will quit trying to
   delete it, and the dinode on disk will be stuck in "unlinked". With
   this patch, inodes that are in I_FREEING will blocked until vfs (and
   GFS2) are done with them. Then a new inode is created and used for
   the transition from unlinked to free.
3. The third patch simply removes a parameter from gfs2_inode_lookup
   which is no longer needed, since it is no longer referenced.
4. The final patch changes the delete code so that it uses the normal
   inode lookup code to identify inodes that need to be transitioned
   from unlinked to free. This patch adds a new parameter "for_del"
   to function gfs2_inode_lookup() which instructs the code to verify
   that the block is unlinked before proceeding. This makes sure the
   glock work is all safely done with the correct lock ordering.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>

Bob Peterson (4):
  GFS2: Prevent delete work from occurring on glocks used for create
  GFS2: Don't filter out I_FREEING inodes anymore
  GFS2: Eliminate parameter non_block on gfs2_inode_lookup
  GFS2: Make deletes use regular inode_lookup with special code path

 fs/gfs2/export.c |  2 +-
 fs/gfs2/glock.c  | 14 ++++----
 fs/gfs2/incore.h |  1 +
 fs/gfs2/inode.c  | 99 ++++++++++++++++++++++----------------------------------
 fs/gfs2/inode.h  |  4 +--
 5 files changed, 49 insertions(+), 71 deletions(-)

-- 
2.5.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create
  2015-12-18 19:02 [Cluster-devel] [PATCH 0/4] patches related to file unlink->delete->new Bob Peterson
@ 2015-12-18 19:02 ` Bob Peterson
  2016-02-02 11:40   ` Steven Whitehouse
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 2/4] GFS2: Don't filter out I_FREEING inodes anymore Bob Peterson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2015-12-18 19:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch tries to prevent delete work (queued via iopen callback)
from executing if the glock is currently being used to create
a new inode.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  | 7 +++++++
 fs/gfs2/incore.h | 1 +
 fs/gfs2/inode.c  | 7 ++++++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index a4ff7b5..cb5ed3a 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -572,6 +572,12 @@ static void delete_work_func(struct work_struct *work)
 	struct inode *inode;
 	u64 no_addr = gl->gl_name.ln_number;
 
+	/* If someone's using this glock to create a new dinode, the block must
+	   have been freed by another node, then re-used, in which case our
+	   iopen callback is too late after the fact. Ignore it. */
+	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
+		goto out;
+
 	ip = gl->gl_object;
 	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
 
@@ -583,6 +589,7 @@ static void delete_work_func(struct work_struct *work)
 		d_prune_aliases(inode);
 		iput(inode);
 	}
+out:
 	gfs2_glock_put(gl);
 }
 
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 845fb09..a6a3389 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -328,6 +328,7 @@ enum {
 	GLF_LRU				= 13,
 	GLF_OBJECT			= 14, /* Used only for tracing */
 	GLF_BLOCKING			= 15,
+	GLF_INODE_CREATING		= 16, /* Inode creation occurring */
 };
 
 struct gfs2_glock {
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 009b551..6cdafed 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -592,7 +592,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
 	struct inode *inode = NULL;
 	struct gfs2_inode *dip = GFS2_I(dir), *ip;
 	struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode);
-	struct gfs2_glock *io_gl;
+	struct gfs2_glock *io_gl = NULL;
 	int error, free_vfs_inode = 1;
 	u32 aflags = 0;
 	unsigned blocks = 1;
@@ -729,6 +729,8 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
 	if (error)
 		goto fail_gunlock2;
 
+	BUG_ON(test_and_set_bit(GLF_INODE_CREATING, &io_gl->gl_flags));
+
 	error = gfs2_glock_nq_init(io_gl, LM_ST_SHARED, GL_EXACT, &ip->i_iopen_gh);
 	if (error)
 		goto fail_gunlock2;
@@ -771,12 +773,15 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
 	}
 	gfs2_glock_dq_uninit(ghs);
 	gfs2_glock_dq_uninit(ghs + 1);
+	clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);
 	return error;
 
 fail_gunlock3:
 	gfs2_glock_dq_uninit(&ip->i_iopen_gh);
 	gfs2_glock_put(io_gl);
 fail_gunlock2:
+	if (io_gl)
+		clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);
 	gfs2_glock_dq_uninit(ghs + 1);
 fail_free_inode:
 	if (ip->i_gl)
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 2/4] GFS2: Don't filter out I_FREEING inodes anymore
  2015-12-18 19:02 [Cluster-devel] [PATCH 0/4] patches related to file unlink->delete->new Bob Peterson
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create Bob Peterson
@ 2015-12-18 19:02 ` Bob Peterson
  2016-02-23 10:15   ` Steven Whitehouse
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 3/4] GFS2: Eliminate parameter non_block on gfs2_inode_lookup Bob Peterson
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path Bob Peterson
  3 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2015-12-18 19:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch basically reverts a very old patch from 2008,
7a9f53b3c1875bef22ad4588e818bc046ef183da, with the title
"Alternate gfs2_iget to avoid looking up inodes being freed".
The original patch was designed to avoid a deadlock caused by lock
ordering with try_rgrp_unlink. The patch forced the function to not
find inodes that were being removed by VFS. The problem is, that
made it impossible for nodes to delete their own unlinked dinodes
after a certain point in time, because the inode needed was not found
by this filtering process. There is no longer a need for the patch,
since function try_rgrp_unlink no longer locks the inode: All it does
is queue the glock onto the delete work_queue, so there should be no
more deadlock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/export.c |  2 +-
 fs/gfs2/glock.c  |  2 +-
 fs/gfs2/inode.c  | 59 ++++----------------------------------------------------
 fs/gfs2/inode.h  |  2 +-
 4 files changed, 7 insertions(+), 58 deletions(-)

diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index 5d15e94..d5bda85 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -137,7 +137,7 @@ static struct dentry *gfs2_get_dentry(struct super_block *sb,
 	struct gfs2_sbd *sdp = sb->s_fs_info;
 	struct inode *inode;
 
-	inode = gfs2_ilookup(sb, inum->no_addr, 0);
+	inode = gfs2_ilookup(sb, inum->no_addr);
 	if (inode) {
 		if (GFS2_I(inode)->i_no_formal_ino != inum->no_formal_ino) {
 			iput(inode);
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index cb5ed3a..3e16d82 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -582,7 +582,7 @@ static void delete_work_func(struct work_struct *work)
 	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
 
 	if (ip)
-		inode = gfs2_ilookup(sdp->sd_vfs, no_addr, 1);
+		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);
 	else
 		inode = gfs2_lookup_by_inum(sdp, no_addr, NULL, GFS2_BLKST_UNLINKED);
 	if (inode && !IS_ERR(inode)) {
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 6cdafed..2adc1ee 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -37,61 +37,9 @@
 #include "super.h"
 #include "glops.h"
 
-struct gfs2_skip_data {
-	u64 no_addr;
-	int skipped;
-	int non_block;
-};
-
-static int iget_test(struct inode *inode, void *opaque)
-{
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_skip_data *data = opaque;
-
-	if (ip->i_no_addr == data->no_addr) {
-		if (data->non_block &&
-		    inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
-			data->skipped = 1;
-			return 0;
-		}
-		return 1;
-	}
-	return 0;
-}
-
-static int iget_set(struct inode *inode, void *opaque)
+struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr)
 {
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_skip_data *data = opaque;
-
-	if (data->skipped)
-		return -ENOENT;
-	inode->i_ino = (unsigned long)(data->no_addr);
-	ip->i_no_addr = data->no_addr;
-	return 0;
-}
-
-struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr, int non_block)
-{
-	unsigned long hash = (unsigned long)no_addr;
-	struct gfs2_skip_data data;
-
-	data.no_addr = no_addr;
-	data.skipped = 0;
-	data.non_block = non_block;
-	return ilookup5(sb, hash, iget_test, &data);
-}
-
-static struct inode *gfs2_iget(struct super_block *sb, u64 no_addr,
-			       int non_block)
-{
-	struct gfs2_skip_data data;
-	unsigned long hash = (unsigned long)no_addr;
-
-	data.no_addr = no_addr;
-	data.skipped = 0;
-	data.non_block = non_block;
-	return iget5_locked(sb, hash, iget_test, iget_set, &data);
+	return ilookup(sb, (unsigned long)no_addr);
 }
 
 /**
@@ -145,8 +93,9 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
 	struct gfs2_glock *io_gl = NULL;
 	int error;
 
-	inode = gfs2_iget(sb, no_addr, non_block);
+	inode = iget_locked(sb, (unsigned long)no_addr);
 	ip = GFS2_I(inode);
+	ip->i_no_addr = no_addr;
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
index ba4d949..22c27a8 100644
--- a/fs/gfs2/inode.h
+++ b/fs/gfs2/inode.h
@@ -99,7 +99,7 @@ extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type,
 extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
 					 u64 *no_formal_ino,
 					 unsigned int blktype);
-extern struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr, int nonblock);
+extern struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr);
 
 extern int gfs2_inode_refresh(struct gfs2_inode *ip);
 
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 3/4] GFS2: Eliminate parameter non_block on gfs2_inode_lookup
  2015-12-18 19:02 [Cluster-devel] [PATCH 0/4] patches related to file unlink->delete->new Bob Peterson
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create Bob Peterson
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 2/4] GFS2: Don't filter out I_FREEING inodes anymore Bob Peterson
@ 2015-12-18 19:02 ` Bob Peterson
  2016-02-23 10:17   ` Steven Whitehouse
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path Bob Peterson
  3 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2015-12-18 19:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Now that we're not filtering out I_FREEING inodes from our lookups
anymore, we can eliminate the non_block parameter from the lookup
function.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/dir.c        | 2 +-
 fs/gfs2/inode.c      | 5 ++---
 fs/gfs2/inode.h      | 3 +--
 fs/gfs2/ops_fstype.c | 2 +-
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 6a92592..2bef023 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1660,7 +1660,7 @@ struct inode *gfs2_dir_search(struct inode *dir, const struct qstr *name,
 		brelse(bh);
 		if (fail_on_exist)
 			return ERR_PTR(-EEXIST);
-		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino, 0);
+		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino);
 		if (!IS_ERR(inode))
 			GFS2_I(inode)->i_rahead = rahead;
 		return inode;
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 2adc1ee..01b6d1a 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -80,13 +80,12 @@ static void gfs2_set_iop(struct inode *inode)
  * @sb: The super block
  * @no_addr: The inode number
  * @type: The type of the inode
- * non_block: Can we block on inodes that are being freed?
  *
  * Returns: A VFS inode, or an error
  */
 
 struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
-				u64 no_addr, u64 no_formal_ino, int non_block)
+				u64 no_addr, u64 no_formal_ino)
 {
 	struct inode *inode;
 	struct gfs2_inode *ip;
@@ -170,7 +169,7 @@ struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
 	if (error)
 		goto fail;
 
-	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0, 1);
+	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0);
 	if (IS_ERR(inode))
 		goto fail;
 
diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
index 22c27a8..e1af0d4 100644
--- a/fs/gfs2/inode.h
+++ b/fs/gfs2/inode.h
@@ -94,8 +94,7 @@ err:
 }
 
 extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type, 
-				       u64 no_addr, u64 no_formal_ino,
-				       int non_block);
+				       u64 no_addr, u64 no_formal_ino);
 extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
 					 u64 *no_formal_ino,
 					 unsigned int blktype);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 7aacdf2..9ed522c 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -454,7 +454,7 @@ static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
 	struct dentry *dentry;
 	struct inode *inode;
 
-	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0, 0);
+	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0);
 	if (IS_ERR(inode)) {
 		fs_err(sdp, "can't read in %s inode: %ld\n", name, PTR_ERR(inode));
 		return PTR_ERR(inode);
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2015-12-18 19:02 [Cluster-devel] [PATCH 0/4] patches related to file unlink->delete->new Bob Peterson
                   ` (2 preceding siblings ...)
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 3/4] GFS2: Eliminate parameter non_block on gfs2_inode_lookup Bob Peterson
@ 2015-12-18 19:02 ` Bob Peterson
  2016-02-23 10:25   ` Steven Whitehouse
  2016-03-21 16:49   ` Benjamin Marzinski
  3 siblings, 2 replies; 19+ messages in thread
From: Bob Peterson @ 2015-12-18 19:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch changes the inode reclaiming code, delete_work_func, use
the normal gfs2_inode_lookup, but with a special parameter, for_del
which causes additional checks to be made.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/dir.c        |  2 +-
 fs/gfs2/glock.c      |  9 +--------
 fs/gfs2/inode.c      | 32 ++++++++++++++++++++++++++++----
 fs/gfs2/inode.h      |  3 ++-
 fs/gfs2/ops_fstype.c |  2 +-
 5 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 2bef023..6a92592 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1660,7 +1660,7 @@ struct inode *gfs2_dir_search(struct inode *dir, const struct qstr *name,
 		brelse(bh);
 		if (fail_on_exist)
 			return ERR_PTR(-EEXIST);
-		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino);
+		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino, 0);
 		if (!IS_ERR(inode))
 			GFS2_I(inode)->i_rahead = rahead;
 		return inode;
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 3e16d82..795c2f3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -568,7 +568,6 @@ static void delete_work_func(struct work_struct *work)
 {
 	struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_delete);
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
-	struct gfs2_inode *ip;
 	struct inode *inode;
 	u64 no_addr = gl->gl_name.ln_number;
 
@@ -578,13 +577,7 @@ static void delete_work_func(struct work_struct *work)
 	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
 		goto out;
 
-	ip = gl->gl_object;
-	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
-
-	if (ip)
-		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);
-	else
-		inode = gfs2_lookup_by_inum(sdp, no_addr, NULL, GFS2_BLKST_UNLINKED);
+	inode = gfs2_inode_lookup(sdp->sd_vfs, DT_UNKNOWN, no_addr, 0, 1);
 	if (inode && !IS_ERR(inode)) {
 		d_prune_aliases(inode);
 		iput(inode);
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 01b6d1a..0357862 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -75,17 +75,37 @@ static void gfs2_set_iop(struct inode *inode)
 	}
 }
 
+static int verify_unlinked(struct gfs2_inode *ip)
+{
+	struct gfs2_holder i_gh;
+	struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
+	int error;
+
+	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_SKIP, &i_gh);
+	if (error)
+		return error;
+
+	error = gfs2_check_blk_type(sdp, ip->i_no_addr, GFS2_BLKST_UNLINKED);
+	if (error == 0)
+		error = gfs2_inode_refresh(ip);
+
+	gfs2_glock_dq_uninit(&i_gh);
+	return error;
+}
+
 /**
  * gfs2_inode_lookup - Lookup an inode
  * @sb: The super block
- * @no_addr: The inode number
  * @type: The type of the inode
+ * @no_addr: The inode number
+ * @no_formal_ino: The formal inode number
+ * @for_del: Is this lookup for deleting purposes?
  *
  * Returns: A VFS inode, or an error
  */
 
 struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
-				u64 no_addr, u64 no_formal_ino)
+				u64 no_addr, u64 no_formal_ino, int for_del)
 {
 	struct inode *inode;
 	struct gfs2_inode *ip;
@@ -121,7 +141,11 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
 		gfs2_glock_put(io_gl);
 		io_gl = NULL;
 
-		if (type == DT_UNKNOWN) {
+		if (for_del) {
+			error = verify_unlinked(ip);
+			if (error)
+				goto fail_refresh;
+		} else if (type == DT_UNKNOWN) {
 			/* Inode glock must be locked already */
 			error = gfs2_inode_refresh(GFS2_I(inode));
 			if (error)
@@ -169,7 +193,7 @@ struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
 	if (error)
 		goto fail;
 
-	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0);
+	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0, 0);
 	if (IS_ERR(inode))
 		goto fail;
 
diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
index e1af0d4..d6233e9 100644
--- a/fs/gfs2/inode.h
+++ b/fs/gfs2/inode.h
@@ -94,7 +94,8 @@ err:
 }
 
 extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type, 
-				       u64 no_addr, u64 no_formal_ino);
+				       u64 no_addr, u64 no_formal_ino,
+				       int for_del);
 extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
 					 u64 *no_formal_ino,
 					 unsigned int blktype);
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 9ed522c..7aacdf2 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -454,7 +454,7 @@ static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
 	struct dentry *dentry;
 	struct inode *inode;
 
-	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0);
+	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0, 0);
 	if (IS_ERR(inode)) {
 		fs_err(sdp, "can't read in %s inode: %ld\n", name, PTR_ERR(inode));
 		return PTR_ERR(inode);
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create Bob Peterson
@ 2016-02-02 11:40   ` Steven Whitehouse
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Whitehouse @ 2016-02-02 11:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

Acked-by: Steven Whitehouse <swhiteho@redhat.com>

Steve.


On 18/12/15 19:02, Bob Peterson wrote:
> This patch tries to prevent delete work (queued via iopen callback)
> from executing if the glock is currently being used to create
> a new inode.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/glock.c  | 7 +++++++
>   fs/gfs2/incore.h | 1 +
>   fs/gfs2/inode.c  | 7 ++++++-
>   3 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index a4ff7b5..cb5ed3a 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -572,6 +572,12 @@ static void delete_work_func(struct work_struct *work)
>   	struct inode *inode;
>   	u64 no_addr = gl->gl_name.ln_number;
>   
> +	/* If someone's using this glock to create a new dinode, the block must
> +	   have been freed by another node, then re-used, in which case our
> +	   iopen callback is too late after the fact. Ignore it. */
> +	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
> +		goto out;
> +
>   	ip = gl->gl_object;
>   	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
>   
> @@ -583,6 +589,7 @@ static void delete_work_func(struct work_struct *work)
>   		d_prune_aliases(inode);
>   		iput(inode);
>   	}
> +out:
>   	gfs2_glock_put(gl);
>   }
>   
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 845fb09..a6a3389 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -328,6 +328,7 @@ enum {
>   	GLF_LRU				= 13,
>   	GLF_OBJECT			= 14, /* Used only for tracing */
>   	GLF_BLOCKING			= 15,
> +	GLF_INODE_CREATING		= 16, /* Inode creation occurring */
>   };
>   
>   struct gfs2_glock {
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 009b551..6cdafed 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -592,7 +592,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
>   	struct inode *inode = NULL;
>   	struct gfs2_inode *dip = GFS2_I(dir), *ip;
>   	struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode);
> -	struct gfs2_glock *io_gl;
> +	struct gfs2_glock *io_gl = NULL;
>   	int error, free_vfs_inode = 1;
>   	u32 aflags = 0;
>   	unsigned blocks = 1;
> @@ -729,6 +729,8 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
>   	if (error)
>   		goto fail_gunlock2;
>   
> +	BUG_ON(test_and_set_bit(GLF_INODE_CREATING, &io_gl->gl_flags));
> +
>   	error = gfs2_glock_nq_init(io_gl, LM_ST_SHARED, GL_EXACT, &ip->i_iopen_gh);
>   	if (error)
>   		goto fail_gunlock2;
> @@ -771,12 +773,15 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
>   	}
>   	gfs2_glock_dq_uninit(ghs);
>   	gfs2_glock_dq_uninit(ghs + 1);
> +	clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);
>   	return error;
>   
>   fail_gunlock3:
>   	gfs2_glock_dq_uninit(&ip->i_iopen_gh);
>   	gfs2_glock_put(io_gl);
>   fail_gunlock2:
> +	if (io_gl)
> +		clear_bit(GLF_INODE_CREATING, &io_gl->gl_flags);
>   	gfs2_glock_dq_uninit(ghs + 1);
>   fail_free_inode:
>   	if (ip->i_gl)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 2/4] GFS2: Don't filter out I_FREEING inodes anymore
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 2/4] GFS2: Don't filter out I_FREEING inodes anymore Bob Peterson
@ 2016-02-23 10:15   ` Steven Whitehouse
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Whitehouse @ 2016-02-23 10:15 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,


On 18/12/15 19:02, Bob Peterson wrote:
> This patch basically reverts a very old patch from 2008,
> 7a9f53b3c1875bef22ad4588e818bc046ef183da, with the title
> "Alternate gfs2_iget to avoid looking up inodes being freed".
> The original patch was designed to avoid a deadlock caused by lock
> ordering with try_rgrp_unlink. The patch forced the function to not
> find inodes that were being removed by VFS. The problem is, that
> made it impossible for nodes to delete their own unlinked dinodes
> after a certain point in time, because the inode needed was not found
> by this filtering process. There is no longer a need for the patch,
> since function try_rgrp_unlink no longer locks the inode: All it does
> is queue the glock onto the delete work_queue, so there should be no
> more deadlock.
This looks good. For older kernels though I think the issue may still 
exist due to the way workqueues used to work, but for upstream and 
modern kernels this is a nice clean up,

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>

Steve.
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/export.c |  2 +-
>   fs/gfs2/glock.c  |  2 +-
>   fs/gfs2/inode.c  | 59 ++++----------------------------------------------------
>   fs/gfs2/inode.h  |  2 +-
>   4 files changed, 7 insertions(+), 58 deletions(-)
>
> diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
> index 5d15e94..d5bda85 100644
> --- a/fs/gfs2/export.c
> +++ b/fs/gfs2/export.c
> @@ -137,7 +137,7 @@ static struct dentry *gfs2_get_dentry(struct super_block *sb,
>   	struct gfs2_sbd *sdp = sb->s_fs_info;
>   	struct inode *inode;
>   
> -	inode = gfs2_ilookup(sb, inum->no_addr, 0);
> +	inode = gfs2_ilookup(sb, inum->no_addr);
>   	if (inode) {
>   		if (GFS2_I(inode)->i_no_formal_ino != inum->no_formal_ino) {
>   			iput(inode);
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index cb5ed3a..3e16d82 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -582,7 +582,7 @@ static void delete_work_func(struct work_struct *work)
>   	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
>   
>   	if (ip)
> -		inode = gfs2_ilookup(sdp->sd_vfs, no_addr, 1);
> +		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);
>   	else
>   		inode = gfs2_lookup_by_inum(sdp, no_addr, NULL, GFS2_BLKST_UNLINKED);
>   	if (inode && !IS_ERR(inode)) {
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 6cdafed..2adc1ee 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -37,61 +37,9 @@
>   #include "super.h"
>   #include "glops.h"
>   
> -struct gfs2_skip_data {
> -	u64 no_addr;
> -	int skipped;
> -	int non_block;
> -};
> -
> -static int iget_test(struct inode *inode, void *opaque)
> -{
> -	struct gfs2_inode *ip = GFS2_I(inode);
> -	struct gfs2_skip_data *data = opaque;
> -
> -	if (ip->i_no_addr == data->no_addr) {
> -		if (data->non_block &&
> -		    inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE)) {
> -			data->skipped = 1;
> -			return 0;
> -		}
> -		return 1;
> -	}
> -	return 0;
> -}
> -
> -static int iget_set(struct inode *inode, void *opaque)
> +struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr)
>   {
> -	struct gfs2_inode *ip = GFS2_I(inode);
> -	struct gfs2_skip_data *data = opaque;
> -
> -	if (data->skipped)
> -		return -ENOENT;
> -	inode->i_ino = (unsigned long)(data->no_addr);
> -	ip->i_no_addr = data->no_addr;
> -	return 0;
> -}
> -
> -struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr, int non_block)
> -{
> -	unsigned long hash = (unsigned long)no_addr;
> -	struct gfs2_skip_data data;
> -
> -	data.no_addr = no_addr;
> -	data.skipped = 0;
> -	data.non_block = non_block;
> -	return ilookup5(sb, hash, iget_test, &data);
> -}
> -
> -static struct inode *gfs2_iget(struct super_block *sb, u64 no_addr,
> -			       int non_block)
> -{
> -	struct gfs2_skip_data data;
> -	unsigned long hash = (unsigned long)no_addr;
> -
> -	data.no_addr = no_addr;
> -	data.skipped = 0;
> -	data.non_block = non_block;
> -	return iget5_locked(sb, hash, iget_test, iget_set, &data);
> +	return ilookup(sb, (unsigned long)no_addr);
>   }
>   
>   /**
> @@ -145,8 +93,9 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
>   	struct gfs2_glock *io_gl = NULL;
>   	int error;
>   
> -	inode = gfs2_iget(sb, no_addr, non_block);
> +	inode = iget_locked(sb, (unsigned long)no_addr);
>   	ip = GFS2_I(inode);
> +	ip->i_no_addr = no_addr;
>   
>   	if (!inode)
>   		return ERR_PTR(-ENOMEM);
> diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
> index ba4d949..22c27a8 100644
> --- a/fs/gfs2/inode.h
> +++ b/fs/gfs2/inode.h
> @@ -99,7 +99,7 @@ extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type,
>   extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>   					 u64 *no_formal_ino,
>   					 unsigned int blktype);
> -extern struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr, int nonblock);
> +extern struct inode *gfs2_ilookup(struct super_block *sb, u64 no_addr);
>   
>   extern int gfs2_inode_refresh(struct gfs2_inode *ip);
>   



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 3/4] GFS2: Eliminate parameter non_block on gfs2_inode_lookup
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 3/4] GFS2: Eliminate parameter non_block on gfs2_inode_lookup Bob Peterson
@ 2016-02-23 10:17   ` Steven Whitehouse
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Whitehouse @ 2016-02-23 10:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com


Acked-by: Steven Whitehouse <swhiteho@redhat.com>

On 18/12/15 19:02, Bob Peterson wrote:
> Now that we're not filtering out I_FREEING inodes from our lookups
> anymore, we can eliminate the non_block parameter from the lookup
> function.
>
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/dir.c        | 2 +-
>   fs/gfs2/inode.c      | 5 ++---
>   fs/gfs2/inode.h      | 3 +--
>   fs/gfs2/ops_fstype.c | 2 +-
>   4 files changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> index 6a92592..2bef023 100644
> --- a/fs/gfs2/dir.c
> +++ b/fs/gfs2/dir.c
> @@ -1660,7 +1660,7 @@ struct inode *gfs2_dir_search(struct inode *dir, const struct qstr *name,
>   		brelse(bh);
>   		if (fail_on_exist)
>   			return ERR_PTR(-EEXIST);
> -		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino, 0);
> +		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino);
>   		if (!IS_ERR(inode))
>   			GFS2_I(inode)->i_rahead = rahead;
>   		return inode;
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 2adc1ee..01b6d1a 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -80,13 +80,12 @@ static void gfs2_set_iop(struct inode *inode)
>    * @sb: The super block
>    * @no_addr: The inode number
>    * @type: The type of the inode
> - * non_block: Can we block on inodes that are being freed?
>    *
>    * Returns: A VFS inode, or an error
>    */
>   
>   struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
> -				u64 no_addr, u64 no_formal_ino, int non_block)
> +				u64 no_addr, u64 no_formal_ino)
>   {
>   	struct inode *inode;
>   	struct gfs2_inode *ip;
> @@ -170,7 +169,7 @@ struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>   	if (error)
>   		goto fail;
>   
> -	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0, 1);
> +	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0);
>   	if (IS_ERR(inode))
>   		goto fail;
>   
> diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
> index 22c27a8..e1af0d4 100644
> --- a/fs/gfs2/inode.h
> +++ b/fs/gfs2/inode.h
> @@ -94,8 +94,7 @@ err:
>   }
>   
>   extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type,
> -				       u64 no_addr, u64 no_formal_ino,
> -				       int non_block);
> +				       u64 no_addr, u64 no_formal_ino);
>   extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>   					 u64 *no_formal_ino,
>   					 unsigned int blktype);
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 7aacdf2..9ed522c 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -454,7 +454,7 @@ static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
>   	struct dentry *dentry;
>   	struct inode *inode;
>   
> -	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0, 0);
> +	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0);
>   	if (IS_ERR(inode)) {
>   		fs_err(sdp, "can't read in %s inode: %ld\n", name, PTR_ERR(inode));
>   		return PTR_ERR(inode);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path Bob Peterson
@ 2016-02-23 10:25   ` Steven Whitehouse
  2016-02-24 20:42     ` Bob Peterson
  2016-03-21 16:49   ` Benjamin Marzinski
  1 sibling, 1 reply; 19+ messages in thread
From: Steven Whitehouse @ 2016-02-23 10:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 18/12/15 19:02, Bob Peterson wrote:
> This patch changes the inode reclaiming code, delete_work_func, use
> the normal gfs2_inode_lookup, but with a special parameter, for_del
> which causes additional checks to be made.
I assume that this was just intended to be a clean up rather than 
something that fixes a bug? At least I can't see any obvious changes to 
the operations, just the way in which they are carried out. I think it 
would be neater to turn the non_block parameter of gfs2_inode_lookup() 
into a flags argument, rather than adding an additional argument which 
is only a bool.

I'm not sure that factoring out some of the checks from 
gfs2_lookup_by_inum() buys us that much though, since that function 
still has to remain in its current form for NFS anyway,

Steve.

> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>   fs/gfs2/dir.c        |  2 +-
>   fs/gfs2/glock.c      |  9 +--------
>   fs/gfs2/inode.c      | 32 ++++++++++++++++++++++++++++----
>   fs/gfs2/inode.h      |  3 ++-
>   fs/gfs2/ops_fstype.c |  2 +-
>   5 files changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> index 2bef023..6a92592 100644
> --- a/fs/gfs2/dir.c
> +++ b/fs/gfs2/dir.c
> @@ -1660,7 +1660,7 @@ struct inode *gfs2_dir_search(struct inode *dir, const struct qstr *name,
>   		brelse(bh);
>   		if (fail_on_exist)
>   			return ERR_PTR(-EEXIST);
> -		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino);
> +		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino, 0);
>   		if (!IS_ERR(inode))
>   			GFS2_I(inode)->i_rahead = rahead;
>   		return inode;
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 3e16d82..795c2f3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -568,7 +568,6 @@ static void delete_work_func(struct work_struct *work)
>   {
>   	struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_delete);
>   	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> -	struct gfs2_inode *ip;
>   	struct inode *inode;
>   	u64 no_addr = gl->gl_name.ln_number;
>   
> @@ -578,13 +577,7 @@ static void delete_work_func(struct work_struct *work)
>   	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
>   		goto out;
>   
> -	ip = gl->gl_object;
> -	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
> -
> -	if (ip)
> -		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);
> -	else
> -		inode = gfs2_lookup_by_inum(sdp, no_addr, NULL, GFS2_BLKST_UNLINKED);
> +	inode = gfs2_inode_lookup(sdp->sd_vfs, DT_UNKNOWN, no_addr, 0, 1);
>   	if (inode && !IS_ERR(inode)) {
>   		d_prune_aliases(inode);
>   		iput(inode);
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 01b6d1a..0357862 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -75,17 +75,37 @@ static void gfs2_set_iop(struct inode *inode)
>   	}
>   }
>   
> +static int verify_unlinked(struct gfs2_inode *ip)
> +{
> +	struct gfs2_holder i_gh;
> +	struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
> +	int error;
> +
> +	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_SKIP, &i_gh);
> +	if (error)
> +		return error;
> +
> +	error = gfs2_check_blk_type(sdp, ip->i_no_addr, GFS2_BLKST_UNLINKED);
> +	if (error == 0)
> +		error = gfs2_inode_refresh(ip);
> +
> +	gfs2_glock_dq_uninit(&i_gh);
> +	return error;
> +}
> +
>   /**
>    * gfs2_inode_lookup - Lookup an inode
>    * @sb: The super block
> - * @no_addr: The inode number
>    * @type: The type of the inode
> + * @no_addr: The inode number
> + * @no_formal_ino: The formal inode number
> + * @for_del: Is this lookup for deleting purposes?
>    *
>    * Returns: A VFS inode, or an error
>    */
>   
>   struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
> -				u64 no_addr, u64 no_formal_ino)
> +				u64 no_addr, u64 no_formal_ino, int for_del)
>   {
>   	struct inode *inode;
>   	struct gfs2_inode *ip;
> @@ -121,7 +141,11 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
>   		gfs2_glock_put(io_gl);
>   		io_gl = NULL;
>   
> -		if (type == DT_UNKNOWN) {
> +		if (for_del) {
> +			error = verify_unlinked(ip);
> +			if (error)
> +				goto fail_refresh;
> +		} else if (type == DT_UNKNOWN) {
>   			/* Inode glock must be locked already */
>   			error = gfs2_inode_refresh(GFS2_I(inode));
>   			if (error)
> @@ -169,7 +193,7 @@ struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>   	if (error)
>   		goto fail;
>   
> -	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0);
> +	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0, 0);
>   	if (IS_ERR(inode))
>   		goto fail;
>   
> diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
> index e1af0d4..d6233e9 100644
> --- a/fs/gfs2/inode.h
> +++ b/fs/gfs2/inode.h
> @@ -94,7 +94,8 @@ err:
>   }
>   
>   extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type,
> -				       u64 no_addr, u64 no_formal_ino);
> +				       u64 no_addr, u64 no_formal_ino,
> +				       int for_del);
>   extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>   					 u64 *no_formal_ino,
>   					 unsigned int blktype);
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 9ed522c..7aacdf2 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -454,7 +454,7 @@ static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
>   	struct dentry *dentry;
>   	struct inode *inode;
>   
> -	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0);
> +	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0, 0);
>   	if (IS_ERR(inode)) {
>   		fs_err(sdp, "can't read in %s inode: %ld\n", name, PTR_ERR(inode));
>   		return PTR_ERR(inode);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-02-23 10:25   ` Steven Whitehouse
@ 2016-02-24 20:42     ` Bob Peterson
  2016-02-25 10:07       ` Steven Whitehouse
  0 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2016-02-24 20:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> Hi,
> 
> On 18/12/15 19:02, Bob Peterson wrote:
> > This patch changes the inode reclaiming code, delete_work_func, use
> > the normal gfs2_inode_lookup, but with a special parameter, for_del
> > which causes additional checks to be made.
> I assume that this was just intended to be a clean up rather than
> something that fixes a bug? At least I can't see any obvious changes to
> the operations, just the way in which they are carried out. I think it
> would be neater to turn the non_block parameter of gfs2_inode_lookup()
> into a flags argument, rather than adding an additional argument which
> is only a bool.
> 
> I'm not sure that factoring out some of the checks from
> gfs2_lookup_by_inum() buys us that much though, since that function
> still has to remain in its current form for NFS anyway,
> 
> Steve.

Hi Steve,

Actually, no, it's not a cleanup at all. Without this patch, I can
recreate one of the problems I'm trying to solve here, which is:

[nate_bob1] [fsck] Unlinked inode found at block 483992 (0x76298).
[nate_bob1] [fsck] Block 483993 (0x76299) bitmap says 1 (Data) but FSCK saw 0 (Free)
[nate_bob1] [fsck] Bitmap at block 483993 (0x76299) left inconsistent
[nate_bob1] [fsck] Block 483994 (0x7629a) bitmap says 1 (Data) but FSCK saw 0 (Free)
[nate_bob1] [fsck] Bitmap at block 483994 (0x7629a) left inconsistent
[nate_bob1] [fsck] Block 483995 (0x7629b) bitmap says 1 (Data) but FSCK saw 0 (Free)
[nate_bob1] [fsck] Bitmap at block 483995 (0x7629b) left inconsistent
[nate_bob1] [fsck] Block 483996 (0x7629c) bitmap says 1 (Data) but FSCK saw 0 (Free)
[nate_bob1] [fsck] Bitmap at block 483996 (0x7629c) left inconsistent
[nate_bob1] [fsck] Block 483997 (0x7629d) bitmap says 1 (Data) but FSCK saw 0 (Free)
[nate_bob1] [fsck] Bitmap at block 483997 (0x7629d) left inconsistent
etc.

Turns out if you use the regular inode lookup path, the problem goes away.

Initially (several months ago) I had written a new inode lookup function
just for the transition from unlinked to free. I debugged it and got it
working fairly well, only to discover that my code did the exact same things
in the exact same order as the original lookup. So I deleted my new function
in favor of using the original, hence, this patch.

My concern, of course, is that NFS may also be vulnerable to this problem,
but I didn't want to mess with the NFS path any more than I had to, so I
left the original inum lookup there for it to use. At some point we should
try to determine if NFS has an issue related to this. I suspect it does.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-02-24 20:42     ` Bob Peterson
@ 2016-02-25 10:07       ` Steven Whitehouse
  2016-03-15 13:50         ` Bob Peterson
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Whitehouse @ 2016-02-25 10:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com



On 24/02/16 20:42, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>> On 18/12/15 19:02, Bob Peterson wrote:
>>> This patch changes the inode reclaiming code, delete_work_func, use
>>> the normal gfs2_inode_lookup, but with a special parameter, for_del
>>> which causes additional checks to be made.
>> I assume that this was just intended to be a clean up rather than
>> something that fixes a bug? At least I can't see any obvious changes to
>> the operations, just the way in which they are carried out. I think it
>> would be neater to turn the non_block parameter of gfs2_inode_lookup()
>> into a flags argument, rather than adding an additional argument which
>> is only a bool.
>>
>> I'm not sure that factoring out some of the checks from
>> gfs2_lookup_by_inum() buys us that much though, since that function
>> still has to remain in its current form for NFS anyway,
>>
>> Steve.
> Hi Steve,
>
> Actually, no, it's not a cleanup at all. Without this patch, I can
> recreate one of the problems I'm trying to solve here, which is:
>
> [nate_bob1] [fsck] Unlinked inode found at block 483992 (0x76298).
> [nate_bob1] [fsck] Block 483993 (0x76299) bitmap says 1 (Data) but FSCK saw 0 (Free)
> [nate_bob1] [fsck] Bitmap at block 483993 (0x76299) left inconsistent
> [nate_bob1] [fsck] Block 483994 (0x7629a) bitmap says 1 (Data) but FSCK saw 0 (Free)
> [nate_bob1] [fsck] Bitmap at block 483994 (0x7629a) left inconsistent
> [nate_bob1] [fsck] Block 483995 (0x7629b) bitmap says 1 (Data) but FSCK saw 0 (Free)
> [nate_bob1] [fsck] Bitmap at block 483995 (0x7629b) left inconsistent
> [nate_bob1] [fsck] Block 483996 (0x7629c) bitmap says 1 (Data) but FSCK saw 0 (Free)
> [nate_bob1] [fsck] Bitmap at block 483996 (0x7629c) left inconsistent
> [nate_bob1] [fsck] Block 483997 (0x7629d) bitmap says 1 (Data) but FSCK saw 0 (Free)
> [nate_bob1] [fsck] Bitmap at block 483997 (0x7629d) left inconsistent
> etc.
>
> Turns out if you use the regular inode lookup path, the problem goes away.
>
> Initially (several months ago) I had written a new inode lookup function
> just for the transition from unlinked to free. I debugged it and got it
> working fairly well, only to discover that my code did the exact same things
> in the exact same order as the original lookup. So I deleted my new function
> in favor of using the original, hence, this patch.
>
> My concern, of course, is that NFS may also be vulnerable to this problem,
> but I didn't want to mess with the NFS path any more than I had to, so I
> left the original inum lookup there for it to use. At some point we should
> try to determine if NFS has an issue related to this. I suspect it does.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
I think we need to know the answer to that before we make this change. 
Also, what is the problem? I'm still not sure that I understand the 
actual issue here. It is down to the ordering of operations, or 
different locking, or what exactly?

The fsck output seems to indicate that there is an unlinked inode in 
which some of the blocks are marked as free. If that occurs without 
there being an un-recovered journal then that is certainly a problem, 
but it would be with the transaction code, so I'm still not quite sure 
how this patch relates to it. It looks like it needs more investigation,

Steve.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-02-25 10:07       ` Steven Whitehouse
@ 2016-03-15 13:50         ` Bob Peterson
  2016-03-16 11:09           ` Steven Whitehouse
  0 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2016-03-15 13:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> On 24/02/16 20:42, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >> On 18/12/15 19:02, Bob Peterson wrote:
> >>> This patch changes the inode reclaiming code, delete_work_func, use
> >>> the normal gfs2_inode_lookup, but with a special parameter, for_del
> >>> which causes additional checks to be made.
> >> I assume that this was just intended to be a clean up rather than
> >> something that fixes a bug? At least I can't see any obvious changes to
> >> the operations, just the way in which they are carried out. I think it
> >> would be neater to turn the non_block parameter of gfs2_inode_lookup()
> >> into a flags argument, rather than adding an additional argument which
> >> is only a bool.
> >>
> >> I'm not sure that factoring out some of the checks from
> >> gfs2_lookup_by_inum() buys us that much though, since that function
> >> still has to remain in its current form for NFS anyway,
> >>
> >> Steve.
> > Hi Steve,
> >
> > Actually, no, it's not a cleanup at all. Without this patch, I can
> > recreate one of the problems I'm trying to solve here, which is:
> >
> > [nate_bob1] [fsck] Unlinked inode found at block 483992 (0x76298).
> > [nate_bob1] [fsck] Block 483993 (0x76299) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483993 (0x76299) left inconsistent
> > [nate_bob1] [fsck] Block 483994 (0x7629a) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483994 (0x7629a) left inconsistent
> > [nate_bob1] [fsck] Block 483995 (0x7629b) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483995 (0x7629b) left inconsistent
> > [nate_bob1] [fsck] Block 483996 (0x7629c) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483996 (0x7629c) left inconsistent
> > [nate_bob1] [fsck] Block 483997 (0x7629d) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483997 (0x7629d) left inconsistent
> > etc.
> >
> > Turns out if you use the regular inode lookup path, the problem goes away.
> >
> > Initially (several months ago) I had written a new inode lookup function
> > just for the transition from unlinked to free. I debugged it and got it
> > working fairly well, only to discover that my code did the exact same
> > things
> > in the exact same order as the original lookup. So I deleted my new
> > function
> > in favor of using the original, hence, this patch.
> >
> > My concern, of course, is that NFS may also be vulnerable to this problem,
> > but I didn't want to mess with the NFS path any more than I had to, so I
> > left the original inum lookup there for it to use. At some point we should
> > try to determine if NFS has an issue related to this. I suspect it does.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> I think we need to know the answer to that before we make this change.
> Also, what is the problem? I'm still not sure that I understand the
> actual issue here. It is down to the ordering of operations, or
> different locking, or what exactly?
> 
> The fsck output seems to indicate that there is an unlinked inode in
> which some of the blocks are marked as free. If that occurs without
> there being an un-recovered journal then that is certainly a problem,
> but it would be with the transaction code, so I'm still not quite sure
> how this patch relates to it. It looks like it needs more investigation,
> 
> Steve.

Hi Steve (and all),

I've done a lot of analysis of this issue, and run into multiple complications,
chief among them that these problems often manifest in unmount after the
debugfs files have been destroyed. But I finally have some answers.

In almost all cases, what's happening is that delete_work_func sees
gl->gl_object and calls gfs2_ilookup, but since the inode is gone at that point,
it returns an error. So lookup_by_inum never gets called, and therefore function
delete_work_func just exits without doing anything. So the inode is left in an
unlinked state.

I tried making it call lookup_by_inum as a fallback if gfs2_ilookup doesn't find
the inode, but that causes a hang in unmount. This can still happen in today's
code, but it's less likely than the first scenario.

The hang is caused because gfs2_lookup_by_inum() grabs the inode glock first, then
the vfs inode lock. The normal lookup path, gfs2_inode_lookup, gets the vfs inode
lock first, then the iopen glock, before everything else. It doesn't really lock
the inode glock, but the caller usually does that after it returns from the lookup.

So in the unmount hang, the delete func is holding the inode glock and waiting for
the vfs inode which is stuck in I_FREEING state. That may go back to that patch I
reverted which skips over I_FREEING inodes. But if that patch is in place, the
I_FREEING inode is skipped, and once again, the inode is left in an UNLINKED state.

The reason the regular lookup path has no problems is because it grabs the vfs inode
lock first, which waits for the I_FREEING inode before going on to lock any glocks.

What this means is that NFS very likely has a similar hang, since it also uses
lookup_by_inum. So we have to figure out what we're going to do about that.
Perhaps you could explain why we wrote lookup_by_inum to begin with?

I'm also greatly concerned because of something else I notice in doing all this
analysis: the delete_work_func goes by the inode number in the glock. There's that
whole concept that we can't dereference ip because we don't have proper locks. But
if the glock has since been reused for another mount point, which also happens to
have an inode at that location, it might end up deleting a file from the wrong file
system. The delete_work_func gets its superblock pointer from the glock. The saving
grace, of course, is that the delete func verifies the block is in an UNLINKEd state,
which means it can't be too far off the rails. Still, it's distressing to me that we
can have two delete_work_func processes, and they can both be operating on the same
file system, and even contending for the same glock due to dlm sending callbacks.
And I'm not sure what to do about it. And yes, I have proof that this is actually
happening. Check out this excerpt from the glocks debugfs file:

G:  s:UN n:2/5a299 f:lqb t:EX d:EX/0 a:0 v:0 r:5 m:200
 H: s:EX f:W e:0 p:2750 [delete_workqueu] gfs2_glock_nq_num+0x5b/0xa0 [gfs2]
 H: s:EX f:W e:0 p:2749 [delete_workqueu] gfs2_glock_nq_num+0x5b/0xa0 [gfs2]

There are only two mount points, which means there are only two delete_work queues,
which means both these processes should not be operating on the same glock.
But somehow they are.

Ordinarily I would hope that when the glock was enqueued to the delete work queue,
it would have a reference that would prevent the glock from being reclaimed and
reused until after dlm sends its callback. But somehow it is happening, and I
need to understand exactly why.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-03-15 13:50         ` Bob Peterson
@ 2016-03-16 11:09           ` Steven Whitehouse
  2016-03-17 16:01             ` Bob Peterson
  0 siblings, 1 reply; 19+ messages in thread
From: Steven Whitehouse @ 2016-03-16 11:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 15/03/16 13:50, Bob Peterson wrote:
> ----- Original Message -----
>> On 24/02/16 20:42, Bob Peterson wrote:
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>> On 18/12/15 19:02, Bob Peterson wrote:
>>>>> This patch changes the inode reclaiming code, delete_work_func, use
>>>>> the normal gfs2_inode_lookup, but with a special parameter, for_del
>>>>> which causes additional checks to be made.
>>>> I assume that this was just intended to be a clean up rather than
>>>> something that fixes a bug? At least I can't see any obvious changes to
>>>> the operations, just the way in which they are carried out. I think it
>>>> would be neater to turn the non_block parameter of gfs2_inode_lookup()
>>>> into a flags argument, rather than adding an additional argument which
>>>> is only a bool.
>>>>
>>>> I'm not sure that factoring out some of the checks from
>>>> gfs2_lookup_by_inum() buys us that much though, since that function
>>>> still has to remain in its current form for NFS anyway,
>>>>
>>>> Steve.
>>> Hi Steve,
>>>
>>> Actually, no, it's not a cleanup at all. Without this patch, I can
>>> recreate one of the problems I'm trying to solve here, which is:
>>>
>>> [nate_bob1] [fsck] Unlinked inode found at block 483992 (0x76298).
>>> [nate_bob1] [fsck] Block 483993 (0x76299) bitmap says 1 (Data) but FSCK saw
>>> 0 (Free)
>>> [nate_bob1] [fsck] Bitmap at block 483993 (0x76299) left inconsistent
>>> [nate_bob1] [fsck] Block 483994 (0x7629a) bitmap says 1 (Data) but FSCK saw
>>> 0 (Free)
>>> [nate_bob1] [fsck] Bitmap at block 483994 (0x7629a) left inconsistent
>>> [nate_bob1] [fsck] Block 483995 (0x7629b) bitmap says 1 (Data) but FSCK saw
>>> 0 (Free)
>>> [nate_bob1] [fsck] Bitmap at block 483995 (0x7629b) left inconsistent
>>> [nate_bob1] [fsck] Block 483996 (0x7629c) bitmap says 1 (Data) but FSCK saw
>>> 0 (Free)
>>> [nate_bob1] [fsck] Bitmap at block 483996 (0x7629c) left inconsistent
>>> [nate_bob1] [fsck] Block 483997 (0x7629d) bitmap says 1 (Data) but FSCK saw
>>> 0 (Free)
>>> [nate_bob1] [fsck] Bitmap at block 483997 (0x7629d) left inconsistent
>>> etc.
>>>
>>> Turns out if you use the regular inode lookup path, the problem goes away.
>>>
>>> Initially (several months ago) I had written a new inode lookup function
>>> just for the transition from unlinked to free. I debugged it and got it
>>> working fairly well, only to discover that my code did the exact same
>>> things
>>> in the exact same order as the original lookup. So I deleted my new
>>> function
>>> in favor of using the original, hence, this patch.
>>>
>>> My concern, of course, is that NFS may also be vulnerable to this problem,
>>> but I didn't want to mess with the NFS path any more than I had to, so I
>>> left the original inum lookup there for it to use. At some point we should
>>> try to determine if NFS has an issue related to this. I suspect it does.
>>>
>>> Regards,
>>>
>>> Bob Peterson
>>> Red Hat File Systems
>> I think we need to know the answer to that before we make this change.
>> Also, what is the problem? I'm still not sure that I understand the
>> actual issue here. It is down to the ordering of operations, or
>> different locking, or what exactly?
>>
>> The fsck output seems to indicate that there is an unlinked inode in
>> which some of the blocks are marked as free. If that occurs without
>> there being an un-recovered journal then that is certainly a problem,
>> but it would be with the transaction code, so I'm still not quite sure
>> how this patch relates to it. It looks like it needs more investigation,
>>
>> Steve.
> Hi Steve (and all),
>
> I've done a lot of analysis of this issue, and run into multiple complications,
> chief among them that these problems often manifest in unmount after the
> debugfs files have been destroyed. But I finally have some answers.
>
> In almost all cases, what's happening is that delete_work_func sees
> gl->gl_object and calls gfs2_ilookup, but since the inode is gone at that point,
> it returns an error. So lookup_by_inum never gets called, and therefore function
> delete_work_func just exits without doing anything. So the inode is left in an
> unlinked state.
>
> I tried making it call lookup_by_inum as a fallback if gfs2_ilookup doesn't find
> the inode, but that causes a hang in unmount. This can still happen in today's
> code, but it's less likely than the first scenario.
So that bit sounds like there is a race between clearing gl_object and 
the test that the delete_work_func uses. So that is perhaps one thing to 
look at.

> The hang is caused because gfs2_lookup_by_inum() grabs the inode glock first, then
> the vfs inode lock. The normal lookup path, gfs2_inode_lookup, gets the vfs inode
> lock first, then the iopen glock, before everything else. It doesn't really lock
> the inode glock, but the caller usually does that after it returns from the lookup.
I'm not sure I follow... which "vfs inode lock" is the problem here? 
Where is it called in/from gfs2_lookup_by_inum ?

> So in the unmount hang, the delete func is holding the inode glock and waiting for
> the vfs inode which is stuck in I_FREEING state. That may go back to that patch I
> reverted which skips over I_FREEING inodes. But if that patch is in place, the
> I_FREEING inode is skipped, and once again, the inode is left in an UNLINKED state.
The delete function should not be waiting for I_FREEING inodes, since 
those are already being freed, so there is no need to wait for them - we 
should just ignore them I think, since they will go away anyway.

> The reason the regular lookup path has no problems is because it grabs the vfs inode
> lock first, which waits for the I_FREEING inode before going on to lock any glocks.
>
> What this means is that NFS very likely has a similar hang, since it also uses
> lookup_by_inum. So we have to figure out what we're going to do about that.
> Perhaps you could explain why we wrote lookup_by_inum to begin with?
The reason is that we have to be very careful when looking up inodes by 
number, because we do not hold the parent directory lock as we would 
during normal lookup, so that the inode can vanish from under us at any 
time. So we have to grab the glock first, and then make several checks 
before we can read in the inode's details.

> I'm also greatly concerned because of something else I notice in doing all this
> analysis: the delete_work_func goes by the inode number in the glock. There's that
> whole concept that we can't dereference ip because we don't have proper locks. But
> if the glock has since been reused for another mount point, which also happens to
> have an inode at that location, it might end up deleting a file from the wrong file
> system. The delete_work_func gets its superblock pointer from the glock. The saving
> grace, of course, is that the delete func verifies the block is in an UNLINKEd state,
> which means it can't be too far off the rails. Still, it's distressing to me that we
> can have two delete_work_func processes, and they can both be operating on the same
> file system, and even contending for the same glock due to dlm sending callbacks.
> And I'm not sure what to do about it. And yes, I have proof that this is actually
> happening. Check out this excerpt from the glocks debugfs file:
>
> G:  s:UN n:2/5a299 f:lqb t:EX d:EX/0 a:0 v:0 r:5 m:200
>   H: s:EX f:W e:0 p:2750 [delete_workqueu] gfs2_glock_nq_num+0x5b/0xa0 [gfs2]
>   H: s:EX f:W e:0 p:2749 [delete_workqueu] gfs2_glock_nq_num+0x5b/0xa0 [gfs2]
I don't think that proves that there is a problem. It is probably the 
same workqueue that has queued up the same request twice. That is 
something that we want to prevent from happening, but I'm not sure how 
the sb might get changed on a glock.... that would be a pretty big issue 
and something that I'm sure we'd have noticed before.

> There are only two mount points, which means there are only two delete_work queues,
> which means both these processes should not be operating on the same glock.
> But somehow they are.
There might only be two workqueues, but that doesn't prevent there being 
multiple worker processes per workqueue.

>
> Ordinarily I would hope that when the glock was enqueued to the delete work queue,
> it would have a reference that would prevent the glock from being reclaimed and
> reused until after dlm sends its callback. But somehow it is happening, and I
> need to understand exactly why.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-03-16 11:09           ` Steven Whitehouse
@ 2016-03-17 16:01             ` Bob Peterson
  2016-03-17 16:27               ` Steven Whitehouse
  0 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2016-03-17 16:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> > Hi Steve (and all),
> >
> > I've done a lot of analysis of this issue, and run into multiple
> > complications,
> > chief among them that these problems often manifest in unmount after the
> > debugfs files have been destroyed. But I finally have some answers.
> >
> > In almost all cases, what's happening is that delete_work_func sees
> > gl->gl_object and calls gfs2_ilookup, but since the inode is gone at that
> > point,
> > it returns an error. So lookup_by_inum never gets called, and therefore
> > function
> > delete_work_func just exits without doing anything. So the inode is left in
> > an
> > unlinked state.
> >
> > I tried making it call lookup_by_inum as a fallback if gfs2_ilookup doesn't
> > find
> > the inode, but that causes a hang in unmount. This can still happen in
> > today's
> > code, but it's less likely than the first scenario.
> So that bit sounds like there is a race between clearing gl_object and
> the test that the delete_work_func uses. So that is perhaps one thing to
> look at.
> 
> > The hang is caused because gfs2_lookup_by_inum() grabs the inode glock
> > first, then
> > the vfs inode lock. The normal lookup path, gfs2_inode_lookup, gets the vfs
> > inode
> > lock first, then the iopen glock, before everything else. It doesn't really
> > lock
> > the inode glock, but the caller usually does that after it returns from the
> > lookup.
> I'm not sure I follow... which "vfs inode lock" is the problem here?
> Where is it called in/from gfs2_lookup_by_inum ?
> 
> > So in the unmount hang, the delete func is holding the inode glock and
> > waiting for
> > the vfs inode which is stuck in I_FREEING state. That may go back to that
> > patch I
> > reverted which skips over I_FREEING inodes. But if that patch is in place,
> > the
> > I_FREEING inode is skipped, and once again, the inode is left in an
> > UNLINKED state.
> The delete function should not be waiting for I_FREEING inodes, since
> those are already being freed, so there is no need to wait for them - we
> should just ignore them I think, since they will go away anyway.

VFS needs to wait for the I_FREEING inode before it can create another
inode in its place. We need the replacement in order to do our iput
which will cause the gfs2_delete_inode which changes it from unlinked to free.
If we don't, the result is a bunch of inodes that stay in unlinked, which is
the cause of the du vs. df issue.

> > The reason the regular lookup path has no problems is because it grabs the
> > vfs inode
> > lock first, which waits for the I_FREEING inode before going on to lock any
> > glocks.
> >
> > What this means is that NFS very likely has a similar hang, since it also
> > uses
> > lookup_by_inum. So we have to figure out what we're going to do about that.
> > Perhaps you could explain why we wrote lookup_by_inum to begin with?
> The reason is that we have to be very careful when looking up inodes by
> number, because we do not hold the parent directory lock as we would
> during normal lookup, so that the inode can vanish from under us at any
> time. So we have to grab the glock first, and then make several checks
> before we can read in the inode's details.

That makes good sense.
 
Perhaps the crux of the problem is best explained by comparing today's
lookup path by inum (which hangs) versus the regular lookup path, which I
used in this patch, which seemed to fix the problem:

A = glock lock
B = VFS-level inode lock

The original delete lookup path              Normal lookup path:
-----------------------------------------    -------------------------------------
delete_work_func:                            delete_work_func:
{                                            {
   if (gl_object)
      inode = gfs2_ilookup(sdp, no_addr); <-------This one often fails, meaning
                                                  that we don't hang, but we also
                                                  don't transition from unlinked to free.
   else {
      inode = gfs2_lookup_by_inum         <-------This is the problematic code path
A        1. glock_nq inode glock
         2. check block type == unlinked
         3. inode_lookup                      inode_lookup
B              iget_locked (inode_lock)      B   iget_locked (inode_lock)
                  __wait_on_freeing_inode           __wait_on_freeing_inode
               if I_NEW                          if I_NEW
                                                    verify unlinked <------- New section I added
	   				     A         glock nq inode glock
		   				       check block type
					     A         glock dq inode glock
                                                    <----------------------- End of new section
                   unlock_new_inode/i_state         unlock_new_inode/i_state
A        4. glock dq inode glock
   }
B  iput / atomic_dec_and_lock (inode_lock)   B  iput / atomic_dec_and_lock (inode_lock)
      generic_drop_inode                           generic_drop_inode
         generic_delete_inode                         generic_delete_inode
            gfs2_delete_inode                            gfs2_delete_inode
A               gfs2_glock_nq_init(ip->i_gl) A              gfs2_glock_nq_init(ip->i_gl)
A               gfs2_glock_dq_uninit         A              gfs2_glock_dq_uninit
}                                            }

With the lookup_by_inum path, the code gets into an ABAB deadlock where
one process holds the vfs-level lock and requires the GFS2 glock, and the
other does the opposite.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-03-17 16:01             ` Bob Peterson
@ 2016-03-17 16:27               ` Steven Whitehouse
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Whitehouse @ 2016-03-17 16:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 17/03/16 16:01, Bob Peterson wrote:
> ----- Original Message -----
>>> Hi Steve (and all),
>>>
>>> I've done a lot of analysis of this issue, and run into multiple
>>> complications,
>>> chief among them that these problems often manifest in unmount after the
>>> debugfs files have been destroyed. But I finally have some answers.
>>>
>>> In almost all cases, what's happening is that delete_work_func sees
>>> gl->gl_object and calls gfs2_ilookup, but since the inode is gone at that
>>> point,
>>> it returns an error. So lookup_by_inum never gets called, and therefore
>>> function
>>> delete_work_func just exits without doing anything. So the inode is left in
>>> an
>>> unlinked state.
>>>
>>> I tried making it call lookup_by_inum as a fallback if gfs2_ilookup doesn't
>>> find
>>> the inode, but that causes a hang in unmount. This can still happen in
>>> today's
>>> code, but it's less likely than the first scenario.
>> So that bit sounds like there is a race between clearing gl_object and
>> the test that the delete_work_func uses. So that is perhaps one thing to
>> look at.
>>
>>> The hang is caused because gfs2_lookup_by_inum() grabs the inode glock
>>> first, then
>>> the vfs inode lock. The normal lookup path, gfs2_inode_lookup, gets the vfs
>>> inode
>>> lock first, then the iopen glock, before everything else. It doesn't really
>>> lock
>>> the inode glock, but the caller usually does that after it returns from the
>>> lookup.
>> I'm not sure I follow... which "vfs inode lock" is the problem here?
>> Where is it called in/from gfs2_lookup_by_inum ?
>>
>>> So in the unmount hang, the delete func is holding the inode glock and
>>> waiting for
>>> the vfs inode which is stuck in I_FREEING state. That may go back to that
>>> patch I
>>> reverted which skips over I_FREEING inodes. But if that patch is in place,
>>> the
>>> I_FREEING inode is skipped, and once again, the inode is left in an
>>> UNLINKED state.
>> The delete function should not be waiting for I_FREEING inodes, since
>> those are already being freed, so there is no need to wait for them - we
>> should just ignore them I think, since they will go away anyway.
> VFS needs to wait for the I_FREEING inode before it can create another
> inode in its place. We need the replacement in order to do our iput
> which will cause the gfs2_delete_inode which changes it from unlinked to free.
> If we don't, the result is a bunch of inodes that stay in unlinked, which is
> the cause of the du vs. df issue.
>
So it seems that there is a combination of a race, or a deadlock 
depending on whether we wait for I_FREEING or not. At least that now 
gives a clear statement of the problem. How to resolve it is another 
issue of course, but that should help to move things along

>>> The reason the regular lookup path has no problems is because it grabs the
>>> vfs inode
>>> lock first, which waits for the I_FREEING inode before going on to lock any
>>> glocks.
>>>
>>> What this means is that NFS very likely has a similar hang, since it also
>>> uses
>>> lookup_by_inum. So we have to figure out what we're going to do about that.
>>> Perhaps you could explain why we wrote lookup_by_inum to begin with?
>> The reason is that we have to be very careful when looking up inodes by
>> number, because we do not hold the parent directory lock as we would
>> during normal lookup, so that the inode can vanish from under us at any
>> time. So we have to grab the glock first, and then make several checks
>> before we can read in the inode's details.
> That makes good sense.
>   
> Perhaps the crux of the problem is best explained by comparing today's
> lookup path by inum (which hangs) versus the regular lookup path, which I
> used in this patch, which seemed to fix the problem:
>
> A = glock lock
> B = VFS-level inode lock
>
> The original delete lookup path              Normal lookup path:
> -----------------------------------------    -------------------------------------
> delete_work_func:                            delete_work_func:
> {                                            {
>     if (gl_object)
>        inode = gfs2_ilookup(sdp, no_addr); <-------This one often fails, meaning
>                                                    that we don't hang, but we also
>                                                    don't transition from unlinked to free.
That one was designed to be a quick test, in case the inode was 
unlinked, but still in use, and thus still cached. The other code path 
take much longer, so if we can avoid calling it in certain cases, that 
seemed like a good thing to do.
>     else {
>        inode = gfs2_lookup_by_inum         <-------This is the problematic code path
> A        1. glock_nq inode glock
>           2. check block type == unlinked
>           3. inode_lookup                      inode_lookup
> B              iget_locked (inode_lock)      B   iget_locked (inode_lock)
>                    __wait_on_freeing_inode           __wait_on_freeing_inode
>                 if I_NEW                          if I_NEW
>                                                      verify unlinked <------- New section I added
> 	   				     A         glock nq inode glock
> 		   				       check block type
> 					     A         glock dq inode glock
>                                                      <----------------------- End of new section
>                     unlock_new_inode/i_state         unlock_new_inode/i_state
> A        4. glock dq inode glock
>     }
> B  iput / atomic_dec_and_lock (inode_lock)   B  iput / atomic_dec_and_lock (inode_lock)
>        generic_drop_inode                           generic_drop_inode
>           generic_delete_inode                         generic_delete_inode
>              gfs2_delete_inode                            gfs2_delete_inode
> A               gfs2_glock_nq_init(ip->i_gl) A              gfs2_glock_nq_init(ip->i_gl)
> A               gfs2_glock_dq_uninit         A              gfs2_glock_dq_uninit
> }                                            }
>
> With the lookup_by_inum path, the code gets into an ABAB deadlock where
> one process holds the vfs-level lock and requires the GFS2 glock, and the
> other does the opposite.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
Yes, that makes sense to me, and it is a good write up of the issue,

Steve.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2015-12-18 19:02 ` [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path Bob Peterson
  2016-02-23 10:25   ` Steven Whitehouse
@ 2016-03-21 16:49   ` Benjamin Marzinski
  2016-03-21 18:12     ` Bob Peterson
  1 sibling, 1 reply; 19+ messages in thread
From: Benjamin Marzinski @ 2016-03-21 16:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Dec 18, 2015 at 01:02:30PM -0600, Bob Peterson wrote:
> This patch changes the inode reclaiming code, delete_work_func, use
> the normal gfs2_inode_lookup, but with a special parameter, for_del
> which causes additional checks to be made.

I get what you're doing here, and I can't see anything really wrong with
it, but I haven't gone through all the delete code as carefully as I
could.  I do have two questions.

1. I don't see any reason why, even if it's just for paranoias sake, we
don't avoid dereferencing the gfs2_inode in verify_unlinked before
locking and checking it. It seems like a simple fix that avoids any
chance that there is a problem due to not holding the directory glock
when we're getting the inode.

2. gfs2_inode_lookup calls iget_locked, which creates an inode if one
doesn't already exist.  If there's any chance of racing and calling
delete_work_func for an inode that has already been removed, won't we
just end up creating a new one here?

-Ben

> 
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>  fs/gfs2/dir.c        |  2 +-
>  fs/gfs2/glock.c      |  9 +--------
>  fs/gfs2/inode.c      | 32 ++++++++++++++++++++++++++++----
>  fs/gfs2/inode.h      |  3 ++-
>  fs/gfs2/ops_fstype.c |  2 +-
>  5 files changed, 33 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> index 2bef023..6a92592 100644
> --- a/fs/gfs2/dir.c
> +++ b/fs/gfs2/dir.c
> @@ -1660,7 +1660,7 @@ struct inode *gfs2_dir_search(struct inode *dir, const struct qstr *name,
>  		brelse(bh);
>  		if (fail_on_exist)
>  			return ERR_PTR(-EEXIST);
> -		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino);
> +		inode = gfs2_inode_lookup(dir->i_sb, dtype, addr, formal_ino, 0);
>  		if (!IS_ERR(inode))
>  			GFS2_I(inode)->i_rahead = rahead;
>  		return inode;
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 3e16d82..795c2f3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -568,7 +568,6 @@ static void delete_work_func(struct work_struct *work)
>  {
>  	struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_delete);
>  	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> -	struct gfs2_inode *ip;
>  	struct inode *inode;
>  	u64 no_addr = gl->gl_name.ln_number;
>  
> @@ -578,13 +577,7 @@ static void delete_work_func(struct work_struct *work)
>  	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
>  		goto out;
>  
> -	ip = gl->gl_object;
> -	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
> -
> -	if (ip)
> -		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);
> -	else
> -		inode = gfs2_lookup_by_inum(sdp, no_addr, NULL, GFS2_BLKST_UNLINKED);
> +	inode = gfs2_inode_lookup(sdp->sd_vfs, DT_UNKNOWN, no_addr, 0, 1);
>  	if (inode && !IS_ERR(inode)) {
>  		d_prune_aliases(inode);
>  		iput(inode);
> diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
> index 01b6d1a..0357862 100644
> --- a/fs/gfs2/inode.c
> +++ b/fs/gfs2/inode.c
> @@ -75,17 +75,37 @@ static void gfs2_set_iop(struct inode *inode)
>  	}
>  }
>  
> +static int verify_unlinked(struct gfs2_inode *ip)
> +{
> +	struct gfs2_holder i_gh;
> +	struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
> +	int error;
> +
> +	error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_SKIP, &i_gh);
> +	if (error)
> +		return error;
> +
> +	error = gfs2_check_blk_type(sdp, ip->i_no_addr, GFS2_BLKST_UNLINKED);
> +	if (error == 0)
> +		error = gfs2_inode_refresh(ip);
> +
> +	gfs2_glock_dq_uninit(&i_gh);
> +	return error;
> +}
> +
>  /**
>   * gfs2_inode_lookup - Lookup an inode
>   * @sb: The super block
> - * @no_addr: The inode number
>   * @type: The type of the inode
> + * @no_addr: The inode number
> + * @no_formal_ino: The formal inode number
> + * @for_del: Is this lookup for deleting purposes?
>   *
>   * Returns: A VFS inode, or an error
>   */
>  
>  struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
> -				u64 no_addr, u64 no_formal_ino)
> +				u64 no_addr, u64 no_formal_ino, int for_del)
>  {
>  	struct inode *inode;
>  	struct gfs2_inode *ip;
> @@ -121,7 +141,11 @@ struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned int type,
>  		gfs2_glock_put(io_gl);
>  		io_gl = NULL;
>  
> -		if (type == DT_UNKNOWN) {
> +		if (for_del) {
> +			error = verify_unlinked(ip);
> +			if (error)
> +				goto fail_refresh;
> +		} else if (type == DT_UNKNOWN) {
>  			/* Inode glock must be locked already */
>  			error = gfs2_inode_refresh(GFS2_I(inode));
>  			if (error)
> @@ -169,7 +193,7 @@ struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>  	if (error)
>  		goto fail;
>  
> -	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0);
> +	inode = gfs2_inode_lookup(sb, DT_UNKNOWN, no_addr, 0, 0);
>  	if (IS_ERR(inode))
>  		goto fail;
>  
> diff --git a/fs/gfs2/inode.h b/fs/gfs2/inode.h
> index e1af0d4..d6233e9 100644
> --- a/fs/gfs2/inode.h
> +++ b/fs/gfs2/inode.h
> @@ -94,7 +94,8 @@ err:
>  }
>  
>  extern struct inode *gfs2_inode_lookup(struct super_block *sb, unsigned type, 
> -				       u64 no_addr, u64 no_formal_ino);
> +				       u64 no_addr, u64 no_formal_ino,
> +				       int for_del);
>  extern struct inode *gfs2_lookup_by_inum(struct gfs2_sbd *sdp, u64 no_addr,
>  					 u64 *no_formal_ino,
>  					 unsigned int blktype);
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 9ed522c..7aacdf2 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -454,7 +454,7 @@ static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
>  	struct dentry *dentry;
>  	struct inode *inode;
>  
> -	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0);
> +	inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0, 0);
>  	if (IS_ERR(inode)) {
>  		fs_err(sdp, "can't read in %s inode: %ld\n", name, PTR_ERR(inode));
>  		return PTR_ERR(inode);
> -- 
> 2.5.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-03-21 16:49   ` Benjamin Marzinski
@ 2016-03-21 18:12     ` Bob Peterson
  2016-03-21 18:56       ` Benjamin Marzinski
  0 siblings, 1 reply; 19+ messages in thread
From: Bob Peterson @ 2016-03-21 18:12 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ben,

----- Original Message -----
> On Fri, Dec 18, 2015 at 01:02:30PM -0600, Bob Peterson wrote:
> > This patch changes the inode reclaiming code, delete_work_func, use
> > the normal gfs2_inode_lookup, but with a special parameter, for_del
> > which causes additional checks to be made.
> 
> I get what you're doing here, and I can't see anything really wrong with
> it, but I haven't gone through all the delete code as carefully as I
> could.  I do have two questions.
> 
> 1. I don't see any reason why, even if it's just for paranoias sake, we
> don't avoid dereferencing the gfs2_inode in verify_unlinked before
> locking and checking it. It seems like a simple fix that avoids any
> chance that there is a problem due to not holding the directory glock
> when we're getting the inode.

I'm not sure I understand the question. Today, function verify_unlinked
dereferences the gfs2_inode, ip, but so does its only caller,
gfs2_inode_lookup. We already know ip is valid at that point, so more
references shouldn't matter. Unless you mean the inode's glock?
 
> 2. gfs2_inode_lookup calls iget_locked, which creates an inode if one
> doesn't already exist.  If there's any chance of racing and calling
> delete_work_func for an inode that has already been removed, won't we
> just end up creating a new one here?
> 
> -Ben

Yes, and that's exactly what we want. I found cases in which the
delete_work_func races with the evict() code. In those cases, we need
the inode to be recreated in order to go through another evict() cycle
which will end up deleting it, whereas the first evict didn't.
For example, the shrinker can arbitrarily call evict() and get rid of
the inode while its nlink is still positive, IOW, it's not deleted,
but after that point in time, the system gets a callback from another
node instructing it to delete the inode because it was unlinked
remotely from that other node. In that case, we still need to
transition it from unlinked to free, but we need a valid inode
in order to do the proper delete. So we recreate it and then it gets
evicted again, in the name of deleting it.

Incidentally, my latest version has changed just a bit. I decided
to keep the call to gfs2_ilookup in delete_work_func. So the latest
version of the delete_work_func() is pasted below.

Regards,

Bob Peterson
Red Hat File Systems
---
static void delete_work_func(struct work_struct *work)
{
	struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_delete);
	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
	struct gfs2_inode *ip;
	struct inode *inode = NULL;
	u64 no_addr = gl->gl_name.ln_number;

	/* If someone's using this glock to create a new dinode, the block must
	   have been freed by another node, then re-used, in which case our
	   iopen callback is too late after the fact. Ignore it. */
	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
		goto out;

	ip = gl->gl_object;
	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */

	if (ip)
		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);

	if (inode == NULL || IS_ERR(inode))
		inode = gfs2_inode_lookup(sdp->sd_vfs, DT_UNKNOWN, no_addr,
					  0, 1);
	if (inode && !IS_ERR(inode)) {
		d_prune_aliases(inode);
		iput(inode);
	}
out:
	gfs2_glock_put(gl);
}



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-03-21 18:12     ` Bob Peterson
@ 2016-03-21 18:56       ` Benjamin Marzinski
  2016-03-23 16:13         ` Bob Peterson
  0 siblings, 1 reply; 19+ messages in thread
From: Benjamin Marzinski @ 2016-03-21 18:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Mar 21, 2016 at 02:12:43PM -0400, Bob Peterson wrote:
> Hi Ben,
> 
> ----- Original Message -----
> > On Fri, Dec 18, 2015 at 01:02:30PM -0600, Bob Peterson wrote:
> > > This patch changes the inode reclaiming code, delete_work_func, use
> > > the normal gfs2_inode_lookup, but with a special parameter, for_del
> > > which causes additional checks to be made.
> > 
> > I get what you're doing here, and I can't see anything really wrong with
> > it, but I haven't gone through all the delete code as carefully as I
> > could.  I do have two questions.
> > 
> > 1. I don't see any reason why, even if it's just for paranoias sake, we
> > don't avoid dereferencing the gfs2_inode in verify_unlinked before
> > locking and checking it. It seems like a simple fix that avoids any
> > chance that there is a problem due to not holding the directory glock
> > when we're getting the inode.
> 
> I'm not sure I understand the question. Today, function verify_unlinked
> dereferences the gfs2_inode, ip, but so does its only caller,
> gfs2_inode_lookup. We already know ip is valid at that point, so more
> references shouldn't matter. Unless you mean the inode's glock?

So before, delete_work_func called gfs2_lookup_by_inum, which locked the
glock without dereferencing the gfs2_inode pointer, and then called
gfs2_inode_lookup. Other code paths call gfs2_inode_lookup without the
glock on the inode itself locked, but with the containing directory's
inode glock held. But now, we are deferencing the gfs2_inode pointer
without either holding its glock or the glock of the directory that the
file is in.

Assuming that everything has to hold the inode lock first now, once
we've locked the that, and locked the glock and verified the file in
verify inode, it should be safe to unlock the glock in verify_inode, and
continue to dereference it the gfs2_inode until we drop the inode lock.

>  
> > 2. gfs2_inode_lookup calls iget_locked, which creates an inode if one
> > doesn't already exist.  If there's any chance of racing and calling
> > delete_work_func for an inode that has already been removed, won't we
> > just end up creating a new one here?
> > 
> > -Ben
> 
> Yes, and that's exactly what we want. I found cases in which the
> delete_work_func races with the evict() code. In those cases, we need
> the inode to be recreated in order to go through another evict() cycle
> which will end up deleting it, whereas the first evict didn't.
> For example, the shrinker can arbitrarily call evict() and get rid of
> the inode while its nlink is still positive, IOW, it's not deleted,
> but after that point in time, the system gets a callback from another
> node instructing it to delete the inode because it was unlinked
> remotely from that other node. In that case, we still need to
> transition it from unlinked to free, but we need a valid inode
> in order to do the proper delete. So we recreate it and then it gets
> evicted again, in the name of deleting it.

O.k. clearly I didn't read through all the delete code as carefully as I
could have.

> Incidentally, my latest version has changed just a bit. I decided
> to keep the call to gfs2_ilookup in delete_work_func. So the latest
> version of the delete_work_func() is pasted below.
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems
> ---
> static void delete_work_func(struct work_struct *work)
> {
> 	struct gfs2_glock *gl = container_of(work, struct gfs2_glock, gl_delete);
> 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> 	struct gfs2_inode *ip;
> 	struct inode *inode = NULL;
> 	u64 no_addr = gl->gl_name.ln_number;
> 
> 	/* If someone's using this glock to create a new dinode, the block must
> 	   have been freed by another node, then re-used, in which case our
> 	   iopen callback is too late after the fact. Ignore it. */
> 	if (test_bit(GLF_INODE_CREATING, &gl->gl_flags))
> 		goto out;
> 
> 	ip = gl->gl_object;
> 	/* Note: Unsafe to dereference ip as we don't hold right refs/locks */
> 
> 	if (ip)
> 		inode = gfs2_ilookup(sdp->sd_vfs, no_addr);
> 
> 	if (inode == NULL || IS_ERR(inode))
> 		inode = gfs2_inode_lookup(sdp->sd_vfs, DT_UNKNOWN, no_addr,
> 					  0, 1);
> 	if (inode && !IS_ERR(inode)) {
> 		d_prune_aliases(inode);
> 		iput(inode);
> 	}
> out:
> 	gfs2_glock_put(gl);
> }



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path
  2016-03-21 18:56       ` Benjamin Marzinski
@ 2016-03-23 16:13         ` Bob Peterson
  0 siblings, 0 replies; 19+ messages in thread
From: Bob Peterson @ 2016-03-23 16:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Ben,

----- Original Message -----
> On Mon, Mar 21, 2016 at 02:12:43PM -0400, Bob Peterson wrote:
> > Hi Ben,
> > 
> > ----- Original Message -----
> > > On Fri, Dec 18, 2015 at 01:02:30PM -0600, Bob Peterson wrote:
> > > > This patch changes the inode reclaiming code, delete_work_func, use
> > > > the normal gfs2_inode_lookup, but with a special parameter, for_del
> > > > which causes additional checks to be made.
> > > 
> > > I get what you're doing here, and I can't see anything really wrong with
> > > it, but I haven't gone through all the delete code as carefully as I
> > > could.  I do have two questions.
> > > 
> > > 1. I don't see any reason why, even if it's just for paranoias sake, we
> > > don't avoid dereferencing the gfs2_inode in verify_unlinked before
> > > locking and checking it. It seems like a simple fix that avoids any
> > > chance that there is a problem due to not holding the directory glock
> > > when we're getting the inode.
> > 
> > I'm not sure I understand the question. Today, function verify_unlinked
> > dereferences the gfs2_inode, ip, but so does its only caller,
> > gfs2_inode_lookup. We already know ip is valid at that point, so more
> > references shouldn't matter. Unless you mean the inode's glock?
> 
> So before, delete_work_func called gfs2_lookup_by_inum, which locked the
> glock without dereferencing the gfs2_inode pointer, and then called
> gfs2_inode_lookup. Other code paths call gfs2_inode_lookup without the
> glock on the inode itself locked, but with the containing directory's
> inode glock held. But now, we are deferencing the gfs2_inode pointer
> without either holding its glock or the glock of the directory that the
> file is in.

Yes, and the old lock order was the majority of the problem.
It seems to me that a lot of GFS2 touches the inode without holding
the inode's glock. For example, take function gfs2_create_inode().
It locks the inode's directory first, but then it does:

   inode = gfs2_dir_search
      gfs2_inode_lookup
         iget_locked

So it's referencing the inode (of which ip is part of) without its glock.
If the inode isn't found by gfs2_dir_search, it continues on to do:

   inode = new_inode(sdp->sd_vfs);
      	inode = new_inode_pseudo(sb);

	   struct inode *inode = alloc_inode(sb);
	   if (inode) {
		   spin_lock(&inode->i_lock);
   ...
   error = gfs2_glock_get(sdp, ip->i_no_addr, &gfs2_inode_glops, CREATE, &ip->i_gl);
   ...
   error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_SKIP, ghs + 1);

Which means it also accesses the inode before the inode's glock.
So AFAICT, the correct order is: grab a reference to the inode, then the inode's glock.

But gfs2_lookup_by_inum was doing this in the opposite order: It was holding
the inode glock (found by its number) then it grabbed the inode itself.

> Assuming that everything has to hold the inode lock first now, once

Now, as it has always been.

> we've locked the that, and locked the glock and verified the file in
> verify inode, it should be safe to unlock the glock in verify_inode, and
> continue to dereference it the gfs2_inode until we drop the inode lock.
> 
> >  
> > > 2. gfs2_inode_lookup calls iget_locked, which creates an inode if one
> > > doesn't already exist.  If there's any chance of racing and calling
> > > delete_work_func for an inode that has already been removed, won't we
> > > just end up creating a new one here?
> > > 
> > > -Ben
> > 
> > Yes, and that's exactly what we want. I found cases in which the
> > delete_work_func races with the evict() code. In those cases, we need
> > the inode to be recreated in order to go through another evict() cycle
> > which will end up deleting it, whereas the first evict didn't.
> > For example, the shrinker can arbitrarily call evict() and get rid of
> > the inode while its nlink is still positive, IOW, it's not deleted,
> > but after that point in time, the system gets a callback from another
> > node instructing it to delete the inode because it was unlinked
> > remotely from that other node. In that case, we still need to
> > transition it from unlinked to free, but we need a valid inode
> > in order to do the proper delete. So we recreate it and then it gets
> > evicted again, in the name of deleting it.
> 
> O.k. clearly I didn't read through all the delete code as carefully as I
> could have.

Well, the delete code is a maze, to be sure, but the proof is in the
pudding. Instrumentation proved to me that this was happening.

One of the many pitfalls I fell into with the delete code is: not only
did I need to recreate the inode after was freed from memory in these
circumstances, but I also needed to do an inode refresh, which seemed
like a real waste of time, given that the inode was being deleted.
Ideally I'd like to eliminate that need and thereby make it more efficient.
These were both handled by the standard inode lookup path, so that's
why I ditched my original attempt to create a delete-specific function to
do the inode lookup, in favor of using the normal inode lookup path.

And even now, I'm not sure what to do about the NFS lookup path, because
that lookup_by_inum still has the wrong lock order AFAICT.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-03-23 16:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-18 19:02 [Cluster-devel] [PATCH 0/4] patches related to file unlink->delete->new Bob Peterson
2015-12-18 19:02 ` [Cluster-devel] [PATCH 1/4] GFS2: Prevent delete work from occurring on glocks used for create Bob Peterson
2016-02-02 11:40   ` Steven Whitehouse
2015-12-18 19:02 ` [Cluster-devel] [PATCH 2/4] GFS2: Don't filter out I_FREEING inodes anymore Bob Peterson
2016-02-23 10:15   ` Steven Whitehouse
2015-12-18 19:02 ` [Cluster-devel] [PATCH 3/4] GFS2: Eliminate parameter non_block on gfs2_inode_lookup Bob Peterson
2016-02-23 10:17   ` Steven Whitehouse
2015-12-18 19:02 ` [Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path Bob Peterson
2016-02-23 10:25   ` Steven Whitehouse
2016-02-24 20:42     ` Bob Peterson
2016-02-25 10:07       ` Steven Whitehouse
2016-03-15 13:50         ` Bob Peterson
2016-03-16 11:09           ` Steven Whitehouse
2016-03-17 16:01             ` Bob Peterson
2016-03-17 16:27               ` Steven Whitehouse
2016-03-21 16:49   ` Benjamin Marzinski
2016-03-21 18:12     ` Bob Peterson
2016-03-21 18:56       ` Benjamin Marzinski
2016-03-23 16:13         ` Bob Peterson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.