All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
@ 2013-06-04 22:17 Zach Brown
  2013-06-04 22:17 ` [PATCH 1/6] btrfs: set readdir f_pos only after filldir Zach Brown
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:17 UTC (permalink / raw)
  To: linux-btrfs

Hi gang,

I finally sat down to fix that readdir hang that has been in the back
of my mind for a while.  I *hope* that the fix is pretty simple: just
don't manufacture a fake f_pos, I *think* we can abuse f_version as an
indicator that we shouldn't return entries.  Does this look reasonable?

We still have the problem that we can generate valid large f_pos values
that can confuse 32bit userspace, but that's a different problem.  I
think we'll want filldir generation of EOVERFLOW like what exists for
large inodes. 

The rest of the patches are cleanups that I saw when absorbing the
code.  It's all lightly tested with xfstests but it wouldn't surprise
me if I missed something so review is appreciated.

Thanks!

- z

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] btrfs: set readdir f_pos only after filldir
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
@ 2013-06-04 22:17 ` Zach Brown
  2013-06-05  1:19   ` Miao Xie
  2013-06-04 22:17 ` [PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX Zach Brown
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:17 UTC (permalink / raw)
  To: linux-btrfs

The only time we need to advance f_pos is after we've successfully given
a result to userspace via filldir.  This simplification gets rid of the
is_curr variable used to update f_pos for the delayed item readdir
entries.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c |  5 +++--
 fs/btrfs/inode.c         | 17 +++++++----------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5615eac..4d846a2 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1696,8 +1696,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
 			continue;
 		}
 
-		filp->f_pos = curr->key.offset;
-
 		di = (struct btrfs_dir_item *)curr->data;
 		name = (char *)(di + 1);
 		name_len = le16_to_cpu(di->name_len);
@@ -1708,6 +1706,9 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
 		over = filldir(dirent, name, name_len, curr->key.offset,
 			       location.objectid, d_type);
 
+		if (!over)
+			filp->f_pos = curr->key.offset + 1;
+
 		if (atomic_dec_and_test(&curr->refs))
 			kfree(curr);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b11a95e..6a5784b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5007,7 +5007,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 	char tmp_name[32];
 	char *name_ptr;
 	int name_len;
-	int is_curr = 0;	/* filp->f_pos points to the current index? */
 
 	/* FIXME, use a real flag for deciding about the key type */
 	if (root->fs_info->tree_root == root)
@@ -5076,9 +5075,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 						  found_key.offset))
 			goto next;
 
-		filp->f_pos = found_key.offset;
-		is_curr = 1;
-
 		di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
 		di_cur = 0;
 		di_total = btrfs_item_size(leaf, item);
@@ -5130,6 +5126,9 @@ skip:
 
 			if (over)
 				goto nopos;
+
+			filp->f_pos = found_key.offset + 1;
+
 			di_len = btrfs_dir_name_len(leaf, di) +
 				 btrfs_dir_data_len(leaf, di) + sizeof(*di);
 			di_cur += di_len;
@@ -5140,23 +5139,21 @@ next:
 	}
 
 	if (key_type == BTRFS_DIR_INDEX_KEY) {
-		if (is_curr)
-			filp->f_pos++;
 		ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir,
 						      &ins_list);
 		if (ret)
 			goto nopos;
 	}
 
-	/* Reached end of directory/root. Bump pos past the last item. */
-	if (key_type == BTRFS_DIR_INDEX_KEY)
+	/* Reached end of directory/root */
+	if (key_type == BTRFS_DIR_INDEX_KEY) {
 		/*
 		 * 32-bit glibc will use getdents64, but then strtol -
 		 * so the last number we can serve is this.
 		 */
 		filp->f_pos = 0x7fffffff;
-	else
-		filp->f_pos++;
+	}
+
 nopos:
 	ret = 0;
 err:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
  2013-06-04 22:17 ` [PATCH 1/6] btrfs: set readdir f_pos only after filldir Zach Brown
@ 2013-06-04 22:17 ` Zach Brown
  2013-06-04 22:17 ` [PATCH 3/6] btrfs: trivial delayed item readdir list cleanups Zach Brown
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:17 UTC (permalink / raw)
  To: linux-btrfs

To work around bugs in userspace btrfs_real_readdir() sets f_pos to an
offset that will prevent any future entries from being returned once the
last entry is hit.  Over time this supposedly impossible offset was
decreased from the initial U64_MAX to INT_MAX to appease 32bit
userspace.

  https://oss.oracle.com/pipermail/btrfs-devel/2008-January/000437.html
  commit c2a8b6e11009398ca9363d8ba8d4e7e40fb897fd
  commit 89f135d8b53bcccafd91a075366d2704ba257cf3
  commit 406266ab9ac8ed8b085c58aacd9e3161480dc5d5

The remaining problem is that resetting f_pos to some impossible offset
causes userspace to spin when it's, well, possible for an entry to have
that offset.  It takes a single thread on a modern cpu about nine hours
of constant file creation and removal to hit an offset past INT_MAX on a
single spindle.

Instead of trying to find an impossible f_pos that doesn't break various
layers of the stack, let's use f_version to indicate that readdir should
stop returning entries until seek changes f_pos and clears f_version.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/inode.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6a5784b..e6e2b86 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4983,6 +4983,16 @@ unsigned char btrfs_filetype_table[] = {
 	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
 };
 
+/*
+ * There have been buggy applications that can't handle one readdir pass
+ * returning the same name for different inodes that are unlinked and
+ * re-created during the readdir pass.  This was partially worked around
+ * by trying to set f_pos to magic values that broke either 32bit userspace
+ * or entries with huge offsets.  Now we set f_version to a magic value
+ * which prevents readdir results until seek resets f_pos and f_version.
+ */
+#define BTRFS_READDIR_EOF ~0ULL
+
 static int btrfs_real_readdir(struct file *filp, void *dirent,
 			      filldir_t filldir)
 {
@@ -5008,6 +5018,9 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 	char *name_ptr;
 	int name_len;
 
+	if (filp->f_version == BTRFS_READDIR_EOF)
+		return 0;
+
 	/* FIXME, use a real flag for deciding about the key type */
 	if (root->fs_info->tree_root == root)
 		key_type = BTRFS_DIR_ITEM_KEY;
@@ -5145,14 +5158,9 @@ next:
 			goto nopos;
 	}
 
-	/* Reached end of directory/root */
-	if (key_type == BTRFS_DIR_INDEX_KEY) {
-		/*
-		 * 32-bit glibc will use getdents64, but then strtol -
-		 * so the last number we can serve is this.
-		 */
-		filp->f_pos = 0x7fffffff;
-	}
+	/* prevent further readdir results without seeking once we hit EOF */
+	if (key_type == BTRFS_DIR_INDEX_KEY)
+		filp->f_version = BTRFS_READDIR_EOF;
 
 nopos:
 	ret = 0;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] btrfs: trivial delayed item readdir list cleanups
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
  2013-06-04 22:17 ` [PATCH 1/6] btrfs: set readdir f_pos only after filldir Zach Brown
  2013-06-04 22:17 ` [PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX Zach Brown
@ 2013-06-04 22:17 ` Zach Brown
  2013-06-04 22:17 ` [PATCH 4/6] btrfs: simplify finding next/prev delayed items Zach Brown
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:17 UTC (permalink / raw)
  To: linux-btrfs

Just call btrfs_put_delayed_items() for each list rather than having two
list arguments and duplicated code.

list_for_each_entry_safe() can handle an empty list.

We don't have to conditionally use and tear down the lists if we always
initialize them to be empty.  They're only populated when needed and the
rest of the uses just find empty lists.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 17 ++---------------
 fs/btrfs/delayed-inode.h |  3 +--
 fs/btrfs/inode.c         | 17 ++++++-----------
 3 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 4d846a2..bc753fc 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1618,18 +1618,11 @@ void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
 	atomic_dec(&delayed_node->refs);
 }
 
-void btrfs_put_delayed_items(struct list_head *ins_list,
-			     struct list_head *del_list)
+void btrfs_put_delayed_items(struct list_head *list)
 {
 	struct btrfs_delayed_item *curr, *next;
 
-	list_for_each_entry_safe(curr, next, ins_list, readdir_list) {
-		list_del(&curr->readdir_list);
-		if (atomic_dec_and_test(&curr->refs))
-			kfree(curr);
-	}
-
-	list_for_each_entry_safe(curr, next, del_list, readdir_list) {
+	list_for_each_entry_safe(curr, next, list, readdir_list) {
 		list_del(&curr->readdir_list);
 		if (atomic_dec_and_test(&curr->refs))
 			kfree(curr);
@@ -1642,9 +1635,6 @@ int btrfs_should_delete_dir_index(struct list_head *del_list,
 	struct btrfs_delayed_item *curr, *next;
 	int ret;
 
-	if (list_empty(del_list))
-		return 0;
-
 	list_for_each_entry_safe(curr, next, del_list, readdir_list) {
 		if (curr->key.offset > index)
 			break;
@@ -1679,9 +1669,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
 	int over = 0;
 	unsigned char d_type;
 
-	if (list_empty(ins_list))
-		return 0;
-
 	/*
 	 * Changing the data of the delayed item is impossible. So
 	 * we needn't lock them. And we have held i_mutex of the
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 1d5c5f7..573506b 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -135,8 +135,7 @@ void btrfs_destroy_delayed_inodes(struct btrfs_root *root);
 /* Used for readdir() */
 void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
 			     struct list_head *del_list);
-void btrfs_put_delayed_items(struct list_head *ins_list,
-			     struct list_head *del_list);
+void btrfs_put_delayed_items(struct list_head *list);
 int btrfs_should_delete_dir_index(struct list_head *del_list,
 				  u64 index);
 int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6e2b86..53a8696 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5003,8 +5003,8 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 	struct btrfs_key key;
 	struct btrfs_key found_key;
 	struct btrfs_path *path;
-	struct list_head ins_list;
-	struct list_head del_list;
+	LIST_HEAD(ins_list);
+	LIST_HEAD(del_list);
 	int ret;
 	struct extent_buffer *leaf;
 	int slot;
@@ -5048,11 +5048,8 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 
 	path->reada = 1;
 
-	if (key_type == BTRFS_DIR_INDEX_KEY) {
-		INIT_LIST_HEAD(&ins_list);
-		INIT_LIST_HEAD(&del_list);
+	if (key_type == BTRFS_DIR_INDEX_KEY)
 		btrfs_get_delayed_items(inode, &ins_list, &del_list);
-	}
 
 	btrfs_set_key_type(&key, key_type);
 	key.offset = filp->f_pos;
@@ -5083,9 +5080,7 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 			break;
 		if (found_key.offset < filp->f_pos)
 			goto next;
-		if (key_type == BTRFS_DIR_INDEX_KEY &&
-		    btrfs_should_delete_dir_index(&del_list,
-						  found_key.offset))
+		if (btrfs_should_delete_dir_index(&del_list, found_key.offset))
 			goto next;
 
 		di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
@@ -5165,8 +5160,8 @@ next:
 nopos:
 	ret = 0;
 err:
-	if (key_type == BTRFS_DIR_INDEX_KEY)
-		btrfs_put_delayed_items(&ins_list, &del_list);
+	btrfs_put_delayed_items(&ins_list);
+	btrfs_put_delayed_items(&del_list);
 	btrfs_free_path(path);
 	return ret;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] btrfs: simplify finding next/prev delayed items
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
                   ` (2 preceding siblings ...)
  2013-06-04 22:17 ` [PATCH 3/6] btrfs: trivial delayed item readdir list cleanups Zach Brown
@ 2013-06-04 22:17 ` Zach Brown
  2013-06-04 22:17 ` [PATCH 5/6] btrfs: add helper to get delayed item root Zach Brown
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:17 UTC (permalink / raw)
  To: linux-btrfs

I'd like to use the currently unused next/prev arguments to
__btrfs_lookup_delayed_item() in a future patch.  I noticed that the
code could be simplified.

We don't need to use rb_next() or rb_prev() to walk back up the tree
once we've failed to find the key at a leaf.  We can record the most
recent next and prev items as we descend and return those when the
search fails.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 54 +++++++++++++++++++-----------------------------
 1 file changed, 21 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index bc753fc..67e0f9f 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -326,11 +326,11 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len)
  * __btrfs_lookup_delayed_item - look up the delayed item by key
  * @delayed_node: pointer to the delayed node
  * @key:	  the key to look up
- * @prev:	  used to store the prev item if the right item isn't found
- * @next:	  used to store the next item if the right item isn't found
+ * @prev:	  set to the item before @key when it isn't found
+ * @next:	  set to the item after @key when it isn't found
  *
- * Note: if we don't find the right item, we will return the prev item and
- * the next item.
+ * @prev and @next are only correct if the search terminates at a leaf so
+ * they're only set if an item with @key is not found.
  */
 static struct btrfs_delayed_item *__btrfs_lookup_delayed_item(
 				struct rb_root *root,
@@ -338,48 +338,36 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item(
 				struct btrfs_delayed_item **prev,
 				struct btrfs_delayed_item **next)
 {
-	struct rb_node *node, *prev_node = NULL;
+	struct rb_node *node = root->rb_node;
+	struct btrfs_delayed_item *prev_item = NULL;
+	struct btrfs_delayed_item *next_item = NULL;
 	struct btrfs_delayed_item *delayed_item = NULL;
-	int ret = 0;
+	int ret;
 
-	node = root->rb_node;
+	if (prev)
+		*prev = NULL;
+	if (next)
+		*next = NULL;
 
 	while (node) {
 		delayed_item = rb_entry(node, struct btrfs_delayed_item,
 					rb_node);
-		prev_node = node;
 		ret = btrfs_comp_cpu_keys(&delayed_item->key, key);
-		if (ret < 0)
+		if (ret < 0) {
+			prev_item = delayed_item;
 			node = node->rb_right;
-		else if (ret > 0)
+		} else if (ret > 0) {
+			next_item = delayed_item;
 			node = node->rb_left;
-		else
+		} else
 			return delayed_item;
 	}
 
-	if (prev) {
-		if (!prev_node)
-			*prev = NULL;
-		else if (ret < 0)
-			*prev = delayed_item;
-		else if ((node = rb_prev(prev_node)) != NULL) {
-			*prev = rb_entry(node, struct btrfs_delayed_item,
-					 rb_node);
-		} else
-			*prev = NULL;
-	}
+	if (prev)
+		*prev = prev_item;
+	if (next)
+		*next = next_item;
 
-	if (next) {
-		if (!prev_node)
-			*next = NULL;
-		else if (ret > 0)
-			*next = delayed_item;
-		else if ((node = rb_next(prev_node)) != NULL) {
-			*next = rb_entry(node, struct btrfs_delayed_item,
-					 rb_node);
-		} else
-			*next = NULL;
-	}
 	return NULL;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] btrfs: add helper to get delayed item root
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
                   ` (3 preceding siblings ...)
  2013-06-04 22:17 ` [PATCH 4/6] btrfs: simplify finding next/prev delayed items Zach Brown
@ 2013-06-04 22:17 ` Zach Brown
  2013-06-04 22:18 ` [PATCH 6/6] btrfs: get fewer delayed item refs during readdir Zach Brown
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:17 UTC (permalink / raw)
  To: linux-btrfs

This just moves some duplicated code into a helper.  I couldn't bring
myself to add another copy in an upcoming patch.

The delayed_root BUG() in __btrfs_remove_delayed_item() wasn't needed.
The pointer deref will oops later if its null.

And now the remaining BUG() is in one place! :)

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 67e0f9f..fcce951 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -382,6 +382,16 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_insertion_item(
 	return item;
 }
 
+static struct rb_root *get_ins_del_root(struct btrfs_delayed_node *delayed_node,
+					int ins_del)
+{
+	if (ins_del == BTRFS_DELAYED_INSERTION_ITEM)
+		return &delayed_node->ins_root;
+	if (ins_del == BTRFS_DELAYED_DELETION_ITEM)
+		return &delayed_node->del_root;
+	BUG();
+}
+
 static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node,
 				    struct btrfs_delayed_item *ins,
 				    int action)
@@ -392,12 +402,7 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node,
 	struct btrfs_delayed_item *item;
 	int cmp;
 
-	if (action == BTRFS_DELAYED_INSERTION_ITEM)
-		root = &delayed_node->ins_root;
-	else if (action == BTRFS_DELAYED_DELETION_ITEM)
-		root = &delayed_node->del_root;
-	else
-		BUG();
+	root = get_ins_del_root(delayed_node, action);
 	p = &root->rb_node;
 	node = &ins->rb_node;
 
@@ -460,15 +465,8 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item)
 
 	delayed_root = delayed_item->delayed_node->root->fs_info->delayed_root;
 
-	BUG_ON(!delayed_root);
-	BUG_ON(delayed_item->ins_or_del != BTRFS_DELAYED_DELETION_ITEM &&
-	       delayed_item->ins_or_del != BTRFS_DELAYED_INSERTION_ITEM);
-
-	if (delayed_item->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM)
-		root = &delayed_item->delayed_node->ins_root;
-	else
-		root = &delayed_item->delayed_node->del_root;
-
+	root = get_ins_del_root(delayed_item->delayed_node,
+				delayed_item->ins_or_del);
 	rb_erase(&delayed_item->rb_node, root);
 	delayed_item->delayed_node->count--;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] btrfs: get fewer delayed item refs during readdir
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
                   ` (4 preceding siblings ...)
  2013-06-04 22:17 ` [PATCH 5/6] btrfs: add helper to get delayed item root Zach Brown
@ 2013-06-04 22:18 ` Zach Brown
  2013-06-04 23:16 ` [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Chris Mason
  2013-07-01 12:54 ` Josef Bacik
  7 siblings, 0 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 22:18 UTC (permalink / raw)
  To: linux-btrfs

On every readdir call all the delayed items for the dir are put on a
private list with a held reference.  If they're outside the f_pos values
that this readdir call ends up using they're just dropped and removed
from the list.  We can make some tiny changes to cut down on this
overhead.

First, let's use the delayed item's key-sorted rbtree to skip items that
are before f_pos and will never be used.

Second, let's only acquire the new delayed items after we've exausted
the existing in-tree items and still have room in the readdir buffer for
more entries.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/delayed-inode.c | 21 ++++++++++-----------
 fs/btrfs/delayed-inode.h |  4 ++--
 fs/btrfs/inode.c         | 14 +++++++++++---
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index fcce951..2c3ec89 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1567,28 +1567,27 @@ int btrfs_inode_delayed_dir_index_count(struct inode *inode)
 	return 0;
 }
 
-void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
-			     struct list_head *del_list)
+void btrfs_get_delayed_items(struct inode *inode, struct list_head *list,
+			     struct btrfs_key *key, int action)
 {
 	struct btrfs_delayed_node *delayed_node;
 	struct btrfs_delayed_item *item;
+	struct btrfs_delayed_item *next;
+	struct rb_root *root;
 
 	delayed_node = btrfs_get_delayed_node(inode);
 	if (!delayed_node)
 		return;
 
-	mutex_lock(&delayed_node->mutex);
-	item = __btrfs_first_delayed_insertion_item(delayed_node);
-	while (item) {
-		atomic_inc(&item->refs);
-		list_add_tail(&item->readdir_list, ins_list);
-		item = __btrfs_next_delayed_item(item);
-	}
+	root = get_ins_del_root(delayed_node, action);
 
-	item = __btrfs_first_delayed_deletion_item(delayed_node);
+	mutex_lock(&delayed_node->mutex);
+	item = __btrfs_lookup_delayed_item(root, key, NULL, &next);
+	if (item == NULL)
+		item = next;
 	while (item) {
 		atomic_inc(&item->refs);
-		list_add_tail(&item->readdir_list, del_list);
+		list_add_tail(&item->readdir_list, list);
 		item = __btrfs_next_delayed_item(item);
 	}
 	mutex_unlock(&delayed_node->mutex);
diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h
index 573506b..7c401e1 100644
--- a/fs/btrfs/delayed-inode.h
+++ b/fs/btrfs/delayed-inode.h
@@ -133,8 +133,8 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root);
 void btrfs_destroy_delayed_inodes(struct btrfs_root *root);
 
 /* Used for readdir() */
-void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list,
-			     struct list_head *del_list);
+void btrfs_get_delayed_items(struct inode *inode, struct list_head *list,
+			     struct btrfs_key *key, int action);
 void btrfs_put_delayed_items(struct list_head *list);
 int btrfs_should_delete_dir_index(struct list_head *del_list,
 				  u64 index);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 53a8696..ad42724 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5048,13 +5048,14 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
 
 	path->reada = 1;
 
-	if (key_type == BTRFS_DIR_INDEX_KEY)
-		btrfs_get_delayed_items(inode, &ins_list, &del_list);
-
 	btrfs_set_key_type(&key, key_type);
 	key.offset = filp->f_pos;
 	key.objectid = btrfs_ino(inode);
 
+	if (key_type == BTRFS_DIR_INDEX_KEY)
+		btrfs_get_delayed_items(inode, &del_list, &key,
+					BTRFS_DELAYED_DELETION_ITEM);
+
 	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 	if (ret < 0)
 		goto err;
@@ -5146,7 +5147,14 @@ next:
 		path->slots[0]++;
 	}
 
+	/* don't acquire delayed item mutex while holding locked path */
+	btrfs_free_path(path);
+	path = NULL;
+
 	if (key_type == BTRFS_DIR_INDEX_KEY) {
+		key.offset = filp->f_pos;
+		btrfs_get_delayed_items(inode, &ins_list, &key,
+					BTRFS_DELAYED_INSERTION_ITEM);
 		ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir,
 						      &ins_list);
 		if (ret)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
                   ` (5 preceding siblings ...)
  2013-06-04 22:18 ` [PATCH 6/6] btrfs: get fewer delayed item refs during readdir Zach Brown
@ 2013-06-04 23:16 ` Chris Mason
  2013-06-04 23:26   ` Zach Brown
  2013-07-01 12:54 ` Josef Bacik
  7 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2013-06-04 23:16 UTC (permalink / raw)
  To: Zach Brown, linux-btrfs

Quoting Zach Brown (2013-06-04 18:17:54)
> Hi gang,
> 
> I finally sat down to fix that readdir hang that has been in the back
> of my mind for a while.  I *hope* that the fix is pretty simple: just
> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> indicator that we shouldn't return entries.  Does this look reasonable?

I like it, and it doesn't look too far away from how others are abusing
f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
loves to surprise me.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-04 23:16 ` [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Chris Mason
@ 2013-06-04 23:26   ` Zach Brown
  2013-06-05  2:34     ` Miao Xie
  2013-06-10 22:39     ` Zach Brown
  0 siblings, 2 replies; 22+ messages in thread
From: Zach Brown @ 2013-06-04 23:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> Quoting Zach Brown (2013-06-04 18:17:54)
> > Hi gang,
> > 
> > I finally sat down to fix that readdir hang that has been in the back
> > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> > indicator that we shouldn't return entries.  Does this look reasonable?
> 
> I like it, and it doesn't look too far away from how others are abusing
> f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
> loves to surprise me.

Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)

- z

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] btrfs: set readdir f_pos only after filldir
  2013-06-04 22:17 ` [PATCH 1/6] btrfs: set readdir f_pos only after filldir Zach Brown
@ 2013-06-05  1:19   ` Miao Xie
  0 siblings, 0 replies; 22+ messages in thread
From: Miao Xie @ 2013-06-05  1:19 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-btrfs

On 	tue, 4 Jun 2013 15:17:55 -0700, Zach Brown wrote:
> The only time we need to advance f_pos is after we've successfully given
> a result to userspace via filldir.  This simplification gets rid of the
> is_curr variable used to update f_pos for the delayed item readdir
> entries.
> 
> Signed-off-by: Zach Brown <zab@redhat.com>

Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>

> ---
>  fs/btrfs/delayed-inode.c |  5 +++--
>  fs/btrfs/inode.c         | 17 +++++++----------
>  2 files changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 5615eac..4d846a2 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1696,8 +1696,6 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
>  			continue;
>  		}
>  
> -		filp->f_pos = curr->key.offset;
> -
>  		di = (struct btrfs_dir_item *)curr->data;
>  		name = (char *)(di + 1);
>  		name_len = le16_to_cpu(di->name_len);
> @@ -1708,6 +1706,9 @@ int btrfs_readdir_delayed_dir_index(struct file *filp, void *dirent,
>  		over = filldir(dirent, name, name_len, curr->key.offset,
>  			       location.objectid, d_type);
>  
> +		if (!over)
> +			filp->f_pos = curr->key.offset + 1;
> +
>  		if (atomic_dec_and_test(&curr->refs))
>  			kfree(curr);
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b11a95e..6a5784b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5007,7 +5007,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
>  	char tmp_name[32];
>  	char *name_ptr;
>  	int name_len;
> -	int is_curr = 0;	/* filp->f_pos points to the current index? */
>  
>  	/* FIXME, use a real flag for deciding about the key type */
>  	if (root->fs_info->tree_root == root)
> @@ -5076,9 +5075,6 @@ static int btrfs_real_readdir(struct file *filp, void *dirent,
>  						  found_key.offset))
>  			goto next;
>  
> -		filp->f_pos = found_key.offset;
> -		is_curr = 1;
> -
>  		di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
>  		di_cur = 0;
>  		di_total = btrfs_item_size(leaf, item);
> @@ -5130,6 +5126,9 @@ skip:
>  
>  			if (over)
>  				goto nopos;
> +
> +			filp->f_pos = found_key.offset + 1;
> +
>  			di_len = btrfs_dir_name_len(leaf, di) +
>  				 btrfs_dir_data_len(leaf, di) + sizeof(*di);
>  			di_cur += di_len;
> @@ -5140,23 +5139,21 @@ next:
>  	}
>  
>  	if (key_type == BTRFS_DIR_INDEX_KEY) {
> -		if (is_curr)
> -			filp->f_pos++;
>  		ret = btrfs_readdir_delayed_dir_index(filp, dirent, filldir,
>  						      &ins_list);
>  		if (ret)
>  			goto nopos;
>  	}
>  
> -	/* Reached end of directory/root. Bump pos past the last item. */
> -	if (key_type == BTRFS_DIR_INDEX_KEY)
> +	/* Reached end of directory/root */
> +	if (key_type == BTRFS_DIR_INDEX_KEY) {
>  		/*
>  		 * 32-bit glibc will use getdents64, but then strtol -
>  		 * so the last number we can serve is this.
>  		 */
>  		filp->f_pos = 0x7fffffff;
> -	else
> -		filp->f_pos++;
> +	}
> +
>  nopos:
>  	ret = 0;
>  err:
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-04 23:26   ` Zach Brown
@ 2013-06-05  2:34     ` Miao Xie
  2013-06-05 13:36       ` David Sterba
  2013-06-10 22:39     ` Zach Brown
  1 sibling, 1 reply; 22+ messages in thread
From: Miao Xie @ 2013-06-05  2:34 UTC (permalink / raw)
  To: Zach Brown; +Cc: Chris Mason, linux-btrfs

On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
>> Quoting Zach Brown (2013-06-04 18:17:54)
>>> Hi gang,
>>>
>>> I finally sat down to fix that readdir hang that has been in the back
>>> of my mind for a while.  I *hope* that the fix is pretty simple: just
>>> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
>>> indicator that we shouldn't return entries.  Does this look reasonable?
>>
>> I like it, and it doesn't look too far away from how others are abusing
>> f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
>> loves to surprise me.
> 
> Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)

If we can not use f_version, we can use private_data. I think this variant is
safe.

Miao

> 
> - z
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-05  2:34     ` Miao Xie
@ 2013-06-05 13:36       ` David Sterba
  2013-06-06  1:35         ` Miao Xie
  0 siblings, 1 reply; 22+ messages in thread
From: David Sterba @ 2013-06-05 13:36 UTC (permalink / raw)
  To: Miao Xie; +Cc: Zach Brown, Chris Mason, linux-btrfs

On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
> On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> > On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> >> Quoting Zach Brown (2013-06-04 18:17:54)
> >>> Hi gang,
> >>>
> >>> I finally sat down to fix that readdir hang that has been in the back
> >>> of my mind for a while.  I *hope* that the fix is pretty simple: just
> >>> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> >>> indicator that we shouldn't return entries.  Does this look reasonable?
> >>
> >> I like it, and it doesn't look too far away from how others are abusing
> >> f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
> >> loves to surprise me.
> > 
> > Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)
> 
> If we can not use f_version, we can use private_data. I think this variant is
> safe.

private_data is used within the ioctl user transactions, so a
readdir(mountpoint) with a user transaction running can break it.

david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-05 13:36       ` David Sterba
@ 2013-06-06  1:35         ` Miao Xie
  2013-06-06 13:55           ` David Sterba
  0 siblings, 1 reply; 22+ messages in thread
From: Miao Xie @ 2013-06-06  1:35 UTC (permalink / raw)
  To: dsterba, Zach Brown, Chris Mason, linux-btrfs

On 	wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:
> On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
>> On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
>>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
>>>> Quoting Zach Brown (2013-06-04 18:17:54)
>>>>> Hi gang,
>>>>>
>>>>> I finally sat down to fix that readdir hang that has been in the back
>>>>> of my mind for a while.  I *hope* that the fix is pretty simple: just
>>>>> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
>>>>> indicator that we shouldn't return entries.  Does this look reasonable?
>>>>
>>>> I like it, and it doesn't look too far away from how others are abusing
>>>> f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
>>>> loves to surprise me.
>>>
>>> Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)
>>
>> If we can not use f_version, we can use private_data. I think this variant is
>> safe.
> 
> private_data is used within the ioctl user transactions, so a
> readdir(mountpoint) with a user transaction running can break it.

don't worry, we can allocate a structure to keep both transaction handle and the information
of readdir, just like ext3/ext4. It is a flexible way and we can extend the structure to keep
more information if need in the future.

Beside the above method, we also can abuse the low bits of private_data to indicator that
we shouldn't return entries.

Thanks
Miao

> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-06  1:35         ` Miao Xie
@ 2013-06-06 13:55           ` David Sterba
  2013-06-06 14:32             ` Chris Mason
  0 siblings, 1 reply; 22+ messages in thread
From: David Sterba @ 2013-06-06 13:55 UTC (permalink / raw)
  To: Miao Xie; +Cc: dsterba, Zach Brown, Chris Mason, linux-btrfs

On Thu, Jun 06, 2013 at 09:35:07AM +0800, Miao Xie wrote:
> On 	wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:
> > On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
> >> On 	tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> >>>> Quoting Zach Brown (2013-06-04 18:17:54)
> >>>>> Hi gang,
> >>>>>
> >>>>> I finally sat down to fix that readdir hang that has been in the back
> >>>>> of my mind for a while.  I *hope* that the fix is pretty simple: just
> >>>>> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> >>>>> indicator that we shouldn't return entries.  Does this look reasonable?
> >>>>
> >>>> I like it, and it doesn't look too far away from how others are abusing
> >>>> f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
> >>>> loves to surprise me.
> >>>
> >>> Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)
> >>
> >> If we can not use f_version, we can use private_data. I think this variant is
> >> safe.
> > 
> > private_data is used within the ioctl user transactions, so a
> > readdir(mountpoint) with a user transaction running can break it.
> 
> don't worry, we can allocate a structure to keep both transaction handle and the information
> of readdir, just like ext3/ext4. It is a flexible way and we can extend the structure to keep
> more information if need in the future.
> 
> Beside the above method, we also can abuse the low bits of private_data to indicator that
> we shouldn't return entries.

Allocating a full structure for private_data sounds better than directly
modifying the pointer value itself.


david

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-06 13:55           ` David Sterba
@ 2013-06-06 14:32             ` Chris Mason
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2013-06-06 14:32 UTC (permalink / raw)
  To: dsterba, David Sterba, Miao Xie; +Cc: dsterba, Zach Brown, linux-btrfs

Quoting David Sterba (2013-06-06 09:55:50)
> On Thu, Jun 06, 2013 at 09:35:07AM +0800, Miao Xie wrote:
> > On    wed, 5 Jun 2013 15:36:36 +0200, David Sterba wrote:
> > > On Wed, Jun 05, 2013 at 10:34:08AM +0800, Miao Xie wrote:
> > >> On         tue, 4 Jun 2013 16:26:57 -0700, Zach Brown wrote:
> > >>> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> > >>>> Quoting Zach Brown (2013-06-04 18:17:54)
> > >>>>> Hi gang,
> > >>>>>
> > >>>>> I finally sat down to fix that readdir hang that has been in the back
> > >>>>> of my mind for a while.  I *hope* that the fix is pretty simple: just
> > >>>>> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> > >>>>> indicator that we shouldn't return entries.  Does this look reasonable?
> > >>>>
> > >>>> I like it, and it doesn't look too far away from how others are abusing
> > >>>> f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
> > >>>> loves to surprise me.
> > >>>
> > >>> Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)
> > >>
> > >> If we can not use f_version, we can use private_data. I think this variant is
> > >> safe.
> > > 
> > > private_data is used within the ioctl user transactions, so a
> > > readdir(mountpoint) with a user transaction running can break it.
> > 
> > don't worry, we can allocate a structure to keep both transaction handle and the information
> > of readdir, just like ext3/ext4. It is a flexible way and we can extend the structure to keep
> > more information if need in the future.
> > 
> > Beside the above method, we also can abuse the low bits of private_data to indicator that
> > we shouldn't return entries.
> 
> Allocating a full structure for private_data sounds better than directly
> modifying the pointer value itself.

I'd actually rather tag the pointers than go through kmalloc, we just
need one bit (maybe that really just shows how badly we've corrupted
poor Miao).  But, we're not there yet, I think Zach's initial patch will
work fine.

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-04 23:26   ` Zach Brown
  2013-06-05  2:34     ` Miao Xie
@ 2013-06-10 22:39     ` Zach Brown
  2013-06-12 12:59       ` Chris Mason
  1 sibling, 1 reply; 22+ messages in thread
From: Zach Brown @ 2013-06-10 22:39 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs, Josef Bacik

On Tue, Jun 04, 2013 at 04:26:57PM -0700, Zach Brown wrote:
> On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> > Quoting Zach Brown (2013-06-04 18:17:54)
> > > Hi gang,
> > > 
> > > I finally sat down to fix that readdir hang that has been in the back
> > > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > > don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> > > indicator that we shouldn't return entries.  Does this look reasonable?
> > 
> > I like it, and it doesn't look too far away from how others are abusing
> > f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
> > loves to surprise me.
> 
> Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)

Or a week later.  Pretty close!

I couldn't get NFS to break.  Clients see new entries created directly
in the exported btrfs and on either of noac and actime=1 client mounts.
For whatever that's worth.

But I did find that I'd broken the case of trying to re-enable readdir
results by seeking past the last entry (which happens to be the current
f_pos now that we're using f_version).

Here's the incremental fix against what Josef has in -next.  I'm cool
with either squashing or just committing it.

- z

Subject: [PATCH] btrfs: reset f_version when seeking to pos

Commit 63e3dfe ("btrfs: fix readdir hang with offsets past INT_MAX")
switched to using f_version to stop readdir results instead of setting a
large f_pos.

It inadvertantly changed behaviour in the case where an app specifically
seeks to one past the last valid dent->d_off it has seen.  Previously
f_pos would have changed from the fake f_pos to this new f_pos which
would let readdir return new entries.

But now that it's using f_version it might not have seen new entries.
generic_file_llseek() won't clear f_version if the desirned pos happens
to be the current f_pos.

So we add a little wrapper to notice this case and clear f_version so
that entries can be seen in this case.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/inode.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1059c90..590c274 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4997,6 +4997,23 @@ unsigned char btrfs_filetype_table[] = {
  * which prevents readdir results until seek resets f_pos and f_version.
  */
 #define BTRFS_READDIR_EOF ~0ULL
+static loff_t btrfs_dir_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_mapping->host;
+	loff_t ret;
+
+	/*
+	 * f_version isn't reset if a seek is attempted to the current pos.  A
+	 * caller can be trying to see more entries by seeking past the last
+	 * entry to the current pos after creating a new entry.
+	 */
+	mutex_lock(&inode->i_mutex);
+	ret = generic_file_llseek(file, offset, whence);
+	if (ret == offset && file->f_version == BTRFS_READDIR_EOF)
+		file->f_version = 0;
+	mutex_unlock(&inode->i_mutex);
+	return ret;
+}
 
 static int btrfs_real_readdir(struct file *filp, void *dirent,
 			      filldir_t filldir)
@@ -8642,7 +8659,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = {
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
-	.llseek		= generic_file_llseek,
+	.llseek		= btrfs_dir_llseek,
 	.read		= generic_read_dir,
 	.readdir	= btrfs_real_readdir,
 	.unlocked_ioctl	= btrfs_ioctl,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-10 22:39     ` Zach Brown
@ 2013-06-12 12:59       ` Chris Mason
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2013-06-12 12:59 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-btrfs, Josef Bacik

Quoting Zach Brown (2013-06-10 18:39:58)
> On Tue, Jun 04, 2013 at 04:26:57PM -0700, Zach Brown wrote:
> > On Tue, Jun 04, 2013 at 07:16:53PM -0400, Chris Mason wrote:
> > > Quoting Zach Brown (2013-06-04 18:17:54)
> > > > Hi gang,
> > > > 
> > > > I finally sat down to fix that readdir hang that has been in the back
> > > > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > > > don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> > > > indicator that we shouldn't return entries.  Does this look reasonable?
> > > 
> > > I like it, and it doesn't look too far away from how others are abusing
> > > f_version.  Have you tried with NFS?  I don't think it'll hurt, but NFS
> > > loves to surprise me.
> > 
> > Mm, no, I hadn't.  I'll give it a go tomorrow.  What could go wrong? :)
> 
> Or a week later.  Pretty close!
> 
> I couldn't get NFS to break.  Clients see new entries created directly
> in the exported btrfs and on either of noac and actime=1 client mounts.
> For whatever that's worth.

Great.

> 
> But I did find that I'd broken the case of trying to re-enable readdir
> results by seeking past the last entry (which happens to be the current
> f_pos now that we're using f_version).
> 
> Here's the incremental fix against what Josef has in -next.  I'm cool
> with either squashing or just committing it.

Lets squash it in, Josef loves to rebase.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
                   ` (6 preceding siblings ...)
  2013-06-04 23:16 ` [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Chris Mason
@ 2013-07-01 12:54 ` Josef Bacik
  2013-07-01 13:18   ` Chris Mason
                     ` (2 more replies)
  7 siblings, 3 replies; 22+ messages in thread
From: Josef Bacik @ 2013-07-01 12:54 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-btrfs

On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote:
> Hi gang,
> 
> I finally sat down to fix that readdir hang that has been in the back
> of my mind for a while.  I *hope* that the fix is pretty simple: just
> don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> indicator that we shouldn't return entries.  Does this look reasonable?
> 
> We still have the problem that we can generate valid large f_pos values
> that can confuse 32bit userspace, but that's a different problem.  I
> think we'll want filldir generation of EOVERFLOW like what exists for
> large inodes. 
> 
> The rest of the patches are cleanups that I saw when absorbing the
> code.  It's all lightly tested with xfstests but it wouldn't surprise
> me if I missed something so review is appreciated.
> 
> Thanks!
> 

One of these patches is making new entries not show up in readdir.  This was
discovered while running stress.sh overnight, it complained about files not
matching but when they were checked the files matched.  Dropping the entire
series made stress.sh run fine.  So I'm dropping these for the next merge window
but I'll dig into it and try and figure out what was causing the problem.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-07-01 12:54 ` Josef Bacik
@ 2013-07-01 13:18   ` Chris Mason
  2013-07-01 16:10   ` Zach Brown
  2013-07-11 23:19   ` Zach Brown
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2013-07-01 13:18 UTC (permalink / raw)
  To: Josef Bacik, Zach Brown; +Cc: linux-btrfs

Quoting Josef Bacik (2013-07-01 08:54:35)
> On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote:
> > Hi gang,
> > 
> > I finally sat down to fix that readdir hang that has been in the back
> > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> > indicator that we shouldn't return entries.  Does this look reasonable?
> > 
> > We still have the problem that we can generate valid large f_pos values
> > that can confuse 32bit userspace, but that's a different problem.  I
> > think we'll want filldir generation of EOVERFLOW like what exists for
> > large inodes. 
> > 
> > The rest of the patches are cleanups that I saw when absorbing the
> > code.  It's all lightly tested with xfstests but it wouldn't surprise
> > me if I missed something so review is appreciated.
> > 
> > Thanks!
> > 
> 
> One of these patches is making new entries not show up in readdir.  This was
> discovered while running stress.sh overnight, it complained about files not
> matching but when they were checked the files matched.  Dropping the entire
> series made stress.sh run fine.  So I'm dropping these for the next merge window
> but I'll dig into it and try and figure out what was causing the problem.

Unfortunately I've only triggered this on flash, and the run takes about
two hours to trigger.  Trying now with some extra printks to see if I
can nail it down

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-07-01 12:54 ` Josef Bacik
  2013-07-01 13:18   ` Chris Mason
@ 2013-07-01 16:10   ` Zach Brown
  2013-07-01 17:18     ` Chris Mason
  2013-07-11 23:19   ` Zach Brown
  2 siblings, 1 reply; 22+ messages in thread
From: Zach Brown @ 2013-07-01 16:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

> > code.  It's all lightly tested with xfstests but it wouldn't surprise
> > me if I missed something so review is appreciated.

*mmm, hmmm*

> One of these patches is making new entries not show up in readdir.  This was
> discovered while running stress.sh overnight, it complained about files not
> matching but when they were checked the files matched.  Dropping the entire
> series made stress.sh run fine.  So I'm dropping these for the next merge window
> but I'll dig into it and try and figure out what was causing the problem.

Nerts.  It's got to be the delayed inode stuff.

Maybe it's some unlink/recreate pattern?  Is this a thing that stress.sh
does?  (Where's stress.sh live?)

- z 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-07-01 16:10   ` Zach Brown
@ 2013-07-01 17:18     ` Chris Mason
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2013-07-01 17:18 UTC (permalink / raw)
  To: Zach Brown, Josef Bacik; +Cc: linux-btrfs

Quoting Zach Brown (2013-07-01 12:10:02)
> > > code.  It's all lightly tested with xfstests but it wouldn't surprise
> > > me if I missed something so review is appreciated.
> 
> *mmm, hmmm*
> 
> > One of these patches is making new entries not show up in readdir.  This was
> > discovered while running stress.sh overnight, it complained about files not
> > matching but when they were checked the files matched.  Dropping the entire
> > series made stress.sh run fine.  So I'm dropping these for the next merge window
> > but I'll dig into it and try and figure out what was causing the problem.
> 
> Nerts.  It's got to be the delayed inode stuff.
> 
> Maybe it's some unlink/recreate pattern?  Is this a thing that stress.sh
> does?  (Where's stress.sh live?)

It's an old namesys tool, I've copied it here:

http://masoncoding.com/mason/tools/stress.sh

My command line:

stress.sh -n 50 -s -c /build/linux /mnt

Basically its:

1) Make a list of all files in /build/linux and their md5sums
2) Start 50 procs
3) Each proc copies /build/linux into /mnt/stress/proc_num/
4) Each proc compares the md5sums of its private copy with the original
master
5) Each proc deletes the private copy
6) Repeat steps 2-5 forever.

The most likely cause of the bug I'm seeing is readdir not finding new
files.

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups
  2013-07-01 12:54 ` Josef Bacik
  2013-07-01 13:18   ` Chris Mason
  2013-07-01 16:10   ` Zach Brown
@ 2013-07-11 23:19   ` Zach Brown
  2 siblings, 0 replies; 22+ messages in thread
From: Zach Brown @ 2013-07-11 23:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Chris Mason

On Mon, Jul 01, 2013 at 08:54:35AM -0400, Josef Bacik wrote:
> On Tue, Jun 04, 2013 at 06:17:54PM -0400, Zach Brown wrote:
> > 
> > I finally sat down to fix that readdir hang that has been in the back
> > of my mind for a while.  I *hope* that the fix is pretty simple: just
> > don't manufacture a fake f_pos, I *think* we can abuse f_version as an
> > indicator that we shouldn't return entries.  Does this look reasonable?
>
> One of these patches is making new entries not show up in readdir.  This was
> discovered while running stress.sh overnight, it complained about files not
> matching but when they were checked the files matched.  Dropping the entire
> series made stress.sh run fine.  So I'm dropping these for the next merge window
> but I'll dig into it and try and figure out what was causing the problem.

OK, how about this.

First, just drop the series.  Most of it was opportunistic cleanups that
I saw as I was reading the code.  But it certainly isn't a comprehensive
cleanup.  We'd still want to go back later and fix how inefficient the
delayed item data structures are for readdir.  So from some perspective
it's just risky churn with little upside.

And I think I found a much simpler way to stop the current readdir from
looping instead of mucking around with f_version.  Just use LLONG_MAX if
entries have already overflowed INT_MAX.

We'd still want real freed offset reuse some day, but that's a bunch of
work that'll have to be done very carefully.  This will at least stop
64bit apps from failing with offsets past 64bits and is very low risk.

So just add this patch and forget about the rest of the series?

It'll still technically conflict with the s/filp->f_pos/ctx->pos/ in the
readdir interface change in -next but that fixup is trivial.

- z

>From e295deb0f56f846738ee3dec63fe0350f3952503 Mon Sep 17 00:00:00 2001
From: Zach Brown <zab@redhat.com>
Date: Wed, 10 Jul 2013 19:48:51 -0400
Subject: [PATCH] btrfs: don't loop on large offsets in readdir

When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.

But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX.  It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.

So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t.   Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.

Signed-off-by: Zach Brown <zab@redhat.com>
---
 fs/btrfs/inode.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c157482..8583e8d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5186,14 +5186,31 @@ next:
 	}
 
 	/* Reached end of directory/root. Bump pos past the last item. */
-	if (key_type == BTRFS_DIR_INDEX_KEY)
-		/*
-		 * 32-bit glibc will use getdents64, but then strtol -
-		 * so the last number we can serve is this.
-		 */
-		filp->f_pos = 0x7fffffff;
-	else
-		filp->f_pos++;
+	filp->f_pos++;
+
+	/* 
+	 * Stop new entries from being returned after we return the last
+	 * entry.
+	 *
+	 * New directory entries are assigned a strictly increasing
+	 * offset.  This means that new entries created during readdir
+	 * are *guaranteed* to be seen in the future by that readdir.
+	 * This has broken buggy programs which operate on names as
+	 * they're returned by readdir.  Until we re-use freed offsets
+	 * we have this hack to stop new entries from being returned
+	 * under the assumption that they'll never reach this huge
+	 * offset.
+	 *
+	 * This is being careful not to overflow 32bit loff_t unless the
+	 * last entry requires it because doing so has broken 32bit apps
+	 * in the past.
+	 */
+	if (key_type == BTRFS_DIR_INDEX_KEY) {
+		if (filp->f_pos >= INT_MAX)
+			filp->f_pos = LLONG_MAX;
+		else
+			filp->f_pos = INT_MAX;
+	}
 nopos:
 	ret = 0;
 err:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-07-11 23:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-04 22:17 [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Zach Brown
2013-06-04 22:17 ` [PATCH 1/6] btrfs: set readdir f_pos only after filldir Zach Brown
2013-06-05  1:19   ` Miao Xie
2013-06-04 22:17 ` [PATCH 2/6] btrfs: fix readdir hang with offsets past INT_MAX Zach Brown
2013-06-04 22:17 ` [PATCH 3/6] btrfs: trivial delayed item readdir list cleanups Zach Brown
2013-06-04 22:17 ` [PATCH 4/6] btrfs: simplify finding next/prev delayed items Zach Brown
2013-06-04 22:17 ` [PATCH 5/6] btrfs: add helper to get delayed item root Zach Brown
2013-06-04 22:18 ` [PATCH 6/6] btrfs: get fewer delayed item refs during readdir Zach Brown
2013-06-04 23:16 ` [PATCH 0/6] fix INT_MAX readdir hang, plus cleanups Chris Mason
2013-06-04 23:26   ` Zach Brown
2013-06-05  2:34     ` Miao Xie
2013-06-05 13:36       ` David Sterba
2013-06-06  1:35         ` Miao Xie
2013-06-06 13:55           ` David Sterba
2013-06-06 14:32             ` Chris Mason
2013-06-10 22:39     ` Zach Brown
2013-06-12 12:59       ` Chris Mason
2013-07-01 12:54 ` Josef Bacik
2013-07-01 13:18   ` Chris Mason
2013-07-01 16:10   ` Zach Brown
2013-07-01 17:18     ` Chris Mason
2013-07-11 23:19   ` Zach Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.