All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] xattr: use rbtree for simple_xattrs
@ 2022-11-08 11:41 Christian Brauner
  2022-11-08 18:59 ` Paul E. McKenney
  2022-11-08 19:08 ` Roman Gushchin
  0 siblings, 2 replies; 8+ messages in thread
From: Christian Brauner @ 2022-11-08 11:41 UTC (permalink / raw)
  To: Tejun Heo, linux-fsdevel
  Cc: Christian Brauner, Vasily Averin, Hugh Dickins, Seth Forshee,
	Stéphane Graber, Roman Gushchin, Paul E. McKenney, Al Viro

A while ago Vasily reported that it is possible to set a large number of
xattrs on inodes of filesystems that make use of the simple xattr
infrastructure. This includes all kernfs-based filesystems that support
xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
mounted by unprivileged users in unprivileged containers and root in an
unprivileged container can set an unrestricted number of security.*
xattrs and privileged users can also set unlimited trusted.* xattrs. As
there are apparently users that have a fairly large number of xattrs we
should scale a bit better. Other xattrs such as user.* are restricted
for kernfs-based instances to a fairly limited number.

Using a simple linked list protected by a spinlock used for set, get,
and list operations doesn't scale well if users use a lot of xattrs even
if it's not a crazy number. And There's no need to bring in the big guns
like rhashtables or rw semaphors for this. An rbtree with a seqlock and
limited rcu semantics is enough.

It scales within the constraints we are working in. By far the most
common operations is getting an xattr. The get operation is optimized to
be lock free as long as there are no writers. The list operation takes
the read lock and protects against concurrent writers while allowing
lockless get operations. Locking out other listxattr callers isn't a
huge deal since listing xattrs is mostly relevant when copying a file or
copying all xattrs between files.

Additionally, listxattr() doesn't list the values of xattrs it can only
be used to list the names of all xattrs set on a file. And the number of
xattr names that can be listed with listxattr() is limited to
XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
found it will return -EFBIG. In short, the maximum amount of memory that
can be retrieved via listxattr() is limited.

Of course, the API is broken as documented on xattr(7) already. In the
future we might want to address this but for now this is the world we
live in and have lived for a long time. But it does indeed mean that
once an application goes over XATTR_LIST_MAX limit of xattrs set on an
inode it isn't possible to copy the file and include its xattrs in the
copy unless the caller knows all xattrs or limits the copy of the xattrs
to important ones it knows by name (At least for tmpfs, and kernfs-based
filesystems. Other filesystems might provide ways of achieving this.).

Also add proper kernel documentation to all the functions.
A big thanks to Paul for his comments.

Cc: Vasily Averin <vvs@openvz.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
---

Notes:
    In addition to this patch I would like to propose that we restrict the number
    of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
    other words, we restrict the number of xattrs for simple xattr filesystems to
    the number of xattrs names that can be retrieved via listxattr(). That should
    be about 2000 to 3000 xattrs per inode which is more than enough. We should
    risk this and see if we get any regression reports from userswith this
    approach.
    
    This should be as simple as adding a max_list member to struct simple_xattrs
    and initialize it with XATTR_MAX_LIST. Set operations would then check against
    this field whether the new xattr they are trying to set will fit and return
    -EFBIG otherwise. I think that might be a good approach to get rid of the in
    principle unbounded number of xattrs that can be set via the simple xattr
    infrastructure. I think this is a regression risk worth taking.
    
    /* v2 */
    Christian Brauner <brauner@kernel.org>:
    - Fix kernel doc.
    - Remove accidental leftover union from previous iteration.

 fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
 include/linux/xattr.h |  40 ++---
 mm/shmem.c            |   2 +-
 3 files changed, 270 insertions(+), 102 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 61107b6bbed2..f18454161d54 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
 }
 EXPORT_SYMBOL(xattr_full_name);
 
-/*
- * Allocate new xattr and copy in the value; but leave the name to callers.
+/**
+ * free_simple_xattr - free an xattr object
+ * @xattr: the xattr object
+ *
+ * Free the xattr object. Can handle @xattr being NULL.
+ */
+static inline void free_simple_xattr(struct simple_xattr *xattr)
+{
+	if (xattr)
+		kfree(xattr->name);
+	kvfree(xattr);
+}
+
+/**
+ * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
+ * @cb: the rcu callback head
+ */
+static void free_simple_xattr_rcu_cb(struct callback_head *cb)
+{
+	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
+}
+
+/**
+ * free_simple_xattr_rcu - free an xattr object with rcu semantics
+ * @xattr: the xattr object
+ */
+static void free_simple_xattr_rcu(struct simple_xattr *xattr)
+{
+	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
+}
+
+/**
+ * put_simple_xattr_rcu - decrement refcount for xattr object
+ * @xattr: the xattr object
+ *
+ * Decrement the reference count of an xattr object and free it using rcu
+ * semantics if we're the holder of the last reference. Can handle @xattr being
+ * NULL.
+ */
+static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
+{
+	if (xattr && refcount_dec_and_test(&xattr->ref))
+		free_simple_xattr_rcu(xattr);
+}
+
+/**
+ * simple_xattr_alloc - allocate new xattr object
+ * @value: value of the xattr object
+ * @size: size of @value
+ *
+ * Allocate a new xattr object and initialize respective members. The caller is
+ * responsible for handling the name of the xattr.
+ *
+ * The initial reference count belongs to the rbtree.
+ *
+ * Return: On success a new xattr object is returned. On failure NULL is
+ * returned.
  */
 struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
 {
@@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
 
 	new_xattr->size = size;
 	memcpy(new_xattr->value, value, size);
+	refcount_set(&new_xattr->ref, 1);
 	return new_xattr;
 }
 
-/*
- * xattr GET operation for in-memory/pseudo filesystems
+/**
+ * simple_xattr_get - get an xattr object
+ * @xattrs: the header of the xattr object
+ * @name: the name of the xattr to retrieve
+ * @buffer: the buffer to store the value into
+ * @size: the size of @buffer
+ *
+ * Try to find and retrieve the xattr object associated with @name. If the
+ * object is found and still in the rbtree bump the reference count.
+ *
+ * If @buffer is provided store the value of @xattr in @buffer.
+ *
+ * Return: On success zero and on error a negative error code is returned.
  */
 int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
 		     void *buffer, size_t size)
 {
-	struct simple_xattr *xattr;
-	int ret = -ENODATA;
-
-	spin_lock(&xattrs->lock);
-	list_for_each_entry(xattr, &xattrs->head, list) {
-		if (strcmp(name, xattr->name))
-			continue;
-
-		ret = xattr->size;
-		if (buffer) {
-			if (size < xattr->size)
-				ret = -ERANGE;
-			else
-				memcpy(buffer, xattr->value, xattr->size);
+	struct simple_xattr *xattr = NULL;
+	struct rb_node *rbp;
+	int ret, seq = 0;
+
+	rcu_read_lock();
+	do {
+		read_seqbegin_or_lock(&xattrs->lock, &seq);
+		rbp = rcu_dereference(xattrs->rb_root.rb_node);
+		while (rbp) {
+			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
+			if (strcmp(xattr->name, name) < 0) {
+				rbp = rcu_dereference(rbp->rb_left);
+			} else if (strcmp(xattr->name, name) > 0) {
+				rbp = rcu_dereference(rbp->rb_right);
+			} else {
+				if (!likely(refcount_inc_not_zero(&xattr->ref)))
+					xattr = NULL;
+				break;
+			}
+			xattr = NULL;
 		}
-		break;
+	} while (need_seqretry(&xattrs->lock, seq));
+	done_seqretry(&xattrs->lock, seq);
+	rcu_read_unlock();
+
+	if (!xattr)
+		return -ENODATA;
+
+	ret = xattr->size;
+	if (buffer) {
+		if (size < xattr->size)
+			ret = -ERANGE;
+		else
+			memcpy(buffer, xattr->value, xattr->size);
 	}
-	spin_unlock(&xattrs->lock);
+
+	put_simple_xattr_rcu(xattr);
 	return ret;
 }
 
 /**
- * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
- * @xattrs: target simple_xattr list
- * @name: name of the extended attribute
- * @value: value of the xattr. If %NULL, will remove the attribute.
- * @size: size of the new xattr
- * @flags: %XATTR_{CREATE|REPLACE}
- * @removed_size: returns size of the removed xattr, -1 if none removed
+ * simple_xattr_set - set an xattr object
+ * @xattrs: the header of the xattr object
+ * @name: the name of the xattr to retrieve
+ * @value: the value to store along the xattr
+ * @size: the size of @value
+ * @flags: the flags determining how to set the xattr
+ * @removed_size: the size of the removed xattr
+ *
+ * Set a new xattr object.
+ * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
+ * is specified in @flags a matching xattr object for @name must already exist.
+ * If it does it will be replace with the new xattr object. If it doesn't we
+ * fail. If XATTR_CREATE is specified and a matching xattr does already exist
+ * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
+ * insert the new xattr replacing any existing one.
  *
- * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
- * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
- * otherwise, fails with -ENODATA.
+ * If @value is empty and a matching xattr object is found we delete it if
+ * XATTR_REPLACE is specified in @flags or @flags is zero.
  *
- * Returns 0 on success, -errno on failure.
+ * If @value is empty and no matching xattr object for @name is found we do
+ * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
+ * XATTR_REPLACE we fail as mentioned above.
+ *
+ * Return: On success zero and on error a negative error code is returned.
  */
 int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
 		     const void *value, size_t size, int flags,
 		     ssize_t *removed_size)
 {
-	struct simple_xattr *xattr;
-	struct simple_xattr *new_xattr = NULL;
+	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
+	struct rb_node *parent = NULL, **rbp;
 	int err = 0;
 
 	if (removed_size)
@@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
 		}
 	}
 
-	spin_lock(&xattrs->lock);
-	list_for_each_entry(xattr, &xattrs->head, list) {
-		if (!strcmp(name, xattr->name)) {
-			if (flags & XATTR_CREATE) {
-				xattr = new_xattr;
-				err = -EEXIST;
-			} else if (new_xattr) {
-				list_replace(&xattr->list, &new_xattr->list);
-				if (removed_size)
-					*removed_size = xattr->size;
-			} else {
-				list_del(&xattr->list);
-				if (removed_size)
-					*removed_size = xattr->size;
-			}
-			goto out;
-		}
-	}
-	if (flags & XATTR_REPLACE) {
-		xattr = new_xattr;
-		err = -ENODATA;
-	} else {
-		list_add(&new_xattr->list, &xattrs->head);
+	write_seqlock(&xattrs->lock);
+	rbp = &xattrs->rb_root.rb_node;
+	while (*rbp) {
+		parent = *rbp;
+		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
+		if (strcmp(xattr->name, name) < 0)
+			rbp = &(*rbp)->rb_left;
+		else if (strcmp(xattr->name, name) > 0)
+			rbp = &(*rbp)->rb_right;
+		else
+			break;
 		xattr = NULL;
 	}
-out:
-	spin_unlock(&xattrs->lock);
+
 	if (xattr) {
-		kfree(xattr->name);
-		kvfree(xattr);
+		/* Fail if XATTR_CREATE is requested and the xattr exists. */
+		if (flags & XATTR_CREATE) {
+			err = -EEXIST;
+			goto out_unlock;
+		}
+
+		if (new_xattr)
+			rb_replace_node_rcu(&xattr->rb_node,
+					    &new_xattr->rb_node,
+					    &xattrs->rb_root);
+		else
+			rb_erase(&xattr->rb_node, &xattrs->rb_root);
+		if (!err && removed_size)
+			*removed_size = xattr->size;
+	} else {
+		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
+		if (flags & XATTR_REPLACE) {
+			err = -ENODATA;
+			goto out_unlock;
+		}
+
+		/*
+		 * If XATTR_CREATE or no flags are specified together with a
+		 * new value simply insert it.
+		 */
+		if (new_xattr) {
+			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
+			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
+		}
+
+		/*
+		 * If XATTR_CREATE or no flags are specified and neither an old
+		 * or new xattr were found/exist then we don't need to do
+		 * anything.
+		 */
 	}
+
+out_unlock:
+	write_sequnlock(&xattrs->lock);
+	if (err)
+		free_simple_xattr(new_xattr);
+	else
+		put_simple_xattr_rcu(xattr);
 	return err;
 
 }
@@ -1134,14 +1258,31 @@ static int xattr_list_one(char **buffer, ssize_t *remaining_size,
 	return 0;
 }
 
-/*
- * xattr LIST operation for in-memory/pseudo filesystems
+/**
+ * simple_xattr_list - list all xattr objects
+ * @inode: inode from which to get the xattrs
+ * @xattrs: the header of the xattr object
+ * @buffer: the buffer to store all xattrs into
+ * @size: the size of @buffer
+ *
+ * List all xattrs associated with @inode. If @buffer is NULL we returned the
+ * required size of the buffer. If @buffer is provided we store the xattrs
+ * value into it provided it is big enough.
+ *
+ * Note, the number of xattr names that can be listed with listxattr(2) is
+ * limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
+ * vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are found
+ * it will return -E2BIG.
+ *
+ * Return: On success the required size or the size of the copied xattrs is
+ * returned. On error a negative error code is returned.
  */
 ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
 			  char *buffer, size_t size)
 {
 	bool trusted = capable(CAP_SYS_ADMIN);
 	struct simple_xattr *xattr;
+	struct rb_node *rbp;
 	ssize_t remaining_size = size;
 	int err = 0;
 
@@ -1162,8 +1303,10 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
 	}
 #endif
 
-	spin_lock(&xattrs->lock);
-	list_for_each_entry(xattr, &xattrs->head, list) {
+	read_seqlock_excl(&xattrs->lock);
+	for (rbp = rb_first(&xattrs->rb_root); rbp; rbp = rb_next(rbp)) {
+		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
+
 		/* skip "trusted." attributes for unprivileged callers */
 		if (!trusted && xattr_is_trusted(xattr->name))
 			continue;
@@ -1172,18 +1315,61 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
 		if (err)
 			break;
 	}
-	spin_unlock(&xattrs->lock);
+	read_sequnlock_excl(&xattrs->lock);
 
 	return err ? err : size - remaining_size;
 }
 
-/*
- * Adds an extended attribute to the list
+/**
+ * simple_xattr_add - add xattr objects
+ * @xattrs: the header of the xattr object
+ * @new_xattr: the xattr object to add
+ *
+ * Add an xattr object to @xattrs. This assumes no replacement or removal of
+ * matching xattrs is wanted.
  */
-void simple_xattr_list_add(struct simple_xattrs *xattrs,
-			   struct simple_xattr *new_xattr)
+void simple_xattr_add(struct simple_xattrs *xattrs,
+		      struct simple_xattr *new_xattr)
 {
-	spin_lock(&xattrs->lock);
-	list_add(&new_xattr->list, &xattrs->head);
-	spin_unlock(&xattrs->lock);
+	write_seqlock(&xattrs->lock);
+	rb_link_node_rcu(&new_xattr->rb_node, xattrs->rb_root.rb_node,
+			 &xattrs->rb_root.rb_node);
+	rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
+	write_sequnlock(&xattrs->lock);
+}
+
+/**
+ * simple_xattr_init - initialize new xattr header
+ * @xattrs: header to initialize
+ *
+ * Initialize relevant fields of a an xattr header.
+ */
+void simple_xattrs_init(struct simple_xattrs *xattrs)
+{
+	seqlock_init(&xattrs->lock);
+	xattrs->rb_root = RB_ROOT;
+}
+
+/**
+ * simple_xattrs_free - free xattrs
+ * @xattrs: xattr header whose xattrs to destroy
+ *
+ * Destroy all xattrs in @xattr. When this is called no one can hold a
+ * reference to any of the xattrs anymore.
+ */
+void simple_xattrs_free(struct simple_xattrs *xattrs)
+{
+	struct rb_node *rbp;
+
+	rbp = rb_first(&xattrs->rb_root);
+	while (rbp) {
+		struct simple_xattr *xattr;
+		struct rb_node *rbp_next;
+
+		rbp_next = rb_next(rbp);
+		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
+		rb_erase(&xattr->rb_node, &xattrs->rb_root);
+		free_simple_xattr(xattr);
+		rbp = rbp_next;
+	}
 }
diff --git a/include/linux/xattr.h b/include/linux/xattr.h
index 4c379d23ec6e..8c01e622ad92 100644
--- a/include/linux/xattr.h
+++ b/include/linux/xattr.h
@@ -80,48 +80,30 @@ static inline const char *xattr_prefix(const struct xattr_handler *handler)
 }
 
 struct simple_xattrs {
-	struct list_head head;
-	spinlock_t lock;
+	struct rb_root rb_root;
+	seqlock_t lock;
 };
 
 struct simple_xattr {
-	struct list_head list;
+	struct rb_node rb_node;
+	struct rcu_head rcu;
+	refcount_t ref;
 	char *name;
 	size_t size;
 	char value[];
 };
 
-/*
- * initialize the simple_xattrs structure
- */
-static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
-{
-	INIT_LIST_HEAD(&xattrs->head);
-	spin_lock_init(&xattrs->lock);
-}
-
-/*
- * free all the xattrs
- */
-static inline void simple_xattrs_free(struct simple_xattrs *xattrs)
-{
-	struct simple_xattr *xattr, *node;
-
-	list_for_each_entry_safe(xattr, node, &xattrs->head, list) {
-		kfree(xattr->name);
-		kvfree(xattr);
-	}
-}
-
+void simple_xattrs_init(struct simple_xattrs *xattrs);
+void simple_xattrs_free(struct simple_xattrs *xattrs);
 struct simple_xattr *simple_xattr_alloc(const void *value, size_t size);
 int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
 		     void *buffer, size_t size);
 int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
 		     const void *value, size_t size, int flags,
 		     ssize_t *removed_size);
-ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs, char *buffer,
-			  size_t size);
-void simple_xattr_list_add(struct simple_xattrs *xattrs,
-			   struct simple_xattr *new_xattr);
+ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
+			  char *buffer, size_t size);
+void simple_xattr_add(struct simple_xattrs *xattrs,
+		      struct simple_xattr *new_xattr);
 
 #endif	/* _LINUX_XATTR_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 8280a5cb48df..2872e6607b2c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3255,7 +3255,7 @@ static int shmem_initxattrs(struct inode *inode,
 		memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
 		       xattr->name, len);
 
-		simple_xattr_list_add(&info->xattrs, new_xattr);
+		simple_xattr_add(&info->xattrs, new_xattr);
 	}
 
 	return 0;

base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-08 11:41 [PATCH v2] xattr: use rbtree for simple_xattrs Christian Brauner
@ 2022-11-08 18:59 ` Paul E. McKenney
  2022-11-09  9:51   ` Christian Brauner
  2022-11-08 19:08 ` Roman Gushchin
  1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2022-11-08 18:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Roman Gushchin, Al Viro

On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> A while ago Vasily reported that it is possible to set a large number of
> xattrs on inodes of filesystems that make use of the simple xattr
> infrastructure. This includes all kernfs-based filesystems that support
> xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> mounted by unprivileged users in unprivileged containers and root in an
> unprivileged container can set an unrestricted number of security.*
> xattrs and privileged users can also set unlimited trusted.* xattrs. As
> there are apparently users that have a fairly large number of xattrs we
> should scale a bit better. Other xattrs such as user.* are restricted
> for kernfs-based instances to a fairly limited number.
> 
> Using a simple linked list protected by a spinlock used for set, get,
> and list operations doesn't scale well if users use a lot of xattrs even
> if it's not a crazy number. And There's no need to bring in the big guns
> like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> limited rcu semantics is enough.
> 
> It scales within the constraints we are working in. By far the most
> common operations is getting an xattr. The get operation is optimized to
> be lock free as long as there are no writers. The list operation takes
> the read lock and protects against concurrent writers while allowing
> lockless get operations. Locking out other listxattr callers isn't a
> huge deal since listing xattrs is mostly relevant when copying a file or
> copying all xattrs between files.
> 
> Additionally, listxattr() doesn't list the values of xattrs it can only
> be used to list the names of all xattrs set on a file. And the number of
> xattr names that can be listed with listxattr() is limited to
> XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> found it will return -EFBIG. In short, the maximum amount of memory that
> can be retrieved via listxattr() is limited.
> 
> Of course, the API is broken as documented on xattr(7) already. In the
> future we might want to address this but for now this is the world we
> live in and have lived for a long time. But it does indeed mean that
> once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> inode it isn't possible to copy the file and include its xattrs in the
> copy unless the caller knows all xattrs or limits the copy of the xattrs
> to important ones it knows by name (At least for tmpfs, and kernfs-based
> filesystems. Other filesystems might provide ways of achieving this.).
> 
> Also add proper kernel documentation to all the functions.
> A big thanks to Paul for his comments.
> 
> Cc: Vasily Averin <vvs@openvz.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Looks mostly plausible from an RCU viewpoint, but there are a few
questions/comments inline below.

							Thanx, Paul

> ---
> 
> Notes:
>     In addition to this patch I would like to propose that we restrict the number
>     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
>     other words, we restrict the number of xattrs for simple xattr filesystems to
>     the number of xattrs names that can be retrieved via listxattr(). That should
>     be about 2000 to 3000 xattrs per inode which is more than enough. We should
>     risk this and see if we get any regression reports from userswith this
>     approach.
>     
>     This should be as simple as adding a max_list member to struct simple_xattrs
>     and initialize it with XATTR_MAX_LIST. Set operations would then check against
>     this field whether the new xattr they are trying to set will fit and return
>     -EFBIG otherwise. I think that might be a good approach to get rid of the in
>     principle unbounded number of xattrs that can be set via the simple xattr
>     infrastructure. I think this is a regression risk worth taking.
>     
>     /* v2 */
>     Christian Brauner <brauner@kernel.org>:
>     - Fix kernel doc.
>     - Remove accidental leftover union from previous iteration.
> 
>  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
>  include/linux/xattr.h |  40 ++---
>  mm/shmem.c            |   2 +-
>  3 files changed, 270 insertions(+), 102 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 61107b6bbed2..f18454161d54 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
>  }
>  EXPORT_SYMBOL(xattr_full_name);
>  
> -/*
> - * Allocate new xattr and copy in the value; but leave the name to callers.
> +/**
> + * free_simple_xattr - free an xattr object
> + * @xattr: the xattr object
> + *
> + * Free the xattr object. Can handle @xattr being NULL.
> + */
> +static inline void free_simple_xattr(struct simple_xattr *xattr)
> +{
> +	if (xattr)
> +		kfree(xattr->name);
> +	kvfree(xattr);
> +}
> +
> +/**
> + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> + * @cb: the rcu callback head
> + */
> +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> +{
> +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> +}
> +
> +/**
> + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> + * @xattr: the xattr object
> + */
> +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> +{
> +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> +}
> +
> +/**
> + * put_simple_xattr_rcu - decrement refcount for xattr object
> + * @xattr: the xattr object
> + *
> + * Decrement the reference count of an xattr object and free it using rcu
> + * semantics if we're the holder of the last reference. Can handle @xattr being
> + * NULL.
> + */
> +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> +{
> +	if (xattr && refcount_dec_and_test(&xattr->ref))
> +		free_simple_xattr_rcu(xattr);
> +}

Looks like the standard combined reference counter and RCU combination,
goog!

> +
> +/**
> + * simple_xattr_alloc - allocate new xattr object
> + * @value: value of the xattr object
> + * @size: size of @value
> + *
> + * Allocate a new xattr object and initialize respective members. The caller is
> + * responsible for handling the name of the xattr.
> + *
> + * The initial reference count belongs to the rbtree.
> + *
> + * Return: On success a new xattr object is returned. On failure NULL is
> + * returned.
>   */
>  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  {
> @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  
>  	new_xattr->size = size;
>  	memcpy(new_xattr->value, value, size);
> +	refcount_set(&new_xattr->ref, 1);

Yes, one is usually needed for the link in the tree.

>  	return new_xattr;
>  }
>  
> -/*
> - * xattr GET operation for in-memory/pseudo filesystems
> +/**
> + * simple_xattr_get - get an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @buffer: the buffer to store the value into
> + * @size: the size of @buffer
> + *
> + * Try to find and retrieve the xattr object associated with @name. If the
> + * object is found and still in the rbtree bump the reference count.
> + *
> + * If @buffer is provided store the value of @xattr in @buffer.
> + *
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  		     void *buffer, size_t size)
>  {
> -	struct simple_xattr *xattr;
> -	int ret = -ENODATA;
> -
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (strcmp(name, xattr->name))
> -			continue;
> -
> -		ret = xattr->size;
> -		if (buffer) {
> -			if (size < xattr->size)
> -				ret = -ERANGE;
> -			else
> -				memcpy(buffer, xattr->value, xattr->size);
> +	struct simple_xattr *xattr = NULL;
> +	struct rb_node *rbp;
> +	int ret, seq = 0;
> +
> +	rcu_read_lock();
> +	do {
> +		read_seqbegin_or_lock(&xattrs->lock, &seq);

It might be necessary to try a few times before grabbing the lock, but
perhaps we should actually hit the problem before increasing complexity.

> +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> +		while (rbp) {
> +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +			if (strcmp(xattr->name, name) < 0) {
> +				rbp = rcu_dereference(rbp->rb_left);
> +			} else if (strcmp(xattr->name, name) > 0) {
> +				rbp = rcu_dereference(rbp->rb_right);
> +			} else {
> +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> +					xattr = NULL;
> +				break;
> +			}
> +			xattr = NULL;
>  		}

Maybe this is too specialized, but should this be in the rbtree code,
perhaps rb_find_first_rcu(), but with an appropriate cmp() function?
The refcount_inc_not_zero() clearly needs to be in the caller.

If this is the only instance of this sort of code, it is likely not
really worthwhile.  But if we have several of these open coded, it would
be good to consolidate that code.

> -		break;
> +	} while (need_seqretry(&xattrs->lock, seq));
> +	done_seqretry(&xattrs->lock, seq);
> +	rcu_read_unlock();
> +
> +	if (!xattr)
> +		return -ENODATA;
> +
> +	ret = xattr->size;
> +	if (buffer) {
> +		if (size < xattr->size)
> +			ret = -ERANGE;
> +		else
> +			memcpy(buffer, xattr->value, xattr->size);

If all we are doing is copying to an in-kernel buffer, why not dispense
with xattr->ref and do the memcpy() under rcu_read_lock() protection?
This would avoid some overhead from reference-count cache misses.

Of course, if that memcpy() can page fault or some such, then what
you have is necessary.  But this is all in-kernel, right?  And if not,
shouldn't the pointers be decorated with __user or some such?

>  	}
> -	spin_unlock(&xattrs->lock);
> +
> +	put_simple_xattr_rcu(xattr);
>  	return ret;
>  }
>  
>  /**
> - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> - * @xattrs: target simple_xattr list
> - * @name: name of the extended attribute
> - * @value: value of the xattr. If %NULL, will remove the attribute.
> - * @size: size of the new xattr
> - * @flags: %XATTR_{CREATE|REPLACE}
> - * @removed_size: returns size of the removed xattr, -1 if none removed
> + * simple_xattr_set - set an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @value: the value to store along the xattr
> + * @size: the size of @value
> + * @flags: the flags determining how to set the xattr
> + * @removed_size: the size of the removed xattr
> + *
> + * Set a new xattr object.
> + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> + * is specified in @flags a matching xattr object for @name must already exist.
> + * If it does it will be replace with the new xattr object. If it doesn't we
> + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> + * insert the new xattr replacing any existing one.
>   *
> - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> - * otherwise, fails with -ENODATA.
> + * If @value is empty and a matching xattr object is found we delete it if
> + * XATTR_REPLACE is specified in @flags or @flags is zero.
>   *
> - * Returns 0 on success, -errno on failure.
> + * If @value is empty and no matching xattr object for @name is found we do
> + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> + * XATTR_REPLACE we fail as mentioned above.
> + *
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		     const void *value, size_t size, int flags,
>  		     ssize_t *removed_size)
>  {
> -	struct simple_xattr *xattr;
> -	struct simple_xattr *new_xattr = NULL;
> +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> +	struct rb_node *parent = NULL, **rbp;
>  	int err = 0;
>  
>  	if (removed_size)
> @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		}
>  	}
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (!strcmp(name, xattr->name)) {
> -			if (flags & XATTR_CREATE) {
> -				xattr = new_xattr;
> -				err = -EEXIST;
> -			} else if (new_xattr) {
> -				list_replace(&xattr->list, &new_xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			} else {
> -				list_del(&xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			}
> -			goto out;
> -		}
> -	}
> -	if (flags & XATTR_REPLACE) {
> -		xattr = new_xattr;
> -		err = -ENODATA;
> -	} else {
> -		list_add(&new_xattr->list, &xattrs->head);
> +	write_seqlock(&xattrs->lock);
> +	rbp = &xattrs->rb_root.rb_node;
> +	while (*rbp) {
> +		parent = *rbp;
> +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> +		if (strcmp(xattr->name, name) < 0)
> +			rbp = &(*rbp)->rb_left;
> +		else if (strcmp(xattr->name, name) > 0)
> +			rbp = &(*rbp)->rb_right;
> +		else
> +			break;
>  		xattr = NULL;
>  	}
> -out:
> -	spin_unlock(&xattrs->lock);
> +
>  	if (xattr) {
> -		kfree(xattr->name);
> -		kvfree(xattr);
> +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> +		if (flags & XATTR_CREATE) {
> +			err = -EEXIST;
> +			goto out_unlock;
> +		}
> +
> +		if (new_xattr)
> +			rb_replace_node_rcu(&xattr->rb_node,
> +					    &new_xattr->rb_node,
> +					    &xattrs->rb_root);
> +		else
> +			rb_erase(&xattr->rb_node, &xattrs->rb_root);

Is rb_erase() RCU-reader-safe?  It is not immediately obvious to me that it
is.  I would expect an rcu_assign_pointer() or three in there somewhere.

> +		if (!err && removed_size)
> +			*removed_size = xattr->size;
> +	} else {
> +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> +		if (flags & XATTR_REPLACE) {
> +			err = -ENODATA;
> +			goto out_unlock;
> +		}
> +
> +		/*
> +		 * If XATTR_CREATE or no flags are specified together with a
> +		 * new value simply insert it.
> +		 */
> +		if (new_xattr) {
> +			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
> +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> +		}
> +
> +		/*
> +		 * If XATTR_CREATE or no flags are specified and neither an old
> +		 * or new xattr were found/exist then we don't need to do
> +		 * anything.
> +		 */

As before, some of this looks like it should be in the rbtree implementation.
On the other hand, and add-or-replace function like this might be rare.  And
there are only two other occurrences, both of which look quite specialize.

So probably no consolidation just yet, anyway.

>  	}
> +
> +out_unlock:
> +	write_sequnlock(&xattrs->lock);
> +	if (err)
> +		free_simple_xattr(new_xattr);
> +	else
> +		put_simple_xattr_rcu(xattr);
>  	return err;
>  
>  }
> @@ -1134,14 +1258,31 @@ static int xattr_list_one(char **buffer, ssize_t *remaining_size,
>  	return 0;
>  }
>  
> -/*
> - * xattr LIST operation for in-memory/pseudo filesystems
> +/**
> + * simple_xattr_list - list all xattr objects
> + * @inode: inode from which to get the xattrs
> + * @xattrs: the header of the xattr object
> + * @buffer: the buffer to store all xattrs into
> + * @size: the size of @buffer
> + *
> + * List all xattrs associated with @inode. If @buffer is NULL we returned the
> + * required size of the buffer. If @buffer is provided we store the xattrs
> + * value into it provided it is big enough.
> + *
> + * Note, the number of xattr names that can be listed with listxattr(2) is
> + * limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> + * vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are found
> + * it will return -E2BIG.
> + *
> + * Return: On success the required size or the size of the copied xattrs is
> + * returned. On error a negative error code is returned.
>   */
>  ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>  			  char *buffer, size_t size)
>  {
>  	bool trusted = capable(CAP_SYS_ADMIN);
>  	struct simple_xattr *xattr;
> +	struct rb_node *rbp;
>  	ssize_t remaining_size = size;
>  	int err = 0;
>  
> @@ -1162,8 +1303,10 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>  	}
>  #endif
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> +	read_seqlock_excl(&xattrs->lock);

This excludes writers, which allows the non-RCU-safe code to work correctly.

So this should be OK.

> +	for (rbp = rb_first(&xattrs->rb_root); rbp; rbp = rb_next(rbp)) {
> +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +
>  		/* skip "trusted." attributes for unprivileged callers */
>  		if (!trusted && xattr_is_trusted(xattr->name))
>  			continue;
> @@ -1172,18 +1315,61 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>  		if (err)
>  			break;
>  	}
> -	spin_unlock(&xattrs->lock);
> +	read_sequnlock_excl(&xattrs->lock);
>  
>  	return err ? err : size - remaining_size;
>  }
>  
> -/*
> - * Adds an extended attribute to the list
> +/**
> + * simple_xattr_add - add xattr objects
> + * @xattrs: the header of the xattr object
> + * @new_xattr: the xattr object to add
> + *
> + * Add an xattr object to @xattrs. This assumes no replacement or removal of
> + * matching xattrs is wanted.
>   */
> -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> -			   struct simple_xattr *new_xattr)
> +void simple_xattr_add(struct simple_xattrs *xattrs,
> +		      struct simple_xattr *new_xattr)
>  {
> -	spin_lock(&xattrs->lock);
> -	list_add(&new_xattr->list, &xattrs->head);
> -	spin_unlock(&xattrs->lock);
> +	write_seqlock(&xattrs->lock);
> +	rb_link_node_rcu(&new_xattr->rb_node, xattrs->rb_root.rb_node,
> +			 &xattrs->rb_root.rb_node);
> +	rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> +	write_sequnlock(&xattrs->lock);
> +}

I freely confess that I am not immediately seeing how this one fits in.
Presumably its caller has already found the right place in the tree?
Except that the caller isn't holding xattrs->lock.

> +
> +/**
> + * simple_xattr_init - initialize new xattr header
> + * @xattrs: header to initialize
> + *
> + * Initialize relevant fields of a an xattr header.
> + */
> +void simple_xattrs_init(struct simple_xattrs *xattrs)
> +{
> +	seqlock_init(&xattrs->lock);
> +	xattrs->rb_root = RB_ROOT;
> +}
> +
> +/**
> + * simple_xattrs_free - free xattrs
> + * @xattrs: xattr header whose xattrs to destroy
> + *
> + * Destroy all xattrs in @xattr. When this is called no one can hold a
> + * reference to any of the xattrs anymore.

As in anyone who might hold a reference is long gone, correct?

							Thanx, Paul

> + */
> +void simple_xattrs_free(struct simple_xattrs *xattrs)
> +{
> +	struct rb_node *rbp;
> +
> +	rbp = rb_first(&xattrs->rb_root);
> +	while (rbp) {
> +		struct simple_xattr *xattr;
> +		struct rb_node *rbp_next;
> +
> +		rbp_next = rb_next(rbp);
> +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +		rb_erase(&xattr->rb_node, &xattrs->rb_root);
> +		free_simple_xattr(xattr);
> +		rbp = rbp_next;
> +	}
>  }
> diff --git a/include/linux/xattr.h b/include/linux/xattr.h
> index 4c379d23ec6e..8c01e622ad92 100644
> --- a/include/linux/xattr.h
> +++ b/include/linux/xattr.h
> @@ -80,48 +80,30 @@ static inline const char *xattr_prefix(const struct xattr_handler *handler)
>  }
>  
>  struct simple_xattrs {
> -	struct list_head head;
> -	spinlock_t lock;
> +	struct rb_root rb_root;
> +	seqlock_t lock;
>  };
>  
>  struct simple_xattr {
> -	struct list_head list;
> +	struct rb_node rb_node;
> +	struct rcu_head rcu;
> +	refcount_t ref;
>  	char *name;
>  	size_t size;
>  	char value[];
>  };
>  
> -/*
> - * initialize the simple_xattrs structure
> - */
> -static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
> -{
> -	INIT_LIST_HEAD(&xattrs->head);
> -	spin_lock_init(&xattrs->lock);
> -}
> -
> -/*
> - * free all the xattrs
> - */
> -static inline void simple_xattrs_free(struct simple_xattrs *xattrs)
> -{
> -	struct simple_xattr *xattr, *node;
> -
> -	list_for_each_entry_safe(xattr, node, &xattrs->head, list) {
> -		kfree(xattr->name);
> -		kvfree(xattr);
> -	}
> -}
> -
> +void simple_xattrs_init(struct simple_xattrs *xattrs);
> +void simple_xattrs_free(struct simple_xattrs *xattrs);
>  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size);
>  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  		     void *buffer, size_t size);
>  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		     const void *value, size_t size, int flags,
>  		     ssize_t *removed_size);
> -ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs, char *buffer,
> -			  size_t size);
> -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> -			   struct simple_xattr *new_xattr);
> +ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> +			  char *buffer, size_t size);
> +void simple_xattr_add(struct simple_xattrs *xattrs,
> +		      struct simple_xattr *new_xattr);
>  
>  #endif	/* _LINUX_XATTR_H */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8280a5cb48df..2872e6607b2c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3255,7 +3255,7 @@ static int shmem_initxattrs(struct inode *inode,
>  		memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
>  		       xattr->name, len);
>  
> -		simple_xattr_list_add(&info->xattrs, new_xattr);
> +		simple_xattr_add(&info->xattrs, new_xattr);
>  	}
>  
>  	return 0;
> 
> base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-08 11:41 [PATCH v2] xattr: use rbtree for simple_xattrs Christian Brauner
  2022-11-08 18:59 ` Paul E. McKenney
@ 2022-11-08 19:08 ` Roman Gushchin
  2022-11-09 11:16   ` Christian Brauner
  1 sibling, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2022-11-08 19:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Paul E. McKenney, Al Viro

On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> A while ago Vasily reported that it is possible to set a large number of
> xattrs on inodes of filesystems that make use of the simple xattr
> infrastructure. This includes all kernfs-based filesystems that support
> xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> mounted by unprivileged users in unprivileged containers and root in an
> unprivileged container can set an unrestricted number of security.*
> xattrs and privileged users can also set unlimited trusted.* xattrs. As
> there are apparently users that have a fairly large number of xattrs we
> should scale a bit better. Other xattrs such as user.* are restricted
> for kernfs-based instances to a fairly limited number.
> 
> Using a simple linked list protected by a spinlock used for set, get,
> and list operations doesn't scale well if users use a lot of xattrs even
> if it's not a crazy number. And There's no need to bring in the big guns
> like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> limited rcu semantics is enough.
> 
> It scales within the constraints we are working in. By far the most
> common operations is getting an xattr. The get operation is optimized to
> be lock free as long as there are no writers. The list operation takes
> the read lock and protects against concurrent writers while allowing
> lockless get operations. Locking out other listxattr callers isn't a
> huge deal since listing xattrs is mostly relevant when copying a file or
> copying all xattrs between files.
> 
> Additionally, listxattr() doesn't list the values of xattrs it can only
> be used to list the names of all xattrs set on a file. And the number of
> xattr names that can be listed with listxattr() is limited to
> XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> found it will return -EFBIG. In short, the maximum amount of memory that
> can be retrieved via listxattr() is limited.
> 
> Of course, the API is broken as documented on xattr(7) already. In the
> future we might want to address this but for now this is the world we
> live in and have lived for a long time. But it does indeed mean that
> once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> inode it isn't possible to copy the file and include its xattrs in the
> copy unless the caller knows all xattrs or limits the copy of the xattrs
> to important ones it knows by name (At least for tmpfs, and kernfs-based
> filesystems. Other filesystems might provide ways of achieving this.).
> 
> Also add proper kernel documentation to all the functions.
> A big thanks to Paul for his comments.
> 
> Cc: Vasily Averin <vvs@openvz.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Hi Christian!

This looks fancy, maybe even too fancy :)
I learned from reading your code.

Is there any specific usecase when we care that much about xattr
performance?

I've nothing against the lockless approach, but an rb-tree protected
by a rwlock/semaphore would be probably much simpler and still a great
improvement over the status quo.

Some small nits below.

> ---
> 
> Notes:
>     In addition to this patch I would like to propose that we restrict the number
>     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
>     other words, we restrict the number of xattrs for simple xattr filesystems to
>     the number of xattrs names that can be retrieved via listxattr(). That should
>     be about 2000 to 3000 xattrs per inode which is more than enough. We should
>     risk this and see if we get any regression reports from userswith this
>     approach.
>     
>     This should be as simple as adding a max_list member to struct simple_xattrs
>     and initialize it with XATTR_MAX_LIST. Set operations would then check against
>     this field whether the new xattr they are trying to set will fit and return
>     -EFBIG otherwise. I think that might be a good approach to get rid of the in
>     principle unbounded number of xattrs that can be set via the simple xattr
>     infrastructure. I think this is a regression risk worth taking.
>     
>     /* v2 */
>     Christian Brauner <brauner@kernel.org>:
>     - Fix kernel doc.
>     - Remove accidental leftover union from previous iteration.
> 
>  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
>  include/linux/xattr.h |  40 ++---
>  mm/shmem.c            |   2 +-
>  3 files changed, 270 insertions(+), 102 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 61107b6bbed2..f18454161d54 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
>  }
>  EXPORT_SYMBOL(xattr_full_name);
>  
> -/*
> - * Allocate new xattr and copy in the value; but leave the name to callers.
> +/**
> + * free_simple_xattr - free an xattr object
> + * @xattr: the xattr object
> + *
> + * Free the xattr object. Can handle @xattr being NULL.
> + */
> +static inline void free_simple_xattr(struct simple_xattr *xattr)
> +{
> +	if (xattr)
> +		kfree(xattr->name);
> +	kvfree(xattr);
> +}
> +
> +/**
> + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> + * @cb: the rcu callback head
> + */
> +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> +{
> +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> +}
> +
> +/**
> + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> + * @xattr: the xattr object
> + */
> +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> +{
> +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> +}
> +
> +/**
> + * put_simple_xattr_rcu - decrement refcount for xattr object
> + * @xattr: the xattr object
> + *
> + * Decrement the reference count of an xattr object and free it using rcu
> + * semantics if we're the holder of the last reference. Can handle @xattr being
> + * NULL.
> + */
> +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> +{
> +	if (xattr && refcount_dec_and_test(&xattr->ref))
> +		free_simple_xattr_rcu(xattr);
> +}
> +
> +/**
> + * simple_xattr_alloc - allocate new xattr object
> + * @value: value of the xattr object
> + * @size: size of @value
> + *
> + * Allocate a new xattr object and initialize respective members. The caller is
> + * responsible for handling the name of the xattr.
> + *
> + * The initial reference count belongs to the rbtree.
> + *
> + * Return: On success a new xattr object is returned. On failure NULL is
> + * returned.
>   */
>  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  {
> @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  
>  	new_xattr->size = size;
>  	memcpy(new_xattr->value, value, size);
> +	refcount_set(&new_xattr->ref, 1);

Better to use REFCOUNT_INIT() here to save an atomic operation.

>  	return new_xattr;
>  }
>  
> -/*
> - * xattr GET operation for in-memory/pseudo filesystems
> +/**
> + * simple_xattr_get - get an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @buffer: the buffer to store the value into
> + * @size: the size of @buffer
> + *
> + * Try to find and retrieve the xattr object associated with @name. If the
> + * object is found and still in the rbtree bump the reference count.
> + *
> + * If @buffer is provided store the value of @xattr in @buffer.
> + *
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  		     void *buffer, size_t size)
>  {
> -	struct simple_xattr *xattr;
> -	int ret = -ENODATA;
> -
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (strcmp(name, xattr->name))
> -			continue;
> -
> -		ret = xattr->size;
> -		if (buffer) {
> -			if (size < xattr->size)
> -				ret = -ERANGE;
> -			else
> -				memcpy(buffer, xattr->value, xattr->size);
> +	struct simple_xattr *xattr = NULL;
> +	struct rb_node *rbp;
> +	int ret, seq = 0;
> +
> +	rcu_read_lock();
> +	do {
> +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> +		while (rbp) {
> +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +			if (strcmp(xattr->name, name) < 0) {
> +				rbp = rcu_dereference(rbp->rb_left);
> +			} else if (strcmp(xattr->name, name) > 0) {
> +				rbp = rcu_dereference(rbp->rb_right);

strcmp() is potentially called twice with same arguments.

> +			} else {
> +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> +					xattr = NULL;
> +				break;
> +			}
> +			xattr = NULL;
>  		}
> -		break;

If there are many xattrs attached and there is a writer who constantly
changes something, won't readers loop here for a potentially very long time?

> +	} while (need_seqretry(&xattrs->lock, seq));
> +	done_seqretry(&xattrs->lock, seq);
> +	rcu_read_unlock();
> +
> +	if (!xattr)
> +		return -ENODATA;
> +
> +	ret = xattr->size;
> +	if (buffer) {
> +		if (size < xattr->size)
> +			ret = -ERANGE;
> +		else
> +			memcpy(buffer, xattr->value, xattr->size);
>  	}
> -	spin_unlock(&xattrs->lock);
> +
> +	put_simple_xattr_rcu(xattr);
>  	return ret;
>  }
>  
>  /**
> - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> - * @xattrs: target simple_xattr list
> - * @name: name of the extended attribute
> - * @value: value of the xattr. If %NULL, will remove the attribute.
> - * @size: size of the new xattr
> - * @flags: %XATTR_{CREATE|REPLACE}
> - * @removed_size: returns size of the removed xattr, -1 if none removed
> + * simple_xattr_set - set an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @value: the value to store along the xattr
> + * @size: the size of @value
> + * @flags: the flags determining how to set the xattr
> + * @removed_size: the size of the removed xattr
> + *
> + * Set a new xattr object.
> + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> + * is specified in @flags a matching xattr object for @name must already exist.
> + * If it does it will be replace with the new xattr object. If it doesn't we
> + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> + * insert the new xattr replacing any existing one.
>   *
> - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> - * otherwise, fails with -ENODATA.
> + * If @value is empty and a matching xattr object is found we delete it if
> + * XATTR_REPLACE is specified in @flags or @flags is zero.
>   *
> - * Returns 0 on success, -errno on failure.
> + * If @value is empty and no matching xattr object for @name is found we do
> + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> + * XATTR_REPLACE we fail as mentioned above.
> + *
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		     const void *value, size_t size, int flags,
>  		     ssize_t *removed_size)
>  {
> -	struct simple_xattr *xattr;
> -	struct simple_xattr *new_xattr = NULL;
> +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> +	struct rb_node *parent = NULL, **rbp;
>  	int err = 0;
>  
>  	if (removed_size)
> @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		}
>  	}
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (!strcmp(name, xattr->name)) {
> -			if (flags & XATTR_CREATE) {
> -				xattr = new_xattr;
> -				err = -EEXIST;
> -			} else if (new_xattr) {
> -				list_replace(&xattr->list, &new_xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			} else {
> -				list_del(&xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			}
> -			goto out;
> -		}
> -	}
> -	if (flags & XATTR_REPLACE) {
> -		xattr = new_xattr;
> -		err = -ENODATA;
> -	} else {
> -		list_add(&new_xattr->list, &xattrs->head);
> +	write_seqlock(&xattrs->lock);
> +	rbp = &xattrs->rb_root.rb_node;
> +	while (*rbp) {
> +		parent = *rbp;
> +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> +		if (strcmp(xattr->name, name) < 0)
> +			rbp = &(*rbp)->rb_left;
> +		else if (strcmp(xattr->name, name) > 0)

No reason to call strcmp() twice.

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-08 18:59 ` Paul E. McKenney
@ 2022-11-09  9:51   ` Christian Brauner
  2022-11-11  8:13     ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2022-11-09  9:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Roman Gushchin, Al Viro

On Tue, Nov 08, 2022 at 10:59:04AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> > A while ago Vasily reported that it is possible to set a large number of
> > xattrs on inodes of filesystems that make use of the simple xattr
> > infrastructure. This includes all kernfs-based filesystems that support
> > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> > mounted by unprivileged users in unprivileged containers and root in an
> > unprivileged container can set an unrestricted number of security.*
> > xattrs and privileged users can also set unlimited trusted.* xattrs. As
> > there are apparently users that have a fairly large number of xattrs we
> > should scale a bit better. Other xattrs such as user.* are restricted
> > for kernfs-based instances to a fairly limited number.
> > 
> > Using a simple linked list protected by a spinlock used for set, get,
> > and list operations doesn't scale well if users use a lot of xattrs even
> > if it's not a crazy number. And There's no need to bring in the big guns
> > like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> > limited rcu semantics is enough.
> > 
> > It scales within the constraints we are working in. By far the most
> > common operations is getting an xattr. The get operation is optimized to
> > be lock free as long as there are no writers. The list operation takes
> > the read lock and protects against concurrent writers while allowing
> > lockless get operations. Locking out other listxattr callers isn't a
> > huge deal since listing xattrs is mostly relevant when copying a file or
> > copying all xattrs between files.
> > 
> > Additionally, listxattr() doesn't list the values of xattrs it can only
> > be used to list the names of all xattrs set on a file. And the number of
> > xattr names that can be listed with listxattr() is limited to
> > XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> > found it will return -EFBIG. In short, the maximum amount of memory that
> > can be retrieved via listxattr() is limited.
> > 
> > Of course, the API is broken as documented on xattr(7) already. In the
> > future we might want to address this but for now this is the world we
> > live in and have lived for a long time. But it does indeed mean that
> > once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> > inode it isn't possible to copy the file and include its xattrs in the
> > copy unless the caller knows all xattrs or limits the copy of the xattrs
> > to important ones it knows by name (At least for tmpfs, and kernfs-based
> > filesystems. Other filesystems might provide ways of achieving this.).
> > 
> > Also add proper kernel documentation to all the functions.
> > A big thanks to Paul for his comments.
> > 
> > Cc: Vasily Averin <vvs@openvz.org>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> 
> Looks mostly plausible from an RCU viewpoint, but there are a few
> questions/comments inline below.
> 
> 							Thanx, Paul
> 
> > ---
> > 
> > Notes:
> >     In addition to this patch I would like to propose that we restrict the number
> >     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
> >     other words, we restrict the number of xattrs for simple xattr filesystems to
> >     the number of xattrs names that can be retrieved via listxattr(). That should
> >     be about 2000 to 3000 xattrs per inode which is more than enough. We should
> >     risk this and see if we get any regression reports from userswith this
> >     approach.
> >     
> >     This should be as simple as adding a max_list member to struct simple_xattrs
> >     and initialize it with XATTR_MAX_LIST. Set operations would then check against
> >     this field whether the new xattr they are trying to set will fit and return
> >     -EFBIG otherwise. I think that might be a good approach to get rid of the in
> >     principle unbounded number of xattrs that can be set via the simple xattr
> >     infrastructure. I think this is a regression risk worth taking.
> >     
> >     /* v2 */
> >     Christian Brauner <brauner@kernel.org>:
> >     - Fix kernel doc.
> >     - Remove accidental leftover union from previous iteration.
> > 
> >  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
> >  include/linux/xattr.h |  40 ++---
> >  mm/shmem.c            |   2 +-
> >  3 files changed, 270 insertions(+), 102 deletions(-)
> > 
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 61107b6bbed2..f18454161d54 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
> >  }
> >  EXPORT_SYMBOL(xattr_full_name);
> >  
> > -/*
> > - * Allocate new xattr and copy in the value; but leave the name to callers.
> > +/**
> > + * free_simple_xattr - free an xattr object
> > + * @xattr: the xattr object
> > + *
> > + * Free the xattr object. Can handle @xattr being NULL.
> > + */
> > +static inline void free_simple_xattr(struct simple_xattr *xattr)
> > +{
> > +	if (xattr)
> > +		kfree(xattr->name);
> > +	kvfree(xattr);
> > +}
> > +
> > +/**
> > + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> > + * @cb: the rcu callback head
> > + */
> > +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> > +{
> > +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> > +}
> > +
> > +/**
> > + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> > + * @xattr: the xattr object
> > + */
> > +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> > +{
> > +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> > +}
> > +
> > +/**
> > + * put_simple_xattr_rcu - decrement refcount for xattr object
> > + * @xattr: the xattr object
> > + *
> > + * Decrement the reference count of an xattr object and free it using rcu
> > + * semantics if we're the holder of the last reference. Can handle @xattr being
> > + * NULL.
> > + */
> > +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> > +{
> > +	if (xattr && refcount_dec_and_test(&xattr->ref))
> > +		free_simple_xattr_rcu(xattr);
> > +}
> 
> Looks like the standard combined reference counter and RCU combination,
> goog!
> 
> > +
> > +/**
> > + * simple_xattr_alloc - allocate new xattr object
> > + * @value: value of the xattr object
> > + * @size: size of @value
> > + *
> > + * Allocate a new xattr object and initialize respective members. The caller is
> > + * responsible for handling the name of the xattr.
> > + *
> > + * The initial reference count belongs to the rbtree.
> > + *
> > + * Return: On success a new xattr object is returned. On failure NULL is
> > + * returned.
> >   */
> >  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> >  {
> > @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> >  
> >  	new_xattr->size = size;
> >  	memcpy(new_xattr->value, value, size);
> > +	refcount_set(&new_xattr->ref, 1);
> 
> Yes, one is usually needed for the link in the tree.
> 
> >  	return new_xattr;
> >  }
> >  
> > -/*
> > - * xattr GET operation for in-memory/pseudo filesystems
> > +/**
> > + * simple_xattr_get - get an xattr object
> > + * @xattrs: the header of the xattr object
> > + * @name: the name of the xattr to retrieve
> > + * @buffer: the buffer to store the value into
> > + * @size: the size of @buffer
> > + *
> > + * Try to find and retrieve the xattr object associated with @name. If the
> > + * object is found and still in the rbtree bump the reference count.
> > + *
> > + * If @buffer is provided store the value of @xattr in @buffer.
> > + *
> > + * Return: On success zero and on error a negative error code is returned.
> >   */
> >  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> >  		     void *buffer, size_t size)
> >  {
> > -	struct simple_xattr *xattr;
> > -	int ret = -ENODATA;
> > -
> > -	spin_lock(&xattrs->lock);
> > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > -		if (strcmp(name, xattr->name))
> > -			continue;
> > -
> > -		ret = xattr->size;
> > -		if (buffer) {
> > -			if (size < xattr->size)
> > -				ret = -ERANGE;
> > -			else
> > -				memcpy(buffer, xattr->value, xattr->size);
> > +	struct simple_xattr *xattr = NULL;
> > +	struct rb_node *rbp;
> > +	int ret, seq = 0;
> > +
> > +	rcu_read_lock();
> > +	do {
> > +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> 
> It might be necessary to try a few times before grabbing the lock, but
> perhaps we should actually hit the problem before increasing complexity.
> 
> > +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> > +		while (rbp) {
> > +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > +			if (strcmp(xattr->name, name) < 0) {
> > +				rbp = rcu_dereference(rbp->rb_left);
> > +			} else if (strcmp(xattr->name, name) > 0) {
> > +				rbp = rcu_dereference(rbp->rb_right);
> > +			} else {
> > +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> > +					xattr = NULL;
> > +				break;
> > +			}
> > +			xattr = NULL;
> >  		}
> 
> Maybe this is too specialized, but should this be in the rbtree code,
> perhaps rb_find_first_rcu(), but with an appropriate cmp() function?
> The refcount_inc_not_zero() clearly needs to be in the caller.
> 
> If this is the only instance of this sort of code, it is likely not
> really worthwhile.  But if we have several of these open coded, it would
> be good to consolidate that code.

There's more than one instance of this pattern for sure.

> 
> > -		break;
> > +	} while (need_seqretry(&xattrs->lock, seq));
> > +	done_seqretry(&xattrs->lock, seq);
> > +	rcu_read_unlock();
> > +
> > +	if (!xattr)
> > +		return -ENODATA;
> > +
> > +	ret = xattr->size;
> > +	if (buffer) {
> > +		if (size < xattr->size)
> > +			ret = -ERANGE;
> > +		else
> > +			memcpy(buffer, xattr->value, xattr->size);
> 
> If all we are doing is copying to an in-kernel buffer, why not dispense
> with xattr->ref and do the memcpy() under rcu_read_lock() protection?
> This would avoid some overhead from reference-count cache misses.
> 
> Of course, if that memcpy() can page fault or some such, then what
> you have is necessary.  But this is all in-kernel, right?  And if not,
> shouldn't the pointers be decorated with __user or some such?

This is just a regular in-kernel memcpy(). Good idea dispensing with the
refcount completely.

> 
> >  	}
> > -	spin_unlock(&xattrs->lock);
> > +
> > +	put_simple_xattr_rcu(xattr);
> >  	return ret;
> >  }
> >  
> >  /**
> > - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> > - * @xattrs: target simple_xattr list
> > - * @name: name of the extended attribute
> > - * @value: value of the xattr. If %NULL, will remove the attribute.
> > - * @size: size of the new xattr
> > - * @flags: %XATTR_{CREATE|REPLACE}
> > - * @removed_size: returns size of the removed xattr, -1 if none removed
> > + * simple_xattr_set - set an xattr object
> > + * @xattrs: the header of the xattr object
> > + * @name: the name of the xattr to retrieve
> > + * @value: the value to store along the xattr
> > + * @size: the size of @value
> > + * @flags: the flags determining how to set the xattr
> > + * @removed_size: the size of the removed xattr
> > + *
> > + * Set a new xattr object.
> > + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> > + * is specified in @flags a matching xattr object for @name must already exist.
> > + * If it does it will be replace with the new xattr object. If it doesn't we
> > + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> > + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> > + * insert the new xattr replacing any existing one.
> >   *
> > - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> > - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> > - * otherwise, fails with -ENODATA.
> > + * If @value is empty and a matching xattr object is found we delete it if
> > + * XATTR_REPLACE is specified in @flags or @flags is zero.
> >   *
> > - * Returns 0 on success, -errno on failure.
> > + * If @value is empty and no matching xattr object for @name is found we do
> > + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> > + * XATTR_REPLACE we fail as mentioned above.
> > + *
> > + * Return: On success zero and on error a negative error code is returned.
> >   */
> >  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> >  		     const void *value, size_t size, int flags,
> >  		     ssize_t *removed_size)
> >  {
> > -	struct simple_xattr *xattr;
> > -	struct simple_xattr *new_xattr = NULL;
> > +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> > +	struct rb_node *parent = NULL, **rbp;
> >  	int err = 0;
> >  
> >  	if (removed_size)
> > @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> >  		}
> >  	}
> >  
> > -	spin_lock(&xattrs->lock);
> > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > -		if (!strcmp(name, xattr->name)) {
> > -			if (flags & XATTR_CREATE) {
> > -				xattr = new_xattr;
> > -				err = -EEXIST;
> > -			} else if (new_xattr) {
> > -				list_replace(&xattr->list, &new_xattr->list);
> > -				if (removed_size)
> > -					*removed_size = xattr->size;
> > -			} else {
> > -				list_del(&xattr->list);
> > -				if (removed_size)
> > -					*removed_size = xattr->size;
> > -			}
> > -			goto out;
> > -		}
> > -	}
> > -	if (flags & XATTR_REPLACE) {
> > -		xattr = new_xattr;
> > -		err = -ENODATA;
> > -	} else {
> > -		list_add(&new_xattr->list, &xattrs->head);
> > +	write_seqlock(&xattrs->lock);
> > +	rbp = &xattrs->rb_root.rb_node;
> > +	while (*rbp) {
> > +		parent = *rbp;
> > +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> > +		if (strcmp(xattr->name, name) < 0)
> > +			rbp = &(*rbp)->rb_left;
> > +		else if (strcmp(xattr->name, name) > 0)
> > +			rbp = &(*rbp)->rb_right;
> > +		else
> > +			break;
> >  		xattr = NULL;
> >  	}
> > -out:
> > -	spin_unlock(&xattrs->lock);
> > +
> >  	if (xattr) {
> > -		kfree(xattr->name);
> > -		kvfree(xattr);
> > +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> > +		if (flags & XATTR_CREATE) {
> > +			err = -EEXIST;
> > +			goto out_unlock;
> > +		}
> > +
> > +		if (new_xattr)
> > +			rb_replace_node_rcu(&xattr->rb_node,
> > +					    &new_xattr->rb_node,
> > +					    &xattrs->rb_root);
> > +		else
> > +			rb_erase(&xattr->rb_node, &xattrs->rb_root);
> 
> Is rb_erase() RCU-reader-safe?  It is not immediately obvious to me that it
> is.  I would expect an rcu_assign_pointer() or three in there somewhere.

This question send me down an interesting rabbit hole which is
(unironically) excellent. Afaiu, yes rb_erase() isn't rcu safe. And our
rbtree implementation in general isn't rcu safe in the face of
rebalances (I think you linked to a paper that illustrates how one would
need to go about doing this though.). So all users deal with that using
the seq+rcu-rbtree pattern. I trust you'll correct me if I'm wrong: The
assumption of current users seems to be that searching the rbtree under
the rcu lock - i.e., without having taken the read seqlock - is safe
against oops or iow against corrupt pointers if a writer changes the
rbtree during the walk. The seqlock is supposed to prevent garbage/stale
results. At least all users of this pattern seem to rely on the fact
that a rcu only walk cannot oops. Is that assumption safe?

Another point that popped up discussing this was that there's an
unlikely attack scenario that David (Howells) made me aware of. While
walking the rbtree under rcu only without the read seqlock it is
theoretically possible for the rbtree to keep being expanded and cause
the rcu walk to "never finish". Though I'm unsure whether that's really
a practically possible attack.

Very interested to hear your thoughts here!

> 
> > +		if (!err && removed_size)
> > +			*removed_size = xattr->size;
> > +	} else {
> > +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> > +		if (flags & XATTR_REPLACE) {
> > +			err = -ENODATA;
> > +			goto out_unlock;
> > +		}
> > +
> > +		/*
> > +		 * If XATTR_CREATE or no flags are specified together with a
> > +		 * new value simply insert it.
> > +		 */
> > +		if (new_xattr) {
> > +			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
> > +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > +		}
> > +
> > +		/*
> > +		 * If XATTR_CREATE or no flags are specified and neither an old
> > +		 * or new xattr were found/exist then we don't need to do
> > +		 * anything.
> > +		 */
> 
> As before, some of this looks like it should be in the rbtree implementation.
> On the other hand, and add-or-replace function like this might be rare.  And
> there are only two other occurrences, both of which look quite specialize.

If you think the current rcu search pattern is sufficiently safe even
against rb_erase() then we might want to give the two callers/users at
least rb_find_first_rcu() directly in rbtree.h which you suggested
further above.

> 
> So probably no consolidation just yet, anyway.
> 
> >  	}
> > +
> > +out_unlock:
> > +	write_sequnlock(&xattrs->lock);
> > +	if (err)
> > +		free_simple_xattr(new_xattr);
> > +	else
> > +		put_simple_xattr_rcu(xattr);
> >  	return err;
> >  
> >  }
> > @@ -1134,14 +1258,31 @@ static int xattr_list_one(char **buffer, ssize_t *remaining_size,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * xattr LIST operation for in-memory/pseudo filesystems
> > +/**
> > + * simple_xattr_list - list all xattr objects
> > + * @inode: inode from which to get the xattrs
> > + * @xattrs: the header of the xattr object
> > + * @buffer: the buffer to store all xattrs into
> > + * @size: the size of @buffer
> > + *
> > + * List all xattrs associated with @inode. If @buffer is NULL we returned the
> > + * required size of the buffer. If @buffer is provided we store the xattrs
> > + * value into it provided it is big enough.
> > + *
> > + * Note, the number of xattr names that can be listed with listxattr(2) is
> > + * limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > + * vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are found
> > + * it will return -E2BIG.
> > + *
> > + * Return: On success the required size or the size of the copied xattrs is
> > + * returned. On error a negative error code is returned.
> >   */
> >  ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> >  			  char *buffer, size_t size)
> >  {
> >  	bool trusted = capable(CAP_SYS_ADMIN);
> >  	struct simple_xattr *xattr;
> > +	struct rb_node *rbp;
> >  	ssize_t remaining_size = size;
> >  	int err = 0;
> >  
> > @@ -1162,8 +1303,10 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> >  	}
> >  #endif
> >  
> > -	spin_lock(&xattrs->lock);
> > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > +	read_seqlock_excl(&xattrs->lock);
> 
> This excludes writers, which allows the non-RCU-safe code to work correctly.
> 
> So this should be OK.
> 
> > +	for (rbp = rb_first(&xattrs->rb_root); rbp; rbp = rb_next(rbp)) {
> > +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > +
> >  		/* skip "trusted." attributes for unprivileged callers */
> >  		if (!trusted && xattr_is_trusted(xattr->name))
> >  			continue;
> > @@ -1172,18 +1315,61 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> >  		if (err)
> >  			break;
> >  	}
> > -	spin_unlock(&xattrs->lock);
> > +	read_sequnlock_excl(&xattrs->lock);
> >  
> >  	return err ? err : size - remaining_size;
> >  }
> >  
> > -/*
> > - * Adds an extended attribute to the list
> > +/**
> > + * simple_xattr_add - add xattr objects
> > + * @xattrs: the header of the xattr object
> > + * @new_xattr: the xattr object to add
> > + *
> > + * Add an xattr object to @xattrs. This assumes no replacement or removal of
> > + * matching xattrs is wanted.
> >   */
> > -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> > -			   struct simple_xattr *new_xattr)
> > +void simple_xattr_add(struct simple_xattrs *xattrs,
> > +		      struct simple_xattr *new_xattr)
> >  {
> > -	spin_lock(&xattrs->lock);
> > -	list_add(&new_xattr->list, &xattrs->head);
> > -	spin_unlock(&xattrs->lock);
> > +	write_seqlock(&xattrs->lock);
> > +	rb_link_node_rcu(&new_xattr->rb_node, xattrs->rb_root.rb_node,
> > +			 &xattrs->rb_root.rb_node);
> > +	rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > +	write_sequnlock(&xattrs->lock);
> > +}
> 
> I freely confess that I am not immediately seeing how this one fits in.
> Presumably its caller has already found the right place in the tree?
> Except that the caller isn't holding xattrs->lock.

This is a bug. Thanks for spotting this. 

> 
> > +
> > +/**
> > + * simple_xattr_init - initialize new xattr header
> > + * @xattrs: header to initialize
> > + *
> > + * Initialize relevant fields of a an xattr header.
> > + */
> > +void simple_xattrs_init(struct simple_xattrs *xattrs)
> > +{
> > +	seqlock_init(&xattrs->lock);
> > +	xattrs->rb_root = RB_ROOT;
> > +}
> > +
> > +/**
> > + * simple_xattrs_free - free xattrs
> > + * @xattrs: xattr header whose xattrs to destroy
> > + *
> > + * Destroy all xattrs in @xattr. When this is called no one can hold a
> > + * reference to any of the xattrs anymore.
> 
> As in anyone who might hold a reference is long gone, correct?

Yes, when this is called we're freeing the inode. No one can retrieve or
set or list xattrs for that inode anymore at that point. IOW, the last
fd must've been closed and the inode been removed (via unlink()/rmdir()
or sm).

Thank you for commenting and providing you insights!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-08 19:08 ` Roman Gushchin
@ 2022-11-09 11:16   ` Christian Brauner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2022-11-09 11:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Paul E. McKenney, Al Viro

On Tue, Nov 08, 2022 at 11:08:18AM -0800, Roman Gushchin wrote:
> On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> > A while ago Vasily reported that it is possible to set a large number of
> > xattrs on inodes of filesystems that make use of the simple xattr
> > infrastructure. This includes all kernfs-based filesystems that support
> > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> > mounted by unprivileged users in unprivileged containers and root in an
> > unprivileged container can set an unrestricted number of security.*
> > xattrs and privileged users can also set unlimited trusted.* xattrs. As
> > there are apparently users that have a fairly large number of xattrs we
> > should scale a bit better. Other xattrs such as user.* are restricted
> > for kernfs-based instances to a fairly limited number.
> > 
> > Using a simple linked list protected by a spinlock used for set, get,
> > and list operations doesn't scale well if users use a lot of xattrs even
> > if it's not a crazy number. And There's no need to bring in the big guns
> > like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> > limited rcu semantics is enough.
> > 
> > It scales within the constraints we are working in. By far the most
> > common operations is getting an xattr. The get operation is optimized to
> > be lock free as long as there are no writers. The list operation takes
> > the read lock and protects against concurrent writers while allowing
> > lockless get operations. Locking out other listxattr callers isn't a
> > huge deal since listing xattrs is mostly relevant when copying a file or
> > copying all xattrs between files.
> > 
> > Additionally, listxattr() doesn't list the values of xattrs it can only
> > be used to list the names of all xattrs set on a file. And the number of
> > xattr names that can be listed with listxattr() is limited to
> > XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> > found it will return -EFBIG. In short, the maximum amount of memory that
> > can be retrieved via listxattr() is limited.
> > 
> > Of course, the API is broken as documented on xattr(7) already. In the
> > future we might want to address this but for now this is the world we
> > live in and have lived for a long time. But it does indeed mean that
> > once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> > inode it isn't possible to copy the file and include its xattrs in the
> > copy unless the caller knows all xattrs or limits the copy of the xattrs
> > to important ones it knows by name (At least for tmpfs, and kernfs-based
> > filesystems. Other filesystems might provide ways of achieving this.).
> > 
> > Also add proper kernel documentation to all the functions.
> > A big thanks to Paul for his comments.
> > 
> > Cc: Vasily Averin <vvs@openvz.org>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> 
> Hi Christian!
> 
> This looks fancy, maybe even too fancy :)
> I learned from reading your code.
> 
> Is there any specific usecase when we care that much about xattr
> performance?

Yeah, there are users with loads of xattrs (such as containers) and the
single spinlock searching linearly through a possibly long linked list
is a bit of a problem.

> 
> I've nothing against the lockless approach, but an rb-tree protected
> by a rwlock/semaphore would be probably much simpler and still a great
> improvement over the status quo.

I have nothing against this approach and actually wondered about this
the whole time (Let's just say I went a bit overboard and may or may not
also have workable rhashtable implementation and a draft for an
rw_semaphore this patch just to get a feel what it would look like and
what the tradeoff would be).

In any case, my main goal was to make the get operation fast as it's the
one that will be hit most often. listxattr() is of secondary concern as
it's used infrequently and also bounded by XATTR_LIST_MAX.

> 
> Some small nits below.
> 
> > ---
> > 
> > Notes:
> >     In addition to this patch I would like to propose that we restrict the number
> >     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
> >     other words, we restrict the number of xattrs for simple xattr filesystems to
> >     the number of xattrs names that can be retrieved via listxattr(). That should
> >     be about 2000 to 3000 xattrs per inode which is more than enough. We should
> >     risk this and see if we get any regression reports from userswith this
> >     approach.
> >     
> >     This should be as simple as adding a max_list member to struct simple_xattrs
> >     and initialize it with XATTR_MAX_LIST. Set operations would then check against
> >     this field whether the new xattr they are trying to set will fit and return
> >     -EFBIG otherwise. I think that might be a good approach to get rid of the in
> >     principle unbounded number of xattrs that can be set via the simple xattr
> >     infrastructure. I think this is a regression risk worth taking.
> >     
> >     /* v2 */
> >     Christian Brauner <brauner@kernel.org>:
> >     - Fix kernel doc.
> >     - Remove accidental leftover union from previous iteration.
> > 
> >  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
> >  include/linux/xattr.h |  40 ++---
> >  mm/shmem.c            |   2 +-
> >  3 files changed, 270 insertions(+), 102 deletions(-)
> > 
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 61107b6bbed2..f18454161d54 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
> >  }
> >  EXPORT_SYMBOL(xattr_full_name);
> >  
> > -/*
> > - * Allocate new xattr and copy in the value; but leave the name to callers.
> > +/**
> > + * free_simple_xattr - free an xattr object
> > + * @xattr: the xattr object
> > + *
> > + * Free the xattr object. Can handle @xattr being NULL.
> > + */
> > +static inline void free_simple_xattr(struct simple_xattr *xattr)
> > +{
> > +	if (xattr)
> > +		kfree(xattr->name);
> > +	kvfree(xattr);
> > +}
> > +
> > +/**
> > + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> > + * @cb: the rcu callback head
> > + */
> > +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> > +{
> > +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> > +}
> > +
> > +/**
> > + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> > + * @xattr: the xattr object
> > + */
> > +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> > +{
> > +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> > +}
> > +
> > +/**
> > + * put_simple_xattr_rcu - decrement refcount for xattr object
> > + * @xattr: the xattr object
> > + *
> > + * Decrement the reference count of an xattr object and free it using rcu
> > + * semantics if we're the holder of the last reference. Can handle @xattr being
> > + * NULL.
> > + */
> > +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> > +{
> > +	if (xattr && refcount_dec_and_test(&xattr->ref))
> > +		free_simple_xattr_rcu(xattr);
> > +}
> > +
> > +/**
> > + * simple_xattr_alloc - allocate new xattr object
> > + * @value: value of the xattr object
> > + * @size: size of @value
> > + *
> > + * Allocate a new xattr object and initialize respective members. The caller is
> > + * responsible for handling the name of the xattr.
> > + *
> > + * The initial reference count belongs to the rbtree.
> > + *
> > + * Return: On success a new xattr object is returned. On failure NULL is
> > + * returned.
> >   */
> >  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> >  {
> > @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> >  
> >  	new_xattr->size = size;
> >  	memcpy(new_xattr->value, value, size);
> > +	refcount_set(&new_xattr->ref, 1);
> 
> Better to use REFCOUNT_INIT() here to save an atomic operation.
> 
> >  	return new_xattr;
> >  }
> >  
> > -/*
> > - * xattr GET operation for in-memory/pseudo filesystems
> > +/**
> > + * simple_xattr_get - get an xattr object
> > + * @xattrs: the header of the xattr object
> > + * @name: the name of the xattr to retrieve
> > + * @buffer: the buffer to store the value into
> > + * @size: the size of @buffer
> > + *
> > + * Try to find and retrieve the xattr object associated with @name. If the
> > + * object is found and still in the rbtree bump the reference count.
> > + *
> > + * If @buffer is provided store the value of @xattr in @buffer.
> > + *
> > + * Return: On success zero and on error a negative error code is returned.
> >   */
> >  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> >  		     void *buffer, size_t size)
> >  {
> > -	struct simple_xattr *xattr;
> > -	int ret = -ENODATA;
> > -
> > -	spin_lock(&xattrs->lock);
> > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > -		if (strcmp(name, xattr->name))
> > -			continue;
> > -
> > -		ret = xattr->size;
> > -		if (buffer) {
> > -			if (size < xattr->size)
> > -				ret = -ERANGE;
> > -			else
> > -				memcpy(buffer, xattr->value, xattr->size);
> > +	struct simple_xattr *xattr = NULL;
> > +	struct rb_node *rbp;
> > +	int ret, seq = 0;
> > +
> > +	rcu_read_lock();
> > +	do {
> > +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> > +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> > +		while (rbp) {
> > +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > +			if (strcmp(xattr->name, name) < 0) {
> > +				rbp = rcu_dereference(rbp->rb_left);
> > +			} else if (strcmp(xattr->name, name) > 0) {
> > +				rbp = rcu_dereference(rbp->rb_right);
> 
> strcmp() is potentially called twice with same arguments.

Yeah, good point.

> 
> > +			} else {
> > +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> > +					xattr = NULL;
> > +				break;
> > +			}
> > +			xattr = NULL;
> >  		}
> > -		break;
> 
> If there are many xattrs attached and there is a writer who constantly
> changes something, won't readers loop here for a potentially very long time?

See my response to Paul where I pondered that possible attack though in
practice I find this rather difficult to imagine. The attack also gets
increasingly harder as the tree grows. So in practice I don't think it
matters. But interesting observation!

> 
> > +	} while (need_seqretry(&xattrs->lock, seq));
> > +	done_seqretry(&xattrs->lock, seq);
> > +	rcu_read_unlock();
> > +
> > +	if (!xattr)
> > +		return -ENODATA;
> > +
> > +	ret = xattr->size;
> > +	if (buffer) {
> > +		if (size < xattr->size)
> > +			ret = -ERANGE;
> > +		else
> > +			memcpy(buffer, xattr->value, xattr->size);
> >  	}
> > -	spin_unlock(&xattrs->lock);
> > +
> > +	put_simple_xattr_rcu(xattr);
> >  	return ret;
> >  }
> >  
> >  /**
> > - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> > - * @xattrs: target simple_xattr list
> > - * @name: name of the extended attribute
> > - * @value: value of the xattr. If %NULL, will remove the attribute.
> > - * @size: size of the new xattr
> > - * @flags: %XATTR_{CREATE|REPLACE}
> > - * @removed_size: returns size of the removed xattr, -1 if none removed
> > + * simple_xattr_set - set an xattr object
> > + * @xattrs: the header of the xattr object
> > + * @name: the name of the xattr to retrieve
> > + * @value: the value to store along the xattr
> > + * @size: the size of @value
> > + * @flags: the flags determining how to set the xattr
> > + * @removed_size: the size of the removed xattr
> > + *
> > + * Set a new xattr object.
> > + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> > + * is specified in @flags a matching xattr object for @name must already exist.
> > + * If it does it will be replace with the new xattr object. If it doesn't we
> > + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> > + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> > + * insert the new xattr replacing any existing one.
> >   *
> > - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> > - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> > - * otherwise, fails with -ENODATA.
> > + * If @value is empty and a matching xattr object is found we delete it if
> > + * XATTR_REPLACE is specified in @flags or @flags is zero.
> >   *
> > - * Returns 0 on success, -errno on failure.
> > + * If @value is empty and no matching xattr object for @name is found we do
> > + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> > + * XATTR_REPLACE we fail as mentioned above.
> > + *
> > + * Return: On success zero and on error a negative error code is returned.
> >   */
> >  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> >  		     const void *value, size_t size, int flags,
> >  		     ssize_t *removed_size)
> >  {
> > -	struct simple_xattr *xattr;
> > -	struct simple_xattr *new_xattr = NULL;
> > +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> > +	struct rb_node *parent = NULL, **rbp;
> >  	int err = 0;
> >  
> >  	if (removed_size)
> > @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> >  		}
> >  	}
> >  
> > -	spin_lock(&xattrs->lock);
> > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > -		if (!strcmp(name, xattr->name)) {
> > -			if (flags & XATTR_CREATE) {
> > -				xattr = new_xattr;
> > -				err = -EEXIST;
> > -			} else if (new_xattr) {
> > -				list_replace(&xattr->list, &new_xattr->list);
> > -				if (removed_size)
> > -					*removed_size = xattr->size;
> > -			} else {
> > -				list_del(&xattr->list);
> > -				if (removed_size)
> > -					*removed_size = xattr->size;
> > -			}
> > -			goto out;
> > -		}
> > -	}
> > -	if (flags & XATTR_REPLACE) {
> > -		xattr = new_xattr;
> > -		err = -ENODATA;
> > -	} else {
> > -		list_add(&new_xattr->list, &xattrs->head);
> > +	write_seqlock(&xattrs->lock);
> > +	rbp = &xattrs->rb_root.rb_node;
> > +	while (*rbp) {
> > +		parent = *rbp;
> > +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> > +		if (strcmp(xattr->name, name) < 0)
> > +			rbp = &(*rbp)->rb_left;
> > +		else if (strcmp(xattr->name, name) > 0)
> 
> No reason to call strcmp() twice.

Good point.

Thank your for your comments!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-09  9:51   ` Christian Brauner
@ 2022-11-11  8:13     ` Paul E. McKenney
  2022-11-11 10:51       ` Christian Brauner
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2022-11-11  8:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Roman Gushchin, Al Viro

On Wed, Nov 09, 2022 at 10:51:52AM +0100, Christian Brauner wrote:
> On Tue, Nov 08, 2022 at 10:59:04AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> > > A while ago Vasily reported that it is possible to set a large number of
> > > xattrs on inodes of filesystems that make use of the simple xattr
> > > infrastructure. This includes all kernfs-based filesystems that support
> > > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> > > mounted by unprivileged users in unprivileged containers and root in an
> > > unprivileged container can set an unrestricted number of security.*
> > > xattrs and privileged users can also set unlimited trusted.* xattrs. As
> > > there are apparently users that have a fairly large number of xattrs we
> > > should scale a bit better. Other xattrs such as user.* are restricted
> > > for kernfs-based instances to a fairly limited number.
> > > 
> > > Using a simple linked list protected by a spinlock used for set, get,
> > > and list operations doesn't scale well if users use a lot of xattrs even
> > > if it's not a crazy number. And There's no need to bring in the big guns
> > > like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> > > limited rcu semantics is enough.
> > > 
> > > It scales within the constraints we are working in. By far the most
> > > common operations is getting an xattr. The get operation is optimized to
> > > be lock free as long as there are no writers. The list operation takes
> > > the read lock and protects against concurrent writers while allowing
> > > lockless get operations. Locking out other listxattr callers isn't a
> > > huge deal since listing xattrs is mostly relevant when copying a file or
> > > copying all xattrs between files.
> > > 
> > > Additionally, listxattr() doesn't list the values of xattrs it can only
> > > be used to list the names of all xattrs set on a file. And the number of
> > > xattr names that can be listed with listxattr() is limited to
> > > XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > > vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> > > found it will return -EFBIG. In short, the maximum amount of memory that
> > > can be retrieved via listxattr() is limited.
> > > 
> > > Of course, the API is broken as documented on xattr(7) already. In the
> > > future we might want to address this but for now this is the world we
> > > live in and have lived for a long time. But it does indeed mean that
> > > once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> > > inode it isn't possible to copy the file and include its xattrs in the
> > > copy unless the caller knows all xattrs or limits the copy of the xattrs
> > > to important ones it knows by name (At least for tmpfs, and kernfs-based
> > > filesystems. Other filesystems might provide ways of achieving this.).
> > > 
> > > Also add proper kernel documentation to all the functions.
> > > A big thanks to Paul for his comments.
> > > 
> > > Cc: Vasily Averin <vvs@openvz.org>
> > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> > 
> > Looks mostly plausible from an RCU viewpoint, but there are a few
> > questions/comments inline below.
> > 
> > 							Thanx, Paul
> > 
> > > ---
> > > 
> > > Notes:
> > >     In addition to this patch I would like to propose that we restrict the number
> > >     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
> > >     other words, we restrict the number of xattrs for simple xattr filesystems to
> > >     the number of xattrs names that can be retrieved via listxattr(). That should
> > >     be about 2000 to 3000 xattrs per inode which is more than enough. We should
> > >     risk this and see if we get any regression reports from userswith this
> > >     approach.
> > >     
> > >     This should be as simple as adding a max_list member to struct simple_xattrs
> > >     and initialize it with XATTR_MAX_LIST. Set operations would then check against
> > >     this field whether the new xattr they are trying to set will fit and return
> > >     -EFBIG otherwise. I think that might be a good approach to get rid of the in
> > >     principle unbounded number of xattrs that can be set via the simple xattr
> > >     infrastructure. I think this is a regression risk worth taking.
> > >     
> > >     /* v2 */
> > >     Christian Brauner <brauner@kernel.org>:
> > >     - Fix kernel doc.
> > >     - Remove accidental leftover union from previous iteration.
> > > 
> > >  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
> > >  include/linux/xattr.h |  40 ++---
> > >  mm/shmem.c            |   2 +-
> > >  3 files changed, 270 insertions(+), 102 deletions(-)
> > > 
> > > diff --git a/fs/xattr.c b/fs/xattr.c
> > > index 61107b6bbed2..f18454161d54 100644
> > > --- a/fs/xattr.c
> > > +++ b/fs/xattr.c
> > > @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
> > >  }
> > >  EXPORT_SYMBOL(xattr_full_name);
> > >  
> > > -/*
> > > - * Allocate new xattr and copy in the value; but leave the name to callers.
> > > +/**
> > > + * free_simple_xattr - free an xattr object
> > > + * @xattr: the xattr object
> > > + *
> > > + * Free the xattr object. Can handle @xattr being NULL.
> > > + */
> > > +static inline void free_simple_xattr(struct simple_xattr *xattr)
> > > +{
> > > +	if (xattr)
> > > +		kfree(xattr->name);
> > > +	kvfree(xattr);
> > > +}
> > > +
> > > +/**
> > > + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> > > + * @cb: the rcu callback head
> > > + */
> > > +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> > > +{
> > > +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> > > +}
> > > +
> > > +/**
> > > + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> > > + * @xattr: the xattr object
> > > + */
> > > +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> > > +{
> > > +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> > > +}
> > > +
> > > +/**
> > > + * put_simple_xattr_rcu - decrement refcount for xattr object
> > > + * @xattr: the xattr object
> > > + *
> > > + * Decrement the reference count of an xattr object and free it using rcu
> > > + * semantics if we're the holder of the last reference. Can handle @xattr being
> > > + * NULL.
> > > + */
> > > +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> > > +{
> > > +	if (xattr && refcount_dec_and_test(&xattr->ref))
> > > +		free_simple_xattr_rcu(xattr);
> > > +}
> > 
> > Looks like the standard combined reference counter and RCU combination,
> > goog!
> > 
> > > +
> > > +/**
> > > + * simple_xattr_alloc - allocate new xattr object
> > > + * @value: value of the xattr object
> > > + * @size: size of @value
> > > + *
> > > + * Allocate a new xattr object and initialize respective members. The caller is
> > > + * responsible for handling the name of the xattr.
> > > + *
> > > + * The initial reference count belongs to the rbtree.
> > > + *
> > > + * Return: On success a new xattr object is returned. On failure NULL is
> > > + * returned.
> > >   */
> > >  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > >  {
> > > @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > >  
> > >  	new_xattr->size = size;
> > >  	memcpy(new_xattr->value, value, size);
> > > +	refcount_set(&new_xattr->ref, 1);
> > 
> > Yes, one is usually needed for the link in the tree.
> > 
> > >  	return new_xattr;
> > >  }
> > >  
> > > -/*
> > > - * xattr GET operation for in-memory/pseudo filesystems
> > > +/**
> > > + * simple_xattr_get - get an xattr object
> > > + * @xattrs: the header of the xattr object
> > > + * @name: the name of the xattr to retrieve
> > > + * @buffer: the buffer to store the value into
> > > + * @size: the size of @buffer
> > > + *
> > > + * Try to find and retrieve the xattr object associated with @name. If the
> > > + * object is found and still in the rbtree bump the reference count.
> > > + *
> > > + * If @buffer is provided store the value of @xattr in @buffer.
> > > + *
> > > + * Return: On success zero and on error a negative error code is returned.
> > >   */
> > >  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> > >  		     void *buffer, size_t size)
> > >  {
> > > -	struct simple_xattr *xattr;
> > > -	int ret = -ENODATA;
> > > -
> > > -	spin_lock(&xattrs->lock);
> > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > -		if (strcmp(name, xattr->name))
> > > -			continue;
> > > -
> > > -		ret = xattr->size;
> > > -		if (buffer) {
> > > -			if (size < xattr->size)
> > > -				ret = -ERANGE;
> > > -			else
> > > -				memcpy(buffer, xattr->value, xattr->size);
> > > +	struct simple_xattr *xattr = NULL;
> > > +	struct rb_node *rbp;
> > > +	int ret, seq = 0;
> > > +
> > > +	rcu_read_lock();
> > > +	do {
> > > +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> > 
> > It might be necessary to try a few times before grabbing the lock, but
> > perhaps we should actually hit the problem before increasing complexity.
> > 
> > > +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> > > +		while (rbp) {
> > > +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > > +			if (strcmp(xattr->name, name) < 0) {
> > > +				rbp = rcu_dereference(rbp->rb_left);
> > > +			} else if (strcmp(xattr->name, name) > 0) {
> > > +				rbp = rcu_dereference(rbp->rb_right);
> > > +			} else {
> > > +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> > > +					xattr = NULL;
> > > +				break;
> > > +			}
> > > +			xattr = NULL;
> > >  		}
> > 
> > Maybe this is too specialized, but should this be in the rbtree code,
> > perhaps rb_find_first_rcu(), but with an appropriate cmp() function?
> > The refcount_inc_not_zero() clearly needs to be in the caller.
> > 
> > If this is the only instance of this sort of code, it is likely not
> > really worthwhile.  But if we have several of these open coded, it would
> > be good to consolidate that code.
> 
> There's more than one instance of this pattern for sure.

OK, might be time to consolidate, then.  ;-)

> > > -		break;
> > > +	} while (need_seqretry(&xattrs->lock, seq));
> > > +	done_seqretry(&xattrs->lock, seq);
> > > +	rcu_read_unlock();
> > > +
> > > +	if (!xattr)
> > > +		return -ENODATA;
> > > +
> > > +	ret = xattr->size;
> > > +	if (buffer) {
> > > +		if (size < xattr->size)
> > > +			ret = -ERANGE;
> > > +		else
> > > +			memcpy(buffer, xattr->value, xattr->size);
> > 
> > If all we are doing is copying to an in-kernel buffer, why not dispense
> > with xattr->ref and do the memcpy() under rcu_read_lock() protection?
> > This would avoid some overhead from reference-count cache misses.
> > 
> > Of course, if that memcpy() can page fault or some such, then what
> > you have is necessary.  But this is all in-kernel, right?  And if not,
> > shouldn't the pointers be decorated with __user or some such?
> 
> This is just a regular in-kernel memcpy(). Good idea dispensing with the
> refcount completely.

Sounds good!

> > >  	}
> > > -	spin_unlock(&xattrs->lock);
> > > +
> > > +	put_simple_xattr_rcu(xattr);
> > >  	return ret;
> > >  }
> > >  
> > >  /**
> > > - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> > > - * @xattrs: target simple_xattr list
> > > - * @name: name of the extended attribute
> > > - * @value: value of the xattr. If %NULL, will remove the attribute.
> > > - * @size: size of the new xattr
> > > - * @flags: %XATTR_{CREATE|REPLACE}
> > > - * @removed_size: returns size of the removed xattr, -1 if none removed
> > > + * simple_xattr_set - set an xattr object
> > > + * @xattrs: the header of the xattr object
> > > + * @name: the name of the xattr to retrieve
> > > + * @value: the value to store along the xattr
> > > + * @size: the size of @value
> > > + * @flags: the flags determining how to set the xattr
> > > + * @removed_size: the size of the removed xattr
> > > + *
> > > + * Set a new xattr object.
> > > + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> > > + * is specified in @flags a matching xattr object for @name must already exist.
> > > + * If it does it will be replace with the new xattr object. If it doesn't we
> > > + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> > > + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> > > + * insert the new xattr replacing any existing one.
> > >   *
> > > - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> > > - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> > > - * otherwise, fails with -ENODATA.
> > > + * If @value is empty and a matching xattr object is found we delete it if
> > > + * XATTR_REPLACE is specified in @flags or @flags is zero.
> > >   *
> > > - * Returns 0 on success, -errno on failure.
> > > + * If @value is empty and no matching xattr object for @name is found we do
> > > + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> > > + * XATTR_REPLACE we fail as mentioned above.
> > > + *
> > > + * Return: On success zero and on error a negative error code is returned.
> > >   */
> > >  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > >  		     const void *value, size_t size, int flags,
> > >  		     ssize_t *removed_size)
> > >  {
> > > -	struct simple_xattr *xattr;
> > > -	struct simple_xattr *new_xattr = NULL;
> > > +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> > > +	struct rb_node *parent = NULL, **rbp;
> > >  	int err = 0;
> > >  
> > >  	if (removed_size)
> > > @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > >  		}
> > >  	}
> > >  
> > > -	spin_lock(&xattrs->lock);
> > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > -		if (!strcmp(name, xattr->name)) {
> > > -			if (flags & XATTR_CREATE) {
> > > -				xattr = new_xattr;
> > > -				err = -EEXIST;
> > > -			} else if (new_xattr) {
> > > -				list_replace(&xattr->list, &new_xattr->list);
> > > -				if (removed_size)
> > > -					*removed_size = xattr->size;
> > > -			} else {
> > > -				list_del(&xattr->list);
> > > -				if (removed_size)
> > > -					*removed_size = xattr->size;
> > > -			}
> > > -			goto out;
> > > -		}
> > > -	}
> > > -	if (flags & XATTR_REPLACE) {
> > > -		xattr = new_xattr;
> > > -		err = -ENODATA;
> > > -	} else {
> > > -		list_add(&new_xattr->list, &xattrs->head);
> > > +	write_seqlock(&xattrs->lock);
> > > +	rbp = &xattrs->rb_root.rb_node;
> > > +	while (*rbp) {
> > > +		parent = *rbp;
> > > +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> > > +		if (strcmp(xattr->name, name) < 0)
> > > +			rbp = &(*rbp)->rb_left;
> > > +		else if (strcmp(xattr->name, name) > 0)
> > > +			rbp = &(*rbp)->rb_right;
> > > +		else
> > > +			break;
> > >  		xattr = NULL;
> > >  	}
> > > -out:
> > > -	spin_unlock(&xattrs->lock);
> > > +
> > >  	if (xattr) {
> > > -		kfree(xattr->name);
> > > -		kvfree(xattr);
> > > +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> > > +		if (flags & XATTR_CREATE) {
> > > +			err = -EEXIST;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if (new_xattr)
> > > +			rb_replace_node_rcu(&xattr->rb_node,
> > > +					    &new_xattr->rb_node,
> > > +					    &xattrs->rb_root);
> > > +		else
> > > +			rb_erase(&xattr->rb_node, &xattrs->rb_root);
> > 
> > Is rb_erase() RCU-reader-safe?  It is not immediately obvious to me that it
> > is.  I would expect an rcu_assign_pointer() or three in there somewhere.
> 
> This question send me down an interesting rabbit hole which is
> (unironically) excellent. Afaiu, yes rb_erase() isn't rcu safe. And our
> rbtree implementation in general isn't rcu safe in the face of
> rebalances (I think you linked to a paper that illustrates how one would
> need to go about doing this though.). So all users deal with that using
> the seq+rcu-rbtree pattern. I trust you'll correct me if I'm wrong: The
> assumption of current users seems to be that searching the rbtree under
> the rcu lock - i.e., without having taken the read seqlock - is safe
> against oops or iow against corrupt pointers if a writer changes the
> rbtree during the walk. The seqlock is supposed to prevent garbage/stale
> results. At least all users of this pattern seem to rely on the fact
> that a rcu only walk cannot oops. Is that assumption safe?

Kind of, maybe?

To be fully safe, you need to check the seqlock during the actual tree
traversal.  Otherwise, an unfortunate series of rebalances or rotations
or whatever could potentially run a given reader forever in circles.
Though the odds of this actually continuing for any length of time are
of course extremely low.

And unmarked accesses to shared variables having conflicting concurrent
accesses is playing with fire, though one hopes that stores of pointers
would end up wsing single machine instructions.  In theory, there are
a number of ways that compiler optimizations could cause trouble.

> Another point that popped up discussing this was that there's an
> unlikely attack scenario that David (Howells) made me aware of. While
> walking the rbtree under rcu only without the read seqlock it is
> theoretically possible for the rbtree to keep being expanded and cause
> the rcu walk to "never finish". Though I'm unsure whether that's really
> a practically possible attack.

Perhaps David's scenario is similar to my eternal rebalance?

> Very interested to hear your thoughts here!

I would of course feel better if there were fewer vulnerabilities.

> > > +		if (!err && removed_size)
> > > +			*removed_size = xattr->size;
> > > +	} else {
> > > +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> > > +		if (flags & XATTR_REPLACE) {
> > > +			err = -ENODATA;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		/*
> > > +		 * If XATTR_CREATE or no flags are specified together with a
> > > +		 * new value simply insert it.
> > > +		 */
> > > +		if (new_xattr) {
> > > +			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
> > > +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > > +		}
> > > +
> > > +		/*
> > > +		 * If XATTR_CREATE or no flags are specified and neither an old
> > > +		 * or new xattr were found/exist then we don't need to do
> > > +		 * anything.
> > > +		 */
> > 
> > As before, some of this looks like it should be in the rbtree implementation.
> > On the other hand, and add-or-replace function like this might be rare.  And
> > there are only two other occurrences, both of which look quite specialize.
> 
> If you think the current rcu search pattern is sufficiently safe even
> against rb_erase() then we might want to give the two callers/users at
> least rb_find_first_rcu() directly in rbtree.h which you suggested
> further above.

If we are going to have RCU-protected rbtrees, it might be good for the
rbtree code to know that it is supposed to handle RCU traversals.  ;-)

> > So probably no consolidation just yet, anyway.
> > 
> > >  	}
> > > +
> > > +out_unlock:
> > > +	write_sequnlock(&xattrs->lock);
> > > +	if (err)
> > > +		free_simple_xattr(new_xattr);
> > > +	else
> > > +		put_simple_xattr_rcu(xattr);
> > >  	return err;
> > >  
> > >  }
> > > @@ -1134,14 +1258,31 @@ static int xattr_list_one(char **buffer, ssize_t *remaining_size,
> > >  	return 0;
> > >  }
> > >  
> > > -/*
> > > - * xattr LIST operation for in-memory/pseudo filesystems
> > > +/**
> > > + * simple_xattr_list - list all xattr objects
> > > + * @inode: inode from which to get the xattrs
> > > + * @xattrs: the header of the xattr object
> > > + * @buffer: the buffer to store all xattrs into
> > > + * @size: the size of @buffer
> > > + *
> > > + * List all xattrs associated with @inode. If @buffer is NULL we returned the
> > > + * required size of the buffer. If @buffer is provided we store the xattrs
> > > + * value into it provided it is big enough.
> > > + *
> > > + * Note, the number of xattr names that can be listed with listxattr(2) is
> > > + * limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > > + * vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are found
> > > + * it will return -E2BIG.
> > > + *
> > > + * Return: On success the required size or the size of the copied xattrs is
> > > + * returned. On error a negative error code is returned.
> > >   */
> > >  ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> > >  			  char *buffer, size_t size)
> > >  {
> > >  	bool trusted = capable(CAP_SYS_ADMIN);
> > >  	struct simple_xattr *xattr;
> > > +	struct rb_node *rbp;
> > >  	ssize_t remaining_size = size;
> > >  	int err = 0;
> > >  
> > > @@ -1162,8 +1303,10 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> > >  	}
> > >  #endif
> > >  
> > > -	spin_lock(&xattrs->lock);
> > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > +	read_seqlock_excl(&xattrs->lock);
> > 
> > This excludes writers, which allows the non-RCU-safe code to work correctly.
> > 
> > So this should be OK.
> > 
> > > +	for (rbp = rb_first(&xattrs->rb_root); rbp; rbp = rb_next(rbp)) {
> > > +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > > +
> > >  		/* skip "trusted." attributes for unprivileged callers */
> > >  		if (!trusted && xattr_is_trusted(xattr->name))
> > >  			continue;
> > > @@ -1172,18 +1315,61 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> > >  		if (err)
> > >  			break;
> > >  	}
> > > -	spin_unlock(&xattrs->lock);
> > > +	read_sequnlock_excl(&xattrs->lock);
> > >  
> > >  	return err ? err : size - remaining_size;
> > >  }
> > >  
> > > -/*
> > > - * Adds an extended attribute to the list
> > > +/**
> > > + * simple_xattr_add - add xattr objects
> > > + * @xattrs: the header of the xattr object
> > > + * @new_xattr: the xattr object to add
> > > + *
> > > + * Add an xattr object to @xattrs. This assumes no replacement or removal of
> > > + * matching xattrs is wanted.
> > >   */
> > > -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> > > -			   struct simple_xattr *new_xattr)
> > > +void simple_xattr_add(struct simple_xattrs *xattrs,
> > > +		      struct simple_xattr *new_xattr)
> > >  {
> > > -	spin_lock(&xattrs->lock);
> > > -	list_add(&new_xattr->list, &xattrs->head);
> > > -	spin_unlock(&xattrs->lock);
> > > +	write_seqlock(&xattrs->lock);
> > > +	rb_link_node_rcu(&new_xattr->rb_node, xattrs->rb_root.rb_node,
> > > +			 &xattrs->rb_root.rb_node);
> > > +	rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > > +	write_sequnlock(&xattrs->lock);
> > > +}
> > 
> > I freely confess that I am not immediately seeing how this one fits in.
> > Presumably its caller has already found the right place in the tree?
> > Except that the caller isn't holding xattrs->lock.
> 
> This is a bug. Thanks for spotting this. 

Whew!  ;-)

> > > +
> > > +/**
> > > + * simple_xattr_init - initialize new xattr header
> > > + * @xattrs: header to initialize
> > > + *
> > > + * Initialize relevant fields of a an xattr header.
> > > + */
> > > +void simple_xattrs_init(struct simple_xattrs *xattrs)
> > > +{
> > > +	seqlock_init(&xattrs->lock);
> > > +	xattrs->rb_root = RB_ROOT;
> > > +}
> > > +
> > > +/**
> > > + * simple_xattrs_free - free xattrs
> > > + * @xattrs: xattr header whose xattrs to destroy
> > > + *
> > > + * Destroy all xattrs in @xattr. When this is called no one can hold a
> > > + * reference to any of the xattrs anymore.
> > 
> > As in anyone who might hold a reference is long gone, correct?
> 
> Yes, when this is called we're freeing the inode. No one can retrieve or
> set or list xattrs for that inode anymore at that point. IOW, the last
> fd must've been closed and the inode been removed (via unlink()/rmdir()
> or sm).

OK, I feel better now.  ;-)

> Thank you for commenting and providing you insights!

Please let me know how it goes!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-11  8:13     ` Paul E. McKenney
@ 2022-11-11 10:51       ` Christian Brauner
  2022-11-11 18:45         ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2022-11-11 10:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Roman Gushchin, Al Viro

On Fri, Nov 11, 2022 at 12:13:35AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 09, 2022 at 10:51:52AM +0100, Christian Brauner wrote:
> > On Tue, Nov 08, 2022 at 10:59:04AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> > > > A while ago Vasily reported that it is possible to set a large number of
> > > > xattrs on inodes of filesystems that make use of the simple xattr
> > > > infrastructure. This includes all kernfs-based filesystems that support
> > > > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> > > > mounted by unprivileged users in unprivileged containers and root in an
> > > > unprivileged container can set an unrestricted number of security.*
> > > > xattrs and privileged users can also set unlimited trusted.* xattrs. As
> > > > there are apparently users that have a fairly large number of xattrs we
> > > > should scale a bit better. Other xattrs such as user.* are restricted
> > > > for kernfs-based instances to a fairly limited number.
> > > > 
> > > > Using a simple linked list protected by a spinlock used for set, get,
> > > > and list operations doesn't scale well if users use a lot of xattrs even
> > > > if it's not a crazy number. And There's no need to bring in the big guns
> > > > like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> > > > limited rcu semantics is enough.
> > > > 
> > > > It scales within the constraints we are working in. By far the most
> > > > common operations is getting an xattr. The get operation is optimized to
> > > > be lock free as long as there are no writers. The list operation takes
> > > > the read lock and protects against concurrent writers while allowing
> > > > lockless get operations. Locking out other listxattr callers isn't a
> > > > huge deal since listing xattrs is mostly relevant when copying a file or
> > > > copying all xattrs between files.
> > > > 
> > > > Additionally, listxattr() doesn't list the values of xattrs it can only
> > > > be used to list the names of all xattrs set on a file. And the number of
> > > > xattr names that can be listed with listxattr() is limited to
> > > > XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > > > vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> > > > found it will return -EFBIG. In short, the maximum amount of memory that
> > > > can be retrieved via listxattr() is limited.
> > > > 
> > > > Of course, the API is broken as documented on xattr(7) already. In the
> > > > future we might want to address this but for now this is the world we
> > > > live in and have lived for a long time. But it does indeed mean that
> > > > once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> > > > inode it isn't possible to copy the file and include its xattrs in the
> > > > copy unless the caller knows all xattrs or limits the copy of the xattrs
> > > > to important ones it knows by name (At least for tmpfs, and kernfs-based
> > > > filesystems. Other filesystems might provide ways of achieving this.).
> > > > 
> > > > Also add proper kernel documentation to all the functions.
> > > > A big thanks to Paul for his comments.
> > > > 
> > > > Cc: Vasily Averin <vvs@openvz.org>
> > > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > > Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> > > 
> > > Looks mostly plausible from an RCU viewpoint, but there are a few
> > > questions/comments inline below.
> > > 
> > > 							Thanx, Paul
> > > 
> > > > ---
> > > > 
> > > > Notes:
> > > >     In addition to this patch I would like to propose that we restrict the number
> > > >     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
> > > >     other words, we restrict the number of xattrs for simple xattr filesystems to
> > > >     the number of xattrs names that can be retrieved via listxattr(). That should
> > > >     be about 2000 to 3000 xattrs per inode which is more than enough. We should
> > > >     risk this and see if we get any regression reports from userswith this
> > > >     approach.
> > > >     
> > > >     This should be as simple as adding a max_list member to struct simple_xattrs
> > > >     and initialize it with XATTR_MAX_LIST. Set operations would then check against
> > > >     this field whether the new xattr they are trying to set will fit and return
> > > >     -EFBIG otherwise. I think that might be a good approach to get rid of the in
> > > >     principle unbounded number of xattrs that can be set via the simple xattr
> > > >     infrastructure. I think this is a regression risk worth taking.
> > > >     
> > > >     /* v2 */
> > > >     Christian Brauner <brauner@kernel.org>:
> > > >     - Fix kernel doc.
> > > >     - Remove accidental leftover union from previous iteration.
> > > > 
> > > >  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
> > > >  include/linux/xattr.h |  40 ++---
> > > >  mm/shmem.c            |   2 +-
> > > >  3 files changed, 270 insertions(+), 102 deletions(-)
> > > > 
> > > > diff --git a/fs/xattr.c b/fs/xattr.c
> > > > index 61107b6bbed2..f18454161d54 100644
> > > > --- a/fs/xattr.c
> > > > +++ b/fs/xattr.c
> > > > @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
> > > >  }
> > > >  EXPORT_SYMBOL(xattr_full_name);
> > > >  
> > > > -/*
> > > > - * Allocate new xattr and copy in the value; but leave the name to callers.
> > > > +/**
> > > > + * free_simple_xattr - free an xattr object
> > > > + * @xattr: the xattr object
> > > > + *
> > > > + * Free the xattr object. Can handle @xattr being NULL.
> > > > + */
> > > > +static inline void free_simple_xattr(struct simple_xattr *xattr)
> > > > +{
> > > > +	if (xattr)
> > > > +		kfree(xattr->name);
> > > > +	kvfree(xattr);
> > > > +}
> > > > +
> > > > +/**
> > > > + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> > > > + * @cb: the rcu callback head
> > > > + */
> > > > +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> > > > +{
> > > > +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> > > > +}
> > > > +
> > > > +/**
> > > > + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> > > > + * @xattr: the xattr object
> > > > + */
> > > > +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> > > > +{
> > > > +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> > > > +}
> > > > +
> > > > +/**
> > > > + * put_simple_xattr_rcu - decrement refcount for xattr object
> > > > + * @xattr: the xattr object
> > > > + *
> > > > + * Decrement the reference count of an xattr object and free it using rcu
> > > > + * semantics if we're the holder of the last reference. Can handle @xattr being
> > > > + * NULL.
> > > > + */
> > > > +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> > > > +{
> > > > +	if (xattr && refcount_dec_and_test(&xattr->ref))
> > > > +		free_simple_xattr_rcu(xattr);
> > > > +}
> > > 
> > > Looks like the standard combined reference counter and RCU combination,
> > > goog!
> > > 
> > > > +
> > > > +/**
> > > > + * simple_xattr_alloc - allocate new xattr object
> > > > + * @value: value of the xattr object
> > > > + * @size: size of @value
> > > > + *
> > > > + * Allocate a new xattr object and initialize respective members. The caller is
> > > > + * responsible for handling the name of the xattr.
> > > > + *
> > > > + * The initial reference count belongs to the rbtree.
> > > > + *
> > > > + * Return: On success a new xattr object is returned. On failure NULL is
> > > > + * returned.
> > > >   */
> > > >  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > > >  {
> > > > @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > > >  
> > > >  	new_xattr->size = size;
> > > >  	memcpy(new_xattr->value, value, size);
> > > > +	refcount_set(&new_xattr->ref, 1);
> > > 
> > > Yes, one is usually needed for the link in the tree.
> > > 
> > > >  	return new_xattr;
> > > >  }
> > > >  
> > > > -/*
> > > > - * xattr GET operation for in-memory/pseudo filesystems
> > > > +/**
> > > > + * simple_xattr_get - get an xattr object
> > > > + * @xattrs: the header of the xattr object
> > > > + * @name: the name of the xattr to retrieve
> > > > + * @buffer: the buffer to store the value into
> > > > + * @size: the size of @buffer
> > > > + *
> > > > + * Try to find and retrieve the xattr object associated with @name. If the
> > > > + * object is found and still in the rbtree bump the reference count.
> > > > + *
> > > > + * If @buffer is provided store the value of @xattr in @buffer.
> > > > + *
> > > > + * Return: On success zero and on error a negative error code is returned.
> > > >   */
> > > >  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> > > >  		     void *buffer, size_t size)
> > > >  {
> > > > -	struct simple_xattr *xattr;
> > > > -	int ret = -ENODATA;
> > > > -
> > > > -	spin_lock(&xattrs->lock);
> > > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > > -		if (strcmp(name, xattr->name))
> > > > -			continue;
> > > > -
> > > > -		ret = xattr->size;
> > > > -		if (buffer) {
> > > > -			if (size < xattr->size)
> > > > -				ret = -ERANGE;
> > > > -			else
> > > > -				memcpy(buffer, xattr->value, xattr->size);
> > > > +	struct simple_xattr *xattr = NULL;
> > > > +	struct rb_node *rbp;
> > > > +	int ret, seq = 0;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	do {
> > > > +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> > > 
> > > It might be necessary to try a few times before grabbing the lock, but
> > > perhaps we should actually hit the problem before increasing complexity.
> > > 
> > > > +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> > > > +		while (rbp) {
> > > > +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > > > +			if (strcmp(xattr->name, name) < 0) {
> > > > +				rbp = rcu_dereference(rbp->rb_left);
> > > > +			} else if (strcmp(xattr->name, name) > 0) {
> > > > +				rbp = rcu_dereference(rbp->rb_right);
> > > > +			} else {
> > > > +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> > > > +					xattr = NULL;
> > > > +				break;
> > > > +			}
> > > > +			xattr = NULL;
> > > >  		}
> > > 
> > > Maybe this is too specialized, but should this be in the rbtree code,
> > > perhaps rb_find_first_rcu(), but with an appropriate cmp() function?
> > > The refcount_inc_not_zero() clearly needs to be in the caller.
> > > 
> > > If this is the only instance of this sort of code, it is likely not
> > > really worthwhile.  But if we have several of these open coded, it would
> > > be good to consolidate that code.
> > 
> > There's more than one instance of this pattern for sure.
> 
> OK, might be time to consolidate, then.  ;-)
> 
> > > > -		break;
> > > > +	} while (need_seqretry(&xattrs->lock, seq));
> > > > +	done_seqretry(&xattrs->lock, seq);
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	if (!xattr)
> > > > +		return -ENODATA;
> > > > +
> > > > +	ret = xattr->size;
> > > > +	if (buffer) {
> > > > +		if (size < xattr->size)
> > > > +			ret = -ERANGE;
> > > > +		else
> > > > +			memcpy(buffer, xattr->value, xattr->size);
> > > 
> > > If all we are doing is copying to an in-kernel buffer, why not dispense
> > > with xattr->ref and do the memcpy() under rcu_read_lock() protection?
> > > This would avoid some overhead from reference-count cache misses.
> > > 
> > > Of course, if that memcpy() can page fault or some such, then what
> > > you have is necessary.  But this is all in-kernel, right?  And if not,
> > > shouldn't the pointers be decorated with __user or some such?
> > 
> > This is just a regular in-kernel memcpy(). Good idea dispensing with the
> > refcount completely.
> 
> Sounds good!
> 
> > > >  	}
> > > > -	spin_unlock(&xattrs->lock);
> > > > +
> > > > +	put_simple_xattr_rcu(xattr);
> > > >  	return ret;
> > > >  }
> > > >  
> > > >  /**
> > > > - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> > > > - * @xattrs: target simple_xattr list
> > > > - * @name: name of the extended attribute
> > > > - * @value: value of the xattr. If %NULL, will remove the attribute.
> > > > - * @size: size of the new xattr
> > > > - * @flags: %XATTR_{CREATE|REPLACE}
> > > > - * @removed_size: returns size of the removed xattr, -1 if none removed
> > > > + * simple_xattr_set - set an xattr object
> > > > + * @xattrs: the header of the xattr object
> > > > + * @name: the name of the xattr to retrieve
> > > > + * @value: the value to store along the xattr
> > > > + * @size: the size of @value
> > > > + * @flags: the flags determining how to set the xattr
> > > > + * @removed_size: the size of the removed xattr
> > > > + *
> > > > + * Set a new xattr object.
> > > > + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> > > > + * is specified in @flags a matching xattr object for @name must already exist.
> > > > + * If it does it will be replace with the new xattr object. If it doesn't we
> > > > + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> > > > + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> > > > + * insert the new xattr replacing any existing one.
> > > >   *
> > > > - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> > > > - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> > > > - * otherwise, fails with -ENODATA.
> > > > + * If @value is empty and a matching xattr object is found we delete it if
> > > > + * XATTR_REPLACE is specified in @flags or @flags is zero.
> > > >   *
> > > > - * Returns 0 on success, -errno on failure.
> > > > + * If @value is empty and no matching xattr object for @name is found we do
> > > > + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> > > > + * XATTR_REPLACE we fail as mentioned above.
> > > > + *
> > > > + * Return: On success zero and on error a negative error code is returned.
> > > >   */
> > > >  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > > >  		     const void *value, size_t size, int flags,
> > > >  		     ssize_t *removed_size)
> > > >  {
> > > > -	struct simple_xattr *xattr;
> > > > -	struct simple_xattr *new_xattr = NULL;
> > > > +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> > > > +	struct rb_node *parent = NULL, **rbp;
> > > >  	int err = 0;
> > > >  
> > > >  	if (removed_size)
> > > > @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > > >  		}
> > > >  	}
> > > >  
> > > > -	spin_lock(&xattrs->lock);
> > > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > > -		if (!strcmp(name, xattr->name)) {
> > > > -			if (flags & XATTR_CREATE) {
> > > > -				xattr = new_xattr;
> > > > -				err = -EEXIST;
> > > > -			} else if (new_xattr) {
> > > > -				list_replace(&xattr->list, &new_xattr->list);
> > > > -				if (removed_size)
> > > > -					*removed_size = xattr->size;
> > > > -			} else {
> > > > -				list_del(&xattr->list);
> > > > -				if (removed_size)
> > > > -					*removed_size = xattr->size;
> > > > -			}
> > > > -			goto out;
> > > > -		}
> > > > -	}
> > > > -	if (flags & XATTR_REPLACE) {
> > > > -		xattr = new_xattr;
> > > > -		err = -ENODATA;
> > > > -	} else {
> > > > -		list_add(&new_xattr->list, &xattrs->head);
> > > > +	write_seqlock(&xattrs->lock);
> > > > +	rbp = &xattrs->rb_root.rb_node;
> > > > +	while (*rbp) {
> > > > +		parent = *rbp;
> > > > +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> > > > +		if (strcmp(xattr->name, name) < 0)
> > > > +			rbp = &(*rbp)->rb_left;
> > > > +		else if (strcmp(xattr->name, name) > 0)
> > > > +			rbp = &(*rbp)->rb_right;
> > > > +		else
> > > > +			break;
> > > >  		xattr = NULL;
> > > >  	}
> > > > -out:
> > > > -	spin_unlock(&xattrs->lock);
> > > > +
> > > >  	if (xattr) {
> > > > -		kfree(xattr->name);
> > > > -		kvfree(xattr);
> > > > +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> > > > +		if (flags & XATTR_CREATE) {
> > > > +			err = -EEXIST;
> > > > +			goto out_unlock;
> > > > +		}
> > > > +
> > > > +		if (new_xattr)
> > > > +			rb_replace_node_rcu(&xattr->rb_node,
> > > > +					    &new_xattr->rb_node,
> > > > +					    &xattrs->rb_root);
> > > > +		else
> > > > +			rb_erase(&xattr->rb_node, &xattrs->rb_root);
> > > 
> > > Is rb_erase() RCU-reader-safe?  It is not immediately obvious to me that it
> > > is.  I would expect an rcu_assign_pointer() or three in there somewhere.
> > 
> > This question send me down an interesting rabbit hole which is
> > (unironically) excellent. Afaiu, yes rb_erase() isn't rcu safe. And our
> > rbtree implementation in general isn't rcu safe in the face of
> > rebalances (I think you linked to a paper that illustrates how one would
> > need to go about doing this though.). So all users deal with that using
> > the seq+rcu-rbtree pattern. I trust you'll correct me if I'm wrong: The
> > assumption of current users seems to be that searching the rbtree under
> > the rcu lock - i.e., without having taken the read seqlock - is safe
> > against oops or iow against corrupt pointers if a writer changes the
> > rbtree during the walk. The seqlock is supposed to prevent garbage/stale
> > results. At least all users of this pattern seem to rely on the fact
> > that a rcu only walk cannot oops. Is that assumption safe?
> 
> Kind of, maybe?
> 
> To be fully safe, you need to check the seqlock during the actual tree
> traversal.  Otherwise, an unfortunate series of rebalances or rotations
> or whatever could potentially run a given reader forever in circles.
> Though the odds of this actually continuing for any length of time are
> of course extremely low.
> 
> And unmarked accesses to shared variables having conflicting concurrent
> accesses is playing with fire, though one hopes that stores of pointers
> would end up wsing single machine instructions.  In theory, there are
> a number of ways that compiler optimizations could cause trouble.

Since this is code reachable from unprivileged userspace I'd rather not
play with fire here at all. Even if all of this might turn out to be
just a theoretical possibility.

I think rcu+seq+rbtree would've made for an interesting, sufficiently
scalable solution with still limited complexity. It's not really too
hard to understand and isn't bloated. But given that we have some valid
concerns we should de-fancy-fy and use an rwlock_t. Roman alrady
mentioned this and I had this is an implementation before this one
(together with rhashtable as I've mentioned). I'll send out the patches
shortly. I'm just finishing tests.

> 
> > Another point that popped up discussing this was that there's an
> > unlikely attack scenario that David (Howells) made me aware of. While
> > walking the rbtree under rcu only without the read seqlock it is
> > theoretically possible for the rbtree to keep being expanded and cause
> > the rcu walk to "never finish". Though I'm unsure whether that's really
> > a practically possible attack.
> 
> Perhaps David's scenario is similar to my eternal rebalance?

Yeah, it certainly was.

> 
> > Very interested to hear your thoughts here!
> 
> I would of course feel better if there were fewer vulnerabilities.

I agree so we should change the implementation as mentioned above.

> 
> > > > +		if (!err && removed_size)
> > > > +			*removed_size = xattr->size;
> > > > +	} else {
> > > > +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> > > > +		if (flags & XATTR_REPLACE) {
> > > > +			err = -ENODATA;
> > > > +			goto out_unlock;
> > > > +		}
> > > > +
> > > > +		/*
> > > > +		 * If XATTR_CREATE or no flags are specified together with a
> > > > +		 * new value simply insert it.
> > > > +		 */
> > > > +		if (new_xattr) {
> > > > +			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
> > > > +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > > > +		}
> > > > +
> > > > +		/*
> > > > +		 * If XATTR_CREATE or no flags are specified and neither an old
> > > > +		 * or new xattr were found/exist then we don't need to do
> > > > +		 * anything.
> > > > +		 */
> > > 
> > > As before, some of this looks like it should be in the rbtree implementation.
> > > On the other hand, and add-or-replace function like this might be rare.  And
> > > there are only two other occurrences, both of which look quite specialize.
> > 
> > If you think the current rcu search pattern is sufficiently safe even
> > against rb_erase() then we might want to give the two callers/users at
> > least rb_find_first_rcu() directly in rbtree.h which you suggested
> > further above.
> 
> If we are going to have RCU-protected rbtrees, it might be good for the
> rbtree code to know that it is supposed to handle RCU traversals.  ;-)

I honestly think this would make for a pretty nifty bachelor thesis
project. :)

Thanks!
Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] xattr: use rbtree for simple_xattrs
  2022-11-11 10:51       ` Christian Brauner
@ 2022-11-11 18:45         ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2022-11-11 18:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tejun Heo, linux-fsdevel, Vasily Averin, Hugh Dickins,
	Seth Forshee, Stéphane Graber, Roman Gushchin, Al Viro

On Fri, Nov 11, 2022 at 11:51:23AM +0100, Christian Brauner wrote:
> On Fri, Nov 11, 2022 at 12:13:35AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 09, 2022 at 10:51:52AM +0100, Christian Brauner wrote:
> > > On Tue, Nov 08, 2022 at 10:59:04AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> > > > > A while ago Vasily reported that it is possible to set a large number of
> > > > > xattrs on inodes of filesystems that make use of the simple xattr
> > > > > infrastructure. This includes all kernfs-based filesystems that support
> > > > > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> > > > > mounted by unprivileged users in unprivileged containers and root in an
> > > > > unprivileged container can set an unrestricted number of security.*
> > > > > xattrs and privileged users can also set unlimited trusted.* xattrs. As
> > > > > there are apparently users that have a fairly large number of xattrs we
> > > > > should scale a bit better. Other xattrs such as user.* are restricted
> > > > > for kernfs-based instances to a fairly limited number.
> > > > > 
> > > > > Using a simple linked list protected by a spinlock used for set, get,
> > > > > and list operations doesn't scale well if users use a lot of xattrs even
> > > > > if it's not a crazy number. And There's no need to bring in the big guns
> > > > > like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> > > > > limited rcu semantics is enough.
> > > > > 
> > > > > It scales within the constraints we are working in. By far the most
> > > > > common operations is getting an xattr. The get operation is optimized to
> > > > > be lock free as long as there are no writers. The list operation takes
> > > > > the read lock and protects against concurrent writers while allowing
> > > > > lockless get operations. Locking out other listxattr callers isn't a
> > > > > huge deal since listing xattrs is mostly relevant when copying a file or
> > > > > copying all xattrs between files.
> > > > > 
> > > > > Additionally, listxattr() doesn't list the values of xattrs it can only
> > > > > be used to list the names of all xattrs set on a file. And the number of
> > > > > xattr names that can be listed with listxattr() is limited to
> > > > > XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > > > > vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> > > > > found it will return -EFBIG. In short, the maximum amount of memory that
> > > > > can be retrieved via listxattr() is limited.
> > > > > 
> > > > > Of course, the API is broken as documented on xattr(7) already. In the
> > > > > future we might want to address this but for now this is the world we
> > > > > live in and have lived for a long time. But it does indeed mean that
> > > > > once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> > > > > inode it isn't possible to copy the file and include its xattrs in the
> > > > > copy unless the caller knows all xattrs or limits the copy of the xattrs
> > > > > to important ones it knows by name (At least for tmpfs, and kernfs-based
> > > > > filesystems. Other filesystems might provide ways of achieving this.).
> > > > > 
> > > > > Also add proper kernel documentation to all the functions.
> > > > > A big thanks to Paul for his comments.
> > > > > 
> > > > > Cc: Vasily Averin <vvs@openvz.org>
> > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > > > Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> > > > 
> > > > Looks mostly plausible from an RCU viewpoint, but there are a few
> > > > questions/comments inline below.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > ---
> > > > > 
> > > > > Notes:
> > > > >     In addition to this patch I would like to propose that we restrict the number
> > > > >     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
> > > > >     other words, we restrict the number of xattrs for simple xattr filesystems to
> > > > >     the number of xattrs names that can be retrieved via listxattr(). That should
> > > > >     be about 2000 to 3000 xattrs per inode which is more than enough. We should
> > > > >     risk this and see if we get any regression reports from userswith this
> > > > >     approach.
> > > > >     
> > > > >     This should be as simple as adding a max_list member to struct simple_xattrs
> > > > >     and initialize it with XATTR_MAX_LIST. Set operations would then check against
> > > > >     this field whether the new xattr they are trying to set will fit and return
> > > > >     -EFBIG otherwise. I think that might be a good approach to get rid of the in
> > > > >     principle unbounded number of xattrs that can be set via the simple xattr
> > > > >     infrastructure. I think this is a regression risk worth taking.
> > > > >     
> > > > >     /* v2 */
> > > > >     Christian Brauner <brauner@kernel.org>:
> > > > >     - Fix kernel doc.
> > > > >     - Remove accidental leftover union from previous iteration.
> > > > > 
> > > > >  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
> > > > >  include/linux/xattr.h |  40 ++---
> > > > >  mm/shmem.c            |   2 +-
> > > > >  3 files changed, 270 insertions(+), 102 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xattr.c b/fs/xattr.c
> > > > > index 61107b6bbed2..f18454161d54 100644
> > > > > --- a/fs/xattr.c
> > > > > +++ b/fs/xattr.c
> > > > > @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
> > > > >  }
> > > > >  EXPORT_SYMBOL(xattr_full_name);
> > > > >  
> > > > > -/*
> > > > > - * Allocate new xattr and copy in the value; but leave the name to callers.
> > > > > +/**
> > > > > + * free_simple_xattr - free an xattr object
> > > > > + * @xattr: the xattr object
> > > > > + *
> > > > > + * Free the xattr object. Can handle @xattr being NULL.
> > > > > + */
> > > > > +static inline void free_simple_xattr(struct simple_xattr *xattr)
> > > > > +{
> > > > > +	if (xattr)
> > > > > +		kfree(xattr->name);
> > > > > +	kvfree(xattr);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> > > > > + * @cb: the rcu callback head
> > > > > + */
> > > > > +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> > > > > +{
> > > > > +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> > > > > + * @xattr: the xattr object
> > > > > + */
> > > > > +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> > > > > +{
> > > > > +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * put_simple_xattr_rcu - decrement refcount for xattr object
> > > > > + * @xattr: the xattr object
> > > > > + *
> > > > > + * Decrement the reference count of an xattr object and free it using rcu
> > > > > + * semantics if we're the holder of the last reference. Can handle @xattr being
> > > > > + * NULL.
> > > > > + */
> > > > > +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> > > > > +{
> > > > > +	if (xattr && refcount_dec_and_test(&xattr->ref))
> > > > > +		free_simple_xattr_rcu(xattr);
> > > > > +}
> > > > 
> > > > Looks like the standard combined reference counter and RCU combination,
> > > > goog!
> > > > 
> > > > > +
> > > > > +/**
> > > > > + * simple_xattr_alloc - allocate new xattr object
> > > > > + * @value: value of the xattr object
> > > > > + * @size: size of @value
> > > > > + *
> > > > > + * Allocate a new xattr object and initialize respective members. The caller is
> > > > > + * responsible for handling the name of the xattr.
> > > > > + *
> > > > > + * The initial reference count belongs to the rbtree.
> > > > > + *
> > > > > + * Return: On success a new xattr object is returned. On failure NULL is
> > > > > + * returned.
> > > > >   */
> > > > >  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > > > >  {
> > > > > @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > > > >  
> > > > >  	new_xattr->size = size;
> > > > >  	memcpy(new_xattr->value, value, size);
> > > > > +	refcount_set(&new_xattr->ref, 1);
> > > > 
> > > > Yes, one is usually needed for the link in the tree.
> > > > 
> > > > >  	return new_xattr;
> > > > >  }
> > > > >  
> > > > > -/*
> > > > > - * xattr GET operation for in-memory/pseudo filesystems
> > > > > +/**
> > > > > + * simple_xattr_get - get an xattr object
> > > > > + * @xattrs: the header of the xattr object
> > > > > + * @name: the name of the xattr to retrieve
> > > > > + * @buffer: the buffer to store the value into
> > > > > + * @size: the size of @buffer
> > > > > + *
> > > > > + * Try to find and retrieve the xattr object associated with @name. If the
> > > > > + * object is found and still in the rbtree bump the reference count.
> > > > > + *
> > > > > + * If @buffer is provided store the value of @xattr in @buffer.
> > > > > + *
> > > > > + * Return: On success zero and on error a negative error code is returned.
> > > > >   */
> > > > >  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> > > > >  		     void *buffer, size_t size)
> > > > >  {
> > > > > -	struct simple_xattr *xattr;
> > > > > -	int ret = -ENODATA;
> > > > > -
> > > > > -	spin_lock(&xattrs->lock);
> > > > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > > > -		if (strcmp(name, xattr->name))
> > > > > -			continue;
> > > > > -
> > > > > -		ret = xattr->size;
> > > > > -		if (buffer) {
> > > > > -			if (size < xattr->size)
> > > > > -				ret = -ERANGE;
> > > > > -			else
> > > > > -				memcpy(buffer, xattr->value, xattr->size);
> > > > > +	struct simple_xattr *xattr = NULL;
> > > > > +	struct rb_node *rbp;
> > > > > +	int ret, seq = 0;
> > > > > +
> > > > > +	rcu_read_lock();
> > > > > +	do {
> > > > > +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> > > > 
> > > > It might be necessary to try a few times before grabbing the lock, but
> > > > perhaps we should actually hit the problem before increasing complexity.
> > > > 
> > > > > +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> > > > > +		while (rbp) {
> > > > > +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > > > > +			if (strcmp(xattr->name, name) < 0) {
> > > > > +				rbp = rcu_dereference(rbp->rb_left);
> > > > > +			} else if (strcmp(xattr->name, name) > 0) {
> > > > > +				rbp = rcu_dereference(rbp->rb_right);
> > > > > +			} else {
> > > > > +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> > > > > +					xattr = NULL;
> > > > > +				break;
> > > > > +			}
> > > > > +			xattr = NULL;
> > > > >  		}
> > > > 
> > > > Maybe this is too specialized, but should this be in the rbtree code,
> > > > perhaps rb_find_first_rcu(), but with an appropriate cmp() function?
> > > > The refcount_inc_not_zero() clearly needs to be in the caller.
> > > > 
> > > > If this is the only instance of this sort of code, it is likely not
> > > > really worthwhile.  But if we have several of these open coded, it would
> > > > be good to consolidate that code.
> > > 
> > > There's more than one instance of this pattern for sure.
> > 
> > OK, might be time to consolidate, then.  ;-)
> > 
> > > > > -		break;
> > > > > +	} while (need_seqretry(&xattrs->lock, seq));
> > > > > +	done_seqretry(&xattrs->lock, seq);
> > > > > +	rcu_read_unlock();
> > > > > +
> > > > > +	if (!xattr)
> > > > > +		return -ENODATA;
> > > > > +
> > > > > +	ret = xattr->size;
> > > > > +	if (buffer) {
> > > > > +		if (size < xattr->size)
> > > > > +			ret = -ERANGE;
> > > > > +		else
> > > > > +			memcpy(buffer, xattr->value, xattr->size);
> > > > 
> > > > If all we are doing is copying to an in-kernel buffer, why not dispense
> > > > with xattr->ref and do the memcpy() under rcu_read_lock() protection?
> > > > This would avoid some overhead from reference-count cache misses.
> > > > 
> > > > Of course, if that memcpy() can page fault or some such, then what
> > > > you have is necessary.  But this is all in-kernel, right?  And if not,
> > > > shouldn't the pointers be decorated with __user or some such?
> > > 
> > > This is just a regular in-kernel memcpy(). Good idea dispensing with the
> > > refcount completely.
> > 
> > Sounds good!
> > 
> > > > >  	}
> > > > > -	spin_unlock(&xattrs->lock);
> > > > > +
> > > > > +	put_simple_xattr_rcu(xattr);
> > > > >  	return ret;
> > > > >  }
> > > > >  
> > > > >  /**
> > > > > - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> > > > > - * @xattrs: target simple_xattr list
> > > > > - * @name: name of the extended attribute
> > > > > - * @value: value of the xattr. If %NULL, will remove the attribute.
> > > > > - * @size: size of the new xattr
> > > > > - * @flags: %XATTR_{CREATE|REPLACE}
> > > > > - * @removed_size: returns size of the removed xattr, -1 if none removed
> > > > > + * simple_xattr_set - set an xattr object
> > > > > + * @xattrs: the header of the xattr object
> > > > > + * @name: the name of the xattr to retrieve
> > > > > + * @value: the value to store along the xattr
> > > > > + * @size: the size of @value
> > > > > + * @flags: the flags determining how to set the xattr
> > > > > + * @removed_size: the size of the removed xattr
> > > > > + *
> > > > > + * Set a new xattr object.
> > > > > + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> > > > > + * is specified in @flags a matching xattr object for @name must already exist.
> > > > > + * If it does it will be replace with the new xattr object. If it doesn't we
> > > > > + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> > > > > + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> > > > > + * insert the new xattr replacing any existing one.
> > > > >   *
> > > > > - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> > > > > - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> > > > > - * otherwise, fails with -ENODATA.
> > > > > + * If @value is empty and a matching xattr object is found we delete it if
> > > > > + * XATTR_REPLACE is specified in @flags or @flags is zero.
> > > > >   *
> > > > > - * Returns 0 on success, -errno on failure.
> > > > > + * If @value is empty and no matching xattr object for @name is found we do
> > > > > + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> > > > > + * XATTR_REPLACE we fail as mentioned above.
> > > > > + *
> > > > > + * Return: On success zero and on error a negative error code is returned.
> > > > >   */
> > > > >  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > > > >  		     const void *value, size_t size, int flags,
> > > > >  		     ssize_t *removed_size)
> > > > >  {
> > > > > -	struct simple_xattr *xattr;
> > > > > -	struct simple_xattr *new_xattr = NULL;
> > > > > +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> > > > > +	struct rb_node *parent = NULL, **rbp;
> > > > >  	int err = 0;
> > > > >  
> > > > >  	if (removed_size)
> > > > > @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > > > >  		}
> > > > >  	}
> > > > >  
> > > > > -	spin_lock(&xattrs->lock);
> > > > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > > > -		if (!strcmp(name, xattr->name)) {
> > > > > -			if (flags & XATTR_CREATE) {
> > > > > -				xattr = new_xattr;
> > > > > -				err = -EEXIST;
> > > > > -			} else if (new_xattr) {
> > > > > -				list_replace(&xattr->list, &new_xattr->list);
> > > > > -				if (removed_size)
> > > > > -					*removed_size = xattr->size;
> > > > > -			} else {
> > > > > -				list_del(&xattr->list);
> > > > > -				if (removed_size)
> > > > > -					*removed_size = xattr->size;
> > > > > -			}
> > > > > -			goto out;
> > > > > -		}
> > > > > -	}
> > > > > -	if (flags & XATTR_REPLACE) {
> > > > > -		xattr = new_xattr;
> > > > > -		err = -ENODATA;
> > > > > -	} else {
> > > > > -		list_add(&new_xattr->list, &xattrs->head);
> > > > > +	write_seqlock(&xattrs->lock);
> > > > > +	rbp = &xattrs->rb_root.rb_node;
> > > > > +	while (*rbp) {
> > > > > +		parent = *rbp;
> > > > > +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> > > > > +		if (strcmp(xattr->name, name) < 0)
> > > > > +			rbp = &(*rbp)->rb_left;
> > > > > +		else if (strcmp(xattr->name, name) > 0)
> > > > > +			rbp = &(*rbp)->rb_right;
> > > > > +		else
> > > > > +			break;
> > > > >  		xattr = NULL;
> > > > >  	}
> > > > > -out:
> > > > > -	spin_unlock(&xattrs->lock);
> > > > > +
> > > > >  	if (xattr) {
> > > > > -		kfree(xattr->name);
> > > > > -		kvfree(xattr);
> > > > > +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> > > > > +		if (flags & XATTR_CREATE) {
> > > > > +			err = -EEXIST;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		if (new_xattr)
> > > > > +			rb_replace_node_rcu(&xattr->rb_node,
> > > > > +					    &new_xattr->rb_node,
> > > > > +					    &xattrs->rb_root);
> > > > > +		else
> > > > > +			rb_erase(&xattr->rb_node, &xattrs->rb_root);
> > > > 
> > > > Is rb_erase() RCU-reader-safe?  It is not immediately obvious to me that it
> > > > is.  I would expect an rcu_assign_pointer() or three in there somewhere.
> > > 
> > > This question send me down an interesting rabbit hole which is
> > > (unironically) excellent. Afaiu, yes rb_erase() isn't rcu safe. And our
> > > rbtree implementation in general isn't rcu safe in the face of
> > > rebalances (I think you linked to a paper that illustrates how one would
> > > need to go about doing this though.). So all users deal with that using
> > > the seq+rcu-rbtree pattern. I trust you'll correct me if I'm wrong: The
> > > assumption of current users seems to be that searching the rbtree under
> > > the rcu lock - i.e., without having taken the read seqlock - is safe
> > > against oops or iow against corrupt pointers if a writer changes the
> > > rbtree during the walk. The seqlock is supposed to prevent garbage/stale
> > > results. At least all users of this pattern seem to rely on the fact
> > > that a rcu only walk cannot oops. Is that assumption safe?
> > 
> > Kind of, maybe?
> > 
> > To be fully safe, you need to check the seqlock during the actual tree
> > traversal.  Otherwise, an unfortunate series of rebalances or rotations
> > or whatever could potentially run a given reader forever in circles.
> > Though the odds of this actually continuing for any length of time are
> > of course extremely low.
> > 
> > And unmarked accesses to shared variables having conflicting concurrent
> > accesses is playing with fire, though one hopes that stores of pointers
> > would end up wsing single machine instructions.  In theory, there are
> > a number of ways that compiler optimizations could cause trouble.
> 
> Since this is code reachable from unprivileged userspace I'd rather not
> play with fire here at all. Even if all of this might turn out to be
> just a theoretical possibility.
> 
> I think rcu+seq+rbtree would've made for an interesting, sufficiently
> scalable solution with still limited complexity. It's not really too
> hard to understand and isn't bloated. But given that we have some valid
> concerns we should de-fancy-fy and use an rwlock_t. Roman alrady
> mentioned this and I had this is an implementation before this one
> (together with rhashtable as I've mentioned). I'll send out the patches
> shortly. I'm just finishing tests.

Fair enough!

> > > Another point that popped up discussing this was that there's an
> > > unlikely attack scenario that David (Howells) made me aware of. While
> > > walking the rbtree under rcu only without the read seqlock it is
> > > theoretically possible for the rbtree to keep being expanded and cause
> > > the rcu walk to "never finish". Though I'm unsure whether that's really
> > > a practically possible attack.
> > 
> > Perhaps David's scenario is similar to my eternal rebalance?
> 
> Yeah, it certainly was.
> 
> > 
> > > Very interested to hear your thoughts here!
> > 
> > I would of course feel better if there were fewer vulnerabilities.
> 
> I agree so we should change the implementation as mentioned above.
> 
> > 
> > > > > +		if (!err && removed_size)
> > > > > +			*removed_size = xattr->size;
> > > > > +	} else {
> > > > > +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> > > > > +		if (flags & XATTR_REPLACE) {
> > > > > +			err = -ENODATA;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		/*
> > > > > +		 * If XATTR_CREATE or no flags are specified together with a
> > > > > +		 * new value simply insert it.
> > > > > +		 */
> > > > > +		if (new_xattr) {
> > > > > +			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
> > > > > +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > > > > +		}
> > > > > +
> > > > > +		/*
> > > > > +		 * If XATTR_CREATE or no flags are specified and neither an old
> > > > > +		 * or new xattr were found/exist then we don't need to do
> > > > > +		 * anything.
> > > > > +		 */
> > > > 
> > > > As before, some of this looks like it should be in the rbtree implementation.
> > > > On the other hand, and add-or-replace function like this might be rare.  And
> > > > there are only two other occurrences, both of which look quite specialize.
> > > 
> > > If you think the current rcu search pattern is sufficiently safe even
> > > against rb_erase() then we might want to give the two callers/users at
> > > least rb_find_first_rcu() directly in rbtree.h which you suggested
> > > further above.
> > 
> > If we are going to have RCU-protected rbtrees, it might be good for the
> > rbtree code to know that it is supposed to handle RCU traversals.  ;-)
> 
> I honestly think this would make for a pretty nifty bachelor thesis
> project. :)

Makes sense to me!  You have a student in mind?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-11 18:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-08 11:41 [PATCH v2] xattr: use rbtree for simple_xattrs Christian Brauner
2022-11-08 18:59 ` Paul E. McKenney
2022-11-09  9:51   ` Christian Brauner
2022-11-11  8:13     ` Paul E. McKenney
2022-11-11 10:51       ` Christian Brauner
2022-11-11 18:45         ` Paul E. McKenney
2022-11-08 19:08 ` Roman Gushchin
2022-11-09 11:16   ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.