linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily
@ 2024-01-30 19:03 Linus Torvalds
  2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
                   ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, Linus Torvalds

The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *root;
 
 	/* Only update if the "events" was on the top level */
 	if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
 	/* Get the tracefs root inode. */
-	inode = d_inode(dentry->d_sb->s_root);
-	ei->attr.uid = inode->i_uid;
-	ei->attr.gid = inode->i_gid;
+	root = d_inode(sb->s_root);
+	ei->attr.uid = root->i_uid;
+	ei->attr.gid = root->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
 {
 	struct tracefs_inode *ti = get_tracefs(inode);
 	struct eventfs_inode *ei = ti->private;
-	struct dentry *dentry;
 
 	/* The top events directory doesn't get automatically updated */
 	if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
-	dentry = ei->dentry;
-
-	update_top_events_attr(ei, dentry);
+	update_top_events_attr(ei, inode->i_sb);
 
 	if (!(ei->attr.mode & EVENTFS_SAVE_UID))
 		inode->i_uid = ei->attr.uid;
@@ -235,8 +232,10 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 
 	mutex_lock(&eventfs_mutex);
 	do {
-		/* The parent always has an ei, except for events itself */
-		ei = dentry->d_parent->d_fsdata;
+		// The parent is stable because we do not do renames
+		dentry = dentry->d_parent;
+		// ... and directories always have d_fsdata
+		ei = dentry->d_fsdata;
 
 		/*
 		 * If the ei is being freed, the ownership of the children
@@ -246,12 +245,11 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 			ei = NULL;
 			break;
 		}
-
-		dentry = ei->dentry;
+		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
 	mutex_unlock(&eventfs_mutex);
 
-	update_top_events_attr(ei, dentry);
+	update_top_events_attr(ei, dentry->d_sb);
 
 	return ei;
 }
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
  2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
@ 2024-01-30 19:03 ` Linus Torvalds
  2024-01-30 19:48   ` Steven Rostedt
  2024-01-30 19:49   ` Steven Rostedt
  2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, Linus Torvalds

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = NULL;			// Directories have 'ei', files not
+
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = ei;
 
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
 	struct eventfs_inode *ei_child;
-	struct tracefs_inode *ti;
 
 	lockdep_assert_held(&eventfs_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
 				 srcu_read_lock_held(&eventfs_srcu)) {
 		ei_child->d_parent = ei->dentry;
 	}
-
-	ti = get_tracefs(ei->dentry->d_inode);
-	ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(&ei->list);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/6] tracefs: dentry lookup crapectomy
  2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
  2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
@ 2024-01-30 19:03 ` Linus Torvalds
  2024-01-30 23:26   ` Al Viro
  2024-01-31  0:23   ` Al Viro
  2024-01-30 19:03 ` [PATCH 4/6] eventfs: remove unused 'd_parent' pointer field Linus Torvalds
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, Linus Torvalds

The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.

You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.

You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries.  Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data rwas then later filled in.

You could see it in the immesnse amount of nonsensical code that didn't
actually just do lookups.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 275 ++++++++-------------------------------
 1 file changed, 52 insertions(+), 223 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c0d977e6c0f2..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(&eventfs_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(&eventfs_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 
@@ -280,11 +278,10 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
 }
 
 /**
- * create_file - create a file in the tracefs filesystem
- * @name: the name of the file to create.
+ * lookup_file - look up a file in the tracefs filesystem
+ * @dentry: the dentry to look up
  * @mode: the permission that the file should have.
  * @attr: saved attributes changed by user
- * @parent: parent dentry for this file.
  * @data: something that the caller will want to get to later on.
  * @fop: struct file_operations that should be used for this file.
  *
@@ -292,13 +289,13 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *create_file(const char *name, umode_t mode,
+static struct dentry *lookup_file(struct dentry *dentry,
+				  umode_t mode,
 				  struct eventfs_attr *attr,
-				  struct dentry *parent, void *data,
+				  void *data,
 				  const struct file_operations *fop)
 {
 	struct tracefs_inode *ti;
-	struct dentry *dentry;
 	struct inode *inode;
 
 	if (!(mode & S_IFMT))
@@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	if (WARN_ON_ONCE(!S_ISREG(mode)))
 		return NULL;
 
-	WARN_ON_ONCE(!parent);
-	dentry = eventfs_start_creating(name, parent);
-
-	if (IS_ERR(dentry))
-		return dentry;
-
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
@@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
 };
 
 /**
- * create_dir - create a dir in the tracefs filesystem
+ * lookup_dir_entry - look up a dir in the tracefs filesystem
+ * @dentry: the directory to look up
  * @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
  *
- * This function will create a dentry for a directory represented by
+ * This function will look up a dentry for a directory represented by
  * a eventfs_inode.
  */
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent)
+static struct dentry *lookup_dir_entry(struct dentry *dentry,
+	struct eventfs_inode *pei, struct eventfs_inode *ei)
 {
 	struct tracefs_inode *ti;
-	struct dentry *dentry;
 	struct inode *inode;
 
-	dentry = eventfs_start_creating(ei->name, parent);
-	if (IS_ERR(dentry))
-		return dentry;
-
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
@@ -372,8 +359,11 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = ei;
 
+	dentry->d_fsdata = ei;
+        ei->dentry = dentry;	// Remove me!
+
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -426,7 +416,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 }
 
 /**
- * create_file_dentry - create a dentry for a file of an eventfs_inode
+ * lookup_file_dentry - create a dentry for a file of an eventfs_inode
  * @ei: the eventfs_inode that the file will be created under
  * @idx: the index into the d_children[] of the @ei
  * @parent: The parent dentry of the created file.
@@ -439,157 +429,21 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
  * address located at @e_dentry.
  */
 static struct dentry *
-create_file_dentry(struct eventfs_inode *ei, int idx,
-		   struct dentry *parent, const char *name, umode_t mode, void *data,
+lookup_file_dentry(struct dentry *dentry,
+		   struct eventfs_inode *ei, int idx,
+		   umode_t mode, void *data,
 		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
 	struct dentry **e_dentry = &ei->d_children[idx];
-	struct dentry *dentry;
 
-	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));
-
-	mutex_lock(&eventfs_mutex);
-	if (ei->is_freed) {
-		mutex_unlock(&eventfs_mutex);
-		return NULL;
-	}
-	/* If the e_dentry already has a dentry, use it */
-	if (*e_dentry) {
-		dget(*e_dentry);
-		mutex_unlock(&eventfs_mutex);
-		return *e_dentry;
-	}
-
-	/* ei->entry_attrs are protected by SRCU */
 	if (ei->entry_attrs)
 		attr = &ei->entry_attrs[idx];
 
-	mutex_unlock(&eventfs_mutex);
+	dentry->d_fsdata = ei;		// NOTE: ei of _parent_
+	lookup_file(dentry, mode, attr, data, fops);
 
-	dentry = create_file(name, mode, attr, parent, data, fops);
-
-	mutex_lock(&eventfs_mutex);
-
-	if (IS_ERR_OR_NULL(dentry)) {
-		/*
-		 * When the mutex was released, something else could have
-		 * created the dentry for this e_dentry. In which case
-		 * use that one.
-		 *
-		 * If ei->is_freed is set, the e_dentry is currently on its
-		 * way to being freed, don't return it. If e_dentry is NULL
-		 * it means it was already freed.
-		 */
-		if (ei->is_freed) {
-			dentry = NULL;
-		} else {
-			dentry = *e_dentry;
-			dget(dentry);
-		}
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-
-	if (!*e_dentry && !ei->is_freed) {
-		*e_dentry = dentry;
-		dentry->d_fsdata = ei;
-	} else {
-		/*
-		 * Should never happen unless we get here due to being freed.
-		 * Otherwise it means two dentries exist with the same name.
-		 */
-		WARN_ON_ONCE(!ei->is_freed);
-		dentry = NULL;
-	}
-	mutex_unlock(&eventfs_mutex);
-
-	return dentry;
-}
-
-/**
- * eventfs_post_create_dir - post create dir routine
- * @ei: eventfs_inode of recently created dir
- *
- * Map the meta-data of files within an eventfs dir to their parent dentry
- */
-static void eventfs_post_create_dir(struct eventfs_inode *ei)
-{
-	struct eventfs_inode *ei_child;
-
-	lockdep_assert_held(&eventfs_mutex);
-
-	/* srcu lock already held */
-	/* fill parent-child relation */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
-		ei_child->d_parent = ei->dentry;
-	}
-}
-
-/**
- * create_dir_dentry - Create a directory dentry for the eventfs_inode
- * @pei: The eventfs_inode parent of ei.
- * @ei: The eventfs_inode to create the directory for
- * @parent: The dentry of the parent of this directory
- *
- * This creates and attaches a directory dentry to the eventfs_inode @ei.
- */
-static struct dentry *
-create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
-		  struct dentry *parent)
-{
-	struct dentry *dentry = NULL;
-
-	WARN_ON_ONCE(!inode_is_locked(parent->d_inode));
-
-	mutex_lock(&eventfs_mutex);
-	if (pei->is_freed || ei->is_freed) {
-		mutex_unlock(&eventfs_mutex);
-		return NULL;
-	}
-	if (ei->dentry) {
-		/* If the eventfs_inode already has a dentry, use it */
-		dentry = ei->dentry;
-		dget(dentry);
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-	mutex_unlock(&eventfs_mutex);
-
-	dentry = create_dir(ei, parent);
-
-	mutex_lock(&eventfs_mutex);
-
-	if (IS_ERR_OR_NULL(dentry) && !ei->is_freed) {
-		/*
-		 * When the mutex was released, something else could have
-		 * created the dentry for this e_dentry. In which case
-		 * use that one.
-		 *
-		 * If ei->is_freed is set, the e_dentry is currently on its
-		 * way to being freed.
-		 */
-		dentry = ei->dentry;
-		if (dentry)
-			dget(dentry);
-		mutex_unlock(&eventfs_mutex);
-		return dentry;
-	}
-
-	if (!ei->dentry && !ei->is_freed) {
-		ei->dentry = dentry;
-		eventfs_post_create_dir(ei);
-		dentry->d_fsdata = ei;
-	} else {
-		/*
-		 * Should never happen unless we get here due to being freed.
-		 * Otherwise it means two dentries exist with the same name.
-		 */
-		WARN_ON_ONCE(!ei->is_freed);
-		dentry = NULL;
-	}
-	mutex_unlock(&eventfs_mutex);
+	*e_dentry = dentry;	// Remove me
 
 	return dentry;
 }
@@ -608,79 +462,54 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags)
 {
-	const struct file_operations *fops;
-	const struct eventfs_entry *entry;
 	struct eventfs_inode *ei_child;
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
-	struct dentry *ei_dentry = NULL;
-	struct dentry *ret = NULL;
-	struct dentry *d;
 	const char *name = dentry->d_name.name;
-	umode_t mode;
-	void *data;
-	int idx;
-	int i;
-	int r;
 
 	ti = get_tracefs(dir);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
-		return NULL;
+		return ERR_PTR(-EIO);
 
-	/* Grab srcu to prevent the ei from going away */
-	idx = srcu_read_lock(&eventfs_srcu);
-
-	/*
-	 * Grab the eventfs_mutex to consistent value from ti->private.
-	 * This s
-	 */
 	mutex_lock(&eventfs_mutex);
-	ei = READ_ONCE(ti->private);
-	if (ei && !ei->is_freed)
-		ei_dentry = READ_ONCE(ei->dentry);
-	mutex_unlock(&eventfs_mutex);
 
-	if (!ei || !ei_dentry)
-		goto out;
+	ei = ti->private;
+	if (!ei || ei->is_freed)
+		goto enoent;
 
-	data = ei->data;
-
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 srcu_read_lock_held(&eventfs_srcu)) {
+	list_for_each_entry(ei_child, &ei->children, list) {
 		if (strcmp(ei_child->name, name) != 0)
 			continue;
-		ret = simple_lookup(dir, dentry, flags);
-		if (IS_ERR(ret))
-			goto out;
-		d = create_dir_dentry(ei, ei_child, ei_dentry);
-		dput(d);
+		if (ei_child->is_freed)
+			goto enoent;
+		lookup_dir_entry(dentry, ei, ei_child);
 		goto out;
 	}
 
-	for (i = 0; i < ei->nr_entries; i++) {
-		entry = &ei->entries[i];
-		if (strcmp(name, entry->name) == 0) {
-			void *cdata = data;
-			mutex_lock(&eventfs_mutex);
-			/* If ei->is_freed, then the event itself may be too */
-			if (!ei->is_freed)
-				r = entry->callback(name, &mode, &cdata, &fops);
-			else
-				r = -1;
-			mutex_unlock(&eventfs_mutex);
-			if (r <= 0)
-				continue;
-			ret = simple_lookup(dir, dentry, flags);
-			if (IS_ERR(ret))
-				goto out;
-			d = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops);
-			dput(d);
-			break;
-		}
+	for (int i = 0; i < ei->nr_entries; i++) {
+		void *data;
+		umode_t mode;
+		const struct file_operations *fops;
+		const struct eventfs_entry *entry = &ei->entries[i];
+
+		if (strcmp(name, entry->name) != 0)
+			continue;
+
+		data = ei->data;
+		if (entry->callback(name, &mode, &data, &fops) <= 0)
+			goto enoent;
+
+		lookup_file_dentry(dentry, ei, i, mode, data, fops);
+		goto out;
 	}
+
+ enoent:
+	/* Nothing found? */
+	d_add(dentry, NULL);
+
  out:
-	srcu_read_unlock(&eventfs_srcu, idx);
-	return ret;
+	mutex_unlock(&eventfs_mutex);
+	return NULL;
 }
 
 /*
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/6] eventfs: remove unused 'd_parent' pointer field
  2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
  2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
  2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
@ 2024-01-30 19:03 ` Linus Torvalds
  2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
  2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
  4 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, Linus Torvalds

It's never used

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 4 +---
 fs/tracefs/internal.h    | 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..1d0102bfd7da 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -685,10 +685,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 	INIT_LIST_HEAD(&ei->list);
 
 	mutex_lock(&eventfs_mutex);
-	if (!parent->is_freed) {
+	if (!parent->is_freed)
 		list_add_tail(&ei->list, &parent->children);
-		ei->d_parent = parent->dentry;
-	}
 	mutex_unlock(&eventfs_mutex);
 
 	/* Was the parent freed? */
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 91c2bf0b91d9..8f38740bfb5b 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -35,7 +35,6 @@ struct eventfs_attr {
  * @name:	the name of the directory to create
  * @children:	link list into the child eventfs_inode
  * @dentry:     the dentry of the directory
- * @d_parent:   pointer to the parent's dentry
  * @d_children: The array of dentries to represent the files when created
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:	Saved mode and ownership of eventfs_inode itself
@@ -50,7 +49,6 @@ struct eventfs_inode {
 	const char			*name;
 	struct list_head		children;
 	struct dentry			*dentry; /* Check is_freed to access */
-	struct dentry			*d_parent;
 	struct dentry			**d_children;
 	struct eventfs_attr		*entry_attrs;
 	struct eventfs_attr		attr;
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
                   ` (2 preceding siblings ...)
  2024-01-30 19:03 ` [PATCH 4/6] eventfs: remove unused 'd_parent' pointer field Linus Torvalds
@ 2024-01-30 19:03 ` Linus Torvalds
  2024-01-30 20:55   ` Steven Rostedt
                     ` (2 more replies)
  2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
  4 siblings, 3 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, Linus Torvalds

The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer.  That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.

There were two reasonms why eventfs tried to keep a pointer to a dentry:

 - the creation of a 'events' directory would actually have a stable
   dentry pointer that it created with tracefs_start_creating().

   And it needed that dentry when tearing it all down again in
   eventfs_remove_events_dir().

   This use is actually ok, because the special top-level events
   directory dentries are actually stable, not just a temporary cache of
   the eventfs data structures.

 - the 'eventfs_inode' (aka ei) needs to stay around as long as there
   are dentries that refer to it.

   It then used these dentry pointers as a replacement for doing
   reference counting: it would try to make sure that there was only
   ever one dentry associated with an event_inode, and keep a child
   dentry array around to see which dentries might still refer to the
   parent ei.

This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.

The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei.  As
does the directory dentries.  That makes the broken use case go away.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 245 ++++++++++++---------------------------
 fs/tracefs/internal.h    |   9 +-
 2 files changed, 80 insertions(+), 174 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1d0102bfd7da..a37db0dac302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -62,6 +62,34 @@ enum {
 
 #define EVENTFS_MODE_MASK	(EVENTFS_SAVE_MODE - 1)
 
+/*
+ * eventfs_inode reference count management.
+ *
+ * NOTE! We count only references from dentries, in the
+ * form 'dentry->d_fsdata'. There are also references from
+ * directory inodes ('ti->private'), but the dentry reference
+ * count is always a superset of the inode reference count.
+ */
+static void release_ei(struct kref *ref)
+{
+	struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
+	kfree(ei->entry_attrs);
+	kfree(ei);
+}
+
+static inline void put_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_put(&ei->kref, release_ei);
+}
+
+static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_get(&ei->kref);
+	return ei;
+}
+
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags);
@@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *lookup_file(struct dentry *dentry,
+static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
+				  struct dentry *dentry,
 				  umode_t mode,
 				  struct eventfs_attr *attr,
 				  void *data,
@@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry *dentry,
 		mode |= S_IFREG;
 
 	if (WARN_ON_ONCE(!S_ISREG(mode)))
-		return NULL;
+		return ERR_PTR(-EIO);
 
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
-		return eventfs_failed_creating(dentry);
+		return ERR_PTR(-ENOMEM);
 
 	/* If the user updated the directory's attributes, use them */
 	update_inode_attr(dentry, inode, attr, mode);
@@ -322,9 +351,12 @@ static struct dentry *lookup_file(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
+	// Files have their parent's ei as their fsdata
+	dentry->d_fsdata = get_ei(parent_ei);
+
 	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
-	return eventfs_end_creating(dentry);
+	return NULL;
 };
 
 /**
@@ -343,7 +375,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
-		return eventfs_failed_creating(dentry);
+		return ERR_PTR(-ENOMEM);
 
 	/* If the user updated the directory's attributes, use them */
 	update_inode_attr(dentry, inode, &ei->attr,
@@ -359,24 +391,28 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = ei;
 
-	dentry->d_fsdata = ei;
-        ei->dentry = dentry;	// Remove me!
+	dentry->d_fsdata = get_ei(ei);
 
 	inc_nlink(inode);
 	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
-	return eventfs_end_creating(dentry);
+	return NULL;
 }
 
-static void free_ei(struct eventfs_inode *ei)
+static inline struct eventfs_inode *alloc_ei(const char *name)
 {
-	kfree_const(ei->name);
-	kfree(ei->d_children);
-	kfree(ei->entry_attrs);
-	kfree(ei);
+	int namesize = strlen(name) + 1;
+	struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL);
+
+	if (ei) {
+		memcpy((char *)ei->name, name, namesize);
+		kref_init(&ei->kref);
+	}
+	return ei;
 }
 
+
 /**
  * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
  * @ti: the tracefs_inode of the dentry
@@ -387,38 +423,20 @@ static void free_ei(struct eventfs_inode *ei)
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
-	int i;
 
 	mutex_lock(&eventfs_mutex);
-
 	ei = dentry->d_fsdata;
-	if (!ei)
-		goto out;
-
-	/* This could belong to one of the files of the ei */
-	if (ei->dentry != dentry) {
-		for (i = 0; i < ei->nr_entries; i++) {
-			if (ei->d_children[i] == dentry)
-				break;
-		}
-		if (WARN_ON_ONCE(i == ei->nr_entries))
-			goto out;
-		ei->d_children[i] = NULL;
-	} else if (ei->is_freed) {
-		free_ei(ei);
-	} else {
-		ei->dentry = NULL;
+	if (ei) {
+		dentry->d_fsdata = NULL;
+		put_ei(ei);
 	}
-
-	dentry->d_fsdata = NULL;
- out:
 	mutex_unlock(&eventfs_mutex);
 }
 
 /**
  * lookup_file_dentry - create a dentry for a file of an eventfs_inode
  * @ei: the eventfs_inode that the file will be created under
- * @idx: the index into the d_children[] of the @ei
+ * @idx: the index into the entry_attrs[] of the @ei
  * @parent: The parent dentry of the created file.
  * @name: The name of the file to create
  * @mode: The mode of the file.
@@ -435,17 +453,11 @@ lookup_file_dentry(struct dentry *dentry,
 		   const struct file_operations *fops)
 {
 	struct eventfs_attr *attr = NULL;
-	struct dentry **e_dentry = &ei->d_children[idx];
 
 	if (ei->entry_attrs)
 		attr = &ei->entry_attrs[idx];
 
-	dentry->d_fsdata = ei;		// NOTE: ei of _parent_
-	lookup_file(dentry, mode, attr, data, fops);
-
-	*e_dentry = dentry;	// Remove me
-
-	return dentry;
+	return lookup_file(ei, dentry, mode, attr, data, fops);
 }
 
 /**
@@ -466,6 +478,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
 	const char *name = dentry->d_name.name;
+	struct dentry *result;
 
 	ti = get_tracefs(dir);
 	if (!(ti->flags & TRACEFS_EVENT_INODE))
@@ -482,7 +495,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 			continue;
 		if (ei_child->is_freed)
 			goto enoent;
-		lookup_dir_entry(dentry, ei, ei_child);
+		result = lookup_dir_entry(dentry, ei, ei_child);
 		goto out;
 	}
 
@@ -499,17 +512,18 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 		if (entry->callback(name, &mode, &data, &fops) <= 0)
 			goto enoent;
 
-		lookup_file_dentry(dentry, ei, i, mode, data, fops);
+		result = lookup_file_dentry(dentry, ei, i, mode, data, fops);
 		goto out;
 	}
 
  enoent:
 	/* Nothing found? */
 	d_add(dentry, NULL);
+	result = NULL;
 
  out:
 	mutex_unlock(&eventfs_mutex);
-	return NULL;
+	return result;
 }
 
 /*
@@ -659,25 +673,10 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 	if (!parent)
 		return ERR_PTR(-EINVAL);
 
-	ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+	ei = alloc_ei(name);
 	if (!ei)
 		return ERR_PTR(-ENOMEM);
 
-	ei->name = kstrdup_const(name, GFP_KERNEL);
-	if (!ei->name) {
-		kfree(ei);
-		return ERR_PTR(-ENOMEM);
-	}
-
-	if (size) {
-		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
-		if (!ei->d_children) {
-			kfree_const(ei->name);
-			kfree(ei);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
-
 	ei->entries = entries;
 	ei->nr_entries = size;
 	ei->data = data;
@@ -691,7 +690,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 
 	/* Was the parent freed? */
 	if (list_empty(&ei->list)) {
-		free_ei(ei);
+		put_ei(ei);
 		ei = NULL;
 	}
 	return ei;
@@ -726,28 +725,20 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	if (IS_ERR(dentry))
 		return ERR_CAST(dentry);
 
-	ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+	ei = alloc_ei(name);
 	if (!ei)
-		goto fail_ei;
+		goto fail;
 
 	inode = tracefs_get_inode(dentry->d_sb);
 	if (unlikely(!inode))
 		goto fail;
 
-	if (size) {
-		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
-		if (!ei->d_children)
-			goto fail;
-	}
-
-	ei->dentry = dentry;
+	// Note: we have a ref to the dentry from tracefs_start_creating()
+	ei->events_dir = dentry;
 	ei->entries = entries;
 	ei->nr_entries = size;
 	ei->is_events = 1;
 	ei->data = data;
-	ei->name = kstrdup_const(name, GFP_KERNEL);
-	if (!ei->name)
-		goto fail;
 
 	/* Save the ownership of this directory */
 	uid = d_inode(dentry->d_parent)->i_uid;
@@ -778,7 +769,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	inode->i_op = &eventfs_root_dir_inode_operations;
 	inode->i_fop = &eventfs_file_operations;
 
-	dentry->d_fsdata = ei;
+	dentry->d_fsdata = get_ei(ei);
 
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
@@ -790,72 +781,11 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	return ei;
 
  fail:
-	kfree(ei->d_children);
-	kfree(ei);
- fail_ei:
+	put_ei(ei);
 	tracefs_failed_creating(dentry);
 	return ERR_PTR(-ENOMEM);
 }
 
-static LLIST_HEAD(free_list);
-
-static void eventfs_workfn(struct work_struct *work)
-{
-        struct eventfs_inode *ei, *tmp;
-        struct llist_node *llnode;
-
-	llnode = llist_del_all(&free_list);
-        llist_for_each_entry_safe(ei, tmp, llnode, llist) {
-		/* This dput() matches the dget() from unhook_dentry() */
-		for (int i = 0; i < ei->nr_entries; i++) {
-			if (ei->d_children[i])
-				dput(ei->d_children[i]);
-		}
-		/* This should only get here if it had a dentry */
-		if (!WARN_ON_ONCE(!ei->dentry))
-			dput(ei->dentry);
-        }
-}
-
-static DECLARE_WORK(eventfs_work, eventfs_workfn);
-
-static void free_rcu_ei(struct rcu_head *head)
-{
-	struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu);
-
-	if (ei->dentry) {
-		/* Do not free the ei until all references of dentry are gone */
-		if (llist_add(&ei->llist, &free_list))
-			queue_work(system_unbound_wq, &eventfs_work);
-		return;
-	}
-
-	/* If the ei doesn't have a dentry, neither should its children */
-	for (int i = 0; i < ei->nr_entries; i++) {
-		WARN_ON_ONCE(ei->d_children[i]);
-	}
-
-	free_ei(ei);
-}
-
-static void unhook_dentry(struct dentry *dentry)
-{
-	if (!dentry)
-		return;
-	/*
-	 * Need to add a reference to the dentry that is expected by
-	 * simple_recursive_removal(), which will include a dput().
-	 */
-	dget(dentry);
-
-	/*
-	 * Also add a reference for the dput() in eventfs_workfn().
-	 * That is required as that dput() will free the ei after
-	 * the SRCU grace period is over.
-	 */
-	dget(dentry);
-}
-
 /**
  * eventfs_remove_rec - remove eventfs dir or file from list
  * @ei: eventfs_inode to be removed.
@@ -868,8 +798,6 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
 {
 	struct eventfs_inode *ei_child;
 
-	if (!ei)
-		return;
 	/*
 	 * Check recursion depth. It should never be greater than 3:
 	 * 0 - events/
@@ -881,28 +809,12 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
 		return;
 
 	/* search for nested folders or files */
-	list_for_each_entry_srcu(ei_child, &ei->children, list,
-				 lockdep_is_held(&eventfs_mutex)) {
-		/* Children only have dentry if parent does */
-		WARN_ON_ONCE(ei_child->dentry && !ei->dentry);
+	list_for_each_entry(ei_child, &ei->children, list)
 		eventfs_remove_rec(ei_child, level + 1);
-	}
-
 
 	ei->is_freed = 1;
-
-	for (int i = 0; i < ei->nr_entries; i++) {
-		if (ei->d_children[i]) {
-			/* Children only have dentry if parent does */
-			WARN_ON_ONCE(!ei->dentry);
-			unhook_dentry(ei->d_children[i]);
-		}
-	}
-
-	unhook_dentry(ei->dentry);
-
-	list_del_rcu(&ei->list);
-	call_srcu(&eventfs_srcu, &ei->rcu, free_rcu_ei);
+	list_del(&ei->list);
+	put_ei(ei);
 }
 
 /**
@@ -913,22 +825,12 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
  */
 void eventfs_remove_dir(struct eventfs_inode *ei)
 {
-	struct dentry *dentry;
-
 	if (!ei)
 		return;
 
 	mutex_lock(&eventfs_mutex);
-	dentry = ei->dentry;
 	eventfs_remove_rec(ei, 0);
 	mutex_unlock(&eventfs_mutex);
-
-	/*
-	 * If any of the ei children has a dentry, then the ei itself
-	 * must have a dentry.
-	 */
-	if (dentry)
-		simple_recursive_removal(dentry, NULL);
 }
 
 /**
@@ -941,7 +843,11 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 {
 	struct dentry *dentry;
 
-	dentry = ei->dentry;
+	dentry = ei->events_dir;
+	if (!dentry)
+		return;
+
+	ei->events_dir = NULL;
 	eventfs_remove_dir(ei);
 
 	/*
@@ -951,5 +857,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
+	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 8f38740bfb5b..72db3bdc4dfb 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -34,8 +34,7 @@ struct eventfs_attr {
  * @entries:	the array of entries representing the files in the directory
  * @name:	the name of the directory to create
  * @children:	link list into the child eventfs_inode
- * @dentry:     the dentry of the directory
- * @d_children: The array of dentries to represent the files when created
+ * @events_dir: the dentry of the events directory
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:	Saved mode and ownership of eventfs_inode itself
  * @data:	The private data to pass to the callbacks
@@ -44,12 +43,11 @@ struct eventfs_attr {
  * @nr_entries: The number of items in @entries
  */
 struct eventfs_inode {
+	struct kref			kref;
 	struct list_head		list;
 	const struct eventfs_entry	*entries;
-	const char			*name;
 	struct list_head		children;
-	struct dentry			*dentry; /* Check is_freed to access */
-	struct dentry			**d_children;
+	struct dentry			*events_dir;
 	struct eventfs_attr		*entry_attrs;
 	struct eventfs_attr		attr;
 	void				*data;
@@ -66,6 +64,7 @@ struct eventfs_inode {
 		struct llist_node	llist;
 		struct rcu_head		rcu;
 	};
+	const char name[];
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
                   ` (3 preceding siblings ...)
  2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
@ 2024-01-30 19:03 ` Linus Torvalds
  2024-01-30 21:25   ` Steven Rostedt
                     ` (2 more replies)
  4 siblings, 3 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel, Linus Torvalds

In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/tracefs/event_inode.c | 21 +++++----------------
 fs/tracefs/inode.c       | 27 ++++++++++++++++++---------
 fs/tracefs/internal.h    |  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-	struct eventfs_inode *ei;
-
-	mutex_lock(&eventfs_mutex);
-	ei = dentry->d_fsdata;
-	if (ei) {
-		dentry->d_fsdata = NULL;
-		put_ei(ei);
-	}
-	mutex_unlock(&eventfs_mutex);
+	put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	}
 
  enoent:
-	/* Nothing found? */
-	d_add(dentry, NULL);
-	result = NULL;
+	/* Don't cache negative lookups, just return an error */
+	result = ERR_PTR(-ENOENT);
 
  out:
 	mutex_unlock(&eventfs_mutex);
@@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
-	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-	struct tracefs_inode *ti;
+	if (dentry->d_fsdata)
+		eventfs_d_release(dentry);
+}
 
-	if (!dentry || !inode)
-		return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
 
-	ti = get_tracefs(inode);
-	if (ti && ti->flags & TRACEFS_EVENT_INODE)
-		eventfs_set_ei_status_free(ti, dentry);
-	iput(inode);
+	return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_revalidate = tracefs_d_revalidate,
+	.d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+
+void eventfs_d_release(struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */
-- 
2.43.0.5.g38fb137bdb


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
  2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
@ 2024-01-30 19:48   ` Steven Rostedt
  2024-01-30 19:56     ` Steven Rostedt
  2024-01-30 19:49   ` Steven Rostedt
  1 sibling, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 19:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 11:03:51 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The tracefs-specific fields in the inode were not initialized before the
> inode was exposed to others through the dentry with 'd_instantiate()'.
> 
> And the ->flags file was initialized incorrectly with a '|=', when the
> old value was stale.  It should have just been a straight assignment.

The ti is allocated from fs/tracefs/inode.c that has:

static struct inode *tracefs_alloc_inode(struct super_block *sb)
{
	struct tracefs_inode *ti;

	ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
	if (!ti)
		return NULL;

	ti->flags = 0;

	return &ti->vfs_inode;
}

Shouldn't that make it valid?

Granted, the eventfs inodes don't have any of the tracefs inode flags set.
But I purposely made he ti->flags initialized to zero.

Is this update really necessary?

Or do I need to make sure that the iput() clears it?

The flags are used by tracefs, so I want to know if there's not a bug there.

-- Steve


> 
> Move the field initializations up to before the d_instantiate, and fix
> the use of uninitialized data.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  fs/tracefs/event_inode.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 2d128bedd654..c0d977e6c0f2 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
>  	inode->i_ino = EVENTFS_FILE_INODE_INO;
>  
>  	ti = get_tracefs(inode);
> -	ti->flags |= TRACEFS_EVENT_INODE;
> +	ti->flags = TRACEFS_EVENT_INODE;
> +	ti->private = NULL;			// Directories have 'ei', files not
> +
>  	d_instantiate(dentry, inode);
>  	fsnotify_create(dentry->d_parent->d_inode, dentry);
>  	return eventfs_end_creating(dentry);
> @@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
>  	inode->i_ino = eventfs_dir_ino(ei);
>  
>  	ti = get_tracefs(inode);
> -	ti->flags |= TRACEFS_EVENT_INODE;
> +	ti->flags = TRACEFS_EVENT_INODE;
> +	ti->private = ei;
>  
>  	inc_nlink(inode);
>  	d_instantiate(dentry, inode);
> @@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
>  static void eventfs_post_create_dir(struct eventfs_inode *ei)
>  {
>  	struct eventfs_inode *ei_child;
> -	struct tracefs_inode *ti;
>  
>  	lockdep_assert_held(&eventfs_mutex);
>  
> @@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
>  				 srcu_read_lock_held(&eventfs_srcu)) {
>  		ei_child->d_parent = ei->dentry;
>  	}
> -
> -	ti = get_tracefs(ei->dentry->d_inode);
> -	ti->private = ei;
>  }
>  
>  /**
> @@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
>  	INIT_LIST_HEAD(&ei->list);
>  
>  	ti = get_tracefs(inode);
> -	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
> +	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
>  	ti->private = ei;
>  
>  	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
  2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
  2024-01-30 19:48   ` Steven Rostedt
@ 2024-01-30 19:49   ` Steven Rostedt
  1 sibling, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 19:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 11:03:51 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
>  	inode->i_ino = EVENTFS_FILE_INODE_INO;
>  
>  	ti = get_tracefs(inode);
> -	ti->flags |= TRACEFS_EVENT_INODE;
> +	ti->flags = TRACEFS_EVENT_INODE;
> +	ti->private = NULL;			// Directories have 'ei', files not

Although ti->private does need to be initialized here.

-- Steve

> +
>  	d_instantiate(dentry, inode);
>  	fsnotify_create(dentry->d_parent->d_inode, dentry);
>  	return eventfs_end_creating(dentry);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
  2024-01-30 19:48   ` Steven Rostedt
@ 2024-01-30 19:56     ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 19:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 14:48:02 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
 
> The ti is allocated from fs/tracefs/inode.c that has:
> 
> static struct inode *tracefs_alloc_inode(struct super_block *sb)
> {
> 	struct tracefs_inode *ti;
> 
> 	ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);

I could also just add __GFP_ZERO so that all of it is initialized to zero,
and then we don't need to assign NULL to any part of it.

-- Steve


> 	if (!ti)
> 		return NULL;
> 
> 	ti->flags = 0;
> 
> 	return &ti->vfs_inode;
> }
> 
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
@ 2024-01-30 20:55   ` Steven Rostedt
  2024-01-30 21:52     ` Linus Torvalds
  2024-01-31  0:48   ` Al Viro
  2024-01-31  5:09   ` Steven Rostedt
  2 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 20:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 11:03:54 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> -static void free_ei(struct eventfs_inode *ei)
> +static inline struct eventfs_inode *alloc_ei(const char *name)
>  {
> -	kfree_const(ei->name);
> -	kfree(ei->d_children);
> -	kfree(ei->entry_attrs);
> -	kfree(ei);
> +	int namesize = strlen(name) + 1;
> +	struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL);
> +
> +	if (ei) {
> +		memcpy((char *)ei->name, name, namesize);
> +		kref_init(&ei->kref);
> +	}

I'm going to be putting back the ei->name pointer as the above actually
adds more memory usage. The reason being is that all static trace events
(not kprobes or other dynamic events) are created with the TRACE_EVENT()
macro that points to a constant name. Passing in a const to kstrdup_const()
checks if the name passed in is a constant or not. If it is, it doesn't
allocate the name and just passes back the pointer.

Removing the pointer to "name" removed 8 bytes on 64 bit machines from the
eventfs_inode structure, but added strlen(name)+1 bytes to it, where the
majority of trace event names are greater than 8 bytes, and are constant
strings.

Another reason for converting back to the way it was is I have moving the
ei allocation to a slab on my TODO list.

-- Steve



> +	return ei;
>  }
>  


> diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
> index 8f38740bfb5b..72db3bdc4dfb 100644
> --- a/fs/tracefs/internal.h
> +++ b/fs/tracefs/internal.h
> @@ -34,8 +34,7 @@ struct eventfs_attr {
>   * @entries:	the array of entries representing the files in the directory
>   * @name:	the name of the directory to create
>   * @children:	link list into the child eventfs_inode
> - * @dentry:     the dentry of the directory
> - * @d_children: The array of dentries to represent the files when created
> + * @events_dir: the dentry of the events directory
>   * @entry_attrs: Saved mode and ownership of the @d_children
>   * @attr:	Saved mode and ownership of eventfs_inode itself
>   * @data:	The private data to pass to the callbacks
> @@ -44,12 +43,11 @@ struct eventfs_attr {
>   * @nr_entries: The number of items in @entries
>   */
>  struct eventfs_inode {
> +	struct kref			kref;
>  	struct list_head		list;
>  	const struct eventfs_entry	*entries;
> -	const char			*name;
>  	struct list_head		children;
> -	struct dentry			*dentry; /* Check is_freed to access */
> -	struct dentry			**d_children;
> +	struct dentry			*events_dir;
>  	struct eventfs_attr		*entry_attrs;
>  	struct eventfs_attr		attr;
>  	void				*data;
> @@ -66,6 +64,7 @@ struct eventfs_inode {
>  		struct llist_node	llist;
>  		struct rcu_head		rcu;
>  	};
> +	const char name[];
>  };
>  
>  static inline struct tracefs_inode *get_tracefs(const struct inode *inode)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
@ 2024-01-30 21:25   ` Steven Rostedt
  2024-01-30 21:36     ` Linus Torvalds
  2024-01-31  1:12   ` Al Viro
  2024-01-31 18:00   ` Steven Rostedt
  2 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 21:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 11:03:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Another thing that might be worth doing is to make all eventfs lookups
> mark their dentries as not worth caching.  We could do that with
> d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.
> 
> As it is, the dentries are all freeable, but they only tend to get freed
> at memory pressure rather than more proactively.  But that's a separate
> issue.

I actually find the dentry's sticking around when their ref counts go to
zero a feature. Tracing applications will open and close the eventfs files
many times. If the dentry were to be deleted on every close, it would need
to be create over again in short spurts.

There's some operations that will traverse the tree many times. One
operation is on reset, as there's some dependencies in the order of
disabling triggers. If there's a dependency, and a trigger is to be
disabled out of order, the disabling will fail with -EBUSY. Thus the tools
will iterate the trigger files several times until it there's no more
changes. It's not a critical path, but I rather not add more overhead to it.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-30 21:25   ` Steven Rostedt
@ 2024-01-30 21:36     ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 21:36 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 13:25, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I actually find the dentry's sticking around when their ref counts go to
> zero a feature. Tracing applications will open and close the eventfs files
> many times. If the dentry were to be deleted on every close, it would need
> to be create over again in short spurts.

Sure. I think this is literally a tuning thing, and we'll see if
anybody ever says "the dentry cache grows too much with tracefs".

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 20:55   ` Steven Rostedt
@ 2024-01-30 21:52     ` Linus Torvalds
  2024-01-30 22:56       ` Steven Rostedt
  2024-01-30 22:56       ` Linus Torvalds
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 21:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 12:55, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I'm going to be putting back the ei->name pointer as the above actually
> adds more memory usage.

I did it mainly because I hate having multiple different allocation
sites that then have to do that kref_init() etc individually, and once
there was a single site the "name" thing really looked lik ean obvious
simplification.

That said, I think you're confused about the memory usage.

Sure, 'kstrdup_const()' optimizes away the allocation for static
constant strings, but what it does *not* do is to optimize away the
pointer.

In contrast, allocating them together gets rid of the pointer itself,
because now the name is just an offset in the structure.

And the pointer is 8 bytes right there.

So allocating the string _with_ the ei will always save at least 8 bytes.

So whenever the string is less than that in length it's *always* a win.

And even when it's not an absolute win, it will take advantage of the
kmalloc() quantized sizes too, and generally not be a loss even with
longer names.

So I'm pretty sure you are simply entirely wrong on the memory usage.
Not counting the size of the pointer is overlooking a big piece of the
puzzle.

Btw, you can look at name lengths in tracefs with something stupid like this:

    find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c

and you will see that (at least in my case) names of lengths 1..7 are
dominating it all:

      1 .
   2189 ..
     34 ...
   2229 ....
    207 .....
   6833 ......
   2211 .......

with the rest being insignificant in comparison.

The reason? There's a *lot* of those 'filter' and 'enable' and 'id'
files. All of which are better off without a 'const char *name' taking
8 bytes.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 21:52     ` Linus Torvalds
@ 2024-01-30 22:56       ` Steven Rostedt
  2024-01-30 22:58         ` Linus Torvalds
  2024-01-30 22:56       ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 22:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 13:52:05 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 12:55, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'm going to be putting back the ei->name pointer as the above actually
> > adds more memory usage.  
> 
> I did it mainly because I hate having multiple different allocation
> sites that then have to do that kref_init() etc individually, and once
> there was a single site the "name" thing really looked lik ean obvious
> simplification.
> 
> That said, I think you're confused about the memory usage.
> 
> Sure, 'kstrdup_const()' optimizes away the allocation for static
> constant strings, but what it does *not* do is to optimize away the
> pointer.
> 
> In contrast, allocating them together gets rid of the pointer itself,
> because now the name is just an offset in the structure.
> 
> And the pointer is 8 bytes right there.
> 
> So allocating the string _with_ the ei will always save at least 8 bytes.

Not if the string is not allocated, and ei->name is just pointing to an
existing string. Which is what kstrdup_const() does if the string is a
constant.

If the length of the string was one byte, so you allocate sizeof(*ei) + 2
(for the '\0' as well), is kzalloc() going to allocate anything less than 8
byte alignment? According to /proc/slabinfo, any allocation will be 8
bytes. Thus it is the same as having the pointer if the string is constant
but less than 8 bytes in length. But a waste if it is more.

> 
> So whenever the string is less than that in length it's *always* a win.
> 
> And even when it's not an absolute win, it will take advantage of the
> kmalloc() quantized sizes too, and generally not be a loss even with
> longer names.
> 
> So I'm pretty sure you are simply entirely wrong on the memory usage.
> Not counting the size of the pointer is overlooking a big piece of the
> puzzle.
> 
> Btw, you can look at name lengths in tracefs with something stupid like this:
> 
>     find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c
> 
> and you will see that (at least in my case) names of lengths 1..7 are
> dominating it all:
> 
>       1 .
>    2189 ..
>      34 ...
>    2229 ....
>     207 .....
>    6833 ......
>    2211 .......
> 
> with the rest being insignificant in comparison.
> 
> The reason? There's a *lot* of those 'filter' and 'enable' and 'id'
> files. All of which are better off without a 'const char *name' taking
> 8 bytes.

Remember, the files don't have an ei allocated. Only directories.

 # find . -type d | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c
      1 .
      2 ..
     30 ...
     14 ....
     21 .....
     15 ......
     18 .......

Above is 8 bytes (length of 7 + '\0')

     27 ........
     36 .........
     35 ..........
     49 ...........
     63 ............
     85 .............
    140 ..............
    180 ...............
    180 ................
    155 .................
    171 ..................
    158 ...................
    134 ....................
    116 .....................
    106 ......................
     81 .......................
     61 ........................
     58 .........................
     43 ..........................
     41 ...........................
     23 ............................
     14 .............................
     13 ..............................
      9 ...............................
      7 ................................
      6 .................................
      5 ..................................
      1 ...................................
      2 ....................................
      1 .....................................
      1 ........................................

The rest is all greater than the pointer size.

Also, it does prevent me from switching to a slab.

I added the below patch (after changing this back to kstrdup_const), that
determines if the string is allocated by checking if ei->name is the same
as the pointer passed in. This showed me:

 # cat /sys/kernel/tracing/dyn_ftrace_total_info
53698 pages:220 groups: 8 const cnt:2099 len:22446 dyn cnt: 3 len:56

That is, 2099 ei->name's point to a constant string without allocation.
Granted, I built my system without many modules. If you have more code
built as modules, that number may be less.

I then counted the length of the string - 7 bytes (which is the same as the
length of the string plus the '\0' character - 8 bytes) to accommodate the
pointer size, and it's a savings of 22KB per instance. And in actuality, I
should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to
return 35 bytes back but 64 bytes will be allocated.

I think that's a win.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index e9819d719d2a..ed9ffaac9a32 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -785,6 +785,25 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 	goto out;
 }
 
+int sdr_ei_const;
+int sdr_ei_len;
+int sdr_ei_dyn;
+int sdr_ei_dyn_len;
+
+static void sdr_count(struct eventfs_inode *ei, const char *name)
+{
+	int len = strlen(name);
+
+	if (ei->name == name) {
+		sdr_ei_const++;
+		if (len > 7)
+			sdr_ei_len += len - 7;
+	} else {
+		sdr_ei_dyn++;
+		sdr_ei_dyn_len += len;
+	}
+}
+
 /**
  * eventfs_create_dir - Create the eventfs_inode for this directory
  * @name: The name of the directory to create.
@@ -838,6 +857,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
 		kfree(ei);
 		return ERR_PTR(-ENOMEM);
 	}
+	sdr_count(ei, name);
 
 	if (size) {
 		ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
@@ -920,6 +940,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	ei->name = kstrdup_const(name, GFP_KERNEL);
 	if (!ei->name)
 		goto fail;
+	sdr_count(ei, name);
 
 	/* Save the ownership of this directory */
 	uid = d_inode(dentry->d_parent)->i_uid;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..3b254ec4d831 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8701,6 +8701,11 @@ static const struct file_operations tracing_stats_fops = {
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
+extern int sdr_ei_const;
+extern int sdr_ei_len;
+extern int sdr_ei_dyn;
+extern int sdr_ei_dyn_len;
+
 static ssize_t
 tracing_read_dyn_info(struct file *filp, char __user *ubuf,
 		  size_t cnt, loff_t *ppos)
@@ -8714,10 +8719,11 @@ tracing_read_dyn_info(struct file *filp, char __user *ubuf,
 	if (!buf)
 		return -ENOMEM;
 
-	r = scnprintf(buf, 256, "%ld pages:%ld groups: %ld\n",
+	r = scnprintf(buf, 256, "%ld pages:%ld groups: %ld const cnt:%d len:%d dyn cnt: %d len:%d\n",
 		      ftrace_update_tot_cnt,
 		      ftrace_number_of_pages,
-		      ftrace_number_of_groups);
+		      ftrace_number_of_groups,
+		      sdr_ei_const, sdr_ei_len, sdr_ei_dyn, sdr_ei_dyn_len);
 
 	ret = simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 	kfree(buf);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 21:52     ` Linus Torvalds
  2024-01-30 22:56       ` Steven Rostedt
@ 2024-01-30 22:56       ` Linus Torvalds
  2024-01-30 23:06         ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 22:56 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 13:52, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Btw, you can look at name lengths in tracefs with something stupid like this:
>
>     find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c

Actually, since only directories have these 'ei' fields, that find
should have "-type d", and then the stats change.

Most directory filenames there are indeed longer than 8 bytes (16
bytes seems to be the most common length).

That means that whether it wins or loses in allocation size tends to
be mostly about the kmalloc sizes and the size of that structure.

And it turns out that the real memory savings there would be this patch

  --- a/fs/tracefs/internal.h
  +++ b/fs/tracefs/internal.h
  @@ -55,15 +55,6 @@ struct eventfs_inode {
        unsigned int                    is_events:1;
        unsigned int                    nr_entries:30;
        unsigned int                    ino;
  -     /*
  -      * Union - used for deletion
  -      * @llist:      for calling dput() if needed after RCU
  -      * @rcu:        eventfs_inode to delete in RCU
  -      */
  -     union {
  -             struct llist_node       llist;
  -             struct rcu_head         rcu;
  -     };
        const char name[];
   };

since with all the other cleanups, neither of those fields are actually used.

With that, the base size of 'struct eventfs_inode' actually becomes 96
bytes for me.

And at least on x86-64, the kmalloc sizes are 96 and then 128 bytes.

But that means that

 - with the added name pointer, the eventfs_inode would grow to 104
bytes, and grow to that next allocation size (128)

 - with the embedded name, the size is 96+strlen+1, and will also need
at least a 128 byte allocation, but any name that is shorter than 32
bytes is free

so it ends up being a wash. You're going to allocate 128 bytes
regardless. But the deallocation is simpler.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 22:56       ` Steven Rostedt
@ 2024-01-30 22:58         ` Linus Torvalds
  2024-01-30 23:04           ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 22:58 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 14:55, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Remember, the files don't have an ei allocated. Only directories.

Crossed emails.

> I then counted the length of the string - 7 bytes (which is the same as the
> length of the string plus the '\0' character - 8 bytes) to accommodate the
> pointer size, and it's a savings of 22KB per instance. And in actuality, I
> should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to
> return 35 bytes back but 64 bytes will be allocated.

No. See my other email. The kmalloc sizes actually means that it comes
out exactly even.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 22:58         ` Linus Torvalds
@ 2024-01-30 23:04           ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 23:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 14:58:26 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 14:55, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Remember, the files don't have an ei allocated. Only directories.  
> 
> Crossed emails.
> 
> > I then counted the length of the string - 7 bytes (which is the same as the
> > length of the string plus the '\0' character - 8 bytes) to accommodate the
> > pointer size, and it's a savings of 22KB per instance. And in actuality, I
> > should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to
> > return 35 bytes back but 64 bytes will be allocated.  
> 
> No. See my other email. The kmalloc sizes actually means that it comes
> out exactly even.
>

But that wouldn't be the case if I switched over to allocating as its own
slab, as it would make it better condensed.

But, I can keep it your way until I do that, as the biggest savings I
needed was getting rid of all the file meta-data as that was what took up
10s of megabytes.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 22:56       ` Linus Torvalds
@ 2024-01-30 23:06         ` Linus Torvalds
  2024-01-30 23:10           ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 23:06 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 14:56, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> With that, the base size of 'struct eventfs_inode' actually becomes 96
> bytes for me.

It can be shrunk some more.

The field ordering is suboptimal. Pointers are 8 bytes and 8-byte
aligned, but 'struct kref' is just 4 bytes, and 'struct eventfs_attr'
is 12 bytes and 4-byte aligned.

So if you pack all the 8-byte-aligned fields at the beginning, and the
4-byte-aligned ones at the end, you get 88 bytes.

At which point a name pointer would *just* fit in 96 bytes.

...  and then some debug option is enabled, and it all goes to hell again.

              Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 23:06         ` Linus Torvalds
@ 2024-01-30 23:10           ` Steven Rostedt
  2024-01-30 23:25             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 23:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 15:06:13 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 14:56, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > With that, the base size of 'struct eventfs_inode' actually becomes 96
> > bytes for me.  
> 
> It can be shrunk some more.
> 
> The field ordering is suboptimal. Pointers are 8 bytes and 8-byte
> aligned, but 'struct kref' is just 4 bytes, and 'struct eventfs_attr'
> is 12 bytes and 4-byte aligned.
> 
> So if you pack all the 8-byte-aligned fields at the beginning, and the
> 4-byte-aligned ones at the end, you get 88 bytes.
> 
> At which point a name pointer would *just* fit in 96 bytes.

Does that mean I should keep the kstrdup_const()?

> 
> ...  and then some debug option is enabled, and it all goes to hell again.

Heh, I'm not too concerned about debug options. As anyone concerned about
memory size should have already have audited their .config to turn off
anything they don't need.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 23:10           ` Steven Rostedt
@ 2024-01-30 23:25             ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-30 23:25 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 15:10, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > At which point a name pointer would *just* fit in 96 bytes.
>
> Does that mean I should keep the kstrdup_const()?

You should check that my math matches reality and relevant
configurations, but yes, at 96 bytes that should fit exactly in one
slab entry on both x86-64 and arm64 (now that arm finally does sane
kmalloc sizes - for the longest time they were all 128-byte aligned
due to historical horrible DMA coherency issues).

But you might also add a comment to the eventfs_inode definition,
because if that ever changes, the math changes again. For example,
adding just one more attribute data would make it all fall apart.

If you *really* want to optimize that data structure, I think you can
make the child list be a 'hlist'. That makes the list head ('children'
becomes a 'hlist_head') smaller, even if the list entry ('list'
becomes a 'hlist_node') stays the same size.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] tracefs: dentry lookup crapectomy
  2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
@ 2024-01-30 23:26   ` Al Viro
  2024-01-30 23:55     ` Steven Rostedt
  2024-01-31  0:23   ` Al Viro
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-01-30 23:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 11:03:52AM -0800, Linus Torvalds wrote:
> The dentry lookup for eventfs files was very broken, and had lots of
> signs of the old situation where the filesystem names were all created
> statically in the dentry tree, rather than being looked up dynamically
> based on the eventfs data structures.
> 
> You could see it in the naming - how it claimed to "create" dentries
> rather than just look up the dentries that were given it.
> 
> You could see it in various nonsensical and very incorrect operations,
> like using "simple_lookup()" on the dentries that were passed in, which
> only results in those dentries becoming negative dentries.  Which meant
> that any other lookup would possibly return ENOENT if it saw that
> negative dentry before the data rwas then later filled in.
> 
> You could see it in the immesnse amount of nonsensical code that didn't
> actually just do lookups.

> -static struct dentry *create_file(const char *name, umode_t mode,
> +static struct dentry *lookup_file(struct dentry *dentry,
> +				  umode_t mode,
>  				  struct eventfs_attr *attr,
> -				  struct dentry *parent, void *data,
> +				  void *data,
>  				  const struct file_operations *fop)
>  {
>  	struct tracefs_inode *ti;
> -	struct dentry *dentry;
>  	struct inode *inode;
>  
>  	if (!(mode & S_IFMT))
> @@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, umode_t mode,
>  	if (WARN_ON_ONCE(!S_ISREG(mode)))
>  		return NULL;
>  
> -	WARN_ON_ONCE(!parent);
> -	dentry = eventfs_start_creating(name, parent);

Used to lock the inode of parent.

>  	if (unlikely(!inode))
>  		return eventfs_failed_creating(dentry);

... and that still unlocks it.

> @@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, umode_t mode,
>  	ti->flags = TRACEFS_EVENT_INODE;
>  	ti->private = NULL;			// Directories have 'ei', files not
>  
> -	d_instantiate(dentry, inode);
> +	d_add(dentry, inode);
>  	fsnotify_create(dentry->d_parent->d_inode, dentry);
>  	return eventfs_end_creating(dentry);

... and so does this.

>  };

Where has that inode_lock() gone and how could that possibly work?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] tracefs: dentry lookup crapectomy
  2024-01-30 23:26   ` Al Viro
@ 2024-01-30 23:55     ` Steven Rostedt
  2024-01-31  0:07       ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-30 23:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 23:26:21 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Tue, Jan 30, 2024 at 11:03:52AM -0800, Linus Torvalds wrote:
> > The dentry lookup for eventfs files was very broken, and had lots of
> > signs of the old situation where the filesystem names were all created
> > statically in the dentry tree, rather than being looked up dynamically
> > based on the eventfs data structures.
> > 
> > You could see it in the naming - how it claimed to "create" dentries
> > rather than just look up the dentries that were given it.
> > 
> > You could see it in various nonsensical and very incorrect operations,
> > like using "simple_lookup()" on the dentries that were passed in, which
> > only results in those dentries becoming negative dentries.  Which meant
> > that any other lookup would possibly return ENOENT if it saw that
> > negative dentry before the data rwas then later filled in.
> > 
> > You could see it in the immesnse amount of nonsensical code that didn't
> > actually just do lookups.  
> 
> > -static struct dentry *create_file(const char *name, umode_t mode,
> > +static struct dentry *lookup_file(struct dentry *dentry,
> > +				  umode_t mode,
> >  				  struct eventfs_attr *attr,
> > -				  struct dentry *parent, void *data,
> > +				  void *data,
> >  				  const struct file_operations *fop)
> >  {
> >  	struct tracefs_inode *ti;
> > -	struct dentry *dentry;
> >  	struct inode *inode;
> >  
> >  	if (!(mode & S_IFMT))
> > @@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, umode_t mode,
> >  	if (WARN_ON_ONCE(!S_ISREG(mode)))
> >  		return NULL;
> >  
> > -	WARN_ON_ONCE(!parent);
> > -	dentry = eventfs_start_creating(name, parent);  
> 
> Used to lock the inode of parent.

Actually it's the tracefs_start_creating() locks the inode, the
eventfs_start_creating() doesn't.

-- Steve


> 
> >  	if (unlikely(!inode))
> >  		return eventfs_failed_creating(dentry);  
> 
> ... and that still unlocks it.
> 
> > @@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, umode_t mode,
> >  	ti->flags = TRACEFS_EVENT_INODE;
> >  	ti->private = NULL;			// Directories have 'ei', files not
> >  
> > -	d_instantiate(dentry, inode);
> > +	d_add(dentry, inode);
> >  	fsnotify_create(dentry->d_parent->d_inode, dentry);
> >  	return eventfs_end_creating(dentry);  
> 
> ... and so does this.
> 
> >  };  
> 
> Where has that inode_lock() gone and how could that possibly work?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] tracefs: dentry lookup crapectomy
  2024-01-30 23:55     ` Steven Rostedt
@ 2024-01-31  0:07       ` Al Viro
  0 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2024-01-31  0:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 06:55:36PM -0500, Steven Rostedt wrote:

> Actually it's the tracefs_start_creating() locks the inode, the
> eventfs_start_creating() doesn't.

Right.

> > 
> > >  	if (unlikely(!inode))
> > >  		return eventfs_failed_creating(dentry);  
> > 
> > ... and that still unlocks it.

This is still bogus, though - both the stray dput() and
dropping a reference to internal mount.

struct dentry *eventfs_failed_creating(struct dentry *dentry)
{
        dput(dentry);
	simple_release_fs(&tracefs_mount, &tracefs_mount_count);
	return NULL;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] tracefs: dentry lookup crapectomy
  2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
  2024-01-30 23:26   ` Al Viro
@ 2024-01-31  0:23   ` Al Viro
  2024-01-31  0:39     ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-01-31  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 11:03:52AM -0800, Linus Torvalds wrote:
>  {
>  	struct eventfs_inode *ei;
>  
> -	mutex_lock(&eventfs_mutex);
>  	do {
>  		// The parent is stable because we do not do renames
>  		dentry = dentry->d_parent;
> @@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
>  		}
>  		// Walk upwards until you find the events inode
>  	} while (!ei->is_events);
> -	mutex_unlock(&eventfs_mutex);

Unless I'm missing something, you've just lost exclusion with
removals (not that the original hadn't been suspicious in that
respect - what's to protect ei past that mutex_unlock?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] tracefs: dentry lookup crapectomy
  2024-01-31  0:23   ` Al Viro
@ 2024-01-31  0:39     ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31  0:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 16:23, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Jan 30, 2024 at 11:03:52AM -0800, Linus Torvalds wrote:
> >  {
> >       struct eventfs_inode *ei;
> >
> > -     mutex_lock(&eventfs_mutex);
> >       do {
> >               // The parent is stable because we do not do renames
> >               dentry = dentry->d_parent;
> > @@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
> >               }
> >               // Walk upwards until you find the events inode
> >       } while (!ei->is_events);
> > -     mutex_unlock(&eventfs_mutex);
>
> Unless I'm missing something, you've just lost exclusion with
> removals (not that the original hadn't been suspicious in that
> respect - what's to protect ei past that mutex_unlock?

No, the mutex is actually taken up the call chain, and removing it
here is fixing a deadlock.

Ie we have the mutex taken in eventfs_root_lookup(), and then that
goes either through lookup_dir_entry() or lookup_file_dentry(), before
it gets to update_inode_attr().

That series is not very clean, I'm afraid.

          Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
  2024-01-30 20:55   ` Steven Rostedt
@ 2024-01-31  0:48   ` Al Viro
  2024-01-31  5:09   ` Steven Rostedt
  2 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2024-01-31  0:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 11:03:54AM -0800, Linus Torvalds wrote:

>  	inode = tracefs_get_inode(dentry->d_sb);
>  	if (unlikely(!inode))
> -		return eventfs_failed_creating(dentry);
> +		return ERR_PTR(-ENOMEM);

That belongs in the lookup crapectomy patch - bisect hazard from stray dput().
Same for similar chunks below.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
  2024-01-30 21:25   ` Steven Rostedt
@ 2024-01-31  1:12   ` Al Viro
  2024-01-31  2:37     ` Linus Torvalds
  2024-01-31 18:00   ` Steven Rostedt
  2 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-01-31  1:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 11:03:55AM -0800, Linus Torvalds wrote:

> +void eventfs_d_release(struct dentry *dentry)
>  {
> -	struct eventfs_inode *ei;
> -
> -	mutex_lock(&eventfs_mutex);
> -	ei = dentry->d_fsdata;
> -	if (ei) {
> -		dentry->d_fsdata = NULL;
> -		put_ei(ei);
> -	}
> -	mutex_unlock(&eventfs_mutex);
> +	put_ei(dentry->d_fsdata);
>  }

I'd rather pass ->d_fsdata to that sucker (or exposed put_ei(),
for that matter).

> @@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
>  	 * sticks around while the other ei->dentry are created
>  	 * and destroyed dynamically.
>  	 */
> -	simple_recursive_removal(dentry, NULL);

That also needs to move earlier in the series - bisect hazard.

> + *
> + * Note that d_revalidate is called potentially under RCU,
> + * so it can't take the eventfs mutex etc. It's fine - if
> + * we open a file just as it's marked dead, things will
> + * still work just fine, and just see the old stale case.

Looks like use after free, unless freeing ei is RCU-delayed...

> +	return !(ei && ei->is_freed);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-31  1:12   ` Al Viro
@ 2024-01-31  2:37     ` Linus Torvalds
  2024-01-31  2:46       ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31  2:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 17:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> > + *
> > + * Note that d_revalidate is called potentially under RCU,
> > + * so it can't take the eventfs mutex etc. It's fine - if
> > + * we open a file just as it's marked dead, things will
> > + * still work just fine, and just see the old stale case.
>
> Looks like use after free, unless freeing ei is RCU-delayed...

We hold the ref to the ei in the very dentry that is doing d_revalidate().

So it should be fine. The race is with eventfs marking the ei
'is_freed' (under the mutex that we don't hold here), but when that
happens and we end up still using the dentry, the ei is still there,
all the operations are just going to fail.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-31  2:37     ` Linus Torvalds
@ 2024-01-31  2:46       ` Al Viro
  2024-01-31  3:39         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Al Viro @ 2024-01-31  2:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 06:37:49PM -0800, Linus Torvalds wrote:
> On Tue, 30 Jan 2024 at 17:12, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > > + *
> > > + * Note that d_revalidate is called potentially under RCU,
> > > + * so it can't take the eventfs mutex etc. It's fine - if
> > > + * we open a file just as it's marked dead, things will
> > > + * still work just fine, and just see the old stale case.
> >
> > Looks like use after free, unless freeing ei is RCU-delayed...
> 
> We hold the ref to the ei in the very dentry that is doing d_revalidate().

What's to stop ->d_revalidate() from being called in parallel with
__dentry_kill()?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-31  2:46       ` Al Viro
@ 2024-01-31  3:39         ` Linus Torvalds
  2024-01-31  4:28           ` Al Viro
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31  3:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 18:46, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What's to stop ->d_revalidate() from being called in parallel with
> __dentry_kill()?

Oh, you're right.

For some reason I thought we did the d_release() _after_ the RCU grace
period, but we don't.

Why don't we, btw? It would be so much better if we did the
d_release() from __d_free(). We have all that smarts in fs/dcache.c to
decide if we need to RCU-delay it or not, and then we don't let
filesystems use it.

I assume the reason is that some 'd_delete' cases might want to sleep
etc. Still, for things like this that just want to release memory, it
would be *much* better to have d_release called when the dentry is
actually released.

Hmm. Not very many d_delete cases, but I did see a couple that
definitely want process context (dma_buf_release goes to things that
do vfree() etc).

So I guess the "make low-level filesystems do their own kfree_rcu() is
what we're doing.

In this case it's as simple as doing that

-       kfree(ei);
+       kfree_rcu(ei, rcu);

and we'd just make the rcu entry a union with something that isn't
that 'is_freed' field so that it doesn't take more space.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-31  3:39         ` Linus Torvalds
@ 2024-01-31  4:28           ` Al Viro
  0 siblings, 0 replies; 43+ messages in thread
From: Al Viro @ 2024-01-31  4:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, Jan 30, 2024 at 07:39:55PM -0800, Linus Torvalds wrote:

> Why don't we, btw? It would be so much better if we did the
> d_release() from __d_free(). We have all that smarts in fs/dcache.c to
> decide if we need to RCU-delay it or not, and then we don't let
> filesystems use it.

Because we want to give filesystems a chance to do something useful
in it?  Something that might need access to the parent.  Or superblock,
for that matter...  By the time we are past the RCU delay there's
very little left; if you want to look at some per-superblock data,
you would have to do something like putting it into a refcounted
structure, with each dentry holding a counting reference, with obvious
effects...

We could add a separate method just for freeing stuff, but... we already
have 4 of them related to that path (->d_delete(), ->d_prune(), ->d_iput(),
->d_release()) and the things are pretty confusing as it is, without
throwing another one into the mix.

I'll look through the RCU pathwalk fixes (will repost the rebased set in
a couple of days, need to finish the re-audit of that stuff) and see how
much would such a method simplify, but I don't think it would buy us
a lot.

> So I guess the "make low-level filesystems do their own kfree_rcu() is
> what we're doing.
> 
> In this case it's as simple as doing that
> 
> -       kfree(ei);
> +       kfree_rcu(ei, rcu);
> 
> and we'd just make the rcu entry a union with something that isn't
> that 'is_freed' field so that it doesn't take more space.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
  2024-01-30 20:55   ` Steven Rostedt
  2024-01-31  0:48   ` Al Viro
@ 2024-01-31  5:09   ` Steven Rostedt
  2024-01-31  5:25     ` Linus Torvalds
  2 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31  5:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 11:03:54 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> +/*
> + * eventfs_inode reference count management.
> + *
> + * NOTE! We count only references from dentries, in the
> + * form 'dentry->d_fsdata'. There are also references from
> + * directory inodes ('ti->private'), but the dentry reference
> + * count is always a superset of the inode reference count.
> + */
> +static void release_ei(struct kref *ref)
> +{
> +	struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
> +	kfree(ei->entry_attrs);

I did modify this to add back the kfree_const(ei->name);

I would also like to add a:

	WARN_ON_ONCE(!ei->is_freed);

The kref_put() logic should add the protection that this is guaranteed to
be true as the logic in the removal does:

	ei->is_freed = 1;
  [..]
	kref_put(ei);

I would think that the last "put" would always have the "is_freed" set. The
WARN_ON is to catch things doing too many put_ei().


> +	kfree(ei);
> +}
> +


> @@ -951,5 +857,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
>  	 * sticks around while the other ei->dentry are created
>  	 * and destroyed dynamically.
>  	 */
> +	simple_recursive_removal(dentry, NULL);

Actually, this doesn't work. It expects all the dentries in the list to
have positive ref counts, as it does dput() on them. If there's any cached,
it will bug with a dput() on a dentry of zero.

The only dentries that should have non zero ref counts are ones that are
currently being accessed and wont go away until finished. I think all we
need to do is invalidate the top dentry. Is that true?

Does this work:

	d_invalidate(dentry);

to make the directory no longer accessible. And all dentries within it
should be zero, and hopefully go away on memory reclaim. The last patch
removes it, but if you want to test the deletion, you can do this:

 # cd /sys/kernel/tracing
 # mkdir instances/foo
 # ls instances/foo/events
 # rmdir instances/foo


But also note that this patch fails with the "ghost" files with the kprobe
test again. When I apply patch 6, that goes away. I'm guessing that's
because this needs the d_revalidate() call. To not break bisection, I think
we need to merge this and patch 6 together.

Also patch 6 removes the simple_recursive_removal() which without at least
the d_invalidate() causes a dput splat. That's because the rmdir calls
tracefs_remove() which calls simple_recursive_removal() that will walk to
the events directory and I'm assuming if it hasn't been invalidated, it
walks into it causing the same issue as a simple_recursive_removal() would
have here.

Try the above mkdir and rmdir to see.

-- Steve


>  	dput(dentry);
>  }

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31  5:09   ` Steven Rostedt
@ 2024-01-31  5:25     ` Linus Torvalds
  2024-01-31  5:33       ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31  5:25 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 21:09, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I would think that the last "put" would always have the "is_freed" set. The
> WARN_ON is to catch things doing too many put_ei().

Yes, that looks sane to me.

> > +     simple_recursive_removal(dentry, NULL);
>
> Actually, this doesn't work.

Yes, note how the next patch just removed it entirely.


> Does this work:
>
>         d_invalidate(dentry);

It does, but it's basically irrelevant with the d_revalidate approach.

Basically, once you have d_revalidate(), the unhashing happens there,
and it's just extra work and pointless to do it elsewhere.

So if you look at the "clean up dentry ops and add revalidate
function" patch, you'll see that it just does

-       simple_recursive_removal(dentry, NULL);

and the thing is just history.

So really, that final patch is the one that fixes the whole eventfs
mess for good (knock wood). But you can't do it first, because it
basically depends on all the refcount fixes.

It might be possible to re-organize the patches so that the refcount
changes go first, then the d_revalidate(), and then the rest. But I
suspect they all really end up depending on each other some way,
because the basic issue was that the whole "keep unrefcounted dentry
pointers around" was just wrong.  So it doesn't end up right until
it's _all_ fixed, because every step of the way exposes some problem.

At least that was my experience. Fix one thing, and it exposes the
hack that another thing depended on.

This is actually something that Al is a master at. You sometimes see
him send one big complicated patch where he talks about all the
problems in some area and it's one huge "fix up everything patch" that
looks very scary.

And then a week later he sends a series of 19 patches that all make
sense and all look "obvious" and all make small progress.

And magically they end up matching that big cleanup patch in the end.
And you just *know* that it didn't start out as that beautiful logical
series, because you saw the big messy patch first...

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31  5:25     ` Linus Torvalds
@ 2024-01-31  5:33       ` Steven Rostedt
  2024-01-31  5:57         ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31  5:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 21:25:30 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Does this work:
> >
> >         d_invalidate(dentry);  
> 
> It does, but it's basically irrelevant with the d_revalidate approach.
> 
> Basically, once you have d_revalidate(), the unhashing happens there,
> and it's just extra work and pointless to do it elsewhere.
> 
> So if you look at the "clean up dentry ops and add revalidate
> function" patch, you'll see that it just does
> 
> -       simple_recursive_removal(dentry, NULL);
> 
> and the thing is just history.

With even the last patch included, without the d_invalidate() I get errors
with simply doing:

 # cd /sys/kernel/tracing
 # mkdir instances/foo
 # ls instances/foo/events
 # rmdir instances/foo

As the rmdir calls tracefs_remove() that calls simple_recursive_removal()
that then walks into the "events" directory. Without that d_invalidate, it
walks beyond just the top directory and then splats on the dentries that
are cached.


> 
> So really, that final patch is the one that fixes the whole eventfs
> mess for good (knock wood). But you can't do it first, because it
> basically depends on all the refcount fixes.

I'm running my full suite with the final patch included, plus some of the
updates I mentioned in replies to other patches, as well as including this
"d_invalidate()" as it doesn't pass without it.

> 
> It might be possible to re-organize the patches so that the refcount
> changes go first, then the d_revalidate(), and then the rest. But I
> suspect they all really end up depending on each other some way,
> because the basic issue was that the whole "keep unrefcounted dentry
> pointers around" was just wrong.  So it doesn't end up right until
> it's _all_ fixed, because every step of the way exposes some problem.
> 
> At least that was my experience. Fix one thing, and it exposes the
> hack that another thing depended on.
> 
> This is actually something that Al is a master at. You sometimes see
> him send one big complicated patch where he talks about all the
> problems in some area and it's one huge "fix up everything patch" that
> looks very scary.
> 
> And then a week later he sends a series of 19 patches that all make
> sense and all look "obvious" and all make small progress.
> 
> And magically they end up matching that big cleanup patch in the end.
> And you just *know* that it didn't start out as that beautiful logical
> series, because you saw the big messy patch first...

I'll take a look at breaking the patches up further, as I now have a much
better understanding of dentries then I did before this discussion.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31  5:33       ` Steven Rostedt
@ 2024-01-31  5:57         ` Linus Torvalds
  2024-01-31  6:20           ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31  5:57 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 21:33, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> With even the last patch included, without the d_invalidate() I get errors
> with simply doing:
>
>  # cd /sys/kernel/tracing
>  # mkdir instances/foo
>  # ls instances/foo/events
>  # rmdir instances/foo
>
> As the rmdir calls tracefs_remove() that calls simple_recursive_removal()
> that then walks into the "events" directory. Without that d_invalidate, it
> walks beyond just the top directory and then splats on the dentries that
> are cached.

Ugh.

This is only an issue for that "events" dir, right? The one that has a
proper refcount on the dentry in 'ei->events_dir'?

Because yes, then doing d_invalidate() looks like the right thing.

           Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31  5:57         ` Linus Torvalds
@ 2024-01-31  6:20           ` Linus Torvalds
  2024-01-31 12:57             ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31  6:20 UTC (permalink / raw)
  To: Steven Rostedt, Al Viro
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ugh.

Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
it does that "unlock and re-lock inode" thing.

It's a disease, and no, it's not an acceptable response to "lockdep
shows there's a locking problem".

The comment says "It is up to the individual mkdir routine to handle
races" but that's *completely* bogus and shows a fundamental
misunderstanding of locking.

Dropping the lock in the middle of a locked section does NOT affect
just the section that you dropped the lock around.

It affects the *caller*, who took the lock in the first place, and who
may well assume that the lock protects things for the whole duration
of the lock.

And so dropping the lock in the middle of an operation is a bad BAD
*BAD* thing to do.

Honestly, I had only been looking at the eventfs_inode.c lookup code.
But now that I started looking at mkdir/rmdir, I'm seeing the same
signs of horrible mistakes there too.

And yes, it might be a real problem. For example, for the rmdir case,
the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
only the ->rmdir() call itself.

It also, for example, is supposed to make the ->rmdir be atomic wrt things like

        dentry->d_inode->i_flags |= S_DEAD;

and

        dont_mount(dentry);
        detach_mounts(dentry);

but now because the inode lock was dropped in the middle, for all we
know a "mkdir()" could come in on the same name, and make a mockery of
the whole inode locking.

The inode lock on that directory that is getting removed is also what
is supposed to make it impossible to add new files to the directory
while the rmdir is going on.

Again, dropping the lock violates those kinds of guarantees.

"git blame" actually fingers Al for that "unlock and relock", but
that's because Al did a search-and-replace on
"mutex_[un]lock(&inode->i_mutex)" and replaced it with
"inode_[un]lock(inode)" back in 2016.

The real culprit is you, and that sh*t-show goes back to commit
277ba04461c2 ("tracing: Add interface to allow multiple trace
buffers").

Christ. Now I guess I need to start looking if there is anything else
horrid lurking in that mkdir/rmdir path.

I doubt the inode locking problem ends up mattering in this situation.
Mostly since this is only tracefs, and normal users shouldn't be able
to access these things anyway. And I think (hope?) that you only
accept mkdir/rmdir in specific places.

But this really is another case of "This code smells *fundamentally* wrong".

Adding Al, in the hopes that he will take a look at this too.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31  6:20           ` Linus Torvalds
@ 2024-01-31 12:57             ` Steven Rostedt
  2024-01-31 13:14               ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31 12:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 22:20:24 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ugh.  
> 
> Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
> it does that "unlock and re-lock inode" thing.

I'd figured you'd like that one.

> 
> It's a disease, and no, it's not an acceptable response to "lockdep
> shows there's a locking problem".
> 
> The comment says "It is up to the individual mkdir routine to handle
> races" but that's *completely* bogus and shows a fundamental
> misunderstanding of locking.
> 
> Dropping the lock in the middle of a locked section does NOT affect
> just the section that you dropped the lock around.
> 
> It affects the *caller*, who took the lock in the first place, and who
> may well assume that the lock protects things for the whole duration
> of the lock.
> 
> And so dropping the lock in the middle of an operation is a bad BAD
> *BAD* thing to do.
> 
> Honestly, I had only been looking at the eventfs_inode.c lookup code.
> But now that I started looking at mkdir/rmdir, I'm seeing the same
> signs of horrible mistakes there too.
> 
> And yes, it might be a real problem. For example, for the rmdir case,
> the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
> only the ->rmdir() call itself.
> 
> It also, for example, is supposed to make the ->rmdir be atomic wrt things like
> 
>         dentry->d_inode->i_flags |= S_DEAD;
> 
> and
> 
>         dont_mount(dentry);
>         detach_mounts(dentry);
> 
> but now because the inode lock was dropped in the middle, for all we
> know a "mkdir()" could come in on the same name, and make a mockery of
> the whole inode locking.
> 
> The inode lock on that directory that is getting removed is also what
> is supposed to make it impossible to add new files to the directory
> while the rmdir is going on.
> 
> Again, dropping the lock violates those kinds of guarantees.
> 
> "git blame" actually fingers Al for that "unlock and relock", but
> that's because Al did a search-and-replace on
> "mutex_[un]lock(&inode->i_mutex)" and replaced it with
> "inode_[un]lock(inode)" back in 2016.
> 
> The real culprit is you, and that sh*t-show goes back to commit
> 277ba04461c2 ("tracing: Add interface to allow multiple trace
> buffers").
> 
> Christ. Now I guess I need to start looking if there is anything else
> horrid lurking in that mkdir/rmdir path.
> 
> I doubt the inode locking problem ends up mattering in this situation.
> Mostly since this is only tracefs, and normal users shouldn't be able
> to access these things anyway. And I think (hope?) that you only
> accept mkdir/rmdir in specific places.

Yes, this was very deliberate. It was for a very "special" feature, and
thus very limited.

> 
> But this really is another case of "This code smells *fundamentally* wrong".
> 
> Adding Al, in the hopes that he will take a look at this too.

This is something I asked Al about when I wrote it. This isn't a normal
rmdir. I remember telling Al what this was doing. Basically, it's just
a way to tell the tracing infrastructure to create a new instance. It
doesn't actually create a normal directory. It's similar to the
kprobe_events interface, where writing to a file would create
directories and files. Ideally I wanted a mkdir interface as it felt
more realistic, and I was ready to have another "echo foo > make_instance"
if this didn't work. But after talking with Al, I felt that it could
work. The main issue is that mkdir doesn't just make a directory, it
creates the entire tree (like what is done in /sys/fs/cgroup). If this
was more like kernfs instead of debugfs, I may not have had this
problem. That was another time I tried to understand how krenfs worked.

This is the reason all the opens in the tracing code has:

        struct trace_array *tr = inode->i_private;
        int ret;

        ret = tracing_check_open_get_tr(tr);
	if (ret)
		return ret;


In kernel/trace/trace.c:

LIST_HEAD(ftrace_trace_arrays);

int trace_array_get(struct trace_array *this_tr)
{
        struct trace_array *tr;
        int ret = -ENODEV;
        
        mutex_lock(&trace_types_lock);
        list_for_each_entry(tr, &ftrace_trace_arrays, list) {
                if (tr == this_tr) {
                        tr->ref++;
                        ret = 0;
                        break;
                }
        }
        mutex_unlock(&trace_types_lock);
                
        return ret;
}       

static void __trace_array_put(struct trace_array *this_tr)
{
        WARN_ON(!this_tr->ref);
        this_tr->ref--;       
}

void trace_array_put(struct trace_array *this_tr)
{
        if (!this_tr)
                return;

        mutex_lock(&trace_types_lock);
        __trace_array_put(this_tr);
        mutex_unlock(&trace_types_lock);
}

int tracing_check_open_get_tr(struct trace_array *tr)
{
        int ret;

        ret = security_locked_down(LOCKDOWN_TRACEFS);
        if (ret)
                return ret;

        if (tracing_disabled)
                return -ENODEV;

        if (tr && trace_array_get(tr) < 0)
                return -ENODEV;

        return 0;
}

The rmdir code that tracefs calls is also in kernel/trace/trace.c:

static int instance_rmdir(const char *name)
{
        struct trace_array *tr;
        int ret;

        mutex_lock(&event_mutex);
        mutex_lock(&trace_types_lock);

        ret = -ENODEV;
        tr = trace_array_find(name);
        if (tr)
                ret = __remove_instance(tr);

        mutex_unlock(&trace_types_lock);
        mutex_unlock(&event_mutex);

        return ret;
}

The mkdir creates a new trace_array (tr) and rmdir destroys it. If
there's any reference on a trace array the rmdir fails. On every open,
a check is made to see if the trace array that was added to the inode
still exists, and if it does, it ups its ref count to prevent a rmdir
from happening.

It's a very limited mkdir and rmdir. I remember Al asking me questions
about what it can and can't do, and me telling him to make sure the
lock release wasn't going to cause problems.

I wrote several fuzzers on this code that perform mkdir and rmdir on the
instances while other tasks are trying to access them (reading and
writing). In the beginning, it found a few bugs. But I was finally able
to get my fuzzers to work non-stop for over a month and that's when I
felt that this is fine.

I was always weary of this code because of the locking situation. The
kernel selftests has a stress test on this code too.

  tools/testing/selftests/ftrace/test.d/instances/instance-event.tc
  tools/testing/selftests/ftrace/test.d/instances/instance.tc

Which is run as part of the selftest suite, which is run by most people
testing the kernel, including KernelCI.

I haven't had a problem with this code in years, and unless we find a
real bug that needs to be fixed, I'm afraid to touch it.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31 12:57             ` Steven Rostedt
@ 2024-01-31 13:14               ` Steven Rostedt
  2024-01-31 18:08                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31 13:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Wed, 31 Jan 2024 07:57:40 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> static int instance_rmdir(const char *name)
> {
>         struct trace_array *tr;
>         int ret;
> 
>         mutex_lock(&event_mutex);

Note, event_mutex prevents dynamic events from being created. No kprobe
can be added while the event_mutex is held (not to be confused with
eventfs_mutex).

>         mutex_lock(&trace_types_lock);
> 
>         ret = -ENODEV;
>         tr = trace_array_find(name);
>         if (tr)
>                 ret = __remove_instance(tr);
> 
>         mutex_unlock(&trace_types_lock);
>         mutex_unlock(&event_mutex);
> 
>         return ret;
> }

And

static int instance_mkdir(const char *name)
{
        struct trace_array *tr;
        int ret;

        mutex_lock(&event_mutex);
        mutex_lock(&trace_types_lock);

        ret = -EEXIST;
        if (trace_array_find(name))
                goto out_unlock;

        tr = trace_array_create(name);

        ret = PTR_ERR_OR_ZERO(tr);

out_unlock:
        mutex_unlock(&trace_types_lock);
        mutex_unlock(&event_mutex);
        return ret;
}


The above functions would have been basically what would have been
called if I had created:

  echo foo >> make_instance
  echo '!foo' >> remove_instance

In fact, IIRC, I did do the above first, and then moved that code into
mkdir/rmdir. Such that the tracefs mkdir and rmdir were just helpers
and not real "mkdir" and "rmdir" routines. This isn't in git history
because it was done only on my local repository.

If you also notice, tracefs only allows mkdir/rmdir to be assigned to
one directory. Once it is assigned, no other directories can have mkdir
rmdir functionality.

 /sys/kernel/tracing/instances

is the *only* directory that can have mkdir rmdir on it.

__init struct dentry *tracefs_create_instance_dir(const char *name,
                                          struct dentry *parent,
                                          int (*mkdir)(const char *name),
                                          int (*rmdir)(const char *name))
{
        struct dentry *dentry;

        /* Only allow one instance of the instances directory. */
        if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
                return NULL;

And IIRC, Al told me that if I released the locks, it's up to the above
functions to prevent the races from occurring. That is, no file can be
created or remove when the mkdir and rmdir are running. Which the
event_mutex prevents.

Basically, I freeze external file creation and deletion within
tracefs/eventfs when the above mkdir/rmdir is being executed, and
prevent rmdir if any file within it is open.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
  2024-01-30 21:25   ` Steven Rostedt
  2024-01-31  1:12   ` Al Viro
@ 2024-01-31 18:00   ` Steven Rostedt
  2024-01-31 23:07     ` Masami Hiramatsu
  2 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31 18:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Tue, 30 Jan 2024 11:03:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It would probably be cleaner to make eventfs its own filesystem, or at
> least set its own dentry ops when looking up eventfs files.  But as it
> is, only eventfs dentries use d_fsdata, so we don't really need to split
> these things up by use.

BTW, I have been thinking about making eventfs a completely separate file
system that could be mounted outside of tracefs. One that is readonly that
only contains the "id" and "format" files and nothing more.

Why? Because perf and powertop both use those files to know how to parse
the raw event formats. I don't think there's anything in there that
requires root privileges to read. They should not be exposing any internal
kernel information besides the event format layouts, and it would be nice
to have a /sys/kernel/events directory that only had that.

Making eventfs a separate file system where, when added to tracefs, has the
control files for the specific trace_array, but for the /sys/kernel
directory, only cares about the trace format files.

Then tracefs could be nicely converted over to kernfs, and eventfs would be
its own entity.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31 13:14               ` Steven Rostedt
@ 2024-01-31 18:08                 ` Linus Torvalds
  2024-01-31 18:39                   ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-01-31 18:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Al Viro, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Wed, 31 Jan 2024 at 05:14, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> If you also notice, tracefs only allows mkdir/rmdir to be assigned to
> one directory. Once it is assigned, no other directories can have mkdir
> rmdir functionality.

I think that does limit the damage, but it's not clear that it is actually safe.

Because you don't hold the inode lock, somebody could come in and do a
mkdir inside the other one that is being removed, ie something like

 - thread 1 does took the inode lock, called ->rmdir

 - it then drops the inode lock (both parent and the directory that is
getting removed) and gets the event lock

 - but thread 2 can come in between that inode lock drop and event lock

Notice: thread 2 comes in where there is *no* locking. Nada. Zilch.

This is what worries me.

But it does help that it's all only in *one* directory.  At least
another mkdir cannot happen *inside* the one that is going away while
the locks are not held. So the good news is that it does mean that
there's only one point that is being protected.

But I do worry about things like this (in vfs_rmdir()):

        inode_lock(dentry->d_inode);

        error = -EBUSY;
        if (is_local_mountpoint(dentry) ||
            (dentry->d_inode->i_flags & S_KERNEL_FILE))
                goto out;

        error = security_inode_rmdir(dir, dentry);
        if (error)
                goto out;

        error = dir->i_op->rmdir(dir, dentry);
        if (error)
                goto out;

notice how it does that "is this a mount point" test atomically wrt
the rmdir before it is allowed to proceed.

And I do think that the inode lock is what also protects it from
concurrent mounts. So now what happens when that "thread 2" above
comes in while there is *no* locking, and mounts something there?

Now, I'm not saying this is a huge problem. But it's very much an
example of a thing that *could* be a problem. Dropping locks in the
middle is just very scary.

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
  2024-01-31 18:08                 ` Linus Torvalds
@ 2024-01-31 18:39                   ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Wed, 31 Jan 2024 10:08:37 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 31 Jan 2024 at 05:14, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > If you also notice, tracefs only allows mkdir/rmdir to be assigned to
> > one directory. Once it is assigned, no other directories can have mkdir
> > rmdir functionality.  
> 
> I think that does limit the damage, but it's not clear that it is actually safe.
> 
> Because you don't hold the inode lock, somebody could come in and do a
> mkdir inside the other one that is being removed, ie something like
> 
>  - thread 1 does took the inode lock, called ->rmdir
> 
>  - it then drops the inode lock (both parent and the directory that is
> getting removed) and gets the event lock
> 
>  - but thread 2 can come in between that inode lock drop and event lock
> 
> Notice: thread 2 comes in where there is *no* locking. Nada. Zilch.
> 
> This is what worries me.

Yep, and that was my biggest concern too, which is why I have stress tests
that try to hit that above scenario. Well, rmdir and other accesses
including other mkdir's of the same name.

As my knowledge on the inode life time is still limited, my main concern
was just corruption of the dentry/inodes themselves. But the first one to
get the event_mutex determines the state of the file system.

If thread 1 is doing rmdir, what would thread 2 do that can harm it?

The rmdir calls trace_remove() which is basically retrying to remove the
directory again, and hopefully has the proper locking just like removing
the kprobe trace event that deletes directories. It can have references on
it.

Now if something were to get a reference, and a valid dentry, as soon as
the open function is called, the tracing logic will see that the
trace_array no longer exists and returns an error.

All the open functions for files that are created in an instance (and that
includes eventfs files) have a check to see if the inode->i_private data is
still valid. The trace_array structure represents the directory, and
there's a link list of all the trace_array structures that is protected by
the trace_types_lock. It grabs that lock, iterates the array to see if the
passed in trace_array is on it, if it is, it ups the ref count (preventing
a rmdir from succeeding) and returns it. If it is not, it returns NULL and
the open call fails as if it opened a nonexistent file.

> 
> But it does help that it's all only in *one* directory.  At least
> another mkdir cannot happen *inside* the one that is going away while
> the locks are not held. So the good news is that it does mean that
> there's only one point that is being protected.
> 
> But I do worry about things like this (in vfs_rmdir()):
> 
>         inode_lock(dentry->d_inode);
> 
>         error = -EBUSY;
>         if (is_local_mountpoint(dentry) ||
>             (dentry->d_inode->i_flags & S_KERNEL_FILE))
>                 goto out;
> 
>         error = security_inode_rmdir(dir, dentry);
>         if (error)
>                 goto out;
> 
>         error = dir->i_op->rmdir(dir, dentry);
>         if (error)
>                 goto out;
> 
> notice how it does that "is this a mount point" test atomically wrt
> the rmdir before it is allowed to proceed.

You mean if someone did:

 # mkdir instances/foo
 # rmdir instances/foo

and at the same time, someone else did

 # mount -t ext4 /dev/sda instances/foo

?

OK, I never thought of that use case. Although, I think if someone is
trying to mount anything in tracefs, they can keep the pieces. ;-)

> 
> And I do think that the inode lock is what also protects it from
> concurrent mounts. So now what happens when that "thread 2" above
> comes in while there is *no* locking, and mounts something there?
> 
> Now, I'm not saying this is a huge problem. But it's very much an
> example of a thing that *could* be a problem. Dropping locks in the
> middle is just very scary.

No arguments from me. I really didn't like the dropping of the locks, and
tried hard to avoid it. If switching over to kernfs can solve that, I'd let
that conversion happen.

I'm all for someone switching tracefs over to kernfs if it solves all these
unknown bugs, as long as it doesn't hurt the memory savings of eventfs. But
again, I'm also willing to make eventfs its own file system (although I
don't have the time yet to do that) where tracefs isn't burdened by it.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-31 18:00   ` Steven Rostedt
@ 2024-01-31 23:07     ` Masami Hiramatsu
  2024-01-31 23:23       ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Masami Hiramatsu @ 2024-01-31 23:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Masami Hiramatsu, linux-kernel, linux-trace-kernel

On Wed, 31 Jan 2024 13:00:39 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 30 Jan 2024 11:03:55 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > It would probably be cleaner to make eventfs its own filesystem, or at
> > least set its own dentry ops when looking up eventfs files.  But as it
> > is, only eventfs dentries use d_fsdata, so we don't really need to split
> > these things up by use.
> 
> BTW, I have been thinking about making eventfs a completely separate file
> system that could be mounted outside of tracefs. One that is readonly that
> only contains the "id" and "format" files and nothing more.
> 
> Why? Because perf and powertop both use those files to know how to parse
> the raw event formats. I don't think there's anything in there that
> requires root privileges to read. They should not be exposing any internal
> kernel information besides the event format layouts, and it would be nice
> to have a /sys/kernel/events directory that only had that.

That's a good idea! So maybe we can allow perf to read it without root user.

> 
> Making eventfs a separate file system where, when added to tracefs, has the
> control files for the specific trace_array, but for the /sys/kernel
> directory, only cares about the trace format files.
> 
> Then tracefs could be nicely converted over to kernfs, and eventfs would be
> its own entity.

If so, maybe we can just make symlinks to the 'id' and 'format' from events
under tracefs? :)

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
  2024-01-31 23:07     ` Masami Hiramatsu
@ 2024-01-31 23:23       ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2024-01-31 23:23 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Linus Torvalds, linux-kernel, linux-trace-kernel

On Thu, 1 Feb 2024 08:07:15 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Then tracefs could be nicely converted over to kernfs, and eventfs would be
> > its own entity.  
> 
> If so, maybe we can just make symlinks to the 'id' and 'format' from events
> under tracefs? :)

I don't think that will save anything. The files currently do not allocate
any memory. If we make symlinks, we need to allocate a path, to them. I
think that would be rather difficult to do. Not to mention, that could
cause a lot of breakage. What do you do if the other filesystem isn't
mounted?

I could possibly make a light way handle to pass back to the callbacks.

struct trace_event_light {
	unsigned long			flags
	struct trace_event_call		*event_call;
};

struct trace_event_file {
	struct trace_event_light	call;
	[..]
	// Remove he flags and event_call and have it above
};

if the callback data has:

 callback(..., void **data)
 {
	struct trace_event_light *call = *data;
	struct trace_event_file *file;

	If (strcmp(name, "id") == 0 || strcmp(name, "format") == 0) {
		*data = call->event_call;
		return 1;
	}

	/* Return if this is just a light data entry */
	if (!(data->flags & TRACE_EVENT_FULL))
		return 0;

	file = container_of(data, struct trace_event_file, call);

	// continue processing the full data
}

This way the lonely eventfs could still share a lot of the code.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-01-31 23:23 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-30 19:03 [PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily Linus Torvalds
2024-01-30 19:03 ` [PATCH 2/6] eventfsfs: initialize the tracefs inode properly Linus Torvalds
2024-01-30 19:48   ` Steven Rostedt
2024-01-30 19:56     ` Steven Rostedt
2024-01-30 19:49   ` Steven Rostedt
2024-01-30 19:03 ` [PATCH 3/6] tracefs: dentry lookup crapectomy Linus Torvalds
2024-01-30 23:26   ` Al Viro
2024-01-30 23:55     ` Steven Rostedt
2024-01-31  0:07       ` Al Viro
2024-01-31  0:23   ` Al Viro
2024-01-31  0:39     ` Linus Torvalds
2024-01-30 19:03 ` [PATCH 4/6] eventfs: remove unused 'd_parent' pointer field Linus Torvalds
2024-01-30 19:03 ` [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts Linus Torvalds
2024-01-30 20:55   ` Steven Rostedt
2024-01-30 21:52     ` Linus Torvalds
2024-01-30 22:56       ` Steven Rostedt
2024-01-30 22:58         ` Linus Torvalds
2024-01-30 23:04           ` Steven Rostedt
2024-01-30 22:56       ` Linus Torvalds
2024-01-30 23:06         ` Linus Torvalds
2024-01-30 23:10           ` Steven Rostedt
2024-01-30 23:25             ` Linus Torvalds
2024-01-31  0:48   ` Al Viro
2024-01-31  5:09   ` Steven Rostedt
2024-01-31  5:25     ` Linus Torvalds
2024-01-31  5:33       ` Steven Rostedt
2024-01-31  5:57         ` Linus Torvalds
2024-01-31  6:20           ` Linus Torvalds
2024-01-31 12:57             ` Steven Rostedt
2024-01-31 13:14               ` Steven Rostedt
2024-01-31 18:08                 ` Linus Torvalds
2024-01-31 18:39                   ` Steven Rostedt
2024-01-30 19:03 ` [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function Linus Torvalds
2024-01-30 21:25   ` Steven Rostedt
2024-01-30 21:36     ` Linus Torvalds
2024-01-31  1:12   ` Al Viro
2024-01-31  2:37     ` Linus Torvalds
2024-01-31  2:46       ` Al Viro
2024-01-31  3:39         ` Linus Torvalds
2024-01-31  4:28           ` Al Viro
2024-01-31 18:00   ` Steven Rostedt
2024-01-31 23:07     ` Masami Hiramatsu
2024-01-31 23:23       ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).