* [for-linus][PATCH 0/3] eventfs: A few more fixes for 6.8 @ 2024-01-17 14:35 Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-17 14:35 UTC (permalink / raw) To: linux-kernel Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton More eventfs fixes for 6.8: - Hard-code the inodes for eventfs to the same number for files, and the same number for directories. - Have getdent() not create dentries/inodes in iterate_shared() as now it has hard-coded inode numbers - Use kcalloc() instead of kzalloc() on a list of elements git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git trace/urgent Head SHA1: 1057066009c4325bb1d8430c9274894d0860e7c3 Erick Archer (1): eventfs: Use kcalloc() instead of kzalloc() Steven Rostedt (Google) (2): eventfs: Have the inodes all for files and directories all be the same eventfs: Do not create dentries nor inodes in iterate_shared ---- fs/tracefs/event_inode.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) ^ permalink raw reply [flat|nested] 30+ messages in thread
* [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-17 14:35 [for-linus][PATCH 0/3] eventfs: A few more fixes for 6.8 Steven Rostedt @ 2024-01-17 14:35 ` Steven Rostedt 2024-01-22 10:38 ` Geert Uytterhoeven 2024-01-17 14:35 ` [for-linus][PATCH 2/3] eventfs: Do not create dentries nor inodes in iterate_shared Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 3/3] eventfs: Use kcalloc() instead of kzalloc() Steven Rostedt 2 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-17 14:35 UTC (permalink / raw) To: linux-kernel Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher, Linus Torvalds From: "Steven Rostedt (Google)" <rostedt@goodmis.org> The dentries and inodes are created in the readdir for the sole purpose of getting a consistent inode number. Linus stated that is unnecessary, and that all inodes can have the same inode number. For a virtual file system they are pretty meaningless. Instead use a single unique inode number for all files and one for all directories. Link: https://lore.kernel.org/all/20240116133753.2808d45e@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180363@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> --- fs/tracefs/event_inode.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..5edf0b96758b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,6 +32,10 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Choose something "unique" ;-) */ +#define EVENTFS_FILE_INODE_INO 0x12c4e37 +#define EVENTFS_DIR_INODE_INO 0x134b2f5 + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_fop = fop; inode->i_private = data; + /* All files will have the same inode number */ + inode->i_ino = EVENTFS_FILE_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; d_instantiate(dentry, inode); @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_op = &eventfs_root_dir_inode_operations; inode->i_fop = &eventfs_file_operations; + /* All directories will have the same inode number */ + inode->i_ino = EVENTFS_DIR_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; -- 2.43.0 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-17 14:35 ` [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same Steven Rostedt @ 2024-01-22 10:38 ` Geert Uytterhoeven 2024-01-22 15:06 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: Geert Uytterhoeven @ 2024-01-22 10:38 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher, Linus Torvalds Hi Stephen, On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > The dentries and inodes are created in the readdir for the sole purpose of > getting a consistent inode number. Linus stated that is unnecessary, and > that all inodes can have the same inode number. For a virtual file system > they are pretty meaningless. > > Instead use a single unique inode number for all files and one for all > directories. > > Link: https://lore.kernel.org/all/20240116133753.2808d45e@gandalf.local.home/ > Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180363@goodmis.org > > Cc: Masami Hiramatsu <mhiramat@kernel.org> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Christian Brauner <brauner@kernel.org> > Cc: Al Viro <viro@ZenIV.linux.org.uk> > Cc: Ajay Kaher <ajay.kaher@broadcom.com> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Thanks for your patch, which is now commit 53c41052ba312176 ("eventfs: Have the inodes all for files and directories all be the same") in v6.8-rc1, to which I have bisected the issue below. > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -32,6 +32,10 @@ > */ > static DEFINE_MUTEX(eventfs_mutex); > > +/* Choose something "unique" ;-) */ > +#define EVENTFS_FILE_INODE_INO 0x12c4e37 > +#define EVENTFS_DIR_INODE_INO 0x134b2f5 > + > /* > * The eventfs_inode (ei) itself is protected by SRCU. It is released from > * its parent's list and will have is_freed set (under eventfs_mutex). > @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, > inode->i_fop = fop; > inode->i_private = data; > > + /* All files will have the same inode number */ > + inode->i_ino = EVENTFS_FILE_INODE_INO; > + > ti = get_tracefs(inode); > ti->flags |= TRACEFS_EVENT_INODE; > d_instantiate(dentry, inode); > @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent > inode->i_op = &eventfs_root_dir_inode_operations; > inode->i_fop = &eventfs_file_operations; > > + /* All directories will have the same inode number */ > + inode->i_ino = EVENTFS_DIR_INODE_INO; > + > ti = get_tracefs(inode); > ti->flags |= TRACEFS_EVENT_INODE; This confuses "find". Running "find /sys/" now prints lots of error messages to stderr: find: File system loop detected; ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of the same file system loop as ‘/sys/kernel/debug/tracing/events/initcall’. find: File system loop detected; ‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of the same file system loop as ‘/sys/kernel/debug/tracing/events/initcall’. find: File system loop detected; ‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of the same file system loop as ‘/sys/kernel/debug/tracing/events/initcall’. [...] Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 10:38 ` Geert Uytterhoeven @ 2024-01-22 15:06 ` Steven Rostedt 2024-01-22 16:23 ` Geert Uytterhoeven 2024-01-22 17:14 ` Mathieu Desnoyers 0 siblings, 2 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 15:06 UTC (permalink / raw) To: Geert Uytterhoeven, Kees Cook Cc: linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher, Linus Torvalds On Mon, 22 Jan 2024 11:38:52 +0100 Geert Uytterhoeven <geert@linux-m68k.org> wrote: > Hi Stephen, I don't know who "Stephen" is, but I'll reply to this message. > > On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > > > The dentries and inodes are created in the readdir for the sole purpose of > > getting a consistent inode number. Linus stated that is unnecessary, and > > that all inodes can have the same inode number. For a virtual file system > > they are pretty meaningless. > > > > Instead use a single unique inode number for all files and one for all > > directories. > > > > Link: https://lore.kernel.org/all/20240116133753.2808d45e@gandalf.local.home/ Yeah, Linus wanted me to try this first and see if there's any regressions. Well, I guess you just answered that. The above link has me saying to Linus: It was me being paranoid that using the same inode number would break user space. If that is not a concern, then I'm happy to just make it either the same, or maybe just hash the ei and name that it is associated with. > > Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180363@goodmis.org > > > > Cc: Masami Hiramatsu <mhiramat@kernel.org> > > Cc: Mark Rutland <mark.rutland@arm.com> > > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Cc: Christian Brauner <brauner@kernel.org> > > Cc: Al Viro <viro@ZenIV.linux.org.uk> > > Cc: Ajay Kaher <ajay.kaher@broadcom.com> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > > Thanks for your patch, which is now commit 53c41052ba312176 ("eventfs: > Have the inodes all for files and directories all be the same") in > v6.8-rc1, to which I have bisected the issue below. > > > --- a/fs/tracefs/event_inode.c > > +++ b/fs/tracefs/event_inode.c > > @@ -32,6 +32,10 @@ > > */ > > static DEFINE_MUTEX(eventfs_mutex); > > > > +/* Choose something "unique" ;-) */ > > +#define EVENTFS_FILE_INODE_INO 0x12c4e37 > > +#define EVENTFS_DIR_INODE_INO 0x134b2f5 > > + > > /* > > * The eventfs_inode (ei) itself is protected by SRCU. It is released from > > * its parent's list and will have is_freed set (under eventfs_mutex). > > @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, > > inode->i_fop = fop; > > inode->i_private = data; > > > > + /* All files will have the same inode number */ > > + inode->i_ino = EVENTFS_FILE_INODE_INO; > > + > > ti = get_tracefs(inode); > > ti->flags |= TRACEFS_EVENT_INODE; > > d_instantiate(dentry, inode); > > @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent > > inode->i_op = &eventfs_root_dir_inode_operations; > > inode->i_fop = &eventfs_file_operations; > > > > + /* All directories will have the same inode number */ > > + inode->i_ino = EVENTFS_DIR_INODE_INO; > > + > > ti = get_tracefs(inode); > > ti->flags |= TRACEFS_EVENT_INODE; > > This confuses "find". > Running "find /sys/" now prints lots of error messages to stderr: > > find: File system loop detected; > ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of > the same file system loop as > ‘/sys/kernel/debug/tracing/events/initcall’. So at a minimum, the directories need to have unique inode numbers. > find: File system loop detected; > ‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of > the same file system loop as > ‘/sys/kernel/debug/tracing/events/initcall’. > find: File system loop detected; > ‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of > the same file system loop as > ‘/sys/kernel/debug/tracing/events/initcall’. > [...] Does this fix it for you? It hashes the eventfs_inode data structure after adding some salt to it. Kees, I'm using the eventfs_inode pointer to create a unique value for the inode. But it's being salted, hashed and then truncated. As it is very easy to read inodes (although by default, only root has access to read these inodes), the inode numbers themselves shouldn't be able to leak kernel addresses via the results of these inode numbers, would it? -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6795fda2af19..d54897b84596 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -19,6 +19,7 @@ #include <linux/namei.h> #include <linux/workqueue.h> #include <linux/security.h> +#include <linux/siphash.h> #include <linux/tracefs.h> #include <linux/kref.h> #include <linux/delay.h> @@ -36,6 +37,31 @@ static DEFINE_MUTEX(eventfs_mutex); #define EVENTFS_FILE_INODE_INO 0x12c4e37 #define EVENTFS_DIR_INODE_INO 0x134b2f5 +/* Used for making inode numbers */ +static siphash_key_t inode_key; + +/* Copied from scripts/kconfig/symbol.c */ +static unsigned strhash(const char *s) +{ + /* fnv32 hash */ + unsigned hash = 2166136261U; + for (; *s; s++) + hash = (hash ^ *s) * 0x01000193; + return hash; +} + +/* Just try to make something consistent and unique */ +static int eventfs_dir_ino(struct event_inode *ei, const char *name) +{ + unsigned long sip = (unsigned long)ei; + + sip += strhash(name) + EVENTFS_DIR_INODE_INO; + sip = siphash_1u32((int)sip, &inode_key); + + /* keep it positive */ + return sip & ((1U << 31) - 1); +} + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -396,7 +422,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_fop = &eventfs_file_operations; /* All directories will have the same inode number */ - inode->i_ino = EVENTFS_DIR_INODE_INO; + inode->i_ino = eventfs_dir_ino(ei, ei->name); ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -802,7 +828,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = ei_child->name; - ino = EVENTFS_DIR_INODE_INO; + ino = eventfs_dir_ino(ei_child, name); if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; @@ -932,6 +958,9 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry if (IS_ERR(dentry)) return ERR_CAST(dentry); + if (siphash_key_is_zero(&inode_key)) + get_random_bytes(&inode_key, sizeof(inode_key)); + ei = kzalloc(sizeof(*ei), GFP_KERNEL); if (!ei) goto fail_ei; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 15:06 ` Steven Rostedt @ 2024-01-22 16:23 ` Geert Uytterhoeven 2024-01-22 16:47 ` Steven Rostedt 2024-01-22 17:14 ` Mathieu Desnoyers 1 sibling, 1 reply; 30+ messages in thread From: Geert Uytterhoeven @ 2024-01-22 16:23 UTC (permalink / raw) To: Steven Rostedt Cc: Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher, Linus Torvalds Hi Steven, On Mon, Jan 22, 2024 at 4:05 PM Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 22 Jan 2024 11:38:52 +0100 > Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > > Hi Stephen, > > I don't know who "Stephen" is, but I'll reply to this message. My apologies (too many different spellings in too many languages)... > > On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > > > > > The dentries and inodes are created in the readdir for the sole purpose of > > > getting a consistent inode number. Linus stated that is unnecessary, and > > > that all inodes can have the same inode number. For a virtual file system > > > they are pretty meaningless. > > > > > > Instead use a single unique inode number for all files and one for all > > > directories. > > > > > > Link: https://lore.kernel.org/all/20240116133753.2808d45e@gandalf.local.home/ > > Yeah, Linus wanted me to try this first and see if there's any regressions. > Well, I guess you just answered that. > > The above link has me saying to Linus: > > It was me being paranoid that using the same inode number would break user > space. If that is not a concern, then I'm happy to just make it either the > same, or maybe just hash the ei and name that it is associated with. > > > > Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180363@goodmis.org > > > > > > Cc: Masami Hiramatsu <mhiramat@kernel.org> > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > > Cc: Christian Brauner <brauner@kernel.org> > > > Cc: Al Viro <viro@ZenIV.linux.org.uk> > > > Cc: Ajay Kaher <ajay.kaher@broadcom.com> > > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> > > > > Thanks for your patch, which is now commit 53c41052ba312176 ("eventfs: > > Have the inodes all for files and directories all be the same") in > > v6.8-rc1, to which I have bisected the issue below. > > This confuses "find". > > Running "find /sys/" now prints lots of error messages to stderr: > > > > find: File system loop detected; > > ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of > > the same file system loop as > > ‘/sys/kernel/debug/tracing/events/initcall’. > > So at a minimum, the directories need to have unique inode numbers. > > > > find: File system loop detected; > > ‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of > > the same file system loop as > > ‘/sys/kernel/debug/tracing/events/initcall’. > > find: File system loop detected; > > ‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of > > the same file system loop as > > ‘/sys/kernel/debug/tracing/events/initcall’. > > [...] > > Does this fix it for you? It hashes the eventfs_inode data structure after > adding some salt to it. > > Kees, > > I'm using the eventfs_inode pointer to create a unique value for the inode. > But it's being salted, hashed and then truncated. As it is very easy to > read inodes (although by default, only root has access to read these > inodes), the inode numbers themselves shouldn't be able to leak kernel > addresses via the results of these inode numbers, would it? > > -- Steve Gotya ;-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 6795fda2af19..d54897b84596 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -36,6 +37,31 @@ static DEFINE_MUTEX(eventfs_mutex); > #define EVENTFS_FILE_INODE_INO 0x12c4e37 > #define EVENTFS_DIR_INODE_INO 0x134b2f5 > > +/* Used for making inode numbers */ > +static siphash_key_t inode_key; > + > +/* Copied from scripts/kconfig/symbol.c */ > +static unsigned strhash(const char *s) > +{ > + /* fnv32 hash */ > + unsigned hash = 2166136261U; > + for (; *s; s++) > + hash = (hash ^ *s) * 0x01000193; > + return hash; > +} > + > +/* Just try to make something consistent and unique */ > +static int eventfs_dir_ino(struct event_inode *ei, const char *name) eventfs_inode > +{ > + unsigned long sip = (unsigned long)ei; > + > + sip += strhash(name) + EVENTFS_DIR_INODE_INO; > + sip = siphash_1u32((int)sip, &inode_key); > + > + /* keep it positive */ > + return sip & ((1U << 31) - 1); > +} > + > /* > * The eventfs_inode (ei) itself is protected by SRCU. It is released from > * its parent's list and will have is_freed set (under eventfs_mutex). Thanks, find is happy now the directories have different inode numbers. The files still have identical inodes numbers, I hope that doesn't cause any issues... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 16:23 ` Geert Uytterhoeven @ 2024-01-22 16:47 ` Steven Rostedt 2024-01-22 17:37 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 16:47 UTC (permalink / raw) To: Geert Uytterhoeven, Linus Torvalds Cc: Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 17:23:29 +0100 Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > +/* Just try to make something consistent and unique */ > > +static int eventfs_dir_ino(struct event_inode *ei, const char *name) > > eventfs_inode Heh, I fixed that, but must have created the patch before catching it. :-/ > > > +{ > > + unsigned long sip = (unsigned long)ei; > > + > > + sip += strhash(name) + EVENTFS_DIR_INODE_INO; > > + sip = siphash_1u32((int)sip, &inode_key); > > + > > + /* keep it positive */ > > + return sip & ((1U << 31) - 1); > > +} > > + > > /* > > * The eventfs_inode (ei) itself is protected by SRCU. It is released from > > * its parent's list and will have is_freed set (under eventfs_mutex). > > Thanks, find is happy now the directories have different inode numbers. > The files still have identical inodes numbers, I hope that doesn't cause > any issues... Well, I guess I should ask Linus. Linus, I can add this patch to make sure directory inodes are unique, as it causes a regression in find, but keep the file inodes the same. I can see how the issue is with directories, as find (and probably other applications) check for invalid directory loops. But with files, there should be no issue. It could just think it's another hard link. The question I have is, should this just make dir inodes unique and keep files the same, as this patch does? Or make all inodes unique? This is assuming that my algorithm is good enough to not leak kernel addresses. I could also chop it down to 28 bits, as that's probably still "good enough" to keep things unique. return sip & ((1U << 28) - 1); That would make it even harder to unhash to get an address. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 16:47 ` Steven Rostedt @ 2024-01-22 17:37 ` Linus Torvalds 2024-01-22 17:39 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2024-01-22 17:37 UTC (permalink / raw) To: Steven Rostedt Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 at 08:46, Steven Rostedt <rostedt@goodmis.org> wrote: > > I can add this patch to make sure directory inodes are unique, as it causes > a regression in find, but keep the file inodes the same. Yeah, limiting it to directories will at least somewhat help the address leaking. However, I also note that you never did the "set i_nlink to one" trick, which is the traditional thing to do to tell 'find' that it cannot do its directory optimization thing. I'm not sure that the nlink trick disables this part of the find sanity checks, but the *first* thing to check would be something like this --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -182,6 +182,7 @@ static int tracefs_getattr(struct mnt_idmap *idmap, set_tracefs_inode_owner(inode); generic_fillattr(idmap, request_mask, inode, stat); + stat->nlink = 1; return 0; } because it might just fix the issue. Having nlink == 1 is how non-unix filesystems (like FAT etc) indicate that you can't try to count directory entries to optimize traversal. And it is possible that that is where the whole find thing comes from, but who knows, it could be a generic loop detector that runs independently of the usual link detection. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 17:37 ` Linus Torvalds @ 2024-01-22 17:39 ` Linus Torvalds 2024-01-22 18:19 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2024-01-22 17:39 UTC (permalink / raw) To: Steven Rostedt Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 at 09:37, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yeah, limiting it to directories will at least somewhat help the > address leaking. Actually, why not juist add an inode number to your data structures, at least for directories? And just do a static increment on it as they get registered? That avoids the whole issue with possibly leaking kernel address data. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 17:39 ` Linus Torvalds @ 2024-01-22 18:19 ` Linus Torvalds 2024-01-22 18:27 ` Mathieu Desnoyers ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Linus Torvalds @ 2024-01-22 18:19 UTC (permalink / raw) To: Steven Rostedt Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher [-- Attachment #1: Type: text/plain, Size: 884 bytes --] On Mon, 22 Jan 2024 at 09:39, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Actually, why not juist add an inode number to your data structures, > at least for directories? And just do a static increment on it as they > get registered? > > That avoids the whole issue with possibly leaking kernel address data. The 'nlink = 1' thing doesn't seem to make 'find' any happier for this case, sadly. But the inode number in the 'struct eventfs_inode' looks trivial. And doesn't even grow that structure on 64-bit architectures at least, because the struct is already 64-bit aligned, and had only one 32-bit entry at the end. On 32-bit architectures the structure size grows, but I'm not sure the allocation size grows. Our kmalloc() is quantized at odd numbers. IOW, this trivial patch seems to be much safer than worrying about some pointer exposure. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1593 bytes --] fs/tracefs/event_inode.c | 6 ++++-- fs/tracefs/internal.h | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6795fda2af19..0b52ec111cf3 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -395,8 +395,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_op = &eventfs_root_dir_inode_operations; inode->i_fop = &eventfs_file_operations; - /* All directories will have the same inode number */ - inode->i_ino = EVENTFS_DIR_INODE_INO; + inode->i_ino = ei->ino; ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -859,6 +858,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode int size, void *data) { struct eventfs_inode *ei; + static int ino_counter = EVENTFS_DIR_INODE_INO; if (!parent) return ERR_PTR(-EINVAL); @@ -889,6 +889,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode INIT_LIST_HEAD(&ei->list); mutex_lock(&eventfs_mutex); + ei->ino = ++ino_counter; + if (!parent->is_freed) { list_add_tail(&ei->list, &parent->children); ei->d_parent = parent->dentry; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 12b7d0150ae9..1a574d306ea9 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -64,6 +64,7 @@ struct eventfs_inode { struct llist_node llist; struct rcu_head rcu; }; + unsigned int ino; unsigned int is_freed:1; unsigned int is_events:1; unsigned int nr_entries:30; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 18:19 ` Linus Torvalds @ 2024-01-22 18:27 ` Mathieu Desnoyers 2024-01-22 19:37 ` Steven Rostedt 2024-01-22 18:50 ` Kees Cook 2024-01-22 19:44 ` Steven Rostedt 2 siblings, 1 reply; 30+ messages in thread From: Mathieu Desnoyers @ 2024-01-22 18:27 UTC (permalink / raw) To: Linus Torvalds, Steven Rostedt Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On 2024-01-22 13:19, Linus Torvalds wrote: > On Mon, 22 Jan 2024 at 09:39, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Actually, why not juist add an inode number to your data structures, >> at least for directories? And just do a static increment on it as they >> get registered? >> >> That avoids the whole issue with possibly leaking kernel address data. > > The 'nlink = 1' thing doesn't seem to make 'find' any happier for this > case, sadly. > > But the inode number in the 'struct eventfs_inode' looks trivial. And > doesn't even grow that structure on 64-bit architectures at least, > because the struct is already 64-bit aligned, and had only one 32-bit > entry at the end. > > On 32-bit architectures the structure size grows, but I'm not sure the > allocation size grows. Our kmalloc() is quantized at odd numbers. > > IOW, this trivial patch seems to be much safer than worrying about > some pointer exposure. My only concern about the simple ino_counter static increment is what happens in the unlikely scenario of a 32-bit overflow. This is why I suggested using a bitmap to track inode allocation. It's compact, and we don't care that much about the linear bitmap scan overhead because it's far from being a fast path. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 18:27 ` Mathieu Desnoyers @ 2024-01-22 19:37 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 19:37 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 13:27:25 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > IOW, this trivial patch seems to be much safer than worrying about > > some pointer exposure. > > My only concern about the simple ino_counter static increment is what > happens in the unlikely scenario of a 32-bit overflow. This is why > I suggested using a bitmap to track inode allocation. It's compact, and > we don't care that much about the linear bitmap scan overhead because > it's far from being a fast path. The original code use to get its inode via "get_next_ino()" I don't think there's any reason not to just go back and do that again. It can still overflow, but it's not anything new that couldn't have happen to debugfs and tracefs over the last decade. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 18:19 ` Linus Torvalds 2024-01-22 18:27 ` Mathieu Desnoyers @ 2024-01-22 18:50 ` Kees Cook 2024-01-22 19:44 ` Steven Rostedt 2 siblings, 0 replies; 30+ messages in thread From: Kees Cook @ 2024-01-22 18:50 UTC (permalink / raw) To: Linus Torvalds, Steven Rostedt Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On January 22, 2024 10:19:12 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Mon, 22 Jan 2024 at 09:39, Linus Torvalds ><torvalds@linux-foundation.org> wrote: >> >> Actually, why not juist add an inode number to your data structures, >> at least for directories? And just do a static increment on it as they >> get registered? Yeah, this is what I'd suggest too. It avoids all the hash complexity. Is wrap-around realistic for it? -Kees -- Kees Cook ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 18:19 ` Linus Torvalds 2024-01-22 18:27 ` Mathieu Desnoyers 2024-01-22 18:50 ` Kees Cook @ 2024-01-22 19:44 ` Steven Rostedt 2024-01-22 19:48 ` Steven Rostedt ` (2 more replies) 2 siblings, 3 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 19:44 UTC (permalink / raw) To: Linus Torvalds Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 10:19:12 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 22 Jan 2024 at 09:39, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Actually, why not juist add an inode number to your data structures, > > at least for directories? And just do a static increment on it as they > > get registered? > > > > That avoids the whole issue with possibly leaking kernel address data. > > The 'nlink = 1' thing doesn't seem to make 'find' any happier for this > case, sadly. > > But the inode number in the 'struct eventfs_inode' looks trivial. And > doesn't even grow that structure on 64-bit architectures at least, > because the struct is already 64-bit aligned, and had only one 32-bit > entry at the end. > > On 32-bit architectures the structure size grows, but I'm not sure the > allocation size grows. Our kmalloc() is quantized at odd numbers. > > IOW, this trivial patch seems to be much safer than worrying about > some pointer exposure. I originally wanted to avoid the addition of the 4 bytes, but your comment about it not making a difference on 64bit due to alignment makes sense. Slightly different version below. -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6795fda2af19..6b211522a13e 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex); /* Choose something "unique" ;-) */ #define EVENTFS_FILE_INODE_INO 0x12c4e37 -#define EVENTFS_DIR_INODE_INO 0x134b2f5 + +/* Just try to make something consistent and unique */ +static int eventfs_dir_ino(struct eventfs_inode *ei) +{ + if (!ei->ino) + ei->ino = get_next_ino(); + + return ei->ino; +} /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from @@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_fop = &eventfs_file_operations; /* All directories will have the same inode number */ - inode->i_ino = EVENTFS_DIR_INODE_INO; + inode->i_ino = eventfs_dir_ino(ei); ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = ei_child->name; - ino = EVENTFS_DIR_INODE_INO; + ino = eventfs_dir_ino(ei_child); if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 12b7d0150ae9..1a574d306ea9 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -64,6 +64,7 @@ struct eventfs_inode { struct llist_node llist; struct rcu_head rcu; }; + unsigned int ino; unsigned int is_freed:1; unsigned int is_events:1; unsigned int nr_entries:30; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 19:44 ` Steven Rostedt @ 2024-01-22 19:48 ` Steven Rostedt 2024-01-22 21:33 ` Kees Cook 2024-01-25 17:40 ` Christian Brauner 2 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 19:48 UTC (permalink / raw) To: Linus Torvalds Cc: Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 14:44:43 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Slightly different version below. The main difference between this and your patch is that the inode numbers are only generated when needed, and files that are never referenced, will not add to the possibility of overflow or collision. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 19:44 ` Steven Rostedt 2024-01-22 19:48 ` Steven Rostedt @ 2024-01-22 21:33 ` Kees Cook 2024-01-25 17:40 ` Christian Brauner 2 siblings, 0 replies; 30+ messages in thread From: Kees Cook @ 2024-01-22 21:33 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Geert Uytterhoeven, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, Jan 22, 2024 at 02:44:43PM -0500, Steven Rostedt wrote: > On Mon, 22 Jan 2024 10:19:12 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Mon, 22 Jan 2024 at 09:39, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > Actually, why not juist add an inode number to your data structures, > > > at least for directories? And just do a static increment on it as they > > > get registered? > > > > > > That avoids the whole issue with possibly leaking kernel address data. > > > > The 'nlink = 1' thing doesn't seem to make 'find' any happier for this > > case, sadly. > > > > But the inode number in the 'struct eventfs_inode' looks trivial. And > > doesn't even grow that structure on 64-bit architectures at least, > > because the struct is already 64-bit aligned, and had only one 32-bit > > entry at the end. > > > > On 32-bit architectures the structure size grows, but I'm not sure the > > allocation size grows. Our kmalloc() is quantized at odd numbers. > > > > IOW, this trivial patch seems to be much safer than worrying about > > some pointer exposure. > > I originally wanted to avoid the addition of the 4 bytes, but your comment > about it not making a difference on 64bit due to alignment makes sense. > > Slightly different version below. > > -- Steve > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 6795fda2af19..6b211522a13e 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex); > > /* Choose something "unique" ;-) */ > #define EVENTFS_FILE_INODE_INO 0x12c4e37 > -#define EVENTFS_DIR_INODE_INO 0x134b2f5 > + > +/* Just try to make something consistent and unique */ > +static int eventfs_dir_ino(struct eventfs_inode *ei) > +{ > + if (!ei->ino) > + ei->ino = get_next_ino(); > + > + return ei->ino; > +} > > /* > * The eventfs_inode (ei) itself is protected by SRCU. It is released from > @@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent > inode->i_fop = &eventfs_file_operations; > > /* All directories will have the same inode number */ > - inode->i_ino = EVENTFS_DIR_INODE_INO; > + inode->i_ino = eventfs_dir_ino(ei); > > ti = get_tracefs(inode); > ti->flags |= TRACEFS_EVENT_INODE; > @@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) > > name = ei_child->name; > > - ino = EVENTFS_DIR_INODE_INO; > + ino = eventfs_dir_ino(ei_child); > > if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) > goto out_dec; > diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h > index 12b7d0150ae9..1a574d306ea9 100644 > --- a/fs/tracefs/internal.h > +++ b/fs/tracefs/internal.h > @@ -64,6 +64,7 @@ struct eventfs_inode { > struct llist_node llist; > struct rcu_head rcu; > }; > + unsigned int ino; > unsigned int is_freed:1; > unsigned int is_events:1; > unsigned int nr_entries:30; I like it! :) Reviewed-by: Kees Cook <keescook@chromium.org> -- Kees Cook ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 19:44 ` Steven Rostedt 2024-01-22 19:48 ` Steven Rostedt 2024-01-22 21:33 ` Kees Cook @ 2024-01-25 17:40 ` Christian Brauner 2024-01-25 18:07 ` Steven Rostedt 2 siblings, 1 reply; 30+ messages in thread From: Christian Brauner @ 2024-01-25 17:40 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Mon, Jan 22, 2024 at 02:44:43PM -0500, Steven Rostedt wrote: > On Mon, 22 Jan 2024 10:19:12 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Mon, 22 Jan 2024 at 09:39, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > Actually, why not juist add an inode number to your data structures, > > > at least for directories? And just do a static increment on it as they > > > get registered? > > > > > > That avoids the whole issue with possibly leaking kernel address data. > > > > The 'nlink = 1' thing doesn't seem to make 'find' any happier for this > > case, sadly. > > > > But the inode number in the 'struct eventfs_inode' looks trivial. And > > doesn't even grow that structure on 64-bit architectures at least, > > because the struct is already 64-bit aligned, and had only one 32-bit > > entry at the end. > > > > On 32-bit architectures the structure size grows, but I'm not sure the > > allocation size grows. Our kmalloc() is quantized at odd numbers. > > > > IOW, this trivial patch seems to be much safer than worrying about > > some pointer exposure. > > I originally wanted to avoid the addition of the 4 bytes, but your comment > about it not making a difference on 64bit due to alignment makes sense. Non-unique inode numbers aren't exactly great for userspace and there are a surprisingly small yet large enough number of tools that trip over this in various ways. So if that can be avoided then great. But luckily no one is probably going to tar up tracefs. ;) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-25 17:40 ` Christian Brauner @ 2024-01-25 18:07 ` Steven Rostedt 2024-01-25 18:08 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-25 18:07 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Thu, 25 Jan 2024 18:40:31 +0100 Christian Brauner <brauner@kernel.org> wrote: > But luckily no one is probably going to tar up tracefs. ;) Actually, inodes isn't the biggest issue of tar, as tar *is* a common operation on tracefs. If you want to create a synthetic event using the sqlhist tool for an embedded board, we recommend copying the tracefs directory over to your workstation from the embedded device. Unfortunately tar never works. That's because all tracefs (and debugfs) files have zero size in stat(). # cd /tmp # (cd /sys/kernel && tar cvf - tracing) | tar xvf - # ls -s tracing total 28 0 available_events 0 max_graph_depth 0 stack_trace 0 available_filter_functions 4 options 0 stack_trace_filter 0 available_filter_functions_addrs 4 osnoise 0 synthetic_events 0 available_tracers 4 per_cpu 0 timestamp_mode 0 buffer_percent 0 printk_formats 0 touched_functions 0 buffer_size_kb 0 README 0 trace 0 buffer_subbuf_size_kb 0 recursed_functions 0 trace_clock 0 buffer_total_size_kb 0 saved_cmdlines 0 trace_marker 0 current_tracer 0 saved_cmdlines_size 0 trace_marker_raw 0 dynamic_events 0 saved_tgids 0 trace_options 0 dyn_ftrace_total_info 0 set_event 0 trace_pipe 0 enabled_functions 0 set_event_notrace_pid 4 trace_stat 0 error_log 0 set_event_pid 0 tracing_cpumask 0 eval_map 0 set_ftrace_filter 0 tracing_max_latency 4 events 0 set_ftrace_notrace 0 tracing_on 0 free_buffer 0 set_ftrace_notrace_pid 0 tracing_thresh 0 function_profile_enabled 0 set_ftrace_pid 0 uprobe_events 4 hwlat_detector 0 set_graph_function 0 uprobe_profile 4 instances 0 set_graph_notrace 0 user_events_data 0 kprobe_events 0 snapshot 0 user_events_status 0 kprobe_profile 0 stack_max_size So instead we have been recommending cp -r Note, for the sqlhist command, only the events are needed, so the instructions is only to copy the events directory, because copying all of tracefs will try to copy the "trace_pipe" file which will block which hangs 'cp'. And I don't know an option to tell the cp command not to block. # cd /tmp # mkdir tracing # cp -r /sys/kernel/tracing/events tracing/events # ls -s tracing/events/sched/sched_switch/ total 20 4 enable 4 filter 4 format 0 hist 0 hist_debug 4 id 0 inject 4 trigger Where the tar would have had: # ls -s tracing/events/sched/sched_switch/ total 0 0 enable 0 filter 0 format 0 hist 0 hist_debug 0 id 0 inject 0 trigger -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-25 18:07 ` Steven Rostedt @ 2024-01-25 18:08 ` Steven Rostedt 2024-01-26 8:07 ` Geert Uytterhoeven 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-25 18:08 UTC (permalink / raw) To: Christian Brauner Cc: Linus Torvalds, Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Thu, 25 Jan 2024 13:07:31 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Actually, inodes isn't the biggest issue of tar, as tar *is* a common > operation on tracefs. Correction. tar would be a common operation if it worked ;-) -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-25 18:08 ` Steven Rostedt @ 2024-01-26 8:07 ` Geert Uytterhoeven 2024-01-26 10:11 ` Christian Brauner 2024-01-26 13:16 ` Steven Rostedt 0 siblings, 2 replies; 30+ messages in thread From: Geert Uytterhoeven @ 2024-01-26 8:07 UTC (permalink / raw) To: Steven Rostedt Cc: Christian Brauner, Linus Torvalds, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher Hi Steven. On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 25 Jan 2024 13:07:31 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common > > operation on tracefs. > > Correction. tar would be a common operation if it worked ;-) What would be needed to fix that? I regularly use tar on other virtual file systems (e.g. /sys/firmware/devicetree/), which works fine. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-26 8:07 ` Geert Uytterhoeven @ 2024-01-26 10:11 ` Christian Brauner 2024-01-26 16:25 ` Steven Rostedt 2024-01-26 13:16 ` Steven Rostedt 1 sibling, 1 reply; 30+ messages in thread From: Christian Brauner @ 2024-01-26 10:11 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Steven Rostedt, Linus Torvalds, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Fri, Jan 26, 2024 at 09:07:06AM +0100, Geert Uytterhoeven wrote: > Hi Steven. > > On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 25 Jan 2024 13:07:31 -0500 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common > > > operation on tracefs. > > > > Correction. tar would be a common operation if it worked ;-) > > What would be needed to fix that? I regularly use tar on other virtual > file systems (e.g. /sys/firmware/devicetree/), which works fine. The size would be one thing. The other is that tar requires unique inode numbers for all files iirc (That's why we have this whole btrfs problem - let's not get into this here - where inode numbers aren't unique and are duplicated per subvolume.). ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-26 10:11 ` Christian Brauner @ 2024-01-26 16:25 ` Steven Rostedt 2024-01-26 19:09 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-26 16:25 UTC (permalink / raw) To: Christian Brauner Cc: Geert Uytterhoeven, Linus Torvalds, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Fri, 26 Jan 2024 11:11:39 +0100 Christian Brauner <brauner@kernel.org> wrote: > The size would be one thing. The other is that tar requires unique inode > numbers for all files iirc (That's why we have this whole btrfs problem > - let's not get into this here - where inode numbers aren't unique and > are duplicated per subvolume.). Well, I guess that answers Linus's question about wondering if there's any user space program that actually cares what the inodes are for files. The answer is "yes" and the program is "tar". And because tar cares, I think I should fix it for tracefs even if it doesn't work because of size. But the size issue is a trivial fix if I just default it to 1 page. I currently use get_next_ino(), but I can use my own version of that, and because each directory has a fixed number of files, I could have it be: /* Copied from get_next_ino but adds allocation for multiple inodes */ #define LAST_INO_BATCH 1024 #define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) static DEFINE_PER_CPU(unsigned int, last_ino); unsigned int tracefs_get_next_ino(int files) { unsigned int *p = &get_cpu_var(last_ino); unsigned int res = *p; #ifdef CONFIG_SMP /* Check if adding files+1 overflows */ if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) { static atomic_t shared_last_ino; int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); res = next - LAST_INO_BATCH; } #endif res++; /* get_next_ino should not provide a 0 inode number */ if (unlikely(!res)) res++; *p = res + files; put_cpu_var(last_ino); return res; } This way the eventfs inode would tell tracefs_get_next_ino() how many inode numbers it needs for its files and then when it creates the file inode, it can use: inode->i_ino = ei->ino + idx; Where idx is the index into the d_children array of the directory's eventfs_inode. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-26 16:25 ` Steven Rostedt @ 2024-01-26 19:09 ` Linus Torvalds 0 siblings, 0 replies; 30+ messages in thread From: Linus Torvalds @ 2024-01-26 19:09 UTC (permalink / raw) To: Steven Rostedt Cc: Christian Brauner, Geert Uytterhoeven, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Fri, 26 Jan 2024 at 08:26, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 26 Jan 2024 11:11:39 +0100 > Christian Brauner <brauner@kernel.org> wrote: > > > The size would be one thing. The other is that tar requires unique inode > > numbers for all files iirc (That's why we have this whole btrfs problem > > - let's not get into this here - where inode numbers aren't unique and > > are duplicated per subvolume.). > > Well, I guess that answers Linus's question about wondering if there's any > user space program that actually cares what the inodes are for files. The > answer is "yes" and the program is "tar". Well, the fact that it hits snapshots, shows that the real problem is just "tar does stupid things that it shouldn't do". Yes, inode numbers used to be special, and there's history behind it. But we should basically try very hard to walk away from that broken history. An inode number just isn't a unique descriptor any more. We're not living in the 1970s, and filesystems have changed. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-26 8:07 ` Geert Uytterhoeven 2024-01-26 10:11 ` Christian Brauner @ 2024-01-26 13:16 ` Steven Rostedt 2024-01-26 14:06 ` Steven Rostedt 1 sibling, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-26 13:16 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Christian Brauner, Linus Torvalds, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Fri, 26 Jan 2024 09:07:06 +0100 Geert Uytterhoeven <geert@linux-m68k.org> wrote: > Hi Steven. > > On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 25 Jan 2024 13:07:31 -0500 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common > > > operation on tracefs. > > > > Correction. tar would be a common operation if it worked ;-) > > What would be needed to fix that? I regularly use tar on other virtual > file systems (e.g. /sys/firmware/devicetree/), which works fine. Looks like all /sys files have one page in size. I could change the default file size to one page and it might work (if the inodes had different numbers), as I don't see any format file greater than 4k. This would fix the events for tar, but it couldn't do the same for tracing. As some files don't actually have a size. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-26 13:16 ` Steven Rostedt @ 2024-01-26 14:06 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-26 14:06 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Christian Brauner, Linus Torvalds, Kees Cook, linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Al Viro, Ajay Kaher On Fri, 26 Jan 2024 08:16:16 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 26 Jan 2024 09:07:06 +0100 > Geert Uytterhoeven <geert@linux-m68k.org> wrote: > > > Hi Steven. > > > > On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Thu, 25 Jan 2024 13:07:31 -0500 > > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common > > > > operation on tracefs. > > > > > > Correction. tar would be a common operation if it worked ;-) > > > > What would be needed to fix that? I regularly use tar on other virtual > > file systems (e.g. /sys/firmware/devicetree/), which works fine. > > Looks like all /sys files have one page in size. I could change the default > file size to one page and it might work (if the inodes had different > numbers), as I don't see any format file greater than 4k. > > This would fix the events for tar, but it couldn't do the same for tracing. > As some files don't actually have a size. And this patch makes a unique inode number for files too. I also found a slight bug where the inode number is generated twice. Once to create the inode, and then again overwriting it with the eventfs inode logic. That just makes it more likely for the inode numbers to overflow. This fixes that too. diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6b211522a13e..7be7a694b106 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,14 +32,11 @@ */ static DEFINE_MUTEX(eventfs_mutex); -/* Choose something "unique" ;-) */ -#define EVENTFS_FILE_INODE_INO 0x12c4e37 - /* Just try to make something consistent and unique */ -static int eventfs_dir_ino(struct eventfs_inode *ei) +static int eventfs_dir_ino(struct eventfs_inode *ei, int nr_files) { if (!ei->ino) - ei->ino = get_next_ino(); + ei->ino = tracefs_get_next_ino(nr_files); return ei->ino; } @@ -327,6 +324,7 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid) * @parent: parent dentry for this file. * @data: something that the caller will want to get to later on. * @fop: struct file_operations that should be used for this file. + * @ino: inode number for this file * * This function creates a dentry that represents a file in the eventsfs_inode * directory. The inode.i_private pointer will point to @data in the open() @@ -335,7 +333,8 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid) static struct dentry *create_file(const char *name, umode_t mode, struct eventfs_attr *attr, struct dentry *parent, void *data, - const struct file_operations *fop) + const struct file_operations *fop, + unsigned int ino) { struct tracefs_inode *ti; struct dentry *dentry; @@ -363,9 +362,7 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_op = &eventfs_file_inode_operations; inode->i_fop = fop; inode->i_private = data; - - /* All files will have the same inode number */ - inode->i_ino = EVENTFS_FILE_INODE_INO; + inode->i_ino = ino; ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -377,12 +374,14 @@ static struct dentry *create_file(const char *name, umode_t mode, /** * create_dir - create a dir in the tracefs filesystem * @ei: the eventfs_inode that represents the directory to create - * @parent: parent dentry for this file. + * @parent: parent dentry for this directory. + * @nr_files: The number of files (not directories) this directory has * * This function will create a dentry for a directory represented by * a eventfs_inode. */ -static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent) +static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent, + int nr_files) { struct tracefs_inode *ti; struct dentry *dentry; @@ -404,7 +403,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_fop = &eventfs_file_operations; /* All directories will have the same inode number */ - inode->i_ino = eventfs_dir_ino(ei); + inode->i_ino = eventfs_dir_ino(ei, nr_files); ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -504,7 +503,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx, mutex_unlock(&eventfs_mutex); - dentry = create_file(name, mode, attr, parent, data, fops); + dentry = create_file(name, mode, attr, parent, data, fops, ei->ino + idx + 1); mutex_lock(&eventfs_mutex); @@ -598,7 +597,7 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, } mutex_unlock(&eventfs_mutex); - dentry = create_dir(ei, parent); + dentry = create_dir(ei, parent, ei->nr_entries); mutex_lock(&eventfs_mutex); @@ -786,7 +785,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) if (r <= 0) continue; - ino = EVENTFS_FILE_INODE_INO; + ino = ei->ino + i + 1; if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) goto out; @@ -810,7 +809,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = ei_child->name; - ino = eventfs_dir_ino(ei_child); + ino = eventfs_dir_ino(ei_child, ei_child->nr_entries); if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..2187be6d7b23 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -223,13 +223,41 @@ static const struct inode_operations tracefs_file_inode_operations = { .setattr = tracefs_setattr, }; +/* Copied from get_next_ino() but adds allocation for multiple inodes */ +#define LAST_INO_BATCH 1024 +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) +static DEFINE_PER_CPU(unsigned int, last_ino); + +unsigned int tracefs_get_next_ino(int files) +{ + unsigned int *p = &get_cpu_var(last_ino); + unsigned int res = *p; + +#ifdef CONFIG_SMP + /* Check if adding files+1 overflows */ + if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) { + static atomic_t shared_last_ino; + int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); + + res = next - LAST_INO_BATCH; + } +#endif + + res++; + /* get_next_ino should not provide a 0 inode number */ + if (unlikely(!res)) + res++; + *p = res + files; + put_cpu_var(last_ino); + return res; +} + struct inode *tracefs_get_inode(struct super_block *sb) { struct inode *inode = new_inode(sb); - if (inode) { - inode->i_ino = get_next_ino(); + if (inode) simple_inode_init_ts(inode); - } + return inode; } @@ -644,6 +672,8 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, inode->i_private = data; inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + inode->i_ino = tracefs_get_next_ino(0); + d_instantiate(dentry, inode); fsnotify_create(d_inode(dentry->d_parent), dentry); return tracefs_end_creating(dentry); @@ -669,6 +699,7 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, inode->i_fop = &simple_dir_operations; inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + inode->i_ino = tracefs_get_next_ino(0); ti = get_tracefs(inode); ti->private = instance_inode(parent, inode); diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 45397df9bb65..7dd6678229d0 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -75,6 +75,7 @@ static inline struct tracefs_inode *get_tracefs(const struct inode *inode) return container_of(inode, struct tracefs_inode, vfs_inode); } +unsigned int tracefs_get_next_ino(int files); struct dentry *tracefs_start_creating(const char *name, struct dentry *parent); struct dentry *tracefs_end_creating(struct dentry *dentry); struct dentry *tracefs_failed_creating(struct dentry *dentry); ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 15:06 ` Steven Rostedt 2024-01-22 16:23 ` Geert Uytterhoeven @ 2024-01-22 17:14 ` Mathieu Desnoyers 2024-01-22 17:50 ` Steven Rostedt 1 sibling, 1 reply; 30+ messages in thread From: Mathieu Desnoyers @ 2024-01-22 17:14 UTC (permalink / raw) To: Steven Rostedt, Geert Uytterhoeven, Kees Cook, Linus Torvalds Cc: linux-kernel, Masami Hiramatsu, Mark Rutland, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On 2024-01-22 10:06, Steven Rostedt wrote: > On Mon, 22 Jan 2024 11:38:52 +0100 > Geert Uytterhoeven <geert@linux-m68k.org> wrote: > [...] >> >> On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: >>> From: "Steven Rostedt (Google)" <rostedt@goodmis.org> >>> >>> The dentries and inodes are created in the readdir for the sole purpose of >>> getting a consistent inode number. Linus stated that is unnecessary, and >>> that all inodes can have the same inode number. For a virtual file system >>> they are pretty meaningless. >>> >>> Instead use a single unique inode number for all files and one for all >>> directories. >>> >>> Link: https://lore.kernel.org/all/20240116133753.2808d45e@gandalf.local.home/ > > Yeah, Linus wanted me to try this first and see if there's any regressions. > Well, I guess you just answered that. > >> This confuses "find". >> Running "find /sys/" now prints lots of error messages to stderr: >> >> find: File system loop detected; >> ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of >> the same file system loop as >> ‘/sys/kernel/debug/tracing/events/initcall’. > > So at a minimum, the directories need to have unique inode numbers. [...] > I'm using the eventfs_inode pointer to create a unique value for the inode. > But it's being salted, hashed and then truncated. As it is very easy to > read inodes (although by default, only root has access to read these > inodes), the inode numbers themselves shouldn't be able to leak kernel > addresses via the results of these inode numbers, would it? Why use an improvised hashing function (re-purposed from scripts/kconfig/symbol.c to a use-case which is exposed through a userspace ABI prone to kernel address leaks) rather than simply reserving values by setting bits in a bitmap ? How many inodes do we realistically expect to have there ? On my 6.1.0 kernel: find /sys/kernel/tracing | wc -l 15598 (mainly due to TRACE_EVENT ABI files) Hashing risks: - Exposing kernel addresses if the hashing algorithm is broken, - Collisions if users are unlucky (which could trigger those 'find' errors). Those 15598 inode values fit within a single page (bitmap of 1922 bytes). So I would recommend simply adding a bitmap per tracefs filesystem instance to keep track of inode number allocation. Creation/removal of files/directories in tracefs should not be a fast-path anyway, so who cares about the speed of a find first bit within a single page ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 17:14 ` Mathieu Desnoyers @ 2024-01-22 17:50 ` Steven Rostedt 2024-01-22 18:35 ` Mathieu Desnoyers 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 17:50 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Geert Uytterhoeven, Kees Cook, Linus Torvalds, linux-kernel, Masami Hiramatsu, Mark Rutland, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 12:14:36 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Why use an improvised hashing function (re-purposed from > scripts/kconfig/symbol.c to a use-case which is exposed through a That hash is just salt to the real hash function, which is the siphash_1u32(). I added the name hash so that each file will get a little different salt to the hash. The siphash_1u32() is what the rest of the kernel uses for hashing kernel address space. > userspace ABI prone to kernel address leaks) rather than simply > reserving values by setting bits in a bitmap ? > > How many inodes do we realistically expect to have there ? If I only do directories, it is actually significantly less. > > On my 6.1.0 kernel: > > find /sys/kernel/tracing | wc -l > 15598 > > (mainly due to TRACE_EVENT ABI files) > > Hashing risks: > > - Exposing kernel addresses if the hashing algorithm is broken, Well this was my biggest concern, but if I truncate at least a nibble, with the unique salt to the algorithm for each file, how easily does that expose kernel addresses. The ei itself, is created from kmalloc() so you would at best get a heap address. But with the missing nibble (if I mask it with ((1 << 28) - 1), and much more taken away for 64 bit systems), and the added unique salt, is it possible for this to expose anything that could be used in an attack? > - Collisions if users are unlucky (which could trigger those > 'find' errors). > > Those 15598 inode values fit within a single page (bitmap of > 1922 bytes). > > So I would recommend simply adding a bitmap per tracefs filesystem > instance to keep track of inode number allocation. And how do I recover this bit after the inode is freed, but then referenced again? > > Creation/removal of files/directories in tracefs should not be > a fast-path anyway, so who cares about the speed of a find first > bit within a single page ? > When an inode is no longer referenced, it is freed. When it is referenced again, I want it to be recreated with the same inode number it had previously. How would having a bitmask help with that? I need a way to map an ei structure with a unique number without adding another 4 bytes to the structure itself. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 17:50 ` Steven Rostedt @ 2024-01-22 18:35 ` Mathieu Desnoyers 2024-01-22 19:59 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: Mathieu Desnoyers @ 2024-01-22 18:35 UTC (permalink / raw) To: Steven Rostedt Cc: Geert Uytterhoeven, Kees Cook, Linus Torvalds, linux-kernel, Masami Hiramatsu, Mark Rutland, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On 2024-01-22 12:50, Steven Rostedt wrote: > On Mon, 22 Jan 2024 12:14:36 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: [...] >> On my 6.1.0 kernel: >> >> find /sys/kernel/tracing | wc -l >> 15598 >> >> (mainly due to TRACE_EVENT ABI files) >> >> Hashing risks: >> >> - Exposing kernel addresses if the hashing algorithm is broken, > > Well this was my biggest concern, but if I truncate at least a nibble, with > the unique salt to the algorithm for each file, how easily does that expose > kernel addresses. > > The ei itself, is created from kmalloc() so you would at best get a heap > address. But with the missing nibble (if I mask it with ((1 << 28) - 1), > and much more taken away for 64 bit systems), and the added unique salt, is > it possible for this to expose anything that could be used in an attack? I don't know, which is why I am concerned about it. But I don't think we should spend time trying to understand all possible attack scenarios associated with hashing of kernel addresses when there are much simpler options available. > >> - Collisions if users are unlucky (which could trigger those >> 'find' errors). >> >> Those 15598 inode values fit within a single page (bitmap of >> 1922 bytes). >> >> So I would recommend simply adding a bitmap per tracefs filesystem >> instance to keep track of inode number allocation. > > And how do I recover this bit after the inode is freed, but then referenced > again? You would keep the allocated inode number value within your data structure associated with the inode. If you never free inodes, then you can just use a static increment as Linus suggested. But AFAIU there are cases where you free inodes, hence my suggestion of bitmap. When the inode is freed, you know which inode number is associated from the field in your data structure, so you can clear this bit in the bitmap. On the next inode allocation, you find-first-zero-bit in the bitmap, and set it to one to reserve it. > >> >> Creation/removal of files/directories in tracefs should not be >> a fast-path anyway, so who cares about the speed of a find first >> bit within a single page ? >> > > When an inode is no longer referenced, it is freed. When it is referenced > again, I want it to be recreated with the same inode number it had > previously. How would having a bitmask help with that? I need a way to map > an ei structure with a unique number without adding another 4 bytes to the > structure itself. As discussed in a separate exchange with Linus, why do you care so much about not adding a 4 bytes field to the structure ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same 2024-01-22 18:35 ` Mathieu Desnoyers @ 2024-01-22 19:59 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-22 19:59 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Geert Uytterhoeven, Kees Cook, Linus Torvalds, linux-kernel, Masami Hiramatsu, Mark Rutland, Andrew Morton, Christian Brauner, Al Viro, Ajay Kaher On Mon, 22 Jan 2024 13:35:43 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > When an inode is no longer referenced, it is freed. When it is referenced > > again, I want it to be recreated with the same inode number it had > > previously. How would having a bitmask help with that? I need a way to map > > an ei structure with a unique number without adding another 4 bytes to the > > structure itself. > > As discussed in a separate exchange with Linus, why do you care so much about > not adding a 4 bytes field to the structure ? I'm trying to keep the memory overhead of tracing down as much as possible. 4 bytes for 2000 events (and growing) just adds to the memory footprint of tracing. And an eventfs_inode is the link between control of an event per instance. Thus, it may only be 8k for those 2000 events, but that's another 8k for each instance you use. You make 10 instances, that is now 80k. This is used in embedded systems as well, so every byte counts. But as Linus pointed out, the issue is moot, as the structure ends with a single 32bit int. And on 64 bit machines, there's likely a 4 byte hole that we can use there. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* [for-linus][PATCH 2/3] eventfs: Do not create dentries nor inodes in iterate_shared 2024-01-17 14:35 [for-linus][PATCH 0/3] eventfs: A few more fixes for 6.8 Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same Steven Rostedt @ 2024-01-17 14:35 ` Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 3/3] eventfs: Use kcalloc() instead of kzalloc() Steven Rostedt 2 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-17 14:35 UTC (permalink / raw) To: linux-kernel Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Linus Torvalds, Christian Brauner, Al Viro, Ajay Kaher, kernel test robot From: "Steven Rostedt (Google)" <rostedt@goodmis.org> The original eventfs code added a wrapper around the dcache_readdir open callback and created all the dentries and inodes at open, and increment their ref count. A wrapper was added around the dcache_readdir release function to decrement all the ref counts of those created inodes and dentries. But this proved to be buggy[1] for when a kprobe was created during a dir read, it would create a dentry between the open and the release, and because the release would decrement all ref counts of all files and directories, that would include the kprobe directory that was not there to have its ref count incremented in open. This would cause the ref count to go to negative and later crash the kernel. To solve this, the dentries and inodes that were created and had their ref count upped in open needed to be saved. That list needed to be passed from the open to the release, so that the release would only decrement the ref counts of the entries that were incremented in the open. Unfortunately, the dcache_readdir logic was already using the file->private_data, which is the only field that can be used to pass information from the open to the release. What was done was the eventfs created another descriptor that had a void pointer to save the dcache_readdir pointer, and it wrapped all the callbacks, so that it could save the list of entries that had their ref counts incremented in the open, and pass it to the release. The wrapped callbacks would just put back the dcache_readdir pointer and call the functions it used so it could still use its data[2]. But Linus had an issue with the "hijacking" of the file->private_data (unfortunately this discussion was on a security list, so no public link). Which we finally agreed on doing everything within the iterate_shared callback and leave the dcache_readdir out of it[3]. All the information needed for the getents() could be created then. But this ended up being buggy too[4]. The iterate_shared callback was not the right place to create the dentries and inodes. Even Christian Brauner had issues with that[5]. An attempt was to go back to creating the inodes and dentries at the open, create an array to store the information in the file->private_data, and pass that information to the other callbacks.[6] The difference between that and the original method, is that it does not use dcache_readdir. It also does not up the ref counts of the dentries and pass them. Instead, it creates an array of a structure that saves the dentry's name and inode number. That information is used in the iterate_shared callback, and the array is freed in the dir release. The dentries and inodes created in the open are not used for the iterate_share or release callbacks. Just their names and inode numbers. Linus did not like that either[7] and just wanted to remove the dentries being created in iterate_shared and use the hard coded inode numbers. [ All this while Linus enjoyed an unexpected vacation during the merge window due to lack of power. ] [1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/ [2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home/ [3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218477@goodmis.org/ [4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.sang@intel.com/ [5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/ [6] https://lore.kernel.org/all/20240116114711.7e8637be@gandalf.local.home/ [7] https://lore.kernel.org/all/20240116170154.5bf0a250@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.573784051@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.sang@intel.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> --- fs/tracefs/event_inode.c | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 5edf0b96758b..10580d6b5012 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -727,8 +727,6 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) struct eventfs_inode *ei_child; struct tracefs_inode *ti; struct eventfs_inode *ei; - struct dentry *ei_dentry = NULL; - struct dentry *dentry; const char *name; umode_t mode; int idx; @@ -749,11 +747,11 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) mutex_lock(&eventfs_mutex); ei = READ_ONCE(ti->private); - if (ei && !ei->is_freed) - ei_dentry = READ_ONCE(ei->dentry); + if (ei && ei->is_freed) + ei = NULL; mutex_unlock(&eventfs_mutex); - if (!ei || !ei_dentry) + if (!ei) goto out; /* @@ -780,11 +778,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) if (r <= 0) continue; - dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); - if (!dentry) - goto out; - ino = dentry->d_inode->i_ino; - dput(dentry); + ino = EVENTFS_FILE_INODE_INO; if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) goto out; @@ -808,11 +802,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = ei_child->name; - dentry = create_dir_dentry(ei, ei_child, ei_dentry); - if (!dentry) - goto out_dec; - ino = dentry->d_inode->i_ino; - dput(dentry); + ino = EVENTFS_DIR_INODE_INO; if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; -- 2.43.0 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [for-linus][PATCH 3/3] eventfs: Use kcalloc() instead of kzalloc() 2024-01-17 14:35 [for-linus][PATCH 0/3] eventfs: A few more fixes for 6.8 Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 2/3] eventfs: Do not create dentries nor inodes in iterate_shared Steven Rostedt @ 2024-01-17 14:35 ` Steven Rostedt 2 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2024-01-17 14:35 UTC (permalink / raw) To: linux-kernel Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton, Erick Archer, Gustavo A. R. Silva From: Erick Archer <erick.archer@gmx.com> As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. So, use the purpose specific kcalloc() function instead of the argument size * count in the kzalloc() function. [1] https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Link: https://lore.kernel.org/linux-trace-kernel/20240115181658.4562-1-erick.archer@gmx.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://github.com/KSPP/linux/issues/162 Signed-off-by: Erick Archer <erick.archer@gmx.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> --- fs/tracefs/event_inode.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 10580d6b5012..6795fda2af19 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -97,7 +97,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, /* Preallocate the children mode array if necessary */ if (!(dentry->d_inode->i_mode & S_IFDIR)) { if (!ei->entry_attrs) { - ei->entry_attrs = kzalloc(sizeof(*ei->entry_attrs) * ei->nr_entries, + ei->entry_attrs = kcalloc(ei->nr_entries, sizeof(*ei->entry_attrs), GFP_NOFS); if (!ei->entry_attrs) { ret = -ENOMEM; @@ -874,7 +874,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode } if (size) { - ei->d_children = kzalloc(sizeof(*ei->d_children) * size, GFP_KERNEL); + ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL); if (!ei->d_children) { kfree_const(ei->name); kfree(ei); @@ -941,7 +941,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry goto fail; if (size) { - ei->d_children = kzalloc(sizeof(*ei->d_children) * size, GFP_KERNEL); + ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL); if (!ei->d_children) goto fail; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-01-26 19:09 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-01-17 14:35 [for-linus][PATCH 0/3] eventfs: A few more fixes for 6.8 Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same Steven Rostedt 2024-01-22 10:38 ` Geert Uytterhoeven 2024-01-22 15:06 ` Steven Rostedt 2024-01-22 16:23 ` Geert Uytterhoeven 2024-01-22 16:47 ` Steven Rostedt 2024-01-22 17:37 ` Linus Torvalds 2024-01-22 17:39 ` Linus Torvalds 2024-01-22 18:19 ` Linus Torvalds 2024-01-22 18:27 ` Mathieu Desnoyers 2024-01-22 19:37 ` Steven Rostedt 2024-01-22 18:50 ` Kees Cook 2024-01-22 19:44 ` Steven Rostedt 2024-01-22 19:48 ` Steven Rostedt 2024-01-22 21:33 ` Kees Cook 2024-01-25 17:40 ` Christian Brauner 2024-01-25 18:07 ` Steven Rostedt 2024-01-25 18:08 ` Steven Rostedt 2024-01-26 8:07 ` Geert Uytterhoeven 2024-01-26 10:11 ` Christian Brauner 2024-01-26 16:25 ` Steven Rostedt 2024-01-26 19:09 ` Linus Torvalds 2024-01-26 13:16 ` Steven Rostedt 2024-01-26 14:06 ` Steven Rostedt 2024-01-22 17:14 ` Mathieu Desnoyers 2024-01-22 17:50 ` Steven Rostedt 2024-01-22 18:35 ` Mathieu Desnoyers 2024-01-22 19:59 ` Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 2/3] eventfs: Do not create dentries nor inodes in iterate_shared Steven Rostedt 2024-01-17 14:35 ` [for-linus][PATCH 3/3] eventfs: Use kcalloc() instead of kzalloc() Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).