* sysfs regression: wrong link counts @ 2012-01-30 21:56 Jiri Slaby 2012-01-30 22:06 ` Greg KH ` (3 more replies) 0 siblings, 4 replies; 75+ messages in thread From: Jiri Slaby @ 2012-01-30 21:56 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Greg KH, LKML Hi, I cannot boot properly with this commit: commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 Author: Eric W. Biederman <ebiederm@xmission.com> Date: Sun Dec 18 20:09:31 2011 -0800 sysfs: Kill nlink counting. 1) network systemd rule doesn't start network 2) sensors complain: sensors_init: Kernel interface error ad 2) look at what it does: /* returns !0 if sysfs filesystem was found, 0 otherwise */ int sensors_init_sysfs(void) { struct stat statbuf; snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys"); if (stat(sensors_sysfs_mount, &statbuf) < 0 || statbuf.st_nlink <= 2) /* Empty directory */ return 0; return 1; } So this looks like it became a part of ABI we cannot break... A revert of this commit on the top of today's -next fixes the problem. thanks, -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby @ 2012-01-30 22:06 ` Greg KH 2012-01-30 22:10 ` Alan Cox 2012-01-30 22:52 ` Kay Sievers ` (2 subsequent siblings) 3 siblings, 1 reply; 75+ messages in thread From: Greg KH @ 2012-01-30 22:06 UTC (permalink / raw) To: Jiri Slaby; +Cc: Eric W. Biederman, LKML, systemd-devel On Mon, Jan 30, 2012 at 10:56:26PM +0100, Jiri Slaby wrote: > Hi, > > I cannot boot properly with this commit: > commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 > Author: Eric W. Biederman <ebiederm@xmission.com> > Date: Sun Dec 18 20:09:31 2011 -0800 > > sysfs: Kill nlink counting. > > > 1) network systemd rule doesn't start network > 2) sensors complain: > sensors_init: Kernel interface error Odd. What in systemd is causing a reliance on this? > ad 2) look at what it does: > /* returns !0 if sysfs filesystem was found, 0 otherwise */ > int sensors_init_sysfs(void) > { > struct stat statbuf; > > snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys"); > if (stat(sensors_sysfs_mount, &statbuf) < 0 > || statbuf.st_nlink <= 2) /* Empty directory */ > return 0; > > return 1; > } Ah, a hack to see if the directory is empty. Isn't there some other "proper" way of doing this in userspace, or is this really the correct way? > So this looks like it became a part of ABI we cannot break... > > A revert of this commit on the top of today's -next fixes the problem. Ick. Eric, care to fix this up, or do you want me to revert it? thanks, greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 22:06 ` Greg KH @ 2012-01-30 22:10 ` Alan Cox 2012-01-30 22:27 ` Greg KH 2012-01-31 3:45 ` sysfs regression: wrong link counts Eric W. Biederman 0 siblings, 2 replies; 75+ messages in thread From: Alan Cox @ 2012-01-30 22:10 UTC (permalink / raw) To: Greg KH; +Cc: Jiri Slaby, Eric W. Biederman, LKML, systemd-devel > Isn't there some other "proper" way of doing this in userspace, or is > this really the correct way? You can look at the S_IFMT bits and stuff however link count indicating number of subdirectories is a standard Unix thing and used by many quite mundane tools as an optimisation. Alan ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 22:10 ` Alan Cox @ 2012-01-30 22:27 ` Greg KH 2012-01-30 22:43 ` Al Viro 2012-01-31 1:27 ` Eric W. Biederman 2012-01-31 3:45 ` sysfs regression: wrong link counts Eric W. Biederman 1 sibling, 2 replies; 75+ messages in thread From: Greg KH @ 2012-01-30 22:27 UTC (permalink / raw) To: Alan Cox, Eric W. Biederman; +Cc: Jiri Slaby, LKML, systemd-devel On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote: > > Isn't there some other "proper" way of doing this in userspace, or is > > this really the correct way? > > You can look at the S_IFMT bits and stuff however link count indicating > number of subdirectories is a standard Unix thing and used by many quite > mundane tools as an optimisation. Ah, yeah, that is easier. Eric, care to fix this or want me to revert it? thanks, greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 22:27 ` Greg KH @ 2012-01-30 22:43 ` Al Viro 2012-01-30 22:56 ` Al Viro 2012-01-31 1:27 ` Eric W. Biederman 1 sibling, 1 reply; 75+ messages in thread From: Al Viro @ 2012-01-30 22:43 UTC (permalink / raw) To: Greg KH; +Cc: Alan Cox, Eric W. Biederman, Jiri Slaby, LKML, systemd-devel On Mon, Jan 30, 2012 at 02:27:17PM -0800, Greg KH wrote: > On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote: > > > Isn't there some other "proper" way of doing this in userspace, or is > > > this really the correct way? > > > > You can look at the S_IFMT bits and stuff however link count indicating > > number of subdirectories is a standard Unix thing and used by many quite > > mundane tools as an optimisation. > > Ah, yeah, that is easier. > > Eric, care to fix this or want me to revert it? Fix _what_? Userland shite quoted upthread? Because that's where the bug is - the mundane tools mentioned by Alan treat 1 in st_nlink as "no information about the number of subdirectories". And shite might be too mild a term for the little gem in question, really... ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 22:43 ` Al Viro @ 2012-01-30 22:56 ` Al Viro 0 siblings, 0 replies; 75+ messages in thread From: Al Viro @ 2012-01-30 22:56 UTC (permalink / raw) To: Greg KH; +Cc: Alan Cox, Eric W. Biederman, Jiri Slaby, LKML, systemd-devel On Mon, Jan 30, 2012 at 10:43:50PM +0000, Al Viro wrote: > On Mon, Jan 30, 2012 at 02:27:17PM -0800, Greg KH wrote: > > On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote: > > > > Isn't there some other "proper" way of doing this in userspace, or is > > > > this really the correct way? > > > > > > You can look at the S_IFMT bits and stuff however link count indicating > > > number of subdirectories is a standard Unix thing and used by many quite > > > mundane tools as an optimisation. > > > > Ah, yeah, that is easier. > > > > Eric, care to fix this or want me to revert it? > > Fix _what_? Userland shite quoted upthread? Because that's where the bug > is - the mundane tools mentioned by Alan treat 1 in st_nlink as "no > information about the number of subdirectories". And shite might be too > mild a term for the little gem in question, really... To repeat this piece of bogosity for those who might've missed it: /* returns !0 if sysfs filesystem was found, 0 otherwise */ int sensors_init_sysfs(void) { struct stat statbuf; snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys"); if (stat(sensors_sysfs_mount, &statbuf) < 0 || statbuf.st_nlink <= 2) /* Empty directory */ return 0; return 1; } which is completely bogus - contrary to what it says in comments, it does *not* check anything about sysfs (or directories being empty). Checking that sysfs is mounted on /sys could be done by statfs(2) and checking ->f_type, or, considering what the code in lib/sysfs.c is doing, just checking that /sys/class is readable and failing otherwise. Sigh... ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 22:27 ` Greg KH 2012-01-30 22:43 ` Al Viro @ 2012-01-31 1:27 ` Eric W. Biederman 2012-01-31 10:48 ` Jiri Slaby 1 sibling, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-01-31 1:27 UTC (permalink / raw) To: Greg KH; +Cc: Alan Cox, Jiri Slaby, LKML, systemd-devel Greg KH <greg@kroah.com> writes: > On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote: >> > Isn't there some other "proper" way of doing this in userspace, or is >> > this really the correct way? >> >> You can look at the S_IFMT bits and stuff however link count indicating >> number of subdirectories is a standard Unix thing and used by many quite >> mundane tools as an optimisation. > > Ah, yeah, that is easier. > > Eric, care to fix this or want me to revert it? With respect to sensors the conversation has already been had, and I fix is already queued and should come out in the sensors release due out in a week or so. So sensors should be fixed before this code is merged into Linus's kernel. Now why something minor like sensors that nothing seems to depend strongly on should stop a system from booting is a huge question. So I am going to have the conversation to triage what is up with systemd. That seems totally unexpected. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-31 1:27 ` Eric W. Biederman @ 2012-01-31 10:48 ` Jiri Slaby 2012-01-31 12:44 ` Eric W. Biederman 0 siblings, 1 reply; 75+ messages in thread From: Jiri Slaby @ 2012-01-31 10:48 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg KH, Alan Cox, LKML, systemd-devel, Linus Torvalds, Al Viro On 01/31/2012 02:27 AM, Eric W. Biederman wrote: > Greg KH <greg@kroah.com> writes: > >> On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote: >>>> Isn't there some other "proper" way of doing this in userspace, or is >>>> this really the correct way? >>> >>> You can look at the S_IFMT bits and stuff however link count indicating >>> number of subdirectories is a standard Unix thing and used by many quite >>> mundane tools as an optimisation. >> >> Ah, yeah, that is easier. >> >> Eric, care to fix this or want me to revert it? > > With respect to sensors the conversation has already been had, and I fix > is already queued and should come out in the sensors release due out in > a week or so. So sensors should be fixed before this code is merged > into Linus's kernel. Oh, we are not going to break userspace with 3.3, are we? I understand that what sensors do is nothing but sh*t. But this change should wait until everybody has a chance to have fixed sensors package in their distribution. To be clear, what I'm suggesting is to postpone the change and schedule it for something like 3.7. > Now why something minor like sensors that nothing seems to depend > strongly on should stop a system from booting is a huge question. sensors do not prevent boot at all, of course. The other cannot-start-network bug does. > So I am going to have the conversation to triage what is up with > systemd. That seems totally unexpected. Looks like netdev renaming problem, see my other mail. thanks, -- js suse labs . ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-31 10:48 ` Jiri Slaby @ 2012-01-31 12:44 ` Eric W. Biederman 2012-01-31 16:45 ` Linus Torvalds 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-01-31 12:44 UTC (permalink / raw) To: Jiri Slaby Cc: Greg KH, Alan Cox, LKML, systemd-devel, Linus Torvalds, Al Viro Jiri Slaby <jslaby@suse.cz> writes: > Oh, we are not going to break userspace with 3.3, are we? No. This code is currently scheduled for 3.4. > I understand > that what sensors do is nothing but sh*t. But this change should wait > until everybody has a chance to have fixed sensors package in their > distribution. > > To be clear, what I'm suggesting is to postpone the change and schedule > it for something like 3.7. The sensors update with the fix is scheduled for about a week out, well before 3.3 ships. The code in next for 3.4 is perhaps 4-6 months out from a kernel release. That seems like plenty of time for distros to get an update package for sensors together. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-31 12:44 ` Eric W. Biederman @ 2012-01-31 16:45 ` Linus Torvalds 2012-01-31 19:18 ` Al Viro 2012-02-01 5:06 ` Eric W. Biederman 0 siblings, 2 replies; 75+ messages in thread From: Linus Torvalds @ 2012-01-31 16:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Jiri Slaby, Greg KH, Alan Cox, LKML, systemd-devel, Al Viro On Tue, Jan 31, 2012 at 4:44 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > The sensors update with the fix is scheduled for about a week out, well > before 3.3 ships. That's almost certainly *not* going to help. Guys, people don't update their user space. Some people run modern kernels on the enterprise distros - and those can be *years* out of date. If people report problems, we revert things. It's that simple. It really doesn't matter if you call it a user space bug or not - if user space depends on what we used to do, then we continue to do it. We can *try* the kernel change, but I can already tell you that anybody who thinks that "sensors will be fixed by all users by the time the kernel comes out" is likely totally full of shit. Not to mention that it apparently *already* causes problems for people who want to test -next, and this breaks those peoples setup, and thus means that -next gets less testing. Really. "It's a bug in user space" is *NOT* an excuse for breaking things. Never was. Never is. There are (other) excuses for breaking things, but "user space did something I didn't expect it to do, and I consider it a bug" is absolutely not one of them. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-31 16:45 ` Linus Torvalds @ 2012-01-31 19:18 ` Al Viro 2012-02-01 5:06 ` Eric W. Biederman 1 sibling, 0 replies; 75+ messages in thread From: Al Viro @ 2012-01-31 19:18 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Jiri Slaby, Greg KH, Alan Cox, LKML, systemd-devel On Tue, Jan 31, 2012 at 08:45:58AM -0800, Linus Torvalds wrote: > We can *try* the kernel change, but I can already tell you that > anybody who thinks that "sensors will be fixed by all users by the > time the kernel comes out" is likely totally full of shit. Not to > mention that it apparently *already* causes problems for people who > want to test -next, and this breaks those peoples setup, and thus > means that -next gets less testing. > > Really. "It's a bug in user space" is *NOT* an excuse for breaking > things. Never was. Never is. There are (other) excuses for breaking > things, but "user space did something I didn't expect it to do, and I > consider it a bug" is absolutely not one of them. It's actually "user space did something utterly weird that happened not to break with the current behaviour by accident", and yes, it's better to revert that change for the reasons you've described. But it's definitely a userland bug - as in "code doesn't do what it intends to do" and no matter what we do kernel-side it ought to be fixed in lm-sensors. Several years later it hopefully will become a non-issue (and we'll need to document the conflict with lm-sensors earlier than $FIXED_VERSION in Documentation/Changes when that sysfs patch finally goes in). ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-31 16:45 ` Linus Torvalds 2012-01-31 19:18 ` Al Viro @ 2012-02-01 5:06 ` Eric W. Biederman 2012-02-01 22:21 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman 1 sibling, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-02-01 5:06 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, Greg KH, Alan Cox, LKML, systemd-devel, Al Viro Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Jan 31, 2012 at 4:44 AM, Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> The sensors update with the fix is scheduled for about a week out, well >> before 3.3 ships. > > That's almost certainly *not* going to help. > > Guys, people don't update their user space. Some people run modern > kernels on the enterprise distros - and those can be *years* out of > date. > > If people report problems, we revert things. It's that simple. It > really doesn't matter if you call it a user space bug or not - if user > space depends on what we used to do, then we continue to do it. > We can *try* the kernel change, but I can already tell you that > anybody who thinks that "sensors will be fixed by all users by the > time the kernel comes out" is likely totally full of shit. Not to > mention that it apparently *already* causes problems for people who > want to test -next, and this breaks those peoples setup, and thus > means that -next gets less testing. By and large I am in agreement with you, and this patch was designed to be very focused in case we ran into userspace that had problems to make it easy to revert. The first question I asked when this was reported is how badly do things break. So I could tell if things needed to be fixed. My current understanding is that it is exactly sensors that breaks. One program that complains loudly and doesn't function. To the best of my knowledge nothing else fails, or depends on sensors. I would like to have the code in -next a little longer to see if anything else breaks. > Really. "It's a bug in user space" is *NOT* an excuse for breaking > things. Never was. Never is. There are (other) excuses for breaking > things, but "user space did something I didn't expect it to do, and I > consider it a bug" is absolutely not one of them. Part of the reason for the patch is it is a partial regression fix for sysfs using more memory in 3.3 and 3.2 than in previous kernels. struct sysfs_dirent is nowhere near as critical as struct page but the same kind of small in structure size add up. Thinking about it I should be able to whip up a tiny patch that adds a Kconfig option. Support for programs that are don't interpret nlinks correctly. That should allow people who want the size reduction to have a smaller sysfs and it should allow people who are conservative to guarantee that the code works, and it should let me find out if anything else out there actually cares. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 5:06 ` Eric W. Biederman @ 2012-02-01 22:21 ` Eric W. Biederman 2012-02-01 22:24 ` Greg Kroah-Hartman ` (3 more replies) 0 siblings, 4 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-02-01 22:21 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki lm_sensors and possibly other applications get confused if all sysfs directories return nlink == 1. The lm_sensors code that got confused was just wrong and a fixed version of lm_sensors should be released shortly. There may be other applications that have problems with sysfs return nlink == 1 for directories. To allow people to continue to use old versions of userspace with new kernels add to sysfs a compile time option to maintain mostly precise directory counts for those people who don't mind the cost. I have moved where we keep nlink in sysfs_dirent as compared to previous versions of subdirectory counting to a location that packs better. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/sysfs/Kconfig | 15 +++++++++++++++ fs/sysfs/dir.c | 8 ++++++++ fs/sysfs/inode.c | 2 ++ fs/sysfs/sysfs.h | 38 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 63 insertions(+), 0 deletions(-) diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig index 8c41fea..9b403e9 100644 --- a/fs/sysfs/Kconfig +++ b/fs/sysfs/Kconfig @@ -21,3 +21,18 @@ config SYSFS example, "root=03:01" for /dev/hda1. Designers of embedded systems may wish to say N here to conserve space. + +config SYSFS_COUNT_LINKS + bool "sysfs count subdirectoires to support buggy applications" + default n + help + + Increase the memory and cpu untilization of sysfs by maintaining + by maintaining a count of how hard links from subdirectories a + directory has. This allows sysfs to report the directory nlink + as the number of subdirectories plus two. With out this enabled + sysfs will report the nlink of all directories as 1, which is + the standard way of indicating that the number of subdirectoires + is not tracked. + + Unless you know you need this say N here. diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index dd3779c..f30df7c 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -91,6 +91,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd) struct rb_node **node = &sd->s_parent->s_dir.children.rb_node; struct rb_node *parent = NULL; + if (sysfs_type(sd) == SYSFS_DIR) + sysfs_inc_nlink(sd->s_parent); + while (*node) { struct sysfs_dirent *pos; int result; @@ -123,6 +126,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd) */ static void sysfs_unlink_sibling(struct sysfs_dirent *sd) { + if (sysfs_type(sd) == SYSFS_DIR) + sysfs_dec_nlink(sd->s_parent); + rb_erase(&sd->s_rb, &sd->s_parent->s_dir.children); } @@ -367,6 +373,8 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type) sd->s_mode = mode; sd->s_flags = type; + sysfs_init_nlink(sd); + return sd; err_out2: diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c index 4291fd1..782c66a 100644 --- a/fs/sysfs/inode.c +++ b/fs/sysfs/inode.c @@ -216,6 +216,8 @@ static void sysfs_refresh_inode(struct sysfs_dirent *sd, struct inode *inode) iattrs->ia_secdata, iattrs->ia_secdata_len); } + + set_nlink(inode, sysfs_read_nlink(sd)); } int sysfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index 6289a00..4603506 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -69,6 +69,9 @@ struct sysfs_dirent { const void *s_ns; /* namespace tag */ unsigned int s_hash; /* ns + name hash */ +#ifdef CONFIG_SYSFS_COUNT_LINKS + unsigned int s_nlink; +#endif union { struct sysfs_elem_dir s_dir; struct sysfs_elem_symlink s_symlink; @@ -127,6 +130,41 @@ do { \ #define sysfs_dirent_init_lockdep(sd) do {} while(0) #endif +#ifdef CONFIG_SYSFS_COUNT_LINKS +static inline void sysfs_init_nlink(struct sysfs_dirent *sd) +{ + if (sysfs_type(sd) == SYSFS_DIR) + sd->s_nlink = 2; + else + sd->s_nlink = 1; +} +static inline void sysfs_inc_nlink(struct sysfs_dirent *sd) +{ + sd->s_nlink++; +} +static inline void sysfs_dec_nlink(struct sysfs_dirent *sd) +{ + sd->s_nlink++; +} +static inline unsigned int sysfs_read_nlink(struct sysfs_dirent *sd) +{ + return sd->s_nlink; +} +#else +static inline void sysfs_init_nlink(struct sysfs_dirent *sd) +{ +} +static inline void sysfs_inc_nlink(struct sysfs_dirent *sd) +{ +} +static inline void sysfs_dec_nlink(struct sysfs_dirent *sd) +{ +} +static inline unsigned int sysfs_read_nlink(struct sysfs_dirent *sd) +{ + return 1; +} +#endif /* * Context structure to be used while adding/removing nodes. */ -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 22:21 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman @ 2012-02-01 22:24 ` Greg Kroah-Hartman 2012-02-01 22:44 ` Eric W. Biederman 2012-02-01 22:31 ` Dave Jones ` (2 subsequent siblings) 3 siblings, 1 reply; 75+ messages in thread From: Greg Kroah-Hartman @ 2012-02-01 22:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Jiri Slaby, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote: > > lm_sensors and possibly other applications get confused if all sysfs > directories return nlink == 1. The lm_sensors code that got confused > was just wrong and a fixed version of lm_sensors should be released > shortly. > > There may be other applications that have problems with sysfs return > nlink == 1 for directories. To allow people to continue to use old > versions of userspace with new kernels add to sysfs a compile time > option to maintain mostly precise directory counts for those people who > don't mind the cost. > > I have moved where we keep nlink in sysfs_dirent as compared to previous > versions of subdirectory counting to a location that packs better. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > fs/sysfs/Kconfig | 15 +++++++++++++++ > fs/sysfs/dir.c | 8 ++++++++ > fs/sysfs/inode.c | 2 ++ > fs/sysfs/sysfs.h | 38 ++++++++++++++++++++++++++++++++++++++ > 4 files changed, 63 insertions(+), 0 deletions(-) > > diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig > index 8c41fea..9b403e9 100644 > --- a/fs/sysfs/Kconfig > +++ b/fs/sysfs/Kconfig > @@ -21,3 +21,18 @@ config SYSFS > example, "root=03:01" for /dev/hda1. > > Designers of embedded systems may wish to say N here to conserve space. > + > +config SYSFS_COUNT_LINKS > + bool "sysfs count subdirectoires to support buggy applications" > + default n As we don't want to break things, this should be default y, right? Also, should we list this in the feature_removal list as well so that we can get rid of it in a year or so? thanks, greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 22:24 ` Greg Kroah-Hartman @ 2012-02-01 22:44 ` Eric W. Biederman 2012-02-01 22:49 ` Greg Kroah-Hartman 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-02-01 22:44 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote: >> >> lm_sensors and possibly other applications get confused if all sysfs >> directories return nlink == 1. The lm_sensors code that got confused >> was just wrong and a fixed version of lm_sensors should be released >> shortly. >> >> There may be other applications that have problems with sysfs return >> nlink == 1 for directories. To allow people to continue to use old >> versions of userspace with new kernels add to sysfs a compile time >> option to maintain mostly precise directory counts for those people who >> don't mind the cost. >> >> I have moved where we keep nlink in sysfs_dirent as compared to previous >> versions of subdirectory counting to a location that packs better. >> >> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> >> --- >> fs/sysfs/Kconfig | 15 +++++++++++++++ >> fs/sysfs/dir.c | 8 ++++++++ >> fs/sysfs/inode.c | 2 ++ >> fs/sysfs/sysfs.h | 38 ++++++++++++++++++++++++++++++++++++++ >> 4 files changed, 63 insertions(+), 0 deletions(-) >> >> diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig >> index 8c41fea..9b403e9 100644 >> --- a/fs/sysfs/Kconfig >> +++ b/fs/sysfs/Kconfig >> @@ -21,3 +21,18 @@ config SYSFS >> example, "root=03:01" for /dev/hda1. >> >> Designers of embedded systems may wish to say N here to conserve space. >> + >> +config SYSFS_COUNT_LINKS >> + bool "sysfs count subdirectoires to support buggy applications" >> + default n > > As we don't want to break things, this should be default y, right? The new behavior is backwards compatible. What the new behavior is not is bug compatible. So nothing *should* break. Furthermore the breaking we have seen so far is limited to just lm_sensors. That is exactly one program that is not a server failing to start. That seems pretty minor in the worst case. So I really don't expect anyone who ships 3.4 to enable this option. I have written the option solely so that in case my assessment turns out to be wrong there is already a tested solution. I have been through the pain of not being able to upgrade/test a new kernel because of a backwards incompatible change and it can be very unpleasant. > Also, should we list this in the feature_removal list as well so that we > can get rid of it in a year or so? Good idea. I don't know if anyone actually reads feature removal but it is good to serve notice. I will cook up a patch for that. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 22:44 ` Eric W. Biederman @ 2012-02-01 22:49 ` Greg Kroah-Hartman 0 siblings, 0 replies; 75+ messages in thread From: Greg Kroah-Hartman @ 2012-02-01 22:49 UTC (permalink / raw) To: Eric W. Biederman Cc: Jiri Slaby, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki On Wed, Feb 01, 2012 at 02:44:32PM -0800, Eric W. Biederman wrote: > Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > > > On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote: > >> > >> lm_sensors and possibly other applications get confused if all sysfs > >> directories return nlink == 1. The lm_sensors code that got confused > >> was just wrong and a fixed version of lm_sensors should be released > >> shortly. > >> > >> There may be other applications that have problems with sysfs return > >> nlink == 1 for directories. To allow people to continue to use old > >> versions of userspace with new kernels add to sysfs a compile time > >> option to maintain mostly precise directory counts for those people who > >> don't mind the cost. > >> > >> I have moved where we keep nlink in sysfs_dirent as compared to previous > >> versions of subdirectory counting to a location that packs better. > >> > >> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > >> --- > >> fs/sysfs/Kconfig | 15 +++++++++++++++ > >> fs/sysfs/dir.c | 8 ++++++++ > >> fs/sysfs/inode.c | 2 ++ > >> fs/sysfs/sysfs.h | 38 ++++++++++++++++++++++++++++++++++++++ > >> 4 files changed, 63 insertions(+), 0 deletions(-) > >> > >> diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig > >> index 8c41fea..9b403e9 100644 > >> --- a/fs/sysfs/Kconfig > >> +++ b/fs/sysfs/Kconfig > >> @@ -21,3 +21,18 @@ config SYSFS > >> example, "root=03:01" for /dev/hda1. > >> > >> Designers of embedded systems may wish to say N here to conserve space. > >> + > >> +config SYSFS_COUNT_LINKS > >> + bool "sysfs count subdirectoires to support buggy applications" > >> + default n > > > > As we don't want to break things, this should be default y, right? > > The new behavior is backwards compatible. What the new behavior is not > is bug compatible. So nothing *should* break. "should", but you really don't know, as all we have is one report so far. > Furthermore the breaking we have seen so far is limited to just > lm_sensors. That is exactly one program that is not a server failing to > start. That seems pretty minor in the worst case. > > So I really don't expect anyone who ships 3.4 to enable this option. What about users who use a new kernel on old userspace, which happens all the time (i.e. all the kernel developers themselves)? > I have written the option solely so that in case my assessment turns out > to be wrong there is already a tested solution. I have been through the > pain of not being able to upgrade/test a new kernel because of a > backwards incompatible change and it can be very unpleasant. I'd really prefer this to be default 'y', and if a distro knows it can turn it off to save time/space, it can. thanks, greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 22:21 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman 2012-02-01 22:24 ` Greg Kroah-Hartman @ 2012-02-01 22:31 ` Dave Jones 2012-02-01 22:35 ` Jiri Slaby 2012-02-01 23:15 ` Linus Torvalds 3 siblings, 0 replies; 75+ messages in thread From: Dave Jones @ 2012-02-01 22:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote: > +config SYSFS_COUNT_LINKS > + bool "sysfs count subdirectoires to support buggy applications" subdirectories > + Increase the memory and cpu untilization of sysfs by maintaining utilization > + as the number of subdirectories plus two. With out this enabled Without > + the standard way of indicating that the number of subdirectoires subdirectories Dave ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 22:21 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman 2012-02-01 22:24 ` Greg Kroah-Hartman 2012-02-01 22:31 ` Dave Jones @ 2012-02-01 22:35 ` Jiri Slaby 2012-02-01 23:15 ` Linus Torvalds 3 siblings, 0 replies; 75+ messages in thread From: Jiri Slaby @ 2012-02-01 22:35 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Greg KH, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki On 02/01/2012 11:21 PM, Eric W. Biederman wrote: > --- a/fs/sysfs/Kconfig > +++ b/fs/sysfs/Kconfig > @@ -21,3 +21,18 @@ config SYSFS > example, "root=03:01" for /dev/hda1. > > Designers of embedded systems may wish to say N here to conserve space. > + > +config SYSFS_COUNT_LINKS > + bool "sysfs count subdirectoires to support buggy applications" > + default n > + help > + > + Increase the memory and cpu untilization of sysfs by maintaining > + by maintaining a count of how hard links from subdirectories a by maintaining only once. of how "many"? > + directory has. This allows sysfs to report the directory nlink > + as the number of subdirectories plus two. With out this enabled > + sysfs will report the nlink of all directories as 1, which is > + the standard way of indicating that the number of subdirectoires > + is not tracked. > + > + Unless you know you need this say N here. You should put the opposite here. "If you are sure you don't need it, say N." > --- a/fs/sysfs/sysfs.h > +++ b/fs/sysfs/sysfs.h ... > @@ -127,6 +130,41 @@ do { \ > #define sysfs_dirent_init_lockdep(sd) do {} while(0) > #endif > > +#ifdef CONFIG_SYSFS_COUNT_LINKS > +static inline void sysfs_init_nlink(struct sysfs_dirent *sd) > +{ > + if (sysfs_type(sd) == SYSFS_DIR) > + sd->s_nlink = 2; > + else > + sd->s_nlink = 1; > +} > +static inline void sysfs_inc_nlink(struct sysfs_dirent *sd) > +{ > + sd->s_nlink++; > +} > +static inline void sysfs_dec_nlink(struct sysfs_dirent *sd) > +{ > + sd->s_nlink++; > +} Minus minus. thanks, -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 22:21 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman ` (2 preceding siblings ...) 2012-02-01 22:35 ` Jiri Slaby @ 2012-02-01 23:15 ` Linus Torvalds 2012-02-01 23:18 ` Linus Torvalds 3 siblings, 1 reply; 75+ messages in thread From: Linus Torvalds @ 2012-02-01 23:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro, Maciej Rutecki On Wed, Feb 1, 2012 at 2:21 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > lm_sensors and possibly other applications get confused if all sysfs > directories return nlink == 1. The lm_sensors code that got confused > was just wrong and a fixed version of lm_sensors should be released > shortly. I think this patch is horrible, and broken. And making the feature a config option is just stupid. The simple approach is to - implement a inode->i_op->getattr function for sysfs - make it call generic_fillattr() - after filling in the generic fields, count the number of sysfs children it has and update stat->nlink appropriately. No extra "keep track of inode counts by hand" crap, and no idiotic config options that just make it easy to (conditionally) get things wrong. Just do it right, and do it *unconditionally* right. The Linux VFS layer is smart, and allows filesystems to do these kinds of fixups. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 23:15 ` Linus Torvalds @ 2012-02-01 23:18 ` Linus Torvalds 2012-02-02 1:22 ` Al Viro 2012-03-05 13:30 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby 0 siblings, 2 replies; 75+ messages in thread From: Linus Torvalds @ 2012-02-01 23:18 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro, Maciej Rutecki On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > No extra "keep track of inode counts by hand" crap, and no idiotic > config options that just make it easy to (conditionally) get things > wrong. Just do it right, and do it *unconditionally* right. And btw, "nlink shows number of subdirectories" for a directory entry really *is* right. It's how Unix filesystems work, like it or not. It's mainly lazy/bad filesystems that set nlink to 1. So the whole "nlink==1" case is meant for crap like FAT etc, not for a filesystem that we control and that could easily just do it right. Which is why I detest that config option. It's as if you were asking the user "Do you want to make the sysfs filesystem act like crap filesystems?" and kernel config time. What kind of inane question is that? Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 23:18 ` Linus Torvalds @ 2012-02-02 1:22 ` Al Viro 2012-02-02 21:24 ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro 2012-03-05 13:30 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby 1 sibling, 1 reply; 75+ messages in thread From: Al Viro @ 2012-02-02 1:22 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Maciej Rutecki On Wed, Feb 01, 2012 at 03:18:05PM -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > No extra "keep track of inode counts by hand" crap, and no idiotic > > config options that just make it easy to (conditionally) get things > > wrong. Just do it right, and do it *unconditionally* right. > > And btw, "nlink shows number of subdirectories" for a directory entry > really *is* right. It's how Unix filesystems work, like it or not. > > It's mainly lazy/bad filesystems that set nlink to 1. So the whole > "nlink==1" case is meant for crap like FAT etc, not for a filesystem > that we control and that could easily just do it right. It's a bit more complicated than that and it had always been a bit of a minefield. v6: link(2) began with ip = namei(&uchar, 0); if(ip == NULL) return; if(ip->i_nlink >= 127) { u.u_error = EMLINK; goto out; } (and mkdir(), of course, was implemented via mknod+link). Up to 127 links to any object. Fine; v7: EMLINK defined, but _never_ returned. Moreover, mkdir(1) didn't bother to check link counts either. Result: 65535 calls of link(2) and you've got yourself an fs corruption on i_nlink overflow (it was 16bit in on-disk inode). Linux implementation of link(2) had exactly the same bug as that of v7 until 0.98.3 (for link(2)) and 0.99.1 (for mkdir(2), rename(2) - didn't exist as syscalls on v7, but had the same problem as soon as they made it in kernel at some point in BSD history). Note that limit for minixfs remained ridiculously low until '97 or so; for ext2 it was 32000 from the very beginning but note that e.g. ext had _not_ acquired fixes for link overflows on mkdir() at all - it had that hole all the way until removal in 2.1.26. Of course, there was a (completely useless) POSIX-mandated LINK_MAX, but since the actual limit depends on fs type, well... the checks remain in ->link()/->mkdir()/->rename() instances and IIRC I've caught a bunch of overflows in those circa 2.1.very_late. As the result, old stat(2) had 16bit st_nlink - same as v7. Nobody needed more, after all. Alpha port got it 32bit (inherited the struct stat layout from OSF, presumably?), but other early ports kept 16bit there. At some point folks started whining about wanting more subdirectories and that's when the things began to get really convoluted and ugly. By that time we had a variant of stat(2) that used 32bit st_nlink, but on-disk layouts remained a problem. Some filesystems went for more or less reasonable "the limit is circa 2^32, EMLINK if we get more than that and -EOVERFLOW on old_stat(2) if it doesn't fit into 16 bits". However, it was _not_ universally true - e.g. reiserfs does _NOT_ do that for mkdir(). There we get "directory link count grows up to 16bit limit and gets stuck at 1 if it ever grows past that", on the theory that st_nlink == 1 is distinguishable from anything you'd normally get for a directory. I have no idea where that invention has come from, but it's been around for more than a decade. Of course, Hans being Hans, it had been advertised as major reiserfs feature - you could get an unlimited amount of subdirectories, which no other fs on Linux would allow, nevermind that nobody sane would actually make use of that... At about the same time somebody had done the same trick on ext2 - Daniel, probably, since it had been floating in directory index patchset. It never got merged into mainline; ext3 port _did_, but without those bits actually used (EXT3_DIR_LINK_MAX is defined, but never used). ext4 actually has it hooked in. I don't see anything else other than ext4 and reiserfs using that convention, but then just grepping for inc_nlink/inode_inc_use_count shows a bloody mess - e.g. jffs2 does not check overflows (32bit there) at all, neither on link() nor on mkdir()/rename(). It _might_ get away with that (with non-standard errno, if so), but I don't remember that code well enough to tell at the quick look. HFS+ doesn't seem to check for overflows either and while it doesn't track link count on directories, it does support link(2) and it does bump i_nlink there. A _lot_ of filesystems (starting with ramfs) assumes that we'll get an OOM before we get to 2^32 links to an inode on purely in-core filesystem; reasonable back in 2001 or so, but not these days... ^ permalink raw reply [flat|nested] 75+ messages in thread
* [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-02 1:22 ` Al Viro @ 2012-02-02 21:24 ` Al Viro 2012-02-02 23:46 ` Linus Torvalds 2012-02-03 8:25 ` Andreas Dilger 0 siblings, 2 replies; 75+ messages in thread From: Al Viro @ 2012-02-02 21:24 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-kernel, Linus Torvalds > Linux implementation of link(2) had exactly the same bug as that of v7 until > 0.98.3 (for link(2)) and 0.99.1 (for mkdir(2), rename(2) - didn't exist as > syscalls on v7, but had the same problem as soon as they made it in kernel > at some point in BSD history). Note that limit for minixfs remained > ridiculously low until '97 or so; for ext2 it was 32000 from the very > beginning but note that e.g. ext had _not_ acquired fixes for link overflows > on mkdir() at all - it had that hole all the way until removal in 2.1.26. > Of course, there was a (completely useless) POSIX-mandated LINK_MAX, but > since the actual limit depends on fs type, well... the checks remain > in ->link()/->mkdir()/->rename() instances and IIRC I've caught a bunch of > overflows in those circa 2.1.very_late. FWIW, there's something we really should've done a long time ago: putting that limit into sb->s_max_links. With 0 meaning "leave all checks to ->link/->mkdir/->rename". Something like the following would make a reasonable start - just the conversion of obvious cases. As the next step I'd probably initialize it as ~0U instead of 0 and let the filesystems that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set it to 0 in their foo_fill_super(). That would take care of a bunch of cases where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and it's probably a saner default anyway. Comments? Boilerplate removal follows (22 files changed, 45 insertions(+), 120 deletions(-)), but it's *not* for immediate merge; it's really completely untested. diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c index 9dbf0c3..fc7161d 100644 --- a/fs/exofs/namei.c +++ b/fs/exofs/namei.c @@ -143,9 +143,6 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir, { struct inode *inode = old_dentry->d_inode; - if (inode->i_nlink >= EXOFS_LINK_MAX) - return -EMLINK; - inode->i_ctime = CURRENT_TIME; inode_inc_link_count(inode); ihold(inode); @@ -156,10 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir, static int exofs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) { struct inode *inode; - int err = -EMLINK; - - if (dir->i_nlink >= EXOFS_LINK_MAX) - goto out; + int err; inode_inc_link_count(dir); @@ -275,11 +269,6 @@ static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry, if (err) goto out_dir; } else { - if (dir_de) { - err = -EMLINK; - if (new_dir->i_nlink >= EXOFS_LINK_MAX) - goto out_dir; - } err = exofs_add_link(new_dentry, old_inode); if (err) goto out_dir; diff --git a/fs/exofs/super.c b/fs/exofs/super.c index d22cd16..6cafcad 100644 --- a/fs/exofs/super.c +++ b/fs/exofs/super.c @@ -754,6 +754,7 @@ static int exofs_fill_super(struct super_block *sb, void *data, int silent) sb->s_blocksize = EXOFS_BLKSIZE; sb->s_blocksize_bits = EXOFS_BLKSHIFT; sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_max_links = EXOFS_LINK_MAX; atomic_set(&sbi->s_curr_pending, 0); sb->s_bdev = NULL; sb->s_dev = 0; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 0804198..dffb865 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -195,9 +195,6 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir, struct inode *inode = old_dentry->d_inode; int err; - if (inode->i_nlink >= EXT2_LINK_MAX) - return -EMLINK; - dquot_initialize(dir); inode->i_ctime = CURRENT_TIME_SEC; @@ -217,10 +214,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir, static int ext2_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode) { struct inode * inode; - int err = -EMLINK; - - if (dir->i_nlink >= EXT2_LINK_MAX) - goto out; + int err; dquot_initialize(dir); @@ -346,11 +340,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry, drop_nlink(new_inode); inode_dec_link_count(new_inode); } else { - if (dir_de) { - err = -EMLINK; - if (new_dir->i_nlink >= EXT2_LINK_MAX) - goto out_dir; - } err = ext2_add_link(new_dentry, old_inode); if (err) goto out_dir; diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 0090595..9f6766a 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -919,6 +919,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) } sb->s_maxbytes = ext2_max_size(sb->s_blocksize_bits); + sb->s_max_links = EXT2_LINK_MAX; if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV) { sbi->s_inode_size = EXT2_GOOD_OLD_INODE_SIZE; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index 5f7c160..07c91ca 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -220,12 +220,6 @@ static int jfs_mkdir(struct inode *dip, struct dentry *dentry, umode_t mode) dquot_initialize(dip); - /* link count overflow on parent directory ? */ - if (dip->i_nlink == JFS_LINK_MAX) { - rc = -EMLINK; - goto out1; - } - /* * search parent directory for entry/freespace * (dtSearch() returns parent directory page pinned) @@ -806,9 +800,6 @@ static int jfs_link(struct dentry *old_dentry, jfs_info("jfs_link: %s %s", old_dentry->d_name.name, dentry->d_name.name); - if (ip->i_nlink == JFS_LINK_MAX) - return -EMLINK; - dquot_initialize(dir); tid = txBegin(ip->i_sb, 0); @@ -1138,10 +1129,6 @@ static int jfs_rename(struct inode *old_dir, struct dentry *old_dentry, rc = -ENOTEMPTY; goto out3; } - } else if ((new_dir != old_dir) && - (new_dir->i_nlink == JFS_LINK_MAX)) { - rc = -EMLINK; - goto out3; } } else if (new_ip) { IWRITE_LOCK(new_ip, RDWRLOCK_NORMAL); diff --git a/fs/jfs/super.c b/fs/jfs/super.c index 682bca6..4661ad7 100644 --- a/fs/jfs/super.c +++ b/fs/jfs/super.c @@ -441,6 +441,7 @@ static int jfs_fill_super(struct super_block *sb, void *data, int silent) return -ENOMEM; sb->s_fs_info = sbi; + sb->s_max_links = JFS_LINK_MAX; sbi->sb = sb; sbi->uid = sbi->gid = sbi->umask = -1; diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c index 3de7a32..4aea231 100644 --- a/fs/logfs/dir.c +++ b/fs/logfs/dir.c @@ -558,9 +558,6 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir, { struct inode *inode = old_dentry->d_inode; - if (inode->i_nlink >= LOGFS_LINK_MAX) - return -EMLINK; - inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME; ihold(inode); inc_nlink(inode); diff --git a/fs/logfs/super.c b/fs/logfs/super.c index c9ee7f5..b1a491a 100644 --- a/fs/logfs/super.c +++ b/fs/logfs/super.c @@ -542,6 +542,7 @@ static struct dentry *logfs_get_sb_device(struct logfs_super *super, * the filesystem incompatible with 32bit systems. */ sb->s_maxbytes = (1ull << 43) - 1; + sb->s_max_links = LOGFS_LINK_MAX; sb->s_op = &logfs_super_operations; sb->s_flags = flags | MS_NOATIME; diff --git a/fs/minix/inode.c b/fs/minix/inode.c index fa8b612..62c697c 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -190,24 +190,24 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) sbi->s_version = MINIX_V1; sbi->s_dirsize = 16; sbi->s_namelen = 14; - sbi->s_link_max = MINIX_LINK_MAX; + s->s_max_links = MINIX_LINK_MAX; } else if (s->s_magic == MINIX_SUPER_MAGIC2) { sbi->s_version = MINIX_V1; sbi->s_dirsize = 32; sbi->s_namelen = 30; - sbi->s_link_max = MINIX_LINK_MAX; + s->s_max_links = MINIX_LINK_MAX; } else if (s->s_magic == MINIX2_SUPER_MAGIC) { sbi->s_version = MINIX_V2; sbi->s_nzones = ms->s_zones; sbi->s_dirsize = 16; sbi->s_namelen = 14; - sbi->s_link_max = MINIX2_LINK_MAX; + s->s_max_links = MINIX2_LINK_MAX; } else if (s->s_magic == MINIX2_SUPER_MAGIC2) { sbi->s_version = MINIX_V2; sbi->s_nzones = ms->s_zones; sbi->s_dirsize = 32; sbi->s_namelen = 30; - sbi->s_link_max = MINIX2_LINK_MAX; + s->s_max_links = MINIX2_LINK_MAX; } else if ( *(__u16 *)(bh->b_data + 24) == MINIX3_SUPER_MAGIC) { m3s = (struct minix3_super_block *) bh->b_data; s->s_magic = m3s->s_magic; @@ -221,9 +221,9 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) sbi->s_dirsize = 64; sbi->s_namelen = 60; sbi->s_version = MINIX_V3; - sbi->s_link_max = MINIX2_LINK_MAX; sbi->s_mount_state = MINIX_VALID_FS; sb_set_blocksize(s, m3s->s_blocksize); + s->s_max_links = MINIX2_LINK_MAX; } else goto out_no_fs; diff --git a/fs/minix/minix.h b/fs/minix/minix.h index c889ef0..1ebd118 100644 --- a/fs/minix/minix.h +++ b/fs/minix/minix.h @@ -34,7 +34,6 @@ struct minix_sb_info { unsigned long s_max_size; int s_dirsize; int s_namelen; - int s_link_max; struct buffer_head ** s_imap; struct buffer_head ** s_zmap; struct buffer_head * s_sbh; diff --git a/fs/minix/namei.c b/fs/minix/namei.c index 2f76e38..2d0ee17 100644 --- a/fs/minix/namei.c +++ b/fs/minix/namei.c @@ -94,9 +94,6 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir, { struct inode *inode = old_dentry->d_inode; - if (inode->i_nlink >= minix_sb(inode->i_sb)->s_link_max) - return -EMLINK; - inode->i_ctime = CURRENT_TIME_SEC; inode_inc_link_count(inode); ihold(inode); @@ -106,10 +103,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir, static int minix_mkdir(struct inode * dir, struct dentry *dentry, umode_t mode) { struct inode * inode; - int err = -EMLINK; - - if (dir->i_nlink >= minix_sb(dir->i_sb)->s_link_max) - goto out; + int err; inode_inc_link_count(dir); @@ -181,7 +175,6 @@ static int minix_rmdir(struct inode * dir, struct dentry *dentry) static int minix_rename(struct inode * old_dir, struct dentry *old_dentry, struct inode * new_dir, struct dentry *new_dentry) { - struct minix_sb_info * info = minix_sb(old_dir->i_sb); struct inode * old_inode = old_dentry->d_inode; struct inode * new_inode = new_dentry->d_inode; struct page * dir_page = NULL; @@ -219,11 +212,6 @@ static int minix_rename(struct inode * old_dir, struct dentry *old_dentry, drop_nlink(new_inode); inode_dec_link_count(new_inode); } else { - if (dir_de) { - err = -EMLINK; - if (new_dir->i_nlink >= info->s_link_max) - goto out_dir; - } err = minix_add_link(new_dentry, old_inode); if (err) goto out_dir; diff --git a/fs/namei.c b/fs/namei.c index 208c6aa..45f953c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2545,6 +2545,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, d int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) { int error = may_create(dir, dentry); + unsigned max_links = dir->i_sb->s_max_links; if (error) return error; @@ -2557,6 +2558,9 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) if (error) return error; + if (max_links && dir->i_nlink >= max_links) + return -EMLINK; + error = dir->i_op->mkdir(dir, dentry, mode); if (!error) fsnotify_mkdir(dir, dentry); @@ -2887,6 +2891,7 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) { struct inode *inode = old_dentry->d_inode; + unsigned max_links = dir->i_sb->s_max_links; int error; if (!inode) @@ -2917,6 +2922,8 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de /* Make sure we don't allow creating hardlink to an unlinked file */ if (inode->i_nlink == 0) error = -ENOENT; + else if (max_links && inode->i_nlink >= max_links) + error = -EMLINK; else error = dir->i_op->link(old_dentry, dir, new_dentry); mutex_unlock(&inode->i_mutex); @@ -3026,6 +3033,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry, { int error = 0; struct inode *target = new_dentry->d_inode; + unsigned max_links = new_dir->i_sb->s_max_links; /* * If we are going to change the parent - check write permissions, @@ -3049,6 +3057,11 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry, if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry)) goto out; + error = -EMLINK; + if (max_links && !target && new_dir != old_dir && + new_dir->i_nlink >= max_links) + goto out; + if (target) shrink_dcache_parent(new_dentry); error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry); diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 1cd3f62..fce2bbe 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -193,9 +193,6 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir, struct nilfs_transaction_info ti; int err; - if (inode->i_nlink >= NILFS_LINK_MAX) - return -EMLINK; - err = nilfs_transaction_begin(dir->i_sb, &ti, 1); if (err) return err; @@ -219,9 +216,6 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) struct nilfs_transaction_info ti; int err; - if (dir->i_nlink >= NILFS_LINK_MAX) - return -EMLINK; - err = nilfs_transaction_begin(dir->i_sb, &ti, 1); if (err) return err; @@ -400,11 +394,6 @@ static int nilfs_rename(struct inode *old_dir, struct dentry *old_dentry, drop_nlink(new_inode); nilfs_mark_inode_dirty(new_inode); } else { - if (dir_de) { - err = -EMLINK; - if (new_dir->i_nlink >= NILFS_LINK_MAX) - goto out_dir; - } err = nilfs_add_link(new_dentry, old_inode); if (err) goto out_dir; diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c index 08e3d4f..1fc9ad3 100644 --- a/fs/nilfs2/super.c +++ b/fs/nilfs2/super.c @@ -1059,6 +1059,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) sb->s_export_op = &nilfs_export_ops; sb->s_root = NULL; sb->s_time_gran = 1; + sb->s_max_links = NILFS_LINK_MAX; bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info; sb->s_bdi = bdi ? : &default_backing_dev_info; diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c index b217797..d7466e2 100644 --- a/fs/sysv/namei.c +++ b/fs/sysv/namei.c @@ -121,9 +121,6 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir, { struct inode *inode = old_dentry->d_inode; - if (inode->i_nlink >= SYSV_SB(inode->i_sb)->s_link_max) - return -EMLINK; - inode->i_ctime = CURRENT_TIME_SEC; inode_inc_link_count(inode); ihold(inode); @@ -134,10 +131,8 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir, static int sysv_mkdir(struct inode * dir, struct dentry *dentry, umode_t mode) { struct inode * inode; - int err = -EMLINK; + int err; - if (dir->i_nlink >= SYSV_SB(dir->i_sb)->s_link_max) - goto out; inode_inc_link_count(dir); inode = sysv_new_inode(dir, S_IFDIR|mode); @@ -251,11 +246,6 @@ static int sysv_rename(struct inode * old_dir, struct dentry * old_dentry, drop_nlink(new_inode); inode_dec_link_count(new_inode); } else { - if (dir_de) { - err = -EMLINK; - if (new_dir->i_nlink >= SYSV_SB(new_dir->i_sb)->s_link_max) - goto out_dir; - } err = sysv_add_link(new_dentry, old_inode); if (err) goto out_dir; diff --git a/fs/sysv/super.c b/fs/sysv/super.c index f60c196..f467740 100644 --- a/fs/sysv/super.c +++ b/fs/sysv/super.c @@ -44,7 +44,7 @@ enum { JAN_1_1980 = (10*365 + 2) * 24 * 60 * 60 }; -static void detected_xenix(struct sysv_sb_info *sbi) +static void detected_xenix(struct sysv_sb_info *sbi, unsigned *max_links) { struct buffer_head *bh1 = sbi->s_bh1; struct buffer_head *bh2 = sbi->s_bh2; @@ -59,7 +59,7 @@ static void detected_xenix(struct sysv_sb_info *sbi) sbd2 = (struct xenix_super_block *) (bh2->b_data - 512); } - sbi->s_link_max = XENIX_LINK_MAX; + *max_links = XENIX_LINK_MAX; sbi->s_fic_size = XENIX_NICINOD; sbi->s_flc_size = XENIX_NICFREE; sbi->s_sbd1 = (char *)sbd1; @@ -75,7 +75,7 @@ static void detected_xenix(struct sysv_sb_info *sbi) sbi->s_nzones = fs32_to_cpu(sbi, sbd1->s_fsize); } -static void detected_sysv4(struct sysv_sb_info *sbi) +static void detected_sysv4(struct sysv_sb_info *sbi, unsigned *max_links) { struct sysv4_super_block * sbd; struct buffer_head *bh1 = sbi->s_bh1; @@ -86,7 +86,7 @@ static void detected_sysv4(struct sysv_sb_info *sbi) else sbd = (struct sysv4_super_block *) bh2->b_data; - sbi->s_link_max = SYSV_LINK_MAX; + *max_links = SYSV_LINK_MAX; sbi->s_fic_size = SYSV_NICINOD; sbi->s_flc_size = SYSV_NICFREE; sbi->s_sbd1 = (char *)sbd; @@ -103,7 +103,7 @@ static void detected_sysv4(struct sysv_sb_info *sbi) sbi->s_nzones = fs32_to_cpu(sbi, sbd->s_fsize); } -static void detected_sysv2(struct sysv_sb_info *sbi) +static void detected_sysv2(struct sysv_sb_info *sbi, unsigned *max_links) { struct sysv2_super_block *sbd; struct buffer_head *bh1 = sbi->s_bh1; @@ -114,7 +114,7 @@ static void detected_sysv2(struct sysv_sb_info *sbi) else sbd = (struct sysv2_super_block *) bh2->b_data; - sbi->s_link_max = SYSV_LINK_MAX; + *max_links = SYSV_LINK_MAX; sbi->s_fic_size = SYSV_NICINOD; sbi->s_flc_size = SYSV_NICFREE; sbi->s_sbd1 = (char *)sbd; @@ -131,14 +131,14 @@ static void detected_sysv2(struct sysv_sb_info *sbi) sbi->s_nzones = fs32_to_cpu(sbi, sbd->s_fsize); } -static void detected_coherent(struct sysv_sb_info *sbi) +static void detected_coherent(struct sysv_sb_info *sbi, unsigned *max_links) { struct coh_super_block * sbd; struct buffer_head *bh1 = sbi->s_bh1; sbd = (struct coh_super_block *) bh1->b_data; - sbi->s_link_max = COH_LINK_MAX; + *max_links = COH_LINK_MAX; sbi->s_fic_size = COH_NICINOD; sbi->s_flc_size = COH_NICFREE; sbi->s_sbd1 = (char *)sbd; @@ -154,12 +154,12 @@ static void detected_coherent(struct sysv_sb_info *sbi) sbi->s_nzones = fs32_to_cpu(sbi, sbd->s_fsize); } -static void detected_v7(struct sysv_sb_info *sbi) +static void detected_v7(struct sysv_sb_info *sbi, unsigned *max_links) { struct buffer_head *bh2 = sbi->s_bh2; struct v7_super_block *sbd = (struct v7_super_block *)bh2->b_data; - sbi->s_link_max = V7_LINK_MAX; + *max_links = V7_LINK_MAX; sbi->s_fic_size = V7_NICINOD; sbi->s_flc_size = V7_NICFREE; sbi->s_sbd1 = (char *)sbd; @@ -290,7 +290,7 @@ static char *flavour_names[] = { [FSTYPE_AFS] = "AFS", }; -static void (*flavour_setup[])(struct sysv_sb_info *) = { +static void (*flavour_setup[])(struct sysv_sb_info *, unsigned *) = { [FSTYPE_XENIX] = detected_xenix, [FSTYPE_SYSV4] = detected_sysv4, [FSTYPE_SYSV2] = detected_sysv2, @@ -310,7 +310,7 @@ static int complete_read_super(struct super_block *sb, int silent, int size) sbi->s_firstinodezone = 2; - flavour_setup[sbi->s_type](sbi); + flavour_setup[sbi->s_type](sbi, &sb->s_max_links); sbi->s_truncate = 1; sbi->s_ndatazones = sbi->s_nzones - sbi->s_firstdatazone; diff --git a/fs/sysv/sysv.h b/fs/sysv/sysv.h index 0e4b821..11b0767 100644 --- a/fs/sysv/sysv.h +++ b/fs/sysv/sysv.h @@ -24,7 +24,6 @@ struct sysv_sb_info { char s_bytesex; /* bytesex (le/be/pdp) */ char s_truncate; /* if 1: names > SYSV_NAMELEN chars are truncated */ /* if 0: they are disallowed (ENAMETOOLONG) */ - nlink_t s_link_max; /* max number of hard links to a file */ unsigned int s_inodes_per_block; /* number of inodes per block */ unsigned int s_inodes_per_block_1; /* inodes_per_block - 1 */ unsigned int s_inodes_per_block_bits; /* log2(inodes_per_block) */ diff --git a/fs/udf/namei.c b/fs/udf/namei.c index 08bf46e..38de8f2 100644 --- a/fs/udf/namei.c +++ b/fs/udf/namei.c @@ -32,8 +32,6 @@ #include <linux/crc-itu-t.h> #include <linux/exportfs.h> -enum { UDF_MAX_LINKS = 0xffff }; - static inline int udf_match(int len1, const unsigned char *name1, int len2, const unsigned char *name2) { @@ -649,10 +647,6 @@ static int udf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) struct udf_inode_info *dinfo = UDF_I(dir); struct udf_inode_info *iinfo; - err = -EMLINK; - if (dir->i_nlink >= UDF_MAX_LINKS) - goto out; - err = -EIO; inode = udf_new_inode(dir, S_IFDIR | mode, &err); if (!inode) @@ -1032,9 +1026,6 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir, struct fileIdentDesc cfi, *fi; int err; - if (inode->i_nlink >= UDF_MAX_LINKS) - return -EMLINK; - fi = udf_add_entry(dir, dentry, &fibh, &cfi, &err); if (!fi) { return err; @@ -1126,10 +1117,6 @@ static int udf_rename(struct inode *old_dir, struct dentry *old_dentry, if (udf_get_lb_pblock(old_inode->i_sb, &tloc, 0) != old_dir->i_ino) goto end_rename; - - retval = -EMLINK; - if (!new_inode && new_dir->i_nlink >= UDF_MAX_LINKS) - goto end_rename; } if (!nfi) { nfi = udf_add_entry(new_dir, new_dentry, &nfibh, &ncfi, diff --git a/fs/udf/super.c b/fs/udf/super.c index c09a84d..8d8b253 100644 --- a/fs/udf/super.c +++ b/fs/udf/super.c @@ -75,6 +75,8 @@ #define UDF_DEFAULT_BLOCKSIZE 2048 +enum { UDF_MAX_LINKS = 0xffff }; + /* These are the "meat" - everything else is stuffing */ static int udf_fill_super(struct super_block *, void *, int); static void udf_put_super(struct super_block *); @@ -2042,6 +2044,7 @@ static int udf_fill_super(struct super_block *sb, void *options, int silent) goto error_out; } sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_max_links = UDF_MAX_LINKS; return 0; error_out: diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c index 38cac19..a2281ca 100644 --- a/fs/ufs/namei.c +++ b/fs/ufs/namei.c @@ -166,10 +166,6 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir, int error; lock_ufs(dir->i_sb); - if (inode->i_nlink >= UFS_LINK_MAX) { - unlock_ufs(dir->i_sb); - return -EMLINK; - } inode->i_ctime = CURRENT_TIME_SEC; inode_inc_link_count(inode); @@ -183,10 +179,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir, static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode) { struct inode * inode; - int err = -EMLINK; - - if (dir->i_nlink >= UFS_LINK_MAX) - goto out; + int err; lock_ufs(dir->i_sb); inode_inc_link_count(dir); @@ -305,11 +298,6 @@ static int ufs_rename(struct inode *old_dir, struct dentry *old_dentry, drop_nlink(new_inode); inode_dec_link_count(new_inode); } else { - if (dir_de) { - err = -EMLINK; - if (new_dir->i_nlink >= UFS_LINK_MAX) - goto out_dir; - } err = ufs_add_link(new_dentry, old_inode); if (err) goto out_dir; diff --git a/fs/ufs/super.c b/fs/ufs/super.c index 5246ee3..ec25d09 100644 --- a/fs/ufs/super.c +++ b/fs/ufs/super.c @@ -1157,6 +1157,7 @@ magic_found: "fast symlink size (%u)\n", uspi->s_maxsymlinklen); uspi->s_maxsymlinklen = maxsymlen; } + sb->s_max_links = UFS_LINK_MAX; inode = ufs_iget(sb, UFS_ROOTINO); if (IS_ERR(inode)) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 386da09..1e49554 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1459,6 +1459,7 @@ struct super_block { u8 s_uuid[16]; /* UUID */ void *s_fs_info; /* Filesystem private info */ + unsigned int s_max_links; fmode_t s_mode; /* Granularity of c/m/atime in ns. ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-02 21:24 ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro @ 2012-02-02 23:46 ` Linus Torvalds 2012-02-03 1:16 ` Al Viro 2012-02-03 8:25 ` Andreas Dilger 1 sibling, 1 reply; 75+ messages in thread From: Linus Torvalds @ 2012-02-02 23:46 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Comments? Boilerplate removal follows (22 files changed, 45 insertions(+), > 120 deletions(-)), but it's *not* for immediate merge; it's really completely > untested. Looks ok to me. Historically, the more things we can check at the VFS layer, the better. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-02 23:46 ` Linus Torvalds @ 2012-02-03 1:16 ` Al Viro 2012-02-03 1:45 ` Al Viro ` (3 more replies) 0 siblings, 4 replies; 75+ messages in thread From: Al Viro @ 2012-02-03 1:16 UTC (permalink / raw) To: Linus Torvalds Cc: linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller On Thu, Feb 02, 2012 at 03:46:06PM -0800, Linus Torvalds wrote: > On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Comments? ?Boilerplate removal follows (22 files changed, 45 insertions(+), > > 120 deletions(-)), but it's *not* for immediate merge; it's really completely > > untested. > > Looks ok to me. Historically, the more things we can check at the VFS > layer, the better. After looking a bit more: nlink_t is a f*cking mess. Almost any code using that type kernel-side is broken. Crap galore: * sometimes it's 32 bits, sometimes 16, sometimes 64. Essentially at random. * almost all have it unsigned, except for sparc32, where it's signed short [inherited from v7 via SunOS? BTW, in v6 it used to be even funnier - char, which is where ridiculous LINK_MAX == 127 comes from] IOW, nlink_t is an attractive nuisance - it's nearly impossible to use in a portable way and we are lucky that almost nobody tries to. Exceptions: ocfs2_rename() does nlink_t old_dir_nlink = old_dir->i_nlink; ... followed later by comparison with old_dir->i_nlink. And no, it's not to handle truncation - it's "what if i_nlink changed while ocfs2_rename() had been grabbing the cluster lock" kind of thing. OCFS2 can have up to 2^32 links to file, so truncation is really possible... AFAICS, that one is a genuine bug - this nlink_t should be u32... Another one is proc_dir_entry ->nlink and it would cause Bad Things(tm) on architecture with 16bit nlink_t if we could end up with 65534 subdirectories in some procfs dir. Might be possible, might be not - doing that under /proc/sys is definitely possible, but that won't be enough; needs to be proc_dir_entry-backed directory. Again, solution is to use explicit u32 anyway. * compat_nlink_t is even funnier - it's signed in *two* cases; sparc and ppc. No, nlink_t on ppc32 is unsigned. Not that anyone cared, really, since the _only_ use of that type is in struct compat_stat. For exactly one field. Only used as left-hand side of assignment, which is actually broken since unlike cp_new_stat(), cp_compat_stat() does *not* check if the value fits into st_nlink. Bug, needs to be fixed. Incidentally, just what should we do on sparc32 if we run into a file with 4G-10 links? -EOVERFLOW or silently put 65536-10 in st_nlink and be done with that? Note that filesystems allowing that many links *do* exist... * when does jfs dtInsert() return -EMLINK? Can it ever get triggered? * WTF is XFS doing with these checks? Note that we have them done _twice_ on all paths - explictly from xfs_create(), xfs_link(), xfs_rename() and then from xfs_bumplink() called by exactly the same set of functions. * what's up with btrfs_insert_inode_ref()? I've tried to trace the codepaths around there, but... Incidentally, when could fixup_low_keys() return non-zero? I don't see any candidates for that in there... Chris? * ubifs, hfsplus, jffs2 - definitely broken if you create enough links. i_nlink wraparound to zero, confused inode eviction logics. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 1:16 ` Al Viro @ 2012-02-03 1:45 ` Al Viro 2012-02-03 2:00 ` Linus Torvalds 2012-02-03 14:57 ` Chris Mason ` (2 subsequent siblings) 3 siblings, 1 reply; 75+ messages in thread From: Al Viro @ 2012-02-03 1:45 UTC (permalink / raw) To: Linus Torvalds Cc: linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote: > After looking a bit more: nlink_t is a f*cking mess. Almost any code > using that type kernel-side is broken. Crap galore: > * sometimes it's 32 bits, sometimes 16, sometimes 64. Essentially > at random. > * almost all have it unsigned, except for sparc32, where it's > signed short [inherited from v7 via SunOS? BTW, in v6 it used to be even > funnier - char, which is where ridiculous LINK_MAX == 127 comes from] > > IOW, nlink_t is an attractive nuisance - it's nearly impossible to use in > a portable way and we are lucky that almost nobody tries to. Incidentally, why the hell do we have typedef __kernel_nlink_t nlink_t; anyway? It's *not* exposed to userland and it's different from the userland nlink_t (which is unsigned int on 32bit and unsigned long on 64bit). Why not use __kernel_nlink_t (or explicitly-sized __uNN) in arch/*/include/asm/stat.h and declare nlink_t kernel-side as __u32? Why do we have daddr_t, while we are at it? There is exactly one user - fs/freevxfs and there we definitely want a fixed-sized type. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 1:45 ` Al Viro @ 2012-02-03 2:00 ` Linus Torvalds 0 siblings, 0 replies; 75+ messages in thread From: Linus Torvalds @ 2012-02-03 2:00 UTC (permalink / raw) To: Al Viro Cc: linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller On Thu, Feb 2, 2012 at 5:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Incidentally, why the hell do we have > typedef __kernel_nlink_t nlink_t; > anyway? It's *not* exposed to userland and it's different from the > userland nlink_t (which is unsigned int on 32bit and unsigned long on 64bit). > Why not use __kernel_nlink_t (or explicitly-sized __uNN) in > arch/*/include/asm/stat.h and declare nlink_t kernel-side as __u32? Probably hysterical raisins, and just converted to the whole __kernel_nlink_t form together with other, more relevant things. Feel free to remove it. > Why do we have daddr_t, while we are at it? There is exactly one user - > fs/freevxfs and there we definitely want a fixed-sized type. I think it's something that probably came from freevxfs and BSD roots or similar. It's a BSD'ism, afaik. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 1:16 ` Al Viro 2012-02-03 1:45 ` Al Viro @ 2012-02-03 14:57 ` Chris Mason 2012-02-03 17:08 ` Al Viro 2012-02-06 22:49 ` Dave Chinner 3 siblings, 0 replies; 75+ messages in thread From: Chris Mason @ 2012-02-03 14:57 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Joel Becker, David Miller On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote: > On Thu, Feb 02, 2012 at 03:46:06PM -0800, Linus Torvalds wrote: > > On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > Comments? ?Boilerplate removal follows (22 files changed, 45 insertions(+), > > > 120 deletions(-)), but it's *not* for immediate merge; it's really completely > > > untested. > > > > Looks ok to me. Historically, the more things we can check at the VFS > > layer, the better. > > > * what's up with btrfs_insert_inode_ref()? I've tried to trace > the codepaths around there, but... Btrfs stores backrefs (the filename, directory inode number) from the inode to the directory. In the current format that's a pretty low limit on how many of these we can store for hard links to the same file in the same directory, but basically no limit on how many backrefs we can store to the same file from different directories. Mark Fasheh was working on a patch to change the backrefs to make the links-from-the-same-dir case consistent with the links-from-different-dir case. With today's code, we'll go -EMLINK at different times depending on the length of the file name and what links you've already made. > Incidentally, when could fixup_low_keys() > return non-zero? I don't see any candidates for that in there... Chris? A long time ago this one used to cow blocks and so it needed an error return. I think Jeff Mahoney has a patch queued up to make it (among many others) void. -chris ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 1:16 ` Al Viro 2012-02-03 1:45 ` Al Viro 2012-02-03 14:57 ` Chris Mason @ 2012-02-03 17:08 ` Al Viro 2012-02-03 19:34 ` Artem Bityutskiy 2012-02-06 8:50 ` Artem Bityutskiy 2012-02-06 22:49 ` Dave Chinner 3 siblings, 2 replies; 75+ messages in thread From: Al Viro @ 2012-02-03 17:08 UTC (permalink / raw) To: Artem Bityutskiy; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote: > * ubifs, hfsplus, jffs2 - definitely broken if you create enough > links. i_nlink wraparound to zero, confused inode eviction logics. BTW, ubifs plays funny games with i_nlink - decrements it early in unlink/rmdir/rename and then increments it back on failure. *If* we really want it that way, we need to use set_nlink() there. Frankly, I'd rather deal with drop_nlink() after the last possible failure exit... Unless there are serious reasons why that wouldn't work, that is. Artem? ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 17:08 ` Al Viro @ 2012-02-03 19:34 ` Artem Bityutskiy 2012-02-06 8:50 ` Artem Bityutskiy 1 sibling, 0 replies; 75+ messages in thread From: Artem Bityutskiy @ 2012-02-03 19:34 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 1012 bytes --] On Fri, 2012-02-03 at 17:08 +0000, Al Viro wrote: > On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote: > > > * ubifs, hfsplus, jffs2 - definitely broken if you create enough > > links. i_nlink wraparound to zero, confused inode eviction logics. > > BTW, ubifs plays funny games with i_nlink - decrements it early in > unlink/rmdir/rename and then increments it back on failure. *If* > we really want it that way, we need to use set_nlink() there. Frankly, > I'd rather deal with drop_nlink() after the last possible failure > exit... Unless there are serious reasons why that wouldn't work, that is. There was a good reason for those games. I cannot provide a good explanation right now because I need to refresh my memory (but I am sure there must be a big comment in the code about this) and my daughter is getting upset already because I am typing something instead of playing with her. Will try to answer on Monday. Have a nice weekend! -- Best Regards, Artem Bityutskiy [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 17:08 ` Al Viro 2012-02-03 19:34 ` Artem Bityutskiy @ 2012-02-06 8:50 ` Artem Bityutskiy 2012-02-06 13:56 ` Al Viro 1 sibling, 1 reply; 75+ messages in thread From: Artem Bityutskiy @ 2012-02-06 8:50 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 1546 bytes --] On Fri, 2012-02-03 at 17:08 +0000, Al Viro wrote: > On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote: > > > * ubifs, hfsplus, jffs2 - definitely broken if you create enough > > links. i_nlink wraparound to zero, confused inode eviction logics. > > BTW, ubifs plays funny games with i_nlink - decrements it early in > unlink/rmdir/rename and then increments it back on failure. *If* > we really want it that way, we need to use set_nlink() there. Frankly, > I'd rather deal with drop_nlink() after the last possible failure > exit... Unless there are serious reasons why that wouldn't work, that is. > Artem? The way code is organized is that the UBIFS journal subsystem functions accept 'struct inode' and struct direntry' objects and write them out to the media as they are in RAM. So before calling 'ubifs_jnl_update()' we should have all the fields in 'struct inode' to be set correctly. Thus, we have to drop 'i_nlink' before calling 'ubifs_jnl_update()'. WRT dealing with 'drop_nlink()' after the last possible failure - I can just make own partial copy of 'struct inode' and pass it to 'ubifs_jnl_update()', if there is a need. I would not like to make the journal code more complex than it already is by adding 'i_nlink' usage smartness. WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()' of course. But I do not really see why is this better. E.g., 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink' wrapping. -- Best Regards, Artem Bityutskiy [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-06 8:50 ` Artem Bityutskiy @ 2012-02-06 13:56 ` Al Viro 2012-02-06 17:05 ` Artem Bityutskiy 0 siblings, 1 reply; 75+ messages in thread From: Al Viro @ 2012-02-06 13:56 UTC (permalink / raw) To: Artem Bityutskiy; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote: > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()' > of course. But I do not really see why is this better. E.g., > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink' > wrapping. So does inc_nlink() when you are asking to get from nlink=0 to nlink=1. I.e. on failure exit in your unlink()... ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-06 13:56 ` Al Viro @ 2012-02-06 17:05 ` Artem Bityutskiy 2012-02-06 17:11 ` Al Viro 0 siblings, 1 reply; 75+ messages in thread From: Artem Bityutskiy @ 2012-02-06 17:05 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 638 bytes --] On Mon, 2012-02-06 at 13:56 +0000, Al Viro wrote: > On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote: > > > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()' > > of course. But I do not really see why is this better. E.g., > > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink' > > wrapping. > > So does inc_nlink() when you are asking to get from nlink=0 to nlink=1. > I.e. on failure exit in your unlink()... Indeed! I'll switch to 'clear_nlink(inode)' and 'set_nlink(inode, 1)' for those inodes which are being unliked. -- Best Regards, Artem Bityutskiy [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-06 17:05 ` Artem Bityutskiy @ 2012-02-06 17:11 ` Al Viro 2012-02-07 7:21 ` Artem Bityutskiy 0 siblings, 1 reply; 75+ messages in thread From: Al Viro @ 2012-02-06 17:11 UTC (permalink / raw) To: Artem Bityutskiy; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds On Mon, Feb 06, 2012 at 07:05:55PM +0200, Artem Bityutskiy wrote: > On Mon, 2012-02-06 at 13:56 +0000, Al Viro wrote: > > On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote: > > > > > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()' > > > of course. But I do not really see why is this better. E.g., > > > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink' > > > wrapping. > > > > So does inc_nlink() when you are asking to get from nlink=0 to nlink=1. > > I.e. on failure exit in your unlink()... > > Indeed! I'll switch to 'clear_nlink(inode)' and 'set_nlink(inode, 1)' > for those inodes which are being unliked. Huh? I thought ubifs allowed link(2)... IOW, i_nlink could've been greater than 1 when you called ubifs_unlink(). ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-06 17:11 ` Al Viro @ 2012-02-07 7:21 ` Artem Bityutskiy 0 siblings, 0 replies; 75+ messages in thread From: Artem Bityutskiy @ 2012-02-07 7:21 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 1080 bytes --] On Mon, 2012-02-06 at 17:11 +0000, Al Viro wrote: > On Mon, Feb 06, 2012 at 07:05:55PM +0200, Artem Bityutskiy wrote: > > On Mon, 2012-02-06 at 13:56 +0000, Al Viro wrote: > > > On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote: > > > > > > > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()' > > > > of course. But I do not really see why is this better. E.g., > > > > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink' > > > > wrapping. > > > > > > So does inc_nlink() when you are asking to get from nlink=0 to nlink=1. > > > I.e. on failure exit in your unlink()... > > > > Indeed! I'll switch to 'clear_nlink(inode)' and 'set_nlink(inode, 1)' > > for those inodes which are being unliked. > > Huh? I thought ubifs allowed link(2)... IOW, i_nlink could've been greater > than 1 when you called ubifs_unlink(). You are right, I'll take this correctly when I change the code rather than writing an e-mail. Thanks! I think I'll submit a patch this week. -- Best Regards, Artem Bityutskiy [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 1:16 ` Al Viro ` (2 preceding siblings ...) 2012-02-03 17:08 ` Al Viro @ 2012-02-06 22:49 ` Dave Chinner 3 siblings, 0 replies; 75+ messages in thread From: Dave Chinner @ 2012-02-06 22:49 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote: > On Thu, Feb 02, 2012 at 03:46:06PM -0800, Linus Torvalds wrote: > > On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > * WTF is XFS doing with these checks? It is validating nlink against the maximum supported by the XFS on-disk format. It was originally limited by what could be reported to pathconf() on Irix - a signed int. We have that same problem on Linux, too, because on 32 bit systems the maximum number of links that can be reported via pathconf is 2^31.... > Note that we have them > done _twice_ on all paths - explictly from xfs_create(), xfs_link(), > xfs_rename() and then from xfs_bumplink() called by exactly the same > set of functions. Well, that's a bit stupid, isn't it? Trivial to fix, though... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-02 21:24 ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro 2012-02-02 23:46 ` Linus Torvalds @ 2012-02-03 8:25 ` Andreas Dilger 2012-02-03 17:03 ` Al Viro 1 sibling, 1 reply; 75+ messages in thread From: Andreas Dilger @ 2012-02-03 8:25 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds On 2012-02-02, at 2:24 PM, Al Viro wrote: > FWIW, there's something we really should've done a long time ago: putting > that limit into sb->s_max_links. With 0 meaning "leave all checks to > ->link/->mkdir/->rename". Something like the following would make a > reasonable start - just the conversion of obvious cases. As the next > step I'd probably initialize it as ~0U instead of 0 and let the filesystems > that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set > it to 0 in their foo_fill_super(). That would take care of a bunch of cases > where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and > it's probably a saner default anyway. This would also give userspace some hope of pathconf(path, _PC_LINK_MAX) returning the actual value from the filesystem, instead of hard-coding this into glibc itself based on the statfs-returned f_type magic value. Cheers, Andreas ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 8:25 ` Andreas Dilger @ 2012-02-03 17:03 ` Al Viro 2012-02-04 7:42 ` Andreas Dilger 0 siblings, 1 reply; 75+ messages in thread From: Al Viro @ 2012-02-03 17:03 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds On Fri, Feb 03, 2012 at 01:25:26AM -0700, Andreas Dilger wrote: > On 2012-02-02, at 2:24 PM, Al Viro wrote: > > FWIW, there's something we really should've done a long time ago: putting > > that limit into sb->s_max_links. With 0 meaning "leave all checks to > > ->link/->mkdir/->rename". Something like the following would make a > > reasonable start - just the conversion of obvious cases. As the next > > step I'd probably initialize it as ~0U instead of 0 and let the filesystems > > that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set > > it to 0 in their foo_fill_super(). That would take care of a bunch of cases > > where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and > > it's probably a saner default anyway. > > This would also give userspace some hope of pathconf(path, _PC_LINK_MAX) > returning the actual value from the filesystem, instead of hard-coding > this into glibc itself based on the statfs-returned f_type magic value. *snort* Even skipping the standard flame about pathconf() as an API, this will not work. * we have filesystems that do not allow link creation at all and do keep track of subdirectories count in i_nlink of directories. What would you have them store? As it is, ~0U works just fine, but pathconf() users won't be happy with it. * we have filesystems that allow unlimited subdirectories, while limiting the number of links to non-directories; ->s_max_links == 0 will work just fine, but won't make pathconf() happy. * we have filesystems that have more complex rules re links to non-directory (see mail from Chris in this thread). What would you have pathconf() do? ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename 2012-02-03 17:03 ` Al Viro @ 2012-02-04 7:42 ` Andreas Dilger 0 siblings, 0 replies; 75+ messages in thread From: Andreas Dilger @ 2012-02-04 7:42 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds On 2012-02-03, at 10:03 AM, Al Viro wrote: > On Fri, Feb 03, 2012 at 01:25:26AM -0700, Andreas Dilger wrote: >> On 2012-02-02, at 2:24 PM, Al Viro wrote: >>> FWIW, there's something we really should've done a long time ago: putting >>> that limit into sb->s_max_links. With 0 meaning "leave all checks to >>> ->link/->mkdir/->rename". Something like the following would make a >>> reasonable start - just the conversion of obvious cases. As the next >>> step I'd probably initialize it as ~0U instead of 0 and let the filesystems >>> that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set >>> it to 0 in their foo_fill_super(). That would take care of a bunch of cases >>> where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and >>> it's probably a saner default anyway. >> >> This would also give userspace some hope of pathconf(path, _PC_LINK_MAX) >> returning the actual value from the filesystem, instead of hard-coding >> this into glibc itself based on the statfs-returned f_type magic value. > > *snort* > > Even skipping the standard flame about pathconf() as an API, this will > not work. > * we have filesystems that do not allow link creation at all and > do keep track of subdirectories count in i_nlink of directories. What > would you have them store? As it is, ~0U works just fine, but pathconf() > users won't be happy with it. > * we have filesystems that allow unlimited subdirectories, while > limiting the number of links to non-directories; ->s_max_links == 0 will > work just fine, but won't make pathconf() happy. > * we have filesystems that have more complex rules re links to > non-directory (see mail from Chris in this thread). What would you have > pathconf() do? No comment on how good an API pathconf() is, but getting a per-filesystem value from the kernel has to be better than a hard-coded value coded in a library in userspace. Cheers, Andreas ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-02-01 23:18 ` Linus Torvalds 2012-02-02 1:22 ` Al Viro @ 2012-03-05 13:30 ` Jiri Slaby 2012-03-05 16:09 ` Greg Kroah-Hartman 1 sibling, 1 reply; 75+ messages in thread From: Jiri Slaby @ 2012-03-05 13:30 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro, Maciej Rutecki On 02/02/2012 12:18 AM, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> No extra "keep track of inode counts by hand" crap, and no idiotic >> config options that just make it easy to (conditionally) get things >> wrong. Just do it right, and do it *unconditionally* right. > > And btw, "nlink shows number of subdirectories" for a directory entry > really *is* right. It's how Unix filesystems work, like it or not. > > It's mainly lazy/bad filesystems that set nlink to 1. So the whole > "nlink==1" case is meant for crap like FAT etc, not for a filesystem > that we control and that could easily just do it right. > > Which is why I detest that config option. It's as if you were asking the user > > "Do you want to make the sysfs filesystem act like crap filesystems?" > > and kernel config time. What kind of inane question is that? <thread resumed...> What's going on here? I still have to revert "sysfs: Kill nlink counting." with today's -next to have working sensors. thanks, -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-03-05 13:30 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby @ 2012-03-05 16:09 ` Greg Kroah-Hartman 2012-03-05 16:47 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 75+ messages in thread From: Greg Kroah-Hartman @ 2012-03-05 16:09 UTC (permalink / raw) To: Jiri Slaby Cc: Linus Torvalds, Eric W. Biederman, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki On Mon, Mar 05, 2012 at 02:30:20PM +0100, Jiri Slaby wrote: > On 02/02/2012 12:18 AM, Linus Torvalds wrote: > > On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> No extra "keep track of inode counts by hand" crap, and no idiotic > >> config options that just make it easy to (conditionally) get things > >> wrong. Just do it right, and do it *unconditionally* right. > > > > And btw, "nlink shows number of subdirectories" for a directory entry > > really *is* right. It's how Unix filesystems work, like it or not. > > > > It's mainly lazy/bad filesystems that set nlink to 1. So the whole > > "nlink==1" case is meant for crap like FAT etc, not for a filesystem > > that we control and that could easily just do it right. > > > > Which is why I detest that config option. It's as if you were asking the user > > > > "Do you want to make the sysfs filesystem act like crap filesystems?" > > > > and kernel config time. What kind of inane question is that? > > <thread resumed...> > > What's going on here? I still have to revert "sysfs: Kill nlink > counting." with today's -next to have working sensors. I don't remember. I thought there was a proposed patch for this issue from Eric, but I don't see it in my queue anywhere. Eric, what was the resolution here? greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-03-05 16:09 ` Greg Kroah-Hartman @ 2012-03-05 16:47 ` Linus Torvalds 2012-03-08 21:05 ` Greg Kroah-Hartman 2012-03-08 22:18 ` Eric W. Biederman 2012-03-08 21:28 ` Eric W. Biederman 2012-03-08 21:34 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman 2 siblings, 2 replies; 75+ messages in thread From: Linus Torvalds @ 2012-03-05 16:47 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Eric W. Biederman, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki On Mon, Mar 5, 2012 at 8:09 AM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > I don't remember. I thought there was a proposed patch for this issue > from Eric, but I don't see it in my queue anywhere. That patch was an abortion. Adding a config option for behavior like this is totally bogus, and the only reason for that config option was that sysfs did silly things. It's only in -next, though, I was assuming that the whole "Kill nlink counting" commit never makes it to me. Because I won't take it. I outlined how the counting could easily be done without actually having to maintain an explicit count in the sysfs. Or we should just keep doing the counting. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-03-05 16:47 ` Linus Torvalds @ 2012-03-08 21:05 ` Greg Kroah-Hartman 2012-03-08 22:18 ` Eric W. Biederman 1 sibling, 0 replies; 75+ messages in thread From: Greg Kroah-Hartman @ 2012-03-08 21:05 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Slaby, Eric W. Biederman, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki On Mon, Mar 05, 2012 at 08:47:27AM -0800, Linus Torvalds wrote: > On Mon, Mar 5, 2012 at 8:09 AM, Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > I don't remember. I thought there was a proposed patch for this issue > > from Eric, but I don't see it in my queue anywhere. > > That patch was an abortion. Adding a config option for behavior like > this is totally bogus, and the only reason for that config option was > that sysfs did silly things. > > It's only in -next, though, I was assuming that the whole "Kill nlink > counting" commit never makes it to me. Because I won't take it. > > I outlined how the counting could easily be done without actually > having to maintain an explicit count in the sysfs. > > Or we should just keep doing the counting. Ok, I've now reverted this patch in my tree. greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-03-05 16:47 ` Linus Torvalds 2012-03-08 21:05 ` Greg Kroah-Hartman @ 2012-03-08 22:18 ` Eric W. Biederman 2012-03-08 23:40 ` Linus Torvalds 1 sibling, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-03-08 22:18 UTC (permalink / raw) To: Linus Torvalds Cc: Greg Kroah-Hartman, Jiri Slaby, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Mar 5, 2012 at 8:09 AM, Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: >> >> I don't remember. I thought there was a proposed patch for this issue >> from Eric, but I don't see it in my queue anywhere. > > That patch was an abortion. Adding a config option for behavior like > this is totally bogus, and the only reason for that config option was > that sysfs did silly things. The biggest reason it is bogus is that it doesn't get properly tested or reviewed. Sigh. My first patch to fix things had a bad typo that everyone missed. > It's only in -next, though, I was assuming that the whole "Kill nlink > counting" commit never makes it to me. Because I won't take it. > > I outlined how the counting could easily be done without actually > having to maintain an explicit count in the sysfs. And if you had bothered to look you would have seen how we used to have that code and it was removed because it was a performance bottleneck. > Or we should just keep doing the counting. The current counting that we do gives the wrong numbers, in the edge cases. To my knowledge a deleted sysfs directory has never returned nlink == 0. Keeping compatibility is easy enough that it looks like it is worth doing, but maintaining 30+ years of backwards compatibility is what nlink >1 in unix filesystem directories is. I don't see any practical sense in keeping . and .. directories on disk or upping the unix nlink directory count because of them. To me it looks like just one of those things you do. Like hash directory entries so you can have a big directory and still be able to have a 32bit offset you can pass to lseek that is stable across renames and deletes. >From the point of view of maintaining sysfs a 32bit nlink_t in sysfs is too small. It is wrong for sysfs to refuse to represent devices that exist and I have heard of machines that have enough memory it possible to create more than 2^32 network devices. So sysfs must handle overflow and sysfs must use the nlink == 1 in some cases. I was just thinking we would get better userspace test coverage if we don't bother to handle the other cases. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-03-08 22:18 ` Eric W. Biederman @ 2012-03-08 23:40 ` Linus Torvalds 0 siblings, 0 replies; 75+ messages in thread From: Linus Torvalds @ 2012-03-08 23:40 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Jiri Slaby, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki On Thu, Mar 8, 2012 at 2:18 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Keeping compatibility is easy enough that it looks like it is worth > doing, but maintaining 30+ years of backwards compatibility Stop right there. This is *not* about some arbitrary "30-year backwards compatibility". This is about your patch BREAKING EXISTING BINARIES. So stop the f*&^ing around already. The patch was shown to be broken, stop making excuses, and stop blathering. End of story. Binary compatibility is more important than *any* of your patches. If you continue to argue anything else or making excuses, I'm going to ask people to just ignore your patches entirely. Seriously. Binary compatibility is *so* important that I do not want to have anything to do with kernel developers who don't understand that importance. If you continue to pooh-pooh the issue, you only show yourself to be unreliable. Don't do it. Dammit, I'm continually surprised by the *idiots* out there that don't understand that binary compatibility is one of the absolute top priorities. The *only* reason for an OS kernel existing in the first place is to serve user-space. The kernel has no relevance on its own. Breaking existing binaries - and then not acknowledging how horribly bad that was - is just about the *worst* offense any kernel developer can do. Because that shows that they don't understand what the whole *point* of the kernel was after all. We're not masturbating around with some research project. We never were. Even when Linux was young, the whole and only point was to make a *usable* system. It's why it's not some crazy drug-induced microkernel or other random crazy thing. Really. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications 2012-03-05 16:09 ` Greg Kroah-Hartman 2012-03-05 16:47 ` Linus Torvalds @ 2012-03-08 21:28 ` Eric W. Biederman 2012-03-08 21:34 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman 2 siblings, 0 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-03-08 21:28 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Mon, Mar 05, 2012 at 02:30:20PM +0100, Jiri Slaby wrote: >> On 02/02/2012 12:18 AM, Linus Torvalds wrote: >> > On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds >> > <torvalds@linux-foundation.org> wrote: >> >> >> >> No extra "keep track of inode counts by hand" crap, and no idiotic >> >> config options that just make it easy to (conditionally) get things >> >> wrong. Just do it right, and do it *unconditionally* right. >> > >> > And btw, "nlink shows number of subdirectories" for a directory entry >> > really *is* right. It's how Unix filesystems work, like it or not. >> > >> > It's mainly lazy/bad filesystems that set nlink to 1. So the whole >> > "nlink==1" case is meant for crap like FAT etc, not for a filesystem >> > that we control and that could easily just do it right. >> > >> > Which is why I detest that config option. It's as if you were asking the user >> > >> > "Do you want to make the sysfs filesystem act like crap filesystems?" >> > >> > and kernel config time. What kind of inane question is that? >> >> <thread resumed...> >> >> What's going on here? I still have to revert "sysfs: Kill nlink >> counting." with today's -next to have working sensors. > > I don't remember. I thought there was a proposed patch for this issue > from Eric, but I don't see it in my queue anywhere. > > Eric, what was the resolution here? Apologies. Cold/Allergies and distractions have kept it away. sysfs patches in a follow up. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte. 2012-03-05 16:09 ` Greg Kroah-Hartman 2012-03-05 16:47 ` Linus Torvalds 2012-03-08 21:28 ` Eric W. Biederman @ 2012-03-08 21:34 ` Eric W. Biederman 2012-03-08 21:36 ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman 2012-03-08 22:28 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman 2 siblings, 2 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-03-08 21:34 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki In an effort to keep sysfs_dirent small used the smallest basic type I can for sysfs s_flags. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/sysfs/sysfs.h | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index 6289a00..c76c932 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -76,7 +76,7 @@ struct sysfs_dirent { struct sysfs_elem_bin_attr s_bin_attr; }; - unsigned short s_flags; + unsigned char s_flags; umode_t s_mode; unsigned int s_ino; struct sysfs_inode_attrs *s_iattr; @@ -84,7 +84,7 @@ struct sysfs_dirent { #define SD_DEACTIVATED_BIAS INT_MIN -#define SYSFS_TYPE_MASK 0x00ff +#define SYSFS_TYPE_MASK 0x000f #define SYSFS_DIR 0x0001 #define SYSFS_KOBJ_ATTR 0x0002 #define SYSFS_KOBJ_BIN_ATTR 0x0004 @@ -93,11 +93,11 @@ struct sysfs_dirent { #define SYSFS_ACTIVE_REF (SYSFS_KOBJ_ATTR | SYSFS_KOBJ_BIN_ATTR) /* identify any namespace tag on sysfs_dirents */ -#define SYSFS_NS_TYPE_MASK 0xf00 -#define SYSFS_NS_TYPE_SHIFT 8 +#define SYSFS_NS_TYPE_MASK 0x70 +#define SYSFS_NS_TYPE_SHIFT 4 #define SYSFS_FLAG_MASK ~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK) -#define SYSFS_FLAG_REMOVED 0x02000 +#define SYSFS_FLAG_REMOVED 0x080 static inline unsigned int sysfs_type(struct sysfs_dirent *sd) { -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 2/3] sysfs: Maintain usable nlink directory counts. 2012-03-08 21:34 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman @ 2012-03-08 21:36 ` Eric W. Biederman 2012-03-08 21:37 ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman 2012-03-08 22:28 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman 1 sibling, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-03-08 21:36 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki When it is easy keep fully accurate nlink counts for sysfs directories, when it is not easy set nlink to 1. This maintains as much compatibility with unix programs that expect directories to have a usable nlink count as possible, without trying to do the impossible. Directory nlink count overflows in sysfs are inevitable by design so only bother to use a byte to store the directory nlink count. A directory with 254 entries is larger than any sysfs directory in a normal configruation but small enough we should see tools that care about large sysfs directories should experience nlink == 1 during their testing. This fixes libsensors and possibly other applications that get confused if all sysfs directories return nlink == 1. The lm_sensors code that got confused was just wrong is fixed in the lm_sensors trunk and a fixed version should be released sometime soon. The nlink of all deleted directories is set to 0. Returning for the first time a correct nlink count for deleted sysfs directories. Once a directory nlink count drops below 2 refuse to increment or decrement it as there is not enough information available. It is important that this works for nlink == 0 as well as nlink == 1 because currently sysfs supports deleting non-empty directories (the PCI layer requires this behavior). For tagged directories set the nlink count == 1 because we currently have one sysfs_dirent and multiple logical sysfs directories making pre computing nlink impossible. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/sysfs/dir.c | 14 ++++++++++++++ fs/sysfs/inode.c | 1 + fs/sysfs/sysfs.h | 16 ++++++++++++++++ 3 files changed, 31 insertions(+), 0 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index dd3779c..1526567 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -91,6 +91,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd) struct rb_node **node = &sd->s_parent->s_dir.children.rb_node; struct rb_node *parent = NULL; + if (sysfs_type(sd) == SYSFS_DIR) + sysfs_inc_nlink(sd->s_parent); + while (*node) { struct sysfs_dirent *pos; int result; @@ -123,6 +126,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd) */ static void sysfs_unlink_sibling(struct sysfs_dirent *sd) { + if (sysfs_type(sd) == SYSFS_DIR) + sysfs_dec_nlink(sd->s_parent); + rb_erase(&sd->s_rb, &sd->s_parent->s_dir.children); } @@ -366,6 +372,9 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type) sd->s_name = name; sd->s_mode = mode; sd->s_flags = type; + sd->s_nlink = 1; + if (sysfs_type(sd) == SYSFS_DIR) + sd->s_nlink = 2; return sd; @@ -536,6 +545,7 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd) ps_iattrs->ia_ctime = ps_iattrs->ia_mtime = CURRENT_TIME; } + sd->s_nlink = 0; sd->s_flags |= SYSFS_FLAG_REMOVED; sd->u.removed_list = acxt->removed; acxt->removed = sd; @@ -660,6 +670,10 @@ static int create_dir(struct kobject *kobj, struct sysfs_dirent *parent_sd, sd->s_ns = ns; sd->s_dir.kobj = kobj; + /* Accurate nlink count impossible (one field mutiple dirs) */ + if (sysfs_ns_type(sd)) + sd->s_nlink = 1; + /* link in */ sysfs_addrm_start(&acxt, parent_sd); rc = sysfs_add_one(&acxt, sd); diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c index 4291fd1..f6ebda8 100644 --- a/fs/sysfs/inode.c +++ b/fs/sysfs/inode.c @@ -216,6 +216,7 @@ static void sysfs_refresh_inode(struct sysfs_dirent *sd, struct inode *inode) iattrs->ia_secdata, iattrs->ia_secdata_len); } + set_nlink(inode, sd->s_nlink); } int sysfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index c76c932..71f9bf7 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -76,6 +76,7 @@ struct sysfs_dirent { struct sysfs_elem_bin_attr s_bin_attr; }; + unsigned char s_nlink; unsigned char s_flags; umode_t s_mode; unsigned int s_ino; @@ -127,6 +128,21 @@ do { \ #define sysfs_dirent_init_lockdep(sd) do {} while(0) #endif +static inline void sysfs_inc_nlink(struct sysfs_dirent *sd) +{ + if (sd->s_nlink <= 1) + return; + sd->s_nlink++; + if (sd->s_nlink == 0) + sd->s_nlink = 1; +} +static inline void sysfs_dec_nlink(struct sysfs_dirent *sd) +{ + if (sd->s_nlink <= 1) + return; + sd->s_nlink--; +} + /* * Context structure to be used while adding/removing nodes. */ -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead. 2012-03-08 21:36 ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman @ 2012-03-08 21:37 ` Eric W. Biederman 2012-03-09 3:40 ` Linus Torvalds 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-03-08 21:37 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki Now that we have a nlink field in sysfs_dirent report deleted files and directories the traditional way with nlink == 0. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/sysfs/dir.c | 11 +++++------ fs/sysfs/sysfs.h | 1 - 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 1526567..b5471d1 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -202,7 +202,7 @@ static void sysfs_deactivate(struct sysfs_dirent *sd) DECLARE_COMPLETION_ONSTACK(wait); int v; - BUG_ON(!(sd->s_flags & SYSFS_FLAG_REMOVED)); + BUG_ON(sd->s_nlink != 0); if (!(sysfs_type(sd) & SYSFS_ACTIVE_REF)) return; @@ -279,7 +279,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd) static int sysfs_dentry_delete(const struct dentry *dentry) { struct sysfs_dirent *sd = dentry->d_fsdata; - return !!(sd->s_flags & SYSFS_FLAG_REMOVED); + return sd->s_nlink == 0; } static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd) @@ -294,7 +294,7 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd) mutex_lock(&sysfs_mutex); /* The sysfs dirent has been deleted */ - if (sd->s_flags & SYSFS_FLAG_REMOVED) + if (sd->s_nlink == 0) goto out_bad; /* The sysfs dirent has been moved? */ @@ -534,7 +534,7 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd) { struct sysfs_inode_attrs *ps_iattr; - BUG_ON(sd->s_flags & SYSFS_FLAG_REMOVED); + BUG_ON(sd->s_nlink == 0); sysfs_unlink_sibling(sd); @@ -546,7 +546,6 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd) } sd->s_nlink = 0; - sd->s_flags |= SYSFS_FLAG_REMOVED; sd->u.removed_list = acxt->removed; acxt->removed = sd; } @@ -946,7 +945,7 @@ static struct sysfs_dirent *sysfs_dir_pos(const void *ns, struct sysfs_dirent *parent_sd, loff_t hash, struct sysfs_dirent *pos) { if (pos) { - int valid = !(pos->s_flags & SYSFS_FLAG_REMOVED) && + int valid = (pos->s_nlink > 0) && pos->s_parent == parent_sd && hash == pos->s_hash; sysfs_put(pos); diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index 71f9bf7..db2c5c5 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -98,7 +98,6 @@ struct sysfs_dirent { #define SYSFS_NS_TYPE_SHIFT 4 #define SYSFS_FLAG_MASK ~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK) -#define SYSFS_FLAG_REMOVED 0x080 static inline unsigned int sysfs_type(struct sysfs_dirent *sd) { -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead. 2012-03-08 21:37 ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman @ 2012-03-09 3:40 ` Linus Torvalds 0 siblings, 0 replies; 75+ messages in thread From: Linus Torvalds @ 2012-03-09 3:40 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Jiri Slaby, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki Is this safe? What's the serialization with sysfs_link_sibling(), which you just made do sd->s_nlink++; if (!sd->s_nlink) sd->s_nlink = 1; in your previous patch? Yes, sysfs_link_sibling() runs under sysfs_mutex, but sysfs_dentry_delete() does not. So what protects sysfs_dentry_delete() from never seeing that bogus 0 due to the overflow? Linus On Thu, Mar 8, 2012 at 1:37 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Now that we have a nlink field in sysfs_dirent report deleted > files and directories the traditional way with nlink == 0. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > fs/sysfs/dir.c | 11 +++++------ > fs/sysfs/sysfs.h | 1 - > 2 files changed, 5 insertions(+), 7 deletions(-) > > diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c > index 1526567..b5471d1 100644 > --- a/fs/sysfs/dir.c > +++ b/fs/sysfs/dir.c > @@ -202,7 +202,7 @@ static void sysfs_deactivate(struct sysfs_dirent *sd) > DECLARE_COMPLETION_ONSTACK(wait); > int v; > > - BUG_ON(!(sd->s_flags & SYSFS_FLAG_REMOVED)); > + BUG_ON(sd->s_nlink != 0); > > if (!(sysfs_type(sd) & SYSFS_ACTIVE_REF)) > return; > @@ -279,7 +279,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd) > static int sysfs_dentry_delete(const struct dentry *dentry) > { > struct sysfs_dirent *sd = dentry->d_fsdata; > - return !!(sd->s_flags & SYSFS_FLAG_REMOVED); > + return sd->s_nlink == 0; > } > > static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd) > @@ -294,7 +294,7 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd) > mutex_lock(&sysfs_mutex); > > /* The sysfs dirent has been deleted */ > - if (sd->s_flags & SYSFS_FLAG_REMOVED) > + if (sd->s_nlink == 0) > goto out_bad; > > /* The sysfs dirent has been moved? */ > @@ -534,7 +534,7 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd) > { > struct sysfs_inode_attrs *ps_iattr; > > - BUG_ON(sd->s_flags & SYSFS_FLAG_REMOVED); > + BUG_ON(sd->s_nlink == 0); > > sysfs_unlink_sibling(sd); > > @@ -546,7 +546,6 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd) > } > > sd->s_nlink = 0; > - sd->s_flags |= SYSFS_FLAG_REMOVED; > sd->u.removed_list = acxt->removed; > acxt->removed = sd; > } > @@ -946,7 +945,7 @@ static struct sysfs_dirent *sysfs_dir_pos(const void *ns, > struct sysfs_dirent *parent_sd, loff_t hash, struct sysfs_dirent *pos) > { > if (pos) { > - int valid = !(pos->s_flags & SYSFS_FLAG_REMOVED) && > + int valid = (pos->s_nlink > 0) && > pos->s_parent == parent_sd && > hash == pos->s_hash; > sysfs_put(pos); > diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h > index 71f9bf7..db2c5c5 100644 > --- a/fs/sysfs/sysfs.h > +++ b/fs/sysfs/sysfs.h > @@ -98,7 +98,6 @@ struct sysfs_dirent { > #define SYSFS_NS_TYPE_SHIFT 4 > > #define SYSFS_FLAG_MASK ~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK) > -#define SYSFS_FLAG_REMOVED 0x080 > > static inline unsigned int sysfs_type(struct sysfs_dirent *sd) > { > -- > 1.7.2.5 > ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte. 2012-03-08 21:34 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman 2012-03-08 21:36 ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman @ 2012-03-08 22:28 ` Greg Kroah-Hartman 2012-03-09 2:49 ` Eric W. Biederman 1 sibling, 1 reply; 75+ messages in thread From: Greg Kroah-Hartman @ 2012-03-08 22:28 UTC (permalink / raw) To: Eric W. Biederman Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki On Thu, Mar 08, 2012 at 01:34:22PM -0800, Eric W. Biederman wrote: > > In an effort to keep sysfs_dirent small used the smallest > basic type I can for sysfs s_flags. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > fs/sysfs/sysfs.h | 10 +++++----- > 1 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h > index 6289a00..c76c932 100644 > --- a/fs/sysfs/sysfs.h > +++ b/fs/sysfs/sysfs.h > @@ -76,7 +76,7 @@ struct sysfs_dirent { > struct sysfs_elem_bin_attr s_bin_attr; > }; > > - unsigned short s_flags; > + unsigned char s_flags; > umode_t s_mode; > unsigned int s_ino; > struct sysfs_inode_attrs *s_iattr; Are you sure this saved you any real space here? Have you use pahole (I think that's the tool name) to determine if there are holes in the structure? Given that you moved this to a smaller variable, yet there are pointers later on, odds are nothing changed at all overall. Verification would be good to have, to see if this work was really worth it, right? thanks, greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte. 2012-03-08 22:28 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman @ 2012-03-09 2:49 ` Eric W. Biederman 0 siblings, 0 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-03-09 2:49 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro, Maciej Rutecki Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Thu, Mar 08, 2012 at 01:34:22PM -0800, Eric W. Biederman wrote: >> >> In an effort to keep sysfs_dirent small used the smallest >> basic type I can for sysfs s_flags. >> >> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> >> --- >> fs/sysfs/sysfs.h | 10 +++++----- >> 1 files changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h >> index 6289a00..c76c932 100644 >> --- a/fs/sysfs/sysfs.h >> +++ b/fs/sysfs/sysfs.h >> @@ -76,7 +76,7 @@ struct sysfs_dirent { >> struct sysfs_elem_bin_attr s_bin_attr; >> }; >> >> - unsigned short s_flags; >> + unsigned char s_flags; >> umode_t s_mode; >> unsigned int s_ino; >> struct sysfs_inode_attrs *s_iattr; > > Are you sure this saved you any real space here? Have you use pahole (I > think that's the tool name) to determine if there are holes in the > structure? Given that you moved this to a smaller variable, yet there > are pointers later on, odds are nothing changed at all overall. > > Verification would be good to have, to see if this work was really worth > it, right? The next patch stuffs an 8bit nlink in the hole. I was making room so we don't have to grow sysfs_dirent. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 22:10 ` Alan Cox 2012-01-30 22:27 ` Greg KH @ 2012-01-31 3:45 ` Eric W. Biederman 2012-01-31 11:54 ` Alan Cox 1 sibling, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-01-31 3:45 UTC (permalink / raw) To: Alan Cox; +Cc: Greg KH, Jiri Slaby, LKML, systemd-devel Alan Cox <alan@lxorguk.ukuu.org.uk> writes: >> Isn't there some other "proper" way of doing this in userspace, or is >> this really the correct way? > > You can look at the S_IFMT bits and stuff however link count indicating > number of subdirectories is a standard Unix thing and used by many quite > mundane tools as an optimisation. Those tools for a decade or better all know to treat nlink == 1 as the case where the optimization does not apply. On a traditional unix filesytem the most common place you will see nlink == 1 is when the link count overflows. extN with > 65536 subdirectories as I recall. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-31 3:45 ` sysfs regression: wrong link counts Eric W. Biederman @ 2012-01-31 11:54 ` Alan Cox 0 siblings, 0 replies; 75+ messages in thread From: Alan Cox @ 2012-01-31 11:54 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Greg KH, Jiri Slaby, LKML, systemd-devel On Mon, 30 Jan 2012 19:45:10 -0800 ebiederm@xmission.com (Eric W. Biederman) wrote: > Alan Cox <alan@lxorguk.ukuu.org.uk> writes: > > >> Isn't there some other "proper" way of doing this in userspace, or is > >> this really the correct way? > > > > You can look at the S_IFMT bits and stuff however link count indicating > > number of subdirectories is a standard Unix thing and used by many quite > > mundane tools as an optimisation. > > Those tools for a decade or better all know to treat nlink == 1 as the > case where the optimization does not apply. Most of the do. but as has been demonstrated here - not all ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby 2012-01-30 22:06 ` Greg KH @ 2012-01-30 22:52 ` Kay Sievers 2012-01-31 10:41 ` network regression: cannot rename netdev twice Jiri Slaby 2012-01-31 1:32 ` sysfs regression: wrong link counts Eric W. Biederman 2012-02-01 18:29 ` Maciej Rutecki 3 siblings, 1 reply; 75+ messages in thread From: Kay Sievers @ 2012-01-30 22:52 UTC (permalink / raw) To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML 2012/1/30 Jiri Slaby <jslaby@suse.cz>: > I cannot boot properly with this commit: > commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 > Author: Eric W. Biederman <ebiederm@xmission.com> > Date: Sun Dec 18 20:09:31 2011 -0800 > > sysfs: Kill nlink counting. > > > 1) network systemd rule doesn't start network What does that mean? What's a network systemd rule? Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* network regression: cannot rename netdev twice 2012-01-30 22:52 ` Kay Sievers @ 2012-01-31 10:41 ` Jiri Slaby 2012-01-31 10:52 ` Kay Sievers 0 siblings, 1 reply; 75+ messages in thread From: Jiri Slaby @ 2012-01-31 10:41 UTC (permalink / raw) To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev On 01/30/2012 11:52 PM, Kay Sievers wrote: > 2012/1/30 Jiri Slaby <jslaby@suse.cz>: >> I cannot boot properly with this commit: >> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 >> Author: Eric W. Biederman <ebiederm@xmission.com> >> Date: Sun Dec 18 20:09:31 2011 -0800 >> >> sysfs: Kill nlink counting. >> >> >> 1) network systemd rule doesn't start network > > What does that mean? What's a network systemd rule? Oh, perhaps you call it a service file, not rule file? Anyway this is a different bug. Revert of the patch above does not help. The bug lays in the network layer. udev is unable to perform persistent eth naming: # ip link set eth0 name eth1 -- this one is OK # ip link set eth1 name eth0 RTNETLINK answers: No such file or directory thanks, -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 10:41 ` network regression: cannot rename netdev twice Jiri Slaby @ 2012-01-31 10:52 ` Kay Sievers 2012-01-31 11:00 ` Jiri Slaby 2012-02-04 2:14 ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh 0 siblings, 2 replies; 75+ messages in thread From: Kay Sievers @ 2012-01-31 10:52 UTC (permalink / raw) To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev On Tue, Jan 31, 2012 at 11:41, Jiri Slaby <jslaby@suse.cz> wrote: > On 01/30/2012 11:52 PM, Kay Sievers wrote: >> 2012/1/30 Jiri Slaby <jslaby@suse.cz>: >>> I cannot boot properly with this commit: >>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 >>> Author: Eric W. Biederman <ebiederm@xmission.com> >>> Date: Sun Dec 18 20:09:31 2011 -0800 >>> >>> sysfs: Kill nlink counting. >>> >>> 1) network systemd rule doesn't start network >> >> What does that mean? What's a network systemd rule? > > Oh, perhaps you call it a service file, not rule file? > > Anyway this is a different bug. Revert of the patch above does not help. Ok, fine. I checked too, and systemd does not play any silly games with link counts. > The bug lays in the network layer. udev is unable to perform persistent > eth naming: > # ip link set eth0 name eth1 -- this one is OK > # ip link set eth1 name eth0 > RTNETLINK answers: No such file or directory Please make sure nothing tries to swap netif names in userspace. We have given up that approach, because it is far too fragile to temporary rename devices to be able to swap the names, and race against the loading of new kernel network drivers at the same time. This might be a new kernel problem here, but in general that approach is just broken, we have have given up fiddling around here. Udev does not do that anymore, and also the code that currently *can* be used to do this, will be removed from udev in the future. Network devices can only be renamed to a namespace that isn't ethX, and which does not race against kernel names. Does is work, if you rename the devices to something else than ethX? Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 10:52 ` Kay Sievers @ 2012-01-31 11:00 ` Jiri Slaby 2012-01-31 11:13 ` Kay Sievers 2012-02-04 2:14 ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh 1 sibling, 1 reply; 75+ messages in thread From: Jiri Slaby @ 2012-01-31 11:00 UTC (permalink / raw) To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev On 01/31/2012 11:52 AM, Kay Sievers wrote: > On Tue, Jan 31, 2012 at 11:41, Jiri Slaby <jslaby@suse.cz> wrote: >> On 01/30/2012 11:52 PM, Kay Sievers wrote: >>> 2012/1/30 Jiri Slaby <jslaby@suse.cz>: >>>> I cannot boot properly with this commit: >>>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 >>>> Author: Eric W. Biederman <ebiederm@xmission.com> >>>> Date: Sun Dec 18 20:09:31 2011 -0800 >>>> >>>> sysfs: Kill nlink counting. >>>> >>>> 1) network systemd rule doesn't start network >>> >>> What does that mean? What's a network systemd rule? >> >> Oh, perhaps you call it a service file, not rule file? >> >> Anyway this is a different bug. Revert of the patch above does not help. > > Ok, fine. I checked too, and systemd does not play any silly games > with link counts. > >> The bug lays in the network layer. udev is unable to perform persistent >> eth naming: >> # ip link set eth0 name eth1 -- this one is OK >> # ip link set eth1 name eth0 >> RTNETLINK answers: No such file or directory > > Please make sure nothing tries to swap netif names in userspace. We > have given up that approach, because it is far too fragile to > temporary rename devices to be able to swap the names, and race > against the loading of new kernel network drivers at the same time. > > This might be a new kernel problem here, but in general that approach > is just broken, we have have given up fiddling around here. Udev does > not do that anymore, and also the code that currently *can* be used to > do this, will be removed from udev in the future. > > Network devices can only be renamed to a namespace that isn't ethX, > and which does not race against kernel names. I have two eth interfaces. The one on the motherboard is named eth0, an added PCI card is eth1. But kernel enumerates them in the opposite order. So udev does this sequence: eth1 -> rename3 eth0 -> eth1 rename3 -> eth0 How it can do it differently? (This is openSUSE factory.) > Does is work, if you rename the devices to something else than ethX? Negative: # ip link set eth0 name krtek # ip link set krtek name jezek RTNETLINK answers: No such file or directory thanks, -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 11:00 ` Jiri Slaby @ 2012-01-31 11:13 ` Kay Sievers 2012-01-31 11:17 ` Jiri Slaby 0 siblings, 1 reply; 75+ messages in thread From: Kay Sievers @ 2012-01-31 11:13 UTC (permalink / raw) To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev On Tue, Jan 31, 2012 at 12:00, Jiri Slaby <jslaby@suse.cz> wrote: > On 01/31/2012 11:52 AM, Kay Sievers wrote: >> On Tue, Jan 31, 2012 at 11:41, Jiri Slaby <jslaby@suse.cz> wrote: >>> On 01/30/2012 11:52 PM, Kay Sievers wrote: >>>> 2012/1/30 Jiri Slaby <jslaby@suse.cz>: >>>>> I cannot boot properly with this commit: >>>>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 >>>>> Author: Eric W. Biederman <ebiederm@xmission.com> >>>>> Date: Sun Dec 18 20:09:31 2011 -0800 >>>>> >>>>> sysfs: Kill nlink counting. >>>>> >>>>> 1) network systemd rule doesn't start network >>>> >>>> What does that mean? What's a network systemd rule? >>> >>> Oh, perhaps you call it a service file, not rule file? >>> >>> Anyway this is a different bug. Revert of the patch above does not help. >> >> Ok, fine. I checked too, and systemd does not play any silly games >> with link counts. >> >>> The bug lays in the network layer. udev is unable to perform persistent >>> eth naming: >>> # ip link set eth0 name eth1 -- this one is OK >>> # ip link set eth1 name eth0 >>> RTNETLINK answers: No such file or directory >> >> Please make sure nothing tries to swap netif names in userspace. We >> have given up that approach, because it is far too fragile to >> temporary rename devices to be able to swap the names, and race >> against the loading of new kernel network drivers at the same time. >> >> This might be a new kernel problem here, but in general that approach >> is just broken, we have have given up fiddling around here. Udev does >> not do that anymore, and also the code that currently *can* be used to >> do this, will be removed from udev in the future. >> >> Network devices can only be renamed to a namespace that isn't ethX, >> and which does not race against kernel names. > > I have two eth interfaces. The one on the motherboard is named eth0, an > added PCI card is eth1. But kernel enumerates them in the opposite order. > > So udev does this sequence: > eth1 -> rename3 > eth0 -> eth1 > rename3 -> eth0 > > How it can do it differently? (This is openSUSE factory.) A future udev will not help you doing that. We have given up supporting this approach. Renaming is done during booting, at the same time we load new kernel drivers, and all breaks in non-interesting ways. Apart from all the other unsolvable problems with this model. Pretending we are able to rename netif names in the same namespace the kernel is allocating new names is just plain wrong. There are races you can't control. The entire approach creates far more problems than it solves. We just have to admit it was wrong to do that. Custom/to-rename netif names can just not be ethX. >> Does is work, if you rename the devices to something else than ethX? > > Negative: > # ip link set eth0 name krtek > # ip link set krtek name jezek > RTNETLINK answers: No such file or directory This is a command sequence you type manually? You are sure that userspace is not working in the background, triggered by uevents, and comes into your way here? Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 11:13 ` Kay Sievers @ 2012-01-31 11:17 ` Jiri Slaby 2012-01-31 11:58 ` Kay Sievers 0 siblings, 1 reply; 75+ messages in thread From: Jiri Slaby @ 2012-01-31 11:17 UTC (permalink / raw) To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev On 01/31/2012 12:13 PM, Kay Sievers wrote: >>> Does is work, if you rename the devices to something else than ethX? >> >> Negative: >> # ip link set eth0 name krtek >> # ip link set krtek name jezek >> RTNETLINK answers: No such file or directory > > This is a command sequence you type manually? Yea, and it is working with 3.3.0-rc1-next-20120124_64+. Not with 3.3.0-rc1-next-20120131_64+. > You are sure that userspace is not working in the background, > triggered by uevents, and comes into your way here? Note that krtek exists after the first command. But cannot be renamed further. thanks, -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 11:17 ` Jiri Slaby @ 2012-01-31 11:58 ` Kay Sievers 2012-01-31 14:18 ` Eric W. Biederman 0 siblings, 1 reply; 75+ messages in thread From: Kay Sievers @ 2012-01-31 11:58 UTC (permalink / raw) To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev On Tue, Jan 31, 2012 at 12:17, Jiri Slaby <jslaby@suse.cz> wrote: > On 01/31/2012 12:13 PM, Kay Sievers wrote: >> This is a command sequence you type manually? > > Yea, and it is working with 3.3.0-rc1-next-20120124_64+. Not with > 3.3.0-rc1-next-20120131_64+. > >> You are sure that userspace is not working in the background, >> triggered by uevents, and comes into your way here? > > Note that krtek exists after the first command. But cannot be renamed > further. Yeah, I can confirm the problem here. I works fine with earlier kernels and fails with the latest -next: # uname -r 3.3.0-rc1-next-20120131+ # modprobe dummy # ip link set dummy0 name foo0 # ip link set foo0 name bar0 RTNETLINK answers: No such file or directory Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 11:58 ` Kay Sievers @ 2012-01-31 14:18 ` Eric W. Biederman 2012-01-31 14:40 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-01-31 14:18 UTC (permalink / raw) To: Kay Sievers; +Cc: Jiri Slaby, Greg KH, LKML, ML netdev Kay Sievers <kay.sievers@vrfy.org> writes: > On Tue, Jan 31, 2012 at 12:17, Jiri Slaby <jslaby@suse.cz> wrote: >> On 01/31/2012 12:13 PM, Kay Sievers wrote: > >>> This is a command sequence you type manually? >> >> Yea, and it is working with 3.3.0-rc1-next-20120124_64+. Not with >> 3.3.0-rc1-next-20120131_64+. >> >>> You are sure that userspace is not working in the background, >>> triggered by uevents, and comes into your way here? >> >> Note that krtek exists after the first command. But cannot be renamed >> further. > > Yeah, I can confirm the problem here. I works fine with earlier > kernels and fails with the latest -next: > > # uname -r > 3.3.0-rc1-next-20120131+ > # modprobe dummy > # ip link set dummy0 name foo0 > # ip link set foo0 name bar0 > RTNETLINK answers: No such file or directory There is something weird going on when sysfs directories and symlinks are renamed. My guess is that I fat fingered something with one of my last sysfs patches. I will look more deeply once I have slept some more. The second network device renames fails because the first rename did not work properly. ls -l /sys/class/net/ /sys/virtual/net/ will let you see what I mean. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH] sysfs: Update the name hash when renaming sysfs entries 2012-01-31 14:18 ` Eric W. Biederman @ 2012-01-31 14:40 ` Eric W. Biederman 2012-01-31 14:41 ` Jiri Slaby 2012-01-31 14:55 ` Greg KH 0 siblings, 2 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-01-31 14:40 UTC (permalink / raw) To: Greg KH, Greg Kroah-Hartman; +Cc: Jiri Slaby, LKML, ML netdev, Kay Sievers This fixes a bug introduced with sysfs name hashes where renaming a network device appears to succeed but silently makes the sysfs files for that network device inaccessible. In at least one configuration this bug has stopped networking from coming up during boot. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/sysfs/dir.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index ea64d01..dd3779c 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -872,6 +872,7 @@ int sysfs_rename(struct sysfs_dirent *sd, dup_name = sd->s_name; sd->s_name = new_name; + sd->s_hash = sysfs_name_hash(sd->s_ns, sd->s_name); } /* Move to the appropriate place in the appropriate directories rbtree. */ -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Update the name hash when renaming sysfs entries 2012-01-31 14:40 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman @ 2012-01-31 14:41 ` Jiri Slaby 2012-01-31 14:55 ` Greg KH 1 sibling, 0 replies; 75+ messages in thread From: Jiri Slaby @ 2012-01-31 14:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg KH, Greg Kroah-Hartman, LKML, ML netdev, Kay Sievers On 01/31/2012 03:40 PM, Eric W. Biederman wrote: > > This fixes a bug introduced with sysfs name hashes where renaming a > network device appears to succeed but silently makes the sysfs files for > that network device inaccessible. > > In at least one configuration this bug has stopped networking from > coming up during boot. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> It works for me. Thanks. Tested-by: Jiri Slaby <jslaby@suse.cz> > --- > fs/sysfs/dir.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c > index ea64d01..dd3779c 100644 > --- a/fs/sysfs/dir.c > +++ b/fs/sysfs/dir.c > @@ -872,6 +872,7 @@ int sysfs_rename(struct sysfs_dirent *sd, > > dup_name = sd->s_name; > sd->s_name = new_name; > + sd->s_hash = sysfs_name_hash(sd->s_ns, sd->s_name); > } > > /* Move to the appropriate place in the appropriate directories rbtree. */ -- js suse labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] sysfs: Update the name hash when renaming sysfs entries 2012-01-31 14:40 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman 2012-01-31 14:41 ` Jiri Slaby @ 2012-01-31 14:55 ` Greg KH 1 sibling, 0 replies; 75+ messages in thread From: Greg KH @ 2012-01-31 14:55 UTC (permalink / raw) To: Eric W. Biederman Cc: Greg Kroah-Hartman, Jiri Slaby, LKML, ML netdev, Kay Sievers On Tue, Jan 31, 2012 at 06:40:26AM -0800, Eric W. Biederman wrote: > > This fixes a bug introduced with sysfs name hashes where renaming a > network device appears to succeed but silently makes the sysfs files for > that network device inaccessible. > > In at least one configuration this bug has stopped networking from > coming up during boot. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > fs/sysfs/dir.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) Thanks for this, I'll queue it up later today. greg k-h ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-01-31 10:52 ` Kay Sievers 2012-01-31 11:00 ` Jiri Slaby @ 2012-02-04 2:14 ` Henrique de Moraes Holschuh 2012-02-06 20:03 ` Kay Sievers 1 sibling, 1 reply; 75+ messages in thread From: Henrique de Moraes Holschuh @ 2012-02-04 2:14 UTC (permalink / raw) To: Kay Sievers; +Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev On Tue, 31 Jan 2012, Kay Sievers wrote: > Please make sure nothing tries to swap netif names in userspace. We > have given up that approach, because it is far too fragile to > temporary rename devices to be able to swap the names, and race > against the loading of new kernel network drivers at the same time. That's a damn fair reason, but the loss of that functionality could cause trouble. In fact, at first glance, to me it looks like this has a large potential for unleashing untold pain and suffering in the sysadmin ranks unless early userspace can emulate it somehow. Is it possible to configure the kernel to use something other than "eth#" as its initial namespace for netif names? Or is there some other way to get eth1 to be what you need eth1 to be during userland boot? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-04 2:14 ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh @ 2012-02-06 20:03 ` Kay Sievers 2012-02-08 2:00 ` Henrique de Moraes Holschuh 0 siblings, 1 reply; 75+ messages in thread From: Kay Sievers @ 2012-02-06 20:03 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev On Sat, Feb 4, 2012 at 03:14, Henrique de Moraes Holschuh <hmh@hmh.eng.br> wrote: > On Tue, 31 Jan 2012, Kay Sievers wrote: >> Please make sure nothing tries to swap netif names in userspace. We >> have given up that approach, because it is far too fragile to >> temporary rename devices to be able to swap the names, and race >> against the loading of new kernel network drivers at the same time. > > That's a damn fair reason, but the loss of that functionality could cause > trouble. In fact, at first glance, to me it looks like this has a large > potential for unleashing untold pain and suffering in the sysadmin ranks > unless early userspace can emulate it somehow. > > Is it possible to configure the kernel to use something other than "eth#" as > its initial namespace for netif names? Or is there some other way to get > eth1 to be what you need eth1 to be during userland boot? I don't think there is a sane way to do that. Someone could add a kernel command line parameter to switch ethX in the kernel to something else, and create custom udev rules which match on device properties and apply configured names which are ethX again. But for all that, there will be no generally available support in common base system tools, and we absolutely do not recommend anybody doing that. Udev will not provide any help for that any more, not for automatic device name reservation from a hotplug path, not for device name swaps in the kernel namespace. It will only be allowed to rename devices to a namespace that does not clash with the kernel's one. People should use biosdevname's pci-slot names, or the on-board labels names like DELL does for configuration-less stable names, or use manually configured names 'internal', 'external' ,'dmz', 'vpn' and so on. I think we should stop pretending we can solve problems, resulting from simple enumeration depending on device-discovery order. These numbers can never be stable, can never reliably work in the reality we are working with. It's time to leave these false promises behind us and move on and that means, no stable ethX names anymore. Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-06 20:03 ` Kay Sievers @ 2012-02-08 2:00 ` Henrique de Moraes Holschuh 2012-02-08 3:50 ` Kay Sievers 0 siblings, 1 reply; 75+ messages in thread From: Henrique de Moraes Holschuh @ 2012-02-08 2:00 UTC (permalink / raw) To: Kay Sievers; +Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev On Mon, 06 Feb 2012, Kay Sievers wrote: > On Sat, Feb 4, 2012 at 03:14, Henrique de Moraes Holschuh > <hmh@hmh.eng.br> wrote: > > Is it possible to configure the kernel to use something other than "eth#" as > > its initial namespace for netif names? Or is there some other way to get > > eth1 to be what you need eth1 to be during userland boot? > > I don't think there is a sane way to do that. Someone could add a > kernel command line parameter to switch ethX in the kernel to > something else, and create custom udev rules which match on device > properties and apply configured names which are ethX again. But for > all that, there will be no generally available support in common base > system tools, and we absolutely do not recommend anybody doing that. What sort of impact analysis on userspace was done about this change? Nobody in his right mind would go back to the dark ages of uncontrolled ifnames. You're effectively forcing everybody with a clue away from the eth# namespace. Just to be very clear: the impact of this is the need to change the interface names on potentially millions of lines of firewall rules and scripts out there, as well as tracking down stuff (mostly scripts) that special-cases the eth prefix. Is there a really good reason why we cannot have a way to move the kernel away from the eth# namespace at boot (through a kernel parameter, maybe with the default namespace set at compile time), AND keep the "common base system tools" support to assign ifname based on MAC addresses that we have right now? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-08 2:00 ` Henrique de Moraes Holschuh @ 2012-02-08 3:50 ` Kay Sievers 2012-02-08 6:42 ` Valdis.Kletnieks 0 siblings, 1 reply; 75+ messages in thread From: Kay Sievers @ 2012-02-08 3:50 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev On Wed, Feb 8, 2012 at 03:00, Henrique de Moraes Holschuh <hmh@hmh.eng.br> wrote: > On Mon, 06 Feb 2012, Kay Sievers wrote: >> On Sat, Feb 4, 2012 at 03:14, Henrique de Moraes Holschuh >> <hmh@hmh.eng.br> wrote: >> > Is it possible to configure the kernel to use something other than "eth#" as >> > its initial namespace for netif names? Or is there some other way to get >> > eth1 to be what you need eth1 to be during userland boot? >> >> I don't think there is a sane way to do that. Someone could add a >> kernel command line parameter to switch ethX in the kernel to >> something else, and create custom udev rules which match on device >> properties and apply configured names which are ethX again. But for >> all that, there will be no generally available support in common base >> system tools, and we absolutely do not recommend anybody doing that. > > What sort of impact analysis on userspace was done about this change? None. It will just not be supported for new setups. Existing ones will do what they always did. > Nobody in his right mind would go back to the dark ages of uncontrolled > ifnames. You're effectively forcing everybody with a clue away from the > eth# namespace. Yes. It's a game we have lost and we will not win in the future. I gave up, and I warn everybody who think it's simple to manage. > Just to be very clear: the impact of this is the need to change the > interface names on potentially millions of lines of firewall rules and > scripts out there, as well as tracking down stuff (mostly scripts) that > special-cases the eth prefix. Yeah, and for good, ethX is a pretty much random kernel name, and I personally will no longer work on conceptually broken infrastructure that can never deliver what it seems to promise. In the longer run, tools need to be fixed to automatically handle changing names, or not care about the names at all, or names need to be explicitly set up outside the ethX namespace to be predictable. After years of working in that area I will stop to work on these hacks to promise stable ethX names. It was just wrong, like enumerations always are in hotplug setups. > Is there a really good reason why we cannot have a way to move the > kernel away from the eth# namespace at boot (through a kernel parameter, > maybe with the default namespace set at compile time), Could work, but I don't think it is worth. Simple enumeration, and automatic persistent on-disk device name reservation in a flat number-range is just a very flawed concept. I'm not interested in working on that, but that surely should not stop anybody from trying and providing tools that can do that. > AND keep the > "common base system tools" support to assign ifname based on MAC > addresses that we have right now? Not provided by udev's default setup, which did persistent name reservation in the device hotplug path. It is already disabled and will be entirely removed from the source tree some day. Other tools can still try to provide that. But I declare that model as officially failed and udev will not even try anything like that anymore. People who need predictable interface names should just manually configure custom/descriptive names, or names which are reliably derived from the hardware, like firmware-provided names or the pci slot number. Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-08 3:50 ` Kay Sievers @ 2012-02-08 6:42 ` Valdis.Kletnieks 2012-02-08 10:57 ` Kay Sievers 0 siblings, 1 reply; 75+ messages in thread From: Valdis.Kletnieks @ 2012-02-08 6:42 UTC (permalink / raw) To: Kay Sievers Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev [-- Attachment #1: Type: text/plain, Size: 1375 bytes --] On Wed, 08 Feb 2012 04:50:15 +0100, Kay Sievers said: > After years of working in that area I will stop to work on these hacks > to promise stable ethX names. It was just wrong, like enumerations > always are in hotplug setups. So (real world case) I've got a server that's got a 1G ethernet connected to the public net, a 1G ethernet that's a cluster management network, and a 10G ethernet that connects to our HPC clusters. And I want to add iptables rules that distinguish based on interface. Currently I can nail the management net to eth0, the public net to eth1, and the 10G to eth2, and then just add "-i eth1" or whatever in the iptables ruleset. I really don't care if the 0/1/2 move around - but if we're not having nailed-down interface names, what will take the place of '-i ethN' in iptables? > People who need predictable interface names should just manually > configure custom/descriptive names, or names which are reliably > derived from the hardware, like firmware-provided names or the pci > slot number. Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" what you are trying to move to, and my systems are already onboard and I should just move along, nothing to see here? ;) [-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-08 6:42 ` Valdis.Kletnieks @ 2012-02-08 10:57 ` Kay Sievers 2012-02-08 20:06 ` Valdis.Kletnieks 0 siblings, 1 reply; 75+ messages in thread From: Kay Sievers @ 2012-02-08 10:57 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev On Wed, Feb 8, 2012 at 07:42, <Valdis.Kletnieks@vt.edu> wrote: > On Wed, 08 Feb 2012 04:50:15 +0100, Kay Sievers said: > >> After years of working in that area I will stop to work on these hacks >> to promise stable ethX names. It was just wrong, like enumerations >> always are in hotplug setups. > > So (real world case) I've got a server that's got a 1G ethernet connected to > the public net, a 1G ethernet that's a cluster management network, and > a 10G ethernet that connects to our HPC clusters. > > And I want to add iptables rules that distinguish based on interface. Currently > I can nail the management net to eth0, the public net to eth1, and the 10G to > eth2, and then just add "-i eth1" or whatever in the iptables ruleset. > > I really don't care if the 0/1/2 move around - but if we're not having nailed-down > interface names, what will take the place of '-i ethN' in iptables? > >> People who need predictable interface names should just manually >> configure custom/descriptive names, or names which are reliably >> derived from the hardware, like firmware-provided names or the pci >> slot number. > > Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules > SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" > what you are trying to move to, and my systems are already onboard and > I should just move along, nothing to see here? ;) Yeah, that's what we did in the past. It works fine if you never have to swap names like eth0 and eth1, with need to free one of the the names with a temporary rename. If another device is added by a different kernel module, or just a USB network device is already plugged-in at bootup time, the parallel loading of drivers might cause the kernel to create a new eth0 or eth1 just in the moment we have the temporary rename active and we want to swap the names. That model is just entirely flawed and will never work reliably without creating an even bigger mess we already have, by requiring complex retry loops across multiple devices, or having global locks including the kernel's device name allocation logic. Let's just move on and stop pretending we want or we can solve these problems. Simple device enumerations in hotplug setups can by their very definition not work in a predictable way, we should never have tried to mess around here, and just moved on to something that has at least the potential to work. Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-08 10:57 ` Kay Sievers @ 2012-02-08 20:06 ` Valdis.Kletnieks 2012-02-08 20:27 ` Stephen Hemminger 2012-02-08 23:48 ` Kay Sievers 0 siblings, 2 replies; 75+ messages in thread From: Valdis.Kletnieks @ 2012-02-08 20:06 UTC (permalink / raw) To: Kay Sievers Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev [-- Attachment #1: Type: text/plain, Size: 1214 bytes --] On Wed, 08 Feb 2012 11:57:18 +0100, Kay Sievers said: > On Wed, Feb 8, 2012 at 07:42, <Valdis.Kletnieks@vt.edu> wrote: > > Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules > > SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" > > what you are trying to move to, and my systems are already onboard and > > I should just move along, nothing to see here? ;) > > Yeah, that's what we did in the past. It works fine if you never have > to swap names like eth0 and eth1, with need to free one of the the > names with a temporary rename. Well, if I had my druthers, I'd stick name="net-mgt", "net-pub", and "net-10g" in the udev rules, and not care about 1/2/3 and race conditions, because meaningful names are easier to not screw up (just last week found a system that had eth1 and eth2 reversed in some iptables rules, wouldn't have happened if they were -mgt and -pub). Only thing stopping me is getting iptables to accept '-i net-10g', and the distro /etc/sysconfig/network scripts like ifup and ifdown playing nice.... So it sounds like what I want as a sysadmin is the same thing you want as a maintainer... [-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-08 20:06 ` Valdis.Kletnieks @ 2012-02-08 20:27 ` Stephen Hemminger 2012-02-08 23:48 ` Kay Sievers 1 sibling, 0 replies; 75+ messages in thread From: Stephen Hemminger @ 2012-02-08 20:27 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Kay Sievers, Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev [-- Attachment #1: Type: text/plain, Size: 111 bytes --] Our customers would prefer network device names of the form "Ethernet0/0" or "ge-0/0/0" (but I said no...) [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: network regression: cannot rename netdev twice 2012-02-08 20:06 ` Valdis.Kletnieks 2012-02-08 20:27 ` Stephen Hemminger @ 2012-02-08 23:48 ` Kay Sievers 1 sibling, 0 replies; 75+ messages in thread From: Kay Sievers @ 2012-02-08 23:48 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev On Wed, Feb 8, 2012 at 21:06, <Valdis.Kletnieks@vt.edu> wrote: > On Wed, 08 Feb 2012 11:57:18 +0100, Kay Sievers said: >> On Wed, Feb 8, 2012 at 07:42, <Valdis.Kletnieks@vt.edu> wrote: > >> > Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules >> > SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" >> > what you are trying to move to, and my systems are already onboard and >> > I should just move along, nothing to see here? ;) >> >> Yeah, that's what we did in the past. It works fine if you never have >> to swap names like eth0 and eth1, with need to free one of the the >> names with a temporary rename. > > Well, if I had my druthers, I'd stick name="net-mgt", "net-pub", and "net-10g" > in the udev rules, and not care about 1/2/3 and race conditions, because > meaningful names are easier to not screw up (just last week found a system that > had eth1 and eth2 reversed in some iptables rules, wouldn't have happened if > they were -mgt and -pub). > > Only thing stopping me is getting iptables to accept '-i net-10g', and the > distro /etc/sysconfig/network scripts like ifup and ifdown playing nice.... > > So it sounds like what I want as a sysadmin is the same thing you want > as a maintainer... Yeah, that sounds very much like it is. I want to push some responsibility to the admin, do less automagic, and personally want to be less responsible for all the unintended screw-up the automagic is causing everywhere. Sure, the intention to keep the names like they always have been was good, but a good intention and a broken model to deliver it, and continue to pretend we can solve it, is the worst things we can do. :) Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby 2012-01-30 22:06 ` Greg KH 2012-01-30 22:52 ` Kay Sievers @ 2012-01-31 1:32 ` Eric W. Biederman 2012-02-01 18:29 ` Maciej Rutecki 3 siblings, 0 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-01-31 1:32 UTC (permalink / raw) To: Jiri Slaby; +Cc: Greg KH, LKML Jiri Slaby <jslaby@suse.cz> writes: > 1) network systemd rule doesn't start network What silly thing is the network systemd rule doing? Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: sysfs regression: wrong link counts 2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby ` (2 preceding siblings ...) 2012-01-31 1:32 ` sysfs regression: wrong link counts Eric W. Biederman @ 2012-02-01 18:29 ` Maciej Rutecki 3 siblings, 0 replies; 75+ messages in thread From: Maciej Rutecki @ 2012-02-01 18:29 UTC (permalink / raw) To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML On poniedziałek, 30 stycznia 2012 o 22:56:26 Jiri Slaby wrote: > Hi, > > I cannot boot properly with this commit: > commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676 > Author: Eric W. Biederman <ebiederm@xmission.com> > Date: Sun Dec 18 20:09:31 2011 -0800 > > sysfs: Kill nlink counting. > > > 1) network systemd rule doesn't start network > 2) sensors complain: > sensors_init: Kernel interface error > > ad 2) look at what it does: > /* returns !0 if sysfs filesystem was found, 0 otherwise */ > int sensors_init_sysfs(void) > { > struct stat statbuf; > > snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys"); > if (stat(sensors_sysfs_mount, &statbuf) < 0 > > || statbuf.st_nlink <= 2) /* Empty directory */ > > return 0; > > return 1; > } > > > So this looks like it became a part of ABI we cannot break... > > A revert of this commit on the top of today's -next fixes the problem. > > thanks, I created a Bugzilla entry at https://bugzilla.kernel.org/show_bug.cgi?id=42712 for your bug report, please add your address to the CC list in there, thanks! -- Maciej Rutecki http://www.mrutecki.pl ^ permalink raw reply [flat|nested] 75+ messages in thread
end of thread, other threads:[~2012-03-09 3:41 UTC | newest] Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby 2012-01-30 22:06 ` Greg KH 2012-01-30 22:10 ` Alan Cox 2012-01-30 22:27 ` Greg KH 2012-01-30 22:43 ` Al Viro 2012-01-30 22:56 ` Al Viro 2012-01-31 1:27 ` Eric W. Biederman 2012-01-31 10:48 ` Jiri Slaby 2012-01-31 12:44 ` Eric W. Biederman 2012-01-31 16:45 ` Linus Torvalds 2012-01-31 19:18 ` Al Viro 2012-02-01 5:06 ` Eric W. Biederman 2012-02-01 22:21 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman 2012-02-01 22:24 ` Greg Kroah-Hartman 2012-02-01 22:44 ` Eric W. Biederman 2012-02-01 22:49 ` Greg Kroah-Hartman 2012-02-01 22:31 ` Dave Jones 2012-02-01 22:35 ` Jiri Slaby 2012-02-01 23:15 ` Linus Torvalds 2012-02-01 23:18 ` Linus Torvalds 2012-02-02 1:22 ` Al Viro 2012-02-02 21:24 ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro 2012-02-02 23:46 ` Linus Torvalds 2012-02-03 1:16 ` Al Viro 2012-02-03 1:45 ` Al Viro 2012-02-03 2:00 ` Linus Torvalds 2012-02-03 14:57 ` Chris Mason 2012-02-03 17:08 ` Al Viro 2012-02-03 19:34 ` Artem Bityutskiy 2012-02-06 8:50 ` Artem Bityutskiy 2012-02-06 13:56 ` Al Viro 2012-02-06 17:05 ` Artem Bityutskiy 2012-02-06 17:11 ` Al Viro 2012-02-07 7:21 ` Artem Bityutskiy 2012-02-06 22:49 ` Dave Chinner 2012-02-03 8:25 ` Andreas Dilger 2012-02-03 17:03 ` Al Viro 2012-02-04 7:42 ` Andreas Dilger 2012-03-05 13:30 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby 2012-03-05 16:09 ` Greg Kroah-Hartman 2012-03-05 16:47 ` Linus Torvalds 2012-03-08 21:05 ` Greg Kroah-Hartman 2012-03-08 22:18 ` Eric W. Biederman 2012-03-08 23:40 ` Linus Torvalds 2012-03-08 21:28 ` Eric W. Biederman 2012-03-08 21:34 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman 2012-03-08 21:36 ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman 2012-03-08 21:37 ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman 2012-03-09 3:40 ` Linus Torvalds 2012-03-08 22:28 ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman 2012-03-09 2:49 ` Eric W. Biederman 2012-01-31 3:45 ` sysfs regression: wrong link counts Eric W. Biederman 2012-01-31 11:54 ` Alan Cox 2012-01-30 22:52 ` Kay Sievers 2012-01-31 10:41 ` network regression: cannot rename netdev twice Jiri Slaby 2012-01-31 10:52 ` Kay Sievers 2012-01-31 11:00 ` Jiri Slaby 2012-01-31 11:13 ` Kay Sievers 2012-01-31 11:17 ` Jiri Slaby 2012-01-31 11:58 ` Kay Sievers 2012-01-31 14:18 ` Eric W. Biederman 2012-01-31 14:40 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman 2012-01-31 14:41 ` Jiri Slaby 2012-01-31 14:55 ` Greg KH 2012-02-04 2:14 ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh 2012-02-06 20:03 ` Kay Sievers 2012-02-08 2:00 ` Henrique de Moraes Holschuh 2012-02-08 3:50 ` Kay Sievers 2012-02-08 6:42 ` Valdis.Kletnieks 2012-02-08 10:57 ` Kay Sievers 2012-02-08 20:06 ` Valdis.Kletnieks 2012-02-08 20:27 ` Stephen Hemminger 2012-02-08 23:48 ` Kay Sievers 2012-01-31 1:32 ` sysfs regression: wrong link counts Eric W. Biederman 2012-02-01 18:29 ` Maciej Rutecki
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.