linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sysfs regression: wrong link counts
@ 2012-01-30 21:56 Jiri Slaby
  2012-01-30 22:06 ` Greg KH
                   ` (3 more replies)
  0 siblings, 4 replies; 75+ messages in thread
From: Jiri Slaby @ 2012-01-30 21:56 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Greg KH, LKML

Hi,

I cannot boot properly with this commit:
commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Sun Dec 18 20:09:31 2011 -0800

    sysfs: Kill nlink counting.


1) network systemd rule doesn't start network
2) sensors complain:
   sensors_init: Kernel interface error

ad 2) look at what it does:
/* returns !0 if sysfs filesystem was found, 0 otherwise */
int sensors_init_sysfs(void)
{
        struct stat statbuf;

        snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys");
        if (stat(sensors_sysfs_mount, &statbuf) < 0
         || statbuf.st_nlink <= 2)      /* Empty directory */
                return 0;

        return 1;
}


So this looks like it became a part of ABI we cannot break...

A revert of this commit on the top of today's -next fixes the problem.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby
@ 2012-01-30 22:06 ` Greg KH
  2012-01-30 22:10   ` Alan Cox
  2012-01-30 22:52 ` Kay Sievers
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 75+ messages in thread
From: Greg KH @ 2012-01-30 22:06 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Eric W. Biederman, LKML, systemd-devel

On Mon, Jan 30, 2012 at 10:56:26PM +0100, Jiri Slaby wrote:
> Hi,
> 
> I cannot boot properly with this commit:
> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Sun Dec 18 20:09:31 2011 -0800
> 
>     sysfs: Kill nlink counting.
> 
> 
> 1) network systemd rule doesn't start network
> 2) sensors complain:
>    sensors_init: Kernel interface error

Odd.

What in systemd is causing a reliance on this?

> ad 2) look at what it does:
> /* returns !0 if sysfs filesystem was found, 0 otherwise */
> int sensors_init_sysfs(void)
> {
>         struct stat statbuf;
> 
>         snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys");
>         if (stat(sensors_sysfs_mount, &statbuf) < 0
>          || statbuf.st_nlink <= 2)      /* Empty directory */
>                 return 0;
> 
>         return 1;
> }

Ah, a hack to see if the directory is empty.

Isn't there some other "proper" way of doing this in userspace, or is
this really the correct way?

> So this looks like it became a part of ABI we cannot break...
> 
> A revert of this commit on the top of today's -next fixes the problem.

Ick.

Eric, care to fix this up, or do you want me to revert it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 22:06 ` Greg KH
@ 2012-01-30 22:10   ` Alan Cox
  2012-01-30 22:27     ` Greg KH
  2012-01-31  3:45     ` sysfs regression: wrong link counts Eric W. Biederman
  0 siblings, 2 replies; 75+ messages in thread
From: Alan Cox @ 2012-01-30 22:10 UTC (permalink / raw)
  To: Greg KH; +Cc: Jiri Slaby, Eric W. Biederman, LKML, systemd-devel

> Isn't there some other "proper" way of doing this in userspace, or is
> this really the correct way?

You can look at the S_IFMT bits and stuff however link count indicating
number of subdirectories is a standard Unix thing and used by many quite
mundane tools as an optimisation.

Alan

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 22:10   ` Alan Cox
@ 2012-01-30 22:27     ` Greg KH
  2012-01-30 22:43       ` Al Viro
  2012-01-31  1:27       ` Eric W. Biederman
  2012-01-31  3:45     ` sysfs regression: wrong link counts Eric W. Biederman
  1 sibling, 2 replies; 75+ messages in thread
From: Greg KH @ 2012-01-30 22:27 UTC (permalink / raw)
  To: Alan Cox, Eric W. Biederman; +Cc: Jiri Slaby, LKML, systemd-devel

On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote:
> > Isn't there some other "proper" way of doing this in userspace, or is
> > this really the correct way?
> 
> You can look at the S_IFMT bits and stuff however link count indicating
> number of subdirectories is a standard Unix thing and used by many quite
> mundane tools as an optimisation.

Ah, yeah, that is easier.

Eric, care to fix this or want me to revert it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 22:27     ` Greg KH
@ 2012-01-30 22:43       ` Al Viro
  2012-01-30 22:56         ` Al Viro
  2012-01-31  1:27       ` Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Al Viro @ 2012-01-30 22:43 UTC (permalink / raw)
  To: Greg KH; +Cc: Alan Cox, Eric W. Biederman, Jiri Slaby, LKML, systemd-devel

On Mon, Jan 30, 2012 at 02:27:17PM -0800, Greg KH wrote:
> On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote:
> > > Isn't there some other "proper" way of doing this in userspace, or is
> > > this really the correct way?
> > 
> > You can look at the S_IFMT bits and stuff however link count indicating
> > number of subdirectories is a standard Unix thing and used by many quite
> > mundane tools as an optimisation.
> 
> Ah, yeah, that is easier.
> 
> Eric, care to fix this or want me to revert it?

Fix _what_?  Userland shite quoted upthread?  Because that's where the bug
is - the mundane tools mentioned by Alan treat 1 in st_nlink as "no
information about the number of subdirectories".  And shite might be too
mild a term for the little gem in question, really...

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby
  2012-01-30 22:06 ` Greg KH
@ 2012-01-30 22:52 ` Kay Sievers
  2012-01-31 10:41   ` network regression: cannot rename netdev twice Jiri Slaby
  2012-01-31  1:32 ` sysfs regression: wrong link counts Eric W. Biederman
  2012-02-01 18:29 ` Maciej Rutecki
  3 siblings, 1 reply; 75+ messages in thread
From: Kay Sievers @ 2012-01-30 22:52 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML

2012/1/30 Jiri Slaby <jslaby@suse.cz>:
> I cannot boot properly with this commit:
> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Sun Dec 18 20:09:31 2011 -0800
>
>    sysfs: Kill nlink counting.
>
>
> 1) network systemd rule doesn't start network

What does that mean? What's a network systemd rule?

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 22:43       ` Al Viro
@ 2012-01-30 22:56         ` Al Viro
  0 siblings, 0 replies; 75+ messages in thread
From: Al Viro @ 2012-01-30 22:56 UTC (permalink / raw)
  To: Greg KH; +Cc: Alan Cox, Eric W. Biederman, Jiri Slaby, LKML, systemd-devel

On Mon, Jan 30, 2012 at 10:43:50PM +0000, Al Viro wrote:
> On Mon, Jan 30, 2012 at 02:27:17PM -0800, Greg KH wrote:
> > On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote:
> > > > Isn't there some other "proper" way of doing this in userspace, or is
> > > > this really the correct way?
> > > 
> > > You can look at the S_IFMT bits and stuff however link count indicating
> > > number of subdirectories is a standard Unix thing and used by many quite
> > > mundane tools as an optimisation.
> > 
> > Ah, yeah, that is easier.
> > 
> > Eric, care to fix this or want me to revert it?
> 
> Fix _what_?  Userland shite quoted upthread?  Because that's where the bug
> is - the mundane tools mentioned by Alan treat 1 in st_nlink as "no
> information about the number of subdirectories".  And shite might be too
> mild a term for the little gem in question, really...

To repeat this piece of bogosity for those who might've missed it:

/* returns !0 if sysfs filesystem was found, 0 otherwise */
int sensors_init_sysfs(void)
{
        struct stat statbuf;

        snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys");
        if (stat(sensors_sysfs_mount, &statbuf) < 0
         || statbuf.st_nlink <= 2)      /* Empty directory */
                return 0;

        return 1;
}

which is completely bogus - contrary to what it says in comments, it does
*not* check anything about sysfs (or directories being empty).  Checking
that sysfs is mounted on /sys could be done by statfs(2) and checking
->f_type, or, considering what the code in lib/sysfs.c is doing, just
checking that /sys/class is readable and failing otherwise.

Sigh...

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 22:27     ` Greg KH
  2012-01-30 22:43       ` Al Viro
@ 2012-01-31  1:27       ` Eric W. Biederman
  2012-01-31 10:48         ` Jiri Slaby
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-01-31  1:27 UTC (permalink / raw)
  To: Greg KH; +Cc: Alan Cox, Jiri Slaby, LKML, systemd-devel

Greg KH <greg@kroah.com> writes:

> On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote:
>> > Isn't there some other "proper" way of doing this in userspace, or is
>> > this really the correct way?
>> 
>> You can look at the S_IFMT bits and stuff however link count indicating
>> number of subdirectories is a standard Unix thing and used by many quite
>> mundane tools as an optimisation.
>
> Ah, yeah, that is easier.
>
> Eric, care to fix this or want me to revert it?

With respect to sensors the conversation has already been had, and I fix
is already queued and should come out in the sensors release due out in
a week or so.  So sensors should be fixed before this code is merged
into Linus's kernel.

Now why something minor like sensors that nothing seems to depend
strongly on should stop a system from booting is a huge question.

So I am going to have the conversation to triage what is up with
systemd.  That seems totally unexpected.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby
  2012-01-30 22:06 ` Greg KH
  2012-01-30 22:52 ` Kay Sievers
@ 2012-01-31  1:32 ` Eric W. Biederman
  2012-02-01 18:29 ` Maciej Rutecki
  3 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-01-31  1:32 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Greg KH, LKML

Jiri Slaby <jslaby@suse.cz> writes:

> 1) network systemd rule doesn't start network

What silly thing is the network systemd rule doing?

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 22:10   ` Alan Cox
  2012-01-30 22:27     ` Greg KH
@ 2012-01-31  3:45     ` Eric W. Biederman
  2012-01-31 11:54       ` Alan Cox
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-01-31  3:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: Greg KH, Jiri Slaby, LKML, systemd-devel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

>> Isn't there some other "proper" way of doing this in userspace, or is
>> this really the correct way?
>
> You can look at the S_IFMT bits and stuff however link count indicating
> number of subdirectories is a standard Unix thing and used by many quite
> mundane tools as an optimisation.

Those tools for a decade or better all know to treat nlink == 1 as the
case where the optimization does not apply.

On a traditional unix filesytem the most common place you will see nlink
== 1 is when the link count overflows.  extN with > 65536 subdirectories
as I recall.


Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* network regression: cannot rename netdev twice
  2012-01-30 22:52 ` Kay Sievers
@ 2012-01-31 10:41   ` Jiri Slaby
  2012-01-31 10:52     ` Kay Sievers
  0 siblings, 1 reply; 75+ messages in thread
From: Jiri Slaby @ 2012-01-31 10:41 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev

On 01/30/2012 11:52 PM, Kay Sievers wrote:
> 2012/1/30 Jiri Slaby <jslaby@suse.cz>:
>> I cannot boot properly with this commit:
>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
>> Author: Eric W. Biederman <ebiederm@xmission.com>
>> Date:   Sun Dec 18 20:09:31 2011 -0800
>>
>>    sysfs: Kill nlink counting.
>>
>>
>> 1) network systemd rule doesn't start network
> 
> What does that mean? What's a network systemd rule?

Oh, perhaps you call it a service file, not rule file?

Anyway this is a different bug. Revert of the patch above does not help.

The bug lays in the network layer. udev is unable to perform persistent
eth naming:
# ip link set eth0 name eth1    -- this one is OK
# ip link set eth1 name eth0
RTNETLINK answers: No such file or directory

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-31  1:27       ` Eric W. Biederman
@ 2012-01-31 10:48         ` Jiri Slaby
  2012-01-31 12:44           ` Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Jiri Slaby @ 2012-01-31 10:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg KH, Alan Cox, LKML, systemd-devel, Linus Torvalds, Al Viro

On 01/31/2012 02:27 AM, Eric W. Biederman wrote:
> Greg KH <greg@kroah.com> writes:
> 
>> On Mon, Jan 30, 2012 at 10:10:59PM +0000, Alan Cox wrote:
>>>> Isn't there some other "proper" way of doing this in userspace, or is
>>>> this really the correct way?
>>>
>>> You can look at the S_IFMT bits and stuff however link count indicating
>>> number of subdirectories is a standard Unix thing and used by many quite
>>> mundane tools as an optimisation.
>>
>> Ah, yeah, that is easier.
>>
>> Eric, care to fix this or want me to revert it?
> 
> With respect to sensors the conversation has already been had, and I fix
> is already queued and should come out in the sensors release due out in
> a week or so.  So sensors should be fixed before this code is merged
> into Linus's kernel.

Oh, we are not going to break userspace with 3.3, are we? I understand
that what sensors do is nothing but sh*t. But this change should wait
until everybody has a chance to have fixed sensors package in their
distribution.

To be clear, what I'm suggesting is to postpone the change and schedule
it for something like 3.7.

> Now why something minor like sensors that nothing seems to depend
> strongly on should stop a system from booting is a huge question.

sensors do not prevent boot at all, of course. The other
cannot-start-network bug does.

> So I am going to have the conversation to triage what is up with
> systemd.  That seems totally unexpected.

Looks like netdev renaming problem, see my other mail.

thanks,
-- 
js
suse labs
.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 10:41   ` network regression: cannot rename netdev twice Jiri Slaby
@ 2012-01-31 10:52     ` Kay Sievers
  2012-01-31 11:00       ` Jiri Slaby
  2012-02-04  2:14       ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh
  0 siblings, 2 replies; 75+ messages in thread
From: Kay Sievers @ 2012-01-31 10:52 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev

On Tue, Jan 31, 2012 at 11:41, Jiri Slaby <jslaby@suse.cz> wrote:
> On 01/30/2012 11:52 PM, Kay Sievers wrote:
>> 2012/1/30 Jiri Slaby <jslaby@suse.cz>:
>>> I cannot boot properly with this commit:
>>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>> Date:   Sun Dec 18 20:09:31 2011 -0800
>>>
>>>    sysfs: Kill nlink counting.
>>>
>>> 1) network systemd rule doesn't start network
>>
>> What does that mean? What's a network systemd rule?
>
> Oh, perhaps you call it a service file, not rule file?
>
> Anyway this is a different bug. Revert of the patch above does not help.

Ok, fine. I checked too, and systemd does not play any silly games
with link counts.

> The bug lays in the network layer. udev is unable to perform persistent
> eth naming:
> # ip link set eth0 name eth1    -- this one is OK
> # ip link set eth1 name eth0
> RTNETLINK answers: No such file or directory

Please make sure nothing tries to swap netif names in userspace. We
have given up that approach, because it is far too fragile to
temporary rename devices to be able to swap the names, and race
against the loading of new kernel network drivers at the same time.

This might be a new kernel problem here, but in general that approach
is just broken, we have have given up fiddling around here. Udev does
not do that anymore, and also the code that currently *can* be used to
do this, will be removed from udev in the future.

Network devices can only be renamed to a namespace that isn't ethX,
and which does not race against kernel names.

Does is work, if you rename the devices to something else than ethX?

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 10:52     ` Kay Sievers
@ 2012-01-31 11:00       ` Jiri Slaby
  2012-01-31 11:13         ` Kay Sievers
  2012-02-04  2:14       ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh
  1 sibling, 1 reply; 75+ messages in thread
From: Jiri Slaby @ 2012-01-31 11:00 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev

On 01/31/2012 11:52 AM, Kay Sievers wrote:
> On Tue, Jan 31, 2012 at 11:41, Jiri Slaby <jslaby@suse.cz> wrote:
>> On 01/30/2012 11:52 PM, Kay Sievers wrote:
>>> 2012/1/30 Jiri Slaby <jslaby@suse.cz>:
>>>> I cannot boot properly with this commit:
>>>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>> Date:   Sun Dec 18 20:09:31 2011 -0800
>>>>
>>>>    sysfs: Kill nlink counting.
>>>>
>>>> 1) network systemd rule doesn't start network
>>>
>>> What does that mean? What's a network systemd rule?
>>
>> Oh, perhaps you call it a service file, not rule file?
>>
>> Anyway this is a different bug. Revert of the patch above does not help.
> 
> Ok, fine. I checked too, and systemd does not play any silly games
> with link counts.
> 
>> The bug lays in the network layer. udev is unable to perform persistent
>> eth naming:
>> # ip link set eth0 name eth1    -- this one is OK
>> # ip link set eth1 name eth0
>> RTNETLINK answers: No such file or directory
> 
> Please make sure nothing tries to swap netif names in userspace. We
> have given up that approach, because it is far too fragile to
> temporary rename devices to be able to swap the names, and race
> against the loading of new kernel network drivers at the same time.
> 
> This might be a new kernel problem here, but in general that approach
> is just broken, we have have given up fiddling around here. Udev does
> not do that anymore, and also the code that currently *can* be used to
> do this, will be removed from udev in the future.
> 
> Network devices can only be renamed to a namespace that isn't ethX,
> and which does not race against kernel names.

I have two eth interfaces. The one on the motherboard is named eth0, an
added PCI card is eth1. But kernel enumerates them in the opposite order.

So udev does this sequence:
eth1 -> rename3
eth0 -> eth1
rename3 -> eth0

How it can do it differently? (This is openSUSE factory.)

> Does is work, if you rename the devices to something else than ethX?

Negative:
# ip link set eth0 name krtek
# ip link set krtek name jezek
RTNETLINK answers: No such file or directory

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 11:00       ` Jiri Slaby
@ 2012-01-31 11:13         ` Kay Sievers
  2012-01-31 11:17           ` Jiri Slaby
  0 siblings, 1 reply; 75+ messages in thread
From: Kay Sievers @ 2012-01-31 11:13 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev

On Tue, Jan 31, 2012 at 12:00, Jiri Slaby <jslaby@suse.cz> wrote:
> On 01/31/2012 11:52 AM, Kay Sievers wrote:
>> On Tue, Jan 31, 2012 at 11:41, Jiri Slaby <jslaby@suse.cz> wrote:
>>> On 01/30/2012 11:52 PM, Kay Sievers wrote:
>>>> 2012/1/30 Jiri Slaby <jslaby@suse.cz>:
>>>>> I cannot boot properly with this commit:
>>>>> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
>>>>> Author: Eric W. Biederman <ebiederm@xmission.com>
>>>>> Date:   Sun Dec 18 20:09:31 2011 -0800
>>>>>
>>>>>    sysfs: Kill nlink counting.
>>>>>
>>>>> 1) network systemd rule doesn't start network
>>>>
>>>> What does that mean? What's a network systemd rule?
>>>
>>> Oh, perhaps you call it a service file, not rule file?
>>>
>>> Anyway this is a different bug. Revert of the patch above does not help.
>>
>> Ok, fine. I checked too, and systemd does not play any silly games
>> with link counts.
>>
>>> The bug lays in the network layer. udev is unable to perform persistent
>>> eth naming:
>>> # ip link set eth0 name eth1    -- this one is OK
>>> # ip link set eth1 name eth0
>>> RTNETLINK answers: No such file or directory
>>
>> Please make sure nothing tries to swap netif names in userspace. We
>> have given up that approach, because it is far too fragile to
>> temporary rename devices to be able to swap the names, and race
>> against the loading of new kernel network drivers at the same time.
>>
>> This might be a new kernel problem here, but in general that approach
>> is just broken, we have have given up fiddling around here. Udev does
>> not do that anymore, and also the code that currently *can* be used to
>> do this, will be removed from udev in the future.
>>
>> Network devices can only be renamed to a namespace that isn't ethX,
>> and which does not race against kernel names.
>
> I have two eth interfaces. The one on the motherboard is named eth0, an
> added PCI card is eth1. But kernel enumerates them in the opposite order.
>
> So udev does this sequence:
> eth1 -> rename3
> eth0 -> eth1
> rename3 -> eth0
>
> How it can do it differently? (This is openSUSE factory.)

A future udev will not help you doing that. We have given up
supporting this approach. Renaming is done during booting, at the same
time we load new kernel drivers, and all breaks in non-interesting
ways. Apart from all the other unsolvable problems with this model.

Pretending we are able to rename netif names in the same namespace the
kernel is allocating new names is just plain wrong. There are races
you can't control. The entire approach creates far more problems than
it solves. We just have to admit it was wrong to do that.
Custom/to-rename netif names can just not be ethX.

>> Does is work, if you rename the devices to something else than ethX?
>
> Negative:
> # ip link set eth0 name krtek
> # ip link set krtek name jezek
> RTNETLINK answers: No such file or directory

This is a command sequence you type manually?

You are sure that userspace is not working in the background,
triggered by uevents, and comes into your way here?

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 11:13         ` Kay Sievers
@ 2012-01-31 11:17           ` Jiri Slaby
  2012-01-31 11:58             ` Kay Sievers
  0 siblings, 1 reply; 75+ messages in thread
From: Jiri Slaby @ 2012-01-31 11:17 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev

On 01/31/2012 12:13 PM, Kay Sievers wrote:
>>> Does is work, if you rename the devices to something else than ethX?
>>
>> Negative:
>> # ip link set eth0 name krtek
>> # ip link set krtek name jezek
>> RTNETLINK answers: No such file or directory
> 
> This is a command sequence you type manually?

Yea, and it is working with 3.3.0-rc1-next-20120124_64+. Not with
3.3.0-rc1-next-20120131_64+.

> You are sure that userspace is not working in the background,
> triggered by uevents, and comes into your way here?

Note that krtek exists after the first command. But cannot be renamed
further.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-31  3:45     ` sysfs regression: wrong link counts Eric W. Biederman
@ 2012-01-31 11:54       ` Alan Cox
  0 siblings, 0 replies; 75+ messages in thread
From: Alan Cox @ 2012-01-31 11:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Greg KH, Jiri Slaby, LKML, systemd-devel

On Mon, 30 Jan 2012 19:45:10 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> 
> >> Isn't there some other "proper" way of doing this in userspace, or is
> >> this really the correct way?
> >
> > You can look at the S_IFMT bits and stuff however link count indicating
> > number of subdirectories is a standard Unix thing and used by many quite
> > mundane tools as an optimisation.
> 
> Those tools for a decade or better all know to treat nlink == 1 as the
> case where the optimization does not apply.

Most of the do. but as has been demonstrated here - not all

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 11:17           ` Jiri Slaby
@ 2012-01-31 11:58             ` Kay Sievers
  2012-01-31 14:18               ` Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Kay Sievers @ 2012-01-31 11:58 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML, ML netdev

On Tue, Jan 31, 2012 at 12:17, Jiri Slaby <jslaby@suse.cz> wrote:
> On 01/31/2012 12:13 PM, Kay Sievers wrote:

>> This is a command sequence you type manually?
>
> Yea, and it is working with 3.3.0-rc1-next-20120124_64+. Not with
> 3.3.0-rc1-next-20120131_64+.
>
>> You are sure that userspace is not working in the background,
>> triggered by uevents, and comes into your way here?
>
> Note that krtek exists after the first command. But cannot be renamed
> further.

Yeah, I can confirm the problem here. I works fine with earlier
kernels and fails with the latest -next:

# uname -r
3.3.0-rc1-next-20120131+
# modprobe dummy
# ip link set dummy0 name foo0
# ip link set foo0 name bar0
RTNETLINK answers: No such file or directory

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-31 10:48         ` Jiri Slaby
@ 2012-01-31 12:44           ` Eric W. Biederman
  2012-01-31 16:45             ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-01-31 12:44 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Greg KH, Alan Cox, LKML, systemd-devel, Linus Torvalds, Al Viro

Jiri Slaby <jslaby@suse.cz> writes:

> Oh, we are not going to break userspace with 3.3, are we? 

No.  This code is currently scheduled for 3.4.

> I understand
> that what sensors do is nothing but sh*t. But this change should wait
> until everybody has a chance to have fixed sensors package in their
> distribution.
>
> To be clear, what I'm suggesting is to postpone the change and schedule
> it for something like 3.7.

The sensors update with the fix is scheduled for about a week out, well
before 3.3 ships.

The code in next for 3.4 is perhaps 4-6 months out from a kernel
release.  That seems like plenty of time for distros to get an update
package for sensors together.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 11:58             ` Kay Sievers
@ 2012-01-31 14:18               ` Eric W. Biederman
  2012-01-31 14:40                 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-01-31 14:18 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Jiri Slaby, Greg KH, LKML, ML netdev

Kay Sievers <kay.sievers@vrfy.org> writes:

> On Tue, Jan 31, 2012 at 12:17, Jiri Slaby <jslaby@suse.cz> wrote:
>> On 01/31/2012 12:13 PM, Kay Sievers wrote:
>
>>> This is a command sequence you type manually?
>>
>> Yea, and it is working with 3.3.0-rc1-next-20120124_64+. Not with
>> 3.3.0-rc1-next-20120131_64+.
>>
>>> You are sure that userspace is not working in the background,
>>> triggered by uevents, and comes into your way here?
>>
>> Note that krtek exists after the first command. But cannot be renamed
>> further.
>
> Yeah, I can confirm the problem here. I works fine with earlier
> kernels and fails with the latest -next:
>
> # uname -r
> 3.3.0-rc1-next-20120131+
> # modprobe dummy
> # ip link set dummy0 name foo0
> # ip link set foo0 name bar0
> RTNETLINK answers: No such file or directory

There is something weird going on when sysfs directories and symlinks
are renamed.  My guess is that I fat fingered something with one of my
last sysfs patches.  I will look more deeply once I have slept some
more.

The second network device renames fails because the first rename did not
work properly.  ls -l /sys/class/net/ /sys/virtual/net/ will let you see
what I mean.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] sysfs:  Update the name hash when renaming sysfs entries
  2012-01-31 14:18               ` Eric W. Biederman
@ 2012-01-31 14:40                 ` Eric W. Biederman
  2012-01-31 14:41                   ` Jiri Slaby
  2012-01-31 14:55                   ` Greg KH
  0 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-01-31 14:40 UTC (permalink / raw)
  To: Greg KH, Greg Kroah-Hartman; +Cc: Jiri Slaby, LKML, ML netdev, Kay Sievers


This fixes a bug introduced with sysfs name hashes where renaming a
network device appears to succeed but silently makes the sysfs files for
that network device inaccessible.

In at least one configuration this bug has stopped networking from
coming up during boot.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/sysfs/dir.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index ea64d01..dd3779c 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -872,6 +872,7 @@ int sysfs_rename(struct sysfs_dirent *sd,
 
 		dup_name = sd->s_name;
 		sd->s_name = new_name;
+		sd->s_hash = sysfs_name_hash(sd->s_ns, sd->s_name);
 	}
 
 	/* Move to the appropriate place in the appropriate directories rbtree. */
-- 
1.7.2.5



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Update the name hash when renaming sysfs entries
  2012-01-31 14:40                 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman
@ 2012-01-31 14:41                   ` Jiri Slaby
  2012-01-31 14:55                   ` Greg KH
  1 sibling, 0 replies; 75+ messages in thread
From: Jiri Slaby @ 2012-01-31 14:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg KH, Greg Kroah-Hartman, LKML, ML netdev, Kay Sievers

On 01/31/2012 03:40 PM, Eric W. Biederman wrote:
> 
> This fixes a bug introduced with sysfs name hashes where renaming a
> network device appears to succeed but silently makes the sysfs files for
> that network device inaccessible.
> 
> In at least one configuration this bug has stopped networking from
> coming up during boot.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

It works for me. Thanks.

Tested-by: Jiri Slaby <jslaby@suse.cz>

> ---
>  fs/sysfs/dir.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index ea64d01..dd3779c 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -872,6 +872,7 @@ int sysfs_rename(struct sysfs_dirent *sd,
>  
>  		dup_name = sd->s_name;
>  		sd->s_name = new_name;
> +		sd->s_hash = sysfs_name_hash(sd->s_ns, sd->s_name);
>  	}
>  
>  	/* Move to the appropriate place in the appropriate directories rbtree. */


-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Update the name hash when renaming sysfs entries
  2012-01-31 14:40                 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman
  2012-01-31 14:41                   ` Jiri Slaby
@ 2012-01-31 14:55                   ` Greg KH
  1 sibling, 0 replies; 75+ messages in thread
From: Greg KH @ 2012-01-31 14:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, LKML, ML netdev, Kay Sievers

On Tue, Jan 31, 2012 at 06:40:26AM -0800, Eric W. Biederman wrote:
> 
> This fixes a bug introduced with sysfs name hashes where renaming a
> network device appears to succeed but silently makes the sysfs files for
> that network device inaccessible.
> 
> In at least one configuration this bug has stopped networking from
> coming up during boot.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/sysfs/dir.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)

Thanks for this, I'll queue it up later today.

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-31 12:44           ` Eric W. Biederman
@ 2012-01-31 16:45             ` Linus Torvalds
  2012-01-31 19:18               ` Al Viro
  2012-02-01  5:06               ` Eric W. Biederman
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2012-01-31 16:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jiri Slaby, Greg KH, Alan Cox, LKML, systemd-devel, Al Viro

On Tue, Jan 31, 2012 at 4:44 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> The sensors update with the fix is scheduled for about a week out, well
> before 3.3 ships.

That's almost certainly *not* going to help.

Guys, people don't update their user space. Some people run modern
kernels on the enterprise distros - and those can be *years* out of
date.

If people report problems, we revert things. It's that simple. It
really doesn't matter if you call it a user space bug or not - if user
space depends on what we used to do, then we continue to do it.

We can *try* the kernel change, but I can already tell you that
anybody who thinks that "sensors will be fixed by all users by the
time the kernel comes out" is likely totally full of shit. Not to
mention that it apparently *already* causes problems for people who
want to test -next, and this breaks those peoples setup, and thus
means that -next gets less testing.

Really. "It's a bug in user space" is *NOT* an excuse for breaking
things. Never was. Never is. There are (other) excuses for breaking
things, but "user space did something I didn't expect it to do, and I
consider it a bug" is absolutely not one of them.

                       Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-31 16:45             ` Linus Torvalds
@ 2012-01-31 19:18               ` Al Viro
  2012-02-01  5:06               ` Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: Al Viro @ 2012-01-31 19:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Jiri Slaby, Greg KH, Alan Cox, LKML, systemd-devel

On Tue, Jan 31, 2012 at 08:45:58AM -0800, Linus Torvalds wrote:
> We can *try* the kernel change, but I can already tell you that
> anybody who thinks that "sensors will be fixed by all users by the
> time the kernel comes out" is likely totally full of shit. Not to
> mention that it apparently *already* causes problems for people who
> want to test -next, and this breaks those peoples setup, and thus
> means that -next gets less testing.
> 
> Really. "It's a bug in user space" is *NOT* an excuse for breaking
> things. Never was. Never is. There are (other) excuses for breaking
> things, but "user space did something I didn't expect it to do, and I
> consider it a bug" is absolutely not one of them.

It's actually "user space did something utterly weird that happened not
to break with the current behaviour by accident", and yes, it's better
to revert that change for the reasons you've described.  But it's
definitely a userland bug - as in "code doesn't do what it intends to
do" and no matter what we do kernel-side it ought to be fixed in
lm-sensors.  Several years later it hopefully will become a non-issue
(and we'll need to document the conflict with lm-sensors earlier than
$FIXED_VERSION in Documentation/Changes when that sysfs patch finally
goes in).

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-31 16:45             ` Linus Torvalds
  2012-01-31 19:18               ` Al Viro
@ 2012-02-01  5:06               ` Eric W. Biederman
  2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-02-01  5:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jiri Slaby, Greg KH, Alan Cox, LKML, systemd-devel, Al Viro

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jan 31, 2012 at 4:44 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> The sensors update with the fix is scheduled for about a week out, well
>> before 3.3 ships.
>
> That's almost certainly *not* going to help.
>
> Guys, people don't update their user space. Some people run modern
> kernels on the enterprise distros - and those can be *years* out of
> date.
>
> If people report problems, we revert things. It's that simple. It
> really doesn't matter if you call it a user space bug or not - if user
> space depends on what we used to do, then we continue to do it.

> We can *try* the kernel change, but I can already tell you that
> anybody who thinks that "sensors will be fixed by all users by the
> time the kernel comes out" is likely totally full of shit. Not to
> mention that it apparently *already* causes problems for people who
> want to test -next, and this breaks those peoples setup, and thus
> means that -next gets less testing.

By and large I am in agreement with you, and this patch was designed to
be very focused in case we ran into userspace that had problems to make
it easy to revert.  The first question I asked when this was reported is
how badly do things break.  So I could tell if things needed to be
fixed.

My current understanding is that it is exactly sensors that breaks.
One program that complains loudly and doesn't function.

To the best of my knowledge nothing else fails, or depends on sensors.

I would like to have the code in -next a little longer to see if
anything else breaks.

> Really. "It's a bug in user space" is *NOT* an excuse for breaking
> things. Never was. Never is. There are (other) excuses for breaking
> things, but "user space did something I didn't expect it to do, and I
> consider it a bug" is absolutely not one of them.

Part of the reason for the patch is it is a partial regression fix for
sysfs using more memory in 3.3 and 3.2 than in previous kernels.
struct sysfs_dirent  is nowhere near as critical as struct page but the
same kind of small in structure size add up.

Thinking about it I should be able to whip up a tiny patch that adds
a Kconfig option.  Support for programs that are don't interpret nlinks
correctly.

That should allow people who want the size reduction to have a smaller
sysfs and it should allow people who are conservative to guarantee that
the code works, and it should let me find out if anything else out there
actually cares.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: sysfs regression: wrong link counts
  2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby
                   ` (2 preceding siblings ...)
  2012-01-31  1:32 ` sysfs regression: wrong link counts Eric W. Biederman
@ 2012-02-01 18:29 ` Maciej Rutecki
  3 siblings, 0 replies; 75+ messages in thread
From: Maciej Rutecki @ 2012-02-01 18:29 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Eric W. Biederman, Greg KH, LKML

On poniedziałek, 30 stycznia 2012 o 22:56:26 Jiri Slaby wrote:
> Hi,
> 
> I cannot boot properly with this commit:
> commit 524b6c5b39b931311dfe5a2f5abae2f5c9731676
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Sun Dec 18 20:09:31 2011 -0800
> 
>     sysfs: Kill nlink counting.
> 
> 
> 1) network systemd rule doesn't start network
> 2) sensors complain:
>    sensors_init: Kernel interface error
> 
> ad 2) look at what it does:
> /* returns !0 if sysfs filesystem was found, 0 otherwise */
> int sensors_init_sysfs(void)
> {
>         struct stat statbuf;
> 
>         snprintf(sensors_sysfs_mount, NAME_MAX, "%s", "/sys");
>         if (stat(sensors_sysfs_mount, &statbuf) < 0
> 
>          || statbuf.st_nlink <= 2)      /* Empty directory */
> 
>                 return 0;
> 
>         return 1;
> }
> 
> 
> So this looks like it became a part of ABI we cannot break...
> 
> A revert of this commit on the top of today's -next fixes the problem.
> 
> thanks,

I created a Bugzilla entry at 
https://bugzilla.kernel.org/show_bug.cgi?id=42712
for your bug report, please add your address to the CC list in there, thanks!
-- 
Maciej Rutecki
http://www.mrutecki.pl

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] sysfs:  Optionally count subdirectories to support buggy applications
  2012-02-01  5:06               ` Eric W. Biederman
@ 2012-02-01 22:21                 ` Eric W. Biederman
  2012-02-01 22:24                   ` Greg Kroah-Hartman
                                     ` (3 more replies)
  0 siblings, 4 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-02-01 22:21 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro, Linus Torvalds,
	Maciej Rutecki


lm_sensors and possibly other applications get confused if all sysfs
directories return nlink == 1.  The lm_sensors code that got confused
was just wrong and a fixed version of lm_sensors should be released
shortly.

There may be other applications that have problems with sysfs return
nlink == 1 for directories.  To allow people to continue to use old
versions of userspace with new kernels add to sysfs a compile time
option to maintain mostly precise directory counts for those people who
don't mind the cost.

I have moved where we keep nlink in sysfs_dirent as compared to previous
versions of subdirectory counting to a location that packs better.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/sysfs/Kconfig |   15 +++++++++++++++
 fs/sysfs/dir.c   |    8 ++++++++
 fs/sysfs/inode.c |    2 ++
 fs/sysfs/sysfs.h |   38 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig
index 8c41fea..9b403e9 100644
--- a/fs/sysfs/Kconfig
+++ b/fs/sysfs/Kconfig
@@ -21,3 +21,18 @@ config SYSFS
 	example, "root=03:01" for /dev/hda1.
 
 	Designers of embedded systems may wish to say N here to conserve space.
+
+config SYSFS_COUNT_LINKS
+	bool "sysfs count subdirectoires to support buggy applications"
+	default n
+	help
+
+	Increase the memory and cpu untilization of sysfs by maintaining
+	by maintaining a count of how hard links from subdirectories a
+	directory has.  This allows sysfs to report the directory nlink
+	as the number of subdirectories plus two.  With out this enabled
+	sysfs will report the nlink of all directories as 1, which is
+	the standard way of indicating that the number of subdirectoires
+	is not tracked.
+
+	Unless you know you need this say N here.
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index dd3779c..f30df7c 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -91,6 +91,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd)
 	struct rb_node **node = &sd->s_parent->s_dir.children.rb_node;
 	struct rb_node *parent = NULL;
 
+	if (sysfs_type(sd) == SYSFS_DIR)
+		sysfs_inc_nlink(sd->s_parent);
+
 	while (*node) {
 		struct sysfs_dirent *pos;
 		int result;
@@ -123,6 +126,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd)
  */
 static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
 {
+	if (sysfs_type(sd) == SYSFS_DIR)
+		sysfs_dec_nlink(sd->s_parent);
+
 	rb_erase(&sd->s_rb, &sd->s_parent->s_dir.children);
 }
 
@@ -367,6 +373,8 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 	sd->s_mode = mode;
 	sd->s_flags = type;
 
+	sysfs_init_nlink(sd);
+
 	return sd;
 
  err_out2:
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 4291fd1..782c66a 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -216,6 +216,8 @@ static void sysfs_refresh_inode(struct sysfs_dirent *sd, struct inode *inode)
 					    iattrs->ia_secdata,
 					    iattrs->ia_secdata_len);
 	}
+
+	set_nlink(inode, sysfs_read_nlink(sd));
 }
 
 int sysfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 6289a00..4603506 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -69,6 +69,9 @@ struct sysfs_dirent {
 
 	const void		*s_ns; /* namespace tag */
 	unsigned int		s_hash; /* ns + name hash */
+#ifdef CONFIG_SYSFS_COUNT_LINKS
+	unsigned int		s_nlink;
+#endif
 	union {
 		struct sysfs_elem_dir		s_dir;
 		struct sysfs_elem_symlink	s_symlink;
@@ -127,6 +130,41 @@ do {								\
 #define sysfs_dirent_init_lockdep(sd) do {} while(0)
 #endif
 
+#ifdef CONFIG_SYSFS_COUNT_LINKS
+static inline void sysfs_init_nlink(struct sysfs_dirent *sd)
+{
+	if (sysfs_type(sd) == SYSFS_DIR)
+		sd->s_nlink = 2;
+	else
+		sd->s_nlink = 1;
+}
+static inline void sysfs_inc_nlink(struct sysfs_dirent *sd)
+{
+	sd->s_nlink++;
+}
+static inline void sysfs_dec_nlink(struct sysfs_dirent *sd)
+{
+	sd->s_nlink++;
+}
+static inline unsigned int sysfs_read_nlink(struct sysfs_dirent *sd)
+{
+	return sd->s_nlink;
+}
+#else
+static inline void sysfs_init_nlink(struct sysfs_dirent *sd)
+{
+}
+static inline void sysfs_inc_nlink(struct sysfs_dirent *sd)
+{
+}
+static inline void sysfs_dec_nlink(struct sysfs_dirent *sd)
+{
+}
+static inline unsigned int sysfs_read_nlink(struct sysfs_dirent *sd)
+{
+	return 1;
+}
+#endif
 /*
  * Context structure to be used while adding/removing nodes.
  */
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Optionally count subdirectories to support buggy applications
  2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
@ 2012-02-01 22:24                   ` Greg Kroah-Hartman
  2012-02-01 22:44                     ` Eric W. Biederman
  2012-02-01 22:31                   ` Dave Jones
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 75+ messages in thread
From: Greg Kroah-Hartman @ 2012-02-01 22:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jiri Slaby, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki

On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote:
> 
> lm_sensors and possibly other applications get confused if all sysfs
> directories return nlink == 1.  The lm_sensors code that got confused
> was just wrong and a fixed version of lm_sensors should be released
> shortly.
> 
> There may be other applications that have problems with sysfs return
> nlink == 1 for directories.  To allow people to continue to use old
> versions of userspace with new kernels add to sysfs a compile time
> option to maintain mostly precise directory counts for those people who
> don't mind the cost.
> 
> I have moved where we keep nlink in sysfs_dirent as compared to previous
> versions of subdirectory counting to a location that packs better.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/sysfs/Kconfig |   15 +++++++++++++++
>  fs/sysfs/dir.c   |    8 ++++++++
>  fs/sysfs/inode.c |    2 ++
>  fs/sysfs/sysfs.h |   38 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 63 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig
> index 8c41fea..9b403e9 100644
> --- a/fs/sysfs/Kconfig
> +++ b/fs/sysfs/Kconfig
> @@ -21,3 +21,18 @@ config SYSFS
>  	example, "root=03:01" for /dev/hda1.
>  
>  	Designers of embedded systems may wish to say N here to conserve space.
> +
> +config SYSFS_COUNT_LINKS
> +	bool "sysfs count subdirectoires to support buggy applications"
> +	default n

As we don't want to break things, this should be default y, right?

Also, should we list this in the feature_removal list as well so that we
can get rid of it in a year or so?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Optionally count subdirectories to support buggy applications
  2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
  2012-02-01 22:24                   ` Greg Kroah-Hartman
@ 2012-02-01 22:31                   ` Dave Jones
  2012-02-01 22:35                   ` Jiri Slaby
  2012-02-01 23:15                   ` Linus Torvalds
  3 siblings, 0 replies; 75+ messages in thread
From: Dave Jones @ 2012-02-01 22:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro,
	Linus Torvalds, Maciej Rutecki

On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote:
 
 > +config SYSFS_COUNT_LINKS
 > +	bool "sysfs count subdirectoires to support buggy applications"

subdirectories

 > +	Increase the memory and cpu untilization of sysfs by maintaining

utilization

 > +	as the number of subdirectories plus two.  With out this enabled

Without

 > +	the standard way of indicating that the number of subdirectoires

subdirectories


	Dave

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Optionally count subdirectories to support buggy applications
  2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
  2012-02-01 22:24                   ` Greg Kroah-Hartman
  2012-02-01 22:31                   ` Dave Jones
@ 2012-02-01 22:35                   ` Jiri Slaby
  2012-02-01 23:15                   ` Linus Torvalds
  3 siblings, 0 replies; 75+ messages in thread
From: Jiri Slaby @ 2012-02-01 22:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Greg KH, Alan Cox, LKML, Al Viro,
	Linus Torvalds, Maciej Rutecki

On 02/01/2012 11:21 PM, Eric W. Biederman wrote:
> --- a/fs/sysfs/Kconfig
> +++ b/fs/sysfs/Kconfig
> @@ -21,3 +21,18 @@ config SYSFS
>  	example, "root=03:01" for /dev/hda1.
>  
>  	Designers of embedded systems may wish to say N here to conserve space.
> +
> +config SYSFS_COUNT_LINKS
> +	bool "sysfs count subdirectoires to support buggy applications"
> +	default n
> +	help
> +
> +	Increase the memory and cpu untilization of sysfs by maintaining
> +	by maintaining a count of how hard links from subdirectories a

by maintaining only once.

of how "many"?

> +	directory has.  This allows sysfs to report the directory nlink
> +	as the number of subdirectories plus two.  With out this enabled
> +	sysfs will report the nlink of all directories as 1, which is
> +	the standard way of indicating that the number of subdirectoires
> +	is not tracked.
> +
> +	Unless you know you need this say N here.

You should put the opposite here. "If you are sure you don't need it,
say N."

> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
...
> @@ -127,6 +130,41 @@ do {								\
>  #define sysfs_dirent_init_lockdep(sd) do {} while(0)
>  #endif
>  
> +#ifdef CONFIG_SYSFS_COUNT_LINKS
> +static inline void sysfs_init_nlink(struct sysfs_dirent *sd)
> +{
> +	if (sysfs_type(sd) == SYSFS_DIR)
> +		sd->s_nlink = 2;
> +	else
> +		sd->s_nlink = 1;
> +}
> +static inline void sysfs_inc_nlink(struct sysfs_dirent *sd)
> +{
> +	sd->s_nlink++;
> +}
> +static inline void sysfs_dec_nlink(struct sysfs_dirent *sd)
> +{
> +	sd->s_nlink++;
> +}

Minus minus.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Optionally count subdirectories to support buggy applications
  2012-02-01 22:24                   ` Greg Kroah-Hartman
@ 2012-02-01 22:44                     ` Eric W. Biederman
  2012-02-01 22:49                       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-02-01 22:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote:
>> 
>> lm_sensors and possibly other applications get confused if all sysfs
>> directories return nlink == 1.  The lm_sensors code that got confused
>> was just wrong and a fixed version of lm_sensors should be released
>> shortly.
>> 
>> There may be other applications that have problems with sysfs return
>> nlink == 1 for directories.  To allow people to continue to use old
>> versions of userspace with new kernels add to sysfs a compile time
>> option to maintain mostly precise directory counts for those people who
>> don't mind the cost.
>> 
>> I have moved where we keep nlink in sysfs_dirent as compared to previous
>> versions of subdirectory counting to a location that packs better.
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>>  fs/sysfs/Kconfig |   15 +++++++++++++++
>>  fs/sysfs/dir.c   |    8 ++++++++
>>  fs/sysfs/inode.c |    2 ++
>>  fs/sysfs/sysfs.h |   38 ++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 63 insertions(+), 0 deletions(-)
>> 
>> diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig
>> index 8c41fea..9b403e9 100644
>> --- a/fs/sysfs/Kconfig
>> +++ b/fs/sysfs/Kconfig
>> @@ -21,3 +21,18 @@ config SYSFS
>>  	example, "root=03:01" for /dev/hda1.
>>  
>>  	Designers of embedded systems may wish to say N here to conserve space.
>> +
>> +config SYSFS_COUNT_LINKS
>> +	bool "sysfs count subdirectoires to support buggy applications"
>> +	default n
>
> As we don't want to break things, this should be default y, right?

The new behavior is backwards compatible.  What the new behavior is not
is bug compatible.  So nothing *should* break.

Furthermore the breaking we have seen so far is limited to just
lm_sensors.  That is exactly one program that is not a server failing to
start.  That seems pretty minor in the worst case.

So I really don't expect anyone who ships 3.4 to enable this option.

I have written the option solely so that in case my assessment turns out
to be wrong there is already a tested solution.  I have been through the
pain of not being able to upgrade/test a new kernel because of a
backwards incompatible change and it can be very unpleasant.

> Also, should we list this in the feature_removal list as well so that we
> can get rid of it in a year or so?

Good idea.  I don't know if anyone actually reads feature removal but it
is good to serve notice.  I will cook up a patch for that.

Eric


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs:  Optionally count subdirectories to support buggy applications
  2012-02-01 22:44                     ` Eric W. Biederman
@ 2012-02-01 22:49                       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Kroah-Hartman @ 2012-02-01 22:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jiri Slaby, Alan Cox, LKML, Al Viro, Linus Torvalds, Maciej Rutecki

On Wed, Feb 01, 2012 at 02:44:32PM -0800, Eric W. Biederman wrote:
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
> 
> > On Wed, Feb 01, 2012 at 02:21:59PM -0800, Eric W. Biederman wrote:
> >> 
> >> lm_sensors and possibly other applications get confused if all sysfs
> >> directories return nlink == 1.  The lm_sensors code that got confused
> >> was just wrong and a fixed version of lm_sensors should be released
> >> shortly.
> >> 
> >> There may be other applications that have problems with sysfs return
> >> nlink == 1 for directories.  To allow people to continue to use old
> >> versions of userspace with new kernels add to sysfs a compile time
> >> option to maintain mostly precise directory counts for those people who
> >> don't mind the cost.
> >> 
> >> I have moved where we keep nlink in sysfs_dirent as compared to previous
> >> versions of subdirectory counting to a location that packs better.
> >> 
> >> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> >> ---
> >>  fs/sysfs/Kconfig |   15 +++++++++++++++
> >>  fs/sysfs/dir.c   |    8 ++++++++
> >>  fs/sysfs/inode.c |    2 ++
> >>  fs/sysfs/sysfs.h |   38 ++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 63 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/fs/sysfs/Kconfig b/fs/sysfs/Kconfig
> >> index 8c41fea..9b403e9 100644
> >> --- a/fs/sysfs/Kconfig
> >> +++ b/fs/sysfs/Kconfig
> >> @@ -21,3 +21,18 @@ config SYSFS
> >>  	example, "root=03:01" for /dev/hda1.
> >>  
> >>  	Designers of embedded systems may wish to say N here to conserve space.
> >> +
> >> +config SYSFS_COUNT_LINKS
> >> +	bool "sysfs count subdirectoires to support buggy applications"
> >> +	default n
> >
> > As we don't want to break things, this should be default y, right?
> 
> The new behavior is backwards compatible.  What the new behavior is not
> is bug compatible.  So nothing *should* break.

"should", but you really don't know, as all we have is one report so
far.

> Furthermore the breaking we have seen so far is limited to just
> lm_sensors.  That is exactly one program that is not a server failing to
> start.  That seems pretty minor in the worst case.
> 
> So I really don't expect anyone who ships 3.4 to enable this option.

What about users who use a new kernel on old userspace, which happens
all the time (i.e. all the kernel developers themselves)?

> I have written the option solely so that in case my assessment turns out
> to be wrong there is already a tested solution.  I have been through the
> pain of not being able to upgrade/test a new kernel because of a
> backwards incompatible change and it can be very unpleasant.

I'd really prefer this to be default 'y', and if a distro knows it can
turn it off to save time/space, it can.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
                                     ` (2 preceding siblings ...)
  2012-02-01 22:35                   ` Jiri Slaby
@ 2012-02-01 23:15                   ` Linus Torvalds
  2012-02-01 23:18                     ` Linus Torvalds
  3 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2012-02-01 23:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro,
	Maciej Rutecki

On Wed, Feb 1, 2012 at 2:21 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> lm_sensors and possibly other applications get confused if all sysfs
> directories return nlink == 1.  The lm_sensors code that got confused
> was just wrong and a fixed version of lm_sensors should be released
> shortly.

I think this patch is horrible, and broken.

And making the feature a config option is just stupid.

The simple approach is to

 - implement a inode->i_op->getattr function for sysfs

 - make it call generic_fillattr()

 - after filling in the generic fields, count the number of sysfs
children it has and update stat->nlink appropriately.

No extra "keep track of inode counts by hand" crap, and no idiotic
config options that just make it easy to (conditionally) get things
wrong. Just do it right, and do it *unconditionally* right.

The Linux VFS layer is smart, and allows filesystems to do these kinds
of fixups.

                  Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-02-01 23:15                   ` Linus Torvalds
@ 2012-02-01 23:18                     ` Linus Torvalds
  2012-02-02  1:22                       ` Al Viro
  2012-03-05 13:30                       ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2012-02-01 23:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, Greg KH, Alan Cox, LKML, Al Viro,
	Maciej Rutecki

On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> No extra "keep track of inode counts by hand" crap, and no idiotic
> config options that just make it easy to (conditionally) get things
> wrong. Just do it right, and do it *unconditionally* right.

And btw, "nlink shows number of subdirectories" for a directory entry
really *is* right. It's how Unix filesystems work, like it or not.

It's mainly lazy/bad filesystems that set nlink to 1. So the whole
"nlink==1" case is meant for crap like FAT etc, not for a filesystem
that we control and that could easily just do it right.

Which is why I detest that config option. It's as if you were asking the user

 "Do you want to make the sysfs filesystem act like crap filesystems?"

and kernel config time. What kind of inane question is that?

                    Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-02-01 23:18                     ` Linus Torvalds
@ 2012-02-02  1:22                       ` Al Viro
  2012-02-02 21:24                         ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro
  2012-03-05 13:30                       ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby
  1 sibling, 1 reply; 75+ messages in thread
From: Al Viro @ 2012-02-02  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Jiri Slaby, Greg KH,
	Alan Cox, LKML, Maciej Rutecki

On Wed, Feb 01, 2012 at 03:18:05PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > No extra "keep track of inode counts by hand" crap, and no idiotic
> > config options that just make it easy to (conditionally) get things
> > wrong. Just do it right, and do it *unconditionally* right.
> 
> And btw, "nlink shows number of subdirectories" for a directory entry
> really *is* right. It's how Unix filesystems work, like it or not.
> 
> It's mainly lazy/bad filesystems that set nlink to 1. So the whole
> "nlink==1" case is meant for crap like FAT etc, not for a filesystem
> that we control and that could easily just do it right.

It's a bit more complicated than that and it had always been a bit of
a minefield.  v6: link(2) began with
        ip = namei(&uchar, 0);
        if(ip == NULL)
                return;
        if(ip->i_nlink >= 127) {
                u.u_error = EMLINK;
                goto out;
        }
(and mkdir(), of course, was implemented via mknod+link).  Up to 127 links
to any object.  Fine; v7: EMLINK defined, but _never_ returned.  Moreover,
mkdir(1) didn't bother to check link counts either.  Result: 65535 calls
of link(2) and you've got yourself an fs corruption on i_nlink overflow
(it was 16bit in on-disk inode).

Linux implementation of link(2) had exactly the same bug as that of v7 until
0.98.3 (for link(2)) and 0.99.1 (for mkdir(2), rename(2) - didn't exist as
syscalls on v7, but had the same problem as soon as they made it in kernel
at some point in BSD history).  Note that limit for minixfs remained
ridiculously low until '97 or so; for ext2 it was 32000 from the very
beginning but note that e.g. ext had _not_ acquired fixes for link overflows
on mkdir() at all - it had that hole all the way until removal in 2.1.26.
Of course, there was a (completely useless) POSIX-mandated LINK_MAX, but
since the actual limit depends on fs type, well...  the checks remain
in ->link()/->mkdir()/->rename() instances and IIRC I've caught a bunch of
overflows in those circa 2.1.very_late.  

As the result, old stat(2) had 16bit st_nlink - same as v7.  Nobody needed
more, after all.  Alpha port got it 32bit (inherited the struct stat layout
from OSF, presumably?), but other early ports kept 16bit there.

At some point folks started whining about wanting more subdirectories and
that's when the things began to get really convoluted and ugly.  By that
time we had a variant of stat(2) that used 32bit st_nlink, but on-disk
layouts remained a problem.  Some filesystems went for more or less
reasonable "the limit is circa 2^32, EMLINK if we get more than that and
-EOVERFLOW on old_stat(2) if it doesn't fit into 16 bits".  However, it
was _not_ universally true - e.g. reiserfs does _NOT_ do that for mkdir().
There we get "directory link count grows up to 16bit limit and gets stuck
at 1 if it ever grows past that", on the theory that st_nlink == 1
is distinguishable from anything you'd normally get for a directory.  I
have no idea where that invention has come from, but it's been around for
more than a decade.  Of course, Hans being Hans, it had been advertised as
major reiserfs feature - you could get an unlimited amount of subdirectories,
which no other fs on Linux would allow, nevermind that nobody sane would
actually make use of that...

At about the same time somebody had done the same trick on ext2 - Daniel,
probably, since it had been floating in directory index patchset.  It never
got merged into mainline; ext3 port _did_, but without those bits actually
used (EXT3_DIR_LINK_MAX is defined, but never used).  ext4 actually has it
hooked in.

I don't see anything else other than ext4 and reiserfs using that convention,
but then just grepping for inc_nlink/inode_inc_use_count shows a bloody
mess - e.g. jffs2 does not check overflows (32bit there) at all, neither on
link() nor on mkdir()/rename().  It _might_ get away with that (with
non-standard errno, if so), but I don't remember that code well enough
to tell at the quick look.  HFS+ doesn't seem to check for overflows either
and while it doesn't track link count on directories, it does support link(2)
and it does bump i_nlink there.  A _lot_ of filesystems (starting with
ramfs) assumes that we'll get an OOM before we get to 2^32 links to an
inode on purely in-core filesystem; reasonable back in 2001 or so, but
not these days...

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-02  1:22                       ` Al Viro
@ 2012-02-02 21:24                         ` Al Viro
  2012-02-02 23:46                           ` Linus Torvalds
  2012-02-03  8:25                           ` Andreas Dilger
  0 siblings, 2 replies; 75+ messages in thread
From: Al Viro @ 2012-02-02 21:24 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, Linus Torvalds


> Linux implementation of link(2) had exactly the same bug as that of v7 until
> 0.98.3 (for link(2)) and 0.99.1 (for mkdir(2), rename(2) - didn't exist as
> syscalls on v7, but had the same problem as soon as they made it in kernel
> at some point in BSD history).  Note that limit for minixfs remained
> ridiculously low until '97 or so; for ext2 it was 32000 from the very
> beginning but note that e.g. ext had _not_ acquired fixes for link overflows
> on mkdir() at all - it had that hole all the way until removal in 2.1.26.
> Of course, there was a (completely useless) POSIX-mandated LINK_MAX, but
> since the actual limit depends on fs type, well...  the checks remain
> in ->link()/->mkdir()/->rename() instances and IIRC I've caught a bunch of
> overflows in those circa 2.1.very_late.  

FWIW, there's something we really should've done a long time ago: putting
that limit into sb->s_max_links.  With 0 meaning "leave all checks to
->link/->mkdir/->rename".  Something like the following would make a
reasonable start - just the conversion of obvious cases.  As the next
step I'd probably initialize it as ~0U instead of 0 and let the filesystems
that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set
it to 0 in their foo_fill_super().  That would take care of a bunch of cases
where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and
it's probably a saner default anyway.

Comments?  Boilerplate removal follows (22 files changed, 45 insertions(+),
120 deletions(-)), but it's *not* for immediate merge; it's really completely
untested.

diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index 9dbf0c3..fc7161d 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -143,9 +143,6 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 {
 	struct inode *inode = old_dentry->d_inode;
 
-	if (inode->i_nlink >= EXOFS_LINK_MAX)
-		return -EMLINK;
-
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
 	ihold(inode);
@@ -156,10 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 static int exofs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
-	int err = -EMLINK;
-
-	if (dir->i_nlink >= EXOFS_LINK_MAX)
-		goto out;
+	int err;
 
 	inode_inc_link_count(dir);
 
@@ -275,11 +269,6 @@ static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		if (err)
 			goto out_dir;
 	} else {
-		if (dir_de) {
-			err = -EMLINK;
-			if (new_dir->i_nlink >= EXOFS_LINK_MAX)
-				goto out_dir;
-		}
 		err = exofs_add_link(new_dentry, old_inode);
 		if (err)
 			goto out_dir;
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index d22cd16..6cafcad 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -754,6 +754,7 @@ static int exofs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_blocksize = EXOFS_BLKSIZE;
 	sb->s_blocksize_bits = EXOFS_BLKSHIFT;
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
+	sb->s_max_links = EXOFS_LINK_MAX;
 	atomic_set(&sbi->s_curr_pending, 0);
 	sb->s_bdev = NULL;
 	sb->s_dev = 0;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0804198..dffb865 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -195,9 +195,6 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 	struct inode *inode = old_dentry->d_inode;
 	int err;
 
-	if (inode->i_nlink >= EXT2_LINK_MAX)
-		return -EMLINK;
-
 	dquot_initialize(dir);
 
 	inode->i_ctime = CURRENT_TIME_SEC;
@@ -217,10 +214,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 static int ext2_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
 {
 	struct inode * inode;
-	int err = -EMLINK;
-
-	if (dir->i_nlink >= EXT2_LINK_MAX)
-		goto out;
+	int err;
 
 	dquot_initialize(dir);
 
@@ -346,11 +340,6 @@ static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
 			drop_nlink(new_inode);
 		inode_dec_link_count(new_inode);
 	} else {
-		if (dir_de) {
-			err = -EMLINK;
-			if (new_dir->i_nlink >= EXT2_LINK_MAX)
-				goto out_dir;
-		}
 		err = ext2_add_link(new_dentry, old_inode);
 		if (err)
 			goto out_dir;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0090595..9f6766a 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -919,6 +919,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	sb->s_maxbytes = ext2_max_size(sb->s_blocksize_bits);
+	sb->s_max_links = EXT2_LINK_MAX;
 
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV) {
 		sbi->s_inode_size = EXT2_GOOD_OLD_INODE_SIZE;
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 5f7c160..07c91ca 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -220,12 +220,6 @@ static int jfs_mkdir(struct inode *dip, struct dentry *dentry, umode_t mode)
 
 	dquot_initialize(dip);
 
-	/* link count overflow on parent directory ? */
-	if (dip->i_nlink == JFS_LINK_MAX) {
-		rc = -EMLINK;
-		goto out1;
-	}
-
 	/*
 	 * search parent directory for entry/freespace
 	 * (dtSearch() returns parent directory page pinned)
@@ -806,9 +800,6 @@ static int jfs_link(struct dentry *old_dentry,
 	jfs_info("jfs_link: %s %s", old_dentry->d_name.name,
 		 dentry->d_name.name);
 
-	if (ip->i_nlink == JFS_LINK_MAX)
-		return -EMLINK;
-
 	dquot_initialize(dir);
 
 	tid = txBegin(ip->i_sb, 0);
@@ -1138,10 +1129,6 @@ static int jfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 				rc = -ENOTEMPTY;
 				goto out3;
 			}
-		} else if ((new_dir != old_dir) &&
-			   (new_dir->i_nlink == JFS_LINK_MAX)) {
-			rc = -EMLINK;
-			goto out3;
 		}
 	} else if (new_ip) {
 		IWRITE_LOCK(new_ip, RDWRLOCK_NORMAL);
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 682bca6..4661ad7 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -441,6 +441,7 @@ static int jfs_fill_super(struct super_block *sb, void *data, int silent)
 		return -ENOMEM;
 
 	sb->s_fs_info = sbi;
+	sb->s_max_links = JFS_LINK_MAX;
 	sbi->sb = sb;
 	sbi->uid = sbi->gid = sbi->umask = -1;
 
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 3de7a32..4aea231 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -558,9 +558,6 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 {
 	struct inode *inode = old_dentry->d_inode;
 
-	if (inode->i_nlink >= LOGFS_LINK_MAX)
-		return -EMLINK;
-
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	ihold(inode);
 	inc_nlink(inode);
diff --git a/fs/logfs/super.c b/fs/logfs/super.c
index c9ee7f5..b1a491a 100644
--- a/fs/logfs/super.c
+++ b/fs/logfs/super.c
@@ -542,6 +542,7 @@ static struct dentry *logfs_get_sb_device(struct logfs_super *super,
 	 * the filesystem incompatible with 32bit systems.
 	 */
 	sb->s_maxbytes	= (1ull << 43) - 1;
+	sb->s_max_links = LOGFS_LINK_MAX;
 	sb->s_op	= &logfs_super_operations;
 	sb->s_flags	= flags | MS_NOATIME;
 
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index fa8b612..62c697c 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -190,24 +190,24 @@ static int minix_fill_super(struct super_block *s, void *data, int silent)
 		sbi->s_version = MINIX_V1;
 		sbi->s_dirsize = 16;
 		sbi->s_namelen = 14;
-		sbi->s_link_max = MINIX_LINK_MAX;
+		s->s_max_links = MINIX_LINK_MAX;
 	} else if (s->s_magic == MINIX_SUPER_MAGIC2) {
 		sbi->s_version = MINIX_V1;
 		sbi->s_dirsize = 32;
 		sbi->s_namelen = 30;
-		sbi->s_link_max = MINIX_LINK_MAX;
+		s->s_max_links = MINIX_LINK_MAX;
 	} else if (s->s_magic == MINIX2_SUPER_MAGIC) {
 		sbi->s_version = MINIX_V2;
 		sbi->s_nzones = ms->s_zones;
 		sbi->s_dirsize = 16;
 		sbi->s_namelen = 14;
-		sbi->s_link_max = MINIX2_LINK_MAX;
+		s->s_max_links = MINIX2_LINK_MAX;
 	} else if (s->s_magic == MINIX2_SUPER_MAGIC2) {
 		sbi->s_version = MINIX_V2;
 		sbi->s_nzones = ms->s_zones;
 		sbi->s_dirsize = 32;
 		sbi->s_namelen = 30;
-		sbi->s_link_max = MINIX2_LINK_MAX;
+		s->s_max_links = MINIX2_LINK_MAX;
 	} else if ( *(__u16 *)(bh->b_data + 24) == MINIX3_SUPER_MAGIC) {
 		m3s = (struct minix3_super_block *) bh->b_data;
 		s->s_magic = m3s->s_magic;
@@ -221,9 +221,9 @@ static int minix_fill_super(struct super_block *s, void *data, int silent)
 		sbi->s_dirsize = 64;
 		sbi->s_namelen = 60;
 		sbi->s_version = MINIX_V3;
-		sbi->s_link_max = MINIX2_LINK_MAX;
 		sbi->s_mount_state = MINIX_VALID_FS;
 		sb_set_blocksize(s, m3s->s_blocksize);
+		s->s_max_links = MINIX2_LINK_MAX;
 	} else
 		goto out_no_fs;
 
diff --git a/fs/minix/minix.h b/fs/minix/minix.h
index c889ef0..1ebd118 100644
--- a/fs/minix/minix.h
+++ b/fs/minix/minix.h
@@ -34,7 +34,6 @@ struct minix_sb_info {
 	unsigned long s_max_size;
 	int s_dirsize;
 	int s_namelen;
-	int s_link_max;
 	struct buffer_head ** s_imap;
 	struct buffer_head ** s_zmap;
 	struct buffer_head * s_sbh;
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index 2f76e38..2d0ee17 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -94,9 +94,6 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 {
 	struct inode *inode = old_dentry->d_inode;
 
-	if (inode->i_nlink >= minix_sb(inode->i_sb)->s_link_max)
-		return -EMLINK;
-
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
 	ihold(inode);
@@ -106,10 +103,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 static int minix_mkdir(struct inode * dir, struct dentry *dentry, umode_t mode)
 {
 	struct inode * inode;
-	int err = -EMLINK;
-
-	if (dir->i_nlink >= minix_sb(dir->i_sb)->s_link_max)
-		goto out;
+	int err;
 
 	inode_inc_link_count(dir);
 
@@ -181,7 +175,6 @@ static int minix_rmdir(struct inode * dir, struct dentry *dentry)
 static int minix_rename(struct inode * old_dir, struct dentry *old_dentry,
 			   struct inode * new_dir, struct dentry *new_dentry)
 {
-	struct minix_sb_info * info = minix_sb(old_dir->i_sb);
 	struct inode * old_inode = old_dentry->d_inode;
 	struct inode * new_inode = new_dentry->d_inode;
 	struct page * dir_page = NULL;
@@ -219,11 +212,6 @@ static int minix_rename(struct inode * old_dir, struct dentry *old_dentry,
 			drop_nlink(new_inode);
 		inode_dec_link_count(new_inode);
 	} else {
-		if (dir_de) {
-			err = -EMLINK;
-			if (new_dir->i_nlink >= info->s_link_max)
-				goto out_dir;
-		}
 		err = minix_add_link(new_dentry, old_inode);
 		if (err)
 			goto out_dir;
diff --git a/fs/namei.c b/fs/namei.c
index 208c6aa..45f953c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2545,6 +2545,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, d
 int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	int error = may_create(dir, dentry);
+	unsigned max_links = dir->i_sb->s_max_links;
 
 	if (error)
 		return error;
@@ -2557,6 +2558,9 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (error)
 		return error;
 
+	if (max_links && dir->i_nlink >= max_links)
+		return -EMLINK;
+
 	error = dir->i_op->mkdir(dir, dentry, mode);
 	if (!error)
 		fsnotify_mkdir(dir, dentry);
@@ -2887,6 +2891,7 @@ SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newn
 int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
 {
 	struct inode *inode = old_dentry->d_inode;
+	unsigned max_links = dir->i_sb->s_max_links;
 	int error;
 
 	if (!inode)
@@ -2917,6 +2922,8 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	/* Make sure we don't allow creating hardlink to an unlinked file */
 	if (inode->i_nlink == 0)
 		error =  -ENOENT;
+	else if (max_links && inode->i_nlink >= max_links)
+		error = -EMLINK;
 	else
 		error = dir->i_op->link(old_dentry, dir, new_dentry);
 	mutex_unlock(&inode->i_mutex);
@@ -3026,6 +3033,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
 {
 	int error = 0;
 	struct inode *target = new_dentry->d_inode;
+	unsigned max_links = new_dir->i_sb->s_max_links;
 
 	/*
 	 * If we are going to change the parent - check write permissions,
@@ -3049,6 +3057,11 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
 	if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
 		goto out;
 
+	error = -EMLINK;
+	if (max_links && !target && new_dir != old_dir &&
+	    new_dir->i_nlink >= max_links)
+		goto out;
+
 	if (target)
 		shrink_dcache_parent(new_dentry);
 	error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 1cd3f62..fce2bbe 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -193,9 +193,6 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 	struct nilfs_transaction_info ti;
 	int err;
 
-	if (inode->i_nlink >= NILFS_LINK_MAX)
-		return -EMLINK;
-
 	err = nilfs_transaction_begin(dir->i_sb, &ti, 1);
 	if (err)
 		return err;
@@ -219,9 +216,6 @@ static int nilfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	struct nilfs_transaction_info ti;
 	int err;
 
-	if (dir->i_nlink >= NILFS_LINK_MAX)
-		return -EMLINK;
-
 	err = nilfs_transaction_begin(dir->i_sb, &ti, 1);
 	if (err)
 		return err;
@@ -400,11 +394,6 @@ static int nilfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		drop_nlink(new_inode);
 		nilfs_mark_inode_dirty(new_inode);
 	} else {
-		if (dir_de) {
-			err = -EMLINK;
-			if (new_dir->i_nlink >= NILFS_LINK_MAX)
-				goto out_dir;
-		}
 		err = nilfs_add_link(new_dentry, old_inode);
 		if (err)
 			goto out_dir;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 08e3d4f..1fc9ad3 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1059,6 +1059,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_export_op = &nilfs_export_ops;
 	sb->s_root = NULL;
 	sb->s_time_gran = 1;
+	sb->s_max_links = NILFS_LINK_MAX;
 
 	bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
 	sb->s_bdi = bdi ? : &default_backing_dev_info;
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index b217797..d7466e2 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -121,9 +121,6 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 {
 	struct inode *inode = old_dentry->d_inode;
 
-	if (inode->i_nlink >= SYSV_SB(inode->i_sb)->s_link_max)
-		return -EMLINK;
-
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
 	ihold(inode);
@@ -134,10 +131,8 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 static int sysv_mkdir(struct inode * dir, struct dentry *dentry, umode_t mode)
 {
 	struct inode * inode;
-	int err = -EMLINK;
+	int err;
 
-	if (dir->i_nlink >= SYSV_SB(dir->i_sb)->s_link_max) 
-		goto out;
 	inode_inc_link_count(dir);
 
 	inode = sysv_new_inode(dir, S_IFDIR|mode);
@@ -251,11 +246,6 @@ static int sysv_rename(struct inode * old_dir, struct dentry * old_dentry,
 			drop_nlink(new_inode);
 		inode_dec_link_count(new_inode);
 	} else {
-		if (dir_de) {
-			err = -EMLINK;
-			if (new_dir->i_nlink >= SYSV_SB(new_dir->i_sb)->s_link_max)
-				goto out_dir;
-		}
 		err = sysv_add_link(new_dentry, old_inode);
 		if (err)
 			goto out_dir;
diff --git a/fs/sysv/super.c b/fs/sysv/super.c
index f60c196..f467740 100644
--- a/fs/sysv/super.c
+++ b/fs/sysv/super.c
@@ -44,7 +44,7 @@ enum {
 	JAN_1_1980 = (10*365 + 2) * 24 * 60 * 60
 };
 
-static void detected_xenix(struct sysv_sb_info *sbi)
+static void detected_xenix(struct sysv_sb_info *sbi, unsigned *max_links)
 {
 	struct buffer_head *bh1 = sbi->s_bh1;
 	struct buffer_head *bh2 = sbi->s_bh2;
@@ -59,7 +59,7 @@ static void detected_xenix(struct sysv_sb_info *sbi)
 		sbd2 = (struct xenix_super_block *) (bh2->b_data - 512);
 	}
 
-	sbi->s_link_max = XENIX_LINK_MAX;
+	*max_links = XENIX_LINK_MAX;
 	sbi->s_fic_size = XENIX_NICINOD;
 	sbi->s_flc_size = XENIX_NICFREE;
 	sbi->s_sbd1 = (char *)sbd1;
@@ -75,7 +75,7 @@ static void detected_xenix(struct sysv_sb_info *sbi)
 	sbi->s_nzones = fs32_to_cpu(sbi, sbd1->s_fsize);
 }
 
-static void detected_sysv4(struct sysv_sb_info *sbi)
+static void detected_sysv4(struct sysv_sb_info *sbi, unsigned *max_links)
 {
 	struct sysv4_super_block * sbd;
 	struct buffer_head *bh1 = sbi->s_bh1;
@@ -86,7 +86,7 @@ static void detected_sysv4(struct sysv_sb_info *sbi)
 	else
 		sbd = (struct sysv4_super_block *) bh2->b_data;
 
-	sbi->s_link_max = SYSV_LINK_MAX;
+	*max_links = SYSV_LINK_MAX;
 	sbi->s_fic_size = SYSV_NICINOD;
 	sbi->s_flc_size = SYSV_NICFREE;
 	sbi->s_sbd1 = (char *)sbd;
@@ -103,7 +103,7 @@ static void detected_sysv4(struct sysv_sb_info *sbi)
 	sbi->s_nzones = fs32_to_cpu(sbi, sbd->s_fsize);
 }
 
-static void detected_sysv2(struct sysv_sb_info *sbi)
+static void detected_sysv2(struct sysv_sb_info *sbi, unsigned *max_links)
 {
 	struct sysv2_super_block *sbd;
 	struct buffer_head *bh1 = sbi->s_bh1;
@@ -114,7 +114,7 @@ static void detected_sysv2(struct sysv_sb_info *sbi)
 	else
 		sbd = (struct sysv2_super_block *) bh2->b_data;
 
-	sbi->s_link_max = SYSV_LINK_MAX;
+	*max_links = SYSV_LINK_MAX;
 	sbi->s_fic_size = SYSV_NICINOD;
 	sbi->s_flc_size = SYSV_NICFREE;
 	sbi->s_sbd1 = (char *)sbd;
@@ -131,14 +131,14 @@ static void detected_sysv2(struct sysv_sb_info *sbi)
 	sbi->s_nzones = fs32_to_cpu(sbi, sbd->s_fsize);
 }
 
-static void detected_coherent(struct sysv_sb_info *sbi)
+static void detected_coherent(struct sysv_sb_info *sbi, unsigned *max_links)
 {
 	struct coh_super_block * sbd;
 	struct buffer_head *bh1 = sbi->s_bh1;
 
 	sbd = (struct coh_super_block *) bh1->b_data;
 
-	sbi->s_link_max = COH_LINK_MAX;
+	*max_links = COH_LINK_MAX;
 	sbi->s_fic_size = COH_NICINOD;
 	sbi->s_flc_size = COH_NICFREE;
 	sbi->s_sbd1 = (char *)sbd;
@@ -154,12 +154,12 @@ static void detected_coherent(struct sysv_sb_info *sbi)
 	sbi->s_nzones = fs32_to_cpu(sbi, sbd->s_fsize);
 }
 
-static void detected_v7(struct sysv_sb_info *sbi)
+static void detected_v7(struct sysv_sb_info *sbi, unsigned *max_links)
 {
 	struct buffer_head *bh2 = sbi->s_bh2;
 	struct v7_super_block *sbd = (struct v7_super_block *)bh2->b_data;
 
-	sbi->s_link_max = V7_LINK_MAX;
+	*max_links = V7_LINK_MAX;
 	sbi->s_fic_size = V7_NICINOD;
 	sbi->s_flc_size = V7_NICFREE;
 	sbi->s_sbd1 = (char *)sbd;
@@ -290,7 +290,7 @@ static char *flavour_names[] = {
 	[FSTYPE_AFS]	= "AFS",
 };
 
-static void (*flavour_setup[])(struct sysv_sb_info *) = {
+static void (*flavour_setup[])(struct sysv_sb_info *, unsigned *) = {
 	[FSTYPE_XENIX]	= detected_xenix,
 	[FSTYPE_SYSV4]	= detected_sysv4,
 	[FSTYPE_SYSV2]	= detected_sysv2,
@@ -310,7 +310,7 @@ static int complete_read_super(struct super_block *sb, int silent, int size)
 
 	sbi->s_firstinodezone = 2;
 
-	flavour_setup[sbi->s_type](sbi);
+	flavour_setup[sbi->s_type](sbi, &sb->s_max_links);
 	
 	sbi->s_truncate = 1;
 	sbi->s_ndatazones = sbi->s_nzones - sbi->s_firstdatazone;
diff --git a/fs/sysv/sysv.h b/fs/sysv/sysv.h
index 0e4b821..11b0767 100644
--- a/fs/sysv/sysv.h
+++ b/fs/sysv/sysv.h
@@ -24,7 +24,6 @@ struct sysv_sb_info {
 	char	       s_bytesex;	/* bytesex (le/be/pdp) */
 	char	       s_truncate;	/* if 1: names > SYSV_NAMELEN chars are truncated */
 					/* if 0: they are disallowed (ENAMETOOLONG) */
-	nlink_t        s_link_max;	/* max number of hard links to a file */
 	unsigned int   s_inodes_per_block;	/* number of inodes per block */
 	unsigned int   s_inodes_per_block_1;	/* inodes_per_block - 1 */
 	unsigned int   s_inodes_per_block_bits;	/* log2(inodes_per_block) */
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 08bf46e..38de8f2 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -32,8 +32,6 @@
 #include <linux/crc-itu-t.h>
 #include <linux/exportfs.h>
 
-enum { UDF_MAX_LINKS = 0xffff };
-
 static inline int udf_match(int len1, const unsigned char *name1, int len2,
 			    const unsigned char *name2)
 {
@@ -649,10 +647,6 @@ static int udf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	struct udf_inode_info *dinfo = UDF_I(dir);
 	struct udf_inode_info *iinfo;
 
-	err = -EMLINK;
-	if (dir->i_nlink >= UDF_MAX_LINKS)
-		goto out;
-
 	err = -EIO;
 	inode = udf_new_inode(dir, S_IFDIR | mode, &err);
 	if (!inode)
@@ -1032,9 +1026,6 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	struct fileIdentDesc cfi, *fi;
 	int err;
 
-	if (inode->i_nlink >= UDF_MAX_LINKS)
-		return -EMLINK;
-
 	fi = udf_add_entry(dir, dentry, &fibh, &cfi, &err);
 	if (!fi) {
 		return err;
@@ -1126,10 +1117,6 @@ static int udf_rename(struct inode *old_dir, struct dentry *old_dentry,
 		if (udf_get_lb_pblock(old_inode->i_sb, &tloc, 0) !=
 				old_dir->i_ino)
 			goto end_rename;
-
-		retval = -EMLINK;
-		if (!new_inode && new_dir->i_nlink >= UDF_MAX_LINKS)
-			goto end_rename;
 	}
 	if (!nfi) {
 		nfi = udf_add_entry(new_dir, new_dentry, &nfibh, &ncfi,
diff --git a/fs/udf/super.c b/fs/udf/super.c
index c09a84d..8d8b253 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -75,6 +75,8 @@
 
 #define UDF_DEFAULT_BLOCKSIZE 2048
 
+enum { UDF_MAX_LINKS = 0xffff };
+
 /* These are the "meat" - everything else is stuffing */
 static int udf_fill_super(struct super_block *, void *, int);
 static void udf_put_super(struct super_block *);
@@ -2042,6 +2044,7 @@ static int udf_fill_super(struct super_block *sb, void *options, int silent)
 		goto error_out;
 	}
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
+	sb->s_max_links = UDF_MAX_LINKS;
 	return 0;
 
 error_out:
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 38cac19..a2281ca 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -166,10 +166,6 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 	int error;
 
 	lock_ufs(dir->i_sb);
-	if (inode->i_nlink >= UFS_LINK_MAX) {
-		unlock_ufs(dir->i_sb);
-		return -EMLINK;
-	}
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
@@ -183,10 +179,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
 {
 	struct inode * inode;
-	int err = -EMLINK;
-
-	if (dir->i_nlink >= UFS_LINK_MAX)
-		goto out;
+	int err;
 
 	lock_ufs(dir->i_sb);
 	inode_inc_link_count(dir);
@@ -305,11 +298,6 @@ static int ufs_rename(struct inode *old_dir, struct dentry *old_dentry,
 			drop_nlink(new_inode);
 		inode_dec_link_count(new_inode);
 	} else {
-		if (dir_de) {
-			err = -EMLINK;
-			if (new_dir->i_nlink >= UFS_LINK_MAX)
-				goto out_dir;
-		}
 		err = ufs_add_link(new_dentry, old_inode);
 		if (err)
 			goto out_dir;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 5246ee3..ec25d09 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1157,6 +1157,7 @@ magic_found:
 			    "fast symlink size (%u)\n", uspi->s_maxsymlinklen);
 		uspi->s_maxsymlinklen = maxsymlen;
 	}
+	sb->s_max_links = UFS_LINK_MAX;
 
 	inode = ufs_iget(sb, UFS_ROOTINO);
 	if (IS_ERR(inode)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 386da09..1e49554 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1459,6 +1459,7 @@ struct super_block {
 	u8 s_uuid[16];				/* UUID */
 
 	void 			*s_fs_info;	/* Filesystem private info */
+	unsigned int		s_max_links;
 	fmode_t			s_mode;
 
 	/* Granularity of c/m/atime in ns.

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-02 21:24                         ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro
@ 2012-02-02 23:46                           ` Linus Torvalds
  2012-02-03  1:16                             ` Al Viro
  2012-02-03  8:25                           ` Andreas Dilger
  1 sibling, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2012-02-02 23:46 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel

On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Comments?  Boilerplate removal follows (22 files changed, 45 insertions(+),
> 120 deletions(-)), but it's *not* for immediate merge; it's really completely
> untested.

Looks ok to me. Historically, the more things we can check at the VFS
layer, the better.

                  Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-02 23:46                           ` Linus Torvalds
@ 2012-02-03  1:16                             ` Al Viro
  2012-02-03  1:45                               ` Al Viro
                                                 ` (3 more replies)
  0 siblings, 4 replies; 75+ messages in thread
From: Al Viro @ 2012-02-03  1:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller

On Thu, Feb 02, 2012 at 03:46:06PM -0800, Linus Torvalds wrote:
> On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Comments? ?Boilerplate removal follows (22 files changed, 45 insertions(+),
> > 120 deletions(-)), but it's *not* for immediate merge; it's really completely
> > untested.
> 
> Looks ok to me. Historically, the more things we can check at the VFS
> layer, the better.

After looking a bit more: nlink_t is a f*cking mess.  Almost any code
using that type kernel-side is broken.  Crap galore:
	* sometimes it's 32 bits, sometimes 16, sometimes 64.  Essentially
at random.
	* almost all have it unsigned, except for sparc32, where it's
signed short [inherited from v7 via SunOS?  BTW, in v6 it used to be even
funnier - char, which is where ridiculous LINK_MAX == 127 comes from]

IOW, nlink_t is an attractive nuisance - it's nearly impossible to use in
a portable way and we are lucky that almost nobody tries to.  Exceptions:
ocfs2_rename() does
        nlink_t old_dir_nlink = old_dir->i_nlink;
	...
followed later by comparison with old_dir->i_nlink.  And no, it's not to
handle truncation - it's "what if i_nlink changed while ocfs2_rename()
had been grabbing the cluster lock" kind of thing.  OCFS2 can have up
to 2^32 links to file, so truncation is really possible...  AFAICS,
that one is a genuine bug - this nlink_t should be u32...
Another one is proc_dir_entry ->nlink and it would cause Bad Things(tm)
on architecture with 16bit nlink_t if we could end up with 65534
subdirectories in some procfs dir.  Might be possible, might be not -
doing that under /proc/sys is definitely possible, but that won't be
enough; needs to be proc_dir_entry-backed directory.  Again, solution
is to use explicit u32 anyway.

	* compat_nlink_t is even funnier - it's signed in *two* cases; sparc
and ppc.  No, nlink_t on ppc32 is unsigned.  Not that anyone cared, really,
since the _only_ use of that type is in struct compat_stat.  For exactly
one field.  Only used as left-hand side of assignment, which is actually
broken since unlike cp_new_stat(), cp_compat_stat() does *not* check if the
value fits into st_nlink.  Bug, needs to be fixed.  Incidentally, just what
should we do on sparc32 if we run into a file with 4G-10 links?  -EOVERFLOW
or silently put 65536-10 in st_nlink and be done with that?  Note that
filesystems allowing that many links *do* exist...

	* when does jfs dtInsert() return -EMLINK?  Can it ever get triggered?
	* WTF is XFS doing with these checks?  Note that we have them
done _twice_ on all paths - explictly from xfs_create(), xfs_link(),
xfs_rename() and then from xfs_bumplink() called by exactly the same
set of functions.

	* what's up with btrfs_insert_inode_ref()?  I've tried to trace
the codepaths around there, but... Incidentally, when could fixup_low_keys()
return non-zero?  I don't see any candidates for that in there...  Chris?

	* ubifs, hfsplus, jffs2 - definitely broken if you create enough
links.  i_nlink wraparound to zero, confused inode eviction logics.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03  1:16                             ` Al Viro
@ 2012-02-03  1:45                               ` Al Viro
  2012-02-03  2:00                                 ` Linus Torvalds
  2012-02-03 14:57                               ` Chris Mason
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 75+ messages in thread
From: Al Viro @ 2012-02-03  1:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller

On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote:

> After looking a bit more: nlink_t is a f*cking mess.  Almost any code
> using that type kernel-side is broken.  Crap galore:
> 	* sometimes it's 32 bits, sometimes 16, sometimes 64.  Essentially
> at random.
> 	* almost all have it unsigned, except for sparc32, where it's
> signed short [inherited from v7 via SunOS?  BTW, in v6 it used to be even
> funnier - char, which is where ridiculous LINK_MAX == 127 comes from]
> 
> IOW, nlink_t is an attractive nuisance - it's nearly impossible to use in
> a portable way and we are lucky that almost nobody tries to.

Incidentally, why the hell do we have
	typedef __kernel_nlink_t        nlink_t;
anyway?  It's *not* exposed to userland and it's different from the
userland nlink_t (which is unsigned int on 32bit and unsigned long on 64bit).
Why not use __kernel_nlink_t (or explicitly-sized __uNN) in
arch/*/include/asm/stat.h and declare nlink_t kernel-side as __u32?

Why do we have daddr_t, while we are at it?  There is exactly one user -
fs/freevxfs and there we definitely want a fixed-sized type.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03  1:45                               ` Al Viro
@ 2012-02-03  2:00                                 ` Linus Torvalds
  0 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2012-02-03  2:00 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel, Joel Becker, Chris Mason, David Miller

On Thu, Feb 2, 2012 at 5:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Incidentally, why the hell do we have
>        typedef __kernel_nlink_t        nlink_t;
> anyway?  It's *not* exposed to userland and it's different from the
> userland nlink_t (which is unsigned int on 32bit and unsigned long on 64bit).
> Why not use __kernel_nlink_t (or explicitly-sized __uNN) in
> arch/*/include/asm/stat.h and declare nlink_t kernel-side as __u32?

Probably hysterical raisins, and just converted to the whole
__kernel_nlink_t form together with other, more relevant things.

Feel free to remove it.

> Why do we have daddr_t, while we are at it?  There is exactly one user -
> fs/freevxfs and there we definitely want a fixed-sized type.

I think it's something that probably came from freevxfs and BSD roots
or similar. It's a BSD'ism, afaik.

                    Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-02 21:24                         ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro
  2012-02-02 23:46                           ` Linus Torvalds
@ 2012-02-03  8:25                           ` Andreas Dilger
  2012-02-03 17:03                             ` Al Viro
  1 sibling, 1 reply; 75+ messages in thread
From: Andreas Dilger @ 2012-02-03  8:25 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

On 2012-02-02, at 2:24 PM, Al Viro wrote:
> FWIW, there's something we really should've done a long time ago: putting
> that limit into sb->s_max_links.  With 0 meaning "leave all checks to
> ->link/->mkdir/->rename".  Something like the following would make a
> reasonable start - just the conversion of obvious cases.  As the next
> step I'd probably initialize it as ~0U instead of 0 and let the filesystems
> that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set
> it to 0 in their foo_fill_super().  That would take care of a bunch of cases
> where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and
> it's probably a saner default anyway.

This would also give userspace some hope of pathconf(path, _PC_LINK_MAX)
returning the actual value from the filesystem, instead of hard-coding
this into glibc itself based on the statfs-returned f_type magic value.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03  1:16                             ` Al Viro
  2012-02-03  1:45                               ` Al Viro
@ 2012-02-03 14:57                               ` Chris Mason
  2012-02-03 17:08                               ` Al Viro
  2012-02-06 22:49                               ` Dave Chinner
  3 siblings, 0 replies; 75+ messages in thread
From: Chris Mason @ 2012-02-03 14:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Joel Becker, David Miller

On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote:
> On Thu, Feb 02, 2012 at 03:46:06PM -0800, Linus Torvalds wrote:
> > On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Comments? ?Boilerplate removal follows (22 files changed, 45 insertions(+),
> > > 120 deletions(-)), but it's *not* for immediate merge; it's really completely
> > > untested.
> > 
> > Looks ok to me. Historically, the more things we can check at the VFS
> > layer, the better.
> 
> 
> 	* what's up with btrfs_insert_inode_ref()?  I've tried to trace
> the codepaths around there, but...

Btrfs stores backrefs (the filename, directory inode number) from the
inode to the directory.

In the current format that's a pretty low limit on how many of these we
can store for hard links to the same file in the same directory, but
basically no limit on how many backrefs we can store to the same file
from different directories.

Mark Fasheh was working on a patch to change the backrefs to make the
links-from-the-same-dir case consistent with the
links-from-different-dir case.   With today's code, we'll go -EMLINK at
different times depending on the length of the file name and what links
you've already made.


> Incidentally, when could fixup_low_keys()
> return non-zero?  I don't see any candidates for that in there...  Chris?

A long time ago this one used to cow blocks and so it needed an error
return.  I think Jeff Mahoney has a patch queued up to make it (among
many others) void.

-chris

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03  8:25                           ` Andreas Dilger
@ 2012-02-03 17:03                             ` Al Viro
  2012-02-04  7:42                               ` Andreas Dilger
  0 siblings, 1 reply; 75+ messages in thread
From: Al Viro @ 2012-02-03 17:03 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

On Fri, Feb 03, 2012 at 01:25:26AM -0700, Andreas Dilger wrote:
> On 2012-02-02, at 2:24 PM, Al Viro wrote:
> > FWIW, there's something we really should've done a long time ago: putting
> > that limit into sb->s_max_links.  With 0 meaning "leave all checks to
> > ->link/->mkdir/->rename".  Something like the following would make a
> > reasonable start - just the conversion of obvious cases.  As the next
> > step I'd probably initialize it as ~0U instead of 0 and let the filesystems
> > that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set
> > it to 0 in their foo_fill_super().  That would take care of a bunch of cases
> > where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and
> > it's probably a saner default anyway.
> 
> This would also give userspace some hope of pathconf(path, _PC_LINK_MAX)
> returning the actual value from the filesystem, instead of hard-coding
> this into glibc itself based on the statfs-returned f_type magic value.

*snort*

Even skipping the standard flame about pathconf() as an API, this will
not work.
	* we have filesystems that do not allow link creation at all and
do keep track of subdirectories count in i_nlink of directories.  What
would you have them store?  As it is, ~0U works just fine, but pathconf()
users won't be happy with it.
	* we have filesystems that allow unlimited subdirectories, while
limiting the number of links to non-directories; ->s_max_links == 0 will
work just fine, but won't make pathconf() happy.
	* we have filesystems that have more complex rules re links to
non-directory (see mail from Chris in this thread).  What would you have
pathconf() do?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03  1:16                             ` Al Viro
  2012-02-03  1:45                               ` Al Viro
  2012-02-03 14:57                               ` Chris Mason
@ 2012-02-03 17:08                               ` Al Viro
  2012-02-03 19:34                                 ` Artem Bityutskiy
  2012-02-06  8:50                                 ` Artem Bityutskiy
  2012-02-06 22:49                               ` Dave Chinner
  3 siblings, 2 replies; 75+ messages in thread
From: Al Viro @ 2012-02-03 17:08 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote:

> 	* ubifs, hfsplus, jffs2 - definitely broken if you create enough
> links.  i_nlink wraparound to zero, confused inode eviction logics.

BTW, ubifs plays funny games with i_nlink - decrements it early in
unlink/rmdir/rename and then increments it back on failure.  *If*
we really want it that way, we need to use set_nlink() there.  Frankly,
I'd rather deal with drop_nlink() after the last possible failure
exit...  Unless there are serious reasons why that wouldn't work, that is.
Artem?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03 17:08                               ` Al Viro
@ 2012-02-03 19:34                                 ` Artem Bityutskiy
  2012-02-06  8:50                                 ` Artem Bityutskiy
  1 sibling, 0 replies; 75+ messages in thread
From: Artem Bityutskiy @ 2012-02-03 19:34 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]

On Fri, 2012-02-03 at 17:08 +0000, Al Viro wrote:
> On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote:
> 
> > 	* ubifs, hfsplus, jffs2 - definitely broken if you create enough
> > links.  i_nlink wraparound to zero, confused inode eviction logics.
> 
> BTW, ubifs plays funny games with i_nlink - decrements it early in
> unlink/rmdir/rename and then increments it back on failure.  *If*
> we really want it that way, we need to use set_nlink() there.  Frankly,
> I'd rather deal with drop_nlink() after the last possible failure
> exit...  Unless there are serious reasons why that wouldn't work, that is.

There was a good reason for those games. I cannot provide a good
explanation right now because I need to refresh my memory (but I am sure
there must be a big comment in the code about this) and my daughter is
getting upset already because I am typing something instead of playing
with her. Will try to answer on Monday. Have a nice weekend!

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-01-31 10:52     ` Kay Sievers
  2012-01-31 11:00       ` Jiri Slaby
@ 2012-02-04  2:14       ` Henrique de Moraes Holschuh
  2012-02-06 20:03         ` Kay Sievers
  1 sibling, 1 reply; 75+ messages in thread
From: Henrique de Moraes Holschuh @ 2012-02-04  2:14 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev

On Tue, 31 Jan 2012, Kay Sievers wrote:
> Please make sure nothing tries to swap netif names in userspace. We
> have given up that approach, because it is far too fragile to
> temporary rename devices to be able to swap the names, and race
> against the loading of new kernel network drivers at the same time.

That's a damn fair reason, but the loss of that functionality could cause
trouble.  In fact, at first glance, to me it looks like this has a large
potential for unleashing untold pain and suffering in the sysadmin ranks
unless early userspace can emulate it somehow.

Is it possible to configure the kernel to use something other than "eth#" as
its initial namespace for netif names?  Or is there some other way to get
eth1 to be what you need eth1 to be during userland boot?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03 17:03                             ` Al Viro
@ 2012-02-04  7:42                               ` Andreas Dilger
  0 siblings, 0 replies; 75+ messages in thread
From: Andreas Dilger @ 2012-02-04  7:42 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

On 2012-02-03, at 10:03 AM, Al Viro wrote:
> On Fri, Feb 03, 2012 at 01:25:26AM -0700, Andreas Dilger wrote:
>> On 2012-02-02, at 2:24 PM, Al Viro wrote:
>>> FWIW, there's something we really should've done a long time ago: putting
>>> that limit into sb->s_max_links.  With 0 meaning "leave all checks to
>>> ->link/->mkdir/->rename".  Something like the following would make a
>>> reasonable start - just the conversion of obvious cases.  As the next
>>> step I'd probably initialize it as ~0U instead of 0 and let the filesystems
>>> that want something trickier (reiserfs, ext4, gfs2, ocfs2) explicitly set
>>> it to 0 in their foo_fill_super().  That would take care of a bunch of cases
>>> where we forgot to do those checks (ubifs, hfsplus, jffs2, ramfs, etc.) and
>>> it's probably a saner default anyway.
>> 
>> This would also give userspace some hope of pathconf(path, _PC_LINK_MAX)
>> returning the actual value from the filesystem, instead of hard-coding
>> this into glibc itself based on the statfs-returned f_type magic value.
> 
> *snort*
> 
> Even skipping the standard flame about pathconf() as an API, this will
> not work.
> 	* we have filesystems that do not allow link creation at all and
> do keep track of subdirectories count in i_nlink of directories.  What
> would you have them store?  As it is, ~0U works just fine, but pathconf()
> users won't be happy with it.
> 	* we have filesystems that allow unlimited subdirectories, while
> limiting the number of links to non-directories; ->s_max_links == 0 will
> work just fine, but won't make pathconf() happy.
> 	* we have filesystems that have more complex rules re links to
> non-directory (see mail from Chris in this thread).  What would you have
> pathconf() do?

No comment on how good an API pathconf() is, but getting a per-filesystem
value from the kernel has to be better than a hard-coded value coded in a
library in userspace.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03 17:08                               ` Al Viro
  2012-02-03 19:34                                 ` Artem Bityutskiy
@ 2012-02-06  8:50                                 ` Artem Bityutskiy
  2012-02-06 13:56                                   ` Al Viro
  1 sibling, 1 reply; 75+ messages in thread
From: Artem Bityutskiy @ 2012-02-06  8:50 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1546 bytes --]

On Fri, 2012-02-03 at 17:08 +0000, Al Viro wrote:
> On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote:
> 
> > 	* ubifs, hfsplus, jffs2 - definitely broken if you create enough
> > links.  i_nlink wraparound to zero, confused inode eviction logics.
> 
> BTW, ubifs plays funny games with i_nlink - decrements it early in
> unlink/rmdir/rename and then increments it back on failure.  *If*
> we really want it that way, we need to use set_nlink() there.  Frankly,
> I'd rather deal with drop_nlink() after the last possible failure
> exit...  Unless there are serious reasons why that wouldn't work, that is.
> Artem?

The way code is organized is that the UBIFS journal subsystem functions
accept 'struct inode' and struct direntry' objects and write them out to
the media as they are in RAM. So before calling 'ubifs_jnl_update()' we
should have all the fields in 'struct inode' to be set correctly. Thus,
we have to drop 'i_nlink' before calling 'ubifs_jnl_update()'.

WRT dealing with 'drop_nlink()' after the last possible failure - I can
just make own partial copy of 'struct inode' and pass it to
'ubifs_jnl_update()', if there is a need. I would not like to make the
journal code more complex than it already is by adding 'i_nlink' usage
smartness.

WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()'
of course. But I do not really see why is this better. E.g.,
'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink'
wrapping.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-06  8:50                                 ` Artem Bityutskiy
@ 2012-02-06 13:56                                   ` Al Viro
  2012-02-06 17:05                                     ` Artem Bityutskiy
  0 siblings, 1 reply; 75+ messages in thread
From: Al Viro @ 2012-02-06 13:56 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote:

> WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()'
> of course. But I do not really see why is this better. E.g.,
> 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink'
> wrapping.

So does inc_nlink() when you are asking to get from nlink=0 to nlink=1.
I.e. on failure exit in your unlink()...

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-06 13:56                                   ` Al Viro
@ 2012-02-06 17:05                                     ` Artem Bityutskiy
  2012-02-06 17:11                                       ` Al Viro
  0 siblings, 1 reply; 75+ messages in thread
From: Artem Bityutskiy @ 2012-02-06 17:05 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 638 bytes --]

On Mon, 2012-02-06 at 13:56 +0000, Al Viro wrote:
> On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote:
> 
> > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()'
> > of course. But I do not really see why is this better. E.g.,
> > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink'
> > wrapping.
> 
> So does inc_nlink() when you are asking to get from nlink=0 to nlink=1.
> I.e. on failure exit in your unlink()...

Indeed! I'll switch to 'clear_nlink(inode)' and 'set_nlink(inode, 1)'
for those inodes which are being unliked.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-06 17:05                                     ` Artem Bityutskiy
@ 2012-02-06 17:11                                       ` Al Viro
  2012-02-07  7:21                                         ` Artem Bityutskiy
  0 siblings, 1 reply; 75+ messages in thread
From: Al Viro @ 2012-02-06 17:11 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

On Mon, Feb 06, 2012 at 07:05:55PM +0200, Artem Bityutskiy wrote:
> On Mon, 2012-02-06 at 13:56 +0000, Al Viro wrote:
> > On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote:
> > 
> > > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()'
> > > of course. But I do not really see why is this better. E.g.,
> > > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink'
> > > wrapping.
> > 
> > So does inc_nlink() when you are asking to get from nlink=0 to nlink=1.
> > I.e. on failure exit in your unlink()...
> 
> Indeed! I'll switch to 'clear_nlink(inode)' and 'set_nlink(inode, 1)'
> for those inodes which are being unliked.

Huh?  I thought ubifs allowed link(2)... IOW, i_nlink could've been greater
than 1 when you called ubifs_unlink().

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-04  2:14       ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh
@ 2012-02-06 20:03         ` Kay Sievers
  2012-02-08  2:00           ` Henrique de Moraes Holschuh
  0 siblings, 1 reply; 75+ messages in thread
From: Kay Sievers @ 2012-02-06 20:03 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev

On Sat, Feb 4, 2012 at 03:14, Henrique de Moraes Holschuh
<hmh@hmh.eng.br> wrote:
> On Tue, 31 Jan 2012, Kay Sievers wrote:
>> Please make sure nothing tries to swap netif names in userspace. We
>> have given up that approach, because it is far too fragile to
>> temporary rename devices to be able to swap the names, and race
>> against the loading of new kernel network drivers at the same time.
>
> That's a damn fair reason, but the loss of that functionality could cause
> trouble.  In fact, at first glance, to me it looks like this has a large
> potential for unleashing untold pain and suffering in the sysadmin ranks
> unless early userspace can emulate it somehow.
>
> Is it possible to configure the kernel to use something other than "eth#" as
> its initial namespace for netif names?  Or is there some other way to get
> eth1 to be what you need eth1 to be during userland boot?

I don't think there is a sane way to do that. Someone could add a
kernel command line parameter to switch ethX in the kernel to
something else, and create custom udev rules which match on device
properties and apply configured names which are ethX again. But for
all that, there will be no generally available support in common base
system tools, and we absolutely do not recommend anybody doing that.

Udev will not provide any help for that any more, not for automatic
device name reservation from a hotplug path, not for device name swaps
in the kernel namespace. It will only be allowed to rename devices to
a namespace that does not clash with the kernel's one.

People should use biosdevname's pci-slot names, or the on-board labels
names like DELL does for configuration-less stable names, or use
manually configured names 'internal', 'external' ,'dmz', 'vpn' and so
on.

I think we should stop pretending we can solve problems, resulting
from simple enumeration depending on device-discovery order. These
numbers can never be stable, can never reliably work in the reality we
are working with.

It's time to leave these false promises behind us and move on and that
means, no stable ethX names anymore.

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-03  1:16                             ` Al Viro
                                                 ` (2 preceding siblings ...)
  2012-02-03 17:08                               ` Al Viro
@ 2012-02-06 22:49                               ` Dave Chinner
  3 siblings, 0 replies; 75+ messages in thread
From: Dave Chinner @ 2012-02-06 22:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Joel Becker,
	Chris Mason, David Miller

On Fri, Feb 03, 2012 at 01:16:12AM +0000, Al Viro wrote:
> On Thu, Feb 02, 2012 at 03:46:06PM -0800, Linus Torvalds wrote:
> > On Thu, Feb 2, 2012 at 1:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 	* WTF is XFS doing with these checks?

It is validating nlink against the maximum supported by the XFS
on-disk format. It was originally limited by what could be reported
to pathconf() on Irix - a signed int.  We have that same problem on
Linux, too, because on 32 bit systems the maximum number of links
that can be reported via pathconf is 2^31....

> 	Note that we have them
> done _twice_ on all paths - explictly from xfs_create(), xfs_link(),
> xfs_rename() and then from xfs_bumplink() called by exactly the same
> set of functions.

Well, that's a bit stupid, isn't it? Trivial to fix, though...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] killing boilerplate checks in ->link/->mkdir/->rename
  2012-02-06 17:11                                       ` Al Viro
@ 2012-02-07  7:21                                         ` Artem Bityutskiy
  0 siblings, 0 replies; 75+ messages in thread
From: Artem Bityutskiy @ 2012-02-07  7:21 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

On Mon, 2012-02-06 at 17:11 +0000, Al Viro wrote:
> On Mon, Feb 06, 2012 at 07:05:55PM +0200, Artem Bityutskiy wrote:
> > On Mon, 2012-02-06 at 13:56 +0000, Al Viro wrote:
> > > On Mon, Feb 06, 2012 at 10:50:44AM +0200, Artem Bityutskiy wrote:
> > > 
> > > > WRT 'sen_nlink()' - I can use it instead of 'drop_nlink()'/'inc_nlink()'
> > > > of course. But I do not really see why is this better. E.g.,
> > > > 'drop_nlink()' additionally gives me ' WARN_ON()' in case of 'i_nlink'
> > > > wrapping.
> > > 
> > > So does inc_nlink() when you are asking to get from nlink=0 to nlink=1.
> > > I.e. on failure exit in your unlink()...
> > 
> > Indeed! I'll switch to 'clear_nlink(inode)' and 'set_nlink(inode, 1)'
> > for those inodes which are being unliked.
> 
> Huh?  I thought ubifs allowed link(2)... IOW, i_nlink could've been greater
> than 1 when you called ubifs_unlink().

You are right, I'll take this correctly when I change the code rather
than writing an e-mail. Thanks! I think I'll submit a patch this week.

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-06 20:03         ` Kay Sievers
@ 2012-02-08  2:00           ` Henrique de Moraes Holschuh
  2012-02-08  3:50             ` Kay Sievers
  0 siblings, 1 reply; 75+ messages in thread
From: Henrique de Moraes Holschuh @ 2012-02-08  2:00 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev

On Mon, 06 Feb 2012, Kay Sievers wrote:
> On Sat, Feb 4, 2012 at 03:14, Henrique de Moraes Holschuh
> <hmh@hmh.eng.br> wrote:
> > Is it possible to configure the kernel to use something other than "eth#" as
> > its initial namespace for netif names?  Or is there some other way to get
> > eth1 to be what you need eth1 to be during userland boot?
> 
> I don't think there is a sane way to do that. Someone could add a
> kernel command line parameter to switch ethX in the kernel to
> something else, and create custom udev rules which match on device
> properties and apply configured names which are ethX again. But for
> all that, there will be no generally available support in common base
> system tools, and we absolutely do not recommend anybody doing that.

What sort of impact analysis on userspace was done about this change?

Nobody in his right mind would go back to the dark ages of uncontrolled
ifnames.  You're effectively forcing everybody with a clue away from the
eth# namespace.

Just to be very clear: the impact of this is the need to change the
interface names on potentially millions of lines of firewall rules and
scripts out there, as well as tracking down stuff (mostly scripts) that
special-cases the eth prefix.

Is there a really good reason why we cannot have a way to move the
kernel away from the eth# namespace at boot (through a kernel parameter,
maybe with the default namespace set at compile time), AND keep the
"common base system tools" support to assign ifname based on MAC
addresses that we have right now?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-08  2:00           ` Henrique de Moraes Holschuh
@ 2012-02-08  3:50             ` Kay Sievers
  2012-02-08  6:42               ` Valdis.Kletnieks
  0 siblings, 1 reply; 75+ messages in thread
From: Kay Sievers @ 2012-02-08  3:50 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Jiri Slaby, Eric W. Biederman, Greg KH, LKML, ML netdev

On Wed, Feb 8, 2012 at 03:00, Henrique de Moraes Holschuh
<hmh@hmh.eng.br> wrote:
> On Mon, 06 Feb 2012, Kay Sievers wrote:
>> On Sat, Feb 4, 2012 at 03:14, Henrique de Moraes Holschuh
>> <hmh@hmh.eng.br> wrote:
>> > Is it possible to configure the kernel to use something other than "eth#" as
>> > its initial namespace for netif names?  Or is there some other way to get
>> > eth1 to be what you need eth1 to be during userland boot?
>>
>> I don't think there is a sane way to do that. Someone could add a
>> kernel command line parameter to switch ethX in the kernel to
>> something else, and create custom udev rules which match on device
>> properties and apply configured names which are ethX again. But for
>> all that, there will be no generally available support in common base
>> system tools, and we absolutely do not recommend anybody doing that.
>
> What sort of impact analysis on userspace was done about this change?

None. It will just not be supported for new setups. Existing ones will
do what they always did.

> Nobody in his right mind would go back to the dark ages of uncontrolled
> ifnames.  You're effectively forcing everybody with a clue away from the
> eth# namespace.

Yes. It's a game we have lost and we will not win in the future. I
gave up, and I warn everybody who think it's simple to manage.

> Just to be very clear: the impact of this is the need to change the
> interface names on potentially millions of lines of firewall rules and
> scripts out there, as well as tracking down stuff (mostly scripts) that
> special-cases the eth prefix.

Yeah, and for good, ethX is a pretty much random kernel name, and I
personally will no longer work on conceptually broken infrastructure
that can never deliver what it seems to promise. In the longer run,
tools need to be fixed to automatically handle changing names, or not
care about the names at all, or names need to be explicitly set up
outside the ethX namespace to be predictable.

After years of working in that area I will stop to work on these hacks
to promise stable ethX names. It was just wrong, like enumerations
always are in hotplug setups.

> Is there a really good reason why we cannot have a way to move the
> kernel away from the eth# namespace at boot (through a kernel parameter,
> maybe with the default namespace set at compile time),

Could work, but I don't think it is worth. Simple enumeration, and
automatic persistent on-disk device name reservation in a flat
number-range is just a very flawed concept. I'm not interested in
working on that, but that surely should not stop anybody from trying
and providing tools that can do that.

> AND keep the
> "common base system tools" support to assign ifname based on MAC
> addresses that we have right now?

Not provided by udev's default setup, which did persistent name
reservation in the device hotplug path. It is already disabled and
will be entirely removed from the source tree some day. Other tools
can still try to provide that. But I declare that model as officially
failed and udev will not even try anything like that anymore.

People who need predictable interface names should just manually
configure custom/descriptive names, or names which are reliably
derived from the hardware, like firmware-provided names or the pci
slot number.

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-08  3:50             ` Kay Sievers
@ 2012-02-08  6:42               ` Valdis.Kletnieks
  2012-02-08 10:57                 ` Kay Sievers
  0 siblings, 1 reply; 75+ messages in thread
From: Valdis.Kletnieks @ 2012-02-08  6:42 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman,
	Greg KH, LKML, ML netdev

[-- Attachment #1: Type: text/plain, Size: 1375 bytes --]

On Wed, 08 Feb 2012 04:50:15 +0100, Kay Sievers said:

> After years of working in that area I will stop to work on these hacks
> to promise stable ethX names. It was just wrong, like enumerations
> always are in hotplug setups.

So (real world case) I've got a server that's got a 1G ethernet connected to
the public net, a 1G ethernet that's a cluster management network, and
a 10G ethernet that connects to our HPC clusters.

And I want to add iptables rules that distinguish based on interface. Currently
I can nail the management net to eth0, the public net to eth1, and the 10G to
eth2, and then just add "-i eth1" or whatever in the iptables ruleset.

I really don't care if the 0/1/2 move around - but if we're not having nailed-down
interface names, what will take the place of '-i ethN' in iptables?

> People who need predictable interface names should just manually
> configure custom/descriptive names, or names which are reliably
> derived from the hardware, like firmware-provided names or the pci
> slot number.

Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
what you are trying to move to, and my systems are already onboard and
I should just move along, nothing to see here? ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-08  6:42               ` Valdis.Kletnieks
@ 2012-02-08 10:57                 ` Kay Sievers
  2012-02-08 20:06                   ` Valdis.Kletnieks
  0 siblings, 1 reply; 75+ messages in thread
From: Kay Sievers @ 2012-02-08 10:57 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman,
	Greg KH, LKML, ML netdev

On Wed, Feb 8, 2012 at 07:42,  <Valdis.Kletnieks@vt.edu> wrote:
> On Wed, 08 Feb 2012 04:50:15 +0100, Kay Sievers said:
>
>> After years of working in that area I will stop to work on these hacks
>> to promise stable ethX names. It was just wrong, like enumerations
>> always are in hotplug setups.
>
> So (real world case) I've got a server that's got a 1G ethernet connected to
> the public net, a 1G ethernet that's a cluster management network, and
> a 10G ethernet that connects to our HPC clusters.
>
> And I want to add iptables rules that distinguish based on interface. Currently
> I can nail the management net to eth0, the public net to eth1, and the 10G to
> eth2, and then just add "-i eth1" or whatever in the iptables ruleset.
>
> I really don't care if the 0/1/2 move around - but if we're not having nailed-down
> interface names, what will take the place of '-i ethN' in iptables?
>
>> People who need predictable interface names should just manually
>> configure custom/descriptive names, or names which are reliably
>> derived from the hardware, like firmware-provided names or the pci
>> slot number.
>
> Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules
> SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
> what you are trying to move to, and my systems are already onboard and
> I should just move along, nothing to see here? ;)

Yeah, that's what we did in the past. It works fine if you never have
to swap names like eth0 and eth1, with need to free one of the the
names with a temporary rename.

If another device is added by a different kernel module, or just a USB
network device is already plugged-in at bootup time, the parallel
loading of drivers might cause the kernel to create a new eth0 or eth1
just in the moment we have the temporary rename active and we want to
swap the names.

That model is just entirely flawed and will never work reliably
without creating an even bigger mess we already have, by requiring
complex retry loops across multiple devices, or having global locks
including the kernel's device name allocation logic.

Let's just move on and stop pretending we want or we can solve these
problems. Simple device enumerations in hotplug setups can by their
very definition not work in a predictable way, we should never have
tried to mess around here, and just moved on to something that has at
least the potential to work.

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-08 10:57                 ` Kay Sievers
@ 2012-02-08 20:06                   ` Valdis.Kletnieks
  2012-02-08 20:27                     ` Stephen Hemminger
  2012-02-08 23:48                     ` Kay Sievers
  0 siblings, 2 replies; 75+ messages in thread
From: Valdis.Kletnieks @ 2012-02-08 20:06 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman,
	Greg KH, LKML, ML netdev

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

On Wed, 08 Feb 2012 11:57:18 +0100, Kay Sievers said:
> On Wed, Feb 8, 2012 at 07:42,  <Valdis.Kletnieks@vt.edu> wrote:

> > Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules
> > SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
> > what you are trying to move to, and my systems are already onboard and
> > I should just move along, nothing to see here? ;)
>
> Yeah, that's what we did in the past. It works fine if you never have
> to swap names like eth0 and eth1, with need to free one of the the
> names with a temporary rename.

Well, if I had my druthers, I'd stick name="net-mgt", "net-pub", and "net-10g"
in the udev rules, and not care about 1/2/3 and race conditions, because
meaningful names are easier to not screw up (just last week found a system that
had eth1 and eth2 reversed in some iptables rules, wouldn't have happened if
they were -mgt and -pub).

Only thing stopping me is getting iptables to accept '-i net-10g', and the
distro /etc/sysconfig/network scripts like ifup and ifdown playing nice....

So it sounds like what I want as a sysadmin is the same thing you want
as a maintainer...



[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-08 20:06                   ` Valdis.Kletnieks
@ 2012-02-08 20:27                     ` Stephen Hemminger
  2012-02-08 23:48                     ` Kay Sievers
  1 sibling, 0 replies; 75+ messages in thread
From: Stephen Hemminger @ 2012-02-08 20:27 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Kay Sievers, Henrique de Moraes Holschuh, Jiri Slaby,
	Eric W. Biederman, Greg KH, LKML, ML netdev

[-- Attachment #1: Type: text/plain, Size: 111 bytes --]

Our customers would prefer network device names of the form
"Ethernet0/0" or "ge-0/0/0" (but I said no...)


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: network regression: cannot rename netdev twice
  2012-02-08 20:06                   ` Valdis.Kletnieks
  2012-02-08 20:27                     ` Stephen Hemminger
@ 2012-02-08 23:48                     ` Kay Sievers
  1 sibling, 0 replies; 75+ messages in thread
From: Kay Sievers @ 2012-02-08 23:48 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Henrique de Moraes Holschuh, Jiri Slaby, Eric W. Biederman,
	Greg KH, LKML, ML netdev

On Wed, Feb 8, 2012 at 21:06,  <Valdis.Kletnieks@vt.edu> wrote:
> On Wed, 08 Feb 2012 11:57:18 +0100, Kay Sievers said:
>> On Wed, Feb 8, 2012 at 07:42,  <Valdis.Kletnieks@vt.edu> wrote:
>
>> > Or is this sort of thing in /etc/udev/rules.d/70-persistent-net.rules
>> > SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:90:0b:f2:80", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
>> > what you are trying to move to, and my systems are already onboard and
>> > I should just move along, nothing to see here? ;)
>>
>> Yeah, that's what we did in the past. It works fine if you never have
>> to swap names like eth0 and eth1, with need to free one of the the
>> names with a temporary rename.
>
> Well, if I had my druthers, I'd stick name="net-mgt", "net-pub", and "net-10g"
> in the udev rules, and not care about 1/2/3 and race conditions, because
> meaningful names are easier to not screw up (just last week found a system that
> had eth1 and eth2 reversed in some iptables rules, wouldn't have happened if
> they were -mgt and -pub).
>
> Only thing stopping me is getting iptables to accept '-i net-10g', and the
> distro /etc/sysconfig/network scripts like ifup and ifdown playing nice....
>
> So it sounds like what I want as a sysadmin is the same thing you want
> as a maintainer...

Yeah, that sounds very much like it is.

I want to push some responsibility to the admin, do less automagic,
and personally want to be less responsible for all the unintended
screw-up the automagic is causing everywhere.

Sure, the intention to keep the names like they always have been was
good, but a good intention and a broken model to deliver it, and
continue to pretend we can solve it, is the worst things we can do. :)

Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-02-01 23:18                     ` Linus Torvalds
  2012-02-02  1:22                       ` Al Viro
@ 2012-03-05 13:30                       ` Jiri Slaby
  2012-03-05 16:09                         ` Greg Kroah-Hartman
  1 sibling, 1 reply; 75+ messages in thread
From: Jiri Slaby @ 2012-03-05 13:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Jiri Slaby, Greg KH,
	Alan Cox, LKML, Al Viro, Maciej Rutecki

On 02/02/2012 12:18 AM, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> No extra "keep track of inode counts by hand" crap, and no idiotic
>> config options that just make it easy to (conditionally) get things
>> wrong. Just do it right, and do it *unconditionally* right.
> 
> And btw, "nlink shows number of subdirectories" for a directory entry
> really *is* right. It's how Unix filesystems work, like it or not.
> 
> It's mainly lazy/bad filesystems that set nlink to 1. So the whole
> "nlink==1" case is meant for crap like FAT etc, not for a filesystem
> that we control and that could easily just do it right.
> 
> Which is why I detest that config option. It's as if you were asking the user
> 
>  "Do you want to make the sysfs filesystem act like crap filesystems?"
> 
> and kernel config time. What kind of inane question is that?

<thread resumed...>

What's going on here? I still have to revert "sysfs: Kill nlink
counting." with today's -next to have working sensors.

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-03-05 13:30                       ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby
@ 2012-03-05 16:09                         ` Greg Kroah-Hartman
  2012-03-05 16:47                           ` Linus Torvalds
                                             ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Greg Kroah-Hartman @ 2012-03-05 16:09 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Linus Torvalds, Eric W. Biederman, Jiri Slaby, Alan Cox, LKML,
	Al Viro, Maciej Rutecki

On Mon, Mar 05, 2012 at 02:30:20PM +0100, Jiri Slaby wrote:
> On 02/02/2012 12:18 AM, Linus Torvalds wrote:
> > On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> No extra "keep track of inode counts by hand" crap, and no idiotic
> >> config options that just make it easy to (conditionally) get things
> >> wrong. Just do it right, and do it *unconditionally* right.
> > 
> > And btw, "nlink shows number of subdirectories" for a directory entry
> > really *is* right. It's how Unix filesystems work, like it or not.
> > 
> > It's mainly lazy/bad filesystems that set nlink to 1. So the whole
> > "nlink==1" case is meant for crap like FAT etc, not for a filesystem
> > that we control and that could easily just do it right.
> > 
> > Which is why I detest that config option. It's as if you were asking the user
> > 
> >  "Do you want to make the sysfs filesystem act like crap filesystems?"
> > 
> > and kernel config time. What kind of inane question is that?
> 
> <thread resumed...>
> 
> What's going on here? I still have to revert "sysfs: Kill nlink
> counting." with today's -next to have working sensors.

I don't remember.  I thought there was a proposed patch for this issue
from Eric, but I don't see it in my queue anywhere.

Eric, what was the resolution here?

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-03-05 16:09                         ` Greg Kroah-Hartman
@ 2012-03-05 16:47                           ` Linus Torvalds
  2012-03-08 21:05                             ` Greg Kroah-Hartman
  2012-03-08 22:18                             ` Eric W. Biederman
  2012-03-08 21:28                           ` Eric W. Biederman
  2012-03-08 21:34                           ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman
  2 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2012-03-05 16:47 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Eric W. Biederman, Jiri Slaby, Alan Cox, LKML,
	Al Viro, Maciej Rutecki

On Mon, Mar 5, 2012 at 8:09 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> I don't remember.  I thought there was a proposed patch for this issue
> from Eric, but I don't see it in my queue anywhere.

That patch was an abortion. Adding a config option for behavior like
this is totally bogus, and the only reason for that config option was
that sysfs did silly things.

It's only in -next, though, I was assuming that the whole "Kill nlink
counting" commit never makes it to me. Because I won't take it.

I outlined how the counting could easily be done without actually
having to maintain an explicit count in the sysfs.

Or we should just keep doing the counting.

                    Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-03-05 16:47                           ` Linus Torvalds
@ 2012-03-08 21:05                             ` Greg Kroah-Hartman
  2012-03-08 22:18                             ` Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: Greg Kroah-Hartman @ 2012-03-08 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jiri Slaby, Eric W. Biederman, Jiri Slaby, Alan Cox, LKML,
	Al Viro, Maciej Rutecki

On Mon, Mar 05, 2012 at 08:47:27AM -0800, Linus Torvalds wrote:
> On Mon, Mar 5, 2012 at 8:09 AM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > I don't remember.  I thought there was a proposed patch for this issue
> > from Eric, but I don't see it in my queue anywhere.
> 
> That patch was an abortion. Adding a config option for behavior like
> this is totally bogus, and the only reason for that config option was
> that sysfs did silly things.
> 
> It's only in -next, though, I was assuming that the whole "Kill nlink
> counting" commit never makes it to me. Because I won't take it.
> 
> I outlined how the counting could easily be done without actually
> having to maintain an explicit count in the sysfs.
> 
> Or we should just keep doing the counting.

Ok, I've now reverted this patch in my tree.

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-03-05 16:09                         ` Greg Kroah-Hartman
  2012-03-05 16:47                           ` Linus Torvalds
@ 2012-03-08 21:28                           ` Eric W. Biederman
  2012-03-08 21:34                           ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman
  2 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-03-08 21:28 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro,
	Maciej Rutecki

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Mon, Mar 05, 2012 at 02:30:20PM +0100, Jiri Slaby wrote:
>> On 02/02/2012 12:18 AM, Linus Torvalds wrote:
>> > On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds
>> > <torvalds@linux-foundation.org> wrote:
>> >>
>> >> No extra "keep track of inode counts by hand" crap, and no idiotic
>> >> config options that just make it easy to (conditionally) get things
>> >> wrong. Just do it right, and do it *unconditionally* right.
>> > 
>> > And btw, "nlink shows number of subdirectories" for a directory entry
>> > really *is* right. It's how Unix filesystems work, like it or not.
>> > 
>> > It's mainly lazy/bad filesystems that set nlink to 1. So the whole
>> > "nlink==1" case is meant for crap like FAT etc, not for a filesystem
>> > that we control and that could easily just do it right.
>> > 
>> > Which is why I detest that config option. It's as if you were asking the user
>> > 
>> >  "Do you want to make the sysfs filesystem act like crap filesystems?"
>> > 
>> > and kernel config time. What kind of inane question is that?
>> 
>> <thread resumed...>
>> 
>> What's going on here? I still have to revert "sysfs: Kill nlink
>> counting." with today's -next to have working sensors.
>
> I don't remember.  I thought there was a proposed patch for this issue
> from Eric, but I don't see it in my queue anywhere.
>
> Eric, what was the resolution here?

Apologies. Cold/Allergies and distractions have kept it away.
sysfs patches in a follow up.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 1/3] sysfs:  Compact sysfs_dirent s_flags into a byte.
  2012-03-05 16:09                         ` Greg Kroah-Hartman
  2012-03-05 16:47                           ` Linus Torvalds
  2012-03-08 21:28                           ` Eric W. Biederman
@ 2012-03-08 21:34                           ` Eric W. Biederman
  2012-03-08 21:36                             ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman
  2012-03-08 22:28                             ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman
  2 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-03-08 21:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro,
	Maciej Rutecki


In an effort to keep sysfs_dirent small used the smallest
basic type I can for sysfs s_flags.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/sysfs/sysfs.h |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 6289a00..c76c932 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -76,7 +76,7 @@ struct sysfs_dirent {
 		struct sysfs_elem_bin_attr	s_bin_attr;
 	};
 
-	unsigned short		s_flags;
+	unsigned char		s_flags;
 	umode_t 		s_mode;
 	unsigned int		s_ino;
 	struct sysfs_inode_attrs *s_iattr;
@@ -84,7 +84,7 @@ struct sysfs_dirent {
 
 #define SD_DEACTIVATED_BIAS		INT_MIN
 
-#define SYSFS_TYPE_MASK			0x00ff
+#define SYSFS_TYPE_MASK			0x000f
 #define SYSFS_DIR			0x0001
 #define SYSFS_KOBJ_ATTR			0x0002
 #define SYSFS_KOBJ_BIN_ATTR		0x0004
@@ -93,11 +93,11 @@ struct sysfs_dirent {
 #define SYSFS_ACTIVE_REF		(SYSFS_KOBJ_ATTR | SYSFS_KOBJ_BIN_ATTR)
 
 /* identify any namespace tag on sysfs_dirents */
-#define SYSFS_NS_TYPE_MASK		0xf00
-#define SYSFS_NS_TYPE_SHIFT		8
+#define SYSFS_NS_TYPE_MASK		0x70
+#define SYSFS_NS_TYPE_SHIFT		4
 
 #define SYSFS_FLAG_MASK			~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK)
-#define SYSFS_FLAG_REMOVED		0x02000
+#define SYSFS_FLAG_REMOVED		0x080
 
 static inline unsigned int sysfs_type(struct sysfs_dirent *sd)
 {
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 2/3] sysfs: Maintain usable nlink directory counts.
  2012-03-08 21:34                           ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman
@ 2012-03-08 21:36                             ` Eric W. Biederman
  2012-03-08 21:37                               ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman
  2012-03-08 22:28                             ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-03-08 21:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro,
	Maciej Rutecki


When it is easy keep fully accurate nlink counts for sysfs directories,
when it is not easy set nlink to 1.  This maintains as much compatibility
with unix programs that expect directories to have a usable nlink count
as possible, without trying to do the impossible.

Directory nlink count overflows in sysfs are inevitable by design
so only bother to use a byte to store the directory nlink count.
A directory with 254 entries is larger than any sysfs directory
in a normal configruation but small enough we should see tools that
care about large sysfs directories should experience nlink == 1
during their testing.

This fixes libsensors and possibly other applications that get
confused if all sysfs directories return nlink == 1.  The lm_sensors
code that got confused was just wrong is fixed in the lm_sensors
trunk and a fixed version should be released sometime soon.

The nlink of all deleted directories is set to 0.  Returning for the
first time a correct nlink count for deleted sysfs directories.

Once a directory nlink count drops below 2 refuse to increment or decrement
it as there is not enough information available.  It is important that
this works for nlink == 0 as well as nlink == 1 because currently
sysfs supports deleting non-empty directories (the PCI layer requires this behavior).

For tagged directories set the nlink count == 1 because we currently
have one sysfs_dirent and multiple logical sysfs directories making
pre computing nlink impossible.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/sysfs/dir.c   |   14 ++++++++++++++
 fs/sysfs/inode.c |    1 +
 fs/sysfs/sysfs.h |   16 ++++++++++++++++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index dd3779c..1526567 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -91,6 +91,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd)
 	struct rb_node **node = &sd->s_parent->s_dir.children.rb_node;
 	struct rb_node *parent = NULL;
 
+	if (sysfs_type(sd) == SYSFS_DIR)
+		sysfs_inc_nlink(sd->s_parent);
+
 	while (*node) {
 		struct sysfs_dirent *pos;
 		int result;
@@ -123,6 +126,9 @@ static int sysfs_link_sibling(struct sysfs_dirent *sd)
  */
 static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
 {
+	if (sysfs_type(sd) == SYSFS_DIR)
+		sysfs_dec_nlink(sd->s_parent);
+
 	rb_erase(&sd->s_rb, &sd->s_parent->s_dir.children);
 }
 
@@ -366,6 +372,9 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 	sd->s_name = name;
 	sd->s_mode = mode;
 	sd->s_flags = type;
+	sd->s_nlink = 1;
+	if (sysfs_type(sd) == SYSFS_DIR)
+		sd->s_nlink = 2;
 
 	return sd;
 
@@ -536,6 +545,7 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
 		ps_iattrs->ia_ctime = ps_iattrs->ia_mtime = CURRENT_TIME;
 	}
 
+	sd->s_nlink = 0;
 	sd->s_flags |= SYSFS_FLAG_REMOVED;
 	sd->u.removed_list = acxt->removed;
 	acxt->removed = sd;
@@ -660,6 +670,10 @@ static int create_dir(struct kobject *kobj, struct sysfs_dirent *parent_sd,
 	sd->s_ns = ns;
 	sd->s_dir.kobj = kobj;
 
+	/* Accurate nlink count impossible (one field mutiple dirs) */
+	if (sysfs_ns_type(sd))
+		sd->s_nlink = 1;
+
 	/* link in */
 	sysfs_addrm_start(&acxt, parent_sd);
 	rc = sysfs_add_one(&acxt, sd);
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 4291fd1..f6ebda8 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -216,6 +216,7 @@ static void sysfs_refresh_inode(struct sysfs_dirent *sd, struct inode *inode)
 					    iattrs->ia_secdata,
 					    iattrs->ia_secdata_len);
 	}
+	set_nlink(inode, sd->s_nlink);
 }
 
 int sysfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index c76c932..71f9bf7 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -76,6 +76,7 @@ struct sysfs_dirent {
 		struct sysfs_elem_bin_attr	s_bin_attr;
 	};
 
+	unsigned char		s_nlink;
 	unsigned char		s_flags;
 	umode_t 		s_mode;
 	unsigned int		s_ino;
@@ -127,6 +128,21 @@ do {								\
 #define sysfs_dirent_init_lockdep(sd) do {} while(0)
 #endif
 
+static inline void sysfs_inc_nlink(struct sysfs_dirent *sd)
+{
+	if (sd->s_nlink <= 1)
+		return;
+	sd->s_nlink++;
+	if (sd->s_nlink == 0)
+		sd->s_nlink = 1;
+}
+static inline void sysfs_dec_nlink(struct sysfs_dirent *sd)
+{
+	if (sd->s_nlink <= 1)
+		return;
+	sd->s_nlink--;
+}
+
 /*
  * Context structure to be used while adding/removing nodes.
  */
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead.
  2012-03-08 21:36                             ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman
@ 2012-03-08 21:37                               ` Eric W. Biederman
  2012-03-09  3:40                                 ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-03-08 21:37 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro,
	Maciej Rutecki


Now that we have a nlink field in sysfs_dirent report deleted
files and directories the traditional way with nlink == 0.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/sysfs/dir.c   |   11 +++++------
 fs/sysfs/sysfs.h |    1 -
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 1526567..b5471d1 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -202,7 +202,7 @@ static void sysfs_deactivate(struct sysfs_dirent *sd)
 	DECLARE_COMPLETION_ONSTACK(wait);
 	int v;
 
-	BUG_ON(!(sd->s_flags & SYSFS_FLAG_REMOVED));
+	BUG_ON(sd->s_nlink != 0);
 
 	if (!(sysfs_type(sd) & SYSFS_ACTIVE_REF))
 		return;
@@ -279,7 +279,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
 static int sysfs_dentry_delete(const struct dentry *dentry)
 {
 	struct sysfs_dirent *sd = dentry->d_fsdata;
-	return !!(sd->s_flags & SYSFS_FLAG_REMOVED);
+	return sd->s_nlink == 0;
 }
 
 static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd)
@@ -294,7 +294,7 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd)
 	mutex_lock(&sysfs_mutex);
 
 	/* The sysfs dirent has been deleted */
-	if (sd->s_flags & SYSFS_FLAG_REMOVED)
+	if (sd->s_nlink == 0)
 		goto out_bad;
 
 	/* The sysfs dirent has been moved? */
@@ -534,7 +534,7 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
 {
 	struct sysfs_inode_attrs *ps_iattr;
 
-	BUG_ON(sd->s_flags & SYSFS_FLAG_REMOVED);
+	BUG_ON(sd->s_nlink == 0);
 
 	sysfs_unlink_sibling(sd);
 
@@ -546,7 +546,6 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
 	}
 
 	sd->s_nlink = 0;
-	sd->s_flags |= SYSFS_FLAG_REMOVED;
 	sd->u.removed_list = acxt->removed;
 	acxt->removed = sd;
 }
@@ -946,7 +945,7 @@ static struct sysfs_dirent *sysfs_dir_pos(const void *ns,
 	struct sysfs_dirent *parent_sd,	loff_t hash, struct sysfs_dirent *pos)
 {
 	if (pos) {
-		int valid = !(pos->s_flags & SYSFS_FLAG_REMOVED) &&
+		int valid = (pos->s_nlink > 0) &&
 			pos->s_parent == parent_sd &&
 			hash == pos->s_hash;
 		sysfs_put(pos);
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 71f9bf7..db2c5c5 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -98,7 +98,6 @@ struct sysfs_dirent {
 #define SYSFS_NS_TYPE_SHIFT		4
 
 #define SYSFS_FLAG_MASK			~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK)
-#define SYSFS_FLAG_REMOVED		0x080
 
 static inline unsigned int sysfs_type(struct sysfs_dirent *sd)
 {
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-03-05 16:47                           ` Linus Torvalds
  2012-03-08 21:05                             ` Greg Kroah-Hartman
@ 2012-03-08 22:18                             ` Eric W. Biederman
  2012-03-08 23:40                               ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-03-08 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg Kroah-Hartman, Jiri Slaby, Jiri Slaby, Alan Cox, LKML,
	Al Viro, Maciej Rutecki

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Mar 5, 2012 at 8:09 AM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
>>
>> I don't remember.  I thought there was a proposed patch for this issue
>> from Eric, but I don't see it in my queue anywhere.
>
> That patch was an abortion. Adding a config option for behavior like
> this is totally bogus, and the only reason for that config option was
> that sysfs did silly things.

The biggest reason it is bogus is that it doesn't get properly tested
or reviewed.  Sigh.  My first patch to fix things had a bad typo
that everyone missed.

> It's only in -next, though, I was assuming that the whole "Kill nlink
> counting" commit never makes it to me. Because I won't take it.
>
> I outlined how the counting could easily be done without actually
> having to maintain an explicit count in the sysfs.

And if you had bothered to look you would have seen how we used to
have that code and it was removed because it was a performance
bottleneck.

> Or we should just keep doing the counting.

The current counting that we do gives the wrong numbers, in the
edge cases.  To my knowledge a deleted sysfs directory has never
returned nlink == 0.

Keeping compatibility is easy enough that it looks like it is worth
doing, but maintaining 30+ years of backwards compatibility is what
nlink >1 in unix filesystem directories is.  I don't see any practical
sense in keeping . and .. directories on disk or upping the unix
nlink directory count because of them.  To me it looks like just one
of those things you do.  Like hash directory entries so you can
have a big directory and still be able to have a 32bit offset you
can pass to lseek that is stable across renames and deletes.

>From the point of view of maintaining sysfs a 32bit nlink_t in sysfs is
too small.  It is wrong for sysfs to refuse to represent devices that
exist and I have heard of machines that have enough memory it possible
to create more than 2^32 network devices.  So sysfs must handle overflow
and sysfs must use the nlink == 1 in some cases.  I was just thinking
we would get better userspace test coverage if we don't bother to handle
the other cases.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] sysfs:  Compact sysfs_dirent s_flags into a byte.
  2012-03-08 21:34                           ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman
  2012-03-08 21:36                             ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman
@ 2012-03-08 22:28                             ` Greg Kroah-Hartman
  2012-03-09  2:49                               ` Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Greg Kroah-Hartman @ 2012-03-08 22:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro,
	Maciej Rutecki

On Thu, Mar 08, 2012 at 01:34:22PM -0800, Eric W. Biederman wrote:
> 
> In an effort to keep sysfs_dirent small used the smallest
> basic type I can for sysfs s_flags.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/sysfs/sysfs.h |   10 +++++-----
>  1 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
> index 6289a00..c76c932 100644
> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
> @@ -76,7 +76,7 @@ struct sysfs_dirent {
>  		struct sysfs_elem_bin_attr	s_bin_attr;
>  	};
>  
> -	unsigned short		s_flags;
> +	unsigned char		s_flags;
>  	umode_t 		s_mode;
>  	unsigned int		s_ino;
>  	struct sysfs_inode_attrs *s_iattr;

Are you sure this saved you any real space here?  Have you use pahole (I
think that's the tool name) to determine if there are holes in the
structure?  Given that you moved this to a smaller variable, yet there
are pointers later on, odds are nothing changed at all overall.

Verification would be good to have, to see if this work was really worth
it, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
  2012-03-08 22:18                             ` Eric W. Biederman
@ 2012-03-08 23:40                               ` Linus Torvalds
  0 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2012-03-08 23:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, Jiri Slaby, Alan Cox, LKML,
	Al Viro, Maciej Rutecki

On Thu, Mar 8, 2012 at 2:18 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Keeping compatibility is easy enough that it looks like it is worth
> doing, but maintaining 30+ years of backwards compatibility

Stop right there.

This is *not* about some arbitrary "30-year backwards compatibility".

This is about your patch BREAKING EXISTING BINARIES.

So stop the f*&^ing around already. The patch was shown to be broken,
stop making excuses, and stop blathering.

End of story. Binary compatibility is more important than *any* of
your patches. If you continue to argue anything else or making
excuses, I'm going to ask people to just ignore your patches entirely.

Seriously. Binary compatibility is *so* important that I do not want
to have anything to do with kernel developers who don't understand
that importance. If you continue to pooh-pooh the issue, you only show
yourself to be unreliable.  Don't do it.

Dammit, I'm continually surprised by the *idiots* out there that don't
understand that binary compatibility is one of the absolute top
priorities. The *only* reason for an OS kernel existing in the first
place is to serve user-space. The kernel has no relevance on its own.
Breaking existing binaries - and then not acknowledging how horribly
bad that was - is just about the *worst* offense any kernel developer
can do.

Because that shows that they don't understand what the whole *point*
of the kernel was after all. We're not masturbating around with some
research project.  We never were. Even when Linux was young, the whole
and only point was to make a *usable* system. It's why it's not some
crazy drug-induced microkernel or other random crazy thing.

Really.

                     Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] sysfs:  Compact sysfs_dirent s_flags into a byte.
  2012-03-08 22:28                             ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman
@ 2012-03-09  2:49                               ` Eric W. Biederman
  0 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-03-09  2:49 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Slaby, Linus Torvalds, Jiri Slaby, Alan Cox, LKML, Al Viro,
	Maciej Rutecki

Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:

> On Thu, Mar 08, 2012 at 01:34:22PM -0800, Eric W. Biederman wrote:
>> 
>> In an effort to keep sysfs_dirent small used the smallest
>> basic type I can for sysfs s_flags.
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>>  fs/sysfs/sysfs.h |   10 +++++-----
>>  1 files changed, 5 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
>> index 6289a00..c76c932 100644
>> --- a/fs/sysfs/sysfs.h
>> +++ b/fs/sysfs/sysfs.h
>> @@ -76,7 +76,7 @@ struct sysfs_dirent {
>>  		struct sysfs_elem_bin_attr	s_bin_attr;
>>  	};
>>  
>> -	unsigned short		s_flags;
>> +	unsigned char		s_flags;
>>  	umode_t 		s_mode;
>>  	unsigned int		s_ino;
>>  	struct sysfs_inode_attrs *s_iattr;
>
> Are you sure this saved you any real space here?  Have you use pahole (I
> think that's the tool name) to determine if there are holes in the
> structure?  Given that you moved this to a smaller variable, yet there
> are pointers later on, odds are nothing changed at all overall.
>
> Verification would be good to have, to see if this work was really worth
> it, right?

The next patch stuffs an 8bit nlink in the hole.  I was making room so
we don't have to grow sysfs_dirent.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead.
  2012-03-08 21:37                               ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman
@ 2012-03-09  3:40                                 ` Linus Torvalds
  0 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2012-03-09  3:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Jiri Slaby, Jiri Slaby, Alan Cox, LKML,
	Al Viro, Maciej Rutecki

Is this safe?

What's the serialization with sysfs_link_sibling(), which you just made do

   sd->s_nlink++;
   if (!sd->s_nlink)
      sd->s_nlink = 1;

in your previous patch?

Yes, sysfs_link_sibling() runs under sysfs_mutex, but
sysfs_dentry_delete() does not.

So what protects sysfs_dentry_delete() from never seeing that bogus 0
due to the overflow?

                     Linus

On Thu, Mar 8, 2012 at 1:37 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Now that we have a nlink field in sysfs_dirent report deleted
> files and directories the traditional way with nlink == 0.
>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/sysfs/dir.c   |   11 +++++------
>  fs/sysfs/sysfs.h |    1 -
>  2 files changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 1526567..b5471d1 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -202,7 +202,7 @@ static void sysfs_deactivate(struct sysfs_dirent *sd)
>        DECLARE_COMPLETION_ONSTACK(wait);
>        int v;
>
> -       BUG_ON(!(sd->s_flags & SYSFS_FLAG_REMOVED));
> +       BUG_ON(sd->s_nlink != 0);
>
>        if (!(sysfs_type(sd) & SYSFS_ACTIVE_REF))
>                return;
> @@ -279,7 +279,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
>  static int sysfs_dentry_delete(const struct dentry *dentry)
>  {
>        struct sysfs_dirent *sd = dentry->d_fsdata;
> -       return !!(sd->s_flags & SYSFS_FLAG_REMOVED);
> +       return sd->s_nlink == 0;
>  }
>
>  static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd)
> @@ -294,7 +294,7 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, struct nameidata *nd)
>        mutex_lock(&sysfs_mutex);
>
>        /* The sysfs dirent has been deleted */
> -       if (sd->s_flags & SYSFS_FLAG_REMOVED)
> +       if (sd->s_nlink == 0)
>                goto out_bad;
>
>        /* The sysfs dirent has been moved? */
> @@ -534,7 +534,7 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
>  {
>        struct sysfs_inode_attrs *ps_iattr;
>
> -       BUG_ON(sd->s_flags & SYSFS_FLAG_REMOVED);
> +       BUG_ON(sd->s_nlink == 0);
>
>        sysfs_unlink_sibling(sd);
>
> @@ -546,7 +546,6 @@ void sysfs_remove_one(struct sysfs_addrm_cxt *acxt, struct sysfs_dirent *sd)
>        }
>
>        sd->s_nlink = 0;
> -       sd->s_flags |= SYSFS_FLAG_REMOVED;
>        sd->u.removed_list = acxt->removed;
>        acxt->removed = sd;
>  }
> @@ -946,7 +945,7 @@ static struct sysfs_dirent *sysfs_dir_pos(const void *ns,
>        struct sysfs_dirent *parent_sd, loff_t hash, struct sysfs_dirent *pos)
>  {
>        if (pos) {
> -               int valid = !(pos->s_flags & SYSFS_FLAG_REMOVED) &&
> +               int valid = (pos->s_nlink > 0) &&
>                        pos->s_parent == parent_sd &&
>                        hash == pos->s_hash;
>                sysfs_put(pos);
> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
> index 71f9bf7..db2c5c5 100644
> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
> @@ -98,7 +98,6 @@ struct sysfs_dirent {
>  #define SYSFS_NS_TYPE_SHIFT            4
>
>  #define SYSFS_FLAG_MASK                        ~(SYSFS_NS_TYPE_MASK|SYSFS_TYPE_MASK)
> -#define SYSFS_FLAG_REMOVED             0x080
>
>  static inline unsigned int sysfs_type(struct sysfs_dirent *sd)
>  {
> --
> 1.7.2.5
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2012-03-09  3:41 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby
2012-01-30 22:06 ` Greg KH
2012-01-30 22:10   ` Alan Cox
2012-01-30 22:27     ` Greg KH
2012-01-30 22:43       ` Al Viro
2012-01-30 22:56         ` Al Viro
2012-01-31  1:27       ` Eric W. Biederman
2012-01-31 10:48         ` Jiri Slaby
2012-01-31 12:44           ` Eric W. Biederman
2012-01-31 16:45             ` Linus Torvalds
2012-01-31 19:18               ` Al Viro
2012-02-01  5:06               ` Eric W. Biederman
2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
2012-02-01 22:24                   ` Greg Kroah-Hartman
2012-02-01 22:44                     ` Eric W. Biederman
2012-02-01 22:49                       ` Greg Kroah-Hartman
2012-02-01 22:31                   ` Dave Jones
2012-02-01 22:35                   ` Jiri Slaby
2012-02-01 23:15                   ` Linus Torvalds
2012-02-01 23:18                     ` Linus Torvalds
2012-02-02  1:22                       ` Al Viro
2012-02-02 21:24                         ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro
2012-02-02 23:46                           ` Linus Torvalds
2012-02-03  1:16                             ` Al Viro
2012-02-03  1:45                               ` Al Viro
2012-02-03  2:00                                 ` Linus Torvalds
2012-02-03 14:57                               ` Chris Mason
2012-02-03 17:08                               ` Al Viro
2012-02-03 19:34                                 ` Artem Bityutskiy
2012-02-06  8:50                                 ` Artem Bityutskiy
2012-02-06 13:56                                   ` Al Viro
2012-02-06 17:05                                     ` Artem Bityutskiy
2012-02-06 17:11                                       ` Al Viro
2012-02-07  7:21                                         ` Artem Bityutskiy
2012-02-06 22:49                               ` Dave Chinner
2012-02-03  8:25                           ` Andreas Dilger
2012-02-03 17:03                             ` Al Viro
2012-02-04  7:42                               ` Andreas Dilger
2012-03-05 13:30                       ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby
2012-03-05 16:09                         ` Greg Kroah-Hartman
2012-03-05 16:47                           ` Linus Torvalds
2012-03-08 21:05                             ` Greg Kroah-Hartman
2012-03-08 22:18                             ` Eric W. Biederman
2012-03-08 23:40                               ` Linus Torvalds
2012-03-08 21:28                           ` Eric W. Biederman
2012-03-08 21:34                           ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman
2012-03-08 21:36                             ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman
2012-03-08 21:37                               ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman
2012-03-09  3:40                                 ` Linus Torvalds
2012-03-08 22:28                             ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman
2012-03-09  2:49                               ` Eric W. Biederman
2012-01-31  3:45     ` sysfs regression: wrong link counts Eric W. Biederman
2012-01-31 11:54       ` Alan Cox
2012-01-30 22:52 ` Kay Sievers
2012-01-31 10:41   ` network regression: cannot rename netdev twice Jiri Slaby
2012-01-31 10:52     ` Kay Sievers
2012-01-31 11:00       ` Jiri Slaby
2012-01-31 11:13         ` Kay Sievers
2012-01-31 11:17           ` Jiri Slaby
2012-01-31 11:58             ` Kay Sievers
2012-01-31 14:18               ` Eric W. Biederman
2012-01-31 14:40                 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman
2012-01-31 14:41                   ` Jiri Slaby
2012-01-31 14:55                   ` Greg KH
2012-02-04  2:14       ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh
2012-02-06 20:03         ` Kay Sievers
2012-02-08  2:00           ` Henrique de Moraes Holschuh
2012-02-08  3:50             ` Kay Sievers
2012-02-08  6:42               ` Valdis.Kletnieks
2012-02-08 10:57                 ` Kay Sievers
2012-02-08 20:06                   ` Valdis.Kletnieks
2012-02-08 20:27                     ` Stephen Hemminger
2012-02-08 23:48                     ` Kay Sievers
2012-01-31  1:32 ` sysfs regression: wrong link counts Eric W. Biederman
2012-02-01 18:29 ` Maciej Rutecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).