linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jiri Slaby <jslaby@suse.cz>, Greg KH <greg@kroah.com>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>,
	LKML <linux-kernel@vger.kernel.org>,
	Maciej Rutecki <maciej.rutecki@gmail.com>
Subject: Re: [PATCH] sysfs: Optionally count subdirectories to support buggy applications
Date: Thu, 2 Feb 2012 01:22:58 +0000	[thread overview]
Message-ID: <20120202012258.GQ23916@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CA+55aFzZX544ZDN9vN3jWMWZ=_9ZtpZ9cR6gNEzUnx9RCqR5LQ@mail.gmail.com>

On Wed, Feb 01, 2012 at 03:18:05PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 3:15 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > No extra "keep track of inode counts by hand" crap, and no idiotic
> > config options that just make it easy to (conditionally) get things
> > wrong. Just do it right, and do it *unconditionally* right.
> 
> And btw, "nlink shows number of subdirectories" for a directory entry
> really *is* right. It's how Unix filesystems work, like it or not.
> 
> It's mainly lazy/bad filesystems that set nlink to 1. So the whole
> "nlink==1" case is meant for crap like FAT etc, not for a filesystem
> that we control and that could easily just do it right.

It's a bit more complicated than that and it had always been a bit of
a minefield.  v6: link(2) began with
        ip = namei(&uchar, 0);
        if(ip == NULL)
                return;
        if(ip->i_nlink >= 127) {
                u.u_error = EMLINK;
                goto out;
        }
(and mkdir(), of course, was implemented via mknod+link).  Up to 127 links
to any object.  Fine; v7: EMLINK defined, but _never_ returned.  Moreover,
mkdir(1) didn't bother to check link counts either.  Result: 65535 calls
of link(2) and you've got yourself an fs corruption on i_nlink overflow
(it was 16bit in on-disk inode).

Linux implementation of link(2) had exactly the same bug as that of v7 until
0.98.3 (for link(2)) and 0.99.1 (for mkdir(2), rename(2) - didn't exist as
syscalls on v7, but had the same problem as soon as they made it in kernel
at some point in BSD history).  Note that limit for minixfs remained
ridiculously low until '97 or so; for ext2 it was 32000 from the very
beginning but note that e.g. ext had _not_ acquired fixes for link overflows
on mkdir() at all - it had that hole all the way until removal in 2.1.26.
Of course, there was a (completely useless) POSIX-mandated LINK_MAX, but
since the actual limit depends on fs type, well...  the checks remain
in ->link()/->mkdir()/->rename() instances and IIRC I've caught a bunch of
overflows in those circa 2.1.very_late.  

As the result, old stat(2) had 16bit st_nlink - same as v7.  Nobody needed
more, after all.  Alpha port got it 32bit (inherited the struct stat layout
from OSF, presumably?), but other early ports kept 16bit there.

At some point folks started whining about wanting more subdirectories and
that's when the things began to get really convoluted and ugly.  By that
time we had a variant of stat(2) that used 32bit st_nlink, but on-disk
layouts remained a problem.  Some filesystems went for more or less
reasonable "the limit is circa 2^32, EMLINK if we get more than that and
-EOVERFLOW on old_stat(2) if it doesn't fit into 16 bits".  However, it
was _not_ universally true - e.g. reiserfs does _NOT_ do that for mkdir().
There we get "directory link count grows up to 16bit limit and gets stuck
at 1 if it ever grows past that", on the theory that st_nlink == 1
is distinguishable from anything you'd normally get for a directory.  I
have no idea where that invention has come from, but it's been around for
more than a decade.  Of course, Hans being Hans, it had been advertised as
major reiserfs feature - you could get an unlimited amount of subdirectories,
which no other fs on Linux would allow, nevermind that nobody sane would
actually make use of that...

At about the same time somebody had done the same trick on ext2 - Daniel,
probably, since it had been floating in directory index patchset.  It never
got merged into mainline; ext3 port _did_, but without those bits actually
used (EXT3_DIR_LINK_MAX is defined, but never used).  ext4 actually has it
hooked in.

I don't see anything else other than ext4 and reiserfs using that convention,
but then just grepping for inc_nlink/inode_inc_use_count shows a bloody
mess - e.g. jffs2 does not check overflows (32bit there) at all, neither on
link() nor on mkdir()/rename().  It _might_ get away with that (with
non-standard errno, if so), but I don't remember that code well enough
to tell at the quick look.  HFS+ doesn't seem to check for overflows either
and while it doesn't track link count on directories, it does support link(2)
and it does bump i_nlink there.  A _lot_ of filesystems (starting with
ramfs) assumes that we'll get an OOM before we get to 2^32 links to an
inode on purely in-core filesystem; reasonable back in 2001 or so, but
not these days...

  reply	other threads:[~2012-02-02  1:23 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-30 21:56 sysfs regression: wrong link counts Jiri Slaby
2012-01-30 22:06 ` Greg KH
2012-01-30 22:10   ` Alan Cox
2012-01-30 22:27     ` Greg KH
2012-01-30 22:43       ` Al Viro
2012-01-30 22:56         ` Al Viro
2012-01-31  1:27       ` Eric W. Biederman
2012-01-31 10:48         ` Jiri Slaby
2012-01-31 12:44           ` Eric W. Biederman
2012-01-31 16:45             ` Linus Torvalds
2012-01-31 19:18               ` Al Viro
2012-02-01  5:06               ` Eric W. Biederman
2012-02-01 22:21                 ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Eric W. Biederman
2012-02-01 22:24                   ` Greg Kroah-Hartman
2012-02-01 22:44                     ` Eric W. Biederman
2012-02-01 22:49                       ` Greg Kroah-Hartman
2012-02-01 22:31                   ` Dave Jones
2012-02-01 22:35                   ` Jiri Slaby
2012-02-01 23:15                   ` Linus Torvalds
2012-02-01 23:18                     ` Linus Torvalds
2012-02-02  1:22                       ` Al Viro [this message]
2012-02-02 21:24                         ` [RFC] killing boilerplate checks in ->link/->mkdir/->rename Al Viro
2012-02-02 23:46                           ` Linus Torvalds
2012-02-03  1:16                             ` Al Viro
2012-02-03  1:45                               ` Al Viro
2012-02-03  2:00                                 ` Linus Torvalds
2012-02-03 14:57                               ` Chris Mason
2012-02-03 17:08                               ` Al Viro
2012-02-03 19:34                                 ` Artem Bityutskiy
2012-02-06  8:50                                 ` Artem Bityutskiy
2012-02-06 13:56                                   ` Al Viro
2012-02-06 17:05                                     ` Artem Bityutskiy
2012-02-06 17:11                                       ` Al Viro
2012-02-07  7:21                                         ` Artem Bityutskiy
2012-02-06 22:49                               ` Dave Chinner
2012-02-03  8:25                           ` Andreas Dilger
2012-02-03 17:03                             ` Al Viro
2012-02-04  7:42                               ` Andreas Dilger
2012-03-05 13:30                       ` [PATCH] sysfs: Optionally count subdirectories to support buggy applications Jiri Slaby
2012-03-05 16:09                         ` Greg Kroah-Hartman
2012-03-05 16:47                           ` Linus Torvalds
2012-03-08 21:05                             ` Greg Kroah-Hartman
2012-03-08 22:18                             ` Eric W. Biederman
2012-03-08 23:40                               ` Linus Torvalds
2012-03-08 21:28                           ` Eric W. Biederman
2012-03-08 21:34                           ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Eric W. Biederman
2012-03-08 21:36                             ` [PATCH 2/3] sysfs: Maintain usable nlink directory counts Eric W. Biederman
2012-03-08 21:37                               ` [PATCH 3/3] sysfs: Remove SYSFS_FLAG_REMOVED and use sd->s_nlink == 0 instead Eric W. Biederman
2012-03-09  3:40                                 ` Linus Torvalds
2012-03-08 22:28                             ` [PATCH 1/3] sysfs: Compact sysfs_dirent s_flags into a byte Greg Kroah-Hartman
2012-03-09  2:49                               ` Eric W. Biederman
2012-01-31  3:45     ` sysfs regression: wrong link counts Eric W. Biederman
2012-01-31 11:54       ` Alan Cox
2012-01-30 22:52 ` Kay Sievers
2012-01-31 10:41   ` network regression: cannot rename netdev twice Jiri Slaby
2012-01-31 10:52     ` Kay Sievers
2012-01-31 11:00       ` Jiri Slaby
2012-01-31 11:13         ` Kay Sievers
2012-01-31 11:17           ` Jiri Slaby
2012-01-31 11:58             ` Kay Sievers
2012-01-31 14:18               ` Eric W. Biederman
2012-01-31 14:40                 ` [PATCH] sysfs: Update the name hash when renaming sysfs entries Eric W. Biederman
2012-01-31 14:41                   ` Jiri Slaby
2012-01-31 14:55                   ` Greg KH
2012-02-04  2:14       ` network regression: cannot rename netdev twice Henrique de Moraes Holschuh
2012-02-06 20:03         ` Kay Sievers
2012-02-08  2:00           ` Henrique de Moraes Holschuh
2012-02-08  3:50             ` Kay Sievers
2012-02-08  6:42               ` Valdis.Kletnieks
2012-02-08 10:57                 ` Kay Sievers
2012-02-08 20:06                   ` Valdis.Kletnieks
2012-02-08 20:27                     ` Stephen Hemminger
2012-02-08 23:48                     ` Kay Sievers
2012-01-31  1:32 ` sysfs regression: wrong link counts Eric W. Biederman
2012-02-01 18:29 ` Maciej Rutecki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120202012258.GQ23916@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=ebiederm@xmission.com \
    --cc=greg@kroah.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jslaby@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.rutecki@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).