Ok, as some of you are aware, one of the things that got merged very
early in the 3.4 merge window was the "word-at-a-time" filename lookup
patches I had been working on. They only get enabled on x86, but when
they do, they do speed things up by quite a noticeable bit (mainly on
x86-64, which ends up doing things 8 bytes at a time - it's much less
noticeable on x86-32).

Now that the name lookup itself is quite lean, that ended up showing
some serious problems in selinux and the security wrappers, some of
which were largely solved a few days ago (the recent "Merge branch
'selinux'" thing got rid of the most annoying cases).

I have a few more selinux fixes to really get the worst stupidity out
of the way, but I suspect they are for next merge window.  But with
those, the path lookup is fast, and selinux isn't eating up *too* much
time either.

And the next thing that shows up is "strncpy_from_user()".

It's not all that surprising - the function really isn't particularly
smart, and using lodsb/stosb is just legacy cruft. But realistically,
while it hasn't been a piece of wonderful engineering, it also never
really showed up very high on profiles. Now with all the other
improvements I have a few loads where it really does show up pretty
well - the only real users of strncpy_from_user() are the pathname
copy code (sure, there's stuff elsewhere, but it's just not that
important).

Anyway, that function uses asms and is duplicated across x86-32 and
x86-64 for no real good reason except it was never a priority.

Here's a patch that actually removes more lines than it adds, because
it removes that duplication, and replaces the asm with some fairly
optimal C code. So we have

 6 files changed, 105 insertions(+), 145 deletions(-)

but what is perhaps more interesting (and the reason linux-arch is
cc'd) is that not only is it noticeably faster, it should also
actually get all the error cases right.

What do I mean by that?

I mean that in my tree (but not committed), I have actually removed
this nasty piece of code from fs/namei.c:

  diff --git a/fs/namei.c b/fs/namei.c
  index 0062dd17eb55..2b7d691523bb 100644
  --- a/fs/namei.c
  +++ b/fs/namei.c
  @@ -119,14 +119,7 @@
   static int do_getname(const char __user *filename, char *page)
   {
          int retval;
  -       unsigned long len = PATH_MAX;
  -
  -       if (!segment_eq(get_fs(), KERNEL_DS)) {
  -               if ((unsigned long) filename >= TASK_SIZE)
  -                       return -EFAULT;
  -               if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
  -                       len = TASK_SIZE - (unsigned long) filename;
  -       }
  +       const unsigned long len = PATH_MAX;

          retval = strncpy_from_user(page, filename, len);
          if (retval > 0) {

which worked around the fact that strncpy_from_user() didn't really
get the end conditions quite right, and it really just did an
access_ok() on the first byte. The new x86 implementation should get
all of that right without any need for any ugly hacks.

Now, I haven't committed that change to do_getname(), but it *will*
get committed at some point. Probably not until the next merge window,
but I did want to let architecture maintainers know: you need to look
at your strncpy_from_userspace(), and make sure they get the "end of
address space" case right, because the pathname code won't hack around
it for you any more if you don't.

The x86-"specific" strncpy_from_user() attached in the patch is
actually fairly generic. It does do the word-at-a-time tricks, but if
you don't want to bother with those, you can just remove that loop
entirely, and it should all work in general (the x86 code has a
fall-back to the byte-at-a-time case, so if you remove the "while (max
>= sizeof(unsigned long))" loop, it should all "just work".

And the word-at-a-time tricks it does do are actually simpler than the
ones the dcache does - just a subset of them - so you may decide that
you'll do that part too, even if the dcache ones are more complicated.

(For example, if you just implement the "has_zero()" function, you can
make it do

    if (has_zero(a))
        break;

to fall back to the byte-at-a-time model - that's what I did
originally in this patch too, but quite frankly, it does mean that the
word-at-a-time logic isn't as efficient).

Anyway, this works for me on x86 (you need current tip-of-git for the
<asm/word-at-a-time.h> thing) and is quite noticeable in profiles
(well, if you have loads that do a lot of pathname lookups: I use a
microbenchmarks that does ten million lookups of a filename with
multiple components, but I also do "make -j" profiles of a fully build
tree).

And considering that the old x86 strncpy_to_user() code was so ugly,
it's pretty much a no-brainer there. But on other architectures, I'd
still like to point out the problem with the end-of-address-space,
which you may or may not actually have on other architectures.

Comments?

                         Linus