Ok, as some of you are aware, one of the things that got merged very early in the 3.4 merge window was the "word-at-a-time" filename lookup patches I had been working on. They only get enabled on x86, but when they do, they do speed things up by quite a noticeable bit (mainly on x86-64, which ends up doing things 8 bytes at a time - it's much less noticeable on x86-32). Now that the name lookup itself is quite lean, that ended up showing some serious problems in selinux and the security wrappers, some of which were largely solved a few days ago (the recent "Merge branch 'selinux'" thing got rid of the most annoying cases). I have a few more selinux fixes to really get the worst stupidity out of the way, but I suspect they are for next merge window. But with those, the path lookup is fast, and selinux isn't eating up *too* much time either. And the next thing that shows up is "strncpy_from_user()". It's not all that surprising - the function really isn't particularly smart, and using lodsb/stosb is just legacy cruft. But realistically, while it hasn't been a piece of wonderful engineering, it also never really showed up very high on profiles. Now with all the other improvements I have a few loads where it really does show up pretty well - the only real users of strncpy_from_user() are the pathname copy code (sure, there's stuff elsewhere, but it's just not that important). Anyway, that function uses asms and is duplicated across x86-32 and x86-64 for no real good reason except it was never a priority. Here's a patch that actually removes more lines than it adds, because it removes that duplication, and replaces the asm with some fairly optimal C code. So we have 6 files changed, 105 insertions(+), 145 deletions(-) but what is perhaps more interesting (and the reason linux-arch is cc'd) is that not only is it noticeably faster, it should also actually get all the error cases right. What do I mean by that? I mean that in my tree (but not committed), I have actually removed this nasty piece of code from fs/namei.c: diff --git a/fs/namei.c b/fs/namei.c index 0062dd17eb55..2b7d691523bb 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -119,14 +119,7 @@ static int do_getname(const char __user *filename, char *page) { int retval; - unsigned long len = PATH_MAX; - - if (!segment_eq(get_fs(), KERNEL_DS)) { - if ((unsigned long) filename >= TASK_SIZE) - return -EFAULT; - if (TASK_SIZE - (unsigned long) filename < PATH_MAX) - len = TASK_SIZE - (unsigned long) filename; - } + const unsigned long len = PATH_MAX; retval = strncpy_from_user(page, filename, len); if (retval > 0) { which worked around the fact that strncpy_from_user() didn't really get the end conditions quite right, and it really just did an access_ok() on the first byte. The new x86 implementation should get all of that right without any need for any ugly hacks. Now, I haven't committed that change to do_getname(), but it *will* get committed at some point. Probably not until the next merge window, but I did want to let architecture maintainers know: you need to look at your strncpy_from_userspace(), and make sure they get the "end of address space" case right, because the pathname code won't hack around it for you any more if you don't. The x86-"specific" strncpy_from_user() attached in the patch is actually fairly generic. It does do the word-at-a-time tricks, but if you don't want to bother with those, you can just remove that loop entirely, and it should all work in general (the x86 code has a fall-back to the byte-at-a-time case, so if you remove the "while (max >= sizeof(unsigned long))" loop, it should all "just work". And the word-at-a-time tricks it does do are actually simpler than the ones the dcache does - just a subset of them - so you may decide that you'll do that part too, even if the dcache ones are more complicated. (For example, if you just implement the "has_zero()" function, you can make it do if (has_zero(a)) break; to fall back to the byte-at-a-time model - that's what I did originally in this patch too, but quite frankly, it does mean that the word-at-a-time logic isn't as efficient). Anyway, this works for me on x86 (you need current tip-of-git for the thing) and is quite noticeable in profiles (well, if you have loads that do a lot of pathname lookups: I use a microbenchmarks that does ten million lookups of a filename with multiple components, but I also do "make -j" profiles of a fully build tree). And considering that the old x86 strncpy_to_user() code was so ugly, it's pretty much a no-brainer there. But on other architectures, I'd still like to point out the problem with the end-of-address-space, which you may or may not actually have on other architectures. Comments? Linus