Work (really slow directory access on ext4)

* Work (really slow directory access on ext4)
@ 2014-08-06 14:49 Theodore Ts'o
  2014-08-06 18:26 ` Arlie Stephens
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2014-08-06 14:49 UTC (permalink / raw)
  To: kernelnewbies

I don't subscribe to kernelnewbies, but I came across this thread in
the mail archive while researching an unrelated issue.

Valdis' observations are on the mark here.  It's almost certain that
you are getting overwhelmed with other disk traffic, because your
directory isn't *that* big.

That being said, there are certainly issues with really really big
directories, and solving this is certainly not going to be a newbie
project (if it was easy to solve, it would have been addressed a long
time ago).   See:

http://en.it-usenet.org/thread/11916/10367/

for the background.  It's a little bit dated, in that we do use a
64-bit hash on 64-bit systems, but the fundamental issues are still
there.

If you sort the readdir files by inode order, this can help
significantly.  Some userspace programs, such as mutt, do this.
Unfortunately "ls" does not.  (That might be a good newbie project,
since it's a userspace-only project.  However, I'm pretty sure the
shellutils maintainers will also react negatively if they are sent
patches which don't compile.  :-)

A proof of concept of how this can be a win can be found here:

http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c

LD_PRELOAD aren't guaranteed to work on all programs, so this is much
more of a hack than something I'd recommend for extended production
use.  But it shows that if you have a readdir+stat workload, sorting
by inode makes a huge difference.

As far as getting traces to better understand problems, I strongly
suggest that you try things like vmstat, iostat, and blktrace; system
call traces like strace aren't going to get you very far.  (See
http://brooker.co.za/blog/2013/07/14/io-performance.html for a nice
introduction to blktrace).  Use the scientific method; collect
baseline statistics using vmstat, iostat, sar, before you run your
test workload, so you know how much I/O is going on before you start
your test.  If you can run your test on a quiscient system, that's a
really good idea.  Then collect statistics as your run your workload,
and then only tweak one variable at a time, and record everything in a
systematic way.

Finally, if you have more problems of a technical nature with respect
to the ext4, there is the ext3-users at redhat.com list, or the
developer's list at linux-ext4 at vger.kernel.org.  It would be nice if
you tried the ext3-users or the kernel-newbies or tried googling to
see if anyone else has come across the problem and figured out the
solution already, but if you can't figure things out any other way, do
feel free to ask the linux-ext4 list.  We won't bite.  :-)

Cheers,

						- Ted

P.S.  If you have a large number of directories which are much larger
than you expect, and you don't want to do the "mkdir foo.new; mv foo/*
foo.new ; rmdir foo; mv foo.new foo" trick on a large number of
directories, you can also schedule downtime and while the file system
is unmounted, use "e2fsck -fD".  See the man page for more details.
It won't solve all of your problems, and it might not solve any of
your problem, but it will probably make the performance of large
directories somewhat better.

^ permalink raw reply	[flat|nested] 12+ messages in thread