On Jan 21, 2021, at 11:32, James Simmons <jsimmons@casper.infradead.org<mailto:jsimmons@casper.infradead.org>> wrote:

 One of the challenging issues for very large scale file systems is the
performance crash when you cross about 670 stripe count. This is due to
the memory allocations going from kmalloc to vmalloc. Once you start to
use vmalloc to allocate the ptlrpc message buffers all the allocating
start to serialize on a global spinlock.
 Looking for a solution the best one I found so far have been using the
generic radix tree API. You have to allocate a page worth of data at a
time so its cluncky for use but I think we could make it work. What do
you think?

https://www.kernel.org/doc/html/latest/core-api/generic-radix-tree.html

I think the first thing to figure out here is whether vmalloc() of the reply buffer is
the problem for sure.  670 stripes is only a 16KB layout, which I'd think would be
handled by kmalloc() almost all of the time, unless memory is very fragmented.
It would also be worthwhile to see if the __GFP_ZERO of this allocation is a
source of performance problems?  While I wouldn't recommend to disable this
to start, at least checking if memset() shows up in the profile would be useful.

I definitely recall some strangeness in the MDT reply code for very large replies
that may mean it is retrying the message if it is too large on the first send.
There are probably some improvements to commit v2_5_56_0-71-g006f258300
"LU-3338 llite: Limit reply buffer size" to better chose the reply buffer size if large
layouts are commonly returned. I think it just bails and limits the layout buffer to
a max of PAGE_SIZE or similar, instead of having a better algorithm.


In times gone by, we also had a patch to improve vmalloc() performance, but
unfortunately they were rejected upstream because "we don't want developers
using vmalloc() and if the performance is bad they will avoid it", or similar.

Now that kvmalloc() is a widey-used interface in the code, maybe improving
vmalloc() performance is of interest again (i.e. removing the single global lock)?
One simple optimization was to linearly use the vmalloc address space from
start to end, instead of trying to have a "smart" usage of the address space.  It
is 32TiB in size, so takes a while to exhaust (probably several days under normal
usage), so the first pass through is "free".


It isn't clear to me what your goal with the radix tree is?  Do you intend to replace
vmalloc() usage in Lustre with a custom memory allocator based on this, or is the
goal to optimize the kernel vmalloc() allocation using the radix tree code?

I think the use of a custom memory allocator in Lustre would be far more nasty than
lots of the things that are raised as objections to upstream inclusion, so I think it
would be a step backward.  Optimizing vmalloc() in upstream kernels (if changes
are accepted) would be a better use of time.  For the few sites that have many OSTs,
they can afford a kernel patch on the client (likely they have a custom kernel from
their system vendor anyway), and the other 99% of users will not need it.


I think a more practical approach might be to have a pool of preallocated reply
buffers (using vmalloc()) that is kept on the client.  That would avoid the overhead
of vmalloc/vfree each time, and not need intrusive code changes.  In the likely
case of a small layout for a file (even if _some_ layouts are very large), the saved
RPC replay buffer can be kmalloc'd normally and copied over.  I don't think there
will be real-world workloads where a client is keeping thousands of different files
open with huge layouts, so it is likely that the number of large buffers in the reply
pool will be relatively small.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud