On Jan 21, 2021, at 11:32, James Simmons > wrote: One of the challenging issues for very large scale file systems is the performance crash when you cross about 670 stripe count. This is due to the memory allocations going from kmalloc to vmalloc. Once you start to use vmalloc to allocate the ptlrpc message buffers all the allocating start to serialize on a global spinlock. Looking for a solution the best one I found so far have been using the generic radix tree API. You have to allocate a page worth of data at a time so its cluncky for use but I think we could make it work. What do you think? https://www.kernel.org/doc/html/latest/core-api/generic-radix-tree.html I think the first thing to figure out here is whether vmalloc() of the reply buffer is the problem for sure. 670 stripes is only a 16KB layout, which I'd think would be handled by kmalloc() almost all of the time, unless memory is very fragmented. It would also be worthwhile to see if the __GFP_ZERO of this allocation is a source of performance problems? While I wouldn't recommend to disable this to start, at least checking if memset() shows up in the profile would be useful. I definitely recall some strangeness in the MDT reply code for very large replies that may mean it is retrying the message if it is too large on the first send. There are probably some improvements to commit v2_5_56_0-71-g006f258300 "LU-3338 llite: Limit reply buffer size" to better chose the reply buffer size if large layouts are commonly returned. I think it just bails and limits the layout buffer to a max of PAGE_SIZE or similar, instead of having a better algorithm. In times gone by, we also had a patch to improve vmalloc() performance, but unfortunately they were rejected upstream because "we don't want developers using vmalloc() and if the performance is bad they will avoid it", or similar. Now that kvmalloc() is a widey-used interface in the code, maybe improving vmalloc() performance is of interest again (i.e. removing the single global lock)? One simple optimization was to linearly use the vmalloc address space from start to end, instead of trying to have a "smart" usage of the address space. It is 32TiB in size, so takes a while to exhaust (probably several days under normal usage), so the first pass through is "free". It isn't clear to me what your goal with the radix tree is? Do you intend to replace vmalloc() usage in Lustre with a custom memory allocator based on this, or is the goal to optimize the kernel vmalloc() allocation using the radix tree code? I think the use of a custom memory allocator in Lustre would be far more nasty than lots of the things that are raised as objections to upstream inclusion, so I think it would be a step backward. Optimizing vmalloc() in upstream kernels (if changes are accepted) would be a better use of time. For the few sites that have many OSTs, they can afford a kernel patch on the client (likely they have a custom kernel from their system vendor anyway), and the other 99% of users will not need it. I think a more practical approach might be to have a pool of preallocated reply buffers (using vmalloc()) that is kept on the client. That would avoid the overhead of vmalloc/vfree each time, and not need intrusive code changes. In the likely case of a small layout for a file (even if _some_ layouts are very large), the saved RPC replay buffer can be kmalloc'd normally and copied over. I don't think there will be real-world workloads where a client is keeping thousands of different files open with huge layouts, so it is likely that the number of large buffers in the reply pool will be relatively small. Cheers, Andreas -- Andreas Dilger Principal Lustre Architect Whamcloud