Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations

From: David Hildenbrand <david@redhat.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>,
	linux-modules@vger.kernel.org, linux-kernel@vger.kernel.org,
	pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com,
	christophe.leroy@csgroup.eu, song@kernel.org,
	torvalds@linux-foundation.org, dave@stgolabs.net,
	fan.ni@samsung.com, vincent.fu@samsung.com,
	a.manzanares@samsung.com, colin.i.king@gmail.com
Subject: Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
Date: Thu, 30 Mar 2023 06:42:14 +0200	[thread overview]
Message-ID: <4fa825ac-707d-ad38-0af7-937a2c1e75b7@redhat.com> (raw)
In-Reply-To: <ZCPNONX3O3bp2Om1@bombadil.infradead.org>

On 29.03.23 07:31, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote:
>>
>> So we're a bit faster (0.2 -- 0.7s) than the original version without the
>> rcu patch (~6s).
> 
> Groovy.

Just to clarify, it was up to 0.7s faster (I phrased it in a confusing way).

> 
>>> The commit log needs updateing to reflect the results I just collected:
>>>
>>> With the alloc patch ("module: avoid allocation if module is already
>>> present and ready") I see 145 MiB in memory difference in comparison
>>> to its last patch, "module: extract patient module check into helper".
>>> So I think that's a clear keeper and should help large CPU count boots.
>>>
>>> The patch "module: add concurrency limiter" which puts the concurency
>>> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
>>> which seems to be what I needed to reproduce your 400 CPU count original
>>> issue.
>>>
>>> The program used to reproduce is stress-ng with the new module option:
>>>
>>> echo 0 > /proc/sys/vm/oom_dump_tasks
>>> ./stress-ng --module 100 --module-name xfs
>>
>> Above command fills for me with nfs (but also ext4) the kernel log with:
>>
>> ...
>> [  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
>> [  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
>> [  883.048549] nfs: Unknown symbol put_rpccred (err -2)
>> [  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
>> [  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
>> [  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
>> [  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
>> [  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
>> [  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
>> [  883.093074] nfs: Unknown symbol nfs_debug (err -2)
>> [  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
>> [  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
>> [  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
>> [  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
>> [  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
>> [  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
>> [  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
>> [  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
>> ...
>>
>>
>> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent
>> handling or just side-effect of the concurrent loading?
> 
> It is just because modprobe deals with module dependencies, stress-ng
> modprobe doesn't, it just calls finit_module() and expects dependencies
> to be loaded first.

Oh, right. That makes sense.

> 
>>> To see how much max memory I use, I just use:
>>>
>>> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
>>>
>>> Run the test in another window, CTRL-C the test when above
>>> finishes after 40 seconds and then:
>>>
>>> sort -n -r foo.log  | head -1
>>
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
>> 14254024
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
>> 12862528
>>
>> So 1391496 (KiB I assume, so 1.3 GiB !?)
> 
> That is sadly correct.
> 
>> difference compared to before the
>> test (I first start capturing and then run stress-ng).
> 
> That's a good chunk :-D
> 
> On a recent thread I do a full analysis of average module sizes [0], you
> can use that to also get an average size estimate of your currently
> loaded modules. Do a bit of math and that could give you the average
> memory pressure on real vmap allocations. For a synthentic test as with
> stress-ng modules we'd be doing twice the memory for a success load, and
> only one time allocation due to kread for a failed allocation, since we
> always allocate memory even for bogus modules on the kread side.
> 
> My xfs.ko is 42M on my test guest system, with 100 concurrent threads
> doing two allocations each that puts the expected vmap allocation at
> around:
> 
> 42*2*100
> 8400
> 8400/1024
> 8.20312500000000000000
> 
> 8.2 GiB
> 
> So on the worst case that's what we'd see. Since I saw seeing the
> patches helped overall around 145 MiB in difference let's see what
> that means. Let me re-test:
> 
> Idle memory: (note free -k uses kibibytes, KiB.
> 
> root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r  | head -1
> 416452
> 
> So while idle, the system is using up 416452 KiB, so 406 MiB.
> 
> While running stress-ng with 100 module ops
> 
> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt
> echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs
> root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1
> 4886492
> 
> 4886492/1024
> 4771.96484375000000000000
> 4886492/1024/1024
> 4.66012191772460937500
> 
> So during the stress test we see memory highest usage was about
> 4771.96 MiB or ~4.66 GiB.
> 
> What's the difference between idle and after the stress test:
> 
> 4886492 - 416452
> 4470040
> 4470040/1024
> 4365.27343750000000000000
> 4470040/1024/1024
> 4.26296234130859375000
> 
> So about 4365.27 MiB or 4.26 GiB.
> 
> Now, the upper limit for the test should have been a delta of 8.2 GiB
> and we get about 4.26, so it means we're rejecting more than half of
> the requests. Why half not just half? I only see 40 successful loads
> of XFS during a 40 second window:
> 
>    dmesg -c > /dev/null
>    # run tests
>    dmesg -c | grep XFS | wc -l
> 
> At 100 ops all running finit_module for 40 seconds those successful
> loads only should have consumed about 40 * 2 * size of XFS (42 MiB):
> 3360 MiB or 3.28 GiB. But at any point in time only one module could
> be loaded at a time, so in the *worst* consumption point a XFS in this
> test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB.
> 
> 4365.27 - 84
> 4281.27
> 
> So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the
> failed module loads due to the kread*() calls from finit_module().
> 
> And get this... *if* you use module compression that also uses vmap()
> *after* the kernel_read*() call which uses vmalloc(). *At least* in
> that case we immediately free the buffer for the compressed module,
> but *still* -- that is 3 possible vmap space allocations for every
> module!

:/

> 
> It'd be hard, but not impossible to collect stats for failed
> finit_modules(). *That* could be indicative of areas in the kernel
> we need to brush up on to stop doing stupid things, like we learned
> about for the CPU frequency scaling modules. As I wrote this paragraph
> I realized -- that this *is*  what we really have wanted all along to
> help us debug these stupid things, so we can slowly learn where to
> point fingers at to help optimize things. Altough I recently did some
> tool scraping to collect stats and *wished* for this from userspace [0],
> it wasn't hard to just a debug option for this to help us debug these
> failures. So I added support for that and sent finally a new patch
> series.

Makes sense to me.

I didn't have time to look at the code yet (currently traveling), I hope 
that the kasan vmap issues are at least history.

I'll do another test with debug kernels using your latest version on 
that machine before I have to hand it back.

Thanks!

-- 
Thanks,

David / dhildenb