All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>,
	linux-modules@vger.kernel.org, linux-kernel@vger.kernel.org,
	pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com,
	christophe.leroy@csgroup.eu, song@kernel.org,
	torvalds@linux-foundation.org, dave@stgolabs.net,
	fan.ni@samsung.com, vincent.fu@samsung.com,
	a.manzanares@samsung.com, colin.i.king@gmail.com
Subject: Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations
Date: Thu, 30 Mar 2023 06:42:14 +0200	[thread overview]
Message-ID: <4fa825ac-707d-ad38-0af7-937a2c1e75b7@redhat.com> (raw)
In-Reply-To: <ZCPNONX3O3bp2Om1@bombadil.infradead.org>

On 29.03.23 07:31, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote:
>>
>> So we're a bit faster (0.2 -- 0.7s) than the original version without the
>> rcu patch (~6s).
> 
> Groovy.

Just to clarify, it was up to 0.7s faster (I phrased it in a confusing way).

> 
>>> The commit log needs updateing to reflect the results I just collected:
>>>
>>> With the alloc patch ("module: avoid allocation if module is already
>>> present and ready") I see 145 MiB in memory difference in comparison
>>> to its last patch, "module: extract patient module check into helper".
>>> So I think that's a clear keeper and should help large CPU count boots.
>>>
>>> The patch "module: add concurrency limiter" which puts the concurency
>>> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
>>> which seems to be what I needed to reproduce your 400 CPU count original
>>> issue.
>>>
>>> The program used to reproduce is stress-ng with the new module option:
>>>
>>> echo 0 > /proc/sys/vm/oom_dump_tasks
>>> ./stress-ng --module 100 --module-name xfs
>>
>> Above command fills for me with nfs (but also ext4) the kernel log with:
>>
>> ...
>> [  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
>> [  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
>> [  883.048549] nfs: Unknown symbol put_rpccred (err -2)
>> [  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
>> [  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
>> [  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
>> [  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
>> [  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
>> [  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
>> [  883.093074] nfs: Unknown symbol nfs_debug (err -2)
>> [  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
>> [  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
>> [  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
>> [  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
>> [  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
>> [  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
>> [  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
>> [  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
>> ...
>>
>>
>> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent
>> handling or just side-effect of the concurrent loading?
> 
> It is just because modprobe deals with module dependencies, stress-ng
> modprobe doesn't, it just calls finit_module() and expects dependencies
> to be loaded first.

Oh, right. That makes sense.

> 
>>> To see how much max memory I use, I just use:
>>>
>>> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
>>>
>>> Run the test in another window, CTRL-C the test when above
>>> finishes after 40 seconds and then:
>>>
>>> sort -n -r foo.log  | head -1
>>
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
>> 14254024
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
>> 12862528
>>
>> So 1391496 (KiB I assume, so 1.3 GiB !?)
> 
> That is sadly correct.
> 
>> difference compared to before the
>> test (I first start capturing and then run stress-ng).
> 
> That's a good chunk :-D
> 
> On a recent thread I do a full analysis of average module sizes [0], you
> can use that to also get an average size estimate of your currently
> loaded modules. Do a bit of math and that could give you the average
> memory pressure on real vmap allocations. For a synthentic test as with
> stress-ng modules we'd be doing twice the memory for a success load, and
> only one time allocation due to kread for a failed allocation, since we
> always allocate memory even for bogus modules on the kread side.
> 
> My xfs.ko is 42M on my test guest system, with 100 concurrent threads
> doing two allocations each that puts the expected vmap allocation at
> around:
> 
> 42*2*100
> 8400
> 8400/1024
> 8.20312500000000000000
> 
> 8.2 GiB
> 
> So on the worst case that's what we'd see. Since I saw seeing the
> patches helped overall around 145 MiB in difference let's see what
> that means. Let me re-test:
> 
> Idle memory: (note free -k uses kibibytes, KiB.
> 
> root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r  | head -1
> 416452
> 
> So while idle, the system is using up 416452 KiB, so 406 MiB.
> 
> While running stress-ng with 100 module ops
> 
> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt
> echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs
> root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1
> 4886492
> 
> 4886492/1024
> 4771.96484375000000000000
> 4886492/1024/1024
> 4.66012191772460937500
> 
> So during the stress test we see memory highest usage was about
> 4771.96 MiB or ~4.66 GiB.
> 
> What's the difference between idle and after the stress test:
> 
> 4886492 - 416452
> 4470040
> 4470040/1024
> 4365.27343750000000000000
> 4470040/1024/1024
> 4.26296234130859375000
> 
> So about 4365.27 MiB or 4.26 GiB.
> 
> Now, the upper limit for the test should have been a delta of 8.2 GiB
> and we get about 4.26, so it means we're rejecting more than half of
> the requests. Why half not just half? I only see 40 successful loads
> of XFS during a 40 second window:
> 
>    dmesg -c > /dev/null
>    # run tests
>    dmesg -c | grep XFS | wc -l
> 
> At 100 ops all running finit_module for 40 seconds those successful
> loads only should have consumed about 40 * 2 * size of XFS (42 MiB):
> 3360 MiB or 3.28 GiB. But at any point in time only one module could
> be loaded at a time, so in the *worst* consumption point a XFS in this
> test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB.
> 
> 4365.27 - 84
> 4281.27
> 
> So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the
> failed module loads due to the kread*() calls from finit_module().
> 
> And get this... *if* you use module compression that also uses vmap()
> *after* the kernel_read*() call which uses vmalloc(). *At least* in
> that case we immediately free the buffer for the compressed module,
> but *still* -- that is 3 possible vmap space allocations for every
> module!

:/

> 
> It'd be hard, but not impossible to collect stats for failed
> finit_modules(). *That* could be indicative of areas in the kernel
> we need to brush up on to stop doing stupid things, like we learned
> about for the CPU frequency scaling modules. As I wrote this paragraph
> I realized -- that this *is*  what we really have wanted all along to
> help us debug these stupid things, so we can slowly learn where to
> point fingers at to help optimize things. Altough I recently did some
> tool scraping to collect stats and *wished* for this from userspace [0],
> it wasn't hard to just a debug option for this to help us debug these
> failures. So I added support for that and sent finally a new patch
> series.

Makes sense to me.


I didn't have time to look at the code yet (currently traveling), I hope 
that the kasan vmap issues are at least history.

I'll do another test with debug kernels using your latest version on 
that machine before I have to hand it back.

Thanks!

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2023-03-30  4:44 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-11  5:17 [RFC 00/12] module: avoid userspace pressure on unwanted allocations Luis Chamberlain
2023-03-11  5:17 ` [RFC 01/12] module: use goto errors on check_modinfo() and layout_and_allocate() Luis Chamberlain
2023-03-11  5:17 ` [RFC 02/12] module: move get_modinfo() helpers all above Luis Chamberlain
2023-03-11  5:17 ` [RFC 03/12] module: rename next_string() to module_next_tag_pair() Luis Chamberlain
2023-03-11  5:17 ` [RFC 04/12] module: add a for_each_modinfo_entry() Luis Chamberlain
2023-03-11  5:17 ` [RFC 05/12] module: add debugging alias parsing support Luis Chamberlain
2023-03-11  5:17 ` [RFC 06/12] module: move early sanity checks into a helper Luis Chamberlain
2023-03-11  5:17 ` [RFC 07/12] module: move check_modinfo() early to early_mod_check() Luis Chamberlain
2023-03-11  5:17 ` [RFC 08/12] module: move finished_loading() Luis Chamberlain
2023-03-11  5:17 ` [RFC 09/12] module: extract patient module check into helper Luis Chamberlain
2023-03-11  5:17 ` [RFC 10/12] module: avoid allocation if module is already present and ready Luis Chamberlain
2023-03-11  5:17 ` [RFC 11/12] module: use list_add_tail_rcu() when adding module Luis Chamberlain
2023-03-11  5:17 ` [RFC 12/12] module: use aliases to find module on find_module_all() Luis Chamberlain
2023-03-11 13:12   ` kernel test robot
2023-03-11 17:06   ` kernel test robot
2023-03-15 14:43   ` Petr Pavlu
2023-03-15 16:12     ` Luis Chamberlain
2023-03-15 12:24 ` [RFC 00/12] module: avoid userspace pressure on unwanted allocations David Hildenbrand
2023-03-15 16:10   ` Luis Chamberlain
2023-03-15 16:41     ` David Hildenbrand
2023-03-16 23:55       ` Luis Chamberlain
2023-03-16 23:56         ` Luis Chamberlain
2023-03-18  0:11           ` Luis Chamberlain
2023-03-20  9:38             ` David Hildenbrand
2023-03-20 19:40               ` David Hildenbrand
2023-03-20 21:09                 ` Luis Chamberlain
2023-03-20 21:15                   ` David Hildenbrand
2023-03-20 21:23                     ` Luis Chamberlain
2023-03-20 21:27                       ` Luis Chamberlain
2023-03-21 19:32                         ` David Hildenbrand
2023-03-24  9:27                           ` David Hildenbrand
2023-03-24 17:54                             ` Luis Chamberlain
2023-03-24 19:11                               ` Linus Torvalds
2023-03-24 19:59                                 ` Luis Chamberlain
2023-03-24 20:28                                   ` Linus Torvalds
2023-03-24 21:14                                     ` Luis Chamberlain
2023-03-24 23:27                                       ` Luis Chamberlain
2023-03-24 23:41                                         ` Linus Torvalds
2023-03-28  3:44                               ` David Hildenbrand
2023-03-28  6:16                                 ` Luis Chamberlain
2023-03-28 21:02                                   ` David Hildenbrand
2023-03-29  5:31                                     ` Luis Chamberlain
2023-03-30  4:42                                       ` David Hildenbrand [this message]
2023-03-21 15:11                       ` David Hildenbrand
2023-03-21 16:52                         ` Luis Chamberlain
2023-03-21 17:01                           ` David Hildenbrand
2023-03-20  9:37           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4fa825ac-707d-ad38-0af7-937a2c1e75b7@redhat.com \
    --to=david@redhat.com \
    --cc=a.manzanares@samsung.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=colin.i.king@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=fan.ni@samsung.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-modules@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=petr.pavlu@suse.com \
    --cc=pmladek@suse.com \
    --cc=prarit@redhat.com \
    --cc=song@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.fu@samsung.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.