linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Johan Hovold <johan@kernel.org>,
	Lucas De Marchi <lucas.demarchi@intel.com>,
	Petr Pavlu <petr.pavlu@suse.com>,
	gregkh@linuxfoundation.org, rafael@kernel.org, song@kernel.org,
	lucas.de.marchi@gmail.com, christophe.leroy@csgroup.eu,
	peterz@infradead.org, rppt@kernel.org, dave@stgolabs.net,
	willy@infradead.org, vbabka@suse.cz, mhocko@suse.com,
	dave.hansen@linux.intel.com, colin.i.king@gmail.com,
	jim.cromie@gmail.com, catalin.marinas@arm.com, jbaron@akamai.com,
	rick.p.edgecombe@intel.com, yujie.liu@intel.com,
	tglx@linutronix.de, hch@lst.de, patches@lists.linux.dev,
	linux-modules@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, pmladek@suse.com,
	prarit@redhat.com, lennart@poettering.net
Subject: Re: [PATCH 2/2] module: add support to avoid duplicates early on load
Date: Mon, 5 Jun 2023 13:26:00 +0200	[thread overview]
Message-ID: <fa3f1a1f-edc6-f13b-cc84-f3264b03b0b1@redhat.com> (raw)
In-Reply-To: <ZHoTFDkPIgglW0sU@bombadil.infradead.org>

On 02.06.23 18:04, Luis Chamberlain wrote:
> On Fri, Jun 02, 2023 at 05:19:58PM +0200, David Hildenbrand wrote:
>> On 31.05.23 18:57, Luis Chamberlain wrote:
>>> On Wed, May 31, 2023 at 09:51:41AM +0200, David Hildenbrand wrote:
>>>> On 30.05.23 18:22, Luis Chamberlain wrote:
>>>>> On Mon, May 29, 2023 at 09:55:15PM -0400, Linus Torvalds wrote:
>>>>>> On Mon, May 29, 2023 at 11:18 AM Johan Hovold <johan@kernel.org> wrote:
>>>>>>>
>>>>>>> I took a closer look at some of the modules that failed to load and
>>>>>>> noticed a pattern in that they have dependencies that are needed by more
>>>>>>> than one device.
>>>>>>
>>>>>> Ok, this is a "maybe something like this" RFC series of two patches -
>>>>>> one trivial one to re-organize things a bit so that we can then do the
>>>>>> real one which uses a filter based on the inode pointer to return an
>>>>>> "idempotent return value" for module loads that share the same inode.
>>>>>>
>>>>>> It's entirely untested, and since I'm on the road I'm going to not
>>>>>> really be able to test it. It compiles for me, and the code looks
>>>>>> fairly straightforward, but it's probably buggy.
>>>>>>
>>>>>> It's very loosely based on Luis' attempt,  but it
>>>>>>     (a) is internal to module loading
>>>>>>     (b) uses a reliable cookie
>>>>>>     (c) doesn't leave the cookie around randomly for later
>>>>>>     (d) has seen absolutely no testing
>>>>>>
>>>>>> Put another way: if somebody wants to play with this, please treat it
>>>>>> as a starting point, not the final thing. You might need to debug
>>>>>> things, and fix silly mistakes.
>>>>>>
>>>>>> The idea is to just have a simple hash list of currently executing
>>>>>> module loads, protected by a trivial spinlock. Every module loader
>>>>>> adds itself to the right hash list, and if they were the *first* one
>>>>>> (ie no other pending module loads for that inode), will actually do
>>>>>> the module load.
>>>>>>
>>>>>> Everybody who *isn't* the first one will just wait for completion and
>>>>>> return the same error code that the first one returned.
>>>>>
>>>>> That's also a hell much more snazzier MODULE_DEBUG_AUTOLOAD_DUPS if we
>>>>> ever wanted to do something similar there if we wanted to also
>>>>> join request_module() calls, instead of it hiding under debug.
>>>>>
>>>>>> This is technically bogus. The first one might fail due to arguments.
>>>>>
>>>>> For boot it's fine, as I can't think of boot wanting to support trying
>>>>> to load a module with different arguments but who knows. But I can't
>>>>> see it sensible to issue concurrent multiple requests for modules
>>>>> with different arguments without waiting in userspace for the first
>>>>> to fail.
>>>>>
>>>>> Even post-boot, doing that sounds rather insane, but it would certainly
>>>>> be a compromise and should probably be clearly documented. I think just
>>>>> a comment acknolwedging that corner case seems sensible.
>>>>>
>>>>> Because we won't be able to get the arguments until we process the
>>>>> module, so it would be too late for this optimization on kread. So it is
>>>>> why I had also stuck to the original feature being in kread, as then it
>>>>> provides a uniq kread call and the caller is aware of it. But indeed I
>>>>> had not considered the effects of arguments.
>>>>>
>>>>> Lucas, any thoughts from modules kmod userspace perspective into
>>>>> supporting anyone likely issuing concurrent modules requests with
>>>>> differing arguments?
>>>>>
>>>>>> So the cookie shouldn't be just the inode, it should be the inode and
>>>>>> a hash of the arguments or something like that.
>>>>>
>>>>> Personally I think it's a fine optimization without the arguments.
>>>>>
>>>>>> But it is what it is,
>>>>>> and apart from possible show-stopper bugs this is no worse than the
>>>>>> failed "exclusive write deny" attempt. IOW - maybe worth trying?
>>>>>
>>>>> The only thing I can think of is allowing threads other than the
>>>>> first one to complete before the one that actually loaded the
>>>>> module. I thought about this race for module auto-loading, see
>>>>> the comment in kmod_dup_request_announce(), so that just
>>>>> further delays the completion to other thread with a stupid
>>>>> queue_work(). That seems more important for module auto-loading
>>>>> duplicates than for boot finit_module() duplicates. But not sure
>>>>> if odering matters in the end due to a preemtible kernel and maybe
>>>>> that concern is hysteria.
>>>>>
>>>>>> And if *that* didn't sell people on this patch series, I don't know
>>>>>> what will. I should be in marketing! Two drink minimums, here I come!
>>>>>
>>>>> Sold:
>>>>>
>>>>> on 255 vcpus 0 duplicates found with this setup:
>>>>>
>>>>> root@kmod ~ # cat /sys/kernel/debug/modules/stats
>>>>>             Mods ever loaded       66
>>>>>         Mods failed on kread       0
>>>>> Mods failed on decompress       0
>>>>>      Mods failed on becoming       0
>>>>>          Mods failed on load       0
>>>>>            Total module size       11268096
>>>>>          Total mod text size       4149248
>>>>>           Failed kread bytes       0
>>>>>      Failed decompress bytes       0
>>>>>        Failed becoming bytes       0
>>>>>            Failed kmod bytes       0
>>>>>     Virtual mem wasted bytes       0
>>>>>             Average mod size       170729
>>>>>        Average mod text size       62868
>>>>>
>>>>> So:
>>>>>
>>>>> Tested-by: Luis Chamberlain <mcgrof@kernel.org>
>>>>>
>>>>> In terms of bootup timing:
>>>>>
>>>>> Before:
>>>>> Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s
>>>>> graphical.target reached after 44.178s in userspace.
>>>>> After:
>>>>> Startup finished in 23.995s (kernel) + 40.350s (userspace) = 1min 4.345s
>>>>> graphical.target reached after 40.226s in userspace.
>>>>
>>>> I'll try grabbing the system where we saw the KASAN-related issues [1] and
>>>> give it a churn with and without the two patches. Might take a bit (~1 day),
>>>> unfortunately.
>>>>
>>>> [1] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
>>>
>>> Great, don't forget:
>>>
>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>>> index 82b0dcc1fe77..222015093eeb 100644
>>> --- a/kernel/module/main.c
>>> +++ b/kernel/module/main.c
>>> @@ -3066,7 +3066,7 @@ struct idempotent {
>>>    #define IDEM_HASH_BITS 8
>>>    static struct hlist_head idem_hash[1 << IDEM_HASH_BITS];
>>> -static struct spinlock idem_lock;
>>> +static DEFINE_SPINLOCK(idem_lock);
>>>    static bool idempotent(struct idempotent *u, const void *cookie)
>>>    {
>>>
>>
>> Finally was able to run it on that ThinkSystem SR950 with 8 populated
>> sockets -> 224 cores and 448 logical CPUs.
>>
>> The KASAN vmap issues on that system were already no longer reproducible with your
>> (Luis) previous work that's already in master.
>>
>> I tested a !debug and debug config (both based on corresponding RHEL9 configs), comparing
>> 929ed21dfdb6 ("master") with 929ed21dfdb6 + Linus' patches ("patched").
>>
>>
>> Unfortunately, boot times vary a lot, and I did not figure out how to reduce
>> the noise. I captured the "systemd-analyze blame" output as well.
>>
>>
>> 1) !debug config (not enabling KASAN)
>>
>> a) master
>>
>> Startup finished in 32.225s (kernel) + 7.399s (initrd) + 20.378s (userspace) = 1min 3ms
>> multi-user.target reached after 20.352s in userspace.
>> Startup finished in 43.734s (kernel) + 7.288s (initrd) + 19.827s (userspace) = 1min 10.851s
>> multi-user.target reached after 19.800s in userspace.
>> Startup finished in 50.514s (kernel) + 7.171s (initrd) + 24.757s (userspace) = 1min 22.443s
>> multi-user.target reached after 24.734s in userspace.
>> Startup finished in 26.722s (kernel) + 7.249s (initrd) + 23.923s (userspace) = 57.895s
>> multi-user.target reached after 23.892s in userspace.
>>
>> b) patched
>>
>> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
>> multi-user.target reached after 21.355s in userspace.
>> Startup finished in 36.318s (kernel) + 7.177s (initrd) + 21.383s (userspace) = 1min 4.879s
>> multi-user.target reached after 21.355s in userspace.
>> Startup finished in 1min 34.678s (kernel) + 7.239s (initrd) + 24.066s (userspace) = 2min 5.985s
>> multi-user.target reached after 24.040s in userspace.
>> Startup finished in 25.879s (kernel) + 7.144s (initrd) + 29.665s (userspace) = 1min 2.689s
>> multi-user.target reached after 29.637s in userspace.
>>
>>
>>
>> 2) debug config (enabling KASAN)
>>
>> a) master
>>
>> Startup finished in 2min 12.695s (kernel) + 25.058s (initrd) + 1min 13.012s (userspace) = 3min 50.765s
>> multi-user.target reached after 1min 12.903s in userspace.
>> Startup finished in 1min 45.400s (kernel) + 24.294s (initrd) + 1min 8.910s (userspace) = 3min 18.606s
>> multi-user.target reached after 1min 8.786s in userspace.
>> Startup finished in 2min 4.857s (kernel) + 24.715s (initrd) + 1min 5.088s (userspace) = 3min 34.660s
>> multi-user.target reached after 1min 4.967s in userspace.
>> Startup finished in 3min 20.400s (kernel) + 24.703s (initrd) + 1min 5.469s (userspace) = 4min 50.573s
>> multi-user.target reached after 1min 5.344s in userspace.
>>
>> b) patched
>>
>> Startup finished in 2min 5.250s (kernel) + 25.049s (initrd) + 1min 1.961s (userspace) = 3min 32.262s
>> multi-user.target reached after 1min 1.844s in userspace.
>> Startup finished in 1min 52.524s (kernel) + 24.897s (initrd) + 1min 5.062s (userspace) = 3min 22.484s
>> multi-user.target reached after 1min 4.916s in userspace.
>> Startup finished in 9min 36.817s (kernel) + 24.859s (initrd) + 1min 18.657s (userspace) = 11min 20.335s
>> multi-user.target reached after 1min 18.455s in userspace.
>> Startup finished in 30min 20.715s (kernel) + 24.722s (initrd) + 1min 7.039s (userspace) = 31min 52.476s
>> multi-user.target reached after 1min 6.907s in userspace.
>>
>>
>> What concerns me a bit, is that on the patched kernel we seem to hit more cases where
>> boot takes much longer (in both kernel configs).
>>
>> I'll do some more runs/investigation to see if this is reproducible or just some system oddity.
>>
>> Staring just at the udev settle time (systemd-analyze blame), it's very similar between both kernel
>> versions.
> 
> Thanks for these tests, having MODULE_STATS enabled +
> 
> cat /sys/kernel/debug/modules/stats
> 
> would be of huge value, both on master and + patched.

Sure! Sorry for the slight delay, had to recompile the kernels.

I only did a single run on each kernel, should be good enough for the purpose here.


1) !debug config (not enabling KASAN)

a) master

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       8052
       Mods failed on load       531
         Total module size       16343040
       Total mod text size       4767744
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       19185958488
         Failed kmod bytes       1172591936
  Virtual mem wasted bytes       20358550424
          Average mod size       217908
     Average mod text size       63570
   Avg fail becoming bytes       2382757
   Average fail load bytes       2208272
Duplicate failed modules:
               Module-name        How-many-times                    Reason
         intel_rapl_common                   447                  Becoming
    intel_uncore_frequency                   447                  Becoming
intel_uncore_frequency_common               445           Becoming & Load
          isst_if_mbox_msr                   447                  Becoming
            isst_if_common                   447           Becoming & Load
                  skx_edac                   447           Becoming & Load
                      nfit                   447                  Becoming
                 libnvdimm                   447                  Becoming
      x86_pkg_temp_thermal                   447                  Becoming
          intel_powerclamp                   447                  Becoming
                  coretemp                   447           Becoming & Load
                 kvm_intel                   447                  Becoming
                       kvm                   447           Becoming & Load
                     irdma                     3                      Load
                       ice                     3                  Becoming
                 irqbypass                   447           Becoming & Load
                      rapl                   393           Becoming & Load
                      gnss                     2           Becoming & Load
                   ioatdma                    63           Becoming & Load
                 ib_uverbs                     3           Becoming & Load
              intel_cstate                   447           Becoming & Load
                   ipmi_si                     1                      Load
              intel_uncore                   447           Becoming & Load
                   ib_core                     3           Becoming & Load
              ipmi_devintf                     1                      Load
           ipmi_msghandler                     1                  Becoming
                       dca                    63                  Becoming
                      ahci                     1                  Becoming
                      i40e                     3                  Becoming
                   libahci                     1                  Becoming
          crct10dif_pclmul                   195                  Becoming
              crc32_pclmul                   262                  Becoming
              crc32c_intel                   126           Becoming & Load
                    libata                     1                      Load
       ghash_clmulni_intel                   297           Becoming & Load
                      fjes                    11                  Becoming


b) patched

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       0
       Mods failed on load       0
         Total module size       16343040
       Total mod text size       4767744
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       0
         Failed kmod bytes       0
  Virtual mem wasted bytes       0
          Average mod size       217908
     Average mod text size       63570



2) debug config (enabling KASAN)

a) master

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       3173
       Mods failed on load       923
         Total module size       32288768
       Total mod text size       14819328
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       4279728952
         Failed kmod bytes       2174133088
  Virtual mem wasted bytes       6453862040
          Average mod size       430517
     Average mod text size       197592
   Avg fail becoming bytes       1348796
   Average fail load bytes       2355508
Duplicate failed modules:
               Module-name        How-many-times                    Reason
                   ioatdma                    63           Becoming & Load
         intel_rapl_common                   313                  Becoming
    intel_uncore_frequency                   136           Becoming & Load
intel_uncore_frequency_common                38                  Becoming
          isst_if_mbox_msr                   443           Becoming & Load
            isst_if_common                   210           Becoming & Load
                  skx_edac                   257           Becoming & Load
                      nfit                   180           Becoming & Load
                 libnvdimm                   294           Becoming & Load
      x86_pkg_temp_thermal                   321           Becoming & Load
          intel_powerclamp                   247           Becoming & Load
                  coretemp                    55           Becoming & Load
                 kvm_intel                     2                  Becoming
                       kvm                     3           Becoming & Load
                 irqbypass                     3                  Becoming
                      rapl                     1                  Becoming
              intel_cstate                   444                      Load
              intel_uncore                   447           Becoming & Load
                 ib_uverbs                     1                      Load
                   ib_core                     3           Becoming & Load
                   ipmi_si                     1                  Becoming
                       dca                    20           Becoming & Load
                        sg                     1                  Becoming
                      i40e                     3           Becoming & Load
                      ahci                     1                  Becoming
          crct10dif_pclmul                   160           Becoming & Load
              crc32_pclmul                   151           Becoming & Load
              crc32c_intel                   146           Becoming & Load
       ghash_clmulni_intel                   140                  Becoming
                    libata                     1                      Load
                      fjes                    11           Becoming & Load



b) patched

# cat /sys/kernel/debug/modules/stats
          Mods ever loaded       75
      Mods failed on kread       0
Mods failed on decompress       0
   Mods failed on becoming       19
       Mods failed on load       0
         Total module size       32288768
       Total mod text size       14819328
        Failed kread bytes       0
   Failed decompress bytes       0
     Failed becoming bytes       6441296
         Failed kmod bytes       0
  Virtual mem wasted bytes       6441296
          Average mod size       430517
     Average mod text size       197592
   Avg fail becoming bytes       339016
Duplicate failed modules:
               Module-name        How-many-times                    Reason
         intel_rapl_common                     2                  Becoming
intel_uncore_frequency_common                 5                  Becoming
            isst_if_common                     3                  Becoming
          intel_powerclamp                     2                  Becoming
                 irqbypass                     5                  Becoming
          crct10dif_pclmul                     1                  Becoming
              crc32c_intel                     1                  Becoming



So, with these (helpful) stats, the improvement is obvious (and explains the ~1s
improvement I saw staring at the startup times of the udev services).

There are still some failed module loads with the debug config (only in the
becoming state), I did not dive deeply into the actual code changes (-EBUSY),
just spelling it out so we can decide if this is to be expected or some corner
case that shouldn't be happening.


Thanks!

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2023-06-05 11:26 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24 21:36 [PATCH 0/2] module: avoid all memory pressure due to duplicates Luis Chamberlain
2023-05-24 21:36 ` [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection Luis Chamberlain
2023-05-24 21:52   ` Linus Torvalds
2023-05-24 21:56     ` Linus Torvalds
2023-05-24 22:07       ` Luis Chamberlain
2023-05-25  4:00     ` Linus Torvalds
2023-05-25 18:08       ` Luis Chamberlain
2023-05-25 18:35         ` Luis Chamberlain
2023-05-25 18:50         ` Linus Torvalds
2023-05-25 19:32           ` Luis Chamberlain
2023-05-25  7:01     ` Christian Brauner
2023-05-24 21:36 ` [PATCH 2/2] module: add support to avoid duplicates early on load Luis Chamberlain
2023-05-25 11:40   ` Petr Pavlu
2023-05-25 16:07     ` Linus Torvalds
2023-05-25 16:42       ` Greg KH
2023-05-25 18:22         ` Luis Chamberlain
2023-05-25 17:52       ` Linus Torvalds
2023-05-25 18:45       ` Lucas De Marchi
2023-05-25 21:12         ` Linus Torvalds
2023-05-25 22:02           ` Luis Chamberlain
2023-05-26  1:39             ` Linus Torvalds
2023-05-29  8:58               ` Johan Hovold
2023-05-29 11:00                 ` Linus Torvalds
2023-05-29 12:44                   ` Johan Hovold
2023-05-29 15:18                     ` Johan Hovold
2023-05-30  1:55                       ` Linus Torvalds
2023-05-30  9:40                         ` Johan Hovold
2023-06-05 12:25                           ` Johan Hovold
2023-05-30 16:22                         ` Luis Chamberlain
2023-05-30 17:16                           ` Lucas De Marchi
2023-05-30 19:41                             ` Luis Chamberlain
2023-05-30 22:17                               ` Linus Torvalds
2023-05-31  5:30                                 ` Lucas De Marchi
2023-05-31  0:31                           ` Luis Chamberlain
2023-05-31  7:51                           ` David Hildenbrand
2023-05-31 16:57                             ` Luis Chamberlain
2023-06-02 15:19                               ` David Hildenbrand
2023-06-02 16:04                                 ` Luis Chamberlain
2023-06-05 11:26                                   ` David Hildenbrand [this message]
2023-06-05 15:17                                     ` Luis Chamberlain
2023-06-05 15:28                                       ` Luis Chamberlain
2023-06-28 18:52                                         ` Luis Chamberlain
2023-06-28 20:14                                           ` Linus Torvalds
2023-06-28 22:07                                             ` Linus Torvalds
2023-06-28 23:17                                               ` Linus Torvalds
2023-06-29  0:18                                                 ` Luis Chamberlain
2023-06-02 16:06                                 ` Linus Torvalds
2023-06-02 16:37                                   ` David Hildenbrand
2023-05-30 22:45                         ` Dan Williams
2023-06-04 14:26                         ` Rudi Heitbaum
2023-05-29 17:47                     ` Linus Torvalds
2023-05-30 10:01                       ` Johan Hovold
2023-05-25 16:54     ` Lucas De Marchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa3f1a1f-edc6-f13b-cc84-f3264b03b0b1@redhat.com \
    --to=david@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=colin.i.king@gmail.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=jbaron@akamai.com \
    --cc=jim.cromie@gmail.com \
    --cc=johan@kernel.org \
    --cc=lennart@poettering.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-modules@vger.kernel.org \
    --cc=lucas.de.marchi@gmail.com \
    --cc=lucas.demarchi@intel.com \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@suse.com \
    --cc=patches@lists.linux.dev \
    --cc=peterz@infradead.org \
    --cc=petr.pavlu@suse.com \
    --cc=pmladek@suse.com \
    --cc=prarit@redhat.com \
    --cc=rafael@kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rppt@kernel.org \
    --cc=song@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yujie.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).