All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Mateusz Jończyk" <mat.jonczyk@o2.pl>
To: Yazen Ghannam <yazen.ghannam@amd.com>, linux-edac@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, tony.luck@intel.com,
	x86@kernel.org, Smita.KoralahalliChannabasappa@amd.com,
	mpatocka@redhat.com, gregkh@linuxfoundation.org
Subject: Re: [PATCH] x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks
Date: Fri, 12 Aug 2022 23:14:44 +0200	[thread overview]
Message-ID: <7c8e34c8-f12d-e7e4-b6bc-4867824865ea@o2.pl> (raw)
In-Reply-To: <20220614174346.3648305-1-yazen.ghannam@amd.com>

W dniu 14.06.2022 o 19:43, Yazen Ghannam pisze:
> AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
> Therefore, the threshold_bank structure for bank 4, and its threshold_block
> structures, will be initialized once at boot time. And the kobject for the
> shared bank will be added to each of the CPUs that share it. Furthermore,
> the threshold_blocks for the shared bank will be added again to the bank's
> kobject. These additions will increase the refcount for the bank's kobject.
>
> For example, a shared bank with two blocks and shared across two CPUs will
> be set up like this:
>
[snip]

Hello,

The problem this patch solves (multiple warnings with stacktraces on suspend)
also affects me, on a tri-core old AMD Phenom II X3 720 processor.

I have done some further debugging by adding printks to the code and to me it seems that your diagnosis
was correct. In the dmesg I can see:

[   66.094076] ACPI: PM: Preparing to enter system sleep state S3
[   66.510378] ACPI: PM: Saving platform NVS memory
[   66.510718] Disabling non-boot CPUs ...
[   66.511043] mce: ENTER mce_threshold_remove_device(cpu = 1)
[   66.511046] mce: ENTER __threshold_remove_device(); numbanks=6
[   66.511048] mce: ENTER threshold_remove_bank(bank name = northbridge, bank->cpus = 3, bank->shared = 1)
[   66.511049] mce: ENTER __threshold_remove_blocks(bank name = northbridge, bank->cpus = 2, bank->shared = 1)
[   66.511060] mce: EXIT __threshold_remove_blocks()
[   66.511061] mce: threshold_remove_bank: EXIT after __threshold_remove_blocks()
[   66.511062] mce: EXIT __threshold_remove_device()
[   66.511063] mce: EXIT mce_threshold_remove_device(cpu = 1)
[   66.512247] smpboot: CPU 1 is now offline
[   66.513915] mce: ENTER mce_threshold_remove_device(cpu = 2)
[   66.513919] mce: ENTER __threshold_remove_device(); numbanks=6
[   66.513922] mce: ENTER threshold_remove_bank(bank name = northbridge, bank->cpus = 2, bank->shared = 1)
[   66.513925] mce: ENTER __threshold_remove_blocks(bank name = northbridge, bank->cpus = 1, bank->shared = 1)
[   66.513929] ------------[ cut here ]------------
[   66.513930] kernfs: can not remove 'threshold_limit', no directory
[... stacktrace]
[   66.514246] kernfs: can not remove 'error_count', no directory
[...]
[   66.514485] kernfs: can not remove 'interrupt_enable', no directory
[...]
[   66.514719] kernfs: can not remove 'threshold_limit', no directory
[...]
[   66.514951] kernfs: can not remove 'error_count', no directory
[...]
[   66.515183] kernfs: can not remove 'interrupt_enable', no directory
[...]
[   66.515412] ---[ end trace 0000000000000000 ]---
[   66.515414] mce: EXIT __threshold_remove_blocks()
[   66.515415] mce: threshold_remove_bank: EXIT after __threshold_remove_blocks()
[   66.515417] mce: EXIT __threshold_remove_device()
[   66.515418] mce: EXIT mce_threshold_remove_device(cpu = 2)
[   66.516669] smpboot: CPU 2 is now offline


> Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
> ---
>  arch/x86/kernel/cpu/mce/amd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 2b7ee4a6c6ba..680b75d23a03 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -1260,10 +1260,10 @@ static void __threshold_remove_blocks(struct threshold_bank *b)
>  	struct threshold_block *pos = NULL;
>  	struct threshold_block *tmp = NULL;
>  
> -	kobject_del(b->kobj);
> +	kobject_put(b->kobj);
>  
>  	list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
> -		kobject_del(&pos->kobj);
> +		kobject_put(b->kobj);

Shouldn't there be "kobject_put(&pos->kobj)" here instead?

Also, it seems to me that "kobject_put(b->kobj);" before the loop may be relocated after
the loop - so that the refcounts on the child objects are decreased first, then the refcount on the parent
object.

Additionally, shouldn't there be a call to "kobject_put(&b->blocks->kobj);" in __threshold_remove_blocks()?
From what I understand, b->blocks is a list head, so we need to decrease the refcount on it too.

After these changes, the __threshold_remove_blocks() function looks very similar to deallocate_threshold_blocks()
function just above it.

>  }
>  
>  static void threshold_remove_bank(struct threshold_bank *bank)

Thank you for your work.

Greetings,

Mateusz


---------------------

commit 39df75c672d9620a2967f993befedabb36b2616d
Author: Mateusz Jończyk <mat.jonczyk@o2.pl>
Date:   Fri Aug 12 20:21:46 2022 +0200

    x86/mce/amd: debugging messages

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 1c87501e0fa3..10c37c0599e1 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -1238,6 +1238,10 @@ static int threshold_create_bank(struct threshold_bank **bp, unsigned int cpu,
 
 static void threshold_block_release(struct kobject *kobj)
 {
+    struct threshold_block *block = to_block(kobj);
+
+    pr_warn("threshold_block_release(block = %u, bank = %u, cpu = %u)\n",
+        block->block, block->bank, block->cpu);
     kfree(to_block(kobj));
 }
 
@@ -1245,12 +1249,19 @@ static void deallocate_threshold_blocks(struct threshold_bank *bank)
 {
     struct threshold_block *pos, *tmp;
 
+    pr_warn("ENTER deallocate_threshold_blocks(bank name = %s, bank->cpus = %u, bank->shared = %d)\n",
+        kobject_name(bank->kobj),
+        refcount_read(&bank->cpus),
+        bank->shared);
+
     list_for_each_entry_safe(pos, tmp, &bank->blocks->miscj, miscj) {
         list_del(&pos->miscj);
         kobject_put(&pos->kobj);
     }
 
     kobject_put(&bank->blocks->kobj);
+
+    pr_warn("EXIT deallocate_threshold_blocks()\n");
 }
 
 static void __threshold_remove_blocks(struct threshold_bank *b)
@@ -1258,24 +1269,40 @@ static void __threshold_remove_blocks(struct threshold_bank *b)
     struct threshold_block *pos = NULL;
     struct threshold_block *tmp = NULL;
 
+    pr_warn("ENTER __threshold_remove_blocks(bank name = %s, bank->cpus = %u, bank->shared = %d)\n",
+        kobject_name(b->kobj),
+        refcount_read(&b->cpus),
+        b->shared);
+
     kobject_del(b->kobj);
 
     list_for_each_entry_safe(pos, tmp, &b->blocks->miscj, miscj)
         kobject_del(&pos->kobj);
+
+    pr_warn("EXIT __threshold_remove_blocks()\n");
 }
 
 static void threshold_remove_bank(struct threshold_bank *bank)
 {
     struct amd_northbridge *nb;
+    pr_warn("ENTER threshold_remove_bank(bank name = %s, bank->cpus = %u, bank->shared = %d)\n",
+        kobject_name(bank->kobj),
+        refcount_read(&bank->cpus),
+        bank->shared);
 
-    if (!bank->blocks)
+    if (!bank->blocks) {
+        pr_warn("threshold_remove_bank: bank->blocks == NULL\n");
         goto out_free;
+        }
 
-    if (!bank->shared)
+    if (!bank->shared) {
+        pr_warn("threshold_remove_bank: bank->shared == NULL\n");
         goto out_dealloc;
+        }
 
     if (!refcount_dec_and_test(&bank->cpus)) {
         __threshold_remove_blocks(bank);
+        pr_warn("threshold_remove_bank: EXIT after __threshold_remove_blocks()\n");
         return;
     } else {
         /*
@@ -1283,6 +1310,7 @@ static void threshold_remove_bank(struct threshold_bank *bank)
          * away, remove that bank now.
          */
         nb = node_to_amd_nb(topology_die_id(smp_processor_id()));
+        pr_warn("threshold_remove_bank: setting nb->bank4 = NULL\n");
         nb->bank4 = NULL;
     }
 
@@ -1292,12 +1320,16 @@ static void threshold_remove_bank(struct threshold_bank *bank)
 out_free:
     kobject_put(bank->kobj);
     kfree(bank);
+
+    pr_warn("EXIT threshold_remove_bank(); \n");
 }
 
 static void __threshold_remove_device(struct threshold_bank **bp)
 {
     unsigned int bank, numbanks = this_cpu_read(mce_num_banks);
 
+    pr_warn("ENTER __threshold_remove_device(); numbanks=%u\n", numbanks);
+
     for (bank = 0; bank < numbanks; bank++) {
         if (!bp[bank])
             continue;
@@ -1306,14 +1338,19 @@ static void __threshold_remove_device(struct threshold_bank **bp)
         bp[bank] = NULL;
     }
     kfree(bp);
+    pr_warn("EXIT __threshold_remove_device()\n");
 }
 
 int mce_threshold_remove_device(unsigned int cpu)
 {
     struct threshold_bank **bp = this_cpu_read(threshold_banks);
 
-    if (!bp)
+    pr_warn("ENTER mce_threshold_remove_device(cpu = %u)\n", cpu);
+
+    if (!bp) {
+        pr_warn("EXIT mce_threshold_remove_device(cpu = %u): bp == NULL\n", cpu);
         return 0;
+    }
 
     /*
      * Clear the pointer before cleaning up, so that the interrupt won't
@@ -1322,6 +1359,8 @@ int mce_threshold_remove_device(unsigned int cpu)
     this_cpu_write(threshold_banks, NULL);
 
     __threshold_remove_device(bp);
+
+    pr_warn("EXIT mce_threshold_remove_device(cpu = %u)\n", cpu);
     return 0;
 }


  parent reply	other threads:[~2022-08-12 21:15 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14 17:43 [PATCH] x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks Yazen Ghannam
2022-06-15  6:33 ` Greg KH
2022-06-15 13:51   ` Yazen Ghannam
2022-10-26 10:16     ` Borislav Petkov
2022-10-26 12:04       ` Greg KH
2022-10-26 15:39         ` Yazen Ghannam
2022-10-26 18:29           ` Borislav Petkov
2022-10-26 19:44             ` Yazen Ghannam
2022-10-26 20:12               ` Borislav Petkov
2022-11-02  2:36                 ` Yazen Ghannam
2022-08-12 21:14 ` Mateusz Jończyk [this message]
2022-08-13 10:09   ` Borislav Petkov
2022-08-13 12:04     ` Mateusz Jończyk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c8e34c8-f12d-e7e4-b6bc-4867824865ea@o2.pl \
    --to=mat.jonczyk@o2.pl \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.