All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: "Luck, Tony" <tony.luck@intel.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	James Morse <james.morse@arm.com>
Cc: "Len Brown" <lenb@kernel.org>, "Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Jarkko Sakkinen" <jarkko@kernel.org>,
	"HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>,
	"linmiaohe@huawei.com" <linmiaohe@huawei.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	Stable <stable@vger.kernel.org>,
	"ACPI Devel Maling List" <linux-acpi@vger.kernel.org>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"cuibixuan@linux.alibaba.com" <cuibixuan@linux.alibaba.com>,
	"baolin.wang@linux.alibaba.com" <baolin.wang@linux.alibaba.com>,
	"zhuo.song@linux.alibaba.com" <zhuo.song@linux.alibaba.com>
Subject: Re: [PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
Date: Thu, 29 Sep 2022 10:33:36 +0800	[thread overview]
Message-ID: <f09e6aee-5d7f-62c2-8a6e-d721d8b22699@linux.alibaba.com> (raw)
In-Reply-To: <SJ1PR11MB60830CBCB42CFF552A2B6CF0FC559@SJ1PR11MB6083.namprd11.prod.outlook.com>



在 2022/9/28 AM1:47, Luck, Tony 写道:
> I follow and agree with everything up until:
> 
>> In a conclusion, the error will be handled in a kworker with or without this fix.

> 
> It isn't handled during the interrupt (it can't be).

Yes, it is not handled during the interrupt and it does not have to.

>
> Who handles the error if the interrupt happens during the execution of a kthread?

As I mentioned, the GHES driver always queues work into workqueue to handle memory
failure of a page in memory_failure_queue(), so the **worker will be scheduled and
handle memory failure later**.

> 
> Can't use the task_work_add() trick to handle it (because this thread never returns to user mode).

Yes, it can not. And this is the key point to fix.

> 
> So how is the error handled?
> 

The workflow to handle hardware error is summery as bellow:

-----------------------------------------------------------------------------
[ghes_sdei_critical_callback: current swapper/3, CPU 3]
ghes_sdei_critical_callback
    => __ghes_sdei_callback
        => ghes_in_nmi_queue_one_entry 		// peak and read estatus
        => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
[ghes_sdei_critical_callback: return]
-----------------------------------------------------------------------------
[ghes_proc_in_irq: current swapper/3, CPU 3]
            => ghes_do_proc
                => ghes_handle_memory_failure
                    => ghes_do_memory_failure
                        => memory_failure_queue	 // put work task on current CPU
                            => if (kfifo_put(&mf_cpu->fifo, entry))
                                  schedule_work_on(smp_processor_id(), &mf_cpu->work);
            => task_work_add(current, &estatus_node->task_work, TWA_RESUME); // fix here, always added to current
[ghes_proc_in_irq: return]
-----------------------------------------------------------------------------
// kworker preempts swapper/3 on CPU 3 due to RESCHED flag
[memory_failure_work_func: current kworker, CPU 3]	
     => memory_failure_work_func(&mf_cpu->work)
        => while kfifo_get(&mf_cpu->fifo, &entry);	// until get no work
            => soft/hard offline
-----------------------------------------------------------------------------

STEP 0: The firmware notifies hardware error to kernel through is SDEI
(ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).

STEP 1: In SDEI callback (or any NMI-like handler), memory from ghes_estatus_pool is
used to save estatus, and added to the ghes_estatus_llist. The swapper running on
CPU 3 is interrupted. irq_work_queue() causes ghes_proc_in_irq() to run in IRQ
context where each estatus in ghes_estatus_llist is processed.

STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU
in workqueue and add task work to sync with the workqueue.

STEP3: The kworker preempts the current running thread and get CPU 3. Then memory failure
is processed in kworker.

(STEP4 for user thread: ghes_kick_task_work() is called as task_work to ensure any
queued workqueue has been done before returning to user-space. The estatus_node is freed.)

If the task work is not added, estatus_node->task_work.func will be NULL, and estatus_node
is freed in STEP 2.

Hope it helps to make the problem clearer. You can also check the stack dumped in key
function in above flow.

Best Regards,
Shuai


---------------------------------------------------------------------------------------
dump_stack() is added in:
- __ghes_sdei_callback()
- ghes_proc_in_irq()
- memory_failure_queue_kick()
- memory_failure_work_func()
- memory_failure()

[  485.457761] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E      6.0.0-rc5+ #33
[  485.457769] Hardware name: xxxx
[  485.457771] Call trace:
[  485.457772]  dump_backtrace+0xe8/0x12c
[  485.457779]  show_stack+0x20/0x50
[  485.457781]  dump_stack_lvl+0x68/0x84
[  485.457785]  dump_stack+0x18/0x34
[  485.457787]  __ghes_sdei_callback+0x24/0x64
[  485.457789]  ghes_sdei_critical_callback+0x5c/0x94
[  485.457792]  sdei_event_handler+0x28/0x90
[  485.457795]  do_sdei_event+0x74/0x160
[  485.457797]  __sdei_handler+0x60/0xf0
[  485.457799]  __sdei_asm_handler+0xbc/0x18c
[  485.457801]  cpu_do_idle+0x14/0x80
[  485.457802]  default_idle_call+0x50/0x114
[  485.457804]  cpuidle_idle_call+0x16c/0x1c0
[  485.457806]  do_idle+0xb8/0x110
[  485.457808]  cpu_startup_entry+0x2c/0x34
[  485.457809]  secondary_start_kernel+0xf0/0x144
[  485.457812]  __secondary_switched+0xb0/0xb4

[  485.459513] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 page:0x89c033 offset:0x400 grain:1 - APEI location: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory)
[  485.459523] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  485.470607] {2}[Hardware Error]: event severity: recoverable
[  485.476252] {2}[Hardware Error]:  precise tstamp: 2022-09-29 09:31:27
[  485.482678] {2}[Hardware Error]:  Error 0, type: recoverable
[  485.488322] {2}[Hardware Error]:   section_type: memory error
[  485.494052] {2}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[  485.503081] {2}[Hardware Error]:   physical_address: 0x000000089c033400
[  485.509680] {2}[Hardware Error]:   node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0
[  485.521487] {2}[Hardware Error]:   error_type: 5, multi-symbol chipkill ECC

[  485.528439] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E      6.0.0-rc5+ #33
[  485.528440] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528441] Call trace:
[  485.528441]  dump_backtrace+0xe8/0x12c
[  485.528443]  show_stack+0x20/0x50
[  485.528444]  dump_stack_lvl+0x68/0x84
[  485.528446]  dump_stack+0x18/0x34
[  485.528448]  ghes_proc_in_irq+0x220/0x250
[  485.528450]  irq_work_single+0x30/0x80
[  485.528453]  irq_work_run_list+0x4c/0x70
[  485.528455]  irq_work_run+0x28/0x44
[  485.528457]  do_handle_IPI+0x2b4/0x2f0
[  485.528459]  ipi_handler+0x24/0x34
[  485.528461]  handle_percpu_devid_irq+0x90/0x1c4
[  485.528463]  generic_handle_domain_irq+0x34/0x50
[  485.528465]  __gic_handle_irq_from_irqson.isra.0+0x130/0x230
[  485.528468]  gic_handle_irq+0x2c/0x60
[  485.528469]  call_on_irq_stack+0x2c/0x38
[  485.528471]  do_interrupt_handler+0x88/0x90
[  485.528472]  el1_interrupt+0x48/0xb0
[  485.528475]  el1h_64_irq_handler+0x18/0x24
[  485.528476]  el1h_64_irq+0x74/0x78
[  485.528477]  __do_softirq+0xa4/0x358
[  485.528478]  __irq_exit_rcu+0x110/0x13c
[  485.528479]  irq_exit_rcu+0x18/0x24
[  485.528480]  el1_interrupt+0x4c/0xb0
[  485.528482]  el1h_64_irq_handler+0x18/0x24
[  485.528483]  el1h_64_irq+0x74/0x78
[  485.528484]  arch_cpu_idle+0x18/0x40
[  485.528485]  default_idle_call+0x50/0x114
[  485.528487]  cpuidle_idle_call+0x16c/0x1c0
[  485.528488]  do_idle+0xb8/0x110
[  485.528489]  cpu_startup_entry+0x2c/0x34
[  485.528491]  secondary_start_kernel+0xf0/0x144
[  485.528493]  __secondary_switched+0xb0/0xb4

[  485.528511] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G            E      6.0.0-rc5+ #33
[  485.528513] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528514] Workqueue: events memory_failure_work_func
[  485.528518] Call trace:
[  485.528519]  dump_backtrace+0xe8/0x12c
[  485.528520]  show_stack+0x20/0x50
[  485.528521]  dump_stack_lvl+0x68/0x84
[  485.528523]  dump_stack+0x18/0x34
[  485.528525]  memory_failure_work_func+0xec/0x180
[  485.528527]  process_one_work+0x1f4/0x460
[  485.528528]  worker_thread+0x188/0x3e4
[  485.528530]  kthread+0xd0/0xd4
[  485.528532]  ret_from_fork+0x10/0x20

[  485.528533] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G            E      6.0.0-rc5+ #33
[  485.528534] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528535] Workqueue: events memory_failure_work_func
[  485.528537] Call trace:
[  485.528538]  dump_backtrace+0xe8/0x12c
[  485.528539]  show_stack+0x20/0x50
[  485.528540]  dump_stack_lvl+0x68/0x84
[  485.528541]  dump_stack+0x18/0x34
[  485.528543]  memory_failure+0x50/0x438
[  485.528544]  memory_failure_work_func+0x174/0x180
[  485.528546]  process_one_work+0x1f4/0x460
[  485.528547]  worker_thread+0x188/0x3e4
[  485.528548]  kthread+0xd0/0xd4
[  485.528550]  ret_from_fork+0x10/0x20
[  485.530622] Memory failure: 0x89c033: recovery action for dirty LRU page: Recovered






  reply	other threads:[~2022-09-29  2:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-16  5:05 [PATCH] ACPI: APEI: do not add task_work for outside context error Shuai Xue
2022-09-19  2:37 ` Shuai Xue
2022-09-24  7:49 ` [PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak Shuai Xue
2022-09-24  7:50   ` kernel test robot
2022-09-24 17:17   ` Rafael J. Wysocki
2022-09-26 11:35     ` Shuai Xue
2022-09-26 15:20       ` Luck, Tony
2022-09-27  3:50         ` Shuai Xue
2022-09-27 17:47           ` Luck, Tony
2022-09-29  2:33             ` Shuai Xue [this message]
2022-09-29 20:52               ` Luck, Tony
2022-09-30  2:52                 ` Shuai Xue
2022-09-30 15:52                   ` Luck, Tony
2022-10-04 14:07                     ` Rafael J. Wysocki
2022-10-13  7:05           ` Shuai Xue
2022-10-13 17:18             ` Luck, Tony
2022-10-14 13:23               ` Shuai Xue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f09e6aee-5d7f-62c2-8a6e-d721d8b22699@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=cuibixuan@linux.alibaba.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=lenb@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=rafael@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tony.luck@intel.com \
    --cc=zhuo.song@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.