linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: James Morse <james.morse@arm.com>
To: Shiju Jose <shiju.jose@huawei.com>
Cc: "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"helgaas@kernel.org" <helgaas@kernel.org>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"zhangliguang@linux.alibaba.com" <zhangliguang@linux.alibaba.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	Linuxarm <linuxarm@huawei.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	yangyicong <yangyicong@huawei.com>
Subject: Re: [PATCH v4 1/2] ACPI: APEI: Add support to notify the vendor specific HW errors
Date: Fri, 13 Mar 2020 15:17:22 +0000	[thread overview]
Message-ID: <7c1a20de-746f-8fc2-29a1-6e5d607ebb48@arm.com> (raw)
In-Reply-To: <689f0c7cb0fe49d6a9df140cc1b56690@huawei.com>

Hi Shiju,

On 3/12/20 12:10 PM, Shiju Jose wrote:
>> On 07/02/2020 10:31, Shiju Jose wrote:
>>> Presently APEI does not support reporting the vendor specific HW
>>> errors, received in the vendor defined table entries, to the vendor
>>> drivers for any recovery.
>>>
>>> This patch adds the support to register and unregister the error
>>> handling function for the vendor specific HW errors and notify the
>>> registered kernel driver.
>>
>> Is it possible to use the kernel's existing atomic_notifier_chain_register() API for
>> this?
>>
>> The one thing that can't be done in the same way is the GUID filtering in ghes.c.
>> Each driver would need to check if the call matched a GUID they knew about,
>> and return NOTIFY_DONE if they "don't care".
>>
>> I think this patch would be a lot smaller if it was tweaked to be able to use the
>> existing API. If there is a reason not to use it, it would be good to know what it
>> is.

> I think when using atomic_notifier_chain_register we have following limitations,
> 1. All the registered error handlers would get called, though an error is not related to those handlers.    

The notifier chain provides NOTIFY_STOP_MASK, so that one of the callers
can say the work is done. We only expect a handful of these, so I don't
think there is going to be a scalability problem.


>     Also this may lead to mishandling of the error information if a handler does not
>     implement GUID checking etc.

Which would be a bug we can fix.
There is no point worrying about bugs in out of tree code.


> 2. atomic_notifier_chain_register (notifier_chain_register) looks like does not support 
>     pass the handler's private data during the registration which supposed to 
>     passed later in the call back function *notifier_fn_t(... ,void *data) to the handler.

The callback is provided with the struct notifier_block. A bit of
container_of() magic will give you whatever structure you embedded it in!


> 3. Also got difficulty in passing the ghes error data(acpi_hest_generic_data), GUID
>     for the error received to the handler through the notifier_chain  callback interface. 

Here you've lost me. Because you need to pass more than one thing? Can't
we have a struct for that?

But, isn't it all in struct acpi_hest_generic_data already? That is
where the guid and severity come from.


>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
>>> 103acbb..69e18d7 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -490,6 +490,109 @@ static void ghes_handle_aer(struct
>>> acpi_hest_generic_data *gdata)
>>
>>> +/**
>>> + * ghes_unregister_event_handler - unregister the previously
>>> + * registered event handling function.
>>> + * @sec_type: sec_type of the corresponding CPER.
>>> + * @data: driver specific data to distinguish devices.
>>> + */
>>> +void ghes_unregister_event_handler(guid_t sec_type, void *data) {
>>> +	struct ghes_event_notify *event_notify;
>>> +	bool found = false;
>>> +
>>> +	mutex_lock(&ghes_event_notify_mutex);
>>> +	rcu_read_lock();
>>> +	list_for_each_entry_rcu(event_notify,
>>> +				&ghes_event_handler_list, list) {
>>> +		if (guid_equal(&event_notify->sec_type, &sec_type)) {
>>
>>> +			if (data != event_notify->data)
>>
>> It looks like you need multiple drivers to handle the same GUID because of
>> multiple root ports. Can't the handler lookup the right device?

> This check was because GUID is shared among multiple devices with one driver as seen
> in the B2889FC9 driver (pcie-hisi-error.c). 

(we should stop calling it by its guid ... does it have a name?!)


This must be some kind of error collector for a bus right?

I agree we may need to have multiple drivers register to handle vendor
events, but it looks like you are registering the same handler multiple
times, with different private structures.

Can't it find the affected device from the error description?


>>> @@ -525,11 +628,14 @@ static void ghes_do_proc(struct ghes *ghes,
>>>
>>>  			log_arm_hw_error(err);
>>>  		} else {
>>> -			void *err = acpi_hest_get_payload(gdata);
>>> -
>>> -			log_non_standard_event(sec_type, fru_id, fru_text,
>>> -					       sec_sev, err,
>>> -					       gdata->error_data_length);
>>> +			if (!ghes_handle_non_standard_event(sec_type, gdata,
>>> +							    sev)) {
>>> +				void *err = acpi_hest_get_payload(gdata);
>>> +
>>> +				log_non_standard_event(sec_type, fru_id,
>>> +						       fru_text, sec_sev, err,
>>> +						       gdata->error_data_length);
>>> +			}
>>
>> So, a side effect of the kernel handling these is they no longer get logged out of
>> trace points?
>>
>> I guess the driver the claims this logs some more accurate information. Are
>> there expected to be any user-space programs doing something useful with
>> B2889FC9... today?

> The B2889FC9 driver does not expect any corresponding user space programs. 
> The driver mainly for the error recovery and basic error decoding and logging.

> Previously we added the error logging for the B2889FC9 in the rasdaemon.

So this series would break the error logging in rasdaemon.

User-space would need to be upgraded to receive the trace information
from the specific driver instead. (how does it know?!)

Could we log_non_standard_event() unconditionally, maybe adding a field
to indicate that a driver claimed it, so there may be more data
somewhere else...


Thanks,

James

  reply	other threads:[~2020-03-13 15:17 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Shiju Jose>
2020-01-15 11:01 ` [RFC PATCH 0/2] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
2020-01-15 11:01   ` [RFC PATCH 1/2] " Shiju Jose
2020-01-15 11:01   ` [RFC PATCH 2/2] PCI:hip08:Add driver to handle HiSilicon hip08 PCIe controller's errors Shiju Jose
2020-01-15 14:13     ` Bjorn Helgaas
2020-01-17  9:40       ` Shiju Jose
2020-01-24 12:39 ` [PATCH v2 0/2] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
2020-01-24 12:39   ` [PATCH v2 1/2] " Shiju Jose
2020-01-24 12:39   ` [PATCH v2 2/2] PCI: hip: Add handling of HiSilicon hip PCIe controller's errors Shiju Jose
2020-01-24 14:30     ` Bjorn Helgaas
2020-01-26 18:12     ` kbuild test robot
2020-01-26 18:12     ` [RFC PATCH] PCI: hip: hisi_pcie_sec_type can be static kbuild test robot
2020-02-03 16:51 ` [PATCH v3 0/2] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
2020-02-03 16:51   ` [PATCH v3 1/2] " Shiju Jose
2020-02-03 16:51   ` [PATCH v3 2/2] PCI: HIP: Add handling of HiSilicon HIP PCIe controller's errors Shiju Jose
2020-02-04 14:31     ` Dan Carpenter
2020-02-07 10:31 ` [PATCH v4 0/2] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
2020-02-07 10:31   ` [PATCH v4 1/2] " Shiju Jose
2020-03-11 17:29     ` James Morse
2020-03-12 12:10       ` Shiju Jose
2020-03-13 15:17         ` James Morse [this message]
2020-03-13 17:08           ` Shiju Jose
2020-02-07 10:31   ` [PATCH v4 2/2] PCI: HIP: Add handling of HiSilicon HIP PCIe controller errors Shiju Jose
2020-03-09  9:23   ` [PATCH v4 0/2] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
2020-03-11 17:27     ` James Morse
2020-03-25 16:42 ` [PATCH v6 0/2] ACPI / " Shiju Jose
2020-03-25 16:42   ` [PATCH v6 1/2] " Shiju Jose
2020-03-27 18:22     ` Borislav Petkov
2020-03-30 10:14       ` Shiju Jose
2020-03-30 10:33         ` Borislav Petkov
2020-03-30 11:55           ` Shiju Jose
2020-03-30 13:42             ` Borislav Petkov
2020-03-30 15:44               ` Shiju Jose
2020-03-31  9:09                 ` Borislav Petkov
2020-04-08  9:20                   ` Shiju Jose
2020-04-08 10:03       ` James Morse
2020-04-21 13:18         ` Shiju Jose
2020-05-11 11:20         ` Shiju Jose
2020-03-25 16:42   ` [PATCH v6 2/2] PCI: hip: Add handling of HiSilicon HIP PCIe controller errors Shiju Jose
2020-03-27 15:07   ` [PATCH v6 0/2] ACPI / APEI: Add support to notify the vendor specific HW errors Bjorn Helgaas
2020-04-07 12:00 ` [v7 PATCH 0/6] ACPI / APEI: Add support to notify non-fatal " Shiju Jose
2020-04-07 12:00   ` [v7 PATCH 1/6] ACPI / APEI: Add support to queuing up the non-fatal HW errors and notify Shiju Jose
2020-04-08 19:41     ` kbuild test robot
2020-04-08 19:41     ` [RFC PATCH] ACPI / APEI: ghes_gdata_pool_init() can be static kbuild test robot
2020-04-07 12:00   ` [v7 PATCH 2/6] ACPI / APEI: Add callback for memory errors to the GHES notifier Shiju Jose
2020-04-07 12:00   ` [v7 PATCH 3/6] ACPI / APEI: Add callback for AER " Shiju Jose
2020-04-07 12:00   ` [v7 PATCH 4/6] ACPI / APEI: Add callback for ARM HW errors " Shiju Jose
2020-04-07 12:00   ` [v7 PATCH 5/6] ACPI / APEI: Add callback for non-standard " Shiju Jose
2020-04-07 12:00   ` [v7 PATCH 6/6] PCI: hip: Add handling of HiSilicon HIP PCIe controller errors Shiju Jose
2020-04-21 13:21 ` [RESEND PATCH v7 0/6] ACPI / APEI: Add support to notify non-fatal HW errors Shiju Jose
2020-04-21 13:21   ` [RESEND PATCH v7 1/6] ACPI / APEI: Add support to queuing up the non-fatal HW errors and notify Shiju Jose
2020-04-21 14:12     ` Dan Carpenter
2020-04-21 13:21   ` [RESEND PATCH v7 2/6] ACPI / APEI: Add callback for memory errors to the GHES notifier Shiju Jose
2020-04-21 13:21   ` [RESEND PATCH v7 3/6] ACPI / APEI: Add callback for AER " Shiju Jose
2020-04-21 13:21   ` [RESEND PATCH v7 4/6] ACPI / APEI: Add callback for ARM HW errors " Shiju Jose
2020-04-21 14:14     ` Dan Carpenter
2020-04-21 15:18       ` Shiju Jose
2020-04-21 13:21   ` [RESEND PATCH v7 5/6] ACPI / APEI: Add callback for non-standard " Shiju Jose
2020-04-21 13:21   ` [RESEND PATCH v7 6/6] PCI: hip: Add handling of HiSilicon HIP PCIe controller errors Shiju Jose
2020-04-21 14:20     ` Dan Carpenter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c1a20de-746f-8fc2-29a1-6e5d607ebb48@arm.com \
    --to=james.morse@arm.com \
    --cc=bp@alien8.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=helgaas@kernel.org \
    --cc=jonathan.cameron@huawei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=rjw@rjwysocki.net \
    --cc=shiju.jose@huawei.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=yangyicong@huawei.com \
    --cc=zhangliguang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).