All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Naveen Naidu <naveennaidu479@gmail.com>
Cc: bhelgaas@google.com, ruscur@russell.cc, oohall@gmail.com,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org,
	linux-kernel-mentees@lists.linuxfoundation.org,
	Keith Busch <kbusch@kernel.org>, Sinan Kaya <okaya@kernel.org>,
	Oza Pawandeep <poza@codeaurora.org>
Subject: Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
Date: Wed, 20 Oct 2021 21:09:34 -0500	[thread overview]
Message-ID: <20211021020934.GA2658296@bhelgaas> (raw)
In-Reply-To: <0a443323ab64ba8c0fc6caa03ca56ecd4d038ea3.1633453452.git.naveennaidu479@gmail.com>

[+cc Keith, Sinan, Oza]

On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> In the EDR path, AER registers are cleared *after* DPC error event is
> processed. The process stack in EDR is:
> 
>   edr_handle_event()
>     dpc_process_error()
>     pci_aer_raw_clear_status()
>     pcie_do_recovery()
> 
> But in DPC path, AER status registers are cleared *while* processing
> the error. The process stack in DPC is:
> 
>   dpc_handler()
>     dpc_process_error()
>       pci_aer_clear_status()
>     pcie_do_recovery()

These are accurate but they both include dpc_process_error(), so we
need a hint to show why the one here is different from the one in the
EDR path, e.g.,

  dpc_handler
    dpc_process_error
      if (reason == 0)
        pci_aer_clear_status    # uncorrectable errors only
    pcie_do_recovery

> In EDR path, AER status registers are cleared irrespective of whether
> the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> AER status registers are cleared only when it's an unmasked uncorrectable
> error.
> 
> This leads to two different behaviours for the same task (handling of
> DPC errors) in FFS systems and when native OS has control.

FFS?

I'd really like to have a specific example of how a user would observe
this difference.  I know you probably don't have two systems to
compare like that, but maybe we can work it out manually.

I guess you're saying the problem is in the native DPC handling, and
we don't clear the AER status registers for ERR_NONFATAL,
ERR_NONFATAL, etc., right?

I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
status in DPC event handling"), where Keith explicitly mentions those
cases.  The commit log here should connect back to that and explain
whether something has changed.

I cc'd Keith and the reviewers of that change in case any of them have
time to dig into this again.

> Bring the same semantics for clearing the AER status register in EDR
> path and DPC path.
> 
> Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
> ---
>  drivers/pci/pcie/dpc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index faf4a1e77fab..68899a3db126 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
>  		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
>  		 aer_get_device_error_info(pdev, &info)) {
>  		aer_print_error(pdev, &info);
> -		pci_aer_clear_status(pdev);
>  	}
>  }
>  
> @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
>  	struct pci_dev *pdev = context;
>  
>  	dpc_process_error(pdev);
> +	pci_aer_clear_status(pdev);
>  
>  	/* We configure DPC so it only triggers on ERR_FATAL */
>  	pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> -- 
> 2.25.1
> 
> _______________________________________________
> Linux-kernel-mentees mailing list
> Linux-kernel-mentees@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Naveen Naidu <naveennaidu479@gmail.com>
Cc: Oza Pawandeep <poza@codeaurora.org>,
	linux-pci@vger.kernel.org, Sinan Kaya <okaya@kernel.org>,
	linux-kernel@vger.kernel.org, Keith Busch <kbusch@kernel.org>,
	oohall@gmail.com, bhelgaas@google.com,
	linuxppc-dev@lists.ozlabs.org,
	linux-kernel-mentees@lists.linuxfoundation.org
Subject: Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
Date: Wed, 20 Oct 2021 21:09:34 -0500	[thread overview]
Message-ID: <20211021020934.GA2658296@bhelgaas> (raw)
In-Reply-To: <0a443323ab64ba8c0fc6caa03ca56ecd4d038ea3.1633453452.git.naveennaidu479@gmail.com>

[+cc Keith, Sinan, Oza]

On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> In the EDR path, AER registers are cleared *after* DPC error event is
> processed. The process stack in EDR is:
> 
>   edr_handle_event()
>     dpc_process_error()
>     pci_aer_raw_clear_status()
>     pcie_do_recovery()
> 
> But in DPC path, AER status registers are cleared *while* processing
> the error. The process stack in DPC is:
> 
>   dpc_handler()
>     dpc_process_error()
>       pci_aer_clear_status()
>     pcie_do_recovery()

These are accurate but they both include dpc_process_error(), so we
need a hint to show why the one here is different from the one in the
EDR path, e.g.,

  dpc_handler
    dpc_process_error
      if (reason == 0)
        pci_aer_clear_status    # uncorrectable errors only
    pcie_do_recovery

> In EDR path, AER status registers are cleared irrespective of whether
> the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> AER status registers are cleared only when it's an unmasked uncorrectable
> error.
> 
> This leads to two different behaviours for the same task (handling of
> DPC errors) in FFS systems and when native OS has control.

FFS?

I'd really like to have a specific example of how a user would observe
this difference.  I know you probably don't have two systems to
compare like that, but maybe we can work it out manually.

I guess you're saying the problem is in the native DPC handling, and
we don't clear the AER status registers for ERR_NONFATAL,
ERR_NONFATAL, etc., right?

I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
status in DPC event handling"), where Keith explicitly mentions those
cases.  The commit log here should connect back to that and explain
whether something has changed.

I cc'd Keith and the reviewers of that change in case any of them have
time to dig into this again.

> Bring the same semantics for clearing the AER status register in EDR
> path and DPC path.
> 
> Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
> ---
>  drivers/pci/pcie/dpc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index faf4a1e77fab..68899a3db126 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
>  		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
>  		 aer_get_device_error_info(pdev, &info)) {
>  		aer_print_error(pdev, &info);
> -		pci_aer_clear_status(pdev);
>  	}
>  }
>  
> @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
>  	struct pci_dev *pdev = context;
>  
>  	dpc_process_error(pdev);
> +	pci_aer_clear_status(pdev);
>  
>  	/* We configure DPC so it only triggers on ERR_FATAL */
>  	pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> -- 
> 2.25.1
> 
> _______________________________________________
> Linux-kernel-mentees mailing list
> Linux-kernel-mentees@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Naveen Naidu <naveennaidu479@gmail.com>
Cc: Oza Pawandeep <poza@codeaurora.org>,
	linux-pci@vger.kernel.org, Sinan Kaya <okaya@kernel.org>,
	linux-kernel@vger.kernel.org, Keith Busch <kbusch@kernel.org>,
	oohall@gmail.com, ruscur@russell.cc, bhelgaas@google.com,
	linuxppc-dev@lists.ozlabs.org,
	linux-kernel-mentees@lists.linuxfoundation.org
Subject: Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers
Date: Wed, 20 Oct 2021 21:09:34 -0500	[thread overview]
Message-ID: <20211021020934.GA2658296@bhelgaas> (raw)
In-Reply-To: <0a443323ab64ba8c0fc6caa03ca56ecd4d038ea3.1633453452.git.naveennaidu479@gmail.com>

[+cc Keith, Sinan, Oza]

On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> In the EDR path, AER registers are cleared *after* DPC error event is
> processed. The process stack in EDR is:
> 
>   edr_handle_event()
>     dpc_process_error()
>     pci_aer_raw_clear_status()
>     pcie_do_recovery()
> 
> But in DPC path, AER status registers are cleared *while* processing
> the error. The process stack in DPC is:
> 
>   dpc_handler()
>     dpc_process_error()
>       pci_aer_clear_status()
>     pcie_do_recovery()

These are accurate but they both include dpc_process_error(), so we
need a hint to show why the one here is different from the one in the
EDR path, e.g.,

  dpc_handler
    dpc_process_error
      if (reason == 0)
        pci_aer_clear_status    # uncorrectable errors only
    pcie_do_recovery

> In EDR path, AER status registers are cleared irrespective of whether
> the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> AER status registers are cleared only when it's an unmasked uncorrectable
> error.
> 
> This leads to two different behaviours for the same task (handling of
> DPC errors) in FFS systems and when native OS has control.

FFS?

I'd really like to have a specific example of how a user would observe
this difference.  I know you probably don't have two systems to
compare like that, but maybe we can work it out manually.

I guess you're saying the problem is in the native DPC handling, and
we don't clear the AER status registers for ERR_NONFATAL,
ERR_NONFATAL, etc., right?

I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
status in DPC event handling"), where Keith explicitly mentions those
cases.  The commit log here should connect back to that and explain
whether something has changed.

I cc'd Keith and the reviewers of that change in case any of them have
time to dig into this again.

> Bring the same semantics for clearing the AER status register in EDR
> path and DPC path.
> 
> Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
> ---
>  drivers/pci/pcie/dpc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index faf4a1e77fab..68899a3db126 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
>  		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
>  		 aer_get_device_error_info(pdev, &info)) {
>  		aer_print_error(pdev, &info);
> -		pci_aer_clear_status(pdev);
>  	}
>  }
>  
> @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
>  	struct pci_dev *pdev = context;
>  
>  	dpc_process_error(pdev);
> +	pci_aer_clear_status(pdev);
>  
>  	/* We configure DPC so it only triggers on ERR_FATAL */
>  	pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> -- 
> 2.25.1
> 
> _______________________________________________
> Linux-kernel-mentees mailing list
> Linux-kernel-mentees@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees
_______________________________________________
Linux-kernel-mentees mailing list
Linux-kernel-mentees@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees

  reply	other threads:[~2021-10-21  2:09 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-05 17:18 [PATCH v4 0/8] Fix long standing AER Error Handling Issues Naveen Naidu
2021-10-05 17:18 ` Naveen Naidu
2021-10-05 17:18 ` Naveen Naidu
2021-10-05 17:18 ` [PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[] Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-21  1:28   ` Bjorn Helgaas
2021-10-21  1:28     ` Bjorn Helgaas
2021-10-21  1:28     ` Bjorn Helgaas
2021-10-21 16:30     ` Naveen Naidu
2021-10-21 16:30       ` Naveen Naidu
2021-10-21 16:30       ` Naveen Naidu
2021-10-21 17:03       ` Bjorn Helgaas
2021-10-21 17:03         ` Bjorn Helgaas
2021-10-21 17:03         ` Bjorn Helgaas
2021-10-05 17:18 ` [PATCH v4 2/8] PCI: Cleanup struct aer_err_info Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18 ` [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error() Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-21  1:28   ` Bjorn Helgaas
2021-10-21  1:28     ` Bjorn Helgaas
2021-10-21  1:28     ` Bjorn Helgaas
2021-10-21 16:31     ` Naveen Naidu
2021-10-21 16:31       ` Naveen Naidu
2021-10-21 16:31       ` Naveen Naidu
2021-10-05 17:18 ` [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() " Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-21  1:40   ` Bjorn Helgaas
2021-10-21  1:40     ` Bjorn Helgaas
2021-10-21  1:40     ` Bjorn Helgaas
2021-10-21 16:36     ` Naveen Naidu
2021-10-21 16:36       ` Naveen Naidu
2021-10-21 16:36       ` Naveen Naidu
2021-10-21 17:05       ` Bjorn Helgaas
2021-10-21 17:05         ` Bjorn Helgaas
2021-10-21 17:05         ` Bjorn Helgaas
2021-10-05 17:18 ` [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-21  2:09   ` Bjorn Helgaas [this message]
2021-10-21  2:09     ` Bjorn Helgaas
2021-10-21  2:09     ` Bjorn Helgaas
2021-10-21 16:53     ` Naveen Naidu
2021-10-21 16:53       ` Naveen Naidu
2021-10-21 16:53       ` Naveen Naidu
2021-10-21 17:11       ` Bjorn Helgaas
2021-10-21 17:11         ` Bjorn Helgaas
2021-10-21 17:11         ` Bjorn Helgaas
2021-10-05 17:18 ` [PATCH v4 6/8] PCI/AER: Clear error device AER registers in aer_irq() Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18 ` [PATCH v4 7/8] PCI/ERR: Remove redundant clearing of AER register in pcie_do_recovery() Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18 ` [PATCH v4 8/8] PCI/AER: Include DEVCTL in aer_print_error() Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu
2021-10-05 17:18   ` Naveen Naidu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211021020934.GA2658296@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel-mentees@lists.linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=naveennaidu479@gmail.com \
    --cc=okaya@kernel.org \
    --cc=oohall@gmail.com \
    --cc=poza@codeaurora.org \
    --cc=ruscur@russell.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.