Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
@ 2020-11-05 13:36 George Cherian
  2020-11-05 17:07 ` Jakub Kicinski
  2020-11-05 19:15 ` Saeed Mahameed
  0 siblings, 2 replies; 12+ messages in thread
From: George Cherian @ 2020-11-05 13:36 UTC (permalink / raw)
  To: Saeed Mahameed, netdev, linux-kernel, Jiri Pirko
  Cc: kuba, davem, Sunil Kovvuri Goutham, Linu Cherian,
	Geethasowjanya Akula, masahiroy, willemdebruijn.kernel

Hi Saeed,

Thanks for the review.

> -----Original Message-----
> From: Saeed Mahameed <saeed@kernel.org>
> Sent: Thursday, November 5, 2020 10:39 AM
> To: George Cherian <gcherian@marvell.com>; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Jiri Pirko <jiri@nvidia.com>
> Cc: kuba@kernel.org; davem@davemloft.net; Sunil Kovvuri Goutham
> <sgoutham@marvell.com>; Linu Cherian <lcherian@marvell.com>;
> Geethasowjanya Akula <gakula@marvell.com>; masahiroy@kernel.org;
> willemdebruijn.kernel@gmail.com
> Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health
> reporters for NIX
> 
> On Wed, 2020-11-04 at 17:57 +0530, George Cherian wrote:
> > Add health reporters for RVU NPA block.
>                                ^^^ NIX ?
> 
Yes, it's NIX.

> Cc: Jiri
> 
> Anyway, could you please spare some words on what is NPA and what is
> NIX?
> 
> Regarding the reporters names, all drivers register well known generic names
> such as (fw,hw,rx,tx), I don't know if it is a good idea to use vendor specific
> names, if you are reporting for hw/fw units then just use "hw" or "fw" as the
> reporter name and append the unit NPA/NIX to the counter/error names.
Okay. These are hw units, I will rename them as hw_npa/hw_nix.
> 
> > Only reporter dump is supported.
> >
> > Output:
> >  # ./devlink health
> >  pci/0002:01:00.0:
> >    reporter npa
> >      state healthy error 0 recover 0
> >    reporter nix
> >      state healthy error 0 recover 0
> >  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
> >   NIX_AF_GENERAL:
> >          Memory Fault on NIX_AQ_INST_S read: 0
> >          Memory Fault on NIX_AQ_RES_S write: 0
> >          AQ Doorbell error: 0
> >          Rx on unmapped PF_FUNC: 0
> >          Rx multicast replication error: 0
> >          Memory fault on NIX_RX_MCE_S read: 0
> >          Memory fault on multicast WQE read: 0
> >          Memory fault on mirror WQE read: 0
> >          Memory fault on mirror pkt write: 0
> >          Memory fault on multicast pkt write: 0
> >    NIX_AF_RAS:
> >          Poisoned data on NIX_AQ_INST_S read: 0
> >          Poisoned data on NIX_AQ_RES_S write: 0
> >          Poisoned data on HW context read: 0
> >          Poisoned data on packet read from mirror buffer: 0
> >          Poisoned data on packet read from mcast buffer: 0
> >          Poisoned data on WQE read from mirror buffer: 0
> >          Poisoned data on WQE read from multicast buffer: 0
> >          Poisoned data on NIX_RX_MCE_S read: 0
> >    NIX_AF_RVU:
> >          Unmap Slot Error: 0
> >
> 
> Now i am a little bit skeptic here, devlink health reporter infrastructure was
> never meant to deal with dump op only, the main purpose is to
> diagnose/dump and recover.
> 
> especially in your use case where you only report counters, i don't believe
> devlink health dump is a proper interface for this.
These are not counters. These are error interrupts raised by HW blocks.
The count is provided to understand on how frequently the errors are seen.
Error recovery for some of the blocks happen internally. That is the reason,
Currently only dump op is added.
> Many of these counters if not most are data path packet based and maybe
> they should belong to ethtool.

Regards,
-George




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 13:36 [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
@ 2020-11-05 17:07 ` Jakub Kicinski
  2020-11-05 19:23   ` Saeed Mahameed
  2020-11-05 19:15 ` Saeed Mahameed
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2020-11-05 17:07 UTC (permalink / raw)
  To: George Cherian
  Cc: Saeed Mahameed, netdev, linux-kernel, Jiri Pirko, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

On Thu, 5 Nov 2020 13:36:56 +0000 George Cherian wrote:
> > Now i am a little bit skeptic here, devlink health reporter infrastructure was
> > never meant to deal with dump op only, the main purpose is to
> > diagnose/dump and recover.
> > 
> > especially in your use case where you only report counters, i don't believe
> > devlink health dump is a proper interface for this.  
> These are not counters. These are error interrupts raised by HW blocks.
> The count is provided to understand on how frequently the errors are seen.
> Error recovery for some of the blocks happen internally. That is the reason,
> Currently only dump op is added.

The previous incarnation was printing messages to logs, so I assume
these errors are expected to be relatively low rate.

The point of using devlink health was that you can generate a netlink
notification when the error happens. IOW you need some calls to
devlink_health_report() or such.

At least that's my thinking, others may disagree.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 13:36 [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
  2020-11-05 17:07 ` Jakub Kicinski
@ 2020-11-05 19:15 ` Saeed Mahameed
  2020-11-05 19:29   ` Sunil Kovvuri
  1 sibling, 1 reply; 12+ messages in thread
From: Saeed Mahameed @ 2020-11-05 19:15 UTC (permalink / raw)
  To: George Cherian, netdev, linux-kernel, Jiri Pirko
  Cc: kuba, davem, Sunil Kovvuri Goutham, Linu Cherian,
	Geethasowjanya Akula, masahiroy, willemdebruijn.kernel

On Thu, 2020-11-05 at 13:36 +0000, George Cherian wrote:
> Hi Saeed,
> 
> Thanks for the review.
> 
> > -----Original Message-----
> > From: Saeed Mahameed <saeed@kernel.org>
> > Sent: Thursday, November 5, 2020 10:39 AM
> > To: George Cherian <gcherian@marvell.com>; netdev@vger.kernel.org;
> > linux-kernel@vger.kernel.org; Jiri Pirko <jiri@nvidia.com>
> > Cc: kuba@kernel.org; davem@davemloft.net; Sunil Kovvuri Goutham
> > <sgoutham@marvell.com>; Linu Cherian <lcherian@marvell.com>;
> > Geethasowjanya Akula <gakula@marvell.com>; masahiroy@kernel.org;
> > willemdebruijn.kernel@gmail.com
> > Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink
> > health
> > reporters for NIX
> > 
> > On Wed, 2020-11-04 at 17:57 +0530, George Cherian wrote:
> > > Add health reporters for RVU NPA block.
> >                                ^^^ NIX ?
> > 
> Yes, it's NIX.
> 
> > Cc: Jiri
> > 
> > Anyway, could you please spare some words on what is NPA and what
> > is
> > NIX?
> > 
> > Regarding the reporters names, all drivers register well known
> > generic names
> > such as (fw,hw,rx,tx), I don't know if it is a good idea to use
> > vendor specific
> > names, if you are reporting for hw/fw units then just use "hw" or
> > "fw" as the
> > reporter name and append the unit NPA/NIX to the counter/error
> > names.
> Okay. These are hw units, I will rename them as hw_npa/hw_nix.

What i meant is have one reporter named "hw" and inside it report
counters with their unit name appended to the counter name.

./devlink health
  pci/0002:01:00.0:
    reporter hw
      state healthy error 0 recover 0
      
./devlink  health dump show pci/0002:01:00.0 reporter hw
      NIX:
         nix_counter_a: 0
         ...
     NPA: 
         npa_counter_a: 0
         ...



> > > Only reporter dump is supported.
> > > 
> > > Output:
> > >  # ./devlink health
> > >  pci/0002:01:00.0:
> > >    reporter npa
> > >      state healthy error 0 recover 0
> > >    reporter nix
> > >      state healthy error 0 recover 0
> > >  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
> > >   NIX_AF_GENERAL:
> > >          Memory Fault on NIX_AQ_INST_S read: 0
> > >          Memory Fault on NIX_AQ_RES_S write: 0
> > >          AQ Doorbell error: 0
> > >          Rx on unmapped PF_FUNC: 0
> > >          Rx multicast replication error: 0
> > >          Memory fault on NIX_RX_MCE_S read: 0
> > >          Memory fault on multicast WQE read: 0
> > >          Memory fault on mirror WQE read: 0
> > >          Memory fault on mirror pkt write: 0
> > >          Memory fault on multicast pkt write: 0
> > >    NIX_AF_RAS:
> > >          Poisoned data on NIX_AQ_INST_S read: 0
> > >          Poisoned data on NIX_AQ_RES_S write: 0
> > >          Poisoned data on HW context read: 0
> > >          Poisoned data on packet read from mirror buffer: 0
> > >          Poisoned data on packet read from mcast buffer: 0
> > >          Poisoned data on WQE read from mirror buffer: 0
> > >          Poisoned data on WQE read from multicast buffer: 0
> > >          Poisoned data on NIX_RX_MCE_S read: 0
> > >    NIX_AF_RVU:
> > >          Unmap Slot Error: 0
> > > 
> > 
> > Now i am a little bit skeptic here, devlink health reporter
> > infrastructure was
> > never meant to deal with dump op only, the main purpose is to
> > diagnose/dump and recover.
> > 
> > especially in your use case where you only report counters, i don't
> > believe
> > devlink health dump is a proper interface for this.
> These are not counters. These are error interrupts raised by HW
> blocks.
> The count is provided to understand on how frequently the errors are
> seen.
> Error recovery for some of the blocks happen internally. That is the
> reason,
> Currently only dump op is added.

So you are counting these events in driver, sounds like a counter to
me, i really think this shouldn't belong to devlink, unless you really
utilize devlink health ops for actual reporting and recovery.

what's wrong with just dumping these counters to ethtool ?

> > Many of these counters if not most are data path packet based and
> > maybe
> > they should belong to ethtool.
> 
> Regards,
> -George
> 
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 17:07 ` Jakub Kicinski
@ 2020-11-05 19:23   ` Saeed Mahameed
  2020-11-05 20:42     ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Saeed Mahameed @ 2020-11-05 19:23 UTC (permalink / raw)
  To: Jakub Kicinski, George Cherian
  Cc: netdev, linux-kernel, Jiri Pirko, davem, Sunil Kovvuri Goutham,
	Linu Cherian, Geethasowjanya Akula, masahiroy,
	willemdebruijn.kernel

On Thu, 2020-11-05 at 09:07 -0800, Jakub Kicinski wrote:
> On Thu, 5 Nov 2020 13:36:56 +0000 George Cherian wrote:
> > > Now i am a little bit skeptic here, devlink health reporter
> > > infrastructure was
> > > never meant to deal with dump op only, the main purpose is to
> > > diagnose/dump and recover.
> > > 
> > > especially in your use case where you only report counters, i
> > > don't believe
> > > devlink health dump is a proper interface for this.  
> > These are not counters. These are error interrupts raised by HW
> > blocks.
> > The count is provided to understand on how frequently the errors
> > are seen.
> > Error recovery for some of the blocks happen internally. That is
> > the reason,
> > Currently only dump op is added.
> 
> The previous incarnation was printing messages to logs, so I assume
> these errors are expected to be relatively low rate.
> 
> The point of using devlink health was that you can generate a netlink
> notification when the error happens. IOW you need some calls to
> devlink_health_report() or such.
> 
> At least that's my thinking, others may disagree.

If you report an error without recovering, devlink health will report a
bad device state

$ ./devlink health
   pci/0002:01:00.0:
     reporter npa
       state error error 1 recover 0

So you will need to implement an empty recover op.
so if these events are informational only and they don't indicate
device health issues, why would you report them via devlink health ?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 19:15 ` Saeed Mahameed
@ 2020-11-05 19:29   ` Sunil Kovvuri
  2020-11-06 20:58     ` Saeed Mahameed
  0 siblings, 1 reply; 12+ messages in thread
From: Sunil Kovvuri @ 2020-11-05 19:29 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: George Cherian, netdev, linux-kernel, Jiri Pirko, kuba, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

> > > > Output:
> > > >  # ./devlink health
> > > >  pci/0002:01:00.0:
> > > >    reporter npa
> > > >      state healthy error 0 recover 0
> > > >    reporter nix
> > > >      state healthy error 0 recover 0
> > > >  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
> > > >   NIX_AF_GENERAL:
> > > >          Memory Fault on NIX_AQ_INST_S read: 0
> > > >          Memory Fault on NIX_AQ_RES_S write: 0
> > > >          AQ Doorbell error: 0
> > > >          Rx on unmapped PF_FUNC: 0
> > > >          Rx multicast replication error: 0
> > > >          Memory fault on NIX_RX_MCE_S read: 0
> > > >          Memory fault on multicast WQE read: 0
> > > >          Memory fault on mirror WQE read: 0
> > > >          Memory fault on mirror pkt write: 0
> > > >          Memory fault on multicast pkt write: 0
> > > >    NIX_AF_RAS:
> > > >          Poisoned data on NIX_AQ_INST_S read: 0
> > > >          Poisoned data on NIX_AQ_RES_S write: 0
> > > >          Poisoned data on HW context read: 0
> > > >          Poisoned data on packet read from mirror buffer: 0
> > > >          Poisoned data on packet read from mcast buffer: 0
> > > >          Poisoned data on WQE read from mirror buffer: 0
> > > >          Poisoned data on WQE read from multicast buffer: 0
> > > >          Poisoned data on NIX_RX_MCE_S read: 0
> > > >    NIX_AF_RVU:
> > > >          Unmap Slot Error: 0
> > > >
> > >
> > > Now i am a little bit skeptic here, devlink health reporter
> > > infrastructure was
> > > never meant to deal with dump op only, the main purpose is to
> > > diagnose/dump and recover.
> > >
> > > especially in your use case where you only report counters, i don't
> > > believe
> > > devlink health dump is a proper interface for this.
> > These are not counters. These are error interrupts raised by HW
> > blocks.
> > The count is provided to understand on how frequently the errors are
> > seen.
> > Error recovery for some of the blocks happen internally. That is the
> > reason,
> > Currently only dump op is added.
>
> So you are counting these events in driver, sounds like a counter to
> me, i really think this shouldn't belong to devlink, unless you really
> utilize devlink health ops for actual reporting and recovery.
>
> what's wrong with just dumping these counters to ethtool ?

This driver is a administrative driver which handles all the resources
in the system and doesn't do any IO.
NIX and NPA are key co-processor blocks which this driver handles.
With NIX and NPA, there are pieces
which gets attached to a PCI device to make it a networking device. We
have netdev drivers registered to this
networking device. Some more information about the drivers is available at
https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/marvell/octeontx2.html

So we don't have a netdev here to report these co-processor block
level errors over ethtool.

Thanks,
Sunil.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 19:23   ` Saeed Mahameed
@ 2020-11-05 20:42     ` Jakub Kicinski
  2020-11-05 23:52       ` Saeed Mahameed
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2020-11-05 20:42 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: George Cherian, netdev, linux-kernel, Jiri Pirko, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

On Thu, 05 Nov 2020 11:23:54 -0800 Saeed Mahameed wrote:
> If you report an error without recovering, devlink health will report a
> bad device state
> 
> $ ./devlink health
>    pci/0002:01:00.0:
>      reporter npa
>        state error error 1 recover 0

Actually, the counter in the driver is unnecessary, right? Devlink
counts errors.

> So you will need to implement an empty recover op.
> so if these events are informational only and they don't indicate
> device health issues, why would you report them via devlink health ?

I see devlink health reporters a way of collecting errors reports which
for the most part are just shared with the vendor. IOW firmware (or
hardware) bugs.

Obviously as you say without recover and additional context in the
report the value is quite diminished. But _if_ these are indeed "report
me to the vendor" kind of events then at least they should use our
current mechanics for such reports - which is dl-health.

Without knowing what these events are it's quite hard to tell if
devlink health is an overkill or counter is sufficient.

Either way - printing these to the logs is definitely the worst choice
:)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 20:42     ` Jakub Kicinski
@ 2020-11-05 23:52       ` Saeed Mahameed
  2020-11-06  0:23         ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Saeed Mahameed @ 2020-11-05 23:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: George Cherian, netdev, linux-kernel, Jiri Pirko, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

On Thu, 2020-11-05 at 12:42 -0800, Jakub Kicinski wrote:
> On Thu, 05 Nov 2020 11:23:54 -0800 Saeed Mahameed wrote:
> > If you report an error without recovering, devlink health will
> > report a
> > bad device state
> > 
> > $ ./devlink health
> >    pci/0002:01:00.0:
> >      reporter npa
> >        state error error 1 recover 0
> 
> Actually, the counter in the driver is unnecessary, right? Devlink
> counts errors.
>  

if you mean error and recover counters, then yes. they are managed by
devlink health

every call to dl-health-report will do:

devlink_health_report(reporter, err_ctx, msg)
{
      reproter.error++;

      devlink_trigger_event(reporter, msg);

      reporter.dump(err_ctx, msg);
      reporter.diag(err_ctx);

      if (!reporter.recover(err_ctx))
             reporter.recover++;
}

so dl-health reports without a recover op will confuse the user if user
sees error count > recover count.

error count should only be grater than recover count when recover
procedure fails which now will indicate the device is not in a healthy
state.

also i want to clarify one small note about devlink dump.

devlink health dump semantics:
on devlink health dump, the devlink health will check if previous dump
exists and will just return it without actually calling the driver, if
not then it will call the driver to perform a new dump and will cache
it.

user has to explicitly clear the devlink health dump of that reporter
in order to allow for newer dump to get generated.

this is done this way because we want the driver to store the dump of
the previously reported errors at the moment the erorrs are reported by
driver, so when a user issue  a dump command the dump of the previous
error will be reported to user form memory without the need to access
driver/hw who might be in a bad state.

so this is why using devlink dump for reporting counters doesn't really
work, it will only report the first time the counters are accessed via
devlink health dump, after that it will report the same cached values
over and over until the user clears it up.

> > So you will need to implement an empty recover op.
> > so if these events are informational only and they don't indicate
> > device health issues, why would you report them via devlink health
> > ?
> 
> I see devlink health reporters a way of collecting errors reports
> which
> for the most part are just shared with the vendor. IOW firmware (or
> hardware) bugs.
> 
> Obviously as you say without recover and additional context in the
> report the value is quite diminished. But _if_ these are indeed
> "report
> me to the vendor" kind of events then at least they should use our
> current mechanics for such reports - which is dl-health.
> 
> Without knowing what these events are it's quite hard to tell if
> devlink health is an overkill or counter is sufficient.
> 
> Either way - printing these to the logs is definitely the worst
> choice
> :)

Sure, I don't mind using devlink health for dump only, I don't really
have strong feelings against this, they can always extend it in the
future.

it just doesn't make sense to me to have it mainly used for dumping
counters and without using devlik helath utilities, like events,
reports and recover.

so maybe Sunil et al. could polish this patchset and provide more
devlink health support, like diagnose for these errors, dump HW
information and contexts related to these errors so they could debug
root causes, etc .. 
Then the use for dl health in this series can be truly justified.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 23:52       ` Saeed Mahameed
@ 2020-11-06  0:23         ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2020-11-06  0:23 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: George Cherian, netdev, linux-kernel, Jiri Pirko, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

On Thu, 05 Nov 2020 15:52:32 -0800 Saeed Mahameed wrote:
> On Thu, 2020-11-05 at 12:42 -0800, Jakub Kicinski wrote:
> > On Thu, 05 Nov 2020 11:23:54 -0800 Saeed Mahameed wrote:  
> > > If you report an error without recovering, devlink health will
> > > report a
> > > bad device state
> > > 
> > > $ ./devlink health
> > >    pci/0002:01:00.0:
> > >      reporter npa
> > >        state error error 1 recover 0  
> > 
> > Actually, the counter in the driver is unnecessary, right? Devlink
> > counts errors.
> 
> if you mean error and recover counters, then yes. they are managed by
> devlink health
> 
> every call to dl-health-report will do:
> 
> devlink_health_report(reporter, err_ctx, msg)
> {
>       reproter.error++;
> 
>       devlink_trigger_event(reporter, msg);
> 
>       reporter.dump(err_ctx, msg);
>       reporter.diag(err_ctx);
> 
>       if (!reporter.recover(err_ctx))
>              reporter.recover++;
> }
> 
> so dl-health reports without a recover op will confuse the user if user
> sees error count > recover count.
> 
> error count should only be grater than recover count when recover
> procedure fails which now will indicate the device is not in a healthy
> state.

Good point, as is the internal devlink counter mismatch looks pretty
strange.

> also i want to clarify one small note about devlink dump.
> 
> devlink health dump semantics:
> on devlink health dump, the devlink health will check if previous dump
> exists and will just return it without actually calling the driver, if
> not then it will call the driver to perform a new dump and will cache
> it.
> 
> user has to explicitly clear the devlink health dump of that reporter
> in order to allow for newer dump to get generated.
> 
> this is done this way because we want the driver to store the dump of
> the previously reported errors at the moment the erorrs are reported by
> driver, so when a user issue  a dump command the dump of the previous
> error will be reported to user form memory without the need to access
> driver/hw who might be in a bad state.
> 
> so this is why using devlink dump for reporting counters doesn't really
> work, it will only report the first time the counters are accessed via
> devlink health dump, after that it will report the same cached values
> over and over until the user clears it up.

Agreed, if only counters are reported driver should rely on the
devlink counters. Dump contents are for context of the event.

> > > So you will need to implement an empty recover op.
> > > so if these events are informational only and they don't indicate
> > > device health issues, why would you report them via devlink health
> > > ?  
> > 
> > I see devlink health reporters a way of collecting errors reports
> > which
> > for the most part are just shared with the vendor. IOW firmware (or
> > hardware) bugs.
> > 
> > Obviously as you say without recover and additional context in the
> > report the value is quite diminished. But _if_ these are indeed
> > "report
> > me to the vendor" kind of events then at least they should use our
> > current mechanics for such reports - which is dl-health.
> > 
> > Without knowing what these events are it's quite hard to tell if
> > devlink health is an overkill or counter is sufficient.
> > 
> > Either way - printing these to the logs is definitely the worst
> > choice
> > :)  
> 
> Sure, I don't mind using devlink health for dump only, I don't really
> have strong feelings against this, they can always extend it in the
> future.
> 
> it just doesn't make sense to me to have it mainly used for dumping
> counters and without using devlik helath utilities, like events,
> reports and recover.
> 
> so maybe Sunil et al. could polish this patchset and provide more
> devlink health support, like diagnose for these errors, dump HW
> information and contexts related to these errors so they could debug
> root causes, etc .. 
> Then the use for dl health in this series can be truly justified.

That'd indeed be optimal.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-05 19:29   ` Sunil Kovvuri
@ 2020-11-06 20:58     ` Saeed Mahameed
  2020-11-07 15:51       ` Sunil Kovvuri
  0 siblings, 1 reply; 12+ messages in thread
From: Saeed Mahameed @ 2020-11-06 20:58 UTC (permalink / raw)
  To: Sunil Kovvuri
  Cc: George Cherian, netdev, linux-kernel, Jiri Pirko, kuba, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

On Fri, 2020-11-06 at 00:59 +0530, Sunil Kovvuri wrote:
> > > > > Output:
> > > > >  # ./devlink health
> > > > >  pci/0002:01:00.0:
> > > > >    reporter npa
> > > > >      state healthy error 0 recover 0
> > > > >    reporter nix
> > > > >      state healthy error 0 recover 0
> > > > >  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
> > > > >   NIX_AF_GENERAL:
> > > > >          Memory Fault on NIX_AQ_INST_S read: 0
> > > > >          Memory Fault on NIX_AQ_RES_S write: 0
> > > > >          AQ Doorbell error: 0
> > > > >          Rx on unmapped PF_FUNC: 0
> > > > >          Rx multicast replication error: 0
> > > > >          Memory fault on NIX_RX_MCE_S read: 0
> > > > >          Memory fault on multicast WQE read: 0
> > > > >          Memory fault on mirror WQE read: 0
> > > > >          Memory fault on mirror pkt write: 0
> > > > >          Memory fault on multicast pkt write: 0
> > > > >    NIX_AF_RAS:
> > > > >          Poisoned data on NIX_AQ_INST_S read: 0
> > > > >          Poisoned data on NIX_AQ_RES_S write: 0
> > > > >          Poisoned data on HW context read: 0
> > > > >          Poisoned data on packet read from mirror buffer: 0
> > > > >          Poisoned data on packet read from mcast buffer: 0
> > > > >          Poisoned data on WQE read from mirror buffer: 0
> > > > >          Poisoned data on WQE read from multicast buffer: 0
> > > > >          Poisoned data on NIX_RX_MCE_S read: 0
> > > > >    NIX_AF_RVU:
> > > > >          Unmap Slot Error: 0
> > > > > 
> > > > 
> > > > Now i am a little bit skeptic here, devlink health reporter
> > > > infrastructure was
> > > > never meant to deal with dump op only, the main purpose is to
> > > > diagnose/dump and recover.
> > > > 
> > > > especially in your use case where you only report counters, i
> > > > don't
> > > > believe
> > > > devlink health dump is a proper interface for this.
> > > These are not counters. These are error interrupts raised by HW
> > > blocks.
> > > The count is provided to understand on how frequently the errors
> > > are
> > > seen.
> > > Error recovery for some of the blocks happen internally. That is
> > > the
> > > reason,
> > > Currently only dump op is added.
> > 
> > So you are counting these events in driver, sounds like a counter
> > to
> > me, i really think this shouldn't belong to devlink, unless you
> > really
> > utilize devlink health ops for actual reporting and recovery.
> > 
> > what's wrong with just dumping these counters to ethtool ?
> 
> This driver is a administrative driver which handles all the
> resources
> in the system and doesn't do any IO.
> NIX and NPA are key co-processor blocks which this driver handles.
> With NIX and NPA, there are pieces
> which gets attached to a PCI device to make it a networking device.
> We
> have netdev drivers registered to this
> networking device. Some more information about the drivers is
> available at
> https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/marvell/octeontx2.html
> 
> So we don't have a netdev here to report these co-processor block
> level errors over ethtool.
> 

but AF driver can't be standalone to operate your hw, it must have a
PF/VF with netdev interface to do io, so even if your model is modular,
a common user of this driver will always see a netdev.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-06 20:58     ` Saeed Mahameed
@ 2020-11-07 15:51       ` Sunil Kovvuri
  0 siblings, 0 replies; 12+ messages in thread
From: Sunil Kovvuri @ 2020-11-07 15:51 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: George Cherian, netdev, linux-kernel, Jiri Pirko, kuba, davem,
	Sunil Kovvuri Goutham, Linu Cherian, Geethasowjanya Akula,
	masahiroy, willemdebruijn.kernel

On Sat, Nov 7, 2020 at 2:28 AM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On Fri, 2020-11-06 at 00:59 +0530, Sunil Kovvuri wrote:
> > > > > > Output:
> > > > > >  # ./devlink health
> > > > > >  pci/0002:01:00.0:
> > > > > >    reporter npa
> > > > > >      state healthy error 0 recover 0
> > > > > >    reporter nix
> > > > > >      state healthy error 0 recover 0
> > > > > >  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
> > > > > >   NIX_AF_GENERAL:
> > > > > >          Memory Fault on NIX_AQ_INST_S read: 0
> > > > > >          Memory Fault on NIX_AQ_RES_S write: 0
> > > > > >          AQ Doorbell error: 0
> > > > > >          Rx on unmapped PF_FUNC: 0
> > > > > >          Rx multicast replication error: 0
> > > > > >          Memory fault on NIX_RX_MCE_S read: 0
> > > > > >          Memory fault on multicast WQE read: 0
> > > > > >          Memory fault on mirror WQE read: 0
> > > > > >          Memory fault on mirror pkt write: 0
> > > > > >          Memory fault on multicast pkt write: 0
> > > > > >    NIX_AF_RAS:
> > > > > >          Poisoned data on NIX_AQ_INST_S read: 0
> > > > > >          Poisoned data on NIX_AQ_RES_S write: 0
> > > > > >          Poisoned data on HW context read: 0
> > > > > >          Poisoned data on packet read from mirror buffer: 0
> > > > > >          Poisoned data on packet read from mcast buffer: 0
> > > > > >          Poisoned data on WQE read from mirror buffer: 0
> > > > > >          Poisoned data on WQE read from multicast buffer: 0
> > > > > >          Poisoned data on NIX_RX_MCE_S read: 0
> > > > > >    NIX_AF_RVU:
> > > > > >          Unmap Slot Error: 0
> > > > > >
> > > > >
> > > > > Now i am a little bit skeptic here, devlink health reporter
> > > > > infrastructure was
> > > > > never meant to deal with dump op only, the main purpose is to
> > > > > diagnose/dump and recover.
> > > > >
> > > > > especially in your use case where you only report counters, i
> > > > > don't
> > > > > believe
> > > > > devlink health dump is a proper interface for this.
> > > > These are not counters. These are error interrupts raised by HW
> > > > blocks.
> > > > The count is provided to understand on how frequently the errors
> > > > are
> > > > seen.
> > > > Error recovery for some of the blocks happen internally. That is
> > > > the
> > > > reason,
> > > > Currently only dump op is added.
> > >
> > > So you are counting these events in driver, sounds like a counter
> > > to
> > > me, i really think this shouldn't belong to devlink, unless you
> > > really
> > > utilize devlink health ops for actual reporting and recovery.
> > >
> > > what's wrong with just dumping these counters to ethtool ?
> >
> > This driver is a administrative driver which handles all the
> > resources
> > in the system and doesn't do any IO.
> > NIX and NPA are key co-processor blocks which this driver handles.
> > With NIX and NPA, there are pieces
> > which gets attached to a PCI device to make it a networking device.
> > We
> > have netdev drivers registered to this
> > networking device. Some more information about the drivers is
> > available at
> > https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/marvell/octeontx2.html
> >
> > So we don't have a netdev here to report these co-processor block
> > level errors over ethtool.
> >
>
> but AF driver can't be standalone to operate your hw, it must have a
> PF/VF with netdev interface to do io, so even if your model is modular,
> a common user of this driver will always see a netdev.
>

That's right, user will always see a netdev, but
The co-processor blocks are like this
- Each co-processor has two parts, AF unit and LF units (local function)
- Each of the co-processor can have multiple LFs, incase of NIX
co-processor, each of the LF provides RQ, SQ, CQs etc.
- So the AF driver handles the co-processor's AF unit and upon
receiving requests from PF/VF attaches the LFs to them, so that they
can do network IO.
- Within co-processor, AF unit specific errors (global) are reported
to AF driver and LF specific errors are reported to netdev driver.
- There can be 10s of netdev driver instances in the system, so these
AF unit global errors cannot be routed and shown in one of the
netdev's ethtool.

Thanks,
Sunil.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-04 12:27 ` [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
@ 2020-11-05  5:08   ` Saeed Mahameed
  0 siblings, 0 replies; 12+ messages in thread
From: Saeed Mahameed @ 2020-11-05  5:08 UTC (permalink / raw)
  To: George Cherian, netdev, linux-kernel, Jiri Pirko
  Cc: kuba, davem, sgoutham, lcherian, gakula, masahiroy,
	willemdebruijn.kernel

On Wed, 2020-11-04 at 17:57 +0530, George Cherian wrote:
> Add health reporters for RVU NPA block.
                               ^^^ NIX ?

Cc: Jiri 

Anyway, could you please spare some words on what is NPA and what is
NIX?

Regarding the reporters names, all drivers register well known generic
names such as (fw,hw,rx,tx), I don't know if it is a good idea to use
vendor specific names, if you are reporting for hw/fw units then just
use "hw" or "fw" as the reporter name and append the unit NPA/NIX to
the counter/error names.

> Only reporter dump is supported.
> 
> Output:
>  # ./devlink health
>  pci/0002:01:00.0:
>    reporter npa
>      state healthy error 0 recover 0
>    reporter nix
>      state healthy error 0 recover 0
>  # ./devlink  health dump show pci/0002:01:00.0 reporter nix
>   NIX_AF_GENERAL:
>          Memory Fault on NIX_AQ_INST_S read: 0
>          Memory Fault on NIX_AQ_RES_S write: 0
>          AQ Doorbell error: 0
>          Rx on unmapped PF_FUNC: 0
>          Rx multicast replication error: 0
>          Memory fault on NIX_RX_MCE_S read: 0
>          Memory fault on multicast WQE read: 0
>          Memory fault on mirror WQE read: 0
>          Memory fault on mirror pkt write: 0
>          Memory fault on multicast pkt write: 0
>    NIX_AF_RAS:
>          Poisoned data on NIX_AQ_INST_S read: 0
>          Poisoned data on NIX_AQ_RES_S write: 0
>          Poisoned data on HW context read: 0
>          Poisoned data on packet read from mirror buffer: 0
>          Poisoned data on packet read from mcast buffer: 0
>          Poisoned data on WQE read from mirror buffer: 0
>          Poisoned data on WQE read from multicast buffer: 0
>          Poisoned data on NIX_RX_MCE_S read: 0
>    NIX_AF_RVU:
>          Unmap Slot Error: 0
> 

Now i am a little bit skeptic here, devlink health reporter
infrastructure was never meant to deal with dump op only, the main
purpose is to diagnose/dump and recover.

especially in your use case where you only report counters, i don't
believe devlink health dump is a proper interface for this.
Many of these counters if not most are data path packet based and maybe
they should belong to ethtool.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX
  2020-11-04 12:27 [PATCH v2 net-next 0/3] Add devlink and devlink health reporters to George Cherian
@ 2020-11-04 12:27 ` George Cherian
  2020-11-05  5:08   ` Saeed Mahameed
  0 siblings, 1 reply; 12+ messages in thread
From: George Cherian @ 2020-11-04 12:27 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: kuba, davem, sgoutham, lcherian, gakula, masahiroy,
	george.cherian, willemdebruijn.kernel

Add health reporters for RVU NPA block.
Only reporter dump is supported.

Output:
 # ./devlink health
 pci/0002:01:00.0:
   reporter npa
     state healthy error 0 recover 0
   reporter nix
     state healthy error 0 recover 0
 # ./devlink  health dump show pci/0002:01:00.0 reporter nix
  NIX_AF_GENERAL:
         Memory Fault on NIX_AQ_INST_S read: 0
         Memory Fault on NIX_AQ_RES_S write: 0
         AQ Doorbell error: 0
         Rx on unmapped PF_FUNC: 0
         Rx multicast replication error: 0
         Memory fault on NIX_RX_MCE_S read: 0
         Memory fault on multicast WQE read: 0
         Memory fault on mirror WQE read: 0
         Memory fault on mirror pkt write: 0
         Memory fault on multicast pkt write: 0
   NIX_AF_RAS:
         Poisoned data on NIX_AQ_INST_S read: 0
         Poisoned data on NIX_AQ_RES_S write: 0
         Poisoned data on HW context read: 0
         Poisoned data on packet read from mirror buffer: 0
         Poisoned data on packet read from mcast buffer: 0
         Poisoned data on WQE read from mirror buffer: 0
         Poisoned data on WQE read from multicast buffer: 0
         Poisoned data on NIX_RX_MCE_S read: 0
   NIX_AF_RVU:
         Unmap Slot Error: 0

Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: George Cherian <george.cherian@marvell.com>
---
 .../marvell/octeontx2/af/rvu_devlink.c        | 360 +++++++++++++++++-
 .../marvell/octeontx2/af/rvu_devlink.h        |  24 ++
 .../marvell/octeontx2/af/rvu_struct.h         |  10 +
 3 files changed, 393 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
index bf9efe1f6aec..49e51d1bd7d5 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
@@ -35,6 +35,110 @@ static int rvu_report_pair_end(struct devlink_fmsg *fmsg)
 	return devlink_fmsg_pair_nest_end(fmsg);
 }
 
+static irqreturn_t rvu_nix_af_rvu_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	nix_event_count = rvu_dl->nix_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NIX_AF_RVU_INT);
+
+	if (intr & BIT_ULL(0))
+		nix_event_count->unmap_slot_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT, intr);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rvu_nix_af_err_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	nix_event_count = rvu_dl->nix_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NIX_AF_ERR_INT);
+
+	if (intr & BIT_ULL(14))
+		nix_event_count->aq_inst_count++;
+	if (intr & BIT_ULL(13))
+		nix_event_count->aq_res_count++;
+	if (intr & BIT_ULL(12))
+		nix_event_count->aq_db_count++;
+	if (intr & BIT_ULL(6))
+		nix_event_count->rx_on_unmap_pf_count++;
+	if (intr & BIT_ULL(5))
+		nix_event_count->rx_mcast_repl_count++;
+	if (intr & BIT_ULL(4))
+		nix_event_count->rx_mcast_memfault_count++;
+	if (intr & BIT_ULL(3))
+		nix_event_count->rx_mcast_wqe_memfault_count++;
+	if (intr & BIT_ULL(2))
+		nix_event_count->rx_mirror_wqe_memfault_count++;
+	if (intr & BIT_ULL(1))
+		nix_event_count->rx_mirror_pktw_memfault_count++;
+	if (intr & BIT_ULL(0))
+		nix_event_count->rx_mcast_pktw_memfault_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT, intr);
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t rvu_nix_af_ras_intr_handler(int irq, void *rvu_irq)
+{
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu_devlink *rvu_dl = rvu_irq;
+	struct rvu *rvu;
+	int blkaddr;
+	u64 intr;
+
+	rvu = rvu_dl->rvu;
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return IRQ_NONE;
+
+	nix_event_count = rvu_dl->nix_event_cnt;
+	intr = rvu_read64(rvu, blkaddr, NIX_AF_RAS);
+
+	if (intr & BIT_ULL(34))
+		nix_event_count->poison_aq_inst_count++;
+	if (intr & BIT_ULL(33))
+		nix_event_count->poison_aq_res_count++;
+	if (intr & BIT_ULL(32))
+		nix_event_count->poison_aq_cxt_count++;
+	if (intr & BIT_ULL(4))
+		nix_event_count->rx_mirror_data_poison_count++;
+	if (intr & BIT_ULL(3))
+		nix_event_count->rx_mcast_data_poison_count++;
+	if (intr & BIT_ULL(2))
+		nix_event_count->rx_mirror_wqe_poison_count++;
+	if (intr & BIT_ULL(1))
+		nix_event_count->rx_mcast_wqe_poison_count++;
+	if (intr & BIT_ULL(0))
+		nix_event_count->rx_mce_poison_count++;
+
+	/* Clear interrupts */
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS, intr);
+	return IRQ_HANDLED;
+}
+
 static bool rvu_common_request_irq(struct rvu *rvu, int offset,
 				   const char *name, irq_handler_t fn)
 {
@@ -52,6 +156,254 @@ static bool rvu_common_request_irq(struct rvu *rvu, int offset,
 	return rvu->irq_allocated[offset];
 }
 
+static void rvu_nix_blk_unregister_interrupts(struct rvu *rvu,
+					      int blkaddr)
+{
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	int offs, i;
+
+	offs = rvu_read64(rvu, blkaddr, NIX_PRIV_AF_INT_CFG) & 0x3ff;
+	if (!offs)
+		return;
+
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT_ENA_W1C, ~0ULL);
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS_ENA_W1C, ~0ULL);
+
+	if (rvu->irq_allocated[offs + NIX_AF_INT_VEC_RVU]) {
+		free_irq(pci_irq_vector(rvu->pdev, offs + NIX_AF_INT_VEC_RVU),
+			 rvu_dl);
+		rvu->irq_allocated[offs + NIX_AF_INT_VEC_RVU] = false;
+	}
+
+	for (i = NIX_AF_INT_VEC_AF_ERR; i < NIX_AF_INT_VEC_CNT; i++)
+		if (rvu->irq_allocated[offs + i]) {
+			free_irq(pci_irq_vector(rvu->pdev, offs + i), rvu_dl);
+			rvu->irq_allocated[offs + i] = false;
+		}
+}
+
+static void rvu_nix_unregister_interrupts(struct rvu *rvu)
+{
+	int blkaddr = 0;
+
+	blkaddr = rvu_get_blkaddr(rvu, BLKTYPE_NIX, 0);
+	if (blkaddr < 0)
+		return;
+
+	rvu_nix_blk_unregister_interrupts(rvu, blkaddr);
+}
+
+static int rvu_nix_blk_register_interrupts(struct rvu *rvu,
+					   int blkaddr)
+{
+	int base;
+	bool rc;
+
+	/* Get NIX AF MSIX vectors offset. */
+	base = rvu_read64(rvu, blkaddr, NIX_PRIV_AF_INT_CFG) & 0x3ff;
+	if (!base) {
+		dev_warn(rvu->dev,
+			 "Failed to get NIX%d NIX_AF_INT vector offsets\n",
+			 blkaddr - BLKADDR_NIX0);
+		return 0;
+	}
+	/* Register and enable NIX_AF_RVU_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base +  NIX_AF_INT_VEC_RVU,
+				    "NIX_AF_RVU_INT",
+				    rvu_nix_af_rvu_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NIX_AF_RVU_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NIX_AF_ERR_INT interrupt */
+	rc = rvu_common_request_irq(rvu, base + NIX_AF_INT_VEC_AF_ERR,
+				    "NIX_AF_ERR_INT",
+				    rvu_nix_af_err_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NIX_AF_ERR_INT_ENA_W1S, ~0ULL);
+
+	/* Register and enable NIX_AF_RAS interrupt */
+	rc = rvu_common_request_irq(rvu, base + NIX_AF_INT_VEC_POISON,
+				    "NIX_AF_RAS",
+				    rvu_nix_af_ras_intr_handler);
+	if (!rc)
+		goto err;
+	rvu_write64(rvu, blkaddr, NIX_AF_RAS_ENA_W1S, ~0ULL);
+
+	return 0;
+err:
+	rvu_nix_unregister_interrupts(rvu);
+	return -1;
+}
+
+static int rvu_nix_register_interrupts(struct rvu *rvu)
+{
+	int blkaddr = 0;
+
+	blkaddr = rvu_get_blkaddr(rvu, blkaddr, 0);
+	if (blkaddr < 0)
+		return blkaddr;
+
+	rvu_nix_blk_register_interrupts(rvu, blkaddr);
+
+	return 0;
+}
+
+static int rvu_nix_report_show(struct devlink_fmsg *fmsg, struct rvu *rvu)
+{
+	struct rvu_devlink *rvu_dl = rvu->rvu_dl;
+	struct rvu_nix_event_cnt *nix_event_count = rvu_dl->nix_event_cnt;
+	int err;
+
+	err = rvu_report_pair_start(fmsg, "NIX_AF_GENERAL");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tMemory Fault on NIX_AQ_INST_S read",
+					nix_event_count->aq_inst_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory Fault on NIX_AQ_RES_S write",
+					nix_event_count->aq_res_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tAQ Doorbell error",
+					nix_event_count->aq_db_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tRx on unmapped PF_FUNC",
+					nix_event_count->rx_on_unmap_pf_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tRx multicast replication error",
+					nix_event_count->rx_mcast_repl_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on NIX_RX_MCE_S read",
+					nix_event_count->rx_mcast_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on multicast WQE read",
+					nix_event_count->rx_mcast_wqe_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on mirror WQE read",
+					nix_event_count->rx_mirror_wqe_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on mirror pkt write",
+					nix_event_count->rx_mirror_pktw_memfault_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tMemory fault on multicast pkt write",
+					nix_event_count->rx_mcast_pktw_memfault_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	err = rvu_report_pair_start(fmsg, "NIX_AF_RAS");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tPoisoned data on NIX_AQ_INST_S read",
+					nix_event_count->poison_aq_inst_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on NIX_AQ_RES_S write",
+					nix_event_count->poison_aq_res_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on HW context read",
+					nix_event_count->poison_aq_cxt_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on packet read from mirror buffer",
+					nix_event_count->rx_mirror_data_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on packet read from mcast buffer",
+					nix_event_count->rx_mcast_data_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on WQE read from mirror buffer",
+					nix_event_count->rx_mirror_wqe_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on WQE read from multicast buffer",
+					nix_event_count->rx_mcast_wqe_poison_count);
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\n\tPoisoned data on NIX_RX_MCE_S read",
+					nix_event_count->rx_mce_poison_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	err = rvu_report_pair_start(fmsg, "NIX_AF_RVU");
+	if (err)
+		return err;
+	err = devlink_fmsg_u64_pair_put(fmsg, "\tUnmap Slot Error",
+					nix_event_count->unmap_slot_count);
+	if (err)
+		return err;
+	err = rvu_report_pair_end(fmsg);
+	if (err)
+		return err;
+	return 0;
+}
+
+static int rvu_nix_reporter_dump(struct devlink_health_reporter *reporter,
+				 struct devlink_fmsg *fmsg, void *ctx,
+				 struct netlink_ext_ack *netlink_extack)
+{
+	struct rvu *rvu = devlink_health_reporter_priv(reporter);
+
+	return rvu_nix_report_show(fmsg, rvu);
+}
+
+static const struct devlink_health_reporter_ops rvu_nix_fault_reporter_ops = {
+		.name = "nix",
+		.dump = rvu_nix_reporter_dump,
+};
+
+static int rvu_nix_health_reporters_create(struct rvu_devlink *rvu_dl)
+{
+	struct devlink_health_reporter *rvu_nix_health_reporter;
+	struct rvu_nix_event_cnt *nix_event_count;
+	struct rvu *rvu = rvu_dl->rvu;
+
+	nix_event_count = kzalloc(sizeof(*nix_event_count), GFP_KERNEL);
+	if (!nix_event_count)
+		return -ENOMEM;
+
+	rvu_dl->nix_event_cnt = nix_event_count;
+	rvu_nix_health_reporter = devlink_health_reporter_create(rvu_dl->dl,
+								 &rvu_nix_fault_reporter_ops,
+								 0, rvu);
+	if (IS_ERR(rvu_nix_health_reporter)) {
+		dev_warn(rvu->dev, "Failed to create nix reporter, err = %ld\n",
+			 PTR_ERR(rvu_nix_health_reporter));
+		return PTR_ERR(rvu_nix_health_reporter);
+	}
+
+	rvu_dl->rvu_nix_health_reporter = rvu_nix_health_reporter;
+	rvu_nix_register_interrupts(rvu);
+	return 0;
+}
+
+static void rvu_nix_health_reporters_destroy(struct rvu_devlink *rvu_dl)
+{
+	struct rvu *rvu = rvu_dl->rvu;
+
+	if (!rvu_dl->rvu_nix_health_reporter)
+		return;
+
+	devlink_health_reporter_destroy(rvu_dl->rvu_nix_health_reporter);
+	rvu_nix_unregister_interrupts(rvu);
+}
+
 static irqreturn_t rvu_npa_af_rvu_intr_handler(int irq, void *rvu_irq)
 {
 	struct rvu_npa_event_cnt *npa_event_count;
@@ -421,9 +773,14 @@ static void rvu_npa_health_reporters_destroy(struct rvu_devlink *rvu_dl)
 static int rvu_health_reporters_create(struct rvu *rvu)
 {
 	struct rvu_devlink *rvu_dl;
+	int err;
 
 	rvu_dl = rvu->rvu_dl;
-	return rvu_npa_health_reporters_create(rvu_dl);
+	err = rvu_npa_health_reporters_create(rvu_dl);
+	if (err)
+		return err;
+
+	return rvu_nix_health_reporters_create(rvu_dl);
 }
 
 static void rvu_health_reporters_destroy(struct rvu *rvu)
@@ -435,6 +792,7 @@ static void rvu_health_reporters_destroy(struct rvu *rvu)
 
 	rvu_dl = rvu->rvu_dl;
 	rvu_npa_health_reporters_destroy(rvu_dl);
+	rvu_nix_health_reporters_destroy(rvu_dl);
 }
 
 static int rvu_devlink_info_get(struct devlink *devlink, struct devlink_info_req *req,
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
index b3ce1a8fff57..15724ad2ed44 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.h
@@ -29,11 +29,35 @@ struct rvu_npa_event_cnt {
 	unsigned long poison_aq_cxt_count;
 };
 
+struct rvu_nix_event_cnt {
+	unsigned long unmap_slot_count;
+	unsigned long aq_inst_count;
+	unsigned long aq_res_count;
+	unsigned long aq_db_count;
+	unsigned long rx_on_unmap_pf_count;
+	unsigned long rx_mcast_repl_count;
+	unsigned long rx_mcast_memfault_count;
+	unsigned long rx_mcast_wqe_memfault_count;
+	unsigned long rx_mirror_wqe_memfault_count;
+	unsigned long rx_mirror_pktw_memfault_count;
+	unsigned long rx_mcast_pktw_memfault_count;
+	unsigned long poison_aq_inst_count;
+	unsigned long poison_aq_res_count;
+	unsigned long poison_aq_cxt_count;
+	unsigned long rx_mirror_data_poison_count;
+	unsigned long rx_mcast_data_poison_count;
+	unsigned long rx_mirror_wqe_poison_count;
+	unsigned long rx_mcast_wqe_poison_count;
+	unsigned long rx_mce_poison_count;
+};
+
 struct rvu_devlink {
 	struct devlink *dl;
 	struct rvu *rvu;
 	struct devlink_health_reporter *rvu_npa_health_reporter;
 	struct rvu_npa_event_cnt *npa_event_cnt;
+	struct devlink_health_reporter *rvu_nix_health_reporter;
+	struct rvu_nix_event_cnt *nix_event_cnt;
 };
 
 /* Devlink APIs */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
index 995add5d8bff..b5944199faf5 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_struct.h
@@ -74,6 +74,16 @@ enum npa_af_int_vec_e {
 	NPA_AF_INT_VEC_CNT	= 0x5,
 };
 
+/* NIX Admin function Interrupt Vector Enumeration */
+enum nix_af_int_vec_e {
+	NIX_AF_INT_VEC_RVU	= 0x0,
+	NIX_AF_INT_VEC_GEN	= 0x1,
+	NIX_AF_INT_VEC_AQ_DONE	= 0x2,
+	NIX_AF_INT_VEC_AF_ERR	= 0x3,
+	NIX_AF_INT_VEC_POISON	= 0x4,
+	NIX_AF_INT_VEC_CNT	= 0x5,
+};
+
 /**
  * RVU PF Interrupt Vector Enumeration
  */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-07 15:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-05 13:36 [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
2020-11-05 17:07 ` Jakub Kicinski
2020-11-05 19:23   ` Saeed Mahameed
2020-11-05 20:42     ` Jakub Kicinski
2020-11-05 23:52       ` Saeed Mahameed
2020-11-06  0:23         ` Jakub Kicinski
2020-11-05 19:15 ` Saeed Mahameed
2020-11-05 19:29   ` Sunil Kovvuri
2020-11-06 20:58     ` Saeed Mahameed
2020-11-07 15:51       ` Sunil Kovvuri
  -- strict thread matches above, loose matches on Subject: below --
2020-11-04 12:27 [PATCH v2 net-next 0/3] Add devlink and devlink health reporters to George Cherian
2020-11-04 12:27 ` [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX George Cherian
2020-11-05  5:08   ` Saeed Mahameed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).