All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Guilherme G. Piccoli" <gpiccoli@linux.vnet.ibm.com>
To: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>,
	linuxppc-dev@lists.ozlabs.org, Russell Currey <ruscur@russell.cc>
Cc: Philippe Bergheaud <philippe.bergheaud@fr.ibm.com>,
	Frederic Barrat <fbarrat@linux.vnet.ibm.com>,
	Gavin Shan <gwshan@linux.vnet.ibm.com>,
	Ian Munsie <imunsie@au1.ibm.com>,
	Andrew Donnellan <andrew.donnellan@au1.ibm.com>,
	Christophe Lombard <christophe_lombard@fr.ibm.com>,
	Greg Kurz <gkurz@linux.vnet.ibm.com>
Subject: Re: [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count
Date: Wed, 1 Mar 2017 11:58:31 -0300	[thread overview]
Message-ID: <e288dd57-df0d-44c1-2867-ce1d1386ad1f@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170301112452.15798-2-vaibhav@linux.vnet.ibm.com>

Hi Vaibhav, nice patch! Some comments below:

On 03/01/2017 08:24 AM, Vaibhav Jain wrote:
> This patch introduces a new function eeh_pe_update_freeze_counter()
> replacing existing function eeh_pe_update_time_stamp(). The new
> function also manages the value of reeze_count along with tstamp to
> track the number of times the PE roze in last one hour and if the
> freeze_count > eeh_max_freezes then eports an error(-ENOTRECOVERABLE)
> to indicate that the PE should be ermanently disabled.

Not sure why, but many of the words in commit message are missing their
first letter. See, for example:
reeze_count,  roze,  eports,  ermanently


> 
> This patch should not introduce any behavioral change.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
> ---
> Change-log:
> 
> v1 -> v2
> Changes as suggested by Russell Currey:
> - Suffixed function names with '()'
> - Dropped '<0' conditional check inside eeh_handle_normal_event()
> - Rephrased the comment for function eeh_pe_update_freeze_counter()
> - Brace-wrapped a single line statement at end of
>   eeh_pe_update_freeze_counter()
> ---
>  arch/powerpc/include/asm/eeh.h   |  2 +-
>  arch/powerpc/kernel/eeh_driver.c | 20 +++--------------
>  arch/powerpc/kernel/eeh_pe.c     | 47 +++++++++++++++++++++++++++-------------
>  3 files changed, 36 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 8e37b71..68806be 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -265,7 +265,7 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
>  struct eeh_pe *eeh_pe_get(struct eeh_dev *edev);
>  int eeh_add_to_parent_pe(struct eeh_dev *edev);
>  int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
> -void eeh_pe_update_time_stamp(struct eeh_pe *pe);
> +int eeh_pe_update_freeze_counter(struct eeh_pe *pe);
>  void *eeh_pe_traverse(struct eeh_pe *root,
>  		eeh_traverse_func fn, void *flag);
>  void *eeh_pe_dev_traverse(struct eeh_pe *root,
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index b948871..8a088ea 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -739,10 +739,9 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
>  		return;
>  	}
> 
> -	eeh_pe_update_time_stamp(pe);
> -	pe->freeze_count++;
> -	if (pe->freeze_count > eeh_max_freezes)
> -		goto excess_failures;
> +	/* Update freeze counters and see if we have tripped max-freeze limit */
> +	if (eeh_pe_update_freeze_counter(pe))
> +		goto perm_error;
>  	pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
>  		pe->freeze_count);
> 
> @@ -872,19 +871,6 @@ static void eeh_handle_normal_event(struct eeh_pe *pe)
> 
>  	return;
> 
> -excess_failures:
> -	/*
> -	 * About 90% of all real-life EEH failures in the field
> -	 * are due to poorly seated PCI cards. Only 10% or so are
> -	 * due to actual, failed cards.
> -	 */
> -	pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
> -	       "last hour and has been permanently disabled.\n"
> -	       "Please try reseating or replacing it.\n",
> -		pe->phb->global_number, pe->addr,
> -		pe->freeze_count);
> -	goto perm_error;
> -
>  hard_fail:
>  	pr_err("EEH: Unable to recover from failure from PHB#%x-PE#%x.\n"
>  	       "Please try reseating or replacing it\n",
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index cc4b206..d367c16 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -504,30 +504,47 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
>  }
> 
>  /**
> - * eeh_pe_update_time_stamp - Update PE's frozen time stamp
> + * eeh_pe_update_freeze_counter - Update PE's frozen time stamp
> + * and freeze counter
>   * @pe: EEH PE
>   *
> - * We have time stamp for each PE to trace its time of getting
> - * frozen in last hour. The function should be called to update
> - * the time stamp on first error of the specific PE. On the other
> - * handle, we needn't account for errors happened in last hour.

s/handle/hand? "On the other hand..."

Thanks,


Guilherme


> + * We have a freeze counter and time stamp for each PE to trace
> + * number of times the PE was frozen in the last hour. This function
> + * updates the PE's freeze counter and returns an error if its greater
> + * than eeh_max_freezes. The function should be called to once every
> + * time a specific PE freezes.
>   */
> -void eeh_pe_update_time_stamp(struct eeh_pe *pe)
> +int eeh_pe_update_freeze_counter(struct eeh_pe *pe)
>  {
>  	struct timeval tstamp;
> 
> -	if (!pe) return;
> +	if (!pe)
> +		return -EINVAL;
> +
> +	do_gettimeofday(&tstamp);
> +	if (pe->freeze_count <= 0 || tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
> +		pe->tstamp = tstamp;
> +		pe->freeze_count = 1;
> +
> +	} else if (pe->freeze_count >= eeh_max_freezes) {
> +		pe->freeze_count++;
> +		/*
> +		 * About 90% of all real-life EEH failures in the field
> +		 * are due to poorly seated PCI cards. Only 10% or so are
> +		 * due to actual, failed cards.
> +		 */
> +		pr_err("EEH: PHB#%x-PE#%x has failed %d times in the\n"
> +		       "last hour and has been permanently disabled.\n"
> +		       "Please try reseating or replacing it.\n",
> +		       pe->phb->global_number, pe->addr,
> +		       pe->freeze_count);
> +		return -ENOTRECOVERABLE;
> 
> -	if (pe->freeze_count <= 0) {
> -		pe->freeze_count = 0;
> -		do_gettimeofday(&pe->tstamp);
>  	} else {
> -		do_gettimeofday(&tstamp);
> -		if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
> -			pe->tstamp = tstamp;
> -			pe->freeze_count = 0;
> -		}
> +		pe->freeze_count++;
>  	}
> +
> +	return 0;
>  }
> 
>  /**
> 

  reply	other threads:[~2017-03-01 14:59 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-01 11:24 [RESEND-RFC v2 0/3] cxl: Reset freeze counter for the adapter before PERST Vaibhav Jain
2017-03-01 11:24 ` [RESEND-RFC v2 1/3] powerpc/eeh: Refactor eeh_pe_update_time_stamp() to update freeze_count Vaibhav Jain
2017-03-01 14:58   ` Guilherme G. Piccoli [this message]
2017-03-01 16:26     ` Vaibhav Jain
2017-03-01 17:39       ` Guilherme G. Piccoli
2017-03-01 23:52       ` Russell Currey
2017-03-02  1:03   ` Andrew Donnellan
2017-03-01 11:24 ` [RESEND-RFC v2 2/3] powerpc/eeh: Introduce function eeh_pe_reset_freeze_counter() Vaibhav Jain
2017-03-02  1:10   ` Andrew Donnellan
2017-03-03  4:21   ` Vaibhav Jain
2017-03-03  4:34     ` Andrew Donnellan
2017-03-03  4:35     ` Russell Currey
2017-03-03  4:39       ` Andrew Donnellan
2017-03-03  5:45       ` Gavin Shan
2017-03-01 11:24 ` [RESEND-RFC v2 3/3] cxl: Reset freeze counters before adapter PERST for flashing new image Vaibhav Jain
2017-03-02  1:46   ` Andrew Donnellan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e288dd57-df0d-44c1-2867-ce1d1386ad1f@linux.vnet.ibm.com \
    --to=gpiccoli@linux.vnet.ibm.com \
    --cc=andrew.donnellan@au1.ibm.com \
    --cc=christophe_lombard@fr.ibm.com \
    --cc=fbarrat@linux.vnet.ibm.com \
    --cc=gkurz@linux.vnet.ibm.com \
    --cc=gwshan@linux.vnet.ibm.com \
    --cc=imunsie@au1.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=philippe.bergheaud@fr.ibm.com \
    --cc=ruscur@russell.cc \
    --cc=vaibhav@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.