All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Deucher <alexdeucher@gmail.com>
To: Christian Koenig <christian.koenig@amd.com>
Cc: "Chen, Guchun" <Guchun.Chen@amd.com>,
	"Zhou1, Tao" <Tao.Zhou1@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Yang, Stanley" <Stanley.Yang@amd.com>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Clements, John" <John.Clements@amd.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
Date: Thu, 23 Jul 2020 09:39:35 -0400	[thread overview]
Message-ID: <CADnq5_NF5oGihSBdof8wO6MJVVG6Nh2TwfMt2Kkk_5Uoyby0yQ@mail.gmail.com> (raw)
In-Reply-To: <040f8c43-758d-e937-1d00-2ff4b118bde1@gmail.com>

Also note that module parameters are global.  If you change the
parameter, it changes it for all GPUs in the system.  That may not be
what the customer wants.

Alex

On Thu, Jul 23, 2020 at 9:10 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> I agree with Guchun as well.
>
> When you have a dynamic module parameter and change the bad page
> threshold the GPU might just stop working suddenly.
>
> That is not a good idea as far as I can see.
>
> Regards,
> Christian.
>
> Am 23.07.20 um 05:47 schrieb Chen, Guchun:
> > [AMD Public Use]
> >
> > Hi Dennis,
> >
> > To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
> > So module parameter is one ideal choice for this.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Li, Dennis <Dennis.Li@amd.com>
> > Sent: Thursday, July 23, 2020 8:32 AM
> > To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> > Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > [AMD Official Use Only - Internal Distribution Only]
> >
> > Hi, Guchun,
> >        It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.
> >
> > Best Regards
> > Dennis Li
> > -----Original Message-----
> > From: Chen, Guchun <Guchun.Chen@amd.com>
> > Sent: Wednesday, July 22, 2020 11:14 AM
> > To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; Yang, Stanley <Stanley.Yang@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Clements, John <John.Clements@amd.com>
> > Cc: Chen, Guchun <Guchun.Chen@amd.com>
> > Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.
> >
> > When it's -1, ras will use typical bad page failure value.
> >
> > Signed-off-by: Guchun Chen <guchun.chen@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 06bfb8658dec..bb83ffb5e26a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
> > +extern int amdgpu_bad_page_threshold;
> >   extern int amdgpu_async_gfx_ring;
> >   extern int amdgpu_mcbp;
> >   extern int amdgpu_discovery;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index d28b95f721c4..f99671101746 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
> > +int amdgpu_bad_page_threshold = -1;
> >
> >   /**
> >    * DOC: vramlimit (int)
> > @@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
> >
> > +/**
> > + * DOC: bad_page_threshold (int)
> > + * Bad page threshold configuration is driven by RMA(Return Merchandise
> > + * Authorization) policy, which is to specify the threshold value of
> > +faulty
> > + * pages detected by ECC, which may result in GPU's retirement if total
> > + * faulty pages by ECC exceed threshold value.
> > + */
> > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 =
> > +auto(default typical value))"); module_param_named(bad_page_threshold,
> > +amdgpu_bad_page_threshold, int, 0444);
> > +
> >   static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
> >       {0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
> > --
> > 2.17.1
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2020-07-23 13:39 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-22  3:14 [PATCH 0/5] BAD GPU retirement policy by total bad pages Guchun Chen
2020-07-22  3:14 ` [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter Guchun Chen
2020-07-23  0:31   ` Li, Dennis
2020-07-23  3:47     ` Chen, Guchun
2020-07-23 13:10       ` Christian König
2020-07-23 13:39         ` Alex Deucher [this message]
2020-07-22  3:14 ` [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras Guchun Chen
2020-07-22  7:51   ` Yang, Stanley
2020-07-23  3:40     ` Chen, Guchun
2020-07-22  3:14 ` [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset Guchun Chen
2020-07-23  2:51   ` Zhou1, Tao
2020-07-23  3:38     ` Chen, Guchun
2020-07-23  4:03       ` Zhou1, Tao
2020-07-22  3:14 ` [PATCH 4/5] drm/amdgpu: restore ras flags when user resets eeprom Guchun Chen
2020-07-22  3:14 ` [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size Guchun Chen
2020-07-22 14:26   ` Andrey Grodzovsky
2020-07-22 14:29     ` Chen, Guchun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADnq5_NF5oGihSBdof8wO6MJVVG6Nh2TwfMt2Kkk_5Uoyby0yQ@mail.gmail.com \
    --to=alexdeucher@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Guchun.Chen@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=John.Clements@amd.com \
    --cc=Stanley.Yang@amd.com \
    --cc=Tao.Zhou1@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.