linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Mukul Joshi <mukul.joshi@amd.com>
Cc: amd-gfx@lists.freedesktop.org, harish.kasiviswanathan@amd.com,
	x86-ml <x86@kernel.org>, lkml <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
Date: Wed, 12 May 2021 11:36:51 +0200	[thread overview]
Message-ID: <YJuhs1WsqtJ7ta1L@zn.tnic> (raw)
In-Reply-To: <20210512013058.6827-1-mukul.joshi@amd.com>

Hi,

so this is a drive-by review using the lore.kernel.org mail because I
wasn't CCed on this.

On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> +				    unsigned long val, void *data)
> +{
> +	struct mce *m = (struct mce *)data;
> +	struct amdgpu_device *adev = NULL;
> +	uint32_t gpu_id = 0;
> +	uint32_t umc_inst = 0;
> +	uint32_t chan_index = 0;
> +	struct ras_err_data err_data = {0, 0, 0, NULL};
> +	struct eeprom_table_record err_rec;
> +	uint64_t retired_page;
> +
> +	/*
> +	 * If the error was generated in UMC_V2, which belongs to GPU UMCs,

Why does it matter if the error is a v2 UMC generated one?

IOW, can you detect the type of errors in GPU memory - I assume this
is what this is supposed to handle - from the error signature itself
instead of doing is_smca_umc_v2?

> +	 * and error occurred in DramECC (Extended error code = 0) then only
> +	 * process the error, else bail out.
> +	 */
> +	if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> +		return NOTIFY_DONE;
> +
> +	gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> +
> +	/*
> +	 * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> +	 */
> +	gpu_id -= GPU_ID_OFFSET;
> +
> +	adev = find_adev(gpu_id);
> +	if (!adev) {
> +		dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> +				     __func__, gpu_id);
> +		return NOTIFY_DONE;
> +	}
> +
> +	/*
> +	 * If it is correctable error, then print a message and return.
> +	 */
> +	if (mce_is_correctable(m)) {
> +		dev_info(adev->dev, "%s: UMC Correctable error detected.",
> +				    __func__);

So put yourself in the shoes of a support engineer who starts getting
all those calls about people's hardware getting correctable errors and
should they replace it and should AMD RMA the GPUs and so on and so
on..? Do you really wanna be on the receiving side of that call?

IOW, whom does printing the fact that the GPU had a corrected error
which got corrected by the hardware, help and do you really want to
upset people needlessly?

> +		return NOTIFY_OK;
> +	}
> +
> +	/*
> +	 * If it is uncorrectable error, then find out UMC instance and
> +	 * channel index.
> +	 */
> +	find_umc_inst_chan_index(m, &umc_inst, &chan_index);

That's a void function but it could return a pointer to the instance and
channel bundled in a struct maybe...

> +
> +	dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> +			    " chan_idx: %d", umc_inst, chan_index);
> +
> +	memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> +
> +	/*
> +	 * Translate UMC channel address to Physical address
> +	 */
> +	retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> +			ADDR_OF_256B_BLOCK(chan_index) |
> +			OFFSET_IN_256B_BLOCK(m->addr);
> +
> +	err_rec.address = m->addr;
> +	err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> +	err_rec.ts = (uint64_t)ktime_get_real_seconds();
> +	err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> +	err_rec.cu = 0;
> +	err_rec.mem_channel = chan_index;
> +	err_rec.mcumc_id = umc_inst;
> +
> +	err_data.err_addr = &err_rec;
> +	err_data.err_addr_cnt = 1;
> +
> +	if (amdgpu_bad_page_threshold != 0) {
> +		amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> +						err_data.err_addr_cnt);
> +		amdgpu_ras_save_bad_pages(adev);
> +	}
> +
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block amdgpu_bad_page_nb = {
> +	.notifier_call  = amdgpu_bad_page_notifier,
> +	.priority       = MCE_PRIO_ACCEL,

Nothing ever explains why this needs to be a separate priority. So how
about it?

> +};
> +
> +static void amdgpu_register_bad_pages_mca_notifier(void)
> +{
> +	/*
> +	 * Register the x86 notifier with MCE subsystem.
> +	 * Please note a notifier can be registered only once
> +	 * with the MCE subsystem.
> +	 */

Why do you need this? Other users don't need that. Do you need to call
mce_unregister_decode_chain() when the driver gets removed or so?

> +	if (notifier_registered == false) {

This is just silly and should be fixed properly once the issue is
understood.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

       reply	other threads:[~2021-05-12  9:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20210512013058.6827-1-mukul.joshi@amd.com>
2021-05-12  9:36 ` Borislav Petkov [this message]
2021-05-12 19:00   ` [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Joshi, Mukul
2021-05-12 21:05     ` Borislav Petkov
2021-05-13  3:20       ` Joshi, Mukul
2021-05-13  9:53         ` Borislav Petkov
2021-05-13 14:17           ` Alex Deucher
2021-05-13 14:30             ` Borislav Petkov
2021-05-13 14:32               ` Alex Deucher
2021-05-13 14:57                 ` Borislav Petkov
2021-05-13 15:02                   ` Alex Deucher
2021-05-13 23:14                   ` Joshi, Mukul
2021-05-14  7:03                     ` Borislav Petkov
2021-05-27 19:54                       ` Joshi, Mukul
2021-06-03 21:13                         ` Yazen Ghannam
2021-07-29 23:59                           ` Joshi, Mukul
2021-09-13  1:31                             ` Joshi, Mukul
2021-05-13 23:10           ` Joshi, Mukul
2021-05-14  7:05             ` Borislav Petkov
2021-05-14 13:06               ` Joshi, Mukul
2021-05-14 14:38                 ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJuhs1WsqtJ7ta1L@zn.tnic \
    --to=bp@alien8.de \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=harish.kasiviswanathan@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mukul.joshi@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).