From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 561BDC3A5A6 for ; Thu, 19 Sep 2019 08:36:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 23BA721929 for ; Thu, 19 Sep 2019 08:36:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="jaDPDGHJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388229AbfISIg1 (ORCPT ); Thu, 19 Sep 2019 04:36:27 -0400 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:62326 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387617AbfISIg0 (ORCPT ); Thu, 19 Sep 2019 04:36:26 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1568882184; x=1600418184; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=peYmu9C6wVgjhm29Lt+//jwlKJnQBBI6R1TRcKcWe7s=; b=jaDPDGHJJlh2HqOBo8rR38uPHwPpzx+7pVGRkJtIfhP8VDAdwapmk/ux daAiNm0qtXS7Wp/v2N+A66e0ZESJTBj3vJiEZmMpKXEikv42AwN8aaynS RuvUDvXP1S7iGzlkCp+fDMawPS8qNOXLjZBlDapDMv5fPVrY7WfB1mJCw A=; X-IronPort-AV: E=Sophos;i="5.64,522,1559520000"; d="scan'208";a="421981783" Received: from iad6-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-2c-6f38efd9.us-west-2.amazon.com) ([10.124.125.6]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 19 Sep 2019 08:36:22 +0000 Received: from EX13MTAUEA001.ant.amazon.com (pdx4-ws-svc-p6-lb7-vlan3.pdx.amazon.com [10.170.41.166]) by email-inbound-relay-2c-6f38efd9.us-west-2.amazon.com (Postfix) with ESMTPS id D0453A1C03; Thu, 19 Sep 2019 08:36:20 +0000 (UTC) Received: from EX13D01EUB001.ant.amazon.com (10.43.166.194) by EX13MTAUEA001.ant.amazon.com (10.43.61.82) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Thu, 19 Sep 2019 08:36:20 +0000 Received: from [10.125.238.52] (10.43.160.153) by EX13D01EUB001.ant.amazon.com (10.43.166.194) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Thu, 19 Sep 2019 08:36:11 +0000 Subject: Re: [PATCH v2 2/2] EDAC: al-mc-edac: Introduce Amazon's Annapurna Labs Memory Controller EDAC To: James Morse CC: , , , , , , , , , , , , , , , , , References: <1568529835-15319-1-git-send-email-talel@amazon.com> <1568529835-15319-3-git-send-email-talel@amazon.com> <4f6cd17c-c56f-b9d2-d6e0-1711de415f47@arm.com> From: "Shenhar, Talel" Message-ID: Date: Thu, 19 Sep 2019 11:36:06 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.0 MIME-Version: 1.0 In-Reply-To: <4f6cd17c-c56f-b9d2-d6e0-1711de415f47@arm.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Originating-IP: [10.43.160.153] X-ClientProxiedBy: EX13D21UWA002.ant.amazon.com (10.43.160.246) To EX13D01EUB001.ant.amazon.com (10.43.166.194) Sender: linux-edac-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Thanks for the review. On 9/18/2019 8:47 PM, James Morse wrote: > Hi Talel, > > On 15/09/2019 07:43, Talel Shenhar wrote: >> The Amazon's Annapurna Labs Memory Controller EDAC supports ECC capability >> for error detection and correction (Single bit error correction, Double >> detection). This driver introduces EDAC driver for that capability. > Is there any documentation for this memory controller? Unfortunately, we don't have public documentation for it. > > >> diff --git a/drivers/edac/al_mc_edac.c b/drivers/edac/al_mc_edac.c >> new file mode 100644 >> index 0000000..f9763d4 >> --- /dev/null >> +++ b/drivers/edac/al_mc_edac.c >> @@ -0,0 +1,382 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. >> + */ >> +#include > #include for hweight_long() shall be part of v3. btw: do you use some tool to catch those missing includes? > >> +#include >> +#include > #include for platform_get_resource() shall be part of v3. > >> +#include "edac_module.h" >> +/* Registers Values */ >> +#define AL_MC_MSTR_DEV_CFG_X4 0 >> +#define AL_MC_MSTR_DEV_CFG_X8 1 >> +#define AL_MC_MSTR_DEV_CFG_X16 2 >> +#define AL_MC_MSTR_DEV_CFG_X32 3 >> +#define AL_MC_MSTR_RANKS_MAX 4 > Is this a fixed property of the memory controller, or is it a limit imposed from somewhere > else. (Does it need to come from the DT?) Yes. this is a fixed behavior hence not part of dt. > > >> +#define AL_MC_MSTR_DATA_BUS_WIDTH_X64 0 >> + >> +#define DRV_NAME "al_mc_edac" >> +#define AL_MC_EDAC_MSG_MAX 256 >> +#define AL_MC_EDAC_MSG(message, buffer_size, type, \ >> + rank, row, bg, bank, column, syn0, syn1, syn2) \ >> + snprintf(message, buffer_size, \ >> + "%s rank=0x%x row=0x%x bg=0x%x bank=0x%x col=0x%x " \ >> + "syn0: 0x%x syn1: 0x%x syn2: 0x%x", \ >> + type == HW_EVENT_ERR_UNCORRECTED ? "UE" : "CE", \ >> + rank, row, bg, bank, column, syn0, syn1, syn2) >> + >> +struct al_mc_edac { >> + void __iomem *mmio_base; >> + int irq_ce; >> + int irq_ue; >> +}; >> + >> +static int al_mc_edac_handle_ce(struct mem_ctl_info *mci) >> +{ >> + struct al_mc_edac *al_mc = mci->pvt_info; >> + u32 eccerrcnt; >> + u16 ce_count; >> + u32 ecccaddr0; >> + u32 ecccaddr1; >> + u32 ecccsyn0; >> + u32 ecccsyn1; >> + u32 ecccsyn2; >> + u8 rank; >> + u32 row; >> + u8 bg; >> + u8 bank; >> + u16 column; >> + char msg[AL_MC_EDAC_MSG_MAX]; > (Some of these could go on the same line, same with UE below) Shall be part of v3 > + > + edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci, > + ce_count, 0, 0, 0, 0, 0, -1, mci->ctl_name, msg); > You used active_ranks as the layer size in al_mc_edac_probe(). Can't you supply the rank here? > > (If its not useful, why is it setup like this in al_mc_edac_probe()?) Seems it can be removed from probe. Shall be part of v3. > > > + u8 bank; > + u16 column; > + char msg[AL_MC_EDAC_MSG_MAX]; > + > + eccerrcnt = readl_relaxed(al_mc->mmio_base + AL_MC_ECC_ERR_COUNT); > + ue_count = FIELD_GET(AL_MC_ECC_ERR_COUNT_UE, eccerrcnt); > + if (!ue_count) > + return 0; > + > + eccuaddr0 = readl_relaxed(al_mc->mmio_base + AL_MC_ECC_UE_ADDR0); > + eccuaddr1 = readl_relaxed(al_mc->mmio_base + AL_MC_ECC_UE_ADDR1); > + eccusyn0 = readl_relaxed(al_mc->mmio_base + AL_MC_ECC_UE_SYND0); > + eccusyn1 = readl_relaxed(al_mc->mmio_base + AL_MC_ECC_UE_SYND1); > + eccusyn2 = readl_relaxed(al_mc->mmio_base + AL_MC_ECC_UE_SYND2); > + > + writel(AL_MC_ECC_CLEAR_UE_COUNT | AL_MC_ECC_CLEAR_UE_ERR, > + al_mc->mmio_base + AL_MC_ECC_CLEAR); > + > + dev_dbg(mci->pdev, "eccuaddr0=0x%08x eccuaddr1=0x%08x\n", > + eccuaddr0, eccuaddr1); > + > + rank = FIELD_GET(AL_MC_ECC_UE_ADDR0_RANK, eccuaddr0); > + row = FIELD_GET(AL_MC_ECC_UE_ADDR0_ROW, eccuaddr0); > + > + bg = FIELD_GET(AL_MC_ECC_UE_ADDR1_BG, eccuaddr1); > + bank = FIELD_GET(AL_MC_ECC_UE_ADDR1_BANK, eccuaddr1); > + column = FIELD_GET(AL_MC_ECC_UE_ADDR1_COLUMN, eccuaddr1); > + > + AL_MC_EDAC_MSG(msg, sizeof(msg), HW_EVENT_ERR_UNCORRECTED, > + rank, row, bg, bank, column, > + eccusyn0, eccusyn1, eccusyn2); > + > + edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci, > + ue_count, 0, 0, 0, 0, 0, -1, mci->ctl_name, msg); > > What happens when this code runs at the same time as the corrected error handler calling > edac_mc_handler_error() with this same mci? > > This could happen on a second CPU, or on one cpu if the corrected handler is polled. > > edac_mc_handle_error() memset's the edac_raw_error_desc in mci, so it can't be called in > parallel, or twice on the same cpu. > > I think you need an irqsave spinlock around the calls to edac_mc_handle_error(). shall add locks in v3. > >> + >> +static irqreturn_t al_mc_edac_irq_handler_ue(int irq, void *info) >> +{ >> + struct platform_device *pdev = info; >> + struct mem_ctl_info *mci = platform_get_drvdata(pdev); >> + int ue_count; >> + >> + ue_count = al_mc_edac_handle_ue(mci); >> + if (ue_count) >> + return IRQ_HANDLED; >> + else >> + return IRQ_NONE; >> +} > As you don't use ue_count, wouldn't this be clearer: > > | if (al_mc_edac_handle_ue(mci)) > | return IRQ_HANDLED; > | return IRQ_NONE; > > ? ack, shall add to v3 > >> +static int al_mc_edac_probe(struct platform_device *pdev) >> +{ >> + struct resource *resource; >> + void __iomem *mmio_base; >> + unsigned int active_ranks; >> + struct edac_mc_layer layers[1]; >> + struct mem_ctl_info *mci; >> + struct al_mc_edac *al_mc; >> + int ret; >> + >> + resource = platform_get_resource(pdev, IORESOURCE_MEM, 0); > platform_get_resource() can fail, returning NULL. ack, shall add to v3 > > > + > + if (al_mc->irq_ue > 0) { > + ret = devm_request_irq(&pdev->dev, > + al_mc->irq_ue, > + al_mc_edac_irq_handler_ue, >> + 0, > As you know when your device has triggered the interrupt from the error counter, could > these be IRQF_SHARED? ack, shall add to v3 > >> +static int al_mc_edac_remove(struct platform_device *pdev) >> +{ >> + struct mem_ctl_info *mci = platform_get_drvdata(pdev); >> + >> + edac_mc_del_mc(&pdev->dev); >> + edac_mc_free(mci); > What stops your interrupt firing here? You've free'd the memory it uses. > > I think you need to devm_free_irq() the interrupts before you free the memory. ack, shall add to v3 > >> +MODULE_LICENSE("GPL v2"); >> +MODULE_AUTHOR("Talel Shenhar"); >> +MODULE_DESCRIPTION("Amazon's Annapurna Lab's Memory Controller EDAC Driver"); > (Kconfig says this is 'bool', so it can't be built as a module, having these is a bit odd) ack, shall add to v3 while at it, shall consider changing to trisate so it can really be build as a module as well. > > > > Thanks, > > James