From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4EBD4C433F5 for ; Wed, 24 Nov 2021 19:31:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351179AbhKXTeo (ORCPT ); Wed, 24 Nov 2021 14:34:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351230AbhKXTeQ (ORCPT ); Wed, 24 Nov 2021 14:34:16 -0500 Received: from mail-pg1-x52a.google.com (mail-pg1-x52a.google.com [IPv6:2607:f8b0:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0C82C06173E for ; Wed, 24 Nov 2021 11:31:06 -0800 (PST) Received: by mail-pg1-x52a.google.com with SMTP id g28so3064941pgg.3 for ; Wed, 24 Nov 2021 11:31:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vI7SzUj5UovgtfMb+fdADiLIwlcQ1wWC4aPy4AKa9sw=; b=0bLuuQ/jE+gTjDQ7ciBE/07ddYj2uqOteQnZLi85uA8MFu8jSoIUYqqJCccGABtrTF A78/mVgvstS02fSGtc8JEl33phPEac/WZZL7AmRbJxiWgXhlevV+smf69rNbhc9RoY4L OdMXMQpGDvQ1a9+DKKGEyPvaBuy2/gNYEkqQ8m5CtKGOyxog3YkFzqDWpSz+I7i9ZJj9 8vLtEE+lnvfb/0/3ccZUw9Utcz2RzvdJSyirxLeiHsNBoblVFWtxy0EmtfEs5ixviCGf ntc2dts22JJo5BbzkzJhTxcHYc661w25eCwPiOJ5PK8YOuQNmEBR/7RQNLQ7ux9FijIg 9ZIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vI7SzUj5UovgtfMb+fdADiLIwlcQ1wWC4aPy4AKa9sw=; b=R9PQ6vsB8VHhYZF8J5cHScrgM9JPDg6fzBPeDquZeCRsY+hv1Otb3veFKzoMI1ttcd IIMQrfW3uwRUKKlvTx+TrosU3J929V+Jn983vsgyKLmq1YQ3CGqgFkwonfYpUWe95ohL aSLcqY/qGyS11MW6rFkwcL+kLa1guDFxgXAzDfEcvl+/ErYd1aC/zhEih/3tnkQjX746 YiMKnFxs52RGM9XsYUxl3TuVyLGEgb1tfjGSfsx/5L/9ey0qf5/2zwI2H0a2uK8sV+2r DB956MJnGUoos2xw/9J+W+u7Dxfo2Yy3nYt54ZPjgcD1Yg6gHCD3WQwT8iwcJzA1MoTf 0B8g== X-Gm-Message-State: AOAM533SuS21tU0Hni7YnKqaIfTwrzvbbjGUqlWDE/Pzh5FoX+urfxZV ADnJ7Rdu5/QMOt5DLZsAn0O4tiCW/L2z+4Bhj77QTg== X-Google-Smtp-Source: ABdhPJwu4BzXwUzPTOx3eZyuEmi1g/2RqUwz46ZVZuCSxJm6PakajWT8pRAEysCQkiHfpnoiaOBdmQLapQccbnY/QTM= X-Received: by 2002:a05:6a00:140e:b0:444:b077:51ef with SMTP id l14-20020a056a00140e00b00444b07751efmr8535703pfu.61.1637782266311; Wed, 24 Nov 2021 11:31:06 -0800 (PST) MIME-Version: 1.0 References: <20211120000250.1663391-1-ben.widawsky@intel.com> <20211120000250.1663391-4-ben.widawsky@intel.com> In-Reply-To: <20211120000250.1663391-4-ben.widawsky@intel.com> From: Dan Williams Date: Wed, 24 Nov 2021 11:30:56 -0800 Message-ID: Subject: Re: [PATCH 03/23] cxl/pci: Extract device status check To: Ben Widawsky Cc: linux-cxl@vger.kernel.org, Linux PCI , Alison Schofield , Ira Weiny , Jonathan Cameron , Vishal Verma Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-cxl@vger.kernel.org On Fri, Nov 19, 2021 at 4:03 PM Ben Widawsky wrote: > > The Memory Device Status register is inspected in the same way for at > least two flows in the CXL Type 3 Memory Device Software Guide > (Revision: 1.0): 2.13.9 Device discovery and mailbox ready sequence, > and 2.13.10 Media ready sequence. Extract this common functionality for > use by both. Can you translate this into CXL specification terms? See below for the rationale... > > Signed-off-by: Ben Widawsky > --- > This patch did not exist in RFCv2 > --- > drivers/cxl/pci.c | 33 +++++++++++++++++++++++++-------- > 1 file changed, 25 insertions(+), 8 deletions(-) > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index a6ea9811a05b..6c8d09fb3a17 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -182,6 +182,27 @@ static int __cxl_pci_mbox_send_cmd(struct cxl_dev_state *cxlds, > return 0; > } > > +/* > + * Implements roughly the bottom half of Figure 42 of the CXL Type 3 Memory > + * Device Software Guide I do appreciate that document for working through some of the concerns that system software might have for various CXL flows, but at the same time it's not authoritative. I.e. it is not a specification itself and it depends on the CXL specification as the "source of truth". So for Linux commentary I would translate the guide's recommendations back into the base truth from the CXL specification. There will be places where Linux goes a different direction than the software guide so I do not want to set any expectations that those excursions are a bug, or otherwise require someone to consult a specific hardware vendor's software guide. Especially in this case when the logic is simply "check a couple fatal status flags", the base specification is sufficient and the original code made no reference to the guide. > + */ > +static int check_device_status(struct cxl_dev_state *cxlds) > +{ > + const u64 md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET); > + > + if (md_status & CXLMDEV_DEV_FATAL) { > + dev_err(cxlds->dev, "Fatal: replace device\n"); The specification says "replace device", I disagree that the kernel should be recommending that the device by replaced. Just report what the driver does, and that's probably easier if the error messages are left to the caller. const u64 md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET); if (md_status & (CXLMDEV_DEV_FATAL | CXLMDEV_FW_HALT)) { dev_err(dev, "mbox: failed to acquire, device state:%s%s\n", md_status & CXLMDEV_DEV_FATAL ? " fatal" : "", md_status & CXLMDEV_FW_HALT ? " firmware-halt" : ""); return -EIO; } ...i.e. it's not clear to me the helper helps. > + return -EIO; > + } > + > + if (md_status & CXLMDEV_FW_HALT) { > + dev_err(cxlds->dev, "FWHalt: reset or replace device\n"); > + return -EBUSY; > + } > + > + return 0; > +} > + > /** > * cxl_pci_mbox_get() - Acquire exclusive access to the mailbox. > * @cxlds: The device state to gain access to. > @@ -231,17 +252,13 @@ static int cxl_pci_mbox_get(struct cxl_dev_state *cxlds) > * Hardware shouldn't allow a ready status but also have failure bits > * set. Spit out an error, this should be a bug report > */ > - rc = -EFAULT; > - if (md_status & CXLMDEV_DEV_FATAL) { > - dev_err(dev, "mbox: reported ready, but fatal\n"); > + rc = check_device_status(cxlds); > + if (rc) > goto out; > - } > - if (md_status & CXLMDEV_FW_HALT) { > - dev_err(dev, "mbox: reported ready, but halted\n"); > - goto out; > - } > + > if (CXLMDEV_RESET_NEEDED(md_status)) { I think this check needs to go. If the reset is needed because of one of the above failure statuses then the function will have already error exited. If the reset is needed because media is disabled that should not be fatal for mailbox operations. It could be useful to do some interrogation of *why* media is disabled. > dev_err(dev, "mbox: reported ready, but reset needed\n"); > + rc = -EFAULT; > goto out; > } > > -- > 2.34.0 >