From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-cxl-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4EBD4C433F5
	for <linux-cxl@archiver.kernel.org>; Wed, 24 Nov 2021 19:31:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1351179AbhKXTeo (ORCPT <rfc822;linux-cxl@archiver.kernel.org>);
        Wed, 24 Nov 2021 14:34:44 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33500 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1351230AbhKXTeQ (ORCPT
        <rfc822;linux-cxl@vger.kernel.org>); Wed, 24 Nov 2021 14:34:16 -0500
Received: from mail-pg1-x52a.google.com (mail-pg1-x52a.google.com [IPv6:2607:f8b0:4864:20::52a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0C82C06173E
        for <linux-cxl@vger.kernel.org>; Wed, 24 Nov 2021 11:31:06 -0800 (PST)
Received: by mail-pg1-x52a.google.com with SMTP id g28so3064941pgg.3
        for <linux-cxl@vger.kernel.org>; Wed, 24 Nov 2021 11:31:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=intel-com.20210112.gappssmtp.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=vI7SzUj5UovgtfMb+fdADiLIwlcQ1wWC4aPy4AKa9sw=;
        b=0bLuuQ/jE+gTjDQ7ciBE/07ddYj2uqOteQnZLi85uA8MFu8jSoIUYqqJCccGABtrTF
         A78/mVgvstS02fSGtc8JEl33phPEac/WZZL7AmRbJxiWgXhlevV+smf69rNbhc9RoY4L
         OdMXMQpGDvQ1a9+DKKGEyPvaBuy2/gNYEkqQ8m5CtKGOyxog3YkFzqDWpSz+I7i9ZJj9
         8vLtEE+lnvfb/0/3ccZUw9Utcz2RzvdJSyirxLeiHsNBoblVFWtxy0EmtfEs5ixviCGf
         ntc2dts22JJo5BbzkzJhTxcHYc661w25eCwPiOJ5PK8YOuQNmEBR/7RQNLQ7ux9FijIg
         9ZIw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=vI7SzUj5UovgtfMb+fdADiLIwlcQ1wWC4aPy4AKa9sw=;
        b=R9PQ6vsB8VHhYZF8J5cHScrgM9JPDg6fzBPeDquZeCRsY+hv1Otb3veFKzoMI1ttcd
         IIMQrfW3uwRUKKlvTx+TrosU3J929V+Jn983vsgyKLmq1YQ3CGqgFkwonfYpUWe95ohL
         aSLcqY/qGyS11MW6rFkwcL+kLa1guDFxgXAzDfEcvl+/ErYd1aC/zhEih/3tnkQjX746
         YiMKnFxs52RGM9XsYUxl3TuVyLGEgb1tfjGSfsx/5L/9ey0qf5/2zwI2H0a2uK8sV+2r
         DB956MJnGUoos2xw/9J+W+u7Dxfo2Yy3nYt54ZPjgcD1Yg6gHCD3WQwT8iwcJzA1MoTf
         0B8g==
X-Gm-Message-State: AOAM533SuS21tU0Hni7YnKqaIfTwrzvbbjGUqlWDE/Pzh5FoX+urfxZV
        ADnJ7Rdu5/QMOt5DLZsAn0O4tiCW/L2z+4Bhj77QTg==
X-Google-Smtp-Source: ABdhPJwu4BzXwUzPTOx3eZyuEmi1g/2RqUwz46ZVZuCSxJm6PakajWT8pRAEysCQkiHfpnoiaOBdmQLapQccbnY/QTM=
X-Received: by 2002:a05:6a00:140e:b0:444:b077:51ef with SMTP id
 l14-20020a056a00140e00b00444b07751efmr8535703pfu.61.1637782266311; Wed, 24
 Nov 2021 11:31:06 -0800 (PST)
MIME-Version: 1.0
References: <20211120000250.1663391-1-ben.widawsky@intel.com> <20211120000250.1663391-4-ben.widawsky@intel.com>
In-Reply-To: <20211120000250.1663391-4-ben.widawsky@intel.com>
From:   Dan Williams <dan.j.williams@intel.com>
Date:   Wed, 24 Nov 2021 11:30:56 -0800
Message-ID: <CAPcyv4ic827h-WPxU2e1aKu1mVo78aNdNo82skcLW6hqnOnSEQ@mail.gmail.com>
Subject: Re: [PATCH 03/23] cxl/pci: Extract device status check
To:     Ben Widawsky <ben.widawsky@intel.com>
Cc:     linux-cxl@vger.kernel.org, Linux PCI <linux-pci@vger.kernel.org>,
        Alison Schofield <alison.schofield@intel.com>,
        Ira Weiny <ira.weiny@intel.com>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Vishal Verma <vishal.l.verma@intel.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
List-ID: <linux-cxl.vger.kernel.org>
X-Mailing-List: linux-cxl@vger.kernel.org

On Fri, Nov 19, 2021 at 4:03 PM Ben Widawsky <ben.widawsky@intel.com> wrote:
>
> The Memory Device Status register is inspected in the same way for at
> least two flows in the CXL Type 3 Memory Device Software Guide
> (Revision: 1.0): 2.13.9 Device discovery and mailbox ready sequence,
> and 2.13.10 Media ready sequence. Extract this common functionality for
> use by both.

Can you translate this into CXL specification terms? See below for the
rationale...

>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> ---
> This patch did not exist in RFCv2
> ---
>  drivers/cxl/pci.c | 33 +++++++++++++++++++++++++--------
>  1 file changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index a6ea9811a05b..6c8d09fb3a17 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -182,6 +182,27 @@ static int __cxl_pci_mbox_send_cmd(struct cxl_dev_state *cxlds,
>         return 0;
>  }
>
> +/*
> + * Implements roughly the bottom half of Figure 42 of the CXL Type 3 Memory
> + * Device Software Guide

I do appreciate that document for working through some of the concerns
that system software might have for various CXL flows, but at the same
time it's not authoritative. I.e. it is not a specification itself and
it depends on the CXL specification as the "source of truth". So for
Linux commentary I would translate the guide's recommendations back
into the base truth from the CXL specification.

There will be places where Linux goes a different direction than the
software guide so I do not want to set any expectations that those
excursions are a bug, or otherwise require someone to consult a
specific hardware vendor's software guide.

Especially in this case when the logic is simply "check a couple fatal
status flags", the base specification is sufficient and the original
code made no reference to the guide.

> + */
> +static int check_device_status(struct cxl_dev_state *cxlds)
> +{
> +       const u64 md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
> +
> +       if (md_status & CXLMDEV_DEV_FATAL) {
> +               dev_err(cxlds->dev, "Fatal: replace device\n");

The specification says "replace device", I disagree that the kernel
should be recommending that the device by replaced. Just report what
the driver does, and that's probably easier if the error messages are
left to the caller.

        const u64 md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);

        if (md_status & (CXLMDEV_DEV_FATAL | CXLMDEV_FW_HALT)) {
                dev_err(dev, "mbox: failed to acquire, device state:%s%s\n",
                        md_status & CXLMDEV_DEV_FATAL ? " fatal" : "",
                        md_status & CXLMDEV_FW_HALT ? " firmware-halt" : "");
                return -EIO;
        }

...i.e. it's not clear to me the helper helps.

> +               return -EIO;
> +       }
> +
> +       if (md_status & CXLMDEV_FW_HALT) {
> +               dev_err(cxlds->dev, "FWHalt: reset or replace device\n");
> +               return -EBUSY;
> +       }
> +
> +       return 0;
> +}
> +
>  /**
>   * cxl_pci_mbox_get() - Acquire exclusive access to the mailbox.
>   * @cxlds: The device state to gain access to.
> @@ -231,17 +252,13 @@ static int cxl_pci_mbox_get(struct cxl_dev_state *cxlds)
>          * Hardware shouldn't allow a ready status but also have failure bits
>          * set. Spit out an error, this should be a bug report
>          */
> -       rc = -EFAULT;
> -       if (md_status & CXLMDEV_DEV_FATAL) {
> -               dev_err(dev, "mbox: reported ready, but fatal\n");
> +       rc = check_device_status(cxlds);
> +       if (rc)
>                 goto out;
> -       }
> -       if (md_status & CXLMDEV_FW_HALT) {
> -               dev_err(dev, "mbox: reported ready, but halted\n");
> -               goto out;
> -       }
> +
>         if (CXLMDEV_RESET_NEEDED(md_status)) {

I think this check needs to go. If the reset is needed because of one
of the above failure statuses then the function will have already
error exited. If the reset is needed because media is disabled that
should not be fatal for mailbox operations. It could be useful to do
some interrogation of *why* media is disabled.

>                 dev_err(dev, "mbox: reported ready, but reset needed\n");
> +               rc = -EFAULT;
>                 goto out;
>         }
>
> --
> 2.34.0
>