From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?iso-8859-1?Q?Ga=EBtan?= Rivet <gaetan.rivet@6wind.com>
Subject: Re: [PATCH v2 1/4] bus/pci: handle device hot unplug
Date: Fri, 22 Jun 2018 14:59:08 +0200
Message-ID: <20180622125908.r5r7qvvndn4dp6ln@bidouze.vm.6wind.com>
References: <1498711073-42917-1-git-send-email-jia.guo@intel.com>
 <1529668268-7462-1-git-send-email-jia.guo@intel.com>
 <1529668268-7462-2-git-send-email-jia.guo@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: stephen@networkplumber.org, bruce.richardson@intel.com,
 ferruh.yigit@intel.com, konstantin.ananyev@intel.com,
 jingjing.wu@intel.com, thomas@monjalon.net, motih@mellanox.com,
 matan@mellanox.com, harry.van.haaren@intel.com,
 qi.z.zhang@intel.com, shaopeng.he@intel.com, jblunck@infradead.org,
 shreyansh.jain@nxp.com, dev@dpdk.org, helin.zhang@intel.com
To: Jeff Guo <jia.guo@intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from mail-wm0-f65.google.com (mail-wm0-f65.google.com [74.125.82.65])
 by dpdk.org (Postfix) with ESMTP id 8D2891B8B5
 for <dev@dpdk.org>; Fri, 22 Jun 2018 14:59:26 +0200 (CEST)
Received: by mail-wm0-f65.google.com with SMTP id v16-v6so2161782wmh.5
 for <dev@dpdk.org>; Fri, 22 Jun 2018 05:59:26 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <1529668268-7462-2-git-send-email-jia.guo@intel.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Jeff,

Sorry, I followed this development from afar,
I have a remark regarding this API, I think it can be made simpler.
Details below.

On Fri, Jun 22, 2018 at 07:51:05PM +0800, Jeff Guo wrote:
> When a hardware device is removed physically or the software disables
> it, the hot unplug occur. App need to call ether dev API to detach the
> device, to unplug the device at the bus level and make access to the device
> invalid. But the problem is that, the removal of the device from the
> software lists is not going to be instantaneous, at this time if the data
> path still read/write the device, it will cause MMIO error and result of
> the app crash out. So a hot unplug handle mechanism need to guaranty app
> will not crash out when hot unplug device.
> 
> To handle device hot unplug is bus-specific behavior, this patch introduces
> a bus ops so that each kind of bus can implement its own logic. Further,
> this patch implements the ops for PCI bus: remap a dummy memory to avoid
> bus read/write error.
> 
> Signed-off-by: Jeff Guo <jia.guo@intel.com>
> ---
> v2->v1(v21):
> refind commit log
> ---
>  drivers/bus/pci/pci_common.c            | 65 +++++++++++++++++++++++++++++++++
>  drivers/bus/pci/pci_common_uio.c        | 33 +++++++++++++++++
>  drivers/bus/pci/private.h               | 12 ++++++
>  lib/librte_eal/common/include/rte_bus.h | 16 ++++++++
>  4 files changed, 126 insertions(+)
> 
> diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> index 7215aae..74d9aa8 100644
> --- a/drivers/bus/pci/pci_common.c
> +++ b/drivers/bus/pci/pci_common.c
> @@ -472,6 +472,70 @@ pci_find_device(const struct rte_device *start, rte_dev_cmp_t cmp,
>  	return NULL;
>  }
>  
> +static struct rte_pci_device *
> +pci_find_device_by_addr(void *failure_addr)
> +{
> +	struct rte_pci_device *pdev = NULL;
> +	int i;
> +
> +	FOREACH_DEVICE_ON_PCIBUS(pdev) {
> +		for (i = 0; i != RTE_DIM(pdev->mem_resource); i++) {
> +			if ((uint64_t)(uintptr_t)failure_addr >=
> +			    (uint64_t)(uintptr_t)pdev->mem_resource[i].addr &&
> +			    (uint64_t)(uintptr_t)failure_addr <
> +			    (uint64_t)(uintptr_t)pdev->mem_resource[i].addr +
> +			    pdev->mem_resource[i].len) {
> +				RTE_LOG(ERR, EAL, "Failure address "
> +					"%16.16"PRIx64" belongs to "
> +					"device %s!\n",
> +					(uint64_t)(uintptr_t)failure_addr,
> +					pdev->device.name);
> +				return pdev;
> +			}
> +		}
> +	}
> +	return NULL;
> +}

You define here a new bus ops that takes either an rte_device or an
arbitrary address as input.

In the uev handler that would call this ops afterward, you similarly try
to find either a bus using the device name, or then iterate over all
buses and try to find one able to handle the error.

This seems redundant and prone to ambiguity: should one check that the
device address is actually linked with the provided address? If not, is
it an improper call or a special case? This is unclear.

Note: I haven't followed the previous discussion, maybe the
      dual dev_addr + failure_addr is warranted in the API here,
      if so why not. Otherwise it just seems redundant:
      the dev addr will never be within a physical BAR mapping,
      and for all buses / drivers not using physical mappings,
      the addr is only meaningful as a cue to find an internal
      resource.

You can use the bus->find_device() to iterate over buses, and design
your bus ops such that when provided with an addr, would do whatever it
needs internally to find a relevant resource and either handle the
error, or return that the error was not handled.

Something like that:

/* new bus ops: */
/* If generic error codes are defined as part of the API,
   >0 should mean that the sigbus was not handled,
   <0 that an error occured but that one should stop trying,
   0 that everything is ok.
 */
int (*handle_sigbus)(void *addr);

/* new rte_bus API: */

static int
bus_handle_sigbus(const struct rte_bus *bus,
                  const void *addr)
{
        /* If additional error codes are defined as part of the API,
           negative values should stop the iteration.
           In this case, rte_errno would need to be set as well.
         */
        return !(bus->handle_sigbus && bus->handle_sigbus(addr) <= 0);
}

int
rte_bus_sigbus_handler(void *addr)
{
        struct rte_bus *bus;
        int old_errno = rte_errno;

        rte_errno = 0;
        bus = rte_bus_find(NULL, bus_handle_sigbus, addr);
        if (bus == NULL) {
                /* ERROR: no bus could handle the error. */
                RTE_LOG(ERR, EAL, "No bus was able to handle the error");
                return -1;
        } else if {rte_errno != 0) {
                /* ERROR: a generic sigbus handling error. */
                RTE_LOG(ERR, EAL, "Say what the error is");
                return -1;
        }
        rte_errno = old_errno;
        return 0;
}

Which would afterward be implemented, for example in PCI bus:

static rte_pci_device *
pci_find_device_by_addr(void *addr)
{
        struct rte_pci_device *pdev;

        FOREACH_DEVICE_ON_PCIBUS(pdev)
                if (&pdev->device == addr ||
                    /* addr within mappings of pdev */)
                        return pdev;
        return NULL;
}

static int
pci_handle_sigbus(void *addr)
{
        static rte_pci_device *pdev;

        pdev = pci_find_device_by_addr(addr);
        if (pdev == NULL)
                return -1;
        /* Here, handle uio remap if needed. */
}

-------------------------

- Leave the bus iteration and check within rte_bus. Centralize the call
  in a tighter rte_bus API, do not use directly your OPS from your other
  EAL facility.

- You have left several error messages for signaling success (!), or
  even simply that a bus could not handle a specific address. This is
  bad. An error should only appear on error. Otherwise, all of this can
  easily be traced using a debugger, so I don't think it's necessary
  to leave it at a DEBUG level.

  But in any case, remove all ERR-level messages for success.

<...>

> diff --git a/lib/librte_eal/common/include/rte_bus.h b/lib/librte_eal/common/include/rte_bus.h
> index eb9eded..6a5609f 100644
> --- a/lib/librte_eal/common/include/rte_bus.h
> +++ b/lib/librte_eal/common/include/rte_bus.h
> @@ -168,6 +168,20 @@ typedef int (*rte_bus_unplug_t)(struct rte_device *dev);
>  typedef int (*rte_bus_parse_t)(const char *name, void *addr);
>  
>  /**
> + * Implementation a specific hot unplug handler, which is responsible
> + * for handle the failure when hot unplug the device, guaranty the system
> + * would not hung in the case.
> + * @param dev
> + *	Pointer of the device structure.
> + *
> + * @return
> + *	0 on success.
> + *	!0 on error.
> + */
> +typedef int (*rte_bus_handle_hot_unplug_t)(struct rte_device *dev,
> +						void *dev_addr);
> +

I don't like the name of the OPS.
The documentation evokes only "the failure".
So is it a handle for any and all error possibly happening to a device?
If so, where is the input to describe the error?

If it is only meant to handle SIGBUS, because it is a very specific
error state only meant to happen on certain parts of the bus (the queue
mappings, if relevant), then it makes sense to only have an arbitrary
address as context for handling.

But then, it needs to be called as such. The expected failure to be
handled should be explicit in the name of the ops, and the documentation
should be more precise about what a bus developper should do with the
input.

> +/**
>   * Bus scan policies
>   */
>  enum rte_bus_scan_mode {
> @@ -209,6 +223,8 @@ struct rte_bus {
>  	rte_bus_plug_t plug;         /**< Probe single device for drivers */
>  	rte_bus_unplug_t unplug;     /**< Remove single device from driver */
>  	rte_bus_parse_t parse;       /**< Parse a device name */
> +	rte_bus_handle_hot_unplug_t handle_hot_unplug; /**< handle hot unplug
> +							device event */

The new ops should be added at the end of the structure.

Regards,
-- 
Gaëtan Rivet
6WIND