All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-05 15:31 ` Allen Hubbe
  0 siblings, 0 replies; 12+ messages in thread
From: Allen Hubbe @ 2016-08-05 15:31 UTC (permalink / raw)
  To: 'Serge Semin', jdmason
  Cc: dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb, linux-kernel

From: Serge Semin
> Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> devices, so translated base address of memory windows can be direcly written
> to peer registers. But there are some IDT PCIe-switches which implement
> complex interfaces using Lookup Tables of translation addresses. Due to
> the way the table is accessed, it can not be done synchronously from different
> RCs, that's why the asynchronous interface should be developed.
> 
> For these purpose the Memory Window related interface is correspondingly split
> as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> is following: "It is a virtual memory region, which locally reflects a physical
> memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> memory windows.
> Here is the description of the Memory Window related NTB-bus callback
> functions:
>  - ntb_mw_count() - number of local memory windows.
>  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
>                          window to map.
>  - ntb_mw_set_trans() - set translation address of local memory window (this
>                         address should be somehow retrieved from a peer).
>  - ntb_mw_get_trans() - get translation address of local memory window.
>  - ntb_mw_get_align() - get alignment of translated base address and size of
>                         local memory window. Additionally one can get the
>                         upper size limit of the memory window.
>  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
>                          local number).
>  - ntb_peer_mw_set_trans() - set translation address of peer memory window
>  - ntb_peer_mw_get_trans() - get translation address of peer memory window
>  - ntb_peer_mw_get_align() - get alignment of translated base address and size
>                              of peer memory window.Additionally one can get the
>                              upper size limit of the memory window.
> 
> As one can see current AMD and Intel NTB drivers mostly implement the
> "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> since it doesn't have convenient access to the peer Lookup Table.
> 
> In order to pass information from one RC to another NTB functions of IDT
> PCIe-switch implement Messaging subsystem. They currently support four message
> registers to transfer DWORD sized data to a specified peer. So there are two
> new callback methods are introduced:
>  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
>                     and receive messages
>  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
>                     to a peer
> Additionally there is a new event function:
>  - ntb_msg_event() - it is invoked when either a new message was retrieved
>                      (NTB_MSG_NEW), or last message was successfully sent
>                      (NTB_MSG_SENT), or the last message failed to be sent
>                      (NTB_MSG_FAIL).
> 
> The last change concerns the IDs (practically names) of NTB-devices on the
> NTB-bus. It is not good to have the devices with same names in the system
> and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> devices supporting both interfaces.

Thanks for the work that went into writing this driver, and thanks for your patience with the review.  Please read my initial comments inline.  I would like to approach this from a top-down api perspective first, and settle on that first before requesting any specific changes in the hardware driver.  My major concern about these changes is that they introduce a distinct classification for sync and async hardware, supported by different sets of methods in the api, neither is a subset of the other.

You know the IDT hardware, so if any of my requests below are infeasible, I would like your constructive opinion (even if it means significant changes to existing drivers) on how to resolve the api so that new and existing hardware drivers can be unified under the same api, if possible.

> 
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> 
> ---
>  drivers/ntb/Kconfig                 |   4 +-
>  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
>  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
>  drivers/ntb/ntb.c                   |  86 +++++-
>  drivers/ntb/ntb_transport.c         |  19 +-
>  drivers/ntb/test/ntb_perf.c         |  16 +-
>  drivers/ntb/test/ntb_pingpong.c     |   5 +
>  drivers/ntb/test/ntb_tool.c         |  25 +-
>  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
>  9 files changed, 701 insertions(+), 162 deletions(-)
> 
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index 95944e5..67d80c4 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -14,8 +14,6 @@ if NTB
> 
>  source "drivers/ntb/hw/Kconfig"
> 
> -source "drivers/ntb/test/Kconfig"
> -
>  config NTB_TRANSPORT
>  	tristate "NTB Transport Client"
>  	help
> @@ -25,4 +23,6 @@ config NTB_TRANSPORT
> 
>  	 If unsure, say N.
> 
> +source "drivers/ntb/test/Kconfig"
> +
>  endif # NTB
> diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
> index 6ccba0d..ab6f353 100644
> --- a/drivers/ntb/hw/amd/ntb_hw_amd.c
> +++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
> @@ -55,6 +55,7 @@
>  #include <linux/pci.h>
>  #include <linux/random.h>
>  #include <linux/slab.h>
> +#include <linux/sizes.h>
>  #include <linux/ntb.h>
> 
>  #include "ntb_hw_amd.h"
> @@ -84,11 +85,8 @@ static int amd_ntb_mw_count(struct ntb_dev *ntb)
>  	return ntb_ndev(ntb)->mw_count;
>  }
> 
> -static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> -				phys_addr_t *base,
> -				resource_size_t *size,
> -				resource_size_t *align,
> -				resource_size_t *align_size)
> +static int amd_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> +				 phys_addr_t *base, resource_size_t *size)
>  {
>  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
>  	int bar;
> @@ -103,17 +101,40 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
>  	if (size)
>  		*size = pci_resource_len(ndev->ntb.pdev, bar);
> 
> -	if (align)
> -		*align = SZ_4K;
> +	return 0;
> +}
> +
> +static int amd_ntb_peer_mw_count(struct ntb_dev *ntb)
> +{
> +	return ntb_ndev(ntb)->mw_count;
> +}
> +
> +static int amd_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> +				     resource_size_t *addr_align,
> +				     resource_size_t *size_align,
> +				     resource_size_t *size_max)
> +{
> +	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> +	int bar;
> +
> +	bar = ndev_mw_to_bar(ndev, idx);
> +	if (bar < 0)
> +		return bar;
> +
> +	if (addr_align)
> +		*addr_align = SZ_4K;
> +
> +	if (size_align)
> +		*size_align = 1;
> 
> -	if (align_size)
> -		*align_size = 1;
> +	if (size_max)
> +		*size_max = pci_resource_len(ndev->ntb.pdev, bar);
> 
>  	return 0;
>  }
> 
> -static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> -				dma_addr_t addr, resource_size_t size)
> +static int amd_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> +				     dma_addr_t addr, resource_size_t size)
>  {
>  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
>  	unsigned long xlat_reg, limit_reg = 0;
> @@ -432,8 +453,10 @@ static int amd_ntb_peer_spad_write(struct ntb_dev *ntb,
> 
>  static const struct ntb_dev_ops amd_ntb_ops = {
>  	.mw_count		= amd_ntb_mw_count,
> -	.mw_get_range		= amd_ntb_mw_get_range,
> -	.mw_set_trans		= amd_ntb_mw_set_trans,
> +	.mw_get_maprsc		= amd_ntb_mw_get_maprsc,
> +	.peer_mw_count		= amd_ntb_peer_mw_count,
> +	.peer_mw_get_align	= amd_ntb_peer_mw_get_align,
> +	.peer_mw_set_trans	= amd_ntb_peer_mw_set_trans,
>  	.link_is_up		= amd_ntb_link_is_up,
>  	.link_enable		= amd_ntb_link_enable,
>  	.link_disable		= amd_ntb_link_disable,
> diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c
> index 40d04ef..fdb2838 100644
> --- a/drivers/ntb/hw/intel/ntb_hw_intel.c
> +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
> @@ -804,11 +804,8 @@ static int intel_ntb_mw_count(struct ntb_dev *ntb)
>  	return ntb_ndev(ntb)->mw_count;
>  }
> 
> -static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> -				  phys_addr_t *base,
> -				  resource_size_t *size,
> -				  resource_size_t *align,
> -				  resource_size_t *align_size)
> +static int intel_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> +				   phys_addr_t *base, resource_size_t *size)
>  {
>  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
>  	int bar;
> @@ -828,17 +825,51 @@ static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
>  		*size = pci_resource_len(ndev->ntb.pdev, bar) -
>  			(idx == ndev->b2b_idx ? ndev->b2b_off : 0);
> 
> -	if (align)
> -		*align = pci_resource_len(ndev->ntb.pdev, bar);
> +	return 0;
> +}
> +
> +static int intel_ntb_peer_mw_count(struct ntb_dev *ntb)
> +{
> +	return ntb_ndev(ntb)->mw_count;
> +}
> +
> +static int intel_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> +				       resource_size_t *addr_align,
> +				       resource_size_t *size_align,
> +				       resource_size_t *size_max)
> +{
> +	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> +	resource_size_t bar_size, mw_size;
> +	int bar;
> +
> +	if (idx >= ndev->b2b_idx && !ndev->b2b_off)
> +		idx += 1;
> +
> +	bar = ndev_mw_to_bar(ndev, idx);
> +	if (bar < 0)
> +		return bar;
> +
> +	bar_size = pci_resource_len(ndev->ntb.pdev, bar);
> +
> +	if (idx == ndev->b2b_idx)
> +		mw_size = bar_size - ndev->b2b_off;
> +	else
> +		mw_size = bar_size;
> +
> +	if (addr_align)
> +		*addr_align = bar_size;
> +
> +	if (size_align)
> +		*size_align = 1;
> 
> -	if (align_size)
> -		*align_size = 1;
> +	if (size_max)
> +		*size_max = mw_size;
> 
>  	return 0;
>  }
> 
> -static int intel_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> -				  dma_addr_t addr, resource_size_t size)
> +static int intel_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> +				       dma_addr_t addr, resource_size_t size)
>  {
>  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
>  	unsigned long base_reg, xlat_reg, limit_reg;
> @@ -2220,8 +2251,10 @@ static struct intel_b2b_addr xeon_b2b_dsd_addr = {
>  /* operations for primary side of local ntb */
>  static const struct ntb_dev_ops intel_ntb_ops = {
>  	.mw_count		= intel_ntb_mw_count,
> -	.mw_get_range		= intel_ntb_mw_get_range,
> -	.mw_set_trans		= intel_ntb_mw_set_trans,
> +	.mw_get_maprsc		= intel_ntb_mw_get_maprsc,
> +	.peer_mw_count		= intel_ntb_peer_mw_count,
> +	.peer_mw_get_align	= intel_ntb_peer_mw_get_align,
> +	.peer_mw_set_trans	= intel_ntb_peer_mw_set_trans,
>  	.link_is_up		= intel_ntb_link_is_up,
>  	.link_enable		= intel_ntb_link_enable,
>  	.link_disable		= intel_ntb_link_disable,
> diff --git a/drivers/ntb/ntb.c b/drivers/ntb/ntb.c
> index 2e25307..37c3b36 100644
> --- a/drivers/ntb/ntb.c
> +++ b/drivers/ntb/ntb.c
> @@ -54,6 +54,7 @@
>  #include <linux/device.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> +#include <linux/atomic.h>
> 
>  #include <linux/ntb.h>
>  #include <linux/pci.h>
> @@ -72,8 +73,62 @@ MODULE_AUTHOR(DRIVER_AUTHOR);
>  MODULE_DESCRIPTION(DRIVER_DESCRIPTION);
> 
>  static struct bus_type ntb_bus;
> +static struct ntb_bus_data ntb_data;
>  static void ntb_dev_release(struct device *dev);
> 
> +static int ntb_gen_devid(struct ntb_dev *ntb)
> +{
> +	const char *name;
> +	unsigned long *mask;
> +	int id;
> +
> +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> +		name = "ntbAS%d";
> +		mask = ntb_data.both_msk;
> +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> +		name = "ntbS%d";
> +		mask = ntb_data.sync_msk;
> +	} else if (ntb_valid_async_dev_ops(ntb)) {
> +		name = "ntbA%d";
> +		mask = ntb_data.async_msk;
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	for (id = 0; NTB_MAX_DEVID > id; id++) {
> +		if (0 == test_and_set_bit(id, mask)) {
> +			ntb->id = id;
> +			break;
> +		}
> +	}
> +
> +	if (NTB_MAX_DEVID > id) {
> +		dev_set_name(&ntb->dev, name, ntb->id);
> +	} else {
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void ntb_free_devid(struct ntb_dev *ntb)
> +{
> +	unsigned long *mask;
> +
> +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> +		mask = ntb_data.both_msk;
> +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> +		mask = ntb_data.sync_msk;
> +	} else if (ntb_valid_async_dev_ops(ntb)) {
> +		mask = ntb_data.async_msk;
> +	} else {
> +		/* It's impossible */
> +		BUG();
> +	}
> +
> +	clear_bit(ntb->id, mask);
> +}
> +
>  int __ntb_register_client(struct ntb_client *client, struct module *mod,
>  			  const char *mod_name)
>  {
> @@ -99,13 +154,15 @@ EXPORT_SYMBOL(ntb_unregister_client);
> 
>  int ntb_register_device(struct ntb_dev *ntb)
>  {
> +	int ret;
> +
>  	if (!ntb)
>  		return -EINVAL;
>  	if (!ntb->pdev)
>  		return -EINVAL;
>  	if (!ntb->ops)
>  		return -EINVAL;
> -	if (!ntb_dev_ops_is_valid(ntb->ops))
> +	if (!ntb_valid_sync_dev_ops(ntb) && !ntb_valid_async_dev_ops(ntb))
>  		return -EINVAL;
> 
>  	init_completion(&ntb->released);
> @@ -114,13 +171,21 @@ int ntb_register_device(struct ntb_dev *ntb)
>  	ntb->dev.bus = &ntb_bus;
>  	ntb->dev.parent = &ntb->pdev->dev;
>  	ntb->dev.release = ntb_dev_release;
> -	dev_set_name(&ntb->dev, "%s", pci_name(ntb->pdev));
> 
>  	ntb->ctx = NULL;
>  	ntb->ctx_ops = NULL;
>  	spin_lock_init(&ntb->ctx_lock);
> 
> -	return device_register(&ntb->dev);
> +	/* No need to wait for completion if failed */
> +	ret = ntb_gen_devid(ntb);
> +	if (ret)
> +		return ret;
> +
> +	ret = device_register(&ntb->dev);
> +	if (ret)
> +		ntb_free_devid(ntb);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(ntb_register_device);
> 
> @@ -128,6 +193,7 @@ void ntb_unregister_device(struct ntb_dev *ntb)
>  {
>  	device_unregister(&ntb->dev);
>  	wait_for_completion(&ntb->released);
> +	ntb_free_devid(ntb);
>  }
>  EXPORT_SYMBOL(ntb_unregister_device);
> 
> @@ -191,6 +257,20 @@ void ntb_db_event(struct ntb_dev *ntb, int vector)
>  }
>  EXPORT_SYMBOL(ntb_db_event);
> 
> +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> +		   struct ntb_msg *msg)
> +{
> +	unsigned long irqflags;
> +
> +	spin_lock_irqsave(&ntb->ctx_lock, irqflags);
> +	{
> +		if (ntb->ctx_ops && ntb->ctx_ops->msg_event)
> +			ntb->ctx_ops->msg_event(ntb->ctx, ev, msg);
> +	}
> +	spin_unlock_irqrestore(&ntb->ctx_lock, irqflags);
> +}
> +EXPORT_SYMBOL(ntb_msg_event);
> +
>  static int ntb_probe(struct device *dev)
>  {
>  	struct ntb_dev *ntb;
> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> index d5c5894..2626ba0 100644
> --- a/drivers/ntb/ntb_transport.c
> +++ b/drivers/ntb/ntb_transport.c
> @@ -673,7 +673,7 @@ static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
>  	if (!mw->virt_addr)
>  		return;
> 
> -	ntb_mw_clear_trans(nt->ndev, num_mw);
> +	ntb_peer_mw_set_trans(nt->ndev, num_mw, 0, 0);
>  	dma_free_coherent(&pdev->dev, mw->buff_size,
>  			  mw->virt_addr, mw->dma_addr);
>  	mw->xlat_size = 0;
> @@ -730,7 +730,8 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
>  	}
> 
>  	/* Notify HW the memory location of the receive buffer */
> -	rc = ntb_mw_set_trans(nt->ndev, num_mw, mw->dma_addr, mw->xlat_size);
> +	rc = ntb_peer_mw_set_trans(nt->ndev, num_mw, mw->dma_addr,
> +				   mw->xlat_size);
>  	if (rc) {
>  		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
>  		ntb_free_mw(nt, num_mw);
> @@ -1060,7 +1061,11 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> ntb_dev *ndev)
>  	int node;
>  	int rc, i;
> 
> -	mw_count = ntb_mw_count(ndev);
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ndev))
> +		return -EINVAL;
> +
> +	mw_count = ntb_peer_mw_count(ndev);
>  	if (ntb_spad_count(ndev) < (NUM_MWS + 1 + mw_count * 2)) {
>  		dev_err(&ndev->dev, "Not enough scratch pad registers for %s",
>  			NTB_TRANSPORT_NAME);
> @@ -1094,8 +1099,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> ntb_dev *ndev)
>  	for (i = 0; i < mw_count; i++) {
>  		mw = &nt->mw_vec[i];
> 
> -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> -				      &mw->xlat_align, &mw->xlat_align_size);
> +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> +		if (rc)
> +			goto err1;
> +
> +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> +					   &mw->xlat_align_size, NULL);

Looks like ntb_mw_get_range() was simpler before the change.

>  		if (rc)
>  			goto err1;
> 
> diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
> index 6a50f20..f2952f7 100644
> --- a/drivers/ntb/test/ntb_perf.c
> +++ b/drivers/ntb/test/ntb_perf.c
> @@ -452,7 +452,7 @@ static void perf_free_mw(struct perf_ctx *perf)
>  	if (!mw->virt_addr)
>  		return;
> 
> -	ntb_mw_clear_trans(perf->ntb, 0);
> +	ntb_peer_mw_set_trans(perf->ntb, 0, 0, 0);
>  	dma_free_coherent(&pdev->dev, mw->buf_size,
>  			  mw->virt_addr, mw->dma_addr);
>  	mw->xlat_size = 0;
> @@ -488,7 +488,7 @@ static int perf_set_mw(struct perf_ctx *perf, resource_size_t size)
>  		mw->buf_size = 0;
>  	}
> 
> -	rc = ntb_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
> +	rc = ntb_peer_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
>  	if (rc) {
>  		dev_err(&perf->ntb->dev, "Unable to set mw0 translation\n");
>  		perf_free_mw(perf);
> @@ -559,8 +559,12 @@ static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf)
> 
>  	mw = &perf->mw;
> 
> -	rc = ntb_mw_get_range(ntb, 0, &mw->phys_addr, &mw->phys_size,
> -			      &mw->xlat_align, &mw->xlat_align_size);
> +	rc = ntb_mw_get_maprsc(ntb, 0, &mw->phys_addr, &mw->phys_size);
> +	if (rc)
> +		return rc;
> +
> +	rc = ntb_peer_mw_get_align(ntb, 0, &mw->xlat_align,
> +				   &mw->xlat_align_size, NULL);

Looks like ntb_mw_get_range() was simpler.

>  	if (rc)
>  		return rc;
> 
> @@ -758,6 +762,10 @@ static int perf_probe(struct ntb_client *client, struct ntb_dev *ntb)
>  	int node;
>  	int rc = 0;
> 
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ntb))
> +		return -EINVAL;
> +
>  	if (ntb_spad_count(ntb) < MAX_SPAD) {
>  		dev_err(&ntb->dev, "Not enough scratch pad registers for %s",
>  			DRIVER_NAME);
> diff --git a/drivers/ntb/test/ntb_pingpong.c b/drivers/ntb/test/ntb_pingpong.c
> index 7d31179..e833649 100644
> --- a/drivers/ntb/test/ntb_pingpong.c
> +++ b/drivers/ntb/test/ntb_pingpong.c
> @@ -214,6 +214,11 @@ static int pp_probe(struct ntb_client *client,
>  	struct pp_ctx *pp;
>  	int rc;
> 
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ntb)) {
> +		return -EINVAL;
> +	}
> +
>  	if (ntb_db_is_unsafe(ntb)) {
>  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
>  		if (!unsafe) {
> diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
> index 61bf2ef..5dfe12f 100644
> --- a/drivers/ntb/test/ntb_tool.c
> +++ b/drivers/ntb/test/ntb_tool.c
> @@ -675,8 +675,11 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> req_size)
>  	if (mw->peer)
>  		return 0;
> 
> -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &size, &align,
> -			      &align_size);
> +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &size);
> +	if (rc)
> +		return rc;
> +
> +	rc = ntb_peer_mw_get_align(tc->ntb, idx, &align, &align_size, NULL);
>  	if (rc)
>  		return rc;

Looks like ntb_mw_get_range() was simpler.

> 
> @@ -689,7 +692,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> req_size)
>  	if (!mw->peer)
>  		return -ENOMEM;
> 
> -	rc = ntb_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
> +	rc = ntb_peer_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
>  	if (rc)
>  		goto err_free_dma;
> 
> @@ -716,7 +719,7 @@ static void tool_free_mw(struct tool_ctx *tc, int idx)
>  	struct tool_mw *mw = &tc->mws[idx];
> 
>  	if (mw->peer) {
> -		ntb_mw_clear_trans(tc->ntb, idx);
> +		ntb_peer_mw_set_trans(tc->ntb, idx, 0, 0);
>  		dma_free_coherent(&tc->ntb->pdev->dev, mw->size,
>  				  mw->peer,
>  				  mw->peer_dma);
> @@ -751,8 +754,8 @@ static ssize_t tool_peer_mw_trans_read(struct file *filep,
>  	if (!buf)
>  		return -ENOMEM;
> 
> -	ntb_mw_get_range(mw->tc->ntb, mw->idx,
> -			 &base, &mw_size, &align, &align_size);
> +	ntb_mw_get_maprsc(mw->tc->ntb, mw->idx, &base, &mw_size);
> +	ntb_peer_mw_get_align(mw->tc->ntb, mw->idx, &align, &align_size, NULL);
> 
>  	off += scnprintf(buf + off, buf_size - off,
>  			 "Peer MW %d Information:\n", mw->idx);
> @@ -827,8 +830,7 @@ static int tool_init_mw(struct tool_ctx *tc, int idx)
>  	phys_addr_t base;
>  	int rc;
> 
> -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &mw->win_size,
> -			      NULL, NULL);
> +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &mw->win_size);
>  	if (rc)
>  		return rc;
> 
> @@ -913,6 +915,11 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
>  	int rc;
>  	int i;
> 
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ntb)) {
> +		return -EINVAL;
> +	}
> +

It would be nice if both types could be supported by the same api.
 
>  	if (ntb_db_is_unsafe(ntb))
>  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
> 
> @@ -928,7 +935,7 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
>  	tc->ntb = ntb;
>  	init_waitqueue_head(&tc->link_wq);
> 
> -	tc->mw_count = min(ntb_mw_count(tc->ntb), MAX_MWS);
> +	tc->mw_count = min(ntb_peer_mw_count(tc->ntb), MAX_MWS);
>  	for (i = 0; i < tc->mw_count; i++) {
>  		rc = tool_init_mw(tc, i);
>  		if (rc)
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index 6f47562..d1937d3 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -159,13 +159,44 @@ static inline int ntb_client_ops_is_valid(const struct
> ntb_client_ops *ops)
>  }
> 
>  /**
> + * struct ntb_msg - ntb driver message structure
> + * @type:	Message type.
> + * @payload:	Payload data to send to a peer
> + * @data:	Array of u32 data to send (size might be hw dependent)
> + */
> +#define NTB_MAX_MSGSIZE 4
> +struct ntb_msg {
> +	union {
> +		struct {
> +			u32 type;
> +			u32 payload[NTB_MAX_MSGSIZE - 1];
> +		};
> +		u32 data[NTB_MAX_MSGSIZE];
> +	};
> +};
> +
> +/**
> + * enum NTB_MSG_EVENT - message event types
> + * @NTB_MSG_NEW:	New message just arrived and passed to the handler
> + * @NTB_MSG_SENT:	Posted message has just been successfully sent
> + * @NTB_MSG_FAIL:	Posted message failed to be sent
> + */
> +enum NTB_MSG_EVENT {
> +	NTB_MSG_NEW,
> +	NTB_MSG_SENT,
> +	NTB_MSG_FAIL
> +};
> +
> +/**
>   * struct ntb_ctx_ops - ntb driver context operations
>   * @link_event:		See ntb_link_event().
>   * @db_event:		See ntb_db_event().
> + * @msg_event:		See ntb_msg_event().
>   */
>  struct ntb_ctx_ops {
>  	void (*link_event)(void *ctx);
>  	void (*db_event)(void *ctx, int db_vector);
> +	void (*msg_event)(void *ctx, enum NTB_MSG_EVENT ev, struct ntb_msg *msg);
>  };
> 
>  static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> @@ -174,18 +205,24 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> *ops)
>  	return
>  		/* ops->link_event		&& */
>  		/* ops->db_event		&& */
> +		/* ops->msg_event		&& */
>  		1;
>  }
> 
>  /**
>   * struct ntb_ctx_ops - ntb device operations
> - * @mw_count:		See ntb_mw_count().
> - * @mw_get_range:	See ntb_mw_get_range().
> - * @mw_set_trans:	See ntb_mw_set_trans().
> - * @mw_clear_trans:	See ntb_mw_clear_trans().
>   * @link_is_up:		See ntb_link_is_up().
>   * @link_enable:	See ntb_link_enable().
>   * @link_disable:	See ntb_link_disable().
> + * @mw_count:		See ntb_mw_count().
> + * @mw_get_maprsc:	See ntb_mw_get_maprsc().
> + * @mw_set_trans:	See ntb_mw_set_trans().
> + * @mw_get_trans:	See ntb_mw_get_trans().
> + * @mw_get_align:	See ntb_mw_get_align().
> + * @peer_mw_count:	See ntb_peer_mw_count().
> + * @peer_mw_set_trans:	See ntb_peer_mw_set_trans().
> + * @peer_mw_get_trans:	See ntb_peer_mw_get_trans().
> + * @peer_mw_get_align:	See ntb_peer_mw_get_align().
>   * @db_is_unsafe:	See ntb_db_is_unsafe().
>   * @db_valid_mask:	See ntb_db_valid_mask().
>   * @db_vector_count:	See ntb_db_vector_count().
> @@ -210,22 +247,38 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> *ops)
>   * @peer_spad_addr:	See ntb_peer_spad_addr().
>   * @peer_spad_read:	See ntb_peer_spad_read().
>   * @peer_spad_write:	See ntb_peer_spad_write().
> + * @msg_post:		See ntb_msg_post().
> + * @msg_size:		See ntb_msg_size().
>   */
>  struct ntb_dev_ops {
> -	int (*mw_count)(struct ntb_dev *ntb);
> -	int (*mw_get_range)(struct ntb_dev *ntb, int idx,
> -			    phys_addr_t *base, resource_size_t *size,
> -			resource_size_t *align, resource_size_t *align_size);
> -	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> -			    dma_addr_t addr, resource_size_t size);
> -	int (*mw_clear_trans)(struct ntb_dev *ntb, int idx);
> -
>  	int (*link_is_up)(struct ntb_dev *ntb,
>  			  enum ntb_speed *speed, enum ntb_width *width);
>  	int (*link_enable)(struct ntb_dev *ntb,
>  			   enum ntb_speed max_speed, enum ntb_width max_width);
>  	int (*link_disable)(struct ntb_dev *ntb);
> 
> +	int (*mw_count)(struct ntb_dev *ntb);
> +	int (*mw_get_maprsc)(struct ntb_dev *ntb, int idx,
> +			     phys_addr_t *base, resource_size_t *size);
> +	int (*mw_get_align)(struct ntb_dev *ntb, int idx,
> +			    resource_size_t *addr_align,
> +			    resource_size_t *size_align,
> +			    resource_size_t *size_max);
> +	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> +			    dma_addr_t addr, resource_size_t size);
> +	int (*mw_get_trans)(struct ntb_dev *ntb, int idx,
> +			    dma_addr_t *addr, resource_size_t *size);
> +
> +	int (*peer_mw_count)(struct ntb_dev *ntb);
> +	int (*peer_mw_get_align)(struct ntb_dev *ntb, int idx,
> +				 resource_size_t *addr_align,
> +				 resource_size_t *size_align,
> +				 resource_size_t *size_max);
> +	int (*peer_mw_set_trans)(struct ntb_dev *ntb, int idx,
> +				 dma_addr_t addr, resource_size_t size);
> +	int (*peer_mw_get_trans)(struct ntb_dev *ntb, int idx,
> +				 dma_addr_t *addr, resource_size_t *size);
> +
>  	int (*db_is_unsafe)(struct ntb_dev *ntb);
>  	u64 (*db_valid_mask)(struct ntb_dev *ntb);
>  	int (*db_vector_count)(struct ntb_dev *ntb);
> @@ -259,47 +312,10 @@ struct ntb_dev_ops {
>  			      phys_addr_t *spad_addr);
>  	u32 (*peer_spad_read)(struct ntb_dev *ntb, int idx);
>  	int (*peer_spad_write)(struct ntb_dev *ntb, int idx, u32 val);
> -};
> -
> -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> -{
> -	/* commented callbacks are not required: */
> -	return
> -		ops->mw_count				&&
> -		ops->mw_get_range			&&
> -		ops->mw_set_trans			&&
> -		/* ops->mw_clear_trans			&& */
> -		ops->link_is_up				&&
> -		ops->link_enable			&&
> -		ops->link_disable			&&
> -		/* ops->db_is_unsafe			&& */
> -		ops->db_valid_mask			&&
> 
> -		/* both set, or both unset */
> -		(!ops->db_vector_count == !ops->db_vector_mask) &&
> -
> -		ops->db_read				&&
> -		/* ops->db_set				&& */
> -		ops->db_clear				&&
> -		/* ops->db_read_mask			&& */
> -		ops->db_set_mask			&&
> -		ops->db_clear_mask			&&
> -		/* ops->peer_db_addr			&& */
> -		/* ops->peer_db_read			&& */
> -		ops->peer_db_set			&&
> -		/* ops->peer_db_clear			&& */
> -		/* ops->peer_db_read_mask		&& */
> -		/* ops->peer_db_set_mask		&& */
> -		/* ops->peer_db_clear_mask		&& */
> -		/* ops->spad_is_unsafe			&& */
> -		ops->spad_count				&&
> -		ops->spad_read				&&
> -		ops->spad_write				&&
> -		/* ops->peer_spad_addr			&& */
> -		/* ops->peer_spad_read			&& */
> -		ops->peer_spad_write			&&
> -		1;
> -}
> +	int (*msg_post)(struct ntb_dev *ntb, struct ntb_msg *msg);
> +	int (*msg_size)(struct ntb_dev *ntb);
> +};
> 
>  /**
>   * struct ntb_client - client interested in ntb devices
> @@ -310,10 +326,22 @@ struct ntb_client {
>  	struct device_driver		drv;
>  	const struct ntb_client_ops	ops;
>  };
> -
>  #define drv_ntb_client(__drv) container_of((__drv), struct ntb_client, drv)
> 
>  /**
> + * struct ntb_bus_data - NTB bus data
> + * @sync_msk:	Synchroous devices mask
> + * @async_msk:	Asynchronous devices mask
> + * @both_msk:	Both sync and async devices mask
> + */
> +#define NTB_MAX_DEVID (8*BITS_PER_LONG)
> +struct ntb_bus_data {
> +	unsigned long sync_msk[8];
> +	unsigned long async_msk[8];
> +	unsigned long both_msk[8];
> +};
> +
> +/**
>   * struct ntb_device - ntb device
>   * @dev:		Linux device object.
>   * @pdev:		Pci device entry of the ntb.
> @@ -332,15 +360,151 @@ struct ntb_dev {
> 
>  	/* private: */
> 
> +	/* device id */
> +	int id;
>  	/* synchronize setting, clearing, and calling ctx_ops */
>  	spinlock_t			ctx_lock;
>  	/* block unregister until device is fully released */
>  	struct completion		released;
>  };
> -
>  #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
> 
>  /**
> + * ntb_valid_sync_dev_ops() - valid operations for synchronous hardware setup
> + * @ntb:	NTB device
> + *
> + * There might be two types of NTB hardware differed by the way of the settings
> + * configuration. The synchronous chips allows to set the memory windows by
> + * directly writing to the peer registers. Additionally there can be shared
> + * Scratchpad registers for synchronous information exchange. Client drivers
> + * should call this function to make sure the hardware supports the proper
> + * functionality.
> + */
> +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> +{
> +	const struct ntb_dev_ops *ops = ntb->ops;
> +
> +	/* Commented callbacks are not required, but might be developed */
> +	return	/* NTB link status ops */
> +		ops->link_is_up					&&
> +		ops->link_enable				&&
> +		ops->link_disable				&&
> +
> +		/* Synchronous memory windows ops */
> +		ops->mw_count					&&
> +		ops->mw_get_maprsc				&&
> +		/* ops->mw_get_align				&& */
> +		/* ops->mw_set_trans				&& */
> +		/* ops->mw_get_trans				&& */
> +		ops->peer_mw_count				&&
> +		ops->peer_mw_get_align				&&
> +		ops->peer_mw_set_trans				&&
> +		/* ops->peer_mw_get_trans			&& */
> +
> +		/* Doorbell ops */
> +		/* ops->db_is_unsafe				&& */
> +		ops->db_valid_mask				&&
> +		/* both set, or both unset */
> +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> +		ops->db_read					&&
> +		/* ops->db_set					&& */
> +		ops->db_clear					&&
> +		/* ops->db_read_mask				&& */
> +		ops->db_set_mask				&&
> +		ops->db_clear_mask				&&
> +		/* ops->peer_db_addr				&& */
> +		/* ops->peer_db_read				&& */
> +		ops->peer_db_set				&&
> +		/* ops->peer_db_clear				&& */
> +		/* ops->peer_db_read_mask			&& */
> +		/* ops->peer_db_set_mask			&& */
> +		/* ops->peer_db_clear_mask			&& */
> +
> +		/* Scratchpad ops */
> +		/* ops->spad_is_unsafe				&& */
> +		ops->spad_count					&&
> +		ops->spad_read					&&
> +		ops->spad_write					&&
> +		/* ops->peer_spad_addr				&& */
> +		/* ops->peer_spad_read				&& */
> +		ops->peer_spad_write				&&
> +
> +		/* Messages IO ops */
> +		/* ops->msg_post				&& */
> +		/* ops->msg_size				&& */
> +		1;
> +}
> +
> +/**
> + * ntb_valid_async_dev_ops() - valid operations for asynchronous hardware setup
> + * @ntb:	NTB device
> + *
> + * There might be two types of NTB hardware differed by the way of the settings
> + * configuration. The asynchronous chips does not allow to set the memory
> + * windows by directly writing to the peer registers. Instead it implements
> + * the additional method to communinicate between NTB nodes like messages.
> + * Scratchpad registers aren't likely supported by such hardware. Client
> + * drivers should call this function to make sure the hardware supports
> + * the proper functionality.
> + */
> +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> +{
> +	const struct ntb_dev_ops *ops = ntb->ops;
> +
> +	/* Commented callbacks are not required, but might be developed */
> +	return	/* NTB link status ops */
> +		ops->link_is_up					&&
> +		ops->link_enable				&&
> +		ops->link_disable				&&
> +
> +		/* Asynchronous memory windows ops */
> +		ops->mw_count					&&
> +		ops->mw_get_maprsc				&&
> +		ops->mw_get_align				&&
> +		ops->mw_set_trans				&&
> +		/* ops->mw_get_trans				&& */
> +		ops->peer_mw_count				&&
> +		ops->peer_mw_get_align				&&
> +		/* ops->peer_mw_set_trans			&& */
> +		/* ops->peer_mw_get_trans			&& */
> +
> +		/* Doorbell ops */
> +		/* ops->db_is_unsafe				&& */
> +		ops->db_valid_mask				&&
> +		/* both set, or both unset */
> +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> +		ops->db_read					&&
> +		/* ops->db_set					&& */
> +		ops->db_clear					&&
> +		/* ops->db_read_mask				&& */
> +		ops->db_set_mask				&&
> +		ops->db_clear_mask				&&
> +		/* ops->peer_db_addr				&& */
> +		/* ops->peer_db_read				&& */
> +		ops->peer_db_set				&&
> +		/* ops->peer_db_clear				&& */
> +		/* ops->peer_db_read_mask			&& */
> +		/* ops->peer_db_set_mask			&& */
> +		/* ops->peer_db_clear_mask			&& */
> +
> +		/* Scratchpad ops */
> +		/* ops->spad_is_unsafe				&& */
> +		/* ops->spad_count				&& */
> +		/* ops->spad_read				&& */
> +		/* ops->spad_write				&& */
> +		/* ops->peer_spad_addr				&& */
> +		/* ops->peer_spad_read				&& */
> +		/* ops->peer_spad_write				&& */
> +
> +		/* Messages IO ops */
> +		ops->msg_post					&&
> +		ops->msg_size					&&
> +		1;
> +}

I understand why IDT requires a different api for dealing with addressing multiple peers.  I would be interested in a solution that would allow, for example, the Intel driver fit under the api for dealing with multiple peers, even though it only supports one peer.  I would rather see that, than two separate apis under ntb.

Thoughts?

Can the sync api be described by some subset of the async api?  Are there less overloaded terms we can use instead of sync/async?

> +
> +
> +
> +/**
>   * ntb_register_client() - register a client for interest in ntb devices
>   * @client:	Client context.
>   *
> @@ -441,10 +605,84 @@ void ntb_link_event(struct ntb_dev *ntb);
>  void ntb_db_event(struct ntb_dev *ntb, int vector);
> 
>  /**
> - * ntb_mw_count() - get the number of memory windows
> + * ntb_msg_event() - notify driver context of event in messaging subsystem
>   * @ntb:	NTB device context.
> + * @ev:		Event type caused the handler invocation
> + * @msg:	Message related to the event
> + *
> + * Notify the driver context that there is some event happaned in the event
> + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> + * last argument is used to pass the event related message. It discarded right
> + * after the handler returns.
> + */
> +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> +		   struct ntb_msg *msg);

I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of the message handling to be done more appropriately at a higher layer of the application.  I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I think would be more appropriate for a ntb transport (or higher layer) driver.

> +
> +/**
> + * ntb_link_is_up() - get the current ntb link state
> + * @ntb:	NTB device context.
> + * @speed:	OUT - The link speed expressed as PCIe generation number.
> + * @width:	OUT - The link width expressed as the number of PCIe lanes.
> + *
> + * Get the current state of the ntb link.  It is recommended to query the link
> + * state once after every link event.  It is safe to query the link state in
> + * the context of the link event callback.
> + *
> + * Return: One if the link is up, zero if the link is down, otherwise a
> + *		negative value indicating the error number.
> + */
> +static inline int ntb_link_is_up(struct ntb_dev *ntb,
> +				 enum ntb_speed *speed, enum ntb_width *width)
> +{
> +	return ntb->ops->link_is_up(ntb, speed, width);
> +}
> +

It looks like there was some rearranging of code, so big hunks appear to be added or removed.  Can you split this into two (or more) patches so that rearranging the code is distinct from more interesting changes?

> +/**
> + * ntb_link_enable() - enable the link on the secondary side of the ntb
> + * @ntb:	NTB device context.
> + * @max_speed:	The maximum link speed expressed as PCIe generation number.
> + * @max_width:	The maximum link width expressed as the number of PCIe lanes.
>   *
> - * Hardware and topology may support a different number of memory windows.
> + * Enable the link on the secondary side of the ntb.  This can only be done
> + * from only one (primary or secondary) side of the ntb in primary or b2b
> + * topology.  The ntb device should train the link to its maximum speed and
> + * width, or the requested speed and width, whichever is smaller, if supported.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_link_enable(struct ntb_dev *ntb,
> +				  enum ntb_speed max_speed,
> +				  enum ntb_width max_width)
> +{
> +	return ntb->ops->link_enable(ntb, max_speed, max_width);
> +}
> +
> +/**
> + * ntb_link_disable() - disable the link on the secondary side of the ntb
> + * @ntb:	NTB device context.
> + *
> + * Disable the link on the secondary side of the ntb.  This can only be
> + * done from only one (primary or secondary) side of the ntb in primary or b2b
> + * topology.  The ntb device should disable the link.  Returning from this call
> + * must indicate that a barrier has passed, though with no more writes may pass
> + * in either direction across the link, except if this call returns an error
> + * number.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_link_disable(struct ntb_dev *ntb)
> +{
> +	return ntb->ops->link_disable(ntb);
> +}
> +
> +/**
> + * ntb_mw_count() - get the number of local memory windows
> + * @ntb:	NTB device context.
> + *
> + * Hardware and topology may support a different number of memory windows at
> + * local and remote devices
>   *
>   * Return: the number of memory windows.
>   */
> @@ -454,122 +692,186 @@ static inline int ntb_mw_count(struct ntb_dev *ntb)
>  }
> 
>  /**
> - * ntb_mw_get_range() - get the range of a memory window
> + * ntb_mw_get_maprsc() - get the range of a memory window to map

What was insufficient about ntb_mw_get_range() that it needed to be split into ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch, it seems ntb_mw_get_range() would have been more simple.

I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].  So, there is no example of how usage of new api would be used differently or more efficiently than ntb_mw_get_range() for async devices.

>   * @ntb:	NTB device context.
>   * @idx:	Memory window number.
>   * @base:	OUT - the base address for mapping the memory window
>   * @size:	OUT - the size for mapping the memory window
> - * @align:	OUT - the base alignment for translating the memory window
> - * @align_size:	OUT - the size alignment for translating the memory window
>   *
> - * Get the range of a memory window.  NULL may be given for any output
> - * parameter if the value is not needed.  The base and size may be used for
> - * mapping the memory window, to access the peer memory.  The alignment and
> - * size may be used for translating the memory window, for the peer to access
> - * memory on the local system.
> + * Get the map range of a memory window. The base and size may be used for
> + * mapping the memory window to access the peer memory.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> -				   phys_addr_t *base, resource_size_t *size,
> -		resource_size_t *align, resource_size_t *align_size)
> +static inline int ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> +				    phys_addr_t *base, resource_size_t *size)
>  {
> -	return ntb->ops->mw_get_range(ntb, idx, base, size,
> -			align, align_size);
> +	return ntb->ops->mw_get_maprsc(ntb, idx, base, size);
> +}
> +
> +/**
> + * ntb_mw_get_align() - get memory window alignment of the local node
> + * @ntb:	NTB device context.
> + * @idx:	Memory window number.
> + * @addr_align:	OUT - the translated base address alignment of the memory window
> + * @size_align:	OUT - the translated memory size alignment of the memory window
> + * @size_max:	OUT - the translated memory maximum size
> + *
> + * Get the alignment parameters to allocate the proper memory window. NULL may
> + * be given for any output parameter if the value is not needed.
> + *
> + * Drivers of synchronous hardware don't have to support it.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_mw_get_align(struct ntb_dev *ntb, int idx,
> +				   resource_size_t *addr_align,
> +				   resource_size_t *size_align,
> +				   resource_size_t *size_max)
> +{
> +	if (!ntb->ops->mw_get_align)
> +		return -EINVAL;
> +
> +	return ntb->ops->mw_get_align(ntb, idx, addr_align, size_align, size_max);
>  }
> 
>  /**
> - * ntb_mw_set_trans() - set the translation of a memory window
> + * ntb_mw_set_trans() - set the translated base address of a peer memory window
>   * @ntb:	NTB device context.
>   * @idx:	Memory window number.
> - * @addr:	The dma address local memory to expose to the peer.
> - * @size:	The size of the local memory to expose to the peer.
> + * @addr:	DMA memory address exposed by the peer.
> + * @size:	Size of the memory exposed by the peer.
> + *
> + * Set the translated base address of a memory window. The peer preliminary
> + * allocates a memory, then someway passes the address to the remote node, that
> + * finally sets up the memory window at the address, up to the size. The address
> + * and size must be aligned to the parameters specified by ntb_mw_get_align() of
> + * the local node and ntb_peer_mw_get_align() of the peer, which must return the
> + * same values. Zero size effectively disables the memory window.
>   *
> - * Set the translation of a memory window.  The peer may access local memory
> - * through the window starting at the address, up to the size.  The address
> - * must be aligned to the alignment specified by ntb_mw_get_range().  The size
> - * must be aligned to the size alignment specified by ntb_mw_get_range().
> + * Drivers of synchronous hardware don't have to support it.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
>  				   dma_addr_t addr, resource_size_t size)
>  {
> +	if (!ntb->ops->mw_set_trans)
> +		return -EINVAL;
> +
>  	return ntb->ops->mw_set_trans(ntb, idx, addr, size);
>  }
> 
>  /**
> - * ntb_mw_clear_trans() - clear the translation of a memory window
> + * ntb_mw_get_trans() - get the translated base address of a memory window
>   * @ntb:	NTB device context.
>   * @idx:	Memory window number.
> + * @addr:	The dma memory address exposed by the peer.
> + * @size:	The size of the memory exposed by the peer.
>   *
> - * Clear the translation of a memory window.  The peer may no longer access
> - * local memory through the window.
> + * Get the translated base address of a memory window spicified for the local
> + * hardware and allocated by the peer. If the addr and size are zero, the
> + * memory window is effectively disabled.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_mw_clear_trans(struct ntb_dev *ntb, int idx)
> +static inline int ntb_mw_get_trans(struct ntb_dev *ntb, int idx,
> +				   dma_addr_t *addr, resource_size_t *size)
>  {
> -	if (!ntb->ops->mw_clear_trans)
> -		return ntb->ops->mw_set_trans(ntb, idx, 0, 0);
> +	if (!ntb->ops->mw_get_trans)
> +		return -EINVAL;
> 
> -	return ntb->ops->mw_clear_trans(ntb, idx);
> +	return ntb->ops->mw_get_trans(ntb, idx, addr, size);
>  }
> 
>  /**
> - * ntb_link_is_up() - get the current ntb link state
> + * ntb_peer_mw_count() - get the number of peer memory windows
>   * @ntb:	NTB device context.
> - * @speed:	OUT - The link speed expressed as PCIe generation number.
> - * @width:	OUT - The link width expressed as the number of PCIe lanes.
>   *
> - * Get the current state of the ntb link.  It is recommended to query the link
> - * state once after every link event.  It is safe to query the link state in
> - * the context of the link event callback.
> + * Hardware and topology may support a different number of memory windows at
> + * local and remote nodes.
>   *
> - * Return: One if the link is up, zero if the link is down, otherwise a
> - *		negative value indicating the error number.
> + * Return: the number of memory windows.
>   */
> -static inline int ntb_link_is_up(struct ntb_dev *ntb,
> -				 enum ntb_speed *speed, enum ntb_width *width)
> +static inline int ntb_peer_mw_count(struct ntb_dev *ntb)
>  {
> -	return ntb->ops->link_is_up(ntb, speed, width);
> +	return ntb->ops->peer_mw_count(ntb);
>  }
> 
>  /**
> - * ntb_link_enable() - enable the link on the secondary side of the ntb
> + * ntb_peer_mw_get_align() - get memory window alignment of the peer
>   * @ntb:	NTB device context.
> - * @max_speed:	The maximum link speed expressed as PCIe generation number.
> - * @max_width:	The maximum link width expressed as the number of PCIe lanes.
> + * @idx:	Memory window number.
> + * @addr_align:	OUT - the translated base address alignment of the memory window
> + * @size_align:	OUT - the translated memory size alignment of the memory window
> + * @size_max:	OUT - the translated memory maximum size
>   *
> - * Enable the link on the secondary side of the ntb.  This can only be done
> - * from the primary side of the ntb in primary or b2b topology.  The ntb device
> - * should train the link to its maximum speed and width, or the requested speed
> - * and width, whichever is smaller, if supported.
> + * Get the alignment parameters to allocate the proper memory window for the
> + * peer. NULL may be given for any output parameter if the value is not needed.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_link_enable(struct ntb_dev *ntb,
> -				  enum ntb_speed max_speed,
> -				  enum ntb_width max_width)
> +static inline int ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> +					resource_size_t *addr_align,
> +					resource_size_t *size_align,
> +					resource_size_t *size_max)
>  {
> -	return ntb->ops->link_enable(ntb, max_speed, max_width);
> +	if (!ntb->ops->peer_mw_get_align)
> +		return -EINVAL;
> +
> +	return ntb->ops->peer_mw_get_align(ntb, idx, addr_align, size_align,
> +					   size_max);
>  }
> 
>  /**
> - * ntb_link_disable() - disable the link on the secondary side of the ntb
> + * ntb_peer_mw_set_trans() - set the translated base address of a peer
> + *			     memory window
>   * @ntb:	NTB device context.
> + * @idx:	Memory window number.
> + * @addr:	Local DMA memory address exposed to the peer.
> + * @size:	Size of the memory exposed to the peer.
>   *
> - * Disable the link on the secondary side of the ntb.  This can only be
> - * done from the primary side of the ntb in primary or b2b topology.  The ntb
> - * device should disable the link.  Returning from this call must indicate that
> - * a barrier has passed, though with no more writes may pass in either
> - * direction across the link, except if this call returns an error number.
> + * Set the translated base address of a memory window exposed to the peer.
> + * The local node preliminary allocates the window, then directly writes the

I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the following make sense, or have I completely misunderstood something?

ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are translated to the local memory destination.

ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory window (is this something that needs to be configured on the local ntb?) are translated to the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of ntb_mw_set_trans() will complete the translation to the peer memory destination.

> + * address and size to the peer control registers. The address and size must
> + * be aligned to the parameters specified by ntb_peer_mw_get_align() of
> + * the local node and ntb_mw_get_align() of the peer, which must return the
> + * same values. Zero size effectively disables the memory window.
> + *
> + * Drivers of synchronous hardware must support it.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_link_disable(struct ntb_dev *ntb)
> +static inline int ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> +					dma_addr_t addr, resource_size_t size)
>  {
> -	return ntb->ops->link_disable(ntb);
> +	if (!ntb->ops->peer_mw_set_trans)
> +		return -EINVAL;
> +
> +	return ntb->ops->peer_mw_set_trans(ntb, idx, addr, size);
> +}
> +
> +/**
> + * ntb_peer_mw_get_trans() - get the translated base address of a peer
> + *			     memory window
> + * @ntb:	NTB device context.
> + * @idx:	Memory window number.
> + * @addr:	Local dma memory address exposed to the peer.
> + * @size:	Size of the memory exposed to the peer.
> + *
> + * Get the translated base address of a memory window spicified for the peer
> + * hardware. If the addr and size are zero then the memory window is effectively
> + * disabled.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_peer_mw_get_trans(struct ntb_dev *ntb, int idx,
> +					dma_addr_t *addr, resource_size_t *size)
> +{
> +	if (!ntb->ops->peer_mw_get_trans)
> +		return -EINVAL;
> +
> +	return ntb->ops->peer_mw_get_trans(ntb, idx, addr, size);
>  }
> 
>  /**
> @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits)
>   * append one additional dma memory copy with the doorbell register as the
>   * destination, after the memory copy operations.
>   *
> + * This is unusual, and hardware may not be suitable to implement it.
> + *

Why is this unusual?  Do you mean async hardware may not support it?

>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_peer_db_addr(struct ntb_dev *ntb,
> @@ -901,10 +1205,15 @@ static inline int ntb_spad_is_unsafe(struct ntb_dev *ntb)
>   *
>   * Hardware and topology may support a different number of scratchpads.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: the number of scratchpads.
>   */
>  static inline int ntb_spad_count(struct ntb_dev *ntb)
>  {
> +	if (!ntb->ops->spad_count)
> +		return -EINVAL;
> +

Maybe we should return zero (i.e. there are no scratchpads).

>  	return ntb->ops->spad_count(ntb);
>  }
> 
> @@ -915,10 +1224,15 @@ static inline int ntb_spad_count(struct ntb_dev *ntb)
>   *
>   * Read the local scratchpad register, and return the value.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: The value of the local scratchpad register.
>   */
>  static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
>  {
> +	if (!ntb->ops->spad_read)
> +		return 0;
> +

Let's return ~0.  I think that's what a driver would read from the pci bus for a memory miss. 

>  	return ntb->ops->spad_read(ntb, idx);
>  }
> 
> @@ -930,10 +1244,15 @@ static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
>   *
>   * Write the value to the local scratchpad register.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32 val)
>  {
> +	if (!ntb->ops->spad_write)
> +		return -EINVAL;
> +
>  	return ntb->ops->spad_write(ntb, idx, val);
>  }
> 
> @@ -946,6 +1265,8 @@ static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32
> val)
>   * Return the address of the peer doorbell register.  This may be used, for
>   * example, by drivers that offload memory copy operations to a dma engine.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
> @@ -964,10 +1285,15 @@ static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
>   *
>   * Read the peer scratchpad register, and return the value.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: The value of the local scratchpad register.
>   */
>  static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
>  {
> +	if (!ntb->ops->peer_spad_read)
> +		return 0;

Also, ~0?

> +
>  	return ntb->ops->peer_spad_read(ntb, idx);
>  }
> 
> @@ -979,11 +1305,59 @@ static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
>   *
>   * Write the value to the peer scratchpad register.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_peer_spad_write(struct ntb_dev *ntb, int idx, u32 val)
>  {
> +	if (!ntb->ops->peer_spad_write)
> +		return -EINVAL;
> +
>  	return ntb->ops->peer_spad_write(ntb, idx, val);
>  }
> 
> +/**
> + * ntb_msg_post() - post the message to the peer
> + * @ntb:	NTB device context.
> + * @msg:	Message
> + *
> + * Post the message to a peer. It shall be delivered to the peer by the
> + * corresponding hardware method. The peer should be notified about the new
> + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> + * event. Otherwise the NTB_MSG_SENT is emitted.

Interesting.. local driver would be notified about completion (success or failure) of delivery.  Is there any order-of-completion guarantee for the completion notifications?  Is there some tolerance for faults, in case we never get a completion notification from the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link comes up again, can we still get a completion notification from the peer, and how would that be handled?

Does delivery mean the application has processed the message, or is it just delivery at the hardware layer, or just delivery at the ntb hardware driver layer?

> + *
> + * Synchronous hardware may not support it.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_msg_post(struct ntb_dev *ntb, struct ntb_msg *msg)
> +{
> +	if (!ntb->ops->msg_post)
> +		return -EINVAL;
> +
> +	return ntb->ops->msg_post(ntb, msg);
> +}
> +
> +/**
> + * ntb_msg_size() - size of the message data
> + * @ntb:	NTB device context.
> + *
> + * Different hardware may support different number of message registers. This
> + * callback shall return the number of those used for data sending and
> + * receiving including the type field.
> + *
> + * Synchronous hardware may not support it.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_msg_size(struct ntb_dev *ntb)
> +{
> +	if (!ntb->ops->msg_size)
> +		return 0;
> +
> +	return ntb->ops->msg_size(ntb);
> +}
> +
>  #endif
> --
> 2.6.6

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-05 15:31 ` Allen Hubbe
  0 siblings, 0 replies; 12+ messages in thread
From: Allen Hubbe @ 2016-08-05 15:31 UTC (permalink / raw)
  To: 'Serge Semin', jdmason
  Cc: dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb, linux-kernel

From: Serge Semin
> Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> devices, so translated base address of memory windows can be direcly written
> to peer registers. But there are some IDT PCIe-switches which implement
> complex interfaces using Lookup Tables of translation addresses. Due to
> the way the table is accessed, it can not be done synchronously from different
> RCs, that's why the asynchronous interface should be developed.
> 
> For these purpose the Memory Window related interface is correspondingly split
> as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> is following: "It is a virtual memory region, which locally reflects a physical
> memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> memory windows.
> Here is the description of the Memory Window related NTB-bus callback
> functions:
>  - ntb_mw_count() - number of local memory windows.
>  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
>                          window to map.
>  - ntb_mw_set_trans() - set translation address of local memory window (this
>                         address should be somehow retrieved from a peer).
>  - ntb_mw_get_trans() - get translation address of local memory window.
>  - ntb_mw_get_align() - get alignment of translated base address and size of
>                         local memory window. Additionally one can get the
>                         upper size limit of the memory window.
>  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
>                          local number).
>  - ntb_peer_mw_set_trans() - set translation address of peer memory window
>  - ntb_peer_mw_get_trans() - get translation address of peer memory window
>  - ntb_peer_mw_get_align() - get alignment of translated base address and size
>                              of peer memory window.Additionally one can get the
>                              upper size limit of the memory window.
> 
> As one can see current AMD and Intel NTB drivers mostly implement the
> "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> since it doesn't have convenient access to the peer Lookup Table.
> 
> In order to pass information from one RC to another NTB functions of IDT
> PCIe-switch implement Messaging subsystem. They currently support four message
> registers to transfer DWORD sized data to a specified peer. So there are two
> new callback methods are introduced:
>  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
>                     and receive messages
>  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
>                     to a peer
> Additionally there is a new event function:
>  - ntb_msg_event() - it is invoked when either a new message was retrieved
>                      (NTB_MSG_NEW), or last message was successfully sent
>                      (NTB_MSG_SENT), or the last message failed to be sent
>                      (NTB_MSG_FAIL).
> 
> The last change concerns the IDs (practically names) of NTB-devices on the
> NTB-bus. It is not good to have the devices with same names in the system
> and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> devices supporting both interfaces.

Thanks for the work that went into writing this driver, and thanks for your patience with the review.  Please read my initial comments inline.  I would like to approach this from a top-down api perspective first, and settle on that first before requesting any specific changes in the hardware driver.  My major concern about these changes is that they introduce a distinct classification for sync and async hardware, supported by different sets of methods in the api, neither is a subset of the other.

You know the IDT hardware, so if any of my requests below are infeasible, I would like your constructive opinion (even if it means significant changes to existing drivers) on how to resolve the api so that new and existing hardware drivers can be unified under the same api, if possible.

> 
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> 
> ---
>  drivers/ntb/Kconfig                 |   4 +-
>  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
>  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
>  drivers/ntb/ntb.c                   |  86 +++++-
>  drivers/ntb/ntb_transport.c         |  19 +-
>  drivers/ntb/test/ntb_perf.c         |  16 +-
>  drivers/ntb/test/ntb_pingpong.c     |   5 +
>  drivers/ntb/test/ntb_tool.c         |  25 +-
>  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
>  9 files changed, 701 insertions(+), 162 deletions(-)
> 
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index 95944e5..67d80c4 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -14,8 +14,6 @@ if NTB
> 
>  source "drivers/ntb/hw/Kconfig"
> 
> -source "drivers/ntb/test/Kconfig"
> -
>  config NTB_TRANSPORT
>  	tristate "NTB Transport Client"
>  	help
> @@ -25,4 +23,6 @@ config NTB_TRANSPORT
> 
>  	 If unsure, say N.
> 
> +source "drivers/ntb/test/Kconfig"
> +
>  endif # NTB
> diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
> index 6ccba0d..ab6f353 100644
> --- a/drivers/ntb/hw/amd/ntb_hw_amd.c
> +++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
> @@ -55,6 +55,7 @@
>  #include <linux/pci.h>
>  #include <linux/random.h>
>  #include <linux/slab.h>
> +#include <linux/sizes.h>
>  #include <linux/ntb.h>
> 
>  #include "ntb_hw_amd.h"
> @@ -84,11 +85,8 @@ static int amd_ntb_mw_count(struct ntb_dev *ntb)
>  	return ntb_ndev(ntb)->mw_count;
>  }
> 
> -static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> -				phys_addr_t *base,
> -				resource_size_t *size,
> -				resource_size_t *align,
> -				resource_size_t *align_size)
> +static int amd_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> +				 phys_addr_t *base, resource_size_t *size)
>  {
>  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
>  	int bar;
> @@ -103,17 +101,40 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
>  	if (size)
>  		*size = pci_resource_len(ndev->ntb.pdev, bar);
> 
> -	if (align)
> -		*align = SZ_4K;
> +	return 0;
> +}
> +
> +static int amd_ntb_peer_mw_count(struct ntb_dev *ntb)
> +{
> +	return ntb_ndev(ntb)->mw_count;
> +}
> +
> +static int amd_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> +				     resource_size_t *addr_align,
> +				     resource_size_t *size_align,
> +				     resource_size_t *size_max)
> +{
> +	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> +	int bar;
> +
> +	bar = ndev_mw_to_bar(ndev, idx);
> +	if (bar < 0)
> +		return bar;
> +
> +	if (addr_align)
> +		*addr_align = SZ_4K;
> +
> +	if (size_align)
> +		*size_align = 1;
> 
> -	if (align_size)
> -		*align_size = 1;
> +	if (size_max)
> +		*size_max = pci_resource_len(ndev->ntb.pdev, bar);
> 
>  	return 0;
>  }
> 
> -static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> -				dma_addr_t addr, resource_size_t size)
> +static int amd_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> +				     dma_addr_t addr, resource_size_t size)
>  {
>  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
>  	unsigned long xlat_reg, limit_reg = 0;
> @@ -432,8 +453,10 @@ static int amd_ntb_peer_spad_write(struct ntb_dev *ntb,
> 
>  static const struct ntb_dev_ops amd_ntb_ops = {
>  	.mw_count		= amd_ntb_mw_count,
> -	.mw_get_range		= amd_ntb_mw_get_range,
> -	.mw_set_trans		= amd_ntb_mw_set_trans,
> +	.mw_get_maprsc		= amd_ntb_mw_get_maprsc,
> +	.peer_mw_count		= amd_ntb_peer_mw_count,
> +	.peer_mw_get_align	= amd_ntb_peer_mw_get_align,
> +	.peer_mw_set_trans	= amd_ntb_peer_mw_set_trans,
>  	.link_is_up		= amd_ntb_link_is_up,
>  	.link_enable		= amd_ntb_link_enable,
>  	.link_disable		= amd_ntb_link_disable,
> diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c
> index 40d04ef..fdb2838 100644
> --- a/drivers/ntb/hw/intel/ntb_hw_intel.c
> +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
> @@ -804,11 +804,8 @@ static int intel_ntb_mw_count(struct ntb_dev *ntb)
>  	return ntb_ndev(ntb)->mw_count;
>  }
> 
> -static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> -				  phys_addr_t *base,
> -				  resource_size_t *size,
> -				  resource_size_t *align,
> -				  resource_size_t *align_size)
> +static int intel_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> +				   phys_addr_t *base, resource_size_t *size)
>  {
>  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
>  	int bar;
> @@ -828,17 +825,51 @@ static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
>  		*size = pci_resource_len(ndev->ntb.pdev, bar) -
>  			(idx == ndev->b2b_idx ? ndev->b2b_off : 0);
> 
> -	if (align)
> -		*align = pci_resource_len(ndev->ntb.pdev, bar);
> +	return 0;
> +}
> +
> +static int intel_ntb_peer_mw_count(struct ntb_dev *ntb)
> +{
> +	return ntb_ndev(ntb)->mw_count;
> +}
> +
> +static int intel_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> +				       resource_size_t *addr_align,
> +				       resource_size_t *size_align,
> +				       resource_size_t *size_max)
> +{
> +	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> +	resource_size_t bar_size, mw_size;
> +	int bar;
> +
> +	if (idx >= ndev->b2b_idx && !ndev->b2b_off)
> +		idx += 1;
> +
> +	bar = ndev_mw_to_bar(ndev, idx);
> +	if (bar < 0)
> +		return bar;
> +
> +	bar_size = pci_resource_len(ndev->ntb.pdev, bar);
> +
> +	if (idx == ndev->b2b_idx)
> +		mw_size = bar_size - ndev->b2b_off;
> +	else
> +		mw_size = bar_size;
> +
> +	if (addr_align)
> +		*addr_align = bar_size;
> +
> +	if (size_align)
> +		*size_align = 1;
> 
> -	if (align_size)
> -		*align_size = 1;
> +	if (size_max)
> +		*size_max = mw_size;
> 
>  	return 0;
>  }
> 
> -static int intel_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> -				  dma_addr_t addr, resource_size_t size)
> +static int intel_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> +				       dma_addr_t addr, resource_size_t size)
>  {
>  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
>  	unsigned long base_reg, xlat_reg, limit_reg;
> @@ -2220,8 +2251,10 @@ static struct intel_b2b_addr xeon_b2b_dsd_addr = {
>  /* operations for primary side of local ntb */
>  static const struct ntb_dev_ops intel_ntb_ops = {
>  	.mw_count		= intel_ntb_mw_count,
> -	.mw_get_range		= intel_ntb_mw_get_range,
> -	.mw_set_trans		= intel_ntb_mw_set_trans,
> +	.mw_get_maprsc		= intel_ntb_mw_get_maprsc,
> +	.peer_mw_count		= intel_ntb_peer_mw_count,
> +	.peer_mw_get_align	= intel_ntb_peer_mw_get_align,
> +	.peer_mw_set_trans	= intel_ntb_peer_mw_set_trans,
>  	.link_is_up		= intel_ntb_link_is_up,
>  	.link_enable		= intel_ntb_link_enable,
>  	.link_disable		= intel_ntb_link_disable,
> diff --git a/drivers/ntb/ntb.c b/drivers/ntb/ntb.c
> index 2e25307..37c3b36 100644
> --- a/drivers/ntb/ntb.c
> +++ b/drivers/ntb/ntb.c
> @@ -54,6 +54,7 @@
>  #include <linux/device.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> +#include <linux/atomic.h>
> 
>  #include <linux/ntb.h>
>  #include <linux/pci.h>
> @@ -72,8 +73,62 @@ MODULE_AUTHOR(DRIVER_AUTHOR);
>  MODULE_DESCRIPTION(DRIVER_DESCRIPTION);
> 
>  static struct bus_type ntb_bus;
> +static struct ntb_bus_data ntb_data;
>  static void ntb_dev_release(struct device *dev);
> 
> +static int ntb_gen_devid(struct ntb_dev *ntb)
> +{
> +	const char *name;
> +	unsigned long *mask;
> +	int id;
> +
> +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> +		name = "ntbAS%d";
> +		mask = ntb_data.both_msk;
> +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> +		name = "ntbS%d";
> +		mask = ntb_data.sync_msk;
> +	} else if (ntb_valid_async_dev_ops(ntb)) {
> +		name = "ntbA%d";
> +		mask = ntb_data.async_msk;
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	for (id = 0; NTB_MAX_DEVID > id; id++) {
> +		if (0 == test_and_set_bit(id, mask)) {
> +			ntb->id = id;
> +			break;
> +		}
> +	}
> +
> +	if (NTB_MAX_DEVID > id) {
> +		dev_set_name(&ntb->dev, name, ntb->id);
> +	} else {
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static void ntb_free_devid(struct ntb_dev *ntb)
> +{
> +	unsigned long *mask;
> +
> +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> +		mask = ntb_data.both_msk;
> +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> +		mask = ntb_data.sync_msk;
> +	} else if (ntb_valid_async_dev_ops(ntb)) {
> +		mask = ntb_data.async_msk;
> +	} else {
> +		/* It's impossible */
> +		BUG();
> +	}
> +
> +	clear_bit(ntb->id, mask);
> +}
> +
>  int __ntb_register_client(struct ntb_client *client, struct module *mod,
>  			  const char *mod_name)
>  {
> @@ -99,13 +154,15 @@ EXPORT_SYMBOL(ntb_unregister_client);
> 
>  int ntb_register_device(struct ntb_dev *ntb)
>  {
> +	int ret;
> +
>  	if (!ntb)
>  		return -EINVAL;
>  	if (!ntb->pdev)
>  		return -EINVAL;
>  	if (!ntb->ops)
>  		return -EINVAL;
> -	if (!ntb_dev_ops_is_valid(ntb->ops))
> +	if (!ntb_valid_sync_dev_ops(ntb) && !ntb_valid_async_dev_ops(ntb))
>  		return -EINVAL;
> 
>  	init_completion(&ntb->released);
> @@ -114,13 +171,21 @@ int ntb_register_device(struct ntb_dev *ntb)
>  	ntb->dev.bus = &ntb_bus;
>  	ntb->dev.parent = &ntb->pdev->dev;
>  	ntb->dev.release = ntb_dev_release;
> -	dev_set_name(&ntb->dev, "%s", pci_name(ntb->pdev));
> 
>  	ntb->ctx = NULL;
>  	ntb->ctx_ops = NULL;
>  	spin_lock_init(&ntb->ctx_lock);
> 
> -	return device_register(&ntb->dev);
> +	/* No need to wait for completion if failed */
> +	ret = ntb_gen_devid(ntb);
> +	if (ret)
> +		return ret;
> +
> +	ret = device_register(&ntb->dev);
> +	if (ret)
> +		ntb_free_devid(ntb);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(ntb_register_device);
> 
> @@ -128,6 +193,7 @@ void ntb_unregister_device(struct ntb_dev *ntb)
>  {
>  	device_unregister(&ntb->dev);
>  	wait_for_completion(&ntb->released);
> +	ntb_free_devid(ntb);
>  }
>  EXPORT_SYMBOL(ntb_unregister_device);
> 
> @@ -191,6 +257,20 @@ void ntb_db_event(struct ntb_dev *ntb, int vector)
>  }
>  EXPORT_SYMBOL(ntb_db_event);
> 
> +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> +		   struct ntb_msg *msg)
> +{
> +	unsigned long irqflags;
> +
> +	spin_lock_irqsave(&ntb->ctx_lock, irqflags);
> +	{
> +		if (ntb->ctx_ops && ntb->ctx_ops->msg_event)
> +			ntb->ctx_ops->msg_event(ntb->ctx, ev, msg);
> +	}
> +	spin_unlock_irqrestore(&ntb->ctx_lock, irqflags);
> +}
> +EXPORT_SYMBOL(ntb_msg_event);
> +
>  static int ntb_probe(struct device *dev)
>  {
>  	struct ntb_dev *ntb;
> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> index d5c5894..2626ba0 100644
> --- a/drivers/ntb/ntb_transport.c
> +++ b/drivers/ntb/ntb_transport.c
> @@ -673,7 +673,7 @@ static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
>  	if (!mw->virt_addr)
>  		return;
> 
> -	ntb_mw_clear_trans(nt->ndev, num_mw);
> +	ntb_peer_mw_set_trans(nt->ndev, num_mw, 0, 0);
>  	dma_free_coherent(&pdev->dev, mw->buff_size,
>  			  mw->virt_addr, mw->dma_addr);
>  	mw->xlat_size = 0;
> @@ -730,7 +730,8 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
>  	}
> 
>  	/* Notify HW the memory location of the receive buffer */
> -	rc = ntb_mw_set_trans(nt->ndev, num_mw, mw->dma_addr, mw->xlat_size);
> +	rc = ntb_peer_mw_set_trans(nt->ndev, num_mw, mw->dma_addr,
> +				   mw->xlat_size);
>  	if (rc) {
>  		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
>  		ntb_free_mw(nt, num_mw);
> @@ -1060,7 +1061,11 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> ntb_dev *ndev)
>  	int node;
>  	int rc, i;
> 
> -	mw_count = ntb_mw_count(ndev);
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ndev))
> +		return -EINVAL;
> +
> +	mw_count = ntb_peer_mw_count(ndev);
>  	if (ntb_spad_count(ndev) < (NUM_MWS + 1 + mw_count * 2)) {
>  		dev_err(&ndev->dev, "Not enough scratch pad registers for %s",
>  			NTB_TRANSPORT_NAME);
> @@ -1094,8 +1099,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> ntb_dev *ndev)
>  	for (i = 0; i < mw_count; i++) {
>  		mw = &nt->mw_vec[i];
> 
> -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> -				      &mw->xlat_align, &mw->xlat_align_size);
> +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> +		if (rc)
> +			goto err1;
> +
> +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> +					   &mw->xlat_align_size, NULL);

Looks like ntb_mw_get_range() was simpler before the change.

>  		if (rc)
>  			goto err1;
> 
> diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
> index 6a50f20..f2952f7 100644
> --- a/drivers/ntb/test/ntb_perf.c
> +++ b/drivers/ntb/test/ntb_perf.c
> @@ -452,7 +452,7 @@ static void perf_free_mw(struct perf_ctx *perf)
>  	if (!mw->virt_addr)
>  		return;
> 
> -	ntb_mw_clear_trans(perf->ntb, 0);
> +	ntb_peer_mw_set_trans(perf->ntb, 0, 0, 0);
>  	dma_free_coherent(&pdev->dev, mw->buf_size,
>  			  mw->virt_addr, mw->dma_addr);
>  	mw->xlat_size = 0;
> @@ -488,7 +488,7 @@ static int perf_set_mw(struct perf_ctx *perf, resource_size_t size)
>  		mw->buf_size = 0;
>  	}
> 
> -	rc = ntb_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
> +	rc = ntb_peer_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
>  	if (rc) {
>  		dev_err(&perf->ntb->dev, "Unable to set mw0 translation\n");
>  		perf_free_mw(perf);
> @@ -559,8 +559,12 @@ static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf)
> 
>  	mw = &perf->mw;
> 
> -	rc = ntb_mw_get_range(ntb, 0, &mw->phys_addr, &mw->phys_size,
> -			      &mw->xlat_align, &mw->xlat_align_size);
> +	rc = ntb_mw_get_maprsc(ntb, 0, &mw->phys_addr, &mw->phys_size);
> +	if (rc)
> +		return rc;
> +
> +	rc = ntb_peer_mw_get_align(ntb, 0, &mw->xlat_align,
> +				   &mw->xlat_align_size, NULL);

Looks like ntb_mw_get_range() was simpler.

>  	if (rc)
>  		return rc;
> 
> @@ -758,6 +762,10 @@ static int perf_probe(struct ntb_client *client, struct ntb_dev *ntb)
>  	int node;
>  	int rc = 0;
> 
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ntb))
> +		return -EINVAL;
> +
>  	if (ntb_spad_count(ntb) < MAX_SPAD) {
>  		dev_err(&ntb->dev, "Not enough scratch pad registers for %s",
>  			DRIVER_NAME);
> diff --git a/drivers/ntb/test/ntb_pingpong.c b/drivers/ntb/test/ntb_pingpong.c
> index 7d31179..e833649 100644
> --- a/drivers/ntb/test/ntb_pingpong.c
> +++ b/drivers/ntb/test/ntb_pingpong.c
> @@ -214,6 +214,11 @@ static int pp_probe(struct ntb_client *client,
>  	struct pp_ctx *pp;
>  	int rc;
> 
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ntb)) {
> +		return -EINVAL;
> +	}
> +
>  	if (ntb_db_is_unsafe(ntb)) {
>  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
>  		if (!unsafe) {
> diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
> index 61bf2ef..5dfe12f 100644
> --- a/drivers/ntb/test/ntb_tool.c
> +++ b/drivers/ntb/test/ntb_tool.c
> @@ -675,8 +675,11 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> req_size)
>  	if (mw->peer)
>  		return 0;
> 
> -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &size, &align,
> -			      &align_size);
> +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &size);
> +	if (rc)
> +		return rc;
> +
> +	rc = ntb_peer_mw_get_align(tc->ntb, idx, &align, &align_size, NULL);
>  	if (rc)
>  		return rc;

Looks like ntb_mw_get_range() was simpler.

> 
> @@ -689,7 +692,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> req_size)
>  	if (!mw->peer)
>  		return -ENOMEM;
> 
> -	rc = ntb_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
> +	rc = ntb_peer_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
>  	if (rc)
>  		goto err_free_dma;
> 
> @@ -716,7 +719,7 @@ static void tool_free_mw(struct tool_ctx *tc, int idx)
>  	struct tool_mw *mw = &tc->mws[idx];
> 
>  	if (mw->peer) {
> -		ntb_mw_clear_trans(tc->ntb, idx);
> +		ntb_peer_mw_set_trans(tc->ntb, idx, 0, 0);
>  		dma_free_coherent(&tc->ntb->pdev->dev, mw->size,
>  				  mw->peer,
>  				  mw->peer_dma);
> @@ -751,8 +754,8 @@ static ssize_t tool_peer_mw_trans_read(struct file *filep,
>  	if (!buf)
>  		return -ENOMEM;
> 
> -	ntb_mw_get_range(mw->tc->ntb, mw->idx,
> -			 &base, &mw_size, &align, &align_size);
> +	ntb_mw_get_maprsc(mw->tc->ntb, mw->idx, &base, &mw_size);
> +	ntb_peer_mw_get_align(mw->tc->ntb, mw->idx, &align, &align_size, NULL);
> 
>  	off += scnprintf(buf + off, buf_size - off,
>  			 "Peer MW %d Information:\n", mw->idx);
> @@ -827,8 +830,7 @@ static int tool_init_mw(struct tool_ctx *tc, int idx)
>  	phys_addr_t base;
>  	int rc;
> 
> -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &mw->win_size,
> -			      NULL, NULL);
> +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &mw->win_size);
>  	if (rc)
>  		return rc;
> 
> @@ -913,6 +915,11 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
>  	int rc;
>  	int i;
> 
> +	/* Synchronous hardware is only supported */
> +	if (!ntb_valid_sync_dev_ops(ntb)) {
> +		return -EINVAL;
> +	}
> +

It would be nice if both types could be supported by the same api.
 
>  	if (ntb_db_is_unsafe(ntb))
>  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
> 
> @@ -928,7 +935,7 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
>  	tc->ntb = ntb;
>  	init_waitqueue_head(&tc->link_wq);
> 
> -	tc->mw_count = min(ntb_mw_count(tc->ntb), MAX_MWS);
> +	tc->mw_count = min(ntb_peer_mw_count(tc->ntb), MAX_MWS);
>  	for (i = 0; i < tc->mw_count; i++) {
>  		rc = tool_init_mw(tc, i);
>  		if (rc)
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index 6f47562..d1937d3 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -159,13 +159,44 @@ static inline int ntb_client_ops_is_valid(const struct
> ntb_client_ops *ops)
>  }
> 
>  /**
> + * struct ntb_msg - ntb driver message structure
> + * @type:	Message type.
> + * @payload:	Payload data to send to a peer
> + * @data:	Array of u32 data to send (size might be hw dependent)
> + */
> +#define NTB_MAX_MSGSIZE 4
> +struct ntb_msg {
> +	union {
> +		struct {
> +			u32 type;
> +			u32 payload[NTB_MAX_MSGSIZE - 1];
> +		};
> +		u32 data[NTB_MAX_MSGSIZE];
> +	};
> +};
> +
> +/**
> + * enum NTB_MSG_EVENT - message event types
> + * @NTB_MSG_NEW:	New message just arrived and passed to the handler
> + * @NTB_MSG_SENT:	Posted message has just been successfully sent
> + * @NTB_MSG_FAIL:	Posted message failed to be sent
> + */
> +enum NTB_MSG_EVENT {
> +	NTB_MSG_NEW,
> +	NTB_MSG_SENT,
> +	NTB_MSG_FAIL
> +};
> +
> +/**
>   * struct ntb_ctx_ops - ntb driver context operations
>   * @link_event:		See ntb_link_event().
>   * @db_event:		See ntb_db_event().
> + * @msg_event:		See ntb_msg_event().
>   */
>  struct ntb_ctx_ops {
>  	void (*link_event)(void *ctx);
>  	void (*db_event)(void *ctx, int db_vector);
> +	void (*msg_event)(void *ctx, enum NTB_MSG_EVENT ev, struct ntb_msg *msg);
>  };
> 
>  static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> @@ -174,18 +205,24 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> *ops)
>  	return
>  		/* ops->link_event		&& */
>  		/* ops->db_event		&& */
> +		/* ops->msg_event		&& */
>  		1;
>  }
> 
>  /**
>   * struct ntb_ctx_ops - ntb device operations
> - * @mw_count:		See ntb_mw_count().
> - * @mw_get_range:	See ntb_mw_get_range().
> - * @mw_set_trans:	See ntb_mw_set_trans().
> - * @mw_clear_trans:	See ntb_mw_clear_trans().
>   * @link_is_up:		See ntb_link_is_up().
>   * @link_enable:	See ntb_link_enable().
>   * @link_disable:	See ntb_link_disable().
> + * @mw_count:		See ntb_mw_count().
> + * @mw_get_maprsc:	See ntb_mw_get_maprsc().
> + * @mw_set_trans:	See ntb_mw_set_trans().
> + * @mw_get_trans:	See ntb_mw_get_trans().
> + * @mw_get_align:	See ntb_mw_get_align().
> + * @peer_mw_count:	See ntb_peer_mw_count().
> + * @peer_mw_set_trans:	See ntb_peer_mw_set_trans().
> + * @peer_mw_get_trans:	See ntb_peer_mw_get_trans().
> + * @peer_mw_get_align:	See ntb_peer_mw_get_align().
>   * @db_is_unsafe:	See ntb_db_is_unsafe().
>   * @db_valid_mask:	See ntb_db_valid_mask().
>   * @db_vector_count:	See ntb_db_vector_count().
> @@ -210,22 +247,38 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> *ops)
>   * @peer_spad_addr:	See ntb_peer_spad_addr().
>   * @peer_spad_read:	See ntb_peer_spad_read().
>   * @peer_spad_write:	See ntb_peer_spad_write().
> + * @msg_post:		See ntb_msg_post().
> + * @msg_size:		See ntb_msg_size().
>   */
>  struct ntb_dev_ops {
> -	int (*mw_count)(struct ntb_dev *ntb);
> -	int (*mw_get_range)(struct ntb_dev *ntb, int idx,
> -			    phys_addr_t *base, resource_size_t *size,
> -			resource_size_t *align, resource_size_t *align_size);
> -	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> -			    dma_addr_t addr, resource_size_t size);
> -	int (*mw_clear_trans)(struct ntb_dev *ntb, int idx);
> -
>  	int (*link_is_up)(struct ntb_dev *ntb,
>  			  enum ntb_speed *speed, enum ntb_width *width);
>  	int (*link_enable)(struct ntb_dev *ntb,
>  			   enum ntb_speed max_speed, enum ntb_width max_width);
>  	int (*link_disable)(struct ntb_dev *ntb);
> 
> +	int (*mw_count)(struct ntb_dev *ntb);
> +	int (*mw_get_maprsc)(struct ntb_dev *ntb, int idx,
> +			     phys_addr_t *base, resource_size_t *size);
> +	int (*mw_get_align)(struct ntb_dev *ntb, int idx,
> +			    resource_size_t *addr_align,
> +			    resource_size_t *size_align,
> +			    resource_size_t *size_max);
> +	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> +			    dma_addr_t addr, resource_size_t size);
> +	int (*mw_get_trans)(struct ntb_dev *ntb, int idx,
> +			    dma_addr_t *addr, resource_size_t *size);
> +
> +	int (*peer_mw_count)(struct ntb_dev *ntb);
> +	int (*peer_mw_get_align)(struct ntb_dev *ntb, int idx,
> +				 resource_size_t *addr_align,
> +				 resource_size_t *size_align,
> +				 resource_size_t *size_max);
> +	int (*peer_mw_set_trans)(struct ntb_dev *ntb, int idx,
> +				 dma_addr_t addr, resource_size_t size);
> +	int (*peer_mw_get_trans)(struct ntb_dev *ntb, int idx,
> +				 dma_addr_t *addr, resource_size_t *size);
> +
>  	int (*db_is_unsafe)(struct ntb_dev *ntb);
>  	u64 (*db_valid_mask)(struct ntb_dev *ntb);
>  	int (*db_vector_count)(struct ntb_dev *ntb);
> @@ -259,47 +312,10 @@ struct ntb_dev_ops {
>  			      phys_addr_t *spad_addr);
>  	u32 (*peer_spad_read)(struct ntb_dev *ntb, int idx);
>  	int (*peer_spad_write)(struct ntb_dev *ntb, int idx, u32 val);
> -};
> -
> -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> -{
> -	/* commented callbacks are not required: */
> -	return
> -		ops->mw_count				&&
> -		ops->mw_get_range			&&
> -		ops->mw_set_trans			&&
> -		/* ops->mw_clear_trans			&& */
> -		ops->link_is_up				&&
> -		ops->link_enable			&&
> -		ops->link_disable			&&
> -		/* ops->db_is_unsafe			&& */
> -		ops->db_valid_mask			&&
> 
> -		/* both set, or both unset */
> -		(!ops->db_vector_count == !ops->db_vector_mask) &&
> -
> -		ops->db_read				&&
> -		/* ops->db_set				&& */
> -		ops->db_clear				&&
> -		/* ops->db_read_mask			&& */
> -		ops->db_set_mask			&&
> -		ops->db_clear_mask			&&
> -		/* ops->peer_db_addr			&& */
> -		/* ops->peer_db_read			&& */
> -		ops->peer_db_set			&&
> -		/* ops->peer_db_clear			&& */
> -		/* ops->peer_db_read_mask		&& */
> -		/* ops->peer_db_set_mask		&& */
> -		/* ops->peer_db_clear_mask		&& */
> -		/* ops->spad_is_unsafe			&& */
> -		ops->spad_count				&&
> -		ops->spad_read				&&
> -		ops->spad_write				&&
> -		/* ops->peer_spad_addr			&& */
> -		/* ops->peer_spad_read			&& */
> -		ops->peer_spad_write			&&
> -		1;
> -}
> +	int (*msg_post)(struct ntb_dev *ntb, struct ntb_msg *msg);
> +	int (*msg_size)(struct ntb_dev *ntb);
> +};
> 
>  /**
>   * struct ntb_client - client interested in ntb devices
> @@ -310,10 +326,22 @@ struct ntb_client {
>  	struct device_driver		drv;
>  	const struct ntb_client_ops	ops;
>  };
> -
>  #define drv_ntb_client(__drv) container_of((__drv), struct ntb_client, drv)
> 
>  /**
> + * struct ntb_bus_data - NTB bus data
> + * @sync_msk:	Synchroous devices mask
> + * @async_msk:	Asynchronous devices mask
> + * @both_msk:	Both sync and async devices mask
> + */
> +#define NTB_MAX_DEVID (8*BITS_PER_LONG)
> +struct ntb_bus_data {
> +	unsigned long sync_msk[8];
> +	unsigned long async_msk[8];
> +	unsigned long both_msk[8];
> +};
> +
> +/**
>   * struct ntb_device - ntb device
>   * @dev:		Linux device object.
>   * @pdev:		Pci device entry of the ntb.
> @@ -332,15 +360,151 @@ struct ntb_dev {
> 
>  	/* private: */
> 
> +	/* device id */
> +	int id;
>  	/* synchronize setting, clearing, and calling ctx_ops */
>  	spinlock_t			ctx_lock;
>  	/* block unregister until device is fully released */
>  	struct completion		released;
>  };
> -
>  #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
> 
>  /**
> + * ntb_valid_sync_dev_ops() - valid operations for synchronous hardware setup
> + * @ntb:	NTB device
> + *
> + * There might be two types of NTB hardware differed by the way of the settings
> + * configuration. The synchronous chips allows to set the memory windows by
> + * directly writing to the peer registers. Additionally there can be shared
> + * Scratchpad registers for synchronous information exchange. Client drivers
> + * should call this function to make sure the hardware supports the proper
> + * functionality.
> + */
> +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> +{
> +	const struct ntb_dev_ops *ops = ntb->ops;
> +
> +	/* Commented callbacks are not required, but might be developed */
> +	return	/* NTB link status ops */
> +		ops->link_is_up					&&
> +		ops->link_enable				&&
> +		ops->link_disable				&&
> +
> +		/* Synchronous memory windows ops */
> +		ops->mw_count					&&
> +		ops->mw_get_maprsc				&&
> +		/* ops->mw_get_align				&& */
> +		/* ops->mw_set_trans				&& */
> +		/* ops->mw_get_trans				&& */
> +		ops->peer_mw_count				&&
> +		ops->peer_mw_get_align				&&
> +		ops->peer_mw_set_trans				&&
> +		/* ops->peer_mw_get_trans			&& */
> +
> +		/* Doorbell ops */
> +		/* ops->db_is_unsafe				&& */
> +		ops->db_valid_mask				&&
> +		/* both set, or both unset */
> +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> +		ops->db_read					&&
> +		/* ops->db_set					&& */
> +		ops->db_clear					&&
> +		/* ops->db_read_mask				&& */
> +		ops->db_set_mask				&&
> +		ops->db_clear_mask				&&
> +		/* ops->peer_db_addr				&& */
> +		/* ops->peer_db_read				&& */
> +		ops->peer_db_set				&&
> +		/* ops->peer_db_clear				&& */
> +		/* ops->peer_db_read_mask			&& */
> +		/* ops->peer_db_set_mask			&& */
> +		/* ops->peer_db_clear_mask			&& */
> +
> +		/* Scratchpad ops */
> +		/* ops->spad_is_unsafe				&& */
> +		ops->spad_count					&&
> +		ops->spad_read					&&
> +		ops->spad_write					&&
> +		/* ops->peer_spad_addr				&& */
> +		/* ops->peer_spad_read				&& */
> +		ops->peer_spad_write				&&
> +
> +		/* Messages IO ops */
> +		/* ops->msg_post				&& */
> +		/* ops->msg_size				&& */
> +		1;
> +}
> +
> +/**
> + * ntb_valid_async_dev_ops() - valid operations for asynchronous hardware setup
> + * @ntb:	NTB device
> + *
> + * There might be two types of NTB hardware differed by the way of the settings
> + * configuration. The asynchronous chips does not allow to set the memory
> + * windows by directly writing to the peer registers. Instead it implements
> + * the additional method to communinicate between NTB nodes like messages.
> + * Scratchpad registers aren't likely supported by such hardware. Client
> + * drivers should call this function to make sure the hardware supports
> + * the proper functionality.
> + */
> +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> +{
> +	const struct ntb_dev_ops *ops = ntb->ops;
> +
> +	/* Commented callbacks are not required, but might be developed */
> +	return	/* NTB link status ops */
> +		ops->link_is_up					&&
> +		ops->link_enable				&&
> +		ops->link_disable				&&
> +
> +		/* Asynchronous memory windows ops */
> +		ops->mw_count					&&
> +		ops->mw_get_maprsc				&&
> +		ops->mw_get_align				&&
> +		ops->mw_set_trans				&&
> +		/* ops->mw_get_trans				&& */
> +		ops->peer_mw_count				&&
> +		ops->peer_mw_get_align				&&
> +		/* ops->peer_mw_set_trans			&& */
> +		/* ops->peer_mw_get_trans			&& */
> +
> +		/* Doorbell ops */
> +		/* ops->db_is_unsafe				&& */
> +		ops->db_valid_mask				&&
> +		/* both set, or both unset */
> +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> +		ops->db_read					&&
> +		/* ops->db_set					&& */
> +		ops->db_clear					&&
> +		/* ops->db_read_mask				&& */
> +		ops->db_set_mask				&&
> +		ops->db_clear_mask				&&
> +		/* ops->peer_db_addr				&& */
> +		/* ops->peer_db_read				&& */
> +		ops->peer_db_set				&&
> +		/* ops->peer_db_clear				&& */
> +		/* ops->peer_db_read_mask			&& */
> +		/* ops->peer_db_set_mask			&& */
> +		/* ops->peer_db_clear_mask			&& */
> +
> +		/* Scratchpad ops */
> +		/* ops->spad_is_unsafe				&& */
> +		/* ops->spad_count				&& */
> +		/* ops->spad_read				&& */
> +		/* ops->spad_write				&& */
> +		/* ops->peer_spad_addr				&& */
> +		/* ops->peer_spad_read				&& */
> +		/* ops->peer_spad_write				&& */
> +
> +		/* Messages IO ops */
> +		ops->msg_post					&&
> +		ops->msg_size					&&
> +		1;
> +}

I understand why IDT requires a different api for dealing with addressing multiple peers.  I would be interested in a solution that would allow, for example, the Intel driver fit under the api for dealing with multiple peers, even though it only supports one peer.  I would rather see that, than two separate apis under ntb.

Thoughts?

Can the sync api be described by some subset of the async api?  Are there less overloaded terms we can use instead of sync/async?

> +
> +
> +
> +/**
>   * ntb_register_client() - register a client for interest in ntb devices
>   * @client:	Client context.
>   *
> @@ -441,10 +605,84 @@ void ntb_link_event(struct ntb_dev *ntb);
>  void ntb_db_event(struct ntb_dev *ntb, int vector);
> 
>  /**
> - * ntb_mw_count() - get the number of memory windows
> + * ntb_msg_event() - notify driver context of event in messaging subsystem
>   * @ntb:	NTB device context.
> + * @ev:		Event type caused the handler invocation
> + * @msg:	Message related to the event
> + *
> + * Notify the driver context that there is some event happaned in the event
> + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> + * last argument is used to pass the event related message. It discarded right
> + * after the handler returns.
> + */
> +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> +		   struct ntb_msg *msg);

I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of the message handling to be done more appropriately at a higher layer of the application.  I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I think would be more appropriate for a ntb transport (or higher layer) driver.

> +
> +/**
> + * ntb_link_is_up() - get the current ntb link state
> + * @ntb:	NTB device context.
> + * @speed:	OUT - The link speed expressed as PCIe generation number.
> + * @width:	OUT - The link width expressed as the number of PCIe lanes.
> + *
> + * Get the current state of the ntb link.  It is recommended to query the link
> + * state once after every link event.  It is safe to query the link state in
> + * the context of the link event callback.
> + *
> + * Return: One if the link is up, zero if the link is down, otherwise a
> + *		negative value indicating the error number.
> + */
> +static inline int ntb_link_is_up(struct ntb_dev *ntb,
> +				 enum ntb_speed *speed, enum ntb_width *width)
> +{
> +	return ntb->ops->link_is_up(ntb, speed, width);
> +}
> +

It looks like there was some rearranging of code, so big hunks appear to be added or removed.  Can you split this into two (or more) patches so that rearranging the code is distinct from more interesting changes?

> +/**
> + * ntb_link_enable() - enable the link on the secondary side of the ntb
> + * @ntb:	NTB device context.
> + * @max_speed:	The maximum link speed expressed as PCIe generation number.
> + * @max_width:	The maximum link width expressed as the number of PCIe lanes.
>   *
> - * Hardware and topology may support a different number of memory windows.
> + * Enable the link on the secondary side of the ntb.  This can only be done
> + * from only one (primary or secondary) side of the ntb in primary or b2b
> + * topology.  The ntb device should train the link to its maximum speed and
> + * width, or the requested speed and width, whichever is smaller, if supported.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_link_enable(struct ntb_dev *ntb,
> +				  enum ntb_speed max_speed,
> +				  enum ntb_width max_width)
> +{
> +	return ntb->ops->link_enable(ntb, max_speed, max_width);
> +}
> +
> +/**
> + * ntb_link_disable() - disable the link on the secondary side of the ntb
> + * @ntb:	NTB device context.
> + *
> + * Disable the link on the secondary side of the ntb.  This can only be
> + * done from only one (primary or secondary) side of the ntb in primary or b2b
> + * topology.  The ntb device should disable the link.  Returning from this call
> + * must indicate that a barrier has passed, though with no more writes may pass
> + * in either direction across the link, except if this call returns an error
> + * number.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_link_disable(struct ntb_dev *ntb)
> +{
> +	return ntb->ops->link_disable(ntb);
> +}
> +
> +/**
> + * ntb_mw_count() - get the number of local memory windows
> + * @ntb:	NTB device context.
> + *
> + * Hardware and topology may support a different number of memory windows at
> + * local and remote devices
>   *
>   * Return: the number of memory windows.
>   */
> @@ -454,122 +692,186 @@ static inline int ntb_mw_count(struct ntb_dev *ntb)
>  }
> 
>  /**
> - * ntb_mw_get_range() - get the range of a memory window
> + * ntb_mw_get_maprsc() - get the range of a memory window to map

What was insufficient about ntb_mw_get_range() that it needed to be split into ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch, it seems ntb_mw_get_range() would have been more simple.

I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].  So, there is no example of how usage of new api would be used differently or more efficiently than ntb_mw_get_range() for async devices.

>   * @ntb:	NTB device context.
>   * @idx:	Memory window number.
>   * @base:	OUT - the base address for mapping the memory window
>   * @size:	OUT - the size for mapping the memory window
> - * @align:	OUT - the base alignment for translating the memory window
> - * @align_size:	OUT - the size alignment for translating the memory window
>   *
> - * Get the range of a memory window.  NULL may be given for any output
> - * parameter if the value is not needed.  The base and size may be used for
> - * mapping the memory window, to access the peer memory.  The alignment and
> - * size may be used for translating the memory window, for the peer to access
> - * memory on the local system.
> + * Get the map range of a memory window. The base and size may be used for
> + * mapping the memory window to access the peer memory.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> -				   phys_addr_t *base, resource_size_t *size,
> -		resource_size_t *align, resource_size_t *align_size)
> +static inline int ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> +				    phys_addr_t *base, resource_size_t *size)
>  {
> -	return ntb->ops->mw_get_range(ntb, idx, base, size,
> -			align, align_size);
> +	return ntb->ops->mw_get_maprsc(ntb, idx, base, size);
> +}
> +
> +/**
> + * ntb_mw_get_align() - get memory window alignment of the local node
> + * @ntb:	NTB device context.
> + * @idx:	Memory window number.
> + * @addr_align:	OUT - the translated base address alignment of the memory window
> + * @size_align:	OUT - the translated memory size alignment of the memory window
> + * @size_max:	OUT - the translated memory maximum size
> + *
> + * Get the alignment parameters to allocate the proper memory window. NULL may
> + * be given for any output parameter if the value is not needed.
> + *
> + * Drivers of synchronous hardware don't have to support it.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_mw_get_align(struct ntb_dev *ntb, int idx,
> +				   resource_size_t *addr_align,
> +				   resource_size_t *size_align,
> +				   resource_size_t *size_max)
> +{
> +	if (!ntb->ops->mw_get_align)
> +		return -EINVAL;
> +
> +	return ntb->ops->mw_get_align(ntb, idx, addr_align, size_align, size_max);
>  }
> 
>  /**
> - * ntb_mw_set_trans() - set the translation of a memory window
> + * ntb_mw_set_trans() - set the translated base address of a peer memory window
>   * @ntb:	NTB device context.
>   * @idx:	Memory window number.
> - * @addr:	The dma address local memory to expose to the peer.
> - * @size:	The size of the local memory to expose to the peer.
> + * @addr:	DMA memory address exposed by the peer.
> + * @size:	Size of the memory exposed by the peer.
> + *
> + * Set the translated base address of a memory window. The peer preliminary
> + * allocates a memory, then someway passes the address to the remote node, that
> + * finally sets up the memory window at the address, up to the size. The address
> + * and size must be aligned to the parameters specified by ntb_mw_get_align() of
> + * the local node and ntb_peer_mw_get_align() of the peer, which must return the
> + * same values. Zero size effectively disables the memory window.
>   *
> - * Set the translation of a memory window.  The peer may access local memory
> - * through the window starting at the address, up to the size.  The address
> - * must be aligned to the alignment specified by ntb_mw_get_range().  The size
> - * must be aligned to the size alignment specified by ntb_mw_get_range().
> + * Drivers of synchronous hardware don't have to support it.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
>  				   dma_addr_t addr, resource_size_t size)
>  {
> +	if (!ntb->ops->mw_set_trans)
> +		return -EINVAL;
> +
>  	return ntb->ops->mw_set_trans(ntb, idx, addr, size);
>  }
> 
>  /**
> - * ntb_mw_clear_trans() - clear the translation of a memory window
> + * ntb_mw_get_trans() - get the translated base address of a memory window
>   * @ntb:	NTB device context.
>   * @idx:	Memory window number.
> + * @addr:	The dma memory address exposed by the peer.
> + * @size:	The size of the memory exposed by the peer.
>   *
> - * Clear the translation of a memory window.  The peer may no longer access
> - * local memory through the window.
> + * Get the translated base address of a memory window spicified for the local
> + * hardware and allocated by the peer. If the addr and size are zero, the
> + * memory window is effectively disabled.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_mw_clear_trans(struct ntb_dev *ntb, int idx)
> +static inline int ntb_mw_get_trans(struct ntb_dev *ntb, int idx,
> +				   dma_addr_t *addr, resource_size_t *size)
>  {
> -	if (!ntb->ops->mw_clear_trans)
> -		return ntb->ops->mw_set_trans(ntb, idx, 0, 0);
> +	if (!ntb->ops->mw_get_trans)
> +		return -EINVAL;
> 
> -	return ntb->ops->mw_clear_trans(ntb, idx);
> +	return ntb->ops->mw_get_trans(ntb, idx, addr, size);
>  }
> 
>  /**
> - * ntb_link_is_up() - get the current ntb link state
> + * ntb_peer_mw_count() - get the number of peer memory windows
>   * @ntb:	NTB device context.
> - * @speed:	OUT - The link speed expressed as PCIe generation number.
> - * @width:	OUT - The link width expressed as the number of PCIe lanes.
>   *
> - * Get the current state of the ntb link.  It is recommended to query the link
> - * state once after every link event.  It is safe to query the link state in
> - * the context of the link event callback.
> + * Hardware and topology may support a different number of memory windows at
> + * local and remote nodes.
>   *
> - * Return: One if the link is up, zero if the link is down, otherwise a
> - *		negative value indicating the error number.
> + * Return: the number of memory windows.
>   */
> -static inline int ntb_link_is_up(struct ntb_dev *ntb,
> -				 enum ntb_speed *speed, enum ntb_width *width)
> +static inline int ntb_peer_mw_count(struct ntb_dev *ntb)
>  {
> -	return ntb->ops->link_is_up(ntb, speed, width);
> +	return ntb->ops->peer_mw_count(ntb);
>  }
> 
>  /**
> - * ntb_link_enable() - enable the link on the secondary side of the ntb
> + * ntb_peer_mw_get_align() - get memory window alignment of the peer
>   * @ntb:	NTB device context.
> - * @max_speed:	The maximum link speed expressed as PCIe generation number.
> - * @max_width:	The maximum link width expressed as the number of PCIe lanes.
> + * @idx:	Memory window number.
> + * @addr_align:	OUT - the translated base address alignment of the memory window
> + * @size_align:	OUT - the translated memory size alignment of the memory window
> + * @size_max:	OUT - the translated memory maximum size
>   *
> - * Enable the link on the secondary side of the ntb.  This can only be done
> - * from the primary side of the ntb in primary or b2b topology.  The ntb device
> - * should train the link to its maximum speed and width, or the requested speed
> - * and width, whichever is smaller, if supported.
> + * Get the alignment parameters to allocate the proper memory window for the
> + * peer. NULL may be given for any output parameter if the value is not needed.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_link_enable(struct ntb_dev *ntb,
> -				  enum ntb_speed max_speed,
> -				  enum ntb_width max_width)
> +static inline int ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> +					resource_size_t *addr_align,
> +					resource_size_t *size_align,
> +					resource_size_t *size_max)
>  {
> -	return ntb->ops->link_enable(ntb, max_speed, max_width);
> +	if (!ntb->ops->peer_mw_get_align)
> +		return -EINVAL;
> +
> +	return ntb->ops->peer_mw_get_align(ntb, idx, addr_align, size_align,
> +					   size_max);
>  }
> 
>  /**
> - * ntb_link_disable() - disable the link on the secondary side of the ntb
> + * ntb_peer_mw_set_trans() - set the translated base address of a peer
> + *			     memory window
>   * @ntb:	NTB device context.
> + * @idx:	Memory window number.
> + * @addr:	Local DMA memory address exposed to the peer.
> + * @size:	Size of the memory exposed to the peer.
>   *
> - * Disable the link on the secondary side of the ntb.  This can only be
> - * done from the primary side of the ntb in primary or b2b topology.  The ntb
> - * device should disable the link.  Returning from this call must indicate that
> - * a barrier has passed, though with no more writes may pass in either
> - * direction across the link, except if this call returns an error number.
> + * Set the translated base address of a memory window exposed to the peer.
> + * The local node preliminary allocates the window, then directly writes the

I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the following make sense, or have I completely misunderstood something?

ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are translated to the local memory destination.

ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory window (is this something that needs to be configured on the local ntb?) are translated to the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of ntb_mw_set_trans() will complete the translation to the peer memory destination.

> + * address and size to the peer control registers. The address and size must
> + * be aligned to the parameters specified by ntb_peer_mw_get_align() of
> + * the local node and ntb_mw_get_align() of the peer, which must return the
> + * same values. Zero size effectively disables the memory window.
> + *
> + * Drivers of synchronous hardware must support it.
>   *
>   * Return: Zero on success, otherwise an error number.
>   */
> -static inline int ntb_link_disable(struct ntb_dev *ntb)
> +static inline int ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> +					dma_addr_t addr, resource_size_t size)
>  {
> -	return ntb->ops->link_disable(ntb);
> +	if (!ntb->ops->peer_mw_set_trans)
> +		return -EINVAL;
> +
> +	return ntb->ops->peer_mw_set_trans(ntb, idx, addr, size);
> +}
> +
> +/**
> + * ntb_peer_mw_get_trans() - get the translated base address of a peer
> + *			     memory window
> + * @ntb:	NTB device context.
> + * @idx:	Memory window number.
> + * @addr:	Local dma memory address exposed to the peer.
> + * @size:	Size of the memory exposed to the peer.
> + *
> + * Get the translated base address of a memory window spicified for the peer
> + * hardware. If the addr and size are zero then the memory window is effectively
> + * disabled.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_peer_mw_get_trans(struct ntb_dev *ntb, int idx,
> +					dma_addr_t *addr, resource_size_t *size)
> +{
> +	if (!ntb->ops->peer_mw_get_trans)
> +		return -EINVAL;
> +
> +	return ntb->ops->peer_mw_get_trans(ntb, idx, addr, size);
>  }
> 
>  /**
> @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits)
>   * append one additional dma memory copy with the doorbell register as the
>   * destination, after the memory copy operations.
>   *
> + * This is unusual, and hardware may not be suitable to implement it.
> + *

Why is this unusual?  Do you mean async hardware may not support it?

>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_peer_db_addr(struct ntb_dev *ntb,
> @@ -901,10 +1205,15 @@ static inline int ntb_spad_is_unsafe(struct ntb_dev *ntb)
>   *
>   * Hardware and topology may support a different number of scratchpads.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: the number of scratchpads.
>   */
>  static inline int ntb_spad_count(struct ntb_dev *ntb)
>  {
> +	if (!ntb->ops->spad_count)
> +		return -EINVAL;
> +

Maybe we should return zero (i.e. there are no scratchpads).

>  	return ntb->ops->spad_count(ntb);
>  }
> 
> @@ -915,10 +1224,15 @@ static inline int ntb_spad_count(struct ntb_dev *ntb)
>   *
>   * Read the local scratchpad register, and return the value.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: The value of the local scratchpad register.
>   */
>  static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
>  {
> +	if (!ntb->ops->spad_read)
> +		return 0;
> +

Let's return ~0.  I think that's what a driver would read from the pci bus for a memory miss. 

>  	return ntb->ops->spad_read(ntb, idx);
>  }
> 
> @@ -930,10 +1244,15 @@ static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
>   *
>   * Write the value to the local scratchpad register.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32 val)
>  {
> +	if (!ntb->ops->spad_write)
> +		return -EINVAL;
> +
>  	return ntb->ops->spad_write(ntb, idx, val);
>  }
> 
> @@ -946,6 +1265,8 @@ static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32
> val)
>   * Return the address of the peer doorbell register.  This may be used, for
>   * example, by drivers that offload memory copy operations to a dma engine.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
> @@ -964,10 +1285,15 @@ static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
>   *
>   * Read the peer scratchpad register, and return the value.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: The value of the local scratchpad register.
>   */
>  static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
>  {
> +	if (!ntb->ops->peer_spad_read)
> +		return 0;

Also, ~0?

> +
>  	return ntb->ops->peer_spad_read(ntb, idx);
>  }
> 
> @@ -979,11 +1305,59 @@ static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
>   *
>   * Write the value to the peer scratchpad register.
>   *
> + * Asynchronous hardware may not support it.
> + *
>   * Return: Zero on success, otherwise an error number.
>   */
>  static inline int ntb_peer_spad_write(struct ntb_dev *ntb, int idx, u32 val)
>  {
> +	if (!ntb->ops->peer_spad_write)
> +		return -EINVAL;
> +
>  	return ntb->ops->peer_spad_write(ntb, idx, val);
>  }
> 
> +/**
> + * ntb_msg_post() - post the message to the peer
> + * @ntb:	NTB device context.
> + * @msg:	Message
> + *
> + * Post the message to a peer. It shall be delivered to the peer by the
> + * corresponding hardware method. The peer should be notified about the new
> + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> + * event. Otherwise the NTB_MSG_SENT is emitted.

Interesting.. local driver would be notified about completion (success or failure) of delivery.  Is there any order-of-completion guarantee for the completion notifications?  Is there some tolerance for faults, in case we never get a completion notification from the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link comes up again, can we still get a completion notification from the peer, and how would that be handled?

Does delivery mean the application has processed the message, or is it just delivery at the hardware layer, or just delivery at the ntb hardware driver layer?

> + *
> + * Synchronous hardware may not support it.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_msg_post(struct ntb_dev *ntb, struct ntb_msg *msg)
> +{
> +	if (!ntb->ops->msg_post)
> +		return -EINVAL;
> +
> +	return ntb->ops->msg_post(ntb, msg);
> +}
> +
> +/**
> + * ntb_msg_size() - size of the message data
> + * @ntb:	NTB device context.
> + *
> + * Different hardware may support different number of message registers. This
> + * callback shall return the number of those used for data sending and
> + * receiving including the type field.
> + *
> + * Synchronous hardware may not support it.
> + *
> + * Return: Zero on success, otherwise an error number.
> + */
> +static inline int ntb_msg_size(struct ntb_dev *ntb)
> +{
> +	if (!ntb->ops->msg_size)
> +		return 0;
> +
> +	return ntb->ops->msg_size(ntb);
> +}
> +
>  #endif
> --
> 2.6.6


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
  2016-08-05 15:31 ` Allen Hubbe
  (?)
@ 2016-08-07 17:50 ` Serge Semin
  -1 siblings, 0 replies; 12+ messages in thread
From: Serge Semin @ 2016-08-07 17:50 UTC (permalink / raw)
  To: Allen Hubbe
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

Hello Allen.

Thanks for your careful review. Going through this mailing thread I hope we'll come up with solutions, which improve the driver code as well as extend the Linux kernel support of new devices like IDT PCIe-swtiches.

Before getting to the inline commentaries I need to give some introduction to the IDT NTB-related hardware so we could speak on the same language. Additionally I'll give a brief explanation how the setup of memory windows works in IDT PCIe-switches.

First of all, before getting into the IDT NTB driver development I had made a research of the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-hardware drivers work. At least I understand the concept of memory windowing, which led to the current NTB bus kernel API.

So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT produces. All of them I split into two distinct groups:
1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2, 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express® switching solutions. Why do I split them up? Because of the next reasons:
1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So the switches of the first group can connect just two domains over NTB. Unlike the second group of switches, which expose a way to setup an interaction between several PCIe-switch ports, which have NT-function activated.
2) The groups are significantly distinct by the way of NT-functions configuration.

Before getting further, I should note, that the uploaded driver supports the second group of devices only. But still I'll give a comparative explanation, since the first group of switches is very similar to the AMD/Intel NTBs.

Lets dive into the configurations a bit deeper. Particularly NT-functions of the first group of switches can be configured the same way as AMD/Intel NTB-functions are. There is an PCIe end-point configuration space, which fully reflects the cross-coupled local and peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct writing to mapped memory. Here is the image, which perfectly explains the configuration registers mapping:
https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
Since the first group switches connect only two root complexes, the race condition of read/write operations to cross-coupled registers can be easily resolved just by roles distribution. So local root complex sets the translated base address directly to a peer configuration space registers, which correspond to BAR0-BAR3 locally mapped memory windows. Of course 2-4 memory windows is enough to connect just two domains. That's why you made the NTB bus kernel API the way it is.

The things get different when one wants to have an access from one domain to multiple coupling up to eight root complexes in the second group of switches. First of all the hardware doesn't support the configuration space cross-coupling anymore. Instead there are two Global Address Space Access registers provided to have an access to a peers configuration space. In fact it is not a big problem, since there are no much differences in accessing registers over a memory mapped space or a pair of fixed Address/Data registers. The problem arises when one wants to share a memory windows between eight domains. Five BARs are not enough for it even if they'd be configured to be of x32 address type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be initialized with translated base address of a peer and IDT switch port, which peer is connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to a memory of a peer just by reading/writing with a shift corresponding to the lookup table entry. That's how more than five peers can be accessed. The root problem is the way the lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data" registers. So a root complex must write an entry index to one registers, then read/write data from another. As you might realise, that weak point leads to a race condition of multiple root complexes accessing the lookup table of one shared peer. Alas I could not come up with a simple and strong solution of the race.

That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since local root complex can't directly write a translated base address to a peer, it must wait until a peer asks him to allocate a memory and send the address back using some of a hardware mechanism. It can be anything: Scratchpad registers, Message registers or even "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
1) Shared Memory windows. In particular local root complex can set a translated base address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB configuration space, the same way as it can be done for AMD/Intel NTBs.
2) One Doorbell register.
3) Two Scratchpads.
4) Four message regietsrs.
As you can see the switches of the first group can be considered as both synchronous and asynchronous. All the NTB bus kernel API can be implemented for it including the changes introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel NTBs can be considered both synchronous and asynchronous as well, although they don't support messaging so Scratchpads can be used to send a data to a peer. Finally the switches of the second group lack of ability to initialize BARs translated base address of peers due to the race condition I described before.

To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make the IDT driver as much compatible with current design as possible, nevertheless the NTB bus kernel API had to be slightly changed. You can find answers to the commentaries down below.

On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> From: Serge Semin
> > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > devices, so translated base address of memory windows can be direcly written
> > to peer registers. But there are some IDT PCIe-switches which implement
> > complex interfaces using Lookup Tables of translation addresses. Due to
> > the way the table is accessed, it can not be done synchronously from different
> > RCs, that's why the asynchronous interface should be developed.
> > 
> > For these purpose the Memory Window related interface is correspondingly split
> > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > is following: "It is a virtual memory region, which locally reflects a physical
> > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > memory windows.
> > Here is the description of the Memory Window related NTB-bus callback
> > functions:
> >  - ntb_mw_count() - number of local memory windows.
> >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> >                          window to map.
> >  - ntb_mw_set_trans() - set translation address of local memory window (this
> >                         address should be somehow retrieved from a peer).
> >  - ntb_mw_get_trans() - get translation address of local memory window.
> >  - ntb_mw_get_align() - get alignment of translated base address and size of
> >                         local memory window. Additionally one can get the
> >                         upper size limit of the memory window.
> >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> >                          local number).
> >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> >                              of peer memory window.Additionally one can get the
> >                              upper size limit of the memory window.
> > 
> > As one can see current AMD and Intel NTB drivers mostly implement the
> > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > since it doesn't have convenient access to the peer Lookup Table.
> > 
> > In order to pass information from one RC to another NTB functions of IDT
> > PCIe-switch implement Messaging subsystem. They currently support four message
> > registers to transfer DWORD sized data to a specified peer. So there are two
> > new callback methods are introduced:
> >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> >                     and receive messages
> >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> >                     to a peer
> > Additionally there is a new event function:
> >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> >                      (NTB_MSG_NEW), or last message was successfully sent
> >                      (NTB_MSG_SENT), or the last message failed to be sent
> >                      (NTB_MSG_FAIL).
> > 
> > The last change concerns the IDs (practically names) of NTB-devices on the
> > NTB-bus. It is not good to have the devices with same names in the system
> > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > devices supporting both interfaces.
> 
> Thanks for the work that went into writing this driver, and thanks for your patience with the review.  Please read my initial comments inline.  I would like to approach this from a top-down api perspective first, and settle on that first before requesting any specific changes in the hardware driver.  My major concern about these changes is that they introduce a distinct classification for sync and async hardware, supported by different sets of methods in the api, neither is a subset of the other.
> 
> You know the IDT hardware, so if any of my requests below are infeasible, I would like your constructive opinion (even if it means significant changes to existing drivers) on how to resolve the api so that new and existing hardware drivers can be unified under the same api, if possible.

I understand your concern. I have been thinking of this a lot. In my opinion the proposed in this patch alterations are the best of all variants I've been thinking about. Regarding the lack of APIs subset. In fact I would not agree with that. As I described in the introduction AMD and Intel drivers can be considered as both synchronous and asynchronous, since a translated base address can be directly set in a local and peer configuration space. Although AMD and Intel devices don't support messaging, they have Scratchpads, which can be used to exchange an information between root complexes. The thing we need to do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much different from the "mw_peer"-prefixed ones. The first method just sets a translated base address to the corresponding local register. The second one does exactly the same as "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I left things the way it was with just slight changes of names.

> 
> > 
> > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > 
> > ---
> >  drivers/ntb/Kconfig                 |   4 +-
> >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> >  drivers/ntb/ntb.c                   |  86 +++++-
> >  drivers/ntb/ntb_transport.c         |  19 +-
> >  drivers/ntb/test/ntb_perf.c         |  16 +-
> >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> >  drivers/ntb/test/ntb_tool.c         |  25 +-
> >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> >  9 files changed, 701 insertions(+), 162 deletions(-)
> > 
> > diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> > index 95944e5..67d80c4 100644
> > --- a/drivers/ntb/Kconfig
> > +++ b/drivers/ntb/Kconfig
> > @@ -14,8 +14,6 @@ if NTB
> > 
> >  source "drivers/ntb/hw/Kconfig"
> > 
> > -source "drivers/ntb/test/Kconfig"
> > -
> >  config NTB_TRANSPORT
> >  	tristate "NTB Transport Client"
> >  	help
> > @@ -25,4 +23,6 @@ config NTB_TRANSPORT
> > 
> >  	 If unsure, say N.
> > 
> > +source "drivers/ntb/test/Kconfig"
> > +
> >  endif # NTB
> > diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
> > index 6ccba0d..ab6f353 100644
> > --- a/drivers/ntb/hw/amd/ntb_hw_amd.c
> > +++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
> > @@ -55,6 +55,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/random.h>
> >  #include <linux/slab.h>
> > +#include <linux/sizes.h>
> >  #include <linux/ntb.h>
> > 
> >  #include "ntb_hw_amd.h"
> > @@ -84,11 +85,8 @@ static int amd_ntb_mw_count(struct ntb_dev *ntb)
> >  	return ntb_ndev(ntb)->mw_count;
> >  }
> > 
> > -static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> > -				phys_addr_t *base,
> > -				resource_size_t *size,
> > -				resource_size_t *align,
> > -				resource_size_t *align_size)
> > +static int amd_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> > +				 phys_addr_t *base, resource_size_t *size)
> >  {
> >  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> >  	int bar;
> > @@ -103,17 +101,40 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> >  	if (size)
> >  		*size = pci_resource_len(ndev->ntb.pdev, bar);
> > 
> > -	if (align)
> > -		*align = SZ_4K;
> > +	return 0;
> > +}
> > +
> > +static int amd_ntb_peer_mw_count(struct ntb_dev *ntb)
> > +{
> > +	return ntb_ndev(ntb)->mw_count;
> > +}
> > +
> > +static int amd_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> > +				     resource_size_t *addr_align,
> > +				     resource_size_t *size_align,
> > +				     resource_size_t *size_max)
> > +{
> > +	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> > +	int bar;
> > +
> > +	bar = ndev_mw_to_bar(ndev, idx);
> > +	if (bar < 0)
> > +		return bar;
> > +
> > +	if (addr_align)
> > +		*addr_align = SZ_4K;
> > +
> > +	if (size_align)
> > +		*size_align = 1;
> > 
> > -	if (align_size)
> > -		*align_size = 1;
> > +	if (size_max)
> > +		*size_max = pci_resource_len(ndev->ntb.pdev, bar);
> > 
> >  	return 0;
> >  }
> > 
> > -static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> > -				dma_addr_t addr, resource_size_t size)
> > +static int amd_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> > +				     dma_addr_t addr, resource_size_t size)
> >  {
> >  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> >  	unsigned long xlat_reg, limit_reg = 0;
> > @@ -432,8 +453,10 @@ static int amd_ntb_peer_spad_write(struct ntb_dev *ntb,
> > 
> >  static const struct ntb_dev_ops amd_ntb_ops = {
> >  	.mw_count		= amd_ntb_mw_count,
> > -	.mw_get_range		= amd_ntb_mw_get_range,
> > -	.mw_set_trans		= amd_ntb_mw_set_trans,
> > +	.mw_get_maprsc		= amd_ntb_mw_get_maprsc,
> > +	.peer_mw_count		= amd_ntb_peer_mw_count,
> > +	.peer_mw_get_align	= amd_ntb_peer_mw_get_align,
> > +	.peer_mw_set_trans	= amd_ntb_peer_mw_set_trans,
> >  	.link_is_up		= amd_ntb_link_is_up,
> >  	.link_enable		= amd_ntb_link_enable,
> >  	.link_disable		= amd_ntb_link_disable,
> > diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c
> > index 40d04ef..fdb2838 100644
> > --- a/drivers/ntb/hw/intel/ntb_hw_intel.c
> > +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
> > @@ -804,11 +804,8 @@ static int intel_ntb_mw_count(struct ntb_dev *ntb)
> >  	return ntb_ndev(ntb)->mw_count;
> >  }
> > 
> > -static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> > -				  phys_addr_t *base,
> > -				  resource_size_t *size,
> > -				  resource_size_t *align,
> > -				  resource_size_t *align_size)
> > +static int intel_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> > +				   phys_addr_t *base, resource_size_t *size)
> >  {
> >  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> >  	int bar;
> > @@ -828,17 +825,51 @@ static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> >  		*size = pci_resource_len(ndev->ntb.pdev, bar) -
> >  			(idx == ndev->b2b_idx ? ndev->b2b_off : 0);
> > 
> > -	if (align)
> > -		*align = pci_resource_len(ndev->ntb.pdev, bar);
> > +	return 0;
> > +}
> > +
> > +static int intel_ntb_peer_mw_count(struct ntb_dev *ntb)
> > +{
> > +	return ntb_ndev(ntb)->mw_count;
> > +}
> > +
> > +static int intel_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> > +				       resource_size_t *addr_align,
> > +				       resource_size_t *size_align,
> > +				       resource_size_t *size_max)
> > +{
> > +	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> > +	resource_size_t bar_size, mw_size;
> > +	int bar;
> > +
> > +	if (idx >= ndev->b2b_idx && !ndev->b2b_off)
> > +		idx += 1;
> > +
> > +	bar = ndev_mw_to_bar(ndev, idx);
> > +	if (bar < 0)
> > +		return bar;
> > +
> > +	bar_size = pci_resource_len(ndev->ntb.pdev, bar);
> > +
> > +	if (idx == ndev->b2b_idx)
> > +		mw_size = bar_size - ndev->b2b_off;
> > +	else
> > +		mw_size = bar_size;
> > +
> > +	if (addr_align)
> > +		*addr_align = bar_size;
> > +
> > +	if (size_align)
> > +		*size_align = 1;
> > 
> > -	if (align_size)
> > -		*align_size = 1;
> > +	if (size_max)
> > +		*size_max = mw_size;
> > 
> >  	return 0;
> >  }
> > 
> > -static int intel_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> > -				  dma_addr_t addr, resource_size_t size)
> > +static int intel_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> > +				       dma_addr_t addr, resource_size_t size)
> >  {
> >  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> >  	unsigned long base_reg, xlat_reg, limit_reg;
> > @@ -2220,8 +2251,10 @@ static struct intel_b2b_addr xeon_b2b_dsd_addr = {
> >  /* operations for primary side of local ntb */
> >  static const struct ntb_dev_ops intel_ntb_ops = {
> >  	.mw_count		= intel_ntb_mw_count,
> > -	.mw_get_range		= intel_ntb_mw_get_range,
> > -	.mw_set_trans		= intel_ntb_mw_set_trans,
> > +	.mw_get_maprsc		= intel_ntb_mw_get_maprsc,
> > +	.peer_mw_count		= intel_ntb_peer_mw_count,
> > +	.peer_mw_get_align	= intel_ntb_peer_mw_get_align,
> > +	.peer_mw_set_trans	= intel_ntb_peer_mw_set_trans,
> >  	.link_is_up		= intel_ntb_link_is_up,
> >  	.link_enable		= intel_ntb_link_enable,
> >  	.link_disable		= intel_ntb_link_disable,
> > diff --git a/drivers/ntb/ntb.c b/drivers/ntb/ntb.c
> > index 2e25307..37c3b36 100644
> > --- a/drivers/ntb/ntb.c
> > +++ b/drivers/ntb/ntb.c
> > @@ -54,6 +54,7 @@
> >  #include <linux/device.h>
> >  #include <linux/kernel.h>
> >  #include <linux/module.h>
> > +#include <linux/atomic.h>
> > 
> >  #include <linux/ntb.h>
> >  #include <linux/pci.h>
> > @@ -72,8 +73,62 @@ MODULE_AUTHOR(DRIVER_AUTHOR);
> >  MODULE_DESCRIPTION(DRIVER_DESCRIPTION);
> > 
> >  static struct bus_type ntb_bus;
> > +static struct ntb_bus_data ntb_data;
> >  static void ntb_dev_release(struct device *dev);
> > 
> > +static int ntb_gen_devid(struct ntb_dev *ntb)
> > +{
> > +	const char *name;
> > +	unsigned long *mask;
> > +	int id;
> > +
> > +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> > +		name = "ntbAS%d";
> > +		mask = ntb_data.both_msk;
> > +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> > +		name = "ntbS%d";
> > +		mask = ntb_data.sync_msk;
> > +	} else if (ntb_valid_async_dev_ops(ntb)) {
> > +		name = "ntbA%d";
> > +		mask = ntb_data.async_msk;
> > +	} else {
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (id = 0; NTB_MAX_DEVID > id; id++) {
> > +		if (0 == test_and_set_bit(id, mask)) {
> > +			ntb->id = id;
> > +			break;
> > +		}
> > +	}
> > +
> > +	if (NTB_MAX_DEVID > id) {
> > +		dev_set_name(&ntb->dev, name, ntb->id);
> > +	} else {
> > +		return -ENOMEM;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void ntb_free_devid(struct ntb_dev *ntb)
> > +{
> > +	unsigned long *mask;
> > +
> > +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> > +		mask = ntb_data.both_msk;
> > +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> > +		mask = ntb_data.sync_msk;
> > +	} else if (ntb_valid_async_dev_ops(ntb)) {
> > +		mask = ntb_data.async_msk;
> > +	} else {
> > +		/* It's impossible */
> > +		BUG();
> > +	}
> > +
> > +	clear_bit(ntb->id, mask);
> > +}
> > +
> >  int __ntb_register_client(struct ntb_client *client, struct module *mod,
> >  			  const char *mod_name)
> >  {
> > @@ -99,13 +154,15 @@ EXPORT_SYMBOL(ntb_unregister_client);
> > 
> >  int ntb_register_device(struct ntb_dev *ntb)
> >  {
> > +	int ret;
> > +
> >  	if (!ntb)
> >  		return -EINVAL;
> >  	if (!ntb->pdev)
> >  		return -EINVAL;
> >  	if (!ntb->ops)
> >  		return -EINVAL;
> > -	if (!ntb_dev_ops_is_valid(ntb->ops))
> > +	if (!ntb_valid_sync_dev_ops(ntb) && !ntb_valid_async_dev_ops(ntb))
> >  		return -EINVAL;
> > 
> >  	init_completion(&ntb->released);
> > @@ -114,13 +171,21 @@ int ntb_register_device(struct ntb_dev *ntb)
> >  	ntb->dev.bus = &ntb_bus;
> >  	ntb->dev.parent = &ntb->pdev->dev;
> >  	ntb->dev.release = ntb_dev_release;
> > -	dev_set_name(&ntb->dev, "%s", pci_name(ntb->pdev));
> > 
> >  	ntb->ctx = NULL;
> >  	ntb->ctx_ops = NULL;
> >  	spin_lock_init(&ntb->ctx_lock);
> > 
> > -	return device_register(&ntb->dev);
> > +	/* No need to wait for completion if failed */
> > +	ret = ntb_gen_devid(ntb);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = device_register(&ntb->dev);
> > +	if (ret)
> > +		ntb_free_devid(ntb);
> > +
> > +	return ret;
> >  }
> >  EXPORT_SYMBOL(ntb_register_device);
> > 
> > @@ -128,6 +193,7 @@ void ntb_unregister_device(struct ntb_dev *ntb)
> >  {
> >  	device_unregister(&ntb->dev);
> >  	wait_for_completion(&ntb->released);
> > +	ntb_free_devid(ntb);
> >  }
> >  EXPORT_SYMBOL(ntb_unregister_device);
> > 
> > @@ -191,6 +257,20 @@ void ntb_db_event(struct ntb_dev *ntb, int vector)
> >  }
> >  EXPORT_SYMBOL(ntb_db_event);
> > 
> > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > +		   struct ntb_msg *msg)
> > +{
> > +	unsigned long irqflags;
> > +
> > +	spin_lock_irqsave(&ntb->ctx_lock, irqflags);
> > +	{
> > +		if (ntb->ctx_ops && ntb->ctx_ops->msg_event)
> > +			ntb->ctx_ops->msg_event(ntb->ctx, ev, msg);
> > +	}
> > +	spin_unlock_irqrestore(&ntb->ctx_lock, irqflags);
> > +}
> > +EXPORT_SYMBOL(ntb_msg_event);
> > +
> >  static int ntb_probe(struct device *dev)
> >  {
> >  	struct ntb_dev *ntb;
> > diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> > index d5c5894..2626ba0 100644
> > --- a/drivers/ntb/ntb_transport.c
> > +++ b/drivers/ntb/ntb_transport.c
> > @@ -673,7 +673,7 @@ static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
> >  	if (!mw->virt_addr)
> >  		return;
> > 
> > -	ntb_mw_clear_trans(nt->ndev, num_mw);
> > +	ntb_peer_mw_set_trans(nt->ndev, num_mw, 0, 0);
> >  	dma_free_coherent(&pdev->dev, mw->buff_size,
> >  			  mw->virt_addr, mw->dma_addr);
> >  	mw->xlat_size = 0;
> > @@ -730,7 +730,8 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
> >  	}
> > 
> >  	/* Notify HW the memory location of the receive buffer */
> > -	rc = ntb_mw_set_trans(nt->ndev, num_mw, mw->dma_addr, mw->xlat_size);
> > +	rc = ntb_peer_mw_set_trans(nt->ndev, num_mw, mw->dma_addr,
> > +				   mw->xlat_size);
> >  	if (rc) {
> >  		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
> >  		ntb_free_mw(nt, num_mw);
> > @@ -1060,7 +1061,11 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> > ntb_dev *ndev)
> >  	int node;
> >  	int rc, i;
> > 
> > -	mw_count = ntb_mw_count(ndev);
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ndev))
> > +		return -EINVAL;
> > +
> > +	mw_count = ntb_peer_mw_count(ndev);
> >  	if (ntb_spad_count(ndev) < (NUM_MWS + 1 + mw_count * 2)) {
> >  		dev_err(&ndev->dev, "Not enough scratch pad registers for %s",
> >  			NTB_TRANSPORT_NAME);
> > @@ -1094,8 +1099,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> > ntb_dev *ndev)
> >  	for (i = 0; i < mw_count; i++) {
> >  		mw = &nt->mw_vec[i];
> > 
> > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > -				      &mw->xlat_align, &mw->xlat_align_size);
> > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > +		if (rc)
> > +			goto err1;
> > +
> > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > +					   &mw->xlat_align_size, NULL);
> 
> Looks like ntb_mw_get_range() was simpler before the change.
> 

If I didn't change NTB bus kernel API, I would have split them up anyway. First of all functions with long argument list look more confusing, than ones with shorter list. It helps to stick to the "80 character per line" rule and improves readability. Secondly the function splitting improves the readability of the code in general. When I first saw the function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function returned. The function lacked of "high code coherence" unofficial rule. It is better when one function does one coherent thing and return a well coherent data. Particularly function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method returns mapping resources. If local NTB client driver is not going to allocate any memory, so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand, that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()", but still the new design is better readable.

Additionally I've split them up because of the difference in the way the asynchronous interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I had to add ntb_mw_set_trans(). Each of that method should logically have related "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client driver a hint how the retrieved from the peer translated base address should be aligned, so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align() will give a hint how the local memory buffer should be allocated to fulfil a peer translated base address alignment. In this way it returns restrictions for parameters of "ntb_peer_mw_set_trans()".

Finally, IDT driver is designed so Primary and Secondary ports can support a different number of memory windows. In this way methods "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have different range of acceptable values of the second argument, which is determined by the "ntb_mw_count()" method, comparing to methods "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory windows index restriction is determined by the "ntb_peer_mw_count()" method.

So to speak the splitting was really necessary to make the API looking more logical.

> >  		if (rc)
> >  			goto err1;
> > 
> > diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
> > index 6a50f20..f2952f7 100644
> > --- a/drivers/ntb/test/ntb_perf.c
> > +++ b/drivers/ntb/test/ntb_perf.c
> > @@ -452,7 +452,7 @@ static void perf_free_mw(struct perf_ctx *perf)
> >  	if (!mw->virt_addr)
> >  		return;
> > 
> > -	ntb_mw_clear_trans(perf->ntb, 0);
> > +	ntb_peer_mw_set_trans(perf->ntb, 0, 0, 0);
> >  	dma_free_coherent(&pdev->dev, mw->buf_size,
> >  			  mw->virt_addr, mw->dma_addr);
> >  	mw->xlat_size = 0;
> > @@ -488,7 +488,7 @@ static int perf_set_mw(struct perf_ctx *perf, resource_size_t size)
> >  		mw->buf_size = 0;
> >  	}
> > 
> > -	rc = ntb_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
> > +	rc = ntb_peer_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
> >  	if (rc) {
> >  		dev_err(&perf->ntb->dev, "Unable to set mw0 translation\n");
> >  		perf_free_mw(perf);
> > @@ -559,8 +559,12 @@ static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf)
> > 
> >  	mw = &perf->mw;
> > 
> > -	rc = ntb_mw_get_range(ntb, 0, &mw->phys_addr, &mw->phys_size,
> > -			      &mw->xlat_align, &mw->xlat_align_size);
> > +	rc = ntb_mw_get_maprsc(ntb, 0, &mw->phys_addr, &mw->phys_size);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = ntb_peer_mw_get_align(ntb, 0, &mw->xlat_align,
> > +				   &mw->xlat_align_size, NULL);
> 
> Looks like ntb_mw_get_range() was simpler.
> 

See the previous answer.

> >  	if (rc)
> >  		return rc;
> > 
> > @@ -758,6 +762,10 @@ static int perf_probe(struct ntb_client *client, struct ntb_dev *ntb)
> >  	int node;
> >  	int rc = 0;
> > 
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ntb))
> > +		return -EINVAL;
> > +
> >  	if (ntb_spad_count(ntb) < MAX_SPAD) {
> >  		dev_err(&ntb->dev, "Not enough scratch pad registers for %s",
> >  			DRIVER_NAME);
> > diff --git a/drivers/ntb/test/ntb_pingpong.c b/drivers/ntb/test/ntb_pingpong.c
> > index 7d31179..e833649 100644
> > --- a/drivers/ntb/test/ntb_pingpong.c
> > +++ b/drivers/ntb/test/ntb_pingpong.c
> > @@ -214,6 +214,11 @@ static int pp_probe(struct ntb_client *client,
> >  	struct pp_ctx *pp;
> >  	int rc;
> > 
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > +		return -EINVAL;
> > +	}
> > +
> >  	if (ntb_db_is_unsafe(ntb)) {
> >  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
> >  		if (!unsafe) {
> > diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
> > index 61bf2ef..5dfe12f 100644
> > --- a/drivers/ntb/test/ntb_tool.c
> > +++ b/drivers/ntb/test/ntb_tool.c
> > @@ -675,8 +675,11 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> > req_size)
> >  	if (mw->peer)
> >  		return 0;
> > 
> > -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &size, &align,
> > -			      &align_size);
> > +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &size);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = ntb_peer_mw_get_align(tc->ntb, idx, &align, &align_size, NULL);
> >  	if (rc)
> >  		return rc;
> 
> Looks like ntb_mw_get_range() was simpler.
> 

See the previous answer.

> > 
> > @@ -689,7 +692,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> > req_size)
> >  	if (!mw->peer)
> >  		return -ENOMEM;
> > 
> > -	rc = ntb_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
> > +	rc = ntb_peer_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
> >  	if (rc)
> >  		goto err_free_dma;
> > 
> > @@ -716,7 +719,7 @@ static void tool_free_mw(struct tool_ctx *tc, int idx)
> >  	struct tool_mw *mw = &tc->mws[idx];
> > 
> >  	if (mw->peer) {
> > -		ntb_mw_clear_trans(tc->ntb, idx);
> > +		ntb_peer_mw_set_trans(tc->ntb, idx, 0, 0);
> >  		dma_free_coherent(&tc->ntb->pdev->dev, mw->size,
> >  				  mw->peer,
> >  				  mw->peer_dma);
> > @@ -751,8 +754,8 @@ static ssize_t tool_peer_mw_trans_read(struct file *filep,
> >  	if (!buf)
> >  		return -ENOMEM;
> > 
> > -	ntb_mw_get_range(mw->tc->ntb, mw->idx,
> > -			 &base, &mw_size, &align, &align_size);
> > +	ntb_mw_get_maprsc(mw->tc->ntb, mw->idx, &base, &mw_size);
> > +	ntb_peer_mw_get_align(mw->tc->ntb, mw->idx, &align, &align_size, NULL);
> > 
> >  	off += scnprintf(buf + off, buf_size - off,
> >  			 "Peer MW %d Information:\n", mw->idx);
> > @@ -827,8 +830,7 @@ static int tool_init_mw(struct tool_ctx *tc, int idx)
> >  	phys_addr_t base;
> >  	int rc;
> > 
> > -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &mw->win_size,
> > -			      NULL, NULL);
> > +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &mw->win_size);
> >  	if (rc)
> >  		return rc;
> > 
> > @@ -913,6 +915,11 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
> >  	int rc;
> >  	int i;
> > 
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > +		return -EINVAL;
> > +	}
> > +
> 
> It would be nice if both types could be supported by the same api.
> 

Yes, it would be. Alas it isn't possible in general. See the introduction to this letter. AMD and Intel devices support asynchronous interface, although they lack of messaging mechanism.

Getting back to the discussion, we still need to provide a way to determine which type of interface an NTB device supports: synchronous/asynchronous translated base address initialization, Scratchpads and memory windows. Currently it can be determined by the functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not the best solution. We can implement the traditional Linux kernel bus device-driver matching, using table_ids and so on. For example, each hardware driver fills in a table with all the functionality it supports, like: synchronous/asynchronous memory windows, Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it uses. NTB bus core implements a "match()" callback, which compares those two tables and calls "probe()" callback method of a driver when the tables successfully matches.

On the other hand, we might don't have to comprehend the NTB bus core. We can just introduce a table_id for NTB hardware device, which would just describe the device vendor itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a supported device by its table_id. It might look easier, since the client driver developer should have a basic understanding of the device one develops a driver for. Then NTB bus kernel API core will simply match NTB devices with drivers like any other buses (PCI, PCIe, i2c, spi, etc) do. 
 
> >  	if (ntb_db_is_unsafe(ntb))
> >  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
> > 
> > @@ -928,7 +935,7 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
> >  	tc->ntb = ntb;
> >  	init_waitqueue_head(&tc->link_wq);
> > 
> > -	tc->mw_count = min(ntb_mw_count(tc->ntb), MAX_MWS);
> > +	tc->mw_count = min(ntb_peer_mw_count(tc->ntb), MAX_MWS);
> >  	for (i = 0; i < tc->mw_count; i++) {
> >  		rc = tool_init_mw(tc, i);
> >  		if (rc)
> > diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> > index 6f47562..d1937d3 100644
> > --- a/include/linux/ntb.h
> > +++ b/include/linux/ntb.h
> > @@ -159,13 +159,44 @@ static inline int ntb_client_ops_is_valid(const struct
> > ntb_client_ops *ops)
> >  }
> > 
> >  /**
> > + * struct ntb_msg - ntb driver message structure
> > + * @type:	Message type.
> > + * @payload:	Payload data to send to a peer
> > + * @data:	Array of u32 data to send (size might be hw dependent)
> > + */
> > +#define NTB_MAX_MSGSIZE 4
> > +struct ntb_msg {
> > +	union {
> > +		struct {
> > +			u32 type;
> > +			u32 payload[NTB_MAX_MSGSIZE - 1];
> > +		};
> > +		u32 data[NTB_MAX_MSGSIZE];
> > +	};
> > +};
> > +
> > +/**
> > + * enum NTB_MSG_EVENT - message event types
> > + * @NTB_MSG_NEW:	New message just arrived and passed to the handler
> > + * @NTB_MSG_SENT:	Posted message has just been successfully sent
> > + * @NTB_MSG_FAIL:	Posted message failed to be sent
> > + */
> > +enum NTB_MSG_EVENT {
> > +	NTB_MSG_NEW,
> > +	NTB_MSG_SENT,
> > +	NTB_MSG_FAIL
> > +};
> > +
> > +/**
> >   * struct ntb_ctx_ops - ntb driver context operations
> >   * @link_event:		See ntb_link_event().
> >   * @db_event:		See ntb_db_event().
> > + * @msg_event:		See ntb_msg_event().
> >   */
> >  struct ntb_ctx_ops {
> >  	void (*link_event)(void *ctx);
> >  	void (*db_event)(void *ctx, int db_vector);
> > +	void (*msg_event)(void *ctx, enum NTB_MSG_EVENT ev, struct ntb_msg *msg);
> >  };
> > 
> >  static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> > @@ -174,18 +205,24 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> > *ops)
> >  	return
> >  		/* ops->link_event		&& */
> >  		/* ops->db_event		&& */
> > +		/* ops->msg_event		&& */
> >  		1;
> >  }
> > 
> >  /**
> >   * struct ntb_ctx_ops - ntb device operations
> > - * @mw_count:		See ntb_mw_count().
> > - * @mw_get_range:	See ntb_mw_get_range().
> > - * @mw_set_trans:	See ntb_mw_set_trans().
> > - * @mw_clear_trans:	See ntb_mw_clear_trans().
> >   * @link_is_up:		See ntb_link_is_up().
> >   * @link_enable:	See ntb_link_enable().
> >   * @link_disable:	See ntb_link_disable().
> > + * @mw_count:		See ntb_mw_count().
> > + * @mw_get_maprsc:	See ntb_mw_get_maprsc().
> > + * @mw_set_trans:	See ntb_mw_set_trans().
> > + * @mw_get_trans:	See ntb_mw_get_trans().
> > + * @mw_get_align:	See ntb_mw_get_align().
> > + * @peer_mw_count:	See ntb_peer_mw_count().
> > + * @peer_mw_set_trans:	See ntb_peer_mw_set_trans().
> > + * @peer_mw_get_trans:	See ntb_peer_mw_get_trans().
> > + * @peer_mw_get_align:	See ntb_peer_mw_get_align().
> >   * @db_is_unsafe:	See ntb_db_is_unsafe().
> >   * @db_valid_mask:	See ntb_db_valid_mask().
> >   * @db_vector_count:	See ntb_db_vector_count().
> > @@ -210,22 +247,38 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> > *ops)
> >   * @peer_spad_addr:	See ntb_peer_spad_addr().
> >   * @peer_spad_read:	See ntb_peer_spad_read().
> >   * @peer_spad_write:	See ntb_peer_spad_write().
> > + * @msg_post:		See ntb_msg_post().
> > + * @msg_size:		See ntb_msg_size().
> >   */
> >  struct ntb_dev_ops {
> > -	int (*mw_count)(struct ntb_dev *ntb);
> > -	int (*mw_get_range)(struct ntb_dev *ntb, int idx,
> > -			    phys_addr_t *base, resource_size_t *size,
> > -			resource_size_t *align, resource_size_t *align_size);
> > -	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> > -			    dma_addr_t addr, resource_size_t size);
> > -	int (*mw_clear_trans)(struct ntb_dev *ntb, int idx);
> > -
> >  	int (*link_is_up)(struct ntb_dev *ntb,
> >  			  enum ntb_speed *speed, enum ntb_width *width);
> >  	int (*link_enable)(struct ntb_dev *ntb,
> >  			   enum ntb_speed max_speed, enum ntb_width max_width);
> >  	int (*link_disable)(struct ntb_dev *ntb);
> > 
> > +	int (*mw_count)(struct ntb_dev *ntb);
> > +	int (*mw_get_maprsc)(struct ntb_dev *ntb, int idx,
> > +			     phys_addr_t *base, resource_size_t *size);
> > +	int (*mw_get_align)(struct ntb_dev *ntb, int idx,
> > +			    resource_size_t *addr_align,
> > +			    resource_size_t *size_align,
> > +			    resource_size_t *size_max);
> > +	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> > +			    dma_addr_t addr, resource_size_t size);
> > +	int (*mw_get_trans)(struct ntb_dev *ntb, int idx,
> > +			    dma_addr_t *addr, resource_size_t *size);
> > +
> > +	int (*peer_mw_count)(struct ntb_dev *ntb);
> > +	int (*peer_mw_get_align)(struct ntb_dev *ntb, int idx,
> > +				 resource_size_t *addr_align,
> > +				 resource_size_t *size_align,
> > +				 resource_size_t *size_max);
> > +	int (*peer_mw_set_trans)(struct ntb_dev *ntb, int idx,
> > +				 dma_addr_t addr, resource_size_t size);
> > +	int (*peer_mw_get_trans)(struct ntb_dev *ntb, int idx,
> > +				 dma_addr_t *addr, resource_size_t *size);
> > +
> >  	int (*db_is_unsafe)(struct ntb_dev *ntb);
> >  	u64 (*db_valid_mask)(struct ntb_dev *ntb);
> >  	int (*db_vector_count)(struct ntb_dev *ntb);
> > @@ -259,47 +312,10 @@ struct ntb_dev_ops {
> >  			      phys_addr_t *spad_addr);
> >  	u32 (*peer_spad_read)(struct ntb_dev *ntb, int idx);
> >  	int (*peer_spad_write)(struct ntb_dev *ntb, int idx, u32 val);
> > -};
> > -
> > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > -{
> > -	/* commented callbacks are not required: */
> > -	return
> > -		ops->mw_count				&&
> > -		ops->mw_get_range			&&
> > -		ops->mw_set_trans			&&
> > -		/* ops->mw_clear_trans			&& */
> > -		ops->link_is_up				&&
> > -		ops->link_enable			&&
> > -		ops->link_disable			&&
> > -		/* ops->db_is_unsafe			&& */
> > -		ops->db_valid_mask			&&
> > 
> > -		/* both set, or both unset */
> > -		(!ops->db_vector_count == !ops->db_vector_mask) &&
> > -
> > -		ops->db_read				&&
> > -		/* ops->db_set				&& */
> > -		ops->db_clear				&&
> > -		/* ops->db_read_mask			&& */
> > -		ops->db_set_mask			&&
> > -		ops->db_clear_mask			&&
> > -		/* ops->peer_db_addr			&& */
> > -		/* ops->peer_db_read			&& */
> > -		ops->peer_db_set			&&
> > -		/* ops->peer_db_clear			&& */
> > -		/* ops->peer_db_read_mask		&& */
> > -		/* ops->peer_db_set_mask		&& */
> > -		/* ops->peer_db_clear_mask		&& */
> > -		/* ops->spad_is_unsafe			&& */
> > -		ops->spad_count				&&
> > -		ops->spad_read				&&
> > -		ops->spad_write				&&
> > -		/* ops->peer_spad_addr			&& */
> > -		/* ops->peer_spad_read			&& */
> > -		ops->peer_spad_write			&&
> > -		1;
> > -}
> > +	int (*msg_post)(struct ntb_dev *ntb, struct ntb_msg *msg);
> > +	int (*msg_size)(struct ntb_dev *ntb);
> > +};
> > 
> >  /**
> >   * struct ntb_client - client interested in ntb devices
> > @@ -310,10 +326,22 @@ struct ntb_client {
> >  	struct device_driver		drv;
> >  	const struct ntb_client_ops	ops;
> >  };
> > -
> >  #define drv_ntb_client(__drv) container_of((__drv), struct ntb_client, drv)
> > 
> >  /**
> > + * struct ntb_bus_data - NTB bus data
> > + * @sync_msk:	Synchroous devices mask
> > + * @async_msk:	Asynchronous devices mask
> > + * @both_msk:	Both sync and async devices mask
> > + */
> > +#define NTB_MAX_DEVID (8*BITS_PER_LONG)
> > +struct ntb_bus_data {
> > +	unsigned long sync_msk[8];
> > +	unsigned long async_msk[8];
> > +	unsigned long both_msk[8];
> > +};
> > +
> > +/**
> >   * struct ntb_device - ntb device
> >   * @dev:		Linux device object.
> >   * @pdev:		Pci device entry of the ntb.
> > @@ -332,15 +360,151 @@ struct ntb_dev {
> > 
> >  	/* private: */
> > 
> > +	/* device id */
> > +	int id;
> >  	/* synchronize setting, clearing, and calling ctx_ops */
> >  	spinlock_t			ctx_lock;
> >  	/* block unregister until device is fully released */
> >  	struct completion		released;
> >  };
> > -
> >  #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
> > 
> >  /**
> > + * ntb_valid_sync_dev_ops() - valid operations for synchronous hardware setup
> > + * @ntb:	NTB device
> > + *
> > + * There might be two types of NTB hardware differed by the way of the settings
> > + * configuration. The synchronous chips allows to set the memory windows by
> > + * directly writing to the peer registers. Additionally there can be shared
> > + * Scratchpad registers for synchronous information exchange. Client drivers
> > + * should call this function to make sure the hardware supports the proper
> > + * functionality.
> > + */
> > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > +{
> > +	const struct ntb_dev_ops *ops = ntb->ops;
> > +
> > +	/* Commented callbacks are not required, but might be developed */
> > +	return	/* NTB link status ops */
> > +		ops->link_is_up					&&
> > +		ops->link_enable				&&
> > +		ops->link_disable				&&
> > +
> > +		/* Synchronous memory windows ops */
> > +		ops->mw_count					&&
> > +		ops->mw_get_maprsc				&&
> > +		/* ops->mw_get_align				&& */
> > +		/* ops->mw_set_trans				&& */
> > +		/* ops->mw_get_trans				&& */
> > +		ops->peer_mw_count				&&
> > +		ops->peer_mw_get_align				&&
> > +		ops->peer_mw_set_trans				&&
> > +		/* ops->peer_mw_get_trans			&& */
> > +
> > +		/* Doorbell ops */
> > +		/* ops->db_is_unsafe				&& */
> > +		ops->db_valid_mask				&&
> > +		/* both set, or both unset */
> > +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> > +		ops->db_read					&&
> > +		/* ops->db_set					&& */
> > +		ops->db_clear					&&
> > +		/* ops->db_read_mask				&& */
> > +		ops->db_set_mask				&&
> > +		ops->db_clear_mask				&&
> > +		/* ops->peer_db_addr				&& */
> > +		/* ops->peer_db_read				&& */
> > +		ops->peer_db_set				&&
> > +		/* ops->peer_db_clear				&& */
> > +		/* ops->peer_db_read_mask			&& */
> > +		/* ops->peer_db_set_mask			&& */
> > +		/* ops->peer_db_clear_mask			&& */
> > +
> > +		/* Scratchpad ops */
> > +		/* ops->spad_is_unsafe				&& */
> > +		ops->spad_count					&&
> > +		ops->spad_read					&&
> > +		ops->spad_write					&&
> > +		/* ops->peer_spad_addr				&& */
> > +		/* ops->peer_spad_read				&& */
> > +		ops->peer_spad_write				&&
> > +
> > +		/* Messages IO ops */
> > +		/* ops->msg_post				&& */
> > +		/* ops->msg_size				&& */
> > +		1;
> > +}
> > +
> > +/**
> > + * ntb_valid_async_dev_ops() - valid operations for asynchronous hardware setup
> > + * @ntb:	NTB device
> > + *
> > + * There might be two types of NTB hardware differed by the way of the settings
> > + * configuration. The asynchronous chips does not allow to set the memory
> > + * windows by directly writing to the peer registers. Instead it implements
> > + * the additional method to communinicate between NTB nodes like messages.
> > + * Scratchpad registers aren't likely supported by such hardware. Client
> > + * drivers should call this function to make sure the hardware supports
> > + * the proper functionality.
> > + */
> > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> > +{
> > +	const struct ntb_dev_ops *ops = ntb->ops;
> > +
> > +	/* Commented callbacks are not required, but might be developed */
> > +	return	/* NTB link status ops */
> > +		ops->link_is_up					&&
> > +		ops->link_enable				&&
> > +		ops->link_disable				&&
> > +
> > +		/* Asynchronous memory windows ops */
> > +		ops->mw_count					&&
> > +		ops->mw_get_maprsc				&&
> > +		ops->mw_get_align				&&
> > +		ops->mw_set_trans				&&
> > +		/* ops->mw_get_trans				&& */
> > +		ops->peer_mw_count				&&
> > +		ops->peer_mw_get_align				&&
> > +		/* ops->peer_mw_set_trans			&& */
> > +		/* ops->peer_mw_get_trans			&& */
> > +
> > +		/* Doorbell ops */
> > +		/* ops->db_is_unsafe				&& */
> > +		ops->db_valid_mask				&&
> > +		/* both set, or both unset */
> > +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> > +		ops->db_read					&&
> > +		/* ops->db_set					&& */
> > +		ops->db_clear					&&
> > +		/* ops->db_read_mask				&& */
> > +		ops->db_set_mask				&&
> > +		ops->db_clear_mask				&&
> > +		/* ops->peer_db_addr				&& */
> > +		/* ops->peer_db_read				&& */
> > +		ops->peer_db_set				&&
> > +		/* ops->peer_db_clear				&& */
> > +		/* ops->peer_db_read_mask			&& */
> > +		/* ops->peer_db_set_mask			&& */
> > +		/* ops->peer_db_clear_mask			&& */
> > +
> > +		/* Scratchpad ops */
> > +		/* ops->spad_is_unsafe				&& */
> > +		/* ops->spad_count				&& */
> > +		/* ops->spad_read				&& */
> > +		/* ops->spad_write				&& */
> > +		/* ops->peer_spad_addr				&& */
> > +		/* ops->peer_spad_read				&& */
> > +		/* ops->peer_spad_write				&& */
> > +
> > +		/* Messages IO ops */
> > +		ops->msg_post					&&
> > +		ops->msg_size					&&
> > +		1;
> > +}
> 
> I understand why IDT requires a different api for dealing with addressing multiple peers.  I would be interested in a solution that would allow, for example, the Intel driver fit under the api for dealing with multiple peers, even though it only supports one peer.  I would rather see that, than two separate apis under ntb.
> 
> Thoughts?
> 
> Can the sync api be described by some subset of the async api?  Are there less overloaded terms we can use instead of sync/async?
> 

Answer to this concern is mostly provided in the introduction as well. I'll repeat it here in details. As I said AMD and Intel hardware support asynchronous API except the messaging. Additionally I can even think of emulating messaging using Doorbells and Scratchpads, but not the other way around. Why not? Before answering, here is how the messaging works in IDT switches of both first and second groups (see introduction for describing the groups).

There are four outbound and inbound message registers for each NTB port in the device. Local root complex can connect its any outbound message to any inbound message register of the IDT switch. When one writes a data to an outbound message register it immediately gets to the connected inbound message registers. Then peer can read its inbound message registers and empty it by clearing a corresponding bit. Then and only then next data can be written to any outbound message registers connected to that inbound message register. So the possible race condition between multiple domains sending a message to same peer is resolved by the IDT switch itself.

One would ask: "Why don't you just wrap the message registers up back to the same port? It would look just like Scratchpads." Yes, It would. But still there are only four message registers. It's not enough to distribute them between all the possibly connected NTB ports. As I said earlier there can be up to eight domains connected, so there must be at least seven message register to fulfil the possible design.

Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly adapt design for a hardware, rather than hardware to a design. Following that rule would simplify a code and support respectively.

Regarding the APIs subset. As I said before async API is kind of subset of synchronous API. We can develop all the memory window related callback-method for AMD and Intel hardware driver, which is pretty much easy. We can even simulate message registers by using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second group of IDT switches can't implement the synchronous API, as I already said in the introduction.

Regarding the overloaded naming. The "sync/async" names are the best I could think of. If you have any idea how one can be appropriately changed, be my guest. I would be really glad to substitute them with something better.

> > +
> > +
> > +
> > +/**
> >   * ntb_register_client() - register a client for interest in ntb devices
> >   * @client:	Client context.
> >   *
> > @@ -441,10 +605,84 @@ void ntb_link_event(struct ntb_dev *ntb);
> >  void ntb_db_event(struct ntb_dev *ntb, int vector);
> > 
> >  /**
> > - * ntb_mw_count() - get the number of memory windows
> > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> >   * @ntb:	NTB device context.
> > + * @ev:		Event type caused the handler invocation
> > + * @msg:	Message related to the event
> > + *
> > + * Notify the driver context that there is some event happaned in the event
> > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > + * last argument is used to pass the event related message. It discarded right
> > + * after the handler returns.
> > + */
> > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > +		   struct ntb_msg *msg);
> 
> I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of the message handling to be done more appropriately at a higher layer of the application.  I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I think would be more appropriate for a ntb transport (or higher layer) driver.
> 

Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a first inbound message register (the rest of message registers are used as an additional data buffers). Then a corresponding tasklet is started to release a hardware interrupt context. That tasklet extracts a message from the inbound message registers, puts it into the driver inbound message queue and marks the registers as empty so the next message could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all new messages to a client driver, which preliminary registered "ntb_msg_event()" callback method. When callback method "ntb_msg_event()" the passed message is discarded.

Description of how messages are sent to a peer is provided below in the corresponding commentary.

> > +
> > +/**
> > + * ntb_link_is_up() - get the current ntb link state
> > + * @ntb:	NTB device context.
> > + * @speed:	OUT - The link speed expressed as PCIe generation number.
> > + * @width:	OUT - The link width expressed as the number of PCIe lanes.
> > + *
> > + * Get the current state of the ntb link.  It is recommended to query the link
> > + * state once after every link event.  It is safe to query the link state in
> > + * the context of the link event callback.
> > + *
> > + * Return: One if the link is up, zero if the link is down, otherwise a
> > + *		negative value indicating the error number.
> > + */
> > +static inline int ntb_link_is_up(struct ntb_dev *ntb,
> > +				 enum ntb_speed *speed, enum ntb_width *width)
> > +{
> > +	return ntb->ops->link_is_up(ntb, speed, width);
> > +}
> > +
> 
> It looks like there was some rearranging of code, so big hunks appear to be added or removed.  Can you split this into two (or more) patches so that rearranging the code is distinct from more interesting changes?
> 

Lets say there was not much rearranging here. I've just put link-related method before everything else. The rearranging was done from the point of methods importance view. There can't be any memory sharing and doorbells operations done before the link is established. The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops() methods.

> > +/**
> > + * ntb_link_enable() - enable the link on the secondary side of the ntb
> > + * @ntb:	NTB device context.
> > + * @max_speed:	The maximum link speed expressed as PCIe generation number.
> > + * @max_width:	The maximum link width expressed as the number of PCIe lanes.
> >   *
> > - * Hardware and topology may support a different number of memory windows.
> > + * Enable the link on the secondary side of the ntb.  This can only be done
> > + * from only one (primary or secondary) side of the ntb in primary or b2b
> > + * topology.  The ntb device should train the link to its maximum speed and
> > + * width, or the requested speed and width, whichever is smaller, if supported.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_link_enable(struct ntb_dev *ntb,
> > +				  enum ntb_speed max_speed,
> > +				  enum ntb_width max_width)
> > +{
> > +	return ntb->ops->link_enable(ntb, max_speed, max_width);
> > +}
> > +
> > +/**
> > + * ntb_link_disable() - disable the link on the secondary side of the ntb
> > + * @ntb:	NTB device context.
> > + *
> > + * Disable the link on the secondary side of the ntb.  This can only be
> > + * done from only one (primary or secondary) side of the ntb in primary or b2b
> > + * topology.  The ntb device should disable the link.  Returning from this call
> > + * must indicate that a barrier has passed, though with no more writes may pass
> > + * in either direction across the link, except if this call returns an error
> > + * number.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_link_disable(struct ntb_dev *ntb)
> > +{
> > +	return ntb->ops->link_disable(ntb);
> > +}
> > +
> > +/**
> > + * ntb_mw_count() - get the number of local memory windows
> > + * @ntb:	NTB device context.
> > + *
> > + * Hardware and topology may support a different number of memory windows at
> > + * local and remote devices
> >   *
> >   * Return: the number of memory windows.
> >   */
> > @@ -454,122 +692,186 @@ static inline int ntb_mw_count(struct ntb_dev *ntb)
> >  }
> > 
> >  /**
> > - * ntb_mw_get_range() - get the range of a memory window
> > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> 
> What was insufficient about ntb_mw_get_range() that it needed to be split into ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch, it seems ntb_mw_get_range() would have been more simple.
> 
> I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].  So, there is no example of how usage of new api would be used differently or more efficiently than ntb_mw_get_range() for async devices.
> 

This concern is answered a bit earlier, when you first commented the method "ntb_mw_get_range()" splitting.

You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending), which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c driver is developed to demonstrate how the new asynchronous API is utilized including the "ntb_mw_get_maprsc()" method usage.

> >   * @ntb:	NTB device context.
> >   * @idx:	Memory window number.
> >   * @base:	OUT - the base address for mapping the memory window
> >   * @size:	OUT - the size for mapping the memory window
> > - * @align:	OUT - the base alignment for translating the memory window
> > - * @align_size:	OUT - the size alignment for translating the memory window
> >   *
> > - * Get the range of a memory window.  NULL may be given for any output
> > - * parameter if the value is not needed.  The base and size may be used for
> > - * mapping the memory window, to access the peer memory.  The alignment and
> > - * size may be used for translating the memory window, for the peer to access
> > - * memory on the local system.
> > + * Get the map range of a memory window. The base and size may be used for
> > + * mapping the memory window to access the peer memory.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> > -				   phys_addr_t *base, resource_size_t *size,
> > -		resource_size_t *align, resource_size_t *align_size)
> > +static inline int ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> > +				    phys_addr_t *base, resource_size_t *size)
> >  {
> > -	return ntb->ops->mw_get_range(ntb, idx, base, size,
> > -			align, align_size);
> > +	return ntb->ops->mw_get_maprsc(ntb, idx, base, size);
> > +}
> > +
> > +/**
> > + * ntb_mw_get_align() - get memory window alignment of the local node
> > + * @ntb:	NTB device context.
> > + * @idx:	Memory window number.
> > + * @addr_align:	OUT - the translated base address alignment of the memory window
> > + * @size_align:	OUT - the translated memory size alignment of the memory window
> > + * @size_max:	OUT - the translated memory maximum size
> > + *
> > + * Get the alignment parameters to allocate the proper memory window. NULL may
> > + * be given for any output parameter if the value is not needed.
> > + *
> > + * Drivers of synchronous hardware don't have to support it.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_mw_get_align(struct ntb_dev *ntb, int idx,
> > +				   resource_size_t *addr_align,
> > +				   resource_size_t *size_align,
> > +				   resource_size_t *size_max)
> > +{
> > +	if (!ntb->ops->mw_get_align)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->mw_get_align(ntb, idx, addr_align, size_align, size_max);
> >  }
> > 
> >  /**
> > - * ntb_mw_set_trans() - set the translation of a memory window
> > + * ntb_mw_set_trans() - set the translated base address of a peer memory window
> >   * @ntb:	NTB device context.
> >   * @idx:	Memory window number.
> > - * @addr:	The dma address local memory to expose to the peer.
> > - * @size:	The size of the local memory to expose to the peer.
> > + * @addr:	DMA memory address exposed by the peer.
> > + * @size:	Size of the memory exposed by the peer.
> > + *
> > + * Set the translated base address of a memory window. The peer preliminary
> > + * allocates a memory, then someway passes the address to the remote node, that
> > + * finally sets up the memory window at the address, up to the size. The address
> > + * and size must be aligned to the parameters specified by ntb_mw_get_align() of
> > + * the local node and ntb_peer_mw_get_align() of the peer, which must return the
> > + * same values. Zero size effectively disables the memory window.
> >   *
> > - * Set the translation of a memory window.  The peer may access local memory
> > - * through the window starting at the address, up to the size.  The address
> > - * must be aligned to the alignment specified by ntb_mw_get_range().  The size
> > - * must be aligned to the size alignment specified by ntb_mw_get_range().
> > + * Drivers of synchronous hardware don't have to support it.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> >  				   dma_addr_t addr, resource_size_t size)
> >  {
> > +	if (!ntb->ops->mw_set_trans)
> > +		return -EINVAL;
> > +
> >  	return ntb->ops->mw_set_trans(ntb, idx, addr, size);
> >  }
> > 
> >  /**
> > - * ntb_mw_clear_trans() - clear the translation of a memory window
> > + * ntb_mw_get_trans() - get the translated base address of a memory window
> >   * @ntb:	NTB device context.
> >   * @idx:	Memory window number.
> > + * @addr:	The dma memory address exposed by the peer.
> > + * @size:	The size of the memory exposed by the peer.
> >   *
> > - * Clear the translation of a memory window.  The peer may no longer access
> > - * local memory through the window.
> > + * Get the translated base address of a memory window spicified for the local
> > + * hardware and allocated by the peer. If the addr and size are zero, the
> > + * memory window is effectively disabled.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_mw_clear_trans(struct ntb_dev *ntb, int idx)
> > +static inline int ntb_mw_get_trans(struct ntb_dev *ntb, int idx,
> > +				   dma_addr_t *addr, resource_size_t *size)
> >  {
> > -	if (!ntb->ops->mw_clear_trans)
> > -		return ntb->ops->mw_set_trans(ntb, idx, 0, 0);
> > +	if (!ntb->ops->mw_get_trans)
> > +		return -EINVAL;
> > 
> > -	return ntb->ops->mw_clear_trans(ntb, idx);
> > +	return ntb->ops->mw_get_trans(ntb, idx, addr, size);
> >  }
> > 
> >  /**
> > - * ntb_link_is_up() - get the current ntb link state
> > + * ntb_peer_mw_count() - get the number of peer memory windows
> >   * @ntb:	NTB device context.
> > - * @speed:	OUT - The link speed expressed as PCIe generation number.
> > - * @width:	OUT - The link width expressed as the number of PCIe lanes.
> >   *
> > - * Get the current state of the ntb link.  It is recommended to query the link
> > - * state once after every link event.  It is safe to query the link state in
> > - * the context of the link event callback.
> > + * Hardware and topology may support a different number of memory windows at
> > + * local and remote nodes.
> >   *
> > - * Return: One if the link is up, zero if the link is down, otherwise a
> > - *		negative value indicating the error number.
> > + * Return: the number of memory windows.
> >   */
> > -static inline int ntb_link_is_up(struct ntb_dev *ntb,
> > -				 enum ntb_speed *speed, enum ntb_width *width)
> > +static inline int ntb_peer_mw_count(struct ntb_dev *ntb)
> >  {
> > -	return ntb->ops->link_is_up(ntb, speed, width);
> > +	return ntb->ops->peer_mw_count(ntb);
> >  }
> > 
> >  /**
> > - * ntb_link_enable() - enable the link on the secondary side of the ntb
> > + * ntb_peer_mw_get_align() - get memory window alignment of the peer
> >   * @ntb:	NTB device context.
> > - * @max_speed:	The maximum link speed expressed as PCIe generation number.
> > - * @max_width:	The maximum link width expressed as the number of PCIe lanes.
> > + * @idx:	Memory window number.
> > + * @addr_align:	OUT - the translated base address alignment of the memory window
> > + * @size_align:	OUT - the translated memory size alignment of the memory window
> > + * @size_max:	OUT - the translated memory maximum size
> >   *
> > - * Enable the link on the secondary side of the ntb.  This can only be done
> > - * from the primary side of the ntb in primary or b2b topology.  The ntb device
> > - * should train the link to its maximum speed and width, or the requested speed
> > - * and width, whichever is smaller, if supported.
> > + * Get the alignment parameters to allocate the proper memory window for the
> > + * peer. NULL may be given for any output parameter if the value is not needed.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_link_enable(struct ntb_dev *ntb,
> > -				  enum ntb_speed max_speed,
> > -				  enum ntb_width max_width)
> > +static inline int ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> > +					resource_size_t *addr_align,
> > +					resource_size_t *size_align,
> > +					resource_size_t *size_max)
> >  {
> > -	return ntb->ops->link_enable(ntb, max_speed, max_width);
> > +	if (!ntb->ops->peer_mw_get_align)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->peer_mw_get_align(ntb, idx, addr_align, size_align,
> > +					   size_max);
> >  }
> > 
> >  /**
> > - * ntb_link_disable() - disable the link on the secondary side of the ntb
> > + * ntb_peer_mw_set_trans() - set the translated base address of a peer
> > + *			     memory window
> >   * @ntb:	NTB device context.
> > + * @idx:	Memory window number.
> > + * @addr:	Local DMA memory address exposed to the peer.
> > + * @size:	Size of the memory exposed to the peer.
> >   *
> > - * Disable the link on the secondary side of the ntb.  This can only be
> > - * done from the primary side of the ntb in primary or b2b topology.  The ntb
> > - * device should disable the link.  Returning from this call must indicate that
> > - * a barrier has passed, though with no more writes may pass in either
> > - * direction across the link, except if this call returns an error number.
> > + * Set the translated base address of a memory window exposed to the peer.
> > + * The local node preliminary allocates the window, then directly writes the
> 
> I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the following make sense, or have I completely misunderstood something?
> 
> ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are translated to the local memory destination.
> 
> ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory window (is this something that needs to be configured on the local ntb?) are translated to the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of ntb_mw_set_trans() will complete the translation to the peer memory destination.
> 

These functions actually do the opposite you described:

ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so outgoing writes to a memory window would be translated and reach the peer memory destination.

ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space, so the local incoming writes would be correctly translated on the peer and reach the local memory destination.

Globally thinking, these methods do the same think, when they called from opposite domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called method "ntb_peer_mw_set_trans()" does the same procedure as the method "ntb_mw_set_trans()" called from a peer.

To make things simpler, think of memory windows in the framework of the next definition: "Memory Window is a virtual memory region, which locally reflects a physical memory of peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory window, so the locally mapped virtual addresses would be connected with the peer physical memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory region, so the peer could successfully perform a writes to our local physical memory.

Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc() and ioremap_nocache() method invocation doublet. You do the same thing in the client test drivers for AMD and Intel hadrware.

> > + * address and size to the peer control registers. The address and size must
> > + * be aligned to the parameters specified by ntb_peer_mw_get_align() of
> > + * the local node and ntb_mw_get_align() of the peer, which must return the
> > + * same values. Zero size effectively disables the memory window.
> > + *
> > + * Drivers of synchronous hardware must support it.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_link_disable(struct ntb_dev *ntb)
> > +static inline int ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> > +					dma_addr_t addr, resource_size_t size)
> >  {
> > -	return ntb->ops->link_disable(ntb);
> > +	if (!ntb->ops->peer_mw_set_trans)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->peer_mw_set_trans(ntb, idx, addr, size);
> > +}
> > +
> > +/**
> > + * ntb_peer_mw_get_trans() - get the translated base address of a peer
> > + *			     memory window
> > + * @ntb:	NTB device context.
> > + * @idx:	Memory window number.
> > + * @addr:	Local dma memory address exposed to the peer.
> > + * @size:	Size of the memory exposed to the peer.
> > + *
> > + * Get the translated base address of a memory window spicified for the peer
> > + * hardware. If the addr and size are zero then the memory window is effectively
> > + * disabled.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_peer_mw_get_trans(struct ntb_dev *ntb, int idx,
> > +					dma_addr_t *addr, resource_size_t *size)
> > +{
> > +	if (!ntb->ops->peer_mw_get_trans)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->peer_mw_get_trans(ntb, idx, addr, size);
> >  }
> > 
> >  /**
> > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits)
> >   * append one additional dma memory copy with the doorbell register as the
> >   * destination, after the memory copy operations.
> >   *
> > + * This is unusual, and hardware may not be suitable to implement it.
> > + *
> 
> Why is this unusual?  Do you mean async hardware may not support it?
> 

Of course I can always return an address of a Doorbell register, but it's not safe to do it working with IDT NTB hardware driver. To make thing explained simpler think a IDT hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of each port can be configured to either reflect the global switch doorbell bits state or not to reflect. Global doorbell bits are set by using outbound doorbell register, which is exist for every NTB port. Primary port is the port which can have an access to multiple peers, so the Primary port inbound and outbound doorbell registers are shared between several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices should not interfere each other, which can happen on uncontrollable usage of Doorbell registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be developed for the IDT NTB hardware driver.

> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_peer_db_addr(struct ntb_dev *ntb,
> > @@ -901,10 +1205,15 @@ static inline int ntb_spad_is_unsafe(struct ntb_dev *ntb)
> >   *
> >   * Hardware and topology may support a different number of scratchpads.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: the number of scratchpads.
> >   */
> >  static inline int ntb_spad_count(struct ntb_dev *ntb)
> >  {
> > +	if (!ntb->ops->spad_count)
> > +		return -EINVAL;
> > +
> 
> Maybe we should return zero (i.e. there are no scratchpads).
> 

Agreed. I will fix it in the next patchset.

> >  	return ntb->ops->spad_count(ntb);
> >  }
> > 
> > @@ -915,10 +1224,15 @@ static inline int ntb_spad_count(struct ntb_dev *ntb)
> >   *
> >   * Read the local scratchpad register, and return the value.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: The value of the local scratchpad register.
> >   */
> >  static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
> >  {
> > +	if (!ntb->ops->spad_read)
> > +		return 0;
> > +
> 
> Let's return ~0.  I think that's what a driver would read from the pci bus for a memory miss. 
> 

Agreed. I will make it returning -EINVAL in the next patchset.

> >  	return ntb->ops->spad_read(ntb, idx);
> >  }
> > 
> > @@ -930,10 +1244,15 @@ static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
> >   *
> >   * Write the value to the local scratchpad register.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32 val)
> >  {
> > +	if (!ntb->ops->spad_write)
> > +		return -EINVAL;
> > +
> >  	return ntb->ops->spad_write(ntb, idx, val);
> >  }
> > 
> > @@ -946,6 +1265,8 @@ static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32
> > val)
> >   * Return the address of the peer doorbell register.  This may be used, for
> >   * example, by drivers that offload memory copy operations to a dma engine.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
> > @@ -964,10 +1285,15 @@ static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
> >   *
> >   * Read the peer scratchpad register, and return the value.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: The value of the local scratchpad register.
> >   */
> >  static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
> >  {
> > +	if (!ntb->ops->peer_spad_read)
> > +		return 0;
> 
> Also, ~0?
> 

Agreed. I will make it returning -EINVAL in the next patchset.

> > +
> >  	return ntb->ops->peer_spad_read(ntb, idx);
> >  }
> > 
> > @@ -979,11 +1305,59 @@ static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
> >   *
> >   * Write the value to the peer scratchpad register.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_peer_spad_write(struct ntb_dev *ntb, int idx, u32 val)
> >  {
> > +	if (!ntb->ops->peer_spad_write)
> > +		return -EINVAL;
> > +
> >  	return ntb->ops->peer_spad_write(ntb, idx, val);
> >  }
> > 
> > +/**
> > + * ntb_msg_post() - post the message to the peer
> > + * @ntb:	NTB device context.
> > + * @msg:	Message
> > + *
> > + * Post the message to a peer. It shall be delivered to the peer by the
> > + * corresponding hardware method. The peer should be notified about the new
> > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > + * event. Otherwise the NTB_MSG_SENT is emitted.
> 
> Interesting.. local driver would be notified about completion (success or failure) of delivery.  Is there any order-of-completion guarantee for the completion notifications?  Is there some tolerance for faults, in case we never get a completion notification from the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link comes up again, can we still get a completion notification from the peer, and how would that be handled?
> 
> Does delivery mean the application has processed the message, or is it just delivery at the hardware layer, or just delivery at the ntb hardware driver layer?
> 

Let me explain how the message delivery works. When a client driver calls the "ntb_msg_post()" method, the corresponding message is placed in an outbound messages queue. Such the message queue exists for every peer device. Then a dedicated kernel work thread is started to send all the messages from the queue. If kernel thread failed to send a message (for instance, if the peer IDT NTB hardware driver still has not freed its inbound message registers), it performs a new attempt after a small timeout. If after a preconfigured number of attempts the kernel thread still fails to delivery the message, it invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.

To be clear the messsages are transfered directly to the peer memory, but instead they are placed in the IDT NTB switch registers, then peer is notified about a new message arrived at the corresponding message registers and the corresponding interrupt handler is called.

If we loose the PCI express or NTB link between the IDT switch and a peer, then the ntb_msg_event() method is called with NTB_MSG_FAIL event.

> > + *
> > + * Synchronous hardware may not support it.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_msg_post(struct ntb_dev *ntb, struct ntb_msg *msg)
> > +{
> > +	if (!ntb->ops->msg_post)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->msg_post(ntb, msg);
> > +}
> > +
> > +/**
> > + * ntb_msg_size() - size of the message data
> > + * @ntb:	NTB device context.
> > + *
> > + * Different hardware may support different number of message registers. This
> > + * callback shall return the number of those used for data sending and
> > + * receiving including the type field.
> > + *
> > + * Synchronous hardware may not support it.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_msg_size(struct ntb_dev *ntb)
> > +{
> > +	if (!ntb->ops->msg_size)
> > +		return 0;
> > +
> > +	return ntb->ops->msg_size(ntb);
> > +}
> > +
> >  #endif
> > --
> > 2.6.6
>

Finally, I've answered to all the questions. Hopefully the things look clearer now.

Regards,
-Sergey


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
  2016-08-19  9:10     ` Serge Semin
@ 2016-08-19 13:41       ` Allen Hubbe
  -1 siblings, 0 replies; 12+ messages in thread
From: Allen Hubbe @ 2016-08-19 13:41 UTC (permalink / raw)
  To: 'Serge Semin'
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

From: Serge Semin
> 3) IDT driver redevelopment will take a lot of time, since I don't have much free time to
> do it. It may be half of year or even more.
> 
> From my side, such an improvement will significantly complicate the NTB Kernel API. Since
> you are the subsystem maintainer it's your decision which design to choose, but I don't think
> I'll do the IDT driver suitable for this design anytime soon.

I'm sorry to have made you feel that way.

> > I hope we got it settled now. If not, We can have a Skype conversation, since writing
> such a long letters takes lot of time.

Come join irc.oftc.net #ntb

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-19 13:41       ` Allen Hubbe
  0 siblings, 0 replies; 12+ messages in thread
From: Allen Hubbe @ 2016-08-19 13:41 UTC (permalink / raw)
  To: 'Serge Semin'
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

From: Serge Semin
> 3) IDT driver redevelopment will take a lot of time, since I don't have much free time to
> do it. It may be half of year or even more.
> 
> From my side, such an improvement will significantly complicate the NTB Kernel API. Since
> you are the subsystem maintainer it's your decision which design to choose, but I don't think
> I'll do the IDT driver suitable for this design anytime soon.

I'm sorry to have made you feel that way.

> > I hope we got it settled now. If not, We can have a Skype conversation, since writing
> such a long letters takes lot of time.

Come join irc.oftc.net #ntb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
  2016-08-18 21:56   ` Serge Semin
@ 2016-08-19  9:10     ` Serge Semin
  -1 siblings, 0 replies; 12+ messages in thread
From: Serge Semin @ 2016-08-19  9:10 UTC (permalink / raw)
  To: Allen Hubbe
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

Allen,
There is no any comment below, just this one.

After a short meditation I realized what you are trying to achieve. Your primary intentions
was to unify the NTB interface so it would fit both Inte/AMD and IDT hardware without doing
any abstraction. You may understand why I so eager in refusal this. The reason of most of my
objection is that making such a unified interface will lead to IDT driver complete redevelopment.

IDT driver is developed to fit your previous NTB Kernel API. So of course I've made some
abstraction to keep it suitable for API and make it as simple as possible. That's why I
introduced coupled Messaging subsystem and kernel threads to deliver messages.

Here are my conclusions if you still want a new inified interface:
1) I'm still eager of renaming the ntb_mw_* and ntb_peer_mw_* prefixed methods (see the illustrated
comment in my previous email). It just a matter of names syntax unification, so it would not look
confusing.

2) We could make the following interface.
Before getting to a possible interface, IDT hardware doesn't continuously enumerate the ports. For
instance, NTB functions can be activated on the 0, 2, 4, 6, 8, 12, 16 and 20 ports. Activation is
usually done over an SMBus interface or using a EEPROM firmware.

I won't describe all the interface methods arguments, just new and important ones:

 - Link Up/down interface
ntb_link_is_up(ntb, port);
ntb_link_enable(ntb, port);
ntb_link_disable(ntb, port);

 - Memory windows interface
ntb_get_port_map(ntb); - return an array of ports with NTB function activated. There can be only
NTB function activated per port.

ntb_mw_count(ntb); - total number of local memory windows which can be initialized (up to 24 for IDT).
ntb_mw_get_maprsc(ntb, idx); - get the mapping resources of the memory window. Client
driver should know from internal logic which port is assigned to which memory window.
ntb_mw_get_align(ntb, idx); - return translation address alignment of the local memory window.
ntb_mw_set_trans(ntb, idx, port); - set a translation address of the corresponding local memory
window, so it would be connected with the RC memory of the corresponding port.
ntb_mw_get_trans(ntb, idx, port); - get a translation address of the corresponding local memory
window.

ntb_peer_mw_count(ntb); - total number of peer memory windows (up to 24 for IDT, but they can't
be reachable because of the race conditions I described in the first emails).
ntb_peer_mw_get_align(ntb, idx); - return translation address alignment of the peer memory window.
ntb_peer_mw_set_trans(ntb, idx, port); - set a translation address of the corresponding peer memory
window, so it would be connected with the RC memory of the corresponding port (it won't work
for IDT because of the race condition).
ntb_peer_mw_get_trans(ntb, idx, port); - get a translation address of the corresponding peer memory
window (it won't work for IDT).

 - Doorbell interface
Doorbells are kind of tricky in IDT. They aren't traditional doorbells like the AMD/Intel ones,
because of the multiple NTB-ports. First of all there is a global doorbell register, which is
32-bits wide. Each port has its own outbound and inbound doorbell registers (each one of 32-bits
wide). There is a global mask registers, which can mask ports outbound doorbell registers from
affecting the global doorbell register and can mask ports inbound doorbell registers from being
affected by the global doorbell register.
Those mask registers can not be safely accessed from a different ports, because of the damn race
condition. Instead we can leave them as is, so all the outbound doorbells affects all the bits of
global doorbell register and all the inbound doorbells are affected by the all the bits of the
global doorbell register.

So to speak we can leave the doorbell interface as is.

 - Scratchpad interface
Since the scratchpad registers are kind of just shared storages, we can leave the interface as
is. I don't think IDT will introduce Scratchpad registers in their any new multiport NTB-related
hardware.

 - Messaging interface
Partly we can stick to your design, but I would split the inbound and outbound message statuses, because
in this way client driver developer won't have to know which part of bit-field is related to which
inbound and outbound messages:
ntb_msg_event(ntb); - received a hardware interrupt for messages. (don't read message status, or anything else)
ntb_msg_read_sts_in(ntb); - read and return inbound MSGSTS bitmask.
ntb_msg_clear_sts_in(ntb) - clear bits of inbound MSGSTS bitmask.
ntb_msg_set_mask_in(ntb); - set bits in inbound part of MSGSTSMSK.
ntb_msg_clear_mask_in(ntb); - clear bits in inbound part of MSGSTSMSK.
ntb_msg_read_sts_out(ntb); - read and return outbound MSGSTS bitmask.
ntb_msg_clear_sts_out(ntb); - clear bits of outbound MSGSTS bitmask.
ntb_msg_set_mask_out(ntb); - set bits in outbound part of MSGSTSMSK.
ntb_msg_clear_mask_out(ntb); - clear bits in outbound part of MSGSTSMSK.
ntb_msg_count(ntb); - number of message registers
ntb_msg_recv(ntb, idx, msg, src_port); - read a message register of the corresponding
index and the source port of data it retrieved.
ntb_msg_send(ntb, idx, msg, target_port); - send a message to the corresponding port.

3) IDT driver redevelopment will take a lot of time, since I don't have much free time to do
it. It may be half of year or even more.

>From my side, such an improvement will significantly complicate the NTB Kernel API. Since you
are the subsystem maintainer it's your decision which design to choose, but I don't think I'll do
the IDT driver suitable for this design anytime soon.

Regards,
-Sergey


On Fri, Aug 19, 2016 at 12:56:04AM +0300, Serge Semin <fancer.lancer@gmail.com> wrote:
> Hello Allen,
> Sorry for the delay with response and thanks for thoughtful review.
> 
> On Mon, Aug 08, 2016 at 05:48:42PM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > From: Serge Semin
> > > Hello Allen.
> > > 
> > > Thanks for your careful review. Going through this mailing thread I hope we'll come up
> > > with solutions, which improve the driver code as well as extend the Linux kernel support
> > > of new devices like IDT PCIe-swtiches.
> > > 
> > > Before getting to the inline commentaries I need to give some introduction to the IDT NTB-
> > > related hardware so we could speak on the same language. Additionally I'll give a brief
> > > explanation how the setup of memory windows works in IDT PCIe-switches.
> > 
> > I found this to use as a reference for IDT:
> > https://www.idt.com/document/man/89hpes24nt24g2-device-user-manual
> 
> Yes, it's supported by the IDT driver, although I am using a device with lesser number of ports:
> https://www.idt.com/document/man/89hpes32nt8ag2-device-user-manual
> 
> > 
> > > First of all, before getting into the IDT NTB driver development I had made a research of
> > > the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the
> > > hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-
> > > hardware drivers work. At least I understand the concept of memory windowing, which led to
> > > the current NTB bus kernel API.
> > > 
> > > So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT
> > > produces. All of them I split into two distinct groups:
> > > 1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
> > > 2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2,
> > > 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
> > > Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express®
> > > switching solutions. Why do I split them up? Because of the next reasons:
> > > 1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So
> > > the switches of the first group can connect just two domains over NTB. Unlike the second
> > > group of switches, which expose a way to setup an interaction between several PCIe-switch
> > > ports, which have NT-function activated.
> > > 2) The groups are significantly distinct by the way of NT-functions configuration.
> > > 
> > > Before getting further, I should note, that the uploaded driver supports the second group
> > > of devices only. But still I'll give a comparative explanation, since the first group of
> > > switches is very similar to the AMD/Intel NTBs.
> > > 
> > > Lets dive into the configurations a bit deeper. Particularly NT-functions of the first
> > > group of switches can be configured the same way as AMD/Intel NTB-functions are. There is
> > > an PCIe end-point configuration space, which fully reflects the cross-coupled local and
> > > peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct
> > > writing to mapped memory. Here is the image, which perfectly explains the configuration
> > > registers mapping:
> > > https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
> > > Since the first group switches connect only two root complexes, the race condition of
> > > read/write operations to cross-coupled registers can be easily resolved just by roles
> > > distribution. So local root complex sets the translated base address directly to a peer
> > > configuration space registers, which correspond to BAR0-BAR3 locally mapped memory
> > > windows. Of course 2-4 memory windows is enough to connect just two domains. That's why
> > > you made the NTB bus kernel API the way it is.
> > > 
> > > The things get different when one wants to have an access from one domain to multiple
> > > coupling up to eight root complexes in the second group of switches. First of all the
> > > hardware doesn't support the configuration space cross-coupling anymore. Instead there are
> > > two Global Address Space Access registers provided to have an access to a peers
> > > configuration space. In fact it is not a big problem, since there are no much differences
> > > in accessing registers over a memory mapped space or a pair of fixed Address/Data
> > > registers. The problem arises when one wants to share a memory windows between eight
> > > domains. Five BARs are not enough for it even if they'd be configured to be of x32 address
> > > type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be
> > > configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be
> > > initialized with translated base address of a peer and IDT switch port, which peer is
> > > connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to
> > > a memory of a peer just by reading/writing with a shift corresponding to the lookup table
> > > entry. That's how more than five peers can be accessed. The root problem is the way the
> > > lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data"
> > > registers. So a root complex must write an entry index to one registers, then read/write
> > > data from another. As you might realise, that weak point leads to a race condition of
> > > multiple root complexes accessing the lookup table of one shared peer. Alas I could not
> > > come up with a simple and strong solution of the race.
> > 
> > Right, multiple peers reaching across to some other peer's NTB configuration space is problematic.  I don't mean to suggest we should reach across to configure the lookup table (or anything else) on a remote NTB.
> 
> Good, we settled this down.
> 
> > 
> > > That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since
> > > local root complex can't directly write a translated base address to a peer, it must wait
> > > until a peer asks him to allocate a memory and send the address back using some of a
> > > hardware mechanism. It can be anything: Scratchpad registers, Message registers or even
> > > "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
> > > 1) Shared Memory windows. In particular local root complex can set a translated base
> > > address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB
> > > configuration space, the same way as it can be done for AMD/Intel NTBs.
> > > 2) One Doorbell register.
> > > 3) Two Scratchpads.
> > > 4) Four message regietsrs.
> > > As you can see the switches of the first group can be considered as both synchronous and
> > > asynchronous. All the NTB bus kernel API can be implemented for it including the changes
> > > introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel
> > > NTBs can be considered both synchronous and asynchronous as well, although they don't
> > > support messaging so Scratchpads can be used to send a data to a peer. Finally the
> > > switches of the second group lack of ability to initialize BARs translated base address of
> > > peers due to the race condition I described before.
> > > 
> > > To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make
> > > the IDT driver as much compatible with current design as possible, nevertheless the NTB
> > > bus kernel API had to be slightly changed. You can find answers to the commentaries down
> > > below.
> > > 
> > > On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > > > From: Serge Semin
> > > > > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > > > > devices, so translated base address of memory windows can be direcly written
> > > > > to peer registers. But there are some IDT PCIe-switches which implement
> > > > > complex interfaces using Lookup Tables of translation addresses. Due to
> > > > > the way the table is accessed, it can not be done synchronously from different
> > > > > RCs, that's why the asynchronous interface should be developed.
> > > > >
> > > > > For these purpose the Memory Window related interface is correspondingly split
> > > > > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > > > > is following: "It is a virtual memory region, which locally reflects a physical
> > > > > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > > > > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > > > > memory windows.
> > > > > Here is the description of the Memory Window related NTB-bus callback
> > > > > functions:
> > > > >  - ntb_mw_count() - number of local memory windows.
> > > > >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> > > > >                          window to map.
> > > > >  - ntb_mw_set_trans() - set translation address of local memory window (this
> > > > >                         address should be somehow retrieved from a peer).
> > > > >  - ntb_mw_get_trans() - get translation address of local memory window.
> > > > >  - ntb_mw_get_align() - get alignment of translated base address and size of
> > > > >                         local memory window. Additionally one can get the
> > > > >                         upper size limit of the memory window.
> > > > >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> > > > >                          local number).
> > > > >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> > > > >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> > > > >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> > > > >                              of peer memory window.Additionally one can get the
> > > > >                              upper size limit of the memory window.
> > > > >
> > > > > As one can see current AMD and Intel NTB drivers mostly implement the
> > > > > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > > > > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > > > > since it doesn't have convenient access to the peer Lookup Table.
> > > > >
> > > > > In order to pass information from one RC to another NTB functions of IDT
> > > > > PCIe-switch implement Messaging subsystem. They currently support four message
> > > > > registers to transfer DWORD sized data to a specified peer. So there are two
> > > > > new callback methods are introduced:
> > > > >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> > > > >                     and receive messages
> > > > >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> > > > >                     to a peer
> > > > > Additionally there is a new event function:
> > > > >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> > > > >                      (NTB_MSG_NEW), or last message was successfully sent
> > > > >                      (NTB_MSG_SENT), or the last message failed to be sent
> > > > >                      (NTB_MSG_FAIL).
> > > > >
> > > > > The last change concerns the IDs (practically names) of NTB-devices on the
> > > > > NTB-bus. It is not good to have the devices with same names in the system
> > > > > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > > > > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > > > > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > > > > devices supporting both interfaces.
> > > >
> > > > Thanks for the work that went into writing this driver, and thanks for your patience
> > > with the review.  Please read my initial comments inline.  I would like to approach this
> > > from a top-down api perspective first, and settle on that first before requesting any
> > > specific changes in the hardware driver.  My major concern about these changes is that
> > > they introduce a distinct classification for sync and async hardware, supported by
> > > different sets of methods in the api, neither is a subset of the other.
> > > >
> > > > You know the IDT hardware, so if any of my requests below are infeasible, I would like
> > > your constructive opinion (even if it means significant changes to existing drivers) on
> > > how to resolve the api so that new and existing hardware drivers can be unified under the
> > > same api, if possible.
> > > 
> > > I understand your concern. I have been thinking of this a lot. In my opinion the proposed
> > > in this patch alterations are the best of all variants I've been thinking about. Regarding
> > > the lack of APIs subset. In fact I would not agree with that. As I described in the
> > > introduction AMD and Intel drivers can be considered as both synchronous and asynchronous,
> > > since a translated base address can be directly set in a local and peer configuration
> > > space. Although AMD and Intel devices don't support messaging, they have Scratchpads,
> > > which can be used to exchange an information between root complexes. The thing we need to
> > > do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much
> > > different from the "mw_peer"-prefixed ones. The first method just sets a translated base
> > > address to the corresponding local register. The second one does exactly the same as
> > > "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I
> > > left things the way it was with just slight changes of names.
> > 
> > It sounds like the purpose of your ntb_mw_set_trans() [what I would call ntb_peer_mw_set_trans()] is similar to what is done at initialization time in the Intel NTB driver, so that outgoing writes are translated to the correct peer NTB BAR.  The difference is that IDT outgoing translation sets not only the peer NTB address but also the port number in the translation.
> > http://lxr.free-electrons.com/source/drivers/ntb/hw/intel/ntb_hw_intel.c?v=4.7#L1673
> > 
> > It would be interesting to allow ntb clients to change this translation, eg, configure an outgoing write from local BAR23 so it hits peer secondary BAR45.  I don't think e.g. Intel driver should be forced to implement that, but it would be interesting to think of unifying the api with that in mind.
> 
> I already said I'm not an expert of Intel and AMD hardware, moreover I don't even have any
> reference manual to study it. But from the first glance it's not. It doesn't concern any
> of peer BARs. As far as I can judge by the Intel driver code, the initialization code
> specifies some fixed translation address so to get access to some memory space of a remote
> bridge. According to my observation b2b configuration looks more like so called Punch-through
> configuration in the IDT definitions. It's when two bridges are connected to each other.
> But I may be wrong. Although it doesn't matter at the moment.
> 
> It's much easier to explain how it works using an illustrations, otherwise we'll be discussing
> this matter forever.
> 
> Lets start from definition what Memory Window mean. As I already said: "Memory Window is a
> virtual memory region, which locally reflects a physical memory of peer/remote device."
> 
> Next suppose we've got two 32-bits Root Complexes (RC0 and RC1) connected to each other over
> an NTB. It doesn't matter whether it's an IDT or Intel/AMD like NTBs. The NTB device has two
> ports: Pn and Pm, each port is connected to its own Root Complex. There are doorbells,
> scratchpads, and of course memory windows. Each Root Complex allocates a memory buffer:
> Buffer A and Buffer B. Additionally RC0 and RC1 maps memory windows at the corresponding
> addresses: MW A and MW B. Here is how it schematically looks:
> https://s3.postimg.org/so3zg0car/memory_windows_before.jpg
> 
> According to your NTB Kernel API naming (see the figure), methods are supposed to be
> syntactically split into two: with "ntb_peer_" prefix and without one. And they are correctly
> split for doorbells and scratchpads, but when it comes to memory windows, the method names
> syntax is kind of messed up.
> 
> Keeping in mind the definition of memory windows I introduced before, your ntb_mw_*_trans()
> methods set/get translation base address to "BARm XLAT", so the ones memory window would be
> correctly connected with Buffer A. But the function doesn't have "ntb_peer_mw" prefix, which
> does look confusing, since it works with peer configuration registers, particularly with the
> peer translation address of BARm - MW B.
> 
> Finally your ntb_mw_get_range() returns information about two opposite sides.
> "Alignment"-related arguments return align of translated base address of the peer, but "base"
> and "size" arguments are related with virtual address of the local memory window, which has
> nothing related with the peer memory window and its translated base address.
> 
> My idea was to fix this syntax incorrectness, so the memory windows NTB Kernel API would look
> the same way as doorbell and scratchpad ones. Here is the illustration, how it works now:
> https://s3.postimg.org/52mvtfpgz/memory_windows_after.jpg
> 
> As you can see, the "ntb_peer_mw_" prefixed methods are related with the peer configurations
> only, so ntb_peer_mw_*_trans() set/get translation base address of the peer memory windows
> and ntb_peer_mw_get_align() return alignment of that address. Methods with no "ntb_peer_mw_"
> prefix, do the same thing but with translation address of the local memory window.
> Additionally ntb_mw_get_maprsc() return a physical address of local memory window to 
> correspondingly map it.
> 
> In the same way the Memory Window API can be split into synchronous, asynchronous and both.
> If hardware (like Intel/AMD) allows to implement "ntb_peer_mw" prefixed methods (see methods
> marked with blue ink on the last figure), then it is considered synchronous, since it can
> directly specify translation base addresses to the peer memory windows. If hardware supports
> the "ntb_mw" prefixed methods only (purple ink on the figure), then it is considered as
> asynchronous, so a client driver must somehow retrieve a translation address for local
> memory window. Method ntb_mw_get_maprsc() must be supported by both hardware (it is marked
> with green ink). Of course there are hardware, which can support both synchronous and
> asynchronous API, like Intel/AMD. IDT PCIe-bridge doesn't safely support it since Lookup
> translation tables access method.
> 
> I hope we got it settled now. If not, We can have a Skype conversation, since writing such a
> long letters takes lot of time.
> 
> > 
> > > 
> > > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > > >
> > > > > ---
> > > > >  drivers/ntb/Kconfig                 |   4 +-
> > > > >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> > > > >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> > > > >  drivers/ntb/ntb.c                   |  86 +++++-
> > > > >  drivers/ntb/ntb_transport.c         |  19 +-
> > > > >  drivers/ntb/test/ntb_perf.c         |  16 +-
> > > > >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> > > > >  drivers/ntb/test/ntb_tool.c         |  25 +-
> > > > >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> > > > >  9 files changed, 701 insertions(+), 162 deletions(-)
> > > > >
> > 
> > 
> > > > > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > > > > -				      &mw->xlat_align, &mw->xlat_align_size);
> > > > > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > > > > +		if (rc)
> > > > > +			goto err1;
> > > > > +
> > > > > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > > > > +					   &mw->xlat_align_size, NULL);
> > > >
> > > > Looks like ntb_mw_get_range() was simpler before the change.
> > > >
> > > 
> > > If I didn't change NTB bus kernel API, I would have split them up anyway. First of all
> > > functions with long argument list look more confusing, than ones with shorter list. It
> > > helps to stick to the "80 character per line" rule and improves readability. Secondly the
> > > function splitting improves the readability of the code in general. When I first saw the
> > > function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function
> > > returned. The function lacked of "high code coherence" unofficial rule. It is better when
> > > one function does one coherent thing and return a well coherent data. Particularly
> > > function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as
> > > well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method
> > > returns mapping resources. If local NTB client driver is not going to allocate any memory,
> > > so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand,
> > > that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()",
> > > but still the new design is better readable.
> > > 
> > > Additionally I've split them up because of the difference in the way the asynchronous
> > > interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I
> > > had to add ntb_mw_set_trans(). Each of that method should logically have related
> > > "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client
> > > driver a hint how the retrieved from the peer translated base address should be aligned,
> > > so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align()
> > > will give a hint how the local memory buffer should be allocated to fulfil a peer
> > > translated base address alignment. In this way it returns restrictions for parameters of
> > > "ntb_peer_mw_set_trans()".
> > > 
> > > Finally, IDT driver is designed so Primary and Secondary ports can support a different
> > > number of memory windows. In this way methods
> > > "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have
> > > different range of acceptable values of the second argument, which is determined by the
> > > "ntb_mw_count()" method, comparing to methods
> > > "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory
> > > windows index restriction is determined by the "ntb_peer_mw_count()" method.
> > > 
> > > So to speak the splitting was really necessary to make the API looking more logical.
> > 
> > If this change is not required by the new hardware, please submit the change as a separate patch.
> > 
> 
> It's required. See the previous comment.
> 
> > > > > +	/* Synchronous hardware is only supported */
> > > > > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > > > > +		return -EINVAL;
> > > > > +	}
> > > > > +
> > > >
> > > > It would be nice if both types could be supported by the same api.
> > > >
> > > 
> > > Yes, it would be. Alas it isn't possible in general. See the introduction to this letter.
> > > AMD and Intel devices support asynchronous interface, although they lack of messaging
> > > mechanism.
> > 
> > What is the prototypical application of the IDT message registers?
> > 
> > I'm thinking they will be the first thing available to drivers, and so one primary purpose will be to exchange information for configuring memory windows.  Can you describe how a cluster of eight nodes would discover each other and initialize?
> > 
> > Are they also intended to be useful beyond memory window initialization?  How should they be used efficiently, so that the application can minimize in particular read operations on the pci bus (reading ntb device registers)?  Or are message registers not intended to be used in low latency communications (for that, use doorbells and memory instead)?
> > 
> 
> The prototypical application of the message registers is to exchange a small portion of
> information, like translation base address for example. Imagine IDT hardware provides just
> four 32-bits wide message registers. So a driver software can transfer a message ID, memory
> window index and a translation address using such a small buffer. The message registers can't
> be efficiently used for exchanging of any bigger data. One should use doorbells and memory
> windows instead.
> 
> I'm not a mind reader, but still supposably IDT provided them as a synchronous exchange of
> Scratchpads. Message registers are designed so it's impossible to send a message to a peer
> before one read a previous message. Such a design is really helpful when we need to connect
> few different nodes and pass information between each other. Scratchpads would lead to too
> much complications.
> 
> > > 
> > > Getting back to the discussion, we still need to provide a way to determine which type of
> > > interface an NTB device supports: synchronous/asynchronous translated base address
> > > initialization, Scratchpads and memory windows. Currently it can be determined by the
> > > functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not
> > > the best solution. We can implement the traditional Linux kernel bus device-driver
> > > matching, using table_ids and so on. For example, each hardware driver fills in a table
> > > with all the functionality it supports, like: synchronous/asynchronous memory windows,
> > > Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it
> > > uses. NTB bus core implements a "match()" callback, which compares those two tables and
> > > calls "probe()" callback method of a driver when the tables successfully matches.
> > > 
> > > On the other hand, we might don't have to comprehend the NTB bus core. We can just
> > > introduce a table_id for NTB hardware device, which would just describe the device vendor
> > > itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a
> > > supported device by its table_id. It might look easier, since
> > 
> > emphasis added:
> > 
> > > the client driver developer
> > > should have a basic understanding of the device one develops a driver for.
> > 
> > This is what I'm hoping to avoid.  I would like to let the driver developer write for the api, not for the specific device.  I would rather the driver check "if feature x is supported" instead of "this is a sync or async device."
> > 
> 
> Ok. We can implement "features checking methods" like:
> ntb_valid_link_ops(),
> ntb_valid_peer_db_ops(),
> ntb_valid_db_ops(),
> ntb_valid_peer_spad_ops(),
> ntb_valid_spad_ops(),
> ntb_valid_msg_ops(),
> ntb_valid_peer_mw_ops(),
> ntb_valid_mw_ops().
> 
> But I am not fan of calling all of those methods in every client drivers. I would rather develop
> an "NTB Device - Client Driver" matching method in the framework of NTB bus. For example,
> developer creates a client driver using Doorbells (ntb_valid_peer_db_ops/ntb_valid_db_ops),
> Messages (ntb_valid_msg_ops) and Memory Windows (ntb_valid_peer_mw_ops/ntb_valid_mw_ops). Then one
> declares that the driver requires the corresponding features, somewhere in the struct ntb_client,
> like it's usually done in the "compatible" fields of matching id_tables of drivers (see SPI, PCI,
> i2c and others), but we would call it like "feature_table" with "compatible" fields. Of course
> every hardware driver would declare, which kind of features one supports. Then the NTB bus
> "match()" callback method checks whether the registered device supports all features the client
> driver claims. If it does, then and only then the "probe()" method of the client driver is called.
> 
> Of course, it's your decision which design to use, I am just giving a possible solutions. But the
> last one gives better unification with general "Bus - Device - Driver" design of the Linux Kernel.
> 
> > > Then NTB bus
> > > kernel API core will simply match NTB devices with drivers like any other buses (PCI,
> > > PCIe, i2c, spi, etc) do.
> > > 
> > 
> > > > > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > > > > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > > > > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> > > >
> > > > I understand why IDT requires a different api for dealing with addressing multiple
> > > peers.  I would be interested in a solution that would allow, for example, the Intel
> > > driver fit under the api for dealing with multiple peers, even though it only supports one
> > > peer.  I would rather see that, than two separate apis under ntb.
> > > >
> > > > Thoughts?
> > > >
> > > > Can the sync api be described by some subset of the async api?  Are there less
> > > overloaded terms we can use instead of sync/async?
> > > >
> > > 
> > > Answer to this concern is mostly provided in the introduction as well. I'll repeat it here
> > > in details. As I said AMD and Intel hardware support asynchronous API except the
> > > messaging. Additionally I can even think of emulating messaging using Doorbells and
> > > Scratchpads, but not the other way around. Why not? Before answering, here is how the
> > > messaging works in IDT switches of both first and second groups (see introduction for
> > > describing the groups).
> > > 
> > > There are four outbound and inbound message registers for each NTB port in the device.
> > > Local root complex can connect its any outbound message to any inbound message register of
> > > the IDT switch. When one writes a data to an outbound message register it immediately gets
> > > to the connected inbound message registers. Then peer can read its inbound message
> > > registers and empty it by clearing a corresponding bit. Then and only then next data can
> > > be written to any outbound message registers connected to that inbound message register.
> > > So the possible race condition between multiple domains sending a message to same peer is
> > > resolved by the IDT switch itself.
> > > 
> > > One would ask: "Why don't you just wrap the message registers up back to the same port? It
> > > would look just like Scratchpads." Yes, It would. But still there are only four message
> > > registers. It's not enough to distribute them between all the possibly connected NTB
> > > ports. As I said earlier there can be up to eight domains connected, so there must be at
> > > least seven message register to fulfil the possible design.
> > > 
> > > Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly
> > > adapt design for a hardware, rather than hardware to a design. Following that rule would
> > > simplify a code and support respectively.
> > > 
> > > Regarding the APIs subset. As I said before async API is kind of subset of synchronous
> > > API. We can develop all the memory window related callback-method for AMD and Intel
> > > hardware driver, which is pretty much easy. We can even simulate message registers by
> > > using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second
> > > group of IDT switches can't implement the synchronous API, as I already said in the
> > > introduction.
> > 
> > Message registers operate fundamentally differently from scratchpads (and doorbells, for that matter).  I think we are in agreement.  It's a pain, but maybe the best we can do is require applications to check for support for scratchpads, message registers, and/or doorbells, before using any of those features.  We already have ntb_db_valid_mask() and ntb_spad_count().
> > 
> 
> Yes they do. And yes, the client drivers must somehow check whether a matching NTB device
> supports all the features they need. See the previous comment how I suppose it can be done.
> 
> > I would like to see ntb_msg_count() and more direct access to the message registers in this api.  I would prefer to see the more direct access to hardware message registers, instead of work_struct for message processing in the low level hardware driver.  A more direct interface to the hardware registers would be more like the existing ntb.h api: direct and low-overhead as possible, providing minimal abstraction of the hardware functionality.
> > 
> > I think there is still hope we can unify the memory window interface.  Even though IDT supports things like subdividing the memory windows with table lookup, and specification of destination ports for outgoing translations, I think we can support the same abstraction in the existing drivers with minimal overhead.
> > 
> > For existing Intel and AMD drivers, there may be only one translation per memory window (there is no table to subdivide the memory window), and there is only one destination port (the peer).  The Intel and AMD drivers can ignore the table index in setting up the translation (or validate that the requested table index is equal to zero).
> > 
> 
> In fact we don't need to introduce any of the table index, because the table index you are
> talking about is just one peer. Since it is just a peer, then it must refer to a particular
> device on the Linux NTB bus. For instance we got eight NTB ports on IDT PCIe-bridge, one
> of them is the Primary port. Then Root Complex connected to the Primary port will have seven
> devices on Linux NTB bus. Such design perfectly fits to your NTB Kernel API and additionally
> will cover all the client driver needs. In this case Primary NTB port would be like an SPI, i2c
> adapters or PCI root complex itself with respect to their subsidiary buses.
> 
> That's how the IDT hardware driver is designed. It gives transparent operations with NTB
> kernel API.
> 
> > > Regarding the overloaded naming. The "sync/async" names are the best I could think of. If
> > > you have any idea how one can be appropriately changed, be my guest. I would be really
> > > glad to substitute them with something better.
> > > 
> > 
> > Let's try to avoid a distinction, first, beyond just saying "not all hardware will support all these features."  If we absolutely have to make a distinction, let's think of better names then.
> > 
> 
> Ok. We can stick to "featured" hardware. It sounds better.
> 
> > > > > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> > > > >   * @ntb:	NTB device context.
> > > > > + * @ev:		Event type caused the handler invocation
> > > > > + * @msg:	Message related to the event
> > > > > + *
> > > > > + * Notify the driver context that there is some event happaned in the event
> > > > > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > > > > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > > > > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > > > > + * last argument is used to pass the event related message. It discarded right
> > > > > + * after the handler returns.
> > > > > + */
> > > > > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > > > > +		   struct ntb_msg *msg);
> > > >
> > > > I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of
> > > the message handling to be done more appropriately at a higher layer of the application.
> > > I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I
> > > think would be more appropriate for a ntb transport (or higher layer) driver.
> > > >
> > > 
> > > Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a
> > > first inbound message register (the rest of message registers are used as an additional
> > > data buffers). Then a corresponding tasklet is started to release a hardware interrupt
> > > context. That tasklet extracts a message from the inbound message registers, puts it into
> > > the driver inbound message queue and marks the registers as empty so the next message
> > > could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all
> > > new messages to a client driver, which preliminary registered "ntb_msg_event()" callback
> > > method. When callback method "ntb_msg_event()" the passed message is discarded.
> > 
> > When an interrupt arrives, can you signal the upper layer that a message has arrived, without delivering the message?  I think the lower layer can do without the work structs, instead have the same body of the work struct run in the context of the upper layer polling to receive the message.
> > 
> 
> Of course we can. I could create a method like ntb_msg_read() instead of passing a message to the
> callback, but I didn't do so because if next message interrupt arrives while the previous message
> still has not been read, how a client driver would find out which message caused the last
> interrupt? Thats why I prefer to pass the new message to the callback, so if a client drivers wants
> to keep track of all the received methods, then it can create it's own queue.
> 
> Regarding the rest of the comment. The upper layer will have to implement the work struct anyway.
> Why do we need to copy that code everywhere if it can be common for all the drivers? Still keep in
> mind, that the incoming message registers is the register, that must be freed as fast as possible,
> since another peer device can be waiting for it to be freed. So it's better to read it in the
> hardware driver, than to let it being done by the unreliable client.
> 
> > > > It looks like there was some rearranging of code, so big hunks appear to be added or
> > > removed.  Can you split this into two (or more) patches so that rearranging the code is
> > > distinct from more interesting changes?
> > > >
> > > 
> > > Lets say there was not much rearranging here. I've just put link-related method before
> > > everything else. The rearranging was done from the point of methods importance view. There
> > > can't be any memory sharing and doorbells operations done before the link is established.
> > > The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops()
> > > methods.
> > 
> > It's unfortunate how the diff captured the changes.  Can you split this up into smaller patches?
> > 
> 
> Lets settle the rest of the things down before doing this. If we don't then it would be just
> a waste of time.
> 
> > > > > - * ntb_mw_get_range() - get the range of a memory window
> > > > > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> > > >
> > > > What was insufficient about ntb_mw_get_range() that it needed to be split into
> > > ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch,
> > > it seems ntb_mw_get_range() would have been more simple.
> > > >
> > > > I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].
> > > So, there is no example of how usage of new api would be used differently or more
> > > efficiently than ntb_mw_get_range() for async devices.
> > > >
> > > 
> > > This concern is answered a bit earlier, when you first commented the method
> > > "ntb_mw_get_range()" splitting.
> > > 
> > > You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The
> > > real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending),
> > > which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c
> > > driver is developed to demonstrate how the new asynchronous API is utilized including the
> > > "ntb_mw_get_maprsc()" method usage.
> > 
> > Right, I misspelled it.  It would be easier to catch a misspelling of ragne.
> > 
> > [PATCH v2 3/3]:
> > +		/* Retrieve the physical address of the memory to map */
> > +		ret = ntb_mw_get_maprsc(ntb, mwindx, &outmw->phys_addr,
> > +			&outmw->size);
> > +		if (SUCCESS != ret) {
> > +			dev_err_mw(ctx, "Failed to get map resources of "
> > +				"outbound window %d", mwindx);
> > +			mwindx--;
> > +			goto err_unmap_rsc;
> > +		}
> > +
> > +		/* Map the memory window resources */
> > +		outmw->virt_addr = ioremap_nocache(outmw->phys_addr, outmw->size);
> > +
> > +		/* Retrieve the memory windows maximum size and alignments */
> > +		ret = ntb_mw_get_align(ntb, mwindx, &outmw->addr_align,
> > +			&outmw->size_align, &outmw->size_max);
> > +		if (SUCCESS != ret) {
> > +			dev_err_mw(ctx, "Failed to get alignment options of "
> > +				"outbound window %d", mwindx);
> > +			goto err_unmap_rsc;
> > +		}
> > 
> > It looks to me like ntb_mw_get_range() would have been sufficient here.  If the change is required by the new driver, please show evidence of that.  If this change is not required by the new hardware, please submit the change as a separate patch.
> > 
> 
> Please, see the comments before.
> 
> > > > I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the
> > > following make sense, or have I completely misunderstood something?
> > > >
> > > > ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are
> > > translated to the local memory destination.
> > > >
> > > > ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory
> > > window (is this something that needs to be configured on the local ntb?) are translated to
> > > the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of
> > > ntb_mw_set_trans() will complete the translation to the peer memory destination.
> > > >
> > > 
> > > These functions actually do the opposite you described:
> > 
> > That's the point.  I noticed that they are opposite.
> > 
> > > ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so
> > > outgoing writes to a memory window would be translated and reach the peer memory
> > > destination.
> > 
> > In other words, this affects the translation of writes in the direction of the peer memory.  I think this should be named ntb_peer_mw_set_trans().
> > 
> 
> Please, see the big comment with illustrations provided before.
> 
> > > ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space,
> > > so the local incoming writes would be correctly translated on the peer and reach the local
> > > memory destination.
> > 
> > In other words, this affects the translation for writes in the direction of local memory.  I think this should be named ntb_mw_set_trans().
> > 
> 
> Please, see the big comment with illustrations provided before.
> 
> > > Globally thinking, these methods do the same think, when they called from opposite
> > > domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the
> > > method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called
> > > method "ntb_peer_mw_set_trans()" does the same procedure as the method
> > > "ntb_mw_set_trans()" called from a peer.
> > > 
> > > To make things simpler, think of memory windows in the framework of the next definition:
> > > "Memory Window is a virtual memory region, which locally reflects a physical memory of
> > > peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory
> > > window, so the locally mapped virtual addresses would be connected with the peer physical
> > > memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory
> > > region, so the peer could successfully perform a writes to our local physical memory.
> > > 
> > > Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc()
> > > and ioremap_nocache() method invocation doublet. You do the same thing in the client test
> > > drivers for AMD and Intel hadrware.
> > > 
> > 
> > > > >  /**
> > > > > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64
> > > db_bits)
> > > > >   * append one additional dma memory copy with the doorbell register as the
> > > > >   * destination, after the memory copy operations.
> > > > >   *
> > > > > + * This is unusual, and hardware may not be suitable to implement it.
> > > > > + *
> > > >
> > > > Why is this unusual?  Do you mean async hardware may not support it?
> > > >
> > > 
> > > Of course I can always return an address of a Doorbell register, but it's not safe to do
> > > it working with IDT NTB hardware driver. To make thing explained simpler think a IDT
> > > hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of
> > > each port can be configured to either reflect the global switch doorbell bits state or not
> > > to reflect. Global doorbell bits are set by using outbound doorbell register, which is
> > > exist for every NTB port. Primary port is the port which can have an access to multiple
> > > peers, so the Primary port inbound and outbound doorbell registers are shared between
> > > several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices
> > > should not interfere each other, which can happen on uncontrollable usage of Doorbell
> > > registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be
> > > developed for the IDT NTB hardware driver.
> > 
> > I misread the diff as if this comment was added to the description of ntb_db_clear_mask().
> > 
> > > > > +	if (!ntb->ops->spad_count)
> > > > > +		return -EINVAL;
> > > > > +
> > > >
> > > > Maybe we should return zero (i.e. there are no scratchpads).
> > > >
> > > 
> > > Agreed. I will fix it in the next patchset.
> > 
> > Thanks.
> > 
> > > > > +	if (!ntb->ops->spad_read)
> > > > > +		return 0;
> > > > > +
> > > >
> > > > Let's return ~0.  I think that's what a driver would read from the pci bus for a memory
> > > miss.
> > > >
> > > 
> > > Agreed. I will make it returning -EINVAL in the next patchset.
> > 
> > I don't think we should try to interpret the returned value as an error number.  If the driver supports this method, and this is a valid scratchpad, the peer can put any value in i, including a value that could be interpreted as an error number.
> > 
> > A driver shouldn't be using this method if it isn't supported.  But if it does, I think ~0 is a better poison value than 0.  I just don't want to encourage drivers to try to interpret this value as an error number.
> > 
> 
> Understood. The method will return ~0 in the next patchset.
> 
> > > > > +	if (!ntb->ops->peer_spad_read)
> > > > > +		return 0;
> > > >
> > > > Also, ~0?
> > > >
> > > 
> > > Agreed. I will make it returning -EINVAL in the next patchset.
> > 
> > I don't think we should try to interpret the returned value as an error number.
> > 
> 
> Understood. The method will return ~0 in the next patchset.
> 
> > > > > + * ntb_msg_post() - post the message to the peer
> > > > > + * @ntb:	NTB device context.
> > > > > + * @msg:	Message
> > > > > + *
> > > > > + * Post the message to a peer. It shall be delivered to the peer by the
> > > > > + * corresponding hardware method. The peer should be notified about the new
> > > > > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > > > > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > > > > + * event. Otherwise the NTB_MSG_SENT is emitted.
> > > >
> > > > Interesting.. local driver would be notified about completion (success or failure) of
> > > delivery.  Is there any order-of-completion guarantee for the completion notifications?
> > > Is there some tolerance for faults, in case we never get a completion notification from
> > > the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link
> > > comes up again, can we still get a completion notification from the peer, and how would
> > > that be handled?
> > > >
> > > > Does delivery mean the application has processed the message, or is it just delivery at
> > > the hardware layer, or just delivery at the ntb hardware driver layer?
> > > >
> > > 
> > > Let me explain how the message delivery works. When a client driver calls the
> > > "ntb_msg_post()" method, the corresponding message is placed in an outbound messages
> > > queue. Such the message queue exists for every peer device. Then a dedicated kernel work
> > > thread is started to send all the messages from the queue.
> > 
> > Can we handle the outbound messages queue in an upper layer thread, too, instead of a kernel thread in this low level driver?  I think if we provide more direct access to the hardware semantics of the message registers, we will end up with something like the following, which will also simplify the hardware driver.  Leave it to the upper layer to schedule message processing after receiving an event.
> > 
> > ntb_msg_event(): we received a hardware interrupt for messages. (don't read message status, or anything else)
> > 
> > ntb_msg_status_read(): read and return MSGSTS bitmask (like ntb_db_read()).
> > ntb_msg_status_clear(): clear bits in MSGSTS bitmask (like ntb_db_clear()).
> > 
> > ntb_msg_mask_set(): set bits in MSGSTSMSK (like ntb_db_mask_set()).
> > ntb_msg_mask_clear(): clear bits in MSGSTSMSK (like ntb_db_mask_clear()).
> > 
> > ntb_msg_recv(): read and return INMSG and INMSGSRC of the indicated message index.
> > ntb_msg_send(): write the outgoing message register with the message.
> > 
> 
> I think such an API would make the interface too complicated. The messaging is intended to be
> simple for just sharing a small amount of information. Primarily for sending a translation
> address. I would prefer to leave API as it is, since it covers all the application needs.
> 
> > > If kernel thread failed to send
> > > a message (for instance, if the peer IDT NTB hardware driver still has not freed its
> > > inbound message registers), it performs a new attempt after a small timeout. If after a
> > > preconfigured number of attempts the kernel thread still fails to delivery the message, it
> > > invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully
> > > delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.
> > 
> > In other words, it was delivered to the peer NTB hardware, and the peer NTB hardware accepted the message into an available register.  It does not mean the peer application processed the message, or even that the peer driver received an interrupt for the message?
> > 
> 
> Of course it doesn't mean, that the application processed the message, but it does mean that the
> peer hardware raised the MSI interrupt, if the interrupt was enabled.
> 
> > > 
> > > To be clear the messsages are transfered directly to the peer memory, but instead they are
> > > placed in the IDT NTB switch registers, then peer is notified about a new message arrived
> > > at the corresponding message registers and the corresponding interrupt handler is called.
> > > 
> > > If we loose the PCI express or NTB link between the IDT switch and a peer, then the
> > > ntb_msg_event() method is called with NTB_MSG_FAIL event.
> > 
> > Byzantine fault is an unsolvable class of problem, so it is important to be clear exactly what is supposed to be guaranteed at each layer.  If we get a hardware ACK that the message was delivered, that means it was delivered to the NTB hardware register, but no further.  If we do not get a hardware NAK(?), that means it was not delivered.  If the link fails or we time out waiting for a completion, we can only guess that it wasn't delivered even though there is a small chance it was.  Applications need to be tolerant either way, and needs may be different depending on the application.  I would rather not add any fault tolerance (other than reporting faults) at this layer that is not already implemented in the hardware.
> > 
> > Reading the description of OUTMSGSTS register, it is clear that we can receive a hardware NAK if an outgoing message failed.  It's not clear to me that IDT will notify any kind of ACK that an outgoing message was accepted.  If an application wants to send two messages, it can send the first, check the bit and see there is no failure.  Does reading the status immediately after sending guarantee the message WAS delivered (i.e. IDT NTB hardware blocks reading the status register while there are messages in flight)?  If not, if the application sends the second message and then sees a failure, how can the application be sure the failure is not for the first message?  Does the application have to wait some time (how long?) before checking the message status?
> > 
> 
> Ok, I think I need to explain it carefully. When a local Root Complex sends a message to a peer,
> it writes a message to its outgoing message registers, which are connected with the peer incoming
> message registers. If that incoming message registers are still full, so peer has not emptied them
> by clearing a corresponding bit, then a local Root Complex gets a so called NACK, on the other
> words it failed to send a message. Then it tries to send the message again and again before a next
> attempt is either succeeded or a limited number of attempts is exceeded. Last one would lead to
> rising a NTB_MSG_FAIL event.
> 
> On the other hand before sending a message IDT driver checks whether the NTB link is up, if it isn't
> then it raises NTB_MSG_FAIL event.
> 
> After all the discussions I am starting to realize what is the problem. The problem is that we
> might have differently pictures of how NTB hardware are connected.) Traditional Intel/AMD NTB
> hardware are directly connected to each other so there is only one PCIe-link between two Root
> Complexes, but IDT bridge is kind of single intermediate device, which has at least two NTB ports
> connected to Root Complexes by different PCIe-links. So when one sends a message to another and
> it's accepted by hardware, then the message was put to the incoming message register of the IDT
> opposite port, and the peer Root Complex is just notified that a new message has arrived.
> 
> > > 
> > > Finally, I've answered to all the questions. Hopefully the things look clearer now.
> > > 
> > > Regards,
> > > -Sergey
> > 
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-19  9:10     ` Serge Semin
  0 siblings, 0 replies; 12+ messages in thread
From: Serge Semin @ 2016-08-19  9:10 UTC (permalink / raw)
  To: Allen Hubbe
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

Allen,
There is no any comment below, just this one.

After a short meditation I realized what you are trying to achieve. Your primary intentions
was to unify the NTB interface so it would fit both Inte/AMD and IDT hardware without doing
any abstraction. You may understand why I so eager in refusal this. The reason of most of my
objection is that making such a unified interface will lead to IDT driver complete redevelopment.

IDT driver is developed to fit your previous NTB Kernel API. So of course I've made some
abstraction to keep it suitable for API and make it as simple as possible. That's why I
introduced coupled Messaging subsystem and kernel threads to deliver messages.

Here are my conclusions if you still want a new inified interface:
1) I'm still eager of renaming the ntb_mw_* and ntb_peer_mw_* prefixed methods (see the illustrated
comment in my previous email). It just a matter of names syntax unification, so it would not look
confusing.

2) We could make the following interface.
Before getting to a possible interface, IDT hardware doesn't continuously enumerate the ports. For
instance, NTB functions can be activated on the 0, 2, 4, 6, 8, 12, 16 and 20 ports. Activation is
usually done over an SMBus interface or using a EEPROM firmware.

I won't describe all the interface methods arguments, just new and important ones:

 - Link Up/down interface
ntb_link_is_up(ntb, port);
ntb_link_enable(ntb, port);
ntb_link_disable(ntb, port);

 - Memory windows interface
ntb_get_port_map(ntb); - return an array of ports with NTB function activated. There can be only
NTB function activated per port.

ntb_mw_count(ntb); - total number of local memory windows which can be initialized (up to 24 for IDT).
ntb_mw_get_maprsc(ntb, idx); - get the mapping resources of the memory window. Client
driver should know from internal logic which port is assigned to which memory window.
ntb_mw_get_align(ntb, idx); - return translation address alignment of the local memory window.
ntb_mw_set_trans(ntb, idx, port); - set a translation address of the corresponding local memory
window, so it would be connected with the RC memory of the corresponding port.
ntb_mw_get_trans(ntb, idx, port); - get a translation address of the corresponding local memory
window.

ntb_peer_mw_count(ntb); - total number of peer memory windows (up to 24 for IDT, but they can't
be reachable because of the race conditions I described in the first emails).
ntb_peer_mw_get_align(ntb, idx); - return translation address alignment of the peer memory window.
ntb_peer_mw_set_trans(ntb, idx, port); - set a translation address of the corresponding peer memory
window, so it would be connected with the RC memory of the corresponding port (it won't work
for IDT because of the race condition).
ntb_peer_mw_get_trans(ntb, idx, port); - get a translation address of the corresponding peer memory
window (it won't work for IDT).

 - Doorbell interface
Doorbells are kind of tricky in IDT. They aren't traditional doorbells like the AMD/Intel ones,
because of the multiple NTB-ports. First of all there is a global doorbell register, which is
32-bits wide. Each port has its own outbound and inbound doorbell registers (each one of 32-bits
wide). There is a global mask registers, which can mask ports outbound doorbell registers from
affecting the global doorbell register and can mask ports inbound doorbell registers from being
affected by the global doorbell register.
Those mask registers can not be safely accessed from a different ports, because of the damn race
condition. Instead we can leave them as is, so all the outbound doorbells affects all the bits of
global doorbell register and all the inbound doorbells are affected by the all the bits of the
global doorbell register.

So to speak we can leave the doorbell interface as is.

 - Scratchpad interface
Since the scratchpad registers are kind of just shared storages, we can leave the interface as
is. I don't think IDT will introduce Scratchpad registers in their any new multiport NTB-related
hardware.

 - Messaging interface
Partly we can stick to your design, but I would split the inbound and outbound message statuses, because
in this way client driver developer won't have to know which part of bit-field is related to which
inbound and outbound messages:
ntb_msg_event(ntb); - received a hardware interrupt for messages. (don't read message status, or anything else)
ntb_msg_read_sts_in(ntb); - read and return inbound MSGSTS bitmask.
ntb_msg_clear_sts_in(ntb) - clear bits of inbound MSGSTS bitmask.
ntb_msg_set_mask_in(ntb); - set bits in inbound part of MSGSTSMSK.
ntb_msg_clear_mask_in(ntb); - clear bits in inbound part of MSGSTSMSK.
ntb_msg_read_sts_out(ntb); - read and return outbound MSGSTS bitmask.
ntb_msg_clear_sts_out(ntb); - clear bits of outbound MSGSTS bitmask.
ntb_msg_set_mask_out(ntb); - set bits in outbound part of MSGSTSMSK.
ntb_msg_clear_mask_out(ntb); - clear bits in outbound part of MSGSTSMSK.
ntb_msg_count(ntb); - number of message registers
ntb_msg_recv(ntb, idx, msg, src_port); - read a message register of the corresponding
index and the source port of data it retrieved.
ntb_msg_send(ntb, idx, msg, target_port); - send a message to the corresponding port.

3) IDT driver redevelopment will take a lot of time, since I don't have much free time to do
it. It may be half of year or even more.

From my side, such an improvement will significantly complicate the NTB Kernel API. Since you
are the subsystem maintainer it's your decision which design to choose, but I don't think I'll do
the IDT driver suitable for this design anytime soon.

Regards,
-Sergey


On Fri, Aug 19, 2016 at 12:56:04AM +0300, Serge Semin <fancer.lancer@gmail.com> wrote:
> Hello Allen,
> Sorry for the delay with response and thanks for thoughtful review.
> 
> On Mon, Aug 08, 2016 at 05:48:42PM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > From: Serge Semin
> > > Hello Allen.
> > > 
> > > Thanks for your careful review. Going through this mailing thread I hope we'll come up
> > > with solutions, which improve the driver code as well as extend the Linux kernel support
> > > of new devices like IDT PCIe-swtiches.
> > > 
> > > Before getting to the inline commentaries I need to give some introduction to the IDT NTB-
> > > related hardware so we could speak on the same language. Additionally I'll give a brief
> > > explanation how the setup of memory windows works in IDT PCIe-switches.
> > 
> > I found this to use as a reference for IDT:
> > https://www.idt.com/document/man/89hpes24nt24g2-device-user-manual
> 
> Yes, it's supported by the IDT driver, although I am using a device with lesser number of ports:
> https://www.idt.com/document/man/89hpes32nt8ag2-device-user-manual
> 
> > 
> > > First of all, before getting into the IDT NTB driver development I had made a research of
> > > the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the
> > > hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-
> > > hardware drivers work. At least I understand the concept of memory windowing, which led to
> > > the current NTB bus kernel API.
> > > 
> > > So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT
> > > produces. All of them I split into two distinct groups:
> > > 1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
> > > 2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2,
> > > 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
> > > Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express�
> > > switching solutions. Why do I split them up? Because of the next reasons:
> > > 1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So
> > > the switches of the first group can connect just two domains over NTB. Unlike the second
> > > group of switches, which expose a way to setup an interaction between several PCIe-switch
> > > ports, which have NT-function activated.
> > > 2) The groups are significantly distinct by the way of NT-functions configuration.
> > > 
> > > Before getting further, I should note, that the uploaded driver supports the second group
> > > of devices only. But still I'll give a comparative explanation, since the first group of
> > > switches is very similar to the AMD/Intel NTBs.
> > > 
> > > Lets dive into the configurations a bit deeper. Particularly NT-functions of the first
> > > group of switches can be configured the same way as AMD/Intel NTB-functions are. There is
> > > an PCIe end-point configuration space, which fully reflects the cross-coupled local and
> > > peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct
> > > writing to mapped memory. Here is the image, which perfectly explains the configuration
> > > registers mapping:
> > > https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
> > > Since the first group switches connect only two root complexes, the race condition of
> > > read/write operations to cross-coupled registers can be easily resolved just by roles
> > > distribution. So local root complex sets the translated base address directly to a peer
> > > configuration space registers, which correspond to BAR0-BAR3 locally mapped memory
> > > windows. Of course 2-4 memory windows is enough to connect just two domains. That's why
> > > you made the NTB bus kernel API the way it is.
> > > 
> > > The things get different when one wants to have an access from one domain to multiple
> > > coupling up to eight root complexes in the second group of switches. First of all the
> > > hardware doesn't support the configuration space cross-coupling anymore. Instead there are
> > > two Global Address Space Access registers provided to have an access to a peers
> > > configuration space. In fact it is not a big problem, since there are no much differences
> > > in accessing registers over a memory mapped space or a pair of fixed Address/Data
> > > registers. The problem arises when one wants to share a memory windows between eight
> > > domains. Five BARs are not enough for it even if they'd be configured to be of x32 address
> > > type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be
> > > configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be
> > > initialized with translated base address of a peer and IDT switch port, which peer is
> > > connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to
> > > a memory of a peer just by reading/writing with a shift corresponding to the lookup table
> > > entry. That's how more than five peers can be accessed. The root problem is the way the
> > > lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data"
> > > registers. So a root complex must write an entry index to one registers, then read/write
> > > data from another. As you might realise, that weak point leads to a race condition of
> > > multiple root complexes accessing the lookup table of one shared peer. Alas I could not
> > > come up with a simple and strong solution of the race.
> > 
> > Right, multiple peers reaching across to some other peer's NTB configuration space is problematic.  I don't mean to suggest we should reach across to configure the lookup table (or anything else) on a remote NTB.
> 
> Good, we settled this down.
> 
> > 
> > > That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since
> > > local root complex can't directly write a translated base address to a peer, it must wait
> > > until a peer asks him to allocate a memory and send the address back using some of a
> > > hardware mechanism. It can be anything: Scratchpad registers, Message registers or even
> > > "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
> > > 1) Shared Memory windows. In particular local root complex can set a translated base
> > > address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB
> > > configuration space, the same way as it can be done for AMD/Intel NTBs.
> > > 2) One Doorbell register.
> > > 3) Two Scratchpads.
> > > 4) Four message regietsrs.
> > > As you can see the switches of the first group can be considered as both synchronous and
> > > asynchronous. All the NTB bus kernel API can be implemented for it including the changes
> > > introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel
> > > NTBs can be considered both synchronous and asynchronous as well, although they don't
> > > support messaging so Scratchpads can be used to send a data to a peer. Finally the
> > > switches of the second group lack of ability to initialize BARs translated base address of
> > > peers due to the race condition I described before.
> > > 
> > > To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make
> > > the IDT driver as much compatible with current design as possible, nevertheless the NTB
> > > bus kernel API had to be slightly changed. You can find answers to the commentaries down
> > > below.
> > > 
> > > On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > > > From: Serge Semin
> > > > > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > > > > devices, so translated base address of memory windows can be direcly written
> > > > > to peer registers. But there are some IDT PCIe-switches which implement
> > > > > complex interfaces using Lookup Tables of translation addresses. Due to
> > > > > the way the table is accessed, it can not be done synchronously from different
> > > > > RCs, that's why the asynchronous interface should be developed.
> > > > >
> > > > > For these purpose the Memory Window related interface is correspondingly split
> > > > > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > > > > is following: "It is a virtual memory region, which locally reflects a physical
> > > > > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > > > > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > > > > memory windows.
> > > > > Here is the description of the Memory Window related NTB-bus callback
> > > > > functions:
> > > > >  - ntb_mw_count() - number of local memory windows.
> > > > >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> > > > >                          window to map.
> > > > >  - ntb_mw_set_trans() - set translation address of local memory window (this
> > > > >                         address should be somehow retrieved from a peer).
> > > > >  - ntb_mw_get_trans() - get translation address of local memory window.
> > > > >  - ntb_mw_get_align() - get alignment of translated base address and size of
> > > > >                         local memory window. Additionally one can get the
> > > > >                         upper size limit of the memory window.
> > > > >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> > > > >                          local number).
> > > > >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> > > > >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> > > > >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> > > > >                              of peer memory window.Additionally one can get the
> > > > >                              upper size limit of the memory window.
> > > > >
> > > > > As one can see current AMD and Intel NTB drivers mostly implement the
> > > > > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > > > > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > > > > since it doesn't have convenient access to the peer Lookup Table.
> > > > >
> > > > > In order to pass information from one RC to another NTB functions of IDT
> > > > > PCIe-switch implement Messaging subsystem. They currently support four message
> > > > > registers to transfer DWORD sized data to a specified peer. So there are two
> > > > > new callback methods are introduced:
> > > > >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> > > > >                     and receive messages
> > > > >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> > > > >                     to a peer
> > > > > Additionally there is a new event function:
> > > > >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> > > > >                      (NTB_MSG_NEW), or last message was successfully sent
> > > > >                      (NTB_MSG_SENT), or the last message failed to be sent
> > > > >                      (NTB_MSG_FAIL).
> > > > >
> > > > > The last change concerns the IDs (practically names) of NTB-devices on the
> > > > > NTB-bus. It is not good to have the devices with same names in the system
> > > > > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > > > > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > > > > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > > > > devices supporting both interfaces.
> > > >
> > > > Thanks for the work that went into writing this driver, and thanks for your patience
> > > with the review.  Please read my initial comments inline.  I would like to approach this
> > > from a top-down api perspective first, and settle on that first before requesting any
> > > specific changes in the hardware driver.  My major concern about these changes is that
> > > they introduce a distinct classification for sync and async hardware, supported by
> > > different sets of methods in the api, neither is a subset of the other.
> > > >
> > > > You know the IDT hardware, so if any of my requests below are infeasible, I would like
> > > your constructive opinion (even if it means significant changes to existing drivers) on
> > > how to resolve the api so that new and existing hardware drivers can be unified under the
> > > same api, if possible.
> > > 
> > > I understand your concern. I have been thinking of this a lot. In my opinion the proposed
> > > in this patch alterations are the best of all variants I've been thinking about. Regarding
> > > the lack of APIs subset. In fact I would not agree with that. As I described in the
> > > introduction AMD and Intel drivers can be considered as both synchronous and asynchronous,
> > > since a translated base address can be directly set in a local and peer configuration
> > > space. Although AMD and Intel devices don't support messaging, they have Scratchpads,
> > > which can be used to exchange an information between root complexes. The thing we need to
> > > do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much
> > > different from the "mw_peer"-prefixed ones. The first method just sets a translated base
> > > address to the corresponding local register. The second one does exactly the same as
> > > "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I
> > > left things the way it was with just slight changes of names.
> > 
> > It sounds like the purpose of your ntb_mw_set_trans() [what I would call ntb_peer_mw_set_trans()] is similar to what is done at initialization time in the Intel NTB driver, so that outgoing writes are translated to the correct peer NTB BAR.  The difference is that IDT outgoing translation sets not only the peer NTB address but also the port number in the translation.
> > http://lxr.free-electrons.com/source/drivers/ntb/hw/intel/ntb_hw_intel.c?v=4.7#L1673
> > 
> > It would be interesting to allow ntb clients to change this translation, eg, configure an outgoing write from local BAR23 so it hits peer secondary BAR45.  I don't think e.g. Intel driver should be forced to implement that, but it would be interesting to think of unifying the api with that in mind.
> 
> I already said I'm not an expert of Intel and AMD hardware, moreover I don't even have any
> reference manual to study it. But from the first glance it's not. It doesn't concern any
> of peer BARs. As far as I can judge by the Intel driver code, the initialization code
> specifies some fixed translation address so to get access to some memory space of a remote
> bridge. According to my observation b2b configuration looks more like so called Punch-through
> configuration in the IDT definitions. It's when two bridges are connected to each other.
> But I may be wrong. Although it doesn't matter at the moment.
> 
> It's much easier to explain how it works using an illustrations, otherwise we'll be discussing
> this matter forever.
> 
> Lets start from definition what Memory Window mean. As I already said: "Memory Window is a
> virtual memory region, which locally reflects a physical memory of peer/remote device."
> 
> Next suppose we've got two 32-bits Root Complexes (RC0 and RC1) connected to each other over
> an NTB. It doesn't matter whether it's an IDT or Intel/AMD like NTBs. The NTB device has two
> ports: Pn and Pm, each port is connected to its own Root Complex. There are doorbells,
> scratchpads, and of course memory windows. Each Root Complex allocates a memory buffer:
> Buffer A and Buffer B. Additionally RC0 and RC1 maps memory windows at the corresponding
> addresses: MW A and MW B. Here is how it schematically looks:
> https://s3.postimg.org/so3zg0car/memory_windows_before.jpg
> 
> According to your NTB Kernel API naming (see the figure), methods are supposed to be
> syntactically split into two: with "ntb_peer_" prefix and without one. And they are correctly
> split for doorbells and scratchpads, but when it comes to memory windows, the method names
> syntax is kind of messed up.
> 
> Keeping in mind the definition of memory windows I introduced before, your ntb_mw_*_trans()
> methods set/get translation base address to "BARm XLAT", so the ones memory window would be
> correctly connected with Buffer A. But the function doesn't have "ntb_peer_mw" prefix, which
> does look confusing, since it works with peer configuration registers, particularly with the
> peer translation address of BARm - MW B.
> 
> Finally your ntb_mw_get_range() returns information about two opposite sides.
> "Alignment"-related arguments return align of translated base address of the peer, but "base"
> and "size" arguments are related with virtual address of the local memory window, which has
> nothing related with the peer memory window and its translated base address.
> 
> My idea was to fix this syntax incorrectness, so the memory windows NTB Kernel API would look
> the same way as doorbell and scratchpad ones. Here is the illustration, how it works now:
> https://s3.postimg.org/52mvtfpgz/memory_windows_after.jpg
> 
> As you can see, the "ntb_peer_mw_" prefixed methods are related with the peer configurations
> only, so ntb_peer_mw_*_trans() set/get translation base address of the peer memory windows
> and ntb_peer_mw_get_align() return alignment of that address. Methods with no "ntb_peer_mw_"
> prefix, do the same thing but with translation address of the local memory window.
> Additionally ntb_mw_get_maprsc() return a physical address of local memory window to 
> correspondingly map it.
> 
> In the same way the Memory Window API can be split into synchronous, asynchronous and both.
> If hardware (like Intel/AMD) allows to implement "ntb_peer_mw" prefixed methods (see methods
> marked with blue ink on the last figure), then it is considered synchronous, since it can
> directly specify translation base addresses to the peer memory windows. If hardware supports
> the "ntb_mw" prefixed methods only (purple ink on the figure), then it is considered as
> asynchronous, so a client driver must somehow retrieve a translation address for local
> memory window. Method ntb_mw_get_maprsc() must be supported by both hardware (it is marked
> with green ink). Of course there are hardware, which can support both synchronous and
> asynchronous API, like Intel/AMD. IDT PCIe-bridge doesn't safely support it since Lookup
> translation tables access method.
> 
> I hope we got it settled now. If not, We can have a Skype conversation, since writing such a
> long letters takes lot of time.
> 
> > 
> > > 
> > > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > > >
> > > > > ---
> > > > >  drivers/ntb/Kconfig                 |   4 +-
> > > > >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> > > > >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> > > > >  drivers/ntb/ntb.c                   |  86 +++++-
> > > > >  drivers/ntb/ntb_transport.c         |  19 +-
> > > > >  drivers/ntb/test/ntb_perf.c         |  16 +-
> > > > >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> > > > >  drivers/ntb/test/ntb_tool.c         |  25 +-
> > > > >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> > > > >  9 files changed, 701 insertions(+), 162 deletions(-)
> > > > >
> > 
> > 
> > > > > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > > > > -				      &mw->xlat_align, &mw->xlat_align_size);
> > > > > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > > > > +		if (rc)
> > > > > +			goto err1;
> > > > > +
> > > > > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > > > > +					   &mw->xlat_align_size, NULL);
> > > >
> > > > Looks like ntb_mw_get_range() was simpler before the change.
> > > >
> > > 
> > > If I didn't change NTB bus kernel API, I would have split them up anyway. First of all
> > > functions with long argument list look more confusing, than ones with shorter list. It
> > > helps to stick to the "80 character per line" rule and improves readability. Secondly the
> > > function splitting improves the readability of the code in general. When I first saw the
> > > function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function
> > > returned. The function lacked of "high code coherence" unofficial rule. It is better when
> > > one function does one coherent thing and return a well coherent data. Particularly
> > > function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as
> > > well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method
> > > returns mapping resources. If local NTB client driver is not going to allocate any memory,
> > > so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand,
> > > that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()",
> > > but still the new design is better readable.
> > > 
> > > Additionally I've split them up because of the difference in the way the asynchronous
> > > interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I
> > > had to add ntb_mw_set_trans(). Each of that method should logically have related
> > > "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client
> > > driver a hint how the retrieved from the peer translated base address should be aligned,
> > > so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align()
> > > will give a hint how the local memory buffer should be allocated to fulfil a peer
> > > translated base address alignment. In this way it returns restrictions for parameters of
> > > "ntb_peer_mw_set_trans()".
> > > 
> > > Finally, IDT driver is designed so Primary and Secondary ports can support a different
> > > number of memory windows. In this way methods
> > > "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have
> > > different range of acceptable values of the second argument, which is determined by the
> > > "ntb_mw_count()" method, comparing to methods
> > > "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory
> > > windows index restriction is determined by the "ntb_peer_mw_count()" method.
> > > 
> > > So to speak the splitting was really necessary to make the API looking more logical.
> > 
> > If this change is not required by the new hardware, please submit the change as a separate patch.
> > 
> 
> It's required. See the previous comment.
> 
> > > > > +	/* Synchronous hardware is only supported */
> > > > > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > > > > +		return -EINVAL;
> > > > > +	}
> > > > > +
> > > >
> > > > It would be nice if both types could be supported by the same api.
> > > >
> > > 
> > > Yes, it would be. Alas it isn't possible in general. See the introduction to this letter.
> > > AMD and Intel devices support asynchronous interface, although they lack of messaging
> > > mechanism.
> > 
> > What is the prototypical application of the IDT message registers?
> > 
> > I'm thinking they will be the first thing available to drivers, and so one primary purpose will be to exchange information for configuring memory windows.  Can you describe how a cluster of eight nodes would discover each other and initialize?
> > 
> > Are they also intended to be useful beyond memory window initialization?  How should they be used efficiently, so that the application can minimize in particular read operations on the pci bus (reading ntb device registers)?  Or are message registers not intended to be used in low latency communications (for that, use doorbells and memory instead)?
> > 
> 
> The prototypical application of the message registers is to exchange a small portion of
> information, like translation base address for example. Imagine IDT hardware provides just
> four 32-bits wide message registers. So a driver software can transfer a message ID, memory
> window index and a translation address using such a small buffer. The message registers can't
> be efficiently used for exchanging of any bigger data. One should use doorbells and memory
> windows instead.
> 
> I'm not a mind reader, but still supposably IDT provided them as a synchronous exchange of
> Scratchpads. Message registers are designed so it's impossible to send a message to a peer
> before one read a previous message. Such a design is really helpful when we need to connect
> few different nodes and pass information between each other. Scratchpads would lead to too
> much complications.
> 
> > > 
> > > Getting back to the discussion, we still need to provide a way to determine which type of
> > > interface an NTB device supports: synchronous/asynchronous translated base address
> > > initialization, Scratchpads and memory windows. Currently it can be determined by the
> > > functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not
> > > the best solution. We can implement the traditional Linux kernel bus device-driver
> > > matching, using table_ids and so on. For example, each hardware driver fills in a table
> > > with all the functionality it supports, like: synchronous/asynchronous memory windows,
> > > Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it
> > > uses. NTB bus core implements a "match()" callback, which compares those two tables and
> > > calls "probe()" callback method of a driver when the tables successfully matches.
> > > 
> > > On the other hand, we might don't have to comprehend the NTB bus core. We can just
> > > introduce a table_id for NTB hardware device, which would just describe the device vendor
> > > itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a
> > > supported device by its table_id. It might look easier, since
> > 
> > emphasis added:
> > 
> > > the client driver developer
> > > should have a basic understanding of the device one develops a driver for.
> > 
> > This is what I'm hoping to avoid.  I would like to let the driver developer write for the api, not for the specific device.  I would rather the driver check "if feature x is supported" instead of "this is a sync or async device."
> > 
> 
> Ok. We can implement "features checking methods" like:
> ntb_valid_link_ops(),
> ntb_valid_peer_db_ops(),
> ntb_valid_db_ops(),
> ntb_valid_peer_spad_ops(),
> ntb_valid_spad_ops(),
> ntb_valid_msg_ops(),
> ntb_valid_peer_mw_ops(),
> ntb_valid_mw_ops().
> 
> But I am not fan of calling all of those methods in every client drivers. I would rather develop
> an "NTB Device - Client Driver" matching method in the framework of NTB bus. For example,
> developer creates a client driver using Doorbells (ntb_valid_peer_db_ops/ntb_valid_db_ops),
> Messages (ntb_valid_msg_ops) and Memory Windows (ntb_valid_peer_mw_ops/ntb_valid_mw_ops). Then one
> declares that the driver requires the corresponding features, somewhere in the struct ntb_client,
> like it's usually done in the "compatible" fields of matching id_tables of drivers (see SPI, PCI,
> i2c and others), but we would call it like "feature_table" with "compatible" fields. Of course
> every hardware driver would declare, which kind of features one supports. Then the NTB bus
> "match()" callback method checks whether the registered device supports all features the client
> driver claims. If it does, then and only then the "probe()" method of the client driver is called.
> 
> Of course, it's your decision which design to use, I am just giving a possible solutions. But the
> last one gives better unification with general "Bus - Device - Driver" design of the Linux Kernel.
> 
> > > Then NTB bus
> > > kernel API core will simply match NTB devices with drivers like any other buses (PCI,
> > > PCIe, i2c, spi, etc) do.
> > > 
> > 
> > > > > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > > > > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > > > > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> > > >
> > > > I understand why IDT requires a different api for dealing with addressing multiple
> > > peers.  I would be interested in a solution that would allow, for example, the Intel
> > > driver fit under the api for dealing with multiple peers, even though it only supports one
> > > peer.  I would rather see that, than two separate apis under ntb.
> > > >
> > > > Thoughts?
> > > >
> > > > Can the sync api be described by some subset of the async api?  Are there less
> > > overloaded terms we can use instead of sync/async?
> > > >
> > > 
> > > Answer to this concern is mostly provided in the introduction as well. I'll repeat it here
> > > in details. As I said AMD and Intel hardware support asynchronous API except the
> > > messaging. Additionally I can even think of emulating messaging using Doorbells and
> > > Scratchpads, but not the other way around. Why not? Before answering, here is how the
> > > messaging works in IDT switches of both first and second groups (see introduction for
> > > describing the groups).
> > > 
> > > There are four outbound and inbound message registers for each NTB port in the device.
> > > Local root complex can connect its any outbound message to any inbound message register of
> > > the IDT switch. When one writes a data to an outbound message register it immediately gets
> > > to the connected inbound message registers. Then peer can read its inbound message
> > > registers and empty it by clearing a corresponding bit. Then and only then next data can
> > > be written to any outbound message registers connected to that inbound message register.
> > > So the possible race condition between multiple domains sending a message to same peer is
> > > resolved by the IDT switch itself.
> > > 
> > > One would ask: "Why don't you just wrap the message registers up back to the same port? It
> > > would look just like Scratchpads." Yes, It would. But still there are only four message
> > > registers. It's not enough to distribute them between all the possibly connected NTB
> > > ports. As I said earlier there can be up to eight domains connected, so there must be at
> > > least seven message register to fulfil the possible design.
> > > 
> > > Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly
> > > adapt design for a hardware, rather than hardware to a design. Following that rule would
> > > simplify a code and support respectively.
> > > 
> > > Regarding the APIs subset. As I said before async API is kind of subset of synchronous
> > > API. We can develop all the memory window related callback-method for AMD and Intel
> > > hardware driver, which is pretty much easy. We can even simulate message registers by
> > > using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second
> > > group of IDT switches can't implement the synchronous API, as I already said in the
> > > introduction.
> > 
> > Message registers operate fundamentally differently from scratchpads (and doorbells, for that matter).  I think we are in agreement.  It's a pain, but maybe the best we can do is require applications to check for support for scratchpads, message registers, and/or doorbells, before using any of those features.  We already have ntb_db_valid_mask() and ntb_spad_count().
> > 
> 
> Yes they do. And yes, the client drivers must somehow check whether a matching NTB device
> supports all the features they need. See the previous comment how I suppose it can be done.
> 
> > I would like to see ntb_msg_count() and more direct access to the message registers in this api.  I would prefer to see the more direct access to hardware message registers, instead of work_struct for message processing in the low level hardware driver.  A more direct interface to the hardware registers would be more like the existing ntb.h api: direct and low-overhead as possible, providing minimal abstraction of the hardware functionality.
> > 
> > I think there is still hope we can unify the memory window interface.  Even though IDT supports things like subdividing the memory windows with table lookup, and specification of destination ports for outgoing translations, I think we can support the same abstraction in the existing drivers with minimal overhead.
> > 
> > For existing Intel and AMD drivers, there may be only one translation per memory window (there is no table to subdivide the memory window), and there is only one destination port (the peer).  The Intel and AMD drivers can ignore the table index in setting up the translation (or validate that the requested table index is equal to zero).
> > 
> 
> In fact we don't need to introduce any of the table index, because the table index you are
> talking about is just one peer. Since it is just a peer, then it must refer to a particular
> device on the Linux NTB bus. For instance we got eight NTB ports on IDT PCIe-bridge, one
> of them is the Primary port. Then Root Complex connected to the Primary port will have seven
> devices on Linux NTB bus. Such design perfectly fits to your NTB Kernel API and additionally
> will cover all the client driver needs. In this case Primary NTB port would be like an SPI, i2c
> adapters or PCI root complex itself with respect to their subsidiary buses.
> 
> That's how the IDT hardware driver is designed. It gives transparent operations with NTB
> kernel API.
> 
> > > Regarding the overloaded naming. The "sync/async" names are the best I could think of. If
> > > you have any idea how one can be appropriately changed, be my guest. I would be really
> > > glad to substitute them with something better.
> > > 
> > 
> > Let's try to avoid a distinction, first, beyond just saying "not all hardware will support all these features."  If we absolutely have to make a distinction, let's think of better names then.
> > 
> 
> Ok. We can stick to "featured" hardware. It sounds better.
> 
> > > > > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> > > > >   * @ntb:	NTB device context.
> > > > > + * @ev:		Event type caused the handler invocation
> > > > > + * @msg:	Message related to the event
> > > > > + *
> > > > > + * Notify the driver context that there is some event happaned in the event
> > > > > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > > > > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > > > > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > > > > + * last argument is used to pass the event related message. It discarded right
> > > > > + * after the handler returns.
> > > > > + */
> > > > > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > > > > +		   struct ntb_msg *msg);
> > > >
> > > > I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of
> > > the message handling to be done more appropriately at a higher layer of the application.
> > > I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I
> > > think would be more appropriate for a ntb transport (or higher layer) driver.
> > > >
> > > 
> > > Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a
> > > first inbound message register (the rest of message registers are used as an additional
> > > data buffers). Then a corresponding tasklet is started to release a hardware interrupt
> > > context. That tasklet extracts a message from the inbound message registers, puts it into
> > > the driver inbound message queue and marks the registers as empty so the next message
> > > could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all
> > > new messages to a client driver, which preliminary registered "ntb_msg_event()" callback
> > > method. When callback method "ntb_msg_event()" the passed message is discarded.
> > 
> > When an interrupt arrives, can you signal the upper layer that a message has arrived, without delivering the message?  I think the lower layer can do without the work structs, instead have the same body of the work struct run in the context of the upper layer polling to receive the message.
> > 
> 
> Of course we can. I could create a method like ntb_msg_read() instead of passing a message to the
> callback, but I didn't do so because if next message interrupt arrives while the previous message
> still has not been read, how a client driver would find out which message caused the last
> interrupt? Thats why I prefer to pass the new message to the callback, so if a client drivers wants
> to keep track of all the received methods, then it can create it's own queue.
> 
> Regarding the rest of the comment. The upper layer will have to implement the work struct anyway.
> Why do we need to copy that code everywhere if it can be common for all the drivers? Still keep in
> mind, that the incoming message registers is the register, that must be freed as fast as possible,
> since another peer device can be waiting for it to be freed. So it's better to read it in the
> hardware driver, than to let it being done by the unreliable client.
> 
> > > > It looks like there was some rearranging of code, so big hunks appear to be added or
> > > removed.  Can you split this into two (or more) patches so that rearranging the code is
> > > distinct from more interesting changes?
> > > >
> > > 
> > > Lets say there was not much rearranging here. I've just put link-related method before
> > > everything else. The rearranging was done from the point of methods importance view. There
> > > can't be any memory sharing and doorbells operations done before the link is established.
> > > The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops()
> > > methods.
> > 
> > It's unfortunate how the diff captured the changes.  Can you split this up into smaller patches?
> > 
> 
> Lets settle the rest of the things down before doing this. If we don't then it would be just
> a waste of time.
> 
> > > > > - * ntb_mw_get_range() - get the range of a memory window
> > > > > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> > > >
> > > > What was insufficient about ntb_mw_get_range() that it needed to be split into
> > > ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch,
> > > it seems ntb_mw_get_range() would have been more simple.
> > > >
> > > > I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].
> > > So, there is no example of how usage of new api would be used differently or more
> > > efficiently than ntb_mw_get_range() for async devices.
> > > >
> > > 
> > > This concern is answered a bit earlier, when you first commented the method
> > > "ntb_mw_get_range()" splitting.
> > > 
> > > You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The
> > > real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending),
> > > which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c
> > > driver is developed to demonstrate how the new asynchronous API is utilized including the
> > > "ntb_mw_get_maprsc()" method usage.
> > 
> > Right, I misspelled it.  It would be easier to catch a misspelling of ragne.
> > 
> > [PATCH v2 3/3]:
> > +		/* Retrieve the physical address of the memory to map */
> > +		ret = ntb_mw_get_maprsc(ntb, mwindx, &outmw->phys_addr,
> > +			&outmw->size);
> > +		if (SUCCESS != ret) {
> > +			dev_err_mw(ctx, "Failed to get map resources of "
> > +				"outbound window %d", mwindx);
> > +			mwindx--;
> > +			goto err_unmap_rsc;
> > +		}
> > +
> > +		/* Map the memory window resources */
> > +		outmw->virt_addr = ioremap_nocache(outmw->phys_addr, outmw->size);
> > +
> > +		/* Retrieve the memory windows maximum size and alignments */
> > +		ret = ntb_mw_get_align(ntb, mwindx, &outmw->addr_align,
> > +			&outmw->size_align, &outmw->size_max);
> > +		if (SUCCESS != ret) {
> > +			dev_err_mw(ctx, "Failed to get alignment options of "
> > +				"outbound window %d", mwindx);
> > +			goto err_unmap_rsc;
> > +		}
> > 
> > It looks to me like ntb_mw_get_range() would have been sufficient here.  If the change is required by the new driver, please show evidence of that.  If this change is not required by the new hardware, please submit the change as a separate patch.
> > 
> 
> Please, see the comments before.
> 
> > > > I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the
> > > following make sense, or have I completely misunderstood something?
> > > >
> > > > ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are
> > > translated to the local memory destination.
> > > >
> > > > ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory
> > > window (is this something that needs to be configured on the local ntb?) are translated to
> > > the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of
> > > ntb_mw_set_trans() will complete the translation to the peer memory destination.
> > > >
> > > 
> > > These functions actually do the opposite you described:
> > 
> > That's the point.  I noticed that they are opposite.
> > 
> > > ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so
> > > outgoing writes to a memory window would be translated and reach the peer memory
> > > destination.
> > 
> > In other words, this affects the translation of writes in the direction of the peer memory.  I think this should be named ntb_peer_mw_set_trans().
> > 
> 
> Please, see the big comment with illustrations provided before.
> 
> > > ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space,
> > > so the local incoming writes would be correctly translated on the peer and reach the local
> > > memory destination.
> > 
> > In other words, this affects the translation for writes in the direction of local memory.  I think this should be named ntb_mw_set_trans().
> > 
> 
> Please, see the big comment with illustrations provided before.
> 
> > > Globally thinking, these methods do the same think, when they called from opposite
> > > domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the
> > > method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called
> > > method "ntb_peer_mw_set_trans()" does the same procedure as the method
> > > "ntb_mw_set_trans()" called from a peer.
> > > 
> > > To make things simpler, think of memory windows in the framework of the next definition:
> > > "Memory Window is a virtual memory region, which locally reflects a physical memory of
> > > peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory
> > > window, so the locally mapped virtual addresses would be connected with the peer physical
> > > memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory
> > > region, so the peer could successfully perform a writes to our local physical memory.
> > > 
> > > Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc()
> > > and ioremap_nocache() method invocation doublet. You do the same thing in the client test
> > > drivers for AMD and Intel hadrware.
> > > 
> > 
> > > > >  /**
> > > > > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64
> > > db_bits)
> > > > >   * append one additional dma memory copy with the doorbell register as the
> > > > >   * destination, after the memory copy operations.
> > > > >   *
> > > > > + * This is unusual, and hardware may not be suitable to implement it.
> > > > > + *
> > > >
> > > > Why is this unusual?  Do you mean async hardware may not support it?
> > > >
> > > 
> > > Of course I can always return an address of a Doorbell register, but it's not safe to do
> > > it working with IDT NTB hardware driver. To make thing explained simpler think a IDT
> > > hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of
> > > each port can be configured to either reflect the global switch doorbell bits state or not
> > > to reflect. Global doorbell bits are set by using outbound doorbell register, which is
> > > exist for every NTB port. Primary port is the port which can have an access to multiple
> > > peers, so the Primary port inbound and outbound doorbell registers are shared between
> > > several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices
> > > should not interfere each other, which can happen on uncontrollable usage of Doorbell
> > > registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be
> > > developed for the IDT NTB hardware driver.
> > 
> > I misread the diff as if this comment was added to the description of ntb_db_clear_mask().
> > 
> > > > > +	if (!ntb->ops->spad_count)
> > > > > +		return -EINVAL;
> > > > > +
> > > >
> > > > Maybe we should return zero (i.e. there are no scratchpads).
> > > >
> > > 
> > > Agreed. I will fix it in the next patchset.
> > 
> > Thanks.
> > 
> > > > > +	if (!ntb->ops->spad_read)
> > > > > +		return 0;
> > > > > +
> > > >
> > > > Let's return ~0.  I think that's what a driver would read from the pci bus for a memory
> > > miss.
> > > >
> > > 
> > > Agreed. I will make it returning -EINVAL in the next patchset.
> > 
> > I don't think we should try to interpret the returned value as an error number.  If the driver supports this method, and this is a valid scratchpad, the peer can put any value in i, including a value that could be interpreted as an error number.
> > 
> > A driver shouldn't be using this method if it isn't supported.  But if it does, I think ~0 is a better poison value than 0.  I just don't want to encourage drivers to try to interpret this value as an error number.
> > 
> 
> Understood. The method will return ~0 in the next patchset.
> 
> > > > > +	if (!ntb->ops->peer_spad_read)
> > > > > +		return 0;
> > > >
> > > > Also, ~0?
> > > >
> > > 
> > > Agreed. I will make it returning -EINVAL in the next patchset.
> > 
> > I don't think we should try to interpret the returned value as an error number.
> > 
> 
> Understood. The method will return ~0 in the next patchset.
> 
> > > > > + * ntb_msg_post() - post the message to the peer
> > > > > + * @ntb:	NTB device context.
> > > > > + * @msg:	Message
> > > > > + *
> > > > > + * Post the message to a peer. It shall be delivered to the peer by the
> > > > > + * corresponding hardware method. The peer should be notified about the new
> > > > > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > > > > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > > > > + * event. Otherwise the NTB_MSG_SENT is emitted.
> > > >
> > > > Interesting.. local driver would be notified about completion (success or failure) of
> > > delivery.  Is there any order-of-completion guarantee for the completion notifications?
> > > Is there some tolerance for faults, in case we never get a completion notification from
> > > the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link
> > > comes up again, can we still get a completion notification from the peer, and how would
> > > that be handled?
> > > >
> > > > Does delivery mean the application has processed the message, or is it just delivery at
> > > the hardware layer, or just delivery at the ntb hardware driver layer?
> > > >
> > > 
> > > Let me explain how the message delivery works. When a client driver calls the
> > > "ntb_msg_post()" method, the corresponding message is placed in an outbound messages
> > > queue. Such the message queue exists for every peer device. Then a dedicated kernel work
> > > thread is started to send all the messages from the queue.
> > 
> > Can we handle the outbound messages queue in an upper layer thread, too, instead of a kernel thread in this low level driver?  I think if we provide more direct access to the hardware semantics of the message registers, we will end up with something like the following, which will also simplify the hardware driver.  Leave it to the upper layer to schedule message processing after receiving an event.
> > 
> > ntb_msg_event(): we received a hardware interrupt for messages. (don't read message status, or anything else)
> > 
> > ntb_msg_status_read(): read and return MSGSTS bitmask (like ntb_db_read()).
> > ntb_msg_status_clear(): clear bits in MSGSTS bitmask (like ntb_db_clear()).
> > 
> > ntb_msg_mask_set(): set bits in MSGSTSMSK (like ntb_db_mask_set()).
> > ntb_msg_mask_clear(): clear bits in MSGSTSMSK (like ntb_db_mask_clear()).
> > 
> > ntb_msg_recv(): read and return INMSG and INMSGSRC of the indicated message index.
> > ntb_msg_send(): write the outgoing message register with the message.
> > 
> 
> I think such an API would make the interface too complicated. The messaging is intended to be
> simple for just sharing a small amount of information. Primarily for sending a translation
> address. I would prefer to leave API as it is, since it covers all the application needs.
> 
> > > If kernel thread failed to send
> > > a message (for instance, if the peer IDT NTB hardware driver still has not freed its
> > > inbound message registers), it performs a new attempt after a small timeout. If after a
> > > preconfigured number of attempts the kernel thread still fails to delivery the message, it
> > > invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully
> > > delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.
> > 
> > In other words, it was delivered to the peer NTB hardware, and the peer NTB hardware accepted the message into an available register.  It does not mean the peer application processed the message, or even that the peer driver received an interrupt for the message?
> > 
> 
> Of course it doesn't mean, that the application processed the message, but it does mean that the
> peer hardware raised the MSI interrupt, if the interrupt was enabled.
> 
> > > 
> > > To be clear the messsages are transfered directly to the peer memory, but instead they are
> > > placed in the IDT NTB switch registers, then peer is notified about a new message arrived
> > > at the corresponding message registers and the corresponding interrupt handler is called.
> > > 
> > > If we loose the PCI express or NTB link between the IDT switch and a peer, then the
> > > ntb_msg_event() method is called with NTB_MSG_FAIL event.
> > 
> > Byzantine fault is an unsolvable class of problem, so it is important to be clear exactly what is supposed to be guaranteed at each layer.  If we get a hardware ACK that the message was delivered, that means it was delivered to the NTB hardware register, but no further.  If we do not get a hardware NAK(?), that means it was not delivered.  If the link fails or we time out waiting for a completion, we can only guess that it wasn't delivered even though there is a small chance it was.  Applications need to be tolerant either way, and needs may be different depending on the application.  I would rather not add any fault tolerance (other than reporting faults) at this layer that is not already implemented in the hardware.
> > 
> > Reading the description of OUTMSGSTS register, it is clear that we can receive a hardware NAK if an outgoing message failed.  It's not clear to me that IDT will notify any kind of ACK that an outgoing message was accepted.  If an application wants to send two messages, it can send the first, check the bit and see there is no failure.  Does reading the status immediately after sending guarantee the message WAS delivered (i.e. IDT NTB hardware blocks reading the status register while there are messages in flight)?  If not, if the application sends the second message and then sees a failure, how can the application be sure the failure is not for the first message?  Does the application have to wait some time (how long?) before checking the message status?
> > 
> 
> Ok, I think I need to explain it carefully. When a local Root Complex sends a message to a peer,
> it writes a message to its outgoing message registers, which are connected with the peer incoming
> message registers. If that incoming message registers are still full, so peer has not emptied them
> by clearing a corresponding bit, then a local Root Complex gets a so called NACK, on the other
> words it failed to send a message. Then it tries to send the message again and again before a next
> attempt is either succeeded or a limited number of attempts is exceeded. Last one would lead to
> rising a NTB_MSG_FAIL event.
> 
> On the other hand before sending a message IDT driver checks whether the NTB link is up, if it isn't
> then it raises NTB_MSG_FAIL event.
> 
> After all the discussions I am starting to realize what is the problem. The problem is that we
> might have differently pictures of how NTB hardware are connected.) Traditional Intel/AMD NTB
> hardware are directly connected to each other so there is only one PCIe-link between two Root
> Complexes, but IDT bridge is kind of single intermediate device, which has at least two NTB ports
> connected to Root Complexes by different PCIe-links. So when one sends a message to another and
> it's accepted by hardware, then the message was put to the incoming message register of the IDT
> opposite port, and the peer Root Complex is just notified that a new message has arrived.
> 
> > > 
> > > Finally, I've answered to all the questions. Hopefully the things look clearer now.
> > > 
> > > Regards,
> > > -Sergey
> > 
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
  2016-08-08 21:48 ` Allen Hubbe
@ 2016-08-18 21:56   ` Serge Semin
  -1 siblings, 0 replies; 12+ messages in thread
From: Serge Semin @ 2016-08-18 21:56 UTC (permalink / raw)
  To: Allen Hubbe
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

Hello Allen,
Sorry for the delay with response and thanks for thoughtful review.

On Mon, Aug 08, 2016 at 05:48:42PM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> From: Serge Semin
> > Hello Allen.
> > 
> > Thanks for your careful review. Going through this mailing thread I hope we'll come up
> > with solutions, which improve the driver code as well as extend the Linux kernel support
> > of new devices like IDT PCIe-swtiches.
> > 
> > Before getting to the inline commentaries I need to give some introduction to the IDT NTB-
> > related hardware so we could speak on the same language. Additionally I'll give a brief
> > explanation how the setup of memory windows works in IDT PCIe-switches.
> 
> I found this to use as a reference for IDT:
> https://www.idt.com/document/man/89hpes24nt24g2-device-user-manual

Yes, it's supported by the IDT driver, although I am using a device with lesser number of ports:
https://www.idt.com/document/man/89hpes32nt8ag2-device-user-manual

> 
> > First of all, before getting into the IDT NTB driver development I had made a research of
> > the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the
> > hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-
> > hardware drivers work. At least I understand the concept of memory windowing, which led to
> > the current NTB bus kernel API.
> > 
> > So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT
> > produces. All of them I split into two distinct groups:
> > 1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
> > 2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2,
> > 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
> > Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express®
> > switching solutions. Why do I split them up? Because of the next reasons:
> > 1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So
> > the switches of the first group can connect just two domains over NTB. Unlike the second
> > group of switches, which expose a way to setup an interaction between several PCIe-switch
> > ports, which have NT-function activated.
> > 2) The groups are significantly distinct by the way of NT-functions configuration.
> > 
> > Before getting further, I should note, that the uploaded driver supports the second group
> > of devices only. But still I'll give a comparative explanation, since the first group of
> > switches is very similar to the AMD/Intel NTBs.
> > 
> > Lets dive into the configurations a bit deeper. Particularly NT-functions of the first
> > group of switches can be configured the same way as AMD/Intel NTB-functions are. There is
> > an PCIe end-point configuration space, which fully reflects the cross-coupled local and
> > peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct
> > writing to mapped memory. Here is the image, which perfectly explains the configuration
> > registers mapping:
> > https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
> > Since the first group switches connect only two root complexes, the race condition of
> > read/write operations to cross-coupled registers can be easily resolved just by roles
> > distribution. So local root complex sets the translated base address directly to a peer
> > configuration space registers, which correspond to BAR0-BAR3 locally mapped memory
> > windows. Of course 2-4 memory windows is enough to connect just two domains. That's why
> > you made the NTB bus kernel API the way it is.
> > 
> > The things get different when one wants to have an access from one domain to multiple
> > coupling up to eight root complexes in the second group of switches. First of all the
> > hardware doesn't support the configuration space cross-coupling anymore. Instead there are
> > two Global Address Space Access registers provided to have an access to a peers
> > configuration space. In fact it is not a big problem, since there are no much differences
> > in accessing registers over a memory mapped space or a pair of fixed Address/Data
> > registers. The problem arises when one wants to share a memory windows between eight
> > domains. Five BARs are not enough for it even if they'd be configured to be of x32 address
> > type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be
> > configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be
> > initialized with translated base address of a peer and IDT switch port, which peer is
> > connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to
> > a memory of a peer just by reading/writing with a shift corresponding to the lookup table
> > entry. That's how more than five peers can be accessed. The root problem is the way the
> > lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data"
> > registers. So a root complex must write an entry index to one registers, then read/write
> > data from another. As you might realise, that weak point leads to a race condition of
> > multiple root complexes accessing the lookup table of one shared peer. Alas I could not
> > come up with a simple and strong solution of the race.
> 
> Right, multiple peers reaching across to some other peer's NTB configuration space is problematic.  I don't mean to suggest we should reach across to configure the lookup table (or anything else) on a remote NTB.

Good, we settled this down.

> 
> > That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since
> > local root complex can't directly write a translated base address to a peer, it must wait
> > until a peer asks him to allocate a memory and send the address back using some of a
> > hardware mechanism. It can be anything: Scratchpad registers, Message registers or even
> > "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
> > 1) Shared Memory windows. In particular local root complex can set a translated base
> > address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB
> > configuration space, the same way as it can be done for AMD/Intel NTBs.
> > 2) One Doorbell register.
> > 3) Two Scratchpads.
> > 4) Four message regietsrs.
> > As you can see the switches of the first group can be considered as both synchronous and
> > asynchronous. All the NTB bus kernel API can be implemented for it including the changes
> > introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel
> > NTBs can be considered both synchronous and asynchronous as well, although they don't
> > support messaging so Scratchpads can be used to send a data to a peer. Finally the
> > switches of the second group lack of ability to initialize BARs translated base address of
> > peers due to the race condition I described before.
> > 
> > To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make
> > the IDT driver as much compatible with current design as possible, nevertheless the NTB
> > bus kernel API had to be slightly changed. You can find answers to the commentaries down
> > below.
> > 
> > On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > > From: Serge Semin
> > > > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > > > devices, so translated base address of memory windows can be direcly written
> > > > to peer registers. But there are some IDT PCIe-switches which implement
> > > > complex interfaces using Lookup Tables of translation addresses. Due to
> > > > the way the table is accessed, it can not be done synchronously from different
> > > > RCs, that's why the asynchronous interface should be developed.
> > > >
> > > > For these purpose the Memory Window related interface is correspondingly split
> > > > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > > > is following: "It is a virtual memory region, which locally reflects a physical
> > > > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > > > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > > > memory windows.
> > > > Here is the description of the Memory Window related NTB-bus callback
> > > > functions:
> > > >  - ntb_mw_count() - number of local memory windows.
> > > >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> > > >                          window to map.
> > > >  - ntb_mw_set_trans() - set translation address of local memory window (this
> > > >                         address should be somehow retrieved from a peer).
> > > >  - ntb_mw_get_trans() - get translation address of local memory window.
> > > >  - ntb_mw_get_align() - get alignment of translated base address and size of
> > > >                         local memory window. Additionally one can get the
> > > >                         upper size limit of the memory window.
> > > >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> > > >                          local number).
> > > >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> > > >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> > > >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> > > >                              of peer memory window.Additionally one can get the
> > > >                              upper size limit of the memory window.
> > > >
> > > > As one can see current AMD and Intel NTB drivers mostly implement the
> > > > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > > > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > > > since it doesn't have convenient access to the peer Lookup Table.
> > > >
> > > > In order to pass information from one RC to another NTB functions of IDT
> > > > PCIe-switch implement Messaging subsystem. They currently support four message
> > > > registers to transfer DWORD sized data to a specified peer. So there are two
> > > > new callback methods are introduced:
> > > >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> > > >                     and receive messages
> > > >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> > > >                     to a peer
> > > > Additionally there is a new event function:
> > > >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> > > >                      (NTB_MSG_NEW), or last message was successfully sent
> > > >                      (NTB_MSG_SENT), or the last message failed to be sent
> > > >                      (NTB_MSG_FAIL).
> > > >
> > > > The last change concerns the IDs (practically names) of NTB-devices on the
> > > > NTB-bus. It is not good to have the devices with same names in the system
> > > > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > > > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > > > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > > > devices supporting both interfaces.
> > >
> > > Thanks for the work that went into writing this driver, and thanks for your patience
> > with the review.  Please read my initial comments inline.  I would like to approach this
> > from a top-down api perspective first, and settle on that first before requesting any
> > specific changes in the hardware driver.  My major concern about these changes is that
> > they introduce a distinct classification for sync and async hardware, supported by
> > different sets of methods in the api, neither is a subset of the other.
> > >
> > > You know the IDT hardware, so if any of my requests below are infeasible, I would like
> > your constructive opinion (even if it means significant changes to existing drivers) on
> > how to resolve the api so that new and existing hardware drivers can be unified under the
> > same api, if possible.
> > 
> > I understand your concern. I have been thinking of this a lot. In my opinion the proposed
> > in this patch alterations are the best of all variants I've been thinking about. Regarding
> > the lack of APIs subset. In fact I would not agree with that. As I described in the
> > introduction AMD and Intel drivers can be considered as both synchronous and asynchronous,
> > since a translated base address can be directly set in a local and peer configuration
> > space. Although AMD and Intel devices don't support messaging, they have Scratchpads,
> > which can be used to exchange an information between root complexes. The thing we need to
> > do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much
> > different from the "mw_peer"-prefixed ones. The first method just sets a translated base
> > address to the corresponding local register. The second one does exactly the same as
> > "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I
> > left things the way it was with just slight changes of names.
> 
> It sounds like the purpose of your ntb_mw_set_trans() [what I would call ntb_peer_mw_set_trans()] is similar to what is done at initialization time in the Intel NTB driver, so that outgoing writes are translated to the correct peer NTB BAR.  The difference is that IDT outgoing translation sets not only the peer NTB address but also the port number in the translation.
> http://lxr.free-electrons.com/source/drivers/ntb/hw/intel/ntb_hw_intel.c?v=4.7#L1673
> 
> It would be interesting to allow ntb clients to change this translation, eg, configure an outgoing write from local BAR23 so it hits peer secondary BAR45.  I don't think e.g. Intel driver should be forced to implement that, but it would be interesting to think of unifying the api with that in mind.

I already said I'm not an expert of Intel and AMD hardware, moreover I don't even have any
reference manual to study it. But from the first glance it's not. It doesn't concern any
of peer BARs. As far as I can judge by the Intel driver code, the initialization code
specifies some fixed translation address so to get access to some memory space of a remote
bridge. According to my observation b2b configuration looks more like so called Punch-through
configuration in the IDT definitions. It's when two bridges are connected to each other.
But I may be wrong. Although it doesn't matter at the moment.

It's much easier to explain how it works using an illustrations, otherwise we'll be discussing
this matter forever.

Lets start from definition what Memory Window mean. As I already said: "Memory Window is a
virtual memory region, which locally reflects a physical memory of peer/remote device."

Next suppose we've got two 32-bits Root Complexes (RC0 and RC1) connected to each other over
an NTB. It doesn't matter whether it's an IDT or Intel/AMD like NTBs. The NTB device has two
ports: Pn and Pm, each port is connected to its own Root Complex. There are doorbells,
scratchpads, and of course memory windows. Each Root Complex allocates a memory buffer:
Buffer A and Buffer B. Additionally RC0 and RC1 maps memory windows at the corresponding
addresses: MW A and MW B. Here is how it schematically looks:
https://s3.postimg.org/so3zg0car/memory_windows_before.jpg

According to your NTB Kernel API naming (see the figure), methods are supposed to be
syntactically split into two: with "ntb_peer_" prefix and without one. And they are correctly
split for doorbells and scratchpads, but when it comes to memory windows, the method names
syntax is kind of messed up.

Keeping in mind the definition of memory windows I introduced before, your ntb_mw_*_trans()
methods set/get translation base address to "BARm XLAT", so the ones memory window would be
correctly connected with Buffer A. But the function doesn't have "ntb_peer_mw" prefix, which
does look confusing, since it works with peer configuration registers, particularly with the
peer translation address of BARm - MW B.

Finally your ntb_mw_get_range() returns information about two opposite sides.
"Alignment"-related arguments return align of translated base address of the peer, but "base"
and "size" arguments are related with virtual address of the local memory window, which has
nothing related with the peer memory window and its translated base address.

My idea was to fix this syntax incorrectness, so the memory windows NTB Kernel API would look
the same way as doorbell and scratchpad ones. Here is the illustration, how it works now:
https://s3.postimg.org/52mvtfpgz/memory_windows_after.jpg

As you can see, the "ntb_peer_mw_" prefixed methods are related with the peer configurations
only, so ntb_peer_mw_*_trans() set/get translation base address of the peer memory windows
and ntb_peer_mw_get_align() return alignment of that address. Methods with no "ntb_peer_mw_"
prefix, do the same thing but with translation address of the local memory window.
Additionally ntb_mw_get_maprsc() return a physical address of local memory window to 
correspondingly map it.

In the same way the Memory Window API can be split into synchronous, asynchronous and both.
If hardware (like Intel/AMD) allows to implement "ntb_peer_mw" prefixed methods (see methods
marked with blue ink on the last figure), then it is considered synchronous, since it can
directly specify translation base addresses to the peer memory windows. If hardware supports
the "ntb_mw" prefixed methods only (purple ink on the figure), then it is considered as
asynchronous, so a client driver must somehow retrieve a translation address for local
memory window. Method ntb_mw_get_maprsc() must be supported by both hardware (it is marked
with green ink). Of course there are hardware, which can support both synchronous and
asynchronous API, like Intel/AMD. IDT PCIe-bridge doesn't safely support it since Lookup
translation tables access method.

I hope we got it settled now. If not, We can have a Skype conversation, since writing such a
long letters takes lot of time.

> 
> > 
> > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > >
> > > > ---
> > > >  drivers/ntb/Kconfig                 |   4 +-
> > > >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> > > >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> > > >  drivers/ntb/ntb.c                   |  86 +++++-
> > > >  drivers/ntb/ntb_transport.c         |  19 +-
> > > >  drivers/ntb/test/ntb_perf.c         |  16 +-
> > > >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> > > >  drivers/ntb/test/ntb_tool.c         |  25 +-
> > > >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> > > >  9 files changed, 701 insertions(+), 162 deletions(-)
> > > >
> 
> 
> > > > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > > > -				      &mw->xlat_align, &mw->xlat_align_size);
> > > > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > > > +		if (rc)
> > > > +			goto err1;
> > > > +
> > > > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > > > +					   &mw->xlat_align_size, NULL);
> > >
> > > Looks like ntb_mw_get_range() was simpler before the change.
> > >
> > 
> > If I didn't change NTB bus kernel API, I would have split them up anyway. First of all
> > functions with long argument list look more confusing, than ones with shorter list. It
> > helps to stick to the "80 character per line" rule and improves readability. Secondly the
> > function splitting improves the readability of the code in general. When I first saw the
> > function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function
> > returned. The function lacked of "high code coherence" unofficial rule. It is better when
> > one function does one coherent thing and return a well coherent data. Particularly
> > function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as
> > well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method
> > returns mapping resources. If local NTB client driver is not going to allocate any memory,
> > so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand,
> > that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()",
> > but still the new design is better readable.
> > 
> > Additionally I've split them up because of the difference in the way the asynchronous
> > interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I
> > had to add ntb_mw_set_trans(). Each of that method should logically have related
> > "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client
> > driver a hint how the retrieved from the peer translated base address should be aligned,
> > so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align()
> > will give a hint how the local memory buffer should be allocated to fulfil a peer
> > translated base address alignment. In this way it returns restrictions for parameters of
> > "ntb_peer_mw_set_trans()".
> > 
> > Finally, IDT driver is designed so Primary and Secondary ports can support a different
> > number of memory windows. In this way methods
> > "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have
> > different range of acceptable values of the second argument, which is determined by the
> > "ntb_mw_count()" method, comparing to methods
> > "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory
> > windows index restriction is determined by the "ntb_peer_mw_count()" method.
> > 
> > So to speak the splitting was really necessary to make the API looking more logical.
> 
> If this change is not required by the new hardware, please submit the change as a separate patch.
> 

It's required. See the previous comment.

> > > > +	/* Synchronous hardware is only supported */
> > > > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > >
> > > It would be nice if both types could be supported by the same api.
> > >
> > 
> > Yes, it would be. Alas it isn't possible in general. See the introduction to this letter.
> > AMD and Intel devices support asynchronous interface, although they lack of messaging
> > mechanism.
> 
> What is the prototypical application of the IDT message registers?
> 
> I'm thinking they will be the first thing available to drivers, and so one primary purpose will be to exchange information for configuring memory windows.  Can you describe how a cluster of eight nodes would discover each other and initialize?
> 
> Are they also intended to be useful beyond memory window initialization?  How should they be used efficiently, so that the application can minimize in particular read operations on the pci bus (reading ntb device registers)?  Or are message registers not intended to be used in low latency communications (for that, use doorbells and memory instead)?
> 

The prototypical application of the message registers is to exchange a small portion of
information, like translation base address for example. Imagine IDT hardware provides just
four 32-bits wide message registers. So a driver software can transfer a message ID, memory
window index and a translation address using such a small buffer. The message registers can't
be efficiently used for exchanging of any bigger data. One should use doorbells and memory
windows instead.

I'm not a mind reader, but still supposably IDT provided them as a synchronous exchange of
Scratchpads. Message registers are designed so it's impossible to send a message to a peer
before one read a previous message. Such a design is really helpful when we need to connect
few different nodes and pass information between each other. Scratchpads would lead to too
much complications.

> > 
> > Getting back to the discussion, we still need to provide a way to determine which type of
> > interface an NTB device supports: synchronous/asynchronous translated base address
> > initialization, Scratchpads and memory windows. Currently it can be determined by the
> > functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not
> > the best solution. We can implement the traditional Linux kernel bus device-driver
> > matching, using table_ids and so on. For example, each hardware driver fills in a table
> > with all the functionality it supports, like: synchronous/asynchronous memory windows,
> > Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it
> > uses. NTB bus core implements a "match()" callback, which compares those two tables and
> > calls "probe()" callback method of a driver when the tables successfully matches.
> > 
> > On the other hand, we might don't have to comprehend the NTB bus core. We can just
> > introduce a table_id for NTB hardware device, which would just describe the device vendor
> > itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a
> > supported device by its table_id. It might look easier, since
> 
> emphasis added:
> 
> > the client driver developer
> > should have a basic understanding of the device one develops a driver for.
> 
> This is what I'm hoping to avoid.  I would like to let the driver developer write for the api, not for the specific device.  I would rather the driver check "if feature x is supported" instead of "this is a sync or async device."
> 

Ok. We can implement "features checking methods" like:
ntb_valid_link_ops(),
ntb_valid_peer_db_ops(),
ntb_valid_db_ops(),
ntb_valid_peer_spad_ops(),
ntb_valid_spad_ops(),
ntb_valid_msg_ops(),
ntb_valid_peer_mw_ops(),
ntb_valid_mw_ops().

But I am not fan of calling all of those methods in every client drivers. I would rather develop
an "NTB Device - Client Driver" matching method in the framework of NTB bus. For example,
developer creates a client driver using Doorbells (ntb_valid_peer_db_ops/ntb_valid_db_ops),
Messages (ntb_valid_msg_ops) and Memory Windows (ntb_valid_peer_mw_ops/ntb_valid_mw_ops). Then one
declares that the driver requires the corresponding features, somewhere in the struct ntb_client,
like it's usually done in the "compatible" fields of matching id_tables of drivers (see SPI, PCI,
i2c and others), but we would call it like "feature_table" with "compatible" fields. Of course
every hardware driver would declare, which kind of features one supports. Then the NTB bus
"match()" callback method checks whether the registered device supports all features the client
driver claims. If it does, then and only then the "probe()" method of the client driver is called.

Of course, it's your decision which design to use, I am just giving a possible solutions. But the
last one gives better unification with general "Bus - Device - Driver" design of the Linux Kernel.

> > Then NTB bus
> > kernel API core will simply match NTB devices with drivers like any other buses (PCI,
> > PCIe, i2c, spi, etc) do.
> > 
> 
> > > > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > > > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > > > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> > >
> > > I understand why IDT requires a different api for dealing with addressing multiple
> > peers.  I would be interested in a solution that would allow, for example, the Intel
> > driver fit under the api for dealing with multiple peers, even though it only supports one
> > peer.  I would rather see that, than two separate apis under ntb.
> > >
> > > Thoughts?
> > >
> > > Can the sync api be described by some subset of the async api?  Are there less
> > overloaded terms we can use instead of sync/async?
> > >
> > 
> > Answer to this concern is mostly provided in the introduction as well. I'll repeat it here
> > in details. As I said AMD and Intel hardware support asynchronous API except the
> > messaging. Additionally I can even think of emulating messaging using Doorbells and
> > Scratchpads, but not the other way around. Why not? Before answering, here is how the
> > messaging works in IDT switches of both first and second groups (see introduction for
> > describing the groups).
> > 
> > There are four outbound and inbound message registers for each NTB port in the device.
> > Local root complex can connect its any outbound message to any inbound message register of
> > the IDT switch. When one writes a data to an outbound message register it immediately gets
> > to the connected inbound message registers. Then peer can read its inbound message
> > registers and empty it by clearing a corresponding bit. Then and only then next data can
> > be written to any outbound message registers connected to that inbound message register.
> > So the possible race condition between multiple domains sending a message to same peer is
> > resolved by the IDT switch itself.
> > 
> > One would ask: "Why don't you just wrap the message registers up back to the same port? It
> > would look just like Scratchpads." Yes, It would. But still there are only four message
> > registers. It's not enough to distribute them between all the possibly connected NTB
> > ports. As I said earlier there can be up to eight domains connected, so there must be at
> > least seven message register to fulfil the possible design.
> > 
> > Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly
> > adapt design for a hardware, rather than hardware to a design. Following that rule would
> > simplify a code and support respectively.
> > 
> > Regarding the APIs subset. As I said before async API is kind of subset of synchronous
> > API. We can develop all the memory window related callback-method for AMD and Intel
> > hardware driver, which is pretty much easy. We can even simulate message registers by
> > using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second
> > group of IDT switches can't implement the synchronous API, as I already said in the
> > introduction.
> 
> Message registers operate fundamentally differently from scratchpads (and doorbells, for that matter).  I think we are in agreement.  It's a pain, but maybe the best we can do is require applications to check for support for scratchpads, message registers, and/or doorbells, before using any of those features.  We already have ntb_db_valid_mask() and ntb_spad_count().
> 

Yes they do. And yes, the client drivers must somehow check whether a matching NTB device
supports all the features they need. See the previous comment how I suppose it can be done.

> I would like to see ntb_msg_count() and more direct access to the message registers in this api.  I would prefer to see the more direct access to hardware message registers, instead of work_struct for message processing in the low level hardware driver.  A more direct interface to the hardware registers would be more like the existing ntb.h api: direct and low-overhead as possible, providing minimal abstraction of the hardware functionality.
> 
> I think there is still hope we can unify the memory window interface.  Even though IDT supports things like subdividing the memory windows with table lookup, and specification of destination ports for outgoing translations, I think we can support the same abstraction in the existing drivers with minimal overhead.
> 
> For existing Intel and AMD drivers, there may be only one translation per memory window (there is no table to subdivide the memory window), and there is only one destination port (the peer).  The Intel and AMD drivers can ignore the table index in setting up the translation (or validate that the requested table index is equal to zero).
> 

In fact we don't need to introduce any of the table index, because the table index you are
talking about is just one peer. Since it is just a peer, then it must refer to a particular
device on the Linux NTB bus. For instance we got eight NTB ports on IDT PCIe-bridge, one
of them is the Primary port. Then Root Complex connected to the Primary port will have seven
devices on Linux NTB bus. Such design perfectly fits to your NTB Kernel API and additionally
will cover all the client driver needs. In this case Primary NTB port would be like an SPI, i2c
adapters or PCI root complex itself with respect to their subsidiary buses.

That's how the IDT hardware driver is designed. It gives transparent operations with NTB
kernel API.

> > Regarding the overloaded naming. The "sync/async" names are the best I could think of. If
> > you have any idea how one can be appropriately changed, be my guest. I would be really
> > glad to substitute them with something better.
> > 
> 
> Let's try to avoid a distinction, first, beyond just saying "not all hardware will support all these features."  If we absolutely have to make a distinction, let's think of better names then.
> 

Ok. We can stick to "featured" hardware. It sounds better.

> > > > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> > > >   * @ntb:	NTB device context.
> > > > + * @ev:		Event type caused the handler invocation
> > > > + * @msg:	Message related to the event
> > > > + *
> > > > + * Notify the driver context that there is some event happaned in the event
> > > > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > > > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > > > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > > > + * last argument is used to pass the event related message. It discarded right
> > > > + * after the handler returns.
> > > > + */
> > > > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > > > +		   struct ntb_msg *msg);
> > >
> > > I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of
> > the message handling to be done more appropriately at a higher layer of the application.
> > I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I
> > think would be more appropriate for a ntb transport (or higher layer) driver.
> > >
> > 
> > Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a
> > first inbound message register (the rest of message registers are used as an additional
> > data buffers). Then a corresponding tasklet is started to release a hardware interrupt
> > context. That tasklet extracts a message from the inbound message registers, puts it into
> > the driver inbound message queue and marks the registers as empty so the next message
> > could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all
> > new messages to a client driver, which preliminary registered "ntb_msg_event()" callback
> > method. When callback method "ntb_msg_event()" the passed message is discarded.
> 
> When an interrupt arrives, can you signal the upper layer that a message has arrived, without delivering the message?  I think the lower layer can do without the work structs, instead have the same body of the work struct run in the context of the upper layer polling to receive the message.
> 

Of course we can. I could create a method like ntb_msg_read() instead of passing a message to the
callback, but I didn't do so because if next message interrupt arrives while the previous message
still has not been read, how a client driver would find out which message caused the last
interrupt? Thats why I prefer to pass the new message to the callback, so if a client drivers wants
to keep track of all the received methods, then it can create it's own queue.

Regarding the rest of the comment. The upper layer will have to implement the work struct anyway.
Why do we need to copy that code everywhere if it can be common for all the drivers? Still keep in
mind, that the incoming message registers is the register, that must be freed as fast as possible,
since another peer device can be waiting for it to be freed. So it's better to read it in the
hardware driver, than to let it being done by the unreliable client.

> > > It looks like there was some rearranging of code, so big hunks appear to be added or
> > removed.  Can you split this into two (or more) patches so that rearranging the code is
> > distinct from more interesting changes?
> > >
> > 
> > Lets say there was not much rearranging here. I've just put link-related method before
> > everything else. The rearranging was done from the point of methods importance view. There
> > can't be any memory sharing and doorbells operations done before the link is established.
> > The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops()
> > methods.
> 
> It's unfortunate how the diff captured the changes.  Can you split this up into smaller patches?
> 

Lets settle the rest of the things down before doing this. If we don't then it would be just
a waste of time.

> > > > - * ntb_mw_get_range() - get the range of a memory window
> > > > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> > >
> > > What was insufficient about ntb_mw_get_range() that it needed to be split into
> > ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch,
> > it seems ntb_mw_get_range() would have been more simple.
> > >
> > > I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].
> > So, there is no example of how usage of new api would be used differently or more
> > efficiently than ntb_mw_get_range() for async devices.
> > >
> > 
> > This concern is answered a bit earlier, when you first commented the method
> > "ntb_mw_get_range()" splitting.
> > 
> > You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The
> > real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending),
> > which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c
> > driver is developed to demonstrate how the new asynchronous API is utilized including the
> > "ntb_mw_get_maprsc()" method usage.
> 
> Right, I misspelled it.  It would be easier to catch a misspelling of ragne.
> 
> [PATCH v2 3/3]:
> +		/* Retrieve the physical address of the memory to map */
> +		ret = ntb_mw_get_maprsc(ntb, mwindx, &outmw->phys_addr,
> +			&outmw->size);
> +		if (SUCCESS != ret) {
> +			dev_err_mw(ctx, "Failed to get map resources of "
> +				"outbound window %d", mwindx);
> +			mwindx--;
> +			goto err_unmap_rsc;
> +		}
> +
> +		/* Map the memory window resources */
> +		outmw->virt_addr = ioremap_nocache(outmw->phys_addr, outmw->size);
> +
> +		/* Retrieve the memory windows maximum size and alignments */
> +		ret = ntb_mw_get_align(ntb, mwindx, &outmw->addr_align,
> +			&outmw->size_align, &outmw->size_max);
> +		if (SUCCESS != ret) {
> +			dev_err_mw(ctx, "Failed to get alignment options of "
> +				"outbound window %d", mwindx);
> +			goto err_unmap_rsc;
> +		}
> 
> It looks to me like ntb_mw_get_range() would have been sufficient here.  If the change is required by the new driver, please show evidence of that.  If this change is not required by the new hardware, please submit the change as a separate patch.
> 

Please, see the comments before.

> > > I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the
> > following make sense, or have I completely misunderstood something?
> > >
> > > ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are
> > translated to the local memory destination.
> > >
> > > ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory
> > window (is this something that needs to be configured on the local ntb?) are translated to
> > the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of
> > ntb_mw_set_trans() will complete the translation to the peer memory destination.
> > >
> > 
> > These functions actually do the opposite you described:
> 
> That's the point.  I noticed that they are opposite.
> 
> > ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so
> > outgoing writes to a memory window would be translated and reach the peer memory
> > destination.
> 
> In other words, this affects the translation of writes in the direction of the peer memory.  I think this should be named ntb_peer_mw_set_trans().
> 

Please, see the big comment with illustrations provided before.

> > ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space,
> > so the local incoming writes would be correctly translated on the peer and reach the local
> > memory destination.
> 
> In other words, this affects the translation for writes in the direction of local memory.  I think this should be named ntb_mw_set_trans().
> 

Please, see the big comment with illustrations provided before.

> > Globally thinking, these methods do the same think, when they called from opposite
> > domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the
> > method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called
> > method "ntb_peer_mw_set_trans()" does the same procedure as the method
> > "ntb_mw_set_trans()" called from a peer.
> > 
> > To make things simpler, think of memory windows in the framework of the next definition:
> > "Memory Window is a virtual memory region, which locally reflects a physical memory of
> > peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory
> > window, so the locally mapped virtual addresses would be connected with the peer physical
> > memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory
> > region, so the peer could successfully perform a writes to our local physical memory.
> > 
> > Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc()
> > and ioremap_nocache() method invocation doublet. You do the same thing in the client test
> > drivers for AMD and Intel hadrware.
> > 
> 
> > > >  /**
> > > > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64
> > db_bits)
> > > >   * append one additional dma memory copy with the doorbell register as the
> > > >   * destination, after the memory copy operations.
> > > >   *
> > > > + * This is unusual, and hardware may not be suitable to implement it.
> > > > + *
> > >
> > > Why is this unusual?  Do you mean async hardware may not support it?
> > >
> > 
> > Of course I can always return an address of a Doorbell register, but it's not safe to do
> > it working with IDT NTB hardware driver. To make thing explained simpler think a IDT
> > hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of
> > each port can be configured to either reflect the global switch doorbell bits state or not
> > to reflect. Global doorbell bits are set by using outbound doorbell register, which is
> > exist for every NTB port. Primary port is the port which can have an access to multiple
> > peers, so the Primary port inbound and outbound doorbell registers are shared between
> > several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices
> > should not interfere each other, which can happen on uncontrollable usage of Doorbell
> > registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be
> > developed for the IDT NTB hardware driver.
> 
> I misread the diff as if this comment was added to the description of ntb_db_clear_mask().
> 
> > > > +	if (!ntb->ops->spad_count)
> > > > +		return -EINVAL;
> > > > +
> > >
> > > Maybe we should return zero (i.e. there are no scratchpads).
> > >
> > 
> > Agreed. I will fix it in the next patchset.
> 
> Thanks.
> 
> > > > +	if (!ntb->ops->spad_read)
> > > > +		return 0;
> > > > +
> > >
> > > Let's return ~0.  I think that's what a driver would read from the pci bus for a memory
> > miss.
> > >
> > 
> > Agreed. I will make it returning -EINVAL in the next patchset.
> 
> I don't think we should try to interpret the returned value as an error number.  If the driver supports this method, and this is a valid scratchpad, the peer can put any value in i, including a value that could be interpreted as an error number.
> 
> A driver shouldn't be using this method if it isn't supported.  But if it does, I think ~0 is a better poison value than 0.  I just don't want to encourage drivers to try to interpret this value as an error number.
> 

Understood. The method will return ~0 in the next patchset.

> > > > +	if (!ntb->ops->peer_spad_read)
> > > > +		return 0;
> > >
> > > Also, ~0?
> > >
> > 
> > Agreed. I will make it returning -EINVAL in the next patchset.
> 
> I don't think we should try to interpret the returned value as an error number.
> 

Understood. The method will return ~0 in the next patchset.

> > > > + * ntb_msg_post() - post the message to the peer
> > > > + * @ntb:	NTB device context.
> > > > + * @msg:	Message
> > > > + *
> > > > + * Post the message to a peer. It shall be delivered to the peer by the
> > > > + * corresponding hardware method. The peer should be notified about the new
> > > > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > > > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > > > + * event. Otherwise the NTB_MSG_SENT is emitted.
> > >
> > > Interesting.. local driver would be notified about completion (success or failure) of
> > delivery.  Is there any order-of-completion guarantee for the completion notifications?
> > Is there some tolerance for faults, in case we never get a completion notification from
> > the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link
> > comes up again, can we still get a completion notification from the peer, and how would
> > that be handled?
> > >
> > > Does delivery mean the application has processed the message, or is it just delivery at
> > the hardware layer, or just delivery at the ntb hardware driver layer?
> > >
> > 
> > Let me explain how the message delivery works. When a client driver calls the
> > "ntb_msg_post()" method, the corresponding message is placed in an outbound messages
> > queue. Such the message queue exists for every peer device. Then a dedicated kernel work
> > thread is started to send all the messages from the queue.
> 
> Can we handle the outbound messages queue in an upper layer thread, too, instead of a kernel thread in this low level driver?  I think if we provide more direct access to the hardware semantics of the message registers, we will end up with something like the following, which will also simplify the hardware driver.  Leave it to the upper layer to schedule message processing after receiving an event.
> 
> ntb_msg_event(): we received a hardware interrupt for messages. (don't read message status, or anything else)
> 
> ntb_msg_status_read(): read and return MSGSTS bitmask (like ntb_db_read()).
> ntb_msg_status_clear(): clear bits in MSGSTS bitmask (like ntb_db_clear()).
> 
> ntb_msg_mask_set(): set bits in MSGSTSMSK (like ntb_db_mask_set()).
> ntb_msg_mask_clear(): clear bits in MSGSTSMSK (like ntb_db_mask_clear()).
> 
> ntb_msg_recv(): read and return INMSG and INMSGSRC of the indicated message index.
> ntb_msg_send(): write the outgoing message register with the message.
> 

I think such an API would make the interface too complicated. The messaging is intended to be
simple for just sharing a small amount of information. Primarily for sending a translation
address. I would prefer to leave API as it is, since it covers all the application needs.

> > If kernel thread failed to send
> > a message (for instance, if the peer IDT NTB hardware driver still has not freed its
> > inbound message registers), it performs a new attempt after a small timeout. If after a
> > preconfigured number of attempts the kernel thread still fails to delivery the message, it
> > invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully
> > delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.
> 
> In other words, it was delivered to the peer NTB hardware, and the peer NTB hardware accepted the message into an available register.  It does not mean the peer application processed the message, or even that the peer driver received an interrupt for the message?
> 

Of course it doesn't mean, that the application processed the message, but it does mean that the
peer hardware raised the MSI interrupt, if the interrupt was enabled.

> > 
> > To be clear the messsages are transfered directly to the peer memory, but instead they are
> > placed in the IDT NTB switch registers, then peer is notified about a new message arrived
> > at the corresponding message registers and the corresponding interrupt handler is called.
> > 
> > If we loose the PCI express or NTB link between the IDT switch and a peer, then the
> > ntb_msg_event() method is called with NTB_MSG_FAIL event.
> 
> Byzantine fault is an unsolvable class of problem, so it is important to be clear exactly what is supposed to be guaranteed at each layer.  If we get a hardware ACK that the message was delivered, that means it was delivered to the NTB hardware register, but no further.  If we do not get a hardware NAK(?), that means it was not delivered.  If the link fails or we time out waiting for a completion, we can only guess that it wasn't delivered even though there is a small chance it was.  Applications need to be tolerant either way, and needs may be different depending on the application.  I would rather not add any fault tolerance (other than reporting faults) at this layer that is not already implemented in the hardware.
> 
> Reading the description of OUTMSGSTS register, it is clear that we can receive a hardware NAK if an outgoing message failed.  It's not clear to me that IDT will notify any kind of ACK that an outgoing message was accepted.  If an application wants to send two messages, it can send the first, check the bit and see there is no failure.  Does reading the status immediately after sending guarantee the message WAS delivered (i.e. IDT NTB hardware blocks reading the status register while there are messages in flight)?  If not, if the application sends the second message and then sees a failure, how can the application be sure the failure is not for the first message?  Does the application have to wait some time (how long?) before checking the message status?
> 

Ok, I think I need to explain it carefully. When a local Root Complex sends a message to a peer,
it writes a message to its outgoing message registers, which are connected with the peer incoming
message registers. If that incoming message registers are still full, so peer has not emptied them
by clearing a corresponding bit, then a local Root Complex gets a so called NACK, on the other
words it failed to send a message. Then it tries to send the message again and again before a next
attempt is either succeeded or a limited number of attempts is exceeded. Last one would lead to
rising a NTB_MSG_FAIL event.

On the other hand before sending a message IDT driver checks whether the NTB link is up, if it isn't
then it raises NTB_MSG_FAIL event.

After all the discussions I am starting to realize what is the problem. The problem is that we
might have differently pictures of how NTB hardware are connected.) Traditional Intel/AMD NTB
hardware are directly connected to each other so there is only one PCIe-link between two Root
Complexes, but IDT bridge is kind of single intermediate device, which has at least two NTB ports
connected to Root Complexes by different PCIe-links. So when one sends a message to another and
it's accepted by hardware, then the message was put to the incoming message register of the IDT
opposite port, and the peer Root Complex is just notified that a new message has arrived.

> > 
> > Finally, I've answered to all the questions. Hopefully the things look clearer now.
> > 
> > Regards,
> > -Sergey
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-18 21:56   ` Serge Semin
  0 siblings, 0 replies; 12+ messages in thread
From: Serge Semin @ 2016-08-18 21:56 UTC (permalink / raw)
  To: Allen Hubbe
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

Hello Allen,
Sorry for the delay with response and thanks for thoughtful review.

On Mon, Aug 08, 2016 at 05:48:42PM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> From: Serge Semin
> > Hello Allen.
> > 
> > Thanks for your careful review. Going through this mailing thread I hope we'll come up
> > with solutions, which improve the driver code as well as extend the Linux kernel support
> > of new devices like IDT PCIe-swtiches.
> > 
> > Before getting to the inline commentaries I need to give some introduction to the IDT NTB-
> > related hardware so we could speak on the same language. Additionally I'll give a brief
> > explanation how the setup of memory windows works in IDT PCIe-switches.
> 
> I found this to use as a reference for IDT:
> https://www.idt.com/document/man/89hpes24nt24g2-device-user-manual

Yes, it's supported by the IDT driver, although I am using a device with lesser number of ports:
https://www.idt.com/document/man/89hpes32nt8ag2-device-user-manual

> 
> > First of all, before getting into the IDT NTB driver development I had made a research of
> > the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the
> > hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-
> > hardware drivers work. At least I understand the concept of memory windowing, which led to
> > the current NTB bus kernel API.
> > 
> > So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT
> > produces. All of them I split into two distinct groups:
> > 1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
> > 2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2,
> > 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
> > Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express�
> > switching solutions. Why do I split them up? Because of the next reasons:
> > 1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So
> > the switches of the first group can connect just two domains over NTB. Unlike the second
> > group of switches, which expose a way to setup an interaction between several PCIe-switch
> > ports, which have NT-function activated.
> > 2) The groups are significantly distinct by the way of NT-functions configuration.
> > 
> > Before getting further, I should note, that the uploaded driver supports the second group
> > of devices only. But still I'll give a comparative explanation, since the first group of
> > switches is very similar to the AMD/Intel NTBs.
> > 
> > Lets dive into the configurations a bit deeper. Particularly NT-functions of the first
> > group of switches can be configured the same way as AMD/Intel NTB-functions are. There is
> > an PCIe end-point configuration space, which fully reflects the cross-coupled local and
> > peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct
> > writing to mapped memory. Here is the image, which perfectly explains the configuration
> > registers mapping:
> > https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
> > Since the first group switches connect only two root complexes, the race condition of
> > read/write operations to cross-coupled registers can be easily resolved just by roles
> > distribution. So local root complex sets the translated base address directly to a peer
> > configuration space registers, which correspond to BAR0-BAR3 locally mapped memory
> > windows. Of course 2-4 memory windows is enough to connect just two domains. That's why
> > you made the NTB bus kernel API the way it is.
> > 
> > The things get different when one wants to have an access from one domain to multiple
> > coupling up to eight root complexes in the second group of switches. First of all the
> > hardware doesn't support the configuration space cross-coupling anymore. Instead there are
> > two Global Address Space Access registers provided to have an access to a peers
> > configuration space. In fact it is not a big problem, since there are no much differences
> > in accessing registers over a memory mapped space or a pair of fixed Address/Data
> > registers. The problem arises when one wants to share a memory windows between eight
> > domains. Five BARs are not enough for it even if they'd be configured to be of x32 address
> > type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be
> > configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be
> > initialized with translated base address of a peer and IDT switch port, which peer is
> > connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to
> > a memory of a peer just by reading/writing with a shift corresponding to the lookup table
> > entry. That's how more than five peers can be accessed. The root problem is the way the
> > lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data"
> > registers. So a root complex must write an entry index to one registers, then read/write
> > data from another. As you might realise, that weak point leads to a race condition of
> > multiple root complexes accessing the lookup table of one shared peer. Alas I could not
> > come up with a simple and strong solution of the race.
> 
> Right, multiple peers reaching across to some other peer's NTB configuration space is problematic.  I don't mean to suggest we should reach across to configure the lookup table (or anything else) on a remote NTB.

Good, we settled this down.

> 
> > That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since
> > local root complex can't directly write a translated base address to a peer, it must wait
> > until a peer asks him to allocate a memory and send the address back using some of a
> > hardware mechanism. It can be anything: Scratchpad registers, Message registers or even
> > "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
> > 1) Shared Memory windows. In particular local root complex can set a translated base
> > address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB
> > configuration space, the same way as it can be done for AMD/Intel NTBs.
> > 2) One Doorbell register.
> > 3) Two Scratchpads.
> > 4) Four message regietsrs.
> > As you can see the switches of the first group can be considered as both synchronous and
> > asynchronous. All the NTB bus kernel API can be implemented for it including the changes
> > introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel
> > NTBs can be considered both synchronous and asynchronous as well, although they don't
> > support messaging so Scratchpads can be used to send a data to a peer. Finally the
> > switches of the second group lack of ability to initialize BARs translated base address of
> > peers due to the race condition I described before.
> > 
> > To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make
> > the IDT driver as much compatible with current design as possible, nevertheless the NTB
> > bus kernel API had to be slightly changed. You can find answers to the commentaries down
> > below.
> > 
> > On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > > From: Serge Semin
> > > > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > > > devices, so translated base address of memory windows can be direcly written
> > > > to peer registers. But there are some IDT PCIe-switches which implement
> > > > complex interfaces using Lookup Tables of translation addresses. Due to
> > > > the way the table is accessed, it can not be done synchronously from different
> > > > RCs, that's why the asynchronous interface should be developed.
> > > >
> > > > For these purpose the Memory Window related interface is correspondingly split
> > > > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > > > is following: "It is a virtual memory region, which locally reflects a physical
> > > > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > > > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > > > memory windows.
> > > > Here is the description of the Memory Window related NTB-bus callback
> > > > functions:
> > > >  - ntb_mw_count() - number of local memory windows.
> > > >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> > > >                          window to map.
> > > >  - ntb_mw_set_trans() - set translation address of local memory window (this
> > > >                         address should be somehow retrieved from a peer).
> > > >  - ntb_mw_get_trans() - get translation address of local memory window.
> > > >  - ntb_mw_get_align() - get alignment of translated base address and size of
> > > >                         local memory window. Additionally one can get the
> > > >                         upper size limit of the memory window.
> > > >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> > > >                          local number).
> > > >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> > > >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> > > >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> > > >                              of peer memory window.Additionally one can get the
> > > >                              upper size limit of the memory window.
> > > >
> > > > As one can see current AMD and Intel NTB drivers mostly implement the
> > > > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > > > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > > > since it doesn't have convenient access to the peer Lookup Table.
> > > >
> > > > In order to pass information from one RC to another NTB functions of IDT
> > > > PCIe-switch implement Messaging subsystem. They currently support four message
> > > > registers to transfer DWORD sized data to a specified peer. So there are two
> > > > new callback methods are introduced:
> > > >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> > > >                     and receive messages
> > > >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> > > >                     to a peer
> > > > Additionally there is a new event function:
> > > >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> > > >                      (NTB_MSG_NEW), or last message was successfully sent
> > > >                      (NTB_MSG_SENT), or the last message failed to be sent
> > > >                      (NTB_MSG_FAIL).
> > > >
> > > > The last change concerns the IDs (practically names) of NTB-devices on the
> > > > NTB-bus. It is not good to have the devices with same names in the system
> > > > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > > > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > > > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > > > devices supporting both interfaces.
> > >
> > > Thanks for the work that went into writing this driver, and thanks for your patience
> > with the review.  Please read my initial comments inline.  I would like to approach this
> > from a top-down api perspective first, and settle on that first before requesting any
> > specific changes in the hardware driver.  My major concern about these changes is that
> > they introduce a distinct classification for sync and async hardware, supported by
> > different sets of methods in the api, neither is a subset of the other.
> > >
> > > You know the IDT hardware, so if any of my requests below are infeasible, I would like
> > your constructive opinion (even if it means significant changes to existing drivers) on
> > how to resolve the api so that new and existing hardware drivers can be unified under the
> > same api, if possible.
> > 
> > I understand your concern. I have been thinking of this a lot. In my opinion the proposed
> > in this patch alterations are the best of all variants I've been thinking about. Regarding
> > the lack of APIs subset. In fact I would not agree with that. As I described in the
> > introduction AMD and Intel drivers can be considered as both synchronous and asynchronous,
> > since a translated base address can be directly set in a local and peer configuration
> > space. Although AMD and Intel devices don't support messaging, they have Scratchpads,
> > which can be used to exchange an information between root complexes. The thing we need to
> > do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much
> > different from the "mw_peer"-prefixed ones. The first method just sets a translated base
> > address to the corresponding local register. The second one does exactly the same as
> > "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I
> > left things the way it was with just slight changes of names.
> 
> It sounds like the purpose of your ntb_mw_set_trans() [what I would call ntb_peer_mw_set_trans()] is similar to what is done at initialization time in the Intel NTB driver, so that outgoing writes are translated to the correct peer NTB BAR.  The difference is that IDT outgoing translation sets not only the peer NTB address but also the port number in the translation.
> http://lxr.free-electrons.com/source/drivers/ntb/hw/intel/ntb_hw_intel.c?v=4.7#L1673
> 
> It would be interesting to allow ntb clients to change this translation, eg, configure an outgoing write from local BAR23 so it hits peer secondary BAR45.  I don't think e.g. Intel driver should be forced to implement that, but it would be interesting to think of unifying the api with that in mind.

I already said I'm not an expert of Intel and AMD hardware, moreover I don't even have any
reference manual to study it. But from the first glance it's not. It doesn't concern any
of peer BARs. As far as I can judge by the Intel driver code, the initialization code
specifies some fixed translation address so to get access to some memory space of a remote
bridge. According to my observation b2b configuration looks more like so called Punch-through
configuration in the IDT definitions. It's when two bridges are connected to each other.
But I may be wrong. Although it doesn't matter at the moment.

It's much easier to explain how it works using an illustrations, otherwise we'll be discussing
this matter forever.

Lets start from definition what Memory Window mean. As I already said: "Memory Window is a
virtual memory region, which locally reflects a physical memory of peer/remote device."

Next suppose we've got two 32-bits Root Complexes (RC0 and RC1) connected to each other over
an NTB. It doesn't matter whether it's an IDT or Intel/AMD like NTBs. The NTB device has two
ports: Pn and Pm, each port is connected to its own Root Complex. There are doorbells,
scratchpads, and of course memory windows. Each Root Complex allocates a memory buffer:
Buffer A and Buffer B. Additionally RC0 and RC1 maps memory windows at the corresponding
addresses: MW A and MW B. Here is how it schematically looks:
https://s3.postimg.org/so3zg0car/memory_windows_before.jpg

According to your NTB Kernel API naming (see the figure), methods are supposed to be
syntactically split into two: with "ntb_peer_" prefix and without one. And they are correctly
split for doorbells and scratchpads, but when it comes to memory windows, the method names
syntax is kind of messed up.

Keeping in mind the definition of memory windows I introduced before, your ntb_mw_*_trans()
methods set/get translation base address to "BARm XLAT", so the ones memory window would be
correctly connected with Buffer A. But the function doesn't have "ntb_peer_mw" prefix, which
does look confusing, since it works with peer configuration registers, particularly with the
peer translation address of BARm - MW B.

Finally your ntb_mw_get_range() returns information about two opposite sides.
"Alignment"-related arguments return align of translated base address of the peer, but "base"
and "size" arguments are related with virtual address of the local memory window, which has
nothing related with the peer memory window and its translated base address.

My idea was to fix this syntax incorrectness, so the memory windows NTB Kernel API would look
the same way as doorbell and scratchpad ones. Here is the illustration, how it works now:
https://s3.postimg.org/52mvtfpgz/memory_windows_after.jpg

As you can see, the "ntb_peer_mw_" prefixed methods are related with the peer configurations
only, so ntb_peer_mw_*_trans() set/get translation base address of the peer memory windows
and ntb_peer_mw_get_align() return alignment of that address. Methods with no "ntb_peer_mw_"
prefix, do the same thing but with translation address of the local memory window.
Additionally ntb_mw_get_maprsc() return a physical address of local memory window to 
correspondingly map it.

In the same way the Memory Window API can be split into synchronous, asynchronous and both.
If hardware (like Intel/AMD) allows to implement "ntb_peer_mw" prefixed methods (see methods
marked with blue ink on the last figure), then it is considered synchronous, since it can
directly specify translation base addresses to the peer memory windows. If hardware supports
the "ntb_mw" prefixed methods only (purple ink on the figure), then it is considered as
asynchronous, so a client driver must somehow retrieve a translation address for local
memory window. Method ntb_mw_get_maprsc() must be supported by both hardware (it is marked
with green ink). Of course there are hardware, which can support both synchronous and
asynchronous API, like Intel/AMD. IDT PCIe-bridge doesn't safely support it since Lookup
translation tables access method.

I hope we got it settled now. If not, We can have a Skype conversation, since writing such a
long letters takes lot of time.

> 
> > 
> > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > >
> > > > ---
> > > >  drivers/ntb/Kconfig                 |   4 +-
> > > >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> > > >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> > > >  drivers/ntb/ntb.c                   |  86 +++++-
> > > >  drivers/ntb/ntb_transport.c         |  19 +-
> > > >  drivers/ntb/test/ntb_perf.c         |  16 +-
> > > >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> > > >  drivers/ntb/test/ntb_tool.c         |  25 +-
> > > >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> > > >  9 files changed, 701 insertions(+), 162 deletions(-)
> > > >
> 
> 
> > > > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > > > -				      &mw->xlat_align, &mw->xlat_align_size);
> > > > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > > > +		if (rc)
> > > > +			goto err1;
> > > > +
> > > > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > > > +					   &mw->xlat_align_size, NULL);
> > >
> > > Looks like ntb_mw_get_range() was simpler before the change.
> > >
> > 
> > If I didn't change NTB bus kernel API, I would have split them up anyway. First of all
> > functions with long argument list look more confusing, than ones with shorter list. It
> > helps to stick to the "80 character per line" rule and improves readability. Secondly the
> > function splitting improves the readability of the code in general. When I first saw the
> > function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function
> > returned. The function lacked of "high code coherence" unofficial rule. It is better when
> > one function does one coherent thing and return a well coherent data. Particularly
> > function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as
> > well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method
> > returns mapping resources. If local NTB client driver is not going to allocate any memory,
> > so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand,
> > that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()",
> > but still the new design is better readable.
> > 
> > Additionally I've split them up because of the difference in the way the asynchronous
> > interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I
> > had to add ntb_mw_set_trans(). Each of that method should logically have related
> > "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client
> > driver a hint how the retrieved from the peer translated base address should be aligned,
> > so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align()
> > will give a hint how the local memory buffer should be allocated to fulfil a peer
> > translated base address alignment. In this way it returns restrictions for parameters of
> > "ntb_peer_mw_set_trans()".
> > 
> > Finally, IDT driver is designed so Primary and Secondary ports can support a different
> > number of memory windows. In this way methods
> > "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have
> > different range of acceptable values of the second argument, which is determined by the
> > "ntb_mw_count()" method, comparing to methods
> > "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory
> > windows index restriction is determined by the "ntb_peer_mw_count()" method.
> > 
> > So to speak the splitting was really necessary to make the API looking more logical.
> 
> If this change is not required by the new hardware, please submit the change as a separate patch.
> 

It's required. See the previous comment.

> > > > +	/* Synchronous hardware is only supported */
> > > > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > >
> > > It would be nice if both types could be supported by the same api.
> > >
> > 
> > Yes, it would be. Alas it isn't possible in general. See the introduction to this letter.
> > AMD and Intel devices support asynchronous interface, although they lack of messaging
> > mechanism.
> 
> What is the prototypical application of the IDT message registers?
> 
> I'm thinking they will be the first thing available to drivers, and so one primary purpose will be to exchange information for configuring memory windows.  Can you describe how a cluster of eight nodes would discover each other and initialize?
> 
> Are they also intended to be useful beyond memory window initialization?  How should they be used efficiently, so that the application can minimize in particular read operations on the pci bus (reading ntb device registers)?  Or are message registers not intended to be used in low latency communications (for that, use doorbells and memory instead)?
> 

The prototypical application of the message registers is to exchange a small portion of
information, like translation base address for example. Imagine IDT hardware provides just
four 32-bits wide message registers. So a driver software can transfer a message ID, memory
window index and a translation address using such a small buffer. The message registers can't
be efficiently used for exchanging of any bigger data. One should use doorbells and memory
windows instead.

I'm not a mind reader, but still supposably IDT provided them as a synchronous exchange of
Scratchpads. Message registers are designed so it's impossible to send a message to a peer
before one read a previous message. Such a design is really helpful when we need to connect
few different nodes and pass information between each other. Scratchpads would lead to too
much complications.

> > 
> > Getting back to the discussion, we still need to provide a way to determine which type of
> > interface an NTB device supports: synchronous/asynchronous translated base address
> > initialization, Scratchpads and memory windows. Currently it can be determined by the
> > functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not
> > the best solution. We can implement the traditional Linux kernel bus device-driver
> > matching, using table_ids and so on. For example, each hardware driver fills in a table
> > with all the functionality it supports, like: synchronous/asynchronous memory windows,
> > Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it
> > uses. NTB bus core implements a "match()" callback, which compares those two tables and
> > calls "probe()" callback method of a driver when the tables successfully matches.
> > 
> > On the other hand, we might don't have to comprehend the NTB bus core. We can just
> > introduce a table_id for NTB hardware device, which would just describe the device vendor
> > itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a
> > supported device by its table_id. It might look easier, since
> 
> emphasis added:
> 
> > the client driver developer
> > should have a basic understanding of the device one develops a driver for.
> 
> This is what I'm hoping to avoid.  I would like to let the driver developer write for the api, not for the specific device.  I would rather the driver check "if feature x is supported" instead of "this is a sync or async device."
> 

Ok. We can implement "features checking methods" like:
ntb_valid_link_ops(),
ntb_valid_peer_db_ops(),
ntb_valid_db_ops(),
ntb_valid_peer_spad_ops(),
ntb_valid_spad_ops(),
ntb_valid_msg_ops(),
ntb_valid_peer_mw_ops(),
ntb_valid_mw_ops().

But I am not fan of calling all of those methods in every client drivers. I would rather develop
an "NTB Device - Client Driver" matching method in the framework of NTB bus. For example,
developer creates a client driver using Doorbells (ntb_valid_peer_db_ops/ntb_valid_db_ops),
Messages (ntb_valid_msg_ops) and Memory Windows (ntb_valid_peer_mw_ops/ntb_valid_mw_ops). Then one
declares that the driver requires the corresponding features, somewhere in the struct ntb_client,
like it's usually done in the "compatible" fields of matching id_tables of drivers (see SPI, PCI,
i2c and others), but we would call it like "feature_table" with "compatible" fields. Of course
every hardware driver would declare, which kind of features one supports. Then the NTB bus
"match()" callback method checks whether the registered device supports all features the client
driver claims. If it does, then and only then the "probe()" method of the client driver is called.

Of course, it's your decision which design to use, I am just giving a possible solutions. But the
last one gives better unification with general "Bus - Device - Driver" design of the Linux Kernel.

> > Then NTB bus
> > kernel API core will simply match NTB devices with drivers like any other buses (PCI,
> > PCIe, i2c, spi, etc) do.
> > 
> 
> > > > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > > > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > > > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> > >
> > > I understand why IDT requires a different api for dealing with addressing multiple
> > peers.  I would be interested in a solution that would allow, for example, the Intel
> > driver fit under the api for dealing with multiple peers, even though it only supports one
> > peer.  I would rather see that, than two separate apis under ntb.
> > >
> > > Thoughts?
> > >
> > > Can the sync api be described by some subset of the async api?  Are there less
> > overloaded terms we can use instead of sync/async?
> > >
> > 
> > Answer to this concern is mostly provided in the introduction as well. I'll repeat it here
> > in details. As I said AMD and Intel hardware support asynchronous API except the
> > messaging. Additionally I can even think of emulating messaging using Doorbells and
> > Scratchpads, but not the other way around. Why not? Before answering, here is how the
> > messaging works in IDT switches of both first and second groups (see introduction for
> > describing the groups).
> > 
> > There are four outbound and inbound message registers for each NTB port in the device.
> > Local root complex can connect its any outbound message to any inbound message register of
> > the IDT switch. When one writes a data to an outbound message register it immediately gets
> > to the connected inbound message registers. Then peer can read its inbound message
> > registers and empty it by clearing a corresponding bit. Then and only then next data can
> > be written to any outbound message registers connected to that inbound message register.
> > So the possible race condition between multiple domains sending a message to same peer is
> > resolved by the IDT switch itself.
> > 
> > One would ask: "Why don't you just wrap the message registers up back to the same port? It
> > would look just like Scratchpads." Yes, It would. But still there are only four message
> > registers. It's not enough to distribute them between all the possibly connected NTB
> > ports. As I said earlier there can be up to eight domains connected, so there must be at
> > least seven message register to fulfil the possible design.
> > 
> > Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly
> > adapt design for a hardware, rather than hardware to a design. Following that rule would
> > simplify a code and support respectively.
> > 
> > Regarding the APIs subset. As I said before async API is kind of subset of synchronous
> > API. We can develop all the memory window related callback-method for AMD and Intel
> > hardware driver, which is pretty much easy. We can even simulate message registers by
> > using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second
> > group of IDT switches can't implement the synchronous API, as I already said in the
> > introduction.
> 
> Message registers operate fundamentally differently from scratchpads (and doorbells, for that matter).  I think we are in agreement.  It's a pain, but maybe the best we can do is require applications to check for support for scratchpads, message registers, and/or doorbells, before using any of those features.  We already have ntb_db_valid_mask() and ntb_spad_count().
> 

Yes they do. And yes, the client drivers must somehow check whether a matching NTB device
supports all the features they need. See the previous comment how I suppose it can be done.

> I would like to see ntb_msg_count() and more direct access to the message registers in this api.  I would prefer to see the more direct access to hardware message registers, instead of work_struct for message processing in the low level hardware driver.  A more direct interface to the hardware registers would be more like the existing ntb.h api: direct and low-overhead as possible, providing minimal abstraction of the hardware functionality.
> 
> I think there is still hope we can unify the memory window interface.  Even though IDT supports things like subdividing the memory windows with table lookup, and specification of destination ports for outgoing translations, I think we can support the same abstraction in the existing drivers with minimal overhead.
> 
> For existing Intel and AMD drivers, there may be only one translation per memory window (there is no table to subdivide the memory window), and there is only one destination port (the peer).  The Intel and AMD drivers can ignore the table index in setting up the translation (or validate that the requested table index is equal to zero).
> 

In fact we don't need to introduce any of the table index, because the table index you are
talking about is just one peer. Since it is just a peer, then it must refer to a particular
device on the Linux NTB bus. For instance we got eight NTB ports on IDT PCIe-bridge, one
of them is the Primary port. Then Root Complex connected to the Primary port will have seven
devices on Linux NTB bus. Such design perfectly fits to your NTB Kernel API and additionally
will cover all the client driver needs. In this case Primary NTB port would be like an SPI, i2c
adapters or PCI root complex itself with respect to their subsidiary buses.

That's how the IDT hardware driver is designed. It gives transparent operations with NTB
kernel API.

> > Regarding the overloaded naming. The "sync/async" names are the best I could think of. If
> > you have any idea how one can be appropriately changed, be my guest. I would be really
> > glad to substitute them with something better.
> > 
> 
> Let's try to avoid a distinction, first, beyond just saying "not all hardware will support all these features."  If we absolutely have to make a distinction, let's think of better names then.
> 

Ok. We can stick to "featured" hardware. It sounds better.

> > > > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> > > >   * @ntb:	NTB device context.
> > > > + * @ev:		Event type caused the handler invocation
> > > > + * @msg:	Message related to the event
> > > > + *
> > > > + * Notify the driver context that there is some event happaned in the event
> > > > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > > > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > > > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > > > + * last argument is used to pass the event related message. It discarded right
> > > > + * after the handler returns.
> > > > + */
> > > > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > > > +		   struct ntb_msg *msg);
> > >
> > > I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of
> > the message handling to be done more appropriately at a higher layer of the application.
> > I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I
> > think would be more appropriate for a ntb transport (or higher layer) driver.
> > >
> > 
> > Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a
> > first inbound message register (the rest of message registers are used as an additional
> > data buffers). Then a corresponding tasklet is started to release a hardware interrupt
> > context. That tasklet extracts a message from the inbound message registers, puts it into
> > the driver inbound message queue and marks the registers as empty so the next message
> > could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all
> > new messages to a client driver, which preliminary registered "ntb_msg_event()" callback
> > method. When callback method "ntb_msg_event()" the passed message is discarded.
> 
> When an interrupt arrives, can you signal the upper layer that a message has arrived, without delivering the message?  I think the lower layer can do without the work structs, instead have the same body of the work struct run in the context of the upper layer polling to receive the message.
> 

Of course we can. I could create a method like ntb_msg_read() instead of passing a message to the
callback, but I didn't do so because if next message interrupt arrives while the previous message
still has not been read, how a client driver would find out which message caused the last
interrupt? Thats why I prefer to pass the new message to the callback, so if a client drivers wants
to keep track of all the received methods, then it can create it's own queue.

Regarding the rest of the comment. The upper layer will have to implement the work struct anyway.
Why do we need to copy that code everywhere if it can be common for all the drivers? Still keep in
mind, that the incoming message registers is the register, that must be freed as fast as possible,
since another peer device can be waiting for it to be freed. So it's better to read it in the
hardware driver, than to let it being done by the unreliable client.

> > > It looks like there was some rearranging of code, so big hunks appear to be added or
> > removed.  Can you split this into two (or more) patches so that rearranging the code is
> > distinct from more interesting changes?
> > >
> > 
> > Lets say there was not much rearranging here. I've just put link-related method before
> > everything else. The rearranging was done from the point of methods importance view. There
> > can't be any memory sharing and doorbells operations done before the link is established.
> > The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops()
> > methods.
> 
> It's unfortunate how the diff captured the changes.  Can you split this up into smaller patches?
> 

Lets settle the rest of the things down before doing this. If we don't then it would be just
a waste of time.

> > > > - * ntb_mw_get_range() - get the range of a memory window
> > > > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> > >
> > > What was insufficient about ntb_mw_get_range() that it needed to be split into
> > ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch,
> > it seems ntb_mw_get_range() would have been more simple.
> > >
> > > I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].
> > So, there is no example of how usage of new api would be used differently or more
> > efficiently than ntb_mw_get_range() for async devices.
> > >
> > 
> > This concern is answered a bit earlier, when you first commented the method
> > "ntb_mw_get_range()" splitting.
> > 
> > You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The
> > real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending),
> > which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c
> > driver is developed to demonstrate how the new asynchronous API is utilized including the
> > "ntb_mw_get_maprsc()" method usage.
> 
> Right, I misspelled it.  It would be easier to catch a misspelling of ragne.
> 
> [PATCH v2 3/3]:
> +		/* Retrieve the physical address of the memory to map */
> +		ret = ntb_mw_get_maprsc(ntb, mwindx, &outmw->phys_addr,
> +			&outmw->size);
> +		if (SUCCESS != ret) {
> +			dev_err_mw(ctx, "Failed to get map resources of "
> +				"outbound window %d", mwindx);
> +			mwindx--;
> +			goto err_unmap_rsc;
> +		}
> +
> +		/* Map the memory window resources */
> +		outmw->virt_addr = ioremap_nocache(outmw->phys_addr, outmw->size);
> +
> +		/* Retrieve the memory windows maximum size and alignments */
> +		ret = ntb_mw_get_align(ntb, mwindx, &outmw->addr_align,
> +			&outmw->size_align, &outmw->size_max);
> +		if (SUCCESS != ret) {
> +			dev_err_mw(ctx, "Failed to get alignment options of "
> +				"outbound window %d", mwindx);
> +			goto err_unmap_rsc;
> +		}
> 
> It looks to me like ntb_mw_get_range() would have been sufficient here.  If the change is required by the new driver, please show evidence of that.  If this change is not required by the new hardware, please submit the change as a separate patch.
> 

Please, see the comments before.

> > > I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the
> > following make sense, or have I completely misunderstood something?
> > >
> > > ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are
> > translated to the local memory destination.
> > >
> > > ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory
> > window (is this something that needs to be configured on the local ntb?) are translated to
> > the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of
> > ntb_mw_set_trans() will complete the translation to the peer memory destination.
> > >
> > 
> > These functions actually do the opposite you described:
> 
> That's the point.  I noticed that they are opposite.
> 
> > ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so
> > outgoing writes to a memory window would be translated and reach the peer memory
> > destination.
> 
> In other words, this affects the translation of writes in the direction of the peer memory.  I think this should be named ntb_peer_mw_set_trans().
> 

Please, see the big comment with illustrations provided before.

> > ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space,
> > so the local incoming writes would be correctly translated on the peer and reach the local
> > memory destination.
> 
> In other words, this affects the translation for writes in the direction of local memory.  I think this should be named ntb_mw_set_trans().
> 

Please, see the big comment with illustrations provided before.

> > Globally thinking, these methods do the same think, when they called from opposite
> > domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the
> > method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called
> > method "ntb_peer_mw_set_trans()" does the same procedure as the method
> > "ntb_mw_set_trans()" called from a peer.
> > 
> > To make things simpler, think of memory windows in the framework of the next definition:
> > "Memory Window is a virtual memory region, which locally reflects a physical memory of
> > peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory
> > window, so the locally mapped virtual addresses would be connected with the peer physical
> > memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory
> > region, so the peer could successfully perform a writes to our local physical memory.
> > 
> > Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc()
> > and ioremap_nocache() method invocation doublet. You do the same thing in the client test
> > drivers for AMD and Intel hadrware.
> > 
> 
> > > >  /**
> > > > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64
> > db_bits)
> > > >   * append one additional dma memory copy with the doorbell register as the
> > > >   * destination, after the memory copy operations.
> > > >   *
> > > > + * This is unusual, and hardware may not be suitable to implement it.
> > > > + *
> > >
> > > Why is this unusual?  Do you mean async hardware may not support it?
> > >
> > 
> > Of course I can always return an address of a Doorbell register, but it's not safe to do
> > it working with IDT NTB hardware driver. To make thing explained simpler think a IDT
> > hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of
> > each port can be configured to either reflect the global switch doorbell bits state or not
> > to reflect. Global doorbell bits are set by using outbound doorbell register, which is
> > exist for every NTB port. Primary port is the port which can have an access to multiple
> > peers, so the Primary port inbound and outbound doorbell registers are shared between
> > several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices
> > should not interfere each other, which can happen on uncontrollable usage of Doorbell
> > registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be
> > developed for the IDT NTB hardware driver.
> 
> I misread the diff as if this comment was added to the description of ntb_db_clear_mask().
> 
> > > > +	if (!ntb->ops->spad_count)
> > > > +		return -EINVAL;
> > > > +
> > >
> > > Maybe we should return zero (i.e. there are no scratchpads).
> > >
> > 
> > Agreed. I will fix it in the next patchset.
> 
> Thanks.
> 
> > > > +	if (!ntb->ops->spad_read)
> > > > +		return 0;
> > > > +
> > >
> > > Let's return ~0.  I think that's what a driver would read from the pci bus for a memory
> > miss.
> > >
> > 
> > Agreed. I will make it returning -EINVAL in the next patchset.
> 
> I don't think we should try to interpret the returned value as an error number.  If the driver supports this method, and this is a valid scratchpad, the peer can put any value in i, including a value that could be interpreted as an error number.
> 
> A driver shouldn't be using this method if it isn't supported.  But if it does, I think ~0 is a better poison value than 0.  I just don't want to encourage drivers to try to interpret this value as an error number.
> 

Understood. The method will return ~0 in the next patchset.

> > > > +	if (!ntb->ops->peer_spad_read)
> > > > +		return 0;
> > >
> > > Also, ~0?
> > >
> > 
> > Agreed. I will make it returning -EINVAL in the next patchset.
> 
> I don't think we should try to interpret the returned value as an error number.
> 

Understood. The method will return ~0 in the next patchset.

> > > > + * ntb_msg_post() - post the message to the peer
> > > > + * @ntb:	NTB device context.
> > > > + * @msg:	Message
> > > > + *
> > > > + * Post the message to a peer. It shall be delivered to the peer by the
> > > > + * corresponding hardware method. The peer should be notified about the new
> > > > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > > > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > > > + * event. Otherwise the NTB_MSG_SENT is emitted.
> > >
> > > Interesting.. local driver would be notified about completion (success or failure) of
> > delivery.  Is there any order-of-completion guarantee for the completion notifications?
> > Is there some tolerance for faults, in case we never get a completion notification from
> > the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link
> > comes up again, can we still get a completion notification from the peer, and how would
> > that be handled?
> > >
> > > Does delivery mean the application has processed the message, or is it just delivery at
> > the hardware layer, or just delivery at the ntb hardware driver layer?
> > >
> > 
> > Let me explain how the message delivery works. When a client driver calls the
> > "ntb_msg_post()" method, the corresponding message is placed in an outbound messages
> > queue. Such the message queue exists for every peer device. Then a dedicated kernel work
> > thread is started to send all the messages from the queue.
> 
> Can we handle the outbound messages queue in an upper layer thread, too, instead of a kernel thread in this low level driver?  I think if we provide more direct access to the hardware semantics of the message registers, we will end up with something like the following, which will also simplify the hardware driver.  Leave it to the upper layer to schedule message processing after receiving an event.
> 
> ntb_msg_event(): we received a hardware interrupt for messages. (don't read message status, or anything else)
> 
> ntb_msg_status_read(): read and return MSGSTS bitmask (like ntb_db_read()).
> ntb_msg_status_clear(): clear bits in MSGSTS bitmask (like ntb_db_clear()).
> 
> ntb_msg_mask_set(): set bits in MSGSTSMSK (like ntb_db_mask_set()).
> ntb_msg_mask_clear(): clear bits in MSGSTSMSK (like ntb_db_mask_clear()).
> 
> ntb_msg_recv(): read and return INMSG and INMSGSRC of the indicated message index.
> ntb_msg_send(): write the outgoing message register with the message.
> 

I think such an API would make the interface too complicated. The messaging is intended to be
simple for just sharing a small amount of information. Primarily for sending a translation
address. I would prefer to leave API as it is, since it covers all the application needs.

> > If kernel thread failed to send
> > a message (for instance, if the peer IDT NTB hardware driver still has not freed its
> > inbound message registers), it performs a new attempt after a small timeout. If after a
> > preconfigured number of attempts the kernel thread still fails to delivery the message, it
> > invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully
> > delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.
> 
> In other words, it was delivered to the peer NTB hardware, and the peer NTB hardware accepted the message into an available register.  It does not mean the peer application processed the message, or even that the peer driver received an interrupt for the message?
> 

Of course it doesn't mean, that the application processed the message, but it does mean that the
peer hardware raised the MSI interrupt, if the interrupt was enabled.

> > 
> > To be clear the messsages are transfered directly to the peer memory, but instead they are
> > placed in the IDT NTB switch registers, then peer is notified about a new message arrived
> > at the corresponding message registers and the corresponding interrupt handler is called.
> > 
> > If we loose the PCI express or NTB link between the IDT switch and a peer, then the
> > ntb_msg_event() method is called with NTB_MSG_FAIL event.
> 
> Byzantine fault is an unsolvable class of problem, so it is important to be clear exactly what is supposed to be guaranteed at each layer.  If we get a hardware ACK that the message was delivered, that means it was delivered to the NTB hardware register, but no further.  If we do not get a hardware NAK(?), that means it was not delivered.  If the link fails or we time out waiting for a completion, we can only guess that it wasn't delivered even though there is a small chance it was.  Applications need to be tolerant either way, and needs may be different depending on the application.  I would rather not add any fault tolerance (other than reporting faults) at this layer that is not already implemented in the hardware.
> 
> Reading the description of OUTMSGSTS register, it is clear that we can receive a hardware NAK if an outgoing message failed.  It's not clear to me that IDT will notify any kind of ACK that an outgoing message was accepted.  If an application wants to send two messages, it can send the first, check the bit and see there is no failure.  Does reading the status immediately after sending guarantee the message WAS delivered (i.e. IDT NTB hardware blocks reading the status register while there are messages in flight)?  If not, if the application sends the second message and then sees a failure, how can the application be sure the failure is not for the first message?  Does the application have to wait some time (how long?) before checking the message status?
> 

Ok, I think I need to explain it carefully. When a local Root Complex sends a message to a peer,
it writes a message to its outgoing message registers, which are connected with the peer incoming
message registers. If that incoming message registers are still full, so peer has not emptied them
by clearing a corresponding bit, then a local Root Complex gets a so called NACK, on the other
words it failed to send a message. Then it tries to send the message again and again before a next
attempt is either succeeded or a limited number of attempts is exceeded. Last one would lead to
rising a NTB_MSG_FAIL event.

On the other hand before sending a message IDT driver checks whether the NTB link is up, if it isn't
then it raises NTB_MSG_FAIL event.

After all the discussions I am starting to realize what is the problem. The problem is that we
might have differently pictures of how NTB hardware are connected.) Traditional Intel/AMD NTB
hardware are directly connected to each other so there is only one PCIe-link between two Root
Complexes, but IDT bridge is kind of single intermediate device, which has at least two NTB ports
connected to Root Complexes by different PCIe-links. So when one sends a message to another and
it's accepted by hardware, then the message was put to the incoming message register of the IDT
opposite port, and the peer Root Complex is just notified that a new message has arrived.

> > 
> > Finally, I've answered to all the questions. Hopefully the things look clearer now.
> > 
> > Regards,
> > -Sergey
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-08 21:48 ` Allen Hubbe
  0 siblings, 0 replies; 12+ messages in thread
From: Allen Hubbe @ 2016-08-08 21:48 UTC (permalink / raw)
  To: 'Serge Semin'
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

From: Serge Semin
> Hello Allen.
> 
> Thanks for your careful review. Going through this mailing thread I hope we'll come up
> with solutions, which improve the driver code as well as extend the Linux kernel support
> of new devices like IDT PCIe-swtiches.
> 
> Before getting to the inline commentaries I need to give some introduction to the IDT NTB-
> related hardware so we could speak on the same language. Additionally I'll give a brief
> explanation how the setup of memory windows works in IDT PCIe-switches.

I found this to use as a reference for IDT:
https://www.idt.com/document/man/89hpes24nt24g2-device-user-manual

> First of all, before getting into the IDT NTB driver development I had made a research of
> the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the
> hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-
> hardware drivers work. At least I understand the concept of memory windowing, which led to
> the current NTB bus kernel API.
> 
> So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT
> produces. All of them I split into two distinct groups:
> 1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
> 2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2,
> 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
> Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express®
> switching solutions. Why do I split them up? Because of the next reasons:
> 1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So
> the switches of the first group can connect just two domains over NTB. Unlike the second
> group of switches, which expose a way to setup an interaction between several PCIe-switch
> ports, which have NT-function activated.
> 2) The groups are significantly distinct by the way of NT-functions configuration.
> 
> Before getting further, I should note, that the uploaded driver supports the second group
> of devices only. But still I'll give a comparative explanation, since the first group of
> switches is very similar to the AMD/Intel NTBs.
> 
> Lets dive into the configurations a bit deeper. Particularly NT-functions of the first
> group of switches can be configured the same way as AMD/Intel NTB-functions are. There is
> an PCIe end-point configuration space, which fully reflects the cross-coupled local and
> peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct
> writing to mapped memory. Here is the image, which perfectly explains the configuration
> registers mapping:
> https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
> Since the first group switches connect only two root complexes, the race condition of
> read/write operations to cross-coupled registers can be easily resolved just by roles
> distribution. So local root complex sets the translated base address directly to a peer
> configuration space registers, which correspond to BAR0-BAR3 locally mapped memory
> windows. Of course 2-4 memory windows is enough to connect just two domains. That's why
> you made the NTB bus kernel API the way it is.
> 
> The things get different when one wants to have an access from one domain to multiple
> coupling up to eight root complexes in the second group of switches. First of all the
> hardware doesn't support the configuration space cross-coupling anymore. Instead there are
> two Global Address Space Access registers provided to have an access to a peers
> configuration space. In fact it is not a big problem, since there are no much differences
> in accessing registers over a memory mapped space or a pair of fixed Address/Data
> registers. The problem arises when one wants to share a memory windows between eight
> domains. Five BARs are not enough for it even if they'd be configured to be of x32 address
> type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be
> configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be
> initialized with translated base address of a peer and IDT switch port, which peer is
> connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to
> a memory of a peer just by reading/writing with a shift corresponding to the lookup table
> entry. That's how more than five peers can be accessed. The root problem is the way the
> lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data"
> registers. So a root complex must write an entry index to one registers, then read/write
> data from another. As you might realise, that weak point leads to a race condition of
> multiple root complexes accessing the lookup table of one shared peer. Alas I could not
> come up with a simple and strong solution of the race.

Right, multiple peers reaching across to some other peer's NTB configuration space is problematic.  I don't mean to suggest we should reach across to configure the lookup table (or anything else) on a remote NTB.

> That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since
> local root complex can't directly write a translated base address to a peer, it must wait
> until a peer asks him to allocate a memory and send the address back using some of a
> hardware mechanism. It can be anything: Scratchpad registers, Message registers or even
> "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
> 1) Shared Memory windows. In particular local root complex can set a translated base
> address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB
> configuration space, the same way as it can be done for AMD/Intel NTBs.
> 2) One Doorbell register.
> 3) Two Scratchpads.
> 4) Four message regietsrs.
> As you can see the switches of the first group can be considered as both synchronous and
> asynchronous. All the NTB bus kernel API can be implemented for it including the changes
> introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel
> NTBs can be considered both synchronous and asynchronous as well, although they don't
> support messaging so Scratchpads can be used to send a data to a peer. Finally the
> switches of the second group lack of ability to initialize BARs translated base address of
> peers due to the race condition I described before.
> 
> To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make
> the IDT driver as much compatible with current design as possible, nevertheless the NTB
> bus kernel API had to be slightly changed. You can find answers to the commentaries down
> below.
> 
> On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > From: Serge Semin
> > > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > > devices, so translated base address of memory windows can be direcly written
> > > to peer registers. But there are some IDT PCIe-switches which implement
> > > complex interfaces using Lookup Tables of translation addresses. Due to
> > > the way the table is accessed, it can not be done synchronously from different
> > > RCs, that's why the asynchronous interface should be developed.
> > >
> > > For these purpose the Memory Window related interface is correspondingly split
> > > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > > is following: "It is a virtual memory region, which locally reflects a physical
> > > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > > memory windows.
> > > Here is the description of the Memory Window related NTB-bus callback
> > > functions:
> > >  - ntb_mw_count() - number of local memory windows.
> > >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> > >                          window to map.
> > >  - ntb_mw_set_trans() - set translation address of local memory window (this
> > >                         address should be somehow retrieved from a peer).
> > >  - ntb_mw_get_trans() - get translation address of local memory window.
> > >  - ntb_mw_get_align() - get alignment of translated base address and size of
> > >                         local memory window. Additionally one can get the
> > >                         upper size limit of the memory window.
> > >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> > >                          local number).
> > >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> > >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> > >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> > >                              of peer memory window.Additionally one can get the
> > >                              upper size limit of the memory window.
> > >
> > > As one can see current AMD and Intel NTB drivers mostly implement the
> > > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > > since it doesn't have convenient access to the peer Lookup Table.
> > >
> > > In order to pass information from one RC to another NTB functions of IDT
> > > PCIe-switch implement Messaging subsystem. They currently support four message
> > > registers to transfer DWORD sized data to a specified peer. So there are two
> > > new callback methods are introduced:
> > >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> > >                     and receive messages
> > >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> > >                     to a peer
> > > Additionally there is a new event function:
> > >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> > >                      (NTB_MSG_NEW), or last message was successfully sent
> > >                      (NTB_MSG_SENT), or the last message failed to be sent
> > >                      (NTB_MSG_FAIL).
> > >
> > > The last change concerns the IDs (practically names) of NTB-devices on the
> > > NTB-bus. It is not good to have the devices with same names in the system
> > > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > > devices supporting both interfaces.
> >
> > Thanks for the work that went into writing this driver, and thanks for your patience
> with the review.  Please read my initial comments inline.  I would like to approach this
> from a top-down api perspective first, and settle on that first before requesting any
> specific changes in the hardware driver.  My major concern about these changes is that
> they introduce a distinct classification for sync and async hardware, supported by
> different sets of methods in the api, neither is a subset of the other.
> >
> > You know the IDT hardware, so if any of my requests below are infeasible, I would like
> your constructive opinion (even if it means significant changes to existing drivers) on
> how to resolve the api so that new and existing hardware drivers can be unified under the
> same api, if possible.
> 
> I understand your concern. I have been thinking of this a lot. In my opinion the proposed
> in this patch alterations are the best of all variants I've been thinking about. Regarding
> the lack of APIs subset. In fact I would not agree with that. As I described in the
> introduction AMD and Intel drivers can be considered as both synchronous and asynchronous,
> since a translated base address can be directly set in a local and peer configuration
> space. Although AMD and Intel devices don't support messaging, they have Scratchpads,
> which can be used to exchange an information between root complexes. The thing we need to
> do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much
> different from the "mw_peer"-prefixed ones. The first method just sets a translated base
> address to the corresponding local register. The second one does exactly the same as
> "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I
> left things the way it was with just slight changes of names.

It sounds like the purpose of your ntb_mw_set_trans() [what I would call ntb_peer_mw_set_trans()] is similar to what is done at initialization time in the Intel NTB driver, so that outgoing writes are translated to the correct peer NTB BAR.  The difference is that IDT outgoing translation sets not only the peer NTB address but also the port number in the translation.
http://lxr.free-electrons.com/source/drivers/ntb/hw/intel/ntb_hw_intel.c?v=4.7#L1673

It would be interesting to allow ntb clients to change this translation, eg, configure an outgoing write from local BAR23 so it hits peer secondary BAR45.  I don't think e.g. Intel driver should be forced to implement that, but it would be interesting to think of unifying the api with that in mind.

> 
> > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > >
> > > ---
> > >  drivers/ntb/Kconfig                 |   4 +-
> > >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> > >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> > >  drivers/ntb/ntb.c                   |  86 +++++-
> > >  drivers/ntb/ntb_transport.c         |  19 +-
> > >  drivers/ntb/test/ntb_perf.c         |  16 +-
> > >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> > >  drivers/ntb/test/ntb_tool.c         |  25 +-
> > >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> > >  9 files changed, 701 insertions(+), 162 deletions(-)
> > >


> > > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > > -				      &mw->xlat_align, &mw->xlat_align_size);
> > > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > > +		if (rc)
> > > +			goto err1;
> > > +
> > > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > > +					   &mw->xlat_align_size, NULL);
> >
> > Looks like ntb_mw_get_range() was simpler before the change.
> >
> 
> If I didn't change NTB bus kernel API, I would have split them up anyway. First of all
> functions with long argument list look more confusing, than ones with shorter list. It
> helps to stick to the "80 character per line" rule and improves readability. Secondly the
> function splitting improves the readability of the code in general. When I first saw the
> function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function
> returned. The function lacked of "high code coherence" unofficial rule. It is better when
> one function does one coherent thing and return a well coherent data. Particularly
> function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as
> well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method
> returns mapping resources. If local NTB client driver is not going to allocate any memory,
> so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand,
> that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()",
> but still the new design is better readable.
> 
> Additionally I've split them up because of the difference in the way the asynchronous
> interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I
> had to add ntb_mw_set_trans(). Each of that method should logically have related
> "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client
> driver a hint how the retrieved from the peer translated base address should be aligned,
> so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align()
> will give a hint how the local memory buffer should be allocated to fulfil a peer
> translated base address alignment. In this way it returns restrictions for parameters of
> "ntb_peer_mw_set_trans()".
> 
> Finally, IDT driver is designed so Primary and Secondary ports can support a different
> number of memory windows. In this way methods
> "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have
> different range of acceptable values of the second argument, which is determined by the
> "ntb_mw_count()" method, comparing to methods
> "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory
> windows index restriction is determined by the "ntb_peer_mw_count()" method.
> 
> So to speak the splitting was really necessary to make the API looking more logical.

If this change is not required by the new hardware, please submit the change as a separate patch.

> > > +	/* Synchronous hardware is only supported */
> > > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > > +		return -EINVAL;
> > > +	}
> > > +
> >
> > It would be nice if both types could be supported by the same api.
> >
> 
> Yes, it would be. Alas it isn't possible in general. See the introduction to this letter.
> AMD and Intel devices support asynchronous interface, although they lack of messaging
> mechanism.

What is the prototypical application of the IDT message registers?

I'm thinking they will be the first thing available to drivers, and so one primary purpose will be to exchange information for configuring memory windows.  Can you describe how a cluster of eight nodes would discover each other and initialize?

Are they also intended to be useful beyond memory window initialization?  How should they be used efficiently, so that the application can minimize in particular read operations on the pci bus (reading ntb device registers)?  Or are message registers not intended to be used in low latency communications (for that, use doorbells and memory instead)?

> 
> Getting back to the discussion, we still need to provide a way to determine which type of
> interface an NTB device supports: synchronous/asynchronous translated base address
> initialization, Scratchpads and memory windows. Currently it can be determined by the
> functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not
> the best solution. We can implement the traditional Linux kernel bus device-driver
> matching, using table_ids and so on. For example, each hardware driver fills in a table
> with all the functionality it supports, like: synchronous/asynchronous memory windows,
> Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it
> uses. NTB bus core implements a "match()" callback, which compares those two tables and
> calls "probe()" callback method of a driver when the tables successfully matches.
> 
> On the other hand, we might don't have to comprehend the NTB bus core. We can just
> introduce a table_id for NTB hardware device, which would just describe the device vendor
> itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a
> supported device by its table_id. It might look easier, since

emphasis added:

> the client driver developer
> should have a basic understanding of the device one develops a driver for.

This is what I'm hoping to avoid.  I would like to let the driver developer write for the api, not for the specific device.  I would rather the driver check "if feature x is supported" instead of "this is a sync or async device."

> Then NTB bus
> kernel API core will simply match NTB devices with drivers like any other buses (PCI,
> PCIe, i2c, spi, etc) do.
> 

> > > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> >
> > I understand why IDT requires a different api for dealing with addressing multiple
> peers.  I would be interested in a solution that would allow, for example, the Intel
> driver fit under the api for dealing with multiple peers, even though it only supports one
> peer.  I would rather see that, than two separate apis under ntb.
> >
> > Thoughts?
> >
> > Can the sync api be described by some subset of the async api?  Are there less
> overloaded terms we can use instead of sync/async?
> >
> 
> Answer to this concern is mostly provided in the introduction as well. I'll repeat it here
> in details. As I said AMD and Intel hardware support asynchronous API except the
> messaging. Additionally I can even think of emulating messaging using Doorbells and
> Scratchpads, but not the other way around. Why not? Before answering, here is how the
> messaging works in IDT switches of both first and second groups (see introduction for
> describing the groups).
> 
> There are four outbound and inbound message registers for each NTB port in the device.
> Local root complex can connect its any outbound message to any inbound message register of
> the IDT switch. When one writes a data to an outbound message register it immediately gets
> to the connected inbound message registers. Then peer can read its inbound message
> registers and empty it by clearing a corresponding bit. Then and only then next data can
> be written to any outbound message registers connected to that inbound message register.
> So the possible race condition between multiple domains sending a message to same peer is
> resolved by the IDT switch itself.
> 
> One would ask: "Why don't you just wrap the message registers up back to the same port? It
> would look just like Scratchpads." Yes, It would. But still there are only four message
> registers. It's not enough to distribute them between all the possibly connected NTB
> ports. As I said earlier there can be up to eight domains connected, so there must be at
> least seven message register to fulfil the possible design.
> 
> Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly
> adapt design for a hardware, rather than hardware to a design. Following that rule would
> simplify a code and support respectively.
> 
> Regarding the APIs subset. As I said before async API is kind of subset of synchronous
> API. We can develop all the memory window related callback-method for AMD and Intel
> hardware driver, which is pretty much easy. We can even simulate message registers by
> using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second
> group of IDT switches can't implement the synchronous API, as I already said in the
> introduction.

Message registers operate fundamentally differently from scratchpads (and doorbells, for that matter).  I think we are in agreement.  It's a pain, but maybe the best we can do is require applications to check for support for scratchpads, message registers, and/or doorbells, before using any of those features.  We already have ntb_db_valid_mask() and ntb_spad_count().

I would like to see ntb_msg_count() and more direct access to the message registers in this api.  I would prefer to see the more direct access to hardware message registers, instead of work_struct for message processing in the low level hardware driver.  A more direct interface to the hardware registers would be more like the existing ntb.h api: direct and low-overhead as possible, providing minimal abstraction of the hardware functionality.

I think there is still hope we can unify the memory window interface.  Even though IDT supports things like subdividing the memory windows with table lookup, and specification of destination ports for outgoing translations, I think we can support the same abstraction in the existing drivers with minimal overhead.

For existing Intel and AMD drivers, there may be only one translation per memory window (there is no table to subdivide the memory window), and there is only one destination port (the peer).  The Intel and AMD drivers can ignore the table index in setting up the translation (or validate that the requested table index is equal to zero).

> Regarding the overloaded naming. The "sync/async" names are the best I could think of. If
> you have any idea how one can be appropriately changed, be my guest. I would be really
> glad to substitute them with something better.
> 

Let's try to avoid a distinction, first, beyond just saying "not all hardware will support all these features."  If we absolutely have to make a distinction, let's think of better names then.

> > > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> > >   * @ntb:	NTB device context.
> > > + * @ev:		Event type caused the handler invocation
> > > + * @msg:	Message related to the event
> > > + *
> > > + * Notify the driver context that there is some event happaned in the event
> > > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > > + * last argument is used to pass the event related message. It discarded right
> > > + * after the handler returns.
> > > + */
> > > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > > +		   struct ntb_msg *msg);
> >
> > I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of
> the message handling to be done more appropriately at a higher layer of the application.
> I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I
> think would be more appropriate for a ntb transport (or higher layer) driver.
> >
> 
> Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a
> first inbound message register (the rest of message registers are used as an additional
> data buffers). Then a corresponding tasklet is started to release a hardware interrupt
> context. That tasklet extracts a message from the inbound message registers, puts it into
> the driver inbound message queue and marks the registers as empty so the next message
> could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all
> new messages to a client driver, which preliminary registered "ntb_msg_event()" callback
> method. When callback method "ntb_msg_event()" the passed message is discarded.

When an interrupt arrives, can you signal the upper layer that a message has arrived, without delivering the message?  I think the lower layer can do without the work structs, instead have the same body of the work struct run in the context of the upper layer polling to receive the message.

> > It looks like there was some rearranging of code, so big hunks appear to be added or
> removed.  Can you split this into two (or more) patches so that rearranging the code is
> distinct from more interesting changes?
> >
> 
> Lets say there was not much rearranging here. I've just put link-related method before
> everything else. The rearranging was done from the point of methods importance view. There
> can't be any memory sharing and doorbells operations done before the link is established.
> The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops()
> methods.

It's unfortunate how the diff captured the changes.  Can you split this up into smaller patches?

> > > - * ntb_mw_get_range() - get the range of a memory window
> > > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> >
> > What was insufficient about ntb_mw_get_range() that it needed to be split into
> ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch,
> it seems ntb_mw_get_range() would have been more simple.
> >
> > I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].
> So, there is no example of how usage of new api would be used differently or more
> efficiently than ntb_mw_get_range() for async devices.
> >
> 
> This concern is answered a bit earlier, when you first commented the method
> "ntb_mw_get_range()" splitting.
> 
> You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The
> real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending),
> which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c
> driver is developed to demonstrate how the new asynchronous API is utilized including the
> "ntb_mw_get_maprsc()" method usage.

Right, I misspelled it.  It would be easier to catch a misspelling of ragne.

[PATCH v2 3/3]:
+		/* Retrieve the physical address of the memory to map */
+		ret = ntb_mw_get_maprsc(ntb, mwindx, &outmw->phys_addr,
+			&outmw->size);
+		if (SUCCESS != ret) {
+			dev_err_mw(ctx, "Failed to get map resources of "
+				"outbound window %d", mwindx);
+			mwindx--;
+			goto err_unmap_rsc;
+		}
+
+		/* Map the memory window resources */
+		outmw->virt_addr = ioremap_nocache(outmw->phys_addr, outmw->size);
+
+		/* Retrieve the memory windows maximum size and alignments */
+		ret = ntb_mw_get_align(ntb, mwindx, &outmw->addr_align,
+			&outmw->size_align, &outmw->size_max);
+		if (SUCCESS != ret) {
+			dev_err_mw(ctx, "Failed to get alignment options of "
+				"outbound window %d", mwindx);
+			goto err_unmap_rsc;
+		}

It looks to me like ntb_mw_get_range() would have been sufficient here.  If the change is required by the new driver, please show evidence of that.  If this change is not required by the new hardware, please submit the change as a separate patch.

> > I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the
> following make sense, or have I completely misunderstood something?
> >
> > ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are
> translated to the local memory destination.
> >
> > ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory
> window (is this something that needs to be configured on the local ntb?) are translated to
> the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of
> ntb_mw_set_trans() will complete the translation to the peer memory destination.
> >
> 
> These functions actually do the opposite you described:

That's the point.  I noticed that they are opposite.

> ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so
> outgoing writes to a memory window would be translated and reach the peer memory
> destination.

In other words, this affects the translation of writes in the direction of the peer memory.  I think this should be named ntb_peer_mw_set_trans().

> ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space,
> so the local incoming writes would be correctly translated on the peer and reach the local
> memory destination.

In other words, this affects the translation for writes in the direction of local memory.  I think this should be named ntb_mw_set_trans().

> Globally thinking, these methods do the same think, when they called from opposite
> domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the
> method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called
> method "ntb_peer_mw_set_trans()" does the same procedure as the method
> "ntb_mw_set_trans()" called from a peer.
> 
> To make things simpler, think of memory windows in the framework of the next definition:
> "Memory Window is a virtual memory region, which locally reflects a physical memory of
> peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory
> window, so the locally mapped virtual addresses would be connected with the peer physical
> memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory
> region, so the peer could successfully perform a writes to our local physical memory.
> 
> Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc()
> and ioremap_nocache() method invocation doublet. You do the same thing in the client test
> drivers for AMD and Intel hadrware.
> 

> > >  /**
> > > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64
> db_bits)
> > >   * append one additional dma memory copy with the doorbell register as the
> > >   * destination, after the memory copy operations.
> > >   *
> > > + * This is unusual, and hardware may not be suitable to implement it.
> > > + *
> >
> > Why is this unusual?  Do you mean async hardware may not support it?
> >
> 
> Of course I can always return an address of a Doorbell register, but it's not safe to do
> it working with IDT NTB hardware driver. To make thing explained simpler think a IDT
> hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of
> each port can be configured to either reflect the global switch doorbell bits state or not
> to reflect. Global doorbell bits are set by using outbound doorbell register, which is
> exist for every NTB port. Primary port is the port which can have an access to multiple
> peers, so the Primary port inbound and outbound doorbell registers are shared between
> several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices
> should not interfere each other, which can happen on uncontrollable usage of Doorbell
> registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be
> developed for the IDT NTB hardware driver.

I misread the diff as if this comment was added to the description of ntb_db_clear_mask().

> > > +	if (!ntb->ops->spad_count)
> > > +		return -EINVAL;
> > > +
> >
> > Maybe we should return zero (i.e. there are no scratchpads).
> >
> 
> Agreed. I will fix it in the next patchset.

Thanks.

> > > +	if (!ntb->ops->spad_read)
> > > +		return 0;
> > > +
> >
> > Let's return ~0.  I think that's what a driver would read from the pci bus for a memory
> miss.
> >
> 
> Agreed. I will make it returning -EINVAL in the next patchset.

I don't think we should try to interpret the returned value as an error number.  If the driver supports this method, and this is a valid scratchpad, the peer can put any value in i, including a value that could be interpreted as an error number.

A driver shouldn't be using this method if it isn't supported.  But if it does, I think ~0 is a better poison value than 0.  I just don't want to encourage drivers to try to interpret this value as an error number.

> > > +	if (!ntb->ops->peer_spad_read)
> > > +		return 0;
> >
> > Also, ~0?
> >
> 
> Agreed. I will make it returning -EINVAL in the next patchset.

I don't think we should try to interpret the returned value as an error number.

> > > + * ntb_msg_post() - post the message to the peer
> > > + * @ntb:	NTB device context.
> > > + * @msg:	Message
> > > + *
> > > + * Post the message to a peer. It shall be delivered to the peer by the
> > > + * corresponding hardware method. The peer should be notified about the new
> > > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > > + * event. Otherwise the NTB_MSG_SENT is emitted.
> >
> > Interesting.. local driver would be notified about completion (success or failure) of
> delivery.  Is there any order-of-completion guarantee for the completion notifications?
> Is there some tolerance for faults, in case we never get a completion notification from
> the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link
> comes up again, can we still get a completion notification from the peer, and how would
> that be handled?
> >
> > Does delivery mean the application has processed the message, or is it just delivery at
> the hardware layer, or just delivery at the ntb hardware driver layer?
> >
> 
> Let me explain how the message delivery works. When a client driver calls the
> "ntb_msg_post()" method, the corresponding message is placed in an outbound messages
> queue. Such the message queue exists for every peer device. Then a dedicated kernel work
> thread is started to send all the messages from the queue.

Can we handle the outbound messages queue in an upper layer thread, too, instead of a kernel thread in this low level driver?  I think if we provide more direct access to the hardware semantics of the message registers, we will end up with something like the following, which will also simplify the hardware driver.  Leave it to the upper layer to schedule message processing after receiving an event.

ntb_msg_event(): we received a hardware interrupt for messages. (don't read message status, or anything else)

ntb_msg_status_read(): read and return MSGSTS bitmask (like ntb_db_read()).
ntb_msg_status_clear(): clear bits in MSGSTS bitmask (like ntb_db_clear()).

ntb_msg_mask_set(): set bits in MSGSTSMSK (like ntb_db_mask_set()).
ntb_msg_mask_clear(): clear bits in MSGSTSMSK (like ntb_db_mask_clear()).

ntb_msg_recv(): read and return INMSG and INMSGSRC of the indicated message index.
ntb_msg_send(): write the outgoing message register with the message.

> If kernel thread failed to send
> a message (for instance, if the peer IDT NTB hardware driver still has not freed its
> inbound message registers), it performs a new attempt after a small timeout. If after a
> preconfigured number of attempts the kernel thread still fails to delivery the message, it
> invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully
> delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.

In other words, it was delivered to the peer NTB hardware, and the peer NTB hardware accepted the message into an available register.  It does not mean the peer application processed the message, or even that the peer driver received an interrupt for the message?

> 
> To be clear the messsages are transfered directly to the peer memory, but instead they are
> placed in the IDT NTB switch registers, then peer is notified about a new message arrived
> at the corresponding message registers and the corresponding interrupt handler is called.
> 
> If we loose the PCI express or NTB link between the IDT switch and a peer, then the
> ntb_msg_event() method is called with NTB_MSG_FAIL event.

Byzantine fault is an unsolvable class of problem, so it is important to be clear exactly what is supposed to be guaranteed at each layer.  If we get a hardware ACK that the message was delivered, that means it was delivered to the NTB hardware register, but no further.  If we do not get a hardware NAK(?), that means it was not delivered.  If the link fails or we time out waiting for a completion, we can only guess that it wasn't delivered even though there is a small chance it was.  Applications need to be tolerant either way, and needs may be different depending on the application.  I would rather not add any fault tolerance (other than reporting faults) at this layer that is not already implemented in the hardware.

Reading the description of OUTMSGSTS register, it is clear that we can receive a hardware NAK if an outgoing message failed.  It's not clear to me that IDT will notify any kind of ACK that an outgoing message was accepted.  If an application wants to send two messages, it can send the first, check the bit and see there is no failure.  Does reading the status immediately after sending guarantee the message WAS delivered (i.e. IDT NTB hardware blocks reading the status register while there are messages in flight)?  If not, if the application sends the second message and then sees a failure, how can the application be sure the failure is not for the first message?  Does the application have to wait some time (how long?) before checking the message status?

> 
> Finally, I've answered to all the questions. Hopefully the things look clearer now.
> 
> Regards,
> -Sergey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
@ 2016-08-08 21:48 ` Allen Hubbe
  0 siblings, 0 replies; 12+ messages in thread
From: Allen Hubbe @ 2016-08-08 21:48 UTC (permalink / raw)
  To: 'Serge Semin'
  Cc: jdmason, dave.jiang, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel

From: Serge Semin
> Hello Allen.
> 
> Thanks for your careful review. Going through this mailing thread I hope we'll come up
> with solutions, which improve the driver code as well as extend the Linux kernel support
> of new devices like IDT PCIe-swtiches.
> 
> Before getting to the inline commentaries I need to give some introduction to the IDT NTB-
> related hardware so we could speak on the same language. Additionally I'll give a brief
> explanation how the setup of memory windows works in IDT PCIe-switches.

I found this to use as a reference for IDT:
https://www.idt.com/document/man/89hpes24nt24g2-device-user-manual

> First of all, before getting into the IDT NTB driver development I had made a research of
> the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the
> hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-
> hardware drivers work. At least I understand the concept of memory windowing, which led to
> the current NTB bus kernel API.
> 
> So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT
> produces. All of them I split into two distinct groups:
> 1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
> 2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2,
> 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
> Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express®
> switching solutions. Why do I split them up? Because of the next reasons:
> 1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So
> the switches of the first group can connect just two domains over NTB. Unlike the second
> group of switches, which expose a way to setup an interaction between several PCIe-switch
> ports, which have NT-function activated.
> 2) The groups are significantly distinct by the way of NT-functions configuration.
> 
> Before getting further, I should note, that the uploaded driver supports the second group
> of devices only. But still I'll give a comparative explanation, since the first group of
> switches is very similar to the AMD/Intel NTBs.
> 
> Lets dive into the configurations a bit deeper. Particularly NT-functions of the first
> group of switches can be configured the same way as AMD/Intel NTB-functions are. There is
> an PCIe end-point configuration space, which fully reflects the cross-coupled local and
> peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct
> writing to mapped memory. Here is the image, which perfectly explains the configuration
> registers mapping:
> https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
> Since the first group switches connect only two root complexes, the race condition of
> read/write operations to cross-coupled registers can be easily resolved just by roles
> distribution. So local root complex sets the translated base address directly to a peer
> configuration space registers, which correspond to BAR0-BAR3 locally mapped memory
> windows. Of course 2-4 memory windows is enough to connect just two domains. That's why
> you made the NTB bus kernel API the way it is.
> 
> The things get different when one wants to have an access from one domain to multiple
> coupling up to eight root complexes in the second group of switches. First of all the
> hardware doesn't support the configuration space cross-coupling anymore. Instead there are
> two Global Address Space Access registers provided to have an access to a peers
> configuration space. In fact it is not a big problem, since there are no much differences
> in accessing registers over a memory mapped space or a pair of fixed Address/Data
> registers. The problem arises when one wants to share a memory windows between eight
> domains. Five BARs are not enough for it even if they'd be configured to be of x32 address
> type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be
> configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be
> initialized with translated base address of a peer and IDT switch port, which peer is
> connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to
> a memory of a peer just by reading/writing with a shift corresponding to the lookup table
> entry. That's how more than five peers can be accessed. The root problem is the way the
> lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data"
> registers. So a root complex must write an entry index to one registers, then read/write
> data from another. As you might realise, that weak point leads to a race condition of
> multiple root complexes accessing the lookup table of one shared peer. Alas I could not
> come up with a simple and strong solution of the race.

Right, multiple peers reaching across to some other peer's NTB configuration space is problematic.  I don't mean to suggest we should reach across to configure the lookup table (or anything else) on a remote NTB.

> That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since
> local root complex can't directly write a translated base address to a peer, it must wait
> until a peer asks him to allocate a memory and send the address back using some of a
> hardware mechanism. It can be anything: Scratchpad registers, Message registers or even
> "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
> 1) Shared Memory windows. In particular local root complex can set a translated base
> address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB
> configuration space, the same way as it can be done for AMD/Intel NTBs.
> 2) One Doorbell register.
> 3) Two Scratchpads.
> 4) Four message regietsrs.
> As you can see the switches of the first group can be considered as both synchronous and
> asynchronous. All the NTB bus kernel API can be implemented for it including the changes
> introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel
> NTBs can be considered both synchronous and asynchronous as well, although they don't
> support messaging so Scratchpads can be used to send a data to a peer. Finally the
> switches of the second group lack of ability to initialize BARs translated base address of
> peers due to the race condition I described before.
> 
> To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make
> the IDT driver as much compatible with current design as possible, nevertheless the NTB
> bus kernel API had to be slightly changed. You can find answers to the commentaries down
> below.
> 
> On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@emc.com> wrote:
> > From: Serge Semin
> > > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > > devices, so translated base address of memory windows can be direcly written
> > > to peer registers. But there are some IDT PCIe-switches which implement
> > > complex interfaces using Lookup Tables of translation addresses. Due to
> > > the way the table is accessed, it can not be done synchronously from different
> > > RCs, that's why the asynchronous interface should be developed.
> > >
> > > For these purpose the Memory Window related interface is correspondingly split
> > > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > > is following: "It is a virtual memory region, which locally reflects a physical
> > > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > > memory windows.
> > > Here is the description of the Memory Window related NTB-bus callback
> > > functions:
> > >  - ntb_mw_count() - number of local memory windows.
> > >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> > >                          window to map.
> > >  - ntb_mw_set_trans() - set translation address of local memory window (this
> > >                         address should be somehow retrieved from a peer).
> > >  - ntb_mw_get_trans() - get translation address of local memory window.
> > >  - ntb_mw_get_align() - get alignment of translated base address and size of
> > >                         local memory window. Additionally one can get the
> > >                         upper size limit of the memory window.
> > >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> > >                          local number).
> > >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> > >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> > >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> > >                              of peer memory window.Additionally one can get the
> > >                              upper size limit of the memory window.
> > >
> > > As one can see current AMD and Intel NTB drivers mostly implement the
> > > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > > since it doesn't have convenient access to the peer Lookup Table.
> > >
> > > In order to pass information from one RC to another NTB functions of IDT
> > > PCIe-switch implement Messaging subsystem. They currently support four message
> > > registers to transfer DWORD sized data to a specified peer. So there are two
> > > new callback methods are introduced:
> > >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> > >                     and receive messages
> > >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> > >                     to a peer
> > > Additionally there is a new event function:
> > >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> > >                      (NTB_MSG_NEW), or last message was successfully sent
> > >                      (NTB_MSG_SENT), or the last message failed to be sent
> > >                      (NTB_MSG_FAIL).
> > >
> > > The last change concerns the IDs (practically names) of NTB-devices on the
> > > NTB-bus. It is not good to have the devices with same names in the system
> > > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > > devices supporting both interfaces.
> >
> > Thanks for the work that went into writing this driver, and thanks for your patience
> with the review.  Please read my initial comments inline.  I would like to approach this
> from a top-down api perspective first, and settle on that first before requesting any
> specific changes in the hardware driver.  My major concern about these changes is that
> they introduce a distinct classification for sync and async hardware, supported by
> different sets of methods in the api, neither is a subset of the other.
> >
> > You know the IDT hardware, so if any of my requests below are infeasible, I would like
> your constructive opinion (even if it means significant changes to existing drivers) on
> how to resolve the api so that new and existing hardware drivers can be unified under the
> same api, if possible.
> 
> I understand your concern. I have been thinking of this a lot. In my opinion the proposed
> in this patch alterations are the best of all variants I've been thinking about. Regarding
> the lack of APIs subset. In fact I would not agree with that. As I described in the
> introduction AMD and Intel drivers can be considered as both synchronous and asynchronous,
> since a translated base address can be directly set in a local and peer configuration
> space. Although AMD and Intel devices don't support messaging, they have Scratchpads,
> which can be used to exchange an information between root complexes. The thing we need to
> do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much
> different from the "mw_peer"-prefixed ones. The first method just sets a translated base
> address to the corresponding local register. The second one does exactly the same as
> "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I
> left things the way it was with just slight changes of names.

It sounds like the purpose of your ntb_mw_set_trans() [what I would call ntb_peer_mw_set_trans()] is similar to what is done at initialization time in the Intel NTB driver, so that outgoing writes are translated to the correct peer NTB BAR.  The difference is that IDT outgoing translation sets not only the peer NTB address but also the port number in the translation.
http://lxr.free-electrons.com/source/drivers/ntb/hw/intel/ntb_hw_intel.c?v=4.7#L1673

It would be interesting to allow ntb clients to change this translation, eg, configure an outgoing write from local BAR23 so it hits peer secondary BAR45.  I don't think e.g. Intel driver should be forced to implement that, but it would be interesting to think of unifying the api with that in mind.

> 
> > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > >
> > > ---
> > >  drivers/ntb/Kconfig                 |   4 +-
> > >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> > >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> > >  drivers/ntb/ntb.c                   |  86 +++++-
> > >  drivers/ntb/ntb_transport.c         |  19 +-
> > >  drivers/ntb/test/ntb_perf.c         |  16 +-
> > >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> > >  drivers/ntb/test/ntb_tool.c         |  25 +-
> > >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> > >  9 files changed, 701 insertions(+), 162 deletions(-)
> > >


> > > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > > -				      &mw->xlat_align, &mw->xlat_align_size);
> > > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > > +		if (rc)
> > > +			goto err1;
> > > +
> > > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > > +					   &mw->xlat_align_size, NULL);
> >
> > Looks like ntb_mw_get_range() was simpler before the change.
> >
> 
> If I didn't change NTB bus kernel API, I would have split them up anyway. First of all
> functions with long argument list look more confusing, than ones with shorter list. It
> helps to stick to the "80 character per line" rule and improves readability. Secondly the
> function splitting improves the readability of the code in general. When I first saw the
> function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function
> returned. The function lacked of "high code coherence" unofficial rule. It is better when
> one function does one coherent thing and return a well coherent data. Particularly
> function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as
> well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method
> returns mapping resources. If local NTB client driver is not going to allocate any memory,
> so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand,
> that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()",
> but still the new design is better readable.
> 
> Additionally I've split them up because of the difference in the way the asynchronous
> interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I
> had to add ntb_mw_set_trans(). Each of that method should logically have related
> "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client
> driver a hint how the retrieved from the peer translated base address should be aligned,
> so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align()
> will give a hint how the local memory buffer should be allocated to fulfil a peer
> translated base address alignment. In this way it returns restrictions for parameters of
> "ntb_peer_mw_set_trans()".
> 
> Finally, IDT driver is designed so Primary and Secondary ports can support a different
> number of memory windows. In this way methods
> "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have
> different range of acceptable values of the second argument, which is determined by the
> "ntb_mw_count()" method, comparing to methods
> "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory
> windows index restriction is determined by the "ntb_peer_mw_count()" method.
> 
> So to speak the splitting was really necessary to make the API looking more logical.

If this change is not required by the new hardware, please submit the change as a separate patch.

> > > +	/* Synchronous hardware is only supported */
> > > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > > +		return -EINVAL;
> > > +	}
> > > +
> >
> > It would be nice if both types could be supported by the same api.
> >
> 
> Yes, it would be. Alas it isn't possible in general. See the introduction to this letter.
> AMD and Intel devices support asynchronous interface, although they lack of messaging
> mechanism.

What is the prototypical application of the IDT message registers?

I'm thinking they will be the first thing available to drivers, and so one primary purpose will be to exchange information for configuring memory windows.  Can you describe how a cluster of eight nodes would discover each other and initialize?

Are they also intended to be useful beyond memory window initialization?  How should they be used efficiently, so that the application can minimize in particular read operations on the pci bus (reading ntb device registers)?  Or are message registers not intended to be used in low latency communications (for that, use doorbells and memory instead)?

> 
> Getting back to the discussion, we still need to provide a way to determine which type of
> interface an NTB device supports: synchronous/asynchronous translated base address
> initialization, Scratchpads and memory windows. Currently it can be determined by the
> functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not
> the best solution. We can implement the traditional Linux kernel bus device-driver
> matching, using table_ids and so on. For example, each hardware driver fills in a table
> with all the functionality it supports, like: synchronous/asynchronous memory windows,
> Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it
> uses. NTB bus core implements a "match()" callback, which compares those two tables and
> calls "probe()" callback method of a driver when the tables successfully matches.
> 
> On the other hand, we might don't have to comprehend the NTB bus core. We can just
> introduce a table_id for NTB hardware device, which would just describe the device vendor
> itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a
> supported device by its table_id. It might look easier, since

emphasis added:

> the client driver developer
> should have a basic understanding of the device one develops a driver for.

This is what I'm hoping to avoid.  I would like to let the driver developer write for the api, not for the specific device.  I would rather the driver check "if feature x is supported" instead of "this is a sync or async device."

> Then NTB bus
> kernel API core will simply match NTB devices with drivers like any other buses (PCI,
> PCIe, i2c, spi, etc) do.
> 

> > > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> >
> > I understand why IDT requires a different api for dealing with addressing multiple
> peers.  I would be interested in a solution that would allow, for example, the Intel
> driver fit under the api for dealing with multiple peers, even though it only supports one
> peer.  I would rather see that, than two separate apis under ntb.
> >
> > Thoughts?
> >
> > Can the sync api be described by some subset of the async api?  Are there less
> overloaded terms we can use instead of sync/async?
> >
> 
> Answer to this concern is mostly provided in the introduction as well. I'll repeat it here
> in details. As I said AMD and Intel hardware support asynchronous API except the
> messaging. Additionally I can even think of emulating messaging using Doorbells and
> Scratchpads, but not the other way around. Why not? Before answering, here is how the
> messaging works in IDT switches of both first and second groups (see introduction for
> describing the groups).
> 
> There are four outbound and inbound message registers for each NTB port in the device.
> Local root complex can connect its any outbound message to any inbound message register of
> the IDT switch. When one writes a data to an outbound message register it immediately gets
> to the connected inbound message registers. Then peer can read its inbound message
> registers and empty it by clearing a corresponding bit. Then and only then next data can
> be written to any outbound message registers connected to that inbound message register.
> So the possible race condition between multiple domains sending a message to same peer is
> resolved by the IDT switch itself.
> 
> One would ask: "Why don't you just wrap the message registers up back to the same port? It
> would look just like Scratchpads." Yes, It would. But still there are only four message
> registers. It's not enough to distribute them between all the possibly connected NTB
> ports. As I said earlier there can be up to eight domains connected, so there must be at
> least seven message register to fulfil the possible design.
> 
> Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly
> adapt design for a hardware, rather than hardware to a design. Following that rule would
> simplify a code and support respectively.
> 
> Regarding the APIs subset. As I said before async API is kind of subset of synchronous
> API. We can develop all the memory window related callback-method for AMD and Intel
> hardware driver, which is pretty much easy. We can even simulate message registers by
> using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second
> group of IDT switches can't implement the synchronous API, as I already said in the
> introduction.

Message registers operate fundamentally differently from scratchpads (and doorbells, for that matter).  I think we are in agreement.  It's a pain, but maybe the best we can do is require applications to check for support for scratchpads, message registers, and/or doorbells, before using any of those features.  We already have ntb_db_valid_mask() and ntb_spad_count().

I would like to see ntb_msg_count() and more direct access to the message registers in this api.  I would prefer to see the more direct access to hardware message registers, instead of work_struct for message processing in the low level hardware driver.  A more direct interface to the hardware registers would be more like the existing ntb.h api: direct and low-overhead as possible, providing minimal abstraction of the hardware functionality.

I think there is still hope we can unify the memory window interface.  Even though IDT supports things like subdividing the memory windows with table lookup, and specification of destination ports for outgoing translations, I think we can support the same abstraction in the existing drivers with minimal overhead.

For existing Intel and AMD drivers, there may be only one translation per memory window (there is no table to subdivide the memory window), and there is only one destination port (the peer).  The Intel and AMD drivers can ignore the table index in setting up the translation (or validate that the requested table index is equal to zero).

> Regarding the overloaded naming. The "sync/async" names are the best I could think of. If
> you have any idea how one can be appropriately changed, be my guest. I would be really
> glad to substitute them with something better.
> 

Let's try to avoid a distinction, first, beyond just saying "not all hardware will support all these features."  If we absolutely have to make a distinction, let's think of better names then.

> > > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> > >   * @ntb:	NTB device context.
> > > + * @ev:		Event type caused the handler invocation
> > > + * @msg:	Message related to the event
> > > + *
> > > + * Notify the driver context that there is some event happaned in the event
> > > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > > + * last argument is used to pass the event related message. It discarded right
> > > + * after the handler returns.
> > > + */
> > > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > > +		   struct ntb_msg *msg);
> >
> > I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of
> the message handling to be done more appropriately at a higher layer of the application.
> I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I
> think would be more appropriate for a ntb transport (or higher layer) driver.
> >
> 
> Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a
> first inbound message register (the rest of message registers are used as an additional
> data buffers). Then a corresponding tasklet is started to release a hardware interrupt
> context. That tasklet extracts a message from the inbound message registers, puts it into
> the driver inbound message queue and marks the registers as empty so the next message
> could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all
> new messages to a client driver, which preliminary registered "ntb_msg_event()" callback
> method. When callback method "ntb_msg_event()" the passed message is discarded.

When an interrupt arrives, can you signal the upper layer that a message has arrived, without delivering the message?  I think the lower layer can do without the work structs, instead have the same body of the work struct run in the context of the upper layer polling to receive the message.

> > It looks like there was some rearranging of code, so big hunks appear to be added or
> removed.  Can you split this into two (or more) patches so that rearranging the code is
> distinct from more interesting changes?
> >
> 
> Lets say there was not much rearranging here. I've just put link-related method before
> everything else. The rearranging was done from the point of methods importance view. There
> can't be any memory sharing and doorbells operations done before the link is established.
> The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops()
> methods.

It's unfortunate how the diff captured the changes.  Can you split this up into smaller patches?

> > > - * ntb_mw_get_range() - get the range of a memory window
> > > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> >
> > What was insufficient about ntb_mw_get_range() that it needed to be split into
> ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch,
> it seems ntb_mw_get_range() would have been more simple.
> >
> > I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].
> So, there is no example of how usage of new api would be used differently or more
> efficiently than ntb_mw_get_range() for async devices.
> >
> 
> This concern is answered a bit earlier, when you first commented the method
> "ntb_mw_get_range()" splitting.
> 
> You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The
> real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending),
> which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c
> driver is developed to demonstrate how the new asynchronous API is utilized including the
> "ntb_mw_get_maprsc()" method usage.

Right, I misspelled it.  It would be easier to catch a misspelling of ragne.

[PATCH v2 3/3]:
+		/* Retrieve the physical address of the memory to map */
+		ret = ntb_mw_get_maprsc(ntb, mwindx, &outmw->phys_addr,
+			&outmw->size);
+		if (SUCCESS != ret) {
+			dev_err_mw(ctx, "Failed to get map resources of "
+				"outbound window %d", mwindx);
+			mwindx--;
+			goto err_unmap_rsc;
+		}
+
+		/* Map the memory window resources */
+		outmw->virt_addr = ioremap_nocache(outmw->phys_addr, outmw->size);
+
+		/* Retrieve the memory windows maximum size and alignments */
+		ret = ntb_mw_get_align(ntb, mwindx, &outmw->addr_align,
+			&outmw->size_align, &outmw->size_max);
+		if (SUCCESS != ret) {
+			dev_err_mw(ctx, "Failed to get alignment options of "
+				"outbound window %d", mwindx);
+			goto err_unmap_rsc;
+		}

It looks to me like ntb_mw_get_range() would have been sufficient here.  If the change is required by the new driver, please show evidence of that.  If this change is not required by the new hardware, please submit the change as a separate patch.

> > I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the
> following make sense, or have I completely misunderstood something?
> >
> > ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are
> translated to the local memory destination.
> >
> > ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory
> window (is this something that needs to be configured on the local ntb?) are translated to
> the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of
> ntb_mw_set_trans() will complete the translation to the peer memory destination.
> >
> 
> These functions actually do the opposite you described:

That's the point.  I noticed that they are opposite.

> ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so
> outgoing writes to a memory window would be translated and reach the peer memory
> destination.

In other words, this affects the translation of writes in the direction of the peer memory.  I think this should be named ntb_peer_mw_set_trans().

> ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space,
> so the local incoming writes would be correctly translated on the peer and reach the local
> memory destination.

In other words, this affects the translation for writes in the direction of local memory.  I think this should be named ntb_mw_set_trans().

> Globally thinking, these methods do the same think, when they called from opposite
> domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the
> method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called
> method "ntb_peer_mw_set_trans()" does the same procedure as the method
> "ntb_mw_set_trans()" called from a peer.
> 
> To make things simpler, think of memory windows in the framework of the next definition:
> "Memory Window is a virtual memory region, which locally reflects a physical memory of
> peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory
> window, so the locally mapped virtual addresses would be connected with the peer physical
> memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory
> region, so the peer could successfully perform a writes to our local physical memory.
> 
> Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc()
> and ioremap_nocache() method invocation doublet. You do the same thing in the client test
> drivers for AMD and Intel hadrware.
> 

> > >  /**
> > > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64
> db_bits)
> > >   * append one additional dma memory copy with the doorbell register as the
> > >   * destination, after the memory copy operations.
> > >   *
> > > + * This is unusual, and hardware may not be suitable to implement it.
> > > + *
> >
> > Why is this unusual?  Do you mean async hardware may not support it?
> >
> 
> Of course I can always return an address of a Doorbell register, but it's not safe to do
> it working with IDT NTB hardware driver. To make thing explained simpler think a IDT
> hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of
> each port can be configured to either reflect the global switch doorbell bits state or not
> to reflect. Global doorbell bits are set by using outbound doorbell register, which is
> exist for every NTB port. Primary port is the port which can have an access to multiple
> peers, so the Primary port inbound and outbound doorbell registers are shared between
> several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices
> should not interfere each other, which can happen on uncontrollable usage of Doorbell
> registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be
> developed for the IDT NTB hardware driver.

I misread the diff as if this comment was added to the description of ntb_db_clear_mask().

> > > +	if (!ntb->ops->spad_count)
> > > +		return -EINVAL;
> > > +
> >
> > Maybe we should return zero (i.e. there are no scratchpads).
> >
> 
> Agreed. I will fix it in the next patchset.

Thanks.

> > > +	if (!ntb->ops->spad_read)
> > > +		return 0;
> > > +
> >
> > Let's return ~0.  I think that's what a driver would read from the pci bus for a memory
> miss.
> >
> 
> Agreed. I will make it returning -EINVAL in the next patchset.

I don't think we should try to interpret the returned value as an error number.  If the driver supports this method, and this is a valid scratchpad, the peer can put any value in i, including a value that could be interpreted as an error number.

A driver shouldn't be using this method if it isn't supported.  But if it does, I think ~0 is a better poison value than 0.  I just don't want to encourage drivers to try to interpret this value as an error number.

> > > +	if (!ntb->ops->peer_spad_read)
> > > +		return 0;
> >
> > Also, ~0?
> >
> 
> Agreed. I will make it returning -EINVAL in the next patchset.

I don't think we should try to interpret the returned value as an error number.

> > > + * ntb_msg_post() - post the message to the peer
> > > + * @ntb:	NTB device context.
> > > + * @msg:	Message
> > > + *
> > > + * Post the message to a peer. It shall be delivered to the peer by the
> > > + * corresponding hardware method. The peer should be notified about the new
> > > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > > + * event. Otherwise the NTB_MSG_SENT is emitted.
> >
> > Interesting.. local driver would be notified about completion (success or failure) of
> delivery.  Is there any order-of-completion guarantee for the completion notifications?
> Is there some tolerance for faults, in case we never get a completion notification from
> the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link
> comes up again, can we still get a completion notification from the peer, and how would
> that be handled?
> >
> > Does delivery mean the application has processed the message, or is it just delivery at
> the hardware layer, or just delivery at the ntb hardware driver layer?
> >
> 
> Let me explain how the message delivery works. When a client driver calls the
> "ntb_msg_post()" method, the corresponding message is placed in an outbound messages
> queue. Such the message queue exists for every peer device. Then a dedicated kernel work
> thread is started to send all the messages from the queue.

Can we handle the outbound messages queue in an upper layer thread, too, instead of a kernel thread in this low level driver?  I think if we provide more direct access to the hardware semantics of the message registers, we will end up with something like the following, which will also simplify the hardware driver.  Leave it to the upper layer to schedule message processing after receiving an event.

ntb_msg_event(): we received a hardware interrupt for messages. (don't read message status, or anything else)

ntb_msg_status_read(): read and return MSGSTS bitmask (like ntb_db_read()).
ntb_msg_status_clear(): clear bits in MSGSTS bitmask (like ntb_db_clear()).

ntb_msg_mask_set(): set bits in MSGSTSMSK (like ntb_db_mask_set()).
ntb_msg_mask_clear(): clear bits in MSGSTSMSK (like ntb_db_mask_clear()).

ntb_msg_recv(): read and return INMSG and INMSGSRC of the indicated message index.
ntb_msg_send(): write the outgoing message register with the message.

> If kernel thread failed to send
> a message (for instance, if the peer IDT NTB hardware driver still has not freed its
> inbound message registers), it performs a new attempt after a small timeout. If after a
> preconfigured number of attempts the kernel thread still fails to delivery the message, it
> invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully
> delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.

In other words, it was delivered to the peer NTB hardware, and the peer NTB hardware accepted the message into an available register.  It does not mean the peer application processed the message, or even that the peer driver received an interrupt for the message?

> 
> To be clear the messsages are transfered directly to the peer memory, but instead they are
> placed in the IDT NTB switch registers, then peer is notified about a new message arrived
> at the corresponding message registers and the corresponding interrupt handler is called.
> 
> If we loose the PCI express or NTB link between the IDT switch and a peer, then the
> ntb_msg_event() method is called with NTB_MSG_FAIL event.

Byzantine fault is an unsolvable class of problem, so it is important to be clear exactly what is supposed to be guaranteed at each layer.  If we get a hardware ACK that the message was delivered, that means it was delivered to the NTB hardware register, but no further.  If we do not get a hardware NAK(?), that means it was not delivered.  If the link fails or we time out waiting for a completion, we can only guess that it wasn't delivered even though there is a small chance it was.  Applications need to be tolerant either way, and needs may be different depending on the application.  I would rather not add any fault tolerance (other than reporting faults) at this layer that is not already implemented in the hardware.

Reading the description of OUTMSGSTS register, it is clear that we can receive a hardware NAK if an outgoing message failed.  It's not clear to me that IDT will notify any kind of ACK that an outgoing message was accepted.  If an application wants to send two messages, it can send the first, check the bit and see there is no failure.  Does reading the status immediately after sending guarantee the message WAS delivered (i.e. IDT NTB hardware blocks reading the status register while there are messages in flight)?  If not, if the application sends the second message and then sees a failure, how can the application be sure the failure is not for the first message?  Does the application have to wait some time (how long?) before checking the message status?

> 
> Finally, I've answered to all the questions. Hopefully the things look clearer now.
> 
> Regards,
> -Sergey



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface
  2016-07-28 10:01 ` [PATCH v2 " Serge Semin
@ 2016-07-28 10:01   ` Serge Semin
  0 siblings, 0 replies; 12+ messages in thread
From: Serge Semin @ 2016-07-28 10:01 UTC (permalink / raw)
  To: jdmason
  Cc: dave.jiang, Allen.Hubbe, Xiangliang.Yu, Sergey.Semin, linux-ntb,
	linux-kernel, Serge Semin

Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
devices, so translated base address of memory windows can be direcly written
to peer registers. But there are some IDT PCIe-switches which implement
complex interfaces using Lookup Tables of translation addresses. Due to
the way the table is accessed, it can not be done synchronously from different
RCs, that's why the asynchronous interface should be developed.

For these purpose the Memory Window related interface is correspondingly split
as it is for Doorbell and Scratchpad registers. The definition of Memory Window
is following: "It is a virtual memory region, which locally reflects a physical
memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
the peers memory windows, "ntb_mw_"-prefixed functions work with the local
memory windows.
Here is the description of the Memory Window related NTB-bus callback
functions:
 - ntb_mw_count() - number of local memory windows.
 - ntb_mw_get_maprsc() - get the physical address and size of the local memory
                         window to map.
 - ntb_mw_set_trans() - set translation address of local memory window (this
                        address should be somehow retrieved from a peer).
 - ntb_mw_get_trans() - get translation address of local memory window.
 - ntb_mw_get_align() - get alignment of translated base address and size of
                        local memory window. Additionally one can get the
                        upper size limit of the memory window.
 - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
                         local number).
 - ntb_peer_mw_set_trans() - set translation address of peer memory window
 - ntb_peer_mw_get_trans() - get translation address of peer memory window
 - ntb_peer_mw_get_align() - get alignment of translated base address and size
                             of peer memory window.Additionally one can get the
                             upper size limit of the memory window.

As one can see current AMD and Intel NTB drivers mostly implement the
"ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
since it doesn't have convenient access to the peer Lookup Table.

In order to pass information from one RC to another NTB functions of IDT
PCIe-switch implement Messaging subsystem. They currently support four message
registers to transfer DWORD sized data to a specified peer. So there are two
new callback methods are introduced:
 - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
                    and receive messages
 - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
                    to a peer
Additionally there is a new event function:
 - ntb_msg_event() - it is invoked when either a new message was retrieved
                     (NTB_MSG_NEW), or last message was successfully sent
                     (NTB_MSG_SENT), or the last message failed to be sent
                     (NTB_MSG_FAIL).

The last change concerns the IDs (practically names) of NTB-devices on the
NTB-bus. It is not good to have the devices with same names in the system
and it brakes my IDT NTB driver from being loaded =) So I developed a simple
algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
devices supporting both interfaces.

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>

---
 drivers/ntb/Kconfig                 |   4 +-
 drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
 drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
 drivers/ntb/ntb.c                   |  86 +++++-
 drivers/ntb/ntb_transport.c         |  19 +-
 drivers/ntb/test/ntb_perf.c         |  16 +-
 drivers/ntb/test/ntb_pingpong.c     |   5 +
 drivers/ntb/test/ntb_tool.c         |  25 +-
 include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
 9 files changed, 701 insertions(+), 162 deletions(-)

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index 95944e5..67d80c4 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -14,8 +14,6 @@ if NTB
 
 source "drivers/ntb/hw/Kconfig"
 
-source "drivers/ntb/test/Kconfig"
-
 config NTB_TRANSPORT
 	tristate "NTB Transport Client"
 	help
@@ -25,4 +23,6 @@ config NTB_TRANSPORT
 
 	 If unsure, say N.
 
+source "drivers/ntb/test/Kconfig"
+
 endif # NTB
diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
index 6ccba0d..ab6f353 100644
--- a/drivers/ntb/hw/amd/ntb_hw_amd.c
+++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
@@ -55,6 +55,7 @@
 #include <linux/pci.h>
 #include <linux/random.h>
 #include <linux/slab.h>
+#include <linux/sizes.h>
 #include <linux/ntb.h>
 
 #include "ntb_hw_amd.h"
@@ -84,11 +85,8 @@ static int amd_ntb_mw_count(struct ntb_dev *ntb)
 	return ntb_ndev(ntb)->mw_count;
 }
 
-static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
-				phys_addr_t *base,
-				resource_size_t *size,
-				resource_size_t *align,
-				resource_size_t *align_size)
+static int amd_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
+				 phys_addr_t *base, resource_size_t *size)
 {
 	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
 	int bar;
@@ -103,17 +101,40 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
 	if (size)
 		*size = pci_resource_len(ndev->ntb.pdev, bar);
 
-	if (align)
-		*align = SZ_4K;
+	return 0;
+}
+
+static int amd_ntb_peer_mw_count(struct ntb_dev *ntb)
+{
+	return ntb_ndev(ntb)->mw_count;
+}
+
+static int amd_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
+				     resource_size_t *addr_align,
+				     resource_size_t *size_align,
+				     resource_size_t *size_max)
+{
+	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
+	int bar;
+
+	bar = ndev_mw_to_bar(ndev, idx);
+	if (bar < 0)
+		return bar;
+
+	if (addr_align)
+		*addr_align = SZ_4K;
+
+	if (size_align)
+		*size_align = 1;
 
-	if (align_size)
-		*align_size = 1;
+	if (size_max)
+		*size_max = pci_resource_len(ndev->ntb.pdev, bar);
 
 	return 0;
 }
 
-static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
-				dma_addr_t addr, resource_size_t size)
+static int amd_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
+				     dma_addr_t addr, resource_size_t size)
 {
 	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
 	unsigned long xlat_reg, limit_reg = 0;
@@ -432,8 +453,10 @@ static int amd_ntb_peer_spad_write(struct ntb_dev *ntb,
 
 static const struct ntb_dev_ops amd_ntb_ops = {
 	.mw_count		= amd_ntb_mw_count,
-	.mw_get_range		= amd_ntb_mw_get_range,
-	.mw_set_trans		= amd_ntb_mw_set_trans,
+	.mw_get_maprsc		= amd_ntb_mw_get_maprsc,
+	.peer_mw_count		= amd_ntb_peer_mw_count,
+	.peer_mw_get_align	= amd_ntb_peer_mw_get_align,
+	.peer_mw_set_trans	= amd_ntb_peer_mw_set_trans,
 	.link_is_up		= amd_ntb_link_is_up,
 	.link_enable		= amd_ntb_link_enable,
 	.link_disable		= amd_ntb_link_disable,
diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c
index 40d04ef..fdb2838 100644
--- a/drivers/ntb/hw/intel/ntb_hw_intel.c
+++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
@@ -804,11 +804,8 @@ static int intel_ntb_mw_count(struct ntb_dev *ntb)
 	return ntb_ndev(ntb)->mw_count;
 }
 
-static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
-				  phys_addr_t *base,
-				  resource_size_t *size,
-				  resource_size_t *align,
-				  resource_size_t *align_size)
+static int intel_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
+				   phys_addr_t *base, resource_size_t *size)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	int bar;
@@ -828,17 +825,51 @@ static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
 		*size = pci_resource_len(ndev->ntb.pdev, bar) -
 			(idx == ndev->b2b_idx ? ndev->b2b_off : 0);
 
-	if (align)
-		*align = pci_resource_len(ndev->ntb.pdev, bar);
+	return 0;
+}
+
+static int intel_ntb_peer_mw_count(struct ntb_dev *ntb)
+{
+	return ntb_ndev(ntb)->mw_count;
+}
+
+static int intel_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
+				       resource_size_t *addr_align,
+				       resource_size_t *size_align,
+				       resource_size_t *size_max)
+{
+	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
+	resource_size_t bar_size, mw_size;
+	int bar;
+
+	if (idx >= ndev->b2b_idx && !ndev->b2b_off)
+		idx += 1;
+
+	bar = ndev_mw_to_bar(ndev, idx);
+	if (bar < 0)
+		return bar;
+
+	bar_size = pci_resource_len(ndev->ntb.pdev, bar);
+
+	if (idx == ndev->b2b_idx)
+		mw_size = bar_size - ndev->b2b_off;
+	else
+		mw_size = bar_size;
+
+	if (addr_align)
+		*addr_align = bar_size;
+
+	if (size_align)
+		*size_align = 1;
 
-	if (align_size)
-		*align_size = 1;
+	if (size_max)
+		*size_max = mw_size;
 
 	return 0;
 }
 
-static int intel_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
-				  dma_addr_t addr, resource_size_t size)
+static int intel_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
+				       dma_addr_t addr, resource_size_t size)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	unsigned long base_reg, xlat_reg, limit_reg;
@@ -2220,8 +2251,10 @@ static struct intel_b2b_addr xeon_b2b_dsd_addr = {
 /* operations for primary side of local ntb */
 static const struct ntb_dev_ops intel_ntb_ops = {
 	.mw_count		= intel_ntb_mw_count,
-	.mw_get_range		= intel_ntb_mw_get_range,
-	.mw_set_trans		= intel_ntb_mw_set_trans,
+	.mw_get_maprsc		= intel_ntb_mw_get_maprsc,
+	.peer_mw_count		= intel_ntb_peer_mw_count,
+	.peer_mw_get_align	= intel_ntb_peer_mw_get_align,
+	.peer_mw_set_trans	= intel_ntb_peer_mw_set_trans,
 	.link_is_up		= intel_ntb_link_is_up,
 	.link_enable		= intel_ntb_link_enable,
 	.link_disable		= intel_ntb_link_disable,
diff --git a/drivers/ntb/ntb.c b/drivers/ntb/ntb.c
index 2e25307..37c3b36 100644
--- a/drivers/ntb/ntb.c
+++ b/drivers/ntb/ntb.c
@@ -54,6 +54,7 @@
 #include <linux/device.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/atomic.h>
 
 #include <linux/ntb.h>
 #include <linux/pci.h>
@@ -72,8 +73,62 @@ MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(DRIVER_DESCRIPTION);
 
 static struct bus_type ntb_bus;
+static struct ntb_bus_data ntb_data;
 static void ntb_dev_release(struct device *dev);
 
+static int ntb_gen_devid(struct ntb_dev *ntb)
+{
+	const char *name;
+	unsigned long *mask;
+	int id;
+
+	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
+		name = "ntbAS%d";
+		mask = ntb_data.both_msk;
+	} else if (ntb_valid_sync_dev_ops(ntb)) {
+		name = "ntbS%d";
+		mask = ntb_data.sync_msk;
+	} else if (ntb_valid_async_dev_ops(ntb)) {
+		name = "ntbA%d";
+		mask = ntb_data.async_msk;
+	} else {
+		return -EINVAL;
+	}
+
+	for (id = 0; NTB_MAX_DEVID > id; id++) {
+		if (0 == test_and_set_bit(id, mask)) {
+			ntb->id = id;
+			break;
+		}
+	}
+
+	if (NTB_MAX_DEVID > id) {
+		dev_set_name(&ntb->dev, name, ntb->id);
+	} else {
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void ntb_free_devid(struct ntb_dev *ntb)
+{
+	unsigned long *mask;
+
+	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
+		mask = ntb_data.both_msk;
+	} else if (ntb_valid_sync_dev_ops(ntb)) {
+		mask = ntb_data.sync_msk;
+	} else if (ntb_valid_async_dev_ops(ntb)) {
+		mask = ntb_data.async_msk;
+	} else {
+		/* It's impossible */
+		BUG();
+	}
+
+	clear_bit(ntb->id, mask);
+}
+
 int __ntb_register_client(struct ntb_client *client, struct module *mod,
 			  const char *mod_name)
 {
@@ -99,13 +154,15 @@ EXPORT_SYMBOL(ntb_unregister_client);
 
 int ntb_register_device(struct ntb_dev *ntb)
 {
+	int ret;
+
 	if (!ntb)
 		return -EINVAL;
 	if (!ntb->pdev)
 		return -EINVAL;
 	if (!ntb->ops)
 		return -EINVAL;
-	if (!ntb_dev_ops_is_valid(ntb->ops))
+	if (!ntb_valid_sync_dev_ops(ntb) && !ntb_valid_async_dev_ops(ntb))
 		return -EINVAL;
 
 	init_completion(&ntb->released);
@@ -114,13 +171,21 @@ int ntb_register_device(struct ntb_dev *ntb)
 	ntb->dev.bus = &ntb_bus;
 	ntb->dev.parent = &ntb->pdev->dev;
 	ntb->dev.release = ntb_dev_release;
-	dev_set_name(&ntb->dev, "%s", pci_name(ntb->pdev));
 
 	ntb->ctx = NULL;
 	ntb->ctx_ops = NULL;
 	spin_lock_init(&ntb->ctx_lock);
 
-	return device_register(&ntb->dev);
+	/* No need to wait for completion if failed */
+	ret = ntb_gen_devid(ntb);
+	if (ret)
+		return ret;
+
+	ret = device_register(&ntb->dev);
+	if (ret)
+		ntb_free_devid(ntb);
+
+	return ret;
 }
 EXPORT_SYMBOL(ntb_register_device);
 
@@ -128,6 +193,7 @@ void ntb_unregister_device(struct ntb_dev *ntb)
 {
 	device_unregister(&ntb->dev);
 	wait_for_completion(&ntb->released);
+	ntb_free_devid(ntb);
 }
 EXPORT_SYMBOL(ntb_unregister_device);
 
@@ -191,6 +257,20 @@ void ntb_db_event(struct ntb_dev *ntb, int vector)
 }
 EXPORT_SYMBOL(ntb_db_event);
 
+void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
+		   struct ntb_msg *msg)
+{
+	unsigned long irqflags;
+
+	spin_lock_irqsave(&ntb->ctx_lock, irqflags);
+	{
+		if (ntb->ctx_ops && ntb->ctx_ops->msg_event)
+			ntb->ctx_ops->msg_event(ntb->ctx, ev, msg);
+	}
+	spin_unlock_irqrestore(&ntb->ctx_lock, irqflags);
+}
+EXPORT_SYMBOL(ntb_msg_event);
+
 static int ntb_probe(struct device *dev)
 {
 	struct ntb_dev *ntb;
diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index d5c5894..2626ba0 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -673,7 +673,7 @@ static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
 	if (!mw->virt_addr)
 		return;
 
-	ntb_mw_clear_trans(nt->ndev, num_mw);
+	ntb_peer_mw_set_trans(nt->ndev, num_mw, 0, 0);
 	dma_free_coherent(&pdev->dev, mw->buff_size,
 			  mw->virt_addr, mw->dma_addr);
 	mw->xlat_size = 0;
@@ -730,7 +730,8 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 	}
 
 	/* Notify HW the memory location of the receive buffer */
-	rc = ntb_mw_set_trans(nt->ndev, num_mw, mw->dma_addr, mw->xlat_size);
+	rc = ntb_peer_mw_set_trans(nt->ndev, num_mw, mw->dma_addr,
+				   mw->xlat_size);
 	if (rc) {
 		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
 		ntb_free_mw(nt, num_mw);
@@ -1060,7 +1061,11 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 	int node;
 	int rc, i;
 
-	mw_count = ntb_mw_count(ndev);
+	/* Synchronous hardware is only supported */
+	if (!ntb_valid_sync_dev_ops(ndev))
+		return -EINVAL;
+
+	mw_count = ntb_peer_mw_count(ndev);
 	if (ntb_spad_count(ndev) < (NUM_MWS + 1 + mw_count * 2)) {
 		dev_err(&ndev->dev, "Not enough scratch pad registers for %s",
 			NTB_TRANSPORT_NAME);
@@ -1094,8 +1099,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 	for (i = 0; i < mw_count; i++) {
 		mw = &nt->mw_vec[i];
 
-		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
-				      &mw->xlat_align, &mw->xlat_align_size);
+		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
+		if (rc)
+			goto err1;
+
+		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
+					   &mw->xlat_align_size, NULL);
 		if (rc)
 			goto err1;
 
diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
index 6a50f20..f2952f7 100644
--- a/drivers/ntb/test/ntb_perf.c
+++ b/drivers/ntb/test/ntb_perf.c
@@ -452,7 +452,7 @@ static void perf_free_mw(struct perf_ctx *perf)
 	if (!mw->virt_addr)
 		return;
 
-	ntb_mw_clear_trans(perf->ntb, 0);
+	ntb_peer_mw_set_trans(perf->ntb, 0, 0, 0);
 	dma_free_coherent(&pdev->dev, mw->buf_size,
 			  mw->virt_addr, mw->dma_addr);
 	mw->xlat_size = 0;
@@ -488,7 +488,7 @@ static int perf_set_mw(struct perf_ctx *perf, resource_size_t size)
 		mw->buf_size = 0;
 	}
 
-	rc = ntb_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
+	rc = ntb_peer_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
 	if (rc) {
 		dev_err(&perf->ntb->dev, "Unable to set mw0 translation\n");
 		perf_free_mw(perf);
@@ -559,8 +559,12 @@ static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf)
 
 	mw = &perf->mw;
 
-	rc = ntb_mw_get_range(ntb, 0, &mw->phys_addr, &mw->phys_size,
-			      &mw->xlat_align, &mw->xlat_align_size);
+	rc = ntb_mw_get_maprsc(ntb, 0, &mw->phys_addr, &mw->phys_size);
+	if (rc)
+		return rc;
+
+	rc = ntb_peer_mw_get_align(ntb, 0, &mw->xlat_align,
+				   &mw->xlat_align_size, NULL);
 	if (rc)
 		return rc;
 
@@ -758,6 +762,10 @@ static int perf_probe(struct ntb_client *client, struct ntb_dev *ntb)
 	int node;
 	int rc = 0;
 
+	/* Synchronous hardware is only supported */
+	if (!ntb_valid_sync_dev_ops(ntb))
+		return -EINVAL;
+
 	if (ntb_spad_count(ntb) < MAX_SPAD) {
 		dev_err(&ntb->dev, "Not enough scratch pad registers for %s",
 			DRIVER_NAME);
diff --git a/drivers/ntb/test/ntb_pingpong.c b/drivers/ntb/test/ntb_pingpong.c
index 7d31179..e833649 100644
--- a/drivers/ntb/test/ntb_pingpong.c
+++ b/drivers/ntb/test/ntb_pingpong.c
@@ -214,6 +214,11 @@ static int pp_probe(struct ntb_client *client,
 	struct pp_ctx *pp;
 	int rc;
 
+	/* Synchronous hardware is only supported */
+	if (!ntb_valid_sync_dev_ops(ntb)) {
+		return -EINVAL;
+	}
+
 	if (ntb_db_is_unsafe(ntb)) {
 		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
 		if (!unsafe) {
diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
index 61bf2ef..5dfe12f 100644
--- a/drivers/ntb/test/ntb_tool.c
+++ b/drivers/ntb/test/ntb_tool.c
@@ -675,8 +675,11 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t req_size)
 	if (mw->peer)
 		return 0;
 
-	rc = ntb_mw_get_range(tc->ntb, idx, &base, &size, &align,
-			      &align_size);
+	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &size);
+	if (rc)
+		return rc;
+
+	rc = ntb_peer_mw_get_align(tc->ntb, idx, &align, &align_size, NULL);
 	if (rc)
 		return rc;
 
@@ -689,7 +692,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t req_size)
 	if (!mw->peer)
 		return -ENOMEM;
 
-	rc = ntb_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
+	rc = ntb_peer_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
 	if (rc)
 		goto err_free_dma;
 
@@ -716,7 +719,7 @@ static void tool_free_mw(struct tool_ctx *tc, int idx)
 	struct tool_mw *mw = &tc->mws[idx];
 
 	if (mw->peer) {
-		ntb_mw_clear_trans(tc->ntb, idx);
+		ntb_peer_mw_set_trans(tc->ntb, idx, 0, 0);
 		dma_free_coherent(&tc->ntb->pdev->dev, mw->size,
 				  mw->peer,
 				  mw->peer_dma);
@@ -751,8 +754,8 @@ static ssize_t tool_peer_mw_trans_read(struct file *filep,
 	if (!buf)
 		return -ENOMEM;
 
-	ntb_mw_get_range(mw->tc->ntb, mw->idx,
-			 &base, &mw_size, &align, &align_size);
+	ntb_mw_get_maprsc(mw->tc->ntb, mw->idx, &base, &mw_size);
+	ntb_peer_mw_get_align(mw->tc->ntb, mw->idx, &align, &align_size, NULL);
 
 	off += scnprintf(buf + off, buf_size - off,
 			 "Peer MW %d Information:\n", mw->idx);
@@ -827,8 +830,7 @@ static int tool_init_mw(struct tool_ctx *tc, int idx)
 	phys_addr_t base;
 	int rc;
 
-	rc = ntb_mw_get_range(tc->ntb, idx, &base, &mw->win_size,
-			      NULL, NULL);
+	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &mw->win_size);
 	if (rc)
 		return rc;
 
@@ -913,6 +915,11 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
 	int rc;
 	int i;
 
+	/* Synchronous hardware is only supported */
+	if (!ntb_valid_sync_dev_ops(ntb)) {
+		return -EINVAL;
+	}
+
 	if (ntb_db_is_unsafe(ntb))
 		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
 
@@ -928,7 +935,7 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
 	tc->ntb = ntb;
 	init_waitqueue_head(&tc->link_wq);
 
-	tc->mw_count = min(ntb_mw_count(tc->ntb), MAX_MWS);
+	tc->mw_count = min(ntb_peer_mw_count(tc->ntb), MAX_MWS);
 	for (i = 0; i < tc->mw_count; i++) {
 		rc = tool_init_mw(tc, i);
 		if (rc)
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index 6f47562..d1937d3 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -159,13 +159,44 @@ static inline int ntb_client_ops_is_valid(const struct ntb_client_ops *ops)
 }
 
 /**
+ * struct ntb_msg - ntb driver message structure
+ * @type:	Message type.
+ * @payload:	Payload data to send to a peer
+ * @data:	Array of u32 data to send (size might be hw dependent)
+ */
+#define NTB_MAX_MSGSIZE 4
+struct ntb_msg {
+	union {
+		struct {
+			u32 type;
+			u32 payload[NTB_MAX_MSGSIZE - 1];
+		};
+		u32 data[NTB_MAX_MSGSIZE];
+	};
+};
+
+/**
+ * enum NTB_MSG_EVENT - message event types
+ * @NTB_MSG_NEW:	New message just arrived and passed to the handler
+ * @NTB_MSG_SENT:	Posted message has just been successfully sent
+ * @NTB_MSG_FAIL:	Posted message failed to be sent
+ */
+enum NTB_MSG_EVENT {
+	NTB_MSG_NEW,
+	NTB_MSG_SENT,
+	NTB_MSG_FAIL
+};
+
+/**
  * struct ntb_ctx_ops - ntb driver context operations
  * @link_event:		See ntb_link_event().
  * @db_event:		See ntb_db_event().
+ * @msg_event:		See ntb_msg_event().
  */
 struct ntb_ctx_ops {
 	void (*link_event)(void *ctx);
 	void (*db_event)(void *ctx, int db_vector);
+	void (*msg_event)(void *ctx, enum NTB_MSG_EVENT ev, struct ntb_msg *msg);
 };
 
 static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
@@ -174,18 +205,24 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
 	return
 		/* ops->link_event		&& */
 		/* ops->db_event		&& */
+		/* ops->msg_event		&& */
 		1;
 }
 
 /**
  * struct ntb_ctx_ops - ntb device operations
- * @mw_count:		See ntb_mw_count().
- * @mw_get_range:	See ntb_mw_get_range().
- * @mw_set_trans:	See ntb_mw_set_trans().
- * @mw_clear_trans:	See ntb_mw_clear_trans().
  * @link_is_up:		See ntb_link_is_up().
  * @link_enable:	See ntb_link_enable().
  * @link_disable:	See ntb_link_disable().
+ * @mw_count:		See ntb_mw_count().
+ * @mw_get_maprsc:	See ntb_mw_get_maprsc().
+ * @mw_set_trans:	See ntb_mw_set_trans().
+ * @mw_get_trans:	See ntb_mw_get_trans().
+ * @mw_get_align:	See ntb_mw_get_align().
+ * @peer_mw_count:	See ntb_peer_mw_count().
+ * @peer_mw_set_trans:	See ntb_peer_mw_set_trans().
+ * @peer_mw_get_trans:	See ntb_peer_mw_get_trans().
+ * @peer_mw_get_align:	See ntb_peer_mw_get_align().
  * @db_is_unsafe:	See ntb_db_is_unsafe().
  * @db_valid_mask:	See ntb_db_valid_mask().
  * @db_vector_count:	See ntb_db_vector_count().
@@ -210,22 +247,38 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
  * @peer_spad_addr:	See ntb_peer_spad_addr().
  * @peer_spad_read:	See ntb_peer_spad_read().
  * @peer_spad_write:	See ntb_peer_spad_write().
+ * @msg_post:		See ntb_msg_post().
+ * @msg_size:		See ntb_msg_size().
  */
 struct ntb_dev_ops {
-	int (*mw_count)(struct ntb_dev *ntb);
-	int (*mw_get_range)(struct ntb_dev *ntb, int idx,
-			    phys_addr_t *base, resource_size_t *size,
-			resource_size_t *align, resource_size_t *align_size);
-	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
-			    dma_addr_t addr, resource_size_t size);
-	int (*mw_clear_trans)(struct ntb_dev *ntb, int idx);
-
 	int (*link_is_up)(struct ntb_dev *ntb,
 			  enum ntb_speed *speed, enum ntb_width *width);
 	int (*link_enable)(struct ntb_dev *ntb,
 			   enum ntb_speed max_speed, enum ntb_width max_width);
 	int (*link_disable)(struct ntb_dev *ntb);
 
+	int (*mw_count)(struct ntb_dev *ntb);
+	int (*mw_get_maprsc)(struct ntb_dev *ntb, int idx,
+			     phys_addr_t *base, resource_size_t *size);
+	int (*mw_get_align)(struct ntb_dev *ntb, int idx,
+			    resource_size_t *addr_align,
+			    resource_size_t *size_align,
+			    resource_size_t *size_max);
+	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
+			    dma_addr_t addr, resource_size_t size);
+	int (*mw_get_trans)(struct ntb_dev *ntb, int idx,
+			    dma_addr_t *addr, resource_size_t *size);
+
+	int (*peer_mw_count)(struct ntb_dev *ntb);
+	int (*peer_mw_get_align)(struct ntb_dev *ntb, int idx,
+				 resource_size_t *addr_align,
+				 resource_size_t *size_align,
+				 resource_size_t *size_max);
+	int (*peer_mw_set_trans)(struct ntb_dev *ntb, int idx,
+				 dma_addr_t addr, resource_size_t size);
+	int (*peer_mw_get_trans)(struct ntb_dev *ntb, int idx,
+				 dma_addr_t *addr, resource_size_t *size);
+
 	int (*db_is_unsafe)(struct ntb_dev *ntb);
 	u64 (*db_valid_mask)(struct ntb_dev *ntb);
 	int (*db_vector_count)(struct ntb_dev *ntb);
@@ -259,47 +312,10 @@ struct ntb_dev_ops {
 			      phys_addr_t *spad_addr);
 	u32 (*peer_spad_read)(struct ntb_dev *ntb, int idx);
 	int (*peer_spad_write)(struct ntb_dev *ntb, int idx, u32 val);
-};
-
-static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
-{
-	/* commented callbacks are not required: */
-	return
-		ops->mw_count				&&
-		ops->mw_get_range			&&
-		ops->mw_set_trans			&&
-		/* ops->mw_clear_trans			&& */
-		ops->link_is_up				&&
-		ops->link_enable			&&
-		ops->link_disable			&&
-		/* ops->db_is_unsafe			&& */
-		ops->db_valid_mask			&&
 
-		/* both set, or both unset */
-		(!ops->db_vector_count == !ops->db_vector_mask) &&
-
-		ops->db_read				&&
-		/* ops->db_set				&& */
-		ops->db_clear				&&
-		/* ops->db_read_mask			&& */
-		ops->db_set_mask			&&
-		ops->db_clear_mask			&&
-		/* ops->peer_db_addr			&& */
-		/* ops->peer_db_read			&& */
-		ops->peer_db_set			&&
-		/* ops->peer_db_clear			&& */
-		/* ops->peer_db_read_mask		&& */
-		/* ops->peer_db_set_mask		&& */
-		/* ops->peer_db_clear_mask		&& */
-		/* ops->spad_is_unsafe			&& */
-		ops->spad_count				&&
-		ops->spad_read				&&
-		ops->spad_write				&&
-		/* ops->peer_spad_addr			&& */
-		/* ops->peer_spad_read			&& */
-		ops->peer_spad_write			&&
-		1;
-}
+	int (*msg_post)(struct ntb_dev *ntb, struct ntb_msg *msg);
+	int (*msg_size)(struct ntb_dev *ntb);
+};
 
 /**
  * struct ntb_client - client interested in ntb devices
@@ -310,10 +326,22 @@ struct ntb_client {
 	struct device_driver		drv;
 	const struct ntb_client_ops	ops;
 };
-
 #define drv_ntb_client(__drv) container_of((__drv), struct ntb_client, drv)
 
 /**
+ * struct ntb_bus_data - NTB bus data
+ * @sync_msk:	Synchroous devices mask
+ * @async_msk:	Asynchronous devices mask
+ * @both_msk:	Both sync and async devices mask
+ */
+#define NTB_MAX_DEVID (8*BITS_PER_LONG)
+struct ntb_bus_data {
+	unsigned long sync_msk[8];
+	unsigned long async_msk[8];
+	unsigned long both_msk[8];
+};
+
+/**
  * struct ntb_device - ntb device
  * @dev:		Linux device object.
  * @pdev:		Pci device entry of the ntb.
@@ -332,15 +360,151 @@ struct ntb_dev {
 
 	/* private: */
 
+	/* device id */
+	int id;
 	/* synchronize setting, clearing, and calling ctx_ops */
 	spinlock_t			ctx_lock;
 	/* block unregister until device is fully released */
 	struct completion		released;
 };
-
 #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
 
 /**
+ * ntb_valid_sync_dev_ops() - valid operations for synchronous hardware setup
+ * @ntb:	NTB device
+ *
+ * There might be two types of NTB hardware differed by the way of the settings
+ * configuration. The synchronous chips allows to set the memory windows by
+ * directly writing to the peer registers. Additionally there can be shared
+ * Scratchpad registers for synchronous information exchange. Client drivers
+ * should call this function to make sure the hardware supports the proper
+ * functionality.
+ */
+static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
+{
+	const struct ntb_dev_ops *ops = ntb->ops;
+
+	/* Commented callbacks are not required, but might be developed */
+	return	/* NTB link status ops */
+		ops->link_is_up					&&
+		ops->link_enable				&&
+		ops->link_disable				&&
+
+		/* Synchronous memory windows ops */
+		ops->mw_count					&&
+		ops->mw_get_maprsc				&&
+		/* ops->mw_get_align				&& */
+		/* ops->mw_set_trans				&& */
+		/* ops->mw_get_trans				&& */
+		ops->peer_mw_count				&&
+		ops->peer_mw_get_align				&&
+		ops->peer_mw_set_trans				&&
+		/* ops->peer_mw_get_trans			&& */
+
+		/* Doorbell ops */
+		/* ops->db_is_unsafe				&& */
+		ops->db_valid_mask				&&
+		/* both set, or both unset */
+		(!ops->db_vector_count == !ops->db_vector_mask)	&&
+		ops->db_read					&&
+		/* ops->db_set					&& */
+		ops->db_clear					&&
+		/* ops->db_read_mask				&& */
+		ops->db_set_mask				&&
+		ops->db_clear_mask				&&
+		/* ops->peer_db_addr				&& */
+		/* ops->peer_db_read				&& */
+		ops->peer_db_set				&&
+		/* ops->peer_db_clear				&& */
+		/* ops->peer_db_read_mask			&& */
+		/* ops->peer_db_set_mask			&& */
+		/* ops->peer_db_clear_mask			&& */
+
+		/* Scratchpad ops */
+		/* ops->spad_is_unsafe				&& */
+		ops->spad_count					&&
+		ops->spad_read					&&
+		ops->spad_write					&&
+		/* ops->peer_spad_addr				&& */
+		/* ops->peer_spad_read				&& */
+		ops->peer_spad_write				&&
+
+		/* Messages IO ops */
+		/* ops->msg_post				&& */
+		/* ops->msg_size				&& */
+		1;
+}
+
+/**
+ * ntb_valid_async_dev_ops() - valid operations for asynchronous hardware setup
+ * @ntb:	NTB device
+ *
+ * There might be two types of NTB hardware differed by the way of the settings
+ * configuration. The asynchronous chips does not allow to set the memory
+ * windows by directly writing to the peer registers. Instead it implements
+ * the additional method to communinicate between NTB nodes like messages.
+ * Scratchpad registers aren't likely supported by such hardware. Client
+ * drivers should call this function to make sure the hardware supports
+ * the proper functionality.
+ */
+static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
+{
+	const struct ntb_dev_ops *ops = ntb->ops;
+
+	/* Commented callbacks are not required, but might be developed */
+	return	/* NTB link status ops */
+		ops->link_is_up					&&
+		ops->link_enable				&&
+		ops->link_disable				&&
+
+		/* Asynchronous memory windows ops */
+		ops->mw_count					&&
+		ops->mw_get_maprsc				&&
+		ops->mw_get_align				&&
+		ops->mw_set_trans				&&
+		/* ops->mw_get_trans				&& */
+		ops->peer_mw_count				&&
+		ops->peer_mw_get_align				&&
+		/* ops->peer_mw_set_trans			&& */
+		/* ops->peer_mw_get_trans			&& */
+
+		/* Doorbell ops */
+		/* ops->db_is_unsafe				&& */
+		ops->db_valid_mask				&&
+		/* both set, or both unset */
+		(!ops->db_vector_count == !ops->db_vector_mask)	&&
+		ops->db_read					&&
+		/* ops->db_set					&& */
+		ops->db_clear					&&
+		/* ops->db_read_mask				&& */
+		ops->db_set_mask				&&
+		ops->db_clear_mask				&&
+		/* ops->peer_db_addr				&& */
+		/* ops->peer_db_read				&& */
+		ops->peer_db_set				&&
+		/* ops->peer_db_clear				&& */
+		/* ops->peer_db_read_mask			&& */
+		/* ops->peer_db_set_mask			&& */
+		/* ops->peer_db_clear_mask			&& */
+
+		/* Scratchpad ops */
+		/* ops->spad_is_unsafe				&& */
+		/* ops->spad_count				&& */
+		/* ops->spad_read				&& */
+		/* ops->spad_write				&& */
+		/* ops->peer_spad_addr				&& */
+		/* ops->peer_spad_read				&& */
+		/* ops->peer_spad_write				&& */
+
+		/* Messages IO ops */
+		ops->msg_post					&&
+		ops->msg_size					&&
+		1;
+}
+
+
+
+/**
  * ntb_register_client() - register a client for interest in ntb devices
  * @client:	Client context.
  *
@@ -441,10 +605,84 @@ void ntb_link_event(struct ntb_dev *ntb);
 void ntb_db_event(struct ntb_dev *ntb, int vector);
 
 /**
- * ntb_mw_count() - get the number of memory windows
+ * ntb_msg_event() - notify driver context of event in messaging subsystem
  * @ntb:	NTB device context.
+ * @ev:		Event type caused the handler invocation
+ * @msg:	Message related to the event
+ *
+ * Notify the driver context that there is some event happaned in the event
+ * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
+ * NTB_MSG_SENT is rised if some message has just been successfully sent to a
+ * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
+ * last argument is used to pass the event related message. It discarded right
+ * after the handler returns.
+ */
+void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
+		   struct ntb_msg *msg);
+
+/**
+ * ntb_link_is_up() - get the current ntb link state
+ * @ntb:	NTB device context.
+ * @speed:	OUT - The link speed expressed as PCIe generation number.
+ * @width:	OUT - The link width expressed as the number of PCIe lanes.
+ *
+ * Get the current state of the ntb link.  It is recommended to query the link
+ * state once after every link event.  It is safe to query the link state in
+ * the context of the link event callback.
+ *
+ * Return: One if the link is up, zero if the link is down, otherwise a
+ *		negative value indicating the error number.
+ */
+static inline int ntb_link_is_up(struct ntb_dev *ntb,
+				 enum ntb_speed *speed, enum ntb_width *width)
+{
+	return ntb->ops->link_is_up(ntb, speed, width);
+}
+
+/**
+ * ntb_link_enable() - enable the link on the secondary side of the ntb
+ * @ntb:	NTB device context.
+ * @max_speed:	The maximum link speed expressed as PCIe generation number.
+ * @max_width:	The maximum link width expressed as the number of PCIe lanes.
  *
- * Hardware and topology may support a different number of memory windows.
+ * Enable the link on the secondary side of the ntb.  This can only be done
+ * from only one (primary or secondary) side of the ntb in primary or b2b
+ * topology.  The ntb device should train the link to its maximum speed and
+ * width, or the requested speed and width, whichever is smaller, if supported.
+ *
+ * Return: Zero on success, otherwise an error number.
+ */
+static inline int ntb_link_enable(struct ntb_dev *ntb,
+				  enum ntb_speed max_speed,
+				  enum ntb_width max_width)
+{
+	return ntb->ops->link_enable(ntb, max_speed, max_width);
+}
+
+/**
+ * ntb_link_disable() - disable the link on the secondary side of the ntb
+ * @ntb:	NTB device context.
+ *
+ * Disable the link on the secondary side of the ntb.  This can only be
+ * done from only one (primary or secondary) side of the ntb in primary or b2b
+ * topology.  The ntb device should disable the link.  Returning from this call
+ * must indicate that a barrier has passed, though with no more writes may pass
+ * in either direction across the link, except if this call returns an error
+ * number.
+ *
+ * Return: Zero on success, otherwise an error number.
+ */
+static inline int ntb_link_disable(struct ntb_dev *ntb)
+{
+	return ntb->ops->link_disable(ntb);
+}
+
+/**
+ * ntb_mw_count() - get the number of local memory windows
+ * @ntb:	NTB device context.
+ *
+ * Hardware and topology may support a different number of memory windows at
+ * local and remote devices
  *
  * Return: the number of memory windows.
  */
@@ -454,122 +692,186 @@ static inline int ntb_mw_count(struct ntb_dev *ntb)
 }
 
 /**
- * ntb_mw_get_range() - get the range of a memory window
+ * ntb_mw_get_maprsc() - get the range of a memory window to map
  * @ntb:	NTB device context.
  * @idx:	Memory window number.
  * @base:	OUT - the base address for mapping the memory window
  * @size:	OUT - the size for mapping the memory window
- * @align:	OUT - the base alignment for translating the memory window
- * @align_size:	OUT - the size alignment for translating the memory window
  *
- * Get the range of a memory window.  NULL may be given for any output
- * parameter if the value is not needed.  The base and size may be used for
- * mapping the memory window, to access the peer memory.  The alignment and
- * size may be used for translating the memory window, for the peer to access
- * memory on the local system.
+ * Get the map range of a memory window. The base and size may be used for
+ * mapping the memory window to access the peer memory.
  *
  * Return: Zero on success, otherwise an error number.
  */
-static inline int ntb_mw_get_range(struct ntb_dev *ntb, int idx,
-				   phys_addr_t *base, resource_size_t *size,
-		resource_size_t *align, resource_size_t *align_size)
+static inline int ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
+				    phys_addr_t *base, resource_size_t *size)
 {
-	return ntb->ops->mw_get_range(ntb, idx, base, size,
-			align, align_size);
+	return ntb->ops->mw_get_maprsc(ntb, idx, base, size);
+}
+
+/**
+ * ntb_mw_get_align() - get memory window alignment of the local node
+ * @ntb:	NTB device context.
+ * @idx:	Memory window number.
+ * @addr_align:	OUT - the translated base address alignment of the memory window
+ * @size_align:	OUT - the translated memory size alignment of the memory window
+ * @size_max:	OUT - the translated memory maximum size
+ *
+ * Get the alignment parameters to allocate the proper memory window. NULL may
+ * be given for any output parameter if the value is not needed.
+ *
+ * Drivers of synchronous hardware don't have to support it.
+ *
+ * Return: Zero on success, otherwise an error number.
+ */
+static inline int ntb_mw_get_align(struct ntb_dev *ntb, int idx,
+				   resource_size_t *addr_align,
+				   resource_size_t *size_align,
+				   resource_size_t *size_max)
+{
+	if (!ntb->ops->mw_get_align)
+		return -EINVAL;
+
+	return ntb->ops->mw_get_align(ntb, idx, addr_align, size_align, size_max);
 }
 
 /**
- * ntb_mw_set_trans() - set the translation of a memory window
+ * ntb_mw_set_trans() - set the translated base address of a peer memory window
  * @ntb:	NTB device context.
  * @idx:	Memory window number.
- * @addr:	The dma address local memory to expose to the peer.
- * @size:	The size of the local memory to expose to the peer.
+ * @addr:	DMA memory address exposed by the peer.
+ * @size:	Size of the memory exposed by the peer.
+ *
+ * Set the translated base address of a memory window. The peer preliminary
+ * allocates a memory, then someway passes the address to the remote node, that
+ * finally sets up the memory window at the address, up to the size. The address
+ * and size must be aligned to the parameters specified by ntb_mw_get_align() of
+ * the local node and ntb_peer_mw_get_align() of the peer, which must return the
+ * same values. Zero size effectively disables the memory window.
  *
- * Set the translation of a memory window.  The peer may access local memory
- * through the window starting at the address, up to the size.  The address
- * must be aligned to the alignment specified by ntb_mw_get_range().  The size
- * must be aligned to the size alignment specified by ntb_mw_get_range().
+ * Drivers of synchronous hardware don't have to support it.
  *
  * Return: Zero on success, otherwise an error number.
  */
 static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
 				   dma_addr_t addr, resource_size_t size)
 {
+	if (!ntb->ops->mw_set_trans)
+		return -EINVAL;
+
 	return ntb->ops->mw_set_trans(ntb, idx, addr, size);
 }
 
 /**
- * ntb_mw_clear_trans() - clear the translation of a memory window
+ * ntb_mw_get_trans() - get the translated base address of a memory window
  * @ntb:	NTB device context.
  * @idx:	Memory window number.
+ * @addr:	The dma memory address exposed by the peer.
+ * @size:	The size of the memory exposed by the peer.
  *
- * Clear the translation of a memory window.  The peer may no longer access
- * local memory through the window.
+ * Get the translated base address of a memory window spicified for the local
+ * hardware and allocated by the peer. If the addr and size are zero, the
+ * memory window is effectively disabled.
  *
  * Return: Zero on success, otherwise an error number.
  */
-static inline int ntb_mw_clear_trans(struct ntb_dev *ntb, int idx)
+static inline int ntb_mw_get_trans(struct ntb_dev *ntb, int idx,
+				   dma_addr_t *addr, resource_size_t *size)
 {
-	if (!ntb->ops->mw_clear_trans)
-		return ntb->ops->mw_set_trans(ntb, idx, 0, 0);
+	if (!ntb->ops->mw_get_trans)
+		return -EINVAL;
 
-	return ntb->ops->mw_clear_trans(ntb, idx);
+	return ntb->ops->mw_get_trans(ntb, idx, addr, size);
 }
 
 /**
- * ntb_link_is_up() - get the current ntb link state
+ * ntb_peer_mw_count() - get the number of peer memory windows
  * @ntb:	NTB device context.
- * @speed:	OUT - The link speed expressed as PCIe generation number.
- * @width:	OUT - The link width expressed as the number of PCIe lanes.
  *
- * Get the current state of the ntb link.  It is recommended to query the link
- * state once after every link event.  It is safe to query the link state in
- * the context of the link event callback.
+ * Hardware and topology may support a different number of memory windows at
+ * local and remote nodes.
  *
- * Return: One if the link is up, zero if the link is down, otherwise a
- *		negative value indicating the error number.
+ * Return: the number of memory windows.
  */
-static inline int ntb_link_is_up(struct ntb_dev *ntb,
-				 enum ntb_speed *speed, enum ntb_width *width)
+static inline int ntb_peer_mw_count(struct ntb_dev *ntb)
 {
-	return ntb->ops->link_is_up(ntb, speed, width);
+	return ntb->ops->peer_mw_count(ntb);
 }
 
 /**
- * ntb_link_enable() - enable the link on the secondary side of the ntb
+ * ntb_peer_mw_get_align() - get memory window alignment of the peer
  * @ntb:	NTB device context.
- * @max_speed:	The maximum link speed expressed as PCIe generation number.
- * @max_width:	The maximum link width expressed as the number of PCIe lanes.
+ * @idx:	Memory window number.
+ * @addr_align:	OUT - the translated base address alignment of the memory window
+ * @size_align:	OUT - the translated memory size alignment of the memory window
+ * @size_max:	OUT - the translated memory maximum size
  *
- * Enable the link on the secondary side of the ntb.  This can only be done
- * from the primary side of the ntb in primary or b2b topology.  The ntb device
- * should train the link to its maximum speed and width, or the requested speed
- * and width, whichever is smaller, if supported.
+ * Get the alignment parameters to allocate the proper memory window for the
+ * peer. NULL may be given for any output parameter if the value is not needed.
  *
  * Return: Zero on success, otherwise an error number.
  */
-static inline int ntb_link_enable(struct ntb_dev *ntb,
-				  enum ntb_speed max_speed,
-				  enum ntb_width max_width)
+static inline int ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
+					resource_size_t *addr_align,
+					resource_size_t *size_align,
+					resource_size_t *size_max)
 {
-	return ntb->ops->link_enable(ntb, max_speed, max_width);
+	if (!ntb->ops->peer_mw_get_align)
+		return -EINVAL;
+
+	return ntb->ops->peer_mw_get_align(ntb, idx, addr_align, size_align,
+					   size_max);
 }
 
 /**
- * ntb_link_disable() - disable the link on the secondary side of the ntb
+ * ntb_peer_mw_set_trans() - set the translated base address of a peer
+ *			     memory window
  * @ntb:	NTB device context.
+ * @idx:	Memory window number.
+ * @addr:	Local DMA memory address exposed to the peer.
+ * @size:	Size of the memory exposed to the peer.
  *
- * Disable the link on the secondary side of the ntb.  This can only be
- * done from the primary side of the ntb in primary or b2b topology.  The ntb
- * device should disable the link.  Returning from this call must indicate that
- * a barrier has passed, though with no more writes may pass in either
- * direction across the link, except if this call returns an error number.
+ * Set the translated base address of a memory window exposed to the peer.
+ * The local node preliminary allocates the window, then directly writes the
+ * address and size to the peer control registers. The address and size must
+ * be aligned to the parameters specified by ntb_peer_mw_get_align() of
+ * the local node and ntb_mw_get_align() of the peer, which must return the
+ * same values. Zero size effectively disables the memory window.
+ *
+ * Drivers of synchronous hardware must support it.
  *
  * Return: Zero on success, otherwise an error number.
  */
-static inline int ntb_link_disable(struct ntb_dev *ntb)
+static inline int ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
+					dma_addr_t addr, resource_size_t size)
 {
-	return ntb->ops->link_disable(ntb);
+	if (!ntb->ops->peer_mw_set_trans)
+		return -EINVAL;
+
+	return ntb->ops->peer_mw_set_trans(ntb, idx, addr, size);
+}
+
+/**
+ * ntb_peer_mw_get_trans() - get the translated base address of a peer
+ *			     memory window
+ * @ntb:	NTB device context.
+ * @idx:	Memory window number.
+ * @addr:	Local dma memory address exposed to the peer.
+ * @size:	Size of the memory exposed to the peer.
+ *
+ * Get the translated base address of a memory window spicified for the peer
+ * hardware. If the addr and size are zero then the memory window is effectively
+ * disabled.
+ *
+ * Return: Zero on success, otherwise an error number.
+ */
+static inline int ntb_peer_mw_get_trans(struct ntb_dev *ntb, int idx,
+					dma_addr_t *addr, resource_size_t *size)
+{
+	if (!ntb->ops->peer_mw_get_trans)
+		return -EINVAL;
+
+	return ntb->ops->peer_mw_get_trans(ntb, idx, addr, size);
 }
 
 /**
@@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits)
  * append one additional dma memory copy with the doorbell register as the
  * destination, after the memory copy operations.
  *
+ * This is unusual, and hardware may not be suitable to implement it.
+ *
  * Return: Zero on success, otherwise an error number.
  */
 static inline int ntb_peer_db_addr(struct ntb_dev *ntb,
@@ -901,10 +1205,15 @@ static inline int ntb_spad_is_unsafe(struct ntb_dev *ntb)
  *
  * Hardware and topology may support a different number of scratchpads.
  *
+ * Asynchronous hardware may not support it.
+ *
  * Return: the number of scratchpads.
  */
 static inline int ntb_spad_count(struct ntb_dev *ntb)
 {
+	if (!ntb->ops->spad_count)
+		return -EINVAL;
+
 	return ntb->ops->spad_count(ntb);
 }
 
@@ -915,10 +1224,15 @@ static inline int ntb_spad_count(struct ntb_dev *ntb)
  *
  * Read the local scratchpad register, and return the value.
  *
+ * Asynchronous hardware may not support it.
+ *
  * Return: The value of the local scratchpad register.
  */
 static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
 {
+	if (!ntb->ops->spad_read)
+		return 0;
+
 	return ntb->ops->spad_read(ntb, idx);
 }
 
@@ -930,10 +1244,15 @@ static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
  *
  * Write the value to the local scratchpad register.
  *
+ * Asynchronous hardware may not support it.
+ *
  * Return: Zero on success, otherwise an error number.
  */
 static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32 val)
 {
+	if (!ntb->ops->spad_write)
+		return -EINVAL;
+
 	return ntb->ops->spad_write(ntb, idx, val);
 }
 
@@ -946,6 +1265,8 @@ static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32 val)
  * Return the address of the peer doorbell register.  This may be used, for
  * example, by drivers that offload memory copy operations to a dma engine.
  *
+ * Asynchronous hardware may not support it.
+ *
  * Return: Zero on success, otherwise an error number.
  */
 static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
@@ -964,10 +1285,15 @@ static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
  *
  * Read the peer scratchpad register, and return the value.
  *
+ * Asynchronous hardware may not support it.
+ *
  * Return: The value of the local scratchpad register.
  */
 static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
 {
+	if (!ntb->ops->peer_spad_read)
+		return 0;
+
 	return ntb->ops->peer_spad_read(ntb, idx);
 }
 
@@ -979,11 +1305,59 @@ static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
  *
  * Write the value to the peer scratchpad register.
  *
+ * Asynchronous hardware may not support it.
+ *
  * Return: Zero on success, otherwise an error number.
  */
 static inline int ntb_peer_spad_write(struct ntb_dev *ntb, int idx, u32 val)
 {
+	if (!ntb->ops->peer_spad_write)
+		return -EINVAL;
+
 	return ntb->ops->peer_spad_write(ntb, idx, val);
 }
 
+/**
+ * ntb_msg_post() - post the message to the peer
+ * @ntb:	NTB device context.
+ * @msg:	Message
+ *
+ * Post the message to a peer. It shall be delivered to the peer by the
+ * corresponding hardware method. The peer should be notified about the new
+ * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
+ * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
+ * event. Otherwise the NTB_MSG_SENT is emitted.
+ *
+ * Synchronous hardware may not support it.
+ *
+ * Return: Zero on success, otherwise an error number.
+ */
+static inline int ntb_msg_post(struct ntb_dev *ntb, struct ntb_msg *msg)
+{
+	if (!ntb->ops->msg_post)
+		return -EINVAL;
+
+	return ntb->ops->msg_post(ntb, msg);
+}
+
+/**
+ * ntb_msg_size() - size of the message data
+ * @ntb:	NTB device context.
+ *
+ * Different hardware may support different number of message registers. This
+ * callback shall return the number of those used for data sending and
+ * receiving including the type field.
+ *
+ * Synchronous hardware may not support it.
+ *
+ * Return: Zero on success, otherwise an error number.
+ */
+static inline int ntb_msg_size(struct ntb_dev *ntb)
+{
+	if (!ntb->ops->msg_size)
+		return 0;
+
+	return ntb->ops->msg_size(ntb);
+}
+
 #endif
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-19 13:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-05 15:31 [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface Allen Hubbe
2016-08-05 15:31 ` Allen Hubbe
2016-08-07 17:50 ` Serge Semin
  -- strict thread matches above, loose matches on Subject: below --
2016-08-08 21:48 Allen Hubbe
2016-08-08 21:48 ` Allen Hubbe
2016-08-18 21:56 ` Serge Semin
2016-08-18 21:56   ` Serge Semin
2016-08-19  9:10   ` Serge Semin
2016-08-19  9:10     ` Serge Semin
2016-08-19 13:41     ` Allen Hubbe
2016-08-19 13:41       ` Allen Hubbe
2016-07-26 19:50 [PATCH 0/3] ntb: Asynchronous NTB devices support Serge Semin
2016-07-28 10:01 ` [PATCH v2 " Serge Semin
2016-07-28 10:01   ` [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus interface Serge Semin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.