All of lore.kernel.org
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management
@ 2021-06-01 12:00 Anatoly Burakov
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion Anatoly Burakov
                   ` (7 more replies)
  0 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev; +Cc: ciara.loftus, david.hunt

This patchset introduces several changes related to PMD power management:

- Add inverted checks to monitor intrinsics, based on previous patchset [1] but
  incorporating feedback [2] - this hopefully will make it possible to add
  support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
  set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
  accompanying infrastructure and example apps changes

[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274

Anatoly Burakov (7):
  power_intrinsics: allow monitor checks inversion
  net/af_xdp: add power monitor support
  eal: add power monitor for multiple events
  power: remove thread safety from PMD power API's
  power: support callbacks for multiple Rx queues
  power: support monitoring multiple Rx queues
  l3fwd-power: support multiqueue in PMD pmgmt modes

 drivers/net/af_xdp/rte_eth_af_xdp.c           |  25 +
 examples/l3fwd-power/main.c                   |  39 +-
 lib/eal/arm/rte_power_intrinsics.c            |  11 +
 lib/eal/include/generic/rte_cpuflags.h        |   2 +
 .../include/generic/rte_power_intrinsics.h    |  39 ++
 lib/eal/ppc/rte_power_intrinsics.c            |  11 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_cpuflags.c                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  74 ++-
 lib/power/meson.build                         |   3 +
 lib/power/rte_power_pmd_mgmt.c                | 500 +++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h                |  40 ++
 lib/power/version.map                         |   3 +
 13 files changed, 596 insertions(+), 156 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-21 12:56   ` Ananyev, Konstantin
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 2/7] net/af_xdp: add power monitor support Anatoly Burakov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev; +Cc: ciara.loftus, david.hunt

Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value.

This commit adds an option to reverse the check, so that we can have
monitor sleep aborted if the expected value *doesn't* match what's in
memory. This allows us to both implement all currently implemented
driver code, as well as support more use cases which don't easily map to
previous semantics (such as waiting on writes to AF_XDP counter value).

Since the old behavior is the default, no need to adjust existing
implementations.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
 lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..1006c2edfc 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	uint8_t invert;  /**< Invert check for expected value (e.g. instead of
+	                  *   checking if `val` matches something, check if
+	                  *   `val` *doesn't* match a particular value)
+	                  */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..5d944e9aa4 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
+		if (!pmc->invert && masked == pmc->val)
+			goto end;
+		/* same, but for inverse check */
+		if (pmc->invert && masked != pmc->val)
 			goto end;
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 2/7] net/af_xdp: add power monitor support
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-02 12:59   ` Loftus, Ciara
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 3/7] eal: add power monitor for multiple events Anatoly Burakov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt

Implement support for .get_monitor_addr in AF_XDP driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..dfbf74ea53 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
 
 #include "compat.h"
 
@@ -788,6 +789,29 @@ eth_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct pkt_rx_queue *rxq = rx_queue;
+	unsigned int *prod = rxq->rx.producer;
+	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+	/* watch for changes in producer ring */
+	pmc->addr = (void*)prod;
+
+	/* store current value */
+	pmc->val = cur_val;
+	pmc->mask = (uint32_t)~0; /* mask entire uint32_t value */
+
+	/* AF_XDP producer ring index is 32-bit */
+	pmc->size = sizeof(uint32_t);
+
+	/* this requires an inverted check */
+	pmc->invert = 1;
+
+	return 0;
+}
+
 static int
 eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
@@ -1448,6 +1472,7 @@ static const struct eth_dev_ops ops = {
 	.link_update = eth_link_update,
 	.stats_get = eth_stats_get,
 	.stats_reset = eth_stats_reset,
+	.get_monitor_addr = eth_get_monitor_addr
 };
 
 /** parse busy_budget argument */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 3/7] eal: add power monitor for multiple events
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion Anatoly Burakov
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
  Cc: ciara.loftus, david.hunt

Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/arm/rte_power_intrinsics.c            | 11 +++
 lib/eal/include/generic/rte_cpuflags.h        |  2 +
 .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
 lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
 lib/eal/version.map                           |  3 +
 lib/eal/x86/rte_cpuflags.c                    |  2 +
 lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
 7 files changed, 133 insertions(+)

diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
 	/**< indicates support for rte_power_monitor function */
 	uint32_t power_pause : 1;
 	/**< indicates support for rte_power_pause function */
+	uint32_t power_monitor_multi : 1;
+	/**< indicates support for rte_power_monitor_multi function */
 };
 
 /**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index 1006c2edfc..acb0d759ce 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -113,4 +113,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 __rte_experimental
 int rte_power_pause(const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ *   Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ *   An array of monitoring condition structures.
+ * @param num
+ *   Length of the `pmc` array.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp);
+
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_power_monitor_multi; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
 		intrinsics->power_monitor = 1;
 		intrinsics->power_pause = 1;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+			intrinsics->power_monitor_multi = 1;
 	}
 }
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 5d944e9aa4..4f972673ce 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_rtm.h>
 #include <rte_spinlock.h>
 
 #include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
 }
 
 static bool wait_supported;
+static bool wait_multi_supported;
 
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -170,6 +172,8 @@ RTE_INIT(rte_power_intrinsics_init) {
 
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
+	if (i.power_monitor_multi)
+		wait_multi_supported = 1;
 }
 
 int
@@ -208,6 +212,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	 * In this case, since we've already woken up, the "wakeup" was
 	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
 	 * wakeup address is still valid so it's perfectly safe to write it.
+	 *
+	 * For multi-monitor case, the act of locking will in itself trigger the
+	 * wakeup, so no additional writes necessary.
 	 */
 	rte_spinlock_lock(&s->lock);
 	if (s->monitor_addr != NULL)
@@ -216,3 +223,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return 0;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
+	uint32_t i, rc;
+
+	/* check if supported */
+	if (!wait_multi_supported)
+		return -ENOTSUP;
+
+	if (pmc == NULL || num == 0)
+		return -EINVAL;
+
+	/* we are already inside transaction region, return */
+	if (rte_xtest() != 0)
+		return 0;
+
+	/* start new transaction region */
+	rc = rte_xbegin();
+
+	/* transaction abort, possible write to one of wait addresses */
+	if (rc != RTE_XBEGIN_STARTED)
+		return 0;
+
+	/*
+	 * the mere act of reading the lock status here adds the lock to
+	 * the read set. This means that when we trigger a wakeup from another
+	 * thread, even if we don't have a defined wakeup address and thus don't
+	 * actually cause any writes, the act of locking our lock will itself
+	 * trigger the wakeup and abort the transaction.
+	 */
+	rte_spinlock_is_locked(&s->lock);
+
+	/*
+	 * add all addresses to wait on into transaction read-set and check if
+	 * any of wakeup conditions are already met.
+	 */
+	for (i = 0; i < num; i++) {
+		const struct rte_power_monitor_cond *c = &pmc[i];
+		const uint64_t val = __get_umwait_val(pmc->addr, pmc->size);
+		const uint64_t masked = val & c->mask;
+
+		/* if the masked value is already matching, abort */
+		if (!c->invert && masked == c->val)
+			break;
+		/* same, but for inverse check */
+		if (c->invert && masked != c->val)
+			break;
+	}
+
+	/* none of the conditions were met, sleep until timeout */
+	if (i == num)
+		rte_power_pause(tsc_timestamp);
+
+	/* end transaction region */
+	rte_xend();
+
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
                   ` (2 preceding siblings ...)
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-22  9:13   ` Ananyev, Konstantin
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.

We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/power/meson.build          |   3 +
 lib/power/rte_power_pmd_mgmt.c | 106 ++++++++-------------------------
 lib/power/rte_power_pmd_mgmt.h |   6 ++
 3 files changed, 35 insertions(+), 80 deletions(-)

diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
         'rte_power_pmd_mgmt.h',
         'rte_power_guest_channel.h',
 )
+if cc.has_argument('-Wno-cast-qual')
+    cflags += '-Wno-cast-qual'
+endif
 deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..0707c60a4f 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
 	/**< Callback mode for this queue */
 	const struct rte_eth_rxtx_callback *cur_cb;
 	/**< Callback instance */
-	volatile bool umwait_in_progress;
-	/**< are we currently sleeping? */
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 			struct rte_power_monitor_cond pmc;
 			uint16_t ret;
 
-			/*
-			 * we might get a cancellation request while being
-			 * inside the callback, in which case the wakeup
-			 * wouldn't work because it would've arrived too early.
-			 *
-			 * to get around this, we notify the other thread that
-			 * we're sleeping, so that it can spin until we're done.
-			 * unsolicited wakeups are perfectly safe.
-			 */
-			q_conf->umwait_in_progress = true;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-			/* check if we need to cancel sleep */
-			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
-				/* use monitoring condition to sleep */
-				ret = rte_eth_get_monitor_addr(port_id, qidx,
-						&pmc);
-				if (ret == 0)
-					rte_power_monitor(&pmc, UINT64_MAX);
-			}
-			q_conf->umwait_in_progress = false;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+			/* use monitoring condition to sleep */
+			ret = rte_eth_get_monitor_addr(port_id, qidx,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
 		q_conf->empty_poll_stats = 0;
@@ -183,6 +162,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 {
 	struct pmd_queue_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
+	rte_rx_callback_fn clb;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -232,17 +212,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->umwait_in_progress = false;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* ensure we update our state before callback starts */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_umwait, NULL);
+		clb = clb_umwait;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +239,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
-				queue_id, clb_scale_freq, NULL);
+		clb = clb_scale_freq;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +247,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (global_data.tsc_per_us == 0)
 			calc_tsc();
 
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_pause, NULL);
+		clb = clb_pause;
 		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
 	}
+
+	/* initialize data before enabling the callback */
+	queue_cfg->empty_poll_stats = 0;
+	queue_cfg->cb_mode = mode;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, NULL);
+
 	ret = 0;
 end:
 	return ret;
@@ -323,27 +287,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	/* stop any callbacks from progressing */
 	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
-	/* ensure we update our state before continuing */
-	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
 	switch (queue_cfg->cb_mode) {
-	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		bool exit = false;
-		do {
-			/*
-			 * we may request cancellation while the other thread
-			 * has just entered the callback but hasn't started
-			 * sleeping yet, so keep waking it up until we know it's
-			 * done sleeping.
-			 */
-			if (queue_cfg->umwait_in_progress)
-				rte_power_monitor_wakeup(lcore_id);
-			else
-				exit = true;
-		} while (!exit);
-	}
-	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		rte_eth_remove_rx_callback(port_id, queue_id,
 				queue_cfg->cur_cb);
@@ -356,10 +301,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		break;
 	}
 	/*
-	 * we don't free the RX callback here because it is unsafe to do so
-	 * unless we know for a fact that all data plane threads have stopped.
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	queue_cfg->cur_cb = NULL;
+	rte_free((void *)queue_cfg->cur_cb);
 
 	return 0;
 }
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..7557f5d7e1 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet ports are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue will be polled from.
  * @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet ports are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue is polled from.
  * @param port_id
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
                   ` (3 preceding siblings ...)
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-22  9:41   ` Ananyev, Konstantin
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 6/7] power: support monitoring " Anatoly Burakov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, David Hunt, Ray Kinsella, Neil Horman; +Cc: ciara.loftus

Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of cores to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when a special designated "power saving" queue is polled. To
  put it another way, we have no idea which queue the user will poll in
  what order, so we rely on them telling us that queue X is the last one
  in the polling loop, so any power management should happen there.
- A new API is added to mark a specific Rx queue as "power saving".
  Failing to call this API will result in no power management, however
  when having only one queue per core it is obvious which queue is the
  "power saving" one, so things will still work without this new API for
  use cases that were previously working without it.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++-------
 lib/power/rte_power_pmd_mgmt.h |  34 ++++
 lib/power/version.map          |   3 +
 3 files changed, 306 insertions(+), 66 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 0707c60a4f..60dd21a19c 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,7 +33,19 @@ enum pmd_mgmt_state {
 	PMD_MGMT_ENABLED
 };
 
-struct pmd_queue_cfg {
+struct queue {
+	uint16_t portid;
+	uint16_t qid;
+};
+struct pmd_core_cfg {
+	struct queue queues[RTE_MAX_ETHPORTS];
+	/**< Which port-queue pairs are associated with this lcore? */
+	struct queue power_save_queue;
+	/**< When polling multiple queues, all but this one will be ignored */
+	bool power_save_queue_set;
+	/**< When polling multiple queues, power save queue must be set */
+	size_t n_queues;
+	/**< How many queues are in the list? */
 	volatile enum pmd_mgmt_state pwr_mgmt_state;
 	/**< State of power management for this queue */
 	enum rte_power_pmd_mgmt_type cb_mode;
@@ -43,8 +55,97 @@ struct pmd_queue_cfg {
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
 
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const struct queue *l, const struct queue *r)
+{
+	return l->portid == r->portid && l->qid == r->qid;
+}
+
+static inline void
+queue_copy(struct queue *dst, const struct queue *src)
+{
+	dst->portid = src->portid;
+	dst->qid = src->qid;
+}
+
+static inline bool
+queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue *q) {
+	const struct queue *pwrsave = &cfg->power_save_queue;
+
+	/* if there's only single queue, no need to check anything */
+	if (cfg->n_queues == 1)
+		return true;
+	return cfg->power_save_queue_set && queue_equal(q, pwrsave);
+}
+
+static int
+queue_list_find(const struct pmd_core_cfg *cfg, const struct queue *q,
+		size_t *idx) {
+	size_t i;
+	for (i = 0; i < cfg->n_queues; i++) {
+		const struct queue *cur = &cfg->queues[i];
+		if (queue_equal(cur, q)) {
+			if (idx != NULL)
+				*idx = i;
+			return 0;
+		}
+	}
+	return -1;
+}
+
+static int
+queue_set_power_save(struct pmd_core_cfg *cfg, const struct queue *q) {
+	if (queue_list_find(cfg, q, NULL) < 0)
+		return -ENOENT;
+	queue_copy(&cfg->power_save_queue, q);
+	cfg->power_save_queue_set = true;
+	return 0;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const struct queue *q)
+{
+	size_t idx = cfg->n_queues;
+	if (idx >= RTE_DIM(cfg->queues))
+		return -ENOSPC;
+	/* is it already in the list? */
+	if (queue_list_find(cfg, q, NULL) == 0)
+		return -EEXIST;
+	queue_copy(&cfg->queues[idx], q);
+	cfg->n_queues++;
+
+	return 0;
+}
+
+static int
+queue_list_remove(struct pmd_core_cfg *cfg, const struct queue *q)
+{
+	struct queue *found, *pwrsave;
+	size_t idx, last_idx = cfg->n_queues - 1;
+
+	if (queue_list_find(cfg, q, &idx) != 0)
+		return -ENOENT;
+
+	/* erase the queue pair being deleted */
+	found = &cfg->queues[idx];
+	memset(found, 0, sizeof(*found));
+
+	/* move the rest of the list */
+	for (; idx < last_idx; idx++)
+		queue_copy(&cfg->queues[idx], &cfg->queues[idx + 1]);
+	cfg->n_queues--;
+
+	/* if this was a power save queue, unset it */
+	pwrsave = &cfg->power_save_queue;
+	if (cfg->power_save_queue_set && queue_is_power_save(cfg, q)) {
+		cfg->power_save_queue_set = false;
+		memset(pwrsave, 0, sizeof(*pwrsave));
+	}
+
+	return 0;
+}
 
 static void
 calc_tsc(void)
@@ -79,10 +180,10 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
+	const unsigned int lcore = rte_lcore_id();
+	struct pmd_core_cfg *q_conf;
 
-	struct pmd_queue_cfg *q_conf;
-
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
 	if (unlikely(nb_rx == 0)) {
 		q_conf->empty_poll_stats++;
@@ -107,11 +208,26 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const struct queue q = {port_id, qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		/* sleep for 1 microsecond */
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
@@ -127,8 +243,7 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 					rte_pause();
 			}
 		}
-	} else
-		q_conf->empty_poll_stats = 0;
+	}
 
 	return nb_rx;
 }
@@ -138,29 +253,97 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const struct queue q = {port_id, qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+
+		/* scale up freq immediately */
+		rte_power_freq_max(rte_lcore_id());
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
 			/* scale down freq */
 			rte_power_freq_min(rte_lcore_id());
-	} else {
-		q_conf->empty_poll_stats = 0;
-		/* scale up freq */
-		rte_power_freq_max(rte_lcore_id());
 	}
 
 	return nb_rx;
 }
 
+static int
+check_scale(unsigned int lcore)
+{
+	enum power_management_env env;
+
+	/* only PSTATE and ACPI modes are supported */
+	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+		return -ENOTSUP;
+	}
+	/* ensure we could initialize the power library */
+	if (rte_power_init(lcore))
+		return -EINVAL;
+
+	/* ensure we initialized the correct env */
+	env = rte_power_get_env();
+	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const struct queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	if (cfg->n_queues > 0) {
+		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+			&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const struct queue qdata = {port_id, queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
 	rte_rx_callback_fn clb;
 	int ret;
@@ -183,9 +366,11 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+	/* if callback was already enabled, check current callback type */
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+			queue_cfg->cb_mode != mode) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -195,53 +380,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 
 	switch (mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		struct rte_power_monitor_cond dummy;
-
-		/* check if rte_power_monitor is supported */
-		if (!global_data.intrinsics_support.power_monitor) {
-			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_monitor(queue_cfg, &qdata);
+		if (ret < 0)
 			goto end;
-		}
 
-		/* check if the device supports the necessary PMD API */
-		if (rte_eth_get_monitor_addr(port_id, queue_id,
-				&dummy) == -ENOTSUP) {
-			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_umwait;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
-	{
-		enum power_management_env env;
-		/* only PSTATE and ACPI modes are supported */
-		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
-				!rte_power_check_env_supported(
-					PM_ENV_PSTATE_CPUFREQ)) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_scale(lcore_id);
+		if (ret < 0)
 			goto end;
-		}
-		/* ensure we could initialize the power library */
-		if (rte_power_init(lcore_id)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		/* ensure we initialized the correct env */
-		env = rte_power_get_env();
-		if (env != PM_ENV_ACPI_CPUFREQ &&
-				env != PM_ENV_PSTATE_CPUFREQ) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_scale_freq;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		/* figure out various time-to-tsc conversions */
 		if (global_data.tsc_per_us == 0)
@@ -254,11 +406,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		ret = -EINVAL;
 		goto end;
 	}
+	/* add this queue to the list */
+	ret = queue_list_add(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
 
 	/* initialize data before enabling the callback */
-	queue_cfg->empty_poll_stats = 0;
-	queue_cfg->cb_mode = mode;
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	if (queue_cfg->n_queues == 1) {
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
 	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
 			clb, NULL);
 
@@ -271,7 +432,9 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const struct queue qdata = {port_id, queue_id};
+	struct pmd_core_cfg *queue_cfg;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
 
@@ -279,13 +442,24 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		return -EINVAL;
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
 		return -EINVAL;
 
-	/* stop any callbacks from progressing */
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	ret = queue_list_remove(queue_cfg, &qdata);
+	if (ret < 0)
+		return -ret;
+
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (queue_cfg->n_queues == 0)
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
 	switch (queue_cfg->cb_mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
@@ -309,3 +483,32 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 
 	return 0;
 }
+
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	const struct queue qdata = {port_id, queue_id};
+	struct pmd_core_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	ret = queue_set_power_save(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to set power save queue: %s\n",
+			strerror(-ret));
+		return -ret;
+	}
+
+	return 0;
+}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7557f5d7e1..edf8d8714f 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -90,6 +90,40 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Set a specific Ethernet device Rx queue to be the "power save" queue for a
+ * particular lcore. When multiple queues are assigned to a single lcore using
+ * the `rte_power_ethdev_pmgmt_queue_enable` API, only one of them will trigger
+ * the power management. In a typical scenario, the last queue to be polled on
+ * a particular lcore should be designated as power save queue.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @note When using multiple queues per lcore, calling this function is
+ *   mandatory. If not called, no power management routines would be triggered
+ *   when the traffic starts.
+ *
+ * @warning This function must be called when all affected Ethernet ports are
+ *   stopped and no Rx/Tx is in progress!
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/power/version.map b/lib/power/version.map
index b004e3e4a9..105d1d94c2 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -38,4 +38,7 @@ EXPERIMENTAL {
 	# added in 21.02
 	rte_power_ethdev_pmgmt_queue_disable;
 	rte_power_ethdev_pmgmt_queue_enable;
+
+	# added in 21.08
+	rte_power_ethdev_pmgmt_queue_set_power_save;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 6/7] power: support monitoring multiple Rx queues
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
                   ` (4 preceding siblings ...)
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/power/rte_power_pmd_mgmt.c | 75 +++++++++++++++++++++++++++++++++-
 1 file changed, 73 insertions(+), 2 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 60dd21a19c..9e0b8bdfaf 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -147,6 +147,23 @@ queue_list_remove(struct pmd_core_cfg *cfg, const struct queue *q)
 	return 0;
 }
 
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+		struct rte_power_monitor_cond *pmc)
+{
+	size_t i;
+	int ret;
+
+	for (i = 0; i < cfg->n_queues; i++) {
+		struct rte_power_monitor_cond *cur = &pmc[i];
+		struct queue *q = &cfg->queues[i];
+		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
 static void
 calc_tsc(void)
 {
@@ -175,6 +192,48 @@ calc_tsc(void)
 	}
 }
 
+static uint16_t
+clb_multiwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	const unsigned int lcore = rte_lcore_id();
+	const struct queue q = {port_id, qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
+
+	q_conf = &lcore_cfg[lcore];
+
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+			uint16_t ret;
+
+			/* gather all monitoring conditions */
+			ret = get_monitor_addresses(q_conf, pmc);
+
+			if (ret == 0)
+				rte_power_monitor_multi(pmc,
+					q_conf->n_queues, UINT64_MAX);
+		}
+	}
+
+	return nb_rx;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
@@ -315,14 +374,19 @@ static int
 check_monitor(struct pmd_core_cfg *cfg, const struct queue *qdata)
 {
 	struct rte_power_monitor_cond dummy;
+	bool multimonitor_supported;
 
 	/* check if rte_power_monitor is supported */
 	if (!global_data.intrinsics_support.power_monitor) {
 		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
 		return -ENOTSUP;
 	}
+	/* check if multi-monitor is supported */
+	multimonitor_supported =
+			global_data.intrinsics_support.power_monitor_multi;
 
-	if (cfg->n_queues > 0) {
+	/* if we're adding a new queue, do we support multiple queues? */
+	if (cfg->n_queues > 0 && !multimonitor_supported) {
 		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
 		return -ENOTSUP;
 	}
@@ -338,6 +402,13 @@ check_monitor(struct pmd_core_cfg *cfg, const struct queue *qdata)
 	return 0;
 }
 
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+	return global_data.intrinsics_support.power_monitor_multi ?
+		clb_multiwait : clb_umwait;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -385,7 +456,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (ret < 0)
 			goto end;
 
-		clb = clb_umwait;
+		clb = get_monitor_callback();
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		/* check if we can add a new queue */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v1 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
                   ` (5 preceding siblings ...)
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-01 12:00 ` Anatoly Burakov
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-01 12:00 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation, and always
mark the last queue in qconf as the power save queue.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 examples/l3fwd-power/main.c | 39 +++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..3057c06936 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2498,6 +2498,27 @@ mode_to_str(enum appmode mode)
 	}
 }
 
+static void
+pmd_pmgmt_set_up(unsigned int lcore, uint16_t portid, uint16_t qid, bool last)
+{
+	int ret;
+
+	ret = rte_power_ethdev_pmgmt_queue_enable(lcore, portid,
+			qid, pmgmt_type);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+			ret, portid);
+
+	if (!last)
+		return;
+	ret = rte_power_ethdev_pmgmt_queue_set_power_save(lcore, portid, qid);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_set_power_save: err=%d, port=%d\n",
+			ret, portid);
+}
+
 int
 main(int argc, char **argv)
 {
@@ -2723,12 +2744,6 @@ main(int argc, char **argv)
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
 
-		/* PMD power management mode can only do 1 queue per core */
-		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
-			rte_exit(EXIT_FAILURE,
-				"In PMD power management mode, only one queue per lcore is allowed\n");
-		}
-
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2767,15 +2782,9 @@ main(int argc, char **argv)
 						 "Fail to add ptype cb\n");
 			}
 
-			if (app_mode == APP_MODE_PMD_MGMT) {
-				ret = rte_power_ethdev_pmgmt_queue_enable(
-						lcore_id, portid, queueid,
-						pmgmt_type);
-				if (ret < 0)
-					rte_exit(EXIT_FAILURE,
-						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
-							ret, portid);
-			}
+			if (app_mode == APP_MODE_PMD_MGMT)
+				pmd_pmgmt_set_up(lcore_id, portid, queueid,
+					queue == (qconf->n_rx_queue - 1));
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/7] net/af_xdp: add power monitor support
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-02 12:59   ` Loftus, Ciara
  0 siblings, 0 replies; 165+ messages in thread
From: Loftus, Ciara @ 2021-06-02 12:59 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Zhang, Qi Z; +Cc: Hunt, David

> Subject: [PATCH v1 2/7] net/af_xdp: add power monitor support
> 
> Implement support for .get_monitor_addr in AF_XDP driver.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

Thanks Anatoly. LGTM.

Acked-by: Ciara Loftus <ciara.loftus@intel.com>

> ---
>  drivers/net/af_xdp/rte_eth_af_xdp.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c
> b/drivers/net/af_xdp/rte_eth_af_xdp.c
> index eb5660a3dc..dfbf74ea53 100644
> --- a/drivers/net/af_xdp/rte_eth_af_xdp.c
> +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
> @@ -37,6 +37,7 @@
>  #include <rte_malloc.h>
>  #include <rte_ring.h>
>  #include <rte_spinlock.h>
> +#include <rte_power_intrinsics.h>
> 
>  #include "compat.h"
> 
> @@ -788,6 +789,29 @@ eth_dev_configure(struct rte_eth_dev *dev)
>  	return 0;
>  }
> 
> +static int
> +eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond
> *pmc)
> +{
> +	struct pkt_rx_queue *rxq = rx_queue;
> +	unsigned int *prod = rxq->rx.producer;
> +	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
> +
> +	/* watch for changes in producer ring */
> +	pmc->addr = (void*)prod;
> +
> +	/* store current value */
> +	pmc->val = cur_val;
> +	pmc->mask = (uint32_t)~0; /* mask entire uint32_t value */
> +
> +	/* AF_XDP producer ring index is 32-bit */
> +	pmc->size = sizeof(uint32_t);
> +
> +	/* this requires an inverted check */
> +	pmc->invert = 1;
> +
> +	return 0;
> +}
> +
>  static int
>  eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>  {
> @@ -1448,6 +1472,7 @@ static const struct eth_dev_ops ops = {
>  	.link_update = eth_link_update,
>  	.stats_get = eth_stats_get,
>  	.stats_reset = eth_stats_reset,
> +	.get_monitor_addr = eth_get_monitor_addr
>  };
> 
>  /** parse busy_budget argument */
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion Anatoly Burakov
@ 2021-06-21 12:56   ` Ananyev, Konstantin
  2021-06-23  9:43     ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-21 12:56 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David


Hi Anatoly,

> Previously, the semantics of power monitor were such that we were
> checking current value against the expected value, and if they matched,
> then the sleep was aborted. This is somewhat inflexible, because it only
> allowed us to check for a specific value.
> 
> This commit adds an option to reverse the check, so that we can have
> monitor sleep aborted if the expected value *doesn't* match what's in
> memory. This allows us to both implement all currently implemented
> driver code, as well as support more use cases which don't easily map to
> previous semantics (such as waiting on writes to AF_XDP counter value).
> 
> Since the old behavior is the default, no need to adjust existing
> implementations.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>  lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> index dddca3d41c..1006c2edfc 100644
> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>  	                  *   4, or 8. Supplying any other value will result in
>  	                  *   an error.
>  	                  */
> +	uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> +	                  *   checking if `val` matches something, check if
> +	                  *   `val` *doesn't* match a particular value)
> +	                  */
>  };
> 
>  /**
> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> index 39ea9fdecd..5d944e9aa4 100644
> --- a/lib/eal/x86/rte_power_intrinsics.c
> +++ b/lib/eal/x86/rte_power_intrinsics.c
> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>  		const uint64_t masked = cur_value & pmc->mask;
> 
>  		/* if the masked value is already matching, abort */
> -		if (masked == pmc->val)
> +		if (!pmc->invert && masked == pmc->val)
> +			goto end;
> +		/* same, but for inverse check */
> +		if (pmc->invert && masked != pmc->val)
>  			goto end;
>  	}
> 

Hmm..., such approach looks too 'patchy'...
Can we at least replace 'inver' with something like:
enum rte_power_monitor_cond_op {
	_EQ, NEQ,...
};
Then at least new comparions ops can be added in future.
Even better I think would be to just leave to PMD to provide a comparison callback.
Will make things really simple and generic:
struct rte_power_monitor_cond {
     volatile void *addr;
     int (*cmp)(uint64_t val);
     uint8_t size;
};
And then in rte_power_monitor(...):
....
const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
if (pmc->cmp(cur_value) != 0)
	goto end;
....
  





^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-22  9:13   ` Ananyev, Konstantin
  2021-06-23  9:46     ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-22  9:13 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara


> Currently, we expect that only one callback can be active at any given
> moment, for a particular queue configuration, which is relatively easy
> to implement in a thread-safe way. However, we're about to add support
> for multiple queues per lcore, which will greatly increase the
> possibility of various race conditions.
> 
> We could have used something like an RCU for this use case, but absent
> of a pressing need for thread safety we'll go the easy way and just
> mandate that the API's are to be called when all affected ports are
> stopped, and document this limitation. This greatly simplifies the
> `rte_power_monitor`-related code.

I think you need to update RN too with that.
Another thing - do you really need the whole port stopped?
From what I understand - you work on queues, so it is enough for you
that related RX queue is stopped.
So, to make things a bit more robust, in pmgmt_queue_enable/disable 
you can call rte_eth_rx_queue_info_get() and check queue state.
 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/power/meson.build          |   3 +
>  lib/power/rte_power_pmd_mgmt.c | 106 ++++++++-------------------------
>  lib/power/rte_power_pmd_mgmt.h |   6 ++
>  3 files changed, 35 insertions(+), 80 deletions(-)
> 
> diff --git a/lib/power/meson.build b/lib/power/meson.build
> index c1097d32f1..4f6a242364 100644
> --- a/lib/power/meson.build
> +++ b/lib/power/meson.build
> @@ -21,4 +21,7 @@ headers = files(
>          'rte_power_pmd_mgmt.h',
>          'rte_power_guest_channel.h',
>  )
> +if cc.has_argument('-Wno-cast-qual')
> +    cflags += '-Wno-cast-qual'
> +endif
>  deps += ['timer', 'ethdev']
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index db03cbf420..0707c60a4f 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -40,8 +40,6 @@ struct pmd_queue_cfg {
>  	/**< Callback mode for this queue */
>  	const struct rte_eth_rxtx_callback *cur_cb;
>  	/**< Callback instance */
> -	volatile bool umwait_in_progress;
> -	/**< are we currently sleeping? */
>  	uint64_t empty_poll_stats;
>  	/**< Number of empty polls */
>  } __rte_cache_aligned;
> @@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>  			struct rte_power_monitor_cond pmc;
>  			uint16_t ret;
> 
> -			/*
> -			 * we might get a cancellation request while being
> -			 * inside the callback, in which case the wakeup
> -			 * wouldn't work because it would've arrived too early.
> -			 *
> -			 * to get around this, we notify the other thread that
> -			 * we're sleeping, so that it can spin until we're done.
> -			 * unsolicited wakeups are perfectly safe.
> -			 */
> -			q_conf->umwait_in_progress = true;
> -
> -			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> -
> -			/* check if we need to cancel sleep */
> -			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> -				/* use monitoring condition to sleep */
> -				ret = rte_eth_get_monitor_addr(port_id, qidx,
> -						&pmc);
> -				if (ret == 0)
> -					rte_power_monitor(&pmc, UINT64_MAX);
> -			}
> -			q_conf->umwait_in_progress = false;
> -
> -			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> +			/* use monitoring condition to sleep */
> +			ret = rte_eth_get_monitor_addr(port_id, qidx,
> +					&pmc);
> +			if (ret == 0)
> +				rte_power_monitor(&pmc, UINT64_MAX);
>  		}
>  	} else
>  		q_conf->empty_poll_stats = 0;
> @@ -183,6 +162,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  {
>  	struct pmd_queue_cfg *queue_cfg;
>  	struct rte_eth_dev_info info;
> +	rte_rx_callback_fn clb;
>  	int ret;
> 
>  	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> @@ -232,17 +212,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  			ret = -ENOTSUP;
>  			goto end;
>  		}
> -		/* initialize data before enabling the callback */
> -		queue_cfg->empty_poll_stats = 0;
> -		queue_cfg->cb_mode = mode;
> -		queue_cfg->umwait_in_progress = false;
> -		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -
> -		/* ensure we update our state before callback starts */
> -		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> -
> -		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> -				clb_umwait, NULL);
> +		clb = clb_umwait;
>  		break;
>  	}
>  	case RTE_POWER_MGMT_TYPE_SCALE:
> @@ -269,16 +239,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  			ret = -ENOTSUP;
>  			goto end;
>  		}
> -		/* initialize data before enabling the callback */
> -		queue_cfg->empty_poll_stats = 0;
> -		queue_cfg->cb_mode = mode;
> -		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -
> -		/* this is not necessary here, but do it anyway */
> -		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> -
> -		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
> -				queue_id, clb_scale_freq, NULL);
> +		clb = clb_scale_freq;
>  		break;
>  	}
>  	case RTE_POWER_MGMT_TYPE_PAUSE:
> @@ -286,18 +247,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		if (global_data.tsc_per_us == 0)
>  			calc_tsc();
> 
> -		/* initialize data before enabling the callback */
> -		queue_cfg->empty_poll_stats = 0;
> -		queue_cfg->cb_mode = mode;
> -		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -
> -		/* this is not necessary here, but do it anyway */
> -		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> -
> -		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> -				clb_pause, NULL);
> +		clb = clb_pause;
>  		break;
> +	default:
> +		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
> +		ret = -EINVAL;
> +		goto end;
>  	}
> +
> +	/* initialize data before enabling the callback */
> +	queue_cfg->empty_poll_stats = 0;
> +	queue_cfg->cb_mode = mode;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +			clb, NULL);
> +
>  	ret = 0;
>  end:
>  	return ret;
> @@ -323,27 +287,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  	/* stop any callbacks from progressing */
>  	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> 
> -	/* ensure we update our state before continuing */
> -	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> -
>  	switch (queue_cfg->cb_mode) {
> -	case RTE_POWER_MGMT_TYPE_MONITOR:
> -	{
> -		bool exit = false;
> -		do {
> -			/*
> -			 * we may request cancellation while the other thread
> -			 * has just entered the callback but hasn't started
> -			 * sleeping yet, so keep waking it up until we know it's
> -			 * done sleeping.
> -			 */
> -			if (queue_cfg->umwait_in_progress)
> -				rte_power_monitor_wakeup(lcore_id);
> -			else
> -				exit = true;
> -		} while (!exit);
> -	}
> -	/* fall-through */
> +	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>  	case RTE_POWER_MGMT_TYPE_PAUSE:
>  		rte_eth_remove_rx_callback(port_id, queue_id,
>  				queue_cfg->cur_cb);
> @@ -356,10 +301,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  		break;
>  	}
>  	/*
> -	 * we don't free the RX callback here because it is unsafe to do so
> -	 * unless we know for a fact that all data plane threads have stopped.
> +	 * the API doc mandates that the user stops all processing on affected
> +	 * ports before calling any of these API's, so we can assume that the
> +	 * callbacks can be freed. we're intentionally casting away const-ness.
>  	 */
> -	queue_cfg->cur_cb = NULL;
> +	rte_free((void *)queue_cfg->cur_cb);
> 
>  	return 0;
>  }
> diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
> index 7a0ac24625..7557f5d7e1 100644
> --- a/lib/power/rte_power_pmd_mgmt.h
> +++ b/lib/power/rte_power_pmd_mgmt.h
> @@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
>   *
>   * @note This function is not thread-safe.
>   *
> + * @warning This function must be called when all affected Ethernet ports are
> + *   stopped and no Rx/Tx is in progress!
> + *
>   * @param lcore_id
>   *   The lcore the Rx queue will be polled from.
>   * @param port_id
> @@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
>   *
>   * @note This function is not thread-safe.
>   *
> + * @warning This function must be called when all affected Ethernet ports are
> + *   stopped and no Rx/Tx is in progress!
> + *
>   * @param lcore_id
>   *   The lcore the Rx queue is polled from.
>   * @param port_id
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-22  9:41   ` Ananyev, Konstantin
  2021-06-23  9:36     ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-22  9:41 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David, Ray Kinsella, Neil Horman
  Cc: Loftus, Ciara


> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
> 
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing.  This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
> 
> - Replace per-queue structures with per-lcore ones, so that any device
>   polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
>   added to the list of cores to poll, so that the callback is aware of
>   other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
>   shared between all queues polled on a particular lcore, and is only
>   activated when a special designated "power saving" queue is polled. To
>   put it another way, we have no idea which queue the user will poll in
>   what order, so we rely on them telling us that queue X is the last one
>   in the polling loop, so any power management should happen there.
> - A new API is added to mark a specific Rx queue as "power saving".

Honestly, I don't understand the logic behind that new function.
I understand that depending on HW we ca monitor either one or multiple queues.
That's ok, but why we now need to mark one queue as a 'very special' one?
Why can't rte_power_ethdev_pmgmt_queue_enable() just:
Check is number of monitored queues exceed HW/SW capabilities,
and if so then just return a failure.
Otherwise add queue to the list and treat them all equally, i.e:
go to power save mode when number of sequential empty polls on
all monitored queues will exceed EMPTYPOLL_MAX threshold?

>   Failing to call this API will result in no power management, however
>   when having only one queue per core it is obvious which queue is the
>   "power saving" one, so things will still work without this new API for
>   use cases that were previously working without it.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>   is incapable of monitoring more than one address.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++-------
>  lib/power/rte_power_pmd_mgmt.h |  34 ++++
>  lib/power/version.map          |   3 +
>  3 files changed, 306 insertions(+), 66 deletions(-)
> 
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index 0707c60a4f..60dd21a19c 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -33,7 +33,19 @@ enum pmd_mgmt_state {
>  	PMD_MGMT_ENABLED
>  };
> 
> -struct pmd_queue_cfg {
> +struct queue {
> +	uint16_t portid;
> +	uint16_t qid;
> +};

Just a thought: if that would help somehow, it can be changed to:
union queue {
	uint32_t raw;
	struct { uint16_t portid, qid;
	}; 
};

That way in queue find/cmp functions below you can operate with single raw 32-bt values.
Probably not that important, as all these functions are on slow path, but might look nicer.

> +struct pmd_core_cfg {
> +	struct queue queues[RTE_MAX_ETHPORTS];

If we'll have ability to monitor multiple queues per lcore, would it be always enough?
From other side, it is updated on control path only.
Wouldn't normal list with malloc(/rte_malloc) would be more suitable here?  

> +	/**< Which port-queue pairs are associated with this lcore? */
> +	struct queue power_save_queue;
> +	/**< When polling multiple queues, all but this one will be ignored */
> +	bool power_save_queue_set;
> +	/**< When polling multiple queues, power save queue must be set */
> +	size_t n_queues;
> +	/**< How many queues are in the list? */
>  	volatile enum pmd_mgmt_state pwr_mgmt_state;
>  	/**< State of power management for this queue */
>  	enum rte_power_pmd_mgmt_type cb_mode;
> @@ -43,8 +55,97 @@ struct pmd_queue_cfg {
>  	uint64_t empty_poll_stats;
>  	/**< Number of empty polls */
>  } __rte_cache_aligned;
> +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
> 
> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> +static inline bool
> +queue_equal(const struct queue *l, const struct queue *r)
> +{
> +	return l->portid == r->portid && l->qid == r->qid;
> +}
> +
> +static inline void
> +queue_copy(struct queue *dst, const struct queue *src)
> +{
> +	dst->portid = src->portid;
> +	dst->qid = src->qid;
> +}
> +
> +static inline bool
> +queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue *q) {

Here and in other places - any reason why standard DPDK coding style is not used?

> +	const struct queue *pwrsave = &cfg->power_save_queue;
> +
> +	/* if there's only single queue, no need to check anything */
> +	if (cfg->n_queues == 1)
> +		return true;
> +	return cfg->power_save_queue_set && queue_equal(q, pwrsave);
> +}
> +

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues
  2021-06-22  9:41   ` Ananyev, Konstantin
@ 2021-06-23  9:36     ` Burakov, Anatoly
  2021-06-23  9:49       ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23  9:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David, Ray Kinsella, Neil Horman
  Cc: Loftus, Ciara

On 22-Jun-21 10:41 AM, Ananyev, Konstantin wrote:
> 
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing.  This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>>    polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>>    added to the list of cores to poll, so that the callback is aware of
>>    other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>>    shared between all queues polled on a particular lcore, and is only
>>    activated when a special designated "power saving" queue is polled. To
>>    put it another way, we have no idea which queue the user will poll in
>>    what order, so we rely on them telling us that queue X is the last one
>>    in the polling loop, so any power management should happen there.
>> - A new API is added to mark a specific Rx queue as "power saving".
> 
> Honestly, I don't understand the logic behind that new function.
> I understand that depending on HW we ca monitor either one or multiple queues.
> That's ok, but why we now need to mark one queue as a 'very special' one?

Because we don't know which of the queues we are supposed to sleep on.

Imagine a situation where you have 3 queues. What usually happens is you 
poll them in a loop, so q0, q1, q2, q0, q1, q2... etc. We only want to 
enter power-optimized state on polling q2, because otherwise we're 
risking going into power optimized state while q1 or q2 have traffic.

Worst case scenario, we enter sleep after polling q0, then traffic 
arrives at q2, we wake up, and then attempt to go to sleep on q1 instead 
of skipping it. Essentially, we will be attempting to sleep at every 
queue, instead of once in a loop. This *might* be OK for multi-monitor 
because we'll be aborting sleep due to sleep condition check failure, 
but for modes like rte_pause()/rte_power_pause()-based sleep, we will be 
entering sleep unconditionally, and will be risking to sleep at q1 while 
there's traffic at q2.

So, we need this mechanism to be activated once every *loop*, not per queue.

> Why can't rte_power_ethdev_pmgmt_queue_enable() just:
> Check is number of monitored queues exceed HW/SW capabilities,
> and if so then just return a failure.
> Otherwise add queue to the list and treat them all equally, i.e:
> go to power save mode when number of sequential empty polls on
> all monitored queues will exceed EMPTYPOLL_MAX threshold?
> 
>>    Failing to call this API will result in no power management, however
>>    when having only one queue per core it is obvious which queue is the
>>    "power saving" one, so things will still work without this new API for
>>    use cases that were previously working without it.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>    is incapable of monitoring more than one address.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++-------
>>   lib/power/rte_power_pmd_mgmt.h |  34 ++++
>>   lib/power/version.map          |   3 +
>>   3 files changed, 306 insertions(+), 66 deletions(-)
>>
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index 0707c60a4f..60dd21a19c 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -33,7 +33,19 @@ enum pmd_mgmt_state {
>>        PMD_MGMT_ENABLED
>>   };
>>
>> -struct pmd_queue_cfg {
>> +struct queue {
>> +     uint16_t portid;
>> +     uint16_t qid;
>> +};
> 
> Just a thought: if that would help somehow, it can be changed to:
> union queue {
>          uint32_t raw;
>          struct { uint16_t portid, qid;
>          };
> };
> 
> That way in queue find/cmp functions below you can operate with single raw 32-bt values.
> Probably not that important, as all these functions are on slow path, but might look nicer.

Sure, that can work. We actually do comparisons with power save queue on 
fast path, so maybe that'll help.

> 
>> +struct pmd_core_cfg {
>> +     struct queue queues[RTE_MAX_ETHPORTS];
> 
> If we'll have ability to monitor multiple queues per lcore, would it be always enough?
>  From other side, it is updated on control path only.
> Wouldn't normal list with malloc(/rte_malloc) would be more suitable here?

You're right, it should be dynamically allocated.

> 
>> +     /**< Which port-queue pairs are associated with this lcore? */
>> +     struct queue power_save_queue;
>> +     /**< When polling multiple queues, all but this one will be ignored */
>> +     bool power_save_queue_set;
>> +     /**< When polling multiple queues, power save queue must be set */
>> +     size_t n_queues;
>> +     /**< How many queues are in the list? */
>>        volatile enum pmd_mgmt_state pwr_mgmt_state;
>>        /**< State of power management for this queue */
>>        enum rte_power_pmd_mgmt_type cb_mode;
>> @@ -43,8 +55,97 @@ struct pmd_queue_cfg {
>>        uint64_t empty_poll_stats;
>>        /**< Number of empty polls */
>>   } __rte_cache_aligned;
>> +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
>>
>> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
>> +static inline bool
>> +queue_equal(const struct queue *l, const struct queue *r)
>> +{
>> +     return l->portid == r->portid && l->qid == r->qid;
>> +}
>> +
>> +static inline void
>> +queue_copy(struct queue *dst, const struct queue *src)
>> +{
>> +     dst->portid = src->portid;
>> +     dst->qid = src->qid;
>> +}
>> +
>> +static inline bool
>> +queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue *q) {
> 
> Here and in other places - any reason why standard DPDK coding style is not used?

Just accidental :)

> 
>> +     const struct queue *pwrsave = &cfg->power_save_queue;
>> +
>> +     /* if there's only single queue, no need to check anything */
>> +     if (cfg->n_queues == 1)
>> +             return true;
>> +     return cfg->power_save_queue_set && queue_equal(q, pwrsave);
>> +}
>> +


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-21 12:56   ` Ananyev, Konstantin
@ 2021-06-23  9:43     ` Burakov, Anatoly
  2021-06-23  9:55       ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23  9:43 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 21-Jun-21 1:56 PM, Ananyev, Konstantin wrote:
> 
> Hi Anatoly,
> 
>> Previously, the semantics of power monitor were such that we were
>> checking current value against the expected value, and if they matched,
>> then the sleep was aborted. This is somewhat inflexible, because it only
>> allowed us to check for a specific value.
>>
>> This commit adds an option to reverse the check, so that we can have
>> monitor sleep aborted if the expected value *doesn't* match what's in
>> memory. This allows us to both implement all currently implemented
>> driver code, as well as support more use cases which don't easily map to
>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>
>> Since the old behavior is the default, no need to adjust existing
>> implementations.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>   lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>> index dddca3d41c..1006c2edfc 100644
>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>                          *   4, or 8. Supplying any other value will result in
>>                          *   an error.
>>                          */
>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>> +                       *   checking if `val` matches something, check if
>> +                       *   `val` *doesn't* match a particular value)
>> +                       */
>>   };
>>
>>   /**
>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>> index 39ea9fdecd..5d944e9aa4 100644
>> --- a/lib/eal/x86/rte_power_intrinsics.c
>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>                const uint64_t masked = cur_value & pmc->mask;
>>
>>                /* if the masked value is already matching, abort */
>> -             if (masked == pmc->val)
>> +             if (!pmc->invert && masked == pmc->val)
>> +                     goto end;
>> +             /* same, but for inverse check */
>> +             if (pmc->invert && masked != pmc->val)
>>                        goto end;
>>        }
>>
> 
> Hmm..., such approach looks too 'patchy'...
> Can we at least replace 'inver' with something like:
> enum rte_power_monitor_cond_op {
>          _EQ, NEQ,...
> };
> Then at least new comparions ops can be added in future.
> Even better I think would be to just leave to PMD to provide a comparison callback.
> Will make things really simple and generic:
> struct rte_power_monitor_cond {
>       volatile void *addr;
>       int (*cmp)(uint64_t val);
>       uint8_t size;
> };
> And then in rte_power_monitor(...):
> ....
> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> if (pmc->cmp(cur_value) != 0)
>          goto end;
> ....
> 

I like the idea of a callback, but these are supposed to be 
intrinsic-like functions, so putting too much into them is contrary to 
their goal, and it's going to make the API hard to use in simpler cases 
(e.g. when we're explicitly calling rte_power_monitor as opposed to 
letting the RX callback do it for us). For example, event/dlb code calls 
rte_power_monitor explicitly.

It's going to be especially "fun" to do these indirect function calls 
from inside transactional region on call to multi-monitor. I'm not 
opposed to having a callback here, but maybe others have more thoughts 
on this?

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's
  2021-06-22  9:13   ` Ananyev, Konstantin
@ 2021-06-23  9:46     ` Burakov, Anatoly
  2021-06-23  9:52       ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23  9:46 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 22-Jun-21 10:13 AM, Ananyev, Konstantin wrote:
> 
>> Currently, we expect that only one callback can be active at any given
>> moment, for a particular queue configuration, which is relatively easy
>> to implement in a thread-safe way. However, we're about to add support
>> for multiple queues per lcore, which will greatly increase the
>> possibility of various race conditions.
>>
>> We could have used something like an RCU for this use case, but absent
>> of a pressing need for thread safety we'll go the easy way and just
>> mandate that the API's are to be called when all affected ports are
>> stopped, and document this limitation. This greatly simplifies the
>> `rte_power_monitor`-related code.
> 
> I think you need to update RN too with that.

Yep, will fix.

> Another thing - do you really need the whole port stopped?
>  From what I understand - you work on queues, so it is enough for you
> that related RX queue is stopped.
> So, to make things a bit more robust, in pmgmt_queue_enable/disable
> you can call rte_eth_rx_queue_info_get() and check queue state.

We work on queues, but the data is per-lcore not per-queue, and it is 
potentially used by multiple queues, so checking one specific queue is 
not going to be enough. We could check all queues that were registered 
so far with the power library, maybe that'll work better?

> 
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/power/meson.build          |   3 +
>>   lib/power/rte_power_pmd_mgmt.c | 106 ++++++++-------------------------
>>   lib/power/rte_power_pmd_mgmt.h |   6 ++
>>   3 files changed, 35 insertions(+), 80 deletions(-)
>>
>> diff --git a/lib/power/meson.build b/lib/power/meson.build
>> index c1097d32f1..4f6a242364 100644
>> --- a/lib/power/meson.build
>> +++ b/lib/power/meson.build
>> @@ -21,4 +21,7 @@ headers = files(
>>           'rte_power_pmd_mgmt.h',
>>           'rte_power_guest_channel.h',
>>   )
>> +if cc.has_argument('-Wno-cast-qual')
>> +    cflags += '-Wno-cast-qual'
>> +endif
>>   deps += ['timer', 'ethdev']
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index db03cbf420..0707c60a4f 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -40,8 +40,6 @@ struct pmd_queue_cfg {
>>        /**< Callback mode for this queue */
>>        const struct rte_eth_rxtx_callback *cur_cb;
>>        /**< Callback instance */
>> -     volatile bool umwait_in_progress;
>> -     /**< are we currently sleeping? */
>>        uint64_t empty_poll_stats;
>>        /**< Number of empty polls */
>>   } __rte_cache_aligned;
>> @@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>>                        struct rte_power_monitor_cond pmc;
>>                        uint16_t ret;
>>
>> -                     /*
>> -                      * we might get a cancellation request while being
>> -                      * inside the callback, in which case the wakeup
>> -                      * wouldn't work because it would've arrived too early.
>> -                      *
>> -                      * to get around this, we notify the other thread that
>> -                      * we're sleeping, so that it can spin until we're done.
>> -                      * unsolicited wakeups are perfectly safe.
>> -                      */
>> -                     q_conf->umwait_in_progress = true;
>> -
>> -                     rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
>> -
>> -                     /* check if we need to cancel sleep */
>> -                     if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
>> -                             /* use monitoring condition to sleep */
>> -                             ret = rte_eth_get_monitor_addr(port_id, qidx,
>> -                                             &pmc);
>> -                             if (ret == 0)
>> -                                     rte_power_monitor(&pmc, UINT64_MAX);
>> -                     }
>> -                     q_conf->umwait_in_progress = false;
>> -
>> -                     rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
>> +                     /* use monitoring condition to sleep */
>> +                     ret = rte_eth_get_monitor_addr(port_id, qidx,
>> +                                     &pmc);
>> +                     if (ret == 0)
>> +                             rte_power_monitor(&pmc, UINT64_MAX);
>>                }
>>        } else
>>                q_conf->empty_poll_stats = 0;
>> @@ -183,6 +162,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>   {
>>        struct pmd_queue_cfg *queue_cfg;
>>        struct rte_eth_dev_info info;
>> +     rte_rx_callback_fn clb;
>>        int ret;
>>
>>        RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> @@ -232,17 +212,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                        ret = -ENOTSUP;
>>                        goto end;
>>                }
>> -             /* initialize data before enabling the callback */
>> -             queue_cfg->empty_poll_stats = 0;
>> -             queue_cfg->cb_mode = mode;
>> -             queue_cfg->umwait_in_progress = false;
>> -             queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -
>> -             /* ensure we update our state before callback starts */
>> -             rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
>> -
>> -             queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> -                             clb_umwait, NULL);
>> +             clb = clb_umwait;
>>                break;
>>        }
>>        case RTE_POWER_MGMT_TYPE_SCALE:
>> @@ -269,16 +239,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                        ret = -ENOTSUP;
>>                        goto end;
>>                }
>> -             /* initialize data before enabling the callback */
>> -             queue_cfg->empty_poll_stats = 0;
>> -             queue_cfg->cb_mode = mode;
>> -             queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -
>> -             /* this is not necessary here, but do it anyway */
>> -             rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
>> -
>> -             queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
>> -                             queue_id, clb_scale_freq, NULL);
>> +             clb = clb_scale_freq;
>>                break;
>>        }
>>        case RTE_POWER_MGMT_TYPE_PAUSE:
>> @@ -286,18 +247,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                if (global_data.tsc_per_us == 0)
>>                        calc_tsc();
>>
>> -             /* initialize data before enabling the callback */
>> -             queue_cfg->empty_poll_stats = 0;
>> -             queue_cfg->cb_mode = mode;
>> -             queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -
>> -             /* this is not necessary here, but do it anyway */
>> -             rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
>> -
>> -             queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> -                             clb_pause, NULL);
>> +             clb = clb_pause;
>>                break;
>> +     default:
>> +             RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
>> +             ret = -EINVAL;
>> +             goto end;
>>        }
>> +
>> +     /* initialize data before enabling the callback */
>> +     queue_cfg->empty_poll_stats = 0;
>> +     queue_cfg->cb_mode = mode;
>> +     queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +     queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +                     clb, NULL);
>> +
>>        ret = 0;
>>   end:
>>        return ret;
>> @@ -323,27 +287,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>        /* stop any callbacks from progressing */
>>        queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>>
>> -     /* ensure we update our state before continuing */
>> -     rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
>> -
>>        switch (queue_cfg->cb_mode) {
>> -     case RTE_POWER_MGMT_TYPE_MONITOR:
>> -     {
>> -             bool exit = false;
>> -             do {
>> -                     /*
>> -                      * we may request cancellation while the other thread
>> -                      * has just entered the callback but hasn't started
>> -                      * sleeping yet, so keep waking it up until we know it's
>> -                      * done sleeping.
>> -                      */
>> -                     if (queue_cfg->umwait_in_progress)
>> -                             rte_power_monitor_wakeup(lcore_id);
>> -                     else
>> -                             exit = true;
>> -             } while (!exit);
>> -     }
>> -     /* fall-through */
>> +     case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>>        case RTE_POWER_MGMT_TYPE_PAUSE:
>>                rte_eth_remove_rx_callback(port_id, queue_id,
>>                                queue_cfg->cur_cb);
>> @@ -356,10 +301,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>                break;
>>        }
>>        /*
>> -      * we don't free the RX callback here because it is unsafe to do so
>> -      * unless we know for a fact that all data plane threads have stopped.
>> +      * the API doc mandates that the user stops all processing on affected
>> +      * ports before calling any of these API's, so we can assume that the
>> +      * callbacks can be freed. we're intentionally casting away const-ness.
>>         */
>> -     queue_cfg->cur_cb = NULL;
>> +     rte_free((void *)queue_cfg->cur_cb);
>>
>>        return 0;
>>   }
>> diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
>> index 7a0ac24625..7557f5d7e1 100644
>> --- a/lib/power/rte_power_pmd_mgmt.h
>> +++ b/lib/power/rte_power_pmd_mgmt.h
>> @@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
>>    *
>>    * @note This function is not thread-safe.
>>    *
>> + * @warning This function must be called when all affected Ethernet ports are
>> + *   stopped and no Rx/Tx is in progress!
>> + *
>>    * @param lcore_id
>>    *   The lcore the Rx queue will be polled from.
>>    * @param port_id
>> @@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
>>    *
>>    * @note This function is not thread-safe.
>>    *
>> + * @warning This function must be called when all affected Ethernet ports are
>> + *   stopped and no Rx/Tx is in progress!
>> + *
>>    * @param lcore_id
>>    *   The lcore the Rx queue is polled from.
>>    * @param port_id
>> --
>> 2.25.1
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues
  2021-06-23  9:36     ` Burakov, Anatoly
@ 2021-06-23  9:49       ` Ananyev, Konstantin
  2021-06-23  9:56         ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-23  9:49 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David, Ray Kinsella, Neil Horman
  Cc: Loftus, Ciara


> 
> On 22-Jun-21 10:41 AM, Ananyev, Konstantin wrote:
> >
> >> Currently, there is a hard limitation on the PMD power management
> >> support that only allows it to support a single queue per lcore. This is
> >> not ideal as most DPDK use cases will poll multiple queues per core.
> >>
> >> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> >> is very difficult to implement such support because callbacks are
> >> effectively stateless and have no visibility into what the other ethdev
> >> devices are doing.  This places limitations on what we can do within the
> >> framework of Rx callbacks, but the basics of this implementation are as
> >> follows:
> >>
> >> - Replace per-queue structures with per-lcore ones, so that any device
> >>    polled from the same lcore can share data
> >> - Any queue that is going to be polled from a specific lcore has to be
> >>    added to the list of cores to poll, so that the callback is aware of
> >>    other queues being polled by the same lcore
> >> - Both the empty poll counter and the actual power saving mechanism is
> >>    shared between all queues polled on a particular lcore, and is only
> >>    activated when a special designated "power saving" queue is polled. To
> >>    put it another way, we have no idea which queue the user will poll in
> >>    what order, so we rely on them telling us that queue X is the last one
> >>    in the polling loop, so any power management should happen there.
> >> - A new API is added to mark a specific Rx queue as "power saving".
> >
> > Honestly, I don't understand the logic behind that new function.
> > I understand that depending on HW we ca monitor either one or multiple queues.
> > That's ok, but why we now need to mark one queue as a 'very special' one?
> 
> Because we don't know which of the queues we are supposed to sleep on.
> 
> Imagine a situation where you have 3 queues. What usually happens is you
> poll them in a loop, so q0, q1, q2, q0, q1, q2... etc. We only want to
> enter power-optimized state on polling q2, because otherwise we're
> risking going into power optimized state while q1 or q2 have traffic.

That's why before going to sleep we need to make sure that for *all* queues
we have at least EMPTYPOLL_MAX empty polls.
Then the order of queue checking wouldn't matter.
With your example it should be:
if (q1.empty_polls >  EMPTYPOLL_MAX && q2. empty_polls >  EMPTYPOLL_MAX &&
     q3.empy_pools >  EMPTYPOLL_MAX)
        goto_sleep;

Don't take me wrong, I am not suggesting to make *precisely* that checks
in the actual code (it could be time consuming if number of checks is big),
but the logic needs to remain.

> 
> Worst case scenario, we enter sleep after polling q0, then traffic
> arrives at q2, we wake up, and then attempt to go to sleep on q1 instead
> of skipping it. Essentially, we will be attempting to sleep at every
> queue, instead of once in a loop. This *might* be OK for multi-monitor
> because we'll be aborting sleep due to sleep condition check failure,
> but for modes like rte_pause()/rte_power_pause()-based sleep, we will be
> entering sleep unconditionally, and will be risking to sleep at q1 while
> there's traffic at q2.
> 
> So, we need this mechanism to be activated once every *loop*, not per queue.
> 
> > Why can't rte_power_ethdev_pmgmt_queue_enable() just:
> > Check is number of monitored queues exceed HW/SW capabilities,
> > and if so then just return a failure.
> > Otherwise add queue to the list and treat them all equally, i.e:
> > go to power save mode when number of sequential empty polls on
> > all monitored queues will exceed EMPTYPOLL_MAX threshold?
> >
> >>    Failing to call this API will result in no power management, however
> >>    when having only one queue per core it is obvious which queue is the
> >>    "power saving" one, so things will still work without this new API for
> >>    use cases that were previously working without it.
> >> - The limitation on UMWAIT-based polling is not removed because UMWAIT
> >>    is incapable of monitoring more than one address.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++-------
> >>   lib/power/rte_power_pmd_mgmt.h |  34 ++++
> >>   lib/power/version.map          |   3 +
> >>   3 files changed, 306 insertions(+), 66 deletions(-)
> >>
> >> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> >> index 0707c60a4f..60dd21a19c 100644
> >> --- a/lib/power/rte_power_pmd_mgmt.c
> >> +++ b/lib/power/rte_power_pmd_mgmt.c
> >> @@ -33,7 +33,19 @@ enum pmd_mgmt_state {
> >>        PMD_MGMT_ENABLED
> >>   };
> >>
> >> -struct pmd_queue_cfg {
> >> +struct queue {
> >> +     uint16_t portid;
> >> +     uint16_t qid;
> >> +};
> >
> > Just a thought: if that would help somehow, it can be changed to:
> > union queue {
> >          uint32_t raw;
> >          struct { uint16_t portid, qid;
> >          };
> > };
> >
> > That way in queue find/cmp functions below you can operate with single raw 32-bt values.
> > Probably not that important, as all these functions are on slow path, but might look nicer.
> 
> Sure, that can work. We actually do comparisons with power save queue on
> fast path, so maybe that'll help.
> 
> >
> >> +struct pmd_core_cfg {
> >> +     struct queue queues[RTE_MAX_ETHPORTS];
> >
> > If we'll have ability to monitor multiple queues per lcore, would it be always enough?
> >  From other side, it is updated on control path only.
> > Wouldn't normal list with malloc(/rte_malloc) would be more suitable here?
> 
> You're right, it should be dynamically allocated.
> 
> >
> >> +     /**< Which port-queue pairs are associated with this lcore? */
> >> +     struct queue power_save_queue;
> >> +     /**< When polling multiple queues, all but this one will be ignored */
> >> +     bool power_save_queue_set;
> >> +     /**< When polling multiple queues, power save queue must be set */
> >> +     size_t n_queues;
> >> +     /**< How many queues are in the list? */
> >>        volatile enum pmd_mgmt_state pwr_mgmt_state;
> >>        /**< State of power management for this queue */
> >>        enum rte_power_pmd_mgmt_type cb_mode;
> >> @@ -43,8 +55,97 @@ struct pmd_queue_cfg {
> >>        uint64_t empty_poll_stats;
> >>        /**< Number of empty polls */
> >>   } __rte_cache_aligned;
> >> +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
> >>
> >> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> >> +static inline bool
> >> +queue_equal(const struct queue *l, const struct queue *r)
> >> +{
> >> +     return l->portid == r->portid && l->qid == r->qid;
> >> +}
> >> +
> >> +static inline void
> >> +queue_copy(struct queue *dst, const struct queue *src)
> >> +{
> >> +     dst->portid = src->portid;
> >> +     dst->qid = src->qid;
> >> +}
> >> +
> >> +static inline bool
> >> +queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue *q) {
> >
> > Here and in other places - any reason why standard DPDK coding style is not used?
> 
> Just accidental :)
> 
> >
> >> +     const struct queue *pwrsave = &cfg->power_save_queue;
> >> +
> >> +     /* if there's only single queue, no need to check anything */
> >> +     if (cfg->n_queues == 1)
> >> +             return true;
> >> +     return cfg->power_save_queue_set && queue_equal(q, pwrsave);
> >> +}
> >> +
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's
  2021-06-23  9:46     ` Burakov, Anatoly
@ 2021-06-23  9:52       ` Ananyev, Konstantin
  2021-06-25 11:52         ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-23  9:52 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



> 
> On 22-Jun-21 10:13 AM, Ananyev, Konstantin wrote:
> >
> >> Currently, we expect that only one callback can be active at any given
> >> moment, for a particular queue configuration, which is relatively easy
> >> to implement in a thread-safe way. However, we're about to add support
> >> for multiple queues per lcore, which will greatly increase the
> >> possibility of various race conditions.
> >>
> >> We could have used something like an RCU for this use case, but absent
> >> of a pressing need for thread safety we'll go the easy way and just
> >> mandate that the API's are to be called when all affected ports are
> >> stopped, and document this limitation. This greatly simplifies the
> >> `rte_power_monitor`-related code.
> >
> > I think you need to update RN too with that.
> 
> Yep, will fix.
> 
> > Another thing - do you really need the whole port stopped?
> >  From what I understand - you work on queues, so it is enough for you
> > that related RX queue is stopped.
> > So, to make things a bit more robust, in pmgmt_queue_enable/disable
> > you can call rte_eth_rx_queue_info_get() and check queue state.
> 
> We work on queues, but the data is per-lcore not per-queue, and it is
> potentially used by multiple queues, so checking one specific queue is
> not going to be enough. We could check all queues that were registered
> so far with the power library, maybe that'll work better?

Yep, that's what I mean: on queue_enable() check is that queue stopped or not.
If not, return -EBUSY/EAGAIN or so/
Sorry if I wasn't clear at first time.


> 
> >
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   lib/power/meson.build          |   3 +
> >>   lib/power/rte_power_pmd_mgmt.c | 106 ++++++++-------------------------
> >>   lib/power/rte_power_pmd_mgmt.h |   6 ++
> >>   3 files changed, 35 insertions(+), 80 deletions(-)
> >>
> >> diff --git a/lib/power/meson.build b/lib/power/meson.build
> >> index c1097d32f1..4f6a242364 100644
> >> --- a/lib/power/meson.build
> >> +++ b/lib/power/meson.build
> >> @@ -21,4 +21,7 @@ headers = files(
> >>           'rte_power_pmd_mgmt.h',
> >>           'rte_power_guest_channel.h',
> >>   )
> >> +if cc.has_argument('-Wno-cast-qual')
> >> +    cflags += '-Wno-cast-qual'
> >> +endif
> >>   deps += ['timer', 'ethdev']
> >> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> >> index db03cbf420..0707c60a4f 100644
> >> --- a/lib/power/rte_power_pmd_mgmt.c
> >> +++ b/lib/power/rte_power_pmd_mgmt.c
> >> @@ -40,8 +40,6 @@ struct pmd_queue_cfg {
> >>        /**< Callback mode for this queue */
> >>        const struct rte_eth_rxtx_callback *cur_cb;
> >>        /**< Callback instance */
> >> -     volatile bool umwait_in_progress;
> >> -     /**< are we currently sleeping? */
> >>        uint64_t empty_poll_stats;
> >>        /**< Number of empty polls */
> >>   } __rte_cache_aligned;
> >> @@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> >>                        struct rte_power_monitor_cond pmc;
> >>                        uint16_t ret;
> >>
> >> -                     /*
> >> -                      * we might get a cancellation request while being
> >> -                      * inside the callback, in which case the wakeup
> >> -                      * wouldn't work because it would've arrived too early.
> >> -                      *
> >> -                      * to get around this, we notify the other thread that
> >> -                      * we're sleeping, so that it can spin until we're done.
> >> -                      * unsolicited wakeups are perfectly safe.
> >> -                      */
> >> -                     q_conf->umwait_in_progress = true;
> >> -
> >> -                     rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> >> -
> >> -                     /* check if we need to cancel sleep */
> >> -                     if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> >> -                             /* use monitoring condition to sleep */
> >> -                             ret = rte_eth_get_monitor_addr(port_id, qidx,
> >> -                                             &pmc);
> >> -                             if (ret == 0)
> >> -                                     rte_power_monitor(&pmc, UINT64_MAX);
> >> -                     }
> >> -                     q_conf->umwait_in_progress = false;
> >> -
> >> -                     rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> >> +                     /* use monitoring condition to sleep */
> >> +                     ret = rte_eth_get_monitor_addr(port_id, qidx,
> >> +                                     &pmc);
> >> +                     if (ret == 0)
> >> +                             rte_power_monitor(&pmc, UINT64_MAX);
> >>                }
> >>        } else
> >>                q_conf->empty_poll_stats = 0;
> >> @@ -183,6 +162,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> >>   {
> >>        struct pmd_queue_cfg *queue_cfg;
> >>        struct rte_eth_dev_info info;
> >> +     rte_rx_callback_fn clb;
> >>        int ret;
> >>
> >>        RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> >> @@ -232,17 +212,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> >>                        ret = -ENOTSUP;
> >>                        goto end;
> >>                }
> >> -             /* initialize data before enabling the callback */
> >> -             queue_cfg->empty_poll_stats = 0;
> >> -             queue_cfg->cb_mode = mode;
> >> -             queue_cfg->umwait_in_progress = false;
> >> -             queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> -
> >> -             /* ensure we update our state before callback starts */
> >> -             rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> >> -
> >> -             queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> -                             clb_umwait, NULL);
> >> +             clb = clb_umwait;
> >>                break;
> >>        }
> >>        case RTE_POWER_MGMT_TYPE_SCALE:
> >> @@ -269,16 +239,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> >>                        ret = -ENOTSUP;
> >>                        goto end;
> >>                }
> >> -             /* initialize data before enabling the callback */
> >> -             queue_cfg->empty_poll_stats = 0;
> >> -             queue_cfg->cb_mode = mode;
> >> -             queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> -
> >> -             /* this is not necessary here, but do it anyway */
> >> -             rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> >> -
> >> -             queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
> >> -                             queue_id, clb_scale_freq, NULL);
> >> +             clb = clb_scale_freq;
> >>                break;
> >>        }
> >>        case RTE_POWER_MGMT_TYPE_PAUSE:
> >> @@ -286,18 +247,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> >>                if (global_data.tsc_per_us == 0)
> >>                        calc_tsc();
> >>
> >> -             /* initialize data before enabling the callback */
> >> -             queue_cfg->empty_poll_stats = 0;
> >> -             queue_cfg->cb_mode = mode;
> >> -             queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> -
> >> -             /* this is not necessary here, but do it anyway */
> >> -             rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> >> -
> >> -             queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> -                             clb_pause, NULL);
> >> +             clb = clb_pause;
> >>                break;
> >> +     default:
> >> +             RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
> >> +             ret = -EINVAL;
> >> +             goto end;
> >>        }
> >> +
> >> +     /* initialize data before enabling the callback */
> >> +     queue_cfg->empty_poll_stats = 0;
> >> +     queue_cfg->cb_mode = mode;
> >> +     queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> +     queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +                     clb, NULL);
> >> +
> >>        ret = 0;
> >>   end:
> >>        return ret;
> >> @@ -323,27 +287,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
> >>        /* stop any callbacks from progressing */
> >>        queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> >>
> >> -     /* ensure we update our state before continuing */
> >> -     rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
> >> -
> >>        switch (queue_cfg->cb_mode) {
> >> -     case RTE_POWER_MGMT_TYPE_MONITOR:
> >> -     {
> >> -             bool exit = false;
> >> -             do {
> >> -                     /*
> >> -                      * we may request cancellation while the other thread
> >> -                      * has just entered the callback but hasn't started
> >> -                      * sleeping yet, so keep waking it up until we know it's
> >> -                      * done sleeping.
> >> -                      */
> >> -                     if (queue_cfg->umwait_in_progress)
> >> -                             rte_power_monitor_wakeup(lcore_id);
> >> -                     else
> >> -                             exit = true;
> >> -             } while (!exit);
> >> -     }
> >> -     /* fall-through */
> >> +     case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
> >>        case RTE_POWER_MGMT_TYPE_PAUSE:
> >>                rte_eth_remove_rx_callback(port_id, queue_id,
> >>                                queue_cfg->cur_cb);
> >> @@ -356,10 +301,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
> >>                break;
> >>        }
> >>        /*
> >> -      * we don't free the RX callback here because it is unsafe to do so
> >> -      * unless we know for a fact that all data plane threads have stopped.
> >> +      * the API doc mandates that the user stops all processing on affected
> >> +      * ports before calling any of these API's, so we can assume that the
> >> +      * callbacks can be freed. we're intentionally casting away const-ness.
> >>         */
> >> -     queue_cfg->cur_cb = NULL;
> >> +     rte_free((void *)queue_cfg->cur_cb);
> >>
> >>        return 0;
> >>   }
> >> diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
> >> index 7a0ac24625..7557f5d7e1 100644
> >> --- a/lib/power/rte_power_pmd_mgmt.h
> >> +++ b/lib/power/rte_power_pmd_mgmt.h
> >> @@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
> >>    *
> >>    * @note This function is not thread-safe.
> >>    *
> >> + * @warning This function must be called when all affected Ethernet ports are
> >> + *   stopped and no Rx/Tx is in progress!
> >> + *
> >>    * @param lcore_id
> >>    *   The lcore the Rx queue will be polled from.
> >>    * @param port_id
> >> @@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
> >>    *
> >>    * @note This function is not thread-safe.
> >>    *
> >> + * @warning This function must be called when all affected Ethernet ports are
> >> + *   stopped and no Rx/Tx is in progress!
> >> + *
> >>    * @param lcore_id
> >>    *   The lcore the Rx queue is polled from.
> >>    * @param port_id
> >> --
> >> 2.25.1
> >
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23  9:43     ` Burakov, Anatoly
@ 2021-06-23  9:55       ` Ananyev, Konstantin
  2021-06-23 10:00         ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-23  9:55 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David



> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, June 23, 2021 10:43 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Loftus, Ciara <ciara.loftus@intel.com>; Hunt, David <david.hunt@intel.com>
> Subject: Re: [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
> 
> On 21-Jun-21 1:56 PM, Ananyev, Konstantin wrote:
> >
> > Hi Anatoly,
> >
> >> Previously, the semantics of power monitor were such that we were
> >> checking current value against the expected value, and if they matched,
> >> then the sleep was aborted. This is somewhat inflexible, because it only
> >> allowed us to check for a specific value.
> >>
> >> This commit adds an option to reverse the check, so that we can have
> >> monitor sleep aborted if the expected value *doesn't* match what's in
> >> memory. This allows us to both implement all currently implemented
> >> driver code, as well as support more use cases which don't easily map to
> >> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>
> >> Since the old behavior is the default, no need to adjust existing
> >> implementations.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>   lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>   2 files changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >> index dddca3d41c..1006c2edfc 100644
> >> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>                          *   4, or 8. Supplying any other value will result in
> >>                          *   an error.
> >>                          */
> >> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >> +                       *   checking if `val` matches something, check if
> >> +                       *   `val` *doesn't* match a particular value)
> >> +                       */
> >>   };
> >>
> >>   /**
> >> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >> index 39ea9fdecd..5d944e9aa4 100644
> >> --- a/lib/eal/x86/rte_power_intrinsics.c
> >> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>                const uint64_t masked = cur_value & pmc->mask;
> >>
> >>                /* if the masked value is already matching, abort */
> >> -             if (masked == pmc->val)
> >> +             if (!pmc->invert && masked == pmc->val)
> >> +                     goto end;
> >> +             /* same, but for inverse check */
> >> +             if (pmc->invert && masked != pmc->val)
> >>                        goto end;
> >>        }
> >>
> >
> > Hmm..., such approach looks too 'patchy'...
> > Can we at least replace 'inver' with something like:
> > enum rte_power_monitor_cond_op {
> >          _EQ, NEQ,...
> > };
> > Then at least new comparions ops can be added in future.
> > Even better I think would be to just leave to PMD to provide a comparison callback.
> > Will make things really simple and generic:
> > struct rte_power_monitor_cond {
> >       volatile void *addr;
> >       int (*cmp)(uint64_t val);
> >       uint8_t size;
> > };
> > And then in rte_power_monitor(...):
> > ....
> > const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> > if (pmc->cmp(cur_value) != 0)
> >          goto end;
> > ....
> >
> 
> I like the idea of a callback, but these are supposed to be
> intrinsic-like functions, so putting too much into them is contrary to
> their goal, and it's going to make the API hard to use in simpler cases
> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> letting the RX callback do it for us). For example, event/dlb code calls
> rte_power_monitor explicitly.

Good point, I didn't know that.
Would be interesting to see how do they use it.

> 
> It's going to be especially "fun" to do these indirect function calls
> from inside transactional region on call to multi-monitor.

But the callback is not supposed to do any memory reads/writes.
Just mask/compare of the provided value with some constant.

> I'm not
> opposed to having a callback here, but maybe others have more thoughts
> on this?
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple Rx queues
  2021-06-23  9:49       ` Ananyev, Konstantin
@ 2021-06-23  9:56         ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23  9:56 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David, Ray Kinsella, Neil Horman
  Cc: Loftus, Ciara

On 23-Jun-21 10:49 AM, Ananyev, Konstantin wrote:
> 
>>
>> On 22-Jun-21 10:41 AM, Ananyev, Konstantin wrote:
>>>
>>>> Currently, there is a hard limitation on the PMD power management
>>>> support that only allows it to support a single queue per lcore. This is
>>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>>
>>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>>> is very difficult to implement such support because callbacks are
>>>> effectively stateless and have no visibility into what the other ethdev
>>>> devices are doing.  This places limitations on what we can do within the
>>>> framework of Rx callbacks, but the basics of this implementation are as
>>>> follows:
>>>>
>>>> - Replace per-queue structures with per-lcore ones, so that any device
>>>>     polled from the same lcore can share data
>>>> - Any queue that is going to be polled from a specific lcore has to be
>>>>     added to the list of cores to poll, so that the callback is aware of
>>>>     other queues being polled by the same lcore
>>>> - Both the empty poll counter and the actual power saving mechanism is
>>>>     shared between all queues polled on a particular lcore, and is only
>>>>     activated when a special designated "power saving" queue is polled. To
>>>>     put it another way, we have no idea which queue the user will poll in
>>>>     what order, so we rely on them telling us that queue X is the last one
>>>>     in the polling loop, so any power management should happen there.
>>>> - A new API is added to mark a specific Rx queue as "power saving".
>>>
>>> Honestly, I don't understand the logic behind that new function.
>>> I understand that depending on HW we ca monitor either one or multiple queues.
>>> That's ok, but why we now need to mark one queue as a 'very special' one?
>>
>> Because we don't know which of the queues we are supposed to sleep on.
>>
>> Imagine a situation where you have 3 queues. What usually happens is you
>> poll them in a loop, so q0, q1, q2, q0, q1, q2... etc. We only want to
>> enter power-optimized state on polling q2, because otherwise we're
>> risking going into power optimized state while q1 or q2 have traffic.
> 
> That's why before going to sleep we need to make sure that for *all* queues
> we have at least EMPTYPOLL_MAX empty polls.
> Then the order of queue checking wouldn't matter.
> With your example it should be:
> if (q1.empty_polls >  EMPTYPOLL_MAX && q2. empty_polls >  EMPTYPOLL_MAX &&
>       q3.empy_pools >  EMPTYPOLL_MAX)
>          goto_sleep;
> 
> Don't take me wrong, I am not suggesting to make *precisely* that checks
> in the actual code (it could be time consuming if number of checks is big),
> but the logic needs to remain.
> 

The empty poll counter is *per core*, not *per queue*. All the shared 
data is per core. We only increment empty poll counter on last queue, 
but we drop it to 0 on any queue that has received traffic. That way, we 
can avoid checking/incrementing empty poll counters for multiple queues. 
In other words, this is effectively achieving what you're suggesting, 
but without per-queue checks.

Of course, i could make it per-queue like before, but then we just end 
up doing way more checks on every callback and basically need to have 
the same logic anyway, so why bother?

>>
>> Worst case scenario, we enter sleep after polling q0, then traffic
>> arrives at q2, we wake up, and then attempt to go to sleep on q1 instead
>> of skipping it. Essentially, we will be attempting to sleep at every
>> queue, instead of once in a loop. This *might* be OK for multi-monitor
>> because we'll be aborting sleep due to sleep condition check failure,
>> but for modes like rte_pause()/rte_power_pause()-based sleep, we will be
>> entering sleep unconditionally, and will be risking to sleep at q1 while
>> there's traffic at q2.
>>
>> So, we need this mechanism to be activated once every *loop*, not per queue.
>>
>>> Why can't rte_power_ethdev_pmgmt_queue_enable() just:
>>> Check is number of monitored queues exceed HW/SW capabilities,
>>> and if so then just return a failure.
>>> Otherwise add queue to the list and treat them all equally, i.e:
>>> go to power save mode when number of sequential empty polls on
>>> all monitored queues will exceed EMPTYPOLL_MAX threshold?
>>>
>>>>     Failing to call this API will result in no power management, however
>>>>     when having only one queue per core it is obvious which queue is the
>>>>     "power saving" one, so things will still work without this new API for
>>>>     use cases that were previously working without it.
>>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>>>     is incapable of monitoring more than one address.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>    lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++-------
>>>>    lib/power/rte_power_pmd_mgmt.h |  34 ++++
>>>>    lib/power/version.map          |   3 +
>>>>    3 files changed, 306 insertions(+), 66 deletions(-)
>>>>
>>>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>>>> index 0707c60a4f..60dd21a19c 100644
>>>> --- a/lib/power/rte_power_pmd_mgmt.c
>>>> +++ b/lib/power/rte_power_pmd_mgmt.c
>>>> @@ -33,7 +33,19 @@ enum pmd_mgmt_state {
>>>>         PMD_MGMT_ENABLED
>>>>    };
>>>>
>>>> -struct pmd_queue_cfg {
>>>> +struct queue {
>>>> +     uint16_t portid;
>>>> +     uint16_t qid;
>>>> +};
>>>
>>> Just a thought: if that would help somehow, it can be changed to:
>>> union queue {
>>>           uint32_t raw;
>>>           struct { uint16_t portid, qid;
>>>           };
>>> };
>>>
>>> That way in queue find/cmp functions below you can operate with single raw 32-bt values.
>>> Probably not that important, as all these functions are on slow path, but might look nicer.
>>
>> Sure, that can work. We actually do comparisons with power save queue on
>> fast path, so maybe that'll help.
>>
>>>
>>>> +struct pmd_core_cfg {
>>>> +     struct queue queues[RTE_MAX_ETHPORTS];
>>>
>>> If we'll have ability to monitor multiple queues per lcore, would it be always enough?
>>>   From other side, it is updated on control path only.
>>> Wouldn't normal list with malloc(/rte_malloc) would be more suitable here?
>>
>> You're right, it should be dynamically allocated.
>>
>>>
>>>> +     /**< Which port-queue pairs are associated with this lcore? */
>>>> +     struct queue power_save_queue;
>>>> +     /**< When polling multiple queues, all but this one will be ignored */
>>>> +     bool power_save_queue_set;
>>>> +     /**< When polling multiple queues, power save queue must be set */
>>>> +     size_t n_queues;
>>>> +     /**< How many queues are in the list? */
>>>>         volatile enum pmd_mgmt_state pwr_mgmt_state;
>>>>         /**< State of power management for this queue */
>>>>         enum rte_power_pmd_mgmt_type cb_mode;
>>>> @@ -43,8 +55,97 @@ struct pmd_queue_cfg {
>>>>         uint64_t empty_poll_stats;
>>>>         /**< Number of empty polls */
>>>>    } __rte_cache_aligned;
>>>> +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
>>>>
>>>> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
>>>> +static inline bool
>>>> +queue_equal(const struct queue *l, const struct queue *r)
>>>> +{
>>>> +     return l->portid == r->portid && l->qid == r->qid;
>>>> +}
>>>> +
>>>> +static inline void
>>>> +queue_copy(struct queue *dst, const struct queue *src)
>>>> +{
>>>> +     dst->portid = src->portid;
>>>> +     dst->qid = src->qid;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue *q) {
>>>
>>> Here and in other places - any reason why standard DPDK coding style is not used?
>>
>> Just accidental :)
>>
>>>
>>>> +     const struct queue *pwrsave = &cfg->power_save_queue;
>>>> +
>>>> +     /* if there's only single queue, no need to check anything */
>>>> +     if (cfg->n_queues == 1)
>>>> +             return true;
>>>> +     return cfg->power_save_queue_set && queue_equal(q, pwrsave);
>>>> +}
>>>> +
>>
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23  9:55       ` Ananyev, Konstantin
@ 2021-06-23 10:00         ` Burakov, Anatoly
  2021-06-23 11:00           ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23 10:00 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 23-Jun-21 10:55 AM, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Wednesday, June 23, 2021 10:43 AM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>
>> Cc: Loftus, Ciara <ciara.loftus@intel.com>; Hunt, David <david.hunt@intel.com>
>> Subject: Re: [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
>>
>> On 21-Jun-21 1:56 PM, Ananyev, Konstantin wrote:
>>>
>>> Hi Anatoly,
>>>
>>>> Previously, the semantics of power monitor were such that we were
>>>> checking current value against the expected value, and if they matched,
>>>> then the sleep was aborted. This is somewhat inflexible, because it only
>>>> allowed us to check for a specific value.
>>>>
>>>> This commit adds an option to reverse the check, so that we can have
>>>> monitor sleep aborted if the expected value *doesn't* match what's in
>>>> memory. This allows us to both implement all currently implemented
>>>> driver code, as well as support more use cases which don't easily map to
>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>>>
>>>> Since the old behavior is the default, no need to adjust existing
>>>> implementations.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>    lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>>>    lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>>>    2 files changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>>>> index dddca3d41c..1006c2edfc 100644
>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>>>                           *   4, or 8. Supplying any other value will result in
>>>>                           *   an error.
>>>>                           */
>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>>>> +                       *   checking if `val` matches something, check if
>>>> +                       *   `val` *doesn't* match a particular value)
>>>> +                       */
>>>>    };
>>>>
>>>>    /**
>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>> index 39ea9fdecd..5d944e9aa4 100644
>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>>                 const uint64_t masked = cur_value & pmc->mask;
>>>>
>>>>                 /* if the masked value is already matching, abort */
>>>> -             if (masked == pmc->val)
>>>> +             if (!pmc->invert && masked == pmc->val)
>>>> +                     goto end;
>>>> +             /* same, but for inverse check */
>>>> +             if (pmc->invert && masked != pmc->val)
>>>>                         goto end;
>>>>         }
>>>>
>>>
>>> Hmm..., such approach looks too 'patchy'...
>>> Can we at least replace 'inver' with something like:
>>> enum rte_power_monitor_cond_op {
>>>           _EQ, NEQ,...
>>> };
>>> Then at least new comparions ops can be added in future.
>>> Even better I think would be to just leave to PMD to provide a comparison callback.
>>> Will make things really simple and generic:
>>> struct rte_power_monitor_cond {
>>>        volatile void *addr;
>>>        int (*cmp)(uint64_t val);
>>>        uint8_t size;
>>> };
>>> And then in rte_power_monitor(...):
>>> ....
>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>> if (pmc->cmp(cur_value) != 0)
>>>           goto end;
>>> ....
>>>
>>
>> I like the idea of a callback, but these are supposed to be
>> intrinsic-like functions, so putting too much into them is contrary to
>> their goal, and it's going to make the API hard to use in simpler cases
>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
>> letting the RX callback do it for us). For example, event/dlb code calls
>> rte_power_monitor explicitly.
> 
> Good point, I didn't know that.
> Would be interesting to see how do they use it.

To be fair, it should be possible to rewrite their code using a 
callback. Perhaps adding a (void *) parameter for any custom data 
related to the callback (because C doesn't have closures...), but 
otherwise it should be doable, so the question isn't that it's 
impossible to rewrite event/dlb code to use callbacks, it's more of an 
issue with complicating usage of already-not-quite-straightforward API 
even more.

> 
>>
>> It's going to be especially "fun" to do these indirect function calls
>> from inside transactional region on call to multi-monitor.
> 
> But the callback is not supposed to do any memory reads/writes.
> Just mask/compare of the provided value with some constant.

Yeah, but with callbacks we can't really control that, can we? I mean i 
guess a *sane* implementation wouldn't do that, but still, it's 
theoretically possible to perform more complex checks and even touch 
some unrelated data in the process.

> 
>> I'm not
>> opposed to having a callback here, but maybe others have more thoughts
>> on this?
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23 10:00         ` Burakov, Anatoly
@ 2021-06-23 11:00           ` Ananyev, Konstantin
  2021-06-23 12:12             ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-23 11:00 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David


> >>>
> >>>> Previously, the semantics of power monitor were such that we were
> >>>> checking current value against the expected value, and if they matched,
> >>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>> allowed us to check for a specific value.
> >>>>
> >>>> This commit adds an option to reverse the check, so that we can have
> >>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>> memory. This allows us to both implement all currently implemented
> >>>> driver code, as well as support more use cases which don't easily map to
> >>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>
> >>>> Since the old behavior is the default, no need to adjust existing
> >>>> implementations.
> >>>>
> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> ---
> >>>>    lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>    lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>    2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>> index dddca3d41c..1006c2edfc 100644
> >>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>                           *   4, or 8. Supplying any other value will result in
> >>>>                           *   an error.
> >>>>                           */
> >>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>> +                       *   checking if `val` matches something, check if
> >>>> +                       *   `val` *doesn't* match a particular value)
> >>>> +                       */
> >>>>    };
> >>>>
> >>>>    /**
> >>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>                 const uint64_t masked = cur_value & pmc->mask;
> >>>>
> >>>>                 /* if the masked value is already matching, abort */
> >>>> -             if (masked == pmc->val)
> >>>> +             if (!pmc->invert && masked == pmc->val)
> >>>> +                     goto end;
> >>>> +             /* same, but for inverse check */
> >>>> +             if (pmc->invert && masked != pmc->val)
> >>>>                         goto end;
> >>>>         }
> >>>>
> >>>
> >>> Hmm..., such approach looks too 'patchy'...
> >>> Can we at least replace 'inver' with something like:
> >>> enum rte_power_monitor_cond_op {
> >>>           _EQ, NEQ,...
> >>> };
> >>> Then at least new comparions ops can be added in future.
> >>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>> Will make things really simple and generic:
> >>> struct rte_power_monitor_cond {
> >>>        volatile void *addr;
> >>>        int (*cmp)(uint64_t val);
> >>>        uint8_t size;
> >>> };
> >>> And then in rte_power_monitor(...):
> >>> ....
> >>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>> if (pmc->cmp(cur_value) != 0)
> >>>           goto end;
> >>> ....
> >>>
> >>
> >> I like the idea of a callback, but these are supposed to be
> >> intrinsic-like functions, so putting too much into them is contrary to
> >> their goal, and it's going to make the API hard to use in simpler cases
> >> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >> letting the RX callback do it for us). For example, event/dlb code calls
> >> rte_power_monitor explicitly.
> >
> > Good point, I didn't know that.
> > Would be interesting to see how do they use it.
> 
> To be fair, it should be possible to rewrite their code using a
> callback. Perhaps adding a (void *) parameter for any custom data
> related to the callback (because C doesn't have closures...), but
> otherwise it should be doable, so the question isn't that it's
> impossible to rewrite event/dlb code to use callbacks, it's more of an
> issue with complicating usage of already-not-quite-straightforward API
> even more.
> 
> >
> >>
> >> It's going to be especially "fun" to do these indirect function calls
> >> from inside transactional region on call to multi-monitor.
> >
> > But the callback is not supposed to do any memory reads/writes.
> > Just mask/compare of the provided value with some constant.
> 
> Yeah, but with callbacks we can't really control that, can we? I mean i
> guess a *sane* implementation wouldn't do that, but still, it's
> theoretically possible to perform more complex checks and even touch
> some unrelated data in the process.

Yep, PMD developer can ignore recommendations and do whatever
he wants in the call-back. We can't control it.
If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
In principle it is the same with all other PMD dev-ops - we have to trust that they are
doing what they have to.  

> 
> >
> >> I'm not
> >> opposed to having a callback here, but maybe others have more thoughts
> >> on this?
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23 11:00           ` Ananyev, Konstantin
@ 2021-06-23 12:12             ` Burakov, Anatoly
  2021-06-23 13:27               ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23 12:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 23-Jun-21 12:00 PM, Ananyev, Konstantin wrote:
> 
>>>>>
>>>>>> Previously, the semantics of power monitor were such that we were
>>>>>> checking current value against the expected value, and if they matched,
>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
>>>>>> allowed us to check for a specific value.
>>>>>>
>>>>>> This commit adds an option to reverse the check, so that we can have
>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
>>>>>> memory. This allows us to both implement all currently implemented
>>>>>> driver code, as well as support more use cases which don't easily map to
>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>>>>>
>>>>>> Since the old behavior is the default, no need to adjust existing
>>>>>> implementations.
>>>>>>
>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> ---
>>>>>>     lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>>>>>     lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>>>>>     2 files changed, 8 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>> index dddca3d41c..1006c2edfc 100644
>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>>>>>                            *   4, or 8. Supplying any other value will result in
>>>>>>                            *   an error.
>>>>>>                            */
>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>>>>>> +                       *   checking if `val` matches something, check if
>>>>>> +                       *   `val` *doesn't* match a particular value)
>>>>>> +                       */
>>>>>>     };
>>>>>>
>>>>>>     /**
>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>>>> index 39ea9fdecd..5d944e9aa4 100644
>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>>>>                  const uint64_t masked = cur_value & pmc->mask;
>>>>>>
>>>>>>                  /* if the masked value is already matching, abort */
>>>>>> -             if (masked == pmc->val)
>>>>>> +             if (!pmc->invert && masked == pmc->val)
>>>>>> +                     goto end;
>>>>>> +             /* same, but for inverse check */
>>>>>> +             if (pmc->invert && masked != pmc->val)
>>>>>>                          goto end;
>>>>>>          }
>>>>>>
>>>>>
>>>>> Hmm..., such approach looks too 'patchy'...
>>>>> Can we at least replace 'inver' with something like:
>>>>> enum rte_power_monitor_cond_op {
>>>>>            _EQ, NEQ,...
>>>>> };
>>>>> Then at least new comparions ops can be added in future.
>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
>>>>> Will make things really simple and generic:
>>>>> struct rte_power_monitor_cond {
>>>>>         volatile void *addr;
>>>>>         int (*cmp)(uint64_t val);
>>>>>         uint8_t size;
>>>>> };
>>>>> And then in rte_power_monitor(...):
>>>>> ....
>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>>>> if (pmc->cmp(cur_value) != 0)
>>>>>            goto end;
>>>>> ....
>>>>>
>>>>
>>>> I like the idea of a callback, but these are supposed to be
>>>> intrinsic-like functions, so putting too much into them is contrary to
>>>> their goal, and it's going to make the API hard to use in simpler cases
>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
>>>> letting the RX callback do it for us). For example, event/dlb code calls
>>>> rte_power_monitor explicitly.
>>>
>>> Good point, I didn't know that.
>>> Would be interesting to see how do they use it.
>>
>> To be fair, it should be possible to rewrite their code using a
>> callback. Perhaps adding a (void *) parameter for any custom data
>> related to the callback (because C doesn't have closures...), but
>> otherwise it should be doable, so the question isn't that it's
>> impossible to rewrite event/dlb code to use callbacks, it's more of an
>> issue with complicating usage of already-not-quite-straightforward API
>> even more.
>>
>>>
>>>>
>>>> It's going to be especially "fun" to do these indirect function calls
>>>> from inside transactional region on call to multi-monitor.
>>>
>>> But the callback is not supposed to do any memory reads/writes.
>>> Just mask/compare of the provided value with some constant.
>>
>> Yeah, but with callbacks we can't really control that, can we? I mean i
>> guess a *sane* implementation wouldn't do that, but still, it's
>> theoretically possible to perform more complex checks and even touch
>> some unrelated data in the process.
> 
> Yep, PMD developer can ignore recommendations and do whatever
> he wants in the call-back. We can't control it.
> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> In principle it is the same with all other PMD dev-ops - we have to trust that they are
> doing what they have to.

I did a quick prototype for this, and i don't think it is going to work.

Callbacks with just "current value" as argument will be pretty limited 
and will only really work for cases where we know what we are expecting. 
However, for cases like event/dlb or net/mlx5, the expected value is (or 
appears to be) dependent upon some internal device data, and is not 
constant like in case of net/ixgbe for example.

This can be fixed by passing an opaque pointer, either by storing it in 
the monitor condition, or by passing it directly to rte_power_monitor at 
invocation time.

The latter doesn't work well because when we call rte_power_monitor from 
inside the rte_power library, we lack the context necessary to get said 
opaque pointer.

The former doesn't work either, because the only place where we can get 
this argument is inside get_monitor_addr, but the opaque pointer must 
persist after we exit that function in order to avoid use-after-free - 
which means that it either has to be statically allocated (which means 
it's not thread-safe for a non-trivial case), or dynamically allocated 
(which a big no-no on a hotpath).

Any other suggestions? :)

> 
>>
>>>
>>>> I'm not
>>>> opposed to having a callback here, but maybe others have more thoughts
>>>> on this?
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23 12:12             ` Burakov, Anatoly
@ 2021-06-23 13:27               ` Ananyev, Konstantin
  2021-06-23 14:13                 ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-23 13:27 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David


> On 23-Jun-21 12:00 PM, Ananyev, Konstantin wrote:
> >
> >>>>>
> >>>>>> Previously, the semantics of power monitor were such that we were
> >>>>>> checking current value against the expected value, and if they matched,
> >>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>>>> allowed us to check for a specific value.
> >>>>>>
> >>>>>> This commit adds an option to reverse the check, so that we can have
> >>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>>>> memory. This allows us to both implement all currently implemented
> >>>>>> driver code, as well as support more use cases which don't easily map to
> >>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>>>
> >>>>>> Since the old behavior is the default, no need to adjust existing
> >>>>>> implementations.
> >>>>>>
> >>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> ---
> >>>>>>     lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>>>     lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>>>     2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>> index dddca3d41c..1006c2edfc 100644
> >>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>>>                            *   4, or 8. Supplying any other value will result in
> >>>>>>                            *   an error.
> >>>>>>                            */
> >>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>>>> +                       *   checking if `val` matches something, check if
> >>>>>> +                       *   `val` *doesn't* match a particular value)
> >>>>>> +                       */
> >>>>>>     };
> >>>>>>
> >>>>>>     /**
> >>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>>>                  const uint64_t masked = cur_value & pmc->mask;
> >>>>>>
> >>>>>>                  /* if the masked value is already matching, abort */
> >>>>>> -             if (masked == pmc->val)
> >>>>>> +             if (!pmc->invert && masked == pmc->val)
> >>>>>> +                     goto end;
> >>>>>> +             /* same, but for inverse check */
> >>>>>> +             if (pmc->invert && masked != pmc->val)
> >>>>>>                          goto end;
> >>>>>>          }
> >>>>>>
> >>>>>
> >>>>> Hmm..., such approach looks too 'patchy'...
> >>>>> Can we at least replace 'inver' with something like:
> >>>>> enum rte_power_monitor_cond_op {
> >>>>>            _EQ, NEQ,...
> >>>>> };
> >>>>> Then at least new comparions ops can be added in future.
> >>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>>>> Will make things really simple and generic:
> >>>>> struct rte_power_monitor_cond {
> >>>>>         volatile void *addr;
> >>>>>         int (*cmp)(uint64_t val);
> >>>>>         uint8_t size;
> >>>>> };
> >>>>> And then in rte_power_monitor(...):
> >>>>> ....
> >>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>>>> if (pmc->cmp(cur_value) != 0)
> >>>>>            goto end;
> >>>>> ....
> >>>>>
> >>>>
> >>>> I like the idea of a callback, but these are supposed to be
> >>>> intrinsic-like functions, so putting too much into them is contrary to
> >>>> their goal, and it's going to make the API hard to use in simpler cases
> >>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >>>> letting the RX callback do it for us). For example, event/dlb code calls
> >>>> rte_power_monitor explicitly.
> >>>
> >>> Good point, I didn't know that.
> >>> Would be interesting to see how do they use it.
> >>
> >> To be fair, it should be possible to rewrite their code using a
> >> callback. Perhaps adding a (void *) parameter for any custom data
> >> related to the callback (because C doesn't have closures...), but
> >> otherwise it should be doable, so the question isn't that it's
> >> impossible to rewrite event/dlb code to use callbacks, it's more of an
> >> issue with complicating usage of already-not-quite-straightforward API
> >> even more.
> >>
> >>>
> >>>>
> >>>> It's going to be especially "fun" to do these indirect function calls
> >>>> from inside transactional region on call to multi-monitor.
> >>>
> >>> But the callback is not supposed to do any memory reads/writes.
> >>> Just mask/compare of the provided value with some constant.
> >>
> >> Yeah, but with callbacks we can't really control that, can we? I mean i
> >> guess a *sane* implementation wouldn't do that, but still, it's
> >> theoretically possible to perform more complex checks and even touch
> >> some unrelated data in the process.
> >
> > Yep, PMD developer can ignore recommendations and do whatever
> > he wants in the call-back. We can't control it.
> > If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> > In principle it is the same with all other PMD dev-ops - we have to trust that they are
> > doing what they have to.
> 
> I did a quick prototype for this, and i don't think it is going to work.
> 
> Callbacks with just "current value" as argument will be pretty limited
> and will only really work for cases where we know what we are expecting.
> However, for cases like event/dlb or net/mlx5, the expected value is (or
> appears to be) dependent upon some internal device data, and is not
> constant like in case of net/ixgbe for example.
> 
> This can be fixed by passing an opaque pointer, either by storing it in
> the monitor condition, or by passing it directly to rte_power_monitor at
> invocation time.
> 
> The latter doesn't work well because when we call rte_power_monitor from
> inside the rte_power library, we lack the context necessary to get said
> opaque pointer.
> 
> The former doesn't work either, because the only place where we can get
> this argument is inside get_monitor_addr, but the opaque pointer must
> persist after we exit that function in order to avoid use-after-free -
> which means that it either has to be statically allocated (which means
> it's not thread-safe for a non-trivial case), or dynamically allocated
> (which a big no-no on a hotpath).

If I get you right, expected_value (and probably mask) can be variable ones.
So for callback approach to work we need to pass all this as parameters
to PMD comparison callback:
int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
Correct? 

> 
> Any other suggestions? :)
> 
> >
> >>
> >>>
> >>>> I'm not
> >>>> opposed to having a callback here, but maybe others have more thoughts
> >>>> on this?
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Anatoly
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23 13:27               ` Ananyev, Konstantin
@ 2021-06-23 14:13                 ` Burakov, Anatoly
  2021-06-24  9:47                   ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-23 14:13 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 23-Jun-21 2:27 PM, Ananyev, Konstantin wrote:
> 
>> On 23-Jun-21 12:00 PM, Ananyev, Konstantin wrote:
>>>
>>>>>>>
>>>>>>>> Previously, the semantics of power monitor were such that we were
>>>>>>>> checking current value against the expected value, and if they matched,
>>>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
>>>>>>>> allowed us to check for a specific value.
>>>>>>>>
>>>>>>>> This commit adds an option to reverse the check, so that we can have
>>>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
>>>>>>>> memory. This allows us to both implement all currently implemented
>>>>>>>> driver code, as well as support more use cases which don't easily map to
>>>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>>>>>>>
>>>>>>>> Since the old behavior is the default, no need to adjust existing
>>>>>>>> implementations.
>>>>>>>>
>>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>>>> ---
>>>>>>>>      lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>>>>>>>      lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>>>>>>>      2 files changed, 8 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>>>> index dddca3d41c..1006c2edfc 100644
>>>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>>>>>>>                             *   4, or 8. Supplying any other value will result in
>>>>>>>>                             *   an error.
>>>>>>>>                             */
>>>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>>>>>>>> +                       *   checking if `val` matches something, check if
>>>>>>>> +                       *   `val` *doesn't* match a particular value)
>>>>>>>> +                       */
>>>>>>>>      };
>>>>>>>>
>>>>>>>>      /**
>>>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>>>>>> index 39ea9fdecd..5d944e9aa4 100644
>>>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>>>>>>                   const uint64_t masked = cur_value & pmc->mask;
>>>>>>>>
>>>>>>>>                   /* if the masked value is already matching, abort */
>>>>>>>> -             if (masked == pmc->val)
>>>>>>>> +             if (!pmc->invert && masked == pmc->val)
>>>>>>>> +                     goto end;
>>>>>>>> +             /* same, but for inverse check */
>>>>>>>> +             if (pmc->invert && masked != pmc->val)
>>>>>>>>                           goto end;
>>>>>>>>           }
>>>>>>>>
>>>>>>>
>>>>>>> Hmm..., such approach looks too 'patchy'...
>>>>>>> Can we at least replace 'inver' with something like:
>>>>>>> enum rte_power_monitor_cond_op {
>>>>>>>             _EQ, NEQ,...
>>>>>>> };
>>>>>>> Then at least new comparions ops can be added in future.
>>>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
>>>>>>> Will make things really simple and generic:
>>>>>>> struct rte_power_monitor_cond {
>>>>>>>          volatile void *addr;
>>>>>>>          int (*cmp)(uint64_t val);
>>>>>>>          uint8_t size;
>>>>>>> };
>>>>>>> And then in rte_power_monitor(...):
>>>>>>> ....
>>>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>>>>>> if (pmc->cmp(cur_value) != 0)
>>>>>>>             goto end;
>>>>>>> ....
>>>>>>>
>>>>>>
>>>>>> I like the idea of a callback, but these are supposed to be
>>>>>> intrinsic-like functions, so putting too much into them is contrary to
>>>>>> their goal, and it's going to make the API hard to use in simpler cases
>>>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
>>>>>> letting the RX callback do it for us). For example, event/dlb code calls
>>>>>> rte_power_monitor explicitly.
>>>>>
>>>>> Good point, I didn't know that.
>>>>> Would be interesting to see how do they use it.
>>>>
>>>> To be fair, it should be possible to rewrite their code using a
>>>> callback. Perhaps adding a (void *) parameter for any custom data
>>>> related to the callback (because C doesn't have closures...), but
>>>> otherwise it should be doable, so the question isn't that it's
>>>> impossible to rewrite event/dlb code to use callbacks, it's more of an
>>>> issue with complicating usage of already-not-quite-straightforward API
>>>> even more.
>>>>
>>>>>
>>>>>>
>>>>>> It's going to be especially "fun" to do these indirect function calls
>>>>>> from inside transactional region on call to multi-monitor.
>>>>>
>>>>> But the callback is not supposed to do any memory reads/writes.
>>>>> Just mask/compare of the provided value with some constant.
>>>>
>>>> Yeah, but with callbacks we can't really control that, can we? I mean i
>>>> guess a *sane* implementation wouldn't do that, but still, it's
>>>> theoretically possible to perform more complex checks and even touch
>>>> some unrelated data in the process.
>>>
>>> Yep, PMD developer can ignore recommendations and do whatever
>>> he wants in the call-back. We can't control it.
>>> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
>>> In principle it is the same with all other PMD dev-ops - we have to trust that they are
>>> doing what they have to.
>>
>> I did a quick prototype for this, and i don't think it is going to work.
>>
>> Callbacks with just "current value" as argument will be pretty limited
>> and will only really work for cases where we know what we are expecting.
>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>> appears to be) dependent upon some internal device data, and is not
>> constant like in case of net/ixgbe for example.
>>
>> This can be fixed by passing an opaque pointer, either by storing it in
>> the monitor condition, or by passing it directly to rte_power_monitor at
>> invocation time.
>>
>> The latter doesn't work well because when we call rte_power_monitor from
>> inside the rte_power library, we lack the context necessary to get said
>> opaque pointer.
>>
>> The former doesn't work either, because the only place where we can get
>> this argument is inside get_monitor_addr, but the opaque pointer must
>> persist after we exit that function in order to avoid use-after-free -
>> which means that it either has to be statically allocated (which means
>> it's not thread-safe for a non-trivial case), or dynamically allocated
>> (which a big no-no on a hotpath).
> 
> If I get you right, expected_value (and probably mask) can be variable ones.
> So for callback approach to work we need to pass all this as parameters
> to PMD comparison callback:
> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> Correct?

If we have both expected value, mask, and current value, then what's the 
point of the callback? The point of the callback would be to pass just 
the current value, and let the callback decide what's the expected value 
and how to compare it.

So, we can either let callback handle expected values itself by having 
an opaque callback-specific argument (which means it has to persist 
between .get_monitor_addr() and rte_power_monitor() calls), or we do the 
comparisons inside rte_power_monitor(), and store the expected/mask 
values in the monitor condition, and *don't* have any callbacks at all.

Are you suggesting an alternative to the above two options?

> 
>>
>> Any other suggestions? :)
>>
>>>
>>>>
>>>>>
>>>>>> I'm not
>>>>>> opposed to having a callback here, but maybe others have more thoughts
>>>>>> on this?
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Anatoly
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-23 14:13                 ` Burakov, Anatoly
@ 2021-06-24  9:47                   ` Ananyev, Konstantin
  2021-06-24 14:34                     ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-24  9:47 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David


> >>>>>>>
> >>>>>>>> Previously, the semantics of power monitor were such that we were
> >>>>>>>> checking current value against the expected value, and if they matched,
> >>>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>>>>>> allowed us to check for a specific value.
> >>>>>>>>
> >>>>>>>> This commit adds an option to reverse the check, so that we can have
> >>>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>>>>>> memory. This allows us to both implement all currently implemented
> >>>>>>>> driver code, as well as support more use cases which don't easily map to
> >>>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>>>>>
> >>>>>>>> Since the old behavior is the default, no need to adjust existing
> >>>>>>>> implementations.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>>>> ---
> >>>>>>>>      lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>>>>>      lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>>>>>      2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> index dddca3d41c..1006c2edfc 100644
> >>>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>>>>>                             *   4, or 8. Supplying any other value will result in
> >>>>>>>>                             *   an error.
> >>>>>>>>                             */
> >>>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>>>>>> +                       *   checking if `val` matches something, check if
> >>>>>>>> +                       *   `val` *doesn't* match a particular value)
> >>>>>>>> +                       */
> >>>>>>>>      };
> >>>>>>>>
> >>>>>>>>      /**
> >>>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>>>>>                   const uint64_t masked = cur_value & pmc->mask;
> >>>>>>>>
> >>>>>>>>                   /* if the masked value is already matching, abort */
> >>>>>>>> -             if (masked == pmc->val)
> >>>>>>>> +             if (!pmc->invert && masked == pmc->val)
> >>>>>>>> +                     goto end;
> >>>>>>>> +             /* same, but for inverse check */
> >>>>>>>> +             if (pmc->invert && masked != pmc->val)
> >>>>>>>>                           goto end;
> >>>>>>>>           }
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hmm..., such approach looks too 'patchy'...
> >>>>>>> Can we at least replace 'inver' with something like:
> >>>>>>> enum rte_power_monitor_cond_op {
> >>>>>>>             _EQ, NEQ,...
> >>>>>>> };
> >>>>>>> Then at least new comparions ops can be added in future.
> >>>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>>>>>> Will make things really simple and generic:
> >>>>>>> struct rte_power_monitor_cond {
> >>>>>>>          volatile void *addr;
> >>>>>>>          int (*cmp)(uint64_t val);
> >>>>>>>          uint8_t size;
> >>>>>>> };
> >>>>>>> And then in rte_power_monitor(...):
> >>>>>>> ....
> >>>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>>>>>> if (pmc->cmp(cur_value) != 0)
> >>>>>>>             goto end;
> >>>>>>> ....
> >>>>>>>
> >>>>>>
> >>>>>> I like the idea of a callback, but these are supposed to be
> >>>>>> intrinsic-like functions, so putting too much into them is contrary to
> >>>>>> their goal, and it's going to make the API hard to use in simpler cases
> >>>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >>>>>> letting the RX callback do it for us). For example, event/dlb code calls
> >>>>>> rte_power_monitor explicitly.
> >>>>>
> >>>>> Good point, I didn't know that.
> >>>>> Would be interesting to see how do they use it.
> >>>>
> >>>> To be fair, it should be possible to rewrite their code using a
> >>>> callback. Perhaps adding a (void *) parameter for any custom data
> >>>> related to the callback (because C doesn't have closures...), but
> >>>> otherwise it should be doable, so the question isn't that it's
> >>>> impossible to rewrite event/dlb code to use callbacks, it's more of an
> >>>> issue with complicating usage of already-not-quite-straightforward API
> >>>> even more.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> It's going to be especially "fun" to do these indirect function calls
> >>>>>> from inside transactional region on call to multi-monitor.
> >>>>>
> >>>>> But the callback is not supposed to do any memory reads/writes.
> >>>>> Just mask/compare of the provided value with some constant.
> >>>>
> >>>> Yeah, but with callbacks we can't really control that, can we? I mean i
> >>>> guess a *sane* implementation wouldn't do that, but still, it's
> >>>> theoretically possible to perform more complex checks and even touch
> >>>> some unrelated data in the process.
> >>>
> >>> Yep, PMD developer can ignore recommendations and do whatever
> >>> he wants in the call-back. We can't control it.
> >>> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> >>> In principle it is the same with all other PMD dev-ops - we have to trust that they are
> >>> doing what they have to.
> >>
> >> I did a quick prototype for this, and i don't think it is going to work.
> >>
> >> Callbacks with just "current value" as argument will be pretty limited
> >> and will only really work for cases where we know what we are expecting.
> >> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >> appears to be) dependent upon some internal device data, and is not
> >> constant like in case of net/ixgbe for example.
> >>
> >> This can be fixed by passing an opaque pointer, either by storing it in
> >> the monitor condition, or by passing it directly to rte_power_monitor at
> >> invocation time.
> >>
> >> The latter doesn't work well because when we call rte_power_monitor from
> >> inside the rte_power library, we lack the context necessary to get said
> >> opaque pointer.
> >>
> >> The former doesn't work either, because the only place where we can get
> >> this argument is inside get_monitor_addr, but the opaque pointer must
> >> persist after we exit that function in order to avoid use-after-free -
> >> which means that it either has to be statically allocated (which means
> >> it's not thread-safe for a non-trivial case), or dynamically allocated
> >> (which a big no-no on a hotpath).
> >
> > If I get you right, expected_value (and probably mask) can be variable ones.
> > So for callback approach to work we need to pass all this as parameters
> > to PMD comparison callback:
> > int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> > Correct?
> 
> If we have both expected value, mask, and current value, then what's the
> point of the callback? The point of the callback would be to pass just
> the current value, and let the callback decide what's the expected value
> and how to compare it.

For me the main point of callback is to hide PMD specific comparison semantics.
Basically they provide us with some values in struct rte_power_monitor_cond,
and then it is up to them how to interpret them in their comparison function.
All we'll do for them: will read the value at address provided. 
I understand that it looks like an overkill, as majority of these comparison functions
will be like:
int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
{
	return ((real_val & mask) == expected_val);
}  
Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.

> 
> So, we can either let callback handle expected values itself by having
> an opaque callback-specific argument (which means it has to persist
> between .get_monitor_addr() and rte_power_monitor() calls), 

But that's what we doing already - PMD fills rte_power_monitor_cond values
for us, we store them somewhere and then use them to decide should we go to sleep or not.
All callback does - moves actual values interpretation back to PMD:
Right now:
PMD:      provide PMC values
POWER: store PMC values somewhere
                read the value at address provided in PMC
                interpret PMC values and newly read value and make the decision

With callback:   
PMD:      provide PMC values
POWER: store PMC values somewhere
                read the value at address provided in PMC
PMD:      interpret PMC values and newly read value and make the decision

Or did you mean something different here?

>or we do the
> comparisons inside rte_power_monitor(), and store the expected/mask
> values in the monitor condition, and *don't* have any callbacks at all.
> Are you suggesting an alternative to the above two options?

As I said in my first mail - we can just replace 'inverse' with 'op'.
That at least will make this API extendable, if someone will need
something different in future.    

Another option is 

> 
> >
> >>
> >> Any other suggestions? :)
> >>
> >>>
> >>>>
> >>>>>
> >>>>>> I'm not
> >>>>>> opposed to having a callback here, but maybe others have more thoughts
> >>>>>> on this?
> >>>>>>
> >>>>>> --
> >>>>>> Thanks,
> >>>>>> Anatoly
> >>>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Anatoly
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-24  9:47                   ` Ananyev, Konstantin
@ 2021-06-24 14:34                     ` Burakov, Anatoly
  2021-06-24 14:57                       ` Ananyev, Konstantin
  2021-07-09 15:03                       ` David Marchand
  0 siblings, 2 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-24 14:34 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 24-Jun-21 10:47 AM, Ananyev, Konstantin wrote:
> 

<snip>

>>>> I did a quick prototype for this, and i don't think it is going to work.
>>>>
>>>> Callbacks with just "current value" as argument will be pretty limited
>>>> and will only really work for cases where we know what we are expecting.
>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>>>> appears to be) dependent upon some internal device data, and is not
>>>> constant like in case of net/ixgbe for example.
>>>>
>>>> This can be fixed by passing an opaque pointer, either by storing it in
>>>> the monitor condition, or by passing it directly to rte_power_monitor at
>>>> invocation time.
>>>>
>>>> The latter doesn't work well because when we call rte_power_monitor from
>>>> inside the rte_power library, we lack the context necessary to get said
>>>> opaque pointer.
>>>>
>>>> The former doesn't work either, because the only place where we can get
>>>> this argument is inside get_monitor_addr, but the opaque pointer must
>>>> persist after we exit that function in order to avoid use-after-free -
>>>> which means that it either has to be statically allocated (which means
>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
>>>> (which a big no-no on a hotpath).
>>>
>>> If I get you right, expected_value (and probably mask) can be variable ones.
>>> So for callback approach to work we need to pass all this as parameters
>>> to PMD comparison callback:
>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>> Correct?
>>
>> If we have both expected value, mask, and current value, then what's the
>> point of the callback? The point of the callback would be to pass just
>> the current value, and let the callback decide what's the expected value
>> and how to compare it.
> 
> For me the main point of callback is to hide PMD specific comparison semantics.
> Basically they provide us with some values in struct rte_power_monitor_cond,
> and then it is up to them how to interpret them in their comparison function.
> All we'll do for them: will read the value at address provided.
> I understand that it looks like an overkill, as majority of these comparison functions
> will be like:
> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> {
>          return ((real_val & mask) == expected_val);
> }
> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
> 
>>
>> So, we can either let callback handle expected values itself by having
>> an opaque callback-specific argument (which means it has to persist
>> between .get_monitor_addr() and rte_power_monitor() calls),
> 
> But that's what we doing already - PMD fills rte_power_monitor_cond values
> for us, we store them somewhere and then use them to decide should we go to sleep or not.
> All callback does - moves actual values interpretation back to PMD:
> Right now:
> PMD:      provide PMC values
> POWER: store PMC values somewhere
>                  read the value at address provided in PMC
>                  interpret PMC values and newly read value and make the decision
> 
> With callback:
> PMD:      provide PMC values
> POWER: store PMC values somewhere
>                  read the value at address provided in PMC
> PMD:      interpret PMC values and newly read value and make the decision
> 
> Or did you mean something different here?
> 
>> or we do the
>> comparisons inside rte_power_monitor(), and store the expected/mask
>> values in the monitor condition, and *don't* have any callbacks at all.
>> Are you suggesting an alternative to the above two options?
> 
> As I said in my first mail - we can just replace 'inverse' with 'op'.
> That at least will make this API extendable, if someone will need
> something different in future.
> 
> Another option is

Right, so the idea is store the PMD-specific data in the monitor 
condition, and leave it to the callback to interpret it.

The obvious question then is, how many values is enough? Two? Three? 
Four? This option doesn't really solve the basic issue, it just kicks 
the can down the road. I guess three values should be enough for 
everyone (tm) ? :D

I don't like the 'op' thing because if the goal is to be flexible, it's 
unnecessarily limiting *and* makes the API even more complex to use. I 
would rather have a number of PMD-specific values and leave it up to the 
callback to interpret them, because at least that way we're not limited 
to predefined operations on the monitor condition data.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-24 14:34                     ` Burakov, Anatoly
@ 2021-06-24 14:57                       ` Ananyev, Konstantin
  2021-06-24 15:04                         ` Burakov, Anatoly
  2021-07-09 15:03                       ` David Marchand
  1 sibling, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-24 14:57 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David



> >>>> I did a quick prototype for this, and i don't think it is going to work.
> >>>>
> >>>> Callbacks with just "current value" as argument will be pretty limited
> >>>> and will only really work for cases where we know what we are expecting.
> >>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >>>> appears to be) dependent upon some internal device data, and is not
> >>>> constant like in case of net/ixgbe for example.
> >>>>
> >>>> This can be fixed by passing an opaque pointer, either by storing it in
> >>>> the monitor condition, or by passing it directly to rte_power_monitor at
> >>>> invocation time.
> >>>>
> >>>> The latter doesn't work well because when we call rte_power_monitor from
> >>>> inside the rte_power library, we lack the context necessary to get said
> >>>> opaque pointer.
> >>>>
> >>>> The former doesn't work either, because the only place where we can get
> >>>> this argument is inside get_monitor_addr, but the opaque pointer must
> >>>> persist after we exit that function in order to avoid use-after-free -
> >>>> which means that it either has to be statically allocated (which means
> >>>> it's not thread-safe for a non-trivial case), or dynamically allocated
> >>>> (which a big no-no on a hotpath).
> >>>
> >>> If I get you right, expected_value (and probably mask) can be variable ones.
> >>> So for callback approach to work we need to pass all this as parameters
> >>> to PMD comparison callback:
> >>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> >>> Correct?
> >>
> >> If we have both expected value, mask, and current value, then what's the
> >> point of the callback? The point of the callback would be to pass just
> >> the current value, and let the callback decide what's the expected value
> >> and how to compare it.
> >
> > For me the main point of callback is to hide PMD specific comparison semantics.
> > Basically they provide us with some values in struct rte_power_monitor_cond,
> > and then it is up to them how to interpret them in their comparison function.
> > All we'll do for them: will read the value at address provided.
> > I understand that it looks like an overkill, as majority of these comparison functions
> > will be like:
> > int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> > {
> >          return ((real_val & mask) == expected_val);
> > }
> > Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
> >
> >>
> >> So, we can either let callback handle expected values itself by having
> >> an opaque callback-specific argument (which means it has to persist
> >> between .get_monitor_addr() and rte_power_monitor() calls),
> >
> > But that's what we doing already - PMD fills rte_power_monitor_cond values
> > for us, we store them somewhere and then use them to decide should we go to sleep or not.
> > All callback does - moves actual values interpretation back to PMD:
> > Right now:
> > PMD:      provide PMC values
> > POWER: store PMC values somewhere
> >                  read the value at address provided in PMC
> >                  interpret PMC values and newly read value and make the decision
> >
> > With callback:
> > PMD:      provide PMC values
> > POWER: store PMC values somewhere
> >                  read the value at address provided in PMC
> > PMD:      interpret PMC values and newly read value and make the decision
> >
> > Or did you mean something different here?
> >
> >> or we do the
> >> comparisons inside rte_power_monitor(), and store the expected/mask
> >> values in the monitor condition, and *don't* have any callbacks at all.
> >> Are you suggesting an alternative to the above two options?
> >
> > As I said in my first mail - we can just replace 'inverse' with 'op'.
> > That at least will make this API extendable, if someone will need
> > something different in future.
> >
> > Another option is
> 
> Right, so the idea is store the PMD-specific data in the monitor
> condition, and leave it to the callback to interpret it.
> 
> The obvious question then is, how many values is enough? Two? Three?
> Four? This option doesn't really solve the basic issue, it just kicks
> the can down the road. I guess three values should be enough for
> everyone (tm) ? :D
> 
> I don't like the 'op' thing because if the goal is to be flexible, it's
> unnecessarily limiting *and* makes the API even more complex to use. I
> would rather have a number of PMD-specific values and leave it up to the
> callback to interpret them, because at least that way we're not limited
> to predefined operations on the monitor condition data.

Just to make sure we are talking about the same, does what you propose
looks like that:

 struct rte_power_monitor_cond {
        volatile void *addr;  /**< Address to monitor for changes */
        uint8_t size;    /**< Data size (in bytes) that will be used to compare
                          *   expected value (`val`) with data read from the
                          *   monitored memory location (`addr`). Can be 1, 2,
                          *   4, or 8. Supplying any other value will result in
                          *   an error.  
                          */
        int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
        uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
};

And then in rte_power_monitor():
...
uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
if (pmc->cmp(cur_value, pmc->opaque) != 0) {
    /* goto sleep */
}

?
 

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-24 14:57                       ` Ananyev, Konstantin
@ 2021-06-24 15:04                         ` Burakov, Anatoly
  2021-06-24 15:25                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-24 15:04 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 24-Jun-21 3:57 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>> I did a quick prototype for this, and i don't think it is going to work.
>>>>>>
>>>>>> Callbacks with just "current value" as argument will be pretty limited
>>>>>> and will only really work for cases where we know what we are expecting.
>>>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>>>>>> appears to be) dependent upon some internal device data, and is not
>>>>>> constant like in case of net/ixgbe for example.
>>>>>>
>>>>>> This can be fixed by passing an opaque pointer, either by storing it in
>>>>>> the monitor condition, or by passing it directly to rte_power_monitor at
>>>>>> invocation time.
>>>>>>
>>>>>> The latter doesn't work well because when we call rte_power_monitor from
>>>>>> inside the rte_power library, we lack the context necessary to get said
>>>>>> opaque pointer.
>>>>>>
>>>>>> The former doesn't work either, because the only place where we can get
>>>>>> this argument is inside get_monitor_addr, but the opaque pointer must
>>>>>> persist after we exit that function in order to avoid use-after-free -
>>>>>> which means that it either has to be statically allocated (which means
>>>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
>>>>>> (which a big no-no on a hotpath).
>>>>>
>>>>> If I get you right, expected_value (and probably mask) can be variable ones.
>>>>> So for callback approach to work we need to pass all this as parameters
>>>>> to PMD comparison callback:
>>>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>>>> Correct?
>>>>
>>>> If we have both expected value, mask, and current value, then what's the
>>>> point of the callback? The point of the callback would be to pass just
>>>> the current value, and let the callback decide what's the expected value
>>>> and how to compare it.
>>>
>>> For me the main point of callback is to hide PMD specific comparison semantics.
>>> Basically they provide us with some values in struct rte_power_monitor_cond,
>>> and then it is up to them how to interpret them in their comparison function.
>>> All we'll do for them: will read the value at address provided.
>>> I understand that it looks like an overkill, as majority of these comparison functions
>>> will be like:
>>> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>> {
>>>           return ((real_val & mask) == expected_val);
>>> }
>>> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
>>>
>>>>
>>>> So, we can either let callback handle expected values itself by having
>>>> an opaque callback-specific argument (which means it has to persist
>>>> between .get_monitor_addr() and rte_power_monitor() calls),
>>>
>>> But that's what we doing already - PMD fills rte_power_monitor_cond values
>>> for us, we store them somewhere and then use them to decide should we go to sleep or not.
>>> All callback does - moves actual values interpretation back to PMD:
>>> Right now:
>>> PMD:      provide PMC values
>>> POWER: store PMC values somewhere
>>>                   read the value at address provided in PMC
>>>                   interpret PMC values and newly read value and make the decision
>>>
>>> With callback:
>>> PMD:      provide PMC values
>>> POWER: store PMC values somewhere
>>>                   read the value at address provided in PMC
>>> PMD:      interpret PMC values and newly read value and make the decision
>>>
>>> Or did you mean something different here?
>>>
>>>> or we do the
>>>> comparisons inside rte_power_monitor(), and store the expected/mask
>>>> values in the monitor condition, and *don't* have any callbacks at all.
>>>> Are you suggesting an alternative to the above two options?
>>>
>>> As I said in my first mail - we can just replace 'inverse' with 'op'.
>>> That at least will make this API extendable, if someone will need
>>> something different in future.
>>>
>>> Another option is
>>
>> Right, so the idea is store the PMD-specific data in the monitor
>> condition, and leave it to the callback to interpret it.
>>
>> The obvious question then is, how many values is enough? Two? Three?
>> Four? This option doesn't really solve the basic issue, it just kicks
>> the can down the road. I guess three values should be enough for
>> everyone (tm) ? :D
>>
>> I don't like the 'op' thing because if the goal is to be flexible, it's
>> unnecessarily limiting *and* makes the API even more complex to use. I
>> would rather have a number of PMD-specific values and leave it up to the
>> callback to interpret them, because at least that way we're not limited
>> to predefined operations on the monitor condition data.
> 
> Just to make sure we are talking about the same, does what you propose
> looks like that:
> 
>   struct rte_power_monitor_cond {
>          volatile void *addr;  /**< Address to monitor for changes */
>          uint8_t size;    /**< Data size (in bytes) that will be used to compare
>                            *   expected value (`val`) with data read from the
>                            *   monitored memory location (`addr`). Can be 1, 2,
>                            *   4, or 8. Supplying any other value will result in
>                            *   an error.
>                            */
>          int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
>          uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
> };
> 
> And then in rte_power_monitor():
> ...
> uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> if (pmc->cmp(cur_value, pmc->opaque) != 0) {
>      /* goto sleep */
> }
> 
> ?
> 

Something like that, yes.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-24 15:04                         ` Burakov, Anatoly
@ 2021-06-24 15:25                           ` Ananyev, Konstantin
  2021-06-24 15:54                             ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-24 15:25 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David


> >>>>>> I did a quick prototype for this, and i don't think it is going to work.
> >>>>>>
> >>>>>> Callbacks with just "current value" as argument will be pretty limited
> >>>>>> and will only really work for cases where we know what we are expecting.
> >>>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >>>>>> appears to be) dependent upon some internal device data, and is not
> >>>>>> constant like in case of net/ixgbe for example.
> >>>>>>
> >>>>>> This can be fixed by passing an opaque pointer, either by storing it in
> >>>>>> the monitor condition, or by passing it directly to rte_power_monitor at
> >>>>>> invocation time.
> >>>>>>
> >>>>>> The latter doesn't work well because when we call rte_power_monitor from
> >>>>>> inside the rte_power library, we lack the context necessary to get said
> >>>>>> opaque pointer.
> >>>>>>
> >>>>>> The former doesn't work either, because the only place where we can get
> >>>>>> this argument is inside get_monitor_addr, but the opaque pointer must
> >>>>>> persist after we exit that function in order to avoid use-after-free -
> >>>>>> which means that it either has to be statically allocated (which means
> >>>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
> >>>>>> (which a big no-no on a hotpath).
> >>>>>
> >>>>> If I get you right, expected_value (and probably mask) can be variable ones.
> >>>>> So for callback approach to work we need to pass all this as parameters
> >>>>> to PMD comparison callback:
> >>>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> >>>>> Correct?
> >>>>
> >>>> If we have both expected value, mask, and current value, then what's the
> >>>> point of the callback? The point of the callback would be to pass just
> >>>> the current value, and let the callback decide what's the expected value
> >>>> and how to compare it.
> >>>
> >>> For me the main point of callback is to hide PMD specific comparison semantics.
> >>> Basically they provide us with some values in struct rte_power_monitor_cond,
> >>> and then it is up to them how to interpret them in their comparison function.
> >>> All we'll do for them: will read the value at address provided.
> >>> I understand that it looks like an overkill, as majority of these comparison functions
> >>> will be like:
> >>> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> >>> {
> >>>           return ((real_val & mask) == expected_val);
> >>> }
> >>> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
> >>>
> >>>>
> >>>> So, we can either let callback handle expected values itself by having
> >>>> an opaque callback-specific argument (which means it has to persist
> >>>> between .get_monitor_addr() and rte_power_monitor() calls),
> >>>
> >>> But that's what we doing already - PMD fills rte_power_monitor_cond values
> >>> for us, we store them somewhere and then use them to decide should we go to sleep or not.
> >>> All callback does - moves actual values interpretation back to PMD:
> >>> Right now:
> >>> PMD:      provide PMC values
> >>> POWER: store PMC values somewhere
> >>>                   read the value at address provided in PMC
> >>>                   interpret PMC values and newly read value and make the decision
> >>>
> >>> With callback:
> >>> PMD:      provide PMC values
> >>> POWER: store PMC values somewhere
> >>>                   read the value at address provided in PMC
> >>> PMD:      interpret PMC values and newly read value and make the decision
> >>>
> >>> Or did you mean something different here?
> >>>
> >>>> or we do the
> >>>> comparisons inside rte_power_monitor(), and store the expected/mask
> >>>> values in the monitor condition, and *don't* have any callbacks at all.
> >>>> Are you suggesting an alternative to the above two options?
> >>>
> >>> As I said in my first mail - we can just replace 'inverse' with 'op'.
> >>> That at least will make this API extendable, if someone will need
> >>> something different in future.
> >>>
> >>> Another option is
> >>
> >> Right, so the idea is store the PMD-specific data in the monitor
> >> condition, and leave it to the callback to interpret it.
> >>
> >> The obvious question then is, how many values is enough? Two? Three?
> >> Four? This option doesn't really solve the basic issue, it just kicks
> >> the can down the road. I guess three values should be enough for
> >> everyone (tm) ? :D
> >>
> >> I don't like the 'op' thing because if the goal is to be flexible, it's
> >> unnecessarily limiting *and* makes the API even more complex to use. I
> >> would rather have a number of PMD-specific values and leave it up to the
> >> callback to interpret them, because at least that way we're not limited
> >> to predefined operations on the monitor condition data.
> >
> > Just to make sure we are talking about the same, does what you propose
> > looks like that:
> >
> >   struct rte_power_monitor_cond {
> >          volatile void *addr;  /**< Address to monitor for changes */
> >          uint8_t size;    /**< Data size (in bytes) that will be used to compare
> >                            *   expected value (`val`) with data read from the
> >                            *   monitored memory location (`addr`). Can be 1, 2,
> >                            *   4, or 8. Supplying any other value will result in
> >                            *   an error.
> >                            */
> >          int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
> >          uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
> > };
> >
> > And then in rte_power_monitor():
> > ...
> > uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> > if (pmc->cmp(cur_value, pmc->opaque) != 0) {
> >      /* goto sleep */
> > }
> >
> > ?
> >
> 
> Something like that, yes.
> 

Seems reasonable to me.
Thanks
Konstantin

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
  2021-06-24 15:25                           ` Ananyev, Konstantin
@ 2021-06-24 15:54                             ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-24 15:54 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Richardson, Bruce; +Cc: Loftus, Ciara, Hunt, David

On 24-Jun-21 4:25 PM, Ananyev, Konstantin wrote:
> 
>>>>>>>> I did a quick prototype for this, and i don't think it is going to work.
>>>>>>>>
>>>>>>>> Callbacks with just "current value" as argument will be pretty limited
>>>>>>>> and will only really work for cases where we know what we are expecting.
>>>>>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>>>>>>>> appears to be) dependent upon some internal device data, and is not
>>>>>>>> constant like in case of net/ixgbe for example.
>>>>>>>>
>>>>>>>> This can be fixed by passing an opaque pointer, either by storing it in
>>>>>>>> the monitor condition, or by passing it directly to rte_power_monitor at
>>>>>>>> invocation time.
>>>>>>>>
>>>>>>>> The latter doesn't work well because when we call rte_power_monitor from
>>>>>>>> inside the rte_power library, we lack the context necessary to get said
>>>>>>>> opaque pointer.
>>>>>>>>
>>>>>>>> The former doesn't work either, because the only place where we can get
>>>>>>>> this argument is inside get_monitor_addr, but the opaque pointer must
>>>>>>>> persist after we exit that function in order to avoid use-after-free -
>>>>>>>> which means that it either has to be statically allocated (which means
>>>>>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
>>>>>>>> (which a big no-no on a hotpath).
>>>>>>>
>>>>>>> If I get you right, expected_value (and probably mask) can be variable ones.
>>>>>>> So for callback approach to work we need to pass all this as parameters
>>>>>>> to PMD comparison callback:
>>>>>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>>>>>> Correct?
>>>>>>
>>>>>> If we have both expected value, mask, and current value, then what's the
>>>>>> point of the callback? The point of the callback would be to pass just
>>>>>> the current value, and let the callback decide what's the expected value
>>>>>> and how to compare it.
>>>>>
>>>>> For me the main point of callback is to hide PMD specific comparison semantics.
>>>>> Basically they provide us with some values in struct rte_power_monitor_cond,
>>>>> and then it is up to them how to interpret them in their comparison function.
>>>>> All we'll do for them: will read the value at address provided.
>>>>> I understand that it looks like an overkill, as majority of these comparison functions
>>>>> will be like:
>>>>> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>>>> {
>>>>>            return ((real_val & mask) == expected_val);
>>>>> }
>>>>> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
>>>>>
>>>>>>
>>>>>> So, we can either let callback handle expected values itself by having
>>>>>> an opaque callback-specific argument (which means it has to persist
>>>>>> between .get_monitor_addr() and rte_power_monitor() calls),
>>>>>
>>>>> But that's what we doing already - PMD fills rte_power_monitor_cond values
>>>>> for us, we store them somewhere and then use them to decide should we go to sleep or not.
>>>>> All callback does - moves actual values interpretation back to PMD:
>>>>> Right now:
>>>>> PMD:      provide PMC values
>>>>> POWER: store PMC values somewhere
>>>>>                    read the value at address provided in PMC
>>>>>                    interpret PMC values and newly read value and make the decision
>>>>>
>>>>> With callback:
>>>>> PMD:      provide PMC values
>>>>> POWER: store PMC values somewhere
>>>>>                    read the value at address provided in PMC
>>>>> PMD:      interpret PMC values and newly read value and make the decision
>>>>>
>>>>> Or did you mean something different here?
>>>>>
>>>>>> or we do the
>>>>>> comparisons inside rte_power_monitor(), and store the expected/mask
>>>>>> values in the monitor condition, and *don't* have any callbacks at all.
>>>>>> Are you suggesting an alternative to the above two options?
>>>>>
>>>>> As I said in my first mail - we can just replace 'inverse' with 'op'.
>>>>> That at least will make this API extendable, if someone will need
>>>>> something different in future.
>>>>>
>>>>> Another option is
>>>>
>>>> Right, so the idea is store the PMD-specific data in the monitor
>>>> condition, and leave it to the callback to interpret it.
>>>>
>>>> The obvious question then is, how many values is enough? Two? Three?
>>>> Four? This option doesn't really solve the basic issue, it just kicks
>>>> the can down the road. I guess three values should be enough for
>>>> everyone (tm) ? :D
>>>>
>>>> I don't like the 'op' thing because if the goal is to be flexible, it's
>>>> unnecessarily limiting *and* makes the API even more complex to use. I
>>>> would rather have a number of PMD-specific values and leave it up to the
>>>> callback to interpret them, because at least that way we're not limited
>>>> to predefined operations on the monitor condition data.
>>>
>>> Just to make sure we are talking about the same, does what you propose
>>> looks like that:
>>>
>>>    struct rte_power_monitor_cond {
>>>           volatile void *addr;  /**< Address to monitor for changes */
>>>           uint8_t size;    /**< Data size (in bytes) that will be used to compare
>>>                             *   expected value (`val`) with data read from the
>>>                             *   monitored memory location (`addr`). Can be 1, 2,
>>>                             *   4, or 8. Supplying any other value will result in
>>>                             *   an error.
>>>                             */
>>>           int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
>>>           uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
>>> };
>>>
>>> And then in rte_power_monitor():
>>> ...
>>> uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>> if (pmc->cmp(cur_value, pmc->opaque) != 0) {
>>>       /* goto sleep */
>>> }
>>>
>>> ?
>>>
>>
>> Something like that, yes.
>>
> 
> Seems reasonable to me.
> Thanks
> Konstantin
> 

OK, i'll implement this in v2. Thanks for your input!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's
  2021-06-23  9:52       ` Ananyev, Konstantin
@ 2021-06-25 11:52         ` Burakov, Anatoly
  2021-06-25 14:42           ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-25 11:52 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 23-Jun-21 10:52 AM, Ananyev, Konstantin wrote:
> 
> 
>>
>> On 22-Jun-21 10:13 AM, Ananyev, Konstantin wrote:
>>>
>>>> Currently, we expect that only one callback can be active at any given
>>>> moment, for a particular queue configuration, which is relatively easy
>>>> to implement in a thread-safe way. However, we're about to add support
>>>> for multiple queues per lcore, which will greatly increase the
>>>> possibility of various race conditions.
>>>>
>>>> We could have used something like an RCU for this use case, but absent
>>>> of a pressing need for thread safety we'll go the easy way and just
>>>> mandate that the API's are to be called when all affected ports are
>>>> stopped, and document this limitation. This greatly simplifies the
>>>> `rte_power_monitor`-related code.
>>>
>>> I think you need to update RN too with that.
>>
>> Yep, will fix.
>>
>>> Another thing - do you really need the whole port stopped?
>>>   From what I understand - you work on queues, so it is enough for you
>>> that related RX queue is stopped.
>>> So, to make things a bit more robust, in pmgmt_queue_enable/disable
>>> you can call rte_eth_rx_queue_info_get() and check queue state.
>>
>> We work on queues, but the data is per-lcore not per-queue, and it is
>> potentially used by multiple queues, so checking one specific queue is
>> not going to be enough. We could check all queues that were registered
>> so far with the power library, maybe that'll work better?
> 
> Yep, that's what I mean: on queue_enable() check is that queue stopped or not.
> If not, return -EBUSY/EAGAIN or so/
> Sorry if I wasn't clear at first time.

I think it's still better that all queues are stopped, rather than 
trying to work around the inherently racy implementation. So while i'll 
add the queue stopped checks, i'll still remove all of the thread safety 
stuff from here.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management
  2021-06-01 12:00 [dpdk-dev] [PATCH v1 0/7] Enhancements for PMD power management Anatoly Burakov
                   ` (6 preceding siblings ...)
  2021-06-01 12:00 ` [dpdk-dev] [PATCH v1 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-06-25 14:00 ` Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
                     ` (7 more replies)
  7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, ciara.loftus

This patchset introduces several changes related to PMD power management:

- Changed monitoring intrinsics to use callbacks as a comparison function, based
  on previous patchset [1] but incorporating feedback [2] - this hopefully will
  make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
  set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
  accompanying infrastructure and example apps changes

v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary

[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274

Anatoly Burakov (7):
  power_intrinsics: use callbacks for comparison
  net/af_xdp: add power monitor support
  eal: add power monitor for multiple events
  power: remove thread safety from PMD power API's
  power: support callbacks for multiple Rx queues
  power: support monitoring multiple Rx queues
  l3fwd-power: support multiqueue in PMD pmgmt modes

 doc/guides/prog_guide/power_man.rst           |  83 ++-
 doc/guides/rel_notes/release_21_08.rst        |  11 +
 drivers/event/dlb2/dlb2.c                     |  16 +-
 drivers/net/af_xdp/rte_eth_af_xdp.c           |  33 +
 drivers/net/i40e/i40e_rxtx.c                  |  19 +-
 drivers/net/iavf/iavf_rxtx.c                  |  19 +-
 drivers/net/ice/ice_rxtx.c                    |  19 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  19 +-
 drivers/net/mlx5/mlx5_rx.c                    |  16 +-
 examples/l3fwd-power/main.c                   |  39 +-
 lib/eal/arm/rte_power_intrinsics.c            |  11 +
 lib/eal/include/generic/rte_cpuflags.h        |   2 +
 .../include/generic/rte_power_intrinsics.h    |  64 +-
 lib/eal/ppc/rte_power_intrinsics.c            |  11 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_cpuflags.c                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  78 ++-
 lib/power/meson.build                         |   3 +
 lib/power/rte_power_pmd_mgmt.c                | 574 +++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h                |  40 ++
 lib/power/version.map                         |   3 +
 21 files changed, 841 insertions(+), 224 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 1/7] power_intrinsics: use callbacks for comparison
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-28 12:19     ` Ananyev, Konstantin
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 2/7] net/af_xdp: add power monitor support Anatoly Burakov
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
	Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
	Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value.

This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.

Existing implementations are adjusted to follow the new semantics.

Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Use callback mechanism for more flexibility
    - Address feedback from Konstantin

 doc/guides/rel_notes/release_21_08.rst        |  1 +
 drivers/event/dlb2/dlb2.c                     | 16 ++++++++--
 drivers/net/i40e/i40e_rxtx.c                  | 19 ++++++++----
 drivers/net/iavf/iavf_rxtx.c                  | 19 ++++++++----
 drivers/net/ice/ice_rxtx.c                    | 19 ++++++++----
 drivers/net/ixgbe/ixgbe_rxtx.c                | 19 ++++++++----
 drivers/net/mlx5/mlx5_rx.c                    | 16 ++++++++--
 .../include/generic/rte_power_intrinsics.h    | 29 ++++++++++++++-----
 lib/eal/x86/rte_power_intrinsics.c            |  9 ++----
 9 files changed, 106 insertions(+), 41 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =======================================================
 
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
 ABI Changes
 -----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..14dfac257c 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,15 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
 	}
 }
 
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val, const uint64_t opaque[4])
+{
+	/* abort if the value matches */
+	return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
 static inline int
 dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		  struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3203,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 			expected_value = 0;
 
 		pmc.addr = monitor_addr;
-		pmc.val = expected_value;
-		pmc.mask = qe_mask.raw_qe[1];
+		/* store expected value and comparison mask in opaque data */
+		pmc.opaque[CLB_VAL_IDX] = expected_value;
+		pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+		/* set up callback */
+		pmc.fn = dlb2_monitor_callback;
 		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..45f3fbf4ec 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,17 @@
 #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
 
+static int
+i40e_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -93,12 +104,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = i40e_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..6e12ecce07 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,17 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
 				rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
 }
 
+static int
+iavf_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -69,12 +80,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = iavf_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..278eb4b9a1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,17 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+static int
+ice_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -39,12 +50,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.status_error0;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
-	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/* comparison callback */
+	pmc->fn = ice_monitor_callback;
 
 	/* register is 16-bit */
 	pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..0c5045d9dc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,17 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+static int
+ixgbe_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -1381,12 +1392,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.upper.status_error;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
-	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/* comparison callback */
+	pmc->fn = ixgbe_monitor_callback;
 
 	/* the registers are 32-bit */
 	pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 6cd71a44eb..f31a1ec839 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,17 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 	return rx_queue_count(rxq);
 }
 
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value, const uint64_t opaque[4])
+{
+	const uint64_t m = opaque[CLB_MSK_IDX];
+	const uint64_t v = opaque[CLB_VAL_IDX];
+
+	return (value & m) == v ? -1 : 0;
+}
+
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +293,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 		return -rte_errno;
 	}
 	pmc->addr = &cqe->op_own;
-	pmc->val =  !!idx;
-	pmc->mask = MLX5_CQE_OWNER_MASK;
+	pmc->opaque[CLB_VAL_IDX] = !!idx;
+	pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+	pmc->fn = mlx_monitor_callback;
 	pmc->size = sizeof(uint8_t);
 	return 0;
 }
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..046667ade6 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,34 @@
  * which are architecture-dependent.
  */
 
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ *   The value read from memory.
+ * @param opaque
+ *   Callback-specific data.
+ *
+ * @return
+ *   0 if entering of power optimized state should proceed
+ *   -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+		const uint64_t opaque[4]);
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< If the `mask` is non-zero, location pointed
-	                       *   to by `addr` will be read and compared
-	                       *   against this value.
-	                       */
-	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
-	uint8_t size;    /**< Data size (in bytes) that will be used to compare
-	                  *   expected value (`val`) with data read from the
+	uint8_t size;    /**< Data size (in bytes) that will be read from the
 	                  *   monitored memory location (`addr`). Can be 1, 2,
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+	                             *   entering power optimized state should
+	                             *   be aborted.
+	                             */
+	uint64_t opaque[4]; /**< Callback-specific data */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..3c5c9ce7ad 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -110,14 +110,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* now that we've put this address into monitor, we can unlock */
 	rte_spinlock_unlock(&s->lock);
 
-	/* if we have a comparison mask, we might not need to sleep at all */
-	if (pmc->mask) {
+	/* if we have a callback, we might not need to sleep at all */
+	if (pmc->fn) {
 		const uint64_t cur_value = __get_umwait_val(
 				pmc->addr, pmc->size);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
+		if (pmc->fn(cur_value, pmc->opaque) != 0)
 			goto end;
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 2/7] net/af_xdp: add power monitor support
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events Anatoly Burakov
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt

Implement support for .get_monitor_addr in AF_XDP driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Rewrite using the callback mechanism

 drivers/net/af_xdp/rte_eth_af_xdp.c | 33 +++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..8b9c89c3e8 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
 
 #include "compat.h"
 
@@ -788,6 +789,37 @@ eth_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value, const uint64_t opaque[4])
+{
+	const uint64_t v = opaque[CLB_VAL_IDX];
+	const uint64_t m = (uint32_t)~0;
+
+	/* if the value has changed, abort entering power optimized state */
+	return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct pkt_rx_queue *rxq = rx_queue;
+	unsigned int *prod = rxq->rx.producer;
+	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+	/* watch for changes in producer ring */
+	pmc->addr = (void*)prod;
+
+	/* store current value */
+	pmc->opaque[CLB_VAL_IDX] = cur_val;
+	pmc->fn = eth_monitor_callback;
+
+	/* AF_XDP producer ring index is 32-bit */
+	pmc->size = sizeof(uint32_t);
+
+	return 0;
+}
+
 static int
 eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
@@ -1448,6 +1480,7 @@ static const struct eth_dev_ops ops = {
 	.link_update = eth_link_update,
 	.stats_get = eth_stats_get,
 	.stats_reset = eth_stats_reset,
+	.get_monitor_addr = eth_get_monitor_addr
 };
 
 /** parse busy_budget argument */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-28 12:37     ` Ananyev, Konstantin
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Adapt to callback mechanism

 doc/guides/rel_notes/release_21_08.rst        |  2 +
 lib/eal/arm/rte_power_intrinsics.c            | 11 +++
 lib/eal/include/generic/rte_cpuflags.h        |  2 +
 .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
 lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
 lib/eal/version.map                           |  3 +
 lib/eal/x86/rte_cpuflags.c                    |  2 +
 lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
 8 files changed, 135 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
 	/**< indicates support for rte_power_monitor function */
 	uint32_t power_pause : 1;
 	/**< indicates support for rte_power_pause function */
+	uint32_t power_monitor_multi : 1;
+	/**< indicates support for rte_power_monitor_multi function */
 };
 
 /**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index 046667ade6..877fb282cb 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -124,4 +124,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 __rte_experimental
 int rte_power_pause(const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ *   Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ *   An array of monitoring condition structures.
+ * @param num
+ *   Length of the `pmc` array.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp);
+
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_power_monitor_multi; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
 		intrinsics->power_monitor = 1;
 		intrinsics->power_pause = 1;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+			intrinsics->power_monitor_multi = 1;
 	}
 }
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 3c5c9ce7ad..3fc6f62ef5 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_rtm.h>
 #include <rte_spinlock.h>
 
 #include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
 }
 
 static bool wait_supported;
+static bool wait_multi_supported;
 
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -164,6 +166,8 @@ RTE_INIT(rte_power_intrinsics_init) {
 
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
+	if (i.power_monitor_multi)
+		wait_multi_supported = 1;
 }
 
 int
@@ -202,6 +206,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	 * In this case, since we've already woken up, the "wakeup" was
 	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
 	 * wakeup address is still valid so it's perfectly safe to write it.
+	 *
+	 * For multi-monitor case, the act of locking will in itself trigger the
+	 * wakeup, so no additional writes necessary.
 	 */
 	rte_spinlock_lock(&s->lock);
 	if (s->monitor_addr != NULL)
@@ -210,3 +217,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return 0;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
+	uint32_t i, rc;
+
+	/* check if supported */
+	if (!wait_multi_supported)
+		return -ENOTSUP;
+
+	if (pmc == NULL || num == 0)
+		return -EINVAL;
+
+	/* we are already inside transaction region, return */
+	if (rte_xtest() != 0)
+		return 0;
+
+	/* start new transaction region */
+	rc = rte_xbegin();
+
+	/* transaction abort, possible write to one of wait addresses */
+	if (rc != RTE_XBEGIN_STARTED)
+		return 0;
+
+	/*
+	 * the mere act of reading the lock status here adds the lock to
+	 * the read set. This means that when we trigger a wakeup from another
+	 * thread, even if we don't have a defined wakeup address and thus don't
+	 * actually cause any writes, the act of locking our lock will itself
+	 * trigger the wakeup and abort the transaction.
+	 */
+	rte_spinlock_is_locked(&s->lock);
+
+	/*
+	 * add all addresses to wait on into transaction read-set and check if
+	 * any of wakeup conditions are already met.
+	 */
+	for (i = 0; i < num; i++) {
+		const struct rte_power_monitor_cond *c = &pmc[i];
+
+		if (pmc->fn == NULL)
+			continue;
+
+		const uint64_t val = __get_umwait_val(pmc->addr, pmc->size);
+
+		/* abort if callback indicates that we need to stop */
+		if (c->fn(val, c->opaque) != 0)
+			break;
+	}
+
+	/* none of the conditions were met, sleep until timeout */
+	if (i == num)
+		rte_power_pause(tsc_timestamp);
+
+	/* end transaction region */
+	rte_xend();
+
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 4/7] power: remove thread safety from PMD power API's
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
                     ` (2 preceding siblings ...)
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.

We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Add check for stopped queue
    - Clarified doc message
    - Added release notes

 doc/guides/rel_notes/release_21_08.rst |   5 +
 lib/power/meson.build                  |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 133 ++++++++++---------------
 lib/power/rte_power_pmd_mgmt.h         |   6 ++
 4 files changed, 67 insertions(+), 80 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
 
 * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
+* rte_power: The experimental PMD power management API is no longer considered
+  to be thread safe; all Rx queues affected by the API will now need to be
+  stopped before making any changes to the power management scheme.
+
+
 ABI Changes
 -----------
 
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
         'rte_power_pmd_mgmt.h',
         'rte_power_guest_channel.h',
 )
+if cc.has_argument('-Wno-cast-qual')
+    cflags += '-Wno-cast-qual'
+endif
 deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
 	/**< Callback mode for this queue */
 	const struct rte_eth_rxtx_callback *cur_cb;
 	/**< Callback instance */
-	volatile bool umwait_in_progress;
-	/**< are we currently sleeping? */
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 			struct rte_power_monitor_cond pmc;
 			uint16_t ret;
 
-			/*
-			 * we might get a cancellation request while being
-			 * inside the callback, in which case the wakeup
-			 * wouldn't work because it would've arrived too early.
-			 *
-			 * to get around this, we notify the other thread that
-			 * we're sleeping, so that it can spin until we're done.
-			 * unsolicited wakeups are perfectly safe.
-			 */
-			q_conf->umwait_in_progress = true;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-			/* check if we need to cancel sleep */
-			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
-				/* use monitoring condition to sleep */
-				ret = rte_eth_get_monitor_addr(port_id, qidx,
-						&pmc);
-				if (ret == 0)
-					rte_power_monitor(&pmc, UINT64_MAX);
-			}
-			q_conf->umwait_in_progress = false;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+			/* use monitoring condition to sleep */
+			ret = rte_eth_get_monitor_addr(port_id, qidx,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
 		q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 	return nb_rx;
 }
 
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+	struct rte_eth_rxq_info qinfo;
+
+	if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+		return -1;
+
+	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
 	struct pmd_queue_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
+	rte_rx_callback_fn clb;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
 	queue_cfg = &port_cfg[port_id][queue_id];
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->umwait_in_progress = false;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* ensure we update our state before callback starts */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_umwait, NULL);
+		clb = clb_umwait;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
-				queue_id, clb_scale_freq, NULL);
+		clb = clb_scale_freq;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (global_data.tsc_per_us == 0)
 			calc_tsc();
 
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_pause, NULL);
+		clb = clb_pause;
 		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
 	}
+
+	/* initialize data before enabling the callback */
+	queue_cfg->empty_poll_stats = 0;
+	queue_cfg->cb_mode = mode;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, NULL);
+
 	ret = 0;
 end:
 	return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
 	struct pmd_queue_cfg *queue_cfg;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
 
 	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
 		return -EINVAL;
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
 	/* no need to check queue id as wrong queue id would not be enabled */
 	queue_cfg = &port_cfg[port_id][queue_id];
 
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	/* stop any callbacks from progressing */
 	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
-	/* ensure we update our state before continuing */
-	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
 	switch (queue_cfg->cb_mode) {
-	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		bool exit = false;
-		do {
-			/*
-			 * we may request cancellation while the other thread
-			 * has just entered the callback but hasn't started
-			 * sleeping yet, so keep waking it up until we know it's
-			 * done sleeping.
-			 */
-			if (queue_cfg->umwait_in_progress)
-				rte_power_monitor_wakeup(lcore_id);
-			else
-				exit = true;
-		} while (!exit);
-	}
-	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		rte_eth_remove_rx_callback(port_id, queue_id,
 				queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		break;
 	}
 	/*
-	 * we don't free the RX callback here because it is unsafe to do so
-	 * unless we know for a fact that all data plane threads have stopped.
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	queue_cfg->cur_cb = NULL;
+	rte_free((void *)queue_cfg->cur_cb);
 
 	return 0;
 }
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue will be polled from.
  * @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue is polled from.
  * @param port_id
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 5/7] power: support callbacks for multiple Rx queues
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
                     ` (3 preceding siblings ...)
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-28  7:10     ` David Marchand
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 6/7] power: support monitoring " Anatoly Burakov
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, David Hunt, Ray Kinsella, Neil Horman; +Cc: ciara.loftus

Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of cores to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when a special designated "power saving" queue is polled. To
  put it another way, we have no idea which queue the user will poll in
  what order, so we rely on them telling us that queue X is the last one
  in the polling loop, so any power management should happen there.
- A new API is added to mark a specific Rx queue as "power saving".
  Failing to call this API will result in no power management, however
  when having only one queue per core it is obvious which queue is the
  "power saving" one, so things will still work without this new API for
  use cases that were previously working without it.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Use a TAILQ for queues instead of a static array
    - Address feedback from Konstantin
    - Add additional checks for stopped queues

 doc/guides/prog_guide/power_man.rst    |  80 ++++--
 doc/guides/rel_notes/release_21_08.rst |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 381 ++++++++++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h         |  34 +++
 lib/power/version.map                  |   3 +
 5 files changed, 407 insertions(+), 94 deletions(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..38f876466a 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,48 @@ Ethernet PMD Power Management API
 Abstract
 ~~~~~~~~
 
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
-   This power saving scheme will put the CPU into optimized power state
-   and use the ``rte_power_monitor()`` function
-   to monitor the Ethernet PMD RX descriptor address,
-   and wake the CPU up whenever there's new traffic.
-
-Pause
-   This power saving scheme will avoid busy polling
-   by either entering power-optimized sleep state
-   with ``rte_power_pause()`` function,
-   or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
-   This power saving scheme will use ``librte_power`` library
-   functionality to scale the core frequency up/down
-   depending on traffic volume.
-
-.. note::
-
-   Currently, this power management API is limited to mandatory mapping
-   of 1 queue to 1 core (multiple queues are supported,
-   but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+   This power saving scheme will put the CPU into optimized power state and
+   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+   there's new traffic. Support for this scheme may not be available on all
+   platforms, and further limitations may apply (see below).
+
+* Pause
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+   This power saving scheme will use ``librte_power`` library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+  monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+  ``rte_power_monitor()`` function is not supported, then monitor mode will not
+  be supported.
+
+* Not all Ethernet devices support monitoring, even if the underlying
+  platform may support the necessary CPU instructions. Support for monitoring is
+  currently implemented in the following DPDK drivers:
+
+  * net/ixgbe
+  * net/i40e
+  * net/ice
+  * net/iavf
+  * net/mlx5
+  * net/af_xdp
+
 
 API Overview for Ethernet PMD Power Management
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -234,6 +248,16 @@ API Overview for Ethernet PMD Power Management
 
 * **Queue Disable**: Disable power scheme for certain queue/port/core.
 
+* **Set Power Save Queue**: In case of polling multiple queues from one lcore,
+  designate a specific queue to be the one that triggers power management routines.
+
+.. note::
+
+   When using PMD power management with multiple Ethernet Rx queues on one lcore,
+   it is required to designate one of the configured Rx queues as a "power save"
+   queue by calling the appropriate API. Failing to do so will result in no
+   power saving ever taking effect.
+
 References
 ----------
 
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
 
 * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
 
+* rte_power: The experimental PMD power management API now supports managing
+  multiple Ethernet Rx queues per lcore.
+
 
 Removed Items
 -------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..7762cd39b8 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,7 +33,28 @@ enum pmd_mgmt_state {
 	PMD_MGMT_ENABLED
 };
 
-struct pmd_queue_cfg {
+union queue {
+	uint32_t val;
+	struct {
+		uint16_t portid;
+		uint16_t qid;
+	};
+};
+
+struct queue_list_entry {
+	TAILQ_ENTRY(queue_list_entry) next;
+	union queue queue;
+};
+
+struct pmd_core_cfg {
+	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+	/**< Which port-queue pairs are associated with this lcore? */
+	union queue power_save_queue;
+	/**< When polling multiple queues, all but this one will be ignored */
+	bool power_save_queue_set;
+	/**< When polling multiple queues, power save queue must be set */
+	size_t n_queues;
+	/**< How many queues are in the list? */
 	volatile enum pmd_mgmt_state pwr_mgmt_state;
 	/**< State of power management for this queue */
 	enum rte_power_pmd_mgmt_type cb_mode;
@@ -43,8 +64,96 @@ struct pmd_queue_cfg {
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
 
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+	return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+	dst->val = src->val;
+}
+
+static inline bool
+queue_is_power_save(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	const union queue *pwrsave = &cfg->power_save_queue;
+
+	/* if there's only single queue, no need to check anything */
+	if (cfg->n_queues == 1)
+		return true;
+	return cfg->power_save_queue_set && queue_equal(q, pwrsave);
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *cur;
+
+	TAILQ_FOREACH(cur, &cfg->head, next) {
+		if (queue_equal(&cur->queue, q))
+			return cur;
+	}
+	return NULL;
+}
+
+static int
+queue_set_power_save(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	const struct queue_list_entry *found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return -ENOENT;
+	queue_copy(&cfg->power_save_queue, q);
+	cfg->power_save_queue_set = true;
+	return 0;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *qle;
+
+	/* is it already in the list? */
+	if (queue_list_find(cfg, q) != NULL)
+		return -EEXIST;
+
+	qle = malloc(sizeof(*qle));
+	if (qle == NULL)
+		return -ENOMEM;
+
+	queue_copy(&qle->queue, q);
+	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+	cfg->n_queues++;
+
+	return 0;
+}
+
+static int
+queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *found;
+
+	found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return -ENOENT;
+
+	TAILQ_REMOVE(&cfg->head, found, next);
+	cfg->n_queues--;
+	free(found);
+
+	/* if this was a power save queue, unset it */
+	if (cfg->power_save_queue_set && queue_is_power_save(cfg, q)) {
+		union queue *pwrsave = &cfg->power_save_queue;
+		cfg->power_save_queue_set = false;
+		pwrsave->val = 0;
+	}
+
+	return 0;
+}
 
 static void
 calc_tsc(void)
@@ -79,10 +188,10 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
+	const unsigned int lcore = rte_lcore_id();
+	struct pmd_core_cfg *q_conf;
 
-	struct pmd_queue_cfg *q_conf;
-
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
 	if (unlikely(nb_rx == 0)) {
 		q_conf->empty_poll_stats++;
@@ -107,11 +216,26 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		/* sleep for 1 microsecond */
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
@@ -127,8 +251,7 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 					rte_pause();
 			}
 		}
-	} else
-		q_conf->empty_poll_stats = 0;
+	}
 
 	return nb_rx;
 }
@@ -138,19 +261,33 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+
+		/* scale up freq immediately */
+		rte_power_freq_max(rte_lcore_id());
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
 			/* scale down freq */
 			rte_power_freq_min(rte_lcore_id());
-	} else {
-		q_conf->empty_poll_stats = 0;
-		/* scale up freq */
-		rte_power_freq_max(rte_lcore_id());
 	}
 
 	return nb_rx;
@@ -167,11 +304,79 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
 	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
 }
 
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+	const struct queue_list_entry *entry;
+
+	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+		const union queue *q = &entry->queue;
+		int ret = queue_stopped(q->portid, q->qid);
+		if (ret != 1)
+			return ret;
+	}
+	return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+	enum power_management_env env;
+
+	/* only PSTATE and ACPI modes are supported */
+	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+		return -ENOTSUP;
+	}
+	/* ensure we could initialize the power library */
+	if (rte_power_init(lcore))
+		return -EINVAL;
+
+	/* ensure we initialized the correct env */
+	env = rte_power_get_env();
+	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	if (cfg->n_queues > 0) {
+		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+			&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
 	rte_rx_callback_fn clb;
 	int ret;
@@ -202,9 +407,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(queue_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
+	/* if callback was already enabled, check current callback type */
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+			queue_cfg->cb_mode != mode) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -214,53 +429,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 
 	switch (mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		struct rte_power_monitor_cond dummy;
-
-		/* check if rte_power_monitor is supported */
-		if (!global_data.intrinsics_support.power_monitor) {
-			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_monitor(queue_cfg, &qdata);
+		if (ret < 0)
 			goto end;
-		}
 
-		/* check if the device supports the necessary PMD API */
-		if (rte_eth_get_monitor_addr(port_id, queue_id,
-				&dummy) == -ENOTSUP) {
-			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_umwait;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
-	{
-		enum power_management_env env;
-		/* only PSTATE and ACPI modes are supported */
-		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
-				!rte_power_check_env_supported(
-					PM_ENV_PSTATE_CPUFREQ)) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_scale(lcore_id);
+		if (ret < 0)
 			goto end;
-		}
-		/* ensure we could initialize the power library */
-		if (rte_power_init(lcore_id)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		/* ensure we initialized the correct env */
-		env = rte_power_get_env();
-		if (env != PM_ENV_ACPI_CPUFREQ &&
-				env != PM_ENV_PSTATE_CPUFREQ) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_scale_freq;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		/* figure out various time-to-tsc conversions */
 		if (global_data.tsc_per_us == 0)
@@ -273,11 +455,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		ret = -EINVAL;
 		goto end;
 	}
+	/* add this queue to the list */
+	ret = queue_list_add(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
 
 	/* initialize data before enabling the callback */
-	queue_cfg->empty_poll_stats = 0;
-	queue_cfg->cb_mode = mode;
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	if (queue_cfg->n_queues == 1) {
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
 	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
 			clb, NULL);
 
@@ -290,7 +481,8 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,13 +498,31 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(queue_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
 		return -EINVAL;
 
-	/* stop any callbacks from progressing */
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	ret = queue_list_remove(queue_cfg, &qdata);
+	if (ret < 0)
+		return -ret;
+
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (queue_cfg->n_queues == 0)
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
 	switch (queue_cfg->cb_mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
@@ -336,3 +546,42 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 
 	return 0;
 }
+
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	ret = queue_set_power_save(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to set power save queue: %s\n",
+			strerror(-ret));
+		return -ret;
+	}
+
+	return 0;
+}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+	size_t i;
+
+	/* initialize all tailqs */
+	for (i = 0; i < RTE_DIM(lcore_cfg); i++) {
+		struct pmd_core_cfg *cfg = &lcore_cfg[i];
+		TAILQ_INIT(&cfg->head);
+	}
+}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 444e7b8a66..d6ef8f778a 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -90,6 +90,40 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Set a specific Ethernet device Rx queue to be the "power save" queue for a
+ * particular lcore. When multiple queues are assigned to a single lcore using
+ * the `rte_power_ethdev_pmgmt_queue_enable` API, only one of them will trigger
+ * the power management. In a typical scenario, the last queue to be polled on
+ * a particular lcore should be designated as power save queue.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @note When using multiple queues per lcore, calling this function is
+ *   mandatory. If not called, no power management routines would be triggered
+ *   when the traffic starts.
+ *
+ * @warning This function must be called when all affected Ethernet ports are
+ *   stopped and no Rx/Tx is in progress!
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/power/version.map b/lib/power/version.map
index b004e3e4a9..105d1d94c2 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -38,4 +38,7 @@ EXPERIMENTAL {
 	# added in 21.02
 	rte_power_ethdev_pmgmt_queue_disable;
 	rte_power_ethdev_pmgmt_queue_enable;
+
+	# added in 21.08
+	rte_power_ethdev_pmgmt_queue_set_power_save;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 6/7] power: support monitoring multiple Rx queues
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
                     ` (4 preceding siblings ...)
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 doc/guides/prog_guide/power_man.rst |  9 ++--
 lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 38f876466a..defb61bdc4 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
 The "monitor" mode is only supported in the following configurations and scenarios:
 
 * If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor_multi()`` function is supported by the platform, then
+  monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
   ``rte_power_monitor()`` is supported by the platform, then monitoring will be
   limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
   monitored from a different lcore).
 
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
-  ``rte_power_monitor()`` function is not supported, then monitor mode will not
-  be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+  two monitoring functions are supported, then monitor mode will not be supported.
 
 * Not all Ethernet devices support monitoring, even if the underlying
   platform may support the necessary CPU instructions. Support for monitoring is
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 7762cd39b8..aab2d4f1ee 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
 	return 0;
 }
 
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+		struct rte_power_monitor_cond *pmc)
+{
+	const struct queue_list_entry *qle;
+	size_t i = 0;
+	int ret;
+
+	TAILQ_FOREACH(qle, &cfg->head, next) {
+		struct rte_power_monitor_cond *cur = &pmc[i];
+		const union queue *q = &qle->queue;
+		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
 static void
 calc_tsc(void)
 {
@@ -183,6 +201,48 @@ calc_tsc(void)
 	}
 }
 
+static uint16_t
+clb_multiwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
+
+	q_conf = &lcore_cfg[lcore];
+
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+			uint16_t ret;
+
+			/* gather all monitoring conditions */
+			ret = get_monitor_addresses(q_conf, pmc);
+
+			if (ret == 0)
+				rte_power_monitor_multi(pmc,
+					q_conf->n_queues, UINT64_MAX);
+		}
+	}
+
+	return nb_rx;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
@@ -348,14 +408,19 @@ static int
 check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 {
 	struct rte_power_monitor_cond dummy;
+	bool multimonitor_supported;
 
 	/* check if rte_power_monitor is supported */
 	if (!global_data.intrinsics_support.power_monitor) {
 		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
 		return -ENOTSUP;
 	}
+	/* check if multi-monitor is supported */
+	multimonitor_supported =
+			global_data.intrinsics_support.power_monitor_multi;
 
-	if (cfg->n_queues > 0) {
+	/* if we're adding a new queue, do we support multiple queues? */
+	if (cfg->n_queues > 0 && !multimonitor_supported) {
 		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
 		return -ENOTSUP;
 	}
@@ -371,6 +436,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 	return 0;
 }
 
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+	return global_data.intrinsics_support.power_monitor_multi ?
+		clb_multiwait : clb_umwait;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -434,7 +506,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (ret < 0)
 			goto end;
 
-		clb = clb_umwait;
+		clb = get_monitor_callback();
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		/* check if we can add a new queue */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v2 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
                     ` (5 preceding siblings ...)
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-25 14:00   ` Anatoly Burakov
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-25 14:00 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation, and always
mark the last queue in qconf as the power save queue.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 examples/l3fwd-power/main.c | 39 +++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..3057c06936 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2498,6 +2498,27 @@ mode_to_str(enum appmode mode)
 	}
 }
 
+static void
+pmd_pmgmt_set_up(unsigned int lcore, uint16_t portid, uint16_t qid, bool last)
+{
+	int ret;
+
+	ret = rte_power_ethdev_pmgmt_queue_enable(lcore, portid,
+			qid, pmgmt_type);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+			ret, portid);
+
+	if (!last)
+		return;
+	ret = rte_power_ethdev_pmgmt_queue_set_power_save(lcore, portid, qid);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_set_power_save: err=%d, port=%d\n",
+			ret, portid);
+}
+
 int
 main(int argc, char **argv)
 {
@@ -2723,12 +2744,6 @@ main(int argc, char **argv)
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
 
-		/* PMD power management mode can only do 1 queue per core */
-		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
-			rte_exit(EXIT_FAILURE,
-				"In PMD power management mode, only one queue per lcore is allowed\n");
-		}
-
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2767,15 +2782,9 @@ main(int argc, char **argv)
 						 "Fail to add ptype cb\n");
 			}
 
-			if (app_mode == APP_MODE_PMD_MGMT) {
-				ret = rte_power_ethdev_pmgmt_queue_enable(
-						lcore_id, portid, queueid,
-						pmgmt_type);
-				if (ret < 0)
-					rte_exit(EXIT_FAILURE,
-						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
-							ret, portid);
-			}
+			if (app_mode == APP_MODE_PMD_MGMT)
+				pmd_pmgmt_set_up(lcore_id, portid, queueid,
+					queue == (qconf->n_rx_queue - 1));
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v1 4/7] power: remove thread safety from PMD power API's
  2021-06-25 11:52         ` Burakov, Anatoly
@ 2021-06-25 14:42           ` Ananyev, Konstantin
  0 siblings, 0 replies; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-25 14:42 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



> >>
> >> On 22-Jun-21 10:13 AM, Ananyev, Konstantin wrote:
> >>>
> >>>> Currently, we expect that only one callback can be active at any given
> >>>> moment, for a particular queue configuration, which is relatively easy
> >>>> to implement in a thread-safe way. However, we're about to add support
> >>>> for multiple queues per lcore, which will greatly increase the
> >>>> possibility of various race conditions.
> >>>>
> >>>> We could have used something like an RCU for this use case, but absent
> >>>> of a pressing need for thread safety we'll go the easy way and just
> >>>> mandate that the API's are to be called when all affected ports are
> >>>> stopped, and document this limitation. This greatly simplifies the
> >>>> `rte_power_monitor`-related code.
> >>>
> >>> I think you need to update RN too with that.
> >>
> >> Yep, will fix.
> >>
> >>> Another thing - do you really need the whole port stopped?
> >>>   From what I understand - you work on queues, so it is enough for you
> >>> that related RX queue is stopped.
> >>> So, to make things a bit more robust, in pmgmt_queue_enable/disable
> >>> you can call rte_eth_rx_queue_info_get() and check queue state.
> >>
> >> We work on queues, but the data is per-lcore not per-queue, and it is
> >> potentially used by multiple queues, so checking one specific queue is
> >> not going to be enough. We could check all queues that were registered
> >> so far with the power library, maybe that'll work better?
> >
> > Yep, that's what I mean: on queue_enable() check is that queue stopped or not.
> > If not, return -EBUSY/EAGAIN or so/
> > Sorry if I wasn't clear at first time.
> 
> I think it's still better that all queues are stopped, rather than
> trying to work around the inherently racy implementation. So while i'll
> add the queue stopped checks, i'll still remove all of the thread safety
> stuff from here.

That's fine by me, all I asked for here - an extra check to make sure the queue is really stopped.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] power: support callbacks for multiple Rx queues
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-28  7:10     ` David Marchand
  2021-06-28  9:25       ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: David Marchand @ 2021-06-28  7:10 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dev, David Hunt, Ray Kinsella, Neil Horman, Ciara Loftus,
	Thomas Monjalon, Andrew Rybchenko, Yigit, Ferruh

On Fri, Jun 25, 2021 at 4:01 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index c70ae128ac..38f876466a 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst

[snip]

> +* Not all Ethernet devices support monitoring, even if the underlying
> +  platform may support the necessary CPU instructions. Support for monitoring is
> +  currently implemented in the following DPDK drivers:
> +
> +  * net/ixgbe
> +  * net/i40e
> +  * net/ice
> +  * net/iavf
> +  * net/mlx5
> +  * net/af_xdp

This list will get obsolete.

It looks like a driver capability, so can we have a ethdev feature added?
Then mark drivers that supports this feature.

And the power lib documentation will have a reference to
doc/guides/nics/features.rst.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] power: support callbacks for multiple Rx queues
  2021-06-28  7:10     ` David Marchand
@ 2021-06-28  9:25       ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-28  9:25 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, David Hunt, Ray Kinsella, Neil Horman, Ciara Loftus,
	Thomas Monjalon, Andrew Rybchenko, Yigit, Ferruh

On 28-Jun-21 8:10 AM, David Marchand wrote:
> On Fri, Jun 25, 2021 at 4:01 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index c70ae128ac..38f876466a 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
> 
> [snip]
> 
>> +* Not all Ethernet devices support monitoring, even if the underlying
>> +  platform may support the necessary CPU instructions. Support for monitoring is
>> +  currently implemented in the following DPDK drivers:
>> +
>> +  * net/ixgbe
>> +  * net/i40e
>> +  * net/ice
>> +  * net/iavf
>> +  * net/mlx5
>> +  * net/af_xdp
> 
> This list will get obsolete.
> 
> It looks like a driver capability, so can we have a ethdev feature added?
> Then mark drivers that supports this feature.
> 
> And the power lib documentation will have a reference to
> doc/guides/nics/features.rst.
> 
> 

Good idea, thanks for the suggestion! Will fix in v3.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/7] power_intrinsics: use callbacks for comparison
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-28 12:19     ` Ananyev, Konstantin
  0 siblings, 0 replies; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-28 12:19 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, McDaniel, Timothy, Xing, Beilei, Wu,
	Jingjing, Yang, Qiming, Zhang, Qi Z, Wang, Haiyue, Matan Azrad,
	Shahaf Shuler, Viacheslav Ovsiienko, Richardson, Bruce
  Cc: Hunt, David, Loftus, Ciara


 
> Previously, the semantics of power monitor were such that we were
> checking current value against the expected value, and if they matched,
> then the sleep was aborted. This is somewhat inflexible, because it only
> allowed us to check for a specific value.
> 
> This commit replaces the comparison with a user callback mechanism, so
> that any PMD (or other code) using `rte_power_monitor()` can define
> their own comparison semantics and decision making on how to detect the
> need to abort the entering of power optimized state.
> 
> Existing implementations are adjusted to follow the new semantics.
> 
> Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v2:
>     - Use callback mechanism for more flexibility
>     - Address feedback from Konstantin
> 
>  doc/guides/rel_notes/release_21_08.rst        |  1 +
>  drivers/event/dlb2/dlb2.c                     | 16 ++++++++--
>  drivers/net/i40e/i40e_rxtx.c                  | 19 ++++++++----
>  drivers/net/iavf/iavf_rxtx.c                  | 19 ++++++++----
>  drivers/net/ice/ice_rxtx.c                    | 19 ++++++++----
>  drivers/net/ixgbe/ixgbe_rxtx.c                | 19 ++++++++----
>  drivers/net/mlx5/mlx5_rx.c                    | 16 ++++++++--
>  .../include/generic/rte_power_intrinsics.h    | 29 ++++++++++++++-----
>  lib/eal/x86/rte_power_intrinsics.c            |  9 ++----
>  9 files changed, 106 insertions(+), 41 deletions(-)
> 
> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> index dddca3d41c..046667ade6 100644
> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> @@ -18,19 +18,34 @@
>   * which are architecture-dependent.
>   */
> 
> +/**
> + * Callback definition for monitoring conditions. Callbacks with this signature
> + * will be used by `rte_power_monitor()` to check if the entering of power
> + * optimized state should be aborted.
> + *
> + * @param val
> + *   The value read from memory.
> + * @param opaque
> + *   Callback-specific data.
> + *
> + * @return
> + *   0 if entering of power optimized state should proceed
> + *   -1 if entering of power optimized state should be aborted
> + */
> +typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
> +		const uint64_t opaque[4]);
>  struct rte_power_monitor_cond {
>  	volatile void *addr;  /**< Address to monitor for changes */
> -	uint64_t val;         /**< If the `mask` is non-zero, location pointed
> -	                       *   to by `addr` will be read and compared
> -	                       *   against this value.
> -	                       */
> -	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
> -	uint8_t size;    /**< Data size (in bytes) that will be used to compare
> -	                  *   expected value (`val`) with data read from the
> +	uint8_t size;    /**< Data size (in bytes) that will be read from the
>  	                  *   monitored memory location (`addr`). Can be 1, 2,
>  	                  *   4, or 8. Supplying any other value will result in
>  	                  *   an error.
>  	                  */
> +	rte_power_monitor_clb_t fn; /**< Callback to be used to check if
> +	                             *   entering power optimized state should
> +	                             *   be aborted.
> +	                             */
> +	uint64_t opaque[4]; /**< Callback-specific data */


As a nit - would be good to add some new macro for '4'.
Apart from that - LGTM.
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

>  };
> 
>  /**
> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> index 39ea9fdecd..3c5c9ce7ad 100644
> --- a/lib/eal/x86/rte_power_intrinsics.c
> +++ b/lib/eal/x86/rte_power_intrinsics.c
> @@ -110,14 +110,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>  	/* now that we've put this address into monitor, we can unlock */
>  	rte_spinlock_unlock(&s->lock);
> 
> -	/* if we have a comparison mask, we might not need to sleep at all */
> -	if (pmc->mask) {
> +	/* if we have a callback, we might not need to sleep at all */
> +	if (pmc->fn) {
>  		const uint64_t cur_value = __get_umwait_val(
>  				pmc->addr, pmc->size);
> -		const uint64_t masked = cur_value & pmc->mask;
> -
> -		/* if the masked value is already matching, abort */
> -		if (masked == pmc->val)
> +		if (pmc->fn(cur_value, pmc->opaque) != 0)
>  			goto end;
>  	}
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-28 12:37     ` Ananyev, Konstantin
  2021-06-28 12:43       ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-28 12:37 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce
  Cc: Hunt, David, Loftus, Ciara


> Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
> what UMWAIT does, but without the limitation of having to listen for
> just one event. This works because the optimized power state used by the
> TPAUSE instruction will cause a wake up on RTM transaction abort, so if
> we add the addresses we're interested in to the read-set, any write to
> those addresses will wake us up.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v2:
>     - Adapt to callback mechanism
> 
>  doc/guides/rel_notes/release_21_08.rst        |  2 +
>  lib/eal/arm/rte_power_intrinsics.c            | 11 +++
>  lib/eal/include/generic/rte_cpuflags.h        |  2 +
>  .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
>  lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
>  lib/eal/version.map                           |  3 +
>  lib/eal/x86/rte_cpuflags.c                    |  2 +
>  lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
>  8 files changed, 135 insertions(+)
> 
...

> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> index 3c5c9ce7ad..3fc6f62ef5 100644
> --- a/lib/eal/x86/rte_power_intrinsics.c
> +++ b/lib/eal/x86/rte_power_intrinsics.c
> @@ -4,6 +4,7 @@
> 
>  #include <rte_common.h>
>  #include <rte_lcore.h>
> +#include <rte_rtm.h>
>  #include <rte_spinlock.h>
> 
>  #include "rte_power_intrinsics.h"
> @@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
>  }
> 
>  static bool wait_supported;
> +static bool wait_multi_supported;
> 
>  static inline uint64_t
>  __get_umwait_val(const volatile void *p, const uint8_t sz)
> @@ -164,6 +166,8 @@ RTE_INIT(rte_power_intrinsics_init) {
> 
>  	if (i.power_monitor && i.power_pause)
>  		wait_supported = 1;
> +	if (i.power_monitor_multi)
> +		wait_multi_supported = 1;
>  }
> 
>  int
> @@ -202,6 +206,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
>  	 * In this case, since we've already woken up, the "wakeup" was
>  	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
>  	 * wakeup address is still valid so it's perfectly safe to write it.
> +	 *
> +	 * For multi-monitor case, the act of locking will in itself trigger the
> +	 * wakeup, so no additional writes necessary.
>  	 */
>  	rte_spinlock_lock(&s->lock);
>  	if (s->monitor_addr != NULL)
> @@ -210,3 +217,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
> 
>  	return 0;
>  }
> +
> +int
> +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
> +		const uint32_t num, const uint64_t tsc_timestamp)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	struct power_wait_status *s = &wait_status[lcore_id];
> +	uint32_t i, rc;
> +
> +	/* check if supported */
> +	if (!wait_multi_supported)
> +		return -ENOTSUP;
> +
> +	if (pmc == NULL || num == 0)
> +		return -EINVAL;
> +
> +	/* we are already inside transaction region, return */
> +	if (rte_xtest() != 0)
> +		return 0;
> +
> +	/* start new transaction region */
> +	rc = rte_xbegin();
> +
> +	/* transaction abort, possible write to one of wait addresses */
> +	if (rc != RTE_XBEGIN_STARTED)
> +		return 0;
> +
> +	/*
> +	 * the mere act of reading the lock status here adds the lock to
> +	 * the read set. This means that when we trigger a wakeup from another
> +	 * thread, even if we don't have a defined wakeup address and thus don't
> +	 * actually cause any writes, the act of locking our lock will itself
> +	 * trigger the wakeup and abort the transaction.
> +	 */
> +	rte_spinlock_is_locked(&s->lock);
> +
> +	/*
> +	 * add all addresses to wait on into transaction read-set and check if
> +	 * any of wakeup conditions are already met.
> +	 */
> +	for (i = 0; i < num; i++) {
> +		const struct rte_power_monitor_cond *c = &pmc[i];
> +
> +		if (pmc->fn == NULL)

Should be c->fn, I believe.

> +			continue;

Actually that way, if c->fn == NULL, we'll never add  our c->addr to monitored addresses.
Is that what we really want?
My thought was, that if callback is not set, we'll just go to power-save state without extra checking, no?
Something like that:

const struct rte_power_monitor_cond *c = &pmc[i];
const uint64_t val = __get_umwait_val(c->addr, c->size);

if (c->fn && c->fn(val, c->opaque) != 0)
   break;

Same thought for rte_power_monitor().

> +		const uint64_t val = __get_umwait_val(pmc->addr, pmc->size);

Same thing: s/pmc->/c->/

> +
> +		/* abort if callback indicates that we need to stop */
> +		if (c->fn(val, c->opaque) != 0)
> +			break;
> +	}
> +
> +	/* none of the conditions were met, sleep until timeout */
> +	if (i == num)
> +		rte_power_pause(tsc_timestamp);
> +
> +	/* end transaction region */
> +	rte_xend();
> +
> +	return 0;
> +}
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management
  2021-06-25 14:00 ` [dpdk-dev] [PATCH v2 0/7] Enhancements for PMD power management Anatoly Burakov
                     ` (6 preceding siblings ...)
  2021-06-25 14:00   ` [dpdk-dev] [PATCH v2 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-06-28 12:41   ` Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
                       ` (7 more replies)
  7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, ciara.loftus

This patchset introduces several changes related to PMD power management:

- Changed monitoring intrinsics to use callbacks as a comparison function, based
  on previous patchset [1] but incorporating feedback [2] - this hopefully will
  make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
  set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
  accompanying infrastructure and example apps changes

v3:
- Moved some doc updates to NIC features list

v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary

[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274

Anatoly Burakov (7):
  power_intrinsics: use callbacks for comparison
  net/af_xdp: add power monitor support
  eal: add power monitor for multiple events
  power: remove thread safety from PMD power API's
  power: support callbacks for multiple Rx queues
  power: support monitoring multiple Rx queues
  l3fwd-power: support multiqueue in PMD pmgmt modes

 doc/guides/nics/features.rst                  |  10 +
 doc/guides/prog_guide/power_man.rst           |  78 ++-
 doc/guides/rel_notes/release_21_08.rst        |  11 +
 drivers/event/dlb2/dlb2.c                     |  16 +-
 drivers/net/af_xdp/rte_eth_af_xdp.c           |  33 +
 drivers/net/i40e/i40e_rxtx.c                  |  19 +-
 drivers/net/iavf/iavf_rxtx.c                  |  19 +-
 drivers/net/ice/ice_rxtx.c                    |  19 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  19 +-
 drivers/net/mlx5/mlx5_rx.c                    |  16 +-
 examples/l3fwd-power/main.c                   |  39 +-
 lib/eal/arm/rte_power_intrinsics.c            |  11 +
 lib/eal/include/generic/rte_cpuflags.h        |   2 +
 .../include/generic/rte_power_intrinsics.h    |  64 +-
 lib/eal/ppc/rte_power_intrinsics.c            |  11 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_cpuflags.c                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  78 ++-
 lib/power/meson.build                         |   3 +
 lib/power/rte_power_pmd_mgmt.c                | 574 +++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h                |  40 ++
 lib/power/version.map                         |   3 +
 22 files changed, 846 insertions(+), 224 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 1/7] power_intrinsics: use callbacks for comparison
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 2/7] net/af_xdp: add power monitor support Anatoly Burakov
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
	Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
	Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value.

This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.

Existing implementations are adjusted to follow the new semantics.

Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Use callback mechanism for more flexibility
    - Address feedback from Konstantin

 doc/guides/rel_notes/release_21_08.rst        |  1 +
 drivers/event/dlb2/dlb2.c                     | 16 ++++++++--
 drivers/net/i40e/i40e_rxtx.c                  | 19 ++++++++----
 drivers/net/iavf/iavf_rxtx.c                  | 19 ++++++++----
 drivers/net/ice/ice_rxtx.c                    | 19 ++++++++----
 drivers/net/ixgbe/ixgbe_rxtx.c                | 19 ++++++++----
 drivers/net/mlx5/mlx5_rx.c                    | 16 ++++++++--
 .../include/generic/rte_power_intrinsics.h    | 29 ++++++++++++++-----
 lib/eal/x86/rte_power_intrinsics.c            |  9 ++----
 9 files changed, 106 insertions(+), 41 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =======================================================
 
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
 ABI Changes
 -----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..14dfac257c 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,15 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
 	}
 }
 
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val, const uint64_t opaque[4])
+{
+	/* abort if the value matches */
+	return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
 static inline int
 dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		  struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3203,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 			expected_value = 0;
 
 		pmc.addr = monitor_addr;
-		pmc.val = expected_value;
-		pmc.mask = qe_mask.raw_qe[1];
+		/* store expected value and comparison mask in opaque data */
+		pmc.opaque[CLB_VAL_IDX] = expected_value;
+		pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+		/* set up callback */
+		pmc.fn = dlb2_monitor_callback;
 		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..45f3fbf4ec 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,17 @@
 #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
 
+static int
+i40e_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -93,12 +104,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = i40e_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..6e12ecce07 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,17 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
 				rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
 }
 
+static int
+iavf_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -69,12 +80,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = iavf_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..278eb4b9a1 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,17 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+static int
+ice_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -39,12 +50,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.status_error0;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
-	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/* comparison callback */
+	pmc->fn = ice_monitor_callback;
 
 	/* register is 16-bit */
 	pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..0c5045d9dc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,17 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+static int
+ixgbe_monitor_callback(const uint64_t value, const uint64_t arg[4] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -1381,12 +1392,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.upper.status_error;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
-	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/* comparison callback */
+	pmc->fn = ixgbe_monitor_callback;
 
 	/* the registers are 32-bit */
 	pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..57f6ca1467 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,17 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 	return rx_queue_count(rxq);
 }
 
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value, const uint64_t opaque[4])
+{
+	const uint64_t m = opaque[CLB_MSK_IDX];
+	const uint64_t v = opaque[CLB_VAL_IDX];
+
+	return (value & m) == v ? -1 : 0;
+}
+
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +293,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 		return -rte_errno;
 	}
 	pmc->addr = &cqe->op_own;
-	pmc->val =  !!idx;
-	pmc->mask = MLX5_CQE_OWNER_MASK;
+	pmc->opaque[CLB_VAL_IDX] = !!idx;
+	pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+	pmc->fn = mlx_monitor_callback;
 	pmc->size = sizeof(uint8_t);
 	return 0;
 }
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..046667ade6 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,34 @@
  * which are architecture-dependent.
  */
 
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ *   The value read from memory.
+ * @param opaque
+ *   Callback-specific data.
+ *
+ * @return
+ *   0 if entering of power optimized state should proceed
+ *   -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+		const uint64_t opaque[4]);
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< If the `mask` is non-zero, location pointed
-	                       *   to by `addr` will be read and compared
-	                       *   against this value.
-	                       */
-	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
-	uint8_t size;    /**< Data size (in bytes) that will be used to compare
-	                  *   expected value (`val`) with data read from the
+	uint8_t size;    /**< Data size (in bytes) that will be read from the
 	                  *   monitored memory location (`addr`). Can be 1, 2,
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+	                             *   entering power optimized state should
+	                             *   be aborted.
+	                             */
+	uint64_t opaque[4]; /**< Callback-specific data */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..3c5c9ce7ad 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -110,14 +110,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* now that we've put this address into monitor, we can unlock */
 	rte_spinlock_unlock(&s->lock);
 
-	/* if we have a comparison mask, we might not need to sleep at all */
-	if (pmc->mask) {
+	/* if we have a callback, we might not need to sleep at all */
+	if (pmc->fn) {
 		const uint64_t cur_value = __get_umwait_val(
 				pmc->addr, pmc->size);
-		const uint64_t masked = cur_value & pmc->mask;
-
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
+		if (pmc->fn(cur_value, pmc->opaque) != 0)
 			goto end;
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 2/7] net/af_xdp: add power monitor support
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 3/7] eal: add power monitor for multiple events Anatoly Burakov
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt

Implement support for .get_monitor_addr in AF_XDP driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Rewrite using the callback mechanism

 drivers/net/af_xdp/rte_eth_af_xdp.c | 33 +++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..8b9c89c3e8 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
 
 #include "compat.h"
 
@@ -788,6 +789,37 @@ eth_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value, const uint64_t opaque[4])
+{
+	const uint64_t v = opaque[CLB_VAL_IDX];
+	const uint64_t m = (uint32_t)~0;
+
+	/* if the value has changed, abort entering power optimized state */
+	return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct pkt_rx_queue *rxq = rx_queue;
+	unsigned int *prod = rxq->rx.producer;
+	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+	/* watch for changes in producer ring */
+	pmc->addr = (void*)prod;
+
+	/* store current value */
+	pmc->opaque[CLB_VAL_IDX] = cur_val;
+	pmc->fn = eth_monitor_callback;
+
+	/* AF_XDP producer ring index is 32-bit */
+	pmc->size = sizeof(uint32_t);
+
+	return 0;
+}
+
 static int
 eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
@@ -1448,6 +1480,7 @@ static const struct eth_dev_ops ops = {
 	.link_update = eth_link_update,
 	.stats_get = eth_stats_get,
 	.stats_reset = eth_stats_reset,
+	.get_monitor_addr = eth_get_monitor_addr
 };
 
 /** parse busy_budget argument */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 3/7] eal: add power monitor for multiple events
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Adapt to callback mechanism

 doc/guides/rel_notes/release_21_08.rst        |  2 +
 lib/eal/arm/rte_power_intrinsics.c            | 11 +++
 lib/eal/include/generic/rte_cpuflags.h        |  2 +
 .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
 lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
 lib/eal/version.map                           |  3 +
 lib/eal/x86/rte_cpuflags.c                    |  2 +
 lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
 8 files changed, 135 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
 	/**< indicates support for rte_power_monitor function */
 	uint32_t power_pause : 1;
 	/**< indicates support for rte_power_pause function */
+	uint32_t power_monitor_multi : 1;
+	/**< indicates support for rte_power_monitor_multi function */
 };
 
 /**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index 046667ade6..877fb282cb 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -124,4 +124,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 __rte_experimental
 int rte_power_pause(const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ *   Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ *   An array of monitoring condition structures.
+ * @param num
+ *   Length of the `pmc` array.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp);
+
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_power_monitor_multi; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
 		intrinsics->power_monitor = 1;
 		intrinsics->power_pause = 1;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+			intrinsics->power_monitor_multi = 1;
 	}
 }
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 3c5c9ce7ad..3fc6f62ef5 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_rtm.h>
 #include <rte_spinlock.h>
 
 #include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
 }
 
 static bool wait_supported;
+static bool wait_multi_supported;
 
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -164,6 +166,8 @@ RTE_INIT(rte_power_intrinsics_init) {
 
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
+	if (i.power_monitor_multi)
+		wait_multi_supported = 1;
 }
 
 int
@@ -202,6 +206,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	 * In this case, since we've already woken up, the "wakeup" was
 	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
 	 * wakeup address is still valid so it's perfectly safe to write it.
+	 *
+	 * For multi-monitor case, the act of locking will in itself trigger the
+	 * wakeup, so no additional writes necessary.
 	 */
 	rte_spinlock_lock(&s->lock);
 	if (s->monitor_addr != NULL)
@@ -210,3 +217,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return 0;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
+	uint32_t i, rc;
+
+	/* check if supported */
+	if (!wait_multi_supported)
+		return -ENOTSUP;
+
+	if (pmc == NULL || num == 0)
+		return -EINVAL;
+
+	/* we are already inside transaction region, return */
+	if (rte_xtest() != 0)
+		return 0;
+
+	/* start new transaction region */
+	rc = rte_xbegin();
+
+	/* transaction abort, possible write to one of wait addresses */
+	if (rc != RTE_XBEGIN_STARTED)
+		return 0;
+
+	/*
+	 * the mere act of reading the lock status here adds the lock to
+	 * the read set. This means that when we trigger a wakeup from another
+	 * thread, even if we don't have a defined wakeup address and thus don't
+	 * actually cause any writes, the act of locking our lock will itself
+	 * trigger the wakeup and abort the transaction.
+	 */
+	rte_spinlock_is_locked(&s->lock);
+
+	/*
+	 * add all addresses to wait on into transaction read-set and check if
+	 * any of wakeup conditions are already met.
+	 */
+	for (i = 0; i < num; i++) {
+		const struct rte_power_monitor_cond *c = &pmc[i];
+
+		if (pmc->fn == NULL)
+			continue;
+
+		const uint64_t val = __get_umwait_val(pmc->addr, pmc->size);
+
+		/* abort if callback indicates that we need to stop */
+		if (c->fn(val, c->opaque) != 0)
+			break;
+	}
+
+	/* none of the conditions were met, sleep until timeout */
+	if (i == num)
+		rte_power_pause(tsc_timestamp);
+
+	/* end transaction region */
+	rte_xend();
+
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 4/7] power: remove thread safety from PMD power API's
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
                       ` (2 preceding siblings ...)
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.

We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Add check for stopped queue
    - Clarified doc message
    - Added release notes

 doc/guides/rel_notes/release_21_08.rst |   5 +
 lib/power/meson.build                  |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 133 ++++++++++---------------
 lib/power/rte_power_pmd_mgmt.h         |   6 ++
 4 files changed, 67 insertions(+), 80 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
 
 * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
+* rte_power: The experimental PMD power management API is no longer considered
+  to be thread safe; all Rx queues affected by the API will now need to be
+  stopped before making any changes to the power management scheme.
+
+
 ABI Changes
 -----------
 
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
         'rte_power_pmd_mgmt.h',
         'rte_power_guest_channel.h',
 )
+if cc.has_argument('-Wno-cast-qual')
+    cflags += '-Wno-cast-qual'
+endif
 deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
 	/**< Callback mode for this queue */
 	const struct rte_eth_rxtx_callback *cur_cb;
 	/**< Callback instance */
-	volatile bool umwait_in_progress;
-	/**< are we currently sleeping? */
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 			struct rte_power_monitor_cond pmc;
 			uint16_t ret;
 
-			/*
-			 * we might get a cancellation request while being
-			 * inside the callback, in which case the wakeup
-			 * wouldn't work because it would've arrived too early.
-			 *
-			 * to get around this, we notify the other thread that
-			 * we're sleeping, so that it can spin until we're done.
-			 * unsolicited wakeups are perfectly safe.
-			 */
-			q_conf->umwait_in_progress = true;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-			/* check if we need to cancel sleep */
-			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
-				/* use monitoring condition to sleep */
-				ret = rte_eth_get_monitor_addr(port_id, qidx,
-						&pmc);
-				if (ret == 0)
-					rte_power_monitor(&pmc, UINT64_MAX);
-			}
-			q_conf->umwait_in_progress = false;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+			/* use monitoring condition to sleep */
+			ret = rte_eth_get_monitor_addr(port_id, qidx,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
 		q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 	return nb_rx;
 }
 
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+	struct rte_eth_rxq_info qinfo;
+
+	if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+		return -1;
+
+	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
 	struct pmd_queue_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
+	rte_rx_callback_fn clb;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
 	queue_cfg = &port_cfg[port_id][queue_id];
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->umwait_in_progress = false;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* ensure we update our state before callback starts */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_umwait, NULL);
+		clb = clb_umwait;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
-				queue_id, clb_scale_freq, NULL);
+		clb = clb_scale_freq;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (global_data.tsc_per_us == 0)
 			calc_tsc();
 
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_pause, NULL);
+		clb = clb_pause;
 		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
 	}
+
+	/* initialize data before enabling the callback */
+	queue_cfg->empty_poll_stats = 0;
+	queue_cfg->cb_mode = mode;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, NULL);
+
 	ret = 0;
 end:
 	return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
 	struct pmd_queue_cfg *queue_cfg;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
 
 	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
 		return -EINVAL;
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
 	/* no need to check queue id as wrong queue id would not be enabled */
 	queue_cfg = &port_cfg[port_id][queue_id];
 
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	/* stop any callbacks from progressing */
 	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
-	/* ensure we update our state before continuing */
-	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
 	switch (queue_cfg->cb_mode) {
-	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		bool exit = false;
-		do {
-			/*
-			 * we may request cancellation while the other thread
-			 * has just entered the callback but hasn't started
-			 * sleeping yet, so keep waking it up until we know it's
-			 * done sleeping.
-			 */
-			if (queue_cfg->umwait_in_progress)
-				rte_power_monitor_wakeup(lcore_id);
-			else
-				exit = true;
-		} while (!exit);
-	}
-	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		rte_eth_remove_rx_callback(port_id, queue_id,
 				queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		break;
 	}
 	/*
-	 * we don't free the RX callback here because it is unsafe to do so
-	 * unless we know for a fact that all data plane threads have stopped.
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	queue_cfg->cur_cb = NULL;
+	rte_free((void *)queue_cfg->cur_cb);
 
 	return 0;
 }
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue will be polled from.
  * @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue is polled from.
  * @param port_id
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 5/7] power: support callbacks for multiple Rx queues
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
                       ` (3 preceding siblings ...)
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 6/7] power: support monitoring " Anatoly Burakov
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, David Hunt, Ray Kinsella, Neil Horman; +Cc: ciara.loftus

Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of cores to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when a special designated "power saving" queue is polled. To
  put it another way, we have no idea which queue the user will poll in
  what order, so we rely on them telling us that queue X is the last one
  in the polling loop, so any power management should happen there.
- A new API is added to mark a specific Rx queue as "power saving".
  Failing to call this API will result in no power management, however
  when having only one queue per core it is obvious which queue is the
  "power saving" one, so things will still work without this new API for
  use cases that were previously working without it.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Also, while we're at it, update and improve the docs.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v3:
    - Move the list of supported NICs to NIC feature table
    
    v2:
    - Use a TAILQ for queues instead of a static array
    - Address feedback from Konstantin
    - Add additional checks for stopped queues

 doc/guides/nics/features.rst           |  10 +
 doc/guides/prog_guide/power_man.rst    |  75 +++--
 doc/guides/rel_notes/release_21_08.rst |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 381 ++++++++++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h         |  34 +++
 lib/power/version.map                  |   3 +
 6 files changed, 412 insertions(+), 94 deletions(-)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
 * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
 * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
 
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
 .. _nic_features_other:
 
 Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..fac2c19516 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@ Ethernet PMD Power Management API
 Abstract
 ~~~~~~~~
 
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
-   This power saving scheme will put the CPU into optimized power state
-   and use the ``rte_power_monitor()`` function
-   to monitor the Ethernet PMD RX descriptor address,
-   and wake the CPU up whenever there's new traffic.
-
-Pause
-   This power saving scheme will avoid busy polling
-   by either entering power-optimized sleep state
-   with ``rte_power_pause()`` function,
-   or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
-   This power saving scheme will use ``librte_power`` library
-   functionality to scale the core frequency up/down
-   depending on traffic volume.
-
-.. note::
-
-   Currently, this power management API is limited to mandatory mapping
-   of 1 queue to 1 core (multiple queues are supported,
-   but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+   This power saving scheme will put the CPU into optimized power state and
+   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+   there's new traffic. Support for this scheme may not be available on all
+   platforms, and further limitations may apply (see below).
+
+* Pause
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+   This power saving scheme will use ``librte_power`` library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+  monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+  ``rte_power_monitor()`` function is not supported, then monitor mode will not
+  be supported.
+
+* Not all Ethernet devices support monitoring, even if the underlying
+  platform may support the necessary CPU instructions. Please refer to
+  :doc:`../nics/overview` for more information.
+
 
 API Overview for Ethernet PMD Power Management
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -234,6 +241,16 @@ API Overview for Ethernet PMD Power Management
 
 * **Queue Disable**: Disable power scheme for certain queue/port/core.
 
+* **Set Power Save Queue**: In case of polling multiple queues from one lcore,
+  designate a specific queue to be the one that triggers power management routines.
+
+.. note::
+
+   When using PMD power management with multiple Ethernet Rx queues on one lcore,
+   it is required to designate one of the configured Rx queues as a "power save"
+   queue by calling the appropriate API. Failing to do so will result in no
+   power saving ever taking effect.
+
 References
 ----------
 
@@ -242,3 +259,5 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
 
 * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
 
+* rte_power: The experimental PMD power management API now supports managing
+  multiple Ethernet Rx queues per lcore.
+
 
 Removed Items
 -------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..7762cd39b8 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,7 +33,28 @@ enum pmd_mgmt_state {
 	PMD_MGMT_ENABLED
 };
 
-struct pmd_queue_cfg {
+union queue {
+	uint32_t val;
+	struct {
+		uint16_t portid;
+		uint16_t qid;
+	};
+};
+
+struct queue_list_entry {
+	TAILQ_ENTRY(queue_list_entry) next;
+	union queue queue;
+};
+
+struct pmd_core_cfg {
+	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+	/**< Which port-queue pairs are associated with this lcore? */
+	union queue power_save_queue;
+	/**< When polling multiple queues, all but this one will be ignored */
+	bool power_save_queue_set;
+	/**< When polling multiple queues, power save queue must be set */
+	size_t n_queues;
+	/**< How many queues are in the list? */
 	volatile enum pmd_mgmt_state pwr_mgmt_state;
 	/**< State of power management for this queue */
 	enum rte_power_pmd_mgmt_type cb_mode;
@@ -43,8 +64,96 @@ struct pmd_queue_cfg {
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
 
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+	return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+	dst->val = src->val;
+}
+
+static inline bool
+queue_is_power_save(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	const union queue *pwrsave = &cfg->power_save_queue;
+
+	/* if there's only single queue, no need to check anything */
+	if (cfg->n_queues == 1)
+		return true;
+	return cfg->power_save_queue_set && queue_equal(q, pwrsave);
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *cur;
+
+	TAILQ_FOREACH(cur, &cfg->head, next) {
+		if (queue_equal(&cur->queue, q))
+			return cur;
+	}
+	return NULL;
+}
+
+static int
+queue_set_power_save(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	const struct queue_list_entry *found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return -ENOENT;
+	queue_copy(&cfg->power_save_queue, q);
+	cfg->power_save_queue_set = true;
+	return 0;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *qle;
+
+	/* is it already in the list? */
+	if (queue_list_find(cfg, q) != NULL)
+		return -EEXIST;
+
+	qle = malloc(sizeof(*qle));
+	if (qle == NULL)
+		return -ENOMEM;
+
+	queue_copy(&qle->queue, q);
+	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+	cfg->n_queues++;
+
+	return 0;
+}
+
+static int
+queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *found;
+
+	found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return -ENOENT;
+
+	TAILQ_REMOVE(&cfg->head, found, next);
+	cfg->n_queues--;
+	free(found);
+
+	/* if this was a power save queue, unset it */
+	if (cfg->power_save_queue_set && queue_is_power_save(cfg, q)) {
+		union queue *pwrsave = &cfg->power_save_queue;
+		cfg->power_save_queue_set = false;
+		pwrsave->val = 0;
+	}
+
+	return 0;
+}
 
 static void
 calc_tsc(void)
@@ -79,10 +188,10 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
+	const unsigned int lcore = rte_lcore_id();
+	struct pmd_core_cfg *q_conf;
 
-	struct pmd_queue_cfg *q_conf;
-
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
 	if (unlikely(nb_rx == 0)) {
 		q_conf->empty_poll_stats++;
@@ -107,11 +216,26 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		/* sleep for 1 microsecond */
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
@@ -127,8 +251,7 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 					rte_pause();
 			}
 		}
-	} else
-		q_conf->empty_poll_stats = 0;
+	}
 
 	return nb_rx;
 }
@@ -138,19 +261,33 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+
+		/* scale up freq immediately */
+		rte_power_freq_max(rte_lcore_id());
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
 			/* scale down freq */
 			rte_power_freq_min(rte_lcore_id());
-	} else {
-		q_conf->empty_poll_stats = 0;
-		/* scale up freq */
-		rte_power_freq_max(rte_lcore_id());
 	}
 
 	return nb_rx;
@@ -167,11 +304,79 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
 	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
 }
 
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+	const struct queue_list_entry *entry;
+
+	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+		const union queue *q = &entry->queue;
+		int ret = queue_stopped(q->portid, q->qid);
+		if (ret != 1)
+			return ret;
+	}
+	return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+	enum power_management_env env;
+
+	/* only PSTATE and ACPI modes are supported */
+	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+		return -ENOTSUP;
+	}
+	/* ensure we could initialize the power library */
+	if (rte_power_init(lcore))
+		return -EINVAL;
+
+	/* ensure we initialized the correct env */
+	env = rte_power_get_env();
+	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	if (cfg->n_queues > 0) {
+		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+			&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
 	rte_rx_callback_fn clb;
 	int ret;
@@ -202,9 +407,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(queue_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
+	/* if callback was already enabled, check current callback type */
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+			queue_cfg->cb_mode != mode) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -214,53 +429,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 
 	switch (mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		struct rte_power_monitor_cond dummy;
-
-		/* check if rte_power_monitor is supported */
-		if (!global_data.intrinsics_support.power_monitor) {
-			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_monitor(queue_cfg, &qdata);
+		if (ret < 0)
 			goto end;
-		}
 
-		/* check if the device supports the necessary PMD API */
-		if (rte_eth_get_monitor_addr(port_id, queue_id,
-				&dummy) == -ENOTSUP) {
-			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_umwait;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
-	{
-		enum power_management_env env;
-		/* only PSTATE and ACPI modes are supported */
-		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
-				!rte_power_check_env_supported(
-					PM_ENV_PSTATE_CPUFREQ)) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_scale(lcore_id);
+		if (ret < 0)
 			goto end;
-		}
-		/* ensure we could initialize the power library */
-		if (rte_power_init(lcore_id)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		/* ensure we initialized the correct env */
-		env = rte_power_get_env();
-		if (env != PM_ENV_ACPI_CPUFREQ &&
-				env != PM_ENV_PSTATE_CPUFREQ) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_scale_freq;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		/* figure out various time-to-tsc conversions */
 		if (global_data.tsc_per_us == 0)
@@ -273,11 +455,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		ret = -EINVAL;
 		goto end;
 	}
+	/* add this queue to the list */
+	ret = queue_list_add(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
 
 	/* initialize data before enabling the callback */
-	queue_cfg->empty_poll_stats = 0;
-	queue_cfg->cb_mode = mode;
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	if (queue_cfg->n_queues == 1) {
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
 	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
 			clb, NULL);
 
@@ -290,7 +481,8 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,13 +498,31 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(queue_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
 		return -EINVAL;
 
-	/* stop any callbacks from progressing */
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	ret = queue_list_remove(queue_cfg, &qdata);
+	if (ret < 0)
+		return -ret;
+
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (queue_cfg->n_queues == 0)
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
 	switch (queue_cfg->cb_mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
@@ -336,3 +546,42 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 
 	return 0;
 }
+
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	ret = queue_set_power_save(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to set power save queue: %s\n",
+			strerror(-ret));
+		return -ret;
+	}
+
+	return 0;
+}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+	size_t i;
+
+	/* initialize all tailqs */
+	for (i = 0; i < RTE_DIM(lcore_cfg); i++) {
+		struct pmd_core_cfg *cfg = &lcore_cfg[i];
+		TAILQ_INIT(&cfg->head);
+	}
+}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 444e7b8a66..d6ef8f778a 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -90,6 +90,40 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Set a specific Ethernet device Rx queue to be the "power save" queue for a
+ * particular lcore. When multiple queues are assigned to a single lcore using
+ * the `rte_power_ethdev_pmgmt_queue_enable` API, only one of them will trigger
+ * the power management. In a typical scenario, the last queue to be polled on
+ * a particular lcore should be designated as power save queue.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @note When using multiple queues per lcore, calling this function is
+ *   mandatory. If not called, no power management routines would be triggered
+ *   when the traffic starts.
+ *
+ * @warning This function must be called when all affected Ethernet ports are
+ *   stopped and no Rx/Tx is in progress!
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/power/version.map b/lib/power/version.map
index b004e3e4a9..105d1d94c2 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -38,4 +38,7 @@ EXPERIMENTAL {
 	# added in 21.02
 	rte_power_ethdev_pmgmt_queue_disable;
 	rte_power_ethdev_pmgmt_queue_enable;
+
+	# added in 21.08
+	rte_power_ethdev_pmgmt_queue_set_power_save;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
                       ` (4 preceding siblings ...)
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 13:29       ` Ananyev, Konstantin
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 doc/guides/prog_guide/power_man.rst |  9 ++--
 lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index fac2c19516..3245a5ebed 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
 The "monitor" mode is only supported in the following configurations and scenarios:
 
 * If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor_multi()`` function is supported by the platform, then
+  monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
   ``rte_power_monitor()`` is supported by the platform, then monitoring will be
   limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
   monitored from a different lcore).
 
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
-  ``rte_power_monitor()`` function is not supported, then monitor mode will not
-  be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+  two monitoring functions are supported, then monitor mode will not be supported.
 
 * Not all Ethernet devices support monitoring, even if the underlying
   platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 7762cd39b8..aab2d4f1ee 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
 	return 0;
 }
 
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+		struct rte_power_monitor_cond *pmc)
+{
+	const struct queue_list_entry *qle;
+	size_t i = 0;
+	int ret;
+
+	TAILQ_FOREACH(qle, &cfg->head, next) {
+		struct rte_power_monitor_cond *cur = &pmc[i];
+		const union queue *q = &qle->queue;
+		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
 static void
 calc_tsc(void)
 {
@@ -183,6 +201,48 @@ calc_tsc(void)
 	}
 }
 
+static uint16_t
+clb_multiwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
+
+	q_conf = &lcore_cfg[lcore];
+
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+			uint16_t ret;
+
+			/* gather all monitoring conditions */
+			ret = get_monitor_addresses(q_conf, pmc);
+
+			if (ret == 0)
+				rte_power_monitor_multi(pmc,
+					q_conf->n_queues, UINT64_MAX);
+		}
+	}
+
+	return nb_rx;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
@@ -348,14 +408,19 @@ static int
 check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 {
 	struct rte_power_monitor_cond dummy;
+	bool multimonitor_supported;
 
 	/* check if rte_power_monitor is supported */
 	if (!global_data.intrinsics_support.power_monitor) {
 		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
 		return -ENOTSUP;
 	}
+	/* check if multi-monitor is supported */
+	multimonitor_supported =
+			global_data.intrinsics_support.power_monitor_multi;
 
-	if (cfg->n_queues > 0) {
+	/* if we're adding a new queue, do we support multiple queues? */
+	if (cfg->n_queues > 0 && !multimonitor_supported) {
 		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
 		return -ENOTSUP;
 	}
@@ -371,6 +436,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 	return 0;
 }
 
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+	return global_data.intrinsics_support.power_monitor_multi ?
+		clb_multiwait : clb_umwait;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -434,7 +506,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (ret < 0)
 			goto end;
 
-		clb = clb_umwait;
+		clb = get_monitor_callback();
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		/* check if we can add a new queue */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v3 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
                       ` (5 preceding siblings ...)
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-28 12:41     ` Anatoly Burakov
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 12:41 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus

Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation, and always
mark the last queue in qconf as the power save queue.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 examples/l3fwd-power/main.c | 39 +++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..3057c06936 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2498,6 +2498,27 @@ mode_to_str(enum appmode mode)
 	}
 }
 
+static void
+pmd_pmgmt_set_up(unsigned int lcore, uint16_t portid, uint16_t qid, bool last)
+{
+	int ret;
+
+	ret = rte_power_ethdev_pmgmt_queue_enable(lcore, portid,
+			qid, pmgmt_type);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+			ret, portid);
+
+	if (!last)
+		return;
+	ret = rte_power_ethdev_pmgmt_queue_set_power_save(lcore, portid, qid);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_set_power_save: err=%d, port=%d\n",
+			ret, portid);
+}
+
 int
 main(int argc, char **argv)
 {
@@ -2723,12 +2744,6 @@ main(int argc, char **argv)
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
 
-		/* PMD power management mode can only do 1 queue per core */
-		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
-			rte_exit(EXIT_FAILURE,
-				"In PMD power management mode, only one queue per lcore is allowed\n");
-		}
-
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2767,15 +2782,9 @@ main(int argc, char **argv)
 						 "Fail to add ptype cb\n");
 			}
 
-			if (app_mode == APP_MODE_PMD_MGMT) {
-				ret = rte_power_ethdev_pmgmt_queue_enable(
-						lcore_id, portid, queueid,
-						pmgmt_type);
-				if (ret < 0)
-					rte_exit(EXIT_FAILURE,
-						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
-							ret, portid);
-			}
+			if (app_mode == APP_MODE_PMD_MGMT)
+				pmd_pmgmt_set_up(lcore_id, portid, queueid,
+					queue == (qconf->n_rx_queue - 1));
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events
  2021-06-28 12:37     ` Ananyev, Konstantin
@ 2021-06-28 12:43       ` Burakov, Anatoly
  2021-06-28 12:58         ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-28 12:43 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Jerin Jacob, Ruifeng Wang,
	Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman,
	Richardson, Bruce
  Cc: Hunt, David, Loftus, Ciara

On 28-Jun-21 1:37 PM, Ananyev, Konstantin wrote:
> 
>> Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
>> what UMWAIT does, but without the limitation of having to listen for
>> just one event. This works because the optimized power state used by the
>> TPAUSE instruction will cause a wake up on RTM transaction abort, so if
>> we add the addresses we're interested in to the read-set, any write to
>> those addresses will wake us up.
>>
>> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v2:
>>      - Adapt to callback mechanism
>>
>>   doc/guides/rel_notes/release_21_08.rst        |  2 +
>>   lib/eal/arm/rte_power_intrinsics.c            | 11 +++
>>   lib/eal/include/generic/rte_cpuflags.h        |  2 +
>>   .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
>>   lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
>>   lib/eal/version.map                           |  3 +
>>   lib/eal/x86/rte_cpuflags.c                    |  2 +
>>   lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
>>   8 files changed, 135 insertions(+)
>>
> ...
> 
>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>> index 3c5c9ce7ad..3fc6f62ef5 100644
>> --- a/lib/eal/x86/rte_power_intrinsics.c
>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>> @@ -4,6 +4,7 @@
>>
>>   #include <rte_common.h>
>>   #include <rte_lcore.h>
>> +#include <rte_rtm.h>
>>   #include <rte_spinlock.h>
>>
>>   #include "rte_power_intrinsics.h"
>> @@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
>>   }
>>
>>   static bool wait_supported;
>> +static bool wait_multi_supported;
>>
>>   static inline uint64_t
>>   __get_umwait_val(const volatile void *p, const uint8_t sz)
>> @@ -164,6 +166,8 @@ RTE_INIT(rte_power_intrinsics_init) {
>>
>>        if (i.power_monitor && i.power_pause)
>>                wait_supported = 1;
>> +     if (i.power_monitor_multi)
>> +             wait_multi_supported = 1;
>>   }
>>
>>   int
>> @@ -202,6 +206,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
>>         * In this case, since we've already woken up, the "wakeup" was
>>         * unneeded, and since T1 is still waiting on T2 releasing the lock, the
>>         * wakeup address is still valid so it's perfectly safe to write it.
>> +      *
>> +      * For multi-monitor case, the act of locking will in itself trigger the
>> +      * wakeup, so no additional writes necessary.
>>         */
>>        rte_spinlock_lock(&s->lock);
>>        if (s->monitor_addr != NULL)
>> @@ -210,3 +217,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
>>
>>        return 0;
>>   }
>> +
>> +int
>> +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
>> +             const uint32_t num, const uint64_t tsc_timestamp)
>> +{
>> +     const unsigned int lcore_id = rte_lcore_id();
>> +     struct power_wait_status *s = &wait_status[lcore_id];
>> +     uint32_t i, rc;
>> +
>> +     /* check if supported */
>> +     if (!wait_multi_supported)
>> +             return -ENOTSUP;
>> +
>> +     if (pmc == NULL || num == 0)
>> +             return -EINVAL;
>> +
>> +     /* we are already inside transaction region, return */
>> +     if (rte_xtest() != 0)
>> +             return 0;
>> +
>> +     /* start new transaction region */
>> +     rc = rte_xbegin();
>> +
>> +     /* transaction abort, possible write to one of wait addresses */
>> +     if (rc != RTE_XBEGIN_STARTED)
>> +             return 0;
>> +
>> +     /*
>> +      * the mere act of reading the lock status here adds the lock to
>> +      * the read set. This means that when we trigger a wakeup from another
>> +      * thread, even if we don't have a defined wakeup address and thus don't
>> +      * actually cause any writes, the act of locking our lock will itself
>> +      * trigger the wakeup and abort the transaction.
>> +      */
>> +     rte_spinlock_is_locked(&s->lock);
>> +
>> +     /*
>> +      * add all addresses to wait on into transaction read-set and check if
>> +      * any of wakeup conditions are already met.
>> +      */
>> +     for (i = 0; i < num; i++) {
>> +             const struct rte_power_monitor_cond *c = &pmc[i];
>> +
>> +             if (pmc->fn == NULL)
> 
> Should be c->fn, I believe.

Yep, will fix.

> 
>> +                     continue;
> 
> Actually that way, if c->fn == NULL, we'll never add  our c->addr to monitored addresses.
> Is that what we really want?
> My thought was, that if callback is not set, we'll just go to power-save state without extra checking, no?
> Something like that:
> 
> const struct rte_power_monitor_cond *c = &pmc[i];
> const uint64_t val = __get_umwait_val(c->addr, c->size);
> 
> if (c->fn && c->fn(val, c->opaque) != 0)
>     break;

This is consistent with previous behavior of rte_power_monitor where if 
mask wasn't set we entered power save mode without any checks. If we do 
a break, that means the check condition has failed somewhere and we have 
to abort the sleep. Continue keeps the sleep.

> 
> Same thought for rte_power_monitor().
> 
>> +             const uint64_t val = __get_umwait_val(pmc->addr, pmc->size);
> 
> Same thing: s/pmc->/c->/

Yep, you're right.

> 
>> +
>> +             /* abort if callback indicates that we need to stop */
>> +             if (c->fn(val, c->opaque) != 0)
>> +                     break;
>> +     }
>> +
>> +     /* none of the conditions were met, sleep until timeout */
>> +     if (i == num)
>> +             rte_power_pause(tsc_timestamp);
>> +
>> +     /* end transaction region */
>> +     rte_xend();
>> +
>> +     return 0;
>> +}
>> --
>> 2.25.1
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events
  2021-06-28 12:43       ` Burakov, Anatoly
@ 2021-06-28 12:58         ` Ananyev, Konstantin
  2021-06-28 13:29           ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-28 12:58 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin,
	David Christensen, Ray Kinsella, Neil Horman, Richardson, Bruce
  Cc: Hunt, David, Loftus, Ciara


> On 28-Jun-21 1:37 PM, Ananyev, Konstantin wrote:
> >
> >> Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
> >> what UMWAIT does, but without the limitation of having to listen for
> >> just one event. This works because the optimized power state used by the
> >> TPAUSE instruction will cause a wake up on RTM transaction abort, so if
> >> we add the addresses we're interested in to the read-set, any write to
> >> those addresses will wake us up.
> >>
> >> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>
> >> Notes:
> >>      v2:
> >>      - Adapt to callback mechanism
> >>
> >>   doc/guides/rel_notes/release_21_08.rst        |  2 +
> >>   lib/eal/arm/rte_power_intrinsics.c            | 11 +++
> >>   lib/eal/include/generic/rte_cpuflags.h        |  2 +
> >>   .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
> >>   lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
> >>   lib/eal/version.map                           |  3 +
> >>   lib/eal/x86/rte_cpuflags.c                    |  2 +
> >>   lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
> >>   8 files changed, 135 insertions(+)
> >>
> > ...
> >
> >> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >> index 3c5c9ce7ad..3fc6f62ef5 100644
> >> --- a/lib/eal/x86/rte_power_intrinsics.c
> >> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >> @@ -4,6 +4,7 @@
> >>
> >>   #include <rte_common.h>
> >>   #include <rte_lcore.h>
> >> +#include <rte_rtm.h>
> >>   #include <rte_spinlock.h>
> >>
> >>   #include "rte_power_intrinsics.h"
> >> @@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
> >>   }
> >>
> >>   static bool wait_supported;
> >> +static bool wait_multi_supported;
> >>
> >>   static inline uint64_t
> >>   __get_umwait_val(const volatile void *p, const uint8_t sz)
> >> @@ -164,6 +166,8 @@ RTE_INIT(rte_power_intrinsics_init) {
> >>
> >>        if (i.power_monitor && i.power_pause)
> >>                wait_supported = 1;
> >> +     if (i.power_monitor_multi)
> >> +             wait_multi_supported = 1;
> >>   }
> >>
> >>   int
> >> @@ -202,6 +206,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
> >>         * In this case, since we've already woken up, the "wakeup" was
> >>         * unneeded, and since T1 is still waiting on T2 releasing the lock, the
> >>         * wakeup address is still valid so it's perfectly safe to write it.
> >> +      *
> >> +      * For multi-monitor case, the act of locking will in itself trigger the
> >> +      * wakeup, so no additional writes necessary.
> >>         */
> >>        rte_spinlock_lock(&s->lock);
> >>        if (s->monitor_addr != NULL)
> >> @@ -210,3 +217,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
> >>
> >>        return 0;
> >>   }
> >> +
> >> +int
> >> +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
> >> +             const uint32_t num, const uint64_t tsc_timestamp)
> >> +{
> >> +     const unsigned int lcore_id = rte_lcore_id();
> >> +     struct power_wait_status *s = &wait_status[lcore_id];
> >> +     uint32_t i, rc;
> >> +
> >> +     /* check if supported */
> >> +     if (!wait_multi_supported)
> >> +             return -ENOTSUP;
> >> +
> >> +     if (pmc == NULL || num == 0)
> >> +             return -EINVAL;
> >> +
> >> +     /* we are already inside transaction region, return */
> >> +     if (rte_xtest() != 0)
> >> +             return 0;
> >> +
> >> +     /* start new transaction region */
> >> +     rc = rte_xbegin();
> >> +
> >> +     /* transaction abort, possible write to one of wait addresses */
> >> +     if (rc != RTE_XBEGIN_STARTED)
> >> +             return 0;
> >> +
> >> +     /*
> >> +      * the mere act of reading the lock status here adds the lock to
> >> +      * the read set. This means that when we trigger a wakeup from another
> >> +      * thread, even if we don't have a defined wakeup address and thus don't
> >> +      * actually cause any writes, the act of locking our lock will itself
> >> +      * trigger the wakeup and abort the transaction.
> >> +      */
> >> +     rte_spinlock_is_locked(&s->lock);
> >> +
> >> +     /*
> >> +      * add all addresses to wait on into transaction read-set and check if
> >> +      * any of wakeup conditions are already met.
> >> +      */
> >> +     for (i = 0; i < num; i++) {
> >> +             const struct rte_power_monitor_cond *c = &pmc[i];
> >> +
> >> +             if (pmc->fn == NULL)
> >
> > Should be c->fn, I believe.
> 
> Yep, will fix.
> 
> >
> >> +                     continue;
> >
> > Actually that way, if c->fn == NULL, we'll never add  our c->addr to monitored addresses.
> > Is that what we really want?
> > My thought was, that if callback is not set, we'll just go to power-save state without extra checking, no?
> > Something like that:
> >
> > const struct rte_power_monitor_cond *c = &pmc[i];
> > const uint64_t val = __get_umwait_val(c->addr, c->size);
> >
> > if (c->fn && c->fn(val, c->opaque) != 0)
> >     break;
> 
> This is consistent with previous behavior of rte_power_monitor where if
> mask wasn't set we entered power save mode without any checks. If we do
> a break, that means the check condition has failed somewhere and we have
> to abort the sleep. Continue keeps the sleep.

Ok, so what is current intention?
If pmc->fn == NULL what does it mean:
1) pmc->addr shouldn't be monitored at all?
2) pmc->addr should be monitored unconditionally
3) pmc->fn should never be NULL and monitor should return an error
3) something else?

For me 1) looks really strange, if user doesn't want to sleep on that address,
he can just not add this addr to pmc[].

2) is probably ok... but is that really needed?
User can just provide NOP as a callback and it would be the same.

3) seems like a most sane to be.

> 
> >
> > Same thought for rte_power_monitor().
> >
> >> +             const uint64_t val = __get_umwait_val(pmc->addr, pmc->size);
> >
> > Same thing: s/pmc->/c->/
> 
> Yep, you're right.
> 
> >
> >> +
> >> +             /* abort if callback indicates that we need to stop */
> >> +             if (c->fn(val, c->opaque) != 0)
> >> +                     break;
> >> +     }
> >> +
> >> +     /* none of the conditions were met, sleep until timeout */
> >> +     if (i == num)
> >> +             rte_power_pause(tsc_timestamp);
> >> +
> >> +     /* end transaction region */
> >> +     rte_xend();
> >> +
> >> +     return 0;
> >> +}
> >> --
> >> 2.25.1
> >
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-28 13:29       ` Ananyev, Konstantin
  2021-06-28 14:09         ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-28 13:29 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> Rx queues while entering the energy efficient power state. The multi
> version will be used unconditionally if supported, and the UMWAIT one
> will only be used when multi-monitor is not supported by the hardware.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  doc/guides/prog_guide/power_man.rst |  9 ++--
>  lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
>  2 files changed, 80 insertions(+), 5 deletions(-)
> 
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index fac2c19516..3245a5ebed 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>  The "monitor" mode is only supported in the following configurations and scenarios:
> 
>  * If ``rte_cpu_get_intrinsics_support()`` function indicates that
> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>    ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>    limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>    monitored from a different lcore).
> 
> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
> -  be supported.
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
> +  two monitoring functions are supported, then monitor mode will not be supported.
> 
>  * Not all Ethernet devices support monitoring, even if the underlying
>    platform may support the necessary CPU instructions. Please refer to
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index 7762cd39b8..aab2d4f1ee 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
>  	return 0;
>  }
> 
> +static inline int
> +get_monitor_addresses(struct pmd_core_cfg *cfg,
> +		struct rte_power_monitor_cond *pmc)
> +{
> +	const struct queue_list_entry *qle;
> +	size_t i = 0;
> +	int ret;
> +
> +	TAILQ_FOREACH(qle, &cfg->head, next) {
> +		struct rte_power_monitor_cond *cur = &pmc[i];

Looks like you never increment 'i' value inside that function.
Also it probably will be safer to add 'num' parameter to check that
we will never over-run pmc[] boundaries.

> +		const union queue *q = &qle->queue;
> +		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
>  static void
>  calc_tsc(void)
>  {
> @@ -183,6 +201,48 @@ calc_tsc(void)
>  	}
>  }
> 
> +static uint16_t
> +clb_multiwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> +{
> +	const unsigned int lcore = rte_lcore_id();
> +	const union queue q = {.portid = port_id, .qid = qidx};
> +	const bool empty = nb_rx == 0;
> +	struct pmd_core_cfg *q_conf;
> +
> +	q_conf = &lcore_cfg[lcore];
> +
> +	/* early exit */
> +	if (likely(!empty)) {
> +		q_conf->empty_poll_stats = 0;
> +	} else {
> +		/* do we care about this particular queue? */
> +		if (!queue_is_power_save(q_conf, &q))
> +			return nb_rx;

I still don't understand the need of 'special' power_save queue here...
Why we can't just have a function:

get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
and then just:

/* all queues have at least EMPTYPOLL_MAX sequential empty polls */
if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
    /* go into power-save mode here */
}

> +
> +		/*
> +		 * we can increment unconditionally here because if there were
> +		 * non-empty polls in other queues assigned to this core, we
> +		 * dropped the counter to zero anyway.
> +		 */
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];

I think you need here:
struct rte_power_monitor_cond pmc[q_conf->n_queues];


> +			uint16_t ret;
> +
> +			/* gather all monitoring conditions */
> +			ret = get_monitor_addresses(q_conf, pmc);
> +
> +			if (ret == 0)
> +				rte_power_monitor_multi(pmc,
> +					q_conf->n_queues, UINT64_MAX);
> +		}
> +	}
> +
> +	return nb_rx;
> +}
> +
>  static uint16_t
>  clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>  		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> @@ -348,14 +408,19 @@ static int
>  check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>  {
>  	struct rte_power_monitor_cond dummy;
> +	bool multimonitor_supported;
> 
>  	/* check if rte_power_monitor is supported */
>  	if (!global_data.intrinsics_support.power_monitor) {
>  		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>  		return -ENOTSUP;
>  	}
> +	/* check if multi-monitor is supported */
> +	multimonitor_supported =
> +			global_data.intrinsics_support.power_monitor_multi;
> 
> -	if (cfg->n_queues > 0) {
> +	/* if we're adding a new queue, do we support multiple queues? */
> +	if (cfg->n_queues > 0 && !multimonitor_supported) {
>  		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
>  		return -ENOTSUP;
>  	}
> @@ -371,6 +436,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>  	return 0;
>  }
> 
> +static inline rte_rx_callback_fn
> +get_monitor_callback(void)
> +{
> +	return global_data.intrinsics_support.power_monitor_multi ?
> +		clb_multiwait : clb_umwait;
> +}
> +
>  int
>  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
> @@ -434,7 +506,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		if (ret < 0)
>  			goto end;
> 
> -		clb = clb_umwait;
> +		clb = get_monitor_callback();
>  		break;
>  	case RTE_POWER_MGMT_TYPE_SCALE:
>  		/* check if we can add a new queue */
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] eal: add power monitor for multiple events
  2021-06-28 12:58         ` Ananyev, Konstantin
@ 2021-06-28 13:29           ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-28 13:29 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Jerin Jacob, Ruifeng Wang,
	Jan Viktorin, David Christensen, Ray Kinsella, Neil Horman,
	Richardson, Bruce
  Cc: Hunt, David, Loftus, Ciara

On 28-Jun-21 1:58 PM, Ananyev, Konstantin wrote:
> 
>> On 28-Jun-21 1:37 PM, Ananyev, Konstantin wrote:
>>>
>>>> Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
>>>> what UMWAIT does, but without the limitation of having to listen for
>>>> just one event. This works because the optimized power state used by the
>>>> TPAUSE instruction will cause a wake up on RTM transaction abort, so if
>>>> we add the addresses we're interested in to the read-set, any write to
>>>> those addresses will wake us up.
>>>>
>>>> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>
>>>> Notes:
>>>>       v2:
>>>>       - Adapt to callback mechanism
>>>>
>>>>    doc/guides/rel_notes/release_21_08.rst        |  2 +
>>>>    lib/eal/arm/rte_power_intrinsics.c            | 11 +++
>>>>    lib/eal/include/generic/rte_cpuflags.h        |  2 +
>>>>    .../include/generic/rte_power_intrinsics.h    | 35 ++++++++++
>>>>    lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
>>>>    lib/eal/version.map                           |  3 +
>>>>    lib/eal/x86/rte_cpuflags.c                    |  2 +
>>>>    lib/eal/x86/rte_power_intrinsics.c            | 69 +++++++++++++++++++
>>>>    8 files changed, 135 insertions(+)
>>>>
>>> ...
>>>
>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>> index 3c5c9ce7ad..3fc6f62ef5 100644
>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>> @@ -4,6 +4,7 @@
>>>>
>>>>    #include <rte_common.h>
>>>>    #include <rte_lcore.h>
>>>> +#include <rte_rtm.h>
>>>>    #include <rte_spinlock.h>
>>>>
>>>>    #include "rte_power_intrinsics.h"
>>>> @@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
>>>>    }
>>>>
>>>>    static bool wait_supported;
>>>> +static bool wait_multi_supported;
>>>>
>>>>    static inline uint64_t
>>>>    __get_umwait_val(const volatile void *p, const uint8_t sz)
>>>> @@ -164,6 +166,8 @@ RTE_INIT(rte_power_intrinsics_init) {
>>>>
>>>>         if (i.power_monitor && i.power_pause)
>>>>                 wait_supported = 1;
>>>> +     if (i.power_monitor_multi)
>>>> +             wait_multi_supported = 1;
>>>>    }
>>>>
>>>>    int
>>>> @@ -202,6 +206,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
>>>>          * In this case, since we've already woken up, the "wakeup" was
>>>>          * unneeded, and since T1 is still waiting on T2 releasing the lock, the
>>>>          * wakeup address is still valid so it's perfectly safe to write it.
>>>> +      *
>>>> +      * For multi-monitor case, the act of locking will in itself trigger the
>>>> +      * wakeup, so no additional writes necessary.
>>>>          */
>>>>         rte_spinlock_lock(&s->lock);
>>>>         if (s->monitor_addr != NULL)
>>>> @@ -210,3 +217,65 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
>>>>
>>>>         return 0;
>>>>    }
>>>> +
>>>> +int
>>>> +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
>>>> +             const uint32_t num, const uint64_t tsc_timestamp)
>>>> +{
>>>> +     const unsigned int lcore_id = rte_lcore_id();
>>>> +     struct power_wait_status *s = &wait_status[lcore_id];
>>>> +     uint32_t i, rc;
>>>> +
>>>> +     /* check if supported */
>>>> +     if (!wait_multi_supported)
>>>> +             return -ENOTSUP;
>>>> +
>>>> +     if (pmc == NULL || num == 0)
>>>> +             return -EINVAL;
>>>> +
>>>> +     /* we are already inside transaction region, return */
>>>> +     if (rte_xtest() != 0)
>>>> +             return 0;
>>>> +
>>>> +     /* start new transaction region */
>>>> +     rc = rte_xbegin();
>>>> +
>>>> +     /* transaction abort, possible write to one of wait addresses */
>>>> +     if (rc != RTE_XBEGIN_STARTED)
>>>> +             return 0;
>>>> +
>>>> +     /*
>>>> +      * the mere act of reading the lock status here adds the lock to
>>>> +      * the read set. This means that when we trigger a wakeup from another
>>>> +      * thread, even if we don't have a defined wakeup address and thus don't
>>>> +      * actually cause any writes, the act of locking our lock will itself
>>>> +      * trigger the wakeup and abort the transaction.
>>>> +      */
>>>> +     rte_spinlock_is_locked(&s->lock);
>>>> +
>>>> +     /*
>>>> +      * add all addresses to wait on into transaction read-set and check if
>>>> +      * any of wakeup conditions are already met.
>>>> +      */
>>>> +     for (i = 0; i < num; i++) {
>>>> +             const struct rte_power_monitor_cond *c = &pmc[i];
>>>> +
>>>> +             if (pmc->fn == NULL)
>>>
>>> Should be c->fn, I believe.
>>
>> Yep, will fix.
>>
>>>
>>>> +                     continue;
>>>
>>> Actually that way, if c->fn == NULL, we'll never add  our c->addr to monitored addresses.
>>> Is that what we really want?
>>> My thought was, that if callback is not set, we'll just go to power-save state without extra checking, no?
>>> Something like that:
>>>
>>> const struct rte_power_monitor_cond *c = &pmc[i];
>>> const uint64_t val = __get_umwait_val(c->addr, c->size);
>>>
>>> if (c->fn && c->fn(val, c->opaque) != 0)
>>>      break;
>>
>> This is consistent with previous behavior of rte_power_monitor where if
>> mask wasn't set we entered power save mode without any checks. If we do
>> a break, that means the check condition has failed somewhere and we have
>> to abort the sleep. Continue keeps the sleep.
> 
> Ok, so what is current intention?
> If pmc->fn == NULL what does it mean:
> 1) pmc->addr shouldn't be monitored at all?
> 2) pmc->addr should be monitored unconditionally
> 3) pmc->fn should never be NULL and monitor should return an error
> 3) something else?
> 
> For me 1) looks really strange, if user doesn't want to sleep on that address,
> he can just not add this addr to pmc[].

Ah, i see what you mean now. While we can skip the comparison, we still 
want to read the address. So, i would guess 2) should be true.

> 
> 2) is probably ok... but is that really needed?
> User can just provide NOP as a callback and it would be the same.
> 
> 3) seems like a most sane to be.

In that case, we have to do the same in the singular power monitor 
function - fn set to NULL should return an error. Will fix in v4 (v3 
went out before i noticed your comments :/ ).

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-28 13:29       ` Ananyev, Konstantin
@ 2021-06-28 14:09         ` Burakov, Anatoly
  2021-06-29  0:07           ` Ananyev, Konstantin
  0 siblings, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-28 14:09 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 28-Jun-21 2:29 PM, Ananyev, Konstantin wrote:
> 
> 
>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
>> Rx queues while entering the energy efficient power state. The multi
>> version will be used unconditionally if supported, and the UMWAIT one
>> will only be used when multi-monitor is not supported by the hardware.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   doc/guides/prog_guide/power_man.rst |  9 ++--
>>   lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
>>   2 files changed, 80 insertions(+), 5 deletions(-)
>>
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index fac2c19516..3245a5ebed 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>>   The "monitor" mode is only supported in the following configurations and scenarios:
>>
>>   * If ``rte_cpu_get_intrinsics_support()`` function indicates that
>> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
>> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>>     ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>>     limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>>     monitored from a different lcore).
>>
>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>> -  be supported.
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
>> +  two monitoring functions are supported, then monitor mode will not be supported.
>>
>>   * Not all Ethernet devices support monitoring, even if the underlying
>>     platform may support the necessary CPU instructions. Please refer to
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index 7762cd39b8..aab2d4f1ee 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
>>        return 0;
>>   }
>>
>> +static inline int
>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
>> +             struct rte_power_monitor_cond *pmc)
>> +{
>> +     const struct queue_list_entry *qle;
>> +     size_t i = 0;
>> +     int ret;
>> +
>> +     TAILQ_FOREACH(qle, &cfg->head, next) {
>> +             struct rte_power_monitor_cond *cur = &pmc[i];
> 
> Looks like you never increment 'i' value inside that function.
> Also it probably will be safer to add 'num' parameter to check that
> we will never over-run pmc[] boundaries.

Will fix in v4, good catch!

> 
>> +             const union queue *q = &qle->queue;
>> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
>> +             if (ret < 0)
>> +                     return ret;
>> +     }
>> +     return 0;
>> +}
>> +
>>   static void
>>   calc_tsc(void)
>>   {
>> @@ -183,6 +201,48 @@ calc_tsc(void)
>>        }
>>   }
>>
>> +static uint16_t
>> +clb_multiwait(uint16_t port_id, uint16_t qidx,
>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
>> +{
>> +     const unsigned int lcore = rte_lcore_id();
>> +     const union queue q = {.portid = port_id, .qid = qidx};
>> +     const bool empty = nb_rx == 0;
>> +     struct pmd_core_cfg *q_conf;
>> +
>> +     q_conf = &lcore_cfg[lcore];
>> +
>> +     /* early exit */
>> +     if (likely(!empty)) {
>> +             q_conf->empty_poll_stats = 0;
>> +     } else {
>> +             /* do we care about this particular queue? */
>> +             if (!queue_is_power_save(q_conf, &q))
>> +                     return nb_rx;
> 
> I still don't understand the need of 'special' power_save queue here...
> Why we can't just have a function:
> 
> get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
> and then just:
> 
> /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
> if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
>      /* go into power-save mode here */
> }

Okay, let's go through this step by step :)

Let's suppose we have three queues - q0, q1 and q2. We want to sleep 
whenever there's no traffic on *all of them*, however we cannot know 
that until we have checked all of them.

So, let's suppose that q0, q1 and q2 were empty all this time, but now 
some traffic arrived at q2 while we're still checking q0. We see that q0 
is empty, and all of the queues were empty for the last N polls, so we 
think we will be safe to sleep at q0 despite the fact that traffic has 
just arrived at q2.

This is not an issue with MONITOR mode because we will be able to see if 
current Rx ring descriptor is busy or not via the NIC callback, *but 
this is not possible* with PAUSE and SCALE modes, because they don't 
have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE 
modes, it is possible to end up in a situation where you *think* you 
don't have any traffic, but you actually do, you just haven't checked 
the relevant queue yet.

In order to prevent this from happening, we do not sleep on every queue, 
instead we sleep *once* per loop. That is, we check q0, check q1, check 
q2, and only then we decide whether we want to sleep or not.

Of course, with such scheme it is still possible to e.g. sleep in q2 
while there's traffic waiting in q0, but worst case is less bad with 
this scheme, because we'll be doing at worst 1 extra sleep.

Whereas with what you're suggesting, if we had e.g. 10 queues to poll, 
and we checked q1 but traffic has just arrived at q0, we'll be sleeping 
at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then 
we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps 
later we finally reach q0 and find out after all this time that we 
shouldn't have slept in the first place. Hopefully you get the point now :)

So, the idea here is, for any N queues, sleep only once, not N times.

> 
>> +
>> +             /*
>> +              * we can increment unconditionally here because if there were
>> +              * non-empty polls in other queues assigned to this core, we
>> +              * dropped the counter to zero anyway.
>> +              */
>> +             q_conf->empty_poll_stats++;
>> +             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +                     struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
> 
> I think you need here:
> struct rte_power_monitor_cond pmc[q_conf->n_queues];

I think VLA's are generally agreed upon to be unsafe, so i'm avoiding 
them here.

> 
> 
>> +                     uint16_t ret;
>> +
>> +                     /* gather all monitoring conditions */
>> +                     ret = get_monitor_addresses(q_conf, pmc);
>> +
>> +                     if (ret == 0)
>> +                             rte_power_monitor_multi(pmc,
>> +                                     q_conf->n_queues, UINT64_MAX);
>> +             }
>> +     }
>> +
>> +     return nb_rx;
>> +}
>> +
>>   static uint16_t
>>   clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>>                uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> @@ -348,14 +408,19 @@ static int
>>   check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>>   {
>>        struct rte_power_monitor_cond dummy;
>> +     bool multimonitor_supported;
>>
>>        /* check if rte_power_monitor is supported */
>>        if (!global_data.intrinsics_support.power_monitor) {
>>                RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>>                return -ENOTSUP;
>>        }
>> +     /* check if multi-monitor is supported */
>> +     multimonitor_supported =
>> +                     global_data.intrinsics_support.power_monitor_multi;
>>
>> -     if (cfg->n_queues > 0) {
>> +     /* if we're adding a new queue, do we support multiple queues? */
>> +     if (cfg->n_queues > 0 && !multimonitor_supported) {
>>                RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
>>                return -ENOTSUP;
>>        }
>> @@ -371,6 +436,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>>        return 0;
>>   }
>>
>> +static inline rte_rx_callback_fn
>> +get_monitor_callback(void)
>> +{
>> +     return global_data.intrinsics_support.power_monitor_multi ?
>> +             clb_multiwait : clb_umwait;
>> +}
>> +
>>   int
>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>> @@ -434,7 +506,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                if (ret < 0)
>>                        goto end;
>>
>> -             clb = clb_umwait;
>> +             clb = get_monitor_callback();
>>                break;
>>        case RTE_POWER_MGMT_TYPE_SCALE:
>>                /* check if we can add a new queue */
>> --
>> 2.25.1
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management
  2021-06-28 12:41   ` [dpdk-dev] [PATCH v3 0/7] Enhancements for PMD power management Anatoly Burakov
                       ` (6 preceding siblings ...)
  2021-06-28 12:41     ` [dpdk-dev] [PATCH v3 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-06-28 15:54     ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
                         ` (7 more replies)
  7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, konstantin.ananyev, ciara.loftus

This patchset introduces several changes related to PMD power management:

- Changed monitoring intrinsics to use callbacks as a comparison function, based
  on previous patchset [1] but incorporating feedback [2] - this hopefully will
  make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
  set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
  accompanying infrastructure and example apps changes

v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections

v3:
- Moved some doc updates to NIC features list

v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary

[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274

Anatoly Burakov (7):
  power_intrinsics: use callbacks for comparison
  net/af_xdp: add power monitor support
  eal: add power monitor for multiple events
  power: remove thread safety from PMD power API's
  power: support callbacks for multiple Rx queues
  power: support monitoring multiple Rx queues
  l3fwd-power: support multiqueue in PMD pmgmt modes

 doc/guides/nics/features.rst                  |  10 +
 doc/guides/prog_guide/power_man.rst           |  78 ++-
 doc/guides/rel_notes/release_21_08.rst        |  11 +
 drivers/event/dlb2/dlb2.c                     |  17 +-
 drivers/net/af_xdp/rte_eth_af_xdp.c           |  34 +
 drivers/net/i40e/i40e_rxtx.c                  |  20 +-
 drivers/net/iavf/iavf_rxtx.c                  |  20 +-
 drivers/net/ice/ice_rxtx.c                    |  20 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  20 +-
 drivers/net/mlx5/mlx5_rx.c                    |  17 +-
 examples/l3fwd-power/main.c                   |  39 +-
 lib/eal/arm/rte_power_intrinsics.c            |  11 +
 lib/eal/include/generic/rte_cpuflags.h        |   2 +
 .../include/generic/rte_power_intrinsics.h    |  68 +-
 lib/eal/ppc/rte_power_intrinsics.c            |  11 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_cpuflags.c                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  90 ++-
 lib/power/meson.build                         |   3 +
 lib/power/rte_power_pmd_mgmt.c                | 582 +++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h                |  40 ++
 lib/power/version.map                         |   3 +
 22 files changed, 874 insertions(+), 227 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support Anatoly Burakov
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
	Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
	Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.

This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.

Existing implementations are adjusted to follow the new semantics.

Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v4:
    - Return error if callback is set to NULL
    - Replace raw number with a macro in monitor condition opaque data
    
    v2:
    - Use callback mechanism for more flexibility
    - Address feedback from Konstantin

 doc/guides/rel_notes/release_21_08.rst        |  1 +
 drivers/event/dlb2/dlb2.c                     | 17 ++++++++--
 drivers/net/i40e/i40e_rxtx.c                  | 20 +++++++----
 drivers/net/iavf/iavf_rxtx.c                  | 20 +++++++----
 drivers/net/ice/ice_rxtx.c                    | 20 +++++++----
 drivers/net/ixgbe/ixgbe_rxtx.c                | 20 +++++++----
 drivers/net/mlx5/mlx5_rx.c                    | 17 ++++++++--
 .../include/generic/rte_power_intrinsics.h    | 33 +++++++++++++++----
 lib/eal/x86/rte_power_intrinsics.c            | 17 +++++-----
 9 files changed, 121 insertions(+), 44 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =======================================================
 
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
 ABI Changes
 -----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
 	}
 }
 
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	/* abort if the value matches */
+	return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
 static inline int
 dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		  struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 			expected_value = 0;
 
 		pmc.addr = monitor_addr;
-		pmc.val = expected_value;
-		pmc.mask = qe_mask.raw_qe[1];
+		/* store expected value and comparison mask in opaque data */
+		pmc.opaque[CLB_VAL_IDX] = expected_value;
+		pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+		/* set up callback */
+		pmc.fn = dlb2_monitor_callback;
 		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..081682f88b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
 #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
 
+static int
+i40e_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = i40e_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..7ed196ec22 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
 				rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
 }
 
+static int
+iavf_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = iavf_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..d12437d19d 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+static int
+ice_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.status_error0;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
-	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/* comparison callback */
+	pmc->fn = ice_monitor_callback;
 
 	/* register is 16-bit */
 	pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+static int
+ixgbe_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.upper.status_error;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
-	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/* comparison callback */
+	pmc->fn = ixgbe_monitor_callback;
 
 	/* the registers are 32-bit */
 	pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 	return rx_queue_count(rxq);
 }
 
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	const uint64_t m = opaque[CLB_MSK_IDX];
+	const uint64_t v = opaque[CLB_VAL_IDX];
+
+	return (value & m) == v ? -1 : 0;
+}
+
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 		return -rte_errno;
 	}
 	pmc->addr = &cqe->op_own;
-	pmc->val =  !!idx;
-	pmc->mask = MLX5_CQE_OWNER_MASK;
+	pmc->opaque[CLB_VAL_IDX] = !!idx;
+	pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+	pmc->fn = mlx_monitor_callback;
 	pmc->size = sizeof(uint8_t);
 	return 0;
 }
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
  * which are architecture-dependent.
  */
 
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ *   The value read from memory.
+ * @param opaque
+ *   Callback-specific data.
+ *
+ * @return
+ *   0 if entering of power optimized state should proceed
+ *   -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< If the `mask` is non-zero, location pointed
-	                       *   to by `addr` will be read and compared
-	                       *   against this value.
-	                       */
-	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
-	uint8_t size;    /**< Data size (in bytes) that will be used to compare
-	                  *   expected value (`val`) with data read from the
+	uint8_t size;    /**< Data size (in bytes) that will be read from the
 	                  *   monitored memory location (`addr`). Can be 1, 2,
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+	                             *   entering power optimized state should
+	                             *   be aborted.
+	                             */
+	uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+	/**< Callback-specific data */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 	const unsigned int lcore_id = rte_lcore_id();
 	struct power_wait_status *s;
+	uint64_t cur_value;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (__check_val_size(pmc->size) < 0)
 		return -EINVAL;
 
+	if (pmc->fn == NULL)
+		return -EINVAL;
+
 	s = &wait_status[lcore_id];
 
 	/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* now that we've put this address into monitor, we can unlock */
 	rte_spinlock_unlock(&s->lock);
 
-	/* if we have a comparison mask, we might not need to sleep at all */
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->size);
-		const uint64_t masked = cur_value & pmc->mask;
+	cur_value = __get_umwait_val(pmc->addr, pmc->size);
 
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			goto end;
-	}
+	/* check if callback indicates we should abort */
+	if (pmc->fn(cur_value, pmc->opaque) != 0)
+		goto end;
 
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events Anatoly Burakov
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev

Implement support for .get_monitor_addr in AF_XDP driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Rewrite using the callback mechanism

 drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
 
 #include "compat.h"
 
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	const uint64_t v = opaque[CLB_VAL_IDX];
+	const uint64_t m = (uint32_t)~0;
+
+	/* if the value has changed, abort entering power optimized state */
+	return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct pkt_rx_queue *rxq = rx_queue;
+	unsigned int *prod = rxq->rx.producer;
+	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+	/* watch for changes in producer ring */
+	pmc->addr = (void*)prod;
+
+	/* store current value */
+	pmc->opaque[CLB_VAL_IDX] = cur_val;
+	pmc->fn = eth_monitor_callback;
+
+	/* AF_XDP producer ring index is 32-bit */
+	pmc->size = sizeof(uint32_t);
+
+	return 0;
+}
+
 static int
 eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
 	.link_update = eth_link_update,
 	.stats_get = eth_stats_get,
 	.stats_reset = eth_stats_reset,
+	.get_monitor_addr = eth_get_monitor_addr
 };
 
 /** parse busy_budget argument */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
                         ` (4 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v4:
    - Fixed bugs in accessing the monitor condition
    - Abort on any monitor condition not having a defined callback
    
    v2:
    - Adapt to callback mechanism

 doc/guides/rel_notes/release_21_08.rst        |  2 +
 lib/eal/arm/rte_power_intrinsics.c            | 11 +++
 lib/eal/include/generic/rte_cpuflags.h        |  2 +
 .../include/generic/rte_power_intrinsics.h    | 35 +++++++++
 lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
 lib/eal/version.map                           |  3 +
 lib/eal/x86/rte_cpuflags.c                    |  2 +
 lib/eal/x86/rte_power_intrinsics.c            | 73 +++++++++++++++++++
 8 files changed, 139 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
 	/**< indicates support for rte_power_monitor function */
 	uint32_t power_pause : 1;
 	/**< indicates support for rte_power_pause function */
+	uint32_t power_monitor_multi : 1;
+	/**< indicates support for rte_power_monitor_multi function */
 };
 
 /**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 __rte_experimental
 int rte_power_pause(const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ *   Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ *   An array of monitoring condition structures.
+ * @param num
+ *   Length of the `pmc` array.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp);
+
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_power_monitor_multi; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
 		intrinsics->power_monitor = 1;
 		intrinsics->power_pause = 1;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+			intrinsics->power_monitor_multi = 1;
 	}
 }
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_rtm.h>
 #include <rte_spinlock.h>
 
 #include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
 }
 
 static bool wait_supported;
+static bool wait_multi_supported;
 
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
 
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
+	if (i.power_monitor_multi)
+		wait_multi_supported = 1;
 }
 
 int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	 * In this case, since we've already woken up, the "wakeup" was
 	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
 	 * wakeup address is still valid so it's perfectly safe to write it.
+	 *
+	 * For multi-monitor case, the act of locking will in itself trigger the
+	 * wakeup, so no additional writes necessary.
 	 */
 	rte_spinlock_lock(&s->lock);
 	if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return 0;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
+	uint32_t i, rc;
+
+	/* check if supported */
+	if (!wait_multi_supported)
+		return -ENOTSUP;
+
+	if (pmc == NULL || num == 0)
+		return -EINVAL;
+
+	/* we are already inside transaction region, return */
+	if (rte_xtest() != 0)
+		return 0;
+
+	/* start new transaction region */
+	rc = rte_xbegin();
+
+	/* transaction abort, possible write to one of wait addresses */
+	if (rc != RTE_XBEGIN_STARTED)
+		return 0;
+
+	/*
+	 * the mere act of reading the lock status here adds the lock to
+	 * the read set. This means that when we trigger a wakeup from another
+	 * thread, even if we don't have a defined wakeup address and thus don't
+	 * actually cause any writes, the act of locking our lock will itself
+	 * trigger the wakeup and abort the transaction.
+	 */
+	rte_spinlock_is_locked(&s->lock);
+
+	/*
+	 * add all addresses to wait on into transaction read-set and check if
+	 * any of wakeup conditions are already met.
+	 */
+	rc = 0;
+	for (i = 0; i < num; i++) {
+		const struct rte_power_monitor_cond *c = &pmc[i];
+
+		/* cannot be NULL */
+		if (c->fn == NULL) {
+			rc = -EINVAL;
+			break;
+		}
+
+		const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+		/* abort if callback indicates that we need to stop */
+		if (c->fn(val, c->opaque) != 0)
+			break;
+	}
+
+	/* none of the conditions were met, sleep until timeout */
+	if (i == num)
+		rte_power_pause(tsc_timestamp);
+
+	/* end transaction region */
+	rte_xend();
+
+	return rc;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
                         ` (2 preceding siblings ...)
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.

We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Add check for stopped queue
    - Clarified doc message
    - Added release notes

 doc/guides/rel_notes/release_21_08.rst |   5 +
 lib/power/meson.build                  |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 133 ++++++++++---------------
 lib/power/rte_power_pmd_mgmt.h         |   6 ++
 4 files changed, 67 insertions(+), 80 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
 
 * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
+* rte_power: The experimental PMD power management API is no longer considered
+  to be thread safe; all Rx queues affected by the API will now need to be
+  stopped before making any changes to the power management scheme.
+
+
 ABI Changes
 -----------
 
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
         'rte_power_pmd_mgmt.h',
         'rte_power_guest_channel.h',
 )
+if cc.has_argument('-Wno-cast-qual')
+    cflags += '-Wno-cast-qual'
+endif
 deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
 	/**< Callback mode for this queue */
 	const struct rte_eth_rxtx_callback *cur_cb;
 	/**< Callback instance */
-	volatile bool umwait_in_progress;
-	/**< are we currently sleeping? */
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 			struct rte_power_monitor_cond pmc;
 			uint16_t ret;
 
-			/*
-			 * we might get a cancellation request while being
-			 * inside the callback, in which case the wakeup
-			 * wouldn't work because it would've arrived too early.
-			 *
-			 * to get around this, we notify the other thread that
-			 * we're sleeping, so that it can spin until we're done.
-			 * unsolicited wakeups are perfectly safe.
-			 */
-			q_conf->umwait_in_progress = true;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-			/* check if we need to cancel sleep */
-			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
-				/* use monitoring condition to sleep */
-				ret = rte_eth_get_monitor_addr(port_id, qidx,
-						&pmc);
-				if (ret == 0)
-					rte_power_monitor(&pmc, UINT64_MAX);
-			}
-			q_conf->umwait_in_progress = false;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+			/* use monitoring condition to sleep */
+			ret = rte_eth_get_monitor_addr(port_id, qidx,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
 		q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 	return nb_rx;
 }
 
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+	struct rte_eth_rxq_info qinfo;
+
+	if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+		return -1;
+
+	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
 	struct pmd_queue_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
+	rte_rx_callback_fn clb;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
 	queue_cfg = &port_cfg[port_id][queue_id];
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->umwait_in_progress = false;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* ensure we update our state before callback starts */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_umwait, NULL);
+		clb = clb_umwait;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
-				queue_id, clb_scale_freq, NULL);
+		clb = clb_scale_freq;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (global_data.tsc_per_us == 0)
 			calc_tsc();
 
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_pause, NULL);
+		clb = clb_pause;
 		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
 	}
+
+	/* initialize data before enabling the callback */
+	queue_cfg->empty_poll_stats = 0;
+	queue_cfg->cb_mode = mode;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, NULL);
+
 	ret = 0;
 end:
 	return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
 	struct pmd_queue_cfg *queue_cfg;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
 
 	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
 		return -EINVAL;
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
 	/* no need to check queue id as wrong queue id would not be enabled */
 	queue_cfg = &port_cfg[port_id][queue_id];
 
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	/* stop any callbacks from progressing */
 	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
-	/* ensure we update our state before continuing */
-	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
 	switch (queue_cfg->cb_mode) {
-	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		bool exit = false;
-		do {
-			/*
-			 * we may request cancellation while the other thread
-			 * has just entered the callback but hasn't started
-			 * sleeping yet, so keep waking it up until we know it's
-			 * done sleeping.
-			 */
-			if (queue_cfg->umwait_in_progress)
-				rte_power_monitor_wakeup(lcore_id);
-			else
-				exit = true;
-		} while (!exit);
-	}
-	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		rte_eth_remove_rx_callback(port_id, queue_id,
 				queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		break;
 	}
 	/*
-	 * we don't free the RX callback here because it is unsafe to do so
-	 * unless we know for a fact that all data plane threads have stopped.
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	queue_cfg->cur_cb = NULL;
+	rte_free((void *)queue_cfg->cur_cb);
 
 	return 0;
 }
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue will be polled from.
  * @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue is polled from.
  * @param port_id
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
                         ` (3 preceding siblings ...)
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 6/7] power: support monitoring " Anatoly Burakov
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, David Hunt, Ray Kinsella, Neil Horman
  Cc: konstantin.ananyev, ciara.loftus

Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of cores to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when a special designated "power saving" queue is polled. To
  put it another way, we have no idea which queue the user will poll in
  what order, so we rely on them telling us that queue X is the last one
  in the polling loop, so any power management should happen there.
- A new API is added to mark a specific Rx queue as "power saving".
  Failing to call this API will result in no power management, however
  when having only one queue per core it is obvious which queue is the
  "power saving" one, so things will still work without this new API for
  use cases that were previously working without it.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Also, while we're at it, update and improve the docs.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v3:
    - Move the list of supported NICs to NIC feature table
    
    v2:
    - Use a TAILQ for queues instead of a static array
    - Address feedback from Konstantin
    - Add additional checks for stopped queues

 doc/guides/nics/features.rst           |  10 +
 doc/guides/prog_guide/power_man.rst    |  75 +++--
 doc/guides/rel_notes/release_21_08.rst |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 381 ++++++++++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h         |  34 +++
 lib/power/version.map                  |   3 +
 6 files changed, 412 insertions(+), 94 deletions(-)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
 * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
 * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
 
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
 .. _nic_features_other:
 
 Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..fac2c19516 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@ Ethernet PMD Power Management API
 Abstract
 ~~~~~~~~
 
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
-   This power saving scheme will put the CPU into optimized power state
-   and use the ``rte_power_monitor()`` function
-   to monitor the Ethernet PMD RX descriptor address,
-   and wake the CPU up whenever there's new traffic.
-
-Pause
-   This power saving scheme will avoid busy polling
-   by either entering power-optimized sleep state
-   with ``rte_power_pause()`` function,
-   or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
-   This power saving scheme will use ``librte_power`` library
-   functionality to scale the core frequency up/down
-   depending on traffic volume.
-
-.. note::
-
-   Currently, this power management API is limited to mandatory mapping
-   of 1 queue to 1 core (multiple queues are supported,
-   but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+   This power saving scheme will put the CPU into optimized power state and
+   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+   there's new traffic. Support for this scheme may not be available on all
+   platforms, and further limitations may apply (see below).
+
+* Pause
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+   This power saving scheme will use ``librte_power`` library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+  monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+  ``rte_power_monitor()`` function is not supported, then monitor mode will not
+  be supported.
+
+* Not all Ethernet devices support monitoring, even if the underlying
+  platform may support the necessary CPU instructions. Please refer to
+  :doc:`../nics/overview` for more information.
+
 
 API Overview for Ethernet PMD Power Management
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -234,6 +241,16 @@ API Overview for Ethernet PMD Power Management
 
 * **Queue Disable**: Disable power scheme for certain queue/port/core.
 
+* **Set Power Save Queue**: In case of polling multiple queues from one lcore,
+  designate a specific queue to be the one that triggers power management routines.
+
+.. note::
+
+   When using PMD power management with multiple Ethernet Rx queues on one lcore,
+   it is required to designate one of the configured Rx queues as a "power save"
+   queue by calling the appropriate API. Failing to do so will result in no
+   power saving ever taking effect.
+
 References
 ----------
 
@@ -242,3 +259,5 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
 
 * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
 
+* rte_power: The experimental PMD power management API now supports managing
+  multiple Ethernet Rx queues per lcore.
+
 
 Removed Items
 -------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..7762cd39b8 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,7 +33,28 @@ enum pmd_mgmt_state {
 	PMD_MGMT_ENABLED
 };
 
-struct pmd_queue_cfg {
+union queue {
+	uint32_t val;
+	struct {
+		uint16_t portid;
+		uint16_t qid;
+	};
+};
+
+struct queue_list_entry {
+	TAILQ_ENTRY(queue_list_entry) next;
+	union queue queue;
+};
+
+struct pmd_core_cfg {
+	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+	/**< Which port-queue pairs are associated with this lcore? */
+	union queue power_save_queue;
+	/**< When polling multiple queues, all but this one will be ignored */
+	bool power_save_queue_set;
+	/**< When polling multiple queues, power save queue must be set */
+	size_t n_queues;
+	/**< How many queues are in the list? */
 	volatile enum pmd_mgmt_state pwr_mgmt_state;
 	/**< State of power management for this queue */
 	enum rte_power_pmd_mgmt_type cb_mode;
@@ -43,8 +64,96 @@ struct pmd_queue_cfg {
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
 
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+	return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+	dst->val = src->val;
+}
+
+static inline bool
+queue_is_power_save(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	const union queue *pwrsave = &cfg->power_save_queue;
+
+	/* if there's only single queue, no need to check anything */
+	if (cfg->n_queues == 1)
+		return true;
+	return cfg->power_save_queue_set && queue_equal(q, pwrsave);
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *cur;
+
+	TAILQ_FOREACH(cur, &cfg->head, next) {
+		if (queue_equal(&cur->queue, q))
+			return cur;
+	}
+	return NULL;
+}
+
+static int
+queue_set_power_save(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	const struct queue_list_entry *found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return -ENOENT;
+	queue_copy(&cfg->power_save_queue, q);
+	cfg->power_save_queue_set = true;
+	return 0;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *qle;
+
+	/* is it already in the list? */
+	if (queue_list_find(cfg, q) != NULL)
+		return -EEXIST;
+
+	qle = malloc(sizeof(*qle));
+	if (qle == NULL)
+		return -ENOMEM;
+
+	queue_copy(&qle->queue, q);
+	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+	cfg->n_queues++;
+
+	return 0;
+}
+
+static int
+queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *found;
+
+	found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return -ENOENT;
+
+	TAILQ_REMOVE(&cfg->head, found, next);
+	cfg->n_queues--;
+	free(found);
+
+	/* if this was a power save queue, unset it */
+	if (cfg->power_save_queue_set && queue_is_power_save(cfg, q)) {
+		union queue *pwrsave = &cfg->power_save_queue;
+		cfg->power_save_queue_set = false;
+		pwrsave->val = 0;
+	}
+
+	return 0;
+}
 
 static void
 calc_tsc(void)
@@ -79,10 +188,10 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
+	const unsigned int lcore = rte_lcore_id();
+	struct pmd_core_cfg *q_conf;
 
-	struct pmd_queue_cfg *q_conf;
-
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
 	if (unlikely(nb_rx == 0)) {
 		q_conf->empty_poll_stats++;
@@ -107,11 +216,26 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
 		void *addr __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		/* sleep for 1 microsecond */
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
@@ -127,8 +251,7 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 					rte_pause();
 			}
 		}
-	} else
-		q_conf->empty_poll_stats = 0;
+	}
 
 	return nb_rx;
 }
@@ -138,19 +261,33 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
 
-	q_conf = &port_cfg[port_id][qidx];
+	q_conf = &lcore_cfg[lcore];
 
-	if (unlikely(nb_rx == 0)) {
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+
+		/* scale up freq immediately */
+		rte_power_freq_max(rte_lcore_id());
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
 		q_conf->empty_poll_stats++;
 		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
 			/* scale down freq */
 			rte_power_freq_min(rte_lcore_id());
-	} else {
-		q_conf->empty_poll_stats = 0;
-		/* scale up freq */
-		rte_power_freq_max(rte_lcore_id());
 	}
 
 	return nb_rx;
@@ -167,11 +304,79 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
 	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
 }
 
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+	const struct queue_list_entry *entry;
+
+	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+		const union queue *q = &entry->queue;
+		int ret = queue_stopped(q->portid, q->qid);
+		if (ret != 1)
+			return ret;
+	}
+	return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+	enum power_management_env env;
+
+	/* only PSTATE and ACPI modes are supported */
+	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+		return -ENOTSUP;
+	}
+	/* ensure we could initialize the power library */
+	if (rte_power_init(lcore))
+		return -EINVAL;
+
+	/* ensure we initialized the correct env */
+	env = rte_power_get_env();
+	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	if (cfg->n_queues > 0) {
+		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+			&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
 	rte_rx_callback_fn clb;
 	int ret;
@@ -202,9 +407,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(queue_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
+	/* if callback was already enabled, check current callback type */
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+			queue_cfg->cb_mode != mode) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -214,53 +429,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 
 	switch (mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		struct rte_power_monitor_cond dummy;
-
-		/* check if rte_power_monitor is supported */
-		if (!global_data.intrinsics_support.power_monitor) {
-			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_monitor(queue_cfg, &qdata);
+		if (ret < 0)
 			goto end;
-		}
 
-		/* check if the device supports the necessary PMD API */
-		if (rte_eth_get_monitor_addr(port_id, queue_id,
-				&dummy) == -ENOTSUP) {
-			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_umwait;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
-	{
-		enum power_management_env env;
-		/* only PSTATE and ACPI modes are supported */
-		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
-				!rte_power_check_env_supported(
-					PM_ENV_PSTATE_CPUFREQ)) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_scale(lcore_id);
+		if (ret < 0)
 			goto end;
-		}
-		/* ensure we could initialize the power library */
-		if (rte_power_init(lcore_id)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		/* ensure we initialized the correct env */
-		env = rte_power_get_env();
-		if (env != PM_ENV_ACPI_CPUFREQ &&
-				env != PM_ENV_PSTATE_CPUFREQ) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_scale_freq;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		/* figure out various time-to-tsc conversions */
 		if (global_data.tsc_per_us == 0)
@@ -273,11 +455,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		ret = -EINVAL;
 		goto end;
 	}
+	/* add this queue to the list */
+	ret = queue_list_add(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
 
 	/* initialize data before enabling the callback */
-	queue_cfg->empty_poll_stats = 0;
-	queue_cfg->cb_mode = mode;
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	if (queue_cfg->n_queues == 1) {
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
 	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
 			clb, NULL);
 
@@ -290,7 +481,8 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,13 +498,31 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	queue_cfg = &port_cfg[port_id][queue_id];
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(queue_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
 		return -EINVAL;
 
-	/* stop any callbacks from progressing */
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	ret = queue_list_remove(queue_cfg, &qdata);
+	if (ret < 0)
+		return -ret;
+
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (queue_cfg->n_queues == 0)
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
 	switch (queue_cfg->cb_mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
@@ -336,3 +546,42 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 
 	return 0;
 }
+
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	queue_cfg = &lcore_cfg[lcore_id];
+
+	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	ret = queue_set_power_save(queue_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to set power save queue: %s\n",
+			strerror(-ret));
+		return -ret;
+	}
+
+	return 0;
+}
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+	size_t i;
+
+	/* initialize all tailqs */
+	for (i = 0; i < RTE_DIM(lcore_cfg); i++) {
+		struct pmd_core_cfg *cfg = &lcore_cfg[i];
+		TAILQ_INIT(&cfg->head);
+	}
+}
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 444e7b8a66..d6ef8f778a 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -90,6 +90,40 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Set a specific Ethernet device Rx queue to be the "power save" queue for a
+ * particular lcore. When multiple queues are assigned to a single lcore using
+ * the `rte_power_ethdev_pmgmt_queue_enable` API, only one of them will trigger
+ * the power management. In a typical scenario, the last queue to be polled on
+ * a particular lcore should be designated as power save queue.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @note When using multiple queues per lcore, calling this function is
+ *   mandatory. If not called, no power management routines would be triggered
+ *   when the traffic starts.
+ *
+ * @warning This function must be called when all affected Ethernet ports are
+ *   stopped and no Rx/Tx is in progress!
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/power/version.map b/lib/power/version.map
index b004e3e4a9..105d1d94c2 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -38,4 +38,7 @@ EXPERIMENTAL {
 	# added in 21.02
 	rte_power_ethdev_pmgmt_queue_disable;
 	rte_power_ethdev_pmgmt_queue_enable;
+
+	# added in 21.08
+	rte_power_ethdev_pmgmt_queue_set_power_save;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 6/7] power: support monitoring multiple Rx queues
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
                         ` (4 preceding siblings ...)
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v4:
    - Fix possible out of bounds access
    - Added missing index increment

 doc/guides/prog_guide/power_man.rst |  9 ++--
 lib/power/rte_power_pmd_mgmt.c      | 84 ++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index fac2c19516..3245a5ebed 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
 The "monitor" mode is only supported in the following configurations and scenarios:
 
 * If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor_multi()`` function is supported by the platform, then
+  monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
   ``rte_power_monitor()`` is supported by the platform, then monitoring will be
   limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
   monitored from a different lcore).
 
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
-  ``rte_power_monitor()`` function is not supported, then monitor mode will not
-  be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+  two monitoring functions are supported, then monitor mode will not be supported.
 
 * Not all Ethernet devices support monitoring, even if the underlying
   platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 7762cd39b8..97c9f1ea36 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -155,6 +155,32 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
 	return 0;
 }
 
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+		struct rte_power_monitor_cond *pmc, size_t len)
+{
+	const struct queue_list_entry *qle;
+	size_t i = 0;
+	int ret;
+
+	TAILQ_FOREACH(qle, &cfg->head, next) {
+		const union queue *q = &qle->queue;
+		struct rte_power_monitor_cond *cur;
+
+		/* attempted out of bounds access */
+		if (i >= len) {
+			RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+			return -1;
+		}
+
+		cur = &pmc[i++];
+		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
 static void
 calc_tsc(void)
 {
@@ -183,6 +209,48 @@ calc_tsc(void)
 	}
 }
 
+static uint16_t
+clb_multiwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	const unsigned int lcore = rte_lcore_id();
+	const union queue q = {.portid = port_id, .qid = qidx};
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *q_conf;
+
+	q_conf = &lcore_cfg[lcore];
+
+	/* early exit */
+	if (likely(!empty)) {
+		q_conf->empty_poll_stats = 0;
+	} else {
+		/* do we care about this particular queue? */
+		if (!queue_is_power_save(q_conf, &q))
+			return nb_rx;
+
+		/*
+		 * we can increment unconditionally here because if there were
+		 * non-empty polls in other queues assigned to this core, we
+		 * dropped the counter to zero anyway.
+		 */
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+			uint16_t ret;
+
+			/* gather all monitoring conditions */
+			ret = get_monitor_addresses(q_conf, pmc, RTE_DIM(pmc));
+
+			if (ret == 0)
+				rte_power_monitor_multi(pmc,
+					q_conf->n_queues, UINT64_MAX);
+		}
+	}
+
+	return nb_rx;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
@@ -348,14 +416,19 @@ static int
 check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 {
 	struct rte_power_monitor_cond dummy;
+	bool multimonitor_supported;
 
 	/* check if rte_power_monitor is supported */
 	if (!global_data.intrinsics_support.power_monitor) {
 		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
 		return -ENOTSUP;
 	}
+	/* check if multi-monitor is supported */
+	multimonitor_supported =
+			global_data.intrinsics_support.power_monitor_multi;
 
-	if (cfg->n_queues > 0) {
+	/* if we're adding a new queue, do we support multiple queues? */
+	if (cfg->n_queues > 0 && !multimonitor_supported) {
 		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
 		return -ENOTSUP;
 	}
@@ -371,6 +444,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 	return 0;
 }
 
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+	return global_data.intrinsics_support.power_monitor_multi ?
+		clb_multiwait : clb_umwait;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -434,7 +514,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (ret < 0)
 			goto end;
 
-		clb = clb_umwait;
+		clb = get_monitor_callback();
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		/* check if we can add a new queue */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
                         ` (5 preceding siblings ...)
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-28 15:54       ` Anatoly Burakov
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-28 15:54 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation, and always
mark the last queue in qconf as the power save queue.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 examples/l3fwd-power/main.c | 39 +++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..3057c06936 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2498,6 +2498,27 @@ mode_to_str(enum appmode mode)
 	}
 }
 
+static void
+pmd_pmgmt_set_up(unsigned int lcore, uint16_t portid, uint16_t qid, bool last)
+{
+	int ret;
+
+	ret = rte_power_ethdev_pmgmt_queue_enable(lcore, portid,
+			qid, pmgmt_type);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
+			ret, portid);
+
+	if (!last)
+		return;
+	ret = rte_power_ethdev_pmgmt_queue_set_power_save(lcore, portid, qid);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE,
+			"rte_power_ethdev_pmgmt_queue_set_power_save: err=%d, port=%d\n",
+			ret, portid);
+}
+
 int
 main(int argc, char **argv)
 {
@@ -2723,12 +2744,6 @@ main(int argc, char **argv)
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
 
-		/* PMD power management mode can only do 1 queue per core */
-		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
-			rte_exit(EXIT_FAILURE,
-				"In PMD power management mode, only one queue per lcore is allowed\n");
-		}
-
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
@@ -2767,15 +2782,9 @@ main(int argc, char **argv)
 						 "Fail to add ptype cb\n");
 			}
 
-			if (app_mode == APP_MODE_PMD_MGMT) {
-				ret = rte_power_ethdev_pmgmt_queue_enable(
-						lcore_id, portid, queueid,
-						pmgmt_type);
-				if (ret < 0)
-					rte_exit(EXIT_FAILURE,
-						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
-							ret, portid);
-			}
+			if (app_mode == APP_MODE_PMD_MGMT)
+				pmd_pmgmt_set_up(lcore_id, portid, queueid,
+					queue == (qconf->n_rx_queue - 1));
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-28 14:09         ` Burakov, Anatoly
@ 2021-06-29  0:07           ` Ananyev, Konstantin
  2021-06-29 11:05             ` Burakov, Anatoly
  2021-06-29 11:39             ` Burakov, Anatoly
  0 siblings, 2 replies; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-29  0:07 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



> >> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> >> Rx queues while entering the energy efficient power state. The multi
> >> version will be used unconditionally if supported, and the UMWAIT one
> >> will only be used when multi-monitor is not supported by the hardware.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   doc/guides/prog_guide/power_man.rst |  9 ++--
> >>   lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
> >>   2 files changed, 80 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> >> index fac2c19516..3245a5ebed 100644
> >> --- a/doc/guides/prog_guide/power_man.rst
> >> +++ b/doc/guides/prog_guide/power_man.rst
> >> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
> >>   The "monitor" mode is only supported in the following configurations and scenarios:
> >>
> >>   * If ``rte_cpu_get_intrinsics_support()`` function indicates that
> >> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
> >> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
> >> +
> >> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
> >>     ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> >>     limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> >>     monitored from a different lcore).
> >>
> >> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> >> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
> >> -  be supported.
> >> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
> >> +  two monitoring functions are supported, then monitor mode will not be supported.
> >>
> >>   * Not all Ethernet devices support monitoring, even if the underlying
> >>     platform may support the necessary CPU instructions. Please refer to
> >> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> >> index 7762cd39b8..aab2d4f1ee 100644
> >> --- a/lib/power/rte_power_pmd_mgmt.c
> >> +++ b/lib/power/rte_power_pmd_mgmt.c
> >> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
> >>        return 0;
> >>   }
> >>
> >> +static inline int
> >> +get_monitor_addresses(struct pmd_core_cfg *cfg,
> >> +             struct rte_power_monitor_cond *pmc)
> >> +{
> >> +     const struct queue_list_entry *qle;
> >> +     size_t i = 0;
> >> +     int ret;
> >> +
> >> +     TAILQ_FOREACH(qle, &cfg->head, next) {
> >> +             struct rte_power_monitor_cond *cur = &pmc[i];
> >
> > Looks like you never increment 'i' value inside that function.
> > Also it probably will be safer to add 'num' parameter to check that
> > we will never over-run pmc[] boundaries.
> 
> Will fix in v4, good catch!
> 
> >
> >> +             const union queue *q = &qle->queue;
> >> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
> >> +             if (ret < 0)
> >> +                     return ret;
> >> +     }
> >> +     return 0;
> >> +}
> >> +
> >>   static void
> >>   calc_tsc(void)
> >>   {
> >> @@ -183,6 +201,48 @@ calc_tsc(void)
> >>        }
> >>   }
> >>
> >> +static uint16_t
> >> +clb_multiwait(uint16_t port_id, uint16_t qidx,
> >> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> >> +{
> >> +     const unsigned int lcore = rte_lcore_id();
> >> +     const union queue q = {.portid = port_id, .qid = qidx};
> >> +     const bool empty = nb_rx == 0;
> >> +     struct pmd_core_cfg *q_conf;
> >> +
> >> +     q_conf = &lcore_cfg[lcore];
> >> +
> >> +     /* early exit */
> >> +     if (likely(!empty)) {
> >> +             q_conf->empty_poll_stats = 0;
> >> +     } else {
> >> +             /* do we care about this particular queue? */
> >> +             if (!queue_is_power_save(q_conf, &q))
> >> +                     return nb_rx;
> >
> > I still don't understand the need of 'special' power_save queue here...
> > Why we can't just have a function:
> >
> > get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
> > and then just:
> >
> > /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
> > if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
> >      /* go into power-save mode here */
> > }
> 
> Okay, let's go through this step by step :)
> 
> Let's suppose we have three queues - q0, q1 and q2. We want to sleep
> whenever there's no traffic on *all of them*, however we cannot know
> that until we have checked all of them.
> 
> So, let's suppose that q0, q1 and q2 were empty all this time, but now
> some traffic arrived at q2 while we're still checking q0. We see that q0
> is empty, and all of the queues were empty for the last N polls, so we
> think we will be safe to sleep at q0 despite the fact that traffic has
> just arrived at q2.
> This is not an issue with MONITOR mode because we will be able to see if
> current Rx ring descriptor is busy or not via the NIC callback, *but
> this is not possible* with PAUSE and SCALE modes, because they don't
> have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE
> modes, it is possible to end up in a situation where you *think* you
> don't have any traffic, but you actually do, you just haven't checked
> the relevant queue yet.

I think such situation is unavoidable.
Yes, traffic can arrive to *any* queue at *any* time.
With your example above - user choose q2 as 'special' queue, but
traffic actually arrives on q0 or q1. 
And yes, if user choose PAUSE or SCALE methods he *can* miss the traffic,   
because as you said for these methods there is no notification mechanisms.
I think there are just unavoidable limitations with these power-save methods. 
 
> In order to prevent this from happening, we do not sleep on every queue,
> instead we sleep *once* per loop. 

Yes, totally agree we shouldn't sleep on *every* queue.
We need to go to sleep when there is no traffic on *any* of queues we monitor. 

> That is, we check q0, check q1, check
> q2, and only then we decide whether we want to sleep or not.

> Of course, with such scheme it is still possible to e.g. sleep in q2
> while there's traffic waiting in q0,

Yes, exactly.

> but worst case is less bad with
> this scheme, because we'll be doing at worst 1 extra sleep.

Hmm, I think it would be one extra sleep anyway.

> Whereas with what you're suggesting, if we had e.g. 10 queues to poll,
> and we checked q1 but traffic has just arrived at q0, we'll be sleeping
> at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then
> we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps
> later we finally reach q0 and find out after all this time that we
> shouldn't have slept in the first place.

Ah ok, I think I understand now what you are saying.
Sure, to avoid such situation, we'll need to maintain extra counters and
update them properly when we go to sleep.   
I should state it clearly at the beginning.
It might be easier to explain what I meant by code snippet:

lcore_conf needs 2 counters:
uint64_t   nb_queues_ready_to_sleep;
uint64_t   nb_sleeps;
 
Plus each queue needs 2 counters:
uint64_t nb_empty_polls;
uint64_t nb_sleeps;

Now, at rx_callback():

/* check did sleep happen since previous call,
     if yes, then reset queue counters */
if (queue->nb_sleeps != lcore_conf->nb_sleeps) {
    queue->nb_sleeps = lcore_conf->nb_sleeps;
    queue->nb_empty_polls = 0;
}

 /* packet arrived, reset counters */
 if (nb_rx != 0) {
   /* queue is not 'ready_to_sleep' any more */
   if (queue->nb_empty_polls > EMPTYPOLL_MAX)
       lcore_conf-> nb_queues_ready_to_sleep--;
   queue->nb_empty_polls = 0;

/* empty poll */
} else {
    /* queue reaches EMPTYPOLL_MAX threshold, mark it as 'ready_to_sleep' */ 
    if (queue->nb_empty_polls == EMPTYPOLL_MAX)
       lcore_conf-> nb_queues_ready_to_sleep++;
    queue->nb_empty_polls++;
}

   /* no traffic on any queue for at least EMPTYPOLL_MAX iterations */
   if (lcore_conf-> nb_queues_ready_to_sleep == lcore_conf->n_queues) {
      /* update counters and sleep */
      lcore_conf->nb_sleeps++;
      lcore_conf-> nb_queues_ready_to_sleep = 0;
      goto_sleep();
   }
}

> Hopefully you get the point now :)
> 
> So, the idea here is, for any N queues, sleep only once, not N times.
> 
> >
> >> +
> >> +             /*
> >> +              * we can increment unconditionally here because if there were
> >> +              * non-empty polls in other queues assigned to this core, we
> >> +              * dropped the counter to zero anyway.
> >> +              */
> >> +             q_conf->empty_poll_stats++;
> >> +             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> >> +                     struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
> >
> > I think you need here:
> > struct rte_power_monitor_cond pmc[q_conf->n_queues];
> 
> I think VLA's are generally agreed upon to be unsafe, so i'm avoiding
> them here.

Wonder why?
These days DPDK uses VLA in dozens of places...
But if you'd like to avoid VLA - you can use alloca(),
or have lcore_conf->pmc[] and realloc() it when new queue is
added/removed from the list.

> 
> >
> >
> >> +                     uint16_t ret;
> >> +
> >> +                     /* gather all monitoring conditions */
> >> +                     ret = get_monitor_addresses(q_conf, pmc);
> >> +
> >> +                     if (ret == 0)
> >> +                             rte_power_monitor_multi(pmc,
> >> +                                     q_conf->n_queues, UINT64_MAX);
> >> +             }
> >> +     }
> >> +
> >> +     return nb_rx;
> >> +}
> >> +
> >>   static uint16_t
> >>   clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> >>                uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> >> @@ -348,14 +408,19 @@ static int
> >>   check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> >>   {
> >>        struct rte_power_monitor_cond dummy;
> >> +     bool multimonitor_supported;
> >>
> >>        /* check if rte_power_monitor is supported */
> >>        if (!global_data.intrinsics_support.power_monitor) {
> >>                RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> >>                return -ENOTSUP;
> >>        }
> >> +     /* check if multi-monitor is supported */
> >> +     multimonitor_supported =
> >> +                     global_data.intrinsics_support.power_monitor_multi;
> >>
> >> -     if (cfg->n_queues > 0) {
> >> +     /* if we're adding a new queue, do we support multiple queues? */
> >> +     if (cfg->n_queues > 0 && !multimonitor_supported) {
> >>                RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
> >>                return -ENOTSUP;
> >>        }
> >> @@ -371,6 +436,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> >>        return 0;
> >>   }
> >>
> >> +static inline rte_rx_callback_fn
> >> +get_monitor_callback(void)
> >> +{
> >> +     return global_data.intrinsics_support.power_monitor_multi ?
> >> +             clb_multiwait : clb_umwait;
> >> +}
> >> +
> >>   int
> >>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> >>                uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
> >> @@ -434,7 +506,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> >>                if (ret < 0)
> >>                        goto end;
> >>
> >> -             clb = clb_umwait;
> >> +             clb = get_monitor_callback();
> >>                break;
> >>        case RTE_POWER_MGMT_TYPE_SCALE:
> >>                /* check if we can add a new queue */
> >> --
> >> 2.25.1
> >
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-29  0:07           ` Ananyev, Konstantin
@ 2021-06-29 11:05             ` Burakov, Anatoly
  2021-06-29 11:39             ` Burakov, Anatoly
  1 sibling, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-29 11:05 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 29-Jun-21 1:07 AM, Ananyev, Konstantin wrote:

>>>> +static uint16_t
>>>> +clb_multiwait(uint16_t port_id, uint16_t qidx,
>>>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>>> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
>>>> +{
>>>> +     const unsigned int lcore = rte_lcore_id();
>>>> +     const union queue q = {.portid = port_id, .qid = qidx};
>>>> +     const bool empty = nb_rx == 0;
>>>> +     struct pmd_core_cfg *q_conf;
>>>> +
>>>> +     q_conf = &lcore_cfg[lcore];
>>>> +
>>>> +     /* early exit */
>>>> +     if (likely(!empty)) {
>>>> +             q_conf->empty_poll_stats = 0;
>>>> +     } else {
>>>> +             /* do we care about this particular queue? */
>>>> +             if (!queue_is_power_save(q_conf, &q))
>>>> +                     return nb_rx;
>>>
>>> I still don't understand the need of 'special' power_save queue here...
>>> Why we can't just have a function:
>>>
>>> get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
>>> and then just:
>>>
>>> /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
>>> if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
>>>       /* go into power-save mode here */
>>> }
>>
>> Okay, let's go through this step by step :)
>>
>> Let's suppose we have three queues - q0, q1 and q2. We want to sleep
>> whenever there's no traffic on *all of them*, however we cannot know
>> that until we have checked all of them.
>>
>> So, let's suppose that q0, q1 and q2 were empty all this time, but now
>> some traffic arrived at q2 while we're still checking q0. We see that q0
>> is empty, and all of the queues were empty for the last N polls, so we
>> think we will be safe to sleep at q0 despite the fact that traffic has
>> just arrived at q2.
>> This is not an issue with MONITOR mode because we will be able to see if
>> current Rx ring descriptor is busy or not via the NIC callback, *but
>> this is not possible* with PAUSE and SCALE modes, because they don't
>> have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE
>> modes, it is possible to end up in a situation where you *think* you
>> don't have any traffic, but you actually do, you just haven't checked
>> the relevant queue yet.
> 
> I think such situation is unavoidable.
> Yes, traffic can arrive to *any* queue at *any* time.
> With your example above - user choose q2 as 'special' queue, but
> traffic actually arrives on q0 or q1.
> And yes, if user choose PAUSE or SCALE methods he *can* miss the traffic,
> because as you said for these methods there is no notification mechanisms.
> I think there are just unavoidable limitations with these power-save methods.
> 
>> In order to prevent this from happening, we do not sleep on every queue,
>> instead we sleep *once* per loop.
> 
> Yes, totally agree we shouldn't sleep on *every* queue.
> We need to go to sleep when there is no traffic on *any* of queues we monitor.
> 
>> That is, we check q0, check q1, check
>> q2, and only then we decide whether we want to sleep or not.
> 
>> Of course, with such scheme it is still possible to e.g. sleep in q2
>> while there's traffic waiting in q0,
> 
> Yes, exactly.
> 
>> but worst case is less bad with
>> this scheme, because we'll be doing at worst 1 extra sleep.
> 
> Hmm, I think it would be one extra sleep anyway.
> 
>> Whereas with what you're suggesting, if we had e.g. 10 queues to poll,
>> and we checked q1 but traffic has just arrived at q0, we'll be sleeping
>> at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then
>> we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps
>> later we finally reach q0 and find out after all this time that we
>> shouldn't have slept in the first place.
> 
> Ah ok, I think I understand now what you are saying.
> Sure, to avoid such situation, we'll need to maintain extra counters and
> update them properly when we go to sleep.
> I should state it clearly at the beginning.
> It might be easier to explain what I meant by code snippet:
> 
> lcore_conf needs 2 counters:
> uint64_t   nb_queues_ready_to_sleep;
> uint64_t   nb_sleeps;
> 
> Plus each queue needs 2 counters:
> uint64_t nb_empty_polls;
> uint64_t nb_sleeps;
> 
> Now, at rx_callback():
> 
> /* check did sleep happen since previous call,
>       if yes, then reset queue counters */
> if (queue->nb_sleeps != lcore_conf->nb_sleeps) {
>      queue->nb_sleeps = lcore_conf->nb_sleeps;
>      queue->nb_empty_polls = 0;
> }
> 
>   /* packet arrived, reset counters */
>   if (nb_rx != 0) {
>     /* queue is not 'ready_to_sleep' any more */
>     if (queue->nb_empty_polls > EMPTYPOLL_MAX)
>         lcore_conf-> nb_queues_ready_to_sleep--;
>     queue->nb_empty_polls = 0;
> 
> /* empty poll */
> } else {
>      /* queue reaches EMPTYPOLL_MAX threshold, mark it as 'ready_to_sleep' */
>      if (queue->nb_empty_polls == EMPTYPOLL_MAX)
>         lcore_conf-> nb_queues_ready_to_sleep++;
>      queue->nb_empty_polls++;
> }
> 
>     /* no traffic on any queue for at least EMPTYPOLL_MAX iterations */
>     if (lcore_conf-> nb_queues_ready_to_sleep == lcore_conf->n_queues) {
>        /* update counters and sleep */
>        lcore_conf->nb_sleeps++;
>        lcore_conf-> nb_queues_ready_to_sleep = 0;
>        goto_sleep();
>     }
> }

OK, this sounds like it is actually doable :) I'll prototype and see if 
it works.

> 
>> Hopefully you get the point now :)
>>
>> So, the idea here is, for any N queues, sleep only once, not N times.
>>
>>>
>>>> +
>>>> +             /*
>>>> +              * we can increment unconditionally here because if there were
>>>> +              * non-empty polls in other queues assigned to this core, we
>>>> +              * dropped the counter to zero anyway.
>>>> +              */
>>>> +             q_conf->empty_poll_stats++;
>>>> +             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>>>> +                     struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
>>>
>>> I think you need here:
>>> struct rte_power_monitor_cond pmc[q_conf->n_queues];
>>
>> I think VLA's are generally agreed upon to be unsafe, so i'm avoiding
>> them here.
> 
> Wonder why?
> These days DPDK uses VLA in dozens of places...

Well, if that's the case, i can use it here also :) realistically the 
n_queues value will be very small, so it shouldn't be a big issue.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-29  0:07           ` Ananyev, Konstantin
  2021-06-29 11:05             ` Burakov, Anatoly
@ 2021-06-29 11:39             ` Burakov, Anatoly
  2021-06-29 12:14               ` Ananyev, Konstantin
  1 sibling, 1 reply; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-29 11:39 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 29-Jun-21 1:07 AM, Ananyev, Konstantin wrote:
> 
> 
>>>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
>>>> Rx queues while entering the energy efficient power state. The multi
>>>> version will be used unconditionally if supported, and the UMWAIT one
>>>> will only be used when multi-monitor is not supported by the hardware.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>    doc/guides/prog_guide/power_man.rst |  9 ++--
>>>>    lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
>>>>    2 files changed, 80 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>>>> index fac2c19516..3245a5ebed 100644
>>>> --- a/doc/guides/prog_guide/power_man.rst
>>>> +++ b/doc/guides/prog_guide/power_man.rst
>>>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>>>>    The "monitor" mode is only supported in the following configurations and scenarios:
>>>>
>>>>    * If ``rte_cpu_get_intrinsics_support()`` function indicates that
>>>> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
>>>> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
>>>> +
>>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>>>>      ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>>>>      limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>>>>      monitored from a different lcore).
>>>>
>>>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>>>> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>>>> -  be supported.
>>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
>>>> +  two monitoring functions are supported, then monitor mode will not be supported.
>>>>
>>>>    * Not all Ethernet devices support monitoring, even if the underlying
>>>>      platform may support the necessary CPU instructions. Please refer to
>>>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>>>> index 7762cd39b8..aab2d4f1ee 100644
>>>> --- a/lib/power/rte_power_pmd_mgmt.c
>>>> +++ b/lib/power/rte_power_pmd_mgmt.c
>>>> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
>>>>         return 0;
>>>>    }
>>>>
>>>> +static inline int
>>>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
>>>> +             struct rte_power_monitor_cond *pmc)
>>>> +{
>>>> +     const struct queue_list_entry *qle;
>>>> +     size_t i = 0;
>>>> +     int ret;
>>>> +
>>>> +     TAILQ_FOREACH(qle, &cfg->head, next) {
>>>> +             struct rte_power_monitor_cond *cur = &pmc[i];
>>>
>>> Looks like you never increment 'i' value inside that function.
>>> Also it probably will be safer to add 'num' parameter to check that
>>> we will never over-run pmc[] boundaries.
>>
>> Will fix in v4, good catch!
>>
>>>
>>>> +             const union queue *q = &qle->queue;
>>>> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
>>>> +             if (ret < 0)
>>>> +                     return ret;
>>>> +     }
>>>> +     return 0;
>>>> +}
>>>> +
>>>>    static void
>>>>    calc_tsc(void)
>>>>    {
>>>> @@ -183,6 +201,48 @@ calc_tsc(void)
>>>>         }
>>>>    }
>>>>
>>>> +static uint16_t
>>>> +clb_multiwait(uint16_t port_id, uint16_t qidx,
>>>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>>> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
>>>> +{
>>>> +     const unsigned int lcore = rte_lcore_id();
>>>> +     const union queue q = {.portid = port_id, .qid = qidx};
>>>> +     const bool empty = nb_rx == 0;
>>>> +     struct pmd_core_cfg *q_conf;
>>>> +
>>>> +     q_conf = &lcore_cfg[lcore];
>>>> +
>>>> +     /* early exit */
>>>> +     if (likely(!empty)) {
>>>> +             q_conf->empty_poll_stats = 0;
>>>> +     } else {
>>>> +             /* do we care about this particular queue? */
>>>> +             if (!queue_is_power_save(q_conf, &q))
>>>> +                     return nb_rx;
>>>
>>> I still don't understand the need of 'special' power_save queue here...
>>> Why we can't just have a function:
>>>
>>> get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
>>> and then just:
>>>
>>> /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
>>> if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
>>>       /* go into power-save mode here */
>>> }
>>
>> Okay, let's go through this step by step :)
>>
>> Let's suppose we have three queues - q0, q1 and q2. We want to sleep
>> whenever there's no traffic on *all of them*, however we cannot know
>> that until we have checked all of them.
>>
>> So, let's suppose that q0, q1 and q2 were empty all this time, but now
>> some traffic arrived at q2 while we're still checking q0. We see that q0
>> is empty, and all of the queues were empty for the last N polls, so we
>> think we will be safe to sleep at q0 despite the fact that traffic has
>> just arrived at q2.
>> This is not an issue with MONITOR mode because we will be able to see if
>> current Rx ring descriptor is busy or not via the NIC callback, *but
>> this is not possible* with PAUSE and SCALE modes, because they don't
>> have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE
>> modes, it is possible to end up in a situation where you *think* you
>> don't have any traffic, but you actually do, you just haven't checked
>> the relevant queue yet.
> 
> I think such situation is unavoidable.
> Yes, traffic can arrive to *any* queue at *any* time.
> With your example above - user choose q2 as 'special' queue, but
> traffic actually arrives on q0 or q1.
> And yes, if user choose PAUSE or SCALE methods he *can* miss the traffic,
> because as you said for these methods there is no notification mechanisms.
> I think there are just unavoidable limitations with these power-save methods.
> 
>> In order to prevent this from happening, we do not sleep on every queue,
>> instead we sleep *once* per loop.
> 
> Yes, totally agree we shouldn't sleep on *every* queue.
> We need to go to sleep when there is no traffic on *any* of queues we monitor.
> 
>> That is, we check q0, check q1, check
>> q2, and only then we decide whether we want to sleep or not.
> 
>> Of course, with such scheme it is still possible to e.g. sleep in q2
>> while there's traffic waiting in q0,
> 
> Yes, exactly.
> 
>> but worst case is less bad with
>> this scheme, because we'll be doing at worst 1 extra sleep.
> 
> Hmm, I think it would be one extra sleep anyway.
> 
>> Whereas with what you're suggesting, if we had e.g. 10 queues to poll,
>> and we checked q1 but traffic has just arrived at q0, we'll be sleeping
>> at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then
>> we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps
>> later we finally reach q0 and find out after all this time that we
>> shouldn't have slept in the first place.
> 
> Ah ok, I think I understand now what you are saying.
> Sure, to avoid such situation, we'll need to maintain extra counters and
> update them properly when we go to sleep.
> I should state it clearly at the beginning.
> It might be easier to explain what I meant by code snippet:
> 
> lcore_conf needs 2 counters:
> uint64_t   nb_queues_ready_to_sleep;
> uint64_t   nb_sleeps;
> 
> Plus each queue needs 2 counters:
> uint64_t nb_empty_polls;
> uint64_t nb_sleeps;
> 
> Now, at rx_callback():
> 
> /* check did sleep happen since previous call,
>       if yes, then reset queue counters */
> if (queue->nb_sleeps != lcore_conf->nb_sleeps) {
>      queue->nb_sleeps = lcore_conf->nb_sleeps;
>      queue->nb_empty_polls = 0;
> }
> 
>   /* packet arrived, reset counters */
>   if (nb_rx != 0) {
>     /* queue is not 'ready_to_sleep' any more */
>     if (queue->nb_empty_polls > EMPTYPOLL_MAX)
>         lcore_conf-> nb_queues_ready_to_sleep--;
>     queue->nb_empty_polls = 0;
> 
> /* empty poll */
> } else {
>      /* queue reaches EMPTYPOLL_MAX threshold, mark it as 'ready_to_sleep' */
>      if (queue->nb_empty_polls == EMPTYPOLL_MAX)
>         lcore_conf-> nb_queues_ready_to_sleep++;
>      queue->nb_empty_polls++;
> }
> 
>     /* no traffic on any queue for at least EMPTYPOLL_MAX iterations */
>     if (lcore_conf-> nb_queues_ready_to_sleep == lcore_conf->n_queues) {
>        /* update counters and sleep */
>        lcore_conf->nb_sleeps++;
>        lcore_conf-> nb_queues_ready_to_sleep = 0;
>        goto_sleep();
>     }
> }
> 
Actually, i don't think this is going to work, because i can see no 
(easy) way to get from lcore to specific queue. I mean, you could have 
an O(N) for loop that will loop over the list of queues every time we 
enter the callback, but i don't think that's such a good idea.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-29 11:39             ` Burakov, Anatoly
@ 2021-06-29 12:14               ` Ananyev, Konstantin
  2021-06-29 13:23                 ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-29 12:14 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Tuesday, June 29, 2021 12:40 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Hunt, David <david.hunt@intel.com>
> Cc: Loftus, Ciara <ciara.loftus@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
> 
> On 29-Jun-21 1:07 AM, Ananyev, Konstantin wrote:
> >
> >
> >>>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> >>>> Rx queues while entering the energy efficient power state. The multi
> >>>> version will be used unconditionally if supported, and the UMWAIT one
> >>>> will only be used when multi-monitor is not supported by the hardware.
> >>>>
> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> ---
> >>>>    doc/guides/prog_guide/power_man.rst |  9 ++--
> >>>>    lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
> >>>>    2 files changed, 80 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> >>>> index fac2c19516..3245a5ebed 100644
> >>>> --- a/doc/guides/prog_guide/power_man.rst
> >>>> +++ b/doc/guides/prog_guide/power_man.rst
> >>>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
> >>>>    The "monitor" mode is only supported in the following configurations and scenarios:
> >>>>
> >>>>    * If ``rte_cpu_get_intrinsics_support()`` function indicates that
> >>>> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
> >>>> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
> >>>> +
> >>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
> >>>>      ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> >>>>      limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> >>>>      monitored from a different lcore).
> >>>>
> >>>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> >>>> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
> >>>> -  be supported.
> >>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
> >>>> +  two monitoring functions are supported, then monitor mode will not be supported.
> >>>>
> >>>>    * Not all Ethernet devices support monitoring, even if the underlying
> >>>>      platform may support the necessary CPU instructions. Please refer to
> >>>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> >>>> index 7762cd39b8..aab2d4f1ee 100644
> >>>> --- a/lib/power/rte_power_pmd_mgmt.c
> >>>> +++ b/lib/power/rte_power_pmd_mgmt.c
> >>>> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
> >>>>         return 0;
> >>>>    }
> >>>>
> >>>> +static inline int
> >>>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
> >>>> +             struct rte_power_monitor_cond *pmc)
> >>>> +{
> >>>> +     const struct queue_list_entry *qle;
> >>>> +     size_t i = 0;
> >>>> +     int ret;
> >>>> +
> >>>> +     TAILQ_FOREACH(qle, &cfg->head, next) {
> >>>> +             struct rte_power_monitor_cond *cur = &pmc[i];
> >>>
> >>> Looks like you never increment 'i' value inside that function.
> >>> Also it probably will be safer to add 'num' parameter to check that
> >>> we will never over-run pmc[] boundaries.
> >>
> >> Will fix in v4, good catch!
> >>
> >>>
> >>>> +             const union queue *q = &qle->queue;
> >>>> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
> >>>> +             if (ret < 0)
> >>>> +                     return ret;
> >>>> +     }
> >>>> +     return 0;
> >>>> +}
> >>>> +
> >>>>    static void
> >>>>    calc_tsc(void)
> >>>>    {
> >>>> @@ -183,6 +201,48 @@ calc_tsc(void)
> >>>>         }
> >>>>    }
> >>>>
> >>>> +static uint16_t
> >>>> +clb_multiwait(uint16_t port_id, uint16_t qidx,
> >>>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> >>>> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> >>>> +{
> >>>> +     const unsigned int lcore = rte_lcore_id();
> >>>> +     const union queue q = {.portid = port_id, .qid = qidx};
> >>>> +     const bool empty = nb_rx == 0;
> >>>> +     struct pmd_core_cfg *q_conf;
> >>>> +
> >>>> +     q_conf = &lcore_cfg[lcore];
> >>>> +
> >>>> +     /* early exit */
> >>>> +     if (likely(!empty)) {
> >>>> +             q_conf->empty_poll_stats = 0;
> >>>> +     } else {
> >>>> +             /* do we care about this particular queue? */
> >>>> +             if (!queue_is_power_save(q_conf, &q))
> >>>> +                     return nb_rx;
> >>>
> >>> I still don't understand the need of 'special' power_save queue here...
> >>> Why we can't just have a function:
> >>>
> >>> get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
> >>> and then just:
> >>>
> >>> /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
> >>> if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
> >>>       /* go into power-save mode here */
> >>> }
> >>
> >> Okay, let's go through this step by step :)
> >>
> >> Let's suppose we have three queues - q0, q1 and q2. We want to sleep
> >> whenever there's no traffic on *all of them*, however we cannot know
> >> that until we have checked all of them.
> >>
> >> So, let's suppose that q0, q1 and q2 were empty all this time, but now
> >> some traffic arrived at q2 while we're still checking q0. We see that q0
> >> is empty, and all of the queues were empty for the last N polls, so we
> >> think we will be safe to sleep at q0 despite the fact that traffic has
> >> just arrived at q2.
> >> This is not an issue with MONITOR mode because we will be able to see if
> >> current Rx ring descriptor is busy or not via the NIC callback, *but
> >> this is not possible* with PAUSE and SCALE modes, because they don't
> >> have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE
> >> modes, it is possible to end up in a situation where you *think* you
> >> don't have any traffic, but you actually do, you just haven't checked
> >> the relevant queue yet.
> >
> > I think such situation is unavoidable.
> > Yes, traffic can arrive to *any* queue at *any* time.
> > With your example above - user choose q2 as 'special' queue, but
> > traffic actually arrives on q0 or q1.
> > And yes, if user choose PAUSE or SCALE methods he *can* miss the traffic,
> > because as you said for these methods there is no notification mechanisms.
> > I think there are just unavoidable limitations with these power-save methods.
> >
> >> In order to prevent this from happening, we do not sleep on every queue,
> >> instead we sleep *once* per loop.
> >
> > Yes, totally agree we shouldn't sleep on *every* queue.
> > We need to go to sleep when there is no traffic on *any* of queues we monitor.
> >
> >> That is, we check q0, check q1, check
> >> q2, and only then we decide whether we want to sleep or not.
> >
> >> Of course, with such scheme it is still possible to e.g. sleep in q2
> >> while there's traffic waiting in q0,
> >
> > Yes, exactly.
> >
> >> but worst case is less bad with
> >> this scheme, because we'll be doing at worst 1 extra sleep.
> >
> > Hmm, I think it would be one extra sleep anyway.
> >
> >> Whereas with what you're suggesting, if we had e.g. 10 queues to poll,
> >> and we checked q1 but traffic has just arrived at q0, we'll be sleeping
> >> at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then
> >> we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps
> >> later we finally reach q0 and find out after all this time that we
> >> shouldn't have slept in the first place.
> >
> > Ah ok, I think I understand now what you are saying.
> > Sure, to avoid such situation, we'll need to maintain extra counters and
> > update them properly when we go to sleep.
> > I should state it clearly at the beginning.
> > It might be easier to explain what I meant by code snippet:
> >
> > lcore_conf needs 2 counters:
> > uint64_t   nb_queues_ready_to_sleep;
> > uint64_t   nb_sleeps;
> >
> > Plus each queue needs 2 counters:
> > uint64_t nb_empty_polls;
> > uint64_t nb_sleeps;
> >
> > Now, at rx_callback():
> >
> > /* check did sleep happen since previous call,
> >       if yes, then reset queue counters */
> > if (queue->nb_sleeps != lcore_conf->nb_sleeps) {
> >      queue->nb_sleeps = lcore_conf->nb_sleeps;
> >      queue->nb_empty_polls = 0;
> > }
> >
> >   /* packet arrived, reset counters */
> >   if (nb_rx != 0) {
> >     /* queue is not 'ready_to_sleep' any more */
> >     if (queue->nb_empty_polls > EMPTYPOLL_MAX)
> >         lcore_conf-> nb_queues_ready_to_sleep--;
> >     queue->nb_empty_polls = 0;
> >
> > /* empty poll */
> > } else {
> >      /* queue reaches EMPTYPOLL_MAX threshold, mark it as 'ready_to_sleep' */
> >      if (queue->nb_empty_polls == EMPTYPOLL_MAX)
> >         lcore_conf-> nb_queues_ready_to_sleep++;
> >      queue->nb_empty_polls++;
> > }
> >
> >     /* no traffic on any queue for at least EMPTYPOLL_MAX iterations */
> >     if (lcore_conf-> nb_queues_ready_to_sleep == lcore_conf->n_queues) {
> >        /* update counters and sleep */
> >        lcore_conf->nb_sleeps++;
> >        lcore_conf-> nb_queues_ready_to_sleep = 0;
> >        goto_sleep();
> >     }
> > }
> >
> Actually, i don't think this is going to work, because i can see no
> (easy) way to get from lcore to specific queue. I mean, you could have
> an O(N) for loop that will loop over the list of queues every time we
> enter the callback, but i don't think that's such a good idea.

I think something like that will work:

struct queue_list_entry {
        TAILQ_ENTRY(queue_list_entry) next;
        union queue queue;
+     /* pointer to the lcore that queue is managed by */
+      struct pmd_core_cfg *lcore_cfg;
+     /* queue RX callback */
+      const struct rte_eth_rxtx_callback *cur_cb;
};

At rte_power_ethdev_pmgmt_queue_enable():

+      struct queue_list_entry *qle;
...
-        ret = queue_list_add(queue_cfg, &qdata);
+       qle = queue_list_add(queue_cfg, &qdata);
...
-       queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, NULL);
+      qle->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, qdata);

At actual clb_xxx(uint16_t port_id, uint16_t qidx,
                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
                uint16_t max_pkts __rte_unused, void *addr __rte_unused)
{
   ...
   struct queue_list_entry *qle = addr;
   struct pmd_core_cfg *lcore_cfg = qle->lcore_conf;
   ....
}



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
  2021-06-29 12:14               ` Ananyev, Konstantin
@ 2021-06-29 13:23                 ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-06-29 13:23 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 29-Jun-21 1:14 PM, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Tuesday, June 29, 2021 12:40 PM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Hunt, David <david.hunt@intel.com>
>> Cc: Loftus, Ciara <ciara.loftus@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
>>
>> On 29-Jun-21 1:07 AM, Ananyev, Konstantin wrote:
>>>
>>>
>>>>>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
>>>>>> Rx queues while entering the energy efficient power state. The multi
>>>>>> version will be used unconditionally if supported, and the UMWAIT one
>>>>>> will only be used when multi-monitor is not supported by the hardware.
>>>>>>
>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> ---
>>>>>>     doc/guides/prog_guide/power_man.rst |  9 ++--
>>>>>>     lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
>>>>>>     2 files changed, 80 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>>>>>> index fac2c19516..3245a5ebed 100644
>>>>>> --- a/doc/guides/prog_guide/power_man.rst
>>>>>> +++ b/doc/guides/prog_guide/power_man.rst
>>>>>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>>>>>>     The "monitor" mode is only supported in the following configurations and scenarios:
>>>>>>
>>>>>>     * If ``rte_cpu_get_intrinsics_support()`` function indicates that
>>>>>> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
>>>>>> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
>>>>>> +
>>>>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>>>>>>       ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>>>>>>       limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>>>>>>       monitored from a different lcore).
>>>>>>
>>>>>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>>>>>> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>>>>>> -  be supported.
>>>>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
>>>>>> +  two monitoring functions are supported, then monitor mode will not be supported.
>>>>>>
>>>>>>     * Not all Ethernet devices support monitoring, even if the underlying
>>>>>>       platform may support the necessary CPU instructions. Please refer to
>>>>>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>>>>>> index 7762cd39b8..aab2d4f1ee 100644
>>>>>> --- a/lib/power/rte_power_pmd_mgmt.c
>>>>>> +++ b/lib/power/rte_power_pmd_mgmt.c
>>>>>> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
>>>>>>          return 0;
>>>>>>     }
>>>>>>
>>>>>> +static inline int
>>>>>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
>>>>>> +             struct rte_power_monitor_cond *pmc)
>>>>>> +{
>>>>>> +     const struct queue_list_entry *qle;
>>>>>> +     size_t i = 0;
>>>>>> +     int ret;
>>>>>> +
>>>>>> +     TAILQ_FOREACH(qle, &cfg->head, next) {
>>>>>> +             struct rte_power_monitor_cond *cur = &pmc[i];
>>>>>
>>>>> Looks like you never increment 'i' value inside that function.
>>>>> Also it probably will be safer to add 'num' parameter to check that
>>>>> we will never over-run pmc[] boundaries.
>>>>
>>>> Will fix in v4, good catch!
>>>>
>>>>>
>>>>>> +             const union queue *q = &qle->queue;
>>>>>> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
>>>>>> +             if (ret < 0)
>>>>>> +                     return ret;
>>>>>> +     }
>>>>>> +     return 0;
>>>>>> +}
>>>>>> +
>>>>>>     static void
>>>>>>     calc_tsc(void)
>>>>>>     {
>>>>>> @@ -183,6 +201,48 @@ calc_tsc(void)
>>>>>>          }
>>>>>>     }
>>>>>>
>>>>>> +static uint16_t
>>>>>> +clb_multiwait(uint16_t port_id, uint16_t qidx,
>>>>>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>>>>> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
>>>>>> +{
>>>>>> +     const unsigned int lcore = rte_lcore_id();
>>>>>> +     const union queue q = {.portid = port_id, .qid = qidx};
>>>>>> +     const bool empty = nb_rx == 0;
>>>>>> +     struct pmd_core_cfg *q_conf;
>>>>>> +
>>>>>> +     q_conf = &lcore_cfg[lcore];
>>>>>> +
>>>>>> +     /* early exit */
>>>>>> +     if (likely(!empty)) {
>>>>>> +             q_conf->empty_poll_stats = 0;
>>>>>> +     } else {
>>>>>> +             /* do we care about this particular queue? */
>>>>>> +             if (!queue_is_power_save(q_conf, &q))
>>>>>> +                     return nb_rx;
>>>>>
>>>>> I still don't understand the need of 'special' power_save queue here...
>>>>> Why we can't just have a function:
>>>>>
>>>>> get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
>>>>> and then just:
>>>>>
>>>>> /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
>>>>> if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
>>>>>        /* go into power-save mode here */
>>>>> }
>>>>
>>>> Okay, let's go through this step by step :)
>>>>
>>>> Let's suppose we have three queues - q0, q1 and q2. We want to sleep
>>>> whenever there's no traffic on *all of them*, however we cannot know
>>>> that until we have checked all of them.
>>>>
>>>> So, let's suppose that q0, q1 and q2 were empty all this time, but now
>>>> some traffic arrived at q2 while we're still checking q0. We see that q0
>>>> is empty, and all of the queues were empty for the last N polls, so we
>>>> think we will be safe to sleep at q0 despite the fact that traffic has
>>>> just arrived at q2.
>>>> This is not an issue with MONITOR mode because we will be able to see if
>>>> current Rx ring descriptor is busy or not via the NIC callback, *but
>>>> this is not possible* with PAUSE and SCALE modes, because they don't
>>>> have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE
>>>> modes, it is possible to end up in a situation where you *think* you
>>>> don't have any traffic, but you actually do, you just haven't checked
>>>> the relevant queue yet.
>>>
>>> I think such situation is unavoidable.
>>> Yes, traffic can arrive to *any* queue at *any* time.
>>> With your example above - user choose q2 as 'special' queue, but
>>> traffic actually arrives on q0 or q1.
>>> And yes, if user choose PAUSE or SCALE methods he *can* miss the traffic,
>>> because as you said for these methods there is no notification mechanisms.
>>> I think there are just unavoidable limitations with these power-save methods.
>>>
>>>> In order to prevent this from happening, we do not sleep on every queue,
>>>> instead we sleep *once* per loop.
>>>
>>> Yes, totally agree we shouldn't sleep on *every* queue.
>>> We need to go to sleep when there is no traffic on *any* of queues we monitor.
>>>
>>>> That is, we check q0, check q1, check
>>>> q2, and only then we decide whether we want to sleep or not.
>>>
>>>> Of course, with such scheme it is still possible to e.g. sleep in q2
>>>> while there's traffic waiting in q0,
>>>
>>> Yes, exactly.
>>>
>>>> but worst case is less bad with
>>>> this scheme, because we'll be doing at worst 1 extra sleep.
>>>
>>> Hmm, I think it would be one extra sleep anyway.
>>>
>>>> Whereas with what you're suggesting, if we had e.g. 10 queues to poll,
>>>> and we checked q1 but traffic has just arrived at q0, we'll be sleeping
>>>> at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then
>>>> we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps
>>>> later we finally reach q0 and find out after all this time that we
>>>> shouldn't have slept in the first place.
>>>
>>> Ah ok, I think I understand now what you are saying.
>>> Sure, to avoid such situation, we'll need to maintain extra counters and
>>> update them properly when we go to sleep.
>>> I should state it clearly at the beginning.
>>> It might be easier to explain what I meant by code snippet:
>>>
>>> lcore_conf needs 2 counters:
>>> uint64_t   nb_queues_ready_to_sleep;
>>> uint64_t   nb_sleeps;
>>>
>>> Plus each queue needs 2 counters:
>>> uint64_t nb_empty_polls;
>>> uint64_t nb_sleeps;
>>>
>>> Now, at rx_callback():
>>>
>>> /* check did sleep happen since previous call,
>>>        if yes, then reset queue counters */
>>> if (queue->nb_sleeps != lcore_conf->nb_sleeps) {
>>>       queue->nb_sleeps = lcore_conf->nb_sleeps;
>>>       queue->nb_empty_polls = 0;
>>> }
>>>
>>>    /* packet arrived, reset counters */
>>>    if (nb_rx != 0) {
>>>      /* queue is not 'ready_to_sleep' any more */
>>>      if (queue->nb_empty_polls > EMPTYPOLL_MAX)
>>>          lcore_conf-> nb_queues_ready_to_sleep--;
>>>      queue->nb_empty_polls = 0;
>>>
>>> /* empty poll */
>>> } else {
>>>       /* queue reaches EMPTYPOLL_MAX threshold, mark it as 'ready_to_sleep' */
>>>       if (queue->nb_empty_polls == EMPTYPOLL_MAX)
>>>          lcore_conf-> nb_queues_ready_to_sleep++;
>>>       queue->nb_empty_polls++;
>>> }
>>>
>>>      /* no traffic on any queue for at least EMPTYPOLL_MAX iterations */
>>>      if (lcore_conf-> nb_queues_ready_to_sleep == lcore_conf->n_queues) {
>>>         /* update counters and sleep */
>>>         lcore_conf->nb_sleeps++;
>>>         lcore_conf-> nb_queues_ready_to_sleep = 0;
>>>         goto_sleep();
>>>      }
>>> }
>>>
>> Actually, i don't think this is going to work, because i can see no
>> (easy) way to get from lcore to specific queue. I mean, you could have
>> an O(N) for loop that will loop over the list of queues every time we
>> enter the callback, but i don't think that's such a good idea.
> 
> I think something like that will work:
> 
> struct queue_list_entry {
>          TAILQ_ENTRY(queue_list_entry) next;
>          union queue queue;
> +     /* pointer to the lcore that queue is managed by */
> +      struct pmd_core_cfg *lcore_cfg;
> +     /* queue RX callback */
> +      const struct rte_eth_rxtx_callback *cur_cb;
> };
> 
> At rte_power_ethdev_pmgmt_queue_enable():
> 
> +      struct queue_list_entry *qle;
> ...
> -        ret = queue_list_add(queue_cfg, &qdata);
> +       qle = queue_list_add(queue_cfg, &qdata);
> ...
> -       queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, NULL);
> +      qle->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, qdata);
> 
> At actual clb_xxx(uint16_t port_id, uint16_t qidx,
>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>                  uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> {
>     ...
>     struct queue_list_entry *qle = addr;
>     struct pmd_core_cfg *lcore_cfg = qle->lcore_conf;
>     ....
> }
> 
> 

Hm, that's actually a clever solution :) Thanks!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management
  2021-06-28 15:54     ` [dpdk-dev] [PATCH v4 0/7] Enhancements for PMD power management Anatoly Burakov
                         ` (6 preceding siblings ...)
  2021-06-28 15:54       ` [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-06-29 15:48       ` Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
                           ` (7 more replies)
  7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, konstantin.ananyev, ciara.loftus

This patchset introduces several changes related to PMD power management:

- Changed monitoring intrinsics to use callbacks as a comparison function, based
  on previous patchset [1] but incorporating feedback [2] - this hopefully will
  make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
  set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
  accompanying infrastructure and example apps changes

v5:
- Removed "power save queue" API and replaced with mechanism suggested by
  Konstantin
- Addressed other feedback

v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections

v3:
- Moved some doc updates to NIC features list

v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary

[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274

Anatoly Burakov (7):
  power_intrinsics: use callbacks for comparison
  net/af_xdp: add power monitor support
  eal: add power monitor for multiple events
  power: remove thread safety from PMD power API's
  power: support callbacks for multiple Rx queues
  power: support monitoring multiple Rx queues
  l3fwd-power: support multiqueue in PMD pmgmt modes

 doc/guides/nics/features.rst                  |  10 +
 doc/guides/prog_guide/power_man.rst           |  68 +-
 doc/guides/rel_notes/release_21_08.rst        |  11 +
 drivers/event/dlb2/dlb2.c                     |  17 +-
 drivers/net/af_xdp/rte_eth_af_xdp.c           |  34 +
 drivers/net/i40e/i40e_rxtx.c                  |  20 +-
 drivers/net/iavf/iavf_rxtx.c                  |  20 +-
 drivers/net/ice/ice_rxtx.c                    |  20 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  20 +-
 drivers/net/mlx5/mlx5_rx.c                    |  17 +-
 examples/l3fwd-power/main.c                   |   6 -
 lib/eal/arm/rte_power_intrinsics.c            |  11 +
 lib/eal/include/generic/rte_cpuflags.h        |   2 +
 .../include/generic/rte_power_intrinsics.h    |  68 +-
 lib/eal/ppc/rte_power_intrinsics.c            |  11 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_cpuflags.c                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  90 ++-
 lib/power/meson.build                         |   3 +
 lib/power/rte_power_pmd_mgmt.c                | 633 +++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h                |   6 +
 21 files changed, 810 insertions(+), 262 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support Anatoly Burakov
                           ` (6 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
	Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
	Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.

This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.

Existing implementations are adjusted to follow the new semantics.

Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v4:
    - Return error if callback is set to NULL
    - Replace raw number with a macro in monitor condition opaque data
    
    v2:
    - Use callback mechanism for more flexibility
    - Address feedback from Konstantin

 doc/guides/rel_notes/release_21_08.rst        |  1 +
 drivers/event/dlb2/dlb2.c                     | 17 ++++++++--
 drivers/net/i40e/i40e_rxtx.c                  | 20 +++++++----
 drivers/net/iavf/iavf_rxtx.c                  | 20 +++++++----
 drivers/net/ice/ice_rxtx.c                    | 20 +++++++----
 drivers/net/ixgbe/ixgbe_rxtx.c                | 20 +++++++----
 drivers/net/mlx5/mlx5_rx.c                    | 17 ++++++++--
 .../include/generic/rte_power_intrinsics.h    | 33 +++++++++++++++----
 lib/eal/x86/rte_power_intrinsics.c            | 17 +++++-----
 9 files changed, 121 insertions(+), 44 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =======================================================
 
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
 ABI Changes
 -----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
 	}
 }
 
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	/* abort if the value matches */
+	return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
 static inline int
 dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		  struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 			expected_value = 0;
 
 		pmc.addr = monitor_addr;
-		pmc.val = expected_value;
-		pmc.mask = qe_mask.raw_qe[1];
+		/* store expected value and comparison mask in opaque data */
+		pmc.opaque[CLB_VAL_IDX] = expected_value;
+		pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+		/* set up callback */
+		pmc.fn = dlb2_monitor_callback;
 		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..081682f88b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
 #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
 
+static int
+i40e_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = i40e_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..7ed196ec22 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
 				rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
 }
 
+static int
+iavf_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = iavf_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..d12437d19d 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+static int
+ice_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.status_error0;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
-	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/* comparison callback */
+	pmc->fn = ice_monitor_callback;
 
 	/* register is 16-bit */
 	pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+static int
+ixgbe_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.upper.status_error;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
-	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/* comparison callback */
+	pmc->fn = ixgbe_monitor_callback;
 
 	/* the registers are 32-bit */
 	pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 	return rx_queue_count(rxq);
 }
 
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	const uint64_t m = opaque[CLB_MSK_IDX];
+	const uint64_t v = opaque[CLB_VAL_IDX];
+
+	return (value & m) == v ? -1 : 0;
+}
+
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 		return -rte_errno;
 	}
 	pmc->addr = &cqe->op_own;
-	pmc->val =  !!idx;
-	pmc->mask = MLX5_CQE_OWNER_MASK;
+	pmc->opaque[CLB_VAL_IDX] = !!idx;
+	pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+	pmc->fn = mlx_monitor_callback;
 	pmc->size = sizeof(uint8_t);
 	return 0;
 }
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
  * which are architecture-dependent.
  */
 
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ *   The value read from memory.
+ * @param opaque
+ *   Callback-specific data.
+ *
+ * @return
+ *   0 if entering of power optimized state should proceed
+ *   -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< If the `mask` is non-zero, location pointed
-	                       *   to by `addr` will be read and compared
-	                       *   against this value.
-	                       */
-	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
-	uint8_t size;    /**< Data size (in bytes) that will be used to compare
-	                  *   expected value (`val`) with data read from the
+	uint8_t size;    /**< Data size (in bytes) that will be read from the
 	                  *   monitored memory location (`addr`). Can be 1, 2,
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+	                             *   entering power optimized state should
+	                             *   be aborted.
+	                             */
+	uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+	/**< Callback-specific data */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 	const unsigned int lcore_id = rte_lcore_id();
 	struct power_wait_status *s;
+	uint64_t cur_value;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (__check_val_size(pmc->size) < 0)
 		return -EINVAL;
 
+	if (pmc->fn == NULL)
+		return -EINVAL;
+
 	s = &wait_status[lcore_id];
 
 	/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* now that we've put this address into monitor, we can unlock */
 	rte_spinlock_unlock(&s->lock);
 
-	/* if we have a comparison mask, we might not need to sleep at all */
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->size);
-		const uint64_t masked = cur_value & pmc->mask;
+	cur_value = __get_umwait_val(pmc->addr, pmc->size);
 
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			goto end;
-	}
+	/* check if callback indicates we should abort */
+	if (pmc->fn(cur_value, pmc->opaque) != 0)
+		goto end;
 
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events Anatoly Burakov
                           ` (5 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev

Implement support for .get_monitor_addr in AF_XDP driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Rewrite using the callback mechanism

 drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
 
 #include "compat.h"
 
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	const uint64_t v = opaque[CLB_VAL_IDX];
+	const uint64_t m = (uint32_t)~0;
+
+	/* if the value has changed, abort entering power optimized state */
+	return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct pkt_rx_queue *rxq = rx_queue;
+	unsigned int *prod = rxq->rx.producer;
+	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+	/* watch for changes in producer ring */
+	pmc->addr = (void*)prod;
+
+	/* store current value */
+	pmc->opaque[CLB_VAL_IDX] = cur_val;
+	pmc->fn = eth_monitor_callback;
+
+	/* AF_XDP producer ring index is 32-bit */
+	pmc->size = sizeof(uint32_t);
+
+	return 0;
+}
+
 static int
 eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
 	.link_update = eth_link_update,
 	.stats_get = eth_stats_get,
 	.stats_reset = eth_stats_reset,
+	.get_monitor_addr = eth_get_monitor_addr
 };
 
 /** parse busy_budget argument */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
                           ` (4 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, Jan Viktorin, Ruifeng Wang, Jerin Jacob, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v4:
    - Fixed bugs in accessing the monitor condition
    - Abort on any monitor condition not having a defined callback
    
    v2:
    - Adapt to callback mechanism

 doc/guides/rel_notes/release_21_08.rst        |  2 +
 lib/eal/arm/rte_power_intrinsics.c            | 11 +++
 lib/eal/include/generic/rte_cpuflags.h        |  2 +
 .../include/generic/rte_power_intrinsics.h    | 35 +++++++++
 lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
 lib/eal/version.map                           |  3 +
 lib/eal/x86/rte_cpuflags.c                    |  2 +
 lib/eal/x86/rte_power_intrinsics.c            | 73 +++++++++++++++++++
 8 files changed, 139 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
 	/**< indicates support for rte_power_monitor function */
 	uint32_t power_pause : 1;
 	/**< indicates support for rte_power_pause function */
+	uint32_t power_monitor_multi : 1;
+	/**< indicates support for rte_power_monitor_multi function */
 };
 
 /**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 __rte_experimental
 int rte_power_pause(const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ *   Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ *   An array of monitoring condition structures.
+ * @param num
+ *   Length of the `pmc` array.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp);
+
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_power_monitor_multi; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
 		intrinsics->power_monitor = 1;
 		intrinsics->power_pause = 1;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+			intrinsics->power_monitor_multi = 1;
 	}
 }
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_rtm.h>
 #include <rte_spinlock.h>
 
 #include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
 }
 
 static bool wait_supported;
+static bool wait_multi_supported;
 
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
 
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
+	if (i.power_monitor_multi)
+		wait_multi_supported = 1;
 }
 
 int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	 * In this case, since we've already woken up, the "wakeup" was
 	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
 	 * wakeup address is still valid so it's perfectly safe to write it.
+	 *
+	 * For multi-monitor case, the act of locking will in itself trigger the
+	 * wakeup, so no additional writes necessary.
 	 */
 	rte_spinlock_lock(&s->lock);
 	if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return 0;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
+	uint32_t i, rc;
+
+	/* check if supported */
+	if (!wait_multi_supported)
+		return -ENOTSUP;
+
+	if (pmc == NULL || num == 0)
+		return -EINVAL;
+
+	/* we are already inside transaction region, return */
+	if (rte_xtest() != 0)
+		return 0;
+
+	/* start new transaction region */
+	rc = rte_xbegin();
+
+	/* transaction abort, possible write to one of wait addresses */
+	if (rc != RTE_XBEGIN_STARTED)
+		return 0;
+
+	/*
+	 * the mere act of reading the lock status here adds the lock to
+	 * the read set. This means that when we trigger a wakeup from another
+	 * thread, even if we don't have a defined wakeup address and thus don't
+	 * actually cause any writes, the act of locking our lock will itself
+	 * trigger the wakeup and abort the transaction.
+	 */
+	rte_spinlock_is_locked(&s->lock);
+
+	/*
+	 * add all addresses to wait on into transaction read-set and check if
+	 * any of wakeup conditions are already met.
+	 */
+	rc = 0;
+	for (i = 0; i < num; i++) {
+		const struct rte_power_monitor_cond *c = &pmc[i];
+
+		/* cannot be NULL */
+		if (c->fn == NULL) {
+			rc = -EINVAL;
+			break;
+		}
+
+		const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+		/* abort if callback indicates that we need to stop */
+		if (c->fn(val, c->opaque) != 0)
+			break;
+	}
+
+	/* none of the conditions were met, sleep until timeout */
+	if (i == num)
+		rte_power_pause(tsc_timestamp);
+
+	/* end transaction region */
+	rte_xend();
+
+	return rc;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
                           ` (2 preceding siblings ...)
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
                           ` (3 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.

We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Add check for stopped queue
    - Clarified doc message
    - Added release notes

 doc/guides/rel_notes/release_21_08.rst |   5 +
 lib/power/meson.build                  |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 133 ++++++++++---------------
 lib/power/rte_power_pmd_mgmt.h         |   6 ++
 4 files changed, 67 insertions(+), 80 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
 
 * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
+* rte_power: The experimental PMD power management API is no longer considered
+  to be thread safe; all Rx queues affected by the API will now need to be
+  stopped before making any changes to the power management scheme.
+
+
 ABI Changes
 -----------
 
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
         'rte_power_pmd_mgmt.h',
         'rte_power_guest_channel.h',
 )
+if cc.has_argument('-Wno-cast-qual')
+    cflags += '-Wno-cast-qual'
+endif
 deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
 	/**< Callback mode for this queue */
 	const struct rte_eth_rxtx_callback *cur_cb;
 	/**< Callback instance */
-	volatile bool umwait_in_progress;
-	/**< are we currently sleeping? */
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 			struct rte_power_monitor_cond pmc;
 			uint16_t ret;
 
-			/*
-			 * we might get a cancellation request while being
-			 * inside the callback, in which case the wakeup
-			 * wouldn't work because it would've arrived too early.
-			 *
-			 * to get around this, we notify the other thread that
-			 * we're sleeping, so that it can spin until we're done.
-			 * unsolicited wakeups are perfectly safe.
-			 */
-			q_conf->umwait_in_progress = true;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-			/* check if we need to cancel sleep */
-			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
-				/* use monitoring condition to sleep */
-				ret = rte_eth_get_monitor_addr(port_id, qidx,
-						&pmc);
-				if (ret == 0)
-					rte_power_monitor(&pmc, UINT64_MAX);
-			}
-			q_conf->umwait_in_progress = false;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+			/* use monitoring condition to sleep */
+			ret = rte_eth_get_monitor_addr(port_id, qidx,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
 		q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 	return nb_rx;
 }
 
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+	struct rte_eth_rxq_info qinfo;
+
+	if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+		return -1;
+
+	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
 	struct pmd_queue_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
+	rte_rx_callback_fn clb;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
 	queue_cfg = &port_cfg[port_id][queue_id];
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->umwait_in_progress = false;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* ensure we update our state before callback starts */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_umwait, NULL);
+		clb = clb_umwait;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
-				queue_id, clb_scale_freq, NULL);
+		clb = clb_scale_freq;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (global_data.tsc_per_us == 0)
 			calc_tsc();
 
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_pause, NULL);
+		clb = clb_pause;
 		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
 	}
+
+	/* initialize data before enabling the callback */
+	queue_cfg->empty_poll_stats = 0;
+	queue_cfg->cb_mode = mode;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, NULL);
+
 	ret = 0;
 end:
 	return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
 	struct pmd_queue_cfg *queue_cfg;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
 
 	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
 		return -EINVAL;
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
 	/* no need to check queue id as wrong queue id would not be enabled */
 	queue_cfg = &port_cfg[port_id][queue_id];
 
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	/* stop any callbacks from progressing */
 	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
-	/* ensure we update our state before continuing */
-	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
 	switch (queue_cfg->cb_mode) {
-	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		bool exit = false;
-		do {
-			/*
-			 * we may request cancellation while the other thread
-			 * has just entered the callback but hasn't started
-			 * sleeping yet, so keep waking it up until we know it's
-			 * done sleeping.
-			 */
-			if (queue_cfg->umwait_in_progress)
-				rte_power_monitor_wakeup(lcore_id);
-			else
-				exit = true;
-		} while (!exit);
-	}
-	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		rte_eth_remove_rx_callback(port_id, queue_id,
 				queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		break;
 	}
 	/*
-	 * we don't free the RX callback here because it is unsafe to do so
-	 * unless we know for a fact that all data plane threads have stopped.
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	queue_cfg->cur_cb = NULL;
+	rte_free((void *)queue_cfg->cur_cb);
 
 	return 0;
 }
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue will be polled from.
  * @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue is polled from.
  * @param port_id
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
                           ` (3 preceding siblings ...)
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-06-30  9:52           ` David Hunt
  2021-06-30 11:04           ` Ananyev, Konstantin
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 6/7] power: support monitoring " Anatoly Burakov
                           ` (2 subsequent siblings)
  7 siblings, 2 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of queues to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when all queues in the list were polled and were determined
  to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Also, while we're at it, update and improve the docs.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Remove the "power save queue" API and replace it with mechanism suggested by
      Konstantin
    
    v3:
    - Move the list of supported NICs to NIC feature table
    
    v2:
    - Use a TAILQ for queues instead of a static array
    - Address feedback from Konstantin
    - Add additional checks for stopped queues

 doc/guides/nics/features.rst           |  10 +
 doc/guides/prog_guide/power_man.rst    |  65 ++--
 doc/guides/rel_notes/release_21_08.rst |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
 4 files changed, 373 insertions(+), 136 deletions(-)

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
 * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
 * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
 
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
 .. _nic_features_other:
 
 Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..ec04a72108 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@ Ethernet PMD Power Management API
 Abstract
 ~~~~~~~~
 
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
-   This power saving scheme will put the CPU into optimized power state
-   and use the ``rte_power_monitor()`` function
-   to monitor the Ethernet PMD RX descriptor address,
-   and wake the CPU up whenever there's new traffic.
-
-Pause
-   This power saving scheme will avoid busy polling
-   by either entering power-optimized sleep state
-   with ``rte_power_pause()`` function,
-   or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
-   This power saving scheme will use ``librte_power`` library
-   functionality to scale the core frequency up/down
-   depending on traffic volume.
-
-.. note::
-
-   Currently, this power management API is limited to mandatory mapping
-   of 1 queue to 1 core (multiple queues are supported,
-   but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+   This power saving scheme will put the CPU into optimized power state and
+   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+   there's new traffic. Support for this scheme may not be available on all
+   platforms, and further limitations may apply (see below).
+
+* Pause
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+   This power saving scheme will use ``librte_power`` library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+  monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+  ``rte_power_monitor()`` function is not supported, then monitor mode will not
+  be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+  platform may support the necessary CPU instructions. Please refer to
+  :doc:`../nics/overview` for more information.
+
 
 API Overview for Ethernet PMD Power Management
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +249,5 @@ References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@ New Features
 
 * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
 
+* rte_power: The experimental PMD power management API now supports managing
+  multiple Ethernet Rx queues per lcore.
+
 
 Removed Items
 -------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..fccfd236c2 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,96 @@ enum pmd_mgmt_state {
 	PMD_MGMT_ENABLED
 };
 
-struct pmd_queue_cfg {
+union queue {
+	uint32_t val;
+	struct {
+		uint16_t portid;
+		uint16_t qid;
+	};
+};
+
+struct queue_list_entry {
+	TAILQ_ENTRY(queue_list_entry) next;
+	union queue queue;
+	uint64_t n_empty_polls;
+	const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+	/**< List of queues associated with this lcore */
+	size_t n_queues;
+	/**< How many queues are in the list? */
 	volatile enum pmd_mgmt_state pwr_mgmt_state;
 	/**< State of power management for this queue */
 	enum rte_power_pmd_mgmt_type cb_mode;
 	/**< Callback mode for this queue */
-	const struct rte_eth_rxtx_callback *cur_cb;
-	/**< Callback instance */
-	uint64_t empty_poll_stats;
-	/**< Number of empty polls */
+	uint64_t n_queues_ready_to_sleep;
+	/**< Number of queues ready to enter power optimized state */
 } __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
 
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+	return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+	dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *cur;
+
+	TAILQ_FOREACH(cur, &cfg->head, next) {
+		if (queue_equal(&cur->queue, q))
+			return cur;
+	}
+	return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *qle;
+
+	/* is it already in the list? */
+	if (queue_list_find(cfg, q) != NULL)
+		return -EEXIST;
+
+	qle = malloc(sizeof(*qle));
+	if (qle == NULL)
+		return -ENOMEM;
+	memset(qle, 0, sizeof(*qle));
+
+	queue_copy(&qle->queue, q);
+	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+	cfg->n_queues++;
+	qle->n_empty_polls = 0;
+
+	return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *found;
+
+	found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return NULL;
+
+	TAILQ_REMOVE(&cfg->head, found, next);
+	cfg->n_queues--;
+
+	/* freeing is responsibility of the caller */
+	return found;
+}
 
 static void
 calc_tsc(void)
@@ -74,21 +152,56 @@ calc_tsc(void)
 	}
 }
 
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+	/* reset empty poll counter for this queue */
+	qcfg->n_empty_polls = 0;
+	/* reset the sleep counter too */
+	cfg->n_queues_ready_to_sleep = 0;
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+	/* this function is called - that means we have an empty poll */
+	qcfg->n_empty_polls++;
+
+	/* if we haven't reached threshold for empty polls, we can't sleep */
+	if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+		return false;
+
+	/* we're ready to sleep */
+	cfg->n_queues_ready_to_sleep++;
+
+	return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+	/* are all queues ready to sleep? */
+	if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+		return false;
+
+	/* we've reached an iteration where we can sleep, reset sleep counter */
+	cfg->n_queues_ready_to_sleep = 0;
+
+	return true;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
-		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
-		void *addr __rte_unused)
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
 {
+	struct queue_list_entry *queue_conf = arg;
 
-	struct pmd_queue_cfg *q_conf;
-
-	q_conf = &port_cfg[port_id][qidx];
-
+	/* this callback can't do more than one queue, omit multiqueue logic */
 	if (unlikely(nb_rx == 0)) {
-		q_conf->empty_poll_stats++;
-		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+		queue_conf->n_empty_polls++;
+		if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
 			struct rte_power_monitor_cond pmc;
-			uint16_t ret;
+			int ret;
 
 			/* use monitoring condition to sleep */
 			ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
-		q_conf->empty_poll_stats = 0;
+		queue_conf->n_empty_polls = 0;
 
 	return nb_rx;
 }
 
 static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
-		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
-		void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *arg)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	struct queue_list_entry *queue_conf = arg;
+	struct pmd_core_cfg *lcore_conf;
+	const bool empty = nb_rx == 0;
 
-	q_conf = &port_cfg[port_id][qidx];
+	lcore_conf = &lcore_cfgs[lcore];
 
-	if (unlikely(nb_rx == 0)) {
-		q_conf->empty_poll_stats++;
-		/* sleep for 1 microsecond */
-		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
-			/* use tpause if we have it */
-			if (global_data.intrinsics_support.power_pause) {
-				const uint64_t cur = rte_rdtsc();
-				const uint64_t wait_tsc =
-						cur + global_data.tsc_per_us;
-				rte_power_pause(wait_tsc);
-			} else {
-				uint64_t i;
-				for (i = 0; i < global_data.pause_per_us; i++)
-					rte_pause();
-			}
+	if (likely(!empty))
+		/* early exit */
+		queue_reset(lcore_conf, queue_conf);
+	else {
+		/* can this queue sleep? */
+		if (!queue_can_sleep(lcore_conf, queue_conf))
+			return nb_rx;
+
+		/* can this lcore sleep? */
+		if (!lcore_can_sleep(lcore_conf))
+			return nb_rx;
+
+		/* sleep for 1 microsecond, use tpause if we have it */
+		if (global_data.intrinsics_support.power_pause) {
+			const uint64_t cur = rte_rdtsc();
+			const uint64_t wait_tsc =
+					cur + global_data.tsc_per_us;
+			rte_power_pause(wait_tsc);
+		} else {
+			uint64_t i;
+			for (i = 0; i < global_data.pause_per_us; i++)
+				rte_pause();
 		}
-	} else
-		q_conf->empty_poll_stats = 0;
+	}
 
 	return nb_rx;
 }
 
 static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
-		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+		uint16_t max_pkts __rte_unused, void *arg)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct queue_list_entry *queue_conf = arg;
 
-	q_conf = &port_cfg[port_id][qidx];
+	if (likely(!empty)) {
+		/* early exit */
+		queue_reset(lcore_conf, queue_conf);
 
-	if (unlikely(nb_rx == 0)) {
-		q_conf->empty_poll_stats++;
-		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
-			/* scale down freq */
-			rte_power_freq_min(rte_lcore_id());
-	} else {
-		q_conf->empty_poll_stats = 0;
-		/* scale up freq */
+		/* scale up freq immediately */
 		rte_power_freq_max(rte_lcore_id());
+	} else {
+		/* can this queue sleep? */
+		if (!queue_can_sleep(lcore_conf, queue_conf))
+			return nb_rx;
+
+		/* can this lcore sleep? */
+		if (!lcore_can_sleep(lcore_conf))
+			return nb_rx;
+
+		rte_power_freq_min(rte_lcore_id());
 	}
 
 	return nb_rx;
@@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
 	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
 }
 
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+	const struct queue_list_entry *entry;
+
+	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+		const union queue *q = &entry->queue;
+		int ret = queue_stopped(q->portid, q->qid);
+		if (ret != 1)
+			return ret;
+	}
+	return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+	enum power_management_env env;
+
+	/* only PSTATE and ACPI modes are supported */
+	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+		return -ENOTSUP;
+	}
+	/* ensure we could initialize the power library */
+	if (rte_power_init(lcore))
+		return -EINVAL;
+
+	/* ensure we initialized the correct env */
+	env = rte_power_get_env();
+	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	if (cfg->n_queues > 0) {
+		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+			&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *lcore_cfg;
+	struct queue_list_entry *queue_cfg;
 	struct rte_eth_dev_info info;
 	rte_rx_callback_fn clb;
 	int ret;
@@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	queue_cfg = &port_cfg[port_id][queue_id];
+	lcore_cfg = &lcore_cfgs[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(lcore_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
+	/* if callback was already enabled, check current callback type */
+	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+			lcore_cfg->cb_mode != mode) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 
 	switch (mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		struct rte_power_monitor_cond dummy;
-
-		/* check if rte_power_monitor is supported */
-		if (!global_data.intrinsics_support.power_monitor) {
-			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_monitor(lcore_cfg, &qdata);
+		if (ret < 0)
 			goto end;
-		}
 
-		/* check if the device supports the necessary PMD API */
-		if (rte_eth_get_monitor_addr(port_id, queue_id,
-				&dummy) == -ENOTSUP) {
-			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_umwait;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
-	{
-		enum power_management_env env;
-		/* only PSTATE and ACPI modes are supported */
-		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
-				!rte_power_check_env_supported(
-					PM_ENV_PSTATE_CPUFREQ)) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_scale(lcore_id);
+		if (ret < 0)
 			goto end;
-		}
-		/* ensure we could initialize the power library */
-		if (rte_power_init(lcore_id)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		/* ensure we initialized the correct env */
-		env = rte_power_get_env();
-		if (env != PM_ENV_ACPI_CPUFREQ &&
-				env != PM_ENV_PSTATE_CPUFREQ) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_scale_freq;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		/* figure out various time-to-tsc conversions */
 		if (global_data.tsc_per_us == 0)
@@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		ret = -EINVAL;
 		goto end;
 	}
+	/* add this queue to the list */
+	ret = queue_list_add(lcore_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
+	/* new queue is always added last */
+	queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
 
 	/* initialize data before enabling the callback */
-	queue_cfg->empty_poll_stats = 0;
-	queue_cfg->cb_mode = mode;
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-			clb, NULL);
+	if (lcore_cfg->n_queues == 1) {
+		lcore_cfg->cb_mode = mode;
+		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
+	queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, queue_cfg);
 
 	ret = 0;
 end:
@@ -290,7 +476,9 @@ int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *lcore_cfg;
+	struct queue_list_entry *queue_cfg;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	queue_cfg = &port_cfg[port_id][queue_id];
+	lcore_cfg = &lcore_cfgs[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(lcore_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
+	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
 		return -EINVAL;
 
-	/* stop any callbacks from progressing */
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	queue_cfg = queue_list_take(lcore_cfg, &qdata);
+	if (queue_cfg == NULL)
+		return -ENOENT;
 
-	switch (queue_cfg->cb_mode) {
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (lcore_cfg->n_queues == 0)
+		lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	switch (lcore_cfg->cb_mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
-		rte_eth_remove_rx_callback(port_id, queue_id,
-				queue_cfg->cur_cb);
+		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		rte_power_freq_max(lcore_id);
-		rte_eth_remove_rx_callback(port_id, queue_id,
-				queue_cfg->cur_cb);
+		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
 		rte_power_exit(lcore_id);
 		break;
 	}
@@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	 * ports before calling any of these API's, so we can assume that the
 	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	rte_free((void *)queue_cfg->cur_cb);
+	rte_free((void *)queue_cfg->cb);
+	free(queue_cfg);
 
 	return 0;
 }
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+	size_t i;
+
+	/* initialize all tailqs */
+	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+		TAILQ_INIT(&cfg->head);
+	}
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
                           ` (4 preceding siblings ...)
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-06-30 10:29           ` Ananyev, Konstantin
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
Rx queues while entering the energy efficient power state. The multi
version will be used unconditionally if supported, and the UMWAIT one
will only be used when multi-monitor is not supported by the hardware.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v4:
    - Fix possible out of bounds access
    - Added missing index increment

 doc/guides/prog_guide/power_man.rst |  9 ++--
 lib/power/rte_power_pmd_mgmt.c      | 81 ++++++++++++++++++++++++++++-
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index ec04a72108..94353ca012 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
 The "monitor" mode is only supported in the following configurations and scenarios:
 
 * If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor_multi()`` function is supported by the platform, then
+  monitoring multiple Ethernet Rx queues for traffic will be supported.
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
   ``rte_power_monitor()`` is supported by the platform, then monitoring will be
   limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
   monitored from a different lcore).
 
-* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
-  ``rte_power_monitor()`` function is not supported, then monitor mode will not
-  be supported.
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
+  two monitoring functions are supported, then monitor mode will not be supported.
 
 * Not all Ethernet drivers support monitoring, even if the underlying
   platform may support the necessary CPU instructions. Please refer to
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index fccfd236c2..2056996b9c 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
 	return found;
 }
 
+static inline int
+get_monitor_addresses(struct pmd_core_cfg *cfg,
+		struct rte_power_monitor_cond *pmc, size_t len)
+{
+	const struct queue_list_entry *qle;
+	size_t i = 0;
+	int ret;
+
+	TAILQ_FOREACH(qle, &cfg->head, next) {
+		const union queue *q = &qle->queue;
+		struct rte_power_monitor_cond *cur;
+
+		/* attempted out of bounds access */
+		if (i >= len) {
+			RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
+			return -1;
+		}
+
+		cur = &pmc[i++];
+		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
 static void
 calc_tsc(void)
 {
@@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
 	return true;
 }
 
+static uint16_t
+clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *arg)
+{
+	const unsigned int lcore = rte_lcore_id();
+	struct queue_list_entry *queue_conf = arg;
+	struct pmd_core_cfg *lcore_conf;
+	const bool empty = nb_rx == 0;
+
+	lcore_conf = &lcore_cfgs[lcore];
+
+	/* early exit */
+	if (likely(!empty))
+		/* early exit */
+		queue_reset(lcore_conf, queue_conf);
+	else {
+		struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
+		int ret;
+
+		/* can this queue sleep? */
+		if (!queue_can_sleep(lcore_conf, queue_conf))
+			return nb_rx;
+
+		/* can this lcore sleep? */
+		if (!lcore_can_sleep(lcore_conf))
+			return nb_rx;
+
+		/* gather all monitoring conditions */
+		ret = get_monitor_addresses(lcore_conf, pmc, RTE_DIM(pmc));
+		if (ret < 0)
+			return nb_rx;
+
+		rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
+	}
+
+	return nb_rx;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 		uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
@@ -341,14 +406,19 @@ static int
 check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 {
 	struct rte_power_monitor_cond dummy;
+	bool multimonitor_supported;
 
 	/* check if rte_power_monitor is supported */
 	if (!global_data.intrinsics_support.power_monitor) {
 		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
 		return -ENOTSUP;
 	}
+	/* check if multi-monitor is supported */
+	multimonitor_supported =
+			global_data.intrinsics_support.power_monitor_multi;
 
-	if (cfg->n_queues > 0) {
+	/* if we're adding a new queue, do we support multiple queues? */
+	if (cfg->n_queues > 0 && !multimonitor_supported) {
 		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
 		return -ENOTSUP;
 	}
@@ -364,6 +434,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
 	return 0;
 }
 
+static inline rte_rx_callback_fn
+get_monitor_callback(void)
+{
+	return global_data.intrinsics_support.power_monitor_multi ?
+		clb_multiwait : clb_umwait;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
@@ -428,7 +505,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (ret < 0)
 			goto end;
 
-		clb = clb_umwait;
+		clb = get_monitor_callback();
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		/* check if we can add a new queue */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
                           ` (5 preceding siblings ...)
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-29 15:48         ` Anatoly Burakov
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-06-29 15:48 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: konstantin.ananyev, ciara.loftus

Currently, l3fwd-power enforces the limitation of having one queue per
lcore. This is no longer necessary, so remove the limitation.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 examples/l3fwd-power/main.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f8dfed1634..52f56dc405 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -2723,12 +2723,6 @@ main(int argc, char **argv)
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
 
-		/* PMD power management mode can only do 1 queue per core */
-		if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) {
-			rte_exit(EXIT_FAILURE,
-				"In PMD power management mode, only one queue per lcore is allowed\n");
-		}
-
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
 			struct rte_eth_rxconf rxq_conf;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
@ 2021-06-30  9:52           ` David Hunt
  2021-07-01  9:01             ` David Hunt
  2021-06-30 11:04           ` Ananyev, Konstantin
  1 sibling, 1 reply; 165+ messages in thread
From: David Hunt @ 2021-06-30  9:52 UTC (permalink / raw)
  To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus

Hi Anatoly,

On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
>    polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
>    added to the list of queues to poll, so that the callback is aware of
>    other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
>    shared between all queues polled on a particular lcore, and is only
>    activated when all queues in the list were polled and were determined
>    to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>    is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>      v5:
>      - Remove the "power save queue" API and replace it with mechanism suggested by
>        Konstantin
>      
>      v3:
>      - Move the list of supported NICs to NIC feature table
>      
>      v2:
>      - Use a TAILQ for queues instead of a static array
>      - Address feedback from Konstantin
>      - Add additional checks for stopped queues
>
>   doc/guides/nics/features.rst           |  10 +
>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>   doc/guides/rel_notes/release_21_08.rst |   3 +
>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>   4 files changed, 373 insertions(+), 136 deletions(-)
>

--snip--

>   int
>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>   {
> -	struct pmd_queue_cfg *queue_cfg;
> +	const union queue qdata = {.portid = port_id, .qid = queue_id};
> +	struct pmd_core_cfg *lcore_cfg;
> +	struct queue_list_entry *queue_cfg;
>   	struct rte_eth_dev_info info;
>   	rte_rx_callback_fn clb;
>   	int ret;
> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   		goto end;
>   	}
>   
> -	queue_cfg = &port_cfg[port_id][queue_id];
> +	lcore_cfg = &lcore_cfgs[lcore_id];
>   
> -	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
> +	/* check if other queues are stopped as well */
> +	ret = cfg_queues_stopped(lcore_cfg);
> +	if (ret != 1) {
> +		/* error means invalid queue, 0 means queue wasn't stopped */
> +		ret = ret < 0 ? -EINVAL : -EBUSY;
> +		goto end;
> +	}
> +
> +	/* if callback was already enabled, check current callback type */
> +	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
> +			lcore_cfg->cb_mode != mode) {
>   		ret = -EINVAL;
>   		goto end;
>   	}
> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   
>   	switch (mode) {
>   	case RTE_POWER_MGMT_TYPE_MONITOR:
> -	{
> -		struct rte_power_monitor_cond dummy;
> -
> -		/* check if rte_power_monitor is supported */
> -		if (!global_data.intrinsics_support.power_monitor) {
> -			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_monitor(lcore_cfg, &qdata);
> +		if (ret < 0)
>   			goto end;
> -		}
>   
> -		/* check if the device supports the necessary PMD API */
> -		if (rte_eth_get_monitor_addr(port_id, queue_id,
> -				&dummy) == -ENOTSUP) {
> -			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>   		clb = clb_umwait;
>   		break;
> -	}
>   	case RTE_POWER_MGMT_TYPE_SCALE:
> -	{
> -		enum power_management_env env;
> -		/* only PSTATE and ACPI modes are supported */
> -		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> -				!rte_power_check_env_supported(
> -					PM_ENV_PSTATE_CPUFREQ)) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_scale(lcore_id);
> +		if (ret < 0)
>   			goto end;
> -		}
> -		/* ensure we could initialize the power library */
> -		if (rte_power_init(lcore_id)) {
> -			ret = -EINVAL;
> -			goto end;
> -		}
> -		/* ensure we initialized the correct env */
> -		env = rte_power_get_env();
> -		if (env != PM_ENV_ACPI_CPUFREQ &&
> -				env != PM_ENV_PSTATE_CPUFREQ) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>   		clb = clb_scale_freq;
>   		break;
> -	}
>   	case RTE_POWER_MGMT_TYPE_PAUSE:
>   		/* figure out various time-to-tsc conversions */
>   		if (global_data.tsc_per_us == 0)
> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   		ret = -EINVAL;
>   		goto end;
>   	}
> +	/* add this queue to the list */
> +	ret = queue_list_add(lcore_cfg, &qdata);
> +	if (ret < 0) {
> +		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
> +				strerror(-ret));
> +		goto end;
> +	}
> +	/* new queue is always added last */
> +	queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);


Need to ensure that queue_cfg gets set here, otherwise we'll get a 
segfault below.



>   
>   	/* initialize data before enabling the callback */
> -	queue_cfg->empty_poll_stats = 0;
> -	queue_cfg->cb_mode = mode;
> -	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> -			clb, NULL);
> +	if (lcore_cfg->n_queues == 1) {
> +		lcore_cfg->cb_mode = mode;
> +		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	}
> +	queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
> +			clb, queue_cfg);
--snip--


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 6/7] power: support monitoring " Anatoly Burakov
@ 2021-06-30 10:29           ` Ananyev, Konstantin
  2021-07-05 10:08             ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-30 10:29 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
> Rx queues while entering the energy efficient power state. The multi
> version will be used unconditionally if supported, and the UMWAIT one
> will only be used when multi-monitor is not supported by the hardware.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v4:
>     - Fix possible out of bounds access
>     - Added missing index increment
> 
>  doc/guides/prog_guide/power_man.rst |  9 ++--
>  lib/power/rte_power_pmd_mgmt.c      | 81 ++++++++++++++++++++++++++++-
>  2 files changed, 85 insertions(+), 5 deletions(-)
> 
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index ec04a72108..94353ca012 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>  The "monitor" mode is only supported in the following configurations and scenarios:
> 
>  * If ``rte_cpu_get_intrinsics_support()`` function indicates that
> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>    ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>    limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>    monitored from a different lcore).
> 
> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
> -  be supported.
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
> +  two monitoring functions are supported, then monitor mode will not be supported.
> 
>  * Not all Ethernet drivers support monitoring, even if the underlying
>    platform may support the necessary CPU instructions. Please refer to
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index fccfd236c2..2056996b9c 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
>  	return found;
>  }
> 
> +static inline int
> +get_monitor_addresses(struct pmd_core_cfg *cfg,
> +		struct rte_power_monitor_cond *pmc, size_t len)
> +{
> +	const struct queue_list_entry *qle;
> +	size_t i = 0;
> +	int ret;
> +
> +	TAILQ_FOREACH(qle, &cfg->head, next) {
> +		const union queue *q = &qle->queue;
> +		struct rte_power_monitor_cond *cur;
> +
> +		/* attempted out of bounds access */
> +		if (i >= len) {
> +			RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
> +			return -1;
> +		}
> +
> +		cur = &pmc[i++];
> +		ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
>  static void
>  calc_tsc(void)
>  {
> @@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
>  	return true;
>  }
> 
> +static uint16_t
> +clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *arg)
> +{
> +	const unsigned int lcore = rte_lcore_id();
> +	struct queue_list_entry *queue_conf = arg;
> +	struct pmd_core_cfg *lcore_conf;
> +	const bool empty = nb_rx == 0;
> +
> +	lcore_conf = &lcore_cfgs[lcore];
> +
> +	/* early exit */
> +	if (likely(!empty))
> +		/* early exit */
> +		queue_reset(lcore_conf, queue_conf);
> +	else {
> +		struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];

As discussed, I still think it needs to be pmc[lcore_conf->n_queues];
Or if VLA is not an option - alloca(), or dynamic lcore_conf->pmc[], or...

> +		int ret;
> +
> +		/* can this queue sleep? */
> +		if (!queue_can_sleep(lcore_conf, queue_conf))
> +			return nb_rx;
> +
> +		/* can this lcore sleep? */
> +		if (!lcore_can_sleep(lcore_conf))
> +			return nb_rx;
> +
> +		/* gather all monitoring conditions */
> +		ret = get_monitor_addresses(lcore_conf, pmc, RTE_DIM(pmc));
> +		if (ret < 0)
> +			return nb_rx;
> +
> +		rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX);
> +	}
> +
> +	return nb_rx;
> +}
> +
>  static uint16_t
>  clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>  		uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
> @@ -341,14 +406,19 @@ static int
>  check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>  {
>  	struct rte_power_monitor_cond dummy;
> +	bool multimonitor_supported;
> 
>  	/* check if rte_power_monitor is supported */
>  	if (!global_data.intrinsics_support.power_monitor) {
>  		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>  		return -ENOTSUP;
>  	}
> +	/* check if multi-monitor is supported */
> +	multimonitor_supported =
> +			global_data.intrinsics_support.power_monitor_multi;
> 
> -	if (cfg->n_queues > 0) {
> +	/* if we're adding a new queue, do we support multiple queues? */
> +	if (cfg->n_queues > 0 && !multimonitor_supported) {
>  		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
>  		return -ENOTSUP;
>  	}
> @@ -364,6 +434,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>  	return 0;
>  }
> 
> +static inline rte_rx_callback_fn
> +get_monitor_callback(void)
> +{
> +	return global_data.intrinsics_support.power_monitor_multi ?
> +		clb_multiwait : clb_umwait;
> +}
> +
>  int
>  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
> @@ -428,7 +505,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		if (ret < 0)
>  			goto end;
> 
> -		clb = clb_umwait;
> +		clb = get_monitor_callback();
>  		break;
>  	case RTE_POWER_MGMT_TYPE_SCALE:
>  		/* check if we can add a new queue */
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
  2021-06-30  9:52           ` David Hunt
@ 2021-06-30 11:04           ` Ananyev, Konstantin
  2021-07-05 10:23             ` Burakov, Anatoly
  1 sibling, 1 reply; 165+ messages in thread
From: Ananyev, Konstantin @ 2021-06-30 11:04 UTC (permalink / raw)
  To: Burakov, Anatoly, dev, Hunt, David; +Cc: Loftus, Ciara



 
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
> 
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
> 
> - Replace per-queue structures with per-lcore ones, so that any device
>   polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
>   added to the list of queues to poll, so that the callback is aware of
>   other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
>   shared between all queues polled on a particular lcore, and is only
>   activated when all queues in the list were polled and were determined
>   to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>   is incapable of monitoring more than one address.
> 
> Also, while we're at it, update and improve the docs.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v5:
>     - Remove the "power save queue" API and replace it with mechanism suggested by
>       Konstantin
> 
>     v3:
>     - Move the list of supported NICs to NIC feature table
> 
>     v2:
>     - Use a TAILQ for queues instead of a static array
>     - Address feedback from Konstantin
>     - Add additional checks for stopped queues
> 
>  doc/guides/nics/features.rst           |  10 +
>  doc/guides/prog_guide/power_man.rst    |  65 ++--
>  doc/guides/rel_notes/release_21_08.rst |   3 +
>  lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>  4 files changed, 373 insertions(+), 136 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>  * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>  * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
> 
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
>  .. _nic_features_other:
> 
>  Other dev ops not represented by a Feature
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index c70ae128ac..ec04a72108 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
>  Abstract
>  ~~~~~~~~
> 
> -Existing power management mechanisms require developers
> -to change application design or change code to make use of it.
> -The PMD power management API provides a convenient alternative
> -by utilizing Ethernet PMD RX callbacks,
> -and triggering power saving whenever empty poll count reaches a certain number.
> -
> -Monitor
> -   This power saving scheme will put the CPU into optimized power state
> -   and use the ``rte_power_monitor()`` function
> -   to monitor the Ethernet PMD RX descriptor address,
> -   and wake the CPU up whenever there's new traffic.
> -
> -Pause
> -   This power saving scheme will avoid busy polling
> -   by either entering power-optimized sleep state
> -   with ``rte_power_pause()`` function,
> -   or, if it's not available, use ``rte_pause()``.
> -
> -Frequency scaling
> -   This power saving scheme will use ``librte_power`` library
> -   functionality to scale the core frequency up/down
> -   depending on traffic volume.
> -
> -.. note::
> -
> -   Currently, this power management API is limited to mandatory mapping
> -   of 1 queue to 1 core (multiple queues are supported,
> -   but they must be polled from different cores).
> +Existing power management mechanisms require developers to change application
> +design or change code to make use of it. The PMD power management API provides a
> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> +power saving whenever empty poll count reaches a certain number.
> +
> +* Monitor
> +   This power saving scheme will put the CPU into optimized power state and
> +   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
> +   there's new traffic. Support for this scheme may not be available on all
> +   platforms, and further limitations may apply (see below).
> +
> +* Pause
> +   This power saving scheme will avoid busy polling by either entering
> +   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
> +   not supported by the underlying platform, use ``rte_pause()``.
> +
> +* Frequency scaling
> +   This power saving scheme will use ``librte_power`` library functionality to
> +   scale the core frequency up/down depending on traffic volume.
> +
> +The "monitor" mode is only supported in the following configurations and scenarios:
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
> +  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> +  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> +  monitored from a different lcore).
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> +  ``rte_power_monitor()`` function is not supported, then monitor mode will not
> +  be supported.
> +
> +* Not all Ethernet drivers support monitoring, even if the underlying
> +  platform may support the necessary CPU instructions. Please refer to
> +  :doc:`../nics/overview` for more information.
> +
> 
>  API Overview for Ethernet PMD Power Management
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -242,3 +249,5 @@ References
> 
>  *   The :doc:`../sample_app_ug/vm_power_management`
>      chapter in the :doc:`../sample_app_ug/index` section.
> +
> +*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
> diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
> index f015c509fc..3926d45ef8 100644
> --- a/doc/guides/rel_notes/release_21_08.rst
> +++ b/doc/guides/rel_notes/release_21_08.rst
> @@ -57,6 +57,9 @@ New Features
> 
>  * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
> 
> +* rte_power: The experimental PMD power management API now supports managing
> +  multiple Ethernet Rx queues per lcore.
> +
> 
>  Removed Items
>  -------------
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index 9b95cf1794..fccfd236c2 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -33,18 +33,96 @@ enum pmd_mgmt_state {
>  	PMD_MGMT_ENABLED
>  };
> 
> -struct pmd_queue_cfg {
> +union queue {
> +	uint32_t val;
> +	struct {
> +		uint16_t portid;
> +		uint16_t qid;
> +	};
> +};
> +
> +struct queue_list_entry {
> +	TAILQ_ENTRY(queue_list_entry) next;
> +	union queue queue;
> +	uint64_t n_empty_polls;
> +	const struct rte_eth_rxtx_callback *cb;
> +};
> +
> +struct pmd_core_cfg {
> +	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
> +	/**< List of queues associated with this lcore */
> +	size_t n_queues;
> +	/**< How many queues are in the list? */
>  	volatile enum pmd_mgmt_state pwr_mgmt_state;
>  	/**< State of power management for this queue */
>  	enum rte_power_pmd_mgmt_type cb_mode;
>  	/**< Callback mode for this queue */
> -	const struct rte_eth_rxtx_callback *cur_cb;
> -	/**< Callback instance */
> -	uint64_t empty_poll_stats;
> -	/**< Number of empty polls */
> +	uint64_t n_queues_ready_to_sleep;
> +	/**< Number of queues ready to enter power optimized state */
>  } __rte_cache_aligned;
> +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
> 
> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> +static inline bool
> +queue_equal(const union queue *l, const union queue *r)
> +{
> +	return l->val == r->val;
> +}
> +
> +static inline void
> +queue_copy(union queue *dst, const union queue *src)
> +{
> +	dst->val = src->val;
> +}
> +
> +static struct queue_list_entry *
> +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
> +{
> +	struct queue_list_entry *cur;
> +
> +	TAILQ_FOREACH(cur, &cfg->head, next) {
> +		if (queue_equal(&cur->queue, q))
> +			return cur;
> +	}
> +	return NULL;
> +}
> +
> +static int
> +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
> +{
> +	struct queue_list_entry *qle;
> +
> +	/* is it already in the list? */
> +	if (queue_list_find(cfg, q) != NULL)
> +		return -EEXIST;
> +
> +	qle = malloc(sizeof(*qle));
> +	if (qle == NULL)
> +		return -ENOMEM;
> +	memset(qle, 0, sizeof(*qle));
> +
> +	queue_copy(&qle->queue, q);
> +	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
> +	cfg->n_queues++;
> +	qle->n_empty_polls = 0;
> +
> +	return 0;
> +}
> +
> +static struct queue_list_entry *
> +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
> +{
> +	struct queue_list_entry *found;
> +
> +	found = queue_list_find(cfg, q);
> +	if (found == NULL)
> +		return NULL;
> +
> +	TAILQ_REMOVE(&cfg->head, found, next);
> +	cfg->n_queues--;
> +
> +	/* freeing is responsibility of the caller */
> +	return found;
> +}
> 
>  static void
>  calc_tsc(void)
> @@ -74,21 +152,56 @@ calc_tsc(void)
>  	}
>  }
> 
> +static inline void
> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> +	/* reset empty poll counter for this queue */
> +	qcfg->n_empty_polls = 0;
> +	/* reset the sleep counter too */
> +	cfg->n_queues_ready_to_sleep = 0;
> +}
> +
> +static inline bool
> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> +	/* this function is called - that means we have an empty poll */
> +	qcfg->n_empty_polls++;
> +
> +	/* if we haven't reached threshold for empty polls, we can't sleep */
> +	if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> +		return false;
> +
> +	/* we're ready to sleep */
> +	cfg->n_queues_ready_to_sleep++;
> +
> +	return true;
> +}
> +
> +static inline bool
> +lcore_can_sleep(struct pmd_core_cfg *cfg)
> +{
> +	/* are all queues ready to sleep? */
> +	if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
> +		return false;
> +
> +	/* we've reached an iteration where we can sleep, reset sleep counter */
> +	cfg->n_queues_ready_to_sleep = 0;
> +
> +	return true;
> +}

As I can see it a slightly modified one from what was discussed.
I understand that it seems simpler, but I think there are some problems with it:
- each queue can be counted more than once at lcore_cfg->n_queues_ready_to_sleep
- queues n_empty_polls are not reset after sleep().

To illustrate the problem, let say we have 2 queues, and at some moment we have:
q0.n_empty_polls == EMPTYPOLL_MAX + 1
q1.n_empty_polls == EMPTYPOLL_MAX + 1
cfg->n_queues_ready_to_sleep == 2

So lcore_can_sleep() returns 'true' and sets:
cfg->n_queues_ready_to_sleep == 0

Now, after sleep():
q0.n_empty_polls == EMPTYPOLL_MAX + 1
q1.n_empty_polls == EMPTYPOLL_MAX + 1

 So after:
queue_can_sleep(q0);
queue_can_sleep(q1);

will have:
cfg->n_queues_ready_to_sleep == 2 
again, and we'll go to another sleep after just one rx_burst() attempt for each queue.

> +
>  static uint16_t
>  clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> -		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> -		void *addr __rte_unused)
> +		uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
>  {
> +	struct queue_list_entry *queue_conf = arg;
> 
> -	struct pmd_queue_cfg *q_conf;
> -
> -	q_conf = &port_cfg[port_id][qidx];
> -
> +	/* this callback can't do more than one queue, omit multiqueue logic */
>  	if (unlikely(nb_rx == 0)) {
> -		q_conf->empty_poll_stats++;
> -		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +		queue_conf->n_empty_polls++;
> +		if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
>  			struct rte_power_monitor_cond pmc;
> -			uint16_t ret;
> +			int ret;
> 
>  			/* use monitoring condition to sleep */
>  			ret = rte_eth_get_monitor_addr(port_id, qidx,
> @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>  				rte_power_monitor(&pmc, UINT64_MAX);
>  		}
>  	} else
> -		q_conf->empty_poll_stats = 0;
> +		queue_conf->n_empty_polls = 0;
> 
>  	return nb_rx;
>  }
> 
>  static uint16_t
> -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> -		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> -		void *addr __rte_unused)
> +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *arg)
>  {
> -	struct pmd_queue_cfg *q_conf;
> +	const unsigned int lcore = rte_lcore_id();
> +	struct queue_list_entry *queue_conf = arg;
> +	struct pmd_core_cfg *lcore_conf;
> +	const bool empty = nb_rx == 0;
> 
> -	q_conf = &port_cfg[port_id][qidx];
> +	lcore_conf = &lcore_cfgs[lcore];
> 
> -	if (unlikely(nb_rx == 0)) {
> -		q_conf->empty_poll_stats++;
> -		/* sleep for 1 microsecond */
> -		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> -			/* use tpause if we have it */
> -			if (global_data.intrinsics_support.power_pause) {
> -				const uint64_t cur = rte_rdtsc();
> -				const uint64_t wait_tsc =
> -						cur + global_data.tsc_per_us;
> -				rte_power_pause(wait_tsc);
> -			} else {
> -				uint64_t i;
> -				for (i = 0; i < global_data.pause_per_us; i++)
> -					rte_pause();
> -			}
> +	if (likely(!empty))
> +		/* early exit */
> +		queue_reset(lcore_conf, queue_conf);
> +	else {
> +		/* can this queue sleep? */
> +		if (!queue_can_sleep(lcore_conf, queue_conf))
> +			return nb_rx;
> +
> +		/* can this lcore sleep? */
> +		if (!lcore_can_sleep(lcore_conf))
> +			return nb_rx;
> +
> +		/* sleep for 1 microsecond, use tpause if we have it */
> +		if (global_data.intrinsics_support.power_pause) {
> +			const uint64_t cur = rte_rdtsc();
> +			const uint64_t wait_tsc =
> +					cur + global_data.tsc_per_us;
> +			rte_power_pause(wait_tsc);
> +		} else {
> +			uint64_t i;
> +			for (i = 0; i < global_data.pause_per_us; i++)
> +				rte_pause();
>  		}
> -	} else
> -		q_conf->empty_poll_stats = 0;
> +	}
> 
>  	return nb_rx;
>  }
> 
>  static uint16_t
> -clb_scale_freq(uint16_t port_id, uint16_t qidx,
> +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>  		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> -		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +		uint16_t max_pkts __rte_unused, void *arg)
>  {
> -	struct pmd_queue_cfg *q_conf;
> +	const unsigned int lcore = rte_lcore_id();
> +	const bool empty = nb_rx == 0;
> +	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
> +	struct queue_list_entry *queue_conf = arg;
> 
> -	q_conf = &port_cfg[port_id][qidx];
> +	if (likely(!empty)) {
> +		/* early exit */
> +		queue_reset(lcore_conf, queue_conf);
> 
> -	if (unlikely(nb_rx == 0)) {
> -		q_conf->empty_poll_stats++;
> -		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> -			/* scale down freq */
> -			rte_power_freq_min(rte_lcore_id());
> -	} else {
> -		q_conf->empty_poll_stats = 0;
> -		/* scale up freq */
> +		/* scale up freq immediately */
>  		rte_power_freq_max(rte_lcore_id());
> +	} else {
> +		/* can this queue sleep? */
> +		if (!queue_can_sleep(lcore_conf, queue_conf))
> +			return nb_rx;
> +
> +		/* can this lcore sleep? */
> +		if (!lcore_can_sleep(lcore_conf))
> +			return nb_rx;
> +
> +		rte_power_freq_min(rte_lcore_id());
>  	}
> 
>  	return nb_rx;
> @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
>  	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
>  }
> 
> +static int
> +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
> +{
> +	const struct queue_list_entry *entry;
> +
> +	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
> +		const union queue *q = &entry->queue;
> +		int ret = queue_stopped(q->portid, q->qid);
> +		if (ret != 1)
> +			return ret;
> +	}
> +	return 1;
> +}
> +
> +static int
> +check_scale(unsigned int lcore)
> +{
> +	enum power_management_env env;
> +
> +	/* only PSTATE and ACPI modes are supported */
> +	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> +	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
> +		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> +		return -ENOTSUP;
> +	}
> +	/* ensure we could initialize the power library */
> +	if (rte_power_init(lcore))
> +		return -EINVAL;
> +
> +	/* ensure we initialized the correct env */
> +	env = rte_power_get_env();
> +	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
> +		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> +		return -ENOTSUP;
> +	}
> +
> +	/* we're done */
> +	return 0;
> +}
> +
> +static int
> +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> +{
> +	struct rte_power_monitor_cond dummy;
> +
> +	/* check if rte_power_monitor is supported */
> +	if (!global_data.intrinsics_support.power_monitor) {
> +		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> +		return -ENOTSUP;
> +	}
> +
> +	if (cfg->n_queues > 0) {
> +		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
> +		return -ENOTSUP;
> +	}
> +
> +	/* check if the device supports the necessary PMD API */
> +	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
> +			&dummy) == -ENOTSUP) {
> +		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> +		return -ENOTSUP;
> +	}
> +
> +	/* we're done */
> +	return 0;
> +}
> +
>  int
>  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>  {
> -	struct pmd_queue_cfg *queue_cfg;
> +	const union queue qdata = {.portid = port_id, .qid = queue_id};
> +	struct pmd_core_cfg *lcore_cfg;
> +	struct queue_list_entry *queue_cfg;
>  	struct rte_eth_dev_info info;
>  	rte_rx_callback_fn clb;
>  	int ret;
> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		goto end;
>  	}
> 
> -	queue_cfg = &port_cfg[port_id][queue_id];
> +	lcore_cfg = &lcore_cfgs[lcore_id];
> 
> -	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
> +	/* check if other queues are stopped as well */
> +	ret = cfg_queues_stopped(lcore_cfg);
> +	if (ret != 1) {
> +		/* error means invalid queue, 0 means queue wasn't stopped */
> +		ret = ret < 0 ? -EINVAL : -EBUSY;
> +		goto end;
> +	}
> +
> +	/* if callback was already enabled, check current callback type */
> +	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
> +			lcore_cfg->cb_mode != mode) {
>  		ret = -EINVAL;
>  		goto end;
>  	}
> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> 
>  	switch (mode) {
>  	case RTE_POWER_MGMT_TYPE_MONITOR:
> -	{
> -		struct rte_power_monitor_cond dummy;
> -
> -		/* check if rte_power_monitor is supported */
> -		if (!global_data.intrinsics_support.power_monitor) {
> -			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_monitor(lcore_cfg, &qdata);
> +		if (ret < 0)
>  			goto end;
> -		}
> 
> -		/* check if the device supports the necessary PMD API */
> -		if (rte_eth_get_monitor_addr(port_id, queue_id,
> -				&dummy) == -ENOTSUP) {
> -			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>  		clb = clb_umwait;
>  		break;
> -	}
>  	case RTE_POWER_MGMT_TYPE_SCALE:
> -	{
> -		enum power_management_env env;
> -		/* only PSTATE and ACPI modes are supported */
> -		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> -				!rte_power_check_env_supported(
> -					PM_ENV_PSTATE_CPUFREQ)) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_scale(lcore_id);
> +		if (ret < 0)
>  			goto end;
> -		}
> -		/* ensure we could initialize the power library */
> -		if (rte_power_init(lcore_id)) {
> -			ret = -EINVAL;
> -			goto end;
> -		}
> -		/* ensure we initialized the correct env */
> -		env = rte_power_get_env();
> -		if (env != PM_ENV_ACPI_CPUFREQ &&
> -				env != PM_ENV_PSTATE_CPUFREQ) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>  		clb = clb_scale_freq;
>  		break;
> -	}
>  	case RTE_POWER_MGMT_TYPE_PAUSE:
>  		/* figure out various time-to-tsc conversions */
>  		if (global_data.tsc_per_us == 0)
> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		ret = -EINVAL;
>  		goto end;
>  	}
> +	/* add this queue to the list */
> +	ret = queue_list_add(lcore_cfg, &qdata);
> +	if (ret < 0) {
> +		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
> +				strerror(-ret));
> +		goto end;
> +	}
> +	/* new queue is always added last */
> +	queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
> 
>  	/* initialize data before enabling the callback */
> -	queue_cfg->empty_poll_stats = 0;
> -	queue_cfg->cb_mode = mode;
> -	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> -			clb, NULL);
> +	if (lcore_cfg->n_queues == 1) {
> +		lcore_cfg->cb_mode = mode;
> +		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	}
> +	queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
> +			clb, queue_cfg);
> 
>  	ret = 0;
>  end:
> @@ -290,7 +476,9 @@ int
>  rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  		uint16_t port_id, uint16_t queue_id)
>  {
> -	struct pmd_queue_cfg *queue_cfg;
> +	const union queue qdata = {.portid = port_id, .qid = queue_id};
> +	struct pmd_core_cfg *lcore_cfg;
> +	struct queue_list_entry *queue_cfg;
>  	int ret;
> 
>  	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  	}
> 
>  	/* no need to check queue id as wrong queue id would not be enabled */
> -	queue_cfg = &port_cfg[port_id][queue_id];
> +	lcore_cfg = &lcore_cfgs[lcore_id];
> 
> -	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
> +	/* check if other queues are stopped as well */
> +	ret = cfg_queues_stopped(lcore_cfg);
> +	if (ret != 1) {
> +		/* error means invalid queue, 0 means queue wasn't stopped */
> +		return ret < 0 ? -EINVAL : -EBUSY;
> +	}
> +
> +	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>  		return -EINVAL;
> 
> -	/* stop any callbacks from progressing */
> -	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +	/*
> +	 * There is no good/easy way to do this without race conditions, so we
> +	 * are just going to throw our hands in the air and hope that the user
> +	 * has read the documentation and has ensured that ports are stopped at
> +	 * the time we enter the API functions.
> +	 */
> +	queue_cfg = queue_list_take(lcore_cfg, &qdata);
> +	if (queue_cfg == NULL)
> +		return -ENOENT;
> 
> -	switch (queue_cfg->cb_mode) {
> +	/* if we've removed all queues from the lists, set state to disabled */
> +	if (lcore_cfg->n_queues == 0)
> +		lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +
> +	switch (lcore_cfg->cb_mode) {
>  	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>  	case RTE_POWER_MGMT_TYPE_PAUSE:
> -		rte_eth_remove_rx_callback(port_id, queue_id,
> -				queue_cfg->cur_cb);
> +		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>  		break;
>  	case RTE_POWER_MGMT_TYPE_SCALE:
>  		rte_power_freq_max(lcore_id);
> -		rte_eth_remove_rx_callback(port_id, queue_id,
> -				queue_cfg->cur_cb);
> +		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>  		rte_power_exit(lcore_id);
>  		break;
>  	}
> @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  	 * ports before calling any of these API's, so we can assume that the
>  	 * callbacks can be freed. we're intentionally casting away const-ness.
>  	 */
> -	rte_free((void *)queue_cfg->cur_cb);
> +	rte_free((void *)queue_cfg->cb);
> +	free(queue_cfg);
> 
>  	return 0;
>  }
> +
> +RTE_INIT(rte_power_ethdev_pmgmt_init) {
> +	size_t i;
> +
> +	/* initialize all tailqs */
> +	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
> +		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
> +		TAILQ_INIT(&cfg->head);
> +	}
> +}
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
  2021-06-30  9:52           ` David Hunt
@ 2021-07-01  9:01             ` David Hunt
  2021-07-05 10:24               ` Burakov, Anatoly
  0 siblings, 1 reply; 165+ messages in thread
From: David Hunt @ 2021-07-01  9:01 UTC (permalink / raw)
  To: Anatoly Burakov, dev; +Cc: konstantin.ananyev, ciara.loftus


On 30/6/2021 10:52 AM, David Hunt wrote:
> Hi Anatoly,
>
> On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>>    polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>>    added to the list of queues to poll, so that the callback is aware of
>>    other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>>    shared between all queues polled on a particular lcore, and is only
>>    activated when all queues in the list were polled and were determined
>>    to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>    is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v5:
>>      - Remove the "power save queue" API and replace it with 
>> mechanism suggested by
>>        Konstantin
>>           v3:
>>      - Move the list of supported NICs to NIC feature table
>>           v2:
>>      - Use a TAILQ for queues instead of a static array
>>      - Address feedback from Konstantin
>>      - Add additional checks for stopped queues
>>
>>   doc/guides/nics/features.rst           |  10 +
>>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>>   doc/guides/rel_notes/release_21_08.rst |   3 +
>>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>>   4 files changed, 373 insertions(+), 136 deletions(-)
>>
>
> --snip--
>
>>   int
>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t 
>> port_id,
>>           uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>   {
>> -    struct pmd_queue_cfg *queue_cfg;
>> +    const union queue qdata = {.portid = port_id, .qid = queue_id};
>> +    struct pmd_core_cfg *lcore_cfg;
>> +    struct queue_list_entry *queue_cfg;
>>       struct rte_eth_dev_info info;
>>       rte_rx_callback_fn clb;
>>       int ret;
>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int 
>> lcore_id, uint16_t port_id,
>>           goto end;
>>       }
>>   -    queue_cfg = &port_cfg[port_id][queue_id];
>> +    lcore_cfg = &lcore_cfgs[lcore_id];
>>   -    if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>> +    /* check if other queues are stopped as well */
>> +    ret = cfg_queues_stopped(lcore_cfg);
>> +    if (ret != 1) {
>> +        /* error means invalid queue, 0 means queue wasn't stopped */
>> +        ret = ret < 0 ? -EINVAL : -EBUSY;
>> +        goto end;
>> +    }
>> +
>> +    /* if callback was already enabled, check current callback type */
>> +    if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>> +            lcore_cfg->cb_mode != mode) {
>>           ret = -EINVAL;
>>           goto end;
>>       }
>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>> int lcore_id, uint16_t port_id,
>>         switch (mode) {
>>       case RTE_POWER_MGMT_TYPE_MONITOR:
>> -    {
>> -        struct rte_power_monitor_cond dummy;
>> -
>> -        /* check if rte_power_monitor is supported */
>> -        if (!global_data.intrinsics_support.power_monitor) {
>> -            RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not 
>> supported\n");
>> -            ret = -ENOTSUP;
>> +        /* check if we can add a new queue */
>> +        ret = check_monitor(lcore_cfg, &qdata);
>> +        if (ret < 0)
>>               goto end;
>> -        }
>>   -        /* check if the device supports the necessary PMD API */
>> -        if (rte_eth_get_monitor_addr(port_id, queue_id,
>> -                &dummy) == -ENOTSUP) {
>> -            RTE_LOG(DEBUG, POWER, "The device does not support 
>> rte_eth_get_monitor_addr\n");
>> -            ret = -ENOTSUP;
>> -            goto end;
>> -        }
>>           clb = clb_umwait;
>>           break;
>> -    }
>>       case RTE_POWER_MGMT_TYPE_SCALE:
>> -    {
>> -        enum power_management_env env;
>> -        /* only PSTATE and ACPI modes are supported */
>> -        if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> -                !rte_power_check_env_supported(
>> -                    PM_ENV_PSTATE_CPUFREQ)) {
>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are 
>> supported\n");
>> -            ret = -ENOTSUP;
>> +        /* check if we can add a new queue */
>> +        ret = check_scale(lcore_id);
>> +        if (ret < 0)
>>               goto end;
>> -        }
>> -        /* ensure we could initialize the power library */
>> -        if (rte_power_init(lcore_id)) {
>> -            ret = -EINVAL;
>> -            goto end;
>> -        }
>> -        /* ensure we initialized the correct env */
>> -        env = rte_power_get_env();
>> -        if (env != PM_ENV_ACPI_CPUFREQ &&
>> -                env != PM_ENV_PSTATE_CPUFREQ) {
>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes 
>> were initialized\n");
>> -            ret = -ENOTSUP;
>> -            goto end;
>> -        }
>>           clb = clb_scale_freq;
>>           break;
>> -    }
>>       case RTE_POWER_MGMT_TYPE_PAUSE:
>>           /* figure out various time-to-tsc conversions */
>>           if (global_data.tsc_per_us == 0)
>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>> int lcore_id, uint16_t port_id,
>>           ret = -EINVAL;
>>           goto end;
>>       }
>> +    /* add this queue to the list */
>> +    ret = queue_list_add(lcore_cfg, &qdata);
>> +    if (ret < 0) {
>> +        RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>> +                strerror(-ret));
>> +        goto end;
>> +    }
>> +    /* new queue is always added last */
>> +    queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>
>
> Need to ensure that queue_cfg gets set here, otherwise we'll get a 
> segfault below.
>

Or, looking at this again, shouldn't "lcore_cfgs" be "lcore_cfg"?


>
>
>>         /* initialize data before enabling the callback */
>> -    queue_cfg->empty_poll_stats = 0;
>> -    queue_cfg->cb_mode = mode;
>> -    queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -    queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> -            clb, NULL);
>> +    if (lcore_cfg->n_queues == 1) {
>> +        lcore_cfg->cb_mode = mode;
>> +        lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +    }
>> +    queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +            clb, queue_cfg);
> --snip--
>

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues
  2021-06-30 10:29           ` Ananyev, Konstantin
@ 2021-07-05 10:08             ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-05 10:08 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 30-Jun-21 11:29 AM, Ananyev, Konstantin wrote:
> 
> 
>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
>> Rx queues while entering the energy efficient power state. The multi
>> version will be used unconditionally if supported, and the UMWAIT one
>> will only be used when multi-monitor is not supported by the hardware.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v4:
>>      - Fix possible out of bounds access
>>      - Added missing index increment
>>
>>   doc/guides/prog_guide/power_man.rst |  9 ++--
>>   lib/power/rte_power_pmd_mgmt.c      | 81 ++++++++++++++++++++++++++++-
>>   2 files changed, 85 insertions(+), 5 deletions(-)
>>
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index ec04a72108..94353ca012 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>>   The "monitor" mode is only supported in the following configurations and scenarios:
>>
>>   * If ``rte_cpu_get_intrinsics_support()`` function indicates that
>> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
>> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>>     ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>>     limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>>     monitored from a different lcore).
>>
>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>> -  be supported.
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
>> +  two monitoring functions are supported, then monitor mode will not be supported.
>>
>>   * Not all Ethernet drivers support monitoring, even if the underlying
>>     platform may support the necessary CPU instructions. Please refer to
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index fccfd236c2..2056996b9c 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
>>        return found;
>>   }
>>
>> +static inline int
>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
>> +             struct rte_power_monitor_cond *pmc, size_t len)
>> +{
>> +     const struct queue_list_entry *qle;
>> +     size_t i = 0;
>> +     int ret;
>> +
>> +     TAILQ_FOREACH(qle, &cfg->head, next) {
>> +             const union queue *q = &qle->queue;
>> +             struct rte_power_monitor_cond *cur;
>> +
>> +             /* attempted out of bounds access */
>> +             if (i >= len) {
>> +                     RTE_LOG(ERR, POWER, "Too many queues being monitored\n");
>> +                     return -1;
>> +             }
>> +
>> +             cur = &pmc[i++];
>> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
>> +             if (ret < 0)
>> +                     return ret;
>> +     }
>> +     return 0;
>> +}
>> +
>>   static void
>>   calc_tsc(void)
>>   {
>> @@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg)
>>        return true;
>>   }
>>
>> +static uint16_t
>> +clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +             uint16_t max_pkts __rte_unused, void *arg)
>> +{
>> +     const unsigned int lcore = rte_lcore_id();
>> +     struct queue_list_entry *queue_conf = arg;
>> +     struct pmd_core_cfg *lcore_conf;
>> +     const bool empty = nb_rx == 0;
>> +
>> +     lcore_conf = &lcore_cfgs[lcore];
>> +
>> +     /* early exit */
>> +     if (likely(!empty))
>> +             /* early exit */
>> +             queue_reset(lcore_conf, queue_conf);
>> +     else {
>> +             struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS];
> 
> As discussed, I still think it needs to be pmc[lcore_conf->n_queues];
> Or if VLA is not an option - alloca(), or dynamic lcore_conf->pmc[], or...
> 

Apologies, this was a rebase mistake. Thanks for catching it! Will fix 
in v6.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
  2021-06-30 11:04           ` Ananyev, Konstantin
@ 2021-07-05 10:23             ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-05 10:23 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Hunt, David; +Cc: Loftus, Ciara

On 30-Jun-21 12:04 PM, Ananyev, Konstantin wrote:
> 
> 
> 
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>>    polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>>    added to the list of queues to poll, so that the callback is aware of
>>    other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>>    shared between all queues polled on a particular lcore, and is only
>>    activated when all queues in the list were polled and were determined
>>    to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>    is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v5:
>>      - Remove the "power save queue" API and replace it with mechanism suggested by
>>        Konstantin
>>
>>      v3:
>>      - Move the list of supported NICs to NIC feature table
>>
>>      v2:
>>      - Use a TAILQ for queues instead of a static array
>>      - Address feedback from Konstantin
>>      - Add additional checks for stopped queues
>>
>>   doc/guides/nics/features.rst           |  10 +
>>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>>   doc/guides/rel_notes/release_21_08.rst |   3 +
>>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>>   4 files changed, 373 insertions(+), 136 deletions(-)
>>
>> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
>> index 403c2b03a3..a96e12d155 100644
>> --- a/doc/guides/nics/features.rst
>> +++ b/doc/guides/nics/features.rst
>> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>>   * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>>   * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>>
>> +.. _nic_features_get_monitor_addr:
>> +
>> +PMD power management using monitor addresses
>> +--------------------------------------------
>> +
>> +Supports getting a monitoring condition to use together with Ethernet PMD power
>> +management (see :doc:`../prog_guide/power_man` for more details).
>> +
>> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
>> +
>>   .. _nic_features_other:
>>
>>   Other dev ops not represented by a Feature
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index c70ae128ac..ec04a72108 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
>>   Abstract
>>   ~~~~~~~~
>>
>> -Existing power management mechanisms require developers
>> -to change application design or change code to make use of it.
>> -The PMD power management API provides a convenient alternative
>> -by utilizing Ethernet PMD RX callbacks,
>> -and triggering power saving whenever empty poll count reaches a certain number.
>> -
>> -Monitor
>> -   This power saving scheme will put the CPU into optimized power state
>> -   and use the ``rte_power_monitor()`` function
>> -   to monitor the Ethernet PMD RX descriptor address,
>> -   and wake the CPU up whenever there's new traffic.
>> -
>> -Pause
>> -   This power saving scheme will avoid busy polling
>> -   by either entering power-optimized sleep state
>> -   with ``rte_power_pause()`` function,
>> -   or, if it's not available, use ``rte_pause()``.
>> -
>> -Frequency scaling
>> -   This power saving scheme will use ``librte_power`` library
>> -   functionality to scale the core frequency up/down
>> -   depending on traffic volume.
>> -
>> -.. note::
>> -
>> -   Currently, this power management API is limited to mandatory mapping
>> -   of 1 queue to 1 core (multiple queues are supported,
>> -   but they must be polled from different cores).
>> +Existing power management mechanisms require developers to change application
>> +design or change code to make use of it. The PMD power management API provides a
>> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
>> +power saving whenever empty poll count reaches a certain number.
>> +
>> +* Monitor
>> +   This power saving scheme will put the CPU into optimized power state and
>> +   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
>> +   there's new traffic. Support for this scheme may not be available on all
>> +   platforms, and further limitations may apply (see below).
>> +
>> +* Pause
>> +   This power saving scheme will avoid busy polling by either entering
>> +   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
>> +   not supported by the underlying platform, use ``rte_pause()``.
>> +
>> +* Frequency scaling
>> +   This power saving scheme will use ``librte_power`` library functionality to
>> +   scale the core frequency up/down depending on traffic volume.
>> +
>> +The "monitor" mode is only supported in the following configurations and scenarios:
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
>> +  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>> +  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>> +  monitored from a different lcore).
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>> +  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>> +  be supported.
>> +
>> +* Not all Ethernet drivers support monitoring, even if the underlying
>> +  platform may support the necessary CPU instructions. Please refer to
>> +  :doc:`../nics/overview` for more information.
>> +
>>
>>   API Overview for Ethernet PMD Power Management
>>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> @@ -242,3 +249,5 @@ References
>>
>>   *   The :doc:`../sample_app_ug/vm_power_management`
>>       chapter in the :doc:`../sample_app_ug/index` section.
>> +
>> +*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
>> diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
>> index f015c509fc..3926d45ef8 100644
>> --- a/doc/guides/rel_notes/release_21_08.rst
>> +++ b/doc/guides/rel_notes/release_21_08.rst
>> @@ -57,6 +57,9 @@ New Features
>>
>>   * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
>>
>> +* rte_power: The experimental PMD power management API now supports managing
>> +  multiple Ethernet Rx queues per lcore.
>> +
>>
>>   Removed Items
>>   -------------
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index 9b95cf1794..fccfd236c2 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -33,18 +33,96 @@ enum pmd_mgmt_state {
>>        PMD_MGMT_ENABLED
>>   };
>>
>> -struct pmd_queue_cfg {
>> +union queue {
>> +     uint32_t val;
>> +     struct {
>> +             uint16_t portid;
>> +             uint16_t qid;
>> +     };
>> +};
>> +
>> +struct queue_list_entry {
>> +     TAILQ_ENTRY(queue_list_entry) next;
>> +     union queue queue;
>> +     uint64_t n_empty_polls;
>> +     const struct rte_eth_rxtx_callback *cb;
>> +};
>> +
>> +struct pmd_core_cfg {
>> +     TAILQ_HEAD(queue_list_head, queue_list_entry) head;
>> +     /**< List of queues associated with this lcore */
>> +     size_t n_queues;
>> +     /**< How many queues are in the list? */
>>        volatile enum pmd_mgmt_state pwr_mgmt_state;
>>        /**< State of power management for this queue */
>>        enum rte_power_pmd_mgmt_type cb_mode;
>>        /**< Callback mode for this queue */
>> -     const struct rte_eth_rxtx_callback *cur_cb;
>> -     /**< Callback instance */
>> -     uint64_t empty_poll_stats;
>> -     /**< Number of empty polls */
>> +     uint64_t n_queues_ready_to_sleep;
>> +     /**< Number of queues ready to enter power optimized state */
>>   } __rte_cache_aligned;
>> +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
>>
>> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
>> +static inline bool
>> +queue_equal(const union queue *l, const union queue *r)
>> +{
>> +     return l->val == r->val;
>> +}
>> +
>> +static inline void
>> +queue_copy(union queue *dst, const union queue *src)
>> +{
>> +     dst->val = src->val;
>> +}
>> +
>> +static struct queue_list_entry *
>> +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> +     struct queue_list_entry *cur;
>> +
>> +     TAILQ_FOREACH(cur, &cfg->head, next) {
>> +             if (queue_equal(&cur->queue, q))
>> +                     return cur;
>> +     }
>> +     return NULL;
>> +}
>> +
>> +static int
>> +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> +     struct queue_list_entry *qle;
>> +
>> +     /* is it already in the list? */
>> +     if (queue_list_find(cfg, q) != NULL)
>> +             return -EEXIST;
>> +
>> +     qle = malloc(sizeof(*qle));
>> +     if (qle == NULL)
>> +             return -ENOMEM;
>> +     memset(qle, 0, sizeof(*qle));
>> +
>> +     queue_copy(&qle->queue, q);
>> +     TAILQ_INSERT_TAIL(&cfg->head, qle, next);
>> +     cfg->n_queues++;
>> +     qle->n_empty_polls = 0;
>> +
>> +     return 0;
>> +}
>> +
>> +static struct queue_list_entry *
>> +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> +     struct queue_list_entry *found;
>> +
>> +     found = queue_list_find(cfg, q);
>> +     if (found == NULL)
>> +             return NULL;
>> +
>> +     TAILQ_REMOVE(&cfg->head, found, next);
>> +     cfg->n_queues--;
>> +
>> +     /* freeing is responsibility of the caller */
>> +     return found;
>> +}
>>
>>   static void
>>   calc_tsc(void)
>> @@ -74,21 +152,56 @@ calc_tsc(void)
>>        }
>>   }
>>
>> +static inline void
>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> +     /* reset empty poll counter for this queue */
>> +     qcfg->n_empty_polls = 0;
>> +     /* reset the sleep counter too */
>> +     cfg->n_queues_ready_to_sleep = 0;
>> +}
>> +
>> +static inline bool
>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> +     /* this function is called - that means we have an empty poll */
>> +     qcfg->n_empty_polls++;
>> +
>> +     /* if we haven't reached threshold for empty polls, we can't sleep */
>> +     if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
>> +             return false;
>> +
>> +     /* we're ready to sleep */
>> +     cfg->n_queues_ready_to_sleep++;
>> +
>> +     return true;
>> +}
>> +
>> +static inline bool
>> +lcore_can_sleep(struct pmd_core_cfg *cfg)
>> +{
>> +     /* are all queues ready to sleep? */
>> +     if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
>> +             return false;
>> +
>> +     /* we've reached an iteration where we can sleep, reset sleep counter */
>> +     cfg->n_queues_ready_to_sleep = 0;
>> +
>> +     return true;
>> +}
> 
> As I can see it a slightly modified one from what was discussed.
> I understand that it seems simpler, but I think there are some problems with it:
> - each queue can be counted more than once at lcore_cfg->n_queues_ready_to_sleep
> - queues n_empty_polls are not reset after sleep().
> 

The latter is intentional: we *want* to sleep constantly once we pass 
the empty poll counter.

The former shouldn't be a big problem in the conventional case as i 
don't think there are situations where people would poll core-pinned 
queues in different orders, but you're right, this is a potential issue 
and should be fixed. I'll add back the n_sleeps in the next iteration.

> To illustrate the problem, let say we have 2 queues, and at some moment we have:
> q0.n_empty_polls == EMPTYPOLL_MAX + 1
> q1.n_empty_polls == EMPTYPOLL_MAX + 1
> cfg->n_queues_ready_to_sleep == 2
> 
> So lcore_can_sleep() returns 'true' and sets:
> cfg->n_queues_ready_to_sleep == 0
> 
> Now, after sleep():
> q0.n_empty_polls == EMPTYPOLL_MAX + 1
> q1.n_empty_polls == EMPTYPOLL_MAX + 1
> 
>   So after:
> queue_can_sleep(q0);
> queue_can_sleep(q1);
> 
> will have:
> cfg->n_queues_ready_to_sleep == 2
> again, and we'll go to another sleep after just one rx_burst() attempt for each queue.
> 
>> +
>>   static uint16_t
>>   clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> -             uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> -             void *addr __rte_unused)
>> +             uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
>>   {
>> +     struct queue_list_entry *queue_conf = arg;
>>
>> -     struct pmd_queue_cfg *q_conf;
>> -
>> -     q_conf = &port_cfg[port_id][qidx];
>> -
>> +     /* this callback can't do more than one queue, omit multiqueue logic */
>>        if (unlikely(nb_rx == 0)) {
>> -             q_conf->empty_poll_stats++;
>> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +             queue_conf->n_empty_polls++;
>> +             if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
>>                        struct rte_power_monitor_cond pmc;
>> -                     uint16_t ret;
>> +                     int ret;
>>
>>                        /* use monitoring condition to sleep */
>>                        ret = rte_eth_get_monitor_addr(port_id, qidx,
>> @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>>                                rte_power_monitor(&pmc, UINT64_MAX);
>>                }
>>        } else
>> -             q_conf->empty_poll_stats = 0;
>> +             queue_conf->n_empty_polls = 0;
>>
>>        return nb_rx;
>>   }
>>
>>   static uint16_t
>> -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> -             uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> -             void *addr __rte_unused)
>> +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +             uint16_t max_pkts __rte_unused, void *arg)
>>   {
>> -     struct pmd_queue_cfg *q_conf;
>> +     const unsigned int lcore = rte_lcore_id();
>> +     struct queue_list_entry *queue_conf = arg;
>> +     struct pmd_core_cfg *lcore_conf;
>> +     const bool empty = nb_rx == 0;
>>
>> -     q_conf = &port_cfg[port_id][qidx];
>> +     lcore_conf = &lcore_cfgs[lcore];
>>
>> -     if (unlikely(nb_rx == 0)) {
>> -             q_conf->empty_poll_stats++;
>> -             /* sleep for 1 microsecond */
>> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> -                     /* use tpause if we have it */
>> -                     if (global_data.intrinsics_support.power_pause) {
>> -                             const uint64_t cur = rte_rdtsc();
>> -                             const uint64_t wait_tsc =
>> -                                             cur + global_data.tsc_per_us;
>> -                             rte_power_pause(wait_tsc);
>> -                     } else {
>> -                             uint64_t i;
>> -                             for (i = 0; i < global_data.pause_per_us; i++)
>> -                                     rte_pause();
>> -                     }
>> +     if (likely(!empty))
>> +             /* early exit */
>> +             queue_reset(lcore_conf, queue_conf);
>> +     else {
>> +             /* can this queue sleep? */
>> +             if (!queue_can_sleep(lcore_conf, queue_conf))
>> +                     return nb_rx;
>> +
>> +             /* can this lcore sleep? */
>> +             if (!lcore_can_sleep(lcore_conf))
>> +                     return nb_rx;
>> +
>> +             /* sleep for 1 microsecond, use tpause if we have it */
>> +             if (global_data.intrinsics_support.power_pause) {
>> +                     const uint64_t cur = rte_rdtsc();
>> +                     const uint64_t wait_tsc =
>> +                                     cur + global_data.tsc_per_us;
>> +                     rte_power_pause(wait_tsc);
>> +             } else {
>> +                     uint64_t i;
>> +                     for (i = 0; i < global_data.pause_per_us; i++)
>> +                             rte_pause();
>>                }
>> -     } else
>> -             q_conf->empty_poll_stats = 0;
>> +     }
>>
>>        return nb_rx;
>>   }
>>
>>   static uint16_t
>> -clb_scale_freq(uint16_t port_id, uint16_t qidx,
>> +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>>                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> -             uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +             uint16_t max_pkts __rte_unused, void *arg)
>>   {
>> -     struct pmd_queue_cfg *q_conf;
>> +     const unsigned int lcore = rte_lcore_id();
>> +     const bool empty = nb_rx == 0;
>> +     struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
>> +     struct queue_list_entry *queue_conf = arg;
>>
>> -     q_conf = &port_cfg[port_id][qidx];
>> +     if (likely(!empty)) {
>> +             /* early exit */
>> +             queue_reset(lcore_conf, queue_conf);
>>
>> -     if (unlikely(nb_rx == 0)) {
>> -             q_conf->empty_poll_stats++;
>> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
>> -                     /* scale down freq */
>> -                     rte_power_freq_min(rte_lcore_id());
>> -     } else {
>> -             q_conf->empty_poll_stats = 0;
>> -             /* scale up freq */
>> +             /* scale up freq immediately */
>>                rte_power_freq_max(rte_lcore_id());
>> +     } else {
>> +             /* can this queue sleep? */
>> +             if (!queue_can_sleep(lcore_conf, queue_conf))
>> +                     return nb_rx;
>> +
>> +             /* can this lcore sleep? */
>> +             if (!lcore_can_sleep(lcore_conf))
>> +                     return nb_rx;
>> +
>> +             rte_power_freq_min(rte_lcore_id());
>>        }
>>
>>        return nb_rx;
>> @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
>>        return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
>>   }
>>
>> +static int
>> +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
>> +{
>> +     const struct queue_list_entry *entry;
>> +
>> +     TAILQ_FOREACH(entry, &queue_cfg->head, next) {
>> +             const union queue *q = &entry->queue;
>> +             int ret = queue_stopped(q->portid, q->qid);
>> +             if (ret != 1)
>> +                     return ret;
>> +     }
>> +     return 1;
>> +}
>> +
>> +static int
>> +check_scale(unsigned int lcore)
>> +{
>> +     enum power_management_env env;
>> +
>> +     /* only PSTATE and ACPI modes are supported */
>> +     if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> +         !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
>> +             RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
>> +             return -ENOTSUP;
>> +     }
>> +     /* ensure we could initialize the power library */
>> +     if (rte_power_init(lcore))
>> +             return -EINVAL;
>> +
>> +     /* ensure we initialized the correct env */
>> +     env = rte_power_get_env();
>> +     if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
>> +             RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     /* we're done */
>> +     return 0;
>> +}
>> +
>> +static int
>> +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>> +{
>> +     struct rte_power_monitor_cond dummy;
>> +
>> +     /* check if rte_power_monitor is supported */
>> +     if (!global_data.intrinsics_support.power_monitor) {
>> +             RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     if (cfg->n_queues > 0) {
>> +             RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     /* check if the device supports the necessary PMD API */
>> +     if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
>> +                     &dummy) == -ENOTSUP) {
>> +             RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     /* we're done */
>> +     return 0;
>> +}
>> +
>>   int
>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>   {
>> -     struct pmd_queue_cfg *queue_cfg;
>> +     const union queue qdata = {.portid = port_id, .qid = queue_id};
>> +     struct pmd_core_cfg *lcore_cfg;
>> +     struct queue_list_entry *queue_cfg;
>>        struct rte_eth_dev_info info;
>>        rte_rx_callback_fn clb;
>>        int ret;
>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                goto end;
>>        }
>>
>> -     queue_cfg = &port_cfg[port_id][queue_id];
>> +     lcore_cfg = &lcore_cfgs[lcore_id];
>>
>> -     if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>> +     /* check if other queues are stopped as well */
>> +     ret = cfg_queues_stopped(lcore_cfg);
>> +     if (ret != 1) {
>> +             /* error means invalid queue, 0 means queue wasn't stopped */
>> +             ret = ret < 0 ? -EINVAL : -EBUSY;
>> +             goto end;
>> +     }
>> +
>> +     /* if callback was already enabled, check current callback type */
>> +     if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>> +                     lcore_cfg->cb_mode != mode) {
>>                ret = -EINVAL;
>>                goto end;
>>        }
>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>
>>        switch (mode) {
>>        case RTE_POWER_MGMT_TYPE_MONITOR:
>> -     {
>> -             struct rte_power_monitor_cond dummy;
>> -
>> -             /* check if rte_power_monitor is supported */
>> -             if (!global_data.intrinsics_support.power_monitor) {
>> -                     RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> -                     ret = -ENOTSUP;
>> +             /* check if we can add a new queue */
>> +             ret = check_monitor(lcore_cfg, &qdata);
>> +             if (ret < 0)
>>                        goto end;
>> -             }
>>
>> -             /* check if the device supports the necessary PMD API */
>> -             if (rte_eth_get_monitor_addr(port_id, queue_id,
>> -                             &dummy) == -ENOTSUP) {
>> -                     RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
>> -                     ret = -ENOTSUP;
>> -                     goto end;
>> -             }
>>                clb = clb_umwait;
>>                break;
>> -     }
>>        case RTE_POWER_MGMT_TYPE_SCALE:
>> -     {
>> -             enum power_management_env env;
>> -             /* only PSTATE and ACPI modes are supported */
>> -             if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> -                             !rte_power_check_env_supported(
>> -                                     PM_ENV_PSTATE_CPUFREQ)) {
>> -                     RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
>> -                     ret = -ENOTSUP;
>> +             /* check if we can add a new queue */
>> +             ret = check_scale(lcore_id);
>> +             if (ret < 0)
>>                        goto end;
>> -             }
>> -             /* ensure we could initialize the power library */
>> -             if (rte_power_init(lcore_id)) {
>> -                     ret = -EINVAL;
>> -                     goto end;
>> -             }
>> -             /* ensure we initialized the correct env */
>> -             env = rte_power_get_env();
>> -             if (env != PM_ENV_ACPI_CPUFREQ &&
>> -                             env != PM_ENV_PSTATE_CPUFREQ) {
>> -                     RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
>> -                     ret = -ENOTSUP;
>> -                     goto end;
>> -             }
>>                clb = clb_scale_freq;
>>                break;
>> -     }
>>        case RTE_POWER_MGMT_TYPE_PAUSE:
>>                /* figure out various time-to-tsc conversions */
>>                if (global_data.tsc_per_us == 0)
>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                ret = -EINVAL;
>>                goto end;
>>        }
>> +     /* add this queue to the list */
>> +     ret = queue_list_add(lcore_cfg, &qdata);
>> +     if (ret < 0) {
>> +             RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>> +                             strerror(-ret));
>> +             goto end;
>> +     }
>> +     /* new queue is always added last */
>> +     queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>>
>>        /* initialize data before enabling the callback */
>> -     queue_cfg->empty_poll_stats = 0;
>> -     queue_cfg->cb_mode = mode;
>> -     queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -     queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> -                     clb, NULL);
>> +     if (lcore_cfg->n_queues == 1) {
>> +             lcore_cfg->cb_mode = mode;
>> +             lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +     }
>> +     queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +                     clb, queue_cfg);
>>
>>        ret = 0;
>>   end:
>> @@ -290,7 +476,9 @@ int
>>   rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>                uint16_t port_id, uint16_t queue_id)
>>   {
>> -     struct pmd_queue_cfg *queue_cfg;
>> +     const union queue qdata = {.portid = port_id, .qid = queue_id};
>> +     struct pmd_core_cfg *lcore_cfg;
>> +     struct queue_list_entry *queue_cfg;
>>        int ret;
>>
>>        RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>        }
>>
>>        /* no need to check queue id as wrong queue id would not be enabled */
>> -     queue_cfg = &port_cfg[port_id][queue_id];
>> +     lcore_cfg = &lcore_cfgs[lcore_id];
>>
>> -     if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>> +     /* check if other queues are stopped as well */
>> +     ret = cfg_queues_stopped(lcore_cfg);
>> +     if (ret != 1) {
>> +             /* error means invalid queue, 0 means queue wasn't stopped */
>> +             return ret < 0 ? -EINVAL : -EBUSY;
>> +     }
>> +
>> +     if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>>                return -EINVAL;
>>
>> -     /* stop any callbacks from progressing */
>> -     queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +     /*
>> +      * There is no good/easy way to do this without race conditions, so we
>> +      * are just going to throw our hands in the air and hope that the user
>> +      * has read the documentation and has ensured that ports are stopped at
>> +      * the time we enter the API functions.
>> +      */
>> +     queue_cfg = queue_list_take(lcore_cfg, &qdata);
>> +     if (queue_cfg == NULL)
>> +             return -ENOENT;
>>
>> -     switch (queue_cfg->cb_mode) {
>> +     /* if we've removed all queues from the lists, set state to disabled */
>> +     if (lcore_cfg->n_queues == 0)
>> +             lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +
>> +     switch (lcore_cfg->cb_mode) {
>>        case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>>        case RTE_POWER_MGMT_TYPE_PAUSE:
>> -             rte_eth_remove_rx_callback(port_id, queue_id,
>> -                             queue_cfg->cur_cb);
>> +             rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>>                break;
>>        case RTE_POWER_MGMT_TYPE_SCALE:
>>                rte_power_freq_max(lcore_id);
>> -             rte_eth_remove_rx_callback(port_id, queue_id,
>> -                             queue_cfg->cur_cb);
>> +             rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>>                rte_power_exit(lcore_id);
>>                break;
>>        }
>> @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>         * ports before calling any of these API's, so we can assume that the
>>         * callbacks can be freed. we're intentionally casting away const-ness.
>>         */
>> -     rte_free((void *)queue_cfg->cur_cb);
>> +     rte_free((void *)queue_cfg->cb);
>> +     free(queue_cfg);
>>
>>        return 0;
>>   }
>> +
>> +RTE_INIT(rte_power_ethdev_pmgmt_init) {
>> +     size_t i;
>> +
>> +     /* initialize all tailqs */
>> +     for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
>> +             struct pmd_core_cfg *cfg = &lcore_cfgs[i];
>> +             TAILQ_INIT(&cfg->head);
>> +     }
>> +}
>> --
>> 2.25.1
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues
  2021-07-01  9:01             ` David Hunt
@ 2021-07-05 10:24               ` Burakov, Anatoly
  0 siblings, 0 replies; 165+ messages in thread
From: Burakov, Anatoly @ 2021-07-05 10:24 UTC (permalink / raw)
  To: David Hunt, dev; +Cc: konstantin.ananyev, ciara.loftus

On 01-Jul-21 10:01 AM, David Hunt wrote:
> 
> On 30/6/2021 10:52 AM, David Hunt wrote:
>> Hi Anatoly,
>>
>> On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
>>> Currently, there is a hard limitation on the PMD power management
>>> support that only allows it to support a single queue per lcore. This is
>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>
>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>> is very difficult to implement such support because callbacks are
>>> effectively stateless and have no visibility into what the other ethdev
>>> devices are doing. This places limitations on what we can do within the
>>> framework of Rx callbacks, but the basics of this implementation are as
>>> follows:
>>>
>>> - Replace per-queue structures with per-lcore ones, so that any device
>>>    polled from the same lcore can share data
>>> - Any queue that is going to be polled from a specific lcore has to be
>>>    added to the list of queues to poll, so that the callback is aware of
>>>    other queues being polled by the same lcore
>>> - Both the empty poll counter and the actual power saving mechanism is
>>>    shared between all queues polled on a particular lcore, and is only
>>>    activated when all queues in the list were polled and were determined
>>>    to have no traffic.
>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>>    is incapable of monitoring more than one address.
>>>
>>> Also, while we're at it, update and improve the docs.
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> ---
>>>
>>> Notes:
>>>      v5:
>>>      - Remove the "power save queue" API and replace it with 
>>> mechanism suggested by
>>>        Konstantin
>>>           v3:
>>>      - Move the list of supported NICs to NIC feature table
>>>           v2:
>>>      - Use a TAILQ for queues instead of a static array
>>>      - Address feedback from Konstantin
>>>      - Add additional checks for stopped queues
>>>
>>>   doc/guides/nics/features.rst           |  10 +
>>>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>>>   doc/guides/rel_notes/release_21_08.rst |   3 +
>>>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>>>   4 files changed, 373 insertions(+), 136 deletions(-)
>>>
>>
>> --snip--
>>
>>>   int
>>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t 
>>> port_id,
>>>           uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>>   {
>>> -    struct pmd_queue_cfg *queue_cfg;
>>> +    const union queue qdata = {.portid = port_id, .qid = queue_id};
>>> +    struct pmd_core_cfg *lcore_cfg;
>>> +    struct queue_list_entry *queue_cfg;
>>>       struct rte_eth_dev_info info;
>>>       rte_rx_callback_fn clb;
>>>       int ret;
>>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int 
>>> lcore_id, uint16_t port_id,
>>>           goto end;
>>>       }
>>>   -    queue_cfg = &port_cfg[port_id][queue_id];
>>> +    lcore_cfg = &lcore_cfgs[lcore_id];
>>>   -    if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>>> +    /* check if other queues are stopped as well */
>>> +    ret = cfg_queues_stopped(lcore_cfg);
>>> +    if (ret != 1) {
>>> +        /* error means invalid queue, 0 means queue wasn't stopped */
>>> +        ret = ret < 0 ? -EINVAL : -EBUSY;
>>> +        goto end;
>>> +    }
>>> +
>>> +    /* if callback was already enabled, check current callback type */
>>> +    if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>>> +            lcore_cfg->cb_mode != mode) {
>>>           ret = -EINVAL;
>>>           goto end;
>>>       }
>>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>>> int lcore_id, uint16_t port_id,
>>>         switch (mode) {
>>>       case RTE_POWER_MGMT_TYPE_MONITOR:
>>> -    {
>>> -        struct rte_power_monitor_cond dummy;
>>> -
>>> -        /* check if rte_power_monitor is supported */
>>> -        if (!global_data.intrinsics_support.power_monitor) {
>>> -            RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not 
>>> supported\n");
>>> -            ret = -ENOTSUP;
>>> +        /* check if we can add a new queue */
>>> +        ret = check_monitor(lcore_cfg, &qdata);
>>> +        if (ret < 0)
>>>               goto end;
>>> -        }
>>>   -        /* check if the device supports the necessary PMD API */
>>> -        if (rte_eth_get_monitor_addr(port_id, queue_id,
>>> -                &dummy) == -ENOTSUP) {
>>> -            RTE_LOG(DEBUG, POWER, "The device does not support 
>>> rte_eth_get_monitor_addr\n");
>>> -            ret = -ENOTSUP;
>>> -            goto end;
>>> -        }
>>>           clb = clb_umwait;
>>>           break;
>>> -    }
>>>       case RTE_POWER_MGMT_TYPE_SCALE:
>>> -    {
>>> -        enum power_management_env env;
>>> -        /* only PSTATE and ACPI modes are supported */
>>> -        if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>>> -                !rte_power_check_env_supported(
>>> -                    PM_ENV_PSTATE_CPUFREQ)) {
>>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are 
>>> supported\n");
>>> -            ret = -ENOTSUP;
>>> +        /* check if we can add a new queue */
>>> +        ret = check_scale(lcore_id);
>>> +        if (ret < 0)
>>>               goto end;
>>> -        }
>>> -        /* ensure we could initialize the power library */
>>> -        if (rte_power_init(lcore_id)) {
>>> -            ret = -EINVAL;
>>> -            goto end;
>>> -        }
>>> -        /* ensure we initialized the correct env */
>>> -        env = rte_power_get_env();
>>> -        if (env != PM_ENV_ACPI_CPUFREQ &&
>>> -                env != PM_ENV_PSTATE_CPUFREQ) {
>>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes 
>>> were initialized\n");
>>> -            ret = -ENOTSUP;
>>> -            goto end;
>>> -        }
>>>           clb = clb_scale_freq;
>>>           break;
>>> -    }
>>>       case RTE_POWER_MGMT_TYPE_PAUSE:
>>>           /* figure out various time-to-tsc conversions */
>>>           if (global_data.tsc_per_us == 0)
>>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>>> int lcore_id, uint16_t port_id,
>>>           ret = -EINVAL;
>>>           goto end;
>>>       }
>>> +    /* add this queue to the list */
>>> +    ret = queue_list_add(lcore_cfg, &qdata);
>>> +    if (ret < 0) {
>>> +        RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>>> +                strerror(-ret));
>>> +        goto end;
>>> +    }
>>> +    /* new queue is always added last */
>>> +    queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>>
>>
>> Need to ensure that queue_cfg gets set here, otherwise we'll get a 
>> segfault below.
>>
> 
> Or, looking at this again, shouldn't "lcore_cfgs" be "lcore_cfg"?

Good catch, will fix!

> 
> 
>>
>>
>>>         /* initialize data before enabling the callback */
>>> -    queue_cfg->empty_poll_stats = 0;
>>> -    queue_cfg->cb_mode = mode;
>>> -    queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>>> -    queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>>> -            clb, NULL);
>>> +    if (lcore_cfg->n_queues == 1) {
>>> +        lcore_cfg->cb_mode = mode;
>>> +        lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>>> +    }
>>> +    queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>>> +            clb, queue_cfg);
>> --snip--
>>


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management
  2021-06-29 15:48       ` [dpdk-dev] [PATCH v5 0/7] Enhancements for PMD power management Anatoly Burakov
                           ` (6 preceding siblings ...)
  2021-06-29 15:48         ` [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes Anatoly Burakov
@ 2021-07-05 15:21         ` Anatoly Burakov
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
                             ` (7 more replies)
  7 siblings, 8 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
  To: dev; +Cc: david.hunt, ciara.loftus, konstantin.ananyev

This patchset introduces several changes related to PMD power management:

- Changed monitoring intrinsics to use callbacks as a comparison function, based
  on previous patchset [1] but incorporating feedback [2] - this hopefully will
  make it possible to add support for .get_monitor_addr in virtio
- Add a new intrinsic to monitor multiple addresses, based on RTM instruction
  set and the TPAUSE instruction
- Add support for PMD power management on multiple queues, as well as all
  accompanying infrastructure and example apps changes

v6:
- Improved the algorithm for multi-queue sleep
- Fixed segfault and addressed other feedback

v5:
- Removed "power save queue" API and replaced with mechanism suggested by
  Konstantin
- Addressed other feedback

v4:
- Replaced raw number with a macro
- Fixed all the bugs found by Konstantin
- Some other minor corrections

v3:
- Moved some doc updates to NIC features list

v2:
- Changed check inversion to callbacks
- Addressed feedback from Konstantin
- Added doc updates where necessary

[1] http://patches.dpdk.org/project/dpdk/list/?series=16930&state=*
[2] http://patches.dpdk.org/project/dpdk/patch/819ef1ace187365a615d3383e54579e3d9fb216e.1620747068.git.anatoly.burakov@intel.com/#133274

Anatoly Burakov (7):
  power_intrinsics: use callbacks for comparison
  net/af_xdp: add power monitor support
  eal: add power monitor for multiple events
  power: remove thread safety from PMD power API's
  power: support callbacks for multiple Rx queues
  power: support monitoring multiple Rx queues
  l3fwd-power: support multiqueue in PMD pmgmt modes

 doc/guides/nics/features.rst                  |  10 +
 doc/guides/prog_guide/power_man.rst           |  68 +-
 doc/guides/rel_notes/release_21_08.rst        |  11 +
 drivers/event/dlb2/dlb2.c                     |  17 +-
 drivers/net/af_xdp/rte_eth_af_xdp.c           |  34 +
 drivers/net/i40e/i40e_rxtx.c                  |  20 +-
 drivers/net/iavf/iavf_rxtx.c                  |  20 +-
 drivers/net/ice/ice_rxtx.c                    |  20 +-
 drivers/net/ixgbe/ixgbe_rxtx.c                |  20 +-
 drivers/net/mlx5/mlx5_rx.c                    |  17 +-
 examples/l3fwd-power/main.c                   |   6 -
 lib/eal/arm/rte_power_intrinsics.c            |  11 +
 lib/eal/include/generic/rte_cpuflags.h        |   2 +
 .../include/generic/rte_power_intrinsics.h    |  68 +-
 lib/eal/ppc/rte_power_intrinsics.c            |  11 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_cpuflags.c                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  90 ++-
 lib/power/meson.build                         |   3 +
 lib/power/rte_power_pmd_mgmt.c                | 655 +++++++++++++-----
 lib/power/rte_power_pmd_mgmt.h                |   6 +
 21 files changed, 832 insertions(+), 262 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
@ 2021-07-05 15:21           ` Anatoly Burakov
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 2/7] net/af_xdp: add power monitor support Anatoly Burakov
                             ` (6 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
  To: dev, Timothy McDaniel, Beilei Xing, Jingjing Wu, Qiming Yang,
	Qi Zhang, Haiyue Wang, Matan Azrad, Shahaf Shuler,
	Viacheslav Ovsiienko, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value in a specific way.

This commit replaces the comparison with a user callback mechanism, so
that any PMD (or other code) using `rte_power_monitor()` can define
their own comparison semantics and decision making on how to detect the
need to abort the entering of power optimized state.

Existing implementations are adjusted to follow the new semantics.

Suggested-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v4:
    - Return error if callback is set to NULL
    - Replace raw number with a macro in monitor condition opaque data
    
    v2:
    - Use callback mechanism for more flexibility
    - Address feedback from Konstantin

 doc/guides/rel_notes/release_21_08.rst        |  1 +
 drivers/event/dlb2/dlb2.c                     | 17 ++++++++--
 drivers/net/i40e/i40e_rxtx.c                  | 20 +++++++----
 drivers/net/iavf/iavf_rxtx.c                  | 20 +++++++----
 drivers/net/ice/ice_rxtx.c                    | 20 +++++++----
 drivers/net/ixgbe/ixgbe_rxtx.c                | 20 +++++++----
 drivers/net/mlx5/mlx5_rx.c                    | 17 ++++++++--
 .../include/generic/rte_power_intrinsics.h    | 33 +++++++++++++++----
 lib/eal/x86/rte_power_intrinsics.c            | 17 +++++-----
 9 files changed, 121 insertions(+), 44 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index a6ecfdf3ce..c84ac280f5 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -84,6 +84,7 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =======================================================
 
+* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
 ABI Changes
 -----------
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index eca183753f..252bbd8d5e 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num)
 	}
 }
 
+#define CLB_MASK_IDX 0
+#define CLB_VAL_IDX 1
+static int
+dlb2_monitor_callback(const uint64_t val,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	/* abort if the value matches */
+	return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0;
+}
+
 static inline int
 dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 		  struct dlb2_eventdev_port *ev_port,
@@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2,
 			expected_value = 0;
 
 		pmc.addr = monitor_addr;
-		pmc.val = expected_value;
-		pmc.mask = qe_mask.raw_qe[1];
+		/* store expected value and comparison mask in opaque data */
+		pmc.opaque[CLB_VAL_IDX] = expected_value;
+		pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1];
+		/* set up callback */
+		pmc.fn = dlb2_monitor_callback;
 		pmc.size = sizeof(uint64_t);
 
 		rte_power_monitor(&pmc, timeout + start_ticks);
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..081682f88b 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -81,6 +81,18 @@
 #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK)
 
+static int
+i40e_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = i40e_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c
index 0361af0d85..7ed196ec22 100644
--- a/drivers/net/iavf/iavf_rxtx.c
+++ b/drivers/net/iavf/iavf_rxtx.c
@@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type)
 				rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1;
 }
 
+static int
+iavf_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.qword1.status_error_len;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
-	pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT);
+	/* comparison callback */
+	pmc->fn = iavf_monitor_callback;
 
 	/* registers are 64-bit */
 	pmc->size = sizeof(uint64_t);
diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c
index fc9bb5a3e7..d12437d19d 100644
--- a/drivers/net/ice/ice_rxtx.c
+++ b/drivers/net/ice/ice_rxtx.c
@@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask;
 uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask;
 
+static int
+ice_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.status_error0;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
-	pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S);
+	/* comparison callback */
+	pmc->fn = ice_monitor_callback;
 
 	/* register is 16-bit */
 	pmc->size = sizeof(uint16_t);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index d69f36e977..c814a28cb4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1369,6 +1369,18 @@ const uint32_t
 		RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP,
 };
 
+static int
+ixgbe_monitor_callback(const uint64_t value,
+		const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused)
+{
+	const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/*
+	 * we expect the DD bit to be set to 1 if this descriptor was already
+	 * written to.
+	 */
+	return (value & m) == m ? -1 : 0;
+}
+
 int
 ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
@@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	/* watch for changes in status bit */
 	pmc->addr = &rxdp->wb.upper.status_error;
 
-	/*
-	 * we expect the DD bit to be set to 1 if this descriptor was already
-	 * written to.
-	 */
-	pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
-	pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD);
+	/* comparison callback */
+	pmc->fn = ixgbe_monitor_callback;
 
 	/* the registers are 32-bit */
 	pmc->size = sizeof(uint32_t);
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 777a1d6e45..17370b77dc 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 	return rx_queue_count(rxq);
 }
 
+#define CLB_VAL_IDX 0
+#define CLB_MSK_IDX 1
+static int
+mlx_monitor_callback(const uint64_t value,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	const uint64_t m = opaque[CLB_MSK_IDX];
+	const uint64_t v = opaque[CLB_VAL_IDX];
+
+	return (value & m) == v ? -1 : 0;
+}
+
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	struct mlx5_rxq_data *rxq = rx_queue;
@@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 		return -rte_errno;
 	}
 	pmc->addr = &cqe->op_own;
-	pmc->val =  !!idx;
-	pmc->mask = MLX5_CQE_OWNER_MASK;
+	pmc->opaque[CLB_VAL_IDX] = !!idx;
+	pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
+	pmc->fn = mlx_monitor_callback;
 	pmc->size = sizeof(uint8_t);
 	return 0;
 }
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..c9aa52a86d 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -18,19 +18,38 @@
  * which are architecture-dependent.
  */
 
+/** Size of the opaque data in monitor condition */
+#define RTE_POWER_MONITOR_OPAQUE_SZ 4
+
+/**
+ * Callback definition for monitoring conditions. Callbacks with this signature
+ * will be used by `rte_power_monitor()` to check if the entering of power
+ * optimized state should be aborted.
+ *
+ * @param val
+ *   The value read from memory.
+ * @param opaque
+ *   Callback-specific data.
+ *
+ * @return
+ *   0 if entering of power optimized state should proceed
+ *   -1 if entering of power optimized state should be aborted
+ */
+typedef int (*rte_power_monitor_clb_t)(const uint64_t val,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]);
 struct rte_power_monitor_cond {
 	volatile void *addr;  /**< Address to monitor for changes */
-	uint64_t val;         /**< If the `mask` is non-zero, location pointed
-	                       *   to by `addr` will be read and compared
-	                       *   against this value.
-	                       */
-	uint64_t mask;   /**< 64-bit mask to extract value read from `addr` */
-	uint8_t size;    /**< Data size (in bytes) that will be used to compare
-	                  *   expected value (`val`) with data read from the
+	uint8_t size;    /**< Data size (in bytes) that will be read from the
 	                  *   monitored memory location (`addr`). Can be 1, 2,
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	rte_power_monitor_clb_t fn; /**< Callback to be used to check if
+	                             *   entering power optimized state should
+	                             *   be aborted.
+	                             */
+	uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ];
+	/**< Callback-specific data */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..66fea28897 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
 	const unsigned int lcore_id = rte_lcore_id();
 	struct power_wait_status *s;
+	uint64_t cur_value;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
@@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (__check_val_size(pmc->size) < 0)
 		return -EINVAL;
 
+	if (pmc->fn == NULL)
+		return -EINVAL;
+
 	s = &wait_status[lcore_id];
 
 	/* update sleep address */
@@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	/* now that we've put this address into monitor, we can unlock */
 	rte_spinlock_unlock(&s->lock);
 
-	/* if we have a comparison mask, we might not need to sleep at all */
-	if (pmc->mask) {
-		const uint64_t cur_value = __get_umwait_val(
-				pmc->addr, pmc->size);
-		const uint64_t masked = cur_value & pmc->mask;
+	cur_value = __get_umwait_val(pmc->addr, pmc->size);
 
-		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
-			goto end;
-	}
+	/* check if callback indicates we should abort */
+	if (pmc->fn(cur_value, pmc->opaque) != 0)
+		goto end;
 
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v6 2/7] net/af_xdp: add power monitor support
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
@ 2021-07-05 15:21           ` Anatoly Burakov
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events Anatoly Burakov
                             ` (5 subsequent siblings)
  7 siblings, 0 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
  To: dev, Ciara Loftus, Qi Zhang; +Cc: david.hunt, konstantin.ananyev

Implement support for .get_monitor_addr in AF_XDP driver.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Rewrite using the callback mechanism

 drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index eb5660a3dc..7830d0c23a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -37,6 +37,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_spinlock.h>
+#include <rte_power_intrinsics.h>
 
 #include "compat.h"
 
@@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+#define CLB_VAL_IDX 0
+static int
+eth_monitor_callback(const uint64_t value,
+		const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
+{
+	const uint64_t v = opaque[CLB_VAL_IDX];
+	const uint64_t m = (uint32_t)~0;
+
+	/* if the value has changed, abort entering power optimized state */
+	return (value & m) == v ? 0 : -1;
+}
+
+static int
+eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
+{
+	struct pkt_rx_queue *rxq = rx_queue;
+	unsigned int *prod = rxq->rx.producer;
+	const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */
+
+	/* watch for changes in producer ring */
+	pmc->addr = (void*)prod;
+
+	/* store current value */
+	pmc->opaque[CLB_VAL_IDX] = cur_val;
+	pmc->fn = eth_monitor_callback;
+
+	/* AF_XDP producer ring index is 32-bit */
+	pmc->size = sizeof(uint32_t);
+
+	return 0;
+}
+
 static int
 eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
@@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = {
 	.link_update = eth_link_update,
 	.stats_get = eth_stats_get,
 	.stats_reset = eth_stats_reset,
+	.get_monitor_addr = eth_get_monitor_addr
 };
 
 /** parse busy_budget argument */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 1/7] power_intrinsics: use callbacks for comparison Anatoly Burakov
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 2/7] net/af_xdp: add power monitor support Anatoly Burakov
@ 2021-07-05 15:21           ` Anatoly Burakov
  2021-08-04  9:52             ` Kinsella, Ray
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
                             ` (4 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
  To: dev, Jerin Jacob, Ruifeng Wang, Jan Viktorin, David Christensen,
	Ray Kinsella, Neil Horman, Bruce Richardson, Konstantin Ananyev
  Cc: david.hunt, ciara.loftus

Use RTM and WAITPKG instructions to perform a wait-for-writes similar to
what UMWAIT does, but without the limitation of having to listen for
just one event. This works because the optimized power state used by the
TPAUSE instruction will cause a wake up on RTM transaction abort, so if
we add the addresses we're interested in to the read-set, any write to
those addresses will wake us up.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v4:
    - Fixed bugs in accessing the monitor condition
    - Abort on any monitor condition not having a defined callback
    
    v2:
    - Adapt to callback mechanism

 doc/guides/rel_notes/release_21_08.rst        |  2 +
 lib/eal/arm/rte_power_intrinsics.c            | 11 +++
 lib/eal/include/generic/rte_cpuflags.h        |  2 +
 .../include/generic/rte_power_intrinsics.h    | 35 +++++++++
 lib/eal/ppc/rte_power_intrinsics.c            | 11 +++
 lib/eal/version.map                           |  3 +
 lib/eal/x86/rte_cpuflags.c                    |  2 +
 lib/eal/x86/rte_power_intrinsics.c            | 73 +++++++++++++++++++
 8 files changed, 139 insertions(+)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index c84ac280f5..9d1cfac395 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -55,6 +55,8 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
+
 
 Removed Items
 -------------
diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c
index e83f04072a..78f55b7203 100644
--- a/lib/eal/arm/rte_power_intrinsics.c
+++ b/lib/eal/arm/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h
index 28a5aecde8..d35551e931 100644
--- a/lib/eal/include/generic/rte_cpuflags.h
+++ b/lib/eal/include/generic/rte_cpuflags.h
@@ -24,6 +24,8 @@ struct rte_cpu_intrinsics {
 	/**< indicates support for rte_power_monitor function */
 	uint32_t power_pause : 1;
 	/**< indicates support for rte_power_pause function */
+	uint32_t power_monitor_multi : 1;
+	/**< indicates support for rte_power_monitor_multi function */
 };
 
 /**
diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index c9aa52a86d..04e8c2ab37 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id);
 __rte_experimental
 int rte_power_pause(const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Monitor a set of addresses for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either one of the specified
+ * memory addresses is written to, a certain TSC timestamp is reached, or other
+ * reasons cause the CPU to wake up.
+ *
+ * Additionally, `expected` 64-bit values and 64-bit masks are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they do not match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @warning It is responsibility of the user to check if this function is
+ *   supported at runtime using `rte_cpu_get_intrinsics_support()` API call.
+ *   Failing to do so may result in an illegal CPU instruction error.
+ *
+ * @param pmc
+ *   An array of monitoring condition structures.
+ * @param num
+ *   Length of the `pmc` array.
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   0 on success
+ *   -EINVAL on invalid parameters
+ *   -ENOTSUP if unsupported
+ */
+__rte_experimental
+int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp);
+
 #endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c
index 7fc9586da7..f00b58ade5 100644
--- a/lib/eal/ppc/rte_power_intrinsics.c
+++ b/lib/eal/ppc/rte_power_intrinsics.c
@@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return -ENOTSUP;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	RTE_SET_USED(pmc);
+	RTE_SET_USED(num);
+	RTE_SET_USED(tsc_timestamp);
+
+	return -ENOTSUP;
+}
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..4ccd5475d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_power_monitor_multi; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c
index a96312ff7f..d339734a8c 100644
--- a/lib/eal/x86/rte_cpuflags.c
+++ b/lib/eal/x86/rte_cpuflags.c
@@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
 		intrinsics->power_monitor = 1;
 		intrinsics->power_pause = 1;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM))
+			intrinsics->power_monitor_multi = 1;
 	}
 }
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 66fea28897..f749da9b85 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_rtm.h>
 #include <rte_spinlock.h>
 
 #include "rte_power_intrinsics.h"
@@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr)
 }
 
 static bool wait_supported;
+static bool wait_multi_supported;
 
 static inline uint64_t
 __get_umwait_val(const volatile void *p, const uint8_t sz)
@@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) {
 
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
+	if (i.power_monitor_multi)
+		wait_multi_supported = 1;
 }
 
 int
@@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	 * In this case, since we've already woken up, the "wakeup" was
 	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
 	 * wakeup address is still valid so it's perfectly safe to write it.
+	 *
+	 * For multi-monitor case, the act of locking will in itself trigger the
+	 * wakeup, so no additional writes necessary.
 	 */
 	rte_spinlock_lock(&s->lock);
 	if (s->monitor_addr != NULL)
@@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 
 	return 0;
 }
+
+int
+rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
+		const uint32_t num, const uint64_t tsc_timestamp)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s = &wait_status[lcore_id];
+	uint32_t i, rc;
+
+	/* check if supported */
+	if (!wait_multi_supported)
+		return -ENOTSUP;
+
+	if (pmc == NULL || num == 0)
+		return -EINVAL;
+
+	/* we are already inside transaction region, return */
+	if (rte_xtest() != 0)
+		return 0;
+
+	/* start new transaction region */
+	rc = rte_xbegin();
+
+	/* transaction abort, possible write to one of wait addresses */
+	if (rc != RTE_XBEGIN_STARTED)
+		return 0;
+
+	/*
+	 * the mere act of reading the lock status here adds the lock to
+	 * the read set. This means that when we trigger a wakeup from another
+	 * thread, even if we don't have a defined wakeup address and thus don't
+	 * actually cause any writes, the act of locking our lock will itself
+	 * trigger the wakeup and abort the transaction.
+	 */
+	rte_spinlock_is_locked(&s->lock);
+
+	/*
+	 * add all addresses to wait on into transaction read-set and check if
+	 * any of wakeup conditions are already met.
+	 */
+	rc = 0;
+	for (i = 0; i < num; i++) {
+		const struct rte_power_monitor_cond *c = &pmc[i];
+
+		/* cannot be NULL */
+		if (c->fn == NULL) {
+			rc = -EINVAL;
+			break;
+		}
+
+		const uint64_t val = __get_umwait_val(c->addr, c->size);
+
+		/* abort if callback indicates that we need to stop */
+		if (c->fn(val, c->opaque) != 0)
+			break;
+	}
+
+	/* none of the conditions were met, sleep until timeout */
+	if (i == num)
+		rte_power_pause(tsc_timestamp);
+
+	/* end transaction region */
+	rte_xend();
+
+	return rc;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
                             ` (2 preceding siblings ...)
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 3/7] eal: add power monitor for multiple events Anatoly Burakov
@ 2021-07-05 15:21           ` Anatoly Burakov
  2021-07-07 10:14             ` Ananyev, Konstantin
  2021-07-05 15:22           ` [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues Anatoly Burakov
                             ` (3 subsequent siblings)
  7 siblings, 1 reply; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:21 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev

Currently, we expect that only one callback can be active at any given
moment, for a particular queue configuration, which is relatively easy
to implement in a thread-safe way. However, we're about to add support
for multiple queues per lcore, which will greatly increase the
possibility of various race conditions.

We could have used something like an RCU for this use case, but absent
of a pressing need for thread safety we'll go the easy way and just
mandate that the API's are to be called when all affected ports are
stopped, and document this limitation. This greatly simplifies the
`rte_power_monitor`-related code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v2:
    - Add check for stopped queue
    - Clarified doc message
    - Added release notes

 doc/guides/rel_notes/release_21_08.rst |   5 +
 lib/power/meson.build                  |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 133 ++++++++++---------------
 lib/power/rte_power_pmd_mgmt.h         |   6 ++
 4 files changed, 67 insertions(+), 80 deletions(-)

diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index 9d1cfac395..f015c509fc 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -88,6 +88,11 @@ API Changes
 
 * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism.
 
+* rte_power: The experimental PMD power management API is no longer considered
+  to be thread safe; all Rx queues affected by the API will now need to be
+  stopped before making any changes to the power management scheme.
+
+
 ABI Changes
 -----------
 
diff --git a/lib/power/meson.build b/lib/power/meson.build
index c1097d32f1..4f6a242364 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -21,4 +21,7 @@ headers = files(
         'rte_power_pmd_mgmt.h',
         'rte_power_guest_channel.h',
 )
+if cc.has_argument('-Wno-cast-qual')
+    cflags += '-Wno-cast-qual'
+endif
 deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index db03cbf420..9b95cf1794 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -40,8 +40,6 @@ struct pmd_queue_cfg {
 	/**< Callback mode for this queue */
 	const struct rte_eth_rxtx_callback *cur_cb;
 	/**< Callback instance */
-	volatile bool umwait_in_progress;
-	/**< are we currently sleeping? */
 	uint64_t empty_poll_stats;
 	/**< Number of empty polls */
 } __rte_cache_aligned;
@@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 			struct rte_power_monitor_cond pmc;
 			uint16_t ret;
 
-			/*
-			 * we might get a cancellation request while being
-			 * inside the callback, in which case the wakeup
-			 * wouldn't work because it would've arrived too early.
-			 *
-			 * to get around this, we notify the other thread that
-			 * we're sleeping, so that it can spin until we're done.
-			 * unsolicited wakeups are perfectly safe.
-			 */
-			q_conf->umwait_in_progress = true;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-			/* check if we need to cancel sleep */
-			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
-				/* use monitoring condition to sleep */
-				ret = rte_eth_get_monitor_addr(port_id, qidx,
-						&pmc);
-				if (ret == 0)
-					rte_power_monitor(&pmc, UINT64_MAX);
-			}
-			q_conf->umwait_in_progress = false;
-
-			rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
+			/* use monitoring condition to sleep */
+			ret = rte_eth_get_monitor_addr(port_id, qidx,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
 		q_conf->empty_poll_stats = 0;
@@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx,
 	return nb_rx;
 }
 
+static int
+queue_stopped(const uint16_t port_id, const uint16_t queue_id)
+{
+	struct rte_eth_rxq_info qinfo;
+
+	if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0)
+		return -1;
+
+	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
 	struct pmd_queue_cfg *queue_cfg;
 	struct rte_eth_dev_info info;
+	rte_rx_callback_fn clb;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
 	queue_cfg = &port_cfg[port_id][queue_id];
 
 	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
@@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->umwait_in_progress = false;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* ensure we update our state before callback starts */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_umwait, NULL);
+		clb = clb_umwait;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
@@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 			ret = -ENOTSUP;
 			goto end;
 		}
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
-				queue_id, clb_scale_freq, NULL);
+		clb = clb_scale_freq;
 		break;
 	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
@@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		if (global_data.tsc_per_us == 0)
 			calc_tsc();
 
-		/* initialize data before enabling the callback */
-		queue_cfg->empty_poll_stats = 0;
-		queue_cfg->cb_mode = mode;
-		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-
-		/* this is not necessary here, but do it anyway */
-		rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
-		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-				clb_pause, NULL);
+		clb = clb_pause;
 		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
 	}
+
+	/* initialize data before enabling the callback */
+	queue_cfg->empty_poll_stats = 0;
+	queue_cfg->cb_mode = mode;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, NULL);
+
 	ret = 0;
 end:
 	return ret;
@@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
 	struct pmd_queue_cfg *queue_cfg;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
 
 	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
 		return -EINVAL;
 
+	/* check if the queue is stopped */
+	ret = queue_stopped(port_id, queue_id);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
 	/* no need to check queue id as wrong queue id would not be enabled */
 	queue_cfg = &port_cfg[port_id][queue_id];
 
@@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	/* stop any callbacks from progressing */
 	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
 
-	/* ensure we update our state before continuing */
-	rte_atomic_thread_fence(__ATOMIC_SEQ_CST);
-
 	switch (queue_cfg->cb_mode) {
-	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		bool exit = false;
-		do {
-			/*
-			 * we may request cancellation while the other thread
-			 * has just entered the callback but hasn't started
-			 * sleeping yet, so keep waking it up until we know it's
-			 * done sleeping.
-			 */
-			if (queue_cfg->umwait_in_progress)
-				rte_power_monitor_wakeup(lcore_id);
-			else
-				exit = true;
-		} while (!exit);
-	}
-	/* fall-through */
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		rte_eth_remove_rx_callback(port_id, queue_id,
 				queue_cfg->cur_cb);
@@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		break;
 	}
 	/*
-	 * we don't free the RX callback here because it is unsafe to do so
-	 * unless we know for a fact that all data plane threads have stopped.
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	queue_cfg->cur_cb = NULL;
+	rte_free((void *)queue_cfg->cur_cb);
 
 	return 0;
 }
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 7a0ac24625..444e7b8a66 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type {
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue will be polled from.
  * @param port_id
@@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id,
  *
  * @note This function is not thread-safe.
  *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
  * @param lcore_id
  *   The lcore the Rx queue is polled from.
  * @param port_id
-- 
2.25.1


^ permalink raw reply	[flat|nested] 165+ messages in thread

* [dpdk-dev] [PATCH v6 5/7] power: support callbacks for multiple Rx queues
  2021-07-05 15:21         ` [dpdk-dev] [PATCH v6 0/7] Enhancements for PMD power management Anatoly Burakov
                             ` (3 preceding siblings ...)
  2021-07-05 15:21           ` [dpdk-dev] [PATCH v6 4/7] power: remove thread safety from PMD power API's Anatoly Burakov
@ 2021-07-05 15:22           ` Anatoly Burakov
  2021-07-06 18:50             ` Ananyev, Konstantin
  2021-07-07 10:04             ` David Hunt
  2021-07-05 15:22           ` [dpdk-dev] [PATCH v6 6/7] power: support monitoring " Anatoly Burakov
                             ` (2 subsequent siblings)
  7 siblings, 2 replies; 165+ messages in thread
From: Anatoly Burakov @ 2021-07-05 15:22 UTC (permalink / raw)
  To: dev, David Hunt; +Cc: ciara.loftus, konstantin.ananyev

Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of queues to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when all queues in the list were polled and were determined
  to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Also, while we're at it, update and improve the docs.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v6:
    - Track each individual queue sleep status (Konstantin)
    - Fix segfault (Dave)
    
    v5:
    - Remove the "power save queue" API and replace it with mechanism suggested by
      Konstantin
    
    v3:
    - Move the list of supported NICs to NIC feature table
    
    v2:
    - Use a TAILQ for queues instead of a static array
    - Address feedback from Konstantin
    - Add additional checks for stopped queues

 doc/guides/nics/features.rst           |  10 +
 doc/guides/prog_guide/power_man.rst    |  65 ++--
 doc/guides/rel_notes/release_21_08.rst |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 452 +++++++++++++++++++------
 4 files changed, 394 insertions(+), 136 deletions(-)