All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] introduce PM QoS interface
@ 2024-03-20 10:55 Huisong Li
  2024-03-20 10:55 ` [PATCH 1/2] power: " Huisong Li
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Huisong Li @ 2024-03-20 10:55 UTC (permalink / raw)
  To: dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong, lihuisong

Subject: [PATCH 0/2] introduce PM QoS interface                           

The system-wide CPU latency QoS limit has a positive impact on the idle
state selection in cpuidle governor.

Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
CPU latency QoS limit on system and send the QoS request for userspace.
Please see the PM QoS framework in the following link:
https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
This feature is supported by kernel-v2.6.25.

The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.

So this series introduce PM QoS interface.

Huisong Li (2):
  power: introduce PM QoS interface
  examples/l3fwd-power: add PM QoS request configuration

 doc/guides/prog_guide/power_man.rst    |  16 ++++
 doc/guides/rel_notes/release_24_03.rst |   4 +
 examples/l3fwd-power/main.c            |  41 +++++++++-
 lib/power/meson.build                  |   2 +
 lib/power/rte_power_qos.c              |  98 ++++++++++++++++++++++++
 lib/power/rte_power_qos.h              | 101 +++++++++++++++++++++++++
 lib/power/version.map                  |   4 +
 7 files changed, 265 insertions(+), 1 deletion(-)
 create mode 100644 lib/power/rte_power_qos.c
 create mode 100644 lib/power/rte_power_qos.h

-- 
2.22.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] power: introduce PM QoS interface
  2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
@ 2024-03-20 10:55 ` Huisong Li
  2024-03-20 10:55 ` [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration Huisong Li
  2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
  2 siblings, 0 replies; 16+ messages in thread
From: Huisong Li @ 2024-03-20 10:55 UTC (permalink / raw)
  To: dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong, lihuisong

The system-wide CPU latency QoS limit has a positive impact on the idle
state selection in cpuidle governor.

Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
CPU latency QoS limit on system and send the QoS request for userspace.
Please see the PM QoS framework in the following link:
https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
This feature has beed supported by kernel-v2.6.25.

The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.

So this PM QoS API make it easy to obtain the CPU latency limit on system
and send the CPU latency QoS request for the application that need them.

The recommend usage method is as follows:
1) an application process first creates QoS request.
2) update the CPU latency request to zero when need.
3) back to the default value when no need(this step is optional).
4) release QoS request when process exit.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
 doc/guides/prog_guide/power_man.rst    |  16 ++++
 doc/guides/rel_notes/release_24_03.rst |   4 +
 lib/power/meson.build                  |   2 +
 lib/power/rte_power_qos.c              |  98 ++++++++++++++++++++++++
 lib/power/rte_power_qos.h              | 101 +++++++++++++++++++++++++
 lib/power/version.map                  |   4 +
 6 files changed, 225 insertions(+)
 create mode 100644 lib/power/rte_power_qos.c
 create mode 100644 lib/power/rte_power_qos.h

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..493c75bf9d 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,22 @@ Get Num Pkgs
 Get Num Dies
   Get the number of die's on a given package.
 
+PM QoS API
+----------
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service threads are delay sensitive and very except
+the low resume time, like interrupt packet receiving mode.
+
+This PM QoS API is aimed to obtain the CPU latency limit on system and send the
+CPU latency QoS request for the application that need them.
+
+* ``rte_power_qos_get_curr_cpu_latency()`` is used to get the current CPU
+  latency limit on system.
+* For sending CPU latency QoS request, first call ``rte_power_create_qos_request()``
+  to create a QoS request, then update CPU latency value by calling
+  ``rte_power_qos_update_request()``. The ``rte_power_release_qos_request()`` is
+  used to release this QoS request when process exit.
+
 References
 ----------
 
diff --git a/doc/guides/rel_notes/release_24_03.rst b/doc/guides/rel_notes/release_24_03.rst
index 14826ea08f..b5be724133 100644
--- a/doc/guides/rel_notes/release_24_03.rst
+++ b/doc/guides/rel_notes/release_24_03.rst
@@ -196,6 +196,10 @@ New Features
   Added DMA producer mode to measure performance of ``OP_FORWARD`` mode
   of event DMA adapter.
 
+* **Added CPU latency PM QoS support.**
+
+  Added the interface querying cpu latency PM QoS limit on system and
+  the interface sending cpu latency QoS request in power lib.
 
 Removed Items
 -------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
         'rte_power.c',
         'rte_power_uncore.c',
         'rte_power_pmd_mgmt.c',
+        'rte_power_qos.c',
 )
 headers = files(
         'rte_power.h',
         'rte_power_guest_channel.h',
         'rte_power_pmd_mgmt.h',
         'rte_power_uncore.h',
+        'rte_power_qos.h',
 )
 if cc.has_argument('-Wno-cast-qual')
     cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..d2b55923a0
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define QOS_CPU_DMA_LATENCY_DEV "/dev/cpu_dma_latency"
+
+struct rte_power_qos_info {
+	/*
+	 * Keep file descriptor to update QoS request until there are no
+	 * necessary anymore.
+	 */
+	int fd;
+	int cur_cpu_latency; /* unit microseconds */
+	};
+
+struct rte_power_qos_info g_qos = {
+	.fd = -1,
+	.cur_cpu_latency = -1,
+};
+
+int
+rte_power_qos_get_curr_cpu_latency(int *latency)
+{
+	int fd, ret;
+
+	fd = open(QOS_CPU_DMA_LATENCY_DEV, O_RDONLY);
+	if (fd < 0) {
+		POWER_LOG(ERR, "Failed to open %s", QOS_CPU_DMA_LATENCY_DEV);
+		return -1;
+	}
+
+	ret = read(fd, latency, sizeof(*latency));
+	if (ret == 0) {
+		POWER_LOG(ERR, "Failed to read %s", QOS_CPU_DMA_LATENCY_DEV);
+		return -1;
+	}
+	close(fd);
+
+	return 0;
+}
+
+int
+rte_power_qos_update_request(int latency)
+{
+	int ret;
+
+	if (g_qos.fd == -1) {
+		POWER_LOG(ERR, "please create QoS request first.");
+		return -EINVAL;
+	}
+
+	if (latency < 0) {
+		POWER_LOG(ERR, "latency should be non negative number.");
+		return -EINVAL;
+	}
+
+	if (g_qos.cur_cpu_latency != -1 && latency == g_qos.cur_cpu_latency)
+		return 0;
+
+	ret = write(g_qos.fd, &latency, sizeof(latency));
+	if (ret == 0) {
+		POWER_LOG(ERR, "Failed to write %s", QOS_CPU_DMA_LATENCY_DEV);
+		return -1;
+	}
+	g_qos.cur_cpu_latency = latency;
+
+	return 0;
+}
+
+int
+rte_power_create_qos_request(void)
+{
+	g_qos.fd = open(QOS_CPU_DMA_LATENCY_DEV, O_WRONLY);
+	if (g_qos.fd < 0) {
+		POWER_LOG(ERR, "Failed to open %s.", QOS_CPU_DMA_LATENCY_DEV);
+		return -1;
+	}
+
+	return 0;
+}
+
+void
+rte_power_release_qos_request(void)
+{
+	if (g_qos.fd != -1) {
+		close(g_qos.fd);
+		g_qos.fd = -1;
+	}
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..d39f5d0c0f
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The system-wide CPU latency QoS limit has a positive impact on the idle
+ * state selection in cpuidle governor.
+ *
+ * Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
+ * CPU latency QoS limit on system and send the QoS request for userspace.
+ * Please see the PM QoS framework in the following link:
+ * https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
+ *
+ * The deeper the idle state, the lower the power consumption, but the longer
+ * the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * So this PM QoS API make it easy to obtain the CPU latency limit on system and
+ * send the CPU latency QoS request for the application that need them.
+ *
+ * The recommend usage method is as follows:
+ * 1) an application process first creates QoS request.
+ * 2) update the CPU latency request to zero when need.
+ * 3) back to the default value @see PM_QOS_CPU_LATENCY_DEFAULT_VALUE when
+ *    no need (this step is optional).
+ * 4)release QoS request when process exit.
+ */
+
+#define QOS_USEC_PER_SEC                        1000000
+#define PM_QOS_CPU_LATENCY_DEFAULT_VALUE        (2000 * QOS_USEC_PER_SEC)
+#define PM_QOS_STRICT_LATENCY_VALUE             0
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create CPU latency QoS request and release this request by
+ * @see rte_power_release_qos_request.
+ *
+ * @return
+ *   0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_create_qos_request(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * release CPU latency QoS request.
+ */
+__rte_experimental
+void rte_power_release_qos_request(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current CPU latency QoS limit on system.
+ * The default value in kernel is @see PM_QOS_CPU_LATENCY_DEFAULT_VALUE.
+ *
+ * @return
+ *   0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_get_curr_cpu_latency(int *latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Update the CPU latency QoS request.
+ * Note: need to create QoS request first and then call this API.
+ *
+ * @param latency
+ *   The latency should be greater than and equal to zero.
+ *
+ * @return
+ *   0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_update_request(int latency);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..42770762b1 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
 	rte_power_set_uncore_env;
 	rte_power_uncore_freqs;
 	rte_power_unset_uncore_env;
+	rte_power_create_qos_request;
+	rte_power_release_qos_request;
+	rte_power_qos_get_curr_cpu_latency;
+	rte_power_qos_update_request;
 };
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration
  2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
  2024-03-20 10:55 ` [PATCH 1/2] power: " Huisong Li
@ 2024-03-20 10:55 ` Huisong Li
  2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
  2 siblings, 0 replies; 16+ messages in thread
From: Huisong Li @ 2024-03-20 10:55 UTC (permalink / raw)
  To: dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong, lihuisong

Add PM QoS request configuration to declease the process resume latency.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
 examples/l3fwd-power/main.c | 41 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f4adcf41b5..78f292ed02 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
 #include <rte_telemetry.h>
 #include <rte_power_pmd_mgmt.h>
 #include <rte_power_uncore.h>
+#include <rte_power_qos.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -2232,12 +2233,48 @@ static int check_ptype(uint16_t portid)
 
 }
 
+static int
+pm_qos_init(void)
+{
+	int cur_cpu_latency;
+	int ret;
+
+	ret = rte_power_qos_get_curr_cpu_latency(&cur_cpu_latency);
+	if (ret < 0) {
+		RTE_LOG(ERR, L3FWD_POWER, "failed to get current cpu latency.\n");
+		return ret;
+	}
+	RTE_LOG(INFO, L3FWD_POWER, "current cpu latency is %dus on system.\n",
+			(cur_cpu_latency / QOS_USEC_PER_SEC));
+
+	ret = rte_power_create_qos_request();
+	if (ret < 0) {
+		RTE_LOG(ERR, L3FWD_POWER, "Failed to create power QoS request.\n");
+		return ret;
+	}
+
+	/*
+	 * Set strict latency requirement to prevent service thread going into
+	 * a deeper sleep state whose resume time is longer.
+	 */
+	ret = rte_power_qos_update_request(PM_QOS_STRICT_LATENCY_VALUE);
+	if (ret < 0)
+		RTE_LOG(ERR, L3FWD_POWER, "Failed to change cpu latency to 0.\n");
+	return ret;
+}
+
 static int
 init_power_library(void)
 {
 	enum power_management_env env;
 	unsigned int lcore_id;
-	int ret = 0;
+	int ret;
+
+	ret = pm_qos_init();
+	if (ret != 0) {
+		RTE_LOG(ERR, L3FWD_POWER, "init power Qos failed.\n");
+		return ret;
+	}
 
 	RTE_LCORE_FOREACH(lcore_id) {
 		/* init power management library */
@@ -2268,6 +2305,8 @@ deinit_power_library(void)
 	unsigned int lcore_id, max_pkg, max_die, die, pkg;
 	int ret = 0;
 
+	rte_power_release_qos_request();
+
 	RTE_LCORE_FOREACH(lcore_id) {
 		/* deinit power management library */
 		ret = rte_power_exit(lcore_id);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2] introduce PM QoS interface
  2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
  2024-03-20 10:55 ` [PATCH 1/2] power: " Huisong Li
  2024-03-20 10:55 ` [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration Huisong Li
@ 2024-03-20 14:05 ` Morten Brørup
  2024-03-21  3:04   ` lihuisong (C)
  2 siblings, 1 reply; 16+ messages in thread
From: Morten Brørup @ 2024-03-20 14:05 UTC (permalink / raw)
  To: Huisong Li, dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

> From: Huisong Li [mailto:lihuisong@huawei.com]
> Sent: Wednesday, 20 March 2024 11.55
> 
> The system-wide CPU latency QoS limit has a positive impact on the idle
> state selection in cpuidle governor.
> 
> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> CPU latency QoS limit on system and send the QoS request for userspace.
> Please see the PM QoS framework in the following link:
> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> This feature is supported by kernel-v2.6.25.
> 
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
> 
> So this series introduce PM QoS interface.

This looks like a 1:1 wrapper for a Linux kernel feature.
Does Windows or BSD offer something similar?

Furthermore, any high-res timing should use nanoseconds, not microseconds or milliseconds.
I realize that the Linux kernel only uses microseconds for these APIs, but the DPDK API should use nanoseconds.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
@ 2024-03-21  3:04   ` lihuisong (C)
  2024-03-21 13:30     ` Morten Brørup
  0 siblings, 1 reply; 16+ messages in thread
From: lihuisong (C) @ 2024-03-21  3:04 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

Hi Moren,

Thanks for your revew.

在 2024/3/20 22:05, Morten Brørup 写道:
>> From: Huisong Li [mailto:lihuisong@huawei.com]
>> Sent: Wednesday, 20 March 2024 11.55
>>
>> The system-wide CPU latency QoS limit has a positive impact on the idle
>> state selection in cpuidle governor.
>>
>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
>> CPU latency QoS limit on system and send the QoS request for userspace.
>> Please see the PM QoS framework in the following link:
>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>> This feature is supported by kernel-v2.6.25.
>>
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> So this series introduce PM QoS interface.
> This looks like a 1:1 wrapper for a Linux kernel feature.
right
> Does Windows or BSD offer something similar?
How do we know Windows or BSD support this similar feature?
The DPDK power lib just work on Linux according to the meson.build under 
lib/power.
If they support this features, they can open it.
>
> Furthermore, any high-res timing should use nanoseconds, not microseconds or milliseconds.
> I realize that the Linux kernel only uses microseconds for these APIs, but the DPDK API should use nanoseconds.
Nanoseconds is more precise, it's good.
But DPDK API how use nanoseconds as you said the the Linux kernel only 
uses microseconds for these APIs.
Kernel interface just know an integer value with microseconds unit.

/BR
/Huisong

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2] introduce PM QoS interface
  2024-03-21  3:04   ` lihuisong (C)
@ 2024-03-21 13:30     ` Morten Brørup
  2024-03-22  8:54       ` lihuisong (C)
  0 siblings, 1 reply; 16+ messages in thread
From: Morten Brørup @ 2024-03-21 13:30 UTC (permalink / raw)
  To: lihuisong (C), dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Thursday, 21 March 2024 04.04
> 
> Hi Moren,
> 
> Thanks for your revew.
> 
> 在 2024/3/20 22:05, Morten Brørup 写道:
> >> From: Huisong Li [mailto:lihuisong@huawei.com]
> >> Sent: Wednesday, 20 March 2024 11.55
> >>
> >> The system-wide CPU latency QoS limit has a positive impact on the idle
> >> state selection in cpuidle governor.
> >>
> >> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> >> CPU latency QoS limit on system and send the QoS request for userspace.
> >> Please see the PM QoS framework in the following link:
> >> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >> This feature is supported by kernel-v2.6.25.
> >>
> >> The deeper the idle state, the lower the power consumption, but the longer
> >> the resume time. Some service are delay sensitive and very except the low
> >> resume time, like interrupt packet receiving mode.
> >>
> >> So this series introduce PM QoS interface.
> > This looks like a 1:1 wrapper for a Linux kernel feature.
> right
> > Does Windows or BSD offer something similar?
> How do we know Windows or BSD support this similar feature?

Ask Windows experts or research using Google.

> The DPDK power lib just work on Linux according to the meson.build under
> lib/power.
> If they support this features, they can open it.

The DPDK power lib currently only works on Linux, yes.
But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.

DPDK is on track to work across multiple platforms, including Windows.
We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.

> >
> > Furthermore, any high-res timing should use nanoseconds, not microseconds or
> milliseconds.
> > I realize that the Linux kernel only uses microseconds for these APIs, but
> the DPDK API should use nanoseconds.
> Nanoseconds is more precise, it's good.
> But DPDK API how use nanoseconds as you said the the Linux kernel only
> uses microseconds for these APIs.
> Kernel interface just know an integer value with microseconds unit.

One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-21 13:30     ` Morten Brørup
@ 2024-03-22  8:54       ` lihuisong (C)
  2024-03-22 12:35         ` Morten Brørup
  2024-03-22 17:55         ` Tyler Retzlaff
  0 siblings, 2 replies; 16+ messages in thread
From: lihuisong (C) @ 2024-03-22  8:54 UTC (permalink / raw)
  To: Morten Brørup, dev, Tyler Retzlaff, weh,
	longli@microsoft.com >> Long Li, alan.elder
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

+Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.

在 2024/3/21 21:30, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Thursday, 21 March 2024 04.04
>>
>> Hi Moren,
>>
>> Thanks for your revew.
>>
>> 在 2024/3/20 22:05, Morten Brørup 写道:
>>>> From: Huisong Li [mailto:lihuisong@huawei.com]
>>>> Sent: Wednesday, 20 March 2024 11.55
>>>>
>>>> The system-wide CPU latency QoS limit has a positive impact on the idle
>>>> state selection in cpuidle governor.
>>>>
>>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
>>>> CPU latency QoS limit on system and send the QoS request for userspace.
>>>> Please see the PM QoS framework in the following link:
>>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>>>> This feature is supported by kernel-v2.6.25.
>>>>
>>>> The deeper the idle state, the lower the power consumption, but the longer
>>>> the resume time. Some service are delay sensitive and very except the low
>>>> resume time, like interrupt packet receiving mode.
>>>>
>>>> So this series introduce PM QoS interface.
>>> This looks like a 1:1 wrapper for a Linux kernel feature.
>> right
>>> Does Windows or BSD offer something similar?
>> How do we know Windows or BSD support this similar feature?
> Ask Windows experts or research using Google.
I download freebsd source code, I didn't find this similar feature.
They don't even support cpuidle feature(this QoS feature affects cpuilde.).
I don't find any useful about this on Windows from google.


@Tyler, @Alan, @Wei and @Long

Do you know windows support that userspace read and send CPU latency 
which has an impact on deep level of CPU idle?

>> The DPDK power lib just work on Linux according to the meson.build under
>> lib/power.
>> If they support this features, they can open it.
> The DPDK power lib currently only works on Linux, yes.
> But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
>
> DPDK is on track to work across multiple platforms, including Windows.
> We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
totally understand you.
>
>>> Furthermore, any high-res timing should use nanoseconds, not microseconds or
>> milliseconds.
>>> I realize that the Linux kernel only uses microseconds for these APIs, but
>> the DPDK API should use nanoseconds.
>> Nanoseconds is more precise, it's good.
>> But DPDK API how use nanoseconds as you said the the Linux kernel only
>> uses microseconds for these APIs.
>> Kernel interface just know an integer value with microseconds unit.
> One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
If so, we have to modify the implementation interface on Linux. This 
change the input/output unit about the interface.
And DPDK also has to do this based on kernel version. It is not good.
The cpuidle governor select which idle state based on the worst-case 
latency of idle state.
These the worst-case latency of Cstate reported by ACPI table is in 
microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3. _LPI 
(Low Power Idle States) in ACPI spec [1].
So it is probably not meaning to change this interface implementation.

For the case need PM QoS in DPDK, I think, it is better to set cpu 
latency to zero to prevent service thread from the deeper the idle state.
> You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
>
[1] 
https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2] introduce PM QoS interface
  2024-03-22  8:54       ` lihuisong (C)
@ 2024-03-22 12:35         ` Morten Brørup
  2024-03-26  2:11           ` lihuisong (C)
  2024-03-22 17:55         ` Tyler Retzlaff
  1 sibling, 1 reply; 16+ messages in thread
From: Morten Brørup @ 2024-03-22 12:35 UTC (permalink / raw)
  To: lihuisong (C), dev, Tyler Retzlaff, weh, longli, alan.elder
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Friday, 22 March 2024 09.54
> 
> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
> 
> 在 2024/3/21 21:30, Morten Brørup 写道:
> >> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >> Sent: Thursday, 21 March 2024 04.04
> >>
> >> Hi Moren,
> >>
> >> Thanks for your revew.
> >>
> >> 在 2024/3/20 22:05, Morten Brørup 写道:
> >>>> From: Huisong Li [mailto:lihuisong@huawei.com]
> >>>> Sent: Wednesday, 20 March 2024 11.55
> >>>>
> >>>> The system-wide CPU latency QoS limit has a positive impact on the idle
> >>>> state selection in cpuidle governor.
> >>>>
> >>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain
> the
> >>>> CPU latency QoS limit on system and send the QoS request for userspace.
> >>>> Please see the PM QoS framework in the following link:
> >>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >>>> This feature is supported by kernel-v2.6.25.
> >>>>
> >>>> The deeper the idle state, the lower the power consumption, but the
> longer
> >>>> the resume time. Some service are delay sensitive and very except the low
> >>>> resume time, like interrupt packet receiving mode.
> >>>>
> >>>> So this series introduce PM QoS interface.
> >>> This looks like a 1:1 wrapper for a Linux kernel feature.
> >> right
> >>> Does Windows or BSD offer something similar?
> >> How do we know Windows or BSD support this similar feature?
> > Ask Windows experts or research using Google.
> I download freebsd source code, I didn't find this similar feature.
> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
> I don't find any useful about this on Windows from google.
> 
> 
> @Tyler, @Alan, @Wei and @Long
> 
> Do you know windows support that userspace read and send CPU latency
> which has an impact on deep level of CPU idle?
> 
> >> The DPDK power lib just work on Linux according to the meson.build under
> >> lib/power.
> >> If they support this features, they can open it.
> > The DPDK power lib currently only works on Linux, yes.
> > But its API should still be designed to be platform agnostic, so the
> functions can be implemented on other platforms in the future.
> >
> > DPDK is on track to work across multiple platforms, including Windows.
> > We must always consider other platforms, and not design DPDK APIs as if they
> are for Linux/BSD only.
> totally understand you.
> >
> >>> Furthermore, any high-res timing should use nanoseconds, not microseconds
> or
> >> milliseconds.
> >>> I realize that the Linux kernel only uses microseconds for these APIs, but
> >> the DPDK API should use nanoseconds.
> >> Nanoseconds is more precise, it's good.
> >> But DPDK API how use nanoseconds as you said the the Linux kernel only
> >> uses microseconds for these APIs.
> >> Kernel interface just know an integer value with microseconds unit.
> > One solution is to expose nanoseconds in the DPDK API, and in the Linux
> specific implementation convert from/to microseconds.
> If so, we have to modify the implementation interface on Linux. This
> change the input/output unit about the interface.
> And DPDK also has to do this based on kernel version. It is not good.
> The cpuidle governor select which idle state based on the worst-case
> latency of idle state.
> These the worst-case latency of Cstate reported by ACPI table is in
> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3. _LPI
> (Low Power Idle States) in ACPI spec [1].
> So it is probably not meaning to change this interface implementation.

OK... Since microsecond resolution is good enough for ACPI and Linux, you have me convinced that it's also good enough for DPDK (for this specific topic).

Thank you for the detailed reply!

> 
> For the case need PM QoS in DPDK, I think, it is better to set cpu
> latency to zero to prevent service thread from the deeper the idle state.

It would defeat the purpose (i.e. not saving sufficient amounts of power) if the CPU cannot enter a deeper idle state.

Personally, I would think a wake-up latency of up to 10 microseconds should be fine for must purposes.
Default Linux timerslack is 50 microseconds, so you could also use that value.

> > You might also want to add a note to the in-line documentation of the
> relevant functions that the Linux implementation only uses microsecond
> resolution.
> >
> [1]
> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-22  8:54       ` lihuisong (C)
  2024-03-22 12:35         ` Morten Brørup
@ 2024-03-22 17:55         ` Tyler Retzlaff
  2024-03-26  2:20           ` lihuisong (C)
  1 sibling, 1 reply; 16+ messages in thread
From: Tyler Retzlaff @ 2024-03-22 17:55 UTC (permalink / raw)
  To: lihuisong (C)
  Cc: Morten Brørup, dev, weh,
	longli@microsoft.com >> Long Li, alan.elder, thomas,
	ferruh.yigit, anatoly.burakov, david.hunt, sivaprasad.tummala,
	liuyonglong

On Fri, Mar 22, 2024 at 04:54:01PM +0800, lihuisong (C) wrote:
> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
> 
> 在 2024/3/21 21:30, Morten Brørup 写道:
> >>From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >>Sent: Thursday, 21 March 2024 04.04
> >>
> >>Hi Moren,
> >>
> >>Thanks for your revew.
> >>
> >>在 2024/3/20 22:05, Morten Brørup 写道:
> >>>>From: Huisong Li [mailto:lihuisong@huawei.com]
> >>>>Sent: Wednesday, 20 March 2024 11.55
> >>>>
> >>>>The system-wide CPU latency QoS limit has a positive impact on the idle
> >>>>state selection in cpuidle governor.
> >>>>
> >>>>Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> >>>>CPU latency QoS limit on system and send the QoS request for userspace.
> >>>>Please see the PM QoS framework in the following link:
> >>>>https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >>>>This feature is supported by kernel-v2.6.25.
> >>>>
> >>>>The deeper the idle state, the lower the power consumption, but the longer
> >>>>the resume time. Some service are delay sensitive and very except the low
> >>>>resume time, like interrupt packet receiving mode.
> >>>>
> >>>>So this series introduce PM QoS interface.
> >>>This looks like a 1:1 wrapper for a Linux kernel feature.
> >>right
> >>>Does Windows or BSD offer something similar?
> >>How do we know Windows or BSD support this similar feature?
> >Ask Windows experts or research using Google.
> I download freebsd source code, I didn't find this similar feature.
> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
> I don't find any useful about this on Windows from google.
> 
> 
> @Tyler, @Alan, @Wei and @Long
> 
> Do you know windows support that userspace read and send CPU latency
> which has an impact on deep level of CPU idle?

it is unlikely you'll find an api that let's you manage things in terms
of raw latency values as the linux knobs here do. windows more often employs
policy centric schemes to permit the system to abstract implementation detail.

powercfg is probably the closest thing you can use to tune the same
things on windows. where you select e.g. the 'performance' scheme but it
won't allow you to pick specific latency numbers.

https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/powercfg-command-line-options

> 
> >>The DPDK power lib just work on Linux according to the meson.build under
> >>lib/power.
> >>If they support this features, they can open it.
> >The DPDK power lib currently only works on Linux, yes.
> >But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
> >
> >DPDK is on track to work across multiple platforms, including Windows.
> >We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
> totally understand you.

since lib/power isn't built for windows at this time i don't think it's
appropriate to constrain your innovation. i do appreciate the engagement
though and would just offer general guidance that if you can design your
api with some kind of abstraction in mind that would be great and by all
means if you can figure out how to wrangle powercfg /Qh into satisfying the
api in a policy centric way it might be kind of nice.

i'll let other windows experts chime in here if they choose.

thanks!

> >
> >>>Furthermore, any high-res timing should use nanoseconds, not microseconds or
> >>milliseconds.
> >>>I realize that the Linux kernel only uses microseconds for these APIs, but
> >>the DPDK API should use nanoseconds.
> >>Nanoseconds is more precise, it's good.
> >>But DPDK API how use nanoseconds as you said the the Linux kernel only
> >>uses microseconds for these APIs.
> >>Kernel interface just know an integer value with microseconds unit.
> >One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
> If so, we have to modify the implementation interface on Linux. This
> change the input/output unit about the interface.
> And DPDK also has to do this based on kernel version. It is not good.
> The cpuidle governor select which idle state based on the worst-case
> latency of idle state.
> These the worst-case latency of Cstate reported by ACPI table is in
> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3.
> _LPI (Low Power Idle States) in ACPI spec [1].
> So it is probably not meaning to change this interface implementation.
> 
> For the case need PM QoS in DPDK, I think, it is better to set cpu
> latency to zero to prevent service thread from the deeper the idle
> state.
> >You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
> >
> [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-22 12:35         ` Morten Brørup
@ 2024-03-26  2:11           ` lihuisong (C)
  2024-03-26  8:27             ` Morten Brørup
  0 siblings, 1 reply; 16+ messages in thread
From: lihuisong (C) @ 2024-03-26  2:11 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong


在 2024/3/22 20:35, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Friday, 22 March 2024 09.54
>>
>> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
>>
>> 在 2024/3/21 21:30, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Thursday, 21 March 2024 04.04
>>>>
>>>> Hi Moren,
>>>>
>>>> Thanks for your revew.
>>>>
>>>> 在 2024/3/20 22:05, Morten Brørup 写道:
>>>>>> From: Huisong Li [mailto:lihuisong@huawei.com]
>>>>>> Sent: Wednesday, 20 March 2024 11.55
>>>>>>
>>>>>> The system-wide CPU latency QoS limit has a positive impact on the idle
>>>>>> state selection in cpuidle governor.
>>>>>>
>>>>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain
>> the
>>>>>> CPU latency QoS limit on system and send the QoS request for userspace.
>>>>>> Please see the PM QoS framework in the following link:
>>>>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>>>>>> This feature is supported by kernel-v2.6.25.
>>>>>>
>>>>>> The deeper the idle state, the lower the power consumption, but the
>> longer
>>>>>> the resume time. Some service are delay sensitive and very except the low
>>>>>> resume time, like interrupt packet receiving mode.
>>>>>>
>>>>>> So this series introduce PM QoS interface.
>>>>> This looks like a 1:1 wrapper for a Linux kernel feature.
>>>> right
>>>>> Does Windows or BSD offer something similar?
>>>> How do we know Windows or BSD support this similar feature?
>>> Ask Windows experts or research using Google.
>> I download freebsd source code, I didn't find this similar feature.
>> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
>> I don't find any useful about this on Windows from google.
>>
>>
>> @Tyler, @Alan, @Wei and @Long
>>
>> Do you know windows support that userspace read and send CPU latency
>> which has an impact on deep level of CPU idle?
>>
>>>> The DPDK power lib just work on Linux according to the meson.build under
>>>> lib/power.
>>>> If they support this features, they can open it.
>>> The DPDK power lib currently only works on Linux, yes.
>>> But its API should still be designed to be platform agnostic, so the
>> functions can be implemented on other platforms in the future.
>>> DPDK is on track to work across multiple platforms, including Windows.
>>> We must always consider other platforms, and not design DPDK APIs as if they
>> are for Linux/BSD only.
>> totally understand you.
>>>>> Furthermore, any high-res timing should use nanoseconds, not microseconds
>> or
>>>> milliseconds.
>>>>> I realize that the Linux kernel only uses microseconds for these APIs, but
>>>> the DPDK API should use nanoseconds.
>>>> Nanoseconds is more precise, it's good.
>>>> But DPDK API how use nanoseconds as you said the the Linux kernel only
>>>> uses microseconds for these APIs.
>>>> Kernel interface just know an integer value with microseconds unit.
>>> One solution is to expose nanoseconds in the DPDK API, and in the Linux
>> specific implementation convert from/to microseconds.
>> If so, we have to modify the implementation interface on Linux. This
>> change the input/output unit about the interface.
>> And DPDK also has to do this based on kernel version. It is not good.
>> The cpuidle governor select which idle state based on the worst-case
>> latency of idle state.
>> These the worst-case latency of Cstate reported by ACPI table is in
>> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3. _LPI
>> (Low Power Idle States) in ACPI spec [1].
>> So it is probably not meaning to change this interface implementation.
> OK... Since microsecond resolution is good enough for ACPI and Linux, you have me convinced that it's also good enough for DPDK (for this specific topic).
>
> Thank you for the detailed reply!
>
>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>> latency to zero to prevent service thread from the deeper the idle state.
> It would defeat the purpose (i.e. not saving sufficient amounts of power) if the CPU cannot enter a deeper idle state.
Yes, it is not good for power.
AFAIS, PM QoS is just to decrease the influence for performance.
Anyway, if we set to zero, system can be into Cstates-0 at least.
>
> Personally, I would think a wake-up latency of up to 10 microseconds should be fine for must purposes.
> Default Linux timerslack is 50 microseconds, so you could also use that value.
How much CPU latency is ok. Maybe, we can give the decision to the 
application.
Linux will collect all these QoS request and use the minimum latency.
what do you think, Morten?
>
>>> You might also want to add a note to the in-line documentation of the
>> relevant functions that the Linux implementation only uses microsecond
>> resolution.
>> [1]
>> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-22 17:55         ` Tyler Retzlaff
@ 2024-03-26  2:20           ` lihuisong (C)
  2024-03-26 16:04             ` Tyler Retzlaff
  0 siblings, 1 reply; 16+ messages in thread
From: lihuisong (C) @ 2024-03-26  2:20 UTC (permalink / raw)
  To: Tyler Retzlaff
  Cc: Morten Brørup, dev, weh,
	longli@microsoft.com >> Long Li, alan.elder, thomas,
	ferruh.yigit, anatoly.burakov, david.hunt, sivaprasad.tummala,
	liuyonglong

Hi Tyler,

在 2024/3/23 1:55, Tyler Retzlaff 写道:
> On Fri, Mar 22, 2024 at 04:54:01PM +0800, lihuisong (C) wrote:
>> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
>>
>> 在 2024/3/21 21:30, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Thursday, 21 March 2024 04.04
>>>>
>>>> Hi Moren,
>>>>
>>>> Thanks for your revew.
>>>>
>>>> 在 2024/3/20 22:05, Morten Brørup 写道:
>>>>>> From: Huisong Li [mailto:lihuisong@huawei.com]
>>>>>> Sent: Wednesday, 20 March 2024 11.55
>>>>>>
>>>>>> The system-wide CPU latency QoS limit has a positive impact on the idle
>>>>>> state selection in cpuidle governor.
>>>>>>
>>>>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
>>>>>> CPU latency QoS limit on system and send the QoS request for userspace.
>>>>>> Please see the PM QoS framework in the following link:
>>>>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>>>>>> This feature is supported by kernel-v2.6.25.
>>>>>>
>>>>>> The deeper the idle state, the lower the power consumption, but the longer
>>>>>> the resume time. Some service are delay sensitive and very except the low
>>>>>> resume time, like interrupt packet receiving mode.
>>>>>>
>>>>>> So this series introduce PM QoS interface.
>>>>> This looks like a 1:1 wrapper for a Linux kernel feature.
>>>> right
>>>>> Does Windows or BSD offer something similar?
>>>> How do we know Windows or BSD support this similar feature?
>>> Ask Windows experts or research using Google.
>> I download freebsd source code, I didn't find this similar feature.
>> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
>> I don't find any useful about this on Windows from google.
>>
>>
>> @Tyler, @Alan, @Wei and @Long
>>
>> Do you know windows support that userspace read and send CPU latency
>> which has an impact on deep level of CPU idle?
> it is unlikely you'll find an api that let's you manage things in terms
> of raw latency values as the linux knobs here do. windows more often employs
> policy centric schemes to permit the system to abstract implementation detail.
>
> powercfg is probably the closest thing you can use to tune the same
> things on windows. where you select e.g. the 'performance' scheme but it
> won't allow you to pick specific latency numbers.
>
> https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/powercfg-command-line-options

Thanks for your feedback. I will take a look at this tool.

>
>>>> The DPDK power lib just work on Linux according to the meson.build under
>>>> lib/power.
>>>> If they support this features, they can open it.
>>> The DPDK power lib currently only works on Linux, yes.
>>> But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
>>>
>>> DPDK is on track to work across multiple platforms, including Windows.
>>> We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
>> totally understand you.
> since lib/power isn't built for windows at this time i don't think it's
> appropriate to constrain your innovation. i do appreciate the engagement
> though and would just offer general guidance that if you can design your
> api with some kind of abstraction in mind that would be great and by all
> means if you can figure out how to wrangle powercfg /Qh into satisfying the
> api in a policy centric way it might be kind of nice.
Testing this by using powercfg on Windows creates a very challenge for me.
So I don't plan to do this on Windows. If you need, you can add it, ok?
>
> i'll let other windows experts chime in here if they choose.
>
> thanks!
>
>>>>> Furthermore, any high-res timing should use nanoseconds, not microseconds or
>>>> milliseconds.
>>>>> I realize that the Linux kernel only uses microseconds for these APIs, but
>>>> the DPDK API should use nanoseconds.
>>>> Nanoseconds is more precise, it's good.
>>>> But DPDK API how use nanoseconds as you said the the Linux kernel only
>>>> uses microseconds for these APIs.
>>>> Kernel interface just know an integer value with microseconds unit.
>>> One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
>> If so, we have to modify the implementation interface on Linux. This
>> change the input/output unit about the interface.
>> And DPDK also has to do this based on kernel version. It is not good.
>> The cpuidle governor select which idle state based on the worst-case
>> latency of idle state.
>> These the worst-case latency of Cstate reported by ACPI table is in
>> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3.
>> _LPI (Low Power Idle States) in ACPI spec [1].
>> So it is probably not meaning to change this interface implementation.
>>
>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>> latency to zero to prevent service thread from the deeper the idle
>> state.
>>> You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
>>>
>> [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
> .

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2] introduce PM QoS interface
  2024-03-26  2:11           ` lihuisong (C)
@ 2024-03-26  8:27             ` Morten Brørup
  2024-03-26 12:15               ` lihuisong (C)
  0 siblings, 1 reply; 16+ messages in thread
From: Morten Brørup @ 2024-03-26  8:27 UTC (permalink / raw)
  To: lihuisong (C), dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Tuesday, 26 March 2024 03.12
> 
> 在 2024/3/22 20:35, Morten Brørup 写道:
> >> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >> Sent: Friday, 22 March 2024 09.54

[...]

> >> For the case need PM QoS in DPDK, I think, it is better to set cpu
> >> latency to zero to prevent service thread from the deeper the idle
> state.
> > It would defeat the purpose (i.e. not saving sufficient amounts of
> power) if the CPU cannot enter a deeper idle state.
> Yes, it is not good for power.
> AFAIS, PM QoS is just to decrease the influence for performance.
> Anyway, if we set to zero, system can be into Cstates-0 at least.
> >
> > Personally, I would think a wake-up latency of up to 10 microseconds
> should be fine for must purposes.
> > Default Linux timerslack is 50 microseconds, so you could also use
> that value.
> How much CPU latency is ok. Maybe, we can give the decision to the
> application.

Yes, the application should decide the acceptable worst-case latency.

> Linux will collect all these QoS request and use the minimum latency.
> what do you think, Morten?

For the example application, you could use a value of 50 microseconds and refer to this value also being the default timerslack in Linux.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-26  8:27             ` Morten Brørup
@ 2024-03-26 12:15               ` lihuisong (C)
  2024-03-26 12:46                 ` Morten Brørup
  0 siblings, 1 reply; 16+ messages in thread
From: lihuisong (C) @ 2024-03-26 12:15 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong


在 2024/3/26 16:27, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Tuesday, 26 March 2024 03.12
>>
>> 在 2024/3/22 20:35, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Friday, 22 March 2024 09.54
> [...]
>
>>>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>>>> latency to zero to prevent service thread from the deeper the idle
>> state.
>>> It would defeat the purpose (i.e. not saving sufficient amounts of
>> power) if the CPU cannot enter a deeper idle state.
>> Yes, it is not good for power.
>> AFAIS, PM QoS is just to decrease the influence for performance.
>> Anyway, if we set to zero, system can be into Cstates-0 at least.
>>> Personally, I would think a wake-up latency of up to 10 microseconds
>> should be fine for must purposes.
>>> Default Linux timerslack is 50 microseconds, so you could also use
>> that value.
>> How much CPU latency is ok. Maybe, we can give the decision to the
>> application.
> Yes, the application should decide the acceptable worst-case latency.
>
>> Linux will collect all these QoS request and use the minimum latency.
>> what do you think, Morten?
> For the example application, you could use a value of 50 microseconds and refer to this value also being the default timerslack in Linux.
There is a description for "/proc/<pid>/timerslack_ns" in Linux document [1]
"
This file provides the value of the task’s timerslack value in nanoseconds.
This value specifies an amount of time that normal timers may be 
deferred in order to coalesce timers and avoid unnecessary wakeups.
This allows a task’s interactivity vs power consumption tradeoff to be 
adjusted.
"
I cannot understand what the relationship is between the timerslack in 
Linux and cpu latency to wake up.
It seems that timerslack is just to defer the timer in order to coalesce 
timers and avoid unnecessary wakeups.
And it has not a lot to do with the CPU latency which is aimed to avoid 
task to enter deeper idle state and satify application request.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 0/2] introduce PM QoS interface
  2024-03-26 12:15               ` lihuisong (C)
@ 2024-03-26 12:46                 ` Morten Brørup
  2024-03-29  1:59                   ` lihuisong (C)
  0 siblings, 1 reply; 16+ messages in thread
From: Morten Brørup @ 2024-03-26 12:46 UTC (permalink / raw)
  To: lihuisong (C), dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong

> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Tuesday, 26 March 2024 13.15
> 
> 在 2024/3/26 16:27, Morten Brørup 写道:
> >> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >> Sent: Tuesday, 26 March 2024 03.12
> >>
> >> 在 2024/3/22 20:35, Morten Brørup 写道:
> >>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >>>> Sent: Friday, 22 March 2024 09.54
> > [...]
> >
> >>>> For the case need PM QoS in DPDK, I think, it is better to set cpu
> >>>> latency to zero to prevent service thread from the deeper the idle
> >> state.
> >>> It would defeat the purpose (i.e. not saving sufficient amounts of
> >> power) if the CPU cannot enter a deeper idle state.
> >> Yes, it is not good for power.
> >> AFAIS, PM QoS is just to decrease the influence for performance.
> >> Anyway, if we set to zero, system can be into Cstates-0 at least.
> >>> Personally, I would think a wake-up latency of up to 10 microseconds
> >> should be fine for must purposes.
> >>> Default Linux timerslack is 50 microseconds, so you could also use
> >> that value.
> >> How much CPU latency is ok. Maybe, we can give the decision to the
> >> application.
> > Yes, the application should decide the acceptable worst-case latency.
> >
> >> Linux will collect all these QoS request and use the minimum latency.
> >> what do you think, Morten?
> > For the example application, you could use a value of 50 microseconds
> and refer to this value also being the default timerslack in Linux.
> There is a description for "/proc/<pid>/timerslack_ns" in Linux document
> [1]
> "
> This file provides the value of the task’s timerslack value in
> nanoseconds.
> This value specifies an amount of time that normal timers may be
> deferred in order to coalesce timers and avoid unnecessary wakeups.
> This allows a task’s interactivity vs power consumption tradeoff to be
> adjusted.
> "
> I cannot understand what the relationship is between the timerslack in
> Linux and cpu latency to wake up.
> It seems that timerslack is just to defer the timer in order to coalesce
> timers and avoid unnecessary wakeups.
> And it has not a lot to do with the CPU latency which is aimed to avoid
> task to enter deeper idle state and satify application request.

Correct. They control two different things.

However, both can cause latency for the application, so my rationale for the relationship was:
If the application accepts X us of latency caused by kernel scheduling delays (caused by timerslack), the application should accept the same amount of latency caused by CPU wake-up latency.

This also means that if you want lower latency than 50 us, you should not only set cpu wake-up latency, you should also set timerslack.

Obviously, if the application is only affected by one of the two, the application only needs to adjust that one of them.

As for the 50 us value, someone in the Linux kernel team decided that 50 us was an acceptable amount of latency for the kernel; we could use the same value, referring to that. Or we could choose some other value, and describe how we came up with our own value. And if necessary, also adjust timerslack accordingly.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-26  2:20           ` lihuisong (C)
@ 2024-03-26 16:04             ` Tyler Retzlaff
  0 siblings, 0 replies; 16+ messages in thread
From: Tyler Retzlaff @ 2024-03-26 16:04 UTC (permalink / raw)
  To: lihuisong (C)
  Cc: Morten Brørup, dev, weh,
	longli@microsoft.com >> Long Li, alan.elder, thomas,
	ferruh.yigit, anatoly.burakov, david.hunt, sivaprasad.tummala,
	liuyonglong

On Tue, Mar 26, 2024 at 10:20:45AM +0800, lihuisong (C) wrote:
> Hi Tyler,
> 
> 在 2024/3/23 1:55, Tyler Retzlaff 写道:
> >On Fri, Mar 22, 2024 at 04:54:01PM +0800, lihuisong (C) wrote:
> >>+Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
> >>
> >>在 2024/3/21 21:30, Morten Brørup 写道:
> >>>>From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >>>>Sent: Thursday, 21 March 2024 04.04
> >>>>
> >>>>Hi Moren,
> >>>>
> >>>>Thanks for your revew.
> >>>>
> >>>>在 2024/3/20 22:05, Morten Brørup 写道:
> >>>>>>From: Huisong Li [mailto:lihuisong@huawei.com]
> >>>>>>Sent: Wednesday, 20 March 2024 11.55
> >>>>>>
> >>>>>>The system-wide CPU latency QoS limit has a positive impact on the idle
> >>>>>>state selection in cpuidle governor.
> >>>>>>
> >>>>>>Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> >>>>>>CPU latency QoS limit on system and send the QoS request for userspace.
> >>>>>>Please see the PM QoS framework in the following link:
> >>>>>>https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >>>>>>This feature is supported by kernel-v2.6.25.
> >>>>>>
> >>>>>>The deeper the idle state, the lower the power consumption, but the longer
> >>>>>>the resume time. Some service are delay sensitive and very except the low
> >>>>>>resume time, like interrupt packet receiving mode.
> >>>>>>
> >>>>>>So this series introduce PM QoS interface.
> >>>>>This looks like a 1:1 wrapper for a Linux kernel feature.
> >>>>right
> >>>>>Does Windows or BSD offer something similar?
> >>>>How do we know Windows or BSD support this similar feature?
> >>>Ask Windows experts or research using Google.
> >>I download freebsd source code, I didn't find this similar feature.
> >>They don't even support cpuidle feature(this QoS feature affects cpuilde.).
> >>I don't find any useful about this on Windows from google.
> >>
> >>
> >>@Tyler, @Alan, @Wei and @Long
> >>
> >>Do you know windows support that userspace read and send CPU latency
> >>which has an impact on deep level of CPU idle?
> >it is unlikely you'll find an api that let's you manage things in terms
> >of raw latency values as the linux knobs here do. windows more often employs
> >policy centric schemes to permit the system to abstract implementation detail.
> >
> >powercfg is probably the closest thing you can use to tune the same
> >things on windows. where you select e.g. the 'performance' scheme but it
> >won't allow you to pick specific latency numbers.
> >
> >https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/powercfg-command-line-options
> 
> Thanks for your feedback. I will take a look at this tool.
> 
> >
> >>>>The DPDK power lib just work on Linux according to the meson.build under
> >>>>lib/power.
> >>>>If they support this features, they can open it.
> >>>The DPDK power lib currently only works on Linux, yes.
> >>>But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
> >>>
> >>>DPDK is on track to work across multiple platforms, including Windows.
> >>>We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
> >>totally understand you.
> >since lib/power isn't built for windows at this time i don't think it's
> >appropriate to constrain your innovation. i do appreciate the engagement
> >though and would just offer general guidance that if you can design your
> >api with some kind of abstraction in mind that would be great and by all
> >means if you can figure out how to wrangle powercfg /Qh into satisfying the
> >api in a policy centric way it might be kind of nice.
> Testing this by using powercfg on Windows creates a very challenge for me.
> So I don't plan to do this on Windows. If you need, you can add it, ok?

ordinarily i would say it is appropriate to, however in this
circumstance i agree. there is quite possibly significant porting work
to be done so i would have to address it if we ever include it for
windows.

thanks

> >
> >i'll let other windows experts chime in here if they choose.
> >
> >thanks!
> >
> >>>>>Furthermore, any high-res timing should use nanoseconds, not microseconds or
> >>>>milliseconds.
> >>>>>I realize that the Linux kernel only uses microseconds for these APIs, but
> >>>>the DPDK API should use nanoseconds.
> >>>>Nanoseconds is more precise, it's good.
> >>>>But DPDK API how use nanoseconds as you said the the Linux kernel only
> >>>>uses microseconds for these APIs.
> >>>>Kernel interface just know an integer value with microseconds unit.
> >>>One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
> >>If so, we have to modify the implementation interface on Linux. This
> >>change the input/output unit about the interface.
> >>And DPDK also has to do this based on kernel version. It is not good.
> >>The cpuidle governor select which idle state based on the worst-case
> >>latency of idle state.
> >>These the worst-case latency of Cstate reported by ACPI table is in
> >>microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3.
> >>_LPI (Low Power Idle States) in ACPI spec [1].
> >>So it is probably not meaning to change this interface implementation.
> >>
> >>For the case need PM QoS in DPDK, I think, it is better to set cpu
> >>latency to zero to prevent service thread from the deeper the idle
> >>state.
> >>>You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
> >>>
> >>[1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
> >.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] introduce PM QoS interface
  2024-03-26 12:46                 ` Morten Brørup
@ 2024-03-29  1:59                   ` lihuisong (C)
  0 siblings, 0 replies; 16+ messages in thread
From: lihuisong (C) @ 2024-03-29  1:59 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
	sivaprasad.tummala, liuyonglong


在 2024/3/26 20:46, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Tuesday, 26 March 2024 13.15
>>
>> 在 2024/3/26 16:27, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Tuesday, 26 March 2024 03.12
>>>>
>>>> 在 2024/3/22 20:35, Morten Brørup 写道:
>>>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>>>> Sent: Friday, 22 March 2024 09.54
>>> [...]
>>>
>>>>>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>>>>>> latency to zero to prevent service thread from the deeper the idle
>>>> state.
>>>>> It would defeat the purpose (i.e. not saving sufficient amounts of
>>>> power) if the CPU cannot enter a deeper idle state.
>>>> Yes, it is not good for power.
>>>> AFAIS, PM QoS is just to decrease the influence for performance.
>>>> Anyway, if we set to zero, system can be into Cstates-0 at least.
>>>>> Personally, I would think a wake-up latency of up to 10 microseconds
>>>> should be fine for must purposes.
>>>>> Default Linux timerslack is 50 microseconds, so you could also use
>>>> that value.
>>>> How much CPU latency is ok. Maybe, we can give the decision to the
>>>> application.
>>> Yes, the application should decide the acceptable worst-case latency.
>>>
>>>> Linux will collect all these QoS request and use the minimum latency.
>>>> what do you think, Morten?
>>> For the example application, you could use a value of 50 microseconds
>> and refer to this value also being the default timerslack in Linux.
>> There is a description for "/proc/<pid>/timerslack_ns" in Linux document
>> [1]
>> "
>> This file provides the value of the task’s timerslack value in
>> nanoseconds.
>> This value specifies an amount of time that normal timers may be
>> deferred in order to coalesce timers and avoid unnecessary wakeups.
>> This allows a task’s interactivity vs power consumption tradeoff to be
>> adjusted.
>> "
>> I cannot understand what the relationship is between the timerslack in
>> Linux and cpu latency to wake up.
>> It seems that timerslack is just to defer the timer in order to coalesce
>> timers and avoid unnecessary wakeups.
>> And it has not a lot to do with the CPU latency which is aimed to avoid
>> task to enter deeper idle state and satify application request.
> Correct. They control two different things.
>
> However, both can cause latency for the application, so my rationale for the relationship was:
> If the application accepts X us of latency caused by kernel scheduling delays (caused by timerslack), the application should accept the same amount of latency caused by CPU wake-up latency.
Understand, thanks for explain.
>
> This also means that if you want lower latency than 50 us, you should not only set cpu wake-up latency, you should also set timerslack.
>
> Obviously, if the application is only affected by one of the two, the application only needs to adjust that one of them.
Yes, I think it is.
>
> As for the 50 us value, someone in the Linux kernel team decided that 50 us was an acceptable amount of latency for the kernel; we could use the same value, referring to that. Or we could choose some other value, and describe how we came up with our own value. And if necessary, also adjust timerslack accordingly.
So how about use the default 50us of timerslack in l3fwd-power?
And we add some description about this in code or document, like, 
suggest user also need to modify this process's timerslack if want a 
more little latency.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-03-29  1:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
2024-03-20 10:55 ` [PATCH 1/2] power: " Huisong Li
2024-03-20 10:55 ` [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration Huisong Li
2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
2024-03-21  3:04   ` lihuisong (C)
2024-03-21 13:30     ` Morten Brørup
2024-03-22  8:54       ` lihuisong (C)
2024-03-22 12:35         ` Morten Brørup
2024-03-26  2:11           ` lihuisong (C)
2024-03-26  8:27             ` Morten Brørup
2024-03-26 12:15               ` lihuisong (C)
2024-03-26 12:46                 ` Morten Brørup
2024-03-29  1:59                   ` lihuisong (C)
2024-03-22 17:55         ` Tyler Retzlaff
2024-03-26  2:20           ` lihuisong (C)
2024-03-26 16:04             ` Tyler Retzlaff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.